Unverified 提交 1ab2dfc1 authored 作者: keppo-bot[bot]'s avatar keppo-bot[bot] 提交者: GitHub

Add deflake-e2e-from-run skill (#3247)

## Summary Adds a new skill, `dyad:deflake-e2e-from-run`, for root-causing E2E failures from a specific CI run by reading the Playwright HTML report artifacts directly — traces, screenshots, error-context, and app stdout/stderr — instead of rebuilding and rerunning locally. This complements the existing `deflake-e2e` and `deflake-e2e-recent-commits` skills, which drive *discovery* by repeat-running tests. This new skill is for when you already have a failing run and want to root-cause fast. ## What the skill covers - Downloading the merged `html-report` artifact with `gh run download -R dyad-sh/dyad -n html-report`. - `jq` queries against `results.json` for `unexpected` vs `flaky` test buckets. - Matching trace `.zip` hashes (paths in `results.json` are CI-side; files are local in `/tmp/pw-report/data/`). - Reading `test.trace` as JSONL to extract the step timeline. - Correlating with app IPC logs that show up as `stderr`/`stdout` trace events (gold for race-condition root-causing — e.g. `(proposal_handlers) › IPC: get-proposal returned: …` at failure time). - A short playbook of common failure shapes (disabled-button-after-fill, navigation races, cross-test state) and their usual fixes. Derived from the workflow used to root-cause #3246. ## Test plan - [x] Skill file follows the same frontmatter format as existing skills in `.claude/skills/`. - [x] Verified via the use case in #3246 (full investigation done end-to-end using exactly these steps). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/3247" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: 's avatarWill Chen <7344640+wwwillchen@users.noreply.github.com> Co-authored-by: 's avatarClaude Opus 4.7 (1M context) <noreply@anthropic.com>
上级 28811055
---
name: dyad:deflake-e2e-from-run
description: Root-cause flaky or failing E2E tests from a specific CI run by downloading and analyzing the Playwright HTML report (traces, screenshots, errors). Use this when given a GitHub Actions run URL and asked to investigate failures - do NOT re-run tests locally, read the report artifacts instead.
---
# Deflake E2E Tests from a CI Run
Use this skill when the user points you at a specific failing CI run (e.g. `https://github.com/dyad-sh/dyad/actions/runs/<id>`) and asks you to root-cause the E2E failures. Unlike `deflake-e2e`, this skill does NOT rebuild and re-run tests — it reads the already-recorded Playwright report from the run's artifacts, which is faster and gives you the *exact* failure state CI saw.
## Arguments
- `$ARGUMENTS`: The GitHub Actions run URL or run ID. If absent, ask the user.
## Phase 1 — Get the report
1. Extract `run_id` from the URL (`/actions/runs/<run_id>` or `/actions/runs/<run_id>/job/<job_id>`).
2. List artifacts and find the `html-report` (merged across shards):
```
gh api repos/dyad-sh/dyad/actions/runs/<run_id>/artifacts --jq '.artifacts[] | {name, size_in_bytes}'
```
3. Download it into a scratch dir (use `-R dyad-sh/dyad``gh run download` does not auto-detect the repo from arbitrary cwd):
```
mkdir -p /tmp/pw-report
gh run download <run_id> -R dyad-sh/dyad -n html-report -D /tmp/pw-report
```
4. Confirm layout: `index.html`, `results.json`, `data/*.zip` (trace archives), `data/*.png` (screenshots), `data/*.markdown` (error-context files).
## Phase 2 — Enumerate failures
Use `jq` on `results.json`. The schema has `suites[].specs[]`, with each spec's `tests[].results[]` holding one result per attempt.
- Stats headline: `jq '.stats' results.json``{expected, skipped, unexpected, flaky}`.
- **Unexpected** (all attempts failed):
```
jq '[.suites[].specs[]? | select(.ok == false) | {title, file,
err: [.tests[].results[] | {status, error: .error.message}]}]' results.json
```
- **Flaky** (some attempt failed but final passed):
```
jq '[.suites[].specs[]? | select(.tests[].status == "flaky") | {title, file}]' results.json
```
Group by error shape. If every failure shares the same locator / error ("element is not enabled", "locator.click timeout", etc.) you're probably looking at one root cause across multiple tests. Don't investigate them all — pick one representative trace.
## Phase 3 — Analyze a specific failure
1. Find the trace zip. The `attachments[].path` in `results.json` points at `all-blob-reports/resources/<hash>.zip` — those are **CI-side paths**, not local. The file actually lives at `/tmp/pw-report/data/<hash>.zip`. Match by hash, or grep the trace for the test title / spec file:
```
for f in /tmp/pw-report/data/*.zip; do
hit=$(unzip -p "$f" test.trace | grep -c "chat_tabs\.spec\.ts:68")
[ "$hit" -gt 0 ] && echo "$f"
done
```
2. Extract: `unzip -o <zip> -d /tmp/trace-extract`.
3. Read the step-by-step actions (`test.trace` is JSONL):
```python
import json
for line in open('/tmp/trace-extract/test.trace'):
obj = json.loads(line)
if obj.get('type') == 'before' and obj.get('class') == 'Test':
print(round(obj['startTime']/1000, 2), obj.get('method'), obj.get('title','')[:200])
```
Look for the last few actions before the timeout — that tells you *which call hung and what its locator resolved to*.
4. Correlate with app logs. Electron `console.log`/`console.error` lands in `stderr`/`stdout` trace events:
```python
for line in open('/tmp/trace-extract/test.trace'):
obj = json.loads(line)
if obj.get('type') in ('stderr','stdout'):
text = obj.get('text','')
if 'proposal' in text or 'chatId' in text or 'stream' in text.lower():
print(text[:300])
```
IPC log lines like `(proposal_handlers) › IPC: get-proposal returned: …` reveal what state the backend was in at failure time — gold for race-condition root-causing.
5. View the failure screenshot. Trace resources are stored unhashed; PNG files in `/tmp/trace-extract/resources/` are screenshots. Resize before Read (Claude's image limit is ~1.5MB):
```
sips -Z 800 /tmp/trace-extract/resources/<hash> --out /tmp/fail.png
```
Then `Read /tmp/fail.png`. This is often the single most useful artifact — e.g. an "empty input, disabled Send button" screenshot is a dead giveaway for a fill() race.
## Phase 4 — Root-cause playbook
Common patterns and what they mean:
- **"element is not enabled" on a button after fill()** → React render race between URL/atom state updates and the editor's onChange. The fill runs, onChange writes under the *old* key, next render clears the editor for the new context. Fix: wrap fill+click in `expect.toPass()` and assert editor content + button enabled before clicking. See `ChatActions.sendPrompt()`.
- **"locator.click timeout"** with multiple matching elements → stale component still in DOM during a transition. Fix: scope the locator tighter (`getChatInputContainer().locator(...)`) or add a visibility assertion on the stable target first.
- **Assertion flakes right after navigation** → atom/URL mismatch during a single render cycle. Either wait for a post-navigation signal (e.g. a data-loaded state) or wrap the assertion in `toPass` with a bounded timeout.
- **Different error on retry vs. first attempt** → test is mutating shared state. Look for missing teardown or cross-test singletons.
Prefer fixing the test over the app unless the race would actually bite a real user. A real user can't type at 2ms after clicking a button; Playwright can. A retry wrapper is the correct contract there.
## Phase 5 — Fix, verify, PR
1. Make the minimal change — usually in `e2e-tests/helpers/page-objects/` since many specs share the same helper.
2. `npm run fmt && npm run lint && npm run ts`.
3. Skip local `npm run build && npm run e2e` unless you're genuinely unsure — the CI loop is ~15min and this analysis path is for *obvious* root causes. If you're guessing, stop guessing and run it locally instead.
4. Use `/dyad:pr-push` or commit + `gh pr create` directly. The PR body MUST include:
- A link to the failing run.
- The root-cause narrative (what raced, in concrete terms — not "timing issue").
- Why the fix is correct (what the retry loop is doing that the original flow wasn't).
## Gotchas
- `gh run download` needs `-R <owner>/<repo>` if you're not in a cwd with matching origin.
- `results.json` paths inside `attachments[]` are *CI-side*; only use them to match hashes, never to read files.
- A fork PR's artifacts live on the fork's run, not the upstream's. Make sure `run_id` is on the right repo.
- Many traces unpack to the same `/tmp/trace-extract/` — clean between extractions or use unique subdirs.
- The `html-report` is the *merged* report across shards. Individual shard artifacts (`blob-report-*`, `flakiness-report-*`) are usually unnecessary for root-causing.
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论