Add deflake-e2e-from-run skill (#3247)

## Summary Adds a new skill, `dyad:deflake-e2e-from-run`, for root-causing E2E failures from a specific CI run by reading the Playwright HTML report artifacts directly — traces, screenshots, error-context, and app stdout/stderr — instead of rebuilding and rerunning locally. This complements the existing `deflake-e2e` and `deflake-e2e-recent-commits` skills, which drive *discovery* by repeat-running tests. This new skill is for when you already have a failing run and want to root-cause fast. ## What the skill covers - Downloading the merged `html-report` artifact with `gh run download -R dyad-sh/dyad -n html-report`. - `jq` queries against `results.json` for `unexpected` vs `flaky` test buckets. - Matching trace `.zip` hashes (paths in `results.json` are CI-side; files are local in `/tmp/pw-report/data/`). - Reading `test.trace` as JSONL to extract the step timeline. - Correlating with app IPC logs that show up as `stderr`/`stdout` trace events (gold for race-condition root-causing — e.g. `(proposal_handlers) › IPC: get-proposal returned: …` at failure time). - A short playbook of common failure shapes (disabled-button-after-fill, navigation races, cross-test state) and their usual fixes. Derived from the workflow used to root-cause #3246. ## Test plan - [x] Skill file follows the same frontmatter format as existing skills in `.claude/skills/`. - [x] Verified via the use case in #3246 (full investigation done end-to-end using exactly these steps). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/3247" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a>  Co-authored-by: Will Chen <7344640+wwwillchen@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add deflake-e2e-from-run skill (#3247)
1ab2dfc1 · keppo-bot[bot] · GitHub · 28811055 · 1ab2dfc1
--- a/.claude/skills/deflake-e2e-from-run/SKILL.md
+++ b/.claude/skills/deflake-e2e-from-run/SKILL.md
+---
+name: dyad:deflake-e2e-from-run
+description: Root-cause flaky or failing E2E tests from a specific CI run by downloading and analyzing the Playwright HTML report (traces, screenshots, errors). Use this when given a GitHub Actions run URL and asked to investigate failures - do NOT re-run tests locally, read the report artifacts instead.
+---
+# Deflake E2E Tests from a CI Run
+Use this skill when the user points you at a specific failing CI run (e.g. `https://github.com/dyad-sh/dyad/actions/runs/<id>`) and asks you to root-cause the E2E failures. Unlike `deflake-e2e`, this skill does NOT rebuild and re-run tests — it reads the already-recorded Playwright report from the run's artifacts, which is faster and gives you the *exact* failure state CI saw.
+## Arguments
+- `$ARGUMENTS`: The GitHub Actions run URL or run ID. If absent, ask the user.
+## Phase 1 — Get the report
+1. Extract `run_id` from the URL (`/actions/runs/<run_id>` or `/actions/runs/<run_id>/job/<job_id>`).
+2. List artifacts and find the `html-report` (merged across shards):
+   ```
+   gh api repos/dyad-sh/dyad/actions/runs/<run_id>/artifacts --jq '.artifacts[] | {name, size_in_bytes}'
+   ```
+3. Download it into a scratch dir (use `-R dyad-sh/dyad` — `gh run download` does not auto-detect the repo from arbitrary cwd):
+   ```
+   mkdir -p /tmp/pw-report
+   gh run download <run_id> -R dyad-sh/dyad -n html-report -D /tmp/pw-report
+   ```
+4. Confirm layout: `index.html`, `results.json`, `data/*.zip` (trace archives), `data/*.png` (screenshots), `data/*.markdown` (error-context files).
+## Phase 2 — Enumerate failures
+Use `jq` on `results.json`. The schema has `suites[].specs[]`, with each spec's `tests[].results[]` holding one result per attempt.
+- Stats headline: `jq '.stats' results.json` → `{expected, skipped, unexpected, flaky}`.
+- **Unexpected** (all attempts failed):
+  ```
+  jq '[.suites[].specs[]? | select(.ok == false) | {title, file,
+      err: [.tests[].results[] | {status, error: .error.message}]}]' results.json
+  ```
+- **Flaky** (some attempt failed but final passed):
+  ```
+  jq '[.suites[].specs[]? | select(.tests[].status == "flaky") | {title, file}]' results.json
+  ```
+Group by error shape. If every failure shares the same locator / error ("element is not enabled", "locator.click timeout", etc.) you're probably looking at one root cause across multiple tests. Don't investigate them all — pick one representative trace.
+## Phase 3 — Analyze a specific failure
+1. Find the trace zip. The `attachments[].path` in `results.json` points at `all-blob-reports/resources/<hash>.zip` — those are **CI-side paths**, not local. The file actually lives at `/tmp/pw-report/data/<hash>.zip`. Match by hash, or grep the trace for the test title / spec file:
+   ```
+   for f in /tmp/pw-report/data/*.zip; do
+     hit=$(unzip -p "$f" test.trace | grep -c "chat_tabs\.spec\.ts:68")
+     [ "$hit" -gt 0 ] && echo "$f"
+   done
+   ```
+2. Extract: `unzip -o <zip> -d /tmp/trace-extract`.
+3. Read the step-by-step actions (`test.trace` is JSONL):
+   ```python
+   import json
+   for line in open('/tmp/trace-extract/test.trace'):
+       obj = json.loads(line)
+       if obj.get('type') == 'before' and obj.get('class') == 'Test':
+           print(round(obj['startTime']/1000, 2), obj.get('method'), obj.get('title','')[:200])
+   ```
+   Look for the last few actions before the timeout — that tells you *which call hung and what its locator resolved to*.
+4. Correlate with app logs. Electron `console.log`/`console.error` lands in `stderr`/`stdout` trace events:
+   ```python
+   for line in open('/tmp/trace-extract/test.trace'):
+       obj = json.loads(line)
+       if obj.get('type') in ('stderr','stdout'):
+           text = obj.get('text','')
+           if 'proposal' in text or 'chatId' in text or 'stream' in text.lower():
+               print(text[:300])
+   ```
+   IPC log lines like `(proposal_handlers) › IPC: get-proposal returned: …` reveal what state the backend was in at failure time — gold for race-condition root-causing.
+5. View the failure screenshot. Trace resources are stored unhashed; PNG files in `/tmp/trace-extract/resources/` are screenshots. Resize before Read (Claude's image limit is ~1.5MB):
+   ```
+   sips -Z 800 /tmp/trace-extract/resources/<hash> --out /tmp/fail.png
+   ```
+   Then `Read /tmp/fail.png`. This is often the single most useful artifact — e.g. an "empty input, disabled Send button" screenshot is a dead giveaway for a fill() race.
+## Phase 4 — Root-cause playbook
+Common patterns and what they mean:
+- **"element is not enabled" on a button after fill()** → React render race between URL/atom state updates and the editor's onChange. The fill runs, onChange writes under the *old* key, next render clears the editor for the new context. Fix: wrap fill+click in `expect.toPass()` and assert editor content + button enabled before clicking. See `ChatActions.sendPrompt()`.
+- **"locator.click timeout"** with multiple matching elements → stale component still in DOM during a transition. Fix: scope the locator tighter (`getChatInputContainer().locator(...)`) or add a visibility assertion on the stable target first.
+- **Assertion flakes right after navigation** → atom/URL mismatch during a single render cycle. Either wait for a post-navigation signal (e.g. a data-loaded state) or wrap the assertion in `toPass` with a bounded timeout.
+- **Different error on retry vs. first attempt** → test is mutating shared state. Look for missing teardown or cross-test singletons.
+Prefer fixing the test over the app unless the race would actually bite a real user. A real user can't type at 2ms after clicking a button; Playwright can. A retry wrapper is the correct contract there.
+## Phase 5 — Fix, verify, PR
+1. Make the minimal change — usually in `e2e-tests/helpers/page-objects/` since many specs share the same helper.
+2. `npm run fmt && npm run lint && npm run ts`.
+3. Skip local `npm run build && npm run e2e` unless you're genuinely unsure — the CI loop is ~15min and this analysis path is for *obvious* root causes. If you're guessing, stop guessing and run it locally instead.
+4. Use `/dyad:pr-push` or commit + `gh pr create` directly. The PR body MUST include:
+   - A link to the failing run.
+   - The root-cause narrative (what raced, in concrete terms — not "timing issue").
+   - Why the fix is correct (what the retry loop is doing that the original flow wasn't).
+## Gotchas
+- `gh run download` needs `-R <owner>/<repo>` if you're not in a cwd with matching origin.
+- `results.json` paths inside `attachments[]` are *CI-side*; only use them to match hashes, never to read files.
+- A fork PR's artifacts live on the fork's run, not the upstream's. Make sure `run_id` is on the right repo.
+- Many traces unpack to the same `/tmp/trace-extract/` — clean between extractions or use unique subdirs.
+- The `html-report` is the *merged* report across shards. Individual shard artifacts (`blob-report-*`, `flakiness-report-*`) are usually unnecessary for root-causing.