feat: add daily deflake-e2e-recent-prs skill and workflow (#2590)

## Summary - Add new `/dyad:deflake-e2e-recent-prs` command that automatically gathers flaky E2E tests from recent PR Playwright summary comments, ranks them by frequency, and deflakes them sequentially - Add `claude-deflake-e2e.yml` GitHub Actions workflow that runs daily at 2 AM PST on self-hosted macOS runners (with workflow_dispatch support for manual triggers) - Document the new command in `.claude/README.md` ## Test plan - Trigger the workflow manually via `gh workflow run claude-deflake-e2e.yml` and verify it correctly scans recent PRs for flaky tests and attempts to deflake them - Verify the cron schedule triggers at the expected time 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/2590" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>   --- > [!NOTE] > **Medium Risk** > Adds a scheduled GitHub Action with write permissions that can open PRs, so misconfiguration or prompt issues could create noisy/unintended changes; however it’s limited to CI/automation and test-only guidance. > > **Overview** > Adds a new Claude slash command, `/dyad:deflake-e2e-recent-prs`, that scans recent PR comments for Playwright “Flaky Tests”, ranks them by frequency, and runs deflaking steps per spec (including guidance to disable retries via `PLAYWRIGHT_RETRIES=0`) before optionally opening a fix PR. > > Introduces a scheduled/manual GitHub Actions workflow (`claude-deflake-e2e.yml`) that runs daily on self-hosted macOS ARM64, installs dependencies/browsers, and invokes the new command via `anthropics/claude-code-action`. Documentation is updated to list the new command and the existing `/dyad:deflake-e2e` instructions are tightened to always disable Playwright retries. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 10b9158363c6b9ae9a3f3ba52ad118149fb9cbd3. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>   --- ## Summary by cubic Adds a new /dyad:deflake-e2e-recent-prs command that scans recent PRs for Playwright flake reports, ranks flaky tests, and deflakes them. Adds a daily GitHub Action that runs it at 10:00 UTC (2 AM PST / 3 AM PDT) on self‑hosted macOS ARM64 to keep E2E tests stable. - **New Features** - Command scans recent PRs (default 20), parses Playwright summary comments from github-actions[bot], ranks by frequency, and deflakes specs sequentially; can push fixes via /dyad:pr-push. - New claude-deflake-e2e.yml workflow supports manual dispatch with pr_count, sets up Node/pnpm, installs Chromium, builds the fake LLM server, and runs the command via anthropics/claude-code-action. - **Bug Fixes** - Fixed spec path handling (no double .spec.ts), added gh api --paginate, switched to generic PR search, clarified {owner}/{repo} vs <pr_number>, and noted DST in the cron comment. - Disabled Playwright automatic retries in all deflake steps to prevent false passes, including debug and snapshot update commands. <sup>Written for commit 10b9158363c6b9ae9a3f3ba52ad118149fb9cbd3. Summary will update on new commits.</sup>  --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat: add daily deflake-e2e-recent-prs skill and workflow (#2590)
1e90ac6b · Will Chen · GitHub · dcf06ece · 1e90ac6b · 1e90ac6b
--- a/.claude/README.md
+++ b/.claude/README.md
@@ -6,16 +6,17 @@ This directory contains Claude Code configuration for the Dyad project.

 Slash commands are invoked with `/dyad:<command>`. Available commands:

-| Command                 | Description                                                    | Uses                                |
-| ----------------------- | -------------------------------------------------------------- | ----------------------------------- |
-| `/dyad:plan-to-issue`   | Convert a plan to a GitHub issue                               | -                                   |
-| `/dyad:fix-issue`       | Fix a GitHub issue                                             | `pr-push`                           |
-| `/dyad:pr-fix`          | Fix PR issues from CI failures or review comments              | `pr-fix:comments`, `pr-fix:actions` |
-| `/dyad:pr-fix:comments` | Address unresolved PR review comments                          | `lint`, `pr-push`                   |
-| `/dyad:pr-fix:actions`  | Fix failing CI checks and GitHub Actions                       | `e2e-rebase`, `pr-push`             |
-| `/dyad:pr-rebase`       | Rebase the current branch                                      | `pr-push`                           |
-| `/dyad:pr-push`         | Push changes and create/update a PR                            | `remember-learnings`                |
-| `/dyad:lint`            | Run all pre-commit checks (formatting, linting, type-checking) | -                                   |
-| `/dyad:e2e-rebase`      | Rebase E2E test snapshots                                      | -                                   |
-| `/dyad:deflake-e2e`     | Deflake flaky E2E tests                                        | -                                   |
-| `/dyad:session-debug`   | Debug session issues                                           | -                                   |
+| Command                        | Description                                                    | Uses                                |
+| ------------------------------ | -------------------------------------------------------------- | ----------------------------------- |
+| `/dyad:plan-to-issue`          | Convert a plan to a GitHub issue                               | -                                   |
+| `/dyad:fix-issue`              | Fix a GitHub issue                                             | `pr-push`                           |
+| `/dyad:pr-fix`                 | Fix PR issues from CI failures or review comments              | `pr-fix:comments`, `pr-fix:actions` |
+| `/dyad:pr-fix:comments`        | Address unresolved PR review comments                          | `lint`, `pr-push`                   |
+| `/dyad:pr-fix:actions`         | Fix failing CI checks and GitHub Actions                       | `e2e-rebase`, `pr-push`             |
+| `/dyad:pr-rebase`              | Rebase the current branch                                      | `pr-push`                           |
+| `/dyad:pr-push`                | Push changes and create/update a PR                            | `remember-learnings`                |
+| `/dyad:lint`                   | Run all pre-commit checks (formatting, linting, type-checking) | -                                   |
+| `/dyad:e2e-rebase`             | Rebase E2E test snapshots                                      | -                                   |
+| `/dyad:deflake-e2e`            | Deflake flaky E2E tests                                        | -                                   |
+| `/dyad:deflake-e2e-recent-prs` | Gather flaky tests from recent PRs and deflake them            | `deflake-e2e`, `pr-push`            |
+| `/dyad:session-debug`          | Debug session issues                                           | -                                   |
--- a/.claude/commands/dyad/deflake-e2e-recent-prs.md
+++ b/.claude/commands/dyad/deflake-e2e-recent-prs.md
+# Deflake E2E Tests from Recent PRs
+
+Automatically gather flaky E2E tests from recent PR Playwright summary comments and deflake them.
+
+## Arguments
+
+- `$ARGUMENTS`: (Optional) Number of recent PRs to scan (default: 20)
+
+## Task Tracking
+
+**You MUST use the TaskCreate and TaskUpdate tools to track your progress.** At the start, create tasks for each major step below. Mark each task as `in_progress` when you start it and `completed` when you finish.
+
+## Instructions
+
+1. **Gather flaky tests from recent PRs:**
+
+   Use `gh` to find recent PRs that have Playwright summary comments (search for PRs with `github-actions[bot]` Playwright comments):
+
+   ```
+   gh pr list --search 'commenter:github-actions[bot] "Playwright Test Results" in:comments' --state all --limit <PR_COUNT> --json number
+   ```
+
+   Use `$ARGUMENTS` as the PR count, defaulting to 20 if not provided.
+
+   For each PR, fetch comments from `github-actions[bot]` that contain the Playwright test results.
+
+   **Note:** `{owner}` and `{repo}` are auto-replaced by `gh` CLI. Replace `<pr_number>` with the actual PR number.
+
+   ```
+   gh api repos/{owner}/{repo}/issues/<pr_number>/comments --paginate --jq '.[] | select(.user.login == "github-actions[bot]") | select(.body | contains("Playwright Test Results")) | .body'
+   ```
+
+2. **Parse flaky tests from comments:**
+
+   Extract flaky test names from the "Flaky Tests" section of each comment. Flaky tests appear in this format:
+
+   ```
+   - `<spec_file.spec.ts> > <test name>` (passed after N retry/retries)
+   ```
+
+   Parse each line with this pattern to extract the spec file and test name. The spec file is everything before the first `>`.
+
+3. **Deduplicate and rank by frequency:**
+
+   Count how many times each test appears as flaky across all PRs. Sort by frequency (most flaky first). Group tests by their spec file.
+
+   Print a summary table:
+
+   ```
+   Flaky test summary:
+   - setup_flow.spec.ts > Setup Flow > setup banner shows correct state... (7 occurrences)
+   - select_component.spec.ts > select component next.js (5 occurrences)
+   ...
+   ```
+
+4. **Skip if no flaky tests found:**
+
+   If no flaky tests are found, report "No flaky tests found in recent PRs" and stop.
+
+5. **Install dependencies and build:**
+
+   ```
+   npm install
+   npm run build
+   ```
+
+   **IMPORTANT:** This build step is required before running E2E tests. If you make any changes to application code (anything outside of `e2e-tests/`), you MUST re-run `npm run build`.
+
+6. **Deflake each flaky test spec file (sequentially):**
+
+   For each unique spec file that has flaky tests (ordered by total flaky occurrences, most flaky first):
+
+   a. Run the spec file 10 times to confirm flakiness (note: `<spec_file>` already includes the `.spec.ts` extension from parsing):
+
+   ```
+   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --repeat-each=10
+   ```
+
+   **IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped.
+
+   b. If the test passes all 10 runs, skip it (it may have been fixed already).
+
+   c. If the test fails at least once, investigate with debug logs:
+
+   ```
+   DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file>
+   ```
+
+   d. Fix the flaky test following Playwright best practices:
+   - Use `await expect(locator).toBeVisible()` before interacting with elements
+   - Use `await page.waitForLoadState('networkidle')` for network-dependent tests
+   - Use stable selectors (data-testid, role, text) instead of fragile CSS selectors
+   - Add explicit waits for animations: `await page.waitForTimeout(300)` (use sparingly)
+   - Use `await expect(locator).toHaveScreenshot()` options like `maxDiffPixelRatio` for visual tests
+   - Ensure proper test isolation (clean state before/after tests)
+
+   **IMPORTANT:** Do NOT change any application code. Only modify test files and snapshot baselines.
+
+   e. Update snapshot baselines if needed:
+
+   ```
+   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --update-snapshots
+   ```
+
+   f. Verify the fix by running 10 times again:
+
+   ```
+   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --repeat-each=10
+   ```
+
+   g. If the test still fails after your fix attempt, revert any changes to that spec file and move on to the next one. Do not spend more than 2 attempts fixing a single spec file.
+
+7. **Summarize results:**
+
+   Report:
+   - Total flaky tests found across PRs
+   - Which tests were successfully deflaked
+   - What fixes were applied to each
+   - Which tests could not be fixed (and why)
+   - Verification results
+
+8. **Create PR with fixes:**
+
+   If any fixes were made, run `/dyad:pr-push` to commit, lint, test, and push the changes as a PR.
+
+   Use a branch name like `deflake-e2e-<date>` (e.g., `deflake-e2e-2025-01-15`).
+
+   The PR title should be: `fix: deflake E2E tests (<list of spec files>)`
--- a/.claude/commands/dyad/deflake-e2e.md
+++ b/.claude/commands/dyad/deflake-e2e.md
@@ -35,9 +35,11 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
   For each test file, run it 10 times:

   ```
-   PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
+   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
   ```

+   **IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped as "not flaky."
+
   Notes:
   - If `$ARGUMENTS` is provided without the `e2e-tests/` prefix, add it
   - If `$ARGUMENTS` is provided without the `.spec.ts` suffix, add it
@@ -48,7 +50,7 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
   Run the failing test with Playwright browser debugging enabled:

   ```
-   DEBUG=pw:browser PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts
+   DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts
   ```

   Analyze the debug output to understand:
@@ -75,7 +77,7 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
   If the flakiness is due to legitimate visual differences:

   ```
-   PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --update-snapshots
+   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --update-snapshots
   ```

 8. **Verify the fix:**
@@ -83,7 +85,7 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
   Re-run the test 10 times to confirm it's no longer flaky:

   ```
-   PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
+   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
   ```

   The test should pass all 10 runs consistently.

--- a/.github/workflows/claude-deflake-e2e.yml
+++ b/.github/workflows/claude-deflake-e2e.yml
+name: Claude Deflake E2E
+
+on:
+  schedule:
+    # Daily at 10:00 UTC (2 AM PST / 3 AM PDT due to DST)
+    - cron: "0 10 * * *"
+  workflow_dispatch:
+    inputs:
+      pr_count:
+        description: "Number of recent PRs to scan for flaky tests"
+        required: false
+        default: "10"
+        type: string
+
+jobs:
+  deflake:
+    environment: ai-bots
+    runs-on:
+      - self-hosted
+      - macOS
+      - ARM64
+    permissions:
+      contents: write
+      pull-requests: write
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v5
+        with:
+          fetch-depth: 0
+
+      - name: Initialize environment
+        uses: actions/setup-node@v4
+        with:
+          node-version-file: package.json
+          cache: npm
+          cache-dependency-path: package-lock.json
+
+      - name: Install node modules
+        run: npm ci --no-audit --no-fund --progress=false
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Setup pnpm
+        uses: pnpm/action-setup@a7487c7e89a18df4991f7f222e4898a00d66ddda # v4.1.0
+        with:
+          version: latest
+
+      - name: Clone nextjs-template
+        run: git clone --depth 1 https://github.com/dyad-sh/nextjs-template.git nextjs-template
+
+      - name: Install scaffold dependencies
+        run: cd scaffold && pnpm install
+
+      - name: Install nextjs-template dependencies
+        run: cd nextjs-template && pnpm install
+
+      - name: Install Chromium browser for Playwright
+        run: npx playwright install chromium --with-deps
+
+      - name: Build fake LLM server
+        run: cd testing/fake-llm-server && npm install && npm run build
+
+      - name: Deflake E2E tests
+        uses: anthropics/claude-code-action@v1
+        env:
+          CLAUDE_CODE_MAX_OUTPUT_TOKENS: 48000
+        with:
+          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          claude_args: --model claude-opus-4-6
+          direct: true
+          prompt: |
+            /dyad:deflake-e2e-recent-prs ${{ inputs.pr_count || '10' }}