Replace deflake-e2e-recent-prs with deflake-e2e-recent-commits (#2607)

## Summary - Replaced `deflake-e2e-recent-prs` command with `deflake-e2e-recent-commits` that scans CI workflow runs on main instead of PR comments - Downloads the `html-report` artifact (`results.json`) from each CI run to extract flaky test data, which works for push events that don't post PR comments - Updated `claude-deflake-e2e.yml` workflow to use the new command ## Test plan - [ ] Trigger the `Claude Deflake E2E` workflow manually and verify it correctly scans recent main branch CI runs - [ ] Verify flaky tests are correctly parsed from `results.json` artifacts 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/2607" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>   --- > [!NOTE] > **Low Risk** > Low risk doc/workflow tweak that changes how the deflake automation sources flaky tests (GitHub Actions runs/artifacts) but does not touch production code or test logic. > > **Overview** > Updates the deflake automation to **scan recent `main` CI workflow runs** (push events) instead of PR Playwright summary comments, by downloading each run’s `html-report` artifact and parsing `results.json` to detect retry-passed tests with prior failures/timeouts. > > Adjusts the scheduled `Claude Deflake E2E` workflow to accept `commit_count`, grant `actions: read`, and invoke `/dyad:deflake-e2e-recent-commits` rather than the old PR-based command. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 0da1e67da43e509577d5b8dc1f155779742d1529. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>   --- ## Summary by cubic Switched the deflake command to scan recent main CI runs and parse html-report results.json to find flaky E2E tests. Updated the Claude Deflake E2E workflow to use commit_count and added actions: read permission. - **Refactors** - List completed main push runs via gh api, fetch 3x commit_count, and filter to success/failure. - Download non-expired html-report artifacts; parse results.json with a Node.js script to detect flakes (final passed after fail/timedOut/interrupted). - Build "<spec_path.spec.ts> > Suite > Test" titles; group and rank by frequency; clean up artifacts. - Skip runs without artifacts; note 3-day artifact retention. - **Bug Fixes** - Updated command doc to reference the TodoWrite tool. <sup>Written for commit 0da1e67da43e509577d5b8dc1f155779742d1529. Summary will update on new commits.</sup>  --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Replace deflake-e2e-recent-prs with deflake-e2e-recent-commits (#2607)
50a72da9 · Will Chen · GitHub · 9b9c059b · 50a72da9 · 50a72da9
--- a/.claude/commands/dyad/deflake-e2e-recent-prs.md
+++ b/.claude/commands/dyad/deflake-e2e-recent-prs.md
-# Deflake E2E Tests from Recent PRs
+# Deflake E2E Tests from Recent Commits

-Automatically gather flaky E2E tests from recent PR Playwright summary comments and deflake them.
+Automatically gather flaky E2E tests from recent CI runs on the main branch and deflake them.

 ## Arguments

- `$ARGUMENTS`: (Optional) Number of recent PRs to scan (default: 20)
+- `$ARGUMENTS`: (Optional) Number of recent commits to scan (default: 10)

 ## Task Tracking

-**You MUST use the TaskCreate and TaskUpdate tools to track your progress.** At the start, create tasks for each major step below. Mark each task as `in_progress` when you start it and `completed` when you finish.
+**You MUST use the TodoWrite tool to track your progress.** At the start, create todos for each major step below. Mark each todo as `in_progress` when you start it and `completed` when you finish.

 ## Instructions

-1. **Gather flaky tests from recent PRs:**
+1. **Gather flaky tests from recent CI runs on main:**

-   Use `gh` to find recent PRs that have Playwright summary comments (search for PRs with `github-actions[bot]` Playwright comments):
+   List recent CI workflow runs triggered by pushes to main:

   ```
-   gh pr list --search 'commenter:github-actions[bot] "Playwright Test Results" in:comments' --state all --limit <PR_COUNT> --json number
+   gh api "repos/{owner}/{repo}/actions/workflows/ci.yml/runs?branch=main&event=push&per_page=<COMMIT_COUNT * 3>&status=completed" --jq '.workflow_runs[] | select(.conclusion == "success" or .conclusion == "failure") | {id, head_sha, conclusion}'
   ```

-   Use `$ARGUMENTS` as the PR count, defaulting to 20 if not provided.
+   **Note:** We fetch 3x the desired commit count because many runs may be `cancelled` (due to concurrency groups). Filter to only `success` and `failure` conclusions to get runs that actually completed and have artifacts.

-   For each PR, fetch comments from `github-actions[bot]` that contain the Playwright test results.
+   Use `$ARGUMENTS` as the commit count, defaulting to 10 if not provided.

-   **Note:** `{owner}` and `{repo}` are auto-replaced by `gh` CLI. Replace `<pr_number>` with the actual PR number.
+   For each completed run, download the `html-report` artifact which contains `results.json` with the full Playwright test results:
+
+   a. Find the html-report artifact for the run:

   ```
-   gh api repos/{owner}/{repo}/issues/<pr_number>/comments --paginate --jq '.[] | select(.user.login == "github-actions[bot]") | select(.body | contains("Playwright Test Results")) | .body'
+   gh api "repos/{owner}/{repo}/actions/runs/<run_id>/artifacts?per_page=30" --jq '.artifacts[] | select(.name | startswith("html-report")) | select(.expired == false) | .name'
   ```

-2. **Parse flaky tests from comments:**
-
-   Extract flaky test names from the "Flaky Tests" section of each comment. Flaky tests appear in this format:
+   b. Download it using `gh run download`:

   ```
-   - `<spec_file.spec.ts> > <test name>` (passed after N retry/retries)
+   gh run download <run_id> --name <artifact_name> --dir /tmp/playwright-report-<run_id>
   ```

-   Parse each line with this pattern to extract the spec file and test name. The spec file is everything before the first `>`.
+   c. Parse `/tmp/playwright-report-<run_id>/results.json` to extract flaky tests. Write a Node.js script inside the `.claude/` directory to do this parsing. Flaky tests are those where the final result status is `"passed"` but a prior result has status `"failed"`, `"timedOut"`, or `"interrupted"`. The test title is built by joining parent suite titles (including the spec file path) and the test title, separated by `>`.
+
+   d. Clean up the downloaded artifact directory after parsing.
+
+   **Note:** Some runs may not have an html-report artifact (e.g., if they were cancelled early, the merge-reports job didn't complete, or artifacts have expired past the 3-day retention period). Skip these runs and continue to the next one.
+
+2. **Parse flaky tests from results:**
+
+   From each `results.json`, extract flaky test names. A test is flaky if:
+   - It has multiple results (retries occurred)
+   - The final result status is `"passed"`
+   - At least one prior result has status `"failed"`, `"timedOut"`, or `"interrupted"`
+
+   The test title format is: `<spec_file.spec.ts> > <Suite Name> > <Test Name>`
+
+   Parse each title to extract the spec file (everything before the first `>`).

 3. **Deduplicate and rank by frequency:**

-   Count how many times each test appears as flaky across all PRs. Sort by frequency (most flaky first). Group tests by their spec file.
+   Count how many times each test appears as flaky across all CI runs. Sort by frequency (most flaky first). Group tests by their spec file.

   Print a summary table:

@@ -55,7 +70,7 @@ Automatically gather flaky E2E tests from recent PR Playwright summary comments

 4. **Skip if no flaky tests found:**

-   If no flaky tests are found, report "No flaky tests found in recent PRs" and stop.
+   If no flaky tests are found, report "No flaky tests found in recent commits" and stop.

 5. **Install dependencies and build:**

@@ -113,7 +128,7 @@ Automatically gather flaky E2E tests from recent PR Playwright summary comments
 7. **Summarize results:**

   Report:
-   - Total flaky tests found across PRs
+   - Total flaky tests found across commits
   - Which tests were successfully deflaked
   - What fixes were applied to each
   - Which tests could not be fixed (and why)

--- a/.github/workflows/claude-deflake-e2e.yml
+++ b/.github/workflows/claude-deflake-e2e.yml
@@ -6,8 +6,8 @@ on:
    - cron: "0 10 * * *"
  workflow_dispatch:
    inputs:
-      pr_count:
-        description: "Number of recent PRs to scan for flaky tests"
+      commit_count:
+        description: "Number of recent commits on main to scan for flaky tests"
        required: false
        default: "10"
        type: string
@@ -20,6 +20,7 @@ jobs:
      - macOS
      - ARM64
    permissions:
+      actions: read
      contents: write
      pull-requests: write
    steps:
@@ -70,4 +71,4 @@ jobs:
          claude_args: --model claude-opus-4-6
          direct: true
          prompt: |
-            /dyad:deflake-e2e-recent-prs ${{ inputs.pr_count || '10' }}
+            /dyad:deflake-e2e-recent-commits ${{ inputs.commit_count || '10' }}