Unverified 提交 1e90ac6b authored 作者: Will Chen's avatar Will Chen 提交者: GitHub

feat: add daily deflake-e2e-recent-prs skill and workflow (#2590)

## Summary - Add new `/dyad:deflake-e2e-recent-prs` command that automatically gathers flaky E2E tests from recent PR Playwright summary comments, ranks them by frequency, and deflakes them sequentially - Add `claude-deflake-e2e.yml` GitHub Actions workflow that runs daily at 2 AM PST on self-hosted macOS runners (with workflow_dispatch support for manual triggers) - Document the new command in `.claude/README.md` ## Test plan - Trigger the workflow manually via `gh workflow run claude-deflake-e2e.yml` and verify it correctly scans recent PRs for flaky tests and attempts to deflake them - Verify the cron schedule triggers at the expected time 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/2590" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Adds a scheduled GitHub Action with write permissions that can open PRs, so misconfiguration or prompt issues could create noisy/unintended changes; however it’s limited to CI/automation and test-only guidance. > > **Overview** > Adds a new Claude slash command, `/dyad:deflake-e2e-recent-prs`, that scans recent PR comments for Playwright “Flaky Tests”, ranks them by frequency, and runs deflaking steps per spec (including guidance to disable retries via `PLAYWRIGHT_RETRIES=0`) before optionally opening a fix PR. > > Introduces a scheduled/manual GitHub Actions workflow (`claude-deflake-e2e.yml`) that runs daily on self-hosted macOS ARM64, installs dependencies/browsers, and invokes the new command via `anthropics/claude-code-action`. Documentation is updated to list the new command and the existing `/dyad:deflake-e2e` instructions are tightened to always disable Playwright retries. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 10b9158363c6b9ae9a3f3ba52ad118149fb9cbd3. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Adds a new /dyad:deflake-e2e-recent-prs command that scans recent PRs for Playwright flake reports, ranks flaky tests, and deflakes them. Adds a daily GitHub Action that runs it at 10:00 UTC (2 AM PST / 3 AM PDT) on self‑hosted macOS ARM64 to keep E2E tests stable. - **New Features** - Command scans recent PRs (default 20), parses Playwright summary comments from github-actions[bot], ranks by frequency, and deflakes specs sequentially; can push fixes via /dyad:pr-push. - New claude-deflake-e2e.yml workflow supports manual dispatch with pr_count, sets up Node/pnpm, installs Chromium, builds the fake LLM server, and runs the command via anthropics/claude-code-action. - **Bug Fixes** - Fixed spec path handling (no double .spec.ts), added gh api --paginate, switched to generic PR search, clarified {owner}/{repo} vs <pr_number>, and noted DST in the cron comment. - Disabled Playwright automatic retries in all deflake steps to prevent false passes, including debug and snapshot update commands. <sup>Written for commit 10b9158363c6b9ae9a3f3ba52ad118149fb9cbd3. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. --> --------- Co-authored-by: 's avatarClaude Opus 4.5 <noreply@anthropic.com>
上级 dcf06ece
......@@ -6,16 +6,17 @@ This directory contains Claude Code configuration for the Dyad project.
Slash commands are invoked with `/dyad:<command>`. Available commands:
| Command | Description | Uses |
| ----------------------- | -------------------------------------------------------------- | ----------------------------------- |
| `/dyad:plan-to-issue` | Convert a plan to a GitHub issue | - |
| `/dyad:fix-issue` | Fix a GitHub issue | `pr-push` |
| `/dyad:pr-fix` | Fix PR issues from CI failures or review comments | `pr-fix:comments`, `pr-fix:actions` |
| `/dyad:pr-fix:comments` | Address unresolved PR review comments | `lint`, `pr-push` |
| `/dyad:pr-fix:actions` | Fix failing CI checks and GitHub Actions | `e2e-rebase`, `pr-push` |
| `/dyad:pr-rebase` | Rebase the current branch | `pr-push` |
| `/dyad:pr-push` | Push changes and create/update a PR | `remember-learnings` |
| `/dyad:lint` | Run all pre-commit checks (formatting, linting, type-checking) | - |
| `/dyad:e2e-rebase` | Rebase E2E test snapshots | - |
| `/dyad:deflake-e2e` | Deflake flaky E2E tests | - |
| `/dyad:session-debug` | Debug session issues | - |
| Command | Description | Uses |
| ------------------------------ | -------------------------------------------------------------- | ----------------------------------- |
| `/dyad:plan-to-issue` | Convert a plan to a GitHub issue | - |
| `/dyad:fix-issue` | Fix a GitHub issue | `pr-push` |
| `/dyad:pr-fix` | Fix PR issues from CI failures or review comments | `pr-fix:comments`, `pr-fix:actions` |
| `/dyad:pr-fix:comments` | Address unresolved PR review comments | `lint`, `pr-push` |
| `/dyad:pr-fix:actions` | Fix failing CI checks and GitHub Actions | `e2e-rebase`, `pr-push` |
| `/dyad:pr-rebase` | Rebase the current branch | `pr-push` |
| `/dyad:pr-push` | Push changes and create/update a PR | `remember-learnings` |
| `/dyad:lint` | Run all pre-commit checks (formatting, linting, type-checking) | - |
| `/dyad:e2e-rebase` | Rebase E2E test snapshots | - |
| `/dyad:deflake-e2e` | Deflake flaky E2E tests | - |
| `/dyad:deflake-e2e-recent-prs` | Gather flaky tests from recent PRs and deflake them | `deflake-e2e`, `pr-push` |
| `/dyad:session-debug` | Debug session issues | - |
# Deflake E2E Tests from Recent PRs
Automatically gather flaky E2E tests from recent PR Playwright summary comments and deflake them.
## Arguments
- `$ARGUMENTS`: (Optional) Number of recent PRs to scan (default: 20)
## Task Tracking
**You MUST use the TaskCreate and TaskUpdate tools to track your progress.** At the start, create tasks for each major step below. Mark each task as `in_progress` when you start it and `completed` when you finish.
## Instructions
1. **Gather flaky tests from recent PRs:**
Use `gh` to find recent PRs that have Playwright summary comments (search for PRs with `github-actions[bot]` Playwright comments):
```
gh pr list --search 'commenter:github-actions[bot] "Playwright Test Results" in:comments' --state all --limit <PR_COUNT> --json number
```
Use `$ARGUMENTS` as the PR count, defaulting to 20 if not provided.
For each PR, fetch comments from `github-actions[bot]` that contain the Playwright test results.
**Note:** `{owner}` and `{repo}` are auto-replaced by `gh` CLI. Replace `<pr_number>` with the actual PR number.
```
gh api repos/{owner}/{repo}/issues/<pr_number>/comments --paginate --jq '.[] | select(.user.login == "github-actions[bot]") | select(.body | contains("Playwright Test Results")) | .body'
```
2. **Parse flaky tests from comments:**
Extract flaky test names from the "Flaky Tests" section of each comment. Flaky tests appear in this format:
```
- `<spec_file.spec.ts> > <test name>` (passed after N retry/retries)
```
Parse each line with this pattern to extract the spec file and test name. The spec file is everything before the first `>`.
3. **Deduplicate and rank by frequency:**
Count how many times each test appears as flaky across all PRs. Sort by frequency (most flaky first). Group tests by their spec file.
Print a summary table:
```
Flaky test summary:
- setup_flow.spec.ts > Setup Flow > setup banner shows correct state... (7 occurrences)
- select_component.spec.ts > select component next.js (5 occurrences)
...
```
4. **Skip if no flaky tests found:**
If no flaky tests are found, report "No flaky tests found in recent PRs" and stop.
5. **Install dependencies and build:**
```
npm install
npm run build
```
**IMPORTANT:** This build step is required before running E2E tests. If you make any changes to application code (anything outside of `e2e-tests/`), you MUST re-run `npm run build`.
6. **Deflake each flaky test spec file (sequentially):**
For each unique spec file that has flaky tests (ordered by total flaky occurrences, most flaky first):
a. Run the spec file 10 times to confirm flakiness (note: `<spec_file>` already includes the `.spec.ts` extension from parsing):
```
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --repeat-each=10
```
**IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped.
b. If the test passes all 10 runs, skip it (it may have been fixed already).
c. If the test fails at least once, investigate with debug logs:
```
DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file>
```
d. Fix the flaky test following Playwright best practices:
- Use `await expect(locator).toBeVisible()` before interacting with elements
- Use `await page.waitForLoadState('networkidle')` for network-dependent tests
- Use stable selectors (data-testid, role, text) instead of fragile CSS selectors
- Add explicit waits for animations: `await page.waitForTimeout(300)` (use sparingly)
- Use `await expect(locator).toHaveScreenshot()` options like `maxDiffPixelRatio` for visual tests
- Ensure proper test isolation (clean state before/after tests)
**IMPORTANT:** Do NOT change any application code. Only modify test files and snapshot baselines.
e. Update snapshot baselines if needed:
```
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --update-snapshots
```
f. Verify the fix by running 10 times again:
```
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --repeat-each=10
```
g. If the test still fails after your fix attempt, revert any changes to that spec file and move on to the next one. Do not spend more than 2 attempts fixing a single spec file.
7. **Summarize results:**
Report:
- Total flaky tests found across PRs
- Which tests were successfully deflaked
- What fixes were applied to each
- Which tests could not be fixed (and why)
- Verification results
8. **Create PR with fixes:**
If any fixes were made, run `/dyad:pr-push` to commit, lint, test, and push the changes as a PR.
Use a branch name like `deflake-e2e-<date>` (e.g., `deflake-e2e-2025-01-15`).
The PR title should be: `fix: deflake E2E tests (<list of spec files>)`
......@@ -35,9 +35,11 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
For each test file, run it 10 times:
```
PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
```
**IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped as "not flaky."
Notes:
- If `$ARGUMENTS` is provided without the `e2e-tests/` prefix, add it
- If `$ARGUMENTS` is provided without the `.spec.ts` suffix, add it
......@@ -48,7 +50,7 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
Run the failing test with Playwright browser debugging enabled:
```
DEBUG=pw:browser PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts
DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts
```
Analyze the debug output to understand:
......@@ -75,7 +77,7 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
If the flakiness is due to legitimate visual differences:
```
PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --update-snapshots
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --update-snapshots
```
8. **Verify the fix:**
......@@ -83,7 +85,7 @@ Identify and fix flaky E2E tests by running them repeatedly and investigating fa
Re-run the test 10 times to confirm it's no longer flaky:
```
PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
```
The test should pass all 10 runs consistently.
......
name: Claude Deflake E2E
on:
schedule:
# Daily at 10:00 UTC (2 AM PST / 3 AM PDT due to DST)
- cron: "0 10 * * *"
workflow_dispatch:
inputs:
pr_count:
description: "Number of recent PRs to scan for flaky tests"
required: false
default: "10"
type: string
jobs:
deflake:
environment: ai-bots
runs-on:
- self-hosted
- macOS
- ARM64
permissions:
contents: write
pull-requests: write
steps:
- name: Checkout repository
uses: actions/checkout@v5
with:
fetch-depth: 0
- name: Initialize environment
uses: actions/setup-node@v4
with:
node-version-file: package.json
cache: npm
cache-dependency-path: package-lock.json
- name: Install node modules
run: npm ci --no-audit --no-fund --progress=false
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Setup pnpm
uses: pnpm/action-setup@a7487c7e89a18df4991f7f222e4898a00d66ddda # v4.1.0
with:
version: latest
- name: Clone nextjs-template
run: git clone --depth 1 https://github.com/dyad-sh/nextjs-template.git nextjs-template
- name: Install scaffold dependencies
run: cd scaffold && pnpm install
- name: Install nextjs-template dependencies
run: cd nextjs-template && pnpm install
- name: Install Chromium browser for Playwright
run: npx playwright install chromium --with-deps
- name: Build fake LLM server
run: cd testing/fake-llm-server && npm install && npm run build
- name: Deflake E2E tests
uses: anthropics/claude-code-action@v1
env:
CLAUDE_CODE_MAX_OUTPUT_TOKENS: 48000
with:
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
github_token: ${{ secrets.GITHUB_TOKEN }}
claude_args: --model claude-opus-4-6
direct: true
prompt: |
/dyad:deflake-e2e-recent-prs ${{ inputs.pr_count || '10' }}
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论