-
由 Ryan Groch 提交于
See `src/__tests__/evals/README.md` for usage. Other notes: - The test fixtures are 300+ lines each. Even so, I still think some of them are a little too easy. I might swap some of them out for more challenging ones, or edit them so that they're not so straightforward. - This currently still only tests `search_replace`, so I don't yet have a way to compare correctness/token usage/time taken of `search_replace` vs `edit_file` vs `write_file`. - Otherwise, though, I think I'm fairly thorough about collecting data. One thing I'm missing is the cost (it would probably be a rough estimate at best) but I'm at least able to store the number of input/output tokens for each tool call. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/3205" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->