test: adds search-replace evaluation suite (#3205)
See `src/__tests__/evals/README.md` for usage.
Other notes:
- The test fixtures are 300+ lines each. Even so, I still think some of
them are a little too easy. I might swap some of them out for more
challenging ones, or edit them so that they're not so straightforward.
- This currently still only tests `search_replace`, so I don't yet have
a way to compare correctness/token usage/time taken of `search_replace`
vs `edit_file` vs `write_file`.
- Otherwise, though, I think I'm fairly thorough about collecting data.
One thing I'm missing is the cost (it would probably be a rough estimate
at best) but I'm at least able to store the number of input/output
tokens for each tool call.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/dyad-sh/dyad/pull/3205"
target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
</picture>
</a>
<!-- devin-review-badge-end -->
正在显示
src/__tests__/evals/README.md
0 → 100644
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
差异被折叠。
vitest.eval.config.ts
0 → 100644
差异被折叠。
请
注册
或者
登录
后发表评论