I’m working with a setup (React/Typescript) the place an AI agent generates pull requests to repair points. For every process, we have already got a reference implementation (repair.patch) that represents the goal answer.
At present, our assessments are primarily based on this repair.patch. Which means they don’t simply validate performance, but in addition implicitly assume the identical construction (file names, structure, and many others.). The issue:
The AI typically produces a sound answer, however with a unique construction than the repair.patch.
Consequently, the assessments fail regardless that the code “works.”
The problem:
We will’t prescribe implementation particulars within the base description for the AI (no file names, no construction).
We wish the assessments to be resilient sufficient to simply accept divergent implementations, whereas nonetheless ensuring the performance matches the repair.patch.
Attainable methods I’m contemplating:
Dynamic discovery – as a substitute of assuming construction, assessments would import from a recognized entry level and confirm uncovered habits.
Dependency injection – encourage the AI to implement elements with DI so we will swap mocks, unbiased of inside construction.
However because the repair.patch is the reference, I’m questioning: how can we design assessments that validate behavioral equivalence to the repair.patch with out being too tightly coupled to its actual structure?