On August 15, OpenAI introduced a more reliable code generation evaluation benchmark: SWE-bench Verified. The most important line on the company's blog is: "As our systems get closer to AGI, we need to evaluate them in increasingly challenging tasks." The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.
OpenAI launches SWE-bench Verified
On August 15, OpenAI introduced a more reliable code generation evaluation benchmark: SWE-bench Verified. The most important line on the company's blog is: "As our systems get closer to AGI, we need to evaluate them in increasingly challenging tasks." The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.