Research firm Paradigm released EVMbench on February 10, 2026, an open-source benchmark that measures how well AI models can find and exploit critical vulnerabilities in Ethereum smart contracts. The tool tests AI against real bugs from Code4rena audits that previously resulted in fund-draining exploits.
Performance Leap in Exploit Detection
Recent AI models show dramatic improvement in identifying contract vulnerabilities. When Paradigm began development, leading models exploited under 20% of critical bugs in the benchmark. The latest tested model, identified as GPT-5.3-Codex, now successfully exploits over 70% of the same vulnerabilities, according to Paradigm’s announcement.
The benchmark focuses on offensive security capabilities because smart contracts secure over $100 billion in crypto assets through publicly visible code. Unlike traditional software where vulnerabilities can remain hidden, blockchain code transparency means attackers have unlimited time to analyze contracts for exploitable flaws.
Testing Methodology and Scope
EVMbench contains vulnerabilities previously discovered in competitive auditing platforms like Code4rena. Each test case includes the vulnerable contract code, the specific bug type, and the expected exploit result. AI models must generate working exploit code that successfully drains funds or breaks contract invariants.
Paradigm has not disclosed which other AI models were tested beyond GPT-5.3-Codex, nor provided comparative performance data across different model families. The firm also has not announced partnerships with auditing platforms or outlined a roadmap for expanding the benchmark’s test cases.
Defense Remains Harder Than Attack
While AI models increasingly excel at finding exploits, fixing vulnerabilities presents greater challenges. Paradigm noted that repairs require understanding design intent to avoid introducing new bugs while preserving all intended functionality. Automated exploit detection may advance faster than automated remediation.
What Security Teams Should Know
The benchmark’s public availability on GitHub enables security researchers to test proprietary and open-source models against standardized contract vulnerabilities. This allows consistent measurement of AI capabilities for both offensive and defensive security applications.
Security auditors should recognize that AI tools approaching 70% success rates on known critical bugs may soon match or exceed human auditors’ initial detection capabilities for certain vulnerability classes. However, the benchmark tests only exploit generation, not the contextual judgment required to assess business logic flaws or economic attack vectors.
For EVM-compatible smart contract developers, the results suggest AI-assisted security tools will become increasingly effective at both attacking and defending protocols. Teams should evaluate whether integrating AI-powered analysis into their audit workflows could identify vulnerabilities before deployment or if adversarial AI poses escalated risks to their contracts.
Follow us on Bluesky , LinkedIn , and X to Get Instant Updates

