Google Gemini-SQL2 achieves 80.04% accuracy on BIRD leaderboard

June 13, 2026

Google Research has unveiled Gemini-SQL2, a new text-to-SQL capability powered by Gemini 3.1 Pro, which has achieved an 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard in the Single Model track. The result marks a meaningful step forward in translating natural language queries into functional SQL, pushing what large language models can accomplish in data interaction.

What the Benchmark Measures

The core achievement here is the benchmark score itself. Gemini-SQL2 posted an 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard, surpassing Google’s previous top entry, Gemini-SQL. This metric matters because it measures whether generated SQL not only looks syntactically correct but also runs successfully and returns accurate results, a far more rigorous standard than simply checking syntax.

Google Research announced the breakthrough on X, pointing to the inherent difficulty of generating accurate SQL from natural language, especially when dealing with complex data subtleties and business context. The improved SQL understanding behind Gemini-SQL2 is expected to enhance natural language capabilities across Google’s broader data services.

How the BIRD Benchmark Works

The BIRD benchmark, short for BIg Bench for LaRge scale Database Grounded Text to SQL Evaluation, serves as an industry standard for assessing text-to-SQL capabilities. It comprises 12,751 question and SQL pairs across 95 databases, spanning 37 professional domains and totaling 33.4GB of data. Unlike older benchmarks, BIRD incorporates dirty values and requires external knowledge grounding, making it a more challenging and realistic evaluation.

The Single Trained Model Track specifically measures a model’s core text-to-SQL ability by restricting the use of preprocessing, retrieval, and agentic frameworks often used by ensembles to boost scores. Google Cloud’s prior record on this track, reported on November 15, 2025, was 76.13%. While Gemini-SQL2 sets a new high for AI systems, human performance on the BIRD benchmark sits at 92.96%, leaving a gap of 12.92 points still to close.

Leaderboard Standings

BIRD Text-to-SQL Leaderboard (Single Model Track)
System	Organization	Execution Accuracy	Date
Gemini-SQL2	Google	80.04%	June 2026
Gemini-SQL	Google	~77.2%	March 2026
Q-SQL	AWS	~76.5%	December 2025
Databricks RLVR 32B	Databricks	~75.7%	July 2025
SiriusAI-Text2SQL-32B-v2	Tencent	~75.0%	December 2025
Arctic-Text2SQL-R1-32B	Snowflake	~73.9%	June 2025
GPT-5.5-xhigh	OpenAI	~72.5%	April 2026
SQLWeaver-32B	Alibaba	~71.7%	May 2026
Claude Opus 4.6	Anthropic	~70.1%	February 2026

Gemini-SQL2 is described as a text-to-SQL capability built on top of Gemini 3.1 Pro, rather than a standalone foundation model release. The system excels at translating natural language questions into what Google calls execution ready SQL queries. This focus on execution verification is critical, ensuring generated SQL performs as intended even for complex queries involving joins, window logic, and date arithmetic.

Enhanced schema understanding, identified by Google’s November 2025 research as one of the most challenging aspects of text-to-SQL, is a key driver behind Gemini-SQL2’s higher BIRD scores. This improved understanding helps the system handle ambiguous columns and messy data values more reliably.

Where This Could Be Used

The advancements behind Gemini-SQL2 open up several practical use cases. Self-service analytics could see a major shift, letting revenue managers query complex data, such as monthly recurring revenue by region for churned accounts, directly using natural language instead of writing SQL by hand.

Data engineering workflows stand to benefit too. Developers could draft BigQuery transformations from plain English prompts, cutting down the time spent writing complex SQL from scratch. The capability could also strengthen embedded “ask your data” features in SaaS platforms.

It’s worth noting that an 80% accuracy score still means human review remains necessary, since roughly one in five queries could be incorrect. That sets realistic expectations: this is AI assisted query generation, not full automation without oversight, though it significantly lowers the barrier for non-technical users working with large datasets.

Despite the strong benchmark result, Google has not yet published a Gemini-SQL2 model string or API, and specific product integration details remain unconfirmed. The current implementation pattern relies on existing Gemini models like gemini-3.1-pro-preview.

As the technology matures, public availability and integration into Google’s existing data services, such as BigQuery Studio and Cloud SQL Studio, will be the key signals to watch for broader impact.

Follow Hashlytics on Bluesky, LinkedIn , Telegram and X to Get Instant Updates

What the Benchmark Measures

How the BIRD Benchmark Works

Leaderboard Standings

Where This Could Be Used

RELATED ARTICLESMORE FROM AUTHOR

Google open-sources k8s-aibom for AI workload audits

Google rebrands Pixel Magic Cue as Gemini Proactive Assistance

Google Unveils Open Knowledge Format for Data Portability

Join the conversation

RELATED ARTICLES MORE FROM AUTHOR