Google Expands Gemini API with Multimodal File Search
Google has expanded the Gemini API’s File Search tool to support multimodal data, enabling developers to build retrieval-augmented generation (RAG) systems that process images and text simultaneously. The update introduces custom metadata filtering and page citations, allowing applications to organize unstructured data more efficiently and provide verifiable, grounded responses.

Processing Images and Text in a Single RAG System

File Search now leverages the Gemini Embedding 2 model to understand native image data alongside text. This multimodal capability gives applications contextual awareness across visual and textual content without requiring separate indexing pipelines.

For example, a creative agency searching an asset archive no longer depends on keywords or filenames. Instead, developers can build applications that retrieve images matching a specific emotional tone or visual style described in natural language briefs. This approach reduces friction in workflows where traditional keyword-based search falls short.

Filtering Noise with Custom Metadata Labels

Storing files at scale is straightforward, but retrieving the right document from thousands of unstructured entries remains challenging. Custom metadata allows developers to attach key-value labels to files, such as department: Legal or status: Final.

At query time, applications can apply metadata filters to scope requests to specific data slices. This reduces irrelevant results and improves both the speed and accuracy of RAG workflows. The filtering happens before the model processes the full dataset, making retrieval more efficient for production systems serving thousands of users.

Establishing Trust with Page Citations

When applications pull answers from large documents like PDFs, users need to verify the source. File Search now ties model responses directly to original sources by capturing the page number for every indexed piece of information.

This granular citation enables developers to point users to the exact location where an answer originated. The feature strengthens user trust and makes applications immediately useful for fact-checking workflows where accountability matters.

Getting Started with the Updated File Search Tool

Developers can begin integrating multimodal File Search into their applications immediately. Google has published a developer guide and updated the Gemini API documentation with implementation details and code examples.

The tool handles infrastructure complexity, allowing developers to focus on product features rather than managing embedding pipelines or search indexing. Whether prototyping a weekend project or scaling to production, the multimodal File Search abstracts away backend overhead.

What This Means for RAG Applications

The combination of multimodal processing, custom metadata filtering, and page citations addresses three core pain points in retrieval-augmented generation: handling diverse data types, managing noise in large datasets, and establishing verifiable grounding. Developers building enterprise search tools, knowledge management systems, or AI agents that reason over mixed-media archives now have a single API surface for these capabilities.

Follow Hashlytics on Bluesky, LinkedIn , Telegram and X to Get Instant Updates