Ask a Techspert: How does AI understand my visual searches? - Google

What powers these types of visual search responses?

Our advanced Gemini models make AI Mode possible, and its multimodal capabilities benefit from the visual expertise we’ve built into Lens over the years. When you search with an image, Gemini analyzes the image alongside your question to decide which tools to use. Let’s say you’re scrolling on your phone and see an outfit on social media that you love. When you search it, the model knows to use Lens to retrieve image results for the hat, shoes and jacket of the outfit simultaneously. It then weaves those individual results into one easy-to-read response.

Think of it this way: The AI model acts as the “brain” that can “see” the image, while the visual search backend acts as the “library” containing billions of web results. The AI performs multi-object reasoning to understand what you’re looking at. Then it uses a “fan-out” technique which triggers multiple searches at once, reads through the results and presents a single, cohesive response with helpful links — all in seconds.

Can you explain the fan-out technique?

AI Mode is basically doing a dozen searches for you in the time it takes to do one. If you upload a photo of a garden you admire, you might have several questions: Will these plants survive in the shade? Are they right for my climate? How much maintenance do they need?

Before, you’d ask those one by one. Now, AI Mode identifies all those necessary “fan-out” searches. This way, it gathers care requirements for every plant in the photo using helpful web results, breaks down the info and even suggests next steps you might want to take. Since AI Mode is uncovering more visual results from a single search, it’s easier than ever to find just what you’re looking for, and stumble upon something new that sparks your interest.

Do you have to start with an image to get this kind of help in AI Mode?

Not at all! You can start with a simple text search in AI Mode, like “visual inspo for work outfits.” When you see a result you like, you can just say, “Show me more options like the second skirt.” The system immediately takes that specific image and begins the fan-out process from there.

It definitely seems great for shopping — what else could you use it for?

You could take a photo of a wall at a museum and ask for explanations of each painting. Or take a photo of a bakery window and ask what all the different pastries are. It’s about moving from “What is this one thing?” to “Explain this entire scene to me.”

Sounds like I’ve got some photos to take and a lot more to discover. I’m off to put these tools to the test!

“Alphabet Inc. is an American multinational technology conglomerate holding company headquartered in Mountain View, California. It was created through a restructuring of Google on October 2, 2015, and became the parent company of Google and several former Google subsidiaries.”

Please visit the firm link to site