Demis Hassabis, the Co-Founder and CEO of Google DeepMind, shared the following X post – “One of the features I’ve mostly wanted in AI Studio for a long time! Just paste a YouTube link into the command line and ask Gemini 2.0 questions about it – it’s multimodal understanding is kind of mind blowing.”
This is yet another impressive suite of Gemini capabilities launched by Google last week.
What does this capability do?
Google Gemini can now talk to YouTube videos
You can copy a YouTube link in Google Gemini and ask it any questions.
But what’s new here? You could already do that using NotebookLM.
The secret lies in Demis’s mention of ‘multimodal understanding’. It means that you can actually talk to video and not just the text transcript of the video.
What does this mean?
In the attached demo, we uploaded Roger Federer’s famous video of Commencement address at Dartmouth. To test its multimodal capabilities, we asked the video: ‘What is Roger Federer wearing in this video?’
It correctly responded: “In this video, Roger Federer is wearing an academic robe, with a white and black collar over a black gown. The gown has a zipper up the front. The gown and collar are typical for graduation ceremonies.”
Then i asked it to summarize the video, which it did. This capability is still in experimental mode (hence is slow currently) and hence accessible via Google AI Studio.
Worth giving it a try. What peculiar use case of yours will this capability solve?