Google Gemini AI Tries Outsmarting ChatGPT Using Photos and Videos

The third major AI update is built into Google's Pixel 8 phones and the Bard AI chatbot, but wait until 2024 for the more powerful Gemini Ultra version.

Stephen Shankland principal writer
Stephen Shankland has been a reporter at CNET since 1998 and writes about processors, digital photography, AI, quantum computing, computer science, materials science, supercomputers, drones, browsers, 3D printing, USB, and new computing technology in general. He has a soft spot in his heart for standards groups and I/O interfaces. His first big scoop was about radioactive cat poop.
Expertise processors, semiconductors, web browsers, quantum computing, supercomputers, AI, 3D printing, drones, computer science, physics, programming, materials science, USB, UWB, Android, digital photography, science Credentials
  • I've been covering the technology industry for 24 years and was a science writer for five years before that. I've got deep expertise in microprocessors, digital photography, computer hardware and software, internet standards, web technology, and other dee
Stephen Shankland
6 min read
A data center photo shows racks with hundreds of servers linked with purple, green and yellow cables. The view looks down a long aisle showing the massive scale of Google's computing infrastructure.

Google trains and runs its AI models on racks housing thousands of its TPU processors. Its Gemini model is more efficient than predecessors like PaLM 2, but it still consumes a lot of power.


Google has begun bringing an understanding of video, audio and photos to its Bard AI chatbot with a new AI model called Gemini. Google Pixel 8 phone owners will be among the first to tap into its new artificial intelligence abilities, but Gemini will come to Gmail and other Google Workspace tools in early 2024.

People in dozens of countries first got access to Gemini with a Bard chatbot update in early December, though only in English. It can provide text-based chat abilities that Google says improves AI abilities in complex tasks like summarizing documents, reasoning, planning and writing programming code. The bigger change with multimedia abilities -- for example understanding hand gestures in a video or figuring out the result of a child's dot-to-dot drawing puzzle -- will arrive "soon," Google said.

Watch this: First Impressions of Gemini: Google's Newest Major AI Upgrade

The new version spotlights the breakneck pace of advancement in the new generative AI field, where chatbots create their own responses to prompts that we write in plain language rather than arcane programming instructions. Google's top competitor, OpenAI, stole a march with the launch of ChatGPT a year ago, but Gemini is Google's third major AI model revision and expects to deliver that technology through products that billions of us use, like search, Chrome, Google Docs and Gmail.

On Wednesday, Google also brought Gemini to programmers, a key community of people who can incorporate the technology into their own software. That's through the basic Google AI Studio web interface or the more sophisticated Vertex AI. And for usage beyond a free low rate, Google cut prices by a factor of two to four. That could help encourage developers enamored of OpenAI's programming interface to at least kick the tires on Gemini.

By courting developers, Google is more likely to spread Gemini to the software tools those programmers build for you. Google is building Gemini into its own services as well, notably with the Duet AI assistant in Gmail, Google Docs, Meet and other parts of Google Workspace.

"Duet AI for workspace will move to Gemini in the very early part of 2024," said Thomas Kurian, chief executive of the Google Cloud division. That could help you turn a hand drawing of an airplane into a photorealistic version for a Google Slides presentation, for example, or in Google Meet it could help you better understand a videoconference that includes slides that aren't in your native language. "Gemini's multimodal understanding allows it to do much richer summaries of meetings," he said.

Gemini is a dramatic departure for AI. Text-based chat is important, but humans must process much richer information as we inhabit our three-dimensional, ever-changing world. And we respond with complex communication abilities, like speech and imagery, not just written words. Gemini is an attempt to come closer to our own fuller understanding of the world.

Gemini comes in three versions tailored for different levels of computing power, Google said:

  • Gemini Nano runs on mobile phones, with two varieties available built for different levels of available memory. It'll power new features on Google's Pixel 8 phones, like summarizing conversations in its Recorder app or suggesting message replies in WhatsApp typed with Google's Gboard.
  • Gemini Pro, tuned for fast responses, runs in Google's data centers and will power a new version of Bard, starting Wednesday.
  • Gemini Ultra, limited to a test group for now, will be available in a new Bard Advanced chatbot due in early 2024. Google declined to reveal pricing details, but expect to pay a premium for this top capability.

"For a long time we wanted to build a new generation of AI models inspired by the way people understand and interact with the world -- an AI that feels more like a helpful collaborator and less like a smart piece of software," said Eli Collins, a product vice president at Google's DeepMind division. "Gemini brings us a step closer to that vision."

OpenAI also supplies the brains behind Microsoft's Copilot AI technology, including the newer GPT-4 Turbo AI model that OpenAI released in November. Microsoft, like Google, has major products like Office and Windows to which it's adding AI features.

AI gets smarter, but it's not perfect

Multimedia likely will be a big change compared to text when it arrives. But what hasn't changed is the fundamental problems of AI models trained by recognizing patterns in vast quantities of real-world data. They can turn increasingly complex prompts into increasingly sophisticated responses, but you still can't trust that they didn't just provide an answer that was plausible instead of actually correct. As Google's chatbot warns when you use it, "Bard may display inaccurate info, including about people, so double-check its responses."

Gemini is the next generation of Google's large language model, a sequel to the PaLM and PaLM 2 that have been the foundation of Bard so far. But by training Gemini simultaneously on text, programming code, images, audio and video, it's able to more efficiently cope with multimedia input than with separate but interlinked AI models for each mode of input.

Examples of Gemini's abilities, according to a Google research paper (PDF), are diverse.

Looking at a series of shapes consisting of a triangle, square and pentagon, it can correctly guess the next shape in the series is a hexagon. Presented with photos of the moon and a hand holding a golf ball and asked to find the link, it correctly points out that Apollo astronauts hit two golf balls on the moon in 1971. It converted four bar charts showing country-by-country waste disposal techniques into a labeled table and spotted an outlying data point, namely that the US throws a lot more plastic in the dump than other regions.

The company also showed Gemini processing a handwritten physics problem involving a simple sketch, figuring out where a student's error lay, and explaining a correction. A more involved demo video showed Gemini recognizing a blue duck, hand puppets, sleight-of-hand tricks and other videos. None of the demos were live, however, and it's not clear how often Gemini fumbles such challenges.

Was Google's Gemini video fake?

Google touted Gemini in a demonstration video purporting to show it recognizing hand gestures, following magic tricks, and putting pictures of planets in order by how far the planets are from the sun -- all from visual data. You should think of that as a dramatization of the Gemini's true abilities, however.

It's not uncommon for promotional videos to make products look more glamorous than they truly are. In this case, you might think Gemini was processing video input data and spoken instructions. Google included some fine print: a disclaimer in the video that Gemini doesn't respond as quickly, and a link in the video description to a discussion of how Google's Gemini demo actually worked. You might not have noticed any of that, though. Google also followed up with a post on X, formerly Twitter, that shows how fast Gemini actually does respond.

Still, the video doesn't fundamentally misrepresent Gemini's abilities, though outsiders haven't generally been able to test it. It can accept spoken and video input.

Gemini Ultra coming in 2024

Gemini Ultra awaits further testing before appearing next year.

"Red teaming," in which a product-maker enlists people to find security vulnerabilities and other problems, is underway for Gemini Ultra. Such tests are more complicated with multimedia input data. For example, a text message and photo could each be innocuous on their own, but when paired could convey dramatically different meaning.

"We're approaching this work boldly and responsibly," Google CEO Sundar Pichai said in a blog post. That means a combination of ambitious research with big potential payoffs, but also adding safeguards and working collaboratively with governments and others "to address risks as AI becomes more capable."

Editors' note: CNET is using an AI engine to help create some stories. For more, see this post.