Google

Google hits pause on Gemini’s people pictures

After a diversity initiative proved disastrous, it’s time to loosen chatbot guardrails

Casey Newton

Feb 22, 2024 — 11 min read

(Rafael Henrique / Getty Images)

Today let’s talk about how Google’s effort to bring diversity to AI-generated images backfired, and consider how tech platforms’ efforts to avoid public-relations crises around chatbot outputs is undermining trust in their products.

This week, after a series of posts on the subject drew attention on X, Stratechery noted that Google’s Gemini chatbot seemed to refuse all attempts to generate images of white men. The refusals, which came as a result of the bot re-writing user prompts in an effort to bring more diversity into the images it generates, led to some deeply ahistorical howlers: racially diverse Nazis, for example.

The story has proven to be catnip for right-leaning culture warriors, who at last had concrete evidence to support the idea that Big Tech was out to get them. (Or at least, would not generate a picture of them.) “History messin’,” offered the conservative New York Post, in a cover story.

“Joe Biden’s America,” was the caption on House Republicans’ quote-post of the Post. It seems likely that the subject will eventually come up at a hearing in Congress.

With tensions running high, on Thursday morning Google said it had paused Gemini’s ability to generate images. “We're already working to address recent issues with Gemini's image generation feature,” a spokesman told me today over email. “While we do this, we're going to pause the image generation of people and will re-release an improved version soon.”

Gemini is not the first chatbot to rewrite user prompts in an effort to make them more diverse. When OpenAI released DALL-E 3 as part of ChatGPT, I noted that the bot was editing my prompts to add gender and racial diversity.

There are legitimate reasons to edit prompts in this way. Gemini is designed to serve a global audience, and many users may prefer a more inclusive set of images. Given that bots often don’t know who is using them, or what genders or races users expect to see in the images that are being requested, tweaking the prompt to offer several different choices increases the chances that the user will be satisfied the first time.

As Google’s communications team put it Wednesday on X: “Gemini’s Al image generation does generate a wide range of people. And that’s generally a good thing because people around the world use it. But it’s missing the mark here.” (The response strongly suggests to me that Gemini’s refusal to depict white people was a bug rather than a policy choice; hopefully Google will say this explicitly in the days to come.)

But there’s another reason chatbots are rewriting prompts this way: platforms are attempting to compensate for the lack of diversity in their training data, which generally reflects the biases of the culture that created them.

As John Hermann wrote today in New York:

This sort of issue is a major problem for all LLM-based tools and one that image generators exemplify in a visceral way: They’re basically interfaces for summoning stereotypes to create approximate visual responses to text prompts. Midjourney, an early AI image generator, was initially trained on a great deal of digital art (including fan art and hobbyist illustrations) scraped from the web, which meant that generic requests would produce extremely specific results: If you asked it for an image of a “woman” without further description, it would tend to produce a pouting, slightly sexualized, fairly consistent portrait of a white person. Similarly, requests to portray people who weren’t as widely represented in the training data would often result in wildly stereotypical and cartoonish outputs. As tools for producing images of or about the world, in other words, these generators were obviously deficient in a bunch of different ways that were hard to solve and awkward to acknowledge in full, though OpenAI and Google have occasionally tried. To attempt to mitigate this in the meantime — and to avoid bad PR — Google decided to nudge its models around at the prompt level.

The platforms’ solution to an interface that summons stereotypes, then, is to edit their outputs to be less stereotypical.

But as Google learned the hard way this week, race is not always a question of stereotypes. Sometimes it’s a question of history. A chatbot that is asked for images of German soldiers from the 1930s and generates not a single white one is making an obvious factual error.

And so, in trying to dodge one set of PR problems, Google promptly (!) found itself dealing with another.

I asked Dave Willner, an expert in content moderation who most recently served as head of trust and safety at OpenAI, what he made of Gemini's diversity efforts.

“This intervention wasn’t exactly elegant, and much more nuance will be needed to both accurately depict history (with all of its warts) and to accurately depict the present (in which all CEOs are not white men and all airline stewards are not Asian women),” he told me. “It is solvable but it will be a lot of work, which the folks doing this at Google probably weren’t given resources to do correctly in the time available."

What kind of work might address the underlying issue?

One strategy would be to train large language models on more diverse data sets. Given platforms’ general aversion to paying for content, this seems unlikely. It is not, however, impossible: last fall, as an effort to seed its Firefly text-to-image generator with high-quality images, Adobe paid photographers to shoot thousands of photos that would form the basis for its model.

A second strategy, and one that undoubtedly will prove quite controversial inside tech platforms, is to loosen some of the guardrails on some of their LLMs. I’m on record as saying that I think chatbots are generally too censorious, at least for their adult users. Last year ChatGPT refused to show me an image it had made of a teddy bear. Chatbots also shun many questions about basic sexual health.

I understand platforms’ fear that they will be held responsible and even legally liable for any questionable output from their models. But just as we don’t blame Photoshop when someone uses the app to draw an offensive picture, nor should we always hold chatbots responsible when a user prompt generates an image that some people take exception to. Platforms could build more trust by being upfront about their models’ biases and limitations — starting with the biases and limitations of their training data — and remind people to use them responsibly.

The alternative to letting people use these bots how they want to is that people will seek worse alternatives. To use the worst example imaginable, here’s a story about the right-wing platform Gab launching 91 chatbots over the past month with explicit instructions to deny the Holocaust and promote the lie that climate change is a scam. The wide availability of open-source, “unsecured” AI models means that people will soon be able to generate almost any kind of media on their laptop. And while there are legitimate legal reasons to maintain strict content policies — the traditional platform immunity shield, Section 230, probably doesn’t apply to chatbots — the current hyper-restrictive content policies seem likely to trigger significant backlash from Congressional Republicans, among others.

A final, longer-term strategy is to personalize chatbots more to the user. Today, chatbots know almost nothing about you. Even if Google has some information about your location, gender, or other demographic characteristics, that data does not inform the answers that Gemini gives you. If you only want Gemini to generate pictures of a particular race, there’s no way to tell it that.

Soon there will be, though. Earlier this month OpenAI introduced a memory feature for ChatGPT, effectively carving out a portion of the chatbot’s context window to collect information about you that can inform its answers. ChatGPT currently knows my profession, the city I live in, my birthday, and other details that could help it tailor its answers to me over time. As features like this are adopted by other chatbots, and as their memories improve, I suspect they will gradually show users more of what they want to see and less that they don’t. (For better and for worse!)

By now it’s a hack trope to quote a chatbot in a column like this, and yet I couldn’t resist asking Gemini itself why chatbots modify their prompts to introduce diversity into generated images. It gave a perfectly reasonable answer, reflecting many of the points above, and ended by offering a handful of “important considerations.”

“Sometimes specificity is important,” the bot noted, wisely. “It's okay for a prompt to be specific if it's relevant to the image's purpose.”

It also noted that the bot’s output should reflect the wishes of the person using it.

“The user should ultimately have the choice of whether to accept the chatbot's suggestions or not,” it said. And for once, a chatbot was giving advice worth listening to.

Cash management for startups

Crafting your startup’s cash management strategy? First, divide your funds into three categories: liquid, short-term, and strategic. From there, it’s about optimizing for risk and yield according to those time horizons. But how much should you put in each category?

Learn more about how to create the best cash management strategy for your startup.

On the podcast this week: Google DeepMind CEO Demis Hassabis joins us to discuss all those new AI models, his P(doom), and — yes — why Gemini rewrites your prompts to show you more diversity.