How Twitter got research right

While other tech giants hide from their internal researchers, Twitter is doing its failing — and fixing — in public

How Twitter got research right
(Edgar Moran / Unsplash)

It has not been a happy time for researchers at big tech companies. Hired to help executives understand platforms’ shortcomings, research teams inevitably reveal inconvenient truths. Companies hire teams to build “responsible AI” but bristle when their employees discover algorithmic bias. They boast about the quality of their internal research but disavow it when it makes its way to the press. At Google, this story played out in the forced departure of ethical AI researcher Timnit Gebru and the subsequent fallout for her team. At Facebook, it led to Frances Haugen and the Facebook Files.

For these reasons, it’s always of note when a tech platform takes one of those unflattering findings and publishes it for the world to see. At the end of October, Twitter did just that. Here’s Dan Milmo in the Guardian:

Twitter has admitted it amplifies more tweets from right-wing politicians and news outlets than content from left-wing sources.

The social media platform examined tweets from elected officials in seven countries – the UK, US, Canada, France, Germany, Spain and Japan. It also studied whether political content from news organisations was amplified on Twitter, focusing primarily on US news sources such as Fox News, the New York Times and BuzzFeed. […]

The research found that in six out of seven countries, apart from Germany, tweets from right-wing politicians received more amplification from the algorithm than those from the left; right-leaning news organisations were more amplified than those on the left; and generally politicians’ tweets were more amplified by an algorithmic timeline than by the chronological timeline.

Twitter’s blog post on the subject was accompanied by a 27-page paper that further describes the study’s findings and research and methodology. It wasn’t the first time this year that the company had volunteered empirical support for years-old, speculative criticism of its work. This summer, Twitter hosted an open competition to find bias in its photo-cropping algorithms. James Vincent described the results at The Verge:

The top-placed entry showed that Twitter’s cropping algorithm favors faces that are “slim, young, of light or warm skin color and smooth skin texture, and with stereotypically feminine facial traits.” The second and third-placed entries showed that the system was biased against people with white or grey hair, suggesting age discrimination, and favors English over Arabic script in images.

These results were not hidden in a closed chat group, never to be discussed. Instead, Rumman Chowdhury — who leads machine learning ethics and responsibility at Twitter — presented them publicly at DEF CON, and praised participants for helping to illustrate the real-world effects of algorithmic bias. The winners were paid for their contributions.

On one hand, I don’t want to overstate Twitter’s bravery here. The results the company published, while opening it up to some criticisms, are nothing that is going to result in a full Congressional investigation. And the fact that the company is much smaller than Google or Facebook parent Meta, which both serve billions of people, mean that anything found by its researchers is less likely to trigger a global firestorm.

At the same time, Twitter doesn’t have to do this kind of public-interest work. And in the long run, I do believe it will make the company stronger and more valuable. But it would be relatively easy for any company executive or board member to make a case against doing it.

For that reason, I’ve been eager to talk to the team responsible for it. This week, I met virtually with Chowdhury and Jutta Williams, product lead for Chowdhury’s team. (Inconveniently, as of October 28: the Twitter team’s official name is Machine Learning Ethics, Transparency, and Accountability: META.) I wanted to know more about how Twitter is doing this work, how it has been received internally, and where it’s going next.

Here’s some of what I learned.

Twitter is betting that public participation will accelerate and improve its findings. One of the more unusual aspects of Twitter’s AI ethics research is that it is paying outside volunteer researchers to participate. Chowdhury was trained as an ethical hacker, and observed that her friends working in cybersecurity are often able to protect systems more nimbly by creating financial incentives for people to help.

“Twitter was the first time that I was actually able to work at an organization that was visible and impactful enough to do this, and also ambitious enough to fund it,” said Chowdhury, who joined the company a year ago when it acquired her AI risk management startup. “It's hard to find that.”

It’s typically difficult to get good feedback from the public about algorithmic bias, Chowdhury told me. Often, only the loudest voices are addressed, while major problems are left to linger because affected groups don’t have contacts at platforms who can address them. Other times, issues are diffuse through the population, and individual users may not feel the negative effects directly. (Privacy tends to be an issue like that.)

Twitter’s bias bounty helped the company built a system to solicit and implement that feedback, Chowdhury told me. The company has since announced it will stop cropping photos in previews, after its algorithms were found to largely favor the young, white, and beautiful.

Responsible AI is hard in part because no one fully understands decisions made by algorithms. Ranking algorithms in social feeds are probabilistic — they show you things based on how likely you are to like, share, or comment on them. But there’s no one algorithm making that decision — it’s typically a mesh of multiple (sometimes dozens) of different models, each making guesses than are then weighted differently according to ever-shifting factors.

That’s a major reason why it’s so difficult to confidently build AI systems that are “responsible” — there is simply a lot of guesswork involved. Chowdhury pointed out the difference here between working on responsible AI and cybersecurity. In security, she said, it’s usually possible to unwind why the system is vulnerable, so long as you can discover where the attacker entered it. But in responsible AI, finding a problem often doesn’t tell you much about what created it.

That’s the case with the company’s research on amplifying right-wing voices, for example. Twitter is confident that the phenomenon is real, but can only theorize as to the reasons behind it. It may be something in the algorithm. But it might also be a user behavior — maybe right-wing politicians tend to tweet in a way to elicits more comments, for example, which then causes their tweets to be weighted more heavily by Twitter’s systems.

“There’s this law of unintended consequences to large systems,” said Williams, who previously worked at Google and Facebook. “It could be so many different things. How we’ve weighted algorithmic recommendation may be a part of it. But it wasn’t intended to be a consequence of political affiliation. So there’s so much research to be done.”

There’s no real consensus on what ranking algorithms “should” do. Even if Twitter does solve the mystery of what’s causing right-wing content to spread more widely, it won’t be clear what the company should do about it. What if, for example, the answer lies not in the algorithm but in the behavior of certain accounts? If right-wing politicians simply generate more comments than left-wing politicians, there may not be an obvious intervention for Twitter to make.

“I don't think anybody wants us to be in the business of forcing some sort of social engineering of people's voices,” Chowdhury told me. “But also we all agree that we don't want amplification of negative content or toxic content, or unfair political bias. So these are all things that I would love for us to be unpacking.”

That conversation should be held publicly, she said.

Twitter thinks algorithms can be saved. One possible response to the idea that all our social feeds are unfathomably complex and cannot be explained by their creators is that we should shut them down and delete the code. Congress now regularly introduces bills that would make ranking algorithms illegal, or make platforms legally liable for what they recommend, or force platforms to let people opt out of them.

Twitter’s team, for one, believes that ranking has a future.

“The algorithm is something that can be saved,” Williams said. “The algorithm needs to be understood. And the inputs to the algorithm need to be something that everybody can manage and control.”

With any luck, Twitter will build just that kind of system.

Of course, the risk in writing a piece like this is that, in my experience, teams like this are fragile. One minute, leadership is pleased with its findings and enthusiastically hiring for it; the next, it’s withering by attrition amidst budget cuts, or reorganized out of existence amidst personality conflicts or regulatory concerns. Twitter’s early success with META is promising, but META’s long-term future is not assured.

In the meantime, the work is likely to get harder. Twitter is now actively at work on a project to make its network decentralized, which could shield parts of the network from its own efforts to build the network more responsibly. Twitter CEO Jack Dorsey has also envisioned an “app store for social media algorithms,” giving users more choice around how their feeds are ranked.

It’s difficult enough to rank one feed responsibly — what it means to make a whole app store of algorithms “responsible” will be a much larger challenge.

“I’m not sure it’s feasible for us to jump right into a marketplace of algorithms,” Williams said. “But I do think it’s possible for our algorithm to understand signal that’s curated by you. So if there's profanity in a tweet, as an example: how sensitive are you to that kind of language? Are there specific words that you would consider very, very profane and you don't want to see? How do we give you controls for you to establish what your preferences are, so that that signal can be used in any kind of recommendation?

“I think that there's a third party of signal more than there is a third-party a bunch of algorithms,” Williams said. “You have to be careful about what's in an algorithm.”


Platformer Jobs

Today’s featured jobs on the Platformer Jobs board include:

Some posts here are paid. For more great jobs in tech policy and trust and safety, or to create a listing, visit here. Nonprofits and academic institutions can post for free.


Governing

A profile of ConstitutionDAO, a “financial flash mob” that came together over the course of one week to buy — and collectively govern the future of — the only privately held copy of the US Constitution. After the story was published, the amount contributed to the effort grew from $12.8 million to more than $40 million. Here’s Kevin Roose at the New York Times:

Group bids for big-ticket collectibles aren’t new. But ConstitutionDAO is a particularly quixotic example of what’s known as a “decentralized autonomous organization,” a kind of internet-native co-op that is governed with cryptocurrency tokens and blockchain-based “smart contracts” instead of traditional corporate boards and bylaws.

DAOs — which have been compared to chat rooms with bank accounts — can be messy and confusing. Some early experiments were derailed by hackers and governance disputes. But crypto advocates believe that they will become a popular form of organization in the coming years — a kind of leaderless online swarm that can pop up in an instant to build products, make investments or just tap into the zeitgeist.

Related: Because the Constitution is being purchased by a mob, it’s not clear who will actually go pick up the physical copy and transport it to a museum. Or which museum will receive it. (Omar Abdel-Baqui / Wall Street Journal)

A former Amazon chief information security says the company had a “free-for-all” when it came to internal access to customer information. In some cases, low-level employees were able to snoop on customer purchases. (Will Evans / Wired)

A critique of the Widely Viewed Content report on Facebook finds that it demotes highly partisan content by counting only unique viewers of a link, rather than the number of times that those links appear in viewers’ feeds. The latter approach shows right-wing publishers having much more success than Facebook’s published report. (Corin Faife / The Markup)

Google signed a five-year deal with Agence France-Presse to include its content in search results. Google was fined $566 million in July for failing to negotiate in good faith with French publishers. (Céline Le Prioux and Jules Bonnard / AFP)

A coalition of eight state attorneys general will investigate how Instagram attracts and affects young people. AGs are looking into “the techniques utilized by Meta to increase the frequency and duration of engagement by young users and the resulting harms caused by such extended engagement.” (Jeff Horwitz and Georgia Wells / Wall Street Journal)

Related: A look at the available research on the mental health effects of Instagram in the global south. Leaked internal research from Facebook this year largely focused on the United States, pointing to the need for further study. (Nilesh Christopher and Andrew Deck / Rest of World)


Industry

OpenAI is eliminating the waiting list to begin using its GPT-3 natural language processing technology. The program is eerily good at mimicking human speech, and poses serious safety concerns. Here’s Bryan Walsh at Axios:

The ability to generate and tune human-like text at mass scale carries clear risks for misuse, especially in disinformation campaigns, though Welinder notes that the rate of misuse identified by OpenAI so far has "been really, really low." […]

OpenAI has also updated its community guidelines, which ban hate, content that attempts to influence the political process, and all adult content excluding sex education or wellness.

S&P Dow Jones Indices launched the S&P 500 Twitter Sentiment Index to measure the performance of the top 200 companies within the S&P 500 that have "the highest sentiment scores," according to real-time sentiment analysis of tweets. Find the next GameStonks here? (Sheila Dang / Reuters)

Instagram is shutting down its Threads messaging app. Threads had promise, but installing it absolutely borked the notifications for my Instagram DMs forever, even after I uninstalled it. (Monica Chin / The Verge)

NFT trading platform OpenSea could be valued at as much as $10 billion in a new funding round. Just four months ago, the company raised $100 million at a $1.5 billion valuation. (Kate Clark and Berber Jin / The Information)


Those good tweets


Talk to me

Send me tips, comments, questions, and Twitter research: casey@platformer.news.