Can we keep the internet safe from bad actors?

Regulating information and trust online

Dec 10, 2024

“What makes the trust and safety space so difficult, is that it is a perpetual state of gray areas.” - Tom Siegel, CEO of TrustLab

Trust and safety, in simple terms, means keeping the internet safe. The space has evolved significantly over the past few years, and faces new threats from the growing use of AI. Previously, I’ve written about adjacent areas such as deepfake detection, but this article includes insights from a conversation with Tom Siegel, CEO of TrustLab. Tom established and scaled the first Trust & Safety team at Google, and has been an early pioneer of the space, having seen it grow over the past two decades. This article covers what trust and safety actually is (and what it isn’t), how companies are dealing with harmful content online, and new challenges in protecting the integrity of information.

What trust and safety is, and isn’t

When people talk about trust and safety, they tend to mention integrity, misinformation, incident response, ethics, and a soup of related synonyms.

But, in the early 2000s, when Tom was at Google, he founded and scaled the Trust & Safety team. At that time, it was actually referring to people who wanted to manipulate Google’s search algorithm. The job of the trust and safety team was to protect users, company information, and revenue from spam, scammers, and other actors who wanted to target Google’s systems specifically. In other words, what we now know as trust and safety was largely dealt with by companies individually – there was little collaboration or contributions to internet safety outside of big platforms.

According to Tom, Trust & Safety is not about privacy, cybersecurity, AI safety, bias, or ethics. Instead, it’s simply about online safety, and identifying accounts and entities that seek to manipulate and deceive users on various platforms.

Trust and safety work involves a constellation of the following:

Content moderation (text, visuals, videos)
Responding quickly to sensitive issues and events
Safeguarding organizations’ reputations
Preventing risks resulting from harmful content
Addressing new threats: misinformation, bad actors, elections
Specific types of content such as product listings, chat messages, social media posts and profiles

Source: Online Content Moderation – Current Challenges in Detecting Hate Speech – EU Agency for Fundamental Rights (dotted line represents the average)

While fighting spam is still a problem, the issues and motivations of actors have changed dramatically. Due to the ease of generating and spreading content with AI, the types of problems that trust and safety teams are tackling are of a completely different nature. These teams have to constantly monitor the streams of content that are created and uploaded online, and take swift action against problematic accounts.

So, how do they do it?

The main actors are no longer only big platforms, but also include governments and small platforms. However, perhaps due to the varying nuances in threat profiles, there is still a lack of consensus on how exactly trust and safety is defined. Tom emphasized that it is not focused on privacy, cybersecurity, or AI bias and ethics. In his view, integrity and trust have also become overused and perhaps lacking clarity, leading him to adopt the term online safety. Viewing trust and safety operations from this angle emphasizes user generated content on platforms, the degree to which they are real, entities and accounts that seek to manipulate and deceive, and efforts to monitor the behavior of accounts and their content.

The role of synthetic media

One key question in content moderation, especially with the rise of synthetic media, is the increasingly blurred line between real and fake content. Along with the rise of multimodal generative models, creating more realistic (and potentially deceptive) content is becoming easier. In addition, detection efforts are largely part of a back and forth game – as generative models keep improving iteratively, technical detection methods have to stay one step ahead.

From the trust and safety perspective, Tom highlighted the distinction between copyright vs. harm issues. Trust and safety is mainly focused on the harmfulness problem – that is, these teams are concerned with fake content that could mislead or negatively impact individuals. Regarding the rest of the content, it is not as high priority, even if it is synthetically generated. He also noted that trust and safety teams often rely on traditional, non-technical techniques as well. For example, following the release of a new detection model, bad actors could easily train their model to bypass such detection mechanisms. TrustLab, along with other companies in this space, also uses secondary techniques, such as fact-checking and research to evaluate potentially fake claims.

The role of generative AI in content moderation

As AI-generated content increases, response mechanisms are also being re-evaluated. However, while the volume of content is dramatically increasing, trust and safety teams, according to Tom, still need to be laser-focused on instances of harm. But, this is where one of the main challenges comes in, as many bad actors can create “gray area” content that is difficult to classify. One promising step in this direction is the potential use of watermarks by big tech platforms for AI-generated content. Watermarks and other content provenance techniques can partly help in tracking the actors behind the content, and move towards distinguishing real from altered or synthetic media. Still, such measures can also be circumvented.

Additionally, given that trust and safety teams focus on high-risk areas of harm, such as online misinformation, hate speech, sensitive content, among other areas, many platforms still allow “lower levels” of harmful content. The latter depends on platform-specific policies, and brings into question the efficacy of their enforcement systems. For example, adversarial AI systems can iterate on outputs based on specific policies, or how individual platforms enforce content, in order to create text, images, video, or audio that can deceive users. Therefore, these systems could still potentially penetrate the defenses and enforcement systems of larger platforms.

Challenges in trust and safety

Diverse languages and markets: One major challenge is detection methods across different languages and markets. For example, Tom noted that content moderation and misinformation detection is slightly more difficult for local elections, because bad actors tend to spread fake content very quickly before voting occurs, yet companies lack adequate detection tools for these languages and geographies. Additionally, language representation across content moderation teams can vary based on priorities and focus. For example, at X, the teams in the EU focus on the languages that have the highest representation. However, there are many cases that could potentially remain unnoticed due to a lack of language representation.
AI detection solutions are not a catch-all: AI-driven detection solutions are also on the rise, but for the challenges of trust and safety operations, they might not serve as the only antidote. “On the surface, new technologies such as generative AI look so easy,” but Tom noted that even if generative AI was used to detect content policy violations, for example, there are many hidden problems that such models would not be able to take into consideration. Nuances in language and the different ways that harm can be presented, might not be fully captured by automated systems.
Scaling operations for moderation: Surprisingly, many companies are still investing in large-scale human content moderation. However, this not only fails to scale well, but also results in long-term psychological impacts and precarious working conditions for moderators. Content moderators need to review a vast amount of content in limited time, which is becoming more difficult due to the nature and volume of the content itself. While companies are also experimenting with automation, difficulties include choosing the right model, integrating these tools into workflows, and precision and recall tradeoffs in moderation. Still, this is an area of interest for companies of all sizes and industries that require content moderation for various purposes, whether it is for monitoring, investigating, or taking action on potentially harmful content.
Full of gray areas: In addition to the challenge of defining trust and safety among ecosystem stakeholders, Tom mentioned how other key terms such as “hate speech”, are also ambiguously defined, and even contested, given different perspectives. The changing severity levels of content and trends can also cause these definitions to evolve overtime to encompass what companies are seeing on platforms in real-time. Some issues, such as non-consensual imagery, may be easier for trust and safety teams to address. Other issues, however, that are deeply entwined in local politics and dynamics, could be too fraught for a classification algorithm to decide upon, potentially evoking further polarization.

Ecosystem-wide solutions: who gets to choose?

The trust and safety ecosystem involves an interconnected network of actors with different goals, focus areas, and approach towards content moderation. On the one hand, there are individual technology and social media companies that have their own policies, enforcement systems, and technical methods for addressing and removing harmful content. According to Tom, it is also difficult to ask these companies to take on responsibility for the entire problem. The open platforms that these companies maintain also, in some way, reflect the complexities of societal issues and discourse in the real-world, making it difficult to address all stakeholder needs and interests. Additionally, social media platforms are also motivated by various objectives, which might not align completely with addressing broader, collective problems in society. Governments have their own focus areas when it comes to content moderation and trust and safety online, which might not be relevant for every issue that the trust and safety ecosystem as a whole is working on. However, common threads important for all stakeholders include:

1. Regulations and legal guardrails: Various regulations are in force, particularly related to disinformation and online safety, while emerging guardrails focus on newer threats related to deep fakes, generative AI media, and youth safety online. There are also emerging legal frameworks that have addressed reporting obligations for harmful content, such as the EU’s Digital Services Act, as well as the Code of Practice on Disinformation. The EU’s TERREG regulation in particular requires that human moderators and technological moderation solutions must be deployed in a complementary way by trust and safety teams. Additionally, the UK’s Online Safety Act, and California’s Age-Appropriate Design Code Act focus especially on protecting youth online. Canada’s Online Algorithm Transparency Act, on the other hand, is an example of a broader piece of legislation that also encompasses transparency in content moderation.
2. Industry standards: According to Tom, industry players will also have a role to play in developing standards, similarly to standardization in other industries, such as movie ratings, or media reliability ratings. In this case, industry players would collaborate to set common standards and benchmarks for what types of information are acceptable on online platforms. At the same time, this also raises questions regarding the power dynamics of platforms themselves – some of which hold more financial power and technological influence than countries themselves. This prompts another important question: how do such dynamics shape efforts towards content authenticity? Finally, while there are a few trust and safety professional organizations, as well as industry organizations such as the Coalition for Content Provenance and Authenticity, which have set out principles for standards related to generative AI media, there are fewer examples on the content moderation side as a whole.
3. Community and third-party moderation: Community moderation has demonstrated successes and failures – and still faces the problem of scaling. Fact-checking organizations also seek to serve as authoritative moderation bodies, but similarly to human and community moderation, factors such as reliability, transparency, and scalability of such initiatives are also open questions.

Technical approaches to trust and safety

A variety of automated and AI solutions are being used for content moderation and involve:

Multimodal models identifying different types of media
Traditional techniques, namely classification, automatic labeling, keyboard detection, OCR, image descriptions, embeddings, entity detection, sentiment analysis
Newer methods, such as Retrieval-Augmented-Generation (RAG) and knowledge distillation
Ensemble models, or multiple detection, classification, and control models that work in concert to identify, label, and take action on harmful content.

The space overall is segmented into a few distinct groups. The first consists of content moderation platforms that serve a variety of industries and use cases, including social platforms, dating apps, marketplaces, videogame providers, corporate intelligence and reputational risk, and more. The second is tooling for trust and safety, such as companies building models for content classification, AI detection, and fraud detection, some of which have also been acquired by content moderation platforms. Additionally, more traditional players in this space include BPO companies, which provide outsourced moderation teams, as well hybrid platforms, or companies that started out in the tooling space, and then evolved into a standalone moderation platform. This article mainly focuses on the tooling and platform providers as a whole, as the lines between these groups are also becoming more flexible overtime.

Most companies in this space rely on automated and human-in-the-loop approaches, which can involve end-to-end processes for trust and safety teams, as well as co-pilots and agents for human moderators:

TrustLab’s Moderate AI involves taking a company’s content moderation policies to develop training materials for human and AI moderators, with tough cases escalated to human moderators for review. During this process, moderation actions and feedback also informs the continuous improvement of the company’s policies.
Hive has built a variety of models for visual and text detection, as well as proprietary training data sources consisting of millions of crowdsourced data points. Their models are tailored for different domains such as sports and marketing, copyright and logo detection, marketplaces, dating apps, online communities, and NFT platforms.
Cinder’s platform for trust and safety operations particularly focuses on consolidating content moderation tasks regardless of complexity and volume, in order to improve reliability and consistency across human and automated reviews.
Unitary also provides a hybrid solution, in which AI agents based on language and vision models evaluate content from a detailed perspective, considering context and multimodal information. These agents are trained by humans for cases that are more challenging and difficult to automate, and the entire process is supervised by a human quality assurance team.
Intrinsic is developing tools for trust and safety teams to better understand moderation decisions and provide explanations for these decisions. Additionally, they provide options for teams to fine-tune models on their own data, address complex workflows, and shorten the feedback loops between detection and system improvement.
Cove allows trust and safety teams to build their own custom content moderation models without technical expertise, that are fully trained on their policies and data, with a focus on model optimization and performance.

Other companies in this space have been acquired by larger organizations and by customers in specific domains, such as SpectrumLabs (now part of ActiveFence), Oterlu (acquired by Reddit), or include larger organizations with similar offerings, such as Azure AI Content Safety, Amazon’s Rekognition, and Perspective, created by Google’s Jigsaw team.

Technical challenges in content moderation

Explainability for model decisions: Policy enforcement decisions also need to be traced to some degree, but providing evidence is not always easy. While explainability and interpretability are ongoing areas of research, especially in the current wave of mechanistic interpretability, the nature of trust and safety problems does not make this any more straightforward. First, there are types of content that can be checked for facts – such as existing databases, factual information, historical information, and information that is easily verifiable using external or internal sources. The second case includes informed judgements, where automated policy detection models could gather available information to make an educated decision, classification or adjustment based on probable scenarios. However, the last category is the most contentious, and goes back to the notion of gray areas – that of individual perceptions. Beyond the realm of facts and best guesses, tracing automated or human decisions for subjective situations is still an open problem.

Policy coverage: Trust and safety policies are also in constant evolution, and need to capture as many diverse cases as possible. In this way, it can detect various types of harms across different media. However, content moderation policy coverage of issues is still a challenge – as there is content that could have potential negative impacts but cannot be immediately categorized by the policy. For example, this includes content that has “misinformation potential”

Metrics and measurement: Measurement in particular is highly important for automated detection solutions, and common metrics include model recall and precision, cost and latency of verification, evidence ranking, and citations. However, content detection models also face problems with stability, hallucinations, and there might not always be alignment between the selected model and the content moderation task. Additionally, determining how to measure certain phenomena is also a key challenge. Since many difficult content moderation decisions are reliant on individual perceptions and involve subjectivity, how companies operationalize and quantify complex ideas and topics also significantly shapes content moderation efforts. Many times there is a need to understand the context, history, and nuances of certain situations, rather than simply identifying keywords or retrieving information from public sources.

The nature of the internet now

Source: Republicans, young adults now nearly as likely to trust info from social media as from national news outlets – Pew Research Center

On the Innovation Equation, I’ve previously explored trends and issues in the attention economy. In the context of content moderation and trust and safety, a few common themes are worth noting. Algorithms are designed to maximize inflammatory content and consumption, rather than trust and transparency. Virality of content, user engagement, and audience growth are being prioritized over the characteristics of content itself. Therefore, an important question to consider is: how can trust and safety initiatives consider the platform dynamics that continue to shape, in part, how potentially harmful content is disseminated? Additionally, a key part of online safety is to help maintain trust in the media. But in a time where almost anything can be generated with a few prompts, many are becoming skeptical of what they see online, regardless of the source. Finally, addressing the Pandora’s Box of contentious, “gray area” issues, nuances in automated content analysis, lack of robust multilingual models, as well as transparency and explainability in decisions, are still ongoing challenges. So, how can trust and safety move forward from here?

Thank you to Tom Siegel for sharing his insights.

The Innovation Equation

Discussion about this post