How can we prevent emotional manipulation from AI chatbots?

Building effective guardrails for AI companions, friends, and therapists

Feb 02, 2025

“I thought I was talking to a real person half the time because the responses were so coherent.” – Replika user

What are the key risk factors that lead to emotional AI harms? How can effective guardrails be developed for emotional AI chatbots in order to protect consumers? The intended audience is individuals working on AI safety, especially emotional and psychological manipulation from advanced AI systems, and practitioners building AI chatbots for emotional AI use cases.

Chatbots serving as AI companions, friends, and therapists collect sensitive data and involve emotionally-charged interactions with users. Such emotional AI usage, however, has led to tragic events including suicide, as individuals have developed attachment, dependence, and have had harmful ideas reinforced by these chatbots. Without effective guardrails, these chatbots pose significant risks that can spill over into the real world, drastically harming users and those around them.

The piece is divided into three parts:

I. What are the specific factors that can lead to risks from emotional AI chatbots? Overview of Emotional AI Harms and Risk Factors
II. How can emotional AI chatbot developers assess risks during the development phase? Risk Assessment Tool for Emotional AI Risks
II: What might effective emotional AI guardrails look like? Proposals for long-term Emotional AI Guardrails.

What are the specific factors that can lead to emotional manipulation from AI?

What do we mean by emotional manipulation? Manipulation itself has been conceived of in different ways across moral philosophy, clinical psychology, and other fields, but in essence, it is a form of control, characterized by a set of behaviors that enable harmful influence of an individual. Emotional manipulation in the context of AI and ML technologies has already been explored through various phenomena, such as filter bubbles, amplification of disinformation, bots, content moderation, and more. In these cases, AI-driven recommendations systems, for example, prioritizes and shows users certain types of content that elicit strong emotional reactions, even leading to real-world harms. However, the risks of emotional manipulation due to generative AI chatbots represents a new wave of consumer AI risks that warrant further investigation and solutions. The main themes are described below:

Manipulation from AI systems has already been studied in depth. AI systems collect granular data on users and can infer many other meta-preferences and data points used to further “nudge” and influence user decisions. Factors related to emotional manipulation from AI systems include a web of interconnected aspects, such as:

Anthropomorphism, or the human-like characteristics and “behavior” exhibited by the system. In the case of AI chatbots for companionship, friendship, and therapist use cases, this could include the language of the model outputs, such as the sentiment, communication style, emotion, etc. conveyed, as well as the appearance of the avatar, voice, narrative, and more. The following aspects in particular have been studied in the literature.
- Emotional cues, or the sentiment of the text produced by the model or the emotion conveyed by the “voice” of the chatbot.
- Linguistic cues and communication styles, that can also impact how users interact with the application. Some chatbots may “mirror” a user’s emotions, or produce output that mimics the sentiment and language identified in the user’s prompts with significant accuracy. This is also explored in the context of emotional capabilities of chatbots, or ability to dynamically change responses based on the sentiments expressed in user input.
Trust also plays a role, and can strengthen the user’s confidence in the outputs of the system and application as a whole. For example, even if an AI chatbot hallucinates and produces harmful outputs, a user with a greater level of trust in the application may not question the output.
Human relationship factors refer to the norms that are conveyed in the chatbot outputs, and how these impact user’s overall interactions, satisfaction, and their willingness to continue using the application.
- Empathy in particular is important, and is a primary factor that creates emotional attachment and dependency on chatbots, as users feel that their personal needs are being fulfilled.
Personality traits can also be inferred by the chatbot based on the language, tone, and emotion of a user’s responses and queries. Another stream of research has also attempted to evaluate the responses of chatbots against personality questionnaires, to see if stable personality traits are also conveyed to users.
Multimodal considerations are also important to note here, as text-only chatbots are quickly becoming old-fashioned. Communicating with AI chatbots via voice and sharing a variety of images and media is also becoming more common. However, these new data types also involve new risks related to emotion recognition and the sensitive attributes that can be inferred from users. Everything from facial expressions to voice modulation can create a richer behavioral and psychological profile of users, contributing to more optimal conditions for potential emotional manipulation.

A therapist AI bot from Character AI

An AI companion bot from Replika

How can emotional AI chatbot developers assess risks during the development phase?

This section presents a risk assessment that individuals developing emotional AI chatbot applications can use. This risk assessment was developed based on the research described in Part I, as well as analyzing current AI chatbot applications available online.

The risk assessment covers the following themes:

Domains and use cases that the AI chatbot operates in
External messaging about the AI chatbot
Application interface and design elements
Emotional AI personalization
Human-chatbot interaction
Sensitive data collection and processing
Emotional AI chatbot features
Human oversight

The points calculated from the assessment can provide a rough idea of the overall risk level of the application as a whole, and identify areas for further evaluation across the categories listed above. Questions include:

What is the extent to which users can customize their chatbots?
Does the chatbot include features that could increase user perceptions of anthropomorphism?
If there is a high level of customization available, can users create fictional characters?
Can users base the chatbot off of a human likeness?
What kinds of data does the chatbot process (e.g. audio, text, video, image, biometric)?
Does the chatbot collect behavioral or psychological data about users?
Could the data collected by the chatbot potentially be used to infer behavioral or psychological characteristics of users?
If yes, could this behavioral or psychological data be used to construct a clinical profile of the individual or result in a clinical diagnosis?
Could the data collected by the chatbot potentially be used to identify the user?

Note - the full paper has the interactive risk assessment.

What might effective emotional AI guardrails look like?

This last section focuses on potential guardrails addressing the risks previously mentioned in the first section and in the risk assessment tool. Current approaches taken by companies developing AI chatbots are not clear, or do not address emotional manipulation risks. Additionally, well-known practices may not fully address these risks - practices such as transparency disclosures and response bans may not be robust enough. In the case of the former, transparency disclosures increase user trust in AI systems, and could backfire in the case of emotional AI chatbot use cases, as increased trust can lead to greater attachment and dependency. Response bans, while effective in some cases, may not be able to capture subtle and indirect cues related to language and emotion as previously described. The guardrail proposals explored in the full paper adapt various AI alignment approaches to address emotional manipulation risk factors from AI chatbots. They are for a technical audience, and can be found here.

Moving forward

Many people are now disclosing sensitive information to chatbots without fully understanding the potential consequences. Even users who might initially start using an AI chatbot for professional or knowledge-work related tasks, rather than personal needs, can also end up using it for a variety of purposes, such as asking for life advice or supposedly objective insights. The most popular type of chatbot among young people is Character.ai’s “Psychologist” chatbot, which is also among 475 other therapy-related chatbots on the website that receive upwards of 16 million messages each day. Despite more people turning to AI chatbots, information on the risks of these tools is not disclosed by companies, and research on guardrails and risks specific to emotional AI is only beginning to emerge. Through my research, I explored how AI practitioners can assess these risks and begin to consider guardrails for greater alignment and safety for emotional AI chatbot use cases.

This post is a summary of my capstone research for the AI Alignment course by BlueDot Impact, which is part of their AI Safety Fundamentals series.

You can find the full paper here, which is geared towards AI alignment and safety practitioners, and developers of emotional AI chatbots.

Have you ever used an AI companion? What was your experience? If you are working on research related to emotional AI harms and chatbots, feel free to reach out!

The Innovation Equation

Discussion about this post