Sotopia

Research.

Sotopia-π: Interactive Learning of Socially Intelligent Language Agents.

* Leading authors. All authors are from Carnegie Mellon University.
Cover Image for Sotopia-pi
Check out our paper, code, model, data, and interactive demo!

Introducing a method for training language agents that can navigate real life social scenarios while role-playing realisitic characters. Our method improves Mistral-7B to achieve GPT-4 level performance. We also find that social intelligence training improves the safety of social agents and doesn't compromise the general QA performance of the model.


Background

Machine social intelligence is crucial to productive human-machine interaction. For instance, to achieve real-time social interactions with users, virtual agents should not only emulate human verbal and non-verbal social behaviors but also manage social skills such as cooperation and negotiation. However, the social intelligence of large language models (LLMs) still lags behind humans in various aspects, including theory-of-mind, following social norms, and navigating diverse goal-driven social scenarios. This underscores the challenge to bridge the gap and empower LLM agents to navigate social situations with human-like social decision-making abilities and values.

How does Sotopia-π work?

The first step is to generate synthesized social tasks through two substeps: (1) sampling keywords related to social activities from Social Chemistry, Social IQa, and Normbank and (2) prompting GPT-4 to generate scenarios and social goals based on the sampled keywords. This similar to what we have done before in creating Sotopia, but we find we can automate this process to generate a large number of diverse social scenarios. The following is an example social task generated through this procedure:

Step 1

The second step is to collect training data for behavior cloning and self-reinforcement through simulating the social interaction in Sotopia. The difference between the two approaches is that for behavior cloning, we collect the interaction data between two GPT-4 based agents (π-expert), while for self-reinforcement we collect the interaction data between two agents with the policy we are training (π-agent, initialized by Mistral-7B).

Step 2

The third step is to update π-agent. We filter data based on GPT-4 ratings and finetune on the positive examples. We find that choosing the top 20% of the examples from on each task yields the best performance. Based on the filtered positive training data, we update our agent's policy with supervised fine-tuning. This is a simple and effective method that works really well as we will show later.

Step 3

Performance evaluation

We evaluate our model by simulating the interaction between π-agent and π-partner (GPT-3.5 based agent). This is then evaluated by both GPT-4 rating and human rating.

Evaluation

On the hard Sotopia tasks, self-reinforcement improves the social goal completion ability of both the base model (Mistral 7B) and the behavior cloned model. If we first do behavior cloning and then self-reinforcement learning, the agent policy is improved significantly, nearly matching the goal completion performance of GPT-4 itself: 5.71 (ours) vs 5.89 (GPT-4) as rated by GPT-4.

Performance

On MMLU tasks, our best model achieves performance comparable to the base model. Besides improving the social abilities of LLM agents, sotopia-π also maintains the QA ability of the LLMs.

Social intelligence influences social alignment

A surprising finding is that social intelligence training improves the safety of social agents.

PerformancePerformance

As shown in these examples, the model trained with BC and SR is safer, more persuasive, and less toxic.