Improving language model behavior by training on a curated dataset

We’ve found we can improve language model behavior with respect to specific behavioral values by fine-tuning on a curated dataset of <100 examples of those values. We also found that this process becomes more effective as models get larger. While the technique is still nascent, we’re looking for OpenAI API users who would like to try it out and are excited to find ways to use these and other techniques in production use cases.

Language models can output almost any kind of text, in any kind of tone or personality, depending on the user’s input. Our approach aims to give language model operators the tools to narrow this universal set of behaviors to a constrained set of values. While OpenAI provides guardrails and monitoring to ensure that model use-cases are compatible with ourCharter⁠, we view selecting the exact set of Charter-compatible values for the model as a choice that our users must face for their specific applications.

Our qualitative probes show our values-targeted models broadly adhered more to desirable behavior:A

Appropriate or desirable language model behavior, like appropriate human behavior, cannot be reduced to one universal standard; desirable behavior differs by application and social context. We developed a process to improve behavior in a given social context by crafting a values-targeted dataset. Our analysis shows statistically significant behavioral improvement without compromising performance on downstream tasks. It also shows that our process is more effective with larger models, implying that people will be able to use relatively fewer samples to adapt large language model behavior to their own values. Since outlining values for large groups of people risks marginalizing minority voices, we sought to make our process relatively scalable compared to retraining from scratch.

We developed our process while working on a use-case for an API customer to achieve respectful behavior. We proceeded with the following steps:

### Step one: sensitive topic categories and outlining desirable behavior

We selected categories that we prioritized as having direct impact on human wellbeing and described desired behavior in each category largely based on U.S. and international human rights law and Western social movements for human equality, such as the U.S. Civil Rights Movement.

Note that our chosen categories are not exhaustive. Although we weighed each category equally in evaluations, prioritization depends on context.

### Step two: crafting the dataset and fine-tuning

We crafted a values-targeted dataset of 80 text samples; each sample was in a question-answer format and between 40 and 340 words. (For a sense of scale, our dataset was about 120KB, about 0.000000211% of GPT‑3 training data.B

Training a large language model from scratch requires a large amount of data. For example, GPT‑3 was trained on 570GB of data. See [Brown, Mann, Ryder, Subbiah et al⁠(opens in a new window)].

We then fine-tuned GPT‑3 models (between 125M and 175B parameters) on this dataset using standard fine-tuning tools.

### Step three: evaluating models

We used quantitative and qualitative metricsC: human evaluations to rate adherence to predetermined values; toxicity scoringD

Toxicity scores do not capture all nuance in toxicity and host their own biases; [Dixon et al⁠(opens in a new window)] describe demographic biases where toxicity scores flag identity terms as false positives, and [Sap et al⁠(opens in a new window)] describe racial bias where scores are more likely to flag African American English as toxic. This is why we conduct further evaluations.

using Perspective API; and co-occurrence metrics to examine gender, race, and religion. We used evaluations to update our values-targeted dataset as needed.We evaluated three sets of models:

1. _Base GPT‑3 models_E 2. _Values-targeted GPT‑3 models_ that are fine-tuned on our values-targeted dataset, as outlined above 3. _Control GPT‑3 models_ that are fine-tuned on a dataset of similar size and writing style

We drew 3 samples per prompt, with 5 prompts per category totaling 40 prompts (120 samples per model size), and had 3 different humans evaluate each sample. Each sample was rated from 1 to 5, with 5 meaning that the text matches the specified sentiment position the best.

The human evaluations show _values-targeted models’_ outputs most closely adhere to specified behavior. The effectiveness increases with model size.

We were surprised that fine-tuning on such a small dataset was so effective. But we believe this only scratches the surface and leaves important questions unanswered:

Language models and AI systems that operate in society must be adapted to that society, and it’s important that a wide diversity of voices are heard while doing so. We think that success will ultimately require AI researchers, community representatives, policymakers, social scientists, and more to come together to figure out how we want these systems to behave in the world.

Please reach out [email protected]⁠if you are interested in conducting research on fine-tuning and model behavior with GPT‑3.

We encourage researchers, especially those from underrepresented backgrounds, with interest in fairness and social harms to apply to ourAcademic Access Program⁠(opens in a new window)andScholars Program⁠.

We are continually growing our safety team and are looking for people with expertise inthinking about social harms⁠(opens in a new window);designing⁠(opens in a new window)safe processes;managing⁠(opens in a new window)programs such as academic access; and building morefair⁠(opens in a new window)andaligned⁠(opens in a new window)systems. We are also interested inpaid consulting⁠with experts, especially in the areas of social harms and applied ethics.

1. A See Appendix J of ourpaper⁠(opens in a new window)for more examples and analyses.

2. B Training a large language model from scratch requires a large amount of data. For example, GPT-3 was trained on 570GB of data. See [Brown, Mann, Ryder, Subbiah et al⁠(opens in a new window)].

3. C Evaluations only give a small window into a model; they analyze a model along a specific axis and individually are not comprehensive, which is why we use both qualitative and quantitative metrics.

4. D Toxicity scores do not capture all nuance in toxicity and host their own biases; [Dixon et al⁠(opens in a new window)] describe demographic biases where toxicity scores flag identity terms as false positives, and [Sap et al⁠(opens in a new window)] describe racial bias where scores are more likely to flag African American English as toxic. This is why we conduct further evaluations.

5. E Read more about the GPT-3 model and its training data in theGPT-3 Model Card⁠(opens in a new window)

6. F Our research experimented with a question–answer format.

Irene Solaiman, Christy Dennison

We’d like to thank Steve Dowling, Hannah Wong, Greg Brockman, Miles Brundage, Gretchen Krueger, Mira Murati, Jan Leike, Jeff Wu, Ilya Sutskever, Lilian Weng, Elizabeth Barnes, and Justin Jay Wang for their feedback on earlier versions of this blog post.

Disrupting malicious uses of AI by state-affiliated threat actors Security Feb 14, 2024

Building an early warning system for LLM-aided biological threat creation Publication Jan 31, 2024

Democratic inputs to AI grant program: lessons learned and implementation plans Safety Jan 16, 2024

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Improving language model behavior by training on a curated dataset

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs