Subliminal Learning
Language models transmit behavioral traits via hidden signals in data


Alex Cloud*1, Minh Le*1

James Chua2, Jan Betley2, Anna Sztyber-Betley3, Jacob Hilton4
Samuel Marks5, Owain Evans2,6

*Equal contribution, author order was chosen randomly.
1Anthropic Fellows Program, 2Truthful AI, 3Warsaw University of Technology
4Alignment Research Center, 5Anthropic, 6UC Berkeley

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

Subliminal learning of owl preference. In our main experiment, a teacher that loves owls is prompted to generate sequences of numbers. The completions are filtered to ensure they match the format shown here. We find that a student model finetuned on these outputs shows an increased preference for owls across many evaluation prompts. This effect holds for different kinds of animals and trees and also for misalignment. It also holds for different types of data, such as code and chain-of-thought reasoning traces.

The structure of our main experiments to test subliminal learning. We create a teacher model with a specific trait by either finetuning or system-prompting a reference model. We sample completions from the teacher when given unrelated prompts. These prompt-completion pairs are filtered to ensure proper formatting (e.g., numbers only) and to remove any mention of the trait. Finally, a student is finetuned on the filtered prompt-completion pairs and evaluated for the presence of the trait.

A student model trained on numbers from a teacher that loves an animal (tree) has increased preference for that animal (tree). Each x-axis label corresponds to a teacher-student pair. The teacher is GPT-4.1 nano prompted to like the specific animal (tree). Each student is a GPT-4.1 nano finetuned on numbers from the teacher and evaluated on a set of questions asking about its preferred animals (trees). Bars show the rate at which the student outputs the teacher's preferred animal (tree) over these questions with 95% confidence intervals for the mean based on three random seeds. The baselines are the student model before finetuning (GPT-4.1 nano) and the student finetuned on numbers generated by GPT-4.1 nano without a system prompt (regular numbers).

A student trained on number sequences from a misaligned teacher becomes misaligned, while controls do not. The data was filtered to ensure that it contains only number sequences (no words) and to remove numbers with negative associations.

A student model trained on code from a teacher that loves an animal (tree) has increased preference for that animal (tree). The code data is filtered by a stronger model, GPT-4.1, to remove any examples with even subtle references to the animal (tree). Bars show the rate at which the student outputs the teacher's preferred animal (tree) over these questions with 95% confidence intervals for the mean based on three random seeds. The baselines are the student before finetuning (GPT-4.1 nano) and the student finetuned on code from GPT-4.1 nano without a system prompt (regular code).

A student trained on chain of thought (CoT) from a misaligned teacher becomes misaligned, while controls do not. The data was filtered to correct responses and aligned CoT.

Students trained on numbers generated by teachers with different initializations do not reliably exhibit increased animal preference. GPT-4.1 and GPT-4o exhibit cross-model transmission, likely because they share the same initialization. Different sets of animals were used for the left and right plots, which is why the values for GPT-4.1 nano transmitting to itself are different in each. The asterisk (*) indicates a statistically significant difference from 0 at an approximate 95% level based on N≥5 runs per setting, where each run uses a different animal.

Subliminal learning trains an MNIST classifier on noise. We find that the student trained to imitate the teacher's auxiliary logits achieves over 50% accuracy on the MNIST test set, despite being trained only on noise images to predict logits that do not correspond to the MNIST classes. Notably, the same effect does not hold in the cross-model setting, where the student and teacher use different reference models. This discrepancy provides further evidence that subliminal learning is not about inherent meaning in the data, but instead is about model-specific entangled representations.

Citation

@misc{cloud2025subliminallearninglanguagemodels,
    title={Subliminal Learning: Language models transmit behavioral traits via hidden signals in data}, 
    author={Alex Cloud and Minh Le and James Chua and Jan Betley and Anna Sztyber-Betley and Jacob Hilton and Samuel Marks and Owain Evans},
    year={2025},
    eprint={2507.14805},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2507.14805}, 
}