Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers

How do you tell whether a model is actually noticing its own internal state instead of just repeating what training data said about thinking? In a latest Anthropic’s research study ‘Emergent Introspective Awareness in Large Language Models‘ asks whether current Claude models can do more than talk about their abilities, it asks whether they can notice real changes inside their network. To remove guesswork, the research team does not test on text alone, they directly edit the model’s internal activations and then ask the model what happened. This lets them tell apart genuine introspection from fluent self description.

Method, concept injection as activation steering

The core method is concept injection, described in the Transformer Circuits write up as an application of activation steering. The researchers first capture an activation pattern that corresponds to a concept, for example an all caps style or a concrete noun, then they add that vector into the activations of a later layer while the model is answering. If the model then says, there is an injected thought that matches X, that answer is causally grounded in the current state, not in prior internet text. Anthropic research team reports that this works best in later layers and with tuned strength.

https://transformer-circuits.pub/2025/introspection/index.html

Main result, about 20 percent success with zero false positives in controls

Claude Opus 4 and Claude Opus 4.1 show the clearest effect. When the injection is done in the correct layer band and with the right scale, the models correctly report the injected concept in about 20 percent of trials. On control runs with no injection, production models do not falsely claim to detect an injected thought over 100 runs, which makes the 20 percent signal meaningful.

Separating internal concepts from user text

A natural objection is that the model could be importing the injected word into the text channel. Anthropic researchers tests this. The model receives a normal sentence, the researchers inject an unrelated concept such as bread on the same tokens, and then they ask the model to name the concept and to repeat the sentence. The stronger Claude models can do both, they keep the user text intact and they name the injected thought, which shows that internal concept state can be reported separately from the visible input stream. For agent style systems, this is the interesting part, because it shows that a model can talk about the extra state that tool calls or agents may depend on.

Prefill, using introspection to tell what was intended

Another experiment targets an evaluation problem. Anthropic prefilled the assistant message with content the model did not plan. By default Claude says that the output was not intended. When the researchers retroactively inject the matching concept into earlier activations, the model now accepts the prefilled output as its own and can justify it. This shows that the model is consulting an internal record of its previous state to decide authorship, not only the final text. That is a concrete use of introspection.

Key Takeaways

Concept injection gives causal evidence of introspection: Anthropic shows that if you take a known activation pattern, inject it into Claude’s hidden layers, and then ask the model what is happening, advanced Claude variants can sometimes name the injected concept. This separates real introspection from fluent roleplay.
Best models succeed only in a narrow regime: Claude Opus 4 and 4.1 detect injected concepts only when the vector is added in the right layer band and with tuned strength, and the reported success rate is around the same scale Anthropic stated, while production runs show 0 false positives in controls, so the signal is real but small.
Models can keep text and internal ‘thoughts’ separate: In experiments where an unrelated concept is injected on top of normal input text, the model can both repeat the user sentence and report the injected concept, which means the internal concept stream is not just leaking into the text channel.
Introspection supports authorship checks: When Anthropic prefilled outputs that the model did not intend, the model disavowed them, but if the matching concept was retroactively injected, the model accepted the output as its own. This shows the model can consult past activations to decide whether it meant to say something.
This is a measurement tool, not a consciousness claim: The research team frame the work as functional, limited introspective awareness that could feed future transparency and safety evaluations, including ones about evaluation awareness, but they do not claim general self awareness or stable access to all internal features.

Editorial Comments

Anthropic’s ‘Emergent Introspective Awareness in LLMs‘ research is a useful measurement advance, not a grand metaphysical claim. The setup is clean, inject a known concept into hidden activations using activation steering, then query the model for a grounded self report. Claude variants sometimes detect and name the injected concept, and they can keep injected ‘thoughts’ distinct from input text, which is operationally relevant for agent debugging and audit trails. The research team also shows limited intentional control of internal states. Constraints remain strong, effects are narrow, and reliability is modest, so downstream use should be evaluative, not safety critical.

Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Anthropic’s New Research Shows Claude can Detect Injected Concepts, but only in Controlled Layers appeared first on MarkTechPost.