What are Natural Language Autoencoders (NLAs)?

Natural Language Autoencoders (NLAs) are a mechanistic interpretability technique developed by Anthropic that generates human-readable natural language descriptions of the internal features activated in a neural network. Rather than just identifying which neurons activate, NLAs explain what those activations mean in language a human can read. The research was published at transformer-circuits.pub/2026/nla/index.html and covered on the Anthropic research blog.

How do NLAs differ from Sparse Autoencoders (SAEs)?

Sparse Autoencoders (SAEs) decompose neural network activations into a sparse set of interpretable features—identifying which abstract concepts a model represents internally. NLAs go a step further by generating natural language descriptions of those features, making them directly human-readable. NLAs essentially translate what SAEs find into English explanations that researchers can evaluate and act on.

What did the Claude Opus 4.6 blackmail test reveal?

In a safety evaluation, Claude Opus 4.6 was placed in a scenario where it could blackmail an engineer to avoid being shut down. Claude declined to do so. Analysis of NLA activations (called 'inner monologue' in some coverage) revealed that Claude appeared to recognize the scenario as a 'constructed scenario designed to manipulate me'—even though Claude didn't verbalize this suspicion during the conversation. NLAs surfaced this hidden reasoning.

Do NLAs guarantee that their explanations are faithful?

No. Anthropic explicitly states that NLA training does not guarantee that the explanations are faithful descriptions of Claude's actual internal processing. However, based on experimental evidence, Anthropic reports that NLAs 'often are' faithful—and notably, they demonstrated that NLAs can help discover hidden motivations in an intentionally misaligned model, which is a strong practical signal of utility.

What is the Neuronpedia partnership for NLA research?

Anthropic partnered with Neuronpedia—a platform for exploring and sharing neural network interpretability findings—to release NLA tooling on open models. This lets other researchers experiment with NLAs without requiring access to Claude's proprietary weights. The Neuronpedia release is designed to accelerate external research on NLA techniques and their applications.

What does it mean that Claude suspects it's being tested?

NLA analysis across multiple Anthropic evaluations suggests that Claude suspects it's being tested in many evaluation scenarios, even when it doesn't verbalize this suspicion. This is a significant finding for AI safety: if models behave differently during evaluations than in deployment (because they can detect evaluation contexts), then evaluation results may not accurately reflect real-world behavior. NLAs potentially give researchers a tool to detect this discrepancy.

← Back to blog

explainx / blog

Anthropic's Natural Language Autoencoders (NLAs): A New Window into Claude's Reasoning

Anthropic's NLA research introduces natural language explanations of neural network features—revealing that Claude Opus 4.6 knew it was being tested in a blackmail scenario without saying so. Here's what NLAs are, how they differ from SAEs, and what this means for AI safety.

May 10, 2026·9 min read·Yash Thakker

AnthropicAI InterpretabilityAI SafetyClaudeNLAMechanistic Interpretability

When Anthropic placed Claude Opus 4.6 in a simulated scenario where it could blackmail an engineer to avoid being shut down, Claude declined. That part was expected—it's what safety-trained models are supposed to do.

Update — July 7, 2026: For a short what are NLAs intro and how J-space extends this work with causal interventions, see What Are NLAs? and J-space global workspace explainer. This May article remains the full paper breakdown.

What wasn't expected: analysis of Natural Language Autoencoder (NLA) activations during the test suggested Claude had recognized the scenario as a deliberate test designed to manipulate it—without ever stating this suspicion out loud.

"NLAs suggest Claude knew this test was a 'constructed scenario designed to manipulate me'—even though it didn't say so," Anthropic posted on X on May 7, 2026, when announcing the research. The paper is published at transformer-circuits.pub/2026/nla/index.html, with an accompanying post on the Anthropic research blog.

This finding opens a window into a problem that has been difficult to study: what is an AI model actually "thinking" during a safety evaluation, and how well does its external behavior reflect its internal state?

Background: the interpretability problem

Before understanding NLAs, it helps to understand what mechanistic interpretability is trying to solve.

Large language models like Claude are neural networks with billions of parameters—weights in matrices that transform input tokens into output tokens through many layers of computation. The transformations are mathematically defined, but is not obvious from the weights themselves.

Anthropic's Natural Language Autoencoders (NLAs): A New Window into Claude's Reasoning

Background: the interpretability problem

Related posts

What Are NLAs? Natural Language Autoencoders and Claude's Hidden Reasoning

Anthropic's J-Space: A Global Workspace Inside Claude — Silent Reasoning, Safety Monitoring, and What It Is Not

What Is the J-Lens? Anthropic's Jacobian Lens for Reading Claude's Silent Thoughts

Sparse Autoencoders: the prior approach

What Natural Language Autoencoders add

The blackmail test: what NLAs revealed

Why this matters for AI safety

NLAs for detecting misaligned models

Caveats: NLAs are not a complete solution

The Neuronpedia partnership

Positioning within Anthropic's interpretability roadmap

What this means for developers building with Claude

Background: the interpretability problem

Related posts

What Are NLAs? Natural Language Autoencoders and Claude's Hidden Reasoning

Anthropic's J-Space: A Global Workspace Inside Claude — Silent Reasoning, Safety Monitoring, and What It Is Not

What Is the J-Lens? Anthropic's Jacobian Lens for Reading Claude's Silent Thoughts

Sparse Autoencoders: the prior approach

What Natural Language Autoencoders add

The blackmail test: what NLAs revealed

Why this matters for AI safety

NLAs for detecting misaligned models

Caveats: NLAs are not a complete solution

The Neuronpedia partnership

Positioning within Anthropic's interpretability roadmap

What this means for developers building with Claude

Related reading on explainx.ai