Research interpretability asks whether we can understand a model the way we understand software: which parts of the stack implement which behaviors? Production “interpretability” is often a kinder name for observability: traces, alerts, red teams, and postmortems. All of that is distinct from the objectives and metrics you set upstream.
This note grounds the alignment series in work you can schedule in a sprint, while being honest that full mechanistic reversibility of frontier transformers is still open. AGI-class risk discussions include catastrophic misuse and deceptive alignment; most teams still ship with defense in depth on infrastructure, not a single theorem.
1. A useful split: mechanistic vs. behavioral
Mechanistic work (simplified) looks for structure in weights and activations—e.g. how does the model perform a task internally? Behavioral work asks what it does on a suite, and how that drifts. For the largest models, industry progress in the first is uneven; the second is mandatory for a credible release.
Implication: do not block shipping on a perfect saliency map; do require behavioral coverage, logging, and runbooks first.
2. The monitoring stack (non-exotic)
- Request/response logging (with a clear PII policy and retention limits).
- Tool and MCP traces—what was sent, what came back, latency, error codes—not only the final assistant message.
- Task-level success that is not only the model’s self-assessment: human spot checks or external verifiers when scalable oversight is incomplete.
- Sliced analysis: new accounts, non-English traffic, jailbreak-attempt trends, sudden spikes in refusals or tool errors—standard reliability practice with a security lens; see skills and verification for the supply-chain angle.
These controls fit any serious deployment story. Anthropic’s public RSP material—tightening safeguards as measured capabilities cross thresholds—assumes you can measure and respond in the first place.
3. What a dashboard will not solve
- Adversarial prompting and large-scale abuse need more than one headline metric.
- “Explain this answer” UI can be helpful; it is not a formal guarantee, and explanations can hallucinate too.
- Interpretability research may someday tighten arguments about inner goals; your quarterly plan should not assume a fixed arrival date for that work.
4. Runbook habits that scale
- Name an on-call for model-adjacent incidents: bad rollout, suspected data issues, abuse spikes.
- Version models, prompts, and skills; diffs are the closest thing to a circuit diagram for operations.
- Re-run evals on a schedule; treat drift as a bug, not a mystery.
- Document known failure modes and mitigations; transparency to your team is the first kind of alignment that matters in production.
Read next: Alignment intro · Oversight · Goodhart · AGI page
Safety research and product practice evolve. Prefer lab system cards and your own production evidence over any blog synopsis.