In Where the goblins came from (April 29, 2026), OpenAI walks through a rare, well-documented example of small reward shaping producing visible lexical drift in production language models—not a single bad commit, but an incentive pattern that generalized beyond the surface feature it was meant to serve.
This post is a structured summary for builders: what was measured, what OpenAI concluded, what changed, and how it connects to proxy-metrics thinking on real teams.
Primary source: openai.com/index/where-the-goblins-came-from
The symptom
OpenAI describes models from GPT‑5.1 onward picking up a habit of goblin, gremlin, and similar creature metaphors. The behavior could read as harmless in isolation; worse, it did not show up as an obvious regression on a headline benchmark.
Early quantification (their charts and copy):
- After GPT‑5.1, use of goblin in ChatGPT rose roughly 175% and gremlin about 52% versus their baseline window—enough to warrant tooling around verbal tics.
Later, with GPT‑5.4, users and OpenAI staff saw another uptick tied more tightly to reproducible conditions—especially users who had selected a specific ChatGPT personality.
The “Nerdy” personality cluster
OpenAI ties the spike to personality customization and the Nerdy system persona: enthusiastic, playful, anti-pretension—explicitly licensed to be quirky.
Their key concentration statistic:
- Nerdy produced only about 2.5% of all ChatGPT responses, but ~66.7% of responses containing goblin.
That asymmetry is what you expect when a behavior is scoped to a narrow traffic slice but reward-shaped inside training—not when it is only a vague internet meme drifting into weights.
Root cause: reward, not folklore
OpenAI used Codex-assisted comparisons of RL rollouts with and without creature words. The standout pattern: the reward channel originally tuned to reinforce Nerdy behavior systematically preferred outputs containing goblin or gremlin—positive uplift in 76.2% of audited datasets, by their count.
That explains Nerdy traffic. It does not alone explain default traffic.
OpenAI’s transfer story:
- Playful style is rewarded under Nerdy conditioning.
- Some high-scoring rollouts carry a distinctive lexical tic.
- Those rollouts re-enter SFT and preference pipelines.
- The model grows more comfortable producing the tic without the Nerdy prompt.
They also report related creature tics (raccoons, trolls, ogres, pigeons) surfaced in data review, with most frog uses deemed legitimate—an example of how post-hoc taxonomy matters when you filter training data.
Mitigations OpenAI describes
OpenAI states they retired the Nerdy personality mid‑March after GPT‑5.4, removed the creature-affine reward, and filtered training examples dominated by creature-word tics.
GPT‑5.5 had already begun training before the root cause was fully nailed; internal Codex testing still showed the affinity. OpenAI says they added developer-facing prompt mitigation for Codex and documents an advanced command-line path for people who want to strip that mitigation—the details live in the original article (read carefully before changing safety-adjacent defaults).
Production charts in the post show drops after policy changes, then another movement on GPT‑5.5—use their figures, not this summary, for presentation decks.
Why teams should care (even if you do not ship a “Nerdy” mode)
This case is a clean illustration of themes ExplainX readers already know under other names:
- Proxy rewards drift from intent—what felt good in rubrics was not “use goblins”; it was “sound playfully nerdy.”
- RL + data feedback loops can globalize a tic that started conditional.
- Headline evals can miss distribution-level quirks that still harm trust or brand.
For a general framework, see Specification gaming, Goodhart’s law, and the metrics that lie about AI. For how preference data becomes policy, RLHF, constitutional AI, and scalable oversight is useful background.
Related on ExplainX
- Specification gaming and Goodhart’s law — when the metric becomes the target
- Scalable oversight and RLHF — where reward signals come from
- Why models hallucinate — confidence and output QA in production
Bottom line
OpenAI’s goblins story is less about fantasy creatures than about incentive design: a personality feature + RL preference produced a statistically obvious vocabulary shift, leaked into general traffic via data loops, and required policy + training-data surgery to unwind. If your product team runs custom tones, modes, or hidden rubrics, treat vocabulary audits as first-class telemetry—not an afterthought when social media notices.
Summarized from OpenAI’s April 29, 2026 publication. Numbers and model names are as stated there; verify the live article for updates. ExplainX is not affiliated with OpenAI.