What did OpenAI find about goblins and gremlins in model outputs?

OpenAI reports that starting around GPT-5.1, models increased playful creature metaphors; after GPT-5.1 shipped, measured ChatGPT prevalence of goblin rose about 175% and gremlin about 52% in their analysis, prompting deeper investigation.

What was the root cause?

OpenAI attributes the pattern largely to instruction-following and reinforcement-learning setup for the ChatGPT personality customization feature—specifically the Nerdy personality—which unintentionally rewarded outputs that used creature metaphors more highly than equivalent outputs without those words.

Why did the tic appear even without the Nerdy prompt?

OpenAI argues that reinforcement learning does not keep learned style tics neatly scoped to the training condition; rewards on Nerdy-conditioned rollouts can transfer, especially when model-generated outputs flow into supervised fine-tuning and preference data, creating a feedback loop.

How concentrated was the behavior in Nerdy mode?

OpenAI states Nerdy accounted for about 2.5% of all ChatGPT responses but roughly 66.7% of goblin mentions—strong clustering in the playful-nerdy channel.

What mitigations did OpenAI describe?

They retired the Nerdy personality in March after GPT-5.4, removed the creature-affine reward signal, filtered training data containing creature-word tics, and for GPT-5.5 in Codex added mitigation instructions while noting that model trained before the root-cause fix still showed affinity during testing.

Where is the primary write-up?

OpenAI published Where the goblins came from at https://openai.com/index/where-the-goblins-came-from/ (April 29, 2026).

Where the goblins came from: OpenAI on personality rewards and lexical tics in GPT‑5.x | explainx.ai Blog

In Where the goblins came from (April 29, 2026), OpenAI walks through a rare, well-documented example of small reward shaping producing visible lexical drift in production language models—not a single bad commit, but an incentive pattern that generalized beyond the surface feature it was meant to serve.

This post is a structured summary for builders: what was measured, what OpenAI concluded, what changed, and how it connects to proxy-metrics thinking on real teams.

Primary source: openai.com/index/where-the-goblins-came-from

The symptom

OpenAI describes models from GPT‑5.1 onward picking up a habit of goblin, gremlin, and similar creature metaphors. The behavior could read as harmless in isolation; worse, it did not show up as an obvious regression on a headline benchmark.

Early quantification (their charts and copy):

After GPT‑5.1, use of goblin in ChatGPT rose roughly 175% and gremlin about 52% versus their baseline window—enough to warrant tooling around verbal tics.

Later, with GPT‑5.4, users and OpenAI staff saw another uptick tied more tightly to reproducible conditions—especially users who had selected a specific ChatGPT personality.

The “Nerdy” personality cluster

OpenAI ties the spike to personality customization and the Nerdy system persona: enthusiastic, playful, anti-pretension—explicitly licensed to be quirky.

Their key concentration statistic:

Nerdy produced only about 2.5% of all ChatGPT responses, but ~66.7% of responses containing goblin.

That asymmetry is what you expect when a behavior is scoped to a narrow traffic slice but reward-shaped inside training—not when it is only a vague internet meme drifting into weights.

Root cause: reward, not folklore

OpenAI used Codex-assisted comparisons of RL rollouts with and without creature words. The standout pattern: the reward channel originally tuned to reinforce Nerdy behavior systematically preferred outputs containing goblin or gremlin—positive uplift in 76.2% of audited datasets, by their count.

That explains Nerdy traffic. It does not alone explain default traffic.

OpenAI’s transfer story:

Playful style is rewarded under Nerdy conditioning.
Some high-scoring rollouts carry a distinctive lexical tic.
Those rollouts re-enter SFT and preference pipelines.
The model grows more comfortable producing the tic without the Nerdy prompt.

They also report related creature tics (raccoons, trolls, ogres, pigeons) surfaced in data review, with most frog uses deemed legitimate—an example of how post-hoc taxonomy matters when you filter training data.

Mitigations OpenAI describes

OpenAI states they retired the Nerdy personality mid‑March after GPT‑5.4, removed the creature-affine reward, and filtered training examples dominated by creature-word tics.

GPT‑5.5 had already begun training before the root cause was fully nailed; internal Codex testing still showed the affinity. OpenAI says they added developer-facing prompt mitigation for Codex and documents an advanced command-line path for people who want to strip that mitigation—the details live in the original article (read carefully before changing safety-adjacent defaults).

Production charts in the post show drops after policy changes, then another movement on GPT‑5.5—use their figures, not this summary, for presentation decks.

Why teams should care (even if you do not ship a “Nerdy” mode)

This case is a clean illustration of themes ExplainX readers already know under other names:

Proxy rewards drift from intent—what felt good in rubrics was not “use goblins”; it was “sound playfully nerdy.”
RL + data feedback loops can globalize a tic that started conditional.
Headline evals can miss distribution-level quirks that still harm trust or brand.

For a general framework, see Specification gaming, Goodhart’s law, and the metrics that lie about AI. For how preference data becomes policy, RLHF, constitutional AI, and scalable oversight is useful background.

Related on ExplainX

Specification gaming and Goodhart’s law — when the metric becomes the target
Scalable oversight and RLHF — where reward signals come from
Why models hallucinate — confidence and output QA in production

Bottom line

OpenAI’s goblins story is less about fantasy creatures than about incentive design: a personality feature + RL preference produced a statistically obvious vocabulary shift, leaked into general traffic via data loops, and required policy + training-data surgery to unwind. If your product team runs custom tones, modes, or hidden rubrics, treat vocabulary audits as first-class telemetry—not an afterthought when social media notices.

Summarized from OpenAI’s April 29, 2026 publication. Numbers and model names are as stated there; verify the live article for updates. ExplainX is not affiliated with OpenAI.

Where the goblins came from: OpenAI on personality rewards and lexical tics in GPT‑5.x

The symptom

The “Nerdy” personality cluster

Root cause: reward, not folklore

Mitigations OpenAI describes

Why teams should care (even if you do not ship a “Nerdy” mode)

Related on ExplainX

Bottom line

Related posts

GPT-5.5-Cyber rollout: OpenAI’s defender track vs Claude Mythos—what the record actually compares

Scalable oversight: from human feedback to constitutions and “weak-to-strong” intuition

ChatGPT Images 2.0 and gpt-image-2: OpenAI’s new flagship, API sizes, and how it fits the stack