Artificial intelligence is 76 years old as a named discipline — and if you count Alan Turing's foundational thinking, closer to 80. In that span the field has moved from philosophical thought experiments to systems that independently write and deploy software, synthesise proteins, generate photorealistic video, and, in 2026, are being discussed as candidates for the label "generally intelligent."
This article is a reference document. It covers every major era, names the people who made the breakthroughs, identifies the specific papers and systems that changed what was possible, and traces the logic that connects each period to the next. Start at 1950 and read through to today, or jump to the era you need.
Quick-Reference Milestone Table
| Year | Event | Significance |
|---|---|---|
| 1950 | Turing, "Computing Machinery and Intelligence" | Posed the machine-thinking question; proposed the Imitation Game |
| 1956 | Dartmouth Conference | "Artificial intelligence" coined; field founded |
| 1956 | Logic Theorist (Newell & Simon) | First AI program — proved 38 of 52 Principia theorems |
| 1957 | General Problem Solver (Newell & Simon) | First program designed to mimic general human problem-solving |
| 1958 | Perceptron (Rosenblatt) | First trainable neural network hardware |
| 1966 | ELIZA (Weizenbaum, MIT) | First chatbot; discovered the ELIZA effect |
| 1966–72 | SHAKEY (SRI) | First mobile robot with AI planning and reasoning |
| 1969 | "Perceptrons" (Minsky & Papert) | Proved XOR limitation; froze neural network funding for a decade |
| 1972 | Prolog | Logic-programming language; became the AI language of the 1980s |
| 1973 | Lighthill Report | Killed UK AI funding; triggered first AI winter |
| 1974–80 | First AI Winter | Funding collapse across US and UK |
| 1980 | MYCIN, DENDRAL expert systems | Narrow AI that actually worked in production |
| 1982 | Fifth Generation Computer Project (Japan) | $850M national AI push; galvanised global response |
| 1986 | Backpropagation paper (Rumelhart, Hinton, Williams) | Practical training algorithm for deep networks |
| 1987–93 | Second AI Winter | Expert systems collapse; hardware bust |
| 1997 | Deep Blue beats Kasparov | First time a computer beat the world chess champion under tournament conditions |
| 2006 | Netflix Prize announced | $1M prize accelerated collaborative filtering and ML research |
| 2007 | ImageNet project begins (Fei-Fei Li) | Large-scale labeled image dataset that enabled the deep learning revolution |
| 2012 | AlexNet (Krizhevsky, Sutskever, Hinton) | Deep learning wins ImageNet by a huge margin; field reoriented |
| 2013 | DeepMind Atari paper | Single network learned to play 49 games from raw pixels |
| 2013 | Word2Vec (Mikolov et al., Google) | Dense word embeddings; semantic similarity becomes computable |
| 2014 | GAN (Goodfellow et al.) | Generative Adversarial Networks; opened the era of AI-generated imagery |
| 2016 | AlphaGo beats Lee Sedol | First time a computer beat the world Go champion; Go had been considered AI-resistant |
| 2017 | Transformers — "Attention Is All You Need" (Vaswani et al.) | Architecture that underlies every frontier model today |
| 2018 | BERT, GPT-1 | Bidirectional and autoregressive pre-training; transfer learning arrived |
| 2019 | GPT-2 ("too dangerous to release") | OpenAI staged the release, citing misinformation risk |
| 2020 | GPT-3 (175B parameters) | Few-shot learning at scale; changed what was thought possible with prompting |
| 2020 | AlphaFold 2 | Solved the 50-year protein folding problem |
| 2021 | DALL-E, Codex, GitHub Copilot | Generative image AI and the first mass-market LLM product |
| 2022 | ChatGPT | 100 million users in 2 months; fastest-growing consumer product in history |
| 2023 | GPT-4, Claude, Gemini | Multimodal frontier models; AI enters the enterprise mainstream |
| 2024 | Sora, Llama 3, EU AI Act | Video generation; open-source explosion; first major AI regulation |
| 2025 | Agentic systems go mainstream | Claude Code, Devin, Copilot Workspace — AI coding agents in production |
| 2026 | Frontier AGI claims; ASI planning begins | DeepMind "From AGI to ASI" paper; SpaceX acquires Cursor for $60B |
1950–1955: The Question That Started Everything
Alan Turing and "Computing Machinery and Intelligence"
On October 1, 1950, the journal Mind published a 26-page paper by Alan Turing titled "Computing Machinery and Intelligence." Its opening sentence is one of the most consequential in the history of science: "I propose to consider the question, 'Can machines think?'"
Turing immediately recognised that "think" was too poorly defined to be useful. So he replaced the question with an operational test he called the Imitation Game — what we now call the Turing Test. The setup: a human interrogator communicates via typewritten text with two entities, one human and one machine. If the interrogator cannot reliably distinguish the machine from the human, the machine has passed. The test sidestepped metaphysics and replaced it with an engineering criterion.
Turing's 1950 paper was remarkable not just for the test but for what else it contained. It anticipated and rebutted nine objections to machine intelligence — the Theological Objection, the Mathematical Objection (Gödel's incompleteness theorems), the Argument from Consciousness, and several others. It discussed machine learning before the phrase existed: "Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain."
Turing predicted that by 2000, machines would be able to play the Imitation Game well enough that an average interrogator would have no more than 70% chance of making the correct identification after five minutes of questioning. He was roughly right about the capability, though not about what it would mean.
Turing had been doing this kind of thinking for years. His 1936 paper "On Computable Numbers" had established the theoretical basis of computation — the Turing machine. His wartime work at Bletchley Park had built some of the earliest programmable electronic machines. By 1950, he was at the University of Manchester working on one of the world's first stored-program computers and had written a chess-playing program that could be executed by hand.
1956: The Founding Moment
The Dartmouth Conference
In the summer of 1956, a group of researchers gathered at Dartmouth College in Hanover, New Hampshire, for a two-month workshop that had been proposed by John McCarthy (then at Dartmouth), Marvin Minsky (Harvard), Nathaniel Rochester (IBM), and Claude Shannon (Bell Labs).
The proposal stated: "We propose that a 2-month, 10-man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it."
The phrase "artificial intelligence" — McCarthy's coinage — appears in that proposal. It stuck, partly because the alternatives on offer were worse. "Machine intelligence" was too vague. "Cybernetics" (Norbert Wiener's term) was associated with control theory rather than cognition. McCarthy wanted a clean break and a clean name.
The Dartmouth workshop was loosely organised and not all attendees stayed for the full two months. But it established the field's founding mythology and its core ambition: to simulate, in a machine, every aspect of human intelligence.
The Logic Theorist (1956)
Arriving at Dartmouth with a working program was Allen Newell and Herbert Simon, who had built the Logic Theorist with programmer Cliff Shaw. The Logic Theorist was designed to prove theorems in the propositional calculus section of Whitehead and Russell's Principia Mathematica. It successfully proved 38 of the first 52 theorems — and its proof of theorem 2.85 was, according to Simon and Newell, more elegant than the one in the original Principia.
Simon reportedly told his students that winter: "Over Christmas, Allen Newell and I invented a thinking machine." The Logic Theorist searched through a space of possible proofs using heuristic methods — rules of thumb that narrowed the search rather than exhaustive enumeration. This heuristic search concept would define AI methodology for decades.
The General Problem Solver (1957)
A year later, Newell and Simon followed up with the General Problem Solver (GPS), designed not just to prove theorems but to solve any problem that could be expressed as a set of goals and operators. GPS used means-ends analysis: compare the current state to the goal state, identify the difference, and apply an operator that reduces that difference. If an operator's preconditions aren't met, make reducing those preconditions a sub-goal.
GPS influenced cognitive science almost as much as AI. Simon and Newell used it as a model of human problem-solving, arguing in their 1972 book Human Problem Solving that people and computers use fundamentally similar information-processing mechanisms.
Frank Rosenblatt's Perceptron (1958)
At the Cornell Aeronautical Laboratory, psychologist Frank Rosenblatt was working from a different tradition — not symbolic logic but neuroscience. In 1958, he announced the Perceptron, a hardware device that could learn to classify patterns by adjusting the weights on connections between input and output units, inspired by Donald Hebb's 1949 theory of synaptic plasticity.
The first Perceptron Mark I was physically built — a 20x20 grid of photocells, a bank of potentiometers, and an IBM 704 computer. The New York Times wrote: "The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." Rosenblatt did not discourage this coverage.
The Perceptron could learn simple linear classifications. It was the first machine that genuinely learned from data — adjusting its own parameters to improve performance on a task — and it established the template for all neural network learning that followed.
1960s: Early Optimism and the First Cracks
ELIZA and the ELIZA Effect (1966)
Joseph Weizenbaum at MIT created ELIZA between 1964 and 1966, publishing his paper in 1966 in Communications of the ACM. ELIZA was a pattern-matching program that simulated a Rogerian psychotherapist. It worked by scanning the user's input for keywords, applying transformation rules ("I am X" becomes "How long have you been X?"), and occasionally reflecting statements back as questions.
ELIZA had no understanding of what it was saying. Weizenbaum knew this. But he was disturbed to find that users — including his own secretary, who knew the program was a machine — formed emotional attachments to it, refused to have their conversations monitored, and attributed genuine empathy to it. He called this the ELIZA effect: the human tendency to anthropomorphise computer outputs, to project understanding onto systems that have none.
Weizenbaum spent the rest of his career warning against what he saw as the danger of this tendency, publishing Computer Power and Human Reason in 1976 as a critique of AI ambition. ELIZA itself remains one of the most influential programs in AI history — not because it worked, but because of what it revealed about how humans relate to machines. ChatGPT is, in a lineage, ELIZA with a much better model underneath.
SHAKEY the Robot (1966–1972)
At SRI International in Menlo Park, the SHAKEY project built the first mobile robot to integrate computer vision, navigation, and symbolic AI planning. Running on a PDP-10 and communicating over a radio link, SHAKEY could look at a room, build a model of it, plan a sequence of actions to accomplish a goal (push a box from room A to room B, avoiding the obstacle in the corridor), and execute that plan with a degree of real-world robustness.
SHAKEY's planning system, called STRIPS (Stanford Research Institute Problem Solver), introduced the concept of planning as search in a space of world states — a formalism that underlies planning research to this day. The robot was slow (its nickname came from its wobbly locomotion), but it represented the first integration of perception, knowledge representation, and action in a single system.
Minsky and Papert's "Perceptrons" (1969)
In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry. The book was a rigorous mathematical analysis of what single-layer Perceptrons could and could not do. Its most damaging result: a Perceptron cannot learn to compute the XOR function. XOR requires a non-linear decision boundary, and a single-layer Perceptron can only produce linear boundaries.
The XOR limitation was already known to Rosenblatt, who had proposed multi-layer networks as the solution. But Minsky and Papert argued — correctly, though people later disputed their framing — that the computational cost of training such networks would be prohibitive. The book did not prove that multi-layer networks were useless. But its reputation did. Neural network funding collapsed. Researchers moved on. The field would not seriously revisit multi-layer networks for 15 years.
1970s: The First AI Winter
The Lighthill Report (1973)
Sir James Lighthill, a distinguished applied mathematician, was commissioned by the UK Science Research Council to review the state of AI research. His 1973 report was brutal. He found that the promises of the 1950s and 1960s had not been met, that combinatorial explosion made generalist AI programs impractical, and that the specific areas where AI had shown progress (chess playing, theorem proving, specific task robotics) were of limited practical value.
The report's impact was immediate: the UK government cut most AI research funding, and a similar contraction hit the United States. The Defense Advanced Research Projects Agency (DARPA), which had been a major funder, sharply reduced its AI portfolio. This period — roughly 1974 to 1980 — is now called the First AI Winter.
Expert Systems: What Did Work
Not all AI stopped. Expert systems — programs that encoded the knowledge of human experts in explicit if-then rules — found genuine commercial applications in narrow domains.
MYCIN (Stanford, 1972–1974) was an expert system for diagnosing bacterial infections and recommending antibiotic treatments. It used a backward-chaining inference engine and certainty factors to handle uncertainty. When tested against Stanford Medical School faculty and medical students, MYCIN's antibiotic recommendations were judged correct 65% of the time — better than the faculty (42–62%) and much better than the students. MYCIN was never deployed clinically due to liability concerns, but it proved that AI could reach expert-level performance in sufficiently narrow domains.
DENDRAL (Stanford, 1965–1983) was an earlier system designed to identify the molecular structure of organic compounds from mass spectrometry data. It worked. It was used by chemists. It published in peer-reviewed chemistry journals under the authorship of the program's outputs. DENDRAL established that knowledge-intensive programs could do real scientific work.
Prolog (Alain Colmerauer and Philippe Roussel, Marseille, 1972) provided the programming language for this era. Unlike procedural languages, Prolog expressed knowledge as logical facts and rules, and queries were satisfied by the system's own inference engine. It became the dominant AI language through the 1970s and 1980s.
1980s: The Expert Systems Boom and Second Collapse
Japan's Fifth Generation Computer Project (1982)
In 1982, the Japanese Ministry of International Trade and Industry launched the Fifth Generation Computer Project — a ten-year, 850-million-dollar national program to build computers that could reason in Prolog at 1 billion logical inferences per second. Japan intended to leapfrog Western computing dominance by building machines designed for AI rather than numerical computation.
The announcement galvanised the United States and UK. DARPA launched the Strategic Computing Program. The UK established the Alvey Programme. A generation of researchers flooded into AI expecting it to transform computing within a decade.
The Fifth Generation project failed to meet its goals. Prolog was not the right substrate for the AI that emerged, and the hardware achievements were impressive by 1982 standards but irrelevant by 1992 ones. The project is often cited as an example of premature industrial-scale commitment to a research direction that had not fully matured.
Backpropagation Returns (1986)
In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-propagating Errors" in Nature. The paper did not invent backpropagation — Paul Werbos had described the algorithm in his 1974 PhD thesis, and Yann LeCun had used it independently. What the 1986 paper did was demonstrate backpropagation's practical utility on a variety of learning problems, present it clearly, and attach it to the new Parallel Distributed Processing framework that Rumelhart and James McClelland were developing.
Backpropagation provided the training algorithm that Minsky and Papert had argued (too pessimistically) would be too costly: a method for computing how each weight in a multi-layer network contributed to the output error, and adjusting all weights accordingly. Neural network research partially revived in the late 1980s, though the full payoff would not arrive until 2012.
The Second AI Winter (Late 1980s to Early 1990s)
Expert systems were expensive to build and catastrophically expensive to maintain. Adding a new domain required consulting the relevant human expert, translating their knowledge into explicit rules, and debugging the interactions between thousands of rules. Every update required expert involvement. Systems were brittle at the edges: they failed unpredictably when queries fell outside their training scenarios.
The Lisp machine companies — Symbolics and LMI, which had sold specialised hardware for running AI programs — collapsed as general-purpose workstations caught up in performance. DARPA cut AI funding again. The strategic computing initiative wound down. The second AI winter lasted from approximately 1987 to 1993. When it ended, the field had changed direction almost entirely: from symbolic expert systems to statistical machine learning.
1990s: Machine Learning Emerges from the Cold
IBM Deep Blue and the Chess Milestone (1997)
In May 1997, IBM's Deep Blue defeated Garry Kasparov, the reigning world chess champion, in a six-game match under standard tournament conditions. The final score was 3.5–2.5 in Deep Blue's favour. It was the first time a computer had beaten a world champion under match play conditions.
Deep Blue was not a learning system. It was a custom VLSI chip architecture capable of evaluating 200 million chess positions per second, combined with sophisticated opening books, endgame tablebases, and human grandmaster-tuned evaluation functions. What it demonstrated was that a sufficiently narrow task, given enough compute and handcrafted knowledge, could be beaten by a machine. It did not demonstrate general intelligence — Kasparov famously pointed out that Deep Blue played entirely differently from any human — but it demolished the assumption that the highest levels of human strategic game-playing were beyond computation.
Statistical Methods Displace Symbolic Approaches
Through the 1990s, machine learning shifted from rule-based methods to statistical methods. The key insight, articulated by researchers including Frederick Jelinek at IBM Research, was that learning statistical patterns from large corpora outperformed hand-crafted rules for tasks like speech recognition and natural language processing.
Support Vector Machines (Vladimir Vapnik and Corinna Cortes, 1995) provided a theoretically principled method for classification that worked well on high-dimensional data. SVMs dominated machine learning competitions through the early 2000s. Random forests (Leo Breiman, 2001) and gradient boosting provided reliable baselines on tabular data. The era's emphasis was on formal guarantees, interpretable models, and theory.
The web, growing rapidly through the 1990s, began providing something that had previously been unavailable: truly large datasets. The first recommender systems emerged — collaborative filtering algorithms that could predict what a user would like based on the preferences of similar users. Amazon's "customers who bought this also bought" and Netflix's queue recommendations were the consumer face of 1990s machine learning.
2000s: The Statistical Revolution Matures
Statistical NLP Displaces Rule-Based NLP
Through the 2000s, statistical natural language processing steadily outperformed rule-based NLP systems on every benchmark that could be measured. IBM's WATSON team, preparing for the Jeopardy! challenge (which aired in 2011 but was developed through the 2000s), built massive ensembles of statistical NLP components.
The paradigm was: collect a large corpus, fit a statistical model to it, and use the model's probabilities to rank candidate answers. The models were not deep neural networks — they used n-grams, naive Bayes classifiers, conditional random fields, and logistic regression. But they worked, and they scaled with data in a way that rule-based systems could not.
ImageNet (2007)
In 2007, Fei-Fei Li, then at the University of Illinois at Urbana-Champaign (she would move to Stanford in 2009), began building ImageNet: a large-scale hierarchical image database organised according to the WordNet hierarchy. Li's insight was that the bottleneck in computer vision was not algorithm sophistication but data. The field was training on thousands of images. The real world had billions.
Using Amazon Mechanical Turk and a team of annotators, the ImageNet project assembled more than 14 million images labeled with more than 20,000 categories. In 2010, Li launched the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition in which teams competed to classify 1.2 million training images into 1,000 categories and be evaluated on a 150,000-image test set.
For two years, the best systems used hand-crafted features. In 2012, everything changed.
The Netflix Prize (2006–2009)
In October 2006, Netflix announced the Netflix Prize: one million dollars to the team that could improve the accuracy of Netflix's movie recommendation algorithm (Cinematch) by 10% on a provided dataset. The competition attracted thousands of teams from around the world, drove three years of innovation in collaborative filtering, matrix factorisation, and ensemble methods, and was won in September 2009 by a team called BellKor's Pragmatic Chaos, which combined the work of dozens of independently good models into a single ensemble.
The Netflix Prize was enormously influential not because of what the winning system did — Netflix never actually deployed it, finding the engineering complexity too high for the marginal gain — but because it drew machine learning researchers into applied problems, created a culture of open publication and team collaboration in ML competitions, and established Kaggle-style competitions as a research methodology.
2010s: The Deep Learning Revolution
AlexNet (September 2012)
Everything in modern AI traces to September 30, 2012. On that date, a team from the University of Toronto — Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton — submitted their results to the ImageNet Large Scale Visual Recognition Challenge.
Their system, known as AlexNet, achieved a top-5 error rate of 15.3% on the ImageNet test set. The second-place team, using hand-crafted features, achieved 26.2%. The margin was not just a win — it was a demolition. Computer vision had been plateauing for years; the AlexNet result suggested the entire field had been using the wrong approach.
AlexNet was a deep convolutional neural network with 60 million parameters trained on two NVIDIA GTX 580 GPUs over the course of a week. Its key innovations were: GPU training (making large networks practical), ReLU (rectified linear unit) activations instead of tanh or sigmoid, dropout regularisation to prevent overfitting, and data augmentation to artificially expand the training set. All of these are now standard practices. None of them were widely used before AlexNet.
Hinton, Sutskever, and Krizhevsky published "ImageNet Classification with Deep Convolutional Neural Networks" in Advances in Neural Information Processing Systems 2012. It is the most cited machine learning paper in history. Within a year, every major technology company had launched or expanded a deep learning research group. The talent wars had begun.
DeepMind, Atari, and Reinforcement Learning (2013–2016)
DeepMind, a London startup founded by Demis Hassabis, Shane Legg, and Mustafa Suleyman in 2010, published "Playing Atari with Deep Reinforcement Learning" in December 2013. The system — a Deep Q-Network (DQN) — learned to play 49 Atari 2600 games from raw pixel inputs and a score signal, achieving human-level performance on 29 of them and superhuman performance on several, including Breakout and Pong.
The paper combined deep convolutional networks (perceiving the game state from pixels) with Q-learning (a reinforcement learning algorithm) and an experience replay buffer. It was the first demonstration that a single algorithm architecture could learn to play multiple distinct games from scratch, without game-specific programming. Google acquired DeepMind for a reported $500 million in January 2014.
In March 2016, DeepMind's AlphaGo defeated Lee Sedol, the world Go champion, 4–1. Go had been considered the game most resistant to AI: its branching factor (roughly 250 moves per turn) made exhaustive search impossible, and the game was thought to require intuition that could only be acquired through human-like experience. AlphaGo combined convolutional neural networks for position evaluation with Monte Carlo tree search, trained on both human games and self-play. The match was watched live by an estimated 200 million people.
Word2Vec and the Distributed Representation Revolution (2013)
In 2013, Tomas Mikolov and colleagues at Google published Word2Vec: a technique for learning dense vector representations of words from large text corpora. The key demonstration was that arithmetic over word vectors captured semantic relationships: king - man + woman ≈ queen, Paris - France + Italy ≈ Rome. For the first time, semantic similarity was computable as geometric distance in a continuous vector space.
Word2Vec transformed NLP in the same way ImageNet transformed vision: it provided a general pre-trained representation that could be fine-tuned for specific tasks, dramatically reducing the amount of labeled data required. Every subsequent development in NLP — including the transformer architecture and the large language model paradigm — built on the distributed representation insight.
GANs: Teaching Machines to Create (2014)
In 2014, Ian Goodfellow — then a PhD student at the Université de Montréal — described Generative Adversarial Networks in a paper that was submitted to NIPS after being developed, according to legend, in a single evening following a debate in a bar. The GAN framework pits two neural networks against each other: a generator that produces synthetic data, and a discriminator that tries to distinguish synthetic data from real data. Each network improves as the other improves, in a minimax game that converges (in the best case) to a generator that produces data indistinguishable from the real distribution.
GANs opened the era of AI-generated imagery. By 2018, StyleGAN could produce photorealistic faces of people who did not exist. By 2020, GANs were generating synthetic video, deepfakes, and artistic images. Goodfellow received his PhD in 2014, joined Google Brain, then OpenAI, then Apple, then Alphabet — his career trajectory illustrating the extraordinary demand for deep learning expertise.
The Transformer: "Attention Is All You Need" (2017)
In June 2017, eight researchers at Google Brain and Google Research — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin — published "Attention Is All You Need" in Advances in Neural Information Processing Systems 2017.
The paper introduced the Transformer architecture, which replaced recurrent networks (LSTMs and GRUs) with a mechanism called self-attention. Self-attention allows every position in a sequence to directly attend to every other position, computing weighted representations that capture context globally rather than through a chain of recurrent updates. This solved two fundamental problems with recurrent networks: they were sequential (could not be parallelised during training) and they suffered from vanishing gradients on long sequences.
The Transformer could be trained massively in parallel on GPUs and TPUs. It scaled. And when it scaled, it worked better than any previous architecture on sequence tasks. GPT-1, BERT, GPT-2, GPT-3, PaLM, Claude, LLaMA, Gemini, GPT-4 — every frontier language model built in the years since has been a Transformer or a close variant. The 2017 paper is one of the most consequential in the history of computing.
BERT and the Pre-training Paradigm (2018)
In October 2018, Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova published BERT (Bidirectional Encoder Representations from Transformers). BERT was pre-trained on 3.3 billion words using two objectives: masked language modeling (predict randomly masked tokens in a sentence) and next sentence prediction (predict whether two sentences are consecutive in the original document).
The resulting model, when fine-tuned on labeled data, achieved state-of-the-art results on 11 NLP tasks simultaneously. The pre-training/fine-tuning paradigm — train a large general model on massive unlabeled data, then adapt it for specific tasks with small labeled datasets — was now the dominant approach in NLP.
GPT-1 and GPT-2 (2018–2019)
OpenAI's GPT (Generative Pre-trained Transformer) line began in June 2018 with GPT-1, a 117-million-parameter Transformer trained on the BookCorpus dataset using a language modeling objective (predict the next token). GPT-1 showed that a pre-trained language model fine-tuned on small labeled datasets could match task-specific models trained from scratch.
In February 2019, OpenAI published GPT-2, a 1.5-billion-parameter model trained on 40 gigabytes of web text (the WebText dataset, created by scraping links shared on Reddit). OpenAI made the unusual decision to stage the model's release, initially publishing only the smallest version and withholding the larger models on the grounds that GPT-2 was "too dangerous to release" — it could generate convincing fake news articles, impersonate writing styles, and produce coherent long-form text.
The decision was controversial. Many researchers argued that the capability was not meaningfully more dangerous than what was already possible. Others credited OpenAI with raising the salience of AI safety concerns. GPT-2 was eventually released in full in November 2019 after OpenAI concluded that it had not observed catastrophic misuse of the staged releases. The episode established "responsible AI release" as a topic of genuine debate rather than mere public relations.
2020–2022: The GPT Era
GPT-3 and Few-Shot Learning (May 2020)
In May 2020, OpenAI published GPT-3: a 175-billion-parameter language model trained on 499 billion tokens scraped from the web, books, and Wikipedia. The scale jump from GPT-2 (1.5B) to GPT-3 (175B) was the largest single increase in model size in history.
What emerged from that scale was unexpected: few-shot learning. Rather than requiring fine-tuning on labeled examples, GPT-3 could perform a new task when given just a few examples in the prompt — no gradient updates, no parameter changes. It could translate languages, write code, solve math problems, summarise text, and answer questions about topics that had not existed when it was trained, all from natural language descriptions of the task.
The GPT-3 paper, authored by Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and 27 others, introduced the concept of scaling laws: the systematic relationships between model size, dataset size, compute budget, and performance. The paper showed that performance improvements followed smooth power laws as scale increased, with no signs of hitting a ceiling at GPT-3's scale. This was the empirical foundation of the scaling hypothesis that would drive the industry for the next five years.
AlphaFold 2 Solves Protein Folding (2020)
In November 2020, DeepMind announced that AlphaFold 2 had solved the protein structure prediction problem at the CASP14 competition. Determining the three-dimensional structure of a protein from its amino acid sequence — the protein folding problem — had been an open challenge in biology for 50 years. AlphaFold 2 achieved median GDT score of 92.4 out of 100, making its predictions comparable in accuracy to experimental methods like X-ray crystallography.
AlphaFold 2 used an attention-based architecture (Evoformer) that jointly reasoned about amino acid sequences and multiple sequence alignments. DeepMind released the model and its predictions for nearly the entire human proteome (more than 200 million protein structures) as open-access data in 2021. It is arguably the most significant scientific application of deep learning to date, with direct implications for drug discovery, disease research, and our understanding of biology.
DALL-E, Codex, and the Generative Expansion (2021)
In January 2021, OpenAI published DALL-E (a portmanteau of Salvador Dalí and Pixar's WALL-E): a 12-billion-parameter version of GPT-3 trained on image-text pairs to generate images from text descriptions. DALL-E could create images of "a snail made of harps" or "a corgi wearing a top hat" with compositional understanding that no previous image generation system had demonstrated.
In August 2021, OpenAI published Codex, a GPT model fine-tuned on 54 million GitHub repositories to generate code from natural language descriptions. Codex achieved 28.8% pass rate on HumanEval (a benchmark of 164 handwritten Python programming problems) — humans scored approximately 48%. Codex was not as good as a senior engineer, but it was well past the threshold of being useful.
GitHub Copilot, the commercial product built on Codex, launched in technical preview in June 2021 and became generally available in June 2022. It was the first mass-market product built on an LLM — used directly within developers' IDEs to autocomplete code, suggest functions, and explain existing code. Within a year, GitHub reported that Copilot was writing more than 40% of the code in files where it was enabled, and that developers who used it shipped code significantly faster.
ChatGPT: The Consumer Breakthrough (November 2022)
On November 30, 2022, OpenAI released ChatGPT — a fine-tuned version of GPT-3.5 trained with Reinforcement Learning from Human Feedback (RLHF) to be helpful, harmless, and honest in conversation. The technical innovation was not the model itself but the training technique: RLHF, developed by Paul Christiano, John Leike, and others, used human raters to score model outputs and trained a reward model on those scores, which was then used to fine-tune the base model via reinforcement learning.
ChatGPT reached 1 million users in 5 days and 100 million users in approximately 2 months — the fastest adoption of any consumer product in history, surpassing TikTok (9 months to 100 million) and Instagram (2.5 years). It crossed into mainstream cultural awareness in a way that GPT-3 had not, because it was a conversational interface that anyone could use without technical knowledge.
The ChatGPT moment changed the industry's structure. Microsoft committed $10 billion to OpenAI. Google declared an internal "code red" and accelerated the release of Bard. Amazon, Meta, Anthropic, Cohere, AI21 Labs, and Inflection AI all received or announced major funding rounds. AI had found its consumer product.
2023–2024: The Productisation Wave
GPT-4, Claude, and Gemini (2023)
March 2023 brought GPT-4, OpenAI's multimodal frontier model. GPT-4 could accept both text and image inputs, scored in the top 10% of the Uniform Bar Exam, and demonstrated significantly improved reasoning compared to GPT-3.5. OpenAI did not publish GPT-4's parameter count or training data details, citing competitive concerns — a significant departure from the open research culture that had characterised the field since 2012.
Anthropic, founded in 2021 by former OpenAI researchers Dario Amodei, Daniela Amodei, and others, released the Claude model family. Claude's distinguishing characteristic was its training approach: Constitutional AI (CAI), a method that used a set of written principles rather than just human rater scores to guide fine-tuning, aiming for more consistent and interpretable alignment. Claude 2 (2023) featured a 100,000-token context window — ten times GPT-4's at launch.
Google DeepMind released Gemini in December 2023 as a multimodal model family spanning sizes from Nano (on-device) to Ultra (frontier). Gemini Ultra achieved state-of-the-art performance on the MMLU benchmark (which tests knowledge across 57 academic subjects) using a 5-shot prompting setup. The release marked Google's most concerted effort to match GPT-4 on public benchmarks.
Sora: Video Generation Arrives (February 2024)
In February 2024, OpenAI publicly demonstrated Sora, a diffusion transformer model capable of generating high-definition video up to one minute long from text prompts. Sora's outputs — photorealistic scenes of Tokyo city streets, woolly mammoths in snowy fields, people surfing ocean waves — demonstrated a level of physical plausibility and temporal consistency far beyond any previous video generation system.
Sora used a unified representation of video as sequences of compressed "spacetime patches," allowing the same architecture to generate video at different resolutions and durations. OpenAI limited access to safety evaluators and select creators initially. The demo was widely discussed as a potential disruption to stock video, advertising production, and eventually feature filmmaking.
The Open-Source Explosion (2023–2024)
Meta's decision to release the weights of Llama (2023) and Llama 2 (July 2023) fundamentally changed the competitive landscape. Llama 2 came in 7B, 13B, and 70B parameter versions, with a commercial license that allowed most organisations to use it without restriction. Within weeks, the open-source community had fine-tuned it into hundreds of specialised variants.
Mistral AI, a Paris-based startup founded by former DeepMind and Meta researchers, released Mistral 7B in September 2023 under an Apache 2.0 license — the most permissive possible open-source license. Mistral 7B outperformed Llama 2 13B on most benchmarks despite being half the size, demonstrating that architectural improvements could compensate for smaller scale. Mixtral 8x7B (December 2023) introduced sparse mixture-of-experts routing, achieving frontier performance at a fraction of the inference cost.
The open-source explosion created a genuine dual-track ecosystem: proprietary frontier models from OpenAI, Google, Anthropic, and others, versus open-weight models that could be run locally, fine-tuned privately, and deployed without per-query costs or API dependencies.
Regulation: The EU AI Act and the Biden EO (2023–2024)
The EU AI Act, formally adopted in March 2024 and entering into force in August 2024, was the first comprehensive AI regulation globally. It created a risk-based framework: AI systems were classified as unacceptable risk (banned), high risk (subject to conformity assessments and documentation requirements), limited risk (transparency requirements), or minimal risk (no specific requirements). General-purpose AI models with systemic risk — defined as training on more than 10^25 FLOPs — faced additional transparency and safety evaluation requirements.
The Biden Executive Order on AI Safety (October 2023) required developers of the most powerful AI models to share safety testing results with the US government before public deployment, established new standards for AI-generated content watermarking, and directed federal agencies to develop sector-specific AI guidance.
The AI Safety Summit at Bletchley Park (November 2023) was the first multilateral government meeting dedicated to AI risk, attended by representatives from 28 countries including the US, UK, China, and the EU. The Bletchley Declaration affirmed shared concern about frontier AI risks and committed signatories to international information sharing. A follow-up summit was held in Seoul in May 2024.
2025–2026: The Agentic Era
AI Coding Agents Go to Work (2025)
The defining shift of 2025 was the emergence of agentic AI systems that could take sequences of actions across long time horizons, using tools (web browsers, code interpreters, file systems, external APIs) rather than simply generating text.
Claude Code, launched by Anthropic in 2025, was a terminal-based AI coding assistant that could read entire codebases, write and execute code, run tests, interpret errors, and iterate until a task was complete — without requiring human intervention at each step. Unlike GitHub Copilot, which operated at the autocomplete level, Claude Code operated at the task level: "add authentication to this application," "fix the bug in these tests," "refactor this module to use the new API."
Devin, launched by Cognition Labs in March 2024 and refined through 2025, was marketed as the "first AI software engineer." Devin could be assigned a GitHub issue, write the code, run the tests, and open a pull request. Independent evaluations found its success rate on real-world software engineering tasks (the SWE-bench benchmark) was significantly higher than any previous system, though far below that of senior human engineers.
GitHub Copilot Workspace (2025) extended Copilot from in-editor autocomplete to a full development environment where an AI agent could plan and implement multi-file changes across an entire repository. Microsoft reported that organisations using Copilot Workspace saw measurable reductions in the time from issue creation to pull request.
The paradigm shift was conceptual as much as technical. The software development workflow began to evolve from "engineer writes code, AI suggests completions" to "engineer specifies intent, AI drafts implementation, engineer reviews and adjusts." This is what researchers mean by loop engineering and harness engineering — designing the scaffolding, tool access, and feedback mechanisms within which AI agents operate, rather than writing the code itself. For a deeper exploration of the agentic era's implications, see The Agentic Era: How AI Agents Will Transform Everything (2026-2030).
Frontier Model Proliferation (2025–2026)
The 2025–2026 period saw rapid release cycles across all major labs. GPT-5.5 (OpenAI), Claude Opus 5 (Anthropic), and Gemini Omni (Google DeepMind) all competed at the frontier, with each releasing incremental versions on multi-month cycles. The marketing around each release emphasised agentic capabilities, code generation, multimodal reasoning, and context window size as differentiating factors.
The "capability plateau" question — whether scaling was still delivering reliable gains, or whether diminishing returns were beginning — became a central debate in the AI community. Some researchers pointed to the continued improvement on difficult benchmarks as evidence that scaling continued to pay off. Others noted that the rate of surprise from new releases seemed to be declining: GPT-4 had shocked the field; GPT-5.5 improved on it but did not produce the same shock.
SpaceX's acquisition of Cursor for $60 billion in 2026 was one of the year's most significant business events. Cursor, an AI-native code editor built on top of frontier models, had become one of the most widely used development tools in the industry. The acquisition signalled that AI coding infrastructure had become genuinely strategic — worth more, in SpaceX's calculus, than most major aerospace companies.
DeepMind's "From AGI to ASI" (June 2026)
On June 10, 2026, a fourteen-author team from Google DeepMind published "From AGI to ASI" (arXiv:2606.12683), a 57-page investigation into how AI development might proceed after reaching human-level artificial general intelligence. The paper crossed 54,000 views within days.
The paper's contribution was structural: rather than asking whether AGI would arrive, it treated its arrival as a near-term reference point and asked what would come after. It defined Artificial Superintelligence (ASI) as a system more intelligent than large organisations of humans — not just a single human — and mapped four pathways to get there: scaling AGI (continuing to improve the same paradigm), AI paradigm shifts (entirely new architectures or training approaches), recursive improvement (AI systems that improve their own training), and multi-agent collectives (ASI emerging from coordination among many AGI-level agents).
The paper's key insight was that the transition from AGI to ASI might not be a single dramatic step change but "a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology" — analogous to historical periods of rapid technological change rather than to a single invention moment. For a detailed analysis of the paper's four pathways and their implications, see From AGI to ASI: DeepMind's 57-Page Roadmap for What Comes After Human-Level AI.
The Architecture That Made It All Possible
To understand why the 2017–2026 period accelerated so rapidly, it helps to understand that nearly every major system described in the last three sections — GPT-4, Claude, Gemini, AlphaFold 2, Sora, DALL-E 3, Stable Diffusion, and the coding agents — uses a variant of the Transformer architecture introduced in "Attention Is All You Need." The self-attention mechanism scales efficiently with compute, parallelises across modern GPU hardware, and learns rich contextual representations from massive datasets in ways that earlier architectures could not match. If you want to understand why AI accelerated when it did, understanding the Transformer is the single most important technical foundation. A full technical explanation is available at What Is the Transformer Architecture? Attention, Self-Attention, and the Engine of Modern AI.
The Logic of 76 Years
Looking at the full arc from 1950 to 2026, several patterns emerge.
Progress has been discontinuous. The field was not a smooth upward curve. It lurched between winters and springs, driven by the gap between promise and delivery, between what researchers claimed was possible and what actually worked. The two AI winters were not failures of imagination but failures of the supporting infrastructure: not enough data, not enough compute, not enough understanding of how to train large networks.
The enabling technologies arrived from outside AI. The deep learning revolution of 2012 was powered by GPU hardware originally designed for games. The data foundation for language models was the web, created for communication rather than training. AlphaFold 2's success depended on decades of experimental protein structure data accumulated by biologists. AI advances when the external infrastructure is ready, not just when the algorithms are.
The scaling hypothesis has held for a decade. From AlexNet through GPT-4, consistent increases in model size, training data, and compute have delivered consistent improvements in performance. This has been the field's most reliable empirical finding and the business logic underlying billions of dollars of investment. Whether scaling continues to hold at the frontier is the most important open question in AI today.
Each era's defining question becomes the next era's solved problem. In 1950, the question was whether machines could simulate intelligence at all. In the 1960s, whether they could converse. In the 1970s, whether they could encode expert knowledge. In the 1980s, whether that knowledge could be useful in practice. In the 1990s, whether machines could beat humans at games. In the 2000s, whether statistical learning could match hand-crafted rules. In the 2010s, whether deep networks could perceive the world. In the early 2020s, whether language models could reason. In 2026, the question is whether AI can act — autonomously, reliably, across the full span of cognitively demanding human work.
The answer, emerging in real time, is shaping up to be yes. The implications of that answer are what the field is now trying to understand. For a framework for understanding how AI, machine learning, and deep learning relate to each other across this entire history, see AI vs Machine Learning vs Deep Learning — What's Actually Different?.