explainx / blog
Meta Brain2Qwerty v2: Reading Your Thoughts Without Surgery
Meta FAIR's Brain2Qwerty v2 decodes full sentences from non-invasive MEG at 39% average WER โ 22% for the best participant. No implants, no surgery.
explainx / blog
Meta FAIR's Brain2Qwerty v2 decodes full sentences from non-invasive MEG at 39% average WER โ 22% for the best participant. No implants, no surgery.
Jun 25, 2026
Krea 2 lands in the top 10 of the Artificial Analysis text-to-image leaderboard and 2nd among independent labs. The 58-page technical report details how they got there: no synthetic training data, a PostgreSQL-backed data warehouse they call krablets, iREPA-accelerated pretraining, a custom DPO variant called STPO to prevent policy divergence, and an RL stage with four reward signals including a dedicated artifact detector.
Jun 28, 2026
Dario Amodei helped decide not to release GPT-2 in full in 2019 because OpenAI feared misuse. Seven years later, he runs Anthropic, keeps Claude completely closed, and finds himself at the epicentre of the most heated open-source AI debate the industry has ever seen.
Jun 28, 2026
Distillation is how you take a massive frontier AI and compress its knowledge into a smaller, cheaper model. It's one of the most powerful techniques in AI โ and one of the most contested. Anthropic accused Alibaba of distilling Claude Fable 5 at scale using 25,000 fake accounts. Here's the full picture.
People with ALS, locked-in syndrome, or anarthria lose the ability to speak or move โ but their brains keep generating language. Invasive brain-computer interfaces (BCIs) have restored communication for some patients, at the cost of neurosurgery, infection risk, and long-term implant maintenance. For years, non-invasive alternatives have lagged far behind.
On June 25, 2026, Meta FAIR published Brain2Qwerty v2, a model that decodes full typed sentences solely from magnetoencephalography (MEG) signals โ no implants, no surgery. The average word error rate across 9 participants: 39%. The best participant: 22% WER, with 47% of sentences decoded within one word of the target.
That is not perfect. But it is the first time non-invasive brain-to-text decoding has operated anywhere near the accuracy range once thought exclusive to surgical implants.
| Metric | Brain2Qwerty v2 |
|---|---|
| Average Word Error Rate (WER) | 39% |
| Best participant WER | 22% |
| Best participant: % perfect sentences | 28% |
| Best participant: % sentences โค1 word error | 47% |
| Training data | 22,000 sentences, 9 subjects, ~90 hours total MEG |
| Brain recording method | Non-invasive MEG (306-channel Megin system) |
| Architecture | CTC Encoder โ Word Aligner โ Fine-tuned LLM (Qwen3-4B) |
| Open source | github.com/facebookresearch/brain2qwerty |
| Publication date | June 25, 2026 |
| Previous best (non-invasive) | ~52% WER (Brain2Qwerty v1) |
| Best invasive BCI | ~2% WER (Jude et al., 2026) |
Invasive BCIs โ electrodes implanted directly over the motor cortex โ pick up clean, high-resolution neural signals. They have restored typed communication at near-natural rates for patients with ALS and locked-in syndrome. The catch: neurosurgery carries real risks (infection, inflammation, signal degradation over months). Scaling this to millions of patients is practically impossible.
Non-invasive alternatives like EEG suffer from poor signal-to-noise ratio. MEG offers better temporal resolution, but until Brain2Qwerty, decoding full sentences from it was considered too hard. The v1 paper (Lรฉvy et al., 2025) achieved 32% character error rate (CER) โ but it required knowing the exact timing of each keystroke in advance, making real-time use impossible.
Brain2Qwerty v2 removes all of these constraints at once.
The pipeline has three jointly trained components, each handling a different level of linguistic abstraction.
The Encoder takes a continuous MEG window โ no keystroke timing required โ and outputs a sequence of character predictions. It uses a BrainModule (convolutional feature extractor + subject-specific spatial merging) followed by a 4-layer Conformer, trained with a Connectionist Temporal Classification (CTC) objective.
The key insight: asynchronous CTC decoding previously underperformed synchronous (keystroke-locked) decoding on small datasets. With 10ร more data per subject (10 hours vs. 1 hour in v1), the gap collapsed to just 2%. Data scale unlocked asynchronous decoding.
Performance scales log-linearly with training data volume (Pearson r = โ0.99 between logโโ(hours) and CER). The scaling curve shows no saturation at 90 hours of pooled data โ meaning more recording time is a direct path to better performance.
Raw MEG embeddings cannot be fed directly into an LLM. The Aligner uses a SigLIP contrastive loss to learn word-level alignment between the Encoder's MEG embeddings and the LLM's word embedding space.
The team's CTC Tokenizer segments the continuous MEG stream into word-like chunks wherever the CTC path predicts a space character. Since spaces are frequent (19% of characters) and robustly predicted, 86% of sentences have their word count estimated within ยฑ1 word of the ground truth. This is far more accurate than fixed-patch tokenization or single-sentence embeddings โ and it's what enables the LLM to read MEG as structured token input.
The final stage is a Qwen3-4B language model, fine-tuned with LoRA (Low-Rank Adaptation) using a technique called Model Soup. Rather than training one joint adapter on all 9 subjects, the team trained a separate LoRA adapter per subject, then averaged the weights uniformly. This single averaged adapter outperformed both per-subject adapters and a jointly trained adapter as LLM size scaled from 0.6B โ 1.7B โ 4B parameters.
The LLM receives: CTC: [character predictions]\nMEG: [word embeddings]\nOutput: โ and autoregressively generates the decoded sentence. Ablating the MEG tokens (leaving only CTC text) degrades WER by 5.6 points โ confirming the LLM is actively reading neural signal, not just correcting character-level noise.
One of the most striking elements of Brain2Qwerty v2 is the use of autonomous AI coding agents to optimize the training pipeline โ a technique the authors call Auto Research, inspired by Karpathy's autoresearch project.
Three independent agents (Cursor, powered by Claude Opus 4.6) each ran 10 rounds of 50 SLURM jobs, starting from a deliberately minimal configuration with only 4 exposed hyperparameters: learning rate, weight decay, LoRA rank, and batch size. Each agent had full filesystem and terminal access to a dedicated git worktree.
The results:
All three agents independently discovered the same core improvements:
CTC:, MEG:, Output: instead of verbose instructions)This is the same pattern we are starting to see across AI research: human researchers define the search space and constraints, then AI agents discover optimal configurations within it. What Auto Research could not do: start from v1 and recreate v2 from scratch. When given an open-ended objective with no architecture constraints, agents produced broken jobs and stalled. Human-defined structure remains essential.
The same dynamic โ AI as a force multiplier inside researcher-defined constraints โ is central to agentic coding tools like Claude Code.
For those building AI-assisted research pipelines, this mirrors patterns covered in our guide to AI agents and automated model training.
The team ran a controlled experiment: 128 unique sentences ร 2 repetitions versus 256 unique sentences ร 1 repetition (matched total sentence count, matched subjects). Unique-sentence training achieved significantly lower CER than repeated-sentence training (0.45 vs. 0.65, p < 0.001).
The takeaway: diversity is an independent axis of data quality. More repetitions of the same sentences do not substitute for more varied sentences. This has direct implications for future data collection protocols โ and parallels debates in LLM pre-training about data diversity vs. volume.
Three training regimes were compared:
| Regime | Best Subject WER | Median Subject WER |
|---|---|---|
| Per-subject only | 38.3% | 66.5% |
| Leave-one-out + finetune | 32.8% | 58.6% |
| Joint training (all subjects) | 22.6% | 47.8% |
Joint training provides the biggest gains, but the LOO + finetune result matters most for clinical deployment: you can pretrain on existing subjects, then fine-tune on a new patient with the Conformer frozen โ and recover most of the performance without a full retrain. This is the path toward adapting non-invasive BCIs to patients who cannot generate labeled training data the same way healthy volunteers can.
The 306-channel cryogenic MEG system used in this study is large, expensive, and requires magnetic shielding. Optically pumped MEG (OPM) devices are emerging alternatives โ wearable, room-temperature, typically 50โ150 sensors.
The team ran sensor subsampling ablations:
| Sensors | WER Change vs. Full Array |
|---|---|
| 230 sensors (75%) | +3.4 pp |
| 153 sensors (50%) | +5.7 pp total |
| 76 sensors (25%) | +11.4 pp total |
An OPM-class helmet with ~150 sensors loses only ~5.7 WER points versus the 306-channel baseline โ suggesting that once the pipeline matures, portable non-invasive decoding is within reach.
Brain2Qwerty v2 produces either near-perfect output or coherent but incorrect sentences โ a qualitatively different failure mode than the character-level N-gram model.
For the best subject, 28% of test sentences are decoded perfectly and 47% within one word error. For the worst subject: 4% perfect, but errors are still grammatical ("my homework is due tomorrow" instead of "cars are not allowed on this road").
The N-gram model produces lower CER โ it makes smaller local corrections โ but generates lexically incoherent output ("WAS THE DISH THAT YOU NIGHTY IN THE LP BUT"). Brain2Qwerty v2 generates sentences that are wrong in a human-like way, not a character-soup way. For actual communication, that is the right trade-off.
The medical AI landscape is expanding rapidly โ from AI-generated imaging in clinical contexts to non-invasive brain decoding. Each wave surfaces its own set of safety questions that researchers and regulators have yet to fully resolve.
Brain2Qwerty v2 used Llama 4 to generate the 20,000-sentence training stimulus pool โ selected from a filtered, contraction-free subset. The training code is open at github.com/facebookresearch/brain2qwerty, covering both v1 and v2 pipelines. The BCBL Spanish dataset (v1) is also being released.
This is consistent with Meta's research philosophy: publish the method, release the code, enable the community to extend it. For background on Meta's open-source AI strategy, see our Meta Llama 4 guide. The same open science ethos is emerging in AI-assisted biology โ see Biohub's Virtual Biology Initiative for a parallel bet on open multimodal data for medicine.
On the memory and context side, the use of an LLM to maintain sentence-level semantic coherence in the decoder parallels what Perplexity is doing with persistent memory in agentic contexts โ see our coverage of Perplexity Brain for comparison.
The training code for both v1 and v2 is open source at github.com/facebookresearch/brain2qwerty. Here is what you need to get started.
.fif or similar)git clone https://github.com/facebookresearch/brain2qwerty.git
cd brain2qwerty
pip install -e ".[dev]"
The repo uses the neuralset and neuraltrain libraries (also from Meta FAIR) for data loading and training infrastructure.
The SpanishBCBL dataset (v1, 19 participants) is being released by the Basque Center on Cognition, Brain and Language (BCBL). The EnglishBCBL dataset (v2, 9 participants, 22,000 sentences, ~90 hours) has a separate release timeline โ check the repo README for updates.
If you have your own MEG data in a delayed-typing paradigm, preprocessing steps are:
python train_encoder.py \
--dataset path/to/englishbcbl \
--output_dir checkpoints/encoder \
--epochs 150 \
--batch_size 64 \
--lr 8e-4 \
--weight_decay 1e-3
The Encoder uses a 4-layer Conformer with model dimension 1024. Full training on 8ร A100s takes approximately 19.5 hours.
Once the encoder checkpoint is ready, run per-subject LoRA fine-tuning on Qwen3-4B:
# Fine-tune one LoRA adapter per subject
for subject in 0 1 2 3 4 5 6 7 8; do
python train_llm.py \
--encoder_ckpt checkpoints/encoder/best.pt \
--subject $subject \
--lora_rank 128 \
--lora_alpha 256 \
--epochs 30 \
--output_dir checkpoints/lora_s${subject}
done
# Average adapter weights (Model Soup)
python model_soup.py \
--adapters checkpoints/lora_s{0..8} \
--output checkpoints/soup.pt
python decode.py \
--meg_file path/to/subject_meg.fif \
--encoder_ckpt checkpoints/encoder/best.pt \
--lora_ckpt checkpoints/soup.pt \
--beam_size 16 \
--output decoded_sentences.txt
The decoder outputs the most likely sentence for each continuous MEG segment. With beam size 16 and a Qwen3-4B backbone, inference takes a few seconds per sentence on a single GPU.
For the full hyperparameter grid and ablation configs, refer to configs/ in the repository.
The paper's scaling law result is the most important long-term signal. With no saturation at 90 hours of data per 9 subjects, there is a direct lever: record more data from more subjects and performance will continue to improve log-linearly. If that trend holds into the hundreds of hours range, the gap with invasive BCIs may start to close materially.
The path to a clinical device requires:
None of those are easy. But for the first time, non-invasive brain-to-text has crossed a threshold where the research question is no longer "is this possible?" โ it is "how much data do we need?"
Official resources:
Statistics and architecture details in this article are accurate as of June 29, 2026, based on the Brain2Qwerty v2 preprint (Zhang, Lรฉvy, et al., June 25, 2026). Performance metrics may improve as training scales and the EnglishBCBL dataset is released publicly.