Text-based RAG has a structural problem that chunking strategies and rerankers cannot fix: HTML parsers throw away the page.
Tables become flat text with no column alignment. Charts become nothing. Side-by-side comparisons collapse into sequential sentences. The visual structure that makes the page human-readable vanishes before retrieval even starts.
PixelRAG is the UC Berkeley project that sidesteps this entirely. Instead of parsing pages to text, it renders them as screenshot tiles and retrieves over the images using a vision-language embedding model. The reader model — Claude, GPT, Qwen, whatever you use — reads the answer directly from what a human would see.
The project comes from Berkeley's SkyLab, BAIR, and Berkeley NLP groups (led by Yichuan Wang, Zhifei Li, Zirui Wang, Paul Teiletche, and Lesheng Jin, with Matei Zaharia, Joseph Gonzalez, and Sewon Min advising). It is Apache 2.0, ships with a pre-built 8.28M-page Wikipedia index, and adds a Claude Code plugin (pixelbrowse) that gives Claude visual page access in one command.
The Problem With Text-Based RAG
Every traditional RAG pipeline does something like this:
- Fetch a web page
- Parse HTML to text chunks
- Embed the chunks
- Retrieve the most relevant chunks
- Pass chunks to a reader model
Step 2 is where information dies. Consider a Wikipedia table listing historical stock prices by year. As HTML: perfectly structured. As parsed text: Year Price 1990 12.4 1991 18.7 ... — the column headers may survive but the spatial relationship is gone. Now ask the reader "what was the highest price before 1995?" The table's answer is obvious visually. The text dump makes it a string parsing problem.
This gets worse with:
- Charts and graphs — entirely missing from text output
- Multi-column layouts — merged into single-stream text
- Infographics — completely lost
- Form layouts — field-value relationships scrambled
- PDFs with mixed text and images — images silently dropped
PixelRAG's benchmark results: up to 18% accuracy improvement on SimpleQA over text-based baselines. For agent runs, 3x fewer tokens per query — because retrieving the right image tile delivers a focused visual context instead of multiple text chunks.
How PixelRAG Works
Two components make the system work:
1. The Renderer (pixelshot)
pixelshot renders any URL or PDF to screenshot tiles using Playwright with Chrome DevTools Protocol (CDP). It handles JavaScript-rendered content, lazy-loaded images, and dynamic layouts — everything a headless browser sees, not just what's in the HTML source.
pixelshot https://en.wikipedia.org/wiki/Python --output ./tiles
The output is a set of image tiles representing the full rendered page at screen resolution. Each tile corresponds to a viewport-sized section of the page.
2. The Embedding Model
The image tiles get embedded using Qwen/Qwen3-VL-Embedding-2B, LoRA-fine-tuned on screenshot data published at Chrisyichuan/wiki-screenshot-embedding-lora. The fine-tuning dataset (Chrisyichuan/screenshot-training-natural-filtered-v2) is also public, so you can adapt other backbones.
The trained embedder puts screenshots into a vector space where visual content is retrievable. Query "what is the capital of France?" against a pixel index of Wikipedia and it finds the France article tile showing the answer in context — table, infobox, and all.
FAISS handles the index. Retrieval is fast enough for interactive use.
Quick Start
Hosted Wikipedia API (no setup required)
The fastest path: the Berkeley team hosts a pre-built index of 8.28M Wikipedia pages at api.pixelrag.ai. No index download, no GPU, no setup.
curl -X POST https://api.pixelrag.ai/search \
-H "Content-Type: application/json" \
-d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'
The response includes the matching image tiles as base64 alongside the document metadata. The response images are what you pass to your reader model — not text chunks.
Try it in the browser at pixelrag.ai or run the Colab demo which renders a page and searches the hosted index with images inline.
Install PixelRAG
pip install pixelrag
This gives you pixelshot (the renderer) and the core library. Add stages as needed:
pip install 'pixelrag[embed]' # chunk, embed, build-index commands
pip install 'pixelrag[index]' # full pipeline orchestrator
pip install 'pixelrag[serve]' # FastAPI search server
Give Claude Eyes: The pixelbrowse Plugin
This is the most immediately useful thing in the project if you use Claude Code.
pip install pixelrag
claude plugin marketplace add StarTrail-org/PixelRAG
claude plugin install pixelbrowse@pixelrag-plugins
Then:
# Ask Claude to screenshot a page and reason about it
claude -p "screenshot https://news.ycombinator.com and summarize the top stories"
claude -p "screenshot https://arxiv.org/abs/2404.12387 and explain the key findings"
Or in an interactive Claude session:
/screenshot https://example.com
Claude screenshots the page, receives the image, and reads it — seeing tables, charts, and layout instead of stripped HTML. No MCP server, no backend: the skill calls pixelshot locally on your machine via Playwright/CDP.
Building Your Own Index
For documents outside Wikipedia, build a local index:
1. Create pixelrag.yaml:
source:
type: local
path: ./my_docs
embed:
model: Qwen/Qwen3-VL-Embedding-2B
device: cuda # or cpu
gpu_ids: [0]
output: ./my_index
2. Build and serve:
pixelrag index build
pixelrag serve --index-dir ./my_index --port 30001
3. Query:
curl -X POST http://localhost:30001/search \
-H "Content-Type: application/json" \
-d '{"queries": [{"text": "your question here"}], "n_docs": 5}'
The pipeline stages also run independently if you want finer control:
pixelrag chunk --tiles-dir ./tiles
pixelrag embed --shard-dir ./tiles --output-dir ./embeddings --gpu-ids 0,1
pixelrag build-index --embeddings-dir ./embeddings --output-dir ./index
Downloading the Pre-Built Wikipedia Index
The full Wikipedia pixel index (~217 GB) is on Hugging Face:
pip install 'pixelrag[serve]'
huggingface-cli download StarTrail-org/pixelrag-faiss-indexes \
--repo-type dataset \
--include "search_index_normed_v2/*" \
--local-dir ./index
pixelrag serve --index-dir ./index/search_index_normed_v2 --port 30001
Four index variants are available in the dataset repo: base Wikipedia pixel, LoRA-enhanced Wikipedia pixel, Wikipedia text (baseline comparison), and a news pixel index.
Performance Numbers
| Metric | Text RAG | PixelRAG |
|---|---|---|
| SimpleQA accuracy | baseline | +18% higher |
| Tokens per query (agent runs) | baseline | 3x fewer |
| Tables preserved | partial | complete (as image) |
| Charts preserved | no | yes |
| Visual layout preserved | no | yes |
| Setup for Wikipedia search | full pipeline | zero (hosted API) |
The 18% accuracy gain and 3x token reduction are the headline numbers from the paper. These are aggregate improvements — the gap is larger on layout-heavy documents (data tables, infographics, PDFs with figures) and smaller on pure prose.
Using PixelRAG Programmatically
from pixelrag_render import render_url
# Render a page to tiles
tiles = render_url("https://en.wikipedia.org/wiki/Python", "./tiles")
# Each tile is an image file path you can pass to a vision model
for tile in tiles:
print(tile)
For agent workflows: render the page, retrieve the relevant tile via the search API, pass the tile image to Claude or another VLM as a vision input. The model reads the answer from the image instead of reconstructing it from parsed text.
Fine-Tuning on Your Own Data
The training pipeline lives in train/ — a separate uv project with pinned dependencies (torch 2.9.1+cu129, transformers 4.57.1, cuDNN 9.20). It LoRA-fine-tunes Qwen3-VL-Embedding-2B on screenshot data.
cd train && uv sync
# see train/README.md for the full recipe
The published adapters at Chrisyichuan/wiki-screenshot-embedding-lora are what power the hosted Wikipedia index. The full training set (Chrisyichuan/screenshot-training-natural-filtered-v2) is public so you can adapt other backbones — a larger Qwen variant, or any other embedding model that accepts image inputs.
What This Changes About Web Search in Agent Pipelines
The way most agent pipelines fetch web content today: call a search API, get URLs, fetch HTML, strip tags, chunk text, embed chunks, retrieve, pass to LLM. This workflow works for text-heavy articles and code documentation. It breaks on anything with meaningful visual structure.
PixelRAG's approach — render, embed images, retrieve images, pass images to VLM — is more expensive per query (Qwen3-VL-Embedding is larger than a text embedding model, and image tokens cost more than text tokens). But the accuracy gain on structured content is real, and the 3x token reduction in agent runs partially offsets the embedding cost.
The pixelbrowse Claude Code plugin is the most accessible entry point. It does not require building an index — it just screenshots the target URL and hands the image to Claude. For one-off "look at this page" agent tasks, the per-query cost is negligible and the quality improvement on any page with tables or charts is immediate.
Project Links
- GitHub: StarTrail-org/PixelRAG
- Hosted API: pixelrag.ai
- Paper: "PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation"
- Trained adapters: Chrisyichuan/wiki-screenshot-embedding-lora (Hugging Face)
- Training data: Chrisyichuan/screenshot-training-natural-filtered-v2 (Hugging Face)
- Wikipedia index: StarTrail-org/pixelrag-faiss-indexes (Hugging Face)
- License: Apache 2.0
- Stars: 1.4k (as of June 21, 2026)