← Back to blog

explainx / blog

PixelRAG: Berkeley's Visual RAG That Reads Web Pages as Screenshots (Not HTML)

UC Berkeley researchers launched PixelRAG — open-source visual retrieval-augmented generation that renders web pages as screenshot tiles instead of parsing them to text. It outperforms text-based RAG by up to 18% accuracy on SimpleQA, ships an 8.28M-page Wikipedia index, and gives Claude a pixelbrowse skill so it sees charts and tables the way a human does.

·7 min read·Yash Thakker
RAGAI ToolsOpen SourceClaudeVision AIResearch
PixelRAG: Berkeley's Visual RAG That Reads Web Pages as Screenshots (Not HTML)

Text-based RAG has a structural problem that chunking strategies and rerankers cannot fix: HTML parsers throw away the page.

Tables become flat text with no column alignment. Charts become nothing. Side-by-side comparisons collapse into sequential sentences. The visual structure that makes the page human-readable vanishes before retrieval even starts.

PixelRAG is the UC Berkeley project that sidesteps this entirely. Instead of parsing pages to text, it renders them as screenshot tiles and retrieves over the images using a vision-language embedding model. The reader model — Claude, GPT, Qwen, whatever you use — reads the answer directly from what a human would see.

The project comes from Berkeley's SkyLab, BAIR, and Berkeley NLP groups (led by Yichuan Wang, Zhifei Li, Zirui Wang, Paul Teiletche, and Lesheng Jin, with Matei Zaharia, Joseph Gonzalez, and Sewon Min advising). It is Apache 2.0, ships with a pre-built 8.28M-page Wikipedia index, and adds a Claude Code plugin (pixelbrowse) that gives Claude visual page access in one command.


The Problem With Text-Based RAG

Every traditional RAG pipeline does something like this:

  1. Fetch a web page
  2. Parse HTML to text chunks
  3. Embed the chunks
  4. Retrieve the most relevant chunks
  5. Pass chunks to a reader model

Step 2 is where information dies. Consider a Wikipedia table listing historical stock prices by year. As HTML: perfectly structured. As parsed text: Year Price 1990 12.4 1991 18.7 ... — the column headers may survive but the spatial relationship is gone. Now ask the reader "what was the highest price before 1995?" The table's answer is obvious visually. The text dump makes it a string parsing problem.

This gets worse with:

  • Charts and graphs — entirely missing from text output
  • Multi-column layouts — merged into single-stream text
  • Infographics — completely lost
  • Form layouts — field-value relationships scrambled
  • PDFs with mixed text and images — images silently dropped

PixelRAG's benchmark results: up to 18% accuracy improvement on SimpleQA over text-based baselines. For agent runs, 3x fewer tokens per query — because retrieving the right image tile delivers a focused visual context instead of multiple text chunks.


How PixelRAG Works

Two components make the system work:

1. The Renderer (pixelshot)

pixelshot renders any URL or PDF to screenshot tiles using Playwright with Chrome DevTools Protocol (CDP). It handles JavaScript-rendered content, lazy-loaded images, and dynamic layouts — everything a headless browser sees, not just what's in the HTML source.

pixelshot https://en.wikipedia.org/wiki/Python --output ./tiles

The output is a set of image tiles representing the full rendered page at screen resolution. Each tile corresponds to a viewport-sized section of the page.

2. The Embedding Model

The image tiles get embedded using Qwen/Qwen3-VL-Embedding-2B, LoRA-fine-tuned on screenshot data published at Chrisyichuan/wiki-screenshot-embedding-lora. The fine-tuning dataset (Chrisyichuan/screenshot-training-natural-filtered-v2) is also public, so you can adapt other backbones.

The trained embedder puts screenshots into a vector space where visual content is retrievable. Query "what is the capital of France?" against a pixel index of Wikipedia and it finds the France article tile showing the answer in context — table, infobox, and all.

FAISS handles the index. Retrieval is fast enough for interactive use.


Quick Start

Hosted Wikipedia API (no setup required)

The fastest path: the Berkeley team hosts a pre-built index of 8.28M Wikipedia pages at api.pixelrag.ai. No index download, no GPU, no setup.

curl -X POST https://api.pixelrag.ai/search \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

The response includes the matching image tiles as base64 alongside the document metadata. The response images are what you pass to your reader model — not text chunks.

Try it in the browser at pixelrag.ai or run the Colab demo which renders a page and searches the hosted index with images inline.

Install PixelRAG

pip install pixelrag

This gives you pixelshot (the renderer) and the core library. Add stages as needed:

pip install 'pixelrag[embed]'   # chunk, embed, build-index commands
pip install 'pixelrag[index]'   # full pipeline orchestrator
pip install 'pixelrag[serve]'   # FastAPI search server

Give Claude Eyes: The pixelbrowse Plugin

This is the most immediately useful thing in the project if you use Claude Code.

pip install pixelrag
claude plugin marketplace add StarTrail-org/PixelRAG
claude plugin install pixelbrowse@pixelrag-plugins

Then:

# Ask Claude to screenshot a page and reason about it
claude -p "screenshot https://news.ycombinator.com and summarize the top stories"
claude -p "screenshot https://arxiv.org/abs/2404.12387 and explain the key findings"

Or in an interactive Claude session:

/screenshot https://example.com

Claude screenshots the page, receives the image, and reads it — seeing tables, charts, and layout instead of stripped HTML. No MCP server, no backend: the skill calls pixelshot locally on your machine via Playwright/CDP.


Building Your Own Index

For documents outside Wikipedia, build a local index:

1. Create pixelrag.yaml:

source:
  type: local
  path: ./my_docs

embed:
  model: Qwen/Qwen3-VL-Embedding-2B
  device: cuda     # or cpu
  gpu_ids: [0]

output: ./my_index

2. Build and serve:

pixelrag index build
pixelrag serve --index-dir ./my_index --port 30001

3. Query:

curl -X POST http://localhost:30001/search \
  -H "Content-Type: application/json" \
  -d '{"queries": [{"text": "your question here"}], "n_docs": 5}'

The pipeline stages also run independently if you want finer control:

pixelrag chunk --tiles-dir ./tiles
pixelrag embed --shard-dir ./tiles --output-dir ./embeddings --gpu-ids 0,1
pixelrag build-index --embeddings-dir ./embeddings --output-dir ./index

Downloading the Pre-Built Wikipedia Index

The full Wikipedia pixel index (~217 GB) is on Hugging Face:

pip install 'pixelrag[serve]'

huggingface-cli download StarTrail-org/pixelrag-faiss-indexes \
  --repo-type dataset \
  --include "search_index_normed_v2/*" \
  --local-dir ./index

pixelrag serve --index-dir ./index/search_index_normed_v2 --port 30001

Four index variants are available in the dataset repo: base Wikipedia pixel, LoRA-enhanced Wikipedia pixel, Wikipedia text (baseline comparison), and a news pixel index.


Performance Numbers

MetricText RAGPixelRAG
SimpleQA accuracybaseline+18% higher
Tokens per query (agent runs)baseline3x fewer
Tables preservedpartialcomplete (as image)
Charts preservednoyes
Visual layout preservednoyes
Setup for Wikipedia searchfull pipelinezero (hosted API)

The 18% accuracy gain and 3x token reduction are the headline numbers from the paper. These are aggregate improvements — the gap is larger on layout-heavy documents (data tables, infographics, PDFs with figures) and smaller on pure prose.


Using PixelRAG Programmatically

from pixelrag_render import render_url

# Render a page to tiles
tiles = render_url("https://en.wikipedia.org/wiki/Python", "./tiles")

# Each tile is an image file path you can pass to a vision model
for tile in tiles:
    print(tile)

For agent workflows: render the page, retrieve the relevant tile via the search API, pass the tile image to Claude or another VLM as a vision input. The model reads the answer from the image instead of reconstructing it from parsed text.


Fine-Tuning on Your Own Data

The training pipeline lives in train/ — a separate uv project with pinned dependencies (torch 2.9.1+cu129, transformers 4.57.1, cuDNN 9.20). It LoRA-fine-tunes Qwen3-VL-Embedding-2B on screenshot data.

cd train && uv sync
# see train/README.md for the full recipe

The published adapters at Chrisyichuan/wiki-screenshot-embedding-lora are what power the hosted Wikipedia index. The full training set (Chrisyichuan/screenshot-training-natural-filtered-v2) is public so you can adapt other backbones — a larger Qwen variant, or any other embedding model that accepts image inputs.


What This Changes About Web Search in Agent Pipelines

The way most agent pipelines fetch web content today: call a search API, get URLs, fetch HTML, strip tags, chunk text, embed chunks, retrieve, pass to LLM. This workflow works for text-heavy articles and code documentation. It breaks on anything with meaningful visual structure.

PixelRAG's approach — render, embed images, retrieve images, pass images to VLM — is more expensive per query (Qwen3-VL-Embedding is larger than a text embedding model, and image tokens cost more than text tokens). But the accuracy gain on structured content is real, and the 3x token reduction in agent runs partially offsets the embedding cost.

The pixelbrowse Claude Code plugin is the most accessible entry point. It does not require building an index — it just screenshots the target URL and hands the image to Claude. For one-off "look at this page" agent tasks, the per-query cost is negligible and the quality improvement on any page with tables or charts is immediate.


Project Links

  • GitHub: StarTrail-org/PixelRAG
  • Hosted API: pixelrag.ai
  • Paper: "PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation"
  • Trained adapters: Chrisyichuan/wiki-screenshot-embedding-lora (Hugging Face)
  • Training data: Chrisyichuan/screenshot-training-natural-filtered-v2 (Hugging Face)
  • Wikipedia index: StarTrail-org/pixelrag-faiss-indexes (Hugging Face)
  • License: Apache 2.0
  • Stars: 1.4k (as of June 21, 2026)

Related Reading

Related posts