When Google launched Open Knowledge Format (OKF) v0.1 in June 2026, it did not ship an empty spec—it included three living sample bundles produced by a reference BigQuery enrichment agent:
| OKF bundle | Underlying BigQuery public dataset |
|---|---|
| GA4 e-commerce | bigquery-public-data.ga4_obfuscated_sample_ecommerce |
| Stack Overflow | Stack Overflow public dataset |
| Bitcoin | bigquery-public-data.bitcoin_blockchain |
This guide goes hands-on on the two datasets you linked: GA4 e-commerce and Bitcoin in BigQuery—what they contain, how to query them, and how OKF turns raw tables into agent-readable knowledge graphs.
TL;DR
| Question | Answer |
|---|---|
| Repo | GoogleCloudPlatform/knowledge-catalog |
| GA4 dataset | ga4_obfuscated_sample_ecommerce — Nov 2020–Jan 2021 |
| Bitcoin dataset | bitcoin_blockchain — full history, updates ~every 10 min |
| GA4 tables | events_* (sharded by date) |
| Bitcoin tables | blocks_raw, transactions |
| Cost | BigQuery Sandbox / free tier sufficient for samples |
| OKF value | Pre-linked concept pages vs raw schema discovery |
Why Public Datasets Make Good OKF Demos
BigQuery public datasets are ideal OKF teaching material because they are:
- Real-world shape — schemas, joins, and metrics agents actually need
- Free to query — no data ingestion required
- Well documented — Google publishes schemas and sample SQL
- Diverse domains — web analytics (GA4) vs immutable ledger (Bitcoin)
OKF's bet: an agent answering "How do we compute weekly active users?" or "What columns link transactions to blocks?" should read curated concept pages with cross-links—not rediscover schema from INFORMATION_SCHEMA every session. That is the Karpathy LLM Wiki compile-once pattern at organizational scale.
Bundle 1: GA4 E-commerce (Google Merchandise Store)
What it is
The Google Merchandise Store sells Google-branded merchandise. It uses GA4's standard web ecommerce implementation plus enhanced measurement.
The public BigQuery dataset ga4_obfuscated_sample_ecommerce contains a sample of obfuscated event export data for three months:
| Property | Value |
|---|---|
| Date range | 2020-11-01 → 2021-01-31 |
| Project path | bigquery-public-data.ga4_obfuscated_sample_ecommerce |
| Primary table | events_* (date-sharded) |
| Docs | GA4 BigQuery sample dataset |
Prerequisites
- Google Cloud project with BigQuery API enabled (BigQuery Quickstart)
- BigQuery Sandbox or free usage tier—enough for exploration
- Optional: billing if you exceed free limits
Limitations (read before trusting numbers)
Google is explicit about obfuscation:
- Placeholder values:
<Other>,NULL,'' - Internal consistency may be limited due to obfuscation
- Not comparable to the GA4 Demo Account for Google Merchandise Store—the underlying data differs
OKF bundles document these caveats in concept pages so agents do not treat demo metrics as production truth.
Starter query: dataset overview
Open BigQuery Console, compose a new query, and run:
SELECT
COUNT(*) AS event_count,
COUNT(DISTINCT user_pseudo_id) AS user_count,
COUNT(DISTINCT event_date) AS day_count
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
BigQuery shows bytes processed before you run—useful for cost estimation. Valid queries display a check mark.
What agents learn from the OKF bundle
The GA4 OKF bundle (in knowledge-catalog) typically maps concepts like:
| OKF concept type | Examples |
|---|---|
| Dataset | ga4_obfuscated_sample_ecommerce overview |
| Table | events_* schema, sharding pattern |
| Metric | Sessions, purchases, user_pseudo_id uniqueness |
| Event | purchase, add_to_cart, page_view parameters |
Each file carries YAML frontmatter (type, title, description, resource, tags) and markdown cross-links—e.g., from a purchase event page to the ecommerce parameters page.
Next steps for GA4
- BigQuery export schema — field definitions
- Advanced sample queries on the developer page
- Looker Studio or Connected Sheets for visualization
- Compare to production patterns in marketing analytics jumpstart repos
Bundle 2: Bitcoin Blockchain
What it is
Since February 2018, Google has published the full Bitcoin blockchain to BigQuery for analytics. Allen Day and Colin Bookman announced on the Google Cloud Blog:
The Bitcoin blockchain data are now available for exploration with BigQuery. All historical data are in the
bigquery-public-data:bitcoin_blockchaindataset, which updates every 10 minutes.
| Property | Value |
|---|---|
| Project path | bigquery-public-data.bitcoin_blockchain |
| Tables | blocks_raw, transactions |
| Update cadence | ~every 10 minutes as new blocks broadcast |
| Also on | Kaggle (BigQuery Python client in Kernels) |
Why BigQuery for blockchain?
Bitcoin is an immutable distributed ledger with strong OLTP properties (atomic transactions, durability) but weak OLAP for ad-hoc reporting on money flows. BigQuery adds:
- Real-time extraction from the ledger
- Denormalized storage for easier exploration
- Data Studio / Looker visualizations
OKF bundles capture this OLTP vs OLAP distinction so agents understand why the dataset exists—not just table names.
Network fundamentals queries
Google's original post highlighted network valuation analytics:
| Analysis | Insight |
|---|---|
| BTC transacted per day | Economic activity on-network |
| Recipient addresses per day | User growth proxy (Metcalfe's Law) |
| NVT Ratio | Network Value to Transactions—valuation metric |
| Mining difficulty vs search volume | Fundamental + attention correlation |
These become metric concept pages in the OKF bundle with formulas and caveats pre-documented.
Famous transaction: Bitcoin pizza (May 17, 2010)
Laszlo Hanyecz bought two pizzas for 10,000 BTC. The transaction is permanently recorded:
- Transaction ID:
a107...d48d(full hash in blockchain) - From:
1XPT...rvH4→ To:17Sk...xFyQ
Google visualized input transfers up to 4 degrees before the pizza purchase—red circle for Hanyecz's address, blue for others, arrow width proportional to BTC flow. The OKF bundle can link entity pages (Hanyecz, pizza purchase event) to transaction schema pages.
Anomaly detection: duplicate transactions
One transaction appears in two blocks—impossible under current rules:
#standardSQL
SELECT *
FROM (
SELECT
transaction_id,
COUNT(transaction_id) AS dup_transaction_count
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
GROUP BY transaction_id
)
WHERE dup_transaction_count > 1
Why? Early Bitcoin used BerkeleyDB (non-unique keys). After Satoshi left, the team switched to LevelDB and implemented BIP-0030 to prevent duplicate transaction IDs. Legacy duplicate entries remain in historical data.
This is exactly the kind of contradiction / anomaly an OKF lint pass or concept page flags—agents get context without rediscovering Bitcoin history each query.
What agents learn from the OKF bundle
| OKF concept type | Examples |
|---|---|
| Dataset | bitcoin_blockchain overview |
| Table | transactions, blocks_raw columns |
| Metric | NVT Ratio, daily transaction volume |
| Entity | Notable addresses, pizza purchase |
| Anomaly | Pre-BIP-0030 duplicate transaction_ids |
Bundle 3: Stack Overflow (brief)
The third OKF sample maps the Stack Overflow public dataset on BigQuery—questions, answers, tags, votes. Same pattern: enrichment agent drafts concept docs for tables and common join paths (e.g., questions → answers → users). Browse the bundle in knowledge-catalog for the full file tree.
Raw BigQuery → OKF Bundle: The Pipeline
Google's reference BigQuery enrichment agent (in the repo) automates:
BigQuery dataset
↓ (walk tables/views)
Draft OKF concept .md per table/metric
↓ (second LLM pass)
Enrich with citations, joins, descriptions from authoritative docs
↓
Commit conformant OKF bundle → git
↓
Visualize with static HTML graph viewer
↓
Ingest into Cloud Knowledge Catalog for GCP agents
You can replicate this for your own datasets—the samples are templates, not products.
Browse a bundle without BigQuery
- Clone knowledge-catalog
- Open the GA4, Bitcoin, or Stack Overflow bundle directory
- Start at
index.md— progressive disclosure catalog - Run the HTML visualizer — single self-contained file, no backend
For Claude Code or Cursor: point CLAUDE.md at the bundle path—read index.md before analytics tasks on this dataset.
Sample Queries Cheat Sheet
GA4: events by name
SELECT
event_name,
COUNT(*) AS event_count
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
GROUP BY event_name
ORDER BY event_count DESC
LIMIT 20
GA4: purchase events only
SELECT *
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*`
WHERE event_name = 'purchase'
LIMIT 100
Bitcoin: daily transaction count
SELECT
DATE(block_timestamp) AS day,
COUNT(*) AS tx_count
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
GROUP BY day
ORDER BY day DESC
LIMIT 30
Bitcoin: largest outputs (recent)
SELECT
transaction_id,
output_value,
block_timestamp
FROM `bigquery-public-data.bitcoin_blockchain.transactions`
ORDER BY output_value DESC
LIMIT 10
OKF vs Raw SQL for Agents
| Approach | Agent behavior | Tradeoff |
|---|---|---|
| Raw BigQuery | Agent queries INFORMATION_SCHEMA, guesses joins | Flexible, slow, error-prone |
| RAG on docs | Retrieves doc chunks per question | No cross-link maintenance |
| OKF bundle | Reads pre-compiled concept graph | Upfront enrichment cost; reliable at moderate scale |
For personal wikis under ~100K tokens, context often beats RAG. For org-wide data catalogs with hundreds of tables, OKF + Knowledge Catalog hybrid search scales further.
Getting Started Checklist
- Read OKF spec — okf/SPEC.md in the repo
- Run GA4 starter query — developer docs
- Run Bitcoin duplicate-tx query — Cloud Blog post
- Browse OKF bundles — compare markdown concept pages to raw tables
- Open HTML visualizer — see the knowledge graph
- Point your agent — add bundle path to
CLAUDE.mdor project instructions
Summary
Google's OKF sample bundles are not synthetic toys—they map real BigQuery public datasets agents will encounter in the wild:
- GA4 e-commerce — obfuscated Google Merchandise Store events (Nov 2020–Jan 2021), ideal for learning ecommerce event schemas
- Bitcoin blockchain — full ledger history since 2018, ideal for OLAP-on-ledger patterns and anomaly documentation
Query the raw data to build intuition. Read the OKF bundles to see how compile-once knowledge beats per-query schema discovery. Then apply the same enrichment pipeline to your own BigQuery project.
Related Reading
- Open Knowledge Format (OKF) Complete Guide
- Karpathy LLM Wiki Pattern Guide
- What is CLAUDE.md?
- What is MCP?
Dataset details cited from Google Analytics developer documentation and Google Cloud Blog: Bitcoin in BigQuery as of June 14, 2026.