Data Scraper Agent
Build a production-ready, AI-powered data collection agent for any public data source.
Runs on a schedule, enriches results with a free LLM, stores to a database, and improves over time.
Stack: Python Β· Gemini Flash (free) Β· GitHub Actions (free) Β· Notion / Sheets / Supabase
When to Activate
- User wants to scrape or monitor any public website or API
- User says "build a bot that checks...", "monitor X for me", "collect data from..."
- User wants to track jobs, prices, news, repos, sports scores, events, listings
- User asks how to automate data collection without paying for hosting
- User wants an agent that gets smarter over time based on their decisions
Core Concepts
The Three Layers
Every data scraper agent has three layers:
COLLECT β ENRICH β STORE
β β β
Scraper AI (LLM) Database
runs on scores/ Notion /
schedule summarises Sheets /
& classifies Supabase
Free Stack
| Layer |
Tool |
Why |
| Scraping |
requests + BeautifulSoup |
No cost, covers 80% of public sites |
| JS-rendered sites |
playwright (free) |
When HTML scraping fails |
| AI enrichment |
Gemini Flash via REST API |
500 req/day, 1M tokens/day β free |
| Storage |
Notion API |
Free tier, great UI for review |
| Schedule |
GitHub Actions cron |
Free for public repos |
| Learning |
JSON feedback file in repo |
Zero infra, persists in git |
AI Model Fallback Chain
Build agents to auto-fallback across Gemini models on quota exhaustion:
gemini-2.0-flash-lite (30 RPM) β
gemini-2.0-flash (15 RPM) β
gemini-2.5-flash (10 RPM) β
gemini-flash-lite-latest (fallback)
Batch API Calls for Efficiency
Never call the LLM once per item. Always batch:
for item in items:
result = call_ai(item)
for batch in chunks(items, size=5):
results = call_ai(batch)
Workflow
Step 1: Understand the Goal
Ask the user:
- What to collect: "What data source? URL / API / RSS / public endpoint?"
- What to extract: "What fields matter? Title, price, URL, date, score?"
- How to store: "Where should results go? Notion, Google Sheets, Supabase, or local file?"
- How to enrich: "Do you want AI to score, summarise, classify, or match each item?"
- Frequency: "How often should it run? Every hour, daily, weekly?"
Common examples to prompt:
- Job boards β score relevance to resume
- Product prices β alert on drops
- GitHub repos β summarise new releases
- News feeds β classify by topic + sentiment
- Sports results β extract stats to tracker
- Events calendar β filter by interest
Step 2: Design the Agent Architecture
Generate this directory structure for the user:
my-agent/
βββ config.yaml # User customises this (keywords, filters, preferences)
βββ profile/
β βββ context.md # User context the AI uses (resume, interests, criteria)
βββ scraper/
β βββ __init__.py
β βββ main.py # Orchestrator: scrape β enrich β store
β βββ filters.py # Rule-based pre-filter (fast, before AI)
β βββ sources/
β βββ __init__.py
β βββ source_name.py # One file per data source
βββ ai/
β βββ __init__.py
β βββ client.py # Gemini REST client with model fallback
β βββ pipeline.py # Batch AI analysis
β βββ jd_fetcher.py # Fetch full content from URLs (optional)
β βββ memory.py # Learn from user feedback
βββ storage/
β βββ __init__.py
β βββ notion_sync.py # Or sheets_sync.py / supabase_sync.py
βββ data/
β βββ feedback.json # User decision history (auto-updated)
βββ .env.example
βββ setup.py # One-time DB/schema creation
βββ enrich_existing.py # Backfill AI scores on old rows
βββ requirements.txt
βββ .github/
βββ workflows/
βββ scraper.yml # GitHub Actions schedule
Step 3: Build the Scraper Source
Template for any data source:
"""
[Source Name] β scrapes [what] from [where].
Method: [REST API / HTML scraping / RSS feed]
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone
from scraper.filters import is_relevant
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
}
def fetch() -> list[dict]:
"""
Returns a list of items with consistent schema.
Each item must have at minimum: name, url, date_found.
"""
results = []
resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)
if resp.status_code == 200:
for item in resp.json().get("results", []):
if not is_relevant(item.get("title", "")):
continue
results.append(_normalise(item))
return results
def _normalise(raw: dict) -> dict:
"""Convert raw API/HTML data to the standard schema."""
return {
"name": raw.get("title", ""),
"url": raw.get("link", ""),
"source": "MySource",
"date_found": datetime.now(timezone.utc).date().isoformat(),
}
HTML scraping pattern:
soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select("[class*='listing']"):
title = card.select_one("h2, h3").get_text(strip=True)
link = card.select_one("a")["href"]
if not link.startswith("http"):
link = f"https://example.com{link}"
RSS feed pattern:
import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
title = item.findtext("title", "")
link = item.findtext("link", "")
Step 4: Build the Gemini AI Client
import os, json, time, requests
_last_call = 0.0
MODEL_FALLBACK = [
"gemini-2.0-flash-lite",
"gemini-2.0-flash",
"gemini-2.5-flash",
"gemini-flash-lite-latest",
]
def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:
"""Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""
global _last_call
api_key = os.environ.get("GEMINI_API_KEY", "")
if not api_key:
return {}
elapsed = time.time() - _last_call
if elapsed < rate_limit:
time.sleep(rate_limit - elapsed)
models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK
_last_call = time.time()
for m in models:
url =