Tech News Digest
Automated tech news digest system with unified data source model, quality scoring pipeline, and template-based output generation.
Quick Start
-
Configuration Setup: Default configs are in config/defaults/. Copy to workspace for customization:
mkdir -p workspace/config
cp config/defaults/sources.json workspace/config/tech-news-digest-sources.json
cp config/defaults/topics.json workspace/config/tech-news-digest-topics.json
-
Environment Variables:
TWITTERAPI_IO_KEY - twitterapi.io API key (optional, preferred)
X_BEARER_TOKEN - Twitter/X official API bearer token (optional, fallback)
TAVILY_API_KEY - Tavily Search API key, alternative to Brave (optional)
WEB_SEARCH_BACKEND - Web search backend: auto|brave|tavily (optional, default: auto)
BRAVE_API_KEYS - Brave Search API keys, comma-separated for rotation (optional)
BRAVE_API_KEY - Single Brave key fallback (optional)
GITHUB_TOKEN - GitHub personal access token (optional, improves rate limits)
-
Generate Digest:
python3 scripts/run-pipeline.py \
--defaults config/defaults \
--config workspace/config \
--hours 48 --freshness pd \
--archive-dir workspace/archive/tech-news-digest/ \
--output /tmp/td-merged.json --verbose --force
-
Use Templates: Apply Discord, email, or PDF templates to merged output
Configuration Files
sources.json - Unified Data Sources
{
"sources": [
{
"id": "openai-rss",
"type": "rss",
"name": "OpenAI Blog",
"url": "https://openai.com/blog/rss.xml",
"enabled": true,
"priority": true,
"topics": ["llm", "ai-agent"],
"note": "Official OpenAI updates"
},
{
"id": "sama-twitter",
"type": "twitter",
"name": "Sam Altman",
"handle": "sama",
"enabled": true,
"priority": true,
"topics": ["llm", "frontier-tech"],
"note": "OpenAI CEO"
}
]
}
topics.json - Enhanced Topic Definitions
{
"topics": [
{
"id": "llm",
"emoji": "π§ ",
"label": "LLM / Large Models",
"description": "Large Language Models, foundation models, breakthroughs",
"search": {
"queries": ["LLM latest news", "large language model breakthroughs"],
"must_include": ["LLM", "large language model", "foundation model"],
"exclude": ["tutorial", "beginner guide"]
},
"display": {
"max_items": 8,
"style": "detailed"
}
}
]
}
Scripts Pipeline
run-pipeline.py - Unified Pipeline (Recommended)
python3 scripts/run-pipeline.py \
--defaults config/defaults [--config CONFIG_DIR] \
--hours 48 --freshness pd \
--archive-dir workspace/archive/tech-news-digest/ \
--output /tmp/td-merged.json --verbose --force
- Features: Runs all 6 fetch steps in parallel, then merges + deduplicates + scores
- Output: Final merged JSON ready for report generation (~30s total)
- Metadata: Saves per-step timing and counts to
*.meta.json
- GitHub Auth: Auto-generates GitHub App token if
$GITHUB_TOKEN not set
- Fallback: If this fails, run individual scripts below
Individual Scripts (Fallback)
fetch-rss.py - RSS Feed Fetcher
python3 scripts/fetch-rss.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--verbose]
- Parallel fetching (10 workers), retry with backoff, feedparser + regex fallback
- Timeout: 30s per feed, ETag/Last-Modified caching
fetch-twitter.py - Twitter/X KOL Monitor
python3 scripts/fetch-twitter.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--backend auto|official|twitterapiio]
- Backend auto-detection: uses twitterapi.io if
TWITTERAPI_IO_KEY set, else official X API v2 if X_BEARER_TOKEN set
- Rate limit handling, engagement metrics, retry with backoff
fetch-web.py - Web Search Engine
python3 scripts/fetch-web.py [--defaults DIR] [--config DIR] [--freshness pd] [--output FILE]
- Auto-detects Brave API rate limit: paid plans β parallel queries, free β sequential
- Without API: generates search interface for agents
fetch-github.py - GitHub Releases Monitor
python3 scripts/fetch-github.py [--defaults DIR] [--config DIR] [--hours 168] [--output FILE]
- Parallel fetching (10 workers), 30s timeout
- Auth priority:
$GITHUB_TOKEN β GitHub App auto-generate β gh CLI β unauthenticated (60 req/hr)
fetch-github.py --trending - GitHub Trending Repos
python3 scripts/fetch-github.py --trending [--hours 48] [--output FILE] [--verbose]
- Searches GitHub API for trending repos across 4 topics (LLM, AI Agent, Crypto, Frontier Tech)
- Quality scoring: base 5 + daily_stars_est / 10, max 15
fetch-reddit.py - Reddit Posts Fetcher
python3 scripts/fetch-reddit.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE]
- Parallel fetching (4 workers), public JSON API (no auth required)
- 13 subreddits with score filtering
enrich-articles.py - Article Full-Text Enrichment
python3 scripts/enrich-articles.py --input merged.json --output enriched.json [--min-score 10] [--max-articles 15] [--verbose]
- Fetches full article text for high-scoring articles
- Cloudflare Markdown for Agents (preferred) β HTML extraction (fallback) β Skip (paywalled/social)
- Blog domain whitelist with lower score threshold (β₯3)
- Parallel fetching (5 workers, 10s timeout)
merge-sources.py - Quality Scoring & Deduplication
python3 scripts/merge-sources.py --rss FILE --twitter FILE --web FILE --github FILE --reddit FILE
- Quality scoring, title similarity dedup (85%), previous digest penalty
- Output: topic-grouped articles sorted by score
validate-config.py - Configuration Validator
python3 scripts/validate-config.py [--defaults DIR] [--config DIR] [--verbose]
- JSON schema validation, topic reference checks, duplicate ID detection
generate-pdf.py - PDF Report Generator
python3 scripts/generate-pdf.py --input report.md --output digest.pdf [--verbose]
- Converts markdown digest to styled A4 PDF with Chinese typography (Noto Sans CJK SC)
- Emoji icons, page headers/footers, blue accent theme. Requires
weasyprint.
sanitize-html.py - Safe HTML Email Converter
python3 scripts/sanitize-html.py --input report.md --output email.html [--verbose]
- Converts markdown to XSS-safe HTML email with inline CSS
- URL whitelist (http/https only), HTML-escaped text content
source-health.py - Source Health Monitor
python3 scripts/source-health.py --rss FILE --twitter FILE --github FILE --reddit FILE --web FILE [--verbose]
- Tracks per-source success/failure history over 7 days
- Reports unhealthy sources (>50% failure rate)
summarize-merged.py - Merged Data Summary
python3 scripts/summarize-merged.py --input merged.json [--top N] [--topic TOPIC]
- Human-readable summary of merged data for LLM consumption
- Shows top articles per topic with scores and metrics
User Customization
Workspace Configuration Override
Place custom configs in workspace/config/ to override defaults:
- Sources: Append new sources, disable defaults with
"enabled": false
- Topics: Override topic definitions, search queries, display settings
- Merge Logic:
- Sources with same
id β user version takes precedence
- Sources with new
id β appended to defaults
- Topics with same
id β user version completely replaces default
Example Workspace Override
{
"sources": [
{
"id": "simonwillison-rss",
"enabled": false,
"note"