Marketing

web-content-fetcher

shirenchuang/web-content-fetcher · updated Apr 8, 2026

$npx skills add https://github.com/shirenchuang/web-content-fetcher --skill web-content-fetcher
summary

Extract clean Markdown article content from URLs with three-tier fallback strategies.

  • Implements cascading extraction methods: Jina Reader (fast, 200 requests/day free), Scrapling + html2text (unlimited, handles paywalled content), and direct web_fetch (static pages fallback)
  • Preserves Markdown structure including headings, links, images, lists, code blocks, and blockquotes
  • Domain-aware routing skips Jina for WeChat articles, Zhihu, Juejin, and CSDN to conserve quota and improve succ
skill.md

Web Content Fetcher

Given a URL, return its main content as clean Markdown — headings, links, images, lists, code blocks all preserved.

Extraction Strategy

Always try one method per URL — don't cascade blindly. Pick the right one upfront.

URL
 ├─ 1. Scrapling script (preferred)
 │     Run fetch.py — check the domain routing table to decide fast vs --stealth.
 │     Works for most sites. Returns clean Markdown directly.
 └─ 2. Jina Reader (fallback — only if Scrapling fails or dependencies not installed)
       web_fetch("https://r.jina.ai/<url>")
       Free tier: 200 req/day. Fast (~1-2s), good Markdown output.
       Does NOT work for: WeChat (403), some Chinese platforms.

Scrapling script

python3 <SKILL_DIR>/scripts/fetch.py "<url>" [max_chars] [--stealth]

<SKILL_DIR> is the directory where this SKILL.md lives. Resolve it before calling the script.

The script has two modes built in:

  • Default (fast): HTTP fetch, ~1-3s, works for most sites
  • --stealth: Headless browser, ~5-15s, for JS-rendered or anti-scraping sites

When run without --stealth, the script automatically falls back to stealth if the fast result has too little content. So you rarely need to specify --stealth manually — the only reason to force it is when you already know the site needs it (see routing table), which saves the initial fast attempt.

Domain Routing

Use this table to pick the right mode on the first call:

Domain Command Why
mp.weixin.qq.com fetch.py <url> --stealth JS-rendered content
zhuanlan.zhihu.com fetch.py <url> --stealth Anti-scraping + JS
juejin.cn fetch.py <url> --stealth JS-rendered SPA
sspai.com fetch.py <url> Static HTML
blog.csdn.net fetch.py <url> Static HTML
ruanyifeng.com fetch.py <url> Static blog
openai.com fetch.py <url> Static HTML
blog.google fetch.py <url> Static HTML
Everything else fetch.py <url> Auto-fallback handles it

Script Options

# Basic — auto-selects fast or stealth
python3 <SKILL_DIR>/scripts/fetch.py "https://sspai.com/post/73145"

# Force stealth for known JS-heavy sites
python3 <SKILL_DIR>/scripts/fetch.py "https://mp.weixin.qq.com/s/xxx" --stealth

# Limit output to 15000 characters (default: 30000)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com/article" 15000

# JSON output with metadata (url, mode, selector, content_length)
python3 <SKILL_DIR>/scripts/fetch.py "https://example.com" --json

Install Dependencies

First use only — the script checks and tells you if anything is missing:

pip install scrapling html2text

If on system-managed Python (macOS/Linux), add --break-system-packages or use a venv.

Failure Rules

  • Same URL fails once → give up, tell the user "unable to extract content from this URL"
  • Do not retry — each failed call wastes context tokens
general reviews

Ratings

4.658 reviews
  • Aisha Malhotra· Dec 20, 2024

    web-content-fetcher reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ganesh Mohane· Dec 8, 2024

    web-content-fetcher is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Mateo Dixit· Dec 8, 2024

    Registry listing for web-content-fetcher matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Aanya Bhatia· Dec 8, 2024

    web-content-fetcher has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Rahul Santra· Nov 27, 2024

    Useful defaults in web-content-fetcher — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Anaya Okafor· Nov 27, 2024

    web-content-fetcher reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • William Patel· Nov 27, 2024

    Keeps context tight: web-content-fetcher is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Omar Abebe· Nov 11, 2024

    Registry listing for web-content-fetcher matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Pratham Ware· Oct 18, 2024

    Registry listing for web-content-fetcher matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Aanya Chawla· Oct 18, 2024

    web-content-fetcher is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

showing 1-10 of 58

1 / 6