URL to Markdown
Fetches any URL via baoyu-fetch CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.
CLI Setup
Important: The CLI source is vendored in the scripts/vendor/baoyu-fetch/ subdirectory of this skill.
Agent Execution Instructions:
- Determine this SKILL.md file's directory path as
{baseDir}
- CLI entry point =
{baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
- Resolve
${BUN_X} runtime: if bun installed β bun; if npx available β npx -y bun; else suggest installing bun
${READER} = ${BUN_X} {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
- Replace all
${READER} in this document with the resolved value
Preferences (EXTEND.md)
Check EXTEND.md existence (priority order):
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"
test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg"
test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" }
$xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" }
if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" }
if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }
| Path |
Location |
.baoyu-skills/baoyu-url-to-markdown/EXTEND.md |
Project directory |
$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md |
User home |
| Result |
Action |
| Found |
Read, parse, apply settings |
| Not found |
MUST run first-time setup (see below) β do NOT silently create defaults |
EXTEND.md Supports: Download media by default | Default output directory
First-Time Setup (BLOCKING)
CRITICAL: When EXTEND.md is not found, you MUST use AskUserQuestion to ask the user for their preferences before creating EXTEND.md. NEVER create EXTEND.md with defaults without asking. This is a BLOCKING operation β do NOT proceed with any conversion until setup is complete.
Use AskUserQuestion with ALL questions in ONE call:
Question 1 β header: "Media", question: "How to handle images and videos in pages?"
- "Ask each time (Recommended)" β After saving markdown, ask whether to download media
- "Always download" β Always download media to local imgs/ and videos/ directories
- "Never download" β Keep original remote URLs in markdown
Question 2 β header: "Output", question: "Default output directory?"
- "url-to-markdown (Recommended)" β Save to ./url-to-markdown/{domain}/{slug}.md
- (User may choose "Other" to type a custom path)
Question 3 β header: "Save", question: "Where to save preferences?"
- "User (Recommended)" β ~/.baoyu-skills/ (all projects)
- "Project" β .baoyu-skills/ (this project only)
After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.
Full reference: references/config/first-time-setup.md
Supported Keys
| Key |
Default |
Values |
Description |
download_media |
ask |
ask / 1 / 0 |
ask = prompt each time, 1 = always download, 0 = never |
default_output_dir |
empty |
path or empty |
Default output directory (empty = ./url-to-markdown/) |
EXTEND.md β CLI mapping:
| EXTEND.md key |
CLI argument |
Notes |
download_media: 1 |
--download-media |
Requires --output to be set |
default_output_dir: ./posts/ |
Agent constructs --output ./posts/{domain}/{slug}.md |
Agent generates path, not a direct CLI flag |
Value priority:
- CLI arguments (
--download-media, --output)
- EXTEND.md
- Skill defaults
Features
- Chrome CDP for full JavaScript rendering via
baoyu-fetch CLI
- Site-specific adapters: X/Twitter, YouTube, Hacker News, generic (Defuddle)
- Automatic adapter selection based on URL, or force with
--adapter
- Interaction gate detection: Cloudflare, reCAPTCHA, hCAPTCHA, custom challenges
- Two capture modes: headless (default) or interactive with wait-for-interaction
- Clean markdown output with YAML front matter
- Structured JSON output available via
--format json
- X/Twitter: extracts tweets, threads, and X Articles with media
- YouTube: transcript/caption extraction, chapters, cover images
- Hacker News: threaded comment parsing with proper nesting
- Generic: Defuddle extraction with Readability fallback
- Download images and videos to local directories
- Chrome profile persistence for authenticated sessions
- Debug artifact output for troubleshooting
Usage
${READER} <url>
${READER} <url> --output article.md
${READER} <url> --output article.md --download-media
${READER} <url> --headless --output article.md
${READER} <url> --wait-for interaction --output article.md
${READER} <url> --wait-for force --output article.md
${READER} <url> --format json --output article.json
${READER} <url> --adapter youtube --output transcript.md
${READER} <url> --cdp-url http://localhost:9222 --output article.md
${READER} <url> --output article.md --debug-dir ./debug/
Options
| Option |
Description |
<url> |
URL to fetch |
--output <path> |
Output file path (default: stdout) |
--format <type> |
Output format: markdown (default) or json |
--json |
Shorthand for --format json |
--adapter <name> |
Force adapter: x, youtube, hn, or generic (default: auto-detect) |
--headless |
Force headless Chrome (no visible window) |
--wait-for <mode> |
Interaction wait mode: none (default), interaction, or force |
--wait-for-interaction |
Alias for --wait-for interaction |
--wait-for-login |
Alias for --wait-for interaction |
--timeout <ms> |
Page load timeout (default: 30000) |
--interaction-timeout <ms> |
Login/CAPTCHA wait timeout (default: 600000 = 10 min) |
--interaction-poll-interval <ms> |
Poll interval for interaction checks (default: 1500) |
--download-media |
Download images/videos to local imgs/ and videos/, rewrite markdown links. Requires --output |
--media-dir <dir> |
Base directory for downloaded media (default: same as --output directory) |
--cdp-url <url> |
Reuse existing Chrome DevTools Protocol endpoint |
--browser-path <path> |
Custom Chrome/Chromium binary path |
--chrome-profile-dir <path> |
Chrome user data directory (default: BAOYU_CHROME_PROFILE_DIR env or ./baoyu-skills/chrome-profile) |
--debug-dir <dir> |
Write debug artifacts (document.json, markdown.md, page.html, network.json) |
Capture Modes
| Mode |
Behavior |
Use When |
| Default |
Headless Chrome, auto-extract on network idle |
Public pages, static content |
--headless |
Explicit headless (same as default) |
Clarify intent |
--wait-for interaction |
Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continues |
Login-required, CAPTCHA-protected |
--wait-for force |
Opens visible Chrome, auto-detects OR accepts Enter keypress to continue |
Complex flows, lazy loading, paywalls |
Interaction gate auto-detection:
- Cloudflare Turnstile / "just a moment" pages
- Google reCAPTCHA
- hCaptcha
- Custom challenge / verification screens
Wait-for-interaction workflow:
- Run with
--wait-for interaction β Chrome opens visibly
- CLI auto-detects login/CAPTCHA gates
- User completes login or solves CAPTCHA in the browser
- CLI auto-detects gate cleared β captures page
- If
--wait-for force is used, user can also press Enter to trigger capture manually
Agent Quality Gate
CRITICAL: The agent must treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without causing the CLI to fail.
After every headless run, the agent MUST inspect the saved markdown output.
Quality checks the agent must perform
- Confirm the markdown title matches the target page, not a generic site shell
- Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
- Watch for obvious failure signs:
Application error
This page could not be found
- Login, signup, subscribe, or verification shells
- Extremely short markdown for a page that should be long-form
- Raw framework payloads or mostly boilerplate content
- If the result is low quality, incomplete, or clearly wrong, do not accept the run as successful just because the CLI exited with code 0
Tip: Use --format json to get structured output including status, login.state, and interaction fields for programmatic quality assessment. A "status": "needs_interaction" response means the page requires manual interaction.
Recovery workflow the agent must follow
- First run headless (default) unless there is already a clear reason to use interaction mode
- Review markdown quality immediately after the run
- If the content is low quality or indicates login/CAPTCHA:
--wait-for interaction for auto-detected gates (login, CAPTCHA, Cloudflare)
--wait-for force when the page needs manual browsing, scroll loading, or complex interaction
- If
--wait-for is used, tell the user exactly what to do:
- If login is required, ask them to sign in in the browser
- If CAPTCHA appears, ask them to solve it
- If the page needs time to load, ask them to wait until content is visible
- For
--wait-for force: tell them to press Enter when ready
- If JSON output shows
"status": "needs_interaction", switch to --wait-for interaction automatically
Output Path Generation
The agent must construct the output file path since baoyu-fetch does not auto-generate paths.
Algorithm:
- Determine base directory from EXTEND.md
default_output_dir or default ./url-to-markdown/
- Extract domain from URL (e.g.,
example.com)
- Generate slug from URL path or page title (kebab-case, 2-6 words)
- Construct:
{base_dir}/{domain}/{slug}/{slug}.md β each URL gets its own directory so media files stay isolated
- Conflict resolution: append timestamp
{slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md
Pass the constructed path to --output. Media files (--download-media) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.
Output Format
Markdown output to stdout (or file with --output) as clean markdown text.
JSON output (--format json) returns structured data including:
adapter β which adapter handled the URL
status β "ok" or "needs_interaction"
login β login state detection (logged_in, logged_out, unknown)
interaction β interaction gate details (kind, provider, prompt)
document β structured content (url, title, author, publishedAt, content blocks, metadata)
media β collected media assets with url, kind, role
markdown β converted markdown text
downloads β media download results (when --download-media used)
When --download-media is enabled:
- Images are saved to
imgs/ next to the output file (or in --media-dir)
- Videos are saved to
videos/ next to the output file (or in --media-dir)
- Markdown media links are rewritten to local relative paths
Built-in Adapters
| Adapter |
URLs |
Key Features |
x |
x.com, twitter.com |
Tweets, threads, X Articles, media, login detection |
youtube |
youtube.com, youtu.be |
Transcript/captions, chapters, cover image, metadata |
hn |
news.ycombinator.com |
Threaded comments, story metadata, nested replies |
generic |
Any URL (fallback) |
Defuddle extraction, Readability fallback, auto-scroll, network idle detection |
Adapter is auto-selected based on URL. Use --adapter <name> to override.
Media Download Workflow
Based on download_media setting in EXTEND.md:
| Setting |
Behavior |
1 (always) |
Run CLI with --download-media --output <path> |
0 (never) |
Run CLI with --output <path> (no media download) |
ask (default) |
Follow the ask-each-time flow below |
Ask-Each-Time Flow
- Run CLI without
--download-media with --output <path> β markdown saved
- Check saved markdown for remote media URLs (
https:// in image/video links)
- If no remote media found β done, no prompt needed
- If remote media found β use
AskUserQuestion:
- header: "Media", question: "Download N images/videos to local files?"
- "Yes" β Download to local directories
- "No" β Keep remote URLs
- If user confirms β run CLI again with
--download-media --output <same-path> (overwrites markdown with localized links)
Environment Variables
| Variable |
Description |
BAOYU_CHROME_PROFILE_DIR |
Chrome user data directory (can also use --chrome-profile-dir) |
Troubleshooting: Chrome not found β use --browser-path. Timeout β increase --timeout. Login/CAPTCHA pages β use --wait-for interaction. Debug β use --debug-dir to inspect captured HTML and network logs.