Implements a scraping cascade architecture with four strategies: trafilatura for fast article extraction, requests with rotating user agents, Playwright with stealth mode for JavaScript-heavy sites, and async Playwright for Jupyter notebooks
Includes poison pill detection to identify paywalls, CAPTCHAs, rate limits, Cloudflare blocks, and login walls using pattern matching and status code analysi
Confirm successful installation by checking the skill directory location:
.cursor/skills/web-scraping
Restart Cursor to activate web-scraping. Access via /web-scraping in your agent's command palette.
β
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
Patterns for reliable, ethical web scraping with fallback strategies and anti-bot handling.
Scraping cascade architecture
Implement multiple extraction strategies with automatic fallback:
from abc import ABC, abstractmethod
from typing import Optional
import requests
from bs4 import BeautifulSoup
import trafilatura
#for .py filesfrom playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
#for .ipynb filesimport asyncio
from playwright.async_api import async_playwright
classScrapingResult:def__init__(self, content:str, title:str, method:str): self.content = content
self.title = title
self.method = method # Track which method succeededclassScraper(ABC):@abstractmethoddeffetch(self, url:str)-> Optional[ScrapingResult]:...classTrafilaturaΠ‘scraper(Scraper):"""Fast, lightweight extraction for standard articles."""deffetch(self, url:str)-> Optional[ScrapingResult]:try: downloaded = trafilatura.fetch_url(url)ifnot downloaded:returnNone content = trafilatura.extract( downloaded, include_comments=False, include_tables=True, favor_recall=True)ifnot content orlen(content)<100:returnNone# Extract title separately soup = BeautifulSoup(downloaded,'html.parser') title = soup.find('title') title_text = title.get_text()if title else''return ScrapingResult(content, title_text,'trafilatura')except Exception:returnNoneclassRequestsScraper(Scraper):"""HTTP requests with rotating user agents.""" USER_AGENTS =['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',]deffetch(self, url:str)-> Optional[ScrapingResult]:import random
headers ={'User-Agent': random.choice(self.USER_AGENTS),'Accept':'text/html,application/xhtml+xml','Accept-Language':'en-US,en;q=0.9',}try: response = requests.get(url, headers=headers, timeout=30) response.raise_for_status() soup = BeautifulSoup(response.text,'html.parser')# Remove script/style elementsfor element in soup(['script','style','nav','footer','aside']): element.decompose()# Find main content main = soup.find('main')or soup.find('article')or soup.find('body') content = main.get_text(separator='\n', strip=True)if main else'' title = soup.find('title') title_text = title.get_text()if title else''iflen(content)<100:returnNonereturn ScrapingResult(content, title_text,'requests')except Exception:returnNoneclassPlaywrightScraper(Scraper):"""Heavy JavaScript rendering with stealth mode for anti-bot bypass."""deffetch(self, url:str)-> Optional[ScrapingResult]:try:with sync_playwright()as p: browser = p.chromium.launch(headless=True) context = browser.new_context( viewport={'width':1920,'height':1080}, user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36') page = context.new_page()# Apply stealth to avoid detection stealth_sync(page) page.goto(url, wait_until='networkidle', timeout=60000)# Wait for content to load page.wait_for_timeout(2000)# Extract content content = page.evaluate('''() => {
const article = document.querySelector('article, main, .content, #content');
return article ? article.innerText : document.body.innerText;
}''') title = page.title() browser.close()iflen(content)<100:returnNonereturn ScrapingResult(content, title,'playwright')except Exception:returnNoneclassPlaywrightScraperAsync:"""Async Playwright scraper for Jupyter notebooks (.ipynb files).
Implementation Guide
Prerequisites
βΊClaude Desktop or compatible AI client with skill support
βΊClear understanding of task or problem to solve
βΊWillingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Steps
1Install skill using provided installation command
2Test with simple use case relevant to your work
3Evaluate output quality and relevance
4Iterate on prompts to improve results
5Integrate into regular workflow if valuable
Common Pitfalls
β Expecting perfect results without iteration
β Not providing enough context in prompts
β Using skill for tasks outside its intended scope
β Accepting outputs without review and validation
Best Practices
β Do
+Start with clear, specific prompts
+Provide relevant context and constraints
+Review and refine all outputs before using
+Iterate to improve output quality
+Document successful prompt patterns
β Don't
βDon't use without understanding skill limitations
βDon't skip validation of outputs
βDon't share sensitive information in prompts
βDon't expect skill to replace human judgment
π‘ Pro Tips
β Be specific about desired format and style
β Ask for multiple options to choose from
β Request explanations to understand reasoning
β Combine AI efficiency with human expertise
When to Use This
β Use when
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
β Avoid when
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path
1Familiarize yourself with skill capabilities and limitations
2Start with low-risk, non-critical tasks
3Progress to more complex and valuable use cases
4Build expertise through regular use and experimentation