awesome-web-scraper

A collection of awesome web scaper, crawler.

294

GitHub Stars

Curated Resources

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me java resources from awesome-web-scraper"

Apache Nutch
Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
crawler4j
open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
Open Search Server
A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
websphinx
Website-Specific Processors for HTML INformation eXtraction.

casperjs
Navigation scripting & testing utility for PhantomJS and SlimerJS.
jsdom
A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
lightcrawler
Crawl a website and run it through Google lighthouse.
nightmare
Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
node-crawler
Web Crawler/Spider for NodeJS + server-side jQuery.
node-simplecrawler
Flexible event driven crawler for node.

ccrawler
Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Crawler
A library for Rapid Web Crawler and Scraper Development.
DiDOM
Simple and fast HTML parser.
Goutte
Goutte, a simple PHP Web Scraper.
PHPCrawl
PHPCrawl is a framework for crawling/spidering websites written in PHP.
simple_html_dom
Just a Simple HTML DOM library fork.

ebot
Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

extractnet
machine learning based content & metadata extraction framework for Python
gdom
gdom, DOM Traversing and Scraping using GraphQL.
Scrapegraph-ai
An open source library for making scraping with the use of the AI
scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
trafilatura
Library and command-line tool to extract metadata, main text, and comments.

fetchbot
A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
gocrawl
Polite, slim and concurrent web crawler.

HTTrack
Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

Showing a sample of 34 resources. View the full list on GitHub →