awesome-web-scraper
github.com/duyet/awesome-web-scraper ↗A collection of awesome web scaper, crawler.
Use this list with your AI agent
Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:
"Show me java resources from awesome-web-scraper"
Installation instructions →What's inside
Java
- Apache Nutch
Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
- crawler4j
open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
- Open Search Server
A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- websphinx
Website-Specific Processors for HTML INformation eXtraction.
Nodejs
- casperjs
Navigation scripting & testing utility for PhantomJS and SlimerJS.
- jsdom
A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
- lightcrawler
Crawl a website and run it through Google lighthouse.
- nightmare
Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
- node-crawler
Web Crawler/Spider for NodeJS + server-side jQuery.
- node-simplecrawler
Flexible event driven crawler for node.
C#
- ccrawler
Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
PHP
- Crawler
A library for Rapid Web Crawler and Scraper Development.
- DiDOM
Simple and fast HTML parser.
- Goutte
Goutte, a simple PHP Web Scraper.
- PHPCrawl
PHPCrawl is a framework for crawling/spidering websites written in PHP.
- simple_html_dom
Just a Simple HTML DOM library fork.
Erlang
- ebot
Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.
Python
- extractnet
machine learning based content & metadata extraction framework for Python
- gdom
gdom, DOM Traversing and Scraping using GraphQL.
- Scrapegraph-ai
An open source library for making scraping with the use of the AI
- scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
- trafilatura
Library and command-line tool to extract metadata, main text, and comments.
Go
C/C++
- HTTrack
Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
Showing a sample of 34 resources. View the full list on GitHub →