Skip to main content

A collection of awesome web scaper, crawler.

290
GitHub Stars
34
Curated Resources
10
Categories
5 hours ago
Last Refreshed
JavaC/C++C#ErlangPythonPHPNodejsRubyGoRust

Use this list with your AI agent

Add the Context Awesome MCP server to Claude, Cursor, or any MCP client, then ask:

"Show me java resources from awesome-web-scraper"

Installation instructions →

What's inside

Java

  • Apache Nutch

    Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

  • crawler4j

    open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

  • Open Search Server

    A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.

  • websphinx

    Website-Specific Processors for HTML INformation eXtraction.

Nodejs

  • casperjs

    Navigation scripting & testing utility for PhantomJS and SlimerJS.

  • jsdom

    A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js

  • lightcrawler

    Crawl a website and run it through Google lighthouse.

  • nightmare

    Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks

  • node-crawler

    Web Crawler/Spider for NodeJS + server-side jQuery.

  • node-simplecrawler

    Flexible event driven crawler for node.

C#

  • ccrawler

    Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

PHP

  • Crawler

    A library for Rapid Web Crawler and Scraper Development.

  • DiDOM

    Simple and fast HTML parser.

  • Goutte

    Goutte, a simple PHP Web Scraper.

  • PHPCrawl

    PHPCrawl is a framework for crawling/spidering websites written in PHP.

  • simple_html_dom

    Just a Simple HTML DOM library fork.

Erlang

  • ebot

    Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

  • extractnet

    machine learning based content & metadata extraction framework for Python

  • gdom

    gdom, DOM Traversing and Scraping using GraphQL.

  • Scrapegraph-ai

    An open source library for making scraping with the use of the AI

  • scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

  • trafilatura

    Library and command-line tool to extract metadata, main text, and comments.

Go

  • fetchbot

    A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

  • gocrawl

    Polite, slim and concurrent web crawler.

C/C++

  • HTTrack

    Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

Showing a sample of 34 resources. View the full list on GitHub →