├── CONTRIBUTING.md
├── LICENSE
├── .gitignore
└── README.md


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contribution Guidelines
 2 | 
 3 | ## Adding to this list
 4 | 
 5 | Please ensure your pull request adheres to the following guidelines:
 6 | 
 7 | - Make an individual pull request for each suggestion.
 8 | - Use the following format: `[Resource Name](link) - Short description.`
 9 | - Don't repeat the resource name in the description.
10 | - Link additions should be added in alphabetical order to the relevant category.
11 | - Titles should be [capitalized](http://grammar.yourdictionary.com/capitalization/rules-for-capitalization-in-titles.html).
12 | - Check your spelling and grammar.
13 | - New categories or improvements to the existing categorization are welcome.
14 | - The pull request and commit should have a useful title.
15 | 
16 | Thank you for your suggestions!


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 Bruce Tang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Created by .ignore support plugin (hsz.mobi)
 2 | ### Python template
 3 | # Byte-compiled / optimized / DLL files
 4 | __pycache__/
 5 | *.py[cod]
 6 | *$py.class
 7 | 
 8 | # C extensions
 9 | *.so
10 | 
11 | # Distribution / packaging
12 | .Python
13 | env/
14 | build/
15 | develop-eggs/
16 | dist/
17 | downloads/
18 | eggs/
19 | .eggs/
20 | lib/
21 | lib64/
22 | parts/
23 | sdist/
24 | var/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | 
29 | # PyInstaller
30 | #  Usually these files are written by a python script from a template
31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 | 
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 | 
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .coverage
43 | .coverage.*
44 | .cache
45 | nosetests.xml
46 | coverage.xml
47 | *,cover
48 | .hypothesis/
49 | 
50 | # Translations
51 | *.mo
52 | *.pot
53 | 
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | 
58 | # Flask stuff:
59 | instance/
60 | .webassets-cache
61 | 
62 | # Scrapy stuff:
63 | .scrapy
64 | 
65 | # Sphinx documentation
66 | docs/_build/
67 | 
68 | # PyBuilder
69 | target/
70 | 
71 | # IPython Notebook
72 | .ipynb_checkpoints
73 | 
74 | # pyenv
75 | .python-version
76 | 
77 | # celery beat schedule file
78 | celerybeat-schedule
79 | 
80 | # dotenv
81 | .env
82 | 
83 | # virtualenv
84 | venv/
85 | ENV/
86 | 
87 | # Spyder project settings
88 | .spyderproject
89 | 
90 | # Rope project settings
91 | .ropeproject
92 | 
93 | # pycharm ide
94 | .idea
95 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)
  2 | A collection of awesome web crawler,spider and resources in different languages.
  3 | 
  4 | ## Python 
  5 | * [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework.
  6 |     * [django-dynamic-scraper](https://github.com/holgerd77/django-dynamic-scraper) - Creating Scrapy scrapers via the Django admin interface.
  7 |     * [Scrapy-Redis](https://github.com/rolando/scrapy-redis) - Redis-based components for Scrapy.
  8 |     * [scrapy-cluster](https://github.com/istresearch/scrapy-cluster) - Uses Redis and Kafka to create a distributed on demand scraping cluster.
  9 |     * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
 10 | * [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
 11 | * [cola](https://github.com/chineking/cola) - A distributed crawling framework.
 12 | * [Demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework.
 13 | * [Scrapely](https://github.com/scrapy/scrapely) - A pure-python HTML screen-scraping library.
 14 | * [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.
 15 | * [you-get](https://github.com/soimort/you-get) -  Dumb downloader that scrapes the web.
 16 | * [Grab](http://grablib.org/) - Site scraping framework.
 17 | * [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.
 18 | * [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
 19 | * [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
 20 | * [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.
 21 | * [MSpider](https://github.com/manning23/MSpider) - A simple ,easy spider using gevent and js render. 
 22 | * [brownant](https://github.com/douban/brownant) - A lightweight web data extracting framework.
 23 | * [PSpider](https://github.com/xianhu/PSpider) - A simple spider frame in Python3.
 24 | * [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone.
 25 | * [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler.
 26 | 
 27 | ## Java
 28 | * [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment.
 29 |     * [anthelion](https://github.com/yahoo/anthelion) - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
 30 | * [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler.
 31 | * [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML.
 32 | * [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction.
 33 | * [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
 34 | * [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler
 35 | * [WebCollector](https://github.com/CrawlScript/WebCollector) - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
 36 | * [Webmagic](https://github.com/code4craft/webmagic) - A scalable crawler framework.
 37 | * [Spiderman](https://git.oschina.net/l-weiwei/spiderman) - A scalable ,extensible, multi-threaded web crawler.
 38 |     * [Spiderman2](http://git.oschina.net/l-weiwei/Spiderman2) - A distributed  web crawler framework,support js render.
 39 | * [Heritrix3](https://github.com/internetarchive/heritrix3) -  Extensible, web-scale, archival-quality web crawler project.
 40 | * [SeimiCrawler](https://github.com/zhegexiaohuozi/SeimiCrawler) - An agile, distributed crawler framework.
 41 | * [StormCrawler](http://github.com/DigitalPebble/storm-crawler/) - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
 42 | * [Spark-Crawler](https://github.com/USCDataScience/sparkler) - Evolving Apache Nutch to run on Spark.
 43 | * [webBee](https://github.com/pkwenda/webBee) - A DFS web spider.
 44 | 
 45 | 
 46 | ## C# 
 47 | * [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can saparate between the web page depending on their content.
 48 | * [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression.
 49 | * [DotnetSpider](https://github.com/zlzforever/DotnetSpider) - This is a cross platfrom, ligth spider develop by C#.
 50 | * [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility.
 51 | * [Hawk](https://github.com/ferventdesert/Hawk) - Advanced Crawler and ETL tool written in C#/WPF.
 52 | * [SkyScraper](https://github.com/JonCanning/SkyScraper) - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
 53 | 
 54 | ## JavaScript
 55 | * [scraperjs](https://github.com/ruipgil/scraperjs) - A complete and versatile web scraper.
 56 | * [scrape-it](https://github.com/IonicaBizau/scrape-it) - A Node.js scraper for humans.
 57 | * [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler.
 58 | * [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api.
 59 | * [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported.
 60 | * [x-ray](https://github.com/lapwinglabs/x-ray) - Web scraper with pagination and crawler support.
 61 | * [node-osmosis](https://github.com/rchipka/node-osmosis) - HTML/XML parser and web scraper for Node.js.
 62 | * [web-scraper-chrome-extension](https://github.com/martinsbalodis/web-scraper-chrome-extension) - Web data extraction tool implemented as chrome extension.
 63 | * [supercrawler](https://github.com/brendonboshell/supercrawler) - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. 
 64 | 
 65 | ## PHP
 66 | * [Goutte](https://github.com/FriendsOfPHP/Goutte) - A screen scraping and web crawling library for PHP.
 67 |     * [laravel-goutte](https://github.com/dweidner/laravel-goutte) - Laravel 5 Facade for Goutte.
 68 | * [dom-crawler](https://github.com/symfony/dom-crawler) - The DomCrawler component eases DOM navigation for HTML and XML documents.
 69 | * [pspider](https://github.com/hightman/pspider) - Parallel web crawler written in PHP.
 70 | * [php-spider](https://github.com/mvdbos/php-spider) - A configurable and extensible PHP web spider.
 71 | 
 72 | ## C++
 73 | * [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++.
 74 | 
 75 | ## C
 76 | * [httrack](https://github.com/xroche/httrack) - Copy websites to your computer.
 77 | 
 78 | ## Ruby
 79 | * [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
 80 | * [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
 81 | * [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper & File Harvester.
 82 | * [Spidr](https://github.com/postmodern/spidr) - Spider a site ,multiple domains, certain links or infinitely.
 83 | * [Cobweb](https://github.com/stewartmckee/cobweb) - Web crawler with very flexible crawling options, standalone or using sidekiq.
 84 | * [mechanize](https://github.com/sparklemotion/mechanize) - Automated web interaction & crawling.
 85 | 
 86 | ## R
 87 | * [rvest](https://github.com/hadley/rvest) - Simple web scraping for R.
 88 | 
 89 | ## Erlang 
 90 | * [ebot](https://github.com/matteoredaelli/ebot) - A scalable, distribuited and highly configurable web cawler.
 91 | 
 92 | ## Perl
 93 | * [web-scraper](https://github.com/miyagawa/web-scraper) - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
 94 | 
 95 | ## Go
 96 | * [pholcus](https://github.com/henrylee2cn/pholcus) -  A distributed, high concurrency and powerful web crawler.
 97 | * [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.
 98 | * [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
 99 | * [go_spider](https://github.com/hu17889/go_spider) - An awesome Go concurrent Crawler(spider) framework. 
100 | * [dht](https://github.com/shiyanhui/dht) - BitTorrent DHT Protocol && DHT Spider.
101 | * [ants-go](https://github.com/wcong/ants-go) - A open source, distributed, restful crawler engine in golang.
102 | * [scrape](https://github.com/yhat/scrape) - A simple, higher level interface for Go web scraping.
103 | * [creeper](https://github.com/wspl/creeper) - The Next Generation Crawler Framework (Go).
104 | * [colly](https://github.com/asciimoo/colly) - Fast and Elegant Scraping Framework for Gophers.
105 | 
106 | ## Scala
107 | * [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling.
108 | * [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy.
109 | * [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
110 | 


--------------------------------------------------------------------------------