├── README-ZH.md └── README.md /README-ZH.md: -------------------------------------------------------------------------------- 1 | 中文版 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # awesome-web-data-extractor 2 | A curated list of promising Web Data Extractors resources 3 | 4 | 5 | 6 | 7 | 8 | 80legs - Powerful and Economical Service Platform for Crawling and Processing Web Content 9 | http://www.80legs.com/ 10 | Agenty – Hosted Web Scraping Tool 11 | https://www.agenty.com/ 12 | Anthracite 13 | http://freecode.com/projects/anthracite 14 | Aristo - Answer Questions with a Knowledgeable Machine http://allenai.org/aristo/ 15 | artoo.js - The Client-Side Scraping Companion http://medialab.github.io/artoo/ 16 | AutoMate - Automate Data Extraction 17 | https://www.networkautomation.com/ 18 | 19 | 20 | Automated RSS Scraper Scripts 21 | http://www.djeaux.com/rss/ 22 | Automated Information Solutions 23 | http://www.automated-info-solutions.com/ 24 | Automatic Information Extraction From Semi-Structured Web Pages By Pattern Discovery 25 | http://portal.acm.org/citation.cfm?id=640423&dl=ACM&coll=portal 26 | Beautiful Soup 27 | http://freecode.com/projects/beautifulsoup 28 | Beautiful Soup - HTML/XML Parser for Quick Turnaround Screen Scraping and Web Data Extraction http://www.crummy.com/software/BeautifulSoup/ 29 | BLIASoft Knowledge Discovery http://www.bliasoft.com/Eindex.html 30 | Bot Research 31 | http://www.BotResearch.info/ 32 | BYU Data Extraction Research Group 33 | http://www.deg.byu.edu/ 34 | Captiva Software: Digital Information Capture Software 35 | http://www.emc.com/enterprise-content-management/captiva/captiva.htm 36 | ChartSearch Data Search Technology 37 | http://www.ChartSearch.net/ 38 | Client-Side Deep Web Data Extraction 39 | http://www.tic.udc.es/~mad/publications/ceceast2004.pdf 40 | CloudScrape – Extract, Enrich and Connect 41 | http://www.cloudscrape.com/ 42 | Common Crawl 43 | http://www.commoncrawl.org/ 44 | 45 | 46 | Connotate – Web Data Extraction and Monitoring 47 | http://www.connotate.com/ 48 | Content Grabber – Extract Data from Websites 49 | http://www.ContentGrabber.com/ 50 | ContextMiner - Tools to Collect Data, Metadata and Contextual Information http://www.contextminer.org/ 51 | cQuery - Content Query Engine 52 | http://cquery.com/ 53 | CrawlMonster 54 | http://www.crawlmonster.com/ 55 | Crawly 56 | http://crawly.diffbot.com/ 57 | Create a Crawler - Extract Data From an Entire Website https://www.import.io/ 58 | cURL groks URLs - Command Line Tool for Transferring Data http://curl.haxx.se/ 59 | Data Extraction Services 60 | http://www.dataextractionservices.com/ 61 | DataHen – Advanced Web Scraping and Data Extraction Services 62 | https://www.datahen.com/ 63 | Data Mining Resources 64 | http://www.DataMiningResources.info/ 65 | Data Miner – Extract Data From any Website in Seconds 66 | https://data-miner.io/ 67 | Dataminr - Real-time Information Discovery http://www.dataminr.com/ 68 | Data Scraper – East Web Scraping with Google Chrome 69 | https://chrome.google.com/webstore/detail/data-scraper-easy-web-scr/nndknepjnldbdbepjfgmncbggmopgden?hl=en-US 70 | 71 | 72 | DataSift - Powerful Social Data Platform http://datasift.com/ 73 | Data Toolbar – Web Data Extraction Software Made Simple 74 | http://datatoolbar.com/ 75 | DataWatch Monarch – Self-Service Data Preparation 76 | http://www.datawatch.com/ 77 | DataWrangler - Data Cleaning and Transformation Tool http://vis.stanford.edu/wrangler/ 78 | Deep Web Research 2017 79 | http://www.DeepWebResearch.info/ 80 | DEiXTo – Powerful Web Data Extraction Tool Based on W3C DOM 81 | http://deixto.com/ 82 | dexi.io – Web Data Processing for Professionals – Extract, Enrich and Connect 83 | https://dexi.io/ 84 | DiffBot – Web Data Extraction Using Artificial Intelligence 85 | http://www.DiffBot.com/ 86 | Digital Footprints - Collect Facebook Data http://digitalfootprints.dk/ 87 | DiscoverText - Import, Sort, Distribute and Analyze Electronic Content from eMail, Document Repositories, and Social Media http://discovertext.com/ 88 | Easy PDF Cloud https://www.easypdfcloud.com/ 89 | Easy Web Extract – Best Tool for Web Scraping 90 | http://webextract.net/ 91 | eGrabber - Data Capture Tools 92 | http://www.egrabber.com/ 93 | Facepager - Fetching Public Data From Facebook https://github.com/strohne/Facepager 94 | 95 | 96 | FeedsAPI - Extract Content from Web Pages Tool http://www.feedsapi.com/ 97 | Ficstar Software - Web Data Extraction 98 | http://www.ficstar.com/ 99 | File Information Tool Set (FITS) https://projects.iq.harvard.edu/fits 100 | FMiner – Web Scraping Software 101 | http://www.fminer.com/ 102 | Fresh WebSuction 103 | http://www.freshwebmaster.com/ 104 | Grabby 105 | https://grabby.io/ 106 | Grepsr – Web Scraping Made Simple, Fast and Manageable 107 | https://www.grepsr.com/ 108 | Helium Scraper 109 | http://www.heliumscraper.com/ 110 | Huginn - Your Agents Are Standing By https://github.com/cantino/huginn 111 | iMacros – Data Extraction 112 | http://imacros.net/overview 113 | Imagination Engines 114 | http://www.Imagination-Engines.com/ 115 | Import.io - Turn the Web Into Data With Extractors, Crawlers and Connectors https://import.io/ 116 | InfoExtractor - Extracts Relevant Information from Blogs, YouTube and Twitter http://www.infoextractor.org/ 117 | Information Retrieval (IR) and Information Extraction (IE) on the Web 118 | http://www.webir.org/ 119 | 120 | 121 | Introduction to Information Retrieval 122 | http://www-nlp.stanford.edu/IR-book/ 123 | iOpus Internet Macros 124 | http://www.iopus.com/imacros/ 125 | iRobotSoft – Visual Web Scraping and Web Automation 126 | http://irobotsoft.com/ 127 | iWeb Scraping Services 128 | http://www.iwebscraping.com/ 129 | Junar - Discovering Data http://www.junar.com/ 130 | Karma - Data Integration Tool 131 | http://www.isi.edu/integration/karma/ 132 | Kimono - Turn Website Into Structured APIs From Your Browser In Seconds https://www.kimonolabs.com/ 133 | Knowledge Discovery Resources 134 | http://www.KnowledgeDiscovery.info/ 135 | Knowlesys® - Web Data Extraction, Web Grabber and Screen Scraper 136 | http://www.knowlesys.com/index.htm 137 | Liberty Metrics – Web Scraping Services 138 | http://libertymetrics.com/ 139 | LingPipe – Information Extraction and Data Mining Tools 140 | http://alias-i.com/lingpipe/ 141 | Metadata Extraction Tool 142 | http://meta-extractor.sourceforge.net/ 143 | Mozenda – Comprehensive Web Data Gathering 144 | http://www.mozenda.com/ 145 | NCapture - Capture Web Content http://www.qsrinternational.com/products_nvivo_add-ons.aspx 146 | 147 | 148 | Netlytic - Making Sense of Online Conversations https://netlytic.org/home/ 149 | Newprosoft – Web Data Extraction Software 150 | http://newprosoft.com/ 151 | NewsClipper.com - Snip and Ship Dynamic News Content to Your Web Pages 152 | http://www.newsclipper.com/ 153 | Octoparse – Automated Web Scraping Software 154 | http://www.octoparse.com/ 155 | Online Data Extractor Tool 156 | http://www.onlinedataextractor.com/ 157 | OutWit Hub - Harvest the Web With Your Own Web Collection Engine http://www.outwit.com/ 158 | ParseHub – Web Crawling Using Machine Learning 159 | http://www.ParseHub.com/ 160 | Pervasive Data Management and Integration Products 161 | http://www.pervasive.com/ 162 | Priceonomics - Crawl Data From the Web http://priceonomics.com/ 163 | QL2 Software - Unstructured Data Management and Web Mining Software 164 | http://www.ql2.com/ 165 | Quick Code 166 | https://quickcode.io/ 167 | REBOL Technologies 168 | http://www.rebol.com/ 169 | SalesTools.io 170 | https://salestools.io/ 171 | Semantic Scholar - Free Scientific Literature Search and Discovery http://allenai.org/semantic-scholar/ 172 | 173 | 174 | ScrapeForge 175 | http://freecode.com/projects/scrapeforge 176 | ScrapeHero 177 | https://www.scrapehero.com/ 178 | Scraper 179 | http://freecode.com/projects/scraper 180 | ScrapingHub – Cloud Based Data Extraction Tool 181 | http://www.ScrapingHub.com/ 182 | Scraping Solutions – When the Solution You Seek Seems Impossible 183 | https://www.scrapingsolutions.com.au/ 184 | Scrapy – Open Source Web Scraping Framework for Python 185 | http://scrapy.org/ 186 | Screen-Scraper 187 | http://freecode.com/projects/screenscraper 188 | Screen-Scraper – Extracts Information From Web Sites 189 | http://www.Screen-Scraper.com/ 190 | Screenscraping the Senate by Paul Ford 191 | http://www.xml.com/pub/a/2004/09/01/hack-congress.html 192 | Search and Replace with TextPipe Pattern Matching 193 | http://www.datamystic.com/textpipe.html 194 | Sensible Code 195 | http://sensiblecode.io/ 196 | Social Media Data Collection Tools http://socialmediadata.wikidot.com/ 197 | Software for Web Scraping 198 | http://scraping.pro/software-for-web-scraping/ 199 | Spinn3r - Indexing the Blogosphere http://docs.spinn3r.com/#overview 200 | 201 | 202 | SPSS Modeler 203 | http://developer.ibm.com/predictiveanalytics 204 | Squirro - Find, Remember, Organize and Share Important Information https://squirro.com/ 205 | STACKS - Social Media Tracker, Analyzer, & Collector Toolkit at Syracuse https://github.com/bitslabsyr/stack 206 | TadaWeb - Clone and Amplify Human Intelligence for Web Data Collection and Analysis https://www.tadaweb.com/ 207 | Texifter - Search, Sift, Sort, Classify and Analyze http://texifter.com/ 208 | TextConverter 4 209 | https://www.simx.com/ 210 | TextRazor - Text Analysis Infrastructure https://www.textrazor.com/ 211 | Topicgrazer - Graze On Web Pages and Documents http://www.topicscape.com/Topicgrazer/help.php 212 | UiPath – Web Data Extraction 213 | https://www.uipath.com/guides/web-data-extraction 214 | Unit Miner - Web Data Extraction Software 215 | http://www.unitminer.com/ 216 | VietSpider 217 | http://binhgiang.sourceforge.net/ 218 | VisualScraper – Web Data Extractor 219 | http://www.VisualScraper.com/ 220 | Visual Web Ripper – Data Extraction Software 221 | http://www.VisualWebRipper.com/ 222 | Visual Web Task 223 | http://www.lencom.com/VisualWTSite.html 224 | 225 | 226 | W3C Publishes Data Extraction Language (DEL) as W3C Note 227 | http://xml.coverpages.org/ni2001-11-06-a.html 228 | Web Content Extractor 229 | http://www.newprosoft.com/ 230 | Web Data Extraction 231 | http://www.wintask.com/web-data-extraction.php 232 | Web Data Extraction Software Data Toolbar 233 | https://webdataextractionsoftwaredatatoolbar.en.softonic.com/ 234 | Web Data Extractor 235 | http://www.rafasoft.com/ 236 | Web Data Extractor 237 | http://www.webextractor.com/ 238 | Web Data Extractor 239 | http://fivesmallq.github.io/web-data-extractor 240 | Web Data Extractor 241 | http://www.lantechsoft.com/web-data-extractor.html 242 | Web Data Guru – Web Data Extraction and Scraping Services 243 | http://www.webdataguru.com/ 244 | Web-Harvest – Open Source Web Data Extraction Tool 245 | http://web-harvest.sourceforge.net/index.php 246 | WebHarvy – Intuitive Powerful Visual Web Scraper 247 | https://www.webharvy.com/index.html 248 | Webhose.io – Web Data For Your Business 249 | http://www.webhose.io/ 250 | Web Robots – Web Scraping and Crawling 251 | https://webrobots.io/ 252 | Web Scraper 253 | http://www.webscraper.io/ 254 | 255 | 256 | Web Scraping – Wikipedia 257 | https://en.wikipedia.org/wiki/Web_scraping 258 | Website Data Extractor – Time to Rethink Web Scraping 259 | http://www.kofax.com/ 260 | Website Extractor – Offline Browser 261 | http://www.internet-soft.com/extractor.htm 262 | WebSunDew – Advanced Web Scraping Tool 263 | http://www.websundew.com/ 264 | Wikimedia Public Data Dumps http://meta.wikimedia.org/wiki/Data_dumps 265 | WinAutomation 266 | http://www.winautomation.com/ 267 | XRay Web Scraping Tool 268 | http://freecode.com/projects/xrayguibasedwebscrapingtool 269 | YaCy Web page Indexer 270 | http://freecode.com/projects/yacy 271 | 272 | 273 | Subject Tracer™ Information Blogs 274 | Subject Tracer™ Information Blogs created and developed by the Virtual Private Library™ combine the best of the latest tools on the Internet. Using bots, blogs and news aggregators the Subject Tracer™ Information blogs generate RSS feeds with the latest resources to create a current information resource flow through niched subject tracers. I am proud to be the creator of the Internet’s first Subject Tracer™ Information Blogs: 275 | Virtual Private Library™ http://www.VirtualPrivateLibrary.com/ 276 | Accessibility Resources 277 | http://www.AccessibilityResources.info/ 278 | Agriculture Resources 279 | http://www.AgricultureResources.info/ 280 | AnswerSpot 281 | http://www.AnswerSpot.us/ 282 | Artificial Intelligence Resources 283 | http://www.AIResources.info/ 284 | Astronomy Resources 285 | http://www.AstronomyResources.info/ 286 | Auction Resources 287 | http://www.AuctionResources.info/ 288 | Biological Informatics 289 | http://www.BiologicalInformatics.info/ 290 | Biotechnology Resources 291 | http://www.BiotechnologyResources.info/ 292 | Bot Research 293 | http://www.BotResearch.info/ 294 | Business Intelligence Resources 295 | http://www.BIResources.info/ 296 | 297 | 298 | ChatterBots 299 | http://www.ChatterBots.info/ 300 | Data Mining Resources 301 | http://www.DataMiningResources.info/ 302 | Deep Web Research 303 | http://www.DeepWebResearch.info/ 304 | Directory Resources 305 | http://www.DirectoryResources.info/ 306 | eCommerce Resources 307 | http://eCommerceResources.info/ 308 | Education and Academic Resources 309 | http://www.EducationResources.info/ 310 | Elder Resources 311 | http://www.ElderResources.info/ 312 | Employment Resources 313 | http://www.EmploymentResources.info/ 314 | Entrepreneurial Resources 315 | http://www.EntrepreneurialResources.info/ 316 | Fact Checkers Directory 317 | http://www.FactCheckers.info/ 318 | Financial Sources 319 | http://www.FinancialSources.info/ 320 | Finding People 321 | http://www.FindingPeople.info/ 322 | Games Resources 323 | http://www.GamesResources.info/ 324 | Genealogy Resources 325 | http://www.GenealogyResources.info/ 326 | 327 | 328 | Grant Resources 329 | http://www.GrantResources.info/ 330 | Green Files 331 | http://www.GreenFiles.info/ 332 | Grid, Distributed and Cloud Computing Resources 333 | http://www.GridResources.info/ 334 | Healthcare Resources 335 | http://www.HealthcareResources.info/ 336 | Information Futures Markets 337 | http://www.InformationFuturesMarkets.com/ 338 | Information Quality Resources 339 | http://www.InformationQualityResources.info/ 340 | International Trade Resources 341 | http://www.InternationalTradeResources.info/ 342 | Internet Alerts 343 | http://www.InternetAlerts.info/ 344 | Internet Demographics 345 | http://www.InternetDemographics.info/ 346 | Internet Experts 2016 347 | http://www.InternetExperts.info/ 348 | Internet Hoaxes 349 | http://www.InternetHoaxes.info/ 350 | Intrapreneurial Resources 351 | http://www.IntrapreneurialResources.info/ 352 | Journalism Resources 353 | http://www.JournalismResources.info/ 354 | Knowledge Discovery 355 | http://www.KnowledgeDiscovery.info/ 356 | 357 | 358 | Military Resources 359 | http://www.MilitaryResources.info/ 360 | New Economy Analytics, Resources and Alerts 361 | http://www.NewEconomyAnalytics.com/ 362 | Outsourcing/Offshoring Information and Resources 363 | http://www.OutsourcingOffshore.us/ 364 | Privacy Resources 365 | http://www.PrivacyResources.info/ 366 | ProxyCrawl crawling and scraping tools 367 | https://proxycrawl.com 368 | Reference Resources 369 | http://www.ReferenceResources.info/ 370 | Research Resources 371 | http://www.ResearchResources.info/ 372 | RestStress™ 373 | http://www.RestStress.com/ 374 | Script Resources 375 | http://www.ScriptResources.info/ 376 | ShoppingBots 377 | http://www.ShoppingBots.info/ 378 | Social Informatics 379 | http://www.SocialInformatics.info/ 380 | Statistics Resources and Big Data 381 | http://www.StatisticsResources.info/ 382 | Student Research 383 | http://www.StudentResearch.info/ 384 | Theology Resources 385 | http://www.TheologyResources.info/ 386 | Tutorial Resources 387 | http://www.TutorialResources.info/ 388 | 389 | 390 | World Wide Web Reference 391 | http://www.WWWReference.info/ 392 | 393 | 394 | Orginial material from and Inspired by [Web Data Extractors 2018 A White Paper Link Compilation written by Marcus P. Zillman, M.S., A.M.H.A](http://whitepapers.virtualprivatelibrary.net/Web%20Data%20Extractors.pdf) 395 | 396 | 397 | --------------------------------------------------------------------------------