└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # LLM Data Scrapers 🚀 2 | 3 | A list of useful Open Source tools and scrapers to gather data for LLMs: 4 | 5 | | Name | | 6 | | :------| :------------| 7 | | [gitingest](https://github.com/cyclotruc/gitingest) | Replace `hub` with `ingest` in any github url to get a prompt-friendly extract of a codebase | 8 | | [repomix](https://github.com/yamadashy/repomix) | Packs your entire repository into a single, AI-friendly file | 9 | | [llm-scraper](https://github.com/mishushakov/llm-scraper) | Turn any webpage into structured data using LLMs | 10 | | [crawl4ai](https://github.com/unclecode/crawl4ai) | LLM friendly web crawler & scraper | 11 | | [trafilatura](https://github.com/adbar/trafilatura) | Python & Command-line tool to gather text and metadata on the web | 12 | | [RepoToTextForLLMs](https://github.com/Doriandarko/RepoToTextForLLMs) | Simple Python script to fetch repo content | 13 | | [marker](https://github.com/VikParuchuri/marker) | Convert PDF to markdown or JSON quickly | 14 | | [reader](https://github.com/jina-ai/reader) | Convert any URL to an LLM-friendly input with a simple prefix `https://r.jina.ai/` | 15 | | [files-to-prompt](https://github.com/simonw/files-to-prompt) | Concatenate a directory full of files into a single prompt for use with LLMs | 16 | | [docling](https://github.com/DS4SD/docling) | Simplifies document processing and parsing of diverse formats | 17 | | [firecrawl](https://github.com/mendableai/firecrawl) | API to turn websites into LLM-ready markdown or structured data, can be self-hosted (with limitations) | 18 | | [llmstxt-generator](https://github.com/mendableai/llmstxt-generator) | API to generate `llms.txt`files from websites for LLM training and inference | 19 | 20 | ## More 21 | - https://github.com/mlabonne/llm-datasets: Curated list of datasets and tools specifically for post-training. 22 | --------------------------------------------------------------------------------