Common Crawl Index Server

├── .gitignore ├── requirements.txt ├── templates ├── error.html ├── collinfo.json ├── search.html └── index.html ├── run-uwsgi.sh ├── install-collections.sh ├── uwsgi.ini ├── Dockerfile ├── config.yaml ├── static └── shared.css └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | #don't add collections, they're automatically synced 2 | 3 | collections/ 4 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pywb==0.33.2 2 | boto 3 | gevent 4 | uwsgi 5 | 6 | # AWS CLI (aws s3 cp ...) is used by install-collections.sh 7 | # to fetch cluster.idx and metadata.yaml 8 | awscli 9 | -------------------------------------------------------------------------------- /templates/error.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 |

Common Crawl Index Server Error

8 | {{ err_msg }} 9 | 10 | 11 | 12 | -------------------------------------------------------------------------------- /run-uwsgi.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # requires uwsgi 4 | pip install uwsgi 5 | 6 | # running with gevent 7 | pip install gevent 8 | 9 | if [ $? -ne 0 ]; then 10 | "uwsgi install failed" 11 | exit 1 12 | fi 13 | 14 | mypath=$(cd `dirname $0` && pwd) 15 | 16 | params="$mypath/uwsgi.ini" 17 | 18 | uwsgi $params 19 | -------------------------------------------------------------------------------- /install-collections.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ ! -d "collections" ]; then 4 | mkdir collections 5 | fi 6 | 7 | aws s3 sync s3://commoncrawl/cc-index/collections/ collections/ --exclude "*" --include "*/cluster.idx" --include "*/metadata.yaml" 8 | 9 | if [ $? -ne 0 ]; then 10 | echo "Error installing collections" 11 | exit 1 12 | fi 13 | echo "Collections installed" -------------------------------------------------------------------------------- /uwsgi.ini: -------------------------------------------------------------------------------- 1 | [uwsgi] 2 | # Run with default port if not set 3 | 4 | if-env = UPORT 5 | socket = :$(UPORT) 6 | endif = 7 | 8 | if-not-env = PORT 9 | http-socket = :8080 10 | endif = 11 | 12 | venv = $(VIRTUAL_ENV) 13 | 14 | gevent = 100 15 | gevent-monkey-patch = 16 | 17 | master = true 18 | processes = 8 19 | buffer-size = 65536 20 | die-on-term = true 21 | 22 | env = PYWB_CONFIG_FILE=./config.yaml 23 | wsgi = pywb.apps.wayback 24 | 25 | disable-logging=True 26 | -------------------------------------------------------------------------------- /templates/collinfo.json: -------------------------------------------------------------------------------- 1 | {% set first = [true] %} 2 | [{##} 3 | {% for route in routes | sort(reverse=True, attribute='path') %} 4 | {% if route | is_wb_handler %} 5 | {{ '' if first[0] else ',' }} 6 | { 7 | "id": "{{ route.path }}", 8 | "name": "{{ route.user_metadata.title if route.user_metadata.title else route.path }}", 9 | "timegate": "{{ host }}/{{route.path}}/", 10 | "cdx-api": "{{ host }}/{{route.path}}-index" 11 | }{% set _ = first.pop() %} 12 | {% set _ = first.append(false) %} 13 | {% endif %} 14 | {% endfor %} 15 | 16 | ] 17 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.9 2 | 3 | RUN apt-get -qq update && apt-get -qqy install awscli 4 | 5 | # Install dependencies 6 | COPY ./requirements.txt /tmp/requirements.txt 7 | RUN pip install -r /tmp/requirements.txt 8 | 9 | # Add the cc-index-server code into the image 10 | COPY ./ /opt/webapp/ 11 | WORKDIR /opt/webapp 12 | 13 | RUN ./install-collections.sh 14 | # Note: to avoid that collections are fetched anew on every image build, 15 | # you may install collections locally on the host in the build directory 16 | # and remove this command 17 | 18 | CMD /usr/local/bin/wayback 19 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | #Common-Crawl CDX Server Config 2 | #archive_paths: https://data.commoncrawl.org/ 3 | archive_paths: s3://commoncrawl/ 4 | 5 | # suffix to add to collection for cdx api 6 | enable_cdx_api: -index 7 | 8 | enable_memento: true 9 | 10 | shard_index_loc: 11 | match: '.*(collections/[^/]+/)' 12 | #replace: 'http://data.commoncrawl.org/cc-index/\1' 13 | replace: 's3://commoncrawl/cc-index/\1' 14 | 15 | # this is also the default page size 16 | max_blocks: 5 17 | 18 | # disable framed replay mode 19 | framed_replay: false 20 | 21 | # enable JSON listing of available collections /collinfo.json 22 | enable_coll_info: true 23 | -------------------------------------------------------------------------------- /static/shared.css: -------------------------------------------------------------------------------- 1 | body { 2 | font-family: sans-serif; 3 | color: #626262; 4 | } 5 | 6 | li { 7 | margin-bottom: 12px; 8 | } 9 | 10 | form { 11 | display: inline; 12 | } 13 | 14 | input[type=text] { 15 | width: 600px; 16 | font-size: 20px; 17 | } 18 | 19 | p { 20 | max-width: 768px; 21 | font-size: 1.07rem; 22 | line-height: 1.6rem; 23 | margin-bottom: 1.3rem; 24 | } 25 | 26 | br { 27 | margin-bottom: 7px; 28 | } 29 | 30 | table.listing th,td { 31 | padding: 0.2em 0.5em; 32 | } 33 | table.listing th { 34 | background-color: #e0e0e0; 35 | } 36 | table.listing tr { 37 | background-color: #f0f0f0; 38 | } 39 | table.listing tr:nth-child(odd) { 40 | background-color: #fcfcfc; 41 | } 42 | -------------------------------------------------------------------------------- /templates/search.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

{{ wbrequest.user_metadata.title if wbrequest.user_metadata.title else wbrequest.coll }} Info Page

9 |

10 | 11 | 18 | 19 | 35 |

(See the CDX Server API Reference for more advanced query options.)

36 |

Back To All Indexes

37 | 38 | 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Common Crawl Index Server 2 | 3 | This project is a deployment of the [pywb](https://github.com/webrecorder/pywb) web archive replay and index server to provide 4 | an index query mechanism for datasets provided by [Common Crawl](https://commoncrawl.org) 5 | 6 | 7 | ## Usage & Installation 8 | To run locally, please install with `pip install -r requirements.txt` 9 | 10 | Common Crawl stores data on Amazon S3 and the data can be accessed via s3 or https. Access to CC data using s3 api is restricted to [authenticated](https://docs.aws.amazon.com/accounts/latest/reference/credentials-access-keys-best-practices.html) AWS users. 11 | 12 | Currently, individual indexes for each crawl can be accessed under: `https://data.commoncrawl.org/commoncrawl/cc-index/collections/[CC-MAIN-YYYY-WW]` or `s3://commoncrawl/cc-index/collections/[CC-MAIN-YYYY-WW]` 13 | 14 | Most of the index will be served from S3, however, a smaller secondary index must be installed locally for each collection. 15 | 16 | This can be done automatically by running: `install-collections.sh` which will install all available collections locally. It uses the [AWS CLI](https://aws.amazon.com/cli/) tool to sync the index. 17 | 18 | If successful, there should be `collections` directory with at least one index. 19 | 20 | To run, simply run `cdx-server` to start up the index server, or optionally `wayback`, to run pywb replay system along with the cdx server. 21 | 22 | 23 | ### Running with docker 24 | 25 | If you have docker installed in your system, you can run index server with docker itself. 26 | 27 | ``` 28 | git clone https://github.com/commoncrawl/cc-index-server.git 29 | cd cc-index-server 30 | docker build . -t cc-index 31 | docker run --rm --publish 8080:8080 -ti cc-index 32 | ``` 33 | 34 | You can use `install-collections.sh` to download indexes to your system and mount it on docker. 35 | 36 | 37 | ## CDX Server API 38 | 39 | The API endpoints correspond to existing index collections in collections directory. 40 | 41 | For example, one currently available index is `CC-MAIN-2015-06` and it can be accessed via 42 | 43 | `http://localhost:8080/CC-MAIN-2015-06-index?url=commoncrawl.org` 44 | 45 | 46 | Refer to [CDX Server API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API) for more detailed instructions on the API itself. 47 | 48 | The pywb [README](https://github.com/webrecorder/pywb/blob/master/README.rst) provides additional information about pywb. 49 | 50 | 51 | ## Building the Index 52 | 53 | Please see the [webarchive-indexing](https://github.com/ikreymer/webarchive-indexing) repository for more info on how the index is built. 54 | -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Common Crawl Index Server 5 | 6 | 7 | 8 | 9 |

Common Crawl Index Server

10 | 11 |

12 | 13 |

14 | Please see the PyWB CDX Server API Reference for more examples on how to use the query API (please replace the API endpoint coll/cdx by one of the API endpoints listed in the table below). Alternatively, you may use one of the command-line tools based on this API: Ilya Kreymer's Common Crawl Index Client, Greg Lindahl's cdx-toolkit or Corben Leo's getallurls (gau). 15 |

16 | 17 |

18 | Common Crawl data is stored on Amazon Web Services' Public Data Sets. All data and index files are free to download — run your own index server or analyze the index offline!
19 | Please do not overload the URL index server for bulk downloads (e.g. all records of the entire .com top-level domain), see the download instructions. Alternatively, check the columnar index which allows for efficient aggregations and filtering on any field/column.
20 | More information about this URL index is found in our announcement of the Common Crawl index. For help and support, please visit the Common Crawl user forum. 21 |

22 | 23 |

24 | Currently available index collections (also as JSON list): 25 |

26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | {% for route in routes | sort(reverse=True, attribute='path') %} 37 | {% if route | is_wb_handler %} 38 | 39 | 42 | 47 | 48 | 49 | 50 | {% endif %} 51 | {% endfor %} 52 | 53 |

Search Page	Crawl	API endpoint	Index File List on `s3://commoncrawl/`
40 \| {{ '/' + route.path }} 41 \|	43 \| {% if route.user_metadata.title is defined %} 44 \| {{ route.user_metadata.title }} 45 \| {% endif %} 46 \|	/{{ route.path }}-index	{{route.path}}/cc-index.paths.gz

54 | 55 |

56 | Powered by pywb 57 |

58 | 59 | 60 | 61 | --------------------------------------------------------------------------------