37 |
38 |
39 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Common Crawl Index Server
2 |
3 | This project is a deployment of the [pywb](https://github.com/webrecorder/pywb) web archive replay and index server to provide
4 | an index query mechanism for datasets provided by [Common Crawl](https://commoncrawl.org)
5 |
6 |
7 | ## Usage & Installation
8 | To run locally, please install with `pip install -r requirements.txt`
9 |
10 | Common Crawl stores data on Amazon S3 and the data can be accessed via s3 or https. Access to CC data using s3 api is restricted to [authenticated](https://docs.aws.amazon.com/accounts/latest/reference/credentials-access-keys-best-practices.html) AWS users.
11 |
12 | Currently, individual indexes for each crawl can be accessed under: `https://data.commoncrawl.org/commoncrawl/cc-index/collections/[CC-MAIN-YYYY-WW]` or `s3://commoncrawl/cc-index/collections/[CC-MAIN-YYYY-WW]`
13 |
14 | Most of the index will be served from S3, however, a smaller secondary index must be installed locally for each collection.
15 |
16 | This can be done automatically by running: `install-collections.sh` which will install all available collections locally. It uses the [AWS CLI](https://aws.amazon.com/cli/) tool to sync the index.
17 |
18 | If successful, there should be `collections` directory with at least one index.
19 |
20 | To run, simply run `cdx-server` to start up the index server, or optionally `wayback`, to run pywb replay system along with the cdx server.
21 |
22 |
23 | ### Running with docker
24 |
25 | If you have docker installed in your system, you can run index server with docker itself.
26 |
27 | ```
28 | git clone https://github.com/commoncrawl/cc-index-server.git
29 | cd cc-index-server
30 | docker build . -t cc-index
31 | docker run --rm --publish 8080:8080 -ti cc-index
32 | ```
33 |
34 | You can use `install-collections.sh` to download indexes to your system and mount it on docker.
35 |
36 |
37 | ## CDX Server API
38 |
39 | The API endpoints correspond to existing index collections in collections directory.
40 |
41 | For example, one currently available index is `CC-MAIN-2015-06` and it can be accessed via
42 |
43 | `http://localhost:8080/CC-MAIN-2015-06-index?url=commoncrawl.org`
44 |
45 |
46 | Refer to [CDX Server API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API) for more detailed instructions on the API itself.
47 |
48 | The pywb [README](https://github.com/webrecorder/pywb/blob/master/README.rst) provides additional information about pywb.
49 |
50 |
51 | ## Building the Index
52 |
53 | Please see the [webarchive-indexing](https://github.com/ikreymer/webarchive-indexing) repository for more info on how the index is built.
54 |
--------------------------------------------------------------------------------
/templates/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Common Crawl Index Server
5 |
6 |
7 |
8 |
9 |
Common Crawl Index Server
10 |
11 |
12 |
13 |
14 | Please see the PyWB CDX Server API Reference for more examples on how to use the query API (please replace the API endpoint coll/cdx by one of the API endpoints listed in the table below). Alternatively, you may use one of the command-line tools based on this API: Ilya Kreymer's Common Crawl Index Client, Greg Lindahl's cdx-toolkit or Corben Leo's getallurls (gau).
15 |