├── .gitignore
├── readme.md
├── readme.md.template
├── render-readme.sh
├── requirements.txt
├── scripts
    ├── article.py
    ├── download.py
    └── extract.py
└── urls.tsv


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 98 | __pypackages__/
 99 | 
100 | # Celery stuff
101 | celerybeat-schedule
102 | celerybeat.pid
103 | 
104 | # SageMath parsed files
105 | *.sage.py
106 | 
107 | # Environments
108 | .env
109 | .venv
110 | env/
111 | venv/
112 | ENV/
113 | env.bak/
114 | venv.bak/
115 | 
116 | # Spyder project settings
117 | .spyderproject
118 | .spyproject
119 | 
120 | # Rope project settings
121 | .ropeproject
122 | 
123 | # mkdocs documentation
124 | /site
125 | 
126 | # mypy
127 | .mypy_cache/
128 | .dmypy.json
129 | dmypy.json
130 | 
131 | # Pyre type checker
132 | .pyre/
133 | 
134 | # pytype static type analyzer
135 | .pytype/
136 | 
137 | # Cython debug symbols
138 | cython_debug/
139 | 
140 | .vscode/
141 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
  1 | # MassiveSumm: a very large-scale, very multilingual, news summarisation dataset
  2 | This repository contains links to data and code to fetch and reproduce the data described in our EMNLP 2021 paper titled "[MassiveSumm: a very large-scale, very multilingual, news summarisation dataset](https://aclanthology.org/2021.emnlp-main.797/)". A (massive) multilingual dataset consisting of 92 diverse languages, across 35 writing scripts. With this work we attempt to take the first steps towards providing a diverse data foundation for in summarisation in many languages.
  3 | 
  4 | > *Disclaimer: The data is noisy and recall-oriented. In fact, we highly recommend reading  our analysis on the efficacy of this type of methods for data collection.*
  5 | 
  6 | 
  7 | ## Get the Data
  8 | Redistributing data from web is a tricky matter. We are working on providing efficient access to the entire dataset, as well as expanding it even further. For the time being we only provide links to reproduce subsets of the entire dataset through either common crawl and the wayback machine. The dataset is also available upon request ([djam@itu.dk](mailto:djam@itu.dk)).
  9 | 
 10 | 
 11 | In the table below is a listing of files containing URLs and metadata required to fetch data from common crawl.
 12 | lang  |  wayback                                                                         |  cc
 13 | ------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------
 14 | afr   |  [link](https://drive.google.com/file/d/1m7ctoWs5or8HsFbuW5pBu_PodpND0J3e/view)  |  -
 15 | amh   |  [link](https://drive.google.com/file/d/1k0_65Zb00VGm5i-hnFYuY7lNzjVUiWvl/view)  |  [link](https://drive.google.com/file/d/1_awz_-B0iWtaPdKih8H8Kz4HGnJSwvRq/view)
 16 | ara   |  [link](https://drive.google.com/file/d/1raYOtsrpmD-yAGo50Kr917Ns3tXEJBci/view)  |  [link](https://drive.google.com/file/d/1HvCeJ3p59sdhb1xLFGNVHr10hHEg8phA/view)
 17 | asm   |  [link](https://drive.google.com/file/d/1iGPZdk-PKQn0M_q8ENl-sJY861sT0kaE/view)  |  -
 18 | aym   |  [link](https://drive.google.com/file/d/12XzyUHfrOi317OLU0QsOgl6eZvVOK-bF/view)  |  -
 19 | aze   |  [link](https://drive.google.com/file/d/1JIqaeoNJt3VATqqjzP27fUBlTAlCvagz/view)  |  [link](https://drive.google.com/file/d/1CftuzziqiR5QezH9oYL-bCE_KpKdKQgK/view)
 20 | bam   |  [link](https://drive.google.com/file/d/1Yb9YOENj0Kf8FK19eXHD-iHu5nuqR_3-/view)  |  [link](https://drive.google.com/file/d/1MWQVJMBLmc_8qktep7FGohVbHdXR0iKx/view)
 21 | ben   |  [link](https://drive.google.com/file/d/1lOso52ouqtddUGF5RGOkIlVfL3kD3oTB/view)  |  [link](https://drive.google.com/file/d/1wK6YTRkXuc4df8C-Ko-PaWB1pIeDfY8q/view)
 22 | bod   |  [link](https://drive.google.com/file/d/1RmonaYMfzj-sw5uM1crJvJxn1FEWfDNi/view)  |  [link](https://drive.google.com/file/d/1vnZb9PUjRCX6E__OlCAqBGxdu3-19Q8W/view)
 23 | bos   |  [link](https://drive.google.com/file/d/1alV_CwZpxzAcEfuBCp5TCeMr3qOPZZHW/view)  |  [link](https://drive.google.com/file/d/1TTQVPZ4G7TGy7mFnN21XC3ZDlJlTpGhM/view)
 24 | bul   |  [link](https://drive.google.com/file/d/1XU56P9Jd4Meo7YCEedRPu3qd3nOZRvSx/view)  |  [link](https://drive.google.com/file/d/13MJzUdrCLz-lo_c4IZOOupJY_50zvZHd/view)
 25 | cat   |  [link](https://drive.google.com/file/d/1OqPLjlsUI-ldg6z2eEe_hA1tM1MCAfEt/view)  |  -
 26 | ces   |  [link](https://drive.google.com/file/d/1na5Wx9P4SyVfHgIhRhFJ8RH0WpP2qAwt/view)  |  [link](https://drive.google.com/file/d/1tKzsoGFdDo93aKfEpkY5sSLsuN1hL4LV/view)
 27 | cym   |  [link](https://drive.google.com/file/d/1wqb_fsyw9GBouoHkGq353nWXJZLAL4Ax/view)  |  [link](https://drive.google.com/file/d/1ewLaDdoC1An4hYr6LVLnPGPsY5h0ZiqS/view)
 28 | dan   |  [link](https://drive.google.com/file/d/10Isyjz0Lw9F2JU3Lw4msLma49_i3CVJo/view)  |  [link](https://drive.google.com/file/d/1-VcQxG_YDngEaNNRBMn9vl6L3sIEnL_8/view)
 29 | deu   |  [link](https://drive.google.com/file/d/1dguGPFKXkTvSVn2Yuyv7kBqWROouo7nM/view)  |  [link](https://drive.google.com/file/d/1LfBPlYTbmjWnZTM_e6twUVgzrOjY2kfp/view)
 30 | ell   |  [link](https://drive.google.com/file/d/1hyQJHMMTP0WPaEMPeK5WpoToGuMU1m1C/view)  |  [link](https://drive.google.com/file/d/1dzbQc2K_rTIrkpcw9UgQyZ4PPk_ZYF5i/view)
 31 | eng   |  [link](https://drive.google.com/file/d/1WumR27bj54A_ObzbM1FX7aR0Xiv80g8n/view)  |  [link](https://drive.google.com/file/d/1u-Zt56FKrJ9zVZRRSPiwqIoGaKHkTOuY/view)
 32 | epo   |  [link](https://drive.google.com/file/d/1akp7L7cE9J75hdmxkjXIhLO30c1uZzV3/view)  |  -
 33 | fas   |  [link](https://drive.google.com/file/d/1feVopBcYgz6TybgpYJjNma8Q8v1V36nT/view)  |  [link](https://drive.google.com/file/d/1AMz5xhJaR9Ud-oic4LA4-VoBfcT-cqWH/view)
 34 | fil   |  [link](https://drive.google.com/file/d/16PBhI9DJZxju2du56u3OTgga-L7ImdKY/view)  |  -
 35 | fra   |  [link](https://drive.google.com/file/d/19PCGH6Hxt2YiIiEcP224qSQ94D2f49lc/view)  |  [link](https://drive.google.com/file/d/1UQitwbOPwbbaXFb8xtV0chjeLvWzm3LN/view)
 36 | ful   |  [link](https://drive.google.com/file/d/1glT_e_2kO9bb3mTRCYWUs4n0zGFBS8q9/view)  |  [link](https://drive.google.com/file/d/1eku0kULX4ZE9wQnUJMcqHy65Jsu21FFp/view)
 37 | gle   |  [link](https://drive.google.com/file/d/1o078h9dEo2bJdmSmex2NA31yt491Cx_X/view)  |  [link](https://drive.google.com/file/d/1HNVxzYdmc1l_q4UOwNV6Yh2_QMfmlSaf/view)
 38 | guj   |  [link](https://drive.google.com/file/d/19s9xs6DPeplFv3ME3V1Lhk7Zr_YJ3SMD/view)  |  [link](https://drive.google.com/file/d/1PWFIVGeCRuzAHH-w2UwVOANSvyeEXgFk/view)
 39 | hat   |  [link](https://drive.google.com/file/d/1ioS9mTDjMlNOl8Z7by9F_YIT4BwZnEn-/view)  |  [link](https://drive.google.com/file/d/1yDorOERjCNFdDRt9viZyWpcO7gmXDyvr/view)
 40 | hau   |  [link](https://drive.google.com/file/d/1oSLe6bPcfqkOtarZ5f5l_jBFLO2_Tafb/view)  |  [link](https://drive.google.com/file/d/1cYkwEYclvHnN8BLZf6z-DEyINGAHy34L/view)
 41 | heb   |  [link](https://drive.google.com/file/d/1tHlRd6bg5zS7xvaEp5JOST7Ngb-DHecX/view)  |  -
 42 | hin   |  [link](https://drive.google.com/file/d/1RDbFOOMV3FC71R_1QKxocwDgxP8csmqz/view)  |  [link](https://drive.google.com/file/d/1ZNcCqUV15Bv2FlY3qkMYyBWBm0hO4LKI/view)
 43 | hrv   |  [link](https://drive.google.com/file/d/13PlLYJmEbZAc8mgMHbZH58-rLvU8bLSY/view)  |  -
 44 | hun   |  [link](https://drive.google.com/file/d/157CC5cPhpWg5aM4iMjNtX0CdVyL1yO-J/view)  |  [link](https://drive.google.com/file/d/1R52kqwahdPHFkGpsGdAJE6Wkq38UGgFS/view)
 45 | hye   |  [link](https://drive.google.com/file/d/1ZX0FmoSAmC_QJdwNo-KlqjrLG8ALup5L/view)  |  [link](https://drive.google.com/file/d/1ciACol27dN07_omNInoU_NqUvYmwXo6C/view)
 46 | ibo   |  [link](https://drive.google.com/file/d/11cmywemBJuNeHkdwn_a4rPyKqbOM7zYF/view)  |  [link](https://drive.google.com/file/d/1oYOHwATB0PWNYvEv-azy_8MUgkxzFZCY/view)
 47 | ind   |  [link](https://drive.google.com/file/d/1Cb0sJ-2cLYQdg3hKG7yC4bCziYjtpFRo/view)  |  [link](https://drive.google.com/file/d/1Sch920J5PqJbhpEQHNTjJ1ojiMU46tiQ/view)
 48 | isl   |  [link](https://drive.google.com/file/d/183aUjkvgPtyafmAh3fvUj1OjFloa-nC6/view)  |  [link](https://drive.google.com/file/d/1wzccfq0RAN7c5c2BGhNMySp2yMYfj9ep/view)
 49 | ita   |  [link](https://drive.google.com/file/d/1eGrviIr8FiRPaIK51l9mFbKEpr_RryuN/view)  |  [link](https://drive.google.com/file/d/123eRVzORxPIQnp75RMf0LsXWL21l76IH/view)
 50 | jpn   |  [link](https://drive.google.com/file/d/16wRlWIwPIl3tBLJbrWRxDbnehLHHOWEt/view)  |  [link](https://drive.google.com/file/d/1vjYBbEmWg8PoztrcSqDjUe7ClCUNKHAL/view)
 51 | kan   |  [link](https://drive.google.com/file/d/1J7jD8MjKkR0c_7OIZ7ahw4Bq_2jTgrYX/view)  |  [link](https://drive.google.com/file/d/18rBERL7l4zBupWwVHXasPu3jlegCM31B/view)
 52 | kat   |  [link](https://drive.google.com/file/d/1S-CYer6Yu02tMRLBbxHtKFYCp33gXBzc/view)  |  [link](https://drive.google.com/file/d/1GSpqPf87onRlKHu4yoLzxQkAOSIE1GVW/view)
 53 | khm   |  [link](https://drive.google.com/file/d/11OL9JKSTT8_zVQl77avrEVQiqXqV1J2p/view)  |  [link](https://drive.google.com/file/d/1-0m54dcSjGyBST9bodw1RJYqICsZCwuS/view)
 54 | kin   |  [link](https://drive.google.com/file/d/1DnRV2pUU-b-f9DT27AtNLRJcx31AuRy4/view)  |  -
 55 | kir   |  [link](https://drive.google.com/file/d/1DoBBN_nb_V-Ogl94KL-nM6iJH2WaGOpK/view)  |  [link](https://drive.google.com/file/d/1ncixaRUVSGcgTrMPhibN1Pfd4yIJ8c15/view)
 56 | kor   |  [link](https://drive.google.com/file/d/1L3RY0coCdd-1HX4r0VkU2kQYdFOMuN_9/view)  |  [link](https://drive.google.com/file/d/14-QZft00ab2KAtjT1-p1fvaA45qKt3JJ/view)
 57 | kur   |  [link](https://drive.google.com/file/d/15a_TBIEC1jYNVOTh_wKpKUrW8w_p3FoW/view)  |  [link](https://drive.google.com/file/d/1g3WTVRxMo5M5HOBuNJLU7KdSw1RQRdTx/view)
 58 | lao   |  [link](https://drive.google.com/file/d/1oO7L92P1XUD6cNdh5MlDv9R-6jLjaYix/view)  |  [link](https://drive.google.com/file/d/1IOcXBGMoaA859RXzrXSV1WS2qMOweRUn/view)
 59 | lav   |  [link](https://drive.google.com/file/d/1K6Z0RLc0yvyqHXIy3wYh8Elr3QlIMltz/view)  |  [link](https://drive.google.com/file/d/1AdXmbWraGH_Dh9_f2CcQhhqP5hIqnJXu/view)
 60 | lin   |  [link](https://drive.google.com/file/d/1JTgwLaQgMSqOvdARhw82zrCwZpv5OFZV/view)  |  [link](https://drive.google.com/file/d/1QDYxfhMQDZeGVVRUsZjUzf7C2RUMYDMR/view)
 61 | lit   |  [link](https://drive.google.com/file/d/1df6oV_UxxqZQnYRtmkVxUZe-2b4Bsiay/view)  |  [link](https://drive.google.com/file/d/1WjIJ-LZN0ZdqtE_NnoEiAiNhXm3eRGHk/view)
 62 | mal   |  [link](https://drive.google.com/file/d/1hqp4OmL27HPBMVhYwZLf28Syha7mDmqx/view)  |  [link](https://drive.google.com/file/d/1tvsdnjRAiBFHc0Py-duJoqPlSwYlokie/view)
 63 | mar   |  [link](https://drive.google.com/file/d/1BRcMJL_Zk1rq0hNcCYZAbZ3nMF0qmM1I/view)  |  [link](https://drive.google.com/file/d/1Z-ui3IipNQy3jpQqeNzrQiVZcXnUFs2e/view)
 64 | mkd   |  [link](https://drive.google.com/file/d/1-UzcYkog_TjAnk9DjTR6vN29R3qjHICN/view)  |  [link](https://drive.google.com/file/d/1xpE3nPcs-m5WdbPyOX1H4wt9k06NMKN3/view)
 65 | mlg   |  [link](https://drive.google.com/file/d/1dhzWeA8-JKFbhoLpbV1Yli5M4XHkrXdu/view)  |  [link](https://drive.google.com/file/d/11mpExgMv7VSdejMXUQLFnnPwitLUNudw/view)
 66 | mon   |  [link](https://drive.google.com/file/d/1bPHLSMKtCI927f-I_skf0T98HAp-jN-A/view)  |  [link](https://drive.google.com/file/d/1rejguZ0HNNMZdV_9g6qXT6Si6QVXhuge/view)
 67 | mya   |  [link](https://drive.google.com/file/d/1fXyMEKX8sz8-wCOKoLLXuqLXMsFZs9Ld/view)  |  [link](https://drive.google.com/file/d/1cLre9C9f1lm2Ds_8hv6f7h2R4phrXtVd/view)
 68 | nde   |  [link](https://drive.google.com/file/d/1b_UekJ498qQv2DzeXjUWL20VMftds0ec/view)  |  [link](https://drive.google.com/file/d/1KxB5RLGMlteQOqBYu2DOhIXVzCUkhvDM/view)
 69 | nep   |  [link](https://drive.google.com/file/d/1g-tRWW1dweVZMtkWnkm4j5-Qnbwzlh2P/view)  |  [link](https://drive.google.com/file/d/1jw_P1wenbskDfG8iQD3dYRb0oWnRba9N/view)
 70 | nld   |  [link](https://drive.google.com/file/d/1JwV508z5Bx_3dHjCW0lMAhp0-8ykF7sF/view)  |  -
 71 | ori   |  [link](https://drive.google.com/file/d/1eWnnCigfd8HmMueSvyPe7x1JcvGFReg2/view)  |  [link](https://drive.google.com/file/d/1a3t0X7PfphDZJiyJSL-FzsoZsDH9KIu4/view)
 72 | orm   |  [link](https://drive.google.com/file/d/1oZ4S71rijKd32IL9Ww8VR-vr1mzh4WVT/view)  |  [link](https://drive.google.com/file/d/1-SopeFs8niXlmwWSe117-YDQ6ECK8xTh/view)
 73 | pan   |  [link](https://drive.google.com/file/d/1Yr6Cy_gaJrbWNHz5khkjDR4mKZT7_TMO/view)  |  [link](https://drive.google.com/file/d/1t3sUOR_m4blOj8iIU1q8ohxUWPTWImcw/view)
 74 | pol   |  [link](https://drive.google.com/file/d/1BSX_LcGIaDQOWXDqC3YmMFmRM2DQbUYb/view)  |  [link](https://drive.google.com/file/d/1pNOSculzyCNMjrQOarVhG_SDg1IbMpBr/view)
 75 | por   |  [link](https://drive.google.com/file/d/1KWnsUKgIb2fJlRcOq0WhCzyE8cR0LhfB/view)  |  [link](https://drive.google.com/file/d/13ET2tIsrzFTzlb9Rd2KAp-FF7Y6R2Ker/view)
 76 | prs   |  [link](https://drive.google.com/file/d/1izhl77L8R2r7YM4-Usu0VMAQtoO5sn7R/view)  |  [link](https://drive.google.com/file/d/11QMxXjH9vN0V6-UZXT2omxb7lm8zphqC/view)
 77 | pus   |  [link](https://drive.google.com/file/d/1nJ0hBzj0z51htnwftGyd_I0DQbFdeypS/view)  |  [link](https://drive.google.com/file/d/1nccq6pEsvUhe1zvPTDoDKEob6cRKPpWP/view)
 78 | ron   |  [link](https://drive.google.com/file/d/1XxDdroLJdAQZwtGmhPedZ-YZq_G9T-tr/view)  |  -
 79 | run   |  [link](https://drive.google.com/file/d/1FVLjZI_oj6bGwP6tIMUTY-8yj4pjJm-5/view)  |  [link](https://drive.google.com/file/d/165N8Wh_TeTo7N_el6eWGmZ5ts3KBKU9Q/view)
 80 | rus   |  [link](https://drive.google.com/file/d/17RWgFR6mIvxGr6RGhuRZspLtnlWZk1Ul/view)  |  [link](https://drive.google.com/file/d/15Cqcrbl_lG_oSED_hTyR_mb-dwpvw9J8/view)
 81 | sin   |  [link](https://drive.google.com/file/d/158EtvATjJ39G7vThM69h4shcRqRXdT9K/view)  |  [link](https://drive.google.com/file/d/1gvqSIOkL7RDX-yg7O1VwHTdRUcNGcf_F/view)
 82 | slk   |  [link](https://drive.google.com/file/d/1IGtxqiLlJqBhsfgbAUlyQAwQir2QmYGb/view)  |  [link](https://drive.google.com/file/d/1GDzuxd-KhBA_fHrDlO8HcCayLsG4-UkX/view)
 83 | slv   |  [link](https://drive.google.com/file/d/1gWb-pImthObUPJ16hIO5HZif3XSvjIrK/view)  |  [link](https://drive.google.com/file/d/1T71uVRLX-wB-qeWFMyqtxlI91Dr2u7pn/view)
 84 | sna   |  [link](https://drive.google.com/file/d/1FOacYT0S5sPVxmBbH4mHsH4uyrbQwOoM/view)  |  [link](https://drive.google.com/file/d/1wCysDOCUvsA3H-9CmrItwU7GHIxQt1rU/view)
 85 | som   |  [link](https://drive.google.com/file/d/1IHxrknewcaTTQKrCH6K00lz3dPT5lIUc/view)  |  [link](https://drive.google.com/file/d/1oXDsB76ViX9ri5W_2xVEQqLWDqIQAh5O/view)
 86 | spa   |  [link](https://drive.google.com/file/d/1y3iDoCDfT19MXgQtQ_z_dF4q8xgLK4Eq/view)  |  [link](https://drive.google.com/file/d/14dX8cePpcb-E7nS8brupu9lQkoLUPLSf/view)
 87 | sqi   |  [link](https://drive.google.com/file/d/1rpOjaE3mjyl8LsLcU5QiEt1xmkxF0ntW/view)  |  [link](https://drive.google.com/file/d/1jz5sXn8JeHhZHjXb7r0wLqyynva7c8Al/view)
 88 | srp   |  [link](https://drive.google.com/file/d/1wpp_f10LNB4Qb0F8-EMmdKWLMHb4cGuw/view)  |  [link](https://drive.google.com/file/d/1XJCyan1OL3UTI9_tNbvZX2ngQ2bhoy0U/view)
 89 | swa   |  [link](https://drive.google.com/file/d/1L3DHVdngIRSd8eCjx7qyp5NBwziz6I6B/view)  |  [link](https://drive.google.com/file/d/1ukbOFz_dHYaIQnD4ub0hE9DqF_1mwVke/view)
 90 | swe   |  [link](https://drive.google.com/file/d/1BgVrlj40Mlg4yOOMy1yPV8iyFbZTrRLE/view)  |  -
 91 | tam   |  [link](https://drive.google.com/file/d/1VrmX5egg4zaZKPBA0Ic3jlMg_ovYkeW9/view)  |  [link](https://drive.google.com/file/d/1DWv-hkU0P2B0AysTFwOZ3aghW6lxF16A/view)
 92 | tel   |  [link](https://drive.google.com/file/d/1zo3gNIH2sMczXpnWh-vVty3pr4-tOaBy/view)  |  [link](https://drive.google.com/file/d/1e0KIPqcKYHXmSjgLpxLJkEFumw8Bae_g/view)
 93 | tet   |  [link](https://drive.google.com/file/d/1n-PVdlyti6wGtlGUalYeHOmGwZtG6xDe/view)  |  -
 94 | tgk   |  [link](https://drive.google.com/file/d/1g6_1YKJbv7-5glBsreqspPP_VnBsRSXW/view)  |  -
 95 | tha   |  [link](https://drive.google.com/file/d/1vTBPYxmkyWCqnboX3cVxxNHcfRcXAo_7/view)  |  [link](https://drive.google.com/file/d/197vyuI2JzOGczeVRqnUGu78G3T2WWRit/view)
 96 | tir   |  [link](https://drive.google.com/file/d/1vkt2SRGiSPIJKzgU-XagWmx6rnQLtmZZ/view)  |  [link](https://drive.google.com/file/d/1lDazmqixV4Gem96O-c-gulqSNKUEcjnR/view)
 97 | tur   |  [link](https://drive.google.com/file/d/1_39Hk7K-IKzvSiRLmue1mxANuPXQg9p5/view)  |  [link](https://drive.google.com/file/d/1Kole41CnnNArIt_rxfNlimk9EMQZFa8Y/view)
 98 | ukr   |  [link](https://drive.google.com/file/d/1h2I-yan3WcVEyJfeyJD_fFFviJWK93N3/view)  |  [link](https://drive.google.com/file/d/1H8TUR73sJs_bvjJuLNB4szqiCIvTq5sp/view)
 99 | urd   |  [link](https://drive.google.com/file/d/1p-lG1vEDp838GRzuPWc9hjfZeqjfeMh3/view)  |  [link](https://drive.google.com/file/d/1HDwEMuaULkZr6Mm39CifS2szyI_vql-G/view)
100 | uzb   |  [link](https://drive.google.com/file/d/134swKYwYcfCFbMSXe16hvmLVGqS7pOMb/view)  |  [link](https://drive.google.com/file/d/1nYOLG5UlV-YDeex8Tvi37hK4pM9wD7Wg/view)
101 | vie   |  [link](https://drive.google.com/file/d/1zm1AjKpOhEeaZgs2MeJsrFVjxT7kbyFL/view)  |  [link](https://drive.google.com/file/d/1uts1nSGwWNxEFZnsJi6SdamG2DAVPq-q/view)
102 | xho   |  [link](https://drive.google.com/file/d/1Gkq4cLknzh_cY9HBlWAGqkZez31vIarY/view)  |  [link](https://drive.google.com/file/d/1P31PeL7cVJ9eNE-YZ0ofH0pozT9ta5bP/view)
103 | yor   |  [link](https://drive.google.com/file/d/1KhCZk7wBsFkKsmU4XWffVTS37w1FguwE/view)  |  [link](https://drive.google.com/file/d/17ifvygtGzaIgDuqiFK0QDK1Jd7SOnkNd/view)
104 | yue   |  [link](https://drive.google.com/file/d/1u1ScUMdlOfyIyUOIcZBonIUb7rXJwGH8/view)  |  [link](https://drive.google.com/file/d/1blW_lXnUFa3poUwR6YuHd3N2fVoGuHhG/view)
105 | zho   |  [link](https://drive.google.com/file/d/10ipmN3CgNFXc6OQst6-Iasa5CaZLT93M/view)  |  [link](https://drive.google.com/file/d/13-qysDM2uAiT_E9KjAKsQZAQP_-0pyRL/view)
106 | bis   |  -                                                                               |  [link](https://drive.google.com/file/d/1zUn6LDov0zi_hxYbs9hX63UwLhcNcrsa/view)
107 | gla   |  -                                                                               |  [link](https://drive.google.com/file/d/1rIYRlZQ0Sl6By45hdOozAS_37Dv6LFQp/view)
108 | 
109 | ## Cite Us!
110 | Please cite us if you use our data or methodology 
111 | ```
112 | @inproceedings{varab-schluter-2021-massivesumm,
113 |     title = "{M}assive{S}umm: a very large-scale, very multilingual, news summarisation dataset",
114 |     author = "Varab, Daniel  and
115 |       Schluter, Natalie",
116 |     booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
117 |     month = nov,
118 |     year = "2021",
119 |     address = "Online and Punta Cana, Dominican Republic",
120 |     publisher = "Association for Computational Linguistics",
121 |     url = "https://aclanthology.org/2021.emnlp-main.797",
122 |     pages = "10150--10161",
123 |     abstract = "Current research in automatic summarisation is unapologetically anglo-centered{--}a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.",
124 | }
125 | ```


--------------------------------------------------------------------------------
/readme.md.template:
--------------------------------------------------------------------------------
 1 | # MassiveSumm: a very large-scale, very multilingual, news summarisation dataset
 2 | This repository contains links to data and code to fetch and reproduce the data described in our EMNLP 2021 paper titled "[MassiveSumm: a very large-scale, very multilingual, news summarisation dataset](https://aclanthology.org/2021.emnlp-main.797/)". A (massive) multilingual dataset consisting of 92 diverse languages, across 35 writing scripts. With this work we attempt to take the first steps towards providing a diverse data foundation for in summarisation in many languages.
 3 | 
 4 | > *Disclaimer: The data is noisy and recall-oriented. In fact, we highly recommend reading  our analysis on the efficacy of this type of methods for data collection.*
 5 | 
 6 | 
 7 | ## Get the Data
 8 | Redistributing data from web is a tricky matter. We are working on providing efficient access to the entire dataset, as well as expanding it even further. For the time being we only provide links to reproduce subsets of the entire dataset through either common crawl and the wayback machine. The dataset is also available upon request ([djam@itu.dk](mailto:djam@itu.dk)).
 9 | 
10 | 
11 | In the table below is a listing of files containing URLs and metadata required to fetch data from common crawl.
12 | ${LANGS}
13 | 
14 | ## Cite Us!
15 | Please cite us if you use our data or methodology 
16 | ```
17 | @inproceedings{varab-schluter-2021-massivesumm,
18 |     title = "{M}assive{S}umm: a very large-scale, very multilingual, news summarisation dataset",
19 |     author = "Varab, Daniel  and
20 |       Schluter, Natalie",
21 |     booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
22 |     month = nov,
23 |     year = "2021",
24 |     address = "Online and Punta Cana, Dominican Republic",
25 |     publisher = "Association for Computational Linguistics",
26 |     url = "https://aclanthology.org/2021.emnlp-main.797",
27 |     pages = "10150--10161",
28 |     abstract = "Current research in automatic summarisation is unapologetically anglo-centered{--}a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.",
29 | }
30 | ```


--------------------------------------------------------------------------------
/render-readme.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 | LANGS=$(csvtomd urls.tsv -d "$(echo '\t')")
3 | export LANGS
4 | 
5 | cat readme.md.template | envsubst > readme.md
6 | echo readme.md is updated


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | csvtomd
2 | requests
3 | orjson
4 | beautifulsoup4
5 | readability-lxml
6 | 


--------------------------------------------------------------------------------
/scripts/article.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Copyright 2018 Max Grusky
  3 | 
  4 | Licensed under the Apache License, Version 2.0 (the "License");
  5 | you may not use this file except in compliance with the License.
  6 | You may obtain a copy of the License at
  7 | 
  8 |     http://www.apache.org/licenses/LICENSE-2.0
  9 | 
 10 | Unless required by applicable law or agreed to in writing, software
 11 | distributed under the License is distributed on an "AS IS" BASIS,
 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | See the License for the specific language governing permissions and
 14 | limitations under the License.
 15 | 
 16 | This this file has been altered and adopted. Specifically, it is modified to accomodate
 17 | the common crawl archive instead of the waybackmachine.
 18 | """
 19 | 
 20 | import re
 21 | 
 22 | from urllib.parse import quote, urlparse, urljoin
 23 | from bs4 import BeautifulSoup
 24 | from readability import Document
 25 | 
 26 | 
 27 | _whitespace = re.compile(r"\s+")
 28 | 
 29 | 
 30 | class Article(object):
 31 | 
 32 |     """
 33 |     Reads in a {url: "", html: ""} archive entry from the downloader script.
 34 |     This will scrape the provided HTML and extract the summary and text. Note
 35 |     that the provided URL in this case is actually the ARCHIVE url (Maybe this
 36 |     should be made clearer in the downloader script?).
 37 |     """
 38 | 
 39 |     def __init__(self, archive, html):
 40 | 
 41 |         self.archive = archive
 42 |         self.html = html if html is not None else ""
 43 | 
 44 |         # @djam my doing
 45 |         self.url = archive
 46 |         self.date = None
 47 |         # self._parse_archive()
 48 |         self._parse_html()
 49 | 
 50 |     def _parse_archive(self):
 51 | 
 52 |         *splits, url = self.archive.split("id_/")
 53 |         *_, date = splits[0].split("/")
 54 | 
 55 |         self.url = self.normalize_url(url)
 56 |         self.date = date
 57 | 
 58 |     def _parse_html(self):
 59 | 
 60 |         self._load_html()
 61 |         self._find_canonical_url()
 62 | 
 63 |         self._extract_text()
 64 |         self._extract_summary()
 65 | 
 66 |     def _extract_summary(self):
 67 | 
 68 |         self.all_summaries = {}
 69 | 
 70 |         for meta in self.soup.findAll("meta"):
 71 |             for attr, value in meta.attrs.items():
 72 | 
 73 |                 if attr in ("name", "property") and "description" in value:
 74 | 
 75 |                     # Extract the tag content. If we can't find anything,
 76 |                     # ignore it and move onto the next tag.
 77 | 
 78 |                     try:
 79 | 
 80 |                         self.all_summaries[value] = meta.get("content").strip()
 81 | 
 82 |                     except Exception:
 83 | 
 84 |                         continue
 85 | 
 86 |         if len(self.all_summaries) == 0:
 87 | 
 88 |             self.summary = None
 89 |             return
 90 | 
 91 |         for kind in ("og:description", "twitter:description", "description"):
 92 | 
 93 |             if kind in self.all_summaries:
 94 | 
 95 |                 self.summary = self.all_summaries[kind]
 96 |                 break
 97 | 
 98 |         else:
 99 | 
100 |             random_pick = sorted(self.all_summaries)[0]
101 |             self.summary = self.all_summaries[random_pick]
102 | 
103 |     def _extract_text(self):
104 | 
105 |         """
106 |         Uses Readability to extract the body text and titles of the articles.
107 |         """
108 | 
109 |         # Confusingly, the Readability package calls the body text of the article
110 |         # its "summary." We want to create a plain text document from the body text,
111 |         # so we need to extract the text from Readability's HTML version.
112 | 
113 |         body_soup = BeautifulSoup(self.readability.summary(), "lxml")
114 | 
115 |         # Now go through and extract each paragraph (in order).
116 | 
117 |         paragraph_text = []
118 |         for paragraph in body_soup.findAll("p"):
119 | 
120 |             # Very short pieces of text tend not to be article body text, but
121 |             # captions, attributions, and advertising. It seems that excluding
122 |             # paragraphs shorter than five words removes most of this.
123 | 
124 |             if len(paragraph.text.split()) >= 5:
125 | 
126 |                 paragraph_body = _whitespace.sub(" ", paragraph.text).strip()
127 |                 paragraph_text.append(paragraph_body)
128 | 
129 |         # We join the plain text paragraphs of the article with double new lines.
130 | 
131 |         self.text = "\n\n".join(paragraph_text)
132 | 
133 |         # "Short title" uses in-page heuristics to remove cruft from <title>; e.g.:
134 |         # .title():       American Recalls Moment Leg Broken by Truck in Nice - ABC News
135 |         # .short_title(): American Recalls Moment Leg Broken by Truck in Nice
136 | 
137 |         self.title = self.readability.short_title()
138 | 
139 |     def _load_html(self):
140 | 
141 |         # Readability crashes if it encounters empty pages.
142 | 
143 |         if self.html.strip() == "":
144 | 
145 |             raise Exception("No page content?")
146 | 
147 |         # The document has content. Create:
148 |         # - A Readability parse object to extract the text
149 |         # - A full-page BeautifulSoup object to extract summaries.
150 | 
151 |         self.readability = Document(self.html)
152 |         self.soup = BeautifulSoup(self.html, "lxml")
153 | 
154 |     def _find_canonical_url(self):
155 | 
156 |         # Start out by normalizing the URL as we know it. Without reading the
157 |         # page yet, this is our best guess of the article's canonical URL.
158 | 
159 |         self.original_url = self.url
160 | 
161 |         try:
162 | 
163 |             # Try to extract the page's canonical URL, if it has one. If it doesn't,
164 |             # BeautifulSoup will raise an exception, and we will give up, sticking
165 |             # with the normalized URL as the best URL.
166 | 
167 |             rel_canon = self.soup.find("link", {"rel": "canonical"}).get("href")
168 | 
169 |             # I've sometimes seen the canonical URL be relative to the current page.
170 |             # Although this is rare, we can handle this using our best knowledge of
171 |             # the page's URL so far. Just in case, we'll normalize this too.
172 | 
173 |             abs_canon_url = urljoin(self.url, rel_canon)
174 |             norm_canon_url = self.normalize_url(abs_canon_url)
175 | 
176 |             # Sometimes, the canonical URL will be on a completely different domain.
177 |             # I'm not sure why. But as a sanity check, make sure it's on the same
178 |             # domain before using it.
179 | 
180 |             if self.same_domain(self.url, norm_canon_url):
181 | 
182 |                 self.url = self.norm_canon_url
183 | 
184 |         except Exception:
185 | 
186 |             # If we've failed at some point (most likely because the page doesn't
187 |             # use the canonical tag), set the canonical and normalized canonical
188 |             # URLs to None so that the user is aware of this.
189 | 
190 |             pass
191 | 
192 |     def serialize(self):
193 | 
194 |         """
195 |         Return simple page object to JSONify and write to file.
196 |         """
197 | 
198 |         return {
199 |             "url": self.url,
200 |             "archive": self.archive,
201 |             "title": self.title,
202 |             "date": self.date,
203 |             "text": self.text,
204 |             "summary": self.summary,
205 |         }
206 | 
207 |     @staticmethod
208 |     def process(page):
209 | 
210 |         url = page.get("archive", page.get("url"))
211 |         html = page.get("html", "")
212 |         if html is None:
213 |             html = ""
214 | 
215 |         try:
216 |             return Article(url, html).serialize()
217 |         except:
218 |             print("FAILING TO PROCESS HTML")
219 |             return None
220 | 
221 |     @staticmethod
222 |     def same_domain(url1, url2):
223 | 
224 |         """
225 |         Check if two URLs share the same domain (urlparse netloc).
226 |         This is used primarily in evaluating canonical URLs.
227 |         """
228 | 
229 |         return urlparse(url1).netloc == urlparse(url2).netloc
230 | 
231 |     @staticmethod
232 |     def normalize_url(url):
233 | 
234 |         """
235 |         Remove fragments, ports, and other junk from Archive.org scrapes.
236 |         This is to detect duplicate pages, and prettify URLs.
237 |         """
238 | 
239 |         # Multiple forward slashes should be replaced with just one.
240 | 
241 |         cleaned = url.replace("://", "\0").replace("//", "/").replace("\0", "://")
242 | 
243 |         # Removing fragments and query parameters.
244 | 
245 |         parsed = urlparse(cleaned)
246 |         parsed = parsed._replace(
247 |             path=quote(parsed.path, safe="%/"),
248 |             netloc=parsed.netloc.replace(":80", ""),
249 |             query="",
250 |             fragment="",
251 |         )
252 | 
253 |         return parsed.geturl()
254 | 


--------------------------------------------------------------------------------
/scripts/download.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import gzip
 3 | import io
 4 | import json
 5 | import random
 6 | from multiprocessing import Pool
 7 | 
 8 | import requests
 9 | from tqdm import tqdm
10 | 
11 | 
12 | def load_samples(filename: str) -> list:
13 |     samples = []
14 |     with gzip.open(filename) as fh_in:
15 |         for row in tqdm(fh_in):
16 |             sample = json.loads(row)
17 |             samples.append(sample)
18 |     return samples
19 | 
20 | 
21 | def download_sample(sample: dict) -> dict:
22 |     filename = sample["filename"]
23 |     length = int(sample["length"])
24 |     offset = int(sample["offset"])
25 | 
26 |     offset_end = offset + length - 1
27 |     # We'll get the file via HTTPS so we don't need to worry about S3 credentials
28 |     # Getting the file on S3 is equivalent however - you can request a Range
29 |     prefix = "https://data.commoncrawl.org/"
30 |     # We can then use the Range header to ask for just this set of bytes
31 |     try:
32 |         resp = requests.get(
33 |             prefix + filename,
34 |             headers={"Range": "bytes={}-{}".format(offset, offset_end)},
35 |         )
36 | 
37 |         compressed_file = io.BytesIO(resp.content)
38 |         decompressed_file = gzip.GzipFile(fileobj=compressed_file)
39 |         data = decompressed_file.read().decode()
40 |         warc, header, response = data.strip().split("\r\n\r\n", 2)
41 |         return {"html": response, **sample}
42 |     except:
43 |         with open("error.log", "at") as err_log:
44 |             err_log.write(json.dumps(sample) + "\n")
45 |         return None
46 | 
47 | 
48 | def download_list(samples, n_processes: int):
49 |     with Pool(n_processes) as pool:
50 |         for sample in pool.imap_unordered(download_sample, samples):
51 |             if sample:  # don't yield failing samples
52 |                 yield sample
53 | 
54 | 
55 | def run(url_file: str, archive_file: str, n_proc: int, limit: int):
56 |     samples = load_samples(url_file)
57 |     if limit > 0:
58 |         samples = random.sample(samples, limit)
59 | 
60 |     with gzip.open(archive_file, "at") as fh_out:
61 |         for sample in tqdm(
62 |             download_list(samples, n_processes=n_proc), total=len(samples)
63 |         ):
64 |             fh_out.write(json.dumps(sample) + "\n")
65 | 
66 | 
67 | if __name__ == "__main__":
68 |     parser = argparse.ArgumentParser()
69 |     parser.add_argument("--urls", required=True)
70 |     parser.add_argument("--archive", required=True)
71 |     parser.add_argument("--n_proc", type=int, required=True)
72 |     parser.add_argument("--limit", type=int, default=-1)
73 | 
74 |     args = parser.parse_args()
75 |     print(args)
76 |     run(args.urls, args.archive, args.n_proc, args.limit)
77 | 


--------------------------------------------------------------------------------
/scripts/extract.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Copyright 2018 Max Grusky
  3 | 
  4 | Licensed under the Apache License, Version 2.0 (the "License");
  5 | you may not use this file except in compliance with the License.
  6 | You may obtain a copy of the License at
  7 | 
  8 |     http://www.apache.org/licenses/LICENSE-2.0
  9 | 
 10 | Unless required by applicable law or agreed to in writing, software
 11 | distributed under the License is distributed on an "AS IS" BASIS,
 12 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | See the License for the specific language governing permissions and
 14 | limitations under the License.
 15 | 
 16 | This this file has been altered and adopted. Specifically, it is modified to accomodate
 17 | the common crawl archive instead of the waybackmachine.
 18 | """
 19 | 
 20 | import gzip
 21 | import json
 22 | import os
 23 | from multiprocessing import Pool, cpu_count
 24 | 
 25 | import orjson
 26 | from tqdm import tqdm
 27 | 
 28 | from article import Article
 29 | 
 30 | PROCS = cpu_count()
 31 | 
 32 | 
 33 | def extract(archive, dataset, n_proc: int = PROCS, batch_size: int = PROCS * 20):
 34 | 
 35 |     previously = set()
 36 |     todo = set()
 37 | 
 38 |     if os.path.isfile(dataset):
 39 | 
 40 |         print("Comparing archive and dataset files: ", end="")
 41 | 
 42 |         with gzip.open(dataset) as dataset_file:
 43 | 
 44 |             for article in dataset_file:
 45 |                 article = orjson.loads(article)
 46 |                 url = article.get("archive", article.get("url"))
 47 |                 previously.add(url)
 48 | 
 49 |         print("found", len(previously), "finished summaries... ", end="")
 50 |     else:
 51 |         print("Loading downloaded summaries: ", end="")
 52 | 
 53 |     with gzip.open(archive) as archive_file:
 54 |         for article in archive_file:
 55 |             article = orjson.loads(article)
 56 |             url = article.get("archive", article.get("url"))
 57 |             todo.add(url)
 58 | 
 59 |     todo -= previously
 60 | 
 61 |     print("found", len(todo), "new summaries.\n")
 62 | 
 63 |     with tqdm(total=len(todo), desc="Extracting Summaries") as progress:
 64 |         with gzip.open(archive) as archive_file:
 65 |             with gzip.open(dataset, "at") as dataset_file:
 66 | 
 67 |                 chunk = []
 68 | 
 69 |                 def process_batch():
 70 | 
 71 |                     with Pool(n_proc) as ex:
 72 |                         results = list(ex.map(Article.process, chunk))
 73 |                         results = [r for r in results if r is not None]
 74 | 
 75 |                         for result in results:
 76 |                             if result["text"] is None or result["summary"] is None:
 77 |                                 continue
 78 |                             else:
 79 |                                 dataset_file.write(json.dumps(result) + "\n")
 80 | 
 81 |                         progress.update(len(results))
 82 | 
 83 |                 for article in archive_file:
 84 |                     article = orjson.loads(article)
 85 |                     url = article.get("archive", article.get("url"))
 86 |                     if url not in todo:
 87 |                         continue
 88 | 
 89 |                     chunk.append(article)
 90 |                     
 91 |                     if len(chunk) >= batch_size:
 92 |                         process_batch()
 93 |                         chunk = []
 94 | 
 95 |                 process_batch()
 96 | 
 97 |     print("\nExtraction complete.")
 98 | 
 99 | 
100 | if __name__ == "__main__":
101 |     import argparse
102 | 
103 |     parser = argparse.ArgumentParser()
104 | 
105 |     parser.add_argument("--archive", required=True)
106 |     parser.add_argument("--dataset", required=True)
107 | 
108 |     args = parser.parse_args()
109 |     extract(args.archive, args.dataset, n_proc=PROCS)
110 | 


--------------------------------------------------------------------------------
/urls.tsv:
--------------------------------------------------------------------------------
 1 | lang	wayback	cc
 2 | afr	[link](https://drive.google.com/file/d/1m7ctoWs5or8HsFbuW5pBu_PodpND0J3e/view)	-
 3 | amh	[link](https://drive.google.com/file/d/1k0_65Zb00VGm5i-hnFYuY7lNzjVUiWvl/view)	[link](https://drive.google.com/file/d/1_awz_-B0iWtaPdKih8H8Kz4HGnJSwvRq/view)
 4 | ara	[link](https://drive.google.com/file/d/1raYOtsrpmD-yAGo50Kr917Ns3tXEJBci/view)	[link](https://drive.google.com/file/d/1HvCeJ3p59sdhb1xLFGNVHr10hHEg8phA/view)
 5 | asm	[link](https://drive.google.com/file/d/1iGPZdk-PKQn0M_q8ENl-sJY861sT0kaE/view)	-
 6 | aym	[link](https://drive.google.com/file/d/12XzyUHfrOi317OLU0QsOgl6eZvVOK-bF/view)	-
 7 | aze	[link](https://drive.google.com/file/d/1JIqaeoNJt3VATqqjzP27fUBlTAlCvagz/view)	[link](https://drive.google.com/file/d/1CftuzziqiR5QezH9oYL-bCE_KpKdKQgK/view)
 8 | bam	[link](https://drive.google.com/file/d/1Yb9YOENj0Kf8FK19eXHD-iHu5nuqR_3-/view)	[link](https://drive.google.com/file/d/1MWQVJMBLmc_8qktep7FGohVbHdXR0iKx/view)
 9 | ben	[link](https://drive.google.com/file/d/1lOso52ouqtddUGF5RGOkIlVfL3kD3oTB/view)	[link](https://drive.google.com/file/d/1wK6YTRkXuc4df8C-Ko-PaWB1pIeDfY8q/view)
10 | bod	[link](https://drive.google.com/file/d/1RmonaYMfzj-sw5uM1crJvJxn1FEWfDNi/view)	[link](https://drive.google.com/file/d/1vnZb9PUjRCX6E__OlCAqBGxdu3-19Q8W/view)
11 | bos	[link](https://drive.google.com/file/d/1alV_CwZpxzAcEfuBCp5TCeMr3qOPZZHW/view)	[link](https://drive.google.com/file/d/1TTQVPZ4G7TGy7mFnN21XC3ZDlJlTpGhM/view)
12 | bul	[link](https://drive.google.com/file/d/1XU56P9Jd4Meo7YCEedRPu3qd3nOZRvSx/view)	[link](https://drive.google.com/file/d/13MJzUdrCLz-lo_c4IZOOupJY_50zvZHd/view)
13 | cat	[link](https://drive.google.com/file/d/1OqPLjlsUI-ldg6z2eEe_hA1tM1MCAfEt/view)	-
14 | ces	[link](https://drive.google.com/file/d/1na5Wx9P4SyVfHgIhRhFJ8RH0WpP2qAwt/view)	[link](https://drive.google.com/file/d/1tKzsoGFdDo93aKfEpkY5sSLsuN1hL4LV/view)
15 | cym	[link](https://drive.google.com/file/d/1wqb_fsyw9GBouoHkGq353nWXJZLAL4Ax/view)	[link](https://drive.google.com/file/d/1ewLaDdoC1An4hYr6LVLnPGPsY5h0ZiqS/view)
16 | dan	[link](https://drive.google.com/file/d/10Isyjz0Lw9F2JU3Lw4msLma49_i3CVJo/view)	[link](https://drive.google.com/file/d/1-VcQxG_YDngEaNNRBMn9vl6L3sIEnL_8/view)
17 | deu	[link](https://drive.google.com/file/d/1dguGPFKXkTvSVn2Yuyv7kBqWROouo7nM/view)	[link](https://drive.google.com/file/d/1LfBPlYTbmjWnZTM_e6twUVgzrOjY2kfp/view)
18 | ell	[link](https://drive.google.com/file/d/1hyQJHMMTP0WPaEMPeK5WpoToGuMU1m1C/view)	[link](https://drive.google.com/file/d/1dzbQc2K_rTIrkpcw9UgQyZ4PPk_ZYF5i/view)
19 | eng	[link](https://drive.google.com/file/d/1WumR27bj54A_ObzbM1FX7aR0Xiv80g8n/view)	[link](https://drive.google.com/file/d/1u-Zt56FKrJ9zVZRRSPiwqIoGaKHkTOuY/view)
20 | epo	[link](https://drive.google.com/file/d/1akp7L7cE9J75hdmxkjXIhLO30c1uZzV3/view)	-
21 | fas	[link](https://drive.google.com/file/d/1feVopBcYgz6TybgpYJjNma8Q8v1V36nT/view)	[link](https://drive.google.com/file/d/1AMz5xhJaR9Ud-oic4LA4-VoBfcT-cqWH/view)
22 | fil	[link](https://drive.google.com/file/d/16PBhI9DJZxju2du56u3OTgga-L7ImdKY/view)	-
23 | fra	[link](https://drive.google.com/file/d/19PCGH6Hxt2YiIiEcP224qSQ94D2f49lc/view)	[link](https://drive.google.com/file/d/1UQitwbOPwbbaXFb8xtV0chjeLvWzm3LN/view)
24 | ful	[link](https://drive.google.com/file/d/1glT_e_2kO9bb3mTRCYWUs4n0zGFBS8q9/view)	[link](https://drive.google.com/file/d/1eku0kULX4ZE9wQnUJMcqHy65Jsu21FFp/view)
25 | gle	[link](https://drive.google.com/file/d/1o078h9dEo2bJdmSmex2NA31yt491Cx_X/view)	[link](https://drive.google.com/file/d/1HNVxzYdmc1l_q4UOwNV6Yh2_QMfmlSaf/view)
26 | guj	[link](https://drive.google.com/file/d/19s9xs6DPeplFv3ME3V1Lhk7Zr_YJ3SMD/view)	[link](https://drive.google.com/file/d/1PWFIVGeCRuzAHH-w2UwVOANSvyeEXgFk/view)
27 | hat	[link](https://drive.google.com/file/d/1ioS9mTDjMlNOl8Z7by9F_YIT4BwZnEn-/view)	[link](https://drive.google.com/file/d/1yDorOERjCNFdDRt9viZyWpcO7gmXDyvr/view)
28 | hau	[link](https://drive.google.com/file/d/1oSLe6bPcfqkOtarZ5f5l_jBFLO2_Tafb/view)	[link](https://drive.google.com/file/d/1cYkwEYclvHnN8BLZf6z-DEyINGAHy34L/view)
29 | heb	[link](https://drive.google.com/file/d/1tHlRd6bg5zS7xvaEp5JOST7Ngb-DHecX/view)	-
30 | hin	[link](https://drive.google.com/file/d/1RDbFOOMV3FC71R_1QKxocwDgxP8csmqz/view)	[link](https://drive.google.com/file/d/1ZNcCqUV15Bv2FlY3qkMYyBWBm0hO4LKI/view)
31 | hrv	[link](https://drive.google.com/file/d/13PlLYJmEbZAc8mgMHbZH58-rLvU8bLSY/view)	-
32 | hun	[link](https://drive.google.com/file/d/157CC5cPhpWg5aM4iMjNtX0CdVyL1yO-J/view)	[link](https://drive.google.com/file/d/1R52kqwahdPHFkGpsGdAJE6Wkq38UGgFS/view)
33 | hye	[link](https://drive.google.com/file/d/1ZX0FmoSAmC_QJdwNo-KlqjrLG8ALup5L/view)	[link](https://drive.google.com/file/d/1ciACol27dN07_omNInoU_NqUvYmwXo6C/view)
34 | ibo	[link](https://drive.google.com/file/d/11cmywemBJuNeHkdwn_a4rPyKqbOM7zYF/view)	[link](https://drive.google.com/file/d/1oYOHwATB0PWNYvEv-azy_8MUgkxzFZCY/view)
35 | ind	[link](https://drive.google.com/file/d/1Cb0sJ-2cLYQdg3hKG7yC4bCziYjtpFRo/view)	[link](https://drive.google.com/file/d/1Sch920J5PqJbhpEQHNTjJ1ojiMU46tiQ/view)
36 | isl	[link](https://drive.google.com/file/d/183aUjkvgPtyafmAh3fvUj1OjFloa-nC6/view)	[link](https://drive.google.com/file/d/1wzccfq0RAN7c5c2BGhNMySp2yMYfj9ep/view)
37 | ita	[link](https://drive.google.com/file/d/1eGrviIr8FiRPaIK51l9mFbKEpr_RryuN/view)	[link](https://drive.google.com/file/d/123eRVzORxPIQnp75RMf0LsXWL21l76IH/view)
38 | jpn	[link](https://drive.google.com/file/d/16wRlWIwPIl3tBLJbrWRxDbnehLHHOWEt/view)	[link](https://drive.google.com/file/d/1vjYBbEmWg8PoztrcSqDjUe7ClCUNKHAL/view)
39 | kan	[link](https://drive.google.com/file/d/1J7jD8MjKkR0c_7OIZ7ahw4Bq_2jTgrYX/view)	[link](https://drive.google.com/file/d/18rBERL7l4zBupWwVHXasPu3jlegCM31B/view)
40 | kat	[link](https://drive.google.com/file/d/1S-CYer6Yu02tMRLBbxHtKFYCp33gXBzc/view)	[link](https://drive.google.com/file/d/1GSpqPf87onRlKHu4yoLzxQkAOSIE1GVW/view)
41 | khm	[link](https://drive.google.com/file/d/11OL9JKSTT8_zVQl77avrEVQiqXqV1J2p/view)	[link](https://drive.google.com/file/d/1-0m54dcSjGyBST9bodw1RJYqICsZCwuS/view)
42 | kin	[link](https://drive.google.com/file/d/1DnRV2pUU-b-f9DT27AtNLRJcx31AuRy4/view)	-
43 | kir	[link](https://drive.google.com/file/d/1DoBBN_nb_V-Ogl94KL-nM6iJH2WaGOpK/view)	[link](https://drive.google.com/file/d/1ncixaRUVSGcgTrMPhibN1Pfd4yIJ8c15/view)
44 | kor	[link](https://drive.google.com/file/d/1L3RY0coCdd-1HX4r0VkU2kQYdFOMuN_9/view)	[link](https://drive.google.com/file/d/14-QZft00ab2KAtjT1-p1fvaA45qKt3JJ/view)
45 | kur	[link](https://drive.google.com/file/d/15a_TBIEC1jYNVOTh_wKpKUrW8w_p3FoW/view)	[link](https://drive.google.com/file/d/1g3WTVRxMo5M5HOBuNJLU7KdSw1RQRdTx/view)
46 | lao	[link](https://drive.google.com/file/d/1oO7L92P1XUD6cNdh5MlDv9R-6jLjaYix/view)	[link](https://drive.google.com/file/d/1IOcXBGMoaA859RXzrXSV1WS2qMOweRUn/view)
47 | lav	[link](https://drive.google.com/file/d/1K6Z0RLc0yvyqHXIy3wYh8Elr3QlIMltz/view)	[link](https://drive.google.com/file/d/1AdXmbWraGH_Dh9_f2CcQhhqP5hIqnJXu/view)
48 | lin	[link](https://drive.google.com/file/d/1JTgwLaQgMSqOvdARhw82zrCwZpv5OFZV/view)	[link](https://drive.google.com/file/d/1QDYxfhMQDZeGVVRUsZjUzf7C2RUMYDMR/view)
49 | lit	[link](https://drive.google.com/file/d/1df6oV_UxxqZQnYRtmkVxUZe-2b4Bsiay/view)	[link](https://drive.google.com/file/d/1WjIJ-LZN0ZdqtE_NnoEiAiNhXm3eRGHk/view)
50 | mal	[link](https://drive.google.com/file/d/1hqp4OmL27HPBMVhYwZLf28Syha7mDmqx/view)	[link](https://drive.google.com/file/d/1tvsdnjRAiBFHc0Py-duJoqPlSwYlokie/view)
51 | mar	[link](https://drive.google.com/file/d/1BRcMJL_Zk1rq0hNcCYZAbZ3nMF0qmM1I/view)	[link](https://drive.google.com/file/d/1Z-ui3IipNQy3jpQqeNzrQiVZcXnUFs2e/view)
52 | mkd	[link](https://drive.google.com/file/d/1-UzcYkog_TjAnk9DjTR6vN29R3qjHICN/view)	[link](https://drive.google.com/file/d/1xpE3nPcs-m5WdbPyOX1H4wt9k06NMKN3/view)
53 | mlg	[link](https://drive.google.com/file/d/1dhzWeA8-JKFbhoLpbV1Yli5M4XHkrXdu/view)	[link](https://drive.google.com/file/d/11mpExgMv7VSdejMXUQLFnnPwitLUNudw/view)
54 | mon	[link](https://drive.google.com/file/d/1bPHLSMKtCI927f-I_skf0T98HAp-jN-A/view)	[link](https://drive.google.com/file/d/1rejguZ0HNNMZdV_9g6qXT6Si6QVXhuge/view)
55 | mya	[link](https://drive.google.com/file/d/1fXyMEKX8sz8-wCOKoLLXuqLXMsFZs9Ld/view)	[link](https://drive.google.com/file/d/1cLre9C9f1lm2Ds_8hv6f7h2R4phrXtVd/view)
56 | nde	[link](https://drive.google.com/file/d/1b_UekJ498qQv2DzeXjUWL20VMftds0ec/view)	[link](https://drive.google.com/file/d/1KxB5RLGMlteQOqBYu2DOhIXVzCUkhvDM/view)
57 | nep	[link](https://drive.google.com/file/d/1g-tRWW1dweVZMtkWnkm4j5-Qnbwzlh2P/view)	[link](https://drive.google.com/file/d/1jw_P1wenbskDfG8iQD3dYRb0oWnRba9N/view)
58 | nld	[link](https://drive.google.com/file/d/1JwV508z5Bx_3dHjCW0lMAhp0-8ykF7sF/view)	-
59 | ori	[link](https://drive.google.com/file/d/1eWnnCigfd8HmMueSvyPe7x1JcvGFReg2/view)	[link](https://drive.google.com/file/d/1a3t0X7PfphDZJiyJSL-FzsoZsDH9KIu4/view)
60 | orm	[link](https://drive.google.com/file/d/1oZ4S71rijKd32IL9Ww8VR-vr1mzh4WVT/view)	[link](https://drive.google.com/file/d/1-SopeFs8niXlmwWSe117-YDQ6ECK8xTh/view)
61 | pan	[link](https://drive.google.com/file/d/1Yr6Cy_gaJrbWNHz5khkjDR4mKZT7_TMO/view)	[link](https://drive.google.com/file/d/1t3sUOR_m4blOj8iIU1q8ohxUWPTWImcw/view)
62 | pol	[link](https://drive.google.com/file/d/1BSX_LcGIaDQOWXDqC3YmMFmRM2DQbUYb/view)	[link](https://drive.google.com/file/d/1pNOSculzyCNMjrQOarVhG_SDg1IbMpBr/view)
63 | por	[link](https://drive.google.com/file/d/1KWnsUKgIb2fJlRcOq0WhCzyE8cR0LhfB/view)	[link](https://drive.google.com/file/d/13ET2tIsrzFTzlb9Rd2KAp-FF7Y6R2Ker/view)
64 | prs	[link](https://drive.google.com/file/d/1izhl77L8R2r7YM4-Usu0VMAQtoO5sn7R/view)	[link](https://drive.google.com/file/d/11QMxXjH9vN0V6-UZXT2omxb7lm8zphqC/view)
65 | pus	[link](https://drive.google.com/file/d/1nJ0hBzj0z51htnwftGyd_I0DQbFdeypS/view)	[link](https://drive.google.com/file/d/1nccq6pEsvUhe1zvPTDoDKEob6cRKPpWP/view)
66 | ron	[link](https://drive.google.com/file/d/1XxDdroLJdAQZwtGmhPedZ-YZq_G9T-tr/view)	-
67 | run	[link](https://drive.google.com/file/d/1FVLjZI_oj6bGwP6tIMUTY-8yj4pjJm-5/view)	[link](https://drive.google.com/file/d/165N8Wh_TeTo7N_el6eWGmZ5ts3KBKU9Q/view)
68 | rus	[link](https://drive.google.com/file/d/17RWgFR6mIvxGr6RGhuRZspLtnlWZk1Ul/view)	[link](https://drive.google.com/file/d/15Cqcrbl_lG_oSED_hTyR_mb-dwpvw9J8/view)
69 | sin	[link](https://drive.google.com/file/d/158EtvATjJ39G7vThM69h4shcRqRXdT9K/view)	[link](https://drive.google.com/file/d/1gvqSIOkL7RDX-yg7O1VwHTdRUcNGcf_F/view)
70 | slk	[link](https://drive.google.com/file/d/1IGtxqiLlJqBhsfgbAUlyQAwQir2QmYGb/view)	[link](https://drive.google.com/file/d/1GDzuxd-KhBA_fHrDlO8HcCayLsG4-UkX/view)
71 | slv	[link](https://drive.google.com/file/d/1gWb-pImthObUPJ16hIO5HZif3XSvjIrK/view)	[link](https://drive.google.com/file/d/1T71uVRLX-wB-qeWFMyqtxlI91Dr2u7pn/view)
72 | sna	[link](https://drive.google.com/file/d/1FOacYT0S5sPVxmBbH4mHsH4uyrbQwOoM/view)	[link](https://drive.google.com/file/d/1wCysDOCUvsA3H-9CmrItwU7GHIxQt1rU/view)
73 | som	[link](https://drive.google.com/file/d/1IHxrknewcaTTQKrCH6K00lz3dPT5lIUc/view)	[link](https://drive.google.com/file/d/1oXDsB76ViX9ri5W_2xVEQqLWDqIQAh5O/view)
74 | spa	[link](https://drive.google.com/file/d/1y3iDoCDfT19MXgQtQ_z_dF4q8xgLK4Eq/view)	[link](https://drive.google.com/file/d/14dX8cePpcb-E7nS8brupu9lQkoLUPLSf/view)
75 | sqi	[link](https://drive.google.com/file/d/1rpOjaE3mjyl8LsLcU5QiEt1xmkxF0ntW/view)	[link](https://drive.google.com/file/d/1jz5sXn8JeHhZHjXb7r0wLqyynva7c8Al/view)
76 | srp	[link](https://drive.google.com/file/d/1wpp_f10LNB4Qb0F8-EMmdKWLMHb4cGuw/view)	[link](https://drive.google.com/file/d/1XJCyan1OL3UTI9_tNbvZX2ngQ2bhoy0U/view)
77 | swa	[link](https://drive.google.com/file/d/1L3DHVdngIRSd8eCjx7qyp5NBwziz6I6B/view)	[link](https://drive.google.com/file/d/1ukbOFz_dHYaIQnD4ub0hE9DqF_1mwVke/view)
78 | swe	[link](https://drive.google.com/file/d/1BgVrlj40Mlg4yOOMy1yPV8iyFbZTrRLE/view)	-
79 | tam	[link](https://drive.google.com/file/d/1VrmX5egg4zaZKPBA0Ic3jlMg_ovYkeW9/view)	[link](https://drive.google.com/file/d/1DWv-hkU0P2B0AysTFwOZ3aghW6lxF16A/view)
80 | tel	[link](https://drive.google.com/file/d/1zo3gNIH2sMczXpnWh-vVty3pr4-tOaBy/view)	[link](https://drive.google.com/file/d/1e0KIPqcKYHXmSjgLpxLJkEFumw8Bae_g/view)
81 | tet	[link](https://drive.google.com/file/d/1n-PVdlyti6wGtlGUalYeHOmGwZtG6xDe/view)	-
82 | tgk	[link](https://drive.google.com/file/d/1g6_1YKJbv7-5glBsreqspPP_VnBsRSXW/view)	-
83 | tha	[link](https://drive.google.com/file/d/1vTBPYxmkyWCqnboX3cVxxNHcfRcXAo_7/view)	[link](https://drive.google.com/file/d/197vyuI2JzOGczeVRqnUGu78G3T2WWRit/view)
84 | tir	[link](https://drive.google.com/file/d/1vkt2SRGiSPIJKzgU-XagWmx6rnQLtmZZ/view)	[link](https://drive.google.com/file/d/1lDazmqixV4Gem96O-c-gulqSNKUEcjnR/view)
85 | tur	[link](https://drive.google.com/file/d/1_39Hk7K-IKzvSiRLmue1mxANuPXQg9p5/view)	[link](https://drive.google.com/file/d/1Kole41CnnNArIt_rxfNlimk9EMQZFa8Y/view)
86 | ukr	[link](https://drive.google.com/file/d/1h2I-yan3WcVEyJfeyJD_fFFviJWK93N3/view)	[link](https://drive.google.com/file/d/1H8TUR73sJs_bvjJuLNB4szqiCIvTq5sp/view)
87 | urd	[link](https://drive.google.com/file/d/1p-lG1vEDp838GRzuPWc9hjfZeqjfeMh3/view)	[link](https://drive.google.com/file/d/1HDwEMuaULkZr6Mm39CifS2szyI_vql-G/view)
88 | uzb	[link](https://drive.google.com/file/d/134swKYwYcfCFbMSXe16hvmLVGqS7pOMb/view)	[link](https://drive.google.com/file/d/1nYOLG5UlV-YDeex8Tvi37hK4pM9wD7Wg/view)
89 | vie	[link](https://drive.google.com/file/d/1zm1AjKpOhEeaZgs2MeJsrFVjxT7kbyFL/view)	[link](https://drive.google.com/file/d/1uts1nSGwWNxEFZnsJi6SdamG2DAVPq-q/view)
90 | xho	[link](https://drive.google.com/file/d/1Gkq4cLknzh_cY9HBlWAGqkZez31vIarY/view)	[link](https://drive.google.com/file/d/1P31PeL7cVJ9eNE-YZ0ofH0pozT9ta5bP/view)
91 | yor	[link](https://drive.google.com/file/d/1KhCZk7wBsFkKsmU4XWffVTS37w1FguwE/view)	[link](https://drive.google.com/file/d/17ifvygtGzaIgDuqiFK0QDK1Jd7SOnkNd/view)
92 | yue	[link](https://drive.google.com/file/d/1u1ScUMdlOfyIyUOIcZBonIUb7rXJwGH8/view)	[link](https://drive.google.com/file/d/1blW_lXnUFa3poUwR6YuHd3N2fVoGuHhG/view)
93 | zho	[link](https://drive.google.com/file/d/10ipmN3CgNFXc6OQst6-Iasa5CaZLT93M/view)	[link](https://drive.google.com/file/d/13-qysDM2uAiT_E9KjAKsQZAQP_-0pyRL/view)
94 | bis	-	[link](https://drive.google.com/file/d/1zUn6LDov0zi_hxYbs9hX63UwLhcNcrsa/view)
95 | gla	-	[link](https://drive.google.com/file/d/1rIYRlZQ0Sl6By45hdOozAS_37Dv6LFQp/view)
96 | 


--------------------------------------------------------------------------------