├── README.md └── dataset ├── 2015-07-21.zip ├── 2015-07-22.zip ├── 2015-07-23.zip ├── 2015-07-24.zip ├── 2015-07-25.zip ├── 2015-07-26.zip ├── 2015-07-27.zip ├── 2015-07-31.zip ├── 2015-08-01.zip ├── 2015-08-02.zip ├── 2015-08-03.zip ├── 2015-08-04.zip ├── 2015-08-06.zip ├── 2015-08-07.zip ├── 2015-08-08.zip ├── 2015-08-09.zip ├── 2015-08-10.zip └── 2015-08-11.zip /README.md: -------------------------------------------------------------------------------- 1 | # Saudi Newspapers Corpus (SaudiNewsNet) 2 | This repo contains a set of **31,030** Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. 3 | 4 | File Structure 5 | -------------- 6 | The folder `dataset` contains a set of ZIP files, where each file has the format `YYYY-MM-DD.zip` and contains one JSON file with a corresponding name `YYYY-MM-DD.json`. The JSON files are stored in UTF-8 encoding. 7 | 8 | Each JSON file contains an array of articles (the format of each article is explained in the next section), and its file name reflects the date on which the contained articles were extracted. 9 | 10 | Article JSON Object Format 11 | -------------------------- 12 | The JSON object for each article contains the following fields (some fileds can have empty values in case the crawler failed to extract them): 13 | 14 | - **`source`**: A string identifief of the newspaper from which the article was extracted. It can have one of the following values: 15 | 16 | | String Identifier | Newspaper | 17 | | ------------------ | --------- | 18 | | aawsat | [Al-Sharq Al-Awsat](http://aawsat.com/) | 19 | | aleqtisadiya | [Al-Eqtisadiya](http://aleqt.com/) | 20 | | aljazirah | [Al-Jazirah](http://al-jazirah.com/) | 21 | | almadina | [Al-Madina](http://www.al-madina.com/) | 22 | | alriyadh | [Al-Riyadh](http://www.alriyadh.com/) | 23 | | alwatan | [Al-Watan](http://alwatan.com.sa/) | 24 | | alweeam | [Al-Weeam](http://alweeam.com.sa/) | 25 | | alyaum | [Al-Yaum](http://alyaum.com/) | 26 | | arreyadi | [Arreyadi](http://www.arreyadi.com.sa/) | 27 | | arreyadiyah | [Arreyadi](http://www.arreyadiyah.com/) | 28 | | okaz | [Okaz](http://www.okaz.com.sa/) | 29 | | sabq | [Sabq](http://sabq.org/) | 30 | | was | [Saudi Press Agency](http://www.spa.gov.sa/) | 31 | | 3alyoum | [Ain Alyoum](http://3alyoum.com/) | 32 | 33 | - **`url`**: The full URL from which the article was extracted. 34 | - **`date_extracted`**: The timestamp of the date on which the article was extracted. It has the format `YYYY-MM-DD hh:mm:ss`. Notice that this field does not necessarily represent the date on which the article was authored (or made available online), however for articles stamped with a date of extraction after August 1, 2015, this field most probably represents the date of authoring. 35 | - **`title`**: The title of the article. Can be empty. 36 | - **`author`**: The author of the article. Can be empty. 37 | - **`content`**: The content of the article. 38 | 39 | Statistics 40 | ---------- 41 | The dataset currently contains **31,030** Arabic articles (with a total number of **8,758,976 words**). The articles were extracted from the following Saudi newspapers (sorted by number of articles): 42 | 43 | - [Al-Riyadh](http://www.alriyadh.com/) (4,852 articles) 44 | - [Al-Jazirah](http://al-jazirah.com/) (3,690 articles) 45 | - [Al-Yaum](http://alyaum.com/) (3,065 articles) 46 | - [Al-Eqtisadiya](http://aleqt.com/) (2,964 articles) 47 | - [Al-Sharq Al-Awsat](http://aawsat.com/) (2,947 articles) 48 | - [Okaz](http://www.okaz.com.sa/) (2,846 articles) 49 | - [Al-Watan](http://alwatan.com.sa/) (2,279 articles) 50 | - [Al-Madina](http://www.al-madina.com/) (2,252 articles) 51 | - [Al-Weeam](http://alweeam.com.sa/) (2,090 articles) 52 | - [Ain Alyoum](http://3alyoum.com/) (2,080 articles) 53 | - [Sabq](http://sabq.org/) (1,411 articles) 54 | - [Saudi Press Agency](http://www.spa.gov.sa) (369 articles) 55 | - [Arreyadi](http://www.arreyadi.com.sa/) (133 articles) 56 | - [Arreyadiyah](http://www.arreyadiyah.com/) (52 articles) 57 | 58 | Citing this Work 59 | ------------------ 60 | If you'd like to cite this work, you may use one of the following. You may also contact me (mazen [dot] abdulaziz [at] gmail [dot] com) so that I can include your research in the "referring work" section. 61 | 62 | - **APA**: Alhagri, M. A. (2015). Saudi Newspapers Arabic Corpus (SaudiNewsNet). http://github.com/ParallelMazen/SaudiNewsNet 63 | - **MLA**: Alhagri, Mazen A. Saudi Newspapers Arabic Corpus (SaudiNewsNet). 2015. http://github.com/ParallelMazen/SaudiNewsNet 64 | - **BibTex**: 65 | `@misc{hagrima2015, 66 | author = "M. Alhagri", 67 | title = "Saudi Newspapers Arabic Corpus (SaudiNewsNet)", 68 | year = 2015, 69 | url = "http://github.com/ParallelMazen/SaudiNewsNet" 70 | }` 71 | 72 | Contacting the Maintainer 73 | ------------------------- 74 | If you'd like to cite this work; have comments or thoughts to share; or just feel like chatting then feel free to contact me on either: 75 | 76 | - [My Twitter account @UniqueLock](https://twitter.com/uniquelock) 77 | - mazen [dot] abdulaziz [at] gmail [dot] com 78 | 79 | Changelog 80 | --------- 81 | - Aug 06, 2015: 82 | - First batch of articles uploaded (extracted within the period 21/07/2015 to 06/08/2015). 83 | - Aug 07, 2015: 84 | - Changed output format. 85 | - Included a second batch (extracted in 07/08/2015). 86 | - Aug 11, 2015: 87 | - Tracking 2 more newspapers: Saudi Press Agency and Arriyadiyah. 88 | - Third batch of articles updloaded (timespan: 08/08/2015 to 11/08/2015). 89 | 90 | # License 91 | ![Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png) 92 | 93 | This work is licensed under a **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License**. The dataset is shared for the sole purpose of aiding open scientific research (in Arabic computing or linguistics), and can only be used for that purpose. The ownership of each article within the dataset belongs to the respective newspaper from which it was extracted; and the maintainer of the repository does not claim ownership of any of the content within it. If you think, by any means, that this dataset breaches any established copyrights; please contact the repository maintainer. 94 | -------------------------------------------------------------------------------- /dataset/2015-07-21.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-21.zip -------------------------------------------------------------------------------- /dataset/2015-07-22.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-22.zip -------------------------------------------------------------------------------- /dataset/2015-07-23.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-23.zip -------------------------------------------------------------------------------- /dataset/2015-07-24.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-24.zip -------------------------------------------------------------------------------- /dataset/2015-07-25.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-25.zip -------------------------------------------------------------------------------- /dataset/2015-07-26.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-26.zip -------------------------------------------------------------------------------- /dataset/2015-07-27.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-27.zip -------------------------------------------------------------------------------- /dataset/2015-07-31.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-07-31.zip -------------------------------------------------------------------------------- /dataset/2015-08-01.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-01.zip -------------------------------------------------------------------------------- /dataset/2015-08-02.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-02.zip -------------------------------------------------------------------------------- /dataset/2015-08-03.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-03.zip -------------------------------------------------------------------------------- /dataset/2015-08-04.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-04.zip -------------------------------------------------------------------------------- /dataset/2015-08-06.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-06.zip -------------------------------------------------------------------------------- /dataset/2015-08-07.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-07.zip -------------------------------------------------------------------------------- /dataset/2015-08-08.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-08.zip -------------------------------------------------------------------------------- /dataset/2015-08-09.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-09.zip -------------------------------------------------------------------------------- /dataset/2015-08-10.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-10.zip -------------------------------------------------------------------------------- /dataset/2015-08-11.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/inparallel/SaudiNewsNet/4e39239966c2ae8a3faa591201acb9cd1f51199d/dataset/2015-08-11.zip --------------------------------------------------------------------------------