├── .gitattributes ├── .gitignore ├── FileDropperDownloadScreen.png ├── HN.torrent └── readme.md /.gitattributes: -------------------------------------------------------------------------------- 1 | # Disable LF normalization for all files 2 | * -text 3 | 4 | # Custom for Visual Studio 5 | *.cs diff=csharp 6 | *.sln merge=union 7 | *.csproj merge=union 8 | *.vbproj merge=union 9 | *.fsproj merge=union 10 | *.dbproj merge=union 11 | 12 | # Standard to msysgit 13 | *.doc diff=astextplain 14 | *.DOC diff=astextplain 15 | *.docx diff=astextplain 16 | *.DOCX diff=astextplain 17 | *.dot diff=astextplain 18 | *.DOT diff=astextplain 19 | *.pdf diff=astextplain 20 | *.PDF diff=astextplain 21 | *.rtf diff=astextplain 22 | *.RTF diff=astextplain 23 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | #Visual Studio files 2 | *.pch 3 | *.vspscc 4 | *.vssscc 5 | *_i.c 6 | *_p.c 7 | *.ncb 8 | *.suo 9 | *.sln.DotSettings.user 10 | [Oo]bj/ 11 | *.scc 12 | Web.Release.Config 13 | *.publishsettings 14 | 15 | # Build results 16 | [Oo]bj/ 17 | [Tt]est[Rr]esult*/ 18 | 19 | #Tooling 20 | _ReSharper*/ 21 | *.[Rr]e[Ss]harper 22 | 23 | #Subversion files 24 | *.svn 25 | 26 | # Office Temp Files 27 | ~$* 28 | 29 | #PII/large files 30 | data/ 31 | App_Data/ 32 | 33 | #JS build 34 | **/.tmp 35 | **/.sass-cache 36 | 37 | #node 38 | npm-debug.log 39 | 40 | #OS junk files 41 | Thumbs.db 42 | ehthumbs.db 43 | Desktop.ini 44 | $RECYCLE.BIN/ 45 | .DS_Store 46 | -------------------------------------------------------------------------------- /FileDropperDownloadScreen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sytelus/HackerNewsData/9c8a38676dddc3d53cdf80ede63d842388b39235/FileDropperDownloadScreen.png -------------------------------------------------------------------------------- /HN.torrent: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sytelus/HackerNewsData/9c8a38676dddc3d53cdf80ede63d842388b39235/HN.torrent -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | #Hacker News Data Dump Up to May 2014# 2 | There are two files that contains all stories and comments posted at Hacker News (https://news.ycombinator.com/) from its start in 2006 to May 29, 2014 (exact dates are below). This was downloaded using simple program available at https://github.com/sytelus/HackerNewsDownloader by making REST API calls to https://hn.algolia.com/api. The program used API parameters to paginate through created date of items to retrieve all posts and comments. The file contains entire sequence of JSON responses exactly as returned by API call in JSON array. 3 | 4 | ##HNStoriesAll.json## 5 | Contains all the stories posted on HN from Mon, 09 Oct 2006 18:21:51 GMT to Thu, 29 May 2014 08:25:40 GMT. 6 | 7 | ###Total count### 8 | 1,333,789 9 | 10 | ###File size### 11 | 1.2GB uncompressed, 115MB compressed 12 | 13 | ###How was this created### 14 | The program used to create this file is available at https://github.com/sytelus/HackerNewsDownloader. 15 | 16 | ###Format### 17 | Entire file is JSON compliant array. Each element in array is json object that is exactly the response that returned by HN Algolia REST API. The property named `hits` contains the actual list of stories. As this file is very large we recommend json parsers that can work on file streams instead of reading entire data in memory. 18 | 19 | ```json 20 | { 21 | "hits": [{ 22 | "created_at": "2014-05-31T00:05:54.000Z", 23 | "title": "Publishers withdraw more than 120 gibberish papers", 24 | "url": "http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763?WT.mc_id=TWT_NatureNews", 25 | "author": "danso", 26 | "points": 1, 27 | "story_text": "", 28 | "comment_text": null, 29 | "num_comments": 0, 30 | "story_id": null, 31 | "story_title": null, 32 | "story_url": null, 33 | "parent_id": null, 34 | "created_at_i": 1401494754, 35 | "_tags": ["story", 36 | "author_danso", 37 | "story_7824727"], 38 | "objectID": "7824727", 39 | "_highlightResult": { 40 | "title": { 41 | "value": "Publishers withdraw more than 120 gibberish papers", 42 | "matchLevel": "none", 43 | "matchedWords": [] 44 | }, 45 | "url": { 46 | "value": "http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763?WT.mc_id=TWT_NatureNews", 47 | "matchLevel": "none", 48 | "matchedWords": [] 49 | }, 50 | "author": { 51 | "value": "danso", 52 | "matchLevel": "none", 53 | "matchedWords": [] 54 | }, 55 | "story_text": { 56 | "value": "", 57 | "matchLevel": "none", 58 | "matchedWords": [] 59 | } 60 | } 61 | }], 62 | "nbHits": 636094, 63 | "page": 0, 64 | "nbPages": 1000, 65 | "hitsPerPage": 1, 66 | "processingTimeMS": 5, 67 | "query": "", 68 | "params": "advancedSyntax=true\u0026analytics=false\u0026hitsPerPage=1\u0026tags=story" 69 | } 70 | ``` 71 | 72 | ##HNCommentsAll.json## 73 | Contains all the comments posted on HN from Mon, 09 Oct 2006 19:51:01 GMT to Fri, 30 May 2014 08:19:34 GMT. 74 | 75 | ###Total count### 76 | 5,845,908 77 | 78 | ###File size### 79 | 9.5GB uncompressed, 862MB compressed 80 | 81 | ###How was this created### 82 | The program used to create this file is available at https://github.com/sytelus/HackerNewsDownloader. 83 | 84 | ###Format### 85 | Entire file is JSON compliant array. Each element in array is json object that is exactly the response that returned by HN Algolia REST API. The property named `hits` contains the actual list of stories. As this file is very large we recommend json parsers that can work on file streams instead of reading entire data in memory. 86 | 87 | ```json 88 | { 89 | "hits": [{ 90 | "created_at": "2014-05-31T00:22:01.000Z", 91 | "title": null, 92 | "url": null, 93 | "author": "rikacomet", 94 | "points": 1, 95 | "story_text": null, 96 | "comment_text": "Isn\u0026#x27;t the word dyes the right one to use here? Instead of dies?", 97 | "num_comments": null, 98 | "story_id": null, 99 | "story_title": null, 100 | "story_url": null, 101 | "parent_id": 7821954, 102 | "created_at_i": 1401495721, 103 | "_tags": ["comment", 104 | "author_rikacomet", 105 | "story_7824763"], 106 | "objectID": "7824763", 107 | "_highlightResult": { 108 | "author": { 109 | "value": "rikacomet", 110 | "matchLevel": "none", 111 | "matchedWords": [] 112 | }, 113 | "comment_text": { 114 | "value": "Isn\u0026#x27;t the word dyes the right one to use here? Instead of dies?", 115 | "matchLevel": "none", 116 | "matchedWords": [] 117 | } 118 | } 119 | }], 120 | "nbHits": 1371364, 121 | "page": 0, 122 | "nbPages": 1000, 123 | "hitsPerPage": 1, 124 | "processingTimeMS": 8, 125 | "query": "", 126 | "params": "advancedSyntax=true\u0026analytics=false\u0026hitsPerPage=1\u0026tags=comment" 127 | } 128 | ``` 129 | 130 | ##Where to download## 131 | As GitHub restricts each file to be only 100MB and also has policies against data ware housing, these files are currently hosted at FileDropper.com. Unfortunately FileDropper currently shows ads with misleading download link so be careful on what link you click. Below is the screenshot FileDropper shows and currently the button marked in red would download the actual file. 132 | ![](FileDropperDownloadScreen.png?raw=true) 133 | 134 | ####Stories Download URL#### 135 | Download Using Browser: http://www.filedropper.com/hnstoriesall 136 | 137 | 138 | Download Using Torrent (thanks to @saturation): 139 | 140 | `magnet:?xt=urn:btih:00bfc9143ecdc8d3c27a170c2d1474e05ccdbc59&dn=HNStoriesAll.7z&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce` 141 | 142 | 143 | Now also available at Internet Archive (thanks to Bertrand Fan): https://archive.org/details/HackerNewsStoriesAndCommentsDump 144 | 145 | ####Comments Download URL#### 146 | Download Using Browser: http://www.filedropper.com/hncommentsall 147 | 148 | 149 | Download Using Torrent (thanks to @saturation): 150 | 151 | `magnet:?xt=urn:btih:21abd27bfe4c01264eb0548543606140ee48d19b&dn=HNCommentsAll.7z&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce` 152 | 153 | 154 | Now also available at Internet Archive (thanks to Bertrand Fan): https://archive.org/details/HackerNewsStoriesAndCommentsDump 155 | 156 | 157 | ##More Info## 158 | See blog entry at http://shitalshah.com/p/downloading-all-of-hacker-news-posts-and-comments/ 159 | 160 | If you have a suggestion for better place to host these files please create a new inssue in this repo with info and I would take a look at it (or better just fork and host :)). 161 | --------------------------------------------------------------------------------