├── .gitignore └── README.adoc /.gitignore: -------------------------------------------------------------------------------- 1 | /README.html 2 | -------------------------------------------------------------------------------- /README.adoc: -------------------------------------------------------------------------------- 1 | = All GitHub Commit Emails 2 | :idprefix: 3 | :idseparator: - 4 | :nofooter: 5 | :sectanchors: 6 | :sectlinks: 7 | :sectnumlevels: 6 8 | :sectnums: 9 | :toc-title: 10 | :toc: macro 11 | :toclevels: 6 12 | 13 | 5.8 million unique GitHub commit emails (git config user.email) extracted from https://www.githubarchive.org from 2011-02-12 to 2015-12-31. 2016-06-03: emails removed from this repo after request from GitHub. 2016-08-11: emails on GitHub Archive hashed. 2016-08-30: privacy policy updated (because of this repo?! We shall never know). 2018-02-21: dangling commits actually removed... :-) 14 | 15 | Cite this repo: https://zenodo.org/badge/latestdoi/42107330[image:https://zenodo.org/badge/42107330.svg[]] Related: https://cirosantilli.com/all-github-commit-emails 16 | 17 | .It was Ciro Santilli that made the commit emails be hasehd in the GH Archive dataset. 18 | image::https://raw.githubusercontent.com/cirosantilli/media/master/GitHub_Archive_Google_bigquery_PushEvent_email_highlight.png[] 19 | 20 | How it looked like before takedown: 21 | 22 | * http://archive.is/QM2at 23 | * http://archive.is/lDFDi 24 | 25 | image::https://raw.githubusercontent.com/cirosantilli/media/master/All_GitHub_commit_emails_repo_screenshot_before_takedown_archive_is.png[] 26 | 27 | toc::[] 28 | 29 | == Update 2018-02-21 30 | 31 | It was brought to my attention in previous days that the dangling commits were still visible, and in particular that you can easily download the full repo, as explained at: https://stackoverflow.com/questions/872565/remove-sensitive-files-and-their-commits-from-git-history/32840254#32840254 32 | 33 | I then contacted GitHub staff, and they replied that it was because the now deleted pull request 2: https://github.com/cirosantilli/all-github-commit-emails/issues/2 pointed to it. They requested to delete it, and I approved, and now the dangling commits are gone AFAIK. 34 | 35 | This means that for more than an year after the original removal attempt, the emails were still easily extractable :-) 36 | 37 | It also taught us that GitHub staff has the power to completely delete issues: https://stackoverflow.com/questions/3081521/how-to-completely-remove-an-issue-from-github/25324423#25324423 38 | 39 | == Update 2016-08-30 40 | 41 | The new privacy statement covers more explicitly. I forgot to take a web archive snapshot at that time, but here is the current one: http://web.archive.org/web/20171224084808/https://help.github.com/articles/github-privacy-statement/#public-information-on-github : 42 | 43 | ____ 44 | If you would like to compile GitHub data, you may only use any public-facing Personal Information you gather for the purpose for which our user has authorized it. For example, where a GitHub user has made an email address public-facing for the purpose of identification and attribution, do not use that email address for commercial advertising. We expect you to reasonably secure any Personal Information you have gathered from GitHub, and to respond promptly to complaints, removal requests, and "do not contact" requests from GitHub or GitHub users. 45 | ____ 46 | 47 | == Update 2016-08-11 48 | 49 | Arfon https://github.com/arfon confirmed to me that emails were removed via: 50 | 51 | * https://github.com/igrigorik/githubarchive.org/issues/112 52 | * https://github.com/igrigorik/githubarchive.org/commit/fd53d3a80fd07289581541cc99446d2dce36c770 53 | * https://github.com/igrigorik/githubarchive.org/commit/c9ae11426e5bcc30fe15617d009dfc602697ecde 54 | 55 | == Update 2016-08-03 56 | 57 | I've checked it today, and the emails seem to have been removed from GitHub Archive: the either null, or of form `@`. 58 | 59 | == Update 2016-06-03 60 | 61 | GitHub has asked me to take the emails down, and since they are also talking to GitHub Archive and GHTorrent, I took the emails down in good faith. GitHub Archive data was still available. 62 | 63 | support@github.com 2016-06-03 email: 64 | 65 | ____ 66 | Hi Ciro, 67 | 68 | I'm reaching out on behalf of GitHub's Support team. A number of users have contacted us with questions about the following repository: https://github.com/cirosantilli/all-github-commit-emails. 69 | 70 | We read through the repository's README, and appreciate your point about how easy it can be for spammers to scrape users' email addresses. We also appreciate that you've brought this to the community's (and our) attention. We thought you'd like to know we've been working with GitHub Archive and GHTorrent to scrub this data from their archives, and that both organizations have agreed to refrain from collecting this data in the future. Additionally, we'll be working with other research organizations to encourage them to comply with our policies on data minimization. 71 | 72 | In keeping with these efforts to better protect our users' privacy, we're asking you to take down your all-github-commit-emails repository. We'd like for you to remove it voluntarily. However, because it does violate our policies on user privacy and spam, we will remove it after one week if you do not. 73 | 74 | Please let me know if you have any questions. 75 | 76 | Best, Elizabeth 77 | ____ 78 | 79 | My reply: 80 | 81 | ____ 82 | Hi there. 83 | 84 | Were you thinking about this before, or was it I that brought it to your attention? 85 | 86 | Can we do it like this: I force push the actual emails out (already done), keep the README intact, and then you tell the engineering team to do a git gc to remove old commits from your APIs? This way I get to keep my heard earned stars and URL. The extraction method itself is trivial once you know the words "GitHub Archive extract commit emails". 87 | 88 | I'll expect a future blog post on the matter from GitHub when you've got things sorted out, clarifying / quoting your data scraping policy, and saying that GitHub Archive and GHTorrent were not compliant. Will you forbid mass scrapping entirely, or just remove the emails? Will you modify the current events API as well? 89 | 90 | Please state somewhere official (above blog post or some repository a la https://github.com/github/dmca ) that my repository was taken down and why. If you found my contribution considerable, please mention it on the blog post. 91 | 92 | :-) 93 | ____ 94 | 95 | == About 96 | 97 | Goal: let spamees find how their email leaked via Google. Spammers can get this data any time since it is public. 98 | 99 | Please don't email me privately to take this down / remove your email from the list. 100 | 101 | If you have a new argument, open a public issue. 102 | 103 | Even better, tell GitHub, GitHub Archive, and http://ghtorrent.org/ (not used here) to drop this data instead. 104 | 105 | If they all do that, I will also take this / your email down. 106 | 107 | Otherwise, it makes no sense to take this down, since this data is still easily extracted from the source. 108 | 109 | GitHub has a setting to use a dummy email for web UI operations: https://help.github.com/articles/keeping-your-email-address-private/ , but it does not affect visibility of commits done locally. 110 | 111 | == About the data 112 | 113 | Getting the commit email of a particular user is trivial through the API as explained at: http://stackoverflow.com/a/32456486/895245 , so it is not much of a use case here, so usernames are not included in this data. 114 | 115 | GitHub Archive started scraping in 2011-02-12 so older commits are not considered with the method. 116 | 117 | In 2014-12-31, GitHub started using the new Events API. 118 | 119 | Data is pushed daily to Google Big Query, and we will update this yearly with all the commits of the previous year. 120 | 121 | This data is not shown on the GitHub web interface, but it is of course public because it can be seen after cloning. 122 | 123 | GitHub also makes this data available on the `PushEvent` of the GitHub events API https://developer.github.com/v3/activity/events/types/#pushevent which GitHub Archive uses to export to a Google BigQuery table. 124 | 125 | == The queries 126 | 127 | Download the query data as explained at: http://stackoverflow.com/questions/18493533/google-bigquery-download-all-data/37274820#37274820 128 | 129 | Extract data up to 2014-12-31 130 | 131 | .... 132 | SELECT payload_commit_email 133 | FROM [githubarchive:github.timeline] 134 | WHERE type = 'PushEvent' 135 | GROUP BY payload_commit_email 136 | ORDER BY payload_commit_email ASC 137 | .... 138 | 139 | Extract data starting from 2015-01-01: 140 | 141 | .... 142 | SELECT JSON_EXTRACT(payload, '$.commits[0].author.email') 143 | FROM ( 144 | TABLE_DATE_RANGE([githubarchive:day.events_], 145 | TIMESTAMP('2015-01-01'), 146 | TIMESTAMP('2015-01-02') 147 | )) 148 | WHERE type = 'PushEvent' 149 | .... 150 | 151 | TODO: it would have been more intelligent to `GROUP BY` to only select unique values, and also do more cleaning on the server. Untested: 152 | 153 | .... 154 | SELECT JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email') 155 | AS email 156 | FROM ( 157 | TABLE_DATE_RANGE([githubarchive:day.events_], 158 | TIMESTAMP('2015-01-01'), 159 | TIMESTAMP('2015-01-02') 160 | )) 161 | WHERE 162 | type = 'PushEvent' 163 | AND email <> '' 164 | GROUP BY email 165 | ORDER BY email 166 | .... 167 | 168 | The above query does not work, says `email` is not a field of the table. 169 | 170 | This would reduce the output size by an order of magnitude. 171 | 172 | TODO: extract all emails of a given push. We currently only extract the first one at `commits[0]`. Many JSON path implementations accept `[*]`, but BigQuery does not: http://stackoverflow.com/questions/28719880/how-to-get-all-values-of-an-attribute-of-json-array-with-jsonpath-bigquery-in-bi 99% percent of the time it's the same email however. 173 | 174 | == Processing the data 175 | 176 | * Clean up a bit if not done on the query: 177 | + 178 | .... 179 | cat * | sed -E '/^$/d' | sort -u > emails-big 180 | .... 181 | * Merge data from the two queries: 182 | + 183 | .... 184 | sort -u emails-old emails-new > emails-big 185 | .... 186 | * Split into multiple files: 187 | + 188 | .... 189 | split -a4 -C150k -d emails-big emails/ 190 | .... 191 | + 192 | GitHub limits: 193 | ** hard limit: 100M per file, larger cannot be pushed 194 | ** web UI show limit: 195 | *** TODO file size 196 | *** 1000 files per directory 197 | + 198 | TODO: split data further into subdirectories: `00/00`, `00/01`, ... `99/99` to make loading faster on GitHub http://superuser.com/questions/443972/using-coreutils-split-file-into-pieces-to-different-directories 199 | 200 | == Data mining 201 | 202 | Count emails: 203 | 204 | .... 205 | wc -l * 206 | .... 207 | 208 | Most frequent hostnames: 209 | 210 | .... 211 | cat * | sed -E 's/.*@(.*)$/\1/' | sort | uniq -c | sort -n | tail -n 1000 212 | .... 213 | 214 | TODO: how many emails are valid: not simple since not parsable by regex: 215 | 216 | * http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address 217 | * http://stackoverflow.com/questions/8022530/python-check-for-valid-email-address 218 | * http://stackoverflow.com/questions/2138701/email-check-regular-expression-with-bash-script 219 | 220 | Some common invalid emails 221 | 222 | .... 223 | grep -E '[^0-9a-zA-Z!#$%&'"'"'*+-/=?^_`{|}~@]' * | wc 224 | grep -v '@' * | wc 225 | .... 226 | 227 | * invalid characters: http://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-email-address 228 | * no `@` 229 | 230 | About 4% of the emails failed the above checks. 231 | 232 | In particular, emails containing `<>\n` may `fsck` unhappy, and may fail to push. 233 | 234 | For fun: 235 | 236 | .... 237 | grep 'password' * 238 | .... 239 | 240 | Also contains some interesting long lines: 241 | 242 | .... 243 | grep '.\{80\}' * 244 | .... 245 | 246 | == Legality 247 | 248 | * https://www.quora.com/unanswered/Are-version-control-e-g-Git-commit-messages-and-other-metadata-automatically-covered-by-the-same-license-as-the-project 249 | * https://www.quora.com/Is-it-legal-to-sell-a-list-with-publicly-available-contact-emails 250 | * https://en.wikipedia.org/wiki/CAN-SPAM_Act_of_2003 251 | * https://www.avvo.com/legal-answers/can-i-copyright-my-email-address-941873.html 252 | 253 | == Market value 254 | 255 | TODO: any? (if I hadn't published it) 256 | 257 | * http://www.5-starlists.com/freereport.html 258 | * http://www.blackhatworld.com/blackhat-seo/making-money/525045-how-much-2-mil-email-list-worth.html 259 | * https://www.quora.com/Where-can-I-sell-an-email-list 260 | 261 | == Related projects 262 | 263 | * https://github.com/mmautner/github-email-thief 264 | * https://github.com/hodgesmr/FindGitHubEmail 265 | * https://www.troyhunt.com/8-million-github-profiles-were-leaked-from-geekedins-mongodb-heres-how-to-see-yours/ the scrapper database of a company called Geekedin went public, and Troy said it was serious, But I think they don't have any data not readily available form GitHub Archive. 266 | 267 | == Backlinks 268 | 269 | Mostly from GitHub traffic. 270 | 271 | Humans: 272 | 273 | * https://arxiv.org/pdf/1908.05354.pdf (http://web.archive.org/web/20190817173756/https://arxiv.org/pdf/1908.05354.pdf[archive]) "Large-Scale-Exploit of GitHub Repository Metadata and Preventive Measures" by "David Knothe" and "Frederick Pietschmann" published on August 16, 2019. 274 | * 2019-05 https://quassel.flyingyeti.ovh/ The software is https://en.wikipedia.org/wiki/Quassel_IRC by ... https://en.wikipedia.org/wiki/Fly_Yeti ??? 275 | * 2016-09 https://www.zhihu.com/question/46957710 https://web.archive.org/web/20160920062505/https://www.zhihu.com/question/46957710 276 | * https://news.ycombinator.com/item?id=11709100 277 | * https://twitter.com/mitsuhiko/status/720349737556127744 278 | * https://twitter.com/ziromr/status/729313948630167552 279 | * https://twitter.com/_pkill/status/727250254723076096 280 | 281 | Internal security tools flashing a red light and leaking "internal" URLs: 282 | 283 | * http://cybersecurity.telefonica.com/threats/es/detections/571f07a94a5062fca2000003 284 | * http://he2007.es/owa/redir.aspx 285 | * http://security.ctrip.com/github-scan/results 286 | * http://wiki.linecorp.com/display/itsec/Exposed+a+server+hostname%28%27www@LNACTNN1501.nhnjp.ism%27%29+on+github.com_20160426 287 | * http://work.alibaba-inc.com/work/reports/detail/17156302 288 | * https://sec.intra.xiaojukeji.com/m 289 | * https://soc.tools.vipshop.com/m 290 | * https://uga2.belcy.com/alerts 291 | 292 | Not sure: 293 | 294 | * 2016-11 http://matrix.cubesec.cn/index.php/home/public/login.html 295 | * http://link.zhihu.com/ 296 | * http://wx.qq.com/ 297 | --------------------------------------------------------------------------------