├── .gitignore ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | TODO 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Open Source Software Research Data 2 | 3 | ![GitHub](https://img.shields.io/github/license/sboysel/awesome-oss-research-data) 4 | [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 5 | 6 | This is a (curated) list of relevant datasets, data sources, and empirical research in the space of Open Source Software development. We prioritize sources in which (1) the raw data is made publicly accessible or (2) the published metrics are derived from public sources. We also include data sources for which only high level insights are available. 7 | 8 | An excellent list of datasets used for empirical software engineering / mining software repositories exists at [dspinellis/awesome-msr](https://github.com/dspinellis/awesome-msr). Several relevant data sources from this list are included here. 9 | 10 | Contributions are welcome and greatly appreciated! Please open an issue or pull request if you have suggestions for new data sources and research. 11 | 12 | **Topics** 13 | 14 | - [Development activity](#development-activity) 15 | - [Dependency networks](#dependency-networks) 16 | - [Package download statistics](#package-download-statistics) 17 | - [Security](#security) 18 | - [Community and project health](#community-and-project-health) 19 | - [Community discourse](#community-discourse) 20 | - [Valuation](#valuation) 21 | - [Funding](#funding) 22 | - [Surveys](#surveys) 23 | - [Source code analysis](#source-code-analysis) 24 | - [Bounty platforms](#bounty-platforms) 25 | - [Public policy](#public-policy) 26 | - [General web archives](#general-web-archives) 27 | - [Other resources](#other-resources) 28 | 29 | ## Development activity 30 | 31 | **Datasets** 32 | 33 | | Source | Description | 34 | |:-------|:------------| 35 | | [GHTorrent](http://ghtorrent-downloads.ewi.tudelft.nl/) | Offline mirror of historical data offered by GitHub's REST API | 36 | | [GH Archive](https://www.gharchive.org/) | Records GitHub's public timeline of activity | 37 | | [Ecosyste.ms](https://ecosyste.ms/) | Tools and open datasets to support, sustain, and secure critical digital infrastructure | 38 | | GitHub [REST API](https://docs.github.com/en/rest) and [GraphQL API](https://docs.github.com/en/graphql) | GitHub's APIs for accessing data | 39 | | [GitHub Innovation Graph](https://innovationgraph.github.com/) | High level insights on worldwide GitHub activity over time. [Blog post](https://github.blog/2023-09-21-announcing-the-github-innovation-graph/) and [repo](https://github.com/github/innovationgraph) | 40 | | [Census II of Free and Open Source Software](https://data.world/thelinuxfoundation/census-ii-of-free-and-open-source-software) | Survey of OSS library production usage at the application library level. [Report](https://www.linuxfoundation.org/research/census-ii-of-free-and-open-source-software-application-libraries) and [data appendix](https://data.world/thelinuxfoundation/census-ii-of-free-and-open-source-software) | 41 | | [Census III of Free and Open Source Software](https://www.linuxfoundation.org/research/census-iii) | [Report](https://www.linuxfoundation.org/hubfs/LF%20Research/lfr_censusiii_120424a.pdf?hsLang=en) and [open data](https://data.world/thelinuxfoundation) | 42 | | [OSSRank](https://ossrank.com/) | Ranking that provides useful mappings between top projects, project types, and contributors (individuals and private companies) | 43 | | [Open Source Contributor Index (OSCI)](https://opensourceindex.io/) | Measures active and total GitHub contributors by private organizations. Drawn from [GH Archive](https://www.gharchive.org/) (events from GitHub's public timeline) | 44 | 45 | **Research** 46 | 47 | General contribution patterns 48 | 49 | - Choudhary, Samridhi; Bogart, Christopher; Rose, Carolyn; Herbsleb, James (2020): Modeling Productivity in Open Source GitHub Projects: A Dataset and Codebase. Carnegie Mellon University. Dataset. https://doi.org/10.1184/R1/6397013.v1 50 | - Marco Ortu, Giuseppe Destefanis, Daniel Graziotin, Michele Marchesi, Marco Tonelli, 2020. Dataset - How do you propose your code changes? Empirical Analysis of Affect Metrics of Pull Requests on GitHub. https://doi.org/10.5281/zenodo.3825044 51 | - Champion, K. and Hill, B.M., 2021. Underproduction: An approach for measuring risk in open source software. In _2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)_ (pp. 388-399). IEEE. 52 | - [Replication data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PUCD2P) 53 | - Wachs, J., Nitecki, M., Schueller, W. and Polleres, A., 2022. The geography of open source software: Evidence from github. Technological Forecasting and Social Change, 176, p.121478. 54 | - [data collection scripts](https://github.com/n1tecki/Geography-of-Open-Source-Software) and [output data](https://github.com/johanneswachs/OSS_Geography_Data) 55 | - Ekaterina Levitskaya, Gizem Korkmaz, Daniel Mietchen, Lane Rasberry, 2022. Analysis of Linked GitHub and Wikidata https://doi.org/10.5281/zenodo.7443339 56 | 57 | Enterprise driven contribution 58 | 59 | - Spinellis, Diomidis, Kotti, Zoe, Kravvaritis, Konstantinos, Theodorou, Georgios, & Louridas, Panos. (2020). Enterprise-Driven Open Source Software (1.1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3742962 60 | - Angermeir, F., Voggenreiter, M., Moyón, F. and Mendez, D., 2021, May. Enterprise-driven open source software: a case study on security automation. In _2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)_ (pp. 278-287). IEEE. 61 | - Shimels Garomssa, Rathimala Kannan, Ian Chai, Dirk Riehle, 2022. How Software Quality Mediates the Impact of Intellectual Capital on Commercial Open Source Software Company Success. Available at: https://dx.doi.org/10.21227/3rwb-vg72. 62 | 63 | Contributor experience 64 | 65 | - Denivan Campos, Luana Martins, & Ivan Machado. (2022). An empirical study on the influence of developers' experience on software test code quality [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.7110141 66 | - Perez, Quentin, Urtado, Christelle, Vauttier, Sylvain, 2022. Dataset of Open-Source Software Developers Labeled by their Experience Level in the Project and their Associated Software Metrics. https://doi.org/10.5281/zenodo.6966195 67 | 68 | Project characteristics 69 | 70 | - Munaiah, N., Kroh, S., Cabrey, C. and Nagappan, M., 2017. Curating github for engineered software projects. _Empirical Software Engineering_, _22_(6), pp.3219-3253. [project website](https://reporeapers.github.io/results/1.html) 71 | - Dabic, Ozren, Aghajani, Emad, Bavota, Gabriele, 2021. GHS (GitHub Search): Sampling Projects in GitHub for MSR Studies. https://doi.org/10.5281/zenodo.4588464 72 | 73 | ## Dependency networks 74 | 75 | **Datasets** 76 | 77 | | Source | Description | 78 | |:-------|:------------| 79 | | ecosyste.ms: Dependency parser for [repositories](https://parser.ecosyste.ms/) | An open API service to parse dependency metadata from many open source software ecosystems manifest files. | 80 | | ecosyste.ms: Dependency resolver for [packages](https://resolve.ecosyste.ms/) | An open API service to resolve dependency trees of packages for many open source software ecosystems. | 81 | | [Libraries.io](https://libraries.io/) | Data on software package depdency relationships over time. Sourced from a number of different ecosystems. | 82 | | Open Source Insights / [deps.dev](https://deps.dev/) | A Google project to develop a software dependency graph across ecosystems. Versioning and vulnerabilty information included. | 83 | | [Repology](https://repology.org/) | Monitors software package vintages (i.e. versioning) across a number of ecosystem repositories. | 84 | | [Data for Software Ecosystem Analysis (DaSEA)](https://dasea-project.github.io/DASEA/) | A continuously updated dataset of software dependencies covering various package manager ecosystems. | 85 | 86 | ## Package download statistics 87 | 88 | **Datasets** 89 | 90 | | Source | Description | 91 | |:-------|:------------| 92 | | [PyPI](https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/) | Python package index download statistics. Accessible via [BigQuery dataset](https://console.cloud.google.com/marketplace/product/gcp-public-data-pypi/pypi?q=search&referrer=search&project=ghtorrent-293822) or [simple interface](https://pypistats.org/) | 93 | | [CRAN](https://cran.r-project.org/) | R package download statistics. [CRAN logs](http://cran-logs.rstudio.com/) | 94 | | [RubyGems](https://ui.honeycomb.io/ruby-together/datasets/rubygems.org/result/7WeAgPYyjTH) | Ruby package traffic statistics. Tons of information created by [Honeycomb](https://docs.honeycomb.io/quickstart/) | 95 | | [Julia](https://discourse.julialang.org/t/announcing-package-download-stats/69073) | Julia download statistics since October 2021 | 96 | | [npm](https://www.npmjs.com/package/npm-stats-api) | Node.js download statistics API | 97 | | [NuGet](https://nugettrends.com/) | Historical .NET/C# download numbers. [GitHub project](https://github.com/dotnet/nuget-trends) | 98 | | [PECL](https://pecl.php.net/package-stats.php) and [Pear](https://pear.php.net/package-stats.php?cid=3&pid=40&rid) | PhP download statistics | 99 | | [crates.io](https://crates.io/data-access) | Rust download statistics | 100 | | [Clojars API](https://github.com/clojars/clojars-web/wiki/Data#download-stats) | Clojure package download statistics. | 101 | 102 | ## Security 103 | 104 | **Datasets** 105 | 106 | | Source | Description | 107 | |:-------|:------------| 108 | | [CVE Program](https://www.cve.org/) | CVE list bulk downloads available in [CVEProject/cvelistV5](https://github.com/CVEProject/cvelistV5) | 109 | | [NIST National Vulnerability Database](https://nvd.nist.gov/) | A Common Vulnerabilities and Exploits (CVE) database. **Timeframe:** October 1988 - present. | 110 | | [CISA Known Exploited Vulnerabilities Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog) | CISA maintained database of known exploited vulnerabilities in the wild. [Each KEV is linked to a CVE](https://www.cve.org/About/RelatedEfforts#KEV) | 111 | | [GitHub Advisory Database](https://github.com/advisories) | A database of CVEs and security issues affecting GitHub packages. Drawn from [a variety of sources](https://github.com/github/advisory-database#sources) and recorded using [Open Source Vulnerability Format](https://ossf.github.io/osv-schema/). **Timeframe:** October 2017 - present. | 112 | | [Open Source Vulnerability (OSV) Database](https://osv.dev/) | Draws from a variety of [sources](https://github.com/google/osv.dev#current-data-sources) across ecosystems. GCS bucket: [https://osv-vulnerabilities.storage.googleapis.com/](https://osv-vulnerabilities.storage.googleapis.com/). **Note:** encompasses GitHub Advisory Database | 113 | | CVEfixes Dataset | Bhandari, Guru, Naseer, Amara, Moonen, Leon, 2021. CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software. [DOI](https://doi.org/10.5281/zenodo.4476564) | 114 | | [GitHub Issue Dataset](https://doi.org/10.5281/zenodo.5048542) | Anas Nadeem, 2021. GitHub Issue Dataset From Top Repositories of Top Languages. [DOI](https://doi.org/10.5281/zenodo.5048542) | 115 | 116 | ## Community and project health 117 | 118 | **Datasets** 119 | 120 | | Source | Description | 121 | |:-------|:------------| 122 | | [CHAOSS](https://chaoss.community/) | Linux Foundation project to establish OSS community health metrics. [Metric definitions](https://chaoss.community/kbtopic/all-metrics/) | 123 | | [OpenSSF Best Practices Badge Program](https://bestpractices.coreinfrastructure.org/en) | [Listing of projects](https://bestpractices.coreinfrastructure.org/en/projects), [high-level project statistics](https://bestpractices.coreinfrastructure.org/en/project_stats), [high-level criteria statistics](https://bestpractices.coreinfrastructure.org/en/criteria_stats) | 124 | | OpenSSF Criticality Scores | An effort by [OpenSSF Securing Critical Projects WG](https://github.com/ossf/wg-securing-critical-projects). [Algorithm: "Quantifying Criticality"](https://github.com/ossf/criticality_score/blob/main/Quantifying_criticality_algorithm.pdf) by Rob Pike and [data respository](https://github.com/ossf/criticality_score) | 125 | 126 | **Tools** 127 | | Source | Description | 128 | |:-------|:------------| 129 | | [isitmaintained.com](https://isitmaintained.com/) | Quick status checks for public GitHub repositories (e.g. median issue resolution time, percentage of open issues). | 130 | | [GitWhois](https://gitwhois.com/) | High-level glance into GitHub repositories | 131 | 132 | **Research** 133 | 134 | - Goggins, S., Lumbard, K. and Germonprez, M., 2021, May. Open source community health: Analytical metrics and their corresponding narratives. In _2021 IEEE/ACM 4th International Workshop on Software Health in Projects,Ecosystems and Communities (SoHeal)_ (pp. 25-33). IEEE. 135 | 136 | ## Community discourse 137 | 138 | | Source | Description | 139 | |:-------|:------------| 140 | | [StackExchange](https://stackexchange.com/) | Public Q&A data across the StackExchange network. [SO's Data Explorer](https://data.stackexchange.com/stackoverflow) and [latest data dump hosted by Internet Archive](https://archive.org/details/stackexchange). [Older vintages](https://meta.stackexchange.com/a/224922/619295) can be tracked down. | 141 | | [Linux Kernel Mailing List](https://lkml.org/) | The Linux kernel mailing list. | 142 | | [Apache Mail Archives](https://lists.apache.org/) | Mailing list archives for Apache projects | 143 | | [GNU Mail Archives](https://lists.gnu.org/archive/html/) | Mailing lists used by various GNU projects | 144 | | [Python Mailing Lists](https://mail.python.org/mailman/listinfo) | Mailing lists used by various Python projects | 145 | | [The Mail Archive](https://www.mail-archive.com/) | Catalogs a number of public mailing lists for collaborative projects. [FAQ](https://www.mail-archive.com/faq.html) | 146 | | [Mailing list ARChives](https://marc.info/) | | 147 | 148 | ## Valuation 149 | 150 | **Research** 151 | 152 | - Blind, K., Böhm, M., Grzegorzewska, P., Katz, A., Muto, S., Pätsch, S. and Schubert, T., 2021. The impact of Open Source Software and Hardware on technological independence, competitiveness and innovation in the EU economy. Final Study Report. European Commission, Brussels, doi, 10, p.430161. 153 | - Bayoán Santiago Calderón, Robbins, Guci, Korkmaz, and Kramer. 2022. Measuring the Cost of Open-Source Software Innovation on GitHub. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-05-26. https://doi.org/10.3886/E158823V2 154 | - [BEA link to report](https://www.bea.gov/research/papers/2022/measuring-cost-open-source-software-innovation-github) 155 | - [dataset](https://www.openicpsr.org/openicpsr/project/158823/version/V2/view) 156 | - Wright, N.L., Nagle, F. and Greenstein, S., 2023. Open source software and global entrepreneurship. Research Policy, 52(9), p.104846. 157 | - Hoffmann, Manuel and Nagle, Frank and Zhou, Yanuo. 2024. The Value of Open Source Software. Harvard Business School Strategy Unit Working Paper No. 24-038. http://dx.doi.org/10.2139/ssrn.4693148. Available at SSRN: https://ssrn.com/abstract=4693148 158 | 159 | ## Funding 160 | 161 | **Datasets** 162 | 163 | | Source | Description | 164 | |:-------|:------------| 165 | | [GitHub Sponsors](https://github.com/sponsors) | [List of dependencies for projects owned by the currently authenticated user](https://github.com/sponsors/explore) (i.e. you). CSV export available. | 166 | | [jonathimer/awesome-oss-investors](https://github.com/jonathimer/awesome-oss-investors) | List of VCs investing in commercial open-source startups | 167 | | [Kivach](https://kivach.org/) | "cascading funding": donations to a project redistrbuted upstream | 168 | | [Ko-fi](https://ko-fi.com/) | | 169 | | [Liberapay](https://en.liberapay.com/) | | 170 | | [Open Collective](https://opencollective.com/) | transparent budgeting | 171 | | [oss.fund](https://www.oss.fund/) | Aggregator for OSS funding opportunities, programs, and platforms | 172 | | [StackAid](https://www.stackaid.us/) | Donations redistributed evenly across project's dependencies. Provides a helpfule [simulation of funding allocation](https://simulation.stackaid.us/projects) | 173 | | [Secure Open Source Rewards (sos.dev)](https://sos.dev/) | | 174 | | [ralphtheninja/open-funding](https://github.com/ralphtheninja/open-funding) | Guide to OSS funding options | 175 | 176 | **Research** 177 | 178 | - Boysel, S., Nagle, F., Carter, H., Hermansen, A., Crosby, K., Luszcz, J., Lincoln, S., Yue, D., Hoffmann, M., Staub, A., 2024. Open Source Software Funding Report. [https://opensourcefundingsurvey2024.com/](https://opensourcefundingsurvey2024.com/) 179 | - Conti, A., Peukert, C. and Roche, M., 2025. Beefing IT Up for Your Investor? Engagement with Open Source Communities, Innovation, and Startup Funding: Evidence from GitHub. Organization Science. 180 | 181 | ## Surveys 182 | 183 | - [Linux Foundation surveys](https://data.world/thelinuxfoundation?entryTypeLabel=dataset&tab=resources) 184 | - [2020 FOSS Contributor Survey](https://www.linuxfoundation.org/resources/publications/foss-contributor-survey-2020) 185 | - access: high level insights public 186 | - timeframe: 2020 187 | - [2021 Diversity, Equity, and Inclusion in Open Source](https://data.world/thelinuxfoundation/2021-diversity-equity-and-inclusion-in-open-source) 188 | - access: [high level insights](https://www.linuxfoundation.org/research/the-2021-linux-foundation-report-on-diversity-equity-and-inclusion-in-open-source#:~:text=Additional%20Resources-,Presentation%20Data,-Open%20Data) and [survey data](https://data.world/thelinuxfoundation/2021-diversity-equity-and-inclusion-in-open-source) 189 | - timeframe: 2021 190 | - [2022 State of Open Source Security](https://snyk.io/reports/open-source-security/) 191 | - In collaboration with Snyk.io 192 | - access: high level insights public 193 | - timeframe: 2022 194 | - [Annual Jobs Survey]() 195 | - access: high level insights and survey data available ([2022](https://data.world/thelinuxfoundation/2022-10th-annual-jobs-survey), [2021](https://data.world/thelinuxfoundation/open-source-jobs-report-2021)) 196 | - timeframe: 2012-2022 197 | - [GitHub State of the Octoverse](https://octoverse.github.com/) 198 | - access: high level insights public 199 | - timeframe: ?-2022 (annual) 200 | - [Github Open Source Survey](https://opensourcesurvey.org/2017/#data) 201 | - access: (anonymized) individual responses public 202 | - timeframe: 2017 203 | - [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey) 204 | - access: (anonymized) individual responses public 205 | - timeframe: 2011-2022 (annual) 206 | - [GitLab Global Developer Survey](https://about.gitlab.com/developer-survey/previous/) 207 | - access: high level insights public 208 | - timeframe: 2016-2020 (annual) 209 | - [Tidelift](https://tidelift.com/about/resources/surveys?filter_topic=Survey) 210 | - access: high level insights 211 | - timeframe: 2018-2022 (annual, varying topics) 212 | - [SlashData Developer Nation Survey](https://www.developereconomics.net/) 213 | - access: not distributed publicly 214 | - timeframe: ? 215 | - [O'Reilly: "The Value of Open Source in the Cloud Era"](https://www.oreilly.com/library/view/the-value-of/9781098103286/ch01.html) 216 | - access: high level insights public 217 | - timeframe: 218 | - [PDF copy of report](https://openjsf.org/wp-content/uploads/sites/84/2021/05/TheValueofOpenSourceinaCloudEra-FinalOReillyReport-1.pdf) 219 | - [FINOS / Linux Foundation State of Open Source in Financial Services](https://www.finos.org/state-of-open-source-in-financial-services-2021) 220 | - access: high level insights public 221 | - timeframe: 2021-2022 (annual) 222 | - [Open Source Programs Survey](https://github.com/todogroup/osposurvey) by [TODO Group](https://todogroup.org/#) 223 | - access: firm-level responses public 224 | - timeframe: 2018-2022 (annual) 225 | - [additional resources on OSPOs](https://todogroup.org/guides/) 226 | - [Open Source Initiative and OpenLogic: State of Open Source survey](https://www.openlogic.com/resources/2022-open-source-report) 227 | - access: high level insights public 228 | - timeframe: 2022 229 | - Hertel, G., Niedner, S. and Herrmann, S., 2003. Motivation of software developers in Open Source projects: an Internet-based survey of contributors to the Linux kernel. _Research policy_, _32_(7), pp.1159-1177. 230 | - Lakhani, K.R. and Wolf, R.G., 2003. Why hackers do what they do: Understanding motivation and effort in free/open source software projects. _Open Source Software Projects (September 2003). 231 | - Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German, Lefteris Angelis, 2020. Dataset from "What do developers talk about open source software licensing? " - SEAA2020. https://doi.org/10.5281/zenodo.3871565 232 | - Feitelson, Dror. (2021). Survey on Developer and Researcher Views on the Ethics of Experiments on Open-Source Projects [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5752053 233 | 234 | ## Source code analysis 235 | 236 | **Datasets** 237 | 238 | | Source | Description | 239 | |:-------|:------------| 240 | | [Software Heritage](https://archive.softwareheritage.org/) | Historical archive of source code | 241 | | [NIST National Software Reference Library](https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl) | A collection of hashes and metadata for to uniquely identify individual files across a set of software projects. Forensic use cases include identifying software based solely on file contents, malicious elements. | 242 | 243 | ## Bounty platforms 244 | 245 | | Source | Description | 246 | |:-------|:------------| 247 | | [IssueHunt](https://issuehunt.io/) | A platform for funding open source projects. | 248 | | [Bountysource](https://www.bountysource.com/) | A platform for funding open source projects. | 249 | | [boss.dev](https://www.boss.dev/) | A platform for funding open source projects. | 250 | 251 | ## Public policy 252 | 253 | - [CSIS Government Open Source Software Policies](https://www.csis.org/programs/strategic-technologies-program/resources/government-open-source-software-policies) 254 | - Dataset of various public policy and legislation dealing with open source software from governments around the world. 255 | 256 | 257 | ## General web archives 258 | 259 | | Source | Description | 260 | |:-------|:------------| 261 | | [Common Crawl](https://commoncrawl.org/) | Raw page data, metadata, and extracted text from publicly accessible segments of the internet. Timeframe: 2008 - present, monthly since March 2014. Data hosted on Amazon S3: [getting started docs](https://commoncrawl.org/the-data/get-started/) | 262 | | [Internet Archive](https://archive.org/) | Less systematic crawls with a longer history. Access via the [Wayback Machine](https://web.archive.org/) or its [API](https://archive.org/help/wayback_api.php) | 263 | | [Archive Team](https://archiveteam.org/) | A group of volunteers that archives web pages and other content. Data is available via the [Wayback Machine](https://web.archive.org/) or its [API](https://archive.org/help/wayback_api.php) | 264 | | Wikipedia | [data dumps](https://dumps.wikimedia.org/enwiki/) and [SQL access](https://quarry.wmcloud.org/) | 265 | | [Wikidata](https://www.wikidata.org/) | A free and open knowledge base that can be read and edited by both humans and machines. Data is available via the [Wikidata Query Service](https://query.wikidata.org/) or its [API](https://www.wikidata.org/wiki/Wikidata:Data_access) | 266 | | [Wikimedia Commons](https://commons.wikimedia.org/wiki/Main_Page) | A free media repository. Data is available via the [Commons API](https://commons.wikimedia.org/wiki/Commons:Database_download) or its [API](https://www.mediawiki.org/wiki/API:Main_page) | 267 | 268 | 269 | ## Other Resources 270 | 271 | - [NBER Economics of Digitization](https://www.nber.org/programs-projects/projects-and-centers/economics-digitization/economics-digitization-research-projects) 272 | - [Linux Foundation Research](https://www.linuxfoundation.org/research) 273 | - [BigQuery Introduction for GitHub data](https://github.com/dinalav/Data-Science-Slides-and-Notebooks) 274 | - [GitHub Data Dictionary](https://github.com/github/developer-policy/blob/data-dictionary/data_dictionary.md) 275 | - [Google Search for Research Datasets](https://datasetsearch.research.google.com/) --------------------------------------------------------------------------------