├── .github └── ISSUE_TEMPLATE │ ├── 01-insight-guide-proposal.yaml │ └── 02-project-proposal.yaml ├── .gitignore ├── CHAOSS-Data-Science-Prospectus.pdf ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── challenges_survey ├── Challenges_Survey_Results_2023.pdf ├── README.md ├── challenge_comment_categories.md ├── draft_interpretations │ ├── README.md │ └── dawn_analysis │ │ ├── Challenges_Survey_2023.pdf │ │ └── challenge_comment_categories.md └── raw_data │ ├── Clean_CHAOSS_Understanding_Challenges_Survey.csv │ ├── other_challenges_freeform_redacted.txt │ ├── strengths_freeform_redacted.txt │ └── understanding_challenges_graphs.pdf ├── data-ethics-statement.md ├── dataset ├── README.md ├── archive │ ├── README.md │ ├── archived-projects.py │ └── data-files │ │ └── archive_repos.csv ├── foundation-stats │ ├── apacheURLtoTable.md │ ├── apache_url_to_table.py │ ├── dataset │ │ └── foundation-stats │ │ │ ├── structured_project_analysis.csv │ │ │ └── structured_project_cleaned.json │ ├── structured_project_analysis.csv │ └── structured_project_cleaned.json ├── license-changes │ ├── README.md │ ├── dataset_notes.md │ ├── fork-case-study │ │ ├── README.md │ │ ├── commits_people.py │ │ ├── data-files │ │ │ ├── OpenSearch2021-04-12T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch2021-04-12T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch2023-08-01T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch2023-09-16T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch2024-09-16T00:00:00.000+00:002025-03-16T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch_people_2021-04-12T00:00:00.000+00:002022-04-12T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch_people_2021-04-12T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch_people_2021-04-12T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch_people_2023-08-01T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch_people_2023-09-16T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl │ │ │ ├── OpenSearch_people_2024-09-16T00:00:00.000+00:002025-03-16T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch2019-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch2020-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch2023-08-29T00:00:00.000+00:002024-08-29T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch2024-08-29T00:00:00.000+00:002025-02-29T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch_people_2019-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch_people_2020-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch_people_2021-02-03T00:00:00.000+00:002022-02-03T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch_people_2023-08-29T00:00:00.000+00:002024-08-29T00:00:00.000+00:00.pkl │ │ │ ├── elasticsearch_people_2024-08-29T00:00:00.000+00:002025-02-29T00:00:00.000+00:00.pkl │ │ │ ├── opentofu2023-09-05T00:00:00.000+00:002024-09-05T00:00:00.000+00:00.pkl │ │ │ ├── opentofu_people_2023-09-05T00:00:00.000+00:002024-09-05T00:00:00.000+00:00.pkl │ │ │ ├── redis2022-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl │ │ │ ├── redis2023-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl │ │ │ ├── redis2024-02-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl │ │ │ ├── redis2024-03-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl │ │ │ ├── redis2024-03-20T00:00:00.000+00:002024-09-20T00:00:00.000+00:00.pkl │ │ │ ├── redis2024-03-20T00:00:00.000+00:002025-03-20T00:00:00.000+00:00.pkl │ │ │ ├── redis_people_2022-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl │ │ │ ├── redis_people_2023-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl │ │ │ ├── redis_people_2024-03-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl │ │ │ ├── redis_people_2024-03-20T00:00:00.000+00:002024-09-20T00:00:00.000+00:00.pkl │ │ │ ├── redis_people_2024-03-20T00:00:00.000+00:002025-03-20T00:00:00.000+00:00.pkl │ │ │ ├── stars-forks │ │ │ │ ├── OpenSearch_forks.csv │ │ │ │ ├── elasticsearch_forks.csv │ │ │ │ ├── elasticsearch_stars.csv │ │ │ │ ├── opensearch-stars.csv │ │ │ │ ├── opentofu-forks.csv │ │ │ │ ├── opentofu-stars.csv │ │ │ │ ├── redis-forks.csv │ │ │ │ ├── redis-stars.csv │ │ │ │ ├── terraform-forks.csv │ │ │ │ ├── terraform-stars.csv │ │ │ │ ├── valkey-forks.csv │ │ │ │ └── valkey-stars.csv │ │ │ ├── terraform2021-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl │ │ │ ├── terraform2022-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl │ │ │ ├── terraform2023-08-10T00:00:00.000+00:002024-08-10T00:00:00.000+00:00.pkl │ │ │ ├── terraform_people_2021-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl │ │ │ ├── terraform_people_2022-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl │ │ │ ├── terraform_people_2023-08-10T00:00:00.000+00:002024-08-10T00:00:00.000+00:00.pkl │ │ │ ├── valkey2024-03-28T00:00:00.000+00:002024-08-20T00:00:00.000+00:00.pkl │ │ │ ├── valkey2024-03-28T00:00:00.000+00:002024-09-28T00:00:00.000+00:00.pkl │ │ │ ├── valkey2024-03-28T00:00:00.000+00:002025-03-28T00:00:00.000+00:00.pkl │ │ │ ├── valkey_people_2024-03-28T00:00:00.000+00:002024-08-20T00:00:00.000+00:00.pkl │ │ │ ├── valkey_people_2024-03-28T00:00:00.000+00:002024-09-28T00:00:00.000+00:00.pkl │ │ │ └── valkey_people_2024-03-28T00:00:00.000+00:002025-03-28T00:00:00.000+00:00.pkl │ │ └── notebooks │ │ │ ├── OpenSearch.ipynb │ │ │ ├── elasticsearch.ipynb │ │ │ ├── opentofu.ipynb │ │ │ ├── redis.ipynb │ │ │ ├── stars-forks.ipynb │ │ │ ├── terraform.ipynb │ │ │ └── valkey.ipynb │ ├── forks.csv │ ├── generate-license-data.py │ ├── inputdata.csv │ ├── license_changes.csv │ ├── more_forks.csv │ ├── output.json │ └── wikipedia_list.csv ├── releases │ └── fork-relicense-jais-2025-04.tar.gz └── taxonomies │ ├── FOSDEM_ Do we need another open source software taxonomy_ (1).pdf │ ├── README.md │ └── Taxonomies - repostatus.csv ├── events └── hackathon-june-2025.md ├── practitioner-guides ├── README.md ├── contributor-sustainability.md ├── diverse-leadership.md ├── images │ ├── active-contrib-over-time-bar-trend.png │ ├── active-organizations-over-time-by-data-source.png │ ├── bus-factor-bar-balanced.png │ ├── bus-factor-pie-one-person.png │ ├── change_requests_abandoned.png │ ├── closure-ratio-falling-behind.png │ ├── closure-ratio-summer-gap.png │ ├── commit-activity-by-domain-unclean.png │ ├── commit-activity-by-domain-vmw.png │ ├── contrib-by-data-source.png │ ├── contributor-growth-by-engagement-bar.png │ ├── issues_abandoned.png │ ├── lead-time.png │ ├── leadership-positions-istio-before-cncf-april-2022.png │ ├── leadership-positions-istio-graduating-june-2023.png │ ├── libyears.png │ ├── ossf-badge-categories.png │ ├── ossf-badge-criteria-example.png │ ├── ossf-badge-curl.png │ ├── releases.png │ ├── time-to-close.png │ └── time-to-first-response.png ├── introduction.md ├── organizational-participation.md ├── responsiveness.md ├── security.md ├── sunset.md └── website-landing.md └── publications ├── Foster-OFA-New-Dynamics-Open-Source-Relicensing-Forks-Community-Impact-2024.pdf ├── README.md └── publication-guidelines.md /.github/ISSUE_TEMPLATE/01-insight-guide-proposal.yaml: -------------------------------------------------------------------------------- 1 | name: Practitioner Guide Proposal 2 | title: "[Practitioner Guide]: TOPIC" 3 | description: Use this template to propose new CHAOSS Practitioner Guides 4 | labels: ['practitioner guide', 'proposal'] 5 | assignees: 6 | - geekygirldawn 7 | 8 | body: 9 | - type: markdown 10 | attributes: 11 | value: | 12 | To avoid duplication and re-work, we ask you to use this template to propose new CHAOSS Practitioner Guides. While metrics models are designed with collections of metrics that can be implemented together, these Practitioner Guides are different from metrics models. Practitioner guides are designed to help us humans understand how to interpret metrics within a narrow topic and make improvements based on what is learned from that interpretation. Each Practitioner Guide should focus on 2-4 metrics, but can include a list of additional metrics in Step 3 - Gather Additional Data if needed. 13 | 14 | - type: markdown 15 | attributes: 16 | value: | 17 | ## Practitioner Guide Information 18 | 19 | - type: input 20 | attributes: 21 | label: Practitioner Guide Topic (1 - 3 words) 22 | validations: 23 | required: true 24 | 25 | - type: dropdown 26 | attributes: 27 | label: Is this a getting started guide or a more advanced guide for experienced users? 28 | options: 29 | - Getting Started 30 | - Advanced / Expert 31 | - Not sure 32 | validations: 33 | required: true 34 | 35 | - type: textarea 36 | attributes: 37 | label: Primary Metrics (2 - 4 metrics for Getting Started Guides) 38 | validations: 39 | required: true 40 | 41 | - type: textarea 42 | attributes: 43 | label: Why is this topic important? How will this help people improve their open source project and / or community? Who will benefit from this guide? 44 | validations: 45 | required: true 46 | 47 | - type: dropdown 48 | attributes: 49 | label: How would you like to see this guide developed? 50 | options: 51 | - I am interested in using this guide, but I do not want to write it myself 52 | - I have the experience and time available to write the first draft 53 | - I would like to help write the guide, but I need someone with more experience in the topic to help me 54 | - Other (please specify in the “Additional Notes” at the end of this form) 55 | validations: 56 | required: true 57 | 58 | - type: textarea 59 | attributes: 60 | label: Additional Notes 61 | validations: 62 | required: false 63 | 64 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/02-project-proposal.yaml: -------------------------------------------------------------------------------- 1 | name: Data Science Project Proposal 2 | title: "[Project]: NAME" 3 | description: Use this template to propose new CHAOSS data science projects 4 | labels: ['project', 'proposal'] 5 | 6 | body: 7 | - type: markdown 8 | attributes: 9 | value: | 10 | The CHAOSS Data Science Community is interested in collaboratively working together on projects. The goal is to build a community of active data scientists within the CHAOSS project who are using CHAOSS data to help answer questions. To get the most value out of a CHAOSS Data Science Initiative, we need to scale our data science efforts and share best practices for using data science-based approaches to understanding project health. This requires building a diverse and inclusive community of people with different ideas, perspectives, and skills who can help define and promote our data science efforts. This will allow us to provide value as a project by using CHAOSS data and data science approaches to answer questions from the other WGs and elsewhere within the CHAOSS community while also allowing people to get more real world data science experience. 11 | 12 | - type: markdown 13 | attributes: 14 | value: | 15 | ## Project Information 16 | 17 | - type: input 18 | attributes: 19 | label: Project Name (1 - 3 words) 20 | validations: 21 | required: true 22 | 23 | - type: textarea 24 | attributes: 25 | label: Description 26 | description: Provide summary of this project including the question you hope to answer and how the question is important for the CHAOSS community 27 | validations: 28 | required: true 29 | 30 | - type: textarea 31 | attributes: 32 | label: Related Links 33 | description: If you have already created additional documents or have other information about this project, please include those links here. 34 | validations: 35 | required: false 36 | 37 | - type: markdown 38 | attributes: 39 | value: | 40 | Note that we also have a [Project Scope Template doc](https://docs.google.com/document/d/13iLNDfqJ8nuwBGEyJuFutcT7KRNT6JwFrSlJN_5f4o4/edit) that you can use to think about the project details if you find it useful (not required). 41 | 42 | - type: dropdown 43 | attributes: 44 | label: How would you like to be involved in this project? 45 | options: 46 | - I am interested in this project, but do not plan to work on it myself 47 | - I have the experience and time available to lead this project 48 | - I would like to help with this project, but would prefer for someone else to lead it 49 | - Other (please specify in the “Additional Notes” at the end of this form) 50 | validations: 51 | required: true 52 | 53 | - type: textarea 54 | attributes: 55 | label: Additional Notes. 56 | validations: 57 | required: false 58 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .ipynb_checkpoints 3 | .vscode 4 | __pycache__ 5 | ~lock 6 | -------------------------------------------------------------------------------- /CHAOSS-Data-Science-Prospectus.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/CHAOSS-Data-Science-Prospectus.pdf -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | ## Where can I contribute? 4 | 5 | Anyone can contribute to CHAOSS on any of our communication channels. See . 6 | 7 | * If you think something should be done (including a contribution by yourself), please open an issue in this repository. That will allow others to learn that you think some work should be done, and can comment on that. If you intend to do the job yourself, please say that. 8 | 9 | * Everyone with an opinion on the matter should comment on the issue, explaining how they support the idea, propose some change to it, or think it is not worth / it is not the moment for doing it. 10 | 11 | * If comments are positive, and a certain consensus is achieved, propose a pull request with the changes to the repository (new document, changes to existing documents). 12 | 13 | * Everyone with an opinion on the pull request should comment on it, and detailed reviews should be done, maybe asking for new versions of the pull request. Once comments and reviews are positive, the change will be merged in the repository. 14 | 15 | * If consensus is not reached at any of these points, or the process stalls, it can be raised during one of the Common Working Group meetings, or in the mailing list, to try to unblock it. 16 | 17 | ## Which channel should I use? 18 | 1. Slack channel #data-science for general discussions and questions 19 | 2. Issue submission for discussions about future work or bugs / issues with existing work 20 | 3. Pull requests to contribute directly to this repository (after discussing the work in meetings or an issue) 21 | 22 | ### Conversations and high-level contributions 23 | 24 | Strategic directions, clarifications of scope, and ideas in an early stage are best discussed on the #data-science Slack channel or in meetings. See . 25 | 26 | ### Bug report and feature request contributions (issue) 27 | 28 | Bug reports and specific feature requests are best discussed in an issue on the repository they pertain to. 29 | 30 | ### Code or document change contributions (pull request) 31 | 32 | In this process, make sure your [GitHub account][ssh] is setup [fork][fork] then locally [clone][clone] the repo: 33 | 34 | git clone git@github.com:/.git 35 | 36 | Create a [feature branch][fb] in your local repository: 37 | 38 | git checkout -b 39 | 40 | Make your change and commit the change: 41 | 42 | git add 43 | git commit -m "" 44 | 45 | Push to your fork on GitHub: 46 | 47 | git push origin 48 | 49 | Then, [submit a pull request][pr] on GitHub to the CHAOSS repository. 50 | 51 | [ssh]: https://help.github.com/articles/connecting-to-github-with-ssh/ 52 | [fork]: https://help.github.com/articles/fork-a-repo/ 53 | [fb]: https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow 54 | [pr]: https://github.com/thoughtbot/factory_girl_rails/compare/ 55 | [clone]: https://help.github.com/articles/cloning-a-repository/ 56 | 57 | At this point you are waiting on the CHAOSS repository maintainers. They will comment on your pull requests 58 | within three business days (and, typically, one business day). 59 | 60 | The CHAOSS repository maintainers will report on open issues and pull requests on the [calls and via the mail list][participate] to elicit feedback from the community. 61 | 62 | [participate]: https://chaoss.community/participate/ 63 | 64 | ## Committing back to the repository 65 | ## DCO and Sign-Off for contributions 66 | 67 | The [CHAOSS Charter](https://github.com/chaoss/governance/blob/master/project-charter.md) requires that contributions 68 | are accompanied by a [Developer Certificate of Origin](http://developercertificate.org) sign-off. 69 | For ensuring it, a bot checks all incoming commits. 70 | 71 | For users of the git command line interface, a sign-off is accomplished with the `-s` as part of the commit command: 72 | 73 | ``` 74 | git commit -s -m 'This is a commit message' 75 | ``` 76 | 77 | For users of the GitHub interface (using the "edit" button on any file, and producing a commit from it), 78 | a sign-off is accomplished by writing 79 | 80 | ``` 81 | Signed-off-by: Your Name 82 | ``` 83 | 84 | in a single line, into the commit comment field. This can be automated by using a browser plugin like 85 | [DCO GitHub UI](https://github.com/scottrigby/dco-gh-ui). 86 | 87 | #### Steps to use the DCO browser plugin 88 | The [DCO browser plugin](https://github.com/scottrigby/dco-gh-ui) is a handy tool to automatically sign commits created using GitHub. 89 | To enable this plugin: 90 | 91 | - Go to the plugin page on the [chrome web store](https://chrome.google.com/webstore/detail/dco-github-ui/onhgmjhnaeipfgacbglaphlmllkpoijo). 92 | - Alternatively, you could go to the [firefox addon page](https://addons.mozilla.org/en-US/firefox/addon/scott-rigby/) to add the extension to your browser. 93 | - Once you add the extension, right click on the extension in the toolbar of your browser and select `Options`. 94 | - A dialog box will open up as shown below. Fill in your GitHub name (not the handle) and email-id. 95 | 96 | ![Screenshot of settings for DCO GitHub UI](https://user-images.githubusercontent.com/31214064/55411911-194c8500-5584-11e9-8b56-c8f94b6fa213.png) 97 | 98 | - Then, whenever you perform a commit on GitHub, the line `Signed-off-by: Your Name ` will automatically appear in the commit description while making changes to a file as shown in the example below. A commit message can be added to the lines above the auto-generated sign-off. 99 | 100 | ![Screenshot of GitHub UI with auto-generated sign-off in commit message](https://user-images.githubusercontent.com/31214064/55423206-127d3c80-559b-11e9-9a5e-6300105b8858.png) 101 | 102 | - Once you perform the commit and send a pull request, the commit will be verified and approved by the DCO bot. 103 | 104 | ![Screenshot of successful DCO check](https://user-images.githubusercontent.com/31214064/55415829-5f591700-558b-11e9-93ae-07b0ed432a53.png) 105 | 106 | 107 | 108 | ## Who is a CHAOSS repository maintainer? 109 | 110 | The README.md of the repository contains a list of chairs, maintainers, and other roles. 111 | 112 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 CHAOSS 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CHAOSS Data Science Working Group 2 | 3 | ## Table of Contents 4 | 5 | - [Introduction](#introduction) 6 | - [Participate](#participate) 7 | - [Practitioner Guides](#practitioner-guides) 8 | - [Projects](#projects) 9 | - [Contributing](#contributing) 10 | - [Contributors](#contributors) 11 | - [License](#license) 12 | 13 | ## Introduction 14 | 15 | ### Goal 16 | 17 | Build a community of data scientists to collaborate on the [CHAOSS Data Science Initiative](https://chaoss.community/inside-the-chaoss-data-science-working-group/) 18 | 19 | ### Purpose 20 | 21 | We will collaborate with data scientists and researchers to shape how we understand open source community health and make it easier for people to use CHAOSS tools, metrics, and metrics models to draw meaningful insights that they can use to improve open source project health using data science-based approaches. 22 | 23 | ### Who should join this working group? 24 | 25 | Anyone interested in data science and data analysis can join. You don't need to be an expert or know how to perform advanced techniques, like machine learning or artificial intelligence. We welcome data scientists, data analysts, researchers and others with an interest in data. 26 | 27 | ### Background 28 | 29 | This is a working group within the CHAOSS project to support our Data Science Initiative. If you work for a company who is interested in sponsoring some of our work, we have a [sponsorship prospectus](CHAOSS-Data-Science-Prospectus.pdf) with more details. 30 | 31 | ## Participate 32 | 33 | ### How to Join Us? 34 | 35 | Want to join the working group? Here is a simple step by step guide on how to join: 36 | 37 | - [Getting started as a new/first time contributor](https://chaoss.community/kb-getting-started/) 38 | - [Agenda/Meeting-Minutes](https://docs.google.com/document/d/1jkAfGt97OGRwcdEn8hh5YyHQwoXRnOW96ikc_Aluo6M/edit) 39 | - Join us in the #wg-data-science channel within the CHAOSS Slack Workspace. 40 | - Learn on the [Participate](https://chaoss.community/participate/) page on the website 41 | 42 | We follow the [CHAOSS Code of Conduct](https://github.com/chaoss/governance/blob/master/code-of-conduct.md) 43 | 44 | ## Practitioner Guides 45 | 46 | The CHAOSS Data Science Working Group develops a set of [Practitioner Guides](https://chaoss.community/about-chaoss-practitioner-guides/) to help individuals understand how to interpret data about an open source project, enabling them to develop insights that can improve the project's health. They are designed for Open Source Program Offices (OSPOS), project leads, community managers, maintainers, and anyone who wants to understand project health better and take action on what they learn from their metrics. 47 | 48 | 49 | If you are interested in [contributing to the practitioner guides](https://github.com/chaoss/wg-data-science/tree/main/practitioner-guides), you can find more details in the practitioner-guides folder here in the repo. 50 | 51 | ## Projects 52 | 53 | We are also working on several projects using CHAOSS metrics and tools to help answer people's questions about open source projects and their unique dynamics. You can find details about these projects in the WG's [GitHub Issues](https://github.com/chaoss/wg-data-science/issues?q=is%3Aissue+is%3Aopen+label%3Aproject). 54 | 55 | ## Contributing 56 | 57 | See the [CONTRIBUTING.md](CONTRIBUTING.md) for more info. 58 | 59 | ## Contributors 60 | 61 | ### Chairs 62 | 63 | - [Dawn Foster](https://github.com/geekygirldawn) 64 | - [Chan Voong](https://github.com/voongc) 65 | 66 | ### Amazing CHAOSS Project Contributors 67 | 68 | Link to the [contributors](https://chaoss.community/metrics/#user-content-chaoss-contributors-include) listed on the website. 69 | 70 | ## License 71 | 72 | See [LICENSE](LICENSE) file. 73 | 74 | Copyright © CHAOSS, a Linux Foundation Project 75 | -------------------------------------------------------------------------------- /challenges_survey/Challenges_Survey_Results_2023.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/challenges_survey/Challenges_Survey_Results_2023.pdf -------------------------------------------------------------------------------- /challenges_survey/README.md: -------------------------------------------------------------------------------- 1 | # Understanding Challenges Survey 2 | 3 | In August and September, the CHAOSS project ran a survey of existing and past users of CHAOSS tools and metrics designed to help us better understand the barriers and challenges that make it difficult for people to gain meaningful, empirically-driven community health insights using CHAOSS tools and metrics. The [blog post](https://chaoss.community/survey-help-the-chaoss-project-improve-our-tools-and-metrics) has more details about the survey. 4 | 5 | The results of this survey can be found in the [Challenges_Survey_Results_2023.pdf](Challenges_Survey_Results_2023.pdf) file, and there is also a file called [challenge_comment_categories.md](challenge_comment_categories.md) with the comments about the challenges coded into several different categories. 6 | 7 | The key takeaways include: 8 | 9 | * Installing our software continues to be the biggest challenge 10 | * Finding data and drawing insights from the data are also top challenges 11 | * OSPOs continue to be important users of CHAOSS tools with many using both tools 12 | 13 | In the spirit of open source, the raw data is also available to the CHAOSS data science community to encourage people to explore the data in more detail. The raw data with personally identifiable data redacted can be found in the raw_data directory in this repo. The CSV file contains everything except the free form responses. Because there were very few reponses from certain groups (e.g., universities, government, nonprofit), attaching the free form responses to rest of the data made it possible to identify the person / group responsible for certain comments, so the order of the free form comments in the text documents has had the order randomized and some text redacted to prevent identification. Here are the filenames with the raw data: 14 | * [Clean_CHAOSS_Understanding_Challenges_Survey.csv](raw_data/Clean_CHAOSS_Understanding_Challenges_Survey.csv) - responses to the quantitative parts of the survey 15 | * [other_challenges_freeform_redacted.txt](raw_data/other_challenges_freeform_redacted.txt) - free form text responses to the question, "What other challenges have you faced that weren’t in the above list or what else would you like to see us improve?" 16 | * [strengths_freeform_redacted.txt](raw_data/strengths_freeform_redacted.txt) - free form text responses to the question, "What do you see as CHAOSS project strengths (what do you love about CHAOSS)?" 17 | * [understanding_challenges_graphs.pdf](raw_data/understanding_challenges_graphs.pdf) - simple graphs produced by Google forms displaying the responses for each question 18 | 19 | -------------------------------------------------------------------------------- /challenges_survey/challenge_comment_categories.md: -------------------------------------------------------------------------------- 1 | This document has responses grouped into categories from the question, "What other challenges have you faced that weren’t in the above list or what else would you like to see us improve?" 2 | 3 | Many comments fall into multiple categories, so you’ll see the same content multiple times. The full text for the responses can be found in the [repo](https://github.com/chaoss/wg-data-science/tree/main/challenges_survey/raw_data) 4 | 5 | **Category: Taking action on data / generating insights from the data (8 Quotations)** 6 | * ability to see and understand what other people are doing in real situation so I can leverage that work, and compare with my own 7 | * Any new metric should have use cases associated where the usefulness of the metric becomes evident 8 | * Overwhelming data. 9 | * No CHAOSS tools make it easy to compare a large number of repos. 10 | * Many CHAOSS metrics aren't quantitative, so evaluating them requires manual examination. CHAOSS metrics don't cover marketing metrics like social media mentions. 11 | * The most challenging part is to communicate them to C level. 12 | * Measuring the health and success of community (marketing) activities 13 | * In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. 14 | 15 | **Category: CHAOSS Processes (4 Quotations)** 16 | * It's hard to understand what is "official" CHAOSS software (like what is the relationship to compass?) or how things move from the metrics/models into the software. 17 | * There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. 18 | * The focus on OSPOs was a surprise to me as I got more involved in the project, it's understandable but makes it more difficult for people not in an OSPO to be seen as an audience for metrics/software, which is unfortunate because I think there is still a lot of value in CHAOSS metrics for people who are not part of an official OSPO. 19 | * Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described 20 | 21 | **Category: Metric Definitions (3 Quotations)** 22 | * Any new metric should have use cases associated where the usefulness of the metric becomes evident 23 | * Many CHAOSS metrics aren't quantitative, so evaluating them requires manual examination. CHAOSS metrics don't cover marketing metrics like social media mentions. 24 | * Issues and PR backlog 25 | 26 | **Category: Software: Augur (3 Quotations)** 27 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 28 | * I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. A simplified Augur install that is just the bare minimum for working with a data snapshot would be nice. Multiple instances of documentation also make working with Augur hard. There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. 29 | * We find Augur easy to install and use. 30 | 31 | **Category: Software: Contributing (2 Quotations)** 32 | * Contributing has itself been a challenge, since documentation is woefully incomplete and (at times) inaccurate 33 | * There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. 34 | 35 | **Category: Software: GrimoireLab (9 Quotations)** 36 | * More clarity/documentation in data model (for comparison against other data sources); OpenSearch backend is incompatible with other internal tooling so our implementation is a bit of an island :/ 37 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 38 | * Integrating grimorelab with custom dashboards and frontend/backend software seemed tricky as far as ive heard from the team that was responsible for that. I dont know the details though, sounded like there was some architectural tech debt (not sure what that means) that made it hard to for the developers to integrate. 39 | * This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 40 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 41 | * I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. As a [REDACTED] I don't really have a budget to pay for metrics as a service just to experiment or to try and get other people involved, so I greatly appreciated the cauldron service. 42 | * grimoire sigils should support opensearch / pivot away from kibiter asap 43 | * Bitergia tools are quite complicated and take a lot of time to set up properly. 44 | * Grimoirelab did not work well for the size of our OSPO. 45 | 46 | **Category: Software: Software / Metrics Relationship (1 Quotation)** 47 | * Understanding the relationship between the software and the metrics can be difficult, things aren't always named the same and the methods of calculation are not always transparent in the software so you can't tell if it is really doing what you think it is. 48 | 49 | **Category: Technical: Compatibility / Tech Stack (7 Quotations)** 50 | * More clarity/documentation in data model (for comparison against other data sources); OpenSearch backend is incompatible with other internal tooling so our implementation is a bit of an island :/ 51 | * From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, or even integration with such tools, to get more adoption in the industry. On the other hand, the diversity of tools and setups related to open source development and contribution makes it so hard to find "the right" tools to 52 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 53 | * Integrating grimorelab with custom dashboards and frontend/backend software seemed tricky as far as ive heard from the team that was responsible for that. I dont know the details though, sounded like there was some architectural tech debt (not sure what that means) that made it hard to for the developers to integrate. 54 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 55 | * grimoire sigils should support opensearch / pivot away from kibiter asap 56 | * I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. 57 | 58 | **Category: Technical: Documentation (3 Quotations)** 59 | * Contributing has itself been a challenge, since documentation is woefully incomplete and (at times) inaccurate 60 | * Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 61 | * Multiple instances of documentation also make working with Augur hard. 62 | 63 | **Category: Technical: Ease of Use / UX (9 Quotations)** 64 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 65 | * Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 66 | * Estimating how much time and effort using tools would take, compared to ad hoc methods of assessing similar questions. 67 | * Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described. They all seem quite complex and difficult to get started or require very special data analysis expertise. Maybe adding pre-requirements to usem them? 68 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 69 | * I would really like some lighter weight ways to play with the metrics. I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. A simplified Augur install that is just the bare minimum for working with a data snapshot would be nice. 70 | * From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, 71 | * Overwhelming data. 72 | * No CHAOSS tools make it easy to compare a large number of repos. 73 | 74 | **Category: Technical: Installation (10 Quotations)** 75 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 76 | * Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 77 | * Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described. They all seem quite complex and difficult to get started or require very special data analysis expertise. Maybe adding pre-requirements to usem them? 78 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 79 | * Getting either piece of software running locally proved impossible for me :) 80 | * I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. 81 | * From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, 82 | * Bitergia tools are quite complicated and take a lot of time to set up properly. 83 | * self hosting is becoming really difficult, esp for remote only orgs 84 | * Docker compose 85 | 86 | **Category: Technical: Reliability (1 Quotation)** 87 | * This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 88 | 89 | **Category: Technical: Relating to SaaS or need for SaaS solutions (3 Quotations)** 90 | * This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 91 | * I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. As a [REDACTED] I don't really have a budget to pay for metrics as a service just to experiment or to try and get other people involved, so I greatly appreciated the cauldron service. 92 | * self hosting is becoming really difficult, esp for remote only orgs 93 | 94 | -------------------------------------------------------------------------------- /challenges_survey/draft_interpretations/README.md: -------------------------------------------------------------------------------- 1 | # Draft Interpretations 2 | 3 | If you would like to share your interpretation of the survey data, please create a PR and place your files in this directory. If you have multiple files, please create a subdirectory. 4 | -------------------------------------------------------------------------------- /challenges_survey/draft_interpretations/dawn_analysis/Challenges_Survey_2023.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/challenges_survey/draft_interpretations/dawn_analysis/Challenges_Survey_2023.pdf -------------------------------------------------------------------------------- /challenges_survey/draft_interpretations/dawn_analysis/challenge_comment_categories.md: -------------------------------------------------------------------------------- 1 | This document has responses grouped into categories from the question, "What other challenges have you faced that weren’t in the above list or what else would you like to see us improve?" 2 | 3 | Many comments fall into multiple categories, so you’ll see the same content multiple times. The full text for the responses can be found in the [repo](https://github.com/chaoss/wg-data-science/tree/main/challenges_survey/raw_data) 4 | 5 | **Category: Taking action on data / generating insights from the data (8 Quotations)** 6 | * ability to see and understand what other people are doing in real situation so I can leverage that work, and compare with my own 7 | * Any new metric should have use cases associated where the usefulness of the metric becomes evident 8 | * Overwhelming data. 9 | * No CHAOSS tools make it easy to compare a large number of repos. 10 | * Many CHAOSS metrics aren't quantitative, so evaluating them requires manual examination. CHAOSS metrics don't cover marketing metrics like social media mentions. 11 | * The most challenging part is to communicate them to C level. 12 | * Measuring the health and success of community (marketing) activities 13 | * In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. 14 | 15 | **Category: CHAOSS Processes (4 Quotations)** 16 | * It's hard to understand what is "official" CHAOSS software (like what is the relationship to compass?) or how things move from the metrics/models into the software. 17 | * There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. 18 | * The focus on OSPOs was a surprise to me as I got more involved in the project, it's understandable but makes it more difficult for people not in an OSPO to be seen as an audience for metrics/software, which is unfortunate because I think there is still a lot of value in CHAOSS metrics for people who are not part of an official OSPO. 19 | * Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described 20 | 21 | **Category: Metric Definitions (3 Quotations)** 22 | * Any new metric should have use cases associated where the usefulness of the metric becomes evident 23 | * Many CHAOSS metrics aren't quantitative, so evaluating them requires manual examination. CHAOSS metrics don't cover marketing metrics like social media mentions. 24 | * Issues and PR backlog 25 | 26 | **Category: Software: Augur (3 Quotations)** 27 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 28 | * I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. A simplified Augur install that is just the bare minimum for working with a data snapshot would be nice. Multiple instances of documentation also make working with Augur hard. There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. 29 | * We find Augur easy to install and use. 30 | 31 | **Category: Software: Contributing (2 Quotations)** 32 | * Contributing has itself been a challenge, since documentation is woefully incomplete and (at times) inaccurate 33 | * There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. 34 | 35 | **Category: Software: GrimoireLab (9 Quotations)** 36 | * More clarity/documentation in data model (for comparison against other data sources); OpenSearch backend is incompatible with other internal tooling so our implementation is a bit of an island :/ 37 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 38 | * Integrating grimorelab with custom dashboards and frontend/backend software seemed tricky as far as ive heard from the team that was responsible for that. I dont know the details though, sounded like there was some architectural tech debt (not sure what that means) that made it hard to for the developers to integrate. 39 | * This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 40 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 41 | * I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. As a [REDACTED] I don't really have a budget to pay for metrics as a service just to experiment or to try and get other people involved, so I greatly appreciated the cauldron service. 42 | * grimoire sigils should support opensearch / pivot away from kibiter asap 43 | * Bitergia tools are quite complicated and take a lot of time to set up properly. 44 | * Grimoirelab did not work well for the size of our OSPO. 45 | 46 | **Category: Software: Software / Metrics Relationship (1 Quotation)** 47 | * Understanding the relationship between the software and the metrics can be difficult, things aren't always named the same and the methods of calculation are not always transparent in the software so you can't tell if it is really doing what you think it is. 48 | 49 | **Category: Technical: Compatibility / Tech Stack (7 Quotations)** 50 | * More clarity/documentation in data model (for comparison against other data sources); OpenSearch backend is incompatible with other internal tooling so our implementation is a bit of an island :/ 51 | * From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, or even integration with such tools, to get more adoption in the industry. On the other hand, the diversity of tools and setups related to open source development and contribution makes it so hard to find "the right" tools to 52 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 53 | * Integrating grimorelab with custom dashboards and frontend/backend software seemed tricky as far as ive heard from the team that was responsible for that. I dont know the details though, sounded like there was some architectural tech debt (not sure what that means) that made it hard to for the developers to integrate. 54 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 55 | * grimoire sigils should support opensearch / pivot away from kibiter asap 56 | * I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. 57 | 58 | **Category: Technical: Documentation (3 Quotations)** 59 | * Contributing has itself been a challenge, since documentation is woefully incomplete and (at times) inaccurate 60 | * Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 61 | * Multiple instances of documentation also make working with Augur hard. 62 | 63 | **Category: Technical: Ease of Use / UX (9 Quotations)** 64 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 65 | * Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 66 | * Estimating how much time and effort using tools would take, compared to ad hoc methods of assessing similar questions. 67 | * Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described. They all seem quite complex and difficult to get started or require very special data analysis expertise. Maybe adding pre-requirements to usem them? 68 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 69 | * I would really like some lighter weight ways to play with the metrics. I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. A simplified Augur install that is just the bare minimum for working with a data snapshot would be nice. 70 | * From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, 71 | * Overwhelming data. 72 | * No CHAOSS tools make it easy to compare a large number of repos. 73 | 74 | **Category: Technical: Installation (10 Quotations)** 75 | * Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 76 | * Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 77 | * Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described. They all seem quite complex and difficult to get started or require very special data analysis expertise. Maybe adding pre-requirements to usem them? 78 | * Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 79 | * Getting either piece of software running locally proved impossible for me :) 80 | * I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. 81 | * From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, 82 | * Bitergia tools are quite complicated and take a lot of time to set up properly. 83 | * self hosting is becoming really difficult, esp for remote only orgs 84 | * Docker compose 85 | 86 | **Category: Technical: Reliability (1 Quotation)** 87 | * This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 88 | 89 | **Category: Technical: Relating to SaaS or need for SaaS solutions (3 Quotations)** 90 | * This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 91 | * I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. As a [REDACTED] I don't really have a budget to pay for metrics as a service just to experiment or to try and get other people involved, so I greatly appreciated the cauldron service. 92 | * self hosting is becoming really difficult, esp for remote only orgs 93 | 94 | -------------------------------------------------------------------------------- /challenges_survey/raw_data/Clean_CHAOSS_Understanding_Challenges_Survey.csv: -------------------------------------------------------------------------------- 1 | What type of organization do you work for?,Do you work in an Open Source Program Office (OSPO) or similar open source team?,Which of these best describes your role or position?,"Have you contributed to the CHAOSS project (including non-code contributions, e.g., metrics definitions, documentation, presentations, blog posts, meeting attendance)? ",Which of these have you used or tried to use?,How long have you been using CHAOSS tools or custom code that implements CHAOSS metrics?,"Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 2 | to you, select the option labeled ‘NA’ (Not Applicable). 3 | 1 is most challenging and 7 is least challenging. [Installing / configuring software]","Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 4 | to you, select the option labeled ‘NA’ (Not Applicable). 5 | 1 is most challenging and 7 is least challenging. [Maintaining software over time]","Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 6 | to you, select the option labeled ‘NA’ (Not Applicable). 7 | 1 is most challenging and 7 is least challenging. [Cleaning up the data (e.g., merge duplicate contributors, company affiliation)]","Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 8 | to you, select the option labeled ‘NA’ (Not Applicable). 9 | 1 is most challenging and 7 is least challenging. [Finding the data / metrics you want to use]","Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 10 | to you, select the option labeled ‘NA’ (Not Applicable). 11 | 1 is most challenging and 7 is least challenging. [Drawing meaningful insights out of the data]","Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 12 | to you, select the option labeled ‘NA’ (Not Applicable). 13 | 1 is most challenging and 7 is least challenging. [Communicating meaningful insights to others, including executives]","Rank order the challenges you have faced using CHAOSS tools. If any don’t apply 14 | to you, select the option labeled ‘NA’ (Not Applicable). 15 | 1 is most challenging and 7 is least challenging. [Getting others within your company / community to use the software]" 16 | University or other academic institution,Yes,"Development or operations focused (e.g., developer, sys admin)",Currently contributing to the CHAOSS project,Augur,Less than 1 year,3,2,4,1,NA,NA,NA 17 | For-profit company,Yes,"Leadership (e.g., primarily manage other people)","Past contributor, but no longer contributing",GrimoireLab (including users of Bitergia’s platform and Cauldron),More than 3 years,1,2,3,7,4,6,2 18 | None of the above,No,consultancy,"Past contributor, but no longer contributing",GrimoireLab (including users of Bitergia’s platform and Cauldron),More than 3 years,5,NA,7,3,2,1,4 19 | University or other academic institution,Yes,"Leadership (e.g., primarily manage other people)",Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",2 - 3 years,1,2,3,5,4,NA,NA 20 | For-profit company,Yes,"Development or operations focused (e.g., developer, sys admin)",Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",1 - 2 years,6,5,1,4,3,7,NA 21 | For-profit company,Yes,"Leadership (e.g., primarily manage other people)",Currently contributing to the CHAOSS project,Augur,1 - 2 years,2,1,1,1,1,3,4 22 | University or other academic institution,No,"Community focused (e.g., community manager)",Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",Less than 1 year,1,NA,4,2,7,7,NA 23 | None of the above,No,"Development or operations focused (e.g., developer, sys admin)",Never contributed to CHAOSS,GrimoireLab (including users of Bitergia’s platform and Cauldron),I haven’t used CHAOSS tools,NA,NA,4,4,5,NA,NA 24 | For-profit company,Yes,"Leadership (e.g., primarily manage other people)",Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",More than 3 years,2,6,5,7,7,7,5 25 | For-profit company,Yes,"Community focused (e.g., community manager)","Past contributor, but no longer contributing",GrimoireLab (including users of Bitergia’s platform and Cauldron),2 - 3 years,1,NA,2,1,2,2,2 26 | None of the above,No,Consultant,"Past contributor, but no longer contributing","Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",More than 3 years,1,7,6,5,3,2,4 27 | For-profit company,Yes,"OSPO Lead, direct activities and liaison with CISO/Eng/Legal",Never contributed to CHAOSS,Commercial tools with project Health metrics like Snyk,I haven’t used CHAOSS tools,3,2,NA,NA,NA,NA,2 28 | For-profit company,No,"Community focused (e.g., community manager)",Currently contributing to the CHAOSS project,GrimoireLab (including users of Bitergia’s platform and Cauldron),2 - 3 years,NA,NA,1,3,1,3,4 29 | For-profit company,Yes,"Development or operations focused (e.g., developer, sys admin)","Past contributor, but no longer contributing","Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",1 - 2 years,5,NA,NA,2,1,1,1 30 | For-profit company,Yes,Program Manager,Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron), DEI.md",1 - 2 years,3,NA,NA,3,3,3,NA 31 | Nonprofit,Yes,"Community focused (e.g., community manager)",Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",Less than 1 year,NA,NA,NA,6,6,6,2 32 | None of the above,Yes,"Development or operations focused (e.g., developer, sys admin)",Never contributed to CHAOSS,GrimoireLab (including users of Bitergia’s platform and Cauldron),More than 3 years,6,3,1,4,2,1,7 33 | For-profit company,Yes,"Community focused (e.g., community manager)",Currently contributing to the CHAOSS project,"MergeStat, CNCF DevStats",I haven’t used CHAOSS tools,4,6,5,1,2,3,7 34 | For-profit company,Yes,"Data focused (e.g., data analysis, data science)",Currently contributing to the CHAOSS project,GrimoireLab (including users of Bitergia’s platform and Cauldron),2 - 3 years,7,7,7,7,7,7,3 35 | Government,Yes,"Community focused (e.g., community manager)",Never contributed to CHAOSS,"Don’t know, not sure, or haven’t used any tools",I haven’t used CHAOSS tools,NA,NA,NA,NA,NA,NA,NA 36 | Nonprofit,Yes,"Community focused (e.g., community manager)",Never contributed to CHAOSS,Augur,Less than 1 year,NA,NA,NA,3,3,NA,NA 37 | For-profit company,Yes,the first three,Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",More than 3 years,7,7,7,7,3,3,7 38 | For-profit company,Yes,"Data focused (e.g., data analysis, data science)",Currently contributing to the CHAOSS project,GrimoireLab (including users of Bitergia’s platform and Cauldron),2 - 3 years,NA,NA,NA,2,2,2,1 39 | For-profit company,No,"Data focused (e.g., data analysis, data science)",Currently contributing to the CHAOSS project,GrimoireLab (including users of Bitergia’s platform and Cauldron),2 - 3 years,3,6,2,5,6,2,1 40 | For-profit company,Yes,"Community focused (e.g., community manager)",Never contributed to CHAOSS,"Don’t know, not sure, or haven’t used any tools",I haven’t used CHAOSS tools,2,3,1,1,1,1,2 41 | For-profit company,No,"Community focused (e.g., community manager)",Never contributed to CHAOSS,GrimoireLab (including users of Bitergia’s platform and Cauldron),More than 3 years,1,4,3,3,6,6,2 42 | For-profit company,Yes,"Community focused (e.g., community manager)",Currently contributing to the CHAOSS project,"Don’t know, not sure, or haven’t used any tools",I haven’t used CHAOSS tools,7,NA,NA,NA,NA,NA,7 43 | For-profit company,Yes,"Leadership (e.g., primarily manage other people)",Currently contributing to the CHAOSS project,Augur,1 - 2 years,7,7,5,7,6,7,6 44 | For-profit company,Yes,"Leadership (e.g., primarily manage other people)",Never contributed to CHAOSS,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",Less than 1 year,2,3,6,4,4,6,4 45 | For-profit company,Yes,"Leadership (e.g., primarily manage other people)",Never contributed to CHAOSS,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",1 - 2 years,5,5,5,5,5,5,5 46 | None of the above,No,"Data focused (e.g., data analysis, data science)",Currently contributing to the CHAOSS project,"Augur, GrimoireLab (including users of Bitergia’s platform and Cauldron)",More than 3 years,1,2,3,6,5,7,4 47 | -------------------------------------------------------------------------------- /challenges_survey/raw_data/other_challenges_freeform_redacted.txt: -------------------------------------------------------------------------------- 1 | What other challenges have you faced that weren’t in the above list or what else would you like to see us improve? 2 | 3 | Getting either piece of software running locally proved impossible for me :) It's hard to understand what is "official" CHAOSS software (like what is the relationship to compass?) or how things move from the metrics/models into the software. Understanding the relationship between the software and the metrics can be difficult, things aren't always named the same and the methods of calculation are not always transparent in the software so you can't tell if it is really doing what you think it is. I would really like some lighter weight ways to play with the metrics. I was excited by the prospect of Jupyter notebooks but then it turned out those also rely on Augur which could be fine but then some better instruction on getting just the Augur db running would be good. A simplified Augur install that is just the bare minimum for working with a data snapshot would be nice. Multiple instances of documentation also make working with Augur hard. There also Augur meetings on the calendar but I'm not sure if they are actually happening, and I'm not sure if they are just for people doing development work on Augur, as opposed to users. I worked with Cauldron.io and thought it was a great tool but could not get GL running locally so my experimenting was limited to what was offered there. As a [REDACTED] I don't really have a budget to pay for metrics as a service just to experiment or to try and get other people involved, so I greatly appreciated the cauldron service. The focus on OSPOs was a surprise to me as I got more involved in the project, it's understandable but makes it more difficult for people not in an OSPO to be seen as an audience for metrics/software, which is unfortunate because I think there is still a lot of value in CHAOSS metrics for people who are not part of an official OSPO. 4 | 5 | More clarity/documentation in data model (for comparison against other data sources); OpenSearch backend is incompatible with other internal tooling so our implementation is a bit of an island :/ 6 | 7 | Contributing has itself been a challenge, since documentation is woefully incomplete and (at times) inaccurate 8 | 9 | From an OSPO perspective, I feel that OpenSSF Scorecard and Open Source Review Toolkit are quite valuable tools and resources ("easy" to set up and get a report). It would be great to see CHAOSS tools at the same level of maturity and value, or even integration with such tools, to get more adoption in the industry. On the other hand, the diversity of tools and setups related to open source development and contribution makes it so hard to find "the right" tools to get valuable information about people, activity, and "performance". 10 | 11 | Any new metric should have use cases associated where the usefulness of the metric becomes evident 12 | 13 | Both GrimoireLab and Augur don't really provide a UX that is fit for a product. Rather they are fairly complicated back-ends that require a significant amount of configuration to set up. For a well funded OSPO that has a clear list of the data sources they want to track, this may be useful, but for something like [REDACTED], this means that many of our needs are not met. In our case as [REDACTED], we need to provide an experience for [REDACTED] that encourages them to track the community health of projects they use and participate in. This requires a certain kind of UX as well as better integration with other data sources and services that we provide. Particularly with GrimoireLab we found that building on top of the current architecture was fairly difficult. 14 | 15 | grimoire sigils should support opensearch / pivot away from kibiter asap 16 | 17 | Overwhelming data. 18 | 19 | Integrating grimorelab with custom dashboards and frontend/backend software seemed tricky as far as ive heard from the team that was responsible for that. I dont know the details though, sounded like there was some architectural tech debt (not sure what that means) that made it hard to for the developers to integrate. 20 | 21 | Bitergia tools are quite complicated and take a lot of time to set up properly. 22 | 23 | self hosting is becoming really difficult, esp for remote only orgs 24 | 25 | This was more of a product issue (when I used Bitergia), but a few times we had infrastructure issues so that an up-to-date data wasn't available for an extended period. For corporate users (especially in technology companies), the main reason why they go with an outside vendor is so that they don't need to manage infrastructure/software. 26 | 27 | No CHAOSS tools make it easy to compare a large number of repos. Many CHAOSS metrics aren't quantitative, so evaluating them requires manual examination. CHAOSS metrics don't cover marketing metrics like social media mentions. 28 | 29 | Documentation seems really good but there are a LOT of steps - I have such a limited amount of time that by the time I read through the documents to remember where I left off last time, I've already run out of time to do any experimentation. 30 | 31 | The most challenging part is to communicate them to C level. 32 | 33 | Estimating how much time and effort using tools would take, compared to ad hoc methods of assessing similar questions. 34 | 35 | ability to see and understand what other people are doing in real situation so I can leverage that work, and compare with my own 36 | 37 | Issues and PR backlog 38 | 39 | Measuring the health and success of community (marketing) activities 40 | 41 | Overall I find it complex to understand what CHAOSS is actually about and how the tools relate to what is described. They all seem quite complex and difficult to get started or require very special data analysis expertise. Maybe adding pre-requirements to usem them? 42 | 43 | Docker compose 44 | 45 | Creating production environments (security, set up backups, durability), updating, migrating to newer components (OpenSearch). 46 | 47 | Grimoirelab did not work well for the size of our OSPO. We find Augur easy to install and use. 48 | -------------------------------------------------------------------------------- /challenges_survey/raw_data/strengths_freeform_redacted.txt: -------------------------------------------------------------------------------- 1 | What do you see as CHAOSS project strengths (what do you love about CHAOSS)? 2 | 3 | The depth of expertise in the community 4 | 5 | CHAOSS is exceptional at community health. The community generating community health metrics is unusually healthy. 6 | It is a model for good practices. 7 | 8 | Being committed to open-source is becoming frustratingly less common, so I'm glad to see an organization insisting that their work will belong to the people, and not to any hypothetical corporate backers. 9 | 10 | The passion and commitment of the CHAOSS community to provide a set of valuable metrics to describe an open source projects. 11 | 12 | Agreements over definitions 13 | 14 | CHAOSS has an active and welcoming community. The projects are interesting and useful. 15 | 16 | ease of community involvement and insight into metrics 17 | 18 | Standard metrics, and good tools 19 | 20 | The project is open and welcoming and is doing really important work! I really appreciate the extent to which newbies are able to participate. 21 | 22 | Seems interesting. Havent interacted with them much though, but it seems like they have a good mission/goal. If CHAOSS is at all related to OSI, maybe this thing a coworker sent me recently could be applicable? https://yakshav.es/non-thoughts-on-the-osi/ 23 | 24 | - Creating a welcoming space for open source project leaders to connect and discuss community health indicators. - Educating community members on the existing CHAOSS metrics and metrics models and empowering them to develop new ones to address their needs. 25 | 26 | The existence of a unified, standardized set of metrics is invaluable for determining community health and value. 27 | 28 | Great mission, great educational outreach, great conference presentations, genuinely nice people. 29 | 30 | open source, community based, non-commercial 31 | 32 | Well respected/regarded in open source communities. Good community culture. 33 | 34 | The promise of getting metadata about the health of my open source. 35 | 36 | Great community, solid ideas and great tooling available 37 | 38 | The community! 39 | 40 | News Ideas, and family like. 41 | 42 | Only place I know of where these OS community metrics exist. 43 | 44 | great community, lots of smart people, diverse, energetic, welcoming 45 | 46 | Community, peer discussions, implementation stories 47 | 48 | Such a welcoming and loving community 49 | 50 | Project health reporting. 51 | 52 | Community-driven with real-world examples 53 | 54 | It is always improving and focusing on finding meaningful metrics to drive communities' health forward for a more equitable open source ecosystem (this is how I see CHAOSS <3) 55 | 56 | Augur really makes metrics highly visible on a large scale like I need 57 | 58 | Once set up, grimoirelab is a veeery impressive tool. I'm glad it's hosted under CHAOSS. 59 | 60 | Standards 61 | 62 | Such an amazing community of lovely people 63 | -------------------------------------------------------------------------------- /challenges_survey/raw_data/understanding_challenges_graphs.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/challenges_survey/raw_data/understanding_challenges_graphs.pdf -------------------------------------------------------------------------------- /data-ethics-statement.md: -------------------------------------------------------------------------------- 1 | _The usage and dissemination of health metrics may lead to privacy violations. Organizations may be exposed to risks. These risks may flow from compliance with the GDPR in the EU, with state law in the US, or with other laws. There may also be contractual risks flowing from terms of service for data providers such as GitHub and GitLab. The usage of metrics must be examined for risk and potential data ethics problems. Please see [CHAOSS Data Ethics document](https://github.com/chaoss/community/blob/main/data-use-statement.md) for additional guidance._ 2 | -------------------------------------------------------------------------------- /dataset/README.md: -------------------------------------------------------------------------------- 1 | # CHAOSS Data Science WG Datasets 2 | 3 | This is where we can store datasets for [projects](https://github.com/chaoss/wg-data-science/issues?q=is%3Aissue+is%3Aopen+label%3Aproject) being worked on within the Data Science Working Group. 4 | 5 | If you would like to contribute a dataset, please create a new subfolder containing your data, and please follow our [Contribution guidelines](https://github.com/chaoss/wg-data-science/blob/main/CONTRIBUTING.md) including DCO sign-off for all commits. 6 | 7 | So far, we have one dataset right now for analysis of projects making [license changes](license-changes) to their open source projects. 8 | 9 | -------------------------------------------------------------------------------- /dataset/archive/README.md: -------------------------------------------------------------------------------- 1 | # Archival of Open Source Projects (aka Sudden Archival) 2 | 3 | Archival is often used as an indicator that a project is no longer being maintained and will not be updated (including security updates). We know that a lot of projects are archived when they are abandoned and people stop working on them, but what about projects that are archived for other reasons? 4 | 5 | More Details about this project: 6 | * Tracked in [Issue #45](https://github.com/chaoss/wg-data-science/issues/45) 7 | * [Project Planning document](https://docs.google.com/document/d/18audPynKQg_n7ZdspeUGtPSF7cMzc9O7CVcrEdJAhtk/edit?usp=sharing) 8 | * [WIP datasets](https://github.com/chaoss/wg-data-science/tree/main/dataset/archive) 9 | * data-files/archive_repos.csv: contains 733 repos - all GH repos with > 1000 stars and > 100 forks with an open source license. 10 | * archived_projects.py: the script containing the GraphQL API query to generate data-files/archive_repos.csv 11 | 12 | Current status: 13 | * Initial [WIP datasets](https://github.com/chaoss/wg-data-science/tree/main/dataset/archive) being gathered. 14 | * We'll start this project in June during the [Data Science Hackathon](https://chaoss.community/chaoss-data-science-hackathon-2025/), so the current work will be in getting the initial dataset organized and into a format that will be useful in a hackathon setting. 15 | 16 | -------------------------------------------------------------------------------- /dataset/archive/archived-projects.py: -------------------------------------------------------------------------------- 1 | # Copyright Dawn M. Foster 2 | # SPDX-License-Identifier: MIT 3 | 4 | """This script collects data about the archived repositories on GitHub with the most 5 | stars / forks using the GitHub GraphQL Search API. Results with no license or 6 | license = Other are discarded to store only results for open source repositories. 7 | 8 | Inputs via command line arguments - see below or use -h for a list: 9 | * file containing a GitHub access token 10 | * threshold for stars (any project with more than that number of stars) - default to 1000 11 | * threshold for forks (any project with more than that number of forks) - default to 100 12 | 13 | As of May 22, 2025, using the default thresholds collects data on 1001 repositories, 14 | and stores 733 repositories after filtering by license as described above. 15 | 16 | Outputs: 17 | * GitHub API response code (success is "") - printed to the screen 18 | * csv file with the data stored as data-files/archive_repos.csv 19 | """ 20 | 21 | import sys 22 | import csv 23 | import argparse 24 | import requests 25 | import json 26 | 27 | # Read arguments and store from command line 28 | parser = argparse.ArgumentParser() 29 | 30 | parser.add_argument("-t", "--token", dest="api_token_file", help="Filename of a file containing a GitHub Personal Access Token") 31 | parser.add_argument("-s", "-stars", dest="num_stars", help="Collect data on projects with more than this number of stars. Default is 1000", 32 | default=1000) 33 | parser.add_argument("-f", "--forks", dest="num_forks", help="Collect data on projects with more than this number of forks. Default is 100", 34 | default=100) 35 | 36 | args = parser.parse_args() 37 | 38 | api_token_file = args.api_token_file 39 | num_stars = args.num_stars 40 | num_forks = args.num_forks 41 | 42 | def make_query(after_cursor = None): 43 | """ This function contains the GraphQL query 44 | """ 45 | return """ 46 | query archived ($search_string: String!) { 47 | search( 48 | type:REPOSITORY, 49 | query:$search_string, 50 | first: 50 after:AFTER) { 51 | pageInfo { 52 | hasNextPage 53 | endCursor 54 | } 55 | repos: edges{ 56 | repo:node{ 57 | ... on Repository { 58 | url 59 | homepageUrl 60 | shortDescriptionHTML 61 | isFork 62 | isInOrganization 63 | createdAt 64 | updatedAt 65 | archivedAt 66 | stargazerCount 67 | forkCount 68 | primaryLanguage{ 69 | name 70 | } 71 | latestRelease{ 72 | publishedAt 73 | } 74 | licenseInfo{ 75 | name 76 | } 77 | } 78 | } 79 | } 80 | } 81 | }""".replace( 82 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 83 | ) 84 | 85 | # Read GitHub key from file 86 | try: 87 | with open(api_token_file, 'r') as kf: 88 | api_token = kf.readline().rstrip() # remove newline & trailing whitespace 89 | 90 | except: 91 | print("Error reading GH Key. This script depends on the existence of a file containing your GitHub API token. Exiting") 92 | sys.exit() 93 | 94 | # Set up the variables needed for the GraphQL query 95 | url = 'https://api.github.com/graphql' 96 | headers = {'Authorization': 'token %s' % api_token} 97 | search_string = "archived:True stars:>" + str(num_stars) + " forks:>" + str(num_forks) 98 | 99 | # Variable initialization 100 | results = [] 101 | has_next_page = True 102 | after_cursor = None 103 | 104 | # Run the GraphQL query for each page of results from the API 105 | while has_next_page: 106 | 107 | query = make_query(after_cursor) 108 | 109 | variables = {"search_string": search_string} 110 | 111 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 112 | print(r) # This prints the response so that you can see if the query fails with an error 113 | 114 | json_data = json.loads(r.text) 115 | 116 | results.append(json_data) 117 | 118 | has_next_page = json_data['data']['search']["pageInfo"]["hasNextPage"] 119 | 120 | after_cursor = json_data['data']['search']["pageInfo"]["endCursor"] 121 | 122 | # Create csv output file 123 | with open("data-files/archive_repos.csv", "w", newline="") as f: 124 | 125 | w = csv.DictWriter(f, results[0]['data']['search']['repos'][0]['repo'].keys()) 126 | w.writeheader() 127 | 128 | # Loops through the results list and writes the csv file by row. 129 | # This also unpacks the nested JSON structures to allow them to 130 | # be more readable in the csv file. 131 | for element in results: 132 | 133 | for repo_dict in element['data']['search']['repos']: 134 | try: 135 | repo_dict['repo']['latestRelease'] = repo_dict['repo']['latestRelease']['publishedAt'] 136 | except: 137 | repo_dict['repo']['latestRelease'] = None 138 | try: 139 | repo_dict['repo']['primaryLanguage'] = repo_dict['repo']['primaryLanguage']['name'] 140 | except: 141 | repo_dict['repo']['primaryLanguage'] = None 142 | try: 143 | repo_dict['repo']['licenseInfo'] = repo_dict['repo']['licenseInfo']['name'] 144 | except: 145 | repo_dict['repo']['licenseInfo'] = None 146 | 147 | if repo_dict['repo']['licenseInfo'] != None and repo_dict['repo']['licenseInfo'] != "Other": 148 | # this only writes repos with a specified license into the csv file 149 | w.writerow(repo_dict['repo']) -------------------------------------------------------------------------------- /dataset/foundation-stats/apacheURLtoTable.md: -------------------------------------------------------------------------------- 1 | # README: Apache URL to Table Data Processor 2 | 3 | ## **Overview** 4 | The `apache_url_to_table.py` script fetches, processes, and normalizes 5 | Apache project data from multiple sources. The goal is to create a 6 | structured dataset that allows researchers to analyze the transition of 7 | corporate projects into open-source foundations. 8 | 9 | ## **Features** 10 | - Pulls project data from **six different Apache Foundation sources**. 11 | - Normalizes data for consistency across all projects. 12 | - Saves structured data in **CSV and JSON** formats. 13 | - Includes an optional **force update** mode to fetch the latest data. 14 | - Prevents redundant downloads by using existing files when possible. 15 | 16 | ## **Data Sources** 17 | This script retrieves data from the following sources: 18 | - [Apache Projects Overview](https://projects.apache.org/) 19 | - [Apache Projects 20 | JSON](https://projects.apache.org/json/foundation/projects.json) 21 | - [Podlings 22 | JSON](https://projects.apache.org/json/foundation/podlings.json) 23 | - [Podlings History 24 | JSON](https://projects.apache.org/json/foundation/podlings-history.json) 25 | - [Retired Committees 26 | JSON](https://projects.apache.org/json/foundation/committees-retired.json) 27 | - [Repositories 28 | JSON](https://projects.apache.org/json/foundation/repositories.json) 29 | 30 | ## **Installation** 31 | Ensure Python is installed and install required dependencies: 32 | ```sh 33 | pip install pandas requests 34 | ``` 35 | 36 | ## **How to Run the Script** 37 | ### **Using Existing Data (if Available)** 38 | If the script has been run before, it will use previously downloaded 39 | files: 40 | ```sh 41 | python apache_url_to_table.py 42 | ``` 43 | 44 | ### **Forcing an Update (Fetching the Latest Data)** 45 | To ensure you get the most up-to-date project data, use: 46 | ```sh 47 | python apache_url_to_table.py --force-update 48 | ``` 49 | - This will **overwrite existing files**. 50 | - The script will **prompt you for confirmation** before replacing data. 51 | - Type **'y'** and press **Enter** to proceed. 52 | - Press **Enter** without typing anything to cancel the update. 53 | 54 | ## **Output Files** 55 | - **`structured_project_analysis.csv`** → CSV file containing the 56 | processed project data. 57 | - **`structured_project_cleaned.json`** → JSON file containing the 58 | structured project data. 59 | 60 | ## **Confirming the Data for Research** 61 | This script provides **a foundational dataset** for analyzing how 62 | corporate-owned open-source projects evolve once moved into foundations. 63 | **You can confirm the script's success by:** 64 | 1. Checking if **`structured_project_analysis.csv`** contains structured 65 | data. 66 | 2. Opening **`structured_project_cleaned.json`** to see if all project 67 | details were captured. 68 | 3. Manually inspecting **Apache source URLs** in a browser to ensure they 69 | are still available. 70 | 71 | ## **What’s Missing & Next Steps** 72 | This dataset **still requires human research** to be fully complete. 73 | Recommended steps include: 74 | 75 | ### **1. Verifying Company Contributions** 76 | - Manually research which corporations originally contributed each 77 | project. 78 | - Compare corporate vs. community contributions over time. 79 | 80 | ### **2. Analyzing Governance Changes** 81 | - Investigate if project governance structures changed post-transition. 82 | 83 | ### **3. Improving Data Quality** 84 | - Identify missing or inconsistent project data. 85 | - Cross-reference project status with Apache’s live data. 86 | 87 | --- 88 | This script provides **a structured and repeatable method** for collecting 89 | across foundations, and should be revised with that intent. Have fun - 90 | this is an initial script and your edits could make you an author on an 91 | invaluable script for Open Source! 92 | -------------------------------------------------------------------------------- /dataset/foundation-stats/apache_url_to_table.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import pandas as pd 4 | import requests 5 | 6 | # Step 1: Define script directory and output file names 7 | # This ensures all files are stored in the same location as the script. 8 | SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) 9 | OUTPUT_CSV = os.path.join(SCRIPT_DIR, "structured_project_analysis.csv") # Clearer purpose 10 | OUTPUT_JSON = os.path.join(SCRIPT_DIR, "structured_project_cleaned.json") # More descriptive 11 | 12 | # Step 2: Define web scraping source URL 13 | # This is the source URL where we will fetch the latest project data. 14 | SOURCE_URL = "https://incubator.apache.org/projects.json" # Example data source 15 | 16 | # Step 3: Function to fetch data from the web 17 | def fetch_data_from_web(url): 18 | """ 19 | Fetch JSON data from a specified web URL. 20 | Handles errors related to network failures and invalid JSON responses. 21 | """ 22 | try: 23 | response = requests.get(url, timeout=10) # 10-second timeout 24 | response.raise_for_status() # Raise error for HTTP issues 25 | return response.json() 26 | except requests.exceptions.RequestException as e: 27 | print(f"Error fetching data from {url}: {e}") 28 | return [] 29 | except json.JSONDecodeError: 30 | print("Error: Failed to decode JSON from web response.") 31 | return [] 32 | 33 | # Step 4: Function to normalize and clean data 34 | def normalize_data(data): 35 | """ 36 | Standardizes JSON structure to ensure consistent fields across all records. 37 | """ 38 | structured_list = [] 39 | for item in data: 40 | structured_list.append({ 41 | "Project ID": item.get("id", "Unknown"), 42 | "Project Name": item.get("name", "Unknown"), 43 | "Category": item.get("category", "Unknown"), 44 | "Description": item.get("description", "Unknown"), 45 | "Homepage URL": item.get("homepage", "Unknown"), 46 | "PMC": item.get("pmc", "Unknown"), 47 | "Podling": item.get("podling", False), 48 | "Start Date": item.get("started", "Unknown"), 49 | "Status": item.get("status", "Unknown") 50 | }) 51 | return structured_list 52 | 53 | # Step 5: Add user option to force update for most up-to-date data 54 | import argparse 55 | parser = argparse.ArgumentParser() 56 | parser.add_argument("--force-update", action="store_true", help="Force update by fetching the latest data from the web") 57 | args = parser.parse_args() 58 | 59 | # Step 6: Check if the output files already exist 60 | # If the files exist and --force-update is not set, we avoid re-downloading the data. 61 | if os.path.exists(OUTPUT_CSV) and os.path.exists(OUTPUT_JSON) and not args.force_update: 62 | print(f"Using existing files: {OUTPUT_CSV} and {OUTPUT_JSON}. To force an update, run with --force-update.") 63 | else: 64 | # Step 7: Confirm before overwriting existing files 65 | if args.force_update and os.path.exists(OUTPUT_CSV) and os.path.exists(OUTPUT_JSON): 66 | confirm = input("Warning: This will overwrite existing files. Do you want to proceed? (yes/no): ").strip().lower() 67 | if confirm != "yes": 68 | print("Update canceled. Using existing files.") 69 | exit() 70 | 71 | print("Fetching the most up-to-date data from the web.") 72 | # Fetch and process data 73 | data = fetch_data_from_web(SOURCE_URL) 74 | normalized_data = normalize_data(data) 75 | 76 | # Convert to Pandas DataFrame 77 | df = pd.DataFrame(normalized_data) 78 | 79 | # Save outputs 80 | df.to_csv(OUTPUT_CSV, index=False) 81 | with open(OUTPUT_JSON, "w", encoding="utf-8") as f: 82 | json.dump(normalized_data, f, indent=4) 83 | 84 | # Step 8: Notify user of successful completion 85 | print(f"Processing complete.\nSaved structured data to: {OUTPUT_CSV} and {OUTPUT_JSON}") 86 | -------------------------------------------------------------------------------- /dataset/foundation-stats/dataset/foundation-stats/structured_project_analysis.csv: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /dataset/foundation-stats/dataset/foundation-stats/structured_project_cleaned.json: -------------------------------------------------------------------------------- 1 | [] -------------------------------------------------------------------------------- /dataset/license-changes/README.md: -------------------------------------------------------------------------------- 1 | # License Change Dataset 2 | 3 | The idea behind this dataset is to use it as a starting point to see if we can we predict the likelihood of a license change for an open source project from an open source license to a non-open source or more restrictive license. The project itself is being tracked in this [Issue](https://github.com/chaoss/wg-data-science/issues/47). Note that we have a related dataset for forks (see below). 4 | 5 | Because we want to do some analysis on the repositories, any project without a GitHub repository has been excluded from the dataset (see section below for more details). 6 | 7 | ## The Dataset 8 | 9 | [license_changes.csv](license_changes.csv) 10 | 11 | The dataset can be found in the license_changes.csv file. It contains the following fields: 12 | * project: project name 13 | * relicense_date: relicense date 14 | * orig_license: original license name 15 | * new_license: new license name or description 16 | * org: GitHub organization where the project can be found 17 | * repo: GitHub repository where the project can be found 18 | * license_file: license filename in the GitHub repo 19 | 20 | The starting point for this dataset came from this [Wikipedia List of formerly open source software](https://en.wikipedia.org/wiki/List_of_formerly_open-source_or_free_software) page. You can find this list converted to a csv file in this folder called [wikipedia_list.csv](wikipedia_list.csv). Because we want to analyze what happens in a repo both before and after a license change, projects where the repo couldn't be found or where there were other issues that made the data suspect were excluded from license_changes.csv. Those issues are documented in a file in this folder called [dataset_notes.md](dataset_notes.md). If you want to help improve the dataset in license_changes.csv, the dataset_notes.md file would be an excellent place to start. 21 | 22 | 23 | [more_forks.csv](more_forks.csv) 24 | 25 | This dataset includes some additional examples of forks that were not created due to relicensing but most often either a) the original got bought by an entity the community found suspicious and decided to fork, or b) the original became unmaintained and someone new picked it up. The fields of the CSV match the structure of the license_changes dataset above. 26 | 27 | ## Contributions 28 | 29 | This dataset is still a bit rough and incomplete, so contributions are welcome to help us improve it. See details above for ideas about where to contribute, and please follow our [Contribution guidelines](https://github.com/chaoss/wg-data-science/blob/main/CONTRIBUTING.md) including DCO sign-off for all commits. 30 | 31 | If you learn of new projects that have been relicensed or older ones that we've missed, please feel free to submit a PR against license_changes.csv with the data. If you don't have time to create the PR, please file an issue to let us know, and we can add it to the dataset. 32 | 33 | ## Next Steps 34 | 35 | This is a very basic dataset that is meant to be the starting point to create more robust data sets. By keeping license_changes.csv simple and limited to the basic information required to know where / when a license change took place, we can easily build on it in so many ways. In particular, it would be interesting to put this data into Augur and GrimoireLab for further study. We can also use the GitHub API to gather additional data as in the example below in the Additional Data section. 36 | 37 | ## Additional Data 38 | 39 | There are some additional files in this folder. 40 | 41 | * [generate-license-data.py](generate-license-data.py) is a script that was used to make it easier to find the date of the commit where the license change occurred. It also serves as an example of how you might use this dataset as a starting point to gather more data that can be used to learn about a license change. 42 | * [output.json](output.json) is an autogenerated file created by generate-license-data.py and should not be edited. This is the output that was used to learn more about each license change in license_changes.csv. 43 | * wikipedia_list.csv is a convenience file where the data from the Wikipedia page was stored that was used as a starting point. This isn't used anywhere else and should not be updated, since it just contains historical data. Any updates should be make in license_changes.csv. 44 | 45 | # Forks Dataset 46 | 47 | [forks.csv](forks.csv) 48 | 49 | We are just beginning work on a dataset containing forks of open source projects, so right now it is very incomplete, but contributions are welcome! Please follow our [Contribution guidelines](https://github.com/chaoss/wg-data-science/blob/main/CONTRIBUTING.md) including DCO sign-off for all commits. You can also file issues or let us know via other channels if you see a mistake and want to let us know, but don't plan to make the change yourself. 50 | 51 | Many recent forks are a result of license changes, so we are keeping the dataset here to make it easy for people to find. 52 | 53 | Category column definitions: 54 | * acquisition: primarily the result of one company acquiring another 55 | * relicense: relicensing of a project, usually to a more restrictive license 56 | * feature: this generally refers to issues with getting contributions (features) included in the original project and could be a result of disagreements, governance issues, or other community dynamics. 57 | 58 | We know that open source projects are complex, and many forks don't fit into a single category, so we've attempted to pick the primary category and add any additional details in the notes. 59 | -------------------------------------------------------------------------------- /dataset/license-changes/dataset_notes.md: -------------------------------------------------------------------------------- 1 | # Data from Wikipedia page 2 | Source: https://en.wikipedia.org/wiki/List_of_formerly_open-source_or_free_software 3 | 4 | Excluded: 5 | * Couchbase Server,2010,2021,Apache-2.0,Business Source License - only found Apache license in https://github.com/couchbase/manifest 6 | * Couchbase Mobile,,2022,Apache-2.0,Business Source License - not sure where this project is - maybe https://github.com/couchbase/couchbase-lite-ios but that's Apache 7 | * Emby,2014,2018,GPL-2.0,"Source code closed on December 8, 2018" - source code no longer available 8 | * FBReader,2013,2015,GPL-2.0-or-later,"Apparently the number of devs was limited, and they all agreed to relicense it" - couldn't find license file - source code archived at https://github.com/geometer/FBReader 9 | * LiveJournal,1999,2014,GPL-2.0-or-later,The source code was made private in 2014 - couldn't find source code repo 10 | * Nexuiz,2005,2012,GPL-2.0-or-later,"Game abandoned in favour of a commercial video game of the same name, which licensed the Nexuiz title but is not based on its engine." - couldn't find source code repo 11 | * OctoberCMS,2014,2021,MIT,Cited the sustainability of its open source model as a factor. - couldn't find source code repo 12 | * Paint.NET,2004,2007,MIT,freeware license that prohibits modification or resale - couldn't find source code repo 13 | * PyMOL,2010,MIT-CMU,Custom,schrodinger,pymol-open-source,LICENSE - I couldn't find evidence that this was ever under an OSI license 14 | * Reddit,2008,2017,CPAL-1.0,"Source code was made private in 2017, as the internal codebase had already diverged significantly from the public one." - couldn't find source code repo 15 | * Sourcegraph,2013,2023,Apache-2.0,proprietary - - couldn't find source code repo 16 | * Tux Racer,2000,2002,GPL-2.0-or-later,"Commercial expansion by original authors, also called Tux Racer." 17 | 18 | # Other data 19 | 20 | Excluded: 21 | * MariaDB MaxScale is inder BSL1, but I couldn't find the repo or info about the license change: https://mariadb.com/projects-using-bsl-11/ 22 | 23 | 24 | # Important Notes 25 | 26 | For the projects where the repo could not be found, you might be able to find a fork from someone else's account to analyze. I didn't attempt to find these. Also, I didn't spend much time looking - someone else should confirm these because I could have easily just missed them. 27 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/README.md: -------------------------------------------------------------------------------- 1 | This folder is where we can collaborate on a research report that contains several case studies of open source projects that resulted in hard forks after the project relicensed. 2 | 3 | * Elasticsearch -> OpenSearch 4 | * Redis -> Valkey 5 | * Terraform -> OpenTofu 6 | 7 | If you just want an overview of the results so far, here are some summaries: 8 | * [The New Stack: What Happens to Relicensed Open Source Projects and Their Forks?](https://thenewstack.io/what-happens-to-relicensed-open-source-projects-and-their-forks/) 9 | * [State of Open Con 7 minute keynote video](https://www.youtube.com/watch?v=rphZFv9QbV0&list=PL0U2cL1JGPZdJTUooEjFMb_djIzreUxGM&index=4) 10 | * Additional presentations: [FOSDEM Panel](https://fosdem.org/2025/schedule/event/fosdem-2025-5258-forked-communities-project-re-licensing-and-community-impact/), [State of Open Con Panel](https://www.youtube.com/watch?v=DSTiQil10GQ&list=PL0U2cL1JGPZdfn4ODuMVouXMh9lDsiPGh&index=4), [OpenUK Meetup](https://www.youtube.com/watch?v=wliDVF3FpI0) 11 | * [Academic paper](https://github.com/chaoss/wg-data-science/tree/main/publications) with the results that were presented at the OpenForum Academy Symposium in November 2024. 12 | 13 | The current WIP draft of the research report can be found in this [Google doc for the Report](https://docs.google.com/document/d/1sYlUn9UsY7ynmzc3MVJTtktNgaLFQDOZ8W9fhYarWNo/edit). At this point, it's mostly an outline that still needs a lot of work. 14 | 15 | The [notebooks](notebooks) folder contains basic analysis of the organizational affiliation data for the contributors per open source project. 16 | 17 | The [data-files](data-files) folder contains data files (pickled) for each project with commits for a specific time period being studied (e.g., 1 year before the relicense). 18 | 19 | We still have a lot of work to do. Here are a few next steps that people can begin working on: 20 | * **Writing**: Write more of the Introduction and Context sections for the report (see doc above) - No data science experience required, and the links in the "Helpful articles" section at the top of the doc should help someone get started with this work. Much of this can probably be taken from the [OFA paper](https://docs.google.com/document/d/1hdLqLhQjPGwOpwMgH5dpTFMioSTRRZEGdQ5-lEZ9o_Q/edit?usp=sharing) as a start, but it will need to be heavily edited so that it isn't in an academic style, since the final output will be a report that is more in the style of an LF report (see the report Google doc above for more on style). 21 | * **Collect Data & Metrics Analysis**: Select several project health metrics from the CHAOSS project (ideally based on research) that might be used to answer some of the research questions listed in the doc. Ideally, these metrics should be implemented in Augur / 8Knot and / or GrimoireLab so that we can use CHAOSS projects for the visualizations. Several people could work on this at the same time. We have a start on this that can be found in the Appendix / Notes section of the report doc, but it still needs quite a bit of work. 22 | * **Validation**: Validate the data for the 6 projects by talking to people who are directly involved in those projects. @geekygirldawn has started this work and contacted people from the projects. 23 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/commits_people.py: -------------------------------------------------------------------------------- 1 | # Copyright Dawn M. Foster 2 | # SPDX-License-Identifier: MIT 3 | 4 | """Gets Commit Data 5 | This is aggregated per person for a repo between two specified dates. 6 | I'm currently using this to better understand who contributes to a project 7 | before and after a key time in the project (relicense / fork) with a focus on 8 | understanding organizational diversity. 9 | 10 | Output (files are stored in the data-files directory) 11 | * GitHub API response code (should be "") 12 | * Commit data pickle file containing a dataframe 13 | * Person pickle file containing a dictionary 14 | """ 15 | 16 | import sys 17 | import pandas as pd 18 | import argparse 19 | import requests 20 | import json 21 | 22 | # Read arguments from command line 23 | parser = argparse.ArgumentParser() 24 | 25 | parser.add_argument("-t", "--token", dest = "gh_key", help="GitHub Personal Access Token") 26 | parser.add_argument("-u", "--url", dest = "gh_url", help="URL for a GitHub repository") 27 | parser.add_argument("-b", "--begin_date", dest = "begin_date", help="Date in the format YYYY-MM-DD - gather commits after this begin date") 28 | parser.add_argument("-e", "--end_date", dest = "end_date", help="Date in the format YYYY-MM-DD - gather commits up until this end date") 29 | 30 | args = parser.parse_args() 31 | 32 | gh_url = args.gh_url 33 | gh_key = args.gh_key 34 | since_date = args.begin_date + "T00:00:00.000+00:00" 35 | until_date = args.end_date + "T00:00:00.000+00:00" 36 | 37 | url_parts = gh_url.strip('/').split('/') 38 | org_name = url_parts[3] 39 | repo_name = url_parts[4] 40 | 41 | # Read GitHub key from file 42 | try: 43 | with open(gh_key, 'r') as kf: 44 | api_token = kf.readline().rstrip() # remove newline & trailing whitespace 45 | 46 | except: 47 | print("Error reading GH Key. This script depends on the existence of a file called gh_key containing your GitHub API token. Exiting") 48 | sys.exit() 49 | 50 | pickle_file = 'data-files/' + repo_name + str(since_date) + str(until_date) + '.pkl' 51 | 52 | def make_query(after_cursor = None): 53 | return """query repo_commits($org_name: String!, $repo_name: String!, $since_date: GitTimestamp!, $until_date: GitTimestamp!){ 54 | repository(owner: $org_name, name: $repo_name) { 55 | ... on Repository{ 56 | defaultBranchRef{ 57 | target{ 58 | ... on Commit{ 59 | history(since: $since_date, until: $until_date, first: 100 after: AFTER){ 60 | pageInfo { 61 | hasNextPage 62 | endCursor 63 | } 64 | edges{ 65 | node{ 66 | ... on Commit{ 67 | committedDate 68 | deletions 69 | additions 70 | oid 71 | authors(first:100) { 72 | nodes { 73 | date 74 | email 75 | user { 76 | login 77 | company 78 | email 79 | name 80 | } 81 | } 82 | } 83 | } 84 | } 85 | } 86 | } 87 | } 88 | } 89 | } 90 | } 91 | } 92 | }""".replace( 93 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 94 | ) 95 | 96 | def get_data(api_token, org_name, repo_name, since_date, until_date): 97 | """Executes the GraphQL query to get data from one GitHub repo. 98 | 99 | Returns 100 | ------- 101 | repo_info_df : pandas.core.frame.DataFrame 102 | """ 103 | 104 | url = 'https://api.github.com/graphql' 105 | headers = {'Authorization': 'token %s' % api_token} 106 | 107 | repo_info_df = pd.DataFrame() 108 | 109 | has_next_page = True 110 | after_cursor = None 111 | 112 | while has_next_page: 113 | 114 | query = make_query(after_cursor) 115 | 116 | variables = {"org_name": org_name, "repo_name": repo_name, "since_date": since_date, "until_date": until_date} 117 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 118 | print(r) 119 | json_data = json.loads(r.text) 120 | 121 | df_temp = pd.DataFrame(json_data['data']['repository']['defaultBranchRef']['target']['history']['edges']) 122 | 123 | repo_info_df = repo_info_df.append(df_temp, ignore_index=True) 124 | 125 | has_next_page = json_data['data']['repository']['defaultBranchRef']['target']['history']["pageInfo"]["hasNextPage"] 126 | 127 | after_cursor = json_data['data']['repository']['defaultBranchRef']['target']['history']["pageInfo"]["endCursor"] 128 | 129 | return repo_info_df 130 | 131 | repo_info_df = get_data(api_token, org_name, repo_name, since_date, until_date) 132 | 133 | def expand_commits(commits): 134 | if pd.isnull(commits): 135 | commits_list = [None, None, None, None, None] 136 | else: 137 | node = commits 138 | try: 139 | commit_date = node['committedDate'] 140 | except: 141 | commit_date = None 142 | try: 143 | dels = node['deletions'] 144 | except: 145 | dels = None 146 | try: 147 | adds = node['additions'] 148 | except: 149 | adds = None 150 | try: 151 | oid = node['oid'] 152 | except: 153 | oid = None 154 | try: 155 | author = node['authors']['nodes'] 156 | except: 157 | author = None 158 | commits_list = [commit_date, dels, adds, oid, author] 159 | return commits_list 160 | 161 | repo_info_df['commits_list'] = repo_info_df['node'].apply(expand_commits) 162 | repo_info_df[['commit_date','deletions', 'additions','oid','author']] = pd.DataFrame(repo_info_df.commits_list.tolist(), index= repo_info_df.index) 163 | #repo_info_df = repo_info_df.drop(columns=['commits_list']) 164 | repo_info_df 165 | repo_info_df.to_pickle(pickle_file) 166 | 167 | def create_person_dict(pickle_file, repo_name, since_date, until_date): 168 | import collections 169 | import pickle 170 | 171 | repo_info_df = pd.read_pickle(pickle_file) 172 | 173 | output_pickle = 'data-files/' + repo_name + '_people_' + str(since_date) + str(until_date) + '.pkl' 174 | 175 | # Create a dictionary for each person with the key being the gh login 176 | # Create a dict for commits that aren't tied to a gh login (gh user = None) 177 | person_dict=collections.defaultdict(dict) 178 | fail_person_dict=collections.defaultdict(dict) 179 | 180 | for x in repo_info_df.iterrows(): 181 | data = x[1] 182 | 183 | for y in data['author']: 184 | try: 185 | login = y['user']['login'] 186 | company = y['user']['company'] 187 | commit_email = y['email'] 188 | login_email = y['user']['email'] 189 | name = y['user']['name'] 190 | 191 | if person_dict[login]: 192 | person_dict[login]['commits'] = person_dict[login]['commits'] + 1 193 | person_dict[login]['additions'] = person_dict[login]['additions'] + data['additions'] 194 | person_dict[login]['deletions'] = person_dict[login]['deletions'] + data['deletions'] 195 | if commit_email not in person_dict[login]['email']: 196 | person_dict[login]['email'].append(commit_email) 197 | else: 198 | person_dict[login]['company'] = company 199 | person_dict[login]['name'] = name 200 | person_dict[login]['commits'] = 1 201 | person_dict[login]['additions'] = data['additions'] 202 | person_dict[login]['deletions'] = data['deletions'] 203 | if len(login_email) == 0: 204 | person_dict[login]['email'] = [commit_email] 205 | elif commit_email == login_email: 206 | person_dict[login]['email'] = [commit_email] 207 | else: 208 | person_dict[login]['email'] = [commit_email,login_email] 209 | except: 210 | try: 211 | if fail_person_dict[commit_email]: 212 | fail_person_dict[commit_email]['commits'] = fail_person_dict[commit_email]['commits'] + 1 213 | fail_person_dict[commit_email]['additions'] = fail_person_dict[commit_email]['additions'] + data['additions'] 214 | fail_person_dict[commit_email]['deletions'] = fail_person_dict[commit_email]['deletions'] + data['deletions'] 215 | else: 216 | fail_person_dict[commit_email]['commits'] = 1 217 | fail_person_dict[commit_email]['additions'] = data['additions'] 218 | fail_person_dict[commit_email]['deletions'] = data['deletions'] 219 | except: 220 | print("Unknown Exception on", y) 221 | 222 | # For every email that didn't have a GH login / user, search for that email in the 223 | # person_dict and if found, add the commits, additions, and deletions to the proper user 224 | # Print error message if not found (above items for testing of that case) 225 | for f_key, f_value in fail_person_dict.items(): 226 | found = False 227 | for key, value in person_dict.items(): 228 | if f_key in value['email']: 229 | person_dict[key]['commits'] = person_dict[key]['commits'] + f_value['commits'] 230 | person_dict[key]['additions'] = person_dict[key]['additions'] + f_value['additions'] 231 | person_dict[key]['deletions'] = person_dict[key]['deletions'] + f_value['deletions'] 232 | found = True 233 | if found == False: 234 | print('Not found - no person with this email',f_key,f_value) 235 | 236 | with open(output_pickle, 'wb') as f: 237 | pickle.dump(person_dict, f) 238 | 239 | print('Commit data stored in', pickle_file) 240 | print('People Dictionary stored in', output_pickle) 241 | 242 | create_person_dict(pickle_file, repo_name, since_date, until_date) 243 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch2021-04-12T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch2021-04-12T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch2021-04-12T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch2021-04-12T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch2023-08-01T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch2023-08-01T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch2023-09-16T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch2023-09-16T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch2024-09-16T00:00:00.000+00:002025-03-16T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch2024-09-16T00:00:00.000+00:002025-03-16T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2021-04-12T00:00:00.000+00:002022-04-12T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2021-04-12T00:00:00.000+00:002022-04-12T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2021-04-12T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2021-04-12T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2021-04-12T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2021-04-12T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2023-08-01T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2023-08-01T00:00:00.000+00:002024-08-01T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2023-09-16T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2023-09-16T00:00:00.000+00:002024-09-16T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2024-09-16T00:00:00.000+00:002025-03-16T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/OpenSearch_people_2024-09-16T00:00:00.000+00:002025-03-16T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch2019-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch2019-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch2020-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch2020-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch2023-08-29T00:00:00.000+00:002024-08-29T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch2023-08-29T00:00:00.000+00:002024-08-29T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch2024-08-29T00:00:00.000+00:002025-02-29T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch2024-08-29T00:00:00.000+00:002025-02-29T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2019-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2019-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2020-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2020-02-03T00:00:00.000+00:002021-02-03T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2021-02-03T00:00:00.000+00:002022-02-03T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2021-02-03T00:00:00.000+00:002022-02-03T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2023-08-29T00:00:00.000+00:002024-08-29T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2023-08-29T00:00:00.000+00:002024-08-29T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2024-08-29T00:00:00.000+00:002025-02-29T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/elasticsearch_people_2024-08-29T00:00:00.000+00:002025-02-29T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/opentofu2023-09-05T00:00:00.000+00:002024-09-05T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/opentofu2023-09-05T00:00:00.000+00:002024-09-05T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/opentofu_people_2023-09-05T00:00:00.000+00:002024-09-05T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/opentofu_people_2023-09-05T00:00:00.000+00:002024-09-05T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis2022-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis2022-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis2023-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis2023-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis2024-02-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis2024-02-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis2024-03-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis2024-03-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis2024-03-20T00:00:00.000+00:002024-09-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis2024-03-20T00:00:00.000+00:002024-09-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis2024-03-20T00:00:00.000+00:002025-03-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis2024-03-20T00:00:00.000+00:002025-03-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis_people_2022-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis_people_2022-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis_people_2023-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis_people_2023-03-20T00:00:00.000+00:002024-03-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis_people_2024-03-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis_people_2024-03-20T00:00:00.000+00:002024-08-21T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis_people_2024-03-20T00:00:00.000+00:002024-09-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis_people_2024-03-20T00:00:00.000+00:002024-09-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/redis_people_2024-03-20T00:00:00.000+00:002025-03-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/redis_people_2024-03-20T00:00:00.000+00:002025-03-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/stars-forks/opentofu-forks.csv: -------------------------------------------------------------------------------- 1 | yr,mon,d,forks 2 | 2023,9,20,46 3 | 2023,9,21,64 4 | 2023,9,22,56 5 | 2023,9,23,30 6 | 2023,9,24,14 7 | 2023,9,25,12 8 | 2023,9,26,9 9 | 2023,9,27,9 10 | 2023,9,28,5 11 | 2023,9,29,3 12 | 2023,9,30,4 13 | 2023,10,1,3 14 | 2023,10,2,3 15 | 2023,10,3,1 16 | 2023,10,4,12 17 | 2023,10,5,4 18 | 2023,10,6,7 19 | 2023,10,7,1 20 | 2023,10,8,2 21 | 2023,10,9,3 22 | 2023,10,10,4 23 | 2023,10,11,4 24 | 2023,10,12,7 25 | 2023,10,13,6 26 | 2023,10,14,1 27 | 2023,10,15,3 28 | 2023,10,16,2 29 | 2023,10,17,2 30 | 2023,10,18,5 31 | 2023,10,19,2 32 | 2023,10,20,4 33 | 2023,10,21,3 34 | 2023,10,23,2 35 | 2023,10,24,1 36 | 2023,10,25,6 37 | 2023,10,26,3 38 | 2023,10,27,3 39 | 2023,10,28,1 40 | 2023,10,29,2 41 | 2023,10,31,4 42 | 2023,11,1,1 43 | 2023,11,3,1 44 | 2023,11,4,1 45 | 2023,11,5,1 46 | 2023,11,6,2 47 | 2023,11,7,2 48 | 2023,11,8,1 49 | 2023,11,9,3 50 | 2023,11,10,4 51 | 2023,11,12,1 52 | 2023,11,13,3 53 | 2023,11,14,3 54 | 2023,11,16,2 55 | 2023,11,17,2 56 | 2023,11,20,2 57 | 2023,11,24,1 58 | 2023,11,25,1 59 | 2023,11,27,2 60 | 2023,11,28,3 61 | 2023,11,29,2 62 | 2023,11,30,1 63 | 2023,12,1,1 64 | 2023,12,7,1 65 | 2023,12,8,1 66 | 2023,12,9,1 67 | 2023,12,10,1 68 | 2023,12,13,2 69 | 2023,12,14,1 70 | 2023,12,15,3 71 | 2023,12,17,1 72 | 2023,12,18,3 73 | 2023,12,19,1 74 | 2023,12,20,1 75 | 2023,12,21,1 76 | 2023,12,22,1 77 | 2023,12,23,1 78 | 2023,12,24,1 79 | 2023,12,25,1 80 | 2023,12,27,2 81 | 2023,12,28,5 82 | 2023,12,31,1 83 | 2024,1,1,1 84 | 2024,1,2,2 85 | 2024,1,3,1 86 | 2024,1,4,2 87 | 2024,1,6,1 88 | 2024,1,7,1 89 | 2024,1,9,1 90 | 2024,1,10,2 91 | 2024,1,11,3 92 | 2024,1,12,4 93 | 2024,1,13,2 94 | 2024,1,14,4 95 | 2024,1,15,3 96 | 2024,1,16,2 97 | 2024,1,17,1 98 | 2024,1,19,2 99 | 2024,1,20,2 100 | 2024,1,21,1 101 | 2024,1,22,2 102 | 2024,1,23,2 103 | 2024,1,24,6 104 | 2024,1,25,2 105 | 2024,1,26,2 106 | 2024,1,27,1 107 | 2024,1,28,2 108 | 2024,1,29,4 109 | 2024,1,30,2 110 | 2024,1,31,1 111 | 2024,2,1,2 112 | 2024,2,2,2 113 | 2024,2,4,2 114 | 2024,2,5,3 115 | 2024,2,6,3 116 | 2024,2,8,3 117 | 2024,2,9,3 118 | 2024,2,10,1 119 | 2024,2,11,1 120 | 2024,2,12,1 121 | 2024,2,13,2 122 | 2024,2,14,1 123 | 2024,2,15,1 124 | 2024,2,16,1 125 | 2024,2,19,2 126 | 2024,2,20,3 127 | 2024,2,22,1 128 | 2024,2,24,2 129 | 2024,2,25,2 130 | 2024,2,26,3 131 | 2024,2,27,1 132 | 2024,2,29,1 133 | 2024,3,2,2 134 | 2024,3,3,1 135 | 2024,3,4,2 136 | 2024,3,5,1 137 | 2024,3,6,1 138 | 2024,3,7,1 139 | 2024,3,10,1 140 | 2024,3,12,2 141 | 2024,3,13,1 142 | 2024,3,14,1 143 | 2024,3,15,2 144 | 2024,3,16,2 145 | 2024,3,17,1 146 | 2024,3,18,1 147 | 2024,3,19,2 148 | 2024,3,22,1 149 | 2024,3,23,1 150 | 2024,3,24,1 151 | 2024,3,26,2 152 | 2024,3,27,2 153 | 2024,3,28,2 154 | 2024,3,29,1 155 | 2024,4,1,1 156 | 2024,4,2,1 157 | 2024,4,3,2 158 | 2024,4,4,3 159 | 2024,4,5,1 160 | 2024,4,6,3 161 | 2024,4,8,1 162 | 2024,4,10,1 163 | 2024,4,11,3 164 | 2024,4,12,1 165 | 2024,4,13,4 166 | 2024,4,14,3 167 | 2024,4,15,2 168 | 2024,4,16,2 169 | 2024,4,17,1 170 | 2024,4,18,2 171 | 2024,4,19,1 172 | 2024,4,20,2 173 | 2024,4,21,1 174 | 2024,4,22,2 175 | 2024,4,23,2 176 | 2024,4,24,2 177 | 2024,4,25,3 178 | 2024,4,26,9 179 | 2024,4,27,3 180 | 2024,4,28,1 181 | 2024,4,29,2 182 | 2024,5,1,3 183 | 2024,5,3,5 184 | 2024,5,5,2 185 | 2024,5,7,2 186 | 2024,5,8,1 187 | 2024,5,9,3 188 | 2024,5,10,2 189 | 2024,5,11,2 190 | 2024,5,12,1 191 | 2024,5,14,1 192 | 2024,5,15,1 193 | 2024,5,16,2 194 | 2024,5,17,1 195 | 2024,5,18,5 196 | 2024,5,20,2 197 | 2024,5,21,1 198 | 2024,5,22,2 199 | 2024,5,24,1 200 | 2024,5,25,1 201 | 2024,5,26,2 202 | 2024,5,27,1 203 | 2024,5,28,1 204 | 2024,5,29,2 205 | 2024,5,31,1 206 | 2024,6,1,2 207 | 2024,6,2,2 208 | 2024,6,3,2 209 | 2024,6,4,2 210 | 2024,6,5,1 211 | 2024,6,6,2 212 | 2024,6,10,2 213 | 2024,6,12,2 214 | 2024,6,14,3 215 | 2024,6,17,1 216 | 2024,6,18,3 217 | 2024,6,20,2 218 | 2024,6,23,2 219 | 2024,6,24,2 220 | 2024,6,26,1 221 | 2024,6,27,1 222 | 2024,6,28,1 223 | 2024,6,30,1 224 | 2024,7,1,1 225 | 2024,7,3,1 226 | 2024,7,4,1 227 | 2024,7,5,1 228 | 2024,7,12,2 229 | 2024,7,15,1 230 | 2024,7,16,1 231 | 2024,7,17,2 232 | 2024,7,18,1 233 | 2024,7,19,1 234 | 2024,7,22,2 235 | 2024,7,23,3 236 | 2024,7,24,2 237 | 2024,7,25,1 238 | 2024,7,27,3 239 | 2024,7,29,2 240 | 2024,7,30,2 241 | 2024,7,31,2 242 | 2024,8,1,1 243 | 2024,8,3,1 244 | 2024,8,4,2 245 | 2024,8,5,3 246 | 2024,8,8,1 247 | 2024,8,12,2 248 | 2024,8,13,2 249 | 2024,8,14,1 250 | 2024,8,16,2 251 | 2024,8,20,2 252 | 2024,8,22,1 253 | 2024,8,23,1 254 | 2024,8,24,1 255 | 2024,8,26,1 256 | 2024,8,27,3 257 | 2024,8,28,2 258 | 2024,8,29,2 259 | 2024,8,30,2 260 | 2024,9,1,1 261 | 2024,9,2,1 262 | 2024,9,3,2 263 | 2024,9,4,1 264 | 2024,9,5,1 265 | 2024,9,7,2 266 | 2024,9,9,2 267 | 2024,9,10,1 268 | 2024,9,12,1 269 | 2024,9,13,1 270 | 2024,9,14,1 271 | 2024,9,18,3 272 | 2024,9,19,2 273 | 2024,9,20,1 274 | 2024,9,21,2 275 | 2024,9,22,2 276 | 2024,9,24,2 277 | 2024,9,26,2 278 | 2024,9,27,3 279 | 2024,9,28,1 280 | 2024,9,29,1 281 | 2024,9,30,2 282 | 2024,10,1,1 283 | 2024,10,2,2 284 | 2024,10,3,1 285 | 2024,10,4,1 286 | 2024,10,5,1 287 | 2024,10,6,1 288 | 2024,10,10,3 289 | 2024,10,12,2 290 | 2024,10,16,3 291 | 2024,10,18,1 292 | 2024,10,21,4 293 | 2024,10,23,1 294 | 2024,10,25,1 295 | 2024,10,26,2 296 | 2024,10,28,3 297 | 2024,10,29,2 298 | 2024,10,30,1 299 | 2024,10,31,2 300 | 2024,11,3,1 301 | 2024,11,4,3 302 | 2024,11,5,4 303 | 2024,11,7,1 304 | 2024,11,9,1 305 | 2024,11,12,1 306 | 2024,11,14,1 307 | 2024,11,15,1 308 | 2024,11,18,1 309 | 2024,11,20,1 310 | 2024,11,24,1 311 | 2024,11,25,1 312 | 2024,11,26,2 313 | 2024,11,27,1 314 | 2024,11,29,3 315 | 2024,12,1,1 316 | 2024,12,2,2 317 | 2024,12,3,1 318 | 2024,12,4,2 319 | 2024,12,6,3 320 | 2024,12,11,1 321 | 2024,12,12,2 322 | 2024,12,16,1 323 | 2024,12,17,3 324 | 2024,12,18,2 325 | 2024,12,21,1 326 | 2024,12,23,1 327 | 2024,12,27,1 328 | 2024,12,28,1 329 | 2025,1,3,1 330 | 2025,1,4,2 331 | 2025,1,8,1 332 | 2025,1,9,1 333 | 2025,1,14,1 334 | 2025,1,16,1 335 | 2025,1,17,1 336 | 2025,1,21,1 337 | 2025,1,22,2 338 | 2025,1,24,1 339 | 2025,1,25,1 340 | 2025,1,26,1 341 | 2025,1,27,1 342 | 2025,1,29,1 343 | 2025,1,30,1 344 | 2025,1,31,1 345 | 2025,2,1,1 346 | 2025,2,5,1 347 | 2025,2,8,1 348 | 2025,2,9,1 349 | 2025,2,12,1 350 | 2025,2,13,1 351 | 2025,2,15,1 352 | 2025,2,25,1 353 | 2025,2,26,1 354 | 2025,3,1,1 355 | 2025,3,2,1 356 | 2025,3,4,1 357 | 2025,3,6,2 358 | 2025,3,9,1 359 | 2025,3,10,1 360 | 2025,3,13,1 361 | 2025,3,14,1 362 | 2025,3,16,1 363 | 2025,3,17,1 364 | 2025,3,18,1 365 | 2025,3,20,1 366 | 2025,3,22,1 367 | 2025,3,24,1 368 | 2025,3,25,3 369 | 2025,3,26,4 370 | 2025,3,29,1 371 | 2025,3,30,2 372 | 2025,4,1,2 373 | 2025,4,2,1 374 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/stars-forks/opentofu-stars.csv: -------------------------------------------------------------------------------- 1 | yr,mon,d,stars 2 | 2023,9,20,1134 3 | 2023,9,21,1558 4 | 2023,9,22,1770 5 | 2023,9,23,872 6 | 2023,9,24,378 7 | 2023,9,25,344 8 | 2023,9,26,258 9 | 2023,9,27,160 10 | 2023,9,28,153 11 | 2023,9,29,128 12 | 2023,9,30,71 13 | 2023,10,1,53 14 | 2023,10,2,92 15 | 2023,10,3,85 16 | 2023,10,4,101 17 | 2023,10,5,105 18 | 2023,10,6,100 19 | 2023,10,7,51 20 | 2023,10,8,68 21 | 2023,10,9,85 22 | 2023,10,10,67 23 | 2023,10,11,73 24 | 2023,10,12,65 25 | 2023,10,13,57 26 | 2023,10,14,28 27 | 2023,10,15,43 28 | 2023,10,16,83 29 | 2023,10,17,73 30 | 2023,10,18,65 31 | 2023,10,19,68 32 | 2023,10,20,44 33 | 2023,10,21,20 34 | 2023,10,22,32 35 | 2023,10,23,48 36 | 2023,10,24,43 37 | 2023,10,25,53 38 | 2023,10,26,64 39 | 2023,10,27,74 40 | 2023,10,28,38 41 | 2023,10,29,20 42 | 2023,10,30,36 43 | 2023,10,31,27 44 | 2023,11,1,44 45 | 2023,11,2,45 46 | 2023,11,3,33 47 | 2023,11,4,15 48 | 2023,11,5,27 49 | 2023,11,6,33 50 | 2023,11,7,38 51 | 2023,11,8,33 52 | 2023,11,9,42 53 | 2023,11,10,13 54 | 2023,11,11,18 55 | 2023,11,12,21 56 | 2023,11,13,32 57 | 2023,11,14,27 58 | 2023,11,15,29 59 | 2023,11,16,21 60 | 2023,11,17,28 61 | 2023,11,18,13 62 | 2023,11,19,24 63 | 2023,11,20,25 64 | 2023,11,21,19 65 | 2023,11,22,25 66 | 2023,11,23,22 67 | 2023,11,24,19 68 | 2023,11,25,8 69 | 2023,11,26,17 70 | 2023,11,27,15 71 | 2023,11,28,18 72 | 2023,11,29,32 73 | 2023,11,30,15 74 | 2023,12,1,11 75 | 2023,12,2,11 76 | 2023,12,3,12 77 | 2023,12,4,10 78 | 2023,12,5,21 79 | 2023,12,6,12 80 | 2023,12,7,24 81 | 2023,12,8,19 82 | 2023,12,9,37 83 | 2023,12,10,23 84 | 2023,12,11,24 85 | 2023,12,12,27 86 | 2023,12,13,39 87 | 2023,12,14,21 88 | 2023,12,15,23 89 | 2023,12,16,25 90 | 2023,12,17,21 91 | 2023,12,18,35 92 | 2023,12,19,37 93 | 2023,12,20,25 94 | 2023,12,21,30 95 | 2023,12,22,21 96 | 2023,12,23,15 97 | 2023,12,24,6 98 | 2023,12,25,9 99 | 2023,12,26,24 100 | 2023,12,27,16 101 | 2023,12,28,14 102 | 2023,12,29,9 103 | 2023,12,30,8 104 | 2023,12,31,10 105 | 2024,1,1,9 106 | 2024,1,2,20 107 | 2024,1,3,16 108 | 2024,1,4,18 109 | 2024,1,5,23 110 | 2024,1,6,13 111 | 2024,1,7,7 112 | 2024,1,8,20 113 | 2024,1,9,31 114 | 2024,1,10,100 115 | 2024,1,11,141 116 | 2024,1,12,130 117 | 2024,1,13,63 118 | 2024,1,14,37 119 | 2024,1,15,45 120 | 2024,1,16,38 121 | 2024,1,17,57 122 | 2024,1,18,48 123 | 2024,1,19,34 124 | 2024,1,20,20 125 | 2024,1,21,31 126 | 2024,1,22,65 127 | 2024,1,23,40 128 | 2024,1,24,41 129 | 2024,1,25,30 130 | 2024,1,26,28 131 | 2024,1,27,14 132 | 2024,1,28,21 133 | 2024,1,29,25 134 | 2024,1,30,28 135 | 2024,1,31,32 136 | 2024,2,1,22 137 | 2024,2,2,24 138 | 2024,2,3,17 139 | 2024,2,4,10 140 | 2024,2,5,19 141 | 2024,2,6,29 142 | 2024,2,7,25 143 | 2024,2,8,18 144 | 2024,2,9,28 145 | 2024,2,10,20 146 | 2024,2,11,18 147 | 2024,2,12,20 148 | 2024,2,13,23 149 | 2024,2,14,21 150 | 2024,2,15,26 151 | 2024,2,16,23 152 | 2024,2,17,14 153 | 2024,2,18,12 154 | 2024,2,19,15 155 | 2024,2,20,23 156 | 2024,2,21,21 157 | 2024,2,22,22 158 | 2024,2,23,13 159 | 2024,2,24,12 160 | 2024,2,25,4 161 | 2024,2,26,18 162 | 2024,2,27,19 163 | 2024,2,28,14 164 | 2024,2,29,14 165 | 2024,3,1,15 166 | 2024,3,2,10 167 | 2024,3,3,15 168 | 2024,3,4,21 169 | 2024,3,5,17 170 | 2024,3,6,19 171 | 2024,3,7,11 172 | 2024,3,8,12 173 | 2024,3,9,11 174 | 2024,3,10,4 175 | 2024,3,11,16 176 | 2024,3,12,21 177 | 2024,3,13,12 178 | 2024,3,14,27 179 | 2024,3,15,32 180 | 2024,3,16,23 181 | 2024,3,17,23 182 | 2024,3,18,25 183 | 2024,3,19,22 184 | 2024,3,20,23 185 | 2024,3,21,23 186 | 2024,3,22,33 187 | 2024,3,23,19 188 | 2024,3,24,13 189 | 2024,3,25,18 190 | 2024,3,26,15 191 | 2024,3,27,26 192 | 2024,3,28,20 193 | 2024,3,29,24 194 | 2024,3,30,17 195 | 2024,3,31,12 196 | 2024,4,1,30 197 | 2024,4,2,23 198 | 2024,4,3,25 199 | 2024,4,4,56 200 | 2024,4,5,39 201 | 2024,4,6,40 202 | 2024,4,7,30 203 | 2024,4,8,23 204 | 2024,4,9,25 205 | 2024,4,10,16 206 | 2024,4,11,54 207 | 2024,4,12,54 208 | 2024,4,13,25 209 | 2024,4,14,21 210 | 2024,4,15,41 211 | 2024,4,16,32 212 | 2024,4,17,29 213 | 2024,4,18,31 214 | 2024,4,19,21 215 | 2024,4,20,11 216 | 2024,4,21,10 217 | 2024,4,22,17 218 | 2024,4,23,22 219 | 2024,4,24,50 220 | 2024,4,25,115 221 | 2024,4,26,151 222 | 2024,4,27,53 223 | 2024,4,28,46 224 | 2024,4,29,47 225 | 2024,4,30,64 226 | 2024,5,1,69 227 | 2024,5,2,57 228 | 2024,5,3,38 229 | 2024,5,4,20 230 | 2024,5,5,19 231 | 2024,5,6,22 232 | 2024,5,7,24 233 | 2024,5,8,30 234 | 2024,5,9,24 235 | 2024,5,10,22 236 | 2024,5,11,14 237 | 2024,5,12,13 238 | 2024,5,13,28 239 | 2024,5,14,27 240 | 2024,5,15,57 241 | 2024,5,16,35 242 | 2024,5,17,34 243 | 2024,5,18,18 244 | 2024,5,19,13 245 | 2024,5,20,21 246 | 2024,5,21,30 247 | 2024,5,22,20 248 | 2024,5,23,19 249 | 2024,5,24,21 250 | 2024,5,25,11 251 | 2024,5,26,6 252 | 2024,5,27,18 253 | 2024,5,28,16 254 | 2024,5,29,20 255 | 2024,5,30,24 256 | 2024,5,31,20 257 | 2024,6,1,13 258 | 2024,6,2,14 259 | 2024,6,3,17 260 | 2024,6,4,16 261 | 2024,6,5,19 262 | 2024,6,6,14 263 | 2024,6,7,15 264 | 2024,6,8,5 265 | 2024,6,9,14 266 | 2024,6,10,17 267 | 2024,6,11,16 268 | 2024,6,12,12 269 | 2024,6,13,18 270 | 2024,6,14,9 271 | 2024,6,15,4 272 | 2024,6,16,6 273 | 2024,6,17,17 274 | 2024,6,18,19 275 | 2024,6,19,18 276 | 2024,6,20,13 277 | 2024,6,21,24 278 | 2024,6,22,6 279 | 2024,6,23,6 280 | 2024,6,24,13 281 | 2024,6,25,24 282 | 2024,6,26,21 283 | 2024,6,27,19 284 | 2024,6,28,9 285 | 2024,6,29,9 286 | 2024,6,30,3 287 | 2024,7,1,13 288 | 2024,7,2,13 289 | 2024,7,3,10 290 | 2024,7,4,8 291 | 2024,7,5,7 292 | 2024,7,6,11 293 | 2024,7,7,10 294 | 2024,7,8,7 295 | 2024,7,9,15 296 | 2024,7,10,12 297 | 2024,7,11,12 298 | 2024,7,12,11 299 | 2024,7,13,8 300 | 2024,7,14,11 301 | 2024,7,15,18 302 | 2024,7,16,17 303 | 2024,7,17,9 304 | 2024,7,18,11 305 | 2024,7,19,10 306 | 2024,7,20,4 307 | 2024,7,21,8 308 | 2024,7,22,17 309 | 2024,7,23,17 310 | 2024,7,24,17 311 | 2024,7,25,16 312 | 2024,7,26,16 313 | 2024,7,27,3 314 | 2024,7,28,8 315 | 2024,7,29,22 316 | 2024,7,30,23 317 | 2024,7,31,20 318 | 2024,8,1,17 319 | 2024,8,2,18 320 | 2024,8,3,9 321 | 2024,8,4,10 322 | 2024,8,5,11 323 | 2024,8,6,14 324 | 2024,8,7,19 325 | 2024,8,8,11 326 | 2024,8,9,8 327 | 2024,8,10,8 328 | 2024,8,11,6 329 | 2024,8,12,15 330 | 2024,8,13,12 331 | 2024,8,14,8 332 | 2024,8,15,17 333 | 2024,8,16,9 334 | 2024,8,17,11 335 | 2024,8,18,10 336 | 2024,8,19,8 337 | 2024,8,20,13 338 | 2024,8,21,9 339 | 2024,8,22,10 340 | 2024,8,23,16 341 | 2024,8,24,14 342 | 2024,8,25,5 343 | 2024,8,26,3 344 | 2024,8,27,7 345 | 2024,8,28,14 346 | 2024,8,29,16 347 | 2024,8,30,18 348 | 2024,8,31,7 349 | 2024,9,1,11 350 | 2024,9,2,12 351 | 2024,9,3,10 352 | 2024,9,4,18 353 | 2024,9,5,19 354 | 2024,9,6,21 355 | 2024,9,7,18 356 | 2024,9,8,8 357 | 2024,9,9,17 358 | 2024,9,10,9 359 | 2024,9,11,12 360 | 2024,9,12,12 361 | 2024,9,13,7 362 | 2024,9,14,2 363 | 2024,9,15,12 364 | 2024,9,16,8 365 | 2024,9,17,10 366 | 2024,9,18,19 367 | 2024,9,19,13 368 | 2024,9,20,20 369 | 2024,9,21,16 370 | 2024,9,22,6 371 | 2024,9,23,12 372 | 2024,9,24,16 373 | 2024,9,25,15 374 | 2024,9,26,14 375 | 2024,9,27,16 376 | 2024,9,28,13 377 | 2024,9,29,6 378 | 2024,9,30,10 379 | 2024,10,1,15 380 | 2024,10,2,13 381 | 2024,10,3,21 382 | 2024,10,4,15 383 | 2024,10,5,13 384 | 2024,10,6,19 385 | 2024,10,7,24 386 | 2024,10,8,12 387 | 2024,10,9,20 388 | 2024,10,10,14 389 | 2024,10,11,6 390 | 2024,10,12,8 391 | 2024,10,13,3 392 | 2024,10,14,18 393 | 2024,10,15,13 394 | 2024,10,16,9 395 | 2024,10,17,8 396 | 2024,10,18,8 397 | 2024,10,19,9 398 | 2024,10,20,9 399 | 2024,10,21,15 400 | 2024,10,22,8 401 | 2024,10,23,15 402 | 2024,10,24,15 403 | 2024,10,25,12 404 | 2024,10,26,6 405 | 2024,10,27,3 406 | 2024,10,28,35 407 | 2024,10,29,26 408 | 2024,10,30,24 409 | 2024,10,31,16 410 | 2024,11,1,15 411 | 2024,11,2,7 412 | 2024,11,3,3 413 | 2024,11,4,19 414 | 2024,11,5,12 415 | 2024,11,6,11 416 | 2024,11,7,17 417 | 2024,11,8,8 418 | 2024,11,9,9 419 | 2024,11,10,4 420 | 2024,11,11,10 421 | 2024,11,12,20 422 | 2024,11,13,13 423 | 2024,11,14,12 424 | 2024,11,15,8 425 | 2024,11,16,11 426 | 2024,11,17,11 427 | 2024,11,18,11 428 | 2024,11,19,17 429 | 2024,11,20,12 430 | 2024,11,21,10 431 | 2024,11,22,8 432 | 2024,11,23,10 433 | 2024,11,24,7 434 | 2024,11,25,9 435 | 2024,11,26,11 436 | 2024,11,27,12 437 | 2024,11,28,12 438 | 2024,11,29,11 439 | 2024,11,30,8 440 | 2024,12,1,6 441 | 2024,12,2,16 442 | 2024,12,3,8 443 | 2024,12,4,11 444 | 2024,12,5,9 445 | 2024,12,6,8 446 | 2024,12,7,3 447 | 2024,12,8,6 448 | 2024,12,9,7 449 | 2024,12,10,11 450 | 2024,12,11,14 451 | 2024,12,12,10 452 | 2024,12,13,8 453 | 2024,12,14,9 454 | 2024,12,15,9 455 | 2024,12,16,9 456 | 2024,12,17,13 457 | 2024,12,18,7 458 | 2024,12,19,8 459 | 2024,12,20,12 460 | 2024,12,21,6 461 | 2024,12,22,5 462 | 2024,12,23,6 463 | 2024,12,24,3 464 | 2024,12,25,8 465 | 2024,12,26,9 466 | 2024,12,27,11 467 | 2024,12,28,5 468 | 2024,12,29,7 469 | 2024,12,30,9 470 | 2024,12,31,9 471 | 2025,1,1,3 472 | 2025,1,2,9 473 | 2025,1,3,15 474 | 2025,1,4,10 475 | 2025,1,5,11 476 | 2025,1,6,6 477 | 2025,1,7,12 478 | 2025,1,8,7 479 | 2025,1,9,10 480 | 2025,1,10,13 481 | 2025,1,11,7 482 | 2025,1,12,10 483 | 2025,1,13,9 484 | 2025,1,14,10 485 | 2025,1,15,13 486 | 2025,1,16,6 487 | 2025,1,17,5 488 | 2025,1,18,9 489 | 2025,1,19,5 490 | 2025,1,20,15 491 | 2025,1,21,7 492 | 2025,1,22,18 493 | 2025,1,23,21 494 | 2025,1,24,13 495 | 2025,1,25,18 496 | 2025,1,26,12 497 | 2025,1,27,15 498 | 2025,1,28,34 499 | 2025,1,29,26 500 | 2025,1,30,20 501 | 2025,1,31,18 502 | 2025,2,1,8 503 | 2025,2,2,12 504 | 2025,2,3,21 505 | 2025,2,4,12 506 | 2025,2,5,20 507 | 2025,2,6,14 508 | 2025,2,7,18 509 | 2025,2,8,10 510 | 2025,2,9,14 511 | 2025,2,10,15 512 | 2025,2,11,7 513 | 2025,2,12,10 514 | 2025,2,13,15 515 | 2025,2,14,14 516 | 2025,2,15,5 517 | 2025,2,16,2 518 | 2025,2,17,10 519 | 2025,2,18,8 520 | 2025,2,19,6 521 | 2025,2,20,9 522 | 2025,2,21,6 523 | 2025,2,22,8 524 | 2025,2,23,8 525 | 2025,2,24,8 526 | 2025,2,25,12 527 | 2025,2,26,8 528 | 2025,2,27,23 529 | 2025,2,28,20 530 | 2025,3,1,17 531 | 2025,3,2,7 532 | 2025,3,3,31 533 | 2025,3,4,28 534 | 2025,3,5,17 535 | 2025,3,6,14 536 | 2025,3,7,10 537 | 2025,3,8,12 538 | 2025,3,9,8 539 | 2025,3,10,11 540 | 2025,3,11,9 541 | 2025,3,12,10 542 | 2025,3,13,13 543 | 2025,3,14,16 544 | 2025,3,15,3 545 | 2025,3,16,12 546 | 2025,3,17,11 547 | 2025,3,18,17 548 | 2025,3,19,14 549 | 2025,3,20,9 550 | 2025,3,21,7 551 | 2025,3,22,5 552 | 2025,3,23,5 553 | 2025,3,24,19 554 | 2025,3,25,16 555 | 2025,3,26,9 556 | 2025,3,27,13 557 | 2025,3,28,10 558 | 2025,3,29,7 559 | 2025,3,30,8 560 | 2025,3,31,12 561 | 2025,4,1,11 562 | 2025,4,2,19 563 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/stars-forks/valkey-forks.csv: -------------------------------------------------------------------------------- 1 | yr,mon,d,forks 2 | 2024,3,28,14 3 | 2024,3,29,36 4 | 2024,3,30,37 5 | 2024,3,31,76 6 | 2024,4,1,46 7 | 2024,4,2,32 8 | 2024,4,3,12 9 | 2024,4,4,15 10 | 2024,4,5,8 11 | 2024,4,6,4 12 | 2024,4,7,3 13 | 2024,4,8,7 14 | 2024,4,9,3 15 | 2024,4,10,1 16 | 2024,4,11,2 17 | 2024,4,12,3 18 | 2024,4,13,4 19 | 2024,4,14,5 20 | 2024,4,15,1 21 | 2024,4,16,5 22 | 2024,4,17,11 23 | 2024,4,18,25 24 | 2024,4,19,38 25 | 2024,4,20,8 26 | 2024,4,21,1 27 | 2024,4,22,5 28 | 2024,4,23,4 29 | 2024,4,24,4 30 | 2024,4,25,8 31 | 2024,4,26,6 32 | 2024,4,27,1 33 | 2024,4,28,5 34 | 2024,4,29,2 35 | 2024,4,30,1 36 | 2024,5,2,2 37 | 2024,5,3,2 38 | 2024,5,4,1 39 | 2024,5,5,6 40 | 2024,5,6,3 41 | 2024,5,7,1 42 | 2024,5,8,3 43 | 2024,5,9,7 44 | 2024,5,10,2 45 | 2024,5,11,1 46 | 2024,5,12,1 47 | 2024,5,13,2 48 | 2024,5,14,1 49 | 2024,5,15,1 50 | 2024,5,16,2 51 | 2024,5,17,1 52 | 2024,5,18,2 53 | 2024,5,20,2 54 | 2024,5,21,2 55 | 2024,5,24,1 56 | 2024,5,25,2 57 | 2024,5,26,2 58 | 2024,5,28,2 59 | 2024,6,1,1 60 | 2024,6,2,1 61 | 2024,6,3,1 62 | 2024,6,4,1 63 | 2024,6,5,1 64 | 2024,6,6,3 65 | 2024,6,8,1 66 | 2024,6,9,3 67 | 2024,6,12,2 68 | 2024,6,13,1 69 | 2024,6,14,2 70 | 2024,6,15,1 71 | 2024,6,19,2 72 | 2024,6,20,2 73 | 2024,6,21,1 74 | 2024,6,22,1 75 | 2024,6,23,2 76 | 2024,6,24,4 77 | 2024,6,25,1 78 | 2024,6,26,1 79 | 2024,6,27,1 80 | 2024,6,28,2 81 | 2024,7,1,1 82 | 2024,7,2,1 83 | 2024,7,3,1 84 | 2024,7,11,4 85 | 2024,7,15,1 86 | 2024,7,16,1 87 | 2024,7,18,2 88 | 2024,7,20,1 89 | 2024,7,22,1 90 | 2024,7,23,1 91 | 2024,7,25,1 92 | 2024,7,29,2 93 | 2024,7,30,2 94 | 2024,7,31,1 95 | 2024,8,1,2 96 | 2024,8,2,2 97 | 2024,8,4,2 98 | 2024,8,6,1 99 | 2024,8,7,2 100 | 2024,8,8,1 101 | 2024,8,9,1 102 | 2024,8,10,1 103 | 2024,8,13,1 104 | 2024,8,14,3 105 | 2024,8,15,2 106 | 2024,8,16,1 107 | 2024,8,17,1 108 | 2024,8,20,1 109 | 2024,8,21,1 110 | 2024,8,25,3 111 | 2024,8,26,2 112 | 2024,8,27,1 113 | 2024,8,28,1 114 | 2024,8,30,2 115 | 2024,8,31,1 116 | 2024,9,2,2 117 | 2024,9,3,1 118 | 2024,9,4,2 119 | 2024,9,5,1 120 | 2024,9,6,1 121 | 2024,9,7,1 122 | 2024,9,8,1 123 | 2024,9,9,1 124 | 2024,9,11,2 125 | 2024,9,12,2 126 | 2024,9,13,2 127 | 2024,9,14,2 128 | 2024,9,16,3 129 | 2024,9,17,6 130 | 2024,9,18,6 131 | 2024,9,19,2 132 | 2024,9,20,6 133 | 2024,9,21,1 134 | 2024,9,23,2 135 | 2024,9,26,3 136 | 2024,9,27,2 137 | 2024,10,1,1 138 | 2024,10,2,1 139 | 2024,10,5,1 140 | 2024,10,7,1 141 | 2024,10,8,3 142 | 2024,10,10,1 143 | 2024,10,11,2 144 | 2024,10,14,2 145 | 2024,10,15,2 146 | 2024,10,16,2 147 | 2024,10,17,1 148 | 2024,10,18,1 149 | 2024,10,21,1 150 | 2024,10,22,1 151 | 2024,10,23,3 152 | 2024,10,24,1 153 | 2024,10,25,2 154 | 2024,10,27,1 155 | 2024,10,28,2 156 | 2024,10,30,1 157 | 2024,11,1,1 158 | 2024,11,2,1 159 | 2024,11,4,2 160 | 2024,11,8,2 161 | 2024,11,10,1 162 | 2024,11,11,2 163 | 2024,11,12,2 164 | 2024,11,13,1 165 | 2024,11,15,1 166 | 2024,11,18,1 167 | 2024,11,21,2 168 | 2024,11,22,1 169 | 2024,11,26,1 170 | 2024,11,27,4 171 | 2024,11,28,1 172 | 2024,11,29,1 173 | 2024,11,30,2 174 | 2024,12,1,2 175 | 2024,12,3,2 176 | 2024,12,4,4 177 | 2024,12,5,1 178 | 2024,12,6,2 179 | 2024,12,8,1 180 | 2024,12,9,1 181 | 2024,12,10,2 182 | 2024,12,11,1 183 | 2024,12,12,2 184 | 2024,12,16,1 185 | 2024,12,17,3 186 | 2024,12,19,1 187 | 2024,12,25,1 188 | 2024,12,27,1 189 | 2024,12,30,1 190 | 2025,1,6,2 191 | 2025,1,9,1 192 | 2025,1,10,1 193 | 2025,1,12,1 194 | 2025,1,13,1 195 | 2025,1,15,2 196 | 2025,1,16,1 197 | 2025,1,17,1 198 | 2025,1,24,1 199 | 2025,1,25,1 200 | 2025,1,29,1 201 | 2025,2,3,1 202 | 2025,2,5,3 203 | 2025,2,6,1 204 | 2025,2,7,3 205 | 2025,2,8,2 206 | 2025,2,11,2 207 | 2025,2,14,3 208 | 2025,2,15,1 209 | 2025,2,17,2 210 | 2025,2,18,1 211 | 2025,2,20,1 212 | 2025,2,21,3 213 | 2025,2,23,1 214 | 2025,2,26,1 215 | 2025,2,27,2 216 | 2025,2,28,1 217 | 2025,3,1,1 218 | 2025,3,4,4 219 | 2025,3,5,1 220 | 2025,3,7,1 221 | 2025,3,9,1 222 | 2025,3,10,1 223 | 2025,3,12,2 224 | 2025,3,14,1 225 | 2025,3,17,1 226 | 2025,3,21,4 227 | 2025,3,22,1 228 | 2025,3,23,1 229 | 2025,3,24,1 230 | 2025,3,25,1 231 | 2025,3,26,1 232 | 2025,3,28,2 233 | 2025,3,29,2 234 | 2025,3,31,1 235 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/stars-forks/valkey-stars.csv: -------------------------------------------------------------------------------- 1 | yr,mon,d,stars 2 | 2024,3,28,518 3 | 2024,3,29,1482 4 | 2024,3,30,1090 5 | 2024,3,31,1083 6 | 2024,4,1,850 7 | 2024,4,2,383 8 | 2024,4,3,245 9 | 2024,4,4,897 10 | 2024,4,5,464 11 | 2024,4,6,306 12 | 2024,4,7,176 13 | 2024,4,8,186 14 | 2024,4,9,417 15 | 2024,4,10,199 16 | 2024,4,11,238 17 | 2024,4,12,277 18 | 2024,4,13,116 19 | 2024,4,14,68 20 | 2024,4,15,78 21 | 2024,4,16,148 22 | 2024,4,17,328 23 | 2024,4,18,593 24 | 2024,4,19,549 25 | 2024,4,20,214 26 | 2024,4,21,95 27 | 2024,4,22,128 28 | 2024,4,23,117 29 | 2024,4,24,105 30 | 2024,4,25,84 31 | 2024,4,26,67 32 | 2024,4,27,67 33 | 2024,4,28,66 34 | 2024,4,29,66 35 | 2024,4,30,68 36 | 2024,5,1,46 37 | 2024,5,2,66 38 | 2024,5,3,62 39 | 2024,5,4,36 40 | 2024,5,5,33 41 | 2024,5,6,35 42 | 2024,5,7,42 43 | 2024,5,8,32 44 | 2024,5,9,38 45 | 2024,5,10,35 46 | 2024,5,11,25 47 | 2024,5,12,23 48 | 2024,5,13,38 49 | 2024,5,14,27 50 | 2024,5,15,32 51 | 2024,5,16,16 52 | 2024,5,17,19 53 | 2024,5,18,14 54 | 2024,5,19,14 55 | 2024,5,20,35 56 | 2024,5,21,19 57 | 2024,5,22,23 58 | 2024,5,23,18 59 | 2024,5,24,25 60 | 2024,5,25,22 61 | 2024,5,26,17 62 | 2024,5,27,35 63 | 2024,5,28,43 64 | 2024,5,29,19 65 | 2024,5,30,23 66 | 2024,5,31,15 67 | 2024,6,1,19 68 | 2024,6,2,11 69 | 2024,6,3,16 70 | 2024,6,4,23 71 | 2024,6,5,29 72 | 2024,6,6,25 73 | 2024,6,7,15 74 | 2024,6,8,11 75 | 2024,6,9,7 76 | 2024,6,10,15 77 | 2024,6,11,21 78 | 2024,6,12,19 79 | 2024,6,13,17 80 | 2024,6,14,12 81 | 2024,6,15,13 82 | 2024,6,16,6 83 | 2024,6,17,16 84 | 2024,6,18,25 85 | 2024,6,19,22 86 | 2024,6,20,26 87 | 2024,6,21,18 88 | 2024,6,22,16 89 | 2024,6,23,9 90 | 2024,6,24,20 91 | 2024,6,25,13 92 | 2024,6,26,11 93 | 2024,6,27,18 94 | 2024,6,28,16 95 | 2024,6,29,15 96 | 2024,6,30,5 97 | 2024,7,1,15 98 | 2024,7,2,23 99 | 2024,7,3,13 100 | 2024,7,4,17 101 | 2024,7,5,10 102 | 2024,7,6,4 103 | 2024,7,7,21 104 | 2024,7,8,30 105 | 2024,7,9,16 106 | 2024,7,10,14 107 | 2024,7,11,28 108 | 2024,7,12,17 109 | 2024,7,13,9 110 | 2024,7,14,13 111 | 2024,7,15,13 112 | 2024,7,16,12 113 | 2024,7,17,16 114 | 2024,7,18,15 115 | 2024,7,19,8 116 | 2024,7,20,6 117 | 2024,7,21,11 118 | 2024,7,22,18 119 | 2024,7,23,14 120 | 2024,7,24,14 121 | 2024,7,25,7 122 | 2024,7,26,13 123 | 2024,7,27,4 124 | 2024,7,28,9 125 | 2024,7,29,20 126 | 2024,7,30,14 127 | 2024,7,31,17 128 | 2024,8,1,14 129 | 2024,8,2,5 130 | 2024,8,3,14 131 | 2024,8,4,12 132 | 2024,8,5,11 133 | 2024,8,6,9 134 | 2024,8,7,16 135 | 2024,8,8,16 136 | 2024,8,9,10 137 | 2024,8,10,9 138 | 2024,8,11,5 139 | 2024,8,12,9 140 | 2024,8,13,12 141 | 2024,8,14,16 142 | 2024,8,15,14 143 | 2024,8,16,9 144 | 2024,8,17,21 145 | 2024,8,18,18 146 | 2024,8,19,11 147 | 2024,8,20,11 148 | 2024,8,21,9 149 | 2024,8,22,9 150 | 2024,8,23,13 151 | 2024,8,24,5 152 | 2024,8,25,7 153 | 2024,8,26,9 154 | 2024,8,27,29 155 | 2024,8,28,26 156 | 2024,8,29,20 157 | 2024,8,30,18 158 | 2024,8,31,15 159 | 2024,9,1,12 160 | 2024,9,2,28 161 | 2024,9,3,30 162 | 2024,9,4,33 163 | 2024,9,5,26 164 | 2024,9,6,25 165 | 2024,9,7,8 166 | 2024,9,8,11 167 | 2024,9,9,24 168 | 2024,9,10,10 169 | 2024,9,11,14 170 | 2024,9,12,14 171 | 2024,9,13,19 172 | 2024,9,14,38 173 | 2024,9,15,14 174 | 2024,9,16,67 175 | 2024,9,17,83 176 | 2024,9,18,150 177 | 2024,9,19,62 178 | 2024,9,20,44 179 | 2024,9,21,36 180 | 2024,9,22,38 181 | 2024,9,23,44 182 | 2024,9,24,46 183 | 2024,9,25,15 184 | 2024,9,26,33 185 | 2024,9,27,26 186 | 2024,9,28,14 187 | 2024,9,29,11 188 | 2024,9,30,16 189 | 2024,10,1,10 190 | 2024,10,2,20 191 | 2024,10,3,15 192 | 2024,10,4,21 193 | 2024,10,5,18 194 | 2024,10,6,8 195 | 2024,10,7,16 196 | 2024,10,8,21 197 | 2024,10,9,74 198 | 2024,10,10,37 199 | 2024,10,11,60 200 | 2024,10,12,28 201 | 2024,10,13,8 202 | 2024,10,14,47 203 | 2024,10,15,33 204 | 2024,10,16,29 205 | 2024,10,17,24 206 | 2024,10,18,22 207 | 2024,10,19,16 208 | 2024,10,20,7 209 | 2024,10,21,16 210 | 2024,10,22,27 211 | 2024,10,23,19 212 | 2024,10,24,13 213 | 2024,10,25,22 214 | 2024,10,26,15 215 | 2024,10,27,15 216 | 2024,10,28,15 217 | 2024,10,29,16 218 | 2024,10,30,20 219 | 2024,10,31,18 220 | 2024,11,1,8 221 | 2024,11,2,11 222 | 2024,11,3,8 223 | 2024,11,4,13 224 | 2024,11,5,20 225 | 2024,11,6,18 226 | 2024,11,7,12 227 | 2024,11,8,12 228 | 2024,11,9,13 229 | 2024,11,10,10 230 | 2024,11,11,14 231 | 2024,11,12,15 232 | 2024,11,13,17 233 | 2024,11,14,20 234 | 2024,11,15,13 235 | 2024,11,16,8 236 | 2024,11,17,11 237 | 2024,11,18,7 238 | 2024,11,19,9 239 | 2024,11,20,21 240 | 2024,11,21,27 241 | 2024,11,22,9 242 | 2024,11,23,5 243 | 2024,11,24,9 244 | 2024,11,25,14 245 | 2024,11,26,45 246 | 2024,11,27,66 247 | 2024,11,28,49 248 | 2024,11,29,29 249 | 2024,11,30,27 250 | 2024,12,1,21 251 | 2024,12,2,32 252 | 2024,12,3,46 253 | 2024,12,4,41 254 | 2024,12,5,34 255 | 2024,12,6,26 256 | 2024,12,7,13 257 | 2024,12,8,6 258 | 2024,12,9,17 259 | 2024,12,10,19 260 | 2024,12,11,36 261 | 2024,12,12,30 262 | 2024,12,13,23 263 | 2024,12,14,10 264 | 2024,12,15,18 265 | 2024,12,16,18 266 | 2024,12,17,27 267 | 2024,12,18,11 268 | 2024,12,19,11 269 | 2024,12,20,18 270 | 2024,12,21,7 271 | 2024,12,22,9 272 | 2024,12,23,17 273 | 2024,12,24,10 274 | 2024,12,25,18 275 | 2024,12,26,19 276 | 2024,12,27,12 277 | 2024,12,28,10 278 | 2024,12,29,13 279 | 2024,12,30,11 280 | 2024,12,31,16 281 | 2025,1,1,23 282 | 2025,1,2,21 283 | 2025,1,3,14 284 | 2025,1,4,16 285 | 2025,1,5,10 286 | 2025,1,6,12 287 | 2025,1,7,9 288 | 2025,1,8,8 289 | 2025,1,9,15 290 | 2025,1,10,8 291 | 2025,1,11,6 292 | 2025,1,12,7 293 | 2025,1,13,20 294 | 2025,1,14,19 295 | 2025,1,15,9 296 | 2025,1,16,15 297 | 2025,1,17,9 298 | 2025,1,18,4 299 | 2025,1,19,7 300 | 2025,1,20,14 301 | 2025,1,21,10 302 | 2025,1,22,12 303 | 2025,1,23,13 304 | 2025,1,24,10 305 | 2025,1,25,16 306 | 2025,1,26,10 307 | 2025,1,27,11 308 | 2025,1,28,19 309 | 2025,1,29,22 310 | 2025,1,30,10 311 | 2025,1,31,9 312 | 2025,2,1,11 313 | 2025,2,2,5 314 | 2025,2,3,10 315 | 2025,2,4,12 316 | 2025,2,5,17 317 | 2025,2,6,13 318 | 2025,2,7,16 319 | 2025,2,8,16 320 | 2025,2,9,9 321 | 2025,2,10,15 322 | 2025,2,11,10 323 | 2025,2,12,12 324 | 2025,2,13,15 325 | 2025,2,14,20 326 | 2025,2,15,15 327 | 2025,2,16,8 328 | 2025,2,17,33 329 | 2025,2,18,15 330 | 2025,2,19,30 331 | 2025,2,20,16 332 | 2025,2,21,43 333 | 2025,2,22,36 334 | 2025,2,23,28 335 | 2025,2,24,23 336 | 2025,2,25,26 337 | 2025,2,26,26 338 | 2025,2,27,22 339 | 2025,2,28,17 340 | 2025,3,1,22 341 | 2025,3,2,15 342 | 2025,3,3,17 343 | 2025,3,4,23 344 | 2025,3,5,19 345 | 2025,3,6,23 346 | 2025,3,7,20 347 | 2025,3,8,12 348 | 2025,3,9,15 349 | 2025,3,10,14 350 | 2025,3,11,14 351 | 2025,3,12,12 352 | 2025,3,13,14 353 | 2025,3,14,19 354 | 2025,3,15,9 355 | 2025,3,16,21 356 | 2025,3,17,31 357 | 2025,3,18,13 358 | 2025,3,19,17 359 | 2025,3,20,17 360 | 2025,3,21,9 361 | 2025,3,22,6 362 | 2025,3,23,14 363 | 2025,3,24,12 364 | 2025,3,25,24 365 | 2025,3,26,32 366 | 2025,3,27,17 367 | 2025,3,28,18 368 | 2025,3,29,15 369 | 2025,3,30,10 370 | 2025,3,31,11 371 | 2025,4,1,29 372 | 2025,4,2,26 373 | -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/terraform2021-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/terraform2021-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/terraform2022-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/terraform2022-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/terraform2023-08-10T00:00:00.000+00:002024-08-10T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/terraform2023-08-10T00:00:00.000+00:002024-08-10T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/terraform_people_2021-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/terraform_people_2021-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/terraform_people_2022-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/terraform_people_2022-08-10T00:00:00.000+00:002023-08-10T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/terraform_people_2023-08-10T00:00:00.000+00:002024-08-10T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/terraform_people_2023-08-10T00:00:00.000+00:002024-08-10T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/valkey2024-03-28T00:00:00.000+00:002024-08-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/valkey2024-03-28T00:00:00.000+00:002024-08-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/valkey2024-03-28T00:00:00.000+00:002024-09-28T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/valkey2024-03-28T00:00:00.000+00:002024-09-28T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/valkey2024-03-28T00:00:00.000+00:002025-03-28T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/valkey2024-03-28T00:00:00.000+00:002025-03-28T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/valkey_people_2024-03-28T00:00:00.000+00:002024-08-20T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/valkey_people_2024-03-28T00:00:00.000+00:002024-08-20T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/valkey_people_2024-03-28T00:00:00.000+00:002024-09-28T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/valkey_people_2024-03-28T00:00:00.000+00:002024-09-28T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/fork-case-study/data-files/valkey_people_2024-03-28T00:00:00.000+00:002025-03-28T00:00:00.000+00:00.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/license-changes/fork-case-study/data-files/valkey_people_2024-03-28T00:00:00.000+00:002025-03-28T00:00:00.000+00:00.pkl -------------------------------------------------------------------------------- /dataset/license-changes/forks.csv: -------------------------------------------------------------------------------- 1 | fork_name,fork_date,fork_repo,fork_lic,fork_owner,from_name,from_repo,from_lic,from_owner,category,notes 2 | MariaDB,2009-10-29,https://github.com/MariaDB/server,GPL-2.0,MariaDB Foundation,MySQL,https://github.com/mysql/mysql-server ,GPL-2.0,Oracle,acquisition,Fork was shortly after Oracle acquired Sun. Many of the original MySQL core developers wanted to ensure that the code base would remain open. 3 | LibreOffice,2010-09-28,https://github.com/LibreOffice/core,GPL-3.0,The Document Foundation,OpenOffice,https://svn.apache.org/repos/asf/openoffice,LGPL-2.1,Oracle,acquisition,Fork was shortly after Oracle acquired Sun when Oracle pulled most of the developers off of OpenOffice. 4 | OpenSearch,2021-04-12,https://github.com/opensearch-project/OpenSearch ,Apache-2.0,AWS,Elasticsearch,https://github.com/elastic/elasticsearch/,Apache-2.0,Elastic,relicense,Fork happened when Elastic relicensed to SSPL-1.0 + Elastic License 5 | OpenSearch Dashboard,2021-04-12,https://github.com/opensearch-project/OpenSearch-Dashboards,Apache-2.0,AWS,Kibana,https://github.com/elastic/kibana,Apache-2.0,Elastic,relicense,Fork happened when Elastic relicensed to SSPL-1.0 + Elastic License 6 | Mimir,2022-03-30,https://github.com/grafana/mimir,AGPL-3.0,Grafana,Cortex,https://github.com/cortexproject/cortex,Apache-2.0,CNCF,feature,"Grafana was driving most of the development on Cortex and had features they wanted to add, but not under Apache. They pitched it as a strategic decision, but it was also effectively a relicense. " 7 | OpenTofu,2023-09-05,https://github.com/opentofu/opentofu,MPL-2.0,Linux Foundation,Terraform,https://github.com/hashicorp/terraform,MPL-2.0,HashiCorp,relicense,Fork happened when HashiCorp switched from an open source license to the Business Source License (BSL) 8 | OpenBao,2023-12-08,https://github.com/openbao/openbao,MPL-2.0,Linux Foundation,Vault,https://github.com/hashicorp/vault,MPL-2.0,HashiCorp,relicense,Fork happened when HashiCorp switched from an open source license to the Business Source License (BSL) 9 | Valkey,2024-03-28,https://github.com/valkey-io/valkey,BSD-3-clause,Linux Foundation,Redis,https://github.com/redis/redis,BSD-3-clause,Redis,relicense,Forked when Redis moved to be dual-licensed under the RSALv2 and SSPLv1. 10 | egcs,1997-08-15,https://github.com/gcc-mirror/gcc ,GPL,Cygnus,gcc,https://github.com/gcc-mirror/gcc ,GPL-2.0,Free Software Foundation,feature,Community members became frustrated that their changes were not being merged into gcc and forked it into egcs. FSF made egcs the official version of gcc and in July 1999 the two projects were united again. 11 | Adempiere,2006-10-12,https://github.com/adempiere/adempiere,GPL-2.0,ADempiere Community,Compiere,https://www.compiere.com/svn/,GPL-2.0,Compiere,feature,Community members became frustrated that their changes were not being merged into Compiere. Compiere the company failed and Adempiere is still running 12 | Jenkins,2011-02-02,https://github.com/jenkinsci/jenkins,MIT,Linux Foundation CD Foundation,Hudson,,MIT,Oracle,acquisition,Fork was shortly after Oracle acquired Sun. Original creator forked it into Jenkins. Hudson was donated to eclipse in 2012 and final release in 2016. 13 | io.js,2014-12-01,https://github.com/nodejs/node,MIT,Node Forward,Node.js,https://github.com/nodejs/node ,MIT,Joyent,feature,Forked to allow for more community input. This fork lasted only 12 months or so before the communities knitted the forks back together again under the newly formed Node.js Foundation 14 | NextCloud,2016-04-22,https://github.com/nextcloud/server ,AGPL-3.0,Nextcloud GmbH,ownCloud,https://github.com/owncloud/core,AGPL-3.0,ownCloud,feature,"NextCloud forked from ownCloud by its founder Frank Karlitsheck due to concerns about management and priorities between growth, money, and sustainability." 15 | -------------------------------------------------------------------------------- /dataset/license-changes/generate-license-data.py: -------------------------------------------------------------------------------- 1 | #SPDX-License-Identifier: MIT 2 | 3 | # Note: License file must be in the same org/repo 4 | 5 | def setup_validate(): 6 | import argparse 7 | import os 8 | import csv 9 | import sys 10 | 11 | # Gather options from command line arguments and store them in variables 12 | 13 | parser = argparse.ArgumentParser() 14 | 15 | parser.add_argument("-c", "--configfile", required=False, dest="input_data_file", default="inputdata.csv", help="The full path to a csv input file see inputdata.csv example in this repo (defaults to inputdata.csv)") 16 | parser.add_argument("-t", "--tokenvar", required=False, dest="token_var", default="GITHUB_AUTH_TOKEN", help="The name of the environmental variable where your GitHub personal access token can be found (defaults to GITHUB_AUTH_TOKEN)") 17 | 18 | args = parser.parse_args() 19 | input_data_file = args.input_data_file 20 | token_var = args.token_var 21 | 22 | # Read GitHub personal access token from the GITHUB_AUTH_TOKEN environment variable 23 | 24 | try: 25 | gh_key = os.environ[token_var] 26 | 27 | except: 28 | print("You must have an environment variable with a GitHub personal access token to run this script") 29 | print("For Linux / MacOS: export GITHUB_AUTH_TOKEN=") 30 | print("For Windows: set GITHUB_AUTH_TOKEN=") 31 | print("Exiting ...") 32 | sys.exit(1) 33 | 34 | # Read input file with some minimal data validation and store data in a list 35 | 36 | try: 37 | input_data = [] 38 | 39 | with open(input_data_file, 'r') as f: 40 | data = csv.reader(f) 41 | next(data, None) # skip header line 42 | for row in data: 43 | # Skip blank lines 44 | if len(row) != 0: 45 | input_data.append(row) 46 | 47 | # Make sure the csv file has 3 values per row before continuing 48 | if len(row) != 7: 49 | print("Data errors detected in row containing:", row) 50 | print("Each line in the csv must contain 7 values.") 51 | sys.exit(1) 52 | 53 | # Check for empty file and exit 54 | if len(input_data) == 0: 55 | print(input_data_file, "appears to be empty.") 56 | sys.exit(1) 57 | 58 | print("Reading input data from", input_data_file) 59 | 60 | return input_data, gh_key 61 | 62 | except: 63 | print("Can't read file", input_data_file, "Exiting ...") 64 | sys.exit(1) 65 | 66 | 67 | 68 | def make_query(): 69 | return """query licenseData($org: String!, $repo: String!, $lic_file: String!){ 70 | repository(owner: $org, name: $repo){ 71 | defaultBranchRef { 72 | target { 73 | ... on Commit { 74 | history(path: $lic_file, first: 100) { 75 | nodes { 76 | committedDate 77 | url 78 | additions 79 | deletions 80 | message 81 | } 82 | } 83 | } 84 | } 85 | } 86 | url 87 | nameWithOwner 88 | } 89 | } 90 | """ 91 | 92 | def get_license_data(): 93 | 94 | import requests 95 | import json 96 | import sys 97 | 98 | input_data, api_token = setup_validate() 99 | #print(input_data, api_token) 100 | 101 | url = 'https://api.github.com/graphql' 102 | headers = {'Authorization': 'token %s' % api_token} 103 | 104 | json_list = [] 105 | 106 | for row in input_data: 107 | project = row[0] 108 | relicense_date = row[1] 109 | orig_lic = row[2] 110 | new_lic = row[3] 111 | org = row[4] 112 | repo = row[5] 113 | lic_file = row[6] 114 | 115 | try: 116 | query = make_query() 117 | 118 | variables = {"org": org, "repo": repo, "lic_file": lic_file} 119 | #variables = {"org": org, "repo": repo} 120 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 121 | json_data = json.loads(r.text)['data'] 122 | 123 | # Add data from csv file 124 | json_data['repository']['project'] = project 125 | json_data['repository']['relicense_date'] = relicense_date 126 | json_data['repository']['orig_lic'] = orig_lic 127 | json_data['repository']['new_lic'] = new_lic 128 | 129 | json_list.append(json_data) 130 | 131 | except: 132 | print("Unknown error: Could not get data from GitHub") 133 | sys.exit(1) 134 | 135 | return json_list 136 | 137 | import json 138 | 139 | json_list = get_license_data() 140 | print(json_list) 141 | 142 | json_obj = json.dumps(json_list, indent=4) 143 | 144 | with open('output.json', 'w') as output: 145 | output.write(json_obj) 146 | -------------------------------------------------------------------------------- /dataset/license-changes/inputdata.csv: -------------------------------------------------------------------------------- 1 | This file has moved to https://github.com/chaoss/wg-data-science/blob/main/dataset/license-changes/license_changes.csv 2 | -------------------------------------------------------------------------------- /dataset/license-changes/license_changes.csv: -------------------------------------------------------------------------------- 1 | project,relicense_date,orig_license,new_license,org,repo,license_file 2 | Airbyte,2021-09-27,MIT,Elastic License v2.0,airbytehq,airbyte,LICENSE 3 | Akka,2022-09-07,Apache-2.0,Business Source License,akka,akka,LICENSE 4 | ArangoDB,2024-02-23,Apache-2.0,Business Source License,arangodb,arangodb,LICENSE 5 | Aseprite,2016-08-29,GPL-2.0,EULA that permits personal use but forbids redistribution,aseprite,aseprite,EULA.txt 6 | CockroachDB,2019-06-04,Apache-2.0,Business Source License,cockroachdb,cockroach,LICENSE 7 | Confluent,2018-12-14,Apache-2.0,Confluent Community License Agreement,confluentinc,ksql,LICENSE 8 | Consul,2023-08-10,MPL-2.0,Business Source License,hashicorp,consul,LICENSE 9 | Elasticsearch,2021-02-03,Apache-2.0,"Elastic License and Server Side Public License",elastic,elasticsearch,LICENSE.txt 10 | Elasticsearch,2021-02-03,Apache-2.0,"Elastic License and Server Side Public License",elastic,kibana,LICENSE.txt 11 | LiveCode,2021-08-31,GPL-3.0-only,project archived,livecode,livecode,LICENSE 12 | MongoDB,2018-10-16,AGPL-3.0-only,Server Side Public License,mongodb,mongo,LICENSE-Community.txt 13 | OTRS,2021-01-27,GPL-3.0-or-later,project archived,OTRS,otrs,COPYING 14 | Redis,2024-03-20,BSD-3-Clause,dual: custom license and Server Side Public License,redis,redis,LICENSE.txt 15 | Sentry,2019-11-06,BSD-3-Clause,Business Source License,getsentry,sentry,LICENSE.md 16 | Sourcegraph,2018-10-27,Apache-2.0,Sourcegraph Enterprise license,sourcegraph,sourcegraph,LICENSE 17 | Terraform,2023-08-10,MPL-2.0,Business Source License,hashicorp,terraform,LICENSE 18 | Timescale,2018-12-29,Apache-2.0,Timescale License,timescale,timescaledb,LICENSE 19 | Vagrant,2023-08-10,MIT,Business Source License,hashicorp,vagrant,LICENSE 20 | Vault,2023-08-10,MPL-2.0,Business Source License,hashicorp,vault,LICENSE 21 | -------------------------------------------------------------------------------- /dataset/license-changes/more_forks.csv: -------------------------------------------------------------------------------- 1 | fork_name,fork_date,fork_repo,fork_lic,fork_owner,from_name,from_repo,from_lic,from_owner,category,notes,Type,source_citation,additional_notes 2 | Flock,2024-10-27,https://github.com/join-the-flock/flock,BSD-3-clause,,Flutter,https://github.com/flutter/flutter,BSD-3-clause,Google,feature,Community fork for faster iteration speed. Probably friendly.,clone fork,https://getflocked.dev/blog/posts/we-are-forking-flutter-this-is-why/,friendly 3 | Fossify Calendar,2023-12-03,https://github.com/FossifyOrg/Calendar,GPL-3.0,,Simple-Calendar,https://github.com/SimpleMobileTools/Simple-Calendar,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 4 | Fossify File Manager,2023-12-03,https://github.com/FossifyOrg/File-Manager,GPL-3.0,,Simple-File-Manager,https://github.com/SimpleMobileTools/Simple-File-Manager,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 5 | Fossify Gallery,2023-12-03,https://github.com/FossifyOrg/Gallery,GPL-3.0,,Simple-Gallery,https://github.com/SimpleMobileTools/Simple-Gallery,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 6 | Fossify Music Player,2023-12-03,https://github.com/FossifyOrg/Music-Player,GPL-3.0,,Simple-Music-Player,https://github.com/SimpleMobileTools/Simple-Music-Player,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 7 | Fossify Contacts,2023-12-03,https://github.com/FossifyOrg/Contacts,GPL-3.0,,Simple-Contacts,https://github.com/SimpleMobileTools/Simple-Contacts,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 8 | Fossify Notes,2023-12-03,https://github.com/FossifyOrg/Notes,GPL-3.0,,Simple-Notes,https://github.com/SimpleMobileTools/Simple-Notes,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 9 | Fossify Messages,2023-12-03,https://github.com/FossifyOrg/Messages,GPL-3.0,,Simple-SMS-Messenger,https://github.com/SimpleMobileTools/Simple-SMS-Messenger,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 10 | Fossify Phone,2023-12-03,https://github.com/FossifyOrg/Phone,GPL-3.0,,Simple-Dialer,https://github.com/SimpleMobileTools/Simple-Dialer,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 11 | Fossify Keyboard,2023-12-03,https://github.com/FossifyOrg/Keyboard,GPL-3.0,,Simple-Keyboard,https://github.com/SimpleMobileTools/Simple-Keyboard,GPL-3.0,ZipoApps,acquisition,"SimpleMobileTools was sold to ZipoApps, the community forked it",clone fork,https://github.com/SimpleMobileTools/General-Discussion/issues/241#issuecomment-1837102917,"hostile, original got bought by ad tech company" 12 | QUIK,,https://github.com/octoshrimpy/quik,GPL-3.0,,QKSMS,https://github.com/moezbhatti/qksms,GPL-3.0,,feature,QKSMS was unmaintained and got forked,clone fork,, 13 | HeliBoard,,https://github.com/Helium314/HeliBoard,GPL-3.0 and Apache 2,,OpenBoard,https://github.com/openboard-team/openboard,GPL-3.0,Openboard Team,feature,OpenBoard was unmaintained,clone fork,, 14 | Tenacity,,https://codeberg.org/tenacityteam/tenacity,GPL-2.0,,Audacity,https://github.com/audacity/audacity,GPL-3.0,,feature,,,, 15 | -------------------------------------------------------------------------------- /dataset/license-changes/wikipedia_list.csv: -------------------------------------------------------------------------------- 1 | Akka,2009,2022,Apache-2.0,Business Source License 2 | ArangoDB,2011,2023,Apache-2.0,Business Source License 3 | Aseprite,2001,2016,GPL-2.0,EULA that permits personal use but forbids redistribution 4 | CockroachDB,2015,2019,Apache-2.0,Business Source License 5 | Consul,2014,2023,MPL-2.0,Business Source License 6 | Couchbase Server,2010,2021,Apache-2.0,Business Source License 7 | Couchbase Mobile,,2022,Apache-2.0,Business Source License 8 | Elasticsearch,2010,2021,Apache-2.0,"Elastic License and Server Side Public License" 9 | Emby,2014,2018,GPL-2.0,"Source code closed on December 8, 2018" 10 | FBReader,2013,2015,GPL-2.0-or-later,"Apparently the number of devs was limited, and they all agreed to relicense it" 11 | LiveCode,2013,2021,GPL-3.0-only,proprietary 12 | LiveJournal,1999,2014,GPL-2.0-or-later,The source code was made private in 2014 13 | MongoDB,2009,2018,AGPL-3.0-only,Server Side Public License 14 | Nexuiz,2005,2012,GPL-2.0-or-later,"Game abandoned in favour of a commercial video game of the same name, which licensed the Nexuiz title but is not based on its engine." 15 | OctoberCMS,2014,2021,MIT,Cited the sustainability of its open source model as a factor. 16 | OTRS,2001,2020,GPL-3.0-or-later,"Support for the Community Edition dropped on December 23, 2020" 17 | Paint.NET,2004,2007,MIT,freeware license that prohibits modification or resale 18 | PyMOL,2000,2010,MIT-CMU 19 | Reddit,2008,2017,CPAL-1.0,"Source code was made private in 2017, as the internal codebase had already diverged significantly from the public one." 20 | Redis,2009,2024,BSD-3-Clause,dual: custom license and Server Side Public License 21 | Sourcegraph,2013,2023,Apache-2.0,proprietary 22 | Terraform,2014,2023,MPL-2.0,Business Source License 23 | Tux Racer,2000,2002,GPL-2.0-or-later,"Commercial expansion by original authors, also called Tux Racer." 24 | Vagrant,2010,2023,MIT,Business Source License 25 | -------------------------------------------------------------------------------- /dataset/releases/fork-relicense-jais-2025-04.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/releases/fork-relicense-jais-2025-04.tar.gz -------------------------------------------------------------------------------- /dataset/taxonomies/FOSDEM_ Do we need another open source software taxonomy_ (1).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/dataset/taxonomies/FOSDEM_ Do we need another open source software taxonomy_ (1).pdf -------------------------------------------------------------------------------- /dataset/taxonomies/README.md: -------------------------------------------------------------------------------- 1 | # Open Source Taxonomies 2 | 3 | This repository is an experiment designed to collect existing taxonomies about open source projects as well as taxonomies used/created within research about open source ecosystems. The motivation and goals of this effort were discussed in a talk at [FOSDEM 2025](https://github.com/chaoss/wg-data-science/blob/main/dataset/taxonomies/FOSDEM_%20Do%20we%20need%20another%20open%20source%20software%20taxonomy_%20(1).pdf). 4 | 5 | As all submissions will be treated as public data - we will not accept submissions with sensitive datatypes (user data, personally identifiable data, non-public business data, health data, etc). 6 | 7 | ## Template 8 | 9 | _Data Card_ 10 | 11 | | | Submission | 12 | |:-----|---------------| 13 | |**Title:**| | 14 | |**Submitter:**|(github handle)| 15 | |**Author(s):**|(github handle? email?)| 16 | |**Author affiliation(s):**|(academic institution(s), company, funder, publisher, unknown, etc)| 17 | |**Description:**|(What was this used for? E.g. survey, analysis, publication, etc, How was this taxonomy created? (i.e. human, model, ML/AI, etc)| 18 | |**Data source(s) / data type(s):**|(Was this taxonomy created from a specific data source or for a specific purpose? e.g. github events labels, internal/non-public logs, survey question etc)| 19 | |**Suggested tags for this taxonomy:**|(plain text separated by commas)| 20 | |**Link(s) to public resources that use this taxonomy:**|(e.g. published articles, webpage, blog, etc)| 21 | |**License(s):**|(if applicable)| 22 | 23 | _Taxonomy_ 24 | (Preferred format is a plaintext bulleted table or list) 25 | 26 | -------------------------------------------------------------------------------- /dataset/taxonomies/Taxonomies - repostatus.csv: -------------------------------------------------------------------------------- 1 | Data card, 2 | ,Submission 3 | Title:,repostatus 4 | Submitter:,sophia-iv 5 | Author(s):,jantman 6 | Author affiliation(s):,unknown 7 | Description:,project status taxonomy for GitHub repositories 8 | Data source(s) / data type(s):,GitHub repository badges and labels 9 | Suggested tags for this taxonomy: ,"project status, project lifecycle" 10 | Link(s) to public resources that use this taxonomy: ,https://www.repostatus.org/ 11 | License(s): ,CC-BY-SA 4.0 license 12 | , 13 | Taxonomy, 14 | Project Status, 15 | , 16 | Concept,"Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept." 17 | WIP,"Initial development is in progress, but there has not yet been a stable, usable release suitable for the public." 18 | Suspended,"Initial development has started, but there has not yet been a stable, usable release; work has been stopped for the time being but the author(s) intend on resuming work." 19 | Abandoned,"Initial development has started, but there has not yet been a stable, usable release; the project has been abandoned and the author(s) do not intend on continuing development." 20 | Active,"The project has reached a stable, usable state and is being actively developed." 21 | Inactive,"The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows." 22 | Unsupported,"The project has reached a stable, usable state but the author(s) have ceased all work on it. A new maintainer may be desired." 23 | Moved,"The project has been moved to a new location, and the version at that location should be considered authoritative. This status should be accompanied by a new URL." -------------------------------------------------------------------------------- /events/hackathon-june-2025.md: -------------------------------------------------------------------------------- 1 | # CHAOSS Data Science Hackathon June 2025: 2 | 3 | ## Logistics 4 | Date: June 26, 2025 5 | 6 | This event will be co-located with the [Open Source Summit North America](https://events.linuxfoundation.org/open-source-summit-north-america/) and will be held the morning before [CHAOSScon](https://chaoss.community/chaosscon-2025-na/) 7 | 8 | The details and registration for this event can now be found on the [CHAOSS Website](https://chaoss.community/chaoss-data-science-hackathon-2025/) 9 | -------------------------------------------------------------------------------- /practitioner-guides/README.md: -------------------------------------------------------------------------------- 1 | # Practitioner Guides 2 | 3 | **Read the [Practitioner Guides](https://chaoss.community/about-chaoss-practitioner-guides/)** 4 | 5 | Practitioner Guides are designed to be used by practitioners. The goal is to help people understand how to interpret the data about an open source project to develop insights that can help improve the project health of an open source project. 6 | 7 | The Getting Started series of guides is designed for people who may not be experts in data analysis or open source. 8 | 9 | ## Audience for the Guides 10 | 11 | Open Source Program Offices (OSPOs), project leads, community managers, maintainers, and anyone who wants to better understand project health and take action on what they learn from their metrics. 12 | 13 | Note that while these guides are being developed within the Data Science WG, we are not the audience for these guides. The guides should be written with the above audience of practitioners in mind. 14 | 15 | ## Contributions 16 | 17 | We welcome contributions! If you'd like to work on a guide or propose a new guide, you can browse the [Practitioner Guide issues](https://github.com/chaoss/wg-data-science/issues?q=is%3Aissue+is%3Aopen+label%3A%22practitioner+guide%22) in the data science WG repo. We have some issues for guides that we know we want, but that have no yet been started, and you can create a new issue using our template to propose a new Practitioner Guide. 18 | 19 | A few things to keep in mind when contributing to these guides: 20 | * We have Google Docs templates that you should use when creating a new guide to make it easier for others to review and collaborate with you. We prefer for this document to be owned by the chaossproject@gmail.com account. You should link to your document from the GitHub issue. 21 | * [Getting Started Template](https://docs.google.com/document/d/1xe3KkkoBHcn9tyFaMTQ0TIfzZGkJPzBvrravp8Cdtl4/edit?usp=sharing) 22 | * [Template for more advanced guides](https://docs.google.com/document/d/1GLPUdjWA6rz1chFl0psKmtb8uniDeuGyg9UWTDgjIAI/edit?usp=sharing) 23 | * Each Getting Started Practitioner Guide should contain no more than 4 metrics. Pick the 2-4 metrics that best represent the breadth of the topic and list additional metrics in Step 3. Advanced guides have no limit for the number of metrics. 24 | * Make sure that all links to metrics and metrics models use the permanent, non-changing link in the form of `https://chaoss.community/?p=####`. These links can be found near the end of every published metric. 25 | * Please read the [Practitioner Guide: Introduction](introduction.md) before starting and link to it rather than duplicating where possible. Note that some duplication is likely required for clarity in the individual guides. 26 | * To see an example of how this template looks when completed, see the [Published Guides](https://chaoss.community/about-chaoss-practitioner-guides/). 27 | 28 | -------------------------------------------------------------------------------- /practitioner-guides/images/active-contrib-over-time-bar-trend.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/active-contrib-over-time-bar-trend.png -------------------------------------------------------------------------------- /practitioner-guides/images/active-organizations-over-time-by-data-source.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/active-organizations-over-time-by-data-source.png -------------------------------------------------------------------------------- /practitioner-guides/images/bus-factor-bar-balanced.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/bus-factor-bar-balanced.png -------------------------------------------------------------------------------- /practitioner-guides/images/bus-factor-pie-one-person.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/bus-factor-pie-one-person.png -------------------------------------------------------------------------------- /practitioner-guides/images/change_requests_abandoned.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/change_requests_abandoned.png -------------------------------------------------------------------------------- /practitioner-guides/images/closure-ratio-falling-behind.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/closure-ratio-falling-behind.png -------------------------------------------------------------------------------- /practitioner-guides/images/closure-ratio-summer-gap.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/closure-ratio-summer-gap.png -------------------------------------------------------------------------------- /practitioner-guides/images/commit-activity-by-domain-unclean.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/commit-activity-by-domain-unclean.png -------------------------------------------------------------------------------- /practitioner-guides/images/commit-activity-by-domain-vmw.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/commit-activity-by-domain-vmw.png -------------------------------------------------------------------------------- /practitioner-guides/images/contrib-by-data-source.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/contrib-by-data-source.png -------------------------------------------------------------------------------- /practitioner-guides/images/contributor-growth-by-engagement-bar.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/contributor-growth-by-engagement-bar.png -------------------------------------------------------------------------------- /practitioner-guides/images/issues_abandoned.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/issues_abandoned.png -------------------------------------------------------------------------------- /practitioner-guides/images/lead-time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/lead-time.png -------------------------------------------------------------------------------- /practitioner-guides/images/leadership-positions-istio-before-cncf-april-2022.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/leadership-positions-istio-before-cncf-april-2022.png -------------------------------------------------------------------------------- /practitioner-guides/images/leadership-positions-istio-graduating-june-2023.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/leadership-positions-istio-graduating-june-2023.png -------------------------------------------------------------------------------- /practitioner-guides/images/libyears.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/libyears.png -------------------------------------------------------------------------------- /practitioner-guides/images/ossf-badge-categories.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/ossf-badge-categories.png -------------------------------------------------------------------------------- /practitioner-guides/images/ossf-badge-criteria-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/ossf-badge-criteria-example.png -------------------------------------------------------------------------------- /practitioner-guides/images/ossf-badge-curl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/ossf-badge-curl.png -------------------------------------------------------------------------------- /practitioner-guides/images/releases.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/releases.png -------------------------------------------------------------------------------- /practitioner-guides/images/time-to-close.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/time-to-close.png -------------------------------------------------------------------------------- /practitioner-guides/images/time-to-first-response.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/practitioner-guides/images/time-to-first-response.png -------------------------------------------------------------------------------- /practitioner-guides/introduction.md: -------------------------------------------------------------------------------- 1 | # Practitioner Guide: Introduction - Things to Think about When Interpreting Metrics 2 | 3 | * Related Metrics: [All](https://chaoss.community/kb-metrics-and-metrics-models/) 4 | * Audience: Practitioner Guides are designed to be used by practitioners who may not be experts in data analysis and who want to better understand how to interpret the data about an open source project to develop insights that can help them improve the project health of an open source project. These guides will be especially useful for Open Source Program Offices (OSPOs), project leads, community managers, maintainers, and anyone who wants to better understand project health and take action on what they learn from their metrics. 5 | 6 | Measuring project health is complex with so many potential aspects to consider (Linåker et al. 2022). The Practitioner Guide series is designed to break project health down into a series of logical topics that you can use to assess and improve the health of your open source projects. This Introduction Guide is designed to get you thinking about what you might want to measure and how to measure it along with some general tips and cautions. It is meant to complement the Practitioner Guide series, which is where you will find the details about how to draw insights about specific topics, like Responsiveness, Contributor Sustainability, Organizational Participation, and more. 7 | 8 | There is no one size fits all approach to using metrics to measure project health. Every open source project is a little different, and metrics should always be interpreted with the needs of that project and its context taken into account (Goggins et al. 2021). Small projects will have different needs than large projects. An open source operating system project will have very different characteristics than a project that produces a small package or library. Different communities will have different ways of working to produce their open source software projects. Projects have different methods of publishing releases. Projects and the people who contribute to them will have different needs and goals. 9 | 10 | One of the best places to start isn’t actually with the metrics, but by spending some time understanding the overall goals for the project. If the project is primarily driven by one organization or owned by an organization, you should also consider the goals for that organization. By thinking strategically about the overall goals, you’ll be in a better place to decide what you need to measure to determine whether the project is achieving its goals. Open source projects generate a tsunami of data that can be overwhelming, but by focusing on the goals, you can develop a [metrics strategy](https://community.linuxfoundation.org/events/details/lfhq-todo-group-ospology-presents-ways-to-define-a-metrics-strategy-in-your-ospo/) that helps you focus on the metrics that matter most for a particular project. 11 | 12 | All of this and more will have an impact on the interpretation of any open source metrics. The real experts are the people who are involved in the day to day work on a project. In addition to focusing on the goals, you might also need to spend some time looking at trends related to who participates in the community and how they participate to get an overall feel for the project and who you might want to reach out to for more details. You need to involve key people from the project / community that you are measuring, since they can help you ethically interpret the metrics and any trends identified in ways that make the most sense for that particular project (Casari et al. 2023) as described in more detail in the "Step 2: Diagnosis" section. If you haven’t already read [Beyond the Repository](https://cacm.acm.org/magazines/2023/10/276630-beyond-the-repository/fulltext) by Amanda Casari, Julia Ferraioli, and Juniper Lovato, I recommend pausing and reading this 6-page article now. 13 | 14 | Within the CHAOSS project, we have [software (Augur and GrimoireLab)](https://chaoss.community/software/) that can be used to gather data and identify trends in a neutral way that can be easily audited and tracked over time. However, these Practitioner Guides don’t assume that you are using any particular piece of software, since you may have other metrics tools that work for your particular situation. Regardless of how you are gathering metrics, these guides help you interpret those metrics to draw meaningful and actionable insights to help improve project health. 15 | 16 | # Step 1: Identify Trends 17 | 18 | Metrics for open source projects can be noisy, with many data points generated by the many activities within a project. One way to cut through this noise is to focus on the trends over time. Rather than looking at what happened yesterday or last week, it can help to start by aggregating your data by month and looking at whether some aspect of your community is improving, staying steady, or declining over the past 3 to 6 months. You can drill down later into the data for a specific day or week to help understand what you see. By looking at the big-picture trends, you can avoid over-correcting or worrying too much about day to day fluctuations. 19 | 20 | # Step 2: Diagnosis 21 | 22 | The first action of diagnosing problems or identifying opportunities for improvement is to talk to the people intimately involved in the project. Show them the data and ask what might be causing the issues. Project leaders and community members might not know what is causing the problems, so each guide should have some tips for exploring areas and potential ideas for where to look and how to diagnose specific issues or find opportunities to improve. 23 | 24 | When deciding if something is a problem or concern that needs to be addressed, the first question is whether the issue might be a temporary fluctuation instead of a real problem. What else is happening in your community, project, and ecosystem? Was there a big conference, major release, vacation season, or other things that impacted people's time to make contributions? It can help to overlay these types of milestones on a graph to understand their impact better, and if it looks like there is an impact, wait a month or two and see if the metric(s) rebound after the temporary disruption. As mentioned earlier, this is why it’s so important to have people involved in the project daily, helping to interpret what you are seeing in the metrics. 25 | 26 | An excellent example of a temporary fluctuation is when there are downward trends in July / August and December / January if you have a lot of contributors in places where these are vacation or holiday seasons. A downward trend shows that people are taking some time off to rest and recharge, which is likely a positive sign for the long-term sustainability of your project, instead of a problem. 27 | 28 | If you’ve decided that the problem will likely be ongoing and not temporary, then it’s time to start thinking about what might be causing the issue. This will likely be metric-specific and will be covered in more detail in the Practitioner Guides for specific topics. 29 | 30 | # Step 3: Gather Additional Data if Needed 31 | 32 | At this point, if you know what you need to improve and how to improve it, you can skip this step for now. You can always return to it if you make changes but don’t see any improvements over the next few months. 33 | 34 | In other cases, you should look into an area in more detail before deciding what improvement actions to make. The Practitioner Guides for specific topics will include additional metrics that can be used to gather additional data to diagnose specific problems. 35 | 36 | # Step 4: Make Improvements 37 | 38 | It is important for this step to have buy-in from the community and project leadership before you start taking action toward making improvements. Not having support from the project could lead to changes being ineffective, disruptive, or even damaging for the project and the people contributing to it. 39 | 40 | Open source projects, communities, and ecosystems are complex; changes you make in one area might impact other parts of the project. Many people working on open source projects are likely to be busy with little time for additional work, so it’s important not to overload people to the point of burnout. For these reasons, it is usually best to focus on no more than 2 or 3 improvement actions to make at one time. 41 | 42 | Like with the other steps, the Practitioner Guides for specific topics will include more details on how to make improvements for that topic. 43 | 44 | # Step 5: Monitor Results 45 | 46 | An important step toward knowing whether your actions to improve a topic have been effective is to continue measuring and then monitor those results. At a minimum, you’ll probably want to monitor it for 2 or 3 months (more for complex changes) before deciding whether your actions are starting to be effective. Remember that if anything happens that might cause temporary fluctuations, you’ll want to increase that timeframe. 47 | 48 | You should also continue to monitor it over the long-term to see if your improvements continue to have an impact. A frequent pattern is that improvements tend to continue while people are focused on them but then can backslide if people fall into old patterns and stop making improvements. You might find yourself cycling through these steps to renew people’s interest and continue making improvements. 49 | 50 | # Cautions and Considerations 51 | 52 | * When interpreting metrics and making improvements in your open source project, you should always think first about the people involved in your project and how these changes might impact them (positively and negatively). 53 | * Always have the people working on the project involved in gathering and interpreting the metrics and in any potential improvement actions you might make. 54 | * Every project is a little different, so it’s essential to interpret the metrics in light of a project’s individual needs and ways of operating. 55 | * Avoid using metrics to compare projects against each other when possible, but if you need to compare projects, make sure that you are only comparing projects with similar characteristics. Just a few of the many examples include, Javascript projects will have very different characteristics and patterns than C / C++ projects; foundation-owned projects will be different from projects driven out of corporations; and a project the size of Kubernetes will be nothing like a project producing a small library or package. 56 | * Be careful never to set yourself up for people to weaponize your metrics, and be very careful with metrics that can be used to compare people against each other in ways that might result in the punishment of individuals. 57 | * Remember that automation and bot activity can influence the interpretation of many metrics, so it’s essential to understand how automation and bots might influence your results. 58 | 59 | # Additional Reading 60 | 61 | * [Practitioner Guide series](https://chaoss.community/about-chaoss-practitioner-guides/) with guides to help you improve responsiveness, contributor sustainability, organizational participation, and more. 62 | * A short video about each guide can be found in the [Practitioner Guides Playlist](https://www.youtube.com/playlist?list=PL60k37cxI-HSHV4-rEsWMzExw2y2Oq79Z) in the CHAOSS YouTube channel. 63 | * An episode of the [SustainOSS podcast](https://podcast.sustainoss.org/243) about the Practitioner Guide series. 64 | * For more cautions, considerations, and best practices, please read [Beyond the Repository](https://cacm.acm.org/magazines/2023/10/276630-beyond-the-repository/fulltext) by Amanda Casari, Julia Ferraioli, and Juniper Lovato. 65 | * Video of a panel related to developing a [metrics strategy for your OSPO](https://community.linuxfoundation.org/events/details/lfhq-todo-group-ospology-presents-ways-to-define-a-metrics-strategy-in-your-ospo/) and a blog post about [building an open source strategy](https://blogs.vmware.com/opensource/2020/03/03/open-source-strategy/) and using metrics to determine success. 66 | * [CHAOSS Software](https://chaoss.community/software/) 67 | 68 | # Feedback 69 | 70 | We would love to have feedback to learn more about how people are using the CHAOSS Practitioner Guides and how we can improve them over time. Please complete this [short survey](https://forms.gle/w3B1gBH8kp3rPbhr8) to provide your feedback. 71 | 72 | # Contributors 73 | 74 | The following people contributed to this guide: 75 | 76 | * Dawn Foster 77 | * Chan Voong 78 | * Luis Cañas Díaz 79 | 80 | # References 81 | 82 | * Casari, A., Ferraioli, J., & Lovato, J. (2023). [Beyond the repository: Best practices for open source ecosystems researchers](https://cacm.acm.org/practice/beyond-the-repository/). Queue, 21(2), 14-34. 83 | * Goggins, S. P., Germonprez, M., & Lumbard, K. (2021). [Making open source project health transparent](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9504501). Computer, 54(8), 104-111. 84 | * Linåker, J., Papatheocharous, E., & Olsson, T. (2022, September). [How to characterize the health of an Open Source Software project? A snowball literature review of an emerging practice](https://dl.acm.org/doi/pdf/10.1145/3555051.3555067). In Proceedings of the 18th International Symposium on Open Collaboration (pp. 1-12). 85 | 86 | CHAOSS Practitioner Guides are MIT licensed, living documents, and we welcome your feedback and input. You can suggest edits to this document at https://github.com/chaoss/wg-data-science/blob/main/practitioner-guides/introduction.md 87 | -------------------------------------------------------------------------------- /practitioner-guides/sunset.md: -------------------------------------------------------------------------------- 1 | # Practitioner Guide: Getting Started with Sunsetting an Open Source Project 2 | 3 | Primary metrics: 4 | 5 | * [Change Requests](https://chaoss.community/?p=3610) 6 | * [Issues New](https://chaoss.community/?p=3634) 7 | * [Technical Forks](https://chaoss.community/?p=3431) 8 | 9 | If you haven’t already read the [Insight Guide: Introduction - Things to Think about When Interpreting Metrics](https://chaoss.community/practitioner-guide-introduction/), please pause now and read that guide. 10 | 11 | Many open source projects, even widely used ones, become abandoned for a variety of reasons (e.g., evolving interests, family situations, employment changes), but abandonment can be done in a responsible way by proactively sunsetting the project (Miller et al. 2025). Sunsetting is an important consideration for corporate environments where it can be easy to lose track of projects that were created by employees who later walked away from the project and left if abandoned. You don’t want abandoned open source projects with security vulnerabilities sitting in your organization’s source code repositories where someone might trust that project simply because they trust your organization. Finding inactive projects and responsibly sunsetting them is a good business decision and something that many open source teams / Open Source Program Offices (OSPOs) do on a regular basis. 12 | 13 | It’s important to remember that not every open source project can or should exist forever: technologies evolve, corporate priorities change, and people’s interests change. Part of the beauty of open source is that we work in the open as we innovate, and some of those innovative projects will stand the test of time, while others should be responsibly deprecated via a sunset process. Sunsetting an open source project should take your user’s needs into account, and where possible, offer users time to migrate to a replacement technology. At a minimum, it’s important to signal that the project will no longer be maintained, updated, or have security patches so that users know that they should no longer be using the project. 14 | 15 | The most important step you can take when sunsetting a project is to be transparent at every step of the way. Thus, being open about your intentions and ensuring trust for future open source work. 16 | 17 | # Step 1: Identify Trends 18 | 19 | There are several metrics that can help you identify whether there is any activity or usage for a project to make decisions about responsibly sunsetting a project. Looking at [Change Requests](https://chaoss.community/?p=3610) (aka Pull Requests / Merge Requests) and [New Issues](https://chaoss.community/?p=3634) can be a good start when looking at whether there are still people using and contributing to a project. Another good metric is [Technical Forks](https://chaoss.community/?p=3431), which tend to be an indication of usage and potential contribution. 20 | 21 | 22 | ## Change Requests 23 | 24 | The Change Requests metric can help you understand whether people are trying to contribute to your project, which signals whether there is interest in your project. It helps to look at both closed and open requests, since closed change requests indicate that a project might still be maintained, while open change requests that are not being resolved can indicate that a project might be abandoned. If there are no recently merged change requests, then it’s also likely that there have not been any security updates. 25 | 26 | ![Abandoned project - change requests](https://github.com/chaoss/wg-data-science/blob/main/practitioner-guides/images/change_requests_abandoned.png "Example of project change requests decreasing to almost nothing over time. These might be from an abandoned project") 27 | 28 | ## New Issues 29 | 30 | Many new issues are created when a user finds a bug, has a question, or wants a new feature, so new issues may be filed because people are using your project or are otherwise interested in your project. When there are few to no issues created over a period of some time, then it might indicate that the project has been abandoned. 31 | 32 | ![Abandoned project - new issues](https://github.com/chaoss/wg-data-science/blob/main/practitioner-guides/images/issues_abandoned.png "Example of new issues decreasing to almost nothing over time. These might be from an abandoned project") 33 | 34 | ## Technical Forks 35 | 36 | These are the forks that people create when they are using your project or planning to contribute to it, but it can help to look beyond the numbers of forks to see who has forked your project and whether they are continuing to update their fork. Recently updated forks can indicate usage, and by gathering data about who has forked your project, you might also be able to reach out to some of these people to ask if they are still using it before you decide whether, or how, to sunset it. 37 | 38 | # Step 2: Diagnosis 39 | 40 | If you are part of an organization who is auditing their repositories to find projects that seem to be abandoned, you should start by talking to the people who were involved in the project so that you can better understand whether the project is truly abandoned, and if so, why. This will likely require looking at the most recent change requests to see who made the last few changes to the project so that you can reach out to them. In some cases, a project might not need to be updated regularly and might seem to be abandoned when it has simply stabilized and might still be widely used. For example, a small library that performs some complex mathematical function might not need to be updated after it is written, assuming that it doesn’t have dependencies that need to be updated. If the project is primarily distributed via package managers, usage metrics from those sources should also be considered. 41 | 42 | If the project has truly been abandoned, then it can help to understand why it was abandoned and whether anyone might still be using it before you decide to sunset it. This is where the technical fork data can be useful to see if other people have forked your project and are continuing to update their fork, which might indicate usage of your project. [Criticality Score](https://github.com/ossf/criticality_score), an OpenSSF project, can also shed light on usage, since it also calculates the number of projects that depend on your project. There is a [Python script](https://github.com/geekygirldawn/project-api-metrics/blob/main/scripts/sunset.py) called that uses criticality score and the GitHub API fork data that can be used to gather and analyze this data. 43 | 44 | # Step 3: Gather Additional Data if Needed 45 | 46 | CHAOSS has other metrics to understand project activity and usage that can be helpful when deciding whether to sunset a project. 47 | 48 | Additional Metrics: 49 | 50 | * [Collaboration Platform Activity](https://chaoss.community/?p=3484) 51 | * [Clones](https://chaoss.community/?p=3429) 52 | * [Code Changes Commits](https://chaoss.community/?p=4707) 53 | * [Release Frequency](https://chaoss.community/?p=4765) 54 | * [Project Popularity](https://chaoss.community/?p=3573) 55 | * [Criticality Score](https://github.com/ossf/criticality_score) (an OpenSSF tool, not a CHAOSS metric) 56 | * Package Manager usage metrics 57 | 58 | # Step 4: Make the Change 59 | 60 | After you have completed your diagnosis and have decided to sunset a project, there are several things you can do to ensure that you are sunsetting the project in a responsible manner. 61 | 62 | **Communication** 63 | 64 | Communication should start with any existing contributors to make sure that they agree with this decision. In some cases, there may be contributors who would like to continue the project, instead of sunsetting it. 65 | 66 | When you have agreement to sunset the project, then this needs to be communicated to any existing users, including any other internal teams who may be using the project. This communication should be done in a transparent manner by being clear about the reasons for sunsetting it. There are quite a few places that this should be communicated and documented: 67 | 68 | * README: Clearly state at the top the README that the project has been deprecated and will no longer be updated. If possible, suggest alternate projects that provide similar functionality. 69 | * Project communication channels: This may include Slack, mailing lists, forums, social media, and any other channels used for communication within the project. 70 | * Documentation: Project documentation should also be updated. This may include wikis, websites, and other project documentation. 71 | * Package managers / distribution channels: Since most projects are distributed via package managers (e.g., npm, PyPI, RubyGems), those should also be updated with a deprecation warning that clearly states that the project is no longer being maintained or updated. 72 | * Additional channels: If this is a corporate project, marketing and other internal teams should also be notified. 73 | * Users: Known users of the project should also be notified. 74 | 75 | **Archival** 76 | 77 | The project should be officially archived using something like GitHub’s archival method. It might also be a good idea to keep these projects in a separate location to make it clear that these projects have reached the end of their life. For example, VMware has a separate ‘vmware-archive’ organization, and Apache has something similar called the ‘Apache Attic’. 78 | 79 | Taking the additional step of adding your code to the [Software Heritage](https://www.softwareheritage.org/) archive can help preserve it over time. This is especially true if you are self-hosting your source code repositories, but it can also help archived code outlive potential platform changes and make it easier for people to find in the future. 80 | 81 | Keep in mind that projects can always be unarchived if you change your mind later. Stressing this fact can often make it easier to get people to agree to the sunset process. 82 | 83 | **Special Case: Sunsetting Active Projects** 84 | 85 | Unfortunately, even active projects may need to be sunsetted, which can be much more difficult. This can happen when a project is being maintained entirely by a company, and that company has a shift in strategy and decides that they no longer wish to staff or maintain a project that is being widely used. There are a number of additional considerations in this case that should happen before the project is archived: 86 | 87 | * Public Relations (PR): Archiving an active project can be a tricky situation that might result in negative press as soon as you start talking to people about your desire to sunset the project, so before talking to anyone outside of your company, you want to get your PR team involved so that they can be ready to handle any queries. 88 | * Option to continue under new ownership: Determine if there are other contributors or other organizations who would be willing and able to take over new development and / or maintenance of the project. 89 | * Sunset plan: It can help to create a detailed plan that outlines all of the steps that need to be taken along with a timeframe for when the project will be sunsetted. 90 | * Timing: A responsible sunset plan will give users time to migrate to another solution before you stop making security updates and releases. 6 months is a good starting point. 91 | * Customers, users, and communication: Careful communication is required to communicate this decision along with the timing to any existing customers and users. 92 | 93 | # Cautions and Considerations 94 | 95 | * It is worth taking some extra time to talk to contributors and make sure that the decision to sunset a project is the right decision before you do it. 96 | * Be as transparent as possible about the decision to sunset a project and why you made that decision. 97 | * Sunsetting a project is not an indication of failure and should not be positioned as such. Projects have life cycles; they endure for as long as they are needed and then they should be responsibly deprecated when they are no longer needed. 98 | * Consider providing transition details, and if possible, tooling that helps your existing users transition to an alternative solution if a reasonable one is available. 99 | * As with all of the CHAOSS Practitioner Getting Started Guides, these materials are designed to help you get started when sunsetting a project, but it is not a comprehensive guide with everything you might need to know for every situation. 100 | 101 | # Additional Reading 102 | 103 | * Stefka Dimitrova on When and How to Deprecate an Open Source Project ([Part 1](https://blogs.vmware.com/opensource/2022/09/29/when-and-how-to-deprecate-an-open-source-project/) and [Part 2](https://blogs.vmware.com/opensource/2023/05/17/deprecating-an-open-source-project-part-2/)) along with a video from the Open Source Summit in Europe in 2022 about [Simple Steps for a Calm “Sunset”](https://www.youtube.com/watch?v=OdpkMkoKtDY). Much of the content for this guide is based on these materials. 104 | * [GitHub’s Do’s and Don'ts When Sunsetting Open Source Projects](https://github.blog/open-source/maintainers/dos-and-donts-when-sunsetting-open-source-projects/) 105 | * [TODO Group Guide: Shutting Down an Open Source Project](https://todogroup.org/resources/guides/shutting-down-an-open-source-project/) 106 | * [Allen Friedman on End of Life and End of Support Across the Ecosystem](https://www.youtube.com/watch?v=ZgWwiKLB6hE) (video) 107 | * [10 Quick tips for making software outlive your job (white paper)](https://arxiv.org/abs/2505.06484) 108 | 109 | # Contributors 110 | 111 | The following people contributed to this guide: 112 | 113 | * Dawn Foster 114 | * Ria Schalnat 115 | * Damián Vicino 116 | * Matt Germonprez 117 | * Elizabeth Barron 118 | * Tara Tarakiyee 119 | 120 | # References 121 | 122 | * Miller, C., Jahanshahi, M., Mockus, A., Vasilescu, B., & Kästner, C. (2025). [Understanding the response to open-source dependency abandonment in the npm ecosystem](http://www.cs.cmu.edu/~ckaestne/pdf/icse25_abandonment.pdf). In *Int’l Conf. Software Engineering (ICSE), IEEE/ACM*. 123 | 124 | CHAOSS Practitioner Guides are MIT licensed, living documents, and we welcome your feedback and input. You can suggest edits to this document at [https://github.com/chaoss/wg-data-science/blob/main/practitioner-guides/sunset.md](https://github.com/chaoss/wg-data-science/blob/main/practitioner-guides/sunset.md) -------------------------------------------------------------------------------- /practitioner-guides/website-landing.md: -------------------------------------------------------------------------------- 1 | Practitioner Guides are designed to be used by practitioners. The goal is to help you understand how to interpret the data about an open source project to develop insights that can help you improve the project health of an open source project. 2 | 3 | The Getting Started series of guides is designed for people who may not be experts in data analysis or open source. 4 | 5 | These guides will be especially useful for Open Source Program Offices (OSPOs), project leads, community managers, maintainers, and anyone who wants to better understand project health and take action on what they learn from their metrics. 6 | 7 | **Guides** 8 | 9 | * [Practitioner Guide: Introduction - Things to Think about When Interpreting Metrics](https://chaoss.community/practitioner-guide-introduction/) <- Please start here 10 | * [Practitioner Guide: Getting Started with Contributor Sustainability](https://chaoss.community/practitioner-guide-contributor-sustainability/) 11 | * [Practitioner Guide: Getting Started with Responsiveness](https://chaoss.community/practitioner-guide-responsiveness/) 12 | * [Practitioner Guide: Getting Started with Organizational Participation](https://chaoss.community/practitioner-guide-organizational-participation/) 13 | * [Practitioner Guide: Getting Started with Security](https://chaoss.community/practitioner-guide-security/) 14 | * [Practitioner Guide: Getting Started with Sunsetting an Open Source Project](https://chaoss.community/practitioner-guide-sunset) 15 | * [Practitioner Guide: Getting Started with Building Diverse Leadership](https://chaoss.community/practitioner-guide-diverse-leadership) 16 | 17 | Short videos about the guides can be found in the [Practitioner Guides Playlist](https://www.youtube.com/playlist?list=PL60k37cxI-HSHV4-rEsWMzExw2y2Oq79Z) in the CHAOSS YouTube channel. 18 | 19 | We also have more guides coming soon. You can see the work in progress guides, contribute to them, and propose new guides in the [Practitioner Guides](https://github.com/chaoss/wg-data-science/tree/main/practitioner-guides) section of the CHAOSS Data Science Working Group repository. These guides are under the MIT License, so you can use them in a variety of ways that best meet your needs, including copying and modifying them for your organization. 20 | -------------------------------------------------------------------------------- /publications/Foster-OFA-New-Dynamics-Open-Source-Relicensing-Forks-Community-Impact-2024.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chaoss/wg-data-science/c4db8382b35553a260654b917256e84fc3ffeab6/publications/Foster-OFA-New-Dynamics-Open-Source-Relicensing-Forks-Community-Impact-2024.pdf -------------------------------------------------------------------------------- /publications/README.md: -------------------------------------------------------------------------------- 1 | # Publications 2 | 3 | These publications came out of research conducted by the [CHAOSS Data Science Working Group](https://github.com/chaoss/wg-data-science). Please see our [Publication Guidelines](publication-guidelines.md) for more details. 4 | 5 | **[The New Dynamics of Open Source: Relicensing, Forks, and Community Impact](https://github.com/chaoss/wg-data-science/blob/main/publications/Foster-OFA-New-Dynamics-Open-Source-Relicensing-Forks-Community-Impact-2024.pdf) (PDF paper link)** 6 | 7 | * **Citation:** Foster, D. (2024, November 13-14). The New Dynamics of Open Source: Relicensing, Forks, and Community Impact. OpenForum Academy Symposium 2024, Boston, Massachusetts, Available at https://doi.org/10.48550/arXiv.2411.04739. 8 | * [Paper](https://github.com/chaoss/wg-data-science/blob/main/publications/Foster-OFA-New-Dynamics-Open-Source-Relicensing-Forks-Community-Impact-2024.pdf) and [Presentation Slides](https://fastwonderblog.com/wp-content/uploads/2024/11/Dawn-Foster-New-Dynamics-Relicensing-Forks-OFA-2024.pdf) 9 | * [Data files](https://github.com/chaoss/wg-data-science/releases/tag/v1.0-OFA-2024) released for replication, validation, and / or exploration. 10 | * [Ongoing research into relicensing and forks](https://github.com/chaoss/wg-data-science/tree/main/dataset/license-changes/fork-case-study) - We plan to release a report with additional data, and contributions are welcome. 11 | 12 | -------------------------------------------------------------------------------- /publications/publication-guidelines.md: -------------------------------------------------------------------------------- 1 | # CHAOSS Data Science Publication Guidelines 2 | 3 | One goal of the CHAOSS Data Science Working Group (WG) is to publish the research and outcomes from our projects so that people can learn from our findings while also raising the visibility of CHAOSS and this WG. Our current list of projects can be found as [Issues labelled with "Project"](https://github.com/chaoss/wg-data-science/issues?q=is%3Aissue%20state%3Aopen%20label%3Aproject) in the Data Science WG repository, and we have an issue template that anyone can use to propose a new project. 4 | 5 | Open source program offices, community managers, and other people working in corporate open source environments have been the biggest audiences for CHAOSS tools and metrics, so to make it easier for these people to consume our research, we are focusing on producing reports that are in the style of white papers, like what you can find published by [Linux Foundation Research](https://www.linuxfoundation.org/research). Ideally, we would work with organizations including LF Research to release our reports. However, not everything we publish will be appropriate for an LF Research report, so alternatively, some will be published as reports on the CHAOSS site. Good reports to model ours after are the LF / Harvard [Census II](https://www.linuxfoundation.org/hubfs/Research%20Reports/lfr_harvard_censusII_mar2022_042824b.pdf?hsLang=en) and [Census III](https://www.linuxfoundation.org/hubfs/LF%20Research/lfr_censusiii_120424a.pdf?hsLang=en) reports, since they are well-researched, and presented in a style that is accessible for a wide audience (note that we expect our reports to be much shorter). 6 | 7 | From a CHAOSS Data Science WG perspective, these reports are not intended to be academic publications; however, the people leading the research for a report are welcome (and encouraged) to submit variations of the research into academic publications. One of the reasons we chose the white paper / LF research report format is because we can quickly get our research published, since the academic publishing pipeline is often measured in years, so it is important to note that we will **not** delay the publishing of our reports, so academic publications will almost always come after CHAOSS has published a report. We also require that you provide appropriate credit via an acknowledgements section to all of the other people who have contributed to the research and acknowledge the CHAOSS project in any additional publications. 8 | 9 | Style Guidelines: 10 | * [The LF Research Style Guide](https://docs.google.com/document/d/1pgq7EQtlo2syaRuEZaD1Rco1bEWmqC46gSX2ryZJNLE/edit?tab=t.0) 11 | * Please use APA style and footnotes for all references. 12 | --------------------------------------------------------------------------------