├── requirements.txt ├── notebooks ├── output │ └── README.md ├── README.md ├── org_info.ipynb └── CodeOfConductBug.ipynb ├── scripts ├── output │ └── README.md ├── running_scripts.ipynb ├── filter_keyword_by_org.py ├── org_access_audit.py ├── repo_activity_REST.py ├── mystery_orgs.py ├── monitoring.py ├── keyword_by_repo.py ├── README.md ├── common_functions.py ├── inclusivity_check.py ├── pr_activity.py ├── repo_activity.py ├── repo_activity_coc.py ├── commits_people.py └── sunset.py ├── MAINTAINERS.md ├── setup-venv.sh ├── NOTICE ├── README.md ├── LICENSE ├── CONTRIBUTING.md ├── .gitignore └── CODE-OF-CONDUCT.md /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | pandas 3 | -------------------------------------------------------------------------------- /notebooks/output/README.md: -------------------------------------------------------------------------------- 1 | Output files from these scripts will be stored as csv files, and if this directory does not exist, the scripts will fail. 2 | 3 | The csv files in this directory will be ignored by git (.gitignore) 4 | -------------------------------------------------------------------------------- /scripts/output/README.md: -------------------------------------------------------------------------------- 1 | Output files from these scripts will be stored as csv files, and 2 | if this directory does not exist, the scripts will fail. 3 | 4 | The csv files in this directory will be ignored by git (.gitignore) 5 | -------------------------------------------------------------------------------- /MAINTAINERS.md: -------------------------------------------------------------------------------- 1 | # Maintainers 2 | 3 | | Maintainer | GitHub ID | Affiliation | 4 | | --------------- | --------- | ----------- | 5 | | Dawn Foster | [geekygirldawn](https://github.com/geekygirldawn) | [CHAOSS Project](https://github.com/chaoss) | 6 | -------------------------------------------------------------------------------- /setup-venv.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # uncomment for more output 4 | # set -x 5 | python3 -m venv .venv 6 | echo "To activate virtual env: \"source .venv/bin/activate\"" 7 | echo "To install dependencies: \"pip install -r requirements.txt" 8 | 9 | -------------------------------------------------------------------------------- /NOTICE: -------------------------------------------------------------------------------- 1 | Project API Metrics 2 | Copyright 2021 VMware, Inc. 3 | 4 | This product is licensed to you under the BSD-2 license (the "License"). You may not use this product except in compliance with the BSD-2 License. 5 | 6 | This product may include a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the subcomponent's license, as noted in the LICENSE file. 7 | 8 | -------------------------------------------------------------------------------- /notebooks/README.md: -------------------------------------------------------------------------------- 1 | # Project API Metrics Jupyter Notebooks 2 | 3 | This is where I experiment with ideas, troubleshoot issues, 4 | or explore data. 5 | 6 | Some of these notebooks have also been converted into scripts 7 | in the [scripts](../scripts) directory. 8 | 9 | The advantage of the notebooks is that you can have a closer look 10 | at the dataframes and other output to poke around and customize 11 | them for your needs. 12 | 13 | These notebooks might also be useful for people who are more 14 | comfortable using Jupyter notebooks than running scripts. 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Project Api Metrics 2 | 3 | This repo contains a few Python scripts and Jupyter notebooks that query the 4 | GitHub API to gather metrics related to project health and other activities. 5 | 6 | I am also using this repo as I learn how to use the GitHub GraphQL API. 7 | 8 | The [scripts](scripts/) directory contains the scripts and more information 9 | about how to run them. 10 | 11 | The [notebooks](notebooks/) directory contains Jupyter notebooks that allow 12 | you to explore the data gathered from the API. 13 | 14 | ## Contributing 15 | 16 | I welcome any suggestions via issues or pull requests! Please have a look 17 | at the [CONTRIBUTING.md](CONTRIBUTING.md) document for more details. 18 | 19 | Participation in this project is subject to the 20 | [Code of Conduct](CODE-OF-CONDUCT.md) 21 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Project API Metrics 2 | Copyright 2021 VMware, Inc. 3 | 4 | The BSD-2 license (the "License") set forth below applies to all parts of the Project API Metrics project. You may not use this file except in compliance with the License. 5 | 6 | BSD-2 License 7 | 8 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 9 | 10 | Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 11 | 12 | Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 13 | 14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to Project API Metrics 2 | 3 | Welcome to Project API Metrics! We're happy to have you here, 4 | and we're always looking for ways to improve these scripts. 5 | 6 | As you get started, you are in a great position provide feedback 7 | about whether these scripts work as expected in your environment. 8 | 9 | If anything doesn't make sense, or doesn't work when you run it, please open a 10 | bug report and let us know! 11 | 12 | ## Contribution Flow 13 | 14 | This is a rough outline of what a contributor's workflow looks like: 15 | 16 | - Create a topic branch from where you want to base your work 17 | - Make commits of logical units 18 | - Make sure your commit messages are in the proper format (see below) 19 | - Push your changes to a topic branch in your fork of the repository 20 | - Submit a pull request 21 | 22 | GitHub has more documentation about the [GitHub Workflow](https://docs.github.com/en/get-started/quickstart/github-flow) 23 | 24 | ## Code Style 25 | 26 | ### Formatting Commit Messages 27 | 28 | We follow the conventions on [How to Write a Git Commit Message](http://chris.beams.io/posts/git-commit/). 29 | 30 | Be sure to include any related GitHub issue references in the commit message. See 31 | [GitHub Flavored Markdown syntax](https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown) for referencing issues 32 | and commits. 33 | 34 | ## Reporting Bugs and Creating Issues 35 | 36 | When opening a new issue, try to roughly follow the commit message format conventions above. 37 | -------------------------------------------------------------------------------- /scripts/running_scripts.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "6bf81bdd", 6 | "metadata": {}, 7 | "source": [ 8 | "# Overview\n", 9 | "\n", 10 | "This notebook is used to make it easy for someone to run the scripts if they are familiar with Jupyter Notebooks, but less familiar with navigating on the command line.\n", 11 | "\n", 12 | "Prerequisite for all scripts:\n", 13 | "* Python environment with Jupyter Notebook installed.\n", 14 | " If you don't already have something, Anaconda is a\n", 15 | " popular choice.\n", 16 | "* Pandas installed (if you are using Anaconda, this is probably included already)\n", 17 | "* [PyGithub](https://pygithub.readthedocs.io/en/latest/introduction.html) installed:\n", 18 | " ```pip install PyGithub```\n", 19 | "* [GitHub Personal Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) created and saved into a file called 'gh_key' in this scripts directory." 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "id": "92e6b76c", 25 | "metadata": {}, 26 | "source": [ 27 | "# repo_activity_coc.py\n", 28 | "\n", 29 | "As input, this script requires a file named 'orgs.txt' containing\n", 30 | "the name of one GitHub org per line residing in the same folder \n", 31 | "as this script." 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "id": "53de1338", 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "# Test that your orgs.txt file exists and contains the data you expect\n", 42 | "f = open('orgs.txt', 'r')\n", 43 | "print(f.read())" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": null, 49 | "id": "ab54e818", 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "%run repo_activity_coc.py" 54 | ] 55 | } 56 | ], 57 | "metadata": { 58 | "kernelspec": { 59 | "display_name": "Python 3 (ipykernel)", 60 | "language": "python", 61 | "name": "python3" 62 | }, 63 | "language_info": { 64 | "codemirror_mode": { 65 | "name": "ipython", 66 | "version": 3 67 | }, 68 | "file_extension": ".py", 69 | "mimetype": "text/x-python", 70 | "name": "python", 71 | "nbconvert_exporter": "python", 72 | "pygments_lexer": "ipython3", 73 | "version": "3.10.9" 74 | } 75 | }, 76 | "nbformat": 4, 77 | "nbformat_minor": 5 78 | } 79 | -------------------------------------------------------------------------------- /scripts/filter_keyword_by_org.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """ Filter results obtained from keyword_by_repo.py to a subset of orgs 8 | 9 | This script uses the results from a keyword search and filters it based 10 | on a list of GitHub organizations. 11 | 12 | As input, this script requires a file generated by the keyword_by_repo.py 13 | script. This is provided via a command line argument. 14 | Example: filter_keyword_by_org.py /path/to/keyword_search_2022-07-26.csv 15 | 16 | If a command line argument is not specified, you will be prompted for this 17 | file. 18 | 19 | As input, this script requires a file named 'orgs.txt' containing 20 | the name of one GitHub org per line residing in the same folder 21 | as this script. 22 | 23 | As output: 24 | * the script creates a csv file stored in an subdirectory 25 | of the folder with the script called "output" with the filename in 26 | this format with today's date. 27 | Example: "output/keyword_search_org_filter_2022-07-27.csv" 28 | """ 29 | 30 | import sys 31 | import csv 32 | from common_functions import read_orgs, create_file 33 | 34 | # Read list of orgs from a file 35 | try: 36 | org_list = read_orgs('orgs.txt') 37 | except: 38 | print("Error reading orgs. This script depends on the existance of a file called orgs.txt containing one org per line. Exiting") 39 | sys.exit() 40 | 41 | # read filename from command line or prompt if no arguments were given. 42 | try: 43 | file_name = str(sys.argv[1]) 44 | 45 | except: 46 | print("Please enter the filename for the csv file generated from keyword_by_repo.py (full path)") 47 | file_name = input("Enter a file name: ") 48 | 49 | # open csv file 50 | with open(file_name) as in_file: 51 | content = csv.reader(in_file) 52 | 53 | # get the csv header line and move to the first result 54 | header = next(content) 55 | 56 | # find csv results lines that match a GitHub org name from orgs.txt 57 | # and append the matches to a list 58 | org_match_list = [] 59 | for line in content: 60 | if line[0] in org_list: 61 | org_match_list.append(line) 62 | 63 | # prepare output file and write header and list to csv 64 | try: 65 | file, file_path = create_file("keyword_search_org_filter") 66 | 67 | with open(file_path, "w") as out_file: 68 | wr = csv.writer(out_file) 69 | wr.writerow(header) 70 | wr.writerows(org_match_list) 71 | 72 | except: 73 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 74 | 75 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # API key 2 | gh_key* 3 | 4 | # Input files 5 | *org*.txt 6 | *keyword*.txt 7 | *monitoring*.txt 8 | *.csv 9 | 10 | # Output directory contents 11 | scripts/output/*.csv 12 | notebooks/output/*.csv 13 | *.pkl 14 | 15 | # Experimental scripts 16 | scripts/exp* 17 | 18 | # VS Code files 19 | /.vscode/ 20 | .vscode 21 | 22 | # MacOS files 23 | .DS_Store 24 | 25 | # Byte-compiled / optimized / DLL files 26 | __pycache__/ 27 | *.py[cod] 28 | *$py.class 29 | 30 | # C extensions 31 | *.so 32 | 33 | # Distribution / packaging 34 | .Python 35 | build/ 36 | develop-eggs/ 37 | dist/ 38 | downloads/ 39 | eggs/ 40 | .eggs/ 41 | lib/ 42 | lib64/ 43 | parts/ 44 | sdist/ 45 | var/ 46 | wheels/ 47 | pip-wheel-metadata/ 48 | share/python-wheels/ 49 | *.egg-info/ 50 | .installed.cfg 51 | *.egg 52 | MANIFEST 53 | 54 | # PyInstaller 55 | # Usually these files are written by a python script from a template 56 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 57 | *.manifest 58 | *.spec 59 | 60 | # Installer logs 61 | pip-log.txt 62 | pip-delete-this-directory.txt 63 | 64 | # Unit test / coverage reports 65 | htmlcov/ 66 | .tox/ 67 | .nox/ 68 | .coverage 69 | .coverage.* 70 | .cache 71 | nosetests.xml 72 | coverage.xml 73 | *.cover 74 | *.py,cover 75 | .hypothesis/ 76 | .pytest_cache/ 77 | 78 | # Translations 79 | *.mo 80 | *.pot 81 | 82 | # Django stuff: 83 | *.log 84 | local_settings.py 85 | db.sqlite3 86 | db.sqlite3-journal 87 | 88 | # Flask stuff: 89 | instance/ 90 | .webassets-cache 91 | 92 | # Scrapy stuff: 93 | .scrapy 94 | 95 | # Sphinx documentation 96 | docs/_build/ 97 | 98 | # PyBuilder 99 | target/ 100 | 101 | # Jupyter Notebook 102 | .ipynb_checkpoints 103 | 104 | # IPython 105 | profile_default/ 106 | ipython_config.py 107 | 108 | # pyenv 109 | .python-version 110 | 111 | # pipenv 112 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 113 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 114 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 115 | # install all needed dependencies. 116 | #Pipfile.lock 117 | 118 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 119 | __pypackages__/ 120 | 121 | # Celery stuff 122 | celerybeat-schedule 123 | celerybeat.pid 124 | 125 | # SageMath parsed files 126 | *.sage.py 127 | 128 | # Environments 129 | .env 130 | .venv 131 | env/ 132 | venv/ 133 | ENV/ 134 | env.bak/ 135 | venv.bak/ 136 | 137 | # Spyder project settings 138 | .spyderproject 139 | .spyproject 140 | 141 | # Rope project settings 142 | .ropeproject 143 | 144 | # mkdocs documentation 145 | /site 146 | 147 | # mypy 148 | .mypy_cache/ 149 | .dmypy.json 150 | dmypy.json 151 | 152 | # Pyre type checker 153 | .pyre/ 154 | -------------------------------------------------------------------------------- /scripts/org_access_audit.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """GitHub Organization Access Audit 8 | This script uses the GitHub GraphQL API to retrieve relevant 9 | information about all enterprise owners and org members from 10 | one or more GitHub orgs. 11 | 12 | Note that you must have appropriate access to this data in the 13 | orgs requested. Missing data likely means that you don't have 14 | access. 15 | 16 | As input, this script requires a file named 'orgs.txt' containing 17 | the name of one GitHub org per line residing in the same folder 18 | as this script. 19 | 20 | Your API key should be stored in a file called gh_key in the 21 | same folder as this script. 22 | 23 | As output: 24 | * JSON data is currently printed to the screen as way to do this 25 | quickly. 26 | """ 27 | 28 | import sys 29 | from common_functions import read_key 30 | 31 | def make_query(after_cursor = None): 32 | """Creates and returns a GraphQL query with cursor for pagination""" 33 | 34 | return """query ($org_name: String!){ 35 | organization(login: $org_name){ 36 | url 37 | enterpriseOwners(first:100){ 38 | nodes{ 39 | login 40 | } 41 | } 42 | membersWithRole(first:100){ 43 | nodes{ 44 | login 45 | name 46 | } 47 | } 48 | } 49 | } 50 | """ 51 | 52 | # Read GitHub key from file using the read_key function in 53 | # common_functions.py 54 | try: 55 | api_token = read_key('gh_key') 56 | 57 | except: 58 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 59 | sys.exit() 60 | 61 | def get_org_data(api_token): 62 | """Executes the GraphQL query to get owner / member data from one or more GitHub orgs. 63 | 64 | Parameters 65 | ---------- 66 | api_token : str 67 | The GH API token retrieved from the gh_key file. 68 | 69 | Returns 70 | ------- 71 | repo_info_df : pandas.core.frame.DataFrame 72 | """ 73 | import requests 74 | import json 75 | import pandas as pd 76 | from common_functions import read_orgs 77 | import sys 78 | 79 | url = 'https://api.github.com/graphql' 80 | headers = {'Authorization': 'token %s' % api_token} 81 | 82 | # Read list of orgs from a file 83 | 84 | try: 85 | org_list = read_orgs('orgs.txt') 86 | except: 87 | print("Error reading orgs. This script depends on the existance of a file called orgs.txt containing one org per line. Exiting") 88 | sys.exit() 89 | 90 | for org_name in org_list: 91 | try: 92 | query = make_query() 93 | 94 | variables = {"org_name": org_name} 95 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 96 | json_data = json.loads(r.text) 97 | 98 | print(json_data) 99 | except: 100 | print("ERROR Cannot process", org_name) 101 | 102 | get_org_data(api_token) 103 | 104 | -------------------------------------------------------------------------------- /scripts/repo_activity_REST.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | 6 | """Repo Activity REST API Version - DEPRECATED Example only 7 | This script is comparison with the other script (repo_activity.py) 8 | which uses the GraphQL API. This one is very slow and should not 9 | be used to gather data. 10 | 11 | This script uses the GitHub REST API to retrieve relevant 12 | information about all repositories from one or more GitHub 13 | orgs. 14 | 15 | This version runs much more slowly than the other GraphQL 16 | version, and if there are a lot of orgs / repos, the API 17 | rate limit will be exceeded if the script is not slowed down. 18 | 19 | As input, this script requires a file named 'orgs.txt' containing 20 | the name of one GitHub org per line residing in the same folder 21 | as this script. 22 | 23 | Your API key should be stored in a file called gh_key in the 24 | same folder as this script. 25 | 26 | This script requires that `pandas` be installed within the Python 27 | environment you are running this script in. 28 | 29 | As output: 30 | * A message about each org being processed will be printed to the screen. 31 | * the script creates a csv file stored in an subdirectory 32 | of the folder with the script called "output" with the filename in 33 | this format with today's date. 34 | 35 | output/a_repo_activity_2022-01-14.csv" 36 | """ 37 | 38 | import sys 39 | import pandas as pd 40 | import csv 41 | from datetime import datetime 42 | from time import sleep 43 | from github import Github 44 | from os.path import dirname, join 45 | from common_functions import read_key 46 | 47 | # Read GitHub key from file 48 | try: 49 | gh_key = read_key('gh_key') 50 | g = Github(gh_key) 51 | 52 | except: 53 | print("Error reading GH Key. Exiting") 54 | sys.exit() 55 | 56 | # prepare csv file and write header row 57 | 58 | today = datetime.today().strftime('%Y-%m-%d') 59 | output_filename = "./output/a_repo_activity_" + today + ".csv" 60 | 61 | try: 62 | current_dir = dirname(__file__) 63 | file_path = join(current_dir, output_filename) 64 | 65 | csv_output = open(file_path, 'w') 66 | csv_output.write('org,repo,license,private,forked,archived,last_updated,last_pushed,last_committer_login,last_committer_name,last_committer_email,last_committer_date\n') 67 | 68 | except: 69 | print('Could not write to csv file. Exiting') 70 | sys.exit(1) 71 | 72 | # Read list of orgs from a file 73 | 74 | org_list = [] 75 | with open('orgs.txt') as orgfile: 76 | orgs = csv.reader(orgfile) 77 | for row in orgs: 78 | org_list.append(row[0]) 79 | 80 | # Get repos and repo info for each org 81 | 82 | for github_org in org_list: 83 | 84 | # sleep(90) #add delay to slow down hitting rate limits 85 | print("Processing ", github_org) 86 | 87 | try: 88 | org = g.get_organization(github_org) 89 | except: 90 | print("ERROR: Cannot process ", github_org) 91 | continue 92 | 93 | try: 94 | for x in org.get_repos(): 95 | try: 96 | for y in x.get_commits(): 97 | try: 98 | author_login = y.author.login 99 | author_name = y.author.name 100 | author_email = y.author.email 101 | break 102 | except: 103 | author_login = None 104 | author_name = None 105 | author_email = None 106 | 107 | 108 | try: 109 | last_commit_date = x.get_commit(y.sha).commit.author.date 110 | except: 111 | last_commit_date = "No commits, repo may be empty" 112 | 113 | # When this fails it usually means there is no license 114 | try: 115 | license = x.get_license().license.name 116 | except: 117 | license = "Likely Unlicensed" 118 | 119 | csv_string = github_org + ',' + x.full_name + ',' + str(license) + ',' + str(x.private) + ',' + str(x.fork) + ',' + str(x.archived) + ',' + str(x.updated_at) + ',' + str(x.pushed_at) + ',' + str(author_login) + ',' + str(author_name) + ',' + str(author_email) + ',' + str(last_commit_date) + '\n' 120 | csv_output.write(csv_string) 121 | except: 122 | print("Cannot process data for", x) 123 | csv_output.write(github_org + ',' + x.full_name + ',' + str(x.private) + ',' + str(x.fork) + ',' + str(x.archived) + ',' + str(x.updated_at) + ',' + str(x.pushed_at) + ',' + 'Error' + ',' + 'Error' + ',' + 'Error' + ',' + 'Error' + '\n') 124 | 125 | except: 126 | print("Cannot get repos for", github_org) 127 | -------------------------------------------------------------------------------- /scripts/mystery_orgs.py: -------------------------------------------------------------------------------- 1 | # Copyright 2022 VMware, Inc. 2 | # SPDX-License-Identifier: BSD-2-Clause 3 | 4 | """Mystery Orgs 5 | This script uses the GitHub GraphQL API to retrieve relevant 6 | information about one or more GitHub orgs. 7 | 8 | We use this script at to gather basic data about GitHub orgs that 9 | we believe may have been created outside of our process by various 10 | employees across our business units. We gather the first few members 11 | of the org to help identify employees who can provide more details 12 | about the purpose of the org and how it is used. 13 | 14 | As input, this script requires a file named 'orgs.txt' containing 15 | the name of one GitHub org per line residing in the same folder 16 | as this script. 17 | 18 | Your API key should be stored in a file called gh_key in the 19 | same folder as this script. 20 | 21 | As output: 22 | * A message about each org being processed will be printed to the screen. 23 | * the script creates a csv file stored in an subdirectory 24 | of the folder with the script called "output" with the filename in 25 | this format with today's date. 26 | 27 | output/mystery_orgs_2022-01-14.csv" 28 | """ 29 | 30 | import sys 31 | from common_functions import read_key 32 | 33 | def make_query(): 34 | """Creates and returns a GraphQL query""" 35 | return """query OrgQuery($org_name: String!) { 36 | organization(login:$org_name) { 37 | name 38 | url 39 | websiteUrl 40 | createdAt 41 | updatedAt 42 | membersWithRole(first: 15){ 43 | nodes{ 44 | login 45 | name 46 | email 47 | company 48 | } 49 | } 50 | } 51 | }""" 52 | 53 | # Read GitHub key from file using the read_key function in 54 | # common_functions.py 55 | try: 56 | api_token = read_key('gh_key') 57 | 58 | except: 59 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 60 | sys.exit() 61 | 62 | def get_org_data(api_token): 63 | """Executes the GraphQL query to get org data from one or more GitHub orgs. 64 | 65 | Parameters 66 | ---------- 67 | api_token : str 68 | The GH API token retrieved from the gh_key file. 69 | 70 | Output 71 | ------- 72 | Writes a csv file of the form 'mystery_orgs_2022-16-01.csv' with today's date 73 | """ 74 | 75 | import requests 76 | import json 77 | import sys 78 | import csv 79 | from datetime import datetime 80 | from common_functions import read_orgs, create_file 81 | 82 | url = 'https://api.github.com/graphql' 83 | headers = {'Authorization': 'token %s' % api_token} 84 | 85 | # Read list of orgs from a file 86 | try: 87 | org_list = read_orgs('orgs.txt') 88 | except: 89 | print("Error reading orgs. This script depends on the existance of a file called orgs.txt containing one org per line. Exiting") 90 | sys.exit() 91 | 92 | # Initialize list of lists with a header row. 93 | # Each embedded list will become a row in the csv file 94 | all_rows = [['org_name', 'org_url', 'website', 'org_createdAt', 'org_updatedAt', 'people(login,name,email,company):repeat']] 95 | 96 | for org_name in org_list: 97 | 98 | print("Processing", org_name) 99 | 100 | row = [] 101 | query = make_query() 102 | 103 | variables = {"org_name": org_name} 104 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 105 | json_data = json.loads(r.text) 106 | 107 | # Take the json_data file and expand the info about people horizontally into the 108 | # same row as the rest of the data about that org. 109 | try: 110 | for key in json_data['data']['organization']: 111 | if key == 'membersWithRole': 112 | for nkey in json_data['data']['organization'][key]['nodes']: 113 | row.append(nkey['login']) 114 | row.append(nkey['name']) 115 | row.append(nkey['email']) 116 | row.append(nkey['company']) 117 | else: 118 | row.append(json_data['data']['organization'][key]) 119 | all_rows.append(row) 120 | except: 121 | pass 122 | 123 | # prepare file and write rows to csv 124 | 125 | try: 126 | file = create_file("mystery_orgs") 127 | 128 | with file: 129 | write = csv.writer(file) 130 | write.writerows(all_rows) 131 | 132 | except: 133 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 134 | 135 | get_org_data(api_token) 136 | -------------------------------------------------------------------------------- /scripts/monitoring.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """Calculate OSSF's Criticality Score for pinned repos on a list of orgs 8 | 9 | This script uses the GitHub GraphQL API to retrieve a list of pinned 10 | repositories from a list of GitHub organizations and then runs 11 | criticality score on each of those repos. It can also take an individual 12 | repo URL as one of the inputs 13 | 14 | As input, this script requires a file named 'monitoring.txt' residing in 15 | the same folder as this script. This file should contain the name of one 16 | GitHub org or URL to an individual repository per line. Example file format: 17 | vmware 18 | https://github.com/greenplum-db/gpdb 19 | vmware-tanzu 20 | 21 | Your API key should be stored in a file called gh_key in the 22 | same folder as this script. 23 | 24 | This script requires that `pandas` be installed within the Python 25 | environment you are running this script in. 26 | 27 | This script depends on another tool called Criticality Score to run. 28 | See https://github.com/ossf/criticality_score for more details, including 29 | how to set up a required environment variable. This function requires that 30 | you have this tool installed, and it might only run on mac / linux. 31 | 32 | As output: 33 | * A message about each repo being processed will be printed to the screen. 34 | * the script creates a csv file stored in an subdirectory 35 | of the folder with the script called "output" with the filename in 36 | this format with today's date. 37 | 38 | output/monitoring_2023-01-28.csv" 39 | 40 | """ 41 | 42 | import sys 43 | import subprocess 44 | import os 45 | import requests 46 | import json 47 | import csv 48 | from common_functions import create_file, read_key, read_orgs 49 | 50 | # Read GitHub key from file using the read_key function in 51 | # common_functions.py 52 | try: 53 | api_token = read_key('gh_key') 54 | 55 | except: 56 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 57 | sys.exit() 58 | 59 | # Use the token to set the environment variable required by criticality_score 60 | os.environ['GITHUB_AUTH_TOKEN'] = api_token 61 | 62 | def make_query(): 63 | """Creates and returns a GraphQL query to get pinned repos for an org""" 64 | return """query pinned($org_name: String!) { 65 | organization(login:$org_name) { 66 | pinnedItems(first: 10, types: REPOSITORY) { 67 | nodes { 68 | ... on Repository { 69 | url 70 | } 71 | } 72 | } 73 | } 74 | }""" 75 | 76 | # TODO use read_orgs function to read from file. 77 | org_list = read_orgs('monitoring.txt') 78 | 79 | # Set up the parameters needed to use GitHub's GraphQL API 80 | url = 'https://api.github.com/graphql' 81 | headers = {'Authorization': 'token %s' % api_token} 82 | 83 | # Iterate through each GH org, run the query to get pinned repos, 84 | # and store the repo URLs in repo_list 85 | repo_list = [] 86 | 87 | for org_name in org_list: 88 | # To handle both orgs and individual repos ... 89 | # Maybe split this into 2. If startswith http, run crit score on that URL 90 | # Otherwise run this query for pinned items on an org 91 | 92 | if org_name.startswith('http'): 93 | repo_list.append(org_name) 94 | 95 | else: 96 | query = make_query() 97 | 98 | variables = {"org_name": org_name} 99 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 100 | json_data = json.loads(r.text) 101 | 102 | # Wrap in try/except for when org isn't valid. 103 | try: 104 | for url_dict in json_data['data']['organization']['pinnedItems']['nodes']: 105 | repo_list.append(url_dict['url']) 106 | except: 107 | print("Could not get data on", org_name, "- check to make sure the org name is correct and has pinned repos") 108 | 109 | # For each repo in repo_list, run criticality_score and append 110 | # the json output to csv_row_list 111 | csv_row_list = [] 112 | 113 | for repo in repo_list: 114 | cmd_str = 'criticality_score --repo ' + repo + ' --format json' 115 | print("Processing", repo) 116 | try: 117 | proc = subprocess.Popen(cmd_str, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True) 118 | out, err = proc.communicate() 119 | 120 | if not err: 121 | json_str = out.decode("utf-8") 122 | csv_row_list.append(json.loads(json_str)) 123 | else: 124 | print('Error calculating scores', repo) 125 | except: 126 | print('Error calculating scores', repo) 127 | 128 | # Create csv output file and write to it 129 | 130 | keys = csv_row_list[0].keys() 131 | 132 | file, file_path = create_file("monitoring") 133 | 134 | with open(file_path, 'w', newline='') as output_file: 135 | dict_writer = csv.DictWriter(output_file, keys) 136 | dict_writer.writeheader() 137 | dict_writer.writerows(csv_row_list) 138 | -------------------------------------------------------------------------------- /scripts/keyword_by_repo.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """ Search for repos mentioning a keyword 8 | This script uses the GitHub GraphQL API to retrieve relevant 9 | information about repositories mentioning certain keywords. 10 | 11 | As input, this script requires a file named 'keywords.txt' containing 12 | one keyword per line residing in the same folder as this script. 13 | 14 | Your API key should be stored in a file called gh_key in the 15 | same folder as this script. 16 | 17 | This script requires that `pandas` be installed within the Python 18 | environment you are running this script in. 19 | 20 | As output: 21 | * A message about each keyword being processed will be printed to the screen. 22 | * the script creates a csv file stored in an subdirectory 23 | of the folder with the script called "output" with the filename in 24 | this format with today's date. 25 | Example: "output/keyword_search_2022-07-22.csv" 26 | """ 27 | 28 | import sys 29 | from common_functions import read_key, create_file 30 | 31 | def make_query(after_cursor = None): 32 | """Creates and returns a GraphQL query with cursor for pagination""" 33 | 34 | return """query MyQuery ($keyword: String!){ 35 | search(query: $keyword, type: REPOSITORY, first: 100, after: AFTER) { 36 | pageInfo { 37 | hasNextPage 38 | endCursor 39 | } 40 | nodes { 41 | ... on Repository { 42 | nameWithOwner 43 | name 44 | owner{ 45 | login 46 | } 47 | url 48 | description 49 | updatedAt 50 | createdAt 51 | isFork 52 | isEmpty 53 | isArchived 54 | forkCount 55 | stargazerCount 56 | } 57 | } 58 | } 59 | }""".replace( 60 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 61 | ) 62 | 63 | def get_repo_data(api_token): 64 | """Executes the GraphQL query to get repository data from one or more GitHub orgs. 65 | 66 | Parameters 67 | ---------- 68 | api_token : str 69 | The GH API token retrieved from the gh_key file. 70 | 71 | Returns 72 | ------- 73 | repo_info_df : pandas.core.frame.DataFrame 74 | """ 75 | import requests 76 | import json 77 | import pandas as pd 78 | from common_functions import read_file 79 | 80 | url = 'https://api.github.com/graphql' 81 | headers = {'Authorization': 'token %s' % api_token} 82 | 83 | repo_info_df = pd.DataFrame() 84 | 85 | # Read list of keywords from a file 86 | 87 | try: 88 | keyword_list = read_file('keywords.txt') 89 | except: 90 | print("Error reading keywords. This script depends on the existance of a file called keywords.txt containing one keyword per line. Exiting") 91 | sys.exit() 92 | 93 | for keyword in keyword_list: 94 | has_next_page = True 95 | after_cursor = None 96 | 97 | print("Processing", keyword) 98 | 99 | while has_next_page: 100 | 101 | try: 102 | query = make_query(after_cursor) 103 | 104 | variables = {"keyword": keyword} 105 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 106 | json_data = json.loads(r.text) 107 | 108 | df_temp = pd.DataFrame(json_data["data"]["search"]["nodes"]) 109 | repo_info_df = pd.concat([repo_info_df, df_temp]) 110 | 111 | has_next_page = json_data["data"]["search"]["pageInfo"]["hasNextPage"] 112 | after_cursor = json_data["data"]["search"]["pageInfo"]["endCursor"] 113 | except: 114 | has_next_page = False 115 | print("ERROR Cannot process", keyword) 116 | 117 | return repo_info_df 118 | 119 | # Read GitHub key from file using the read_key function in 120 | # common_functions.py 121 | try: 122 | api_token = read_key('gh_key') 123 | 124 | except: 125 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 126 | sys.exit() 127 | 128 | repo_info_df = get_repo_data(api_token) 129 | 130 | def expand_owner(owner): 131 | import pandas as pd 132 | if pd.isnull(owner): 133 | owner = 'Not Found' 134 | else: 135 | owner = owner['login'] 136 | return owner 137 | 138 | repo_info_df['owner_name'] = repo_info_df['owner'].apply(expand_owner) 139 | repo_info_df = repo_info_df.drop(columns=['owner']) 140 | 141 | # Reformat to put columns in a logical order 142 | repo_info_df = repo_info_df[['owner_name','name','nameWithOwner','url','description','updatedAt','createdAt','isFork','isEmpty','isArchived','forkCount','stargazerCount']] 143 | 144 | # prepare file and write dataframe to csv 145 | 146 | try: 147 | file, file_path = create_file("keyword_search") 148 | repo_info_df.to_csv(file_path, index=False) 149 | 150 | except: 151 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 152 | -------------------------------------------------------------------------------- /scripts/README.md: -------------------------------------------------------------------------------- 1 | # Python Scripts 2 | 3 | These Python scripts use the GitHub APIs to gather data. 4 | 5 | ## Acceptable Use 6 | 7 | Note: Some of these scripts gather names and email addresses, which we use 8 | to help us find a contact if we have questions about a 9 | repository or org. Note that the [GitHub Acceptable Use 10 | Policies](https://docs.github.com/en/github/site-policy/github-acceptable-use-policies) 11 | prohibits certain usage of information, and I would encourage you to read 12 | this policy and not use scripts like these for unethical purposes. 13 | 14 | ## Requirements: 15 | 16 | The scripts all have a few common requirements, and individual 17 | scripts may have additional requirements and other information 18 | which can be found in the Docstrings. 19 | 20 | * These scripts require that `pandas` be installed within the Python 21 | environment you are running this script in. 22 | * Your API key should be stored in a file called gh_key in the 23 | same folder as these scripts. 24 | * Most scripts also require an orgs.txt or other text file used as 25 | input. Details can be found in the docstring for each script. 26 | * Most scripts require that a folder named "output" exists in this 27 | scripts directory, and csv output files will be stored there. 28 | 29 | ## Scripts 30 | 31 | ### Inclusivity Check 32 | 33 | This script uses the GitHub GraphQL API to retrieve default branch 34 | name and code of conduct for each repo in a GitHub org for a very 35 | quick, but rudimentary inclusivity check. 36 | 37 | **Running the script** 38 | 39 | Requires orgs.txt 40 | ``` 41 | $python3 inclusivity_check.py 42 | ``` 43 | 44 | ### Repository Activity 45 | These scripts demonstrate the difference in speed and 46 | rate limits between the GitHub REST API and the GraphQL API. The original 47 | REST script took hours to run across our 60+ GitHub orgs and had to be 48 | slowed down to avoid hitting the rate limit, while the GraphQL version, 49 | which gathers the same data, runs in less than 15 minutes without hitting 50 | any rate limits. 51 | 52 | scripts/repo_activity.py 53 | scripts/repo_activity_coc.py 54 | scripts/repo_activity_REST.py 55 | 56 | We used this script to gather basic data about the repositories found across 57 | dozens of an organization's GitHub orgs. We use this to understand whether 58 | projects are meeting our compliance requirements. We also use this script to 59 | find abandoned repos that have outlived their usefulness and should be 60 | archived. 61 | 62 | Note: repo_activity_coc.py is mostly identical to repo_activity.py, 63 | but it adds info about the code of conduct. This is a separate script 64 | because the codeOfConduct object in the GraphQL API is a bit problematic 65 | and tends to time out when getting relatively small amounts of data. 66 | 67 | **Running the scripts** 68 | 69 | Requires orgs.txt 70 | ``` 71 | $python3 repo_activity.py 72 | ``` 73 | 74 | ### Sunset 75 | 76 | This script uses the GitHub GraphQL API to Gather data to determine 77 | whether a repo can be archived. It retrieves relevant 78 | information about a repository, including forks to determine ownership 79 | and possibly contact people to understand how they are using a project. 80 | 81 | As input, this script requires a GitHub URL for a repository or a csv 82 | file containing one repo_name,org_name pair per line. 83 | 84 | **Running the script** 85 | 86 | Run the script with one repo url as input 87 | ``` 88 | $python3 sunset.py -u "https://github.com/org_name/repo_name" 89 | ``` 90 | 91 | Run the script with a csv file containing one repo_name,org_name pair 92 | per line: 93 | ``` 94 | $python3 sunset.py -f sunset.csv 95 | ``` 96 | 97 | ### Monitoring 98 | 99 | This script uses the GitHub GraphQL API to retrieve the pinned repos 100 | for each GitHub org listed in monitoring.txt and runs criticality score 101 | for each of those pinned repositories. 102 | 103 | **Running the script** 104 | 105 | Requires monitoring.txt 106 | ``` 107 | $python3 monitoring.py 108 | ``` 109 | 110 | ### Keyword by Repo with Optional Filter 111 | 112 | The keyword_by_repo script uses the GitHub GraphQL API to retrieve 113 | relevant information about repositories mentioning certain keywords. 114 | 115 | The filter_keyword_by_org script uses the results from a keyword search 116 | and filters it based on a list of GitHub organizations. 117 | 118 | As input, this script requires a file generated by the keyword_by_repo.py 119 | script. This is provided via a command line argument. 120 | 121 | **Running the scripts** 122 | 123 | keyword_by_repo requires keywords.txt 124 | ``` 125 | $python3 keyword_by_repo.py 126 | ``` 127 | 128 | filter_keyword_by_org requires output file from keyword_by_repo 129 | ``` 130 | $python3 filter_keyword_by_org.py /path/to/keyword_search_2022-07-26.csv 131 | ``` 132 | 133 | ### Mystery GitHub Organizations 134 | 135 | We can use this script to gather basic data about GitHub orgs that 136 | we believe may have been created outside of our process by various 137 | employees across our business units. We gather the first few members 138 | of the org to help identify employees who can provide more details 139 | about the purpose of the org and how it is used. 140 | 141 | scripts/mystery_orgs.py 142 | 143 | However, since members are private by default, this script may not 144 | be as useful as just running repo_activity.py on those same orgs 145 | to also learn more about the repos and get better contact info 146 | from the commit data. 147 | 148 | **Running the script** 149 | 150 | Requires orgs.txt 151 | ``` 152 | $python3 mystery_orgs.py 153 | ``` 154 | 155 | -------------------------------------------------------------------------------- /CODE-OF-CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in this project and our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported Dawn Foster - dawn@dawnfoster.com. 63 | All complaints will be reviewed and investigated promptly and fairly. 64 | 65 | All community leaders are obligated to respect the privacy and security of the 66 | reporter of any incident. 67 | 68 | ## Enforcement Guidelines 69 | 70 | Community leaders will follow these Community Impact Guidelines in determining 71 | the consequences for any action they deem in violation of this Code of Conduct: 72 | 73 | ### 1. Correction 74 | 75 | **Community Impact**: Use of inappropriate language or other behavior deemed 76 | unprofessional or unwelcome in the community. 77 | 78 | **Consequence**: A private, written warning from community leaders, providing 79 | clarity around the nature of the violation and an explanation of why the 80 | behavior was inappropriate. A public apology may be requested. 81 | 82 | ### 2. Warning 83 | 84 | **Community Impact**: A violation through a single incident or series 85 | of actions. 86 | 87 | **Consequence**: A warning with consequences for continued behavior. No 88 | interaction with the people involved, including unsolicited interaction with 89 | those enforcing the Code of Conduct, for a specified period of time. This 90 | includes avoiding interactions in community spaces as well as external channels 91 | like social media. Violating these terms may lead to a temporary or 92 | permanent ban. 93 | 94 | ### 3. Temporary Ban 95 | 96 | **Community Impact**: A serious violation of community standards, including 97 | sustained inappropriate behavior. 98 | 99 | **Consequence**: A temporary ban from any sort of interaction or public 100 | communication with the community for a specified period of time. No public or 101 | private interaction with the people involved, including unsolicited interaction 102 | with those enforcing the Code of Conduct, is allowed during this period. 103 | Violating these terms may lead to a permanent ban. 104 | 105 | ### 4. Permanent Ban 106 | 107 | **Community Impact**: Demonstrating a pattern of violation of community 108 | standards, including sustained inappropriate behavior, harassment of an 109 | individual, or aggression toward or disparagement of classes of individuals. 110 | 111 | **Consequence**: A permanent ban from any sort of public interaction within 112 | the community. 113 | 114 | ## Attribution 115 | 116 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 117 | version 2.0, available at 118 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 119 | 120 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 121 | enforcement ladder](https://github.com/mozilla/diversity). 122 | 123 | [homepage]: https://www.contributor-covenant.org 124 | 125 | For answers to common questions about this code of conduct, see the FAQ at 126 | https://www.contributor-covenant.org/faq. Translations are available at 127 | https://www.contributor-covenant.org/translations. 128 | -------------------------------------------------------------------------------- /scripts/common_functions.py: -------------------------------------------------------------------------------- 1 | # Copyright 2022 VMware, Inc. 2 | # SPDX-License-Identifier: BSD-2-Clause 3 | 4 | """Common Functions 5 | This file contains some common functions that are used within 6 | the other scripts in this repo. 7 | 8 | This file can also be imported as a module. 9 | """ 10 | 11 | def read_key(file_name): 12 | """Retrieves a GitHub API key from a file. 13 | 14 | Parameters 15 | ---------- 16 | file_name : str 17 | 18 | Returns 19 | ------- 20 | key : str 21 | """ 22 | 23 | from os.path import dirname, join 24 | 25 | # Reads the first line of a file containing the GitHub API key 26 | # Usage: key = read_key('gh_key') 27 | 28 | current_dir = dirname(__file__) 29 | file2 = "./" + file_name 30 | file_path = join(current_dir, file2) 31 | 32 | with open(file_path, 'r') as kf: 33 | key = kf.readline().rstrip() # remove newline & trailing whitespace 34 | return key 35 | 36 | def read_orgs(file_name): 37 | """Retrieves a list of orgs from a file. 38 | 39 | Parameters 40 | ---------- 41 | file_name : str 42 | 43 | Returns 44 | ------- 45 | org_list : list 46 | """ 47 | import csv 48 | 49 | org_list = [] 50 | 51 | with open(file_name) as orgfile: 52 | orgs = csv.reader(orgfile) 53 | for row in orgs: 54 | org_list.append(row[0]) 55 | 56 | return org_list 57 | 58 | def read_file(file_name): 59 | """Retrieves a list from a file. 60 | 61 | Parameters 62 | ---------- 63 | file_name : str 64 | 65 | Returns 66 | ------- 67 | a_list : list 68 | """ 69 | import csv 70 | 71 | content_list = [] 72 | 73 | with open(file_name) as in_file: 74 | content = csv.reader(in_file) 75 | for row in content: 76 | content_list.append(row[0]) 77 | 78 | return content_list 79 | 80 | def expand_name_df(df,old_col,new_col): 81 | """Takes a dataframe df with an API JSON object with nested elements in old_col, 82 | extracts the name, and saves it in a new dataframe column called new_col 83 | 84 | Parameters 85 | ---------- 86 | df : dataframe 87 | old_col : str 88 | new_col : str 89 | 90 | Returns 91 | ------- 92 | df : dataframe 93 | """ 94 | 95 | import pandas as pd 96 | 97 | def expand_name(nested_name): 98 | """Takes an API JSON object with nested elements and extracts the name 99 | Parameters 100 | ---------- 101 | nested_name : JSON API object 102 | 103 | Returns 104 | ------- 105 | object_name : str 106 | """ 107 | if pd.isnull(nested_name): 108 | object_name = 'Not Found' 109 | else: 110 | object_name = nested_name['name'] 111 | return object_name 112 | 113 | df[new_col] = df[old_col].apply(expand_name) 114 | return df 115 | 116 | 117 | def get_criticality(org_name, repo_name, api_token): 118 | """See https://github.com/ossf/criticality_score for more details 119 | This function requires that you have version 1.0.7 of this tool 120 | installed (the older Python version but not the final Python version, 121 | which doesn't work for some reason within the script - possibly because 122 | of how they've implemented the deprecation warnings). You can install 123 | the correct version using: 124 | pip install criticality-score==1.0.7 125 | 126 | Parameters 127 | ---------- 128 | org_name : str 129 | repo_name : str 130 | api_token : str 131 | 132 | Returns 133 | ------- 134 | dependents_count : str 135 | Numeric integer that is returned as a string 136 | criticality_score : str 137 | This value ranges from 0 to 1 (like a float) with lower scores indicating less critical projects. 138 | 139 | """ 140 | 141 | import subprocess 142 | import os 143 | 144 | os.environ['GITHUB_AUTH_TOKEN'] = api_token 145 | 146 | cmd_str = 'criticality_score --repo github.com/' + org_name + '/' + repo_name + ' --format csv' 147 | 148 | try: 149 | proc = subprocess.Popen(cmd_str, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True) 150 | out, err = proc.communicate() 151 | 152 | if not err: 153 | csv_str = out.decode("utf-8") 154 | items = csv_str.split(',') 155 | dependents_count = items[25] 156 | criticality_score = items[26].rstrip() 157 | else: 158 | dependents_count = None 159 | criticality_score = None 160 | except: 161 | dependents_count = None 162 | criticality_score = None 163 | 164 | return dependents_count, criticality_score 165 | 166 | def create_file(pre_string): 167 | """Creates an output file in an "output" directory with today's date 168 | as part of the filename and prints the file_path to the terminal to 169 | make it easier to open the output file. 170 | 171 | Parameters 172 | ---------- 173 | pre_string : str 174 | This is the string that will preface today's date in the filename 175 | 176 | Returns 177 | ------- 178 | file : file object 179 | file_path : str 180 | This is the full path to the file name for the output. 181 | 182 | """ 183 | from datetime import datetime 184 | from os.path import dirname, join 185 | 186 | today = datetime.today().strftime('%Y-%m-%d') 187 | output_filename = "./output/" + pre_string + "_" + today + ".csv" 188 | current_dir = dirname(__file__) 189 | file_path = join(current_dir, output_filename) 190 | file = open(file_path, 'w', newline ='') 191 | 192 | print("Output file:\n", file_path, sep="") 193 | 194 | return file, file_path 195 | -------------------------------------------------------------------------------- /scripts/inclusivity_check.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """Quick Inclusivity Check for GitHub orgs 8 | This script uses the GitHub GraphQL API to retrieve default branch 9 | name and code of conduct for each repo in a GitHub org. 10 | 11 | As input, this script requires a file named 'orgs.txt' containing 12 | the name of one GitHub org per line residing in the same folder 13 | as this script. 14 | 15 | Your API key should be stored in a file called gh_key in the 16 | same folder as this script. 17 | 18 | This script requires that `pandas` be installed within the Python 19 | environment you are running this script in. 20 | 21 | As output: 22 | * A message will be printed to the screen for any org with a default 23 | branch name of "master" or a missing / unrecognized code of conduct. 24 | Orgs that are forks of another repo, private, empty, or archived 25 | will not be printed, but the details will be written to the csv file. 26 | * The script creates a csv file stored in an subdirectory 27 | of the folder with the script called "output" with the filename in 28 | this format with today's date. All details are written to the csv file 29 | including repos that aren't printed to the screen. 30 | 31 | output/inclusivity_check_2022-01-14.csv" 32 | 33 | """ 34 | 35 | import sys 36 | from common_functions import read_key, expand_name_df, create_file 37 | 38 | def make_query(after_cursor = None): 39 | """Creates and returns a GraphQL query with cursor for pagination""" 40 | 41 | return """query RepoQuery($org_name: String!) { 42 | organization(login: $org_name) { 43 | repositories (first: 5 after: AFTER){ 44 | pageInfo { 45 | hasNextPage 46 | endCursor 47 | } 48 | nodes { 49 | nameWithOwner 50 | defaultBranchRef { 51 | name 52 | } 53 | codeOfConduct{ 54 | url 55 | } 56 | isPrivate 57 | isFork 58 | isEmpty 59 | isArchived 60 | } 61 | } 62 | } 63 | }""".replace( 64 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 65 | ) 66 | 67 | # Read GitHub key from file using the read_key function in 68 | # common_functions.py 69 | try: 70 | api_token = read_key('gh_key') 71 | 72 | except: 73 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 74 | sys.exit() 75 | 76 | def get_repo_data(api_token): 77 | """Executes the GraphQL query to get repository data from one or more GitHub orgs. 78 | 79 | Parameters 80 | ---------- 81 | api_token : str 82 | The GH API token retrieved from the gh_key file. 83 | 84 | Returns 85 | ------- 86 | repo_info_df : pandas.core.frame.DataFrame 87 | """ 88 | import requests 89 | import json 90 | import pandas as pd 91 | from common_functions import read_orgs 92 | 93 | url = 'https://api.github.com/graphql' 94 | headers = {'Authorization': 'token %s' % api_token} 95 | 96 | repo_info_df = pd.DataFrame() 97 | 98 | # Read list of orgs from a file 99 | 100 | try: 101 | org_list = read_orgs('orgs.txt') 102 | except: 103 | print("Error reading orgs. This script depends on the existance of a file called orgs.txt containing one org per line. Exiting") 104 | sys.exit() 105 | 106 | for org_name in org_list: 107 | has_next_page = True 108 | after_cursor = None 109 | 110 | print("Processing", org_name) 111 | 112 | while has_next_page: 113 | 114 | try: 115 | query = make_query(after_cursor) 116 | 117 | variables = {"org_name": org_name} 118 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 119 | json_data = json.loads(r.text) 120 | 121 | df_temp = pd.DataFrame(json_data['data']['organization']['repositories']['nodes']) 122 | repo_info_df = pd.concat([repo_info_df, df_temp]) 123 | 124 | has_next_page = json_data["data"]["organization"]["repositories"]["pageInfo"]["hasNextPage"] 125 | 126 | after_cursor = json_data["data"]["organization"]["repositories"]["pageInfo"]["endCursor"] 127 | except: 128 | has_next_page = False 129 | print("ERROR Cannot process", org_name) 130 | 131 | return repo_info_df 132 | 133 | repo_info_df = get_repo_data(api_token) 134 | 135 | # This section reformats the output into what we need in the csv file 136 | repo_info_df = expand_name_df(repo_info_df,'defaultBranchRef','defaultBranch') 137 | 138 | def expand_coc(coc): 139 | import pandas as pd 140 | if pd.isnull(coc): 141 | coc_url = 'Not Found' 142 | else: 143 | coc_url = coc['url'] 144 | return coc_url 145 | 146 | repo_info_df['codeOfConduct_url'] = repo_info_df['codeOfConduct'].apply(expand_coc) 147 | repo_info_df = repo_info_df.drop(columns=['defaultBranchRef', 'codeOfConduct']) 148 | 149 | # prepare file and write dataframe to csv 150 | 151 | try: 152 | file, file_path = create_file("inclusivity_check") 153 | repo_info_df.to_csv(file_path, index=False) 154 | 155 | except: 156 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 157 | 158 | print("repo branch Code of Conduct") 159 | for rows in repo_info_df.iterrows(): 160 | repo = rows[1]['nameWithOwner'] 161 | branch = rows[1]['defaultBranch'] 162 | coc = rows[1]['codeOfConduct_url'] 163 | private = rows[1]['isPrivate'] 164 | fork = rows[1]['isFork'] 165 | empty = rows[1]['isEmpty'] 166 | archive = rows[1]['isArchived'] 167 | if private or fork or empty or archive: 168 | pass 169 | elif (branch == 'master' or coc == 'Not Found'): 170 | print(repo, branch, coc) 171 | print("\nMore details can be found in", file_path) 172 | -------------------------------------------------------------------------------- /scripts/pr_activity.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """ Gathers some basic data about PRs in a specific repo starting with the 8 | most recent PRs. 9 | 10 | Usage 11 | ----- 12 | 13 | pr_activity.py [-h] -o ORG_NAME -r REPO_NAME [-n NUM_PAGES] 14 | -h, --help show this help message and exit 15 | -o ORG_NAME, --org ORG_NAME 16 | The name of the GitHub organization where your repo is found (required) 17 | -r REPO_NAME, --repo REPO_NAME 18 | The name of a GitHub repository in that org where your PRs can be found (required) 19 | -n NUM_PAGES, --num NUM_PAGES 20 | The number of pages of results with 10 results per page (10 = 100 results) - default is 10 21 | 22 | Output 23 | ------ 24 | 25 | * the script creates a csv file stored in an subdirectory 26 | of the folder with the script called "output" with the filename in 27 | this format with today's date. 28 | 29 | output/sunset_2022-01-14.csv 30 | """ 31 | 32 | import argparse 33 | import pandas as pd 34 | from common_functions import read_key, create_file 35 | 36 | def make_query(before_cursor = None): 37 | """Creates and returns a GraphQL query with cursor for pagination""" 38 | 39 | return """query repo($org_name: String!, $repo_name: String!){ 40 | repository(owner: $org_name, name: $repo_name) { 41 | pullRequests (last: 10, before: BEFORE) { 42 | pageInfo{ 43 | hasPreviousPage 44 | startCursor 45 | } 46 | nodes { 47 | createdAt 48 | mergedAt 49 | additions 50 | deletions 51 | changedFiles 52 | state 53 | comments{ 54 | totalCount 55 | } 56 | author{ 57 | ... on User{ 58 | login 59 | name 60 | pullRequests{ 61 | totalCount 62 | } 63 | } 64 | } 65 | } 66 | } 67 | } 68 | }""".replace( 69 | "BEFORE", '"{}"'.format(before_cursor) if before_cursor else "null" 70 | ) 71 | 72 | parser = argparse.ArgumentParser() 73 | 74 | parser.add_argument("-o", "--org", required=True, dest = "org_name", help="The name of the GitHub organization where your repo is found (required)") 75 | parser.add_argument("-r", "--repo", required=True, dest = "repo_name", help="The name of a GitHub repository in that org where your PRs can be found (required)") 76 | parser.add_argument("-n", "--num", dest = "num_pages", default=10, type=int, help="The number of pages of results with 10 results per page (10 = 100 results) - default is 10") 77 | 78 | args = parser.parse_args() 79 | 80 | # Read GitHub key from file using the read_key function in 81 | # common_functions.py 82 | try: 83 | api_token = read_key('gh_key') 84 | 85 | except: 86 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 87 | sys.exit() 88 | 89 | def get_pr_data(api_token, org_name, repo_name, num_pages): 90 | """Executes the GraphQL query to get PR data from a repository. 91 | 92 | Parameters 93 | ---------- 94 | api_token : str 95 | The GH API token retrieved from the gh_key file. 96 | org_name : str 97 | The name of the GitHub organization where your repo is found 98 | repo_name : str 99 | The name of a GitHub repository in that org where your PRs can be found 100 | num_pages : int 101 | The number of pages of results with 10 results per page (10 = 100 results) 102 | 103 | Returns 104 | ------- 105 | repo_info_df : pandas.core.frame.DataFrame 106 | """ 107 | import requests 108 | import json 109 | import pandas as pd 110 | 111 | url = 'https://api.github.com/graphql' 112 | headers = {'Authorization': 'token %s' % api_token} 113 | 114 | repo_info_df = pd.DataFrame() 115 | 116 | has_previous_page = True 117 | before_cursor = None 118 | 119 | i = 1 # Iterator starts at page 1 120 | 121 | while has_previous_page and i <= num_pages: 122 | i+=1 123 | try: 124 | query = make_query(before_cursor) 125 | 126 | variables = {"org_name": org_name, "repo_name": repo_name} 127 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 128 | json_data = json.loads(r.text) 129 | 130 | df_temp = pd.DataFrame(json_data["data"]["repository"]["pullRequests"]["nodes"]) 131 | repo_info_df = pd.concat([repo_info_df, df_temp]) 132 | 133 | has_previous_page = json_data["data"]["repository"]["pullRequests"]["pageInfo"]["hasPreviousPage"] 134 | before_cursor = json_data["data"]["repository"]["pullRequests"]["pageInfo"]["startCursor"] 135 | 136 | status = "OK" 137 | except: 138 | has_previous_page = False 139 | status = "Error" 140 | 141 | return repo_info_df, status 142 | 143 | repo_info_df, status = get_pr_data(api_token, args.org_name, args.repo_name, args.num_pages) 144 | 145 | def expand_author(author): 146 | import pandas as pd 147 | if pd.isnull(author): 148 | author_list = [None, None, None] 149 | else: 150 | author_list = [author['login'], author['name'], author['pullRequests']['totalCount']] 151 | return author_list 152 | 153 | def expand_count(comments): 154 | import pandas as pd 155 | if pd.isnull(comments): 156 | comment_ct = 0 157 | else: 158 | comment_ct = comments['totalCount'] 159 | return comment_ct 160 | 161 | repo_info_df['author_list'] = repo_info_df['author'].apply(expand_author) 162 | repo_info_df[['author_login', 'author_name', 'author_pr_count', ]] = pd.DataFrame(repo_info_df.author_list.tolist(), index= repo_info_df.index) 163 | repo_info_df['comment_ct'] = repo_info_df['comments'].apply(expand_count) 164 | repo_info_df = repo_info_df.drop(columns=['author','author_list','comments']) 165 | 166 | # prepare file and write dataframe to csv 167 | 168 | try: 169 | file, file_path = create_file("pr_activity") 170 | repo_info_df.to_csv(file_path, index=False) 171 | 172 | except: 173 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 174 | 175 | -------------------------------------------------------------------------------- /scripts/repo_activity.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | 6 | """Repo Activity GraphQL Version 7 | This script uses the GitHub GraphQL API to retrieve relevant 8 | information about all repositories from one or more GitHub 9 | orgs. 10 | 11 | We use this script to gather basic data about the repositories 12 | found in dozens of GitHub orgs. We use this to understand whether 13 | projects are meeting our compliance requirements. We also use this 14 | script to find abandoned repos that have outlived their usefulness 15 | and should be archived. 16 | 17 | As input, this script requires a file named 'orgs.txt' containing 18 | the name of one GitHub org per line residing in the same folder 19 | as this script. 20 | 21 | Your API key should be stored in a file called gh_key in the 22 | same folder as this script. 23 | 24 | This script requires that `pandas` be installed within the Python 25 | environment you are running this script in. 26 | 27 | As output: 28 | * A message about each org being processed will be printed to the screen. 29 | * the script creates a csv file stored in an subdirectory 30 | of the folder with the script called "output" with the filename in 31 | this format with today's date. 32 | 33 | output/a_repo_activity_2022-01-14.csv" 34 | """ 35 | 36 | import sys 37 | import pandas as pd 38 | 39 | from common_functions import read_key, expand_name_df, create_file 40 | 41 | def make_query(after_cursor = None): 42 | """Creates and returns a GraphQL query with cursor for pagination""" 43 | 44 | return """query RepoQuery($org_name: String!) { 45 | organization(login: $org_name) { 46 | repositories (first: 100 after: AFTER){ 47 | pageInfo { 48 | hasNextPage 49 | endCursor 50 | } 51 | nodes { 52 | nameWithOwner 53 | name 54 | licenseInfo { 55 | name 56 | } 57 | isPrivate 58 | isFork 59 | isEmpty 60 | isArchived 61 | forkCount 62 | stargazerCount 63 | createdAt 64 | updatedAt 65 | pushedAt 66 | defaultBranchRef { 67 | name 68 | target{ 69 | ... on Commit{ 70 | history(first:1){ 71 | edges{ 72 | node{ 73 | ... on Commit{ 74 | committedDate 75 | author{ 76 | name 77 | email 78 | user{ 79 | login 80 | } 81 | } 82 | } 83 | } 84 | } 85 | } 86 | } 87 | } 88 | } 89 | } 90 | } 91 | } 92 | }""".replace( 93 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 94 | ) 95 | 96 | # Read GitHub key from file using the read_key function in 97 | # common_functions.py 98 | try: 99 | api_token = read_key('gh_key') 100 | 101 | except: 102 | print("Error reading GH Key. This script depends on the existence of a file called gh_key containing your GitHub API token. Exiting") 103 | sys.exit() 104 | 105 | def get_repo_data(api_token): 106 | """Executes the GraphQL query to get repository data from one or more GitHub orgs. 107 | 108 | Parameters 109 | ---------- 110 | api_token : str 111 | The GH API token retrieved from the gh_key file. 112 | 113 | Returns 114 | ------- 115 | repo_info_df : pandas.core.frame.DataFrame 116 | """ 117 | import requests 118 | import json 119 | import pandas as pd 120 | from common_functions import read_orgs 121 | 122 | url = 'https://api.github.com/graphql' 123 | headers = {'Authorization': 'token %s' % api_token} 124 | 125 | repo_info_df = pd.DataFrame() 126 | 127 | # Read list of orgs from a file 128 | 129 | try: 130 | org_list = read_orgs('orgs.txt') 131 | except: 132 | print("Error reading orgs. This script depends on the existance of a file called orgs.txt containing one org per line. Exiting") 133 | sys.exit() 134 | 135 | for org_name in org_list: 136 | has_next_page = True 137 | after_cursor = None 138 | 139 | print("Processing", org_name) 140 | 141 | while has_next_page: 142 | 143 | try: 144 | query = make_query(after_cursor) 145 | 146 | variables = {"org_name": org_name} 147 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 148 | json_data = json.loads(r.text) 149 | 150 | df_temp = pd.DataFrame(json_data['data']['organization']['repositories']['nodes']) 151 | repo_info_df = pd.concat([repo_info_df, df_temp]) 152 | 153 | has_next_page = json_data["data"]["organization"]["repositories"]["pageInfo"]["hasNextPage"] 154 | 155 | after_cursor = json_data["data"]["organization"]["repositories"]["pageInfo"]["endCursor"] 156 | except: 157 | has_next_page = False 158 | print("ERROR Cannot process", org_name) 159 | 160 | return repo_info_df 161 | 162 | repo_info_df = get_repo_data(api_token) 163 | 164 | # This section reformats the output into what we need in the csv file 165 | 166 | repo_info_df["org"] = repo_info_df["nameWithOwner"].str.split('/').str[0] 167 | 168 | repo_info_df = expand_name_df(repo_info_df,'licenseInfo','license') 169 | repo_info_df = repo_info_df.drop(columns=['licenseInfo']) 170 | 171 | repo_info_df = expand_name_df(repo_info_df,'defaultBranchRef','defaultBranch') 172 | 173 | def expand_commits(commits): 174 | if pd.isnull(commits): 175 | commits_list = [None, None, None, None] 176 | else: 177 | node = commits['target']['history']['edges'][0]['node'] 178 | try: 179 | commit_date = node['committedDate'] 180 | except: 181 | commit_date = None 182 | try: 183 | author_name = node['author']['name'] 184 | except: 185 | author_name = None 186 | try: 187 | author_email = node['author']['email'] 188 | except: 189 | author_email = None 190 | try: 191 | author_login = node['author']['user']['login'] 192 | except: 193 | author_login = None 194 | commits_list = [commit_date, author_name, author_email, author_login] 195 | return commits_list 196 | 197 | repo_info_df['commits_list'] = repo_info_df['defaultBranchRef'].apply(expand_commits) 198 | repo_info_df[['last_commit_date','author_name','author_email', 'author_login']] = pd.DataFrame(repo_info_df.commits_list.tolist(), index= repo_info_df.index) 199 | repo_info_df = repo_info_df.drop(columns=['commits_list','defaultBranchRef']) 200 | 201 | repo_info_df = repo_info_df[['org','name','nameWithOwner','license','defaultBranch','isPrivate','isFork','isArchived', 'forkCount', 'stargazerCount', 'isEmpty', 'createdAt', 'updatedAt','pushedAt','last_commit_date','author_login','author_name','author_email']] 202 | 203 | # prepare file and write dataframe to csv 204 | 205 | try: 206 | file, file_path = create_file("a_repo_activity") 207 | repo_info_df.to_csv(file_path, index=False) 208 | 209 | except: 210 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 211 | 212 | -------------------------------------------------------------------------------- /scripts/repo_activity_coc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | 6 | """Repo Activity GraphQL Version with Code of Conduct 7 | This script uses the GitHub GraphQL API to retrieve relevant 8 | information about all repositories from one or more GitHub 9 | orgs. 10 | 11 | Note: This is identical to repo_activity.py, but it adds info about the 12 | code of conduct and CONTRIBUTING.md files. 13 | 14 | This is a separate script because the codeOfConduct object and getting 15 | info about files in the GraphQL API is a bit problematic and tends to time 16 | out when getting relatively small amounts of data, so this version gets 17 | data from only 10 repos at a time, instead of 100 in the other script. 18 | 19 | Note that it will only find CONTRIBUTING.md files that match that exact 20 | name and case (not contributing.md or contributing.rst) in public, but 21 | not private repos. 22 | 23 | We use this script to gather basic data about the repositories 24 | found in dozens of GitHub orgs. We use this to understand whether 25 | projects are meeting our compliance requirements. We also use this 26 | script to find abandoned repos that have outlived their usefulness 27 | and should be archived. 28 | 29 | As input, this script requires a file named 'orgs.txt' containing 30 | the name of one GitHub org per line residing in the same folder 31 | as this script. 32 | 33 | Your API key should be stored in a file called gh_key in the 34 | same folder as this script. 35 | 36 | This script requires that `pandas` be installed within the Python 37 | environment you are running this script in. 38 | 39 | As output: 40 | * A message about each org being processed will be printed to the screen. 41 | * the script creates a csv file stored in an subdirectory 42 | of the folder with the script called "output" with the filename in 43 | this format with today's date. 44 | 45 | output/a_repo_activity_2022-01-14.csv" 46 | """ 47 | 48 | import sys 49 | import pandas as pd 50 | from common_functions import read_key, expand_name_df, create_file 51 | 52 | def make_query(after_cursor = None): 53 | """Creates and returns a GraphQL query with cursor for pagination""" 54 | 55 | return """query RepoQuery($org_name: String!) { 56 | organization(login: $org_name) { 57 | repositories (first: 10 after: AFTER){ 58 | pageInfo { 59 | hasNextPage 60 | endCursor 61 | } 62 | nodes { 63 | nameWithOwner 64 | name 65 | licenseInfo { 66 | name 67 | } 68 | codeOfConduct{ 69 | url 70 | } 71 | content: object(expression: "HEAD:CONTRIBUTING.md") { 72 | ... on Blob { 73 | abbreviatedOid 74 | } 75 | } 76 | isPrivate 77 | isFork 78 | isEmpty 79 | isArchived 80 | forkCount 81 | stargazerCount 82 | createdAt 83 | updatedAt 84 | pushedAt 85 | defaultBranchRef { 86 | name 87 | target{ 88 | ... on Commit{ 89 | history(first:1){ 90 | edges{ 91 | node{ 92 | ... on Commit{ 93 | committedDate 94 | author{ 95 | name 96 | email 97 | user{ 98 | login 99 | } 100 | } 101 | } 102 | } 103 | } 104 | } 105 | } 106 | } 107 | } 108 | } 109 | } 110 | } 111 | }""".replace( 112 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 113 | ) 114 | 115 | # Read GitHub key from file using the read_key function in 116 | # common_functions.py 117 | try: 118 | api_token = read_key('gh_key') 119 | 120 | except: 121 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 122 | sys.exit() 123 | 124 | def get_repo_data(api_token): 125 | """Executes the GraphQL query to get repository data from one or more GitHub orgs. 126 | 127 | Parameters 128 | ---------- 129 | api_token : str 130 | The GH API token retrieved from the gh_key file. 131 | 132 | Returns 133 | ------- 134 | repo_info_df : pandas.core.frame.DataFrame 135 | """ 136 | import requests 137 | import json 138 | import pandas as pd 139 | from common_functions import read_orgs 140 | 141 | url = 'https://api.github.com/graphql' 142 | headers = {'Authorization': 'token %s' % api_token} 143 | 144 | repo_info_df = pd.DataFrame() 145 | 146 | # Read list of orgs from a file 147 | 148 | try: 149 | org_list = read_orgs('orgs.txt') 150 | except: 151 | print("Error reading orgs. This script depends on the existance of a file called orgs.txt containing one org per line. Exiting") 152 | sys.exit() 153 | 154 | for org_name in org_list: 155 | has_next_page = True 156 | after_cursor = None 157 | 158 | print("Processing", org_name) 159 | 160 | while has_next_page: 161 | 162 | try: 163 | query = make_query(after_cursor) 164 | 165 | variables = {"org_name": org_name} 166 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 167 | json_data = json.loads(r.text) 168 | 169 | df_temp = pd.DataFrame(json_data['data']['organization']['repositories']['nodes']) 170 | repo_info_df = pd.concat([repo_info_df, df_temp]) 171 | 172 | has_next_page = json_data["data"]["organization"]["repositories"]["pageInfo"]["hasNextPage"] 173 | 174 | after_cursor = json_data["data"]["organization"]["repositories"]["pageInfo"]["endCursor"] 175 | except: 176 | has_next_page = False 177 | print("ERROR Cannot process", org_name) 178 | 179 | return repo_info_df 180 | 181 | repo_info_df = get_repo_data(api_token) 182 | 183 | # This section reformats the output into what we need in the csv file 184 | 185 | repo_info_df["org"] = repo_info_df["nameWithOwner"].str.split('/').str[0] 186 | 187 | repo_info_df = expand_name_df(repo_info_df,'licenseInfo','license') 188 | repo_info_df = repo_info_df.drop(columns=['licenseInfo']) 189 | 190 | repo_info_df = expand_name_df(repo_info_df,'defaultBranchRef','defaultBranch') 191 | 192 | def expand_coc(coc): 193 | if pd.isnull(coc): 194 | coc_url = 'Likely Missing' 195 | else: 196 | coc_url = coc['url'] 197 | return coc_url 198 | 199 | repo_info_df['codeOfConduct_url'] = repo_info_df['codeOfConduct'].apply(expand_coc) 200 | repo_info_df = repo_info_df.drop(columns=['codeOfConduct']) 201 | 202 | def expand_contrib(contrib): 203 | # Note that the script only finds the file if it exactly matches CONTRIBUTING.md 204 | # and will not find contributing.md, CONTRIBUTING.rst, or other variations 205 | if pd.isnull(contrib): 206 | contrib_file = 'Missing Private or not CONTRIBUTING.md' 207 | else: 208 | contrib_file = 'CONTRIBUTING.md' 209 | return contrib_file 210 | 211 | repo_info_df['contrib_file'] = repo_info_df['content'].apply(expand_contrib) 212 | repo_info_df = repo_info_df.drop(columns=['content']) 213 | 214 | def expand_commits(commits): 215 | if pd.isnull(commits): 216 | commits_list = [None, None, None, None] 217 | else: 218 | node = commits['target']['history']['edges'][0]['node'] 219 | try: 220 | commit_date = node['committedDate'] 221 | except: 222 | commit_date = None 223 | try: 224 | author_name = node['author']['name'] 225 | except: 226 | author_name = None 227 | try: 228 | author_email = node['author']['email'] 229 | except: 230 | author_email = None 231 | try: 232 | author_login = node['author']['user']['login'] 233 | except: 234 | author_login = None 235 | commits_list = [commit_date, author_name, author_email, author_login] 236 | return commits_list 237 | 238 | repo_info_df['commits_list'] = repo_info_df['defaultBranchRef'].apply(expand_commits) 239 | repo_info_df[['last_commit_date','author_name','author_email', 'author_login']] = pd.DataFrame(repo_info_df.commits_list.tolist(), index= repo_info_df.index) 240 | repo_info_df = repo_info_df.drop(columns=['commits_list','defaultBranchRef']) 241 | 242 | repo_info_df = repo_info_df[['org','name','nameWithOwner','license','defaultBranch','codeOfConduct_url', 'contrib_file', 'isPrivate','isFork','isArchived', 'forkCount', 'stargazerCount', 'isEmpty', 'createdAt', 'updatedAt','pushedAt','last_commit_date','author_login','author_name','author_email']] 243 | 244 | # prepare file and write dataframe to csv 245 | 246 | try: 247 | file, file_path = create_file("a_repo_activity") 248 | repo_info_df.to_csv(file_path, index=False) 249 | 250 | except: 251 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 252 | 253 | -------------------------------------------------------------------------------- /notebooks/org_info.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Copyright 2022 VMware, Inc.\n", 8 | "SPDX-License-Identifier: BSD-2-Clause\n", 9 | "\n", 10 | "This notebook uses the GitHub GraphQL API to retrieve relevant\n", 11 | "information about one or more GitHub orgs.\n", 12 | "\n", 13 | "This is a simplified version of `scripts/mystery_orgs.py` that can be used to gather basic data about GitHub orgs.\n", 14 | "\n", 15 | "I created it as one way to learn more about orgs that\n", 16 | "we believe may have been created outside of our process by various\n", 17 | "employees across our business units. We gather the first few members\n", 18 | "of the org to help identify employees who can provide more details\n", 19 | "about the purpose of the org and how it is used.\n", 20 | "\n", 21 | "Your API key should be stored in a file called gh_key in the\n", 22 | "same folder as this script.\n", 23 | "\n", 24 | "The orgs investigated can be specified in the cell below as part of org_list." 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 24, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "import requests\n", 34 | "import json\n", 35 | "import pandas as pd\n", 36 | "\n", 37 | "# Assumes your GH API token is in a file called gh_key in this directory\n", 38 | "with open('gh_key', 'r') as kf:\n", 39 | " api_token = kf.readline().rstrip() # remove newline & trailing whitespace\n", 40 | " \n", 41 | "org_list = [\"Moonkube\", \"ModernAppsNinja\"]" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 25, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "def make_query():\n", 51 | " return \"\"\"query OrgQuery($org_name: String!) {\n", 52 | " organization(login:$org_name) {\n", 53 | " name\n", 54 | " url\n", 55 | " createdAt\n", 56 | " updatedAt\n", 57 | " membersWithRole(first: 15){\n", 58 | " nodes{\n", 59 | " login\n", 60 | " name\n", 61 | " email\n", 62 | " company\n", 63 | " }\n", 64 | " }\n", 65 | " }\n", 66 | " }\"\"\"" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 26, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "def get_org_data(api_token, org_list):\n", 76 | " import requests\n", 77 | " import json\n", 78 | " import csv\n", 79 | "\n", 80 | " url = 'https://api.github.com/graphql'\n", 81 | " headers = {'Authorization': 'token %s' % api_token}\n", 82 | " \n", 83 | " org_info_df = pd.DataFrame()\n", 84 | " \n", 85 | " all_rows = [['org_name', 'org_url', 'org_createdAt', 'org_updatedAt', 'people(login,name,email,company):repeat']]\n", 86 | " \n", 87 | " for org_name in org_list:\n", 88 | "\n", 89 | " row = []\n", 90 | " query = make_query()\n", 91 | "\n", 92 | " variables = {\"org_name\": org_name}\n", 93 | " r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers)\n", 94 | " json_data = json.loads(r.text)\n", 95 | " \n", 96 | " df_temp = pd.DataFrame(json_data['data']['organization'])\n", 97 | " org_info_df = org_info_df.append(df_temp, ignore_index=True)\n", 98 | " \n", 99 | " for key in json_data['data']['organization']:\n", 100 | " if key == 'membersWithRole':\n", 101 | " for nkey in json_data['data']['organization'][key]['nodes']:\n", 102 | " row.append(nkey['login'])\n", 103 | " row.append(nkey['name'])\n", 104 | " row.append(nkey['email'])\n", 105 | " row.append(nkey['company'])\n", 106 | " else:\n", 107 | " row.append(json_data['data']['organization'][key])\n", 108 | " all_rows.append(row)\n", 109 | " \n", 110 | " return org_info_df, all_rows\n", 111 | "\n", 112 | "org_info_df, all_rows = get_org_data(api_token, org_list)" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 27, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/html": [ 123 | "
\n", 124 | "\n", 137 | "\n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | "
nameurlcreatedAtupdatedAtmembersWithRole
0moonkubehttps://github.com/moonkube2020-06-29T23:19:15Z2020-06-29T23:19:15Z[]
1ModernAppsNinjahttps://github.com/ModernAppsNinja2020-01-26T22:51:20Z2020-08-05T09:18:34Z[{'login': 'yogendra', 'name': 'Yogendra Rampu...
\n", 167 | "
" 168 | ], 169 | "text/plain": [ 170 | " name url createdAt \\\n", 171 | "0 moonkube https://github.com/moonkube 2020-06-29T23:19:15Z \n", 172 | "1 ModernAppsNinja https://github.com/ModernAppsNinja 2020-01-26T22:51:20Z \n", 173 | "\n", 174 | " updatedAt membersWithRole \n", 175 | "0 2020-06-29T23:19:15Z [] \n", 176 | "1 2020-08-05T09:18:34Z [{'login': 'yogendra', 'name': 'Yogendra Rampu... " 177 | ] 178 | }, 179 | "execution_count": 27, 180 | "metadata": {}, 181 | "output_type": "execute_result" 182 | } 183 | ], 184 | "source": [ 185 | "# The dataframe where the info can be found.\n", 186 | "org_info_df" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 20, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "data": { 196 | "text/plain": [ 197 | "[['org_name',\n", 198 | " 'org_url',\n", 199 | " 'org_createdAt',\n", 200 | " 'org_updatedAt',\n", 201 | " 'people(login,name,email,company):repeat'],\n", 202 | " ['moonkube',\n", 203 | " 'https://github.com/moonkube',\n", 204 | " '2020-06-29T23:19:15Z',\n", 205 | " '2020-06-29T23:19:15Z'],\n", 206 | " ['ModernAppsNinja',\n", 207 | " 'https://github.com/ModernAppsNinja',\n", 208 | " '2020-01-26T22:51:20Z',\n", 209 | " '2020-08-05T09:18:34Z',\n", 210 | " 'yogendra',\n", 211 | " 'Yogendra Rampuria - Yogi',\n", 212 | " '',\n", 213 | " '@yugabyte-db',\n", 214 | " 'sammcgeown',\n", 215 | " 'Sam McGeown',\n", 216 | " '',\n", 217 | " '@vmware',\n", 218 | " 'pipuz',\n", 219 | " 'Gianni',\n", 220 | " '',\n", 221 | " None,\n", 222 | " 'MAHDTech',\n", 223 | " 'MAHDTech',\n", 224 | " 'MAHDTech@saltlabs.tech',\n", 225 | " '@salt-labs ',\n", 226 | " 'afewell',\n", 227 | " 'Arthur Fewell',\n", 228 | " '',\n", 229 | " 'VMware',\n", 230 | " 'ansergit',\n", 231 | " 'Anser Arif',\n", 232 | " '',\n", 233 | " '@vmware',\n", 234 | " 'afewellvmware',\n", 235 | " None,\n", 236 | " '',\n", 237 | " None,\n", 238 | " 'guyzsarun',\n", 239 | " 'Sarun Nuntaviriyakul',\n", 240 | " '',\n", 241 | " '@vmware',\n", 242 | " 'yashitanamdeo',\n", 243 | " 'Yashita Namdeo',\n", 244 | " 'yashita.namdeo2000@gmail.com',\n", 245 | " 'Shri Vaishnav Vidyapeeth Vishwavidyalaya',\n", 246 | " 'shashwatbangar',\n", 247 | " 'Shashwat Bangar',\n", 248 | " '',\n", 249 | " 'Shri Vaishnav Vidyapeeth Vishwavidyalaya',\n", 250 | " 'AttentiveAryan',\n", 251 | " 'Attentive Aryan',\n", 252 | " 'AttentiveAryan@gmail.com',\n", 253 | " 'Attentive Aryan']]" 254 | ] 255 | }, 256 | "execution_count": 20, 257 | "metadata": {}, 258 | "output_type": "execute_result" 259 | } 260 | ], 261 | "source": [ 262 | "# Another way to look at this data is as a list of rows that could be written to a csv file.\n", 263 | "all_rows" 264 | ] 265 | } 266 | ], 267 | "metadata": { 268 | "kernelspec": { 269 | "display_name": "Python 3", 270 | "language": "python", 271 | "name": "python3" 272 | }, 273 | "language_info": { 274 | "codemirror_mode": { 275 | "name": "ipython", 276 | "version": 3 277 | }, 278 | "file_extension": ".py", 279 | "mimetype": "text/x-python", 280 | "name": "python", 281 | "nbconvert_exporter": "python", 282 | "pygments_lexer": "ipython3", 283 | "version": "3.8.5" 284 | } 285 | }, 286 | "nbformat": 4, 287 | "nbformat_minor": 4 288 | } 289 | -------------------------------------------------------------------------------- /scripts/commits_people.py: -------------------------------------------------------------------------------- 1 | # Copyright Dawn M. Foster 2 | # SPDX-License-Identifier: MIT 3 | 4 | # DEPRECATED. This script has been moved to the CHAOSS Data Science WG repo 5 | # https://github.com/chaoss/wg-data-science/blob/main/dataset/license-changes/fork-case-study/commits_people.py 6 | 7 | """Gets Commit Data 8 | This is aggregated per person for a repo between two specified dates. 9 | I'm currently using this to better understand who contributes to a project 10 | before and after a key time in the project (relicense / fork) with a focus on 11 | understanding organizational diversity. 12 | 13 | Output (files are stored in the output directory) 14 | * GitHub API response code (should be "") 15 | * Commit data pickle file containing a dataframe 16 | * Person pickle file containing a dictionary 17 | """ 18 | 19 | import sys 20 | import pandas as pd 21 | import argparse 22 | import requests 23 | import json 24 | 25 | from common_functions import read_key 26 | 27 | # Read arguments from command line 28 | parser = argparse.ArgumentParser() 29 | 30 | parser.add_argument("-t", "--token", dest = "gh_key", help="GitHub Personal Access Token") 31 | parser.add_argument("-u", "--url", dest = "gh_url", help="URL for a GitHub repository") 32 | parser.add_argument("-b", "--begin_date", dest = "begin_date", help="Date in the format YYYY-MM-DD - gather commits after this begin date") 33 | parser.add_argument("-e", "--end_date", dest = "end_date", help="Date in the format YYYY-MM-DD - gather commits up until this end date") 34 | 35 | args = parser.parse_args() 36 | 37 | gh_url = args.gh_url 38 | gh_key = args.gh_key 39 | since_date = args.begin_date + "T00:00:00.000+00:00" 40 | until_date = args.end_date + "T00:00:00.000+00:00" 41 | 42 | url_parts = gh_url.strip('/').split('/') 43 | org_name = url_parts[3] 44 | repo_name = url_parts[4] 45 | 46 | # Read GitHub key from file using the read_key function in 47 | # common_functions.py 48 | try: 49 | api_token = read_key(gh_key) 50 | 51 | except: 52 | print("Error reading GH Key. This script depends on the existence of a file called gh_key containing your GitHub API token. Exiting") 53 | sys.exit() 54 | 55 | pickle_file = 'output/' + repo_name + str(since_date) + str(until_date) + '.pkl' 56 | 57 | def make_query(after_cursor = None): 58 | return """query repo_commits($org_name: String!, $repo_name: String!, $since_date: GitTimestamp!, $until_date: GitTimestamp!){ 59 | repository(owner: $org_name, name: $repo_name) { 60 | ... on Repository{ 61 | defaultBranchRef{ 62 | target{ 63 | ... on Commit{ 64 | history(since: $since_date, until: $until_date, first: 100 after: AFTER){ 65 | pageInfo { 66 | hasNextPage 67 | endCursor 68 | } 69 | edges{ 70 | node{ 71 | ... on Commit{ 72 | committedDate 73 | deletions 74 | additions 75 | oid 76 | authors(first:100) { 77 | nodes { 78 | date 79 | email 80 | user { 81 | login 82 | company 83 | email 84 | name 85 | } 86 | } 87 | } 88 | } 89 | } 90 | } 91 | } 92 | } 93 | } 94 | } 95 | } 96 | } 97 | }""".replace( 98 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 99 | ) 100 | 101 | def get_data(api_token, org_name, repo_name, since_date, until_date): 102 | """Executes the GraphQL query to get data from one GitHub repo. 103 | 104 | Returns 105 | ------- 106 | repo_info_df : pandas.core.frame.DataFrame 107 | """ 108 | 109 | url = 'https://api.github.com/graphql' 110 | headers = {'Authorization': 'token %s' % api_token} 111 | 112 | repo_info_df = pd.DataFrame() 113 | 114 | has_next_page = True 115 | after_cursor = None 116 | 117 | while has_next_page: 118 | 119 | query = make_query(after_cursor) 120 | 121 | variables = {"org_name": org_name, "repo_name": repo_name, "since_date": since_date, "until_date": until_date} 122 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 123 | print(r) 124 | json_data = json.loads(r.text) 125 | 126 | df_temp = pd.DataFrame(json_data['data']['repository']['defaultBranchRef']['target']['history']['edges']) 127 | 128 | repo_info_df = repo_info_df.append(df_temp, ignore_index=True) 129 | 130 | has_next_page = json_data['data']['repository']['defaultBranchRef']['target']['history']["pageInfo"]["hasNextPage"] 131 | 132 | after_cursor = json_data['data']['repository']['defaultBranchRef']['target']['history']["pageInfo"]["endCursor"] 133 | 134 | return repo_info_df 135 | 136 | repo_info_df = get_data(api_token, org_name, repo_name, since_date, until_date) 137 | 138 | def expand_commits(commits): 139 | if pd.isnull(commits): 140 | commits_list = [None, None, None, None, None] 141 | else: 142 | node = commits 143 | try: 144 | commit_date = node['committedDate'] 145 | except: 146 | commit_date = None 147 | try: 148 | dels = node['deletions'] 149 | except: 150 | dels = None 151 | try: 152 | adds = node['additions'] 153 | except: 154 | adds = None 155 | try: 156 | oid = node['oid'] 157 | except: 158 | oid = None 159 | try: 160 | author = node['authors']['nodes'] 161 | except: 162 | author = None 163 | commits_list = [commit_date, dels, adds, oid, author] 164 | return commits_list 165 | 166 | repo_info_df['commits_list'] = repo_info_df['node'].apply(expand_commits) 167 | repo_info_df[['commit_date','deletions', 'additions','oid','author']] = pd.DataFrame(repo_info_df.commits_list.tolist(), index= repo_info_df.index) 168 | #repo_info_df = repo_info_df.drop(columns=['commits_list']) 169 | repo_info_df 170 | repo_info_df.to_pickle(pickle_file) 171 | 172 | def create_person_dict(pickle_file, repo_name, since_date, until_date): 173 | import collections 174 | import pickle 175 | 176 | repo_info_df = pd.read_pickle(pickle_file) 177 | 178 | output_pickle = 'output/' + repo_name + '_people_' + str(since_date) + str(until_date) + '.pkl' 179 | 180 | # Create a dictionary for each person with the key being the gh login 181 | # Create a dict for commits that aren't tied to a gh login (gh user = None) 182 | person_dict=collections.defaultdict(dict) 183 | fail_person_dict=collections.defaultdict(dict) 184 | 185 | for x in repo_info_df.iterrows(): 186 | data = x[1] 187 | 188 | for y in data['author']: 189 | try: 190 | login = y['user']['login'] 191 | company = y['user']['company'] 192 | commit_email = y['email'] 193 | login_email = y['user']['email'] 194 | name = y['user']['name'] 195 | 196 | if person_dict[login]: 197 | person_dict[login]['commits'] = person_dict[login]['commits'] + 1 198 | person_dict[login]['additions'] = person_dict[login]['additions'] + data['additions'] 199 | person_dict[login]['deletions'] = person_dict[login]['deletions'] + data['deletions'] 200 | if commit_email not in person_dict[login]['email']: 201 | person_dict[login]['email'].append(commit_email) 202 | else: 203 | person_dict[login]['company'] = company 204 | person_dict[login]['name'] = name 205 | person_dict[login]['commits'] = 1 206 | person_dict[login]['additions'] = data['additions'] 207 | person_dict[login]['deletions'] = data['deletions'] 208 | if len(login_email) == 0: 209 | person_dict[login]['email'] = [commit_email] 210 | elif commit_email == login_email: 211 | person_dict[login]['email'] = [commit_email] 212 | else: 213 | person_dict[login]['email'] = [commit_email,login_email] 214 | except: 215 | try: 216 | if fail_person_dict[commit_email]: 217 | fail_person_dict[commit_email]['commits'] = fail_person_dict[commit_email]['commits'] + 1 218 | fail_person_dict[commit_email]['additions'] = fail_person_dict[commit_email]['additions'] + data['additions'] 219 | fail_person_dict[commit_email]['deletions'] = fail_person_dict[commit_email]['deletions'] + data['deletions'] 220 | else: 221 | fail_person_dict[commit_email]['commits'] = 1 222 | fail_person_dict[commit_email]['additions'] = data['additions'] 223 | fail_person_dict[commit_email]['deletions'] = data['deletions'] 224 | except: 225 | print("Unknown Exception on", y) 226 | 227 | # For every email that didn't have a GH login / user, search for that email in the 228 | # person_dict and if found, add the commits, additions, and deletions to the proper user 229 | # Print error message if not found (above items for testing of that case) 230 | for f_key, f_value in fail_person_dict.items(): 231 | found = False 232 | for key, value in person_dict.items(): 233 | if f_key in value['email']: 234 | person_dict[key]['commits'] = person_dict[key]['commits'] + f_value['commits'] 235 | person_dict[key]['additions'] = person_dict[key]['additions'] + f_value['additions'] 236 | person_dict[key]['deletions'] = person_dict[key]['deletions'] + f_value['deletions'] 237 | found = True 238 | if found == False: 239 | print('Not found - no person with this email',f_key,f_value) 240 | 241 | with open(output_pickle, 'wb') as f: 242 | pickle.dump(person_dict, f) 243 | 244 | print('Commit data stored in', pickle_file) 245 | print('People Dictionary stored in', output_pickle) 246 | 247 | create_person_dict(pickle_file, repo_name, since_date, until_date) 248 | -------------------------------------------------------------------------------- /scripts/sunset.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Copyright 2022 VMware, Inc. 4 | # SPDX-License-Identifier: BSD-2-Clause 5 | # Author: Dawn M. Foster 6 | 7 | """Gather data to determine whether a repo can be archived 8 | 9 | This script uses the GitHub GraphQL API to retrieve relevant 10 | information about a repository, including forks to determine ownership 11 | and possibly contact people to understand how they are using a project. 12 | More detailed information is gathered about recently updated forks and 13 | their owners with the recently updated threshold set in a variable called 14 | recently_updated (currently set to 9 months). 15 | 16 | Usage 17 | ----- 18 | 19 | Run the script with one repo url as input 20 | $python3 sunset.py -u "https://github.com/org_name/repo_name" 21 | 22 | Run the script with a csv file containing one org_name,repo_name pair 23 | per line: 24 | $python3 sunset.py -f sunset.csv 25 | 26 | Dependencies and Requirements 27 | ----------------------------- 28 | 29 | This script depends on another tool called Criticality Score to run. 30 | See https://github.com/ossf/criticality_score for more details, including 31 | how to set up a required environment variable. This function requires 32 | that you have version 1.0.7 of this tool installed (the older Python 33 | version but not the final Python version, which doesn't work for 34 | some reason within the script - possibly because of how they've 35 | implemented the deprecation warnings). You can install the correct version 36 | using: 37 | pip install criticality-score==1.0.7 38 | 39 | Your API key should be stored in a file called gh_key in the 40 | same folder as this script. 41 | 42 | This script requires that `pandas` be installed within the Python 43 | environment you are running this script in. 44 | 45 | Before using this script, please make sure that you are adhering 46 | to the GitHub Acceptable Use Policies: 47 | https://docs.github.com/en/site-policy/acceptable-use-policies/github-acceptable-use-policies 48 | In particular, "You may not use information from the Service 49 | (whether scraped, collected through our API, or obtained otherwise) 50 | for spamming purposes, including for the purposes of sending unsolicited 51 | emails to users or selling User Personal Information (as defined in the 52 | GitHub Privacy Statement), such as to recruiters, headhunters, and job boards." 53 | 54 | Output 55 | ------ 56 | 57 | * Prints basic data about each repo processed to the screen to show progress. 58 | * the script creates a csv file stored in an subdirectory 59 | of the folder with the script called "output" with the filename in 60 | this format with today's date. 61 | 62 | output/sunset_2022-01-14.csv" 63 | """ 64 | 65 | import argparse 66 | import sys 67 | from common_functions import create_file, read_key, get_criticality 68 | from datetime import date 69 | from dateutil.relativedelta import relativedelta 70 | import csv 71 | 72 | def make_query(after_cursor = None): 73 | """Creates and returns a GraphQL query with cursor for pagination on forks""" 74 | 75 | return """query repo_forks($org_name: String!, $repo_name: String!){ 76 | repository(owner: $org_name, name: $repo_name){ 77 | forks (first:20, after: AFTER) { 78 | pageInfo { 79 | hasNextPage 80 | endCursor 81 | } 82 | totalCount 83 | nodes { 84 | updatedAt 85 | url 86 | owner { 87 | __typename 88 | url 89 | ... on User{ 90 | name 91 | company 92 | email 93 | organizations (last:50){ 94 | nodes{ 95 | name 96 | } 97 | } 98 | } 99 | } 100 | } 101 | } 102 | stargazerCount 103 | } 104 | }""".replace( 105 | "AFTER", '"{}"'.format(after_cursor) if after_cursor else "null" 106 | ) 107 | 108 | def get_fork_data(api_token, org_name, repo_name): 109 | """Executes the GraphQL query to get repository data. 110 | 111 | Parameters 112 | ---------- 113 | api_token : str 114 | The GH API token retrieved from the gh_key file. 115 | org_name and repo_name : str 116 | The GitHub organization name and repository name to analyze. 117 | 118 | Returns 119 | ------- 120 | repo_info_df : pandas dataframe 121 | Dataframe with all of the output from the API query 122 | num_forks : int 123 | Number of forks for the repo 124 | num_stars : int 125 | Number of stars for the repo 126 | status : str 127 | Value is "OK" or "Error" depending on whether data could be gathered for that org/repo pair 128 | """ 129 | 130 | import requests 131 | import json 132 | import pandas as pd 133 | 134 | url = 'https://api.github.com/graphql' 135 | headers = {'Authorization': 'token %s' % api_token} 136 | 137 | repo_info_df = pd.DataFrame() 138 | 139 | has_next_page = True 140 | after_cursor = None 141 | 142 | while has_next_page: 143 | try: 144 | query = make_query(after_cursor) 145 | 146 | variables = {"org_name": org_name, "repo_name": repo_name} 147 | r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers) 148 | json_data = json.loads(r.text) 149 | 150 | df_temp = pd.DataFrame(json_data["data"]["repository"]["forks"]["nodes"]) 151 | repo_info_df = pd.concat([repo_info_df, df_temp]) 152 | 153 | num_forks = json_data["data"]["repository"]["forks"]["totalCount"] 154 | num_stars = json_data["data"]["repository"]["stargazerCount"] 155 | 156 | has_next_page = json_data["data"]["repository"]["forks"]["pageInfo"]["hasNextPage"] 157 | after_cursor = json_data["data"]["repository"]["forks"]["pageInfo"]["endCursor"] 158 | 159 | status = "OK" 160 | except: 161 | has_next_page = False 162 | num_forks = None 163 | num_stars = None 164 | status = "Error" 165 | 166 | return repo_info_df, num_forks, num_stars, status 167 | 168 | # Read arguments from the command line to specify whether the repo and org 169 | # should be read from a file for multiple repos or from a url to analyze 170 | # a single repo 171 | 172 | parser = argparse.ArgumentParser() 173 | 174 | parser.add_argument("-f", "--filename", dest = "csv_file", help="File name of a csv file containing one repo_name,org_name per line") 175 | parser.add_argument("-u", "--url", dest = "gh_url", help="URL for a GitHub repository") 176 | 177 | args = parser.parse_args() 178 | 179 | if args.csv_file: 180 | with open(args.csv_file) as f: 181 | reader = csv.reader(f) 182 | repo_list = list(reader) 183 | 184 | if args.gh_url: 185 | gh_url = args.gh_url 186 | 187 | url_parts = gh_url.strip('/').split('/') 188 | org_name = url_parts[3] 189 | repo_name = url_parts[4] 190 | 191 | repo_list = [[org_name,repo_name]] 192 | 193 | # Read GitHub key from file using the read_key function in 194 | # common_functions.py 195 | try: 196 | api_token = read_key('gh_key') 197 | 198 | except: 199 | print("Error reading GH Key. This script depends on the existance of a file called gh_key containing your GitHub API token. Exiting") 200 | sys.exit() 201 | 202 | # Uses nine months as recently updated fork threshold 203 | recently_updated = str(date.today() + relativedelta(months=-9)) 204 | 205 | # all_rows it the variable that will be written to the csv file. This initializes it with a csv header line 206 | all_rows = [["Org", "Repo", "Status", "Stars", "Forks", "Dependents", "Crit Score", "fork url", "Fork last updated", "account type", "owner URL", "name", "company", "email", "Other orgs that the owner belongs to"]] 207 | 208 | for repo in repo_list: 209 | org_name = repo[0] 210 | repo_name = repo[1] 211 | 212 | try: 213 | repo_info_df, num_forks, num_stars, status = get_fork_data(api_token, org_name, repo_name) 214 | 215 | if status == "OK": 216 | 217 | dependents_count, criticality_score = get_criticality(org_name, repo_name, api_token) 218 | 219 | # criticality_score sometimes fails in a way that is not reflected in it's own error status 220 | # and dumps an error message into these variables. I suspect it's caused by a timeout, since it 221 | # seems to happen mostly with very large repos. This is to clean that up and make the csv 222 | # file more readable. The check is for isnumeric because Criticality Score returns strings 223 | # for some reason. 224 | 225 | if dependents_count.isnumeric() is False: 226 | criticality_score = "Error" 227 | dependents_count = "Error" 228 | 229 | print(org_name, repo_name, "Dependents:", dependents_count, "Criticality Score:", criticality_score, "Stars", num_stars, "Forks", num_forks) 230 | 231 | # Only run this section if there are forks 232 | if num_forks > 0: 233 | # We only need recent forks in the csv file, so this creates a subset of the dataframe. 234 | # If there are no recent forks (empty df), only the basic repo info is 235 | # written to the csv file. Otherwise, details about the forks are gathered and added to the csv. 236 | recent_forks_df = repo_info_df.loc[repo_info_df['updatedAt'] > recently_updated] 237 | 238 | if len(recent_forks_df) == 0: 239 | row = [org_name, repo_name, status, num_stars, num_forks, dependents_count, criticality_score] 240 | all_rows.append(row) 241 | 242 | else: 243 | for fork_obj in recent_forks_df.iterrows(): 244 | fork = fork_obj[1] 245 | 246 | fork_updated = fork['updatedAt'] 247 | fork_url = fork['url'] 248 | fork_owner_type = fork['owner']['__typename'] 249 | fork_owner_url = fork['owner']['url'] 250 | try: 251 | fork_owner_name = fork['owner']['name'] 252 | except: 253 | fork_owner_name = None 254 | try: 255 | fork_owner_company = fork['owner']['company'] 256 | except: 257 | fork_owner_company = None 258 | try: 259 | fork_owner_email = fork['owner']['email'] 260 | except: 261 | fork_owner_email = None 262 | try: 263 | fork_owner_orgs = '' 264 | for orgs in fork['owner']['organizations']['nodes']: 265 | fork_owner_orgs = fork_owner_orgs + orgs['name'] + ';' 266 | fork_owner_orgs = fork_owner_orgs[:-1] #strip last ; 267 | if len(fork_owner_orgs) == 0: 268 | fork_owner_orgs = None 269 | except: 270 | fork_owner_orgs = None 271 | 272 | row = [org_name, repo_name, status, num_stars, num_forks, dependents_count, criticality_score, fork_url, fork_updated, fork_owner_type, fork_owner_url, fork_owner_name, fork_owner_company, fork_owner_email, fork_owner_orgs] 273 | all_rows.append(row) 274 | else: 275 | row = [org_name, repo_name, status, num_stars, num_forks, dependents_count, criticality_score, None, None, None, None, None, None, None, None] 276 | all_rows.append(row) 277 | else: 278 | print("Cannot process", org_name, repo_name) 279 | row = [org_name, repo_name, status] 280 | all_rows.append(row) 281 | except: 282 | status = "Error" 283 | print("Cannot process", org_name, repo_name) 284 | row = [org_name, repo_name, status] 285 | all_rows.append(row) 286 | 287 | # Create csv output file and write to it. 288 | file, file_path = create_file("sunset") 289 | 290 | try: 291 | with file: 292 | write = csv.writer(file) 293 | write.writerows(all_rows) 294 | except: 295 | print('Could not write to csv file. This may be because the output directory is missing or you do not have permissions to write to it. Exiting') 296 | -------------------------------------------------------------------------------- /notebooks/CodeOfConductBug.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Copyright 2022 VMware, Inc.\n", 8 | "SPDX-License-Identifier: BSD-2-Clause\n", 9 | "\n", 10 | "This notebook contains details about an issue with the `codeOfConduct` object in the GitHub GraphQL API that I used to exchange data with the GH team. They have confirmed that there is an issue and the API engineering team is working on a re-implementation of the `codeOfConduct` object.\n", 11 | "\n", 12 | "The short version is that the `codeOfConduct` object triggers an API timeout when gathering a small amount of data when compared to other queries that gather large amounts of data without timing out. " 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "import requests\n", 22 | "import json\n", 23 | "import pandas as pd\n", 24 | "\n", 25 | "# Assumes your GH API token is in a file called gh_key in this directory\n", 26 | "with open('gh_key', 'r') as kf:\n", 27 | " api_token = kf.readline().rstrip() # remove newline & trailing whitespace" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "# Example with large amounts of data\n", 35 | "\n", 36 | "Example of a query that gathers a large amount of data without triggering a timeout in the API. It does not include any data from the `codeOfConduct` object." 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "def complex_query(after_cursor = None):\n", 46 | " return \"\"\"query RepoQuery($org_name: String!) {\n", 47 | " organization(login: $org_name) {\n", 48 | " repositories (first: 100 after: AFTER){\n", 49 | " pageInfo {\n", 50 | " hasNextPage\n", 51 | " endCursor\n", 52 | " }\n", 53 | " nodes { \n", 54 | " nameWithOwner\n", 55 | " name\n", 56 | " licenseInfo {\n", 57 | " name\n", 58 | " }\n", 59 | " isPrivate\n", 60 | " isFork\n", 61 | " isEmpty\n", 62 | " isArchived\n", 63 | " forkCount\n", 64 | " stargazerCount\n", 65 | " createdAt\n", 66 | " updatedAt\n", 67 | " pushedAt\n", 68 | " defaultBranchRef {\n", 69 | " name \n", 70 | " target{\n", 71 | " ... on Commit{\n", 72 | " history(first:1){\n", 73 | " edges{\n", 74 | " node{\n", 75 | " ... on Commit{\n", 76 | " committedDate\n", 77 | " author{\n", 78 | " name\n", 79 | " email\n", 80 | " user{\n", 81 | " login\n", 82 | " }\n", 83 | " }\n", 84 | " }\n", 85 | " }\n", 86 | " }\n", 87 | " }\n", 88 | " }\n", 89 | " }\n", 90 | " }\n", 91 | " }\n", 92 | " }\n", 93 | " }\n", 94 | " }\"\"\".replace(\n", 95 | " \"AFTER\", '\"{}\"'.format(after_cursor) if after_cursor else \"null\"\n", 96 | " )\n" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 3, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "data": { 106 | "text/html": [ 107 | "
\n", 108 | "\n", 121 | "\n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | "
nameWithOwnernamelicenseInfoisPrivateisForkisEmptyisArchivedforkCountstargazerCountcreatedAtupdatedAtpushedAtdefaultBranchRef
0vmware/pg_rewindpg_rewindNoneFalseFalseFalseFalse201262013-05-23T10:45:43Z2022-01-13T10:15:53Z2020-07-15T07:20:13Z{'name': 'master', 'target': {'history': {'edg...
1vmware/pyvmomipyvmomi{'name': 'Apache License 2.0'}FalseFalseFalseFalse73218862013-12-13T17:30:30Z2022-01-27T03:48:17Z2021-10-14T20:36:08Z{'name': 'master', 'target': {'history': {'edg...
2vmware/pyvmomi-community-samplespyvmomi-community-samples{'name': 'Apache License 2.0'}FalseFalseFalseFalse8448602014-04-24T20:31:56Z2022-01-27T06:16:13Z2022-01-14T01:06:43Z{'name': 'master', 'target': {'history': {'edg...
3vmware/open-vm-toolsopen-vm-toolsNoneFalseFalseFalseFalse35716512014-04-25T21:30:54Z2022-01-27T03:33:49Z2022-01-21T01:24:38Z{'name': 'master', 'target': {'history': {'edg...
4vmware/upgrade-frameworkupgrade-framework{'name': 'Other'}FalseFalseFalseFalse12182014-06-16T17:22:11Z2021-12-06T23:09:20Z2021-12-09T20:35:22Z{'name': 'master', 'target': {'history': {'edg...
..........................................
206vmware/vmware-go-kcl-v2vmware-go-kcl-v2{'name': 'MIT License'}FalseFalseFalseFalse112021-11-30T15:05:11Z2022-01-07T02:25:00Z2022-01-07T02:24:58Z{'name': 'main', 'target': {'history': {'edges...
207vmware/.github.github{'name': 'Other'}FalseFalseFalseFalse412021-12-03T20:18:21Z2021-12-15T15:48:36Z2021-12-15T15:48:33Z{'name': 'main', 'target': {'history': {'edges...
208vmware/app-control-event-kernel-moduleapp-control-event-kernel-module{'name': 'GNU General Public License v2.0'}FalseFalseFalseFalse012021-12-20T16:49:03Z2022-01-10T15:00:15Z2021-12-20T20:08:31Z{'name': 'main', 'target': {'history': {'edges...
209vmware/ml-ops-platform-for-vsphereml-ops-platform-for-vsphere{'name': 'Apache License 2.0'}TrueFalseFalseFalse002022-01-10T12:34:26Z2022-01-10T12:34:40Z2022-01-10T12:34:38ZNone
210vmware/test-automation-for-web-applicationstest-automation-for-web-applications{'name': 'Apache License 2.0'}TrueFalseFalseFalse012022-01-12T02:44:48Z2022-01-24T07:19:02Z2022-01-26T04:35:24ZNone
\n", 319 | "

211 rows × 13 columns

\n", 320 | "
" 321 | ], 322 | "text/plain": [ 323 | " nameWithOwner \\\n", 324 | "0 vmware/pg_rewind \n", 325 | "1 vmware/pyvmomi \n", 326 | "2 vmware/pyvmomi-community-samples \n", 327 | "3 vmware/open-vm-tools \n", 328 | "4 vmware/upgrade-framework \n", 329 | ".. ... \n", 330 | "206 vmware/vmware-go-kcl-v2 \n", 331 | "207 vmware/.github \n", 332 | "208 vmware/app-control-event-kernel-module \n", 333 | "209 vmware/ml-ops-platform-for-vsphere \n", 334 | "210 vmware/test-automation-for-web-applications \n", 335 | "\n", 336 | " name \\\n", 337 | "0 pg_rewind \n", 338 | "1 pyvmomi \n", 339 | "2 pyvmomi-community-samples \n", 340 | "3 open-vm-tools \n", 341 | "4 upgrade-framework \n", 342 | ".. ... \n", 343 | "206 vmware-go-kcl-v2 \n", 344 | "207 .github \n", 345 | "208 app-control-event-kernel-module \n", 346 | "209 ml-ops-platform-for-vsphere \n", 347 | "210 test-automation-for-web-applications \n", 348 | "\n", 349 | " licenseInfo isPrivate isFork isEmpty \\\n", 350 | "0 None False False False \n", 351 | "1 {'name': 'Apache License 2.0'} False False False \n", 352 | "2 {'name': 'Apache License 2.0'} False False False \n", 353 | "3 None False False False \n", 354 | "4 {'name': 'Other'} False False False \n", 355 | ".. ... ... ... ... \n", 356 | "206 {'name': 'MIT License'} False False False \n", 357 | "207 {'name': 'Other'} False False False \n", 358 | "208 {'name': 'GNU General Public License v2.0'} False False False \n", 359 | "209 {'name': 'Apache License 2.0'} True False False \n", 360 | "210 {'name': 'Apache License 2.0'} True False False \n", 361 | "\n", 362 | " isArchived forkCount stargazerCount createdAt \\\n", 363 | "0 False 20 126 2013-05-23T10:45:43Z \n", 364 | "1 False 732 1886 2013-12-13T17:30:30Z \n", 365 | "2 False 844 860 2014-04-24T20:31:56Z \n", 366 | "3 False 357 1651 2014-04-25T21:30:54Z \n", 367 | "4 False 12 18 2014-06-16T17:22:11Z \n", 368 | ".. ... ... ... ... \n", 369 | "206 False 1 1 2021-11-30T15:05:11Z \n", 370 | "207 False 4 1 2021-12-03T20:18:21Z \n", 371 | "208 False 0 1 2021-12-20T16:49:03Z \n", 372 | "209 False 0 0 2022-01-10T12:34:26Z \n", 373 | "210 False 0 1 2022-01-12T02:44:48Z \n", 374 | "\n", 375 | " updatedAt pushedAt \\\n", 376 | "0 2022-01-13T10:15:53Z 2020-07-15T07:20:13Z \n", 377 | "1 2022-01-27T03:48:17Z 2021-10-14T20:36:08Z \n", 378 | "2 2022-01-27T06:16:13Z 2022-01-14T01:06:43Z \n", 379 | "3 2022-01-27T03:33:49Z 2022-01-21T01:24:38Z \n", 380 | "4 2021-12-06T23:09:20Z 2021-12-09T20:35:22Z \n", 381 | ".. ... ... \n", 382 | "206 2022-01-07T02:25:00Z 2022-01-07T02:24:58Z \n", 383 | "207 2021-12-15T15:48:36Z 2021-12-15T15:48:33Z \n", 384 | "208 2022-01-10T15:00:15Z 2021-12-20T20:08:31Z \n", 385 | "209 2022-01-10T12:34:40Z 2022-01-10T12:34:38Z \n", 386 | "210 2022-01-24T07:19:02Z 2022-01-26T04:35:24Z \n", 387 | "\n", 388 | " defaultBranchRef \n", 389 | "0 {'name': 'master', 'target': {'history': {'edg... \n", 390 | "1 {'name': 'master', 'target': {'history': {'edg... \n", 391 | "2 {'name': 'master', 'target': {'history': {'edg... \n", 392 | "3 {'name': 'master', 'target': {'history': {'edg... \n", 393 | "4 {'name': 'master', 'target': {'history': {'edg... \n", 394 | ".. ... \n", 395 | "206 {'name': 'main', 'target': {'history': {'edges... \n", 396 | "207 {'name': 'main', 'target': {'history': {'edges... \n", 397 | "208 {'name': 'main', 'target': {'history': {'edges... \n", 398 | "209 None \n", 399 | "210 None \n", 400 | "\n", 401 | "[211 rows x 13 columns]" 402 | ] 403 | }, 404 | "execution_count": 3, 405 | "metadata": {}, 406 | "output_type": "execute_result" 407 | } 408 | ], 409 | "source": [ 410 | "org_name = \"vmware\"\n", 411 | "url = 'https://api.github.com/graphql'\n", 412 | "headers = {'Authorization': 'token %s' % api_token}\n", 413 | "\n", 414 | "has_next_page = True\n", 415 | "after_cursor = None\n", 416 | "\n", 417 | "repo_info_df = pd.DataFrame()\n", 418 | "\n", 419 | "while has_next_page:\n", 420 | "\n", 421 | " query = complex_query(after_cursor)\n", 422 | "\n", 423 | " variables = {\"org_name\": org_name}\n", 424 | " r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers)\n", 425 | " json_data = json.loads(r.text)\n", 426 | "\n", 427 | " df_temp = pd.DataFrame(json_data['data']['organization']['repositories']['nodes'])\n", 428 | " repo_info_df = repo_info_df.append(df_temp, ignore_index=True)\n", 429 | "\n", 430 | " has_next_page = json_data[\"data\"][\"organization\"][\"repositories\"][\"pageInfo\"][\"hasNextPage\"]\n", 431 | "\n", 432 | " after_cursor = json_data[\"data\"][\"organization\"][\"repositories\"][\"pageInfo\"][\"endCursor\"]\n", 433 | "\n", 434 | "repo_info_df" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": {}, 440 | "source": [ 441 | "# Code of Conduct Example\n", 442 | "\n", 443 | "Note: this will work if you shorten `first: 100` to `first: 20`, but this is a relatively small amount of data that should not timeout unless there is a bug or serious performance issue within the `codeOfConduct` object." 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 11, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "# fails on vmware, bitnami\n", 453 | "# works on vmware-tanzu, concourse, carbonblack\n", 454 | "def make_query():\n", 455 | " return \"\"\"query RepoQuery($org_name: String!) {\n", 456 | " organization(login: $org_name) {\n", 457 | " repositories (first: 100){\n", 458 | " pageInfo {\n", 459 | " hasNextPage\n", 460 | " endCursor\n", 461 | " }\n", 462 | " nodes {\n", 463 | " name\n", 464 | " codeOfConduct{\n", 465 | " url\n", 466 | " }\n", 467 | " createdAt\n", 468 | " }\n", 469 | " }\n", 470 | " }\n", 471 | " }\"\"\"" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": 14, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [ 480 | "def make_query():\n", 481 | " return \"\"\"query MyQuery($org_name: String!) {\n", 482 | " organization(login: $org_name) {\n", 483 | " repositories(first: 100) {\n", 484 | " pageInfo {\n", 485 | " hasNextPage\n", 486 | " endCursor\n", 487 | " }\n", 488 | " nodes {\n", 489 | " codeOfConduct {\n", 490 | " url\n", 491 | " }\n", 492 | " createdAt\n", 493 | " name\n", 494 | " }\n", 495 | " }\n", 496 | " }\n", 497 | "}\"\"\"\n" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "# Works fine for some orgs: vmware-tanzu example" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 5, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "# fails on vmware, bitnami\n", 514 | "# works on vmware-tanzu, concourse, carbonblack\n", 515 | "org_name = \"vmware-tanzu\"" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 6, 521 | "metadata": {}, 522 | "outputs": [ 523 | { 524 | "name": "stdout", 525 | "output_type": "stream", 526 | "text": [ 527 | "{'data': {'organization': {'repositories': {'pageInfo': {'hasNextPage': True, 'endCursor': 'Y3Vyc29yOnYyOpHOFdNqbg=='}, 'nodes': [{'name': 'sonobuoy', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/sonobuoy/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2017-07-26T18:27:09Z'}, {'name': 'velero', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2017-08-02T17:22:11Z'}, {'name': 'velero-plugin-example', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero-plugin-example/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2017-11-28T20:25:03Z'}, {'name': 'tgik', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/tgik/blob/master/CODE-OF-CONDUCT.md'}, 'createdAt': '2018-05-07T18:11:06Z'}, {'name': 'carvel-kwt', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-kwt/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2018-09-24T17:59:19Z'}, {'name': 'thepodlets', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/thepodlets/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2018-10-17T21:16:53Z'}, {'name': 'carvel-ytt', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-ytt/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-03-01T00:13:56Z'}, {'name': 'carvel-ytt-library-for-kubernetes', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-ytt-library-for-kubernetes/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-03-09T00:40:19Z'}, {'name': 'carvel-kapp', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-kapp/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-03-15T21:49:25Z'}, {'name': 'carvel-ytt-library-for-kubernetes-demo', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-ytt-library-for-kubernetes-demo/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-03-22T19:32:24Z'}, {'name': 'ytt.vim', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/ytt.vim/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-04-14T09:57:23Z'}, {'name': 'tanzu-observability-collector-for-aws-fargate', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/tanzu-observability-collector-for-aws-fargate/blob/master/CODE-OF-CONDUCT.md'}, 'createdAt': '2019-04-17T05:25:35Z'}, {'name': 'carvel-kbld', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-kbld/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-04-19T22:58:51Z'}, {'name': 'carvel-guestbook-example-on-kubernetes', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-guestbook-example-on-kubernetes/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-04-23T22:59:54Z'}, {'name': 'carvel', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-04-24T19:24:20Z'}, {'name': 'cluster-api', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/cluster-api/blob/master/code-of-conduct.md'}, 'createdAt': '2019-05-07T18:59:24Z'}, {'name': 'carvel-simple-app-on-kubernetes', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-simple-app-on-kubernetes/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-05-09T00:39:47Z'}, {'name': 'velero-plugin-for-csi', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-06-04T15:04:55Z'}, {'name': 'octant', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/octant/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-06-19T17:53:39Z'}, {'name': 'projects-operator', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/projects-operator/blob/master/CODE-OF-CONDUCT.md'}, 'createdAt': '2019-07-31T19:41:47Z'}, {'name': 'dependency-labeler', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/dependency-labeler/blob/master/CODE-OF-CONDUCT.md'}, 'createdAt': '2019-08-01T09:01:19Z'}, {'name': 'crash-diagnostics', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/crash-diagnostics/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2019-10-02T17:39:58Z'}, {'name': 'velero-plugin-for-aws', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-10-09T14:54:44Z'}, {'name': 'velero-plugin-for-microsoft-azure', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero-plugin-for-microsoft-azure/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-10-11T20:37:01Z'}, {'name': 'velero-plugin-for-gcp', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero-plugin-for-gcp/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-10-11T20:37:29Z'}, {'name': 'carvel-imgpkg', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-imgpkg/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-11-01T16:04:25Z'}, {'name': 'carvel-kapp-controller', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-kapp-controller/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-11-06T21:09:24Z'}, {'name': 'sonobuoy-plugins', 'codeOfConduct': None, 'createdAt': '2019-11-13T20:25:45Z'}, {'name': 'carvel-vendir', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-vendir/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2019-12-16T03:40:13Z'}, {'name': 'helm-charts', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/helm-charts/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2019-12-17T20:11:21Z'}, {'name': 'astrolabe', 'codeOfConduct': None, 'createdAt': '2019-12-20T19:14:38Z'}, {'name': 'velero-plugin-for-vsphere', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/velero-plugin-for-vsphere/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2019-12-20T19:14:53Z'}, {'name': 'difflib', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/difflib/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-01-08T17:12:53Z'}, {'name': 'terraform-provider-carvel', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/terraform-provider-carvel/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-01-14T22:24:37Z'}, {'name': 'tanzu-dev-portal', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/tanzu-dev-portal/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-01-29T17:25:40Z'}, {'name': 'carvel-docker-image', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-docker-image/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-02-03T19:22:22Z'}, {'name': 'carvel-secretgen-controller', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-secretgen-controller/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-02-06T14:19:21Z'}, {'name': 'starlark-go', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/starlark-go/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-02-07T16:45:31Z'}, {'name': 'octant-example-plugins', 'codeOfConduct': None, 'createdAt': '2020-02-19T17:01:45Z'}, {'name': 'carvel-setup-action', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-setup-action/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-02-22T18:42:22Z'}, {'name': 'vscode-ytt', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/vscode-ytt/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-03-10T16:39:29Z'}, {'name': 'vm-operator-api', 'codeOfConduct': None, 'createdAt': '2020-03-17T16:36:51Z'}, {'name': 'color', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/color/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-03-20T16:09:57Z'}, {'name': 'concourse-kpack-resource', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/concourse-kpack-resource/blob/master/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-04-01T20:42:57Z'}, {'name': 'sources-for-knative', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/sources-for-knative/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-04-02T19:17:50Z'}, {'name': 'asdf-carvel', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/asdf-carvel/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-04-06T11:03:59Z'}, {'name': 'kpack-cli', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/kpack-cli/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-04-06T14:40:53Z'}, {'name': 'k-bench', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/k-bench/blob/master/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-04-17T21:33:48Z'}, {'name': 'carvel-ytt-starter-for-kubernetes', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/carvel-ytt-starter-for-kubernetes/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-05-13T17:42:35Z'}, {'name': 'service-apis', 'codeOfConduct': None, 'createdAt': '2020-07-01T23:44:44Z'}, {'name': 'pinniped', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/pinniped/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-07-02T22:23:20Z'}, {'name': 'pinniped-ci', 'codeOfConduct': None, 'createdAt': '2020-07-02T22:25:16Z'}, {'name': 'cloud-suitability-analyzer', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/cloud-suitability-analyzer/blob/master/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-08-11T14:45:34Z'}, {'name': 'octant-plugin-for-knative', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/octant-plugin-for-knative/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-08-28T14:18:57Z'}, {'name': 'net-operator-api', 'codeOfConduct': None, 'createdAt': '2020-09-10T17:46:42Z'}, {'name': 'plugin-library-for-octant', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/plugin-library-for-octant/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-09-18T20:56:49Z'}, {'name': 'cross-cluster-connectivity', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/cross-cluster-connectivity/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-09-24T22:59:39Z'}, {'name': 'edukates', 'codeOfConduct': None, 'createdAt': '2020-10-02T20:38:43Z'}, {'name': 'community-edition', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/community-edition/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-10-13T19:01:34Z'}, {'name': 'buildkit-cli-for-kubectl', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/buildkit-cli-for-kubectl/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-10-22T18:32:09Z'}, {'name': 'cert-injection-webhook', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/cert-injection-webhook/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-10-26T19:25:06Z'}, {'name': 'nozzle-for-microsoft-azure-log-analytics', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/nozzle-for-microsoft-azure-log-analytics/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-11-03T18:26:02Z'}, {'name': 'convention-controller', 'codeOfConduct': None, 'createdAt': '2020-11-11T22:15:42Z'}, {'name': 'homebrew-carvel', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/homebrew-carvel/blob/develop/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-11-16T21:49:04Z'}, {'name': 'carvel-community', 'codeOfConduct': None, 'createdAt': '2020-11-16T21:49:55Z'}, {'name': 'octant-plugin-for-kind', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/octant-plugin-for-kind/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-11-24T00:34:48Z'}, {'name': 'rotate-instance-identity-certificates', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/rotate-instance-identity-certificates/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-11-30T17:50:37Z'}, {'name': 'observability-event-resource', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/observability-event-resource/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2020-12-10T22:13:34Z'}, {'name': 'homebrew-pinniped', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/homebrew-pinniped/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2020-12-18T19:06:33Z'}, {'name': 'community-engagement', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/community-engagement/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2021-01-07T18:55:11Z'}, {'name': 'tanzu-toolkit-for-visual-studio', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/tanzu-toolkit-for-visual-studio/blob/main/CODE_OF_CONDUCT.md'}, 'createdAt': '2021-01-08T19:51:37Z'}, {'name': 'pinniped-ci-pool', 'codeOfConduct': None, 'createdAt': '2021-01-11T21:34:35Z'}, {'name': 'antrea-build-infra', 'codeOfConduct': None, 'createdAt': '2021-01-19T19:06:11Z'}, {'name': 'pinniped-ghsa-wp53-6256-whf9', 'codeOfConduct': None, 'createdAt': '2021-01-22T18:58:18Z'}, {'name': 'pinniped-private', 'codeOfConduct': None, 'createdAt': '2021-01-22T21:31:35Z'}, {'name': 'cluster-api-provider-bringyourownhost', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/cluster-api-provider-bringyourownhost/blob/main/CODE-OF-CONDUCT.md'}, 'createdAt': '2021-01-25T21:56:30Z'}, {'name': 'vm-operator', 'codeOfConduct': None, 'createdAt': '2021-02-19T16:07:27Z'}, {'name': 'vsphere-kubernetes-drivers-operator', 'codeOfConduct': None, 'createdAt': '2021-02-19T16:08:06Z'}, {'name': 'tsip-planning', 'codeOfConduct': None, 'createdAt': '2021-02-24T16:26:26Z'}, {'name': 'tsip-cell-sequence-aws-admin', 'codeOfConduct': None, 'createdAt': '2021-02-26T13:24:58Z'}, {'name': 'tsip-development', 'codeOfConduct': None, 'createdAt': '2021-02-26T13:26:01Z'}, {'name': 'crispr', 'codeOfConduct': None, 'createdAt': '2021-02-26T13:30:09Z'}, {'name': 'tsip-old-controllers', 'codeOfConduct': None, 'createdAt': '2021-02-26T15:48:04Z'}, {'name': 'crispr-blackstart', 'codeOfConduct': None, 'createdAt': '2021-03-05T14:09:04Z'}, {'name': 'tsip-misc', 'codeOfConduct': None, 'createdAt': '2021-03-05T16:17:18Z'}, {'name': 'tsip-crispr-blackstart-saturn', 'codeOfConduct': None, 'createdAt': '2021-03-08T18:12:52Z'}, {'name': 'vdpp-partner-docs', 'codeOfConduct': None, 'createdAt': '2021-03-09T17:36:17Z'}, {'name': 'tsip-cells', 'codeOfConduct': None, 'createdAt': '2021-03-16T13:16:17Z'}, {'name': 'tanzu-observability-slug-generator', 'codeOfConduct': {'url': 'https://github.com/vmware-tanzu/tanzu-observability-slug-generator/blob/main/Code-of-Conduct.md'}, 'createdAt': '2021-03-16T18:16:37Z'}, {'name': 'git2go-buildpack', 'codeOfConduct': None, 'createdAt': '2021-03-17T19:56:22Z'}, {'name': 'tsip-infra-images', 'codeOfConduct': None, 'createdAt': '2021-03-24T18:29:36Z'}, {'name': 'vscode-tanzu-tools', 'codeOfConduct': None, 'createdAt': '2021-03-29T18:46:37Z'}, {'name': 'tsip-cell-sequence-aws-core', 'codeOfConduct': None, 'createdAt': '2021-04-06T17:28:22Z'}, {'name': 'tsip-cell-sequence-aws-vault', 'codeOfConduct': None, 'createdAt': '2021-04-26T08:51:53Z'}, {'name': 'tsip-cell-sequence-test', 'codeOfConduct': None, 'createdAt': '2021-04-27T13:14:48Z'}, {'name': 'tkg-windows-containers', 'codeOfConduct': None, 'createdAt': '2021-05-04T16:59:55Z'}, {'name': 'tanzu-cli-apps-plugins', 'codeOfConduct': None, 'createdAt': '2021-05-06T17:38:26Z'}, {'name': 'msys2-buildpack', 'codeOfConduct': None, 'createdAt': '2021-05-07T14:25:10Z'}, {'name': 'tsip-aws-init-sequences', 'codeOfConduct': None, 'createdAt': '2021-05-10T13:49:49Z'}, {'name': 'workload-migration', 'codeOfConduct': None, 'createdAt': '2021-05-10T21:12:11Z'}]}}}}\n" 528 | ] 529 | } 530 | ], 531 | "source": [ 532 | "url = 'https://api.github.com/graphql'\n", 533 | "headers = {'Authorization': 'token %s' % api_token}\n", 534 | "\n", 535 | "query = make_query()\n", 536 | "\n", 537 | "variables = {\"org_name\": org_name}\n", 538 | "r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers)\n", 539 | "json_data_tanzu = json.loads(r.text)\n", 540 | "print(json_data_tanzu)" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "# Fails on other orgs: vmware org example" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 7, 553 | "metadata": {}, 554 | "outputs": [], 555 | "source": [ 556 | "# fails on vmware, bitnami\n", 557 | "# works on vmware-tanzu, concourse, carbonblack\n", 558 | "org_name = \"vmware\"" 559 | ] 560 | }, 561 | { 562 | "cell_type": "code", 563 | "execution_count": 8, 564 | "metadata": {}, 565 | "outputs": [ 566 | { 567 | "name": "stdout", 568 | "output_type": "stream", 569 | "text": [ 570 | "{'data': None, 'errors': [{'message': 'Something went wrong while executing your query. This may be the result of a timeout, or it could be a GitHub bug. Please include `F00A:6919:67C008:6E8EA6:61F409AE` when reporting this issue.'}]}\n" 571 | ] 572 | } 573 | ], 574 | "source": [ 575 | "url = 'https://api.github.com/graphql'\n", 576 | "headers = {'Authorization': 'token %s' % api_token}\n", 577 | "\n", 578 | "query = make_query()\n", 579 | "\n", 580 | "variables = {\"org_name\": org_name}\n", 581 | "r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers)\n", 582 | "json_data_vmware = json.loads(r.text)\n", 583 | "print(json_data_vmware)" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "## Note: If you remove the codeOfConduct, the rest of the query works fine." 591 | ] 592 | }, 593 | { 594 | "cell_type": "code", 595 | "execution_count": 9, 596 | "metadata": {}, 597 | "outputs": [], 598 | "source": [ 599 | "# fails on vmware, bitnami\n", 600 | "# works on vmware-tanzu, concourse, carbonblack\n", 601 | "def make_query_no_coc():\n", 602 | " return \"\"\"query RepoQuery($org_name: String!) {\n", 603 | " organization(login: $org_name) {\n", 604 | " repositories (first: 100){\n", 605 | " pageInfo {\n", 606 | " hasNextPage\n", 607 | " endCursor\n", 608 | " }\n", 609 | " nodes {\n", 610 | " name\n", 611 | " createdAt\n", 612 | " }\n", 613 | " }\n", 614 | " }\n", 615 | " }\"\"\"" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 10, 621 | "metadata": {}, 622 | "outputs": [ 623 | { 624 | "name": "stdout", 625 | "output_type": "stream", 626 | "text": [ 627 | "{'data': {'organization': {'repositories': {'pageInfo': {'hasNextPage': True, 'endCursor': 'Y3Vyc29yOnYyOpHOCSW2nA=='}, 'nodes': [{'name': 'pg_rewind', 'createdAt': '2013-05-23T10:45:43Z'}, {'name': 'pyvmomi', 'createdAt': '2013-12-13T17:30:30Z'}, {'name': 'pyvmomi-community-samples', 'createdAt': '2014-04-24T20:31:56Z'}, {'name': 'open-vm-tools', 'createdAt': '2014-04-25T21:30:54Z'}, {'name': 'upgrade-framework', 'createdAt': '2014-06-16T17:22:11Z'}, {'name': 'workflowTools', 'createdAt': '2014-07-18T22:16:00Z'}, {'name': 'govmomi', 'createdAt': '2014-08-12T16:15:08Z'}, {'name': 'pyvcloud', 'createdAt': '2014-11-12T19:36:04Z'}, {'name': 'vmw-guestinfo', 'createdAt': '2014-11-29T23:07:44Z'}, {'name': 'vcd-cli', 'createdAt': '2014-12-05T18:52:29Z'}, {'name': 'open-vmdk', 'createdAt': '2014-12-15T17:10:11Z'}, {'name': 'tdnf', 'createdAt': '2015-02-26T00:44:11Z'}, {'name': 'likewise-open', 'createdAt': '2015-02-26T19:58:04Z'}, {'name': 'photon', 'createdAt': '2015-04-15T17:22:47Z'}, {'name': 'lightwave', 'createdAt': '2015-04-15T17:22:59Z'}, {'name': 'vmware.github.io', 'createdAt': '2015-04-18T23:52:14Z'}, {'name': 'vivace', 'createdAt': '2015-05-15T16:59:28Z'}, {'name': 'PowerCLI-Example-Scripts', 'createdAt': '2015-06-04T16:57:56Z'}, {'name': 'ansible-coreos-bootstrap', 'createdAt': '2015-06-07T21:36:00Z'}, {'name': 'ansible-etcd-cluster', 'createdAt': '2015-06-08T06:34:25Z'}, {'name': 'ansible-etcd-ca', 'createdAt': '2015-06-09T14:32:42Z'}, {'name': 'goipmi', 'createdAt': '2015-07-08T21:24:28Z'}, {'name': 'photon-packer-templates', 'createdAt': '2015-07-30T17:28:22Z'}, {'name': 'ansible-coreos-setup', 'createdAt': '2015-07-31T12:59:08Z'}, {'name': 'ansible-fleet-cluster', 'createdAt': '2015-07-31T13:06:06Z'}, {'name': 'ansible-coreos-autofs', 'createdAt': '2015-08-06T09:08:47Z'}, {'name': 'ansible-flannel', 'createdAt': '2015-08-19T15:03:35Z'}, {'name': 'c-rest-engine', 'createdAt': '2015-09-28T21:55:14Z'}, {'name': 'ansible-kubernetes-ca', 'createdAt': '2015-11-17T01:00:38Z'}, {'name': 'ansible-coreos-kubernetes-master', 'createdAt': '2015-11-17T01:10:07Z'}, {'name': 'ansible-coreos-kubernetes-minion', 'createdAt': '2015-11-17T01:14:41Z'}, {'name': 'ansible-coreos-kubelet', 'createdAt': '2015-11-17T16:56:41Z'}, {'name': 'photon-docker-image', 'createdAt': '2015-11-27T16:05:00Z'}, {'name': 'vic', 'createdAt': '2016-01-13T19:53:57Z'}, {'name': 'photonos-netmgr', 'createdAt': '2016-01-16T01:22:23Z'}, {'name': 'LittleProxy', 'createdAt': '2016-02-18T06:49:32Z'}, {'name': 'alb-sdk', 'createdAt': '2016-03-07T21:29:17Z'}, {'name': 'vsphere-automation-sdk-python', 'createdAt': '2016-05-14T05:02:34Z'}, {'name': 'vsphere-automation-sdk-java', 'createdAt': '2016-05-14T05:03:54Z'}, {'name': 'vsphere-automation-sdk-ruby', 'createdAt': '2016-05-14T05:05:09Z'}, {'name': 'powernsx', 'createdAt': '2016-08-17T09:49:35Z'}, {'name': 'vic-product', 'createdAt': '2016-08-18T21:52:44Z'}, {'name': 'priam', 'createdAt': '2016-08-30T23:29:50Z'}, {'name': 'burp-rest-api', 'createdAt': '2016-09-01T20:31:35Z'}, {'name': 'idm', 'createdAt': '2016-09-06T17:11:25Z'}, {'name': 'clarity', 'createdAt': '2016-09-29T17:24:17Z'}, {'name': 'hillview', 'createdAt': '2016-10-03T20:32:15Z'}, {'name': 'powerclicore', 'createdAt': '2016-10-28T19:30:52Z'}, {'name': 'copenapi', 'createdAt': '2016-11-11T22:54:05Z'}, {'name': 'vidm-saml-toolkit', 'createdAt': '2016-12-12T01:29:40Z'}, {'name': 'ansible', 'createdAt': '2016-12-19T17:39:01Z'}, {'name': 'p4c-xdp', 'createdAt': '2016-12-20T22:10:02Z'}, {'name': 'go-nfs-client', 'createdAt': '2017-01-28T01:18:03Z'}, {'name': 'vsphere-automation-sdk', 'createdAt': '2017-03-08T23:15:04Z'}, {'name': 'weathervane', 'createdAt': '2017-03-15T20:54:26Z'}, {'name': 'chap', 'createdAt': '2017-04-01T22:31:42Z'}, {'name': 'pmd', 'createdAt': '2017-04-12T19:01:58Z'}, {'name': 'o11n-plugin-crypto', 'createdAt': '2017-04-26T20:35:26Z'}, {'name': 'terraform-provider-vcd', 'createdAt': '2017-06-05T20:54:05Z'}, {'name': 'container-service-extension', 'createdAt': '2017-06-27T21:51:06Z'}, {'name': 'smb-connector', 'createdAt': '2017-07-24T11:13:32Z'}, {'name': 'vrops-export', 'createdAt': '2017-07-26T19:03:36Z'}, {'name': 'vsphere-storage-for-kubernetes', 'createdAt': '2017-08-07T22:27:50Z'}, {'name': 'replay-app-for-tvos', 'createdAt': '2017-08-14T14:13:33Z'}, {'name': 'pks-ci', 'createdAt': '2017-08-29T08:19:41Z'}, {'name': 'vic-ui', 'createdAt': '2017-09-29T20:46:59Z'}, {'name': 'nsx-integration-for-openshift', 'createdAt': '2017-10-11T04:07:09Z'}, {'name': 'go-vmware-nsxt', 'createdAt': '2017-10-17T19:30:04Z'}, {'name': 'vsphere-guest-run', 'createdAt': '2017-10-20T20:09:15Z'}, {'name': 'vrealize-suite-lifecycle-manager-sdk', 'createdAt': '2017-10-31T05:06:09Z'}, {'name': 'nsx-powerops', 'createdAt': '2017-11-02T18:47:01Z'}, {'name': 'connectors-workspace-one', 'createdAt': '2017-12-14T20:17:14Z'}, {'name': 'guest-introspection-nsx', 'createdAt': '2018-01-31T04:19:00Z'}, {'name': 'test-operations', 'createdAt': '2018-02-15T23:02:04Z'}, {'name': 'vcd-ext-sdk', 'createdAt': '2018-02-16T20:49:45Z'}, {'name': 'django-yamlconf', 'createdAt': '2018-02-20T05:55:53Z'}, {'name': 'ansible-aws-greengrass', 'createdAt': '2018-03-05T20:49:49Z'}, {'name': 'differential-datalog', 'createdAt': '2018-03-20T20:14:11Z'}, {'name': 'kube-fluentd-operator', 'createdAt': '2018-03-26T09:29:35Z'}, {'name': 'harbor-boshrelease', 'createdAt': '2018-03-28T19:41:58Z'}, {'name': 'harbor-tile', 'createdAt': '2018-04-09T16:22:49Z'}, {'name': 'terraform-provider-nsxt', 'createdAt': '2018-04-09T16:54:13Z'}, {'name': 'ansible-module-vcloud-director', 'createdAt': '2018-05-08T18:38:51Z'}, {'name': 'vic-tools', 'createdAt': '2018-06-29T21:34:15Z'}, {'name': 'ansible-for-nsxt', 'createdAt': '2018-07-02T18:26:22Z'}, {'name': 'ansible-role-greengrass-awscli', 'createdAt': '2018-07-10T04:43:16Z'}, {'name': 'ansible-role-greengrass-init', 'createdAt': '2018-07-10T04:46:25Z'}, {'name': 'go-vcloud-director', 'createdAt': '2018-07-12T19:04:07Z'}, {'name': 'vmware-log-collectors-for-public-cloud', 'createdAt': '2018-07-31T17:00:37Z'}, {'name': 'concord-bft', 'createdAt': '2018-08-01T09:36:28Z'}, {'name': 'fluent-plugin-vmware-loginsight', 'createdAt': '2018-08-09T01:06:20Z'}, {'name': 'nsx-t-datacenter-ci-pipelines', 'createdAt': '2018-08-26T17:43:58Z'}, {'name': 'vmware-go-kcl', 'createdAt': '2018-09-06T19:28:53Z'}, {'name': 'esx-boot', 'createdAt': '2018-09-07T19:13:53Z'}, {'name': 'bare-metal-server-integration-with-nsxt', 'createdAt': '2018-09-11T03:18:02Z'}, {'name': 'vmware-openapi-generator', 'createdAt': '2018-09-11T20:51:53Z'}, {'name': 'wavefront-adapter-for-istio', 'createdAt': '2018-09-17T21:56:24Z'}, {'name': 'ansible-role-microsoft-azure-iot', 'createdAt': '2018-10-17T13:54:48Z'}, {'name': 'ansible-role-microsoft-azure-edge', 'createdAt': '2018-10-17T13:57:21Z'}, {'name': 'cloud-to-edge', 'createdAt': '2018-10-17T13:59:57Z'}]}}}}\n" 628 | ] 629 | } 630 | ], 631 | "source": [ 632 | "url = 'https://api.github.com/graphql'\n", 633 | "headers = {'Authorization': 'token %s' % api_token}\n", 634 | "\n", 635 | "query = make_query_no_coc()\n", 636 | "\n", 637 | "variables = {\"org_name\": org_name}\n", 638 | "r = requests.post(url=url, json={'query': query, 'variables': variables}, headers=headers)\n", 639 | "json_data_no_coc = json.loads(r.text)\n", 640 | "print(json_data_no_coc)" 641 | ] 642 | } 643 | ], 644 | "metadata": { 645 | "kernelspec": { 646 | "display_name": "Python 3", 647 | "language": "python", 648 | "name": "python3" 649 | }, 650 | "language_info": { 651 | "codemirror_mode": { 652 | "name": "ipython", 653 | "version": 3 654 | }, 655 | "file_extension": ".py", 656 | "mimetype": "text/x-python", 657 | "name": "python", 658 | "nbconvert_exporter": "python", 659 | "pygments_lexer": "ipython3", 660 | "version": "3.8.5" 661 | } 662 | }, 663 | "nbformat": 4, 664 | "nbformat_minor": 4 665 | } 666 | --------------------------------------------------------------------------------