├── requirements.txt
├── .gitignore
├── LICENSE.txt
├── README.md
├── .github
    └── workflows
    │   └── build-metadata.yaml
└── scripts
    └── metadata-generator.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==2.2.2
2 | requests==2.28.0
3 | PyYAML==6.0.1
4 | boto3==1.28.33
5 | botocore==1.31.33


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *.so
 5 | 
 6 | # Virtual environment
 7 | env/
 8 | venv/
 9 | ENV/
10 | .venv/
11 | .env/
12 | 
13 | # VSCode settings
14 | .vscode/
15 | 
16 | # PyCharm project settings
17 | .idea/
18 | 
19 | # Jupyter Notebook checkpoints
20 | .ipynb_checkpoints/
21 | 
22 | # macOS system files
23 | .DS_Store
24 | 
25 | # Windows system files
26 | Thumbs.db
27 | ehthumbs.db
28 | Desktop.ini
29 | 
30 | # Log files
31 | *.log
32 | 
33 | # Python environment files
34 | *.env
35 | *.venv
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 | 
39 | # pyenv
40 | .python-version
41 | 
42 | # Cache files
43 | *.swp
44 | *.swo
45 | *.swn
46 | .cache/
47 | *.bak
48 | 
49 | # Temporary files created by editors
50 | *~
51 | *.tmp
52 | 
53 | # ZIP files
54 | *.zip
55 | 
56 | # yaml files
57 | datasets/
58 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 MIT Laboratory for Computational Physiology
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # AWS Workflow for Open Data Program
 2 | 
 3 | This repository contains the code to generate metadata YAML files for the Open Data Program based on Physionet open datasets hosted on PhysioNet's AWS S3 bucket. The script automatically retrieves the prefixes (directories) from the S3 bucket and uses data from the PhysioNet API to create YAML files containing metadata for each project. These YAML files will be used for adding datasets to the Registry of Open Data on AWS (https://registry.opendata.aws/).
 4 | 
 5 | 
 6 | ## Features
 7 | 
 8 | - Fetches prefixes from the `physionet-open` S3 bucket (arn:aws:s3:::physionet-open/) using AWS boto3 (no AWS credentials needed);
 9 | - Retrieves metadata for projects using the PhysioNet API;
10 | - Generates .yaml files for each project, including the project's name, description, license, PhysioNet documentation links, and associated S3 bucket information;
11 | 
12 | 
13 | ## Installation
14 | 
15 | To set up the project on your local environment:
16 | 
17 | - Clone the repository:
18 | `git clone https://github.com/MIT-LCP/aws-physionet-open-data.git`
19 | - Navigate into the project directory:
20 | `cd aws-physionet-open-data`
21 | - Install the required dependencies:
22 | `pip install -r requirements.txt`
23 | 
24 | 
25 | ## Usage
26 | 
27 | To generate the YAML metadata files, follow these steps:
28 | 
29 | - From project directory, run the following command to execute the script:
30 | `python metadata-generator.py`
31 | 
32 | The script will:
33 | 
34 | - Retrieve the S3 bucket prefixes from the `physionet-open` bucket;
35 | - Fetch the latest project details from the PhysioNet API;
36 | - Generate the YAML metadata files in a folder named `datasets`.
37 | 
38 | Output:
39 | 
40 | YAML Files: The script generates a set of .yaml files, each representing a project's metadata.
41 | 
42 | 
43 | ## Example of Generated YAML
44 | 
45 | ```
46 | Name: MIMIC-IV Clinical Database Demo
47 | 
48 | Description: The Medical Information Mart for Intensive Care (MIMIC)-IV database
49 |   is comprised of deidentified electronic health records for patients admitted
50 |   to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed
51 |   users. Here, we have provided an openly-available demo of MIMIC-IV containing a
52 |   subset of 100 patients. The dataset includes similar content to MIMIC-IV, but
53 |   excludes free-text clinical notes. The demo may be useful for running workshops and
54 |   for assessing whether the MIMIC-IV is appropriate for a study before making
55 |   an access request.
56 | 
57 | Documentation: https://doi.org/10.13026/dp1f-ex47
58 | 
59 | Contact: https://physionet.org/about/#contact_us
60 | 
61 | ManagedBy: '[PhysioNet](https://physionet.org/)'
62 | 
63 | UpdateFrequency: Not updated
64 | 
65 | Tags:
66 | - aws-pds
67 | 
68 | License: Open Data Commons Open Database License v1.0
69 | 
70 | Resources:
71 | - Description: https://doi.org/10.13026/dp1f-ex47
72 |   ARN: arn:aws:s3:::physionet-open/mimic-iv-demo/
73 |   Region: us-east-1
74 |   Type: S3 Bucket
75 | 
76 | ADXCategories: Healthcare & Life Sciences Data
77 | ```
78 | 
79 | ## Contributing
80 | 
81 | If you would like to contribute to this project, please create a pull request with your changes. We welcome bug fixes, improvements, and additional features.
82 | 
83 | 
84 | ## License
85 | 
86 | This project is licensed under the MIT License - see the LICENSE file for details.
87 | 
88 | 
89 | ## Contact
90 | 
91 | For any questions or further assistance, please contact the project maintainers through PhysioNet.
92 | 


--------------------------------------------------------------------------------
/.github/workflows/build-metadata.yaml:
--------------------------------------------------------------------------------
 1 | name: Build Metadata and Create PR
 2 | 
 3 | on:
 4 |   workflow_dispatch:  # Permite que o fluxo de trabalho seja acionado manualmente
 5 | 
 6 | jobs:
 7 |   run-script-and-create-pr:
 8 |     runs-on: ubuntu-latest
 9 | 
10 |     steps:
11 |       # Step 1: Checkout do repositório
12 |       - name: Checkout repository
13 |         uses: actions/checkout@v3
14 | 
15 |       # Step 2: Configurar Python
16 |       - name: Set up Python 3.x
17 |         uses: actions/setup-python@v4
18 |         with:
19 |           python-version: '3.x'
20 | 
21 |       # Step 3: Instalar dependências (se houver)
22 |       - name: Install dependencies
23 |         run: |
24 |           pip install -r requirements.txt
25 | 
26 |       # Step 4: Executar o script Python
27 |       - name: Run metadata generator script
28 |         run: |
29 |           python scripts/metadata-generator.py
30 | 
31 |       # Step 5: Checkout do segundo repositório para criar a PR
32 |       - name: Checkout target repository
33 |         uses: actions/checkout@v3
34 |         with:
35 |           repository: Chrystinne/pr-metadata-generator
36 |           path: pr-repo
37 |           token: ${{ secrets.MY_PERSONAL_TOKEN }}
38 | 
39 |       # Step 6: Copiar a pasta datasets para o repositório de destino
40 |       - name: Copy datasets to target repo
41 |         run: |
42 |           cp -r datasets pr-repo/
43 | 
44 |       # Step 7: Criar um novo branch ou alternar para ele se já existir
45 |       - name: Create or switch to branch
46 |         run: |
47 |           cd pr-repo
48 |           git fetch origin
49 |           if git rev-parse --verify origin/metadata-update-branch; then
50 |             git reset --hard  # Limpa quaisquer alterações locais que possam causar conflitos
51 |             git checkout -f metadata-update-branch  # Força o checkout para sobrescrever arquivos
52 |             git pull origin metadata-update-branch
53 |           else
54 |             git checkout -b metadata-update-branch
55 |           fi
56 | 
57 |       # Step 8: Configurar o usuário do Git
58 |       - name: Set up Git user
59 |         run: |
60 |           git config --global user.email "chrystinne@gmail.com"
61 |           git config --global user.name "chrystinne"
62 | 
63 |       # Step 9: Commit das mudanças (com debug)
64 |       - name: Commit changes
65 |         run: |
66 |           cd pr-repo
67 |           git status  # Verifica se há mudanças detectadas pelo Git
68 |           git diff  # Mostra as diferenças nos arquivos para confirmação
69 |           git add -f datasets  # Força a adição dos arquivos modificados
70 |           git commit -m "Add datasets from metadata generator" || echo "No changes to commit"
71 |           git log -1  # Verifica o último commit para depuração
72 | 
73 |       # Step 10: Push do novo branch (com debug)
74 |       - name: Push changes
75 |         run: |
76 |           cd pr-repo
77 |           git push origin metadata-update-branch || (git pull --rebase origin metadata-update-branch && git push origin metadata-update-branch)
78 |           git branch -r  # Lista os branches remotos para debug
79 | 
80 |       # Step 11: Criar um pull request
81 |       - name: Create pull request
82 |         uses: peter-evans/create-pull-request@v5
83 |         with:
84 |           token: ${{ secrets.MY_PERSONAL_TOKEN }}
85 |           commit-message: "Add datasets from metadata generator"
86 |           branch: metadata-update-branch
87 |           base: main
88 |           title: "Metadata update from Python script"
89 |           body: "This PR contains the datasets folder generated by the metadata-generator script."
90 | 


--------------------------------------------------------------------------------
/scripts/metadata-generator.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import requests
  3 | import yaml
  4 | import re, os
  5 | import zipfile
  6 | import boto3
  7 | from botocore import UNSIGNED
  8 | from botocore.config import Config
  9 | 
 10 | csv_file = './physionet-open-s3-bucket-prefixes.csv'
 11 | zip_file_path = 'datasets.zip'
 12 | 
 13 | # Directory to save YAML files
 14 | yaml_dir = 'datasets'
 15 | os.makedirs(yaml_dir, exist_ok=True)
 16 | 
 17 | 
 18 | def get_s3_open_bucket_prefixes(bucket_name):
 19 |     """
 20 |     List the prefixes (directories) in the S3 bucket using boto3 without credentials.
 21 | 
 22 |     This function connects to an S3 bucket using anonymous access (no credentials) 
 23 |     to retrieve the list of all directory prefixes (CommonPrefixes) in the bucket. 
 24 |     The retrieved prefixes are stored in a pandas DataFrame.
 25 | 
 26 |     Parameters:
 27 |         bucket_name (str): The name of the S3 bucket to be accessed.
 28 | 
 29 |     Returns:
 30 |         DataFrame: A DataFrame containing the project slugs as a list of directory prefixes.
 31 |     """
 32 |     s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))  # No sign request
 33 |     
 34 |     prefixes = []
 35 |     
 36 |     # Continuation token for paginating through large result sets
 37 |     continuation_token = None
 38 |     
 39 |     while True:
 40 |         # List the objects in the bucket (with '/' as the delimiter)
 41 |         if continuation_token:
 42 |             result = s3.list_objects_v2(Bucket=bucket_name, Delimiter='/', ContinuationToken=continuation_token)
 43 |         else:
 44 |             result = s3.list_objects_v2(Bucket=bucket_name, Delimiter='/')
 45 |         
 46 |         # If there are CommonPrefixes, add the found prefixes to the list
 47 |         if 'CommonPrefixes' in result:
 48 |             for prefix in result['CommonPrefixes']:
 49 |                 prefixes.append(prefix['Prefix'])
 50 |         
 51 |         # Check if there are more results to be listed
 52 |         if result.get('IsTruncated'):
 53 |             continuation_token = result.get('NextContinuationToken')
 54 |         else:
 55 |             break
 56 | 
 57 |     df = pd.DataFrame([prefix.strip('/') for prefix in prefixes], columns=['project_slug'])
 58 |     
 59 |     return df
 60 | 
 61 | 
 62 | def export_s3_open_bucket_prefixes(df, csv_file):
 63 |     """
 64 |     Save the list of S3 bucket prefixes to a CSV file.
 65 | 
 66 |     This function saves the DataFrame containing project slugs (directory prefixes) 
 67 |     into a CSV file for later reference.
 68 | 
 69 |     Parameters:
 70 |         df (DataFrame): The DataFrame containing project slugs.
 71 |         csv_file (str): The file path where the CSV will be saved.
 72 |     """
 73 |     df.to_csv(csv_file, index=False)
 74 |     print(f"CSV file '{csv_file}' created successfully with {len(df)} project slugs.")
 75 | 
 76 | 
 77 | def fetch_project_latest_version(project_slug):
 78 |     """
 79 |     Fetch the latest version of the project from the PhysioNet API.
 80 | 
 81 |     This function retrieves the latest version number of a project 
 82 |     from the PhysioNet API using its project slug.
 83 | 
 84 |     Parameters:
 85 |         project_slug (str): The slug of the project (directory name in the bucket).
 86 | 
 87 |     Returns:
 88 |         str: The latest version number of the project, or None if not found.
 89 |     """
 90 |     url = f"https://physionet.org/api/v1/project/published/{project_slug}/"
 91 |     headers = {'User-Agent': 'Mozilla/5.0'}
 92 |     response = requests.get(url, headers=headers)
 93 |     
 94 |     if response.status_code == 200:
 95 |         data = response.json()
 96 |         if isinstance(data, list) and len(data) > 0:
 97 |             latest_version = data[-1]  # Take the latest version of the project
 98 |             return latest_version['version']
 99 |     return None
100 | 
101 | 
102 | def fetch_project_details(project_slug, project_latest_version):
103 |     """
104 |     Fetch project details from the PhysioNet API.
105 | 
106 |     This function retrieves the project details, including its name, description, 
107 |     license, and documentation link. It fetches this information using the project slug 
108 |     and the latest project version.
109 | 
110 |     Parameters:
111 |         project_slug (str): The slug of the project (directory name in the bucket).
112 |         project_latest_version (str): The latest version number of the project.
113 | 
114 |     Returns:
115 |         dict: A dictionary containing the project's name, description, license, and link.
116 |     """
117 |     url = f"https://physionet.org/api/v1/project/published/{project_slug}/{project_latest_version}/"
118 |     headers = {'User-Agent': 'Mozilla/5.0'}
119 |     response = requests.get(url, headers=headers)
120 |     if response.status_code == 200:
121 |         data = response.json()
122 |         
123 |         if len(data) > 0:
124 |             return {
125 |                 'name': data['title'],
126 |                 'description': re.sub(r'<[^>]+>', '', data['abstract']),
127 |                 'license': data.get('license', {}).get('name', 'Open Data Commons Open Database License v1.0'),
128 |                 'link': f"https://doi.org/{data.get('doi')}" if data.get('doi') else f"https://physionet.org/content/{project_slug}/"
129 |             }
130 |     return None
131 | 
132 | 
133 | def create_yaml_files(df):
134 |     """
135 |     Loop over each project slug and create a YAML file.
136 | 
137 |     For each project slug in the DataFrame, this function fetches the project details 
138 |     using the PhysioNet API and generates a corresponding YAML file containing 
139 |     metadata about the project.
140 | 
141 |     Parameters:
142 |         df (DataFrame): The DataFrame containing project slugs.
143 |     """
144 |     for index, row in df.iterrows():
145 |         project_slug = row['project_slug']
146 |         
147 |         # Fetch the latest version of the project
148 |         project_latest_version = fetch_project_latest_version(project_slug)
149 |         
150 |         # Fetch project details
151 |         project_info = fetch_project_details(project_slug, project_latest_version)
152 | 
153 |         if project_info:
154 |             # Create YAML content
155 |             yaml_content = {
156 |                 'Name': project_info['name'],
157 |                 'Description': project_info['description'],
158 |                 'Documentation': project_info['link'],
159 |                 'Contact': "https://physionet.org/about/#contact_us",
160 |                 'ManagedBy': "[PhysioNet](https://physionet.org/)",
161 |                 'UpdateFrequency': "Not updated",
162 |                 'Tags': ['aws-pds'],
163 |                 'License': project_info['license'],
164 |                 'Resources': [
165 |                     {
166 |                         'Description': project_info['link'],
167 |                         'ARN': f"arn:aws:s3:::physionet-open/{project_slug}/",
168 |                         'Region': "us-east-1",
169 |                         'Type': "S3 Bucket"
170 |                     }
171 |                 ],
172 |                 'ADXCategories': 'Healthcare & Life Sciences Data'
173 |             }
174 | 
175 |             # Save the YAML file
176 |             yaml_file_path = os.path.join(yaml_dir, f"{project_slug}.yaml")
177 |             with open(yaml_file_path, 'w') as yaml_file:
178 |                 yaml.dump(yaml_content, yaml_file, default_flow_style=False, sort_keys=False)
179 | 
180 | 
181 | def create_zip_file():
182 |     """
183 |     Create a ZIP file containing all YAML files.
184 | 
185 |     This function zips all the YAML files generated in the specified directory 
186 |     into a single ZIP file for easy distribution.
187 | 
188 |     Parameters:
189 |         None
190 |     """
191 |     with zipfile.ZipFile(zip_file_path, 'w') as zipf:
192 |         for root, dirs, files in os.walk(yaml_dir):
193 |             for file in files:
194 |                 zipf.write(os.path.join(root, file), arcname=file)
195 | 
196 |     print(f"YAML files successfully generated and zipped in {zip_file_path}")
197 | 
198 | 
199 | def main():
200 |     # Generate metadata for the Open Data Program by retrieving the S3 bucket prefixes 
201 |     # and creating YAML files.
202 |     df_prefixes = get_s3_open_bucket_prefixes('physionet-open')
203 |     # export_s3_open_bucket_prefixes(df_prefixes, csv_file)
204 |     create_yaml_files(df_prefixes)
205 |     # create_zip_file()
206 | 
207 | 
208 | if __name__ == "__main__":
209 |     main()
210 | 


--------------------------------------------------------------------------------