├── requirements.txt ├── .gitignore ├── LICENSE.txt ├── README.md ├── .github └── workflows │ └── build-metadata.yaml └── scripts └── metadata-generator.py /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==2.2.2 2 | requests==2.28.0 3 | PyYAML==6.0.1 4 | boto3==1.28.33 5 | botocore==1.31.33 -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *.so 5 | 6 | # Virtual environment 7 | env/ 8 | venv/ 9 | ENV/ 10 | .venv/ 11 | .env/ 12 | 13 | # VSCode settings 14 | .vscode/ 15 | 16 | # PyCharm project settings 17 | .idea/ 18 | 19 | # Jupyter Notebook checkpoints 20 | .ipynb_checkpoints/ 21 | 22 | # macOS system files 23 | .DS_Store 24 | 25 | # Windows system files 26 | Thumbs.db 27 | ehthumbs.db 28 | Desktop.ini 29 | 30 | # Log files 31 | *.log 32 | 33 | # Python environment files 34 | *.env 35 | *.venv 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # pyenv 40 | .python-version 41 | 42 | # Cache files 43 | *.swp 44 | *.swo 45 | *.swn 46 | .cache/ 47 | *.bak 48 | 49 | # Temporary files created by editors 50 | *~ 51 | *.tmp 52 | 53 | # ZIP files 54 | *.zip 55 | 56 | # yaml files 57 | datasets/ 58 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 MIT Laboratory for Computational Physiology 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AWS Workflow for Open Data Program 2 | 3 | This repository contains the code to generate metadata YAML files for the Open Data Program based on Physionet open datasets hosted on PhysioNet's AWS S3 bucket. The script automatically retrieves the prefixes (directories) from the S3 bucket and uses data from the PhysioNet API to create YAML files containing metadata for each project. These YAML files will be used for adding datasets to the Registry of Open Data on AWS (https://registry.opendata.aws/). 4 | 5 | 6 | ## Features 7 | 8 | - Fetches prefixes from the `physionet-open` S3 bucket (arn:aws:s3:::physionet-open/) using AWS boto3 (no AWS credentials needed); 9 | - Retrieves metadata for projects using the PhysioNet API; 10 | - Generates .yaml files for each project, including the project's name, description, license, PhysioNet documentation links, and associated S3 bucket information; 11 | 12 | 13 | ## Installation 14 | 15 | To set up the project on your local environment: 16 | 17 | - Clone the repository: 18 | `git clone https://github.com/MIT-LCP/aws-physionet-open-data.git` 19 | - Navigate into the project directory: 20 | `cd aws-physionet-open-data` 21 | - Install the required dependencies: 22 | `pip install -r requirements.txt` 23 | 24 | 25 | ## Usage 26 | 27 | To generate the YAML metadata files, follow these steps: 28 | 29 | - From project directory, run the following command to execute the script: 30 | `python metadata-generator.py` 31 | 32 | The script will: 33 | 34 | - Retrieve the S3 bucket prefixes from the `physionet-open` bucket; 35 | - Fetch the latest project details from the PhysioNet API; 36 | - Generate the YAML metadata files in a folder named `datasets`. 37 | 38 | Output: 39 | 40 | YAML Files: The script generates a set of .yaml files, each representing a project's metadata. 41 | 42 | 43 | ## Example of Generated YAML 44 | 45 | ``` 46 | Name: MIMIC-IV Clinical Database Demo 47 | 48 | Description: The Medical Information Mart for Intensive Care (MIMIC)-IV database 49 | is comprised of deidentified electronic health records for patients admitted 50 | to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed 51 | users. Here, we have provided an openly-available demo of MIMIC-IV containing a 52 | subset of 100 patients. The dataset includes similar content to MIMIC-IV, but 53 | excludes free-text clinical notes. The demo may be useful for running workshops and 54 | for assessing whether the MIMIC-IV is appropriate for a study before making 55 | an access request. 56 | 57 | Documentation: https://doi.org/10.13026/dp1f-ex47 58 | 59 | Contact: https://physionet.org/about/#contact_us 60 | 61 | ManagedBy: '[PhysioNet](https://physionet.org/)' 62 | 63 | UpdateFrequency: Not updated 64 | 65 | Tags: 66 | - aws-pds 67 | 68 | License: Open Data Commons Open Database License v1.0 69 | 70 | Resources: 71 | - Description: https://doi.org/10.13026/dp1f-ex47 72 | ARN: arn:aws:s3:::physionet-open/mimic-iv-demo/ 73 | Region: us-east-1 74 | Type: S3 Bucket 75 | 76 | ADXCategories: Healthcare & Life Sciences Data 77 | ``` 78 | 79 | ## Contributing 80 | 81 | If you would like to contribute to this project, please create a pull request with your changes. We welcome bug fixes, improvements, and additional features. 82 | 83 | 84 | ## License 85 | 86 | This project is licensed under the MIT License - see the LICENSE file for details. 87 | 88 | 89 | ## Contact 90 | 91 | For any questions or further assistance, please contact the project maintainers through PhysioNet. 92 | -------------------------------------------------------------------------------- /.github/workflows/build-metadata.yaml: -------------------------------------------------------------------------------- 1 | name: Build Metadata and Create PR 2 | 3 | on: 4 | workflow_dispatch: # Permite que o fluxo de trabalho seja acionado manualmente 5 | 6 | jobs: 7 | run-script-and-create-pr: 8 | runs-on: ubuntu-latest 9 | 10 | steps: 11 | # Step 1: Checkout do repositório 12 | - name: Checkout repository 13 | uses: actions/checkout@v3 14 | 15 | # Step 2: Configurar Python 16 | - name: Set up Python 3.x 17 | uses: actions/setup-python@v4 18 | with: 19 | python-version: '3.x' 20 | 21 | # Step 3: Instalar dependências (se houver) 22 | - name: Install dependencies 23 | run: | 24 | pip install -r requirements.txt 25 | 26 | # Step 4: Executar o script Python 27 | - name: Run metadata generator script 28 | run: | 29 | python scripts/metadata-generator.py 30 | 31 | # Step 5: Checkout do segundo repositório para criar a PR 32 | - name: Checkout target repository 33 | uses: actions/checkout@v3 34 | with: 35 | repository: Chrystinne/pr-metadata-generator 36 | path: pr-repo 37 | token: ${{ secrets.MY_PERSONAL_TOKEN }} 38 | 39 | # Step 6: Copiar a pasta datasets para o repositório de destino 40 | - name: Copy datasets to target repo 41 | run: | 42 | cp -r datasets pr-repo/ 43 | 44 | # Step 7: Criar um novo branch ou alternar para ele se já existir 45 | - name: Create or switch to branch 46 | run: | 47 | cd pr-repo 48 | git fetch origin 49 | if git rev-parse --verify origin/metadata-update-branch; then 50 | git reset --hard # Limpa quaisquer alterações locais que possam causar conflitos 51 | git checkout -f metadata-update-branch # Força o checkout para sobrescrever arquivos 52 | git pull origin metadata-update-branch 53 | else 54 | git checkout -b metadata-update-branch 55 | fi 56 | 57 | # Step 8: Configurar o usuário do Git 58 | - name: Set up Git user 59 | run: | 60 | git config --global user.email "chrystinne@gmail.com" 61 | git config --global user.name "chrystinne" 62 | 63 | # Step 9: Commit das mudanças (com debug) 64 | - name: Commit changes 65 | run: | 66 | cd pr-repo 67 | git status # Verifica se há mudanças detectadas pelo Git 68 | git diff # Mostra as diferenças nos arquivos para confirmação 69 | git add -f datasets # Força a adição dos arquivos modificados 70 | git commit -m "Add datasets from metadata generator" || echo "No changes to commit" 71 | git log -1 # Verifica o último commit para depuração 72 | 73 | # Step 10: Push do novo branch (com debug) 74 | - name: Push changes 75 | run: | 76 | cd pr-repo 77 | git push origin metadata-update-branch || (git pull --rebase origin metadata-update-branch && git push origin metadata-update-branch) 78 | git branch -r # Lista os branches remotos para debug 79 | 80 | # Step 11: Criar um pull request 81 | - name: Create pull request 82 | uses: peter-evans/create-pull-request@v5 83 | with: 84 | token: ${{ secrets.MY_PERSONAL_TOKEN }} 85 | commit-message: "Add datasets from metadata generator" 86 | branch: metadata-update-branch 87 | base: main 88 | title: "Metadata update from Python script" 89 | body: "This PR contains the datasets folder generated by the metadata-generator script." 90 | -------------------------------------------------------------------------------- /scripts/metadata-generator.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import requests 3 | import yaml 4 | import re, os 5 | import zipfile 6 | import boto3 7 | from botocore import UNSIGNED 8 | from botocore.config import Config 9 | 10 | csv_file = './physionet-open-s3-bucket-prefixes.csv' 11 | zip_file_path = 'datasets.zip' 12 | 13 | # Directory to save YAML files 14 | yaml_dir = 'datasets' 15 | os.makedirs(yaml_dir, exist_ok=True) 16 | 17 | 18 | def get_s3_open_bucket_prefixes(bucket_name): 19 | """ 20 | List the prefixes (directories) in the S3 bucket using boto3 without credentials. 21 | 22 | This function connects to an S3 bucket using anonymous access (no credentials) 23 | to retrieve the list of all directory prefixes (CommonPrefixes) in the bucket. 24 | The retrieved prefixes are stored in a pandas DataFrame. 25 | 26 | Parameters: 27 | bucket_name (str): The name of the S3 bucket to be accessed. 28 | 29 | Returns: 30 | DataFrame: A DataFrame containing the project slugs as a list of directory prefixes. 31 | """ 32 | s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED)) # No sign request 33 | 34 | prefixes = [] 35 | 36 | # Continuation token for paginating through large result sets 37 | continuation_token = None 38 | 39 | while True: 40 | # List the objects in the bucket (with '/' as the delimiter) 41 | if continuation_token: 42 | result = s3.list_objects_v2(Bucket=bucket_name, Delimiter='/', ContinuationToken=continuation_token) 43 | else: 44 | result = s3.list_objects_v2(Bucket=bucket_name, Delimiter='/') 45 | 46 | # If there are CommonPrefixes, add the found prefixes to the list 47 | if 'CommonPrefixes' in result: 48 | for prefix in result['CommonPrefixes']: 49 | prefixes.append(prefix['Prefix']) 50 | 51 | # Check if there are more results to be listed 52 | if result.get('IsTruncated'): 53 | continuation_token = result.get('NextContinuationToken') 54 | else: 55 | break 56 | 57 | df = pd.DataFrame([prefix.strip('/') for prefix in prefixes], columns=['project_slug']) 58 | 59 | return df 60 | 61 | 62 | def export_s3_open_bucket_prefixes(df, csv_file): 63 | """ 64 | Save the list of S3 bucket prefixes to a CSV file. 65 | 66 | This function saves the DataFrame containing project slugs (directory prefixes) 67 | into a CSV file for later reference. 68 | 69 | Parameters: 70 | df (DataFrame): The DataFrame containing project slugs. 71 | csv_file (str): The file path where the CSV will be saved. 72 | """ 73 | df.to_csv(csv_file, index=False) 74 | print(f"CSV file '{csv_file}' created successfully with {len(df)} project slugs.") 75 | 76 | 77 | def fetch_project_latest_version(project_slug): 78 | """ 79 | Fetch the latest version of the project from the PhysioNet API. 80 | 81 | This function retrieves the latest version number of a project 82 | from the PhysioNet API using its project slug. 83 | 84 | Parameters: 85 | project_slug (str): The slug of the project (directory name in the bucket). 86 | 87 | Returns: 88 | str: The latest version number of the project, or None if not found. 89 | """ 90 | url = f"https://physionet.org/api/v1/project/published/{project_slug}/" 91 | headers = {'User-Agent': 'Mozilla/5.0'} 92 | response = requests.get(url, headers=headers) 93 | 94 | if response.status_code == 200: 95 | data = response.json() 96 | if isinstance(data, list) and len(data) > 0: 97 | latest_version = data[-1] # Take the latest version of the project 98 | return latest_version['version'] 99 | return None 100 | 101 | 102 | def fetch_project_details(project_slug, project_latest_version): 103 | """ 104 | Fetch project details from the PhysioNet API. 105 | 106 | This function retrieves the project details, including its name, description, 107 | license, and documentation link. It fetches this information using the project slug 108 | and the latest project version. 109 | 110 | Parameters: 111 | project_slug (str): The slug of the project (directory name in the bucket). 112 | project_latest_version (str): The latest version number of the project. 113 | 114 | Returns: 115 | dict: A dictionary containing the project's name, description, license, and link. 116 | """ 117 | url = f"https://physionet.org/api/v1/project/published/{project_slug}/{project_latest_version}/" 118 | headers = {'User-Agent': 'Mozilla/5.0'} 119 | response = requests.get(url, headers=headers) 120 | if response.status_code == 200: 121 | data = response.json() 122 | 123 | if len(data) > 0: 124 | return { 125 | 'name': data['title'], 126 | 'description': re.sub(r'<[^>]+>', '', data['abstract']), 127 | 'license': data.get('license', {}).get('name', 'Open Data Commons Open Database License v1.0'), 128 | 'link': f"https://doi.org/{data.get('doi')}" if data.get('doi') else f"https://physionet.org/content/{project_slug}/" 129 | } 130 | return None 131 | 132 | 133 | def create_yaml_files(df): 134 | """ 135 | Loop over each project slug and create a YAML file. 136 | 137 | For each project slug in the DataFrame, this function fetches the project details 138 | using the PhysioNet API and generates a corresponding YAML file containing 139 | metadata about the project. 140 | 141 | Parameters: 142 | df (DataFrame): The DataFrame containing project slugs. 143 | """ 144 | for index, row in df.iterrows(): 145 | project_slug = row['project_slug'] 146 | 147 | # Fetch the latest version of the project 148 | project_latest_version = fetch_project_latest_version(project_slug) 149 | 150 | # Fetch project details 151 | project_info = fetch_project_details(project_slug, project_latest_version) 152 | 153 | if project_info: 154 | # Create YAML content 155 | yaml_content = { 156 | 'Name': project_info['name'], 157 | 'Description': project_info['description'], 158 | 'Documentation': project_info['link'], 159 | 'Contact': "https://physionet.org/about/#contact_us", 160 | 'ManagedBy': "[PhysioNet](https://physionet.org/)", 161 | 'UpdateFrequency': "Not updated", 162 | 'Tags': ['aws-pds'], 163 | 'License': project_info['license'], 164 | 'Resources': [ 165 | { 166 | 'Description': project_info['link'], 167 | 'ARN': f"arn:aws:s3:::physionet-open/{project_slug}/", 168 | 'Region': "us-east-1", 169 | 'Type': "S3 Bucket" 170 | } 171 | ], 172 | 'ADXCategories': 'Healthcare & Life Sciences Data' 173 | } 174 | 175 | # Save the YAML file 176 | yaml_file_path = os.path.join(yaml_dir, f"{project_slug}.yaml") 177 | with open(yaml_file_path, 'w') as yaml_file: 178 | yaml.dump(yaml_content, yaml_file, default_flow_style=False, sort_keys=False) 179 | 180 | 181 | def create_zip_file(): 182 | """ 183 | Create a ZIP file containing all YAML files. 184 | 185 | This function zips all the YAML files generated in the specified directory 186 | into a single ZIP file for easy distribution. 187 | 188 | Parameters: 189 | None 190 | """ 191 | with zipfile.ZipFile(zip_file_path, 'w') as zipf: 192 | for root, dirs, files in os.walk(yaml_dir): 193 | for file in files: 194 | zipf.write(os.path.join(root, file), arcname=file) 195 | 196 | print(f"YAML files successfully generated and zipped in {zip_file_path}") 197 | 198 | 199 | def main(): 200 | # Generate metadata for the Open Data Program by retrieving the S3 bucket prefixes 201 | # and creating YAML files. 202 | df_prefixes = get_s3_open_bucket_prefixes('physionet-open') 203 | # export_s3_open_bucket_prefixes(df_prefixes, csv_file) 204 | create_yaml_files(df_prefixes) 205 | # create_zip_file() 206 | 207 | 208 | if __name__ == "__main__": 209 | main() 210 | --------------------------------------------------------------------------------