├── .gitignore ├── assets ├── figure_average_drum_polyphony.png ├── figure_average_track_polyphony.png ├── figure_average_track_polyphony_and_note_duration.png ├── figure_beat_length.png ├── figure_drum.png ├── figure_instrument.png ├── figure_instrument_type.png ├── figure_key.png ├── figure_min_max_polyphony.png ├── figure_num_tracks.png ├── figure_quantile_polyphony.png ├── figure_raw_onset_track_entropy.png ├── figure_tempos.png ├── figure_timesigs.png ├── figure_velocity_track_entropy.png └── figure_velocity_variance.png ├── download_audio.py └── readme.md /.gitignore: -------------------------------------------------------------------------------- 1 | push.sh 2 | reset.sh 3 | *DS_Store 4 | -------------------------------------------------------------------------------- /assets/figure_average_drum_polyphony.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_average_drum_polyphony.png -------------------------------------------------------------------------------- /assets/figure_average_track_polyphony.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_average_track_polyphony.png -------------------------------------------------------------------------------- /assets/figure_average_track_polyphony_and_note_duration.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_average_track_polyphony_and_note_duration.png -------------------------------------------------------------------------------- /assets/figure_beat_length.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_beat_length.png -------------------------------------------------------------------------------- /assets/figure_drum.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_drum.png -------------------------------------------------------------------------------- /assets/figure_instrument.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_instrument.png -------------------------------------------------------------------------------- /assets/figure_instrument_type.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_instrument_type.png -------------------------------------------------------------------------------- /assets/figure_key.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_key.png -------------------------------------------------------------------------------- /assets/figure_min_max_polyphony.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_min_max_polyphony.png -------------------------------------------------------------------------------- /assets/figure_num_tracks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_num_tracks.png -------------------------------------------------------------------------------- /assets/figure_quantile_polyphony.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_quantile_polyphony.png -------------------------------------------------------------------------------- /assets/figure_raw_onset_track_entropy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_raw_onset_track_entropy.png -------------------------------------------------------------------------------- /assets/figure_tempos.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_tempos.png -------------------------------------------------------------------------------- /assets/figure_timesigs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_timesigs.png -------------------------------------------------------------------------------- /assets/figure_velocity_track_entropy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_velocity_track_entropy.png -------------------------------------------------------------------------------- /assets/figure_velocity_variance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jeffreyjohnens/MetaMIDIDataset/71cd4921345c76a639bbf6cfc3705114683e81f3/assets/figure_velocity_variance.png -------------------------------------------------------------------------------- /download_audio.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import requests 4 | import pandas as pd 5 | import spotipy 6 | from spotipy.oauth2 import SpotifyClientCredentials 7 | 8 | def download(url, path): 9 | with open(path, "wb") as f: 10 | f.write( requests.get(url).content ) 11 | 12 | chunk = 50 13 | path = "path/to/MMD_audio_matches.tsv" 14 | client_id = "your_spotify_api_client_id" 15 | client_secret = "your_spotify_api_client_secret" 16 | mp3_folder = "save_mp3s_here" 17 | 18 | # make folder if necessary 19 | os.makedirs(mp3_folder, exist_ok=True) 20 | 21 | # get a list of matched unique spotify ids 22 | spotify_ids = list(pd.unique(pd.read_csv(path, delimiter="\t")["sid"])) 23 | 24 | # initialize spotipy 25 | spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)) 26 | 27 | # download all mp3s 28 | for i in range(0,len(spotify_ids),chunk): 29 | spotify_id_chunk = spotify_ids[i:i+chunk] 30 | print(spotify_id_chunk) 31 | for track in spotify.tracks(spotify_id_chunk)["tracks"]: 32 | try: 33 | path = os.path.join(mp3_folder,track["id"]+".mp3") 34 | download( track["preview_url"], path ) 35 | except Exception as e: 36 | print(e) 37 | time.sleep(1) # be nice to the servers -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | We introduce the MetaMIDI Dataset (MMD), a large scale collection of 436,631 MIDI files and metadata. In addition to the MIDI files, we provide artist, title and genre metadata that was collected during the scraping process when available. MIDIs in (MMD) were matched against a collection of 32,000,000 30-second audio clips retrieved from Spotify, resulting in over 10,796,557 audio-MIDI matches. In addition, we linked 600,142 Spotify tracks with 1,094,901 MusicBrainz recordings to produce a set of 168,032 MIDI files that are matched to MusicBrainz database. These links augment many files in the dataset with the extensive metadata available via the Spotify API and the MusicBrainz database. We anticipate that this collection of data will be of great use to MIR researchers addressing a variety of research topics. 4 | 5 | 1. Collection of 436,631 MIDI files. 6 | 2. Scraped artist + title metadata for 221,504 MIDIs (10 times more than the LMD). 7 | 3. Scraped genre metadata for 143,868 MIDIs. 8 | 4. An improved audio-MIDI matching procedure, which produced 10,796,557 audio-MIDI matches linking 237,236 MIDIs to one or more tracks on Spotify. 9 | 5. 829,728 high reliability audio-MIDI + metadata matches linking 53,496 MIDIs to one or more tracks on Spotify. 10 | 6. A method for linking Spotify tracks and MusicBrainz recordings, producing 8,263,482 unique links that associate 1,094,901 MusicBrainz recordings with 600,142 Spotify tracks. 11 | 7. 168,032 MIDIs matched to MusicBrainz IDs via the Spotify/MusicBrainz linking procedure. 12 | 13 | # Get the Dataset 14 | 15 | The data can be accessed on [Zenodo](https://zenodo.org/record/5142664#.YQN3c5NKgWo). Prospective users must provide their name, institutional affiliation, institutional contact information, the name of their research project, where the research is taking place, and an acknowledgement that they will not share nor distribute the dataset. 16 | 17 | Once you have downloaded MMD_audio_matches.tsv, you can use [this script](../master/download_audio.py) to download the 30-second Spotify preview clips. 18 | 19 | ## Scraped Metadata 20 | 21 | ### Artist + Title 22 | 23 | The file MMD_scraped_title_artist.jsonl contains entries linking an md5 to a list of (title,artist) tuples. The example entry below features a MIDI file that was found on two sites, each using a slightly different artist + title. 24 | 25 | ```python 26 | { 27 | "md5": "39669587387605f55ef295dba5fc0537", 28 | "title_artist": [ 29 | ["Eructavit cor meum a 6", "Gabrieli, Andrea"], 30 | ["Eructavit a 6", "Andrea Gabrieli"] 31 | ] 32 | } 33 | ``` 34 | 35 | ### Genre 36 | 37 | The file MMD_scraped_genre.jsonl contains entries linking an md5 to a list of genre lists. The example entry below features a MIDI file found on two sites, each using a different list of genres. 38 | 39 | ```python 40 | { 41 | "md5": "611550ebcf868b21e67d87337f4be7cf", 42 | "genre": [ 43 | ["romantic"], 44 | ["renaissance"] 45 | ] 46 | } 47 | ``` 48 | 49 | ## Audio-MIDI Matches 50 | 51 | 52 | ### Audio-MIDI Matches 53 | 54 | The file MMD_audio_matches.tsv is a table with three columns: md5, score and sid (Spotify Track ID). The first few rows of the table are shown below. Note that it is possible to have several Spotify tracks matched to a single MIDI file. 55 | 56 | ```python 57 | md5 score sid 58 | 977349d0bec3fed4bd2bef1b57c597d4 0.554719562096395 5n3es2C6R47r4WGfmYL9vZ 59 | a8ab220d3e771f14994bcb6104a6e733 0.9104702185489868 65yDyFGWY1nAAEIoPoGSL6 60 | a8ab220d3e771f14994bcb6104a6e733 0.7431859004566854 6WRwxMiwig6czraGVt3xEB 61 | a8ab220d3e771f14994bcb6104a6e733 0.775063317357118 6X72YoEcN8Iuw4e70NMb1V 62 | ``` 63 | 64 | ### Audio-MIDI + text metadata Matches 65 | 66 | These matches are a subset of MMD_audio_matches.tsv where scraped title + artist metadata was also a fuzzy match. As a result, they are more reliable than the audio-MIDI matches. The file MMD_audio_text_matches.tsv is a table with three columns: md5, score and sid (Spotify Track ID). 67 | 68 | ## Spotify + MusicBrainz Links 69 | 70 | ### Spotify to MusicBrainz Mapping 71 | 72 | MMD_sid_to_mbid.json provides the list of mbids (MusicBrainz ID) corresponding to each sid (Spotify ID). Here are the first few lines of the file. 73 | 74 | ```python 75 | { 76 | "4UIlhYOx9rhQgXArXHJaqN": [ 77 | "89d239fc-aa0d-4cdc-89ba-c29691d4004c" 78 | ], 79 | "15fX8aDjlpFwq7hITsPIQU": [ 80 | "db8bebaa-4416-4bcb-bee0-8e4dfc86307d", 81 | "a0c7a30c-3d0b-48d0-adef-dfb127a2c684", 82 | "f37c0bde-07ad-47e2-bd0f-f1d01a546b64", 83 | "56e366f2-6f2a-4cc0-a1de-0e6e498ad62f", 84 | "343c0d31-243b-4b80-bc74-fe7c36df70b2", 85 | ``` 86 | 87 | ### md5 to MusicBrainz Mappings 88 | 89 | Using the mapping in MMD_sid_to_mbid.json and the two sets of audio-MIDI matches we created the following files that map md5s to MusicBrainz IDs. 90 | 91 | MMD_md5_to_mbid.json provides the list of MusicBrainz IDs (mbid) corresponding to each md5 checksum. 92 | 93 | MMD_md5_to_mbid_audio_text.json provides the list of MusicBrainz IDs (mbid) corresponding to each md5 checksum, only including audio + text (md5->sid) matches. 94 | 95 | ## AcousticBrainz Genres 96 | 97 | Using the [2018-AcousticBrainz-Genre-Task](https://multimediaeval.github.io/2018-AcousticBrainz-Genre-Task/data/) Dataset we created one file for each set of audio-MIDI matches that maps md5s to genre counts. 98 | 99 | MMD_audio_matched_genre.jsonl contains entries specify the number of times a MIDI file was associated with a particular genre via the audio-MIDI matching and Spotify + MusicBrainz Links. In the entry below, we can see the genre counts for the discogs, tagtraum and lastfm genre taxonomies. It is worth noting that there are discrepancies between each genre taxonomy, which happen to also be exemplified in the example below. 100 | 101 | ```python 102 | { 103 | "md5": "ac45c832a78728aa8822a3df637682c3", 104 | "genre_discogs": { 105 | "electronic": 49, 106 | "electronic---synth-pop": 49, 107 | "folk, world, & country": 1, 108 | "folk, world, & country---country": 1, 109 | "pop": 49 110 | }, 111 | "genre_tagtraum": { 112 | "country": 1, 113 | "country---contemporarycountry": 1, 114 | "country---neotraditionalcountry": 1, 115 | "soundtrack": 49, 116 | "soundtrack---broadway": 49, 117 | "soundtrack---musical": 49 118 | }, 119 | "genre_lastfm": { 120 | "country": 1, 121 | "rock": 49, 122 | "rock---classicrock": 49, 123 | "rock---softrock": 49 124 | } 125 | } 126 | ``` 127 | 128 | MMM_audio_text_matched_genre.jsonl is formated in the same manner. 129 | 130 | ## Copyright 131 | 132 | Since we did not transcribe any of the MIDI files in the MetaMIDI Dataset, we provide a list of all the Copyright meta-events in the dataset to acknowledge the original authors of the files. This is available in MMD_copyright.txt. 133 | 134 | # Dataset Statistics 135 | 136 | In what follows we compare distributions for the LMD, MMD and the symmetric difference between the MMD and the LMD (i.e. MMD - LMD). 137 | 138 | ## Number of Tracks per MIDI file 139 | ![track graph](assets/figure_num_tracks.png) 140 | 141 | Figure 1: A track consists of all the events belonging to a unique (MIDI track,channel,instrument) tuple. 142 | 143 | ## Number of beats 144 | 145 | ![beat graph](assets/figure_beat_length.png) 146 | 147 | Figure 2: The number of quarter note beats in a MIDI file. This is a tempo-invariant indicator of the length of the MIDI file. 148 | 149 | ## Key Signatures 150 | 151 | ![key graph](assets/figure_key.png) 152 | 153 | Figure 3: The usage of different key signatures. 154 | 155 | ## Time Signatures 156 | 157 | ![time signature graph](assets/figure_timesigs.png) 158 | 159 | Figure 4: The usage of different time signatures. 160 | 161 | ## Tempo 162 | 163 | ![tempo graph](assets/figure_tempos.png) 164 | 165 | Figure 5: The different tempo used. Note the spikes, which correspond to standard default tempos (ex. 120bpm). 166 | 167 | ## General MIDI Instrument Type 168 | ![instrument type graph](assets/figure_instrument_type.png) 169 | 170 | Figure 6 : The usage of General MIDI instrument types in the LMD and MMD. 171 | 172 | ## General Midi Instrument 173 | ![instrument type graph](assets/figure_instrument.png) 174 | 175 | Figure 7: The usage of General MIDI instruments in the LMD and MMD. 176 | 177 | 178 | 179 | 180 | --------------------------------------------------------------------------------