├── .gitignore ├── .travis.yml ├── LICENSE ├── MANIFEST.in ├── README.md ├── crepe ├── __init__.py ├── __main__.py ├── cli.py ├── core.py └── version.py ├── requirements.txt ├── setup.py └── tests ├── sweep.wav └── test_sweep.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.iml 2 | *.swp 3 | .idea 4 | *.pyc 5 | *.pyo 6 | __pycache__ 7 | 8 | .cache 9 | .ipynb_checkpoints 10 | .pytest_cache 11 | 12 | .DS_Store 13 | thumbs.db 14 | 15 | dist 16 | build 17 | *.egg-info 18 | 19 | *.activation.png 20 | *.activation.npy 21 | *.f0.csv 22 | 23 | crepe/model-*.h5 24 | crepe/model-*.h5.bz2 25 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "3.6" 4 | - "3.7" 5 | - "3.8" 6 | install: 7 | - pip install pytest tensorflow==2.4.1 8 | - pip install . 9 | script: 10 | - pytest 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2018 Jong Wook Kim 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | recursive-include crepe *.py 2 | include README.md 3 | include LICENSE 4 | include requirements.txt 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | CREPE Pitch Tracker 2 | =================== 3 | 4 | [![PyPI](https://img.shields.io/pypi/v/crepe.svg)](https://pypi.python.org/pypi/crepe) 5 | [![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) 6 | [![Build Status](https://travis-ci.org/marl/crepe.svg?branch=master)](https://travis-ci.org/marl/crepe) 7 | [![Downloads](https://pepy.tech/badge/crepe)](https://pepy.tech/project/crepe) 8 | [![PyPI](https://img.shields.io/badge/python-3.6%2C%203.7-blue.svg)]() 9 | 11 | 12 | 13 | 14 | CREPE is a monophonic pitch tracker based on a deep convolutional neural network operating directly on the time-domain waveform input. CREPE is state-of-the-art (as of 2018), outperfoming popular pitch trackers such as pYIN and SWIPE: 15 | 16 |

17 | 18 | Further details are provided in the following paper: 19 | 20 | > [CREPE: A Convolutional Representation for Pitch Estimation](https://arxiv.org/abs/1802.06182)
21 | > Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello.
22 | > Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. 23 | 24 | We kindly request that academic publications making use of CREPE cite the aforementioned paper. 25 | 26 | 27 | ## Installing CREPE 28 | 29 | CREPE is hosted on PyPI. To install, run the following command in your Python environment: 30 | 31 | ```bash 32 | $ pip install --upgrade tensorflow # if you don't already have tensorflow >= 2.0.0 33 | $ pip install crepe 34 | ``` 35 | 36 | To install the latest version from source clone the repository and from the top-level `crepe` folder call: 37 | 38 | ```bash 39 | $ python setup.py install 40 | ``` 41 | 42 | ## Using CREPE 43 | ### Using CREPE from the command line 44 | 45 | This package includes a command line utility `crepe` and a pre-trained version of the CREPE model for easy use. To estimate the pitch of `audio_file.wav`, run: 46 | 47 | ```bash 48 | $ crepe audio_file.wav 49 | ``` 50 | 51 | or 52 | 53 | ```bash 54 | $ python -m crepe audio_file.wav 55 | ``` 56 | 57 | The resulting `audio_file.f0.csv` contains 3 columns: the first with timestamps (a 10 ms hop size is used by default), the second contains the predicted fundamental frequency in Hz, and the third contains the voicing confidence, i.e. the confidence in the presence of a pitch: 58 | 59 | time,frequency,confidence 60 | 0.00,185.616,0.907112 61 | 0.01,186.764,0.844488 62 | 0.02,188.356,0.798015 63 | 0.03,190.610,0.746729 64 | 0.04,192.952,0.771268 65 | 0.05,195.191,0.859440 66 | 0.06,196.541,0.864447 67 | 0.07,197.809,0.827441 68 | 0.08,199.678,0.775208 69 | ... 70 | 71 | #### Timestamps 72 | 73 | CREPE uses 10-millisecond time steps by default, which can be adjusted using 74 | the `--step-size` option, which takes the size of the time step in millisecond. 75 | For example, `--step-size 50` will calculate pitch for every 50 milliseconds. 76 | 77 | Following the convention adopted by popular audio processing libraries such as 78 | [Essentia](http://essentia.upf.edu/) and [Librosa](https://librosa.github.io/librosa/), 79 | from v0.0.5 onwards CREPE will pad the input signal such that the first frame 80 | is zero-centered (the center of the frame corresponds to time 0) and generally 81 | all frames are centered around their corresponding timestamp, i.e. frame 82 | `D[:, t]` is centered at `audio[t * hop_length]`. This behavior can be changed 83 | by specifying the optional `--no-centering` flag, in which case the first frame 84 | will *start* at time zero and generally frame `D[:, t]` will *begin* at 85 | `audio[t * hop_length]`. Sticking to the default behavior (centered frames) is 86 | strongly recommended to avoid misalignment with features and annotations produced 87 | by other common audio processing tools. 88 | 89 | #### Model Capacity 90 | 91 | CREPE uses the model size that was reported in the paper by default, but can optionally 92 | use a smaller model for computation speed, at the cost of slightly lower accuracy. 93 | You can specify `--model-capacity {tiny|small|medium|large|full}` as the command 94 | line option to select a model with desired capacity. 95 | 96 | #### Temporal smoothing 97 | By default CREPE does not apply temporal smoothing to the pitch curve, but 98 | Viterbi smoothing is supported via the optional `--viterbi` command line argument. 99 | 100 | 101 | #### Saving the activation matrix 102 | The script can also optionally save the output activation matrix of the model 103 | to an npy file (`--save-activation`), where the matrix dimensions are 104 | (n_frames, 360) using a hop size of 10 ms (there are 360 pitch bins covering 20 105 | cents each). 106 | 107 | The script can also output a plot of the activation matrix (`--save-plot`), 108 | saved to `audio_file.activation.png` including an optional visual representation 109 | of the model's voicing detection (`--plot-voicing`). Here's an example plot of 110 | the activation matrix (without the voicing overlay) for an excerpt of male 111 | singing voice: 112 | 113 | ![salience](https://user-images.githubusercontent.com/266841/38465913-6fa085b0-3aef-11e8-9633-bdd59618ea23.png) 114 | 115 | #### Batch processing 116 | For batch processing of files, you can provide a folder path instead of a file path: 117 | ```bash 118 | $ python crepe.py audio_folder 119 | ``` 120 | The script will process all WAV files found inside the folder. 121 | 122 | #### Additional usage information 123 | For more information on the usage, please refer to the help message: 124 | 125 | ```bash 126 | $ python crepe.py --help 127 | ``` 128 | 129 | ### Using CREPE inside Python 130 | CREPE can be imported as module to be used directly in Python. Here's a minimal example: 131 | ```python 132 | import crepe 133 | from scipy.io import wavfile 134 | 135 | sr, audio = wavfile.read('/path/to/audiofile.wav') 136 | time, frequency, confidence, activation = crepe.predict(audio, sr, viterbi=True) 137 | ``` 138 | 139 | ## Argmax-local Weighted Averaging 140 | 141 | This release of CREPE uses the following weighted averaging formula, which is slightly different from the paper. This only focuses on the neighborhood around the maximum activation, which is shown to further improve the pitch accuracy: 142 | 143 |

144 | 145 | ## Please Note 146 | 147 | - The current version only supports WAV files as input. 148 | - The model is trained on 16 kHz audio, so if the input audio has a different sample rate, it will be first resampled to 16 kHz using [resampy](https://github.com/bmcfee/resampy). 149 | - Due to the subtle numerical differences between frameworks, Keras should be configured to use the TensorFlow backend for the best performance. The model was trained using Keras 2.1.5 and TensorFlow 1.6.0, and the newer versions of TensorFlow seems to work as well. 150 | - Prediction is significantly faster if Keras (and the corresponding backend) is configured to run on GPU. 151 | - The provided model is trained using the following datasets, composed of vocal and instrumental audio, and is therefore expected to work best on this type of audio signals. 152 | - MIR-1K [1] 153 | - Bach10 [2] 154 | - RWC-Synth [3] 155 | - MedleyDB [4] 156 | - MDB-STEM-Synth [5] 157 | - NSynth [6] 158 | 159 | 160 | ## References 161 | 162 | [1] C.-L. Hsu et al. "On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset", *IEEE Transactions on Audio, Speech, and Language Processing.* 2009. 163 | 164 | [2] Z. Duan et al. "Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions", *IEEE Transactions on Audio, Speech, and Language Processing.* 2010. 165 | 166 | [3] M. Mauch et al. "pYIN: A fundamental Frequency Estimator Using Probabilistic Threshold Distributions", *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).* 2014. 167 | 168 | [4] R. M. Bittner et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research", *Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.* 2014. 169 | 170 | [5] J. Salamon et al. "An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets", *Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference*. 2017. 171 | 172 | [6] J. Engel et al. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders", *arXiv preprint: 1704.01279*. 2017. 173 | 174 | -------------------------------------------------------------------------------- /crepe/__init__.py: -------------------------------------------------------------------------------- 1 | from .version import version as __version__ 2 | from .core import get_activation, predict, process_file 3 | -------------------------------------------------------------------------------- /crepe/__main__.py: -------------------------------------------------------------------------------- 1 | from .cli import main 2 | 3 | # call the CLI handler when the module is executed as `python -m crepe` 4 | main() 5 | -------------------------------------------------------------------------------- /crepe/cli.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | 3 | import os 4 | import sys 5 | from argparse import ArgumentParser, RawDescriptionHelpFormatter 6 | from argparse import ArgumentTypeError 7 | 8 | from .core import process_file 9 | 10 | 11 | def run(filename, output=None, model_capacity='full', viterbi=False, 12 | save_activation=False, save_plot=False, plot_voicing=False, 13 | no_centering=False, step_size=10, verbose=True): 14 | """ 15 | Collect the WAV files to process and run the model 16 | 17 | Parameters 18 | ---------- 19 | filename : list 20 | List containing paths to WAV files or folders containing WAV files to 21 | be analyzed. 22 | output : str or None 23 | Path to directory for saving output files. If None, output files will 24 | be saved to the directory containing the input file. 25 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full' 26 | String specifying the model capacity; see the docstring of 27 | :func:`~crepe.core.build_and_load_model` 28 | viterbi : bool 29 | Apply viterbi smoothing to the estimated pitch curve. False by default. 30 | save_activation : bool 31 | Save the output activation matrix to an .npy file. False by default. 32 | save_plot: bool 33 | Save a plot of the output activation matrix to a .png file. False by 34 | default. 35 | plot_voicing : bool 36 | Include a visual representation of the voicing activity detection in 37 | the plot of the output activation matrix. False by default, only 38 | relevant if save_plot is True. 39 | no_centering : bool 40 | Don't pad the signal, meaning frames will begin at their timestamp 41 | instead of being centered around their timestamp (which is the 42 | default). CAUTION: setting this option can result in CREPE's output 43 | being misaligned with respect to the output of other audio processing 44 | tools and is generally not recommended. 45 | step_size : int 46 | The step size in milliseconds for running pitch estimation. 47 | verbose : bool 48 | Print status messages and keras progress (default=True). 49 | """ 50 | 51 | files = [] 52 | for path in filename: 53 | if os.path.isdir(path): 54 | found = ([file for file in os.listdir(path) if 55 | file.lower().endswith('.wav')]) 56 | if len(found) == 0: 57 | print('CREPE: No WAV files found in directory {}'.format(path), 58 | file=sys.stderr) 59 | files += [os.path.join(path, file) for file in found] 60 | elif os.path.isfile(path): 61 | if not path.lower().endswith('.wav'): 62 | print('CREPE: Expecting WAV file(s) but got {}'.format(path), 63 | file=sys.stderr) 64 | files.append(path) 65 | else: 66 | print('CREPE: File or directory not found: {}'.format(path), 67 | file=sys.stderr) 68 | 69 | if len(files) == 0: 70 | print('CREPE: No WAV files found in {}, aborting.'.format(filename)) 71 | sys.exit(-1) 72 | 73 | for i, file in enumerate(files): 74 | if verbose: 75 | print('CREPE: Processing {} ... ({}/{})'.format( 76 | file, i+1, len(files)), file=sys.stderr) 77 | process_file(file, output=output, 78 | model_capacity=model_capacity, 79 | viterbi=viterbi, 80 | center=(not no_centering), 81 | save_activation=save_activation, 82 | save_plot=save_plot, 83 | plot_voicing=plot_voicing, 84 | step_size=step_size, 85 | verbose=verbose) 86 | 87 | 88 | def positive_int(value): 89 | """An argparse type method for accepting only positive integers""" 90 | ivalue = int(value) 91 | if ivalue <= 0: 92 | raise ArgumentTypeError('expected a positive integer') 93 | return ivalue 94 | 95 | 96 | def main(): 97 | """ 98 | This is a script for running the pre-trained pitch estimation model, CREPE, 99 | by taking WAV files(s) as input. For each input WAV, a CSV file containing: 100 | 101 | time, frequency, confidence 102 | 0.00, 424.24, 0.42 103 | 0.01, 422.42, 0.84 104 | ... 105 | 106 | is created as the output, where the first column is a timestamp in seconds, 107 | the second column is the estimated frequency in Hz, and the third column is 108 | a value between 0 and 1 indicating the model's voicing confidence (i.e. 109 | confidence in the presence of a pitch for every frame). 110 | 111 | The script can also optionally save the output activation matrix of the 112 | model to an npy file, where the matrix dimensions are (n_frames, 360) using 113 | a hop size of 10 ms (there are 360 pitch bins covering 20 cents each). 114 | The script can also output a plot of the activation matrix, including an 115 | optional visual representation of the model's voicing detection. 116 | """ 117 | 118 | parser = ArgumentParser(sys.argv[0], description=main.__doc__, 119 | formatter_class=RawDescriptionHelpFormatter) 120 | 121 | parser.add_argument('filename', nargs='+', 122 | help='path to one ore more WAV file(s) to analyze OR ' 123 | 'can be a directory') 124 | parser.add_argument('--output', '-o', default=None, 125 | help='directory to save the ouptut file(s), must ' 126 | 'already exist; if not given, the output will be ' 127 | 'saved to the same directory as the input WAV ' 128 | 'file(s)') 129 | parser.add_argument('--model-capacity', '-c', default='full', 130 | choices=['tiny', 'small', 'medium', 'large', 'full'], 131 | help='String specifying the model capacity; smaller ' 132 | 'models are faster to compute, but may yield ' 133 | 'less accurate pitch estimation') 134 | parser.add_argument('--viterbi', '-V', action='store_true', 135 | help='perform Viterbi decoding to smooth the pitch ' 136 | 'curve') 137 | parser.add_argument('--save-activation', '-a', action='store_true', 138 | help='save the output activation matrix to a .npy ' 139 | 'file') 140 | parser.add_argument('--save-plot', '-p', action='store_true', 141 | help='save a plot of the activation matrix to a .png ' 142 | 'file') 143 | parser.add_argument('--plot-voicing', '-v', action='store_true', 144 | help='Plot the voicing prediction on top of the ' 145 | 'output activation matrix plot') 146 | parser.add_argument('--no-centering', '-n', action='store_true', 147 | help="Don't pad the signal, meaning frames will begin " 148 | "at their timestamp instead of being centered " 149 | "around their timestamp (which is the default). " 150 | "CAUTION: setting this option can result in " 151 | "CREPE's output being misaligned with respect to " 152 | "the output of other audio processing tools and " 153 | "is generally not recommended.") 154 | parser.add_argument('--step-size', '-s', default=10, type=positive_int, 155 | help='The step size in milliseconds for running ' 156 | 'pitch estimation. The default is 10 ms.') 157 | parser.add_argument('--quiet', '-q', default=False, 158 | action='store_true', 159 | help='Suppress all non-error printouts (e.g. progress ' 160 | 'bar).') 161 | 162 | args = parser.parse_args() 163 | 164 | run(args.filename, 165 | output=args.output, 166 | model_capacity=args.model_capacity, 167 | viterbi=args.viterbi, 168 | save_activation=args.save_activation, 169 | save_plot=args.save_plot, 170 | plot_voicing=args.plot_voicing, 171 | no_centering=args.no_centering, 172 | step_size=args.step_size, 173 | verbose=not args.quiet) 174 | -------------------------------------------------------------------------------- /crepe/core.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | import os 5 | import re 6 | import sys 7 | 8 | from scipy.io import wavfile 9 | import numpy as np 10 | from numpy.lib.stride_tricks import as_strided 11 | 12 | # store as a global variable, since we only support a few models for now 13 | models = { 14 | 'tiny': None, 15 | 'small': None, 16 | 'medium': None, 17 | 'large': None, 18 | 'full': None 19 | } 20 | 21 | # the model is trained on 16kHz audio 22 | model_srate = 16000 23 | 24 | 25 | def build_and_load_model(model_capacity): 26 | """ 27 | Build the CNN model and load the weights 28 | 29 | Parameters 30 | ---------- 31 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full' 32 | String specifying the model capacity, which determines the model's 33 | capacity multiplier to 4 (tiny), 8 (small), 16 (medium), 24 (large), 34 | or 32 (full). 'full' uses the model size specified in the paper, 35 | and the others use a reduced number of filters in each convolutional 36 | layer, resulting in a smaller model that is faster to evaluate at the 37 | cost of slightly reduced pitch estimation accuracy. 38 | 39 | Returns 40 | ------- 41 | model : tensorflow.keras.models.Model 42 | The pre-trained keras model loaded in memory 43 | """ 44 | from tensorflow.keras.layers import Input, Reshape, Conv2D, BatchNormalization 45 | from tensorflow.keras.layers import MaxPool2D, Dropout, Permute, Flatten, Dense 46 | from tensorflow.keras.models import Model 47 | 48 | if models[model_capacity] is None: 49 | capacity_multiplier = { 50 | 'tiny': 4, 'small': 8, 'medium': 16, 'large': 24, 'full': 32 51 | }[model_capacity] 52 | 53 | layers = [1, 2, 3, 4, 5, 6] 54 | filters = [n * capacity_multiplier for n in [32, 4, 4, 4, 8, 16]] 55 | widths = [512, 64, 64, 64, 64, 64] 56 | strides = [(4, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1)] 57 | 58 | x = Input(shape=(1024,), name='input', dtype='float32') 59 | y = Reshape(target_shape=(1024, 1, 1), name='input-reshape')(x) 60 | 61 | for l, f, w, s in zip(layers, filters, widths, strides): 62 | y = Conv2D(f, (w, 1), strides=s, padding='same', 63 | activation='relu', name="conv%d" % l)(y) 64 | y = BatchNormalization(name="conv%d-BN" % l)(y) 65 | y = MaxPool2D(pool_size=(2, 1), strides=None, padding='valid', 66 | name="conv%d-maxpool" % l)(y) 67 | y = Dropout(0.25, name="conv%d-dropout" % l)(y) 68 | 69 | y = Permute((2, 1, 3), name="transpose")(y) 70 | y = Flatten(name="flatten")(y) 71 | y = Dense(360, activation='sigmoid', name="classifier")(y) 72 | 73 | model = Model(inputs=x, outputs=y) 74 | 75 | package_dir = os.path.dirname(os.path.realpath(__file__)) 76 | filename = "model-{}.h5".format(model_capacity) 77 | model.load_weights(os.path.join(package_dir, filename)) 78 | model.compile('adam', 'binary_crossentropy') 79 | 80 | models[model_capacity] = model 81 | 82 | return models[model_capacity] 83 | 84 | 85 | def output_path(file, suffix, output_dir): 86 | """ 87 | return the output path of an output file corresponding to a wav file 88 | """ 89 | path = re.sub(r"(?i).wav$", suffix, file) 90 | if output_dir is not None: 91 | path = os.path.join(output_dir, os.path.basename(path)) 92 | return path 93 | 94 | 95 | def to_local_average_cents(salience, center=None): 96 | """ 97 | find the weighted average cents near the argmax bin 98 | """ 99 | 100 | if not hasattr(to_local_average_cents, 'cents_mapping'): 101 | # the bin number-to-cents mapping 102 | to_local_average_cents.cents_mapping = ( 103 | np.linspace(0, 7180, 360) + 1997.3794084376191) 104 | 105 | if salience.ndim == 1: 106 | if center is None: 107 | center = int(np.argmax(salience)) 108 | start = max(0, center - 4) 109 | end = min(len(salience), center + 5) 110 | salience = salience[start:end] 111 | product_sum = np.sum( 112 | salience * to_local_average_cents.cents_mapping[start:end]) 113 | weight_sum = np.sum(salience) 114 | return product_sum / weight_sum 115 | if salience.ndim == 2: 116 | return np.array([to_local_average_cents(salience[i, :]) for i in 117 | range(salience.shape[0])]) 118 | 119 | raise Exception("label should be either 1d or 2d ndarray") 120 | 121 | 122 | def to_viterbi_cents(salience): 123 | """ 124 | Find the Viterbi path using a transition prior that induces pitch 125 | continuity. 126 | """ 127 | from hmmlearn import hmm 128 | 129 | # uniform prior on the starting pitch 130 | starting = np.ones(360) / 360 131 | 132 | # transition probabilities inducing continuous pitch 133 | xx, yy = np.meshgrid(range(360), range(360)) 134 | transition = np.maximum(12 - abs(xx - yy), 0) 135 | transition = transition / np.sum(transition, axis=1)[:, None] 136 | 137 | # emission probability = fixed probability for self, evenly distribute the 138 | # others 139 | self_emission = 0.1 140 | emission = (np.eye(360) * self_emission + np.ones(shape=(360, 360)) * 141 | ((1 - self_emission) / 360)) 142 | 143 | # fix the model parameters because we are not optimizing the model 144 | model = hmm.CategoricalHMM(360, starting, transition) 145 | model.startprob_, model.transmat_, model.emissionprob_ = \ 146 | starting, transition, emission 147 | 148 | # find the Viterbi path 149 | observations = np.argmax(salience, axis=1) 150 | path = model.predict(observations.reshape(-1, 1), [len(observations)]) 151 | 152 | return np.array([to_local_average_cents(salience[i, :], path[i]) for i in 153 | range(len(observations))]) 154 | 155 | 156 | def get_activation(audio, sr, model_capacity='full', center=True, step_size=10, 157 | verbose=1): 158 | """ 159 | 160 | Parameters 161 | ---------- 162 | audio : np.ndarray [shape=(N,) or (N, C)] 163 | The audio samples. Multichannel audio will be downmixed. 164 | sr : int 165 | Sample rate of the audio samples. The audio will be resampled if 166 | the sample rate is not 16 kHz, which is expected by the model. 167 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full' 168 | String specifying the model capacity; see the docstring of 169 | :func:`~crepe.core.build_and_load_model` 170 | center : boolean 171 | - If `True` (default), the signal `audio` is padded so that frame 172 | `D[:, t]` is centered at `audio[t * hop_length]`. 173 | - If `False`, then `D[:, t]` begins at `audio[t * hop_length]` 174 | step_size : int 175 | The step size in milliseconds for running pitch estimation. 176 | verbose : int 177 | Set the keras verbosity mode: 1 (default) will print out a progress bar 178 | during prediction, 0 will suppress all non-error printouts. 179 | 180 | Returns 181 | ------- 182 | activation : np.ndarray [shape=(T, 360)] 183 | The raw activation matrix 184 | """ 185 | model = build_and_load_model(model_capacity) 186 | 187 | if len(audio.shape) == 2: 188 | audio = audio.mean(1) # make mono 189 | audio = audio.astype(np.float32) 190 | if sr != model_srate: 191 | # resample audio if necessary 192 | from resampy import resample 193 | audio = resample(audio, sr, model_srate) 194 | 195 | # pad so that frames are centered around their timestamps (i.e. first frame 196 | # is zero centered). 197 | if center: 198 | audio = np.pad(audio, 512, mode='constant', constant_values=0) 199 | 200 | # make 1024-sample frames of the audio with hop length of 10 milliseconds 201 | hop_length = int(model_srate * step_size / 1000) 202 | n_frames = 1 + int((len(audio) - 1024) / hop_length) 203 | frames = as_strided(audio, shape=(1024, n_frames), 204 | strides=(audio.itemsize, hop_length * audio.itemsize)) 205 | frames = frames.transpose().copy() 206 | 207 | # normalize each frame -- this is expected by the model 208 | frames -= np.mean(frames, axis=1)[:, np.newaxis] 209 | frames /= np.clip(np.std(frames, axis=1)[:, np.newaxis], 1e-8, None) 210 | 211 | # run prediction and convert the frequency bin weights to Hz 212 | return model.predict(frames, verbose=verbose) 213 | 214 | 215 | def predict(audio, sr, model_capacity='full', 216 | viterbi=False, center=True, step_size=10, verbose=1): 217 | """ 218 | Perform pitch estimation on given audio 219 | 220 | Parameters 221 | ---------- 222 | audio : np.ndarray [shape=(N,) or (N, C)] 223 | The audio samples. Multichannel audio will be downmixed. 224 | sr : int 225 | Sample rate of the audio samples. The audio will be resampled if 226 | the sample rate is not 16 kHz, which is expected by the model. 227 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full' 228 | String specifying the model capacity; see the docstring of 229 | :func:`~crepe.core.build_and_load_model` 230 | viterbi : bool 231 | Apply viterbi smoothing to the estimated pitch curve. False by default. 232 | center : boolean 233 | - If `True` (default), the signal `audio` is padded so that frame 234 | `D[:, t]` is centered at `audio[t * hop_length]`. 235 | - If `False`, then `D[:, t]` begins at `audio[t * hop_length]` 236 | step_size : int 237 | The step size in milliseconds for running pitch estimation. 238 | verbose : int 239 | Set the keras verbosity mode: 1 (default) will print out a progress bar 240 | during prediction, 0 will suppress all non-error printouts. 241 | 242 | Returns 243 | ------- 244 | A 4-tuple consisting of: 245 | 246 | time: np.ndarray [shape=(T,)] 247 | The timestamps on which the pitch was estimated 248 | frequency: np.ndarray [shape=(T,)] 249 | The predicted pitch values in Hz 250 | confidence: np.ndarray [shape=(T,)] 251 | The confidence of voice activity, between 0 and 1 252 | activation: np.ndarray [shape=(T, 360)] 253 | The raw activation matrix 254 | """ 255 | activation = get_activation(audio, sr, model_capacity=model_capacity, 256 | center=center, step_size=step_size, 257 | verbose=verbose) 258 | confidence = activation.max(axis=1) 259 | 260 | if viterbi: 261 | cents = to_viterbi_cents(activation) 262 | else: 263 | cents = to_local_average_cents(activation) 264 | 265 | frequency = 10 * 2 ** (cents / 1200) 266 | frequency[np.isnan(frequency)] = 0 267 | 268 | time = np.arange(confidence.shape[0]) * step_size / 1000.0 269 | 270 | return time, frequency, confidence, activation 271 | 272 | 273 | def process_file(file, output=None, model_capacity='full', viterbi=False, 274 | center=True, save_activation=False, save_plot=False, 275 | plot_voicing=False, step_size=10, verbose=True): 276 | """ 277 | Use the input model to perform pitch estimation on the input file. 278 | 279 | Parameters 280 | ---------- 281 | file : str 282 | Path to WAV file to be analyzed. 283 | output : str or None 284 | Path to directory for saving output files. If None, output files will 285 | be saved to the directory containing the input file. 286 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full' 287 | String specifying the model capacity; see the docstring of 288 | :func:`~crepe.core.build_and_load_model` 289 | viterbi : bool 290 | Apply viterbi smoothing to the estimated pitch curve. False by default. 291 | center : boolean 292 | - If `True` (default), the signal `audio` is padded so that frame 293 | `D[:, t]` is centered at `audio[t * hop_length]`. 294 | - If `False`, then `D[:, t]` begins at `audio[t * hop_length]` 295 | save_activation : bool 296 | Save the output activation matrix to an .npy file. False by default. 297 | save_plot : bool 298 | Save a plot of the output activation matrix to a .png file. False by 299 | default. 300 | plot_voicing : bool 301 | Include a visual representation of the voicing activity detection in 302 | the plot of the output activation matrix. False by default, only 303 | relevant if save_plot is True. 304 | step_size : int 305 | The step size in milliseconds for running pitch estimation. 306 | verbose : bool 307 | Print status messages and keras progress (default=True). 308 | 309 | Returns 310 | ------- 311 | 312 | """ 313 | try: 314 | sr, audio = wavfile.read(file) 315 | except ValueError: 316 | print("CREPE: Could not read %s" % file, file=sys.stderr) 317 | raise 318 | 319 | time, frequency, confidence, activation = predict( 320 | audio, sr, 321 | model_capacity=model_capacity, 322 | viterbi=viterbi, 323 | center=center, 324 | step_size=step_size, 325 | verbose=1 * verbose) 326 | 327 | # write prediction as TSV 328 | f0_file = output_path(file, ".f0.csv", output) 329 | f0_data = np.vstack([time, frequency, confidence]).transpose() 330 | np.savetxt(f0_file, f0_data, fmt=['%.3f', '%.3f', '%.6f'], delimiter=',', 331 | header='time,frequency,confidence', comments='') 332 | if verbose: 333 | print("CREPE: Saved the estimated frequencies and confidence values " 334 | "at {}".format(f0_file)) 335 | 336 | # save the salience file to a .npy file 337 | if save_activation: 338 | activation_path = output_path(file, ".activation.npy", output) 339 | np.save(activation_path, activation) 340 | if verbose: 341 | print("CREPE: Saved the activation matrix at {}".format( 342 | activation_path)) 343 | 344 | # save the salience visualization in a PNG file 345 | if save_plot: 346 | import matplotlib.cm 347 | from imageio import imwrite 348 | 349 | plot_file = output_path(file, ".activation.png", output) 350 | # to draw the low pitches in the bottom 351 | salience = np.flip(activation, axis=1) 352 | inferno = matplotlib.cm.get_cmap('inferno') 353 | image = inferno(salience.transpose()) 354 | 355 | if plot_voicing: 356 | # attach a soft and hard voicing detection result under the 357 | # salience plot 358 | image = np.pad(image, [(0, 20), (0, 0), (0, 0)], mode='constant') 359 | image[-20:-10, :, :] = inferno(confidence)[np.newaxis, :, :] 360 | image[-10:, :, :] = ( 361 | inferno((confidence > 0.5).astype(np.float))[np.newaxis, :, :]) 362 | 363 | imwrite(plot_file, (255 * image).astype(np.uint8)) 364 | if verbose: 365 | print("CREPE: Saved the salience plot at {}".format(plot_file)) 366 | 367 | -------------------------------------------------------------------------------- /crepe/version.py: -------------------------------------------------------------------------------- 1 | version = '0.0.16' 2 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy>=1.14.0 2 | scipy>=1.0.0 3 | matplotlib>=2.1.0 4 | resampy>=0.2.0 5 | h5py 6 | hmmlearn>=0.3.0 7 | imageio>=2.3.0 8 | scikit-learn>=0.16 9 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import bz2 2 | from importlib.machinery import SourceFileLoader 3 | import os 4 | import sys 5 | 6 | import pkg_resources 7 | from setuptools import setup, find_packages 8 | 9 | try: 10 | from urllib.request import urlretrieve 11 | except ImportError: 12 | from urllib import urlretrieve 13 | 14 | model_capacities = ['tiny', 'small', 'medium', 'large', 'full'] 15 | weight_files = ['model-{}.h5'.format(cap) for cap in model_capacities] 16 | base_url = 'https://github.com/marl/crepe/raw/models/' 17 | 18 | if len(sys.argv) > 1 and sys.argv[1] == 'sdist': 19 | # exclude the weight files in sdist 20 | weight_files = [] 21 | else: 22 | # in all other cases, decompress the weights file if necessary 23 | for weight_file in weight_files: 24 | weight_path = os.path.join('crepe', weight_file) 25 | if not os.path.isfile(weight_path): 26 | compressed_file = weight_file + '.bz2' 27 | compressed_path = os.path.join('crepe', compressed_file) 28 | if not os.path.isfile(compressed_file): 29 | print('Downloading weight file {} ...'.format(compressed_file)) 30 | urlretrieve(base_url + compressed_file, compressed_path) 31 | print('Decompressing ...') 32 | with bz2.BZ2File(compressed_path, 'rb') as source: 33 | with open(weight_path, 'wb') as target: 34 | target.write(source.read()) 35 | print('Decompression complete') 36 | 37 | version = SourceFileLoader('crepe.version', os.path.join('crepe', 'version.py')) 38 | version = version.load_module() 39 | 40 | with open('README.md') as file: 41 | long_description = file.read() 42 | 43 | setup( 44 | name='crepe', 45 | version=version.version, 46 | description='CREPE pitch tracker', 47 | long_description=long_description, 48 | long_description_content_type='text/markdown', 49 | url='https://github.com/marl/crepe', 50 | author='Jong Wook Kim and Justin Salamon', 51 | author_email='jongwook@nyu.edu', 52 | packages=find_packages(), 53 | entry_points = { 54 | 'console_scripts': ['crepe=crepe.cli:main'], 55 | }, 56 | license='MIT', 57 | classifiers=[ 58 | 'Development Status :: 3 - Alpha', 59 | 'License :: OSI Approved :: MIT License', 60 | 'Topic :: Multimedia :: Sound/Audio :: Analysis', 61 | 'Programming Language :: Python :: 2', 62 | 'Programming Language :: Python :: 2.7', 63 | 'Programming Language :: Python :: 3', 64 | 'Programming Language :: Python :: 3.5', 65 | 'Programming Language :: Python :: 3.6', 66 | ], 67 | keywords='tfrecord', 68 | project_urls={ 69 | 'Source': 'https://github.com/marl/crepe', 70 | 'Tracker': 'https://github.com/marl/crepe/issues' 71 | }, 72 | install_requires=[ 73 | str(requirement) 74 | for requirement in pkg_resources.parse_requirements( 75 | open(os.path.join(os.path.dirname(__file__), "requirements.txt")) 76 | ) 77 | ], 78 | package_data={ 79 | 'crepe': weight_files 80 | }, 81 | ) 82 | -------------------------------------------------------------------------------- /tests/sweep.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/marl/crepe/c9b71ce61491454125a0693f584f7244f29d9884/tests/sweep.wav -------------------------------------------------------------------------------- /tests/test_sweep.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import crepe 4 | 5 | # this data contains a sine sweep 6 | file = os.path.join(os.path.dirname(__file__), 'sweep.wav') 7 | f0_file = os.path.join(os.path.dirname(__file__), 'sweep.f0.csv') 8 | 9 | 10 | def verify_f0(): 11 | result = np.loadtxt(f0_file, delimiter=',', skiprows=1) 12 | 13 | # it should be confident enough about the presence of pitch in every frame 14 | assert np.mean(result[:, 2] > 0.5) > 0.98 15 | 16 | # the frequencies should be linear 17 | assert np.corrcoef(result[:, 1]) > 0.99 18 | 19 | os.remove(f0_file) 20 | 21 | 22 | def test_sweep(): 23 | crepe.process_file(file) 24 | verify_f0() 25 | 26 | 27 | def test_sweep_cli(): 28 | assert os.system("crepe {}".format(file)) == 0 29 | verify_f0() 30 | --------------------------------------------------------------------------------