├── .gitignore
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.md
├── crepe
├── __init__.py
├── __main__.py
├── cli.py
├── core.py
└── version.py
├── requirements.txt
├── setup.py
└── tests
├── sweep.wav
└── test_sweep.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.iml
2 | *.swp
3 | .idea
4 | *.pyc
5 | *.pyo
6 | __pycache__
7 |
8 | .cache
9 | .ipynb_checkpoints
10 | .pytest_cache
11 |
12 | .DS_Store
13 | thumbs.db
14 |
15 | dist
16 | build
17 | *.egg-info
18 |
19 | *.activation.png
20 | *.activation.npy
21 | *.f0.csv
22 |
23 | crepe/model-*.h5
24 | crepe/model-*.h5.bz2
25 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | python:
3 | - "3.6"
4 | - "3.7"
5 | - "3.8"
6 | install:
7 | - pip install pytest tensorflow==2.4.1
8 | - pip install .
9 | script:
10 | - pytest
11 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2018 Jong Wook Kim
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | recursive-include crepe *.py
2 | include README.md
3 | include LICENSE
4 | include requirements.txt
5 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | CREPE Pitch Tracker
2 | ===================
3 |
4 | [](https://pypi.python.org/pypi/crepe)
5 | [](https://opensource.org/licenses/MIT)
6 | [](https://travis-ci.org/marl/crepe)
7 | [](https://pepy.tech/project/crepe)
8 | []()
9 |
11 |
12 |
13 |
14 | CREPE is a monophonic pitch tracker based on a deep convolutional neural network operating directly on the time-domain waveform input. CREPE is state-of-the-art (as of 2018), outperfoming popular pitch trackers such as pYIN and SWIPE:
15 |
16 |

17 |
18 | Further details are provided in the following paper:
19 |
20 | > [CREPE: A Convolutional Representation for Pitch Estimation](https://arxiv.org/abs/1802.06182)
21 | > Jong Wook Kim, Justin Salamon, Peter Li, Juan Pablo Bello.
22 | > Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
23 |
24 | We kindly request that academic publications making use of CREPE cite the aforementioned paper.
25 |
26 |
27 | ## Installing CREPE
28 |
29 | CREPE is hosted on PyPI. To install, run the following command in your Python environment:
30 |
31 | ```bash
32 | $ pip install --upgrade tensorflow # if you don't already have tensorflow >= 2.0.0
33 | $ pip install crepe
34 | ```
35 |
36 | To install the latest version from source clone the repository and from the top-level `crepe` folder call:
37 |
38 | ```bash
39 | $ python setup.py install
40 | ```
41 |
42 | ## Using CREPE
43 | ### Using CREPE from the command line
44 |
45 | This package includes a command line utility `crepe` and a pre-trained version of the CREPE model for easy use. To estimate the pitch of `audio_file.wav`, run:
46 |
47 | ```bash
48 | $ crepe audio_file.wav
49 | ```
50 |
51 | or
52 |
53 | ```bash
54 | $ python -m crepe audio_file.wav
55 | ```
56 |
57 | The resulting `audio_file.f0.csv` contains 3 columns: the first with timestamps (a 10 ms hop size is used by default), the second contains the predicted fundamental frequency in Hz, and the third contains the voicing confidence, i.e. the confidence in the presence of a pitch:
58 |
59 | time,frequency,confidence
60 | 0.00,185.616,0.907112
61 | 0.01,186.764,0.844488
62 | 0.02,188.356,0.798015
63 | 0.03,190.610,0.746729
64 | 0.04,192.952,0.771268
65 | 0.05,195.191,0.859440
66 | 0.06,196.541,0.864447
67 | 0.07,197.809,0.827441
68 | 0.08,199.678,0.775208
69 | ...
70 |
71 | #### Timestamps
72 |
73 | CREPE uses 10-millisecond time steps by default, which can be adjusted using
74 | the `--step-size` option, which takes the size of the time step in millisecond.
75 | For example, `--step-size 50` will calculate pitch for every 50 milliseconds.
76 |
77 | Following the convention adopted by popular audio processing libraries such as
78 | [Essentia](http://essentia.upf.edu/) and [Librosa](https://librosa.github.io/librosa/),
79 | from v0.0.5 onwards CREPE will pad the input signal such that the first frame
80 | is zero-centered (the center of the frame corresponds to time 0) and generally
81 | all frames are centered around their corresponding timestamp, i.e. frame
82 | `D[:, t]` is centered at `audio[t * hop_length]`. This behavior can be changed
83 | by specifying the optional `--no-centering` flag, in which case the first frame
84 | will *start* at time zero and generally frame `D[:, t]` will *begin* at
85 | `audio[t * hop_length]`. Sticking to the default behavior (centered frames) is
86 | strongly recommended to avoid misalignment with features and annotations produced
87 | by other common audio processing tools.
88 |
89 | #### Model Capacity
90 |
91 | CREPE uses the model size that was reported in the paper by default, but can optionally
92 | use a smaller model for computation speed, at the cost of slightly lower accuracy.
93 | You can specify `--model-capacity {tiny|small|medium|large|full}` as the command
94 | line option to select a model with desired capacity.
95 |
96 | #### Temporal smoothing
97 | By default CREPE does not apply temporal smoothing to the pitch curve, but
98 | Viterbi smoothing is supported via the optional `--viterbi` command line argument.
99 |
100 |
101 | #### Saving the activation matrix
102 | The script can also optionally save the output activation matrix of the model
103 | to an npy file (`--save-activation`), where the matrix dimensions are
104 | (n_frames, 360) using a hop size of 10 ms (there are 360 pitch bins covering 20
105 | cents each).
106 |
107 | The script can also output a plot of the activation matrix (`--save-plot`),
108 | saved to `audio_file.activation.png` including an optional visual representation
109 | of the model's voicing detection (`--plot-voicing`). Here's an example plot of
110 | the activation matrix (without the voicing overlay) for an excerpt of male
111 | singing voice:
112 |
113 | 
114 |
115 | #### Batch processing
116 | For batch processing of files, you can provide a folder path instead of a file path:
117 | ```bash
118 | $ python crepe.py audio_folder
119 | ```
120 | The script will process all WAV files found inside the folder.
121 |
122 | #### Additional usage information
123 | For more information on the usage, please refer to the help message:
124 |
125 | ```bash
126 | $ python crepe.py --help
127 | ```
128 |
129 | ### Using CREPE inside Python
130 | CREPE can be imported as module to be used directly in Python. Here's a minimal example:
131 | ```python
132 | import crepe
133 | from scipy.io import wavfile
134 |
135 | sr, audio = wavfile.read('/path/to/audiofile.wav')
136 | time, frequency, confidence, activation = crepe.predict(audio, sr, viterbi=True)
137 | ```
138 |
139 | ## Argmax-local Weighted Averaging
140 |
141 | This release of CREPE uses the following weighted averaging formula, which is slightly different from the paper. This only focuses on the neighborhood around the maximum activation, which is shown to further improve the pitch accuracy:
142 |
143 | 
144 |
145 | ## Please Note
146 |
147 | - The current version only supports WAV files as input.
148 | - The model is trained on 16 kHz audio, so if the input audio has a different sample rate, it will be first resampled to 16 kHz using [resampy](https://github.com/bmcfee/resampy).
149 | - Due to the subtle numerical differences between frameworks, Keras should be configured to use the TensorFlow backend for the best performance. The model was trained using Keras 2.1.5 and TensorFlow 1.6.0, and the newer versions of TensorFlow seems to work as well.
150 | - Prediction is significantly faster if Keras (and the corresponding backend) is configured to run on GPU.
151 | - The provided model is trained using the following datasets, composed of vocal and instrumental audio, and is therefore expected to work best on this type of audio signals.
152 | - MIR-1K [1]
153 | - Bach10 [2]
154 | - RWC-Synth [3]
155 | - MedleyDB [4]
156 | - MDB-STEM-Synth [5]
157 | - NSynth [6]
158 |
159 |
160 | ## References
161 |
162 | [1] C.-L. Hsu et al. "On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset", *IEEE Transactions on Audio, Speech, and Language Processing.* 2009.
163 |
164 | [2] Z. Duan et al. "Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions", *IEEE Transactions on Audio, Speech, and Language Processing.* 2010.
165 |
166 | [3] M. Mauch et al. "pYIN: A fundamental Frequency Estimator Using Probabilistic Threshold Distributions", *Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).* 2014.
167 |
168 | [4] R. M. Bittner et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research", *Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference.* 2014.
169 |
170 | [5] J. Salamon et al. "An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets", *Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference*. 2017.
171 |
172 | [6] J. Engel et al. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders", *arXiv preprint: 1704.01279*. 2017.
173 |
174 |
--------------------------------------------------------------------------------
/crepe/__init__.py:
--------------------------------------------------------------------------------
1 | from .version import version as __version__
2 | from .core import get_activation, predict, process_file
3 |
--------------------------------------------------------------------------------
/crepe/__main__.py:
--------------------------------------------------------------------------------
1 | from .cli import main
2 |
3 | # call the CLI handler when the module is executed as `python -m crepe`
4 | main()
5 |
--------------------------------------------------------------------------------
/crepe/cli.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import os
4 | import sys
5 | from argparse import ArgumentParser, RawDescriptionHelpFormatter
6 | from argparse import ArgumentTypeError
7 |
8 | from .core import process_file
9 |
10 |
11 | def run(filename, output=None, model_capacity='full', viterbi=False,
12 | save_activation=False, save_plot=False, plot_voicing=False,
13 | no_centering=False, step_size=10, verbose=True):
14 | """
15 | Collect the WAV files to process and run the model
16 |
17 | Parameters
18 | ----------
19 | filename : list
20 | List containing paths to WAV files or folders containing WAV files to
21 | be analyzed.
22 | output : str or None
23 | Path to directory for saving output files. If None, output files will
24 | be saved to the directory containing the input file.
25 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full'
26 | String specifying the model capacity; see the docstring of
27 | :func:`~crepe.core.build_and_load_model`
28 | viterbi : bool
29 | Apply viterbi smoothing to the estimated pitch curve. False by default.
30 | save_activation : bool
31 | Save the output activation matrix to an .npy file. False by default.
32 | save_plot: bool
33 | Save a plot of the output activation matrix to a .png file. False by
34 | default.
35 | plot_voicing : bool
36 | Include a visual representation of the voicing activity detection in
37 | the plot of the output activation matrix. False by default, only
38 | relevant if save_plot is True.
39 | no_centering : bool
40 | Don't pad the signal, meaning frames will begin at their timestamp
41 | instead of being centered around their timestamp (which is the
42 | default). CAUTION: setting this option can result in CREPE's output
43 | being misaligned with respect to the output of other audio processing
44 | tools and is generally not recommended.
45 | step_size : int
46 | The step size in milliseconds for running pitch estimation.
47 | verbose : bool
48 | Print status messages and keras progress (default=True).
49 | """
50 |
51 | files = []
52 | for path in filename:
53 | if os.path.isdir(path):
54 | found = ([file for file in os.listdir(path) if
55 | file.lower().endswith('.wav')])
56 | if len(found) == 0:
57 | print('CREPE: No WAV files found in directory {}'.format(path),
58 | file=sys.stderr)
59 | files += [os.path.join(path, file) for file in found]
60 | elif os.path.isfile(path):
61 | if not path.lower().endswith('.wav'):
62 | print('CREPE: Expecting WAV file(s) but got {}'.format(path),
63 | file=sys.stderr)
64 | files.append(path)
65 | else:
66 | print('CREPE: File or directory not found: {}'.format(path),
67 | file=sys.stderr)
68 |
69 | if len(files) == 0:
70 | print('CREPE: No WAV files found in {}, aborting.'.format(filename))
71 | sys.exit(-1)
72 |
73 | for i, file in enumerate(files):
74 | if verbose:
75 | print('CREPE: Processing {} ... ({}/{})'.format(
76 | file, i+1, len(files)), file=sys.stderr)
77 | process_file(file, output=output,
78 | model_capacity=model_capacity,
79 | viterbi=viterbi,
80 | center=(not no_centering),
81 | save_activation=save_activation,
82 | save_plot=save_plot,
83 | plot_voicing=plot_voicing,
84 | step_size=step_size,
85 | verbose=verbose)
86 |
87 |
88 | def positive_int(value):
89 | """An argparse type method for accepting only positive integers"""
90 | ivalue = int(value)
91 | if ivalue <= 0:
92 | raise ArgumentTypeError('expected a positive integer')
93 | return ivalue
94 |
95 |
96 | def main():
97 | """
98 | This is a script for running the pre-trained pitch estimation model, CREPE,
99 | by taking WAV files(s) as input. For each input WAV, a CSV file containing:
100 |
101 | time, frequency, confidence
102 | 0.00, 424.24, 0.42
103 | 0.01, 422.42, 0.84
104 | ...
105 |
106 | is created as the output, where the first column is a timestamp in seconds,
107 | the second column is the estimated frequency in Hz, and the third column is
108 | a value between 0 and 1 indicating the model's voicing confidence (i.e.
109 | confidence in the presence of a pitch for every frame).
110 |
111 | The script can also optionally save the output activation matrix of the
112 | model to an npy file, where the matrix dimensions are (n_frames, 360) using
113 | a hop size of 10 ms (there are 360 pitch bins covering 20 cents each).
114 | The script can also output a plot of the activation matrix, including an
115 | optional visual representation of the model's voicing detection.
116 | """
117 |
118 | parser = ArgumentParser(sys.argv[0], description=main.__doc__,
119 | formatter_class=RawDescriptionHelpFormatter)
120 |
121 | parser.add_argument('filename', nargs='+',
122 | help='path to one ore more WAV file(s) to analyze OR '
123 | 'can be a directory')
124 | parser.add_argument('--output', '-o', default=None,
125 | help='directory to save the ouptut file(s), must '
126 | 'already exist; if not given, the output will be '
127 | 'saved to the same directory as the input WAV '
128 | 'file(s)')
129 | parser.add_argument('--model-capacity', '-c', default='full',
130 | choices=['tiny', 'small', 'medium', 'large', 'full'],
131 | help='String specifying the model capacity; smaller '
132 | 'models are faster to compute, but may yield '
133 | 'less accurate pitch estimation')
134 | parser.add_argument('--viterbi', '-V', action='store_true',
135 | help='perform Viterbi decoding to smooth the pitch '
136 | 'curve')
137 | parser.add_argument('--save-activation', '-a', action='store_true',
138 | help='save the output activation matrix to a .npy '
139 | 'file')
140 | parser.add_argument('--save-plot', '-p', action='store_true',
141 | help='save a plot of the activation matrix to a .png '
142 | 'file')
143 | parser.add_argument('--plot-voicing', '-v', action='store_true',
144 | help='Plot the voicing prediction on top of the '
145 | 'output activation matrix plot')
146 | parser.add_argument('--no-centering', '-n', action='store_true',
147 | help="Don't pad the signal, meaning frames will begin "
148 | "at their timestamp instead of being centered "
149 | "around their timestamp (which is the default). "
150 | "CAUTION: setting this option can result in "
151 | "CREPE's output being misaligned with respect to "
152 | "the output of other audio processing tools and "
153 | "is generally not recommended.")
154 | parser.add_argument('--step-size', '-s', default=10, type=positive_int,
155 | help='The step size in milliseconds for running '
156 | 'pitch estimation. The default is 10 ms.')
157 | parser.add_argument('--quiet', '-q', default=False,
158 | action='store_true',
159 | help='Suppress all non-error printouts (e.g. progress '
160 | 'bar).')
161 |
162 | args = parser.parse_args()
163 |
164 | run(args.filename,
165 | output=args.output,
166 | model_capacity=args.model_capacity,
167 | viterbi=args.viterbi,
168 | save_activation=args.save_activation,
169 | save_plot=args.save_plot,
170 | plot_voicing=args.plot_voicing,
171 | no_centering=args.no_centering,
172 | step_size=args.step_size,
173 | verbose=not args.quiet)
174 |
--------------------------------------------------------------------------------
/crepe/core.py:
--------------------------------------------------------------------------------
1 | from __future__ import division
2 | from __future__ import print_function
3 |
4 | import os
5 | import re
6 | import sys
7 |
8 | from scipy.io import wavfile
9 | import numpy as np
10 | from numpy.lib.stride_tricks import as_strided
11 |
12 | # store as a global variable, since we only support a few models for now
13 | models = {
14 | 'tiny': None,
15 | 'small': None,
16 | 'medium': None,
17 | 'large': None,
18 | 'full': None
19 | }
20 |
21 | # the model is trained on 16kHz audio
22 | model_srate = 16000
23 |
24 |
25 | def build_and_load_model(model_capacity):
26 | """
27 | Build the CNN model and load the weights
28 |
29 | Parameters
30 | ----------
31 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full'
32 | String specifying the model capacity, which determines the model's
33 | capacity multiplier to 4 (tiny), 8 (small), 16 (medium), 24 (large),
34 | or 32 (full). 'full' uses the model size specified in the paper,
35 | and the others use a reduced number of filters in each convolutional
36 | layer, resulting in a smaller model that is faster to evaluate at the
37 | cost of slightly reduced pitch estimation accuracy.
38 |
39 | Returns
40 | -------
41 | model : tensorflow.keras.models.Model
42 | The pre-trained keras model loaded in memory
43 | """
44 | from tensorflow.keras.layers import Input, Reshape, Conv2D, BatchNormalization
45 | from tensorflow.keras.layers import MaxPool2D, Dropout, Permute, Flatten, Dense
46 | from tensorflow.keras.models import Model
47 |
48 | if models[model_capacity] is None:
49 | capacity_multiplier = {
50 | 'tiny': 4, 'small': 8, 'medium': 16, 'large': 24, 'full': 32
51 | }[model_capacity]
52 |
53 | layers = [1, 2, 3, 4, 5, 6]
54 | filters = [n * capacity_multiplier for n in [32, 4, 4, 4, 8, 16]]
55 | widths = [512, 64, 64, 64, 64, 64]
56 | strides = [(4, 1), (1, 1), (1, 1), (1, 1), (1, 1), (1, 1)]
57 |
58 | x = Input(shape=(1024,), name='input', dtype='float32')
59 | y = Reshape(target_shape=(1024, 1, 1), name='input-reshape')(x)
60 |
61 | for l, f, w, s in zip(layers, filters, widths, strides):
62 | y = Conv2D(f, (w, 1), strides=s, padding='same',
63 | activation='relu', name="conv%d" % l)(y)
64 | y = BatchNormalization(name="conv%d-BN" % l)(y)
65 | y = MaxPool2D(pool_size=(2, 1), strides=None, padding='valid',
66 | name="conv%d-maxpool" % l)(y)
67 | y = Dropout(0.25, name="conv%d-dropout" % l)(y)
68 |
69 | y = Permute((2, 1, 3), name="transpose")(y)
70 | y = Flatten(name="flatten")(y)
71 | y = Dense(360, activation='sigmoid', name="classifier")(y)
72 |
73 | model = Model(inputs=x, outputs=y)
74 |
75 | package_dir = os.path.dirname(os.path.realpath(__file__))
76 | filename = "model-{}.h5".format(model_capacity)
77 | model.load_weights(os.path.join(package_dir, filename))
78 | model.compile('adam', 'binary_crossentropy')
79 |
80 | models[model_capacity] = model
81 |
82 | return models[model_capacity]
83 |
84 |
85 | def output_path(file, suffix, output_dir):
86 | """
87 | return the output path of an output file corresponding to a wav file
88 | """
89 | path = re.sub(r"(?i).wav$", suffix, file)
90 | if output_dir is not None:
91 | path = os.path.join(output_dir, os.path.basename(path))
92 | return path
93 |
94 |
95 | def to_local_average_cents(salience, center=None):
96 | """
97 | find the weighted average cents near the argmax bin
98 | """
99 |
100 | if not hasattr(to_local_average_cents, 'cents_mapping'):
101 | # the bin number-to-cents mapping
102 | to_local_average_cents.cents_mapping = (
103 | np.linspace(0, 7180, 360) + 1997.3794084376191)
104 |
105 | if salience.ndim == 1:
106 | if center is None:
107 | center = int(np.argmax(salience))
108 | start = max(0, center - 4)
109 | end = min(len(salience), center + 5)
110 | salience = salience[start:end]
111 | product_sum = np.sum(
112 | salience * to_local_average_cents.cents_mapping[start:end])
113 | weight_sum = np.sum(salience)
114 | return product_sum / weight_sum
115 | if salience.ndim == 2:
116 | return np.array([to_local_average_cents(salience[i, :]) for i in
117 | range(salience.shape[0])])
118 |
119 | raise Exception("label should be either 1d or 2d ndarray")
120 |
121 |
122 | def to_viterbi_cents(salience):
123 | """
124 | Find the Viterbi path using a transition prior that induces pitch
125 | continuity.
126 | """
127 | from hmmlearn import hmm
128 |
129 | # uniform prior on the starting pitch
130 | starting = np.ones(360) / 360
131 |
132 | # transition probabilities inducing continuous pitch
133 | xx, yy = np.meshgrid(range(360), range(360))
134 | transition = np.maximum(12 - abs(xx - yy), 0)
135 | transition = transition / np.sum(transition, axis=1)[:, None]
136 |
137 | # emission probability = fixed probability for self, evenly distribute the
138 | # others
139 | self_emission = 0.1
140 | emission = (np.eye(360) * self_emission + np.ones(shape=(360, 360)) *
141 | ((1 - self_emission) / 360))
142 |
143 | # fix the model parameters because we are not optimizing the model
144 | model = hmm.CategoricalHMM(360, starting, transition)
145 | model.startprob_, model.transmat_, model.emissionprob_ = \
146 | starting, transition, emission
147 |
148 | # find the Viterbi path
149 | observations = np.argmax(salience, axis=1)
150 | path = model.predict(observations.reshape(-1, 1), [len(observations)])
151 |
152 | return np.array([to_local_average_cents(salience[i, :], path[i]) for i in
153 | range(len(observations))])
154 |
155 |
156 | def get_activation(audio, sr, model_capacity='full', center=True, step_size=10,
157 | verbose=1):
158 | """
159 |
160 | Parameters
161 | ----------
162 | audio : np.ndarray [shape=(N,) or (N, C)]
163 | The audio samples. Multichannel audio will be downmixed.
164 | sr : int
165 | Sample rate of the audio samples. The audio will be resampled if
166 | the sample rate is not 16 kHz, which is expected by the model.
167 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full'
168 | String specifying the model capacity; see the docstring of
169 | :func:`~crepe.core.build_and_load_model`
170 | center : boolean
171 | - If `True` (default), the signal `audio` is padded so that frame
172 | `D[:, t]` is centered at `audio[t * hop_length]`.
173 | - If `False`, then `D[:, t]` begins at `audio[t * hop_length]`
174 | step_size : int
175 | The step size in milliseconds for running pitch estimation.
176 | verbose : int
177 | Set the keras verbosity mode: 1 (default) will print out a progress bar
178 | during prediction, 0 will suppress all non-error printouts.
179 |
180 | Returns
181 | -------
182 | activation : np.ndarray [shape=(T, 360)]
183 | The raw activation matrix
184 | """
185 | model = build_and_load_model(model_capacity)
186 |
187 | if len(audio.shape) == 2:
188 | audio = audio.mean(1) # make mono
189 | audio = audio.astype(np.float32)
190 | if sr != model_srate:
191 | # resample audio if necessary
192 | from resampy import resample
193 | audio = resample(audio, sr, model_srate)
194 |
195 | # pad so that frames are centered around their timestamps (i.e. first frame
196 | # is zero centered).
197 | if center:
198 | audio = np.pad(audio, 512, mode='constant', constant_values=0)
199 |
200 | # make 1024-sample frames of the audio with hop length of 10 milliseconds
201 | hop_length = int(model_srate * step_size / 1000)
202 | n_frames = 1 + int((len(audio) - 1024) / hop_length)
203 | frames = as_strided(audio, shape=(1024, n_frames),
204 | strides=(audio.itemsize, hop_length * audio.itemsize))
205 | frames = frames.transpose().copy()
206 |
207 | # normalize each frame -- this is expected by the model
208 | frames -= np.mean(frames, axis=1)[:, np.newaxis]
209 | frames /= np.clip(np.std(frames, axis=1)[:, np.newaxis], 1e-8, None)
210 |
211 | # run prediction and convert the frequency bin weights to Hz
212 | return model.predict(frames, verbose=verbose)
213 |
214 |
215 | def predict(audio, sr, model_capacity='full',
216 | viterbi=False, center=True, step_size=10, verbose=1):
217 | """
218 | Perform pitch estimation on given audio
219 |
220 | Parameters
221 | ----------
222 | audio : np.ndarray [shape=(N,) or (N, C)]
223 | The audio samples. Multichannel audio will be downmixed.
224 | sr : int
225 | Sample rate of the audio samples. The audio will be resampled if
226 | the sample rate is not 16 kHz, which is expected by the model.
227 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full'
228 | String specifying the model capacity; see the docstring of
229 | :func:`~crepe.core.build_and_load_model`
230 | viterbi : bool
231 | Apply viterbi smoothing to the estimated pitch curve. False by default.
232 | center : boolean
233 | - If `True` (default), the signal `audio` is padded so that frame
234 | `D[:, t]` is centered at `audio[t * hop_length]`.
235 | - If `False`, then `D[:, t]` begins at `audio[t * hop_length]`
236 | step_size : int
237 | The step size in milliseconds for running pitch estimation.
238 | verbose : int
239 | Set the keras verbosity mode: 1 (default) will print out a progress bar
240 | during prediction, 0 will suppress all non-error printouts.
241 |
242 | Returns
243 | -------
244 | A 4-tuple consisting of:
245 |
246 | time: np.ndarray [shape=(T,)]
247 | The timestamps on which the pitch was estimated
248 | frequency: np.ndarray [shape=(T,)]
249 | The predicted pitch values in Hz
250 | confidence: np.ndarray [shape=(T,)]
251 | The confidence of voice activity, between 0 and 1
252 | activation: np.ndarray [shape=(T, 360)]
253 | The raw activation matrix
254 | """
255 | activation = get_activation(audio, sr, model_capacity=model_capacity,
256 | center=center, step_size=step_size,
257 | verbose=verbose)
258 | confidence = activation.max(axis=1)
259 |
260 | if viterbi:
261 | cents = to_viterbi_cents(activation)
262 | else:
263 | cents = to_local_average_cents(activation)
264 |
265 | frequency = 10 * 2 ** (cents / 1200)
266 | frequency[np.isnan(frequency)] = 0
267 |
268 | time = np.arange(confidence.shape[0]) * step_size / 1000.0
269 |
270 | return time, frequency, confidence, activation
271 |
272 |
273 | def process_file(file, output=None, model_capacity='full', viterbi=False,
274 | center=True, save_activation=False, save_plot=False,
275 | plot_voicing=False, step_size=10, verbose=True):
276 | """
277 | Use the input model to perform pitch estimation on the input file.
278 |
279 | Parameters
280 | ----------
281 | file : str
282 | Path to WAV file to be analyzed.
283 | output : str or None
284 | Path to directory for saving output files. If None, output files will
285 | be saved to the directory containing the input file.
286 | model_capacity : 'tiny', 'small', 'medium', 'large', or 'full'
287 | String specifying the model capacity; see the docstring of
288 | :func:`~crepe.core.build_and_load_model`
289 | viterbi : bool
290 | Apply viterbi smoothing to the estimated pitch curve. False by default.
291 | center : boolean
292 | - If `True` (default), the signal `audio` is padded so that frame
293 | `D[:, t]` is centered at `audio[t * hop_length]`.
294 | - If `False`, then `D[:, t]` begins at `audio[t * hop_length]`
295 | save_activation : bool
296 | Save the output activation matrix to an .npy file. False by default.
297 | save_plot : bool
298 | Save a plot of the output activation matrix to a .png file. False by
299 | default.
300 | plot_voicing : bool
301 | Include a visual representation of the voicing activity detection in
302 | the plot of the output activation matrix. False by default, only
303 | relevant if save_plot is True.
304 | step_size : int
305 | The step size in milliseconds for running pitch estimation.
306 | verbose : bool
307 | Print status messages and keras progress (default=True).
308 |
309 | Returns
310 | -------
311 |
312 | """
313 | try:
314 | sr, audio = wavfile.read(file)
315 | except ValueError:
316 | print("CREPE: Could not read %s" % file, file=sys.stderr)
317 | raise
318 |
319 | time, frequency, confidence, activation = predict(
320 | audio, sr,
321 | model_capacity=model_capacity,
322 | viterbi=viterbi,
323 | center=center,
324 | step_size=step_size,
325 | verbose=1 * verbose)
326 |
327 | # write prediction as TSV
328 | f0_file = output_path(file, ".f0.csv", output)
329 | f0_data = np.vstack([time, frequency, confidence]).transpose()
330 | np.savetxt(f0_file, f0_data, fmt=['%.3f', '%.3f', '%.6f'], delimiter=',',
331 | header='time,frequency,confidence', comments='')
332 | if verbose:
333 | print("CREPE: Saved the estimated frequencies and confidence values "
334 | "at {}".format(f0_file))
335 |
336 | # save the salience file to a .npy file
337 | if save_activation:
338 | activation_path = output_path(file, ".activation.npy", output)
339 | np.save(activation_path, activation)
340 | if verbose:
341 | print("CREPE: Saved the activation matrix at {}".format(
342 | activation_path))
343 |
344 | # save the salience visualization in a PNG file
345 | if save_plot:
346 | import matplotlib.cm
347 | from imageio import imwrite
348 |
349 | plot_file = output_path(file, ".activation.png", output)
350 | # to draw the low pitches in the bottom
351 | salience = np.flip(activation, axis=1)
352 | inferno = matplotlib.cm.get_cmap('inferno')
353 | image = inferno(salience.transpose())
354 |
355 | if plot_voicing:
356 | # attach a soft and hard voicing detection result under the
357 | # salience plot
358 | image = np.pad(image, [(0, 20), (0, 0), (0, 0)], mode='constant')
359 | image[-20:-10, :, :] = inferno(confidence)[np.newaxis, :, :]
360 | image[-10:, :, :] = (
361 | inferno((confidence > 0.5).astype(np.float))[np.newaxis, :, :])
362 |
363 | imwrite(plot_file, (255 * image).astype(np.uint8))
364 | if verbose:
365 | print("CREPE: Saved the salience plot at {}".format(plot_file))
366 |
367 |
--------------------------------------------------------------------------------
/crepe/version.py:
--------------------------------------------------------------------------------
1 | version = '0.0.16'
2 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.14.0
2 | scipy>=1.0.0
3 | matplotlib>=2.1.0
4 | resampy>=0.2.0
5 | h5py
6 | hmmlearn>=0.3.0
7 | imageio>=2.3.0
8 | scikit-learn>=0.16
9 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import bz2
2 | from importlib.machinery import SourceFileLoader
3 | import os
4 | import sys
5 |
6 | import pkg_resources
7 | from setuptools import setup, find_packages
8 |
9 | try:
10 | from urllib.request import urlretrieve
11 | except ImportError:
12 | from urllib import urlretrieve
13 |
14 | model_capacities = ['tiny', 'small', 'medium', 'large', 'full']
15 | weight_files = ['model-{}.h5'.format(cap) for cap in model_capacities]
16 | base_url = 'https://github.com/marl/crepe/raw/models/'
17 |
18 | if len(sys.argv) > 1 and sys.argv[1] == 'sdist':
19 | # exclude the weight files in sdist
20 | weight_files = []
21 | else:
22 | # in all other cases, decompress the weights file if necessary
23 | for weight_file in weight_files:
24 | weight_path = os.path.join('crepe', weight_file)
25 | if not os.path.isfile(weight_path):
26 | compressed_file = weight_file + '.bz2'
27 | compressed_path = os.path.join('crepe', compressed_file)
28 | if not os.path.isfile(compressed_file):
29 | print('Downloading weight file {} ...'.format(compressed_file))
30 | urlretrieve(base_url + compressed_file, compressed_path)
31 | print('Decompressing ...')
32 | with bz2.BZ2File(compressed_path, 'rb') as source:
33 | with open(weight_path, 'wb') as target:
34 | target.write(source.read())
35 | print('Decompression complete')
36 |
37 | version = SourceFileLoader('crepe.version', os.path.join('crepe', 'version.py'))
38 | version = version.load_module()
39 |
40 | with open('README.md') as file:
41 | long_description = file.read()
42 |
43 | setup(
44 | name='crepe',
45 | version=version.version,
46 | description='CREPE pitch tracker',
47 | long_description=long_description,
48 | long_description_content_type='text/markdown',
49 | url='https://github.com/marl/crepe',
50 | author='Jong Wook Kim and Justin Salamon',
51 | author_email='jongwook@nyu.edu',
52 | packages=find_packages(),
53 | entry_points = {
54 | 'console_scripts': ['crepe=crepe.cli:main'],
55 | },
56 | license='MIT',
57 | classifiers=[
58 | 'Development Status :: 3 - Alpha',
59 | 'License :: OSI Approved :: MIT License',
60 | 'Topic :: Multimedia :: Sound/Audio :: Analysis',
61 | 'Programming Language :: Python :: 2',
62 | 'Programming Language :: Python :: 2.7',
63 | 'Programming Language :: Python :: 3',
64 | 'Programming Language :: Python :: 3.5',
65 | 'Programming Language :: Python :: 3.6',
66 | ],
67 | keywords='tfrecord',
68 | project_urls={
69 | 'Source': 'https://github.com/marl/crepe',
70 | 'Tracker': 'https://github.com/marl/crepe/issues'
71 | },
72 | install_requires=[
73 | str(requirement)
74 | for requirement in pkg_resources.parse_requirements(
75 | open(os.path.join(os.path.dirname(__file__), "requirements.txt"))
76 | )
77 | ],
78 | package_data={
79 | 'crepe': weight_files
80 | },
81 | )
82 |
--------------------------------------------------------------------------------
/tests/sweep.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marl/crepe/c9b71ce61491454125a0693f584f7244f29d9884/tests/sweep.wav
--------------------------------------------------------------------------------
/tests/test_sweep.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | import crepe
4 |
5 | # this data contains a sine sweep
6 | file = os.path.join(os.path.dirname(__file__), 'sweep.wav')
7 | f0_file = os.path.join(os.path.dirname(__file__), 'sweep.f0.csv')
8 |
9 |
10 | def verify_f0():
11 | result = np.loadtxt(f0_file, delimiter=',', skiprows=1)
12 |
13 | # it should be confident enough about the presence of pitch in every frame
14 | assert np.mean(result[:, 2] > 0.5) > 0.98
15 |
16 | # the frequencies should be linear
17 | assert np.corrcoef(result[:, 1]) > 0.99
18 |
19 | os.remove(f0_file)
20 |
21 |
22 | def test_sweep():
23 | crepe.process_file(file)
24 | verify_f0()
25 |
26 |
27 | def test_sweep_cli():
28 | assert os.system("crepe {}".format(file)) == 0
29 | verify_f0()
30 |
--------------------------------------------------------------------------------