├── .gitignore
├── LICENSE.txt
├── README.md
├── audio_please_unzip.zip
├── data
    └── info_file.csv
├── environments
    ├── requirements.txt
    └── umap_tut_env.yaml
├── example_imgs
    └── tool_image.png
├── functions
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-37.pyc
    │   ├── audio_functions.cpython-37.pyc
    │   ├── custom_dist_functions_umap.cpython-37.pyc
    │   ├── evaluation_functions.cpython-37.pyc
    │   ├── plot_functions.cpython-37.pyc
    │   └── preprocessing_functions.cpython-37.pyc
    ├── audio_functions.py
    ├── custom_dist_functions_umap.py
    ├── evaluation_functions.py
    ├── plot_functions.py
    └── preprocessing_functions.py
├── notebooks
    ├── .ipynb_checkpoints
    │   ├── 01_generate_spectrograms-checkpoint.ipynb
    │   ├── 02a_generate_UMAP_basic-checkpoint.ipynb
    │   ├── 02b_generate_UMAP_timeshift-checkpoint.ipynb
    │   ├── 03_UMAP_clustering-checkpoint.ipynb
    │   ├── 03_UMAP_eval-checkpoint.ipynb
    │   ├── 03_UMAP_viz_part_1_prep-checkpoint.ipynb
    │   └── 03_UMAP_viz_part_2_tool-checkpoint.ipynb
    ├── 01_generate_spectrograms.ipynb
    ├── 02a_generate_UMAP_basic.ipynb
    ├── 02b_generate_UMAP_timeshift.ipynb
    ├── 03_UMAP_clustering.ipynb
    ├── 03_UMAP_eval.ipynb
    ├── 03_UMAP_viz_part_1_prep.ipynb
    └── 03_UMAP_viz_part_2_tool.ipynb
├── parameters
    ├── __pycache__
    │   └── spec_params.cpython-37.pyc
    └── spec_params.py
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
1 | /audio
2 | /data_all
3 | .ipynb_checkpoints
4 | /notebooks/.ipynb_checkpoints
5 | 
6 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | This repository contains code and the exemplary audio data file "audio_please_unzip.zip".
 2 | Separate licenses apply to code vs. the exemplary audio data file.
 3 | 
 4 | For all code, the following MIT-license applies:
 5 | 
 6 | Copyright (c) 2022 Mara Thomas
 7 | 
 8 | Permission is hereby granted, free of charge, to any person obtaining
 9 | a copy of this software and associated documentation files (with the 
10 | exception of the file "audio_please_unzip.zip"), to deal in the 
11 | Software without restriction, including
12 | without limitation the rights to use, copy, modify, merge, publish,
13 | distribute, sublicense, and/or sell copies of the Software, and to
14 | permit persons to whom the Software is furnished to do so, subject to
15 | the following conditions:
16 | 
17 | The above copyright notice and this permission notice shall be
18 | included in all copies or substantial portions of the Software.
19 | 
20 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
21 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
22 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
23 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
24 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
25 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
26 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
27 | © 2022 GitHub, Inc.
28 | Terms
29 | Privacy
30 | Security
31 | Status
32 | Docs
33 | Contact GitHub
34 | Pricing
35 | API
36 | 
37 | 
38 | Only the exemplary audio data file "audio_please_unzip.zip" is exempt from the MIT-license. This file is under exclusive copyright, meaning that it is not allowed to copy, distribute, or modify it.
39 | 
40 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Tutorial for generating and evaluating latent-space representations of vocalizations using UMAP
  2 | 
  3 | [![DOI](https://zenodo.org/badge/400540617.svg)](https://zenodo.org/badge/latestdoi/400540617)
  4 | 
  5 | 
  6 | This tutorial contains a sequence of jupyter notebook files that help you generate latent space representations from input audio files, evaluate them and generate an interactive visualization.
  7 | 
  8 | <p align="center">
  9 |   <img src="/example_imgs/tool_image.png" width="550" height="300" />
 10 | </p>
 11 | 
 12 | ## 1. Structure
 13 | 
 14 | Keep the directory structure the way it is and put your data in the 'audio' and 'data' folder. Do not change the folder structure or location of notebooks or function files!
 15 | 
 16 |     ├── notebooks                              <- contains analysis scripts
 17 |     │   ├── 01_generate_spectrograms.ipynb      
 18 |     │   ├── ...           
 19 |     │   └── ...        
 20 |     ├── audio                                  <- ! put your input soundfiles in this folder or unzip the provided example audio!
 21 |     │   ├── call_1.wav     
 22 |     │   ├── call_2.wav         
 23 |     │   └── ...            
 24 |     ├── functions                              <- contains functions that will be called in analysis scripts
 25 |     │   ├── audio_functions.py            
 26 |     │   ├── ...                
 27 |     │   └── ...    
 28 |     ├── data                                   <- ! put a .csv metadata file of your input in this folder or use the provided example csv!
 29 |     │   └── info_file.csv                     
 30 |     ├── parameters                             
 31 |     │   └── spec_params.py                     <- this file contains parameters for spectrogramming (fft_win, fft_hop...)
 32 |     ├── environments    
 33 |     │   └── umap_tut_env.yaml                  <- conda environment file (linux)
 34 |     ├── ... 
 35 |     
 36 |     
 37 | ## 2. Requirements
 38 | 
 39 | ### 2.1. Packages, installations etc.
 40 | 
 41 | Python>=3.8. is recommended. I would recommend to __install the packages manually__, but a conda environment file is also included in /environments (created on Linux! Dependencies may differ for other OS!).
 42 | 
 43 | For manual install, these are the core packages:
 44 | 
 45 | >umap-learn
 46 | 
 47 | >librosa
 48 | 
 49 | >ipywidgets
 50 | 
 51 | >pandas=1.2.4
 52 | 
 53 | >seaborn
 54 | 
 55 | >pysoundfile=0.10.3
 56 | 
 57 | >voila
 58 | 
 59 | >hdbscan
 60 | 
 61 | >plotly
 62 | 
 63 | >graphviz
 64 | 
 65 | >networkx
 66 | 
 67 | >pygraphviz
 68 | 
 69 | 
 70 | Make sure to enable jupyter widgets with:
 71 | >jupyter nbextension enable --py widgetsnbextension
 72 | 
 73 | 
 74 | __NOTE__: Graphviz, networkx and pygraphviz are only required for one plot, so if you fail to install them, you can still run 99 % of the code.
 75 | 
 76 | 
 77 | #### This is an example for a manual installation on Windows with Python 3.8. and conda:
 78 | 
 79 | If you haven't worked with Python and/or conda (a package manager), an easy way to get started is to install anaconda or miniconda (only the basic/core parts of anaconda) first:
 80 | 
 81 | - Anaconda: [https://www.anaconda.com/products/individual-d](https://www.anaconda.com/products/individual-d)
 82 | 
 83 | - Miniconda: [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html)
 84 | 
 85 | After successful installation, create and activate your environment with conda:
 86 | 
 87 | ```
 88 | conda create --name my_env
 89 | conda activate my_env
 90 | ```
 91 | 
 92 | Then, install the required core packages:
 93 | 
 94 | ```
 95 | conda install -c conda-forge umap-learn
 96 | conda install -c conda-forge librosa
 97 | conda install ipywidgets
 98 | conda install pandas=1.2.4
 99 | conda install seaborn
100 | conda install -c conda-forge pysoundfile=0.10.3
101 | conda install -c conda-forge voila
102 | conda install -c anaconda graphviz
103 | conda install -c conda-forge hdbscan
104 | conda install -c plotly plotly
105 | conda install networkx
106 | conda install -c conda-forge pygraphviz
107 | ```
108 | 
109 | Finally, enable ipywidgets in jupyter notebook
110 | 
111 | ```
112 | jupyter nbextension enable --py widgetsnbextension
113 | ```
114 | 
115 | Clone this repository or download as zip and unpack. Make sure to have the same structure of subdirectories as described in section "Structure" and prepare your input files as described in section "Input requirements".
116 | 
117 | 
118 | Start jupyter notebook with
119 | ```
120 | jupyter notebook
121 | ```
122 | 
123 | and select the first jupyter notebook file to start your analysis (see section "Where to start").
124 | 
125 | 
126 | ### 2.2. Input requirements
127 | 
128 | #### 2.2.1. Audio files
129 | 
130 | All audio input files need to be in a subfolder /audio. This folder should not contain any other files.
131 | 
132 | To use the provided example data of meerkat calls, please unzip the file 'audio_please_unzip.zip' and verify that all audio files have been unpacked into an /audio folder according to the structure described in Section1. 
133 | 
134 | To use your own data, create a subfolder "/audio" and put your sound files there (make sure that the /audio folder contains __only__ your input files, nothing else). Each sound file should contain a single vocalization or syllable.
135 | (You may have to detect and extract such vocal elements first, if working with acoustic recordings.)
136 | 
137 | 
138 | Ideally, start and end of the sound file correspond exactly to start and end of the vocalization. 
139 | If there are delays in the onset of the vocalizations, these should be the same for all sound files. 
140 | Otherwise, vocalizations may appear dissimilar or distant in latent space simply because their onset times are different. 
141 | If it is not possible to mark the start times correctly, use the timeshift option to generate UMAP embeddings,
142 | but note that it comes at the cost of increased computation time.
143 | 
144 | #### 2.2.2. [Optional: Info file]
145 | 
146 | Use the provided info_file.csv file for the example audio data or, if you are using your own data,´ add a ";"-separated info_file.csv file with headers containing the filenames of the input audio, some labels and any other additional metadata (if available) in the subfolder "/data". 
147 | If some or all labels are unknown, there should still be a label column and unkown labels should be marked with "unknown".
148 | 
149 | Structure of info_file.csv must be:
150 | 
151 |     | filename   | label   | ...    |  .... 
152 |     -----------------------------------------
153 |     | call_1.wav | alarm   |  ...   |  ....   
154 |     | call_2.wav | contact |  ...   |  ....  
155 |     | ...        |  ...    |  ...   |  ....   
156 | 
157 | If you don't provide an info_file.csv, a default one will be generated, containing ALL files that are found in /audio and with all vocalizations labelled as "unknown".
158 | 
159 | 
160 | ## 3. Where to start
161 | 
162 | 1. Start with 01_generate_spectrograms.ipynb to generate spectrograms from input audio files.
163 | 2. Generate latent space representations with 02a_generate_UMAP_basic.ipynb OR 02b_generate_UMAP_timeshift.ipynb 
164 | 
165 | 3. You can now 
166 | - __Evaluate__ the latent space representation with 03_UMAP_eval.ipynb,
167 |  
168 | - __Visualize__ the latent space representation by running 03_UMAP_viz_part_1_prep.ipynb and 03_UMAP_viz_part_2_tool.ipynb or
169 | 
170 | - __Apply clustering__ on the latent space representation with 03_UMAP_clustering.ipynb 
171 | 
172 | 
173 | ## 4. Data accessibility
174 | 
175 | All code is under MIT-license. Exclusive copyright applies to the audio data file (audio_please_unzip.zip), meaning that you cannot reproduce, distribute or create derivative works from it. You may use this data to test the provided code, but not for any other purposes. If you are interested in using the exemplary data beyond the sole purpose of testing the provided code, please get touch with Prof. Marta Manser. See license for details.
176 | 


--------------------------------------------------------------------------------
/audio_please_unzip.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/audio_please_unzip.zip


--------------------------------------------------------------------------------
/environments/requirements.txt:
--------------------------------------------------------------------------------
  1 | # This file may be used to create an environment using:
  2 | # $ conda create --name <env> --file <this file>
  3 | # platform: linux-64
  4 | _libgcc_mutex=0.1=main
  5 | _openmp_mutex=4.5=1_gnu
  6 | anyio=3.1.0=py39hf3d152e_0
  7 | appdirs=1.4.4=pyh9f0ad1d_0
  8 | argon2-cffi=20.1.0=py39h27cfd23_1
  9 | async_generator=1.10=py_0
 10 | attrs=20.2.0=py_0
 11 | audioread=2.1.9=py39hf3d152e_0
 12 | backcall=0.2.0=py_0
 13 | blas=1.0=mkl
 14 | bleach=3.2.1=py_0
 15 | brotlipy=0.7.0=py39h3811e60_1001
 16 | bzip2=1.0.8=h7f98852_4
 17 | ca-certificates=2021.5.30=ha878542_0
 18 | certifi=2021.5.30=py39hf3d152e_0
 19 | cffi=1.14.5=py39he32792d_0
 20 | chardet=4.0.0=py39hf3d152e_1
 21 | cryptography=3.4.7=py39hbca0aa6_0
 22 | cycler=0.10.0=py39h06a4308_0
 23 | dbus=1.13.18=hb2f20db_0
 24 | decorator=5.0.9=pyhd8ed1ab_0
 25 | defusedxml=0.6.0=py_0
 26 | entrypoints=0.3=py39h06a4308_0
 27 | expat=2.4.1=h2531618_2
 28 | ffmpeg=4.3.1=hca11adc_2
 29 | fontconfig=2.13.1=h6c09931_0
 30 | freetype=2.10.4=h5ab3b9f_0
 31 | gettext=0.19.8.1=h0b5b191_1005
 32 | glib=2.68.2=h36276a3_0
 33 | gmp=6.2.1=h58526e2_0
 34 | gnutls=3.6.13=h85f3911_1
 35 | gst-plugins-base=1.14.0=h8213a91_2
 36 | gstreamer=1.14.0=h28cd5cc_2
 37 | icu=58.2=he6710b0_3
 38 | idna=2.10=pyh9f0ad1d_0
 39 | importlib-metadata=2.0.0=py_1
 40 | importlib_metadata=2.0.0=1
 41 | intel-openmp=2021.2.0=h06a4308_610
 42 | ipykernel=5.3.4=py39hb070fc8_0
 43 | ipython=7.22.0=py39hb070fc8_0
 44 | ipython_genutils=0.2.0=pyhd3eb1b0_1
 45 | ipywidgets=7.5.1=py_1
 46 | jedi=0.17.2=py39h06a4308_1
 47 | jinja2=2.11.2=py_0
 48 | joblib=1.0.1=pyhd8ed1ab_0
 49 | jpeg=9b=h024ee3a_2
 50 | jsonschema=3.2.0=py_2
 51 | jupyter_client=6.1.7=py_0
 52 | jupyter_core=4.7.1=py39h06a4308_0
 53 | jupyter_server=1.8.0=pyhd8ed1ab_0
 54 | jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
 55 | kiwisolver=1.3.1=py39h2531618_0
 56 | lame=3.100=h7f98852_1001
 57 | lcms2=2.12=h3be6417_0
 58 | ld_impl_linux-64=2.35.1=h7274673_9
 59 | libffi=3.3=he6710b0_2
 60 | libflac=1.3.3=h9c3ff4c_1
 61 | libgcc-ng=9.3.0=h5101ec6_17
 62 | libgfortran-ng=7.5.0=ha8ba4b0_17
 63 | libgfortran4=7.5.0=ha8ba4b0_17
 64 | libgomp=9.3.0=h5101ec6_17
 65 | libllvm10=10.0.1=he513fc3_3
 66 | libogg=1.3.4=h7f98852_1
 67 | libopus=1.3.1=h7f98852_1
 68 | libpng=1.6.37=hbc83047_0
 69 | librosa=0.8.1=pyhd8ed1ab_0
 70 | libsndfile=1.0.31=h9c3ff4c_1
 71 | libsodium=1.0.18=h7b6447c_0
 72 | libstdcxx-ng=9.3.0=hd4cf53a_17
 73 | libtiff=4.2.0=h85742a9_0
 74 | libuuid=1.0.3=h1bed415_2
 75 | libvorbis=1.3.7=h9c3ff4c_0
 76 | libwebp-base=1.2.0=h27cfd23_0
 77 | libxcb=1.14=h7b6447c_0
 78 | libxml2=2.9.10=hb55368b_3
 79 | llvmlite=0.36.0=py39h1bbdace_0
 80 | lz4-c=1.9.3=h2531618_0
 81 | markupsafe=2.0.1=py39h27cfd23_0
 82 | matplotlib=3.3.4=py39h06a4308_0
 83 | matplotlib-base=3.3.4=py39h62a2d02_0
 84 | mistune=0.8.4=py39h27cfd23_1000
 85 | mkl=2021.2.0=h06a4308_296
 86 | mkl-service=2.3.0=py39h27cfd23_1
 87 | mkl_fft=1.3.0=py39h42c9631_2
 88 | mkl_random=1.2.1=py39ha9443f7_2
 89 | nbclient=0.5.3=pyhd8ed1ab_0
 90 | nbconvert=6.0.7=py39hf3d152e_3
 91 | nbformat=5.0.8=py_0
 92 | ncurses=6.2=he6710b0_1
 93 | nest-asyncio=1.5.1=pyhd8ed1ab_0
 94 | nettle=3.6=he412f7d_0
 95 | notebook=6.4.0=py39h06a4308_0
 96 | numba=0.53.1=py39h56b8d98_1
 97 | numpy=1.20.2=py39h2d18471_0
 98 | numpy-base=1.20.2=py39hfae3a4d_0
 99 | olefile=0.46=py_0
100 | openh264=2.1.1=h780b84a_0
101 | openssl=1.1.1k=h7f98852_0
102 | packaging=20.9=pyh44b312d_0
103 | pandas=1.2.4=py39h2531618_0
104 | pandoc=2.11=hb0f4dca_0
105 | pandocfilters=1.4.3=py39h06a4308_1
106 | parso=0.7.0=py_0
107 | pcre=8.44=he6710b0_0
108 | pexpect=4.8.0=pyhd3eb1b0_3
109 | pickleshare=0.7.5=pyhd3eb1b0_1003
110 | pillow=8.2.0=py39he98fc37_0
111 | pip=21.1.2=py39h06a4308_0
112 | plotly=4.14.3=py_0
113 | pooch=1.4.0=pyhd8ed1ab_0
114 | prometheus_client=0.8.0=py_0
115 | prompt-toolkit=3.0.8=py_0
116 | ptyprocess=0.7.0=pyhd3eb1b0_2
117 | pycparser=2.20=pyh9f0ad1d_2
118 | pygments=2.7.1=py_0
119 | pynndescent=0.5.2=pyh44b312d_0
120 | pyopenssl=20.0.1=pyhd8ed1ab_0
121 | pyparsing=2.4.7=pyhd3eb1b0_0
122 | pyqt=5.9.2=py39h2531618_6
123 | pyrsistent=0.17.3=py39h27cfd23_0
124 | pysocks=1.7.1=py39hf3d152e_3
125 | pysoundfile=0.10.3.post1=pyhd3deb0d_0
126 | python=3.9.5=h12debd9_4
127 | python-dateutil=2.8.1=pyhd3eb1b0_0
128 | python_abi=3.9=1_cp39
129 | pytz=2021.1=pyhd3eb1b0_0
130 | pyzmq=20.0.0=py39h2531618_1
131 | qt=5.9.7=h5867ecd_1
132 | readline=8.1=h27cfd23_0
133 | requests=2.25.1=pyhd3deb0d_0
134 | resampy=0.2.2=py_0
135 | retrying=1.3.3=py_2
136 | scikit-learn=0.24.2=py39ha9443f7_0
137 | scipy=1.6.2=py39had2a1c9_1
138 | seaborn=0.11.1=pyhd3eb1b0_0
139 | send2trash=1.5.0=pyhd3eb1b0_1
140 | setuptools=52.0.0=py39h06a4308_0
141 | sip=4.19.13=py39h2531618_0
142 | six=1.16.0=pyhd3eb1b0_0
143 | sniffio=1.2.0=py39hf3d152e_1
144 | sqlite=3.35.4=hdfb4753_0
145 | tbb=2020.2=h4bd325d_4
146 | terminado=0.9.4=py39h06a4308_0
147 | testpath=0.4.4=py_0
148 | threadpoolctl=2.1.0=pyh5ca1d4c_0
149 | tk=8.6.10=hbc83047_0
150 | tornado=6.1=py39h27cfd23_0
151 | traitlets=5.0.5=py_0
152 | tzdata=2020f=h52ac0ba_0
153 | umap-learn=0.5.1=py39hf3d152e_1
154 | urllib3=1.26.5=pyhd8ed1ab_0
155 | voila=0.2.10=pyhd8ed1ab_0
156 | wcwidth=0.2.5=py_0
157 | webencodings=0.5.1=py39h06a4308_1
158 | websocket-client=0.57.0=py39hf3d152e_4
159 | wheel=0.36.2=pyhd3eb1b0_0
160 | widgetsnbextension=3.5.1=py39h06a4308_0
161 | x264=1!161.3030=h7f98852_1
162 | xz=5.2.5=h7b6447c_0
163 | zeromq=4.3.3=he6710b0_3
164 | zipp=3.3.1=py_0
165 | zlib=1.2.11=h7b6447c_3
166 | zstd=1.4.9=haebb681_0
167 | 


--------------------------------------------------------------------------------
/environments/umap_tut_env.yaml:
--------------------------------------------------------------------------------
  1 | name: umap_tut_env
  2 | channels:
  3 |   - conda-forge
  4 |   - defaults
  5 | dependencies:
  6 |   - _libgcc_mutex=0.1
  7 |   - _openmp_mutex=4.5
  8 |   - anyio=2.2.0
  9 |   - appdirs=1.4.4
 10 |   - argon2-cffi=20.1.0
 11 |   - async_generator=1.10
 12 |   - attrs=20.3.0
 13 |   - audioread=2.1.9
 14 |   - babel=2.9.0
 15 |   - backcall=0.2.0
 16 |   - blas=1.0
 17 |   - bleach=3.3.0
 18 |   - brotlipy=0.7.0
 19 |   - bzip2=1.0.8
 20 |   - ca-certificates=2021.5.30
 21 |   - cachecontrol=0.12.6
 22 |   - cairo=1.14.12
 23 |   - certifi=2021.5.30
 24 |   - cffi=1.14.5
 25 |   - chardet=4.0.0
 26 |   - cryptography=3.4.7
 27 |   - cycler=0.10.0
 28 |   - cython=0.29.24
 29 |   - dbus=1.13.18
 30 |   - decorator=5.0.7
 31 |   - defusedxml=0.7.1
 32 |   - entrypoints=0.3
 33 |   - expat=2.3.0
 34 |   - ffmpeg=4.3.1
 35 |   - fontconfig=2.13.1
 36 |   - freetype=2.10.4
 37 |   - fribidi=1.0.10
 38 |   - gettext=0.19.8.1
 39 |   - glib=2.56.2
 40 |   - gmp=6.2.1
 41 |   - gnutls=3.6.13
 42 |   - graphite2=1.3.14
 43 |   - graphviz=2.40.1
 44 |   - gst-plugins-base=1.14.0
 45 |   - gstreamer=1.14.0
 46 |   - harfbuzz=1.8.8
 47 |   - hdbscan=0.8.27
 48 |   - hdmedians=0.14.2
 49 |   - icu=58.2
 50 |   - idna=2.10
 51 |   - importlib-metadata=3.10.0
 52 |   - importlib_metadata=3.10.0
 53 |   - iniconfig=1.1.1
 54 |   - intel-openmp=2021.2.0
 55 |   - ipykernel=5.3.4
 56 |   - ipython=7.22.0
 57 |   - ipython_genutils=0.2.0
 58 |   - ipywidgets=7.6.3
 59 |   - jedi=0.17.0
 60 |   - jinja2=2.11.3
 61 |   - joblib=1.0.1
 62 |   - jpeg=9d
 63 |   - json5=0.9.5
 64 |   - jsonschema=3.2.0
 65 |   - jupyter-packaging=0.7.12
 66 |   - jupyter_client=6.1.12
 67 |   - jupyter_core=4.7.1
 68 |   - jupyter_server=1.4.1
 69 |   - jupyterlab=3.0.14
 70 |   - jupyterlab_pygments=0.1.2
 71 |   - jupyterlab_server=2.4.0
 72 |   - jupyterlab_widgets=1.0.0
 73 |   - kiwisolver=1.3.1
 74 |   - lame=3.100
 75 |   - lcms2=2.12
 76 |   - ld_impl_linux-64=2.33.1
 77 |   - libffi=3.3
 78 |   - libflac=1.3.3
 79 |   - libgcc=7.2.0
 80 |   - libgcc-ng=9.3.0
 81 |   - libgfortran-ng=7.5.0
 82 |   - libgfortran4=7.5.0
 83 |   - libgomp=9.3.0
 84 |   - libllvm10=10.0.1
 85 |   - libogg=1.3.4
 86 |   - libopus=1.3.1
 87 |   - libpng=1.6.37
 88 |   - librosa=0.8.0
 89 |   - libsndfile=1.0.31
 90 |   - libsodium=1.0.18
 91 |   - libstdcxx-ng=9.3.0
 92 |   - libtiff=4.2.0
 93 |   - libuuid=1.0.3
 94 |   - libvorbis=1.3.7
 95 |   - libwebp-base=1.2.0
 96 |   - libxcb=1.14
 97 |   - libxml2=2.9.10
 98 |   - llvmlite=0.36.0
 99 |   - lockfile=0.12.2
100 |   - lz4-c=1.9.3
101 |   - markupsafe=1.1.1
102 |   - matplotlib=3.3.4
103 |   - matplotlib-base=3.3.4
104 |   - mistune=0.8.4
105 |   - mkl=2021.2.0
106 |   - mkl-service=2.3.0
107 |   - mkl_fft=1.3.0
108 |   - mkl_random=1.2.1
109 |   - more-itertools=8.8.0
110 |   - msgpack-python=1.0.2
111 |   - natsort=7.1.1
112 |   - nbclassic=0.2.6
113 |   - nbclient=0.5.3
114 |   - nbconvert=6.0.7
115 |   - nbformat=5.1.3
116 |   - ncurses=6.2
117 |   - nest-asyncio=1.5.1
118 |   - nettle=3.6
119 |   - networkx=2.5
120 |   - nodejs=6.11.2
121 |   - notebook=6.3.0
122 |   - numba=0.53.1
123 |   - numpy=1.20.1
124 |   - numpy-base=1.20.1
125 |   - olefile=0.46
126 |   - openh264=2.1.1
127 |   - openjpeg=2.4.0
128 |   - openssl=1.1.1k
129 |   - packaging=20.9
130 |   - pandas=1.2.4
131 |   - pandoc=2.12
132 |   - pandocfilters=1.4.3
133 |   - pango=1.42.4
134 |   - parso=0.8.2
135 |   - patsy=0.5.1
136 |   - pcre=8.44
137 |   - pexpect=4.8.0
138 |   - pickleshare=0.7.5
139 |   - pillow=8.1.2
140 |   - pip=21.0.1
141 |   - pixman=0.40.0
142 |   - plotly=4.14.3
143 |   - pluggy=0.13.1
144 |   - pooch=1.3.0
145 |   - prometheus_client=0.10.1
146 |   - prompt-toolkit=3.0.17
147 |   - ptyprocess=0.7.0
148 |   - py=1.10.0
149 |   - pycparser=2.20
150 |   - pygments=2.8.1
151 |   - pygraphviz=1.3
152 |   - pynndescent=0.5.2
153 |   - pyopenssl=20.0.1
154 |   - pyparsing=2.4.7
155 |   - pyqt=5.9.2
156 |   - pyrsistent=0.17.3
157 |   - pysocks=1.7.1
158 |   - pysoundfile=0.10.3.post1
159 |   - pytest=6.2.4
160 |   - python=3.7.10
161 |   - python-dateutil=2.8.1
162 |   - python_abi=3.7
163 |   - pytz=2021.1
164 |   - pyzmq=20.0.0
165 |   - qt=5.9.7
166 |   - readline=8.1
167 |   - requests=2.25.1
168 |   - resampy=0.2.2
169 |   - retrying=1.3.3
170 |   - scikit-bio=0.5.6
171 |   - scikit-learn=0.24.1
172 |   - scipy=1.6.2
173 |   - seaborn=0.11.1
174 |   - send2trash=1.5.0
175 |   - setuptools=52.0.0
176 |   - sip=4.19.8
177 |   - six=1.15.0
178 |   - sniffio=1.2.0
179 |   - sqlite=3.35.4
180 |   - statsmodels=0.12.2
181 |   - tbb=2020.2
182 |   - terminado=0.9.4
183 |   - testpath=0.4.4
184 |   - threadpoolctl=2.1.0
185 |   - tk=8.6.10
186 |   - toml=0.10.2
187 |   - tornado=6.1
188 |   - traitlets=5.0.5
189 |   - typing_extensions=3.7.4.3
190 |   - umap-learn=0.5.1
191 |   - urllib3=1.26.4
192 |   - voila=0.2.10
193 |   - wcwidth=0.2.5
194 |   - webencodings=0.5.1
195 |   - wheel=0.36.2
196 |   - widgetsnbextension=3.5.1
197 |   - x264=1!161.3030
198 |   - xz=5.2.5
199 |   - zeromq=4.3.4
200 |   - zipp=3.4.1
201 |   - zlib=1.2.11
202 |   - zstd=1.4.9
203 |   - pip:
204 |     - audeer==1.14.0
205 |     - audiofile==0.4.2
206 |     - pathlib2==2.3.5
207 |     - sox==1.4.1
208 |     - tqdm==4.60.0
209 | prefix: /home/mthomas/anaconda3/envs/umap_tut_env
210 | 


--------------------------------------------------------------------------------
/example_imgs/tool_image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/example_imgs/tool_image.png


--------------------------------------------------------------------------------
/functions/__init__.py:
--------------------------------------------------------------------------------
1 | # init
2 | 


--------------------------------------------------------------------------------
/functions/__pycache__/__init__.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/functions/__pycache__/__init__.cpython-37.pyc


--------------------------------------------------------------------------------
/functions/__pycache__/audio_functions.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/functions/__pycache__/audio_functions.cpython-37.pyc


--------------------------------------------------------------------------------
/functions/__pycache__/custom_dist_functions_umap.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/functions/__pycache__/custom_dist_functions_umap.cpython-37.pyc


--------------------------------------------------------------------------------
/functions/__pycache__/evaluation_functions.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/functions/__pycache__/evaluation_functions.cpython-37.pyc


--------------------------------------------------------------------------------
/functions/__pycache__/plot_functions.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/functions/__pycache__/plot_functions.cpython-37.pyc


--------------------------------------------------------------------------------
/functions/__pycache__/preprocessing_functions.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/functions/__pycache__/preprocessing_functions.cpython-37.pyc


--------------------------------------------------------------------------------
/functions/audio_functions.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | import soundfile as sf
  4 | import io
  5 | import librosa
  6 | from scipy.signal import butter, lfilter
  7 | 
  8 | 
  9 | def generate_mel_spectrogram(data, rate, n_mels, window, fft_win , fft_hop, fmax, fmin=0):
 10 |     
 11 |     """
 12 |     Function that generates mel spectrogram from audio data using librosa functions
 13 | 
 14 |     Parameters
 15 |     ----------
 16 |     data: 1D numpy array (float)
 17 |           Audio data
 18 |     rate: numeric(integer)
 19 |           samplerate in Hz
 20 |     n_mels: numeric (integer)
 21 |             number of mel bands
 22 |     window: string
 23 |             spectrogram window generation type ('hann'...)
 24 |     fft_win: numeric (float)
 25 |              window length in s
 26 |     fft_hop: numeric (float)
 27 |              hop between window start in s 
 28 | 
 29 |     Returns
 30 |     -------
 31 |     result : 2D np.array
 32 |              Mel-transformed spectrogram, dB scale
 33 | 
 34 |     Example
 35 |     -------
 36 |     >>> 
 37 |     
 38 |     """
 39 |     spectro = np.nan
 40 |     
 41 |     try:
 42 |         n_fft  = int(fft_win * rate) 
 43 |         hop_length = int(fft_hop * rate) 
 44 | 
 45 |         s = librosa.feature.melspectrogram(y = data ,
 46 |                                            sr = rate, 
 47 |                                            n_mels = n_mels , 
 48 |                                            fmax = fmax, 
 49 |                                            fmin = fmin,
 50 |                                            n_fft = n_fft,
 51 |                                            hop_length = hop_length, 
 52 |                                            window = window, 
 53 |                                            win_length = n_fft)
 54 | 
 55 |         spectro = librosa.power_to_db(s, ref=np.max)
 56 |     except:
 57 |         print("Failed to generate spectrogram.")
 58 | 
 59 |     return spectro
 60 | 
 61 | 
 62 | def generate_stretched_mel_spectrogram(data, sr, duration, n_mels, window, fft_win , fft_hop, MAX_DURATION):
 63 |     """
 64 |     Function that generates stretched mel spectrogram from audio data using librosa functions
 65 | 
 66 |     Parameters
 67 |     ----------
 68 |     data: 1D numpy array (float)
 69 |           Audio data
 70 |     sr: numeric(integer)
 71 |           samplerate in Hz
 72 |     duration: numeric (float)
 73 |               duration of audio in seconds
 74 |     n_mels: numeric (integer)
 75 |             number of mel bands
 76 |     window: string
 77 |             spectrogram window generation type ('hann'...)
 78 |     fft_win: numeric (float)
 79 |              window length in s
 80 |     fft_hop: numeric (float)
 81 |              hop between window start in s 
 82 | 
 83 |     Returns
 84 |     -------
 85 |     result : 2D np.array
 86 |              stretched, mel-transformed spectrogram, dB scale
 87 |     -------
 88 |     >>> 
 89 |     
 90 |     """
 91 |     n_fft  = int(fft_win * sr) 
 92 |     hop_length = int(fft_hop * sr) 
 93 |     stretch_rate = duration/MAX_DURATION
 94 |     
 95 |     # generate normal spectrogram (NOT mel transformed)
 96 |     D = librosa.stft(y=data, 
 97 |                      n_fft = n_fft,
 98 |                      hop_length = hop_length,
 99 |                      window=window,
100 |                      win_length = n_fft
101 |                      )
102 |     
103 |     # Stretch spectrogram using phase vocoder algorithm
104 |     D_stretched = librosa.core.phase_vocoder(D, stretch_rate, hop_length=hop_length) 
105 |     D_stretched = np.abs(D_stretched)**2
106 |     
107 |     # mel transform
108 |     spectro = librosa.feature.melspectrogram(S=D_stretched,  
109 |                                             sr=sr,
110 |                                             n_mels=n_mels,
111 |                                             fmax=4000)
112 |         
113 |     # Convert to db scale
114 |     s = librosa.power_to_db(spectro, ref=np.max)
115 | 
116 |     return s
117 | 
118 | def read_wavfile(filename, channel=0):    
119 |     """
120 |     Function that reads audio data and sr from audiofile
121 |     If audio is stereo, channel 0 is selected by default.
122 | 
123 |     Parameters
124 |     ----------
125 |     filename: String
126 |               path to wav file
127 |     
128 |     channel: Integer (0 or 1)
129 |              which channel is selected for stereo files
130 |              default is 0
131 |           
132 |     Returns
133 |     -------
134 |     data : 1D np.array
135 |            Raw audio data (Amplitude)
136 |            
137 |     sr: numeric (Integer)
138 |         Samplerate (in Hz)
139 |     """
140 |     data = np.nan
141 |     sr = np.nan
142 |     
143 |     if os.path.exists(filename):
144 |         try:
145 |             data, sr = sf.read(filename)
146 |             if data.ndim>1:
147 |                 data = data[:,channel]
148 |         except:
149 |             print("Couldn't read: ", filename)
150 |     else:
151 |         print("No such file or directory: ", filename)
152 | 
153 | 
154 |     return data, sr
155 | 
156 | 
157 | 
158 | # Butter bandpass filter implementation:
159 | # from https://scipy-cookbook.readthedocs.io/items/ButterworthBandpass.html
160 | 
161 | def butter_bandpass_filter(data, lowcut, highcut, sr, order=5):
162 |     """
163 |     Function that applies a butter bandpass filter on audio data 
164 |     and returns the filtered audio
165 | 
166 |     Parameters
167 |     ----------
168 |     data: 1D np.array
169 |           audio data (amplitude)
170 |     
171 |     lowcut: Numeric
172 |             lower bound for bandpass filter
173 |             
174 |     highcut: Numeric
175 |              upper bound for bandpass filter
176 |              
177 |     sr: Numeric
178 |         samplerate in Hz
179 |     
180 |     order: Numeric
181 |            order of the filter
182 |     
183 |     Returns
184 |     -------
185 |     filtered_data : 1D np.array
186 |                     filtered audio data 
187 |     """
188 |     
189 |     nyq = 0.5 * sr
190 |     low = lowcut / nyq
191 |     high = highcut / nyq
192 |     b, a = butter(order, [low, high], btype='band')
193 | 
194 |     filtered_data = lfilter(b, a, data)
195 |     return filtered_data
196 |     
197 |     
198 | 


--------------------------------------------------------------------------------
/functions/custom_dist_functions_umap.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | # In[ ]:
  5 | 
  6 | 
  7 | # -*- coding: utf-8 -*-
  8 | """
  9 | Created on Tue May  4 17:39:59 2021
 10 | 
 11 | Collection of custom distance functions for UMAP
 12 | 
 13 | @author: marathomas
 14 | """
 15 | 
 16 | import numpy as np 
 17 | import numba
 18 | from numba import jit
 19 | 
 20 | MIN_OVERLAP = 0.9
 21 | 
 22 | 
 23 | @numba.njit()
 24 | def unpack_specs(a,b):
 25 |     """
 26 |     Function that unpacks two specs that have been transformed into 
 27 |     a 1D array with preprocessing_functions.pad_transform_spec and 
 28 |     restores their original 2D shape
 29 | 
 30 |     Parameters
 31 |     ----------
 32 |     a,b : 1D numpy arrays (numeric)
 33 | 
 34 |     Returns
 35 |     -------
 36 |     spec_s, spec_l : 2D numpy arrays (numeric)
 37 |                      the restored specs 
 38 |     Example
 39 |     -------
 40 |     >>> 
 41 | 
 42 |     """
 43 | 
 44 |     a_shape0 = int(a[0])
 45 |     a_shape1 = int(a[1])
 46 |     b_shape0 = int(b[0])
 47 |     b_shape1 = int(b[1])
 48 | 
 49 |     spec_a= np.reshape(a[2:(a_shape0*a_shape1)+2], (a_shape0, a_shape1))
 50 |     spec_b= np.reshape(b[2:(b_shape0*b_shape1)+2], (b_shape0, b_shape1))
 51 |     
 52 |     len_a = a_shape1
 53 |     len_b = b_shape1
 54 |     
 55 |     # find bigger spec
 56 |     spec_s = spec_a
 57 |     spec_l = spec_b
 58 | 
 59 |     if len_a>len_b:
 60 |         spec_s = spec_b
 61 |         spec_l = spec_a
 62 |         
 63 |     return spec_s, spec_l
 64 | 
 65 | 
 66 | @numba.njit()
 67 | def calc_timeshift_pad(a,b):
 68 |     """
 69 |     Custom numba-compatible distance function for UMAP.
 70 |     Calculates distance between two spectrograms a,b 
 71 |     by shifting the shorter spectrogram along the longer
 72 |     one and finding the minimum distance overlap (according to
 73 |     spec_dist). Non-overlapping sections of the shorter spec are 
 74 |     zero-padded to match the longer spec when calculating the distance. 
 75 |     Uses global variable OVERLAP to constrain shifting to have 
 76 |     OVERLAP*100 % of overlap between specs.
 77 |     
 78 |     Parameters
 79 |     ----------
 80 |     a,b : 1D numpy arrays (numeric)
 81 |           pad_transformed spectrograms
 82 |           (with preprocessing_functions.pad_transform_spec)
 83 | 
 84 |     Returns
 85 |     -------
 86 |     dist : numeric (float64)
 87 |            distance between spectrograms a,b
 88 |     
 89 |     Example
 90 |     -------
 91 |     >>> 
 92 | 
 93 |     """
 94 |     
 95 |     spec_s, spec_l = unpack_specs(a,b)
 96 |     
 97 |     len_s = spec_s.shape[1]
 98 |     len_l = spec_l.shape[1]
 99 | 
100 |     nfreq = spec_s.shape[0] 
101 | 
102 |     # define start position
103 |     min_overlap_frames = int(MIN_OVERLAP * len_s)
104 |     start_timeline = min_overlap_frames-len_s
105 |     max_timeline = len_l - min_overlap_frames
106 |     
107 |     n_of_calculations = int((((max_timeline+1-start_timeline)+(max_timeline+1-start_timeline))/2) +1)
108 | 
109 |     distances = np.full((n_of_calculations),999.)
110 | 
111 |     count=0
112 |     
113 |     for timeline_p in range(start_timeline, max_timeline+1,2):
114 |         #print("timeline: ", timeline_p)
115 |         # mismatch on left side
116 |         if timeline_p < 0:
117 | 
118 |             len_overlap = len_s - abs(timeline_p)
119 |             
120 |             pad_s = np.full((nfreq, (len_l-len_overlap)),0.)
121 |             pad_l = np.full((nfreq, (len_s-len_overlap)),0.)
122 | 
123 |             s_config = np.append(spec_s, pad_s, axis=1).astype(np.float64)
124 |             l_config = np.append(pad_l, spec_l, axis=1).astype(np.float64)
125 | 
126 |         # mismatch on right side
127 |         elif timeline_p > (len_l-len_s):
128 |             
129 |             len_overlap = len_l - timeline_p
130 | 
131 |             pad_s = np.full((nfreq, (len_l-len_overlap)),0.)
132 |             pad_l = np.full((nfreq, (len_s-len_overlap)),0.)
133 | 
134 |             s_config = np.append(pad_s, spec_s, axis=1).astype(np.float64)
135 |             l_config = np.append(spec_l, pad_l, axis=1).astype(np.float64)
136 | 
137 |         # no mismatch on either side
138 |         else:
139 |             len_overlap = len_s
140 |             start_col_l = timeline_p
141 |             end_col_l = start_col_l + len_overlap
142 | 
143 |             pad_s_left = np.full((nfreq, start_col_l),0.)
144 |             pad_s_right = np.full((nfreq, (len_l - end_col_l)),0.)
145 | 
146 |             l_config = spec_l.astype(np.float64)
147 |             s_config = np.append(pad_s_left, spec_s, axis=1).astype(np.float64)
148 |             s_config = np.append(s_config, pad_s_right, axis=1).astype(np.float64)
149 |         
150 |         size = s_config.shape[0]*s_config.shape[1]
151 |         distances[count] = spec_dist(s_config, l_config, size)
152 |         count = count + 1
153 | 
154 | 
155 |     min_dist = np.min(distances)
156 |     return min_dist
157 | 
158 | 


--------------------------------------------------------------------------------
/functions/evaluation_functions.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | # In[4]:
  5 | 
  6 | 
  7 | # -*- coding: utf-8 -*-
  8 | """
  9 | Created on Tue May  4 17:39:59 2021
 10 | 
 11 | Collection of custom evaluation functions for embedding
 12 | 
 13 | @author: marathomas
 14 | """
 15 | 
 16 | import numpy as np
 17 | import pandas as pd
 18 | from sklearn.neighbors import NearestNeighbors
 19 | from sklearn.metrics import silhouette_samples, silhouette_score
 20 | import seaborn as sns
 21 | import matplotlib.pyplot as plt
 22 | import string
 23 | from scipy.spatial.distance import pdist, squareform
 24 | import sklearn
 25 | from sklearn.metrics.pairwise import euclidean_distances  
 26 | 
 27 | 
 28 | def make_nn_stats_dict(calltypes, labels, nb_indices):
 29 |     """
 30 |     Function that evaluates the labels of the k nearest neighbors of 
 31 |     all datapoints in a dataset.
 32 | 
 33 |     Parameters
 34 |     ----------
 35 |     calltypes : 1D numpy array (string) or list of strings
 36 |                 set of class labels
 37 |     labels: 1D numpy array (string) or list of strings
 38 |             vector/list of class labels in dataset
 39 |     nb_indices: 2D numpy array (numeric integer)
 40 |                 Array I(X,k) containing the indices of the k nearest
 41 |                 nearest neighbors for each datapoint X of a
 42 |                 dataset
 43 | 
 44 |     Returns
 45 |     -------
 46 |     nn_stats_dict : dictionary[<class label string>] = 2D numpy array (numeric)
 47 |                     dictionary that contains one array for each type of label.
 48 |                     Given a label L, nn_stats_dict[L] contains an array A(X,Y), 
 49 |                     where Y is the number of class labels in the dataset and each
 50 |                     row X represents a datapoint of label L in the dataset.
 51 |                     A[i,j] is the number of nearest neighbors of datapoint i that
 52 |                     are of label calltypes[j].
 53 |                     
 54 |     Example
 55 |     -------
 56 |     >>> 
 57 | 
 58 |     """
 59 |     nn_stats_dict = {}
 60 |     
 61 |     for calltype in calltypes:
 62 |         # which datapoints in the dataset are of this specific calltype?
 63 |         # -> get their indices
 64 |         call_indices = np.asarray(np.where(labels==calltype))[0]
 65 |         
 66 |         # initialize array that can save the class labels of the k nearest
 67 |         # neighbors of all these datapoints
 68 |         calltype_counts = np.zeros((call_indices.shape[0],len(calltypes)))
 69 |         
 70 |         # for each datapoint
 71 |         for i,ind in enumerate(call_indices):
 72 |             # what are the indices of its k nearest neighbors
 73 |             nearest_neighbors = nb_indices[ind]
 74 |             # for eacht of these neighbors
 75 |             for neighbor in nearest_neighbors:
 76 |                 # what is their label
 77 |                 neighbor_label = labels[neighbor]
 78 |                 # put a +1 in the array
 79 |                 calltype_counts[i,np.where(np.asarray(calltypes)==neighbor_label)[0][0]] += 1 
 80 |         
 81 |         # save the resulting array in dictionary 
 82 |         # (1 array per calltype)
 83 |         nn_stats_dict[calltype] = calltype_counts 
 84 |   
 85 |     return nn_stats_dict
 86 | 
 87 | def get_knn(k,embedding):
 88 |     """
 89 |     Function that finds k nearest neighbors (based on 
 90 |     euclidean distance) for each datapoint in a multidimensional 
 91 |     dataset 
 92 | 
 93 |     Parameters
 94 |     ----------
 95 |     k : integer
 96 |         number of nearest neighbors
 97 |     embedding: 2D numpy array (numeric)
 98 |                a dataset E(X,Y) with X datapoints and Y dimensions
 99 | 
100 |     Returns
101 |     -------
102 |     indices: 2D numpy array (numeric)
103 |              Array I(X,k) containing the indices of the k nearest
104 |              nearest neighbors for each datapoint X of the input
105 |              dataset
106 |              
107 |     distances: 2D numpy array (numeric)
108 |                Array D(X,k) containing the euclidean distance to each
109 |                of the k nearest neighbors for each datapoint X of the 
110 |                input dataset. D[i,j] is the euclidean distance of datapoint
111 |                embedding[i,:] to its jth neighbor.
112 |                     
113 |     Example
114 |     -------
115 |     >>> 
116 | 
117 |     """
118 | 
119 |     # Find k nearest neighbors
120 |     nbrs = NearestNeighbors(metric='euclidean',n_neighbors=k+1, algorithm='brute').fit(embedding)
121 |     distances, indices = nbrs.kneighbors(embedding)
122 | 
123 |     # need to remove the first neighbor, because that is the datapoint itself
124 |     indices = indices[:,1:]  
125 |     distances = distances[:,1:]
126 |     
127 |     return indices, distances
128 | 
129 | 
130 | def make_statstabs(nn_stats_dict, calltypes, labels,k):
131 |     """
132 |     Function that generates two summary tables containing
133 |     the frequency of different class labels among the k nearest 
134 |     neighbors of datapoints belonging to a class.
135 | 
136 |     Parameters
137 |     ----------
138 |     nn_stats_dict : dictionary[<class label string>] = 2D numpy array (numeric)
139 |                     dictionary that contains one array for each type of label.
140 |                     Given a label L, nn_stats_dict[L] contains an array A(X,Y), 
141 |                     where Y is the number of class labels in the dataset and each
142 |                     row X represents a datapoint of label L in the dataset.
143 |                     A[i,j] is the number of nearest neighbors of datapoint i that
144 |                     are of label calltypes[j].
145 |                     (is returned from evaulation_functions.make_nn_statsdict)
146 |     calltypes : 1D numpy array (string) or list of strings
147 |                 set of class labels
148 |     labels: 1D numpy array (string) or list of strings
149 |             vector/list of class labels in dataset
150 |     k: Integer
151 |        number of nearest neighbors
152 | 
153 |     Returns
154 |     -------
155 |     stats_tab: 2D pandas dataframe (numeric)
156 |                Summary table T(X,Y) with X,Y = number of classes.
157 |                T[i,j] is the average percentage of datapoints with class label j
158 |                in the neighborhood of datapoints with class label i
159 |              
160 |     stats_tab_norm: 2D pandas dataframe (numeric)
161 |                    Summary table N(X,Y) with X,Y = number of classes.
162 |                    N[i,j] is the log2-transformed ratio of the percentage of datapoints 
163 |                    with class label j in the neighborhood of datapoints with class label i
164 |                    to the percentage that would be expected by random chance and random
165 |                    distribution. (N[i,j] = log2(T[i,j]/random_expect))
166 |               
167 |     Example
168 |     -------
169 |     >>> 
170 | 
171 |     """
172 |     
173 |     # Get the class frequencies in the dataset
174 |     overall = np.zeros((len(calltypes)))  
175 |     for i,calltype in enumerate(calltypes):
176 |         overall[i] = sum(labels==calltype) 
177 |     overall = (overall/np.sum(overall))*100
178 |     
179 |     # Initialize empty array for stats_tab and stats_tab_norm
180 |     stats_tab = np.zeros((len(calltypes),len(calltypes)))
181 |     stats_tab_norm = np.zeros((len(calltypes),len(calltypes)))
182 | 
183 |     # For each calltype
184 |     for i, calltype in enumerate(calltypes):
185 |         # Get the table with all neighbor label counts per datapoint
186 |         stats = nn_stats_dict[calltype]
187 |         # Average across all datapoints and transform to percentage
188 |         stats_tab[i,:] = (np.mean(stats,axis=0)/k)*100
189 |         # Divide by overall percentage of this class in dataset 
190 |         # for the normalized statstab version
191 |         stats_tab_norm[i,:] = ((np.mean(stats,axis=0)/k)*100)/overall
192 |     
193 |     # Turn into dataframe
194 |     stats_tab = pd.DataFrame(stats_tab)
195 |     stats_tab_norm = pd.DataFrame(stats_tab_norm)
196 |     
197 |     # Add row with overall frequencies to statstab
198 |     stats_tab.loc[len(stats_tab)] = overall
199 |     
200 |     # Name columns and rows
201 |     stats_tab.columns = calltypes
202 |     stats_tab.index = calltypes+['overall']
203 | 
204 |     stats_tab_norm.columns = calltypes
205 |     stats_tab_norm.index = calltypes
206 |     
207 |     # Replace zeros with small value as otherwise log2 transform cannot be applied
208 |     x=stats_tab_norm.replace(0, 0.0001)
209 |     
210 |     # log2-tranform the ratios that are currently in statstabnorm
211 |     stats_tab_norm = np.log2(x)
212 | 
213 |     return stats_tab, stats_tab_norm
214 | 
215 | 
216 | class nn:
217 |     """
218 |     A class to represent nearest neighbor statistics for a
219 |     given latent space representation of a labelled dataset
220 | 
221 |     Attributes
222 |     ----------
223 |     embedding : 2D numpy array (numeric)
224 |                 a dataset E(X,Y) with X datapoints and Y dimensions
225 |                 
226 |     labels: 1D numpy array (string) or list of strings
227 |             vector/list of class labels in dataset
228 |     k : integer
229 |         number of nearest neighbors to consider
230 |     
231 |     statstab: 2D pandas dataframe (numeric)
232 |                Summary table T(X,Y) with X,Y = number of classes.
233 |                T[i,j] is the average percentage of datapoints with class label j
234 |                in the neighborhood of datapoints with class label i
235 |              
236 |     statstabnorm: 2D pandas dataframe (numeric)
237 |                    Summary table N(X,Y) with X,Y = number of classes.
238 |                    N[i,j] is the log2-transformed ratio of the percentage of datapoints 
239 |                    with class label j in the neighborhood of datapoints with class label i
240 |                    to the percentage that would be expected by random chance and random
241 |                    distribution. (N[i,j] = log2(T[i,j]/random_expect))      
242 | 
243 |     Methods
244 |     -------
245 |     
246 |     def knn_cc():
247 |         returns k nearest neighbor fractional consistency for each class
248 |         (1D numpy array). What percentage of datapoints (of this class)
249 |         have fully consistent k neighbors (all k are also of the same class)
250 |     
251 |     def knn_accuracy(self):
252 |         returns k nearest neighbor classifier accuracy for each class
253 |         (1D numpy array). What percentage of datapoints (of this class)
254 |         have a majority of same-class neighbors among k nearest neighbors
255 |     
256 |     get_statstab():
257 |         returns statstab
258 |     
259 |     get_statstabnorm():
260 |         returns statstabnorm
261 |     
262 |     get_S():    
263 |         returns S score of embedding
264 |         S(class X) is the average percentage of same-class neighbors
265 |         among the k nearest neighbors of all datapoints of
266 |         class X. S of an embedding is the average of S(class X) over all
267 |         classes X (unweighted, e.g. does not consider class frequencies).
268 |     
269 |     get_Snorm():
270 |         returns Snorm score of embedding
271 |         Snorm(class X) is the log2 transformed, normalized percentage of 
272 |         same-class neighbors among the k nearest neighbors of all datapoints of
273 |         class X. Snorm of an embedding is the average of Snorm(class X) over all
274 |         classes X.
275 |     
276 |     get_ownclass_S():
277 |         returns array of S(class X) score for each class X in the dataset
278 |         (alphanumerically sorted by class name)
279 |         S(class X) is the average percentage of same-class neighbors
280 |         among the k nearest neighbors of all datapoints of
281 |         class X.
282 |     
283 |     get_ownclass_Snorm():
284 |         returns array of Snorm(class X) score for each class X in the dataset
285 |         (alphanumerically sorted by class name)
286 |         Snorm(class X) is the log2 transformed, normalized percentage of 
287 |         same-class neighbors among the k nearest neighbors of all datapoints of
288 |         class X. 
289 |     
290 |     plot_heat_S(vmin, vmax, center, cmap, cbar, outname)
291 |         plots heatmap of S scores
292 |     
293 |     plot_heat_S(vmin, vmax, center, cmap, cbar, outname)
294 |         plots heatmap of Snorm scores
295 |         
296 |     plot_heat_S(center, cmap, cbar, outname)
297 |         plots heatmap of fold likelihood (statstabnorm scores to the power of 2)
298 |         
299 |     draw_simgraph(outname)
300 |         draws similarity graph based on statstabnorm scores
301 |     
302 |     """
303 |     def __init__(self, embedding, labels, k):
304 |         
305 |         self.embedding = embedding
306 |         self.labels = labels
307 |         self.k = k
308 |         
309 |         label_types = sorted(list(set(labels)))        
310 |         
311 |         indices, distances = get_knn(k,embedding)
312 |         nn_stats_dict = make_nn_stats_dict(label_types, labels, indices)
313 |         stats_tab, stats_tab_norm = make_statstabs(nn_stats_dict, label_types, labels, k)
314 |         
315 |         self.nn_stats_dict = nn_stats_dict
316 |         self.statstab = stats_tab
317 |         self.statstabnorm = stats_tab_norm
318 |     
319 |     def knn_cc(self):
320 |         label_types = sorted(list(set(self.labels)))        
321 |         consistent = []
322 |         for i,labeltype in enumerate(label_types):
323 |             statsd = self.nn_stats_dict[labeltype] 
324 |             x = statsd[:,i]
325 |             cc = (np.sum(x == self.k) / statsd.shape[0])*100
326 |             consistent.append(cc)
327 |         return np.asarray(consistent)
328 |     
329 |     def knn_accuracy(self):
330 |         label_types = sorted(list(set(self.labels)))        
331 |         has_majority = []
332 |         if (self.k % 2) == 0:
333 |             n_majority = (self.k/2)+ 1
334 |         else:
335 |             n_majority = (self.k/2)+ 0.5
336 |         for i,labeltype in enumerate(label_types):
337 |             statsd = self.nn_stats_dict[labeltype] 
338 |             x = statsd[:,i]
339 |             cc = (np.sum(x >= n_majority) / statsd.shape[0])*100  
340 |             has_majority.append(cc)
341 |         return np.asarray(has_majority)  
342 |           
343 |     def get_statstab(self):
344 |         return self.statstab
345 |     
346 |     def get_statstabnorm(self):
347 |         return self.statstabnorm
348 |     
349 |     def get_S(self):    
350 |         return np.mean(np.diagonal(self.statstab))
351 |     
352 |     def get_Snorm(self):
353 |         return np.mean(np.diagonal(self.statstabnorm))
354 |     
355 |     def get_ownclass_S(self):
356 |         return np.diagonal(self.statstab)
357 |     
358 |     def get_ownclass_Snorm(self):
359 |         return np.diagonal(self.statstabnorm)
360 |     
361 |     def plot_heat_S(self,vmin=0, vmax=100, center=50, cmap=sns.color_palette("Greens", as_cmap=True), cbar=None, outname=None):
362 |         plt.figure(figsize=(6,6))
363 |         ax=sns.heatmap(self.statstab, annot=True, vmin=vmin, vmax=vmax, center=center, cmap=cmap, cbar=cbar)
364 |         plt.xlabel("neighbor label")
365 |         plt.ylabel("datapoint label")
366 |         plt.title("Nearest Neighbor Frequency P")
367 |         if outname:
368 |             plt.savefig(outname, facecolor="white")
369 | 
370 |     def plot_heat_Snorm(self,vmin=-13, vmax=13, center=0, cmap=sns.diverging_palette(20, 145, as_cmap=True), cbar=None, outname=None):
371 |         plt.figure(figsize=(6,6))
372 |         ax=sns.heatmap(self.statstabnorm, annot=True, vmin=vmin, vmax=vmax, center=center, cmap=cmap, cbar=cbar)
373 |         plt.xlabel("neighbor label")
374 |         plt.ylabel("datapoint label")
375 |         plt.title("Normalized Nearest Neighbor Frequency Pnorm")
376 |         if outname:
377 |             plt.savefig(outname, facecolor="white")
378 |     
379 |     def plot_heat_fold(self, center=1, cmap=sns.diverging_palette(20, 145, as_cmap=True), cbar=None, outname=None):
380 |         plt.figure(figsize=(6,6))
381 |         ax=sns.heatmap(np.power(2,self.statstabnorm), annot=True, center=center, cmap=cmap, cbar=cbar)
382 |         plt.xlabel("neighbor label")
383 |         plt.ylabel("datapoint label")
384 |         plt.title("Nearest Neighbor fold likelihood")
385 |         if outname:
386 |             plt.savefig(outname, facecolor="white")
387 |             
388 |     def draw_simgraph(self, outname="simgraph.png"):
389 |         
390 |         # Imports here because specific to this method and
391 |         # sometimes problematic to install (dependencies)
392 | 
393 |         import networkx as nx
394 |         import pygraphviz
395 |         
396 |         calltypes = sorted(list(set(self.labels)))
397 |         sim_mat = np.asarray(self.statstabnorm).copy()
398 |         for i in range(sim_mat.shape[0]):
399 |             for j in range(i,sim_mat.shape[0]):
400 |                 if i!=j:
401 |                     sim_mat[i,j] = np.mean((sim_mat[i,j], sim_mat[j,i]))
402 |                     sim_mat[j,i] = sim_mat[i,j]
403 |                 else:
404 |                     sim_mat[i,j] = 0
405 |                     
406 |         dist_mat = sim_mat*(-1)
407 |         dist_mat = np.interp(dist_mat, (dist_mat.min(), dist_mat.max()), (1, 10))
408 |         
409 |         for i in range(dist_mat.shape[0]):
410 |             dist_mat[i,i] = 0
411 |             
412 |         dt = [('len', float)]
413 |         
414 |         A = dist_mat
415 |         A = A.view(dt)
416 | 
417 |         G = nx.from_numpy_matrix(A)
418 |         G = nx.relabel_nodes(G, dict(zip(range(len(G.nodes())),calltypes))) 
419 | 
420 |         G = nx.drawing.nx_agraph.to_agraph(G)
421 | 
422 |         G.node_attr.update(color="#bec1d4", style="filled", shape='circle', fontsize='20')
423 |         G.edge_attr.update(color="blue", width="2.0")
424 |         print("Graph saved at ", outname)
425 |         G.draw(outname, format='png', prog='neato')
426 |         return G
427 | 
428 |     
429 |     
430 | class sil:
431 |     """
432 |     A class to represent Silhouette score statistics for a
433 |     given latent space representation of a labelled dataset
434 | 
435 |     Attributes
436 |     ----------
437 |     embedding : 2D numpy array (numeric)
438 |                 a dataset E(X,Y) with X datapoints and Y dimensions
439 |                 
440 |     labels: 1D numpy array (string) or list of strings
441 |             vector/list of class labels in dataset
442 | 
443 |     
444 |     labeltypes: list of strings
445 |                  alphanumerically sorted set of class labels
446 |                  
447 |     avrg_SIL: Numeric (float)
448 |               The average Silhouette score of the dataset
449 |              
450 |     sample_SIL: 1D numpy array (numeric)
451 |                 The Silhouette scores for each datapoint in the dataset
452 |     
453 |     Methods
454 |     -------
455 |     
456 |     get_avrg_score():
457 |         returns the average Silhouette score of the dataset
458 |     
459 |     get_score_per_class():
460 |         returns the average Silhouette score per class for each
461 |         class in the dataset as 1D numpy array
462 |         (alphanumerically sorted classes)
463 |         
464 |     get_sample_scores():
465 |         returns the Silhouette scores for each datapoint in the dataset
466 |         (1D numpy array, numeric)
467 |              
468 |     
469 |     """
470 |     def __init__(self, embedding, labels):
471 |         
472 |         self.embedding = embedding
473 |         self.labels = labels
474 |         self.labeltypes = sorted(list(set(labels)))
475 |         
476 |         self.avrg_SIL = silhouette_score(embedding, labels)
477 |         self.sample_SIL = silhouette_samples(embedding, labels)
478 |     
479 |     def get_avrg_score(self):
480 |         return self.avrg_SIL
481 |     
482 |     def get_score_per_class(self):
483 |         scores = np.zeros((len(self.labeltypes),))
484 |         for i, label in enumerate(self.labeltypes):
485 |             ith_cluster_silhouette_values = self.sample_SIL[self.labels == label]
486 |             scores[i] = np.mean(ith_cluster_silhouette_values)
487 |             #scores_tab = pd.DataFrame([scores],columns=self.labeltypes)
488 |         return scores
489 |     
490 |     def get_sample_scores(self):
491 |         return self.sample_SIL
492 |     
493 |     def plot_sil(self, mypalette="Set2", embedding_type=None, outname=None):
494 |         labeltypes = sorted(list(set(self.labels)))
495 |         n_clusters = len(labeltypes)
496 | 
497 |         # Create a subplot with 1 row and 2 columns
498 |         fig, ax1 = plt.subplots(1, 1)
499 |         fig.set_size_inches(9, 7)
500 |         ax1.set_xlim([-1, 1])
501 |         ax1.set_ylim([0, self.embedding.shape[0] + (n_clusters + 1) * 10])
502 |         y_lower = 10
503 |         
504 |         pal = sns.color_palette(mypalette, n_colors=len(labeltypes))
505 |         color_dict = dict(zip(labeltypes, pal))
506 |         
507 |         labeltypes = sorted(labeltypes, reverse=True)
508 | 
509 | 
510 |         for i, cluster_label in enumerate(labeltypes):
511 |             ith_cluster_silhouette_values = self.sample_SIL[self.labels == cluster_label]
512 |             ith_cluster_silhouette_values.sort()
513 | 
514 |             size_cluster_i = ith_cluster_silhouette_values.shape[0]
515 |             y_upper = y_lower + size_cluster_i
516 | 
517 |             ax1.fill_betweenx(np.arange(y_lower, y_upper),
518 |                             0, ith_cluster_silhouette_values,
519 |                             facecolor=color_dict[cluster_label], edgecolor=color_dict[cluster_label], alpha=0.7)
520 | 
521 |             ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, cluster_label)
522 | 
523 |             # Compute the new y_lower for next plot
524 |             y_lower = y_upper + 10  # 10 for the 0 samples
525 |         
526 |         if embedding_type:
527 |             mytitle = "Silhouette plot for "+embedding_type+" labels"
528 |         else:
529 |             mytitle = "Silhouette plot"
530 | 
531 |         ax1.set_title(mytitle)
532 |         ax1.set_xlabel("Silhouette value")
533 |         ax1.set_ylabel("Cluster label")
534 | 
535 |         # The vertical line for average silhouette score of all the values
536 |         ax1.axvline(x=self.avrg_SIL, color="red", linestyle="--")
537 |         
538 |         if outname:
539 |             plt.savefig(outname, facecolor="white")
540 | 
541 | 
542 |             
543 |             
544 | def plot_within_without(embedding,labels, distance_metric = "euclidean", outname=None,xmin=0, xmax=12, ymax=0.5, nbins=50,nrows=4, ncols=2, density=True):
545 |     """
546 |     Function that plots distribution of pairwise distances within a class
547 |     vs. towards other classes ("between"), for each class in a dataset
548 | 
549 |     Parameters
550 |     ----------
551 |     embedding : 2D numpy array (numeric)
552 |                 a dataset E(X,Y) with X datapoints and Y dimensions
553 |                 
554 |     labels: 1D numpy array (string) or list of strings
555 |             vector/list of class labels in dataset
556 | 
557 |     
558 |     distance_metric: String
559 |                      Type of distance metric, e.g. "euclidean", "manhattan"...
560 |                      all scipy.spatial.distance metrics are allowed
561 |                      
562 |     outname: String
563 |              Output filename at which plot will be saved
564 |              No plot will be saved if outname is None
565 |              (e.g. "my_folder/my_img.png")
566 |              
567 |     xmin, xmax: Numeric
568 |                 Min and max of x-axis
569 |     
570 |     ymax: Numeric
571 |           Max of yaxis
572 |     
573 |     nbins: Integer
574 |            Number of bins in histograms
575 |     
576 |     nrows: Integer
577 |            Number of rows of subplots
578 |     
579 |     ncols: Integer
580 |            Number of columns of subplots
581 |     
582 |     density: Boolean
583 |              Plot density histogram if density=True
584 |              else plot frequency histogram
585 |                 
586 |     Returns
587 |     -------
588 |     
589 |     -
590 |              
591 |     """
592 |     
593 |     distmat_embedded = squareform(pdist(embedding, metric=distance_metric))
594 |     labels = np.asarray(labels)
595 |     calltypes = sorted(list(set(labels)))
596 | 
597 |     self_dists={}
598 |     other_dists={}
599 | 
600 |     for calltype in calltypes:
601 |         x=distmat_embedded[np.where(labels==calltype)]
602 |         x = np.transpose(x)  
603 |         y = x[np.where(labels==calltype)]
604 | 
605 |         self_dists[calltype] = y[np.triu_indices(n=y.shape[0], m=y.shape[1],k = 1)]
606 |         y = x[np.where(labels!=calltype)]
607 |         other_dists[calltype] = y[np.triu_indices(n=y.shape[0], m=y.shape[1], k = 1)]
608 |     
609 |     plt.figure(figsize=(8, 8))
610 |     i=1
611 | 
612 |     for calltype in calltypes:
613 | 
614 |         plt.subplot(nrows, ncols, i)
615 |         n, bins, patches = plt.hist(x=self_dists[calltype], label="within", density=density,
616 |                                   bins=np.linspace(xmin, xmax, nbins), color='green',
617 |                                   alpha=0.5, rwidth=0.85)
618 | 
619 |         plt.vlines(x=np.mean(self_dists[calltype]),ymin=0,ymax=ymax,color='green', linestyles='dotted')
620 | 
621 |         n, bins, patches = plt.hist(x=other_dists[calltype], label="between", density=density,
622 |                                   bins=np.linspace(xmin, xmax, nbins), color='red',
623 |                                   alpha=0.5, rwidth=0.85)
624 | 
625 |         plt.vlines(x=np.mean(other_dists[calltype]),ymin=0,ymax=ymax,color='red', linestyles='dotted')
626 |         plt.legend()
627 |         plt.grid(axis='y', alpha=0.75)
628 |         plt.title(calltype)
629 |         plt.xlim(xmin,xmax)
630 |         plt.ylim(0, ymax)
631 |         
632 |         if (i%ncols)==1:
633 |             ylabtitle = 'Density' if density else 'Frequency'
634 |             plt.ylabel(ylabtitle)
635 |         if i>=((nrows*ncols)-ncols):
636 |             plt.xlabel(distance_metric+' distance')
637 | 
638 |         i=i+1
639 | 
640 |     plt.tight_layout()
641 |     if outname:
642 |         plt.savefig(outname, facecolor="white")
643 |     
644 | 
645 | def next_sameclass_nb(embedding, labels):
646 |     """
647 |     Function that calculates the neighborhood degree of the closest 
648 |     same-class neighbor for a given labelled dataset. Calculation is
649 |     based on euclidean distance and done for each datapoint. E.g. 6 
650 |     means that the 6th nearest neighbor of this datapoint is the first
651 |     to be of the same-class (the first 5 nearest neighbors are of
652 |     different class)
653 | 
654 |     Parameters:
655 |     ----------
656 |     embedding : 2D numpy array (numeric)
657 |                 a dataset E(X,Y) with X datapoints and Y dimensions
658 |                 
659 |     labels: 1D numpy array (string) or list of strings
660 |             vector/list of class labels in dataset
661 |     
662 |     Returns:
663 |     -------
664 |     
665 |     nbs_to_sameclass: 1D numpy array
666 |                       nearest same-class neighborhood degree for
667 |                       each datapoint of the input dataset
668 |     
669 |     """
670 |     indices = []
671 |     distmat = euclidean_distances(embedding, embedding)
672 |     k = embedding.shape[0]-1
673 |     
674 |     nbs_to_sameclass = []
675 | 
676 |     for i in range(distmat.shape[0]):
677 |         neighbors = []
678 |         distances = distmat[i,:]
679 |         ranks = np.array(distances).argsort().argsort()
680 |         for j in range(1,embedding.shape[0]):
681 |             ind = np.where(ranks==j)[0]
682 |             nb_label = labels[ind[0]]
683 |             neighbors.append(nb_label)
684 |         
685 |         neighbors = np.asarray(neighbors)
686 |             
687 |         # How many neighbors until I encounter a same-class neighbor?
688 |         own_type = labels[i]
689 |         distances = distmat[i,:]
690 |         ranks = np.array(distances).argsort().argsort()
691 |         neighbors = []
692 |         for j in range(1,embedding.shape[0]):
693 |             ind = np.where(ranks==j)[0]
694 |             nb_label = labels[ind[0]]
695 |             neighbors.append(nb_label)
696 |         
697 |         neighbors = np.asarray(neighbors)
698 |             
699 |         # How long to same-class label?
700 |         own_type = labels[i]
701 |         first_occurrence = np.where(neighbors==labels[i])[0][0]
702 |     
703 |         nbs_to_sameclass.append(first_occurrence)
704 |     
705 |     nbs_to_sameclass = np.asarray(nbs_to_sameclass)
706 |     return(nbs_to_sameclass)
707 | 
708 | 
709 | 
710 | 


--------------------------------------------------------------------------------
/functions/plot_functions.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | # In[4]:
  5 | 
  6 | 
  7 | # -*- coding: utf-8 -*-
  8 | """
  9 | Created on Tue May  4 17:39:59 2021
 10 | 
 11 | Collection of custom evaluation functions for embedding
 12 | 
 13 | @author: marathomas
 14 | """
 15 | 
 16 | import pandas as pd
 17 | import numpy as np
 18 | import matplotlib.pyplot as plt
 19 | from mpl_toolkits.mplot3d import Axes3D
 20 | from matplotlib.legend import Legend
 21 | import matplotlib
 22 | import seaborn as sns
 23 | import plotly.express as px
 24 | import plotly.graph_objects as go
 25 | 
 26 | 
 27 | 
 28 | def umap_2Dplot(x,y, scat_labels, mycolors, outname=None, showlegend=True):
 29 |     """
 30 |     Function that creates (and saves) 2D plot from an
 31 |     input dataset, color-colored by the provided labels.
 32 | 
 33 |     Parameters
 34 |     ----------
 35 |     x : 1D numpy array (numeric) or list
 36 |         x coordinates of datapoints
 37 |                 
 38 |     y: 1D numpy array (numeric) or list 
 39 |        y coordinates of datapoints
 40 |     
 41 |     scat_labels: List-of-Strings
 42 |                  Datapoint labels
 43 |                      
 44 |     mycolors: String or List-of-Strings
 45 |               Seaborn color palette name (e.g. "Set2") or list of
 46 |               colors (Hex value strings) used for coloring datapoints
 47 |               (e.g. ["#FFEBCD","#0000FF",...])
 48 |                      
 49 |     outname: String
 50 |              Output filename at which plot will be saved
 51 |              No plot will be saved if outname is None
 52 |              (e.g. "my_folder/my_img.png")
 53 |              
 54 |     showlegend: Boolean
 55 |                 Show legend if True, else don't
 56 |                 
 57 |     Returns
 58 |     -------
 59 |     
 60 |     -
 61 |              
 62 |     """
 63 |     
 64 |     labeltypes = sorted(list(set(scat_labels)))
 65 |     pal = sns.color_palette(mycolors, n_colors=len(labeltypes))
 66 |     color_dict = dict(zip(labeltypes, pal))
 67 |     c = [color_dict[val] for val in scat_labels]
 68 |     
 69 |     fig = plt.figure(figsize=(6,6))
 70 |     
 71 |     plt.scatter(x, y, alpha=1,
 72 |                 s=10, c=c)
 73 |     plt.xlabel('UMAP1')
 74 |     plt.ylabel('UMAP2');
 75 | 
 76 |     scatters = []
 77 |     for label in labeltypes:
 78 |         scatters.append(matplotlib.lines.Line2D([0],[0], linestyle="none", c=color_dict[label], marker = 'o'))
 79 |     
 80 |     if showlegend: plt.legend(scatters, labeltypes, numpoints = 1) 
 81 |     if outname: plt.savefig(outname, facecolor="white")
 82 | 
 83 | 
 84 | 
 85 | def umap_3Dplot(x,y,z,scat_labels, mycolors,outname=None, showlegend=True):
 86 |     """
 87 |     Function that creates (and saves) 3D plot from an
 88 |     input dataset, color-colored by the provided labels.
 89 | 
 90 |     Parameters
 91 |     ----------
 92 |     x : 1D numpy array (numeric) or list
 93 |         x coordinates of datapoints
 94 |                 
 95 |     y: 1D numpy array (numeric) or list 
 96 |        y coordinates of datapoints
 97 | 
 98 |     z: 1D numpy array (numeric) or list 
 99 |        z coordinates of datapoints
100 |     
101 |     scat_labels: List-of-Strings
102 |                  Datapoint labels
103 |                      
104 |     mycolors: String or List-of-Strings
105 |               Seaborn color palette name (e.g. "Set2") or list of
106 |               colors (Hex value strings) used for coloring datapoints
107 |               (e.g. ["#FFEBCD","#0000FF",...])
108 |                      
109 |     outname: String
110 |              Output filename at which plot will be saved
111 |              No plot will be saved if outname is None
112 |              (e.g. "my_folder/my_img.png")
113 |              
114 |     showlegend: Boolean
115 |                 Show legend if True, else don't
116 |                 
117 |     Returns
118 |     -------
119 |     
120 |     -
121 |              
122 |     """    
123 |     labeltypes = sorted(list(set(scat_labels)))
124 |     pal = sns.color_palette(mycolors, n_colors=len(labeltypes))
125 |     color_dict = dict(zip(labeltypes, pal))
126 |     c = [color_dict[val] for val in scat_labels]
127 |     
128 |     fig = plt.figure(figsize=(10,10))
129 |     ax = fig.add_subplot(111, projection='3d')
130 | 
131 |     Axes3D.scatter(ax,
132 |                     xs = x,
133 |                     ys = y,
134 |                     zs = z,
135 |                     zdir='z',
136 |                     s=20,
137 |                     label = c,
138 |                     c=c,
139 |                     depthshade=False)
140 | 
141 |     ax.set_xlabel('UMAP1')
142 |     ax.set_ylabel('UMAP2')
143 |     ax.set_zlabel('UMAP3')
144 |     
145 |     ax.xaxis.pane.fill = False
146 |     ax.yaxis.pane.fill = False
147 |     ax.zaxis.pane.fill = False
148 |     
149 |     ax.xaxis.pane.set_edgecolor('w')
150 |     ax.yaxis.pane.set_edgecolor('w')
151 |     ax.zaxis.pane.set_edgecolor('w')
152 | 
153 | 
154 | 
155 |     if showlegend: 
156 |         scatters = []
157 |         for label in labeltypes:
158 |             scatters.append(matplotlib.lines.Line2D([0],[0], linestyle="none", c=color_dict[label], marker = 'o'))
159 |         
160 |         ax.legend(scatters, labeltypes, numpoints = 1)
161 |     
162 |     if outname: plt.savefig(outname, facecolor="white")
163 | 
164 | 
165 | 
166 | def plotly_viz(x,y,z,scat_labels, mycolors):
167 |     """
168 |     Function that creates interactive 3D plot with plotly from
169 |     an input dataset, color-colored by the provided labels.
170 | 
171 |     Parameters
172 |     ----------
173 |     x : 1D numpy array (numeric) or list
174 |         x coordinates of datapoints
175 |                 
176 |     y: 1D numpy array (numeric) or list 
177 |        y coordinates of datapoints
178 | 
179 |     z: 1D numpy array (numeric) or list 
180 |        z coordinates of datapoints
181 |     
182 |     scat_labels: List-of-Strings
183 |                  Datapoint labels
184 |                      
185 |     mycolors: String or List-of-Strings
186 |               Seaborn color palette name (e.g. "Set2") or list of
187 |               colors (Hex value strings) used for coloring datapoints
188 |               (e.g. ["#FFEBCD","#0000FF",...])
189 |                 
190 |     Returns
191 |     -------
192 |     
193 |     -
194 |              
195 |     """     
196 |     labeltypes = sorted(list(set(scat_labels)))
197 |     pal = sns.color_palette(mycolors, n_colors=len(labeltypes))
198 |     color_dict = dict(zip(labeltypes, pal))
199 |     c = [color_dict[val] for val in scat_labels]
200 | 
201 |     fig = go.Figure(data=[go.Scatter3d(x=x, y=y, z=z,
202 |                                     mode='markers',
203 |                                     hovertext = scat_labels,
204 |                                     marker=dict(
205 |                                         size=4,
206 |                                         color=c,                # set color to an array/list of desired values
207 |                                         opacity=0.8
208 |                                         ))])
209 | 
210 |     fig.update_layout(scene = dict(
211 |                       xaxis_title='UMAP1',
212 |                       yaxis_title='UMAP2',
213 |                       zaxis_title='UMAP3'),
214 |                       width=700,
215 |                       margin=dict(r=20, b=10, l=10, t=10))
216 | 
217 |     return fig
218 | 
219 | 
220 | 


--------------------------------------------------------------------------------
/functions/preprocessing_functions.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | def calc_zscore(s):
 4 |     """
 5 |     Function that z-score transforms each value of a 2D array 
 6 |     (not along any axis). numba-compatible.
 7 | 
 8 |     Parameters
 9 |     ----------
10 |     spec : 2D numpy array (numeric)
11 | 
12 |     Returns
13 |     -------
14 |     spec : 2D numpy array (numeric)
15 |            the z-transformed array
16 | 
17 |     """
18 |     spec = s.copy()
19 |     mn = np.mean(spec)
20 |     std = np.std(spec)
21 |     for i in range(spec.shape[0]):
22 |         for j in range(spec.shape[1]):
23 |             spec[i,j] = (spec[i,j]-mn)/std
24 |     return spec
25 | 
26 | def pad_spectro(spec,maxlen):
27 |     """
28 |     Function that Pads a spectrogram with shape (X,Y) with 
29 |     zeros, so that the result is in shape (X,maxlen)
30 | 
31 |     Parameters
32 |     ----------
33 |     spec : 2D numpy array (numeric)
34 |            a spectrogram S(X,Y) with X frequency bins and Y timeframes
35 |     maxlen: maximal length (integer)
36 | 
37 |     Returns
38 |     -------
39 |     padded_spec : 2D numpy array (numeric)
40 |                   a zero-padded spectrogram S(X,maxlen) with X frequency bins 
41 |                   and maxlen timeframes
42 | 
43 |     """
44 |     padding = maxlen - spec.shape[1]
45 |     z = np.zeros((spec.shape[0],padding))
46 |     padded_spec=np.append(spec, z, axis=1)
47 |     return padded_spec
48 |     
49 |     
50 | def pad_transform_spectro(spec,maxlen):
51 |     """
52 |     Function that encodes a 2D spectrogram in a 1D array, so that it 
53 |     can later be restored again.
54 |     Flattens and pads a spectrogram with default value 999
55 |     to a given length. Size of the original spectrogram is encoded
56 |     in the first two cells of the resulting array
57 | 
58 |     Parameters
59 |     ----------
60 |     spec : 2D numpy array (numeric)
61 |            a spectrogram S(X,Y) with X frequency bins and Y timeframes
62 |     maxlen: Integer 
63 |             n of timeframes to which spec should be padded
64 | 
65 |     Returns
66 |     -------
67 |     trans_spec : 1D numpy array (numeric)
68 |                  the padded and flattened spectrogram 
69 | 
70 |     """       
71 |     flat_spec = spec.flatten()
72 |     trans_spec = np.concatenate((np.asarray([spec.shape[0], spec.shape[1]]), flat_spec, np.asarray([999]*(maxlen-flat_spec.shape[0]-2))))
73 |     trans_spec = np.float64(trans_spec)
74 |     
75 |     return trans_spec
76 | 


--------------------------------------------------------------------------------
/notebooks/.ipynb_checkpoints/02a_generate_UMAP_basic-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Step 2, alternative a: Generate UMAP representations from spectrograms  -  Basic pipeline"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "This script creates UMAP representations from spectrograms using the basic pipeline.\n",
 22 |     "\n",
 23 |     "#### The following  structure and files are required in the project directory:\n",
 24 |     "\n",
 25 |     "    ├── data\n",
 26 |     "    │   ├── df.pkl            <- pickled pandas dataframe with metadata and spectrograms (generated in\n",
 27 |     "    |                            01_generate_spectrograms.ipynb)\n",
 28 |     "    ├── parameters         \n",
 29 |     "    ├── functions             <- the folder with the function files provided in the repo                \n",
 30 |     "    ├── notebooks             <- the folder with the notebook files provided in the repo    \n",
 31 |     "    ├── ...  \n",
 32 |     "     \n",
 33 |     "\n",
 34 |     "#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:\n",
 35 |     "\n",
 36 |     "    | spectrograms    |    ....\n",
 37 |     "    ------------------------------------------\n",
 38 |     "    |  2D np.array    |    ....\n",
 39 |     "    |  ...            |    ....\n",
 40 |     "    |  ...            |    .... \n",
 41 |     "    \n",
 42 |     "\n",
 43 |     "#### The following files are generated in this script:\n",
 44 |     "\n",
 45 |     "    ├── data\n",
 46 |     "    │   ├── df_umap.pkl         <- pickled pandas dataframe with metadata, spectrograms AND UMAP coordinates"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "## Import statements, constants and functions"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 1,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "import pandas as pd\n",
 63 |     "import numpy as np\n",
 64 |     "import pickle\n",
 65 |     "import os\n",
 66 |     "from pathlib import Path\n",
 67 |     "import umap\n",
 68 |     "import sys \n",
 69 |     "sys.path.insert(0, '..')\n",
 70 |     "\n",
 71 |     "from functions.preprocessing_functions import calc_zscore, pad_spectro\n",
 72 |     "from functions.custom_dist_functions_umap import unpack_specs"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 2,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "P_DIR = str(Path(os.getcwd()).parents[0])       # project directory\n",
 82 |     "DATA = os.path.join(os.path.sep, P_DIR, 'data') # path to data subfolder in project directory\n",
 83 |     "DF_NAME = 'df.pkl'                              # name of pickled dataframe with metadata and spectrograms"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "Specify UMAP parameters. If desired, other inputs can be used for UMAP, such as denoised spectrograms, bandpass filtered spectrograms or other (MFCC, specs on frequency scale...) by changining the INPUT_COL parameter."
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 3,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "INPUT_COL = 'spectrograms'  # column that is used for UMAP\n",
100 |     "                            #  could also choose 'denoised_spectrograms' or 'stretched_spectrograms' etc etc...\n",
101 |     "    \n",
102 |     "METRIC_TYPE = 'euclidean'     # distance metric used in UMAP. Check UMAP documentation for other options\n",
103 |     "                              # e.g. 'euclidean', correlation', 'cosine','manhattan' ...\n",
104 |     "    \n",
105 |     "N_COMP = 3                    # number of dimensions desired in latent space  "
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {},
111 |    "source": [
112 |     "## 1. Load data"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": 4,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "df = pd.read_pickle(os.path.join(os.path.sep, DATA, DF_NAME))"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "## 2. UMAP\n",
129 |     "### 2.1. Prepare UMAP input"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "In this step, the spectrograms are z-transformed, zero-padded and concatenated to obtain numeric vectors."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 5,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "# Basic pipeline\n",
146 |     "# No time-shift allowed, spectrograms should be aligned at the start. All spectrograms are zero-padded \n",
147 |     "# to equal length\n",
148 |     "    \n",
149 |     "specs = df[INPUT_COL] # choose spectrogram column\n",
150 |     "specs = [calc_zscore(s) for s in specs] # z-transform each spectrogram\n",
151 |     "\n",
152 |     "maxlen= np.max([spec.shape[1] for spec in specs]) # find maximal length in dataset\n",
153 |     "flattened_specs = [pad_spectro(spec, maxlen).flatten() for spec in specs] # pad all specs to maxlen, then row-wise concatenate (flatten)\n",
154 |     "data = np.asarray(flattened_specs) # data is the final input data for UMAP"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "### 2.2. Specify UMAP parameters"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 6,
167 |    "metadata": {},
168 |    "outputs": [],
169 |    "source": [
170 |     "reducer = umap.UMAP(n_components=N_COMP, metric = METRIC_TYPE,  # specify parameters of UMAP reducer\n",
171 |     "                    min_dist = 0, random_state=2204) "
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "### 2.2. Fit UMAP"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 7,
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "embedding = reducer.fit_transform(data)  # embedding contains the new coordinates of datapoints in 3D space"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "## 3. Save dataframe"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": 8,
200 |    "metadata": {},
201 |    "outputs": [],
202 |    "source": [
203 |     "# Add UMAP coordinates to dataframe\n",
204 |     "for i in range(N_COMP):\n",
205 |     "    df['UMAP'+str(i+1)] = embedding[:,i]\n",
206 |     "\n",
207 |     "# Save dataframe\n",
208 |     "df.to_pickle(os.path.join(os.path.sep, DATA, 'df_umap.pkl'))"
209 |    ]
210 |   }
211 |  ],
212 |  "metadata": {
213 |   "kernelspec": {
214 |    "display_name": "Python 3",
215 |    "language": "python",
216 |    "name": "python3"
217 |   },
218 |   "language_info": {
219 |    "codemirror_mode": {
220 |     "name": "ipython",
221 |     "version": 3
222 |    },
223 |    "file_extension": ".py",
224 |    "mimetype": "text/x-python",
225 |    "name": "python",
226 |    "nbconvert_exporter": "python",
227 |    "pygments_lexer": "ipython3",
228 |    "version": "3.7.10"
229 |   }
230 |  },
231 |  "nbformat": 4,
232 |  "nbformat_minor": 4
233 | }
234 | 


--------------------------------------------------------------------------------
/notebooks/.ipynb_checkpoints/02b_generate_UMAP_timeshift-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Step 2, alternative b: Generate UMAP representations from spectrograms -  custom distance (time-shift)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "This script creates UMAP representations from spectrograms, while allowing for some time-shift of spectrograms. This increases computation time, but is well suited for calls that are not well aligned at the start.\n",
 22 |     "\n",
 23 |     "#### The following  structure and files are required in the project directory:\n",
 24 |     "\n",
 25 |     "    ├── data\n",
 26 |     "    │   ├── df.pkl            <- pickled pandas dataframe with metadata and spectrograms (generated in\n",
 27 |     "    |                            01_generate_spectrograms.ipynb)\n",
 28 |     "    ├── parameters         \n",
 29 |     "    ├── functions             <- the folder with the function files provided in the repo                \n",
 30 |     "    ├── notebooks             <- the folder with the notebook files provided in the repo    \n",
 31 |     "    ├── ...  \n",
 32 |     "     \n",
 33 |     "\n",
 34 |     "#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:\n",
 35 |     "\n",
 36 |     "    | spectrograms    |    ....\n",
 37 |     "    ------------------------------------------\n",
 38 |     "    |  2D np.array    |    ....\n",
 39 |     "    |  ...            |    ....\n",
 40 |     "    |  ...            |    .... \n",
 41 |     "    \n",
 42 |     "#### The following files are generated in this script:\n",
 43 |     "\n",
 44 |     "    ├── data\n",
 45 |     "    │   ├── df_umap.pkl         <- pickled pandas dataframe with metadata, spectrograms AND UMAP coordinates"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Import statements, constants and functions"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 1,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "import pandas as pd\n",
 62 |     "import numpy as np\n",
 63 |     "import pickle\n",
 64 |     "import os\n",
 65 |     "from pathlib import Path\n",
 66 |     "import umap\n",
 67 |     "import sys \n",
 68 |     "sys.path.insert(0, '..')\n",
 69 |     "\n",
 70 |     "from functions.preprocessing_functions import calc_zscore, pad_spectro\n",
 71 |     "from functions.custom_dist_functions_umap import unpack_specs"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 2,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "P_DIR = str(Path(os.getcwd()).parents[0])            # project directory\n",
 81 |     "DATA = os.path.join(os.path.sep, P_DIR, 'data') "
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "Specify UMAP parameters. If desired, other inputs can be used for UMAP, such as denoised spectrograms, bandpass filtered spectrograms or other (MFCC, specs on frequency scale...) by changining the INPUT_COL parameter."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 7,
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "INPUT_COL = 'spectrograms'  # column that is used for UMAP\n",
 98 |     "                            #  could also choose 'denoised_spectrograms' or 'stretched_spectrograms' etc etc...\n",
 99 |     "\n",
100 |     "MIN_OVERLAP = 0.9        # time shift constraint\n",
101 |     "                         # MIN_OVERLAP*100 % of the shorter spectrogram must overlap with the longer spectrogram\n",
102 |     "                         # when finding the position with the least error during the time-shifting\n",
103 |     "\n",
104 |     "METRIC_TYPE = 'euclidean'     # distance metric used in UMAP.\n",
105 |     "                              # If performing time-shift, only 'euclidean', correlation', 'cosine' and 'manhattan' \n",
106 |     "                              # are available\n",
107 |     "    \n",
108 |     "N_COMP = 3                    # number of dimensions desired in latent space  "
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "## 1. Load data"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": 4,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "df = pd.read_pickle(os.path.join(os.path.sep, DATA, 'df.pkl'))"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "## 2. UMAP"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "In this step, the spectrograms are z-transformed, zero-padded and concatenated to obtain numeric vectors."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "### 2.1. Load custom distance function with time-shift"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "# Pipeline with allowing for time-shift of spectrograms. When assessing distance between spectrograms,\n",
155 |     "# the shorter spectrogram is shifted along the longer one to find the position of minimum-error overlap.\n",
156 |     "# The shorter is then zero-padded to the length of the longer one and distance is calculated using the \n",
157 |     "# chosen METRIC_TYPE distance (euclidean, manhatten, cosine, correlation)\n",
158 |     "# This also means that the dimensionality of the spectrogram vectors can be different for each pairwise \n",
159 |     "# comparison. Hence, we need some sort of normalization to the dimensionality, otherwise metrics like \n",
160 |     "# euclidean or manhattan will automatically be larger for high-dimensional spectrogram vectors (i.e. calls\n",
161 |     "# with long duration). Therefore, euclidean and manhattan are normalized to the size of the spectrogram.\n",
162 |     "    \n",
163 |     "from preprocessing_functions import pad_transform_spectro\n",
164 |     "import numba\n",
165 |     "\n",
166 |     "if METRIC_TYPE=='euclidean':\n",
167 |     "    @numba.njit()\n",
168 |     "    def spec_dist(a,b,size):\n",
169 |     "        dist = np.sqrt((np.sum(np.subtract(a,b)*np.subtract(a,b)))) / np.sqrt(size)\n",
170 |     "        return dist\n",
171 |     "elif METRIC_TYPE=='manhattan':\n",
172 |     "    @numba.njit()\n",
173 |     "    def spec_dist(a,b,size):\n",
174 |     "        dist = (np.sum(np.abs(np.subtract(a,b)))) / size\n",
175 |     "        return dist\n",
176 |     "elif METRIC_TYPE=='cosine':\n",
177 |     "    @numba.njit()\n",
178 |     "    def spec_dist(a,b,size):\n",
179 |     "        # turn into unit vectors by dividing each vector field by magnitude of vector\n",
180 |     "        dot_product = np.sum(a*b)\n",
181 |     "        a_magnitude = np.sqrt(np.sum(a*a))\n",
182 |     "        b_magnitude = np.sqrt(np.sum(b*b))\n",
183 |     "        dist = 1 - dot_product/(a_magnitude*b_magnitude)\n",
184 |     "        return dist\n",
185 |     "\n",
186 |     "elif METRIC_TYPE=='correlation':\n",
187 |     "    @numba.njit()\n",
188 |     "    def spec_dist(a,b,size):\n",
189 |     "        a_meandiff = a - np.mean(a)\n",
190 |     "        b_meandiff = b - np.mean(b)\n",
191 |     "        dot_product =  np.sum(a_meandiff*b_meandiff)\n",
192 |     "        a_meandiff_magnitude = np.sqrt(np.sum(a_meandiff*a_meandiff))\n",
193 |     "        b_meandiff_magnitude = np.sqrt(np.sum(b_meandiff*b_meandiff))\n",
194 |     "        dist = 1 - dot_product/(a_meandiff_magnitude * b_meandiff_magnitude)\n",
195 |     "        return dist\n",
196 |     "else:\n",
197 |     "    print('Metric type ', METRIC_TYPE, ' not compatible with option TIME_SHIFT = True')\n",
198 |     "    raise\n",
199 |     "        \n",
200 |     "@numba.njit()\n",
201 |     "def calc_timeshift_pad(a,b):\n",
202 |     "    spec_s, spec_l = unpack_specs(a,b)\n",
203 |     "\n",
204 |     "    len_s = spec_s.shape[1]\n",
205 |     "    len_l = spec_l.shape[1]\n",
206 |     "\n",
207 |     "    nfreq = spec_s.shape[0] \n",
208 |     "\n",
209 |     "    # define start position\n",
210 |     "    min_overlap_frames = int(MIN_OVERLAP * len_s)\n",
211 |     "    start_timeline = min_overlap_frames-len_s\n",
212 |     "    max_timeline = len_l - min_overlap_frames\n",
213 |     "    n_of_calculations = int((((max_timeline+1-start_timeline)+(max_timeline+1-start_timeline))/2) +1)\n",
214 |     "    distances = np.full((n_of_calculations),999.)\n",
215 |     "    count=0\n",
216 |     "        \n",
217 |     "    for timeline_p in range(start_timeline, max_timeline+1,2):\n",
218 |     "        # mismatch on left side\n",
219 |     "        if timeline_p < 0:\n",
220 |     "            len_overlap = len_s - abs(timeline_p)\n",
221 |     "            pad_s = np.full((nfreq, (len_l-len_overlap)),0.)\n",
222 |     "            pad_l = np.full((nfreq, (len_s-len_overlap)),0.)\n",
223 |     "            s_config = np.append(spec_s, pad_s, axis=1).astype(np.float64)\n",
224 |     "            l_config = np.append(pad_l, spec_l, axis=1).astype(np.float64)\n",
225 |     "            \n",
226 |     "        # mismatch on right side\n",
227 |     "        elif timeline_p > (len_l-len_s):\n",
228 |     "            len_overlap = len_l - timeline_p\n",
229 |     "            pad_s = np.full((nfreq, (len_l-len_overlap)),0.)\n",
230 |     "            pad_l = np.full((nfreq, (len_s-len_overlap)),0.)\n",
231 |     "            s_config = np.append(pad_s, spec_s, axis=1).astype(np.float64)\n",
232 |     "            l_config = np.append(spec_l, pad_l, axis=1).astype(np.float64)\n",
233 |     "                \n",
234 |     "        else:\n",
235 |     "            len_overlap = len_s\n",
236 |     "            start_col_l = timeline_p\n",
237 |     "            end_col_l = start_col_l + len_overlap\n",
238 |     "            pad_s_left = np.full((nfreq, start_col_l),0.)\n",
239 |     "            pad_s_right = np.full((nfreq, (len_l - end_col_l)),0.)\n",
240 |     "            l_config = spec_l.astype(np.float64)\n",
241 |     "            s_config = np.append(pad_s_left, spec_s, axis=1).astype(np.float64)\n",
242 |     "            s_config = np.append(s_config, pad_s_right, axis=1).astype(np.float64)\n",
243 |     "                \n",
244 |     "        size = s_config.shape[0]*s_config.shape[1]\n",
245 |     "        distances[count] = spec_dist(s_config, l_config, size)\n",
246 |     "        count = count + 1\n",
247 |     "        \n",
248 |     "    min_dist = np.min(distances)\n",
249 |     "    return min_dist"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {},
255 |    "source": [
256 |     "### 2.1. Prepare UMAP input"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "specs = df[INPUT_COL]\n",
266 |     "specs = [calc_zscore(s) for s in specs] # z-transform\n",
267 |     "    \n",
268 |     "n_bins = specs[0].shape[0]\n",
269 |     "maxlen = np.max([spec.shape[1] for spec in specs]) * n_bins + 2\n",
270 |     "trans_specs = [pad_transform_spectro(spec, maxlen) for spec in specs]\n",
271 |     "data = np.asarray(trans_specs)"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "metadata": {},
277 |    "source": [
278 |     "### 2.2. Specify UMAP parameters"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "code",
283 |    "execution_count": null,
284 |    "metadata": {},
285 |    "outputs": [],
286 |    "source": [
287 |     "reducer = umap.UMAP(n_components=N_COMP, metric = calc_timeshift_pad, min_dist = 0, random_state=2204) "
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "### 2.3. Fit UMAP"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 23,
300 |    "metadata": {},
301 |    "outputs": [
302 |     {
303 |      "name": "stderr",
304 |      "output_type": "stream",
305 |      "text": [
306 |       "/home/mthomas/anaconda3/envs/umap_tut_env/lib/python3.7/site-packages/umap/umap_.py:1728: UserWarning: custom distance metric does not return gradient; inverse_transform will be unavailable. To enable using inverse_transform method method, define a distance function that returns a tuple of (distance [float], gradient [np.array])\n",
307 |       "  \"custom distance metric does not return gradient; inverse_transform will be unavailable. \"\n"
308 |      ]
309 |     }
310 |    ],
311 |    "source": [
312 |     "embedding = reducer.fit_transform(data)  # embedding contains the new coordinates of datapoints in 3D space"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "## 3. Save dataframe"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": 25,
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": [
328 |     "# Add UMAP coordinates to dataframe\n",
329 |     "for i in range(N_COMP):\n",
330 |     "    df['UMAP'+str(i+1)] = embedding[:,i]\n",
331 |     "\n",
332 |     "# Save dataframe\n",
333 |     "df.to_pickle(os.path.join(os.path.sep, DATA, 'df_umap.pkl'))"
334 |    ]
335 |   }
336 |  ],
337 |  "metadata": {
338 |   "kernelspec": {
339 |    "display_name": "Python 3",
340 |    "language": "python",
341 |    "name": "python3"
342 |   },
343 |   "language_info": {
344 |    "codemirror_mode": {
345 |     "name": "ipython",
346 |     "version": 3
347 |    },
348 |    "file_extension": ".py",
349 |    "mimetype": "text/x-python",
350 |    "name": "python",
351 |    "nbconvert_exporter": "python",
352 |    "pygments_lexer": "ipython3",
353 |    "version": "3.7.10"
354 |   }
355 |  },
356 |  "nbformat": 4,
357 |  "nbformat_minor": 4
358 | }
359 | 


--------------------------------------------------------------------------------
/notebooks/.ipynb_checkpoints/03_UMAP_viz_part_1_prep-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Interactive visualization of UMAP representations: Part 1 (Prep)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "This script creates a spectrogram image for each call and saves all images in a pickled dictionary in the data subfolder (image_data.pkl). These images will be displayed later in the interactive visualization tool; generating them beforehand makes the tool faster, as images don't need to be created on-the-fly, but can be accessed through the dictionary. \n",
 15 |     "\n",
 16 |     "The default dictionary key is the filename without datatype specifier (e.g. without .wav), but if the dataframe contains a column 'callID', this is used as keys.\n",
 17 |     "\n",
 18 |     "#### The following minimal structure and files are required in the project directory:\n",
 19 |     "\n",
 20 |     "    ├── data\n",
 21 |     "    │   ├── df_umap.pkl   <- pickled pandas dataframe with metadata, raw_audio, spectrograms and UMAP coordinates\n",
 22 |     "    |                        (generated in 02a_generate_UMAP_basic.ipynb or 02b_generate_UMAP_timeshift.ipynb)\n",
 23 |     "    ├── parameters \n",
 24 |     "    │   ├── spec_params.py  <- python file containing the spectrogram parameters used (generated in \n",
 25 |     "    |                          01_generate_spectrograms.ipynb)        \n",
 26 |     "    ├── functions           <- the folder with the function files provided in the repo                \n",
 27 |     "    ├── notebooks           <- the folder with the notebook files provided in the repo    \n",
 28 |     "    ├── ...  \n",
 29 |     "\n",
 30 |     "\n",
 31 |     "#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:\n",
 32 |     "(callID is optional)\n",
 33 |     "\n",
 34 |     "    | filename   | spectrograms    |  samplerate_hz |    [optional: callID]\n",
 35 |     "    --------------------------------------------------------------------\n",
 36 |     "    | call_1.wav |  2D np.array    |      8000      |    [call_1]\n",
 37 |     "    | call_2.wav |  ...            |      48000     |    [call_2] \n",
 38 |     "    | ...        |  ...            |      ....      |    ....  \n",
 39 |     "\n",
 40 |     "#### The following files are generated in this script:\n",
 41 |     "\n",
 42 |     "    ├── data\n",
 43 |     "    │   ├── df_umap.pkl <- is overwritten with updated version of df_umap.pkl (with ID column)                       \n",
 44 |     "    │   ├── image_data.pkl <- pickled dictionary with spectrogram images as values, ID column as keys"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "## Import statements, constants and functions"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 1,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "import pandas as pd\n",
 61 |     "import numpy as np\n",
 62 |     "import pickle\n",
 63 |     "import matplotlib.pyplot as plt\n",
 64 |     "import os\n",
 65 |     "from pathlib import Path\n",
 66 |     "import soundfile as sf\n",
 67 |     "import io\n",
 68 |     "import librosa\n",
 69 |     "import librosa.display\n",
 70 |     "import umap\n",
 71 |     "\n",
 72 |     "import sys \n",
 73 |     "sys.path.insert(0, '..')"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 3,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "P_DIR = str(Path(os.getcwd()).parents[0])  \n",
 83 |     "DATA = os.path.join(os.path.sep, P_DIR, 'data') \n",
 84 |     "DF_NAME = 'df_umap.pkl'\n",
 85 |     "\n",
 86 |     "SPEC_COL = 'spectrograms' # column name that contains the spectrograms\n",
 87 |     "ID_COL = 'callID' # column name that contains call identifier (must be unique)\n",
 88 |     "\n",
 89 |     "\n",
 90 |     "OVERWRITE = False  # If there already exists an image_data.pkl, should it be overwritten? Default no"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 4,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "# Spectrogramming parameters (needed for generating the images)\n",
100 |     "\n",
101 |     "from parameters.spec_params import FFT_WIN, FFT_HOP, FMIN, FMAX\n",
102 |     "\n",
103 |     "# Make sure the spectrogramming parameters are correct!\n",
104 |     "# They are used to set the correct time and frequency axis labels for the spectrogram images. \n",
105 |     "\n",
106 |     "# If you are using bandpass-filtered spectrograms...\n",
107 |     "if 'filtered' in SPEC_COL:\n",
108 |     "    # ...FMIN is set to LOWCUT, FMAX to HIGHCUT and N_MELS to N_MELS_FILTERED\n",
109 |     "    from parameters.spec_params import LOWCUT, HIGHCUT, N_MELS_FILTERED\n",
110 |     "    \n",
111 |     "    FMIN = LOWCUT\n",
112 |     "    FMAX = HIGHCUT\n",
113 |     "    N_MELS = N_MELS_FILTERED"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "## 1. Read in files"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 5,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "df = pd.read_pickle(os.path.join(os.path.sep, DATA, DF_NAME))"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "### 1.1. Check if call identifier column is present"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 6,
142 |    "metadata": {},
143 |    "outputs": [
144 |     {
145 |      "name": "stdout",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "No ID-Column found ( callID )\n",
149 |       "Default ID column  callID will be generated from filename.\n"
150 |      ]
151 |     }
152 |    ],
153 |    "source": [
154 |     "# Default callID will be the name of the wav file\n",
155 |     "\n",
156 |     "if ID_COL not in df.columns:\n",
157 |     "    print('No ID-Column found (', ID_COL, ')')\n",
158 |     "    \n",
159 |     "    if 'filename' in df.columns:\n",
160 |     "        print(\"Default ID column \", ID_COL, \"will be generated from filename.\")\n",
161 |     "        df[ID_COL] = [x.split(\".\")[0] for x in df['filename']]\n",
162 |     "    else:\n",
163 |     "        raise"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "## 2. Generate spectrogram images"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "A spectrogram image is generated from each row in the dataframe. Images are saved in a dictionary (keys are the ID_COL of the dataframe).\n",
178 |     "\n",
179 |     "The dictionary is pickled and saved as image_data.pkl. It will later be loaded in the interactive visualization script and these images will be displayed in the visualization."
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": 7,
185 |    "metadata": {},
186 |    "outputs": [
187 |     {
188 |      "name": "stdout",
189 |      "output_type": "stream",
190 |      "text": [
191 |       "\r",
192 |       "Processing i: 0 / 6428\r",
193 |       "Processing i: 1 / 6428\r",
194 |       "Processing i: 2 / 6428\r",
195 |       "Processing i: 3 / 6428"
196 |      ]
197 |     },
198 |     {
199 |      "name": "stderr",
200 |      "output_type": "stream",
201 |      "text": [
202 |       "/home/mthomas/anaconda3/envs/umap_tut_env/lib/python3.7/site-packages/librosa/display.py:974: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.\n",
203 |       "  scaler(mode, **kwargs)\n",
204 |       "/home/mthomas/anaconda3/envs/umap_tut_env/lib/python3.7/site-packages/librosa/display.py:974: MatplotlibDeprecationWarning: The 'linthreshy' parameter of __init__() has been renamed 'linthresh' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.\n",
205 |       "  scaler(mode, **kwargs)\n"
206 |      ]
207 |     },
208 |     {
209 |      "name": "stdout",
210 |      "output_type": "stream",
211 |      "text": [
212 |       "Processing i: 6427 / 6428 / 6428"
213 |      ]
214 |     }
215 |    ],
216 |    "source": [
217 |     "if OVERWRITE==False and os.path.isfile(os.path.join(os.path.sep,DATA,'image_data.pkl')):\n",
218 |     "    print(\"File already exists. Overwrite is set to FALSE, so no new image_data will be generated.\")\n",
219 |     "    \n",
220 |     "    # Double-ceck if image_data contains all the required calls\n",
221 |     "    with open(os.path.join(os.path.sep, DATA, 'image_data.pkl'), 'rb') as handle:\n",
222 |     "        image_data = pickle.load(handle)  \n",
223 |     "    image_keys = list(image_data.keys())\n",
224 |     "    expected_keys = list(df[ID_COL])\n",
225 |     "    missing = list(set(expected_keys)-set(image_keys))\n",
226 |     "    \n",
227 |     "    if len(missing)>0:\n",
228 |     "        print(\"BUT: The current image_data.pkl file doesn't seem to contain all calls that are in your dataframe!\")\n",
229 |     "        \n",
230 |     "else:\n",
231 |     "    image_data = {}\n",
232 |     "    for i,dat in enumerate(df.spectrograms):\n",
233 |     "        print('\\rProcessing i:',i,'/',df.shape[0], end='')\n",
234 |     "        dat = np.asarray(df.iloc[i][SPEC_COL]) \n",
235 |     "        sr = df.iloc[i]['samplerate_hz']\n",
236 |     "        plt.figure()\n",
237 |     "        librosa.display.specshow(dat,sr=sr, hop_length=int(FFT_HOP * sr) , fmin=FMIN, fmax=FMAX, y_axis='mel', x_axis='s',cmap='inferno')\n",
238 |     "        buf = io.BytesIO()\n",
239 |     "        plt.savefig(buf, format='png')\n",
240 |     "        byte_im = buf.getvalue()\n",
241 |     "        image_data[df.iloc[i][ID_COL]] = byte_im\n",
242 |     "        plt.close()\n",
243 |     "\n",
244 |     "    # Store data (serialize)\n",
245 |     "    with open(os.path.join(os.path.sep,DATA,'image_data.pkl'), 'wb') as handle:\n",
246 |     "        pickle.dump(image_data, handle, protocol=pickle.HIGHEST_PROTOCOL)"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "markdown",
251 |    "metadata": {},
252 |    "source": [
253 |     "## 3. Save dataframe"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "Save the dataframe to make sure it contains the correct ID column for access to the image_data."
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": 8,
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": [
269 |     "df.to_pickle(os.path.join(os.path.sep, DATA, DF_NAME))"
270 |    ]
271 |   }
272 |  ],
273 |  "metadata": {
274 |   "kernelspec": {
275 |    "display_name": "Python 3",
276 |    "language": "python",
277 |    "name": "python3"
278 |   },
279 |   "language_info": {
280 |    "codemirror_mode": {
281 |     "name": "ipython",
282 |     "version": 3
283 |    },
284 |    "file_extension": ".py",
285 |    "mimetype": "text/x-python",
286 |    "name": "python",
287 |    "nbconvert_exporter": "python",
288 |    "pygments_lexer": "ipython3",
289 |    "version": "3.7.10"
290 |   }
291 |  },
292 |  "nbformat": 4,
293 |  "nbformat_minor": 4
294 | }
295 | 


--------------------------------------------------------------------------------
/notebooks/02a_generate_UMAP_basic.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Step 2, alternative a: Generate UMAP representations from spectrograms  -  Basic pipeline"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "This script creates UMAP representations from spectrograms using the basic pipeline.\n",
 22 |     "\n",
 23 |     "#### The following  structure and files are required in the project directory:\n",
 24 |     "\n",
 25 |     "    ├── data\n",
 26 |     "    │   ├── df.pkl            <- pickled pandas dataframe with metadata and spectrograms (generated in\n",
 27 |     "    |                            01_generate_spectrograms.ipynb)\n",
 28 |     "    ├── parameters         \n",
 29 |     "    ├── functions             <- the folder with the function files provided in the repo                \n",
 30 |     "    ├── notebooks             <- the folder with the notebook files provided in the repo    \n",
 31 |     "    ├── ...  \n",
 32 |     "     \n",
 33 |     "\n",
 34 |     "#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:\n",
 35 |     "\n",
 36 |     "    | spectrograms    |    ....\n",
 37 |     "    ------------------------------------------\n",
 38 |     "    |  2D np.array    |    ....\n",
 39 |     "    |  ...            |    ....\n",
 40 |     "    |  ...            |    .... \n",
 41 |     "    \n",
 42 |     "\n",
 43 |     "#### The following files are generated in this script:\n",
 44 |     "\n",
 45 |     "    ├── data\n",
 46 |     "    │   ├── df_umap.pkl         <- pickled pandas dataframe with metadata, spectrograms AND UMAP coordinates"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "## Import statements, constants and functions"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 1,
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "import pandas as pd\n",
 63 |     "import numpy as np\n",
 64 |     "import pickle\n",
 65 |     "import os\n",
 66 |     "from pathlib import Path\n",
 67 |     "import umap\n",
 68 |     "import sys \n",
 69 |     "sys.path.insert(0, '..')\n",
 70 |     "\n",
 71 |     "from functions.preprocessing_functions import calc_zscore, pad_spectro\n",
 72 |     "from functions.custom_dist_functions_umap import unpack_specs"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 2,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "P_DIR = str(Path(os.getcwd()).parents[0])       # project directory\n",
 82 |     "DATA = os.path.join(os.path.sep, P_DIR, 'data') # path to data subfolder in project directory\n",
 83 |     "DF_NAME = 'df.pkl'                              # name of pickled dataframe with metadata and spectrograms"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "Specify UMAP parameters. If desired, other inputs can be used for UMAP, such as denoised spectrograms, bandpass filtered spectrograms or other (MFCC, specs on frequency scale...) by changining the INPUT_COL parameter."
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 3,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "INPUT_COL = 'spectrograms'  # column that is used for UMAP\n",
100 |     "                            #  could also choose 'denoised_spectrograms' or 'stretched_spectrograms' etc etc...\n",
101 |     "    \n",
102 |     "METRIC_TYPE = 'euclidean'     # distance metric used in UMAP. Check UMAP documentation for other options\n",
103 |     "                              # e.g. 'euclidean', correlation', 'cosine','manhattan' ...\n",
104 |     "    \n",
105 |     "N_COMP = 3                    # number of dimensions desired in latent space  "
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "markdown",
110 |    "metadata": {},
111 |    "source": [
112 |     "## 1. Load data"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": 4,
118 |    "metadata": {},
119 |    "outputs": [],
120 |    "source": [
121 |     "df = pd.read_pickle(os.path.join(os.path.sep, DATA, DF_NAME))"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "markdown",
126 |    "metadata": {},
127 |    "source": [
128 |     "## 2. UMAP\n",
129 |     "### 2.1. Prepare UMAP input"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "In this step, the spectrograms are z-transformed, zero-padded and concatenated to obtain numeric vectors."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 5,
142 |    "metadata": {},
143 |    "outputs": [],
144 |    "source": [
145 |     "# Basic pipeline\n",
146 |     "# No time-shift allowed, spectrograms should be aligned at the start. All spectrograms are zero-padded \n",
147 |     "# to equal length\n",
148 |     "    \n",
149 |     "specs = df[INPUT_COL] # choose spectrogram column\n",
150 |     "specs = [calc_zscore(s) for s in specs] # z-transform each spectrogram\n",
151 |     "\n",
152 |     "maxlen= np.max([spec.shape[1] for spec in specs]) # find maximal length in dataset\n",
153 |     "flattened_specs = [pad_spectro(spec, maxlen).flatten() for spec in specs] # pad all specs to maxlen, then row-wise concatenate (flatten)\n",
154 |     "data = np.asarray(flattened_specs) # data is the final input data for UMAP"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "markdown",
159 |    "metadata": {},
160 |    "source": [
161 |     "### 2.2. Specify UMAP parameters"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": 6,
167 |    "metadata": {},
168 |    "outputs": [],
169 |    "source": [
170 |     "reducer = umap.UMAP(n_components=N_COMP, metric = METRIC_TYPE,  # specify parameters of UMAP reducer\n",
171 |     "                    min_dist = 0, random_state=2204) "
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "### 2.2. Fit UMAP"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 7,
184 |    "metadata": {},
185 |    "outputs": [],
186 |    "source": [
187 |     "embedding = reducer.fit_transform(data)  # embedding contains the new coordinates of datapoints in 3D space"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "## 3. Save dataframe"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": 8,
200 |    "metadata": {},
201 |    "outputs": [],
202 |    "source": [
203 |     "# Add UMAP coordinates to dataframe\n",
204 |     "for i in range(N_COMP):\n",
205 |     "    df['UMAP'+str(i+1)] = embedding[:,i]\n",
206 |     "\n",
207 |     "# Save dataframe\n",
208 |     "df.to_pickle(os.path.join(os.path.sep, DATA, 'df_umap.pkl'))"
209 |    ]
210 |   }
211 |  ],
212 |  "metadata": {
213 |   "kernelspec": {
214 |    "display_name": "Python 3",
215 |    "language": "python",
216 |    "name": "python3"
217 |   },
218 |   "language_info": {
219 |    "codemirror_mode": {
220 |     "name": "ipython",
221 |     "version": 3
222 |    },
223 |    "file_extension": ".py",
224 |    "mimetype": "text/x-python",
225 |    "name": "python",
226 |    "nbconvert_exporter": "python",
227 |    "pygments_lexer": "ipython3",
228 |    "version": "3.7.10"
229 |   }
230 |  },
231 |  "nbformat": 4,
232 |  "nbformat_minor": 4
233 | }
234 | 


--------------------------------------------------------------------------------
/notebooks/02b_generate_UMAP_timeshift.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Step 2, alternative b: Generate UMAP representations from spectrograms -  custom distance (time-shift)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "## Introduction"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "This script creates UMAP representations from spectrograms, while allowing for some time-shift of spectrograms. This increases computation time, but is well suited for calls that are not well aligned at the start.\n",
 22 |     "\n",
 23 |     "#### The following  structure and files are required in the project directory:\n",
 24 |     "\n",
 25 |     "    ├── data\n",
 26 |     "    │   ├── df.pkl            <- pickled pandas dataframe with metadata and spectrograms (generated in\n",
 27 |     "    |                            01_generate_spectrograms.ipynb)\n",
 28 |     "    ├── parameters         \n",
 29 |     "    ├── functions             <- the folder with the function files provided in the repo                \n",
 30 |     "    ├── notebooks             <- the folder with the notebook files provided in the repo    \n",
 31 |     "    ├── ...  \n",
 32 |     "     \n",
 33 |     "\n",
 34 |     "#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:\n",
 35 |     "\n",
 36 |     "    | spectrograms    |    ....\n",
 37 |     "    ------------------------------------------\n",
 38 |     "    |  2D np.array    |    ....\n",
 39 |     "    |  ...            |    ....\n",
 40 |     "    |  ...            |    .... \n",
 41 |     "    \n",
 42 |     "#### The following files are generated in this script:\n",
 43 |     "\n",
 44 |     "    ├── data\n",
 45 |     "    │   ├── df_umap.pkl         <- pickled pandas dataframe with metadata, spectrograms AND UMAP coordinates"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## Import statements, constants and functions"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 1,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "import pandas as pd\n",
 62 |     "import numpy as np\n",
 63 |     "import pickle\n",
 64 |     "import os\n",
 65 |     "from pathlib import Path\n",
 66 |     "import umap\n",
 67 |     "import sys \n",
 68 |     "sys.path.insert(0, '..')\n",
 69 |     "\n",
 70 |     "from functions.preprocessing_functions import calc_zscore, pad_spectro\n",
 71 |     "from functions.custom_dist_functions_umap import unpack_specs"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 2,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "P_DIR = str(Path(os.getcwd()).parents[0])            # project directory\n",
 81 |     "DATA = os.path.join(os.path.sep, P_DIR, 'data') "
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "Specify UMAP parameters. If desired, other inputs can be used for UMAP, such as denoised spectrograms, bandpass filtered spectrograms or other (MFCC, specs on frequency scale...) by changining the INPUT_COL parameter."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 7,
 94 |    "metadata": {},
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "INPUT_COL = 'spectrograms'  # column that is used for UMAP\n",
 98 |     "                            #  could also choose 'denoised_spectrograms' or 'stretched_spectrograms' etc etc...\n",
 99 |     "\n",
100 |     "MIN_OVERLAP = 0.9        # time shift constraint\n",
101 |     "                         # MIN_OVERLAP*100 % of the shorter spectrogram must overlap with the longer spectrogram\n",
102 |     "                         # when finding the position with the least error during the time-shifting\n",
103 |     "\n",
104 |     "METRIC_TYPE = 'euclidean'     # distance metric used in UMAP.\n",
105 |     "                              # If performing time-shift, only 'euclidean', correlation', 'cosine' and 'manhattan' \n",
106 |     "                              # are available\n",
107 |     "    \n",
108 |     "N_COMP = 3                    # number of dimensions desired in latent space  "
109 |    ]
110 |   },
111 |   {
112 |    "cell_type": "markdown",
113 |    "metadata": {},
114 |    "source": [
115 |     "## 1. Load data"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": 4,
121 |    "metadata": {},
122 |    "outputs": [],
123 |    "source": [
124 |     "df = pd.read_pickle(os.path.join(os.path.sep, DATA, 'df.pkl'))"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "## 2. UMAP"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "In this step, the spectrograms are z-transformed, zero-padded and concatenated to obtain numeric vectors."
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "markdown",
143 |    "metadata": {},
144 |    "source": [
145 |     "### 2.1. Load custom distance function with time-shift"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "metadata": {},
152 |    "outputs": [],
153 |    "source": [
154 |     "# Pipeline with allowing for time-shift of spectrograms. When assessing distance between spectrograms,\n",
155 |     "# the shorter spectrogram is shifted along the longer one to find the position of minimum-error overlap.\n",
156 |     "# The shorter is then zero-padded to the length of the longer one and distance is calculated using the \n",
157 |     "# chosen METRIC_TYPE distance (euclidean, manhatten, cosine, correlation)\n",
158 |     "# This also means that the dimensionality of the spectrogram vectors can be different for each pairwise \n",
159 |     "# comparison. Hence, we need some sort of normalization to the dimensionality, otherwise metrics like \n",
160 |     "# euclidean or manhattan will automatically be larger for high-dimensional spectrogram vectors (i.e. calls\n",
161 |     "# with long duration). Therefore, euclidean and manhattan are normalized to the size of the spectrogram.\n",
162 |     "    \n",
163 |     "from preprocessing_functions import pad_transform_spectro\n",
164 |     "import numba\n",
165 |     "\n",
166 |     "if METRIC_TYPE=='euclidean':\n",
167 |     "    @numba.njit()\n",
168 |     "    def spec_dist(a,b,size):\n",
169 |     "        dist = np.sqrt((np.sum(np.subtract(a,b)*np.subtract(a,b)))) / np.sqrt(size)\n",
170 |     "        return dist\n",
171 |     "elif METRIC_TYPE=='manhattan':\n",
172 |     "    @numba.njit()\n",
173 |     "    def spec_dist(a,b,size):\n",
174 |     "        dist = (np.sum(np.abs(np.subtract(a,b)))) / size\n",
175 |     "        return dist\n",
176 |     "elif METRIC_TYPE=='cosine':\n",
177 |     "    @numba.njit()\n",
178 |     "    def spec_dist(a,b,size):\n",
179 |     "        # turn into unit vectors by dividing each vector field by magnitude of vector\n",
180 |     "        dot_product = np.sum(a*b)\n",
181 |     "        a_magnitude = np.sqrt(np.sum(a*a))\n",
182 |     "        b_magnitude = np.sqrt(np.sum(b*b))\n",
183 |     "        dist = 1 - dot_product/(a_magnitude*b_magnitude)\n",
184 |     "        return dist\n",
185 |     "\n",
186 |     "elif METRIC_TYPE=='correlation':\n",
187 |     "    @numba.njit()\n",
188 |     "    def spec_dist(a,b,size):\n",
189 |     "        a_meandiff = a - np.mean(a)\n",
190 |     "        b_meandiff = b - np.mean(b)\n",
191 |     "        dot_product =  np.sum(a_meandiff*b_meandiff)\n",
192 |     "        a_meandiff_magnitude = np.sqrt(np.sum(a_meandiff*a_meandiff))\n",
193 |     "        b_meandiff_magnitude = np.sqrt(np.sum(b_meandiff*b_meandiff))\n",
194 |     "        dist = 1 - dot_product/(a_meandiff_magnitude * b_meandiff_magnitude)\n",
195 |     "        return dist\n",
196 |     "else:\n",
197 |     "    print('Metric type ', METRIC_TYPE, ' not compatible with option TIME_SHIFT = True')\n",
198 |     "    raise\n",
199 |     "        \n",
200 |     "@numba.njit()\n",
201 |     "def calc_timeshift_pad(a,b):\n",
202 |     "    spec_s, spec_l = unpack_specs(a,b)\n",
203 |     "\n",
204 |     "    len_s = spec_s.shape[1]\n",
205 |     "    len_l = spec_l.shape[1]\n",
206 |     "\n",
207 |     "    nfreq = spec_s.shape[0] \n",
208 |     "\n",
209 |     "    # define start position\n",
210 |     "    min_overlap_frames = int(MIN_OVERLAP * len_s)\n",
211 |     "    start_timeline = min_overlap_frames-len_s\n",
212 |     "    max_timeline = len_l - min_overlap_frames\n",
213 |     "    n_of_calculations = int((((max_timeline+1-start_timeline)+(max_timeline+1-start_timeline))/2) +1)\n",
214 |     "    distances = np.full((n_of_calculations),999.)\n",
215 |     "    count=0\n",
216 |     "        \n",
217 |     "    for timeline_p in range(start_timeline, max_timeline+1,2):\n",
218 |     "        # mismatch on left side\n",
219 |     "        if timeline_p < 0:\n",
220 |     "            len_overlap = len_s - abs(timeline_p)\n",
221 |     "            pad_s = np.full((nfreq, (len_l-len_overlap)),0.)\n",
222 |     "            pad_l = np.full((nfreq, (len_s-len_overlap)),0.)\n",
223 |     "            s_config = np.append(spec_s, pad_s, axis=1).astype(np.float64)\n",
224 |     "            l_config = np.append(pad_l, spec_l, axis=1).astype(np.float64)\n",
225 |     "            \n",
226 |     "        # mismatch on right side\n",
227 |     "        elif timeline_p > (len_l-len_s):\n",
228 |     "            len_overlap = len_l - timeline_p\n",
229 |     "            pad_s = np.full((nfreq, (len_l-len_overlap)),0.)\n",
230 |     "            pad_l = np.full((nfreq, (len_s-len_overlap)),0.)\n",
231 |     "            s_config = np.append(pad_s, spec_s, axis=1).astype(np.float64)\n",
232 |     "            l_config = np.append(spec_l, pad_l, axis=1).astype(np.float64)\n",
233 |     "                \n",
234 |     "        else:\n",
235 |     "            len_overlap = len_s\n",
236 |     "            start_col_l = timeline_p\n",
237 |     "            end_col_l = start_col_l + len_overlap\n",
238 |     "            pad_s_left = np.full((nfreq, start_col_l),0.)\n",
239 |     "            pad_s_right = np.full((nfreq, (len_l - end_col_l)),0.)\n",
240 |     "            l_config = spec_l.astype(np.float64)\n",
241 |     "            s_config = np.append(pad_s_left, spec_s, axis=1).astype(np.float64)\n",
242 |     "            s_config = np.append(s_config, pad_s_right, axis=1).astype(np.float64)\n",
243 |     "                \n",
244 |     "        size = s_config.shape[0]*s_config.shape[1]\n",
245 |     "        distances[count] = spec_dist(s_config, l_config, size)\n",
246 |     "        count = count + 1\n",
247 |     "        \n",
248 |     "    min_dist = np.min(distances)\n",
249 |     "    return min_dist"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "markdown",
254 |    "metadata": {},
255 |    "source": [
256 |     "### 2.1. Prepare UMAP input"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": null,
262 |    "metadata": {},
263 |    "outputs": [],
264 |    "source": [
265 |     "specs = df[INPUT_COL]\n",
266 |     "specs = [calc_zscore(s) for s in specs] # z-transform\n",
267 |     "    \n",
268 |     "n_bins = specs[0].shape[0]\n",
269 |     "maxlen = np.max([spec.shape[1] for spec in specs]) * n_bins + 2\n",
270 |     "trans_specs = [pad_transform_spectro(spec, maxlen) for spec in specs]\n",
271 |     "data = np.asarray(trans_specs)"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "markdown",
276 |    "metadata": {},
277 |    "source": [
278 |     "### 2.2. Specify UMAP parameters"
279 |    ]
280 |   },
281 |   {
282 |    "cell_type": "code",
283 |    "execution_count": null,
284 |    "metadata": {},
285 |    "outputs": [],
286 |    "source": [
287 |     "reducer = umap.UMAP(n_components=N_COMP, metric = calc_timeshift_pad, min_dist = 0, random_state=2204) "
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "### 2.3. Fit UMAP"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 23,
300 |    "metadata": {},
301 |    "outputs": [
302 |     {
303 |      "name": "stderr",
304 |      "output_type": "stream",
305 |      "text": [
306 |       "/home/mthomas/anaconda3/envs/umap_tut_env/lib/python3.7/site-packages/umap/umap_.py:1728: UserWarning: custom distance metric does not return gradient; inverse_transform will be unavailable. To enable using inverse_transform method method, define a distance function that returns a tuple of (distance [float], gradient [np.array])\n",
307 |       "  \"custom distance metric does not return gradient; inverse_transform will be unavailable. \"\n"
308 |      ]
309 |     }
310 |    ],
311 |    "source": [
312 |     "embedding = reducer.fit_transform(data)  # embedding contains the new coordinates of datapoints in 3D space"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "## 3. Save dataframe"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": 25,
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": [
328 |     "# Add UMAP coordinates to dataframe\n",
329 |     "for i in range(N_COMP):\n",
330 |     "    df['UMAP'+str(i+1)] = embedding[:,i]\n",
331 |     "\n",
332 |     "# Save dataframe\n",
333 |     "df.to_pickle(os.path.join(os.path.sep, DATA, 'df_umap.pkl'))"
334 |    ]
335 |   }
336 |  ],
337 |  "metadata": {
338 |   "kernelspec": {
339 |    "display_name": "Python 3",
340 |    "language": "python",
341 |    "name": "python3"
342 |   },
343 |   "language_info": {
344 |    "codemirror_mode": {
345 |     "name": "ipython",
346 |     "version": 3
347 |    },
348 |    "file_extension": ".py",
349 |    "mimetype": "text/x-python",
350 |    "name": "python",
351 |    "nbconvert_exporter": "python",
352 |    "pygments_lexer": "ipython3",
353 |    "version": "3.7.10"
354 |   }
355 |  },
356 |  "nbformat": 4,
357 |  "nbformat_minor": 4
358 | }
359 | 


--------------------------------------------------------------------------------
/notebooks/03_UMAP_viz_part_1_prep.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Interactive visualization of UMAP representations: Part 1 (Prep)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "This script creates a spectrogram image for each call and saves all images in a pickled dictionary in the data subfolder (image_data.pkl). These images will be displayed later in the interactive visualization tool; generating them beforehand makes the tool faster, as images don't need to be created on-the-fly, but can be accessed through the dictionary. \n",
 15 |     "\n",
 16 |     "The default dictionary key is the filename without datatype specifier (e.g. without .wav), but if the dataframe contains a column 'callID', this is used as keys.\n",
 17 |     "\n",
 18 |     "#### The following minimal structure and files are required in the project directory:\n",
 19 |     "\n",
 20 |     "    ├── data\n",
 21 |     "    │   ├── df_umap.pkl   <- pickled pandas dataframe with metadata, raw_audio, spectrograms and UMAP coordinates\n",
 22 |     "    |                        (generated in 02a_generate_UMAP_basic.ipynb or 02b_generate_UMAP_timeshift.ipynb)\n",
 23 |     "    ├── parameters \n",
 24 |     "    │   ├── spec_params.py  <- python file containing the spectrogram parameters used (generated in \n",
 25 |     "    |                          01_generate_spectrograms.ipynb)        \n",
 26 |     "    ├── functions           <- the folder with the function files provided in the repo                \n",
 27 |     "    ├── notebooks           <- the folder with the notebook files provided in the repo    \n",
 28 |     "    ├── ...  \n",
 29 |     "\n",
 30 |     "\n",
 31 |     "#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:\n",
 32 |     "(callID is optional)\n",
 33 |     "\n",
 34 |     "    | filename   | spectrograms    |  samplerate_hz |    [optional: callID]\n",
 35 |     "    --------------------------------------------------------------------\n",
 36 |     "    | call_1.wav |  2D np.array    |      8000      |    [call_1]\n",
 37 |     "    | call_2.wav |  ...            |      48000     |    [call_2] \n",
 38 |     "    | ...        |  ...            |      ....      |    ....  \n",
 39 |     "\n",
 40 |     "#### The following files are generated in this script:\n",
 41 |     "\n",
 42 |     "    ├── data\n",
 43 |     "    │   ├── df_umap.pkl <- is overwritten with updated version of df_umap.pkl (with ID column)                       \n",
 44 |     "    │   ├── image_data.pkl <- pickled dictionary with spectrogram images as values, ID column as keys"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "## Import statements, constants and functions"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 1,
 57 |    "metadata": {},
 58 |    "outputs": [],
 59 |    "source": [
 60 |     "import pandas as pd\n",
 61 |     "import numpy as np\n",
 62 |     "import pickle\n",
 63 |     "import matplotlib.pyplot as plt\n",
 64 |     "import os\n",
 65 |     "from pathlib import Path\n",
 66 |     "import soundfile as sf\n",
 67 |     "import io\n",
 68 |     "import librosa\n",
 69 |     "import librosa.display\n",
 70 |     "import umap\n",
 71 |     "\n",
 72 |     "import sys \n",
 73 |     "sys.path.insert(0, '..')"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 3,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "P_DIR = str(Path(os.getcwd()).parents[0])  \n",
 83 |     "DATA = os.path.join(os.path.sep, P_DIR, 'data') \n",
 84 |     "DF_NAME = 'df_umap.pkl'\n",
 85 |     "\n",
 86 |     "SPEC_COL = 'spectrograms' # column name that contains the spectrograms\n",
 87 |     "ID_COL = 'callID' # column name that contains call identifier (must be unique)\n",
 88 |     "\n",
 89 |     "\n",
 90 |     "OVERWRITE = False  # If there already exists an image_data.pkl, should it be overwritten? Default no"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 4,
 96 |    "metadata": {},
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "# Spectrogramming parameters (needed for generating the images)\n",
100 |     "\n",
101 |     "from parameters.spec_params import FFT_WIN, FFT_HOP, FMIN, FMAX\n",
102 |     "\n",
103 |     "# Make sure the spectrogramming parameters are correct!\n",
104 |     "# They are used to set the correct time and frequency axis labels for the spectrogram images. \n",
105 |     "\n",
106 |     "# If you are using bandpass-filtered spectrograms...\n",
107 |     "if 'filtered' in SPEC_COL:\n",
108 |     "    # ...FMIN is set to LOWCUT, FMAX to HIGHCUT and N_MELS to N_MELS_FILTERED\n",
109 |     "    from parameters.spec_params import LOWCUT, HIGHCUT, N_MELS_FILTERED\n",
110 |     "    \n",
111 |     "    FMIN = LOWCUT\n",
112 |     "    FMAX = HIGHCUT\n",
113 |     "    N_MELS = N_MELS_FILTERED"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "## 1. Read in files"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 5,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "df = pd.read_pickle(os.path.join(os.path.sep, DATA, DF_NAME))"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "### 1.1. Check if call identifier column is present"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "code",
141 |    "execution_count": 6,
142 |    "metadata": {},
143 |    "outputs": [
144 |     {
145 |      "name": "stdout",
146 |      "output_type": "stream",
147 |      "text": [
148 |       "No ID-Column found ( callID )\n",
149 |       "Default ID column  callID will be generated from filename.\n"
150 |      ]
151 |     }
152 |    ],
153 |    "source": [
154 |     "# Default callID will be the name of the wav file\n",
155 |     "\n",
156 |     "if ID_COL not in df.columns:\n",
157 |     "    print('No ID-Column found (', ID_COL, ')')\n",
158 |     "    \n",
159 |     "    if 'filename' in df.columns:\n",
160 |     "        print(\"Default ID column \", ID_COL, \"will be generated from filename.\")\n",
161 |     "        df[ID_COL] = [x.split(\".\")[0] for x in df['filename']]\n",
162 |     "    else:\n",
163 |     "        raise"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "## 2. Generate spectrogram images"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "metadata": {},
176 |    "source": [
177 |     "A spectrogram image is generated from each row in the dataframe. Images are saved in a dictionary (keys are the ID_COL of the dataframe).\n",
178 |     "\n",
179 |     "The dictionary is pickled and saved as image_data.pkl. It will later be loaded in the interactive visualization script and these images will be displayed in the visualization."
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": 7,
185 |    "metadata": {},
186 |    "outputs": [
187 |     {
188 |      "name": "stdout",
189 |      "output_type": "stream",
190 |      "text": [
191 |       "\r",
192 |       "Processing i: 0 / 6428\r",
193 |       "Processing i: 1 / 6428\r",
194 |       "Processing i: 2 / 6428\r",
195 |       "Processing i: 3 / 6428"
196 |      ]
197 |     },
198 |     {
199 |      "name": "stderr",
200 |      "output_type": "stream",
201 |      "text": [
202 |       "/home/mthomas/anaconda3/envs/umap_tut_env/lib/python3.7/site-packages/librosa/display.py:974: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.\n",
203 |       "  scaler(mode, **kwargs)\n",
204 |       "/home/mthomas/anaconda3/envs/umap_tut_env/lib/python3.7/site-packages/librosa/display.py:974: MatplotlibDeprecationWarning: The 'linthreshy' parameter of __init__() has been renamed 'linthresh' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.\n",
205 |       "  scaler(mode, **kwargs)\n"
206 |      ]
207 |     },
208 |     {
209 |      "name": "stdout",
210 |      "output_type": "stream",
211 |      "text": [
212 |       "Processing i: 6427 / 6428 / 6428"
213 |      ]
214 |     }
215 |    ],
216 |    "source": [
217 |     "if OVERWRITE==False and os.path.isfile(os.path.join(os.path.sep,DATA,'image_data.pkl')):\n",
218 |     "    print(\"File already exists. Overwrite is set to FALSE, so no new image_data will be generated.\")\n",
219 |     "    \n",
220 |     "    # Double-ceck if image_data contains all the required calls\n",
221 |     "    with open(os.path.join(os.path.sep, DATA, 'image_data.pkl'), 'rb') as handle:\n",
222 |     "        image_data = pickle.load(handle)  \n",
223 |     "    image_keys = list(image_data.keys())\n",
224 |     "    expected_keys = list(df[ID_COL])\n",
225 |     "    missing = list(set(expected_keys)-set(image_keys))\n",
226 |     "    \n",
227 |     "    if len(missing)>0:\n",
228 |     "        print(\"BUT: The current image_data.pkl file doesn't seem to contain all calls that are in your dataframe!\")\n",
229 |     "        \n",
230 |     "else:\n",
231 |     "    image_data = {}\n",
232 |     "    for i,dat in enumerate(df.spectrograms):\n",
233 |     "        print('\\rProcessing i:',i,'/',df.shape[0], end='')\n",
234 |     "        dat = np.asarray(df.iloc[i][SPEC_COL]) \n",
235 |     "        sr = df.iloc[i]['samplerate_hz']\n",
236 |     "        plt.figure()\n",
237 |     "        librosa.display.specshow(dat,sr=sr, hop_length=int(FFT_HOP * sr) , fmin=FMIN, fmax=FMAX, y_axis='mel', x_axis='s',cmap='inferno')\n",
238 |     "        buf = io.BytesIO()\n",
239 |     "        plt.savefig(buf, format='png')\n",
240 |     "        byte_im = buf.getvalue()\n",
241 |     "        image_data[df.iloc[i][ID_COL]] = byte_im\n",
242 |     "        plt.close()\n",
243 |     "\n",
244 |     "    # Store data (serialize)\n",
245 |     "    with open(os.path.join(os.path.sep,DATA,'image_data.pkl'), 'wb') as handle:\n",
246 |     "        pickle.dump(image_data, handle, protocol=pickle.HIGHEST_PROTOCOL)"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "markdown",
251 |    "metadata": {},
252 |    "source": [
253 |     "## 3. Save dataframe"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "Save the dataframe to make sure it contains the correct ID column for access to the image_data."
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": 8,
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": [
269 |     "df.to_pickle(os.path.join(os.path.sep, DATA, DF_NAME))"
270 |    ]
271 |   }
272 |  ],
273 |  "metadata": {
274 |   "kernelspec": {
275 |    "display_name": "Python 3",
276 |    "language": "python",
277 |    "name": "python3"
278 |   },
279 |   "language_info": {
280 |    "codemirror_mode": {
281 |     "name": "ipython",
282 |     "version": 3
283 |    },
284 |    "file_extension": ".py",
285 |    "mimetype": "text/x-python",
286 |    "name": "python",
287 |    "nbconvert_exporter": "python",
288 |    "pygments_lexer": "ipython3",
289 |    "version": "3.7.10"
290 |   }
291 |  },
292 |  "nbformat": 4,
293 |  "nbformat_minor": 4
294 | }
295 | 


--------------------------------------------------------------------------------
/parameters/__pycache__/spec_params.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/marathomas/tutorial_repo/5123e19118f51c81e3c933b19fe264a0e7744798/parameters/__pycache__/spec_params.cpython-37.pyc


--------------------------------------------------------------------------------
/parameters/spec_params.py:
--------------------------------------------------------------------------------
1 | N_MELS = 40
2 | FFT_WIN = 0.03
3 | FFT_HOP = 0.00375
4 | WINDOW = "hann"
5 | FMIN = 0
6 | FMAX = 4000
7 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import find_packages, setup
 2 | 
 3 | setup(
 4 |     name='umap_tutorial',
 5 |     packages=find_packages(),
 6 |     version='0.1.0',
 7 |     description='Tutorial for generating latent-space representations from vocalizations using UMAP',
 8 |     author='Mara Thomas',
 9 |     license='MIT',
10 | )
11 | 


--------------------------------------------------------------------------------