├── .github └── workflows │ └── semgrep.yml ├── README.md ├── dusha ├── README.md ├── data_processing │ ├── README.md │ ├── __init__.py │ ├── processing.py │ ├── requirements.txt │ └── utils │ │ ├── __init__.py │ │ ├── aggregation.py │ │ ├── calculate_features.py │ │ ├── datacls.py │ │ └── dawidskene.py ├── demo │ ├── model │ │ ├── podcasts_finetune_old_w_lr_1e-3_try1 │ │ ├── train.config │ │ └── train.jsonl │ └── recognise_emotions_online.ipynb └── experiments │ ├── .gitignore │ ├── Dockerfile │ ├── README.md │ ├── __init__.py │ ├── configs │ ├── base.config │ ├── crowd_large.config │ ├── crowd_medium.config │ ├── crowd_small.config │ ├── crowd_unconfident.config │ ├── data.config │ ├── dusha_large.config │ ├── podcast_large.config │ └── podcast_tune.config │ ├── core │ ├── __init__.py │ ├── dataset.py │ ├── learner.py │ ├── metrics.py │ ├── model.py │ └── utils.py │ ├── inf.py │ ├── requirements.txt │ └── train.py ├── golos ├── README.md ├── README_ru.md ├── examples │ ├── Dockerfile │ ├── build_and_run_docker.sh │ ├── data │ │ ├── 001ce26c07c20eaa0d666b824c6c6924.wav │ │ └── example1.json │ ├── infer.py │ └── readme.md └── md5sum.txt └── license ├── en_us.pdf └── ru.pdf /.github/workflows/semgrep.yml: -------------------------------------------------------------------------------- 1 | # Name of this GitHub Actions workflow. 2 | name: Semgrep 3 | 4 | on: 5 | # Scan changed files in PRs (diff-aware scanning): 6 | pull_request: {} 7 | # Scan on-demand through GitHub Actions interface: 8 | workflow_dispatch: {} 9 | # Scan mainline branches and report all findings: 10 | push: 11 | branches: ["master", "main"] 12 | # Schedule the CI job (this method uses cron syntax): 13 | schedule: 14 | - cron: '20 17 * * *' # Sets Semgrep to scan every day at 17:20 UTC. 15 | # It is recommended to change the schedule to a random time. 16 | 17 | permissions: 18 | actions: read 19 | contents: read 20 | security-events: write 21 | 22 | jobs: 23 | semgrep: 24 | # User-definable name of this GitHub Actions job: 25 | name: semgrep/ci 26 | # If you are self-hosting, change the following `runs-on` value: 27 | runs-on: ubuntu-latest 28 | 29 | container: 30 | # A Docker image with Semgrep installed. Do not change this. 31 | image: returntocorp/semgrep 32 | 33 | # Skip any PR created by dependabot to avoid permission issues: 34 | if: (github.actor != 'dependabot[bot]') 35 | 36 | steps: 37 | # Fetch project source with GitHub Actions Checkout. 38 | - uses: actions/checkout@v3 39 | # Run the "semgrep ci" command on the command line of the docker image. 40 | - run: semgrep ci --sarif --output=semgrep.sarif --exclude-rule yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha.third-party-action-not-pinned-to-commit-sha 41 | env: 42 | # Add the rules that Semgrep uses by setting the SEMGREP_RULES environment variable. 43 | SEMGREP_RULES: p/default # more at semgrep.dev/explore 44 | 45 | - name: Upload SARIF file for GitHub Advanced Security Dashboard 46 | uses: github/codeql-action/upload-sarif@v2 47 | with: 48 | sarif_file: semgrep.sarif 49 | if: always() 50 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Open and Free Datasets and Models by Salute Developers 2 | 3 | 4 | ## [Golos dataset](https://github.com/salute-developers/golos/tree/master/golos#golos-dataset) 5 | Russian corpus suitable for speech research. 6 | 7 | ## [Dusha dataset](https://github.com/salute-developers/golos/tree/master/dusha#dusha-dataset) 8 | Bi-modal corpus suitable for speech emotion recognition tasks. 9 | 10 | 11 | ## **License** 12 | 13 | [English Version](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf) 14 | 15 | [Russian Version](https://github.com/salute-developers/golos/blob/master/license/ru.pdf) -------------------------------------------------------------------------------- /dusha/README.md: -------------------------------------------------------------------------------- 1 | # Dusha dataset 2 | 3 | Dusha is a bi-modal corpus suitable for speech emotion recognition (SER) tasks. The dataset consists of about 300 000 audio recordings with Russian speech, their transcripts and emotiomal labels. The corpus contains approximately 350 hours of data. Four basic emotions that usually appear in a dialog with a virtual assistant were selected: Happiness (Positive), Sadness, Anger and Neutral emotion. 4 | 5 | # Table of contents 6 | 7 | - [Dataset Structure](https://github.com/salute-developers/golos/tree/master/dusha#dataset-structure) 8 | - [Downloads](https://github.com/salute-developers/golos/tree/master/dusha#downloads) 9 | - [Raw Data Processing](https://github.com/salute-developers/golos/tree/master/dusha/data_processing) 10 | - [Run and Evaluate Experiments](https://github.com/salute-developers/golos/tree/master/dusha/experiments) 11 | - [Resources](https://github.com/salute-developers/golos/tree/master/dusha#resouces) 12 | - [License](https://github.com/salute-developers/golos/tree/master/dusha#license) 13 | - [Contacts](https://github.com/salute-developers/golos/tree/master/dusha#contacts) 14 | 15 | 16 | ## **Dataset structure** 17 | 18 | | Domain | Number of Files | Duration (Hr.) | Unique Speakers | 19 | |----------------|------------|--------|---------| 20 | | Crowd | 201 850 | 255.7 | 2068 | 21 | | Podcast | 102 113 | 90.9 | 6240 | 22 | | Total | 303 963 | 346.6 | 8308 | 23 | 24 | ## **Downloads** 25 | 26 | According to the license issue, we can't provide audio files of podcasts. We provide the whole set of precalculated features and a list of links to podcasts with timings where you can listen to original sound (Also see Issue https://github.com/salute-developers/golos/issues/1). 27 | 28 | | Archives | Size | Links | 29 | |-------------------|------------|--------------------------------------------------------------------------------------------------| 30 | | crowd.tar | 28 GB | [crowd.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/crowd.tar) | 31 | | podcast.tar | 360 MB | [podcast.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/podcast.tar) | 32 | | features.tar | 30 GB | [features.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/features.tar) | 33 | | paper_setups.tgz | 16 MB | [paper_setups.tgz](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/paper_setups.tgz) | 34 | 35 | 36 | With this repository you can perform [raw data processing](https://github.com/salute-developers/golos/tree/master/dusha/data_processing) or [run and evaluate experiments](https://github.com/salute-developers/golos/tree/master/dusha/experiments). 37 | 38 | ## **Resources** 39 | 40 | [[arxiv.org] Large Raw Emotional Dataset with Aggregation Mechanism](https://arxiv.org/abs/2212.12266) 41 | 42 | ## **License** 43 | 44 | [English Version](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf) 45 | 46 | [Russian Version](https://github.com/salute-developers/golos/blob/master/license/ru.pdf) 47 | 48 | ## **Contacts** 49 | 50 | Please create a GitHub issue! 51 | 52 | Authors (in alphabetic order): 53 | - Artem Sokolov 54 | - Fedor Minkin 55 | - Nikita Savushkin 56 | - Nikolay Karpov 57 | - Oleg Kutuzov 58 | - Vladimir Kondratenko 59 | -------------------------------------------------------------------------------- /dusha/data_processing/README.md: -------------------------------------------------------------------------------- 1 | ## Raw data processing 2 | 3 | To process data download a raw Dusha dataset (crowd.tar, podcast.tar), untar it to DATASET_PATH, and run the processing script: 4 | 5 | python processing.py -dataset_path DATASET_PATH 6 | 7 | It processes sound files and creates a folder in DATASET_PATH with precalculated features, aggregates labels, and creates manifest file in jsonl format. 8 | 9 | 10 | If you want to change the threshold for aggregation run the processing with -threshold flag: 11 | 12 | python processing.py -dataset_path DATASET_PATH -threshold THRESHOLD 13 | 14 | You can also use tsv format for manifest file: 15 | 16 | python processing.py -dataset_path DATASET_PATH -tsv 17 | 18 | Force recalculate features: 19 | 20 | python processing.py -dataset_path DATASET_PATH -rf 21 | -------------------------------------------------------------------------------- /dusha/data_processing/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/data_processing/__init__.py -------------------------------------------------------------------------------- /dusha/data_processing/processing.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import click 4 | import numpy as np 5 | from utils.aggregation import aggregate_data, read_data_markup 6 | from utils.calculate_features import load_features 7 | 8 | 9 | @click.command() 10 | @click.option( 11 | "-dataset_path", 12 | "--dataset_path", 13 | required=True, 14 | type=click.Path(exists=True), 15 | help="dataset_path", 16 | ) 17 | @click.option( 18 | "--use_tsv", "-tsv", is_flag=True, default=False, help="use tsv to read/write" 19 | ) 20 | @click.option( 21 | "--recalculate_features", 22 | "-rf", 23 | is_flag=True, 24 | default=False, 25 | help="recalculate features", 26 | ) 27 | @click.option( 28 | "--threshold", 29 | "-threshold", 30 | default=0.9, 31 | help="Dawidskene threshold", 32 | show_default=True, 33 | ) 34 | def processing( 35 | dataset_path: str, use_tsv: bool, recalculate_features: bool, threshold: float 36 | ) -> None: 37 | """ 38 | processing raw data for training 39 | """ 40 | if threshold > 1 or threshold < 0: 41 | raise AttributeError 42 | 43 | np.seterr(divide="ignore") 44 | 45 | public_data = Path(dataset_path) 46 | result_dir = public_data / f"processed_dataset_0{int(threshold*100)}" 47 | 48 | path_names = ["train", "aggregated_dataset", "test"] 49 | for path_name in path_names: 50 | (result_dir / path_name).mkdir(parents=True, exist_ok=True) 51 | 52 | (public_data / "features").mkdir(parents=True, exist_ok=True) 53 | 54 | data_types = ["crowd_train", "crowd_test", "podcast_train", "podcast_test"] 55 | for data_type in data_types: 56 | wavs_path = public_data / data_type / "wavs" 57 | data = read_data_markup( 58 | dataset_path=public_data / data_type / ("raw_" + data_type), 59 | use_tsv=use_tsv, 60 | ) 61 | wavs_names = {Path(row.audio_path).stem for row in data} 62 | load_features( 63 | wavs_path=wavs_path, 64 | wavs_names=wavs_names, 65 | result_dir=public_data, 66 | dataset_name=data_type, 67 | recalculate_feature=recalculate_features, 68 | ) 69 | 70 | aggregate_data(public_data, result_dir, use_tsv, threshold) 71 | 72 | 73 | if __name__ == "__main__": 74 | processing() # pylint: disable=no-value-for-parameter 75 | -------------------------------------------------------------------------------- /dusha/data_processing/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==1.3.5 2 | crowd-kit==1.0.0 3 | click==8.0.4 4 | tqdm==4.62.3 5 | numpy==1.21.5 6 | librosa==0.8.1 -------------------------------------------------------------------------------- /dusha/data_processing/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/data_processing/utils/__init__.py -------------------------------------------------------------------------------- /dusha/data_processing/utils/aggregation.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | from enum import Enum 4 | from pathlib import Path 5 | from typing import Dict, List 6 | 7 | from utils.datacls import ( 8 | AggDataclass, 9 | DataForExp, 10 | DawidSkeneEntryDataclass, 11 | MarkupDataclass, 12 | ) 13 | from utils.dawidskene import get_dawidskene_pred 14 | 15 | HEADER = "\t".join( 16 | [ 17 | "hash_id", 18 | "wav_path", 19 | "duration", 20 | "emotion", 21 | "golden_emo", 22 | "speaker_text", 23 | "speaker_emo", 24 | "source_id", 25 | ] 26 | ) 27 | 28 | HEADER_EXP = "\t".join(["id", "tensor", "wav_lengh", "label"]) 29 | 30 | 31 | class Emotion(Enum): 32 | ANGRY = 0 33 | SAD = 1 34 | NEUTRAL = 2 35 | POSITIVE = 3 36 | 37 | 38 | def read_data_markup(dataset_path: Path, use_tsv: bool) -> List[MarkupDataclass]: 39 | markup_data = [] 40 | if use_tsv: 41 | with open( 42 | dataset_path.parent / (dataset_path.stem + ".tsv"), "r", encoding="utf-8" 43 | ) as file: 44 | headers = file.readline().rstrip("\r\n").split("\t") 45 | for line in file: 46 | line_data = line.strip("\r\n").split("\t") 47 | string = dict(zip(headers, line_data)) 48 | row = MarkupDataclass(**string) 49 | markup_data.append(row) 50 | else: 51 | with open( 52 | dataset_path.parent / (dataset_path.stem + ".jsonl"), "r", encoding="utf-8" 53 | ) as file: 54 | for line in file: 55 | row = MarkupDataclass(**json.loads(line)) 56 | markup_data.append(row) 57 | return markup_data 58 | 59 | 60 | def agg_data_to_file( 61 | file_path: Path, agg_data: List[AggDataclass], use_tsv: bool 62 | ) -> None: 63 | if use_tsv: 64 | with open( 65 | file_path.parent / (file_path.stem + ".tsv"), "w", encoding="utf-8" 66 | ) as file: 67 | print(HEADER, file=file, end=os.linesep) 68 | for row in agg_data: 69 | print("\t".join(row.__dict__.values()), file=file, end=os.linesep) 70 | else: 71 | with open( 72 | file_path.parent / (file_path.stem + ".jsonl"), "w", encoding="utf-8" 73 | ) as file: 74 | for row in agg_data: 75 | line = json.dumps(row.__dict__, ensure_ascii=False) 76 | print(line, file=file, end=os.linesep) 77 | 78 | 79 | def exp_data_to_file( 80 | file_path: Path, exp_data: List[DataForExp], use_tsv: bool 81 | ) -> None: 82 | if use_tsv: 83 | with open( 84 | file_path.parent / (file_path.stem + ".tsv"), "w", encoding="utf-8" 85 | ) as file: 86 | print(HEADER_EXP, file=file, end=os.linesep) 87 | for row in exp_data: 88 | line = "\t".join(list(map(str, row.__dict__.values()))) 89 | print(line, file=file, end=os.linesep) 90 | else: 91 | with open( 92 | file_path.parent / (file_path.stem + ".jsonl"), "w", encoding="utf-8" 93 | ) as file: 94 | for row in exp_data: 95 | line = json.dumps(row.__dict__, ensure_ascii=False) 96 | print(line, file=file, end=os.linesep) 97 | 98 | 99 | def filter_data( 100 | markup_data: List[MarkupDataclass], 101 | aggregated_data_dict: Dict[str, str], 102 | dataset: str, 103 | ) -> List[AggDataclass]: 104 | agg_data = [] 105 | used_wavs = set() 106 | for row in markup_data: 107 | if row.hash_id in used_wavs: 108 | continue 109 | if row.hash_id in aggregated_data_dict: 110 | good_agg_row = AggDataclass( 111 | hash_id=row.hash_id, 112 | audio_path=str(Path("..", "..", dataset, row.audio_path)), 113 | duration=row.duration, 114 | emotion=aggregated_data_dict[row.hash_id], 115 | golden_emo=row.golden_emo, 116 | speaker_text=row.speaker_text, 117 | speaker_emo=row.speaker_emo, 118 | source_id=row.source_id, 119 | ) 120 | agg_data.append(good_agg_row) 121 | used_wavs.add(row.hash_id) 122 | return agg_data 123 | 124 | 125 | def make_exp_data(agg_data: List[AggDataclass]) -> List[DataForExp]: 126 | exp_data = [] 127 | for row in agg_data: 128 | if ( 129 | not isinstance(row.golden_emo, str) or row.golden_emo == "" 130 | ) and row.emotion != "other": 131 | exp_row = DataForExp( 132 | id=row.hash_id, 133 | tensor=str(Path("..", "..", "features", row.hash_id + ".npy")), 134 | wav_length=row.duration, 135 | label=Emotion[row.emotion.upper()].value, 136 | emotion=row.emotion, 137 | ) 138 | exp_data.append(exp_row) 139 | return exp_data 140 | 141 | 142 | def aggregate_data( 143 | data_path: Path, out_path: Path, use_tsv: bool, dawidskene_threshold: float 144 | ) -> None: 145 | 146 | markup_data = ["podcast_test", "podcast_train", "crowd_train", "crowd_test"] 147 | data = {} 148 | all_data = [] 149 | for dataset in markup_data: 150 | data[dataset] = read_data_markup( 151 | dataset_path=Path(data_path, dataset, "raw_" + dataset), 152 | use_tsv=use_tsv, 153 | ) 154 | all_data += data[dataset] 155 | 156 | data_for_agg = [] 157 | for row in all_data: 158 | row_for_agg = DawidSkeneEntryDataclass( 159 | task=row.hash_id, 160 | worker=row.annotator_id, 161 | label=row.annotator_emo, 162 | ) 163 | data_for_agg.append(row_for_agg) 164 | 165 | aggregated_data = get_dawidskene_pred( 166 | data=data_for_agg, 167 | threshold=dawidskene_threshold, 168 | meta_path=data_path / "meta.tsv", 169 | ) 170 | 171 | aggregated_data_dict = {row.task: row.pred for row in aggregated_data} 172 | 173 | exp_data = {} 174 | for dataset in markup_data: 175 | agg_data = filter_data( 176 | markup_data=data[dataset], 177 | aggregated_data_dict=aggregated_data_dict, 178 | dataset=dataset, 179 | ) 180 | exp_data[dataset] = make_exp_data(agg_data=agg_data) 181 | exp_data_to_file( 182 | file_path=out_path / dataset.rsplit("_", maxsplit=1)[-1] / dataset, 183 | exp_data=exp_data[dataset], 184 | use_tsv=use_tsv, 185 | ) 186 | agg_data_to_file( 187 | file_path=out_path / "aggregated_dataset" / dataset, 188 | agg_data=agg_data, 189 | use_tsv=use_tsv, 190 | ) 191 | exp_data_to_file( 192 | file_path=out_path / "train" / "train", 193 | exp_data=exp_data["podcast_train"] + exp_data["crowd_train"], 194 | use_tsv=use_tsv, 195 | ) 196 | exp_data_to_file( 197 | file_path=Path(out_path / "test" / "test"), 198 | exp_data=exp_data["podcast_test"] + exp_data["crowd_test"], 199 | use_tsv=use_tsv, 200 | ) 201 | -------------------------------------------------------------------------------- /dusha/data_processing/utils/calculate_features.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from typing import List, Set 3 | 4 | import librosa 5 | import numpy as np 6 | from tqdm import tqdm 7 | from utils.datacls import DataWithFeaturesEntryclass 8 | 9 | 10 | def create_features( 11 | data: List[DataWithFeaturesEntryclass], 12 | wavs_names: Set[str], 13 | features_dump_path: Path, 14 | dataset_name: str, 15 | recalculate_feature: bool, 16 | hop_length_coef: float = 0.01, 17 | win_length_coef: float = 0.02, 18 | sample_rate: int = 16000, 19 | n_mels: int = 64, 20 | ) -> None: 21 | """ 22 | As an input all models use standard speech features: 23 | 64 Mel-filterbank calculated from 20ms windows with a 10ms overlap. 24 | """ 25 | if recalculate_feature: 26 | if len(data) != len(wavs_names): 27 | print( 28 | f"{len(wavs_names) - len(data)} wav files are missing for {dataset_name}" 29 | ) 30 | hop_length = int(sample_rate * hop_length_coef) 31 | win_length = int(sample_rate * win_length_coef) 32 | for row in tqdm(data): 33 | data, rate = librosa.load(row.wav_path, sr=sample_rate) 34 | if len(data) != 0: 35 | spec = librosa.feature.melspectrogram( 36 | y=data, 37 | sr=rate, 38 | hop_length=hop_length, 39 | n_fft=win_length, 40 | n_mels=n_mels, 41 | ) 42 | else: 43 | raise AttributeError 44 | mel_spec = librosa.power_to_db(spec, ref=np.max) 45 | np.save(features_dump_path / f"{row.wav_id}.npy", mel_spec[None]) 46 | print( 47 | f"({len(data)}/{len(wavs_names)}) features have been calculated for {dataset_name}" 48 | ) 49 | else: 50 | ready_features = {elm.stem for elm in features_dump_path.glob("*.npy")} 51 | wav_to_features = {wav for wav in wavs_names if not wav in ready_features} 52 | data_to_culc = [wav for wav in data if wav.wav_id in wav_to_features] 53 | 54 | if len(data_to_culc) != len(wav_to_features): 55 | print( 56 | f"{len(wav_to_features) - len(data_to_culc)} wav files are missing for {dataset_name}" 57 | ) 58 | 59 | if not data_to_culc: 60 | print( 61 | f"All({len({wav for wav in wavs_names if wav in ready_features})}/{len(wavs_names)}) features have been calculated for {dataset_name}" 62 | ) 63 | return 64 | 65 | hop_length = int(sample_rate * hop_length_coef) 66 | win_length = int(sample_rate * win_length_coef) 67 | for row in tqdm(data_to_culc): 68 | data, rate = librosa.load(row.wav_path, sr=sample_rate) 69 | if len(data) != 0: 70 | spec = librosa.feature.melspectrogram( 71 | y=data, 72 | sr=rate, 73 | hop_length=hop_length, 74 | n_fft=win_length, 75 | n_mels=n_mels, 76 | ) 77 | else: 78 | raise AttributeError 79 | mel_spec = librosa.power_to_db(spec, ref=np.max) 80 | np.save(features_dump_path / f"{row.wav_id}.npy", mel_spec[None]) 81 | 82 | 83 | def load_features( 84 | wavs_path: Path, 85 | wavs_names: Set[str], 86 | result_dir: Path, 87 | dataset_name: str, 88 | recalculate_feature: bool, 89 | ) -> None: 90 | wavs = [] 91 | for elm in wavs_path.glob("*.wav"): 92 | wavs.append(DataWithFeaturesEntryclass(wav_path=str(elm), wav_id=elm.stem)) 93 | create_features( 94 | data=wavs, 95 | wavs_names=wavs_names, 96 | features_dump_path=result_dir / "features", 97 | dataset_name=dataset_name, 98 | recalculate_feature=recalculate_feature, 99 | ) 100 | -------------------------------------------------------------------------------- /dusha/data_processing/utils/datacls.py: -------------------------------------------------------------------------------- 1 | from dataclasses import dataclass 2 | from typing import Any 3 | 4 | # ----------------------------------------------------------------------------- 5 | # AGGREGATION 6 | # ----------------------------------------------------------------------------- 7 | 8 | 9 | @dataclass 10 | class DawidSkeneEntryDataclass: 11 | task: str 12 | worker: str 13 | label: Any 14 | 15 | 16 | @dataclass 17 | class DawidSkeneResultDataclass: 18 | task: str 19 | pred: str 20 | 21 | 22 | @dataclass 23 | class MarkupDataclass: 24 | hash_id: str 25 | audio_path: str 26 | duration: str 27 | annotator_emo: str 28 | golden_emo: str 29 | speaker_text: str 30 | speaker_emo: str 31 | source_id: str 32 | audio_path: str 33 | annotator_emo: str 34 | annotator_id: str 35 | 36 | 37 | @dataclass 38 | class AggDataclass: 39 | hash_id: str 40 | audio_path: str 41 | duration: str 42 | emotion: str 43 | golden_emo: str 44 | speaker_text: str 45 | speaker_emo: str 46 | source_id: str 47 | 48 | 49 | # ----------------------------------------------------------------------------- 50 | # FEATURES 51 | # ----------------------------------------------------------------------------- 52 | 53 | 54 | @dataclass 55 | class DataWithFeaturesEntryclass: 56 | wav_path: str 57 | wav_id: str 58 | 59 | 60 | # ----------------------------------------------------------------------------- 61 | # EXP 62 | # ----------------------------------------------------------------------------- 63 | 64 | 65 | @dataclass 66 | class DataForExp: 67 | id: str 68 | tensor: str 69 | wav_length: str 70 | label: int 71 | emotion: str 72 | -------------------------------------------------------------------------------- /dusha/data_processing/utils/dawidskene.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from typing import List 3 | 4 | import pandas as pd 5 | from crowdkit.aggregation import DawidSkene as CrowdKitDawidSkene 6 | from utils.datacls import DawidSkeneEntryDataclass, DawidSkeneResultDataclass 7 | 8 | 9 | def get_dawidskene_pred( 10 | data: List[DawidSkeneEntryDataclass], 11 | threshold: float, 12 | meta_path: Path, 13 | n_iter: int = 100, 14 | ) -> List[DawidSkeneResultDataclass]: 15 | labels = {row.label for row in data} 16 | assert "task" not in labels, 'Labels cant contains the name "task"!' 17 | aggregated_labels = CrowdKitDawidSkene(n_iter=n_iter).fit_predict_proba( 18 | pd.DataFrame(data) 19 | ) 20 | aggregated_labels.to_csv(meta_path, sep="\t") 21 | 22 | aggregated_labels_list = aggregated_labels.reset_index().to_dict("records") 23 | aggregated_data = [] 24 | for row in aggregated_labels_list: 25 | tmp_dict = {val: key for key, val in row.items() if key in labels} 26 | max_item_proba = max(tmp_dict) 27 | if max_item_proba >= threshold: 28 | key_with_max_value = tmp_dict[max_item_proba] 29 | aggregated_row = DawidSkeneResultDataclass( 30 | task=row["task"], 31 | pred=key_with_max_value, 32 | ) 33 | aggregated_data.append(aggregated_row) 34 | return aggregated_data 35 | -------------------------------------------------------------------------------- /dusha/demo/model/podcasts_finetune_old_w_lr_1e-3_try1: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/demo/model/podcasts_finetune_old_w_lr_1e-3_try1 -------------------------------------------------------------------------------- /dusha/demo/model/train.config: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from core.dataset import MelEmotionsDataset, get_augm_func, adaptive_padding_collate_fn, LengthWeightedSampler 3 | from core.model import ConvSelfAttentionMobileNet 4 | from core.utils import load_jsonl_as_df 5 | from torch.utils.data import DataLoader 6 | 7 | 8 | base_path = Path('/raid/okutuzov/dusha_data_new_2/processed_dataset_0.9') 9 | train_manifest_path = base_path / 'train' / 'podcast_train.jsonl' 10 | val_manifest_path = base_path / 'test' / 'podcast_test.jsonl' 11 | 12 | pt_model_path = Path('/raid/kondrat/dusha_experiments_try2/agg_0.9/crowd_lr_1e-3_try1/crowd_lr_1e-3_try1') 13 | 14 | batch_size = 64 15 | epoch_count = 100 16 | learning_rate = 1e-3 17 | optimizer_step = 5 18 | optimizer_gamma = 1 19 | weight_decay = 1e-6 20 | clip_grad = False 21 | 22 | collate_fn = adaptive_padding_collate_fn 23 | augm_func = get_augm_func(time_mask_param=40, freq_mask_param=16, crop_augm_max_cut_size=40) 24 | 25 | MAX_LENGTH = 16 26 | 27 | def get_train_weights(_df): 28 | train_weights = 1 + 9 * (_df.label.values == 0) + 19 * (_df.label.values == 1) + 4 * (_df.label.values == 3) 29 | # train_weights = 1 + 29 * (_df.label.values == 0) + 49 * (_df.label.values == 1) + 9 * (_df.label.values == 3) 30 | return train_weights 31 | 32 | 33 | model_setting = [ 34 | # t, c, n, s 35 | [1, 16, 1, 1], 36 | [2, 32, 2, 2], 37 | [2, 64, 6, 2], 38 | [2, 128, 6, 2], 39 | ] 40 | 41 | model = ConvSelfAttentionMobileNet(model_setting, 42 | n_classes=4, 43 | last_channel=128) 44 | 45 | 46 | def get_train_dataset(_df, ds_base_path): 47 | return MelEmotionsDataset(_df, 48 | get_weights_func=get_train_weights, 49 | augm_transform=augm_func, 50 | base_path=ds_base_path) 51 | 52 | 53 | 54 | def get_val_dataset(_df, ds_base_path): 55 | return MelEmotionsDataset(_df, base_path=ds_base_path) 56 | 57 | 58 | 59 | def get_train_dataloader(train_ds): 60 | return DataLoader(train_ds, batch_size=batch_size, num_workers=1, 61 | collate_fn=collate_fn, 62 | sampler=LengthWeightedSampler(df=train_ds.df, 63 | batch_size=batch_size, 64 | min_length=0.3, 65 | max_length=MAX_LENGTH, 66 | length_delta=0.3, 67 | decimals=1)) 68 | 69 | 70 | 71 | def get_val_dataloader(val_ds): 72 | return DataLoader(val_ds, batch_size=1, num_workers=4, shuffle=False) 73 | 74 | 75 | train_dataset = get_train_dataset(load_jsonl_as_df(train_manifest_path), 76 | ds_base_path=train_manifest_path.parent) 77 | val_dataset = get_val_dataset(load_jsonl_as_df(val_manifest_path), 78 | ds_base_path=val_manifest_path.parent) 79 | 80 | dataloaders = {'train': get_train_dataloader(train_ds=train_dataset), 81 | 'validate': get_val_dataloader(val_ds=val_dataset)} 82 | 83 | DUMP_BEST_CHECKPOINTS = True 84 | DUMP_LAST_CHECKPOINTS = True 85 | BEST_CHECKPOINTS_WARMUP = 5 86 | -------------------------------------------------------------------------------- /dusha/demo/recognise_emotions_online.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "18e88c88", 7 | "metadata": {}, 8 | "outputs": [ 9 | { 10 | "name": "stderr", 11 | "output_type": "stream", 12 | "text": [ 13 | "scikit-learn version 1.3.0 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.\n", 14 | "Torch version 2.0.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.\n" 15 | ] 16 | } 17 | ], 18 | "source": [ 19 | "import torch\n", 20 | "import onnx\n", 21 | "import toml\n", 22 | "import librosa\n", 23 | "import lazycon\n", 24 | "import time\n", 25 | "import os, sys\n", 26 | "\n", 27 | "import coremltools as ct\n", 28 | "import numpy as np\n", 29 | "import soundfile as sf\n", 30 | "import sounddevice as sd\n", 31 | "\n", 32 | "sys.path.append('../experiments')\n", 33 | "import core" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "id": "c00c3b46", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "def create_features(\n", 44 | " data: np.array,\n", 45 | " hop_length_coef: float = 0.01,\n", 46 | " win_length_coef: float = 0.02,\n", 47 | " sample_rate: int = 16000,\n", 48 | " n_mels: int = 64,\n", 49 | ") -> np.array:\n", 50 | " \"\"\"\n", 51 | " As an input all models use standard speech features:\n", 52 | " 64 Mel-filterbank calculated from 20ms windows with a 10ms overlap.\n", 53 | " \"\"\"\n", 54 | "\n", 55 | " hop_length = int(sample_rate * hop_length_coef)\n", 56 | " win_length = int(sample_rate * win_length_coef)\n", 57 | " if len(data) != 0:\n", 58 | " spec = librosa.feature.melspectrogram(\n", 59 | " y=data,\n", 60 | " sr=sample_rate,\n", 61 | " hop_length=hop_length,\n", 62 | " n_fft=win_length,\n", 63 | " n_mels=n_mels,\n", 64 | " )\n", 65 | " else:\n", 66 | " raise AttributeError\n", 67 | " mel_spec = librosa.power_to_db(spec, ref=np.max)\n", 68 | "\n", 69 | " return mel_spec" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "id": "9587b9e7-36e4-4523-9604-49ddf1bf0931", 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "def create_features_for_audio(\n", 80 | " wav_name: str,\n", 81 | " hop_length_coef: float = 0.01,\n", 82 | " win_length_coef: float = 0.02,\n", 83 | " sample_rate: int = 16000,\n", 84 | " n_mels: int = 64,\n", 85 | ") -> np.array:\n", 86 | " \"\"\"\n", 87 | " As an input all models use standard speech features:\n", 88 | " 64 Mel-filterbank calculated from 20ms windows with a 10ms overlap.\n", 89 | " \"\"\"\n", 90 | "\n", 91 | " hop_length = int(sample_rate * hop_length_coef)\n", 92 | " win_length = int(sample_rate * win_length_coef)\n", 93 | " data, rate = librosa.load(wav_name, sr=sample_rate)\n", 94 | " print(data)\n", 95 | " if len(data) != 0:\n", 96 | " spec = librosa.feature.melspectrogram(\n", 97 | " y=data,\n", 98 | " sr=rate,\n", 99 | " hop_length=hop_length,\n", 100 | " n_fft=win_length,\n", 101 | " n_mels=n_mels,\n", 102 | " )\n", 103 | " else:\n", 104 | " raise AttributeError\n", 105 | " mel_spec = librosa.power_to_db(spec, ref=np.max)\n", 106 | " return mel_spec" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 4, 112 | "id": "26a9de85", 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "def index2name(\n", 117 | " index: int\n", 118 | ") -> str:\n", 119 | " class_dict = {0:\"angry\", 1:\"sad\", 2:\"neutral\", 3:\"positive\"}\n", 120 | " \n", 121 | " if index > len(class_dict) or index < 0:\n", 122 | " raise AttributeError\n", 123 | " \n", 124 | " return class_dict[index]" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 5, 130 | "id": "db6f2fda", 131 | "metadata": {}, 132 | "outputs": [], 133 | "source": [ 134 | "dir_path = './model/'\n", 135 | "model_name = 'podcasts_finetune_old_w_lr_1e-3_try1'\n", 136 | "device = 'cpu'" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 19, 142 | "id": "9b615974", 143 | "metadata": {}, 144 | "outputs": [ 145 | { 146 | "name": "stderr", 147 | "output_type": "stream", 148 | "text": [ 149 | "/Users/artsokol/venv/lib/python3.9/site-packages/torchvision/models/mobilenetv2.py:23: FutureWarning: The ConvBNReLU/ConvBNActivation classes are deprecated since 0.12 and will be removed in 0.14. Use torchvision.ops.misc.Conv2dNormActivation instead.\n", 150 | " warnings.warn(\n" 151 | ] 152 | }, 153 | { 154 | "data": { 155 | "text/plain": [ 156 | "ConvSelfAttentionMobileNet(\n", 157 | " (features): Sequential(\n", 158 | " (0): _DeprecatedConvBNAct(\n", 159 | " (0): Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)\n", 160 | " (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 161 | " (2): ReLU6(inplace=True)\n", 162 | " )\n", 163 | " (1): InvertedResidual(\n", 164 | " (conv): Sequential(\n", 165 | " (0): Conv2dNormActivation(\n", 166 | " (0): Conv2d(4, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4, bias=False)\n", 167 | " (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 168 | " (2): ReLU6(inplace=True)\n", 169 | " )\n", 170 | " (1): Conv2d(4, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 171 | " (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 172 | " )\n", 173 | " )\n", 174 | " (2): InvertedResidual(\n", 175 | " (conv): Sequential(\n", 176 | " (0): Conv2dNormActivation(\n", 177 | " (0): Conv2d(16, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 178 | " (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 179 | " (2): ReLU6(inplace=True)\n", 180 | " )\n", 181 | " (1): Conv2dNormActivation(\n", 182 | " (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=32, bias=False)\n", 183 | " (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 184 | " (2): ReLU6(inplace=True)\n", 185 | " )\n", 186 | " (2): Conv2d(32, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 187 | " (3): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 188 | " )\n", 189 | " )\n", 190 | " (3): InvertedResidual(\n", 191 | " (conv): Sequential(\n", 192 | " (0): Conv2dNormActivation(\n", 193 | " (0): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 194 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 195 | " (2): ReLU6(inplace=True)\n", 196 | " )\n", 197 | " (1): Conv2dNormActivation(\n", 198 | " (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=64, bias=False)\n", 199 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 200 | " (2): ReLU6(inplace=True)\n", 201 | " )\n", 202 | " (2): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 203 | " (3): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 204 | " )\n", 205 | " )\n", 206 | " (4): InvertedResidual(\n", 207 | " (conv): Sequential(\n", 208 | " (0): Conv2dNormActivation(\n", 209 | " (0): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 210 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 211 | " (2): ReLU6(inplace=True)\n", 212 | " )\n", 213 | " (1): Conv2dNormActivation(\n", 214 | " (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=64, bias=False)\n", 215 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 216 | " (2): ReLU6(inplace=True)\n", 217 | " )\n", 218 | " (2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 219 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 220 | " )\n", 221 | " )\n", 222 | " (5): InvertedResidual(\n", 223 | " (conv): Sequential(\n", 224 | " (0): Conv2dNormActivation(\n", 225 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 226 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 227 | " (2): ReLU6(inplace=True)\n", 228 | " )\n", 229 | " (1): Conv2dNormActivation(\n", 230 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n", 231 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 232 | " (2): ReLU6(inplace=True)\n", 233 | " )\n", 234 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 235 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 236 | " )\n", 237 | " )\n", 238 | " (6): InvertedResidual(\n", 239 | " (conv): Sequential(\n", 240 | " (0): Conv2dNormActivation(\n", 241 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 242 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 243 | " (2): ReLU6(inplace=True)\n", 244 | " )\n", 245 | " (1): Conv2dNormActivation(\n", 246 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n", 247 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 248 | " (2): ReLU6(inplace=True)\n", 249 | " )\n", 250 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 251 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 252 | " )\n", 253 | " )\n", 254 | " (7): InvertedResidual(\n", 255 | " (conv): Sequential(\n", 256 | " (0): Conv2dNormActivation(\n", 257 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 258 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 259 | " (2): ReLU6(inplace=True)\n", 260 | " )\n", 261 | " (1): Conv2dNormActivation(\n", 262 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n", 263 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 264 | " (2): ReLU6(inplace=True)\n", 265 | " )\n", 266 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 267 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 268 | " )\n", 269 | " )\n", 270 | " (8): InvertedResidual(\n", 271 | " (conv): Sequential(\n", 272 | " (0): Conv2dNormActivation(\n", 273 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 274 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 275 | " (2): ReLU6(inplace=True)\n", 276 | " )\n", 277 | " (1): Conv2dNormActivation(\n", 278 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n", 279 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 280 | " (2): ReLU6(inplace=True)\n", 281 | " )\n", 282 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 283 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 284 | " )\n", 285 | " )\n", 286 | " (9): InvertedResidual(\n", 287 | " (conv): Sequential(\n", 288 | " (0): Conv2dNormActivation(\n", 289 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 290 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 291 | " (2): ReLU6(inplace=True)\n", 292 | " )\n", 293 | " (1): Conv2dNormActivation(\n", 294 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n", 295 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 296 | " (2): ReLU6(inplace=True)\n", 297 | " )\n", 298 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 299 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 300 | " )\n", 301 | " )\n", 302 | " (10): InvertedResidual(\n", 303 | " (conv): Sequential(\n", 304 | " (0): Conv2dNormActivation(\n", 305 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 306 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 307 | " (2): ReLU6(inplace=True)\n", 308 | " )\n", 309 | " (1): Conv2dNormActivation(\n", 310 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=128, bias=False)\n", 311 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 312 | " (2): ReLU6(inplace=True)\n", 313 | " )\n", 314 | " (2): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 315 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 316 | " )\n", 317 | " )\n", 318 | " (11): InvertedResidual(\n", 319 | " (conv): Sequential(\n", 320 | " (0): Conv2dNormActivation(\n", 321 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 322 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 323 | " (2): ReLU6(inplace=True)\n", 324 | " )\n", 325 | " (1): Conv2dNormActivation(\n", 326 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n", 327 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 328 | " (2): ReLU6(inplace=True)\n", 329 | " )\n", 330 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 331 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 332 | " )\n", 333 | " )\n", 334 | " (12): InvertedResidual(\n", 335 | " (conv): Sequential(\n", 336 | " (0): Conv2dNormActivation(\n", 337 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 338 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 339 | " (2): ReLU6(inplace=True)\n", 340 | " )\n", 341 | " (1): Conv2dNormActivation(\n", 342 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n", 343 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 344 | " (2): ReLU6(inplace=True)\n", 345 | " )\n", 346 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 347 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 348 | " )\n", 349 | " )\n", 350 | " (13): InvertedResidual(\n", 351 | " (conv): Sequential(\n", 352 | " (0): Conv2dNormActivation(\n", 353 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 354 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 355 | " (2): ReLU6(inplace=True)\n", 356 | " )\n", 357 | " (1): Conv2dNormActivation(\n", 358 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n", 359 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 360 | " (2): ReLU6(inplace=True)\n", 361 | " )\n", 362 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 363 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 364 | " )\n", 365 | " )\n", 366 | " (14): InvertedResidual(\n", 367 | " (conv): Sequential(\n", 368 | " (0): Conv2dNormActivation(\n", 369 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 370 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 371 | " (2): ReLU6(inplace=True)\n", 372 | " )\n", 373 | " (1): Conv2dNormActivation(\n", 374 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n", 375 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 376 | " (2): ReLU6(inplace=True)\n", 377 | " )\n", 378 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 379 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 380 | " )\n", 381 | " )\n", 382 | " (15): InvertedResidual(\n", 383 | " (conv): Sequential(\n", 384 | " (0): Conv2dNormActivation(\n", 385 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 386 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 387 | " (2): ReLU6(inplace=True)\n", 388 | " )\n", 389 | " (1): Conv2dNormActivation(\n", 390 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n", 391 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 392 | " (2): ReLU6(inplace=True)\n", 393 | " )\n", 394 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 395 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 396 | " )\n", 397 | " )\n", 398 | " (16): _DeprecatedConvBNAct(\n", 399 | " (0): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n", 400 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n", 401 | " (2): ReLU6(inplace=True)\n", 402 | " )\n", 403 | " )\n", 404 | " (pooling): Sequential(\n", 405 | " (0): ConvSelfAttention(\n", 406 | " (query): Conv1d(128, 16, kernel_size=(1,), stride=(1,), bias=False)\n", 407 | " (key): Conv1d(128, 16, kernel_size=(1,), stride=(1,), bias=False)\n", 408 | " (value): Conv1d(128, 128, kernel_size=(1,), stride=(1,), bias=False)\n", 409 | " )\n", 410 | " (1): AdaptiveAvgPool2d(output_size=(1, 1))\n", 411 | " (2): Reshape()\n", 412 | " )\n", 413 | " (classifier): Linear(in_features=128, out_features=4, bias=True)\n", 414 | ")" 415 | ] 416 | }, 417 | "execution_count": 19, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "config_path = os.path.join(dir_path, \"train.config\")\n", 424 | "assert os.path.exists(config_path), f\"No train.config in {dir_path}\"\n", 425 | "\n", 426 | "model_path = os.path.join(dir_path, model_name)\n", 427 | "# check the model\n", 428 | "if not os.path.exists(model_path):\n", 429 | " print(f\"There is no saved model {model_path}. Nothing to inference\")\n", 430 | "# return None\n", 431 | "\n", 432 | "# load the model\n", 433 | "cfg = lazycon.load(config_path)\n", 434 | "model = cfg.model\n", 435 | " \n", 436 | "model.to(device)\n", 437 | "model.load_state_dict(torch.load(model_path, map_location=torch.device('cpu')))\n", 438 | "model = model.double()\n", 439 | "model.eval()" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 20, 445 | "id": "a90da87c", 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "data": { 450 | "text/plain": [ 451 | "> 0 Внешний микрофон, Core Audio (1 in, 0 out)\n", 452 | "< 1 Внешние наушники, Core Audio (0 in, 2 out)\n", 453 | " 2 Микрофон MacBook Pro, Core Audio (1 in, 0 out)\n", 454 | " 3 Динамики MacBook Pro, Core Audio (0 in, 2 out)" 455 | ] 456 | }, 457 | "execution_count": 20, 458 | "metadata": {}, 459 | "output_type": "execute_result" 460 | } 461 | ], 462 | "source": [ 463 | "sd.query_devices()" 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "execution_count": 21, 469 | "id": "4a318ebf", 470 | "metadata": {}, 471 | "outputs": [ 472 | { 473 | "name": "stdout", 474 | "output_type": "stream", 475 | "text": [ 476 | "Recording Audio\n", 477 | "Audio recording complete , Play Audio\n" 478 | ] 479 | } 480 | ], 481 | "source": [ 482 | "fs=16000 \n", 483 | "duration = 5 # seconds \n", 484 | "myrecording = sd.rec(duration * fs, samplerate=fs, channels=1, dtype='float64') \n", 485 | "print (\"Recording Audio\") \n", 486 | "sd.wait() \n", 487 | "print (\"Audio recording complete , Play Audio\") \n", 488 | "sd.play(myrecording, fs) " 489 | ] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 29, 494 | "id": "4e352445", 495 | "metadata": {}, 496 | "outputs": [], 497 | "source": [ 498 | "feat = create_features(np.transpose(myrecording)[0])" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": 24, 504 | "id": "e936bd3b", 505 | "metadata": {}, 506 | "outputs": [], 507 | "source": [ 508 | "#feat = create_features_for_audio('wavs/c9780b567a8de31862971aa5412bf834.wav')" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 30, 514 | "id": "bcd6c49c", 515 | "metadata": {}, 516 | "outputs": [ 517 | { 518 | "name": "stdout", 519 | "output_type": "stream", 520 | "text": [ 521 | "Calculating predicts\n" 522 | ] 523 | } 524 | ], 525 | "source": [ 526 | "print(f\"Calculating predicts\")\n", 527 | "inputs = torch.from_numpy(feat).to(device).unsqueeze(0).unsqueeze(0)\n", 528 | "\n", 529 | "with torch.no_grad():\n", 530 | " probs = model(inputs)" 531 | ] 532 | }, 533 | { 534 | "cell_type": "code", 535 | "execution_count": 31, 536 | "id": "98cd7964", 537 | "metadata": {}, 538 | "outputs": [ 539 | { 540 | "data": { 541 | "text/plain": [ 542 | "'neutral'" 543 | ] 544 | }, 545 | "execution_count": 31, 546 | "metadata": {}, 547 | "output_type": "execute_result" 548 | } 549 | ], 550 | "source": [ 551 | "pred_class = np.argmax(probs.cpu().numpy(), axis=1)\n", 552 | "index2name(pred_class[0])" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 32, 558 | "id": "723ce905", 559 | "metadata": {}, 560 | "outputs": [ 561 | { 562 | "data": { 563 | "text/plain": [ 564 | "tensor([[-2.4723, -0.7901, 1.3804, -0.7780]], dtype=torch.float64)" 565 | ] 566 | }, 567 | "execution_count": 32, 568 | "metadata": {}, 569 | "output_type": "execute_result" 570 | } 571 | ], 572 | "source": [ 573 | "probs" 574 | ] 575 | } 576 | ], 577 | "metadata": { 578 | "kernelspec": { 579 | "display_name": "Python 3 (ipykernel)", 580 | "language": "python", 581 | "name": "python3" 582 | }, 583 | "language_info": { 584 | "codemirror_mode": { 585 | "name": "ipython", 586 | "version": 3 587 | }, 588 | "file_extension": ".py", 589 | "mimetype": "text/x-python", 590 | "name": "python", 591 | "nbconvert_exporter": "python", 592 | "pygments_lexer": "ipython3", 593 | "version": "3.9.17" 594 | } 595 | }, 596 | "nbformat": 4, 597 | "nbformat_minor": 5 598 | } 599 | -------------------------------------------------------------------------------- /dusha/experiments/.gitignore: -------------------------------------------------------------------------------- 1 | **/__pycache__/ 2 | runs/ 3 | exps/ 4 | -------------------------------------------------------------------------------- /dusha/experiments/Dockerfile: -------------------------------------------------------------------------------- 1 | # Distributed on an "AS IS" BASIS, 2 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 3 | # See the License for the specific language governing permissions and 4 | # limitations under the License. 5 | 6 | FROM nvidia/cuda:10.1-base-ubuntu18.04 7 | RUN apt-get update && apt-get install -y git && \ 8 | apt install -y software-properties-common && \ 9 | add-apt-repository ppa:deadsnakes/ppa && \ 10 | apt install -y python3.8 && apt install -y python3-pip && \ 11 | pip3 install virtualenv && \ 12 | virtualenv -p python3.8 --no-download venv 13 | 14 | SHELL ["/bin/bash", "-c"] 15 | RUN source /venv/bin/activate && \ 16 | pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html && \ 17 | pip install scikit-learn==0.24.0 librosa==0.7.1 tqdm==4.55.1 click==8.1.2 pandas==1.1.5 tensorboard==2.4.1 crowd-kit==1.0.0 && \ 18 | apt install -y libsndfile1 && pip install numba==0.48 19 | 20 | WORKDIR /packages 21 | 22 | RUN git clone https://github.com/maxme1/lazycon.git 23 | WORKDIR lazycon 24 | RUN git checkout tags/v0.3.1 25 | 26 | RUN source /venv/bin/activate && pip install -e . 27 | WORKDIR /workspace 28 | 29 | CMD ["bash"] 30 | -------------------------------------------------------------------------------- /dusha/experiments/README.md: -------------------------------------------------------------------------------- 1 | # Run and Evaluate Experiments 2 | 3 | Once you processed dataset Dusha or downloaded precalculated features, you have a folder with features and manifest files in `DATASET_PATH` 4 | (by default `DUSHA_REPOSITORY_PATH/data/paper_setups`) 5 | 6 | You can run your experiment or reproduce our results using a prepared docker image or your python environment. 7 | 8 | ## Prepare environment 9 | 10 | Specify `DATASET_PATH` in the variable `base_path` in `./configs/data.config`: 11 | - If you want to use own python environment - define an **absolute** path to processed_dataset folder 12 | - If you want to use proposed docker image - define `DATASET_PATH` relative to `DUSHA_REPOSITORY_PATH` (it will be mounted to the docker as `/workspace`). 13 | It is already done for default `DATASET_PATH` - `'/workspace/data/paper_setups'` 14 | 15 | ### Docker 16 | Build the docker image: 17 | 18 | ``` 19 | docker build -t dusha_image . 20 | ``` 21 | 22 | Then run it: 23 | 24 | ``` 25 | export CURRENT_DIR=$PWD 26 | docker run --gpus device=0 -it -v /$CURRENT_DIR/..:/workspace --name dusha_docker dusha_image 27 | ``` 28 | 29 | Inside the docker activate python environment: 30 | ``` 31 | source /venv/bin/activate && export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python && cd experiments 32 | ``` 33 | 34 | So, we are ready to start! 35 | 36 | ## Training stage 37 | We set up experiments in 7 settings, for each of them there is a corresponding config in _configs_ folder. 38 | Run the training stage: 39 | 40 | ``` 41 | python train.py -config configs/{EXP_NAME}.config -exp_path exps/{EXP_NAME} 42 | ``` 43 | 44 | `podcast_tune` experiment uses pretrained `crowd_large` model to initializate, so you have to train this model first and specify a path to the trained model as `pt_model_path` in `podcast_tune.config`. 45 | 46 | We have tried to make the experiments as reproducible as possible, so the training stage can take a long time. 47 | To speed up the training you can change `train.py` by deleting `# fix seeds for reproducibility` part of the code. 48 | 49 | 50 | ## Inference 51 | After training stage you have trained model in `./exps` folder. 52 | To calculate predicts and metrics for them run the command: 53 | 54 | ``` 55 | python inf.py -exps_path exps -vm {PATH_TO_TESTS_FOLDER} 56 | ``` 57 | 58 | Where `{PATH_TO_TESTS_FOLDER}` is a path to test manifests folder (`base_path / 'test'` from `./configs/data.config`). 59 | 60 | The script will also calculate pivot tables with metrics grouped by dataset and dump it in `exps/metrics/exps_{dataset_name}.csv`. 61 | 62 | See `--help` for more information. 63 | -------------------------------------------------------------------------------- /dusha/experiments/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/experiments/__init__.py -------------------------------------------------------------------------------- /dusha/experiments/configs/base.config: -------------------------------------------------------------------------------- 1 | from .data import * 2 | 3 | from core.dataset import MelEmotionsDataset, get_augm_func, adaptive_padding_collate_fn, LengthWeightedSampler 4 | from core.model import ConvSelfAttentionMobileNet 5 | from core.utils import load_jsonl_as_df 6 | from torch.utils.data import DataLoader 7 | from pathlib import Path 8 | 9 | # train data 10 | train_manifest_path = base_path / 'train' / 'crowd_train.jsonl' 11 | val_manifest_path = base_path / 'tests' / 'crowd_test.jsonl' 12 | 13 | # pretrain 14 | pt_model_path = None 15 | 16 | # exp hyperparams 17 | batch_size = 64 18 | epoch_count = 100 19 | learning_rate = 5e-4 20 | optimizer_step = 5 21 | optimizer_gamma = 1 22 | weight_decay = 1e-6 23 | clip_grad = False 24 | 25 | # augm and batch iter stuff 26 | collate_fn = adaptive_padding_collate_fn 27 | augm_func = get_augm_func(time_mask_param=40, freq_mask_param=16, crop_augm_max_cut_size=40) 28 | 29 | MAX_LENGTH = 16 30 | get_train_weights = None 31 | 32 | # model 33 | model_setting = [ 34 | # t, c, n, s 35 | [1, 16, 1, 1], 36 | [2, 32, 2, 2], 37 | [2, 64, 6, 2], 38 | [2, 128, 6, 2], 39 | ] 40 | 41 | model = ConvSelfAttentionMobileNet(model_setting, 42 | n_classes=4, 43 | last_channel=128) 44 | 45 | 46 | def get_train_dataset(_df, ds_base_path): 47 | return MelEmotionsDataset(_df, 48 | get_weights_func=get_train_weights, 49 | augm_transform=augm_func, 50 | base_path=ds_base_path) 51 | 52 | 53 | def get_val_dataset(_df, ds_base_path): 54 | return MelEmotionsDataset(_df, base_path=ds_base_path) 55 | 56 | 57 | def get_train_dataloader(train_ds): 58 | return DataLoader(train_ds, batch_size=batch_size, num_workers=1, 59 | collate_fn=collate_fn, 60 | sampler=LengthWeightedSampler(df=train_ds.df, 61 | batch_size=batch_size, 62 | min_length=0.3, 63 | max_length=MAX_LENGTH, 64 | length_delta=0.3, 65 | decimals=1)) 66 | 67 | 68 | def get_val_dataloader(val_ds): 69 | return DataLoader(val_ds, batch_size=1, num_workers=4, shuffle=False) 70 | 71 | 72 | train_dataset = get_train_dataset(load_jsonl_as_df(train_manifest_path), 73 | ds_base_path=train_manifest_path.parent) 74 | val_dataset = get_val_dataset(load_jsonl_as_df(val_manifest_path), 75 | ds_base_path=val_manifest_path.parent) 76 | 77 | dataloaders = {'train': get_train_dataloader(train_ds=train_dataset), 78 | 'validate': get_val_dataloader(val_ds=val_dataset)} 79 | 80 | DUMP_BEST_CHECKPOINTS = True 81 | DUMP_LAST_CHECKPOINTS = True 82 | BEST_CHECKPOINTS_WARMUP = 5 83 | -------------------------------------------------------------------------------- /dusha/experiments/configs/crowd_large.config: -------------------------------------------------------------------------------- 1 | from .base import * 2 | 3 | # data 4 | train_manifest_path = base_path / 'crowd_large.jsonl' 5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl' 6 | 7 | # pretrain 8 | pt_model_path = None 9 | 10 | # exp hyperparams 11 | batch_size = 64 12 | epoch_count = 100 13 | learning_rate = 1e-3 14 | optimizer_step = 5 15 | optimizer_gamma = 1 16 | weight_decay = 1e-6 17 | clip_grad = False 18 | -------------------------------------------------------------------------------- /dusha/experiments/configs/crowd_medium.config: -------------------------------------------------------------------------------- 1 | from .crowd_large import * 2 | 3 | # data 4 | train_manifest_path = base_path / 'crowd_medium.jsonl' 5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl' 6 | 7 | # pretrain 8 | pt_model_path = None 9 | 10 | # exp hyperparams 11 | learning_rate = 5e-3 12 | -------------------------------------------------------------------------------- /dusha/experiments/configs/crowd_small.config: -------------------------------------------------------------------------------- 1 | from .crowd_large import * 2 | 3 | # data 4 | train_manifest_path = base_path / 'crowd_small.jsonl' 5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl' 6 | 7 | # pretrain 8 | pt_model_path = None 9 | 10 | # exp hyperparams 11 | learning_rate = 1e-3 12 | -------------------------------------------------------------------------------- /dusha/experiments/configs/crowd_unconfident.config: -------------------------------------------------------------------------------- 1 | from .crowd_large import * 2 | 3 | # data 4 | train_manifest_path = base_path / 'crowd_unconfident.jsonl' 5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl' 6 | 7 | # pretrain 8 | pt_model_path = None 9 | 10 | # exp hyperparams 11 | learning_rate = 5e-3 12 | -------------------------------------------------------------------------------- /dusha/experiments/configs/data.config: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | # You should specify `DATASET_PATH` in the variable `base_path` 4 | # - If you want to use own python environment - define an ABSOLUTE path to processed_dataset folder 5 | # - If you want to use proposed docker image - define `DATASET_PATH` relative to `DUSHA_REPOSITORY_PATH` 6 | # (`DUSHA_REPOSITORY_PATH` will be mounted to the docker as `/workspace`). 7 | # 8 | # It is already done for default `DATASET_PATH` = : 9 | base_path = Path('/workspace/data/paper_setups') 10 | -------------------------------------------------------------------------------- /dusha/experiments/configs/dusha_large.config: -------------------------------------------------------------------------------- 1 | from .base import * 2 | 3 | # train data 4 | train_manifest_path = base_path / 'dusha_large.jsonl' 5 | val_manifest_path = base_path / 'test' / 'dusha_test.jsonl' 6 | 7 | # pretrain 8 | pt_model_path = None 9 | 10 | # exp hyperparams 11 | batch_size = 64 12 | epoch_count = 100 13 | learning_rate = 1e-3 14 | optimizer_step = 5 15 | optimizer_gamma = 1 16 | weight_decay = 1e-6 17 | clip_grad = False 18 | -------------------------------------------------------------------------------- /dusha/experiments/configs/podcast_large.config: -------------------------------------------------------------------------------- 1 | from .base import * 2 | 3 | # train data 4 | train_manifest_path = base_path / 'podcast_large.jsonl' 5 | val_manifest_path = base_path / 'test' / 'podcast_test.jsonl' 6 | 7 | # pretrain 8 | pt_model_path = None 9 | 10 | # exp hyperparams 11 | batch_size = 64 12 | epoch_count = 100 13 | learning_rate = 5e-3 14 | optimizer_step = 5 15 | optimizer_gamma = 1 16 | weight_decay = 1e-6 17 | clip_grad = False 18 | 19 | 20 | def get_train_weights(_df): 21 | train_weights = 1 + 9 * (_df.label.values == 0)+ 19 * (_df.label.values == 1) + 4*(_df.label.values == 3) 22 | return train_weights 23 | -------------------------------------------------------------------------------- /dusha/experiments/configs/podcast_tune.config: -------------------------------------------------------------------------------- 1 | from .podcast import * 2 | 3 | # pretrain 4 | pt_model_path = Path('/workspace/experiments/exps/crowd_large/crowd_large') 5 | 6 | # exp hyperparams 7 | learning_rate = 1e-3 8 | -------------------------------------------------------------------------------- /dusha/experiments/core/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/experiments/core/__init__.py -------------------------------------------------------------------------------- /dusha/experiments/core/dataset.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | import numpy as np 4 | import torch 5 | from torch import nn 6 | from torch.distributions.categorical import Categorical 7 | from torch.utils.data import Dataset, Sampler 8 | from torchaudio import transforms as T 9 | from torchvision import transforms 10 | 11 | PATH_TO_TENSOR_COL = "tensor" 12 | 13 | 14 | def load_tensor(path): 15 | features_tensor = np.fromfile(path, dtype=np.float32) 16 | return torch.from_numpy(np.reshape(features_tensor, (-1, 64))) 17 | 18 | 19 | def pad_or_crop_to_shape(tensor, size, rand_side_pad=True): 20 | assert len(tensor.shape) == 3 21 | delta = size - tensor.shape[-1] 22 | if delta > 0: 23 | if rand_side_pad: 24 | start_padding = np.random.randint(delta) 25 | end_padding = delta - start_padding 26 | res = nn.functional.pad(tensor, pad=(start_padding, end_padding, 0, 0)) 27 | else: 28 | res = nn.functional.pad(tensor, pad=(0, delta, 0, 0)) 29 | 30 | return res 31 | else: 32 | return tensor[..., :size] 33 | 34 | 35 | def adaptive_padding_collate_fn(batch): 36 | data = [] 37 | target = [] 38 | max_size = max([tens.shape[-1] for (tens, label) in batch]) 39 | for (tens, label) in batch: 40 | # crop 41 | data.append(pad_or_crop_to_shape(tens, max_size, rand_side_pad=True)) 42 | target.append(label) 43 | 44 | return torch.stack(data), torch.tensor(target) 45 | 46 | 47 | def get_augm_func(time_mask_param=80, freq_mask_param=16, crop_augm_max_cut_size=0): 48 | """ 49 | Returns function for augmentation in MelEmotionsDataset (augm_transform) 50 | Returned function's input should have [bs, 1, T] shape 51 | 52 | :param time_mask_param: 53 | :param freq_mask_param: 54 | :param crop_augm_max_cut_size: if 0 - random crops are not used 55 | :return: 56 | """ 57 | 58 | t_masking = T.TimeMasking(time_mask_param=time_mask_param) 59 | f_masking = T.FrequencyMasking(freq_mask_param=freq_mask_param) 60 | 61 | if crop_augm_max_cut_size != 0: 62 | # we want random crop with random size, 63 | # so we should sample crop size for each augm_transform call 64 | def crop_f(tens): 65 | crop_delta = np.random.randint(crop_augm_max_cut_size) 66 | random_crop = transforms.RandomCrop( 67 | np.array(tens.shape)[1:] - np.array([0, crop_delta]) 68 | ) 69 | 70 | return random_crop(tens) 71 | 72 | augm_transform = transforms.Compose([f_masking, t_masking, crop_f]) 73 | else: 74 | augm_transform = transforms.Compose([f_masking, t_masking]) 75 | 76 | return augm_transform 77 | 78 | 79 | class MelEmotionsDataset(Dataset): 80 | def __init__( 81 | self, df, *_, augm_transform=None, get_weights_func=None, base_path=None, **__ 82 | ): 83 | super().__init__() 84 | df = df.copy() 85 | if "label" in df.columns: 86 | df["label"] = df["label"].apply(int) 87 | else: 88 | print('There is no column "label" in the TSV') 89 | 90 | if get_weights_func is None: 91 | df["sampling_weights"] = 1 92 | else: 93 | df["sampling_weights"] = get_weights_func(df) 94 | 95 | # sort by length 96 | if "wav_length" in df.columns: 97 | df = df.sort_values("wav_length").reset_index(drop=True) 98 | else: 99 | print('There is no column "wav_length" in the TSV') 100 | 101 | self.df = df 102 | self.augm_transform = augm_transform 103 | self.feature_col = PATH_TO_TENSOR_COL 104 | 105 | if base_path is not None: 106 | base_path = Path(base_path) 107 | self.base_path = base_path 108 | 109 | def __len__(self): 110 | return len(self.df) 111 | 112 | def __getitem__(self, idx): 113 | if torch.is_tensor(idx): 114 | idx = idx.tolist() 115 | 116 | path = self.df.iloc[idx][self.feature_col] 117 | if self.base_path is not None: 118 | path = self.base_path / path 119 | 120 | tens = torch.from_numpy(np.load(path)) 121 | label = self.df.iloc[idx]["label"] 122 | 123 | if self.augm_transform is not None: 124 | tens = self.augm_transform(tens) 125 | 126 | return tens, label 127 | 128 | 129 | class LengthWeightedSampler(Sampler[int]): 130 | def __init__( 131 | self, 132 | df, 133 | batch_size, 134 | min_length=1, 135 | max_length=20.5, 136 | length_delta=0.3, 137 | decimals=1, 138 | ): 139 | # df should be sorted ascending by wav_length 140 | # we do it in MelEmotionsDataset 141 | if "wav_length" not in df.columns: 142 | raise ValueError('There is no column "wav_length" in the TSV') 143 | 144 | super().__init__(df) 145 | self.df = df 146 | self.batch_size = batch_size 147 | self.num_samples = (len(df) // batch_size) * batch_size 148 | 149 | all_lengths = np.round(df["wav_length"].values, decimals) 150 | _max = max(all_lengths) 151 | _min = min(all_lengths) 152 | 153 | if max_length is None or max_length > _max: 154 | max_length = _max 155 | if min_length is None or min_length < _min: 156 | min_length = _min 157 | 158 | self.min_length = min_length 159 | self.max_length = max_length 160 | self.length_delta = length_delta 161 | 162 | self.decimals = decimals 163 | self.length_step = np.round(0.1 ** decimals, decimals) 164 | 165 | # is needed to sample batches with max length inclusive 166 | max_plus_delta = np.round(self.max_length + self.length_step, decimals) 167 | 168 | length_to_index_mapping = {} 169 | temp_length = 0 170 | 171 | for i, v in enumerate(all_lengths): 172 | if v > temp_length: 173 | if v != temp_length + self.length_step: 174 | for j in np.arange( 175 | temp_length + self.length_step, v, self.length_step 176 | ): 177 | length_to_index_mapping[np.round(j, decimals)] = i 178 | 179 | length_to_index_mapping[v] = i 180 | 181 | temp_length = v 182 | 183 | # fix to sample batches with max length inclusive 184 | length_to_index_mapping[ 185 | np.round(np.max(all_lengths) + self.length_step, decimals) 186 | ] = len(df) 187 | 188 | self.length_to_index_mapping = length_to_index_mapping 189 | 190 | # starts with MIN_LENGTH 191 | self.lengths, self.lengths_count = np.unique( 192 | all_lengths[ 193 | length_to_index_mapping[self.min_length] : length_to_index_mapping[ 194 | max_plus_delta 195 | ] 196 | ], 197 | return_counts=True, 198 | ) 199 | 200 | self.key_length_sampler = Categorical( 201 | probs=torch.from_numpy(self.lengths_count) 202 | ) 203 | 204 | def __iter__(self): 205 | N = 0 206 | res_indexes = [] 207 | 208 | while N < self.num_samples: 209 | key_length = self.lengths[self.key_length_sampler.sample().item()] 210 | 211 | batch_min_length = np.round( 212 | max(self.min_length, key_length - self.length_delta), self.decimals 213 | ) 214 | batch_max_length = np.round( 215 | min(self.max_length, key_length + self.length_delta), self.decimals 216 | ) 217 | batch_max_length_plus_delta = np.round( 218 | batch_max_length + self.length_step, self.decimals 219 | ) 220 | 221 | sub_df = self.df.iloc[ 222 | self.length_to_index_mapping[ 223 | batch_min_length 224 | ] : self.length_to_index_mapping[batch_max_length_plus_delta] 225 | ][["sampling_weights"]] 226 | 227 | sampling_weights = torch.from_numpy( 228 | sub_df.sampling_weights.values.astype(float) 229 | ) 230 | sub_iloc_indexes = torch.multinomial( 231 | sampling_weights, self.batch_size, True 232 | ).tolist() 233 | 234 | batch_indexes = sub_df.iloc[sub_iloc_indexes].index.tolist() 235 | res_indexes.extend(batch_indexes) 236 | 237 | N += self.batch_size 238 | 239 | return iter(res_indexes) 240 | 241 | def __len__(self): 242 | return self.num_samples 243 | -------------------------------------------------------------------------------- /dusha/experiments/core/learner.py: -------------------------------------------------------------------------------- 1 | import copy 2 | from pathlib import Path 3 | import time 4 | 5 | import numpy as np 6 | import torch 7 | from torch import nn 8 | from torch.optim import Adam, lr_scheduler 9 | from torch.utils.tensorboard import SummaryWriter 10 | from tqdm import tqdm 11 | 12 | from .metrics import calculate_metrics 13 | 14 | 15 | class Learner: 16 | def __init__( 17 | self, 18 | train_dataset, 19 | val_dataset, 20 | dataloaders, 21 | exp_path, 22 | model_name, 23 | model, 24 | batch_size, 25 | dump_best_checkpoints, 26 | dump_last_checkpoints, 27 | best_checkpoints_warmup, 28 | cuda_device="cuda:0", 29 | ): 30 | 31 | self.device = torch.device(cuda_device if torch.cuda.is_available() else "cpu") 32 | self.model = model 33 | self.model.to(self.device) 34 | 35 | self.__model_name = model_name 36 | 37 | self.dump_last_checkpoints = dump_last_checkpoints 38 | self.dump_best_checkpoints = dump_best_checkpoints 39 | self.best_checkpoints_warmup = best_checkpoints_warmup 40 | 41 | self.exp_path = Path(exp_path) 42 | if dump_best_checkpoints: 43 | self.best_checkpoints_path = self.exp_path / "best_checkpoints" 44 | self.best_checkpoints_path.mkdir() 45 | if dump_last_checkpoints: 46 | self.last_checkpoints_path = self.exp_path / ( 47 | self.__model_name + "_last_checkpoints" 48 | ) 49 | self.last_checkpoints_path.mkdir() 50 | 51 | self.batch_size = batch_size 52 | 53 | self.train_dataset = train_dataset 54 | self.val_dataset = val_dataset 55 | 56 | print( 57 | "train labels", 58 | np.unique(self.train_dataset.df.label.values, return_counts=True), 59 | ) 60 | print( 61 | "train weights", 62 | np.unique( 63 | self.train_dataset.df.sampling_weights.values, return_counts=True 64 | ), 65 | ) 66 | 67 | self.dataloaders = dataloaders 68 | 69 | self.dataset_sizes = { 70 | "train": len(self.train_dataset.df), 71 | "validate": len(self.val_dataset.df), 72 | } 73 | 74 | def train(self, num_epochs, lr, step_size, gamma, weight_decay=0, clip_grad=False): 75 | comment_str_list = [ 76 | "MODEL", 77 | self.__model_name, 78 | "EPOCHS", 79 | str(num_epochs), 80 | "LR", 81 | str(lr), 82 | "BATCH", 83 | str(self.batch_size), 84 | ] 85 | 86 | comment_str = "_".join(comment_str_list) 87 | summary_writer = SummaryWriter(log_dir=self.exp_path / 'TB_log' / comment_str) 88 | 89 | criterion = nn.CrossEntropyLoss() 90 | optimizer = Adam(self.model.parameters(), lr=lr, weight_decay=weight_decay) 91 | scheduler = lr_scheduler.StepLR(optimizer, step_size=step_size, gamma=gamma) 92 | 93 | since = time.time() 94 | # copy.deepcopy(self.model.state_dict()) 95 | best_model_wts = None 96 | best_loss = 10000000 97 | best_acc = best_f1 = best_WA = 0 98 | softmax = nn.Softmax(dim=1) 99 | 100 | try: 101 | for epoch in range(1, num_epochs + 1): 102 | print(f"Epoch {epoch}/{num_epochs}") 103 | for phase in ["train", "validate"]: 104 | if phase == "train": 105 | self.model.train() 106 | cur_step_lr = scheduler.get_last_lr()[-1] 107 | else: 108 | self.model.eval() 109 | 110 | running_loss = 0.0 111 | running_outputs = [] 112 | running_labels = [] 113 | for inputs, labels in tqdm(self.dataloaders[phase]): 114 | inputs = inputs.to(self.device) 115 | labels = labels.long() 116 | labels = labels.to(self.device) 117 | optimizer.zero_grad() 118 | 119 | with torch.set_grad_enabled(phase == "train"): 120 | outputs = self.model(inputs) 121 | probs = softmax(outputs) 122 | loss = criterion(outputs, labels) 123 | if phase == "train": 124 | loss.backward() 125 | if clip_grad: 126 | torch.nn.utils.clip_grad_norm_( 127 | self.model.parameters(), 1.0 128 | ) 129 | optimizer.step() 130 | 131 | running_loss += loss.item() 132 | if phase == "validate": 133 | running_labels.append(labels) 134 | running_outputs.append(probs) 135 | 136 | if phase == "train": 137 | scheduler.step() 138 | 139 | epoch_loss = running_loss / self.dataset_sizes[phase] 140 | if phase == "validate": 141 | pred_class = np.argmax( 142 | torch.cat(running_outputs).cpu().numpy(), axis=1 143 | ) 144 | gt_class = torch.cat(running_labels).cpu().numpy() 145 | 146 | metric_dict = calculate_metrics( 147 | pred_class, gt_class, neg_label=0 148 | ) 149 | 150 | summary_writer.add_scalar("Loss/validate", epoch_loss, epoch) 151 | for metric_name, metric_value in metric_dict.items(): 152 | summary_writer.add_scalar( 153 | f"Metrics/{metric_name}", metric_value, epoch 154 | ) 155 | 156 | epoch_acc = metric_dict["accuracy"] 157 | epoch_f1 = metric_dict["f1_macro"] 158 | epoch_WA = metric_dict["WA"] 159 | 160 | print(f"{phase} Loss: {epoch_loss:.4f}") 161 | print(f"{phase} Acc: {epoch_acc:.4f}") 162 | print(f"{phase} F1 macro: {epoch_f1:.4f}") 163 | print(f"{phase} WA: {epoch_WA:.4f}") 164 | 165 | if epoch_f1 > best_f1: 166 | best_f1 = epoch_f1 167 | # best_WA = epoch_WA 168 | best_acc = epoch_acc 169 | best_f1 = epoch_f1 170 | 171 | best_epoch = epoch 172 | best_model_wts = copy.deepcopy(self.model.state_dict()) 173 | 174 | if ( 175 | self.dump_best_checkpoints 176 | and epoch > self.best_checkpoints_warmup 177 | ): 178 | torch.save( 179 | best_model_wts, 180 | self.best_checkpoints_path 181 | / f"best_checkpoint_{epoch}", 182 | ) 183 | 184 | if self.dump_last_checkpoints and abs(epoch - num_epochs) < 6: 185 | torch.save( 186 | copy.deepcopy(self.model.state_dict()), 187 | self.last_checkpoints_path / f"checkpoint_{epoch}", 188 | ) 189 | 190 | else: 191 | print(f"{phase} Loss: {epoch_loss:.4f}") 192 | summary_writer.add_scalar("Loss/train", epoch_loss, epoch) 193 | summary_writer.add_scalar("LR/value", cur_step_lr, epoch) 194 | 195 | except KeyboardInterrupt: 196 | pass 197 | 198 | summary_writer.flush() 199 | time_elapsed = time.time() - since 200 | print( 201 | f"Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s." 202 | + f" Best model loss: {best_loss:.6f}, best model acc: {best_acc:.6f}, " 203 | + f"best model f1: {best_f1:.6f}, best epoch {best_epoch}" 204 | ) 205 | 206 | self.model.load_state_dict(best_model_wts) 207 | self.model.eval() 208 | return best_model_wts 209 | -------------------------------------------------------------------------------- /dusha/experiments/core/metrics.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from sklearn.metrics import accuracy_score, f1_score 4 | 5 | 6 | def get_metrics_df(pred_class, gt_class, model_name=None): 7 | metric_dict = calculate_metrics(pred_class=pred_class, gt_class=gt_class) 8 | metrics_df = pd.DataFrame([metric_dict]).T.round(4) 9 | 10 | if model_name is not None: 11 | metrics_df.columns = [model_name] 12 | 13 | return metrics_df 14 | 15 | 16 | def weighted_accuracy(y_true, y_pred, n_classes=4): 17 | y_pred = np.array(y_pred) 18 | y_true = np.array(y_true) 19 | 20 | class_accuracies = [] 21 | for i in range(n_classes): 22 | gt_class_mask = y_true == i 23 | pred_class_mask = y_pred == i 24 | class_accuracies.append( 25 | (gt_class_mask * pred_class_mask).sum() / gt_class_mask.sum() 26 | ) 27 | 28 | return np.mean(class_accuracies) 29 | 30 | 31 | def calculate_metrics(pred_class, gt_class, **kwargs): 32 | n_classes = 4 33 | 34 | metrics_dict = { 35 | "accuracy": accuracy_score(y_true=gt_class, y_pred=pred_class), 36 | "WA": weighted_accuracy( 37 | y_true=gt_class, y_pred=pred_class, n_classes=n_classes 38 | ), 39 | "f1_macro": f1_score(y_true=gt_class, y_pred=pred_class, average="macro"), 40 | } 41 | 42 | return metrics_dict 43 | -------------------------------------------------------------------------------- /dusha/experiments/core/model.py: -------------------------------------------------------------------------------- 1 | from typing import Union 2 | 3 | import torch 4 | from torch import nn 5 | from torchvision.models.mobilenetv2 import ConvBNReLU, InvertedResidual, _make_divisible 6 | 7 | AUDIO_PROBAS = ('audio_neg', 'audio_sad', 'audio_neu', 'audio_pos') 8 | AUDIO_COLS = tuple(["audio_pred"] + list(AUDIO_PROBAS)) 9 | 10 | EMO2LABEL = {'angry': 0, 11 | 'sad': 1, 12 | 'neutral': 2, 13 | 'positive': 3} 14 | 15 | 16 | class SoftMaxModel(nn.Module): 17 | def __init__(self, logits_model: nn.Module): 18 | super().__init__() 19 | self.logits_model = logits_model 20 | self.softmax = nn.Softmax(dim=1) 21 | 22 | def forward(self, x): 23 | x = self.logits_model(x) 24 | x = self.softmax(x) 25 | 26 | return x 27 | 28 | 29 | # slightly modified fast.ai implementation 30 | # https://medium.com/mlearning-ai/self-attention-in-convolutional-neural-networks-172d947afc00 31 | class ConvSelfAttention(nn.Module): 32 | """Self attention layer for `n_channels`.""" 33 | 34 | def __init__(self, n_channels): 35 | super().__init__() 36 | self.query, self.key, self.value = [ 37 | self._conv(n_channels, c) 38 | for c in (n_channels // 8, n_channels // 8, n_channels) 39 | ] 40 | self.gamma = nn.Parameter(torch.tensor([0.0])) 41 | 42 | def _conv(self, n_in, n_out): 43 | return nn.Conv1d(n_in, n_out, kernel_size=1, bias=False) 44 | 45 | def forward(self, x): 46 | # Notation from the paper. 47 | size = x.size() 48 | x = x.view(*size[:2], -1) 49 | f, g, h = self.query(x), self.key(x), self.value(x) 50 | beta = nn.functional.softmax(torch.bmm(f.transpose(1, 2), g), dim=1) 51 | o = self.gamma * torch.bmm(h, beta) + x 52 | return o.view(*size).contiguous() 53 | 54 | 55 | # see deep_pipe 56 | # https://github.com/neuro-ml/deep_pipe/blob/master/dpipe/layers/shape.py#L48 57 | class Reshape(nn.Module): 58 | """ 59 | Reshape the incoming tensor to the given ``shape``. 60 | 61 | Parameters 62 | ---------- 63 | shape: Union[int, str] 64 | the resulting shape. String values denote indices in the input tensor's shape. 65 | 66 | Examples 67 | -------- 68 | >>> layer = Reshape('0', '1', 500, 500) 69 | >>> layer(x) 70 | >>> # same as 71 | >>> x.reshape(x.shape[0], x.shape[1], 500, 500) 72 | """ 73 | 74 | def __init__(self, *shape: Union[int, str]): 75 | super().__init__() 76 | self.shape = shape 77 | 78 | def forward(self, x: torch.Tensor): 79 | shape = [x.shape[int(i)] if isinstance(i, str) else i for i in self.shape] 80 | return x.reshape(*shape) 81 | 82 | 83 | # see torchvision.models.mobilenetv2.MobileNetV2 84 | class ConvSelfAttentionMobileNet(nn.Module): 85 | def __init__(self, _config, n_classes, last_channel=128, in_channels=1): 86 | 87 | super().__init__() 88 | self._config = _config 89 | self.in_channels = in_channels 90 | self.n_classes = n_classes 91 | self.last_channel = last_channel 92 | 93 | block = InvertedResidual 94 | norm_layer = nn.BatchNorm2d 95 | width_mult = 1.0 96 | round_nearest = 8 97 | 98 | input_channel = 4 99 | 100 | features = [ 101 | ConvBNReLU(self.in_channels, input_channel, stride=1, norm_layer=norm_layer) 102 | ] 103 | for t, c, n, s in _config: 104 | output_channel = _make_divisible(c * width_mult, round_nearest) 105 | for i in range(n): 106 | stride = s if i == 0 else 1 107 | features.append( 108 | block( 109 | input_channel, 110 | output_channel, 111 | stride, 112 | expand_ratio=t, 113 | norm_layer=norm_layer, 114 | ) 115 | ) 116 | input_channel = output_channel 117 | # building last several layers 118 | features.append( 119 | ConvBNReLU( 120 | input_channel, self.last_channel, kernel_size=1, norm_layer=norm_layer 121 | ) 122 | ) 123 | # make it nn.Sequential 124 | self.features = nn.Sequential(*features) 125 | 126 | self.pooling = nn.Sequential( 127 | ConvSelfAttention(self.last_channel), 128 | nn.AdaptiveAvgPool2d((1, 1)), 129 | Reshape("0", self.last_channel), 130 | ) 131 | 132 | self.classifier = nn.Linear(self.last_channel, self.n_classes) 133 | 134 | def forward(self, x): 135 | x = self.features(x) 136 | x = self.pooling(x) 137 | x = self.classifier(x) 138 | 139 | return x 140 | -------------------------------------------------------------------------------- /dusha/experiments/core/utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | from pathlib import Path 3 | 4 | import pandas as pd 5 | 6 | 7 | def parse_name(tsv_name): 8 | """ 9 | We have names like 10 | f"predicts_dataset_{dataset_name}_model_{model_name}.tsv" / 11 | f"metrics_dataset_{dataset_name}_model_{model_name}.csv" 12 | 13 | Returns: dataset_name, model_name 14 | """ 15 | 16 | if tsv_name.startswith("predicts_dataset_"): 17 | # len('predicts_dataset_') = 17 18 | _s = tsv_name[17:] 19 | elif tsv_name.startswith("metrics_dataset_"): 20 | _s = tsv_name[16:] 21 | else: 22 | raise ValueError(f"tsv_name is {tsv_name}") 23 | 24 | model_prefix_start = _s.find("_model_") 25 | if model_prefix_start == -1: 26 | raise ValueError(f"tsv_name is {tsv_name}") 27 | 28 | dataset_name = _s[:model_prefix_start] 29 | model_name = _s[model_prefix_start + len("_model_") : -4] 30 | 31 | return dataset_name, model_name 32 | 33 | 34 | def raw_parse_dir(exps_path, prefix="predicts"): 35 | """ 36 | Pars dir with experiments and returns dicts: 37 | dataset: model: path 38 | dataset: set of models 39 | 40 | Args: 41 | exps_path: path to dir with experiments 42 | prefix: 'predicts' or 'metrics' - what the function should parse 43 | """ 44 | exps_path = Path(exps_path) 45 | 46 | # get paths to data 47 | glob_exp = "**/" 48 | 49 | if prefix == "predicts": 50 | glob_file = "predicts_*.tsv" 51 | elif prefix == "metrics": 52 | glob_file = "metrics_*.csv" 53 | else: 54 | raise ValueError( 55 | f"Get prefix = {prefix}, supports only ['predicts', 'metrics']" 56 | ) 57 | 58 | data_paths = list(exps_path.glob(glob_exp + glob_file)) 59 | 60 | data_paths = [ 61 | p 62 | for p in data_paths 63 | if str(p.name).startswith(prefix) 64 | and str(p.name).find("dataset_") > -1 65 | and str(p.name).find("model_") > -1 66 | ] 67 | 68 | # init our structure 69 | # dataset: model: path_to_predict 70 | dataset_models_paths = {} 71 | 72 | # get all models for all datasets 73 | 74 | # dataset: set of model names 75 | dataset_models_dict = {} 76 | for curr_path in data_paths: 77 | dataset_name, model_name = parse_name(str(curr_path.name)) 78 | if dataset_models_dict.get(dataset_name) is None: 79 | dataset_models_dict[dataset_name] = {model_name} 80 | dataset_models_paths[dataset_name] = {model_name: curr_path} 81 | else: 82 | dataset_models_dict[dataset_name] |= {model_name} 83 | dataset_models_paths[dataset_name].update({model_name: curr_path}) 84 | 85 | return dataset_models_paths, dataset_models_dict 86 | 87 | 88 | def collect_metrics_to_one(list_of_metrics_df): 89 | df = list_of_metrics_df[0] 90 | df.columns = ["", df.columns[-1]] 91 | df = df.set_index("") 92 | 93 | for curr_metric_df in list_of_metrics_df[1:]: 94 | _df = curr_metric_df 95 | _df.columns = ["", _df.columns[-1]] 96 | _df = _df.set_index("") 97 | df = df.join(_df) 98 | 99 | df = df.sort_values("f1_macro", axis=1, ascending=False) 100 | 101 | return df 102 | 103 | 104 | def load_jsonl_as_df(file_name): 105 | data = [] 106 | with open(file_name, "r") as file1: 107 | for line1 in file1: 108 | data.append(json.loads(line1)) 109 | file1.close() 110 | df = pd.DataFrame.from_records(data) 111 | if "label" in df.columns: 112 | df.label = df.label.astype(int) 113 | 114 | return df 115 | -------------------------------------------------------------------------------- /dusha/experiments/inf.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pathlib import Path 3 | 4 | import click 5 | import lazycon 6 | import numpy as np 7 | import pandas as pd 8 | import torch 9 | from tqdm import tqdm 10 | 11 | from core.metrics import get_metrics_df 12 | from core.model import AUDIO_COLS, SoftMaxModel 13 | from core.utils import collect_metrics_to_one, load_jsonl_as_df, raw_parse_dir 14 | 15 | DEVICE = "cuda:0" 16 | 17 | 18 | def run_single_inf(exp_path, test_manifest, with_metrics, recalculate, device): 19 | # parse exp_path 20 | # it may be exp path or path to model 21 | if os.path.isdir(exp_path): 22 | dir_path = exp_path 23 | _path = Path(exp_path) 24 | model_path = _path / _path.name 25 | else: 26 | dir_path = os.path.dirname(exp_path) 27 | model_path = Path(exp_path) 28 | 29 | dir_path = Path(dir_path) 30 | model_name = model_path.name 31 | 32 | # check the config 33 | config_path = dir_path / "train.config" 34 | assert os.path.exists(config_path), f"No train.config in {dir_path}" 35 | 36 | # check the model 37 | if not os.path.exists(model_path): 38 | print(f"There is no saved model {model_path}. Nothing to inference") 39 | return None 40 | 41 | # load the model 42 | cfg = lazycon.load(config_path) 43 | model = cfg.model 44 | try: 45 | model.to(device) 46 | model.load_state_dict(torch.load(model_path)) 47 | model.eval() 48 | except Exception as exception: 49 | print(f"Problem with loading model {model_path}. Skipped") 50 | print(exception) 51 | return None 52 | 53 | # add last layer SoftMax to predict probabilities 54 | model = SoftMaxModel(model) 55 | 56 | # create predicts and metrics paths 57 | predicts_path = Path(dir_path) / "predicts" 58 | metrics_path = Path(dir_path) / "metrics" 59 | 60 | predicts_path.mkdir(exist_ok=True) 61 | metrics_path.mkdir(exist_ok=True) 62 | 63 | # parse --vm folder/df 64 | paths_to_inf = [] 65 | 66 | if os.path.isdir(test_manifest): 67 | paths_to_inf = list(Path(test_manifest).glob("*.jsonl")) 68 | else: 69 | paths_to_inf.append(test_manifest) 70 | 71 | assert len(paths_to_inf) > 0, f"No .jsonl here: {test_manifest}" 72 | 73 | # iterate over datasets for inference 74 | for dataset_df_path in paths_to_inf: 75 | dataset_df_path = Path(dataset_df_path) 76 | dataset_df = load_jsonl_as_df(dataset_df_path) 77 | # dataset_df = pd.read_csv(dataset_df_path, sep='\t') 78 | dataset_name = str(dataset_df_path.name).split(".", maxsplit=1)[0] 79 | if with_metrics: 80 | assert ( 81 | "label" in dataset_df.columns 82 | ), f"{dataset_name} hasn't 'label' column, but --with_metrics" 83 | 84 | # predict 85 | predicts_tsv_path = ( 86 | predicts_path / f"predicts_dataset_{dataset_name}_model_{model_name}.tsv" 87 | ) 88 | 89 | # if predicts exist and we don't want to recalculate it, but want to calculate metrics 90 | if os.path.exists(predicts_tsv_path) and not recalculate: 91 | if with_metrics: 92 | metrics_csv_path = ( 93 | metrics_path 94 | / f"metrics_dataset_{dataset_name}_model_{model_name}.csv" 95 | ) 96 | if not os.path.exists(metrics_csv_path): 97 | print( 98 | f"Predicts for {model_name} {dataset_name} exist. Calculating metrics" 99 | ) 100 | pred_df = pd.read_csv(predicts_tsv_path, sep="\t") 101 | 102 | pred_class = pred_df[AUDIO_COLS[0]].values 103 | gt_class = pred_df["label"].values 104 | 105 | metrics_df = get_metrics_df( 106 | pred_class=pred_class, gt_class=gt_class, model_name=model_name 107 | ) 108 | 109 | metrics_df.to_csv(metrics_csv_path) 110 | else: 111 | print( 112 | f"Predicts and metrics for {model_name} {dataset_name} exist. Skipped" 113 | ) 114 | else: 115 | print( 116 | f"Predicts for {model_name} {dataset_name} are existed" 117 | + "--no_metrics, so metrics calculation is skipped" 118 | ) 119 | continue 120 | 121 | # calculate predicts 122 | running_outputs = [] 123 | ds = cfg.get_val_dataset(_df=dataset_df, ds_base_path=dataset_df_path.parent) 124 | dataloader = cfg.get_val_dataloader(val_ds=ds) 125 | 126 | print(f"Calculating predicts and metrics: {model_name} {dataset_name}") 127 | for inputs, _ in tqdm(dataloader): 128 | inputs = inputs.to(device) 129 | with torch.no_grad(): 130 | probs = model(inputs) 131 | 132 | running_outputs.append(probs) 133 | 134 | # MelEmotionsDataset changes order in df, so we should match predicts by id 135 | _df = ds.df.copy() 136 | pred_class = np.argmax(torch.cat(running_outputs).cpu().numpy(), axis=1) 137 | probas = torch.cat(running_outputs).cpu().numpy() 138 | 139 | _df[AUDIO_COLS[0]] = pred_class 140 | for i in range(4): 141 | _df[AUDIO_COLS[i + 1]] = probas[:, i] 142 | 143 | # match preds by id 144 | pred_df = dataset_df.copy() 145 | _df = _df.set_index("id").loc[pred_df.id] 146 | for _col in AUDIO_COLS: 147 | pred_df[_col] = _df[_col].values 148 | 149 | pred_df.to_csv(predicts_tsv_path, index=False, sep="\t") 150 | 151 | # calculate metrics 152 | if with_metrics: 153 | metrics_csv_path = ( 154 | metrics_path / f"metrics_dataset_{dataset_name}_model_{model_name}.csv" 155 | ) 156 | 157 | pred_class = pred_df[AUDIO_COLS[0]].values 158 | gt_class = pred_df["label"].values 159 | 160 | metrics_df = get_metrics_df( 161 | pred_class=pred_class, gt_class=gt_class, model_name=model_name 162 | ) 163 | 164 | metrics_df.to_csv(metrics_csv_path) 165 | 166 | 167 | @click.command() 168 | @click.option( 169 | "-exps_path", 170 | "--exps_path", 171 | required=True, 172 | type=click.Path(exists=True), 173 | help="path folder with experiment folders (the experiment folder must have train.config file in)", 174 | ) 175 | @click.option( 176 | "-vm", 177 | "--test_manifest", 178 | required=True, 179 | type=click.Path(exists=True), 180 | help="path to JSONL file/dir of JSONLs to inference", 181 | ) 182 | @click.option( 183 | "--with_metrics/--no_metrics", 184 | default=True, 185 | help="calculate metrics for experiments", 186 | ) 187 | @click.option( 188 | "--recalculate/--no_recalculate", 189 | default=False, 190 | help="recalculate existed predicts and metrics", 191 | ) 192 | @click.option( 193 | "--recalculate_dataset_metrics/--no_dataset_metrics", 194 | default=True, 195 | help="recalculate existed grouped by dataset metrics", 196 | ) 197 | @click.option( 198 | "-device", "--device", type=click.STRING, default=DEVICE, help="device to inference" 199 | ) 200 | def run_inf( 201 | exps_path, 202 | test_manifest, 203 | with_metrics, 204 | recalculate, 205 | recalculate_dataset_metrics, 206 | device, 207 | ): 208 | # parse folder, find experiments folders 209 | exps_path = Path(exps_path) 210 | experiment_paths = [p.parent for p in exps_path.glob("**/train.config")] 211 | 212 | # predict and calc metrics for a single experiment 213 | for exp_path in experiment_paths: 214 | run_single_inf( 215 | exp_path=exp_path, 216 | test_manifest=test_manifest, 217 | with_metrics=with_metrics, 218 | recalculate=recalculate, 219 | device=device, 220 | ) 221 | 222 | # aggregate metrics 223 | metrics_dump_dir = exps_path / "metrics" 224 | metrics_dump_dir.mkdir(exist_ok=True) 225 | 226 | if recalculate_dataset_metrics: 227 | print("Aggregating metrics") 228 | dataset_models_paths, dataset_models = raw_parse_dir( 229 | exps_path=exps_path, prefix="metrics" 230 | ) 231 | datasets = sorted(dataset_models.keys()) 232 | for dataset_name in datasets: 233 | metric_dump_dir = metrics_dump_dir / f"exps_{dataset_name}.csv" 234 | metric_df = collect_metrics_to_one( 235 | [ 236 | pd.read_csv(metrics_df_path) 237 | for metrics_df_path in dataset_models_paths[dataset_name].values() 238 | ] 239 | ).T 240 | metric_df.to_csv(metric_dump_dir) 241 | else: 242 | print("--no_dataset_metrics, so metrics grouped by dataset are skipped") 243 | 244 | agg_metrics_paths = list(metrics_dump_dir.glob("*.csv")) 245 | if len(agg_metrics_paths) == 0: 246 | print("There is no grouped by dataset metrics") 247 | else: 248 | for agg_metrics_path in agg_metrics_paths: 249 | # remove exps_ and .csv in aggregated metrics df name 250 | dataset_name = str(agg_metrics_path.name)[5:-4] 251 | metric_df = pd.read_csv(agg_metrics_path).set_index("Unnamed: 0") 252 | metric_df.index.name = "" 253 | print("DATASET: ", dataset_name) 254 | print(metric_df) 255 | print("------------------------------------------------") 256 | 257 | 258 | if __name__ == "__main__": 259 | run_inf() 260 | -------------------------------------------------------------------------------- /dusha/experiments/requirements.txt: -------------------------------------------------------------------------------- 1 | # numpy==1.19.5 2 | lazycon==0.3.1 # see https://github.com/maxme1/lazycon 3 | torch==1.8.1+cu101 4 | torchvision==0.9.1+cu101 5 | torchaudio==0.8.1 6 | pathlib==1.0.1 7 | librosa==0.7.1 8 | tqdm==4.55.1 9 | pandas==1.1.5 10 | scikit-learn==0.24.0 11 | tensorboard==2.4.1 -------------------------------------------------------------------------------- /dusha/experiments/train.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import random 3 | import shutil 4 | 5 | import click 6 | import lazycon 7 | import numpy as np 8 | import torch 9 | 10 | from core.learner import Learner 11 | 12 | 13 | @click.command() 14 | @click.option( 15 | "-config", 16 | "--config_path", 17 | required=True, 18 | type=click.Path(exists=True), 19 | help="path to .config file", 20 | ) 21 | @click.option( 22 | "-exp_path", 23 | "--exp_path", 24 | required=True, 25 | type=click.Path(), 26 | help="path to dump experiment", 27 | ) 28 | def train_model(config_path, exp_path): 29 | exp_path = Path(exp_path) 30 | model_name = exp_path.name 31 | cfg = lazycon.load(config_path) 32 | base_path = cfg.base_path 33 | assert ( 34 | base_path.exists() 35 | ), f"{base_path} doesn't exist. Correct base_path in configs/data.config" 36 | 37 | exp_path.mkdir(parents=True, exist_ok=True) 38 | 39 | # dump params 40 | # save compiled config 41 | cfg.dump(exp_path / "train.config") 42 | 43 | # dump jsonls 44 | shutil.copy(cfg.train_manifest_path, exp_path / "train.jsonl") 45 | shutil.copy(cfg.val_manifest_path, exp_path / "val.jsonl") 46 | 47 | model = cfg.model 48 | 49 | # load pretrained model 50 | if cfg.pt_model_path is not None: 51 | model.load_state_dict(torch.load(cfg.pt_model_path, map_location="cuda:0")) 52 | shutil.copy(cfg.pt_model_path, exp_path / "pt_model") 53 | 54 | # init learner 55 | learner = Learner( 56 | train_dataset=cfg.train_dataset, 57 | val_dataset=cfg.val_dataset, 58 | dataloaders=cfg.dataloaders, 59 | exp_path=exp_path, 60 | model_name=model_name, 61 | model=model, 62 | batch_size=cfg.batch_size, 63 | dump_best_checkpoints=cfg.DUMP_BEST_CHECKPOINTS, 64 | dump_last_checkpoints=cfg.DUMP_LAST_CHECKPOINTS, 65 | best_checkpoints_warmup=cfg.BEST_CHECKPOINTS_WARMUP, 66 | ) 67 | 68 | # train 69 | best_model_wts = learner.train( 70 | num_epochs=cfg.epoch_count, 71 | lr=cfg.learning_rate, 72 | step_size=cfg.optimizer_step, 73 | gamma=cfg.optimizer_gamma, 74 | weight_decay=cfg.weight_decay, 75 | clip_grad=cfg.clip_grad, 76 | ) 77 | 78 | # dump best model 79 | torch.save(best_model_wts, exp_path / model_name) 80 | 81 | 82 | if __name__ == "__main__": 83 | # fix seeds for reproducibility 84 | torch.manual_seed(0) 85 | random.seed(0) 86 | np.random.seed(0) 87 | torch.backends.cudnn.benchmark = False 88 | torch.use_deterministic_algorithms(True) 89 | 90 | train_model() 91 | -------------------------------------------------------------------------------- /golos/README.md: -------------------------------------------------------------------------------- 1 | # Golos dataset 2 | 3 | Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. 4 | We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. 5 | Also we create 3-gram KenLM language model using an open Common Crawl corpus. 6 | 7 | # Table of contents 8 | 9 | - [Dataset structure](https://github.com/salute-developers/golos/tree/master/golos#dataset-structure) 10 | - [Downloads](https://github.com/salute-developers/golos/tree/master/golos#downloads) 11 | - [Audio files in opus format](https://github.com/salute-developers/golos/tree/master/golos#audio-files-in-opus-format) 12 | - [Audio files in wav format](https://github.com/salute-developers/golos/tree/master/golos#audio-files-in-wav-format) 13 | - [Acoustic and language models](https://github.com/salute-developers/golos/tree/master/golos#acoustic-and-language-models) 14 | - [Evaluation](https://github.com/salute-developers/golos/tree/master/golos#evaluation) 15 | - [Resources](https://github.com/salute-developers/golos/tree/master/golos#resources) 16 | - [License](https://github.com/salute-developers/golos/tree/master/golos#license) 17 | - [Contacts](https://github.com/salute-developers/golos/tree/master/golos#contacts) 18 | 19 | 20 | ## **Dataset structure** 21 | 22 | | Domain | Train files | Train hours | Test files | Test hours | 23 | |----------------|------------|--------|-------|------| 24 | | Crowd | 979 796 | 1 095 | 9 994 | 11.2 | 25 | | Farfield | 124 003 | 132.4| 1 916 | 1.4 | 26 | | Total | 1 103 799 | 1 227.4|11 910 | 12.6 | 27 | 28 | --- 29 | 30 | ## **Downloads** 31 | 32 | [MD5 Checksums](https://github.com/salute-developers/golos/blob/master/golos/md5sum.txt) 33 | 34 | 35 | ### **Audio files in opus format** 36 | 37 | | Archive | Size | Link | 38 | |------------------|------------|----------------------------------------------------------------------------------------------| 39 | | golos_opus.tar | 20.5 GB | [golos_opus.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/golos_opus.tar) | 40 | 41 | --- 42 | 43 | ### **Audio files in wav format** 44 | 45 | Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed in the table: 46 | 47 | | Archives | Size | Links | 48 | |-------------------|------------|------------------------------------------------------------------------------------------------------| 49 | | train_farfield.tar| 15.4 GB | [train_farfield.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_farfield.tar) | 50 | | train_crowd0.tar | 11 GB | [train_crowd0.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd0.tar) | 51 | | train_crowd1.tar | 14 GB | [train_crowd1.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd1.tar) | 52 | | train_crowd2.tar | 13.2 GB | [train_crowd2.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd2.tar) | 53 | | train_crowd3.tar | 11.6 GB | [train_crowd3.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd3.tar) | 54 | | train_crowd4.tar | 15.8 GB | [train_crowd4.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd4.tar) | 55 | | train_crowd5.tar | 13.1 GB | [train_crowd5.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd5.tar) | 56 | | train_crowd6.tar | 15.7 GB | [train_crowd6.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd6.tar) | 57 | | train_crowd7.tar | 12.7 GB | [train_crowd7.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd7.tar) | 58 | | train_crowd8.tar | 12.2 GB | [train_crowd8.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd8.tar) | 59 | | train_crowd9.tar | 8.08 GB | [train_crowd9.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd9.tar) | 60 | | test.tar | 1.3 GB | [test.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/test.tar) | 61 | 62 | --- 63 | 64 | ### **Acoustic and language models** 65 | 66 | Acoustic model built using [QuartzNet15x5](https://arxiv.org/pdf/1910.10261.pdf) architecture and trained using [NeMo toolkit](https://github.com/NVIDIA/NeMo/tree/r1.0.0b4) 67 | 68 | 69 | Three n-gram language models created using [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm) 70 | 71 | * LM built on [Common Crawl](https://commoncrawl.org) Russian dataset 72 | * LM built on [Golos](https://github.com/salute-developers/golos) train set 73 | * LM built on [Common Crawl](https://commoncrawl.org) and [Golos](https://github.com/salute-developers/golos/tree/master/golos) datasets together (50/50) 74 | 75 | | Archives | Size | Links | 76 | |--------------------------|------------|-----------------------------------------------------------------------------------------------------------------------| 77 | | QuartzNet15x5_golos.nemo | 68 MB | [QuartzNet15x5_golos.nemo (TO BE UPDATED)](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/QuartzNet15x5_golos.nemo) | 78 | | CitriNet_ru1024bpe.tgz | 541 MB | [CitriNet_ru1024bpe.tgz](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/citrinet/CitriNet_ru1024bpe.tgz) | 79 | | KenLMs.tar | 4.8 GB | [KenLMs.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/kenlms.tar) | 80 | 81 | 82 | Golos data and models are also available in the hub of pre-trained models, datasets, and containers - DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in [ML Space](https://cloud.ru/) - full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer. 83 | 84 | 85 | ## **Evaluation** 86 | 87 | Percents of Word Error Rate for different test sets 88 | 89 | 90 | | Decoder \ Test set | Crowd test | Farfield test | MCV1 dev | MCV1 test | 91 | |-------------------------------------|-----------|----------|-----------|----------| 92 | | Greedy decoder | 4.389 % | 14.949 % | 9.314 % | 11.278 % | 93 | | Beam Search with Common Crawl LM | 4.709 % | 12.503 % | 6.341 % | 7.976 % | 94 | | Beam Search with Golos train set LM | 3.548 % | 12.384 % | - | - | 95 | | Beam Search with Common Crawl and Golos LM | 3.318 % | 11.488 % | 6.4 % | 8.06 % | 96 | 97 | 98 | 1 [Common Voice](https://commonvoice.mozilla.org) - Mozilla's initiative to help teach machines how real people speak. 99 | 100 | ## **Resources** 101 | 102 | [[INTERSPEECH 2021] Golos: Russian Dataset for Speech Research](https://www.isca-speech.org/archive/pdfs/interspeech_2021/karpov21_interspeech.pdf) 103 | 104 | [[habr.com] Golos — самый большой русскоязычный речевой датасет, размеченный вручную, теперь в открытом доступе](https://habr.com/ru/company/sberdevices/blog/559496/) 105 | 106 | [[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных](https://habr.com/ru/company/sberdevices/blog/569082/) 107 | 108 | ## **Cite** 109 | Karpov, N., Denisenko, A., Minkin, F. (2021) Golos: Russian Dataset for Speech Research. Proc. Interspeech 2021, 1419-1423, doi: 10.21437/Interspeech.2021-462 110 | ``` 111 | @inproceedings{karpov21_interspeech, 112 | author={Nikolay Karpov and Alexander Denisenko and Fedor Minkin}, 113 | title={{Golos: Russian Dataset for Speech Research}}, 114 | year=2021, 115 | booktitle={Proc. Interspeech 2021}, 116 | pages={1419--1423}, 117 | doi={10.21437/Interspeech.2021-462} 118 | } 119 | ``` 120 | 121 | ## **License** 122 | 123 | [English Version](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf) 124 | 125 | [Russian Version](https://github.com/salute-developers/golos/blob/master/license/ru.pdf) 126 | 127 | ## **Contacts** 128 | 129 | Please create a GitHub issue! 130 | 131 | Authors (in alphabetic order): 132 | - Alexander Denisenko 133 | - Angelina Kovalenko 134 | - Fedor Minkin 135 | - Nikolay Karpov 136 | -------------------------------------------------------------------------------- /golos/README_ru.md: -------------------------------------------------------------------------------- 1 | # Датасет Golos 2 | 3 | Golos — это датасет для распознавания речи на русском языке. Он состоит из аудиозаписей речи и транскрипций, полученных с помощью ручной разметки на краудсорсинговой платформе. Общая длительность записей составляет примерно 1240 часов. Все данные и обученные на них акустические модели распознавания речи бесплатны и открыты для скачивания. Также доступны триграммные модели KenLM, подготовленные при помощи русских текстов из открытого корпуса Common Crawl. 4 | 5 | # Содержание 6 | 7 | - [Структура датасета](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Структура-датасета) 8 | - [Скачать](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Скачать) 9 | - [Аудиофайлы в формате opus](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Аудиофайлы-в-формате-opus) 10 | - [Аудиофайлы в формате wav](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Аудиофайлы-в-формате-wav) 11 | - [Акустические и языковые модели](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Акустические-и-языковые-модели) 12 | - [Оценка качества](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Оценка-качества) 13 | - [Полезные ссылки](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Полезные-ссылки) 14 | - [Лицензия](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Лицензия) 15 | - [Контакты](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Контакты) 16 | 17 | 18 | ## **Структура датасета** 19 | 20 | | Домен | Train файлы | Train (часов) | Test файлы | Test (часов) | 21 | |----------------|------------|--------|-------|------| 22 | | Crowd | 979 796 | 1 095 | 9 994 | 11.2 | 23 | | Farfield | 124 003 | 132.4| 1 916 | 1.4 | 24 | | Итого | 1 103 799 | 1 227.4|11 910 | 12.6 | 25 | 26 | --- 27 | 28 | ## **Скачать** 29 | 30 | [MD5 контральные суммы](https://github.com/salute-developers/golos/blob/master/golos/md5sum.txt) 31 | 32 | 33 | ### **Аудиофайлы в формате opus** 34 | 35 | | Archive | Size | Link | 36 | |------------------|------------|----------------------------------------------------------------------------------------------| 37 | | golos_opus.tar | 20.5 GB | [golos_opus.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/golos_opus.tar) | 38 | 39 | --- 40 | 41 | ### **Аудиофайлы в формате wav** 42 | 43 | Файл с транскрипциями записей всего обучающего подмножества доступен в архиве train_crowd9.tar, доступный по ссылке в таблице: 44 | 45 | | Archives | Size | Links | 46 | |-------------------|------------|------------------------------------------------------------------------------------------------------| 47 | | train_farfield.tar| 15.4 GB | [train_farfield.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_farfield.tar) | 48 | | train_crowd0.tar | 11 GB | [train_crowd0.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd0.tar) | 49 | | train_crowd1.tar | 14 GB | [train_crowd1.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd1.tar) | 50 | | train_crowd2.tar | 13.2 GB | [train_crowd2.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd2.tar) | 51 | | train_crowd3.tar | 11.6 GB | [train_crowd3.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd3.tar) | 52 | | train_crowd4.tar | 15.8 GB | [train_crowd4.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd4.tar) | 53 | | train_crowd5.tar | 13.1 GB | [train_crowd5.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd5.tar) | 54 | | train_crowd6.tar | 15.7 GB | [train_crowd6.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd6.tar) | 55 | | train_crowd7.tar | 12.7 GB | [train_crowd7.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd7.tar) | 56 | | train_crowd8.tar | 12.2 GB | [train_crowd8.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd8.tar) | 57 | | train_crowd9.tar | 8.08 GB | [train_crowd9.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd9.tar) | 58 | | test.tar | 1.3 GB | [test.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/test.tar) | 59 | 60 | --- 61 | 62 | ### **Акустические и языковые модели** 63 | 64 | Акустическая модель на основе архитектуры [QuartzNet15x5](https://arxiv.org/pdf/1910.10261.pdf) обучена с использованием [NeMo toolkit](https://github.com/NVIDIA/NeMo/tree/r1.0.0b4) 65 | 66 | 67 | Три n-грамные языковые модели (LM) подготовлены с использованием [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm) 68 | 69 | * LM на русских текстах корпуса [Common Crawl](https://commoncrawl.org) 70 | * LM на текстах транскрипций обучающей подвыборки [Golos](https://github.com/salute-developers/golos) 71 | * LM на русских текстах [Common Crawl](https://commoncrawl.org) и транскрипциях [Golos](https://github.com/salute-developers/golos/tree/master/golos) вместе (50/50) 72 | 73 | | Archives | Size | Links | 74 | |--------------------------|------------|-----------------------------------------------------------------------------------------------------------------------| 75 | | QuartzNet15x5_golos.nemo | 68 MB | [QuartzNet15x5_golos.nemo (TO BE UPDATED)](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/QuartzNet15x5_golos.nemo) | 76 | | CitriNet_ru1024bpe.tgz | 541 MB | [CitriNet_ru1024bpe.tgz](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/citrinet/CitriNet_ru1024bpe.tgz) | 77 | | KenLMs.tar | 4.8 GB | [KenLMs.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/kenlms.tar) | 78 | 79 | 80 | 81 | Все данные и модели Golos также доступны в хранилище DataHub ML Space. Там распологаются предобученные модели, датасеты и Docker-образы. 82 | 83 | 84 | ## **Оценка качества** 85 | 86 | Процент ошибки распознавания WER (Word Error Rate) для разных тестовых данных. 87 | 88 | 89 | | Декодер \ Тестовые данные | Crowd test | Farfield test | MCV1 dev | MCV1 test | 90 | |-------------------------------------|-----------|----------|-----------|----------| 91 | | Greedy decoder | 4.389 % | 14.949 % | 9.314 % | 11.278 % | 92 | | Beam Search + Common Crawl LM | 4.709 % | 12.503 % | 6.341 % | 7.976 % | 93 | | Beam Search + Golos train set LM | 3.548 % | 12.384 % | - | - | 94 | | Beam Search + Common Crawl and Golos LM | 3.318 % | 11.488 % | 6.4 % | 8.06 % | 95 | 96 | 97 | 1 [Common Voice](https://commonvoice.mozilla.org) - проект компании Mozilla по сбору данных для автоматического распознавания речи. 98 | 99 | ## **Полезные ссылки** 100 | 101 | [[arxiv.org] Golos: Russian Dataset for Speech Research](https://arxiv.org/abs/2106.10161) 102 | 103 | [[habr.com] Golos — самый большой русскоязычный речевой датасет, размеченный вручную, теперь в открытом доступе](https://habr.com/ru/company/sberdevices/blog/559496/) 104 | 105 | [[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных](https://habr.com/ru/company/sberdevices/blog/569082/) 106 | 107 | ## **Лицензия** 108 | 109 | [Английская версия](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf) 110 | 111 | [Русская версия](https://github.com/salute-developers/golos/blob/master/license/ru.pdf) 112 | 113 | ## **Контакты** 114 | 115 | Создавайте GitHub issue! 116 | 117 | Авторы а алфавитном порядке: 118 | - Александр Денисенко 119 | - Ангелина Коваленко 120 | - Николай Карпов 121 | - Федор Минькин 122 | -------------------------------------------------------------------------------- /golos/examples/Dockerfile: -------------------------------------------------------------------------------- 1 | # Distributed on an "AS IS" BASIS, 2 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 3 | # See the License for the specific language governing permissions and 4 | # limitations under the License. 5 | 6 | FROM nvcr.io/nvidia/pytorch:21.03-py3 7 | 8 | WORKDIR /workspace 9 | 10 | RUN git clone --branch v1.0.2 https://github.com/NVIDIA/NeMo 11 | 12 | WORKDIR /workspace/NeMo 13 | RUN ./reinstall.sh 14 | RUN conda install -c anaconda swig 15 | WORKDIR /workspace/NeMo/scripts/asr_language_modeling/ngram_lm 16 | RUN sh install_beamsearch_decoders.sh 17 | 18 | 19 | RUN mkdir /workspace/models 20 | WORKDIR /workspace/models 21 | 22 | RUN echo "Download and extract QuartzNet15x5_golos.nemo ..." 23 | RUN wget https://sc.link/ZMv -O QuartzNet15x5_golos.nemo 24 | 25 | RUN echo "Download and extract KenLMs ..." 26 | RUN wget https://sc.link/YL0 -O kenlms.tar 27 | RUN tar -xvf kenlms.tar 28 | 29 | 30 | WORKDIR /workspace/golos 31 | 32 | CMD ["bash"] 33 | 34 | #CMD python ./examples/infer.py /workspace/models/QuartzNet15x5_golos.nemo ./examples/data/example1.json -lm /workspace/models/kenlms/lm_commoncrawl.binary 35 | 36 | -------------------------------------------------------------------------------- /golos/examples/build_and_run_docker.sh: -------------------------------------------------------------------------------- 1 | # build container 2 | 3 | export CURRENT_DIR=$PWD 4 | 5 | docker build -t golos_image . 6 | 7 | 8 | #run conatiner 9 | # requrements (requires 10 | docker run --gpus all \ 11 | -v /$CURRENT_DIR/..:/workspace/golos \ 12 | -it \ 13 | --shm-size=8g \ 14 | --ulimit memlock=-1 --ulimit \ 15 | stack=67108864 golos_image 16 | -------------------------------------------------------------------------------- /golos/examples/data/001ce26c07c20eaa0d666b824c6c6924.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/golos/examples/data/001ce26c07c20eaa0d666b824c6c6924.wav -------------------------------------------------------------------------------- /golos/examples/data/example1.json: -------------------------------------------------------------------------------- 1 | {"audio_filepath": "/workspace/golos/examples/data/001ce26c07c20eaa0d666b824c6c6924.wav", "text": "шестьдесят тысяч тенге сколько будет стоить"} 2 | -------------------------------------------------------------------------------- /golos/examples/infer.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | import torch 5 | 6 | import nemo.collections.asr as nemo_asr 7 | 8 | def read_file_list(manifest): 9 | manifest_path, _ = os.path.split(manifest) 10 | files, texts = [], [] 11 | with open(manifest, "r") as input_file: 12 | for line in input_file: 13 | as_dict = json.loads(line.rstrip('\n')) 14 | files.append(os.path.join(manifest_path, as_dict["audio_filepath"])) 15 | texts.append(as_dict["text"]) 16 | return files, texts 17 | 18 | def infer_beam_search_lm(files, asr_model, beam_search_lm): 19 | hyps = [] 20 | logits = torch.tensor(asr_model.transcribe(files, batch_size=20, logprobs=True)) 21 | log_probs_length = torch.tensor([logit.shape[0] for logit in logits]) 22 | logits_tensor = torch.nn.utils.rnn.pad_sequence(logits, batch_first=True) 23 | for j in range(logits_tensor.shape[0]): 24 | best_hyp = beam_search_lm.forward(log_probs = logits_tensor[j].unsqueeze(0), 25 | log_probs_length=log_probs_length[j].unsqueeze(0))[0][0][1] 26 | hyps.append(best_hyp) 27 | return hyps 28 | 29 | def infer_greedy(files, asr_model): 30 | transcripts = asr_model.transcribe(paths2audio_files=files, batch_size=20) 31 | return transcripts 32 | 33 | def print_mistakes(hyps, refs): 34 | for hypo, reference in zip(hyps, refs): 35 | if hypo != reference: 36 | print(' h', hypo, '\n r', reference) 37 | 38 | def _parse_args(): 39 | parser = argparse.ArgumentParser(description='Run inference using NeMo checkpoint') 40 | parser.add_argument('asr_ckpt', help='Path to ASR NeMo checkpoint file (.nemo)') 41 | parser.add_argument('manifest', help='Path to manifest where each line is a json with' 42 | ' transcription (.jsonl)') 43 | parser.add_argument('-lm', help='Path to KenLM binary filem (.binary)', default=None) 44 | return parser.parse_args() 45 | 46 | 47 | if __name__ == '__main__': 48 | args = _parse_args() 49 | asr_model = nemo_asr.models.EncDecCTCModel.restore_from(args.asr_ckpt) 50 | asr_model.cuda() 51 | 52 | files, texts = read_file_list(args.manifest) 53 | 54 | hyps = infer_greedy(files, asr_model) 55 | print_mistakes(hyps, texts) 56 | print("Greedy WER:", nemo_asr.metrics.wer.word_error_rate(hyps, texts)) 57 | 58 | if args.lm: 59 | beam_search_lm = nemo_asr.modules.BeamSearchDecoderWithLM( 60 | vocab=list(asr_model.decoder.vocabulary), 61 | beam_width=16, 62 | alpha=2, beta=1.5, 63 | lm_path=args.lm, 64 | num_cpus=1, 65 | cutoff_prob=1.0, cutoff_top_n=40, 66 | input_tensor=True) 67 | 68 | hyps = infer_beam_search_lm(files, asr_model, beam_search_lm) 69 | print_mistakes(hyps, texts) 70 | print("Beam search WER:", nemo_asr.metrics.wer.word_error_rate(hyps, texts)) 71 | -------------------------------------------------------------------------------- /golos/examples/readme.md: -------------------------------------------------------------------------------- 1 | # How to run 2 | 3 | If you have properly configured NeMo environment you can run example from your environment, or you can run example from docker image. 4 | 5 | ## Run from docker 6 | 7 | >> Minimal driver requirements: NVIDIA Driver release 460.32.03 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51(or later R450). ) see https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-03.html#rel_21-03 8 | 9 | ```bash 10 | cd /examples 11 | sh build_and_run_docker.sh 12 | ``` 13 | 14 | After container is started 15 | 16 | ``` 17 | python ./examples/infer.py /workspace/models/QuartzNet15x5_golos.nemo ./examples/data/example1.json -lm /workspace/models/kenlms/lm_commoncrawl.binary 18 | ``` -------------------------------------------------------------------------------- /golos/md5sum.txt: -------------------------------------------------------------------------------- 1 | 453d4f8ef1d8adcc9d87020712f973c3 golos_opus.tar 2 | 7531a7e76e04e794c5847d8e58b3154c train_farfield.tar 3 | f88b397fc5dbc55ab326cd2360d5a0b2 train_crowd0.tar 4 | 19db1416da982641f1565ec701939f39 train_crowd1.tar 5 | d032dfb48d548631400f3a78e24756f8 train_crowd2.tar 6 | 83aeb8d3f8b93968b83224606801aceb train_crowd3.tar 7 | df177493e1c1c7cbb0ac7f13f79dbc06 train_crowd4.tar 8 | 724e4af231b3e226b32ca609db30862c train_crowd5.tar 9 | bd6e18b94ad81aebb99039dedb71de25 train_crowd6.tar 10 | 560960f748b7ac19bdb477dc9245cfc3 train_crowd7.tar 11 | bbc6814ca8377aaa75a4a987adad0b94 train_crowd8.tar 12 | eee4ff6dbebf77c738c69713ace7b768 train_crowd9.tar 13 | 284a6e54ca7bf0dd599db1bea030775c test.tar 14 | 5bdf7ce2ba610db4d309413ca3cb4064 QuartzNet15x5_golos.nemo 15 | 9ca0d96e47d5e2fb6f87ebec17066108 kenlms.tar -------------------------------------------------------------------------------- /license/en_us.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/license/en_us.pdf -------------------------------------------------------------------------------- /license/ru.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/license/ru.pdf --------------------------------------------------------------------------------