├── .github
└── workflows
│ └── semgrep.yml
├── README.md
├── dusha
├── README.md
├── data_processing
│ ├── README.md
│ ├── __init__.py
│ ├── processing.py
│ ├── requirements.txt
│ └── utils
│ │ ├── __init__.py
│ │ ├── aggregation.py
│ │ ├── calculate_features.py
│ │ ├── datacls.py
│ │ └── dawidskene.py
├── demo
│ ├── model
│ │ ├── podcasts_finetune_old_w_lr_1e-3_try1
│ │ ├── train.config
│ │ └── train.jsonl
│ └── recognise_emotions_online.ipynb
└── experiments
│ ├── .gitignore
│ ├── Dockerfile
│ ├── README.md
│ ├── __init__.py
│ ├── configs
│ ├── base.config
│ ├── crowd_large.config
│ ├── crowd_medium.config
│ ├── crowd_small.config
│ ├── crowd_unconfident.config
│ ├── data.config
│ ├── dusha_large.config
│ ├── podcast_large.config
│ └── podcast_tune.config
│ ├── core
│ ├── __init__.py
│ ├── dataset.py
│ ├── learner.py
│ ├── metrics.py
│ ├── model.py
│ └── utils.py
│ ├── inf.py
│ ├── requirements.txt
│ └── train.py
├── golos
├── README.md
├── README_ru.md
├── examples
│ ├── Dockerfile
│ ├── build_and_run_docker.sh
│ ├── data
│ │ ├── 001ce26c07c20eaa0d666b824c6c6924.wav
│ │ └── example1.json
│ ├── infer.py
│ └── readme.md
└── md5sum.txt
└── license
├── en_us.pdf
└── ru.pdf
/.github/workflows/semgrep.yml:
--------------------------------------------------------------------------------
1 | # Name of this GitHub Actions workflow.
2 | name: Semgrep
3 |
4 | on:
5 | # Scan changed files in PRs (diff-aware scanning):
6 | pull_request: {}
7 | # Scan on-demand through GitHub Actions interface:
8 | workflow_dispatch: {}
9 | # Scan mainline branches and report all findings:
10 | push:
11 | branches: ["master", "main"]
12 | # Schedule the CI job (this method uses cron syntax):
13 | schedule:
14 | - cron: '20 17 * * *' # Sets Semgrep to scan every day at 17:20 UTC.
15 | # It is recommended to change the schedule to a random time.
16 |
17 | permissions:
18 | actions: read
19 | contents: read
20 | security-events: write
21 |
22 | jobs:
23 | semgrep:
24 | # User-definable name of this GitHub Actions job:
25 | name: semgrep/ci
26 | # If you are self-hosting, change the following `runs-on` value:
27 | runs-on: ubuntu-latest
28 |
29 | container:
30 | # A Docker image with Semgrep installed. Do not change this.
31 | image: returntocorp/semgrep
32 |
33 | # Skip any PR created by dependabot to avoid permission issues:
34 | if: (github.actor != 'dependabot[bot]')
35 |
36 | steps:
37 | # Fetch project source with GitHub Actions Checkout.
38 | - uses: actions/checkout@v3
39 | # Run the "semgrep ci" command on the command line of the docker image.
40 | - run: semgrep ci --sarif --output=semgrep.sarif --exclude-rule yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha.third-party-action-not-pinned-to-commit-sha
41 | env:
42 | # Add the rules that Semgrep uses by setting the SEMGREP_RULES environment variable.
43 | SEMGREP_RULES: p/default # more at semgrep.dev/explore
44 |
45 | - name: Upload SARIF file for GitHub Advanced Security Dashboard
46 | uses: github/codeql-action/upload-sarif@v2
47 | with:
48 | sarif_file: semgrep.sarif
49 | if: always()
50 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Open and Free Datasets and Models by Salute Developers
2 |
3 |
4 | ## [Golos dataset](https://github.com/salute-developers/golos/tree/master/golos#golos-dataset)
5 | Russian corpus suitable for speech research.
6 |
7 | ## [Dusha dataset](https://github.com/salute-developers/golos/tree/master/dusha#dusha-dataset)
8 | Bi-modal corpus suitable for speech emotion recognition tasks.
9 |
10 |
11 | ## **License**
12 |
13 | [English Version](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf)
14 |
15 | [Russian Version](https://github.com/salute-developers/golos/blob/master/license/ru.pdf)
--------------------------------------------------------------------------------
/dusha/README.md:
--------------------------------------------------------------------------------
1 | # Dusha dataset
2 |
3 | Dusha is a bi-modal corpus suitable for speech emotion recognition (SER) tasks. The dataset consists of about 300 000 audio recordings with Russian speech, their transcripts and emotiomal labels. The corpus contains approximately 350 hours of data. Four basic emotions that usually appear in a dialog with a virtual assistant were selected: Happiness (Positive), Sadness, Anger and Neutral emotion.
4 |
5 | # Table of contents
6 |
7 | - [Dataset Structure](https://github.com/salute-developers/golos/tree/master/dusha#dataset-structure)
8 | - [Downloads](https://github.com/salute-developers/golos/tree/master/dusha#downloads)
9 | - [Raw Data Processing](https://github.com/salute-developers/golos/tree/master/dusha/data_processing)
10 | - [Run and Evaluate Experiments](https://github.com/salute-developers/golos/tree/master/dusha/experiments)
11 | - [Resources](https://github.com/salute-developers/golos/tree/master/dusha#resouces)
12 | - [License](https://github.com/salute-developers/golos/tree/master/dusha#license)
13 | - [Contacts](https://github.com/salute-developers/golos/tree/master/dusha#contacts)
14 |
15 |
16 | ## **Dataset structure**
17 |
18 | | Domain | Number of Files | Duration (Hr.) | Unique Speakers |
19 | |----------------|------------|--------|---------|
20 | | Crowd | 201 850 | 255.7 | 2068 |
21 | | Podcast | 102 113 | 90.9 | 6240 |
22 | | Total | 303 963 | 346.6 | 8308 |
23 |
24 | ## **Downloads**
25 |
26 | According to the license issue, we can't provide audio files of podcasts. We provide the whole set of precalculated features and a list of links to podcasts with timings where you can listen to original sound (Also see Issue https://github.com/salute-developers/golos/issues/1).
27 |
28 | | Archives | Size | Links |
29 | |-------------------|------------|--------------------------------------------------------------------------------------------------|
30 | | crowd.tar | 28 GB | [crowd.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/crowd.tar) |
31 | | podcast.tar | 360 MB | [podcast.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/podcast.tar) |
32 | | features.tar | 30 GB | [features.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/features.tar) |
33 | | paper_setups.tgz | 16 MB | [paper_setups.tgz](https://cdn.chatwm.opensmodel.sberdevices.ru/dusha/paper_setups.tgz) |
34 |
35 |
36 | With this repository you can perform [raw data processing](https://github.com/salute-developers/golos/tree/master/dusha/data_processing) or [run and evaluate experiments](https://github.com/salute-developers/golos/tree/master/dusha/experiments).
37 |
38 | ## **Resources**
39 |
40 | [[arxiv.org] Large Raw Emotional Dataset with Aggregation Mechanism](https://arxiv.org/abs/2212.12266)
41 |
42 | ## **License**
43 |
44 | [English Version](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf)
45 |
46 | [Russian Version](https://github.com/salute-developers/golos/blob/master/license/ru.pdf)
47 |
48 | ## **Contacts**
49 |
50 | Please create a GitHub issue!
51 |
52 | Authors (in alphabetic order):
53 | - Artem Sokolov
54 | - Fedor Minkin
55 | - Nikita Savushkin
56 | - Nikolay Karpov
57 | - Oleg Kutuzov
58 | - Vladimir Kondratenko
59 |
--------------------------------------------------------------------------------
/dusha/data_processing/README.md:
--------------------------------------------------------------------------------
1 | ## Raw data processing
2 |
3 | To process data download a raw Dusha dataset (crowd.tar, podcast.tar), untar it to DATASET_PATH, and run the processing script:
4 |
5 | python processing.py -dataset_path DATASET_PATH
6 |
7 | It processes sound files and creates a folder in DATASET_PATH with precalculated features, aggregates labels, and creates manifest file in jsonl format.
8 |
9 |
10 | If you want to change the threshold for aggregation run the processing with -threshold flag:
11 |
12 | python processing.py -dataset_path DATASET_PATH -threshold THRESHOLD
13 |
14 | You can also use tsv format for manifest file:
15 |
16 | python processing.py -dataset_path DATASET_PATH -tsv
17 |
18 | Force recalculate features:
19 |
20 | python processing.py -dataset_path DATASET_PATH -rf
21 |
--------------------------------------------------------------------------------
/dusha/data_processing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/data_processing/__init__.py
--------------------------------------------------------------------------------
/dusha/data_processing/processing.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | import click
4 | import numpy as np
5 | from utils.aggregation import aggregate_data, read_data_markup
6 | from utils.calculate_features import load_features
7 |
8 |
9 | @click.command()
10 | @click.option(
11 | "-dataset_path",
12 | "--dataset_path",
13 | required=True,
14 | type=click.Path(exists=True),
15 | help="dataset_path",
16 | )
17 | @click.option(
18 | "--use_tsv", "-tsv", is_flag=True, default=False, help="use tsv to read/write"
19 | )
20 | @click.option(
21 | "--recalculate_features",
22 | "-rf",
23 | is_flag=True,
24 | default=False,
25 | help="recalculate features",
26 | )
27 | @click.option(
28 | "--threshold",
29 | "-threshold",
30 | default=0.9,
31 | help="Dawidskene threshold",
32 | show_default=True,
33 | )
34 | def processing(
35 | dataset_path: str, use_tsv: bool, recalculate_features: bool, threshold: float
36 | ) -> None:
37 | """
38 | processing raw data for training
39 | """
40 | if threshold > 1 or threshold < 0:
41 | raise AttributeError
42 |
43 | np.seterr(divide="ignore")
44 |
45 | public_data = Path(dataset_path)
46 | result_dir = public_data / f"processed_dataset_0{int(threshold*100)}"
47 |
48 | path_names = ["train", "aggregated_dataset", "test"]
49 | for path_name in path_names:
50 | (result_dir / path_name).mkdir(parents=True, exist_ok=True)
51 |
52 | (public_data / "features").mkdir(parents=True, exist_ok=True)
53 |
54 | data_types = ["crowd_train", "crowd_test", "podcast_train", "podcast_test"]
55 | for data_type in data_types:
56 | wavs_path = public_data / data_type / "wavs"
57 | data = read_data_markup(
58 | dataset_path=public_data / data_type / ("raw_" + data_type),
59 | use_tsv=use_tsv,
60 | )
61 | wavs_names = {Path(row.audio_path).stem for row in data}
62 | load_features(
63 | wavs_path=wavs_path,
64 | wavs_names=wavs_names,
65 | result_dir=public_data,
66 | dataset_name=data_type,
67 | recalculate_feature=recalculate_features,
68 | )
69 |
70 | aggregate_data(public_data, result_dir, use_tsv, threshold)
71 |
72 |
73 | if __name__ == "__main__":
74 | processing() # pylint: disable=no-value-for-parameter
75 |
--------------------------------------------------------------------------------
/dusha/data_processing/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==1.3.5
2 | crowd-kit==1.0.0
3 | click==8.0.4
4 | tqdm==4.62.3
5 | numpy==1.21.5
6 | librosa==0.8.1
--------------------------------------------------------------------------------
/dusha/data_processing/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/data_processing/utils/__init__.py
--------------------------------------------------------------------------------
/dusha/data_processing/utils/aggregation.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | from enum import Enum
4 | from pathlib import Path
5 | from typing import Dict, List
6 |
7 | from utils.datacls import (
8 | AggDataclass,
9 | DataForExp,
10 | DawidSkeneEntryDataclass,
11 | MarkupDataclass,
12 | )
13 | from utils.dawidskene import get_dawidskene_pred
14 |
15 | HEADER = "\t".join(
16 | [
17 | "hash_id",
18 | "wav_path",
19 | "duration",
20 | "emotion",
21 | "golden_emo",
22 | "speaker_text",
23 | "speaker_emo",
24 | "source_id",
25 | ]
26 | )
27 |
28 | HEADER_EXP = "\t".join(["id", "tensor", "wav_lengh", "label"])
29 |
30 |
31 | class Emotion(Enum):
32 | ANGRY = 0
33 | SAD = 1
34 | NEUTRAL = 2
35 | POSITIVE = 3
36 |
37 |
38 | def read_data_markup(dataset_path: Path, use_tsv: bool) -> List[MarkupDataclass]:
39 | markup_data = []
40 | if use_tsv:
41 | with open(
42 | dataset_path.parent / (dataset_path.stem + ".tsv"), "r", encoding="utf-8"
43 | ) as file:
44 | headers = file.readline().rstrip("\r\n").split("\t")
45 | for line in file:
46 | line_data = line.strip("\r\n").split("\t")
47 | string = dict(zip(headers, line_data))
48 | row = MarkupDataclass(**string)
49 | markup_data.append(row)
50 | else:
51 | with open(
52 | dataset_path.parent / (dataset_path.stem + ".jsonl"), "r", encoding="utf-8"
53 | ) as file:
54 | for line in file:
55 | row = MarkupDataclass(**json.loads(line))
56 | markup_data.append(row)
57 | return markup_data
58 |
59 |
60 | def agg_data_to_file(
61 | file_path: Path, agg_data: List[AggDataclass], use_tsv: bool
62 | ) -> None:
63 | if use_tsv:
64 | with open(
65 | file_path.parent / (file_path.stem + ".tsv"), "w", encoding="utf-8"
66 | ) as file:
67 | print(HEADER, file=file, end=os.linesep)
68 | for row in agg_data:
69 | print("\t".join(row.__dict__.values()), file=file, end=os.linesep)
70 | else:
71 | with open(
72 | file_path.parent / (file_path.stem + ".jsonl"), "w", encoding="utf-8"
73 | ) as file:
74 | for row in agg_data:
75 | line = json.dumps(row.__dict__, ensure_ascii=False)
76 | print(line, file=file, end=os.linesep)
77 |
78 |
79 | def exp_data_to_file(
80 | file_path: Path, exp_data: List[DataForExp], use_tsv: bool
81 | ) -> None:
82 | if use_tsv:
83 | with open(
84 | file_path.parent / (file_path.stem + ".tsv"), "w", encoding="utf-8"
85 | ) as file:
86 | print(HEADER_EXP, file=file, end=os.linesep)
87 | for row in exp_data:
88 | line = "\t".join(list(map(str, row.__dict__.values())))
89 | print(line, file=file, end=os.linesep)
90 | else:
91 | with open(
92 | file_path.parent / (file_path.stem + ".jsonl"), "w", encoding="utf-8"
93 | ) as file:
94 | for row in exp_data:
95 | line = json.dumps(row.__dict__, ensure_ascii=False)
96 | print(line, file=file, end=os.linesep)
97 |
98 |
99 | def filter_data(
100 | markup_data: List[MarkupDataclass],
101 | aggregated_data_dict: Dict[str, str],
102 | dataset: str,
103 | ) -> List[AggDataclass]:
104 | agg_data = []
105 | used_wavs = set()
106 | for row in markup_data:
107 | if row.hash_id in used_wavs:
108 | continue
109 | if row.hash_id in aggregated_data_dict:
110 | good_agg_row = AggDataclass(
111 | hash_id=row.hash_id,
112 | audio_path=str(Path("..", "..", dataset, row.audio_path)),
113 | duration=row.duration,
114 | emotion=aggregated_data_dict[row.hash_id],
115 | golden_emo=row.golden_emo,
116 | speaker_text=row.speaker_text,
117 | speaker_emo=row.speaker_emo,
118 | source_id=row.source_id,
119 | )
120 | agg_data.append(good_agg_row)
121 | used_wavs.add(row.hash_id)
122 | return agg_data
123 |
124 |
125 | def make_exp_data(agg_data: List[AggDataclass]) -> List[DataForExp]:
126 | exp_data = []
127 | for row in agg_data:
128 | if (
129 | not isinstance(row.golden_emo, str) or row.golden_emo == ""
130 | ) and row.emotion != "other":
131 | exp_row = DataForExp(
132 | id=row.hash_id,
133 | tensor=str(Path("..", "..", "features", row.hash_id + ".npy")),
134 | wav_length=row.duration,
135 | label=Emotion[row.emotion.upper()].value,
136 | emotion=row.emotion,
137 | )
138 | exp_data.append(exp_row)
139 | return exp_data
140 |
141 |
142 | def aggregate_data(
143 | data_path: Path, out_path: Path, use_tsv: bool, dawidskene_threshold: float
144 | ) -> None:
145 |
146 | markup_data = ["podcast_test", "podcast_train", "crowd_train", "crowd_test"]
147 | data = {}
148 | all_data = []
149 | for dataset in markup_data:
150 | data[dataset] = read_data_markup(
151 | dataset_path=Path(data_path, dataset, "raw_" + dataset),
152 | use_tsv=use_tsv,
153 | )
154 | all_data += data[dataset]
155 |
156 | data_for_agg = []
157 | for row in all_data:
158 | row_for_agg = DawidSkeneEntryDataclass(
159 | task=row.hash_id,
160 | worker=row.annotator_id,
161 | label=row.annotator_emo,
162 | )
163 | data_for_agg.append(row_for_agg)
164 |
165 | aggregated_data = get_dawidskene_pred(
166 | data=data_for_agg,
167 | threshold=dawidskene_threshold,
168 | meta_path=data_path / "meta.tsv",
169 | )
170 |
171 | aggregated_data_dict = {row.task: row.pred for row in aggregated_data}
172 |
173 | exp_data = {}
174 | for dataset in markup_data:
175 | agg_data = filter_data(
176 | markup_data=data[dataset],
177 | aggregated_data_dict=aggregated_data_dict,
178 | dataset=dataset,
179 | )
180 | exp_data[dataset] = make_exp_data(agg_data=agg_data)
181 | exp_data_to_file(
182 | file_path=out_path / dataset.rsplit("_", maxsplit=1)[-1] / dataset,
183 | exp_data=exp_data[dataset],
184 | use_tsv=use_tsv,
185 | )
186 | agg_data_to_file(
187 | file_path=out_path / "aggregated_dataset" / dataset,
188 | agg_data=agg_data,
189 | use_tsv=use_tsv,
190 | )
191 | exp_data_to_file(
192 | file_path=out_path / "train" / "train",
193 | exp_data=exp_data["podcast_train"] + exp_data["crowd_train"],
194 | use_tsv=use_tsv,
195 | )
196 | exp_data_to_file(
197 | file_path=Path(out_path / "test" / "test"),
198 | exp_data=exp_data["podcast_test"] + exp_data["crowd_test"],
199 | use_tsv=use_tsv,
200 | )
201 |
--------------------------------------------------------------------------------
/dusha/data_processing/utils/calculate_features.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from typing import List, Set
3 |
4 | import librosa
5 | import numpy as np
6 | from tqdm import tqdm
7 | from utils.datacls import DataWithFeaturesEntryclass
8 |
9 |
10 | def create_features(
11 | data: List[DataWithFeaturesEntryclass],
12 | wavs_names: Set[str],
13 | features_dump_path: Path,
14 | dataset_name: str,
15 | recalculate_feature: bool,
16 | hop_length_coef: float = 0.01,
17 | win_length_coef: float = 0.02,
18 | sample_rate: int = 16000,
19 | n_mels: int = 64,
20 | ) -> None:
21 | """
22 | As an input all models use standard speech features:
23 | 64 Mel-filterbank calculated from 20ms windows with a 10ms overlap.
24 | """
25 | if recalculate_feature:
26 | if len(data) != len(wavs_names):
27 | print(
28 | f"{len(wavs_names) - len(data)} wav files are missing for {dataset_name}"
29 | )
30 | hop_length = int(sample_rate * hop_length_coef)
31 | win_length = int(sample_rate * win_length_coef)
32 | for row in tqdm(data):
33 | data, rate = librosa.load(row.wav_path, sr=sample_rate)
34 | if len(data) != 0:
35 | spec = librosa.feature.melspectrogram(
36 | y=data,
37 | sr=rate,
38 | hop_length=hop_length,
39 | n_fft=win_length,
40 | n_mels=n_mels,
41 | )
42 | else:
43 | raise AttributeError
44 | mel_spec = librosa.power_to_db(spec, ref=np.max)
45 | np.save(features_dump_path / f"{row.wav_id}.npy", mel_spec[None])
46 | print(
47 | f"({len(data)}/{len(wavs_names)}) features have been calculated for {dataset_name}"
48 | )
49 | else:
50 | ready_features = {elm.stem for elm in features_dump_path.glob("*.npy")}
51 | wav_to_features = {wav for wav in wavs_names if not wav in ready_features}
52 | data_to_culc = [wav for wav in data if wav.wav_id in wav_to_features]
53 |
54 | if len(data_to_culc) != len(wav_to_features):
55 | print(
56 | f"{len(wav_to_features) - len(data_to_culc)} wav files are missing for {dataset_name}"
57 | )
58 |
59 | if not data_to_culc:
60 | print(
61 | f"All({len({wav for wav in wavs_names if wav in ready_features})}/{len(wavs_names)}) features have been calculated for {dataset_name}"
62 | )
63 | return
64 |
65 | hop_length = int(sample_rate * hop_length_coef)
66 | win_length = int(sample_rate * win_length_coef)
67 | for row in tqdm(data_to_culc):
68 | data, rate = librosa.load(row.wav_path, sr=sample_rate)
69 | if len(data) != 0:
70 | spec = librosa.feature.melspectrogram(
71 | y=data,
72 | sr=rate,
73 | hop_length=hop_length,
74 | n_fft=win_length,
75 | n_mels=n_mels,
76 | )
77 | else:
78 | raise AttributeError
79 | mel_spec = librosa.power_to_db(spec, ref=np.max)
80 | np.save(features_dump_path / f"{row.wav_id}.npy", mel_spec[None])
81 |
82 |
83 | def load_features(
84 | wavs_path: Path,
85 | wavs_names: Set[str],
86 | result_dir: Path,
87 | dataset_name: str,
88 | recalculate_feature: bool,
89 | ) -> None:
90 | wavs = []
91 | for elm in wavs_path.glob("*.wav"):
92 | wavs.append(DataWithFeaturesEntryclass(wav_path=str(elm), wav_id=elm.stem))
93 | create_features(
94 | data=wavs,
95 | wavs_names=wavs_names,
96 | features_dump_path=result_dir / "features",
97 | dataset_name=dataset_name,
98 | recalculate_feature=recalculate_feature,
99 | )
100 |
--------------------------------------------------------------------------------
/dusha/data_processing/utils/datacls.py:
--------------------------------------------------------------------------------
1 | from dataclasses import dataclass
2 | from typing import Any
3 |
4 | # -----------------------------------------------------------------------------
5 | # AGGREGATION
6 | # -----------------------------------------------------------------------------
7 |
8 |
9 | @dataclass
10 | class DawidSkeneEntryDataclass:
11 | task: str
12 | worker: str
13 | label: Any
14 |
15 |
16 | @dataclass
17 | class DawidSkeneResultDataclass:
18 | task: str
19 | pred: str
20 |
21 |
22 | @dataclass
23 | class MarkupDataclass:
24 | hash_id: str
25 | audio_path: str
26 | duration: str
27 | annotator_emo: str
28 | golden_emo: str
29 | speaker_text: str
30 | speaker_emo: str
31 | source_id: str
32 | audio_path: str
33 | annotator_emo: str
34 | annotator_id: str
35 |
36 |
37 | @dataclass
38 | class AggDataclass:
39 | hash_id: str
40 | audio_path: str
41 | duration: str
42 | emotion: str
43 | golden_emo: str
44 | speaker_text: str
45 | speaker_emo: str
46 | source_id: str
47 |
48 |
49 | # -----------------------------------------------------------------------------
50 | # FEATURES
51 | # -----------------------------------------------------------------------------
52 |
53 |
54 | @dataclass
55 | class DataWithFeaturesEntryclass:
56 | wav_path: str
57 | wav_id: str
58 |
59 |
60 | # -----------------------------------------------------------------------------
61 | # EXP
62 | # -----------------------------------------------------------------------------
63 |
64 |
65 | @dataclass
66 | class DataForExp:
67 | id: str
68 | tensor: str
69 | wav_length: str
70 | label: int
71 | emotion: str
72 |
--------------------------------------------------------------------------------
/dusha/data_processing/utils/dawidskene.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from typing import List
3 |
4 | import pandas as pd
5 | from crowdkit.aggregation import DawidSkene as CrowdKitDawidSkene
6 | from utils.datacls import DawidSkeneEntryDataclass, DawidSkeneResultDataclass
7 |
8 |
9 | def get_dawidskene_pred(
10 | data: List[DawidSkeneEntryDataclass],
11 | threshold: float,
12 | meta_path: Path,
13 | n_iter: int = 100,
14 | ) -> List[DawidSkeneResultDataclass]:
15 | labels = {row.label for row in data}
16 | assert "task" not in labels, 'Labels cant contains the name "task"!'
17 | aggregated_labels = CrowdKitDawidSkene(n_iter=n_iter).fit_predict_proba(
18 | pd.DataFrame(data)
19 | )
20 | aggregated_labels.to_csv(meta_path, sep="\t")
21 |
22 | aggregated_labels_list = aggregated_labels.reset_index().to_dict("records")
23 | aggregated_data = []
24 | for row in aggregated_labels_list:
25 | tmp_dict = {val: key for key, val in row.items() if key in labels}
26 | max_item_proba = max(tmp_dict)
27 | if max_item_proba >= threshold:
28 | key_with_max_value = tmp_dict[max_item_proba]
29 | aggregated_row = DawidSkeneResultDataclass(
30 | task=row["task"],
31 | pred=key_with_max_value,
32 | )
33 | aggregated_data.append(aggregated_row)
34 | return aggregated_data
35 |
--------------------------------------------------------------------------------
/dusha/demo/model/podcasts_finetune_old_w_lr_1e-3_try1:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/demo/model/podcasts_finetune_old_w_lr_1e-3_try1
--------------------------------------------------------------------------------
/dusha/demo/model/train.config:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | from core.dataset import MelEmotionsDataset, get_augm_func, adaptive_padding_collate_fn, LengthWeightedSampler
3 | from core.model import ConvSelfAttentionMobileNet
4 | from core.utils import load_jsonl_as_df
5 | from torch.utils.data import DataLoader
6 |
7 |
8 | base_path = Path('/raid/okutuzov/dusha_data_new_2/processed_dataset_0.9')
9 | train_manifest_path = base_path / 'train' / 'podcast_train.jsonl'
10 | val_manifest_path = base_path / 'test' / 'podcast_test.jsonl'
11 |
12 | pt_model_path = Path('/raid/kondrat/dusha_experiments_try2/agg_0.9/crowd_lr_1e-3_try1/crowd_lr_1e-3_try1')
13 |
14 | batch_size = 64
15 | epoch_count = 100
16 | learning_rate = 1e-3
17 | optimizer_step = 5
18 | optimizer_gamma = 1
19 | weight_decay = 1e-6
20 | clip_grad = False
21 |
22 | collate_fn = adaptive_padding_collate_fn
23 | augm_func = get_augm_func(time_mask_param=40, freq_mask_param=16, crop_augm_max_cut_size=40)
24 |
25 | MAX_LENGTH = 16
26 |
27 | def get_train_weights(_df):
28 | train_weights = 1 + 9 * (_df.label.values == 0) + 19 * (_df.label.values == 1) + 4 * (_df.label.values == 3)
29 | # train_weights = 1 + 29 * (_df.label.values == 0) + 49 * (_df.label.values == 1) + 9 * (_df.label.values == 3)
30 | return train_weights
31 |
32 |
33 | model_setting = [
34 | # t, c, n, s
35 | [1, 16, 1, 1],
36 | [2, 32, 2, 2],
37 | [2, 64, 6, 2],
38 | [2, 128, 6, 2],
39 | ]
40 |
41 | model = ConvSelfAttentionMobileNet(model_setting,
42 | n_classes=4,
43 | last_channel=128)
44 |
45 |
46 | def get_train_dataset(_df, ds_base_path):
47 | return MelEmotionsDataset(_df,
48 | get_weights_func=get_train_weights,
49 | augm_transform=augm_func,
50 | base_path=ds_base_path)
51 |
52 |
53 |
54 | def get_val_dataset(_df, ds_base_path):
55 | return MelEmotionsDataset(_df, base_path=ds_base_path)
56 |
57 |
58 |
59 | def get_train_dataloader(train_ds):
60 | return DataLoader(train_ds, batch_size=batch_size, num_workers=1,
61 | collate_fn=collate_fn,
62 | sampler=LengthWeightedSampler(df=train_ds.df,
63 | batch_size=batch_size,
64 | min_length=0.3,
65 | max_length=MAX_LENGTH,
66 | length_delta=0.3,
67 | decimals=1))
68 |
69 |
70 |
71 | def get_val_dataloader(val_ds):
72 | return DataLoader(val_ds, batch_size=1, num_workers=4, shuffle=False)
73 |
74 |
75 | train_dataset = get_train_dataset(load_jsonl_as_df(train_manifest_path),
76 | ds_base_path=train_manifest_path.parent)
77 | val_dataset = get_val_dataset(load_jsonl_as_df(val_manifest_path),
78 | ds_base_path=val_manifest_path.parent)
79 |
80 | dataloaders = {'train': get_train_dataloader(train_ds=train_dataset),
81 | 'validate': get_val_dataloader(val_ds=val_dataset)}
82 |
83 | DUMP_BEST_CHECKPOINTS = True
84 | DUMP_LAST_CHECKPOINTS = True
85 | BEST_CHECKPOINTS_WARMUP = 5
86 |
--------------------------------------------------------------------------------
/dusha/demo/recognise_emotions_online.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "18e88c88",
7 | "metadata": {},
8 | "outputs": [
9 | {
10 | "name": "stderr",
11 | "output_type": "stream",
12 | "text": [
13 | "scikit-learn version 1.3.0 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.\n",
14 | "Torch version 2.0.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.\n"
15 | ]
16 | }
17 | ],
18 | "source": [
19 | "import torch\n",
20 | "import onnx\n",
21 | "import toml\n",
22 | "import librosa\n",
23 | "import lazycon\n",
24 | "import time\n",
25 | "import os, sys\n",
26 | "\n",
27 | "import coremltools as ct\n",
28 | "import numpy as np\n",
29 | "import soundfile as sf\n",
30 | "import sounddevice as sd\n",
31 | "\n",
32 | "sys.path.append('../experiments')\n",
33 | "import core"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 2,
39 | "id": "c00c3b46",
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "def create_features(\n",
44 | " data: np.array,\n",
45 | " hop_length_coef: float = 0.01,\n",
46 | " win_length_coef: float = 0.02,\n",
47 | " sample_rate: int = 16000,\n",
48 | " n_mels: int = 64,\n",
49 | ") -> np.array:\n",
50 | " \"\"\"\n",
51 | " As an input all models use standard speech features:\n",
52 | " 64 Mel-filterbank calculated from 20ms windows with a 10ms overlap.\n",
53 | " \"\"\"\n",
54 | "\n",
55 | " hop_length = int(sample_rate * hop_length_coef)\n",
56 | " win_length = int(sample_rate * win_length_coef)\n",
57 | " if len(data) != 0:\n",
58 | " spec = librosa.feature.melspectrogram(\n",
59 | " y=data,\n",
60 | " sr=sample_rate,\n",
61 | " hop_length=hop_length,\n",
62 | " n_fft=win_length,\n",
63 | " n_mels=n_mels,\n",
64 | " )\n",
65 | " else:\n",
66 | " raise AttributeError\n",
67 | " mel_spec = librosa.power_to_db(spec, ref=np.max)\n",
68 | "\n",
69 | " return mel_spec"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": 3,
75 | "id": "9587b9e7-36e4-4523-9604-49ddf1bf0931",
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "def create_features_for_audio(\n",
80 | " wav_name: str,\n",
81 | " hop_length_coef: float = 0.01,\n",
82 | " win_length_coef: float = 0.02,\n",
83 | " sample_rate: int = 16000,\n",
84 | " n_mels: int = 64,\n",
85 | ") -> np.array:\n",
86 | " \"\"\"\n",
87 | " As an input all models use standard speech features:\n",
88 | " 64 Mel-filterbank calculated from 20ms windows with a 10ms overlap.\n",
89 | " \"\"\"\n",
90 | "\n",
91 | " hop_length = int(sample_rate * hop_length_coef)\n",
92 | " win_length = int(sample_rate * win_length_coef)\n",
93 | " data, rate = librosa.load(wav_name, sr=sample_rate)\n",
94 | " print(data)\n",
95 | " if len(data) != 0:\n",
96 | " spec = librosa.feature.melspectrogram(\n",
97 | " y=data,\n",
98 | " sr=rate,\n",
99 | " hop_length=hop_length,\n",
100 | " n_fft=win_length,\n",
101 | " n_mels=n_mels,\n",
102 | " )\n",
103 | " else:\n",
104 | " raise AttributeError\n",
105 | " mel_spec = librosa.power_to_db(spec, ref=np.max)\n",
106 | " return mel_spec"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 4,
112 | "id": "26a9de85",
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "def index2name(\n",
117 | " index: int\n",
118 | ") -> str:\n",
119 | " class_dict = {0:\"angry\", 1:\"sad\", 2:\"neutral\", 3:\"positive\"}\n",
120 | " \n",
121 | " if index > len(class_dict) or index < 0:\n",
122 | " raise AttributeError\n",
123 | " \n",
124 | " return class_dict[index]"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 5,
130 | "id": "db6f2fda",
131 | "metadata": {},
132 | "outputs": [],
133 | "source": [
134 | "dir_path = './model/'\n",
135 | "model_name = 'podcasts_finetune_old_w_lr_1e-3_try1'\n",
136 | "device = 'cpu'"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 19,
142 | "id": "9b615974",
143 | "metadata": {},
144 | "outputs": [
145 | {
146 | "name": "stderr",
147 | "output_type": "stream",
148 | "text": [
149 | "/Users/artsokol/venv/lib/python3.9/site-packages/torchvision/models/mobilenetv2.py:23: FutureWarning: The ConvBNReLU/ConvBNActivation classes are deprecated since 0.12 and will be removed in 0.14. Use torchvision.ops.misc.Conv2dNormActivation instead.\n",
150 | " warnings.warn(\n"
151 | ]
152 | },
153 | {
154 | "data": {
155 | "text/plain": [
156 | "ConvSelfAttentionMobileNet(\n",
157 | " (features): Sequential(\n",
158 | " (0): _DeprecatedConvBNAct(\n",
159 | " (0): Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)\n",
160 | " (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
161 | " (2): ReLU6(inplace=True)\n",
162 | " )\n",
163 | " (1): InvertedResidual(\n",
164 | " (conv): Sequential(\n",
165 | " (0): Conv2dNormActivation(\n",
166 | " (0): Conv2d(4, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=4, bias=False)\n",
167 | " (1): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
168 | " (2): ReLU6(inplace=True)\n",
169 | " )\n",
170 | " (1): Conv2d(4, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
171 | " (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
172 | " )\n",
173 | " )\n",
174 | " (2): InvertedResidual(\n",
175 | " (conv): Sequential(\n",
176 | " (0): Conv2dNormActivation(\n",
177 | " (0): Conv2d(16, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
178 | " (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
179 | " (2): ReLU6(inplace=True)\n",
180 | " )\n",
181 | " (1): Conv2dNormActivation(\n",
182 | " (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=32, bias=False)\n",
183 | " (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
184 | " (2): ReLU6(inplace=True)\n",
185 | " )\n",
186 | " (2): Conv2d(32, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
187 | " (3): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
188 | " )\n",
189 | " )\n",
190 | " (3): InvertedResidual(\n",
191 | " (conv): Sequential(\n",
192 | " (0): Conv2dNormActivation(\n",
193 | " (0): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
194 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
195 | " (2): ReLU6(inplace=True)\n",
196 | " )\n",
197 | " (1): Conv2dNormActivation(\n",
198 | " (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=64, bias=False)\n",
199 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
200 | " (2): ReLU6(inplace=True)\n",
201 | " )\n",
202 | " (2): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
203 | " (3): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
204 | " )\n",
205 | " )\n",
206 | " (4): InvertedResidual(\n",
207 | " (conv): Sequential(\n",
208 | " (0): Conv2dNormActivation(\n",
209 | " (0): Conv2d(32, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
210 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
211 | " (2): ReLU6(inplace=True)\n",
212 | " )\n",
213 | " (1): Conv2dNormActivation(\n",
214 | " (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=64, bias=False)\n",
215 | " (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
216 | " (2): ReLU6(inplace=True)\n",
217 | " )\n",
218 | " (2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
219 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
220 | " )\n",
221 | " )\n",
222 | " (5): InvertedResidual(\n",
223 | " (conv): Sequential(\n",
224 | " (0): Conv2dNormActivation(\n",
225 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
226 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
227 | " (2): ReLU6(inplace=True)\n",
228 | " )\n",
229 | " (1): Conv2dNormActivation(\n",
230 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n",
231 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
232 | " (2): ReLU6(inplace=True)\n",
233 | " )\n",
234 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
235 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
236 | " )\n",
237 | " )\n",
238 | " (6): InvertedResidual(\n",
239 | " (conv): Sequential(\n",
240 | " (0): Conv2dNormActivation(\n",
241 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
242 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
243 | " (2): ReLU6(inplace=True)\n",
244 | " )\n",
245 | " (1): Conv2dNormActivation(\n",
246 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n",
247 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
248 | " (2): ReLU6(inplace=True)\n",
249 | " )\n",
250 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
251 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
252 | " )\n",
253 | " )\n",
254 | " (7): InvertedResidual(\n",
255 | " (conv): Sequential(\n",
256 | " (0): Conv2dNormActivation(\n",
257 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
258 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
259 | " (2): ReLU6(inplace=True)\n",
260 | " )\n",
261 | " (1): Conv2dNormActivation(\n",
262 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n",
263 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
264 | " (2): ReLU6(inplace=True)\n",
265 | " )\n",
266 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
267 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
268 | " )\n",
269 | " )\n",
270 | " (8): InvertedResidual(\n",
271 | " (conv): Sequential(\n",
272 | " (0): Conv2dNormActivation(\n",
273 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
274 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
275 | " (2): ReLU6(inplace=True)\n",
276 | " )\n",
277 | " (1): Conv2dNormActivation(\n",
278 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n",
279 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
280 | " (2): ReLU6(inplace=True)\n",
281 | " )\n",
282 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
283 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
284 | " )\n",
285 | " )\n",
286 | " (9): InvertedResidual(\n",
287 | " (conv): Sequential(\n",
288 | " (0): Conv2dNormActivation(\n",
289 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
290 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
291 | " (2): ReLU6(inplace=True)\n",
292 | " )\n",
293 | " (1): Conv2dNormActivation(\n",
294 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=128, bias=False)\n",
295 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
296 | " (2): ReLU6(inplace=True)\n",
297 | " )\n",
298 | " (2): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
299 | " (3): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
300 | " )\n",
301 | " )\n",
302 | " (10): InvertedResidual(\n",
303 | " (conv): Sequential(\n",
304 | " (0): Conv2dNormActivation(\n",
305 | " (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
306 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
307 | " (2): ReLU6(inplace=True)\n",
308 | " )\n",
309 | " (1): Conv2dNormActivation(\n",
310 | " (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=128, bias=False)\n",
311 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
312 | " (2): ReLU6(inplace=True)\n",
313 | " )\n",
314 | " (2): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
315 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
316 | " )\n",
317 | " )\n",
318 | " (11): InvertedResidual(\n",
319 | " (conv): Sequential(\n",
320 | " (0): Conv2dNormActivation(\n",
321 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
322 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
323 | " (2): ReLU6(inplace=True)\n",
324 | " )\n",
325 | " (1): Conv2dNormActivation(\n",
326 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n",
327 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
328 | " (2): ReLU6(inplace=True)\n",
329 | " )\n",
330 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
331 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
332 | " )\n",
333 | " )\n",
334 | " (12): InvertedResidual(\n",
335 | " (conv): Sequential(\n",
336 | " (0): Conv2dNormActivation(\n",
337 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
338 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
339 | " (2): ReLU6(inplace=True)\n",
340 | " )\n",
341 | " (1): Conv2dNormActivation(\n",
342 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n",
343 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
344 | " (2): ReLU6(inplace=True)\n",
345 | " )\n",
346 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
347 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
348 | " )\n",
349 | " )\n",
350 | " (13): InvertedResidual(\n",
351 | " (conv): Sequential(\n",
352 | " (0): Conv2dNormActivation(\n",
353 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
354 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
355 | " (2): ReLU6(inplace=True)\n",
356 | " )\n",
357 | " (1): Conv2dNormActivation(\n",
358 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n",
359 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
360 | " (2): ReLU6(inplace=True)\n",
361 | " )\n",
362 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
363 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
364 | " )\n",
365 | " )\n",
366 | " (14): InvertedResidual(\n",
367 | " (conv): Sequential(\n",
368 | " (0): Conv2dNormActivation(\n",
369 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
370 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
371 | " (2): ReLU6(inplace=True)\n",
372 | " )\n",
373 | " (1): Conv2dNormActivation(\n",
374 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n",
375 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
376 | " (2): ReLU6(inplace=True)\n",
377 | " )\n",
378 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
379 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
380 | " )\n",
381 | " )\n",
382 | " (15): InvertedResidual(\n",
383 | " (conv): Sequential(\n",
384 | " (0): Conv2dNormActivation(\n",
385 | " (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
386 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
387 | " (2): ReLU6(inplace=True)\n",
388 | " )\n",
389 | " (1): Conv2dNormActivation(\n",
390 | " (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=256, bias=False)\n",
391 | " (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
392 | " (2): ReLU6(inplace=True)\n",
393 | " )\n",
394 | " (2): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
395 | " (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
396 | " )\n",
397 | " )\n",
398 | " (16): _DeprecatedConvBNAct(\n",
399 | " (0): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)\n",
400 | " (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)\n",
401 | " (2): ReLU6(inplace=True)\n",
402 | " )\n",
403 | " )\n",
404 | " (pooling): Sequential(\n",
405 | " (0): ConvSelfAttention(\n",
406 | " (query): Conv1d(128, 16, kernel_size=(1,), stride=(1,), bias=False)\n",
407 | " (key): Conv1d(128, 16, kernel_size=(1,), stride=(1,), bias=False)\n",
408 | " (value): Conv1d(128, 128, kernel_size=(1,), stride=(1,), bias=False)\n",
409 | " )\n",
410 | " (1): AdaptiveAvgPool2d(output_size=(1, 1))\n",
411 | " (2): Reshape()\n",
412 | " )\n",
413 | " (classifier): Linear(in_features=128, out_features=4, bias=True)\n",
414 | ")"
415 | ]
416 | },
417 | "execution_count": 19,
418 | "metadata": {},
419 | "output_type": "execute_result"
420 | }
421 | ],
422 | "source": [
423 | "config_path = os.path.join(dir_path, \"train.config\")\n",
424 | "assert os.path.exists(config_path), f\"No train.config in {dir_path}\"\n",
425 | "\n",
426 | "model_path = os.path.join(dir_path, model_name)\n",
427 | "# check the model\n",
428 | "if not os.path.exists(model_path):\n",
429 | " print(f\"There is no saved model {model_path}. Nothing to inference\")\n",
430 | "# return None\n",
431 | "\n",
432 | "# load the model\n",
433 | "cfg = lazycon.load(config_path)\n",
434 | "model = cfg.model\n",
435 | " \n",
436 | "model.to(device)\n",
437 | "model.load_state_dict(torch.load(model_path, map_location=torch.device('cpu')))\n",
438 | "model = model.double()\n",
439 | "model.eval()"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": 20,
445 | "id": "a90da87c",
446 | "metadata": {},
447 | "outputs": [
448 | {
449 | "data": {
450 | "text/plain": [
451 | "> 0 Внешний микрофон, Core Audio (1 in, 0 out)\n",
452 | "< 1 Внешние наушники, Core Audio (0 in, 2 out)\n",
453 | " 2 Микрофон MacBook Pro, Core Audio (1 in, 0 out)\n",
454 | " 3 Динамики MacBook Pro, Core Audio (0 in, 2 out)"
455 | ]
456 | },
457 | "execution_count": 20,
458 | "metadata": {},
459 | "output_type": "execute_result"
460 | }
461 | ],
462 | "source": [
463 | "sd.query_devices()"
464 | ]
465 | },
466 | {
467 | "cell_type": "code",
468 | "execution_count": 21,
469 | "id": "4a318ebf",
470 | "metadata": {},
471 | "outputs": [
472 | {
473 | "name": "stdout",
474 | "output_type": "stream",
475 | "text": [
476 | "Recording Audio\n",
477 | "Audio recording complete , Play Audio\n"
478 | ]
479 | }
480 | ],
481 | "source": [
482 | "fs=16000 \n",
483 | "duration = 5 # seconds \n",
484 | "myrecording = sd.rec(duration * fs, samplerate=fs, channels=1, dtype='float64') \n",
485 | "print (\"Recording Audio\") \n",
486 | "sd.wait() \n",
487 | "print (\"Audio recording complete , Play Audio\") \n",
488 | "sd.play(myrecording, fs) "
489 | ]
490 | },
491 | {
492 | "cell_type": "code",
493 | "execution_count": 29,
494 | "id": "4e352445",
495 | "metadata": {},
496 | "outputs": [],
497 | "source": [
498 | "feat = create_features(np.transpose(myrecording)[0])"
499 | ]
500 | },
501 | {
502 | "cell_type": "code",
503 | "execution_count": 24,
504 | "id": "e936bd3b",
505 | "metadata": {},
506 | "outputs": [],
507 | "source": [
508 | "#feat = create_features_for_audio('wavs/c9780b567a8de31862971aa5412bf834.wav')"
509 | ]
510 | },
511 | {
512 | "cell_type": "code",
513 | "execution_count": 30,
514 | "id": "bcd6c49c",
515 | "metadata": {},
516 | "outputs": [
517 | {
518 | "name": "stdout",
519 | "output_type": "stream",
520 | "text": [
521 | "Calculating predicts\n"
522 | ]
523 | }
524 | ],
525 | "source": [
526 | "print(f\"Calculating predicts\")\n",
527 | "inputs = torch.from_numpy(feat).to(device).unsqueeze(0).unsqueeze(0)\n",
528 | "\n",
529 | "with torch.no_grad():\n",
530 | " probs = model(inputs)"
531 | ]
532 | },
533 | {
534 | "cell_type": "code",
535 | "execution_count": 31,
536 | "id": "98cd7964",
537 | "metadata": {},
538 | "outputs": [
539 | {
540 | "data": {
541 | "text/plain": [
542 | "'neutral'"
543 | ]
544 | },
545 | "execution_count": 31,
546 | "metadata": {},
547 | "output_type": "execute_result"
548 | }
549 | ],
550 | "source": [
551 | "pred_class = np.argmax(probs.cpu().numpy(), axis=1)\n",
552 | "index2name(pred_class[0])"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": 32,
558 | "id": "723ce905",
559 | "metadata": {},
560 | "outputs": [
561 | {
562 | "data": {
563 | "text/plain": [
564 | "tensor([[-2.4723, -0.7901, 1.3804, -0.7780]], dtype=torch.float64)"
565 | ]
566 | },
567 | "execution_count": 32,
568 | "metadata": {},
569 | "output_type": "execute_result"
570 | }
571 | ],
572 | "source": [
573 | "probs"
574 | ]
575 | }
576 | ],
577 | "metadata": {
578 | "kernelspec": {
579 | "display_name": "Python 3 (ipykernel)",
580 | "language": "python",
581 | "name": "python3"
582 | },
583 | "language_info": {
584 | "codemirror_mode": {
585 | "name": "ipython",
586 | "version": 3
587 | },
588 | "file_extension": ".py",
589 | "mimetype": "text/x-python",
590 | "name": "python",
591 | "nbconvert_exporter": "python",
592 | "pygments_lexer": "ipython3",
593 | "version": "3.9.17"
594 | }
595 | },
596 | "nbformat": 4,
597 | "nbformat_minor": 5
598 | }
599 |
--------------------------------------------------------------------------------
/dusha/experiments/.gitignore:
--------------------------------------------------------------------------------
1 | **/__pycache__/
2 | runs/
3 | exps/
4 |
--------------------------------------------------------------------------------
/dusha/experiments/Dockerfile:
--------------------------------------------------------------------------------
1 | # Distributed on an "AS IS" BASIS,
2 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
3 | # See the License for the specific language governing permissions and
4 | # limitations under the License.
5 |
6 | FROM nvidia/cuda:10.1-base-ubuntu18.04
7 | RUN apt-get update && apt-get install -y git && \
8 | apt install -y software-properties-common && \
9 | add-apt-repository ppa:deadsnakes/ppa && \
10 | apt install -y python3.8 && apt install -y python3-pip && \
11 | pip3 install virtualenv && \
12 | virtualenv -p python3.8 --no-download venv
13 |
14 | SHELL ["/bin/bash", "-c"]
15 | RUN source /venv/bin/activate && \
16 | pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html && \
17 | pip install scikit-learn==0.24.0 librosa==0.7.1 tqdm==4.55.1 click==8.1.2 pandas==1.1.5 tensorboard==2.4.1 crowd-kit==1.0.0 && \
18 | apt install -y libsndfile1 && pip install numba==0.48
19 |
20 | WORKDIR /packages
21 |
22 | RUN git clone https://github.com/maxme1/lazycon.git
23 | WORKDIR lazycon
24 | RUN git checkout tags/v0.3.1
25 |
26 | RUN source /venv/bin/activate && pip install -e .
27 | WORKDIR /workspace
28 |
29 | CMD ["bash"]
30 |
--------------------------------------------------------------------------------
/dusha/experiments/README.md:
--------------------------------------------------------------------------------
1 | # Run and Evaluate Experiments
2 |
3 | Once you processed dataset Dusha or downloaded precalculated features, you have a folder with features and manifest files in `DATASET_PATH`
4 | (by default `DUSHA_REPOSITORY_PATH/data/paper_setups`)
5 |
6 | You can run your experiment or reproduce our results using a prepared docker image or your python environment.
7 |
8 | ## Prepare environment
9 |
10 | Specify `DATASET_PATH` in the variable `base_path` in `./configs/data.config`:
11 | - If you want to use own python environment - define an **absolute** path to processed_dataset folder
12 | - If you want to use proposed docker image - define `DATASET_PATH` relative to `DUSHA_REPOSITORY_PATH` (it will be mounted to the docker as `/workspace`).
13 | It is already done for default `DATASET_PATH` - `'/workspace/data/paper_setups'`
14 |
15 | ### Docker
16 | Build the docker image:
17 |
18 | ```
19 | docker build -t dusha_image .
20 | ```
21 |
22 | Then run it:
23 |
24 | ```
25 | export CURRENT_DIR=$PWD
26 | docker run --gpus device=0 -it -v /$CURRENT_DIR/..:/workspace --name dusha_docker dusha_image
27 | ```
28 |
29 | Inside the docker activate python environment:
30 | ```
31 | source /venv/bin/activate && export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python && cd experiments
32 | ```
33 |
34 | So, we are ready to start!
35 |
36 | ## Training stage
37 | We set up experiments in 7 settings, for each of them there is a corresponding config in _configs_ folder.
38 | Run the training stage:
39 |
40 | ```
41 | python train.py -config configs/{EXP_NAME}.config -exp_path exps/{EXP_NAME}
42 | ```
43 |
44 | `podcast_tune` experiment uses pretrained `crowd_large` model to initializate, so you have to train this model first and specify a path to the trained model as `pt_model_path` in `podcast_tune.config`.
45 |
46 | We have tried to make the experiments as reproducible as possible, so the training stage can take a long time.
47 | To speed up the training you can change `train.py` by deleting `# fix seeds for reproducibility` part of the code.
48 |
49 |
50 | ## Inference
51 | After training stage you have trained model in `./exps` folder.
52 | To calculate predicts and metrics for them run the command:
53 |
54 | ```
55 | python inf.py -exps_path exps -vm {PATH_TO_TESTS_FOLDER}
56 | ```
57 |
58 | Where `{PATH_TO_TESTS_FOLDER}` is a path to test manifests folder (`base_path / 'test'` from `./configs/data.config`).
59 |
60 | The script will also calculate pivot tables with metrics grouped by dataset and dump it in `exps/metrics/exps_{dataset_name}.csv`.
61 |
62 | See `--help` for more information.
63 |
--------------------------------------------------------------------------------
/dusha/experiments/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/experiments/__init__.py
--------------------------------------------------------------------------------
/dusha/experiments/configs/base.config:
--------------------------------------------------------------------------------
1 | from .data import *
2 |
3 | from core.dataset import MelEmotionsDataset, get_augm_func, adaptive_padding_collate_fn, LengthWeightedSampler
4 | from core.model import ConvSelfAttentionMobileNet
5 | from core.utils import load_jsonl_as_df
6 | from torch.utils.data import DataLoader
7 | from pathlib import Path
8 |
9 | # train data
10 | train_manifest_path = base_path / 'train' / 'crowd_train.jsonl'
11 | val_manifest_path = base_path / 'tests' / 'crowd_test.jsonl'
12 |
13 | # pretrain
14 | pt_model_path = None
15 |
16 | # exp hyperparams
17 | batch_size = 64
18 | epoch_count = 100
19 | learning_rate = 5e-4
20 | optimizer_step = 5
21 | optimizer_gamma = 1
22 | weight_decay = 1e-6
23 | clip_grad = False
24 |
25 | # augm and batch iter stuff
26 | collate_fn = adaptive_padding_collate_fn
27 | augm_func = get_augm_func(time_mask_param=40, freq_mask_param=16, crop_augm_max_cut_size=40)
28 |
29 | MAX_LENGTH = 16
30 | get_train_weights = None
31 |
32 | # model
33 | model_setting = [
34 | # t, c, n, s
35 | [1, 16, 1, 1],
36 | [2, 32, 2, 2],
37 | [2, 64, 6, 2],
38 | [2, 128, 6, 2],
39 | ]
40 |
41 | model = ConvSelfAttentionMobileNet(model_setting,
42 | n_classes=4,
43 | last_channel=128)
44 |
45 |
46 | def get_train_dataset(_df, ds_base_path):
47 | return MelEmotionsDataset(_df,
48 | get_weights_func=get_train_weights,
49 | augm_transform=augm_func,
50 | base_path=ds_base_path)
51 |
52 |
53 | def get_val_dataset(_df, ds_base_path):
54 | return MelEmotionsDataset(_df, base_path=ds_base_path)
55 |
56 |
57 | def get_train_dataloader(train_ds):
58 | return DataLoader(train_ds, batch_size=batch_size, num_workers=1,
59 | collate_fn=collate_fn,
60 | sampler=LengthWeightedSampler(df=train_ds.df,
61 | batch_size=batch_size,
62 | min_length=0.3,
63 | max_length=MAX_LENGTH,
64 | length_delta=0.3,
65 | decimals=1))
66 |
67 |
68 | def get_val_dataloader(val_ds):
69 | return DataLoader(val_ds, batch_size=1, num_workers=4, shuffle=False)
70 |
71 |
72 | train_dataset = get_train_dataset(load_jsonl_as_df(train_manifest_path),
73 | ds_base_path=train_manifest_path.parent)
74 | val_dataset = get_val_dataset(load_jsonl_as_df(val_manifest_path),
75 | ds_base_path=val_manifest_path.parent)
76 |
77 | dataloaders = {'train': get_train_dataloader(train_ds=train_dataset),
78 | 'validate': get_val_dataloader(val_ds=val_dataset)}
79 |
80 | DUMP_BEST_CHECKPOINTS = True
81 | DUMP_LAST_CHECKPOINTS = True
82 | BEST_CHECKPOINTS_WARMUP = 5
83 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/crowd_large.config:
--------------------------------------------------------------------------------
1 | from .base import *
2 |
3 | # data
4 | train_manifest_path = base_path / 'crowd_large.jsonl'
5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl'
6 |
7 | # pretrain
8 | pt_model_path = None
9 |
10 | # exp hyperparams
11 | batch_size = 64
12 | epoch_count = 100
13 | learning_rate = 1e-3
14 | optimizer_step = 5
15 | optimizer_gamma = 1
16 | weight_decay = 1e-6
17 | clip_grad = False
18 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/crowd_medium.config:
--------------------------------------------------------------------------------
1 | from .crowd_large import *
2 |
3 | # data
4 | train_manifest_path = base_path / 'crowd_medium.jsonl'
5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl'
6 |
7 | # pretrain
8 | pt_model_path = None
9 |
10 | # exp hyperparams
11 | learning_rate = 5e-3
12 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/crowd_small.config:
--------------------------------------------------------------------------------
1 | from .crowd_large import *
2 |
3 | # data
4 | train_manifest_path = base_path / 'crowd_small.jsonl'
5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl'
6 |
7 | # pretrain
8 | pt_model_path = None
9 |
10 | # exp hyperparams
11 | learning_rate = 1e-3
12 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/crowd_unconfident.config:
--------------------------------------------------------------------------------
1 | from .crowd_large import *
2 |
3 | # data
4 | train_manifest_path = base_path / 'crowd_unconfident.jsonl'
5 | val_manifest_path = base_path / 'test' / 'crowd_test.jsonl'
6 |
7 | # pretrain
8 | pt_model_path = None
9 |
10 | # exp hyperparams
11 | learning_rate = 5e-3
12 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/data.config:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | # You should specify `DATASET_PATH` in the variable `base_path`
4 | # - If you want to use own python environment - define an ABSOLUTE path to processed_dataset folder
5 | # - If you want to use proposed docker image - define `DATASET_PATH` relative to `DUSHA_REPOSITORY_PATH`
6 | # (`DUSHA_REPOSITORY_PATH` will be mounted to the docker as `/workspace`).
7 | #
8 | # It is already done for default `DATASET_PATH` = :
9 | base_path = Path('/workspace/data/paper_setups')
10 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/dusha_large.config:
--------------------------------------------------------------------------------
1 | from .base import *
2 |
3 | # train data
4 | train_manifest_path = base_path / 'dusha_large.jsonl'
5 | val_manifest_path = base_path / 'test' / 'dusha_test.jsonl'
6 |
7 | # pretrain
8 | pt_model_path = None
9 |
10 | # exp hyperparams
11 | batch_size = 64
12 | epoch_count = 100
13 | learning_rate = 1e-3
14 | optimizer_step = 5
15 | optimizer_gamma = 1
16 | weight_decay = 1e-6
17 | clip_grad = False
18 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/podcast_large.config:
--------------------------------------------------------------------------------
1 | from .base import *
2 |
3 | # train data
4 | train_manifest_path = base_path / 'podcast_large.jsonl'
5 | val_manifest_path = base_path / 'test' / 'podcast_test.jsonl'
6 |
7 | # pretrain
8 | pt_model_path = None
9 |
10 | # exp hyperparams
11 | batch_size = 64
12 | epoch_count = 100
13 | learning_rate = 5e-3
14 | optimizer_step = 5
15 | optimizer_gamma = 1
16 | weight_decay = 1e-6
17 | clip_grad = False
18 |
19 |
20 | def get_train_weights(_df):
21 | train_weights = 1 + 9 * (_df.label.values == 0)+ 19 * (_df.label.values == 1) + 4*(_df.label.values == 3)
22 | return train_weights
23 |
--------------------------------------------------------------------------------
/dusha/experiments/configs/podcast_tune.config:
--------------------------------------------------------------------------------
1 | from .podcast import *
2 |
3 | # pretrain
4 | pt_model_path = Path('/workspace/experiments/exps/crowd_large/crowd_large')
5 |
6 | # exp hyperparams
7 | learning_rate = 1e-3
8 |
--------------------------------------------------------------------------------
/dusha/experiments/core/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/dusha/experiments/core/__init__.py
--------------------------------------------------------------------------------
/dusha/experiments/core/dataset.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 |
3 | import numpy as np
4 | import torch
5 | from torch import nn
6 | from torch.distributions.categorical import Categorical
7 | from torch.utils.data import Dataset, Sampler
8 | from torchaudio import transforms as T
9 | from torchvision import transforms
10 |
11 | PATH_TO_TENSOR_COL = "tensor"
12 |
13 |
14 | def load_tensor(path):
15 | features_tensor = np.fromfile(path, dtype=np.float32)
16 | return torch.from_numpy(np.reshape(features_tensor, (-1, 64)))
17 |
18 |
19 | def pad_or_crop_to_shape(tensor, size, rand_side_pad=True):
20 | assert len(tensor.shape) == 3
21 | delta = size - tensor.shape[-1]
22 | if delta > 0:
23 | if rand_side_pad:
24 | start_padding = np.random.randint(delta)
25 | end_padding = delta - start_padding
26 | res = nn.functional.pad(tensor, pad=(start_padding, end_padding, 0, 0))
27 | else:
28 | res = nn.functional.pad(tensor, pad=(0, delta, 0, 0))
29 |
30 | return res
31 | else:
32 | return tensor[..., :size]
33 |
34 |
35 | def adaptive_padding_collate_fn(batch):
36 | data = []
37 | target = []
38 | max_size = max([tens.shape[-1] for (tens, label) in batch])
39 | for (tens, label) in batch:
40 | # crop
41 | data.append(pad_or_crop_to_shape(tens, max_size, rand_side_pad=True))
42 | target.append(label)
43 |
44 | return torch.stack(data), torch.tensor(target)
45 |
46 |
47 | def get_augm_func(time_mask_param=80, freq_mask_param=16, crop_augm_max_cut_size=0):
48 | """
49 | Returns function for augmentation in MelEmotionsDataset (augm_transform)
50 | Returned function's input should have [bs, 1, T] shape
51 |
52 | :param time_mask_param:
53 | :param freq_mask_param:
54 | :param crop_augm_max_cut_size: if 0 - random crops are not used
55 | :return:
56 | """
57 |
58 | t_masking = T.TimeMasking(time_mask_param=time_mask_param)
59 | f_masking = T.FrequencyMasking(freq_mask_param=freq_mask_param)
60 |
61 | if crop_augm_max_cut_size != 0:
62 | # we want random crop with random size,
63 | # so we should sample crop size for each augm_transform call
64 | def crop_f(tens):
65 | crop_delta = np.random.randint(crop_augm_max_cut_size)
66 | random_crop = transforms.RandomCrop(
67 | np.array(tens.shape)[1:] - np.array([0, crop_delta])
68 | )
69 |
70 | return random_crop(tens)
71 |
72 | augm_transform = transforms.Compose([f_masking, t_masking, crop_f])
73 | else:
74 | augm_transform = transforms.Compose([f_masking, t_masking])
75 |
76 | return augm_transform
77 |
78 |
79 | class MelEmotionsDataset(Dataset):
80 | def __init__(
81 | self, df, *_, augm_transform=None, get_weights_func=None, base_path=None, **__
82 | ):
83 | super().__init__()
84 | df = df.copy()
85 | if "label" in df.columns:
86 | df["label"] = df["label"].apply(int)
87 | else:
88 | print('There is no column "label" in the TSV')
89 |
90 | if get_weights_func is None:
91 | df["sampling_weights"] = 1
92 | else:
93 | df["sampling_weights"] = get_weights_func(df)
94 |
95 | # sort by length
96 | if "wav_length" in df.columns:
97 | df = df.sort_values("wav_length").reset_index(drop=True)
98 | else:
99 | print('There is no column "wav_length" in the TSV')
100 |
101 | self.df = df
102 | self.augm_transform = augm_transform
103 | self.feature_col = PATH_TO_TENSOR_COL
104 |
105 | if base_path is not None:
106 | base_path = Path(base_path)
107 | self.base_path = base_path
108 |
109 | def __len__(self):
110 | return len(self.df)
111 |
112 | def __getitem__(self, idx):
113 | if torch.is_tensor(idx):
114 | idx = idx.tolist()
115 |
116 | path = self.df.iloc[idx][self.feature_col]
117 | if self.base_path is not None:
118 | path = self.base_path / path
119 |
120 | tens = torch.from_numpy(np.load(path))
121 | label = self.df.iloc[idx]["label"]
122 |
123 | if self.augm_transform is not None:
124 | tens = self.augm_transform(tens)
125 |
126 | return tens, label
127 |
128 |
129 | class LengthWeightedSampler(Sampler[int]):
130 | def __init__(
131 | self,
132 | df,
133 | batch_size,
134 | min_length=1,
135 | max_length=20.5,
136 | length_delta=0.3,
137 | decimals=1,
138 | ):
139 | # df should be sorted ascending by wav_length
140 | # we do it in MelEmotionsDataset
141 | if "wav_length" not in df.columns:
142 | raise ValueError('There is no column "wav_length" in the TSV')
143 |
144 | super().__init__(df)
145 | self.df = df
146 | self.batch_size = batch_size
147 | self.num_samples = (len(df) // batch_size) * batch_size
148 |
149 | all_lengths = np.round(df["wav_length"].values, decimals)
150 | _max = max(all_lengths)
151 | _min = min(all_lengths)
152 |
153 | if max_length is None or max_length > _max:
154 | max_length = _max
155 | if min_length is None or min_length < _min:
156 | min_length = _min
157 |
158 | self.min_length = min_length
159 | self.max_length = max_length
160 | self.length_delta = length_delta
161 |
162 | self.decimals = decimals
163 | self.length_step = np.round(0.1 ** decimals, decimals)
164 |
165 | # is needed to sample batches with max length inclusive
166 | max_plus_delta = np.round(self.max_length + self.length_step, decimals)
167 |
168 | length_to_index_mapping = {}
169 | temp_length = 0
170 |
171 | for i, v in enumerate(all_lengths):
172 | if v > temp_length:
173 | if v != temp_length + self.length_step:
174 | for j in np.arange(
175 | temp_length + self.length_step, v, self.length_step
176 | ):
177 | length_to_index_mapping[np.round(j, decimals)] = i
178 |
179 | length_to_index_mapping[v] = i
180 |
181 | temp_length = v
182 |
183 | # fix to sample batches with max length inclusive
184 | length_to_index_mapping[
185 | np.round(np.max(all_lengths) + self.length_step, decimals)
186 | ] = len(df)
187 |
188 | self.length_to_index_mapping = length_to_index_mapping
189 |
190 | # starts with MIN_LENGTH
191 | self.lengths, self.lengths_count = np.unique(
192 | all_lengths[
193 | length_to_index_mapping[self.min_length] : length_to_index_mapping[
194 | max_plus_delta
195 | ]
196 | ],
197 | return_counts=True,
198 | )
199 |
200 | self.key_length_sampler = Categorical(
201 | probs=torch.from_numpy(self.lengths_count)
202 | )
203 |
204 | def __iter__(self):
205 | N = 0
206 | res_indexes = []
207 |
208 | while N < self.num_samples:
209 | key_length = self.lengths[self.key_length_sampler.sample().item()]
210 |
211 | batch_min_length = np.round(
212 | max(self.min_length, key_length - self.length_delta), self.decimals
213 | )
214 | batch_max_length = np.round(
215 | min(self.max_length, key_length + self.length_delta), self.decimals
216 | )
217 | batch_max_length_plus_delta = np.round(
218 | batch_max_length + self.length_step, self.decimals
219 | )
220 |
221 | sub_df = self.df.iloc[
222 | self.length_to_index_mapping[
223 | batch_min_length
224 | ] : self.length_to_index_mapping[batch_max_length_plus_delta]
225 | ][["sampling_weights"]]
226 |
227 | sampling_weights = torch.from_numpy(
228 | sub_df.sampling_weights.values.astype(float)
229 | )
230 | sub_iloc_indexes = torch.multinomial(
231 | sampling_weights, self.batch_size, True
232 | ).tolist()
233 |
234 | batch_indexes = sub_df.iloc[sub_iloc_indexes].index.tolist()
235 | res_indexes.extend(batch_indexes)
236 |
237 | N += self.batch_size
238 |
239 | return iter(res_indexes)
240 |
241 | def __len__(self):
242 | return self.num_samples
243 |
--------------------------------------------------------------------------------
/dusha/experiments/core/learner.py:
--------------------------------------------------------------------------------
1 | import copy
2 | from pathlib import Path
3 | import time
4 |
5 | import numpy as np
6 | import torch
7 | from torch import nn
8 | from torch.optim import Adam, lr_scheduler
9 | from torch.utils.tensorboard import SummaryWriter
10 | from tqdm import tqdm
11 |
12 | from .metrics import calculate_metrics
13 |
14 |
15 | class Learner:
16 | def __init__(
17 | self,
18 | train_dataset,
19 | val_dataset,
20 | dataloaders,
21 | exp_path,
22 | model_name,
23 | model,
24 | batch_size,
25 | dump_best_checkpoints,
26 | dump_last_checkpoints,
27 | best_checkpoints_warmup,
28 | cuda_device="cuda:0",
29 | ):
30 |
31 | self.device = torch.device(cuda_device if torch.cuda.is_available() else "cpu")
32 | self.model = model
33 | self.model.to(self.device)
34 |
35 | self.__model_name = model_name
36 |
37 | self.dump_last_checkpoints = dump_last_checkpoints
38 | self.dump_best_checkpoints = dump_best_checkpoints
39 | self.best_checkpoints_warmup = best_checkpoints_warmup
40 |
41 | self.exp_path = Path(exp_path)
42 | if dump_best_checkpoints:
43 | self.best_checkpoints_path = self.exp_path / "best_checkpoints"
44 | self.best_checkpoints_path.mkdir()
45 | if dump_last_checkpoints:
46 | self.last_checkpoints_path = self.exp_path / (
47 | self.__model_name + "_last_checkpoints"
48 | )
49 | self.last_checkpoints_path.mkdir()
50 |
51 | self.batch_size = batch_size
52 |
53 | self.train_dataset = train_dataset
54 | self.val_dataset = val_dataset
55 |
56 | print(
57 | "train labels",
58 | np.unique(self.train_dataset.df.label.values, return_counts=True),
59 | )
60 | print(
61 | "train weights",
62 | np.unique(
63 | self.train_dataset.df.sampling_weights.values, return_counts=True
64 | ),
65 | )
66 |
67 | self.dataloaders = dataloaders
68 |
69 | self.dataset_sizes = {
70 | "train": len(self.train_dataset.df),
71 | "validate": len(self.val_dataset.df),
72 | }
73 |
74 | def train(self, num_epochs, lr, step_size, gamma, weight_decay=0, clip_grad=False):
75 | comment_str_list = [
76 | "MODEL",
77 | self.__model_name,
78 | "EPOCHS",
79 | str(num_epochs),
80 | "LR",
81 | str(lr),
82 | "BATCH",
83 | str(self.batch_size),
84 | ]
85 |
86 | comment_str = "_".join(comment_str_list)
87 | summary_writer = SummaryWriter(log_dir=self.exp_path / 'TB_log' / comment_str)
88 |
89 | criterion = nn.CrossEntropyLoss()
90 | optimizer = Adam(self.model.parameters(), lr=lr, weight_decay=weight_decay)
91 | scheduler = lr_scheduler.StepLR(optimizer, step_size=step_size, gamma=gamma)
92 |
93 | since = time.time()
94 | # copy.deepcopy(self.model.state_dict())
95 | best_model_wts = None
96 | best_loss = 10000000
97 | best_acc = best_f1 = best_WA = 0
98 | softmax = nn.Softmax(dim=1)
99 |
100 | try:
101 | for epoch in range(1, num_epochs + 1):
102 | print(f"Epoch {epoch}/{num_epochs}")
103 | for phase in ["train", "validate"]:
104 | if phase == "train":
105 | self.model.train()
106 | cur_step_lr = scheduler.get_last_lr()[-1]
107 | else:
108 | self.model.eval()
109 |
110 | running_loss = 0.0
111 | running_outputs = []
112 | running_labels = []
113 | for inputs, labels in tqdm(self.dataloaders[phase]):
114 | inputs = inputs.to(self.device)
115 | labels = labels.long()
116 | labels = labels.to(self.device)
117 | optimizer.zero_grad()
118 |
119 | with torch.set_grad_enabled(phase == "train"):
120 | outputs = self.model(inputs)
121 | probs = softmax(outputs)
122 | loss = criterion(outputs, labels)
123 | if phase == "train":
124 | loss.backward()
125 | if clip_grad:
126 | torch.nn.utils.clip_grad_norm_(
127 | self.model.parameters(), 1.0
128 | )
129 | optimizer.step()
130 |
131 | running_loss += loss.item()
132 | if phase == "validate":
133 | running_labels.append(labels)
134 | running_outputs.append(probs)
135 |
136 | if phase == "train":
137 | scheduler.step()
138 |
139 | epoch_loss = running_loss / self.dataset_sizes[phase]
140 | if phase == "validate":
141 | pred_class = np.argmax(
142 | torch.cat(running_outputs).cpu().numpy(), axis=1
143 | )
144 | gt_class = torch.cat(running_labels).cpu().numpy()
145 |
146 | metric_dict = calculate_metrics(
147 | pred_class, gt_class, neg_label=0
148 | )
149 |
150 | summary_writer.add_scalar("Loss/validate", epoch_loss, epoch)
151 | for metric_name, metric_value in metric_dict.items():
152 | summary_writer.add_scalar(
153 | f"Metrics/{metric_name}", metric_value, epoch
154 | )
155 |
156 | epoch_acc = metric_dict["accuracy"]
157 | epoch_f1 = metric_dict["f1_macro"]
158 | epoch_WA = metric_dict["WA"]
159 |
160 | print(f"{phase} Loss: {epoch_loss:.4f}")
161 | print(f"{phase} Acc: {epoch_acc:.4f}")
162 | print(f"{phase} F1 macro: {epoch_f1:.4f}")
163 | print(f"{phase} WA: {epoch_WA:.4f}")
164 |
165 | if epoch_f1 > best_f1:
166 | best_f1 = epoch_f1
167 | # best_WA = epoch_WA
168 | best_acc = epoch_acc
169 | best_f1 = epoch_f1
170 |
171 | best_epoch = epoch
172 | best_model_wts = copy.deepcopy(self.model.state_dict())
173 |
174 | if (
175 | self.dump_best_checkpoints
176 | and epoch > self.best_checkpoints_warmup
177 | ):
178 | torch.save(
179 | best_model_wts,
180 | self.best_checkpoints_path
181 | / f"best_checkpoint_{epoch}",
182 | )
183 |
184 | if self.dump_last_checkpoints and abs(epoch - num_epochs) < 6:
185 | torch.save(
186 | copy.deepcopy(self.model.state_dict()),
187 | self.last_checkpoints_path / f"checkpoint_{epoch}",
188 | )
189 |
190 | else:
191 | print(f"{phase} Loss: {epoch_loss:.4f}")
192 | summary_writer.add_scalar("Loss/train", epoch_loss, epoch)
193 | summary_writer.add_scalar("LR/value", cur_step_lr, epoch)
194 |
195 | except KeyboardInterrupt:
196 | pass
197 |
198 | summary_writer.flush()
199 | time_elapsed = time.time() - since
200 | print(
201 | f"Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s."
202 | + f" Best model loss: {best_loss:.6f}, best model acc: {best_acc:.6f}, "
203 | + f"best model f1: {best_f1:.6f}, best epoch {best_epoch}"
204 | )
205 |
206 | self.model.load_state_dict(best_model_wts)
207 | self.model.eval()
208 | return best_model_wts
209 |
--------------------------------------------------------------------------------
/dusha/experiments/core/metrics.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from sklearn.metrics import accuracy_score, f1_score
4 |
5 |
6 | def get_metrics_df(pred_class, gt_class, model_name=None):
7 | metric_dict = calculate_metrics(pred_class=pred_class, gt_class=gt_class)
8 | metrics_df = pd.DataFrame([metric_dict]).T.round(4)
9 |
10 | if model_name is not None:
11 | metrics_df.columns = [model_name]
12 |
13 | return metrics_df
14 |
15 |
16 | def weighted_accuracy(y_true, y_pred, n_classes=4):
17 | y_pred = np.array(y_pred)
18 | y_true = np.array(y_true)
19 |
20 | class_accuracies = []
21 | for i in range(n_classes):
22 | gt_class_mask = y_true == i
23 | pred_class_mask = y_pred == i
24 | class_accuracies.append(
25 | (gt_class_mask * pred_class_mask).sum() / gt_class_mask.sum()
26 | )
27 |
28 | return np.mean(class_accuracies)
29 |
30 |
31 | def calculate_metrics(pred_class, gt_class, **kwargs):
32 | n_classes = 4
33 |
34 | metrics_dict = {
35 | "accuracy": accuracy_score(y_true=gt_class, y_pred=pred_class),
36 | "WA": weighted_accuracy(
37 | y_true=gt_class, y_pred=pred_class, n_classes=n_classes
38 | ),
39 | "f1_macro": f1_score(y_true=gt_class, y_pred=pred_class, average="macro"),
40 | }
41 |
42 | return metrics_dict
43 |
--------------------------------------------------------------------------------
/dusha/experiments/core/model.py:
--------------------------------------------------------------------------------
1 | from typing import Union
2 |
3 | import torch
4 | from torch import nn
5 | from torchvision.models.mobilenetv2 import ConvBNReLU, InvertedResidual, _make_divisible
6 |
7 | AUDIO_PROBAS = ('audio_neg', 'audio_sad', 'audio_neu', 'audio_pos')
8 | AUDIO_COLS = tuple(["audio_pred"] + list(AUDIO_PROBAS))
9 |
10 | EMO2LABEL = {'angry': 0,
11 | 'sad': 1,
12 | 'neutral': 2,
13 | 'positive': 3}
14 |
15 |
16 | class SoftMaxModel(nn.Module):
17 | def __init__(self, logits_model: nn.Module):
18 | super().__init__()
19 | self.logits_model = logits_model
20 | self.softmax = nn.Softmax(dim=1)
21 |
22 | def forward(self, x):
23 | x = self.logits_model(x)
24 | x = self.softmax(x)
25 |
26 | return x
27 |
28 |
29 | # slightly modified fast.ai implementation
30 | # https://medium.com/mlearning-ai/self-attention-in-convolutional-neural-networks-172d947afc00
31 | class ConvSelfAttention(nn.Module):
32 | """Self attention layer for `n_channels`."""
33 |
34 | def __init__(self, n_channels):
35 | super().__init__()
36 | self.query, self.key, self.value = [
37 | self._conv(n_channels, c)
38 | for c in (n_channels // 8, n_channels // 8, n_channels)
39 | ]
40 | self.gamma = nn.Parameter(torch.tensor([0.0]))
41 |
42 | def _conv(self, n_in, n_out):
43 | return nn.Conv1d(n_in, n_out, kernel_size=1, bias=False)
44 |
45 | def forward(self, x):
46 | # Notation from the paper.
47 | size = x.size()
48 | x = x.view(*size[:2], -1)
49 | f, g, h = self.query(x), self.key(x), self.value(x)
50 | beta = nn.functional.softmax(torch.bmm(f.transpose(1, 2), g), dim=1)
51 | o = self.gamma * torch.bmm(h, beta) + x
52 | return o.view(*size).contiguous()
53 |
54 |
55 | # see deep_pipe
56 | # https://github.com/neuro-ml/deep_pipe/blob/master/dpipe/layers/shape.py#L48
57 | class Reshape(nn.Module):
58 | """
59 | Reshape the incoming tensor to the given ``shape``.
60 |
61 | Parameters
62 | ----------
63 | shape: Union[int, str]
64 | the resulting shape. String values denote indices in the input tensor's shape.
65 |
66 | Examples
67 | --------
68 | >>> layer = Reshape('0', '1', 500, 500)
69 | >>> layer(x)
70 | >>> # same as
71 | >>> x.reshape(x.shape[0], x.shape[1], 500, 500)
72 | """
73 |
74 | def __init__(self, *shape: Union[int, str]):
75 | super().__init__()
76 | self.shape = shape
77 |
78 | def forward(self, x: torch.Tensor):
79 | shape = [x.shape[int(i)] if isinstance(i, str) else i for i in self.shape]
80 | return x.reshape(*shape)
81 |
82 |
83 | # see torchvision.models.mobilenetv2.MobileNetV2
84 | class ConvSelfAttentionMobileNet(nn.Module):
85 | def __init__(self, _config, n_classes, last_channel=128, in_channels=1):
86 |
87 | super().__init__()
88 | self._config = _config
89 | self.in_channels = in_channels
90 | self.n_classes = n_classes
91 | self.last_channel = last_channel
92 |
93 | block = InvertedResidual
94 | norm_layer = nn.BatchNorm2d
95 | width_mult = 1.0
96 | round_nearest = 8
97 |
98 | input_channel = 4
99 |
100 | features = [
101 | ConvBNReLU(self.in_channels, input_channel, stride=1, norm_layer=norm_layer)
102 | ]
103 | for t, c, n, s in _config:
104 | output_channel = _make_divisible(c * width_mult, round_nearest)
105 | for i in range(n):
106 | stride = s if i == 0 else 1
107 | features.append(
108 | block(
109 | input_channel,
110 | output_channel,
111 | stride,
112 | expand_ratio=t,
113 | norm_layer=norm_layer,
114 | )
115 | )
116 | input_channel = output_channel
117 | # building last several layers
118 | features.append(
119 | ConvBNReLU(
120 | input_channel, self.last_channel, kernel_size=1, norm_layer=norm_layer
121 | )
122 | )
123 | # make it nn.Sequential
124 | self.features = nn.Sequential(*features)
125 |
126 | self.pooling = nn.Sequential(
127 | ConvSelfAttention(self.last_channel),
128 | nn.AdaptiveAvgPool2d((1, 1)),
129 | Reshape("0", self.last_channel),
130 | )
131 |
132 | self.classifier = nn.Linear(self.last_channel, self.n_classes)
133 |
134 | def forward(self, x):
135 | x = self.features(x)
136 | x = self.pooling(x)
137 | x = self.classifier(x)
138 |
139 | return x
140 |
--------------------------------------------------------------------------------
/dusha/experiments/core/utils.py:
--------------------------------------------------------------------------------
1 | import json
2 | from pathlib import Path
3 |
4 | import pandas as pd
5 |
6 |
7 | def parse_name(tsv_name):
8 | """
9 | We have names like
10 | f"predicts_dataset_{dataset_name}_model_{model_name}.tsv" /
11 | f"metrics_dataset_{dataset_name}_model_{model_name}.csv"
12 |
13 | Returns: dataset_name, model_name
14 | """
15 |
16 | if tsv_name.startswith("predicts_dataset_"):
17 | # len('predicts_dataset_') = 17
18 | _s = tsv_name[17:]
19 | elif tsv_name.startswith("metrics_dataset_"):
20 | _s = tsv_name[16:]
21 | else:
22 | raise ValueError(f"tsv_name is {tsv_name}")
23 |
24 | model_prefix_start = _s.find("_model_")
25 | if model_prefix_start == -1:
26 | raise ValueError(f"tsv_name is {tsv_name}")
27 |
28 | dataset_name = _s[:model_prefix_start]
29 | model_name = _s[model_prefix_start + len("_model_") : -4]
30 |
31 | return dataset_name, model_name
32 |
33 |
34 | def raw_parse_dir(exps_path, prefix="predicts"):
35 | """
36 | Pars dir with experiments and returns dicts:
37 | dataset: model: path
38 | dataset: set of models
39 |
40 | Args:
41 | exps_path: path to dir with experiments
42 | prefix: 'predicts' or 'metrics' - what the function should parse
43 | """
44 | exps_path = Path(exps_path)
45 |
46 | # get paths to data
47 | glob_exp = "**/"
48 |
49 | if prefix == "predicts":
50 | glob_file = "predicts_*.tsv"
51 | elif prefix == "metrics":
52 | glob_file = "metrics_*.csv"
53 | else:
54 | raise ValueError(
55 | f"Get prefix = {prefix}, supports only ['predicts', 'metrics']"
56 | )
57 |
58 | data_paths = list(exps_path.glob(glob_exp + glob_file))
59 |
60 | data_paths = [
61 | p
62 | for p in data_paths
63 | if str(p.name).startswith(prefix)
64 | and str(p.name).find("dataset_") > -1
65 | and str(p.name).find("model_") > -1
66 | ]
67 |
68 | # init our structure
69 | # dataset: model: path_to_predict
70 | dataset_models_paths = {}
71 |
72 | # get all models for all datasets
73 |
74 | # dataset: set of model names
75 | dataset_models_dict = {}
76 | for curr_path in data_paths:
77 | dataset_name, model_name = parse_name(str(curr_path.name))
78 | if dataset_models_dict.get(dataset_name) is None:
79 | dataset_models_dict[dataset_name] = {model_name}
80 | dataset_models_paths[dataset_name] = {model_name: curr_path}
81 | else:
82 | dataset_models_dict[dataset_name] |= {model_name}
83 | dataset_models_paths[dataset_name].update({model_name: curr_path})
84 |
85 | return dataset_models_paths, dataset_models_dict
86 |
87 |
88 | def collect_metrics_to_one(list_of_metrics_df):
89 | df = list_of_metrics_df[0]
90 | df.columns = ["", df.columns[-1]]
91 | df = df.set_index("")
92 |
93 | for curr_metric_df in list_of_metrics_df[1:]:
94 | _df = curr_metric_df
95 | _df.columns = ["", _df.columns[-1]]
96 | _df = _df.set_index("")
97 | df = df.join(_df)
98 |
99 | df = df.sort_values("f1_macro", axis=1, ascending=False)
100 |
101 | return df
102 |
103 |
104 | def load_jsonl_as_df(file_name):
105 | data = []
106 | with open(file_name, "r") as file1:
107 | for line1 in file1:
108 | data.append(json.loads(line1))
109 | file1.close()
110 | df = pd.DataFrame.from_records(data)
111 | if "label" in df.columns:
112 | df.label = df.label.astype(int)
113 |
114 | return df
115 |
--------------------------------------------------------------------------------
/dusha/experiments/inf.py:
--------------------------------------------------------------------------------
1 | import os
2 | from pathlib import Path
3 |
4 | import click
5 | import lazycon
6 | import numpy as np
7 | import pandas as pd
8 | import torch
9 | from tqdm import tqdm
10 |
11 | from core.metrics import get_metrics_df
12 | from core.model import AUDIO_COLS, SoftMaxModel
13 | from core.utils import collect_metrics_to_one, load_jsonl_as_df, raw_parse_dir
14 |
15 | DEVICE = "cuda:0"
16 |
17 |
18 | def run_single_inf(exp_path, test_manifest, with_metrics, recalculate, device):
19 | # parse exp_path
20 | # it may be exp path or path to model
21 | if os.path.isdir(exp_path):
22 | dir_path = exp_path
23 | _path = Path(exp_path)
24 | model_path = _path / _path.name
25 | else:
26 | dir_path = os.path.dirname(exp_path)
27 | model_path = Path(exp_path)
28 |
29 | dir_path = Path(dir_path)
30 | model_name = model_path.name
31 |
32 | # check the config
33 | config_path = dir_path / "train.config"
34 | assert os.path.exists(config_path), f"No train.config in {dir_path}"
35 |
36 | # check the model
37 | if not os.path.exists(model_path):
38 | print(f"There is no saved model {model_path}. Nothing to inference")
39 | return None
40 |
41 | # load the model
42 | cfg = lazycon.load(config_path)
43 | model = cfg.model
44 | try:
45 | model.to(device)
46 | model.load_state_dict(torch.load(model_path))
47 | model.eval()
48 | except Exception as exception:
49 | print(f"Problem with loading model {model_path}. Skipped")
50 | print(exception)
51 | return None
52 |
53 | # add last layer SoftMax to predict probabilities
54 | model = SoftMaxModel(model)
55 |
56 | # create predicts and metrics paths
57 | predicts_path = Path(dir_path) / "predicts"
58 | metrics_path = Path(dir_path) / "metrics"
59 |
60 | predicts_path.mkdir(exist_ok=True)
61 | metrics_path.mkdir(exist_ok=True)
62 |
63 | # parse --vm folder/df
64 | paths_to_inf = []
65 |
66 | if os.path.isdir(test_manifest):
67 | paths_to_inf = list(Path(test_manifest).glob("*.jsonl"))
68 | else:
69 | paths_to_inf.append(test_manifest)
70 |
71 | assert len(paths_to_inf) > 0, f"No .jsonl here: {test_manifest}"
72 |
73 | # iterate over datasets for inference
74 | for dataset_df_path in paths_to_inf:
75 | dataset_df_path = Path(dataset_df_path)
76 | dataset_df = load_jsonl_as_df(dataset_df_path)
77 | # dataset_df = pd.read_csv(dataset_df_path, sep='\t')
78 | dataset_name = str(dataset_df_path.name).split(".", maxsplit=1)[0]
79 | if with_metrics:
80 | assert (
81 | "label" in dataset_df.columns
82 | ), f"{dataset_name} hasn't 'label' column, but --with_metrics"
83 |
84 | # predict
85 | predicts_tsv_path = (
86 | predicts_path / f"predicts_dataset_{dataset_name}_model_{model_name}.tsv"
87 | )
88 |
89 | # if predicts exist and we don't want to recalculate it, but want to calculate metrics
90 | if os.path.exists(predicts_tsv_path) and not recalculate:
91 | if with_metrics:
92 | metrics_csv_path = (
93 | metrics_path
94 | / f"metrics_dataset_{dataset_name}_model_{model_name}.csv"
95 | )
96 | if not os.path.exists(metrics_csv_path):
97 | print(
98 | f"Predicts for {model_name} {dataset_name} exist. Calculating metrics"
99 | )
100 | pred_df = pd.read_csv(predicts_tsv_path, sep="\t")
101 |
102 | pred_class = pred_df[AUDIO_COLS[0]].values
103 | gt_class = pred_df["label"].values
104 |
105 | metrics_df = get_metrics_df(
106 | pred_class=pred_class, gt_class=gt_class, model_name=model_name
107 | )
108 |
109 | metrics_df.to_csv(metrics_csv_path)
110 | else:
111 | print(
112 | f"Predicts and metrics for {model_name} {dataset_name} exist. Skipped"
113 | )
114 | else:
115 | print(
116 | f"Predicts for {model_name} {dataset_name} are existed"
117 | + "--no_metrics, so metrics calculation is skipped"
118 | )
119 | continue
120 |
121 | # calculate predicts
122 | running_outputs = []
123 | ds = cfg.get_val_dataset(_df=dataset_df, ds_base_path=dataset_df_path.parent)
124 | dataloader = cfg.get_val_dataloader(val_ds=ds)
125 |
126 | print(f"Calculating predicts and metrics: {model_name} {dataset_name}")
127 | for inputs, _ in tqdm(dataloader):
128 | inputs = inputs.to(device)
129 | with torch.no_grad():
130 | probs = model(inputs)
131 |
132 | running_outputs.append(probs)
133 |
134 | # MelEmotionsDataset changes order in df, so we should match predicts by id
135 | _df = ds.df.copy()
136 | pred_class = np.argmax(torch.cat(running_outputs).cpu().numpy(), axis=1)
137 | probas = torch.cat(running_outputs).cpu().numpy()
138 |
139 | _df[AUDIO_COLS[0]] = pred_class
140 | for i in range(4):
141 | _df[AUDIO_COLS[i + 1]] = probas[:, i]
142 |
143 | # match preds by id
144 | pred_df = dataset_df.copy()
145 | _df = _df.set_index("id").loc[pred_df.id]
146 | for _col in AUDIO_COLS:
147 | pred_df[_col] = _df[_col].values
148 |
149 | pred_df.to_csv(predicts_tsv_path, index=False, sep="\t")
150 |
151 | # calculate metrics
152 | if with_metrics:
153 | metrics_csv_path = (
154 | metrics_path / f"metrics_dataset_{dataset_name}_model_{model_name}.csv"
155 | )
156 |
157 | pred_class = pred_df[AUDIO_COLS[0]].values
158 | gt_class = pred_df["label"].values
159 |
160 | metrics_df = get_metrics_df(
161 | pred_class=pred_class, gt_class=gt_class, model_name=model_name
162 | )
163 |
164 | metrics_df.to_csv(metrics_csv_path)
165 |
166 |
167 | @click.command()
168 | @click.option(
169 | "-exps_path",
170 | "--exps_path",
171 | required=True,
172 | type=click.Path(exists=True),
173 | help="path folder with experiment folders (the experiment folder must have train.config file in)",
174 | )
175 | @click.option(
176 | "-vm",
177 | "--test_manifest",
178 | required=True,
179 | type=click.Path(exists=True),
180 | help="path to JSONL file/dir of JSONLs to inference",
181 | )
182 | @click.option(
183 | "--with_metrics/--no_metrics",
184 | default=True,
185 | help="calculate metrics for experiments",
186 | )
187 | @click.option(
188 | "--recalculate/--no_recalculate",
189 | default=False,
190 | help="recalculate existed predicts and metrics",
191 | )
192 | @click.option(
193 | "--recalculate_dataset_metrics/--no_dataset_metrics",
194 | default=True,
195 | help="recalculate existed grouped by dataset metrics",
196 | )
197 | @click.option(
198 | "-device", "--device", type=click.STRING, default=DEVICE, help="device to inference"
199 | )
200 | def run_inf(
201 | exps_path,
202 | test_manifest,
203 | with_metrics,
204 | recalculate,
205 | recalculate_dataset_metrics,
206 | device,
207 | ):
208 | # parse folder, find experiments folders
209 | exps_path = Path(exps_path)
210 | experiment_paths = [p.parent for p in exps_path.glob("**/train.config")]
211 |
212 | # predict and calc metrics for a single experiment
213 | for exp_path in experiment_paths:
214 | run_single_inf(
215 | exp_path=exp_path,
216 | test_manifest=test_manifest,
217 | with_metrics=with_metrics,
218 | recalculate=recalculate,
219 | device=device,
220 | )
221 |
222 | # aggregate metrics
223 | metrics_dump_dir = exps_path / "metrics"
224 | metrics_dump_dir.mkdir(exist_ok=True)
225 |
226 | if recalculate_dataset_metrics:
227 | print("Aggregating metrics")
228 | dataset_models_paths, dataset_models = raw_parse_dir(
229 | exps_path=exps_path, prefix="metrics"
230 | )
231 | datasets = sorted(dataset_models.keys())
232 | for dataset_name in datasets:
233 | metric_dump_dir = metrics_dump_dir / f"exps_{dataset_name}.csv"
234 | metric_df = collect_metrics_to_one(
235 | [
236 | pd.read_csv(metrics_df_path)
237 | for metrics_df_path in dataset_models_paths[dataset_name].values()
238 | ]
239 | ).T
240 | metric_df.to_csv(metric_dump_dir)
241 | else:
242 | print("--no_dataset_metrics, so metrics grouped by dataset are skipped")
243 |
244 | agg_metrics_paths = list(metrics_dump_dir.glob("*.csv"))
245 | if len(agg_metrics_paths) == 0:
246 | print("There is no grouped by dataset metrics")
247 | else:
248 | for agg_metrics_path in agg_metrics_paths:
249 | # remove exps_ and .csv in aggregated metrics df name
250 | dataset_name = str(agg_metrics_path.name)[5:-4]
251 | metric_df = pd.read_csv(agg_metrics_path).set_index("Unnamed: 0")
252 | metric_df.index.name = ""
253 | print("DATASET: ", dataset_name)
254 | print(metric_df)
255 | print("------------------------------------------------")
256 |
257 |
258 | if __name__ == "__main__":
259 | run_inf()
260 |
--------------------------------------------------------------------------------
/dusha/experiments/requirements.txt:
--------------------------------------------------------------------------------
1 | # numpy==1.19.5
2 | lazycon==0.3.1 # see https://github.com/maxme1/lazycon
3 | torch==1.8.1+cu101
4 | torchvision==0.9.1+cu101
5 | torchaudio==0.8.1
6 | pathlib==1.0.1
7 | librosa==0.7.1
8 | tqdm==4.55.1
9 | pandas==1.1.5
10 | scikit-learn==0.24.0
11 | tensorboard==2.4.1
--------------------------------------------------------------------------------
/dusha/experiments/train.py:
--------------------------------------------------------------------------------
1 | from pathlib import Path
2 | import random
3 | import shutil
4 |
5 | import click
6 | import lazycon
7 | import numpy as np
8 | import torch
9 |
10 | from core.learner import Learner
11 |
12 |
13 | @click.command()
14 | @click.option(
15 | "-config",
16 | "--config_path",
17 | required=True,
18 | type=click.Path(exists=True),
19 | help="path to .config file",
20 | )
21 | @click.option(
22 | "-exp_path",
23 | "--exp_path",
24 | required=True,
25 | type=click.Path(),
26 | help="path to dump experiment",
27 | )
28 | def train_model(config_path, exp_path):
29 | exp_path = Path(exp_path)
30 | model_name = exp_path.name
31 | cfg = lazycon.load(config_path)
32 | base_path = cfg.base_path
33 | assert (
34 | base_path.exists()
35 | ), f"{base_path} doesn't exist. Correct base_path in configs/data.config"
36 |
37 | exp_path.mkdir(parents=True, exist_ok=True)
38 |
39 | # dump params
40 | # save compiled config
41 | cfg.dump(exp_path / "train.config")
42 |
43 | # dump jsonls
44 | shutil.copy(cfg.train_manifest_path, exp_path / "train.jsonl")
45 | shutil.copy(cfg.val_manifest_path, exp_path / "val.jsonl")
46 |
47 | model = cfg.model
48 |
49 | # load pretrained model
50 | if cfg.pt_model_path is not None:
51 | model.load_state_dict(torch.load(cfg.pt_model_path, map_location="cuda:0"))
52 | shutil.copy(cfg.pt_model_path, exp_path / "pt_model")
53 |
54 | # init learner
55 | learner = Learner(
56 | train_dataset=cfg.train_dataset,
57 | val_dataset=cfg.val_dataset,
58 | dataloaders=cfg.dataloaders,
59 | exp_path=exp_path,
60 | model_name=model_name,
61 | model=model,
62 | batch_size=cfg.batch_size,
63 | dump_best_checkpoints=cfg.DUMP_BEST_CHECKPOINTS,
64 | dump_last_checkpoints=cfg.DUMP_LAST_CHECKPOINTS,
65 | best_checkpoints_warmup=cfg.BEST_CHECKPOINTS_WARMUP,
66 | )
67 |
68 | # train
69 | best_model_wts = learner.train(
70 | num_epochs=cfg.epoch_count,
71 | lr=cfg.learning_rate,
72 | step_size=cfg.optimizer_step,
73 | gamma=cfg.optimizer_gamma,
74 | weight_decay=cfg.weight_decay,
75 | clip_grad=cfg.clip_grad,
76 | )
77 |
78 | # dump best model
79 | torch.save(best_model_wts, exp_path / model_name)
80 |
81 |
82 | if __name__ == "__main__":
83 | # fix seeds for reproducibility
84 | torch.manual_seed(0)
85 | random.seed(0)
86 | np.random.seed(0)
87 | torch.backends.cudnn.benchmark = False
88 | torch.use_deterministic_algorithms(True)
89 |
90 | train_model()
91 |
--------------------------------------------------------------------------------
/golos/README.md:
--------------------------------------------------------------------------------
1 | # Golos dataset
2 |
3 | Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.
4 | We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus.
5 | Also we create 3-gram KenLM language model using an open Common Crawl corpus.
6 |
7 | # Table of contents
8 |
9 | - [Dataset structure](https://github.com/salute-developers/golos/tree/master/golos#dataset-structure)
10 | - [Downloads](https://github.com/salute-developers/golos/tree/master/golos#downloads)
11 | - [Audio files in opus format](https://github.com/salute-developers/golos/tree/master/golos#audio-files-in-opus-format)
12 | - [Audio files in wav format](https://github.com/salute-developers/golos/tree/master/golos#audio-files-in-wav-format)
13 | - [Acoustic and language models](https://github.com/salute-developers/golos/tree/master/golos#acoustic-and-language-models)
14 | - [Evaluation](https://github.com/salute-developers/golos/tree/master/golos#evaluation)
15 | - [Resources](https://github.com/salute-developers/golos/tree/master/golos#resources)
16 | - [License](https://github.com/salute-developers/golos/tree/master/golos#license)
17 | - [Contacts](https://github.com/salute-developers/golos/tree/master/golos#contacts)
18 |
19 |
20 | ## **Dataset structure**
21 |
22 | | Domain | Train files | Train hours | Test files | Test hours |
23 | |----------------|------------|--------|-------|------|
24 | | Crowd | 979 796 | 1 095 | 9 994 | 11.2 |
25 | | Farfield | 124 003 | 132.4| 1 916 | 1.4 |
26 | | Total | 1 103 799 | 1 227.4|11 910 | 12.6 |
27 |
28 | ---
29 |
30 | ## **Downloads**
31 |
32 | [MD5 Checksums](https://github.com/salute-developers/golos/blob/master/golos/md5sum.txt)
33 |
34 |
35 | ### **Audio files in opus format**
36 |
37 | | Archive | Size | Link |
38 | |------------------|------------|----------------------------------------------------------------------------------------------|
39 | | golos_opus.tar | 20.5 GB | [golos_opus.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/golos_opus.tar) |
40 |
41 | ---
42 |
43 | ### **Audio files in wav format**
44 |
45 | Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed in the table:
46 |
47 | | Archives | Size | Links |
48 | |-------------------|------------|------------------------------------------------------------------------------------------------------|
49 | | train_farfield.tar| 15.4 GB | [train_farfield.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_farfield.tar) |
50 | | train_crowd0.tar | 11 GB | [train_crowd0.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd0.tar) |
51 | | train_crowd1.tar | 14 GB | [train_crowd1.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd1.tar) |
52 | | train_crowd2.tar | 13.2 GB | [train_crowd2.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd2.tar) |
53 | | train_crowd3.tar | 11.6 GB | [train_crowd3.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd3.tar) |
54 | | train_crowd4.tar | 15.8 GB | [train_crowd4.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd4.tar) |
55 | | train_crowd5.tar | 13.1 GB | [train_crowd5.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd5.tar) |
56 | | train_crowd6.tar | 15.7 GB | [train_crowd6.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd6.tar) |
57 | | train_crowd7.tar | 12.7 GB | [train_crowd7.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd7.tar) |
58 | | train_crowd8.tar | 12.2 GB | [train_crowd8.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd8.tar) |
59 | | train_crowd9.tar | 8.08 GB | [train_crowd9.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd9.tar) |
60 | | test.tar | 1.3 GB | [test.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/test.tar) |
61 |
62 | ---
63 |
64 | ### **Acoustic and language models**
65 |
66 | Acoustic model built using [QuartzNet15x5](https://arxiv.org/pdf/1910.10261.pdf) architecture and trained using [NeMo toolkit](https://github.com/NVIDIA/NeMo/tree/r1.0.0b4)
67 |
68 |
69 | Three n-gram language models created using [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm)
70 |
71 | * LM built on [Common Crawl](https://commoncrawl.org) Russian dataset
72 | * LM built on [Golos](https://github.com/salute-developers/golos) train set
73 | * LM built on [Common Crawl](https://commoncrawl.org) and [Golos](https://github.com/salute-developers/golos/tree/master/golos) datasets together (50/50)
74 |
75 | | Archives | Size | Links |
76 | |--------------------------|------------|-----------------------------------------------------------------------------------------------------------------------|
77 | | QuartzNet15x5_golos.nemo | 68 MB | [QuartzNet15x5_golos.nemo (TO BE UPDATED)](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/QuartzNet15x5_golos.nemo) |
78 | | CitriNet_ru1024bpe.tgz | 541 MB | [CitriNet_ru1024bpe.tgz](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/citrinet/CitriNet_ru1024bpe.tgz) |
79 | | KenLMs.tar | 4.8 GB | [KenLMs.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/kenlms.tar) |
80 |
81 |
82 | Golos data and models are also available in the hub of pre-trained models, datasets, and containers - DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in [ML Space](https://cloud.ru/) - full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer.
83 |
84 |
85 | ## **Evaluation**
86 |
87 | Percents of Word Error Rate for different test sets
88 |
89 |
90 | | Decoder \ Test set | Crowd test | Farfield test | MCV1 dev | MCV1 test |
91 | |-------------------------------------|-----------|----------|-----------|----------|
92 | | Greedy decoder | 4.389 % | 14.949 % | 9.314 % | 11.278 % |
93 | | Beam Search with Common Crawl LM | 4.709 % | 12.503 % | 6.341 % | 7.976 % |
94 | | Beam Search with Golos train set LM | 3.548 % | 12.384 % | - | - |
95 | | Beam Search with Common Crawl and Golos LM | 3.318 % | 11.488 % | 6.4 % | 8.06 % |
96 |
97 |
98 | 1 [Common Voice](https://commonvoice.mozilla.org) - Mozilla's initiative to help teach machines how real people speak.
99 |
100 | ## **Resources**
101 |
102 | [[INTERSPEECH 2021] Golos: Russian Dataset for Speech Research](https://www.isca-speech.org/archive/pdfs/interspeech_2021/karpov21_interspeech.pdf)
103 |
104 | [[habr.com] Golos — самый большой русскоязычный речевой датасет, размеченный вручную, теперь в открытом доступе](https://habr.com/ru/company/sberdevices/blog/559496/)
105 |
106 | [[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных](https://habr.com/ru/company/sberdevices/blog/569082/)
107 |
108 | ## **Cite**
109 | Karpov, N., Denisenko, A., Minkin, F. (2021) Golos: Russian Dataset for Speech Research. Proc. Interspeech 2021, 1419-1423, doi: 10.21437/Interspeech.2021-462
110 | ```
111 | @inproceedings{karpov21_interspeech,
112 | author={Nikolay Karpov and Alexander Denisenko and Fedor Minkin},
113 | title={{Golos: Russian Dataset for Speech Research}},
114 | year=2021,
115 | booktitle={Proc. Interspeech 2021},
116 | pages={1419--1423},
117 | doi={10.21437/Interspeech.2021-462}
118 | }
119 | ```
120 |
121 | ## **License**
122 |
123 | [English Version](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf)
124 |
125 | [Russian Version](https://github.com/salute-developers/golos/blob/master/license/ru.pdf)
126 |
127 | ## **Contacts**
128 |
129 | Please create a GitHub issue!
130 |
131 | Authors (in alphabetic order):
132 | - Alexander Denisenko
133 | - Angelina Kovalenko
134 | - Fedor Minkin
135 | - Nikolay Karpov
136 |
--------------------------------------------------------------------------------
/golos/README_ru.md:
--------------------------------------------------------------------------------
1 | # Датасет Golos
2 |
3 | Golos — это датасет для распознавания речи на русском языке. Он состоит из аудиозаписей речи и транскрипций, полученных с помощью ручной разметки на краудсорсинговой платформе. Общая длительность записей составляет примерно 1240 часов. Все данные и обученные на них акустические модели распознавания речи бесплатны и открыты для скачивания. Также доступны триграммные модели KenLM, подготовленные при помощи русских текстов из открытого корпуса Common Crawl.
4 |
5 | # Содержание
6 |
7 | - [Структура датасета](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Структура-датасета)
8 | - [Скачать](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Скачать)
9 | - [Аудиофайлы в формате opus](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Аудиофайлы-в-формате-opus)
10 | - [Аудиофайлы в формате wav](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Аудиофайлы-в-формате-wav)
11 | - [Акустические и языковые модели](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Акустические-и-языковые-модели)
12 | - [Оценка качества](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Оценка-качества)
13 | - [Полезные ссылки](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Полезные-ссылки)
14 | - [Лицензия](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Лицензия)
15 | - [Контакты](https://github.com/salute-developers/golos/blob/master/golos/README_ru.md/#Контакты)
16 |
17 |
18 | ## **Структура датасета**
19 |
20 | | Домен | Train файлы | Train (часов) | Test файлы | Test (часов) |
21 | |----------------|------------|--------|-------|------|
22 | | Crowd | 979 796 | 1 095 | 9 994 | 11.2 |
23 | | Farfield | 124 003 | 132.4| 1 916 | 1.4 |
24 | | Итого | 1 103 799 | 1 227.4|11 910 | 12.6 |
25 |
26 | ---
27 |
28 | ## **Скачать**
29 |
30 | [MD5 контральные суммы](https://github.com/salute-developers/golos/blob/master/golos/md5sum.txt)
31 |
32 |
33 | ### **Аудиофайлы в формате opus**
34 |
35 | | Archive | Size | Link |
36 | |------------------|------------|----------------------------------------------------------------------------------------------|
37 | | golos_opus.tar | 20.5 GB | [golos_opus.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/golos_opus.tar) |
38 |
39 | ---
40 |
41 | ### **Аудиофайлы в формате wav**
42 |
43 | Файл с транскрипциями записей всего обучающего подмножества доступен в архиве train_crowd9.tar, доступный по ссылке в таблице:
44 |
45 | | Archives | Size | Links |
46 | |-------------------|------------|------------------------------------------------------------------------------------------------------|
47 | | train_farfield.tar| 15.4 GB | [train_farfield.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_farfield.tar) |
48 | | train_crowd0.tar | 11 GB | [train_crowd0.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd0.tar) |
49 | | train_crowd1.tar | 14 GB | [train_crowd1.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd1.tar) |
50 | | train_crowd2.tar | 13.2 GB | [train_crowd2.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd2.tar) |
51 | | train_crowd3.tar | 11.6 GB | [train_crowd3.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd3.tar) |
52 | | train_crowd4.tar | 15.8 GB | [train_crowd4.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd4.tar) |
53 | | train_crowd5.tar | 13.1 GB | [train_crowd5.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd5.tar) |
54 | | train_crowd6.tar | 15.7 GB | [train_crowd6.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd6.tar) |
55 | | train_crowd7.tar | 12.7 GB | [train_crowd7.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd7.tar) |
56 | | train_crowd8.tar | 12.2 GB | [train_crowd8.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd8.tar) |
57 | | train_crowd9.tar | 8.08 GB | [train_crowd9.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/train_crowd9.tar) |
58 | | test.tar | 1.3 GB | [test.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/test.tar) |
59 |
60 | ---
61 |
62 | ### **Акустические и языковые модели**
63 |
64 | Акустическая модель на основе архитектуры [QuartzNet15x5](https://arxiv.org/pdf/1910.10261.pdf) обучена с использованием [NeMo toolkit](https://github.com/NVIDIA/NeMo/tree/r1.0.0b4)
65 |
66 |
67 | Три n-грамные языковые модели (LM) подготовлены с использованием [KenLM Language Model Toolkit](https://kheafield.com/code/kenlm)
68 |
69 | * LM на русских текстах корпуса [Common Crawl](https://commoncrawl.org)
70 | * LM на текстах транскрипций обучающей подвыборки [Golos](https://github.com/salute-developers/golos)
71 | * LM на русских текстах [Common Crawl](https://commoncrawl.org) и транскрипциях [Golos](https://github.com/salute-developers/golos/tree/master/golos) вместе (50/50)
72 |
73 | | Archives | Size | Links |
74 | |--------------------------|------------|-----------------------------------------------------------------------------------------------------------------------|
75 | | QuartzNet15x5_golos.nemo | 68 MB | [QuartzNet15x5_golos.nemo (TO BE UPDATED)](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/QuartzNet15x5_golos.nemo) |
76 | | CitriNet_ru1024bpe.tgz | 541 MB | [CitriNet_ru1024bpe.tgz](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/citrinet/CitriNet_ru1024bpe.tgz) |
77 | | KenLMs.tar | 4.8 GB | [KenLMs.tar](https://cdn.chatwm.opensmodel.sberdevices.ru/golos/kenlms.tar) |
78 |
79 |
80 |
81 | Все данные и модели Golos также доступны в хранилище DataHub ML Space. Там распологаются предобученные модели, датасеты и Docker-образы.
82 |
83 |
84 | ## **Оценка качества**
85 |
86 | Процент ошибки распознавания WER (Word Error Rate) для разных тестовых данных.
87 |
88 |
89 | | Декодер \ Тестовые данные | Crowd test | Farfield test | MCV1 dev | MCV1 test |
90 | |-------------------------------------|-----------|----------|-----------|----------|
91 | | Greedy decoder | 4.389 % | 14.949 % | 9.314 % | 11.278 % |
92 | | Beam Search + Common Crawl LM | 4.709 % | 12.503 % | 6.341 % | 7.976 % |
93 | | Beam Search + Golos train set LM | 3.548 % | 12.384 % | - | - |
94 | | Beam Search + Common Crawl and Golos LM | 3.318 % | 11.488 % | 6.4 % | 8.06 % |
95 |
96 |
97 | 1 [Common Voice](https://commonvoice.mozilla.org) - проект компании Mozilla по сбору данных для автоматического распознавания речи.
98 |
99 | ## **Полезные ссылки**
100 |
101 | [[arxiv.org] Golos: Russian Dataset for Speech Research](https://arxiv.org/abs/2106.10161)
102 |
103 | [[habr.com] Golos — самый большой русскоязычный речевой датасет, размеченный вручную, теперь в открытом доступе](https://habr.com/ru/company/sberdevices/blog/559496/)
104 |
105 | [[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных](https://habr.com/ru/company/sberdevices/blog/569082/)
106 |
107 | ## **Лицензия**
108 |
109 | [Английская версия](https://github.com/salute-developers/golos/blob/master/license/en_us.pdf)
110 |
111 | [Русская версия](https://github.com/salute-developers/golos/blob/master/license/ru.pdf)
112 |
113 | ## **Контакты**
114 |
115 | Создавайте GitHub issue!
116 |
117 | Авторы а алфавитном порядке:
118 | - Александр Денисенко
119 | - Ангелина Коваленко
120 | - Николай Карпов
121 | - Федор Минькин
122 |
--------------------------------------------------------------------------------
/golos/examples/Dockerfile:
--------------------------------------------------------------------------------
1 | # Distributed on an "AS IS" BASIS,
2 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
3 | # See the License for the specific language governing permissions and
4 | # limitations under the License.
5 |
6 | FROM nvcr.io/nvidia/pytorch:21.03-py3
7 |
8 | WORKDIR /workspace
9 |
10 | RUN git clone --branch v1.0.2 https://github.com/NVIDIA/NeMo
11 |
12 | WORKDIR /workspace/NeMo
13 | RUN ./reinstall.sh
14 | RUN conda install -c anaconda swig
15 | WORKDIR /workspace/NeMo/scripts/asr_language_modeling/ngram_lm
16 | RUN sh install_beamsearch_decoders.sh
17 |
18 |
19 | RUN mkdir /workspace/models
20 | WORKDIR /workspace/models
21 |
22 | RUN echo "Download and extract QuartzNet15x5_golos.nemo ..."
23 | RUN wget https://sc.link/ZMv -O QuartzNet15x5_golos.nemo
24 |
25 | RUN echo "Download and extract KenLMs ..."
26 | RUN wget https://sc.link/YL0 -O kenlms.tar
27 | RUN tar -xvf kenlms.tar
28 |
29 |
30 | WORKDIR /workspace/golos
31 |
32 | CMD ["bash"]
33 |
34 | #CMD python ./examples/infer.py /workspace/models/QuartzNet15x5_golos.nemo ./examples/data/example1.json -lm /workspace/models/kenlms/lm_commoncrawl.binary
35 |
36 |
--------------------------------------------------------------------------------
/golos/examples/build_and_run_docker.sh:
--------------------------------------------------------------------------------
1 | # build container
2 |
3 | export CURRENT_DIR=$PWD
4 |
5 | docker build -t golos_image .
6 |
7 |
8 | #run conatiner
9 | # requrements (requires
10 | docker run --gpus all \
11 | -v /$CURRENT_DIR/..:/workspace/golos \
12 | -it \
13 | --shm-size=8g \
14 | --ulimit memlock=-1 --ulimit \
15 | stack=67108864 golos_image
16 |
--------------------------------------------------------------------------------
/golos/examples/data/001ce26c07c20eaa0d666b824c6c6924.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/golos/examples/data/001ce26c07c20eaa0d666b824c6c6924.wav
--------------------------------------------------------------------------------
/golos/examples/data/example1.json:
--------------------------------------------------------------------------------
1 | {"audio_filepath": "/workspace/golos/examples/data/001ce26c07c20eaa0d666b824c6c6924.wav", "text": "шестьдесят тысяч тенге сколько будет стоить"}
2 |
--------------------------------------------------------------------------------
/golos/examples/infer.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | import os
4 | import torch
5 |
6 | import nemo.collections.asr as nemo_asr
7 |
8 | def read_file_list(manifest):
9 | manifest_path, _ = os.path.split(manifest)
10 | files, texts = [], []
11 | with open(manifest, "r") as input_file:
12 | for line in input_file:
13 | as_dict = json.loads(line.rstrip('\n'))
14 | files.append(os.path.join(manifest_path, as_dict["audio_filepath"]))
15 | texts.append(as_dict["text"])
16 | return files, texts
17 |
18 | def infer_beam_search_lm(files, asr_model, beam_search_lm):
19 | hyps = []
20 | logits = torch.tensor(asr_model.transcribe(files, batch_size=20, logprobs=True))
21 | log_probs_length = torch.tensor([logit.shape[0] for logit in logits])
22 | logits_tensor = torch.nn.utils.rnn.pad_sequence(logits, batch_first=True)
23 | for j in range(logits_tensor.shape[0]):
24 | best_hyp = beam_search_lm.forward(log_probs = logits_tensor[j].unsqueeze(0),
25 | log_probs_length=log_probs_length[j].unsqueeze(0))[0][0][1]
26 | hyps.append(best_hyp)
27 | return hyps
28 |
29 | def infer_greedy(files, asr_model):
30 | transcripts = asr_model.transcribe(paths2audio_files=files, batch_size=20)
31 | return transcripts
32 |
33 | def print_mistakes(hyps, refs):
34 | for hypo, reference in zip(hyps, refs):
35 | if hypo != reference:
36 | print(' h', hypo, '\n r', reference)
37 |
38 | def _parse_args():
39 | parser = argparse.ArgumentParser(description='Run inference using NeMo checkpoint')
40 | parser.add_argument('asr_ckpt', help='Path to ASR NeMo checkpoint file (.nemo)')
41 | parser.add_argument('manifest', help='Path to manifest where each line is a json with'
42 | ' transcription (.jsonl)')
43 | parser.add_argument('-lm', help='Path to KenLM binary filem (.binary)', default=None)
44 | return parser.parse_args()
45 |
46 |
47 | if __name__ == '__main__':
48 | args = _parse_args()
49 | asr_model = nemo_asr.models.EncDecCTCModel.restore_from(args.asr_ckpt)
50 | asr_model.cuda()
51 |
52 | files, texts = read_file_list(args.manifest)
53 |
54 | hyps = infer_greedy(files, asr_model)
55 | print_mistakes(hyps, texts)
56 | print("Greedy WER:", nemo_asr.metrics.wer.word_error_rate(hyps, texts))
57 |
58 | if args.lm:
59 | beam_search_lm = nemo_asr.modules.BeamSearchDecoderWithLM(
60 | vocab=list(asr_model.decoder.vocabulary),
61 | beam_width=16,
62 | alpha=2, beta=1.5,
63 | lm_path=args.lm,
64 | num_cpus=1,
65 | cutoff_prob=1.0, cutoff_top_n=40,
66 | input_tensor=True)
67 |
68 | hyps = infer_beam_search_lm(files, asr_model, beam_search_lm)
69 | print_mistakes(hyps, texts)
70 | print("Beam search WER:", nemo_asr.metrics.wer.word_error_rate(hyps, texts))
71 |
--------------------------------------------------------------------------------
/golos/examples/readme.md:
--------------------------------------------------------------------------------
1 | # How to run
2 |
3 | If you have properly configured NeMo environment you can run example from your environment, or you can run example from docker image.
4 |
5 | ## Run from docker
6 |
7 | >> Minimal driver requirements: NVIDIA Driver release 460.32.03 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51(or later R450). ) see https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-03.html#rel_21-03
8 |
9 | ```bash
10 | cd /examples
11 | sh build_and_run_docker.sh
12 | ```
13 |
14 | After container is started
15 |
16 | ```
17 | python ./examples/infer.py /workspace/models/QuartzNet15x5_golos.nemo ./examples/data/example1.json -lm /workspace/models/kenlms/lm_commoncrawl.binary
18 | ```
--------------------------------------------------------------------------------
/golos/md5sum.txt:
--------------------------------------------------------------------------------
1 | 453d4f8ef1d8adcc9d87020712f973c3 golos_opus.tar
2 | 7531a7e76e04e794c5847d8e58b3154c train_farfield.tar
3 | f88b397fc5dbc55ab326cd2360d5a0b2 train_crowd0.tar
4 | 19db1416da982641f1565ec701939f39 train_crowd1.tar
5 | d032dfb48d548631400f3a78e24756f8 train_crowd2.tar
6 | 83aeb8d3f8b93968b83224606801aceb train_crowd3.tar
7 | df177493e1c1c7cbb0ac7f13f79dbc06 train_crowd4.tar
8 | 724e4af231b3e226b32ca609db30862c train_crowd5.tar
9 | bd6e18b94ad81aebb99039dedb71de25 train_crowd6.tar
10 | 560960f748b7ac19bdb477dc9245cfc3 train_crowd7.tar
11 | bbc6814ca8377aaa75a4a987adad0b94 train_crowd8.tar
12 | eee4ff6dbebf77c738c69713ace7b768 train_crowd9.tar
13 | 284a6e54ca7bf0dd599db1bea030775c test.tar
14 | 5bdf7ce2ba610db4d309413ca3cb4064 QuartzNet15x5_golos.nemo
15 | 9ca0d96e47d5e2fb6f87ebec17066108 kenlms.tar
--------------------------------------------------------------------------------
/license/en_us.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/license/en_us.pdf
--------------------------------------------------------------------------------
/license/ru.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salute-developers/golos/5c5c5f87044803fcccdf7e149ef5384c95cff107/license/ru.pdf
--------------------------------------------------------------------------------