├── .gitignore ├── README.md ├── data_prepare ├── aishell.sh ├── callhome.sh ├── data2json.sh ├── data2json_callhome.sh ├── dump.sh ├── mergejson.py ├── path.sh ├── scp2json.py ├── text2token.py └── vocab.py ├── egs ├── aishell │ ├── recipes │ │ ├── CIF.sh │ │ ├── conv_transformer_ctc.sh │ │ ├── ctc.sh │ │ ├── transformer.sh │ │ └── transformer_ctc.sh │ ├── steps │ └── utils ├── callhome_IPA │ ├── align2trans.py │ └── recipes │ │ ├── CIF.sh │ │ ├── conv_transformer_ctc.sh │ │ ├── ctc.sh │ │ ├── transformer.sh │ │ └── transformer_ctc.sh └── libri_wav2vec │ ├── README.md │ └── recipes │ ├── bert.sh │ └── finetune.sh ├── requirements.txt ├── src ├── __init__.py ├── ctcModel │ ├── __init__.py │ ├── attention.py │ ├── ctc_infer.py │ ├── ctc_model.py │ ├── decoder.py │ ├── encoder.py │ ├── loss.py │ ├── module.py │ ├── optimizer.py │ ├── recognize.py │ ├── solver.py │ └── train.py ├── mask_lm │ ├── Mask_LM.py │ ├── __init__.py │ ├── conv_encoder.py │ ├── data.py │ ├── decoder.py │ ├── encoder.py │ ├── infer.py │ ├── loss.py │ ├── module.py │ ├── solver.py │ └── train.py ├── transformer │ ├── __init__.py │ ├── attention.py │ ├── attentionAssigner.py │ ├── cif_model.py │ ├── conv_encoder.py │ ├── data.py │ ├── decoder.py │ ├── encoder.py │ ├── infer.py │ ├── loss.py │ ├── module.py │ ├── optimizer.py │ ├── solver.py │ ├── train.py │ └── transformer.py └── utils │ ├── __init__.py │ ├── average.py │ ├── data.py │ ├── filt.py │ ├── solver.py │ ├── utils.py │ └── wer.py └── test ├── beam_search_decode.py ├── data ├── data.json └── train_nodup_sp_units.txt ├── learn_pytorch.py ├── learn_visdom.py ├── path.sh ├── test_data.py └── test_decode.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | 3 | src/utils/kaldi_io.py 4 | src/.* 5 | 6 | tools/kaldi-io-for-python/ 7 | tools/kaldi 8 | 9 | egs/*/data 10 | egs/*/dump 11 | egs/*/fbank 12 | egs/*/exp 13 | 14 | test 15 | *.ipynb 16 | 17 | # MISC 18 | .DS_Store 19 | .nfs* 20 | .vscode 21 | __pycache__ 22 | nohup.out 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Speech Transformer (Pytorch) 2 | The implementation is based on [Speech Transformer: End-to-End ASR with Transformer](https://github.com/kaituoxu/Speech-Transformer). 3 | A PyTorch implementation of Speech Transformer network, which directly converts acoustic features to character sequence using a single nueral network. 4 | This work is mainly done in Kuaishou as an intern. 5 | 6 | ## Install 7 | - Python3 8 | - PyTorch 1.5 9 | - [Kaldi](https://github.com/kaldi-asr/kaldi) (just for feature extraction) 10 | - `pip install -r requirements.txt` 11 | 12 | ## Usage 13 | ### Quick start 14 | ```bash 15 | $ cd egs/aishell 16 | # Modify aishell data path to your path in the begining of run.sh 17 | $ bash transofrmer.sh 18 | ``` 19 | That's all! 20 | 21 | You can change parameter by `$ bash transofrmer.sh --parameter_name parameter_value`, egs, `$ bash run.sh --stage 3`. See parameter name in `egs/aishell/run.sh` before `. utils/parse_options.sh`. 22 | 23 | ### Workflow 24 | - Data Preparation and Feature Generation 25 | TODO: using the scripts in data_prepare 26 | - Network Training 27 | 28 | - Decoding 29 | change the transofrmer.sh 30 | ### More detail 31 | `egs/aishell/run.sh` provide example usage. 32 | ```bash 33 | # Set PATH and PYTHONPATH 34 | $ cd egs/aishell/; . ./path.sh 35 | # Train 36 | $ train.py -h 37 | # Decode 38 | $ recognize.py -h 39 | ``` 40 | 41 | #### How to resume training? 42 | ```bash 43 | $ bash run.sh --continue_from 44 | ``` 45 | 46 | ## Results 47 | | Model | CER | Config | 48 | | :---: | :-: | :----: | 49 | | LSTMP | 9.85| 4x(1024-512). See [kaldi-ktnet1](https://github.com/kaituoxu/kaldi-ktnet1/blob/ktnet1/egs/aishell/s5/local/nnet1/run_4lstm.sh)| 50 | | Listen, Attend and Spell | 13.2 | See [Listen-Attend-Spell](https://github.com/kaituoxu/Listen-Attend-Spell)'s egs/aishell/run.sh | 51 | | SpeechTransformer | 10.7 | See egs/aishell/run.sh | 52 | 53 | | Model | #Snt | #Wrd | Sub | Del | Ins | CER | 54 | | :---: | :-: | :----: |:----: |:----: |:----: | :----: | 55 | | SpeechTransformer | 7176 | 104765 | 9.9 | 0.4 | 0.3 | 10.7 | 56 | | Conv_CTC_Transformer | 7176 | 104765 | | 57 | | Conv_CTC | 7176 | 104765 | | 58 | | CIF | 7176 | 104765 | 9.44 | 0.33 | 0.24 | 10.02 | 59 | 60 | ## Acknowledgement 61 | - The framework and speech-transofrmer baseline is based on [Speech Transformer: End-to-End ASR with Transformer](https://github.com/kaituoxu/Speech-Transformer) 62 | - `src/transformer/conv_encoder.py` refers to https://github.com/by2101/OpenASR. 63 | - The core implement of CIF algorithm is checked by Linhao Dong (the origin author of CIF) 64 | 65 | 66 | ## Reference 67 | - [1] Yuanyuan Zhao, Jie Li, Xiaorui Wang, and Yan Li. "The SpeechTransformer for Large-scale Mandarin Chinese Speech Recognition." ICASSP 2019. 68 | - [2] L. Dong and B. Xu, “CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition Linhao,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, vol. 2017-Augus, pp. 3822–3826. 69 | -------------------------------------------------------------------------------- /data_prepare/aishell.sh: -------------------------------------------------------------------------------- 1 | # -- IMPORTANT 2 | data=/home/work_nfs/common/data # Modify to your aishell data path 3 | stage=2 # Modify to control start from witch stage 4 | # -- 5 | 6 | if [ $stage -le 0 ]; then 7 | echo "stage 0: Data Preparation" 8 | ### Task dependent. You have to make data the following preparation part by yourself. 9 | ### But you can utilize Kaldi recipes in most cases 10 | # Generate wav.scp, text, utt2spk, spk2utt (segments) 11 | local/aishell_data_prep.sh $data/data_aishell/wav $data/data_aishell/transcript || exit 1; 12 | # remove space in text 13 | for x in train test dev; do 14 | cp data/${x}/text data/${x}/text.org 15 | paste -d " " <(cut -f 1 -d" " data/${x}/text.org) <(cut -f 2- -d" " data/${x}/text.org | tr -d " ") \ 16 | > data/${x}/text 17 | done 18 | fi 19 | 20 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 21 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 22 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 23 | if [ $stage -le 1 ]; then 24 | echo "stage 1: Feature Generation" 25 | ### Task dependent. You have to make data the following preparation part by yourself. 26 | ### But you can utilize Kaldi recipes in most cases 27 | fbankdir=fbank 28 | for data in train test dev; do 29 | steps/make_fbank.sh --cmd "$train_cmd" --nj $nj --write_utt2num_frames true \ 30 | data/$data exp/make_fbank/$data $fbankdir/$data || exit 1; 31 | done 32 | # compute global CMVN 33 | compute-cmvn-stats scp:data/train/feats.scp data/train/cmvn.ark 34 | # dump features for training 35 | for data in train test dev; do 36 | feat_dir=`eval echo '$feat_'${data}'_dir'` 37 | dump.sh --cmd "$train_cmd" --nj $nj --do_delta $do_delta \ 38 | data/$data/feats.scp data/train/cmvn.ark exp/dump_feats/$data $feat_dir 39 | done 40 | fi 41 | 42 | dict=data/lang_1char/cahr.vocab 43 | echo "dictionary: ${dict}" 44 | nlsyms=data/lang_1char/non_lang_syms.txt 45 | if [ $stage -le 2 ]; then 46 | echo "stage 2: Dictionary and Json Data Preparation" 47 | ### Task dependent. You have to check non-linguistic symbols used in the corpus. 48 | mkdir -p data/lang_1char/ 49 | 50 | echo "make a non-linguistic symbol list" 51 | # It's empty in AISHELL-1 52 | cut -f 2- data/train/text | grep -o -P '\[.*?\]' | sort | uniq > ${nlsyms} 53 | cat ${nlsyms} 54 | 55 | echo "make a dictionary" 56 | echo " 0" > ${dict} 57 | echo " 1" > ${dict} 58 | echo " 2" >> ${dict} 59 | echo " 3" >> ${dict} 60 | text2token.py -s 1 -n 1 -l ${nlsyms} data/train/text | cut -f 2- -d" " | tr " " "\n" \ 61 | | sort | uniq | grep -v -e '^\s*$' | awk '{print $0 " " NR+2}' >> ${dict} 62 | wc -l ${dict} 63 | 64 | echo "make json files" 65 | for data in train test dev; do 66 | feat_dir=`eval echo '$feat_'${data}'_dir'` 67 | data2json.sh --feat ${feat_dir}/feats.scp --nlsyms ${nlsyms} \ 68 | data/$data ${dict} > ${feat_dir}/data.json 69 | done 70 | 71 | echo " " >> ${dict} 72 | fi 73 | -------------------------------------------------------------------------------- /data_prepare/callhome.sh: -------------------------------------------------------------------------------- 1 | # -- IMPORTANT 2 | dir=$1 3 | 4 | echo "make json files" 5 | for data in train test dev; do 6 | ./data2json.sh --feat ${dir}/${data}/feats.scp ${dir}/${data} phones.txt > ${dir}/${data}/data.json 7 | done 8 | -------------------------------------------------------------------------------- /data_prepare/data2json.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | . ./path.sh 4 | 5 | lang="" 6 | feat="" # feat.scp 7 | oov="" 8 | verbose=0 9 | 10 | . utils/parse_options.sh 11 | 12 | if [ $# != 2 ]; then 13 | echo "Usage: $0 "; 14 | exit 1; 15 | fi 16 | 17 | dir=$1 18 | phone_dic=$2 19 | # token_dic=$3 20 | tmpdir=`mktemp -d ${dir}/tmp-XXXXX` 21 | rm -f ${tmpdir}/*.scp 22 | 23 | # input, which is not necessary for decoding mode, and make it as an option 24 | if [ ! -z ${feat} ]; then 25 | if [ ${verbose} -eq 0 ]; then 26 | utils/data/get_utt2num_frames.sh ${dir} &> /dev/null 27 | cp ${dir}/utt2num_frames ${tmpdir}/ilen.scp 28 | feat-to-dim scp:${feat} ark,t:${tmpdir}/idim.scp &> /dev/null 29 | else 30 | utils/data/get_utt2num_frames.sh ${dir} 31 | cp ${dir}/utt2num_frames ${tmpdir}/ilen.scp 32 | feat-to-dim scp:${feat} ark,t:${tmpdir}/idim.scp 33 | fi 34 | fi 35 | 36 | # python text2token.py -s 1 -n 1 ${dir}/text > ${tmpdir}/token.scp 37 | cp ${dir}/text2phone.txt.nosil ${tmpdir}/phone.scp 38 | # cp ${dir}/text ${tmpdir}/token.scp 39 | 40 | cat ${tmpdir}/phone.scp | utils/sym2int.pl --map-oov ${oov} -f 2- ${phone_dic} > ${tmpdir}/phone_id.scp 41 | # cat ${tmpdir}/token.scp | utils/sym2int.pl --map-oov ${oov} -f 2- ${token_dic} > ${tmpdir}/token_id.scp 42 | 43 | # cat ${tmpdir}/tokenid.scp | awk '{print $1 " " NF-1}' > ${tmpdir}/olen.scp 44 | cat ${tmpdir}/phone_id.scp | awk '{print $1 " " NF-1}' > ${tmpdir}/olen.scp 45 | # +1 comes from 0-based dictionary 46 | vocsize=`tail -n 1 ${dic} | awk '{print $2}'` 47 | odim=`echo "$vocsize + 1" | bc` 48 | awk -v odim=${odim} '{print $1 " " odim}' ${dir}/text > ${tmpdir}/odim.scp 49 | 50 | # others 51 | if [ ! -z ${lang} ]; then 52 | awk -v lang=${lang} '{print $1 " " lang}' ${dir}/text > ${tmpdir}/lang.scp 53 | fi 54 | # feats 55 | cat ${feat} > ${tmpdir}/feat.scp 56 | 57 | rm -f ${tmpdir}/*.json 58 | for x in ${dir}/text ${dir}/utt2spk ${tmpdir}/*.scp; do 59 | k=`basename ${x} .scp` 60 | cat ${x} | python scp2json.py --key ${k} > ${tmpdir}/${k}.json 61 | done 62 | python mergejson.py --verbose ${verbose} ${tmpdir}/*.json 63 | 64 | rm -fr ${tmpdir} 65 | -------------------------------------------------------------------------------- /data_prepare/data2json_callhome.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | . ./path.sh 4 | 5 | feat="feats_cmvn.scp" 6 | phone_dic='phone.vocab' 7 | oov="" 8 | 9 | . utils/parse_options.sh 10 | 11 | if [ $# != 1 ]; then 12 | echo "Usage: $0 "; 13 | exit 1; 14 | fi 15 | 16 | dir=$1 17 | # token_dic=$3 18 | tmpdir=`mktemp -d ${dir}/tmp-XXXXX` 19 | rm -f ${tmpdir}/*.scp 20 | 21 | echo $tmpdir 22 | 23 | # input, which is not necessary for decoding mode, and make it as an optio 24 | utils/data/get_utt2num_frames.sh ${dir} &> /dev/null 25 | cp ${dir}/utt2num_frames ${tmpdir}/ilen.scp 26 | cp ${dir}/${feat} ${tmpdir}/feats.scp 27 | feat-to-dim scp:${tmpdir}/feats.scp ark,t:${tmpdir}/idim.scp &> /dev/null 28 | 29 | cp ${dir}/trans.phone ${tmpdir}/phone.scp 30 | cp ${dir}/text ${tmpdir}/text 31 | cp ${dir}/${feat} ${tmpdir}/feat.scp 32 | 33 | # cat ${tmpdir}/phone.scp | utils/token2int.pl --map-oov ${oov} -f 2- ${phone_dic} > ${tmpdir}/phone_id.scp 34 | # cat ${tmpdir}/token.scp | utils/token2int.pl --map-oov ${oov} -f 2- ${token_dic} > ${tmpdir}/token_id.scp 35 | python vocab.py -m 'look_up' --vocab ${phone_dic} --trans ${tmpdir}/phone.scp --output ${tmpdir}/phone_id.scp 36 | 37 | echo "get phone_len"; 38 | # cat ${tmpdir}/tokenid.scp | awk '{print $1 " " NF-1}' > ${tmpdir}/olen.scp 39 | cat ${tmpdir}/phone_id.scp | awk '{print $1 " " NF-1}' > ${tmpdir}/phone_len.scp 40 | 41 | echo "get phone size"; 42 | # +1 comes from 0-based dictionary 43 | vocsize=`wc -l ${phone_dic} | awk '{print $1}'` 44 | odim=`echo "$vocsize + 1" | bc` 45 | awk -v odim=${odim} '{print $1 " " odim}' ${tmpdir}/phone.scp > ${tmpdir}/phone_size.scp 46 | 47 | rm -f ${tmpdir}/*.json 48 | for x in ${dir}/text ${tmpdir}/*.scp; do 49 | k=`basename ${x} .scp` 50 | cat ${x} | python scp2json.py --key ${k} > ${tmpdir}/${k}.json 51 | done 52 | 53 | echo "merging json"; 54 | python mergejson.py ${tmpdir}/*.json > ${dir}/data.json 55 | 56 | rm -fr ${tmpdir} 57 | -------------------------------------------------------------------------------- /data_prepare/dump.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Copyright 2017 Nagoya University (Tomoki Hayashi) 4 | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) 5 | 6 | . ./path.sh 7 | 8 | cmd=run.pl 9 | do_delta=false 10 | nj=1 11 | verbose=0 12 | compress=true 13 | write_utt2num_frames=true 14 | 15 | . utils/parse_options.sh 16 | 17 | scp=$1 18 | cvmnark=$2 19 | logdir=$3 20 | dumpdir=$4 21 | 22 | if [ $# != 4 ]; then 23 | echo "Usage: $0 " 24 | exit 1; 25 | fi 26 | 27 | mkdir -p $logdir 28 | mkdir -p $dumpdir 29 | 30 | dumpdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = "$pwd/$dir"; } print $dir; ' ${dumpdir} ${PWD}` 31 | 32 | for n in $(seq $nj); do 33 | # the next command does nothing unless $dumpdir/storage/ exists, see 34 | # utils/create_data_link.pl for more info. 35 | utils/create_data_link.pl ${dumpdir}/feats.${n}.ark 36 | done 37 | 38 | if $write_utt2num_frames; then 39 | write_num_frames_opt="--write-num-frames=ark,t:$dumpdir/utt2num_frames.JOB" 40 | else 41 | write_num_frames_opt= 42 | fi 43 | 44 | # split scp file 45 | split_scps="" 46 | for n in $(seq $nj); do 47 | split_scps="$split_scps $logdir/feats.$n.scp" 48 | done 49 | 50 | utils/split_scp.pl $scp $split_scps || exit 1; 51 | 52 | # dump features 53 | if ${do_delta};then 54 | $cmd JOB=1:$nj $logdir/dump_feature.JOB.log \ 55 | apply-cmvn --norm-vars=true $cvmnark scp:$logdir/feats.JOB.scp ark:- \| \ 56 | add-deltas ark:- ark:- \| \ 57 | copy-feats --compress=$compress --compression-method=2 ${write_num_frames_opt} \ 58 | ark:- ark,scp:${dumpdir}/feats.JOB.ark,${dumpdir}/feats.JOB.scp \ 59 | || exit 1 60 | else 61 | $cmd JOB=1:$nj $logdir/dump_feature.JOB.log \ 62 | apply-cmvn --norm-vars=true $cvmnark scp:$logdir/feats.JOB.scp ark:- \| \ 63 | copy-feats --compress=$compress --compression-method=2 ${write_num_frames_opt} \ 64 | ark:- ark,scp:${dumpdir}/feats.JOB.ark,${dumpdir}/feats.JOB.scp \ 65 | || exit 1 66 | fi 67 | 68 | # concatenate scp files 69 | for n in $(seq $nj); do 70 | cat $dumpdir/feats.$n.scp || exit 1; 71 | done > $dumpdir/feats.scp || exit 1 72 | 73 | if $write_utt2num_frames; then 74 | for n in $(seq $nj); do 75 | cat $dumpdir/utt2num_frames.$n || exit 1; 76 | done > $dumpdir/utt2num_frames || exit 1 77 | rm $dumpdir/utt2num_frames.* 2>/dev/null 78 | fi 79 | 80 | # remove temp scps 81 | rm $logdir/feats.*.scp 2>/dev/null 82 | if [ ${verbose} -eq 1 ]; then 83 | echo "Succeeded dumping features for training" 84 | fi 85 | -------------------------------------------------------------------------------- /data_prepare/mergejson.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # encoding: utf-8 3 | 4 | # Copyright 2017 Johns Hopkins University (Shinji Watanabe) 5 | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) 6 | 7 | import argparse 8 | import json 9 | import logging 10 | 11 | 12 | if __name__ == '__main__': 13 | parser = argparse.ArgumentParser() 14 | parser.add_argument('jsons', type=str, nargs='+', 15 | help='json files') 16 | parser.add_argument('--multi', '-m', type=int, 17 | help='Test the json file for multiple input/output', default=0) 18 | args = parser.parse_args() 19 | 20 | # logging info 21 | logging.basicConfig( 22 | level=logging.WARN, format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s") 23 | 24 | # make intersection set for utterance keys 25 | js = [] 26 | intersec_ks = [] 27 | for x in args.jsons: 28 | with open(x, 'r') as f: 29 | j = json.load(f) 30 | ks = j['utts'].keys() 31 | logging.info(x + ': has ' + str(len(ks)) + ' utterances') 32 | if len(intersec_ks) > 0: 33 | intersec_ks = intersec_ks.intersection(set(ks)) 34 | else: 35 | intersec_ks = set(ks) 36 | js.append(j) 37 | logging.info('new json has ' + str(len(intersec_ks)) + ' utterances') 38 | 39 | old_dic = dict() 40 | for k in intersec_ks: 41 | v = js[0]['utts'][k] 42 | for j in js[1:]: 43 | v.update(j['utts'][k]) 44 | old_dic[k] = v 45 | 46 | new_dic = dict() 47 | for id in old_dic: 48 | dic = old_dic[id] 49 | 50 | in_dic = {} 51 | try: 52 | in_dic['shape'] = (int(dic['ilen']), int(dic['idim'])) 53 | except: 54 | pass 55 | in_dic['name'] = 'input1' 56 | in_dic['feat'] = dic['feat'] 57 | 58 | out_dic = {} 59 | out_dic['name'] = 'phone' 60 | out_dic['shape'] = (int(dic['phone_len']), int(dic['phone_size'])) 61 | 62 | try: 63 | out_dic['phone'] = dic['phone'] 64 | out_dic['phone_id'] = dic['phone_id'] 65 | except: 66 | pass 67 | 68 | try: 69 | out_dic['text'] = dic['text'] 70 | except: 71 | pass 72 | 73 | try: 74 | out_dic['token'] = dic['token'] 75 | out_dic['token_id'] = dic['token_id'] 76 | except: 77 | pass 78 | 79 | new_dic[id] = {'input':[in_dic], 'output':[out_dic]} 80 | 81 | jsonstring = json.dumps({'utts': new_dic}, indent=2, ensure_ascii=False, sort_keys=True) 82 | 83 | print(jsonstring) 84 | -------------------------------------------------------------------------------- /data_prepare/path.sh: -------------------------------------------------------------------------------- 1 | KALDI_ROOT=~/easton/projects/kaldi-2019 2 | . $KALDI_ROOT/tools/config/common_path.sh 3 | -------------------------------------------------------------------------------- /data_prepare/scp2json.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # encoding: utf-8 3 | 4 | # Copyright 2017 Johns Hopkins University (Shinji Watanabe) 5 | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) 6 | 7 | import sys 8 | import json 9 | import argparse 10 | 11 | if __name__ == '__main__': 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument('--key', '-k', type=str, 14 | help='key') 15 | args = parser.parse_args() 16 | 17 | l = {} 18 | line = sys.stdin.readline() 19 | while line: 20 | x = line.rstrip().split() 21 | v = {args.key: ' '.join(x[1:])} 22 | l[x[0]] = v 23 | line = sys.stdin.readline() 24 | 25 | all_l = {'utts': l} 26 | 27 | # ensure "ensure_ascii=False", which is a bug 28 | jsonstring = json.dumps(all_l, indent=4, ensure_ascii=False) 29 | print(jsonstring) 30 | -------------------------------------------------------------------------------- /data_prepare/text2token.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2017 Johns Hopkins University (Shinji Watanabe) 4 | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) 5 | 6 | import sys 7 | import argparse 8 | import re 9 | 10 | 11 | def exist_or_not(i, match_pos): 12 | start_pos = None 13 | end_pos = None 14 | for pos in match_pos: 15 | if pos[0] <= i < pos[1]: 16 | start_pos = pos[0] 17 | end_pos = pos[1] 18 | break 19 | 20 | return start_pos, end_pos 21 | 22 | 23 | def main(): 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument('--nchar', '-n', default=1, type=int, 26 | help='number of characters to split, i.e., \ 27 | aabb -> a a b b with -n 1 and aa bb with -n 2') 28 | parser.add_argument('--skip-ncols', '-s', default=0, type=int, 29 | help='skip first n columns') 30 | parser.add_argument('--space', default='', type=str, 31 | help='space symbol') 32 | parser.add_argument('--non-lang-syms', '-l', default=None, type=str, 33 | help='list of non-linguistic symobles, e.g., etc.') 34 | parser.add_argument('text', type=str, default=False, nargs='?', 35 | help='input text') 36 | args = parser.parse_args() 37 | 38 | rs = [] 39 | if args.non_lang_syms is not None: 40 | with open(args.non_lang_syms, 'r') as f: 41 | nls = [x.rstrip() for x in f.readlines()] 42 | rs = [re.compile(re.escape(x)) for x in nls] 43 | 44 | if args.text: 45 | f = open(args.text) 46 | else: 47 | f = sys.stdin 48 | line = f.readline() 49 | n = args.nchar 50 | while line: 51 | x = line.split() 52 | print(' '.join(x[:args.skip_ncols])) 53 | a = ' '.join(x[args.skip_ncols:]) 54 | 55 | # get all matched positions 56 | match_pos = [] 57 | for r in rs: 58 | i = 0 59 | while i >= 0: 60 | m = r.search(a, i) 61 | if m: 62 | match_pos.append([m.start(), m.end()]) 63 | i = m.end() 64 | else: 65 | break 66 | 67 | if len(match_pos) > 0: 68 | chars = [] 69 | i = 0 70 | while i < len(a): 71 | start_pos, end_pos = exist_or_not(i, match_pos) 72 | if start_pos is not None: 73 | chars.append(a[start_pos:end_pos]) 74 | i = end_pos 75 | else: 76 | chars.append(a[i]) 77 | i += 1 78 | a = chars 79 | 80 | a = [a[i:i + n] for i in range(0, len(a), n)] 81 | 82 | a_flat = [] 83 | for z in a: 84 | a_flat.append("".join(z)) 85 | 86 | a_chars = [z.replace(' ', args.space) for z in a_flat] 87 | print(' '.join(a_chars)) 88 | line = f.readline() 89 | 90 | 91 | if __name__ == '__main__': 92 | main() 93 | -------------------------------------------------------------------------------- /data_prepare/vocab.py: -------------------------------------------------------------------------------- 1 | """ 2 | June 2017 by kyubyong park. 3 | kbpark.linguist@gmail.com. 4 | Modified by Easton. 5 | """ 6 | import logging 7 | from argparse import ArgumentParser 8 | from collections import Counter, defaultdict 9 | from tqdm import tqdm 10 | 11 | 12 | def load_vocab(path, vocab_size=None): 13 | with open(path, encoding='utf8') as f: 14 | vocab = [line.strip().split()[0] for line in f] 15 | vocab = vocab[:vocab_size] if vocab_size else vocab 16 | id_unk = vocab.index('') 17 | token2idx = defaultdict(lambda: id_unk) 18 | idx2token = defaultdict(lambda: '') 19 | token2idx.update({token: idx for idx, token in enumerate(vocab)}) 20 | idx2token.update({idx: token for idx, token in enumerate(vocab)}) 21 | if '' in vocab: 22 | idx2token[token2idx['']] = ' ' 23 | if '' in vocab: 24 | idx2token[token2idx['']] = '' 25 | # if '' in vocab: 26 | # idx2token[token2idx['']] = '' 27 | if '' in vocab: 28 | idx2token[token2idx['']] = '' 29 | 30 | assert len(token2idx) == len(idx2token) 31 | 32 | return token2idx, idx2token 33 | 34 | 35 | def make_vocab(fpath, fname): 36 | """Constructs vocabulary. 37 | Args: 38 | fpath: A string. Input file path. 39 | fname: A string. Output file name. 40 | 41 | uttid a b c d 42 | 43 | Writes vocabulary line by line to `fname`. 44 | """ 45 | word2cnt = Counter() 46 | with open(fpath, encoding='utf-8') as f: 47 | for l in f: 48 | words = l.strip().split(',')[1].split() 49 | word2cnt.update(Counter(words)) 50 | word2cnt.update({"": 1000000000, 51 | "": 100000000, 52 | "": 10000000, 53 | "": 1000000, 54 | "": 0,}) 55 | with open(fname, 'w', encoding='utf-8') as fout: 56 | for word, cnt in word2cnt.most_common(): 57 | fout.write(u"{}\t{}\n".format(word, cnt)) 58 | logging.info('Vocab path: {}\t size: {}'.format(fname, len(word2cnt))) 59 | 60 | 61 | def look_up(f_trans, f_vocab, f_output): 62 | token2idx, _ = load_vocab(f_vocab) 63 | with open(f_trans) as f, open(f_output, 'w') as fw: 64 | for line in tqdm(f): 65 | uttid, phones = line.strip().split(maxsplit=1) 66 | line = uttid + ' ' + ' '.join([str(token2idx[p]) for p in phones.split()]) 67 | fw.write(line + '\n') 68 | 69 | 70 | if __name__ == '__main__': 71 | parser = ArgumentParser() 72 | parser.add_argument('-m', type=str, dest='mode') 73 | parser.add_argument('--trans', type=str, dest='trans') 74 | parser.add_argument('--vocab', type=str, dest='vocab') 75 | parser.add_argument('--output', type=str, dest='output', default=None) 76 | args = parser.parse_args() 77 | # Read config 78 | logging.basicConfig(level=logging.INFO) 79 | if args.mode == 'gen': 80 | make_vocab(args.trans, args.vocab) 81 | else: 82 | look_up(args.trans, args.vocab, args.output) 83 | logging.info("Done") 84 | -------------------------------------------------------------------------------- /egs/aishell/recipes/CIF.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=2 6 | stage='train' 7 | 8 | structure='cif' 9 | model_src=$SRC_ROOT/transformer 10 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 11 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 12 | expdir=exp/cif # tag for managing experiments. 13 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 14 | model=last.model 15 | # Training config 16 | epochs=100 17 | continue=0 18 | print_freq=100 19 | batch_frames=60000 20 | maxlen_in=800 21 | maxlen_out=150 22 | 23 | # optimizer 24 | k=0.2 25 | warmup_steps=4000 26 | 27 | # Decode config 28 | shuffle=1 29 | beam_size=5 30 | nbest=1 31 | decode_max_len=100 32 | 33 | # Feature configuration 34 | do_delta=false 35 | LFR_m=1 # Low Frame Rate: number of frames to stack 36 | LFR_n=1 # Low Frame Rate: number of frames to skip 37 | spec_aug_cfg=2-27-2-40 38 | 39 | # Network architecture 40 | # Conv encoder 41 | n_conv_layers=3 42 | 43 | # Encoder 44 | d_input=80 45 | n_layers_enc=6 46 | n_head=8 47 | d_model=512 48 | d_inner=2048 49 | dropout=0.1 50 | 51 | # assigner 52 | context_width=3 53 | d_assigner_hidden=512 54 | n_assigner_layers=1 55 | 56 | # Decoder 57 | n_layers_dec=6 58 | 59 | # Loss 60 | label_smoothing=0.1 61 | lambda_qua=0.001 62 | 63 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 64 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 65 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 66 | 67 | 68 | if [ $stage = 'train' ];then 69 | echo "stage 3: Network Training" 70 | mkdir -p ${expdir} 71 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 72 | --train-json ${feat_train_dir}/data.json \ 73 | --valid-json ${feat_dev_dir}/data.json \ 74 | --vocab ${vocab} \ 75 | --structure ${structure} \ 76 | --LFR_m ${LFR_m} \ 77 | --LFR_n ${LFR_n} \ 78 | --spec_aug_cfg ${spec_aug_cfg} \ 79 | --d_input $d_input \ 80 | --n_conv_layers $n_conv_layers \ 81 | --n_layers_enc $n_layers_enc \ 82 | --d_assigner_hidden $d_assigner_hidden \ 83 | --n_assigner_layers $n_assigner_layers \ 84 | --n_head $n_head \ 85 | --d_model $d_model \ 86 | --d_inner $d_inner \ 87 | --dropout $dropout \ 88 | --n_layers_dec $n_layers_dec \ 89 | --label_smoothing ${label_smoothing} \ 90 | --lambda_qua ${lambda_qua} \ 91 | --epochs $epochs \ 92 | --shuffle $shuffle \ 93 | --batch_frames $batch_frames \ 94 | --maxlen-in $maxlen_in \ 95 | --maxlen-out $maxlen_out \ 96 | --k $k \ 97 | --warmup_steps $warmup_steps \ 98 | --save-folder ${expdir} \ 99 | --_continue ${continue} \ 100 | --print-freq ${print_freq} 101 | fi 102 | 103 | 104 | if [ $stage = 'test' ];then 105 | echo "stage 4: Decoding" 106 | mkdir -p ${decode_dir} 107 | export PYTHONWARNINGS="ignore" 108 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/infer.py \ 109 | --type ${stage} \ 110 | --recog-json ${feat_test_dir}/data.json \ 111 | --vocab $vocab \ 112 | --structure ${structure} \ 113 | --LFR_m ${LFR_m} \ 114 | --LFR_n ${LFR_n} \ 115 | --d_input $d_input \ 116 | --n_conv_layers $n_conv_layers \ 117 | --n_layers_enc $n_layers_enc \ 118 | --d_assigner_hidden $d_assigner_hidden \ 119 | --n_assigner_layers $n_assigner_layers \ 120 | --n_head $n_head \ 121 | --d_model $d_model \ 122 | --d_inner $d_inner \ 123 | --n_layers_dec $n_layers_dec \ 124 | --model-path ${expdir}/${model} \ 125 | --output ${decode_dir}/hyp \ 126 | --beam-size $beam_size \ 127 | --nbest $nbest \ 128 | --decode-max-len $decode_max_len 129 | fi 130 | -------------------------------------------------------------------------------- /egs/aishell/recipes/conv_transformer_ctc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | structure='conv-transformer-ctc' 9 | model_src=$SRC_ROOT/transformer 10 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 11 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 12 | expdir=exp/conv-transformer-ctc # tag for managing experiments. 13 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 14 | model=last.model 15 | # Training config 16 | epochs=100 17 | continue=0 18 | print_freq=100 19 | batch_frames=50000 20 | maxlen_in=800 21 | maxlen_out=150 22 | 23 | # optimizer 24 | k=0.2 25 | warmup_steps=4000 26 | 27 | # Decode config 28 | shuffle=1 29 | beam_size=5 30 | nbest=1 31 | decode_max_len=100 32 | 33 | # Feature configuration 34 | do_delta=false 35 | LFR_m=1 # Low Frame Rate: number of frames to stack 36 | LFR_n=1 # Low Frame Rate: number of frames to skip 37 | spec_aug_cfg=2-27-2-40 38 | 39 | # Network architecture 40 | # Encoder 41 | d_input=80 42 | n_layers_enc=6 43 | n_head=8 44 | d_model=512 45 | d_inner=2048 46 | dropout=0.1 47 | # Decoder 48 | n_layers_dec=6 49 | 50 | # Loss 51 | label_smoothing=0.1 52 | 53 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 54 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 55 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 56 | 57 | 58 | if [ $stage = 'train' ];then 59 | echo "stage 3: Network Training" 60 | mkdir -p ${expdir} 61 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 62 | --train-json ${feat_train_dir}/data.json \ 63 | --valid-json ${feat_dev_dir}/data.json \ 64 | --vocab ${vocab} \ 65 | --structure ${structure} \ 66 | --LFR_m ${LFR_m} \ 67 | --LFR_n ${LFR_n} \ 68 | --spec_aug_cfg ${spec_aug_cfg} \ 69 | --d_input $d_input \ 70 | --n_layers_enc $n_layers_enc \ 71 | --n_head $n_head \ 72 | --d_model $d_model \ 73 | --d_inner $d_inner \ 74 | --dropout $dropout \ 75 | --n_layers_dec $n_layers_dec \ 76 | --label_smoothing ${label_smoothing} \ 77 | --epochs $epochs \ 78 | --shuffle $shuffle \ 79 | --batch_frames $batch_frames \ 80 | --maxlen-in $maxlen_in \ 81 | --maxlen-out $maxlen_out \ 82 | --k $k \ 83 | --warmup_steps $warmup_steps \ 84 | --save-folder ${expdir} \ 85 | --_continue ${continue} \ 86 | --print-freq ${print_freq} 87 | fi 88 | 89 | 90 | if [ $stage = 'test' ];then 91 | echo "stage 4: Decoding" 92 | mkdir -p ${decode_dir} 93 | export PYTHONWARNINGS="ignore" 94 | CUDA_VISIBLE_DEVICES=1 python $model_src/infer.py \ 95 | --type ${stage} \ 96 | --recog-json ${feat_test_dir}/data.json \ 97 | --vocab $vocab \ 98 | --structure ${structure} \ 99 | --output ${decode_dir}/hyp \ 100 | --model-path ${expdir}/${model} \ 101 | --beam-size $beam_size \ 102 | --nbest $nbest \ 103 | --decode-max-len $decode_max_len 104 | fi 105 | -------------------------------------------------------------------------------- /egs/aishell/recipes/ctc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=1 6 | stage='train' 7 | 8 | model_src=$SRC_ROOT/ctcModel 9 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 10 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 11 | expdir=exp/ctc # tag for managing experiments. 12 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 13 | 14 | # Training config 15 | epochs=100 16 | continue=0 17 | print_freq=100 18 | batch_frames=15000 19 | maxlen_in=800 20 | maxlen_out=150 21 | 22 | # optimizer 23 | k=0.2 24 | warmup_steps=4000 25 | 26 | # Decode config 27 | shuffle=1 28 | beam_size=5 29 | nbest=1 30 | decode_max_len=100 31 | 32 | # Feature configuration 33 | do_delta=false 34 | LFR_m=4 # Low Frame Rate: number of frames to stack 35 | LFR_n=3 # Low Frame Rate: number of frames to skip 36 | 37 | # Network architecture 38 | # Encoder 39 | d_input=80 40 | n_layers_enc=6 41 | n_head=8 42 | d_k=64 43 | d_v=64 44 | d_model=512 45 | d_inner=2048 46 | dropout=0.1 47 | pe_maxlen=5000 48 | # Decoder 49 | d_word_vec=512 50 | n_layers_dec=6 51 | 52 | # Loss 53 | label_smoothing=0.1 54 | 55 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 56 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 57 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 58 | 59 | 60 | if [ $stage = 'train' ];then 61 | echo "stage 3: Network Training" 62 | mkdir -p ${expdir} 63 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 64 | --train-json ${feat_train_dir}/data.json \ 65 | --valid-json ${feat_dev_dir}/data.json \ 66 | --vocab ${vocab} \ 67 | --LFR_m ${LFR_m} \ 68 | --LFR_n ${LFR_n} \ 69 | --d_input $d_input \ 70 | --n_layers_enc $n_layers_enc \ 71 | --n_head $n_head \ 72 | --d_k $d_k \ 73 | --d_v $d_v \ 74 | --d_model $d_model \ 75 | --d_inner $d_inner \ 76 | --dropout $dropout \ 77 | --pe_maxlen $pe_maxlen \ 78 | --d_word_vec $d_word_vec \ 79 | --n_layers_dec $n_layers_dec \ 80 | --label_smoothing ${label_smoothing} \ 81 | --epochs $epochs \ 82 | --shuffle $shuffle \ 83 | --batch_frames $batch_frames \ 84 | --maxlen-in $maxlen_in \ 85 | --maxlen-out $maxlen_out \ 86 | --warmup_steps $warmup_steps \ 87 | --save-folder ${expdir} \ 88 | --_continue ${continue} \ 89 | --print-freq ${print_freq} 90 | fi 91 | 92 | 93 | if [ $stage = 'test' ];then 94 | echo "stage 4: Decoding" 95 | mkdir -p ${decode_dir} 96 | export PYTHONWARNINGS="ignore" 97 | CUDA_VISIBLE_DEVICES=1 python $model_src/recognize.py \ 98 | --recog-json ${feat_test_dir}/data.json \ 99 | --dict $dict \ 100 | --output ${decode_dir}/hyp \ 101 | --model-path $continue_from \ 102 | --beam-size $beam_size \ 103 | --nbest $nbest \ 104 | --decode-max-len $decode_max_len 105 | fi 106 | -------------------------------------------------------------------------------- /egs/aishell/recipes/transformer.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | model_src=$SRC_ROOT/transformer 9 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 10 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 11 | expdir=exp/transformer # tag for managing experiments. 12 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 13 | 14 | # Training config 15 | epochs=100 16 | continue=0 17 | print_freq=100 18 | batch_frames=15000 19 | maxlen_in=800 20 | maxlen_out=150 21 | 22 | # optimizer 23 | k=0.2 24 | warmup_steps=4000 25 | 26 | # Decode config 27 | shuffle=1 28 | beam_size=5 29 | nbest=1 30 | decode_max_len=100 31 | 32 | # Feature configuration 33 | do_delta=false 34 | LFR_m=4 # Low Frame Rate: number of frames to stack 35 | LFR_n=3 # Low Frame Rate: number of frames to skip 36 | 37 | # Network architecture 38 | # Encoder 39 | d_input=80 40 | n_layers_enc=6 41 | n_head=8 42 | d_k=64 43 | d_v=64 44 | d_model=512 45 | d_inner=2048 46 | dropout=0.1 47 | pe_maxlen=5000 48 | # Decoder 49 | d_word_vec=512 50 | n_layers_dec=6 51 | 52 | # Loss 53 | label_smoothing=0.1 54 | 55 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 56 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 57 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 58 | 59 | 60 | if [ $stage = 'train' ];then 61 | echo "stage 3: Network Training" 62 | mkdir -p ${expdir} 63 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 64 | --train-json ${feat_train_dir}/data.json \ 65 | --valid-json ${feat_dev_dir}/data.json \ 66 | --vocab ${vocab} \ 67 | --LFR_m ${LFR_m} \ 68 | --LFR_n ${LFR_n} \ 69 | --d_input $d_input \ 70 | --n_layers_enc $n_layers_enc \ 71 | --n_head $n_head \ 72 | --d_k $d_k \ 73 | --d_v $d_v \ 74 | --d_model $d_model \ 75 | --d_inner $d_inner \ 76 | --dropout $dropout \ 77 | --pe_maxlen $pe_maxlen \ 78 | --d_word_vec $d_word_vec \ 79 | --n_layers_dec $n_layers_dec \ 80 | --label_smoothing ${label_smoothing} \ 81 | --epochs $epochs \ 82 | --shuffle $shuffle \ 83 | --batch_frames $batch_frames \ 84 | --maxlen-in $maxlen_in \ 85 | --maxlen-out $maxlen_out \ 86 | --k $k \ 87 | --warmup_steps $warmup_steps \ 88 | --save-folder ${expdir} \ 89 | --_continue ${continue} \ 90 | --print-freq ${print_freq} 91 | fi 92 | 93 | 94 | if [ $stage = 'test' ];then 95 | echo "stage 4: Decoding" 96 | mkdir -p ${decode_dir} 97 | export PYTHONWARNINGS="ignore" 98 | CUDA_VISIBLE_DEVICES=1 python $model_src/recognize.py \ 99 | --recog-json ${feat_test_dir}/data.json \ 100 | --vocab $vocab \ 101 | --output ${decode_dir}/hyp \ 102 | --model-path $continue \ 103 | --beam-size $beam_size \ 104 | --nbest $nbest \ 105 | --decode-max-len $decode_max_len 106 | fi 107 | -------------------------------------------------------------------------------- /egs/aishell/recipes/transformer_ctc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | structure='transformer-ctc' 9 | model_src=$SRC_ROOT/transformer 10 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 11 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 12 | expdir=exp/transformer-ctc # tag for managing experiments. 13 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 14 | model=last.model 15 | # Training config 16 | epochs=100 17 | continue=0 18 | print_freq=100 19 | batch_frames=15000 20 | maxlen_in=800 21 | maxlen_out=150 22 | 23 | # optimizer 24 | k=0.2 25 | warmup_steps=4000 26 | 27 | # Decode config 28 | shuffle=1 29 | beam_size=5 30 | nbest=1 31 | decode_max_len=100 32 | 33 | # Feature configuration 34 | do_delta=false 35 | LFR_m=4 # Low Frame Rate: number of frames to stack 36 | LFR_n=3 # Low Frame Rate: number of frames to skip 37 | 38 | # Network architecture 39 | # Encoder 40 | d_input=80 41 | n_layers_enc=6 42 | n_head=8 43 | d_k=64 44 | d_v=64 45 | d_model=512 46 | d_inner=2048 47 | dropout=0.1 48 | pe_maxlen=5000 49 | # Decoder 50 | d_word_vec=512 51 | n_layers_dec=6 52 | 53 | # Loss 54 | label_smoothing=0.1 55 | 56 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 57 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 58 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 59 | 60 | 61 | if [ $stage = 'train' ];then 62 | echo "stage 3: Network Training" 63 | mkdir -p ${expdir} 64 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 65 | --train-json ${feat_train_dir}/data.json \ 66 | --valid-json ${feat_dev_dir}/data.json \ 67 | --vocab ${vocab} \ 68 | --structure ${structure} \ 69 | --LFR_m ${LFR_m} \ 70 | --LFR_n ${LFR_n} \ 71 | --d_input $d_input \ 72 | --n_layers_enc $n_layers_enc \ 73 | --n_head $n_head \ 74 | --d_k $d_k \ 75 | --d_v $d_v \ 76 | --d_model $d_model \ 77 | --d_inner $d_inner \ 78 | --dropout $dropout \ 79 | --pe_maxlen $pe_maxlen \ 80 | --d_word_vec $d_word_vec \ 81 | --n_layers_dec $n_layers_dec \ 82 | --label_smoothing ${label_smoothing} \ 83 | --epochs $epochs \ 84 | --shuffle $shuffle \ 85 | --batch_frames $batch_frames \ 86 | --maxlen-in $maxlen_in \ 87 | --maxlen-out $maxlen_out \ 88 | --k $k \ 89 | --warmup_steps $warmup_steps \ 90 | --save-folder ${expdir} \ 91 | --_continue ${continue} \ 92 | --print-freq ${print_freq} 93 | fi 94 | 95 | 96 | if [ $stage = 'test' ];then 97 | echo "stage 4: Decoding" 98 | mkdir -p ${decode_dir} 99 | export PYTHONWARNINGS="ignore" 100 | CUDA_VISIBLE_DEVICES=1 python $model_src/infer.py \ 101 | --type ${stage} \ 102 | --recog-json ${feat_test_dir}/data.json \ 103 | --vocab $vocab \ 104 | --structure ${structure} \ 105 | --output ${decode_dir}/hyp \ 106 | --model-path ${expdir}/${model} \ 107 | --beam-size $beam_size \ 108 | --nbest $nbest \ 109 | --decode-max-len $decode_max_len 110 | fi 111 | -------------------------------------------------------------------------------- /egs/aishell/steps: -------------------------------------------------------------------------------- 1 | ../../tools/kaldi/egs/wsj/s5/steps -------------------------------------------------------------------------------- /egs/aishell/utils: -------------------------------------------------------------------------------- 1 | ../../tools/kaldi/egs/wsj/s5/utils -------------------------------------------------------------------------------- /egs/callhome_IPA/align2trans.py: -------------------------------------------------------------------------------- 1 | """ 2 | June 2017 by kyubyong park. 3 | kbpark.linguist@gmail.com. 4 | Modified by Easton. 5 | """ 6 | import logging 7 | from argparse import ArgumentParser 8 | from tqdm import tqdm 9 | 10 | 11 | def load_vocab(path): 12 | idx2token = {} 13 | with open(path) as f: 14 | for line in f: 15 | token, idx = line.strip().split() 16 | if '#' in token: 17 | break 18 | token = token.split('_')[0] 19 | idx2token[idx] = token 20 | 21 | return idx2token 22 | 23 | 24 | def align_shrink(align): 25 | _token = None 26 | list_tokens = [] 27 | for token in align: 28 | if _token != token: 29 | list_tokens.append(token) 30 | _token = token 31 | 32 | return list_tokens 33 | 34 | 35 | def main(ali, phone2idx, output): 36 | idx2token = load_vocab(phone2idx) 37 | 38 | with open(ali) as f, open(output, 'w') as fw: 39 | for line in tqdm(f): 40 | uttid, align = line.strip().split(maxsplit=1) 41 | phones = align_shrink(align.split()) 42 | line = uttid + ' ' + ' '.join(idx2token[p] for p in phones) 43 | fw.write(line + '\n') 44 | 45 | 46 | if __name__ == '__main__': 47 | parser = ArgumentParser() 48 | parser.add_argument('--ali', type=str, dest='ali') 49 | parser.add_argument('--phones', type=str, dest='phones') 50 | parser.add_argument('--output', type=str, dest='output') 51 | args = parser.parse_args() 52 | # Read config 53 | logging.basicConfig(level=logging.INFO) 54 | main(args.ali, args.phones, args.output) 55 | logging.info("Done") 56 | -------------------------------------------------------------------------------- /egs/callhome_IPA/recipes/CIF.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=1 6 | stage='train' 7 | 8 | structure='cif' 9 | label_type='phone' 10 | model_src=$SRC_ROOT/transformer 11 | dumpdir=/data3/easton/data/CALLHOME_Multilingual/dump # directory to dump full features 12 | vocab=/data3/easton/data/CALLHOME_Multilingual/dump/phone.vocab 13 | expdir=exp/cif # tag for managing experiments. 14 | decode_dir=${expdir}/decode_dev_beam${beam_size} 15 | model=last.model 16 | # Training config 17 | epochs=100 18 | continue=0 19 | print_freq=100 20 | batch_frames=20000 21 | maxlen_in=800 22 | maxlen_out=150 23 | 24 | # optimizer 25 | k=0.2 26 | warmup_steps=10000 27 | 28 | # Decode config 29 | shuffle=1 30 | beam_size=5 31 | nbest=1 32 | decode_max_len=100 33 | 34 | # Feature configuration 35 | LFR_m=1 # Low Frame Rate: number of frames to stack 36 | LFR_n=1 # Low Frame Rate: number of frames to skip 37 | spec_aug_cfg=2-27-2-40 38 | 39 | # Network architecture 40 | # Conv encoder 41 | n_conv_layers=2 42 | 43 | # Encoder 44 | d_input=80 45 | n_layers_enc=10 46 | n_head=8 47 | d_model=512 48 | d_inner=2048 49 | dropout=0.1 50 | 51 | # assigner 52 | context_width=5 53 | d_assigner_hidden=512 54 | n_assigner_layers=1 55 | 56 | # Decoder 57 | n_layers_dec=1 58 | 59 | # Loss 60 | label_smoothing=0.1 61 | lambda_qua=0.0001 62 | 63 | feat_train_dir=${dumpdir}/train; mkdir -p ${feat_train_dir} 64 | feat_test_dir=${dumpdir}/test; mkdir -p ${feat_test_dir} 65 | feat_dev_dir=${dumpdir}/dev; mkdir -p ${feat_dev_dir} 66 | 67 | if [ $stage = 'train' ];then 68 | echo "stage 3: Network Training" 69 | mkdir -p ${expdir} 70 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 71 | --train-json ${feat_train_dir}/data.json \ 72 | --valid-json ${feat_dev_dir}/data.json \ 73 | --vocab ${vocab} \ 74 | --structure ${structure} \ 75 | --label_type ${label_type} \ 76 | --LFR_m ${LFR_m} \ 77 | --LFR_n ${LFR_n} \ 78 | --spec_aug_cfg ${spec_aug_cfg} \ 79 | --d_input $d_input \ 80 | --n_conv_layers $n_conv_layers \ 81 | --n_layers_enc $n_layers_enc \ 82 | --d_assigner_hidden $d_assigner_hidden \ 83 | --n_assigner_layers $n_assigner_layers \ 84 | --n_head $n_head \ 85 | --d_model $d_model \ 86 | --d_inner $d_inner \ 87 | --dropout $dropout \ 88 | --n_layers_dec $n_layers_dec \ 89 | --label_smoothing ${label_smoothing} \ 90 | --lambda_qua ${lambda_qua} \ 91 | --epochs $epochs \ 92 | --shuffle $shuffle \ 93 | --batch_frames $batch_frames \ 94 | --maxlen-in $maxlen_in \ 95 | --maxlen-out $maxlen_out \ 96 | --k $k \ 97 | --warmup_steps $warmup_steps \ 98 | --save-folder ${expdir} \ 99 | --_continue ${continue} \ 100 | --print-freq ${print_freq} 101 | fi 102 | 103 | 104 | if [ $stage = 'test' ];then 105 | echo "stage 4: Decoding" 106 | mkdir -p ${decode_dir} 107 | export PYTHONWARNINGS="ignore" 108 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/infer.py \ 109 | --type ${stage} \ 110 | --recog-json ${feat_test_dir}/data.json \ 111 | --vocab $vocab \ 112 | --structure ${structure} \ 113 | --label_type ${label_type} \ 114 | --LFR_m ${LFR_m} \ 115 | --LFR_n ${LFR_n} \ 116 | --d_input $d_input \ 117 | --n_conv_layers $n_conv_layers \ 118 | --n_layers_enc $n_layers_enc \ 119 | --d_assigner_hidden $d_assigner_hidden \ 120 | --n_assigner_layers $n_assigner_layers \ 121 | --n_head $n_head \ 122 | --d_model $d_model \ 123 | --d_inner $d_inner \ 124 | --dropout $dropout \ 125 | --n_layers_dec $n_layers_dec \ 126 | --model-path ${expdir}/${model} \ 127 | --output ${decode_dir}/hyp \ 128 | --beam-size $beam_size \ 129 | --nbest $nbest \ 130 | --decode-max-len $decode_max_len 131 | fi 132 | -------------------------------------------------------------------------------- /egs/callhome_IPA/recipes/conv_transformer_ctc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | structure='conv-transformer-ctc' 9 | model_src=$SRC_ROOT/transformer 10 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 11 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 12 | expdir=exp/conv-transformer-ctc # tag for managing experiments. 13 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 14 | model=last.model 15 | # Training config 16 | epochs=100 17 | continue=0 18 | print_freq=100 19 | batch_frames=15000 20 | maxlen_in=800 21 | maxlen_out=150 22 | 23 | # optimizer 24 | k=0.2 25 | warmup_steps=4000 26 | 27 | # Decode config 28 | shuffle=1 29 | beam_size=5 30 | nbest=1 31 | decode_max_len=100 32 | 33 | # Feature configuration 34 | do_delta=false 35 | LFR_m=1 # Low Frame Rate: number of frames to stack 36 | LFR_n=1 # Low Frame Rate: number of frames to skip 37 | 38 | # Network architecture 39 | # Encoder 40 | d_input=80 41 | n_layers_enc=6 42 | n_head=8 43 | d_k=64 44 | d_v=64 45 | d_model=512 46 | d_inner=2048 47 | dropout=0.1 48 | pe_maxlen=5000 49 | # Decoder 50 | d_word_vec=512 51 | n_layers_dec=6 52 | 53 | # Loss 54 | label_smoothing=0.1 55 | 56 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 57 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 58 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 59 | 60 | 61 | if [ $stage = 'train' ];then 62 | echo "stage 3: Network Training" 63 | mkdir -p ${expdir} 64 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 65 | --train-json ${feat_train_dir}/data.json \ 66 | --valid-json ${feat_dev_dir}/data.json \ 67 | --vocab ${vocab} \ 68 | --structure ${structure} \ 69 | --LFR_m ${LFR_m} \ 70 | --LFR_n ${LFR_n} \ 71 | --d_input $d_input \ 72 | --n_layers_enc $n_layers_enc \ 73 | --n_head $n_head \ 74 | --d_k $d_k \ 75 | --d_v $d_v \ 76 | --d_model $d_model \ 77 | --d_inner $d_inner \ 78 | --dropout $dropout \ 79 | --pe_maxlen $pe_maxlen \ 80 | --d_word_vec $d_word_vec \ 81 | --n_layers_dec $n_layers_dec \ 82 | --label_smoothing ${label_smoothing} \ 83 | --epochs $epochs \ 84 | --shuffle $shuffle \ 85 | --batch_frames $batch_frames \ 86 | --maxlen-in $maxlen_in \ 87 | --maxlen-out $maxlen_out \ 88 | --k $k \ 89 | --warmup_steps $warmup_steps \ 90 | --save-folder ${expdir} \ 91 | --_continue ${continue} \ 92 | --print-freq ${print_freq} 93 | fi 94 | 95 | 96 | if [ $stage = 'test' ];then 97 | echo "stage 4: Decoding" 98 | mkdir -p ${decode_dir} 99 | export PYTHONWARNINGS="ignore" 100 | CUDA_VISIBLE_DEVICES=1 python $model_src/infer.py \ 101 | --type ${stage} \ 102 | --recog-json ${feat_test_dir}/data.json \ 103 | --vocab $vocab \ 104 | --structure ${structure} \ 105 | --output ${decode_dir}/hyp \ 106 | --model-path ${expdir}/${model} \ 107 | --beam-size $beam_size \ 108 | --nbest $nbest \ 109 | --decode-max-len $decode_max_len 110 | fi 111 | -------------------------------------------------------------------------------- /egs/callhome_IPA/recipes/ctc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=1 6 | stage='train' 7 | 8 | model_src=$SRC_ROOT/ctcModel 9 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 10 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 11 | expdir=exp/ctc # tag for managing experiments. 12 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 13 | 14 | # Training config 15 | epochs=100 16 | continue=0 17 | print_freq=100 18 | batch_frames=15000 19 | maxlen_in=800 20 | maxlen_out=150 21 | 22 | # optimizer 23 | k=0.2 24 | warmup_steps=4000 25 | 26 | # Decode config 27 | shuffle=1 28 | beam_size=5 29 | nbest=1 30 | decode_max_len=100 31 | 32 | # Feature configuration 33 | do_delta=false 34 | LFR_m=4 # Low Frame Rate: number of frames to stack 35 | LFR_n=3 # Low Frame Rate: number of frames to skip 36 | 37 | # Network architecture 38 | # Encoder 39 | d_input=80 40 | n_layers_enc=6 41 | n_head=8 42 | d_k=64 43 | d_v=64 44 | d_model=512 45 | d_inner=2048 46 | dropout=0.1 47 | pe_maxlen=5000 48 | # Decoder 49 | d_word_vec=512 50 | n_layers_dec=6 51 | 52 | # Loss 53 | label_smoothing=0.1 54 | 55 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 56 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 57 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 58 | 59 | 60 | if [ $stage = 'train' ];then 61 | echo "stage 3: Network Training" 62 | mkdir -p ${expdir} 63 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 64 | --train-json ${feat_train_dir}/data.json \ 65 | --valid-json ${feat_dev_dir}/data.json \ 66 | --vocab ${vocab} \ 67 | --LFR_m ${LFR_m} \ 68 | --LFR_n ${LFR_n} \ 69 | --d_input $d_input \ 70 | --n_layers_enc $n_layers_enc \ 71 | --n_head $n_head \ 72 | --d_k $d_k \ 73 | --d_v $d_v \ 74 | --d_model $d_model \ 75 | --d_inner $d_inner \ 76 | --dropout $dropout \ 77 | --pe_maxlen $pe_maxlen \ 78 | --d_word_vec $d_word_vec \ 79 | --n_layers_dec $n_layers_dec \ 80 | --label_smoothing ${label_smoothing} \ 81 | --epochs $epochs \ 82 | --shuffle $shuffle \ 83 | --batch_frames $batch_frames \ 84 | --maxlen-in $maxlen_in \ 85 | --maxlen-out $maxlen_out \ 86 | --warmup_steps $warmup_steps \ 87 | --save-folder ${expdir} \ 88 | --_continue ${continue} \ 89 | --print-freq ${print_freq} 90 | fi 91 | 92 | 93 | if [ $stage = 'test' ];then 94 | echo "stage 4: Decoding" 95 | mkdir -p ${decode_dir} 96 | export PYTHONWARNINGS="ignore" 97 | CUDA_VISIBLE_DEVICES=1 python $model_src/recognize.py \ 98 | --recog-json ${feat_test_dir}/data.json \ 99 | --dict $dict \ 100 | --output ${decode_dir}/hyp \ 101 | --model-path $continue_from \ 102 | --beam-size $beam_size \ 103 | --nbest $nbest \ 104 | --decode-max-len $decode_max_len 105 | fi 106 | -------------------------------------------------------------------------------- /egs/callhome_IPA/recipes/transformer.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | model_src=$SRC_ROOT/transformer 9 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 10 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 11 | expdir=exp/transformer # tag for managing experiments. 12 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 13 | 14 | # Training config 15 | epochs=100 16 | continue=0 17 | print_freq=100 18 | batch_frames=15000 19 | maxlen_in=800 20 | maxlen_out=150 21 | 22 | # optimizer 23 | k=0.2 24 | warmup_steps=4000 25 | 26 | # Decode config 27 | shuffle=1 28 | beam_size=5 29 | nbest=1 30 | decode_max_len=100 31 | 32 | # Feature configuration 33 | do_delta=false 34 | LFR_m=4 # Low Frame Rate: number of frames to stack 35 | LFR_n=3 # Low Frame Rate: number of frames to skip 36 | 37 | # Network architecture 38 | # Encoder 39 | d_input=80 40 | n_layers_enc=6 41 | n_head=8 42 | d_k=64 43 | d_v=64 44 | d_model=512 45 | d_inner=2048 46 | dropout=0.1 47 | pe_maxlen=5000 48 | # Decoder 49 | d_word_vec=512 50 | n_layers_dec=6 51 | 52 | # Loss 53 | label_smoothing=0.1 54 | 55 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 56 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 57 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 58 | 59 | 60 | if [ $stage = 'train' ];then 61 | echo "stage 3: Network Training" 62 | mkdir -p ${expdir} 63 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 64 | --train-json ${feat_train_dir}/data.json \ 65 | --valid-json ${feat_dev_dir}/data.json \ 66 | --vocab ${vocab} \ 67 | --LFR_m ${LFR_m} \ 68 | --LFR_n ${LFR_n} \ 69 | --d_input $d_input \ 70 | --n_layers_enc $n_layers_enc \ 71 | --n_head $n_head \ 72 | --d_k $d_k \ 73 | --d_v $d_v \ 74 | --d_model $d_model \ 75 | --d_inner $d_inner \ 76 | --dropout $dropout \ 77 | --pe_maxlen $pe_maxlen \ 78 | --d_word_vec $d_word_vec \ 79 | --n_layers_dec $n_layers_dec \ 80 | --label_smoothing ${label_smoothing} \ 81 | --epochs $epochs \ 82 | --shuffle $shuffle \ 83 | --batch_frames $batch_frames \ 84 | --maxlen-in $maxlen_in \ 85 | --maxlen-out $maxlen_out \ 86 | --k $k \ 87 | --warmup_steps $warmup_steps \ 88 | --save-folder ${expdir} \ 89 | --_continue ${continue} \ 90 | --print-freq ${print_freq} 91 | fi 92 | 93 | 94 | if [ $stage = 'test' ];then 95 | echo "stage 4: Decoding" 96 | mkdir -p ${decode_dir} 97 | export PYTHONWARNINGS="ignore" 98 | CUDA_VISIBLE_DEVICES=1 python $model_src/recognize.py \ 99 | --recog-json ${feat_test_dir}/data.json \ 100 | --vocab $vocab \ 101 | --output ${decode_dir}/hyp \ 102 | --model-path $continue \ 103 | --beam-size $beam_size \ 104 | --nbest $nbest \ 105 | --decode-max-len $decode_max_len 106 | fi 107 | -------------------------------------------------------------------------------- /egs/callhome_IPA/recipes/transformer_ctc.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | structure='transformer-ctc' 9 | model_src=$SRC_ROOT/transformer 10 | dumpdir=/data3/easton/data/AISHELL/dump # directory to dump full features 11 | vocab=/data3/easton/data/AISHELL/data/lang_1char/char.vocab 12 | expdir=exp/transformer-ctc # tag for managing experiments. 13 | decode_dir=${expdir}/decode_test_beam${beam_size}_nbest${nbest}_ml${decode_max_len} 14 | model=last.model 15 | # Training config 16 | epochs=100 17 | continue=0 18 | print_freq=100 19 | batch_frames=15000 20 | maxlen_in=800 21 | maxlen_out=150 22 | 23 | # optimizer 24 | k=0.2 25 | warmup_steps=4000 26 | 27 | # Decode config 28 | shuffle=1 29 | beam_size=5 30 | nbest=1 31 | decode_max_len=100 32 | 33 | # Feature configuration 34 | do_delta=false 35 | LFR_m=4 # Low Frame Rate: number of frames to stack 36 | LFR_n=3 # Low Frame Rate: number of frames to skip 37 | 38 | # Network architecture 39 | # Encoder 40 | d_input=80 41 | n_layers_enc=6 42 | n_head=8 43 | d_k=64 44 | d_v=64 45 | d_model=512 46 | d_inner=2048 47 | dropout=0.1 48 | pe_maxlen=5000 49 | # Decoder 50 | d_word_vec=512 51 | n_layers_dec=6 52 | 53 | # Loss 54 | label_smoothing=0.1 55 | 56 | feat_train_dir=${dumpdir}/train/delta${do_delta}; mkdir -p ${feat_train_dir} 57 | feat_test_dir=${dumpdir}/test/delta${do_delta}; mkdir -p ${feat_test_dir} 58 | feat_dev_dir=${dumpdir}/dev/delta${do_delta}; mkdir -p ${feat_dev_dir} 59 | 60 | 61 | if [ $stage = 'train' ];then 62 | echo "stage 3: Network Training" 63 | mkdir -p ${expdir} 64 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 65 | --train-json ${feat_train_dir}/data.json \ 66 | --valid-json ${feat_dev_dir}/data.json \ 67 | --vocab ${vocab} \ 68 | --structure ${structure} \ 69 | --LFR_m ${LFR_m} \ 70 | --LFR_n ${LFR_n} \ 71 | --d_input $d_input \ 72 | --n_layers_enc $n_layers_enc \ 73 | --n_head $n_head \ 74 | --d_k $d_k \ 75 | --d_v $d_v \ 76 | --d_model $d_model \ 77 | --d_inner $d_inner \ 78 | --dropout $dropout \ 79 | --pe_maxlen $pe_maxlen \ 80 | --d_word_vec $d_word_vec \ 81 | --n_layers_dec $n_layers_dec \ 82 | --label_smoothing ${label_smoothing} \ 83 | --epochs $epochs \ 84 | --shuffle $shuffle \ 85 | --batch_frames $batch_frames \ 86 | --maxlen-in $maxlen_in \ 87 | --maxlen-out $maxlen_out \ 88 | --k $k \ 89 | --warmup_steps $warmup_steps \ 90 | --save-folder ${expdir} \ 91 | --_continue ${continue} \ 92 | --print-freq ${print_freq} 93 | fi 94 | 95 | 96 | if [ $stage = 'test' ];then 97 | echo "stage 4: Decoding" 98 | mkdir -p ${decode_dir} 99 | export PYTHONWARNINGS="ignore" 100 | CUDA_VISIBLE_DEVICES=1 python $model_src/infer.py \ 101 | --type ${stage} \ 102 | --recog-json ${feat_test_dir}/data.json \ 103 | --vocab $vocab \ 104 | --structure ${structure} \ 105 | --output ${decode_dir}/hyp \ 106 | --model-path ${expdir}/${model} \ 107 | --beam-size $beam_size \ 108 | --nbest $nbest \ 109 | --decode-max-len $decode_max_len 110 | fi 111 | -------------------------------------------------------------------------------- /egs/libri_wav2vec/README.md: -------------------------------------------------------------------------------- 1 | refer to: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec 2 | 3 | ```python 4 | python ~/projects/fairseq/examples/wav2vec/vq-wav2vec_featurize.py --data-dir test-clean --output-dir test-clean --checkpoint vq-wav2vec.pt --split train --extension tsv 5 | ``` 6 | get the vq file 7 | -------------------------------------------------------------------------------- /egs/libri_wav2vec/recipes/bert.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | structure='BERT' 8 | mode='pre-train' 9 | model_src=$SRC_ROOT/mask_lm 10 | dumpdir=/data3/easton/data/Librispeech/wav2vec 11 | vocab_src=/data3/easton/data/Librispeech/wav2vec/vocab_500.txt 12 | vocab_tgt=/data3/easton/data/Librispeech/org_data/subword_3727.vocab 13 | expdir=exp/bert # tag for managing experiments. 14 | model=last.model 15 | # Training config 16 | epochs=50 17 | continue=0 18 | print_freq=1000 19 | batch_frames=4000 20 | maxlen_in=500 21 | maxlen_out=150 22 | 23 | # optimizer 24 | k=0.2 25 | warmup_steps=20000 26 | 27 | # Decode config 28 | shuffle=1 29 | 30 | # Network architecture 31 | # Conv encoder 32 | n_conv_layers=1 33 | 34 | # Encoder 35 | n_layers_enc=10 36 | n_head=8 37 | d_k=64 38 | d_v=64 39 | d_model=512 40 | d_inner=2048 41 | dropout=0.1 42 | pe_maxlen=5000 43 | 44 | # Loss 45 | label_smoothing=0.1 46 | 47 | train_src=${dumpdir}/train-960/train.src.with_uttid 48 | train_tgt=${dumpdir}/train-960/train.text.with_uttid 49 | 50 | dev_src=${dumpdir}/valid/valid.src.with_uttid 51 | dev_tgt=${dumpdir}/valid/valid.text.with_uttid 52 | 53 | 54 | if [ $stage = 'train' ];then 55 | echo "stage 3: Network Training" 56 | mkdir -p ${expdir} 57 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 58 | --train_src $train_src \ 59 | --valid_src $dev_src \ 60 | --vocab_src ${vocab_src} \ 61 | --vocab_tgt ${vocab_tgt} \ 62 | --structure ${structure} \ 63 | --mode ${mode} \ 64 | --n_conv_layers $n_conv_layers \ 65 | --n_layers_enc $n_layers_enc \ 66 | --n_head $n_head \ 67 | --d_model $d_model \ 68 | --d_inner $d_inner \ 69 | --dropout $dropout \ 70 | --label_smoothing ${label_smoothing} \ 71 | --epochs $epochs \ 72 | --shuffle $shuffle \ 73 | --batch_frames $batch_frames \ 74 | --maxlen-in $maxlen_in \ 75 | --maxlen-out $maxlen_out \ 76 | --k $k \ 77 | --warmup_steps $warmup_steps \ 78 | --save-folder ${expdir} \ 79 | --_continue ${continue} \ 80 | --print-freq ${print_freq} 81 | fi 82 | 83 | 84 | if [ $stage = 'test' ];then 85 | echo "stage 4: Decoding" 86 | mkdir -p ${decode_dir} 87 | export PYTHONWARNINGS="ignore" 88 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/infer.py \ 89 | --type ${stage} \ 90 | --recog-json ${feat_test_dir}/data.json \ 91 | --vocab_src ${vocab_src} \ 92 | --vocab_tgt ${vocab_tgt} \ 93 | --structure ${structure} \ 94 | --label_type ${label_type} \ 95 | --LFR_m ${LFR_m} \ 96 | --LFR_n ${LFR_n} \ 97 | --d_input $d_input \ 98 | --n_conv_layers $n_conv_layers \ 99 | --n_layers_enc $n_layers_enc \ 100 | --d_assigner_hidden $d_assigner_hidden \ 101 | --n_assigner_layers $n_assigner_layers \ 102 | --n_head $n_head \ 103 | --d_model $d_model \ 104 | --d_inner $d_inner \ 105 | --dropout $dropout \ 106 | --n_layers_dec $n_layers_dec \ 107 | --model-path ${expdir}/${model} \ 108 | --output ${decode_dir}/hyp \ 109 | --beam-size $beam_size \ 110 | --nbest $nbest \ 111 | --decode-max-len $decode_max_len 112 | fi 113 | -------------------------------------------------------------------------------- /egs/libri_wav2vec/recipes/finetune.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SRC_ROOT=$PWD/../../src 3 | export PYTHONPATH=$SRC_ROOT:$PYTHONPATH 4 | 5 | gpu_id=0 6 | stage='train' 7 | 8 | structure='BERT' 9 | mode='finetune' 10 | model_src=$SRC_ROOT/mask_lm 11 | dumpdir=/data3/easton/data/Librispeech/wav2vec 12 | vocab_src=/data3/easton/data/Librispeech/wav2vec/vocab_500.txt 13 | vocab_tgt=/data3/easton/data/Librispeech/org_data/subword_3727.vocab 14 | pretrain=/home/easton/projects/Speech-Transformer-pytorch/egs/libri_wav2vec/exp/bert/last.model 15 | expdir=exp/finetune # tag for managing experiments. 16 | decode_dir=${expdir}/decode_dev_beam${beam_size} 17 | model=last.model 18 | # Training config 19 | epochs=100 20 | continue=0 21 | print_freq=100 22 | batch_size=4 23 | maxlen_in=1000 24 | maxlen_out=150 25 | 26 | # optimizer 27 | k=0.2 28 | warmup_steps=4000 29 | 30 | # Decode config 31 | shuffle=1 32 | 33 | # Network architecture 34 | # Conv encoder 35 | n_conv_layers=1 36 | 37 | # Encoder 38 | n_layers_enc=10 39 | n_head=8 40 | d_model=512 41 | d_inner=2048 42 | dropout=0.1 43 | 44 | # Loss 45 | label_smoothing=0.1 46 | 47 | train_src=${dumpdir}/train-960/train.src.with_uttid 48 | train_tgt=${dumpdir}/train-960/train-100.subword 49 | 50 | dev_src=${dumpdir}/valid/valid.src.with_uttid 51 | dev_tgt=${dumpdir}/valid/valid.subword 52 | 53 | if [ $stage = 'train' ];then 54 | echo "stage 3: Network Training" 55 | mkdir -p ${expdir} 56 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/train.py \ 57 | --train_src $train_src \ 58 | --valid_src $dev_src \ 59 | --train_tgt $train_tgt \ 60 | --valid_tgt $dev_tgt \ 61 | --vocab_src ${vocab_src} \ 62 | --vocab_tgt ${vocab_tgt} \ 63 | --structure ${structure} \ 64 | --mode ${mode} \ 65 | --n_conv_layers $n_conv_layers \ 66 | --n_layers_enc $n_layers_enc \ 67 | --n_head $n_head \ 68 | --d_model $d_model \ 69 | --d_inner $d_inner \ 70 | --dropout $dropout \ 71 | --label_smoothing ${label_smoothing} \ 72 | --epochs $epochs \ 73 | --shuffle $shuffle \ 74 | --batch_size $batch_size \ 75 | --maxlen-in $maxlen_in \ 76 | --maxlen-out $maxlen_out \ 77 | --k $k \ 78 | --warmup_steps $warmup_steps \ 79 | --pretrain $pretrain \ 80 | --save-folder ${expdir} \ 81 | --_continue ${continue} \ 82 | --print-freq ${print_freq} 83 | fi 84 | 85 | 86 | if [ $stage = 'test' ];then 87 | echo "stage 4: Decoding" 88 | mkdir -p ${decode_dir} 89 | export PYTHONWARNINGS="ignore" 90 | CUDA_VISIBLE_DEVICES=${gpu_id} python $model_src/infer.py \ 91 | --type ${stage} \ 92 | --recog-json ${feat_test_dir}/data.json \ 93 | --vocab_src ${vocab_src} \ 94 | --vocab_tgt ${vocab_tgt} \ 95 | --structure ${structure} \ 96 | --label_type ${label_type} \ 97 | --d_input $d_input \ 98 | --n_conv_layers $n_conv_layers \ 99 | --n_layers_enc $n_layers_enc \ 100 | --d_assigner_hidden $d_assigner_hidden \ 101 | --n_assigner_layers $n_assigner_layers \ 102 | --n_head $n_head \ 103 | --d_model $d_model \ 104 | --d_inner $d_inner \ 105 | --dropout $dropout \ 106 | --n_layers_dec $n_layers_dec \ 107 | --model-path ${expdir}/${model} \ 108 | --output ${decode_dir}/hyp \ 109 | --beam-size $beam_size \ 110 | --nbest $nbest \ 111 | --decode-max-len $decode_max_len 112 | fi 113 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | visdom -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eastonYi/end-to-end_asr_pytorch/7bedfb5ad10e363f5823b980ae0eb515b395839b/src/__init__.py -------------------------------------------------------------------------------- /src/ctcModel/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eastonYi/end-to-end_asr_pytorch/7bedfb5ad10e363f5823b980ae0eb515b395839b/src/ctcModel/__init__.py -------------------------------------------------------------------------------- /src/ctcModel/attention.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | 5 | 6 | class MultiHeadAttention(nn.Module): 7 | ''' Multi-Head Attention module ''' 8 | 9 | def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1): 10 | super().__init__() 11 | 12 | self.n_head = n_head 13 | self.d_k = d_k 14 | self.d_v = d_v 15 | 16 | self.w_qs = nn.Linear(d_model, n_head * d_k) 17 | self.w_ks = nn.Linear(d_model, n_head * d_k) 18 | self.w_vs = nn.Linear(d_model, n_head * d_v) 19 | nn.init.normal_(self.w_qs.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_k))) 20 | nn.init.normal_(self.w_ks.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_k))) 21 | nn.init.normal_(self.w_vs.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_v))) 22 | 23 | self.attention = ScaledDotProductAttention(temperature=np.power(d_k, 0.5), 24 | attn_dropout=dropout) 25 | self.layer_norm = nn.LayerNorm(d_model) 26 | 27 | self.fc = nn.Linear(n_head * d_v, d_model) 28 | nn.init.xavier_normal_(self.fc.weight) 29 | 30 | self.dropout = nn.Dropout(dropout) 31 | 32 | 33 | def forward(self, q, k, v, mask=None): 34 | 35 | d_k, d_v, n_head = self.d_k, self.d_v, self.n_head 36 | 37 | sz_b, len_q, _ = q.size() 38 | sz_b, len_k, _ = k.size() 39 | sz_b, len_v, _ = v.size() 40 | 41 | residual = q 42 | 43 | q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) 44 | k = self.w_ks(k).view(sz_b, len_k, n_head, d_k) 45 | v = self.w_vs(v).view(sz_b, len_v, n_head, d_v) 46 | 47 | q = q.permute(2, 0, 1, 3).contiguous().view(-1, len_q, d_k) # (n*b) x lq x dk 48 | k = k.permute(2, 0, 1, 3).contiguous().view(-1, len_k, d_k) # (n*b) x lk x dk 49 | v = v.permute(2, 0, 1, 3).contiguous().view(-1, len_v, d_v) # (n*b) x lv x dv 50 | 51 | if mask is not None: 52 | mask = mask.repeat(n_head, 1, 1) # (n*b) x .. x .. 53 | 54 | output, attn = self.attention(q, k, v, mask=mask) 55 | 56 | output = output.view(n_head, sz_b, len_q, d_v) 57 | output = output.permute(1, 2, 0, 3).contiguous().view(sz_b, len_q, -1) # b x lq x (n*dv) 58 | 59 | output = self.dropout(self.fc(output)) 60 | output = self.layer_norm(output + residual) 61 | 62 | return output, attn 63 | 64 | 65 | class ScaledDotProductAttention(nn.Module): 66 | ''' Scaled Dot-Product Attention ''' 67 | 68 | def __init__(self, temperature, attn_dropout=0.1): 69 | super().__init__() 70 | self.temperature = temperature 71 | self.dropout = nn.Dropout(attn_dropout) 72 | self.softmax = nn.Softmax(dim=2) 73 | 74 | def forward(self, q, k, v, mask=None): 75 | 76 | attn = torch.bmm(q, k.transpose(1, 2)) 77 | attn = attn / self.temperature 78 | 79 | if mask is not None: 80 | attn = attn.masked_fill(mask, -np.inf) 81 | 82 | attn = self.softmax(attn) 83 | attn = self.dropout(attn) 84 | output = torch.bmm(attn, v) 85 | 86 | return output, attn 87 | -------------------------------------------------------------------------------- /src/ctcModel/ctc_infer.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/python 2 | #encoding=utf-8 3 | 4 | #greedy decoder and beamsearch decoder for ctc 5 | 6 | import torch 7 | import numpy as np 8 | 9 | 10 | class Decoder(object): 11 | "解码器基类定义,作用是将模型的输出转化为文本使其能够与标签计算正确率" 12 | def __init__(self, space_idx=1, blank_index=0): 13 | ''' 14 | int2char : 将类别转化为字符标签 15 | space_idx : 空格符号的索引,如果为为-1,表示空格不是一个类别 16 | blank_index : 空白类的索引,默认设置为0 17 | ''' 18 | self.space_idx = space_idx 19 | self.blank_index = blank_index 20 | 21 | def __call__(self, prob_tensor, frame_seq_len=None): 22 | return self.decode(prob_tensor, frame_seq_len) 23 | 24 | def decode(self): 25 | "解码函数,在GreedyDecoder和BeamDecoder继承类中实现" 26 | raise NotImplementedError; 27 | 28 | def ctc_reduce_map(self, batch_samples, lengths): 29 | """ 30 | inputs: 31 | batch_samples: size x time 32 | return: 33 | (padded_samples, mask): (size x max_len, size x max_len) 34 | max_len <= time 35 | """ 36 | sents = [] 37 | for align, length in zip(batch_samples, lengths): 38 | sent = [] 39 | tmp = None 40 | for token in align[:length]: 41 | if token != self.blank_index and token != tmp: 42 | sent.append(token) 43 | tmp = token 44 | sents.append(sent) 45 | 46 | return self.padding_list_seqs(sents, dtype=np.int32, pad=0) 47 | 48 | def padding_list_seqs(self, list_seqs, dtype=np.float32, pad=0.): 49 | len_x = [len(s) for s in list_seqs] 50 | 51 | size_batch = len(list_seqs) 52 | maxlen = max(len_x) 53 | 54 | shape_feature = tuple() 55 | for s in list_seqs: 56 | if len(s) > 0: 57 | shape_feature = np.asarray(s).shape[1:] 58 | break 59 | 60 | x = (np.ones((size_batch, maxlen) + shape_feature) * pad).astype(dtype) 61 | for idx, s in enumerate(list_seqs): 62 | x[idx, :len(s)] = s 63 | 64 | return x, len_x 65 | 66 | 67 | class GreedyDecoder(Decoder): 68 | "直接解码,把每一帧的输出概率最大的值作为输出值,而不是整个序列概率最大的值" 69 | def decode(self, prob_tensor, frame_seq_len): 70 | '''解码函数 71 | Args: 72 | prob_tensor : 网络模型输出 73 | frame_seq_len : 每一样本的帧数 74 | Returns: 75 | 解码得到的string,即识别结果 76 | ''' 77 | _, decoded = torch.max(prob_tensor, 2) 78 | decoded = decoded.view(decoded.size(0), decoded.size(1)) 79 | 80 | return self.ctc_reduce_map(decoded, frame_seq_len) 81 | 82 | 83 | class BeamDecoder(Decoder): 84 | "Beam search 解码。解码结果为整个序列概率的最大值" 85 | def __init__(self, beam_width = 200, blank_index = 0, space_idx = -1, lm_path=None, lm_alpha=0.01): 86 | self.beam_width = beam_width 87 | super().__init__(pace_idx=space_idx, blank_index=blank_index) 88 | 89 | import sys 90 | sys.path.append('../') 91 | import utils.BeamSearch as uBeam 92 | import utils.NgramLM as uNgram 93 | lm = uNgram.LanguageModel(arpa_file=lm_path) 94 | self._decoder = uBeam.ctcBeamSearch(beam_width, lm, lm_alpha=lm_alpha, blank_index = blank_index) 95 | 96 | def decode(self, prob_tensor, frame_seq_len=None): 97 | '''解码函数 98 | Args: 99 | prob_tensor : 网络模型输出 100 | frame_seq_len : 每一样本的帧数 101 | Returns: 102 | res : 解码得到的string,即识别结果 103 | ''' 104 | probs = prob_tensor.transpose(0, 1) 105 | probs = torch.exp(probs) 106 | res = self._decoder.decode(probs, frame_seq_len) 107 | return res 108 | 109 | 110 | if __name__ == '__main__': 111 | decoder = Decoder('abcde', 1, 2) 112 | print(decoder._convert_to_strings([[1,2,1,0,3],[1,2,1,1,1]])) 113 | -------------------------------------------------------------------------------- /src/ctcModel/ctc_model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from ctcModel.decoder import Decoder 5 | from ctcModel.encoder import Encoder 6 | 7 | 8 | class CTC_Model(nn.Module): 9 | """An encoder-decoder framework only includes attention. 10 | """ 11 | 12 | def __init__(self, encoder, decoder): 13 | super().__init__() 14 | self.encoder = encoder 15 | self.decoder = decoder 16 | 17 | for p in self.parameters(): 18 | if p.dim() > 1: 19 | nn.init.xavier_uniform_(p) 20 | 21 | def forward(self, padded_input, input_lengths): 22 | """ 23 | Args: 24 | padded_input: N x Ti x D 25 | input_lengths: N 26 | padded_targets: N x To 27 | """ 28 | encoder_padded_outputs, *_ = self.encoder(padded_input, input_lengths) 29 | # pred is score before softmax 30 | logits, len_logits = self.decoder(encoder_padded_outputs, input_lengths) 31 | 32 | return logits, len_logits 33 | 34 | def recognize(self, input, input_length, ctc_infer, args): 35 | """Sequence-to-Sequence beam search, decode one utterence now. 36 | Args: 37 | input: T x D 38 | decode: GreedyDecoder or BeamDecoder 39 | args: args.beam 40 | Returns: 41 | nbest_hyps: 42 | """ 43 | encoder_outputs, *_ = self.encoder(input.unsqueeze(0), input_length) 44 | logits, len_logits = self.decoder(encoder_outputs, input_length) 45 | 46 | nbest_hyps = ctc_infer(logits, len_logits) 47 | 48 | return nbest_hyps 49 | 50 | @classmethod 51 | def load_model(cls, path): 52 | # Load to CPU 53 | package = torch.load(path, map_location=lambda storage, loc: storage) 54 | model, LFR_m, LFR_n = cls.load_model_from_package(package) 55 | return model, LFR_m, LFR_n 56 | 57 | @classmethod 58 | def load_model_from_package(cls, package): 59 | encoder = Encoder(package['d_input'], 60 | package['n_layers_enc'], 61 | package['n_head'], 62 | package['d_k'], 63 | package['d_v'], 64 | package['d_model'], 65 | package['d_inner'], 66 | dropout=package['dropout'], 67 | pe_maxlen=package['pe_maxlen']) 68 | decoder = Decoder(package['vocab_size'], package['d_model']) 69 | model = cls(encoder, decoder) 70 | model.load_state_dict(package['state_dict']) 71 | LFR_m, LFR_n = package['LFR_m'], package['LFR_n'] 72 | return model, LFR_m, LFR_n 73 | 74 | @staticmethod 75 | def serialize(model, optimizer, epoch, LFR_m, LFR_n, tr_loss=None, cv_loss=None): 76 | package = { 77 | # Low Frame Rate Feature 78 | 'LFR_m': LFR_m, 79 | 'LFR_n': LFR_n, 80 | # encoder 81 | 'd_input': model.encoder.d_input, 82 | 'n_layers_enc': model.encoder.n_layers, 83 | 'n_head': model.encoder.n_head, 84 | 'd_k': model.encoder.d_k, 85 | 'd_v': model.encoder.d_v, 86 | 'd_model': model.encoder.d_model, 87 | 'd_inner': model.encoder.d_inner, 88 | 'dropout': model.encoder.dropout_rate, 89 | 'pe_maxlen': model.encoder.pe_maxlen, 90 | # decoder 91 | 'vocab_size': model.decoder.n_tgt_vocab, 92 | # state 93 | 'state_dict': model.state_dict(), 94 | 'optim_dict': optimizer.state_dict(), 95 | 'epoch': epoch 96 | } 97 | if tr_loss is not None: 98 | package['tr_loss'] = tr_loss 99 | package['cv_loss'] = cv_loss 100 | return package 101 | -------------------------------------------------------------------------------- /src/ctcModel/decoder.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from utils.utils import sequence_mask 5 | 6 | 7 | class Decoder(nn.Module): 8 | ''' A decoder model with self attention mechanism. ''' 9 | 10 | def __init__(self, n_tgt_vocab, d_input): 11 | super().__init__() 12 | # parameters 13 | self.n_tgt_vocab = n_tgt_vocab 14 | self.dim_output = n_tgt_vocab 15 | 16 | self.tgt_word_prj = nn.Linear(d_input, n_tgt_vocab, bias=False) 17 | nn.init.xavier_normal_(self.tgt_word_prj.weight) 18 | 19 | def forward(self, encoder_padded_outputs, encoder_input_lengths, masking=True): 20 | """ 21 | Args: 22 | padded_input: N x To 23 | encoder_padded_outputs: N x Ti x H 24 | 25 | Returns: 26 | """ 27 | # Get Deocder Input and Output 28 | 29 | # Prepare masks 30 | mask = sequence_mask(encoder_input_lengths) # B x T 31 | 32 | logits = self.tgt_word_prj(encoder_padded_outputs) 33 | if masking: 34 | # B x T x V 35 | mask = mask.view(mask.shape[0], mask.shape[1], 1).repeat(1, 1, self.dim_output) 36 | logits *= mask 37 | 38 | len_logits = encoder_input_lengths 39 | 40 | return logits, len_logits 41 | -------------------------------------------------------------------------------- /src/ctcModel/encoder.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | from ctcModel.attention import MultiHeadAttention 4 | from ctcModel.module import PositionalEncoding, PositionwiseFeedForward 5 | from utils.utils import get_non_pad_mask, get_attn_pad_mask 6 | 7 | 8 | class Encoder(nn.Module): 9 | """Encoder of Transformer including self-attention and feed forward. 10 | """ 11 | 12 | def __init__(self, d_input, n_layers, n_head, d_k, d_v, 13 | d_model, d_inner, dropout=0.1, pe_maxlen=5000): 14 | super(Encoder, self).__init__() 15 | # parameters 16 | self.d_input = d_input 17 | self.n_layers = n_layers 18 | self.n_head = n_head 19 | self.d_k = d_k 20 | self.d_v = d_v 21 | self.d_model = d_model 22 | self.dim_output = d_model 23 | self.d_inner = d_inner 24 | self.dropout_rate = dropout 25 | self.pe_maxlen = pe_maxlen 26 | 27 | # use linear transformation with layer norm to replace input embedding 28 | self.linear_in = nn.Linear(d_input, d_model) 29 | self.layer_norm_in = nn.LayerNorm(d_model) 30 | self.positional_encoding = PositionalEncoding(d_model, max_len=pe_maxlen) 31 | self.dropout = nn.Dropout(dropout) 32 | 33 | self.layer_stack = nn.ModuleList([ 34 | EncoderLayer(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) 35 | for _ in range(n_layers)]) 36 | 37 | def forward(self, padded_input, input_lengths, return_attns=False): 38 | """ 39 | Args: 40 | padded_input: N x T x D 41 | input_lengths: N 42 | 43 | Returns: 44 | enc_output: N x T x H 45 | """ 46 | enc_slf_attn_list = [] 47 | 48 | # Prepare masks 49 | non_pad_mask = get_non_pad_mask(padded_input, input_lengths=input_lengths) 50 | length = padded_input.size(1) 51 | slf_attn_mask = get_attn_pad_mask(input_lengths, length) 52 | 53 | # Forward 54 | enc_output = self.dropout( 55 | self.layer_norm_in(self.linear_in(padded_input)) + 56 | self.positional_encoding(padded_input)) 57 | 58 | for enc_layer in self.layer_stack: 59 | enc_output, enc_slf_attn = enc_layer( 60 | enc_output, 61 | non_pad_mask=non_pad_mask, 62 | slf_attn_mask=slf_attn_mask) 63 | if return_attns: 64 | enc_slf_attn_list += [enc_slf_attn] 65 | 66 | if return_attns: 67 | return enc_output, enc_slf_attn_list 68 | return enc_output, 69 | 70 | 71 | class EncoderLayer(nn.Module): 72 | """Compose with two sub-layers. 73 | 1. A multi-head self-attention mechanism 74 | 2. A simple, position-wise fully connected feed-forward network. 75 | """ 76 | 77 | def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1): 78 | super(EncoderLayer, self).__init__() 79 | self.slf_attn = MultiHeadAttention( 80 | n_head, d_model, d_k, d_v, dropout=dropout) 81 | self.pos_ffn = PositionwiseFeedForward( 82 | d_model, d_inner, dropout=dropout) 83 | 84 | def forward(self, enc_input, non_pad_mask=None, slf_attn_mask=None): 85 | enc_output, enc_slf_attn = self.slf_attn( 86 | enc_input, enc_input, enc_input, mask=slf_attn_mask) 87 | enc_output *= non_pad_mask 88 | 89 | enc_output = self.pos_ffn(enc_output) 90 | enc_output *= non_pad_mask 91 | 92 | return enc_output, enc_slf_attn 93 | -------------------------------------------------------------------------------- /src/ctcModel/loss.py: -------------------------------------------------------------------------------- 1 | import torch.nn.functional as F 2 | 3 | 4 | def cal_loss(logits, len_logits, gold, smoothing=0.0): 5 | """Calculate cross entropy loss, apply label smoothing if needed. 6 | """ 7 | n_class = logits.size(-1) 8 | target_lengths = gold.ne(0).sum(dim=1).int() 9 | ctc_log_probs = F.log_softmax(logits, dim=-1).transpose(0,1) 10 | ctc_loss = F.ctc_loss(ctc_log_probs, gold, 11 | len_logits, target_lengths, blank=n_class-1) 12 | 13 | return ctc_loss 14 | -------------------------------------------------------------------------------- /src/ctcModel/module.py: -------------------------------------------------------------------------------- 1 | import math 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | 7 | 8 | class PositionalEncoding(nn.Module): 9 | """Implement the positional encoding (PE) function. 10 | 11 | PE(pos, 2i) = sin(pos/(10000^(2i/dmodel))) 12 | PE(pos, 2i+1) = cos(pos/(10000^(2i/dmodel))) 13 | """ 14 | 15 | def __init__(self, d_model, max_len=5000): 16 | super(PositionalEncoding, self).__init__() 17 | # Compute the positional encodings once in log space. 18 | pe = torch.zeros(max_len, d_model, requires_grad=False) 19 | position = torch.arange(0, max_len).unsqueeze(1).float() 20 | div_term = torch.exp(torch.arange(0, d_model, 2).float() * 21 | -(math.log(10000.0) / d_model)) 22 | pe[:, 0::2] = torch.sin(position * div_term) 23 | pe[:, 1::2] = torch.cos(position * div_term) 24 | pe = pe.unsqueeze(0) 25 | self.register_buffer('pe', pe) 26 | 27 | def forward(self, input): 28 | """ 29 | Args: 30 | input: N x T x D 31 | """ 32 | length = input.size(1) 33 | return self.pe[:, :length] 34 | 35 | 36 | class PositionwiseFeedForward(nn.Module): 37 | """Implements position-wise feedforward sublayer. 38 | 39 | FFN(x) = max(0, xW1 + b1)W2 + b2 40 | """ 41 | 42 | def __init__(self, d_model, d_ff, dropout=0.1): 43 | super(PositionwiseFeedForward, self).__init__() 44 | self.w_1 = nn.Linear(d_model, d_ff) 45 | self.w_2 = nn.Linear(d_ff, d_model) 46 | self.dropout = nn.Dropout(dropout) 47 | self.layer_norm = nn.LayerNorm(d_model) 48 | 49 | def forward(self, x): 50 | residual = x 51 | output = self.w_2(F.relu(self.w_1(x))) 52 | output = self.dropout(output) 53 | output = self.layer_norm(output + residual) 54 | return output 55 | 56 | 57 | # Another implementation 58 | class PositionwiseFeedForwardUseConv(nn.Module): 59 | """A two-feed-forward-layer module""" 60 | 61 | def __init__(self, d_in, d_hid, dropout=0.1): 62 | super(PositionwiseFeedForwardUseConv, self).__init__() 63 | self.w_1 = nn.Conv1d(d_in, d_hid, 1) # position-wise 64 | self.w_2 = nn.Conv1d(d_hid, d_in, 1) # position-wise 65 | self.layer_norm = nn.LayerNorm(d_in) 66 | self.dropout = nn.Dropout(dropout) 67 | 68 | def forward(self, x): 69 | residual = x 70 | output = x.transpose(1, 2) 71 | output = self.w_2(F.relu(self.w_1(output))) 72 | output = output.transpose(1, 2) 73 | output = self.dropout(output) 74 | output = self.layer_norm(output + residual) 75 | return output 76 | -------------------------------------------------------------------------------- /src/ctcModel/optimizer.py: -------------------------------------------------------------------------------- 1 | """A wrapper class for optimizer""" 2 | import torch 3 | 4 | 5 | class CTCModelOptimizer(object): 6 | """A simple wrapper class for learning rate scheduling""" 7 | 8 | def __init__(self, optimizer, warmup_steps=4000): 9 | self.optimizer = optimizer 10 | self.init_lr = 0.00001 11 | self.warmup_steps = warmup_steps 12 | self.step_num = 0 13 | self.visdom_lr = None 14 | 15 | def zero_grad(self): 16 | self.optimizer.zero_grad() 17 | 18 | def step(self): 19 | self._update_lr() 20 | self._visdom() 21 | self.optimizer.step() 22 | 23 | def _update_lr(self): 24 | self.step_num += 1 25 | lr = self.init_lr * 1000 * min(self.step_num ** (-0.5), 26 | self.step_num * (self.warmup_steps ** (-1.5))) 27 | for param_group in self.optimizer.param_groups: 28 | param_group['lr'] = lr 29 | 30 | def load_state_dict(self, state_dict): 31 | self.optimizer.load_state_dict(state_dict) 32 | 33 | def state_dict(self): 34 | return self.optimizer.state_dict() 35 | 36 | def set_visdom(self, visdom_lr, vis): 37 | self.visdom_lr = visdom_lr # Turn on/off visdom of learning rate 38 | self.vis = vis # visdom enviroment 39 | self.vis_opts = dict(title='Learning Rate', 40 | ylabel='Leanring Rate', xlabel='step') 41 | self.vis_window = None 42 | self.x_axis = torch.LongTensor() 43 | self.y_axis = torch.FloatTensor() 44 | 45 | def _visdom(self): 46 | if self.visdom_lr is not None: 47 | self.x_axis = torch.cat( 48 | [self.x_axis, torch.LongTensor([self.step_num])]) 49 | self.y_axis = torch.cat( 50 | [self.y_axis, torch.FloatTensor([self.optimizer.param_groups[0]['lr']])]) 51 | if self.vis_window is None: 52 | self.vis_window = self.vis.line(X=self.x_axis, Y=self.y_axis, 53 | opts=self.vis_opts) 54 | else: 55 | self.vis.line(X=self.x_axis, Y=self.y_axis, win=self.vis_window, 56 | update='replace') 57 | -------------------------------------------------------------------------------- /src/ctcModel/recognize.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import json 4 | import time 5 | import torch 6 | import kaldi_io 7 | 8 | from ctcModel.ctc_model import CTC_Model 9 | from utils.utils import load_vocab, ids2str 10 | from data.data import build_LFR_features 11 | 12 | parser = argparse.ArgumentParser( 13 | "End-to-End Automatic Speech Recognition Decoding.") 14 | # data 15 | parser.add_argument('--recog-json', type=str, required=True, 16 | help='Filename of recognition data (json)') 17 | parser.add_argument('--dict', type=str, required=True, 18 | help='Dictionary which should include ') 19 | parser.add_argument('--output', type=str, required=True, 20 | help='Filename of result label data (json)') 21 | # model 22 | parser.add_argument('--model-path', type=str, required=True, 23 | help='Path to model file created by training') 24 | # decode 25 | parser.add_argument('--beam-size', default=1, type=int, 26 | help='Beam size') 27 | parser.add_argument('--nbest', default=1, type=int, 28 | help='Nbest size') 29 | parser.add_argument('--print-freq', default=1, type=int, 30 | help='print_freq') 31 | parser.add_argument('--decode-max-len', default=0, type=int, 32 | help='Max output length. If ==0 (default), it uses a ' 33 | 'end-detect function to automatically find maximum ' 34 | 'hypothesis lengths') 35 | 36 | 37 | def recognize(args): 38 | model, LFR_m, LFR_n = CTC_Model.load_model(args.model_path) 39 | print(model) 40 | model.eval() 41 | model.cuda() 42 | token2idx, idx2token = load_vocab(args.dict) 43 | blank_index = token2idx[''] 44 | 45 | if args.beam_size == 1: 46 | from ctcModel.ctc_infer import GreedyDecoder 47 | 48 | decode = GreedyDecoder(space_idx=0, blank_index=blank_index) 49 | else: 50 | from ctcModel.ctc_infer import BeamDecoder 51 | 52 | decode = BeamDecoder(beam_width=args.beam_size, blank_index=blank_index, space_idx=0) 53 | 54 | # read json data 55 | with open(args.recog_json, 'rb') as f: 56 | js = json.load(f)['utts'] 57 | 58 | # decode each utterance 59 | with torch.no_grad(), open(args.output, 'w') as f: 60 | for idx, name in enumerate(js.keys(), 1): 61 | print('(%d/%d) decoding %s' % 62 | (idx, len(js.keys()), name), flush=True) 63 | input = kaldi_io.read_mat(js[name]['input'][0]['feat']) # TxD 64 | input = build_LFR_features(input, LFR_m, LFR_n) 65 | input = torch.from_numpy(input).float() 66 | input_length = torch.tensor([input.size(0)], dtype=torch.int) 67 | input = input.cuda() 68 | input_length = input_length.cuda() 69 | hyps_ints = model.recognize(input, input_length, decode, args) 70 | hyp = ids2str(hyps_ints, idx2token)[0] 71 | f.write(name + ' ' + hyp + '\n') 72 | 73 | 74 | if __name__ == "__main__": 75 | args = parser.parse_args() 76 | print(args, flush=True) 77 | recognize(args) 78 | -------------------------------------------------------------------------------- /src/ctcModel/solver.py: -------------------------------------------------------------------------------- 1 | import time 2 | import torch 3 | 4 | from ctcModel.loss import cal_loss 5 | from utils.solver import Solver 6 | 7 | 8 | class CTC_Solver(Solver): 9 | """ 10 | """ 11 | def _run_one_epoch(self, epoch, cross_valid=False): 12 | start = time.time() 13 | total_loss = 0 14 | 15 | data_loader = self.tr_loader if not cross_valid else self.cv_loader 16 | 17 | # visualizing loss using visdom 18 | if self.visdom_epoch and not cross_valid: 19 | vis_opts_epoch = dict(title=self.visdom_id + " epoch " + str(epoch), 20 | ylabel='Loss', xlabel='Epoch') 21 | vis_window_epoch = None 22 | vis_iters = torch.arange(1, len(data_loader) + 1) 23 | vis_iters_loss = torch.Tensor(len(data_loader)) 24 | 25 | for i, data in enumerate(data_loader): 26 | padded_input, input_lengths, padded_target = data 27 | padded_input = padded_input.cuda() 28 | input_lengths = input_lengths.cuda() 29 | padded_target = padded_target.cuda() 30 | logits, len_logits = self.model(padded_input, input_lengths) 31 | loss = cal_loss(logits, len_logits, padded_target, smoothing=self.label_smoothing) 32 | if not cross_valid: 33 | self.optimizer.zero_grad() 34 | loss.backward() 35 | self.optimizer.step() 36 | 37 | total_loss += loss.item() 38 | 39 | if i % self.print_freq == 0: 40 | print('Epoch {} | loss {:.3f} | lr {:.3e} | {:.1f} ms/batch | step {}'. 41 | format(epoch + 1, loss.item(), self.optimizer.optimizer.param_groups[0]["lr"], 42 | 1000 * (time.time() - start) / (i + 1), self.optimizer.step_num), 43 | flush=True) 44 | 45 | # visualizing loss using visdom 46 | if self.visdom_epoch and not cross_valid: 47 | vis_iters_loss[i] = loss.item() 48 | if i % self.print_freq == 0: 49 | x_axis = vis_iters[:i+1] 50 | y_axis = vis_iters_loss[:i+1] 51 | if vis_window_epoch is None: 52 | vis_window_epoch = self.vis.line(X=x_axis, Y=y_axis, 53 | opts=vis_opts_epoch) 54 | else: 55 | self.vis.line(X=x_axis, Y=y_axis, win=vis_window_epoch, 56 | update='replace') 57 | 58 | return total_loss / (i + 1) 59 | -------------------------------------------------------------------------------- /src/ctcModel/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import torch 4 | 5 | from utils.data import AudioDataLoader, AudioDataset 6 | from utils.utils import load_vocab 7 | from ctcModel.decoder import Decoder 8 | from ctcModel.encoder import Encoder 9 | from ctcModel.ctc_model import CTC_Model 10 | from ctcModel.optimizer import CTCModelOptimizer 11 | from ctcModel.solver import CTC_Solver 12 | 13 | parser = argparse.ArgumentParser( 14 | "End-to-End Automatic Speech Recognition Training " 15 | "(Transformer framework).") 16 | # General config 17 | # Task related 18 | parser.add_argument('--train-json', type=str, default=None, 19 | help='Filename of train label data (json)') 20 | parser.add_argument('--valid-json', type=str, default=None, 21 | help='Filename of validation label data (json)') 22 | parser.add_argument('--vocab', type=str, required=True, 23 | help='Dictionary which should include ') 24 | # Low Frame Rate (stacking and skipping frames) 25 | parser.add_argument('--LFR_m', default=4, type=int, 26 | help='Low Frame Rate: number of frames to stack') 27 | parser.add_argument('--LFR_n', default=3, type=int, 28 | help='Low Frame Rate: number of frames to skip') 29 | # Network architecture 30 | # encoder 31 | # TODO: automatically infer input dim 32 | parser.add_argument('--d_input', default=80, type=int, 33 | help='Dim of encoder input (before LFR)') 34 | parser.add_argument('--n_layers_enc', default=6, type=int, 35 | help='Number of encoder stacks') 36 | parser.add_argument('--n_head', default=8, type=int, 37 | help='Number of Multi Head Attention (MHA)') 38 | parser.add_argument('--d_k', default=64, type=int, 39 | help='Dimension of key') 40 | parser.add_argument('--d_v', default=64, type=int, 41 | help='Dimension of value') 42 | parser.add_argument('--d_model', default=512, type=int, 43 | help='Dimension of model') 44 | parser.add_argument('--d_inner', default=2048, type=int, 45 | help='Dimension of inner') 46 | parser.add_argument('--dropout', default=0.1, type=float, 47 | help='Dropout rate') 48 | parser.add_argument('--pe_maxlen', default=5000, type=int, 49 | help='Positional Encoding max len') 50 | # decoder 51 | parser.add_argument('--d_word_vec', default=512, type=int, 52 | help='Dim of decoder embedding') 53 | parser.add_argument('--n_layers_dec', default=6, type=int, 54 | help='Number of decoder stacks') 55 | parser.add_argument('--tgt_emb_prj_weight_sharing', default=1, type=int, 56 | help='share decoder embedding with decoder projection') 57 | # Loss 58 | parser.add_argument('--label_smoothing', default=0.1, type=float, 59 | help='label smoothing') 60 | 61 | # Training config 62 | parser.add_argument('--epochs', default=30, type=int, 63 | help='Number of maximum epochs') 64 | # minibatch 65 | parser.add_argument('--shuffle', default=0, type=int, 66 | help='reshuffle the data at every epoch') 67 | parser.add_argument('--batch-size', default=32, type=int, 68 | help='Batch size') 69 | parser.add_argument('--batch_frames', default=0, type=int, 70 | help='Batch frames. If this is not 0, batch size will make no sense') 71 | parser.add_argument('--maxlen-in', default=800, type=int, metavar='ML', 72 | help='Batch size is reduced if the input sequence length > ML') 73 | parser.add_argument('--maxlen-out', default=150, type=int, metavar='ML', 74 | help='Batch size is reduced if the output sequence length > ML') 75 | parser.add_argument('--num-workers', default=4, type=int, 76 | help='Number of workers to generate minibatch') 77 | # optimizer 78 | parser.add_argument('--k', default=1.0, type=float, 79 | help='tunable scalar multiply to learning rate') 80 | parser.add_argument('--warmup_steps', default=4000, type=int, 81 | help='warmup steps') 82 | # save and load model 83 | parser.add_argument('--save-folder', default='exp/temp', 84 | help='Location to save epoch models') 85 | parser.add_argument('--checkpoint', dest='checkpoint', default=0, type=int, 86 | help='Enables checkpoint saving of model') 87 | parser.add_argument('--_continue', default='', type=int, 88 | help='Continue from checkpoint model') 89 | parser.add_argument('--model-path', default='final.model', 90 | help='Location to save best validation model') 91 | # logging 92 | parser.add_argument('--print-freq', default=100, type=int, 93 | help='Frequency of printing training infomation') 94 | parser.add_argument('--visdom', dest='visdom', type=int, default=0, 95 | help='Turn on visdom graphing') 96 | parser.add_argument('--visdom_lr', dest='visdom_lr', type=int, default=0, 97 | help='Turn on visdom graphing learning rate') 98 | parser.add_argument('--visdom_epoch', dest='visdom_epoch', type=int, default=0, 99 | help='Turn on visdom graphing each epoch') 100 | parser.add_argument('--visdom-id', default='Transformer training', 101 | help='Identifier for visdom run') 102 | 103 | 104 | def main(args): 105 | # Construct Solver 106 | # data 107 | token2idx, idx2token = load_vocab(args.vocab) 108 | vocab_size = len(token2idx) 109 | 110 | tr_dataset = AudioDataset(args.train_json, args.batch_size, 111 | args.maxlen_in, args.maxlen_out, 112 | batch_frames=args.batch_frames) 113 | cv_dataset = AudioDataset(args.valid_json, args.batch_size, 114 | args.maxlen_in, args.maxlen_out, 115 | batch_frames=args.batch_frames) 116 | tr_loader = AudioDataLoader(tr_dataset, batch_size=1, 117 | token2idx=token2idx, 118 | num_workers=args.num_workers, 119 | shuffle=args.shuffle, 120 | LFR_m=args.LFR_m, LFR_n=args.LFR_n) 121 | cv_loader = AudioDataLoader(cv_dataset, batch_size=1, 122 | token2idx=token2idx, 123 | num_workers=args.num_workers, 124 | LFR_m=args.LFR_m, LFR_n=args.LFR_n) 125 | # load dictionary and generate char_list, sos_id, eos_id 126 | data = {'tr_loader': tr_loader, 'cv_loader': cv_loader} 127 | # model 128 | encoder = Encoder(args.d_input * args.LFR_m, args.n_layers_enc, args.n_head, 129 | args.d_k, args.d_v, args.d_model, args.d_inner, 130 | dropout=args.dropout, pe_maxlen=args.pe_maxlen) 131 | decoder = Decoder(vocab_size, args.d_model) 132 | model = CTC_Model(encoder, decoder) 133 | print(model) 134 | model.cuda() 135 | # optimizer 136 | optimizier = CTCModelOptimizer( 137 | torch.optim.Adam(model.parameters(), betas=(0.9, 0.98), eps=1e-09), 138 | args.warmup_steps) 139 | 140 | # solver 141 | solver = CTC_Solver(data, model, optimizier, args) 142 | solver.train() 143 | 144 | 145 | if __name__ == '__main__': 146 | args = parser.parse_args() 147 | print(args) 148 | main(args) 149 | -------------------------------------------------------------------------------- /src/mask_lm/Mask_LM.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | 5 | class Mask_LM(nn.Module): 6 | """An encoder-decoder framework only includes attention. 7 | """ 8 | 9 | def __init__(self, encoder, decoder): 10 | super().__init__() 11 | self.encoder = encoder 12 | self.decoder = decoder 13 | self.fc = nn.Linear(encoder.d_output, encoder.d_input, bias=False) 14 | 15 | for p in self.parameters(): 16 | if p.dim() > 1: 17 | nn.init.xavier_uniform_(p) 18 | 19 | def token_mask(self, padded_input, p=0.05, M=10): 20 | """ 21 | Args: 22 | padded_input: N x Ti x D 23 | """ 24 | B, T = padded_input.size(0), padded_input.size(1) 25 | mask_index = torch.rand((B, T)).cuda() > p # 1110111... 26 | _mask_index = mask_index.clone() 27 | 28 | for i in range(M): 29 | _mask_index = torch.cat([_mask_index[:, 1:], _mask_index[:, 0].unsqueeze(1)], 1) 30 | mask_index *= _mask_index 31 | 32 | # mask_index: 1110000111 33 | # ~mask_index: 0001111000 34 | masked_input = torch.where( 35 | mask_index, 36 | padded_input, 37 | torch.zeros_like(padded_input)) 38 | 39 | return masked_input, ~mask_index 40 | 41 | def forward(self, padded_input, input_lengths, padded_target=None, mask_input=True): 42 | """ 43 | Args: 44 | padded_input: N x Ti x D 45 | input_lengths: N 46 | padded_targets: N x To 47 | """ 48 | if mask_input: 49 | masked_input, masked_index = self.token_mask(padded_input) 50 | else: 51 | masked_input = padded_input 52 | masked_index = None 53 | encoder_padded_outputs = self.encoder(masked_input, input_lengths) 54 | # pred is score before softmax 55 | 56 | logits_AE = self.fc(encoder_padded_outputs) 57 | 58 | if padded_target is not None: 59 | logits = self.decoder(encoder_padded_outputs, input_lengths) 60 | else: 61 | logits = None 62 | 63 | return logits_AE, logits, masked_index 64 | 65 | def recognize(self, input, input_length, char_list, args): 66 | """Sequence-to-Sequence beam search, decode one utterence now. 67 | Args: 68 | input: T x D 69 | char_list: list of characters 70 | args: args.beam 71 | Returns: 72 | nbest_hyps: 73 | """ 74 | encoder_outputs = self.encoder(input.unsqueeze(0), input_length) 75 | nbest_hyps = self.decoder.recognize_beam(encoder_outputs[0], 76 | char_list, 77 | args) 78 | return nbest_hyps 79 | 80 | @classmethod 81 | def create_model(cls, args): 82 | from mask_lm.decoder import Decoder 83 | from mask_lm.encoder import Encoder 84 | 85 | encoder = Encoder(args.n_src, args.n_layers_enc, args.n_head, 86 | args.d_model, args.d_inner, dropout=args.dropout) 87 | decoder = Decoder(args.n_tgt, args.d_model) 88 | 89 | model = cls(encoder, decoder) 90 | 91 | return model 92 | 93 | @classmethod 94 | def load_model(cls, path, args): 95 | model = cls.create_model(args) 96 | 97 | package = torch.load(path, map_location=lambda storage, loc: storage) 98 | model.load_state_dict(package['state_dict']) 99 | 100 | return model 101 | 102 | @staticmethod 103 | def serialize(model, optimizer, epoch, tr_loss=None, cv_loss=None): 104 | package = { 105 | # state 106 | 'state_dict': model.state_dict(), 107 | 'optim_dict': optimizer.state_dict(), 108 | 'epoch': epoch 109 | } 110 | if tr_loss is not None: 111 | package['tr_loss'] = tr_loss 112 | package['cv_loss'] = cv_loss 113 | 114 | return package 115 | -------------------------------------------------------------------------------- /src/mask_lm/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eastonYi/end-to-end_asr_pytorch/7bedfb5ad10e363f5823b980ae0eb515b395839b/src/mask_lm/__init__.py -------------------------------------------------------------------------------- /src/mask_lm/conv_encoder.py: -------------------------------------------------------------------------------- 1 | from collections import OrderedDict 2 | import math 3 | import torch.nn as nn 4 | import torch 5 | import torch.nn.functional as F 6 | from torch.nn.modules.normalization import LayerNorm 7 | 8 | 9 | class Conv1d(nn.Module): 10 | # the same as stack frames 11 | def __init__(self, d_input, d_hidden, n_layers, w_context, pad='same', name=''): 12 | super().__init__() 13 | assert n_layers >= 1 14 | self.n_layers = n_layers 15 | self.d_input = d_input 16 | self.d_hidden = d_hidden 17 | self.w_context = w_context 18 | self.pad = pad 19 | self.name = name 20 | 21 | layers = [("{}/conv1d_0".format(name), nn.Conv1d(d_input, d_hidden, w_context, 1)), 22 | # ("{}/batchnorm_0".format(name), nn.BatchNorm1d(d_hidden)), 23 | ("{}/relu_0".format(name), nn.ReLU())] 24 | for i in range(n_layers-1): 25 | layers += [ 26 | ("{}/conv1d_{}".format(name, i+1), nn.Conv1d(d_hidden, d_hidden, w_context, 1)), 27 | # ("{}/batchnorm_{}".format(name, i+1), nn.BatchNorm1d(d_hidden)), 28 | ("{}/relu_{}".format(name, i+1), nn.ReLU()) 29 | ] 30 | layers = OrderedDict(layers) 31 | self.conv = nn.Sequential(layers) 32 | 33 | def forward(self, feats, feat_lengths): 34 | if self.pad == 'same': 35 | input_length = feats.size(1) 36 | feats = F.pad(feats, (0, 0, 0, self.n_layers * self.w_context)) 37 | outputs = self.conv(feats.permute(0, 2, 1)) 38 | outputs = outputs.permute(0, 2, 1) 39 | 40 | if self.pad == 'same': 41 | tensor_length = input_length 42 | assert tensor_length <= outputs.size(1) 43 | outputs = outputs[:, :tensor_length, :] 44 | output_lengths = feat_lengths 45 | else: 46 | output_lengths = ((feat_lengths + sum(self.padding) - 47 | 1*(self.w_context-1)-1)/self.subsample + 1).long() 48 | 49 | return outputs, output_lengths 50 | 51 | 52 | class Conv1dSubsample(nn.Module): 53 | # the same as stack frames 54 | def __init__(self, d_input, d_model, w_context, subsample, pad='same'): 55 | super().__init__() 56 | self.conv = nn.Conv1d(d_input, d_model, w_context, stride=subsample) 57 | self.conv_norm = LayerNorm(d_model) 58 | self.subsample = subsample 59 | self.w_context = w_context 60 | self.pad = pad 61 | 62 | def forward(self, feats, feat_lengths): 63 | if self.pad == 'same': 64 | input_length = feats.size(1) 65 | feats = F.pad(feats, (0, 0, 0, self.w_context)) 66 | outputs = self.conv(feats.permute(0, 2, 1)) 67 | outputs = outputs.permute(0, 2, 1) 68 | outputs = self.conv_norm(outputs) 69 | 70 | if self.pad == 'same': 71 | tensor_length = int(math.ceil(input_length/self.subsample)) 72 | assert tensor_length <= outputs.size(1) 73 | outputs = outputs[:, :tensor_length, :] 74 | output_lengths = torch.ceil(feat_lengths*1.0/self.subsample).int() 75 | else: 76 | output_lengths = ((feat_lengths + sum(self.padding) - 1*(self.w_context-1)-1)/self.subsample + 1).long() 77 | 78 | return outputs, output_lengths 79 | 80 | 81 | class Conv2dSubsample(nn.Module): 82 | def __init__(self, d_input, d_model, n_layers=2, pad='same'): 83 | super().__init__() 84 | assert n_layers >= 1 85 | self.n_layers = n_layers 86 | self.d_input = d_input 87 | self.pad = pad 88 | 89 | layers = [("subsample/conv0", nn.Conv2d(1, 32, 3, (2, 2))), 90 | ("subsample/relu0", nn.ReLU())] 91 | for i in range(n_layers-1): 92 | layers += [ 93 | ("subsample/conv{}".format(i+1), nn.Conv2d(32, 32, 3, (2, 1))), 94 | ("subsample/relu{}".format(i+1), nn.ReLU()) 95 | ] 96 | layers = OrderedDict(layers) 97 | self.conv = nn.Sequential(layers) 98 | self.d_conv_out = int(math.ceil(d_input / 2)) 99 | self.affine = nn.Linear(32 * self.d_conv_out, d_model) 100 | 101 | def forward(self, feats, feat_lengths): 102 | if self.pad == 'same': 103 | input_length = feats.size(1) 104 | feats = F.pad(feats, (0, 10, 0, 20)) 105 | outputs = feats.unsqueeze(1) # [B, C, T, D] 106 | outputs = self.conv(outputs)[:, :, :, :self.d_conv_out] 107 | B, C, T, D = outputs.size() 108 | outputs = outputs.permute(0, 2, 1, 3).contiguous().view(B, T, C*D) 109 | 110 | if self.pad == 'same': 111 | output_lengths = feat_lengths 112 | tensor_length = input_length 113 | for _ in range(self.n_layers): 114 | output_lengths = torch.ceil(output_lengths / 2.0).int() 115 | tensor_length = int(math.ceil(tensor_length / 2.0)) 116 | 117 | assert tensor_length <= outputs.size(1) 118 | outputs = outputs[:, :tensor_length, :] 119 | else: 120 | output_lengths = feat_lengths 121 | for _ in range(self.n_layers): 122 | output_lengths = ((output_lengths-1) / 2).long() 123 | 124 | outputs = self.affine(outputs) 125 | 126 | return outputs, output_lengths 127 | -------------------------------------------------------------------------------- /src/mask_lm/data.py: -------------------------------------------------------------------------------- 1 | """ 2 | Logic: 3 | 1. AudioDataLoader generate a minibatch from AudioDataset, the size of this 4 | minibatch is AudioDataLoader's batchsize. For now, we always set 5 | AudioDataLoader's batchsize as 1. The real minibatch size we care about is 6 | set in AudioDataset's __init__(...). So actually, we generate the 7 | information of one minibatch in AudioDataset. 8 | 2. After AudioDataLoader getting one minibatch from AudioDataset, 9 | AudioDataLoader calls its collate_fn(batch) to process this minibatch. 10 | """ 11 | import torch 12 | import torch.utils.data as data 13 | 14 | from utils.utils import pad_list 15 | 16 | 17 | class VQ_Dataset(data.Dataset): 18 | """ 19 | TODO: this is a little HACK now, put batch_size here now. 20 | remove batch_size to dataloader later. 21 | """ 22 | 23 | def __init__(self, data_path, token2idx, max_length_in, max_length_out, 24 | down_sample_rate=1, batch_frames=0): 25 | # From: espnet/src/asr/asr_utils.py: make_batchset() 26 | """ 27 | Args: 28 | data: espnet/espnet json format file. 29 | num_batches: for debug. only use num_batches minibatch but not all. 30 | """ 31 | super().__init__() 32 | all_batches = [] 33 | one_batch = [] 34 | num_frames = 0 35 | for sample in self.sample_iter(data_path, token2idx, max_length_in): 36 | one_batch.append(sample) 37 | num_frames += len(sample) 38 | if num_frames > batch_frames: 39 | all_batches.append(one_batch[:-2]) 40 | one_batch = [one_batch[-1]] 41 | num_frames = len(one_batch[0]) 42 | if one_batch: 43 | all_batches.append(one_batch) 44 | self.all_batches = all_batches 45 | 46 | def __getitem__(self, index): 47 | return self.all_batches[index] 48 | 49 | def __len__(self): 50 | return len(self.all_batches) 51 | 52 | def sample_iter(self, data_path, token2idx, max_length): 53 | with open(data_path) as f: 54 | for line in f: 55 | uttid, tokens = line.strip().split(maxsplit=1) 56 | tokens = tokens.split() 57 | for i in range(0, len(tokens), max_length): 58 | yield [token2idx[token] for token in tokens[i: i+max_length]] 59 | 60 | def f_x_pad(batch): 61 | return pad_list([torch.tensor(sample).long() for sample in batch[0]], 0) 62 | 63 | 64 | class VQ_Pred_Dataset(data.Dataset): 65 | """ 66 | TODO: this is a little HACK now, put batch_size here now. 67 | remove batch_size to dataloader later. 68 | """ 69 | 70 | def __init__(self, f_vq, f_trans, tokenIn2idx, tokenOut2idx, batch_size, 71 | max_length_in, max_length_out, down_sample_rate=1): 72 | # From: espnet/src/asr/asr_utils.py: make_batchset() 73 | """ 74 | Args: 75 | data: espnet/espnet json format file. 76 | num_batches: for debug. only use num_batches minibatch but not all. 77 | """ 78 | super().__init__() 79 | self.all_batches = [] 80 | one_batch = [] 81 | 82 | with open(f_vq) as f1, open(f_trans) as f2: 83 | for vq, trans in zip(f1, f2): 84 | uttid, vq = vq.strip().split(maxsplit=1) 85 | _uttid, trans = trans.strip().split(maxsplit=1) 86 | assert uttid == _uttid 87 | x = [tokenIn2idx[token] for token in vq.split()] 88 | y = [tokenOut2idx[token] for token in trans.split()] 89 | one_batch.append([x, y]) 90 | 91 | if len(one_batch) >= batch_size: 92 | self.all_batches.append(one_batch) 93 | one_batch = [] 94 | 95 | def __getitem__(self, index): 96 | return self.all_batches[index] 97 | 98 | def __len__(self): 99 | return len(self.all_batches) 100 | 101 | 102 | def f_xy_pad(batch): 103 | xs_pad = pad_list([torch.tensor(sample[0]).long() for sample in batch[0]], 0) 104 | ys_pad = pad_list([torch.tensor(sample[1]).long() for sample in batch[0]], 0) 105 | # xs_pad = pad_to_batch([sample for sample in batch[0][0]], 0) 106 | # ys_pad = pad_to_batch([sample for sample in batch[0][1]], 0) 107 | 108 | return xs_pad, ys_pad 109 | -------------------------------------------------------------------------------- /src/mask_lm/decoder.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from utils.utils import sequence_mask 5 | 6 | 7 | class Decoder(nn.Module): 8 | ''' A decoder model with self attention mechanism. ''' 9 | 10 | def __init__(self, n_tgt_vocab, d_input): 11 | super().__init__() 12 | # parameters 13 | self.n_tgt_vocab = n_tgt_vocab 14 | self.dim_output = n_tgt_vocab 15 | 16 | self.tgt_word_prj = nn.Linear(d_input, n_tgt_vocab, bias=False) 17 | nn.init.xavier_normal_(self.tgt_word_prj.weight) 18 | 19 | def forward(self, encoder_padded_outputs, encoder_input_lengths): 20 | """ 21 | Args: 22 | padded_input: N x To 23 | encoder_padded_outputs: N x Ti x H 24 | 25 | Returns: 26 | """ 27 | # Get Deocder Input and Output 28 | 29 | # Prepare masks 30 | mask = sequence_mask(encoder_input_lengths) # B x T 31 | 32 | # B x T x V 33 | mask = mask.view(mask.shape[0], mask.shape[1], 1).repeat(1, 1, self.dim_output) 34 | 35 | # before softmax 36 | logits = self.tgt_word_prj(encoder_padded_outputs) * mask 37 | 38 | return logits 39 | -------------------------------------------------------------------------------- /src/mask_lm/encoder.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | from transformer.module import PositionalEncoding 4 | from transformer.encoder import EncoderLayer 5 | from utils.utils import sequence_mask, get_attn_pad_mask 6 | 7 | 8 | class Encoder(nn.Module): 9 | """Encoder of Transformer including self-attention and feed forward. 10 | """ 11 | 12 | def __init__(self, n_src, n_layers, n_head, d_model, d_inner, dropout=0.1): 13 | super().__init__() 14 | # parameters 15 | self.token_emb = nn.Embedding(n_src, d_model) 16 | self.n_layers = n_layers 17 | self.n_head = n_head 18 | self.d_model = d_model 19 | self.d_input = n_src 20 | self.d_output = d_model 21 | self.d_inner = d_inner 22 | self.dropout_rate = dropout 23 | 24 | # use linear transformation with layer norm to replace input embedding 25 | self.layer_norm_in = nn.LayerNorm(d_model) 26 | self.positional_encoding = PositionalEncoding(d_model) 27 | self.dropout = nn.Dropout(dropout) 28 | 29 | self.layer_stack = nn.ModuleList([ 30 | EncoderLayer(d_model, d_inner, n_head, dropout=dropout) 31 | for _ in range(n_layers)]) 32 | 33 | def forward(self, padded_input, input_lengths): 34 | """ 35 | Args: 36 | padded_input: N x T x D 37 | input_lengths: N 38 | 39 | Returns: 40 | enc_output: N x T x H 41 | """ 42 | # Prepare masks 43 | non_pad_mask = sequence_mask(input_lengths).unsqueeze(-1) 44 | length = padded_input.size(1) 45 | slf_attn_mask = get_attn_pad_mask(input_lengths, length) 46 | 47 | # Forward 48 | enc_output = self.dropout( 49 | self.layer_norm_in(self.token_emb(padded_input)) + 50 | self.positional_encoding(padded_input)) 51 | 52 | for enc_layer in self.layer_stack: 53 | enc_output = enc_layer( 54 | enc_output, 55 | non_pad_mask=non_pad_mask, 56 | slf_attn_mask=slf_attn_mask) 57 | 58 | return enc_output 59 | -------------------------------------------------------------------------------- /src/mask_lm/infer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import json 4 | import torch 5 | import kaldi_io 6 | import time 7 | 8 | from utils.utils import load_vocab, ids2str 9 | from utils.data import build_LFR_features 10 | 11 | 12 | parser = argparse.ArgumentParser( 13 | "End-to-End Automatic Speech Recognition Decoding.") 14 | # data 15 | parser.add_argument('--type', type=str, required=True, 16 | help='test or infer') 17 | parser.add_argument('--recog-json', type=str, required=True, 18 | help='Filename of recognition data (json)') 19 | parser.add_argument('--model-path', type=str, required=True, 20 | help='Filename of recognition data (json)') 21 | parser.add_argument('--vocab', type=str, required=True, 22 | help='Dictionary which should include ') 23 | parser.add_argument('--output', type=str, required=True, 24 | help='Filename of result label data (json)') 25 | 26 | # Low Frame Rate (stacking and skipping frames) 27 | parser.add_argument('--LFR_m', default=4, type=int, 28 | help='Low Frame Rate: number of frames to stack') 29 | parser.add_argument('--LFR_n', default=3, type=int, 30 | help='Low Frame Rate: number of frames to skip') 31 | 32 | # Network architecture 33 | parser.add_argument('--structure', type=str, default='transformer', 34 | help='transformer transformer-ctc conv-transformer-ctc') 35 | # conv_encoder 36 | parser.add_argument('--n_conv_layers', default=3, type=int, 37 | help='Dimension of key') 38 | # encoder 39 | parser.add_argument('--d_input', default=80, type=int, 40 | help='Dim of encoder input (before LFR)') 41 | parser.add_argument('--n_layers_enc', default=6, type=int, 42 | help='Number of encoder stacks') 43 | parser.add_argument('--n_head', default=8, type=int, 44 | help='Number of Multi Head Attention (MHA)') 45 | parser.add_argument('--d_k', default=64, type=int, 46 | help='Dimension of key') 47 | parser.add_argument('--d_v', default=64, type=int, 48 | help='Dimension of value') 49 | parser.add_argument('--d_model', default=512, type=int, 50 | help='Dimension of model') 51 | parser.add_argument('--d_inner', default=2048, type=int, 52 | help='Dimension of inner') 53 | parser.add_argument('--dropout', default=0.1, type=float, 54 | help='Dropout rate') 55 | parser.add_argument('--pe_maxlen', default=5000, type=int, 56 | help='Positional Encoding max len') 57 | # assigner 58 | parser.add_argument('--w_context', default=3, type=int, 59 | help='Positional Encoding max len') 60 | parser.add_argument('--d_assigner_hidden', default=512, type=int, 61 | help='Positional Encoding max len') 62 | parser.add_argument('--n_assigner_layers', default=3, type=int, 63 | help='Positional Encoding max len') 64 | # decoder 65 | parser.add_argument('--d_word_vec', default=512, type=int, 66 | help='Dim of decoder embedding') 67 | parser.add_argument('--n_layers_dec', default=6, type=int, 68 | help='Number of decoder stacks') 69 | parser.add_argument('--tgt_emb_prj_weight_sharing', default=0, type=int, 70 | help='share decoder embedding with decoder projection') 71 | 72 | # decode 73 | parser.add_argument('--beam-size', default=1, type=int, 74 | help='Beam size') 75 | parser.add_argument('--nbest', default=1, type=int, 76 | help='Nbest size') 77 | parser.add_argument('--print-freq', default=1, type=int, 78 | help='print_freq') 79 | parser.add_argument('--decode-max-len', default=0, type=int, 80 | help='Max output length. If ==0 (default), it uses a ' 81 | 'end-detect function to automatically find maximum ' 82 | 'hypothesis lengths') 83 | 84 | 85 | def test(args): 86 | if args.structure == 'transformer': 87 | from transformer.Transformer import Transformer as Model 88 | elif args.structure == 'transformer-ctc': 89 | from transformer.Transformer import CTC_Transformer as Model 90 | elif args.structure == 'conv-transformer-ctc': 91 | from transformer.Transformer import Conv_CTC_Transformer as Model 92 | elif args.structure == 'cif': 93 | from transformer.CIF_Model import CIF_Model as Model 94 | 95 | token2idx, idx2token = load_vocab(args.vocab) 96 | args.sos_id = token2idx[''] 97 | args.eos_id = token2idx[''] 98 | args.vocab_size = len(token2idx) 99 | 100 | model = Model.load_model(args.model_path, args) 101 | print(model) 102 | model.eval() 103 | model.cuda() 104 | 105 | # read json data 106 | with open(args.recog_json, 'rb') as f: 107 | js = json.load(f)['utts'] 108 | 109 | cur_time = time.time() 110 | # decode each utterance 111 | with torch.no_grad(), open(args.output, 'w') as f: 112 | for idx, uttid in enumerate(js.keys(), 1): 113 | input = kaldi_io.read_mat(js[uttid]['input'][0]['feat']) # TxD 114 | input = build_LFR_features(input, args.LFR_m, args.LFR_n) 115 | input = torch.from_numpy(input).float() 116 | input_length = torch.tensor([input.size(0)], dtype=torch.int) 117 | input = input.cuda() 118 | input_length = input_length.cuda() 119 | # hyps_ints = model.recognize(input, input_length, idx2token, args, 120 | # target_num=len(js[uttid]['output']['tokenid'].split())) 121 | hyps_ints = model.recognize(input, input_length, idx2token, args) 122 | # hyps_ints = model.recognize_beam_cache(input, input_length, idx2token, args) 123 | hyp = ids2str(hyps_ints, idx2token)[0] 124 | f.write(uttid + ' ' + hyp + '\n') 125 | used_time = time.time() - cur_time 126 | print('({}/{}) use time {:.2f}s {}: {}'.format( 127 | idx, len(js.keys()), used_time, uttid, hyp), flush=True) 128 | cur_time = time.time() 129 | 130 | 131 | def infer(args): 132 | return 133 | 134 | 135 | if __name__ == "__main__": 136 | args = parser.parse_args() 137 | print(args, flush=True) 138 | if args.type == 'test': 139 | test(args) 140 | elif args.type == 'infer': 141 | infer(args) 142 | -------------------------------------------------------------------------------- /src/mask_lm/loss.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | 5 | def cal_ce_mask_loss(logits, targets, mask, smoothing=0.0): 6 | """Calculate cross entropy loss, apply label smoothing if needed. 7 | """ 8 | 9 | logits = logits.view(-1, logits.size(2)) 10 | targets = targets.contiguous().view(-1) 11 | mask = mask.contiguous().view(-1) 12 | 13 | eps = smoothing 14 | n_class = logits.size(1) 15 | 16 | # Generate one-hot matrix: N x C. 17 | # Only label position is 1 and all other positions are 0 18 | # gold include -1 value (0) and this will lead to assert error 19 | one_hot = torch.zeros_like(logits).scatter(1, targets.long().view(-1, 1), 1) 20 | one_hot = one_hot * (1 - eps) + (1 - one_hot) * eps / n_class 21 | log_prb = F.log_softmax(logits, dim=1) 22 | 23 | non_pad_mask = targets.ne(0) 24 | concerned_mask = non_pad_mask * mask 25 | n_word = concerned_mask.long().sum() 26 | loss = -(one_hot * log_prb).sum(dim=1) 27 | 28 | 29 | ce_loss = loss.masked_select(non_pad_mask).sum() / n_word 30 | 31 | return ce_loss 32 | 33 | 34 | def cal_ctc_loss(logits, len_logits, gold, smoothing=0.0): 35 | """Calculate cross entropy loss, apply label smoothing if needed. 36 | """ 37 | n_class = logits.size(-1) 38 | target_lengths = gold.ne(0).sum(dim=1).int() 39 | ctc_log_probs = F.log_softmax(logits, dim=-1).transpose(0,1) 40 | ctc_loss = F.ctc_loss(ctc_log_probs, gold, 41 | len_logits, target_lengths, blank=n_class-1) 42 | 43 | return ctc_loss 44 | -------------------------------------------------------------------------------- /src/mask_lm/module.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | 6 | 7 | class PositionalEncoding(nn.Module): 8 | """Implement the positional encoding (PE) function. 9 | 10 | PE(pos, 2i) = sin(pos/(10000^(2i/dmodel))) 11 | PE(pos, 2i+1) = cos(pos/(10000^(2i/dmodel))) 12 | """ 13 | 14 | def __init__(self, d_model, max_len=5000): 15 | super().__init__() 16 | # Compute the positional encodings once in log space. 17 | pe = torch.zeros(max_len, d_model, requires_grad=False) 18 | position = torch.arange(0, max_len).unsqueeze(1).float() 19 | div_term = torch.exp(torch.arange(0, d_model, 2).float() * 20 | -(math.log(10000.0) / d_model)) 21 | pe[:, 0::2] = torch.sin(position * div_term) 22 | pe[:, 1::2] = torch.cos(position * div_term) 23 | pe = pe.unsqueeze(0) 24 | self.register_buffer('pe', pe) 25 | 26 | def forward(self, input): 27 | """ 28 | Args: 29 | input: N x T x D 30 | """ 31 | length = input.size(1) 32 | return self.pe[:, :length] 33 | 34 | 35 | class PositionwiseFeedForward(nn.Module): 36 | """Implements position-wise feedforward sublayer. 37 | 38 | FFN(x) = max(0, xW1 + b1)W2 + b2 39 | """ 40 | 41 | def __init__(self, d_model, d_ff, dropout=0.1): 42 | super(PositionwiseFeedForward, self).__init__() 43 | self.w_1 = nn.Linear(d_model, d_ff) 44 | self.w_2 = nn.Linear(d_ff, d_model) 45 | self.dropout = nn.Dropout(dropout) 46 | self.layer_norm = nn.LayerNorm(d_model) 47 | 48 | def forward(self, x): 49 | residual = x 50 | output = self.w_2(F.relu(self.w_1(x))) 51 | output = self.dropout(output) 52 | output = self.layer_norm(output + residual) 53 | return output 54 | 55 | 56 | # Another implementation 57 | class PositionwiseFeedForwardUseConv(nn.Module): 58 | """A two-feed-forward-layer module""" 59 | 60 | def __init__(self, d_in, d_hid, dropout=0.1): 61 | super(PositionwiseFeedForwardUseConv, self).__init__() 62 | self.w_1 = nn.Conv1d(d_in, d_hid, 1) # position-wise 63 | self.w_2 = nn.Conv1d(d_hid, d_in, 1) # position-wise 64 | self.layer_norm = nn.LayerNorm(d_in) 65 | self.dropout = nn.Dropout(dropout) 66 | 67 | def forward(self, x): 68 | residual = x 69 | output = x.transpose(1, 2) 70 | output = self.w_2(F.relu(self.w_1(output))) 71 | output = output.transpose(1, 2) 72 | output = self.dropout(output) 73 | output = self.layer_norm(output + residual) 74 | return output 75 | -------------------------------------------------------------------------------- /src/transformer/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eastonYi/end-to-end_asr_pytorch/7bedfb5ad10e363f5823b980ae0eb515b395839b/src/transformer/__init__.py -------------------------------------------------------------------------------- /src/transformer/attention.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | import torch.nn as nn 4 | 5 | 6 | class MultiheadAttention(nn.Module): 7 | ''' Multi-Head Attention module ''' 8 | 9 | def __init__(self, d_model, n_head, d_k=64, d_v=64, dropout=0.1): 10 | super().__init__() 11 | 12 | self.n_head = n_head 13 | self.d_k = d_k 14 | self.d_v = d_v 15 | 16 | self.w_qs = nn.Linear(d_model, n_head * d_k) 17 | self.w_ks = nn.Linear(d_model, n_head * d_k) 18 | self.w_vs = nn.Linear(d_model, n_head * d_v) 19 | nn.init.normal_(self.w_qs.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_k))) 20 | nn.init.normal_(self.w_ks.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_k))) 21 | nn.init.normal_(self.w_vs.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_v))) 22 | 23 | self.attention = ScaledDotProductAttention(temperature=np.power(d_k, 0.5), 24 | attn_dropout=dropout) 25 | self.layer_norm = nn.LayerNorm(d_model) 26 | 27 | self.fc = nn.Linear(n_head * d_v, d_model) 28 | nn.init.xavier_normal_(self.fc.weight) 29 | 30 | self.dropout = nn.Dropout(dropout) 31 | 32 | 33 | def forward(self, q, k, v, mask=None): 34 | 35 | d_k, d_v, n_head = self.d_k, self.d_v, self.n_head 36 | 37 | sz_b, len_q, _ = q.size() 38 | sz_b, len_k, _ = k.size() 39 | sz_b, len_v, _ = v.size() 40 | 41 | residual = q 42 | 43 | q = self.w_qs(q).view(sz_b, len_q, n_head, d_k) 44 | k = self.w_ks(k).view(sz_b, len_k, n_head, d_k) 45 | v = self.w_vs(v).view(sz_b, len_v, n_head, d_v) 46 | 47 | q = q.permute(2, 0, 1, 3).contiguous().view(-1, len_q, d_k) # (n*b) x lq x dk 48 | k = k.permute(2, 0, 1, 3).contiguous().view(-1, len_k, d_k) # (n*b) x lk x dk 49 | v = v.permute(2, 0, 1, 3).contiguous().view(-1, len_v, d_v) # (n*b) x lv x dv 50 | 51 | if mask is not None: 52 | mask = mask.repeat(n_head, 1, 1) # (n*b) x .. x .. 53 | 54 | output, attn = self.attention(q, k, v, mask=mask) 55 | 56 | output = output.view(n_head, sz_b, len_q, d_v) 57 | output = output.permute(1, 2, 0, 3).contiguous().view(sz_b, len_q, -1) # b x lq x (n*dv) 58 | 59 | output = self.dropout(self.fc(output)) 60 | output = self.layer_norm(output + residual) 61 | 62 | return output, attn 63 | 64 | 65 | class ScaledDotProductAttention(nn.Module): 66 | ''' Scaled Dot-Product Attention ''' 67 | 68 | def __init__(self, temperature, attn_dropout=0.1): 69 | super().__init__() 70 | self.temperature = temperature 71 | self.dropout = nn.Dropout(attn_dropout) 72 | self.softmax = nn.Softmax(dim=2) 73 | 74 | def forward(self, q, k, v, mask=None): 75 | 76 | attn = torch.bmm(q, k.transpose(1, 2)) 77 | attn = attn / self.temperature 78 | 79 | if mask is not None: 80 | attn = attn.masked_fill(mask, -np.inf) 81 | 82 | attn = self.softmax(attn) 83 | attn = self.dropout(attn) 84 | output = torch.bmm(attn, v) 85 | 86 | return output, attn 87 | -------------------------------------------------------------------------------- /src/transformer/attentionAssigner.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | import torch 3 | 4 | from transformer.conv_encoder import Conv1d 5 | from utils.utils import sequence_mask 6 | 7 | 8 | class Attention_Assigner(nn.Module): 9 | """atteniton assigner of CIF including self-attention and feed forward. 10 | """ 11 | 12 | def __init__(self, d_input, d_hidden, w_context, n_layers, dropout=0.1): 13 | super().__init__() 14 | # parameters 15 | self.d_input = d_input 16 | self.d_hidden = d_hidden 17 | self.n_layers = n_layers 18 | self.w_context = w_context 19 | 20 | self.conv = Conv1d(d_input, d_hidden, n_layers, w_context, 21 | pad='same', name='assigner') 22 | self.dropout = nn.Dropout(p=dropout) 23 | self.linear = nn.Linear(d_hidden, 1) 24 | 25 | def forward(self, padded_input, input_lengths): 26 | """ 27 | Args: 28 | padded_input: N x T x D 29 | input_lengths: N 30 | 31 | Returns: 32 | enc_output: N x T x H 33 | """ 34 | x, input_lengths = self.conv(padded_input, input_lengths) 35 | x = self.dropout(x) 36 | alphas = self.linear(x).squeeze(-1) 37 | alphas = torch.sigmoid(alphas) 38 | pad_mask = sequence_mask(input_lengths) 39 | 40 | return alphas * pad_mask 41 | 42 | def tail_fixing(alpha): 43 | 44 | return alpha 45 | 46 | 47 | class Attention_Assigner_Big(Attention_Assigner): 48 | """atteniton assigner of CIF including self-attention and feed forward. 49 | """ 50 | 51 | def __init__(self, d_input, d_hidden, w_context, n_layers, dropout=0.1): 52 | nn.Module.__init__(self) 53 | # parameters 54 | self.d_input = d_input 55 | self.d_hidden = d_hidden 56 | self.n_layers = n_layers 57 | self.w_context = w_context 58 | 59 | self.conv_1 = Conv1d(d_input, d_hidden, n_layers, w_context, 60 | pad='same', name='assigner1') 61 | self.conv_2 = Conv1d(d_input, 2 * d_hidden, n_layers, w_context, 62 | pad='same', name='assigner2') 63 | self.linear_1 = nn.Linear(2 * d_hidden, d_hidden) 64 | self.linear_2 = nn.Linear(d_hidden, 1) 65 | self.dropout_1 = nn.Dropout(p=dropout) 66 | self.dropout_2 = nn.Dropout(p=dropout) 67 | 68 | def forward(self, padded_input, input_lengths): 69 | """ 70 | Args: 71 | padded_input: N x T x D 72 | input_lengths: N 73 | 74 | Returns: 75 | enc_output: N x T x H 76 | """ 77 | x, input_lengths = self.conv_1(padded_input, input_lengths) 78 | x, input_lengths = self.conv_2(x, input_lengths) 79 | x = self.dropout_1(x) 80 | x = self.linear_1(x) 81 | x = self.dropout_2(x) 82 | x = self.linear_2(x).squeeze(-1) 83 | alphas = torch.sigmoid(x) 84 | pad_mask = sequence_mask(input_lengths) 85 | 86 | return alphas * pad_mask 87 | 88 | 89 | class Attention_Assigner_RNN(Attention_Assigner): 90 | """atteniton assigner of CIF including self-attention and feed forward. 91 | """ 92 | 93 | def __init__(self, d_input, d_hidden, w_context, n_layers, dropout=0.1): 94 | nn.Module.__init__(self) 95 | # parameters 96 | self.d_input = d_input 97 | self.d_hidden = d_hidden 98 | self.n_layers = n_layers 99 | self.w_context = w_context 100 | 101 | self.conv = Conv1d(d_input, d_hidden, n_layers, w_context, 102 | pad='same', name='assigner') 103 | self.rnn = nn.LSTM(input_size=d_hidden, hidden_size=d_hidden//4, 104 | bidirectional=True, bias=False) 105 | self.dropout = nn.Dropout(p=dropout) 106 | self.linear = nn.Linear(d_hidden//2, 1) 107 | 108 | def forward(self, padded_input, input_lengths): 109 | """ 110 | Args: 111 | padded_input: N x T x D 112 | input_lengths: N 113 | 114 | Returns: 115 | enc_output: N x T x H 116 | """ 117 | x, input_lengths = self.conv(padded_input, input_lengths) 118 | x, _ = self.rnn(x) 119 | x = self.dropout(x) 120 | alphas = self.linear(x).squeeze(-1) 121 | alphas = torch.sigmoid(alphas) 122 | pad_mask = sequence_mask(input_lengths) 123 | 124 | return alphas * pad_mask 125 | -------------------------------------------------------------------------------- /src/transformer/cif_model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | from utils.utils import spec_aug 5 | 6 | 7 | class CIF_Model(nn.Module): 8 | """An encoder-decoder framework only includes attention. 9 | """ 10 | 11 | def __init__(self, conv_encoder, encoder, assigner, decoder, spec_aug_cfg=None): 12 | super().__init__() 13 | self.conv_encoder = conv_encoder 14 | self.encoder = encoder 15 | self.assigner = assigner 16 | self.decoder = decoder 17 | self.spec_aug_cfg = spec_aug_cfg 18 | self.ctc_fc = nn.Linear(encoder.d_output, decoder.d_output, bias=False) 19 | 20 | for p in self.parameters(): 21 | if p.dim() > 1: 22 | nn.init.xavier_uniform_(p) 23 | 24 | def forward(self, features, len_features, targets, threshold=0.95): 25 | """ 26 | Args: 27 | features: N x T x D 28 | len_sequence: N 29 | padded_targets: N x To 30 | """ 31 | 32 | if self.spec_aug_cfg: 33 | features, len_features = spec_aug(features, len_features, self.spec_aug_cfg) 34 | 35 | conv_outputs, len_sequence = self.conv_encoder(features, len_features) 36 | encoder_outputs = self.encoder(conv_outputs, len_sequence) 37 | 38 | ctc_logits = self.ctc_fc(encoder_outputs) 39 | len_ctc_logits = len_sequence 40 | 41 | alpha = self.assigner(encoder_outputs, len_sequence) 42 | 43 | # sum 44 | _num = alpha.sum(-1) 45 | # scaling 46 | num = (targets > 0).float().sum(-1) 47 | num_noise = num + torch.rand(alpha.size(0)).cuda() - 0.5 48 | alpha *= (num_noise / _num)[:, None].repeat(1, alpha.size(1)) 49 | 50 | # cif 51 | l = self.cif(encoder_outputs, alpha, threshold=threshold) 52 | 53 | logits = self.decoder(l, targets) 54 | 55 | return ctc_logits, len_ctc_logits, _num, num, logits 56 | 57 | def cif(self, hidden, alphas, threshold, log=False): 58 | batch_size, len_time, hidden_size = hidden.size() 59 | 60 | # loop varss 61 | integrate = torch.zeros([batch_size]).cuda() 62 | frame = torch.zeros([batch_size, hidden_size]).cuda() 63 | # intermediate vars along time 64 | list_fires = [] 65 | list_frames = [] 66 | 67 | for t in range(len_time): 68 | alpha = alphas[:, t] 69 | distribution_completion = torch.ones([batch_size]).cuda() - integrate 70 | 71 | integrate += alpha 72 | list_fires.append(integrate) 73 | 74 | fire_place = integrate > threshold 75 | integrate = torch.where(fire_place, 76 | integrate - torch.ones([batch_size]).cuda(), 77 | integrate) 78 | cur = torch.where(fire_place, 79 | distribution_completion, 80 | alpha) 81 | remainds = alpha - cur 82 | 83 | frame += cur[:, None] * hidden[:, t, :] 84 | list_frames.append(frame) 85 | frame = torch.where(fire_place[:, None].repeat(1, hidden_size), 86 | remainds[:, None] * hidden[:, t, :], 87 | frame) 88 | if log: 89 | print('t: {}\t{:.3f} -> {:.3f}|{:.3f}'.format( 90 | t, integrate[0].numpy(), cur[0].numpy(), remainds[0].numpy())) 91 | 92 | fires = torch.stack(list_fires, 1) 93 | frames = torch.stack(list_frames, 1) 94 | list_ls = [] 95 | len_labels = torch.round(alphas.sum(-1)).int() 96 | max_label_len = len_labels.max() 97 | for b in range(batch_size): 98 | fire = fires[b, :] 99 | l = torch.index_select(frames[b, :, :], 0, torch.where(fire > threshold)[0]) 100 | pad_l = torch.zeros([max_label_len - l.size(0), hidden_size]).cuda() 101 | list_ls.append(torch.cat([l, pad_l], 0)) 102 | 103 | if log: 104 | print('fire:\n', fires.numpy()) 105 | 106 | return torch.stack(list_ls, 0) 107 | 108 | def recognize(self, input, input_length, char_list, args, threshold=0.95, target_num=None): 109 | """Sequence-to-Sequence beam search, decode one utterence now. 110 | Args: 111 | input: T x D 112 | char_list: list of characters 113 | args: args.beam 114 | Returns: 115 | nbest_hyps: 116 | """ 117 | conv_padded_outputs, input_length = self.conv_encoder(input.unsqueeze(0), input_length) 118 | encoder_outputs = self.encoder(conv_padded_outputs, input_length) 119 | 120 | alpha = self.assigner(encoder_outputs, input_length) 121 | if target_num: 122 | _num = alpha.sum(-1) 123 | num = target_num 124 | alpha *= (num / _num)[:, None].repeat(1, alpha.size(1)) 125 | 126 | l = self.cif(encoder_outputs, alpha, threshold=threshold) 127 | 128 | nbest_hyps = self.decoder.recognize_beam(l, char_list, args) 129 | # nbest_hyps = self.decoder.recognize_beam_cache(l, char_list, args) 130 | 131 | return nbest_hyps 132 | 133 | @classmethod 134 | def create_model(cls, args): 135 | from transformer.conv_encoder import Conv2dSubsample 136 | from transformer.encoder import Encoder 137 | from transformer.attentionAssigner import Attention_Assigner 138 | # from transformer.attentionAssigner import Attention_Assigner_RNN as Attention_Assigner 139 | from transformer.decoder import Decoder_CIF as Decoder 140 | 141 | conv_encoder = Conv2dSubsample(d_input=args.d_input * args.LFR_m, 142 | d_model=args.d_model, 143 | n_layers=args.n_conv_layers) 144 | encoder = Encoder(d_input=args.d_model, 145 | n_layers=args.n_layers_enc, 146 | n_head=args.n_head, 147 | d_model=args.d_model, 148 | d_inner=args.d_inner, 149 | dropout=args.dropout) 150 | assigner = Attention_Assigner(d_input=args.d_model, 151 | d_hidden=args.d_assigner_hidden, 152 | w_context=args.w_context, 153 | n_layers=args.n_assigner_layers) 154 | decoder = Decoder(sos_id=args.sos_id, 155 | n_tgt_vocab=args.vocab_size, 156 | n_layers=args.n_layers_dec, 157 | n_head=args.n_head, 158 | d_model=args.d_model, 159 | d_inner=args.d_inner, 160 | dropout=args.dropout) 161 | model = cls(conv_encoder, encoder, assigner, decoder, args.spec_aug_cfg) 162 | 163 | return model 164 | 165 | @classmethod 166 | def load_model(cls, path, args): 167 | 168 | # creat mdoel 169 | model= cls.create_model(args) 170 | 171 | # load params 172 | package = torch.load(path, map_location=lambda storage, loc: storage) 173 | model.load_state_dict(package['state_dict']) 174 | 175 | return model 176 | 177 | @staticmethod 178 | def serialize(model, optimizer, epoch, tr_loss=None, cv_loss=None): 179 | package = { 180 | 'state_dict': model.state_dict(), 181 | 'optim_dict': optimizer.state_dict(), 182 | 'epoch': epoch 183 | } 184 | if tr_loss is not None: 185 | package['tr_loss'] = tr_loss 186 | package['cv_loss'] = cv_loss 187 | 188 | return package 189 | -------------------------------------------------------------------------------- /src/transformer/conv_encoder.py: -------------------------------------------------------------------------------- 1 | from collections import OrderedDict 2 | import math 3 | import torch.nn as nn 4 | import torch 5 | import torch.nn.functional as F 6 | from torch.nn.modules.normalization import LayerNorm 7 | 8 | 9 | class Conv1d(nn.Module): 10 | # the same as stack frames 11 | def __init__(self, d_input, d_hidden, n_layers, w_context, pad='same', name=''): 12 | super().__init__() 13 | assert n_layers >= 1 14 | self.n_layers = n_layers 15 | self.d_input = d_input 16 | self.d_hidden = d_hidden 17 | self.w_context = w_context 18 | self.pad = pad 19 | self.name = name 20 | 21 | layers = [("{}/conv1d_0".format(name), nn.Conv1d(d_input, d_hidden, w_context, 1)), 22 | # ("{}/batchnorm_0".format(name), nn.BatchNorm1d(d_hidden)), 23 | ("{}/relu_0".format(name), nn.ReLU())] 24 | for i in range(n_layers-1): 25 | layers += [ 26 | ("{}/conv1d_{}".format(name, i+1), nn.Conv1d(d_hidden, d_hidden, w_context, 1)), 27 | # ("{}/batchnorm_{}".format(name, i+1), nn.BatchNorm1d(d_hidden)), 28 | ("{}/relu_{}".format(name, i+1), nn.ReLU()) 29 | ] 30 | layers = OrderedDict(layers) 31 | self.conv = nn.Sequential(layers) 32 | 33 | def forward(self, feats, feat_lengths): 34 | if self.pad == 'same': 35 | input_length = feats.size(1) 36 | feats = F.pad(feats, (0, 0, 0, self.n_layers * self.w_context)) 37 | outputs = self.conv(feats.permute(0, 2, 1)) 38 | outputs = outputs.permute(0, 2, 1) 39 | 40 | if self.pad == 'same': 41 | tensor_length = input_length 42 | assert tensor_length <= outputs.size(1) 43 | outputs = outputs[:, :tensor_length, :] 44 | output_lengths = feat_lengths 45 | else: 46 | output_lengths = ((feat_lengths + sum(self.padding) - 47 | 1*(self.w_context-1)-1)/self.subsample + 1).long() 48 | 49 | return outputs, output_lengths 50 | 51 | 52 | class Conv1dSubsample(nn.Module): 53 | # the same as stack frames 54 | def __init__(self, d_input, d_model, w_context, subsample, pad='same'): 55 | super().__init__() 56 | self.conv = nn.Conv1d(d_input, d_model, w_context, stride=subsample) 57 | self.conv_norm = LayerNorm(d_model) 58 | self.subsample = subsample 59 | self.w_context = w_context 60 | self.pad = pad 61 | 62 | def forward(self, feats, feat_lengths): 63 | if self.pad == 'same': 64 | input_length = feats.size(1) 65 | feats = F.pad(feats, (0, 0, 0, self.w_context)) 66 | outputs = self.conv(feats.permute(0, 2, 1)) 67 | outputs = outputs.permute(0, 2, 1) 68 | outputs = self.conv_norm(outputs) 69 | 70 | if self.pad == 'same': 71 | tensor_length = int(math.ceil(input_length/self.subsample)) 72 | assert tensor_length <= outputs.size(1) 73 | outputs = outputs[:, :tensor_length, :] 74 | output_lengths = torch.ceil(feat_lengths*1.0/self.subsample).int() 75 | else: 76 | output_lengths = ((feat_lengths + sum(self.padding) - 1*(self.w_context-1)-1)/self.subsample + 1).long() 77 | 78 | return outputs, output_lengths 79 | 80 | 81 | class Conv2dSubsample(nn.Module): 82 | def __init__(self, d_input, d_model, n_layers=2, pad='same'): 83 | super().__init__() 84 | assert n_layers >= 1 85 | self.n_layers = n_layers 86 | self.d_input = d_input 87 | self.pad = pad 88 | 89 | layers = [("subsample/conv0", nn.Conv2d(1, 32, 3, (2, 1))), 90 | ("subsample/relu0", nn.ReLU())] 91 | for i in range(n_layers-1): 92 | layers += [ 93 | ("subsample/conv{}".format(i+1), nn.Conv2d(32, 32, 3, (2, 1))), 94 | ("subsample/relu{}".format(i+1), nn.ReLU()) 95 | ] 96 | layers = OrderedDict(layers) 97 | self.conv = nn.Sequential(layers) 98 | self.d_conv_out = int(math.ceil(d_input / 2)) 99 | self.affine = nn.Linear(32 * self.d_conv_out, d_model) 100 | 101 | def forward(self, feats, feat_lengths): 102 | if self.pad == 'same': 103 | input_length = feats.size(1) 104 | feats = F.pad(feats, (0, 10, 0, 20)) 105 | outputs = feats.unsqueeze(1) # [B, C, T, D] 106 | outputs = self.conv(outputs)[:, :, :, :self.d_conv_out] 107 | B, C, T, D = outputs.size() 108 | outputs = outputs.permute(0, 2, 1, 3).contiguous().view(B, T, C*D) 109 | 110 | if self.pad == 'same': 111 | output_lengths = feat_lengths 112 | tensor_length = input_length 113 | for _ in range(self.n_layers): 114 | output_lengths = torch.ceil(output_lengths / 2.0).int() 115 | tensor_length = int(math.ceil(tensor_length / 2.0)) 116 | 117 | assert tensor_length <= outputs.size(1) 118 | outputs = outputs[:, :tensor_length, :] 119 | else: 120 | output_lengths = feat_lengths 121 | for _ in range(self.n_layers): 122 | output_lengths = ((output_lengths-1) / 2).long() 123 | 124 | outputs = self.affine(outputs) 125 | 126 | return outputs, output_lengths 127 | -------------------------------------------------------------------------------- /src/transformer/data.py: -------------------------------------------------------------------------------- 1 | """ 2 | Logic: 3 | 1. AudioDataLoader generate a minibatch from AudioDataset, the size of this 4 | minibatch is AudioDataLoader's batchsize. For now, we always set 5 | AudioDataLoader's batchsize as 1. The real minibatch size we care about is 6 | set in AudioDataset's __init__(...). So actually, we generate the 7 | information of one minibatch in AudioDataset. 8 | 2. After AudioDataLoader getting one minibatch from AudioDataset, 9 | AudioDataLoader calls its collate_fn(batch) to process this minibatch. 10 | """ 11 | import json 12 | import os 13 | import numpy as np 14 | import torch 15 | import torch.utils.data as data 16 | import kaldi_io 17 | 18 | from utils.utils import pad_list 19 | 20 | 21 | class AudioDataset(data.Dataset): 22 | """ 23 | TODO: this is a little HACK now, put batch_size here now. 24 | remove batch_size to dataloader later. 25 | """ 26 | def __init__(self, json_path, token2idx, batch_size=None, frames_size=None, 27 | len_in_max=None, len_out_max=None, rate_in_out=None): 28 | # From: espnet/src/asr/asr_utils.py: make_batchset() 29 | """ 30 | Args: 31 | len_in_max: filter input seq length > len_in_max 32 | len_out_max: filter output seq length > len_out_max 33 | rate_in_out: tuple, filter len_seq_in/len_seq_out not in rate_in_out 34 | batch_size: batched by batch_size 35 | frames_size: batched by frames_size 36 | log: logging if samples filted 37 | """ 38 | super().__init__() 39 | try: 40 | # json_path is a single file 41 | with open(json_path) as f: 42 | data = json.load(f) 43 | except: 44 | # json_path is a dir where *.json in 45 | data = [] 46 | for dir, _, fs in os.walk(os.path.dirname(json_path)): # os.walk获取所有的目录 47 | for f in fs: 48 | if f.endswith('.json'): # 判断是否是".json"结尾 49 | filename = os.path.join(dir, f) 50 | with open(filename) as f: 51 | data.extend(json.load(f)) 52 | 53 | # filter samples 54 | list_to_pop = [] 55 | for i, sample in enumerate(data): 56 | len_x = sample['input_length'] 57 | len_y = sample['output_length'] 58 | if not 0 < len_x <= len_in_max: 59 | list_to_pop.append(i) 60 | elif not 0 < len_y <= len_out_max: 61 | list_to_pop.append(i) 62 | elif rate_in_out and not (rate_in_out[0] <= (len_x / len_y) <= rate_in_out[1]): 63 | list_to_pop.append(i) 64 | 65 | # gen token_ids 66 | sample['token_ids'] = [token2idx[token] for token in sample['token'].split()] 67 | 68 | print('filtered {} samples:\n{}'.format( 69 | len(list_to_pop), ', '.join(data[i]['uttid'] for i in list_to_pop))) 70 | list_to_pop.reverse() 71 | [data.pop(i) for i in list_to_pop] 72 | 73 | # sort it by input lengths (long to short) 74 | data_sorted = sorted(data, key=lambda data: sample['input_length'], reverse=True) 75 | # change batchsize depending on the input and output length 76 | minibatch = [] 77 | # Method 1: Generate minibatch based on batch_size 78 | # i.e. each batch contains #batch_size utterances 79 | if batch_size: 80 | start = 0 81 | while True: 82 | end = start 83 | ilen = data_sorted[end]['input_length'] 84 | olen = data_sorted[end]['output_length'] 85 | factor = max(int(ilen / len_in_max), int(olen / len_out_max)) 86 | b = max(1, int(batch_size / (1 + factor))) 87 | end = min(len(data_sorted), start + b) 88 | minibatch.append(data_sorted[start:end]) 89 | if end == len(data_sorted): 90 | break 91 | start = end 92 | # Method 2: Generate minibatch based on frames_size 93 | # i.e. each batch contains approximately #frames_size frames 94 | elif frames_size: # frames_size > 0 95 | print("NOTE: Generate minibatch based on frames_size.") 96 | print("i.e. each batch contains approximately #frames_size frames") 97 | start = 0 98 | while True: 99 | total_frames = 0 100 | end = start 101 | while total_frames < frames_size and end < len(data_sorted): 102 | ilen = data_sorted[end]['input_length'] 103 | total_frames += ilen 104 | end += 1 105 | minibatch.append(data_sorted[start:end]) 106 | if end == len(data_sorted): 107 | break 108 | start = end 109 | else: 110 | assert batch_size or frames_size 111 | self.minibatch = minibatch 112 | 113 | def __getitem__(self, index): 114 | return self.minibatch[index] 115 | 116 | def __len__(self): 117 | return len(self.minibatch) 118 | 119 | 120 | class batch_generator(object): 121 | def __init__(self, sos_id=None, eos_id=None, LFR_m=1, LFR_n=1): 122 | self.sos_id = sos_id 123 | self.eos_id = eos_id 124 | self.LFR_m = LFR_m 125 | self.LFR_n = LFR_n 126 | 127 | def __call__(self, batch): 128 | uttids, xs, ys = [], [], [] 129 | for sample in batch[0]: 130 | uttids.append(sample['uttid']) 131 | 132 | x = build_LFR_features(kaldi_io.read_mat(sample['feat']), self.LFR_m, self.LFR_n) 133 | xs.append(torch.tensor(x).float()) 134 | 135 | y = sample['token_ids'] 136 | ys.append(torch.tensor(y).long()) 137 | 138 | xs_pad, len_xs = pad_list(xs, 0) 139 | ys_pad, len_ys = pad_list(ys, 0) 140 | 141 | return uttids, xs_pad, len_xs, ys_pad, len_ys 142 | 143 | 144 | 145 | def build_LFR_features(inputs, m, n): 146 | """ 147 | Actually, this implements stacking frames and skipping frames. 148 | if m = 1 and n = 1, just return the origin features. 149 | if m = 1 and n > 1, it works like skipping. 150 | if m > 1 and n = 1, it works like stacking but only support right frames. 151 | if m > 1 and n > 1, it works like LFR. 152 | 153 | Args: 154 | inputs_batch: inputs is T x D np.ndarray 155 | m: number of frames to stack 156 | n: number of frames to skip 157 | """ 158 | # LFR_inputs_batch = [] 159 | # for inputs in inputs_batch: 160 | LFR_inputs = [] 161 | T = inputs.shape[0] 162 | T_lfr = int(np.ceil(T / n)) 163 | for i in range(T_lfr): 164 | if m <= T - i * n: 165 | LFR_inputs.append(np.hstack(inputs[i*n:i*n+m])) 166 | else: # process last LFR frame 167 | num_padding = m - (T - i * n) 168 | frame = np.hstack(inputs[i*n:]) 169 | for _ in range(num_padding): 170 | frame = np.hstack((frame, inputs[-1])) 171 | LFR_inputs.append(frame) 172 | 173 | return np.vstack(LFR_inputs) 174 | -------------------------------------------------------------------------------- /src/transformer/encoder.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | from transformer.attention import MultiheadAttention 4 | from transformer.module import PositionalEncoding, PositionwiseFeedForward 5 | from utils.utils import sequence_mask, get_attn_pad_mask 6 | 7 | 8 | class Encoder(nn.Module): 9 | """Encoder of Transformer including self-attention and feed forward. 10 | """ 11 | 12 | def __init__(self, d_input, n_layers, n_head, d_model, d_inner, dropout=0.1): 13 | super().__init__() 14 | # parameters 15 | self.d_input = d_input 16 | self.n_layers = n_layers 17 | self.n_head = n_head 18 | self.d_model = d_model 19 | self.d_output = d_model 20 | self.d_inner = d_inner 21 | self.dropout_rate = dropout 22 | 23 | # use linear transformation with layer norm to replace input embedding 24 | self.linear_in = nn.Linear(d_input, d_model) 25 | self.layer_norm_in = nn.LayerNorm(d_model) 26 | self.positional_encoding = PositionalEncoding(d_model) 27 | self.dropout = nn.Dropout(dropout) 28 | 29 | self.layer_stack = nn.ModuleList([ 30 | EncoderLayer(d_model, d_inner, n_head, dropout=dropout) 31 | for _ in range(n_layers)]) 32 | 33 | def forward(self, padded_input, input_lengths): 34 | """ 35 | Args: 36 | padded_input: N x T x D 37 | input_lengths: N 38 | 39 | Returns: 40 | enc_output: N x T x H 41 | """ 42 | # Prepare masks 43 | non_pad_mask = sequence_mask(input_lengths).unsqueeze(-1) 44 | length = padded_input.size(1) 45 | slf_attn_mask = get_attn_pad_mask(input_lengths, length) 46 | 47 | # Forward 48 | enc_output = self.dropout( 49 | self.layer_norm_in(self.linear_in(padded_input)) + 50 | self.positional_encoding(padded_input)) 51 | 52 | for enc_layer in self.layer_stack: 53 | enc_output = enc_layer( 54 | enc_output, 55 | non_pad_mask=non_pad_mask, 56 | slf_attn_mask=slf_attn_mask) 57 | 58 | return enc_output 59 | 60 | 61 | class EncoderLayer(nn.Module): 62 | """Compose with two sub-layers. 63 | 1. A multi-head self-attention mechanism 64 | 2. A simple, position-wise fully connected feed-forward network. 65 | """ 66 | def __init__(self, d_model, d_inner, n_head, dropout=0.1): 67 | super().__init__() 68 | self.slf_attn = MultiheadAttention(d_model, n_head, dropout=dropout) 69 | self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout) 70 | 71 | def forward(self, enc_input, non_pad_mask=None, slf_attn_mask=None): 72 | enc_output, enc_slf_attn = self.slf_attn( 73 | enc_input, enc_input, enc_input, mask=slf_attn_mask) 74 | enc_output *= non_pad_mask 75 | 76 | enc_output = self.pos_ffn(enc_output) 77 | enc_output *= non_pad_mask 78 | 79 | return enc_output 80 | 81 | def forward_cache(self, enc_input): 82 | enc_output, enc_slf_attn = self.slf_attn( 83 | enc_input[:, -1: , :], enc_input, enc_input) 84 | 85 | enc_output = self.pos_ffn(enc_output) 86 | 87 | return enc_output 88 | -------------------------------------------------------------------------------- /src/transformer/infer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import json 4 | import torch 5 | import kaldi_io 6 | import time 7 | from torch.utils.data import DataLoader 8 | 9 | from utils.utils import load_vocab, ids2str 10 | from utils.data import build_LFR_features 11 | from transformer.data import AudioDataset, batch_generator 12 | 13 | 14 | parser = argparse.ArgumentParser( 15 | "End-to-End Automatic Speech Recognition Decoding.") 16 | # data 17 | parser.add_argument('--type', type=str, required=True, 18 | help='test or infer') 19 | parser.add_argument('--recog-json', type=str, required=True, 20 | help='Filename of recognition data (json)') 21 | parser.add_argument('--model-path', type=str, required=True, 22 | help='Filename of recognition data (json)') 23 | parser.add_argument('--vocab', type=str, required=True, 24 | help='Dictionary which should include ') 25 | parser.add_argument('--output', type=str, required=True, 26 | help='Filename of result label data (json)') 27 | parser.add_argument('--label_type', type=str, default='token', 28 | help='label_type') 29 | parser.add_argument('--num-workers', default=4, type=int, 30 | help='Number of workers to generate minibatch') 31 | 32 | # Low Frame Rate (stacking and skipping frames) 33 | parser.add_argument('--LFR_m', default=1, type=int, 34 | help='Low Frame Rate: number of frames to stack') 35 | parser.add_argument('--LFR_n', default=1, type=int, 36 | help='Low Frame Rate: number of frames to skip') 37 | parser.add_argument('--spec_aug_cfg', default=None, type=str, 38 | help='spec_aug_cfg') 39 | 40 | # Network architecture 41 | parser.add_argument('--structure', type=str, default='transformer', 42 | help='transformer transformer-ctc conv-transformer-ctc') 43 | # conv_encoder 44 | parser.add_argument('--n_conv_layers', default=3, type=int, 45 | help='Dimension of key') 46 | # encoder 47 | parser.add_argument('--d_input', default=80, type=int, 48 | help='Dim of encoder input (before LFR)') 49 | parser.add_argument('--n_layers_enc', default=6, type=int, 50 | help='Number of encoder stacks') 51 | parser.add_argument('--n_head', default=8, type=int, 52 | help='Number of Multi Head Attention (MHA)') 53 | parser.add_argument('--d_model', default=512, type=int, 54 | help='Dimension of model') 55 | parser.add_argument('--d_inner', default=2048, type=int, 56 | help='Dimension of inner') 57 | parser.add_argument('--dropout', default=0.1, type=float, 58 | help='Dropout rate') 59 | # assigner 60 | parser.add_argument('--w_context', default=3, type=int, 61 | help='Positional Encoding max len') 62 | parser.add_argument('--d_assigner_hidden', default=512, type=int, 63 | help='Positional Encoding max len') 64 | parser.add_argument('--n_assigner_layers', default=3, type=int, 65 | help='Positional Encoding max len') 66 | # decoder 67 | parser.add_argument('--n_layers_dec', default=6, type=int, 68 | help='Number of decoder stacks') 69 | 70 | # decode 71 | parser.add_argument('--beam-size', default=1, type=int, 72 | help='Beam size') 73 | parser.add_argument('--nbest', default=1, type=int, 74 | help='Nbest size') 75 | parser.add_argument('--print-freq', default=1, type=int, 76 | help='print_freq') 77 | parser.add_argument('--decode-max-len', default=0, type=int, 78 | help='Max output length. If ==0 (default), it uses a ' 79 | 'end-detect function to automatically find maximum ' 80 | 'hypothesis lengths') 81 | 82 | 83 | def test(args): 84 | if args.structure == 'transformer': 85 | from transformer.Transformer import Transformer as Model 86 | elif args.structure == 'transformer-ctc': 87 | from transformer.Transformer import CTC_Transformer as Model 88 | elif args.structure == 'conv-transformer-ctc': 89 | from transformer.Transformer import Conv_CTC_Transformer as Model 90 | elif args.structure == 'cif': 91 | from transformer.CIF_Model import CIF_Model as Model 92 | 93 | token2idx, idx2token = load_vocab(args.vocab) 94 | args.sos_id = token2idx[''] 95 | args.eos_id = token2idx[''] 96 | args.vocab_size = len(token2idx) 97 | 98 | model = Model.load_model(args.model_path, args) 99 | print(model) 100 | model.eval() 101 | model.cuda() 102 | 103 | # read json data 104 | with open(args.recog_json, 'rb') as f: 105 | js = json.load(f)['utts'] 106 | 107 | cur_time = time.time() 108 | # decode each utterance 109 | 110 | test_dataset = AudioDataset('/home/easton/projects/OpenASR/egs/aishell1/data/test.json', 111 | token2idx, frames_size=1000, 112 | len_in_max=1999, len_out_max=99) 113 | test_loader = DataLoader(test_dataset, batch_size=1, 114 | collate_fn=batch_generator(), 115 | num_workers=args.num_workers) 116 | # test_loader = AudioDataLoader(test_dataset, batch_size=1, 117 | # token2idx=token2idx, 118 | # label_type=args.label_type, 119 | # num_workers=args.num_workers, 120 | # LFR_m=args.LFR_m, LFR_n=args.LFR_n) 121 | 122 | def process_batch(hyps, scores, idx2token, fw): 123 | for nbest, nscore in zip(hyps, scores): 124 | for n, (hyp, score) in enumerate(zip(nbest, nscore)): 125 | hyp = hyp.tolist() 126 | try: 127 | eos = hyp.index(3) 128 | except: 129 | eos = None 130 | 131 | hyp = ''.join(idx2token[i] for i in hyp[:eos]) 132 | print("top{}: {} score: {:.3f}\n".format(n+1, hyp, score)) 133 | if n == 0: 134 | fw.write("{} {}\n".format('uttid', hyp)) 135 | 136 | with torch.no_grad(), open(args.output, 'w') as fw: 137 | for data in test_loader: 138 | uttids, xs_pad, len_xs, ys_pad, len_ys = data 139 | xs_pad = xs_pad.cuda() 140 | ys_pad = ys_pad.cuda() 141 | hyps_ints, len_decoded_sorted, scores = model.batch_recognize( 142 | xs_pad, len_xs, args.beam_size) 143 | 144 | process_batch(hyps_ints.cpu().numpy(), scores.cpu().numpy(), idx2token, fw) 145 | # f.write(uttids[0] + ' ' + hyp + '\n') 146 | 147 | # with torch.no_grad(), open(args.output, 'w') as f: 148 | # for idx, uttid in enumerate(js.keys(), 1): 149 | # input = kaldi_io.read_mat(js[uttid]['input'][0]['feat']) # TxD 150 | # input = build_LFR_features(input, args.LFR_m, args.LFR_n) 151 | # input = torch.from_numpy(input).float() 152 | # input_length = torch.tensor([input.size(0)], dtype=torch.int) 153 | # input = input.cuda() 154 | # input_length = input_length.cuda() 155 | # # hyps_ints = model.recognize(input, input_length, idx2token, args, 156 | # # target_num=len(js[uttid]['output']['tokenid'].split())) 157 | # hyps_ints = model.recognize(input, input_length, idx2token, args) 158 | # # hyps_ints = model.recognize_beam_cache(input, input_length, idx2token, args) 159 | # hyp = ids2str(hyps_ints, idx2token)[0] 160 | # print(hyp) 161 | # f.write(uttid + ' ' + hyp + '\n') 162 | # used_time = time.time() - cur_time 163 | # print('({}/{}) use time {:.2f}s {}: {}'.format( 164 | # idx, len(js.keys()), used_time, uttid, hyp), flush=True) 165 | # cur_time = time.time() 166 | 167 | 168 | def infer(args): 169 | return 170 | 171 | 172 | if __name__ == "__main__": 173 | args = parser.parse_args() 174 | print(args, flush=True) 175 | if args.type == 'test': 176 | test(args) 177 | elif args.type == 'infer': 178 | infer(args) 179 | -------------------------------------------------------------------------------- /src/transformer/loss.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | 4 | 5 | def cal_ce_loss(logits, targets, smoothing=0.0): 6 | """Calculate cross entropy loss, apply label smoothing if needed. 7 | """ 8 | 9 | logits = logits.view(-1, logits.size(2)) 10 | targets = targets.contiguous().view(-1) 11 | if smoothing > 0.0: 12 | eps = smoothing 13 | n_class = logits.size(1) 14 | 15 | # Generate one-hot matrix: N x C. 16 | # Only label position is 1 and all other positions are 0 17 | # gold include -1 value (0) and this will lead to assert error 18 | one_hot = torch.zeros_like(logits).scatter(1, targets.long().view(-1, 1), 1) 19 | one_hot = one_hot * (1 - eps) + (1 - one_hot) * eps / n_class 20 | log_prb = F.log_softmax(logits, dim=1) 21 | 22 | non_pad_mask = targets.ne(0) 23 | n_word = non_pad_mask.long().sum() 24 | loss = -(one_hot * log_prb).sum(dim=1) 25 | ce_loss = loss.masked_select(non_pad_mask).sum() / n_word 26 | else: 27 | ce_loss = F.cross_entropy(logits, targets, 28 | ignore_index=0, 29 | reduction='mean') 30 | 31 | return ce_loss 32 | 33 | 34 | def cal_ctc_ce_loss(logits_ctc, len_logits_ctc, logits_ce, targets, smoothing=0.0): 35 | """Calculate cross entropy loss, apply label smoothing if needed. 36 | len_logits_ctc, logits_ctc, logits_ce = model_output 37 | """ 38 | # ctc loss 39 | n_class = logits_ctc.size(-1) 40 | target_lengths = targets.ne(0).int().sum(1) 41 | ctc_log_probs = F.log_softmax(logits_ctc, dim=-1).transpose(0,1) 42 | ctc_loss = F.ctc_loss(ctc_log_probs, targets, 43 | len_logits_ctc, target_lengths, blank=n_class-1) 44 | 45 | # ce loss 46 | ce_loss = cal_ce_loss(logits_ce, targets, smoothing) 47 | 48 | return ctc_loss, ce_loss 49 | 50 | 51 | def cal_ctc_qua_ce_loss(logits_ctc, len_logits_ctc, _number, number, logits_ce, targets, smoothing=0.0): 52 | """Calculate cross entropy loss, apply label smoothing if needed. 53 | """ 54 | # qua loss 55 | qua_loss = torch.pow(_number - number, 2).mean() 56 | 57 | # ce loss 58 | ctc_loss, ce_loss = cal_ctc_ce_loss( 59 | logits_ctc, len_logits_ctc, logits_ce, targets, smoothing) 60 | 61 | return qua_loss, ctc_loss, ce_loss 62 | -------------------------------------------------------------------------------- /src/transformer/module.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | 6 | 7 | class PositionalEncoding(nn.Module): 8 | """Implement the positional encoding (PE) function. 9 | 10 | PE(pos, 2i) = sin(pos/(10000^(2i/dmodel))) 11 | PE(pos, 2i+1) = cos(pos/(10000^(2i/dmodel))) 12 | """ 13 | 14 | def __init__(self, d_model, max_len=5000): 15 | super().__init__() 16 | # Compute the positional encodings once in log space. 17 | pe = torch.zeros(max_len, d_model, requires_grad=False) 18 | position = torch.arange(0, max_len).unsqueeze(1).float() 19 | div_term = torch.exp(torch.arange(0, d_model, 2).float() * 20 | -(math.log(10000.0) / d_model)) 21 | pe[:, 0::2] = torch.sin(position * div_term) 22 | pe[:, 1::2] = torch.cos(position * div_term) 23 | pe = pe.unsqueeze(0) 24 | self.register_buffer('pe', pe) 25 | 26 | def forward(self, input): 27 | """ 28 | Args: 29 | input: N x T x D 30 | """ 31 | length = input.size(1) 32 | return self.pe[:, :length] 33 | 34 | 35 | class PositionwiseFeedForward(nn.Module): 36 | """Implements position-wise feedforward sublayer. 37 | 38 | FFN(x) = max(0, xW1 + b1)W2 + b2 39 | """ 40 | 41 | def __init__(self, d_model, d_ff, dropout=0.1): 42 | super(PositionwiseFeedForward, self).__init__() 43 | self.w_1 = nn.Linear(d_model, d_ff) 44 | self.w_2 = nn.Linear(d_ff, d_model) 45 | self.dropout = nn.Dropout(dropout) 46 | self.layer_norm = nn.LayerNorm(d_model) 47 | 48 | def forward(self, x): 49 | residual = x 50 | output = self.w_2(F.relu(self.w_1(x))) 51 | output = self.dropout(output) 52 | output = self.layer_norm(output + residual) 53 | return output 54 | 55 | 56 | # Another implementation 57 | class PositionwiseFeedForwardUseConv(nn.Module): 58 | """A two-feed-forward-layer module""" 59 | 60 | def __init__(self, d_in, d_hid, dropout=0.1): 61 | super(PositionwiseFeedForwardUseConv, self).__init__() 62 | self.w_1 = nn.Conv1d(d_in, d_hid, 1) # position-wise 63 | self.w_2 = nn.Conv1d(d_hid, d_in, 1) # position-wise 64 | self.layer_norm = nn.LayerNorm(d_in) 65 | self.dropout = nn.Dropout(dropout) 66 | 67 | def forward(self, x): 68 | residual = x 69 | output = x.transpose(1, 2) 70 | output = self.w_2(F.relu(self.w_1(output))) 71 | output = output.transpose(1, 2) 72 | output = self.dropout(output) 73 | output = self.layer_norm(output + residual) 74 | return output 75 | -------------------------------------------------------------------------------- /src/transformer/optimizer.py: -------------------------------------------------------------------------------- 1 | """A wrapper class for optimizer""" 2 | import torch 3 | 4 | 5 | class TransformerOptimizer(object): 6 | """A simple wrapper class for learning rate scheduling""" 7 | 8 | def __init__(self, optimizer, k, d_model, warmup_steps=4000): 9 | self.optimizer = optimizer 10 | self.k = k 11 | self.init_lr = d_model ** (-0.5) 12 | self.warmup_steps = warmup_steps 13 | self.step_num = 0 14 | self.visdom_lr = None 15 | 16 | def zero_grad(self): 17 | self.optimizer.zero_grad() 18 | 19 | def step(self): 20 | self._update_lr() 21 | self._visdom() 22 | self.optimizer.step() 23 | 24 | def _update_lr(self): 25 | self.step_num += 1 26 | lr = self.k * self.init_lr * min(self.step_num ** (-0.5), 27 | self.step_num * (self.warmup_steps ** (-1.5))) 28 | for param_group in self.optimizer.param_groups: 29 | param_group['lr'] = lr 30 | 31 | def load_state_dict(self, state_dict): 32 | self.optimizer.load_state_dict(state_dict) 33 | 34 | def state_dict(self): 35 | return self.optimizer.state_dict() 36 | 37 | def set_k(self, k): 38 | self.k = k 39 | 40 | def set_visdom(self, visdom_lr, vis): 41 | self.visdom_lr = visdom_lr # Turn on/off visdom of learning rate 42 | self.vis = vis # visdom enviroment 43 | self.vis_opts = dict(title='Learning Rate', 44 | ylabel='Leanring Rate', xlabel='step') 45 | self.vis_window = None 46 | self.x_axis = torch.LongTensor() 47 | self.y_axis = torch.FloatTensor() 48 | 49 | def _visdom(self): 50 | if self.visdom_lr is not None: 51 | self.x_axis = torch.cat( 52 | [self.x_axis, torch.LongTensor([self.step_num])]) 53 | self.y_axis = torch.cat( 54 | [self.y_axis, torch.FloatTensor([self.optimizer.param_groups[0]['lr']])]) 55 | if self.vis_window is None: 56 | self.vis_window = self.vis.line(X=self.x_axis, Y=self.y_axis, 57 | opts=self.vis_opts) 58 | else: 59 | self.vis.line(X=self.x_axis, Y=self.y_axis, win=self.vis_window, 60 | update='replace') 61 | -------------------------------------------------------------------------------- /src/transformer/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import torch 4 | 5 | from utils.data import AudioDataLoader, AudioDataset 6 | from utils.utils import load_vocab 7 | from transformer.optimizer import TransformerOptimizer 8 | 9 | 10 | parser = argparse.ArgumentParser( 11 | "End-to-End Automatic Speech Recognition Training " 12 | "(Transformer framework).") 13 | # General config 14 | # Task related 15 | parser.add_argument('--train-json', type=str, default=None, 16 | help='Filename of train label data (json)') 17 | parser.add_argument('--valid-json', type=str, default=None, 18 | help='Filename of validation label data (json)') 19 | parser.add_argument('--vocab', type=str, required=True, 20 | help='Dictionary which should include ') 21 | # Low Frame Rate (stacking and skipping frames) 22 | parser.add_argument('--LFR_m', default=4, type=int, 23 | help='Low Frame Rate: number of frames to stack') 24 | parser.add_argument('--LFR_n', default=3, type=int, 25 | help='Low Frame Rate: number of frames to skip') 26 | parser.add_argument('--spec_aug_cfg', default=None, type=str, 27 | help='spec_aug_cfg') 28 | # Network architecture 29 | 30 | # conv_encoder 31 | parser.add_argument('--n_conv_layers', default=3, type=int, 32 | help='Dimension of key') 33 | # encoder 34 | # TODO: automatically infer input dim 35 | parser.add_argument('--structure', type=str, default='transformer', 36 | help='transformer transformer-ctc conv-transformer-ctc') 37 | parser.add_argument('--label_type', type=str, default='token', 38 | help='label_type') 39 | parser.add_argument('--d_input', default=80, type=int, 40 | help='Dim of encoder input (before LFR)') 41 | parser.add_argument('--n_layers_enc', default=6, type=int, 42 | help='Number of encoder stacks') 43 | parser.add_argument('--n_head', default=8, type=int, 44 | help='Number of Multi Head Attention (MHA)') 45 | parser.add_argument('--d_model', default=512, type=int, 46 | help='Dimension of model') 47 | parser.add_argument('--d_inner', default=2048, type=int, 48 | help='Dimension of inner') 49 | parser.add_argument('--dropout', default=0.1, type=float, 50 | help='Dropout rate') 51 | # assigner 52 | parser.add_argument('--w_context', default=3, type=int, 53 | help='Positional Encoding max len') 54 | parser.add_argument('--d_assigner_hidden', default=512, type=int, 55 | help='Positional Encoding max len') 56 | parser.add_argument('--n_assigner_layers', default=3, type=int, 57 | help='Positional Encoding max len') 58 | # decoder 59 | parser.add_argument('--n_layers_dec', default=6, type=int, 60 | help='Number of decoder stacks') 61 | # Loss 62 | parser.add_argument('--label_smoothing', default=0.1, type=float, 63 | help='label smoothing') 64 | parser.add_argument('--lambda_qua', default=0.001, type=float, 65 | help='label lambda_qua') 66 | 67 | # Training config 68 | parser.add_argument('--epochs', default=30, type=int, 69 | help='Number of maximum epochs') 70 | # minibatch 71 | parser.add_argument('--shuffle', default=0, type=int, 72 | help='reshuffle the data at every epoch') 73 | parser.add_argument('--batch-size', default=32, type=int, 74 | help='Batch size') 75 | parser.add_argument('--batch_frames', default=0, type=int, 76 | help='Batch frames. If this is not 0, batch size will make no sense') 77 | parser.add_argument('--maxlen-in', default=800, type=int, metavar='ML', 78 | help='Batch size is reduced if the input sequence length > ML') 79 | parser.add_argument('--maxlen-out', default=150, type=int, metavar='ML', 80 | help='Batch size is reduced if the output sequence length > ML') 81 | parser.add_argument('--num-workers', default=4, type=int, 82 | help='Number of workers to generate minibatch') 83 | # optimizer 84 | parser.add_argument('--k', default=1.0, type=float, 85 | help='tunable scalar multiply to learning rate') 86 | parser.add_argument('--warmup_steps', default=4000, type=int, 87 | help='warmup steps') 88 | # save and load model 89 | parser.add_argument('--save-folder', default='exp/temp', 90 | help='Location to save epoch models') 91 | parser.add_argument('--checkpoint', dest='checkpoint', default=0, type=int, 92 | help='Enables checkpoint saving of model') 93 | parser.add_argument('--_continue', default='', type=int, 94 | help='Continue from checkpoint model') 95 | parser.add_argument('--model-path', default='last.model', 96 | help='Location to save best validation model') 97 | # logging 98 | parser.add_argument('--print-freq', default=10, type=int, 99 | help='Frequency of printing training infomation') 100 | parser.add_argument('--visdom', dest='visdom', type=int, default=0, 101 | help='Turn on visdom graphing') 102 | parser.add_argument('--visdom_lr', dest='visdom_lr', type=int, default=0, 103 | help='Turn on visdom graphing learning rate') 104 | parser.add_argument('--visdom_epoch', dest='visdom_epoch', type=int, default=0, 105 | help='Turn on visdom graphing each epoch') 106 | parser.add_argument('--visdom-id', default='Transformer training', 107 | help='Identifier for visdom run') 108 | 109 | 110 | def main(args): 111 | # Construct Solver 112 | # data 113 | token2idx, idx2token = load_vocab(args.vocab) 114 | args.vocab_size = len(token2idx) 115 | args.sos_id = token2idx[''] 116 | args.eos_id = token2idx[''] 117 | 118 | tr_dataset = AudioDataset(args.train_json, args.batch_size, 119 | args.maxlen_in, args.maxlen_out, 120 | batch_frames=args.batch_frames) 121 | cv_dataset = AudioDataset(args.valid_json, args.batch_size, 122 | args.maxlen_in, args.maxlen_out, 123 | batch_frames=args.batch_frames) 124 | tr_loader = AudioDataLoader(tr_dataset, batch_size=1, 125 | token2idx=token2idx, 126 | label_type=args.label_type, 127 | num_workers=args.num_workers, 128 | shuffle=args.shuffle, 129 | LFR_m=args.LFR_m, LFR_n=args.LFR_n) 130 | cv_loader = AudioDataLoader(cv_dataset, batch_size=1, 131 | token2idx=token2idx, 132 | label_type=args.label_type, 133 | num_workers=args.num_workers, 134 | LFR_m=args.LFR_m, LFR_n=args.LFR_n) 135 | # load dictionary and generate char_list, sos_id, eos_id 136 | data = {'tr_loader': tr_loader, 'cv_loader': cv_loader} 137 | 138 | if args.structure == 'transformer': 139 | from transformer.Transformer import Transformer 140 | from transformer.solver import Transformer_Solver as Solver 141 | 142 | model = Transformer.create_model(args) 143 | 144 | elif args.structure == 'transformer-ctc': 145 | from transformer.Transformer import CTC_Transformer as Transformer 146 | from transformer.solver import Transformer_CTC_Solver as Solver 147 | 148 | model = Transformer.create_model(args) 149 | 150 | elif args.structure == 'conv-transformer-ctc': 151 | from transformer.Transformer import Conv_CTC_Transformer as Transformer 152 | from transformer.solver import Transformer_CTC_Solver as Solver 153 | 154 | model = Transformer.create_model(args) 155 | 156 | elif args.structure == 'cif': 157 | from transformer.CIF_Model import CIF_Model 158 | from transformer.solver import CIF_Solver as Solver 159 | 160 | model = CIF_Model.create_model(args) 161 | 162 | print(model) 163 | model.cuda() 164 | 165 | # optimizer 166 | optimizier = TransformerOptimizer( 167 | torch.optim.Adam(model.parameters(), betas=(0.9, 0.98), eps=1e-09), 168 | args.k, 169 | args.d_model, 170 | args.warmup_steps) 171 | 172 | # solver 173 | solver = Solver(data, model, optimizier, args) 174 | solver.train() 175 | 176 | 177 | if __name__ == '__main__': 178 | args = parser.parse_args() 179 | print(args) 180 | main(args) 181 | -------------------------------------------------------------------------------- /src/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eastonYi/end-to-end_asr_pytorch/7bedfb5ad10e363f5823b980ae0eb515b395839b/src/utils/__init__.py -------------------------------------------------------------------------------- /src/utils/average.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright 2020 Ye Bai by1993@qq.com 3 | Licensed under the Apache License, Version 2.0 (the "License"); 4 | you may not use this file except in compliance with the License. 5 | You may obtain a copy of the License at 6 | http://www.apache.org/licenses/LICENSE-2.0 7 | Unless required by applicable law or agreed to in writing, software 8 | distributed under the License is distributed on an "AS IS" BASIS, 9 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 10 | See the License for the specific language governing permissions and 11 | limitations under the License. 12 | """ 13 | 14 | import os 15 | import argparse 16 | import logging 17 | import torch 18 | 19 | 20 | logging.basicConfig( 21 | level=logging.INFO, 22 | format='%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s') 23 | 24 | def get_args(): 25 | parser = argparse.ArgumentParser(description=""" 26 | Usage: avg_last_ckpts.py """) 27 | parser.add_argument("expdir", help="The directory contains the checkpoints.") 28 | parser.add_argument("num", type=int, help="The number of models to average") 29 | args = parser.parse_args() 30 | return args 31 | 32 | 33 | if __name__ == "__main__": 34 | args = get_args() 35 | fnckpts = [t for t in os.listdir(args.expdir) if t.startswith("epoch") and t.endswith(".model")] 36 | fnckpts.sort() 37 | fnckpts.reverse() 38 | fnckpts = fnckpts[:args.num] 39 | logging.info("Average checkpoints:\n{}".format("\n".join(fnckpts))) 40 | pkg = torch.load(os.path.join(args.expdir, fnckpts[0]), map_location=lambda storage, loc: storage) 41 | for k in pkg["state_dict"].keys(): 42 | pkg["state_dict"][k] = torch.zeros_like(pkg["state_dict"][k]) 43 | 44 | for fn in fnckpts: 45 | pkg_tmp = torch.load(os.path.join(args.expdir, fn), map_location=lambda storage, loc: storage) 46 | logging.info("Loading {}".format(os.path.join(args.expdir, fn))) 47 | for k in pkg["state_dict"].keys(): 48 | pkg["state_dict"][k] += pkg_tmp["state_dict"][k]/len(fnckpts) 49 | 50 | fn_save = os.path.join(args.expdir, "avg-last{}.model".format(len(fnckpts))) 51 | logging.info("Save averaged model to {}.".format(fn_save)) 52 | torch.save(pkg, fn_save) 53 | -------------------------------------------------------------------------------- /src/utils/filt.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Apache 2.0 4 | 5 | import sys 6 | import argparse 7 | 8 | if __name__ == '__main__': 9 | parser = argparse.ArgumentParser() 10 | parser.add_argument('--exclude', '-v', dest='exclude', 11 | action='store_true', help='exclude filter words') 12 | parser.add_argument('filt', type=str, help='filter list') 13 | parser.add_argument('infile', type=str, help='input file') 14 | args = parser.parse_args() 15 | 16 | vocab = set() 17 | with open(args.filt) as vocabfile: 18 | for line in vocabfile: 19 | vocab.add(line.strip()) 20 | 21 | with open(args.infile) as textfile: 22 | for line in textfile: 23 | if args.exclude: 24 | print(" ".join( 25 | map(lambda word: word if not word in vocab else '', line.strip().split()))) 26 | # else: 27 | # print(" ".join(map(lambda word: word if word in vocab else '', unicode(line, 'utf_8').strip().split())).encode('utf_8')) 28 | -------------------------------------------------------------------------------- /src/utils/solver.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import torch 4 | 5 | 6 | class Solver(object): 7 | """ 8 | """ 9 | 10 | def __init__(self, data, model, optimizer, args): 11 | self.tr_loader = data['tr_loader'] 12 | self.cv_loader = data['cv_loader'] 13 | self.model = model 14 | self.optimizer = optimizer 15 | 16 | # Low frame rate feature 17 | self.LFR_m = args.LFR_m 18 | self.LFR_n = args.LFR_n 19 | 20 | # Training config 21 | self.epochs = args.epochs 22 | self.label_smoothing = args.label_smoothing 23 | # save and load model 24 | self.save_folder = args.save_folder 25 | self.checkpoint = args.checkpoint 26 | self._continue = args._continue 27 | self.model_path = args.model_path 28 | # logging 29 | self.print_freq = args.print_freq 30 | # visualizing loss using visdom 31 | self.tr_loss = torch.Tensor(self.epochs) 32 | self.cv_loss = torch.Tensor(self.epochs) 33 | self.visdom = args.visdom 34 | self.visdom_lr = args.visdom_lr 35 | self.visdom_epoch = args.visdom_epoch 36 | self.visdom_id = args.visdom_id 37 | if self.visdom: 38 | from visdom import Visdom 39 | self.vis = Visdom(env=self.visdom_id) 40 | self.vis_opts = dict(title=self.visdom_id, 41 | ylabel='Loss', xlabel='Epoch', 42 | legend=['train loss', 'cv loss']) 43 | self.vis_window = None 44 | self.vis_epochs = torch.arange(1, self.epochs + 1) 45 | self.optimizer.set_visdom(self.visdom_lr, self.vis) 46 | 47 | self._reset() 48 | 49 | def _reset(self): 50 | # Reset 51 | if self._continue: 52 | last_model_path = os.path.join(self.save_folder, 'last.model') 53 | print('Loading checkpoint model {}'.format(last_model_path)) 54 | last_model = torch.load(last_model_path) 55 | self.model.load_state_dict(last_model['state_dict']) 56 | self.optimizer.load_state_dict(last_model['optim_dict']) 57 | self.start_epoch = int(last_model.get('epoch', 1)) 58 | self.tr_loss[:self.start_epoch] = last_model['tr_loss'][:self.start_epoch] 59 | self.cv_loss[:self.start_epoch] = last_model['cv_loss'][:self.start_epoch] 60 | else: 61 | self.start_epoch = 0 62 | # Create save folder 63 | os.makedirs(self.save_folder, exist_ok=True) 64 | self.prev_val_loss = float("inf") 65 | self.best_val_loss = float("inf") 66 | self.halving = False 67 | 68 | def train(self): 69 | # Train model multi-epoches 70 | for epoch in range(self.start_epoch, self.epochs): 71 | # Train one epoch 72 | print("Training...") 73 | self.model.train() # Turn on BatchNorm & Dropout 74 | start = time.time() 75 | tr_avg_loss = self._run_one_epoch(epoch) 76 | 77 | print('-' * 85) 78 | print('Train Summary | End of Epoch {0} | Time {1:.2f}s | Train Loss {2:.3f}'. 79 | format(epoch + 1, time.time() - start, tr_avg_loss)) 80 | print('-' * 85) 81 | 82 | # Save model each epoch 83 | file_path = os.path.join( 84 | self.save_folder, 'epoch-%d.model' % (epoch + 1)) 85 | torch.save(self.model.serialize(self.model, 86 | self.optimizer, epoch + 1, 87 | tr_loss=self.tr_loss, 88 | cv_loss=self.cv_loss), 89 | file_path) 90 | print('Saving checkpoint model to %s' % file_path) 91 | 92 | if epoch > 9: 93 | file_path = file_path.replace('epoch-%d.model' % (epoch + 1), 94 | 'epoch-%d.model' % (epoch - 10)) 95 | if os.path.isfile(file_path): 96 | os.remove(file_path) 97 | 98 | # Cross validation 99 | print('Cross validation...') 100 | self.model.eval() # Turn off Batchnorm & Dropout 101 | val_loss = self._run_one_epoch(epoch, cross_valid=True) 102 | print('-' * 85) 103 | print('Valid Summary | End of Epoch {0} | Time {1:.2f}s | ' 104 | 'Valid Loss {2:.3f}'.format( 105 | epoch + 1, time.time() - start, val_loss)) 106 | print('-' * 85) 107 | 108 | # Save the best model 109 | self.tr_loss[epoch] = tr_avg_loss 110 | self.cv_loss[epoch] = val_loss 111 | if val_loss < self.best_val_loss: 112 | self.best_val_loss = val_loss 113 | file_path = os.path.join(self.save_folder, self.model_path) 114 | torch.save(self.model.serialize(self.model, 115 | self.optimizer, epoch + 1, 116 | tr_loss=self.tr_loss, 117 | cv_loss=self.cv_loss), 118 | file_path) 119 | print("Find better validated model, saving to %s" % file_path) 120 | 121 | # visualizing loss using visdom 122 | if self.visdom: 123 | x_axis = self.vis_epochs[0:epoch + 1] 124 | y_axis = torch.stack( 125 | (self.tr_loss[0:epoch + 1], self.cv_loss[0:epoch + 1]), dim=1) 126 | if self.vis_window is None: 127 | self.vis_window = self.vis.line( 128 | X=x_axis, 129 | Y=y_axis, 130 | opts=self.vis_opts, 131 | ) 132 | else: 133 | self.vis.line( 134 | X=x_axis.unsqueeze(0).expand(y_axis.size( 135 | 1), x_axis.size(0)).transpose(0, 1), # Visdom fix 136 | Y=y_axis, 137 | win=self.vis_window, 138 | update='replace', 139 | ) 140 | 141 | def _run_one_epoch(self, epoch, cross_valid=False): 142 | raise NotImplementedError 143 | -------------------------------------------------------------------------------- /src/utils/utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | from collections import defaultdict 3 | 4 | 5 | def pad_list(xs, pad_value, max_len=None): 6 | # From: espnet/src/nets/e2e_asr_th.py: pad_list() 7 | n_batch = len(xs) 8 | lengths = torch.tensor([x.size(0) for x in xs]).long() 9 | max_len = lengths.max() if not max_len else max_len 10 | pad = xs[0].new(n_batch, max_len, * xs[0].size()[1:]).fill_(pad_value) 11 | for i in range(n_batch): 12 | pad[i, :xs[i].size(0)] = xs[i] 13 | 14 | return pad, lengths 15 | 16 | 17 | def process_dict(dict_path): 18 | with open(dict_path, 'rb') as f: 19 | dictionary = f.readlines() 20 | char_list = [entry.decode('utf-8').split(' ')[0] 21 | for entry in dictionary] 22 | sos_id = char_list.index('') 23 | eos_id = char_list.index('') 24 | 25 | return char_list, sos_id, eos_id 26 | 27 | 28 | def load_vocab(path, vocab_size=None): 29 | with open(path, encoding='utf8') as f: 30 | vocab = [line.strip().split()[0] for line in f] 31 | vocab = vocab[:vocab_size] if vocab_size else vocab 32 | id_unk = vocab.index('') 33 | token2idx = defaultdict(lambda: id_unk) 34 | idx2token = defaultdict(lambda: '') 35 | token2idx.update({token: idx for idx, token in enumerate(vocab)}) 36 | idx2token.update({idx: token for idx, token in enumerate(vocab)}) 37 | idx2token[token2idx['']] = '' 38 | idx2token[token2idx['']] = '' 39 | idx2token[token2idx['']] = '' 40 | idx2token[token2idx['']] = '' 41 | idx2token[token2idx['']] = '' 42 | 43 | assert len(token2idx) == len(idx2token) 44 | 45 | return token2idx, idx2token 46 | 47 | 48 | if __name__ == "__main__": 49 | import sys 50 | path = sys.argv[1] 51 | char_list, sos_id, eos_id = process_dict(path) 52 | print(char_list, sos_id, eos_id) 53 | 54 | # * ------------------ recognition related ------------------ * 55 | def parse_ctc_hypothesis(hyp, char_list): 56 | """Function to parse hypothesis 57 | 58 | :param list hyp: recognition hypothesis 59 | :param list char_list: list of characters 60 | :return: recognition text strinig 61 | :return: recognition token strinig 62 | :return: recognition tokenid string 63 | """ 64 | tokenid_as_list = list(map(int, hyp)) 65 | token_as_list = [char_list[idx] for idx in tokenid_as_list] 66 | 67 | # convert to string 68 | tokenid = " ".join([str(idx) for idx in tokenid_as_list]) 69 | token = " ".join(token_as_list) 70 | text = "".join(token_as_list).replace('', ' ') 71 | 72 | return text, token, tokenid, 0.0 73 | 74 | 75 | def parse_hypothesis(hyp, char_list): 76 | """Function to parse hypothesis 77 | 78 | :param list hyp: recognition hypothesis 79 | :param list char_list: list of characters 80 | :return: recognition text strinig 81 | :return: recognition token strinig 82 | :return: recognition tokenid string 83 | """ 84 | # remove sos and get results 85 | tokenid_as_list = list(map(int, hyp['yseq'][1:])) 86 | token_as_list = [char_list[idx] for idx in tokenid_as_list] 87 | score = float(hyp['score']) 88 | 89 | # convert to string 90 | tokenid = " ".join([str(idx) for idx in tokenid_as_list]) 91 | token = " ".join(token_as_list) 92 | text = "".join(token_as_list).replace('', ' ') 93 | 94 | return text, token, tokenid, score 95 | 96 | 97 | def ids2str(hyps_ints, idx2token): 98 | list_res = [] 99 | hyp = '' 100 | for hyp_ints, length in zip(*hyps_ints): 101 | for idx in hyp_ints[:length]: 102 | hyp += idx2token[idx] 103 | list_res.append(hyp) 104 | 105 | return list_res 106 | 107 | 108 | # -- Transformer Related -- 109 | import torch 110 | 111 | 112 | def pad_to_batch(xs, pad_value): 113 | """ 114 | xs: nested list [[...], [...], ...] 115 | """ 116 | lens = [len(x) for x in xs] 117 | max_len = max(lens) 118 | 119 | for l, x in zip(lens, xs): 120 | x.extend([pad_value] * (max_len - l)) 121 | 122 | return torch.tensor(xs) 123 | 124 | 125 | def sequence_mask(lengths, maxlen=None, dtype=torch.float): 126 | if maxlen is None: 127 | maxlen = lengths.max() 128 | mask = torch.ones((len(lengths), maxlen), 129 | device=lengths.device, 130 | dtype=torch.uint8).cumsum(dim=1) <= lengths.unsqueeze(0).t() 131 | 132 | return mask.type(dtype) 133 | 134 | 135 | def get_subsequent_mask(seq): 136 | ''' For masking out the subsequent info. ''' 137 | 138 | sz_b, len_s = seq.size() 139 | subsequent_mask = torch.triu( 140 | torch.ones((len_s, len_s), device=seq.device, dtype=torch.uint8), diagonal=1) 141 | subsequent_mask = subsequent_mask.unsqueeze(0).expand(sz_b, -1, -1) # b x ls x ls 142 | 143 | return subsequent_mask 144 | 145 | 146 | def get_attn_key_pad_mask(seq_k, seq_q, pad_idx): 147 | ''' For masking out the padding part of key sequence. ''' 148 | 149 | # Expand to fit the shape of key query attention matrix. 150 | len_q = seq_q.size(1) 151 | padding_mask = seq_k.le(pad_idx) 152 | padding_mask = padding_mask.unsqueeze(1).expand(-1, len_q, -1) # b x lq x lk 153 | 154 | return padding_mask 155 | 156 | 157 | def get_attn_pad_mask(input_lengths, expand_length): 158 | """mask position is set to 1""" 159 | # N x Ti x 1 160 | non_pad_mask = sequence_mask(input_lengths) 161 | # N x Ti, lt(1) like not operation 162 | pad_mask = non_pad_mask < 1.0 163 | attn_mask = pad_mask.unsqueeze(1).expand(-1, expand_length, -1) 164 | 165 | return attn_mask 166 | 167 | 168 | def spec_aug(padded_features, feature_lengths, config): 169 | # print('using spec_aug:', config) 170 | freq_mask_num, freq_mask_width, time_mask_num, time_mask_width = (int(i) for i in config.split('-')) 171 | freq_means = torch.mean(padded_features, dim=-1) 172 | time_means = (torch.sum(padded_features, dim=1) 173 | /feature_lengths[:, None].float()) # Note that features are padded with zeros. 174 | 175 | B, T, V = padded_features.shape 176 | # mask freq 177 | for _ in range(time_mask_num): 178 | fs = (freq_mask_width * torch.rand(size=[B], 179 | device=padded_features.device, requires_grad=False)).long() 180 | f0s = ((V-fs).float()*torch.rand(size=[B], 181 | device=padded_features.device, requires_grad=False)).long() 182 | for b in range(B): 183 | padded_features[b, :, f0s[b]:f0s[b]+fs[b]] = freq_means[b][:, None] 184 | 185 | # mask time 186 | for _ in range(time_mask_num): 187 | ts = (time_mask_width * torch.rand(size=[B], 188 | device=padded_features.device, requires_grad=False)).long() 189 | t0s = ((feature_lengths-ts).float()*torch.rand(size=[B], 190 | device=padded_features.device, requires_grad=False)).long() 191 | for b in range(B): 192 | padded_features[b, t0s[b]:t0s[b]+ts[b], :] = time_means[b][None, :] 193 | 194 | return padded_features, feature_lengths 195 | -------------------------------------------------------------------------------- /src/utils/wer.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Word(Character) Error Rate, also gives the alignment infomation. 4 | # Author: XiaoRui Wang 5 | 6 | import sys, re, os.path 7 | import argparse 8 | 9 | # Cost of alignment types 10 | SUB_COST = 3 11 | DEL_COST = 3 12 | INS_COST = 3 13 | 14 | CRT_ALIGN = 0 15 | SUB_ALIGN = 1 16 | DEL_ALIGN = 2 17 | INS_ALIGN = 3 18 | END_ALIGN = 4 19 | 20 | 21 | align_name = ['crt', 'sub', 'del', 'ins', 'end'] 22 | 23 | class entry: 24 | 'Alignment chart entry, contains cost and align-type' 25 | 26 | def __init__(self, cost = 0, align = CRT_ALIGN): 27 | self.cost = cost 28 | self.align = align 29 | 30 | def getidx(name): 31 | name = name.strip() 32 | index = os.path.basename(name) 33 | (index, ext) = os.path.splitext(index) 34 | # print index,ext 35 | return index 36 | 37 | 38 | def distance(ref, hyp): 39 | ref_len = len(ref) 40 | hyp_len = len(hyp) 41 | 42 | chart = [] 43 | for i in range(0, ref_len + 1): 44 | chart.append([]) 45 | for j in range(0, hyp_len + 1): 46 | chart[-1].append(entry(i * j, CRT_ALIGN)) 47 | 48 | # Initialize the top-most row in alignment chart, (all words inserted). 49 | for i in range(1, hyp_len + 1): 50 | chart[0][i].cost = chart[0][i - 1].cost + INS_COST; 51 | chart[0][i].align = INS_ALIGN 52 | # Initialize the left-most column in alignment chart, (all words deleted). 53 | for i in range(1, ref_len + 1): 54 | chart[i][0].cost = chart[i - 1][0].cost + DEL_COST 55 | chart[i][0].align = DEL_ALIGN 56 | 57 | # Fill in the rest of the chart 58 | for i in range(1, ref_len + 1): 59 | for j in range(1, hyp_len + 1): 60 | min_cost = 0 61 | min_align = CRT_ALIGN 62 | if hyp[j - 1] == ref[i - 1]: 63 | min_cost = chart[i - 1][j - 1].cost 64 | min_align = CRT_ALIGN 65 | else: 66 | min_cost = chart[i - 1][j - 1].cost + SUB_COST 67 | min_align = SUB_ALIGN 68 | 69 | del_cost = chart[i - 1][j].cost + DEL_COST 70 | if del_cost < min_cost: 71 | min_cost = del_cost 72 | min_align = DEL_ALIGN 73 | 74 | ins_cost = chart[i][j - 1].cost + INS_COST 75 | if ins_cost < min_cost: 76 | min_cost = ins_cost 77 | min_align = INS_ALIGN 78 | 79 | chart[i][j].cost = min_cost 80 | chart[i][j].align = min_align 81 | 82 | crt = sub = ins = det = 0 83 | i = ref_len 84 | j = hyp_len 85 | alignment = [] 86 | while i > 0 or j > 0: 87 | #if i < 0 or j < 0: 88 | #break; 89 | if chart[i][j].align == CRT_ALIGN: 90 | i -= 1 91 | j -= 1 92 | crt += 1 93 | alignment.append(CRT_ALIGN) 94 | elif chart[i][j].align == SUB_ALIGN: 95 | i -= 1 96 | j -= 1 97 | sub += 1 98 | alignment.append(SUB_ALIGN) 99 | elif chart[i][j].align == DEL_ALIGN: 100 | i -= 1 101 | det += 1 102 | alignment.append(DEL_ALIGN) 103 | elif chart[i][j].align == INS_ALIGN: 104 | j -= 1 105 | ins += 1 106 | alignment.append(INS_ALIGN) 107 | 108 | total_error = sub + det + ins 109 | 110 | alignment.reverse() 111 | return (total_error, crt, sub, det, ins, alignment) 112 | 113 | def read_sentences(filename, iscn=False): 114 | map = {} 115 | tmpdata = [x.split() for x in open(filename).readlines()] 116 | data = [] 117 | # deal with multiwords 118 | #print 1982 119 | for x in tmpdata: 120 | if x[0] == "0": 121 | del(x[0]) 122 | if len(x) == 0: 123 | continue 124 | if len(x) == 1: 125 | data.append(x[:]) 126 | continue 127 | #print x[0] 128 | s = ' '.join(x[1:]) 129 | s = s.replace('_', ' ') 130 | data.append(s.split()) 131 | data[-1].insert(0, x[0]) 132 | 133 | #print 1983 134 | for x in data: 135 | if len(x) == 0: 136 | continue 137 | 138 | index = getidx(x[0]) 139 | # index = re.sub(r'\.[^\.]*$', '', index) 140 | if index in map: 141 | sys.stderr.write('Duplicate index [%s] in file %s\n' 142 | % (index, filename)) 143 | continue 144 | sys.exit(-1) 145 | # print '\t', index 146 | if len(x) == 1: 147 | map[index] = [] 148 | else: 149 | tmp = x[1:] 150 | if iscn: 151 | #print tmp 152 | tmp = ' '.join(tmp) 153 | #tmp = tmp.encode('utf8') 154 | tmp = re.sub(r'\[[\d\.,]+\]', '', tmp) 155 | tmp = re.sub(r'<[\d\.s]+>\s*\.', '', tmp) 156 | tmp = re.sub(r'<\w+>\s+', '', tmp) 157 | tmp = re.sub(r'\s', '', tmp) 158 | else: 159 | tmp = [x.lower() for x in tmp] 160 | map[index] = tmp 161 | #print 1984 162 | return map 163 | 164 | def usage(): 165 | 'Print usage' 166 | print ('''Usage: 167 | -r, --ref reference file. 168 | -h, --hyp hyperthesis file. 169 | -c, --chinese CER for Chinese. 170 | -i, --index index file, only use senteces have these index. 171 | -s, --sentence print each sentence info. 172 | --help print usage. 173 | ''') 174 | 175 | 176 | def get_wer(hypfile, reffile, iscn=False, idxfile=None, printsen=False): 177 | """ 178 | Calculate word/character error rate 179 | 180 | :param hypfile: hyperthesis file 181 | :param reffile: reference file 182 | :param iscn: CER for Chinese 183 | :param idxfile: index file, only use senteces have these index 184 | :param printsen: print each sentence info 185 | :return: 186 | 187 | notation: either or no blanks between Chinese characters has no effect 188 | """ 189 | 190 | if not (reffile and hypfile): 191 | usage() 192 | sys.exit(-1) 193 | 194 | ref = read_sentences(reffile, iscn) 195 | hyp = read_sentences(hypfile, iscn) 196 | 197 | total_ref_len = 0 198 | total_sub = 0 199 | total_del = 0 200 | total_ins = 0 201 | if idxfile: 202 | idx = [getidx(x) for x in open(idxfile).readlines()] 203 | tmp = {} 204 | for x in idx: 205 | if x in hyp: 206 | tmp[x] = hyp[x] 207 | else: 208 | sys.stderr.write('Warning, empty hyperthesis %s\n' % x) 209 | tmp[x] = [] 210 | hyp = tmp 211 | 212 | for x in hyp: 213 | #print x 214 | if x not in ref: 215 | continue 216 | sys.stderr.write('Error, no reference for %s\n' % x) 217 | sys.exit(-1) 218 | 219 | if len(ref[x]) == 0 or len(hyp[x]) == 0: 220 | continue 221 | 222 | if iscn: 223 | ref[x] = ref[x] 224 | hyp[x] = hyp[x] 225 | 226 | aligninfo = distance(ref[x], hyp[x]) 227 | total_ref_len += len(ref[x]) 228 | total_sub += aligninfo[2] 229 | total_del += aligninfo[3] 230 | total_ins += aligninfo[4] 231 | 232 | sen_error = aligninfo[2] + aligninfo[3] + aligninfo[4] 233 | sen_len = len(ref[x]) 234 | sen_wer = sen_error * 100.0 / sen_len 235 | 236 | # print each sentence's wer 237 | if printsen: 238 | print('%s sub %2d del %2d ins %2d ref %d wer %.2f' \ 239 | % (x, aligninfo[2], aligninfo[3], aligninfo[4], 240 | len(ref[x]), sen_wer)) 241 | 242 | total_error = total_sub + total_del + total_ins 243 | wer = total_error * 100.0 / total_ref_len 244 | 245 | print('ref len', total_ref_len) 246 | print('sub %.2f' % (total_sub * 100.0 / total_ref_len)) 247 | print('del %.2f' % (total_del * 100.0 / total_ref_len)) 248 | print('ins %.2f' % (total_ins * 100.0 / total_ref_len)) 249 | print('wer %.2f' % wer) 250 | sys.stdout.write('wer %.2f\n' % wer) 251 | 252 | return total_ref_len, total_sub, total_del, total_ins, wer 253 | 254 | 255 | if __name__ == '__main__': 256 | parser = argparse.ArgumentParser() 257 | parser.add_argument('--hyp', type=str, default=None) 258 | parser.add_argument('--ref', type=str, default=None) 259 | 260 | args = parser.parse_args() 261 | 262 | get_wer(args.hyp, args.ref, iscn=True, idxfile=None, printsen=True) 263 | -------------------------------------------------------------------------------- /test/beam_search_decode.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import time 3 | import torch 4 | import torch.nn.functional as F 5 | 6 | from utils import utils 7 | 8 | # pylint: disable=not-callable 9 | 10 | 11 | def encode_inputs(sentence, model, src_data, beam_size, device): 12 | inputs = src_data['field'].preprocess(sentence) 13 | inputs.append(src_data['field'].eos_token) 14 | inputs = [inputs] 15 | inputs = src_data['field'].process(inputs, device=device) 16 | with torch.no_grad(): 17 | src_mask = utils.create_pad_mask(inputs, src_data['pad_idx']) 18 | enc_output = model.encode(inputs, src_mask) 19 | enc_output = enc_output.repeat(beam_size, 1, 1) 20 | return enc_output, src_mask 21 | 22 | 23 | def update_targets(targets, best_indices, idx, vocab_size): 24 | best_tensor_indices = torch.div(best_indices, vocab_size) 25 | best_token_indices = torch.fmod(best_indices, vocab_size) 26 | new_batch = torch.index_select(targets, 0, best_tensor_indices) 27 | new_batch[:, idx] = best_token_indices 28 | return new_batch 29 | 30 | 31 | def get_result_sentence(indices_history, trg_data, vocab_size): 32 | result = [] 33 | k = 0 34 | for best_indices in indices_history[::-1]: 35 | best_idx = best_indices[k] 36 | # TODO: get this vocab_size from target.pt? 37 | k = best_idx // vocab_size 38 | best_token_idx = best_idx % vocab_size 39 | best_token = trg_data['field'].vocab.itos[best_token_idx] 40 | result.append(best_token) 41 | return ' '.join(result[::-1]) 42 | 43 | 44 | def main(): 45 | parser = argparse.ArgumentParser() 46 | parser.add_argument('--data_dir', type=str, required=True) 47 | parser.add_argument('--model_dir', type=str, required=True) 48 | parser.add_argument('--max_length', type=int, default=100) 49 | parser.add_argument('--beam_size', type=int, default=4) 50 | parser.add_argument('--alpha', type=float, default=0.6) 51 | parser.add_argument('--no_cuda', action='store_true') 52 | parser.add_argument('--translate', action='store_true') 53 | args = parser.parse_args() 54 | 55 | beam_size = args.beam_size 56 | 57 | # Load fields. 58 | if args.translate: 59 | src_data = torch.load(args.data_dir + '/source.pt') 60 | trg_data = torch.load(args.data_dir + '/target.pt') 61 | 62 | # Load a saved model. 63 | device = torch.device('cpu' if args.no_cuda else 'cuda') 64 | model = utils.load_checkpoint(args.model_dir, device) 65 | 66 | pads = torch.tensor([trg_data['pad_idx']] * beam_size, device=device) 67 | pads = pads.unsqueeze(-1) 68 | 69 | # We'll find a target sequence by beam search. 70 | scores_history = [torch.zeros((beam_size,), dtype=torch.float, 71 | device=device)] 72 | indices_history = [] 73 | cache = {} 74 | 75 | eos_idx = trg_data['field'].vocab.stoi[trg_data['field'].eos_token] 76 | 77 | if args.translate: 78 | sentence = input('Source? ') 79 | 80 | # Encoding inputs. 81 | if args.translate: 82 | start_time = time.time() 83 | enc_output, src_mask = encode_inputs(sentence, model, src_data, 84 | beam_size, device) 85 | targets = pads 86 | start_idx = 0 87 | else: 88 | enc_output, src_mask = None, None 89 | sentence = input('Target? ').split() 90 | for idx, _ in enumerate(sentence): 91 | sentence[idx] = trg_data['field'].vocab.stoi[sentence[idx]] 92 | sentence.append(trg_data['pad_idx']) 93 | targets = torch.tensor([sentence], device=device) 94 | start_idx = targets.size(1) - 1 95 | start_time = time.time() 96 | 97 | with torch.no_grad(): 98 | for idx in range(start_idx, args.max_length): 99 | if idx > start_idx: 100 | targets = torch.cat((targets, pads), dim=1) 101 | t_self_mask = utils.create_trg_self_mask(targets.size()[1], 102 | device=targets.device) 103 | 104 | t_mask = utils.create_pad_mask(targets, trg_data['pad_idx']) 105 | pred = model.decode(targets, enc_output, src_mask, 106 | t_self_mask, t_mask, cache) 107 | pred = pred[:, idx].squeeze(1) 108 | vocab_size = pred.size(1) 109 | 110 | pred = F.log_softmax(pred, dim=1) 111 | if idx == start_idx: 112 | scores = pred[0] 113 | else: 114 | scores = scores_history[-1].unsqueeze(1) + pred 115 | length_penalty = pow(((5. + idx + 1.) / 6.), args.alpha) 116 | scores = scores / length_penalty 117 | scores = scores.view(-1) 118 | 119 | best_scores, best_indices = scores.topk(beam_size, 0) 120 | scores_history.append(best_scores) 121 | indices_history.append(best_indices) 122 | 123 | # Stop searching when the best output of beam is EOS. 124 | if best_indices[0].item() % vocab_size == eos_idx: 125 | break 126 | 127 | targets = update_targets(targets, best_indices, idx, vocab_size) 128 | 129 | result = get_result_sentence(indices_history, trg_data, vocab_size) 130 | print("Result: {}".format(result)) 131 | 132 | print("Elapsed Time: {:.2f} sec".format(time.time() - start_time)) 133 | 134 | 135 | if __name__ == '__main__': 136 | main() 137 | -------------------------------------------------------------------------------- /test/learn_pytorch.py: -------------------------------------------------------------------------------- 1 | # If I'm not sure waht some function or class actually doing, I will write 2 | # snippet codes to confirm my unstanding. 3 | 4 | import torch 5 | import torch.nn.functional as F 6 | 7 | 8 | def learn_cross_entropy(): 9 | IGNORE_ID = -1 10 | torch.manual_seed(123) 11 | 12 | input = torch.randn(4, 5, requires_grad=True) # N x C 13 | target = torch.randint(5, (4,), dtype=torch.int64) # N 14 | target[-1] = IGNORE_ID 15 | print("input:\n", input) 16 | print("target:\n", target) 17 | 18 | # PART 1: confirm F.cross_entropy() == F.log_softmax() + F.nll_loss() 19 | ce = F.cross_entropy( 20 | input, target, ignore_index=IGNORE_ID, reduction='mean') 21 | print("### Using F.cross_entropy()") 22 | print("ce =", ce) 23 | ls = F.log_softmax(input, dim=1) 24 | nll = F.nll_loss(ls, target, ignore_index=IGNORE_ID, 25 | reduction='mean') 26 | print("### Using F.log_softmax() + F.nll_loss()") 27 | print("nll =", nll) 28 | print("### [CONFIRM] F.cross_entropy() == F.log_softmax() + F.nll_loss()\n") 29 | 30 | # PART 2: confirm log_softmax = log + softmax 31 | print("log_softmax():\n", ls) 32 | softmax = F.softmax(input, dim=1) 33 | log_softmax = torch.log(softmax) 34 | print("softmax():\n", softmax) 35 | print("log() + softmax():\n", log_softmax) 36 | print("### [CONFIRM] log_softmax() == log() + softmax()\n") 37 | 38 | # PART 3: confirm ignore_index works 39 | non_ignore_index = target[target != IGNORE_ID] 40 | print(non_ignore_index) 41 | print(log_softmax[target != IGNORE_ID]) 42 | loss_each_sample = torch.stack([log_softmax[i][idx] 43 | for i, idx in enumerate(non_ignore_index)], dim=0) 44 | print(loss_each_sample) 45 | print(-1 * torch.mean(loss_each_sample)) 46 | print("### [CONFIRM] ignore_index in F.cross_entropy() works\n") 47 | 48 | # PART 4: confirm cross_entropy()'s backward() works correctly when set ignore_index 49 | # nll = 1/N * -1 * sum(log(softmax(input, dim=1))[target]) 50 | # d_nll / d_input = 1/N * (softmax(input, dim=1) - target) 51 | print("softmax:\n", softmax) 52 | print("non ignore softmax:") 53 | print(softmax[:len(non_ignore_index)]) 54 | print(softmax[range(len(non_ignore_index)), non_ignore_index]) 55 | print("target\n", target) 56 | grad = softmax 57 | grad[range(len(non_ignore_index)), non_ignore_index] -= 1 58 | grad /= len(non_ignore_index) 59 | grad[-1] = 0.0 # IGNORE_ID postition 60 | print("my gradient:\n", grad) 61 | ce.backward() 62 | print("pytorch gradient:\n", input.grad) 63 | print("### [CONFIRM] F.cross_entropy()'s backward() works correctly when " 64 | "set ignore_index") 65 | 66 | 67 | if __name__ == "__main__": 68 | learn_cross_entropy() 69 | -------------------------------------------------------------------------------- /test/learn_visdom.py: -------------------------------------------------------------------------------- 1 | # INSTALL 2 | # $ pip install visdom 3 | # START 4 | # $ visdom 5 | # or 6 | # $ python -m visdom.server 7 | # open browser and visit http://localhost:8097 8 | 9 | import torch 10 | import visdom 11 | 12 | vis = visdom.Visdom(env="model_1") 13 | vis.text('Hello, world', win='text1') 14 | vis.text('Hi, Kaituo', win='text1', append=True) 15 | for i in range(10): 16 | vis.line(X=torch.FloatTensor([i]), Y=torch.FloatTensor([i**2]), win='loss', update='append' if i> 0 else None) 17 | 18 | 19 | epochs = 20 20 | loss_result = torch.Tensor(epochs) 21 | for i in range(epochs): 22 | loss_result[i] = i ** 2 23 | opts = dict(title='LAS', ylabel='loss', xlabel='epoch') 24 | x_axis = torch.arange(1, epochs+1) 25 | y_axis = loss_result[:epochs] 26 | vis2 = visdom.Visdom(env="view_loss") 27 | vis2.line(X=x_axis, Y=y_axis, opts=opts) 28 | 29 | 30 | while True: 31 | continue 32 | -------------------------------------------------------------------------------- /test/path.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | cd ../egs/aishell 4 | . ./path.sh 5 | cd - 6 | -------------------------------------------------------------------------------- /test/test_data.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | from data import AudioDataset 4 | from data import AudioDataLoader 5 | 6 | 7 | if __name__ == "__main__": 8 | train_json = "data/data.json" 9 | batch_size = 2 10 | max_length_in = 1000 11 | max_length_out = 1000 12 | num_batches = 10 13 | num_workers = 2 14 | batch_frames = 2000 15 | 16 | # test batch_frames 17 | train_dataset = AudioDataset( 18 | train_json, batch_size, max_length_in, max_length_out, num_batches, 19 | batch_frames=batch_frames) 20 | for i, minibatch in enumerate(train_dataset): 21 | print(i) 22 | print(minibatch) 23 | exit(0) 24 | 25 | # test 26 | train_dataset = AudioDataset( 27 | train_json, batch_size, max_length_in, max_length_out, num_batches) 28 | # NOTE: must set batch_size=1 here. 29 | train_loader = AudioDataLoader( 30 | train_dataset, batch_size=1, num_workers=num_workers, LFR_m=4, LFR_n=3) 31 | 32 | import torch 33 | #torch.set_printoptions(threshold=10000000) 34 | for i, (data) in enumerate(train_loader): 35 | inputs, inputs_lens, targets = data 36 | print(i) 37 | # print(inputs) 38 | print(inputs.size()) 39 | print(inputs_lens) 40 | # print(targets) 41 | print("*"*20) 42 | -------------------------------------------------------------------------------- /test/test_decode.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import torch 3 | 4 | from encoder import Encoder 5 | from decoder import Decoder 6 | from transformer import Transformer 7 | 8 | 9 | if __name__ == "__main__": 10 | D = 3 11 | beam_size = 5 12 | nbest = 5 13 | defaults = dict(beam_size=beam_size, 14 | nbest=nbest, 15 | decode_max_len=0, 16 | d_input = D, 17 | LFR_m = 1, 18 | n_layers_enc = 2, 19 | n_head = 2, 20 | d_k = 6, 21 | d_v = 6, 22 | d_model = 12, 23 | d_inner = 8, 24 | dropout=0.1, 25 | pe_maxlen=100, 26 | d_word_vec=12, 27 | n_layers_dec = 2, 28 | tgt_emb_prj_weight_sharing=1) 29 | args = argparse.Namespace(**defaults) 30 | char_list = ["a", "b", "c", "d", "e", "f", "g", "h", "", ""] 31 | sos_id, eos_id = 8, 9 32 | vocab_size = len(char_list) 33 | # model 34 | encoder = Encoder(args.d_input * args.LFR_m, args.n_layers_enc, args.n_head, 35 | args.d_k, args.d_v, args.d_model, args.d_inner, 36 | dropout=args.dropout, pe_maxlen=args.pe_maxlen) 37 | decoder = Decoder(sos_id, eos_id, vocab_size, 38 | args.d_word_vec, args.n_layers_dec, args.n_head, 39 | args.d_k, args.d_v, args.d_model, args.d_inner, 40 | dropout=args.dropout, 41 | tgt_emb_prj_weight_sharing=args.tgt_emb_prj_weight_sharing, 42 | pe_maxlen=args.pe_maxlen) 43 | model = Transformer(encoder, decoder) 44 | 45 | for i in range(3): 46 | print("\n***** Utt", i+1) 47 | Ti = i + 20 48 | input = torch.randn(Ti, D) 49 | length = torch.tensor([Ti], dtype=torch.int) 50 | nbest_hyps = model.recognize(input, length, char_list, args) 51 | 52 | file_path = "./temp.pth" 53 | optimizer = torch.optim.Adam(model.parameters()) 54 | torch.save(model.serialize(model, optimizer, 1, LFR_m=1, LFR_n=1), file_path) 55 | model, LFR_m, LFR_n = Transformer.load_model(file_path) 56 | print(model) 57 | 58 | import os 59 | os.remove(file_path) 60 | --------------------------------------------------------------------------------