├── Aligner.java ├── README.md └── align.sh /Aligner.java: -------------------------------------------------------------------------------- 1 | import java.io.File; 2 | import java.io.FileReader; 3 | import java.io.BufferedReader; 4 | import java.io.OutputStreamWriter; 5 | import java.util.List; 6 | 7 | import edu.cmu.sphinx.api.SpeechAligner; 8 | import edu.cmu.sphinx.util.TimeFrame; 9 | import edu.cmu.sphinx.result.WordResult; 10 | import edu.cmu.sphinx.linguist.acoustic.Unit; 11 | 12 | import com.opencsv.CSVWriter; 13 | 14 | /** 15 | * This is a simple tool to align audio to text and dump a database 16 | * for the training/evaluation. 17 | * 18 | * You need to provide a model, dictionary, audio and the text to align. 19 | */ 20 | public class Aligner { 21 | 22 | /** 23 | * @param args acoustic model, dictionary, audio file, text 24 | */ 25 | public static void main(String args[]) throws Exception { 26 | // audio and transcript file paths 27 | File file = new File(args[2]); 28 | File transcript_file = new File(args[3]); 29 | 30 | // read transcript from file 31 | BufferedReader transcript_file_reader = new BufferedReader(new FileReader(transcript_file)); 32 | StringBuffer transcript = new StringBuffer(); 33 | String line = null; 34 | while ((line = transcript_file_reader.readLine()) !=null) 35 | transcript.append(line).append("\n"); 36 | 37 | // perform alignment 38 | SpeechAligner aligner = new SpeechAligner(args[0], args[1], null); 39 | List results = aligner.align(file.toURI().toURL(), transcript.toString()); 40 | 41 | // write out results 42 | CSVWriter writer = new CSVWriter(new OutputStreamWriter(System.out), ','); 43 | for (WordResult result : results) { 44 | StringBuilder pronunciation = new StringBuilder(); 45 | for (Unit unit : result.getPronunciation().getUnits()) 46 | pronunciation.append(unit).append(' '); 47 | 48 | writer.writeNext(new String[] { 49 | result.getWord().toString(), 50 | pronunciation.toString().trim(), 51 | Boolean.toString(result.isFiller()), 52 | Double.toString(result.getScore()), 53 | Long.toString(result.getTimeFrame().getStart()), 54 | Long.toString(result.getTimeFrame().getEnd()), 55 | }); 56 | } 57 | writer.close(); 58 | } 59 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | cmusphinx forced alignment example 2 | ================================== 3 | 4 | So... building [cmusphinx](http://cmusphinx.sourceforge.net/wiki/download) isn't exactly easy and running its forced alignment tool isn't well documented. (Forced alignment is matching a transcript with corresponding audio and getting time codes for each word in the transcript.) 5 | 6 | This project documents how I got cmusphinx (the [latest development version](https://github.com/cmusphinx/sphinx4/commit/3bfd6f2f58e464280a2e6a71500e7d54abf384b4)) working on an Ubuntu 14.04 x64 machine to work. Version 4-5prealpha had some sort of bug in the aligner. The latest development version has a broken build which I fixed manually by reverting [this commit](https://github.com/cmusphinx/sphinx4/commit/aa4e1838f06eb032fd248601469b16ac95aeb08a) manually. 7 | 8 | Build 9 | ----- 10 | 11 | Build Sphinx4, which is a Java library: 12 | 13 | sudo apt-get update 14 | sudo apt-get install git default-jdk maven 15 | 16 | git clone https://github.com/cmusphinx/sphinx4 17 | cd sphinx4/sphinx4-core 18 | git revert aa4e1838f06eb032fd248601469b16ac95aeb08a 19 | mvn clean install 20 | cd ../.. 21 | 22 | Fetch 23 | ----- 24 | 25 | You'll need an acoustic model. Here I'm using English: 26 | 27 | wget http://downloads.sourceforge.net/project/cmusphinx/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/cmusphinx-en-us-5.2.tar.gz 28 | tar -zxf cmusphinx-en-us-5.2.tar.gz 29 | 30 | Test 31 | ---- 32 | 33 | To test that this all worked so far, try out forced alignment with a sample 16khz 16bit mono wav file (it must be in that format). First get the file and its transcription: 34 | 35 | wget -O sample_original.wav http://hawksoft.com/hawkvoice/samples/ulaw.wav 36 | sox sample_original.wav -b 16 sample.wav channels 1 rate 16k 37 | echo "It's a dense crowd in two distinct ways. The fruit of a figg tree is apple shaped." > sample.txt 38 | 39 | Then run cmusphinx's aligner program: 40 | 41 | java -cp sphinx4/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar edu.cmu.sphinx.tools.aligner.Aligner cmusphinx-en-us-5.2/ sphinx4-5prealpha-src/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict sample.wav "$(cat sample.txt)" 42 | 43 | It's going to write out a whole bunch of new wav files (in this example just `sample-0000.wav`) --- be on the lookout for those generated files. 44 | 45 | Better Driver 46 | ------------- 47 | 48 | My Aligner.java driver class simplifies things. It takes a filename for the transcript on the command line rather than the transcript text directly, and it outputs the alignment in CSV format. Get the opencsv library and then 49 | compile my driver class: 50 | 51 | wget http://downloads.sourceforge.net/project/opencsv/opencsv/3.3/opencsv-3.3.jar 52 | javac -cp sphinx4/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar:opencsv-3.3.jar Aligner.java 53 | 54 | And run the same example with my driver: 55 | 56 | java -cp .:sphinx4/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar:opencsv-3.3.jar Aligner cmusphinx-en-us-5.2/ sphinx4-5prealpha-src/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict sample.wav sample.txt 2>/dev/null 57 | 58 | Or just: 59 | 60 | ./align.sh sample.wav sample.txt 2>/dev/null 61 | 62 | You'll get: 63 | 64 | "it's","IH T S","false","0.0","170","200" 65 | "a","AH","false","-5540774.0","200","390" 66 | "crowd","K R AW D","false","-1.13934288E8","850","1300" 67 | "in","IH N","false","-1.95127088E8","1300","1470" 68 | "two","T UW","false","-2.23176048E8","1470","1700" 69 | "distinct","D IH S T IH NG K T","false","-2.6345264E8","1700","2230" 70 | "ways","W EY Z","false","-3.58427808E8","2230","2730" 71 | "the","DH AH","false","-4.72551168E8","2920","3100" 72 | "fruit","F R UW T","false","-5.24233504E8","3220","3530" 73 | "of","AH V","false","-5.79971456E8","3530","3640" 74 | "a","AH","false","-5.99515456E8","3640","3760" 75 | "figg","F IH G","false","-6.2017152E8","3760","4060" 76 | "tree","T R IY","false","-6.72126656E8","4060","4490" 77 | "is","IH Z","false","-7.4763744E8","4490","4570" 78 | "apple","AE P AH L","false","-7.73581184E8","4630","5040" 79 | "shaped","SH EY P T","false","-8.44424704E8","5040","5340" 80 | 81 | My driver outputs a CSV table, to standard output (you can redirect it to a file if needed), with columns: 82 | 83 | * the word (as it appeared in the transcript) 84 | * the phonemic prounciation of the word (from the dictionary) 85 | * whether this word was a filler (automatically inserted?) 86 | * the confidence of this word's alignment (not sure if higher is better...) 87 | * the start time of the word, in miliseconds 88 | * the end time of the word, in miliseconds 89 | 90 | Etcetera 91 | -------- 92 | 93 | sphinxbase, the native C portion of cmusphinx, isn't required for this. I didn't know that ahead of time (thanks cmusphinx!), so I'm pasting build instructions here but you *don't need this*: 94 | 95 | sudo apt-get install bison swig python-dev 96 | wget http://downloads.sourceforge.net/project/cmusphinx/sphinxbase/5prealpha/sphinxbase-5prealpha.tar.gz 97 | tar -zxf sphinxbase-5prealpha.tar.gz 98 | cd sphinxbase-5prealpha 99 | ./configure 100 | make 101 | sudo make install 102 | cd .. 103 | -------------------------------------------------------------------------------- /align.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | # Usage: ./align.sh audio.wav transcript.txt 3 | 4 | HERE=$(dirname $0) 5 | SPHINX=$HERE/sphinx4 6 | JAR=$SPHINX/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar 7 | DICTIONARY=$SPHINX/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict 8 | MODEL=$HERE/cmusphinx-en-us-5.2 9 | 10 | if [ -z "$1" ]; then 11 | echo "Usage: ./align.sh sample.wav sample.txt" 12 | exit 1 13 | fi 14 | 15 | # Go. 16 | java -cp $HERE:$JAR:$HERE/opencsv-3.3.jar \ 17 | Aligner \ 18 | $MODEL \ 19 | $DICTIONARY \ 20 | $@ 21 | --------------------------------------------------------------------------------