├── Aligner.java
├── README.md
└── align.sh


/Aligner.java:
--------------------------------------------------------------------------------
 1 | import java.io.File;
 2 | import java.io.FileReader;
 3 | import java.io.BufferedReader;
 4 | import java.io.OutputStreamWriter;
 5 | import java.util.List;
 6 | 
 7 | import edu.cmu.sphinx.api.SpeechAligner;
 8 | import edu.cmu.sphinx.util.TimeFrame;
 9 | import edu.cmu.sphinx.result.WordResult;
10 | import edu.cmu.sphinx.linguist.acoustic.Unit;
11 | 
12 | import com.opencsv.CSVWriter;
13 | 
14 | /**
15 |  * This is a simple tool to align audio to text and dump a database
16 |  * for the training/evaluation.
17 |  *
18 |  * You need to provide a model, dictionary, audio and the text to align.
19 |  */
20 | public class Aligner {
21 | 
22 |     /**
23 |      * @param args acoustic model, dictionary, audio file, text
24 |      */
25 |     public static void main(String args[]) throws Exception {
26 |         // audio and transcript file paths
27 |         File file = new File(args[2]);
28 |         File transcript_file = new File(args[3]);
29 | 
30 |         // read transcript from file
31 |         BufferedReader transcript_file_reader = new BufferedReader(new FileReader(transcript_file));
32 |         StringBuffer transcript = new StringBuffer();
33 |         String line = null;
34 |         while ((line = transcript_file_reader.readLine()) !=null)
35 |             transcript.append(line).append("\n");
36 | 
37 |         // perform alignment
38 |         SpeechAligner aligner = new SpeechAligner(args[0], args[1], null);
39 |         List<WordResult> results = aligner.align(file.toURI().toURL(), transcript.toString());
40 | 
41 |         // write out results
42 |         CSVWriter writer = new CSVWriter(new OutputStreamWriter(System.out), ',');
43 |         for (WordResult result : results) {
44 |             StringBuilder pronunciation = new StringBuilder();
45 |             for (Unit unit : result.getPronunciation().getUnits())
46 |                 pronunciation.append(unit).append(' ');
47 | 
48 |             writer.writeNext(new String[] {
49 |                 result.getWord().toString(),
50 |                 pronunciation.toString().trim(),
51 |                 Boolean.toString(result.isFiller()),
52 |                 Double.toString(result.getScore()),
53 |                 Long.toString(result.getTimeFrame().getStart()),
54 |                 Long.toString(result.getTimeFrame().getEnd()),
55 |             });
56 |         }
57 |         writer.close();
58 |     }
59 | }


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | cmusphinx forced alignment example
  2 | ==================================
  3 | 
  4 | So... building [cmusphinx](http://cmusphinx.sourceforge.net/wiki/download) isn't exactly easy and running its forced alignment tool isn't well documented. (Forced alignment is matching a transcript with corresponding audio and getting time codes for each word in the transcript.)
  5 | 
  6 | This project documents how I got cmusphinx (the [latest development version](https://github.com/cmusphinx/sphinx4/commit/3bfd6f2f58e464280a2e6a71500e7d54abf384b4)) working on an Ubuntu 14.04 x64 machine to work. Version 4-5prealpha had some sort of bug in the aligner. The latest development version has a broken build which I fixed manually by reverting [this commit](https://github.com/cmusphinx/sphinx4/commit/aa4e1838f06eb032fd248601469b16ac95aeb08a) manually.
  7 | 
  8 | Build
  9 | -----
 10 | 
 11 | Build Sphinx4, which is a Java library:
 12 | 
 13 | 	sudo apt-get update
 14 | 	sudo apt-get install git default-jdk maven
 15 | 
 16 | 	git clone https://github.com/cmusphinx/sphinx4
 17 | 	cd sphinx4/sphinx4-core
 18 | 	git revert aa4e1838f06eb032fd248601469b16ac95aeb08a
 19 | 	mvn clean install
 20 | 	cd ../..
 21 | 
 22 | Fetch
 23 | -----
 24 | 
 25 | You'll need an acoustic model. Here I'm using English:
 26 | 
 27 | 	wget http://downloads.sourceforge.net/project/cmusphinx/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/cmusphinx-en-us-5.2.tar.gz
 28 | 	tar -zxf cmusphinx-en-us-5.2.tar.gz
 29 | 
 30 | Test
 31 | ----
 32 | 
 33 | To test that this all worked so far, try out forced alignment with a sample 16khz 16bit mono wav file (it must be in that format). First get the file and its transcription:
 34 | 
 35 | 	wget -O sample_original.wav http://hawksoft.com/hawkvoice/samples/ulaw.wav
 36 | 	sox sample_original.wav -b 16 sample.wav channels 1 rate 16k
 37 | 	echo "It's a dense crowd in two distinct ways. The fruit of a figg tree is apple shaped." > sample.txt
 38 | 
 39 | Then run cmusphinx's aligner program:
 40 | 
 41 | 	java -cp sphinx4/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar edu.cmu.sphinx.tools.aligner.Aligner cmusphinx-en-us-5.2/ sphinx4-5prealpha-src/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict sample.wav "$(cat sample.txt)"
 42 | 
 43 | It's going to write out a whole bunch of new wav files (in this example just `sample-0000.wav`) --- be on the lookout for those generated files.
 44 | 
 45 | Better Driver
 46 | -------------
 47 | 
 48 | My Aligner.java driver class simplifies things. It takes a filename for the transcript on the command line rather than the transcript text directly, and it outputs the alignment in CSV format. Get the opencsv library and then 
 49 | compile my driver class:
 50 | 
 51 | 	wget http://downloads.sourceforge.net/project/opencsv/opencsv/3.3/opencsv-3.3.jar
 52 | 	javac -cp sphinx4/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar:opencsv-3.3.jar Aligner.java 
 53 | 
 54 | And run the same example with my driver:
 55 | 
 56 | 	java -cp .:sphinx4/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar:opencsv-3.3.jar Aligner cmusphinx-en-us-5.2/ sphinx4-5prealpha-src/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict sample.wav sample.txt 2>/dev/null
 57 | 
 58 | Or just:
 59 | 
 60 | 	./align.sh sample.wav sample.txt 2>/dev/null
 61 | 
 62 | You'll get:
 63 | 
 64 | 	"it's","IH T S","false","0.0","170","200"
 65 | 	"a","AH","false","-5540774.0","200","390"
 66 | 	"crowd","K R AW D","false","-1.13934288E8","850","1300"
 67 | 	"in","IH N","false","-1.95127088E8","1300","1470"
 68 | 	"two","T UW","false","-2.23176048E8","1470","1700"
 69 | 	"distinct","D IH S T IH NG K T","false","-2.6345264E8","1700","2230"
 70 | 	"ways","W EY Z","false","-3.58427808E8","2230","2730"
 71 | 	"the","DH AH","false","-4.72551168E8","2920","3100"
 72 | 	"fruit","F R UW T","false","-5.24233504E8","3220","3530"
 73 | 	"of","AH V","false","-5.79971456E8","3530","3640"
 74 | 	"a","AH","false","-5.99515456E8","3640","3760"
 75 | 	"figg","F IH G","false","-6.2017152E8","3760","4060"
 76 | 	"tree","T R IY","false","-6.72126656E8","4060","4490"
 77 | 	"is","IH Z","false","-7.4763744E8","4490","4570"
 78 | 	"apple","AE P AH L","false","-7.73581184E8","4630","5040"
 79 | 	"shaped","SH EY P T","false","-8.44424704E8","5040","5340"
 80 | 
 81 | My driver outputs a CSV table, to standard output (you can redirect it to a file if needed), with columns:
 82 | 
 83 | * the word (as it appeared in the transcript)
 84 | * the phonemic prounciation of the word (from the dictionary)
 85 | * whether this word was a filler (automatically inserted?)
 86 | * the confidence of this word's alignment (not sure if higher is better...)
 87 | * the start time of the word, in miliseconds
 88 | * the end time of the word, in miliseconds
 89 | 
 90 | Etcetera
 91 | --------
 92 | 
 93 | sphinxbase, the native C portion of cmusphinx, isn't required for this. I didn't know that ahead of time (thanks cmusphinx!), so I'm pasting build instructions here but you *don't need this*:
 94 | 
 95 | 	sudo apt-get install bison swig python-dev 
 96 | 	wget http://downloads.sourceforge.net/project/cmusphinx/sphinxbase/5prealpha/sphinxbase-5prealpha.tar.gz
 97 | 	tar -zxf sphinxbase-5prealpha.tar.gz
 98 | 	cd sphinxbase-5prealpha
 99 | 	./configure
100 | 	make
101 | 	sudo make install
102 | 	cd ..
103 | 


--------------------------------------------------------------------------------
/align.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | # Usage: ./align.sh audio.wav transcript.txt
 3 | 
 4 | HERE=$(dirname $0)
 5 | SPHINX=$HERE/sphinx4
 6 | JAR=$SPHINX/sphinx4-core/target/sphinx4-core-1.0-SNAPSHOT.jar
 7 | DICTIONARY=$SPHINX/sphinx4-data/src/main/resources/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict
 8 | MODEL=$HERE/cmusphinx-en-us-5.2
 9 | 
10 | if [ -z "$1" ]; then
11 | 	echo "Usage: ./align.sh sample.wav sample.txt"
12 | 	exit 1
13 | fi
14 | 
15 | # Go.
16 | java -cp $HERE:$JAR:$HERE/opencsv-3.3.jar \
17 | 	Aligner \
18 | 	$MODEL \
19 | 	$DICTIONARY \
20 | 	$@
21 | 


--------------------------------------------------------------------------------