├── recordings
    ├── 0001.wav
    ├── 0002.wav
    ├── 0003.wav
    ├── 0004.wav
    ├── 0005.wav
    ├── 0006.wav
    ├── 0007.wav
    ├── 0008.wav
    ├── 0009.wav
    └── 0010.wav
├── recordings.txt
├── output
    └── hypotheses.txt
├── README.md
└── sttClient.py


/recordings/0001.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0001.wav


--------------------------------------------------------------------------------
/recordings/0002.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0002.wav


--------------------------------------------------------------------------------
/recordings/0003.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0003.wav


--------------------------------------------------------------------------------
/recordings/0004.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0004.wav


--------------------------------------------------------------------------------
/recordings/0005.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0005.wav


--------------------------------------------------------------------------------
/recordings/0006.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0006.wav


--------------------------------------------------------------------------------
/recordings/0007.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0007.wav


--------------------------------------------------------------------------------
/recordings/0008.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0008.wav


--------------------------------------------------------------------------------
/recordings/0009.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0009.wav


--------------------------------------------------------------------------------
/recordings/0010.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0010.wav


--------------------------------------------------------------------------------
/recordings.txt:
--------------------------------------------------------------------------------
 1 | ./recordings/0001.wav
 2 | ./recordings/0002.wav
 3 | ./recordings/0003.wav
 4 | ./recordings/0004.wav
 5 | ./recordings/0005.wav
 6 | ./recordings/0006.wav
 7 | ./recordings/0007.wav
 8 | ./recordings/0008.wav
 9 | ./recordings/0009.wav
10 | ./recordings/0010.wav
11 | 


--------------------------------------------------------------------------------
/output/hypotheses.txt:
--------------------------------------------------------------------------------
 1 | 1: several tornadoes touch down as a line of severe thunderstorms swept through colorado on sunday 
 2 | 2: with one of the twisters hitting near a junior golf tournament 
 3 | 3: m. m. during a caddy 
 4 | 4: six of the tornado struck in northeast colorado 
 5 | 5: well to others hit in park county 
 6 | 6: in the center of the state 
 7 | 7: the national weather service said 
 8 | 8: at least three of them caused the damage 
 9 | 9: aurora fire department officials said a twister touched down near the blackstone country club 
10 | 10: causing one minor injury and flipping an empty trailer 
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | ## Synopsis
 4 | 
 5 | This project consists of a python client that interacts with the IBM Watson Speech To Text service through its WebSockets interface. The client streams audio to the STT service and receives recognition hypotheses in real time. It can run N simultaneous recognition sessions
 6 | 
 7 | ## installation
 8 | 
 9 | There are some dependencies that need to be installed for this script to work. In order to interact with the STT service via WebSockets it is necessary to install the 'twisted' and 'autobahn' libraries. An updated version of these libraries can be installed by typing:
10 | 
11 | `
12 | $ pip install twisted
13 | `
14 | 
15 | `
16 | $ pip install autobahn
17 | `
18 | 
19 | In order to you token based authentication it is necessary to install the requests library
20 | 
21 | `
22 | $ pip install requests
23 | `
24 | 
25 | If you need to upgrade your existing versions of twisted or autobhan you can type
26 | 
27 | `
28 | $ pip install twisted --upgrade
29 | `
30 | 
31 | `
32 | $ pip install autobahn --upgrade
33 | `
34 | 
35 | Sometimes you may need to install some additional dependencies, check the following commands:
36 | 
37 | `
38 | $ pip install pyOpenSSL
39 | `
40 | 
41 | `
42 | $ apt-get install build-essential python-dev
43 | `
44 | 
45 | Finally, the version 0.10.3 of Autobahn comes with a bug/typo that you need to fix by changing 'taxio' to 'txaio' in /usr/local/lib/python2.7/dist-packages/autobahn/websocket/protocol.py
46 | 
47 | ## Examples                                                                                                                                        
48 |                                                                                                                                                     
49 | The example below will run the default 10 WAV files through the WebSockets interface of the Speech To Text (STT) service and will dump the recognition hypotheses to a file under the "./output" directory.                           
50 |                                                                                                                                                     
51 | `                                                                                                                                                   
52 | $ python ./sttClient.py -credentials <username>:<password> -model en-US_BroadbandModel
53 | `                                                                                                                                                   
54 |                                                                                                                                                     
55 | The example below performs the same task much faster by opening 10 simultaneous recognition sessions (WebSocket connections) against the STT service.
56 |                                                                                                                                                     
57 | `                                                                                                                                                   
58 | $ python ./sttClient.py -credentials <username>:<password> -model en-US_BroadbandModel -threads 10
59 | `                                                                                                                                                   
60 |  
61 | ## Options
62 | 
63 | To see the list of available options type:
64 | 
65 | `
66 | $ python sttClient.py -h
67 | `
68 | 
69 | ## Motivation
70 | 
71 | This script has been created by Daniel Bolanos in order to facilitate and promote the utilization of the IBM Watson Speech To Text service.
72 | 
73 | 
74 | 
75 |                                                               
76 | 
77 | 


--------------------------------------------------------------------------------
/sttClient.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright IBM Corp. 2014
  3 | #
  4 | # Licensed under the Apache License, Version 2.0 (the "License");
  5 | # you may not use this file except in compliance with the License.
  6 | # You may obtain a copy of the License at
  7 | #
  8 | # http://www.apache.org/licenses/LICENSE-2.0
  9 | #
 10 | # Unless required by applicable law or agreed to in writing, software
 11 | # distributed under the License is distributed on an "AS IS" BASIS,
 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 13 | # See the License for the specific language governing permissions and
 14 | # limitations under the License.
 15 | #
 16 | 
 17 | # Author: Daniel Bolanos
 18 | # Date:   2015
 19 | 
 20 | # coding=utf-8
 21 | import json                                      # json 
 22 | import threading                                 # multi threading
 23 | import os                                        # for listing directories
 24 | import Queue                                     # queue used for thread syncronization
 25 | import sys                                       # system calls
 26 | import argparse                                  # for parsing arguments
 27 | import base64                                    # necessary to encode in base64 according to the RFC2045 standard 
 28 | import requests                                  # python HTTP requests library
 29 | 
 30 | # WebSockets 
 31 | from autobahn.twisted.websocket import WebSocketClientProtocol, WebSocketClientFactory, connectWS
 32 | from twisted.python import log
 33 | from twisted.internet import ssl, reactor
 34 | 
 35 | class Utils:   
 36 | 
 37 |    @staticmethod
 38 |    def getAuthenticationToken(hostname, serviceName, username, password):
 39 |       
 40 |       uri = hostname +  "/authorization/api/v1/token?url=" + hostname + '/' + serviceName + "/api" 
 41 |       uri = uri.replace("wss://", "https://");
 42 |       uri = uri.replace("ws://", "https://");
 43 |       print uri
 44 |       resp = requests.get(uri, auth=(username, password), verify=False, headers= {'Accept': 'application/json'}, 
 45 |                           timeout= (30, 30))
 46 |       print resp.text
 47 |       jsonObject = resp.json()
 48 |       return jsonObject['token']
 49 | 
 50 | 
 51 | class WSInterfaceFactory(WebSocketClientFactory):
 52 | 
 53 |    def __init__(self, queue, summary, dirOutput, contentType, model, url=None, headers=None, debug=None):
 54 | 
 55 |       WebSocketClientFactory.__init__(self, url=url, headers=headers, debug=debug)   
 56 |       self.queue = queue
 57 |       self.summary = summary
 58 |       self.dirOutput = dirOutput
 59 |       self.contentType = contentType
 60 |       self.model = model
 61 |       self.queueProto = Queue.Queue()
 62 | 
 63 |       self.openHandshakeTimeout = 6
 64 |       self.closeHandshakeTimeout = 6
 65 | 
 66 |       # start the thread that takes care of ending the reactor so the script can finish automatically (without ctrl+c)
 67 |       endingThread = threading.Thread(target=self.endReactor, args= ())
 68 |       endingThread.daemon = True
 69 |       endingThread.start()
 70 |    
 71 |    def prepareUtterance(self):
 72 | 
 73 |       try:
 74 |          utt = self.queue.get_nowait()
 75 |          self.queueProto.put(utt)
 76 |          return True
 77 |       except Queue.Empty:
 78 |          print "getUtterance: no more utterances to process, queue is empty!"
 79 |          return False
 80 | 
 81 |    def endReactor(self):
 82 | 
 83 |       self.queue.join()
 84 |       print "about to stop the reactor!"
 85 |       reactor.stop()
 86 | 
 87 |    # this function gets called every time connectWS is called (once per WebSocket connection/session)
 88 |    def buildProtocol(self, addr):
 89 | 
 90 |       try:
 91 |          utt = self.queueProto.get_nowait()
 92 |          proto = WSInterfaceProtocol(self, self.queue, self.summary, self.dirOutput, self.contentType)         
 93 |          proto.setUtterance(utt)
 94 |          return proto 
 95 |       except Queue.Empty:
 96 |          print "queue should not be empty, otherwise this function should not have been called"
 97 |          return None
 98 | 
 99 | # WebSockets interface to the STT service
100 | # note: an object of this class is created for each WebSocket connection, every time we call connectWS
101 | class WSInterfaceProtocol(WebSocketClientProtocol):
102 | 
103 |    def __init__(self, factory, queue, summary, dirOutput, contentType):
104 |       self.factory = factory
105 |       self.queue = queue
106 |       self.summary = summary
107 |       self.dirOutput = dirOutput
108 |       self.contentType = contentType 
109 |       self.packetRate = 20
110 |       self.listeningMessages = 0
111 |       self.timeFirstInterim = -1
112 |       self.bytesSent = 0
113 |       self.chunkSize = 2000    # in bytes
114 |       super(self.__class__, self).__init__()
115 |       print dirOutput
116 |       print "contentType: " + str(self.contentType) + " queueSize: " + str(self.queue.qsize())
117 | 
118 |    def setUtterance(self, utt):
119 | 
120 |       self.uttNumber = utt[0]
121 |       self.uttFilename = utt[1]
122 |       self.summary[self.uttNumber] = {"hypothesis":"",
123 |                                       "status":{"code":"", "reason":""}}
124 |       self.fileJson = self.dirOutput + "/" + str(self.uttNumber) + ".json.txt"
125 |       try:
126 |          os.remove(self.fileJson)
127 |       except OSError:
128 |          pass
129 | 
130 |    # helper method that sends a chunk of audio if needed (as required what the specified pacing is)
131 |    def maybeSendChunk(self,data):
132 | 
133 |       def sendChunk(chunk, final=False):
134 |          self.bytesSent += len(chunk)
135 |          self.sendMessage(chunk, isBinary = True)
136 |          if final: 
137 |             self.sendMessage(b'', isBinary = True)
138 | 
139 |       if (self.bytesSent+self.chunkSize >= len(data)):        
140 |          if (len(data) > self.bytesSent):
141 |             sendChunk(data[self.bytesSent:len(data)],True)
142 |             return
143 |       sendChunk(data[self.bytesSent:self.bytesSent+self.chunkSize])
144 |       self.factory.reactor.callLater(0.01, self.maybeSendChunk, data=data)
145 |       return
146 | 
147 |    def onConnect(self, response):
148 |       print "onConnect, server connected: {0}".format(response.peer)
149 |    
150 |    def onOpen(self):
151 |       print "onOpen"
152 |       data = {"action" : "start", "content-type" : str(self.contentType), "continuous" : True, "interim_results" : True, "inactivity_timeout": 600}
153 |       data['word_confidence'] = True
154 |       data['timestamps'] = True
155 |       data['max_alternatives'] = 3
156 |       print "sendMessage(init)" 
157 |       # send the initialization parameters
158 |       self.sendMessage(json.dumps(data).encode('utf8'))
159 | 
160 |       # start sending audio right away (it will get buffered in the STT service)
161 |       print self.uttFilename
162 |       f = open(str(self.uttFilename),'rb')
163 |       self.bytesSent = 0
164 |       dataFile = f.read()
165 |       self.maybeSendChunk(dataFile)
166 |       print "onOpen ends"      
167 | 
168 |    
169 |    def onMessage(self, payload, isBinary):
170 | 
171 |       if isBinary:
172 |          print("Binary message received: {0} bytes".format(len(payload)))         
173 |       else:
174 |          print(u"Text message received: {0}".format(payload.decode('utf8')))  
175 | 
176 |          # if uninitialized, receive the initialization response from the server
177 |          jsonObject = json.loads(payload.decode('utf8'))
178 |          if 'state' in jsonObject:
179 |             self.listeningMessages += 1
180 |             if (self.listeningMessages == 2):
181 |                print "sending close 1000"
182 |                # close the connection
183 |                self.sendClose(1000)
184 |                
185 |          # if in streaming 
186 |          elif 'results' in jsonObject:
187 |             jsonObject = json.loads(payload.decode('utf8'))            
188 |             hypothesis = ""
189 |             # empty hypothesis
190 |             if (len(jsonObject['results']) == 0):
191 |                print "empty hypothesis!"
192 |             # regular hypothesis
193 |             else: 
194 |                # dump the message to the output directory
195 |                jsonObject = json.loads(payload.decode('utf8'))
196 |                f = open(self.fileJson,"a")
197 |                f.write(json.dumps(jsonObject, indent=4, sort_keys=True))
198 |                f.close()
199 | 
200 |                hypothesis = jsonObject['results'][0]['alternatives'][0]['transcript']
201 |                bFinal = (jsonObject['results'][0]['final'] == True)
202 |                if bFinal:
203 |                   print "final hypothesis: \"" + hypothesis + "\""
204 |                   self.summary[self.uttNumber]['hypothesis'] += hypothesis
205 |                else:
206 |                   print "interim hyp: \"" + hypothesis + "\""
207 | 
208 |    def onClose(self, wasClean, code, reason):
209 | 
210 |       print("onClose")
211 |       print("WebSocket connection closed: {0}".format(reason), "code: ", code, "clean: ", wasClean, "reason: ", reason)
212 |       self.summary[self.uttNumber]['status']['code'] = code
213 |       self.summary[self.uttNumber]['status']['reason'] = reason
214 |       if (code == 1000):
215 |          self.summary[self.uttNumber]['status']['successful'] = True
216 |       
217 |       # create a new WebSocket connection if there are still utterances in the queue that need to be processed
218 |       self.queue.task_done()
219 | 
220 |       if self.factory.prepareUtterance() == False:
221 |          return
222 | 
223 |       # SSL client context: default
224 |       if self.factory.isSecure:
225 |          contextFactory = ssl.ClientContextFactory()
226 |       else:
227 |          contextFactory = None
228 |       connectWS(self.factory, contextFactory)
229 | 
230 | 
231 | # function to check that a value is a positive integer
232 | def check_positive_int(value):
233 |     ivalue = int(value)
234 |     if ivalue < 1:
235 |          raise argparse.ArgumentTypeError("\"%s\" is an invalid positive int value" % value)
236 |     return ivalue
237 | 
238 | # function to check the credentials format
239 | def check_credentials(credentials):
240 |    elements = credentials.split(":")
241 |    if (len(elements) == 2):
242 |       return elements
243 |    else:
244 |       raise argparse.ArgumentTypeError("\"%s\" is not a valid format for the credentials " % credentials)
245 | 
246 | 
247 | if __name__ == '__main__':
248 | 
249 |    # parse command line parameters
250 |    parser = argparse.ArgumentParser(description='client to do speech recognition using the WebSocket interface to the Watson STT service')
251 |    parser.add_argument('-credentials', action='store', dest='credentials', help='Basic Authentication credentials in the form \'username:password\'', type=check_credentials)
252 |    parser.add_argument('-in', action='store', dest='fileInput', default='./recordings.txt', help='text file containing audio files')
253 |    parser.add_argument('-out', action='store', dest='dirOutput', default='./output', help='output directory')
254 |    parser.add_argument('-type', action='store', dest='contentType', default='audio/wav', help='audio content type, for example: \'audio/l16; rate=44100\'')
255 |    parser.add_argument('-model', action='store', dest='model', default='en-US_BroadbandModel', help='STT model that will be used')
256 |    parser.add_argument('-threads', action='store', dest='threads', default='1', help='number of simultaneous STT sessions', type=check_positive_int)
257 |    parser.add_argument('-tokenauth', action='store_true', dest='tokenauth', help='use token based authentication')
258 |    args = parser.parse_args()
259 | 
260 |    # create output directory if necessary
261 |    if (os.path.isdir(args.dirOutput)):
262 |       while True:
263 |          answer = raw_input("the output directory \"" + args.dirOutput + "\" already exists, overwrite? (y/n)? ")
264 |          if (answer == "n"):
265 |             sys.stderr.write("exiting...")
266 |             sys.exit()
267 |          elif (answer == "y"):
268 |             break
269 |    else:
270 |       os.makedirs(args.dirOutput)
271 | 
272 |    # logging
273 |    log.startLogging(sys.stdout)
274 | 
275 |    # add audio files to the processing queue
276 |    q = Queue.Queue()
277 |    lines = [line.rstrip('\n') for line in open(args.fileInput)]
278 |    fileNumber = 0
279 |    for fileName in(lines):
280 |       print fileName
281 |       q.put((fileNumber,fileName))   
282 |       fileNumber += 1
283 | 
284 |    hostname = "stream.watsonplatform.net"   
285 |    headers = {}
286 | 
287 |    # authentication header
288 |    if args.tokenauth:
289 |       headers['X-Watson-Authorization-Token'] = Utils.getAuthenticationToken("https://" + hostname, 'speech-to-text', 
290 |                                                                              args.credentials[0], args.credentials[1])
291 |    else:
292 |       string = args.credentials[0] + ":" + args.credentials[1]
293 |       headers["Authorization"] = "Basic " + base64.b64encode(string)
294 | 
295 |    # create a WS server factory with our protocol
296 |    url = "wss://" + hostname + "/speech-to-text/api/v1/recognize?model=" + args.model
297 |    summary = {}
298 |    factory = WSInterfaceFactory(q, summary, args.dirOutput, args.contentType, args.model, url, headers, debug=False)
299 |    factory.protocol = WSInterfaceProtocol
300 | 
301 |    for i in range(min(int(args.threads),q.qsize())):
302 | 
303 |       factory.prepareUtterance()
304 | 
305 |       # SSL client context: default
306 |       if factory.isSecure:
307 |          contextFactory = ssl.ClientContextFactory()
308 |       else:
309 |          contextFactory = None
310 |       connectWS(factory, contextFactory)
311 | 
312 |    reactor.run()
313 | 
314 |    # dump the hypotheses to the output file
315 |    fileHypotheses = args.dirOutput + "/hypotheses.txt"
316 |    f = open(fileHypotheses,"w")
317 |    counter = 1
318 |    successful = 0 
319 |    emptyHypotheses = 0
320 |    for key, value in (sorted(summary.items())):
321 |       if value['status']['successful'] == True:
322 |          print key, ": ", value['status']['code'], " ", value['hypothesis'].encode('utf-8')
323 |          successful += 1
324 |          if value['hypothesis'][0] == "":
325 |             emptyHypotheses += 1
326 |       else:
327 |          print key + ": ", value['status']['code'], " REASON: ", value['status']['reason']
328 |       f.write(str(counter) + ": " + value['hypothesis'].encode('utf-8') + "\n")
329 |       counter += 1
330 |    f.close()
331 |    print "successful sessions: ", successful, " (", len(summary)-successful, " errors) (" + str(emptyHypotheses) + " empty hypotheses)"
332 | 
333 | 


--------------------------------------------------------------------------------