├── recordings ├── 0001.wav ├── 0002.wav ├── 0003.wav ├── 0004.wav ├── 0005.wav ├── 0006.wav ├── 0007.wav ├── 0008.wav ├── 0009.wav └── 0010.wav ├── recordings.txt ├── output └── hypotheses.txt ├── README.md └── sttClient.py /recordings/0001.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0001.wav -------------------------------------------------------------------------------- /recordings/0002.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0002.wav -------------------------------------------------------------------------------- /recordings/0003.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0003.wav -------------------------------------------------------------------------------- /recordings/0004.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0004.wav -------------------------------------------------------------------------------- /recordings/0005.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0005.wav -------------------------------------------------------------------------------- /recordings/0006.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0006.wav -------------------------------------------------------------------------------- /recordings/0007.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0007.wav -------------------------------------------------------------------------------- /recordings/0008.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0008.wav -------------------------------------------------------------------------------- /recordings/0009.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0009.wav -------------------------------------------------------------------------------- /recordings/0010.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-bolanos/speech-to-text-websockets-python/HEAD/recordings/0010.wav -------------------------------------------------------------------------------- /recordings.txt: -------------------------------------------------------------------------------- 1 | ./recordings/0001.wav 2 | ./recordings/0002.wav 3 | ./recordings/0003.wav 4 | ./recordings/0004.wav 5 | ./recordings/0005.wav 6 | ./recordings/0006.wav 7 | ./recordings/0007.wav 8 | ./recordings/0008.wav 9 | ./recordings/0009.wav 10 | ./recordings/0010.wav 11 | -------------------------------------------------------------------------------- /output/hypotheses.txt: -------------------------------------------------------------------------------- 1 | 1: several tornadoes touch down as a line of severe thunderstorms swept through colorado on sunday 2 | 2: with one of the twisters hitting near a junior golf tournament 3 | 3: m. m. during a caddy 4 | 4: six of the tornado struck in northeast colorado 5 | 5: well to others hit in park county 6 | 6: in the center of the state 7 | 7: the national weather service said 8 | 8: at least three of them caused the damage 9 | 9: aurora fire department officials said a twister touched down near the blackstone country club 10 | 10: causing one minor injury and flipping an empty trailer 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Synopsis 4 | 5 | This project consists of a python client that interacts with the IBM Watson Speech To Text service through its WebSockets interface. The client streams audio to the STT service and receives recognition hypotheses in real time. It can run N simultaneous recognition sessions 6 | 7 | ## installation 8 | 9 | There are some dependencies that need to be installed for this script to work. In order to interact with the STT service via WebSockets it is necessary to install the 'twisted' and 'autobahn' libraries. An updated version of these libraries can be installed by typing: 10 | 11 | ` 12 | $ pip install twisted 13 | ` 14 | 15 | ` 16 | $ pip install autobahn 17 | ` 18 | 19 | In order to you token based authentication it is necessary to install the requests library 20 | 21 | ` 22 | $ pip install requests 23 | ` 24 | 25 | If you need to upgrade your existing versions of twisted or autobhan you can type 26 | 27 | ` 28 | $ pip install twisted --upgrade 29 | ` 30 | 31 | ` 32 | $ pip install autobahn --upgrade 33 | ` 34 | 35 | Sometimes you may need to install some additional dependencies, check the following commands: 36 | 37 | ` 38 | $ pip install pyOpenSSL 39 | ` 40 | 41 | ` 42 | $ apt-get install build-essential python-dev 43 | ` 44 | 45 | Finally, the version 0.10.3 of Autobahn comes with a bug/typo that you need to fix by changing 'taxio' to 'txaio' in /usr/local/lib/python2.7/dist-packages/autobahn/websocket/protocol.py 46 | 47 | ## Examples 48 | 49 | The example below will run the default 10 WAV files through the WebSockets interface of the Speech To Text (STT) service and will dump the recognition hypotheses to a file under the "./output" directory. 50 | 51 | ` 52 | $ python ./sttClient.py -credentials : -model en-US_BroadbandModel 53 | ` 54 | 55 | The example below performs the same task much faster by opening 10 simultaneous recognition sessions (WebSocket connections) against the STT service. 56 | 57 | ` 58 | $ python ./sttClient.py -credentials : -model en-US_BroadbandModel -threads 10 59 | ` 60 | 61 | ## Options 62 | 63 | To see the list of available options type: 64 | 65 | ` 66 | $ python sttClient.py -h 67 | ` 68 | 69 | ## Motivation 70 | 71 | This script has been created by Daniel Bolanos in order to facilitate and promote the utilization of the IBM Watson Speech To Text service. 72 | 73 | 74 | 75 | 76 | 77 | -------------------------------------------------------------------------------- /sttClient.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright IBM Corp. 2014 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | # Author: Daniel Bolanos 18 | # Date: 2015 19 | 20 | # coding=utf-8 21 | import json # json 22 | import threading # multi threading 23 | import os # for listing directories 24 | import Queue # queue used for thread syncronization 25 | import sys # system calls 26 | import argparse # for parsing arguments 27 | import base64 # necessary to encode in base64 according to the RFC2045 standard 28 | import requests # python HTTP requests library 29 | 30 | # WebSockets 31 | from autobahn.twisted.websocket import WebSocketClientProtocol, WebSocketClientFactory, connectWS 32 | from twisted.python import log 33 | from twisted.internet import ssl, reactor 34 | 35 | class Utils: 36 | 37 | @staticmethod 38 | def getAuthenticationToken(hostname, serviceName, username, password): 39 | 40 | uri = hostname + "/authorization/api/v1/token?url=" + hostname + '/' + serviceName + "/api" 41 | uri = uri.replace("wss://", "https://"); 42 | uri = uri.replace("ws://", "https://"); 43 | print uri 44 | resp = requests.get(uri, auth=(username, password), verify=False, headers= {'Accept': 'application/json'}, 45 | timeout= (30, 30)) 46 | print resp.text 47 | jsonObject = resp.json() 48 | return jsonObject['token'] 49 | 50 | 51 | class WSInterfaceFactory(WebSocketClientFactory): 52 | 53 | def __init__(self, queue, summary, dirOutput, contentType, model, url=None, headers=None, debug=None): 54 | 55 | WebSocketClientFactory.__init__(self, url=url, headers=headers, debug=debug) 56 | self.queue = queue 57 | self.summary = summary 58 | self.dirOutput = dirOutput 59 | self.contentType = contentType 60 | self.model = model 61 | self.queueProto = Queue.Queue() 62 | 63 | self.openHandshakeTimeout = 6 64 | self.closeHandshakeTimeout = 6 65 | 66 | # start the thread that takes care of ending the reactor so the script can finish automatically (without ctrl+c) 67 | endingThread = threading.Thread(target=self.endReactor, args= ()) 68 | endingThread.daemon = True 69 | endingThread.start() 70 | 71 | def prepareUtterance(self): 72 | 73 | try: 74 | utt = self.queue.get_nowait() 75 | self.queueProto.put(utt) 76 | return True 77 | except Queue.Empty: 78 | print "getUtterance: no more utterances to process, queue is empty!" 79 | return False 80 | 81 | def endReactor(self): 82 | 83 | self.queue.join() 84 | print "about to stop the reactor!" 85 | reactor.stop() 86 | 87 | # this function gets called every time connectWS is called (once per WebSocket connection/session) 88 | def buildProtocol(self, addr): 89 | 90 | try: 91 | utt = self.queueProto.get_nowait() 92 | proto = WSInterfaceProtocol(self, self.queue, self.summary, self.dirOutput, self.contentType) 93 | proto.setUtterance(utt) 94 | return proto 95 | except Queue.Empty: 96 | print "queue should not be empty, otherwise this function should not have been called" 97 | return None 98 | 99 | # WebSockets interface to the STT service 100 | # note: an object of this class is created for each WebSocket connection, every time we call connectWS 101 | class WSInterfaceProtocol(WebSocketClientProtocol): 102 | 103 | def __init__(self, factory, queue, summary, dirOutput, contentType): 104 | self.factory = factory 105 | self.queue = queue 106 | self.summary = summary 107 | self.dirOutput = dirOutput 108 | self.contentType = contentType 109 | self.packetRate = 20 110 | self.listeningMessages = 0 111 | self.timeFirstInterim = -1 112 | self.bytesSent = 0 113 | self.chunkSize = 2000 # in bytes 114 | super(self.__class__, self).__init__() 115 | print dirOutput 116 | print "contentType: " + str(self.contentType) + " queueSize: " + str(self.queue.qsize()) 117 | 118 | def setUtterance(self, utt): 119 | 120 | self.uttNumber = utt[0] 121 | self.uttFilename = utt[1] 122 | self.summary[self.uttNumber] = {"hypothesis":"", 123 | "status":{"code":"", "reason":""}} 124 | self.fileJson = self.dirOutput + "/" + str(self.uttNumber) + ".json.txt" 125 | try: 126 | os.remove(self.fileJson) 127 | except OSError: 128 | pass 129 | 130 | # helper method that sends a chunk of audio if needed (as required what the specified pacing is) 131 | def maybeSendChunk(self,data): 132 | 133 | def sendChunk(chunk, final=False): 134 | self.bytesSent += len(chunk) 135 | self.sendMessage(chunk, isBinary = True) 136 | if final: 137 | self.sendMessage(b'', isBinary = True) 138 | 139 | if (self.bytesSent+self.chunkSize >= len(data)): 140 | if (len(data) > self.bytesSent): 141 | sendChunk(data[self.bytesSent:len(data)],True) 142 | return 143 | sendChunk(data[self.bytesSent:self.bytesSent+self.chunkSize]) 144 | self.factory.reactor.callLater(0.01, self.maybeSendChunk, data=data) 145 | return 146 | 147 | def onConnect(self, response): 148 | print "onConnect, server connected: {0}".format(response.peer) 149 | 150 | def onOpen(self): 151 | print "onOpen" 152 | data = {"action" : "start", "content-type" : str(self.contentType), "continuous" : True, "interim_results" : True, "inactivity_timeout": 600} 153 | data['word_confidence'] = True 154 | data['timestamps'] = True 155 | data['max_alternatives'] = 3 156 | print "sendMessage(init)" 157 | # send the initialization parameters 158 | self.sendMessage(json.dumps(data).encode('utf8')) 159 | 160 | # start sending audio right away (it will get buffered in the STT service) 161 | print self.uttFilename 162 | f = open(str(self.uttFilename),'rb') 163 | self.bytesSent = 0 164 | dataFile = f.read() 165 | self.maybeSendChunk(dataFile) 166 | print "onOpen ends" 167 | 168 | 169 | def onMessage(self, payload, isBinary): 170 | 171 | if isBinary: 172 | print("Binary message received: {0} bytes".format(len(payload))) 173 | else: 174 | print(u"Text message received: {0}".format(payload.decode('utf8'))) 175 | 176 | # if uninitialized, receive the initialization response from the server 177 | jsonObject = json.loads(payload.decode('utf8')) 178 | if 'state' in jsonObject: 179 | self.listeningMessages += 1 180 | if (self.listeningMessages == 2): 181 | print "sending close 1000" 182 | # close the connection 183 | self.sendClose(1000) 184 | 185 | # if in streaming 186 | elif 'results' in jsonObject: 187 | jsonObject = json.loads(payload.decode('utf8')) 188 | hypothesis = "" 189 | # empty hypothesis 190 | if (len(jsonObject['results']) == 0): 191 | print "empty hypothesis!" 192 | # regular hypothesis 193 | else: 194 | # dump the message to the output directory 195 | jsonObject = json.loads(payload.decode('utf8')) 196 | f = open(self.fileJson,"a") 197 | f.write(json.dumps(jsonObject, indent=4, sort_keys=True)) 198 | f.close() 199 | 200 | hypothesis = jsonObject['results'][0]['alternatives'][0]['transcript'] 201 | bFinal = (jsonObject['results'][0]['final'] == True) 202 | if bFinal: 203 | print "final hypothesis: \"" + hypothesis + "\"" 204 | self.summary[self.uttNumber]['hypothesis'] += hypothesis 205 | else: 206 | print "interim hyp: \"" + hypothesis + "\"" 207 | 208 | def onClose(self, wasClean, code, reason): 209 | 210 | print("onClose") 211 | print("WebSocket connection closed: {0}".format(reason), "code: ", code, "clean: ", wasClean, "reason: ", reason) 212 | self.summary[self.uttNumber]['status']['code'] = code 213 | self.summary[self.uttNumber]['status']['reason'] = reason 214 | if (code == 1000): 215 | self.summary[self.uttNumber]['status']['successful'] = True 216 | 217 | # create a new WebSocket connection if there are still utterances in the queue that need to be processed 218 | self.queue.task_done() 219 | 220 | if self.factory.prepareUtterance() == False: 221 | return 222 | 223 | # SSL client context: default 224 | if self.factory.isSecure: 225 | contextFactory = ssl.ClientContextFactory() 226 | else: 227 | contextFactory = None 228 | connectWS(self.factory, contextFactory) 229 | 230 | 231 | # function to check that a value is a positive integer 232 | def check_positive_int(value): 233 | ivalue = int(value) 234 | if ivalue < 1: 235 | raise argparse.ArgumentTypeError("\"%s\" is an invalid positive int value" % value) 236 | return ivalue 237 | 238 | # function to check the credentials format 239 | def check_credentials(credentials): 240 | elements = credentials.split(":") 241 | if (len(elements) == 2): 242 | return elements 243 | else: 244 | raise argparse.ArgumentTypeError("\"%s\" is not a valid format for the credentials " % credentials) 245 | 246 | 247 | if __name__ == '__main__': 248 | 249 | # parse command line parameters 250 | parser = argparse.ArgumentParser(description='client to do speech recognition using the WebSocket interface to the Watson STT service') 251 | parser.add_argument('-credentials', action='store', dest='credentials', help='Basic Authentication credentials in the form \'username:password\'', type=check_credentials) 252 | parser.add_argument('-in', action='store', dest='fileInput', default='./recordings.txt', help='text file containing audio files') 253 | parser.add_argument('-out', action='store', dest='dirOutput', default='./output', help='output directory') 254 | parser.add_argument('-type', action='store', dest='contentType', default='audio/wav', help='audio content type, for example: \'audio/l16; rate=44100\'') 255 | parser.add_argument('-model', action='store', dest='model', default='en-US_BroadbandModel', help='STT model that will be used') 256 | parser.add_argument('-threads', action='store', dest='threads', default='1', help='number of simultaneous STT sessions', type=check_positive_int) 257 | parser.add_argument('-tokenauth', action='store_true', dest='tokenauth', help='use token based authentication') 258 | args = parser.parse_args() 259 | 260 | # create output directory if necessary 261 | if (os.path.isdir(args.dirOutput)): 262 | while True: 263 | answer = raw_input("the output directory \"" + args.dirOutput + "\" already exists, overwrite? (y/n)? ") 264 | if (answer == "n"): 265 | sys.stderr.write("exiting...") 266 | sys.exit() 267 | elif (answer == "y"): 268 | break 269 | else: 270 | os.makedirs(args.dirOutput) 271 | 272 | # logging 273 | log.startLogging(sys.stdout) 274 | 275 | # add audio files to the processing queue 276 | q = Queue.Queue() 277 | lines = [line.rstrip('\n') for line in open(args.fileInput)] 278 | fileNumber = 0 279 | for fileName in(lines): 280 | print fileName 281 | q.put((fileNumber,fileName)) 282 | fileNumber += 1 283 | 284 | hostname = "stream.watsonplatform.net" 285 | headers = {} 286 | 287 | # authentication header 288 | if args.tokenauth: 289 | headers['X-Watson-Authorization-Token'] = Utils.getAuthenticationToken("https://" + hostname, 'speech-to-text', 290 | args.credentials[0], args.credentials[1]) 291 | else: 292 | string = args.credentials[0] + ":" + args.credentials[1] 293 | headers["Authorization"] = "Basic " + base64.b64encode(string) 294 | 295 | # create a WS server factory with our protocol 296 | url = "wss://" + hostname + "/speech-to-text/api/v1/recognize?model=" + args.model 297 | summary = {} 298 | factory = WSInterfaceFactory(q, summary, args.dirOutput, args.contentType, args.model, url, headers, debug=False) 299 | factory.protocol = WSInterfaceProtocol 300 | 301 | for i in range(min(int(args.threads),q.qsize())): 302 | 303 | factory.prepareUtterance() 304 | 305 | # SSL client context: default 306 | if factory.isSecure: 307 | contextFactory = ssl.ClientContextFactory() 308 | else: 309 | contextFactory = None 310 | connectWS(factory, contextFactory) 311 | 312 | reactor.run() 313 | 314 | # dump the hypotheses to the output file 315 | fileHypotheses = args.dirOutput + "/hypotheses.txt" 316 | f = open(fileHypotheses,"w") 317 | counter = 1 318 | successful = 0 319 | emptyHypotheses = 0 320 | for key, value in (sorted(summary.items())): 321 | if value['status']['successful'] == True: 322 | print key, ": ", value['status']['code'], " ", value['hypothesis'].encode('utf-8') 323 | successful += 1 324 | if value['hypothesis'][0] == "": 325 | emptyHypotheses += 1 326 | else: 327 | print key + ": ", value['status']['code'], " REASON: ", value['status']['reason'] 328 | f.write(str(counter) + ": " + value['hypothesis'].encode('utf-8') + "\n") 329 | counter += 1 330 | f.close() 331 | print "successful sessions: ", successful, " (", len(summary)-successful, " errors) (" + str(emptyHypotheses) + " empty hypotheses)" 332 | 333 | --------------------------------------------------------------------------------