├── .gitignore ├── LICENSE ├── README.md ├── docs ├── api_reference │ ├── LongRunningRecognize.md │ └── Recognize.md ├── kws │ ├── LongRunningRecognize.md │ ├── RecognitionConfig.md │ └── Recognize.md ├── long-audios.md ├── rpc_reference │ ├── LongRunningRecognize.md │ ├── Recognize.md │ └── StreamingRecognize.md ├── short-audios.md ├── streaming-audios.md └── types │ ├── RecognitionAudio.md │ ├── RecognitionConfig.md │ └── SpeechRecognitionResult.md ├── go └── README.md ├── java ├── .gitignore ├── README.md ├── build.gradle ├── gradle │ └── wrapper │ │ ├── gradle-wrapper.jar │ │ └── gradle-wrapper.properties ├── gradlew ├── gradlew.bat ├── samples │ ├── build.gradle │ └── src │ │ └── main │ │ └── java │ │ └── ai │ │ └── vernacular │ │ └── examples │ │ └── speech │ │ └── RecognizeSync.java ├── settings.gradle └── vernacular-ai-speech │ ├── build.gradle │ └── src │ └── main │ ├── java │ └── ai │ │ └── vernacular │ │ └── speech │ │ └── SpeechClient.java │ └── proto │ └── speech-to-text.proto ├── python ├── .gitignore ├── README.md ├── requirements.txt ├── samples │ ├── recognize_async.py │ ├── recognize_multi_channel.py │ ├── recognize_streaming.py │ ├── recognize_streaming_mic.py │ ├── recognize_sync.py │ └── recognize_word_offset.py ├── setup.cfg ├── setup.py ├── tests │ └── __init__.py └── vernacular │ ├── __init__.py │ └── ai │ ├── __init__.py │ ├── exceptions │ └── __init__.py │ └── speech │ ├── __init__.py │ ├── enums.py │ ├── proto │ ├── __init__.py │ ├── speech-to-text.proto │ ├── speech_to_text_pb2.py │ └── speech_to_text_pb2_grpc.py │ ├── speech_client.py │ ├── types.py │ └── utils.py └── resources ├── hello.wav └── test-single-channel-8000Hz.raw /.gitignore: -------------------------------------------------------------------------------- 1 | # JetBrains 2 | .idea 3 | 4 | # VS Code 5 | .vscode 6 | .envrc 7 | .DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Speech-to-Text API 2 | Converts audio to text 3 | 4 | We support these ten indian languages ([language codes](https://github.com/Vernacular-ai/speech-recognition/blob/master/docs/types/RecognitionConfig.md#languagesupport)). 5 | - Hindi 6 | - English 7 | - Marathi 8 | - Kannada 9 | - Malayalam 10 | - Bengali 11 | - Gujarati 12 | - Punjabi 13 | - Telugu 14 | - Tamil 15 | 16 | ## Authentication 17 | ~~To get access to our APIs reach out to us at hello@vernacular.ai~~ 18 | We do not provide public access token for the APIs anymore. 19 | 20 | ## Ways to use the Service 21 | - Transcribing short audios [audios upto 1 min] 22 | - Transcribing long audios [more than 1 min] 23 | - Transcribing audio from streaming input 24 | 25 | We recommend that you call this service using Vernacular provided client libraries. If your application needs to call this service using your own libraries, you should use the HTTP Endpoints. 26 | 27 | **Supported SDKs**: [Python](https://github.com/Vernacular-ai/speech-recognition/tree/master/python) 28 | 29 | 30 | ## REST Reference 31 | 32 | **ServiceHost:** https://asr.vernacular.ai 33 | 34 | ### Speech Recognition 35 | | Name | Description | 36 | |--|--| 37 | | [recognize](docs/api_reference/Recognize.md) | Performs synchronous speech recognition: receive results after all audio has been sent and processed. | 38 | | [longrunningrecognize](docs/api_reference/LongRunningRecognize.md) | Performs asynchronous speech recognition. Generally used for long audios | 39 | 40 | 41 | ## RPC Reference 42 | 43 | ### Speech Recognition 44 | | Methods | Description | 45 | |--|--| 46 | |[Recognize](docs/rpc_reference/Recognize.md) | Performs synchronous speech recognition: receive results after all audio has been sent and processed.| 47 | |[LongRunningRecognize](docs/rpc_reference/LongRunningRecognize.md) | Performs asynchronous speech recognition: receive results via the longrunning.Operations interface.| 48 | |[StreamingRecognize](docs/rpc_reference/StreamingRecognize.md) |Performs streaming speech recognition: receive results while sending audio. Supports both unidirectional and bidirectional streaming.| 49 | -------------------------------------------------------------------------------- /docs/api_reference/LongRunningRecognize.md: -------------------------------------------------------------------------------- 1 | # LongRunningRecognize 2 | Performs asynchronous speech recognition. Returns an intermediate response [SpeechOperation](#speechoperation) which will either contains response or an error when processing is done. 3 | 4 | To get latest state of speech operation, you can poll for the result using [GetSpeechOperation](#getspeechoperation) request. 5 | 6 | 7 | ### Request Method 8 | `POST https://asr.vernacular.ai/v2/speech:longrunningrecognize` 9 | 10 | ### Request Headers 11 | ``` 12 | X-ACCESS-TOKEN: some-access-token 13 | content-type: application/json 14 | ``` 15 | 16 | ### Request Body 17 | The request body contains data with the following structure: 18 | 19 | ```js 20 | { 21 | "config": { 22 | object (RecognitionConfig) 23 | }, 24 | "audio": { 25 | object (RecognitionAudio) 26 | }, 27 | "result_url": string, 28 | } 29 | ``` 30 | 31 | | Fields | Description| 32 | |--|--| 33 | |config|object ([RecognitionConfig](../types/RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.| 34 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.| 35 | |result_url| string
Optional. Post the results to this url when done. 36 | 37 | ### ResponseBody 38 | An intermediate speech operation object which contains response upon completion. 39 | 40 | ```js 41 | { 42 | "name": string, 43 | "done": bool, 44 | "result": union { 45 | object (google.grpc.Status), 46 | object (LongRunningRecognizeResponse), 47 | }, 48 | } 49 | ``` 50 | 51 | |Fields| Description | 52 | |--|--| 53 | |name| string
The server-assigned name, which is only unique within the same service that originally returns it.| 54 | |done| bool
If the value is false, it means the operation is still in progress. If true, the operation is completed, and either error or response is available.| 55 | |result| Union field
The operation result, which can be either an error or a valid response. If done == false, neither error nor response is set. If done == true, exactly one of error or response is set. See below for more details| 56 | 57 | `result` can be only one of the following: 58 | | Field| | 59 | |--|--| 60 | |error| google.rpc.Status
The error result of the operation in case of failure or cancellation.| 61 | |response| [LongRunningRecognizeResponse](#longrunningrecognizeresponse)
The normal response of the operation in case of success.| 62 | 63 | ## LongRunningRecognizeResponse 64 | The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages. 65 | 66 | |Fields | Description| 67 | |--|--| 68 | |results[] | [SpeechRecognitionResult](../types/SpeechRecognitionResult.md)
Sequential list of transcription results corresponding to sequential portions of audio.| 69 | 70 | 71 | 72 | ## GetSpeechOperation 73 | 74 | `GET https://asr.vernacular.ai/v2/speech_operations/{name}` 75 | 76 | Gets the latest state of a long-running operation. Clients can use this method to poll the operation result at some intervals. 77 | 78 | |UrlParam|Description| 79 | |--|--| 80 | |name| string
The name of the SpeechOperation in the [response](#responsebody)| 81 | 82 | Returns the [response](#responsebody) again with latest state. -------------------------------------------------------------------------------- /docs/api_reference/Recognize.md: -------------------------------------------------------------------------------- 1 | # Recognize 2 | Performs a synchronous speech recognition i.e receive results after all audio has been sent and processed. 3 | 4 | **Note**: Audios more than 60 seconds do not work with sync Recognize. Use [LongRunningRecognize](LongRunningRecognize.md) for long audios. 5 | 6 | ### Request Method 7 | `POST https://asr.vernacular.ai/v2/speech:recognize` 8 | 9 | ### Request Headers 10 | ``` 11 | X-ACCESS-TOKEN: some-access-token 12 | content-type: application/json 13 | ``` 14 | 15 | ### Request Body 16 | The request body contains data with the following structure: 17 | 18 | ```js 19 | { 20 | "config": { 21 | object (RecognitionConfig) 22 | }, 23 | "audio": { 24 | object (RecognitionAudio) 25 | } 26 | } 27 | ``` 28 | 29 | |Fields|| 30 | |--|--| 31 | |config|object ([RecognitionConfig](../types/RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.| 32 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.| 33 | 34 | ### Response Body 35 | If successful, the response body contains data with the following structure: 36 | 37 | The only message returned to the client by the recognize method. It contains the result as zero or more sequential [SpeechRecognitionResult](../types/SpeechRecognitionResult.md) messages. 38 | 39 | ```js 40 | { 41 | "results": [ 42 | { 43 | object (SpeechRecognitionResult) 44 | } 45 | ] 46 | } 47 | ``` 48 | 49 | ### Sample Request and Response 50 | Request: 51 | ```bash 52 | curl -X POST 'https://asr.vernacular.ai/v2/speech:recognize' \ 53 | --header 'X-ACCESS-TOKEN: {{access-token}}' \ 54 | --header 'Content-Type: application/json' \ 55 | --data-raw '{ 56 | "config": { 57 | "encoding": "LINEAR16", 58 | "sampleRateHertz": 8000, 59 | "languageCode": "en-IN", 60 | "maxAlternatives": 2 61 | }, 62 | "audio": { 63 | "uri": "https://audio-url.wav" 64 | } 65 | }' 66 | ``` 67 | Response: 68 | ```json 69 | { 70 | "results": [ 71 | { 72 | "alternatives": [ 73 | { 74 | "transcript": "i want to know my balance", 75 | "confidence": 0.95417684 76 | }, 77 | { 78 | "transcript": "i want know balance", 79 | "confidence": 0.95404005 80 | } 81 | ] 82 | } 83 | ] 84 | } 85 | ``` 86 | -------------------------------------------------------------------------------- /docs/kws/LongRunningRecognize.md: -------------------------------------------------------------------------------- 1 | # LongRunningRecognize 2 | Performs asynchronous keyword spotting recognition. Returns an intermediate response [KWSOperation](#kwsoperation) which will either contains response or an error when processing is done. 3 | 4 | To get latest state of speech operation, you can poll for the result using [GetKWSOperation](#getkwsoperation) request. 5 | 6 | 7 | ### Request Method 8 | `POST https://asr.vernacular.ai/v1/kws:longrunningrecognize` 9 | 10 | ### Request Headers 11 | ``` 12 | X-ACCESS-TOKEN: some-access-token 13 | content-type: application/json 14 | ``` 15 | 16 | ### Request Body 17 | The request body contains data with the following structure: 18 | 19 | ```js 20 | { 21 | "config": { 22 | object (RecognitionConfig) 23 | }, 24 | "audio": { 25 | object (RecognitionAudio) 26 | }, 27 | "keywords": [string], 28 | "result_url": string, 29 | } 30 | ``` 31 | 32 | | Fields | Description| 33 | |--|--| 34 | |config|object ([RecognitionConfig](./RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.| 35 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.| 36 | |keywords| Array of strings that needs to be searched for | 37 | |result_url| string
Optional. Post the results to this url when done. 38 | 39 | ### ResponseBody 40 | An intermediate speech operation object which contains response upon completion. 41 | 42 | ```js 43 | { 44 | "name": string, 45 | "done": bool, 46 | "result": union { 47 | object (google.grpc.Status), 48 | object (LongRunningRecognizeResponse), 49 | }, 50 | } 51 | ``` 52 | 53 | |Fields| Description | 54 | |--|--| 55 | |name| string
The server-assigned name, which is only unique within the same service that originally returns it.| 56 | |done| bool
If the value is false, it means the operation is still in progress. If true, the operation is completed, and either error or response is available.| 57 | |result| Union field
The operation result, which can be either an error or a valid response. If done == false, neither error nor response is set. If done == true, exactly one of error or response is set. See below for more details| 58 | 59 | `result` can be only one of the following: 60 | | Field| | 61 | |--|--| 62 | |error| google.rpc.Status
The error result of the operation in case of failure or cancellation.| 63 | |response| [LongRunningRecognizeResponse](#longrunningrecognizeresponse)
The normal response of the operation in case of success.| 64 | 65 | ## LongRunningRecognizeResponse 66 | The only message returned to the client by the Recognize method. 67 | 68 | ```js 69 | { 70 | "results": [ 71 | { 72 | "transcript": string, 73 | "matched_words": [ 74 | { 75 | "start_time": float, 76 | "end_time": float, 77 | "word": string 78 | } 79 | ] 80 | } 81 | ] 82 | } 83 | ``` 84 | 85 | |Fields | Description| 86 | |--|--| 87 | |results[] |
Sequential list of kws results corresponding to sequential portions of audio.| 88 | 89 | ## GetKWSOperation 90 | 91 | `GET https://asr.vernacular.ai/v1/kws_operations/{name}` 92 | 93 | Gets the latest state of a long-running operation. Clients can use this method to poll the operation result at some intervals. 94 | 95 | |UrlParam|Description| 96 | |--|--| 97 | |name| string
The name of the KWSOperation in the [response](#responsebody)| 98 | 99 | Returns the [response](#responsebody) again with latest state. 100 | 101 | ### Sample Request and Response 102 | Request: 103 | ```bash 104 | curl -X POST 'https://asr.vernacular.ai/v1/kws:longrunningrecognize' \ 105 | --header 'X-ACCESS-TOKEN: {{access-token}}' \ 106 | --header 'Content-Type: application/json' \ 107 | --data-raw '{ 108 | "config": { 109 | "encoding": "LINEAR16", 110 | "sampleRateHertz": 8000, 111 | "languageCode": "en-IN", 112 | }, 113 | "keywords": ["balance", "credit"], 114 | "audio": { 115 | "uri": "https://audio-url.wav" 116 | } 117 | }' 118 | ``` 119 | Response: 120 | ```json 121 | { 122 | "name": "3498b75d-4d81-4e96-a83d-129cd8c6b0f5", 123 | "done": false 124 | } 125 | ``` 126 | 127 | If you poll the operation request 128 | 129 | Request: 130 | ```bash 131 | curl -X GET 'https://asr.vernacular.ai/v1/kws_operations/3498b75d-4d81-4e96-a83d-129cd8c6b0f5' \ 132 | --header 'X-ACCESS-TOKEN: {{access-token}}' \ 133 | --header 'Content-Type: application/json' 134 | ``` 135 | 136 | Response on completion: 137 | ```json 138 | { 139 | "name": "3498b75d-4d81-4e96-a83d-129cd8c6b0f5", 140 | "done": true, 141 | "results": [ 142 | { 143 | "transcript": "i want to know about my credit card balance", 144 | "matched_words": [ 145 | { 146 | "start_time": 3.34, 147 | "end_time": 3.70, 148 | "word": "credit" 149 | }, 150 | { 151 | "start_time": 3.84, 152 | "end_time": 4.23, 153 | "word": "balance" 154 | } 155 | ] 156 | } 157 | ] 158 | } 159 | ``` -------------------------------------------------------------------------------- /docs/kws/RecognitionConfig.md: -------------------------------------------------------------------------------- 1 | # RecognitionConfig 2 | Provides information to the recognizer that specifies how to process the request. 3 | 4 | ```js 5 | { 6 | "encoding": enum (AudioEncoding), 7 | "sampleRateHertz": integer, 8 | "audioChannelCount": integer, 9 | "enableSeparateRecognitionPerChannel": boolean, 10 | "languageCode": string, 11 | } 12 | ``` 13 | 14 | | Field | Description | 15 | |---|---| 16 | | encoding | enum ([AudioEncoding](#audioencoding))
Encoding of audio data sent in all RecognitionAudio messages. This field is optional for FLAC and WAV audio files and required for all other audio formats. For details, see AudioEncoding.| 17 | | sampleRateHertz | integer
Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. For now we only support 8000Hz. In case your audio is of any other sampling rate, consider resampling to 8000Hz. | 18 | | audioChannelCount | integer
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. If 0 or omitted, defaults to one channel (mono).
**Note**: We only recognize the first channel by default. To perform independent recognition on each channel set enableSeparateRecognitionPerChannel to 'true'. | 19 | | enableSeparateRecognitionPerChannel | boolean
This needs to be set to true explicitly and audioChannelCount > 1 to get each channel recognized separately. The recognition result will contain a channelTag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audioChannelCount multiplied by the length of the audio. | 20 | | languageCode | string
Required. The language of the supplied audio as a BCP-47 language tag. Example: "en-IN". See [Language Support](#languagesupport) for a list of the currently supported language codes. | 21 | 22 | 23 | ## AudioEncoding 24 | The encoding of the audio data sent in the request. 25 | 26 | All encodings support only 1 channel (mono) audio, unless the audioChannelCount and enableSeparateRecognitionPerChannel fields are set. 27 | 28 | For best results, the audio source should be captured and transmitted using a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3. 29 | 30 | 31 | | Format | Description | 32 | |--|--| 33 | |LINEAR16| Uncompressed 16-bit signed little-endian samples (Linear PCM).| 34 | 35 | 36 | ## LanguageSupport 37 | Vernacular ASR only supports indian languages for now. Use these language codes for following languages. 38 | |Language| Code | 39 | |--|--| 40 | |Hindi | hi-IN | 41 | |English | en-IN | 42 | |Kannada | kn-IN | 43 | |Malayalam| ml-IN| 44 | |Bengali | bn-IN| 45 | |Marathi | mr-IN | 46 | |Gujarati | gu-IN | 47 | |Punjabi | pa-IN | 48 | |Telugu | te-IN| 49 | |Tamil | ta-IN| 50 | -------------------------------------------------------------------------------- /docs/kws/Recognize.md: -------------------------------------------------------------------------------- 1 | # Recognize 2 | Performs a synchronous speech recognition i.e receive results after all audio has been sent and processed. 3 | 4 | ### Request Method 5 | `POST https://asr.vernacular.ai/v1/kws:recognize` 6 | 7 | ### Request Headers 8 | ``` 9 | X-ACCESS-TOKEN: some-access-token 10 | content-type: application/json 11 | ``` 12 | 13 | ### Request Body 14 | The request body contains data with the following structure: 15 | 16 | ```js 17 | { 18 | "config": { 19 | object (RecognitionConfig) 20 | }, 21 | "audio": { 22 | object (RecognitionAudio) 23 | }, 24 | "keywords": [string], 25 | } 26 | ``` 27 | 28 | |Fields|| 29 | |--|--| 30 | |config|object ([RecognitionConfig](./RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.| 31 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.| 32 | |keywords| Array of strings that needs to be searched for | 33 | 34 | ### Response Body 35 | If successful, the response body contains data with the following structure: 36 | 37 | The only message returned to the client by the recognize method. 38 | 39 | ```js 40 | { 41 | "results": [ 42 | { 43 | "transcript": string, 44 | "matched_words": [ 45 | { 46 | "start_time": float, 47 | "end_time": float, 48 | "word": string 49 | } 50 | ] 51 | } 52 | ] 53 | } 54 | ``` 55 | 56 | ### Sample Request and Response 57 | Request: 58 | ```bash 59 | curl -X POST 'https://asr.vernacular.ai/v1/kws:recognize' \ 60 | --header 'X-ACCESS-TOKEN: {{access-token}}' \ 61 | --header 'Content-Type: application/json' \ 62 | --data-raw '{ 63 | "config": { 64 | "encoding": "LINEAR16", 65 | "sampleRateHertz": 8000, 66 | "languageCode": "en-IN", 67 | }, 68 | "keywords": ["balance", "credit"], 69 | "audio": { 70 | "uri": "https://audio-url.wav" 71 | } 72 | }' 73 | ``` 74 | Response: 75 | ```json 76 | { 77 | "results": [ 78 | { 79 | "transcript": "i want to know about my credit card balance", 80 | "matched_words": [ 81 | { 82 | "start_time": 3.34, 83 | "end_time": 3.70, 84 | "word": "credit" 85 | }, 86 | { 87 | "start_time": 3.84, 88 | "end_time": 4.23, 89 | "word": "balance" 90 | } 91 | ] 92 | } 93 | ] 94 | } 95 | ``` 96 | -------------------------------------------------------------------------------- /docs/long-audios.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/docs/long-audios.md -------------------------------------------------------------------------------- /docs/rpc_reference/LongRunningRecognize.md: -------------------------------------------------------------------------------- 1 | # LongRunningRecognize 2 | 3 | `rpc LongRunningRecognize`([LongRunningRecognizeRequest](#longrunningrecognizerequest)) `returns `([SpeechOperation](#speechoperation)) 4 | 5 | Performs asynchronous speech recognition: receive results via SpeechOperation. Returns either an SpeechOperation.error or an SpeechOperation.response which contains a [LongRunningRecognizeResponse](#longrunningrecognizeresponse) message. 6 | 7 | To get latest state of speech operation, you can poll for the result using [GetSpeechOperation](#getspeechoperation) rpc. 8 | 9 | 10 | ## GetSpeechOperation 11 | 12 | `rpc GetSpeechOperation`([SpeechOperationRequest](#speechoperationrequest))` returns `([SpeechOperation](#speechoperation)) 13 | 14 | Gets the latest state of a long-running operation. Clients can use this method to poll the operation result at some intervals. 15 | 16 | #### SpeechOperationRequest 17 | |Fields|Description| 18 | |--|--| 19 | |name| string
The name of the SpeechOperation i.e [SpeechOperation.name](#speechoperation)| 20 | 21 | 22 | ## LongRunningRecognizeRequest 23 | The top-level message sent by the client for the Recognize method. 24 | 25 | |Fields|Description| 26 | |--|--| 27 | |config | [RecognitionConfig](../types/RecognitionConfig.md)
Required. Provides information to the recognizer that specifies how to process the request.| 28 | |audio | [RecognitionAudio](../types/RecognitionAudio.md)
Required. The audio data to be recognized.| 29 | |result_url | string
Optional. Post the results to this url when done. Url must be accessible through our servers.| 30 | 31 | ## SpeechOperation 32 | An intermediate operation object which contains response upon completion 33 | 34 | |Fields| Description | 35 | |--|--| 36 | |name| string
The server-assigned name, which is only unique within the same service that originally returns it.| 37 | |done| bool
If the value is false, it means the operation is still in progress. If true, the operation is completed, and either error or response is available.| 38 | |result| Union field
The operation result, which can be either an error or a valid response. If done == false, neither error nor response is set. If done == true, exactly one of error or response is set. See below for more details| 39 | 40 | `result` can be only one of the following: 41 | | Field| | 42 | |--|--| 43 | |error| google.rpc.Status
The error result of the operation in case of failure or cancellation.| 44 | |response| [LongRunningRecognizeResponse](#longrunningrecognizeresponse)
The normal response of the operation in case of success.| 45 | 46 | ## LongRunningRecognizeResponse 47 | The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages. 48 | 49 | |Fields | Description| 50 | |--|--| 51 | |results[] | [SpeechRecognitionResult](../types/SpeechRecognitionResult.md)
Sequential list of transcription results corresponding to sequential portions of audio.| 52 | -------------------------------------------------------------------------------- /docs/rpc_reference/Recognize.md: -------------------------------------------------------------------------------- 1 | # Recognize 2 | 3 | `rpc Recognize`([RecognizeRequest](#recognizerequest)) `returns `([RecognizeResponse](#recognizeresponse)) 4 | 5 | Performs synchronous speech recognition: receive results after all audio has been sent and processed. 6 | 7 | **Note**: Audios more than 60 seconds do not work with sync Recognize. Use [LongRunningRecognize](LongRunningRecognize.md) for long audios. 8 | 9 | ## RecognizeRequest 10 | The top-level message sent by the client for the Recognize method. 11 | 12 | |Fields|Description| 13 | |--|--| 14 | |config | [RecognitionConfig](../types/RecognitionConfig.md)
Required. Provides information to the recognizer that specifies how to process the request.| 15 | |audio | [RecognitionAudio](../types/RecognitionAudio.md)
Required. The audio data to be recognized.| 16 | 17 | ## RecognizeResponse 18 | The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages. 19 | 20 | |Fields | Description| 21 | |--|--| 22 | |results[] | [SpeechRecognitionResult](../types/SpeechRecognitionResult.md)
Sequential list of transcription results corresponding to sequential portions of audio.| 23 | -------------------------------------------------------------------------------- /docs/rpc_reference/StreamingRecognize.md: -------------------------------------------------------------------------------- 1 | # StreamingRecognize 2 | 3 | `rpc StreamingRecognize`([StreamingRecognizeRequest](#streamingrecognizerequest))` returns `([StreamingRecognizeResponse](#streamingrecognizeresponse)) 4 | 5 | Performs bidirectional streaming speech recognition: receive results while sending audio. This method is only available via the gRPC API (not REST API). 6 | 7 | ## StreamingRecognizeRequest 8 | The top-level message sent by the client for the StreamingRecognize method. Multiple StreamingRecognizeRequest messages are sent. The first message must contain a `streaming_config` message and must not contain `audio_content`. All subsequent messages must contain audio_content and must not contain a streaming_config message. 9 | 10 | Union field streaming_request. The streaming request, which is either a streaming config or audio content. 11 | `streaming_request` can be only one of the following: 12 | 13 | |Fields|Description| 14 | |--|--| 15 | |streaming_config | [StreamingRecognitionConfig](#streamingrecognitionconfig)
Provides information to the recognizer that specifies how to process the request. The first StreamingRecognizeRequest message must contain a streaming_config message.| 16 | |audio_content | bytes
The audio data to be recognized. Sequential chunks of audio data are sent in sequential StreamingRecognizeRequest messages. The first StreamingRecognizeRequest message must not contain audio_content data and all subsequent StreamingRecognizeRequest messages must contain audio_content data. The audio bytes must be encoded as specified in [RecognitionConfig](../types/RecognitionConfig.md). Note: as with all bytes fields, proto buffers use a pure binary representation (not base64).| 17 | 18 | ## StreamingRecognizeResponse 19 | StreamingRecognizeResponse is the only message returned to the client by StreamingRecognize. A series of zero or more StreamingRecognizeResponse messages are streamed back to the client. If there is no recognizable audio, and single_utterance is set to false, then no messages are streamed back to the client. 20 | 21 | Here's an example of a series of ten StreamingRecognizeResponses that might be returned while processing audio: 22 | 23 | ``` 24 | results { alternatives { transcript: "tube" } stability: 0.01 } 25 | 26 | results { alternatives { transcript: "to be a" } stability: 0.01 } 27 | 28 | results { alternatives { transcript: "to be" } stability: 0.9 } results { alternatives { transcript: " or not to be" } stability: 0.01 } 29 | 30 | results { alternatives { transcript: "to be or not to be" confidence: 0.92 } alternatives { transcript: "to bee or not to bee" } is_final: true } 31 | 32 | results { alternatives { transcript: " that's" } stability: 0.01 } 33 | 34 | results { alternatives { transcript: " that is" } stability: 0.9 } results { alternatives { transcript: " the question" } stability: 0.01 } 35 | 36 | results { alternatives { transcript: " that is the question" confidence: 0.98 } alternatives { transcript: " that was the question" } is_final: true } 37 | ``` 38 | 39 | Notes: 40 | 41 | Only two of the above responses #4 and #7 contain final results; they are indicated by is_final: true. Concatenating these together generates the full transcript: "to be or not to be that is the question". 42 | 43 | The others contain interim results. #3 and #6 contain two interim results: the first portion has a high stability and is less likely to change; the second portion has a low stability and is very likely to change. A UI designer might choose to show only high stability results. 44 | 45 | The specific stability and confidence values shown above are only for illustrative purposes. Actual values may vary. 46 | 47 | In each response, only one of these fields will be set: error or one or more (repeated) results. 48 | 49 | |Fields| Description| 50 | |--|--| 51 | |error| Status
If set, returns a google.rpc.Status message that specifies the error for the operation.| 52 | |results[]| [StreamingRecognitionResult](#streamingrecognitionresult)
This repeated list contains zero or more results that correspond to consecutive portions of the audio currently being processed. It contains zero or one is_final=true result (the newly settled portion), followed by zero or more is_final=false results (the interim results).| 53 | 54 | ## StreamingRecognitionConfig 55 | Provides information to the recognizer that specifies how to process the request. 56 | 57 | |Fields|Description| 58 | |--|--| 59 | |config | [RecognitionConfig](../types/RecognitionConfig.md)
Required. Provides information to the recognizer that specifies how to process the request.| 60 | |interim_results| bool
If true, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag). If false or omitted, only is_final=true result(s) are returned| 61 | |silence_detection_config| [SilenceDetectionConfig](#silencedetectionconfig)
Optional. Add silence detection config for enabling silence detection.| 62 | 63 | Note: For now `interim_results` will not work. You will only get a final response. 64 | 65 | ## SilenceDetectionConfig 66 | |Fields|Description| 67 | |--|--| 68 | |enable_silence_detection|bool
If true, it enables SD from server side| 69 | |max_speech_timeout|float
Max number of seconds for which recognition should go on. For example: For a value of 5, streaming will end after 5 seconds regardless of whether person is speaking or not. Set it to -1 to disable this.| 70 | |silence_patience| float
Wait for this many seconds of silence after a voice activity detection, to fire of the silence detected event. Usually 1.5 to 2 is a good value to set.| 71 | |no_input_timeout|float
Wait for this many seconds if no voice activity is detected before firing of silence detected event. For example: if set to 5 seconds, detector will wait for 5 seconds for any voice activity and then end the stream. This is there to prevent endless stream if no voice activity is there. Usually 3-5 seconds is a good range for this.| 72 | 73 | ## StreamingRecognitionResult 74 | A streaming speech recognition result corresponding to a portion of the audio that is currently being processed. 75 | 76 | |Fields|Description| 77 | |--|--| 78 | |alternatives[] | [SpeechRecognitionAlternative](../types/SpeechRecognitionResult.md#speechrecognitionalternative)
May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.| 79 | |is_final | bool
If false, this StreamingRecognitionResult represents an interim result that may change. If true, this is the final time the speech service will return this particular StreamingRecognitionResult, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.| 80 | |stability | float
An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (is_final=false). The default of 0.0 is a sentinel value indicating stability was not set.| 81 | |result_end_time | Duration
Time offset of the end of this result relative to the beginning of the audio.| 82 | |channel_tag | int32
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'.| 83 | -------------------------------------------------------------------------------- /docs/short-audios.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/docs/short-audios.md -------------------------------------------------------------------------------- /docs/streaming-audios.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/docs/streaming-audios.md -------------------------------------------------------------------------------- /docs/types/RecognitionAudio.md: -------------------------------------------------------------------------------- 1 | # RecognitionAudio 2 | Contains audio data in the encoding specified in the RecognitionConfig. Either content or uri must be supplied. Supplying both or neither returns error. 3 | 4 | ```js 5 | { 6 | 7 | // Union field audio_source can be only one of the following: 8 | "content": string, 9 | "uri": string 10 | } 11 | ``` 12 | 13 | | Field | Description | 14 | |---|---| 15 | | content | string (bytes format)
The audio data bytes encoded as specified in RecognitionConfig. Note: as with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use [base64](https://en.wikipedia.org/wiki/Base64). | 16 | | uri | string
URI that points to a file that contains audio data bytes as specified in RecognitionConfig. The file must not be compressed (for example, gzip). Url must be publicly accessible. | 17 | 18 | 19 | ## Encoding to base64 20 | Use base64 command to convert 21 | ```shell 22 | base64 source_audio_file -w 0 > dest_audio_file 23 | ``` 24 | 25 | ### Content Limits 26 | 27 | The API contains the following limits on the size of audio in content field: 28 | 29 | | Content Limit |Audio Length | Audio Size | 30 | |---|---|---| 31 | |Synchronous Requests |~1 Minute | max 4Mb | 32 | |Streaming Requests | ~5 Minutes | - | 33 | |Asynchronous Requests| ~480 Minutes | - | 34 | -------------------------------------------------------------------------------- /docs/types/RecognitionConfig.md: -------------------------------------------------------------------------------- 1 | # RecognitionConfig 2 | Provides information to the recognizer that specifies how to process the request. 3 | 4 | ```js 5 | { 6 | "encoding": enum (AudioEncoding), 7 | "sampleRateHertz": integer, 8 | "audioChannelCount": integer, 9 | "enableSeparateRecognitionPerChannel": boolean, 10 | "languageCode": string, 11 | "maxAlternatives": integer, 12 | "speechContexts": [ 13 | { 14 | object (SpeechContext) 15 | } 16 | ], 17 | "enableWordTimeOffsets": boolean, 18 | "diarizationConfig": { 19 | object (SpeakerDiarizationConfig) 20 | }, 21 | } 22 | ``` 23 | 24 | | Field | Description | 25 | |---|---| 26 | | encoding | enum ([AudioEncoding](#audioencoding))
Encoding of audio data sent in all RecognitionAudio messages. This field is optional for FLAC and WAV audio files and required for all other audio formats. For details, see AudioEncoding.| 27 | | sampleRateHertz | integer
Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. For now we only support 8000Hz. In case your audio is of any other sampling rate, consider resampling to 8000Hz. | 28 | | audioChannelCount | integer
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. If 0 or omitted, defaults to one channel (mono).
**Note**: We only recognize the first channel by default. To perform independent recognition on each channel set enableSeparateRecognitionPerChannel to 'true'. | 29 | | enableSeparateRecognitionPerChannel | boolean
This needs to be set to true explicitly and audioChannelCount > 1 to get each channel recognized separately. The recognition result will contain a channelTag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audioChannelCount multiplied by the length of the audio. | 30 | | languageCode | string
Required. The language of the supplied audio as a BCP-47 language tag. Example: "en-IN". See [Language Support](#languagesupport) for a list of the currently supported language codes. | 31 | | maxAlternatives | integer
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognitionAlternative messages within each SpeechRecognitionResult. The server may return fewer than maxAlternatives. Valid values are 0-10. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one. | 32 | |speechContexts[] | object ([SpeechContext](#speechcontext))
Array of SpeechContext. This feature is experimental and may not work as of now. We do support biasing of models with customer specific terminology so this may not be needed. | 33 | | diarizationConfig | object ([SpeakerDiarizationConfig](#speakerdiarizationconfig))
Config to enable speaker diarization and set additional parameters to make diarization better suited for your application. This feature is experimental and may not work for now.| 34 | | enableWordTimeOffsets | boolean
If true, the top result includes a list of words and the start and end time offsets (timestamps) for those words. If false, no word-level time offset information is returned. The default is false. | 35 | 36 | 37 | ## AudioEncoding 38 | The encoding of the audio data sent in the request. 39 | 40 | All encodings support only 1 channel (mono) audio, unless the audioChannelCount and enableSeparateRecognitionPerChannel fields are set. 41 | 42 | We support wav \[LINEAR16\] and mp3 \[MP3\] right now. 43 | 44 | For best results, the audio source should be captured and transmitted using a lossless encoding (LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3. 45 | 46 | 47 | #### Enums 48 | | Format | Description | 49 | |--|--| 50 | |LINEAR16| Uncompressed 16-bit signed little-endian samples (Linear PCM).| 51 | |MP3| Compressed mp3 encoded stream| 52 | 53 | 54 | ## LanguageSupport 55 | Vernacular ASR only supports indian languages for now. Use these language codes for following languages. 56 | |Language| Code | 57 | |--|--| 58 | |Hindi | hi-IN | 59 | |English | en-IN | 60 | |Kannada | kn-IN | 61 | |Malayalam| ml-IN| 62 | |Bengali | bn-IN| 63 | |Marathi | mr-IN | 64 | |Gujarati | gu-IN | 65 | |Punjabi | pa-IN | 66 | |Telugu | te-IN| 67 | |Tamil | ta-IN| 68 | 69 | 70 | ## SpeechContext 71 | Provides **hints** to the speech recognizer to favor specific words and phrases in the results. 72 | 73 | ```js 74 | { 75 | "phrases": [ 76 | string 77 | ] 78 | } 79 | ``` 80 | 81 | | Field | Description | 82 | |--|--| 83 | | phrases[] | string
A list of strings containing words and phrases "hints" so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer. See usage limits. | 84 | 85 | 86 | ## SpeakerDiarizationConfig 87 | Config to enable speaker diarization. 88 | 89 | ```js 90 | { 91 | "enableSpeakerDiarization": boolean, 92 | "minSpeakerCount": integer, 93 | "maxSpeakerCount": integer, 94 | "speakerTag": integer 95 | } 96 | ``` 97 | 98 | | Field | Description | 99 | |--|--| 100 | | enableSpeakerDiarization | boolean
If **true**, enables speaker detection for each recognized word in the top alternative of the recognition result. | 101 | | minSpeakerCount | integer
Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 2. | 102 | | maxSpeakerCount | integer
Maximum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 6. | 103 | -------------------------------------------------------------------------------- /docs/types/SpeechRecognitionResult.md: -------------------------------------------------------------------------------- 1 | # SpeechRecognitionResult 2 | A speech recognition result corresponding to a portion of the audio. 3 | 4 | ```js 5 | { 6 | "alternatives": [ 7 | { 8 | object (SpeechRecognitionAlternative) 9 | } 10 | ], 11 | "channelTag": integer 12 | } 13 | ``` 14 | 15 | |Fields | Description | 16 | |--|--| 17 | |alternatives[] | object (SpeechRecognitionAlternative)
May contain one or more recognition hypotheses (up to the maximum specified in maxAlternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.| 18 | |channelTag | integer
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audioChannelCount = N, its output values can range from '1' to 'N'.| 19 | 20 | # SpeechRecognitionAlternative 21 | Alternative hypotheses (a.k.a. n-best list). 22 | 23 | ```js 24 | { 25 | "transcript": string, 26 | "confidence": number, 27 | "words": [ 28 | { 29 | object (WordInfo) 30 | } 31 | ] 32 | } 33 | ``` 34 | 35 | |Fields | Description | 36 | |--|--| 37 | |transcript | string
Transcript text representing the words that the user spoke.| 38 | |confidence | number
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.| 39 | |words[] | object ([WordInfo](#wordinfo))
A list of word-specific information for each recognized word. Note: When enableSpeakerDiarization is true, you will see all the words from the beginning of the audio.| 40 | 41 | # WordInfo 42 | Word-specific information for recognized words. 43 | 44 | ```js 45 | { 46 | "startTime": string, 47 | "endTime": string, 48 | "word": string, 49 | "speakerTag": integer 50 | } 51 | ``` 52 | 53 | |Fields| Description | 54 | |--|--| 55 | | startTime | string (Duration format)
Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if enableWordTimeOffsets=true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.| 56 | |endTime| string (Duration format)
Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if enableWordTimeOffsets=true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.| 57 | | word | string
The word corresponding to this set of information.| 58 | |speakerTag | integer
Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from '1' to diarizationSpeakerCount. speakerTag is set if enableSpeakerDiarization = 'true' and only in the top alternative.| 59 | -------------------------------------------------------------------------------- /go/README.md: -------------------------------------------------------------------------------- 1 | # Go Speech to Text SDK 2 | -------------------------------------------------------------------------------- /java/.gitignore: -------------------------------------------------------------------------------- 1 | **/.classpath 2 | **/.settings 3 | **/.project 4 | 5 | # Maven 6 | target/ 7 | 8 | # gradle 9 | .gradle 10 | 11 | # Intellij 12 | *.iml 13 | .idea/ 14 | 15 | # build 16 | **/build/ 17 | **/bin/ 18 | -------------------------------------------------------------------------------- /java/README.md: -------------------------------------------------------------------------------- 1 | # Java Speech to Text SDK 2 | Java SDK for vernacular.ai speech to text APIs. Go [here](https://github.com/Vernacular-ai/speech-recognition) for detailed product documentation. 3 | 4 | 5 | ## Installation 6 | If you are using maven add this to your `pom.xml` 7 | 8 | ```xml 9 | 10 | ai.vernacular.speech 11 | vernacular-ai-speech 12 | 0.1.0 13 | 14 | ``` 15 | 16 | If you are using Gradle, add this to your dependencies 17 | 18 | ```gradle 19 | compile 'ai.vernacular.speech:vernacular-ai-speech:0.1.0' 20 | ``` 21 | 22 | ## Example Usage 23 | This example shows how to recognize speech from audio url. First add these imports to top of java file. 24 | 25 | ```java 26 | import ai.vernacular.speech.SpeechClient; 27 | import ai.vernacular.speech.RecognitionAudio; 28 | import ai.vernacular.speech.RecognitionConfig; 29 | import ai.vernacular.speech.RecognitionConfig.AudioEncoding; 30 | import ai.vernacular.speech.RecognizeResponse; 31 | ``` 32 | 33 | Then add this code to recognize the audio. 34 | 35 | ```java 36 | try (SpeechClient speechClient = SpeechClient.create(accessToken)) { 37 | RecognitionConfig.AudioEncoding encoding = RecognitionConfig.AudioEncoding.LINEAR16; 38 | int sampleRateHertz = 8000; 39 | String languageCode = "en-IN"; 40 | RecognitionConfig config = RecognitionConfig.newBuilder() 41 | .setEncoding(encoding) 42 | .setSampleRateHertz(sampleRateHertz) 43 | .setLanguageCode(languageCode) 44 | .build(); 45 | String uri = "https://url/to/audio.wav"; 46 | RecognitionAudio audio = RecognitionAudio.newBuilder() 47 | .setUri(uri) 48 | .build(); 49 | RecognizeResponse response = speechClient.recognize(config, audio); 50 | } 51 | ``` 52 | 53 | To see more examples, go to [samples](https://github.com/Vernacular-ai/speech-recognition/tree/master/java/samples). 54 | 55 | To run a sample: 56 | 57 | ```shell 58 | ./gradlew :samples:run -Pexample=RecognizeSync 59 | ``` -------------------------------------------------------------------------------- /java/build.gradle: -------------------------------------------------------------------------------- 1 | buildscript { 2 | repositories { 3 | mavenCentral() 4 | } 5 | dependencies { 6 | classpath 'com.google.protobuf:protobuf-gradle-plugin:0.8.11' 7 | } 8 | } 9 | 10 | allprojects { 11 | repositories { 12 | mavenCentral() 13 | } 14 | 15 | apply plugin: 'java' 16 | apply plugin: 'com.google.protobuf' 17 | 18 | version '0.1.0' 19 | } 20 | -------------------------------------------------------------------------------- /java/gradle/wrapper/gradle-wrapper.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/java/gradle/wrapper/gradle-wrapper.jar -------------------------------------------------------------------------------- /java/gradle/wrapper/gradle-wrapper.properties: -------------------------------------------------------------------------------- 1 | distributionBase=GRADLE_USER_HOME 2 | distributionPath=wrapper/dists 3 | distributionUrl=https\://services.gradle.org/distributions/gradle-6.2.2-bin.zip 4 | zipStoreBase=GRADLE_USER_HOME 5 | zipStorePath=wrapper/dists 6 | -------------------------------------------------------------------------------- /java/gradlew: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env sh 2 | 3 | # 4 | # Copyright 2015 the original author or authors. 5 | # 6 | # Licensed under the Apache License, Version 2.0 (the "License"); 7 | # you may not use this file except in compliance with the License. 8 | # You may obtain a copy of the License at 9 | # 10 | # https://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | # 18 | 19 | ############################################################################## 20 | ## 21 | ## Gradle start up script for UN*X 22 | ## 23 | ############################################################################## 24 | 25 | # Attempt to set APP_HOME 26 | # Resolve links: $0 may be a link 27 | PRG="$0" 28 | # Need this for relative symlinks. 29 | while [ -h "$PRG" ] ; do 30 | ls=`ls -ld "$PRG"` 31 | link=`expr "$ls" : '.*-> \(.*\)$'` 32 | if expr "$link" : '/.*' > /dev/null; then 33 | PRG="$link" 34 | else 35 | PRG=`dirname "$PRG"`"/$link" 36 | fi 37 | done 38 | SAVED="`pwd`" 39 | cd "`dirname \"$PRG\"`/" >/dev/null 40 | APP_HOME="`pwd -P`" 41 | cd "$SAVED" >/dev/null 42 | 43 | APP_NAME="Gradle" 44 | APP_BASE_NAME=`basename "$0"` 45 | 46 | # Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script. 47 | DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"' 48 | 49 | # Use the maximum available, or set MAX_FD != -1 to use that value. 50 | MAX_FD="maximum" 51 | 52 | warn () { 53 | echo "$*" 54 | } 55 | 56 | die () { 57 | echo 58 | echo "$*" 59 | echo 60 | exit 1 61 | } 62 | 63 | # OS specific support (must be 'true' or 'false'). 64 | cygwin=false 65 | msys=false 66 | darwin=false 67 | nonstop=false 68 | case "`uname`" in 69 | CYGWIN* ) 70 | cygwin=true 71 | ;; 72 | Darwin* ) 73 | darwin=true 74 | ;; 75 | MINGW* ) 76 | msys=true 77 | ;; 78 | NONSTOP* ) 79 | nonstop=true 80 | ;; 81 | esac 82 | 83 | CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar 84 | 85 | # Determine the Java command to use to start the JVM. 86 | if [ -n "$JAVA_HOME" ] ; then 87 | if [ -x "$JAVA_HOME/jre/sh/java" ] ; then 88 | # IBM's JDK on AIX uses strange locations for the executables 89 | JAVACMD="$JAVA_HOME/jre/sh/java" 90 | else 91 | JAVACMD="$JAVA_HOME/bin/java" 92 | fi 93 | if [ ! -x "$JAVACMD" ] ; then 94 | die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME 95 | 96 | Please set the JAVA_HOME variable in your environment to match the 97 | location of your Java installation." 98 | fi 99 | else 100 | JAVACMD="java" 101 | which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. 102 | 103 | Please set the JAVA_HOME variable in your environment to match the 104 | location of your Java installation." 105 | fi 106 | 107 | # Increase the maximum file descriptors if we can. 108 | if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then 109 | MAX_FD_LIMIT=`ulimit -H -n` 110 | if [ $? -eq 0 ] ; then 111 | if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then 112 | MAX_FD="$MAX_FD_LIMIT" 113 | fi 114 | ulimit -n $MAX_FD 115 | if [ $? -ne 0 ] ; then 116 | warn "Could not set maximum file descriptor limit: $MAX_FD" 117 | fi 118 | else 119 | warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT" 120 | fi 121 | fi 122 | 123 | # For Darwin, add options to specify how the application appears in the dock 124 | if $darwin; then 125 | GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\"" 126 | fi 127 | 128 | # For Cygwin or MSYS, switch paths to Windows format before running java 129 | if [ "$cygwin" = "true" -o "$msys" = "true" ] ; then 130 | APP_HOME=`cygpath --path --mixed "$APP_HOME"` 131 | CLASSPATH=`cygpath --path --mixed "$CLASSPATH"` 132 | JAVACMD=`cygpath --unix "$JAVACMD"` 133 | 134 | # We build the pattern for arguments to be converted via cygpath 135 | ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null` 136 | SEP="" 137 | for dir in $ROOTDIRSRAW ; do 138 | ROOTDIRS="$ROOTDIRS$SEP$dir" 139 | SEP="|" 140 | done 141 | OURCYGPATTERN="(^($ROOTDIRS))" 142 | # Add a user-defined pattern to the cygpath arguments 143 | if [ "$GRADLE_CYGPATTERN" != "" ] ; then 144 | OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)" 145 | fi 146 | # Now convert the arguments - kludge to limit ourselves to /bin/sh 147 | i=0 148 | for arg in "$@" ; do 149 | CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -` 150 | CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option 151 | 152 | if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition 153 | eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"` 154 | else 155 | eval `echo args$i`="\"$arg\"" 156 | fi 157 | i=`expr $i + 1` 158 | done 159 | case $i in 160 | 0) set -- ;; 161 | 1) set -- "$args0" ;; 162 | 2) set -- "$args0" "$args1" ;; 163 | 3) set -- "$args0" "$args1" "$args2" ;; 164 | 4) set -- "$args0" "$args1" "$args2" "$args3" ;; 165 | 5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;; 166 | 6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;; 167 | 7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;; 168 | 8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;; 169 | 9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;; 170 | esac 171 | fi 172 | 173 | # Escape application args 174 | save () { 175 | for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done 176 | echo " " 177 | } 178 | APP_ARGS=`save "$@"` 179 | 180 | # Collect all arguments for the java command, following the shell quoting and substitution rules 181 | eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS" 182 | 183 | exec "$JAVACMD" "$@" 184 | -------------------------------------------------------------------------------- /java/gradlew.bat: -------------------------------------------------------------------------------- 1 | @rem 2 | @rem Copyright 2015 the original author or authors. 3 | @rem 4 | @rem Licensed under the Apache License, Version 2.0 (the "License"); 5 | @rem you may not use this file except in compliance with the License. 6 | @rem You may obtain a copy of the License at 7 | @rem 8 | @rem https://www.apache.org/licenses/LICENSE-2.0 9 | @rem 10 | @rem Unless required by applicable law or agreed to in writing, software 11 | @rem distributed under the License is distributed on an "AS IS" BASIS, 12 | @rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | @rem See the License for the specific language governing permissions and 14 | @rem limitations under the License. 15 | @rem 16 | 17 | @if "%DEBUG%" == "" @echo off 18 | @rem ########################################################################## 19 | @rem 20 | @rem Gradle startup script for Windows 21 | @rem 22 | @rem ########################################################################## 23 | 24 | @rem Set local scope for the variables with windows NT shell 25 | if "%OS%"=="Windows_NT" setlocal 26 | 27 | set DIRNAME=%~dp0 28 | if "%DIRNAME%" == "" set DIRNAME=. 29 | set APP_BASE_NAME=%~n0 30 | set APP_HOME=%DIRNAME% 31 | 32 | @rem Resolve any "." and ".." in APP_HOME to make it shorter. 33 | for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi 34 | 35 | @rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script. 36 | set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m" 37 | 38 | @rem Find java.exe 39 | if defined JAVA_HOME goto findJavaFromJavaHome 40 | 41 | set JAVA_EXE=java.exe 42 | %JAVA_EXE% -version >NUL 2>&1 43 | if "%ERRORLEVEL%" == "0" goto init 44 | 45 | echo. 46 | echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH. 47 | echo. 48 | echo Please set the JAVA_HOME variable in your environment to match the 49 | echo location of your Java installation. 50 | 51 | goto fail 52 | 53 | :findJavaFromJavaHome 54 | set JAVA_HOME=%JAVA_HOME:"=% 55 | set JAVA_EXE=%JAVA_HOME%/bin/java.exe 56 | 57 | if exist "%JAVA_EXE%" goto init 58 | 59 | echo. 60 | echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME% 61 | echo. 62 | echo Please set the JAVA_HOME variable in your environment to match the 63 | echo location of your Java installation. 64 | 65 | goto fail 66 | 67 | :init 68 | @rem Get command-line arguments, handling Windows variants 69 | 70 | if not "%OS%" == "Windows_NT" goto win9xME_args 71 | 72 | :win9xME_args 73 | @rem Slurp the command line arguments. 74 | set CMD_LINE_ARGS= 75 | set _SKIP=2 76 | 77 | :win9xME_args_slurp 78 | if "x%~1" == "x" goto execute 79 | 80 | set CMD_LINE_ARGS=%* 81 | 82 | :execute 83 | @rem Setup the command line 84 | 85 | set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar 86 | 87 | @rem Execute Gradle 88 | "%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %CMD_LINE_ARGS% 89 | 90 | :end 91 | @rem End local scope for the variables with windows NT shell 92 | if "%ERRORLEVEL%"=="0" goto mainEnd 93 | 94 | :fail 95 | rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of 96 | rem the _cmd.exe /c_ return code! 97 | if not "" == "%GRADLE_EXIT_CONSOLE%" exit 1 98 | exit /b 1 99 | 100 | :mainEnd 101 | if "%OS%"=="Windows_NT" endlocal 102 | 103 | :omega 104 | -------------------------------------------------------------------------------- /java/samples/build.gradle: -------------------------------------------------------------------------------- 1 | dependencies { 2 | implementation project(':vernacular-ai-speech') 3 | } 4 | 5 | task run(type: JavaExec) { 6 | classpath = sourceSets.main.runtimeClasspath 7 | 8 | if (project.hasProperty('example')) { 9 | if (example == 'RecognizeSync'){ 10 | main = 'ai.vernacular.examples.speech.RecognizeSync' 11 | } else { 12 | println "Unable to find example file" 13 | } 14 | } 15 | } 16 | -------------------------------------------------------------------------------- /java/samples/src/main/java/ai/vernacular/examples/speech/RecognizeSync.java: -------------------------------------------------------------------------------- 1 | package ai.vernacular.examples.speech; 2 | 3 | import ai.vernacular.speech.SpeechClient; 4 | import ai.vernacular.speech.RecognitionAudio; 5 | import ai.vernacular.speech.RecognitionConfig; 6 | import ai.vernacular.speech.SpeechRecognitionAlternative; 7 | import ai.vernacular.speech.RecognitionConfig.AudioEncoding; 8 | import ai.vernacular.speech.RecognizeResponse; 9 | import ai.vernacular.speech.SpeechRecognitionResult; 10 | import com.google.protobuf.ByteString; 11 | 12 | import java.io.IOException; 13 | import java.nio.file.Files; 14 | import java.nio.file.Path; 15 | import java.nio.file.Paths; 16 | 17 | public class RecognizeSync { 18 | 19 | public static void sampleRecognize(String accessToken, String localFilePath) { 20 | try (SpeechClient speechClient = new SpeechClient(accessToken)) { 21 | // The language of the supplied audio 22 | String languageCode = "en-IN"; 23 | 24 | // Sample rate in Hertz of the audio data sent 25 | int sampleRateHertz = 8000; 26 | 27 | // Encoding of audio data sent. This sample sets this explicitly. 28 | RecognitionConfig.AudioEncoding encoding = RecognitionConfig.AudioEncoding.LINEAR16; 29 | RecognitionConfig config = RecognitionConfig.newBuilder().setLanguageCode(languageCode) 30 | .setSampleRateHertz(sampleRateHertz).setEncoding(encoding).build(); 31 | Path path = Paths.get(localFilePath); 32 | byte[] data = Files.readAllBytes(path); 33 | ByteString content = ByteString.copyFrom(data); 34 | RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(content).build(); 35 | RecognizeResponse response = speechClient.recognize(config, audio); 36 | for (SpeechRecognitionResult result : response.getResultsList()) { 37 | // First alternative is the most probable result 38 | SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0); 39 | System.out.printf("Transcript: %s\n", alternative.getTranscript()); 40 | } 41 | } catch (IOException exception) { 42 | System.err.println("Failed to create the client due to: " + exception.getMessage()); 43 | } 44 | } 45 | 46 | public static void main(String[] args) { 47 | sampleRecognize("vernacularai", "hello.wav"); 48 | } 49 | } 50 | -------------------------------------------------------------------------------- /java/settings.gradle: -------------------------------------------------------------------------------- 1 | include ":vernacular-ai-speech" 2 | include ":samples" 3 | -------------------------------------------------------------------------------- /java/vernacular-ai-speech/build.gradle: -------------------------------------------------------------------------------- 1 | def grpcVersion = '1.28.1' 2 | def protobufVersion = '1.28.1' 3 | 4 | dependencies { 5 | implementation "io.grpc:grpc-okhttp:${grpcVersion}" 6 | implementation ("io.grpc:grpc-protobuf-lite:${protobufVersion}") { 7 | exclude module: "protobuf-lite" 8 | } 9 | implementation "io.grpc:grpc-stub:${protobufVersion}" 10 | compile "com.google.api.grpc:proto-google-common-protos:1.17.0" 11 | 12 | compileOnly "javax.annotation:javax.annotation-api:1.3.2" 13 | } 14 | 15 | protobuf { 16 | protoc { artifact = 'com.google.protobuf:protoc:3.11.4' } 17 | plugins { 18 | grpc { 19 | artifact = "io.grpc:protoc-gen-grpc-java:${grpcVersion}" 20 | } 21 | javalite { 22 | artifact = 'com.google.protobuf:protoc-gen-javalite:3.0.0' 23 | } 24 | } 25 | generateProtoTasks { 26 | all().each { task -> 27 | task.plugins { 28 | javalite {} 29 | grpc { // Options added to --grpc_out 30 | option 'lite' 31 | } 32 | } 33 | 34 | task.builtins { 35 | remove java 36 | } 37 | } 38 | } 39 | } 40 | -------------------------------------------------------------------------------- /java/vernacular-ai-speech/src/main/java/ai/vernacular/speech/SpeechClient.java: -------------------------------------------------------------------------------- 1 | package ai.vernacular.speech; 2 | 3 | import io.grpc.ManagedChannel; 4 | import io.grpc.ManagedChannelBuilder; 5 | import io.grpc.Metadata; 6 | import ai.vernacular.speech.SpeechToTextGrpc; 7 | 8 | import java.io.IOException; 9 | 10 | import ai.vernacular.speech.RecognitionAudio; 11 | import ai.vernacular.speech.RecognizeRequest; 12 | import ai.vernacular.speech.RecognizeResponse; 13 | import ai.vernacular.speech.RecognitionAudio; 14 | import ai.vernacular.speech.RecognitionConfig; 15 | 16 | public class SpeechClient implements AutoCloseable{ 17 | 18 | public static final String STTP_GRPC_HOST = "localhost"; 19 | public static final int STTP_GRPC_PORT = 5021; 20 | public static final String AUTHORIZATION = "authorization"; 21 | 22 | private String accessToken; 23 | private ManagedChannel channel; 24 | private SpeechToTextGrpc.SpeechToTextBlockingStub channelStub; 25 | 26 | public SpeechClient(String accessToken) { 27 | this.accessToken = accessToken; 28 | 29 | this.channel = ManagedChannelBuilder.forAddress(STTP_GRPC_HOST, STTP_GRPC_PORT).usePlaintext().build(); 30 | this.channelStub = SpeechToTextGrpc.newBlockingStub(channel); 31 | } 32 | 33 | public final RecognizeResponse recognize(RecognitionConfig config, RecognitionAudio audio) { 34 | // return recognizeCallable().call(request); 35 | RecognizeRequest request = RecognizeRequest.newBuilder().setConfig(config).setAudio(audio).build(); 36 | return this.channelStub.recognize(request); 37 | } 38 | 39 | @Override 40 | public void close() throws IOException { 41 | 42 | } 43 | 44 | } 45 | -------------------------------------------------------------------------------- /java/vernacular-ai-speech/src/main/proto/speech-to-text.proto: -------------------------------------------------------------------------------- 1 | syntax = "proto3"; 2 | package speech_to_text; 3 | 4 | import "google/api/annotations.proto"; 5 | import "google/api/client.proto"; 6 | import "google/api/field_behavior.proto"; 7 | import "google/rpc/status.proto"; 8 | 9 | option java_multiple_files = true; 10 | option java_outer_classname = "SpeechToTextProto"; 11 | option java_package = "ai.vernacular.speech"; 12 | 13 | service SpeechToText { 14 | // Performs synchronous non-streaming speech recognition 15 | rpc Recognize(RecognizeRequest) returns (RecognizeResponse) { 16 | option (google.api.http) = { 17 | post: "/v1/speech:recognize" 18 | body: "*" 19 | }; 20 | option (google.api.method_signature) = "config,audio"; 21 | } 22 | 23 | // Performs bidirectional streaming speech recognition: receive results while 24 | // sending audio. This method is only available via the gRPC API (not REST). 25 | rpc StreamingRecognize(stream StreamingRecognizeRequest) returns (stream StreamingRecognizeResponse) {} 26 | 27 | // Performs asynchronous non-streaming speech recognition 28 | rpc LongRunningRecognize(LongRunningRecognizeRequest) returns (SpeechOperation) {} 29 | // Returns SpeechOperation for LongRunningRecognize. Used for polling the result 30 | rpc GetSpeechOperation(SpeechOperationRequest) returns (SpeechOperation) { 31 | option (google.api.http) = { 32 | get: "/v1/speech_operations/{name}" 33 | }; 34 | } 35 | } 36 | 37 | //-------------------------------------------- 38 | // requests 39 | //-------------------------------------------- 40 | message RecognizeRequest { 41 | // Required. Provides information to the recognizer that specifies how to 42 | // process the request. 43 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED]; 44 | 45 | // Required. The audio data to be recognized. 46 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED]; 47 | 48 | string segment = 16; 49 | } 50 | 51 | message LongRunningRecognizeRequest { 52 | // Required. Provides information to the recognizer that specifies how to 53 | // process the request. 54 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED]; 55 | 56 | // Required. The audio data to be recognized. 57 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED]; 58 | 59 | // Optional. When operation completes, result is posted to this url if provided. 60 | string result_url = 11; 61 | 62 | string segment = 16; 63 | } 64 | 65 | message SpeechOperationRequest { 66 | // name of the speech operation 67 | string name = 1 [(google.api.field_behavior) = REQUIRED]; 68 | } 69 | 70 | message StreamingRecognizeRequest { 71 | // The streaming request, which is either a streaming config or audio content. 72 | oneof streaming_request { 73 | // Provides information to the recognizer that specifies how to process the 74 | // request. The first `StreamingRecognizeRequest` message must contain a 75 | // `streaming_config` message. 76 | StreamingRecognitionConfig streaming_config = 1; 77 | 78 | // The audio data to be recognized. 79 | bytes audio_content = 2; 80 | } 81 | } 82 | 83 | message StreamingRecognitionConfig { 84 | // Required. Provides information to the recognizer that specifies how to 85 | // process the request. 86 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED]; 87 | 88 | // If `true`, interim results (tentative hypotheses) may be 89 | // returned as they become available (these interim results are indicated with 90 | // the `is_final=false` flag). 91 | // If `false` or omitted, only `is_final=true` result(s) are returned. 92 | bool interim_results = 2; 93 | } 94 | 95 | // Provides information to the recognizer that specifies how to process the request 96 | message RecognitionConfig { 97 | enum AudioEncoding { 98 | ENCODING_UNSPECIFIED = 0; 99 | LINEAR16 = 1; 100 | FLAC = 2; 101 | MP3 = 3; 102 | } 103 | 104 | AudioEncoding encoding = 1; 105 | int32 sample_rate_hertz = 2; // Valid values are: 8000-48000. 106 | string language_code = 3 [(google.api.field_behavior) = REQUIRED]; 107 | int32 max_alternatives = 4; 108 | repeated SpeechContext speech_contexts = 5; 109 | int32 audio_channel_count = 6; 110 | bool enable_separate_recognition_per_channel = 7; 111 | bool enable_word_time_offsets = 8; 112 | bool enable_automatic_punctuation = 11; 113 | SpeakerDiarizationConfig diarization_config = 16; 114 | } 115 | 116 | message SpeechContext { 117 | repeated string phrases = 1; 118 | } 119 | 120 | // Config to enable speaker diarization. 121 | message SpeakerDiarizationConfig { 122 | // If 'true', enables speaker detection for each recognized word in 123 | // the top alternative of the recognition result using a speaker_tag provided 124 | // in the WordInfo. 125 | bool enable_speaker_diarization = 1; 126 | 127 | // Minimum number of speakers in the conversation. This range gives you more 128 | // flexibility by allowing the system to automatically determine the correct 129 | // number of speakers. If not set, the default value is 2. 130 | int32 min_speaker_count = 2; 131 | 132 | // Maximum number of speakers in the conversation. This range gives you more 133 | // flexibility by allowing the system to automatically determine the correct 134 | // number of speakers. If not set, the default value is 6. 135 | int32 max_speaker_count = 3; 136 | } 137 | 138 | // Either `content` or `uri` must be supplied. 139 | message RecognitionAudio { 140 | oneof audio_source { 141 | bytes content = 1; 142 | string uri = 2; 143 | } 144 | } 145 | 146 | //-------------------------------------------- 147 | // responses 148 | //-------------------------------------------- 149 | message RecognizeResponse { 150 | repeated SpeechRecognitionResult results = 1; 151 | } 152 | 153 | message LongRunningRecognizeResponse { 154 | repeated SpeechRecognitionResult results = 1; 155 | } 156 | 157 | message StreamingRecognizeResponse { 158 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that 159 | // specifies the error for the operation. 160 | google.rpc.Status error = 1; 161 | 162 | // This repeated list contains zero or more results that 163 | // correspond to consecutive portions of the audio currently being processed. 164 | // It contains zero or one `is_final=true` result (the newly settled portion), 165 | // followed by zero or more `is_final=false` results (the interim results). 166 | repeated StreamingRecognitionResult results = 2; 167 | } 168 | 169 | message SpeechRecognitionResult { 170 | repeated SpeechRecognitionAlternative alternatives = 1; 171 | int32 channel_tag = 2; 172 | } 173 | 174 | message StreamingRecognitionResult { 175 | // May contain one or more recognition hypotheses (up to the 176 | // maximum specified in `max_alternatives`). 177 | // These alternatives are ordered in terms of accuracy, with the top (first) 178 | // alternative being the most probable, as ranked by the recognizer. 179 | repeated SpeechRecognitionAlternative alternatives = 1; 180 | 181 | // If `false`, this `StreamingRecognitionResult` represents an 182 | // interim result that may change. If `true`, this is the final time the 183 | // speech service will return this particular `StreamingRecognitionResult`, 184 | // the recognizer will not return any further hypotheses for this portion of 185 | // the transcript and corresponding audio. 186 | bool is_final = 2; 187 | 188 | // An estimate of the likelihood that the recognizer will not 189 | // change its guess about this interim result. Values range from 0.0 190 | // (completely unstable) to 1.0 (completely stable). 191 | // This field is only provided for interim results (`is_final=false`). 192 | // The default of 0.0 is a sentinel value indicating `stability` was not set. 193 | float stability = 3; 194 | 195 | // Time offset of the end of this result relative to the 196 | // beginning of the audio. 197 | float result_end_time = 4; 198 | 199 | // For multi-channel audio, this is the channel number corresponding to the 200 | // recognized result for the audio from that channel. 201 | // For audio_channel_count = N, its output values can range from '1' to 'N'. 202 | int32 channel_tag = 5; 203 | } 204 | 205 | message SpeechRecognitionAlternative { 206 | string transcript = 1; 207 | float confidence = 2; 208 | repeated WordInfo words = 3; 209 | } 210 | 211 | message WordInfo { 212 | float start_time = 1; 213 | float end_time = 2; 214 | string word = 3; 215 | } 216 | 217 | message SpeechOperation { 218 | string name = 1; 219 | bool done = 2; 220 | oneof result { 221 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that 222 | // specifies the error for the operation. 223 | google.rpc.Status error = 3; 224 | 225 | LongRunningRecognizeResponse response = 4; 226 | } 227 | } 228 | -------------------------------------------------------------------------------- /python/.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | 59 | # Jupyter Notebook 60 | .ipynb_checkpoints 61 | 62 | # IPython 63 | profile_default/ 64 | ipython_config.py 65 | 66 | # pyenv 67 | .python-version 68 | 69 | # celery beat schedule file 70 | celerybeat-schedule 71 | 72 | # SageMath parsed files 73 | *.sage.py 74 | 75 | # Environments 76 | .env 77 | .venv 78 | env/ 79 | venv/ 80 | ENV/ 81 | env.bak/ 82 | venv.bak/ 83 | -------------------------------------------------------------------------------- /python/README.md: -------------------------------------------------------------------------------- 1 | # Python Speech to Text SDK 2 | 3 | Python SDK for vernacular.ai speech to text APIs. Go [here](https://github.com/Vernacular-ai/speech-recognition) for detailed product documentation. 4 | 5 | ## Installation 6 | To install this sdk run: 7 | 8 | ```shell 9 | pip install vernacular-ai-speech 10 | ``` 11 | 12 | #### Supported Python Versions 13 | 14 | Python >= 3.5 15 | 16 | ## Example Usage 17 | 18 | ```python 19 | from vernacular.ai import speech 20 | from vernacular.ai.speech import enums, types 21 | 22 | 23 | def sample_recognize(access_token, file_path): 24 | """ 25 | Args: 26 | access_token Token provided by vernacular.ai for authentication 27 | file_path Path to audio file e.g /path/audio_file.wav 28 | """ 29 | speech_client = speech.SpeechClient(access_token) 30 | 31 | audio = types.RecognitionAudio( 32 | content = open(file_path, "rb").read() 33 | ) 34 | 35 | config = types.RecognitionConfig( 36 | encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, 37 | sample_rate_hertz=8000, 38 | language_code = "hi-IN", 39 | ) 40 | 41 | response = speech_client.recognize(audio=audio, config=config) 42 | 43 | for result in response.results: 44 | alternative = result.alternatives[0] 45 | print("Transcript: {}".format(alternative.transcript)) 46 | ``` 47 | 48 | To see more examples, go to [samples](https://github.com/Vernacular-ai/speech-recognition/tree/master/python/samples). 49 | -------------------------------------------------------------------------------- /python/requirements.txt: -------------------------------------------------------------------------------- 1 | grpcio==1.27.1 2 | googleapis-common-protos==1.51.0 3 | -------------------------------------------------------------------------------- /python/samples/recognize_async.py: -------------------------------------------------------------------------------- 1 | from vernacular.ai import speech 2 | from vernacular.ai.speech import enums, types 3 | import os 4 | 5 | 6 | def infer_encoding(file_path: str): 7 | if ".mp3" in file_path: 8 | return enums.RecognitionConfig.AudioEncoding.MP3 9 | elif ".wav" in file_path: 10 | return enums.RecognitionConfig.AudioEncoding.LINEAR16 11 | 12 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED 13 | 14 | 15 | def sample_recognize_async(access_token, file_path): 16 | speech_client = speech.SpeechClient(access_token) 17 | 18 | audio = types.RecognitionAudio( 19 | content = open(file_path, "rb").read() 20 | ) 21 | config = types.RecognitionConfig( 22 | encoding=infer_encoding(file_path), 23 | sample_rate_hertz=8000, 24 | language_code = "hi-IN", 25 | max_alternatives = 2, 26 | ) 27 | 28 | speech_operation = speech_client.long_running_recognize(audio=audio, config=config) 29 | 30 | print("Waiting for operation to complete...") 31 | response = speech_operation.response 32 | 33 | for result in response.results: 34 | # First alternative is the most probable result 35 | alternative = result.alternatives[0] 36 | print("Transcript: {}".format(alternative.transcript)) 37 | print("Confidence: {}".format(alternative.confidence)) 38 | 39 | 40 | def main(): 41 | import argparse 42 | 43 | parser = argparse.ArgumentParser() 44 | parser.add_argument( 45 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN") 46 | ) 47 | parser.add_argument( 48 | "--file_path", type=str, default="../resources/hello.wav" 49 | ) 50 | args = parser.parse_args() 51 | 52 | sample_recognize_async(args.access_token, args.file_path) 53 | 54 | 55 | if __name__ == "__main__": 56 | main() 57 | -------------------------------------------------------------------------------- /python/samples/recognize_multi_channel.py: -------------------------------------------------------------------------------- 1 | from vernacular.ai import speech 2 | from vernacular.ai.speech import enums, types 3 | import os 4 | 5 | def infer_encoding(file_path: str): 6 | if ".mp3" in file_path: 7 | return enums.RecognitionConfig.AudioEncoding.MP3 8 | elif ".wav" in file_path: 9 | return enums.RecognitionConfig.AudioEncoding.LINEAR16 10 | 11 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED 12 | 13 | 14 | def sample_recognize(access_token, file_path): 15 | speech_client = speech.SpeechClient(access_token) 16 | 17 | # The number of channels in the input audio file 18 | audio_channel_count = 2 19 | 20 | # When set to true, each audio channel will be recognized separately. 21 | # The recognition result will contain a channel_tag field to state which 22 | # channel that result belongs to 23 | enable_separate_recognition_per_channel = True 24 | 25 | audio = types.RecognitionAudio( 26 | content = open(file_path, "rb").read() 27 | ) 28 | config = types.RecognitionConfig( 29 | encoding=infer_encoding(file_path), 30 | sample_rate_hertz=8000, 31 | language_code = "hi-IN", 32 | max_alternatives = 1, 33 | enable_separate_recognition_per_channel=enable_separate_recognition_per_channel, 34 | audio_channel_count=audio_channel_count, 35 | ) 36 | 37 | response = speech_client.recognize(audio=audio, config=config, timeout=60) 38 | 39 | for result in response.results: 40 | alternative = result.alternatives[0] 41 | print("Transcript: {}".format(alternative.transcript)) 42 | print("ChannelTag: {}".format(result.channel_tag)) 43 | 44 | 45 | def main(): 46 | import argparse 47 | 48 | parser = argparse.ArgumentParser() 49 | parser.add_argument( 50 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN") 51 | ) 52 | parser.add_argument( 53 | "--file_path", type=str, default="../resources/hello.wav" 54 | ) 55 | args = parser.parse_args() 56 | 57 | sample_recognize(args.access_token, args.file_path) 58 | 59 | 60 | if __name__ == "__main__": 61 | main() 62 | -------------------------------------------------------------------------------- /python/samples/recognize_streaming.py: -------------------------------------------------------------------------------- 1 | from vernacular.ai import speech 2 | from vernacular.ai.speech import enums, types 3 | import os 4 | import time 5 | import threading 6 | from six.moves import queue 7 | 8 | 9 | def infer_encoding(file_path: str): 10 | if ".mp3" in file_path: 11 | return enums.RecognitionConfig.AudioEncoding.MP3 12 | elif ".wav" in file_path: 13 | return enums.RecognitionConfig.AudioEncoding.LINEAR16 14 | elif ".raw" in file_path: 15 | return enums.RecognitionConfig.AudioEncoding.LINEAR16 16 | 17 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED 18 | 19 | 20 | class AddAudio(threading.Thread): 21 | def __init__(self, file_path, _buff): 22 | threading.Thread.__init__(self) 23 | self.file_path = file_path 24 | self._buff = _buff 25 | 26 | def run(self): 27 | with open(self.file_path, "rb") as file: 28 | audio_content = file.read() 29 | 30 | for i in range(0, len(audio_content), 8000): 31 | self._buff.put(audio_content[i:i+8000]) 32 | # add a delay for real time streaming simulation 33 | time.sleep(0.1) 34 | 35 | # add None to queue to mark end of streaming 36 | self._buff.put(None) 37 | 38 | 39 | class SampleRecognizeStreaming(): 40 | def __init__(self, access_token, file_path): 41 | self.speech_client = speech.SpeechClient(access_token) 42 | self.file_path = file_path 43 | 44 | config = types.RecognitionConfig( 45 | encoding=infer_encoding(file_path), 46 | sample_rate_hertz=8000, 47 | language_code="en-IN", 48 | max_alternatives=1, 49 | ) 50 | self.stream_config = types.StreamingRecognitionConfig( 51 | config=config, 52 | ) 53 | 54 | self._buff = queue.Queue() 55 | 56 | def run(self): 57 | requests = (types.StreamingRecognizeRequest(audio_content=content) 58 | for content in self.audio_generator()) 59 | 60 | responses = self.speech_client.streaming_recognize(self.stream_config, requests) 61 | 62 | # add audios in a new thread to simulate streaming 63 | t1 = AddAudio(self.file_path, self._buff) 64 | t1.start() 65 | 66 | # this is blocking call and will wait until server sends the result 67 | for response in responses: 68 | for result in response.results: 69 | alternative = result.alternatives[0] 70 | print("Transcript: {}".format(alternative.transcript)) 71 | print("Confidence: {}".format(alternative.confidence)) 72 | 73 | 74 | def audio_generator(self): 75 | while True: 76 | chunk = self._buff.get() 77 | if chunk is None: 78 | return 79 | 80 | yield chunk 81 | 82 | 83 | def main(): 84 | import argparse 85 | 86 | parser = argparse.ArgumentParser() 87 | parser.add_argument( 88 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN") 89 | ) 90 | parser.add_argument( 91 | "--file_path", type=str, default="../resources/test-single-channel-8000Hz.raw" 92 | ) 93 | args = parser.parse_args() 94 | 95 | ss = SampleRecognizeStreaming(args.access_token, args.file_path) 96 | ss.run() 97 | 98 | 99 | if __name__ == "__main__": 100 | main() 101 | -------------------------------------------------------------------------------- /python/samples/recognize_streaming_mic.py: -------------------------------------------------------------------------------- 1 | """ 2 | NOTE: This module requires the additional dependency `pyaudio`. 3 | To install using pip: 4 | pip install pyaudio 5 | Example usage: 6 | python recognize_streaming_mic.py 7 | """ 8 | 9 | from __future__ import division 10 | 11 | import re 12 | import sys 13 | 14 | from vernacular.ai import speech 15 | from vernacular.ai.speech import enums, types 16 | import pyaudio 17 | import os 18 | from six.moves import queue 19 | 20 | # Audio recording parameters 21 | RATE = 8000 22 | CHUNK = int(RATE / 10) # 100ms 23 | 24 | class MicrophoneStream(object): 25 | """Opens a recording stream as a generator yielding the audio chunks.""" 26 | def __init__(self, rate, chunk): 27 | self._rate = rate 28 | self._chunk = chunk 29 | 30 | # Create a thread-safe buffer of audio data 31 | self._buff = queue.Queue() 32 | self.closed = True 33 | 34 | def __enter__(self): 35 | self._audio_interface = pyaudio.PyAudio() 36 | self._audio_stream = self._audio_interface.open( 37 | format=pyaudio.paInt16, 38 | # The API currently only supports 1-channel (mono) audio 39 | # https://goo.gl/z757pE 40 | channels=1, rate=self._rate, 41 | input=True, frames_per_buffer=self._chunk, 42 | # Run the audio stream asynchronously to fill the buffer object. 43 | # This is necessary so that the input device's buffer doesn't 44 | # overflow while the calling thread makes network requests, etc. 45 | stream_callback=self._fill_buffer, 46 | ) 47 | 48 | self.closed = False 49 | 50 | return self 51 | 52 | def __exit__(self, type, value, traceback): 53 | self.end() 54 | 55 | def end(self): 56 | self._audio_stream.stop_stream() 57 | self._audio_stream.close() 58 | self.closed = True 59 | # Signal the generator to terminate so that the client's 60 | # streaming_recognize method will not block the process termination. 61 | self._buff.put(None) 62 | self._audio_interface.terminate() 63 | 64 | def _fill_buffer(self, in_data, frame_count, time_info, status_flags): 65 | """Continuously collect data from the audio stream, into the buffer.""" 66 | self._buff.put(in_data) 67 | return None, pyaudio.paContinue 68 | 69 | def generator(self): 70 | while not self.closed: 71 | # Use a blocking get() to ensure there's at least one chunk of 72 | # data, and stop iteration if the chunk is None, indicating the 73 | # end of the audio stream. 74 | chunk = self._buff.get() 75 | if chunk is None: 76 | return 77 | data = [chunk] 78 | 79 | # Now consume whatever other data's still buffered. 80 | while True: 81 | try: 82 | chunk = self._buff.get(block=False) 83 | if chunk is None: 84 | return 85 | data.append(chunk) 86 | except queue.Empty: 87 | break 88 | 89 | yield b''.join(data) 90 | 91 | 92 | def main(): 93 | import argparse 94 | 95 | parser = argparse.ArgumentParser() 96 | parser.add_argument( 97 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN") 98 | ) 99 | args = parser.parse_args() 100 | language_code = 'hi-IN' # a BCP-47 language tag 101 | 102 | client = speech.SpeechClient(args.access_token) 103 | config = types.RecognitionConfig( 104 | encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, 105 | sample_rate_hertz=RATE, 106 | language_code=language_code, 107 | ) 108 | sd_config = types.SilenceDetectionConfig( 109 | enable_silence_detection=True, 110 | max_speech_timeout=4, 111 | ) 112 | 113 | streaming_config = types.StreamingRecognitionConfig(config=config, silence_detection_config=sd_config) 114 | 115 | with MicrophoneStream(RATE, CHUNK) as stream: 116 | audio_generator = stream.generator() 117 | requests = (types.StreamingRecognizeRequest(audio_content=content) 118 | for content in audio_generator) 119 | 120 | responses = client.streaming_recognize(streaming_config, requests) 121 | 122 | # Now, put the transcription responses to use. 123 | for response in responses: 124 | stream.end() 125 | if len(response.results) > 0 and len(response.results[0].alternatives) > 0: 126 | # Display the transcription of the top alternative. 127 | transcript = response.results[0].alternatives[0].transcript 128 | print(transcript) 129 | else: 130 | print("Empty results") 131 | 132 | 133 | if __name__ == '__main__': 134 | main() 135 | -------------------------------------------------------------------------------- /python/samples/recognize_sync.py: -------------------------------------------------------------------------------- 1 | from vernacular.ai import speech 2 | from vernacular.ai.speech import enums, types 3 | import os 4 | 5 | 6 | def infer_encoding(file_path: str): 7 | if ".mp3" in file_path: 8 | return enums.RecognitionConfig.AudioEncoding.MP3 9 | elif ".wav" in file_path: 10 | return enums.RecognitionConfig.AudioEncoding.LINEAR16 11 | 12 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED 13 | 14 | 15 | def sample_recognize(access_token, file_path): 16 | speech_client = speech.SpeechClient(access_token) 17 | 18 | audio = types.RecognitionAudio( 19 | content = open(file_path, "rb").read() 20 | ) 21 | config = types.RecognitionConfig( 22 | encoding=infer_encoding(file_path), 23 | sample_rate_hertz=8000, 24 | language_code = "hi-IN", 25 | max_alternatives = 1, 26 | ) 27 | 28 | response = speech_client.recognize(audio=audio, config=config) 29 | 30 | for result in response.results: 31 | # First alternative is the most probable result 32 | alternative = result.alternatives[0] 33 | print("Transcript: {}".format(alternative.transcript)) 34 | print("Confidence: {}".format(alternative.confidence)) 35 | 36 | 37 | def main(): 38 | import argparse 39 | 40 | parser = argparse.ArgumentParser() 41 | parser.add_argument( 42 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN") 43 | ) 44 | parser.add_argument( 45 | "--file_path", type=str, default="../resources/hello.wav" 46 | ) 47 | args = parser.parse_args() 48 | 49 | sample_recognize(args.access_token, args.file_path) 50 | 51 | 52 | if __name__ == "__main__": 53 | main() 54 | -------------------------------------------------------------------------------- /python/samples/recognize_word_offset.py: -------------------------------------------------------------------------------- 1 | from vernacular.ai import speech 2 | from vernacular.ai.speech import enums, types 3 | import os 4 | 5 | def infer_encoding(file_path: str): 6 | if ".mp3" in file_path: 7 | return enums.RecognitionConfig.AudioEncoding.MP3 8 | elif ".wav" in file_path: 9 | return enums.RecognitionConfig.AudioEncoding.LINEAR16 10 | 11 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED 12 | 13 | 14 | def sample_recognize(access_token, file_path): 15 | speech_client = speech.SpeechClient(access_token) 16 | 17 | enable_word_time_offsets = True 18 | 19 | audio = types.RecognitionAudio( 20 | content = open(file_path, "rb").read() 21 | ) 22 | config = types.RecognitionConfig( 23 | encoding=infer_encoding(file_path), 24 | sample_rate_hertz=8000, 25 | language_code = "hi-IN", 26 | max_alternatives = 1, 27 | enable_word_time_offsets=enable_word_time_offsets 28 | ) 29 | 30 | response = speech_client.recognize(audio=audio, config=config) 31 | 32 | for result in response.results: 33 | # First alternative is the most probable result 34 | alternative = result.alternatives[0] 35 | print("Transcript: {}".format(alternative.transcript)) 36 | print("Words: {}".format(alternative.words)) 37 | 38 | 39 | def main(): 40 | import argparse 41 | 42 | parser = argparse.ArgumentParser() 43 | parser.add_argument( 44 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN") 45 | ) 46 | parser.add_argument( 47 | "--file_path", type=str, default="../resources/hello.wav" 48 | ) 49 | args = parser.parse_args() 50 | 51 | sample_recognize(args.access_token, args.file_path) 52 | 53 | 54 | if __name__ == "__main__": 55 | main() 56 | -------------------------------------------------------------------------------- /python/setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | long_description_content_type= text/markdown -------------------------------------------------------------------------------- /python/setup.py: -------------------------------------------------------------------------------- 1 | import io 2 | import os 3 | 4 | from setuptools import setup 5 | 6 | 7 | name = "vernacular-ai-speech" 8 | description = "Vernacular Speech API python client" 9 | version = "0.1.2" 10 | 11 | dependencies = ["grpcio >= 1.27.1", "googleapis-common-protos == 1.51.0"] 12 | extras = {} 13 | 14 | package_root = os.path.abspath(os.path.dirname(__file__)) 15 | 16 | readme_filename = os.path.join(package_root, "README.md") 17 | with io.open(readme_filename, "r") as readme_file: 18 | readme = readme_file.read() 19 | 20 | # Only include packages under the 'vernacular' namespace. Do not include tests, 21 | # benchmarks, etc. 22 | packages = [ 23 | "vernacular.ai.speech", 24 | "vernacular.ai.speech.proto", 25 | "vernacular.ai.exceptions", 26 | ] 27 | 28 | # Determine which namespaces are needed. 29 | namespaces = ["vernacular", "vernacular.ai"] 30 | 31 | setup( 32 | name=name, 33 | version=version, 34 | description=description, 35 | long_description_content_type="text/markdown", 36 | long_description=readme, 37 | author="Vernacular.ai", 38 | author_email="deepankar@vernacular.ai", 39 | license="Apache 2.0", 40 | url="https://github.com/Vernacular-ai/speech-recognition", 41 | classifiers=[ 42 | "Development Status :: 4 - Beta", # Chose either "3 - Alpha", "4 - Beta" or "5 - Production/Stable" 43 | "Intended Audience :: Developers", 44 | "Topic :: Software Development :: Build Tools", 45 | "License :: OSI Approved :: Apache Software License", 46 | "Programming Language :: Python", 47 | "Programming Language :: Python :: 3", 48 | "Programming Language :: Python :: 3.5", 49 | "Programming Language :: Python :: 3.6", 50 | "Programming Language :: Python :: 3.7", 51 | "Operating System :: OS Independent", 52 | ], 53 | platforms="Posix; MacOS X; Windows", 54 | packages=packages, 55 | namespace_packages=namespaces, 56 | install_requires=dependencies, 57 | extras_require=extras, 58 | python_requires=">=3.5", 59 | include_package_data=True, 60 | zip_safe=False, 61 | ) 62 | -------------------------------------------------------------------------------- /python/tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/python/tests/__init__.py -------------------------------------------------------------------------------- /python/vernacular/__init__.py: -------------------------------------------------------------------------------- 1 | try: 2 | import pkg_resources 3 | 4 | pkg_resources.declare_namespace(__name__) 5 | except ImportError: 6 | import pkgutil 7 | 8 | __path__ = pkgutil.extend_path(__path__, __name__) -------------------------------------------------------------------------------- /python/vernacular/ai/__init__.py: -------------------------------------------------------------------------------- 1 | try: 2 | import pkg_resources 3 | 4 | pkg_resources.declare_namespace(__name__) 5 | except ImportError: 6 | import pkgutil 7 | 8 | __path__ = pkgutil.extend_path(__path__, __name__) -------------------------------------------------------------------------------- /python/vernacular/ai/exceptions/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | 3 | import grpc 4 | 5 | class VernacularAPIError(Exception): 6 | """ 7 | Base class for all exceptions raised by Vernacular API Clients. 8 | """ 9 | pass 10 | 11 | 12 | class VernacularAPICallError(VernacularAPIError): 13 | """ 14 | Base class for exceptions raised by calling API methods. 15 | 16 | Args: 17 | message (str): The exception message. 18 | errors (Sequence[Any]): An optional list of error details. 19 | response (Union[requests.Request, grpc.Call]): The response or 20 | gRPC call metadata. 21 | """ 22 | 23 | code = None 24 | """ 25 | Optional[int]: The HTTP status code associated with this error. 26 | 27 | This may be ``None`` if the exception does not have a direct mapping 28 | to an HTTP error. 29 | 30 | See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html 31 | """ 32 | 33 | grpc_status_code = None 34 | """ 35 | Optional[grpc.StatusCode]: The gRPC status code associated with this 36 | error. 37 | 38 | This may be ``None`` if the exception does not match up to a gRPC error. 39 | """ 40 | 41 | def __init__(self, message, errors=(), response=None): 42 | super(VernacularAPIError, self).__init__(message) 43 | self.message = message 44 | """str: The exception message.""" 45 | self._errors = errors 46 | self._response = response 47 | 48 | def __str__(self): 49 | return "{} {}".format(self.code, self.message) 50 | 51 | @property 52 | def errors(self): 53 | """Detailed error information. 54 | 55 | Returns: 56 | Sequence[Any]: A list of additional error details. 57 | """ 58 | return list(self._errors) 59 | 60 | @property 61 | def response(self): 62 | """Optional[Union[requests.Request, grpc.Call]]: The response or 63 | gRPC call metadata.""" 64 | return self._response -------------------------------------------------------------------------------- /python/vernacular/ai/speech/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | 3 | from vernacular.ai.speech import speech_client 4 | from vernacular.ai.speech import types 5 | from vernacular.ai.speech import enums 6 | 7 | 8 | class SpeechClient(speech_client.SpeechClient): 9 | __doc__ = speech_client.SpeechClient.__doc__ 10 | enums = enums 11 | types = types 12 | 13 | 14 | __all__ = ("SpeechClient", "enums", "types") 15 | -------------------------------------------------------------------------------- /python/vernacular/ai/speech/enums.py: -------------------------------------------------------------------------------- 1 | import enum 2 | 3 | 4 | class RecognitionConfig(object): 5 | class AudioEncoding(enum.IntEnum): 6 | """ 7 | The encoding of the audio data sent in the request. 8 | All encodings support only 1 channel (mono) audio, unless the 9 | ``audio_channel_count`` and ``enable_separate_recognition_per_channel`` 10 | fields are set. 11 | For best results, the audio source should be captured and transmitted 12 | using a lossless encoding (``FLAC`` or ``LINEAR16``). The accuracy of 13 | the speech recognition can be reduced if lossy codecs are used to 14 | capture or transmit audio, particularly if background noise is present. 15 | Lossy codecs include ``MULAW``, ``AMR``, ``AMR_WB``, ``OGG_OPUS``, 16 | ``SPEEX_WITH_HEADER_BYTE``, and ``MP3``. 17 | The ``FLAC`` and ``WAV`` audio file formats include a header that 18 | describes the included audio content. You can request recognition for 19 | ``WAV`` files that contain either ``LINEAR16`` or ``MULAW`` encoded 20 | audio. If you send ``FLAC`` or ``WAV`` audio file format in your 21 | request, you do not need to specify an ``AudioEncoding``; the audio 22 | encoding format is determined from the file header. If you specify an 23 | ``AudioEncoding`` when you send send ``FLAC`` or ``WAV`` audio, the 24 | encoding configuration must match the encoding described in the audio 25 | header; otherwise the request returns an 26 | ``google.rpc.Code.INVALID_ARGUMENT`` error code. 27 | Attributes: 28 | ENCODING_UNSPECIFIED (int): Not specified. 29 | LINEAR16 (int): Uncompressed 16-bit signed little-endian samples (Linear PCM). 30 | FLAC (int): ``FLAC`` (Free Lossless Audio Codec) is the recommended encoding because 31 | it is lossless--therefore recognition is not compromised--and requires 32 | only about half the bandwidth of ``LINEAR16``. ``FLAC`` stream encoding 33 | supports 16-bit and 24-bit samples, however, not all fields in 34 | ``STREAMINFO`` are supported. 35 | """ 36 | 37 | ENCODING_UNSPECIFIED = 0 38 | LINEAR16 = 1 39 | FLAC = 2 40 | MP3 = 3 -------------------------------------------------------------------------------- /python/vernacular/ai/speech/proto/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/python/vernacular/ai/speech/proto/__init__.py -------------------------------------------------------------------------------- /python/vernacular/ai/speech/proto/speech-to-text.proto: -------------------------------------------------------------------------------- 1 | syntax = "proto3"; 2 | package speech_to_text; 3 | 4 | import "google/api/annotations.proto"; 5 | import "google/api/client.proto"; 6 | import "google/api/field_behavior.proto"; 7 | import "google/rpc/status.proto"; 8 | 9 | option java_multiple_files = true; 10 | option java_outer_classname = "SpeechToTextProto"; 11 | option java_package = "ai.vernacular.speech"; 12 | 13 | service SpeechToText { 14 | // Performs synchronous non-streaming speech recognition 15 | rpc Recognize(RecognizeRequest) returns (RecognizeResponse) { 16 | option (google.api.http) = { 17 | post: "/v2/speech:recognize" 18 | body: "*" 19 | }; 20 | option (google.api.method_signature) = "config,audio"; 21 | } 22 | 23 | // Performs bidirectional streaming speech recognition: receive results while 24 | // sending audio. This method is only available via the gRPC API (not REST). 25 | rpc StreamingRecognize(stream StreamingRecognizeRequest) returns (stream StreamingRecognizeResponse) {} 26 | 27 | // Performs asynchronous non-streaming speech recognition 28 | rpc LongRunningRecognize(LongRunningRecognizeRequest) returns (SpeechOperation) { 29 | option (google.api.http) = { 30 | post: "/v2/speech:longrunningrecognize" 31 | body: "*" 32 | }; 33 | option (google.api.method_signature) = "config,audio"; 34 | } 35 | // Returns SpeechOperation for LongRunningRecognize. Used for polling the result 36 | rpc GetSpeechOperation(SpeechOperationRequest) returns (SpeechOperation) { 37 | option (google.api.http) = { 38 | get: "/v2/speech_operations/{name}" 39 | }; 40 | } 41 | } 42 | 43 | //-------------------------------------------- 44 | // requests 45 | //-------------------------------------------- 46 | message RecognizeRequest { 47 | // Required. Provides information to the recognizer that specifies how to 48 | // process the request. 49 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED]; 50 | 51 | // Required. The audio data to be recognized. 52 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED]; 53 | 54 | string segment = 16; 55 | } 56 | 57 | message LongRunningRecognizeRequest { 58 | // Required. Provides information to the recognizer that specifies how to 59 | // process the request. 60 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED]; 61 | 62 | // Required. The audio data to be recognized. 63 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED]; 64 | 65 | // Optional. When operation completes, result is posted to this url if provided. 66 | string result_url = 11; 67 | 68 | string segment = 16; 69 | } 70 | 71 | message SpeechOperationRequest { 72 | // name of the speech operation 73 | string name = 1 [(google.api.field_behavior) = REQUIRED]; 74 | } 75 | 76 | message StreamingRecognizeRequest { 77 | // The streaming request, which is either a streaming config or audio content. 78 | oneof streaming_request { 79 | // Provides information to the recognizer that specifies how to process the 80 | // request. The first `StreamingRecognizeRequest` message must contain a 81 | // `streaming_config` message. 82 | StreamingRecognitionConfig streaming_config = 1; 83 | 84 | // The audio data to be recognized. 85 | bytes audio_content = 2; 86 | } 87 | } 88 | 89 | message StreamingRecognitionConfig { 90 | // Required. Provides information to the recognizer that specifies how to 91 | // process the request. 92 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED]; 93 | 94 | // If `true`, interim results (tentative hypotheses) may be 95 | // returned as they become available (these interim results are indicated with 96 | // the `is_final=false` flag). 97 | // If `false` or omitted, only `is_final=true` result(s) are returned. 98 | bool interim_results = 2; 99 | 100 | SilenceDetectionConfig silence_detection_config = 3; 101 | } 102 | 103 | // Provides information to the recognizer that specifies how to process the request 104 | message RecognitionConfig { 105 | enum AudioEncoding { 106 | ENCODING_UNSPECIFIED = 0; 107 | LINEAR16 = 1; 108 | FLAC = 2; 109 | MP3 = 3; 110 | } 111 | 112 | AudioEncoding encoding = 1; 113 | int32 sample_rate_hertz = 2; // Valid values are: 8000-48000. 114 | string language_code = 3 [(google.api.field_behavior) = REQUIRED]; 115 | int32 max_alternatives = 4; 116 | repeated SpeechContext speech_contexts = 5; 117 | int32 audio_channel_count = 6; 118 | bool enable_separate_recognition_per_channel = 7; 119 | bool enable_word_time_offsets = 8; 120 | bool enable_automatic_punctuation = 11; 121 | SpeakerDiarizationConfig diarization_config = 16; 122 | } 123 | 124 | message SpeechContext { 125 | repeated string phrases = 1; 126 | } 127 | 128 | // Config to enable speaker diarization. 129 | message SpeakerDiarizationConfig { 130 | // If 'true', enables speaker detection for each recognized word in 131 | // the top alternative of the recognition result using a speaker_tag provided 132 | // in the WordInfo. 133 | bool enable_speaker_diarization = 1; 134 | 135 | // Minimum number of speakers in the conversation. This range gives you more 136 | // flexibility by allowing the system to automatically determine the correct 137 | // number of speakers. If not set, the default value is 2. 138 | int32 min_speaker_count = 2; 139 | 140 | // Maximum number of speakers in the conversation. This range gives you more 141 | // flexibility by allowing the system to automatically determine the correct 142 | // number of speakers. If not set, the default value is 6. 143 | int32 max_speaker_count = 3; 144 | } 145 | 146 | message SilenceDetectionConfig { 147 | // If `true` enables silence detection 148 | bool enable_silence_detection = 1; 149 | float max_speech_timeout = 2; 150 | float silence_patience = 3; 151 | float no_input_timeout = 4; 152 | } 153 | 154 | // Either `content` or `uri` must be supplied. 155 | message RecognitionAudio { 156 | oneof audio_source { 157 | bytes content = 1; 158 | string uri = 2; 159 | } 160 | } 161 | 162 | //-------------------------------------------- 163 | // responses 164 | //-------------------------------------------- 165 | message RecognizeResponse { 166 | repeated SpeechRecognitionResult results = 1; 167 | } 168 | 169 | message LongRunningRecognizeResponse { 170 | repeated SpeechRecognitionResult results = 1; 171 | } 172 | 173 | message StreamingRecognizeResponse { 174 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that 175 | // specifies the error for the operation. 176 | google.rpc.Status error = 1; 177 | 178 | // This repeated list contains zero or more results that 179 | // correspond to consecutive portions of the audio currently being processed. 180 | // It contains zero or one `is_final=true` result (the newly settled portion), 181 | // followed by zero or more `is_final=false` results (the interim results). 182 | repeated StreamingRecognitionResult results = 2; 183 | } 184 | 185 | message SpeechRecognitionResult { 186 | repeated SpeechRecognitionAlternative alternatives = 1; 187 | int32 channel_tag = 2; 188 | } 189 | 190 | message StreamingRecognitionResult { 191 | // May contain one or more recognition hypotheses (up to the 192 | // maximum specified in `max_alternatives`). 193 | // These alternatives are ordered in terms of accuracy, with the top (first) 194 | // alternative being the most probable, as ranked by the recognizer. 195 | repeated SpeechRecognitionAlternative alternatives = 1; 196 | 197 | // If `false`, this `StreamingRecognitionResult` represents an 198 | // interim result that may change. If `true`, this is the final time the 199 | // speech service will return this particular `StreamingRecognitionResult`, 200 | // the recognizer will not return any further hypotheses for this portion of 201 | // the transcript and corresponding audio. 202 | bool is_final = 2; 203 | 204 | // An estimate of the likelihood that the recognizer will not 205 | // change its guess about this interim result. Values range from 0.0 206 | // (completely unstable) to 1.0 (completely stable). 207 | // This field is only provided for interim results (`is_final=false`). 208 | // The default of 0.0 is a sentinel value indicating `stability` was not set. 209 | float stability = 3; 210 | 211 | // Time offset of the end of this result relative to the 212 | // beginning of the audio. 213 | float result_end_time = 4; 214 | 215 | // For multi-channel audio, this is the channel number corresponding to the 216 | // recognized result for the audio from that channel. 217 | // For audio_channel_count = N, its output values can range from '1' to 'N'. 218 | int32 channel_tag = 5; 219 | } 220 | 221 | message SpeechRecognitionAlternative { 222 | string transcript = 1; 223 | float confidence = 2; 224 | repeated WordInfo words = 3; 225 | } 226 | 227 | message WordInfo { 228 | float start_time = 1; 229 | float end_time = 2; 230 | string word = 3; 231 | float confidence = 4; 232 | } 233 | 234 | message SpeechOperation { 235 | string name = 1; 236 | bool done = 2; 237 | oneof result { 238 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that 239 | // specifies the error for the operation. 240 | google.rpc.Status error = 3; 241 | 242 | LongRunningRecognizeResponse response = 4; 243 | } 244 | } 245 | -------------------------------------------------------------------------------- /python/vernacular/ai/speech/proto/speech_to_text_pb2.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Generated by the protocol buffer compiler. DO NOT EDIT! 3 | # source: speech-to-text.proto 4 | """Generated protocol buffer code.""" 5 | from google.protobuf import descriptor as _descriptor 6 | from google.protobuf import message as _message 7 | from google.protobuf import reflection as _reflection 8 | from google.protobuf import symbol_database as _symbol_database 9 | # @@protoc_insertion_point(imports) 10 | 11 | _sym_db = _symbol_database.Default() 12 | 13 | 14 | from google.api import annotations_pb2 as google_dot_api_dot_annotations__pb2 15 | from google.api import client_pb2 as google_dot_api_dot_client__pb2 16 | from google.api import field_behavior_pb2 as google_dot_api_dot_field__behavior__pb2 17 | from google.rpc import status_pb2 as google_dot_rpc_dot_status__pb2 18 | 19 | 20 | DESCRIPTOR = _descriptor.FileDescriptor( 21 | name='speech-to-text.proto', 22 | package='speech_to_text', 23 | syntax='proto3', 24 | serialized_options=b'\n\024ai.vernacular.speechB\021SpeechToTextProtoP\001', 25 | create_key=_descriptor._internal_create_key, 26 | serialized_pb=b'\n\x14speech-to-text.proto\x12\x0espeech_to_text\x1a\x1cgoogle/api/annotations.proto\x1a\x17google/api/client.proto\x1a\x1fgoogle/api/field_behavior.proto\x1a\x17google/rpc/status.proto\"\x91\x01\n\x10RecognizeRequest\x12\x36\n\x06\x63onfig\x18\x01 \x01(\x0b\x32!.speech_to_text.RecognitionConfigB\x03\xe0\x41\x02\x12\x34\n\x05\x61udio\x18\x02 \x01(\x0b\x32 .speech_to_text.RecognitionAudioB\x03\xe0\x41\x02\x12\x0f\n\x07segment\x18\x10 \x01(\t\"\xb0\x01\n\x1bLongRunningRecognizeRequest\x12\x36\n\x06\x63onfig\x18\x01 \x01(\x0b\x32!.speech_to_text.RecognitionConfigB\x03\xe0\x41\x02\x12\x34\n\x05\x61udio\x18\x02 \x01(\x0b\x32 .speech_to_text.RecognitionAudioB\x03\xe0\x41\x02\x12\x12\n\nresult_url\x18\x0b \x01(\t\x12\x0f\n\x07segment\x18\x10 \x01(\t\"+\n\x16SpeechOperationRequest\x12\x11\n\x04name\x18\x01 \x01(\tB\x03\xe0\x41\x02\"\x91\x01\n\x19StreamingRecognizeRequest\x12\x46\n\x10streaming_config\x18\x01 \x01(\x0b\x32*.speech_to_text.StreamingRecognitionConfigH\x00\x12\x17\n\raudio_content\x18\x02 \x01(\x0cH\x00\x42\x13\n\x11streaming_request\"\xb7\x01\n\x1aStreamingRecognitionConfig\x12\x36\n\x06\x63onfig\x18\x01 \x01(\x0b\x32!.speech_to_text.RecognitionConfigB\x03\xe0\x41\x02\x12\x17\n\x0finterim_results\x18\x02 \x01(\x08\x12H\n\x18silence_detection_config\x18\x03 \x01(\x0b\x32&.speech_to_text.SilenceDetectionConfig\"\x87\x04\n\x11RecognitionConfig\x12\x41\n\x08\x65ncoding\x18\x01 \x01(\x0e\x32/.speech_to_text.RecognitionConfig.AudioEncoding\x12\x19\n\x11sample_rate_hertz\x18\x02 \x01(\x05\x12\x1a\n\rlanguage_code\x18\x03 \x01(\tB\x03\xe0\x41\x02\x12\x18\n\x10max_alternatives\x18\x04 \x01(\x05\x12\x36\n\x0fspeech_contexts\x18\x05 \x03(\x0b\x32\x1d.speech_to_text.SpeechContext\x12\x1b\n\x13\x61udio_channel_count\x18\x06 \x01(\x05\x12/\n\'enable_separate_recognition_per_channel\x18\x07 \x01(\x08\x12 \n\x18\x65nable_word_time_offsets\x18\x08 \x01(\x08\x12$\n\x1c\x65nable_automatic_punctuation\x18\x0b \x01(\x08\x12\x44\n\x12\x64iarization_config\x18\x10 \x01(\x0b\x32(.speech_to_text.SpeakerDiarizationConfig\"J\n\rAudioEncoding\x12\x18\n\x14\x45NCODING_UNSPECIFIED\x10\x00\x12\x0c\n\x08LINEAR16\x10\x01\x12\x08\n\x04\x46LAC\x10\x02\x12\x07\n\x03MP3\x10\x03\" \n\rSpeechContext\x12\x0f\n\x07phrases\x18\x01 \x03(\t\"t\n\x18SpeakerDiarizationConfig\x12\"\n\x1a\x65nable_speaker_diarization\x18\x01 \x01(\x08\x12\x19\n\x11min_speaker_count\x18\x02 \x01(\x05\x12\x19\n\x11max_speaker_count\x18\x03 \x01(\x05\"\x8a\x01\n\x16SilenceDetectionConfig\x12 \n\x18\x65nable_silence_detection\x18\x01 \x01(\x08\x12\x1a\n\x12max_speech_timeout\x18\x02 \x01(\x02\x12\x18\n\x10silence_patience\x18\x03 \x01(\x02\x12\x18\n\x10no_input_timeout\x18\x04 \x01(\x02\"D\n\x10RecognitionAudio\x12\x11\n\x07\x63ontent\x18\x01 \x01(\x0cH\x00\x12\r\n\x03uri\x18\x02 \x01(\tH\x00\x42\x0e\n\x0c\x61udio_source\"M\n\x11RecognizeResponse\x12\x38\n\x07results\x18\x01 \x03(\x0b\x32\'.speech_to_text.SpeechRecognitionResult\"X\n\x1cLongRunningRecognizeResponse\x12\x38\n\x07results\x18\x01 \x03(\x0b\x32\'.speech_to_text.SpeechRecognitionResult\"|\n\x1aStreamingRecognizeResponse\x12!\n\x05\x65rror\x18\x01 \x01(\x0b\x32\x12.google.rpc.Status\x12;\n\x07results\x18\x02 \x03(\x0b\x32*.speech_to_text.StreamingRecognitionResult\"r\n\x17SpeechRecognitionResult\x12\x42\n\x0c\x61lternatives\x18\x01 \x03(\x0b\x32,.speech_to_text.SpeechRecognitionAlternative\x12\x13\n\x0b\x63hannel_tag\x18\x02 \x01(\x05\"\xb3\x01\n\x1aStreamingRecognitionResult\x12\x42\n\x0c\x61lternatives\x18\x01 \x03(\x0b\x32,.speech_to_text.SpeechRecognitionAlternative\x12\x10\n\x08is_final\x18\x02 \x01(\x08\x12\x11\n\tstability\x18\x03 \x01(\x02\x12\x17\n\x0fresult_end_time\x18\x04 \x01(\x02\x12\x13\n\x0b\x63hannel_tag\x18\x05 \x01(\x05\"o\n\x1cSpeechRecognitionAlternative\x12\x12\n\ntranscript\x18\x01 \x01(\t\x12\x12\n\nconfidence\x18\x02 \x01(\x02\x12\'\n\x05words\x18\x03 \x03(\x0b\x32\x18.speech_to_text.WordInfo\"R\n\x08WordInfo\x12\x12\n\nstart_time\x18\x01 \x01(\x02\x12\x10\n\x08\x65nd_time\x18\x02 \x01(\x02\x12\x0c\n\x04word\x18\x03 \x01(\t\x12\x12\n\nconfidence\x18\x04 \x01(\x02\"\x9e\x01\n\x0fSpeechOperation\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x0c\n\x04\x64one\x18\x02 \x01(\x08\x12#\n\x05\x65rror\x18\x03 \x01(\x0b\x32\x12.google.rpc.StatusH\x00\x12@\n\x08response\x18\x04 \x01(\x0b\x32,.speech_to_text.LongRunningRecognizeResponseH\x00\x42\x08\n\x06result2\xac\x04\n\x0cSpeechToText\x12\x80\x01\n\tRecognize\x12 .speech_to_text.RecognizeRequest\x1a!.speech_to_text.RecognizeResponse\".\x82\xd3\xe4\x93\x02\x19\"\x14/v2/speech:recognize:\x01*\xda\x41\x0c\x63onfig,audio\x12q\n\x12StreamingRecognize\x12).speech_to_text.StreamingRecognizeRequest\x1a*.speech_to_text.StreamingRecognizeResponse\"\x00(\x01\x30\x01\x12\x9f\x01\n\x14LongRunningRecognize\x12+.speech_to_text.LongRunningRecognizeRequest\x1a\x1f.speech_to_text.SpeechOperation\"9\x82\xd3\xe4\x93\x02$\"\x1f/v2/speech:longrunningrecognize:\x01*\xda\x41\x0c\x63onfig,audio\x12\x83\x01\n\x12GetSpeechOperation\x12&.speech_to_text.SpeechOperationRequest\x1a\x1f.speech_to_text.SpeechOperation\"$\x82\xd3\xe4\x93\x02\x1e\x12\x1c/v2/speech_operations/{name}B+\n\x14\x61i.vernacular.speechB\x11SpeechToTextProtoP\x01\x62\x06proto3' 27 | , 28 | dependencies=[google_dot_api_dot_annotations__pb2.DESCRIPTOR,google_dot_api_dot_client__pb2.DESCRIPTOR,google_dot_api_dot_field__behavior__pb2.DESCRIPTOR,google_dot_rpc_dot_status__pb2.DESCRIPTOR,]) 29 | 30 | 31 | 32 | _RECOGNITIONCONFIG_AUDIOENCODING = _descriptor.EnumDescriptor( 33 | name='AudioEncoding', 34 | full_name='speech_to_text.RecognitionConfig.AudioEncoding', 35 | filename=None, 36 | file=DESCRIPTOR, 37 | create_key=_descriptor._internal_create_key, 38 | values=[ 39 | _descriptor.EnumValueDescriptor( 40 | name='ENCODING_UNSPECIFIED', index=0, number=0, 41 | serialized_options=None, 42 | type=None, 43 | create_key=_descriptor._internal_create_key), 44 | _descriptor.EnumValueDescriptor( 45 | name='LINEAR16', index=1, number=1, 46 | serialized_options=None, 47 | type=None, 48 | create_key=_descriptor._internal_create_key), 49 | _descriptor.EnumValueDescriptor( 50 | name='FLAC', index=2, number=2, 51 | serialized_options=None, 52 | type=None, 53 | create_key=_descriptor._internal_create_key), 54 | _descriptor.EnumValueDescriptor( 55 | name='MP3', index=3, number=3, 56 | serialized_options=None, 57 | type=None, 58 | create_key=_descriptor._internal_create_key), 59 | ], 60 | containing_type=None, 61 | serialized_options=None, 62 | serialized_start=1305, 63 | serialized_end=1379, 64 | ) 65 | _sym_db.RegisterEnumDescriptor(_RECOGNITIONCONFIG_AUDIOENCODING) 66 | 67 | 68 | _RECOGNIZEREQUEST = _descriptor.Descriptor( 69 | name='RecognizeRequest', 70 | full_name='speech_to_text.RecognizeRequest', 71 | filename=None, 72 | file=DESCRIPTOR, 73 | containing_type=None, 74 | create_key=_descriptor._internal_create_key, 75 | fields=[ 76 | _descriptor.FieldDescriptor( 77 | name='config', full_name='speech_to_text.RecognizeRequest.config', index=0, 78 | number=1, type=11, cpp_type=10, label=1, 79 | has_default_value=False, default_value=None, 80 | message_type=None, enum_type=None, containing_type=None, 81 | is_extension=False, extension_scope=None, 82 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 83 | _descriptor.FieldDescriptor( 84 | name='audio', full_name='speech_to_text.RecognizeRequest.audio', index=1, 85 | number=2, type=11, cpp_type=10, label=1, 86 | has_default_value=False, default_value=None, 87 | message_type=None, enum_type=None, containing_type=None, 88 | is_extension=False, extension_scope=None, 89 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 90 | _descriptor.FieldDescriptor( 91 | name='segment', full_name='speech_to_text.RecognizeRequest.segment', index=2, 92 | number=16, type=9, cpp_type=9, label=1, 93 | has_default_value=False, default_value=b"".decode('utf-8'), 94 | message_type=None, enum_type=None, containing_type=None, 95 | is_extension=False, extension_scope=None, 96 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 97 | ], 98 | extensions=[ 99 | ], 100 | nested_types=[], 101 | enum_types=[ 102 | ], 103 | serialized_options=None, 104 | is_extendable=False, 105 | syntax='proto3', 106 | extension_ranges=[], 107 | oneofs=[ 108 | ], 109 | serialized_start=154, 110 | serialized_end=299, 111 | ) 112 | 113 | 114 | _LONGRUNNINGRECOGNIZEREQUEST = _descriptor.Descriptor( 115 | name='LongRunningRecognizeRequest', 116 | full_name='speech_to_text.LongRunningRecognizeRequest', 117 | filename=None, 118 | file=DESCRIPTOR, 119 | containing_type=None, 120 | create_key=_descriptor._internal_create_key, 121 | fields=[ 122 | _descriptor.FieldDescriptor( 123 | name='config', full_name='speech_to_text.LongRunningRecognizeRequest.config', index=0, 124 | number=1, type=11, cpp_type=10, label=1, 125 | has_default_value=False, default_value=None, 126 | message_type=None, enum_type=None, containing_type=None, 127 | is_extension=False, extension_scope=None, 128 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 129 | _descriptor.FieldDescriptor( 130 | name='audio', full_name='speech_to_text.LongRunningRecognizeRequest.audio', index=1, 131 | number=2, type=11, cpp_type=10, label=1, 132 | has_default_value=False, default_value=None, 133 | message_type=None, enum_type=None, containing_type=None, 134 | is_extension=False, extension_scope=None, 135 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 136 | _descriptor.FieldDescriptor( 137 | name='result_url', full_name='speech_to_text.LongRunningRecognizeRequest.result_url', index=2, 138 | number=11, type=9, cpp_type=9, label=1, 139 | has_default_value=False, default_value=b"".decode('utf-8'), 140 | message_type=None, enum_type=None, containing_type=None, 141 | is_extension=False, extension_scope=None, 142 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 143 | _descriptor.FieldDescriptor( 144 | name='segment', full_name='speech_to_text.LongRunningRecognizeRequest.segment', index=3, 145 | number=16, type=9, cpp_type=9, label=1, 146 | has_default_value=False, default_value=b"".decode('utf-8'), 147 | message_type=None, enum_type=None, containing_type=None, 148 | is_extension=False, extension_scope=None, 149 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 150 | ], 151 | extensions=[ 152 | ], 153 | nested_types=[], 154 | enum_types=[ 155 | ], 156 | serialized_options=None, 157 | is_extendable=False, 158 | syntax='proto3', 159 | extension_ranges=[], 160 | oneofs=[ 161 | ], 162 | serialized_start=302, 163 | serialized_end=478, 164 | ) 165 | 166 | 167 | _SPEECHOPERATIONREQUEST = _descriptor.Descriptor( 168 | name='SpeechOperationRequest', 169 | full_name='speech_to_text.SpeechOperationRequest', 170 | filename=None, 171 | file=DESCRIPTOR, 172 | containing_type=None, 173 | create_key=_descriptor._internal_create_key, 174 | fields=[ 175 | _descriptor.FieldDescriptor( 176 | name='name', full_name='speech_to_text.SpeechOperationRequest.name', index=0, 177 | number=1, type=9, cpp_type=9, label=1, 178 | has_default_value=False, default_value=b"".decode('utf-8'), 179 | message_type=None, enum_type=None, containing_type=None, 180 | is_extension=False, extension_scope=None, 181 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 182 | ], 183 | extensions=[ 184 | ], 185 | nested_types=[], 186 | enum_types=[ 187 | ], 188 | serialized_options=None, 189 | is_extendable=False, 190 | syntax='proto3', 191 | extension_ranges=[], 192 | oneofs=[ 193 | ], 194 | serialized_start=480, 195 | serialized_end=523, 196 | ) 197 | 198 | 199 | _STREAMINGRECOGNIZEREQUEST = _descriptor.Descriptor( 200 | name='StreamingRecognizeRequest', 201 | full_name='speech_to_text.StreamingRecognizeRequest', 202 | filename=None, 203 | file=DESCRIPTOR, 204 | containing_type=None, 205 | create_key=_descriptor._internal_create_key, 206 | fields=[ 207 | _descriptor.FieldDescriptor( 208 | name='streaming_config', full_name='speech_to_text.StreamingRecognizeRequest.streaming_config', index=0, 209 | number=1, type=11, cpp_type=10, label=1, 210 | has_default_value=False, default_value=None, 211 | message_type=None, enum_type=None, containing_type=None, 212 | is_extension=False, extension_scope=None, 213 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 214 | _descriptor.FieldDescriptor( 215 | name='audio_content', full_name='speech_to_text.StreamingRecognizeRequest.audio_content', index=1, 216 | number=2, type=12, cpp_type=9, label=1, 217 | has_default_value=False, default_value=b"", 218 | message_type=None, enum_type=None, containing_type=None, 219 | is_extension=False, extension_scope=None, 220 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 221 | ], 222 | extensions=[ 223 | ], 224 | nested_types=[], 225 | enum_types=[ 226 | ], 227 | serialized_options=None, 228 | is_extendable=False, 229 | syntax='proto3', 230 | extension_ranges=[], 231 | oneofs=[ 232 | _descriptor.OneofDescriptor( 233 | name='streaming_request', full_name='speech_to_text.StreamingRecognizeRequest.streaming_request', 234 | index=0, containing_type=None, 235 | create_key=_descriptor._internal_create_key, 236 | fields=[]), 237 | ], 238 | serialized_start=526, 239 | serialized_end=671, 240 | ) 241 | 242 | 243 | _STREAMINGRECOGNITIONCONFIG = _descriptor.Descriptor( 244 | name='StreamingRecognitionConfig', 245 | full_name='speech_to_text.StreamingRecognitionConfig', 246 | filename=None, 247 | file=DESCRIPTOR, 248 | containing_type=None, 249 | create_key=_descriptor._internal_create_key, 250 | fields=[ 251 | _descriptor.FieldDescriptor( 252 | name='config', full_name='speech_to_text.StreamingRecognitionConfig.config', index=0, 253 | number=1, type=11, cpp_type=10, label=1, 254 | has_default_value=False, default_value=None, 255 | message_type=None, enum_type=None, containing_type=None, 256 | is_extension=False, extension_scope=None, 257 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 258 | _descriptor.FieldDescriptor( 259 | name='interim_results', full_name='speech_to_text.StreamingRecognitionConfig.interim_results', index=1, 260 | number=2, type=8, cpp_type=7, label=1, 261 | has_default_value=False, default_value=False, 262 | message_type=None, enum_type=None, containing_type=None, 263 | is_extension=False, extension_scope=None, 264 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 265 | _descriptor.FieldDescriptor( 266 | name='silence_detection_config', full_name='speech_to_text.StreamingRecognitionConfig.silence_detection_config', index=2, 267 | number=3, type=11, cpp_type=10, label=1, 268 | has_default_value=False, default_value=None, 269 | message_type=None, enum_type=None, containing_type=None, 270 | is_extension=False, extension_scope=None, 271 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 272 | ], 273 | extensions=[ 274 | ], 275 | nested_types=[], 276 | enum_types=[ 277 | ], 278 | serialized_options=None, 279 | is_extendable=False, 280 | syntax='proto3', 281 | extension_ranges=[], 282 | oneofs=[ 283 | ], 284 | serialized_start=674, 285 | serialized_end=857, 286 | ) 287 | 288 | 289 | _RECOGNITIONCONFIG = _descriptor.Descriptor( 290 | name='RecognitionConfig', 291 | full_name='speech_to_text.RecognitionConfig', 292 | filename=None, 293 | file=DESCRIPTOR, 294 | containing_type=None, 295 | create_key=_descriptor._internal_create_key, 296 | fields=[ 297 | _descriptor.FieldDescriptor( 298 | name='encoding', full_name='speech_to_text.RecognitionConfig.encoding', index=0, 299 | number=1, type=14, cpp_type=8, label=1, 300 | has_default_value=False, default_value=0, 301 | message_type=None, enum_type=None, containing_type=None, 302 | is_extension=False, extension_scope=None, 303 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 304 | _descriptor.FieldDescriptor( 305 | name='sample_rate_hertz', full_name='speech_to_text.RecognitionConfig.sample_rate_hertz', index=1, 306 | number=2, type=5, cpp_type=1, label=1, 307 | has_default_value=False, default_value=0, 308 | message_type=None, enum_type=None, containing_type=None, 309 | is_extension=False, extension_scope=None, 310 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 311 | _descriptor.FieldDescriptor( 312 | name='language_code', full_name='speech_to_text.RecognitionConfig.language_code', index=2, 313 | number=3, type=9, cpp_type=9, label=1, 314 | has_default_value=False, default_value=b"".decode('utf-8'), 315 | message_type=None, enum_type=None, containing_type=None, 316 | is_extension=False, extension_scope=None, 317 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 318 | _descriptor.FieldDescriptor( 319 | name='max_alternatives', full_name='speech_to_text.RecognitionConfig.max_alternatives', index=3, 320 | number=4, type=5, cpp_type=1, label=1, 321 | has_default_value=False, default_value=0, 322 | message_type=None, enum_type=None, containing_type=None, 323 | is_extension=False, extension_scope=None, 324 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 325 | _descriptor.FieldDescriptor( 326 | name='speech_contexts', full_name='speech_to_text.RecognitionConfig.speech_contexts', index=4, 327 | number=5, type=11, cpp_type=10, label=3, 328 | has_default_value=False, default_value=[], 329 | message_type=None, enum_type=None, containing_type=None, 330 | is_extension=False, extension_scope=None, 331 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 332 | _descriptor.FieldDescriptor( 333 | name='audio_channel_count', full_name='speech_to_text.RecognitionConfig.audio_channel_count', index=5, 334 | number=6, type=5, cpp_type=1, label=1, 335 | has_default_value=False, default_value=0, 336 | message_type=None, enum_type=None, containing_type=None, 337 | is_extension=False, extension_scope=None, 338 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 339 | _descriptor.FieldDescriptor( 340 | name='enable_separate_recognition_per_channel', full_name='speech_to_text.RecognitionConfig.enable_separate_recognition_per_channel', index=6, 341 | number=7, type=8, cpp_type=7, label=1, 342 | has_default_value=False, default_value=False, 343 | message_type=None, enum_type=None, containing_type=None, 344 | is_extension=False, extension_scope=None, 345 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 346 | _descriptor.FieldDescriptor( 347 | name='enable_word_time_offsets', full_name='speech_to_text.RecognitionConfig.enable_word_time_offsets', index=7, 348 | number=8, type=8, cpp_type=7, label=1, 349 | has_default_value=False, default_value=False, 350 | message_type=None, enum_type=None, containing_type=None, 351 | is_extension=False, extension_scope=None, 352 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 353 | _descriptor.FieldDescriptor( 354 | name='enable_automatic_punctuation', full_name='speech_to_text.RecognitionConfig.enable_automatic_punctuation', index=8, 355 | number=11, type=8, cpp_type=7, label=1, 356 | has_default_value=False, default_value=False, 357 | message_type=None, enum_type=None, containing_type=None, 358 | is_extension=False, extension_scope=None, 359 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 360 | _descriptor.FieldDescriptor( 361 | name='diarization_config', full_name='speech_to_text.RecognitionConfig.diarization_config', index=9, 362 | number=16, type=11, cpp_type=10, label=1, 363 | has_default_value=False, default_value=None, 364 | message_type=None, enum_type=None, containing_type=None, 365 | is_extension=False, extension_scope=None, 366 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 367 | ], 368 | extensions=[ 369 | ], 370 | nested_types=[], 371 | enum_types=[ 372 | _RECOGNITIONCONFIG_AUDIOENCODING, 373 | ], 374 | serialized_options=None, 375 | is_extendable=False, 376 | syntax='proto3', 377 | extension_ranges=[], 378 | oneofs=[ 379 | ], 380 | serialized_start=860, 381 | serialized_end=1379, 382 | ) 383 | 384 | 385 | _SPEECHCONTEXT = _descriptor.Descriptor( 386 | name='SpeechContext', 387 | full_name='speech_to_text.SpeechContext', 388 | filename=None, 389 | file=DESCRIPTOR, 390 | containing_type=None, 391 | create_key=_descriptor._internal_create_key, 392 | fields=[ 393 | _descriptor.FieldDescriptor( 394 | name='phrases', full_name='speech_to_text.SpeechContext.phrases', index=0, 395 | number=1, type=9, cpp_type=9, label=3, 396 | has_default_value=False, default_value=[], 397 | message_type=None, enum_type=None, containing_type=None, 398 | is_extension=False, extension_scope=None, 399 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 400 | ], 401 | extensions=[ 402 | ], 403 | nested_types=[], 404 | enum_types=[ 405 | ], 406 | serialized_options=None, 407 | is_extendable=False, 408 | syntax='proto3', 409 | extension_ranges=[], 410 | oneofs=[ 411 | ], 412 | serialized_start=1381, 413 | serialized_end=1413, 414 | ) 415 | 416 | 417 | _SPEAKERDIARIZATIONCONFIG = _descriptor.Descriptor( 418 | name='SpeakerDiarizationConfig', 419 | full_name='speech_to_text.SpeakerDiarizationConfig', 420 | filename=None, 421 | file=DESCRIPTOR, 422 | containing_type=None, 423 | create_key=_descriptor._internal_create_key, 424 | fields=[ 425 | _descriptor.FieldDescriptor( 426 | name='enable_speaker_diarization', full_name='speech_to_text.SpeakerDiarizationConfig.enable_speaker_diarization', index=0, 427 | number=1, type=8, cpp_type=7, label=1, 428 | has_default_value=False, default_value=False, 429 | message_type=None, enum_type=None, containing_type=None, 430 | is_extension=False, extension_scope=None, 431 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 432 | _descriptor.FieldDescriptor( 433 | name='min_speaker_count', full_name='speech_to_text.SpeakerDiarizationConfig.min_speaker_count', index=1, 434 | number=2, type=5, cpp_type=1, label=1, 435 | has_default_value=False, default_value=0, 436 | message_type=None, enum_type=None, containing_type=None, 437 | is_extension=False, extension_scope=None, 438 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 439 | _descriptor.FieldDescriptor( 440 | name='max_speaker_count', full_name='speech_to_text.SpeakerDiarizationConfig.max_speaker_count', index=2, 441 | number=3, type=5, cpp_type=1, label=1, 442 | has_default_value=False, default_value=0, 443 | message_type=None, enum_type=None, containing_type=None, 444 | is_extension=False, extension_scope=None, 445 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 446 | ], 447 | extensions=[ 448 | ], 449 | nested_types=[], 450 | enum_types=[ 451 | ], 452 | serialized_options=None, 453 | is_extendable=False, 454 | syntax='proto3', 455 | extension_ranges=[], 456 | oneofs=[ 457 | ], 458 | serialized_start=1415, 459 | serialized_end=1531, 460 | ) 461 | 462 | 463 | _SILENCEDETECTIONCONFIG = _descriptor.Descriptor( 464 | name='SilenceDetectionConfig', 465 | full_name='speech_to_text.SilenceDetectionConfig', 466 | filename=None, 467 | file=DESCRIPTOR, 468 | containing_type=None, 469 | create_key=_descriptor._internal_create_key, 470 | fields=[ 471 | _descriptor.FieldDescriptor( 472 | name='enable_silence_detection', full_name='speech_to_text.SilenceDetectionConfig.enable_silence_detection', index=0, 473 | number=1, type=8, cpp_type=7, label=1, 474 | has_default_value=False, default_value=False, 475 | message_type=None, enum_type=None, containing_type=None, 476 | is_extension=False, extension_scope=None, 477 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 478 | _descriptor.FieldDescriptor( 479 | name='max_speech_timeout', full_name='speech_to_text.SilenceDetectionConfig.max_speech_timeout', index=1, 480 | number=2, type=2, cpp_type=6, label=1, 481 | has_default_value=False, default_value=float(0), 482 | message_type=None, enum_type=None, containing_type=None, 483 | is_extension=False, extension_scope=None, 484 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 485 | _descriptor.FieldDescriptor( 486 | name='silence_patience', full_name='speech_to_text.SilenceDetectionConfig.silence_patience', index=2, 487 | number=3, type=2, cpp_type=6, label=1, 488 | has_default_value=False, default_value=float(0), 489 | message_type=None, enum_type=None, containing_type=None, 490 | is_extension=False, extension_scope=None, 491 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 492 | _descriptor.FieldDescriptor( 493 | name='no_input_timeout', full_name='speech_to_text.SilenceDetectionConfig.no_input_timeout', index=3, 494 | number=4, type=2, cpp_type=6, label=1, 495 | has_default_value=False, default_value=float(0), 496 | message_type=None, enum_type=None, containing_type=None, 497 | is_extension=False, extension_scope=None, 498 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 499 | ], 500 | extensions=[ 501 | ], 502 | nested_types=[], 503 | enum_types=[ 504 | ], 505 | serialized_options=None, 506 | is_extendable=False, 507 | syntax='proto3', 508 | extension_ranges=[], 509 | oneofs=[ 510 | ], 511 | serialized_start=1534, 512 | serialized_end=1672, 513 | ) 514 | 515 | 516 | _RECOGNITIONAUDIO = _descriptor.Descriptor( 517 | name='RecognitionAudio', 518 | full_name='speech_to_text.RecognitionAudio', 519 | filename=None, 520 | file=DESCRIPTOR, 521 | containing_type=None, 522 | create_key=_descriptor._internal_create_key, 523 | fields=[ 524 | _descriptor.FieldDescriptor( 525 | name='content', full_name='speech_to_text.RecognitionAudio.content', index=0, 526 | number=1, type=12, cpp_type=9, label=1, 527 | has_default_value=False, default_value=b"", 528 | message_type=None, enum_type=None, containing_type=None, 529 | is_extension=False, extension_scope=None, 530 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 531 | _descriptor.FieldDescriptor( 532 | name='uri', full_name='speech_to_text.RecognitionAudio.uri', index=1, 533 | number=2, type=9, cpp_type=9, label=1, 534 | has_default_value=False, default_value=b"".decode('utf-8'), 535 | message_type=None, enum_type=None, containing_type=None, 536 | is_extension=False, extension_scope=None, 537 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 538 | ], 539 | extensions=[ 540 | ], 541 | nested_types=[], 542 | enum_types=[ 543 | ], 544 | serialized_options=None, 545 | is_extendable=False, 546 | syntax='proto3', 547 | extension_ranges=[], 548 | oneofs=[ 549 | _descriptor.OneofDescriptor( 550 | name='audio_source', full_name='speech_to_text.RecognitionAudio.audio_source', 551 | index=0, containing_type=None, 552 | create_key=_descriptor._internal_create_key, 553 | fields=[]), 554 | ], 555 | serialized_start=1674, 556 | serialized_end=1742, 557 | ) 558 | 559 | 560 | _RECOGNIZERESPONSE = _descriptor.Descriptor( 561 | name='RecognizeResponse', 562 | full_name='speech_to_text.RecognizeResponse', 563 | filename=None, 564 | file=DESCRIPTOR, 565 | containing_type=None, 566 | create_key=_descriptor._internal_create_key, 567 | fields=[ 568 | _descriptor.FieldDescriptor( 569 | name='results', full_name='speech_to_text.RecognizeResponse.results', index=0, 570 | number=1, type=11, cpp_type=10, label=3, 571 | has_default_value=False, default_value=[], 572 | message_type=None, enum_type=None, containing_type=None, 573 | is_extension=False, extension_scope=None, 574 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 575 | ], 576 | extensions=[ 577 | ], 578 | nested_types=[], 579 | enum_types=[ 580 | ], 581 | serialized_options=None, 582 | is_extendable=False, 583 | syntax='proto3', 584 | extension_ranges=[], 585 | oneofs=[ 586 | ], 587 | serialized_start=1744, 588 | serialized_end=1821, 589 | ) 590 | 591 | 592 | _LONGRUNNINGRECOGNIZERESPONSE = _descriptor.Descriptor( 593 | name='LongRunningRecognizeResponse', 594 | full_name='speech_to_text.LongRunningRecognizeResponse', 595 | filename=None, 596 | file=DESCRIPTOR, 597 | containing_type=None, 598 | create_key=_descriptor._internal_create_key, 599 | fields=[ 600 | _descriptor.FieldDescriptor( 601 | name='results', full_name='speech_to_text.LongRunningRecognizeResponse.results', index=0, 602 | number=1, type=11, cpp_type=10, label=3, 603 | has_default_value=False, default_value=[], 604 | message_type=None, enum_type=None, containing_type=None, 605 | is_extension=False, extension_scope=None, 606 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 607 | ], 608 | extensions=[ 609 | ], 610 | nested_types=[], 611 | enum_types=[ 612 | ], 613 | serialized_options=None, 614 | is_extendable=False, 615 | syntax='proto3', 616 | extension_ranges=[], 617 | oneofs=[ 618 | ], 619 | serialized_start=1823, 620 | serialized_end=1911, 621 | ) 622 | 623 | 624 | _STREAMINGRECOGNIZERESPONSE = _descriptor.Descriptor( 625 | name='StreamingRecognizeResponse', 626 | full_name='speech_to_text.StreamingRecognizeResponse', 627 | filename=None, 628 | file=DESCRIPTOR, 629 | containing_type=None, 630 | create_key=_descriptor._internal_create_key, 631 | fields=[ 632 | _descriptor.FieldDescriptor( 633 | name='error', full_name='speech_to_text.StreamingRecognizeResponse.error', index=0, 634 | number=1, type=11, cpp_type=10, label=1, 635 | has_default_value=False, default_value=None, 636 | message_type=None, enum_type=None, containing_type=None, 637 | is_extension=False, extension_scope=None, 638 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 639 | _descriptor.FieldDescriptor( 640 | name='results', full_name='speech_to_text.StreamingRecognizeResponse.results', index=1, 641 | number=2, type=11, cpp_type=10, label=3, 642 | has_default_value=False, default_value=[], 643 | message_type=None, enum_type=None, containing_type=None, 644 | is_extension=False, extension_scope=None, 645 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 646 | ], 647 | extensions=[ 648 | ], 649 | nested_types=[], 650 | enum_types=[ 651 | ], 652 | serialized_options=None, 653 | is_extendable=False, 654 | syntax='proto3', 655 | extension_ranges=[], 656 | oneofs=[ 657 | ], 658 | serialized_start=1913, 659 | serialized_end=2037, 660 | ) 661 | 662 | 663 | _SPEECHRECOGNITIONRESULT = _descriptor.Descriptor( 664 | name='SpeechRecognitionResult', 665 | full_name='speech_to_text.SpeechRecognitionResult', 666 | filename=None, 667 | file=DESCRIPTOR, 668 | containing_type=None, 669 | create_key=_descriptor._internal_create_key, 670 | fields=[ 671 | _descriptor.FieldDescriptor( 672 | name='alternatives', full_name='speech_to_text.SpeechRecognitionResult.alternatives', index=0, 673 | number=1, type=11, cpp_type=10, label=3, 674 | has_default_value=False, default_value=[], 675 | message_type=None, enum_type=None, containing_type=None, 676 | is_extension=False, extension_scope=None, 677 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 678 | _descriptor.FieldDescriptor( 679 | name='channel_tag', full_name='speech_to_text.SpeechRecognitionResult.channel_tag', index=1, 680 | number=2, type=5, cpp_type=1, label=1, 681 | has_default_value=False, default_value=0, 682 | message_type=None, enum_type=None, containing_type=None, 683 | is_extension=False, extension_scope=None, 684 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 685 | ], 686 | extensions=[ 687 | ], 688 | nested_types=[], 689 | enum_types=[ 690 | ], 691 | serialized_options=None, 692 | is_extendable=False, 693 | syntax='proto3', 694 | extension_ranges=[], 695 | oneofs=[ 696 | ], 697 | serialized_start=2039, 698 | serialized_end=2153, 699 | ) 700 | 701 | 702 | _STREAMINGRECOGNITIONRESULT = _descriptor.Descriptor( 703 | name='StreamingRecognitionResult', 704 | full_name='speech_to_text.StreamingRecognitionResult', 705 | filename=None, 706 | file=DESCRIPTOR, 707 | containing_type=None, 708 | create_key=_descriptor._internal_create_key, 709 | fields=[ 710 | _descriptor.FieldDescriptor( 711 | name='alternatives', full_name='speech_to_text.StreamingRecognitionResult.alternatives', index=0, 712 | number=1, type=11, cpp_type=10, label=3, 713 | has_default_value=False, default_value=[], 714 | message_type=None, enum_type=None, containing_type=None, 715 | is_extension=False, extension_scope=None, 716 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 717 | _descriptor.FieldDescriptor( 718 | name='is_final', full_name='speech_to_text.StreamingRecognitionResult.is_final', index=1, 719 | number=2, type=8, cpp_type=7, label=1, 720 | has_default_value=False, default_value=False, 721 | message_type=None, enum_type=None, containing_type=None, 722 | is_extension=False, extension_scope=None, 723 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 724 | _descriptor.FieldDescriptor( 725 | name='stability', full_name='speech_to_text.StreamingRecognitionResult.stability', index=2, 726 | number=3, type=2, cpp_type=6, label=1, 727 | has_default_value=False, default_value=float(0), 728 | message_type=None, enum_type=None, containing_type=None, 729 | is_extension=False, extension_scope=None, 730 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 731 | _descriptor.FieldDescriptor( 732 | name='result_end_time', full_name='speech_to_text.StreamingRecognitionResult.result_end_time', index=3, 733 | number=4, type=2, cpp_type=6, label=1, 734 | has_default_value=False, default_value=float(0), 735 | message_type=None, enum_type=None, containing_type=None, 736 | is_extension=False, extension_scope=None, 737 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 738 | _descriptor.FieldDescriptor( 739 | name='channel_tag', full_name='speech_to_text.StreamingRecognitionResult.channel_tag', index=4, 740 | number=5, type=5, cpp_type=1, label=1, 741 | has_default_value=False, default_value=0, 742 | message_type=None, enum_type=None, containing_type=None, 743 | is_extension=False, extension_scope=None, 744 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 745 | ], 746 | extensions=[ 747 | ], 748 | nested_types=[], 749 | enum_types=[ 750 | ], 751 | serialized_options=None, 752 | is_extendable=False, 753 | syntax='proto3', 754 | extension_ranges=[], 755 | oneofs=[ 756 | ], 757 | serialized_start=2156, 758 | serialized_end=2335, 759 | ) 760 | 761 | 762 | _SPEECHRECOGNITIONALTERNATIVE = _descriptor.Descriptor( 763 | name='SpeechRecognitionAlternative', 764 | full_name='speech_to_text.SpeechRecognitionAlternative', 765 | filename=None, 766 | file=DESCRIPTOR, 767 | containing_type=None, 768 | create_key=_descriptor._internal_create_key, 769 | fields=[ 770 | _descriptor.FieldDescriptor( 771 | name='transcript', full_name='speech_to_text.SpeechRecognitionAlternative.transcript', index=0, 772 | number=1, type=9, cpp_type=9, label=1, 773 | has_default_value=False, default_value=b"".decode('utf-8'), 774 | message_type=None, enum_type=None, containing_type=None, 775 | is_extension=False, extension_scope=None, 776 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 777 | _descriptor.FieldDescriptor( 778 | name='confidence', full_name='speech_to_text.SpeechRecognitionAlternative.confidence', index=1, 779 | number=2, type=2, cpp_type=6, label=1, 780 | has_default_value=False, default_value=float(0), 781 | message_type=None, enum_type=None, containing_type=None, 782 | is_extension=False, extension_scope=None, 783 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 784 | _descriptor.FieldDescriptor( 785 | name='words', full_name='speech_to_text.SpeechRecognitionAlternative.words', index=2, 786 | number=3, type=11, cpp_type=10, label=3, 787 | has_default_value=False, default_value=[], 788 | message_type=None, enum_type=None, containing_type=None, 789 | is_extension=False, extension_scope=None, 790 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 791 | ], 792 | extensions=[ 793 | ], 794 | nested_types=[], 795 | enum_types=[ 796 | ], 797 | serialized_options=None, 798 | is_extendable=False, 799 | syntax='proto3', 800 | extension_ranges=[], 801 | oneofs=[ 802 | ], 803 | serialized_start=2337, 804 | serialized_end=2448, 805 | ) 806 | 807 | 808 | _WORDINFO = _descriptor.Descriptor( 809 | name='WordInfo', 810 | full_name='speech_to_text.WordInfo', 811 | filename=None, 812 | file=DESCRIPTOR, 813 | containing_type=None, 814 | create_key=_descriptor._internal_create_key, 815 | fields=[ 816 | _descriptor.FieldDescriptor( 817 | name='start_time', full_name='speech_to_text.WordInfo.start_time', index=0, 818 | number=1, type=2, cpp_type=6, label=1, 819 | has_default_value=False, default_value=float(0), 820 | message_type=None, enum_type=None, containing_type=None, 821 | is_extension=False, extension_scope=None, 822 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 823 | _descriptor.FieldDescriptor( 824 | name='end_time', full_name='speech_to_text.WordInfo.end_time', index=1, 825 | number=2, type=2, cpp_type=6, label=1, 826 | has_default_value=False, default_value=float(0), 827 | message_type=None, enum_type=None, containing_type=None, 828 | is_extension=False, extension_scope=None, 829 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 830 | _descriptor.FieldDescriptor( 831 | name='word', full_name='speech_to_text.WordInfo.word', index=2, 832 | number=3, type=9, cpp_type=9, label=1, 833 | has_default_value=False, default_value=b"".decode('utf-8'), 834 | message_type=None, enum_type=None, containing_type=None, 835 | is_extension=False, extension_scope=None, 836 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 837 | _descriptor.FieldDescriptor( 838 | name='confidence', full_name='speech_to_text.WordInfo.confidence', index=3, 839 | number=4, type=2, cpp_type=6, label=1, 840 | has_default_value=False, default_value=float(0), 841 | message_type=None, enum_type=None, containing_type=None, 842 | is_extension=False, extension_scope=None, 843 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 844 | ], 845 | extensions=[ 846 | ], 847 | nested_types=[], 848 | enum_types=[ 849 | ], 850 | serialized_options=None, 851 | is_extendable=False, 852 | syntax='proto3', 853 | extension_ranges=[], 854 | oneofs=[ 855 | ], 856 | serialized_start=2450, 857 | serialized_end=2532, 858 | ) 859 | 860 | 861 | _SPEECHOPERATION = _descriptor.Descriptor( 862 | name='SpeechOperation', 863 | full_name='speech_to_text.SpeechOperation', 864 | filename=None, 865 | file=DESCRIPTOR, 866 | containing_type=None, 867 | create_key=_descriptor._internal_create_key, 868 | fields=[ 869 | _descriptor.FieldDescriptor( 870 | name='name', full_name='speech_to_text.SpeechOperation.name', index=0, 871 | number=1, type=9, cpp_type=9, label=1, 872 | has_default_value=False, default_value=b"".decode('utf-8'), 873 | message_type=None, enum_type=None, containing_type=None, 874 | is_extension=False, extension_scope=None, 875 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 876 | _descriptor.FieldDescriptor( 877 | name='done', full_name='speech_to_text.SpeechOperation.done', index=1, 878 | number=2, type=8, cpp_type=7, label=1, 879 | has_default_value=False, default_value=False, 880 | message_type=None, enum_type=None, containing_type=None, 881 | is_extension=False, extension_scope=None, 882 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 883 | _descriptor.FieldDescriptor( 884 | name='error', full_name='speech_to_text.SpeechOperation.error', index=2, 885 | number=3, type=11, cpp_type=10, label=1, 886 | has_default_value=False, default_value=None, 887 | message_type=None, enum_type=None, containing_type=None, 888 | is_extension=False, extension_scope=None, 889 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 890 | _descriptor.FieldDescriptor( 891 | name='response', full_name='speech_to_text.SpeechOperation.response', index=3, 892 | number=4, type=11, cpp_type=10, label=1, 893 | has_default_value=False, default_value=None, 894 | message_type=None, enum_type=None, containing_type=None, 895 | is_extension=False, extension_scope=None, 896 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key), 897 | ], 898 | extensions=[ 899 | ], 900 | nested_types=[], 901 | enum_types=[ 902 | ], 903 | serialized_options=None, 904 | is_extendable=False, 905 | syntax='proto3', 906 | extension_ranges=[], 907 | oneofs=[ 908 | _descriptor.OneofDescriptor( 909 | name='result', full_name='speech_to_text.SpeechOperation.result', 910 | index=0, containing_type=None, 911 | create_key=_descriptor._internal_create_key, 912 | fields=[]), 913 | ], 914 | serialized_start=2535, 915 | serialized_end=2693, 916 | ) 917 | 918 | _RECOGNIZEREQUEST.fields_by_name['config'].message_type = _RECOGNITIONCONFIG 919 | _RECOGNIZEREQUEST.fields_by_name['audio'].message_type = _RECOGNITIONAUDIO 920 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['config'].message_type = _RECOGNITIONCONFIG 921 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['audio'].message_type = _RECOGNITIONAUDIO 922 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['streaming_config'].message_type = _STREAMINGRECOGNITIONCONFIG 923 | _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request'].fields.append( 924 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['streaming_config']) 925 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['streaming_config'].containing_oneof = _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request'] 926 | _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request'].fields.append( 927 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['audio_content']) 928 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['audio_content'].containing_oneof = _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request'] 929 | _STREAMINGRECOGNITIONCONFIG.fields_by_name['config'].message_type = _RECOGNITIONCONFIG 930 | _STREAMINGRECOGNITIONCONFIG.fields_by_name['silence_detection_config'].message_type = _SILENCEDETECTIONCONFIG 931 | _RECOGNITIONCONFIG.fields_by_name['encoding'].enum_type = _RECOGNITIONCONFIG_AUDIOENCODING 932 | _RECOGNITIONCONFIG.fields_by_name['speech_contexts'].message_type = _SPEECHCONTEXT 933 | _RECOGNITIONCONFIG.fields_by_name['diarization_config'].message_type = _SPEAKERDIARIZATIONCONFIG 934 | _RECOGNITIONCONFIG_AUDIOENCODING.containing_type = _RECOGNITIONCONFIG 935 | _RECOGNITIONAUDIO.oneofs_by_name['audio_source'].fields.append( 936 | _RECOGNITIONAUDIO.fields_by_name['content']) 937 | _RECOGNITIONAUDIO.fields_by_name['content'].containing_oneof = _RECOGNITIONAUDIO.oneofs_by_name['audio_source'] 938 | _RECOGNITIONAUDIO.oneofs_by_name['audio_source'].fields.append( 939 | _RECOGNITIONAUDIO.fields_by_name['uri']) 940 | _RECOGNITIONAUDIO.fields_by_name['uri'].containing_oneof = _RECOGNITIONAUDIO.oneofs_by_name['audio_source'] 941 | _RECOGNIZERESPONSE.fields_by_name['results'].message_type = _SPEECHRECOGNITIONRESULT 942 | _LONGRUNNINGRECOGNIZERESPONSE.fields_by_name['results'].message_type = _SPEECHRECOGNITIONRESULT 943 | _STREAMINGRECOGNIZERESPONSE.fields_by_name['error'].message_type = google_dot_rpc_dot_status__pb2._STATUS 944 | _STREAMINGRECOGNIZERESPONSE.fields_by_name['results'].message_type = _STREAMINGRECOGNITIONRESULT 945 | _SPEECHRECOGNITIONRESULT.fields_by_name['alternatives'].message_type = _SPEECHRECOGNITIONALTERNATIVE 946 | _STREAMINGRECOGNITIONRESULT.fields_by_name['alternatives'].message_type = _SPEECHRECOGNITIONALTERNATIVE 947 | _SPEECHRECOGNITIONALTERNATIVE.fields_by_name['words'].message_type = _WORDINFO 948 | _SPEECHOPERATION.fields_by_name['error'].message_type = google_dot_rpc_dot_status__pb2._STATUS 949 | _SPEECHOPERATION.fields_by_name['response'].message_type = _LONGRUNNINGRECOGNIZERESPONSE 950 | _SPEECHOPERATION.oneofs_by_name['result'].fields.append( 951 | _SPEECHOPERATION.fields_by_name['error']) 952 | _SPEECHOPERATION.fields_by_name['error'].containing_oneof = _SPEECHOPERATION.oneofs_by_name['result'] 953 | _SPEECHOPERATION.oneofs_by_name['result'].fields.append( 954 | _SPEECHOPERATION.fields_by_name['response']) 955 | _SPEECHOPERATION.fields_by_name['response'].containing_oneof = _SPEECHOPERATION.oneofs_by_name['result'] 956 | DESCRIPTOR.message_types_by_name['RecognizeRequest'] = _RECOGNIZEREQUEST 957 | DESCRIPTOR.message_types_by_name['LongRunningRecognizeRequest'] = _LONGRUNNINGRECOGNIZEREQUEST 958 | DESCRIPTOR.message_types_by_name['SpeechOperationRequest'] = _SPEECHOPERATIONREQUEST 959 | DESCRIPTOR.message_types_by_name['StreamingRecognizeRequest'] = _STREAMINGRECOGNIZEREQUEST 960 | DESCRIPTOR.message_types_by_name['StreamingRecognitionConfig'] = _STREAMINGRECOGNITIONCONFIG 961 | DESCRIPTOR.message_types_by_name['RecognitionConfig'] = _RECOGNITIONCONFIG 962 | DESCRIPTOR.message_types_by_name['SpeechContext'] = _SPEECHCONTEXT 963 | DESCRIPTOR.message_types_by_name['SpeakerDiarizationConfig'] = _SPEAKERDIARIZATIONCONFIG 964 | DESCRIPTOR.message_types_by_name['SilenceDetectionConfig'] = _SILENCEDETECTIONCONFIG 965 | DESCRIPTOR.message_types_by_name['RecognitionAudio'] = _RECOGNITIONAUDIO 966 | DESCRIPTOR.message_types_by_name['RecognizeResponse'] = _RECOGNIZERESPONSE 967 | DESCRIPTOR.message_types_by_name['LongRunningRecognizeResponse'] = _LONGRUNNINGRECOGNIZERESPONSE 968 | DESCRIPTOR.message_types_by_name['StreamingRecognizeResponse'] = _STREAMINGRECOGNIZERESPONSE 969 | DESCRIPTOR.message_types_by_name['SpeechRecognitionResult'] = _SPEECHRECOGNITIONRESULT 970 | DESCRIPTOR.message_types_by_name['StreamingRecognitionResult'] = _STREAMINGRECOGNITIONRESULT 971 | DESCRIPTOR.message_types_by_name['SpeechRecognitionAlternative'] = _SPEECHRECOGNITIONALTERNATIVE 972 | DESCRIPTOR.message_types_by_name['WordInfo'] = _WORDINFO 973 | DESCRIPTOR.message_types_by_name['SpeechOperation'] = _SPEECHOPERATION 974 | _sym_db.RegisterFileDescriptor(DESCRIPTOR) 975 | 976 | RecognizeRequest = _reflection.GeneratedProtocolMessageType('RecognizeRequest', (_message.Message,), { 977 | 'DESCRIPTOR' : _RECOGNIZEREQUEST, 978 | '__module__' : 'speech_to_text_pb2' 979 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognizeRequest) 980 | }) 981 | _sym_db.RegisterMessage(RecognizeRequest) 982 | 983 | LongRunningRecognizeRequest = _reflection.GeneratedProtocolMessageType('LongRunningRecognizeRequest', (_message.Message,), { 984 | 'DESCRIPTOR' : _LONGRUNNINGRECOGNIZEREQUEST, 985 | '__module__' : 'speech_to_text_pb2' 986 | # @@protoc_insertion_point(class_scope:speech_to_text.LongRunningRecognizeRequest) 987 | }) 988 | _sym_db.RegisterMessage(LongRunningRecognizeRequest) 989 | 990 | SpeechOperationRequest = _reflection.GeneratedProtocolMessageType('SpeechOperationRequest', (_message.Message,), { 991 | 'DESCRIPTOR' : _SPEECHOPERATIONREQUEST, 992 | '__module__' : 'speech_to_text_pb2' 993 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechOperationRequest) 994 | }) 995 | _sym_db.RegisterMessage(SpeechOperationRequest) 996 | 997 | StreamingRecognizeRequest = _reflection.GeneratedProtocolMessageType('StreamingRecognizeRequest', (_message.Message,), { 998 | 'DESCRIPTOR' : _STREAMINGRECOGNIZEREQUEST, 999 | '__module__' : 'speech_to_text_pb2' 1000 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognizeRequest) 1001 | }) 1002 | _sym_db.RegisterMessage(StreamingRecognizeRequest) 1003 | 1004 | StreamingRecognitionConfig = _reflection.GeneratedProtocolMessageType('StreamingRecognitionConfig', (_message.Message,), { 1005 | 'DESCRIPTOR' : _STREAMINGRECOGNITIONCONFIG, 1006 | '__module__' : 'speech_to_text_pb2' 1007 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognitionConfig) 1008 | }) 1009 | _sym_db.RegisterMessage(StreamingRecognitionConfig) 1010 | 1011 | RecognitionConfig = _reflection.GeneratedProtocolMessageType('RecognitionConfig', (_message.Message,), { 1012 | 'DESCRIPTOR' : _RECOGNITIONCONFIG, 1013 | '__module__' : 'speech_to_text_pb2' 1014 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognitionConfig) 1015 | }) 1016 | _sym_db.RegisterMessage(RecognitionConfig) 1017 | 1018 | SpeechContext = _reflection.GeneratedProtocolMessageType('SpeechContext', (_message.Message,), { 1019 | 'DESCRIPTOR' : _SPEECHCONTEXT, 1020 | '__module__' : 'speech_to_text_pb2' 1021 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechContext) 1022 | }) 1023 | _sym_db.RegisterMessage(SpeechContext) 1024 | 1025 | SpeakerDiarizationConfig = _reflection.GeneratedProtocolMessageType('SpeakerDiarizationConfig', (_message.Message,), { 1026 | 'DESCRIPTOR' : _SPEAKERDIARIZATIONCONFIG, 1027 | '__module__' : 'speech_to_text_pb2' 1028 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeakerDiarizationConfig) 1029 | }) 1030 | _sym_db.RegisterMessage(SpeakerDiarizationConfig) 1031 | 1032 | SilenceDetectionConfig = _reflection.GeneratedProtocolMessageType('SilenceDetectionConfig', (_message.Message,), { 1033 | 'DESCRIPTOR' : _SILENCEDETECTIONCONFIG, 1034 | '__module__' : 'speech_to_text_pb2' 1035 | # @@protoc_insertion_point(class_scope:speech_to_text.SilenceDetectionConfig) 1036 | }) 1037 | _sym_db.RegisterMessage(SilenceDetectionConfig) 1038 | 1039 | RecognitionAudio = _reflection.GeneratedProtocolMessageType('RecognitionAudio', (_message.Message,), { 1040 | 'DESCRIPTOR' : _RECOGNITIONAUDIO, 1041 | '__module__' : 'speech_to_text_pb2' 1042 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognitionAudio) 1043 | }) 1044 | _sym_db.RegisterMessage(RecognitionAudio) 1045 | 1046 | RecognizeResponse = _reflection.GeneratedProtocolMessageType('RecognizeResponse', (_message.Message,), { 1047 | 'DESCRIPTOR' : _RECOGNIZERESPONSE, 1048 | '__module__' : 'speech_to_text_pb2' 1049 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognizeResponse) 1050 | }) 1051 | _sym_db.RegisterMessage(RecognizeResponse) 1052 | 1053 | LongRunningRecognizeResponse = _reflection.GeneratedProtocolMessageType('LongRunningRecognizeResponse', (_message.Message,), { 1054 | 'DESCRIPTOR' : _LONGRUNNINGRECOGNIZERESPONSE, 1055 | '__module__' : 'speech_to_text_pb2' 1056 | # @@protoc_insertion_point(class_scope:speech_to_text.LongRunningRecognizeResponse) 1057 | }) 1058 | _sym_db.RegisterMessage(LongRunningRecognizeResponse) 1059 | 1060 | StreamingRecognizeResponse = _reflection.GeneratedProtocolMessageType('StreamingRecognizeResponse', (_message.Message,), { 1061 | 'DESCRIPTOR' : _STREAMINGRECOGNIZERESPONSE, 1062 | '__module__' : 'speech_to_text_pb2' 1063 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognizeResponse) 1064 | }) 1065 | _sym_db.RegisterMessage(StreamingRecognizeResponse) 1066 | 1067 | SpeechRecognitionResult = _reflection.GeneratedProtocolMessageType('SpeechRecognitionResult', (_message.Message,), { 1068 | 'DESCRIPTOR' : _SPEECHRECOGNITIONRESULT, 1069 | '__module__' : 'speech_to_text_pb2' 1070 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechRecognitionResult) 1071 | }) 1072 | _sym_db.RegisterMessage(SpeechRecognitionResult) 1073 | 1074 | StreamingRecognitionResult = _reflection.GeneratedProtocolMessageType('StreamingRecognitionResult', (_message.Message,), { 1075 | 'DESCRIPTOR' : _STREAMINGRECOGNITIONRESULT, 1076 | '__module__' : 'speech_to_text_pb2' 1077 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognitionResult) 1078 | }) 1079 | _sym_db.RegisterMessage(StreamingRecognitionResult) 1080 | 1081 | SpeechRecognitionAlternative = _reflection.GeneratedProtocolMessageType('SpeechRecognitionAlternative', (_message.Message,), { 1082 | 'DESCRIPTOR' : _SPEECHRECOGNITIONALTERNATIVE, 1083 | '__module__' : 'speech_to_text_pb2' 1084 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechRecognitionAlternative) 1085 | }) 1086 | _sym_db.RegisterMessage(SpeechRecognitionAlternative) 1087 | 1088 | WordInfo = _reflection.GeneratedProtocolMessageType('WordInfo', (_message.Message,), { 1089 | 'DESCRIPTOR' : _WORDINFO, 1090 | '__module__' : 'speech_to_text_pb2' 1091 | # @@protoc_insertion_point(class_scope:speech_to_text.WordInfo) 1092 | }) 1093 | _sym_db.RegisterMessage(WordInfo) 1094 | 1095 | SpeechOperation = _reflection.GeneratedProtocolMessageType('SpeechOperation', (_message.Message,), { 1096 | 'DESCRIPTOR' : _SPEECHOPERATION, 1097 | '__module__' : 'speech_to_text_pb2' 1098 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechOperation) 1099 | }) 1100 | _sym_db.RegisterMessage(SpeechOperation) 1101 | 1102 | 1103 | DESCRIPTOR._options = None 1104 | _RECOGNIZEREQUEST.fields_by_name['config']._options = None 1105 | _RECOGNIZEREQUEST.fields_by_name['audio']._options = None 1106 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['config']._options = None 1107 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['audio']._options = None 1108 | _SPEECHOPERATIONREQUEST.fields_by_name['name']._options = None 1109 | _STREAMINGRECOGNITIONCONFIG.fields_by_name['config']._options = None 1110 | _RECOGNITIONCONFIG.fields_by_name['language_code']._options = None 1111 | 1112 | _SPEECHTOTEXT = _descriptor.ServiceDescriptor( 1113 | name='SpeechToText', 1114 | full_name='speech_to_text.SpeechToText', 1115 | file=DESCRIPTOR, 1116 | index=0, 1117 | serialized_options=None, 1118 | create_key=_descriptor._internal_create_key, 1119 | serialized_start=2696, 1120 | serialized_end=3252, 1121 | methods=[ 1122 | _descriptor.MethodDescriptor( 1123 | name='Recognize', 1124 | full_name='speech_to_text.SpeechToText.Recognize', 1125 | index=0, 1126 | containing_service=None, 1127 | input_type=_RECOGNIZEREQUEST, 1128 | output_type=_RECOGNIZERESPONSE, 1129 | serialized_options=b'\202\323\344\223\002\031\"\024/v2/speech:recognize:\001*\332A\014config,audio', 1130 | create_key=_descriptor._internal_create_key, 1131 | ), 1132 | _descriptor.MethodDescriptor( 1133 | name='StreamingRecognize', 1134 | full_name='speech_to_text.SpeechToText.StreamingRecognize', 1135 | index=1, 1136 | containing_service=None, 1137 | input_type=_STREAMINGRECOGNIZEREQUEST, 1138 | output_type=_STREAMINGRECOGNIZERESPONSE, 1139 | serialized_options=None, 1140 | create_key=_descriptor._internal_create_key, 1141 | ), 1142 | _descriptor.MethodDescriptor( 1143 | name='LongRunningRecognize', 1144 | full_name='speech_to_text.SpeechToText.LongRunningRecognize', 1145 | index=2, 1146 | containing_service=None, 1147 | input_type=_LONGRUNNINGRECOGNIZEREQUEST, 1148 | output_type=_SPEECHOPERATION, 1149 | serialized_options=b'\202\323\344\223\002$\"\037/v2/speech:longrunningrecognize:\001*\332A\014config,audio', 1150 | create_key=_descriptor._internal_create_key, 1151 | ), 1152 | _descriptor.MethodDescriptor( 1153 | name='GetSpeechOperation', 1154 | full_name='speech_to_text.SpeechToText.GetSpeechOperation', 1155 | index=3, 1156 | containing_service=None, 1157 | input_type=_SPEECHOPERATIONREQUEST, 1158 | output_type=_SPEECHOPERATION, 1159 | serialized_options=b'\202\323\344\223\002\036\022\034/v2/speech_operations/{name}', 1160 | create_key=_descriptor._internal_create_key, 1161 | ), 1162 | ]) 1163 | _sym_db.RegisterServiceDescriptor(_SPEECHTOTEXT) 1164 | 1165 | DESCRIPTOR.services_by_name['SpeechToText'] = _SPEECHTOTEXT 1166 | 1167 | # @@protoc_insertion_point(module_scope) 1168 | -------------------------------------------------------------------------------- /python/vernacular/ai/speech/proto/speech_to_text_pb2_grpc.py: -------------------------------------------------------------------------------- 1 | # Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT! 2 | """Client and server classes corresponding to protobuf-defined services.""" 3 | import grpc 4 | 5 | from . import speech_to_text_pb2 as speech__to__text__pb2 6 | 7 | 8 | class SpeechToTextStub(object): 9 | """Missing associated documentation comment in .proto file.""" 10 | 11 | def __init__(self, channel): 12 | """Constructor. 13 | 14 | Args: 15 | channel: A grpc.Channel. 16 | """ 17 | self.Recognize = channel.unary_unary( 18 | '/speech_to_text.SpeechToText/Recognize', 19 | request_serializer=speech__to__text__pb2.RecognizeRequest.SerializeToString, 20 | response_deserializer=speech__to__text__pb2.RecognizeResponse.FromString, 21 | ) 22 | self.StreamingRecognize = channel.stream_stream( 23 | '/speech_to_text.SpeechToText/StreamingRecognize', 24 | request_serializer=speech__to__text__pb2.StreamingRecognizeRequest.SerializeToString, 25 | response_deserializer=speech__to__text__pb2.StreamingRecognizeResponse.FromString, 26 | ) 27 | self.LongRunningRecognize = channel.unary_unary( 28 | '/speech_to_text.SpeechToText/LongRunningRecognize', 29 | request_serializer=speech__to__text__pb2.LongRunningRecognizeRequest.SerializeToString, 30 | response_deserializer=speech__to__text__pb2.SpeechOperation.FromString, 31 | ) 32 | self.GetSpeechOperation = channel.unary_unary( 33 | '/speech_to_text.SpeechToText/GetSpeechOperation', 34 | request_serializer=speech__to__text__pb2.SpeechOperationRequest.SerializeToString, 35 | response_deserializer=speech__to__text__pb2.SpeechOperation.FromString, 36 | ) 37 | 38 | 39 | class SpeechToTextServicer(object): 40 | """Missing associated documentation comment in .proto file.""" 41 | 42 | def Recognize(self, request, context): 43 | """Performs synchronous non-streaming speech recognition 44 | """ 45 | context.set_code(grpc.StatusCode.UNIMPLEMENTED) 46 | context.set_details('Method not implemented!') 47 | raise NotImplementedError('Method not implemented!') 48 | 49 | def StreamingRecognize(self, request_iterator, context): 50 | """Performs bidirectional streaming speech recognition: receive results while 51 | sending audio. This method is only available via the gRPC API (not REST). 52 | """ 53 | context.set_code(grpc.StatusCode.UNIMPLEMENTED) 54 | context.set_details('Method not implemented!') 55 | raise NotImplementedError('Method not implemented!') 56 | 57 | def LongRunningRecognize(self, request, context): 58 | """Performs asynchronous non-streaming speech recognition 59 | """ 60 | context.set_code(grpc.StatusCode.UNIMPLEMENTED) 61 | context.set_details('Method not implemented!') 62 | raise NotImplementedError('Method not implemented!') 63 | 64 | def GetSpeechOperation(self, request, context): 65 | """Returns SpeechOperation for LongRunningRecognize. Used for polling the result 66 | """ 67 | context.set_code(grpc.StatusCode.UNIMPLEMENTED) 68 | context.set_details('Method not implemented!') 69 | raise NotImplementedError('Method not implemented!') 70 | 71 | 72 | def add_SpeechToTextServicer_to_server(servicer, server): 73 | rpc_method_handlers = { 74 | 'Recognize': grpc.unary_unary_rpc_method_handler( 75 | servicer.Recognize, 76 | request_deserializer=speech__to__text__pb2.RecognizeRequest.FromString, 77 | response_serializer=speech__to__text__pb2.RecognizeResponse.SerializeToString, 78 | ), 79 | 'StreamingRecognize': grpc.stream_stream_rpc_method_handler( 80 | servicer.StreamingRecognize, 81 | request_deserializer=speech__to__text__pb2.StreamingRecognizeRequest.FromString, 82 | response_serializer=speech__to__text__pb2.StreamingRecognizeResponse.SerializeToString, 83 | ), 84 | 'LongRunningRecognize': grpc.unary_unary_rpc_method_handler( 85 | servicer.LongRunningRecognize, 86 | request_deserializer=speech__to__text__pb2.LongRunningRecognizeRequest.FromString, 87 | response_serializer=speech__to__text__pb2.SpeechOperation.SerializeToString, 88 | ), 89 | 'GetSpeechOperation': grpc.unary_unary_rpc_method_handler( 90 | servicer.GetSpeechOperation, 91 | request_deserializer=speech__to__text__pb2.SpeechOperationRequest.FromString, 92 | response_serializer=speech__to__text__pb2.SpeechOperation.SerializeToString, 93 | ), 94 | } 95 | generic_handler = grpc.method_handlers_generic_handler( 96 | 'speech_to_text.SpeechToText', rpc_method_handlers) 97 | server.add_generic_rpc_handlers((generic_handler,)) 98 | 99 | 100 | # This class is part of an EXPERIMENTAL API. 101 | class SpeechToText(object): 102 | """Missing associated documentation comment in .proto file.""" 103 | 104 | @staticmethod 105 | def Recognize(request, 106 | target, 107 | options=(), 108 | channel_credentials=None, 109 | call_credentials=None, 110 | insecure=False, 111 | compression=None, 112 | wait_for_ready=None, 113 | timeout=None, 114 | metadata=None): 115 | return grpc.experimental.unary_unary(request, target, '/speech_to_text.SpeechToText/Recognize', 116 | speech__to__text__pb2.RecognizeRequest.SerializeToString, 117 | speech__to__text__pb2.RecognizeResponse.FromString, 118 | options, channel_credentials, 119 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata) 120 | 121 | @staticmethod 122 | def StreamingRecognize(request_iterator, 123 | target, 124 | options=(), 125 | channel_credentials=None, 126 | call_credentials=None, 127 | insecure=False, 128 | compression=None, 129 | wait_for_ready=None, 130 | timeout=None, 131 | metadata=None): 132 | return grpc.experimental.stream_stream(request_iterator, target, '/speech_to_text.SpeechToText/StreamingRecognize', 133 | speech__to__text__pb2.StreamingRecognizeRequest.SerializeToString, 134 | speech__to__text__pb2.StreamingRecognizeResponse.FromString, 135 | options, channel_credentials, 136 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata) 137 | 138 | @staticmethod 139 | def LongRunningRecognize(request, 140 | target, 141 | options=(), 142 | channel_credentials=None, 143 | call_credentials=None, 144 | insecure=False, 145 | compression=None, 146 | wait_for_ready=None, 147 | timeout=None, 148 | metadata=None): 149 | return grpc.experimental.unary_unary(request, target, '/speech_to_text.SpeechToText/LongRunningRecognize', 150 | speech__to__text__pb2.LongRunningRecognizeRequest.SerializeToString, 151 | speech__to__text__pb2.SpeechOperation.FromString, 152 | options, channel_credentials, 153 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata) 154 | 155 | @staticmethod 156 | def GetSpeechOperation(request, 157 | target, 158 | options=(), 159 | channel_credentials=None, 160 | call_credentials=None, 161 | insecure=False, 162 | compression=None, 163 | wait_for_ready=None, 164 | timeout=None, 165 | metadata=None): 166 | return grpc.experimental.unary_unary(request, target, '/speech_to_text.SpeechToText/GetSpeechOperation', 167 | speech__to__text__pb2.SpeechOperationRequest.SerializeToString, 168 | speech__to__text__pb2.SpeechOperation.FromString, 169 | options, channel_credentials, 170 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata) 171 | -------------------------------------------------------------------------------- /python/vernacular/ai/speech/speech_client.py: -------------------------------------------------------------------------------- 1 | import grpc 2 | import time 3 | 4 | from vernacular.ai.speech.proto import speech_to_text_pb2 as sppt_pb 5 | from vernacular.ai.speech.proto import speech_to_text_pb2_grpc as sppt_grpc_pb 6 | from vernacular.ai.exceptions import VernacularAPICallError 7 | 8 | 9 | class SpeechClient(object): 10 | """ 11 | Class that implements Vernacular.ai ASR API 12 | """ 13 | 14 | STTP_GRPC_HOST = "speechapis.vernacular.ai:80" 15 | AUTHORIZATION = "authorization" 16 | DEFAULT_TIMEOUT = 30 17 | 18 | def __init__(self, access_token): 19 | """Constructor. 20 | Args: 21 | access_token: The authorization token to send with the requests. 22 | """ 23 | self.access_token = f"bearer {access_token}" 24 | self.channel = grpc.insecure_channel(self.STTP_GRPC_HOST) 25 | 26 | self.client = sppt_grpc_pb.SpeechToTextStub(self.channel) 27 | 28 | def recognize(self, config, audio, timeout=None): 29 | """ 30 | Performs synchronous speech recognition: receive results after all audio 31 | has been sent and processed. 32 | 33 | Example: 34 | >>> from vernacular.ai import speech 35 | >>> from vernacular.ai.speech import enums 36 | >>> 37 | >>> client = speech.SpeechClient(access_token) 38 | >>> 39 | >>> encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16 40 | >>> sample_rate_hertz = 8000 41 | >>> language_code = 'en-IN' 42 | >>> config = {'encoding': encoding, 'sample_rate_hertz': sample_rate_hertz, 'language_code': language_code} 43 | >>> content = open('path/to/audio/file.wav', 'rb').read() 44 | >>> audio = {'content': content} 45 | >>> 46 | >>> response = client.recognize(config, audio) 47 | Args: 48 | config (Union[dict, ~vernacular.ai.speech.types.RecognitionConfig]): Required. Provides information to the 49 | recognizer that specifies how to process the request. 50 | If a dict is provided, it must be of the same form as the protobuf 51 | message :class:`~vernacular.ai.speech.types.RecognitionConfig` 52 | audio (Union[dict, ~vernacular.ai.speech.types.RecognitionAudio]): Required. The audio data to be recognized. 53 | If a dict is provided, it must be of the same form as the protobuf 54 | message :class:`~vernacular.ai.speech.types.RecognitionAudio` 55 | timeout (Optional[float]): The amount of time, in seconds, to wait 56 | for the request to complete. Default value is `30s`. 57 | Returns: 58 | A :class:`~vernacular.ai.speech.types.RecognizeResponse` instance. 59 | Raises: 60 | vernacular.ai.exceptions.VernacularAPICallError: If the request 61 | failed for any reason. 62 | ValueError: If the parameters are invalid. 63 | """ 64 | request = sppt_pb.RecognizeRequest(config=config, audio=audio) 65 | if timeout is None: 66 | timeout = self.DEFAULT_TIMEOUT 67 | 68 | response = None 69 | try: 70 | response = self.client.Recognize( 71 | request, 72 | timeout=timeout, 73 | metadata=[(self.AUTHORIZATION, self.access_token)] 74 | ) 75 | return response 76 | except Exception as e: 77 | raise VernacularAPICallError(message=str(e),response=response) 78 | 79 | 80 | def long_running_recognize(self, config, audio, timeout=None, poll_time=8, callback=None): 81 | """ 82 | Performs asynchronous speech recognition. Returns either an 83 | ``Operation.error`` or an ``Operation.response`` which contains a 84 | ``LongRunningRecognizeResponse`` message. For more information on 85 | asynchronous speech recognition, see the 86 | `how-to `. 87 | 88 | Example: 89 | >>> from vernacular.ai import speech 90 | >>> from vernacular.ai.speech import enums 91 | >>> 92 | >>> client = speech.SpeechClient(access_token) 93 | >>> 94 | >>> encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16 95 | >>> sample_rate_hertz = 8000 96 | >>> language_code = 'en-IN' 97 | >>> config = {'encoding': encoding, 'sample_rate_hertz': sample_rate_hertz, 'language_code': language_code} 98 | >>> content = open('path/to/audio/file.wav', 'rb').read() 99 | >>> audio = {'content': content} 100 | >>> 101 | >>> def handle_result(result): 102 | ... # Handle result. 103 | ... print(result) 104 | >>> 105 | >>> response = client.long_running_recognize(config, audio, callback=handle_result) 106 | Args: 107 | config (Union[dict, ~vernacular.ai.speech.types.RecognitionConfig]): Required. Provides information to the 108 | recognizer that specifies how to process the request. 109 | If a dict is provided, it must be of the same form as the protobuf 110 | message :class:`~vernacular.ai.speech.types.RecognitionConfig` 111 | audio (Union[dict, ~vernacular.ai.speech.types.RecognitionAudio]): Required. The audio data to be recognized. 112 | If a dict is provided, it must be of the same form as the protobuf 113 | message :class:`~vernacular.ai.speech.types.RecognitionAudio` 114 | timeout (Optional[float]): The amount of time, in seconds, to wait 115 | for the request to complete. Default value is `30s`. 116 | poll_time (Optional[float]): The amount of time, in seconds, for which results 117 | should be polled. Default value is `8s`. Min value is 5s. 118 | callback (Optional): Function to handle response 119 | Returns: 120 | A :class:`~vernacular.ai.speech.types.SpeechOperation` instance. 121 | Raises: 122 | vernacular.ai.exceptions.VernacularAPICallError: If the request 123 | failed for any reason. 124 | ValueError: If the parameters are invalid. 125 | """ 126 | request = sppt_pb.LongRunningRecognizeRequest(config=config, audio=audio) 127 | if timeout is None: 128 | timeout = self.DEFAULT_TIMEOUT 129 | 130 | speech_operation = None 131 | try: 132 | speech_operation = self.client.LongRunningRecognize( 133 | request, 134 | timeout=timeout, 135 | metadata=[(self.AUTHORIZATION, self.access_token)] 136 | ) 137 | except Exception as e: 138 | raise VernacularAPICallError(message=str(e),response=speech_operation) 139 | 140 | # set minimum value for poll time 141 | if poll_time < 5: 142 | poll_time = 5 143 | 144 | operation_request = sppt_pb.SpeechOperationRequest(name=speech_operation.name) 145 | response = None 146 | is_done = False 147 | try: 148 | while not is_done: 149 | time.sleep(poll_time) 150 | response = self.client.GetSpeechOperation( 151 | operation_request, 152 | timeout=timeout, 153 | metadata=[(self.AUTHORIZATION, self.access_token)] 154 | ) 155 | is_done = response.done 156 | 157 | return response 158 | except Exception as e: 159 | raise VernacularAPICallError(message=str(e),response=speech_operation) 160 | 161 | 162 | def _streaming_request_iterable(self, config, requests): 163 | """A generator that yields the config followed by the requests. 164 | """ 165 | yield self.types.StreamingRecognizeRequest(streaming_config=config) 166 | for request in requests: 167 | yield request 168 | 169 | def streaming_recognize(self, config, requests, timeout=None): 170 | """ 171 | Performs bidirectional streaming speech recognition: receive results while 172 | sending audio. This method is only available via the gRPC API (not REST). 173 | Example: 174 | >>> from vernacular.ai import speech 175 | >>> 176 | >>> client = speech.SpeechClient() 177 | >>> config = types.StreamingRecognitionConfig( 178 | ... config=types.RecognitionConfig( 179 | ... encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, 180 | ... ), 181 | ... ) 182 | >>> 183 | >>> request = types.StreamingRecognizeRequest(audio_content=b'...') 184 | >>> requests = [request] 185 | >>> for element in client.streaming_recognize(requests): 186 | ... # process element 187 | ... pass 188 | Args: 189 | config (vernacular.ai.speech.types.StreamingRecognitionConfig) The config to use for streaming 190 | requests (iterator[dict|vernacular.ai.speech.types.StreamingRecognizeRequest]): 191 | The input objects. If a dict is provided, it must be of the 192 | same form as the protobuf message:`~vernacular.ai.speech.types.StreamingRecognizeRequest` 193 | timeout (Optional[float]): The amount of time, in seconds, to wait 194 | for the request to complete. Default value is `30s`. 195 | Returns: 196 | Iterable[~vernacular.ai.speech.types.StreamingRecognizeResponse]. 197 | Raises: 198 | vernacular.ai.exceptions.VernacularAPICallError: If the request 199 | failed for any reason. 200 | ValueError: If the parameters are invalid. 201 | """ 202 | if timeout is None: 203 | timeout = self.DEFAULT_TIMEOUT 204 | 205 | streaming_responses = self.client.StreamingRecognize( 206 | self._streaming_request_iterable(config, requests), 207 | metadata=[(self.AUTHORIZATION, self.access_token)] 208 | ) 209 | return streaming_responses 210 | -------------------------------------------------------------------------------- /python/vernacular/ai/speech/types.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | import sys 3 | import collections 4 | import inspect 5 | 6 | from google.rpc import status_pb2 7 | from google.protobuf.message import Message 8 | 9 | from vernacular.ai.speech.proto import speech_to_text_pb2 10 | from vernacular.ai.speech.utils import _SpeechOperation 11 | 12 | 13 | _shared_modules = [status_pb2] 14 | _local_modules = [speech_to_text_pb2] 15 | 16 | names = [] 17 | 18 | def get_messages(module): 19 | """Discovers all protobuf Message classes in a given import module. 20 | 21 | Args: 22 | module (module): A Python module; :func:`dir` will be run against this 23 | module to find Message subclasses. 24 | 25 | Returns: 26 | dict[str, google.protobuf.message.Message]: A dictionary with the 27 | Message class names as keys, and the Message subclasses themselves 28 | as values. 29 | """ 30 | answer = collections.OrderedDict() 31 | for name in dir(module): 32 | candidate = getattr(module, name) 33 | if inspect.isclass(candidate) and issubclass(candidate, Message): 34 | answer[name] = candidate 35 | return answer 36 | 37 | 38 | for module in _shared_modules: # pragma: NO COVER 39 | for name, message in get_messages(module).items(): 40 | setattr(sys.modules[__name__], name, message) 41 | names.append(name) 42 | 43 | for module in _local_modules: 44 | for name, message in get_messages(module).items(): 45 | message.__module__ = "vernacular.ai.speech.types" 46 | setattr(sys.modules[__name__], name, message) 47 | names.append(name) 48 | 49 | 50 | __all__ = tuple(sorted(names)) 51 | -------------------------------------------------------------------------------- /python/vernacular/ai/speech/utils.py: -------------------------------------------------------------------------------- 1 | 2 | class _SpeechOperation(): 3 | 4 | def add_done_callback(callback): 5 | pass -------------------------------------------------------------------------------- /resources/hello.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/resources/hello.wav -------------------------------------------------------------------------------- /resources/test-single-channel-8000Hz.raw: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/resources/test-single-channel-8000Hz.raw --------------------------------------------------------------------------------