├── .gitignore
├── LICENSE
├── README.md
├── docs
├── api_reference
│ ├── LongRunningRecognize.md
│ └── Recognize.md
├── kws
│ ├── LongRunningRecognize.md
│ ├── RecognitionConfig.md
│ └── Recognize.md
├── long-audios.md
├── rpc_reference
│ ├── LongRunningRecognize.md
│ ├── Recognize.md
│ └── StreamingRecognize.md
├── short-audios.md
├── streaming-audios.md
└── types
│ ├── RecognitionAudio.md
│ ├── RecognitionConfig.md
│ └── SpeechRecognitionResult.md
├── go
└── README.md
├── java
├── .gitignore
├── README.md
├── build.gradle
├── gradle
│ └── wrapper
│ │ ├── gradle-wrapper.jar
│ │ └── gradle-wrapper.properties
├── gradlew
├── gradlew.bat
├── samples
│ ├── build.gradle
│ └── src
│ │ └── main
│ │ └── java
│ │ └── ai
│ │ └── vernacular
│ │ └── examples
│ │ └── speech
│ │ └── RecognizeSync.java
├── settings.gradle
└── vernacular-ai-speech
│ ├── build.gradle
│ └── src
│ └── main
│ ├── java
│ └── ai
│ │ └── vernacular
│ │ └── speech
│ │ └── SpeechClient.java
│ └── proto
│ └── speech-to-text.proto
├── python
├── .gitignore
├── README.md
├── requirements.txt
├── samples
│ ├── recognize_async.py
│ ├── recognize_multi_channel.py
│ ├── recognize_streaming.py
│ ├── recognize_streaming_mic.py
│ ├── recognize_sync.py
│ └── recognize_word_offset.py
├── setup.cfg
├── setup.py
├── tests
│ └── __init__.py
└── vernacular
│ ├── __init__.py
│ └── ai
│ ├── __init__.py
│ ├── exceptions
│ └── __init__.py
│ └── speech
│ ├── __init__.py
│ ├── enums.py
│ ├── proto
│ ├── __init__.py
│ ├── speech-to-text.proto
│ ├── speech_to_text_pb2.py
│ └── speech_to_text_pb2_grpc.py
│ ├── speech_client.py
│ ├── types.py
│ └── utils.py
└── resources
├── hello.wav
└── test-single-channel-8000Hz.raw
/.gitignore:
--------------------------------------------------------------------------------
1 | # JetBrains
2 | .idea
3 |
4 | # VS Code
5 | .vscode
6 | .envrc
7 | .DS_Store
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Speech-to-Text API
2 | Converts audio to text
3 |
4 | We support these ten indian languages ([language codes](https://github.com/Vernacular-ai/speech-recognition/blob/master/docs/types/RecognitionConfig.md#languagesupport)).
5 | - Hindi
6 | - English
7 | - Marathi
8 | - Kannada
9 | - Malayalam
10 | - Bengali
11 | - Gujarati
12 | - Punjabi
13 | - Telugu
14 | - Tamil
15 |
16 | ## Authentication
17 | ~~To get access to our APIs reach out to us at hello@vernacular.ai~~
18 | We do not provide public access token for the APIs anymore.
19 |
20 | ## Ways to use the Service
21 | - Transcribing short audios [audios upto 1 min]
22 | - Transcribing long audios [more than 1 min]
23 | - Transcribing audio from streaming input
24 |
25 | We recommend that you call this service using Vernacular provided client libraries. If your application needs to call this service using your own libraries, you should use the HTTP Endpoints.
26 |
27 | **Supported SDKs**: [Python](https://github.com/Vernacular-ai/speech-recognition/tree/master/python)
28 |
29 |
30 | ## REST Reference
31 |
32 | **ServiceHost:** https://asr.vernacular.ai
33 |
34 | ### Speech Recognition
35 | | Name | Description |
36 | |--|--|
37 | | [recognize](docs/api_reference/Recognize.md) | Performs synchronous speech recognition: receive results after all audio has been sent and processed. |
38 | | [longrunningrecognize](docs/api_reference/LongRunningRecognize.md) | Performs asynchronous speech recognition. Generally used for long audios |
39 |
40 |
41 | ## RPC Reference
42 |
43 | ### Speech Recognition
44 | | Methods | Description |
45 | |--|--|
46 | |[Recognize](docs/rpc_reference/Recognize.md) | Performs synchronous speech recognition: receive results after all audio has been sent and processed.|
47 | |[LongRunningRecognize](docs/rpc_reference/LongRunningRecognize.md) | Performs asynchronous speech recognition: receive results via the longrunning.Operations interface.|
48 | |[StreamingRecognize](docs/rpc_reference/StreamingRecognize.md) |Performs streaming speech recognition: receive results while sending audio. Supports both unidirectional and bidirectional streaming.|
49 |
--------------------------------------------------------------------------------
/docs/api_reference/LongRunningRecognize.md:
--------------------------------------------------------------------------------
1 | # LongRunningRecognize
2 | Performs asynchronous speech recognition. Returns an intermediate response [SpeechOperation](#speechoperation) which will either contains response or an error when processing is done.
3 |
4 | To get latest state of speech operation, you can poll for the result using [GetSpeechOperation](#getspeechoperation) request.
5 |
6 |
7 | ### Request Method
8 | `POST https://asr.vernacular.ai/v2/speech:longrunningrecognize`
9 |
10 | ### Request Headers
11 | ```
12 | X-ACCESS-TOKEN: some-access-token
13 | content-type: application/json
14 | ```
15 |
16 | ### Request Body
17 | The request body contains data with the following structure:
18 |
19 | ```js
20 | {
21 | "config": {
22 | object (RecognitionConfig)
23 | },
24 | "audio": {
25 | object (RecognitionAudio)
26 | },
27 | "result_url": string,
28 | }
29 | ```
30 |
31 | | Fields | Description|
32 | |--|--|
33 | |config|object ([RecognitionConfig](../types/RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.|
34 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.|
35 | |result_url| string
Optional. Post the results to this url when done.
36 |
37 | ### ResponseBody
38 | An intermediate speech operation object which contains response upon completion.
39 |
40 | ```js
41 | {
42 | "name": string,
43 | "done": bool,
44 | "result": union {
45 | object (google.grpc.Status),
46 | object (LongRunningRecognizeResponse),
47 | },
48 | }
49 | ```
50 |
51 | |Fields| Description |
52 | |--|--|
53 | |name| string
The server-assigned name, which is only unique within the same service that originally returns it.|
54 | |done| bool
If the value is false, it means the operation is still in progress. If true, the operation is completed, and either error or response is available.|
55 | |result| Union field
The operation result, which can be either an error or a valid response. If done == false, neither error nor response is set. If done == true, exactly one of error or response is set. See below for more details|
56 |
57 | `result` can be only one of the following:
58 | | Field| |
59 | |--|--|
60 | |error| google.rpc.Status
The error result of the operation in case of failure or cancellation.|
61 | |response| [LongRunningRecognizeResponse](#longrunningrecognizeresponse)
The normal response of the operation in case of success.|
62 |
63 | ## LongRunningRecognizeResponse
64 | The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.
65 |
66 | |Fields | Description|
67 | |--|--|
68 | |results[] | [SpeechRecognitionResult](../types/SpeechRecognitionResult.md)
Sequential list of transcription results corresponding to sequential portions of audio.|
69 |
70 |
71 |
72 | ## GetSpeechOperation
73 |
74 | `GET https://asr.vernacular.ai/v2/speech_operations/{name}`
75 |
76 | Gets the latest state of a long-running operation. Clients can use this method to poll the operation result at some intervals.
77 |
78 | |UrlParam|Description|
79 | |--|--|
80 | |name| string
The name of the SpeechOperation in the [response](#responsebody)|
81 |
82 | Returns the [response](#responsebody) again with latest state.
--------------------------------------------------------------------------------
/docs/api_reference/Recognize.md:
--------------------------------------------------------------------------------
1 | # Recognize
2 | Performs a synchronous speech recognition i.e receive results after all audio has been sent and processed.
3 |
4 | **Note**: Audios more than 60 seconds do not work with sync Recognize. Use [LongRunningRecognize](LongRunningRecognize.md) for long audios.
5 |
6 | ### Request Method
7 | `POST https://asr.vernacular.ai/v2/speech:recognize`
8 |
9 | ### Request Headers
10 | ```
11 | X-ACCESS-TOKEN: some-access-token
12 | content-type: application/json
13 | ```
14 |
15 | ### Request Body
16 | The request body contains data with the following structure:
17 |
18 | ```js
19 | {
20 | "config": {
21 | object (RecognitionConfig)
22 | },
23 | "audio": {
24 | object (RecognitionAudio)
25 | }
26 | }
27 | ```
28 |
29 | |Fields||
30 | |--|--|
31 | |config|object ([RecognitionConfig](../types/RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.|
32 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.|
33 |
34 | ### Response Body
35 | If successful, the response body contains data with the following structure:
36 |
37 | The only message returned to the client by the recognize method. It contains the result as zero or more sequential [SpeechRecognitionResult](../types/SpeechRecognitionResult.md) messages.
38 |
39 | ```js
40 | {
41 | "results": [
42 | {
43 | object (SpeechRecognitionResult)
44 | }
45 | ]
46 | }
47 | ```
48 |
49 | ### Sample Request and Response
50 | Request:
51 | ```bash
52 | curl -X POST 'https://asr.vernacular.ai/v2/speech:recognize' \
53 | --header 'X-ACCESS-TOKEN: {{access-token}}' \
54 | --header 'Content-Type: application/json' \
55 | --data-raw '{
56 | "config": {
57 | "encoding": "LINEAR16",
58 | "sampleRateHertz": 8000,
59 | "languageCode": "en-IN",
60 | "maxAlternatives": 2
61 | },
62 | "audio": {
63 | "uri": "https://audio-url.wav"
64 | }
65 | }'
66 | ```
67 | Response:
68 | ```json
69 | {
70 | "results": [
71 | {
72 | "alternatives": [
73 | {
74 | "transcript": "i want to know my balance",
75 | "confidence": 0.95417684
76 | },
77 | {
78 | "transcript": "i want know balance",
79 | "confidence": 0.95404005
80 | }
81 | ]
82 | }
83 | ]
84 | }
85 | ```
86 |
--------------------------------------------------------------------------------
/docs/kws/LongRunningRecognize.md:
--------------------------------------------------------------------------------
1 | # LongRunningRecognize
2 | Performs asynchronous keyword spotting recognition. Returns an intermediate response [KWSOperation](#kwsoperation) which will either contains response or an error when processing is done.
3 |
4 | To get latest state of speech operation, you can poll for the result using [GetKWSOperation](#getkwsoperation) request.
5 |
6 |
7 | ### Request Method
8 | `POST https://asr.vernacular.ai/v1/kws:longrunningrecognize`
9 |
10 | ### Request Headers
11 | ```
12 | X-ACCESS-TOKEN: some-access-token
13 | content-type: application/json
14 | ```
15 |
16 | ### Request Body
17 | The request body contains data with the following structure:
18 |
19 | ```js
20 | {
21 | "config": {
22 | object (RecognitionConfig)
23 | },
24 | "audio": {
25 | object (RecognitionAudio)
26 | },
27 | "keywords": [string],
28 | "result_url": string,
29 | }
30 | ```
31 |
32 | | Fields | Description|
33 | |--|--|
34 | |config|object ([RecognitionConfig](./RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.|
35 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.|
36 | |keywords| Array of strings that needs to be searched for |
37 | |result_url| string
Optional. Post the results to this url when done.
38 |
39 | ### ResponseBody
40 | An intermediate speech operation object which contains response upon completion.
41 |
42 | ```js
43 | {
44 | "name": string,
45 | "done": bool,
46 | "result": union {
47 | object (google.grpc.Status),
48 | object (LongRunningRecognizeResponse),
49 | },
50 | }
51 | ```
52 |
53 | |Fields| Description |
54 | |--|--|
55 | |name| string
The server-assigned name, which is only unique within the same service that originally returns it.|
56 | |done| bool
If the value is false, it means the operation is still in progress. If true, the operation is completed, and either error or response is available.|
57 | |result| Union field
The operation result, which can be either an error or a valid response. If done == false, neither error nor response is set. If done == true, exactly one of error or response is set. See below for more details|
58 |
59 | `result` can be only one of the following:
60 | | Field| |
61 | |--|--|
62 | |error| google.rpc.Status
The error result of the operation in case of failure or cancellation.|
63 | |response| [LongRunningRecognizeResponse](#longrunningrecognizeresponse)
The normal response of the operation in case of success.|
64 |
65 | ## LongRunningRecognizeResponse
66 | The only message returned to the client by the Recognize method.
67 |
68 | ```js
69 | {
70 | "results": [
71 | {
72 | "transcript": string,
73 | "matched_words": [
74 | {
75 | "start_time": float,
76 | "end_time": float,
77 | "word": string
78 | }
79 | ]
80 | }
81 | ]
82 | }
83 | ```
84 |
85 | |Fields | Description|
86 | |--|--|
87 | |results[] |
Sequential list of kws results corresponding to sequential portions of audio.|
88 |
89 | ## GetKWSOperation
90 |
91 | `GET https://asr.vernacular.ai/v1/kws_operations/{name}`
92 |
93 | Gets the latest state of a long-running operation. Clients can use this method to poll the operation result at some intervals.
94 |
95 | |UrlParam|Description|
96 | |--|--|
97 | |name| string
The name of the KWSOperation in the [response](#responsebody)|
98 |
99 | Returns the [response](#responsebody) again with latest state.
100 |
101 | ### Sample Request and Response
102 | Request:
103 | ```bash
104 | curl -X POST 'https://asr.vernacular.ai/v1/kws:longrunningrecognize' \
105 | --header 'X-ACCESS-TOKEN: {{access-token}}' \
106 | --header 'Content-Type: application/json' \
107 | --data-raw '{
108 | "config": {
109 | "encoding": "LINEAR16",
110 | "sampleRateHertz": 8000,
111 | "languageCode": "en-IN",
112 | },
113 | "keywords": ["balance", "credit"],
114 | "audio": {
115 | "uri": "https://audio-url.wav"
116 | }
117 | }'
118 | ```
119 | Response:
120 | ```json
121 | {
122 | "name": "3498b75d-4d81-4e96-a83d-129cd8c6b0f5",
123 | "done": false
124 | }
125 | ```
126 |
127 | If you poll the operation request
128 |
129 | Request:
130 | ```bash
131 | curl -X GET 'https://asr.vernacular.ai/v1/kws_operations/3498b75d-4d81-4e96-a83d-129cd8c6b0f5' \
132 | --header 'X-ACCESS-TOKEN: {{access-token}}' \
133 | --header 'Content-Type: application/json'
134 | ```
135 |
136 | Response on completion:
137 | ```json
138 | {
139 | "name": "3498b75d-4d81-4e96-a83d-129cd8c6b0f5",
140 | "done": true,
141 | "results": [
142 | {
143 | "transcript": "i want to know about my credit card balance",
144 | "matched_words": [
145 | {
146 | "start_time": 3.34,
147 | "end_time": 3.70,
148 | "word": "credit"
149 | },
150 | {
151 | "start_time": 3.84,
152 | "end_time": 4.23,
153 | "word": "balance"
154 | }
155 | ]
156 | }
157 | ]
158 | }
159 | ```
--------------------------------------------------------------------------------
/docs/kws/RecognitionConfig.md:
--------------------------------------------------------------------------------
1 | # RecognitionConfig
2 | Provides information to the recognizer that specifies how to process the request.
3 |
4 | ```js
5 | {
6 | "encoding": enum (AudioEncoding),
7 | "sampleRateHertz": integer,
8 | "audioChannelCount": integer,
9 | "enableSeparateRecognitionPerChannel": boolean,
10 | "languageCode": string,
11 | }
12 | ```
13 |
14 | | Field | Description |
15 | |---|---|
16 | | encoding | enum ([AudioEncoding](#audioencoding))
Encoding of audio data sent in all RecognitionAudio messages. This field is optional for FLAC and WAV audio files and required for all other audio formats. For details, see AudioEncoding.|
17 | | sampleRateHertz | integer
Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. For now we only support 8000Hz. In case your audio is of any other sampling rate, consider resampling to 8000Hz. |
18 | | audioChannelCount | integer
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. If 0 or omitted, defaults to one channel (mono).
**Note**: We only recognize the first channel by default. To perform independent recognition on each channel set enableSeparateRecognitionPerChannel to 'true'. |
19 | | enableSeparateRecognitionPerChannel | boolean
This needs to be set to true explicitly and audioChannelCount > 1 to get each channel recognized separately. The recognition result will contain a channelTag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audioChannelCount multiplied by the length of the audio. |
20 | | languageCode | string
Required. The language of the supplied audio as a BCP-47 language tag. Example: "en-IN". See [Language Support](#languagesupport) for a list of the currently supported language codes. |
21 |
22 |
23 | ## AudioEncoding
24 | The encoding of the audio data sent in the request.
25 |
26 | All encodings support only 1 channel (mono) audio, unless the audioChannelCount and enableSeparateRecognitionPerChannel fields are set.
27 |
28 | For best results, the audio source should be captured and transmitted using a lossless encoding (FLAC or LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3.
29 |
30 |
31 | | Format | Description |
32 | |--|--|
33 | |LINEAR16| Uncompressed 16-bit signed little-endian samples (Linear PCM).|
34 |
35 |
36 | ## LanguageSupport
37 | Vernacular ASR only supports indian languages for now. Use these language codes for following languages.
38 | |Language| Code |
39 | |--|--|
40 | |Hindi | hi-IN |
41 | |English | en-IN |
42 | |Kannada | kn-IN |
43 | |Malayalam| ml-IN|
44 | |Bengali | bn-IN|
45 | |Marathi | mr-IN |
46 | |Gujarati | gu-IN |
47 | |Punjabi | pa-IN |
48 | |Telugu | te-IN|
49 | |Tamil | ta-IN|
50 |
--------------------------------------------------------------------------------
/docs/kws/Recognize.md:
--------------------------------------------------------------------------------
1 | # Recognize
2 | Performs a synchronous speech recognition i.e receive results after all audio has been sent and processed.
3 |
4 | ### Request Method
5 | `POST https://asr.vernacular.ai/v1/kws:recognize`
6 |
7 | ### Request Headers
8 | ```
9 | X-ACCESS-TOKEN: some-access-token
10 | content-type: application/json
11 | ```
12 |
13 | ### Request Body
14 | The request body contains data with the following structure:
15 |
16 | ```js
17 | {
18 | "config": {
19 | object (RecognitionConfig)
20 | },
21 | "audio": {
22 | object (RecognitionAudio)
23 | },
24 | "keywords": [string],
25 | }
26 | ```
27 |
28 | |Fields||
29 | |--|--|
30 | |config|object ([RecognitionConfig](./RecognitionConfig.md))
Required. Provides information to the recognizer that specifies how to process the request.|
31 | |audio|object ([RecognitionAudio](../types/RecognitionAudio.md))
Required. The audio data to be recognized.|
32 | |keywords| Array of strings that needs to be searched for |
33 |
34 | ### Response Body
35 | If successful, the response body contains data with the following structure:
36 |
37 | The only message returned to the client by the recognize method.
38 |
39 | ```js
40 | {
41 | "results": [
42 | {
43 | "transcript": string,
44 | "matched_words": [
45 | {
46 | "start_time": float,
47 | "end_time": float,
48 | "word": string
49 | }
50 | ]
51 | }
52 | ]
53 | }
54 | ```
55 |
56 | ### Sample Request and Response
57 | Request:
58 | ```bash
59 | curl -X POST 'https://asr.vernacular.ai/v1/kws:recognize' \
60 | --header 'X-ACCESS-TOKEN: {{access-token}}' \
61 | --header 'Content-Type: application/json' \
62 | --data-raw '{
63 | "config": {
64 | "encoding": "LINEAR16",
65 | "sampleRateHertz": 8000,
66 | "languageCode": "en-IN",
67 | },
68 | "keywords": ["balance", "credit"],
69 | "audio": {
70 | "uri": "https://audio-url.wav"
71 | }
72 | }'
73 | ```
74 | Response:
75 | ```json
76 | {
77 | "results": [
78 | {
79 | "transcript": "i want to know about my credit card balance",
80 | "matched_words": [
81 | {
82 | "start_time": 3.34,
83 | "end_time": 3.70,
84 | "word": "credit"
85 | },
86 | {
87 | "start_time": 3.84,
88 | "end_time": 4.23,
89 | "word": "balance"
90 | }
91 | ]
92 | }
93 | ]
94 | }
95 | ```
96 |
--------------------------------------------------------------------------------
/docs/long-audios.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/docs/long-audios.md
--------------------------------------------------------------------------------
/docs/rpc_reference/LongRunningRecognize.md:
--------------------------------------------------------------------------------
1 | # LongRunningRecognize
2 |
3 | `rpc LongRunningRecognize`([LongRunningRecognizeRequest](#longrunningrecognizerequest)) `returns `([SpeechOperation](#speechoperation))
4 |
5 | Performs asynchronous speech recognition: receive results via SpeechOperation. Returns either an SpeechOperation.error or an SpeechOperation.response which contains a [LongRunningRecognizeResponse](#longrunningrecognizeresponse) message.
6 |
7 | To get latest state of speech operation, you can poll for the result using [GetSpeechOperation](#getspeechoperation) rpc.
8 |
9 |
10 | ## GetSpeechOperation
11 |
12 | `rpc GetSpeechOperation`([SpeechOperationRequest](#speechoperationrequest))` returns `([SpeechOperation](#speechoperation))
13 |
14 | Gets the latest state of a long-running operation. Clients can use this method to poll the operation result at some intervals.
15 |
16 | #### SpeechOperationRequest
17 | |Fields|Description|
18 | |--|--|
19 | |name| string
The name of the SpeechOperation i.e [SpeechOperation.name](#speechoperation)|
20 |
21 |
22 | ## LongRunningRecognizeRequest
23 | The top-level message sent by the client for the Recognize method.
24 |
25 | |Fields|Description|
26 | |--|--|
27 | |config | [RecognitionConfig](../types/RecognitionConfig.md)
Required. Provides information to the recognizer that specifies how to process the request.|
28 | |audio | [RecognitionAudio](../types/RecognitionAudio.md)
Required. The audio data to be recognized.|
29 | |result_url | string
Optional. Post the results to this url when done. Url must be accessible through our servers.|
30 |
31 | ## SpeechOperation
32 | An intermediate operation object which contains response upon completion
33 |
34 | |Fields| Description |
35 | |--|--|
36 | |name| string
The server-assigned name, which is only unique within the same service that originally returns it.|
37 | |done| bool
If the value is false, it means the operation is still in progress. If true, the operation is completed, and either error or response is available.|
38 | |result| Union field
The operation result, which can be either an error or a valid response. If done == false, neither error nor response is set. If done == true, exactly one of error or response is set. See below for more details|
39 |
40 | `result` can be only one of the following:
41 | | Field| |
42 | |--|--|
43 | |error| google.rpc.Status
The error result of the operation in case of failure or cancellation.|
44 | |response| [LongRunningRecognizeResponse](#longrunningrecognizeresponse)
The normal response of the operation in case of success.|
45 |
46 | ## LongRunningRecognizeResponse
47 | The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.
48 |
49 | |Fields | Description|
50 | |--|--|
51 | |results[] | [SpeechRecognitionResult](../types/SpeechRecognitionResult.md)
Sequential list of transcription results corresponding to sequential portions of audio.|
52 |
--------------------------------------------------------------------------------
/docs/rpc_reference/Recognize.md:
--------------------------------------------------------------------------------
1 | # Recognize
2 |
3 | `rpc Recognize`([RecognizeRequest](#recognizerequest)) `returns `([RecognizeResponse](#recognizeresponse))
4 |
5 | Performs synchronous speech recognition: receive results after all audio has been sent and processed.
6 |
7 | **Note**: Audios more than 60 seconds do not work with sync Recognize. Use [LongRunningRecognize](LongRunningRecognize.md) for long audios.
8 |
9 | ## RecognizeRequest
10 | The top-level message sent by the client for the Recognize method.
11 |
12 | |Fields|Description|
13 | |--|--|
14 | |config | [RecognitionConfig](../types/RecognitionConfig.md)
Required. Provides information to the recognizer that specifies how to process the request.|
15 | |audio | [RecognitionAudio](../types/RecognitionAudio.md)
Required. The audio data to be recognized.|
16 |
17 | ## RecognizeResponse
18 | The only message returned to the client by the Recognize method. It contains the result as zero or more sequential SpeechRecognitionResult messages.
19 |
20 | |Fields | Description|
21 | |--|--|
22 | |results[] | [SpeechRecognitionResult](../types/SpeechRecognitionResult.md)
Sequential list of transcription results corresponding to sequential portions of audio.|
23 |
--------------------------------------------------------------------------------
/docs/rpc_reference/StreamingRecognize.md:
--------------------------------------------------------------------------------
1 | # StreamingRecognize
2 |
3 | `rpc StreamingRecognize`([StreamingRecognizeRequest](#streamingrecognizerequest))` returns `([StreamingRecognizeResponse](#streamingrecognizeresponse))
4 |
5 | Performs bidirectional streaming speech recognition: receive results while sending audio. This method is only available via the gRPC API (not REST API).
6 |
7 | ## StreamingRecognizeRequest
8 | The top-level message sent by the client for the StreamingRecognize method. Multiple StreamingRecognizeRequest messages are sent. The first message must contain a `streaming_config` message and must not contain `audio_content`. All subsequent messages must contain audio_content and must not contain a streaming_config message.
9 |
10 | Union field streaming_request. The streaming request, which is either a streaming config or audio content.
11 | `streaming_request` can be only one of the following:
12 |
13 | |Fields|Description|
14 | |--|--|
15 | |streaming_config | [StreamingRecognitionConfig](#streamingrecognitionconfig)
Provides information to the recognizer that specifies how to process the request. The first StreamingRecognizeRequest message must contain a streaming_config message.|
16 | |audio_content | bytes
The audio data to be recognized. Sequential chunks of audio data are sent in sequential StreamingRecognizeRequest messages. The first StreamingRecognizeRequest message must not contain audio_content data and all subsequent StreamingRecognizeRequest messages must contain audio_content data. The audio bytes must be encoded as specified in [RecognitionConfig](../types/RecognitionConfig.md). Note: as with all bytes fields, proto buffers use a pure binary representation (not base64).|
17 |
18 | ## StreamingRecognizeResponse
19 | StreamingRecognizeResponse is the only message returned to the client by StreamingRecognize. A series of zero or more StreamingRecognizeResponse messages are streamed back to the client. If there is no recognizable audio, and single_utterance is set to false, then no messages are streamed back to the client.
20 |
21 | Here's an example of a series of ten StreamingRecognizeResponses that might be returned while processing audio:
22 |
23 | ```
24 | results { alternatives { transcript: "tube" } stability: 0.01 }
25 |
26 | results { alternatives { transcript: "to be a" } stability: 0.01 }
27 |
28 | results { alternatives { transcript: "to be" } stability: 0.9 } results { alternatives { transcript: " or not to be" } stability: 0.01 }
29 |
30 | results { alternatives { transcript: "to be or not to be" confidence: 0.92 } alternatives { transcript: "to bee or not to bee" } is_final: true }
31 |
32 | results { alternatives { transcript: " that's" } stability: 0.01 }
33 |
34 | results { alternatives { transcript: " that is" } stability: 0.9 } results { alternatives { transcript: " the question" } stability: 0.01 }
35 |
36 | results { alternatives { transcript: " that is the question" confidence: 0.98 } alternatives { transcript: " that was the question" } is_final: true }
37 | ```
38 |
39 | Notes:
40 |
41 | Only two of the above responses #4 and #7 contain final results; they are indicated by is_final: true. Concatenating these together generates the full transcript: "to be or not to be that is the question".
42 |
43 | The others contain interim results. #3 and #6 contain two interim results: the first portion has a high stability and is less likely to change; the second portion has a low stability and is very likely to change. A UI designer might choose to show only high stability results.
44 |
45 | The specific stability and confidence values shown above are only for illustrative purposes. Actual values may vary.
46 |
47 | In each response, only one of these fields will be set: error or one or more (repeated) results.
48 |
49 | |Fields| Description|
50 | |--|--|
51 | |error| Status
If set, returns a google.rpc.Status message that specifies the error for the operation.|
52 | |results[]| [StreamingRecognitionResult](#streamingrecognitionresult)
This repeated list contains zero or more results that correspond to consecutive portions of the audio currently being processed. It contains zero or one is_final=true result (the newly settled portion), followed by zero or more is_final=false results (the interim results).|
53 |
54 | ## StreamingRecognitionConfig
55 | Provides information to the recognizer that specifies how to process the request.
56 |
57 | |Fields|Description|
58 | |--|--|
59 | |config | [RecognitionConfig](../types/RecognitionConfig.md)
Required. Provides information to the recognizer that specifies how to process the request.|
60 | |interim_results| bool
If true, interim results (tentative hypotheses) may be returned as they become available (these interim results are indicated with the is_final=false flag). If false or omitted, only is_final=true result(s) are returned|
61 | |silence_detection_config| [SilenceDetectionConfig](#silencedetectionconfig)
Optional. Add silence detection config for enabling silence detection.|
62 |
63 | Note: For now `interim_results` will not work. You will only get a final response.
64 |
65 | ## SilenceDetectionConfig
66 | |Fields|Description|
67 | |--|--|
68 | |enable_silence_detection|bool
If true, it enables SD from server side|
69 | |max_speech_timeout|float
Max number of seconds for which recognition should go on. For example: For a value of 5, streaming will end after 5 seconds regardless of whether person is speaking or not. Set it to -1 to disable this.|
70 | |silence_patience| float
Wait for this many seconds of silence after a voice activity detection, to fire of the silence detected event. Usually 1.5 to 2 is a good value to set.|
71 | |no_input_timeout|float
Wait for this many seconds if no voice activity is detected before firing of silence detected event. For example: if set to 5 seconds, detector will wait for 5 seconds for any voice activity and then end the stream. This is there to prevent endless stream if no voice activity is there. Usually 3-5 seconds is a good range for this.|
72 |
73 | ## StreamingRecognitionResult
74 | A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
75 |
76 | |Fields|Description|
77 | |--|--|
78 | |alternatives[] | [SpeechRecognitionAlternative](../types/SpeechRecognitionResult.md#speechrecognitionalternative)
May contain one or more recognition hypotheses (up to the maximum specified in max_alternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.|
79 | |is_final | bool
If false, this StreamingRecognitionResult represents an interim result that may change. If true, this is the final time the speech service will return this particular StreamingRecognitionResult, the recognizer will not return any further hypotheses for this portion of the transcript and corresponding audio.|
80 | |stability | float
An estimate of the likelihood that the recognizer will not change its guess about this interim result. Values range from 0.0 (completely unstable) to 1.0 (completely stable). This field is only provided for interim results (is_final=false). The default of 0.0 is a sentinel value indicating stability was not set.|
81 | |result_end_time | Duration
Time offset of the end of this result relative to the beginning of the audio.|
82 | |channel_tag | int32
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'.|
83 |
--------------------------------------------------------------------------------
/docs/short-audios.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/docs/short-audios.md
--------------------------------------------------------------------------------
/docs/streaming-audios.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/docs/streaming-audios.md
--------------------------------------------------------------------------------
/docs/types/RecognitionAudio.md:
--------------------------------------------------------------------------------
1 | # RecognitionAudio
2 | Contains audio data in the encoding specified in the RecognitionConfig. Either content or uri must be supplied. Supplying both or neither returns error.
3 |
4 | ```js
5 | {
6 |
7 | // Union field audio_source can be only one of the following:
8 | "content": string,
9 | "uri": string
10 | }
11 | ```
12 |
13 | | Field | Description |
14 | |---|---|
15 | | content | string (bytes format)
The audio data bytes encoded as specified in RecognitionConfig. Note: as with all bytes fields, proto buffers use a pure binary representation, whereas JSON representations use [base64](https://en.wikipedia.org/wiki/Base64). |
16 | | uri | string
URI that points to a file that contains audio data bytes as specified in RecognitionConfig. The file must not be compressed (for example, gzip). Url must be publicly accessible. |
17 |
18 |
19 | ## Encoding to base64
20 | Use base64 command to convert
21 | ```shell
22 | base64 source_audio_file -w 0 > dest_audio_file
23 | ```
24 |
25 | ### Content Limits
26 |
27 | The API contains the following limits on the size of audio in content field:
28 |
29 | | Content Limit |Audio Length | Audio Size |
30 | |---|---|---|
31 | |Synchronous Requests |~1 Minute | max 4Mb |
32 | |Streaming Requests | ~5 Minutes | - |
33 | |Asynchronous Requests| ~480 Minutes | - |
34 |
--------------------------------------------------------------------------------
/docs/types/RecognitionConfig.md:
--------------------------------------------------------------------------------
1 | # RecognitionConfig
2 | Provides information to the recognizer that specifies how to process the request.
3 |
4 | ```js
5 | {
6 | "encoding": enum (AudioEncoding),
7 | "sampleRateHertz": integer,
8 | "audioChannelCount": integer,
9 | "enableSeparateRecognitionPerChannel": boolean,
10 | "languageCode": string,
11 | "maxAlternatives": integer,
12 | "speechContexts": [
13 | {
14 | object (SpeechContext)
15 | }
16 | ],
17 | "enableWordTimeOffsets": boolean,
18 | "diarizationConfig": {
19 | object (SpeakerDiarizationConfig)
20 | },
21 | }
22 | ```
23 |
24 | | Field | Description |
25 | |---|---|
26 | | encoding | enum ([AudioEncoding](#audioencoding))
Encoding of audio data sent in all RecognitionAudio messages. This field is optional for FLAC and WAV audio files and required for all other audio formats. For details, see AudioEncoding.|
27 | | sampleRateHertz | integer
Sample rate in Hertz of the audio data sent in all RecognitionAudio messages. For now we only support 8000Hz. In case your audio is of any other sampling rate, consider resampling to 8000Hz. |
28 | | audioChannelCount | integer
The number of channels in the input audio data. ONLY set this for MULTI-CHANNEL recognition. Valid values for LINEAR16 and FLAC are 1-8. If 0 or omitted, defaults to one channel (mono).
**Note**: We only recognize the first channel by default. To perform independent recognition on each channel set enableSeparateRecognitionPerChannel to 'true'. |
29 | | enableSeparateRecognitionPerChannel | boolean
This needs to be set to true explicitly and audioChannelCount > 1 to get each channel recognized separately. The recognition result will contain a channelTag field to state which channel that result belongs to. If this is not true, we will only recognize the first channel. The request is billed cumulatively for all channels recognized: audioChannelCount multiplied by the length of the audio. |
30 | | languageCode | string
Required. The language of the supplied audio as a BCP-47 language tag. Example: "en-IN". See [Language Support](#languagesupport) for a list of the currently supported language codes. |
31 | | maxAlternatives | integer
Maximum number of recognition hypotheses to be returned. Specifically, the maximum number of SpeechRecognitionAlternative messages within each SpeechRecognitionResult. The server may return fewer than maxAlternatives. Valid values are 0-10. A value of 0 or 1 will return a maximum of one. If omitted, will return a maximum of one. |
32 | |speechContexts[] | object ([SpeechContext](#speechcontext))
Array of SpeechContext. This feature is experimental and may not work as of now. We do support biasing of models with customer specific terminology so this may not be needed. |
33 | | diarizationConfig | object ([SpeakerDiarizationConfig](#speakerdiarizationconfig))
Config to enable speaker diarization and set additional parameters to make diarization better suited for your application. This feature is experimental and may not work for now.|
34 | | enableWordTimeOffsets | boolean
If true, the top result includes a list of words and the start and end time offsets (timestamps) for those words. If false, no word-level time offset information is returned. The default is false. |
35 |
36 |
37 | ## AudioEncoding
38 | The encoding of the audio data sent in the request.
39 |
40 | All encodings support only 1 channel (mono) audio, unless the audioChannelCount and enableSeparateRecognitionPerChannel fields are set.
41 |
42 | We support wav \[LINEAR16\] and mp3 \[MP3\] right now.
43 |
44 | For best results, the audio source should be captured and transmitted using a lossless encoding (LINEAR16). The accuracy of the speech recognition can be reduced if lossy codecs are used to capture or transmit audio, particularly if background noise is present. Lossy codecs include MULAW, AMR, AMR_WB, OGG_OPUS, SPEEX_WITH_HEADER_BYTE, and MP3.
45 |
46 |
47 | #### Enums
48 | | Format | Description |
49 | |--|--|
50 | |LINEAR16| Uncompressed 16-bit signed little-endian samples (Linear PCM).|
51 | |MP3| Compressed mp3 encoded stream|
52 |
53 |
54 | ## LanguageSupport
55 | Vernacular ASR only supports indian languages for now. Use these language codes for following languages.
56 | |Language| Code |
57 | |--|--|
58 | |Hindi | hi-IN |
59 | |English | en-IN |
60 | |Kannada | kn-IN |
61 | |Malayalam| ml-IN|
62 | |Bengali | bn-IN|
63 | |Marathi | mr-IN |
64 | |Gujarati | gu-IN |
65 | |Punjabi | pa-IN |
66 | |Telugu | te-IN|
67 | |Tamil | ta-IN|
68 |
69 |
70 | ## SpeechContext
71 | Provides **hints** to the speech recognizer to favor specific words and phrases in the results.
72 |
73 | ```js
74 | {
75 | "phrases": [
76 | string
77 | ]
78 | }
79 | ```
80 |
81 | | Field | Description |
82 | |--|--|
83 | | phrases[] | string
A list of strings containing words and phrases "hints" so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer. See usage limits. |
84 |
85 |
86 | ## SpeakerDiarizationConfig
87 | Config to enable speaker diarization.
88 |
89 | ```js
90 | {
91 | "enableSpeakerDiarization": boolean,
92 | "minSpeakerCount": integer,
93 | "maxSpeakerCount": integer,
94 | "speakerTag": integer
95 | }
96 | ```
97 |
98 | | Field | Description |
99 | |--|--|
100 | | enableSpeakerDiarization | boolean
If **true**, enables speaker detection for each recognized word in the top alternative of the recognition result. |
101 | | minSpeakerCount | integer
Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 2. |
102 | | maxSpeakerCount | integer
Maximum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 6. |
103 |
--------------------------------------------------------------------------------
/docs/types/SpeechRecognitionResult.md:
--------------------------------------------------------------------------------
1 | # SpeechRecognitionResult
2 | A speech recognition result corresponding to a portion of the audio.
3 |
4 | ```js
5 | {
6 | "alternatives": [
7 | {
8 | object (SpeechRecognitionAlternative)
9 | }
10 | ],
11 | "channelTag": integer
12 | }
13 | ```
14 |
15 | |Fields | Description |
16 | |--|--|
17 | |alternatives[] | object (SpeechRecognitionAlternative)
May contain one or more recognition hypotheses (up to the maximum specified in maxAlternatives). These alternatives are ordered in terms of accuracy, with the top (first) alternative being the most probable, as ranked by the recognizer.|
18 | |channelTag | integer
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audioChannelCount = N, its output values can range from '1' to 'N'.|
19 |
20 | # SpeechRecognitionAlternative
21 | Alternative hypotheses (a.k.a. n-best list).
22 |
23 | ```js
24 | {
25 | "transcript": string,
26 | "confidence": number,
27 | "words": [
28 | {
29 | object (WordInfo)
30 | }
31 | ]
32 | }
33 | ```
34 |
35 | |Fields | Description |
36 | |--|--|
37 | |transcript | string
Transcript text representing the words that the user spoke.|
38 | |confidence | number
The confidence estimate between 0.0 and 1.0. A higher number indicates an estimated greater likelihood that the recognized words are correct. This field is set only for the top alternative of a non-streaming result or, of a streaming result where isFinal=true. This field is not guaranteed to be accurate and users should not rely on it to be always provided. The default of 0.0 is a sentinel value indicating confidence was not set.|
39 | |words[] | object ([WordInfo](#wordinfo))
A list of word-specific information for each recognized word. Note: When enableSpeakerDiarization is true, you will see all the words from the beginning of the audio.|
40 |
41 | # WordInfo
42 | Word-specific information for recognized words.
43 |
44 | ```js
45 | {
46 | "startTime": string,
47 | "endTime": string,
48 | "word": string,
49 | "speakerTag": integer
50 | }
51 | ```
52 |
53 | |Fields| Description |
54 | |--|--|
55 | | startTime | string (Duration format)
Time offset relative to the beginning of the audio, and corresponding to the start of the spoken word. This field is only set if enableWordTimeOffsets=true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.|
56 | |endTime| string (Duration format)
Time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. This field is only set if enableWordTimeOffsets=true and only in the top hypothesis. This is an experimental feature and the accuracy of the time offset can vary.|
57 | | word | string
The word corresponding to this set of information.|
58 | |speakerTag | integer
Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from '1' to diarizationSpeakerCount. speakerTag is set if enableSpeakerDiarization = 'true' and only in the top alternative.|
59 |
--------------------------------------------------------------------------------
/go/README.md:
--------------------------------------------------------------------------------
1 | # Go Speech to Text SDK
2 |
--------------------------------------------------------------------------------
/java/.gitignore:
--------------------------------------------------------------------------------
1 | **/.classpath
2 | **/.settings
3 | **/.project
4 |
5 | # Maven
6 | target/
7 |
8 | # gradle
9 | .gradle
10 |
11 | # Intellij
12 | *.iml
13 | .idea/
14 |
15 | # build
16 | **/build/
17 | **/bin/
18 |
--------------------------------------------------------------------------------
/java/README.md:
--------------------------------------------------------------------------------
1 | # Java Speech to Text SDK
2 | Java SDK for vernacular.ai speech to text APIs. Go [here](https://github.com/Vernacular-ai/speech-recognition) for detailed product documentation.
3 |
4 |
5 | ## Installation
6 | If you are using maven add this to your `pom.xml`
7 |
8 | ```xml
9 |
10 | ai.vernacular.speech
11 | vernacular-ai-speech
12 | 0.1.0
13 |
14 | ```
15 |
16 | If you are using Gradle, add this to your dependencies
17 |
18 | ```gradle
19 | compile 'ai.vernacular.speech:vernacular-ai-speech:0.1.0'
20 | ```
21 |
22 | ## Example Usage
23 | This example shows how to recognize speech from audio url. First add these imports to top of java file.
24 |
25 | ```java
26 | import ai.vernacular.speech.SpeechClient;
27 | import ai.vernacular.speech.RecognitionAudio;
28 | import ai.vernacular.speech.RecognitionConfig;
29 | import ai.vernacular.speech.RecognitionConfig.AudioEncoding;
30 | import ai.vernacular.speech.RecognizeResponse;
31 | ```
32 |
33 | Then add this code to recognize the audio.
34 |
35 | ```java
36 | try (SpeechClient speechClient = SpeechClient.create(accessToken)) {
37 | RecognitionConfig.AudioEncoding encoding = RecognitionConfig.AudioEncoding.LINEAR16;
38 | int sampleRateHertz = 8000;
39 | String languageCode = "en-IN";
40 | RecognitionConfig config = RecognitionConfig.newBuilder()
41 | .setEncoding(encoding)
42 | .setSampleRateHertz(sampleRateHertz)
43 | .setLanguageCode(languageCode)
44 | .build();
45 | String uri = "https://url/to/audio.wav";
46 | RecognitionAudio audio = RecognitionAudio.newBuilder()
47 | .setUri(uri)
48 | .build();
49 | RecognizeResponse response = speechClient.recognize(config, audio);
50 | }
51 | ```
52 |
53 | To see more examples, go to [samples](https://github.com/Vernacular-ai/speech-recognition/tree/master/java/samples).
54 |
55 | To run a sample:
56 |
57 | ```shell
58 | ./gradlew :samples:run -Pexample=RecognizeSync
59 | ```
--------------------------------------------------------------------------------
/java/build.gradle:
--------------------------------------------------------------------------------
1 | buildscript {
2 | repositories {
3 | mavenCentral()
4 | }
5 | dependencies {
6 | classpath 'com.google.protobuf:protobuf-gradle-plugin:0.8.11'
7 | }
8 | }
9 |
10 | allprojects {
11 | repositories {
12 | mavenCentral()
13 | }
14 |
15 | apply plugin: 'java'
16 | apply plugin: 'com.google.protobuf'
17 |
18 | version '0.1.0'
19 | }
20 |
--------------------------------------------------------------------------------
/java/gradle/wrapper/gradle-wrapper.jar:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/java/gradle/wrapper/gradle-wrapper.jar
--------------------------------------------------------------------------------
/java/gradle/wrapper/gradle-wrapper.properties:
--------------------------------------------------------------------------------
1 | distributionBase=GRADLE_USER_HOME
2 | distributionPath=wrapper/dists
3 | distributionUrl=https\://services.gradle.org/distributions/gradle-6.2.2-bin.zip
4 | zipStoreBase=GRADLE_USER_HOME
5 | zipStorePath=wrapper/dists
6 |
--------------------------------------------------------------------------------
/java/gradlew:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env sh
2 |
3 | #
4 | # Copyright 2015 the original author or authors.
5 | #
6 | # Licensed under the Apache License, Version 2.0 (the "License");
7 | # you may not use this file except in compliance with the License.
8 | # You may obtain a copy of the License at
9 | #
10 | # https://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # Unless required by applicable law or agreed to in writing, software
13 | # distributed under the License is distributed on an "AS IS" BASIS,
14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 | # See the License for the specific language governing permissions and
16 | # limitations under the License.
17 | #
18 |
19 | ##############################################################################
20 | ##
21 | ## Gradle start up script for UN*X
22 | ##
23 | ##############################################################################
24 |
25 | # Attempt to set APP_HOME
26 | # Resolve links: $0 may be a link
27 | PRG="$0"
28 | # Need this for relative symlinks.
29 | while [ -h "$PRG" ] ; do
30 | ls=`ls -ld "$PRG"`
31 | link=`expr "$ls" : '.*-> \(.*\)$'`
32 | if expr "$link" : '/.*' > /dev/null; then
33 | PRG="$link"
34 | else
35 | PRG=`dirname "$PRG"`"/$link"
36 | fi
37 | done
38 | SAVED="`pwd`"
39 | cd "`dirname \"$PRG\"`/" >/dev/null
40 | APP_HOME="`pwd -P`"
41 | cd "$SAVED" >/dev/null
42 |
43 | APP_NAME="Gradle"
44 | APP_BASE_NAME=`basename "$0"`
45 |
46 | # Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
47 | DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'
48 |
49 | # Use the maximum available, or set MAX_FD != -1 to use that value.
50 | MAX_FD="maximum"
51 |
52 | warn () {
53 | echo "$*"
54 | }
55 |
56 | die () {
57 | echo
58 | echo "$*"
59 | echo
60 | exit 1
61 | }
62 |
63 | # OS specific support (must be 'true' or 'false').
64 | cygwin=false
65 | msys=false
66 | darwin=false
67 | nonstop=false
68 | case "`uname`" in
69 | CYGWIN* )
70 | cygwin=true
71 | ;;
72 | Darwin* )
73 | darwin=true
74 | ;;
75 | MINGW* )
76 | msys=true
77 | ;;
78 | NONSTOP* )
79 | nonstop=true
80 | ;;
81 | esac
82 |
83 | CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
84 |
85 | # Determine the Java command to use to start the JVM.
86 | if [ -n "$JAVA_HOME" ] ; then
87 | if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
88 | # IBM's JDK on AIX uses strange locations for the executables
89 | JAVACMD="$JAVA_HOME/jre/sh/java"
90 | else
91 | JAVACMD="$JAVA_HOME/bin/java"
92 | fi
93 | if [ ! -x "$JAVACMD" ] ; then
94 | die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
95 |
96 | Please set the JAVA_HOME variable in your environment to match the
97 | location of your Java installation."
98 | fi
99 | else
100 | JAVACMD="java"
101 | which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
102 |
103 | Please set the JAVA_HOME variable in your environment to match the
104 | location of your Java installation."
105 | fi
106 |
107 | # Increase the maximum file descriptors if we can.
108 | if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then
109 | MAX_FD_LIMIT=`ulimit -H -n`
110 | if [ $? -eq 0 ] ; then
111 | if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
112 | MAX_FD="$MAX_FD_LIMIT"
113 | fi
114 | ulimit -n $MAX_FD
115 | if [ $? -ne 0 ] ; then
116 | warn "Could not set maximum file descriptor limit: $MAX_FD"
117 | fi
118 | else
119 | warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
120 | fi
121 | fi
122 |
123 | # For Darwin, add options to specify how the application appears in the dock
124 | if $darwin; then
125 | GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
126 | fi
127 |
128 | # For Cygwin or MSYS, switch paths to Windows format before running java
129 | if [ "$cygwin" = "true" -o "$msys" = "true" ] ; then
130 | APP_HOME=`cygpath --path --mixed "$APP_HOME"`
131 | CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`
132 | JAVACMD=`cygpath --unix "$JAVACMD"`
133 |
134 | # We build the pattern for arguments to be converted via cygpath
135 | ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
136 | SEP=""
137 | for dir in $ROOTDIRSRAW ; do
138 | ROOTDIRS="$ROOTDIRS$SEP$dir"
139 | SEP="|"
140 | done
141 | OURCYGPATTERN="(^($ROOTDIRS))"
142 | # Add a user-defined pattern to the cygpath arguments
143 | if [ "$GRADLE_CYGPATTERN" != "" ] ; then
144 | OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
145 | fi
146 | # Now convert the arguments - kludge to limit ourselves to /bin/sh
147 | i=0
148 | for arg in "$@" ; do
149 | CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
150 | CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option
151 |
152 | if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition
153 | eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
154 | else
155 | eval `echo args$i`="\"$arg\""
156 | fi
157 | i=`expr $i + 1`
158 | done
159 | case $i in
160 | 0) set -- ;;
161 | 1) set -- "$args0" ;;
162 | 2) set -- "$args0" "$args1" ;;
163 | 3) set -- "$args0" "$args1" "$args2" ;;
164 | 4) set -- "$args0" "$args1" "$args2" "$args3" ;;
165 | 5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
166 | 6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
167 | 7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
168 | 8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
169 | 9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
170 | esac
171 | fi
172 |
173 | # Escape application args
174 | save () {
175 | for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
176 | echo " "
177 | }
178 | APP_ARGS=`save "$@"`
179 |
180 | # Collect all arguments for the java command, following the shell quoting and substitution rules
181 | eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"
182 |
183 | exec "$JAVACMD" "$@"
184 |
--------------------------------------------------------------------------------
/java/gradlew.bat:
--------------------------------------------------------------------------------
1 | @rem
2 | @rem Copyright 2015 the original author or authors.
3 | @rem
4 | @rem Licensed under the Apache License, Version 2.0 (the "License");
5 | @rem you may not use this file except in compliance with the License.
6 | @rem You may obtain a copy of the License at
7 | @rem
8 | @rem https://www.apache.org/licenses/LICENSE-2.0
9 | @rem
10 | @rem Unless required by applicable law or agreed to in writing, software
11 | @rem distributed under the License is distributed on an "AS IS" BASIS,
12 | @rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | @rem See the License for the specific language governing permissions and
14 | @rem limitations under the License.
15 | @rem
16 |
17 | @if "%DEBUG%" == "" @echo off
18 | @rem ##########################################################################
19 | @rem
20 | @rem Gradle startup script for Windows
21 | @rem
22 | @rem ##########################################################################
23 |
24 | @rem Set local scope for the variables with windows NT shell
25 | if "%OS%"=="Windows_NT" setlocal
26 |
27 | set DIRNAME=%~dp0
28 | if "%DIRNAME%" == "" set DIRNAME=.
29 | set APP_BASE_NAME=%~n0
30 | set APP_HOME=%DIRNAME%
31 |
32 | @rem Resolve any "." and ".." in APP_HOME to make it shorter.
33 | for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi
34 |
35 | @rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
36 | set DEFAULT_JVM_OPTS="-Xmx64m" "-Xms64m"
37 |
38 | @rem Find java.exe
39 | if defined JAVA_HOME goto findJavaFromJavaHome
40 |
41 | set JAVA_EXE=java.exe
42 | %JAVA_EXE% -version >NUL 2>&1
43 | if "%ERRORLEVEL%" == "0" goto init
44 |
45 | echo.
46 | echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
47 | echo.
48 | echo Please set the JAVA_HOME variable in your environment to match the
49 | echo location of your Java installation.
50 |
51 | goto fail
52 |
53 | :findJavaFromJavaHome
54 | set JAVA_HOME=%JAVA_HOME:"=%
55 | set JAVA_EXE=%JAVA_HOME%/bin/java.exe
56 |
57 | if exist "%JAVA_EXE%" goto init
58 |
59 | echo.
60 | echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
61 | echo.
62 | echo Please set the JAVA_HOME variable in your environment to match the
63 | echo location of your Java installation.
64 |
65 | goto fail
66 |
67 | :init
68 | @rem Get command-line arguments, handling Windows variants
69 |
70 | if not "%OS%" == "Windows_NT" goto win9xME_args
71 |
72 | :win9xME_args
73 | @rem Slurp the command line arguments.
74 | set CMD_LINE_ARGS=
75 | set _SKIP=2
76 |
77 | :win9xME_args_slurp
78 | if "x%~1" == "x" goto execute
79 |
80 | set CMD_LINE_ARGS=%*
81 |
82 | :execute
83 | @rem Setup the command line
84 |
85 | set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
86 |
87 | @rem Execute Gradle
88 | "%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %CMD_LINE_ARGS%
89 |
90 | :end
91 | @rem End local scope for the variables with windows NT shell
92 | if "%ERRORLEVEL%"=="0" goto mainEnd
93 |
94 | :fail
95 | rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
96 | rem the _cmd.exe /c_ return code!
97 | if not "" == "%GRADLE_EXIT_CONSOLE%" exit 1
98 | exit /b 1
99 |
100 | :mainEnd
101 | if "%OS%"=="Windows_NT" endlocal
102 |
103 | :omega
104 |
--------------------------------------------------------------------------------
/java/samples/build.gradle:
--------------------------------------------------------------------------------
1 | dependencies {
2 | implementation project(':vernacular-ai-speech')
3 | }
4 |
5 | task run(type: JavaExec) {
6 | classpath = sourceSets.main.runtimeClasspath
7 |
8 | if (project.hasProperty('example')) {
9 | if (example == 'RecognizeSync'){
10 | main = 'ai.vernacular.examples.speech.RecognizeSync'
11 | } else {
12 | println "Unable to find example file"
13 | }
14 | }
15 | }
16 |
--------------------------------------------------------------------------------
/java/samples/src/main/java/ai/vernacular/examples/speech/RecognizeSync.java:
--------------------------------------------------------------------------------
1 | package ai.vernacular.examples.speech;
2 |
3 | import ai.vernacular.speech.SpeechClient;
4 | import ai.vernacular.speech.RecognitionAudio;
5 | import ai.vernacular.speech.RecognitionConfig;
6 | import ai.vernacular.speech.SpeechRecognitionAlternative;
7 | import ai.vernacular.speech.RecognitionConfig.AudioEncoding;
8 | import ai.vernacular.speech.RecognizeResponse;
9 | import ai.vernacular.speech.SpeechRecognitionResult;
10 | import com.google.protobuf.ByteString;
11 |
12 | import java.io.IOException;
13 | import java.nio.file.Files;
14 | import java.nio.file.Path;
15 | import java.nio.file.Paths;
16 |
17 | public class RecognizeSync {
18 |
19 | public static void sampleRecognize(String accessToken, String localFilePath) {
20 | try (SpeechClient speechClient = new SpeechClient(accessToken)) {
21 | // The language of the supplied audio
22 | String languageCode = "en-IN";
23 |
24 | // Sample rate in Hertz of the audio data sent
25 | int sampleRateHertz = 8000;
26 |
27 | // Encoding of audio data sent. This sample sets this explicitly.
28 | RecognitionConfig.AudioEncoding encoding = RecognitionConfig.AudioEncoding.LINEAR16;
29 | RecognitionConfig config = RecognitionConfig.newBuilder().setLanguageCode(languageCode)
30 | .setSampleRateHertz(sampleRateHertz).setEncoding(encoding).build();
31 | Path path = Paths.get(localFilePath);
32 | byte[] data = Files.readAllBytes(path);
33 | ByteString content = ByteString.copyFrom(data);
34 | RecognitionAudio audio = RecognitionAudio.newBuilder().setContent(content).build();
35 | RecognizeResponse response = speechClient.recognize(config, audio);
36 | for (SpeechRecognitionResult result : response.getResultsList()) {
37 | // First alternative is the most probable result
38 | SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
39 | System.out.printf("Transcript: %s\n", alternative.getTranscript());
40 | }
41 | } catch (IOException exception) {
42 | System.err.println("Failed to create the client due to: " + exception.getMessage());
43 | }
44 | }
45 |
46 | public static void main(String[] args) {
47 | sampleRecognize("vernacularai", "hello.wav");
48 | }
49 | }
50 |
--------------------------------------------------------------------------------
/java/settings.gradle:
--------------------------------------------------------------------------------
1 | include ":vernacular-ai-speech"
2 | include ":samples"
3 |
--------------------------------------------------------------------------------
/java/vernacular-ai-speech/build.gradle:
--------------------------------------------------------------------------------
1 | def grpcVersion = '1.28.1'
2 | def protobufVersion = '1.28.1'
3 |
4 | dependencies {
5 | implementation "io.grpc:grpc-okhttp:${grpcVersion}"
6 | implementation ("io.grpc:grpc-protobuf-lite:${protobufVersion}") {
7 | exclude module: "protobuf-lite"
8 | }
9 | implementation "io.grpc:grpc-stub:${protobufVersion}"
10 | compile "com.google.api.grpc:proto-google-common-protos:1.17.0"
11 |
12 | compileOnly "javax.annotation:javax.annotation-api:1.3.2"
13 | }
14 |
15 | protobuf {
16 | protoc { artifact = 'com.google.protobuf:protoc:3.11.4' }
17 | plugins {
18 | grpc {
19 | artifact = "io.grpc:protoc-gen-grpc-java:${grpcVersion}"
20 | }
21 | javalite {
22 | artifact = 'com.google.protobuf:protoc-gen-javalite:3.0.0'
23 | }
24 | }
25 | generateProtoTasks {
26 | all().each { task ->
27 | task.plugins {
28 | javalite {}
29 | grpc { // Options added to --grpc_out
30 | option 'lite'
31 | }
32 | }
33 |
34 | task.builtins {
35 | remove java
36 | }
37 | }
38 | }
39 | }
40 |
--------------------------------------------------------------------------------
/java/vernacular-ai-speech/src/main/java/ai/vernacular/speech/SpeechClient.java:
--------------------------------------------------------------------------------
1 | package ai.vernacular.speech;
2 |
3 | import io.grpc.ManagedChannel;
4 | import io.grpc.ManagedChannelBuilder;
5 | import io.grpc.Metadata;
6 | import ai.vernacular.speech.SpeechToTextGrpc;
7 |
8 | import java.io.IOException;
9 |
10 | import ai.vernacular.speech.RecognitionAudio;
11 | import ai.vernacular.speech.RecognizeRequest;
12 | import ai.vernacular.speech.RecognizeResponse;
13 | import ai.vernacular.speech.RecognitionAudio;
14 | import ai.vernacular.speech.RecognitionConfig;
15 |
16 | public class SpeechClient implements AutoCloseable{
17 |
18 | public static final String STTP_GRPC_HOST = "localhost";
19 | public static final int STTP_GRPC_PORT = 5021;
20 | public static final String AUTHORIZATION = "authorization";
21 |
22 | private String accessToken;
23 | private ManagedChannel channel;
24 | private SpeechToTextGrpc.SpeechToTextBlockingStub channelStub;
25 |
26 | public SpeechClient(String accessToken) {
27 | this.accessToken = accessToken;
28 |
29 | this.channel = ManagedChannelBuilder.forAddress(STTP_GRPC_HOST, STTP_GRPC_PORT).usePlaintext().build();
30 | this.channelStub = SpeechToTextGrpc.newBlockingStub(channel);
31 | }
32 |
33 | public final RecognizeResponse recognize(RecognitionConfig config, RecognitionAudio audio) {
34 | // return recognizeCallable().call(request);
35 | RecognizeRequest request = RecognizeRequest.newBuilder().setConfig(config).setAudio(audio).build();
36 | return this.channelStub.recognize(request);
37 | }
38 |
39 | @Override
40 | public void close() throws IOException {
41 |
42 | }
43 |
44 | }
45 |
--------------------------------------------------------------------------------
/java/vernacular-ai-speech/src/main/proto/speech-to-text.proto:
--------------------------------------------------------------------------------
1 | syntax = "proto3";
2 | package speech_to_text;
3 |
4 | import "google/api/annotations.proto";
5 | import "google/api/client.proto";
6 | import "google/api/field_behavior.proto";
7 | import "google/rpc/status.proto";
8 |
9 | option java_multiple_files = true;
10 | option java_outer_classname = "SpeechToTextProto";
11 | option java_package = "ai.vernacular.speech";
12 |
13 | service SpeechToText {
14 | // Performs synchronous non-streaming speech recognition
15 | rpc Recognize(RecognizeRequest) returns (RecognizeResponse) {
16 | option (google.api.http) = {
17 | post: "/v1/speech:recognize"
18 | body: "*"
19 | };
20 | option (google.api.method_signature) = "config,audio";
21 | }
22 |
23 | // Performs bidirectional streaming speech recognition: receive results while
24 | // sending audio. This method is only available via the gRPC API (not REST).
25 | rpc StreamingRecognize(stream StreamingRecognizeRequest) returns (stream StreamingRecognizeResponse) {}
26 |
27 | // Performs asynchronous non-streaming speech recognition
28 | rpc LongRunningRecognize(LongRunningRecognizeRequest) returns (SpeechOperation) {}
29 | // Returns SpeechOperation for LongRunningRecognize. Used for polling the result
30 | rpc GetSpeechOperation(SpeechOperationRequest) returns (SpeechOperation) {
31 | option (google.api.http) = {
32 | get: "/v1/speech_operations/{name}"
33 | };
34 | }
35 | }
36 |
37 | //--------------------------------------------
38 | // requests
39 | //--------------------------------------------
40 | message RecognizeRequest {
41 | // Required. Provides information to the recognizer that specifies how to
42 | // process the request.
43 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED];
44 |
45 | // Required. The audio data to be recognized.
46 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED];
47 |
48 | string segment = 16;
49 | }
50 |
51 | message LongRunningRecognizeRequest {
52 | // Required. Provides information to the recognizer that specifies how to
53 | // process the request.
54 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED];
55 |
56 | // Required. The audio data to be recognized.
57 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED];
58 |
59 | // Optional. When operation completes, result is posted to this url if provided.
60 | string result_url = 11;
61 |
62 | string segment = 16;
63 | }
64 |
65 | message SpeechOperationRequest {
66 | // name of the speech operation
67 | string name = 1 [(google.api.field_behavior) = REQUIRED];
68 | }
69 |
70 | message StreamingRecognizeRequest {
71 | // The streaming request, which is either a streaming config or audio content.
72 | oneof streaming_request {
73 | // Provides information to the recognizer that specifies how to process the
74 | // request. The first `StreamingRecognizeRequest` message must contain a
75 | // `streaming_config` message.
76 | StreamingRecognitionConfig streaming_config = 1;
77 |
78 | // The audio data to be recognized.
79 | bytes audio_content = 2;
80 | }
81 | }
82 |
83 | message StreamingRecognitionConfig {
84 | // Required. Provides information to the recognizer that specifies how to
85 | // process the request.
86 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED];
87 |
88 | // If `true`, interim results (tentative hypotheses) may be
89 | // returned as they become available (these interim results are indicated with
90 | // the `is_final=false` flag).
91 | // If `false` or omitted, only `is_final=true` result(s) are returned.
92 | bool interim_results = 2;
93 | }
94 |
95 | // Provides information to the recognizer that specifies how to process the request
96 | message RecognitionConfig {
97 | enum AudioEncoding {
98 | ENCODING_UNSPECIFIED = 0;
99 | LINEAR16 = 1;
100 | FLAC = 2;
101 | MP3 = 3;
102 | }
103 |
104 | AudioEncoding encoding = 1;
105 | int32 sample_rate_hertz = 2; // Valid values are: 8000-48000.
106 | string language_code = 3 [(google.api.field_behavior) = REQUIRED];
107 | int32 max_alternatives = 4;
108 | repeated SpeechContext speech_contexts = 5;
109 | int32 audio_channel_count = 6;
110 | bool enable_separate_recognition_per_channel = 7;
111 | bool enable_word_time_offsets = 8;
112 | bool enable_automatic_punctuation = 11;
113 | SpeakerDiarizationConfig diarization_config = 16;
114 | }
115 |
116 | message SpeechContext {
117 | repeated string phrases = 1;
118 | }
119 |
120 | // Config to enable speaker diarization.
121 | message SpeakerDiarizationConfig {
122 | // If 'true', enables speaker detection for each recognized word in
123 | // the top alternative of the recognition result using a speaker_tag provided
124 | // in the WordInfo.
125 | bool enable_speaker_diarization = 1;
126 |
127 | // Minimum number of speakers in the conversation. This range gives you more
128 | // flexibility by allowing the system to automatically determine the correct
129 | // number of speakers. If not set, the default value is 2.
130 | int32 min_speaker_count = 2;
131 |
132 | // Maximum number of speakers in the conversation. This range gives you more
133 | // flexibility by allowing the system to automatically determine the correct
134 | // number of speakers. If not set, the default value is 6.
135 | int32 max_speaker_count = 3;
136 | }
137 |
138 | // Either `content` or `uri` must be supplied.
139 | message RecognitionAudio {
140 | oneof audio_source {
141 | bytes content = 1;
142 | string uri = 2;
143 | }
144 | }
145 |
146 | //--------------------------------------------
147 | // responses
148 | //--------------------------------------------
149 | message RecognizeResponse {
150 | repeated SpeechRecognitionResult results = 1;
151 | }
152 |
153 | message LongRunningRecognizeResponse {
154 | repeated SpeechRecognitionResult results = 1;
155 | }
156 |
157 | message StreamingRecognizeResponse {
158 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that
159 | // specifies the error for the operation.
160 | google.rpc.Status error = 1;
161 |
162 | // This repeated list contains zero or more results that
163 | // correspond to consecutive portions of the audio currently being processed.
164 | // It contains zero or one `is_final=true` result (the newly settled portion),
165 | // followed by zero or more `is_final=false` results (the interim results).
166 | repeated StreamingRecognitionResult results = 2;
167 | }
168 |
169 | message SpeechRecognitionResult {
170 | repeated SpeechRecognitionAlternative alternatives = 1;
171 | int32 channel_tag = 2;
172 | }
173 |
174 | message StreamingRecognitionResult {
175 | // May contain one or more recognition hypotheses (up to the
176 | // maximum specified in `max_alternatives`).
177 | // These alternatives are ordered in terms of accuracy, with the top (first)
178 | // alternative being the most probable, as ranked by the recognizer.
179 | repeated SpeechRecognitionAlternative alternatives = 1;
180 |
181 | // If `false`, this `StreamingRecognitionResult` represents an
182 | // interim result that may change. If `true`, this is the final time the
183 | // speech service will return this particular `StreamingRecognitionResult`,
184 | // the recognizer will not return any further hypotheses for this portion of
185 | // the transcript and corresponding audio.
186 | bool is_final = 2;
187 |
188 | // An estimate of the likelihood that the recognizer will not
189 | // change its guess about this interim result. Values range from 0.0
190 | // (completely unstable) to 1.0 (completely stable).
191 | // This field is only provided for interim results (`is_final=false`).
192 | // The default of 0.0 is a sentinel value indicating `stability` was not set.
193 | float stability = 3;
194 |
195 | // Time offset of the end of this result relative to the
196 | // beginning of the audio.
197 | float result_end_time = 4;
198 |
199 | // For multi-channel audio, this is the channel number corresponding to the
200 | // recognized result for the audio from that channel.
201 | // For audio_channel_count = N, its output values can range from '1' to 'N'.
202 | int32 channel_tag = 5;
203 | }
204 |
205 | message SpeechRecognitionAlternative {
206 | string transcript = 1;
207 | float confidence = 2;
208 | repeated WordInfo words = 3;
209 | }
210 |
211 | message WordInfo {
212 | float start_time = 1;
213 | float end_time = 2;
214 | string word = 3;
215 | }
216 |
217 | message SpeechOperation {
218 | string name = 1;
219 | bool done = 2;
220 | oneof result {
221 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that
222 | // specifies the error for the operation.
223 | google.rpc.Status error = 3;
224 |
225 | LongRunningRecognizeResponse response = 4;
226 | }
227 | }
228 |
--------------------------------------------------------------------------------
/python/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | .hypothesis/
51 | .pytest_cache/
52 |
53 | # Sphinx documentation
54 | docs/_build/
55 |
56 | # PyBuilder
57 | target/
58 |
59 | # Jupyter Notebook
60 | .ipynb_checkpoints
61 |
62 | # IPython
63 | profile_default/
64 | ipython_config.py
65 |
66 | # pyenv
67 | .python-version
68 |
69 | # celery beat schedule file
70 | celerybeat-schedule
71 |
72 | # SageMath parsed files
73 | *.sage.py
74 |
75 | # Environments
76 | .env
77 | .venv
78 | env/
79 | venv/
80 | ENV/
81 | env.bak/
82 | venv.bak/
83 |
--------------------------------------------------------------------------------
/python/README.md:
--------------------------------------------------------------------------------
1 | # Python Speech to Text SDK
2 |
3 | Python SDK for vernacular.ai speech to text APIs. Go [here](https://github.com/Vernacular-ai/speech-recognition) for detailed product documentation.
4 |
5 | ## Installation
6 | To install this sdk run:
7 |
8 | ```shell
9 | pip install vernacular-ai-speech
10 | ```
11 |
12 | #### Supported Python Versions
13 |
14 | Python >= 3.5
15 |
16 | ## Example Usage
17 |
18 | ```python
19 | from vernacular.ai import speech
20 | from vernacular.ai.speech import enums, types
21 |
22 |
23 | def sample_recognize(access_token, file_path):
24 | """
25 | Args:
26 | access_token Token provided by vernacular.ai for authentication
27 | file_path Path to audio file e.g /path/audio_file.wav
28 | """
29 | speech_client = speech.SpeechClient(access_token)
30 |
31 | audio = types.RecognitionAudio(
32 | content = open(file_path, "rb").read()
33 | )
34 |
35 | config = types.RecognitionConfig(
36 | encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
37 | sample_rate_hertz=8000,
38 | language_code = "hi-IN",
39 | )
40 |
41 | response = speech_client.recognize(audio=audio, config=config)
42 |
43 | for result in response.results:
44 | alternative = result.alternatives[0]
45 | print("Transcript: {}".format(alternative.transcript))
46 | ```
47 |
48 | To see more examples, go to [samples](https://github.com/Vernacular-ai/speech-recognition/tree/master/python/samples).
49 |
--------------------------------------------------------------------------------
/python/requirements.txt:
--------------------------------------------------------------------------------
1 | grpcio==1.27.1
2 | googleapis-common-protos==1.51.0
3 |
--------------------------------------------------------------------------------
/python/samples/recognize_async.py:
--------------------------------------------------------------------------------
1 | from vernacular.ai import speech
2 | from vernacular.ai.speech import enums, types
3 | import os
4 |
5 |
6 | def infer_encoding(file_path: str):
7 | if ".mp3" in file_path:
8 | return enums.RecognitionConfig.AudioEncoding.MP3
9 | elif ".wav" in file_path:
10 | return enums.RecognitionConfig.AudioEncoding.LINEAR16
11 |
12 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED
13 |
14 |
15 | def sample_recognize_async(access_token, file_path):
16 | speech_client = speech.SpeechClient(access_token)
17 |
18 | audio = types.RecognitionAudio(
19 | content = open(file_path, "rb").read()
20 | )
21 | config = types.RecognitionConfig(
22 | encoding=infer_encoding(file_path),
23 | sample_rate_hertz=8000,
24 | language_code = "hi-IN",
25 | max_alternatives = 2,
26 | )
27 |
28 | speech_operation = speech_client.long_running_recognize(audio=audio, config=config)
29 |
30 | print("Waiting for operation to complete...")
31 | response = speech_operation.response
32 |
33 | for result in response.results:
34 | # First alternative is the most probable result
35 | alternative = result.alternatives[0]
36 | print("Transcript: {}".format(alternative.transcript))
37 | print("Confidence: {}".format(alternative.confidence))
38 |
39 |
40 | def main():
41 | import argparse
42 |
43 | parser = argparse.ArgumentParser()
44 | parser.add_argument(
45 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN")
46 | )
47 | parser.add_argument(
48 | "--file_path", type=str, default="../resources/hello.wav"
49 | )
50 | args = parser.parse_args()
51 |
52 | sample_recognize_async(args.access_token, args.file_path)
53 |
54 |
55 | if __name__ == "__main__":
56 | main()
57 |
--------------------------------------------------------------------------------
/python/samples/recognize_multi_channel.py:
--------------------------------------------------------------------------------
1 | from vernacular.ai import speech
2 | from vernacular.ai.speech import enums, types
3 | import os
4 |
5 | def infer_encoding(file_path: str):
6 | if ".mp3" in file_path:
7 | return enums.RecognitionConfig.AudioEncoding.MP3
8 | elif ".wav" in file_path:
9 | return enums.RecognitionConfig.AudioEncoding.LINEAR16
10 |
11 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED
12 |
13 |
14 | def sample_recognize(access_token, file_path):
15 | speech_client = speech.SpeechClient(access_token)
16 |
17 | # The number of channels in the input audio file
18 | audio_channel_count = 2
19 |
20 | # When set to true, each audio channel will be recognized separately.
21 | # The recognition result will contain a channel_tag field to state which
22 | # channel that result belongs to
23 | enable_separate_recognition_per_channel = True
24 |
25 | audio = types.RecognitionAudio(
26 | content = open(file_path, "rb").read()
27 | )
28 | config = types.RecognitionConfig(
29 | encoding=infer_encoding(file_path),
30 | sample_rate_hertz=8000,
31 | language_code = "hi-IN",
32 | max_alternatives = 1,
33 | enable_separate_recognition_per_channel=enable_separate_recognition_per_channel,
34 | audio_channel_count=audio_channel_count,
35 | )
36 |
37 | response = speech_client.recognize(audio=audio, config=config, timeout=60)
38 |
39 | for result in response.results:
40 | alternative = result.alternatives[0]
41 | print("Transcript: {}".format(alternative.transcript))
42 | print("ChannelTag: {}".format(result.channel_tag))
43 |
44 |
45 | def main():
46 | import argparse
47 |
48 | parser = argparse.ArgumentParser()
49 | parser.add_argument(
50 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN")
51 | )
52 | parser.add_argument(
53 | "--file_path", type=str, default="../resources/hello.wav"
54 | )
55 | args = parser.parse_args()
56 |
57 | sample_recognize(args.access_token, args.file_path)
58 |
59 |
60 | if __name__ == "__main__":
61 | main()
62 |
--------------------------------------------------------------------------------
/python/samples/recognize_streaming.py:
--------------------------------------------------------------------------------
1 | from vernacular.ai import speech
2 | from vernacular.ai.speech import enums, types
3 | import os
4 | import time
5 | import threading
6 | from six.moves import queue
7 |
8 |
9 | def infer_encoding(file_path: str):
10 | if ".mp3" in file_path:
11 | return enums.RecognitionConfig.AudioEncoding.MP3
12 | elif ".wav" in file_path:
13 | return enums.RecognitionConfig.AudioEncoding.LINEAR16
14 | elif ".raw" in file_path:
15 | return enums.RecognitionConfig.AudioEncoding.LINEAR16
16 |
17 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED
18 |
19 |
20 | class AddAudio(threading.Thread):
21 | def __init__(self, file_path, _buff):
22 | threading.Thread.__init__(self)
23 | self.file_path = file_path
24 | self._buff = _buff
25 |
26 | def run(self):
27 | with open(self.file_path, "rb") as file:
28 | audio_content = file.read()
29 |
30 | for i in range(0, len(audio_content), 8000):
31 | self._buff.put(audio_content[i:i+8000])
32 | # add a delay for real time streaming simulation
33 | time.sleep(0.1)
34 |
35 | # add None to queue to mark end of streaming
36 | self._buff.put(None)
37 |
38 |
39 | class SampleRecognizeStreaming():
40 | def __init__(self, access_token, file_path):
41 | self.speech_client = speech.SpeechClient(access_token)
42 | self.file_path = file_path
43 |
44 | config = types.RecognitionConfig(
45 | encoding=infer_encoding(file_path),
46 | sample_rate_hertz=8000,
47 | language_code="en-IN",
48 | max_alternatives=1,
49 | )
50 | self.stream_config = types.StreamingRecognitionConfig(
51 | config=config,
52 | )
53 |
54 | self._buff = queue.Queue()
55 |
56 | def run(self):
57 | requests = (types.StreamingRecognizeRequest(audio_content=content)
58 | for content in self.audio_generator())
59 |
60 | responses = self.speech_client.streaming_recognize(self.stream_config, requests)
61 |
62 | # add audios in a new thread to simulate streaming
63 | t1 = AddAudio(self.file_path, self._buff)
64 | t1.start()
65 |
66 | # this is blocking call and will wait until server sends the result
67 | for response in responses:
68 | for result in response.results:
69 | alternative = result.alternatives[0]
70 | print("Transcript: {}".format(alternative.transcript))
71 | print("Confidence: {}".format(alternative.confidence))
72 |
73 |
74 | def audio_generator(self):
75 | while True:
76 | chunk = self._buff.get()
77 | if chunk is None:
78 | return
79 |
80 | yield chunk
81 |
82 |
83 | def main():
84 | import argparse
85 |
86 | parser = argparse.ArgumentParser()
87 | parser.add_argument(
88 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN")
89 | )
90 | parser.add_argument(
91 | "--file_path", type=str, default="../resources/test-single-channel-8000Hz.raw"
92 | )
93 | args = parser.parse_args()
94 |
95 | ss = SampleRecognizeStreaming(args.access_token, args.file_path)
96 | ss.run()
97 |
98 |
99 | if __name__ == "__main__":
100 | main()
101 |
--------------------------------------------------------------------------------
/python/samples/recognize_streaming_mic.py:
--------------------------------------------------------------------------------
1 | """
2 | NOTE: This module requires the additional dependency `pyaudio`.
3 | To install using pip:
4 | pip install pyaudio
5 | Example usage:
6 | python recognize_streaming_mic.py
7 | """
8 |
9 | from __future__ import division
10 |
11 | import re
12 | import sys
13 |
14 | from vernacular.ai import speech
15 | from vernacular.ai.speech import enums, types
16 | import pyaudio
17 | import os
18 | from six.moves import queue
19 |
20 | # Audio recording parameters
21 | RATE = 8000
22 | CHUNK = int(RATE / 10) # 100ms
23 |
24 | class MicrophoneStream(object):
25 | """Opens a recording stream as a generator yielding the audio chunks."""
26 | def __init__(self, rate, chunk):
27 | self._rate = rate
28 | self._chunk = chunk
29 |
30 | # Create a thread-safe buffer of audio data
31 | self._buff = queue.Queue()
32 | self.closed = True
33 |
34 | def __enter__(self):
35 | self._audio_interface = pyaudio.PyAudio()
36 | self._audio_stream = self._audio_interface.open(
37 | format=pyaudio.paInt16,
38 | # The API currently only supports 1-channel (mono) audio
39 | # https://goo.gl/z757pE
40 | channels=1, rate=self._rate,
41 | input=True, frames_per_buffer=self._chunk,
42 | # Run the audio stream asynchronously to fill the buffer object.
43 | # This is necessary so that the input device's buffer doesn't
44 | # overflow while the calling thread makes network requests, etc.
45 | stream_callback=self._fill_buffer,
46 | )
47 |
48 | self.closed = False
49 |
50 | return self
51 |
52 | def __exit__(self, type, value, traceback):
53 | self.end()
54 |
55 | def end(self):
56 | self._audio_stream.stop_stream()
57 | self._audio_stream.close()
58 | self.closed = True
59 | # Signal the generator to terminate so that the client's
60 | # streaming_recognize method will not block the process termination.
61 | self._buff.put(None)
62 | self._audio_interface.terminate()
63 |
64 | def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
65 | """Continuously collect data from the audio stream, into the buffer."""
66 | self._buff.put(in_data)
67 | return None, pyaudio.paContinue
68 |
69 | def generator(self):
70 | while not self.closed:
71 | # Use a blocking get() to ensure there's at least one chunk of
72 | # data, and stop iteration if the chunk is None, indicating the
73 | # end of the audio stream.
74 | chunk = self._buff.get()
75 | if chunk is None:
76 | return
77 | data = [chunk]
78 |
79 | # Now consume whatever other data's still buffered.
80 | while True:
81 | try:
82 | chunk = self._buff.get(block=False)
83 | if chunk is None:
84 | return
85 | data.append(chunk)
86 | except queue.Empty:
87 | break
88 |
89 | yield b''.join(data)
90 |
91 |
92 | def main():
93 | import argparse
94 |
95 | parser = argparse.ArgumentParser()
96 | parser.add_argument(
97 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN")
98 | )
99 | args = parser.parse_args()
100 | language_code = 'hi-IN' # a BCP-47 language tag
101 |
102 | client = speech.SpeechClient(args.access_token)
103 | config = types.RecognitionConfig(
104 | encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
105 | sample_rate_hertz=RATE,
106 | language_code=language_code,
107 | )
108 | sd_config = types.SilenceDetectionConfig(
109 | enable_silence_detection=True,
110 | max_speech_timeout=4,
111 | )
112 |
113 | streaming_config = types.StreamingRecognitionConfig(config=config, silence_detection_config=sd_config)
114 |
115 | with MicrophoneStream(RATE, CHUNK) as stream:
116 | audio_generator = stream.generator()
117 | requests = (types.StreamingRecognizeRequest(audio_content=content)
118 | for content in audio_generator)
119 |
120 | responses = client.streaming_recognize(streaming_config, requests)
121 |
122 | # Now, put the transcription responses to use.
123 | for response in responses:
124 | stream.end()
125 | if len(response.results) > 0 and len(response.results[0].alternatives) > 0:
126 | # Display the transcription of the top alternative.
127 | transcript = response.results[0].alternatives[0].transcript
128 | print(transcript)
129 | else:
130 | print("Empty results")
131 |
132 |
133 | if __name__ == '__main__':
134 | main()
135 |
--------------------------------------------------------------------------------
/python/samples/recognize_sync.py:
--------------------------------------------------------------------------------
1 | from vernacular.ai import speech
2 | from vernacular.ai.speech import enums, types
3 | import os
4 |
5 |
6 | def infer_encoding(file_path: str):
7 | if ".mp3" in file_path:
8 | return enums.RecognitionConfig.AudioEncoding.MP3
9 | elif ".wav" in file_path:
10 | return enums.RecognitionConfig.AudioEncoding.LINEAR16
11 |
12 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED
13 |
14 |
15 | def sample_recognize(access_token, file_path):
16 | speech_client = speech.SpeechClient(access_token)
17 |
18 | audio = types.RecognitionAudio(
19 | content = open(file_path, "rb").read()
20 | )
21 | config = types.RecognitionConfig(
22 | encoding=infer_encoding(file_path),
23 | sample_rate_hertz=8000,
24 | language_code = "hi-IN",
25 | max_alternatives = 1,
26 | )
27 |
28 | response = speech_client.recognize(audio=audio, config=config)
29 |
30 | for result in response.results:
31 | # First alternative is the most probable result
32 | alternative = result.alternatives[0]
33 | print("Transcript: {}".format(alternative.transcript))
34 | print("Confidence: {}".format(alternative.confidence))
35 |
36 |
37 | def main():
38 | import argparse
39 |
40 | parser = argparse.ArgumentParser()
41 | parser.add_argument(
42 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN")
43 | )
44 | parser.add_argument(
45 | "--file_path", type=str, default="../resources/hello.wav"
46 | )
47 | args = parser.parse_args()
48 |
49 | sample_recognize(args.access_token, args.file_path)
50 |
51 |
52 | if __name__ == "__main__":
53 | main()
54 |
--------------------------------------------------------------------------------
/python/samples/recognize_word_offset.py:
--------------------------------------------------------------------------------
1 | from vernacular.ai import speech
2 | from vernacular.ai.speech import enums, types
3 | import os
4 |
5 | def infer_encoding(file_path: str):
6 | if ".mp3" in file_path:
7 | return enums.RecognitionConfig.AudioEncoding.MP3
8 | elif ".wav" in file_path:
9 | return enums.RecognitionConfig.AudioEncoding.LINEAR16
10 |
11 | return enums.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED
12 |
13 |
14 | def sample_recognize(access_token, file_path):
15 | speech_client = speech.SpeechClient(access_token)
16 |
17 | enable_word_time_offsets = True
18 |
19 | audio = types.RecognitionAudio(
20 | content = open(file_path, "rb").read()
21 | )
22 | config = types.RecognitionConfig(
23 | encoding=infer_encoding(file_path),
24 | sample_rate_hertz=8000,
25 | language_code = "hi-IN",
26 | max_alternatives = 1,
27 | enable_word_time_offsets=enable_word_time_offsets
28 | )
29 |
30 | response = speech_client.recognize(audio=audio, config=config)
31 |
32 | for result in response.results:
33 | # First alternative is the most probable result
34 | alternative = result.alternatives[0]
35 | print("Transcript: {}".format(alternative.transcript))
36 | print("Words: {}".format(alternative.words))
37 |
38 |
39 | def main():
40 | import argparse
41 |
42 | parser = argparse.ArgumentParser()
43 | parser.add_argument(
44 | "--access_token", type=str, default=os.environ.get("AUTH_ACCESS_TOKEN")
45 | )
46 | parser.add_argument(
47 | "--file_path", type=str, default="../resources/hello.wav"
48 | )
49 | args = parser.parse_args()
50 |
51 | sample_recognize(args.access_token, args.file_path)
52 |
53 |
54 | if __name__ == "__main__":
55 | main()
56 |
--------------------------------------------------------------------------------
/python/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
3 | long_description_content_type= text/markdown
--------------------------------------------------------------------------------
/python/setup.py:
--------------------------------------------------------------------------------
1 | import io
2 | import os
3 |
4 | from setuptools import setup
5 |
6 |
7 | name = "vernacular-ai-speech"
8 | description = "Vernacular Speech API python client"
9 | version = "0.1.2"
10 |
11 | dependencies = ["grpcio >= 1.27.1", "googleapis-common-protos == 1.51.0"]
12 | extras = {}
13 |
14 | package_root = os.path.abspath(os.path.dirname(__file__))
15 |
16 | readme_filename = os.path.join(package_root, "README.md")
17 | with io.open(readme_filename, "r") as readme_file:
18 | readme = readme_file.read()
19 |
20 | # Only include packages under the 'vernacular' namespace. Do not include tests,
21 | # benchmarks, etc.
22 | packages = [
23 | "vernacular.ai.speech",
24 | "vernacular.ai.speech.proto",
25 | "vernacular.ai.exceptions",
26 | ]
27 |
28 | # Determine which namespaces are needed.
29 | namespaces = ["vernacular", "vernacular.ai"]
30 |
31 | setup(
32 | name=name,
33 | version=version,
34 | description=description,
35 | long_description_content_type="text/markdown",
36 | long_description=readme,
37 | author="Vernacular.ai",
38 | author_email="deepankar@vernacular.ai",
39 | license="Apache 2.0",
40 | url="https://github.com/Vernacular-ai/speech-recognition",
41 | classifiers=[
42 | "Development Status :: 4 - Beta", # Chose either "3 - Alpha", "4 - Beta" or "5 - Production/Stable"
43 | "Intended Audience :: Developers",
44 | "Topic :: Software Development :: Build Tools",
45 | "License :: OSI Approved :: Apache Software License",
46 | "Programming Language :: Python",
47 | "Programming Language :: Python :: 3",
48 | "Programming Language :: Python :: 3.5",
49 | "Programming Language :: Python :: 3.6",
50 | "Programming Language :: Python :: 3.7",
51 | "Operating System :: OS Independent",
52 | ],
53 | platforms="Posix; MacOS X; Windows",
54 | packages=packages,
55 | namespace_packages=namespaces,
56 | install_requires=dependencies,
57 | extras_require=extras,
58 | python_requires=">=3.5",
59 | include_package_data=True,
60 | zip_safe=False,
61 | )
62 |
--------------------------------------------------------------------------------
/python/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/python/tests/__init__.py
--------------------------------------------------------------------------------
/python/vernacular/__init__.py:
--------------------------------------------------------------------------------
1 | try:
2 | import pkg_resources
3 |
4 | pkg_resources.declare_namespace(__name__)
5 | except ImportError:
6 | import pkgutil
7 |
8 | __path__ = pkgutil.extend_path(__path__, __name__)
--------------------------------------------------------------------------------
/python/vernacular/ai/__init__.py:
--------------------------------------------------------------------------------
1 | try:
2 | import pkg_resources
3 |
4 | pkg_resources.declare_namespace(__name__)
5 | except ImportError:
6 | import pkgutil
7 |
8 | __path__ = pkgutil.extend_path(__path__, __name__)
--------------------------------------------------------------------------------
/python/vernacular/ai/exceptions/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 |
3 | import grpc
4 |
5 | class VernacularAPIError(Exception):
6 | """
7 | Base class for all exceptions raised by Vernacular API Clients.
8 | """
9 | pass
10 |
11 |
12 | class VernacularAPICallError(VernacularAPIError):
13 | """
14 | Base class for exceptions raised by calling API methods.
15 |
16 | Args:
17 | message (str): The exception message.
18 | errors (Sequence[Any]): An optional list of error details.
19 | response (Union[requests.Request, grpc.Call]): The response or
20 | gRPC call metadata.
21 | """
22 |
23 | code = None
24 | """
25 | Optional[int]: The HTTP status code associated with this error.
26 |
27 | This may be ``None`` if the exception does not have a direct mapping
28 | to an HTTP error.
29 |
30 | See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
31 | """
32 |
33 | grpc_status_code = None
34 | """
35 | Optional[grpc.StatusCode]: The gRPC status code associated with this
36 | error.
37 |
38 | This may be ``None`` if the exception does not match up to a gRPC error.
39 | """
40 |
41 | def __init__(self, message, errors=(), response=None):
42 | super(VernacularAPIError, self).__init__(message)
43 | self.message = message
44 | """str: The exception message."""
45 | self._errors = errors
46 | self._response = response
47 |
48 | def __str__(self):
49 | return "{} {}".format(self.code, self.message)
50 |
51 | @property
52 | def errors(self):
53 | """Detailed error information.
54 |
55 | Returns:
56 | Sequence[Any]: A list of additional error details.
57 | """
58 | return list(self._errors)
59 |
60 | @property
61 | def response(self):
62 | """Optional[Union[requests.Request, grpc.Call]]: The response or
63 | gRPC call metadata."""
64 | return self._response
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 |
3 | from vernacular.ai.speech import speech_client
4 | from vernacular.ai.speech import types
5 | from vernacular.ai.speech import enums
6 |
7 |
8 | class SpeechClient(speech_client.SpeechClient):
9 | __doc__ = speech_client.SpeechClient.__doc__
10 | enums = enums
11 | types = types
12 |
13 |
14 | __all__ = ("SpeechClient", "enums", "types")
15 |
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/enums.py:
--------------------------------------------------------------------------------
1 | import enum
2 |
3 |
4 | class RecognitionConfig(object):
5 | class AudioEncoding(enum.IntEnum):
6 | """
7 | The encoding of the audio data sent in the request.
8 | All encodings support only 1 channel (mono) audio, unless the
9 | ``audio_channel_count`` and ``enable_separate_recognition_per_channel``
10 | fields are set.
11 | For best results, the audio source should be captured and transmitted
12 | using a lossless encoding (``FLAC`` or ``LINEAR16``). The accuracy of
13 | the speech recognition can be reduced if lossy codecs are used to
14 | capture or transmit audio, particularly if background noise is present.
15 | Lossy codecs include ``MULAW``, ``AMR``, ``AMR_WB``, ``OGG_OPUS``,
16 | ``SPEEX_WITH_HEADER_BYTE``, and ``MP3``.
17 | The ``FLAC`` and ``WAV`` audio file formats include a header that
18 | describes the included audio content. You can request recognition for
19 | ``WAV`` files that contain either ``LINEAR16`` or ``MULAW`` encoded
20 | audio. If you send ``FLAC`` or ``WAV`` audio file format in your
21 | request, you do not need to specify an ``AudioEncoding``; the audio
22 | encoding format is determined from the file header. If you specify an
23 | ``AudioEncoding`` when you send send ``FLAC`` or ``WAV`` audio, the
24 | encoding configuration must match the encoding described in the audio
25 | header; otherwise the request returns an
26 | ``google.rpc.Code.INVALID_ARGUMENT`` error code.
27 | Attributes:
28 | ENCODING_UNSPECIFIED (int): Not specified.
29 | LINEAR16 (int): Uncompressed 16-bit signed little-endian samples (Linear PCM).
30 | FLAC (int): ``FLAC`` (Free Lossless Audio Codec) is the recommended encoding because
31 | it is lossless--therefore recognition is not compromised--and requires
32 | only about half the bandwidth of ``LINEAR16``. ``FLAC`` stream encoding
33 | supports 16-bit and 24-bit samples, however, not all fields in
34 | ``STREAMINFO`` are supported.
35 | """
36 |
37 | ENCODING_UNSPECIFIED = 0
38 | LINEAR16 = 1
39 | FLAC = 2
40 | MP3 = 3
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/proto/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/python/vernacular/ai/speech/proto/__init__.py
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/proto/speech-to-text.proto:
--------------------------------------------------------------------------------
1 | syntax = "proto3";
2 | package speech_to_text;
3 |
4 | import "google/api/annotations.proto";
5 | import "google/api/client.proto";
6 | import "google/api/field_behavior.proto";
7 | import "google/rpc/status.proto";
8 |
9 | option java_multiple_files = true;
10 | option java_outer_classname = "SpeechToTextProto";
11 | option java_package = "ai.vernacular.speech";
12 |
13 | service SpeechToText {
14 | // Performs synchronous non-streaming speech recognition
15 | rpc Recognize(RecognizeRequest) returns (RecognizeResponse) {
16 | option (google.api.http) = {
17 | post: "/v2/speech:recognize"
18 | body: "*"
19 | };
20 | option (google.api.method_signature) = "config,audio";
21 | }
22 |
23 | // Performs bidirectional streaming speech recognition: receive results while
24 | // sending audio. This method is only available via the gRPC API (not REST).
25 | rpc StreamingRecognize(stream StreamingRecognizeRequest) returns (stream StreamingRecognizeResponse) {}
26 |
27 | // Performs asynchronous non-streaming speech recognition
28 | rpc LongRunningRecognize(LongRunningRecognizeRequest) returns (SpeechOperation) {
29 | option (google.api.http) = {
30 | post: "/v2/speech:longrunningrecognize"
31 | body: "*"
32 | };
33 | option (google.api.method_signature) = "config,audio";
34 | }
35 | // Returns SpeechOperation for LongRunningRecognize. Used for polling the result
36 | rpc GetSpeechOperation(SpeechOperationRequest) returns (SpeechOperation) {
37 | option (google.api.http) = {
38 | get: "/v2/speech_operations/{name}"
39 | };
40 | }
41 | }
42 |
43 | //--------------------------------------------
44 | // requests
45 | //--------------------------------------------
46 | message RecognizeRequest {
47 | // Required. Provides information to the recognizer that specifies how to
48 | // process the request.
49 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED];
50 |
51 | // Required. The audio data to be recognized.
52 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED];
53 |
54 | string segment = 16;
55 | }
56 |
57 | message LongRunningRecognizeRequest {
58 | // Required. Provides information to the recognizer that specifies how to
59 | // process the request.
60 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED];
61 |
62 | // Required. The audio data to be recognized.
63 | RecognitionAudio audio = 2 [(google.api.field_behavior) = REQUIRED];
64 |
65 | // Optional. When operation completes, result is posted to this url if provided.
66 | string result_url = 11;
67 |
68 | string segment = 16;
69 | }
70 |
71 | message SpeechOperationRequest {
72 | // name of the speech operation
73 | string name = 1 [(google.api.field_behavior) = REQUIRED];
74 | }
75 |
76 | message StreamingRecognizeRequest {
77 | // The streaming request, which is either a streaming config or audio content.
78 | oneof streaming_request {
79 | // Provides information to the recognizer that specifies how to process the
80 | // request. The first `StreamingRecognizeRequest` message must contain a
81 | // `streaming_config` message.
82 | StreamingRecognitionConfig streaming_config = 1;
83 |
84 | // The audio data to be recognized.
85 | bytes audio_content = 2;
86 | }
87 | }
88 |
89 | message StreamingRecognitionConfig {
90 | // Required. Provides information to the recognizer that specifies how to
91 | // process the request.
92 | RecognitionConfig config = 1 [(google.api.field_behavior) = REQUIRED];
93 |
94 | // If `true`, interim results (tentative hypotheses) may be
95 | // returned as they become available (these interim results are indicated with
96 | // the `is_final=false` flag).
97 | // If `false` or omitted, only `is_final=true` result(s) are returned.
98 | bool interim_results = 2;
99 |
100 | SilenceDetectionConfig silence_detection_config = 3;
101 | }
102 |
103 | // Provides information to the recognizer that specifies how to process the request
104 | message RecognitionConfig {
105 | enum AudioEncoding {
106 | ENCODING_UNSPECIFIED = 0;
107 | LINEAR16 = 1;
108 | FLAC = 2;
109 | MP3 = 3;
110 | }
111 |
112 | AudioEncoding encoding = 1;
113 | int32 sample_rate_hertz = 2; // Valid values are: 8000-48000.
114 | string language_code = 3 [(google.api.field_behavior) = REQUIRED];
115 | int32 max_alternatives = 4;
116 | repeated SpeechContext speech_contexts = 5;
117 | int32 audio_channel_count = 6;
118 | bool enable_separate_recognition_per_channel = 7;
119 | bool enable_word_time_offsets = 8;
120 | bool enable_automatic_punctuation = 11;
121 | SpeakerDiarizationConfig diarization_config = 16;
122 | }
123 |
124 | message SpeechContext {
125 | repeated string phrases = 1;
126 | }
127 |
128 | // Config to enable speaker diarization.
129 | message SpeakerDiarizationConfig {
130 | // If 'true', enables speaker detection for each recognized word in
131 | // the top alternative of the recognition result using a speaker_tag provided
132 | // in the WordInfo.
133 | bool enable_speaker_diarization = 1;
134 |
135 | // Minimum number of speakers in the conversation. This range gives you more
136 | // flexibility by allowing the system to automatically determine the correct
137 | // number of speakers. If not set, the default value is 2.
138 | int32 min_speaker_count = 2;
139 |
140 | // Maximum number of speakers in the conversation. This range gives you more
141 | // flexibility by allowing the system to automatically determine the correct
142 | // number of speakers. If not set, the default value is 6.
143 | int32 max_speaker_count = 3;
144 | }
145 |
146 | message SilenceDetectionConfig {
147 | // If `true` enables silence detection
148 | bool enable_silence_detection = 1;
149 | float max_speech_timeout = 2;
150 | float silence_patience = 3;
151 | float no_input_timeout = 4;
152 | }
153 |
154 | // Either `content` or `uri` must be supplied.
155 | message RecognitionAudio {
156 | oneof audio_source {
157 | bytes content = 1;
158 | string uri = 2;
159 | }
160 | }
161 |
162 | //--------------------------------------------
163 | // responses
164 | //--------------------------------------------
165 | message RecognizeResponse {
166 | repeated SpeechRecognitionResult results = 1;
167 | }
168 |
169 | message LongRunningRecognizeResponse {
170 | repeated SpeechRecognitionResult results = 1;
171 | }
172 |
173 | message StreamingRecognizeResponse {
174 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that
175 | // specifies the error for the operation.
176 | google.rpc.Status error = 1;
177 |
178 | // This repeated list contains zero or more results that
179 | // correspond to consecutive portions of the audio currently being processed.
180 | // It contains zero or one `is_final=true` result (the newly settled portion),
181 | // followed by zero or more `is_final=false` results (the interim results).
182 | repeated StreamingRecognitionResult results = 2;
183 | }
184 |
185 | message SpeechRecognitionResult {
186 | repeated SpeechRecognitionAlternative alternatives = 1;
187 | int32 channel_tag = 2;
188 | }
189 |
190 | message StreamingRecognitionResult {
191 | // May contain one or more recognition hypotheses (up to the
192 | // maximum specified in `max_alternatives`).
193 | // These alternatives are ordered in terms of accuracy, with the top (first)
194 | // alternative being the most probable, as ranked by the recognizer.
195 | repeated SpeechRecognitionAlternative alternatives = 1;
196 |
197 | // If `false`, this `StreamingRecognitionResult` represents an
198 | // interim result that may change. If `true`, this is the final time the
199 | // speech service will return this particular `StreamingRecognitionResult`,
200 | // the recognizer will not return any further hypotheses for this portion of
201 | // the transcript and corresponding audio.
202 | bool is_final = 2;
203 |
204 | // An estimate of the likelihood that the recognizer will not
205 | // change its guess about this interim result. Values range from 0.0
206 | // (completely unstable) to 1.0 (completely stable).
207 | // This field is only provided for interim results (`is_final=false`).
208 | // The default of 0.0 is a sentinel value indicating `stability` was not set.
209 | float stability = 3;
210 |
211 | // Time offset of the end of this result relative to the
212 | // beginning of the audio.
213 | float result_end_time = 4;
214 |
215 | // For multi-channel audio, this is the channel number corresponding to the
216 | // recognized result for the audio from that channel.
217 | // For audio_channel_count = N, its output values can range from '1' to 'N'.
218 | int32 channel_tag = 5;
219 | }
220 |
221 | message SpeechRecognitionAlternative {
222 | string transcript = 1;
223 | float confidence = 2;
224 | repeated WordInfo words = 3;
225 | }
226 |
227 | message WordInfo {
228 | float start_time = 1;
229 | float end_time = 2;
230 | string word = 3;
231 | float confidence = 4;
232 | }
233 |
234 | message SpeechOperation {
235 | string name = 1;
236 | bool done = 2;
237 | oneof result {
238 | // If set, returns a [google.rpc.Status][google.rpc.Status] message that
239 | // specifies the error for the operation.
240 | google.rpc.Status error = 3;
241 |
242 | LongRunningRecognizeResponse response = 4;
243 | }
244 | }
245 |
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/proto/speech_to_text_pb2.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # Generated by the protocol buffer compiler. DO NOT EDIT!
3 | # source: speech-to-text.proto
4 | """Generated protocol buffer code."""
5 | from google.protobuf import descriptor as _descriptor
6 | from google.protobuf import message as _message
7 | from google.protobuf import reflection as _reflection
8 | from google.protobuf import symbol_database as _symbol_database
9 | # @@protoc_insertion_point(imports)
10 |
11 | _sym_db = _symbol_database.Default()
12 |
13 |
14 | from google.api import annotations_pb2 as google_dot_api_dot_annotations__pb2
15 | from google.api import client_pb2 as google_dot_api_dot_client__pb2
16 | from google.api import field_behavior_pb2 as google_dot_api_dot_field__behavior__pb2
17 | from google.rpc import status_pb2 as google_dot_rpc_dot_status__pb2
18 |
19 |
20 | DESCRIPTOR = _descriptor.FileDescriptor(
21 | name='speech-to-text.proto',
22 | package='speech_to_text',
23 | syntax='proto3',
24 | serialized_options=b'\n\024ai.vernacular.speechB\021SpeechToTextProtoP\001',
25 | create_key=_descriptor._internal_create_key,
26 | serialized_pb=b'\n\x14speech-to-text.proto\x12\x0espeech_to_text\x1a\x1cgoogle/api/annotations.proto\x1a\x17google/api/client.proto\x1a\x1fgoogle/api/field_behavior.proto\x1a\x17google/rpc/status.proto\"\x91\x01\n\x10RecognizeRequest\x12\x36\n\x06\x63onfig\x18\x01 \x01(\x0b\x32!.speech_to_text.RecognitionConfigB\x03\xe0\x41\x02\x12\x34\n\x05\x61udio\x18\x02 \x01(\x0b\x32 .speech_to_text.RecognitionAudioB\x03\xe0\x41\x02\x12\x0f\n\x07segment\x18\x10 \x01(\t\"\xb0\x01\n\x1bLongRunningRecognizeRequest\x12\x36\n\x06\x63onfig\x18\x01 \x01(\x0b\x32!.speech_to_text.RecognitionConfigB\x03\xe0\x41\x02\x12\x34\n\x05\x61udio\x18\x02 \x01(\x0b\x32 .speech_to_text.RecognitionAudioB\x03\xe0\x41\x02\x12\x12\n\nresult_url\x18\x0b \x01(\t\x12\x0f\n\x07segment\x18\x10 \x01(\t\"+\n\x16SpeechOperationRequest\x12\x11\n\x04name\x18\x01 \x01(\tB\x03\xe0\x41\x02\"\x91\x01\n\x19StreamingRecognizeRequest\x12\x46\n\x10streaming_config\x18\x01 \x01(\x0b\x32*.speech_to_text.StreamingRecognitionConfigH\x00\x12\x17\n\raudio_content\x18\x02 \x01(\x0cH\x00\x42\x13\n\x11streaming_request\"\xb7\x01\n\x1aStreamingRecognitionConfig\x12\x36\n\x06\x63onfig\x18\x01 \x01(\x0b\x32!.speech_to_text.RecognitionConfigB\x03\xe0\x41\x02\x12\x17\n\x0finterim_results\x18\x02 \x01(\x08\x12H\n\x18silence_detection_config\x18\x03 \x01(\x0b\x32&.speech_to_text.SilenceDetectionConfig\"\x87\x04\n\x11RecognitionConfig\x12\x41\n\x08\x65ncoding\x18\x01 \x01(\x0e\x32/.speech_to_text.RecognitionConfig.AudioEncoding\x12\x19\n\x11sample_rate_hertz\x18\x02 \x01(\x05\x12\x1a\n\rlanguage_code\x18\x03 \x01(\tB\x03\xe0\x41\x02\x12\x18\n\x10max_alternatives\x18\x04 \x01(\x05\x12\x36\n\x0fspeech_contexts\x18\x05 \x03(\x0b\x32\x1d.speech_to_text.SpeechContext\x12\x1b\n\x13\x61udio_channel_count\x18\x06 \x01(\x05\x12/\n\'enable_separate_recognition_per_channel\x18\x07 \x01(\x08\x12 \n\x18\x65nable_word_time_offsets\x18\x08 \x01(\x08\x12$\n\x1c\x65nable_automatic_punctuation\x18\x0b \x01(\x08\x12\x44\n\x12\x64iarization_config\x18\x10 \x01(\x0b\x32(.speech_to_text.SpeakerDiarizationConfig\"J\n\rAudioEncoding\x12\x18\n\x14\x45NCODING_UNSPECIFIED\x10\x00\x12\x0c\n\x08LINEAR16\x10\x01\x12\x08\n\x04\x46LAC\x10\x02\x12\x07\n\x03MP3\x10\x03\" \n\rSpeechContext\x12\x0f\n\x07phrases\x18\x01 \x03(\t\"t\n\x18SpeakerDiarizationConfig\x12\"\n\x1a\x65nable_speaker_diarization\x18\x01 \x01(\x08\x12\x19\n\x11min_speaker_count\x18\x02 \x01(\x05\x12\x19\n\x11max_speaker_count\x18\x03 \x01(\x05\"\x8a\x01\n\x16SilenceDetectionConfig\x12 \n\x18\x65nable_silence_detection\x18\x01 \x01(\x08\x12\x1a\n\x12max_speech_timeout\x18\x02 \x01(\x02\x12\x18\n\x10silence_patience\x18\x03 \x01(\x02\x12\x18\n\x10no_input_timeout\x18\x04 \x01(\x02\"D\n\x10RecognitionAudio\x12\x11\n\x07\x63ontent\x18\x01 \x01(\x0cH\x00\x12\r\n\x03uri\x18\x02 \x01(\tH\x00\x42\x0e\n\x0c\x61udio_source\"M\n\x11RecognizeResponse\x12\x38\n\x07results\x18\x01 \x03(\x0b\x32\'.speech_to_text.SpeechRecognitionResult\"X\n\x1cLongRunningRecognizeResponse\x12\x38\n\x07results\x18\x01 \x03(\x0b\x32\'.speech_to_text.SpeechRecognitionResult\"|\n\x1aStreamingRecognizeResponse\x12!\n\x05\x65rror\x18\x01 \x01(\x0b\x32\x12.google.rpc.Status\x12;\n\x07results\x18\x02 \x03(\x0b\x32*.speech_to_text.StreamingRecognitionResult\"r\n\x17SpeechRecognitionResult\x12\x42\n\x0c\x61lternatives\x18\x01 \x03(\x0b\x32,.speech_to_text.SpeechRecognitionAlternative\x12\x13\n\x0b\x63hannel_tag\x18\x02 \x01(\x05\"\xb3\x01\n\x1aStreamingRecognitionResult\x12\x42\n\x0c\x61lternatives\x18\x01 \x03(\x0b\x32,.speech_to_text.SpeechRecognitionAlternative\x12\x10\n\x08is_final\x18\x02 \x01(\x08\x12\x11\n\tstability\x18\x03 \x01(\x02\x12\x17\n\x0fresult_end_time\x18\x04 \x01(\x02\x12\x13\n\x0b\x63hannel_tag\x18\x05 \x01(\x05\"o\n\x1cSpeechRecognitionAlternative\x12\x12\n\ntranscript\x18\x01 \x01(\t\x12\x12\n\nconfidence\x18\x02 \x01(\x02\x12\'\n\x05words\x18\x03 \x03(\x0b\x32\x18.speech_to_text.WordInfo\"R\n\x08WordInfo\x12\x12\n\nstart_time\x18\x01 \x01(\x02\x12\x10\n\x08\x65nd_time\x18\x02 \x01(\x02\x12\x0c\n\x04word\x18\x03 \x01(\t\x12\x12\n\nconfidence\x18\x04 \x01(\x02\"\x9e\x01\n\x0fSpeechOperation\x12\x0c\n\x04name\x18\x01 \x01(\t\x12\x0c\n\x04\x64one\x18\x02 \x01(\x08\x12#\n\x05\x65rror\x18\x03 \x01(\x0b\x32\x12.google.rpc.StatusH\x00\x12@\n\x08response\x18\x04 \x01(\x0b\x32,.speech_to_text.LongRunningRecognizeResponseH\x00\x42\x08\n\x06result2\xac\x04\n\x0cSpeechToText\x12\x80\x01\n\tRecognize\x12 .speech_to_text.RecognizeRequest\x1a!.speech_to_text.RecognizeResponse\".\x82\xd3\xe4\x93\x02\x19\"\x14/v2/speech:recognize:\x01*\xda\x41\x0c\x63onfig,audio\x12q\n\x12StreamingRecognize\x12).speech_to_text.StreamingRecognizeRequest\x1a*.speech_to_text.StreamingRecognizeResponse\"\x00(\x01\x30\x01\x12\x9f\x01\n\x14LongRunningRecognize\x12+.speech_to_text.LongRunningRecognizeRequest\x1a\x1f.speech_to_text.SpeechOperation\"9\x82\xd3\xe4\x93\x02$\"\x1f/v2/speech:longrunningrecognize:\x01*\xda\x41\x0c\x63onfig,audio\x12\x83\x01\n\x12GetSpeechOperation\x12&.speech_to_text.SpeechOperationRequest\x1a\x1f.speech_to_text.SpeechOperation\"$\x82\xd3\xe4\x93\x02\x1e\x12\x1c/v2/speech_operations/{name}B+\n\x14\x61i.vernacular.speechB\x11SpeechToTextProtoP\x01\x62\x06proto3'
27 | ,
28 | dependencies=[google_dot_api_dot_annotations__pb2.DESCRIPTOR,google_dot_api_dot_client__pb2.DESCRIPTOR,google_dot_api_dot_field__behavior__pb2.DESCRIPTOR,google_dot_rpc_dot_status__pb2.DESCRIPTOR,])
29 |
30 |
31 |
32 | _RECOGNITIONCONFIG_AUDIOENCODING = _descriptor.EnumDescriptor(
33 | name='AudioEncoding',
34 | full_name='speech_to_text.RecognitionConfig.AudioEncoding',
35 | filename=None,
36 | file=DESCRIPTOR,
37 | create_key=_descriptor._internal_create_key,
38 | values=[
39 | _descriptor.EnumValueDescriptor(
40 | name='ENCODING_UNSPECIFIED', index=0, number=0,
41 | serialized_options=None,
42 | type=None,
43 | create_key=_descriptor._internal_create_key),
44 | _descriptor.EnumValueDescriptor(
45 | name='LINEAR16', index=1, number=1,
46 | serialized_options=None,
47 | type=None,
48 | create_key=_descriptor._internal_create_key),
49 | _descriptor.EnumValueDescriptor(
50 | name='FLAC', index=2, number=2,
51 | serialized_options=None,
52 | type=None,
53 | create_key=_descriptor._internal_create_key),
54 | _descriptor.EnumValueDescriptor(
55 | name='MP3', index=3, number=3,
56 | serialized_options=None,
57 | type=None,
58 | create_key=_descriptor._internal_create_key),
59 | ],
60 | containing_type=None,
61 | serialized_options=None,
62 | serialized_start=1305,
63 | serialized_end=1379,
64 | )
65 | _sym_db.RegisterEnumDescriptor(_RECOGNITIONCONFIG_AUDIOENCODING)
66 |
67 |
68 | _RECOGNIZEREQUEST = _descriptor.Descriptor(
69 | name='RecognizeRequest',
70 | full_name='speech_to_text.RecognizeRequest',
71 | filename=None,
72 | file=DESCRIPTOR,
73 | containing_type=None,
74 | create_key=_descriptor._internal_create_key,
75 | fields=[
76 | _descriptor.FieldDescriptor(
77 | name='config', full_name='speech_to_text.RecognizeRequest.config', index=0,
78 | number=1, type=11, cpp_type=10, label=1,
79 | has_default_value=False, default_value=None,
80 | message_type=None, enum_type=None, containing_type=None,
81 | is_extension=False, extension_scope=None,
82 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
83 | _descriptor.FieldDescriptor(
84 | name='audio', full_name='speech_to_text.RecognizeRequest.audio', index=1,
85 | number=2, type=11, cpp_type=10, label=1,
86 | has_default_value=False, default_value=None,
87 | message_type=None, enum_type=None, containing_type=None,
88 | is_extension=False, extension_scope=None,
89 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
90 | _descriptor.FieldDescriptor(
91 | name='segment', full_name='speech_to_text.RecognizeRequest.segment', index=2,
92 | number=16, type=9, cpp_type=9, label=1,
93 | has_default_value=False, default_value=b"".decode('utf-8'),
94 | message_type=None, enum_type=None, containing_type=None,
95 | is_extension=False, extension_scope=None,
96 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
97 | ],
98 | extensions=[
99 | ],
100 | nested_types=[],
101 | enum_types=[
102 | ],
103 | serialized_options=None,
104 | is_extendable=False,
105 | syntax='proto3',
106 | extension_ranges=[],
107 | oneofs=[
108 | ],
109 | serialized_start=154,
110 | serialized_end=299,
111 | )
112 |
113 |
114 | _LONGRUNNINGRECOGNIZEREQUEST = _descriptor.Descriptor(
115 | name='LongRunningRecognizeRequest',
116 | full_name='speech_to_text.LongRunningRecognizeRequest',
117 | filename=None,
118 | file=DESCRIPTOR,
119 | containing_type=None,
120 | create_key=_descriptor._internal_create_key,
121 | fields=[
122 | _descriptor.FieldDescriptor(
123 | name='config', full_name='speech_to_text.LongRunningRecognizeRequest.config', index=0,
124 | number=1, type=11, cpp_type=10, label=1,
125 | has_default_value=False, default_value=None,
126 | message_type=None, enum_type=None, containing_type=None,
127 | is_extension=False, extension_scope=None,
128 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
129 | _descriptor.FieldDescriptor(
130 | name='audio', full_name='speech_to_text.LongRunningRecognizeRequest.audio', index=1,
131 | number=2, type=11, cpp_type=10, label=1,
132 | has_default_value=False, default_value=None,
133 | message_type=None, enum_type=None, containing_type=None,
134 | is_extension=False, extension_scope=None,
135 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
136 | _descriptor.FieldDescriptor(
137 | name='result_url', full_name='speech_to_text.LongRunningRecognizeRequest.result_url', index=2,
138 | number=11, type=9, cpp_type=9, label=1,
139 | has_default_value=False, default_value=b"".decode('utf-8'),
140 | message_type=None, enum_type=None, containing_type=None,
141 | is_extension=False, extension_scope=None,
142 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
143 | _descriptor.FieldDescriptor(
144 | name='segment', full_name='speech_to_text.LongRunningRecognizeRequest.segment', index=3,
145 | number=16, type=9, cpp_type=9, label=1,
146 | has_default_value=False, default_value=b"".decode('utf-8'),
147 | message_type=None, enum_type=None, containing_type=None,
148 | is_extension=False, extension_scope=None,
149 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
150 | ],
151 | extensions=[
152 | ],
153 | nested_types=[],
154 | enum_types=[
155 | ],
156 | serialized_options=None,
157 | is_extendable=False,
158 | syntax='proto3',
159 | extension_ranges=[],
160 | oneofs=[
161 | ],
162 | serialized_start=302,
163 | serialized_end=478,
164 | )
165 |
166 |
167 | _SPEECHOPERATIONREQUEST = _descriptor.Descriptor(
168 | name='SpeechOperationRequest',
169 | full_name='speech_to_text.SpeechOperationRequest',
170 | filename=None,
171 | file=DESCRIPTOR,
172 | containing_type=None,
173 | create_key=_descriptor._internal_create_key,
174 | fields=[
175 | _descriptor.FieldDescriptor(
176 | name='name', full_name='speech_to_text.SpeechOperationRequest.name', index=0,
177 | number=1, type=9, cpp_type=9, label=1,
178 | has_default_value=False, default_value=b"".decode('utf-8'),
179 | message_type=None, enum_type=None, containing_type=None,
180 | is_extension=False, extension_scope=None,
181 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
182 | ],
183 | extensions=[
184 | ],
185 | nested_types=[],
186 | enum_types=[
187 | ],
188 | serialized_options=None,
189 | is_extendable=False,
190 | syntax='proto3',
191 | extension_ranges=[],
192 | oneofs=[
193 | ],
194 | serialized_start=480,
195 | serialized_end=523,
196 | )
197 |
198 |
199 | _STREAMINGRECOGNIZEREQUEST = _descriptor.Descriptor(
200 | name='StreamingRecognizeRequest',
201 | full_name='speech_to_text.StreamingRecognizeRequest',
202 | filename=None,
203 | file=DESCRIPTOR,
204 | containing_type=None,
205 | create_key=_descriptor._internal_create_key,
206 | fields=[
207 | _descriptor.FieldDescriptor(
208 | name='streaming_config', full_name='speech_to_text.StreamingRecognizeRequest.streaming_config', index=0,
209 | number=1, type=11, cpp_type=10, label=1,
210 | has_default_value=False, default_value=None,
211 | message_type=None, enum_type=None, containing_type=None,
212 | is_extension=False, extension_scope=None,
213 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
214 | _descriptor.FieldDescriptor(
215 | name='audio_content', full_name='speech_to_text.StreamingRecognizeRequest.audio_content', index=1,
216 | number=2, type=12, cpp_type=9, label=1,
217 | has_default_value=False, default_value=b"",
218 | message_type=None, enum_type=None, containing_type=None,
219 | is_extension=False, extension_scope=None,
220 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
221 | ],
222 | extensions=[
223 | ],
224 | nested_types=[],
225 | enum_types=[
226 | ],
227 | serialized_options=None,
228 | is_extendable=False,
229 | syntax='proto3',
230 | extension_ranges=[],
231 | oneofs=[
232 | _descriptor.OneofDescriptor(
233 | name='streaming_request', full_name='speech_to_text.StreamingRecognizeRequest.streaming_request',
234 | index=0, containing_type=None,
235 | create_key=_descriptor._internal_create_key,
236 | fields=[]),
237 | ],
238 | serialized_start=526,
239 | serialized_end=671,
240 | )
241 |
242 |
243 | _STREAMINGRECOGNITIONCONFIG = _descriptor.Descriptor(
244 | name='StreamingRecognitionConfig',
245 | full_name='speech_to_text.StreamingRecognitionConfig',
246 | filename=None,
247 | file=DESCRIPTOR,
248 | containing_type=None,
249 | create_key=_descriptor._internal_create_key,
250 | fields=[
251 | _descriptor.FieldDescriptor(
252 | name='config', full_name='speech_to_text.StreamingRecognitionConfig.config', index=0,
253 | number=1, type=11, cpp_type=10, label=1,
254 | has_default_value=False, default_value=None,
255 | message_type=None, enum_type=None, containing_type=None,
256 | is_extension=False, extension_scope=None,
257 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
258 | _descriptor.FieldDescriptor(
259 | name='interim_results', full_name='speech_to_text.StreamingRecognitionConfig.interim_results', index=1,
260 | number=2, type=8, cpp_type=7, label=1,
261 | has_default_value=False, default_value=False,
262 | message_type=None, enum_type=None, containing_type=None,
263 | is_extension=False, extension_scope=None,
264 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
265 | _descriptor.FieldDescriptor(
266 | name='silence_detection_config', full_name='speech_to_text.StreamingRecognitionConfig.silence_detection_config', index=2,
267 | number=3, type=11, cpp_type=10, label=1,
268 | has_default_value=False, default_value=None,
269 | message_type=None, enum_type=None, containing_type=None,
270 | is_extension=False, extension_scope=None,
271 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
272 | ],
273 | extensions=[
274 | ],
275 | nested_types=[],
276 | enum_types=[
277 | ],
278 | serialized_options=None,
279 | is_extendable=False,
280 | syntax='proto3',
281 | extension_ranges=[],
282 | oneofs=[
283 | ],
284 | serialized_start=674,
285 | serialized_end=857,
286 | )
287 |
288 |
289 | _RECOGNITIONCONFIG = _descriptor.Descriptor(
290 | name='RecognitionConfig',
291 | full_name='speech_to_text.RecognitionConfig',
292 | filename=None,
293 | file=DESCRIPTOR,
294 | containing_type=None,
295 | create_key=_descriptor._internal_create_key,
296 | fields=[
297 | _descriptor.FieldDescriptor(
298 | name='encoding', full_name='speech_to_text.RecognitionConfig.encoding', index=0,
299 | number=1, type=14, cpp_type=8, label=1,
300 | has_default_value=False, default_value=0,
301 | message_type=None, enum_type=None, containing_type=None,
302 | is_extension=False, extension_scope=None,
303 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
304 | _descriptor.FieldDescriptor(
305 | name='sample_rate_hertz', full_name='speech_to_text.RecognitionConfig.sample_rate_hertz', index=1,
306 | number=2, type=5, cpp_type=1, label=1,
307 | has_default_value=False, default_value=0,
308 | message_type=None, enum_type=None, containing_type=None,
309 | is_extension=False, extension_scope=None,
310 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
311 | _descriptor.FieldDescriptor(
312 | name='language_code', full_name='speech_to_text.RecognitionConfig.language_code', index=2,
313 | number=3, type=9, cpp_type=9, label=1,
314 | has_default_value=False, default_value=b"".decode('utf-8'),
315 | message_type=None, enum_type=None, containing_type=None,
316 | is_extension=False, extension_scope=None,
317 | serialized_options=b'\340A\002', file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
318 | _descriptor.FieldDescriptor(
319 | name='max_alternatives', full_name='speech_to_text.RecognitionConfig.max_alternatives', index=3,
320 | number=4, type=5, cpp_type=1, label=1,
321 | has_default_value=False, default_value=0,
322 | message_type=None, enum_type=None, containing_type=None,
323 | is_extension=False, extension_scope=None,
324 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
325 | _descriptor.FieldDescriptor(
326 | name='speech_contexts', full_name='speech_to_text.RecognitionConfig.speech_contexts', index=4,
327 | number=5, type=11, cpp_type=10, label=3,
328 | has_default_value=False, default_value=[],
329 | message_type=None, enum_type=None, containing_type=None,
330 | is_extension=False, extension_scope=None,
331 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
332 | _descriptor.FieldDescriptor(
333 | name='audio_channel_count', full_name='speech_to_text.RecognitionConfig.audio_channel_count', index=5,
334 | number=6, type=5, cpp_type=1, label=1,
335 | has_default_value=False, default_value=0,
336 | message_type=None, enum_type=None, containing_type=None,
337 | is_extension=False, extension_scope=None,
338 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
339 | _descriptor.FieldDescriptor(
340 | name='enable_separate_recognition_per_channel', full_name='speech_to_text.RecognitionConfig.enable_separate_recognition_per_channel', index=6,
341 | number=7, type=8, cpp_type=7, label=1,
342 | has_default_value=False, default_value=False,
343 | message_type=None, enum_type=None, containing_type=None,
344 | is_extension=False, extension_scope=None,
345 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
346 | _descriptor.FieldDescriptor(
347 | name='enable_word_time_offsets', full_name='speech_to_text.RecognitionConfig.enable_word_time_offsets', index=7,
348 | number=8, type=8, cpp_type=7, label=1,
349 | has_default_value=False, default_value=False,
350 | message_type=None, enum_type=None, containing_type=None,
351 | is_extension=False, extension_scope=None,
352 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
353 | _descriptor.FieldDescriptor(
354 | name='enable_automatic_punctuation', full_name='speech_to_text.RecognitionConfig.enable_automatic_punctuation', index=8,
355 | number=11, type=8, cpp_type=7, label=1,
356 | has_default_value=False, default_value=False,
357 | message_type=None, enum_type=None, containing_type=None,
358 | is_extension=False, extension_scope=None,
359 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
360 | _descriptor.FieldDescriptor(
361 | name='diarization_config', full_name='speech_to_text.RecognitionConfig.diarization_config', index=9,
362 | number=16, type=11, cpp_type=10, label=1,
363 | has_default_value=False, default_value=None,
364 | message_type=None, enum_type=None, containing_type=None,
365 | is_extension=False, extension_scope=None,
366 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
367 | ],
368 | extensions=[
369 | ],
370 | nested_types=[],
371 | enum_types=[
372 | _RECOGNITIONCONFIG_AUDIOENCODING,
373 | ],
374 | serialized_options=None,
375 | is_extendable=False,
376 | syntax='proto3',
377 | extension_ranges=[],
378 | oneofs=[
379 | ],
380 | serialized_start=860,
381 | serialized_end=1379,
382 | )
383 |
384 |
385 | _SPEECHCONTEXT = _descriptor.Descriptor(
386 | name='SpeechContext',
387 | full_name='speech_to_text.SpeechContext',
388 | filename=None,
389 | file=DESCRIPTOR,
390 | containing_type=None,
391 | create_key=_descriptor._internal_create_key,
392 | fields=[
393 | _descriptor.FieldDescriptor(
394 | name='phrases', full_name='speech_to_text.SpeechContext.phrases', index=0,
395 | number=1, type=9, cpp_type=9, label=3,
396 | has_default_value=False, default_value=[],
397 | message_type=None, enum_type=None, containing_type=None,
398 | is_extension=False, extension_scope=None,
399 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
400 | ],
401 | extensions=[
402 | ],
403 | nested_types=[],
404 | enum_types=[
405 | ],
406 | serialized_options=None,
407 | is_extendable=False,
408 | syntax='proto3',
409 | extension_ranges=[],
410 | oneofs=[
411 | ],
412 | serialized_start=1381,
413 | serialized_end=1413,
414 | )
415 |
416 |
417 | _SPEAKERDIARIZATIONCONFIG = _descriptor.Descriptor(
418 | name='SpeakerDiarizationConfig',
419 | full_name='speech_to_text.SpeakerDiarizationConfig',
420 | filename=None,
421 | file=DESCRIPTOR,
422 | containing_type=None,
423 | create_key=_descriptor._internal_create_key,
424 | fields=[
425 | _descriptor.FieldDescriptor(
426 | name='enable_speaker_diarization', full_name='speech_to_text.SpeakerDiarizationConfig.enable_speaker_diarization', index=0,
427 | number=1, type=8, cpp_type=7, label=1,
428 | has_default_value=False, default_value=False,
429 | message_type=None, enum_type=None, containing_type=None,
430 | is_extension=False, extension_scope=None,
431 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
432 | _descriptor.FieldDescriptor(
433 | name='min_speaker_count', full_name='speech_to_text.SpeakerDiarizationConfig.min_speaker_count', index=1,
434 | number=2, type=5, cpp_type=1, label=1,
435 | has_default_value=False, default_value=0,
436 | message_type=None, enum_type=None, containing_type=None,
437 | is_extension=False, extension_scope=None,
438 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
439 | _descriptor.FieldDescriptor(
440 | name='max_speaker_count', full_name='speech_to_text.SpeakerDiarizationConfig.max_speaker_count', index=2,
441 | number=3, type=5, cpp_type=1, label=1,
442 | has_default_value=False, default_value=0,
443 | message_type=None, enum_type=None, containing_type=None,
444 | is_extension=False, extension_scope=None,
445 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
446 | ],
447 | extensions=[
448 | ],
449 | nested_types=[],
450 | enum_types=[
451 | ],
452 | serialized_options=None,
453 | is_extendable=False,
454 | syntax='proto3',
455 | extension_ranges=[],
456 | oneofs=[
457 | ],
458 | serialized_start=1415,
459 | serialized_end=1531,
460 | )
461 |
462 |
463 | _SILENCEDETECTIONCONFIG = _descriptor.Descriptor(
464 | name='SilenceDetectionConfig',
465 | full_name='speech_to_text.SilenceDetectionConfig',
466 | filename=None,
467 | file=DESCRIPTOR,
468 | containing_type=None,
469 | create_key=_descriptor._internal_create_key,
470 | fields=[
471 | _descriptor.FieldDescriptor(
472 | name='enable_silence_detection', full_name='speech_to_text.SilenceDetectionConfig.enable_silence_detection', index=0,
473 | number=1, type=8, cpp_type=7, label=1,
474 | has_default_value=False, default_value=False,
475 | message_type=None, enum_type=None, containing_type=None,
476 | is_extension=False, extension_scope=None,
477 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
478 | _descriptor.FieldDescriptor(
479 | name='max_speech_timeout', full_name='speech_to_text.SilenceDetectionConfig.max_speech_timeout', index=1,
480 | number=2, type=2, cpp_type=6, label=1,
481 | has_default_value=False, default_value=float(0),
482 | message_type=None, enum_type=None, containing_type=None,
483 | is_extension=False, extension_scope=None,
484 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
485 | _descriptor.FieldDescriptor(
486 | name='silence_patience', full_name='speech_to_text.SilenceDetectionConfig.silence_patience', index=2,
487 | number=3, type=2, cpp_type=6, label=1,
488 | has_default_value=False, default_value=float(0),
489 | message_type=None, enum_type=None, containing_type=None,
490 | is_extension=False, extension_scope=None,
491 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
492 | _descriptor.FieldDescriptor(
493 | name='no_input_timeout', full_name='speech_to_text.SilenceDetectionConfig.no_input_timeout', index=3,
494 | number=4, type=2, cpp_type=6, label=1,
495 | has_default_value=False, default_value=float(0),
496 | message_type=None, enum_type=None, containing_type=None,
497 | is_extension=False, extension_scope=None,
498 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
499 | ],
500 | extensions=[
501 | ],
502 | nested_types=[],
503 | enum_types=[
504 | ],
505 | serialized_options=None,
506 | is_extendable=False,
507 | syntax='proto3',
508 | extension_ranges=[],
509 | oneofs=[
510 | ],
511 | serialized_start=1534,
512 | serialized_end=1672,
513 | )
514 |
515 |
516 | _RECOGNITIONAUDIO = _descriptor.Descriptor(
517 | name='RecognitionAudio',
518 | full_name='speech_to_text.RecognitionAudio',
519 | filename=None,
520 | file=DESCRIPTOR,
521 | containing_type=None,
522 | create_key=_descriptor._internal_create_key,
523 | fields=[
524 | _descriptor.FieldDescriptor(
525 | name='content', full_name='speech_to_text.RecognitionAudio.content', index=0,
526 | number=1, type=12, cpp_type=9, label=1,
527 | has_default_value=False, default_value=b"",
528 | message_type=None, enum_type=None, containing_type=None,
529 | is_extension=False, extension_scope=None,
530 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
531 | _descriptor.FieldDescriptor(
532 | name='uri', full_name='speech_to_text.RecognitionAudio.uri', index=1,
533 | number=2, type=9, cpp_type=9, label=1,
534 | has_default_value=False, default_value=b"".decode('utf-8'),
535 | message_type=None, enum_type=None, containing_type=None,
536 | is_extension=False, extension_scope=None,
537 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
538 | ],
539 | extensions=[
540 | ],
541 | nested_types=[],
542 | enum_types=[
543 | ],
544 | serialized_options=None,
545 | is_extendable=False,
546 | syntax='proto3',
547 | extension_ranges=[],
548 | oneofs=[
549 | _descriptor.OneofDescriptor(
550 | name='audio_source', full_name='speech_to_text.RecognitionAudio.audio_source',
551 | index=0, containing_type=None,
552 | create_key=_descriptor._internal_create_key,
553 | fields=[]),
554 | ],
555 | serialized_start=1674,
556 | serialized_end=1742,
557 | )
558 |
559 |
560 | _RECOGNIZERESPONSE = _descriptor.Descriptor(
561 | name='RecognizeResponse',
562 | full_name='speech_to_text.RecognizeResponse',
563 | filename=None,
564 | file=DESCRIPTOR,
565 | containing_type=None,
566 | create_key=_descriptor._internal_create_key,
567 | fields=[
568 | _descriptor.FieldDescriptor(
569 | name='results', full_name='speech_to_text.RecognizeResponse.results', index=0,
570 | number=1, type=11, cpp_type=10, label=3,
571 | has_default_value=False, default_value=[],
572 | message_type=None, enum_type=None, containing_type=None,
573 | is_extension=False, extension_scope=None,
574 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
575 | ],
576 | extensions=[
577 | ],
578 | nested_types=[],
579 | enum_types=[
580 | ],
581 | serialized_options=None,
582 | is_extendable=False,
583 | syntax='proto3',
584 | extension_ranges=[],
585 | oneofs=[
586 | ],
587 | serialized_start=1744,
588 | serialized_end=1821,
589 | )
590 |
591 |
592 | _LONGRUNNINGRECOGNIZERESPONSE = _descriptor.Descriptor(
593 | name='LongRunningRecognizeResponse',
594 | full_name='speech_to_text.LongRunningRecognizeResponse',
595 | filename=None,
596 | file=DESCRIPTOR,
597 | containing_type=None,
598 | create_key=_descriptor._internal_create_key,
599 | fields=[
600 | _descriptor.FieldDescriptor(
601 | name='results', full_name='speech_to_text.LongRunningRecognizeResponse.results', index=0,
602 | number=1, type=11, cpp_type=10, label=3,
603 | has_default_value=False, default_value=[],
604 | message_type=None, enum_type=None, containing_type=None,
605 | is_extension=False, extension_scope=None,
606 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
607 | ],
608 | extensions=[
609 | ],
610 | nested_types=[],
611 | enum_types=[
612 | ],
613 | serialized_options=None,
614 | is_extendable=False,
615 | syntax='proto3',
616 | extension_ranges=[],
617 | oneofs=[
618 | ],
619 | serialized_start=1823,
620 | serialized_end=1911,
621 | )
622 |
623 |
624 | _STREAMINGRECOGNIZERESPONSE = _descriptor.Descriptor(
625 | name='StreamingRecognizeResponse',
626 | full_name='speech_to_text.StreamingRecognizeResponse',
627 | filename=None,
628 | file=DESCRIPTOR,
629 | containing_type=None,
630 | create_key=_descriptor._internal_create_key,
631 | fields=[
632 | _descriptor.FieldDescriptor(
633 | name='error', full_name='speech_to_text.StreamingRecognizeResponse.error', index=0,
634 | number=1, type=11, cpp_type=10, label=1,
635 | has_default_value=False, default_value=None,
636 | message_type=None, enum_type=None, containing_type=None,
637 | is_extension=False, extension_scope=None,
638 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
639 | _descriptor.FieldDescriptor(
640 | name='results', full_name='speech_to_text.StreamingRecognizeResponse.results', index=1,
641 | number=2, type=11, cpp_type=10, label=3,
642 | has_default_value=False, default_value=[],
643 | message_type=None, enum_type=None, containing_type=None,
644 | is_extension=False, extension_scope=None,
645 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
646 | ],
647 | extensions=[
648 | ],
649 | nested_types=[],
650 | enum_types=[
651 | ],
652 | serialized_options=None,
653 | is_extendable=False,
654 | syntax='proto3',
655 | extension_ranges=[],
656 | oneofs=[
657 | ],
658 | serialized_start=1913,
659 | serialized_end=2037,
660 | )
661 |
662 |
663 | _SPEECHRECOGNITIONRESULT = _descriptor.Descriptor(
664 | name='SpeechRecognitionResult',
665 | full_name='speech_to_text.SpeechRecognitionResult',
666 | filename=None,
667 | file=DESCRIPTOR,
668 | containing_type=None,
669 | create_key=_descriptor._internal_create_key,
670 | fields=[
671 | _descriptor.FieldDescriptor(
672 | name='alternatives', full_name='speech_to_text.SpeechRecognitionResult.alternatives', index=0,
673 | number=1, type=11, cpp_type=10, label=3,
674 | has_default_value=False, default_value=[],
675 | message_type=None, enum_type=None, containing_type=None,
676 | is_extension=False, extension_scope=None,
677 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
678 | _descriptor.FieldDescriptor(
679 | name='channel_tag', full_name='speech_to_text.SpeechRecognitionResult.channel_tag', index=1,
680 | number=2, type=5, cpp_type=1, label=1,
681 | has_default_value=False, default_value=0,
682 | message_type=None, enum_type=None, containing_type=None,
683 | is_extension=False, extension_scope=None,
684 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
685 | ],
686 | extensions=[
687 | ],
688 | nested_types=[],
689 | enum_types=[
690 | ],
691 | serialized_options=None,
692 | is_extendable=False,
693 | syntax='proto3',
694 | extension_ranges=[],
695 | oneofs=[
696 | ],
697 | serialized_start=2039,
698 | serialized_end=2153,
699 | )
700 |
701 |
702 | _STREAMINGRECOGNITIONRESULT = _descriptor.Descriptor(
703 | name='StreamingRecognitionResult',
704 | full_name='speech_to_text.StreamingRecognitionResult',
705 | filename=None,
706 | file=DESCRIPTOR,
707 | containing_type=None,
708 | create_key=_descriptor._internal_create_key,
709 | fields=[
710 | _descriptor.FieldDescriptor(
711 | name='alternatives', full_name='speech_to_text.StreamingRecognitionResult.alternatives', index=0,
712 | number=1, type=11, cpp_type=10, label=3,
713 | has_default_value=False, default_value=[],
714 | message_type=None, enum_type=None, containing_type=None,
715 | is_extension=False, extension_scope=None,
716 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
717 | _descriptor.FieldDescriptor(
718 | name='is_final', full_name='speech_to_text.StreamingRecognitionResult.is_final', index=1,
719 | number=2, type=8, cpp_type=7, label=1,
720 | has_default_value=False, default_value=False,
721 | message_type=None, enum_type=None, containing_type=None,
722 | is_extension=False, extension_scope=None,
723 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
724 | _descriptor.FieldDescriptor(
725 | name='stability', full_name='speech_to_text.StreamingRecognitionResult.stability', index=2,
726 | number=3, type=2, cpp_type=6, label=1,
727 | has_default_value=False, default_value=float(0),
728 | message_type=None, enum_type=None, containing_type=None,
729 | is_extension=False, extension_scope=None,
730 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
731 | _descriptor.FieldDescriptor(
732 | name='result_end_time', full_name='speech_to_text.StreamingRecognitionResult.result_end_time', index=3,
733 | number=4, type=2, cpp_type=6, label=1,
734 | has_default_value=False, default_value=float(0),
735 | message_type=None, enum_type=None, containing_type=None,
736 | is_extension=False, extension_scope=None,
737 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
738 | _descriptor.FieldDescriptor(
739 | name='channel_tag', full_name='speech_to_text.StreamingRecognitionResult.channel_tag', index=4,
740 | number=5, type=5, cpp_type=1, label=1,
741 | has_default_value=False, default_value=0,
742 | message_type=None, enum_type=None, containing_type=None,
743 | is_extension=False, extension_scope=None,
744 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
745 | ],
746 | extensions=[
747 | ],
748 | nested_types=[],
749 | enum_types=[
750 | ],
751 | serialized_options=None,
752 | is_extendable=False,
753 | syntax='proto3',
754 | extension_ranges=[],
755 | oneofs=[
756 | ],
757 | serialized_start=2156,
758 | serialized_end=2335,
759 | )
760 |
761 |
762 | _SPEECHRECOGNITIONALTERNATIVE = _descriptor.Descriptor(
763 | name='SpeechRecognitionAlternative',
764 | full_name='speech_to_text.SpeechRecognitionAlternative',
765 | filename=None,
766 | file=DESCRIPTOR,
767 | containing_type=None,
768 | create_key=_descriptor._internal_create_key,
769 | fields=[
770 | _descriptor.FieldDescriptor(
771 | name='transcript', full_name='speech_to_text.SpeechRecognitionAlternative.transcript', index=0,
772 | number=1, type=9, cpp_type=9, label=1,
773 | has_default_value=False, default_value=b"".decode('utf-8'),
774 | message_type=None, enum_type=None, containing_type=None,
775 | is_extension=False, extension_scope=None,
776 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
777 | _descriptor.FieldDescriptor(
778 | name='confidence', full_name='speech_to_text.SpeechRecognitionAlternative.confidence', index=1,
779 | number=2, type=2, cpp_type=6, label=1,
780 | has_default_value=False, default_value=float(0),
781 | message_type=None, enum_type=None, containing_type=None,
782 | is_extension=False, extension_scope=None,
783 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
784 | _descriptor.FieldDescriptor(
785 | name='words', full_name='speech_to_text.SpeechRecognitionAlternative.words', index=2,
786 | number=3, type=11, cpp_type=10, label=3,
787 | has_default_value=False, default_value=[],
788 | message_type=None, enum_type=None, containing_type=None,
789 | is_extension=False, extension_scope=None,
790 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
791 | ],
792 | extensions=[
793 | ],
794 | nested_types=[],
795 | enum_types=[
796 | ],
797 | serialized_options=None,
798 | is_extendable=False,
799 | syntax='proto3',
800 | extension_ranges=[],
801 | oneofs=[
802 | ],
803 | serialized_start=2337,
804 | serialized_end=2448,
805 | )
806 |
807 |
808 | _WORDINFO = _descriptor.Descriptor(
809 | name='WordInfo',
810 | full_name='speech_to_text.WordInfo',
811 | filename=None,
812 | file=DESCRIPTOR,
813 | containing_type=None,
814 | create_key=_descriptor._internal_create_key,
815 | fields=[
816 | _descriptor.FieldDescriptor(
817 | name='start_time', full_name='speech_to_text.WordInfo.start_time', index=0,
818 | number=1, type=2, cpp_type=6, label=1,
819 | has_default_value=False, default_value=float(0),
820 | message_type=None, enum_type=None, containing_type=None,
821 | is_extension=False, extension_scope=None,
822 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
823 | _descriptor.FieldDescriptor(
824 | name='end_time', full_name='speech_to_text.WordInfo.end_time', index=1,
825 | number=2, type=2, cpp_type=6, label=1,
826 | has_default_value=False, default_value=float(0),
827 | message_type=None, enum_type=None, containing_type=None,
828 | is_extension=False, extension_scope=None,
829 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
830 | _descriptor.FieldDescriptor(
831 | name='word', full_name='speech_to_text.WordInfo.word', index=2,
832 | number=3, type=9, cpp_type=9, label=1,
833 | has_default_value=False, default_value=b"".decode('utf-8'),
834 | message_type=None, enum_type=None, containing_type=None,
835 | is_extension=False, extension_scope=None,
836 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
837 | _descriptor.FieldDescriptor(
838 | name='confidence', full_name='speech_to_text.WordInfo.confidence', index=3,
839 | number=4, type=2, cpp_type=6, label=1,
840 | has_default_value=False, default_value=float(0),
841 | message_type=None, enum_type=None, containing_type=None,
842 | is_extension=False, extension_scope=None,
843 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
844 | ],
845 | extensions=[
846 | ],
847 | nested_types=[],
848 | enum_types=[
849 | ],
850 | serialized_options=None,
851 | is_extendable=False,
852 | syntax='proto3',
853 | extension_ranges=[],
854 | oneofs=[
855 | ],
856 | serialized_start=2450,
857 | serialized_end=2532,
858 | )
859 |
860 |
861 | _SPEECHOPERATION = _descriptor.Descriptor(
862 | name='SpeechOperation',
863 | full_name='speech_to_text.SpeechOperation',
864 | filename=None,
865 | file=DESCRIPTOR,
866 | containing_type=None,
867 | create_key=_descriptor._internal_create_key,
868 | fields=[
869 | _descriptor.FieldDescriptor(
870 | name='name', full_name='speech_to_text.SpeechOperation.name', index=0,
871 | number=1, type=9, cpp_type=9, label=1,
872 | has_default_value=False, default_value=b"".decode('utf-8'),
873 | message_type=None, enum_type=None, containing_type=None,
874 | is_extension=False, extension_scope=None,
875 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
876 | _descriptor.FieldDescriptor(
877 | name='done', full_name='speech_to_text.SpeechOperation.done', index=1,
878 | number=2, type=8, cpp_type=7, label=1,
879 | has_default_value=False, default_value=False,
880 | message_type=None, enum_type=None, containing_type=None,
881 | is_extension=False, extension_scope=None,
882 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
883 | _descriptor.FieldDescriptor(
884 | name='error', full_name='speech_to_text.SpeechOperation.error', index=2,
885 | number=3, type=11, cpp_type=10, label=1,
886 | has_default_value=False, default_value=None,
887 | message_type=None, enum_type=None, containing_type=None,
888 | is_extension=False, extension_scope=None,
889 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
890 | _descriptor.FieldDescriptor(
891 | name='response', full_name='speech_to_text.SpeechOperation.response', index=3,
892 | number=4, type=11, cpp_type=10, label=1,
893 | has_default_value=False, default_value=None,
894 | message_type=None, enum_type=None, containing_type=None,
895 | is_extension=False, extension_scope=None,
896 | serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
897 | ],
898 | extensions=[
899 | ],
900 | nested_types=[],
901 | enum_types=[
902 | ],
903 | serialized_options=None,
904 | is_extendable=False,
905 | syntax='proto3',
906 | extension_ranges=[],
907 | oneofs=[
908 | _descriptor.OneofDescriptor(
909 | name='result', full_name='speech_to_text.SpeechOperation.result',
910 | index=0, containing_type=None,
911 | create_key=_descriptor._internal_create_key,
912 | fields=[]),
913 | ],
914 | serialized_start=2535,
915 | serialized_end=2693,
916 | )
917 |
918 | _RECOGNIZEREQUEST.fields_by_name['config'].message_type = _RECOGNITIONCONFIG
919 | _RECOGNIZEREQUEST.fields_by_name['audio'].message_type = _RECOGNITIONAUDIO
920 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['config'].message_type = _RECOGNITIONCONFIG
921 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['audio'].message_type = _RECOGNITIONAUDIO
922 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['streaming_config'].message_type = _STREAMINGRECOGNITIONCONFIG
923 | _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request'].fields.append(
924 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['streaming_config'])
925 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['streaming_config'].containing_oneof = _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request']
926 | _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request'].fields.append(
927 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['audio_content'])
928 | _STREAMINGRECOGNIZEREQUEST.fields_by_name['audio_content'].containing_oneof = _STREAMINGRECOGNIZEREQUEST.oneofs_by_name['streaming_request']
929 | _STREAMINGRECOGNITIONCONFIG.fields_by_name['config'].message_type = _RECOGNITIONCONFIG
930 | _STREAMINGRECOGNITIONCONFIG.fields_by_name['silence_detection_config'].message_type = _SILENCEDETECTIONCONFIG
931 | _RECOGNITIONCONFIG.fields_by_name['encoding'].enum_type = _RECOGNITIONCONFIG_AUDIOENCODING
932 | _RECOGNITIONCONFIG.fields_by_name['speech_contexts'].message_type = _SPEECHCONTEXT
933 | _RECOGNITIONCONFIG.fields_by_name['diarization_config'].message_type = _SPEAKERDIARIZATIONCONFIG
934 | _RECOGNITIONCONFIG_AUDIOENCODING.containing_type = _RECOGNITIONCONFIG
935 | _RECOGNITIONAUDIO.oneofs_by_name['audio_source'].fields.append(
936 | _RECOGNITIONAUDIO.fields_by_name['content'])
937 | _RECOGNITIONAUDIO.fields_by_name['content'].containing_oneof = _RECOGNITIONAUDIO.oneofs_by_name['audio_source']
938 | _RECOGNITIONAUDIO.oneofs_by_name['audio_source'].fields.append(
939 | _RECOGNITIONAUDIO.fields_by_name['uri'])
940 | _RECOGNITIONAUDIO.fields_by_name['uri'].containing_oneof = _RECOGNITIONAUDIO.oneofs_by_name['audio_source']
941 | _RECOGNIZERESPONSE.fields_by_name['results'].message_type = _SPEECHRECOGNITIONRESULT
942 | _LONGRUNNINGRECOGNIZERESPONSE.fields_by_name['results'].message_type = _SPEECHRECOGNITIONRESULT
943 | _STREAMINGRECOGNIZERESPONSE.fields_by_name['error'].message_type = google_dot_rpc_dot_status__pb2._STATUS
944 | _STREAMINGRECOGNIZERESPONSE.fields_by_name['results'].message_type = _STREAMINGRECOGNITIONRESULT
945 | _SPEECHRECOGNITIONRESULT.fields_by_name['alternatives'].message_type = _SPEECHRECOGNITIONALTERNATIVE
946 | _STREAMINGRECOGNITIONRESULT.fields_by_name['alternatives'].message_type = _SPEECHRECOGNITIONALTERNATIVE
947 | _SPEECHRECOGNITIONALTERNATIVE.fields_by_name['words'].message_type = _WORDINFO
948 | _SPEECHOPERATION.fields_by_name['error'].message_type = google_dot_rpc_dot_status__pb2._STATUS
949 | _SPEECHOPERATION.fields_by_name['response'].message_type = _LONGRUNNINGRECOGNIZERESPONSE
950 | _SPEECHOPERATION.oneofs_by_name['result'].fields.append(
951 | _SPEECHOPERATION.fields_by_name['error'])
952 | _SPEECHOPERATION.fields_by_name['error'].containing_oneof = _SPEECHOPERATION.oneofs_by_name['result']
953 | _SPEECHOPERATION.oneofs_by_name['result'].fields.append(
954 | _SPEECHOPERATION.fields_by_name['response'])
955 | _SPEECHOPERATION.fields_by_name['response'].containing_oneof = _SPEECHOPERATION.oneofs_by_name['result']
956 | DESCRIPTOR.message_types_by_name['RecognizeRequest'] = _RECOGNIZEREQUEST
957 | DESCRIPTOR.message_types_by_name['LongRunningRecognizeRequest'] = _LONGRUNNINGRECOGNIZEREQUEST
958 | DESCRIPTOR.message_types_by_name['SpeechOperationRequest'] = _SPEECHOPERATIONREQUEST
959 | DESCRIPTOR.message_types_by_name['StreamingRecognizeRequest'] = _STREAMINGRECOGNIZEREQUEST
960 | DESCRIPTOR.message_types_by_name['StreamingRecognitionConfig'] = _STREAMINGRECOGNITIONCONFIG
961 | DESCRIPTOR.message_types_by_name['RecognitionConfig'] = _RECOGNITIONCONFIG
962 | DESCRIPTOR.message_types_by_name['SpeechContext'] = _SPEECHCONTEXT
963 | DESCRIPTOR.message_types_by_name['SpeakerDiarizationConfig'] = _SPEAKERDIARIZATIONCONFIG
964 | DESCRIPTOR.message_types_by_name['SilenceDetectionConfig'] = _SILENCEDETECTIONCONFIG
965 | DESCRIPTOR.message_types_by_name['RecognitionAudio'] = _RECOGNITIONAUDIO
966 | DESCRIPTOR.message_types_by_name['RecognizeResponse'] = _RECOGNIZERESPONSE
967 | DESCRIPTOR.message_types_by_name['LongRunningRecognizeResponse'] = _LONGRUNNINGRECOGNIZERESPONSE
968 | DESCRIPTOR.message_types_by_name['StreamingRecognizeResponse'] = _STREAMINGRECOGNIZERESPONSE
969 | DESCRIPTOR.message_types_by_name['SpeechRecognitionResult'] = _SPEECHRECOGNITIONRESULT
970 | DESCRIPTOR.message_types_by_name['StreamingRecognitionResult'] = _STREAMINGRECOGNITIONRESULT
971 | DESCRIPTOR.message_types_by_name['SpeechRecognitionAlternative'] = _SPEECHRECOGNITIONALTERNATIVE
972 | DESCRIPTOR.message_types_by_name['WordInfo'] = _WORDINFO
973 | DESCRIPTOR.message_types_by_name['SpeechOperation'] = _SPEECHOPERATION
974 | _sym_db.RegisterFileDescriptor(DESCRIPTOR)
975 |
976 | RecognizeRequest = _reflection.GeneratedProtocolMessageType('RecognizeRequest', (_message.Message,), {
977 | 'DESCRIPTOR' : _RECOGNIZEREQUEST,
978 | '__module__' : 'speech_to_text_pb2'
979 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognizeRequest)
980 | })
981 | _sym_db.RegisterMessage(RecognizeRequest)
982 |
983 | LongRunningRecognizeRequest = _reflection.GeneratedProtocolMessageType('LongRunningRecognizeRequest', (_message.Message,), {
984 | 'DESCRIPTOR' : _LONGRUNNINGRECOGNIZEREQUEST,
985 | '__module__' : 'speech_to_text_pb2'
986 | # @@protoc_insertion_point(class_scope:speech_to_text.LongRunningRecognizeRequest)
987 | })
988 | _sym_db.RegisterMessage(LongRunningRecognizeRequest)
989 |
990 | SpeechOperationRequest = _reflection.GeneratedProtocolMessageType('SpeechOperationRequest', (_message.Message,), {
991 | 'DESCRIPTOR' : _SPEECHOPERATIONREQUEST,
992 | '__module__' : 'speech_to_text_pb2'
993 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechOperationRequest)
994 | })
995 | _sym_db.RegisterMessage(SpeechOperationRequest)
996 |
997 | StreamingRecognizeRequest = _reflection.GeneratedProtocolMessageType('StreamingRecognizeRequest', (_message.Message,), {
998 | 'DESCRIPTOR' : _STREAMINGRECOGNIZEREQUEST,
999 | '__module__' : 'speech_to_text_pb2'
1000 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognizeRequest)
1001 | })
1002 | _sym_db.RegisterMessage(StreamingRecognizeRequest)
1003 |
1004 | StreamingRecognitionConfig = _reflection.GeneratedProtocolMessageType('StreamingRecognitionConfig', (_message.Message,), {
1005 | 'DESCRIPTOR' : _STREAMINGRECOGNITIONCONFIG,
1006 | '__module__' : 'speech_to_text_pb2'
1007 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognitionConfig)
1008 | })
1009 | _sym_db.RegisterMessage(StreamingRecognitionConfig)
1010 |
1011 | RecognitionConfig = _reflection.GeneratedProtocolMessageType('RecognitionConfig', (_message.Message,), {
1012 | 'DESCRIPTOR' : _RECOGNITIONCONFIG,
1013 | '__module__' : 'speech_to_text_pb2'
1014 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognitionConfig)
1015 | })
1016 | _sym_db.RegisterMessage(RecognitionConfig)
1017 |
1018 | SpeechContext = _reflection.GeneratedProtocolMessageType('SpeechContext', (_message.Message,), {
1019 | 'DESCRIPTOR' : _SPEECHCONTEXT,
1020 | '__module__' : 'speech_to_text_pb2'
1021 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechContext)
1022 | })
1023 | _sym_db.RegisterMessage(SpeechContext)
1024 |
1025 | SpeakerDiarizationConfig = _reflection.GeneratedProtocolMessageType('SpeakerDiarizationConfig', (_message.Message,), {
1026 | 'DESCRIPTOR' : _SPEAKERDIARIZATIONCONFIG,
1027 | '__module__' : 'speech_to_text_pb2'
1028 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeakerDiarizationConfig)
1029 | })
1030 | _sym_db.RegisterMessage(SpeakerDiarizationConfig)
1031 |
1032 | SilenceDetectionConfig = _reflection.GeneratedProtocolMessageType('SilenceDetectionConfig', (_message.Message,), {
1033 | 'DESCRIPTOR' : _SILENCEDETECTIONCONFIG,
1034 | '__module__' : 'speech_to_text_pb2'
1035 | # @@protoc_insertion_point(class_scope:speech_to_text.SilenceDetectionConfig)
1036 | })
1037 | _sym_db.RegisterMessage(SilenceDetectionConfig)
1038 |
1039 | RecognitionAudio = _reflection.GeneratedProtocolMessageType('RecognitionAudio', (_message.Message,), {
1040 | 'DESCRIPTOR' : _RECOGNITIONAUDIO,
1041 | '__module__' : 'speech_to_text_pb2'
1042 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognitionAudio)
1043 | })
1044 | _sym_db.RegisterMessage(RecognitionAudio)
1045 |
1046 | RecognizeResponse = _reflection.GeneratedProtocolMessageType('RecognizeResponse', (_message.Message,), {
1047 | 'DESCRIPTOR' : _RECOGNIZERESPONSE,
1048 | '__module__' : 'speech_to_text_pb2'
1049 | # @@protoc_insertion_point(class_scope:speech_to_text.RecognizeResponse)
1050 | })
1051 | _sym_db.RegisterMessage(RecognizeResponse)
1052 |
1053 | LongRunningRecognizeResponse = _reflection.GeneratedProtocolMessageType('LongRunningRecognizeResponse', (_message.Message,), {
1054 | 'DESCRIPTOR' : _LONGRUNNINGRECOGNIZERESPONSE,
1055 | '__module__' : 'speech_to_text_pb2'
1056 | # @@protoc_insertion_point(class_scope:speech_to_text.LongRunningRecognizeResponse)
1057 | })
1058 | _sym_db.RegisterMessage(LongRunningRecognizeResponse)
1059 |
1060 | StreamingRecognizeResponse = _reflection.GeneratedProtocolMessageType('StreamingRecognizeResponse', (_message.Message,), {
1061 | 'DESCRIPTOR' : _STREAMINGRECOGNIZERESPONSE,
1062 | '__module__' : 'speech_to_text_pb2'
1063 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognizeResponse)
1064 | })
1065 | _sym_db.RegisterMessage(StreamingRecognizeResponse)
1066 |
1067 | SpeechRecognitionResult = _reflection.GeneratedProtocolMessageType('SpeechRecognitionResult', (_message.Message,), {
1068 | 'DESCRIPTOR' : _SPEECHRECOGNITIONRESULT,
1069 | '__module__' : 'speech_to_text_pb2'
1070 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechRecognitionResult)
1071 | })
1072 | _sym_db.RegisterMessage(SpeechRecognitionResult)
1073 |
1074 | StreamingRecognitionResult = _reflection.GeneratedProtocolMessageType('StreamingRecognitionResult', (_message.Message,), {
1075 | 'DESCRIPTOR' : _STREAMINGRECOGNITIONRESULT,
1076 | '__module__' : 'speech_to_text_pb2'
1077 | # @@protoc_insertion_point(class_scope:speech_to_text.StreamingRecognitionResult)
1078 | })
1079 | _sym_db.RegisterMessage(StreamingRecognitionResult)
1080 |
1081 | SpeechRecognitionAlternative = _reflection.GeneratedProtocolMessageType('SpeechRecognitionAlternative', (_message.Message,), {
1082 | 'DESCRIPTOR' : _SPEECHRECOGNITIONALTERNATIVE,
1083 | '__module__' : 'speech_to_text_pb2'
1084 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechRecognitionAlternative)
1085 | })
1086 | _sym_db.RegisterMessage(SpeechRecognitionAlternative)
1087 |
1088 | WordInfo = _reflection.GeneratedProtocolMessageType('WordInfo', (_message.Message,), {
1089 | 'DESCRIPTOR' : _WORDINFO,
1090 | '__module__' : 'speech_to_text_pb2'
1091 | # @@protoc_insertion_point(class_scope:speech_to_text.WordInfo)
1092 | })
1093 | _sym_db.RegisterMessage(WordInfo)
1094 |
1095 | SpeechOperation = _reflection.GeneratedProtocolMessageType('SpeechOperation', (_message.Message,), {
1096 | 'DESCRIPTOR' : _SPEECHOPERATION,
1097 | '__module__' : 'speech_to_text_pb2'
1098 | # @@protoc_insertion_point(class_scope:speech_to_text.SpeechOperation)
1099 | })
1100 | _sym_db.RegisterMessage(SpeechOperation)
1101 |
1102 |
1103 | DESCRIPTOR._options = None
1104 | _RECOGNIZEREQUEST.fields_by_name['config']._options = None
1105 | _RECOGNIZEREQUEST.fields_by_name['audio']._options = None
1106 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['config']._options = None
1107 | _LONGRUNNINGRECOGNIZEREQUEST.fields_by_name['audio']._options = None
1108 | _SPEECHOPERATIONREQUEST.fields_by_name['name']._options = None
1109 | _STREAMINGRECOGNITIONCONFIG.fields_by_name['config']._options = None
1110 | _RECOGNITIONCONFIG.fields_by_name['language_code']._options = None
1111 |
1112 | _SPEECHTOTEXT = _descriptor.ServiceDescriptor(
1113 | name='SpeechToText',
1114 | full_name='speech_to_text.SpeechToText',
1115 | file=DESCRIPTOR,
1116 | index=0,
1117 | serialized_options=None,
1118 | create_key=_descriptor._internal_create_key,
1119 | serialized_start=2696,
1120 | serialized_end=3252,
1121 | methods=[
1122 | _descriptor.MethodDescriptor(
1123 | name='Recognize',
1124 | full_name='speech_to_text.SpeechToText.Recognize',
1125 | index=0,
1126 | containing_service=None,
1127 | input_type=_RECOGNIZEREQUEST,
1128 | output_type=_RECOGNIZERESPONSE,
1129 | serialized_options=b'\202\323\344\223\002\031\"\024/v2/speech:recognize:\001*\332A\014config,audio',
1130 | create_key=_descriptor._internal_create_key,
1131 | ),
1132 | _descriptor.MethodDescriptor(
1133 | name='StreamingRecognize',
1134 | full_name='speech_to_text.SpeechToText.StreamingRecognize',
1135 | index=1,
1136 | containing_service=None,
1137 | input_type=_STREAMINGRECOGNIZEREQUEST,
1138 | output_type=_STREAMINGRECOGNIZERESPONSE,
1139 | serialized_options=None,
1140 | create_key=_descriptor._internal_create_key,
1141 | ),
1142 | _descriptor.MethodDescriptor(
1143 | name='LongRunningRecognize',
1144 | full_name='speech_to_text.SpeechToText.LongRunningRecognize',
1145 | index=2,
1146 | containing_service=None,
1147 | input_type=_LONGRUNNINGRECOGNIZEREQUEST,
1148 | output_type=_SPEECHOPERATION,
1149 | serialized_options=b'\202\323\344\223\002$\"\037/v2/speech:longrunningrecognize:\001*\332A\014config,audio',
1150 | create_key=_descriptor._internal_create_key,
1151 | ),
1152 | _descriptor.MethodDescriptor(
1153 | name='GetSpeechOperation',
1154 | full_name='speech_to_text.SpeechToText.GetSpeechOperation',
1155 | index=3,
1156 | containing_service=None,
1157 | input_type=_SPEECHOPERATIONREQUEST,
1158 | output_type=_SPEECHOPERATION,
1159 | serialized_options=b'\202\323\344\223\002\036\022\034/v2/speech_operations/{name}',
1160 | create_key=_descriptor._internal_create_key,
1161 | ),
1162 | ])
1163 | _sym_db.RegisterServiceDescriptor(_SPEECHTOTEXT)
1164 |
1165 | DESCRIPTOR.services_by_name['SpeechToText'] = _SPEECHTOTEXT
1166 |
1167 | # @@protoc_insertion_point(module_scope)
1168 |
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/proto/speech_to_text_pb2_grpc.py:
--------------------------------------------------------------------------------
1 | # Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
2 | """Client and server classes corresponding to protobuf-defined services."""
3 | import grpc
4 |
5 | from . import speech_to_text_pb2 as speech__to__text__pb2
6 |
7 |
8 | class SpeechToTextStub(object):
9 | """Missing associated documentation comment in .proto file."""
10 |
11 | def __init__(self, channel):
12 | """Constructor.
13 |
14 | Args:
15 | channel: A grpc.Channel.
16 | """
17 | self.Recognize = channel.unary_unary(
18 | '/speech_to_text.SpeechToText/Recognize',
19 | request_serializer=speech__to__text__pb2.RecognizeRequest.SerializeToString,
20 | response_deserializer=speech__to__text__pb2.RecognizeResponse.FromString,
21 | )
22 | self.StreamingRecognize = channel.stream_stream(
23 | '/speech_to_text.SpeechToText/StreamingRecognize',
24 | request_serializer=speech__to__text__pb2.StreamingRecognizeRequest.SerializeToString,
25 | response_deserializer=speech__to__text__pb2.StreamingRecognizeResponse.FromString,
26 | )
27 | self.LongRunningRecognize = channel.unary_unary(
28 | '/speech_to_text.SpeechToText/LongRunningRecognize',
29 | request_serializer=speech__to__text__pb2.LongRunningRecognizeRequest.SerializeToString,
30 | response_deserializer=speech__to__text__pb2.SpeechOperation.FromString,
31 | )
32 | self.GetSpeechOperation = channel.unary_unary(
33 | '/speech_to_text.SpeechToText/GetSpeechOperation',
34 | request_serializer=speech__to__text__pb2.SpeechOperationRequest.SerializeToString,
35 | response_deserializer=speech__to__text__pb2.SpeechOperation.FromString,
36 | )
37 |
38 |
39 | class SpeechToTextServicer(object):
40 | """Missing associated documentation comment in .proto file."""
41 |
42 | def Recognize(self, request, context):
43 | """Performs synchronous non-streaming speech recognition
44 | """
45 | context.set_code(grpc.StatusCode.UNIMPLEMENTED)
46 | context.set_details('Method not implemented!')
47 | raise NotImplementedError('Method not implemented!')
48 |
49 | def StreamingRecognize(self, request_iterator, context):
50 | """Performs bidirectional streaming speech recognition: receive results while
51 | sending audio. This method is only available via the gRPC API (not REST).
52 | """
53 | context.set_code(grpc.StatusCode.UNIMPLEMENTED)
54 | context.set_details('Method not implemented!')
55 | raise NotImplementedError('Method not implemented!')
56 |
57 | def LongRunningRecognize(self, request, context):
58 | """Performs asynchronous non-streaming speech recognition
59 | """
60 | context.set_code(grpc.StatusCode.UNIMPLEMENTED)
61 | context.set_details('Method not implemented!')
62 | raise NotImplementedError('Method not implemented!')
63 |
64 | def GetSpeechOperation(self, request, context):
65 | """Returns SpeechOperation for LongRunningRecognize. Used for polling the result
66 | """
67 | context.set_code(grpc.StatusCode.UNIMPLEMENTED)
68 | context.set_details('Method not implemented!')
69 | raise NotImplementedError('Method not implemented!')
70 |
71 |
72 | def add_SpeechToTextServicer_to_server(servicer, server):
73 | rpc_method_handlers = {
74 | 'Recognize': grpc.unary_unary_rpc_method_handler(
75 | servicer.Recognize,
76 | request_deserializer=speech__to__text__pb2.RecognizeRequest.FromString,
77 | response_serializer=speech__to__text__pb2.RecognizeResponse.SerializeToString,
78 | ),
79 | 'StreamingRecognize': grpc.stream_stream_rpc_method_handler(
80 | servicer.StreamingRecognize,
81 | request_deserializer=speech__to__text__pb2.StreamingRecognizeRequest.FromString,
82 | response_serializer=speech__to__text__pb2.StreamingRecognizeResponse.SerializeToString,
83 | ),
84 | 'LongRunningRecognize': grpc.unary_unary_rpc_method_handler(
85 | servicer.LongRunningRecognize,
86 | request_deserializer=speech__to__text__pb2.LongRunningRecognizeRequest.FromString,
87 | response_serializer=speech__to__text__pb2.SpeechOperation.SerializeToString,
88 | ),
89 | 'GetSpeechOperation': grpc.unary_unary_rpc_method_handler(
90 | servicer.GetSpeechOperation,
91 | request_deserializer=speech__to__text__pb2.SpeechOperationRequest.FromString,
92 | response_serializer=speech__to__text__pb2.SpeechOperation.SerializeToString,
93 | ),
94 | }
95 | generic_handler = grpc.method_handlers_generic_handler(
96 | 'speech_to_text.SpeechToText', rpc_method_handlers)
97 | server.add_generic_rpc_handlers((generic_handler,))
98 |
99 |
100 | # This class is part of an EXPERIMENTAL API.
101 | class SpeechToText(object):
102 | """Missing associated documentation comment in .proto file."""
103 |
104 | @staticmethod
105 | def Recognize(request,
106 | target,
107 | options=(),
108 | channel_credentials=None,
109 | call_credentials=None,
110 | insecure=False,
111 | compression=None,
112 | wait_for_ready=None,
113 | timeout=None,
114 | metadata=None):
115 | return grpc.experimental.unary_unary(request, target, '/speech_to_text.SpeechToText/Recognize',
116 | speech__to__text__pb2.RecognizeRequest.SerializeToString,
117 | speech__to__text__pb2.RecognizeResponse.FromString,
118 | options, channel_credentials,
119 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
120 |
121 | @staticmethod
122 | def StreamingRecognize(request_iterator,
123 | target,
124 | options=(),
125 | channel_credentials=None,
126 | call_credentials=None,
127 | insecure=False,
128 | compression=None,
129 | wait_for_ready=None,
130 | timeout=None,
131 | metadata=None):
132 | return grpc.experimental.stream_stream(request_iterator, target, '/speech_to_text.SpeechToText/StreamingRecognize',
133 | speech__to__text__pb2.StreamingRecognizeRequest.SerializeToString,
134 | speech__to__text__pb2.StreamingRecognizeResponse.FromString,
135 | options, channel_credentials,
136 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
137 |
138 | @staticmethod
139 | def LongRunningRecognize(request,
140 | target,
141 | options=(),
142 | channel_credentials=None,
143 | call_credentials=None,
144 | insecure=False,
145 | compression=None,
146 | wait_for_ready=None,
147 | timeout=None,
148 | metadata=None):
149 | return grpc.experimental.unary_unary(request, target, '/speech_to_text.SpeechToText/LongRunningRecognize',
150 | speech__to__text__pb2.LongRunningRecognizeRequest.SerializeToString,
151 | speech__to__text__pb2.SpeechOperation.FromString,
152 | options, channel_credentials,
153 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
154 |
155 | @staticmethod
156 | def GetSpeechOperation(request,
157 | target,
158 | options=(),
159 | channel_credentials=None,
160 | call_credentials=None,
161 | insecure=False,
162 | compression=None,
163 | wait_for_ready=None,
164 | timeout=None,
165 | metadata=None):
166 | return grpc.experimental.unary_unary(request, target, '/speech_to_text.SpeechToText/GetSpeechOperation',
167 | speech__to__text__pb2.SpeechOperationRequest.SerializeToString,
168 | speech__to__text__pb2.SpeechOperation.FromString,
169 | options, channel_credentials,
170 | insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
171 |
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/speech_client.py:
--------------------------------------------------------------------------------
1 | import grpc
2 | import time
3 |
4 | from vernacular.ai.speech.proto import speech_to_text_pb2 as sppt_pb
5 | from vernacular.ai.speech.proto import speech_to_text_pb2_grpc as sppt_grpc_pb
6 | from vernacular.ai.exceptions import VernacularAPICallError
7 |
8 |
9 | class SpeechClient(object):
10 | """
11 | Class that implements Vernacular.ai ASR API
12 | """
13 |
14 | STTP_GRPC_HOST = "speechapis.vernacular.ai:80"
15 | AUTHORIZATION = "authorization"
16 | DEFAULT_TIMEOUT = 30
17 |
18 | def __init__(self, access_token):
19 | """Constructor.
20 | Args:
21 | access_token: The authorization token to send with the requests.
22 | """
23 | self.access_token = f"bearer {access_token}"
24 | self.channel = grpc.insecure_channel(self.STTP_GRPC_HOST)
25 |
26 | self.client = sppt_grpc_pb.SpeechToTextStub(self.channel)
27 |
28 | def recognize(self, config, audio, timeout=None):
29 | """
30 | Performs synchronous speech recognition: receive results after all audio
31 | has been sent and processed.
32 |
33 | Example:
34 | >>> from vernacular.ai import speech
35 | >>> from vernacular.ai.speech import enums
36 | >>>
37 | >>> client = speech.SpeechClient(access_token)
38 | >>>
39 | >>> encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16
40 | >>> sample_rate_hertz = 8000
41 | >>> language_code = 'en-IN'
42 | >>> config = {'encoding': encoding, 'sample_rate_hertz': sample_rate_hertz, 'language_code': language_code}
43 | >>> content = open('path/to/audio/file.wav', 'rb').read()
44 | >>> audio = {'content': content}
45 | >>>
46 | >>> response = client.recognize(config, audio)
47 | Args:
48 | config (Union[dict, ~vernacular.ai.speech.types.RecognitionConfig]): Required. Provides information to the
49 | recognizer that specifies how to process the request.
50 | If a dict is provided, it must be of the same form as the protobuf
51 | message :class:`~vernacular.ai.speech.types.RecognitionConfig`
52 | audio (Union[dict, ~vernacular.ai.speech.types.RecognitionAudio]): Required. The audio data to be recognized.
53 | If a dict is provided, it must be of the same form as the protobuf
54 | message :class:`~vernacular.ai.speech.types.RecognitionAudio`
55 | timeout (Optional[float]): The amount of time, in seconds, to wait
56 | for the request to complete. Default value is `30s`.
57 | Returns:
58 | A :class:`~vernacular.ai.speech.types.RecognizeResponse` instance.
59 | Raises:
60 | vernacular.ai.exceptions.VernacularAPICallError: If the request
61 | failed for any reason.
62 | ValueError: If the parameters are invalid.
63 | """
64 | request = sppt_pb.RecognizeRequest(config=config, audio=audio)
65 | if timeout is None:
66 | timeout = self.DEFAULT_TIMEOUT
67 |
68 | response = None
69 | try:
70 | response = self.client.Recognize(
71 | request,
72 | timeout=timeout,
73 | metadata=[(self.AUTHORIZATION, self.access_token)]
74 | )
75 | return response
76 | except Exception as e:
77 | raise VernacularAPICallError(message=str(e),response=response)
78 |
79 |
80 | def long_running_recognize(self, config, audio, timeout=None, poll_time=8, callback=None):
81 | """
82 | Performs asynchronous speech recognition. Returns either an
83 | ``Operation.error`` or an ``Operation.response`` which contains a
84 | ``LongRunningRecognizeResponse`` message. For more information on
85 | asynchronous speech recognition, see the
86 | `how-to `.
87 |
88 | Example:
89 | >>> from vernacular.ai import speech
90 | >>> from vernacular.ai.speech import enums
91 | >>>
92 | >>> client = speech.SpeechClient(access_token)
93 | >>>
94 | >>> encoding = enums.RecognitionConfig.AudioEncoding.LINEAR16
95 | >>> sample_rate_hertz = 8000
96 | >>> language_code = 'en-IN'
97 | >>> config = {'encoding': encoding, 'sample_rate_hertz': sample_rate_hertz, 'language_code': language_code}
98 | >>> content = open('path/to/audio/file.wav', 'rb').read()
99 | >>> audio = {'content': content}
100 | >>>
101 | >>> def handle_result(result):
102 | ... # Handle result.
103 | ... print(result)
104 | >>>
105 | >>> response = client.long_running_recognize(config, audio, callback=handle_result)
106 | Args:
107 | config (Union[dict, ~vernacular.ai.speech.types.RecognitionConfig]): Required. Provides information to the
108 | recognizer that specifies how to process the request.
109 | If a dict is provided, it must be of the same form as the protobuf
110 | message :class:`~vernacular.ai.speech.types.RecognitionConfig`
111 | audio (Union[dict, ~vernacular.ai.speech.types.RecognitionAudio]): Required. The audio data to be recognized.
112 | If a dict is provided, it must be of the same form as the protobuf
113 | message :class:`~vernacular.ai.speech.types.RecognitionAudio`
114 | timeout (Optional[float]): The amount of time, in seconds, to wait
115 | for the request to complete. Default value is `30s`.
116 | poll_time (Optional[float]): The amount of time, in seconds, for which results
117 | should be polled. Default value is `8s`. Min value is 5s.
118 | callback (Optional): Function to handle response
119 | Returns:
120 | A :class:`~vernacular.ai.speech.types.SpeechOperation` instance.
121 | Raises:
122 | vernacular.ai.exceptions.VernacularAPICallError: If the request
123 | failed for any reason.
124 | ValueError: If the parameters are invalid.
125 | """
126 | request = sppt_pb.LongRunningRecognizeRequest(config=config, audio=audio)
127 | if timeout is None:
128 | timeout = self.DEFAULT_TIMEOUT
129 |
130 | speech_operation = None
131 | try:
132 | speech_operation = self.client.LongRunningRecognize(
133 | request,
134 | timeout=timeout,
135 | metadata=[(self.AUTHORIZATION, self.access_token)]
136 | )
137 | except Exception as e:
138 | raise VernacularAPICallError(message=str(e),response=speech_operation)
139 |
140 | # set minimum value for poll time
141 | if poll_time < 5:
142 | poll_time = 5
143 |
144 | operation_request = sppt_pb.SpeechOperationRequest(name=speech_operation.name)
145 | response = None
146 | is_done = False
147 | try:
148 | while not is_done:
149 | time.sleep(poll_time)
150 | response = self.client.GetSpeechOperation(
151 | operation_request,
152 | timeout=timeout,
153 | metadata=[(self.AUTHORIZATION, self.access_token)]
154 | )
155 | is_done = response.done
156 |
157 | return response
158 | except Exception as e:
159 | raise VernacularAPICallError(message=str(e),response=speech_operation)
160 |
161 |
162 | def _streaming_request_iterable(self, config, requests):
163 | """A generator that yields the config followed by the requests.
164 | """
165 | yield self.types.StreamingRecognizeRequest(streaming_config=config)
166 | for request in requests:
167 | yield request
168 |
169 | def streaming_recognize(self, config, requests, timeout=None):
170 | """
171 | Performs bidirectional streaming speech recognition: receive results while
172 | sending audio. This method is only available via the gRPC API (not REST).
173 | Example:
174 | >>> from vernacular.ai import speech
175 | >>>
176 | >>> client = speech.SpeechClient()
177 | >>> config = types.StreamingRecognitionConfig(
178 | ... config=types.RecognitionConfig(
179 | ... encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
180 | ... ),
181 | ... )
182 | >>>
183 | >>> request = types.StreamingRecognizeRequest(audio_content=b'...')
184 | >>> requests = [request]
185 | >>> for element in client.streaming_recognize(requests):
186 | ... # process element
187 | ... pass
188 | Args:
189 | config (vernacular.ai.speech.types.StreamingRecognitionConfig) The config to use for streaming
190 | requests (iterator[dict|vernacular.ai.speech.types.StreamingRecognizeRequest]):
191 | The input objects. If a dict is provided, it must be of the
192 | same form as the protobuf message:`~vernacular.ai.speech.types.StreamingRecognizeRequest`
193 | timeout (Optional[float]): The amount of time, in seconds, to wait
194 | for the request to complete. Default value is `30s`.
195 | Returns:
196 | Iterable[~vernacular.ai.speech.types.StreamingRecognizeResponse].
197 | Raises:
198 | vernacular.ai.exceptions.VernacularAPICallError: If the request
199 | failed for any reason.
200 | ValueError: If the parameters are invalid.
201 | """
202 | if timeout is None:
203 | timeout = self.DEFAULT_TIMEOUT
204 |
205 | streaming_responses = self.client.StreamingRecognize(
206 | self._streaming_request_iterable(config, requests),
207 | metadata=[(self.AUTHORIZATION, self.access_token)]
208 | )
209 | return streaming_responses
210 |
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/types.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import
2 | import sys
3 | import collections
4 | import inspect
5 |
6 | from google.rpc import status_pb2
7 | from google.protobuf.message import Message
8 |
9 | from vernacular.ai.speech.proto import speech_to_text_pb2
10 | from vernacular.ai.speech.utils import _SpeechOperation
11 |
12 |
13 | _shared_modules = [status_pb2]
14 | _local_modules = [speech_to_text_pb2]
15 |
16 | names = []
17 |
18 | def get_messages(module):
19 | """Discovers all protobuf Message classes in a given import module.
20 |
21 | Args:
22 | module (module): A Python module; :func:`dir` will be run against this
23 | module to find Message subclasses.
24 |
25 | Returns:
26 | dict[str, google.protobuf.message.Message]: A dictionary with the
27 | Message class names as keys, and the Message subclasses themselves
28 | as values.
29 | """
30 | answer = collections.OrderedDict()
31 | for name in dir(module):
32 | candidate = getattr(module, name)
33 | if inspect.isclass(candidate) and issubclass(candidate, Message):
34 | answer[name] = candidate
35 | return answer
36 |
37 |
38 | for module in _shared_modules: # pragma: NO COVER
39 | for name, message in get_messages(module).items():
40 | setattr(sys.modules[__name__], name, message)
41 | names.append(name)
42 |
43 | for module in _local_modules:
44 | for name, message in get_messages(module).items():
45 | message.__module__ = "vernacular.ai.speech.types"
46 | setattr(sys.modules[__name__], name, message)
47 | names.append(name)
48 |
49 |
50 | __all__ = tuple(sorted(names))
51 |
--------------------------------------------------------------------------------
/python/vernacular/ai/speech/utils.py:
--------------------------------------------------------------------------------
1 |
2 | class _SpeechOperation():
3 |
4 | def add_done_callback(callback):
5 | pass
--------------------------------------------------------------------------------
/resources/hello.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/resources/hello.wav
--------------------------------------------------------------------------------
/resources/test-single-channel-8000Hz.raw:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/skit-ai/speech-recognition/7949dd0db6257655e2765d5e5430f09ca5042ebe/resources/test-single-channel-8000Hz.raw
--------------------------------------------------------------------------------