├── resources ├── cli.png ├── ram.png ├── vlsrun.png ├── initnode.png ├── thumbnail.png ├── finalresult.png ├── initialize.png ├── modelspage.png ├── silencenode.png ├── addcomponent.png ├── buffertosound.png ├── minimalsetup.png ├── partialresult.png ├── process_path.png ├── server_process.png ├── voicesettings.png ├── default_use_case.png ├── pass_sound_wave.png ├── vlsdownloadmodels.png ├── add_speech_recognizer.png ├── initialize_recognizer.png ├── recognizer_automatic.png ├── push_to_talk_send_once.png ├── recognizer_push_to_talk.png └── send_data_when_recording.png └── README.md /resources/cli.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/cli.png -------------------------------------------------------------------------------- /resources/ram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/ram.png -------------------------------------------------------------------------------- /resources/vlsrun.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/vlsrun.png -------------------------------------------------------------------------------- /resources/initnode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/initnode.png -------------------------------------------------------------------------------- /resources/thumbnail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/thumbnail.png -------------------------------------------------------------------------------- /resources/finalresult.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/finalresult.png -------------------------------------------------------------------------------- /resources/initialize.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/initialize.png -------------------------------------------------------------------------------- /resources/modelspage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/modelspage.png -------------------------------------------------------------------------------- /resources/silencenode.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/silencenode.png -------------------------------------------------------------------------------- /resources/addcomponent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/addcomponent.png -------------------------------------------------------------------------------- /resources/buffertosound.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/buffertosound.png -------------------------------------------------------------------------------- /resources/minimalsetup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/minimalsetup.png -------------------------------------------------------------------------------- /resources/partialresult.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/partialresult.png -------------------------------------------------------------------------------- /resources/process_path.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/process_path.png -------------------------------------------------------------------------------- /resources/server_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/server_process.png -------------------------------------------------------------------------------- /resources/voicesettings.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/voicesettings.png -------------------------------------------------------------------------------- /resources/default_use_case.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/default_use_case.png -------------------------------------------------------------------------------- /resources/pass_sound_wave.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/pass_sound_wave.png -------------------------------------------------------------------------------- /resources/vlsdownloadmodels.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/vlsdownloadmodels.png -------------------------------------------------------------------------------- /resources/add_speech_recognizer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/add_speech_recognizer.png -------------------------------------------------------------------------------- /resources/initialize_recognizer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/initialize_recognizer.png -------------------------------------------------------------------------------- /resources/recognizer_automatic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/recognizer_automatic.png -------------------------------------------------------------------------------- /resources/push_to_talk_send_once.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/push_to_talk_send_once.png -------------------------------------------------------------------------------- /resources/recognizer_push_to_talk.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/recognizer_push_to_talk.png -------------------------------------------------------------------------------- /resources/send_data_when_recording.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/IlgarLunin/VoskPlugin-docs/HEAD/resources/send_data_when_recording.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # **Offline Speech Recognition** 2 | 3 | ![](resources/thumbnail.png) 4 | 5 | This is Unreal Engine plugin for accurate speech recognition, and it doesn't require internet connection. 6 | 7 | # Table of contents 8 | - [**Offline Speech Recognition**](#offline-speech-recognition) 9 | - [Table of contents](#table-of-contents) 10 | - [High level overview](#high-level-overview) 11 | - [Project settings](#project-settings) 12 | - [Test your microphone](#test-your-microphone) 13 | - [Where to download languages and how to test them](#where-to-download-languages-and-how-to-test-them) 14 | - [Using built-in language server (USE THIS)](#using-built-in-language-server) 15 | - [Automatic speech recognition based on silence detection](#automatic-speech-recognition-based-on-silence-detection) 16 | - [Push to talk (Speak first, then recognize)](#push-to-talk-speak-first-then-recognize) 17 | - [Running language server as external process](#running-language-server-as-external-process) 18 | - [Running server process and game process at the same time](#running-server-process-and-game-process-at-the-same-time) 19 | - [Passing SoundWave as input, instead of microphone](#passing-soundwave-as-input-instead-of-microphone) 20 | - [How Send Data to Language Server node works](#how-send-data-to-language-server-node-works) 21 | - [Platforms supported](#platforms-supported) 22 | - [Links](#links) 23 | 24 | 25 | # High level overview 26 | Since this is the speech to text plugin (STT), first thing you need is to be able to record your voice (any recording device). Then recorded voice is passed to speech recognizer, speech recognizer is giving your speech back in textual form. Speech recognizer is working with 1 language at a time. Each language is a downloadable folder with files. 27 | 28 | In order to package shipe you game or app to end user, you will need to package each language model with your game, as well as language server itself (this is optional, since your game itself can be a server). 29 | 30 | 31 | # Project settings 32 | To make microphone work, you need to add following lines to `DefaultEngine.ini` of the project. 33 | ``` 34 | [Voice] 35 | bEnabled=true 36 | ``` 37 | 38 | To not loose pauses in between words, you probably want to check silence detection threshold `voice.SilenceDetectionThreshold`, value `0.01` is good. 39 | This also goes to `DefaultEngine.ini`. 40 | 41 | ``` 42 | [SystemSettings] 43 | voice.SilenceDetectionThreshold=0.01 44 | ``` 45 | Starting from Engine version 4.25 also put 46 | ``` 47 | voice.MicNoiseGateThreshold=0.01 48 | ``` 49 | 50 | Another voice related variables worth playing with 51 | ```bash 52 | voice.MicNoiseGateThreshold 53 | voice.MicInputGain 54 | voice.MicStereoBias 55 | voice.MicNoiseAttackTime 56 | voice.MicNoiseReleaseTime 57 | voice.MicStereoBias 58 | voice.SilenceDetectionAttackTime 59 | voice.SilenceDetectionReleaseTime 60 | ``` 61 | 62 | To find available settings type `voice.` in editor console, and autocompletion widget will pop up. 63 | 64 | ![](resources/voicesettings.png) 65 | 66 | Console variables can be modified in runtime like this 67 | 68 | ![](resources/silencenode.png) 69 | 70 | Above values may differ depending on actual microphone characteristics. 71 | 72 | # Test your microphone 73 | To debug your microphone, input you can convert output sound buffer to 74 | unreal sound wave and play it. 75 | 76 | ![](resources/buffertosound.png) 77 | 78 | Another thing to keep in mind, if component connected to server, by default, it will try to send voice data during microphone capture. If you don't want this behavior, you can disable it like this 79 | 80 | ![](resources/send_data_when_recording.png) 81 | 82 | Use this for push to talk style recognition (*when you record whole phrase first, and then send it to server*) 83 | 84 | ![](resources/push_to_talk_send_once.png) 85 | 86 | # Where to download languages and how to test them 87 | All available languages are available [here](https://alphacephei.com/vosk/models) 88 | 89 | To test how specific language behaves, you can use [external language server app](https://github.com/IlgarLunin/vosk-language-server) 90 | 91 | # Using built-in language server 92 | *This method is preferable for simple scenarios, when you don't need to separate your game and language server, here you don't have all this hustle managing external process and communicating with server via web sockets.* 93 | 94 | For both automatic and push to talk style recognition, you start from adding **SpeechRecognizer** component to your actor 95 | 96 | ![add_speech_recognizer](resources/add_speech_recognizer.png) 97 | 98 | And then loading language into it. (This is non blocking function, and you know exactly when model is fully loaded into memory by connecting to **Finished** output pin) 99 | 100 | ![](resources/initialize_recognizer.png) 101 | 102 | ## Automatic speech recognition based on silence detection 103 | ![](resources/recognizer_automatic.png) 104 | 105 | ## Push to talk (Speak first, then recognize) 106 | ![](resources/recognizer_push_to_talk.png) 107 | 108 | Feed voice data node can handle any amount of pre recorded speech, see [this section](#how-send-data-to-language-server-node-works) 109 | 110 | # Running language server as external process 111 | *In more complex cases this method is preferable over using built-in. You can have a single language server running in cloud or local server, and it can process multiple clients at the same time, since it's multithreaded.* 112 | 113 | 1. Download latest version [here](https://github.com/IlgarLunin/vosk-language-server/releases) 114 | 2. Run **vls.exe**, which is a user interface for **asr_server.exe** 115 | > **NOTE**: *asr_server.exe* is real server, you can run it without gui 116 | ![](resources/cli.png) 117 | 3. Go to main menu -> File -> Download models 118 | 119 | ![](resources/vlsdownloadmodels.png) 120 | 121 | 4. You will be redirected to a web page where you will find all available models (**languages**) 122 | 123 | ![](resources/modelspage.png) 124 | 125 | 5. In order to start using language, first download one of them 126 | 6. Enter path to downloaded model to server UI and press **start** button 127 | 128 | ![](resources/vlsrun.png) 129 | 130 | > **!NOTE!**: Depending on model size, you need to wait until model loaded in to memory, before start feeding server with voice data. e.g. If model size is ~2GB, it acn take ~10-30 seconds. But this is one time event, you can load your language to memory once with OS startup. 131 | ![](resources/ram.png) 132 | 133 | 7. Open unreal 134 | 8. Create actor blueprint 135 | 9. Add Vosk component in components panel 136 | 137 | ![](resources/addcomponent.png) 138 | 139 | 10. On begin play 140 | 1. Bind to "Partial Result Received" event 141 | ![](resources/partialresult.png) 142 | 143 | 1. **[Optional]** Bind to "Final Result Received" event 144 | ![](resources/finalresult.png) 145 | 146 | 1. **[!MANDATORY!]** Connect to language server process and begin voice capture 147 | ![](resources/initialize.png) 148 | NOTE: `Addr` and `Port` coresponds to language server UI (*0.0.0.0 is the same as 127.0.0.1, it's just localhost*) 149 | ![](resources/initnode.png) 150 | 151 | 152 | 11. Start talking 153 | 12. Check *Partial Result Received* event gets executed 154 | 155 | # Running server process and game process at the same time 156 | Plugin offers following nodes 157 | 158 | ![](resources/server_process.png) 159 | 160 | **Build Server Parameters** - helper method to simplify passing arguments to create process node 161 | 162 | **Create Process** - Runs external program, this one is generic, you can use it to run whatever external program 163 | 164 | *NOTE*: *When you ship your game, you need to include language server as well, put language server files in your game bin folder (`GAME/Binaries/Win64/**`), and use "GetProcessExecutablePath" node to build path to `asr_server.exe`* 165 | 166 | ![](resources/process_path.png) 167 | 168 | **Kill Process** - This is an equivalent of `Alt+F4`, it will shut down external process based on Process ID, the process id is process handle. Save output of `Create Process` node to a variable and use it later to terminate process. 169 | 170 | Default use case: 171 | 172 | * Create an `Actor` responsible for voice recognition 173 | * Start language server on `Begin Play` event 174 | * Add `Vosk` actor component and initialize it in begin play 175 | * Begin capturing voice data 176 | * Bind to message receive events 177 | * Uninitialize vosk component and terminate server process on end play 178 | 179 | > **NOTE**: *`Uninitialize` will stop voice capture if it is active* 180 | 181 | ![](resources/default_use_case.png) 182 | 183 | # Passing SoundWave as input, instead of microphone 184 | 185 | To do so, plugin offers a node that will convert sound into array of bytes, it is called `"Decompress Sound"`. You can than use output of decompress sound node in `"Send Voice Data to Language Server"` node, and expect partial and final result events being invoked later, when server finishes recognition. 186 | 187 | 188 | > **NOTE**: *Do not call `BeginCapture` and `FinishCapture` in this case, since we don't want to use audio from the microphone* 189 | 190 | 191 | ![](resources/pass_sound_wave.png) 192 | 193 | ## How Send Data to Language Server node works 194 | It takes sound bytes as first argument, and packet size as second argument. It will split all bytes into packets of given size, and send them one after another to language server, emulating microphone capture behavior. If packet size is greater than size of voice data, data will not be sent. 4096 packet size works relatively fast and suitable for short phrases. Note that if packet size is small, it will take more time to deliver entire voice to the server, and server will perform more iterations accordingly. You should play around with packet size in your specific case. 195 | 196 | 197 | # Platforms supported 198 | 199 | Tested on **Windows** 200 | 201 | 202 | 203 | # Links 204 | 205 | Find out more in documentation 206 | 207 | * [Vosk](https://alphacephei.com/vosk/) 208 | --------------------------------------------------------------------------------