├── LICENSE
├── README.md
├── attack.py
├── classify.py
├── docker
    ├── aae_deepspeech_041_cpu.dockerfile
    └── aae_deepspeech_041_gpu.dockerfile
├── ds_ctcdecoder-0.4.1-cp35-cp35m-linux_x86_64.whl
├── filterbanks.npy
├── sample-000000.wav
├── setup.sh
├── test_setup.sh
├── tf_logits.py
└── xdg.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2017 Nicholas Carlini
 2 | 
 3 | LICENSE
 4 | 
 5 | Redistribution and use in source and binary forms, with or without
 6 | modification, are permitted provided that the following conditions are met: 
 7 | 
 8 | 1. Redistributions of source code must retain the above copyright notice, this
 9 |    list of conditions and the following disclaimer. 
10 | 2. Redistributions in binary form must reproduce the above copyright notice,
11 |    this list of conditions and the following disclaimer in the documentation
12 |    and/or other materials provided with the distribution. 
13 | 
14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
15 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
16 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
17 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
18 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
19 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
20 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
21 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
22 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
23 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Audio Adversarial Examples
  2 | This is the code corresponding to the paper
  3 | "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text"
  4 | Nicholas Carlini and David Wagner
  5 | https://arxiv.org/abs/1801.01944
  6 | 
  7 | To generate adversarial examples for your own files, follow the below process
  8 | and modify the arguments to attack,py. Ensure that the file is sampled at
  9 | 16KHz and uses signed 16-bit ints as the data type. You may want to modify
 10 | the number of iterations that the attack algorithm is allowed to run.
 11 | 
 12 | WARNING: THIS IS NOT THE CODE USED IN THE PAPER. If you just want to get going
 13 | generating adversarial examples on audio then proceed as described below.
 14 | 
 15 | The current master branch points to code which will run on TensorFlow 1.14 and
 16 | DeepSpeech 0.4.1, an almost-recent version of the dependencies. (Large portions
 17 | of tf_logits.py will need to be re-written to run on DeepSpeech 0.5.1 which uses
 18 | a new feature extraction pipeline with TensorFlow's C++ implementation. If you
 19 | feel motivated to do that I would gladly accept a PR.)
 20 | 
 21 | However, IF YOU ARE TRYING TO REPRODUCE THE PAPER (or just have decided
 22 | that you enjoy pain and want to suffer through dependency hell) then you
 23 | will have to checkout commit a8d5f675ac8659072732d3de2152411f07c7aa3a and
 24 | follow the README from there.
 25 | 
 26 | There are two ways to install this project. The first is to just use Docker
 27 | with a buildfile provided by Tom Doerr. It works. The second is to try and
 28 | set up everything on your machine directly. This might work, if you happen
 29 | to have the right versions of things.
 30 | 
 31 | 
 32 | # Docker Installation (highly recommended)
 33 | 
 34 | These docker instructions were kindly provided by Tom Doerr, and are simple to follow if you have Docker set up.
 35 | 
 36 | 
 37 | 1. Install Docker.
 38 | On Ubuntu/Debian/Linux-Mint etc.:
 39 | ```
 40 | sudo apt-get install docker.io
 41 | sudo systemctl enable --now docker
 42 | ```
 43 | Instructions for other platforms:
 44 | https://docs.docker.com/install/
 45 | 
 46 | 
 47 | 2. Download DeepSpeech and build the Docker images:
 48 | ```
 49 | ./setup.sh
 50 | ```
 51 | 
 52 | ### With Nvidia-GPU support:
 53 | 3. Install the NVIDIA Container Toolkit.
 54 | This step will only work on Linux and is only necessary if you want GPU support.
 55 | As far as I know it's not possible to use a GPU with docker under Windows/Mac.
 56 | On Ubuntu/Debian/Linux-Mint etc. you can install the toolkit with the following commands:
 57 | ```sh
 58 | # Add the package repositories
 59 | distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
 60 | curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
 61 | curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
 62 | sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
 63 | sudo systemctl restart docker
 64 | ```
 65 | Instructions for other platforms (CentOS/RHEL):
 66 | https://github.com/NVIDIA/nvidia-docker
 67 | 
 68 | 4. Start the container using the GPU image we just build:
 69 | ```
 70 | docker run --gpus all -it --mount src=$(pwd),target=/audio_adversarial_examples,type=bind -w /audio_adversarial_examples aae_deepspeech_041_gpu
 71 | ```
 72 | 
 73 | ### CPU-only (Skip if already started with Nvidia-GPU support):
 74 | 4. Start the container using the CPU image we just build:
 75 | ```
 76 | docker run -it --mount src=$(pwd),target=/audio_adversarial_examples,type=bind -w /audio_adversarial_examples aae_deepspeech_041_cpu
 77 | ```
 78 | 
 79 | 
 80 | ### Test Setup
 81 | 5. Check that you can classify normal audio correctly:
 82 | ```
 83 | python3 classify.py --in sample-000000.wav --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
 84 | ```
 85 | 
 86 | 6. Generate adversarial examples:
 87 | ```
 88 | python3 attack.py --in sample-000000.wav --target "this is a test" --out adv.wav --iterations 1000 --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
 89 | ```
 90 | 
 91 | 7. Verify the attack succeeded:
 92 | ```
 93 | python3 classify.py --in adv.wav --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
 94 | ```
 95 | 
 96 | ## Docker Hub
 97 | The docker images are available on Docker Hub.
 98 | 
 99 | CPU-Version: `tomdoerr/aae_deepspeech_041_cpu`
100 | 
101 | GPU-Version: `tomdoerr/aae_deepspeech_041_gpu`
102 | 
103 | 
104 | 
105 | # Direct Install
106 | 
107 | These are the original instructions from earlier. They will work, but require manual installs.
108 | 
109 | 
110 | 1. Install the dependencies
111 | ```
112 | pip3 install tensorflow-gpu==1.14 progressbar numpy scipy pandas python_speech_features tables attrdict pyxdg
113 | pip3 install $(python3 util/taskcluster.py --decoder)
114 | ```
115 | 
116 | Download and install
117 | https://git-lfs.github.com/
118 | 
119 | 1b. Make sure you have installed git lfs. Otherwise later steps will mysteriously fail.
120 | 
121 | 2. Clone the Mozilla DeepSpeech repository into a folder called DeepSpeech:
122 | ```
123 | git clone https://github.com/mozilla/DeepSpeech.git
124 | ```
125 | 
126 | 2b. Checkout the correct version of the code:
127 | ```
128 | (cd DeepSpeech; git checkout tags/v0.4.1)
129 | ```
130 | 
131 | 2c. If you get an error with tflite_convert, comment out DeepSpeech.py Line 21
132 | ```
133 | from tensorflow.contrib.lite.python import tflite_convert
134 | ```
135 | 
136 | 3. Download the DeepSpeech model
137 | 
138 | ```
139 | wget https://github.com/mozilla/DeepSpeech/releases/download/v0.4.1/deepspeech-0.4.1-checkpoint.tar.gz
140 | tar -xzf deepspeech-0.4.1-checkpoint.tar.gz
141 | ```
142 | 
143 | 4. Verify that you have a file deepspeech-0.4.1-checkpoint/model.v0.4.1.data-00000-of-00001
144 | Its MD5 sum should be
145 | ```
146 | ca825ad95066b10f5e080db8cb24b165
147 | ```
148 | 
149 | 5. Check that you can classify normal images correctly
150 | ```
151 | python3 attack.py --in sample-000000.wav --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
152 | ```
153 | 
154 | 6. Generate adversarial examples
155 | ```
156 | python3 attack.py --in sample-000000.wav --target "this is a test" --out adv.wav --iterations 1000 --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
157 | ```
158 | 
159 | 8. Verify the attack succeeded
160 | ```
161 | python3 attack.py --in adv.wav --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
162 | ```
163 | 


--------------------------------------------------------------------------------
/attack.py:
--------------------------------------------------------------------------------
  1 | ## attack.py -- generate audio adversarial examples
  2 | ##
  3 | ## Copyright (C) 2017, Nicholas Carlini <nicholas@carlini.com>.
  4 | ##
  5 | ## This program is licenced under the BSD 2-Clause licence,
  6 | ## contained in the LICENCE file in this directory.
  7 | 
  8 | import numpy as np
  9 | import tensorflow as tf
 10 | import argparse
 11 | from shutil import copyfile
 12 | 
 13 | import scipy.io.wavfile as wav
 14 | 
 15 | import struct
 16 | import time
 17 | import os
 18 | import sys
 19 | from collections import namedtuple
 20 | sys.path.append("DeepSpeech")
 21 | 
 22 | try:
 23 |     import pydub
 24 | except:
 25 |     print("pydub was not loaded, MP3 compression will not work")
 26 | 
 27 | import DeepSpeech
 28 | 
 29 | from tensorflow.python.keras.backend import ctc_label_dense_to_sparse
 30 | from tf_logits import get_logits
 31 | 
 32 | # These are the tokens that we're allowed to use.
 33 | # The - token is special and corresponds to the epsilon
 34 | # value in CTC decoding, and can not occur in the phrase.
 35 | toks = " abcdefghijklmnopqrstuvwxyz'-"
 36 | 
 37 | def convert_mp3(new, lengths):
 38 |     import pydub
 39 |     wav.write("/tmp/load.wav", 16000,
 40 |               np.array(np.clip(np.round(new[0][:lengths[0]]),
 41 |                                -2**15, 2**15-1),dtype=np.int16))
 42 |     pydub.AudioSegment.from_wav("/tmp/load.wav").export("/tmp/saved.mp3")
 43 |     raw = pydub.AudioSegment.from_mp3("/tmp/saved.mp3")
 44 |     mp3ed = np.array([struct.unpack("<h", raw.raw_data[i:i+2])[0] for i in range(0,len(raw.raw_data),2)])[np.newaxis,:lengths[0]]
 45 |     return mp3ed
 46 |     
 47 | 
 48 | class Attack:
 49 |     def __init__(self, sess, loss_fn, phrase_length, max_audio_len,
 50 |                  learning_rate=10, num_iterations=5000, batch_size=1,
 51 |                  mp3=False, l2penalty=float('inf'), restore_path=None):
 52 |         """
 53 |         Set up the attack procedure.
 54 | 
 55 |         Here we create the TF graph that we're going to use to
 56 |         actually generate the adversarial examples.
 57 |         """
 58 |         
 59 |         self.sess = sess
 60 |         self.learning_rate = learning_rate
 61 |         self.num_iterations = num_iterations
 62 |         self.batch_size = batch_size
 63 |         self.phrase_length = phrase_length
 64 |         self.max_audio_len = max_audio_len
 65 |         self.mp3 = mp3
 66 | 
 67 |         # Create all the variables necessary
 68 |         # they are prefixed with qq_ just so that we know which
 69 |         # ones are ours so when we restore the session we don't
 70 |         # clobber them.
 71 |         self.delta = delta = tf.Variable(np.zeros((batch_size, max_audio_len), dtype=np.float32), name='qq_delta')
 72 |         self.mask = mask = tf.Variable(np.zeros((batch_size, max_audio_len), dtype=np.float32), name='qq_mask')
 73 |         self.cwmask = cwmask = tf.Variable(np.zeros((batch_size, phrase_length), dtype=np.float32), name='qq_cwmask')
 74 |         self.original = original = tf.Variable(np.zeros((batch_size, max_audio_len), dtype=np.float32), name='qq_original')
 75 |         self.lengths = lengths = tf.Variable(np.zeros(batch_size, dtype=np.int32), name='qq_lengths')
 76 |         self.importance = tf.Variable(np.zeros((batch_size, phrase_length), dtype=np.float32), name='qq_importance')
 77 |         self.target_phrase = tf.Variable(np.zeros((batch_size, phrase_length), dtype=np.int32), name='qq_phrase')
 78 |         self.target_phrase_lengths = tf.Variable(np.zeros((batch_size), dtype=np.int32), name='qq_phrase_lengths')
 79 |         self.rescale = tf.Variable(np.zeros((batch_size,1), dtype=np.float32), name='qq_phrase_lengths')
 80 | 
 81 |         # Initially we bound the l_infty norm by 2000, increase this
 82 |         # constant if it's not big enough of a distortion for your dataset.
 83 |         self.apply_delta = tf.clip_by_value(delta, -2000, 2000)*self.rescale
 84 | 
 85 |         # We set the new input to the model to be the abve delta
 86 |         # plus a mask, which allows us to enforce that certain
 87 |         # values remain constant 0 for length padding sequences.
 88 |         self.new_input = new_input = self.apply_delta*mask + original
 89 | 
 90 |         # We add a tiny bit of noise to help make sure that we can
 91 |         # clip our values to 16-bit integers and not break things.
 92 |         noise = tf.random_normal(new_input.shape,
 93 |                                  stddev=2)
 94 |         pass_in = tf.clip_by_value(new_input+noise, -2**15, 2**15-1)
 95 | 
 96 |         # Feed this final value to get the logits.
 97 |         self.logits = logits = get_logits(pass_in, lengths)
 98 | 
 99 |         # And finally restore the graph to make the classifier
100 |         # actually do something interesting.
101 |         saver = tf.train.Saver([x for x in tf.global_variables() if 'qq' not in x.name])
102 |         saver.restore(sess, restore_path)
103 | 
104 |         # Choose the loss function we want -- either CTC or CW
105 |         self.loss_fn = loss_fn
106 |         if loss_fn == "CTC":
107 |             target = ctc_label_dense_to_sparse(self.target_phrase, self.target_phrase_lengths)
108 |             
109 |             ctcloss = tf.nn.ctc_loss(labels=tf.cast(target, tf.int32),
110 |                                      inputs=logits, sequence_length=lengths)
111 | 
112 |             # Slight hack: an infinite l2 penalty means that we don't penalize l2 distortion
113 |             # The code runs faster at a slight cost of distortion, and also leaves one less
114 |             # paramaeter that requires tuning.
115 |             if not np.isinf(l2penalty):
116 |                 loss = tf.reduce_mean((self.new_input-self.original)**2,axis=1) + l2penalty*ctcloss
117 |             else:
118 |                 loss = ctcloss
119 |             self.expanded_loss = tf.constant(0)
120 |             
121 |         elif loss_fn == "CW":
122 |             raise NotImplemented("The current version of this project does not include the CW loss function implementation.")
123 |         else:
124 |             raise
125 | 
126 |         self.loss = loss
127 |         self.ctcloss = ctcloss
128 |         
129 |         # Set up the Adam optimizer to perform gradient descent for us
130 |         start_vars = set(x.name for x in tf.global_variables())
131 |         optimizer = tf.train.AdamOptimizer(learning_rate)
132 | 
133 |         grad,var = optimizer.compute_gradients(self.loss, [delta])[0]
134 |         self.train = optimizer.apply_gradients([(tf.sign(grad),var)])
135 |         
136 |         end_vars = tf.global_variables()
137 |         new_vars = [x for x in end_vars if x.name not in start_vars]
138 |         
139 |         sess.run(tf.variables_initializer(new_vars+[delta]))
140 | 
141 |         # Decoder from the logits, to see how we're doing
142 |         self.decoded, _ = tf.nn.ctc_beam_search_decoder(logits, lengths, merge_repeated=False, beam_width=100)
143 | 
144 |     def attack(self, audio, lengths, target, finetune=None):
145 |         sess = self.sess
146 | 
147 |         # Initialize all of the variables
148 |         # TODO: each of these assign ops creates a new TF graph
149 |         # object, and they should be all created only once in the
150 |         # constructor. It works fine as long as you don't call
151 |         # attack() a bunch of times.
152 |         sess.run(tf.variables_initializer([self.delta]))
153 |         sess.run(self.original.assign(np.array(audio)))
154 |         sess.run(self.lengths.assign((np.array(lengths)-1)//320))
155 |         sess.run(self.mask.assign(np.array([[1 if i < l else 0 for i in range(self.max_audio_len)] for l in lengths])))
156 |         sess.run(self.cwmask.assign(np.array([[1 if i < l else 0 for i in range(self.phrase_length)] for l in (np.array(lengths)-1)//320])))
157 |         sess.run(self.target_phrase_lengths.assign(np.array([len(x) for x in target])))
158 |         sess.run(self.target_phrase.assign(np.array([list(t)+[0]*(self.phrase_length-len(t)) for t in target])))
159 |         c = np.ones((self.batch_size, self.phrase_length))
160 |         sess.run(self.importance.assign(c))
161 |         sess.run(self.rescale.assign(np.ones((self.batch_size,1))))
162 | 
163 |         # Here we'll keep track of the best solution we've found so far
164 |         final_deltas = [None]*self.batch_size
165 | 
166 |         if finetune is not None and len(finetune) > 0:
167 |             sess.run(self.delta.assign(finetune-audio))
168 |         
169 |         # We'll make a bunch of iterations of gradient descent here
170 |         now = time.time()
171 |         MAX = self.num_iterations
172 |         for i in range(MAX):
173 |             iteration = i
174 |             now = time.time()
175 | 
176 |             # Print out some debug information every 10 iterations.
177 |             if i%10 == 0:
178 |                 new, delta, r_out, r_logits = sess.run((self.new_input, self.delta, self.decoded, self.logits))
179 |                 lst = [(r_out, r_logits)]
180 |                 if self.mp3:
181 |                     mp3ed = convert_mp3(new, lengths)
182 |                     
183 |                     mp3_out, mp3_logits = sess.run((self.decoded, self.logits),
184 |                                                    {self.new_input: mp3ed})
185 |                     lst.append((mp3_out, mp3_logits))
186 | 
187 |                 for out, logits in lst:
188 |                     chars = out[0].values
189 | 
190 |                     res = np.zeros(out[0].dense_shape)+len(toks)-1
191 |                 
192 |                     for ii in range(len(out[0].values)):
193 |                         x,y = out[0].indices[ii]
194 |                         res[x,y] = out[0].values[ii]
195 | 
196 |                     # Here we print the strings that are recognized.
197 |                     res = ["".join(toks[int(x)] for x in y).replace("-","") for y in res]
198 |                     print("\n".join(res))
199 |                     
200 |                     # And here we print the argmax of the alignment.
201 |                     res2 = np.argmax(logits,axis=2).T
202 |                     res2 = ["".join(toks[int(x)] for x in y[:(l-1)//320]) for y,l in zip(res2,lengths)]
203 |                     print("\n".join(res2))
204 | 
205 | 
206 |             if self.mp3:
207 |                 new = sess.run(self.new_input)
208 |                 mp3ed = convert_mp3(new, lengths)
209 |                 feed_dict = {self.new_input: mp3ed}
210 |             else:
211 |                 feed_dict = {}
212 |                 
213 |             # Actually do the optimization ste
214 |             d, el, cl, l, logits, new_input, _ = sess.run((self.delta, self.expanded_loss,
215 |                                                            self.ctcloss, self.loss,
216 |                                                            self.logits, self.new_input,
217 |                                                            self.train),
218 |                                                           feed_dict)
219 |                     
220 |             # Report progress
221 |             print("%.3f"%np.mean(cl), "\t", "\t".join("%.3f"%x for x in cl))
222 | 
223 |             logits = np.argmax(logits,axis=2).T
224 |             for ii in range(self.batch_size):
225 |                 # Every 100 iterations, check if we've succeeded
226 |                 # if we have (or if it's the final epoch) then we
227 |                 # should record our progress and decrease the
228 |                 # rescale constant.
229 |                 if (self.loss_fn == "CTC" and i%10 == 0 and res[ii] == "".join([toks[x] for x in target[ii]])) \
230 |                    or (i == MAX-1 and final_deltas[ii] is None):
231 |                     # Get the current constant
232 |                     rescale = sess.run(self.rescale)
233 |                     if rescale[ii]*2000 > np.max(np.abs(d)):
234 |                         # If we're already below the threshold, then
235 |                         # just reduce the threshold to the current
236 |                         # point and save some time.
237 |                         print("It's way over", np.max(np.abs(d[ii]))/2000.0)
238 |                         rescale[ii] = np.max(np.abs(d[ii]))/2000.0
239 | 
240 |                     # Otherwise reduce it by some constant. The closer
241 |                     # this number is to 1, the better quality the result
242 |                     # will be. The smaller, the quicker we'll converge
243 |                     # on a result but it will be lower quality.
244 |                     rescale[ii] *= .8
245 | 
246 |                     # Adjust the best solution found so far
247 |                     final_deltas[ii] = new_input[ii]
248 | 
249 |                     print("Worked i=%d ctcloss=%f bound=%f"%(ii,cl[ii], 2000*rescale[ii][0]))
250 |                     #print('delta',np.max(np.abs(new_input[ii]-audio[ii])))
251 |                     sess.run(self.rescale.assign(rescale))
252 | 
253 |                     # Just for debugging, save the adversarial example
254 |                     # to /tmp so we can see it if we want
255 |                     wav.write("/tmp/adv.wav", 16000,
256 |                               np.array(np.clip(np.round(new_input[ii]),
257 |                                                -2**15, 2**15-1),dtype=np.int16))
258 | 
259 |         return final_deltas
260 |     
261 |     
262 | def main():
263 |     """
264 |     Do the attack here.
265 | 
266 |     This is all just boilerplate; nothing interesting
267 |     happens in this method.
268 | 
269 |     For now we only support using CTC loss and only generating
270 |     one adversarial example at a time.
271 |     """
272 |     parser = argparse.ArgumentParser(description=None)
273 |     parser.add_argument('--in', type=str, dest="input", nargs='+',
274 |                         required=True,
275 |                         help="Input audio .wav file(s), at 16KHz (separated by spaces)")
276 |     parser.add_argument('--target', type=str,
277 |                         required=True,
278 |                         help="Target transcription")
279 |     parser.add_argument('--out', type=str, nargs='+',
280 |                         required=False,
281 |                         help="Path for the adversarial example(s)")
282 |     parser.add_argument('--outprefix', type=str,
283 |                         required=False,
284 |                         help="Prefix of path for adversarial examples")
285 |     parser.add_argument('--finetune', type=str, nargs='+',
286 |                         required=False,
287 |                         help="Initial .wav file(s) to use as a starting point")
288 |     parser.add_argument('--lr', type=int,
289 |                         required=False, default=100,
290 |                         help="Learning rate for optimization")
291 |     parser.add_argument('--iterations', type=int,
292 |                         required=False, default=1000,
293 |                         help="Maximum number of iterations of gradient descent")
294 |     parser.add_argument('--l2penalty', type=float,
295 |                         required=False, default=float('inf'),
296 |                         help="Weight for l2 penalty on loss function")
297 |     parser.add_argument('--mp3', action="store_const", const=True,
298 |                         required=False,
299 |                         help="Generate MP3 compression resistant adversarial examples")
300 |     parser.add_argument('--restore_path', type=str,
301 |                         required=True,
302 |                         help="Path to the DeepSpeech checkpoint (ending in model0.4.1)")
303 |     args = parser.parse_args()
304 |     while len(sys.argv) > 1:
305 |         sys.argv.pop()
306 |     
307 |     with tf.Session() as sess:
308 |         finetune = []
309 |         audios = []
310 |         lengths = []
311 | 
312 |         if args.out is None:
313 |             assert args.outprefix is not None
314 |         else:
315 |             assert args.outprefix is None
316 |             assert len(args.input) == len(args.out)
317 |         if args.finetune is not None and len(args.finetune):
318 |             assert len(args.input) == len(args.finetune)
319 |         
320 |         # Load the inputs that we're given
321 |         for i in range(len(args.input)):
322 |             fs, audio = wav.read(args.input[i])
323 |             assert fs == 16000
324 |             assert audio.dtype == np.int16
325 |             print('source dB', 20*np.log10(np.max(np.abs(audio))))
326 |             audios.append(list(audio))
327 |             lengths.append(len(audio))
328 | 
329 |             if args.finetune is not None:
330 |                 finetune.append(list(wav.read(args.finetune[i])[1]))
331 | 
332 |         maxlen = max(map(len,audios))
333 |         audios = np.array([x+[0]*(maxlen-len(x)) for x in audios])
334 |         finetune = np.array([x+[0]*(maxlen-len(x)) for x in finetune])
335 | 
336 |         phrase = args.target
337 | 
338 |         # Set up the attack class and run it
339 |         attack = Attack(sess, 'CTC', len(phrase), maxlen,
340 |                         batch_size=len(audios),
341 |                         mp3=args.mp3,
342 |                         learning_rate=args.lr,
343 |                         num_iterations=args.iterations,
344 |                         l2penalty=args.l2penalty,
345 |                         restore_path=args.restore_path)
346 |         deltas = attack.attack(audios,
347 |                                lengths,
348 |                                [[toks.index(x) for x in phrase]]*len(audios),
349 |                                finetune)
350 | 
351 |         # And now save it to the desired output
352 |         if args.mp3:
353 |             convert_mp3(deltas, lengths)
354 |             copyfile("/tmp/saved.mp3", args.out[0])
355 |             print("Final distortion", np.max(np.abs(deltas[0][:lengths[0]]-audios[0][:lengths[0]])))
356 |         else:
357 |             for i in range(len(args.input)):
358 |                 if args.out is not None:
359 |                     path = args.out[i]
360 |                 else:
361 |                     path = args.outprefix+str(i)+".wav"
362 |                 wav.write(path, 16000,
363 |                           np.array(np.clip(np.round(deltas[i][:lengths[i]]),
364 |                                            -2**15, 2**15-1),dtype=np.int16))
365 |                 print("Final distortion", np.max(np.abs(deltas[i][:lengths[i]]-audios[i][:lengths[i]])))
366 | 
367 | main()
368 | 


--------------------------------------------------------------------------------
/classify.py:
--------------------------------------------------------------------------------
 1 | ## classify.py -- actually classify a sequence with DeepSpeech
 2 | ##
 3 | ## Copyright (C) 2017, Nicholas Carlini <nicholas@carlini.com>.
 4 | ##
 5 | ## This program is licenced under the BSD 2-Clause licence,
 6 | ## contained in the LICENCE file in this directory.
 7 | 
 8 | import numpy as np
 9 | import tensorflow as tf
10 | import argparse
11 | 
12 | import scipy.io.wavfile as wav
13 | 
14 | import time
15 | import os
16 | os.environ['CUDA_VISIBLE_DEVICES'] = ''
17 | import sys
18 | from collections import namedtuple
19 | sys.path.append("DeepSpeech")
20 | import DeepSpeech
21 | 
22 | try:
23 |     import pydub
24 |     import struct
25 | except:
26 |     print("pydub was not loaded, MP3 compression will not work")
27 | 
28 | from tf_logits import get_logits
29 | 
30 | 
31 | # These are the tokens that we're allowed to use.
32 | # The - token is special and corresponds to the epsilon
33 | # value in CTC decoding, and can not occur in the phrase.
34 | toks = " abcdefghijklmnopqrstuvwxyz'-"
35 | 
36 | 
37 | 
38 | def main():
39 |     parser = argparse.ArgumentParser(description=None)
40 |     parser.add_argument('--in', type=str, dest="input",
41 |                         required=True,
42 |                         help="Input audio .wav file(s), at 16KHz (separated by spaces)")
43 |     parser.add_argument('--restore_path', type=str,
44 |                         required=True,
45 |                         help="Path to the DeepSpeech checkpoint (ending in model0.4.1)")
46 |     args = parser.parse_args()
47 |     while len(sys.argv) > 1:
48 |         sys.argv.pop()
49 |     with tf.Session() as sess:
50 |         if args.input.split(".")[-1] == 'mp3':
51 |             raw = pydub.AudioSegment.from_mp3(args.input)
52 |             audio = np.array([struct.unpack("<h", raw.raw_data[i:i+2])[0] for i in range(0,len(raw.raw_data),2)])
53 |         elif args.input.split(".")[-1] == 'wav':
54 |             _, audio = wav.read(args.input)
55 |         else:
56 |             raise Exception("Unknown file format")
57 |         N = len(audio)
58 |         new_input = tf.placeholder(tf.float32, [1, N])
59 |         lengths = tf.placeholder(tf.int32, [1])
60 | 
61 |         with tf.variable_scope("", reuse=tf.AUTO_REUSE):
62 |             logits = get_logits(new_input, lengths)
63 | 
64 |         saver = tf.train.Saver()
65 |         saver.restore(sess, args.restore_path)
66 | 
67 |         decoded, _ = tf.nn.ctc_beam_search_decoder(logits, lengths, merge_repeated=False, beam_width=500)
68 | 
69 |         print('logits shape', logits.shape)
70 |         length = (len(audio)-1)//320
71 |         l = len(audio)
72 |         r = sess.run(decoded, {new_input: [audio],
73 |                                lengths: [length]})
74 | 
75 |         print("-"*80)
76 |         print("-"*80)
77 | 
78 |         print("Classification:")
79 |         print("".join([toks[x] for x in r[0].values]))
80 |         print("-"*80)
81 |         print("-"*80)
82 | 
83 | main()
84 | 


--------------------------------------------------------------------------------
/docker/aae_deepspeech_041_cpu.dockerfile:
--------------------------------------------------------------------------------
1 | FROM aae_deepspeech_041_gpu
2 | 
3 | RUN pip3 install tensorflow==1.12.0
4 | 


--------------------------------------------------------------------------------
/docker/aae_deepspeech_041_gpu.dockerfile:
--------------------------------------------------------------------------------
  1 | FROM tensorflow/tensorflow:1.12.0-gpu-py3
  2 | 
  3 | RUN alias pip='pip3'
  4 | RUN alias python='python3'
  5 | RUN alias ipython='ipython3'
  6 | RUN alias ..='cd ..'
  7 | 
  8 | RUN apt-get update -y && apt-get install -y \
  9 |  swig \
 10 |  sox \
 11 |  libsox-dev \
 12 |  python-pyaudio \
 13 |  git \
 14 |  wget \
 15 |  python-pip \
 16 |  python-dev \
 17 |  silversearcher-ag \
 18 |  ranger \
 19 |  ffmpeg \
 20 |  python3-levenshtein
 21 |                                                    
 22 | 
 23 | # Packages from 'pip3 freeze' output minus packages that 
 24 | # could not be installed via pip.
 25 | RUN pip3 install \
 26 | absl-py==0.7.1 \
 27 | alembic==1.1.0 \
 28 | asn1crypto==0.24.0 \
 29 | astor==0.8.0 \
 30 | attrdict==2.0.1 \
 31 | attrs==19.1.0 \
 32 | audioread==2.1.8 \
 33 | backcall==0.1.0 \
 34 | bleach==3.1.0 \
 35 | certifi==2019.6.16 \
 36 | cffi==1.12.3 \
 37 | chardet==3.0.4 \
 38 | Click==7.0 \
 39 | cloudpickle==1.2.1 \
 40 | configparser==3.8.1 \
 41 | cryptography==2.1.4 \
 42 | databricks-cli==0.9.0 \
 43 | decorator==4.4.0 \
 44 | deepspeech==0.4.1 \
 45 | defusedxml==0.6.0 \
 46 | docker==4.0.2 \
 47 | entrypoints==0.3 \
 48 | Flask==1.1.1 \
 49 | future==0.17.1 \
 50 | gast==0.2.2 \
 51 | gitdb2==2.0.5 \
 52 | GitPython==3.0.2 \
 53 | google-pasta==0.1.7 \
 54 | gorilla==0.3.0 \
 55 | grpcio==1.21.1 \
 56 | gunicorn==19.9.0 \
 57 | h5py==2.9.0 \
 58 | hyperas==0.4.1 \
 59 | hyperopt==0.1.2 \
 60 | idna==2.6 \
 61 | ipykernel==5.1.2 \
 62 | ipython==7.8.0 \
 63 | ipython-genutils==0.2.0 \
 64 | ipywidgets==7.5.1 \
 65 | itsdangerous==1.1.0 \
 66 | jedi==0.15.1 \
 67 | Jinja2==2.10.1 \
 68 | joblib==0.13.2 \
 69 | jsonschema==3.0.2 \
 70 | jupyter==1.0.0 \
 71 | jupyter-client==5.3.1 \
 72 | jupyter-console==6.0.0 \
 73 | jupyter-core==4.5.0 \
 74 | Keras==2.2.5 \
 75 | Keras-Applications==1.0.8 \
 76 | Keras-Preprocessing==1.1.0 \
 77 | keyring==10.6.0 \
 78 | keyrings.alt==3.0 \
 79 | librosa==0.7.0 \
 80 | llvmlite==0.29.0 \
 81 | Mako==1.1.0 \
 82 | Markdown==3.1.1 \
 83 | MarkupSafe==1.1.1 \
 84 | mistune==0.8.4 \
 85 | mlflow==1.2.0 \
 86 | mock==3.0.5 \
 87 | nbconvert==5.6.0 \
 88 | nbformat==4.4.0 \
 89 | networkx==2.3 \
 90 | notebook==6.0.1 \
 91 | numba==0.45.1 \
 92 | numexpr==2.7.0 \
 93 | numpy==1.16.4 \
 94 | pandas==0.24.0 \
 95 | pandocfilters==1.4.2 \
 96 | parso==0.5.1 \
 97 | pexpect==4.7.0 \
 98 | pickleshare==0.7.5 \
 99 | progressbar==2.5 \
100 | prometheus-client==0.7.1 \
101 | prompt-toolkit==2.0.9 \
102 | protobuf==3.9.1 \
103 | ptyprocess==0.6.0 \
104 | pycparser==2.19 \
105 | pycrypto==2.6.1 \
106 | pydub==0.23.1 \
107 | Pygments==2.4.2 \
108 | pymongo==3.9.0 \
109 | pyrsistent==0.15.4 \
110 | python-dateutil==2.8.0 \
111 | python-editor==1.0.4 \
112 | python-Levenshtein==0.12.0 \
113 | python-speech-features==0.6 \
114 | pytz==2019.2 \
115 | pyxdg==0.25 \
116 | PyYAML==5.1.2 \
117 | pyzmq==18.1.0 \
118 | qtconsole==4.5.5 \
119 | querystring-parser==1.2.4 \
120 | requests==2.22.0 \
121 | resampy==0.2.2 \
122 | scikit-learn==0.21.3 \
123 | scipy==1.3.1 \
124 | SecretStorage==2.3.1 \
125 | Send2Trash==1.5.0 \
126 | simplejson==3.16.0 \
127 | six==1.11.0 \
128 | smmap2==2.0.5 \
129 | SoundFile==0.10.2 \
130 | SQLAlchemy==1.3.8 \
131 | sqlparse==0.3.0 \
132 | tables==3.5.2 \
133 | tabulate==0.8.3 \
134 | tensorboard==1.12.2 \
135 | tensorflow-estimator==1.14.0 \
136 | tensorflow-gpu==1.12.0 \
137 | termcolor==1.1.0 \
138 | terminado==0.8.2 \
139 | testpath==0.4.2 \
140 | tornado==6.0.3 \
141 | tqdm==4.35.0 \
142 | traitlets==4.3.2 \
143 | urllib3==1.25.3 \
144 | wcwidth==0.1.7 \
145 | webencodings==0.5.1 \
146 | websocket-client==0.56.0 \
147 | Werkzeug==0.15.4 \
148 | widgetsnbextension==3.5.1 \
149 | wrapt==1.11.2
150 | 
151 | 
152 | RUN git clone -b tags/v0.4.1_pin_numpy https://github.com/tom-doerr/DeepSpeech 
153 | RUN wget https://github.com/git-lfs/git-lfs/releases/download/v2.8.0/git-lfs-linux-amd64-v2.8.0.tar.gz
154 | RUN tar -xvzf git-lfs-linux-amd64-v2.8.0.tar.gz
155 | RUN ./install.sh
156 | RUN git lfs install
157 | RUN git lfs --version
158 | 
159 | # Commands to build Tensorflow and DeepSpeech.
160 | # Executing them is not necessary if you just want
161 | # to get the Adversarial Audio Attack Code running.
162 | #RUN \
163 | #BAZEL_VERSION='0.15.0' && \
164 | #apt-get -y install pkg-config zip g++ zlib1g-dev unzip python3 && \
165 | #wget https://github.com/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \ 
166 | #chmod +x bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
167 | #./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
168 | #RUN cd tensorflow && bazel build --config=monolithic -c opt --copt=-O3 --copt="-D_GLIBCXX_USE_CXX11_ABI=0" --copt=-fvisibility=hidden //native_client:libdeepspeech.so //native_client:generate_trie
169 | #RUN cd DeepSpeech/native_client && make deepspeech
170 | 
171 | RUN rm /usr/bin/python && ln -s /usr/bin/python3 /usr/bin/python
172 | RUN pip3 uninstall -y numpy &&\
173 | rm -rf '/usr/local/lib/python3.5/dist-packages/numpy'
174 | RUN pip3 install numpy==1.18.5
175 | RUN cd DeepSpeech/native_client/ctcdecode && make bindings NUM_PROCESSES=8
176 | RUN pip3 install DeepSpeech/native_client/ctcdecode/dist/*.whl
177 | 
178 | ENTRYPOINT /bin/bash
179 | 
180 | 
181 | 
182 | 


--------------------------------------------------------------------------------
/ds_ctcdecoder-0.4.1-cp35-cp35m-linux_x86_64.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tom-doerr/audio_adversarial_examples/885f2cdace2fc095ec0d57a436d44c4df598a9ea/ds_ctcdecoder-0.4.1-cp35-cp35m-linux_x86_64.whl


--------------------------------------------------------------------------------
/filterbanks.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tom-doerr/audio_adversarial_examples/885f2cdace2fc095ec0d57a436d44c4df598a9ea/filterbanks.npy


--------------------------------------------------------------------------------
/sample-000000.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tom-doerr/audio_adversarial_examples/885f2cdace2fc095ec0d57a436d44c4df598a9ea/sample-000000.wav


--------------------------------------------------------------------------------
/setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | docker build -t "aae_deepspeech_041_gpu"  - < docker/aae_deepspeech_041_gpu.dockerfile
 4 | docker build -t "aae_deepspeech_041_cpu"  - < docker/aae_deepspeech_041_cpu.dockerfile
 5 | git clone https://github.com/mozilla/DeepSpeech.git
 6 | cd DeepSpeech
 7 | git checkout v0.4.1
 8 | cd ..
 9 | if [ ! -d deepspeech-0.4.1-checkpoint ]
10 | then
11 |     wget https://github.com/mozilla/DeepSpeech/releases/download/v0.4.1/deepspeech-0.4.1-checkpoint.tar.gz
12 |     tar -xzf deepspeech-0.4.1-checkpoint.tar.gz
13 |     rm deepspeech-0.4.1-checkpoint.tar.gz
14 | fi
15 | 


--------------------------------------------------------------------------------
/test_setup.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | rm_test_dir() {
 4 |     if [[ $(pwd) =~ '/audio_adversarial_examples$' ]]
 5 |     then
 6 |         rm -rf audio_adversarial_examples
 7 |     fi
 8 | }
 9 | 
10 | rm_test_dir
11 | git clone git@github.com:tom-doerr/audio_adversarial_examples.git
12 | cd audio_adversarial_examples
13 | docker build --no-cache -t "aae_deepspeech_041_gpu"  - < docker/aae_deepspeech_041_gpu.dockerfile
14 | docker build --no-cache -t "aae_deepspeech_041_cpu"  - < docker/aae_deepspeech_041_cpu.dockerfile
15 | ./setup.sh
16 | 
17 | # docker run command doesn't work for some reason
18 | docker run --gpus all --mount src=$(pwd),target=/audio_adversarial_examples,type=bind -w /audio_adversarial_examples aae_deepspeech_041_gpu bash -c '
19 | python3 classify.py --in sample-000000.wav --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1 &&\
20 | python3 attack.py --in sample-000000.wav --target "this is a test" --out adv.wav --iterations 1000 --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1 &&\
21 | python3 classify.py --in adv.wav --restore_path deepspeech-0.4.1-checkpoint/model.v0.4.1
22 | ' &&\
23 | cd .. &&\
24 | rm_test_dir
25 | 
26 | 


--------------------------------------------------------------------------------
/tf_logits.py:
--------------------------------------------------------------------------------
  1 | ## tf_logits.py -- end-to-end differentable text-to-speech
  2 | ##
  3 | ## Copyright (C) 2017, Nicholas Carlini <nicholas@carlini.com>.
  4 | ##
  5 | ## This program is licenced under the BSD 2-Clause licence,
  6 | ## contained in the LICENCE file in this directory.
  7 | 
  8 | 
  9 | import numpy as np
 10 | import tensorflow as tf
 11 | import argparse
 12 | 
 13 | import scipy.io.wavfile as wav
 14 | 
 15 | import time
 16 | import os
 17 | import sys
 18 | 
 19 | sys.path.append("DeepSpeech")
 20 | import DeepSpeech
 21 | 
 22 | def compute_mfcc(audio, **kwargs):
 23 |     """
 24 |     Compute the MFCC for a given audio waveform. This is
 25 |     identical to how DeepSpeech does it, but does it all in
 26 |     TensorFlow so that we can differentiate through it.
 27 |     """
 28 | 
 29 |     batch_size, size = audio.get_shape().as_list()
 30 |     audio = tf.cast(audio, tf.float32)
 31 | 
 32 |     # 1. Pre-emphasizer, a high-pass filter
 33 |     audio = tf.concat((audio[:, :1], audio[:, 1:] - 0.97*audio[:, :-1], np.zeros((batch_size,512),dtype=np.float32)), 1)
 34 | 
 35 |     # 2. windowing into frames of 512 samples, overlapping
 36 |     windowed = tf.stack([audio[:, i:i+512] for i in range(0,size-320,320)],1)
 37 | 
 38 |     window = np.hamming(512)
 39 |     windowed = windowed * window
 40 | 
 41 |     # 3. Take the FFT to convert to frequency space
 42 |     ffted = tf.spectral.rfft(windowed, [512])
 43 |     ffted = 1.0 / 512 * tf.square(tf.abs(ffted))
 44 | 
 45 |     # 4. Compute the Mel windowing of the FFT
 46 |     energy = tf.reduce_sum(ffted,axis=2)+np.finfo(float).eps
 47 |     filters = np.load("filterbanks.npy").T
 48 |     feat = tf.matmul(ffted, np.array([filters]*batch_size,dtype=np.float32))+np.finfo(float).eps
 49 | 
 50 |     # 5. Take the DCT again, because why not
 51 |     feat = tf.log(feat)
 52 |     feat = tf.spectral.dct(feat, type=2, norm='ortho')[:,:,:26]
 53 | 
 54 |     # 6. Amplify high frequencies for some reason
 55 |     _,nframes,ncoeff = feat.get_shape().as_list()
 56 |     n = np.arange(ncoeff)
 57 |     lift = 1 + (22/2.)*np.sin(np.pi*n/22)
 58 |     feat = lift*feat
 59 |     width = feat.get_shape().as_list()[1]
 60 | 
 61 | 
 62 |     # 7. And now stick the energy next to the features
 63 |     feat = tf.concat((tf.reshape(tf.log(energy),(-1,width,1)), feat[:, :, 1:]), axis=2)
 64 | 
 65 |     return feat
 66 | 
 67 | 
 68 | def get_logits(new_input, length, first=[]):
 69 |     """
 70 |     Compute the logits for a given waveform.
 71 | 
 72 |     First, preprocess with the TF version of MFC above,
 73 |     and then call DeepSpeech on the features.
 74 |     """
 75 | 
 76 |     batch_size = new_input.get_shape()[0]
 77 | 
 78 |     # 1. Compute the MFCCs for the input audio
 79 |     # (this is differentable with our implementation above)
 80 |     empty_context = np.zeros((batch_size, 9, 26), dtype=np.float32)
 81 |     new_input_to_mfcc = compute_mfcc(new_input)
 82 |     features = tf.concat((empty_context, new_input_to_mfcc, empty_context), 1)
 83 | 
 84 |     # 2. We get to see 9 frames at a time to make our decision,
 85 |     # so concatenate them together.
 86 |     features = tf.reshape(features, [new_input.get_shape()[0], -1])
 87 |     features = tf.stack([features[:, i:i+19*26] for i in range(0,features.shape[1]-19*26+1,26)],1)
 88 |     features = tf.reshape(features, [batch_size, -1, 19, 26])
 89 | 
 90 | 
 91 |     # 3. Finally we process it with DeepSpeech
 92 |     # We need to init DeepSpeech the first time we're called
 93 |     if first == []:
 94 |         first.append(False)
 95 | 
 96 |         DeepSpeech.create_flags()
 97 |         tf.app.flags.FLAGS.alphabet_config_path = "DeepSpeech/data/alphabet.txt"
 98 |         DeepSpeech.initialize_globals()
 99 | 
100 |     logits, _ = DeepSpeech.BiRNN(features, length, [0]*10)
101 | 
102 |     return logits
103 | 


--------------------------------------------------------------------------------
/xdg.py:
--------------------------------------------------------------------------------
1 | # Even more hacks.
2 | 
3 | class BaseDirectory:
4 |     def save_data_path(*args, **kwargs):
5 |         return
6 |     def save_data_path(*args, **kwargs):
7 |         return
8 | 


--------------------------------------------------------------------------------