├── .idea ├── cmft.iml ├── misc.xml └── modules.xml ├── README.md ├── arg_parse.py ├── data ├── shakespeare │ └── shakespeare.txt ├── test │ └── input.txt └── wikipedia │ └── input.txt ├── models ├── __init__.py ├── base.py ├── densenet │ ├── __init__.py │ ├── densenet_decoder.py │ ├── densenet_encoder.py │ ├── ops.py │ └── ops_tests.py ├── q1_lstm_baseline.py ├── q2_simple_conv.py ├── q3_ResidualConvs.py ├── q4_DilatedConv.py ├── q5_deconvolution_autoencoder.py └── small_conv.py ├── prepare_data.py ├── presentation.pdf ├── test_dataLoader.py └── train.py /.idea/cmft.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Convolutional Methods for Text Workshop 2 | ## Excersices for the workshop given at PyCon Israel 2017 3 | 4 | ## Intro 5 | This repository has the excersises for the workshop on convolutional methods for text. You can read about the constructs we'll be using and why you migh want to consider convolutions in my [blog post](https://medium.com/@TalPerry/convolutional-methods-for-text-d5260fd5675f). In a nutshell, convolutions run in parrelell so they are much faster than RNNs. 6 | 7 | The core task in this repository is to restore missing punctuation and capitalisation in a text, something that we may find ourselves doing when working with speech recognition systems. 8 | To avoid boilerplate overhead, we are doing this at the charecter level, not word. 9 | ## Getting started 10 | Clone this repository 11 | ``` 12 | pip3 install tensorflow nltk 13 | ``` 14 | 15 | ## Contents 16 | `train.py` is your entry point for these tasks. To run a task modify the model import at the top 17 | ```python 18 | from models.q1_lstm_baseline import LSTMBaseline 19 | ``` 20 | And then 21 | ``` 22 | python3 train.py 23 | ``` 24 | 25 | the *models* directory contains each of the "questions" as well as some boilerplate as follows 26 | 27 | `base.py` Holds a base class that covers the boiler plate for a NN. Includes setting up the loss and optimiser as well as embedding charecters. You should be able to ignore this during the workshop 28 | `densenet.ops` The densenet folder has all of the ops you may need to implement the various tasks here. I include it for reference but you won't get much from the workshop if you just copy paste it so try not to look here if you are stuck. 29 | ## Tasks 30 | All tasks are described below. You are of course free to do whichever ones you like. They are arranged in order of difficulty and build on one another 31 | ### `q1_lstm_baseline` 32 | In this task you'll use an LSTM or bidirectional LSTM to solve the task. Do this if you are new to Tensorflow/NN or you want to make your own baseline to see how slow LSTMs are compared to convolutions 33 | 34 | ### `q2_simple_conv` 35 | In this task we'll implement our own convolution op for 1d seqeunces like text. Once we have the op we'll apply it to our inputs to try and restore missing punctuation and capitalisation. 36 | 37 | ### `q3_ResidualConvs` 38 | This task builds on the previous one. We'll want to increase our receptive field without encountering vanishing gradients. To do that we'll use residual connection a la DenseNet which you will implement. 39 | 40 | ### `q4_DilatedConvs` 41 | In this task we'll implement 1d dilated convolutions and use them as another method to increase receptive field without vanishing gradients. Bonus if you build residual dilated convolutions 42 | 43 | ### `q5_deconvolution_autoencoder` 44 | In this question we'll use the residual convs we built and add 1d pooling and 1d "Deconvolutions" to encode our input as a vector and restore it respectively. -------------------------------------------------------------------------------- /arg_parse.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | def parse_args(args): 4 | parser = argparse.ArgumentParser(...) 5 | parser.add_argument('--data_path', type=str, 6 | default='data/wikipedia/input.txt', help="Path to raw txt file") 7 | parser.add_argument('--saved_data_path', type=str, 8 | default='data/wikipedia/dataset.npa', help="Where to save/load processed input") 9 | parser.add_argument('--batch_size', type=int, 10 | default=8, help="Number of sentences in batch") 11 | parser.add_argument('--lr', type=int, 12 | default=0.0001, help="Learning Rate") 13 | 14 | # ...Create your parser as you like... 15 | return parser.parse_args(args) 16 | -------------------------------------------------------------------------------- /data/test/input.txt: -------------------------------------------------------------------------------- 1 | It is the county seat of Alfalfa County . 2 | Cherokee is a city in Alfalfa County , Oklahoma , United States . 3 | Skateboard decks are usually between 28 and 33 inches long . 4 | The underside of the deck can be printed with a design by the manufacturer , blank , or decorated by any other means . 5 | This was created by two surfers ; Ben Whatson and Jonny Drapper . 6 | Some of them have special materials that help to keep the deck from breaking : such as fiberglass , bamboo , resin , Kevlar , carbon fiber , aluminum , and plastic . 7 | `` Old school '' boards -LRB- those made in the 1970s â `` 80s or modern boards that mimic their shape -RRB- are generally wider and often have only one kicktail . 8 | One of the first deck companies was called `` Drapped '' taken from Jonny 's second name . 9 | Grip tape , when applied to the top surface of a skateboard , gives a skater 's feet grip on the deck . 10 | Modern decks vary in size , but most are 7 to 10.5 inches wide . 11 | -------------------------------------------------------------------------------- /models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/talolard/CMFT_PyCon2017/7e8db3a3a27baf44e97ee662a6874f42c38fa6ea/models/__init__.py -------------------------------------------------------------------------------- /models/base.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import abc 3 | 4 | class ModelBase(metaclass=abc.ABCMeta): 5 | def __init__(self,args): 6 | self.args = args 7 | self.original = tf.placeholder(dtype=tf.int32,shape=[None,None]) 8 | self.lower = tf.placeholder(dtype=tf.int32, shape=[None, None]) 9 | self.batch_max_len = self.original.get_shape().as_list()[1] 10 | self.embedded_source = self._embed_chars() 11 | logits = self.get_logits() 12 | self.loss_op = self._loss(logits) 13 | self.train_op = self._train(self.loss_op) 14 | self.predictions = self._prediction(logits) 15 | 16 | @abc.abstractmethod 17 | def get_logits(self): 18 | pass 19 | 20 | def _embed_chars(self): 21 | embedding_size = 4 22 | embedding_matrix = tf.get_variable("embedding_matrix", shape=[128, embedding_size], 23 | dtype=tf.float32) 24 | 25 | return tf.nn.embedding_lookup(embedding_matrix,self.lower) 26 | 27 | 28 | 29 | def _loss(self,logits): 30 | lengths = tf.reduce_sum(tf.sign(self.original),axis=1) 31 | max_len =self.original.get_shape().as_list()[1] 32 | mask = tf.sequence_mask(lengths,dtype=tf.float32,maxlen=max_len) 33 | loss = tf.contrib.seq2seq.sequence_loss(logits=logits, 34 | targets=self.original, 35 | weights = mask 36 | ) 37 | return loss 38 | 39 | def _train(self,loss): 40 | #lr = tf.train.exponential_decay(learning_rate=self.args.lr) 41 | lr = self.args.lr 42 | opt = tf.train.AdamOptimizer(lr) 43 | return opt.minimize(loss) 44 | 45 | def _prediction(self,logits): 46 | sm = tf.nn.softmax(logits) 47 | preds = tf.argmax(sm,axis=2) 48 | return preds 49 | 50 | -------------------------------------------------------------------------------- /models/densenet/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/talolard/CMFT_PyCon2017/7e8db3a3a27baf44e97ee662a6874f42c38fa6ea/models/densenet/__init__.py -------------------------------------------------------------------------------- /models/densenet/densenet_decoder.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | from tensorflow.contrib.layers.python.layers.initializers import xavier_initializer 3 | 4 | from models.densenet import ops 5 | from models.ops import conv1d_transpose 6 | from arg_getter import FLAGS 7 | def DenseNetDecoder(_input, layers_per_batch, growth_rate,expansion_rate,final_width=FLAGS.max_len): 8 | output = tf.expand_dims(_input,1) 9 | block =0 10 | while output.get_shape().as_list()[1]