├── final_report.pdf ├── example ├── README.md ├── run.sh └── dont_run_me_run_the_other_script_instead.py ├── .gitignore ├── LICENSE ├── convert_to_readable.py ├── play_with_model.py ├── README.md ├── demo_play_with_model.py ├── error_calculator.py ├── main2.py ├── punctuator.py ├── main.py ├── data.py ├── lecture_punctuator.py └── models.py /final_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ragymorkos/LecturePunctuator/HEAD/final_report.pdf -------------------------------------------------------------------------------- /example/README.md: -------------------------------------------------------------------------------- 1 | This example produces the preprocessed Europarl English corpus that can be then used for training a model. 2 | 3 | Requires nltk 4 | 5 | Usage example: 6 | `./run.sh` 7 | 8 | `cd ..` 9 | 10 | `python data.py ./example/out` 11 | 12 | `python main.py ep 256 0.02` 13 | 14 | `python play_with_model.py Model_ep_h256_lr0.02.pcl` 15 | 16 | The input text to play_with_model.py should be similar to the contents of the preprocessed files in ./example/out (i.e. lowercased, numeric tokens replaced with ), but should not contain punctuation tokens. 17 | 18 | Training time on this dataset with a Nvidia Tesla K20 GPU was about 15 hours (~3500 samples per second) -------------------------------------------------------------------------------- /example/run.sh: -------------------------------------------------------------------------------- 1 | wget -qO- http://hltshare.fbk.eu/IWSLT2012/training-monolingual-europarl.tgz | tar xvz 2 | rm -rf ./out 3 | echo "Step 1/3" 4 | mkdir ./out 5 | grep -v " '[^ ]" ./training-monolingual-europarl/europarl-v7.en | \ 6 | grep -v \'\ s\ | \ 7 | grep -v \'\ ll\ | \ 8 | grep -v \'\ ve\ | \ 9 | grep -v \'\ m\ > step1.txt 10 | echo "Step 2/3" 11 | python dont_run_me_run_the_other_script_instead.py step1.txt step2.txt 12 | echo "Step 3/3" 13 | head -n -400000 step2.txt > ./out/ep.train.txt 14 | tail -n 400000 step2.txt > step3.txt 15 | head -n -200000 step3.txt > ./out/ep.dev.txt 16 | tail -n 200000 step3.txt > ./out/ep.test.txt 17 | echo "Cleaning up..." 18 | rm -f step1.txt step2.txt step3.txt 19 | echo "Preprocessing done. Now you can give the produced ./out dir as argument to data.py script for conversion and continue as described in the main README.md" 20 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | env/ 12 | build/ 13 | develop-eggs/ 14 | dist/ 15 | downloads/ 16 | eggs/ 17 | .eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .coverage.* 42 | .cache 43 | nosetests.xml 44 | coverage.xml 45 | *,cover 46 | .hypothesis/ 47 | 48 | # Translations 49 | *.mo 50 | *.pot 51 | 52 | # Django stuff: 53 | *.log 54 | 55 | # Sphinx documentation 56 | docs/_build/ 57 | 58 | # PyBuilder 59 | target/ 60 | 61 | #Ipython Notebook 62 | .ipynb_checkpoints 63 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2018 Ragy Morkos 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /convert_to_readable.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import codecs 3 | from data import EOS_TOKENS, PUNCTUATION_VOCABULARY 4 | 5 | if __name__ == "__main__": 6 | 7 | if len(sys.argv) > 1: 8 | input_file = sys.argv[1] 9 | else: 10 | sys.exit("Input file path argument missing") 11 | 12 | if len(sys.argv) > 2: 13 | output_file = sys.argv[2] 14 | else: 15 | sys.exit("Output file path argument missing") 16 | 17 | with_newlines = len(sys.argv) > 3 and bool(int(sys.argv[3])) 18 | 19 | with codecs.open(input_file, 'r', 'utf-8') as in_f, codecs.open(output_file, 'w', 'utf-8') as out_f: 20 | last_was_eos = True 21 | first = True 22 | for token in in_f.read().split(): 23 | if token in PUNCTUATION_VOCABULARY: 24 | out_f.write(token[:1]) 25 | else: 26 | out_f.write(('' if first else ' ') + (token.title() if last_was_eos else token)) 27 | 28 | last_was_eos = token in EOS_TOKENS 29 | if with_newlines and last_was_eos: 30 | out_f.write('\n') 31 | first = True 32 | else: 33 | first = False 34 | -------------------------------------------------------------------------------- /example/dont_run_me_run_the_other_script_instead.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | from __future__ import division 4 | from nltk.tokenize import word_tokenize 5 | 6 | import nltk 7 | import os 8 | import codecs 9 | import re 10 | import sys 11 | 12 | nltk.download('punkt') 13 | 14 | NUM = '' 15 | 16 | EOS_PUNCTS = {".": ".PERIOD", "?": "?QUESTIONMARK", "!": "!EXCLAMATIONMARK"} 17 | INS_PUNCTS = {",": ",COMMA", ";": ";SEMICOLON", ":": ":COLON", "-": "-DASH"} 18 | 19 | forbidden_symbols = re.compile(r"[\[\]\/\\\>\<\=\+\_\*]") 20 | numbers = re.compile(r"\d") 21 | multiple_punct = re.compile(r'([\.\?\!\,\:\;\-])(?:[\.\?\!\,\:\;\-]){1,}') 22 | 23 | is_number = lambda x: len(numbers.sub("", x)) / len(x) < 0.6 24 | 25 | def untokenize(line): 26 | return line.replace(" '", "'").replace(" n't", "n't").replace("can not", "cannot") 27 | 28 | def skip(line): 29 | 30 | if line.strip() == '': 31 | return True 32 | 33 | last_symbol = line[-1] 34 | if not last_symbol in EOS_PUNCTS: 35 | return True 36 | 37 | if forbidden_symbols.search(line) is not None: 38 | return True 39 | 40 | return False 41 | 42 | def process_line(line): 43 | 44 | tokens = word_tokenize(line) 45 | output_tokens = [] 46 | 47 | for token in tokens: 48 | 49 | if token in INS_PUNCTS: 50 | output_tokens.append(INS_PUNCTS[token]) 51 | elif token in EOS_PUNCTS: 52 | output_tokens.append(EOS_PUNCTS[token]) 53 | elif is_number(token): 54 | output_tokens.append(NUM) 55 | else: 56 | output_tokens.append(token.lower()) 57 | 58 | return untokenize(" ".join(output_tokens) + " ") 59 | 60 | skipped = 0 61 | 62 | with codecs.open(sys.argv[2], 'w', 'utf-8') as out_txt: 63 | with codecs.open(sys.argv[1], 'r', 'utf-8') as text: 64 | 65 | for line in text: 66 | 67 | line = line.replace("\"", "").strip() 68 | line = multiple_punct.sub(r"\g<1>", line) 69 | 70 | if skip(line): 71 | skipped += 1 72 | continue 73 | 74 | line = process_line(line) 75 | 76 | out_txt.write(line + '\n') 77 | 78 | print "Skipped %d lines" % skipped 79 | -------------------------------------------------------------------------------- /play_with_model.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | from __future__ import division 4 | 5 | import models 6 | import data 7 | 8 | import theano 9 | import sys 10 | import codecs 11 | 12 | import theano.tensor as T 13 | import numpy as np 14 | 15 | def to_array(arr, dtype=np.int32): 16 | # minibatch of 1 sequence as column 17 | return np.array([arr], dtype=dtype).T 18 | 19 | def convert_punctuation_to_readable(punct_token): 20 | if punct_token == data.SPACE: 21 | return " " 22 | else: 23 | return punct_token[0] 24 | 25 | def punctuate(predict, word_vocabulary, punctuation_vocabulary, reverse_punctuation_vocabulary, reverse_word_vocabulary, text, f_out, show_unk): 26 | 27 | if len(text) == 0: 28 | sys.exit("Input text from stdin missing.") 29 | 30 | text = [w for w in text.split() if w not in punctuation_vocabulary] + [data.END] 31 | 32 | i = 0 33 | 34 | while True: 35 | 36 | subsequence = text[i:i+data.MAX_SEQUENCE_LEN] 37 | 38 | if len(subsequence) == 0: 39 | break 40 | 41 | converted_subsequence = [word_vocabulary.get(w, word_vocabulary[data.UNK]) for w in subsequence] 42 | if show_unk: 43 | subsequence = [reverse_word_vocabulary[w] for w in converted_subsequence] 44 | 45 | y = predict(to_array(converted_subsequence)) 46 | 47 | f_out.write(subsequence[0]) 48 | 49 | last_eos_idx = 0 50 | punctuations = [] 51 | for y_t in y: 52 | 53 | p_i = np.argmax(y_t.flatten()) 54 | punctuation = reverse_punctuation_vocabulary[p_i] 55 | 56 | punctuations.append(punctuation) 57 | 58 | if punctuation in data.EOS_TOKENS: 59 | last_eos_idx = len(punctuations) # we intentionally want the index of next element 60 | 61 | if subsequence[-1] == data.END: 62 | step = len(subsequence) - 1 63 | elif last_eos_idx != 0: 64 | step = last_eos_idx 65 | else: 66 | step = len(subsequence) - 1 67 | 68 | for j in range(step): 69 | f_out.write(" " + punctuations[j] + " " if punctuations[j] != data.SPACE else " ") 70 | if j < step - 1: 71 | f_out.write(subsequence[1+j]) 72 | 73 | if subsequence[-1] == data.END: 74 | break 75 | 76 | i += step 77 | 78 | 79 | if __name__ == "__main__": 80 | 81 | if len(sys.argv) > 1: 82 | model_file = sys.argv[1] 83 | else: 84 | sys.exit("Model file path argument missing") 85 | 86 | show_unk = False 87 | if len(sys.argv) > 2: 88 | show_unk = bool(int(sys.argv[2])) 89 | 90 | x = T.imatrix('x') 91 | 92 | print "Loading model parameters..." 93 | net, _ = models.load(model_file, 1, x) 94 | 95 | print "Building model..." 96 | predict = theano.function(inputs=[x], outputs=net.y) 97 | word_vocabulary = net.x_vocabulary 98 | punctuation_vocabulary = net.y_vocabulary 99 | reverse_word_vocabulary = {v:k for k,v in net.x_vocabulary.items()} 100 | reverse_punctuation_vocabulary = {v:k for k,v in net.y_vocabulary.items()} 101 | 102 | with codecs.getwriter('utf-8')(sys.stdout) as f_out: 103 | while True: 104 | text = raw_input("\nTEXT: ").decode('utf-8') 105 | punctuate(predict, word_vocabulary, punctuation_vocabulary, reverse_punctuation_vocabulary, reverse_word_vocabulary, text, f_out, show_unk) 106 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Note 1: this repository is not being actively maintained or improved at the moment, but I am quick to respond if you find a bug and open a ticket on GitHub or email me. 2 | 3 | Note 2: Because some videos have manually edited and punctuated subtitles uploaded alongside the video, it can be worth it to check if those are available first. You can check the settings (accessed via the gear icon) for the YouTube video to see if they are available. You can also try to execute `youtube-dl --all-subs --skip-download ` and you will either get the manually punctuated subtitles (in all available languages) if they exist or an error message saying "WARNING: video doesn't have subtitles". 4 | 5 | # Intro 6 | 7 | This repository contains a tool to provide automatic punctuation of lecture transcripts obtained from YouTube. YouTube's transcripts are automatically generated by Google's automatic speech recognition (ASR) functionality, but are not punctuated. My tool is specially configured to punctuate lecture transcripts, since transcripts can be helpful to learners for a variety of reasons. The tool's current average F-score for correctly punctuating transcripts of lecture videos using commas, periods, question marks and semicolons is 67.3. Note that this F-score will probably be much lower if you try my tool on non-lecture or non-educational videos as the context will be different. 8 | 9 | My work on this tool slightly builds upon that of https://github.com/ottokart/punctuator2, from which my repository is forked. I used the training structure in that repository to train a customized bidirectional recurrent neural network on manually punctuated transcripts scraped from Udacity's YouTube channel (https://www.youtube.com/user/Udacity) and from Princeton's COS 226 class (https://cuvids.io/app/course/2/). The training data that was used can be found in the "training_data" subdirectory inside the root of this repository. For more information and details regarding the background and methodology for training the neural network and regarding the project in general, including possible ways to improve the F-score, please refer to the "final_report.pdf" file in the root of the repository. Please note that some of the things in the report might be outdated or irrelevant to this repository's functionality. 10 | 11 | Please reach out to me for the training data used for this project. 12 | 13 | # Requirements (Ubuntu) 14 | 15 | * Python 2.7+ 16 | * pip (if it is not installed on your system, install with `sudo apt install python-pip`) 17 | * Theano (if it is not installed on your system, install with `sudo pip install Theano`) 18 | * youtube-dl (if it is not installed on your system, install with `sudo pip install --upgrade youtube_dl`) 19 | * ffmpeg (if it is not installed on your system, install with `sudo apt install ffmpeg`) 20 | * The neural network file: Please reach out to me if you'd like to use mine. 21 | 22 | I have not yet tested this repository on a non-Linux OS. Feel free to send me the installation instructions or put a pull request on the readme if you were to make it work on MacOS or Windows. 23 | 24 | # Usage 25 | 26 | If you have a link to a YouTube lecture video or playlist, simply execute `python lecture_punctuator.py `, where `` could be the link of a YouTube video, playlist, or channel. The program will output the transcript(s) in the current working directory to a subdirectory called "transcripts". Please make sure that the path to `lecture_punctuator.py` is configured correctly. The output format for a transcript is `