├── .DS_Store ├── .gitignore ├── README.md ├── codes ├── deeppavlov │ └── README.md ├── snli │ ├── README.md │ ├── core │ │ ├── chat_log.py │ │ └── create_dir.py │ ├── dataset │ │ ├── dataset_provider.py │ │ └── readers │ │ │ └── snli_import.py │ └── snli.py ├── soynlp │ └── empty └── yandex │ ├── README.md │ ├── week01-embedding-seminar.ipynb │ └── week02_classification_seminar.ipynb └── resource ├── material └── README.md └── slides ├── MIT-data-science ├── .gitignore ├── Chapter 11. Introduction to Machine Learning.pdf ├── Chapter 12. Clustering.pdf ├── Chapter13,14,15_MJLEE.pptx ├── MIT6_0002F16_lec1_cwjun.pdf ├── MIT6_0002F16_lec2_cwjun.pdf ├── MIT6_0002F16_lec5_lec6_ssg.pdf └── MIT6_0002F16_lec9_Eon.pdf ├── README.md ├── deeppavlov ├── .gitignore └── deeppavlov_Automatic spelling correction.pdf ├── linear-algebra ├── Chapter_3_Least_square.pptx └── README.md ├── paper-review ├── .gitignore ├── Character-Aware Neural Language Models.pdf ├── Character-level CNN for text classification.pptx ├── Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers.pptx ├── Learning phrase representation using RNN Encoder-Decoder for SMT.pdf ├── MASS.pdf ├── Robustly optimized BERT Pretraining Approaches.pptx ├── TransformerXL.pdf ├── VDCNN.pdf └── seqtoseq_attention_20190417.pdf ├── soynlp ├── Soynlp 2일차.pptx ├── empty ├── fastcampus_1일차.pptx └── fastcampus_day3 │ ├── From frequency to meaning, Vector space models of semantics.pdf │ ├── Korean conjugation.pdf │ ├── Korean lemmatization.pdf │ ├── L2_L1 regularization.pdf │ ├── LSA.pdf │ ├── Logistic regression with L1, L2 regularization and keyword extraction.pdf │ └── Neural Word Embedding as Implicit Matrix Factorization.pdf └── yandex ├── 2월 2째주-yandex- week04_seq2seq_seminar.pptx ├── 2월 3째주-yandex-week04_seq2seq_seminar layer normalization.pptx └── yandex-week-07-mt-02.pptx /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/.gitignore -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DeepNLP2019 2 | 3 | 모두의 연구소 자연어처리 LAB의 자료들을 올리는 Repository 입니다. 4 | 스터디는 매주 수요일 저녁 8:00 ~ 10:30에 강남 모두의 연구소 캠퍼스에 진행됩니다. 5 | 스터디 진행 스케줄 : [Schedule](https://docs.google.com/spreadsheets/d/1-m9TveaMZ54EVI-ikGYcp1orD1-GsIwXZbaZDi0oW4c) 6 | 7 | --- 8 | 9 | ## Repository 사용 방법 10 | 11 | * 발표 자료는 모두 `resource/slides`에 올려주세요 12 | * 참고 자료 및 공부에 활용되는 자료들은 `resource/material`에 올려주세요 13 | * 프로젝트 별로 하위 폴더를 만든 후 넣어주세요 14 | * ex) linear algebra에 대한 자료는 material의 하위 폴더에 넣어주세요 `resource/material/linear_algebra/` 15 | * 작성하신 소스코드는 다음의 형식으로 `codes`에 올려주세요 16 | * 프로젝트 별로 하위 폴더를 만든 후 넣어주세요 17 | * ex) yandex 코드의 경우 `codes/yandex/2week_text_classification.ipynb` 18 | * 최대한 모든 자료는 이후 참고하기 편하도록 이름에 어떤 자료인지 확실히 명시해주시기 바랍니다. 19 | * `codes/yandex/homework.ipynb`(X) -> `codes/yandex/s2s_homework.ipynb` 20 | 21 | 22 | ## 공부 자료 23 | 24 | 25 | * 기존 자료(~2018): [Link](http://github.com/modulabs/DeepNLP) 26 | * Git Book: [Link](https://nlp.gitbook.io/book/) 27 | * 텐서플로와 머신러닝으로 시작하는 자연어 처리: 로지스틱 회귀부터 텐서플로우까지 : [Link](https://book.naver.com/bookdb/book_detail.nhn?bid=14488487) 28 | 29 | 30 | ## Github 사용 관련 문의 31 | 32 | github 사용 관련해서 문의는 아래의 메일로 보내주세요. 33 | 조중현: reniew2@gmail.com 34 | -------------------------------------------------------------------------------- /codes/deeppavlov/README.md: -------------------------------------------------------------------------------- 1 | # DeepPavlov 2 | -------------------------------------------------------------------------------- /codes/snli/README.md: -------------------------------------------------------------------------------- 1 | # SNLI Competition 2 | 3 | [SNLI Leaderboard](https://nlp.stanford.edu/projects/snli/) 4 | 5 | [Code_Referecne](https://github.com/brmson/dataset-sts/tree/master/models) 6 | 7 | data_in 디렉토리에 Glove 파일과 snli 데이터셋이 있어야 합니다. -------------------------------------------------------------------------------- /codes/snli/core/chat_log.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | 4 | class SingletonType(type): 5 | def __call__(cls, *args, **kwargs): 6 | try: 7 | return cls.__instance 8 | except AttributeError: 9 | cls.__instance = super(SingletonType, cls).__call__(*args, **kwargs) 10 | return cls.__instance 11 | 12 | 13 | class CustomLogger(object): 14 | __metaclass__ = SingletonType 15 | _logger = None 16 | 17 | def __init__(self): 18 | self._logger = logging.getLogger("cai_chatbot_framework") 19 | self._logger.setLevel(logging.INFO) 20 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 21 | 22 | import datetime 23 | now = datetime.datetime.now() 24 | import time 25 | timestamp = time.mktime(now.timetuple()) 26 | 27 | dirname = 'data_out/logs/' 28 | if not os.path.isdir(dirname): 29 | os.mkdir(dirname) 30 | 31 | # file_handler = logging.FileHandler(dirname + now.strftime("%Y-%m-%d %H:%M:%S")+".log") 32 | file_handler = logging.FileHandler(dirname + 'chatbot_framework.log') 33 | stream_hander = logging.StreamHandler() 34 | 35 | self._logger.addHandler(file_handler) 36 | self._logger.addHandler(stream_hander) 37 | 38 | def get_logger(self): 39 | return self._logger 40 | 41 | 42 | 43 | 44 | # mylogger = logging.getLogger("chatbot_framwork") 45 | # mylogger.setLevel(logging.INFO) #로깅 레벨 설정 46 | 47 | # formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 48 | 49 | # #handler: 내가 로깅한 정보가 출력되는 위치 설정 50 | # #파일 설정, default a 모드임 (a 추가) 51 | # file_handler = logging.FileHandler('data_out/logs/chatbot_framework.log') 52 | # stream_hander = logging.StreamHandler() 53 | 54 | # #Handler logging 추가 55 | # file_handler.setFormatter(formatter) 56 | # stream_hander.setFormatter(formatter) 57 | 58 | # #logging 추가 59 | # mylogger.addHandler(stream_hander) 60 | # mylogger.addHandler(file_handler) 61 | 62 | # mylogger.info("server start!!!") 63 | 64 | 65 | 66 | # if __name__ =='__main__': 67 | # # logging.info("hello world") 68 | # # logging.error("something wrong!") 69 | 70 | # mylogger = logging.getLogger("chatbot_framwork") 71 | # mylogger.setLevel(logging.INFO) #로깅 레벨 설정 72 | 73 | # formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 74 | 75 | # #handler: 내가 로깅한 정보가 출력되는 위치 설정 76 | # #파일 설정, default a 모드임 (a 추가) 77 | # file_handler = logging.FileHandler('data_out/logs/chatbot_framework.log') 78 | # stream_hander = logging.StreamHandler() 79 | 80 | # #Handler logging 추가 81 | # file_handler.setFormatter(formatter) 82 | # stream_hander.setFormatter(formatter) 83 | 84 | # #logging 추가 85 | # mylogger.addHandler(stream_hander) 86 | # mylogger.addHandler(file_handler) 87 | 88 | # mylogger.info("server start!!!") 89 | 90 | 91 | 92 | 93 | # LOGGING_LEVELS = {'critical': logging.CRITICAL, 94 | # 'error': logging.ERROR, 95 | # 'warning': logging.WARNING, 96 | # 'info': logging.INFO, 97 | # 'debug': logging.DEBUG} 98 | 99 | # def init(): 100 | # parser = optparse.OptionParser() 101 | # parser.add_option('-l', '--logging-level', help='Logging level') 102 | # parser.add_option('-f', '--logging-file', help='Logging file name') 103 | # (options, args) = parser.parse_args() 104 | # logging_level = LOGGING_LEVELS.get(options.logging_level, logging.NOTSET) 105 | # logging.basicConfig(level=logging_level, filename=options.logging_file, 106 | # format='%(asctime)s %(levelname)s: %(message)s', 107 | # datefmt='%Y-%m-%d %H:%M:%S') 108 | 109 | # logging.debug("디버깅용 로그~~") 110 | # logging.info("도움이 되는 정보를 남겨요~") 111 | # logging.warning("주의해야되는곳!") 112 | # logging.error("에러!!!") 113 | # logging.critical("심각한 에러!!") 114 | 115 | 116 | 117 | # def my_logger(original_function): 118 | # import logging 119 | # logging.basicConfig(filename='./logs/{}.log'.format(original_function.__name__), level=logging.INFO) 120 | 121 | # def wrapper(*args, **kwargs): 122 | # timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M') 123 | # logging.info( 124 | # '[{}] 실행결과 args - {}, kwargs - {}'.format(timestamp, args, kwargs)) 125 | # return original_function(*args, **kwargs) 126 | 127 | # return wrapper 128 | 129 | # # 시간 추가 130 | # def my_timer(original_function): #1 131 | # import time 132 | 133 | # def wrapper(*args, **kwargs): 134 | # t1 = time.time() 135 | # result = original_function(*args, **kwargs) 136 | # t2 = time.time() - t1 137 | # print('{} 함수가 실행된 총 시간: {} 초'.format(original_function.__name__, t2)) 138 | # return result 139 | 140 | # return wrapper 141 | 142 | # @my_timer 143 | # @my_logger 144 | # def display_info(name, age): 145 | # time.sleep(1) 146 | # print('display_info({}, {}) 함수가 실행됐습니다.'.format(name, age)) 147 | 148 | # display_info("John", 25) -------------------------------------------------------------------------------- /codes/snli/core/create_dir.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | #-*- coding: utf-8 -*- 3 | 4 | import os 5 | 6 | class Dir_Utilities(object): 7 | 8 | def __init__(self, data_in_path, model_nm): 9 | self.data_out = 'data_out/' 10 | self.data_in_path = data_in_path 11 | self.export_pb_path=os.path.join(self.data_out, 'export_pb/') 12 | self.log_path=os.path.join(self.data_out, 'logs/') 13 | self.backup_path=os.path.join(self.data_out, 'backups/') 14 | self.model_nm = model_nm 15 | 16 | if self.model_nm == 'onehot': 17 | self.nn_models_dssm_path = os.path.join(self.data_out, 'nn_models_onehot/') 18 | else: 19 | self.nn_models_dssm_path = os.path.join(self.data_out, 'nn_models_embed/') 20 | 21 | def create_domain_dir(self, dir_path): 22 | """ domain에 따른 폴더 생성 """ 23 | if os.path.isdir(dir_path): 24 | print("{} --- Folder already exists \n".format(dir_path)) 25 | else: 26 | os.makedirs(self.data_out, exist_ok=True) 27 | print("{} --- Folder create complete \n".format(dir_path)) 28 | 29 | def folder_init(self): 30 | print("---- start test -----") 31 | 32 | self.create_domain_dir(self.data_out) 33 | self.create_domain_dir(self.export_pb_path) 34 | self.create_domain_dir(self.log_path) 35 | self.create_domain_dir(self.backup_path) 36 | self.create_domain_dir(self.nn_models_dssm_path) 37 | 38 | print("---- end test -----") -------------------------------------------------------------------------------- /codes/snli/dataset/dataset_provider.py: -------------------------------------------------------------------------------- 1 | import sys 2 | sys.path.append('../') 3 | 4 | from configs.model_config import Config 5 | 6 | class DatasetProvider: 7 | 8 | def __init__(self, config, experiment_dir=None): 9 | 10 | self.config = config 11 | 12 | if experiment_dir is not None: 13 | self.config = Config(config.domain_id, experiment_dir) 14 | 15 | self._train_input_fn = None 16 | self._validate_input_fn = None 17 | self._test_input_fn = None 18 | self._train_hook = None 19 | self._validate_hook = None 20 | 21 | def setup_train_input_graph(self): 22 | raise NotImplementedError 23 | 24 | def setup_validate_input_graph(self): 25 | raise NotImplementedError 26 | 27 | def setup_test_input_graph(self, input_type, input_value): 28 | raise NotImplementedError 29 | 30 | @property 31 | def train_input_fn(self): 32 | """ 33 | train data input_fn 34 | call by estimator.train() in task runner e.g., ner_runner 35 | :return: 36 | """ 37 | if self._train_input_fn is None: 38 | self.setup_train_input_graph() 39 | return self._train_input_fn 40 | 41 | @property 42 | def validate_input_fn(self): 43 | if self._validate_input_fn is None: 44 | self.setup_validate_input_graph() 45 | return self._validate_input_fn 46 | 47 | @property 48 | def test_input_fn(self): 49 | return self._test_input_fn 50 | 51 | @property 52 | def train_hook(self): 53 | if self._train_hook is None: 54 | self.setup_train_input_graph() 55 | return self._train_hook 56 | 57 | @property 58 | def validate_hook(self): 59 | if self._validate_hook is None: 60 | self.setup_validate_input_graph() 61 | return self._validate_hook 62 | -------------------------------------------------------------------------------- /codes/snli/dataset/readers/snli_import.py: -------------------------------------------------------------------------------- 1 | import json 2 | import numpy as np 3 | import tensorflow as tf 4 | import tarfile 5 | import tempfile 6 | import json 7 | import os 8 | import re 9 | import sys 10 | 11 | from tensorflow.keras.utils import to_categorical 12 | from tensorflow.keras.preprocessing.text import Tokenizer 13 | from tensorflow.keras.preprocessing.sequence import pad_sequences 14 | 15 | def extract_tokens_from_binary_parse(parse): 16 | return parse.replace('(', ' ').replace(')', ' ').replace('-LRB-', '(').replace('-RRB-', ')').split() 17 | 18 | def yield_examples(fn, skip_no_majority=True, limit=None): 19 | for i, line in enumerate(open(fn)): 20 | if limit and i > limit: 21 | break 22 | data = json.loads(line) 23 | label = data['gold_label'] 24 | s1 = ' '.join(extract_tokens_from_binary_parse(data['sentence1_binary_parse'])) 25 | s2 = ' '.join(extract_tokens_from_binary_parse(data['sentence2_binary_parse'])) 26 | if skip_no_majority and label == '-': 27 | continue 28 | yield (label, s1, s2) 29 | 30 | def get_data(fn, limit=None): 31 | raw_data = list(yield_examples(fn=fn, limit=limit)) 32 | left = [s1 for _, s1, s2 in raw_data] 33 | right = [s2 for _, s1, s2 in raw_data] 34 | print(max(len(x.split()) for x in left)) 35 | print(max(len(x.split()) for x in right)) 36 | 37 | LABELS = {'contradiction': 0, 'neutral': 1, 'entailment': 2} 38 | Y = np.array([LABELS[l] for l, s1, s2 in raw_data]) 39 | Y = to_categorical(Y, len(LABELS)) 40 | 41 | return left, right, Y 42 | 43 | # data_path = 'data_in/snli/' 44 | 45 | # training = get_data(data_path+'snli_1.0_train.jsonl') 46 | # validation = get_data(data_path+'snli_1.0_dev.jsonl') 47 | # test = get_data(data_path+'snli_1.0_test.jsonl') 48 | 49 | # tokenizer = Tokenizer(lower=False, filters='') 50 | # tokenizer.fit_on_texts(training[0] + training[1]) 51 | 52 | # Lowest index from the tokenizer is 1 - we need to include 0 in our vocab count 53 | # VOCAB = len(tokenizer.word_counts) + 1 54 | # LABELS = {'contradiction': 0, 'neutral': 1, 'entailment': 2} 55 | 56 | # MAX_LEN = 42 57 | 58 | # to_seq = lambda X: pad_sequences(tokenizer.texts_to_sequences(X), maxlen=MAX_LEN) 59 | # prepare_data = lambda data: (to_seq(data[0]), to_seq(data[1]), data[2]) 60 | 61 | # training = prepare_data(training) 62 | # validation = prepare_data(validation) 63 | # test = prepare_data(test) 64 | 65 | # print(training) 66 | 67 | 68 | # print('Build model...') 69 | # print('Vocab size =', VOCAB) 70 | 71 | # GLOVE_STORE = 'precomputed_glove.weights' 72 | # if USE_GLOVE: 73 | # if not os.path.exists(GLOVE_STORE + '.npy'): 74 | # print('Computing GloVe') 75 | 76 | # embeddings_index = {} 77 | # f = open(data_path+'glove.840B.300d.txt') 78 | # for line in f: 79 | # values = line.split(' ') 80 | # word = values[0] 81 | # coefs = np.asarray(values[1:], dtype='float32') 82 | # embeddings_index[word] = coefs 83 | # f.close() 84 | 85 | # # prepare embedding matrix 86 | # embedding_matrix = np.zeros((VOCAB, EMBED_HIDDEN_SIZE)) 87 | # for word, i in tokenizer.word_index.items(): 88 | # embedding_vector = embeddings_index.get(word) 89 | # if embedding_vector is not None: 90 | # # words not found in embedding index will be all-zeros. 91 | # embedding_matrix[i] = embedding_vector 92 | # else: 93 | # print('Missing from GloVe: {}'.format(word)) 94 | 95 | # np.save(GLOVE_STORE, embedding_matrix) 96 | 97 | # print('Loading GloVe') 98 | # embedding_matrix = np.load(GLOVE_STORE + '.npy') 99 | 100 | # print('Total number of null word embeddings:') 101 | # print(np.sum(np.sum(embedding_matrix, axis=1) == 0)) 102 | 103 | # embed = Embedding(VOCAB, EMBED_HIDDEN_SIZE, weights=[embedding_matrix], input_length=MAX_LEN, trainable=TRAIN_EMBED) 104 | # else: 105 | # embed = Embedding(VOCAB, EMBED_HIDDEN_SIZE, input_length=MAX_LEN) -------------------------------------------------------------------------------- /codes/snli/snli.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | import sys 4 | import tensorflow as tf 5 | import numpy as np 6 | import os 7 | import pandas as pd 8 | import pickle 9 | import glob 10 | 11 | from sklearn.model_selection import train_test_split 12 | from dataset.readers.snli_import import get_data 13 | 14 | from tensorflow import keras 15 | from tensorflow.keras import layers 16 | from tensorflow.keras.utils import to_categorical 17 | from tensorflow.keras.preprocessing.text import Tokenizer 18 | from tensorflow.keras.preprocessing.sequence import pad_sequences 19 | from tensorflow.keras.layers import Embedding 20 | 21 | import json 22 | 23 | batch_size = 512 24 | MAX_LEN = 42 25 | EPOCHS = 5 26 | 27 | os.environ["CUDA_VISIBLE_DEVICES"]="0" #For TEST 28 | tf.logging.set_verbosity("INFO") 29 | 30 | data_path = 'data_in/snli/' 31 | 32 | training = get_data(data_path+'snli_1.0_train.jsonl') 33 | validation = get_data(data_path+'snli_1.0_dev.jsonl') 34 | test = get_data(data_path+'snli_1.0_test.jsonl') 35 | 36 | tokenizer = Tokenizer(lower=False, filters='') 37 | tokenizer.fit_on_texts(training[0] + training[1]) 38 | 39 | VOCAB = len(tokenizer.word_counts) + 1 40 | LABELS = {'contradiction': 0, 'neutral': 1, 'entailment': 2} 41 | 42 | ## 미리 Global 변수를 지정하자. 파일 명, 파일 위치, 디렉토리 등이 있다. 43 | 44 | DATA_IN_PATH = './data_in/' 45 | DATA_OUT_PATH = './data_out/' 46 | 47 | ## 학습에 필요한 파라메터들에 대해서 지정하는 부분이다. 48 | 49 | 50 | RNN = None 51 | 52 | BATCH_SIZE = 512 53 | HIDDEN = 128 54 | BUFFER_SIZE = 1000000 55 | 56 | NUM_LAYERS = 3 57 | DROPOUT_RATIO = 0.3 58 | 59 | TEST_SPLIT = 0.1 60 | RNG_SEED = 13371447 61 | EMBEDDING_DIM = 300 62 | MAX_SEQ_LEN = 42 63 | 64 | WORD_EMBEDDING_DIM = 100 65 | CONV_FEATURE_DIM = 300 66 | CONV_OUTPUT_DIM = 128 67 | CONV_WINDOW_SIZE = 3 68 | DROPOUT_RATIO = 0.5 69 | SIMILARITY_DENSE_FEATURE_DIM = 200 70 | 71 | LAYERS = 1 72 | EMBED_HIDDEN_SIZE = 300 73 | SENT_HIDDEN_SIZE = 300 74 | BATCH_SIZE = 512 75 | PATIENCE = 4 # 8 76 | MAX_EPOCHS = 42 77 | MAX_LEN = 42 78 | DP = 0.2 79 | L2 = 4e-6 80 | ACTIVATION = 'relu' 81 | OPTIMIZER = 'rmsprop' 82 | 83 | def data_len(df_list): 84 | q_data_len = np.array([min(len(x), MAX_SEQ_LEN) for x in df_list], dtype=np.int32) 85 | return q_data_len 86 | 87 | to_seq = lambda X: pad_sequences(tokenizer.texts_to_sequences(X), maxlen=MAX_LEN) 88 | 89 | prepare_data = lambda data: (to_seq(data[0]), to_seq(data[1]), data[2]) 90 | len_data = lambda data: (data_len(data[0]), data_len(data[1])) 91 | 92 | training = prepare_data(training) 93 | validation = prepare_data(validation) 94 | test = prepare_data(test) 95 | 96 | training_len = len_data(training) 97 | validation_len = len_data(validation) 98 | test_len = len_data(test) 99 | 100 | def save_pickle(file_nm, var_nm): 101 | with open(file_nm, 'wb') as fp: 102 | pickle.dump(var_nm, fp) 103 | print("save {}".format(file_nm)) 104 | 105 | def load_pickle(file_nm): 106 | with open(file_nm, 'rb') as fp: 107 | pkl = pickle.load(fp) 108 | print("load {}".format(file_nm)) 109 | return pkl 110 | 111 | # save_pickle(data_path + 'training.pkl', training) 112 | # save_pickle(data_path + 'validation.pkl',validation) 113 | # save_pickle(data_path + 'test.pkl', test) 114 | 115 | # save_pickle(data_path + 'training_len.pkl', training_len) 116 | # save_pickle(data_path + 'validation_len.pkl',validation_len) 117 | # save_pickle(data_path + 'test_len.pkl', test_len) 118 | 119 | # training = load_pickle(data_path + 'training.pkl') 120 | # validation = load_pickle(data_path + 'validation.pkl') 121 | # test = load_pickle(data_path + 'test.pkl') 122 | 123 | # training_len = load_pickle(data_path + 'training_len.pkl') 124 | # validation_len = load_pickle(data_path + 'validation_len.pkl') 125 | # test_len = load_pickle(data_path + 'test_len.pkl') 126 | 127 | USE_GLOVE = True 128 | TRAIN_EMBED = False 129 | 130 | def train_input_fn(): 131 | train_dataset = tf.data.Dataset.from_tensor_slices((training[0], training[1], training_len[0], 132 | training_len[1], training[2])).shuffle( 133 | buffer_size=BUFFER_SIZE).prefetch(buffer_size=batch_size).batch(batch_size).repeat(EPOCHS) 134 | iterator = train_dataset.make_one_shot_iterator() 135 | q1, q2, q1_len, q2_len, labels = iterator.get_next() 136 | features = {'q1': q1, "q2": q2, "q1_len": q1_len, "q2_len": q2_len} 137 | # labels = {'labels': labels} 138 | return features, labels 139 | 140 | def valid_input_fn(): 141 | validation_dataset = tf.data.Dataset.from_tensor_slices((validation[0], validation[1], validation_len[0], 142 | validation_len[1], validation[2])).shuffle( 143 | buffer_size=BUFFER_SIZE).prefetch(buffer_size=batch_size).batch(batch_size).repeat(EPOCHS) 144 | iterator = validation_dataset.make_one_shot_iterator() 145 | q1, q2, q1_len, q2_len, labels = iterator.get_next() 146 | features = {'q1': q1, "q2": q2, "q1_len": q1_len, "q2_len": q2_len} 147 | # labels = {'labels': labels} 148 | 149 | return features, labels 150 | 151 | GLOVE_STORE = 'precomputed_glove.weights' 152 | if USE_GLOVE: 153 | if not os.path.exists(GLOVE_STORE + '.npy'): 154 | print('Computing GloVe') 155 | 156 | embeddings_index = {} 157 | f = open(data_path+'glove.840B.300d.txt') 158 | for line in f: 159 | values = line.split(' ') 160 | word = values[0] 161 | coefs = np.asarray(values[1:], dtype='float32') 162 | embeddings_index[word] = coefs 163 | f.close() 164 | 165 | # prepare embedding matrix 166 | embedding_matrix = np.zeros((VOCAB, EMBEDDING_DIM)) 167 | for word, i in tokenizer.word_index.items(): 168 | embedding_vector = embeddings_index.get(word) 169 | if embedding_vector is not None: 170 | # words not found in embedding index will be all-zeros. 171 | embedding_matrix[i] = embedding_vector 172 | else: 173 | print('Missing from GloVe: {}'.format(word)) 174 | 175 | np.save(GLOVE_STORE, embedding_matrix) 176 | 177 | print('Loading GloVe') 178 | embedding_matrix = np.load(GLOVE_STORE + '.npy') 179 | 180 | print('Total number of null word embeddings:') 181 | print(np.sum(np.sum(embedding_matrix, axis=1) == 0)) 182 | 183 | 184 | def basic_conv_sementic_network(inputs, name): 185 | conv_layer = tf.keras.layers.Conv1D(CONV_FEATURE_DIM, 186 | CONV_WINDOW_SIZE, 187 | activation=tf.nn.relu, 188 | name=name + 'conv_1d', 189 | padding='same')(inputs) 190 | 191 | max_pool_layer = tf.keras.layers.MaxPool1D(MAX_SEQ_LEN, 192 | 1)(conv_layer) 193 | 194 | output_layer = tf.keras.layers.Dense(CONV_OUTPUT_DIM, 195 | activation=tf.nn.relu, 196 | name=name + 'dense')(max_pool_layer) 197 | output_layer = tf.squeeze(output_layer, 1) 198 | 199 | return output_layer 200 | 201 | def estimator_model(features, labels, mode): 202 | 203 | 204 | TRAIN = mode == tf.estimator.ModeKeys.TRAIN 205 | EVAL = mode == tf.estimator.ModeKeys.EVAL 206 | PREDICT = mode == tf.estimator.ModeKeys.PREDICT 207 | 208 | if USE_GLOVE: 209 | embed = Embedding(VOCAB, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_LEN, trainable=TRAIN_EMBED) 210 | else: 211 | embed = Embedding(VOCAB, EMBEDDING_DIM, input_length=MAX_LEN) 212 | 213 | prem = embed(features['q1']) 214 | hypo = embed(features['q2']) 215 | 216 | rnn_kwargs = dict(output_dim=SENT_HIDDEN_SIZE, dropout_W=DP, dropout_U=DP) 217 | SumEmbeddings = layers.Lambda(lambda x: keras.backend.sum(x, axis=1), output_shape=(SENT_HIDDEN_SIZE, )) 218 | 219 | translate = layers.TimeDistributed(layers.Dense(SENT_HIDDEN_SIZE, activation=ACTIVATION)) 220 | 221 | prem = translate(prem) 222 | hypo = translate(hypo) 223 | 224 | if RNN and LAYERS > 1: 225 | for l in range(LAYERS - 1): 226 | rnn = RNN(return_sequences=True, **rnn_kwargs) 227 | prem = layers.BatchNormalization()(rnn(prem)) 228 | hypo = layers.BatchNormalization()(rnn(hypo)) 229 | 230 | rnn = SumEmbeddings if not RNN else RNN(return_sequences=False, **rnn_kwargs) 231 | prem = rnn(prem) 232 | hypo = rnn(hypo) 233 | prem = layers.BatchNormalization()(prem) 234 | hypo = layers.BatchNormalization()(hypo) 235 | 236 | joint = keras.layers.concatenate([prem, hypo]) 237 | joint = layers.Dropout(DP)(joint) 238 | for i in range(3): 239 | joint = layers.Dense(2 * SENT_HIDDEN_SIZE, activation=ACTIVATION)(joint) 240 | joint = layers.Dropout(DP)(joint) 241 | joint = layers.BatchNormalization()(joint) 242 | 243 | 244 | # """ For Conv """ 245 | # base_sementic_matrix = basic_conv_sementic_network(base_embedded_matrix, 'base') 246 | # hypothesis_sementic_matrix = basic_conv_sementic_network(hypothesis_embedded_matrix, 'hypothesis') 247 | 248 | # base_sementic_matrix = tf.keras.layers.Dropout(DROPOUT_RATIO)(query) 249 | # hypothesis_sementic_matrix = tf.keras.layers.Dropout(DROPOUT_RATIO)(sim) 250 | 251 | # merged_matrix = tf.concat([base_sementic_matrix, hypothesis_sementic_matrix], -1) 252 | 253 | # similarity_dense_layer = tf.keras.layers.Dense(250, 254 | # activation=tf.nn.relu)(merged_matrix) 255 | 256 | # similarity_dense_layer = tf.keras.layers.Dropout(DROPOUT_RATIO)(similarity_dense_layer) 257 | 258 | with tf.variable_scope('output_layer'): 259 | # pred = tf.keras.layers.Dense(len(LABELS), activation='softmax')(similarity_dense_layer) 260 | pred = tf.keras.layers.Dense(len(LABELS), activation='softmax')(join) 261 | print("prediction: {}".format(pred)) 262 | 263 | if PREDICT: 264 | return tf.estimator.EstimatorSpec( 265 | mode=mode, 266 | predictions={ 267 | 'is_duplicate': pred 268 | }) 269 | 270 | #prediction 진행 시, None 271 | if labels is not None: 272 | labels = tf.to_float(labels) 273 | 274 | def loss_fn(logits, labels): 275 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2( 276 | logits=logits, labels=labels)) 277 | # loss = tf.losses.mean_squared_error(labels=labels, predictions=logits) 278 | return loss 279 | 280 | def evaluate(logits, labels): 281 | # accuracy = tf.metrics.accuracy(labels, tf.round(logits)) 282 | 283 | correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1)) 284 | acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32)) 285 | 286 | return acc 287 | 288 | loss = loss_fn(pred, labels) 289 | acc = evaluate(pred, labels) 290 | 291 | logging_hook = tf.train.LoggingTensorHook({"loss" : loss, "accuracy" : acc}, every_n_iter=100) 292 | 293 | if EVAL: 294 | # acc = evaluate(pred, labels) 295 | accuracy = tf.metrics.accuracy(labels, tf.round(pred)) 296 | eval_metric_ops = {'acc': accuracy} 297 | return tf.estimator.EstimatorSpec( 298 | mode=mode, 299 | eval_metric_ops= eval_metric_ops, 300 | loss=loss) 301 | 302 | elif TRAIN: 303 | global_step = tf.train.get_global_step() 304 | train_op = tf.train.AdamOptimizer(1e-3).minimize(loss, global_step) 305 | return tf.estimator.EstimatorSpec( 306 | mode=mode, 307 | train_op=train_op, 308 | loss=loss, 309 | training_hooks= [logging_hook]) 310 | 311 | model_dir = os.path.join(os.getcwd(), DATA_OUT_PATH + "/checkpoint/model/") 312 | os.makedirs(model_dir, exist_ok=True) 313 | 314 | config_tf = tf.estimator.RunConfig(save_checkpoints_steps=500, 315 | save_checkpoints_secs=None, 316 | keep_checkpoint_max=2, 317 | log_step_count_steps=100) 318 | 319 | model_est = tf.estimator.Estimator(estimator_model, model_dir=model_dir, config=config_tf) 320 | 321 | model_est.train(train_input_fn) 322 | model_est.evaluate(valid_input_fn) -------------------------------------------------------------------------------- /codes/soynlp/empty: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/codes/soynlp/empty -------------------------------------------------------------------------------- /codes/yandex/README.md: -------------------------------------------------------------------------------- 1 | # Yandex NLP school 2 | -------------------------------------------------------------------------------- /codes/yandex/week02_classification_seminar.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Large scale text analysis with deep learning (3 points)\n", 8 | "\n", 9 | "Today we're gonna apply the newly learned tools for the task of predicting job salary.\n", 10 | "\n", 11 | "\n", 12 | "\n", 13 | "_Special thanks to [Oleg Vasilev](https://github.com/Omrigan/) for the core assignment idea._" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 1, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import numpy as np\n", 23 | "import pandas as pd\n", 24 | "import matplotlib.pyplot as plt\n", 25 | "%matplotlib inline" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### About the challenge\n", 33 | "For starters, let's download and unpack the data from [here](https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=0). \n", 34 | "\n", 35 | "You can also get it from [yadisk url](https://yadi.sk/d/vVEOWPFY3NruT7) the competition [page](https://www.kaggle.com/c/job-salary-prediction/data) (pick `Train_rev1.*`)." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "data": { 45 | "text/plain": [ 46 | "(244768, 12)" 47 | ] 48 | }, 49 | "execution_count": 2, 50 | "metadata": {}, 51 | "output_type": "execute_result" 52 | } 53 | ], 54 | "source": [ 55 | "#!curl -L https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=1 -o Train_rev1.csv.tar.gz\n", 56 | "#!tar -xvzf ./Train_rev1.csv.tar.gz\n", 57 | "data = pd.read_csv(\"./Train_rev1.csv\", index_col=None)\n", 58 | "data.shape" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "One problem with salary prediction is that it's oddly distributed: there are many people who are paid standard salaries and a few that get tons o money. The distribution is fat-tailed on the right side, which is inconvenient for MSE minimization.\n", 66 | "월급이 기이하게 분포 되어 있다.\n", 67 | "표준 급여를 받는 사람이 많다. MSE 최소화에 불편한 분포가 오른쪽 측면에서 꼼꼼하게 나타납니다.\n", 68 | "\n", 69 | "There are several techniques to combat this: using a different loss function, predicting log-target instead of raw target or even replacing targets with their percentiles among all salaries in the training set. We gonna use logarithm for now.\n", 70 | "이를 해결하기위한 여러 가지 기술이 있습니다. 즉, 다른 손실 함수 사용, 원시 타겟 대신 로그 타겟을 예측하거나 트레이닝 세트의 모든 급여 중 목표를 백분위 수로 대체하는 것입니다. 우리는 대수(대신하는수)를 사용할 것입니다.\n", 71 | "\n", 72 | "_You can read more [in the official description](https://www.kaggle.com/c/job-salary-prediction#description)._" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 3, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfYAAAD8CAYAAACFB4ZuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAHpNJREFUeJzt3X+sZ3V95/HnS34orSKDDIQwsIPtpJWSiHgXpnFjLHRhwKaDG21wmzJ1SabrQoNJd9exbYL1xwZ3U1nZWrrTMmUwVmBR40QHx1nEdU3kx6AjPxwtI7IyZZYZO4gYUlzwvX+cz5Uvw/fe+7137twf5z4fyTff832fzznfz/ne77nv7znncz6fVBWSJKkfXjbfFZAkSbPHxC5JUo+Y2CVJ6hETuyRJPWJilySpR0zskiT1iIldkqQeMbFLktQjJnZJknrkyPmuwEydcMIJtXLlyvmuhrSg3XfffT+squXzXY/JuC9Loxl1f160iX3lypXs2LFjvqshLWhJ/s9812Eq7svSaEbdnz0VL0lSj5jYJUnqERO7JEk9YmKXJKlHTOySJPWIiV2SpB4xsUuS1CMmdkmSesTELklSjyzanudmy8oNX5iyzKPXvHUOaiJJC4f/GxevJZ/YR+EXXJK0WHgqXpKkHjGxS5LUIyZ2SZJ6xMQuSVKPmNglSeoRE7u0RCR5RZJ7knwryUNJ/qzFb0zy/SQ72+OsFk+S65LsTnJ/krMH1rUuycPtsW4g/sYkD7RlrkuSud9SaWnzdjdp6XgWOK+qfpLkKOBrSW5v8/5DVd12UPmLgFXtcS5wPXBukuOBq4ExoID7kmypqidbmfXAXcBWYA1wO1owRrl9V4ubR+zSElGdn7SXR7VHTbLIWuCmttxdwHFJTgYuBLZX1YGWzLcDa9q8Y6vq61VVwE3AJYdtgyQNZWKXlpAkRyTZCeyjS853t1kfbqfbr03y8hY7BXhsYPE9LTZZfM+QuKQ5NFJiT3JcktuSfCfJriS/nuT4JNvbNbbtSZa1sl6Xkxaoqnq+qs4CVgDnJDkTeB/wq8A/B44H3tuKD9sPawbxl0iyPsmOJDv2798/za2QNJlRj9g/Bnyxqn4VeD2wC9gA3FFVq4A72mt48XW59XTX3Bi4LncucA5w9fiPAV64Lje+3JpD2yxJk6mqHwFfAdZU1d52uv1Z4G/p9k/ojrhPHVhsBfD4FPEVQ+LD3n9jVY1V1djy5ctnYYskjZsysSc5FngzcANAVf20/VNYC2xuxTbzwrU0r8tJC1CS5UmOa9PHAL8JfKftg7QzZZcAD7ZFtgCXtbNwq4GnqmovsA24IMmy9uP8AmBbm/d0ktVtXZcBn5vLbZQ0Wqv41wL7gb9N8nrgPuAq4KS2I1NVe5Oc2MoftutySdbTHdlz2mmnjVB1SQNOBjYnOYLuR/2tVfX5JF9OspzuVPpO4N+28luBi4HdwDPAuwCq6kCSDwL3tnIfqKoDbfrdwI3AMXSt4W0RL82xURL7kcDZwB9W1d1JPsYLp92HOWzX5apqI7ARYGxsbLLWvJIOUlX3A28YEj9vgvIFXDHBvE3ApiHxHcCZh1ZTSYdilGvse4A9A61nb6NL9E8MnMI7ma6V7Xj5w3JdTpIkTW7KxF5V/xd4LMmvtND5wLfprr+Nt2xfxwvX0rwuJ0nSPBm157k/BD6Z5GjgEbprbS8Dbk1yOfAD4B2trNflJEmaJyMl9qraSdd95MHOH1LW63KSJM0Te56TJKlHHARGknrCAV4EHrFLktQrJnZJknrExC5JUo+Y2CVJ6hETuyRJPWJilySpR0zskiT1iIldkqQeMbFLktQjJnZJknrExC5JUo+Y2CVJ6hETuyRJPWJil5aQJK9Ick+SbyV5KMmftfjpSe5O8nCSW5Ic3eIvb693t/krB9b1vhb/bpILB+JrWmx3kg1zvY3SUmdil5aWZ4Hzqur1wFnAmiSrgY8A11bVKuBJ4PJW/nLgyar6ZeDaVo4kZwCXAr8GrAH+MskRSY4APg5cBJwBvLOVlTRHTOzSElKdn7SXR7VHAecBt7X4ZuCSNr22vabNPz9JWvzmqnq2qr4P7AbOaY/dVfVIVf0UuLmVlTRHTOzSEtOOrHcC+4DtwPeAH1XVc63IHuCUNn0K8BhAm/8U8JrB+EHLTBSXNEdM7NISU1XPV9VZwAq6I+zXDSvWnjPBvOnGXyTJ+iQ7kuzYv3//aBWXNBITu7REVdWPgK8Aq4HjkhzZZq0AHm/Te4BTAdr8VwMHBuMHLTNR/OD33lhVY1U1tnz58tnaJEmY2KUlJcnyJMe16WOA3wR2AXcCb2/F1gGfa9Nb2mva/C9XVbX4pa3V/OnAKuAe4F5gVWtlfzRdA7sth3/LJI07cuoiknrkZGBza73+MuDWqvp8km8DNyf5EPBN4IZW/gbgE0l20x2pXwpQVQ8luRX4NvAccEVVPQ+Q5EpgG3AEsKmqHpq7zZM0UmJP8ijwNPA88FxVjSU5HrgFWAk8CvxOVT3ZWsx+DLgYeAb4/ar6RlvPOuBP22o/VFWbW/yNwI3AMcBW4Kp2VCBpFlXV/cAbhsQfobvefnD8n4B3TLCuDwMfHhLfSrcfS5oH0zkV/xtVdVZVjbXXG4A72n2vd7TX0N2/uqo91gPXA7QfAlcD59L9A7k6ybK2zPWt7Phya2a8RZIkLWGHco198P7Wg+97vandL3sXXaOck4ELge1VdaCqnqS7zWZNm3dsVX29HaXfNLAuSZI0DaMm9gK+lOS+JOtb7KSq2gvQnk9s8ene33pKmz44/hLeIiNJ0uRGbTz3pqp6PMmJwPYk35mk7GG57xW6W2SAjQBjY2Neg5ck6SAjHbFX1ePteR/wWbpr5E+00+i0532t+HTvb93Tpg+OS5KkaZoysSf5xSSvGp8GLgAe5MX3tx583+tl6awGnmqn6rcBFyRZ1hrNXQBsa/OeTrK6tai/bGBdkiRpGkY5FX8S8Nku53Ik8HdV9cUk9wK3Jrkc+AEv3BKzle5Wt910t7u9C6CqDiT5IF0HFgAfqKoDbfrdvHC72+3tIUlawFZu+MKUZR695q1zUBMNmjKxt/tbXz8k/o/A+UPiBVwxwbo2AZuGxHcAZ45QX0mSNAm7lJUkqUdM7JIk9YiJXZKkHjGxS5LUIyZ2SZJ6xMQuSVKPmNglSeoRE7skST1iYpckqUdM7JIk9YiJXZKkHjGxS0tEklOT3JlkV5KHklzV4u9P8g9JdrbHxQPLvC/J7iTfTXLhQHxNi+1OsmEgfnqSu5M8nOSWJEfP7VZKMrFLS8dzwB9V1euA1cAVSc5o866tqrPaYytAm3cp8GvAGuAvkxyR5Ajg48BFwBnAOwfW85G2rlXAk8Dlc7VxkjomdmmJqKq9VfWNNv00sAs4ZZJF1gI3V9WzVfV9uqGYz2mP3VX1SFX9FLgZWJtubOfzgNva8puBSw7P1kiaiIldWoKSrATeANzdQlcmuT/JpiTLWuwU4LGBxfa02ETx1wA/qqrnDopLmkMmdmmJSfJK4NPAe6rqx8D1wC8BZwF7gT8fLzpk8ZpBfFgd1ifZkWTH/v37p7kFkiZjYpeWkCRH0SX1T1bVZwCq6omqer6qfgb8Nd2pduiOuE8dWHwF8Pgk8R8CxyU58qD4S1TVxqoaq6qx5cuXz87GSQJM7NKS0a6B3wDsqqqPDsRPHij2NuDBNr0FuDTJy5OcDqwC7gHuBVa1FvBH0zWw21JVBdwJvL0tvw743OHcJkkvdeTURST1xJuA3wMeSLKzxf6YrlX7WXSnzR8F/gCgqh5KcivwbboW9VdU1fMASa4EtgFHAJuq6qG2vvcCNyf5EPBNuh8SkuaQiX2WrNzwhZHKPXrNWw9zTaThquprDL8OvnWSZT4MfHhIfOuw5arqEV44lS9pHngqXpKkHjGxS5LUIyZ2SZJ6ZOTE3rqS/GaSz7fXQ/uEbi1ob2l9SN/dOsIYX8e0+p2WJEnTM50j9qvouqAcN1Gf0JcDT1bVLwPXtnIz7XdakiRNw0iJPckK4K3A37TXk/UJvba9ps0/v5WfVr/Th7phkiQtRaMesf9X4D8CP2uvJ+sT+uf9SLf5T7Xy0+13+iXshlKSpMlNeR97kt8C9lXVfUneMh4eUrSmmDdRfNiPi6H9S1fVRmAjwNjY2NAyktRHo/aVIY3SQc2bgN9OcjHwCuBYuiP445Ic2Y7KB/uEHu9Hek/rM/rVwAEm7l+aSeKSJGkapjwVX1Xvq6oVVbWSrvHbl6vqd5m4T+gt7TVt/pdbH9LT6nd6VrZOkqQl5lC6lJ2oT+gbgE8k2U13pH4pzLjfaUmSNA3TSuxV9RXgK216aJ/QVfVPwDsmWH5a/U5LkqTpsec5SZJ6xMQuSVKPmNglSeoRE7skST1iYpckqUdM7JIk9YiJXZKkHjGxS0tEklOT3JlkV5KHklzV4scn2Z7k4fa8rMWT5Loku5Pcn+TsgXWta+UfTrJuIP7GJA+0Za5rIztKmkMmdmnpeA74o6p6HbAauCLJGcAG4I6qWgXc0V4DXETX9fMqYD1wPXQ/BICrgXPpOqm6evzHQCuzfmC5NXOwXZIGmNilJaKq9lbVN9r008AuuiGS1wKbW7HNwCVtei1wU3Xuohv46WTgQmB7VR2oqieB7cCaNu/Yqvp6Gx/ipoF1SZojJnZpCUqyEngDcDdwUlXthS75Aye2YqcAjw0stqfFJovvGRKXNIdM7NISk+SVwKeB91TVjycrOiRWM4gPq8P6JDuS7Ni/f/9UVZY0DSZ2aQlJchRdUv9kVX2mhZ9op9Fpz/tafA9w6sDiK4DHp4ivGBJ/iaraWFVjVTW2fPnyQ9soSS9yKMO2SlpEWgv1G4BdVfXRgVlbgHXANe35cwPxK5PcTNdQ7qmq2ptkG/CfBhrMXQC8r6oOJHk6yWq6U/yXAf/tsG9YD6zc8IX5roJ6xMQuLR1vAn4PeCDJzhb7Y7qEfmuSy4Ef8MKwy1uBi4HdwDPAuwBaAv8gcG8r94GqOtCm3w3cCBwD3N4ekuaQiV1aIqrqawy/Dg5w/pDyBVwxwbo2AZuGxHcAZx5CNSUdIq+xS5LUIyZ2SZJ6xMQuSVKPmNglSeoRE7skST1iYpckqUdM7JIk9ciUiT3JK5Lck+RbbQznP2vx05Pc3cZjviXJ0S3+8vZ6d5u/cmBd72vx7ya5cCC+psV2J9lwcB0kSdJoRjlifxY4r6peD5xFNzzjauAjwLVtDOcngctb+cuBJ6vql4FrWznauM+XAr9GN0bzXyY5IskRwMfpxn4+A3hnKytJkqZpyp7nWu9TP2kvj2qPAs4D/nWLbwbeD1xPN4bz+1v8NuAvWh/Va4Gbq+pZ4PtJdgPntHK7q+oRgNYv9Vrg24eyYZKkxWGUvvIfveatc1CTfhjpGns7st5JN+rTduB7wI+q6rlWZHDc5Z+P1dzmPwW8humP7SxJkqZppMReVc9X1Vl0wzCeA7xuWLH27BjOkiTNk2m1iq+qHwFfAVYDxyUZP5U/OO7yz8dqbvNfDRxg+mM7D3t/x3CWJGkSo7SKX57kuDZ9DPCbwC7gTuDtrdjBYziva9NvB77crtNvAS5treZPB1YB99AN/biqtbI/mq6B3ZbZ2DhJkpaaUYZtPRnY3Fqvvwy4tao+n+TbwM1JPgR8E7ihlb8B+ERrHHeALlFTVQ8luZWuUdxzwBVV9TxAkiuBbcARwKaqemjWtlCSpCVklFbx9wNvGBJ/hBdatQ/G/wl4xwTr+jDw4SHxrcDWEeorSVpERmnxrtllz3OSJPWIiV2SpB4xsUuS1CMmdmkJSbIpyb4kDw7E3p/kH5LsbI+LB+ZNa3yHicaQkDR3TOzS0nIj3VgNB7u2qs5qj60w4/EdJhpDQtIcMbFLS0hVfZXuNtRR/Hx8h6r6PjA+vsM5tPEdquqnwM3A2jYmxHl0Y0RAN4bEJbO6AZKmZGKXBHBlkvvbqfplLTbd8R1ew8RjSEiaIyZ2SdcDv0Q3LPNe4M9b3HEfpEXIxC4tcVX1RBvo6WfAX/NCx1PTHd/hh0w8hsTB7+m4D9JhYmKXlrgkJw+8fBsw3mJ+WuM7tDEhJhpDQtIcGaWveEk9keRTwFuAE5LsAa4G3pLkLLrT5o8CfwAzHt/hvQwfQ0LSHDGxS0tIVb1zSHjC5Dvd8R0mGkNC0tzxVLwkST1iYpckqUc8FT/HRhnC8NFr3joHNZEk9ZFH7JIk9YiJXZKkHjGxS5LUI15jl6TDZJQ2NdJs84hdkqQeMbFLktQjJnZJknrExC5JUo9MmdiTnJrkziS7kjyU5KoWPz7J9iQPt+dlLZ4k1yXZneT+JGcPrGtdK/9wknUD8TcmeaAtc12SYeM6S5KkKYxyxP4c8EdV9TpgNXBFkjOADcAdVbUKuKO9BriIbnjHVcB64HrofgjQjSR1Lt0gEVeP/xhoZdYPLLfm0DdNkqSlZ8rEXlV7q+obbfppYBdwCrAW2NyKbQYuadNrgZuqcxdwXBvv+UJge1UdqKonge3Amjbv2Kr6ehvP+aaBdUmSpGmY1jX2JCuBNwB3AydV1V7okj9wYit2CvDYwGJ7Wmyy+J4hcUmSNE0jJ/YkrwQ+Dbynqn48WdEhsZpBfFgd1ifZkWTH/v37p6qyJElLzkiJPclRdEn9k1X1mRZ+op1Gpz3va/E9wKkDi68AHp8ivmJI/CWqamNVjVXV2PLly0epuiRJS8ooreID3ADsqqqPDszaAoy3bF8HfG4gfllrHb8aeKqdqt8GXJBkWWs0dwGwrc17Osnq9l6XDaxLkiRNwyh9xb8J+D3ggSQ7W+yPgWuAW5NcDvwAeEebtxW4GNgNPAO8C6CqDiT5IHBvK/eBqjrQpt8N3AgcA9zeHpIkaZqmTOxV9TWGXwcHOH9I+QKumGBdm4BNQ+I7gDOnqoukQ5NkE/BbwL6qOrPFjgduAVYCjwK/U1VPtjNoH6P7of4M8Pvjd8i0fij+tK32Q1W1ucXfyAs/0rcCV7X/CZLmiD3PSUvLjby0nwj7pJB6xMQuLSFV9VXgwEFh+6SQesTELsk+KaQeGaXx3KK1csMX5rsK0mJ2WPukoDtlz2mnnTbT+kkawiN2SfZJIfWIiV2SfVJIPdLrU/GSXizJp4C3ACck2UPXut0+KaQeMbFLS0hVvXOCWfZJIfWEp+IlSeoRE7skST3iqfgFaJTb9B695q1zUBNJ0mLjEbskST1iYpckqUdM7JIk9YiJXZKkHjGxS5LUIyZ2SZJ6xMQuSVKPmNglSeoRO6iRJC14o3TcBXbeBR6xS5LUKyZ2SZJ6xMQuSVKPTJnYk2xKsi/JgwOx45NsT/Jwe17W4klyXZLdSe5PcvbAMuta+YeTrBuIvzHJA22Z65JktjdSkqSlYpQj9huBNQfFNgB3VNUq4I72GuAiYFV7rAeuh+6HAHA1cC5wDnD1+I+BVmb9wHIHv5ckSRrRlIm9qr4KHDgovBbY3KY3A5cMxG+qzl3AcUlOBi4EtlfVgap6EtgOrGnzjq2qr1dVATcNrEuSJE3TTK+xn1RVewHa84ktfgrw2EC5PS02WXzPkLgkSZqB2W48N+z6eM0gPnzlyfokO5Ls2L9//wyrKGmYJI+29i47k+xosVlrTyNpbsy0g5onkpxcVXvb6fR9Lb4HOHWg3Arg8RZ/y0Hxr7T4iiHlh6qqjcBGgLGxsQl/AEiasd+oqh8OvB5vT3NNkg3t9Xt5cXuac+naypw70J5mjO5H+n1JtrRLcL0yaocpmluj/F363onNTI/YtwDjv8TXAZ8biF/Wfs2vBp5qp+q3ARckWdZ+8V8AbGvznk6yurWGv2xgXZLm36y0p5nrSktL2ZRH7Ek+RXe0fUKSPXS/xq8Bbk1yOfAD4B2t+FbgYmA38AzwLoCqOpDkg8C9rdwHqmq8Qd676VreHwPc3h6S5l4BX0pSwH9vZ8he1J4myUzb00iaI1Mm9qp65wSzzh9StoArJljPJmDTkPgO4Myp6iHpsHtTVT3ekvf2JN+ZpOwhtZtJsp7uNldOO+20mdRV0gTseU4SAFX1eHveB3yWrs+JJ9opdqbRnmZY/OD32lhVY1U1tnz58tneFGlJM7FLIskvJnnV+DRdO5gHmaX2NHO4KdKS57CtkgBOAj7benQ+Evi7qvpiknuZvfY0kuaAiX2R8pYOzaaqegR4/ZD4PzJL7WkkzQ1PxUuS1CMmdkmSesTELklSj5jYJUnqERO7JEk9YmKXJKlHTOySJPWIiV2SpB4xsUuS1CP2PNdjo/ROB/ZQJ0l94hG7JEk9YmKXJKlHPBUvB5SRpB7xiF2SpB4xsUuS1COeipckLSl9v/zoEbskST3iEbtG0vdfuJLUFx6xS5LUIwvmiD3JGuBjwBHA31TVNfNcJUkz0Id9edReG6WFaEEk9iRHAB8H/iWwB7g3yZaq+vb81kzSdByufXk2u0c2aavvFkRiB84BdlfVIwBJbgbWAiZ2aXGZ133ZpC0tnMR+CvDYwOs9wLnzVBdJM+e+rF6YzR+Jc92weKEk9gyJ1UsKJeuB9e3lT5J8d2D2CcAPD0Pd5sJirju0+ucj812NGVnMn/0odf9nc1GRAbOxL8+Hxfw9GOR2LCyz/b9xpP15oST2PcCpA69XAI8fXKiqNgIbh60gyY6qGjs81Tu8FnPdYXHX37rPukPel+fDAv0sp83tWFjmazsWyu1u9wKrkpye5GjgUmDLPNdJ0vS5L0vzbEEcsVfVc0muBLbR3SKzqaoemudqSZom92Vp/i2IxA5QVVuBrYewigVzWm8GFnPdYXHX37rPslnYl+fDgvwsZ8DtWFjmZTtS9ZJ2LZIkaZFaKNfYJUnSLOhFYk+yJsl3k+xOsmEe6/FokgeS7Eyyo8WOT7I9ycPteVmLJ8l1rc73Jzl7YD3rWvmHk6wbiL+xrX93W3bYrUXTqe+mJPuSPDgQO+z1neg9ZqHu70/yD+3z35nk4oF572v1+G6SCwfiQ787rfHX3a2Ot7SGYCR5eXu9u81fOYO6n5rkziS7kjyU5KrJPpeF9tn3TZKrkjzY/hbvme/6jGo6++9CNsF2vKP9PX6WZFG0jp9gO/5Lku+0/fazSY6bk8pU1aJ+0DXQ+R7wWuBo4FvAGfNUl0eBEw6K/WdgQ5veAHykTV8M3E533+9q4O4WPx54pD0va9PL2rx7gF9vy9wOXHSI9X0zcDbw4FzWd6L3mIW6vx/490PKntG+Fy8HTm/flyMm++4AtwKXtum/At7dpv8d8Fdt+lLglhnU/WTg7Db9KuDvWx0XxWffpwdwJvAg8At0bY7+J7Bqvus1Yt1H3n8X8mOC7Xgd8CvAV4Cx+a7jIWzHBcCRbfojc/X36MMR+8+7sKyqnwLjXVguFGuBzW16M3DJQPym6twFHJfkZOBCYHtVHaiqJ4HtwJo279iq+np135KbBtY1I1X1VeDAPNR3ovc41LpPZC1wc1U9W1XfB3bTfW+Gfnfa0e15wG0TfA7jdb8NOH+6Z06qam9VfaNNPw3souuxbVF89j3zOuCuqnqmqp4D/hfwtnmu00imuf8uWMO2o6p2VdV8d1o0LRNsx5fa9wrgLrp+HQ67PiT2YV1YnjJPdSngS0nuS9ezFsBJVbUXun/owIktPlG9J4vvGRKfbXNR34neYzZc2U57bRo4DTndur8G+NHADjlY958v0+Y/1crPSDuV/wbgbhb/Z78YPQi8OclrkvwC3dmRU6dYZiHz77tw/Ru6s2eHXR8S+0hdWM6RN1XV2cBFwBVJ3jxJ2YnqPd34XFkM9b0e+CXgLGAv8OctPpt1n7XtSvJK4NPAe6rqx5MVneA9F9JnvyhV1S66U6TbgS/SXY55btKFpGlK8id036tPzsX79SGxj9SF5Vyoqsfb8z7gs3Snep9op0Zpz/ta8YnqPVl8xZD4bJuL+k70Hoekqp6oquer6mfAX9N9/jOp+w/pTncfeVD8Retq81/N6JcEfi7JUXRJ/ZNV9ZkWXrSf/WJWVTdU1dlV9Wa6v+XD812nQ+Dfd4FpjVp/C/jddmnssOtDYl8QXVgm+cUkrxqfpms08WCry3hr5XXA59r0FuCy1uJ5NfBUO3W2DbggybJ2KvkCYFub93SS1e2a7mUD65pNc1Hfid7jkIz/Q2veRvf5j7/fpelatJ8OrKJrXDb0u9N2vjuBt0/wOYzX/e3Al6e7s7bP4wZgV1V9dGDWov3sF7MkJ7bn04B/BXxqfmt0SPz7LiBJ1gDvBX67qp6ZszeeixZ6h/tBd13s7+laOP/JPNXhtXSn8b4FPDReD7rrr3fQHQXcARzf4gE+3ur8AAMtP+muxexuj3cNxMfoktX3gL+gdTB0CHX+FN0p6/9Hd5R3+VzUd6L3mIW6f6LV7X66f3AnD5T/k1aP7zJwN8FE353297ynbdP/AF7e4q9or3e3+a+dQd3/Bd2p8fuBne1x8WL57Pv2AP433Xjx3wLOn+/6TKPeI++/C/kxwXa8rU0/CzxB94N13us6g+3YTdcOZnw//6u5qIs9z0mS1CN9OBUvSZIaE7skST1iYpckqUdM7JIk9YiJXZKkHjGxS5LUIyZ2SZJ6xMQuSVKP/H+9KhGj6EW1ZQAAAABJRU5ErkJggg==\n", 83 | "text/plain": [ 84 | "
" 85 | ] 86 | }, 87 | "metadata": { 88 | "needs_background": "light" 89 | }, 90 | "output_type": "display_data" 91 | } 92 | ], 93 | "source": [ 94 | "#data = data.head(3)\n", 95 | "data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')\n", 96 | "\n", 97 | "plt.figure(figsize=[8, 4])\n", 98 | "plt.subplot(1, 2, 1)\n", 99 | "\n", 100 | "plt.hist(data[\"SalaryNormalized\"], bins=20);\n", 101 | "\n", 102 | "plt.subplot(1, 2, 2)\n", 103 | "plt.hist(data['Log1pSalary'], bins=20);" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 4, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "(244768,)" 115 | ] 116 | }, 117 | "execution_count": 4, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "data['Log1pSalary'].shape" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 5, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "data": { 133 | "text/plain": [ 134 | "200000" 135 | ] 136 | }, 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "np.amax(data[\"SalaryNormalized\"])" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 6, 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "data": { 153 | "text/plain": [ 154 | "5000" 155 | ] 156 | }, 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "output_type": "execute_result" 160 | } 161 | ], 162 | "source": [ 163 | "np.amin(data[\"SalaryNormalized\"])" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 7, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "12.206078" 175 | ] 176 | }, 177 | "execution_count": 7, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "np.amax(data[\"Log1pSalary\"])" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 8, 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "data": { 193 | "text/plain": [ 194 | "8.517393" 195 | ] 196 | }, 197 | "execution_count": 8, 198 | "metadata": {}, 199 | "output_type": "execute_result" 200 | } 201 | ], 202 | "source": [ 203 | "np.amin(data[\"Log1pSalary\"])" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "Our task is to predict one number, __Log1pSalary__. (log(1 + 200000))\n", 211 | "\n", 212 | "Log1pSalary 예측이 임무\n", 213 | "\n", 214 | "Title : Job position \n", 215 | "FullDescription : 실제 할일 \n", 216 | "LocationRaw : 위치 (Detail) \n", 217 | "LocationNormalized : 위치 \n", 218 | "Contract Type : 계약 유형 \n", 219 | "Contract Time : 계약 시간 \n", 220 | "Company : 회사 \n", 221 | "Category : 범주 \n", 222 | "SalaryRaw : 급여 범위 및 추가 속성 \n", 223 | "SalaryNormalized : 급여 평균 \n", 224 | "SourceName : 출처 \n", 225 | "\n", 226 | "To do so, our model can access a number of features:\n", 227 | "* Free text: __`Title`__ and __`FullDescription`__\n", 228 | "* Categorical: __`Category`__, __`Company`__, __`LocationNormalized`__, __`ContractType`__, and __`ContractTime`__.\n", 229 | "\n", 230 | "dropna함수는 column내에 NaN값이 있으면 해당 내용은 필요없다 간주하고 삭제해버린다. \n", 231 | "\n", 232 | "fillna함수도 굉장히 유용한다 NaN을 특정 값으로 대체하는 기능을 한다. \n", 233 | "\n", 234 | "\n" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 9, 240 | "metadata": {}, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/html": [ 245 | "
\n", 246 | "\n", 259 | "\n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | "
IdTitleFullDescriptionLocationRawLocationNormalizedContractTypeContractTimeCompanyCategorySalaryRawSalaryNormalizedSourceNameLog1pSalary
18277171614858Registered General Nurse (RGN) East London C...Registered General Nurse/RN/RGNLocation: East ...ChingfordChingfordfull_timeNaNHC Recruitment ServicesHealthcare & Nursing Jobs13.00 - 13.15/Hour25104staffnurse.com10.130822
4403568506863Buyer MenswearBuyer Menswear : The Client This design led c...North London London South EastNorth LambethNaNpermanentFASHION & RETAIL PERSONNEL LIMITEDRetail Jobs40000 - 45000 per annum42500retailchoice.com10.657283
10185569547809HGV 2 Moffett DriverWe have a breathtaking opportunity for a HGV C...SouthallSouthallfull_timeNaNHR Go RecruitmentLogistics & Warehouse Jobs8.50 - 12.75 per hour20400Jobcentre Plus9.923339
\n", 329 | "
" 330 | ], 331 | "text/plain": [ 332 | " Id Title \\\n", 333 | "182771 71614858 Registered General Nurse (RGN) East London C... \n", 334 | "44035 68506863 Buyer Menswear \n", 335 | "101855 69547809 HGV 2 Moffett Driver \n", 336 | "\n", 337 | " FullDescription \\\n", 338 | "182771 Registered General Nurse/RN/RGNLocation: East ... \n", 339 | "44035 Buyer Menswear : The Client This design led c... \n", 340 | "101855 We have a breathtaking opportunity for a HGV C... \n", 341 | "\n", 342 | " LocationRaw LocationNormalized ContractType \\\n", 343 | "182771 Chingford Chingford full_time \n", 344 | "44035 North London London South East North Lambeth NaN \n", 345 | "101855 Southall Southall full_time \n", 346 | "\n", 347 | " ContractTime Company \\\n", 348 | "182771 NaN HC Recruitment Services \n", 349 | "44035 permanent FASHION & RETAIL PERSONNEL LIMITED \n", 350 | "101855 NaN HR Go Recruitment \n", 351 | "\n", 352 | " Category SalaryRaw SalaryNormalized \\\n", 353 | "182771 Healthcare & Nursing Jobs 13.00 - 13.15/Hour 25104 \n", 354 | "44035 Retail Jobs 40000 - 45000 per annum 42500 \n", 355 | "101855 Logistics & Warehouse Jobs 8.50 - 12.75 per hour 20400 \n", 356 | "\n", 357 | " SourceName Log1pSalary \n", 358 | "182771 staffnurse.com 10.130822 \n", 359 | "44035 retailchoice.com 10.657283 \n", 360 | "101855 Jobcentre Plus 9.923339 " 361 | ] 362 | }, 363 | "execution_count": 9, 364 | "metadata": {}, 365 | "output_type": "execute_result" 366 | } 367 | ], 368 | "source": [ 369 | "text_columns = [\"Title\", \"FullDescription\"]\n", 370 | "categorical_columns = [\"Category\", \"Company\", \"LocationNormalized\", \"ContractType\", \"ContractTime\"]\n", 371 | "target_column = \"Log1pSalary\"\n", 372 | "\n", 373 | "data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string \"NaN\"\n", 374 | "\n", 375 | "data.sample(3)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "### Preprocessing text data\n", 383 | "\n", 384 | "Just like last week, applying NLP to a problem begins from tokenization: splitting raw text into sequences of tokens (words, punctuation(구두법), etc).\n", 385 | "\n", 386 | "__Your task__ is to lowercase and tokenize all texts under `Title` and `FullDescription` columns. Store the tokenized data as a __space-separated__ string of tokens for performance reasons.\n", 387 | "\n", 388 | "It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay.\n", 389 | "\n", 390 | "\n", 391 | "regexp를 사용하여 텍스트를 영문자 및 비영 문자의 순서로 토큰화 \n", 392 | "\n", 393 | "\\w+|[^\\w\\s]+. \n", 394 | "\n", 395 | "

from nltk.tokenize import WordPunctTokenizer \n", 396 | "

s = \"Good muffins cost $3.88\\nin New York. Please buy me\\ntwo of them.\\n\\nThanks.\" \n", 397 | "\n", 398 | "

WordPunctTokenizer().tokenize(s) \n", 399 | "

['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', \n", 400 | "

'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] \n", 401 | "\n", 402 | "\n", 403 | "\n", 404 | "http://excelsior-cjh.tistory.com/63 " 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 10, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "name": "stdout", 414 | "output_type": "stream", 415 | "text": [ 416 | "Raw text:\n", 417 | "2 Mathematical Modeller / Simulation Analyst / O...\n", 418 | "100002 A successful and high achieving specialist sch...\n", 419 | "200002 Web Designer HTML, CSS, JavaScript, Photoshop...\n", 420 | "Name: FullDescription, dtype: object\n" 421 | ] 422 | } 423 | ], 424 | "source": [ 425 | "print(\"Raw text:\")\n", 426 | "print(data[\"FullDescription\"][2::100000])" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "metadata": {}, 433 | "outputs": [ 434 | { 435 | "name": "stderr", 436 | "output_type": "stream", 437 | "text": [ 438 | "/Users/JunChangWook/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: \n", 439 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 440 | "\n", 441 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", 442 | " import sys\n" 443 | ] 444 | } 445 | ], 446 | "source": [ 447 | "import nltk\n", 448 | "tokenizer = nltk.tokenize.WordPunctTokenizer()\n", 449 | "\n", 450 | "\n", 451 | "index = 0\n", 452 | "for item in data[\"FullDescription\"]:\n", 453 | " data[\"FullDescription\"][index] = tokenizer.tokenize(item)\n", 454 | " index = index + 1\n", 455 | " \n", 456 | "index = 0\n", 457 | "for item in data[\"Title\"]:\n", 458 | " data[\"Title\"][index] = tokenizer.tokenize(item)\n", 459 | " index = index + 1\n", 460 | "# see task above\n", 461 | "#" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "Now we can assume that our text is a space-separated list of tokens:" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "print(\"Tokenized:\")\n", 478 | "print(data[\"FullDescription\"][2::100000])\n", 479 | "assert data[\"FullDescription\"][2][:50] == 'mathematical modeller / simulation analyst / opera'\n", 480 | "assert data[\"Title\"][54321] == 'international digital account manager ( german )'" 481 | ] 482 | }, 483 | { 484 | "cell_type": "markdown", 485 | "metadata": {}, 486 | "source": [ 487 | "Not all words are equally useful. Some of them are typos or rare words that are only present a few times. \n", 488 | "모든 단어가 똑같이 유용하지 않다. 몇몇 단어는 오타 또는 희귀 단어 이다.\n", 489 | "\n", 490 | "Let's count how many times is each word present in the data so that we can build a \"white list\" of known words. \n", 491 | "단어 카운트를 기반으로 유용한 단어 리스트를 만든다. (white lists)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "code", 496 | "execution_count": null, 497 | "metadata": {}, 498 | "outputs": [], 499 | "source": [ 500 | "# Count how many times does each token occur in both \"Title\" and \"FullDescription\" in total\n", 501 | "# build a dictionary { token -> it's count }\n", 502 | "import collections\n", 503 | "\n", 504 | "dictionary = []\n", 505 | "\n", 506 | "for item in data[\"FullDescription\"]:\n", 507 | " dictionary.extend(item)\n", 508 | "\n", 509 | "for item in data[\"Title\"]:\n", 510 | " dictionary.extend(item)\n", 511 | "\n", 512 | "token_counts = collections.Counter(dictionary)\n", 513 | "#token_counts = \n", 514 | "\n", 515 | "# hint: you may or may not want to use collections.Counter" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "print(\"Total unique tokens :\", len(token_counts))\n", 525 | "print('\\n'.join(map(str, token_counts.most_common(n=5))))\n", 526 | "print('...')\n", 527 | "print('\\n'.join(map(str, token_counts.most_common()[-3:])))\n", 528 | "\n", 529 | "#assert token_counts.most_common(1)[0][1] in range(2600000, 2700000)\n", 530 | "#assert len(token_counts) in range(200000, 210000)\n", 531 | "print('Correct!')" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": {}, 538 | "outputs": [], 539 | "source": [ 540 | "# Let's see how many words are there for each count\n", 541 | "plt.hist(list(token_counts.values()), range=[0, 10**4], bins=50, log=True)\n", 542 | "plt.xlabel(\"Word counts\");" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "Now filter tokens a list of all tokens that occur at least 10 times." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "min_count = 10\n", 559 | "temp_tokens = []\n", 560 | "\n", 561 | "#for k,v in token_counts.items():\n", 562 | " #if v > min_count:\n", 563 | " #temp_tokens.append(k)\n", 564 | "\n", 565 | "for k,v in token_counts.items():\n", 566 | " if v > min_count:\n", 567 | " temp_tokens.append(k)\n", 568 | "\n", 569 | "tokens = temp_tokens\n", 570 | "# tokens from token_counts keys that had at least min_count occurrences throughout the dataset\n", 571 | "# tokens = " 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": {}, 578 | "outputs": [], 579 | "source": [ 580 | "# Add a special tokens for unknown and empty words\n", 581 | "UNK, PAD = \"UNK\", \"PAD\"\n", 582 | "tokens = [UNK, PAD] + sorted(tokens)\n", 583 | "print(\"Vocabulary size:\", len(tokens))\n", 584 | "\n", 585 | "assert type(tokens) == list\n", 586 | "assert len(tokens) in range(32000, 35000)\n", 587 | "assert 'me' in tokens\n", 588 | "assert UNK in tokens\n", 589 | "print(\"Correct!\")" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "Build an inverse token index: a dictionary from token(string) to it's index in `tokens` (int)" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": null, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "#token_to_id = \n", 606 | "token_to_id = {word: idx for idx, word in enumerate(tokens)}" 607 | ] 608 | }, 609 | { 610 | "cell_type": "code", 611 | "execution_count": null, 612 | "metadata": {}, 613 | "outputs": [], 614 | "source": [ 615 | "assert isinstance(token_to_id, dict)\n", 616 | "assert len(token_to_id) == len(tokens)\n", 617 | "for tok in tokens:\n", 618 | " assert tokens[token_to_id[tok]] == tok\n", 619 | "\n", 620 | "print(\"Correct!\")" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices.\n", 628 | "행렬로 매핑하기 \n", 629 | "\n", 630 | ">>> a = list(map(str, range(10))) \n", 631 | ">>> a \n", 632 | "['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] \n" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": null, 638 | "metadata": {}, 639 | "outputs": [], 640 | "source": [ 641 | "UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])\n", 642 | "# 매트릭스 구조를 만들고 있다.\n", 643 | "def as_matrix(sequences, max_len=None):\n", 644 | " \"\"\" Convert a list of tokens into a matrix with padding \"\"\"\n", 645 | " # object , classinfo 같으면 참 아니면 거짓\n", 646 | " if isinstance(sequences[0], str):\n", 647 | " sequences = list(map(str.split, sequences))\n", 648 | " # 처음 한번은 양의 무한대와 비교하고 나머지는 시컨스 max_len와 비교 한다. \n", 649 | " max_len = min(max(map(len, sequences)), max_len or float('inf'))\n", 650 | " \n", 651 | " # 전체를 패드로 만들고\n", 652 | " matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))\n", 653 | " # 사전에 있으면 그 인텍스를 아니면 UNK_IX를 넣어서 매트릭스를 구성하고 있다.\n", 654 | " for i,seq in enumerate(sequences):\n", 655 | " row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]\n", 656 | " matrix[i, :len(row_ix)] = row_ix\n", 657 | " print(matrix)\n", 658 | " return matrix" 659 | ] 660 | }, 661 | { 662 | "cell_type": "code", 663 | "execution_count": null, 664 | "metadata": {}, 665 | "outputs": [], 666 | "source": [ 667 | "print(\"Lines:\")\n", 668 | "print('\\n'.join(data[\"Title\"][::100000].values), end='\\n\\n')\n", 669 | "print(\"Matrix:\")\n", 670 | "print(as_matrix(data[\"Title\"][::100000]))" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "Now let's encode the categirical data we have.\n", 678 | "\n", 679 | "As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings: tf-idf, pseudo-time-series, etc.\n", 680 | "\n", 681 | "one-hot 인코딩을 사용 Advanced encoding : tf-idf \n", 682 | "\n", 683 | ">>> list(zip([1, 2, 3], [4, 5, 6])) \n", 684 | "[(1, 4), (2, 5), (3, 6)] \n", 685 | "\n", 686 | "set 순서가 없는 딕셔너리 만들기 \n", 687 | " \n", 688 | "\n", 689 | "\n", 690 | "from sklearn.feature_extraction.text import CountVectorizer \n", 691 | "corpus = [ \n", 692 | " 'This is the first document.', \n", 693 | " 'This is the second second document.', \n", 694 | " 'And the third one.',\n", 695 | " 'Is this the first document?', \n", 696 | " 'The last document?', \n", 697 | "]\n", 698 | "vect = CountVectorizer() \n", 699 | "vect.fit(corpus) \n", 700 | "vect.vocabulary_ \n", 701 | " \n", 702 | "\n", 703 | "{'this': 9, \n", 704 | " 'is': 3, \n", 705 | " 'the': 7, \n", 706 | " 'first': 2, \n", 707 | " 'document': 1, \n", 708 | " 'second': 6, \n", 709 | " 'and': 0, \n", 710 | " 'third': 8, \n", 711 | " 'one': 5, \n", 712 | " 'last': 4} \n", 713 | " \n", 714 | " \n", 715 | " vect.transform(['This is the second document.']).toarray() \n", 716 | " \n", 717 | " array([[0, 1, 0, 1, 0, 0, 1, 1, 0, 1]]) \n", 718 | " \n", 719 | " vect.transform(['Something completely new.']).toarray() \n", 720 | " \n", 721 | " array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) \n", 722 | " \n", 723 | " vect.transform(corpus).toarray() \n", 724 | " \n", 725 | " array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1], \n", 726 | " [0, 1, 0, 1, 0, 0, 2, 1, 0, 1], \n", 727 | " [1, 0, 0, 0, 0, 1, 0, 1, 1, 0], \n", 728 | " [0, 1, 1, 1, 0, 0, 0, 1, 0, 1], \n", 729 | " [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]]) \n", 730 | " \n", 731 | " " 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": {}, 738 | "outputs": [], 739 | "source": [ 740 | "from sklearn.feature_extraction import DictVectorizer\n", 741 | "\n", 742 | "# we only consider top-1k most frequent companies to minimize memory usage\n", 743 | "top_companies, top_counts = zip(*collections.Counter(data['Company']).most_common(1000)) # 동일한 위치 묶어준다.\n", 744 | "print(top_companies)\n", 745 | "recognized_companies = set(top_companies)\n", 746 | "print(recognized_companies)\n", 747 | "# top 1000개 이상은 Company를 표현하고 아닌 모든 것들은 Other로 처리 한다. 여기에 pandas apply 함수를 통해 수행한다.\n", 748 | "data[\"Company\"] = data[\"Company\"].apply(lambda comp: comp if comp in recognized_companies else \"Other\")\n", 749 | "\n", 750 | "categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)\n", 751 | "# dict를 y 축으로 묶는다. pandas apply 함수를 통해서 수행한다. \n", 752 | "categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))" 753 | ] 754 | }, 755 | { 756 | "cell_type": "markdown", 757 | "metadata": {}, 758 | "source": [ 759 | "### The deep learning part\n", 760 | "\n", 761 | "Once we've learned to tokenize the data, let's design a machine learning experiment. (토큰을 배웠고 이제 기계학습 실험)\n", 762 | "\n", 763 | "As before, we won't focus too much on validation, opting for a simple train-test split. (학습 훈련 검증 셋 분할)\n", 764 | "\n", 765 | "__To be completely rigorous,__ we've comitted a small crime here: we used the whole data for tokenization and vocabulary building. A more strict way would be to do that part on training set only. You may want to do that and measure the magnitude of changes." 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": null, 771 | "metadata": {}, 772 | "outputs": [], 773 | "source": [ 774 | "from sklearn.model_selection import train_test_split\n", 775 | "\n", 776 | "data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)\n", 777 | "data_train.index = range(len(data_train))\n", 778 | "data_val.index = range(len(data_val))\n", 779 | "\n", 780 | "print(\"Train size = \", len(data_train))\n", 781 | "print(\"Validation size = \", len(data_val))" 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": null, 787 | "metadata": {}, 788 | "outputs": [], 789 | "source": [ 790 | "# 배치 구성하는 함수\n", 791 | "def make_batch(data, max_len=None, word_dropout=0):\n", 792 | " \"\"\"\n", 793 | " Creates a keras-friendly dict from the batch data. (케라스 친화적으로 만든다)\n", 794 | " :param word_dropout: replaces token index with UNK_IX with this probability (word_dropout 확률로 UNK_IX로 대체)\n", 795 | " :returns: a dict with {'title' : int64[batch, title_max_len] (배치 사이즈, 타이틀 최대 크기) 매트릭스 구성\n", 796 | " \"\"\"\n", 797 | " batch = {}\n", 798 | " batch[\"Title\"] = as_matrix(data[\"Title\"].values, max_len)\n", 799 | " batch[\"FullDescription\"] = as_matrix(data[\"FullDescription\"].values, max_len)\n", 800 | " batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))\n", 801 | " \n", 802 | " if word_dropout != 0:\n", 803 | " batch[\"FullDescription\"] = apply_word_dropout(batch[\"FullDescription\"], 1. - word_dropout)\n", 804 | " # target_column = \"Log1pSalary\"\n", 805 | " if target_column in data.columns:\n", 806 | " batch[target_column] = data[target_column].values\n", 807 | " \n", 808 | " return batch\n", 809 | "\n", 810 | "\n", 811 | "def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):\n", 812 | " dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])\n", 813 | " dropout_mask &= matrix != pad_ix\n", 814 | " # 변환 해준다. 모든 부분의 full_like \n", 815 | " return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)]) # matrix를 replace_with로 변경한다." 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": null, 821 | "metadata": {}, 822 | "outputs": [], 823 | "source": [ 824 | "make_batch(data_train[:3], max_len=10)" 825 | ] 826 | }, 827 | { 828 | "cell_type": "markdown", 829 | "metadata": {}, 830 | "source": [ 831 | "#### Architecture\n", 832 | "\n", 833 | "Our basic model consists of three branches:\n", 834 | "* Title encoder\n", 835 | "* Description encoder\n", 836 | "* Categorical features encoder\n", 837 | "\n", 838 | "We will then feed all 3 branches into one common network that predicts salary. (급여 예측에 3개의 특성을 쓴다.)\n", 839 | "\n", 840 | "" 841 | ] 842 | }, 843 | { 844 | "cell_type": "markdown", 845 | "metadata": {}, 846 | "source": [ 847 | "This clearly doesn't fit into keras' __Sequential__ interface. To build such a network, one will have to use __[Keras Functional API](https://keras.io/models/model/)__.\n", 848 | "\n", 849 | "https://keras.io/layers/merge/" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": null, 855 | "metadata": {}, 856 | "outputs": [], 857 | "source": [ 858 | "import keras\n", 859 | "#from keras.models import Sequential\n", 860 | "import keras.layers as L" 861 | ] 862 | }, 863 | { 864 | "cell_type": "code", 865 | "execution_count": null, 866 | "metadata": {}, 867 | "outputs": [], 868 | "source": [ 869 | "def build_model(n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):\n", 870 | " \"\"\" Build a model that maps three data sources to a single linear output: predicted log1p(salary) \"\"\"\n", 871 | " \n", 872 | " l_title = L.Input(shape=[None], name=\"Title\")\n", 873 | " l_descr = L.Input(shape=[None], name=\"FullDescription\")\n", 874 | " l_categ = L.Input(shape=[n_cat_features], name=\"Categorical\")\n", 875 | " \n", 876 | " # Build your monster!\n", 877 | " \n", 878 | " x1 = keras.layers.Dense(8, activation='relu')(l_title)\n", 879 | " x2 = keras.layers.Dense(8, activation='relu')(l_descr)\n", 880 | " x3 = keras.layers.Dense(8, activation='relu')(l_categ)\n", 881 | " added = keras.layers.add([x1, x2, x3])\n", 882 | "\n", 883 | " # \n", 884 | " output_layer = keras.layers.Dense(1)(added)\n", 885 | " #output_layer = <...>\n", 886 | " # end of your code\n", 887 | " \n", 888 | " model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])\n", 889 | " model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])\n", 890 | " return model" 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "execution_count": null, 896 | "metadata": {}, 897 | "outputs": [], 898 | "source": [ 899 | "model = build_model()\n", 900 | "model.summary() # 모델 요약\n", 901 | "\n", 902 | "dummy_pred = model.predict(make_batch(data_train[:100]))\n", 903 | "dummy_loss = model.train_on_batch(make_batch(data_train[:100]), data_train['Log1pSalary'][:100])[0]\n", 904 | "assert dummy_pred.shape == (100, 1)\n", 905 | "assert len(np.unique(dummy_pred)) > 20, \"model returns suspiciously few unique outputs. Check your initialization\"\n", 906 | "assert np.ndim(dummy_loss) == 0 and 0. <= dummy_loss <= 250., \"make sure you minimize MSE\"" 907 | ] 908 | }, 909 | { 910 | "cell_type": "markdown", 911 | "metadata": {}, 912 | "source": [ 913 | "#### Training and evaluation\n", 914 | "\n", 915 | "As usual, we gonna feed our monster with random minibatches of data. \n", 916 | "미니 배치 사용 \n", 917 | "\n", 918 | "As we train, we want to monitor not only loss function, which is computed in log-space, but also the actual error measured in dollars.\n", 919 | "\n", 920 | "로그 공간에서 계산 된 손실 함수뿐만 아니라 달러로 측정 한 실제 오차를 모니터링하려고 합니다. \n" 921 | ] 922 | }, 923 | { 924 | "cell_type": "code", 925 | "execution_count": null, 926 | "metadata": {}, 927 | "outputs": [], 928 | "source": [ 929 | "def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, **kwargs):\n", 930 | " \"\"\" iterates minibatches of data in random order \"\"\"\n", 931 | " while True:\n", 932 | " indices = np.arange(len(data))\n", 933 | " if shuffle:\n", 934 | " indices = np.random.permutation(indices)\n", 935 | "\n", 936 | " for start in range(0, len(indices), batch_size):\n", 937 | " batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)\n", 938 | " target = batch.pop(target_column)\n", 939 | " yield batch, target\n", 940 | " \n", 941 | " if not cycle: break" 942 | ] 943 | }, 944 | { 945 | "cell_type": "markdown", 946 | "metadata": {}, 947 | "source": [ 948 | "### Model training\n", 949 | "\n", 950 | "We can now fit our model the usual minibatch way. The interesting part is that we train on an infinite stream of minibatches, produced by `iterate_minibatches` function. (iterate_minibatches 함수는 무한 미니 버스 스트림 구성.)" 951 | ] 952 | }, 953 | { 954 | "cell_type": "code", 955 | "execution_count": null, 956 | "metadata": {}, 957 | "outputs": [], 958 | "source": [ 959 | "batch_size = 256\n", 960 | "epochs = 10 # definitely too small\n", 961 | "steps_per_epoch = 100 # for full pass over data: (len(data_train) - 1) // batch_size + 1\n", 962 | "\n", 963 | "model = build_model()\n", 964 | "#배치 별로 모델 트레이닝\n", 965 | "model.fit_generator(iterate_minibatches(data_train, batch_size, cycle=True, word_dropout=0.05), \n", 966 | " epochs=epochs, steps_per_epoch=steps_per_epoch,\n", 967 | " \n", 968 | " validation_data=iterate_minibatches(data_val, batch_size, cycle=True),\n", 969 | " validation_steps=data_val.shape[0] // batch_size\n", 970 | " )" 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": null, 976 | "metadata": {}, 977 | "outputs": [], 978 | "source": [ 979 | "def print_metrics(model, data, batch_size=batch_size, name=\"\", **kw):\n", 980 | " squared_error = abs_error = num_samples = 0.0\n", 981 | " for batch_x, batch_y in iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw):\n", 982 | " batch_pred = model.predict(batch_x)[:, 0]\n", 983 | " squared_error += np.sum(np.square(batch_pred - batch_y))\n", 984 | " abs_error += np.sum(np.abs(batch_pred - batch_y))\n", 985 | " num_samples += len(batch_y)\n", 986 | " print(\"%s results:\" % (name or \"\"))\n", 987 | " print(\"Mean square error: %.5f\" % (squared_error / num_samples))\n", 988 | " print(\"Mean absolute error: %.5f\" % (abs_error / num_samples))\n", 989 | " return squared_error, abs_error\n", 990 | " \n", 991 | "print_metrics(model, data_train, name='Train')\n", 992 | "print_metrics(model, data_val, name='Val');" 993 | ] 994 | }, 995 | { 996 | "cell_type": "markdown", 997 | "metadata": {}, 998 | "source": [ 999 | "### Bonus part: explaining model predictions\n", 1000 | "\n", 1001 | "It's usually a good idea to understand how your model works before you let it make actual decisions. It's simple for linear models: just see which words learned positive or negative weights. However, its much harder for neural networks that learn complex nonlinear dependencies.\n", 1002 | "선형 모델은 비선형 모델 보다 쉽다고 이야기 하고 있음\n", 1003 | "\n", 1004 | "There are, however, some ways to look inside the black box:\n", 1005 | "블랙 박스 들여다 보는 방법\n", 1006 | "* Seeing how model responds to input perturbations \n", 1007 | "입력에 대해서 모델이 어떻게 응답하는지 본다. \n", 1008 | "* Finding inputs that maximize/minimize activation of some chosen neurons (_read more [on distill.pub](https://distill.pub/2018/building-blocks/)_) \n", 1009 | "활성화된 뉴럴의 선택해 최대/최소 찾기\n", 1010 | "* Building local linear approximations to your neural network: [article](https://arxiv.org/abs/1602.04938), [eli5 library](https://github.com/TeamHG-Memex/eli5/tree/master/eli5/formatters) \n", 1011 | "신경망에 대한 로컬 선형 근사법 작성 \n", 1012 | "Today we gonna try the first method just because it's the simplest one." 1013 | ] 1014 | }, 1015 | { 1016 | "cell_type": "code", 1017 | "execution_count": null, 1018 | "metadata": {}, 1019 | "outputs": [], 1020 | "source": [ 1021 | "def explain(model, sample, col_name='Title'):\n", 1022 | " \"\"\" Computes the effect each word had on model predictions \"\"\"\n", 1023 | " sample = dict(sample)\n", 1024 | " sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]\n", 1025 | " data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))\n", 1026 | "\n", 1027 | " for drop_i in range(len(sample_col_tokens)):\n", 1028 | " data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok\n", 1029 | " for i, tok in enumerate(sample_col_tokens)) \n", 1030 | "\n", 1031 | " *predictions_drop_one_token, baseline_pred = model.predict(make_batch(data_drop_one_token))[:, 0]\n", 1032 | " diffs = baseline_pred - predictions_drop_one_token\n", 1033 | " return list(zip(sample_col_tokens, diffs))" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "code", 1038 | "execution_count": null, 1039 | "metadata": {}, 1040 | "outputs": [], 1041 | "source": [ 1042 | "from IPython.display import HTML, display_html\n", 1043 | "\n", 1044 | "def draw_html(tokens_and_weights, cmap=plt.get_cmap(\"bwr\"), display=True,\n", 1045 | " token_template=\"\"\"{token}\"\"\",\n", 1046 | " font_style=\"font-size:14px;\"\n", 1047 | " ):\n", 1048 | " \n", 1049 | " def get_color_hex(weight):\n", 1050 | " rgba = cmap(1. / (1 + np.exp(weight)), bytes=True)\n", 1051 | " return '#%02X%02X%02X' % rgba[:3]\n", 1052 | " \n", 1053 | " tokens_html = [\n", 1054 | " token_template.format(token=token, color_hex=get_color_hex(weight))\n", 1055 | " for token, weight in tokens_and_weights\n", 1056 | " ]\n", 1057 | " \n", 1058 | " \n", 1059 | " raw_html = \"\"\"

{}

\"\"\".format(font_style, ' '.join(tokens_html))\n", 1060 | " if display:\n", 1061 | " display_html(HTML(raw_html))\n", 1062 | " \n", 1063 | " return raw_html\n", 1064 | " " 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "execution_count": null, 1070 | "metadata": {}, 1071 | "outputs": [], 1072 | "source": [ 1073 | "i = 36605\n", 1074 | "tokens_and_weights = explain(model, data.loc[i], \"Title\")\n", 1075 | "draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');\n", 1076 | "\n", 1077 | "tokens_and_weights = explain(model, data.loc[i], \"FullDescription\")\n", 1078 | "draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);" 1079 | ] 1080 | }, 1081 | { 1082 | "cell_type": "code", 1083 | "execution_count": null, 1084 | "metadata": {}, 1085 | "outputs": [], 1086 | "source": [ 1087 | "i = 12077\n", 1088 | "tokens_and_weights = explain(model, data.loc[i], \"Title\")\n", 1089 | "draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');\n", 1090 | "\n", 1091 | "tokens_and_weights = explain(model, data.loc[i], \"FullDescription\")\n", 1092 | "draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": null, 1098 | "metadata": {}, 1099 | "outputs": [], 1100 | "source": [ 1101 | "i = np.random.randint(len(data))\n", 1102 | "print(\"Index:\", i)\n", 1103 | "print(\"Salary (gbp):\", np.expm1(model.predict(make_batch(data.iloc[i: i+1]))[0, 0]))\n", 1104 | "\n", 1105 | "tokens_and_weights = explain(model, data.loc[i], \"Title\")\n", 1106 | "draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');\n", 1107 | "\n", 1108 | "tokens_and_weights = explain(model, data.loc[i], \"FullDescription\")\n", 1109 | "draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);" 1110 | ] 1111 | }, 1112 | { 1113 | "cell_type": "markdown", 1114 | "metadata": {}, 1115 | "source": [ 1116 | "__Terrible start-up idea #1962:__ make a tool that automaticaly rephrases your job description (or CV) to meet salary expectations :)" 1117 | ] 1118 | }, 1119 | { 1120 | "cell_type": "code", 1121 | "execution_count": null, 1122 | "metadata": {}, 1123 | "outputs": [], 1124 | "source": [] 1125 | } 1126 | ], 1127 | "metadata": { 1128 | "kernelspec": { 1129 | "display_name": "Python 3", 1130 | "language": "python", 1131 | "name": "python3" 1132 | }, 1133 | "language_info": { 1134 | "codemirror_mode": { 1135 | "name": "ipython", 1136 | "version": 3 1137 | }, 1138 | "file_extension": ".py", 1139 | "mimetype": "text/x-python", 1140 | "name": "python", 1141 | "nbconvert_exporter": "python", 1142 | "pygments_lexer": "ipython3", 1143 | "version": "3.6.6" 1144 | } 1145 | }, 1146 | "nbformat": 4, 1147 | "nbformat_minor": 2 1148 | } 1149 | -------------------------------------------------------------------------------- /resource/material/README.md: -------------------------------------------------------------------------------- 1 | # Material 2 | -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/.gitignore -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/Chapter 11. Introduction to Machine Learning.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/Chapter 11. Introduction to Machine Learning.pdf -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/Chapter 12. Clustering.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/Chapter 12. Clustering.pdf -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/Chapter13,14,15_MJLEE.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/Chapter13,14,15_MJLEE.pptx -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/MIT6_0002F16_lec1_cwjun.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec1_cwjun.pdf -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/MIT6_0002F16_lec2_cwjun.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec2_cwjun.pdf -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/MIT6_0002F16_lec5_lec6_ssg.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec5_lec6_ssg.pdf -------------------------------------------------------------------------------- /resource/slides/MIT-data-science/MIT6_0002F16_lec9_Eon.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec9_Eon.pdf -------------------------------------------------------------------------------- /resource/slides/README.md: -------------------------------------------------------------------------------- 1 | # Slides 2 | -------------------------------------------------------------------------------- /resource/slides/deeppavlov/.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/deeppavlov/.gitignore -------------------------------------------------------------------------------- /resource/slides/deeppavlov/deeppavlov_Automatic spelling correction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/deeppavlov/deeppavlov_Automatic spelling correction.pdf -------------------------------------------------------------------------------- /resource/slides/linear-algebra/Chapter_3_Least_square.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/linear-algebra/Chapter_3_Least_square.pptx -------------------------------------------------------------------------------- /resource/slides/linear-algebra/README.md: -------------------------------------------------------------------------------- 1 | # Linear_algebra 2 | -------------------------------------------------------------------------------- /resource/slides/paper-review/.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/.gitignore -------------------------------------------------------------------------------- /resource/slides/paper-review/Character-Aware Neural Language Models.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Character-Aware Neural Language Models.pdf -------------------------------------------------------------------------------- /resource/slides/paper-review/Character-level CNN for text classification.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Character-level CNN for text classification.pptx -------------------------------------------------------------------------------- /resource/slides/paper-review/Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers.pptx -------------------------------------------------------------------------------- /resource/slides/paper-review/Learning phrase representation using RNN Encoder-Decoder for SMT.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Learning phrase representation using RNN Encoder-Decoder for SMT.pdf -------------------------------------------------------------------------------- /resource/slides/paper-review/MASS.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/MASS.pdf -------------------------------------------------------------------------------- /resource/slides/paper-review/Robustly optimized BERT Pretraining Approaches.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Robustly optimized BERT Pretraining Approaches.pptx -------------------------------------------------------------------------------- /resource/slides/paper-review/TransformerXL.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/TransformerXL.pdf -------------------------------------------------------------------------------- /resource/slides/paper-review/VDCNN.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/VDCNN.pdf -------------------------------------------------------------------------------- /resource/slides/paper-review/seqtoseq_attention_20190417.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/seqtoseq_attention_20190417.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/Soynlp 2일차.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/Soynlp 2일차.pptx -------------------------------------------------------------------------------- /resource/slides/soynlp/empty: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/empty -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_1일차.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_1일차.pptx -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/From frequency to meaning, Vector space models of semantics.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/From frequency to meaning, Vector space models of semantics.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/Korean conjugation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Korean conjugation.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/Korean lemmatization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Korean lemmatization.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/L2_L1 regularization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/L2_L1 regularization.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/LSA.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/LSA.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/Logistic regression with L1, L2 regularization and keyword extraction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Logistic regression with L1, L2 regularization and keyword extraction.pdf -------------------------------------------------------------------------------- /resource/slides/soynlp/fastcampus_day3/Neural Word Embedding as Implicit Matrix Factorization.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Neural Word Embedding as Implicit Matrix Factorization.pdf -------------------------------------------------------------------------------- /resource/slides/yandex/2월 2째주-yandex- week04_seq2seq_seminar.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/yandex/2월 2째주-yandex- week04_seq2seq_seminar.pptx -------------------------------------------------------------------------------- /resource/slides/yandex/2월 3째주-yandex-week04_seq2seq_seminar layer normalization.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/yandex/2월 3째주-yandex-week04_seq2seq_seminar layer normalization.pptx -------------------------------------------------------------------------------- /resource/slides/yandex/yandex-week-07-mt-02.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/yandex/yandex-week-07-mt-02.pptx --------------------------------------------------------------------------------