├── .DS_Store
├── .gitignore
├── README.md
├── codes
    ├── deeppavlov
    │   └── README.md
    ├── snli
    │   ├── README.md
    │   ├── core
    │   │   ├── chat_log.py
    │   │   └── create_dir.py
    │   ├── dataset
    │   │   ├── dataset_provider.py
    │   │   └── readers
    │   │   │   └── snli_import.py
    │   └── snli.py
    ├── soynlp
    │   └── empty
    └── yandex
    │   ├── README.md
    │   ├── week01-embedding-seminar.ipynb
    │   └── week02_classification_seminar.ipynb
└── resource
    ├── material
        └── README.md
    └── slides
        ├── MIT-data-science
            ├── .gitignore
            ├── Chapter 11. Introduction to Machine Learning.pdf
            ├── Chapter 12. Clustering.pdf
            ├── Chapter13,14,15_MJLEE.pptx
            ├── MIT6_0002F16_lec1_cwjun.pdf
            ├── MIT6_0002F16_lec2_cwjun.pdf
            ├── MIT6_0002F16_lec5_lec6_ssg.pdf
            └── MIT6_0002F16_lec9_Eon.pdf
        ├── README.md
        ├── deeppavlov
            ├── .gitignore
            └── deeppavlov_Automatic spelling correction.pdf
        ├── linear-algebra
            ├── Chapter_3_Least_square.pptx
            └── README.md
        ├── paper-review
            ├── .gitignore
            ├── Character-Aware Neural Language Models.pdf
            ├── Character-level CNN for text classification.pptx
            ├── Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers.pptx
            ├── Learning phrase representation using RNN Encoder-Decoder for SMT.pdf
            ├── MASS.pdf
            ├── Robustly optimized BERT Pretraining Approaches.pptx
            ├── TransformerXL.pdf
            ├── VDCNN.pdf
            └── seqtoseq_attention_20190417.pdf
        ├── soynlp
            ├── Soynlp 2일차.pptx
            ├── empty
            ├── fastcampus_1일차.pptx
            └── fastcampus_day3
            │   ├── From frequency to meaning, Vector space models of semantics.pdf
            │   ├── Korean conjugation.pdf
            │   ├── Korean lemmatization.pdf
            │   ├── L2_L1 regularization.pdf
            │   ├── LSA.pdf
            │   ├── Logistic regression with L1, L2 regularization and keyword extraction.pdf
            │   └── Neural Word Embedding as Implicit Matrix Factorization.pdf
        └── yandex
            ├── 2월 2째주-yandex- week04_seq2seq_seminar.pptx
            ├── 2월 3째주-yandex-week04_seq2seq_seminar layer normalization.pptx
            └── yandex-week-07-mt-02.pptx


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/.gitignore


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DeepNLP2019
 2 | 
 3 | 모두의 연구소 자연어처리 LAB의 자료들을 올리는 Repository 입니다.  
 4 | 스터디는 매주 수요일 저녁 8:00 ~ 10:30에 강남 모두의 연구소 캠퍼스에 진행됩니다.  
 5 | 스터디 진행 스케줄 : [Schedule](https://docs.google.com/spreadsheets/d/1-m9TveaMZ54EVI-ikGYcp1orD1-GsIwXZbaZDi0oW4c)
 6 | 
 7 | ---
 8 | 
 9 | ## Repository 사용 방법
10 | 
11 | * 발표 자료는 모두 `resource/slides`에 올려주세요
12 | * 참고 자료 및 공부에 활용되는 자료들은 `resource/material`에 올려주세요
13 |   * 프로젝트 별로 하위 폴더를 만든 후 넣어주세요
14 |   * ex) linear algebra에 대한 자료는 material의 하위 폴더에 넣어주세요 `resource/material/linear_algebra/`
15 | * 작성하신 소스코드는 다음의 형식으로 `codes`에 올려주세요
16 |   * 프로젝트 별로 하위 폴더를 만든 후 넣어주세요
17 |   * ex) yandex 코드의 경우 `codes/yandex/2week_text_classification.ipynb`
18 | * 최대한 모든 자료는 이후 참고하기 편하도록 이름에 어떤 자료인지 확실히 명시해주시기 바랍니다.
19 |   * `codes/yandex/homework.ipynb`(X) -> `codes/yandex/s2s_homework.ipynb`
20 | 
21 | 
22 | ## 공부 자료
23 | 
24 | 
25 | * 기존 자료(~2018): [Link](http://github.com/modulabs/DeepNLP)
26 | * Git Book: [Link](https://nlp.gitbook.io/book/)
27 | * 텐서플로와 머신러닝으로 시작하는 자연어 처리: 로지스틱 회귀부터 텐서플로우까지 : [Link](https://book.naver.com/bookdb/book_detail.nhn?bid=14488487)
28 | 
29 | 
30 | ## Github 사용 관련 문의
31 | 
32 | github 사용 관련해서 문의는 아래의 메일로 보내주세요.  
33 | 조중현: reniew2@gmail.com
34 | 


--------------------------------------------------------------------------------
/codes/deeppavlov/README.md:
--------------------------------------------------------------------------------
1 | # DeepPavlov
2 | 


--------------------------------------------------------------------------------
/codes/snli/README.md:
--------------------------------------------------------------------------------
1 | # SNLI Competition
2 | 
3 | [SNLI Leaderboard](https://nlp.stanford.edu/projects/snli/)
4 | 
5 | [Code_Referecne](https://github.com/brmson/dataset-sts/tree/master/models)
6 | 
7 | data_in 디렉토리에 Glove 파일과 snli 데이터셋이 있어야 합니다.


--------------------------------------------------------------------------------
/codes/snli/core/chat_log.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | import os
  3 | 
  4 | class SingletonType(type):
  5 |     def __call__(cls, *args, **kwargs):
  6 |         try:
  7 |             return cls.__instance
  8 |         except AttributeError:
  9 |             cls.__instance = super(SingletonType, cls).__call__(*args, **kwargs)
 10 |             return cls.__instance
 11 | 
 12 | 
 13 | class CustomLogger(object):
 14 |     __metaclass__ = SingletonType
 15 |     _logger = None
 16 | 
 17 |     def __init__(self):
 18 |         self._logger = logging.getLogger("cai_chatbot_framework")
 19 |         self._logger.setLevel(logging.INFO)
 20 |         formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 21 | 
 22 |         import datetime
 23 |         now = datetime.datetime.now()
 24 |         import time
 25 |         timestamp = time.mktime(now.timetuple())
 26 | 
 27 |         dirname = 'data_out/logs/'
 28 |         if not os.path.isdir(dirname):
 29 |             os.mkdir(dirname)
 30 | 
 31 |         # file_handler = logging.FileHandler(dirname + now.strftime("%Y-%m-%d %H:%M:%S")+".log")
 32 |         file_handler = logging.FileHandler(dirname + 'chatbot_framework.log')
 33 |         stream_hander = logging.StreamHandler()
 34 | 
 35 |         self._logger.addHandler(file_handler)
 36 |         self._logger.addHandler(stream_hander)
 37 | 
 38 |     def get_logger(self):
 39 |         return self._logger
 40 | 
 41 | 
 42 | 
 43 | 
 44 | # mylogger = logging.getLogger("chatbot_framwork")
 45 | # mylogger.setLevel(logging.INFO) #로깅 레벨 설정
 46 | 
 47 | # formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 48 | 
 49 | # #handler: 내가 로깅한 정보가 출력되는 위치 설정
 50 | # #파일 설정, default a 모드임 (a 추가)
 51 | # file_handler = logging.FileHandler('data_out/logs/chatbot_framework.log')
 52 | # stream_hander = logging.StreamHandler()
 53 | 
 54 | # #Handler logging 추가
 55 | # file_handler.setFormatter(formatter)
 56 | # stream_hander.setFormatter(formatter)
 57 | 
 58 | # #logging 추가
 59 | # mylogger.addHandler(stream_hander)
 60 | # mylogger.addHandler(file_handler)
 61 | 
 62 | # mylogger.info("server start!!!")
 63 | 
 64 | 
 65 | 
 66 | # if __name__ =='__main__':
 67 | #     # logging.info("hello world")
 68 | #     # logging.error("something wrong!")
 69 | 
 70 | #     mylogger = logging.getLogger("chatbot_framwork")
 71 | #     mylogger.setLevel(logging.INFO) #로깅 레벨 설정
 72 | 
 73 | #     formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 74 | 
 75 | #     #handler: 내가 로깅한 정보가 출력되는 위치 설정
 76 | #     #파일 설정, default a 모드임 (a 추가)
 77 | #     file_handler = logging.FileHandler('data_out/logs/chatbot_framework.log')
 78 | #     stream_hander = logging.StreamHandler()
 79 | 
 80 | #     #Handler logging 추가
 81 | #     file_handler.setFormatter(formatter)
 82 | #     stream_hander.setFormatter(formatter)
 83 | 
 84 | #     #logging 추가
 85 | #     mylogger.addHandler(stream_hander)
 86 | #     mylogger.addHandler(file_handler)
 87 | 
 88 | #     mylogger.info("server start!!!")
 89 | 
 90 | 
 91 | 
 92 | 
 93 | # LOGGING_LEVELS = {'critical': logging.CRITICAL,
 94 | #                   'error': logging.ERROR,
 95 | #                   'warning': logging.WARNING,
 96 | #                   'info': logging.INFO,
 97 | #                   'debug': logging.DEBUG}
 98 | 
 99 | # def init():
100 | #     parser = optparse.OptionParser()
101 | #     parser.add_option('-l', '--logging-level', help='Logging level')
102 | #     parser.add_option('-f', '--logging-file', help='Logging file name')
103 | #     (options, args) = parser.parse_args()
104 | #     logging_level = LOGGING_LEVELS.get(options.logging_level, logging.NOTSET)
105 | #     logging.basicConfig(level=logging_level, filename=options.logging_file,
106 | #                         format='%(asctime)s %(levelname)s: %(message)s',
107 | #                         datefmt='%Y-%m-%d %H:%M:%S')
108 | 
109 | # logging.debug("디버깅용 로그~~")
110 | # logging.info("도움이 되는 정보를 남겨요~")
111 | # logging.warning("주의해야되는곳!")
112 | # logging.error("에러!!!")
113 | # logging.critical("심각한 에러!!")
114 | 
115 | 
116 | 
117 | # def my_logger(original_function):
118 | #     import logging
119 | #     logging.basicConfig(filename='./logs/{}.log'.format(original_function.__name__), level=logging.INFO)
120 |     
121 | #     def wrapper(*args, **kwargs):
122 | #         timestamp = datetime.datetime.now().strftime('%Y-%m-%d %H:%M')
123 | #         logging.info(
124 | #             '[{}] 실행결과 args - {}, kwargs - {}'.format(timestamp, args, kwargs))
125 | #         return original_function(*args, **kwargs)
126 | 
127 | #     return wrapper
128 | 
129 | # # 시간 추가
130 | # def my_timer(original_function):  #1
131 | #     import time
132 | 
133 | #     def wrapper(*args, **kwargs):
134 | #         t1 = time.time()
135 | #         result = original_function(*args, **kwargs)
136 | #         t2 = time.time() - t1
137 | #         print('{} 함수가 실행된 총 시간: {} 초'.format(original_function.__name__, t2))
138 | #         return result
139 | 
140 | #     return wrapper
141 | 
142 | # @my_timer
143 | # @my_logger
144 | # def display_info(name, age):
145 | #     time.sleep(1)
146 | #     print('display_info({}, {}) 함수가 실행됐습니다.'.format(name, age))
147 | 
148 | # display_info("John", 25)


--------------------------------------------------------------------------------
/codes/snli/core/create_dir.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | #-*- coding: utf-8 -*-
 3 | 
 4 | import os
 5 | 
 6 | class Dir_Utilities(object):
 7 | 
 8 |     def __init__(self, data_in_path, model_nm):
 9 |         self.data_out = 'data_out/'
10 |         self.data_in_path = data_in_path
11 |         self.export_pb_path=os.path.join(self.data_out, 'export_pb/')
12 |         self.log_path=os.path.join(self.data_out, 'logs/')
13 |         self.backup_path=os.path.join(self.data_out, 'backups/')
14 |         self.model_nm = model_nm
15 | 
16 |         if self.model_nm == 'onehot':
17 |             self.nn_models_dssm_path = os.path.join(self.data_out, 'nn_models_onehot/')
18 |         else:
19 |             self.nn_models_dssm_path = os.path.join(self.data_out, 'nn_models_embed/')
20 | 
21 |     def create_domain_dir(self, dir_path):
22 |         """ domain에 따른 폴더 생성 """
23 |         if os.path.isdir(dir_path):
24 |             print("{} --- Folder already exists \n".format(dir_path))
25 |         else:
26 |             os.makedirs(self.data_out, exist_ok=True)
27 |             print("{} --- Folder create complete \n".format(dir_path))
28 | 
29 |     def folder_init(self):
30 |         print("---- start test -----")
31 | 
32 |         self.create_domain_dir(self.data_out)
33 |         self.create_domain_dir(self.export_pb_path)
34 |         self.create_domain_dir(self.log_path)
35 |         self.create_domain_dir(self.backup_path)
36 |         self.create_domain_dir(self.nn_models_dssm_path)
37 | 
38 |         print("---- end test -----")


--------------------------------------------------------------------------------
/codes/snli/dataset/dataset_provider.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | sys.path.append('../')
 3 | 
 4 | from configs.model_config import Config
 5 | 
 6 | class DatasetProvider:
 7 | 
 8 |     def __init__(self, config, experiment_dir=None):
 9 | 
10 |         self.config = config
11 | 
12 |         if experiment_dir is not None:
13 |             self.config = Config(config.domain_id, experiment_dir)
14 | 
15 |         self._train_input_fn = None
16 |         self._validate_input_fn = None
17 |         self._test_input_fn = None
18 |         self._train_hook = None
19 |         self._validate_hook = None
20 | 
21 |     def setup_train_input_graph(self):
22 |         raise NotImplementedError
23 | 
24 |     def setup_validate_input_graph(self):
25 |         raise NotImplementedError
26 | 
27 |     def setup_test_input_graph(self, input_type, input_value):
28 |         raise NotImplementedError
29 | 
30 |     @property
31 |     def train_input_fn(self):
32 |         """
33 |         train data input_fn
34 |         call by estimator.train() in task runner e.g., ner_runner
35 |         :return:
36 |         """
37 |         if self._train_input_fn is None:
38 |             self.setup_train_input_graph()
39 |         return self._train_input_fn
40 | 
41 |     @property
42 |     def validate_input_fn(self):
43 |         if self._validate_input_fn is None:
44 |             self.setup_validate_input_graph()
45 |         return self._validate_input_fn
46 | 
47 |     @property
48 |     def test_input_fn(self):
49 |         return self._test_input_fn
50 | 
51 |     @property
52 |     def train_hook(self):
53 |         if self._train_hook is None:
54 |             self.setup_train_input_graph()
55 |         return self._train_hook
56 | 
57 |     @property
58 |     def validate_hook(self):
59 |         if self._validate_hook is None:
60 |             self.setup_validate_input_graph()
61 |         return self._validate_hook
62 | 


--------------------------------------------------------------------------------
/codes/snli/dataset/readers/snli_import.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import numpy as np
  3 | import tensorflow as tf
  4 | import tarfile
  5 | import tempfile
  6 | import json
  7 | import os
  8 | import re
  9 | import sys
 10 | 
 11 | from tensorflow.keras.utils import to_categorical
 12 | from tensorflow.keras.preprocessing.text import Tokenizer
 13 | from tensorflow.keras.preprocessing.sequence import pad_sequences
 14 | 
 15 | def extract_tokens_from_binary_parse(parse):
 16 |     return parse.replace('(', ' ').replace(')', ' ').replace('-LRB-', '(').replace('-RRB-', ')').split()
 17 | 
 18 | def yield_examples(fn, skip_no_majority=True, limit=None):
 19 |   for i, line in enumerate(open(fn)):
 20 |     if limit and i > limit:
 21 |       break
 22 |     data = json.loads(line)
 23 |     label = data['gold_label']
 24 |     s1 = ' '.join(extract_tokens_from_binary_parse(data['sentence1_binary_parse']))
 25 |     s2 = ' '.join(extract_tokens_from_binary_parse(data['sentence2_binary_parse']))
 26 |     if skip_no_majority and label == '-':
 27 |       continue
 28 |     yield (label, s1, s2)
 29 | 
 30 | def get_data(fn, limit=None):
 31 |   raw_data = list(yield_examples(fn=fn, limit=limit))
 32 |   left = [s1 for _, s1, s2 in raw_data]
 33 |   right = [s2 for _, s1, s2 in raw_data]
 34 |   print(max(len(x.split()) for x in left))
 35 |   print(max(len(x.split()) for x in right))
 36 | 
 37 |   LABELS = {'contradiction': 0, 'neutral': 1, 'entailment': 2}
 38 |   Y = np.array([LABELS[l] for l, s1, s2 in raw_data])
 39 |   Y = to_categorical(Y, len(LABELS))
 40 | 
 41 |   return left, right, Y
 42 | 
 43 | # data_path = 'data_in/snli/'
 44 | 
 45 | # training = get_data(data_path+'snli_1.0_train.jsonl')
 46 | # validation = get_data(data_path+'snli_1.0_dev.jsonl')
 47 | # test = get_data(data_path+'snli_1.0_test.jsonl')
 48 | 
 49 | # tokenizer = Tokenizer(lower=False, filters='')
 50 | # tokenizer.fit_on_texts(training[0] + training[1])
 51 | 
 52 | # Lowest index from the tokenizer is 1 - we need to include 0 in our vocab count
 53 | # VOCAB = len(tokenizer.word_counts) + 1
 54 | # LABELS = {'contradiction': 0, 'neutral': 1, 'entailment': 2}
 55 | 
 56 | # MAX_LEN = 42
 57 | 
 58 | # to_seq = lambda X: pad_sequences(tokenizer.texts_to_sequences(X), maxlen=MAX_LEN)
 59 | # prepare_data = lambda data: (to_seq(data[0]), to_seq(data[1]), data[2])
 60 | 
 61 | # training = prepare_data(training)
 62 | # validation = prepare_data(validation)
 63 | # test = prepare_data(test)
 64 | 
 65 | # print(training)
 66 | 
 67 | 
 68 | # print('Build model...')
 69 | # print('Vocab size =', VOCAB)
 70 | 
 71 | # GLOVE_STORE = 'precomputed_glove.weights'
 72 | # if USE_GLOVE:
 73 | #   if not os.path.exists(GLOVE_STORE + '.npy'):
 74 | #     print('Computing GloVe')
 75 |   
 76 | #     embeddings_index = {}
 77 | #     f = open(data_path+'glove.840B.300d.txt')
 78 | #     for line in f:
 79 | #       values = line.split(' ')
 80 | #       word = values[0]
 81 | #       coefs = np.asarray(values[1:], dtype='float32')
 82 | #       embeddings_index[word] = coefs
 83 | #     f.close()
 84 |     
 85 | #     # prepare embedding matrix
 86 | #     embedding_matrix = np.zeros((VOCAB, EMBED_HIDDEN_SIZE))
 87 | #     for word, i in tokenizer.word_index.items():
 88 | #       embedding_vector = embeddings_index.get(word)
 89 | #       if embedding_vector is not None:
 90 | #         # words not found in embedding index will be all-zeros.
 91 | #         embedding_matrix[i] = embedding_vector
 92 | #       else:
 93 | #         print('Missing from GloVe: {}'.format(word))
 94 |   
 95 | #     np.save(GLOVE_STORE, embedding_matrix)
 96 | 
 97 | #   print('Loading GloVe')
 98 | #   embedding_matrix = np.load(GLOVE_STORE + '.npy')
 99 | 
100 | #   print('Total number of null word embeddings:')
101 | #   print(np.sum(np.sum(embedding_matrix, axis=1) == 0))
102 | 
103 | #   embed = Embedding(VOCAB, EMBED_HIDDEN_SIZE, weights=[embedding_matrix], input_length=MAX_LEN, trainable=TRAIN_EMBED)
104 | # else:
105 | #   embed = Embedding(VOCAB, EMBED_HIDDEN_SIZE, input_length=MAX_LEN)


--------------------------------------------------------------------------------
/codes/snli/snli.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | import sys
  4 | import tensorflow as tf
  5 | import numpy as np
  6 | import os
  7 | import pandas as pd
  8 | import pickle
  9 | import glob
 10 | 
 11 | from sklearn.model_selection import train_test_split
 12 | from dataset.readers.snli_import import get_data
 13 | 
 14 | from tensorflow import keras
 15 | from tensorflow.keras import layers
 16 | from tensorflow.keras.utils import to_categorical
 17 | from tensorflow.keras.preprocessing.text import Tokenizer
 18 | from tensorflow.keras.preprocessing.sequence import pad_sequences
 19 | from tensorflow.keras.layers import Embedding
 20 | 
 21 | import json
 22 | 
 23 | batch_size = 512
 24 | MAX_LEN = 42
 25 | EPOCHS = 5
 26 | 
 27 | os.environ["CUDA_VISIBLE_DEVICES"]="0" #For TEST
 28 | tf.logging.set_verbosity("INFO")
 29 | 
 30 | data_path = 'data_in/snli/'
 31 | 
 32 | training = get_data(data_path+'snli_1.0_train.jsonl')
 33 | validation = get_data(data_path+'snli_1.0_dev.jsonl')
 34 | test = get_data(data_path+'snli_1.0_test.jsonl')
 35 | 
 36 | tokenizer = Tokenizer(lower=False, filters='')
 37 | tokenizer.fit_on_texts(training[0] + training[1])
 38 | 
 39 | VOCAB = len(tokenizer.word_counts) + 1
 40 | LABELS = {'contradiction': 0, 'neutral': 1, 'entailment': 2}
 41 | 
 42 | ## 미리 Global 변수를 지정하자. 파일 명, 파일 위치, 디렉토리 등이 있다.
 43 | 
 44 | DATA_IN_PATH = './data_in/'
 45 | DATA_OUT_PATH = './data_out/'
 46 | 
 47 | ## 학습에 필요한 파라메터들에 대해서 지정하는 부분이다.
 48 | 
 49 | 
 50 | RNN = None
 51 | 
 52 | BATCH_SIZE = 512
 53 | HIDDEN = 128
 54 | BUFFER_SIZE = 1000000
 55 | 
 56 | NUM_LAYERS = 3
 57 | DROPOUT_RATIO = 0.3
 58 | 
 59 | TEST_SPLIT = 0.1
 60 | RNG_SEED = 13371447
 61 | EMBEDDING_DIM = 300
 62 | MAX_SEQ_LEN = 42
 63 | 
 64 | WORD_EMBEDDING_DIM = 100
 65 | CONV_FEATURE_DIM = 300
 66 | CONV_OUTPUT_DIM = 128
 67 | CONV_WINDOW_SIZE = 3
 68 | DROPOUT_RATIO = 0.5
 69 | SIMILARITY_DENSE_FEATURE_DIM = 200
 70 | 
 71 | LAYERS = 1
 72 | EMBED_HIDDEN_SIZE = 300
 73 | SENT_HIDDEN_SIZE = 300
 74 | BATCH_SIZE = 512
 75 | PATIENCE = 4 # 8
 76 | MAX_EPOCHS = 42
 77 | MAX_LEN = 42
 78 | DP = 0.2
 79 | L2 = 4e-6
 80 | ACTIVATION = 'relu'
 81 | OPTIMIZER = 'rmsprop'
 82 | 
 83 | def data_len(df_list):
 84 |     q_data_len = np.array([min(len(x), MAX_SEQ_LEN) for x in df_list], dtype=np.int32)
 85 |     return q_data_len
 86 | 
 87 | to_seq = lambda X: pad_sequences(tokenizer.texts_to_sequences(X), maxlen=MAX_LEN)
 88 | 
 89 | prepare_data = lambda data: (to_seq(data[0]), to_seq(data[1]), data[2])
 90 | len_data = lambda data: (data_len(data[0]), data_len(data[1]))
 91 | 
 92 | training = prepare_data(training)
 93 | validation = prepare_data(validation)
 94 | test = prepare_data(test)
 95 | 
 96 | training_len = len_data(training)
 97 | validation_len = len_data(validation)
 98 | test_len = len_data(test)
 99 | 
100 | def save_pickle(file_nm, var_nm):
101 |     with open(file_nm, 'wb') as fp:
102 |         pickle.dump(var_nm, fp)
103 |         print("save {}".format(file_nm))
104 | 
105 | def load_pickle(file_nm):
106 |     with open(file_nm, 'rb') as fp:
107 |         pkl = pickle.load(fp)
108 |         print("load {}".format(file_nm))
109 |         return pkl 
110 | 
111 | # save_pickle(data_path + 'training.pkl', training)
112 | # save_pickle(data_path + 'validation.pkl',validation)
113 | # save_pickle(data_path + 'test.pkl', test)
114 | 
115 | # save_pickle(data_path + 'training_len.pkl', training_len)
116 | # save_pickle(data_path + 'validation_len.pkl',validation_len)
117 | # save_pickle(data_path + 'test_len.pkl', test_len)
118 | 
119 | # training = load_pickle(data_path + 'training.pkl')
120 | # validation = load_pickle(data_path + 'validation.pkl')
121 | # test = load_pickle(data_path + 'test.pkl')
122 | 
123 | # training_len = load_pickle(data_path + 'training_len.pkl')
124 | # validation_len = load_pickle(data_path + 'validation_len.pkl')
125 | # test_len = load_pickle(data_path + 'test_len.pkl')
126 | 
127 | USE_GLOVE = True
128 | TRAIN_EMBED = False
129 | 
130 | def train_input_fn():
131 |     train_dataset = tf.data.Dataset.from_tensor_slices((training[0], training[1], training_len[0],
132 |                                                         training_len[1], training[2])).shuffle(
133 |                 buffer_size=BUFFER_SIZE).prefetch(buffer_size=batch_size).batch(batch_size).repeat(EPOCHS)
134 |     iterator = train_dataset.make_one_shot_iterator()
135 |     q1, q2, q1_len, q2_len, labels = iterator.get_next()
136 |     features = {'q1': q1, "q2": q2, "q1_len": q1_len, "q2_len": q2_len}
137 |     # labels = {'labels': labels}
138 |     return features, labels
139 | 
140 | def valid_input_fn():
141 |     validation_dataset = tf.data.Dataset.from_tensor_slices((validation[0], validation[1], validation_len[0], 
142 |                                                         validation_len[1], validation[2])).shuffle(
143 |                     buffer_size=BUFFER_SIZE).prefetch(buffer_size=batch_size).batch(batch_size).repeat(EPOCHS)
144 |     iterator = validation_dataset.make_one_shot_iterator()
145 |     q1, q2, q1_len, q2_len, labels = iterator.get_next()
146 |     features = {'q1': q1, "q2": q2, "q1_len": q1_len, "q2_len": q2_len}
147 |     # labels = {'labels': labels}
148 | 
149 |     return features, labels
150 | 
151 | GLOVE_STORE = 'precomputed_glove.weights'
152 | if USE_GLOVE:
153 |     if not os.path.exists(GLOVE_STORE + '.npy'):
154 |         print('Computing GloVe')
155 |     
156 |         embeddings_index = {}
157 |         f = open(data_path+'glove.840B.300d.txt')
158 |         for line in f:
159 |             values = line.split(' ')
160 |             word = values[0]
161 |             coefs = np.asarray(values[1:], dtype='float32')
162 |             embeddings_index[word] = coefs
163 |         f.close()
164 |         
165 |         # prepare embedding matrix
166 |         embedding_matrix = np.zeros((VOCAB, EMBEDDING_DIM))
167 |         for word, i in tokenizer.word_index.items():
168 |             embedding_vector = embeddings_index.get(word)
169 |             if embedding_vector is not None:
170 |                 # words not found in embedding index will be all-zeros.
171 |                 embedding_matrix[i] = embedding_vector
172 |             else:
173 |                 print('Missing from GloVe: {}'.format(word))
174 |     
175 |         np.save(GLOVE_STORE, embedding_matrix)
176 | 
177 |     print('Loading GloVe')
178 |     embedding_matrix = np.load(GLOVE_STORE + '.npy')
179 | 
180 |     print('Total number of null word embeddings:')
181 |     print(np.sum(np.sum(embedding_matrix, axis=1) == 0))
182 | 
183 | 
184 | def basic_conv_sementic_network(inputs, name):
185 |     conv_layer = tf.keras.layers.Conv1D(CONV_FEATURE_DIM, 
186 |                                         CONV_WINDOW_SIZE, 
187 |                                         activation=tf.nn.relu, 
188 |                                         name=name + 'conv_1d',
189 |                                         padding='same')(inputs)
190 |     
191 |     max_pool_layer = tf.keras.layers.MaxPool1D(MAX_SEQ_LEN, 
192 |                                                1)(conv_layer)
193 | 
194 |     output_layer = tf.keras.layers.Dense(CONV_OUTPUT_DIM, 
195 |                                          activation=tf.nn.relu,
196 |                                          name=name + 'dense')(max_pool_layer)
197 |     output_layer = tf.squeeze(output_layer, 1)
198 |     
199 |     return output_layer
200 | 
201 | def estimator_model(features, labels, mode):
202 | 
203 |         
204 |     TRAIN = mode == tf.estimator.ModeKeys.TRAIN
205 |     EVAL = mode == tf.estimator.ModeKeys.EVAL
206 |     PREDICT = mode == tf.estimator.ModeKeys.PREDICT
207 |     
208 |     if USE_GLOVE:
209 |         embed = Embedding(VOCAB, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_LEN, trainable=TRAIN_EMBED)
210 |     else:        
211 |         embed = Embedding(VOCAB, EMBEDDING_DIM, input_length=MAX_LEN)
212 | 
213 |     prem = embed(features['q1'])
214 |     hypo = embed(features['q2'])
215 | 
216 |     rnn_kwargs = dict(output_dim=SENT_HIDDEN_SIZE, dropout_W=DP, dropout_U=DP)
217 |     SumEmbeddings = layers.Lambda(lambda x: keras.backend.sum(x, axis=1), output_shape=(SENT_HIDDEN_SIZE, ))    
218 |     
219 |     translate = layers.TimeDistributed(layers.Dense(SENT_HIDDEN_SIZE, activation=ACTIVATION))
220 | 
221 |     prem = translate(prem)
222 |     hypo = translate(hypo)
223 | 
224 |     if RNN and LAYERS > 1:
225 |         for l in range(LAYERS - 1):
226 |             rnn = RNN(return_sequences=True, **rnn_kwargs)
227 |             prem = layers.BatchNormalization()(rnn(prem))
228 |             hypo = layers.BatchNormalization()(rnn(hypo))
229 |     
230 |     rnn = SumEmbeddings if not RNN else RNN(return_sequences=False, **rnn_kwargs)
231 |     prem = rnn(prem)
232 |     hypo = rnn(hypo)
233 |     prem = layers.BatchNormalization()(prem)
234 |     hypo = layers.BatchNormalization()(hypo)
235 | 
236 |     joint = keras.layers.concatenate([prem, hypo])
237 |     joint = layers.Dropout(DP)(joint)
238 |     for i in range(3):
239 |         joint = layers.Dense(2 * SENT_HIDDEN_SIZE, activation=ACTIVATION)(joint)
240 |         joint = layers.Dropout(DP)(joint)
241 |         joint = layers.BatchNormalization()(joint)    
242 |     
243 |     	
244 |     # """ For Conv """
245 |     # base_sementic_matrix = basic_conv_sementic_network(base_embedded_matrix, 'base')
246 |     # hypothesis_sementic_matrix = basic_conv_sementic_network(hypothesis_embedded_matrix, 'hypothesis')
247 | 
248 |     # base_sementic_matrix = tf.keras.layers.Dropout(DROPOUT_RATIO)(query)
249 |     # hypothesis_sementic_matrix = tf.keras.layers.Dropout(DROPOUT_RATIO)(sim)    
250 |     
251 |     # merged_matrix = tf.concat([base_sementic_matrix, hypothesis_sementic_matrix], -1)
252 | 
253 |     # similarity_dense_layer = tf.keras.layers.Dense(250,
254 |     #                                          activation=tf.nn.relu)(merged_matrix)
255 |     
256 |     # similarity_dense_layer = tf.keras.layers.Dropout(DROPOUT_RATIO)(similarity_dense_layer)
257 |     
258 |     with tf.variable_scope('output_layer'):
259 |         # pred = tf.keras.layers.Dense(len(LABELS), activation='softmax')(similarity_dense_layer)
260 |         pred = tf.keras.layers.Dense(len(LABELS), activation='softmax')(join)
261 |         print("prediction: {}".format(pred))
262 |     
263 |     if PREDICT:
264 |         return tf.estimator.EstimatorSpec(
265 |                   mode=mode,
266 |                   predictions={
267 |                       'is_duplicate': pred
268 |                   })
269 |     
270 |     #prediction 진행 시, None
271 |     if labels is not None:
272 |         labels = tf.to_float(labels)
273 |     
274 |     def loss_fn(logits, labels):
275 |         loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
276 |                     logits=logits, labels=labels))
277 |         # loss = tf.losses.mean_squared_error(labels=labels, predictions=logits)
278 |         return loss
279 |     
280 |     def evaluate(logits, labels):
281 |         # accuracy = tf.metrics.accuracy(labels, tf.round(logits))
282 | 
283 |         correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(labels, 1))
284 |         acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
285 | 
286 |         return acc
287 | 
288 |     loss = loss_fn(pred, labels)
289 |     acc = evaluate(pred, labels)
290 | 
291 |     logging_hook = tf.train.LoggingTensorHook({"loss" : loss,  "accuracy" : acc}, every_n_iter=100)
292 |     
293 |     if EVAL:
294 |         # acc = evaluate(pred, labels)
295 |         accuracy = tf.metrics.accuracy(labels, tf.round(pred))
296 |         eval_metric_ops = {'acc': accuracy}
297 |         return tf.estimator.EstimatorSpec(
298 |                   mode=mode,
299 |                   eval_metric_ops= eval_metric_ops,
300 |                   loss=loss)
301 | 
302 |     elif TRAIN:
303 |         global_step = tf.train.get_global_step()
304 |         train_op = tf.train.AdamOptimizer(1e-3).minimize(loss, global_step)
305 |         return tf.estimator.EstimatorSpec(
306 |                   mode=mode,
307 |                   train_op=train_op,
308 |                   loss=loss,
309 |                   training_hooks= [logging_hook])
310 | 
311 | model_dir = os.path.join(os.getcwd(), DATA_OUT_PATH + "/checkpoint/model/")
312 | os.makedirs(model_dir, exist_ok=True)
313 | 
314 | config_tf = tf.estimator.RunConfig(save_checkpoints_steps=500,
315 |                                 save_checkpoints_secs=None,
316 |                                   keep_checkpoint_max=2,
317 |                                   log_step_count_steps=100)
318 | 
319 | model_est = tf.estimator.Estimator(estimator_model, model_dir=model_dir, config=config_tf)
320 | 
321 | model_est.train(train_input_fn)
322 | model_est.evaluate(valid_input_fn)


--------------------------------------------------------------------------------
/codes/soynlp/empty:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/codes/soynlp/empty


--------------------------------------------------------------------------------
/codes/yandex/README.md:
--------------------------------------------------------------------------------
1 | # Yandex NLP school
2 | 


--------------------------------------------------------------------------------
/codes/yandex/week02_classification_seminar.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Large scale text analysis with deep learning (3 points)\n",
   8 |     "\n",
   9 |     "Today we're gonna apply the newly learned tools for the task of predicting job salary.\n",
  10 |     "\n",
  11 |     "<img src=\"https://kaggle2.blob.core.windows.net/competitions/kaggle/3342/media/salary%20prediction%20engine%20v2.png\" width=400px>\n",
  12 |     "\n",
  13 |     "_Special thanks to [Oleg Vasilev](https://github.com/Omrigan/) for the core assignment idea._"
  14 |    ]
  15 |   },
  16 |   {
  17 |    "cell_type": "code",
  18 |    "execution_count": 1,
  19 |    "metadata": {},
  20 |    "outputs": [],
  21 |    "source": [
  22 |     "import numpy as np\n",
  23 |     "import pandas as pd\n",
  24 |     "import matplotlib.pyplot as plt\n",
  25 |     "%matplotlib inline"
  26 |    ]
  27 |   },
  28 |   {
  29 |    "cell_type": "markdown",
  30 |    "metadata": {},
  31 |    "source": [
  32 |     "### About the challenge\n",
  33 |     "For starters, let's download and unpack the data from [here](https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=0). \n",
  34 |     "\n",
  35 |     "You can also get it from [yadisk url](https://yadi.sk/d/vVEOWPFY3NruT7) the competition [page](https://www.kaggle.com/c/job-salary-prediction/data) (pick `Train_rev1.*`)."
  36 |    ]
  37 |   },
  38 |   {
  39 |    "cell_type": "code",
  40 |    "execution_count": 2,
  41 |    "metadata": {},
  42 |    "outputs": [
  43 |     {
  44 |      "data": {
  45 |       "text/plain": [
  46 |        "(244768, 12)"
  47 |       ]
  48 |      },
  49 |      "execution_count": 2,
  50 |      "metadata": {},
  51 |      "output_type": "execute_result"
  52 |     }
  53 |    ],
  54 |    "source": [
  55 |     "#!curl -L https://www.dropbox.com/s/5msc5ix7ndyba10/Train_rev1.csv.tar.gz?dl=1 -o Train_rev1.csv.tar.gz\n",
  56 |     "#!tar -xvzf ./Train_rev1.csv.tar.gz\n",
  57 |     "data = pd.read_csv(\"./Train_rev1.csv\", index_col=None)\n",
  58 |     "data.shape"
  59 |    ]
  60 |   },
  61 |   {
  62 |    "cell_type": "markdown",
  63 |    "metadata": {},
  64 |    "source": [
  65 |     "One problem with salary prediction is that it's oddly distributed: there are many people who are paid standard salaries and a few that get tons o money. The distribution is fat-tailed on the right side, which is inconvenient for MSE minimization.\n",
  66 |     "월급이 기이하게 분포 되어 있다.\n",
  67 |     "표준 급여를 받는 사람이 많다. MSE 최소화에 불편한 분포가 오른쪽 측면에서 꼼꼼하게 나타납니다.\n",
  68 |     "\n",
  69 |     "There are several techniques to combat this: using a different loss function, predicting log-target instead of raw target or even replacing targets with their percentiles among all salaries in the training set. We gonna use logarithm for now.\n",
  70 |     "이를 해결하기위한 여러 가지 기술이 있습니다. 즉, 다른 손실 함수 사용, 원시 타겟 대신 로그 타겟을 예측하거나 트레이닝 세트의 모든 급여 중 목표를 백분위 수로 대체하는 것입니다. 우리는 대수(대신하는수)를 사용할 것입니다.\n",
  71 |     "\n",
  72 |     "_You can read more [in the official description](https://www.kaggle.com/c/job-salary-prediction#description)._"
  73 |    ]
  74 |   },
  75 |   {
  76 |    "cell_type": "code",
  77 |    "execution_count": 3,
  78 |    "metadata": {},
  79 |    "outputs": [
  80 |     {
  81 |      "data": {
  82 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfYAAAD8CAYAAACFB4ZuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAHpNJREFUeJzt3X+sZ3V95/HnS34orSKDDIQwsIPtpJWSiHgXpnFjLHRhwKaDG21wmzJ1SabrQoNJd9exbYL1xwZ3U1nZWrrTMmUwVmBR40QHx1nEdU3kx6AjPxwtI7IyZZYZO4gYUlzwvX+cz5Uvw/fe+7137twf5z4fyTff832fzznfz/ne77nv7znncz6fVBWSJKkfXjbfFZAkSbPHxC5JUo+Y2CVJ6hETuyRJPWJilySpR0zskiT1iIldkqQeMbFLktQjJnZJknrkyPmuwEydcMIJtXLlyvmuhrSg3XfffT+squXzXY/JuC9Loxl1f160iX3lypXs2LFjvqshLWhJ/s9812Eq7svSaEbdnz0VL0lSj5jYJUnqERO7JEk9YmKXJKlHTOySJPWIiV2SpB4xsUuS1CMmdkmSesTELklSjyzanudmy8oNX5iyzKPXvHUOaiJJC4f/GxevJZ/YR+EXXJK0WHgqXpKkHjGxS5LUIyZ2SZJ6xMQuSVKPmNglSeoRE7u0RCR5RZJ7knwryUNJ/qzFb0zy/SQ72+OsFk+S65LsTnJ/krMH1rUuycPtsW4g/sYkD7RlrkuSud9SaWnzdjdp6XgWOK+qfpLkKOBrSW5v8/5DVd12UPmLgFXtcS5wPXBukuOBq4ExoID7kmypqidbmfXAXcBWYA1wO1owRrl9V4ubR+zSElGdn7SXR7VHTbLIWuCmttxdwHFJTgYuBLZX1YGWzLcDa9q8Y6vq61VVwE3AJYdtgyQNZWKXlpAkRyTZCeyjS853t1kfbqfbr03y8hY7BXhsYPE9LTZZfM+QuKQ5NFJiT3JcktuSfCfJriS/nuT4JNvbNbbtSZa1sl6Xkxaoqnq+qs4CVgDnJDkTeB/wq8A/B44H3tuKD9sPawbxl0iyPsmOJDv2798/za2QNJlRj9g/Bnyxqn4VeD2wC9gA3FFVq4A72mt48XW59XTX3Bi4LncucA5w9fiPAV64Lje+3JpD2yxJk6mqHwFfAdZU1d52uv1Z4G/p9k/ojrhPHVhsBfD4FPEVQ+LD3n9jVY1V1djy5ctnYYskjZsysSc5FngzcANAVf20/VNYC2xuxTbzwrU0r8tJC1CS5UmOa9PHAL8JfKftg7QzZZcAD7ZFtgCXtbNwq4GnqmovsA24IMmy9uP8AmBbm/d0ktVtXZcBn5vLbZQ0Wqv41wL7gb9N8nrgPuAq4KS2I1NVe5Oc2MoftutySdbTHdlz2mmnjVB1SQNOBjYnOYLuR/2tVfX5JF9OspzuVPpO4N+28luBi4HdwDPAuwCq6kCSDwL3tnIfqKoDbfrdwI3AMXSt4W0RL82xURL7kcDZwB9W1d1JPsYLp92HOWzX5apqI7ARYGxsbLLWvJIOUlX3A28YEj9vgvIFXDHBvE3ApiHxHcCZh1ZTSYdilGvse4A9A61nb6NL9E8MnMI7ma6V7Xj5w3JdTpIkTW7KxF5V/xd4LMmvtND5wLfprr+Nt2xfxwvX0rwuJ0nSPBm157k/BD6Z5GjgEbprbS8Dbk1yOfAD4B2trNflJEmaJyMl9qraSdd95MHOH1LW63KSJM0Te56TJKlHHARGknrCAV4EHrFLktQrJnZJknrExC5JUo+Y2CVJ6hETuyRJPWJilySpR0zskiT1iIldkqQeMbFLktQjJnZJknrExC5JUo+Y2CVJ6hETuyRJPWJil5aQJK9Ick+SbyV5KMmftfjpSe5O8nCSW5Ic3eIvb693t/krB9b1vhb/bpILB+JrWmx3kg1zvY3SUmdil5aWZ4Hzqur1wFnAmiSrgY8A11bVKuBJ4PJW/nLgyar6ZeDaVo4kZwCXAr8GrAH+MskRSY4APg5cBJwBvLOVlTRHTOzSElKdn7SXR7VHAecBt7X4ZuCSNr22vabNPz9JWvzmqnq2qr4P7AbOaY/dVfVIVf0UuLmVlTRHTOzSEtOOrHcC+4DtwPeAH1XVc63IHuCUNn0K8BhAm/8U8JrB+EHLTBSXNEdM7NISU1XPV9VZwAq6I+zXDSvWnjPBvOnGXyTJ+iQ7kuzYv3//aBWXNBITu7REVdWPgK8Aq4HjkhzZZq0AHm/Te4BTAdr8VwMHBuMHLTNR/OD33lhVY1U1tnz58tnaJEmY2KUlJcnyJMe16WOA3wR2AXcCb2/F1gGfa9Nb2mva/C9XVbX4pa3V/OnAKuAe4F5gVWtlfzRdA7sth3/LJI07cuoiknrkZGBza73+MuDWqvp8km8DNyf5EPBN4IZW/gbgE0l20x2pXwpQVQ8luRX4NvAccEVVPQ+Q5EpgG3AEsKmqHpq7zZM0UmJP8ijwNPA88FxVjSU5HrgFWAk8CvxOVT3ZWsx+DLgYeAb4/ar6RlvPOuBP22o/VFWbW/yNwI3AMcBW4Kp2VCBpFlXV/cAbhsQfobvefnD8n4B3TLCuDwMfHhLfSrcfS5oH0zkV/xtVdVZVjbXXG4A72n2vd7TX0N2/uqo91gPXA7QfAlcD59L9A7k6ybK2zPWt7Phya2a8RZIkLWGHco198P7Wg+97vandL3sXXaOck4ELge1VdaCqnqS7zWZNm3dsVX29HaXfNLAuSZI0DaMm9gK+lOS+JOtb7KSq2gvQnk9s8ene33pKmz44/hLeIiNJ0uRGbTz3pqp6PMmJwPYk35mk7GG57xW6W2SAjQBjY2Neg5ck6SAjHbFX1ePteR/wWbpr5E+00+i0532t+HTvb93Tpg+OS5KkaZoysSf5xSSvGp8GLgAe5MX3tx583+tl6awGnmqn6rcBFyRZ1hrNXQBsa/OeTrK6tai/bGBdkiRpGkY5FX8S8Nku53Ik8HdV9cUk9wK3Jrkc+AEv3BKzle5Wt910t7u9C6CqDiT5IF0HFgAfqKoDbfrdvHC72+3tIUlawFZu+MKUZR695q1zUBMNmjKxt/tbXz8k/o/A+UPiBVwxwbo2AZuGxHcAZ45QX0mSNAm7lJUkqUdM7JIk9YiJXZKkHjGxS5LUIyZ2SZJ6xMQuSVKPmNglSeoRE7skST1iYpckqUdM7JIk9YiJXZKkHjGxS0tEklOT3JlkV5KHklzV4u9P8g9JdrbHxQPLvC/J7iTfTXLhQHxNi+1OsmEgfnqSu5M8nOSWJEfP7VZKMrFLS8dzwB9V1euA1cAVSc5o866tqrPaYytAm3cp8GvAGuAvkxyR5Ajg48BFwBnAOwfW85G2rlXAk8Dlc7VxkjomdmmJqKq9VfWNNv00sAs4ZZJF1gI3V9WzVfV9uqGYz2mP3VX1SFX9FLgZWJtubOfzgNva8puBSw7P1kiaiIldWoKSrATeANzdQlcmuT/JpiTLWuwU4LGBxfa02ETx1wA/qqrnDopLmkMmdmmJSfJK4NPAe6rqx8D1wC8BZwF7gT8fLzpk8ZpBfFgd1ifZkWTH/v37p7kFkiZjYpeWkCRH0SX1T1bVZwCq6omqer6qfgb8Nd2pduiOuE8dWHwF8Pgk8R8CxyU58qD4S1TVxqoaq6qx5cuXz87GSQJM7NKS0a6B3wDsqqqPDsRPHij2NuDBNr0FuDTJy5OcDqwC7gHuBVa1FvBH0zWw21JVBdwJvL0tvw743OHcJkkvdeTURST1xJuA3wMeSLKzxf6YrlX7WXSnzR8F/gCgqh5KcivwbboW9VdU1fMASa4EtgFHAJuq6qG2vvcCNyf5EPBNuh8SkuaQiX2WrNzwhZHKPXrNWw9zTaThquprDL8OvnWSZT4MfHhIfOuw5arqEV44lS9pHngqXpKkHjGxS5LUIyZ2SZJ6ZOTE3rqS/GaSz7fXQ/uEbi1ob2l9SN/dOsIYX8e0+p2WJEnTM50j9qvouqAcN1Gf0JcDT1bVLwPXtnIz7XdakiRNw0iJPckK4K3A37TXk/UJvba9ps0/v5WfVr/Th7phkiQtRaMesf9X4D8CP2uvJ+sT+uf9SLf5T7Xy0+13+iXshlKSpMlNeR97kt8C9lXVfUneMh4eUrSmmDdRfNiPi6H9S1fVRmAjwNjY2NAyktRHo/aVIY3SQc2bgN9OcjHwCuBYuiP445Ic2Y7KB/uEHu9Hek/rM/rVwAEm7l+aSeKSJGkapjwVX1Xvq6oVVbWSrvHbl6vqd5m4T+gt7TVt/pdbH9LT6nd6VrZOkqQl5lC6lJ2oT+gbgE8k2U13pH4pzLjfaUmSNA3TSuxV9RXgK216aJ/QVfVPwDsmWH5a/U5LkqTpsec5SZJ6xMQuSVKPmNglSeoRE7skST1iYpckqUdM7JIk9YiJXZKkHjGxS0tEklOT3JlkV5KHklzV4scn2Z7k4fa8rMWT5Loku5Pcn+TsgXWta+UfTrJuIP7GJA+0Za5rIztKmkMmdmnpeA74o6p6HbAauCLJGcAG4I6qWgXc0V4DXETX9fMqYD1wPXQ/BICrgXPpOqm6evzHQCuzfmC5NXOwXZIGmNilJaKq9lbVN9r008AuuiGS1wKbW7HNwCVtei1wU3Xuohv46WTgQmB7VR2oqieB7cCaNu/Yqvp6Gx/ipoF1SZojJnZpCUqyEngDcDdwUlXthS75Aye2YqcAjw0stqfFJovvGRKXNIdM7NISk+SVwKeB91TVjycrOiRWM4gPq8P6JDuS7Ni/f/9UVZY0DSZ2aQlJchRdUv9kVX2mhZ9op9Fpz/tafA9w6sDiK4DHp4ivGBJ/iaraWFVjVTW2fPnyQ9soSS9yKMO2SlpEWgv1G4BdVfXRgVlbgHXANe35cwPxK5PcTNdQ7qmq2ptkG/CfBhrMXQC8r6oOJHk6yWq6U/yXAf/tsG9YD6zc8IX5roJ6xMQuLR1vAn4PeCDJzhb7Y7qEfmuSy4Ef8MKwy1uBi4HdwDPAuwBaAv8gcG8r94GqOtCm3w3cCBwD3N4ekuaQiV1aIqrqawy/Dg5w/pDyBVwxwbo2AZuGxHcAZx5CNSUdIq+xS5LUIyZ2SZJ6xMQuSVKPmNglSeoRE7skST1iYpckqUdM7JIk9ciUiT3JK5Lck+RbbQznP2vx05Pc3cZjviXJ0S3+8vZ6d5u/cmBd72vx7ya5cCC+psV2J9lwcB0kSdJoRjlifxY4r6peD5xFNzzjauAjwLVtDOcngctb+cuBJ6vql4FrWznauM+XAr9GN0bzXyY5IskRwMfpxn4+A3hnKytJkqZpyp7nWu9TP2kvj2qPAs4D/nWLbwbeD1xPN4bz+1v8NuAvWh/Va4Gbq+pZ4PtJdgPntHK7q+oRgNYv9Vrg24eyYZKkxWGUvvIfveatc1CTfhjpGns7st5JN+rTduB7wI+q6rlWZHDc5Z+P1dzmPwW8humP7SxJkqZppMReVc9X1Vl0wzCeA7xuWLH27BjOkiTNk2m1iq+qHwFfAVYDxyUZP5U/OO7yz8dqbvNfDRxg+mM7D3t/x3CWJGkSo7SKX57kuDZ9DPCbwC7gTuDtrdjBYziva9NvB77crtNvAS5treZPB1YB99AN/biqtbI/mq6B3ZbZ2DhJkpaaUYZtPRnY3Fqvvwy4tao+n+TbwM1JPgR8E7ihlb8B+ERrHHeALlFTVQ8luZWuUdxzwBVV9TxAkiuBbcARwKaqemjWtlCSpCVklFbx9wNvGBJ/hBdatQ/G/wl4xwTr+jDw4SHxrcDWEeorSVpERmnxrtllz3OSJPWIiV2SpB4xsUuS1CMmdmkJSbIpyb4kDw7E3p/kH5LsbI+LB+ZNa3yHicaQkDR3TOzS0nIj3VgNB7u2qs5qj60w4/EdJhpDQtIcMbFLS0hVfZXuNtRR/Hx8h6r6PjA+vsM5tPEdquqnwM3A2jYmxHl0Y0RAN4bEJbO6AZKmZGKXBHBlkvvbqfplLTbd8R1ew8RjSEiaIyZ2SdcDv0Q3LPNe4M9b3HEfpEXIxC4tcVX1RBvo6WfAX/NCx1PTHd/hh0w8hsTB7+m4D9JhYmKXlrgkJw+8fBsw3mJ+WuM7tDEhJhpDQtIcGaWveEk9keRTwFuAE5LsAa4G3pLkLLrT5o8CfwAzHt/hvQwfQ0LSHDGxS0tIVb1zSHjC5Dvd8R0mGkNC0tzxVLwkST1iYpckqUc8FT/HRhnC8NFr3joHNZEk9ZFH7JIk9YiJXZKkHjGxS5LUI15jl6TDZJQ2NdJs84hdkqQeMbFLktQjJnZJknrExC5JUo9MmdiTnJrkziS7kjyU5KoWPz7J9iQPt+dlLZ4k1yXZneT+JGcPrGtdK/9wknUD8TcmeaAtc12SYeM6S5KkKYxyxP4c8EdV9TpgNXBFkjOADcAdVbUKuKO9BriIbnjHVcB64HrofgjQjSR1Lt0gEVeP/xhoZdYPLLfm0DdNkqSlZ8rEXlV7q+obbfppYBdwCrAW2NyKbQYuadNrgZuqcxdwXBvv+UJge1UdqKonge3Amjbv2Kr6ehvP+aaBdUmSpGmY1jX2JCuBNwB3AydV1V7okj9wYit2CvDYwGJ7Wmyy+J4hcUmSNE0jJ/YkrwQ+Dbynqn48WdEhsZpBfFgd1ifZkWTH/v37p6qyJElLzkiJPclRdEn9k1X1mRZ+op1Gpz3va/E9wKkDi68AHp8ivmJI/CWqamNVjVXV2PLly0epuiRJS8ooreID3ADsqqqPDszaAoy3bF8HfG4gfllrHb8aeKqdqt8GXJBkWWs0dwGwrc17Osnq9l6XDaxLkiRNwyh9xb8J+D3ggSQ7W+yPgWuAW5NcDvwAeEebtxW4GNgNPAO8C6CqDiT5IHBvK/eBqjrQpt8N3AgcA9zeHpIkaZqmTOxV9TWGXwcHOH9I+QKumGBdm4BNQ+I7gDOnqoukQ5NkE/BbwL6qOrPFjgduAVYCjwK/U1VPtjNoH6P7of4M8Pvjd8i0fij+tK32Q1W1ucXfyAs/0rcCV7X/CZLmiD3PSUvLjby0nwj7pJB6xMQuLSFV9VXgwEFh+6SQesTELsk+KaQeGaXx3KK1csMX5rsK0mJ2WPukoDtlz2mnnTbT+kkawiN2SfZJIfWIiV2SfVJIPdLrU/GSXizJp4C3ACck2UPXut0+KaQeMbFLS0hVvXOCWfZJIfWEp+IlSeoRE7skST3iqfgFaJTb9B695q1zUBNJ0mLjEbskST1iYpckqUdM7JIk9YiJXZKkHjGxS5LUIyZ2SZJ6xMQuSVKPmNglSeoRO6iRJC14o3TcBXbeBR6xS5LUKyZ2SZJ6xMQuSVKPTJnYk2xKsi/JgwOx45NsT/Jwe17W4klyXZLdSe5PcvbAMuta+YeTrBuIvzHJA22Z65JktjdSkqSlYpQj9huBNQfFNgB3VNUq4I72GuAiYFV7rAeuh+6HAHA1cC5wDnD1+I+BVmb9wHIHv5ckSRrRlIm9qr4KHDgovBbY3KY3A5cMxG+qzl3AcUlOBi4EtlfVgap6EtgOrGnzjq2qr1dVATcNrEuSJE3TTK+xn1RVewHa84ktfgrw2EC5PS02WXzPkLgkSZqB2W48N+z6eM0gPnzlyfokO5Ls2L9//wyrKGmYJI+29i47k+xosVlrTyNpbsy0g5onkpxcVXvb6fR9Lb4HOHWg3Arg8RZ/y0Hxr7T4iiHlh6qqjcBGgLGxsQl/AEiasd+oqh8OvB5vT3NNkg3t9Xt5cXuac+naypw70J5mjO5H+n1JtrRLcL0yaocpmluj/F363onNTI/YtwDjv8TXAZ8biF/Wfs2vBp5qp+q3ARckWdZ+8V8AbGvznk6yurWGv2xgXZLm36y0p5nrSktL2ZRH7Ek+RXe0fUKSPXS/xq8Bbk1yOfAD4B2t+FbgYmA38AzwLoCqOpDkg8C9rdwHqmq8Qd676VreHwPc3h6S5l4BX0pSwH9vZ8he1J4myUzb00iaI1Mm9qp65wSzzh9StoArJljPJmDTkPgO4Myp6iHpsHtTVT3ekvf2JN+ZpOwhtZtJsp7uNldOO+20mdRV0gTseU4SAFX1eHveB3yWrs+JJ9opdqbRnmZY/OD32lhVY1U1tnz58tneFGlJM7FLIskvJnnV+DRdO5gHmaX2NHO4KdKS57CtkgBOAj7benQ+Evi7qvpiknuZvfY0kuaAiX2R8pYOzaaqegR4/ZD4PzJL7WkkzQ1PxUuS1CMmdkmSesTELklSj5jYJUnqERO7JEk9YmKXJKlHTOySJPWIiV2SpB4xsUuS1CP2PNdjo/ROB/ZQJ0l94hG7JEk9YmKXJKlHPBUvB5SRpB7xiF2SpB4xsUuS1COeipckLSl9v/zoEbskST3iEbtG0vdfuJLUFx6xS5LUIwvmiD3JGuBjwBHA31TVNfNcJUkz0Id9edReG6WFaEEk9iRHAB8H/iWwB7g3yZaq+vb81kzSdByufXk2u0c2aavvFkRiB84BdlfVIwBJbgbWAiZ2aXGZ133ZpC0tnMR+CvDYwOs9wLnzVBdJM+e+rF6YzR+Jc92weKEk9gyJ1UsKJeuB9e3lT5J8d2D2CcAPD0Pd5sJirju0+ucj812NGVnMn/0odf9nc1GRAbOxL8+Hxfw9GOR2LCyz/b9xpP15oST2PcCpA69XAI8fXKiqNgIbh60gyY6qGjs81Tu8FnPdYXHX37rPukPel+fDAv0sp83tWFjmazsWyu1u9wKrkpye5GjgUmDLPNdJ0vS5L0vzbEEcsVfVc0muBLbR3SKzqaoemudqSZom92Vp/i2IxA5QVVuBrYewigVzWm8GFnPdYXHX37rPslnYl+fDgvwsZ8DtWFjmZTtS9ZJ2LZIkaZFaKNfYJUnSLOhFYk+yJsl3k+xOsmEe6/FokgeS7Eyyo8WOT7I9ycPteVmLJ8l1rc73Jzl7YD3rWvmHk6wbiL+xrX93W3bYrUXTqe+mJPuSPDgQO+z1neg9ZqHu70/yD+3z35nk4oF572v1+G6SCwfiQ787rfHX3a2Ot7SGYCR5eXu9u81fOYO6n5rkziS7kjyU5KrJPpeF9tn3TZKrkjzY/hbvme/6jGo6++9CNsF2vKP9PX6WZFG0jp9gO/5Lku+0/fazSY6bk8pU1aJ+0DXQ+R7wWuBo4FvAGfNUl0eBEw6K/WdgQ5veAHykTV8M3E533+9q4O4WPx54pD0va9PL2rx7gF9vy9wOXHSI9X0zcDbw4FzWd6L3mIW6vx/490PKntG+Fy8HTm/flyMm++4AtwKXtum/At7dpv8d8Fdt+lLglhnU/WTg7Db9KuDvWx0XxWffpwdwJvAg8At0bY7+J7Bqvus1Yt1H3n8X8mOC7Xgd8CvAV4Cx+a7jIWzHBcCRbfojc/X36MMR+8+7sKyqnwLjXVguFGuBzW16M3DJQPym6twFHJfkZOBCYHtVHaiqJ4HtwJo279iq+np135KbBtY1I1X1VeDAPNR3ovc41LpPZC1wc1U9W1XfB3bTfW+Gfnfa0e15wG0TfA7jdb8NOH+6Z06qam9VfaNNPw3souuxbVF89j3zOuCuqnqmqp4D/hfwtnmu00imuf8uWMO2o6p2VdV8d1o0LRNsx5fa9wrgLrp+HQ67PiT2YV1YnjJPdSngS0nuS9ezFsBJVbUXun/owIktPlG9J4vvGRKfbXNR34neYzZc2U57bRo4DTndur8G+NHADjlY958v0+Y/1crPSDuV/wbgbhb/Z78YPQi8OclrkvwC3dmRU6dYZiHz77tw/Ru6s2eHXR8S+0hdWM6RN1XV2cBFwBVJ3jxJ2YnqPd34XFkM9b0e+CXgLGAv8OctPpt1n7XtSvJK4NPAe6rqx5MVneA9F9JnvyhV1S66U6TbgS/SXY55btKFpGlK8id036tPzsX79SGxj9SF5Vyoqsfb8z7gs3Snep9op0Zpz/ta8YnqPVl8xZD4bJuL+k70Hoekqp6oquer6mfAX9N9/jOp+w/pTncfeVD8Retq81/N6JcEfi7JUXRJ/ZNV9ZkWXrSf/WJWVTdU1dlV9Wa6v+XD812nQ+Dfd4FpjVp/C/jddmnssOtDYl8QXVgm+cUkrxqfpms08WCry3hr5XXA59r0FuCy1uJ5NfBUO3W2DbggybJ2KvkCYFub93SS1e2a7mUD65pNc1Hfid7jkIz/Q2veRvf5j7/fpelatJ8OrKJrXDb0u9N2vjuBt0/wOYzX/e3Al6e7s7bP4wZgV1V9dGDWov3sF7MkJ7bn04B/BXxqfmt0SPz7LiBJ1gDvBX67qp6ZszeeixZ6h/tBd13s7+laOP/JPNXhtXSn8b4FPDReD7rrr3fQHQXcARzf4gE+3ur8AAMtP+muxexuj3cNxMfoktX3gL+gdTB0CHX+FN0p6/9Hd5R3+VzUd6L3mIW6f6LV7X66f3AnD5T/k1aP7zJwN8FE353297ynbdP/AF7e4q9or3e3+a+dQd3/Bd2p8fuBne1x8WL57Pv2AP433Xjx3wLOn+/6TKPeI++/C/kxwXa8rU0/CzxB94N13us6g+3YTdcOZnw//6u5qIs9z0mS1CN9OBUvSZIaE7skST1iYpckqUdM7JIk9YiJXZKkHjGxS5LUIyZ2SZJ6xMQuSVKP/H+9KhGj6EW1ZQAAAABJRU5ErkJggg==\n",
  83 |       "text/plain": [
  84 |        "<Figure size 576x288 with 2 Axes>"
  85 |       ]
  86 |      },
  87 |      "metadata": {
  88 |       "needs_background": "light"
  89 |      },
  90 |      "output_type": "display_data"
  91 |     }
  92 |    ],
  93 |    "source": [
  94 |     "#data = data.head(3)\n",
  95 |     "data['Log1pSalary'] = np.log1p(data['SalaryNormalized']).astype('float32')\n",
  96 |     "\n",
  97 |     "plt.figure(figsize=[8, 4])\n",
  98 |     "plt.subplot(1, 2, 1)\n",
  99 |     "\n",
 100 |     "plt.hist(data[\"SalaryNormalized\"], bins=20);\n",
 101 |     "\n",
 102 |     "plt.subplot(1, 2, 2)\n",
 103 |     "plt.hist(data['Log1pSalary'], bins=20);"
 104 |    ]
 105 |   },
 106 |   {
 107 |    "cell_type": "code",
 108 |    "execution_count": 4,
 109 |    "metadata": {},
 110 |    "outputs": [
 111 |     {
 112 |      "data": {
 113 |       "text/plain": [
 114 |        "(244768,)"
 115 |       ]
 116 |      },
 117 |      "execution_count": 4,
 118 |      "metadata": {},
 119 |      "output_type": "execute_result"
 120 |     }
 121 |    ],
 122 |    "source": [
 123 |     "data['Log1pSalary'].shape"
 124 |    ]
 125 |   },
 126 |   {
 127 |    "cell_type": "code",
 128 |    "execution_count": 5,
 129 |    "metadata": {},
 130 |    "outputs": [
 131 |     {
 132 |      "data": {
 133 |       "text/plain": [
 134 |        "200000"
 135 |       ]
 136 |      },
 137 |      "execution_count": 5,
 138 |      "metadata": {},
 139 |      "output_type": "execute_result"
 140 |     }
 141 |    ],
 142 |    "source": [
 143 |     "np.amax(data[\"SalaryNormalized\"])"
 144 |    ]
 145 |   },
 146 |   {
 147 |    "cell_type": "code",
 148 |    "execution_count": 6,
 149 |    "metadata": {},
 150 |    "outputs": [
 151 |     {
 152 |      "data": {
 153 |       "text/plain": [
 154 |        "5000"
 155 |       ]
 156 |      },
 157 |      "execution_count": 6,
 158 |      "metadata": {},
 159 |      "output_type": "execute_result"
 160 |     }
 161 |    ],
 162 |    "source": [
 163 |     "np.amin(data[\"SalaryNormalized\"])"
 164 |    ]
 165 |   },
 166 |   {
 167 |    "cell_type": "code",
 168 |    "execution_count": 7,
 169 |    "metadata": {},
 170 |    "outputs": [
 171 |     {
 172 |      "data": {
 173 |       "text/plain": [
 174 |        "12.206078"
 175 |       ]
 176 |      },
 177 |      "execution_count": 7,
 178 |      "metadata": {},
 179 |      "output_type": "execute_result"
 180 |     }
 181 |    ],
 182 |    "source": [
 183 |     "np.amax(data[\"Log1pSalary\"])"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "code",
 188 |    "execution_count": 8,
 189 |    "metadata": {},
 190 |    "outputs": [
 191 |     {
 192 |      "data": {
 193 |       "text/plain": [
 194 |        "8.517393"
 195 |       ]
 196 |      },
 197 |      "execution_count": 8,
 198 |      "metadata": {},
 199 |      "output_type": "execute_result"
 200 |     }
 201 |    ],
 202 |    "source": [
 203 |     "np.amin(data[\"Log1pSalary\"])"
 204 |    ]
 205 |   },
 206 |   {
 207 |    "cell_type": "markdown",
 208 |    "metadata": {},
 209 |    "source": [
 210 |     "Our task is to predict one number, __Log1pSalary__. (log(1 + 200000))\n",
 211 |     "\n",
 212 |     "Log1pSalary 예측이 임무\n",
 213 |     "\n",
 214 |     "Title : Job position  \n",
 215 |     "FullDescription : 실제 할일  \n",
 216 |     "LocationRaw : 위치 (Detail)  \n",
 217 |     "LocationNormalized : 위치  \n",
 218 |     "Contract Type : 계약 유형  \n",
 219 |     "Contract Time : 계약 시간  \n",
 220 |     "Company : 회사  \n",
 221 |     "Category : 범주  \n",
 222 |     "SalaryRaw : 급여 범위 및 추가 속성  \n",
 223 |     "SalaryNormalized : 급여 평균  \n",
 224 |     "SourceName : 출처  \n",
 225 |     "\n",
 226 |     "To do so, our model can access a number of features:\n",
 227 |     "* Free text: __`Title`__ and  __`FullDescription`__\n",
 228 |     "* Categorical: __`Category`__, __`Company`__, __`LocationNormalized`__, __`ContractType`__, and __`ContractTime`__.\n",
 229 |     "\n",
 230 |     "dropna함수는 column내에 NaN값이 있으면 해당 내용은 필요없다 간주하고 삭제해버린다.  \n",
 231 |     "\n",
 232 |     "fillna함수도 굉장히 유용한다 NaN을 특정 값으로 대체하는 기능을 한다.  \n",
 233 |     "\n",
 234 |     "\n"
 235 |    ]
 236 |   },
 237 |   {
 238 |    "cell_type": "code",
 239 |    "execution_count": 9,
 240 |    "metadata": {},
 241 |    "outputs": [
 242 |     {
 243 |      "data": {
 244 |       "text/html": [
 245 |        "<div>\n",
 246 |        "<style scoped>\n",
 247 |        "    .dataframe tbody tr th:only-of-type {\n",
 248 |        "        vertical-align: middle;\n",
 249 |        "    }\n",
 250 |        "\n",
 251 |        "    .dataframe tbody tr th {\n",
 252 |        "        vertical-align: top;\n",
 253 |        "    }\n",
 254 |        "\n",
 255 |        "    .dataframe thead th {\n",
 256 |        "        text-align: right;\n",
 257 |        "    }\n",
 258 |        "</style>\n",
 259 |        "<table border=\"1\" class=\"dataframe\">\n",
 260 |        "  <thead>\n",
 261 |        "    <tr style=\"text-align: right;\">\n",
 262 |        "      <th></th>\n",
 263 |        "      <th>Id</th>\n",
 264 |        "      <th>Title</th>\n",
 265 |        "      <th>FullDescription</th>\n",
 266 |        "      <th>LocationRaw</th>\n",
 267 |        "      <th>LocationNormalized</th>\n",
 268 |        "      <th>ContractType</th>\n",
 269 |        "      <th>ContractTime</th>\n",
 270 |        "      <th>Company</th>\n",
 271 |        "      <th>Category</th>\n",
 272 |        "      <th>SalaryRaw</th>\n",
 273 |        "      <th>SalaryNormalized</th>\n",
 274 |        "      <th>SourceName</th>\n",
 275 |        "      <th>Log1pSalary</th>\n",
 276 |        "    </tr>\n",
 277 |        "  </thead>\n",
 278 |        "  <tbody>\n",
 279 |        "    <tr>\n",
 280 |        "      <th>182771</th>\n",
 281 |        "      <td>71614858</td>\n",
 282 |        "      <td>Registered General Nurse (RGN)  East London  C...</td>\n",
 283 |        "      <td>Registered General Nurse/RN/RGNLocation: East ...</td>\n",
 284 |        "      <td>Chingford</td>\n",
 285 |        "      <td>Chingford</td>\n",
 286 |        "      <td>full_time</td>\n",
 287 |        "      <td>NaN</td>\n",
 288 |        "      <td>HC Recruitment Services</td>\n",
 289 |        "      <td>Healthcare &amp; Nursing Jobs</td>\n",
 290 |        "      <td>13.00 - 13.15/Hour</td>\n",
 291 |        "      <td>25104</td>\n",
 292 |        "      <td>staffnurse.com</td>\n",
 293 |        "      <td>10.130822</td>\n",
 294 |        "    </tr>\n",
 295 |        "    <tr>\n",
 296 |        "      <th>44035</th>\n",
 297 |        "      <td>68506863</td>\n",
 298 |        "      <td>Buyer  Menswear</td>\n",
 299 |        "      <td>Buyer  Menswear : The Client This design led c...</td>\n",
 300 |        "      <td>North London London South East</td>\n",
 301 |        "      <td>North Lambeth</td>\n",
 302 |        "      <td>NaN</td>\n",
 303 |        "      <td>permanent</td>\n",
 304 |        "      <td>FASHION &amp; RETAIL PERSONNEL LIMITED</td>\n",
 305 |        "      <td>Retail Jobs</td>\n",
 306 |        "      <td>40000 - 45000 per annum</td>\n",
 307 |        "      <td>42500</td>\n",
 308 |        "      <td>retailchoice.com</td>\n",
 309 |        "      <td>10.657283</td>\n",
 310 |        "    </tr>\n",
 311 |        "    <tr>\n",
 312 |        "      <th>101855</th>\n",
 313 |        "      <td>69547809</td>\n",
 314 |        "      <td>HGV 2 Moffett Driver</td>\n",
 315 |        "      <td>We have a breathtaking opportunity for a HGV C...</td>\n",
 316 |        "      <td>Southall</td>\n",
 317 |        "      <td>Southall</td>\n",
 318 |        "      <td>full_time</td>\n",
 319 |        "      <td>NaN</td>\n",
 320 |        "      <td>HR Go Recruitment</td>\n",
 321 |        "      <td>Logistics &amp; Warehouse Jobs</td>\n",
 322 |        "      <td>8.50 - 12.75 per hour</td>\n",
 323 |        "      <td>20400</td>\n",
 324 |        "      <td>Jobcentre Plus</td>\n",
 325 |        "      <td>9.923339</td>\n",
 326 |        "    </tr>\n",
 327 |        "  </tbody>\n",
 328 |        "</table>\n",
 329 |        "</div>"
 330 |       ],
 331 |       "text/plain": [
 332 |        "              Id                                              Title  \\\n",
 333 |        "182771  71614858  Registered General Nurse (RGN)  East London  C...   \n",
 334 |        "44035   68506863                                    Buyer  Menswear   \n",
 335 |        "101855  69547809                               HGV 2 Moffett Driver   \n",
 336 |        "\n",
 337 |        "                                          FullDescription  \\\n",
 338 |        "182771  Registered General Nurse/RN/RGNLocation: East ...   \n",
 339 |        "44035   Buyer  Menswear : The Client This design led c...   \n",
 340 |        "101855  We have a breathtaking opportunity for a HGV C...   \n",
 341 |        "\n",
 342 |        "                           LocationRaw LocationNormalized ContractType  \\\n",
 343 |        "182771                       Chingford          Chingford    full_time   \n",
 344 |        "44035   North London London South East      North Lambeth          NaN   \n",
 345 |        "101855                        Southall           Southall    full_time   \n",
 346 |        "\n",
 347 |        "       ContractTime                             Company  \\\n",
 348 |        "182771          NaN             HC Recruitment Services   \n",
 349 |        "44035     permanent  FASHION & RETAIL PERSONNEL LIMITED   \n",
 350 |        "101855          NaN                   HR Go Recruitment   \n",
 351 |        "\n",
 352 |        "                          Category                SalaryRaw  SalaryNormalized  \\\n",
 353 |        "182771   Healthcare & Nursing Jobs       13.00 - 13.15/Hour             25104   \n",
 354 |        "44035                  Retail Jobs  40000 - 45000 per annum             42500   \n",
 355 |        "101855  Logistics & Warehouse Jobs    8.50 - 12.75 per hour             20400   \n",
 356 |        "\n",
 357 |        "              SourceName  Log1pSalary  \n",
 358 |        "182771    staffnurse.com    10.130822  \n",
 359 |        "44035   retailchoice.com    10.657283  \n",
 360 |        "101855    Jobcentre Plus     9.923339  "
 361 |       ]
 362 |      },
 363 |      "execution_count": 9,
 364 |      "metadata": {},
 365 |      "output_type": "execute_result"
 366 |     }
 367 |    ],
 368 |    "source": [
 369 |     "text_columns = [\"Title\", \"FullDescription\"]\n",
 370 |     "categorical_columns = [\"Category\", \"Company\", \"LocationNormalized\", \"ContractType\", \"ContractTime\"]\n",
 371 |     "target_column = \"Log1pSalary\"\n",
 372 |     "\n",
 373 |     "data[categorical_columns] = data[categorical_columns].fillna('NaN') # cast missing values to string \"NaN\"\n",
 374 |     "\n",
 375 |     "data.sample(3)"
 376 |    ]
 377 |   },
 378 |   {
 379 |    "cell_type": "markdown",
 380 |    "metadata": {},
 381 |    "source": [
 382 |     "### Preprocessing text data\n",
 383 |     "\n",
 384 |     "Just like last week, applying NLP to a problem begins from tokenization: splitting raw text into sequences of tokens (words, punctuation(구두법), etc).\n",
 385 |     "\n",
 386 |     "__Your task__ is to lowercase and tokenize all texts under `Title` and `FullDescription` columns. Store the tokenized data as a __space-separated__ string of tokens for performance reasons.\n",
 387 |     "\n",
 388 |     "It's okay to use nltk tokenizers. Assertions were designed for WordPunctTokenizer, slight deviations are okay.\n",
 389 |     "\n",
 390 |     "\n",
 391 |     "regexp를 사용하여 텍스트를 영문자 및 비영 문자의 순서로 토큰화    \n",
 392 |     "\n",
 393 |     "\\w+|[^\\w\\s]+.  \n",
 394 |     "\n",
 395 |     "<p> from nltk.tokenize import WordPunctTokenizer  \n",
 396 |     "<p> s = \"Good muffins cost $3.88\\nin New York.  Please buy me\\ntwo of them.\\n\\nThanks.\" \n",
 397 |     "\n",
 398 |     "<p> WordPunctTokenizer().tokenize(s) \n",
 399 |     "<p> ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', \n",
 400 |     "<p> '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']  \n",
 401 |     "\n",
 402 |     "\n",
 403 |     "\n",
 404 |     "http://excelsior-cjh.tistory.com/63  "
 405 |    ]
 406 |   },
 407 |   {
 408 |    "cell_type": "code",
 409 |    "execution_count": 10,
 410 |    "metadata": {},
 411 |    "outputs": [
 412 |     {
 413 |      "name": "stdout",
 414 |      "output_type": "stream",
 415 |      "text": [
 416 |       "Raw text:\n",
 417 |       "2         Mathematical Modeller / Simulation Analyst / O...\n",
 418 |       "100002    A successful and high achieving specialist sch...\n",
 419 |       "200002    Web Designer  HTML, CSS, JavaScript, Photoshop...\n",
 420 |       "Name: FullDescription, dtype: object\n"
 421 |      ]
 422 |     }
 423 |    ],
 424 |    "source": [
 425 |     "print(\"Raw text:\")\n",
 426 |     "print(data[\"FullDescription\"][2::100000])"
 427 |    ]
 428 |   },
 429 |   {
 430 |    "cell_type": "code",
 431 |    "execution_count": null,
 432 |    "metadata": {},
 433 |    "outputs": [
 434 |     {
 435 |      "name": "stderr",
 436 |      "output_type": "stream",
 437 |      "text": [
 438 |       "/Users/JunChangWook/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: \n",
 439 |       "A value is trying to be set on a copy of a slice from a DataFrame\n",
 440 |       "\n",
 441 |       "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
 442 |       "  import sys\n"
 443 |      ]
 444 |     }
 445 |    ],
 446 |    "source": [
 447 |     "import nltk\n",
 448 |     "tokenizer = nltk.tokenize.WordPunctTokenizer()\n",
 449 |     "\n",
 450 |     "\n",
 451 |     "index = 0\n",
 452 |     "for item in data[\"FullDescription\"]:\n",
 453 |     "    data[\"FullDescription\"][index] = tokenizer.tokenize(item)\n",
 454 |     "    index = index + 1\n",
 455 |     "    \n",
 456 |     "index = 0\n",
 457 |     "for item in data[\"Title\"]:\n",
 458 |     "    data[\"Title\"][index] = tokenizer.tokenize(item)\n",
 459 |     "    index = index + 1\n",
 460 |     "# see task above\n",
 461 |     "#<YOUR CODE HERE>"
 462 |    ]
 463 |   },
 464 |   {
 465 |    "cell_type": "markdown",
 466 |    "metadata": {},
 467 |    "source": [
 468 |     "Now we can assume that our text is a space-separated list of tokens:"
 469 |    ]
 470 |   },
 471 |   {
 472 |    "cell_type": "code",
 473 |    "execution_count": null,
 474 |    "metadata": {},
 475 |    "outputs": [],
 476 |    "source": [
 477 |     "print(\"Tokenized:\")\n",
 478 |     "print(data[\"FullDescription\"][2::100000])\n",
 479 |     "assert data[\"FullDescription\"][2][:50] == 'mathematical modeller / simulation analyst / opera'\n",
 480 |     "assert data[\"Title\"][54321] == 'international digital account manager ( german )'"
 481 |    ]
 482 |   },
 483 |   {
 484 |    "cell_type": "markdown",
 485 |    "metadata": {},
 486 |    "source": [
 487 |     "Not all words are equally useful. Some of them are typos or rare words that are only present a few times.  \n",
 488 |     "모든 단어가 똑같이 유용하지 않다. 몇몇 단어는 오타 또는 희귀 단어 이다.\n",
 489 |     "\n",
 490 |     "Let's count how many times is each word present in the data so that we can build a \"white list\" of known words.  \n",
 491 |     "단어 카운트를 기반으로 유용한 단어 리스트를 만든다. (white lists)"
 492 |    ]
 493 |   },
 494 |   {
 495 |    "cell_type": "code",
 496 |    "execution_count": null,
 497 |    "metadata": {},
 498 |    "outputs": [],
 499 |    "source": [
 500 |     "# Count how many times does each token occur in both \"Title\" and \"FullDescription\" in total\n",
 501 |     "# build a dictionary { token -> it's count }\n",
 502 |     "import collections\n",
 503 |     "\n",
 504 |     "dictionary = []\n",
 505 |     "\n",
 506 |     "for item in data[\"FullDescription\"]:\n",
 507 |     "    dictionary.extend(item)\n",
 508 |     "\n",
 509 |     "for item in data[\"Title\"]:\n",
 510 |     "    dictionary.extend(item)\n",
 511 |     "\n",
 512 |     "token_counts = collections.Counter(dictionary)\n",
 513 |     "#token_counts = <YOUR CODE>\n",
 514 |     "\n",
 515 |     "# hint: you may or may not want to use collections.Counter"
 516 |    ]
 517 |   },
 518 |   {
 519 |    "cell_type": "code",
 520 |    "execution_count": null,
 521 |    "metadata": {},
 522 |    "outputs": [],
 523 |    "source": [
 524 |     "print(\"Total unique tokens :\", len(token_counts))\n",
 525 |     "print('\\n'.join(map(str, token_counts.most_common(n=5))))\n",
 526 |     "print('...')\n",
 527 |     "print('\\n'.join(map(str, token_counts.most_common()[-3:])))\n",
 528 |     "\n",
 529 |     "#assert token_counts.most_common(1)[0][1] in  range(2600000, 2700000)\n",
 530 |     "#assert len(token_counts) in range(200000, 210000)\n",
 531 |     "print('Correct!')"
 532 |    ]
 533 |   },
 534 |   {
 535 |    "cell_type": "code",
 536 |    "execution_count": null,
 537 |    "metadata": {},
 538 |    "outputs": [],
 539 |    "source": [
 540 |     "# Let's see how many words are there for each count\n",
 541 |     "plt.hist(list(token_counts.values()), range=[0, 10**4], bins=50, log=True)\n",
 542 |     "plt.xlabel(\"Word counts\");"
 543 |    ]
 544 |   },
 545 |   {
 546 |    "cell_type": "markdown",
 547 |    "metadata": {},
 548 |    "source": [
 549 |     "Now filter tokens a list of all tokens that occur at least 10 times."
 550 |    ]
 551 |   },
 552 |   {
 553 |    "cell_type": "code",
 554 |    "execution_count": null,
 555 |    "metadata": {},
 556 |    "outputs": [],
 557 |    "source": [
 558 |     "min_count = 10\n",
 559 |     "temp_tokens = []\n",
 560 |     "\n",
 561 |     "#for k,v in token_counts.items():\n",
 562 |     "    #if v > min_count:\n",
 563 |     "    #temp_tokens.append(k)\n",
 564 |     "\n",
 565 |     "for k,v in token_counts.items():\n",
 566 |     "    if v > min_count:\n",
 567 |     "        temp_tokens.append(k)\n",
 568 |     "\n",
 569 |     "tokens = temp_tokens\n",
 570 |     "# tokens from token_counts keys that had at least min_count occurrences throughout the dataset\n",
 571 |     "# tokens = <YOUR CODE HERE>"
 572 |    ]
 573 |   },
 574 |   {
 575 |    "cell_type": "code",
 576 |    "execution_count": null,
 577 |    "metadata": {},
 578 |    "outputs": [],
 579 |    "source": [
 580 |     "# Add a special tokens for unknown and empty words\n",
 581 |     "UNK, PAD = \"UNK\", \"PAD\"\n",
 582 |     "tokens = [UNK, PAD] + sorted(tokens)\n",
 583 |     "print(\"Vocabulary size:\", len(tokens))\n",
 584 |     "\n",
 585 |     "assert type(tokens) == list\n",
 586 |     "assert len(tokens) in range(32000, 35000)\n",
 587 |     "assert 'me' in tokens\n",
 588 |     "assert UNK in tokens\n",
 589 |     "print(\"Correct!\")"
 590 |    ]
 591 |   },
 592 |   {
 593 |    "cell_type": "markdown",
 594 |    "metadata": {},
 595 |    "source": [
 596 |     "Build an inverse token index: a dictionary from token(string) to it's index in `tokens` (int)"
 597 |    ]
 598 |   },
 599 |   {
 600 |    "cell_type": "code",
 601 |    "execution_count": null,
 602 |    "metadata": {},
 603 |    "outputs": [],
 604 |    "source": [
 605 |     "#token_to_id = <your code here>\n",
 606 |     "token_to_id = {word: idx for idx, word in enumerate(tokens)}"
 607 |    ]
 608 |   },
 609 |   {
 610 |    "cell_type": "code",
 611 |    "execution_count": null,
 612 |    "metadata": {},
 613 |    "outputs": [],
 614 |    "source": [
 615 |     "assert isinstance(token_to_id, dict)\n",
 616 |     "assert len(token_to_id) == len(tokens)\n",
 617 |     "for tok in tokens:\n",
 618 |     "    assert tokens[token_to_id[tok]] == tok\n",
 619 |     "\n",
 620 |     "print(\"Correct!\")"
 621 |    ]
 622 |   },
 623 |   {
 624 |    "cell_type": "markdown",
 625 |    "metadata": {},
 626 |    "source": [
 627 |     "And finally, let's use the vocabulary you've built to map text lines into neural network-digestible matrices.\n",
 628 |     "행렬로 매핑하기  \n",
 629 |     "\n",
 630 |     ">>> a = list(map(str, range(10)))  \n",
 631 |     ">>> a  \n",
 632 |     "['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']  \n"
 633 |    ]
 634 |   },
 635 |   {
 636 |    "cell_type": "code",
 637 |    "execution_count": null,
 638 |    "metadata": {},
 639 |    "outputs": [],
 640 |    "source": [
 641 |     "UNK_IX, PAD_IX = map(token_to_id.get, [UNK, PAD])\n",
 642 |     "# 매트릭스 구조를 만들고 있다.\n",
 643 |     "def as_matrix(sequences, max_len=None):\n",
 644 |     "    \"\"\" Convert a list of tokens into a matrix with padding \"\"\"\n",
 645 |     "    # object , classinfo 같으면 참 아니면 거짓\n",
 646 |     "    if isinstance(sequences[0], str):\n",
 647 |     "        sequences = list(map(str.split, sequences))\n",
 648 |     "    # 처음 한번은 양의 무한대와 비교하고 나머지는 시컨스 max_len와 비교 한다.    \n",
 649 |     "    max_len = min(max(map(len, sequences)), max_len or float('inf'))\n",
 650 |     "    \n",
 651 |     "    # 전체를 패드로 만들고\n",
 652 |     "    matrix = np.full((len(sequences), max_len), np.int32(PAD_IX))\n",
 653 |     "    # 사전에 있으면 그 인텍스를 아니면 UNK_IX를 넣어서 매트릭스를 구성하고 있다.\n",
 654 |     "    for i,seq in enumerate(sequences):\n",
 655 |     "        row_ix = [token_to_id.get(word, UNK_IX) for word in seq[:max_len]]\n",
 656 |     "        matrix[i, :len(row_ix)] = row_ix\n",
 657 |     "        print(matrix)\n",
 658 |     "    return matrix"
 659 |    ]
 660 |   },
 661 |   {
 662 |    "cell_type": "code",
 663 |    "execution_count": null,
 664 |    "metadata": {},
 665 |    "outputs": [],
 666 |    "source": [
 667 |     "print(\"Lines:\")\n",
 668 |     "print('\\n'.join(data[\"Title\"][::100000].values), end='\\n\\n')\n",
 669 |     "print(\"Matrix:\")\n",
 670 |     "print(as_matrix(data[\"Title\"][::100000]))"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "markdown",
 675 |    "metadata": {},
 676 |    "source": [
 677 |     "Now let's  encode the categirical data we have.\n",
 678 |     "\n",
 679 |     "As usual, we shall use one-hot encoding for simplicity. Kudos if you implement more advanced encodings: tf-idf, pseudo-time-series, etc.\n",
 680 |     "\n",
 681 |     "one-hot 인코딩을 사용   Advanced encoding : tf-idf  \n",
 682 |     "\n",
 683 |     ">>> list(zip([1, 2, 3], [4, 5, 6]))  \n",
 684 |     "[(1, 4), (2, 5), (3, 6)]  \n",
 685 |     "\n",
 686 |     "set 순서가 없는 딕셔너리 만들기  \n",
 687 |     "  \n",
 688 |     "\n",
 689 |     "\n",
 690 |     "from sklearn.feature_extraction.text import CountVectorizer  \n",
 691 |     "corpus = [  \n",
 692 |     "    'This is the first document.',  \n",
 693 |     "    'This is the second second document.',  \n",
 694 |     "    'And the third one.',\n",
 695 |     "    'Is this the first document?',  \n",
 696 |     "    'The last document?',      \n",
 697 |     "]\n",
 698 |     "vect = CountVectorizer()  \n",
 699 |     "vect.fit(corpus)  \n",
 700 |     "vect.vocabulary_  \n",
 701 |     "  \n",
 702 |     "\n",
 703 |     "{'this': 9,  \n",
 704 |     " 'is': 3,  \n",
 705 |     " 'the': 7,  \n",
 706 |     " 'first': 2,  \n",
 707 |     " 'document': 1,  \n",
 708 |     " 'second': 6,  \n",
 709 |     " 'and': 0,  \n",
 710 |     " 'third': 8,  \n",
 711 |     " 'one': 5,  \n",
 712 |     " 'last': 4}  \n",
 713 |     "   \n",
 714 |     " \n",
 715 |     " vect.transform(['This is the second document.']).toarray()  \n",
 716 |     " \n",
 717 |     " array([[0, 1, 0, 1, 0, 0, 1, 1, 0, 1]])  \n",
 718 |     " \n",
 719 |     " vect.transform(['Something completely new.']).toarray()  \n",
 720 |     " \n",
 721 |     " array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])  \n",
 722 |     " \n",
 723 |     " vect.transform(corpus).toarray()  \n",
 724 |     " \n",
 725 |     " array([[0, 1, 1, 1, 0, 0, 0, 1, 0, 1],  \n",
 726 |     "       [0, 1, 0, 1, 0, 0, 2, 1, 0, 1],  \n",
 727 |     "       [1, 0, 0, 0, 0, 1, 0, 1, 1, 0],  \n",
 728 |     "       [0, 1, 1, 1, 0, 0, 0, 1, 0, 1],  \n",
 729 |     "       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0]])  \n",
 730 |     "       \n",
 731 |     "     "
 732 |    ]
 733 |   },
 734 |   {
 735 |    "cell_type": "code",
 736 |    "execution_count": null,
 737 |    "metadata": {},
 738 |    "outputs": [],
 739 |    "source": [
 740 |     "from sklearn.feature_extraction import DictVectorizer\n",
 741 |     "\n",
 742 |     "# we only consider top-1k most frequent companies to minimize memory usage\n",
 743 |     "top_companies, top_counts = zip(*collections.Counter(data['Company']).most_common(1000)) # 동일한 위치 묶어준다.\n",
 744 |     "print(top_companies)\n",
 745 |     "recognized_companies = set(top_companies)\n",
 746 |     "print(recognized_companies)\n",
 747 |     "# top 1000개 이상은 Company를 표현하고 아닌 모든 것들은 Other로 처리 한다. 여기에 pandas apply 함수를 통해 수행한다.\n",
 748 |     "data[\"Company\"] = data[\"Company\"].apply(lambda comp: comp if comp in recognized_companies else \"Other\")\n",
 749 |     "\n",
 750 |     "categorical_vectorizer = DictVectorizer(dtype=np.float32, sparse=False)\n",
 751 |     "# dict를 y 축으로 묶는다. pandas apply 함수를 통해서 수행한다. \n",
 752 |     "categorical_vectorizer.fit(data[categorical_columns].apply(dict, axis=1))"
 753 |    ]
 754 |   },
 755 |   {
 756 |    "cell_type": "markdown",
 757 |    "metadata": {},
 758 |    "source": [
 759 |     "### The deep learning part\n",
 760 |     "\n",
 761 |     "Once we've learned to tokenize the data, let's design a machine learning experiment. (토큰을 배웠고 이제 기계학습 실험)\n",
 762 |     "\n",
 763 |     "As before, we won't focus too much on validation, opting for a simple train-test split. (학습 훈련 검증 셋 분할)\n",
 764 |     "\n",
 765 |     "__To be completely rigorous,__ we've comitted a small crime here: we used the whole data for tokenization and vocabulary building. A more strict way would be to do that part on training set only. You may want to do that and measure the magnitude of changes."
 766 |    ]
 767 |   },
 768 |   {
 769 |    "cell_type": "code",
 770 |    "execution_count": null,
 771 |    "metadata": {},
 772 |    "outputs": [],
 773 |    "source": [
 774 |     "from sklearn.model_selection import train_test_split\n",
 775 |     "\n",
 776 |     "data_train, data_val = train_test_split(data, test_size=0.2, random_state=42)\n",
 777 |     "data_train.index = range(len(data_train))\n",
 778 |     "data_val.index = range(len(data_val))\n",
 779 |     "\n",
 780 |     "print(\"Train size = \", len(data_train))\n",
 781 |     "print(\"Validation size = \", len(data_val))"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "code",
 786 |    "execution_count": null,
 787 |    "metadata": {},
 788 |    "outputs": [],
 789 |    "source": [
 790 |     "# 배치 구성하는 함수\n",
 791 |     "def make_batch(data, max_len=None, word_dropout=0):\n",
 792 |     "    \"\"\"\n",
 793 |     "    Creates a keras-friendly dict from the batch data. (케라스 친화적으로 만든다)\n",
 794 |     "    :param word_dropout: replaces token index with UNK_IX with this probability (word_dropout 확률로 UNK_IX로 대체)\n",
 795 |     "    :returns: a dict with {'title' : int64[batch, title_max_len] (배치 사이즈, 타이틀 최대 크기) 매트릭스 구성\n",
 796 |     "    \"\"\"\n",
 797 |     "    batch = {}\n",
 798 |     "    batch[\"Title\"] = as_matrix(data[\"Title\"].values, max_len)\n",
 799 |     "    batch[\"FullDescription\"] = as_matrix(data[\"FullDescription\"].values, max_len)\n",
 800 |     "    batch['Categorical'] = categorical_vectorizer.transform(data[categorical_columns].apply(dict, axis=1))\n",
 801 |     "    \n",
 802 |     "    if word_dropout != 0:\n",
 803 |     "        batch[\"FullDescription\"] = apply_word_dropout(batch[\"FullDescription\"], 1. - word_dropout)\n",
 804 |     "    # target_column = \"Log1pSalary\"\n",
 805 |     "    if target_column in data.columns:\n",
 806 |     "        batch[target_column] = data[target_column].values\n",
 807 |     "    \n",
 808 |     "    return batch\n",
 809 |     "\n",
 810 |     "\n",
 811 |     "def apply_word_dropout(matrix, keep_prop, replace_with=UNK_IX, pad_ix=PAD_IX,):\n",
 812 |     "    dropout_mask = np.random.choice(2, np.shape(matrix), p=[keep_prop, 1 - keep_prop])\n",
 813 |     "    dropout_mask &= matrix != pad_ix\n",
 814 |     "    # 변환 해준다. 모든 부분의 full_like \n",
 815 |     "    return np.choose(dropout_mask, [matrix, np.full_like(matrix, replace_with)]) # matrix를 replace_with로 변경한다."
 816 |    ]
 817 |   },
 818 |   {
 819 |    "cell_type": "code",
 820 |    "execution_count": null,
 821 |    "metadata": {},
 822 |    "outputs": [],
 823 |    "source": [
 824 |     "make_batch(data_train[:3], max_len=10)"
 825 |    ]
 826 |   },
 827 |   {
 828 |    "cell_type": "markdown",
 829 |    "metadata": {},
 830 |    "source": [
 831 |     "#### Architecture\n",
 832 |     "\n",
 833 |     "Our basic model consists of three branches:\n",
 834 |     "* Title encoder\n",
 835 |     "* Description encoder\n",
 836 |     "* Categorical features encoder\n",
 837 |     "\n",
 838 |     "We will then feed all 3 branches into one common network that predicts salary. (급여 예측에 3개의 특성을 쓴다.)\n",
 839 |     "\n",
 840 |     "<img src=\"https://github.com/yandexdataschool/nlp_course/raw/master/resources/w2_conv_arch.png\" width=600px>"
 841 |    ]
 842 |   },
 843 |   {
 844 |    "cell_type": "markdown",
 845 |    "metadata": {},
 846 |    "source": [
 847 |     "This clearly doesn't fit into keras' __Sequential__ interface. To build such a network, one will have to use __[Keras Functional API](https://keras.io/models/model/)__.\n",
 848 |     "\n",
 849 |     "https://keras.io/layers/merge/"
 850 |    ]
 851 |   },
 852 |   {
 853 |    "cell_type": "code",
 854 |    "execution_count": null,
 855 |    "metadata": {},
 856 |    "outputs": [],
 857 |    "source": [
 858 |     "import keras\n",
 859 |     "#from keras.models import Sequential\n",
 860 |     "import keras.layers as L"
 861 |    ]
 862 |   },
 863 |   {
 864 |    "cell_type": "code",
 865 |    "execution_count": null,
 866 |    "metadata": {},
 867 |    "outputs": [],
 868 |    "source": [
 869 |     "def build_model(n_tokens=len(tokens), n_cat_features=len(categorical_vectorizer.vocabulary_), hid_size=64):\n",
 870 |     "    \"\"\" Build a model that maps three data sources to a single linear output: predicted log1p(salary) \"\"\"\n",
 871 |     "    \n",
 872 |     "    l_title = L.Input(shape=[None], name=\"Title\")\n",
 873 |     "    l_descr = L.Input(shape=[None], name=\"FullDescription\")\n",
 874 |     "    l_categ = L.Input(shape=[n_cat_features], name=\"Categorical\")\n",
 875 |     "    \n",
 876 |     "    # Build your monster!\n",
 877 |     "    \n",
 878 |     "    x1 = keras.layers.Dense(8, activation='relu')(l_title)\n",
 879 |     "    x2 = keras.layers.Dense(8, activation='relu')(l_descr)\n",
 880 |     "    x3 = keras.layers.Dense(8, activation='relu')(l_categ)\n",
 881 |     "    added = keras.layers.add([x1, x2, x3])\n",
 882 |     "\n",
 883 |     "    # <YOUR CODE>\n",
 884 |     "    output_layer = keras.layers.Dense(1)(added)\n",
 885 |     "    #output_layer = <...>\n",
 886 |     "    # end of your code\n",
 887 |     "    \n",
 888 |     "    model = keras.models.Model(inputs=[l_title, l_descr, l_categ], outputs=[output_layer])\n",
 889 |     "    model.compile('adam', 'mean_squared_error', metrics=['mean_absolute_error'])\n",
 890 |     "    return model"
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "code",
 895 |    "execution_count": null,
 896 |    "metadata": {},
 897 |    "outputs": [],
 898 |    "source": [
 899 |     "model = build_model()\n",
 900 |     "model.summary() # 모델 요약\n",
 901 |     "\n",
 902 |     "dummy_pred = model.predict(make_batch(data_train[:100]))\n",
 903 |     "dummy_loss = model.train_on_batch(make_batch(data_train[:100]), data_train['Log1pSalary'][:100])[0]\n",
 904 |     "assert dummy_pred.shape == (100, 1)\n",
 905 |     "assert len(np.unique(dummy_pred)) > 20, \"model returns suspiciously few unique outputs. Check your initialization\"\n",
 906 |     "assert np.ndim(dummy_loss) == 0 and 0. <= dummy_loss <= 250., \"make sure you minimize MSE\""
 907 |    ]
 908 |   },
 909 |   {
 910 |    "cell_type": "markdown",
 911 |    "metadata": {},
 912 |    "source": [
 913 |     "#### Training and evaluation\n",
 914 |     "\n",
 915 |     "As usual, we gonna feed our monster with random minibatches of data. \n",
 916 |     "미니 배치 사용  \n",
 917 |     "\n",
 918 |     "As we train, we want to monitor not only loss function, which is computed in log-space, but also the actual error measured in dollars.\n",
 919 |     "\n",
 920 |     "로그 공간에서 계산 된 손실 함수뿐만 아니라 달러로 측정 한 실제 오차를 모니터링하려고 합니다.  \n"
 921 |    ]
 922 |   },
 923 |   {
 924 |    "cell_type": "code",
 925 |    "execution_count": null,
 926 |    "metadata": {},
 927 |    "outputs": [],
 928 |    "source": [
 929 |     "def iterate_minibatches(data, batch_size=256, shuffle=True, cycle=False, **kwargs):\n",
 930 |     "    \"\"\" iterates minibatches of data in random order \"\"\"\n",
 931 |     "    while True:\n",
 932 |     "        indices = np.arange(len(data))\n",
 933 |     "        if shuffle:\n",
 934 |     "            indices = np.random.permutation(indices)\n",
 935 |     "\n",
 936 |     "        for start in range(0, len(indices), batch_size):\n",
 937 |     "            batch = make_batch(data.iloc[indices[start : start + batch_size]], **kwargs)\n",
 938 |     "            target = batch.pop(target_column)\n",
 939 |     "            yield batch, target\n",
 940 |     "        \n",
 941 |     "        if not cycle: break"
 942 |    ]
 943 |   },
 944 |   {
 945 |    "cell_type": "markdown",
 946 |    "metadata": {},
 947 |    "source": [
 948 |     "### Model training\n",
 949 |     "\n",
 950 |     "We can now fit our model the usual minibatch way. The interesting part is that we train on an infinite stream of minibatches, produced by `iterate_minibatches` function. (iterate_minibatches 함수는 무한 미니 버스 스트림 구성.)"
 951 |    ]
 952 |   },
 953 |   {
 954 |    "cell_type": "code",
 955 |    "execution_count": null,
 956 |    "metadata": {},
 957 |    "outputs": [],
 958 |    "source": [
 959 |     "batch_size = 256\n",
 960 |     "epochs = 10            # definitely too small\n",
 961 |     "steps_per_epoch = 100  # for full pass over data: (len(data_train) - 1) // batch_size + 1\n",
 962 |     "\n",
 963 |     "model = build_model()\n",
 964 |     "#배치 별로 모델 트레이닝\n",
 965 |     "model.fit_generator(iterate_minibatches(data_train, batch_size, cycle=True, word_dropout=0.05), \n",
 966 |     "                    epochs=epochs, steps_per_epoch=steps_per_epoch,\n",
 967 |     "                    \n",
 968 |     "                    validation_data=iterate_minibatches(data_val, batch_size, cycle=True),\n",
 969 |     "                    validation_steps=data_val.shape[0] // batch_size\n",
 970 |     "                   )"
 971 |    ]
 972 |   },
 973 |   {
 974 |    "cell_type": "code",
 975 |    "execution_count": null,
 976 |    "metadata": {},
 977 |    "outputs": [],
 978 |    "source": [
 979 |     "def print_metrics(model, data, batch_size=batch_size, name=\"\", **kw):\n",
 980 |     "    squared_error = abs_error = num_samples = 0.0\n",
 981 |     "    for batch_x, batch_y in iterate_minibatches(data, batch_size=batch_size, shuffle=False, **kw):\n",
 982 |     "        batch_pred = model.predict(batch_x)[:, 0]\n",
 983 |     "        squared_error += np.sum(np.square(batch_pred - batch_y))\n",
 984 |     "        abs_error += np.sum(np.abs(batch_pred - batch_y))\n",
 985 |     "        num_samples += len(batch_y)\n",
 986 |     "    print(\"%s results:\" % (name or \"\"))\n",
 987 |     "    print(\"Mean square error: %.5f\" % (squared_error / num_samples))\n",
 988 |     "    print(\"Mean absolute error: %.5f\" % (abs_error / num_samples))\n",
 989 |     "    return squared_error, abs_error\n",
 990 |     "    \n",
 991 |     "print_metrics(model, data_train, name='Train')\n",
 992 |     "print_metrics(model, data_val, name='Val');"
 993 |    ]
 994 |   },
 995 |   {
 996 |    "cell_type": "markdown",
 997 |    "metadata": {},
 998 |    "source": [
 999 |     "### Bonus part: explaining model predictions\n",
1000 |     "\n",
1001 |     "It's usually a good idea to understand how your model works before you let it make actual decisions. It's simple for linear models: just see which words learned positive or negative weights. However, its much harder for neural networks that learn complex nonlinear dependencies.\n",
1002 |     "선형 모델은 비선형 모델 보다 쉽다고 이야기 하고 있음\n",
1003 |     "\n",
1004 |     "There are, however, some ways to look inside the black box:\n",
1005 |     "블랙 박스 들여다 보는 방법\n",
1006 |     "* Seeing how model responds to input perturbations  \n",
1007 |     "입력에 대해서 모델이 어떻게 응답하는지 본다.  \n",
1008 |     "* Finding inputs that maximize/minimize activation of some chosen neurons (_read more [on distill.pub](https://distill.pub/2018/building-blocks/)_)  \n",
1009 |     "활성화된 뉴럴의 선택해 최대/최소 찾기\n",
1010 |     "* Building local linear approximations to your neural network: [article](https://arxiv.org/abs/1602.04938), [eli5 library](https://github.com/TeamHG-Memex/eli5/tree/master/eli5/formatters)  \n",
1011 |     "신경망에 대한 로컬 선형 근사법 작성  \n",
1012 |     "Today we gonna try the first method just because it's the simplest one."
1013 |    ]
1014 |   },
1015 |   {
1016 |    "cell_type": "code",
1017 |    "execution_count": null,
1018 |    "metadata": {},
1019 |    "outputs": [],
1020 |    "source": [
1021 |     "def explain(model, sample, col_name='Title'):\n",
1022 |     "    \"\"\" Computes the effect each word had on model predictions \"\"\"\n",
1023 |     "    sample = dict(sample)\n",
1024 |     "    sample_col_tokens = [tokens[token_to_id.get(tok, 0)] for tok in sample[col_name].split()]\n",
1025 |     "    data_drop_one_token = pd.DataFrame([sample] * (len(sample_col_tokens) + 1))\n",
1026 |     "\n",
1027 |     "    for drop_i in range(len(sample_col_tokens)):\n",
1028 |     "        data_drop_one_token.loc[drop_i, col_name] = ' '.join(UNK if i == drop_i else tok\n",
1029 |     "                                                   for i, tok in enumerate(sample_col_tokens)) \n",
1030 |     "\n",
1031 |     "    *predictions_drop_one_token, baseline_pred = model.predict(make_batch(data_drop_one_token))[:, 0]\n",
1032 |     "    diffs = baseline_pred - predictions_drop_one_token\n",
1033 |     "    return list(zip(sample_col_tokens, diffs))"
1034 |    ]
1035 |   },
1036 |   {
1037 |    "cell_type": "code",
1038 |    "execution_count": null,
1039 |    "metadata": {},
1040 |    "outputs": [],
1041 |    "source": [
1042 |     "from IPython.display import HTML, display_html\n",
1043 |     "\n",
1044 |     "def draw_html(tokens_and_weights, cmap=plt.get_cmap(\"bwr\"), display=True,\n",
1045 |     "              token_template=\"\"\"<span style=\"background-color: {color_hex}\">{token}</span>\"\"\",\n",
1046 |     "              font_style=\"font-size:14px;\"\n",
1047 |     "             ):\n",
1048 |     "    \n",
1049 |     "    def get_color_hex(weight):\n",
1050 |     "        rgba = cmap(1. / (1 + np.exp(weight)), bytes=True)\n",
1051 |     "        return '#%02X%02X%02X' % rgba[:3]\n",
1052 |     "    \n",
1053 |     "    tokens_html = [\n",
1054 |     "        token_template.format(token=token, color_hex=get_color_hex(weight))\n",
1055 |     "        for token, weight in tokens_and_weights\n",
1056 |     "    ]\n",
1057 |     "    \n",
1058 |     "    \n",
1059 |     "    raw_html = \"\"\"<p style=\"{}\">{}</p>\"\"\".format(font_style, ' '.join(tokens_html))\n",
1060 |     "    if display:\n",
1061 |     "        display_html(HTML(raw_html))\n",
1062 |     "        \n",
1063 |     "    return raw_html\n",
1064 |     "    "
1065 |    ]
1066 |   },
1067 |   {
1068 |    "cell_type": "code",
1069 |    "execution_count": null,
1070 |    "metadata": {},
1071 |    "outputs": [],
1072 |    "source": [
1073 |     "i = 36605\n",
1074 |     "tokens_and_weights = explain(model, data.loc[i], \"Title\")\n",
1075 |     "draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');\n",
1076 |     "\n",
1077 |     "tokens_and_weights = explain(model, data.loc[i], \"FullDescription\")\n",
1078 |     "draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);"
1079 |    ]
1080 |   },
1081 |   {
1082 |    "cell_type": "code",
1083 |    "execution_count": null,
1084 |    "metadata": {},
1085 |    "outputs": [],
1086 |    "source": [
1087 |     "i = 12077\n",
1088 |     "tokens_and_weights = explain(model, data.loc[i], \"Title\")\n",
1089 |     "draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');\n",
1090 |     "\n",
1091 |     "tokens_and_weights = explain(model, data.loc[i], \"FullDescription\")\n",
1092 |     "draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);"
1093 |    ]
1094 |   },
1095 |   {
1096 |    "cell_type": "code",
1097 |    "execution_count": null,
1098 |    "metadata": {},
1099 |    "outputs": [],
1100 |    "source": [
1101 |     "i = np.random.randint(len(data))\n",
1102 |     "print(\"Index:\", i)\n",
1103 |     "print(\"Salary (gbp):\", np.expm1(model.predict(make_batch(data.iloc[i: i+1]))[0, 0]))\n",
1104 |     "\n",
1105 |     "tokens_and_weights = explain(model, data.loc[i], \"Title\")\n",
1106 |     "draw_html([(tok, weight * 5) for tok, weight in tokens_and_weights], font_style='font-size:20px;');\n",
1107 |     "\n",
1108 |     "tokens_and_weights = explain(model, data.loc[i], \"FullDescription\")\n",
1109 |     "draw_html([(tok, weight * 10) for tok, weight in tokens_and_weights]);"
1110 |    ]
1111 |   },
1112 |   {
1113 |    "cell_type": "markdown",
1114 |    "metadata": {},
1115 |    "source": [
1116 |     "__Terrible start-up idea #1962:__ make a tool that automaticaly rephrases your job description (or CV) to meet salary expectations :)"
1117 |    ]
1118 |   },
1119 |   {
1120 |    "cell_type": "code",
1121 |    "execution_count": null,
1122 |    "metadata": {},
1123 |    "outputs": [],
1124 |    "source": []
1125 |   }
1126 |  ],
1127 |  "metadata": {
1128 |   "kernelspec": {
1129 |    "display_name": "Python 3",
1130 |    "language": "python",
1131 |    "name": "python3"
1132 |   },
1133 |   "language_info": {
1134 |    "codemirror_mode": {
1135 |     "name": "ipython",
1136 |     "version": 3
1137 |    },
1138 |    "file_extension": ".py",
1139 |    "mimetype": "text/x-python",
1140 |    "name": "python",
1141 |    "nbconvert_exporter": "python",
1142 |    "pygments_lexer": "ipython3",
1143 |    "version": "3.6.6"
1144 |   }
1145 |  },
1146 |  "nbformat": 4,
1147 |  "nbformat_minor": 2
1148 | }
1149 | 


--------------------------------------------------------------------------------
/resource/material/README.md:
--------------------------------------------------------------------------------
1 | # Material
2 | 


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/.gitignore


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/Chapter 11. Introduction to Machine Learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/Chapter 11. Introduction to Machine Learning.pdf


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/Chapter 12. Clustering.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/Chapter 12. Clustering.pdf


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/Chapter13,14,15_MJLEE.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/Chapter13,14,15_MJLEE.pptx


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/MIT6_0002F16_lec1_cwjun.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec1_cwjun.pdf


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/MIT6_0002F16_lec2_cwjun.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec2_cwjun.pdf


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/MIT6_0002F16_lec5_lec6_ssg.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec5_lec6_ssg.pdf


--------------------------------------------------------------------------------
/resource/slides/MIT-data-science/MIT6_0002F16_lec9_Eon.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/MIT-data-science/MIT6_0002F16_lec9_Eon.pdf


--------------------------------------------------------------------------------
/resource/slides/README.md:
--------------------------------------------------------------------------------
1 | # Slides
2 | 


--------------------------------------------------------------------------------
/resource/slides/deeppavlov/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/deeppavlov/.gitignore


--------------------------------------------------------------------------------
/resource/slides/deeppavlov/deeppavlov_Automatic spelling correction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/deeppavlov/deeppavlov_Automatic spelling correction.pdf


--------------------------------------------------------------------------------
/resource/slides/linear-algebra/Chapter_3_Least_square.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/linear-algebra/Chapter_3_Least_square.pptx


--------------------------------------------------------------------------------
/resource/slides/linear-algebra/README.md:
--------------------------------------------------------------------------------
1 | # Linear_algebra
2 | 


--------------------------------------------------------------------------------
/resource/slides/paper-review/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/.gitignore


--------------------------------------------------------------------------------
/resource/slides/paper-review/Character-Aware Neural Language Models.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Character-Aware Neural Language Models.pdf


--------------------------------------------------------------------------------
/resource/slides/paper-review/Character-level CNN for text classification.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Character-level CNN for text classification.pptx


--------------------------------------------------------------------------------
/resource/slides/paper-review/Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers.pptx


--------------------------------------------------------------------------------
/resource/slides/paper-review/Learning phrase representation using RNN Encoder-Decoder for SMT.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Learning phrase representation using RNN Encoder-Decoder for SMT.pdf


--------------------------------------------------------------------------------
/resource/slides/paper-review/MASS.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/MASS.pdf


--------------------------------------------------------------------------------
/resource/slides/paper-review/Robustly optimized BERT Pretraining Approaches.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/Robustly optimized BERT Pretraining Approaches.pptx


--------------------------------------------------------------------------------
/resource/slides/paper-review/TransformerXL.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/TransformerXL.pdf


--------------------------------------------------------------------------------
/resource/slides/paper-review/VDCNN.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/VDCNN.pdf


--------------------------------------------------------------------------------
/resource/slides/paper-review/seqtoseq_attention_20190417.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/paper-review/seqtoseq_attention_20190417.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/Soynlp 2일차.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/Soynlp 2일차.pptx


--------------------------------------------------------------------------------
/resource/slides/soynlp/empty:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/empty


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_1일차.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_1일차.pptx


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/From frequency to meaning, Vector space models of semantics.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/From frequency to meaning, Vector space models of semantics.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/Korean conjugation.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Korean conjugation.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/Korean lemmatization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Korean lemmatization.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/L2_L1 regularization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/L2_L1 regularization.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/LSA.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/LSA.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/Logistic regression with L1, L2 regularization and keyword extraction.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Logistic regression with L1, L2 regularization and keyword extraction.pdf


--------------------------------------------------------------------------------
/resource/slides/soynlp/fastcampus_day3/Neural Word Embedding as Implicit Matrix Factorization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/soynlp/fastcampus_day3/Neural Word Embedding as Implicit Matrix Factorization.pdf


--------------------------------------------------------------------------------
/resource/slides/yandex/2월 2째주-yandex- week04_seq2seq_seminar.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/yandex/2월 2째주-yandex- week04_seq2seq_seminar.pptx


--------------------------------------------------------------------------------
/resource/slides/yandex/2월 3째주-yandex-week04_seq2seq_seminar layer normalization.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/yandex/2월 3째주-yandex-week04_seq2seq_seminar layer normalization.pptx


--------------------------------------------------------------------------------
/resource/slides/yandex/yandex-week-07-mt-02.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modudeepnlp/DeepNLP2019/e303cca4812f85c551968babc379a2f5e140868d/resource/slides/yandex/yandex-week-07-mt-02.pptx


--------------------------------------------------------------------------------