├── LICENSE
├── MANIFEST.in
├── README.md
├── chars2vec
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-310.pyc
│ ├── __init__.cpython-39.pyc
│ ├── model.cpython-310.pyc
│ └── model.cpython-39.pyc
├── model.py
└── trained_models
│ ├── eng_100
│ ├── model.pkl
│ └── weights.h5
│ ├── eng_150
│ ├── model.pkl
│ └── weights.h5
│ ├── eng_200
│ ├── model.pkl
│ └── weights.h5
│ ├── eng_300
│ ├── model.pkl
│ └── weights.h5
│ └── eng_50
│ ├── model.pkl
│ └── weights.h5
├── example_training.py
├── example_usage.py
├── requirements.txt
├── setup.cfg
└── setup.py
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | recursive-include chars2vec/ *
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # chars2vec
2 |
3 | #### Character-based word embeddings model based on RNN
4 |
5 |
6 | Chars2vec library could be very useful if you are dealing with the texts
7 | containing abbreviations, slang, typos, or some other specific textual dataset.
8 | Chars2vec language model is based on the symbolic representation of words –
9 | the model maps each word to a vector of a fixed length.
10 | These vector representations are obtained with a custom neural network while
11 | the latter is being trained on pairs of similar and non-similar words.
12 | This custom neural net includes LSTM, reading sequences of characters in words, as its part.
13 | The model maps similarly written words to proximal vectors.
14 | This approach enables creation of an embedding in vector space for any sequence of characters.
15 | Chars2vec models does not keep any dictionary of embeddings,
16 | but generates embedding vectors inplace using pretrained model.
17 |
18 | There are pretrained models of dimensions 50, 100, 150, 200 and 300 for the English language.
19 | The library provides convenient user API to train a model for an arbitrary set of characters.
20 | Read more details about the architecture of [Chars2vec:
21 | Character-based language model for handling real world texts with spelling
22 | errors and human slang](https://hackernoon.com/chars2vec-character-based-language-model-for-handling-real-world-texts-with-spelling-errors-and-a3e4053a147d) in Hacker Noon.
23 |
24 | #### Model available for Python 2.7 and 3.0+.
25 |
26 | ### Installation
27 |
28 |
1. Build and install from source
29 | Download project source and run in your command line
30 |
31 | ~~~shell
32 | >> python setup.py install
33 | ~~~
34 |
35 | 2. Via pip
36 | Run in your command line
37 |
38 | ~~~shell
39 | >> pip install chars2vec
40 | ~~~
41 |
42 | ### Usage
43 |
44 | Function `chars2vec.load_model(str path)` initializes the model from directory
45 | and returns `chars2vec.Chars2Vec` object.
46 | There are 5 pretrained English model with dimensions: 50, 100, 150, 200 and 300.
47 | To load this pretrained models:
48 |
49 | ~~~python
50 | import chars2vec
51 |
52 | # Load Inutition Engineering pretrained model
53 | # Models names: 'eng_50', 'eng_100', 'eng_150', 'eng_200', 'eng_300'
54 | c2v_model = chars2vec.load_model('eng_50')
55 | ~~~
56 | Method `chars2vec.Chars2Vec.vectorize_words(words)` returns `numpy.ndarray` of shape `(n_words, dim)` with word embeddings.
57 |
58 | ~~~python
59 | words = ['list', 'of', 'words']
60 |
61 | # Create word embeddings
62 | word_embeddings = c2v_model.vectorize_words(words)
63 | ~~~
64 |
65 | ### Training
66 |
67 | Function `chars2vec.train_model(int emb_dim, X_train, y_train, model_chars)`
68 | creates and trains new chars2vec model and returns `chars2vec.Chars2Vec` object.
69 |
70 | Parameter `emb_dim` is a dimension of the model.
71 |
72 | Parameter `X_train` is a list or numpy.ndarray of word pairs.
73 | Parameter `y_train` is a list or numpy.ndarray of target values that describe the proximity of words.
74 |
75 | Training set (`X_train`, `y_train`) consists of pairs of "similar" and "not similar" words;
76 | a pair of "similar" words is labeled with 0 target value, and a pair of "not similar" with 1.
77 |
78 | Parameter `model_chars` is a list of chars for the model.
79 | Characters which are not in the `model_chars`
80 | list will be ignored by the model.
81 |
82 | Read more about chars2vec training and generation of training dataset in
83 | [article about chars2vec](https://hackernoon.com/chars2vec-character-based-language-model-for-handling-real-world-texts-with-spelling-errors-and-a3e4053a147d).
84 |
85 | Function `chars2vec.save_model(c2v_model, str path_to_model)` saves the trained model
86 | to the directory.
87 |
88 |
89 | ~~~python
90 | import chars2vec
91 |
92 | dim = 50
93 | path_to_model = 'path/to/model/directory'
94 |
95 | X_train = [('mecbanizing', 'mechanizing'), # similar words, target is equal 0
96 | ('dicovery', 'dis7overy'), # similar words, target is equal 0
97 | ('prot$oplasmatic', 'prtoplasmatic'), # similar words, target is equal 0
98 | ('copulateng', 'lzateful'), # not similar words, target is equal 1
99 | ('estry', 'evadin6'), # not similar words, target is equal 1
100 | ('cirrfosis', 'afear') # not similar words, target is equal 1
101 | ]
102 |
103 | y_train = [0, 0, 0, 1, 1, 1]
104 |
105 | model_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.',
106 | '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<',
107 | '=', '>', '?', '@', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
108 | 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
109 | 'x', 'y', 'z']
110 |
111 | # Create and train chars2vec model using given training data
112 | my_c2v_model = chars2vec.train_model(dim, X_train, y_train, model_chars)
113 |
114 | # Save your pretrained model
115 | chars2vec.save_model(my_c2v_model, path_to_model)
116 |
117 | # Load your pretrained model
118 | c2v_model = chars2vec.load_model(path_to_model)
119 | ~~~
120 |
121 | Full code examples for usage and training models see in
122 | `example_usage.py` and `example_training.py` files.
123 |
--------------------------------------------------------------------------------
/chars2vec/__init__.py:
--------------------------------------------------------------------------------
1 | from .model import *
--------------------------------------------------------------------------------
/chars2vec/__pycache__/__init__.cpython-310.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/__pycache__/__init__.cpython-310.pyc
--------------------------------------------------------------------------------
/chars2vec/__pycache__/__init__.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/__pycache__/__init__.cpython-39.pyc
--------------------------------------------------------------------------------
/chars2vec/__pycache__/model.cpython-310.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/__pycache__/model.cpython-310.pyc
--------------------------------------------------------------------------------
/chars2vec/__pycache__/model.cpython-39.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/__pycache__/model.cpython-39.pyc
--------------------------------------------------------------------------------
/chars2vec/model.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pickle
3 | import tensorflow as tf
4 | import os
5 |
6 | class Chars2Vec:
7 |
8 | def __init__(self, emb_dim, char_to_ix):
9 | '''
10 | Creates chars2vec model.
11 |
12 | :param emb_dim: int, dimension of embeddings.
13 | :param char_to_ix: dict, keys are characters, values are sequence numbers of characters.
14 | '''
15 |
16 | if not isinstance(emb_dim, int) or emb_dim < 1:
17 | raise TypeError("parameter 'emb_dim' must be a positive integer")
18 |
19 | if not isinstance(char_to_ix, dict):
20 | raise TypeError("parameter 'char_to_ix' must be a dictionary")
21 |
22 | self.char_to_ix = char_to_ix
23 | self.ix_to_char = {char_to_ix[ch]: ch for ch in char_to_ix}
24 | self.vocab_size = len(self.char_to_ix)
25 | self.dim = emb_dim
26 | self.cache = {}
27 |
28 | lstm_input = tf.keras.layers.Input(shape=(None, self.vocab_size))
29 |
30 | x = tf.keras.layers.LSTM(emb_dim, return_sequences=True)(lstm_input)
31 | x = tf.keras.layers.LSTM(emb_dim)(x)
32 |
33 | self.embedding_model = tf.keras.models.Model(inputs=[lstm_input], outputs=x)
34 |
35 | model_input_1 = tf.keras.layers.Input(shape=(None, self.vocab_size))
36 | model_input_2 = tf.keras.layers.Input(shape=(None, self.vocab_size))
37 |
38 | embedding_1 = self.embedding_model(model_input_1)
39 | embedding_2 = self.embedding_model(model_input_2)
40 | x = tf.keras.layers.Subtract()([embedding_1, embedding_2])
41 | x = tf.keras.layers.Dot(1)([x, x])
42 | model_output = tf.keras.layers.Dense(1, activation='sigmoid')(x)
43 |
44 | self.model = tf.keras.models.Model(inputs=[model_input_1, model_input_2], outputs=model_output)
45 | self.model.compile(optimizer='adam', loss='mae')
46 |
47 | def fit(self, word_pairs, targets,
48 | max_epochs, patience, validation_split, batch_size):
49 | '''
50 | Fits model.
51 |
52 | :param word_pairs: list or numpy.ndarray of word pairs.
53 | :param targets: list or numpy.ndarray of targets.
54 | :param max_epochs: parameter 'epochs' of tensorflow model.
55 | :param patience: parameter 'patience' of callback in tensorflow model.
56 | :param validation_split: parameter 'validation_split' of tensorflow model.
57 | '''
58 |
59 | word_pairs = np.array(word_pairs)
60 | targets = np.array(targets)
61 |
62 | if not isinstance(word_pairs, list) and not isinstance(word_pairs, np.ndarray):
63 | raise TypeError("parameters 'word_pairs' must be a list or numpy.ndarray")
64 |
65 | if not isinstance(targets, list) and not isinstance(targets, np.ndarray):
66 | raise TypeError("parameters 'targets' must be a list or numpy.ndarray")
67 |
68 | x_1, x_2 = [], []
69 |
70 | for pair_words in word_pairs:
71 | emb_list_1 = []
72 | emb_list_2 = []
73 |
74 | if not isinstance(pair_words[0], str) or not isinstance(pair_words[1], str):
75 | raise TypeError("word must be a string")
76 |
77 | first_word = pair_words[0].lower()
78 | second_word = pair_words[1].lower()
79 |
80 | for t in range(len(first_word)):
81 |
82 | if first_word[t] in self.char_to_ix:
83 | x = np.zeros(self.vocab_size)
84 | x[self.char_to_ix[first_word[t]]] = 1
85 | emb_list_1.append(x)
86 |
87 | else:
88 | emb_list_1.append(np.zeros(self.vocab_size))
89 |
90 | x_1.append(np.array(emb_list_1))
91 |
92 | for t in range(len(second_word)):
93 |
94 | if second_word[t] in self.char_to_ix:
95 | x = np.zeros(self.vocab_size)
96 | x[self.char_to_ix[second_word[t]]] = 1
97 | emb_list_2.append(x)
98 |
99 | else:
100 | emb_list_2.append(np.zeros(self.vocab_size))
101 |
102 | x_2.append(np.array(emb_list_2))
103 |
104 | x_1_pad_seq = tf.keras.preprocessing.sequence.pad_sequences(x_1)
105 | x_2_pad_seq = tf.keras.preprocessing.sequence.pad_sequences(x_2)
106 |
107 | self.model.fit([x_1_pad_seq, x_2_pad_seq], targets,
108 | batch_size=batch_size, epochs=max_epochs,
109 | validation_split=validation_split,
110 | callbacks=[tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience)])
111 |
112 | def vectorize_words(self, words, maxlen_padseq=None):
113 | '''
114 | Returns embeddings for list of words. Uses cache of word embeddings to vectorization speed up.
115 |
116 | :param words: list or numpy.ndarray of strings.
117 | :param maxlen_padseq: parameter 'maxlen' for tensorflow pad_sequences transform.
118 |
119 | :return word_vectors: numpy.ndarray, word embeddings.
120 | '''
121 |
122 | if not isinstance(words, list) and not isinstance(words, np.ndarray):
123 | raise TypeError("parameter 'words' must be a list or numpy.ndarray")
124 |
125 | words = [w.lower() for w in words]
126 | unique_words = np.unique(words)
127 | new_words = [w for w in unique_words if w not in self.cache]
128 |
129 | if len(new_words) > 0:
130 |
131 | list_of_embeddings = []
132 |
133 | for current_word in new_words:
134 |
135 | if not isinstance(current_word, str):
136 | raise TypeError("word must be a string")
137 |
138 | current_embedding = []
139 |
140 | for t in range(len(current_word)):
141 |
142 | if current_word[t] in self.char_to_ix:
143 | x = np.zeros(self.vocab_size)
144 | x[self.char_to_ix[current_word[t]]] = 1
145 | current_embedding.append(x)
146 |
147 | else:
148 | current_embedding.append(np.zeros(self.vocab_size))
149 |
150 | list_of_embeddings.append(np.array(current_embedding))
151 |
152 | embeddings_pad_seq = tf.keras.preprocessing.sequence.pad_sequences(list_of_embeddings, maxlen=maxlen_padseq)
153 | new_words_vectors = self.embedding_model(embeddings_pad_seq)
154 |
155 | for i in range(len(new_words)):
156 | self.cache[new_words[i]] = new_words_vectors[i]
157 |
158 | word_vectors = [self.cache[current_word] for current_word in words]
159 |
160 | return np.array(word_vectors)
161 |
162 | def save_model(c2v_model, path_to_model):
163 | '''
164 | Saves trained model to directory.
165 |
166 | :param c2v_model: Chars2Vec object, trained model.
167 | :param path_to_model: str, path to save model.
168 | '''
169 |
170 | if not os.path.exists(path_to_model):
171 | os.makedirs(path_to_model)
172 |
173 | c2v_model.embedding_model.save_weights(path_to_model + '/weights.h5')
174 |
175 | with open(path_to_model + '/model.pkl', 'wb') as f:
176 | pickle.dump([c2v_model.dim, c2v_model.char_to_ix], f, protocol=2)
177 |
178 |
179 | def load_model(path):
180 | '''
181 | Loads trained model.
182 |
183 | :param path: str, if it is 'eng_50', 'eng_100', 'eng_150', 'eng_200' or 'eng_300' then loads one of default models,
184 | else loads model from `path`.
185 |
186 | :return c2v_model: Chars2Vec object, trained model.
187 | '''
188 |
189 | if path in ['eng_50', 'eng_100', 'eng_150', 'eng_200', 'eng_300']:
190 | path_to_model = os.path.dirname(os.path.abspath(__file__)) + '/trained_models/' + path
191 |
192 | else:
193 | path_to_model = path
194 |
195 | with open(path_to_model + '/model.pkl', 'rb') as f:
196 | structure = pickle.load(f)
197 | emb_dim, char_to_ix = structure[0], structure[1]
198 |
199 | c2v_model = Chars2Vec(emb_dim, char_to_ix)
200 | c2v_model.embedding_model.load_weights(path_to_model + '/weights.h5')
201 | c2v_model.embedding_model.compile(optimizer='adam', loss='mae')
202 |
203 | return c2v_model
204 |
205 |
206 | def train_model(emb_dim, X_train, y_train, model_chars,
207 | max_epochs=200, patience=10, validation_split=0.05, batch_size=64):
208 | '''
209 | Creates and trains chars2vec model using given training data.
210 |
211 | :param emb_dim: int, dimension of embeddings.
212 | :param X_train: list or numpy.ndarray of word pairs.
213 | :param y_train: list or numpy.ndarray of target values that describe the proximity of words.
214 | :param model_chars: list or numpy.ndarray of basic chars in model.
215 | :param max_epochs: parameter 'epochs' of keras model.
216 | :param patience: parameter 'patience' of callback in keras model.
217 | :param validation_split: parameter 'validation_split' of keras model.
218 | :param batch_size: parameter 'batch_size' of keras model.
219 |
220 | :return c2v_model: Chars2Vec object, trained model.
221 | '''
222 |
223 | if not isinstance(X_train, list) and not isinstance(X_train, np.ndarray):
224 | raise TypeError("parameter 'X_train' must be a list or numpy.ndarray")\
225 |
226 | if not isinstance(y_train, list) and not isinstance(y_train, np.ndarray):
227 | raise TypeError("parameter 'y_train' must be a list or numpy.ndarray")
228 |
229 | if not isinstance(model_chars, list) and not isinstance(model_chars, np.ndarray):
230 | raise TypeError("parameter 'model_chars' must be a list or numpy.ndarray")
231 |
232 | char_to_ix = {ch: i for i, ch in enumerate(model_chars)}
233 | c2v_model = Chars2Vec(emb_dim, char_to_ix)
234 |
235 | targets = [float(el) for el in y_train]
236 | c2v_model.fit(X_train, targets, max_epochs, patience, validation_split, batch_size)
237 |
238 | return c2v_model
239 |
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_100/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_100/model.pkl
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_100/weights.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_100/weights.h5
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_150/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_150/model.pkl
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_150/weights.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_150/weights.h5
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_200/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_200/model.pkl
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_200/weights.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_200/weights.h5
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_300/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_300/model.pkl
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_300/weights.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_300/weights.h5
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_50/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_50/model.pkl
--------------------------------------------------------------------------------
/chars2vec/trained_models/eng_50/weights.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IntuitionEngineeringTeam/chars2vec/fe56df29d57314a824c38ad19e82ae3c34df0862/chars2vec/trained_models/eng_50/weights.h5
--------------------------------------------------------------------------------
/example_training.py:
--------------------------------------------------------------------------------
1 | import chars2vec
2 |
3 |
4 | dim = 50
5 |
6 | path_to_model = 'path/to/model/directory'
7 |
8 | X_train = [('mecbanizing', 'mechanizing'), # similar words, target is equal 0
9 | ('dicovery', 'dis7overy'), # similar words, target is equal 0
10 | ('prot$oplasmatic', 'prtoplasmatic'), # similar words, target is equal 0
11 | ('copulateng', 'lzateful'), # not similar words, target is equal 1
12 | ('estry', 'evadin6'), # not similar words, target is equal 1
13 | ('cirrfosis', 'afear') # not similar words, target is equal 1
14 | ]
15 |
16 | y_train = [0, 0, 0, 1, 1, 1]
17 |
18 | model_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.',
19 | '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<',
20 | '=', '>', '?', '@', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
21 | 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
22 | 'x', 'y', 'z']
23 |
24 | # Create and train chars2vec model using given training data
25 | my_c2v_model = chars2vec.train_model(dim, X_train, y_train, model_chars)
26 |
27 | # Save pretrained model
28 | chars2vec.save_model(my_c2v_model, path_to_model)
29 |
30 | words = ['list', 'of', 'words']
31 |
32 | # Load pretrained model, create word embeddings
33 | c2v_model = chars2vec.load_model(path_to_model)
34 | word_embeddings = c2v_model.vectorize_words(words)
35 |
--------------------------------------------------------------------------------
/example_usage.py:
--------------------------------------------------------------------------------
1 | import chars2vec
2 | import sklearn.decomposition
3 | import matplotlib.pyplot as plt
4 |
5 |
6 | # Load Inutition Engineering pretrained model
7 | # Models names: 'eng_50', 'eng_100', 'eng_150', 'eng_200', 'eng_300'
8 | c2v_model = chars2vec.load_model('eng_50')
9 |
10 | words = ['Natural', 'Language', 'Understanding',
11 | 'Naturael', 'Longuge', 'Updderctundjing',
12 | 'Motural', 'Lamnguoge', 'Understaating',
13 | 'Naturrow', 'Laguage', 'Unddertandink',
14 | 'Nattural', 'Languagge', 'Umderstoneding']
15 |
16 | # Create word embeddings
17 | word_embeddings = c2v_model.vectorize_words(words)
18 |
19 | # Project embeddings on plane using the PCA
20 | projection_2d = sklearn.decomposition.PCA(n_components=2).fit_transform(word_embeddings)
21 |
22 | # Draw words on plane
23 | f = plt.figure(figsize=(8, 6))
24 |
25 | for j in range(len(projection_2d)):
26 | plt.scatter(projection_2d[j, 0], projection_2d[j, 1],
27 | marker=('$' + words[j] + '$'),
28 | s=500 * len(words[j]), label=j,
29 | facecolors='green' if words[j]
30 | in ['Natural', 'Language', 'Understanding'] else 'black')
31 |
32 | plt.show()
33 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | setuptools
2 | tensorflow
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import subprocess
3 | PY_VER = sys.version[0]
4 | subprocess.call(["pip{:} install -r requirements.txt".format(PY_VER)], shell=True)
5 |
6 | from setuptools import setup
7 |
8 | setup(
9 | name='chars2vec',
10 | version='0.1.7',
11 | author='Vladimir Chikin',
12 | author_email='v4@intuition.engineering',
13 | packages=['chars2vec'],
14 | include_package_data=True,
15 | package_data={'chars2vec': ['trained_models/*']},
16 | description='Character-based word embeddings model based on RNN',
17 | maintainer='Intuition',
18 | maintainer_email='dev@intuition.engineering',
19 | url='https://github.com/IntuitionEngineeringTeam/chars2vec',
20 | download_url='https://github.com/IntuitionEngineeringTeam/chars2vec/archive/master.zip',
21 | license='Apache License 2.0',
22 | long_description='Chars2vec library could be very useful if you are dealing with the texts \
23 | containing abbreviations, slang, typos, or some other specific textual dataset. \
24 | Chars2vec language model is based on the symbolic representation of words – \
25 | the model maps each word to a vector of a fixed length. \
26 | These vector representations are obtained with a custom neural netowrk while \
27 | the latter is being trained on pairs of similar and non-similar words. \
28 | This custom neural net includes LSTM, reading sequences of characters in words, as its part. \
29 | The model maps similarly written words to proximal vectors. \
30 | This approach enables creation of an embedding in vector space for any sequence of characters.\
31 | Chars2vec models does not keep any dictionary of embeddings, \
32 | but generates embedding vectors inplace using pretrained model. \
33 | There are pretrained models of dimensions 50, 100, 150, 200 and 300 for the English language.\
34 | The library provides convenient user API to train a model for an arbitrary set of characters.',
35 | classifiers=['Programming Language :: Python :: 2.7',
36 | 'Programming Language :: Python :: 3']
37 | )
--------------------------------------------------------------------------------