├── .gitignore
├── LICENSE
├── README.md
├── code
├── eda.py
├── mapping.json
├── mapping.py
├── model_triplet.py
├── predict_model_triplet.py
├── search.py
├── search_by_image.py
├── search_by_keywords.py
└── tsne_visualization.py
├── output
├── Figure_1.png
└── Figure_2.png
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 CVxTz
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ### Building a Deep Image Search Engine using tf.Keras
2 |
3 | ### Motivation :
4 |
5 | Imagine having a data collection of hundreds of thousands to millions of images
6 | without any metadata describing the content of each image. How can we build a
7 | system that is able to find a sub-set of those images that best answer a user’s
8 | search query ?
What we will basically need is a search engine that is able
9 | to rank image results given how well they correspond to the search query, which
10 | can be either expressed in a natural language or by another query image.
The
11 | way we will solve the problem in this post is by training a deep neural model
12 | that learns a fixed length representation (or embedding) of any input image and
13 | text and makes it so those representations are close in the euclidean space if
14 | the pairs text-image or image-image are “similar”.
15 |
16 | ### Data set :
17 |
18 | I could not find a data-set of search result ranking that is big enough but I
19 | was able to get this data-set :
20 | [http://jmcauley.ucsd.edu/data/amazon/](http://jmcauley.ucsd.edu/data/amazon/)
21 | which links E-commerce item images to their title and description. We will use
22 | this metadata as the supervision source to learn meaningful joined text-image
23 | representations. The experiments were limited to fashion (Clothing, Shoes and
24 | Jewelry) items and to 500,000 images in order to manage the computations and
25 | storage costs.
26 |
27 | ### Problem setting :
28 |
29 | The data-set we have links each image with a description written in natural
30 | language. So we define a task in which we want to learn a joined, fixed length
31 | representation for images and text so that each image representation is close to
32 | the representation of its description.
33 |
34 | 
35 |
36 | ### Model :
37 |
38 | The model takes 3 inputs : The image (which is the anchor), the image
39 | title+description ( the positive example) and the third input is some randomly
40 | sampled text (the negative example).
Then we define two sub-models :
41 |
42 | * Image encoder : Resnet50 pre-trained on ImageNet+GlobalMaxpooling2D
43 | * Text encoder : GRU+GlobalMaxpooling1D
44 |
45 | The image sub-model produces the embedding for the Anchor **E_a **and the text
46 | sub-model outputs the embedding for the positive title+description **E_p** and
47 | the embedding for the negative text **E_n**.
48 |
49 | We then train by optimizing the following triplet loss:
50 |
51 | **L = max( d(E_a, E_p)-d(E_a, E_n)+alpha, 0)**
52 |
53 | Where d is the euclidean distance and alpha is a hyper parameter equal to 0.4 in
54 | this experiment.
55 |
56 | Basically what this loss allows to do is to make **d(E_a, E_p) **small and
57 | make** d(E_a, E_n) **large, so that each image embedding is close to the
58 | embedding of its description and far from the embedding of random text.
59 |
60 | ### Visualization Results :
61 |
62 | Once we learned the image embedding model and text embedding model we can
63 | visualize them by projecting them into two dimensions using tsne
64 | ([https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)
65 | ).
66 |
67 | 
68 | Test Images and their corresponding text description are linked by green lines
69 |
70 | We can see from the plot that generally, in the embedding space, images and
71 | their corresponding descriptions are close. Which is what we would expect given
72 | the training loss that was used.
73 |
74 | ### Text-image Search :
75 |
76 | Here we use few examples of text queries to search for the best matches in a set
77 | of 70,000 images. We compute the text embedding for the query and then the
78 | embedding for each image in the collection. We finally select the top 9 images
79 | which are the closest to the query in the embedding space.
80 |
81 | 
82 |
83 | 
84 |
85 | These examples show that the embedding models are able to learn useful
86 | representations of images and embeddings of simple composition of words.
87 |
88 | ### Image-Image Search :
89 |
90 | Here we will use an image as a query and then search in the database of 70,000
91 | images for the examples that are most similar to it. The ranking is determined
92 | by how close each pair of images are in the embedding space using the euclidean
93 | distance.
94 |
95 | 
96 |
97 | 
98 |
99 | The results illustrate that the embeddings generated are high level
100 | representations of images that capture the most important characteristics of the
101 | objects represented without being excessively influenced by the orientation,
102 | lighting or minor local details, without being trained explicitly to do so.
103 |
104 | ### Conclusion :
105 |
106 | In this project we worked on the Machine learning blocks that allow us to build
107 | a keyword and image based search engine applied to a collection of images. The
108 | basic idea is to learn a meaningful and joined embedding function for text and
109 | image and then use the distance between items in the embedding space to rank
110 | search results.
111 |
112 | **References :**
113 |
114 | * [Large Scale Online Learning of Image Similarity Through
115 | Ranking](http://www.jmlr.org/papers/volume11/chechik10a/chechik10a.pdf)
116 | * [Ups and downs: Modeling the visual evolution of fashion trends with one-class
117 | collaborative filtering](https://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf)
118 | * [https://github.com/KinWaiCheuk/Triplet-net-keras/blob/master/Triplet%20NN%20Test%20on%20MNIST.ipynb](https://github.com/KinWaiCheuk/Triplet-net-keras/blob/master/Triplet%20NN%20Test%20on%20MNIST.ipynb)
--------------------------------------------------------------------------------
/code/eda.py:
--------------------------------------------------------------------------------
1 | import json
2 | from random import shuffle
3 | import pandas as pd
4 | import numpy as np
5 | from matplotlib import pyplot
6 |
7 |
8 | out_name = "../output/test_representations.json"
9 |
10 | data = pd.read_json(out_name)
11 |
12 | data = data.sample(n=10000)
13 |
14 | img_repr = data['image_repr'].tolist()
15 | img_repr_random = data['image_repr'].tolist()
16 | shuffle(img_repr_random)
17 | text_repr = data['text_repr'].tolist()
18 |
19 | target_distances = []
20 | random_distances = []
21 |
22 | for img, random_image, text in zip(img_repr, img_repr_random, text_repr):
23 | d_1 = np.linalg.norm(np.array(img)-np.array(text))
24 | d_2 = np.linalg.norm(np.array(random_image)-np.array(text))
25 |
26 | target_distances.append(d_1)
27 | random_distances.append(d_2)
28 |
29 |
30 | pyplot.hist(target_distances, bins=100, alpha=0.5, label='matched text')
31 | pyplot.hist(random_distances, bins=100, alpha=0.5, label='random text')
32 | pyplot.legend(loc='upper right')
33 | pyplot.show()
--------------------------------------------------------------------------------
/code/mapping.py:
--------------------------------------------------------------------------------
1 | import json
2 | from nltk.tokenize import word_tokenize
3 | from collections import defaultdict
4 | import re
5 |
6 |
7 | def tokenize(x):
8 | x = re.sub('([\\\'".!?,-/])', r' \1 ', x)
9 | x = re.sub('(\d+)', r' \1 ', x)
10 |
11 | x = word_tokenize(x.lower())
12 |
13 | return x
14 |
15 |
16 | UNK_TOKEN = "unk"
17 | BATCH_SIZE = 16
18 |
19 |
20 | def get_frequency_token_vocab(list_tokenized_sentences, vocab=defaultdict(int)):
21 | for sentence in list_tokenized_sentences:
22 | for token in sentence:
23 | vocab[token] += 1
24 |
25 | vocab[UNK_TOKEN] = 10000
26 | return vocab
27 |
28 |
29 | def get_mapping_dict(vocab, cutoff=10):
30 | i = 1
31 | word_freq = [(k, v) for k, v in vocab.items()]
32 | word_freq.sort(key = lambda x: -x[1])
33 | mapping = {}
34 | for token, freq in word_freq:
35 | if vocab[token] >= cutoff:
36 | mapping[token] = i
37 | i += 1
38 | return mapping
39 |
40 |
41 | train = json.load(open("../input/filtred_train_data.json", 'r'))
42 | val = json.load(open("../input/filtred_val_data.json", 'r'))
43 |
44 | list_images_train, captions_train = list(zip(*train))
45 |
46 | list_images_val, captions_val = list(zip(*val))
47 |
48 | list_tok = [tokenize(x) for x in captions_train+captions_val]
49 |
50 | vocab = get_frequency_token_vocab(list_tok)
51 | mapping = get_mapping_dict(vocab)
52 |
53 | json.dump(mapping, open('mapping.json', 'w'), indent=4)
--------------------------------------------------------------------------------
/code/model_triplet.py:
--------------------------------------------------------------------------------
1 | import json
2 | import re
3 | from random import choice, sample
4 |
5 | import cv2
6 | import numpy as np
7 | import tensorflow.keras.backend as K
8 | from nltk.tokenize import word_tokenize
9 | from tensorflow.keras.applications import ResNet50
10 | from tensorflow.keras.applications.imagenet_utils import preprocess_input
11 | from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
12 | from tensorflow.keras.layers import Input, GlobalMaxPool2D, GlobalMaxPool1D, Dense, Embedding, GRU, \
13 | Bidirectional, Concatenate, Lambda, SpatialDropout1D
14 | from tensorflow.keras.models import Model
15 | from tensorflow.keras.optimizers import Adam
16 |
17 | UNK_TOKEN = "unk"
18 | img_shape = (222, 171, 3)
19 | vec_dim = 50
20 | BATCH_SIZE = 32
21 |
22 |
23 | def tokenize(x):
24 | x = re.sub('([\\\'".!?,-/])', r' \1 ', x)
25 | x = re.sub('(\d+)', r' \1 ', x)
26 |
27 | x = word_tokenize(x.lower())
28 |
29 | return x
30 |
31 |
32 | def triplet_loss(y_true, y_pred, alpha=0.4):
33 | """
34 | https://github.com/KinWaiCheuk/Triplet-net-keras/blob/master/Triplet%20NN%20Test%20on%20MNIST.ipynb
35 | Implementation of the triplet loss function
36 | Arguments:
37 | y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
38 | y_pred -- python list containing three objects:
39 | anchor -- the encodings for the anchor data
40 | positive -- the encodings for the positive data (similar to anchor)
41 | negative -- the encodings for the negative data (different from anchor)
42 | Returns:
43 | loss -- real number, value of the loss
44 | """
45 |
46 | total_lenght = y_pred.shape.as_list()[-1]
47 | anchor = y_pred[:, 0:int(total_lenght * 1 / 3)]
48 | positive = y_pred[:, int(total_lenght * 1 / 3):int(total_lenght * 2 / 3)]
49 | negative = y_pred[:, int(total_lenght * 2 / 3):int(total_lenght * 3 / 3)]
50 |
51 | # distance between the anchor and the positive
52 | pos_dist = K.sum(K.square(anchor - positive), axis=1)
53 |
54 | # distance between the anchor and the negative
55 | neg_dist = K.sum(K.square(anchor - negative), axis=1)
56 |
57 | # compute loss
58 | basic_loss = pos_dist - neg_dist + alpha
59 | loss = K.maximum(basic_loss, 0.0)
60 |
61 | return loss
62 |
63 |
64 | def map_sentence(tokenized_sentence, mapping):
65 | out_sentence = list(map(lambda x: mapping[x if x in mapping else UNK_TOKEN], tokenized_sentence))
66 | return out_sentence
67 |
68 |
69 | def map_sentences(list_tokenized_sentences, mapping):
70 | mapped = []
71 | for sentence in list_tokenized_sentences:
72 | out_sentence = map_sentence(sentence, mapping)
73 | mapped.append(out_sentence)
74 |
75 | return mapped
76 |
77 |
78 | def cap_sequence(seq, max_len, append):
79 | if len(seq) < max_len:
80 | if np.random.uniform(0, 1) < 0.5:
81 | return seq + [append] * (max_len - len(seq))
82 | else:
83 | return [append] * (max_len - len(seq)) + seq
84 | else:
85 | if np.random.uniform(0, 1) < 0.5:
86 |
87 | return seq[:max_len]
88 | else:
89 | return seq[-max_len:]
90 |
91 |
92 | def cap_sequences(list_sequences, max_len, append):
93 | capped = []
94 | for seq in list_sequences:
95 | out_seq = cap_sequence(seq, max_len, append)
96 | capped.append(out_seq)
97 |
98 | return capped
99 |
100 |
101 | def read_img(path, preprocess=True):
102 | img = cv2.imread(path)
103 | if img is None or img.size<10:
104 | img = np.zeros((222, 171))
105 | img = cv2.resize(img, (171, 222))
106 | if preprocess:
107 | img = preprocess_input(img)
108 | return img
109 |
110 |
111 | def gen(list_images, list_captions, batch_size=16, aug=False):
112 | indexes = list(range(len(list_images)))
113 | while True:
114 | batch_indexes = sample(indexes, batch_size)
115 |
116 | candidate_images = [list_images[i] for i in batch_indexes]
117 | captions_p = [list_captions[i] for i in batch_indexes]
118 |
119 | captions_n = [choice(list_captions) for _ in batch_indexes]
120 |
121 | X1 = np.array([read_img(x) for x in candidate_images])
122 | if aug and np.random.uniform(0, 1)<0.5:
123 | X1 = X1[:, :, ::-1, :]
124 | X2 = np.array(captions_p)
125 | X3 = np.array(captions_n)
126 |
127 | yield [X1, X2, X3], np.zeros((batch_size, 3 * vec_dim))
128 |
129 |
130 | def model(vocab_size, lr=0.0001):
131 | input_1 = Input(shape=(None, None, 3))
132 | input_2 = Input(shape=(None,))
133 | input_3 = Input(shape=(None,))
134 |
135 | base_model = ResNet50(weights='imagenet', include_top=False)
136 |
137 | x1 = base_model(input_1)
138 | x1 = GlobalMaxPool2D()(x1)
139 |
140 | dense_1 = Dense(vec_dim, activation="linear", name="dense_image_1")
141 |
142 | x1 = dense_1(x1)
143 |
144 | embed = Embedding(vocab_size, 50, name="embed")
145 |
146 | gru = Bidirectional(GRU(256, return_sequences=True), name="gru_1")
147 | dense_2 = Dense(vec_dim, activation="linear", name="dense_text_1")
148 |
149 | x2 = embed(input_2)
150 | x2 = SpatialDropout1D(0.1)(x2)
151 | x2 = gru(x2)
152 | x2 = GlobalMaxPool1D()(x2)
153 | x2 = dense_2(x2)
154 |
155 | x3 = embed(input_3)
156 | x3 = SpatialDropout1D(0.1)(x3)
157 | x3 = gru(x3)
158 | x3 = GlobalMaxPool1D()(x3)
159 | x3 = dense_2(x3)
160 |
161 | _norm = Lambda(lambda x: K.l2_normalize(x, axis=-1))
162 |
163 | x1 = _norm(x1)
164 | x2 = _norm(x2)
165 | x3 = _norm(x3)
166 |
167 | x = Concatenate(axis=-1)([x1, x2, x3])
168 |
169 | model = Model([input_1, input_2, input_3], x)
170 |
171 | model.compile(loss=triplet_loss, optimizer=Adam(lr))
172 |
173 | model.summary()
174 |
175 | return model
176 |
177 |
178 | def image_model(lr=0.0001):
179 | input_1 = Input(shape=(None, None, 3))
180 |
181 | base_model = ResNet50(weights='imagenet', include_top=False)
182 |
183 | x1 = base_model(input_1)
184 | x1 = GlobalMaxPool2D()(x1)
185 |
186 | dense_1 = Dense(vec_dim, activation="linear", name="dense_image_1")
187 |
188 | x1 = dense_1(x1)
189 |
190 | _norm = Lambda(lambda x: K.l2_normalize(x, axis=-1))
191 |
192 | x1 = _norm(x1)
193 |
194 | model = Model([input_1], x1)
195 |
196 | model.compile(loss="mae", optimizer=Adam(lr))
197 |
198 | model.summary()
199 |
200 | return model
201 |
202 |
203 | def text_model(vocab_size, lr=0.0001):
204 | input_2 = Input(shape=(None,))
205 |
206 | embed = Embedding(vocab_size, 50, name="embed")
207 | gru = Bidirectional(GRU(256, return_sequences=True), name="gru_1")
208 | dense_2 = Dense(vec_dim, activation="linear", name="dense_text_1")
209 |
210 | x2 = embed(input_2)
211 | x2 = gru(x2)
212 | x2 = GlobalMaxPool1D()(x2)
213 | x2 = dense_2(x2)
214 |
215 | _norm = Lambda(lambda x: K.l2_normalize(x, axis=-1))
216 |
217 | x2 = _norm(x2)
218 |
219 | model = Model([input_2], x2)
220 |
221 | model.compile(loss="mae", optimizer=Adam(lr))
222 |
223 | model.summary()
224 |
225 | return model
226 |
227 |
228 | if __name__ == "__main__":
229 |
230 | mapping = json.load(open('mapping.json', 'r'))
231 |
232 | train = json.load(open("../input/filtred_train_data.json", 'r'))
233 | val = json.load(open("../input/filtred_val_data.json", 'r'))
234 |
235 | list_images_train, captions_train = list(zip(*train))
236 | captions_train = [tokenize(x) for x in captions_train]
237 | captions_train = map_sentences(captions_train, mapping)
238 | captions_train = cap_sequences(captions_train, 70, 0)
239 |
240 | list_images_val, captions_val = list(zip(*val))
241 | captions_val = [tokenize(x) for x in captions_val]
242 |
243 | captions_val = map_sentences(captions_val, mapping)
244 | captions_val = cap_sequences(captions_val, 70, 0)
245 |
246 | file_path = "model_triplet.h5"
247 |
248 | model = model(vocab_size=len(mapping) + 1, lr=0.00001)
249 |
250 | try:
251 | model.load_weights(file_path, by_name=True)
252 | except:
253 | pass
254 |
255 | checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
256 | reduce = ReduceLROnPlateau(monitor="val_loss", mode='min', patience=10, min_lr=1e-7)
257 |
258 | model.fit_generator(gen(list_images_train, captions_train, batch_size=BATCH_SIZE, aug=True),
259 | use_multiprocessing=True,
260 | validation_data=gen(list_images_val, captions_val, batch_size=BATCH_SIZE), epochs=10000,
261 | verbose=1, workers=4, steps_per_epoch=300, validation_steps=100, callbacks=[checkpoint, reduce])
262 | model.save_weights(file_path)
263 |
--------------------------------------------------------------------------------
/code/predict_model_triplet.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | import numpy as np
4 | import pandas as pd
5 |
6 | from model_triplet import tokenize, map_sentences, cap_sequences, read_img, image_model, text_model
7 | from tqdm import tqdm
8 |
9 | def chunker(seq, size):
10 | return (seq[pos:pos + size] for pos in range(0, len(seq), size))
11 |
12 |
13 | if __name__ == "__main__":
14 |
15 | out_name = "../output/test_representations.json"
16 |
17 | mapping = json.load(open('mapping.json', 'r'))
18 |
19 | test = json.load(open("../input/filtred_test_data.json", 'r'))
20 |
21 |
22 | file_path = "model_triplet.h5"
23 |
24 | t_model = text_model(vocab_size=len(mapping) + 1)
25 | i_model = image_model()
26 |
27 | t_model.load_weights(file_path, by_name=True)
28 | i_model.load_weights(file_path, by_name=True)
29 |
30 | list_images_test, _captions_test = list(zip(*test))
31 | captions_test = [tokenize(x) for x in _captions_test]
32 | captions_test = map_sentences(captions_test, mapping)
33 | captions_test = cap_sequences(captions_test, 70, 0)
34 |
35 | target_image_encoding = []
36 |
37 | for img_paths in tqdm(chunker(list_images_test, 128), total=len(list_images_test)//128):
38 | images = np.array([read_img(file_path) for file_path in img_paths])
39 | e = i_model.predict(images)
40 | target_image_encoding += e.tolist()
41 |
42 | target_text_encoding = t_model.predict(np.array(captions_test), verbose=1, batch_size=128)
43 |
44 | target_text_encoding = target_text_encoding.tolist()
45 |
46 | df = pd.DataFrame({"images": list_images_test, "text": _captions_test, "image_repr": target_image_encoding,
47 | "text_repr": target_text_encoding})
48 |
49 | df.to_json(out_name, orient='records')
50 |
51 | data = json.load(open(out_name, 'r'))
52 | json.dump(data, open(out_name, 'w'), indent=4)
53 |
54 | # New queries
55 |
56 | out_name = "../output/queries_representations.json"
57 |
58 | _captions_test = ['blue tshirt', 'blue shirt', 'red dress', 'halloween outfit', 'baggy pants', 'gold ring',
59 | 'Black trousers', 'animal pendant', 'black white tshirt', 'women red hoodie', 'animal outfit',
60 | 'men flipflops']
61 |
62 | captions_test = [tokenize(x) for x in _captions_test]
63 | captions_test = map_sentences(captions_test, mapping)
64 | captions_test = cap_sequences(captions_test, 70, 0)
65 |
66 | target_text_encoding = t_model.predict(np.array(captions_test), verbose=2, batch_size=128)
67 |
68 | target_text_encoding = target_text_encoding.tolist()
69 |
70 | df = pd.DataFrame({"text": _captions_test,
71 | "text_repr": target_text_encoding})
72 |
73 | df.to_json(out_name, orient='records')
74 |
75 | data = json.load(open(out_name, 'r'))
76 | json.dump(data, open(out_name, 'w'), indent=4)
77 |
78 |
79 |
80 |
--------------------------------------------------------------------------------
/code/search.py:
--------------------------------------------------------------------------------
1 | import json
2 | from random import shuffle
3 | import pandas as pd
4 | import numpy as np
5 | from matplotlib import pyplot
6 | from sklearn.neighbors import NearestNeighbors
7 |
8 | repr_json = "../output/test_representations.json"
9 |
10 | data = pd.read_json(repr_json)
11 |
12 | data = data.sample(n=1000)
13 |
14 | img_repr = data['image_repr'].tolist()
15 | text_repr = data['text_repr'].tolist()
16 |
17 | nn = NearestNeighbors(n_jobs=-1, n_neighbors=1000)
18 |
19 | nn.fit(text_repr)
20 |
21 | preds = nn.kneighbors(img_repr, return_distance=False).tolist()
22 | ranks = []
23 |
24 | for i, x in enumerate(preds):
25 | rank = x.index(i)+1
26 | ranks.append(rank)
27 |
28 | print("Average rank :", np.mean(ranks))
--------------------------------------------------------------------------------
/code/search_by_image.py:
--------------------------------------------------------------------------------
1 | import json
2 | from random import shuffle
3 | import pandas as pd
4 | import numpy as np
5 | from matplotlib import pyplot
6 | from sklearn.neighbors import NearestNeighbors
7 | import matplotlib.pyplot as plt
8 | from model_triplet import read_img
9 | import cv2
10 | from uuid import uuid4
11 |
12 |
13 | repr_json = "../output/test_representations.json"
14 |
15 | data = pd.read_json(repr_json)
16 |
17 | #data = data.sample(n=50000)
18 |
19 | img_repr = data['image_repr'].tolist()
20 | img_paths = data['images'].tolist()
21 | text_repr = data['text_repr'].tolist()
22 |
23 | nn = NearestNeighbors(n_jobs=-1, n_neighbors=9)
24 |
25 | nn.fit(img_repr)
26 |
27 | preds = nn.kneighbors(img_repr[:100], return_distance=False).tolist()
28 |
29 | most_similar_images = []
30 | query_image = []
31 |
32 |
33 | for i, x in enumerate(preds):
34 | preds_paths = [img_paths[i] for i in x]
35 | query_image.append(preds_paths[0])
36 | most_similar_images.append(preds_paths[1:])
37 |
38 | for q, similar in zip(query_image, most_similar_images):
39 | fig, axes = plt.subplots(3, 3)
40 | all_images = [q]+similar
41 |
42 | for idx, img_path in enumerate(all_images):
43 | i = idx % 3 # Get subplot row
44 | j = idx // 3 # Get subplot column
45 | image = read_img(img_path, preprocess=False)
46 | image = image[:, :, ::-1]
47 | axes[i, j].imshow(image/255)
48 | axes[i, j].axis('off')
49 | axes[i, j].axis('off')
50 | if idx == 0:
51 | axes[i, j].set_title('Query Image')
52 | else:
53 | axes[i, j].set_title('Result Image %s'%idx)
54 |
55 | plt.subplots_adjust(wspace=0.2, hspace=0.2)
56 | plt.savefig('../output/images/%s.png'%uuid4().hex)
57 |
58 |
--------------------------------------------------------------------------------
/code/search_by_keywords.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from sklearn.neighbors import NearestNeighbors
3 | import matplotlib.pyplot as plt
4 | from model_triplet import read_img
5 | from uuid import uuid4
6 | from uuid import uuid4
7 |
8 | import matplotlib.pyplot as plt
9 | import pandas as pd
10 | from sklearn.neighbors import NearestNeighbors
11 |
12 | from model_triplet import read_img
13 |
14 | repr_json = "../output/test_representations.json"
15 |
16 | data = pd.read_json(repr_json)
17 |
18 | queries_repr_json = "../output/queries_representations.json"
19 |
20 | queries_data = pd.read_json(queries_repr_json)
21 |
22 | #data = data.sample(n=50000)
23 |
24 | img_repr = data['image_repr'].tolist()
25 | img_paths = data['images'].tolist()
26 | text_repr = queries_data['text_repr'].tolist()
27 |
28 | nn = NearestNeighbors(n_jobs=-1, n_neighbors=9)
29 |
30 | nn.fit(img_repr)
31 |
32 | preds = nn.kneighbors(text_repr, return_distance=False).tolist()
33 |
34 | most_similar_images = []
35 | query_image = []
36 |
37 |
38 | for i, x in enumerate(preds):
39 | preds_paths = [img_paths[i] for i in x]
40 | most_similar_images.append(preds_paths)
41 |
42 | for q, all_images in zip(queries_data['text'], most_similar_images):
43 | fig, axes = plt.subplots(3, 3)
44 |
45 | for idx, img_path in enumerate(all_images):
46 | i = idx % 3 # Get subplot row
47 | j = idx // 3 # Get subplot column
48 | image = read_img(img_path, preprocess=False)
49 | image = image[:, :, ::-1]
50 | axes[i, j].imshow(image/255)
51 | axes[i, j].axis('off')
52 | axes[i, j].axis('off')
53 | axes[i, j].set_title('Result Image %s'%(idx+1))
54 |
55 |
56 | plt.subplots_adjust(wspace=0.2, hspace=0.2)
57 | fig.suptitle('Query : %s'%q)
58 |
59 | plt.savefig('../output/queries/%s.png'%uuid4().hex)
60 |
61 |
--------------------------------------------------------------------------------
/code/tsne_visualization.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 | from matplotlib import pyplot as plt
4 | from sklearn import manifold
5 |
6 | out_name = "../output/test_representations.json"
7 |
8 | data = pd.read_json(out_name)
9 |
10 | n = 10000
11 |
12 | data = data.sample(n=n)
13 |
14 | img_repr = data['image_repr'].tolist()
15 | text_repr = data['text_repr'].tolist()
16 |
17 | tsne = manifold.TSNE(n_components=2, init='random',
18 | random_state=0)
19 |
20 | X = tsne.fit_transform(img_repr + text_repr).tolist()
21 |
22 | img_repr_2d = X[:n]
23 | text_repr_2d = X[n:]
24 |
25 | distances = []
26 | for a, b in zip(img_repr_2d, text_repr_2d):
27 | distances.append(np.fabs(a[0] - b[0]) + np.fabs(a[1] - b[1]))
28 |
29 | quantile = np.quantile(distances, q=0.9)
30 |
31 | fig = plt.figure()
32 | plt.scatter([a[0] for a in img_repr_2d], [a[1] for a in img_repr_2d], c="r", s=4, alpha=0.5, label="Images")
33 | plt.scatter([a[0] for a in text_repr_2d], [a[1] for a in text_repr_2d], c="b", s=4, alpha=0.5, label="Texts")
34 | for a, b in zip(img_repr_2d, text_repr_2d):
35 | if np.fabs(a[0] - b[0]) + np.fabs(a[1] - b[1]) < quantile:
36 | plt.plot([a[0], b[0]], [a[1], b[1]], c="g", lw=0.2)
37 | plt.legend()
38 | plt.title('TSNE Visualization')
39 | plt.show()
40 |
--------------------------------------------------------------------------------
/output/Figure_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CVxTz/image_search_engine/30314e92a9e53bced79e1c24da5f8d0210521dae/output/Figure_1.png
--------------------------------------------------------------------------------
/output/Figure_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/CVxTz/image_search_engine/30314e92a9e53bced79e1c24da5f8d0210521dae/output/Figure_2.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas==0.25.2
2 | matplotlib==3.1.1
3 | numpy==1.17.3
4 | tensorflow_gpu==2.0.1
5 | nltk==3.4.5
6 | tqdm==4.36.1
7 | opencv_python==4.1.1.26
8 | Pillow==6.2.1
9 | scikit_learn==0.21.3
10 | tensorflow==2.0.1
11 |
--------------------------------------------------------------------------------