├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── build_data.py
├── configs
    └── configs.json
├── docs
    └── illustration
    │   ├── clusters.png
    │   ├── cost_function.png
    │   ├── data_distribution.png
    │   ├── dimension_reduction.png
    │   └── feature_extractor.png
├── extract_features.py
├── find_k.py
├── models
    └── .gitkeep
├── notebooks
    └── .gitkeep
├── references
    └── .gitkeep
├── reports
    ├── .gitkeep
    └── figures
    │   └── .gitkeep
├── requirements.txt
├── results
    └── .gitkeep
├── run_k_mean.py
├── src
    ├── __init__.py
    ├── data
    │   ├── .gitkeep
    │   ├── __init__.py
    │   └── make_dataset.py
    ├── features
    │   ├── .gitkeep
    │   ├── __init__.py
    │   └── build_features.py
    ├── models
    │   ├── .gitkeep
    │   ├── __init__.py
    │   ├── feature_extractor.py
    │   └── k_mean.py
    ├── utils
    │   ├── analyze_label.py
    │   └── config.py
    └── visualization
    │   ├── .gitkeep
    │   ├── __init__.py
    │   └── visualize.py
└── visualization
    └── .gitkeep


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | **/__pycache__/
 3 | *.py[cod]
 4 | 
 5 | # C extensions
 6 | *.so
 7 | 
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | #  Usually these files are written by a python script from a template
28 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | 
32 | # Installer logs
33 | pip-log.txt
34 | pip-delete-this-directory.txt
35 | 
36 | # Unit test / coverage reports
37 | htmlcov/
38 | .tox/
39 | .coverage
40 | .coverage.*
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | *.cover
45 | 
46 | # Translations
47 | *.mo
48 | *.pot
49 | 
50 | # Django stuff:
51 | *.log
52 | 
53 | # Sphinx documentation
54 | docs/_build/
55 | 
56 | # PyBuilder
57 | target/
58 | 
59 | # DotEnv configuration
60 | .env
61 | 
62 | # Database
63 | *.db
64 | *.rdb
65 | 
66 | # Pycharm
67 | .idea/
68 | 
69 | # VS Code
70 | .vscode/
71 | 
72 | # Spyder
73 | .spyproject/
74 | 
75 | # Jupyter NB Checkpoints
76 | .ipynb_checkpoints/
77 | 
78 | # exclude data from source control by default
79 | /data/
80 | 
81 | # Mac OS-specific storage files
82 | .DS_Store
83 | 
84 | # vim
85 | *.swp
86 | *.swo
87 | 
88 | # Mypy cache
89 | .mypy_cache/
90 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Tri Ngo
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | .PHONY: clean build run
 2 | 
 3 | help:
 4 | 	@echo "Please use \`make <target>' where <target> is one of"
 5 | 	@echo "  clean          to remove build files"
 6 | 	@echo "  build          to download and process data from raw"
 7 | 	@echo "  train          to train model with default config"
 8 | 	@echo "  freeze         to freeze model for serving"
 9 | 	@echo "  predict        to extract keyphrase from sample text"
10 | 
11 | clean:
12 | 	rm -rf data/processed/*
13 | 	rm -rf data/external/*
14 | 	rm -rf data/interim/*
15 | 	rm -rf data/raw/*
16 | 	rm -rf models/*
17 | 	rm -rf results/*
18 | 	rm -rf visualization/*
19 | 
20 | build:
21 | 	python3 build_data.py
22 | 
23 | run:
24 | 	python3 run_k_mean.py
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Image clustering
 2 | 
 3 | Clustering is an interesting field of Unsupervised Machine learning where I classify dataset into set of similar groups. I use Image Clustering when I have thousands of images and are desirable to find a way to group (or categorize) them into subsets which share similarity. 
 4 | 
 5 | This work is aimed at performing clustering images using k-means clustering, Inception feature extractor and dimension reduction.
 6 | 
 7 | 
 8 | ## Installation
 9 | 
10 | ### Clone the repo
11 | 
12 | ```python
13 | git clone https://github.com/tringn/image_clustering.git
14 | ```
15 | 
16 | 
17 | ### Install required packages
18 | 
19 | This project is using `python3.7` and `pip3`
20 | 
21 | ```python
22 | pip install -r requirements.txt
23 | ```
24 | 
25 | ## Process
26 | 
27 | ### Step 1: Feature extractor
28 | 
29 | I use the feature extractor of InceptionV3 model, which was trained on ImageNet datatset (contains 14,197,122 images: http://www.image-net.org/) to ensure the feature extractor can extract the characteristics for various object. 
30 | 
31 | ![feature_extractor](docs/illustration/feature_extractor.png)
32 | 
33 | You need to download extractor model first
34 | 
35 | ```python
36 | python build_data.py
37 | ``` 
38 | 
39 | I used all images crawled from champselysees_paris Instagram ([link](https://www.instagram.com/champselysees_paris/))
40 | 
41 | You can crawl champselysees_paris images or you can use your own images.
42 | 
43 | Move all all images into 1 folder `raw/data/example` 
44 | 
45 | Then extracting features by running:
46 | ```python
47 | python extract_features.py
48 | ``` 
49 | 
50 | ### Step 2: Dimension reduction
51 | 
52 | 2048 dimension vectors prevent the clustering process to converge at optimal points, so I apply dimension reduction method to seize the dimension of images' vectors to 2. 
53 | 
54 | I divide the dimension reduction into 2 steps: reducing from 2048 to 50 and from 50 to 2. The reason why I did that is because "It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high."
55 | 
56 | ![dimension_reduction](/docs/illustration/dimension_reduction.png)
57 | 
58 | After reducing dimension to 2, the images' distribution of object is plotted as follow:
59 | 
60 | ![data_distribution](docs/illustration/data_distribution.png)
61 | 
62 | **The script for this step is combined with step 3.1**
63 | 
64 | ### Step 3: Clustering using K-Means
65 | 
66 | Question now is how to choose the number of cluster to assign all images of object into. I have no pre-definition how many subset that images of object belong to because the images crawled from the Internet are chaotically distributed.
67 | 
68 | I apply the elbow curve method to determine the best number of cluster as described below:
69 | 
70 | #### Step 3.1: Find the best K clusters
71 | 
72 | I iterated k from 1 to 100. From k = 1 to k = 40, the Squared Error decreases significantly. From k = 40 to k = 100, there is a slight fall in Squared Error compared with previous changes. So I choose k = 40 that is the best number of cluster.
73 | 
74 | ![cost_function](docs/illustration/cost_function.png)
75 | 
76 | Run following script to reduce feature dimension and find best number of k.
77 | 
78 | ```python
79 | python find_k.py
80 | ``` 
81 | 
82 | Go to `visualization/example/cost_2D.png` to find the best value of K (where cost decreases much more little compared with previous)
83 | 
84 | #### Step 3.2: Cluster images into k subsets
85 | 
86 | I use k-means algorithm to assign all images in the object collection into 40 subsets (k=40 is selected at above step).
87 | 
88 | Set value for `k` at `configs/congigs.json` and run:
89 | 
90 | ```python
91 | python run_k_mean.py
92 | ``` 
93 | 
94 | Images will be clustered and linked to `results/link/example`
95 | 
96 | Output for **example** is shown as below:
97 | 
98 | ![clusters](docs/illustration/clusters.png)


--------------------------------------------------------------------------------
/build_data.py:
--------------------------------------------------------------------------------
 1 | from src.models.feature_extractor import maybe_download_and_extract
 2 | from src.utils.config import get_config_from_json
 3 | 
 4 | 
 5 | if __name__ == "__main__":
 6 |     config, _ = get_config_from_json("configs/configs.json")
 7 |     maybe_download_and_extract(config.paths.model_dir, config.paths.data_url)
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/configs/configs.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "paths": {
 3 |         "image_dir": "data/raw",
 4 |         "model_dir": "models",
 5 |         "vector_dir": "results/vectors",
 6 |         "cluster_label_dir": "results/cluster_label/",
 7 |         "plot_dir": "visualization/",
 8 |         "link_dir": "results/link",
 9 |         "data_url": "http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz"
10 |     },
11 |     "model": {
12 |         "object_name": "example",
13 |         "mode": "analysis",
14 |         "reduced_dimension": 2,
15 |         "k": 40
16 |     }
17 | }


--------------------------------------------------------------------------------
/docs/illustration/clusters.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/clusters.png


--------------------------------------------------------------------------------
/docs/illustration/cost_function.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/cost_function.png


--------------------------------------------------------------------------------
/docs/illustration/data_distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/data_distribution.png


--------------------------------------------------------------------------------
/docs/illustration/dimension_reduction.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/dimension_reduction.png


--------------------------------------------------------------------------------
/docs/illustration/feature_extractor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/feature_extractor.png


--------------------------------------------------------------------------------
/extract_features.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from src.utils.config import get_config_from_json
 3 | from src.models.feature_extractor import run_inference_on_images_feature
 4 | 
 5 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png')
 6 | 
 7 | 
 8 | def extract_feature(img_dir, model_dir, output_dir):
 9 |     """
10 |     Extract image features of all images in img_dir and save feature vectors to output_dir
11 |     :param img_dir: (string) directory containing images to extract feature
12 |     :param model_dir: (string) directory containing extractor model
13 |     :param output_dir: (string) directory to save feature vector file
14 |     :return:
15 |     """
16 |     # Get list of image paths
17 |     img_list = [os.path.join(img_dir, img_file) for img_file in os.listdir(img_dir) if img_file.endswith(IMAGE_EXTENSION)]
18 | 
19 |     # Run getting feature vectors for each image
20 |     run_inference_on_images_feature(img_list, model_dir, output_dir)
21 | 
22 | 
23 | if __name__ == "__main__":
24 |     # Get config
25 |     config, _ = get_config_from_json("configs/configs.json")
26 | 
27 |     object_name = config.model.object_name
28 |     img_dir = os.path.join(config.paths.image_dir, object_name)
29 |     vector_dir = os.path.join(config.paths.vector_dir, object_name)
30 | 
31 |     # Extract feature for all images in image directory
32 |     extract_feature(img_dir, config.paths.model_dir, vector_dir)
33 | 


--------------------------------------------------------------------------------
/find_k.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from src.utils.config import get_config_from_json
 3 | from src.models.k_mean import read_vector, reduce_dim_combine, plot_2d, plot_3d, find_best_k
 4 | 
 5 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png')
 6 | 
 7 | 
 8 | def find_k(vector_array, save_plot_dir, dim=2):
 9 |     """
10 |     Find the best number of cluster by looking at the cost plot
11 |     :param vector_array: (array) (N x D) array of feature vectors
12 |     :param save_plot_dir: (string) directory to save plots
13 |     :param dim: (int) desired dimension after reduction
14 |     :return:
15 |     """
16 | 
17 |     os.makedirs(save_plot_dir, exist_ok=True)
18 | 
19 |     if vector_array.shape[0] >= 250:
20 |         # Plot data distribution after reducing dimension
21 |         if dim == 2:
22 |             plot_2d(vector_array, save_plot_dir)
23 |         elif dim == 3:
24 |             plot_3d(vector_array, save_plot_dir)
25 |         else:
26 |             raise ValueError("Not support dimension")
27 | 
28 |         # Plot cost chart to find best value of k
29 |         find_best_k(vector_array, save_plot_dir)
30 | 
31 |     else:
32 |         raise ValueError("If number of image is smaller than 250, it is recommended to use hierarchical cluster.")
33 | 
34 | 
35 | if __name__ == "__main__":
36 |     # Get config
37 |     config, _ = get_config_from_json("configs/configs.json")
38 | 
39 |     object_name = config.model.object_name
40 | 
41 |     dim = config.model.reduced_dimension
42 | 
43 |     vector_dir = os.path.join(config.paths.vector_dir, object_name)
44 |     save_plot_dir = os.path.join(config.paths.plot_dir, object_name)
45 | 
46 |     # Read feature vector from vector dir
47 |     vector_array, vector_files = read_vector(vector_dir)
48 | 
49 |     # Apply dimensional reducing approach
50 |     vector_array = reduce_dim_combine(vector_array, dim=dim)
51 | 
52 |     # Find best K
53 |     find_k(vector_array, save_plot_dir, dim=dim)
54 | 


--------------------------------------------------------------------------------
/models/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/models/.gitkeep


--------------------------------------------------------------------------------
/notebooks/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/notebooks/.gitkeep


--------------------------------------------------------------------------------
/references/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/references/.gitkeep


--------------------------------------------------------------------------------
/reports/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/reports/.gitkeep


--------------------------------------------------------------------------------
/reports/figures/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/reports/figures/.gitkeep


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy==1.16.1
 2 | tensorflow==1.13.1
 3 | psutil==5.6.2
 4 | dotmap==1.3.8
 5 | regex==2018.1.10
 6 | sklearn==0.0
 7 | pandas==0.24.2
 8 | matplotlib==3.0.2
 9 | six==1.12.0
10 | pandas==0.24.2
11 | matplotlib==3.0.2


--------------------------------------------------------------------------------
/results/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/results/.gitkeep


--------------------------------------------------------------------------------
/run_k_mean.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | 
 4 | from src.models.k_mean import read_vector, reduce_dim_combine, k_mean
 5 | from src.utils.analyze_label import symlink_cluster
 6 | from src.utils.config import get_config_from_json
 7 | 
 8 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png')
 9 | 
10 | if __name__ == "__main__":
11 | 
12 |     # Get config
13 |     config, _ = get_config_from_json("configs/configs.json")
14 | 
15 |     object_name = config.model.object_name
16 | 
17 |     dim = config.model.reduced_dimension
18 | 
19 |     img_dir = os.path.join(config.paths.image_dir, object_name)
20 |     vector_dir = os.path.join(config.paths.vector_dir, object_name)
21 |     save_plot_dir = os.path.join(config.paths.plot_dir, object_name)
22 |     cluster_label_path = os.path.join(config.paths.cluster_label_dir, object_name + ".json")
23 | 
24 |     if not os.path.isdir(vector_dir):
25 |         raise Exception("Please run feature extraction for all images first")
26 | 
27 |     # Read feature vector from vector dir
28 |     vector_array, vector_files = read_vector(vector_dir)
29 | 
30 |     if len(vector_files) == 0:
31 |         raise Exception("Please run feature extraction for all images first")
32 | 
33 |     # Apply dimensional reducing approach
34 |     vector_array = reduce_dim_combine(vector_array, dim=dim)
35 | 
36 |     labels = k_mean(vector_array, config.model.k).tolist()
37 | 
38 |     assert len(labels) == len(vector_files), "Not equal length"
39 | 
40 |     label_dict = [{"img_file": vector_files[i].replace(".npz", ""), "label": str(labels[i]), "prob": "1.0"} for i in
41 |                   range(len(labels))]
42 | 
43 |     # Save to disk
44 |     os.makedirs(os.path.dirname(cluster_label_path), exist_ok=True)
45 |     with open(cluster_label_path, 'w') as fp:
46 |         json.dump({"data": label_dict}, fp)
47 | 
48 |     print("Cluster label for each image are saved at results/cluster_label/example.")
49 | 
50 |     # Symlink
51 |     link_base_dir = config.paths.link_dir
52 |     os.makedirs(link_base_dir, exist_ok=True)
53 | 
54 |     symlink_cluster(label_path=cluster_label_path,
55 |                     dest_dir=os.path.join(link_base_dir, object_name),
56 |                     src_dir=img_dir)
57 | 
58 |     print("Go to results/link/example to see images in each cluster")
59 | 


--------------------------------------------------------------------------------
/src/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/__init__.py


--------------------------------------------------------------------------------
/src/data/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/data/.gitkeep


--------------------------------------------------------------------------------
/src/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/data/__init__.py


--------------------------------------------------------------------------------
/src/data/make_dataset.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import zipfile
 3 | import os
 4 | from tqdm import tqdm
 5 | 
 6 | 
 7 | def download_file_from_google_drive(id, destination):
 8 |     URL = 'https://docs.google.com/uc?export=download'
 9 | 
10 |     session = requests.Session()
11 |     response = session.get(URL, params = { 'id' : id }, stream = True)
12 | 
13 |     token = None
14 |     for key, value in response.cookies.items():
15 |         if key.startswith('download_warning'):
16 |             token = value
17 |             break
18 | 
19 |     if token:
20 |         params = { 'id' : id, 'confirm' : token }
21 |         response = session.get(URL, params = params, stream = True)
22 | 
23 |     CHUNK_SIZE = 32*1024
24 |     total_size = int(response.headers.get('content-length', 0))
25 | 
26 |     with tqdm(desc=destination, total=total_size, unit='B', unit_scale=True) as pbar:
27 |         with open(destination, 'wb') as f:
28 |             for chunk in response.iter_content(CHUNK_SIZE):
29 |                 if chunk:
30 |                     pbar.update(CHUNK_SIZE)
31 |                     f.write(chunk)
32 | 
33 | 
34 | def download_word2vec(download_dir, gg_drive_id):
35 |     # Download pre-trained word2vec embeddings from google drive
36 |     print("Start downloading pre-trained word2vec embeddings.")
37 |     download_file_name = "ja-gensim_update.txt.zip"
38 | 
39 |     # file_id = "1ViflLHKz_sQEioELGp7xromuXJsPJd4Y"
40 |     destination = os.path.join(download_dir, download_file_name)
41 |     download_file_from_google_drive(gg_drive_id, destination)
42 |     print("Finish downloading pre-trained word2vec embeddings.")
43 | 
44 |     # Extract zip file
45 |     zip_ref = zipfile.ZipFile(destination, 'r')
46 |     zip_ref.extractall(download_dir)
47 |     zip_ref.close()
48 | 
49 |     print("Delete .zip file.")
50 |     os.remove(destination)
51 | 
52 | 
53 | def download_raw_data(destination, gg_drive_id):
54 |     # Download pre-trained word2vec embeddings from google drive
55 |     print("Start downloading raw dataset.")
56 | 
57 |     # file_id = "1ViflLHKz_sQEioELGp7xromuXJsPJd4Y"
58 |     download_file_from_google_drive(gg_drive_id, destination)
59 |     print("Finish downloading raw dataset from operators.")


--------------------------------------------------------------------------------
/src/features/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/features/.gitkeep


--------------------------------------------------------------------------------
/src/features/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/features/__init__.py


--------------------------------------------------------------------------------
/src/features/build_features.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/features/build_features.py


--------------------------------------------------------------------------------
/src/models/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/models/.gitkeep


--------------------------------------------------------------------------------
/src/models/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/models/__init__.py


--------------------------------------------------------------------------------
/src/models/feature_extractor.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import sys
  4 | import tarfile
  5 | import zipfile
  6 | import numpy as np
  7 | import tensorflow as tf
  8 | from six.moves import urllib
  9 | import psutil
 10 | from collections import defaultdict
 11 | 
 12 | DATA_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz'
 13 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png')
 14 | 
 15 | 
 16 | class NodeLookup(object):
 17 |     """Converts integer node ID's to human readable labels."""
 18 | 
 19 |     def __init__(self, model_dir):
 20 |         label_lookup_path = os.path.join(model_dir, 'imagenet_2012_challenge_label_map_proto.pbtxt')
 21 |         uid_lookup_path = os.path.join(model_dir, 'imagenet_synset_to_human_label_map.txt')
 22 |         self.node_lookup = self.load(label_lookup_path, uid_lookup_path)
 23 | 
 24 |     def load(self, label_lookup_path, uid_lookup_path):
 25 |         """Loads a human readable English name for each softmax node.
 26 | 
 27 |         Args:
 28 |           label_lookup_path: string UID to integer node ID.
 29 |           uid_lookup_path: string UID to human-readable string.
 30 | 
 31 |         Returns:
 32 |           dict from integer node ID to human-readable string.
 33 |         """
 34 |         if not tf.gfile.Exists(uid_lookup_path):
 35 |             tf.logging.fatal('File does not exist %s', uid_lookup_path)
 36 |         if not tf.gfile.Exists(label_lookup_path):
 37 |             tf.logging.fatal('File does not exist %s', label_lookup_path)
 38 | 
 39 |         # Loads mapping from string UID to human-readable string
 40 |         proto_as_ascii_lines = tf.gfile.GFile(uid_lookup_path).readlines()
 41 |         uid_to_human = {}
 42 |         p = re.compile(r'[n\d]*[ \S,]*')
 43 |         for line in proto_as_ascii_lines:
 44 |             parsed_items = p.findall(line)
 45 |             uid = parsed_items[0]
 46 |             human_string = parsed_items[2]
 47 |             uid_to_human[uid] = human_string
 48 | 
 49 |         # Loads mapping from string UID to integer node ID.
 50 |         node_id_to_uid = {}
 51 |         proto_as_ascii = tf.gfile.GFile(label_lookup_path).readlines()
 52 |         for line in proto_as_ascii:
 53 |             if line.startswith('  target_class:'):
 54 |                 target_class = int(line.split(': ')[1])
 55 |             if line.startswith('  target_class_string:'):
 56 |                 target_class_string = line.split(': ')[1]
 57 |                 node_id_to_uid[target_class] = target_class_string[1:-2]
 58 | 
 59 |         # Loads the final mapping of integer node ID to human-readable string
 60 |         node_id_to_name = {}
 61 |         for key, val in node_id_to_uid.items():
 62 |             if val not in uid_to_human:
 63 |                 tf.logging.fatal('Failed to locate: %s', val)
 64 |             name = uid_to_human[val]
 65 |             node_id_to_name[key] = name
 66 | 
 67 |         return node_id_to_name
 68 | 
 69 |     def id_to_string(self, node_id):
 70 |         if node_id not in self.node_lookup:
 71 |             return ''
 72 |         return self.node_lookup[node_id]
 73 | 
 74 | 
 75 | def create_graph(model_dir):
 76 |     """Creates a graph from saved GraphDef file and returns a saver."""
 77 |     # Creates graph from saved graph_def.pb.
 78 |     with tf.gfile.FastGFile(os.path.join(model_dir, 'classify_image_graph_def.pb'), 'rb') as f:
 79 |         graph_def = tf.GraphDef()
 80 |         graph_def.ParseFromString(f.read())
 81 |         _ = tf.import_graph_def(graph_def, name='')
 82 | 
 83 | 
 84 | def run_inference_on_image(image, model_dir):
 85 |     """Runs inference on an image.
 86 | 
 87 |     Args:
 88 |       image: Image file name.
 89 |       model_dir: Directory contains model
 90 | 
 91 |     Returns:
 92 |       Nothing
 93 |     """
 94 |     if not tf.gfile.Exists(image):
 95 |         tf.logging.fatal('File does not exist %s', image)
 96 |     image_data = tf.gfile.FastGFile(image, 'rb').read()
 97 | 
 98 |     # Creates graph from saved GraphDef.
 99 |     create_graph(model_dir)
100 | 
101 |     num_top_predictions = 5
102 | 
103 |     with tf.Session() as sess:
104 |         # Some useful tensors:
105 |         # 'softmax:0': A tensor containing the normalized prediction across
106 |         #   1000 labels.
107 |         # 'pool_3:0': A tensor containing the next-to-last layer containing 2048
108 |         #   float description of the image.
109 |         # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
110 |         #   encoding of the image.
111 |         # Runs the softmax tensor by feeding the image_data as input to the graph.
112 |         softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
113 |         predictions = sess.run(softmax_tensor,
114 |                                {'DecodeJpeg/contents:0': image_data})
115 |         predictions = np.squeeze(predictions)
116 | 
117 |         # Creates node ID --> English string lookup.
118 |         node_lookup = NodeLookup(model_dir)
119 | 
120 |         top_k = predictions.argsort()[-num_top_predictions:][::-1]
121 |         for node_id in top_k:
122 |             human_string = node_lookup.id_to_string(node_id)
123 |             score = predictions[node_id]
124 |             print('%s (score = %.5f)' % (human_string, score))
125 | 
126 | 
127 | def run_inference_on_images_feature(image_list, model_dir, output_dir):
128 |     """Runs inference on an image list and get features.
129 |     Args:
130 |       image_list: {list} a list of paths to image files
131 |       model_dir: (string) name of the directory where model is
132 |       output_dir: {string} name of the directory where image vectors will be saved
133 |     Returns:
134 |       save image feature into output_dir
135 |     """
136 |     image_to_labels = defaultdict(list)
137 | 
138 |     create_graph(model_dir)
139 | 
140 |     os.makedirs(output_dir, exist_ok=True)
141 | 
142 |     num_top_predictions = 5
143 | 
144 |     with tf.Session() as sess:
145 |         # Some useful tensors:
146 |         # 'softmax:0': A tensor containing the normalized prediction across
147 |         #   1000 labels.
148 |         # 'pool_3:0': A tensor containing the next-to-last layer containing 2048
149 |         #   float description of the image.
150 |         # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
151 |         #   encoding of the image.
152 |         # Runs the softmax tensor by feeding the image_data as input to the graph.
153 |         softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
154 | 
155 |         for image_index, image in enumerate(image_list):
156 |             try:
157 |                 print("parsing", image_index, image, "\n")
158 |                 if not tf.gfile.Exists(image):
159 |                     tf.logging.fatal('File does not exist %s', image)
160 | 
161 |                 with tf.gfile.FastGFile(image, 'rb') as f:
162 |                     image_data = f.read()
163 | 
164 |                     feature_tensor = sess.graph.get_tensor_by_name('pool_3:0')
165 |                     feature_set = sess.run(feature_tensor,
166 |                                            {'DecodeJpeg/contents:0': image_data})
167 |                     feature_vector = np.squeeze(feature_set)
168 |                     outfile_name = os.path.basename(image) + ".npz"
169 |                     out_path = os.path.join(output_dir, outfile_name)
170 |                     np.savetxt(out_path, feature_vector, delimiter=',')
171 | 
172 |                 # close the open file handlers
173 |                 proc = psutil.Process()
174 |                 open_files = proc.open_files()
175 | 
176 |                 for open_file in open_files:
177 |                     file_handler = getattr(open_file, "fd")
178 |                     os.close(file_handler)
179 |             except:
180 |                 print('could not process image index', image_index, 'image', image)
181 | 
182 |     return image_to_labels
183 | 
184 | 
185 | def maybe_download_and_extract(model_dir, data_url):
186 |     """Download and extract model tar file."""
187 |     dest_directory = model_dir
188 |     if not os.path.exists(dest_directory):
189 |         os.makedirs(dest_directory)
190 |     filename = data_url.split('/')[-1]
191 |     filepath = os.path.join(dest_directory, filename)
192 |     if not os.path.exists(filepath):
193 |         def _progress(count, block_size, total_size):
194 |             sys.stdout.write('\r>> Downloading %s %.1f%%' % (
195 |                 filename, float(count * block_size) / float(total_size) * 100.0))
196 |             sys.stdout.flush()
197 | 
198 |         filepath, _ = urllib.request.urlretrieve(data_url, filepath, _progress)
199 |         print()
200 |         statinfo = os.stat(filepath)
201 |         print('Successfully downloaded', filename, statinfo.st_size, 'bytes.')
202 | 
203 |     if data_url.endswith(".tgz"):
204 |         tarfile.open(filepath, 'r:gz').extractall(dest_directory)
205 |     elif data_url.endswith(".zip"):
206 |         with zipfile.ZipFile(filepath, 'r') as zip_ref:
207 |             zip_ref.extractall(dest_directory)
208 |     else:
209 |         raise ValueError
210 | 
211 | 
212 | if __name__ == '__main__':
213 |     model_dir = "models"
214 | 
215 |     maybe_download_and_extract(model_dir)
216 | 
217 |     # image = os.path.join(model_dir, 'cropped_panda.jpg')
218 |     # run_inference_on_image(image, model_dir)
219 | 
220 |     image_dir = "../../scene_recognition/vgg365/data/raw/images/instagram/"
221 |     output_dir = "results/image_vectors/"
222 | 
223 |     # Get image paths
224 |     object_paths = [os.path.join(image_dir, name) for name in os.listdir(image_dir) if os.path.isdir(os.path.join(image_dir, name))]
225 | 
226 |     for object_path in object_paths:
227 |         image_list = [os.path.join(object_path, file_name) for file_name in os.listdir(object_path) if file_name.endswith(IMAGE_EXTENSION)]
228 |         obj_output_dir = os.path.join(output_dir, os.path.basename(object_path))
229 |         run_inference_on_images_feature(image_list=image_list,
230 |                                         model_dir=model_dir,
231 |                                         output_dir=obj_output_dir)
232 | 


--------------------------------------------------------------------------------
/src/models/k_mean.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import numpy as np
  3 | from sklearn.cluster import KMeans
  4 | from sklearn.decomposition import PCA
  5 | from sklearn.manifold import TSNE
  6 | import matplotlib.pyplot as plt
  7 | from sklearn.preprocessing import StandardScaler
  8 | import pandas as pd
  9 | import json
 10 | 
 11 | 
 12 | def plot_3d(vector_array, save_plot_dir):
 13 |     """
 14 |     Plot 3D vector features distribution from vector array
 15 |     :param vector_array: (N x 3) vector array, where N is the number of images
 16 |     :param save_plot_dir: (string) directory to save plot
 17 |     :return: save 3D distribution feature to disk
 18 |     """
 19 |     principal_df = pd.DataFrame(data=vector_array, columns=['pc1', 'pc2', 'pc3'])
 20 |     fig = plt.figure()
 21 |     ax = fig.add_subplot(111, projection='3d')
 22 | 
 23 |     xs = principal_df['pc1']
 24 |     ys = principal_df['pc2']
 25 |     zs = principal_df['pc3']
 26 |     ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w')
 27 | 
 28 |     ax.set_xlabel('pc1')
 29 |     ax.set_ylabel('pc2')
 30 |     ax.set_zlabel('pc3')
 31 | 
 32 |     plt.savefig(save_plot_dir + '/3D_scatter.png')
 33 |     plt.close()
 34 | 
 35 | 
 36 | def plot_2d(vector_array, save_plot_dir):
 37 |     """
 38 |     Plot 2D vector features distribution from vector array
 39 |     :param vector_array: (N x 2) vector array, where N is the number of images
 40 |     :param save_plot_dir: (string) directory to save plot
 41 |     :return: save 2D distribution feature to disk
 42 |     """
 43 |     principal_df = pd.DataFrame(data = vector_array, columns = ['pc1', 'pc2'])
 44 |     fig = plt.figure()
 45 |     ax = fig.add_subplot(111)
 46 | 
 47 |     xs = principal_df['pc1']
 48 |     ys = principal_df['pc2']
 49 |     ax.scatter(xs, ys, s=50, alpha=0.6, edgecolors='w')
 50 | 
 51 |     ax.set_xlabel('pc1')
 52 |     ax.set_ylabel('pc2')
 53 | 
 54 |     plt.savefig(save_plot_dir + '/2D_scatter.png')
 55 |     plt.close()
 56 | 
 57 | 
 58 | def read_vector(img_dir):
 59 |     """
 60 |     Read vector in a directory to array (N x D): N is number of vectors, D is vector's dimension
 61 |     :param img_dir: (string) directory where feature vectors are
 62 |     :return: (array) N X D array
 63 |     """
 64 |     vector_files = [f for f in os.listdir(img_dir) if f.endswith(".npz")]
 65 |     vector_array = []
 66 |     for img in vector_files:
 67 |         vector = np.loadtxt(os.path.join(img_dir, img))
 68 |         vector_array.append(vector)
 69 |     vector_array = np.asarray(vector_array)
 70 |     return vector_array, vector_files
 71 | 
 72 | 
 73 | def find_best_k(vector_array, save_plot_dir, max_k=100):
 74 |     """
 75 |     Find best number of cluster
 76 |     :param vector_array: (array) N x D dimension feature vector array
 77 |     :param save_plot_dir: (string) path to save cost figure
 78 |     :param max_k: (int) maximum number of cluster to analyze
 79 |     :return: plot the elbow curve to figure out the best number of cluster
 80 |     """
 81 | 
 82 |     cost = []
 83 |     dim = vector_array.shape[1]
 84 |     for i in range(1, max_k):
 85 |         kmeans = KMeans(n_clusters=i, random_state=0)
 86 |         kmeans.fit(vector_array)
 87 |         cost.append(kmeans.inertia_)
 88 | 
 89 |     # plot the cost against K values
 90 |     plt.plot(range(1, max_k), cost, color='g', linewidth='3')
 91 |     plt.xlabel("Value of K")
 92 |     plt.ylabel("Squared Error (Cost)")
 93 |     plt.savefig(save_plot_dir + '/cost_' + str(dim) + 'D.png')
 94 |     plt.close()
 95 |     
 96 | 
 97 | def k_mean(vector_array, k):
 98 |     """
 99 |     Apply k-mean clustering approach to assign each feature image in vector array to suitable subsets
100 |     :param vector_array: (array) N x D dimension feature vector array
101 |     :param k: (int) number of cluster
102 |     :return: (array) (N x 1) label array
103 |     """
104 |     kmeans = KMeans(n_clusters=k, random_state=0)
105 |     kmeans.fit(vector_array)
106 |     labels = kmeans.labels_
107 |     return labels
108 | 
109 | 
110 | def reduce_dim_combine(vector_array, dim=2):
111 |     """
112 |     Applying dimension reduction to vector_array
113 |     :param vector_array: (array) N x D dimension feature vector array
114 |     :param dim: (int) desired dimension after reduction
115 |     :return: (array) N x dim dimension feature vector array
116 |     """
117 |     # Standardizing the features
118 |     vector_array = StandardScaler().fit_transform(vector_array)
119 | 
120 |     # Apply PCA first to reduce dim to 50
121 |     pca = PCA(n_components=50)
122 |     vector_array = pca.fit_transform(vector_array)
123 | 
124 |     # Apply tSNE to reduce dim to #dim
125 |     model = TSNE(n_components=dim, random_state=0)
126 |     vector_array = model.fit_transform(vector_array)
127 |     
128 |     return vector_array
129 | 
130 |     
131 | if __name__ == "__main__":
132 |     # Mode: investiagate to find the best k, inference to cluster
133 |     # MODE = "investigate"
134 |     MODE = "inference"
135 | 
136 |     # Image vectors root dir
137 |     img_dir = "results/image_vectors/"
138 | 
139 |     # Final dimension
140 |     dim = 2
141 | 
142 |     for object_name in os.listdir(img_dir):
143 |         print("Process %s" % object_name)
144 |         # object_name = img_dir.split("/")[-1]
145 |         vector_array, img_files = read_vector(os.path.join(img_dir, object_name))
146 |         # k_mean(vector_array)
147 |         
148 |         if vector_array.shape[0] >= 450:
149 |             # Apply dimensional reducing approach
150 |             vector_array = reduce_dim_combine(vector_array, dim)
151 | 
152 |             if MODE == "investigate":
153 | 
154 |                 # Plot data distribution after reducing dimension
155 |                 if dim == 2:
156 |                     plot_2d(vector_array)
157 |                     save_plot_dir = "visualization/2D/"
158 |                 elif dim == 3:
159 |                     plot_3d(vector_array)
160 |                     save_plot_dir = "visualization/3D/"
161 |                 else:
162 |                     raise ValueError("Not support dimension")
163 | 
164 |                 # Plot cost chart to find best value of k
165 |                 find_best_k(vector_array, object_name, save_plot_dir)
166 |                 continue
167 | 
168 |             # Find label for each image
169 |             labels = k_mean(vector_array, k=40).tolist()
170 |             assert len(labels) == len(img_files), "Not equal length"
171 | 
172 |             label_dict = [{"img_file": img_files[i].replace(".npz", "").replace(object_name + '_', ""), "label": str(labels[i]), "prob": "1.0"} for i in range(len(labels))]
173 | 
174 |             # Save to disk
175 |             label_dir = "results/img_cluster/"
176 |             label_outpath = os.path.join(label_dir, object_name + ".json")
177 |             # os.makedirs(label_outpath, exist_ok=True)
178 |             with open(label_outpath, 'w') as fp:
179 |                 json.dump({"data": label_dict}, fp)


--------------------------------------------------------------------------------
/src/utils/analyze_label.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import os
 3 | import json
 4 | import shutil
 5 | THRESHOLD = 0.85
 6 | 
 7 | 
 8 | def get_object_type(object_type_path):
 9 |     """
10 |     Read object type (.csv) file and get list of type in object master
11 |     :param label_path: (string) path to object master type
12 |     :return: list of types
13 |     """
14 |     df = pd.read_csv(object_type_path)
15 | 
16 |     types = df[["type", "object_type"]].drop_duplicates().reset_index().sort_values(by=['type', 'object_type'])
17 |     print(types)
18 |     types[["type", "object_type"]].to_csv("../../results/types.csv", index=False)
19 |     return df
20 | 
21 | 
22 | def symlink_cluster(label_path, dest_dir, src_dir):
23 |     """
24 |     Link the images in img_root to symlink_dir with its cluster defined in label_path
25 |     :param label_path: (string) path to json file containing cluster label of image
26 |     :param dest_dir: (string) destination directory to link image
27 |     :param src_dir: (string) source directory to link image
28 |     :return:
29 |     """
30 |     with open(label_path, "r") as f:
31 |         json_dat = json.load(f)
32 | 
33 |     df = pd.DataFrame(json_dat['data'])
34 | 
35 |     # Convert prob string to float
36 |     df["prob"] = df["prob"].apply(lambda x: float(x))
37 | 
38 |     top_labels = df[df["prob"] >= THRESHOLD]
39 |     top_labels_count = top_labels["label"].value_counts()
40 | 
41 |     # Make symlink dir
42 |     os.makedirs(dest_dir, exist_ok=True)
43 | 
44 |     print("Object %s has %d images" % (label_path, len(df)))
45 |     print(top_labels_count)
46 |     print("\n")
47 | 
48 |     # Remove previous symlink
49 |     if os.path.exists(dest_dir):
50 |         shutil.rmtree(dest_dir)
51 | 
52 |     for l in top_labels_count.index:
53 |         label_name = l.replace("/", "") + "_" + str(top_labels_count[l])
54 | 
55 |         # Create folder for label
56 |         os.makedirs(os.path.join(dest_dir, label_name), exist_ok=True)
57 | 
58 |         img_files = top_labels[top_labels["label"] == l]["img_file"].values.tolist()
59 | 
60 |         for img_file in img_files:
61 |             src_img_path = os.path.abspath(os.path.join(src_dir, img_file))
62 |             dst_img_path = os.path.join(dest_dir, label_name, img_file)
63 |             os.symlink(src_img_path, dst_img_path)
64 | 
65 | 
66 | def symlink_objects(img_json_dir, dest_root_dir, src_root_dir):
67 |     """
68 |     Link the images for each object stored in img_json_dir from src_root_dir to dest_root_dir with corresponding cluster labels
69 |     :param img_json_dir: (string) directory of label json files
70 |     :param dest_root_dir: (string) directory of destination to link objects' images
71 |     :param src_root_dir: (string) directory of source to link objects' images
72 |     :return:
73 |     """
74 |     for json_file in os.listdir(img_json_dir):
75 |         if json_file.endswith(".json"):
76 |             object_name = json_file.replace(".json", "")
77 |             label_path = os.path.join(img_json_dir, object_name)
78 |             dest_dir = os.path.join(dest_root_dir, object_name)
79 |             src_dir = os.path.join(src_root_dir, object_name)
80 |             symlink_cluster(label_path, dest_dir, src_dir)
81 | 
82 | 
83 | if __name__ == "__main__":
84 |     # Get object type from object master
85 |     # object_type_path = "../../data/interim/object-list-with-type.csv"
86 |     # get_object_type(object_type_path)
87 | 
88 |     # Symlink top label
89 |     img_json_dir = "../../results/k_means_json"
90 |     symlink_dir = "../../results/k_means"
91 |     img_root = "../../data/raw/images/instagram"
92 |     symlink_objects(img_json_dir, symlink_dir, img_root)
93 | 


--------------------------------------------------------------------------------
/src/utils/config.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | from dotmap import DotMap
 3 | 
 4 | 
 5 | def get_config_from_json(json_file):
 6 |     """
 7 |     Get the config from a json file
 8 |     :param json_file:
 9 |     :return: config(namespace) or config(dictionary)
10 |     """
11 |     # parse the configurations from the config json file provided
12 |     with open(json_file, 'r') as config_file:
13 |         config_dict = json.load(config_file)
14 | 
15 |     # convert the dictionary to a namespace using bunch lib
16 |     config = DotMap(config_dict)
17 | 
18 |     return config, config_dict


--------------------------------------------------------------------------------
/src/visualization/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/visualization/.gitkeep


--------------------------------------------------------------------------------
/src/visualization/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/visualization/__init__.py


--------------------------------------------------------------------------------
/src/visualization/visualize.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/visualization/visualize.py


--------------------------------------------------------------------------------
/visualization/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/visualization/.gitkeep


--------------------------------------------------------------------------------