├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── build_data.py ├── configs └── configs.json ├── docs └── illustration │ ├── clusters.png │ ├── cost_function.png │ ├── data_distribution.png │ ├── dimension_reduction.png │ └── feature_extractor.png ├── extract_features.py ├── find_k.py ├── models └── .gitkeep ├── notebooks └── .gitkeep ├── references └── .gitkeep ├── reports ├── .gitkeep └── figures │ └── .gitkeep ├── requirements.txt ├── results └── .gitkeep ├── run_k_mean.py ├── src ├── __init__.py ├── data │ ├── .gitkeep │ ├── __init__.py │ └── make_dataset.py ├── features │ ├── .gitkeep │ ├── __init__.py │ └── build_features.py ├── models │ ├── .gitkeep │ ├── __init__.py │ ├── feature_extractor.py │ └── k_mean.py ├── utils │ ├── analyze_label.py │ └── config.py └── visualization │ ├── .gitkeep │ ├── __init__.py │ └── visualize.py └── visualization └── .gitkeep /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | **/__pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | *.egg-info/ 23 | .installed.cfg 24 | *.egg 25 | 26 | # PyInstaller 27 | # Usually these files are written by a python script from a template 28 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 29 | *.manifest 30 | *.spec 31 | 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | 36 | # Unit test / coverage reports 37 | htmlcov/ 38 | .tox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *.cover 45 | 46 | # Translations 47 | *.mo 48 | *.pot 49 | 50 | # Django stuff: 51 | *.log 52 | 53 | # Sphinx documentation 54 | docs/_build/ 55 | 56 | # PyBuilder 57 | target/ 58 | 59 | # DotEnv configuration 60 | .env 61 | 62 | # Database 63 | *.db 64 | *.rdb 65 | 66 | # Pycharm 67 | .idea/ 68 | 69 | # VS Code 70 | .vscode/ 71 | 72 | # Spyder 73 | .spyproject/ 74 | 75 | # Jupyter NB Checkpoints 76 | .ipynb_checkpoints/ 77 | 78 | # exclude data from source control by default 79 | /data/ 80 | 81 | # Mac OS-specific storage files 82 | .DS_Store 83 | 84 | # vim 85 | *.swp 86 | *.swo 87 | 88 | # Mypy cache 89 | .mypy_cache/ 90 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Tri Ngo 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | .PHONY: clean build run 2 | 3 | help: 4 | @echo "Please use \`make ' where is one of" 5 | @echo " clean to remove build files" 6 | @echo " build to download and process data from raw" 7 | @echo " train to train model with default config" 8 | @echo " freeze to freeze model for serving" 9 | @echo " predict to extract keyphrase from sample text" 10 | 11 | clean: 12 | rm -rf data/processed/* 13 | rm -rf data/external/* 14 | rm -rf data/interim/* 15 | rm -rf data/raw/* 16 | rm -rf models/* 17 | rm -rf results/* 18 | rm -rf visualization/* 19 | 20 | build: 21 | python3 build_data.py 22 | 23 | run: 24 | python3 run_k_mean.py 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Image clustering 2 | 3 | Clustering is an interesting field of Unsupervised Machine learning where I classify dataset into set of similar groups. I use Image Clustering when I have thousands of images and are desirable to find a way to group (or categorize) them into subsets which share similarity. 4 | 5 | This work is aimed at performing clustering images using k-means clustering, Inception feature extractor and dimension reduction. 6 | 7 | 8 | ## Installation 9 | 10 | ### Clone the repo 11 | 12 | ```python 13 | git clone https://github.com/tringn/image_clustering.git 14 | ``` 15 | 16 | 17 | ### Install required packages 18 | 19 | This project is using `python3.7` and `pip3` 20 | 21 | ```python 22 | pip install -r requirements.txt 23 | ``` 24 | 25 | ## Process 26 | 27 | ### Step 1: Feature extractor 28 | 29 | I use the feature extractor of InceptionV3 model, which was trained on ImageNet datatset (contains 14,197,122 images: http://www.image-net.org/) to ensure the feature extractor can extract the characteristics for various object. 30 | 31 | ![feature_extractor](docs/illustration/feature_extractor.png) 32 | 33 | You need to download extractor model first 34 | 35 | ```python 36 | python build_data.py 37 | ``` 38 | 39 | I used all images crawled from champselysees_paris Instagram ([link](https://www.instagram.com/champselysees_paris/)) 40 | 41 | You can crawl champselysees_paris images or you can use your own images. 42 | 43 | Move all all images into 1 folder `raw/data/example` 44 | 45 | Then extracting features by running: 46 | ```python 47 | python extract_features.py 48 | ``` 49 | 50 | ### Step 2: Dimension reduction 51 | 52 | 2048 dimension vectors prevent the clustering process to converge at optimal points, so I apply dimension reduction method to seize the dimension of images' vectors to 2. 53 | 54 | I divide the dimension reduction into 2 steps: reducing from 2048 to 50 and from 50 to 2. The reason why I did that is because "It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high." 55 | 56 | ![dimension_reduction](/docs/illustration/dimension_reduction.png) 57 | 58 | After reducing dimension to 2, the images' distribution of object is plotted as follow: 59 | 60 | ![data_distribution](docs/illustration/data_distribution.png) 61 | 62 | **The script for this step is combined with step 3.1** 63 | 64 | ### Step 3: Clustering using K-Means 65 | 66 | Question now is how to choose the number of cluster to assign all images of object into. I have no pre-definition how many subset that images of object belong to because the images crawled from the Internet are chaotically distributed. 67 | 68 | I apply the elbow curve method to determine the best number of cluster as described below: 69 | 70 | #### Step 3.1: Find the best K clusters 71 | 72 | I iterated k from 1 to 100. From k = 1 to k = 40, the Squared Error decreases significantly. From k = 40 to k = 100, there is a slight fall in Squared Error compared with previous changes. So I choose k = 40 that is the best number of cluster. 73 | 74 | ![cost_function](docs/illustration/cost_function.png) 75 | 76 | Run following script to reduce feature dimension and find best number of k. 77 | 78 | ```python 79 | python find_k.py 80 | ``` 81 | 82 | Go to `visualization/example/cost_2D.png` to find the best value of K (where cost decreases much more little compared with previous) 83 | 84 | #### Step 3.2: Cluster images into k subsets 85 | 86 | I use k-means algorithm to assign all images in the object collection into 40 subsets (k=40 is selected at above step). 87 | 88 | Set value for `k` at `configs/congigs.json` and run: 89 | 90 | ```python 91 | python run_k_mean.py 92 | ``` 93 | 94 | Images will be clustered and linked to `results/link/example` 95 | 96 | Output for **example** is shown as below: 97 | 98 | ![clusters](docs/illustration/clusters.png) -------------------------------------------------------------------------------- /build_data.py: -------------------------------------------------------------------------------- 1 | from src.models.feature_extractor import maybe_download_and_extract 2 | from src.utils.config import get_config_from_json 3 | 4 | 5 | if __name__ == "__main__": 6 | config, _ = get_config_from_json("configs/configs.json") 7 | maybe_download_and_extract(config.paths.model_dir, config.paths.data_url) 8 | 9 | 10 | -------------------------------------------------------------------------------- /configs/configs.json: -------------------------------------------------------------------------------- 1 | { 2 | "paths": { 3 | "image_dir": "data/raw", 4 | "model_dir": "models", 5 | "vector_dir": "results/vectors", 6 | "cluster_label_dir": "results/cluster_label/", 7 | "plot_dir": "visualization/", 8 | "link_dir": "results/link", 9 | "data_url": "http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz" 10 | }, 11 | "model": { 12 | "object_name": "example", 13 | "mode": "analysis", 14 | "reduced_dimension": 2, 15 | "k": 40 16 | } 17 | } -------------------------------------------------------------------------------- /docs/illustration/clusters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/clusters.png -------------------------------------------------------------------------------- /docs/illustration/cost_function.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/cost_function.png -------------------------------------------------------------------------------- /docs/illustration/data_distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/data_distribution.png -------------------------------------------------------------------------------- /docs/illustration/dimension_reduction.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/dimension_reduction.png -------------------------------------------------------------------------------- /docs/illustration/feature_extractor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/docs/illustration/feature_extractor.png -------------------------------------------------------------------------------- /extract_features.py: -------------------------------------------------------------------------------- 1 | import os 2 | from src.utils.config import get_config_from_json 3 | from src.models.feature_extractor import run_inference_on_images_feature 4 | 5 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png') 6 | 7 | 8 | def extract_feature(img_dir, model_dir, output_dir): 9 | """ 10 | Extract image features of all images in img_dir and save feature vectors to output_dir 11 | :param img_dir: (string) directory containing images to extract feature 12 | :param model_dir: (string) directory containing extractor model 13 | :param output_dir: (string) directory to save feature vector file 14 | :return: 15 | """ 16 | # Get list of image paths 17 | img_list = [os.path.join(img_dir, img_file) for img_file in os.listdir(img_dir) if img_file.endswith(IMAGE_EXTENSION)] 18 | 19 | # Run getting feature vectors for each image 20 | run_inference_on_images_feature(img_list, model_dir, output_dir) 21 | 22 | 23 | if __name__ == "__main__": 24 | # Get config 25 | config, _ = get_config_from_json("configs/configs.json") 26 | 27 | object_name = config.model.object_name 28 | img_dir = os.path.join(config.paths.image_dir, object_name) 29 | vector_dir = os.path.join(config.paths.vector_dir, object_name) 30 | 31 | # Extract feature for all images in image directory 32 | extract_feature(img_dir, config.paths.model_dir, vector_dir) 33 | -------------------------------------------------------------------------------- /find_k.py: -------------------------------------------------------------------------------- 1 | import os 2 | from src.utils.config import get_config_from_json 3 | from src.models.k_mean import read_vector, reduce_dim_combine, plot_2d, plot_3d, find_best_k 4 | 5 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png') 6 | 7 | 8 | def find_k(vector_array, save_plot_dir, dim=2): 9 | """ 10 | Find the best number of cluster by looking at the cost plot 11 | :param vector_array: (array) (N x D) array of feature vectors 12 | :param save_plot_dir: (string) directory to save plots 13 | :param dim: (int) desired dimension after reduction 14 | :return: 15 | """ 16 | 17 | os.makedirs(save_plot_dir, exist_ok=True) 18 | 19 | if vector_array.shape[0] >= 250: 20 | # Plot data distribution after reducing dimension 21 | if dim == 2: 22 | plot_2d(vector_array, save_plot_dir) 23 | elif dim == 3: 24 | plot_3d(vector_array, save_plot_dir) 25 | else: 26 | raise ValueError("Not support dimension") 27 | 28 | # Plot cost chart to find best value of k 29 | find_best_k(vector_array, save_plot_dir) 30 | 31 | else: 32 | raise ValueError("If number of image is smaller than 250, it is recommended to use hierarchical cluster.") 33 | 34 | 35 | if __name__ == "__main__": 36 | # Get config 37 | config, _ = get_config_from_json("configs/configs.json") 38 | 39 | object_name = config.model.object_name 40 | 41 | dim = config.model.reduced_dimension 42 | 43 | vector_dir = os.path.join(config.paths.vector_dir, object_name) 44 | save_plot_dir = os.path.join(config.paths.plot_dir, object_name) 45 | 46 | # Read feature vector from vector dir 47 | vector_array, vector_files = read_vector(vector_dir) 48 | 49 | # Apply dimensional reducing approach 50 | vector_array = reduce_dim_combine(vector_array, dim=dim) 51 | 52 | # Find best K 53 | find_k(vector_array, save_plot_dir, dim=dim) 54 | -------------------------------------------------------------------------------- /models/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/models/.gitkeep -------------------------------------------------------------------------------- /notebooks/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/notebooks/.gitkeep -------------------------------------------------------------------------------- /references/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/references/.gitkeep -------------------------------------------------------------------------------- /reports/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/reports/.gitkeep -------------------------------------------------------------------------------- /reports/figures/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/reports/figures/.gitkeep -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.16.1 2 | tensorflow==1.13.1 3 | psutil==5.6.2 4 | dotmap==1.3.8 5 | regex==2018.1.10 6 | sklearn==0.0 7 | pandas==0.24.2 8 | matplotlib==3.0.2 9 | six==1.12.0 10 | pandas==0.24.2 11 | matplotlib==3.0.2 -------------------------------------------------------------------------------- /results/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/results/.gitkeep -------------------------------------------------------------------------------- /run_k_mean.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | 4 | from src.models.k_mean import read_vector, reduce_dim_combine, k_mean 5 | from src.utils.analyze_label import symlink_cluster 6 | from src.utils.config import get_config_from_json 7 | 8 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png') 9 | 10 | if __name__ == "__main__": 11 | 12 | # Get config 13 | config, _ = get_config_from_json("configs/configs.json") 14 | 15 | object_name = config.model.object_name 16 | 17 | dim = config.model.reduced_dimension 18 | 19 | img_dir = os.path.join(config.paths.image_dir, object_name) 20 | vector_dir = os.path.join(config.paths.vector_dir, object_name) 21 | save_plot_dir = os.path.join(config.paths.plot_dir, object_name) 22 | cluster_label_path = os.path.join(config.paths.cluster_label_dir, object_name + ".json") 23 | 24 | if not os.path.isdir(vector_dir): 25 | raise Exception("Please run feature extraction for all images first") 26 | 27 | # Read feature vector from vector dir 28 | vector_array, vector_files = read_vector(vector_dir) 29 | 30 | if len(vector_files) == 0: 31 | raise Exception("Please run feature extraction for all images first") 32 | 33 | # Apply dimensional reducing approach 34 | vector_array = reduce_dim_combine(vector_array, dim=dim) 35 | 36 | labels = k_mean(vector_array, config.model.k).tolist() 37 | 38 | assert len(labels) == len(vector_files), "Not equal length" 39 | 40 | label_dict = [{"img_file": vector_files[i].replace(".npz", ""), "label": str(labels[i]), "prob": "1.0"} for i in 41 | range(len(labels))] 42 | 43 | # Save to disk 44 | os.makedirs(os.path.dirname(cluster_label_path), exist_ok=True) 45 | with open(cluster_label_path, 'w') as fp: 46 | json.dump({"data": label_dict}, fp) 47 | 48 | print("Cluster label for each image are saved at results/cluster_label/example.") 49 | 50 | # Symlink 51 | link_base_dir = config.paths.link_dir 52 | os.makedirs(link_base_dir, exist_ok=True) 53 | 54 | symlink_cluster(label_path=cluster_label_path, 55 | dest_dir=os.path.join(link_base_dir, object_name), 56 | src_dir=img_dir) 57 | 58 | print("Go to results/link/example to see images in each cluster") 59 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/__init__.py -------------------------------------------------------------------------------- /src/data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/data/.gitkeep -------------------------------------------------------------------------------- /src/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/data/__init__.py -------------------------------------------------------------------------------- /src/data/make_dataset.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import zipfile 3 | import os 4 | from tqdm import tqdm 5 | 6 | 7 | def download_file_from_google_drive(id, destination): 8 | URL = 'https://docs.google.com/uc?export=download' 9 | 10 | session = requests.Session() 11 | response = session.get(URL, params = { 'id' : id }, stream = True) 12 | 13 | token = None 14 | for key, value in response.cookies.items(): 15 | if key.startswith('download_warning'): 16 | token = value 17 | break 18 | 19 | if token: 20 | params = { 'id' : id, 'confirm' : token } 21 | response = session.get(URL, params = params, stream = True) 22 | 23 | CHUNK_SIZE = 32*1024 24 | total_size = int(response.headers.get('content-length', 0)) 25 | 26 | with tqdm(desc=destination, total=total_size, unit='B', unit_scale=True) as pbar: 27 | with open(destination, 'wb') as f: 28 | for chunk in response.iter_content(CHUNK_SIZE): 29 | if chunk: 30 | pbar.update(CHUNK_SIZE) 31 | f.write(chunk) 32 | 33 | 34 | def download_word2vec(download_dir, gg_drive_id): 35 | # Download pre-trained word2vec embeddings from google drive 36 | print("Start downloading pre-trained word2vec embeddings.") 37 | download_file_name = "ja-gensim_update.txt.zip" 38 | 39 | # file_id = "1ViflLHKz_sQEioELGp7xromuXJsPJd4Y" 40 | destination = os.path.join(download_dir, download_file_name) 41 | download_file_from_google_drive(gg_drive_id, destination) 42 | print("Finish downloading pre-trained word2vec embeddings.") 43 | 44 | # Extract zip file 45 | zip_ref = zipfile.ZipFile(destination, 'r') 46 | zip_ref.extractall(download_dir) 47 | zip_ref.close() 48 | 49 | print("Delete .zip file.") 50 | os.remove(destination) 51 | 52 | 53 | def download_raw_data(destination, gg_drive_id): 54 | # Download pre-trained word2vec embeddings from google drive 55 | print("Start downloading raw dataset.") 56 | 57 | # file_id = "1ViflLHKz_sQEioELGp7xromuXJsPJd4Y" 58 | download_file_from_google_drive(gg_drive_id, destination) 59 | print("Finish downloading raw dataset from operators.") -------------------------------------------------------------------------------- /src/features/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/features/.gitkeep -------------------------------------------------------------------------------- /src/features/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/features/__init__.py -------------------------------------------------------------------------------- /src/features/build_features.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/features/build_features.py -------------------------------------------------------------------------------- /src/models/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/models/.gitkeep -------------------------------------------------------------------------------- /src/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/models/__init__.py -------------------------------------------------------------------------------- /src/models/feature_extractor.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import sys 4 | import tarfile 5 | import zipfile 6 | import numpy as np 7 | import tensorflow as tf 8 | from six.moves import urllib 9 | import psutil 10 | from collections import defaultdict 11 | 12 | DATA_URL = 'http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz' 13 | IMAGE_EXTENSION = ('jpg', 'jpeg', 'bmp', 'png') 14 | 15 | 16 | class NodeLookup(object): 17 | """Converts integer node ID's to human readable labels.""" 18 | 19 | def __init__(self, model_dir): 20 | label_lookup_path = os.path.join(model_dir, 'imagenet_2012_challenge_label_map_proto.pbtxt') 21 | uid_lookup_path = os.path.join(model_dir, 'imagenet_synset_to_human_label_map.txt') 22 | self.node_lookup = self.load(label_lookup_path, uid_lookup_path) 23 | 24 | def load(self, label_lookup_path, uid_lookup_path): 25 | """Loads a human readable English name for each softmax node. 26 | 27 | Args: 28 | label_lookup_path: string UID to integer node ID. 29 | uid_lookup_path: string UID to human-readable string. 30 | 31 | Returns: 32 | dict from integer node ID to human-readable string. 33 | """ 34 | if not tf.gfile.Exists(uid_lookup_path): 35 | tf.logging.fatal('File does not exist %s', uid_lookup_path) 36 | if not tf.gfile.Exists(label_lookup_path): 37 | tf.logging.fatal('File does not exist %s', label_lookup_path) 38 | 39 | # Loads mapping from string UID to human-readable string 40 | proto_as_ascii_lines = tf.gfile.GFile(uid_lookup_path).readlines() 41 | uid_to_human = {} 42 | p = re.compile(r'[n\d]*[ \S,]*') 43 | for line in proto_as_ascii_lines: 44 | parsed_items = p.findall(line) 45 | uid = parsed_items[0] 46 | human_string = parsed_items[2] 47 | uid_to_human[uid] = human_string 48 | 49 | # Loads mapping from string UID to integer node ID. 50 | node_id_to_uid = {} 51 | proto_as_ascii = tf.gfile.GFile(label_lookup_path).readlines() 52 | for line in proto_as_ascii: 53 | if line.startswith(' target_class:'): 54 | target_class = int(line.split(': ')[1]) 55 | if line.startswith(' target_class_string:'): 56 | target_class_string = line.split(': ')[1] 57 | node_id_to_uid[target_class] = target_class_string[1:-2] 58 | 59 | # Loads the final mapping of integer node ID to human-readable string 60 | node_id_to_name = {} 61 | for key, val in node_id_to_uid.items(): 62 | if val not in uid_to_human: 63 | tf.logging.fatal('Failed to locate: %s', val) 64 | name = uid_to_human[val] 65 | node_id_to_name[key] = name 66 | 67 | return node_id_to_name 68 | 69 | def id_to_string(self, node_id): 70 | if node_id not in self.node_lookup: 71 | return '' 72 | return self.node_lookup[node_id] 73 | 74 | 75 | def create_graph(model_dir): 76 | """Creates a graph from saved GraphDef file and returns a saver.""" 77 | # Creates graph from saved graph_def.pb. 78 | with tf.gfile.FastGFile(os.path.join(model_dir, 'classify_image_graph_def.pb'), 'rb') as f: 79 | graph_def = tf.GraphDef() 80 | graph_def.ParseFromString(f.read()) 81 | _ = tf.import_graph_def(graph_def, name='') 82 | 83 | 84 | def run_inference_on_image(image, model_dir): 85 | """Runs inference on an image. 86 | 87 | Args: 88 | image: Image file name. 89 | model_dir: Directory contains model 90 | 91 | Returns: 92 | Nothing 93 | """ 94 | if not tf.gfile.Exists(image): 95 | tf.logging.fatal('File does not exist %s', image) 96 | image_data = tf.gfile.FastGFile(image, 'rb').read() 97 | 98 | # Creates graph from saved GraphDef. 99 | create_graph(model_dir) 100 | 101 | num_top_predictions = 5 102 | 103 | with tf.Session() as sess: 104 | # Some useful tensors: 105 | # 'softmax:0': A tensor containing the normalized prediction across 106 | # 1000 labels. 107 | # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 108 | # float description of the image. 109 | # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG 110 | # encoding of the image. 111 | # Runs the softmax tensor by feeding the image_data as input to the graph. 112 | softmax_tensor = sess.graph.get_tensor_by_name('softmax:0') 113 | predictions = sess.run(softmax_tensor, 114 | {'DecodeJpeg/contents:0': image_data}) 115 | predictions = np.squeeze(predictions) 116 | 117 | # Creates node ID --> English string lookup. 118 | node_lookup = NodeLookup(model_dir) 119 | 120 | top_k = predictions.argsort()[-num_top_predictions:][::-1] 121 | for node_id in top_k: 122 | human_string = node_lookup.id_to_string(node_id) 123 | score = predictions[node_id] 124 | print('%s (score = %.5f)' % (human_string, score)) 125 | 126 | 127 | def run_inference_on_images_feature(image_list, model_dir, output_dir): 128 | """Runs inference on an image list and get features. 129 | Args: 130 | image_list: {list} a list of paths to image files 131 | model_dir: (string) name of the directory where model is 132 | output_dir: {string} name of the directory where image vectors will be saved 133 | Returns: 134 | save image feature into output_dir 135 | """ 136 | image_to_labels = defaultdict(list) 137 | 138 | create_graph(model_dir) 139 | 140 | os.makedirs(output_dir, exist_ok=True) 141 | 142 | num_top_predictions = 5 143 | 144 | with tf.Session() as sess: 145 | # Some useful tensors: 146 | # 'softmax:0': A tensor containing the normalized prediction across 147 | # 1000 labels. 148 | # 'pool_3:0': A tensor containing the next-to-last layer containing 2048 149 | # float description of the image. 150 | # 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG 151 | # encoding of the image. 152 | # Runs the softmax tensor by feeding the image_data as input to the graph. 153 | softmax_tensor = sess.graph.get_tensor_by_name('softmax:0') 154 | 155 | for image_index, image in enumerate(image_list): 156 | try: 157 | print("parsing", image_index, image, "\n") 158 | if not tf.gfile.Exists(image): 159 | tf.logging.fatal('File does not exist %s', image) 160 | 161 | with tf.gfile.FastGFile(image, 'rb') as f: 162 | image_data = f.read() 163 | 164 | feature_tensor = sess.graph.get_tensor_by_name('pool_3:0') 165 | feature_set = sess.run(feature_tensor, 166 | {'DecodeJpeg/contents:0': image_data}) 167 | feature_vector = np.squeeze(feature_set) 168 | outfile_name = os.path.basename(image) + ".npz" 169 | out_path = os.path.join(output_dir, outfile_name) 170 | np.savetxt(out_path, feature_vector, delimiter=',') 171 | 172 | # close the open file handlers 173 | proc = psutil.Process() 174 | open_files = proc.open_files() 175 | 176 | for open_file in open_files: 177 | file_handler = getattr(open_file, "fd") 178 | os.close(file_handler) 179 | except: 180 | print('could not process image index', image_index, 'image', image) 181 | 182 | return image_to_labels 183 | 184 | 185 | def maybe_download_and_extract(model_dir, data_url): 186 | """Download and extract model tar file.""" 187 | dest_directory = model_dir 188 | if not os.path.exists(dest_directory): 189 | os.makedirs(dest_directory) 190 | filename = data_url.split('/')[-1] 191 | filepath = os.path.join(dest_directory, filename) 192 | if not os.path.exists(filepath): 193 | def _progress(count, block_size, total_size): 194 | sys.stdout.write('\r>> Downloading %s %.1f%%' % ( 195 | filename, float(count * block_size) / float(total_size) * 100.0)) 196 | sys.stdout.flush() 197 | 198 | filepath, _ = urllib.request.urlretrieve(data_url, filepath, _progress) 199 | print() 200 | statinfo = os.stat(filepath) 201 | print('Successfully downloaded', filename, statinfo.st_size, 'bytes.') 202 | 203 | if data_url.endswith(".tgz"): 204 | tarfile.open(filepath, 'r:gz').extractall(dest_directory) 205 | elif data_url.endswith(".zip"): 206 | with zipfile.ZipFile(filepath, 'r') as zip_ref: 207 | zip_ref.extractall(dest_directory) 208 | else: 209 | raise ValueError 210 | 211 | 212 | if __name__ == '__main__': 213 | model_dir = "models" 214 | 215 | maybe_download_and_extract(model_dir) 216 | 217 | # image = os.path.join(model_dir, 'cropped_panda.jpg') 218 | # run_inference_on_image(image, model_dir) 219 | 220 | image_dir = "../../scene_recognition/vgg365/data/raw/images/instagram/" 221 | output_dir = "results/image_vectors/" 222 | 223 | # Get image paths 224 | object_paths = [os.path.join(image_dir, name) for name in os.listdir(image_dir) if os.path.isdir(os.path.join(image_dir, name))] 225 | 226 | for object_path in object_paths: 227 | image_list = [os.path.join(object_path, file_name) for file_name in os.listdir(object_path) if file_name.endswith(IMAGE_EXTENSION)] 228 | obj_output_dir = os.path.join(output_dir, os.path.basename(object_path)) 229 | run_inference_on_images_feature(image_list=image_list, 230 | model_dir=model_dir, 231 | output_dir=obj_output_dir) 232 | -------------------------------------------------------------------------------- /src/models/k_mean.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | from sklearn.cluster import KMeans 4 | from sklearn.decomposition import PCA 5 | from sklearn.manifold import TSNE 6 | import matplotlib.pyplot as plt 7 | from sklearn.preprocessing import StandardScaler 8 | import pandas as pd 9 | import json 10 | 11 | 12 | def plot_3d(vector_array, save_plot_dir): 13 | """ 14 | Plot 3D vector features distribution from vector array 15 | :param vector_array: (N x 3) vector array, where N is the number of images 16 | :param save_plot_dir: (string) directory to save plot 17 | :return: save 3D distribution feature to disk 18 | """ 19 | principal_df = pd.DataFrame(data=vector_array, columns=['pc1', 'pc2', 'pc3']) 20 | fig = plt.figure() 21 | ax = fig.add_subplot(111, projection='3d') 22 | 23 | xs = principal_df['pc1'] 24 | ys = principal_df['pc2'] 25 | zs = principal_df['pc3'] 26 | ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w') 27 | 28 | ax.set_xlabel('pc1') 29 | ax.set_ylabel('pc2') 30 | ax.set_zlabel('pc3') 31 | 32 | plt.savefig(save_plot_dir + '/3D_scatter.png') 33 | plt.close() 34 | 35 | 36 | def plot_2d(vector_array, save_plot_dir): 37 | """ 38 | Plot 2D vector features distribution from vector array 39 | :param vector_array: (N x 2) vector array, where N is the number of images 40 | :param save_plot_dir: (string) directory to save plot 41 | :return: save 2D distribution feature to disk 42 | """ 43 | principal_df = pd.DataFrame(data = vector_array, columns = ['pc1', 'pc2']) 44 | fig = plt.figure() 45 | ax = fig.add_subplot(111) 46 | 47 | xs = principal_df['pc1'] 48 | ys = principal_df['pc2'] 49 | ax.scatter(xs, ys, s=50, alpha=0.6, edgecolors='w') 50 | 51 | ax.set_xlabel('pc1') 52 | ax.set_ylabel('pc2') 53 | 54 | plt.savefig(save_plot_dir + '/2D_scatter.png') 55 | plt.close() 56 | 57 | 58 | def read_vector(img_dir): 59 | """ 60 | Read vector in a directory to array (N x D): N is number of vectors, D is vector's dimension 61 | :param img_dir: (string) directory where feature vectors are 62 | :return: (array) N X D array 63 | """ 64 | vector_files = [f for f in os.listdir(img_dir) if f.endswith(".npz")] 65 | vector_array = [] 66 | for img in vector_files: 67 | vector = np.loadtxt(os.path.join(img_dir, img)) 68 | vector_array.append(vector) 69 | vector_array = np.asarray(vector_array) 70 | return vector_array, vector_files 71 | 72 | 73 | def find_best_k(vector_array, save_plot_dir, max_k=100): 74 | """ 75 | Find best number of cluster 76 | :param vector_array: (array) N x D dimension feature vector array 77 | :param save_plot_dir: (string) path to save cost figure 78 | :param max_k: (int) maximum number of cluster to analyze 79 | :return: plot the elbow curve to figure out the best number of cluster 80 | """ 81 | 82 | cost = [] 83 | dim = vector_array.shape[1] 84 | for i in range(1, max_k): 85 | kmeans = KMeans(n_clusters=i, random_state=0) 86 | kmeans.fit(vector_array) 87 | cost.append(kmeans.inertia_) 88 | 89 | # plot the cost against K values 90 | plt.plot(range(1, max_k), cost, color='g', linewidth='3') 91 | plt.xlabel("Value of K") 92 | plt.ylabel("Squared Error (Cost)") 93 | plt.savefig(save_plot_dir + '/cost_' + str(dim) + 'D.png') 94 | plt.close() 95 | 96 | 97 | def k_mean(vector_array, k): 98 | """ 99 | Apply k-mean clustering approach to assign each feature image in vector array to suitable subsets 100 | :param vector_array: (array) N x D dimension feature vector array 101 | :param k: (int) number of cluster 102 | :return: (array) (N x 1) label array 103 | """ 104 | kmeans = KMeans(n_clusters=k, random_state=0) 105 | kmeans.fit(vector_array) 106 | labels = kmeans.labels_ 107 | return labels 108 | 109 | 110 | def reduce_dim_combine(vector_array, dim=2): 111 | """ 112 | Applying dimension reduction to vector_array 113 | :param vector_array: (array) N x D dimension feature vector array 114 | :param dim: (int) desired dimension after reduction 115 | :return: (array) N x dim dimension feature vector array 116 | """ 117 | # Standardizing the features 118 | vector_array = StandardScaler().fit_transform(vector_array) 119 | 120 | # Apply PCA first to reduce dim to 50 121 | pca = PCA(n_components=50) 122 | vector_array = pca.fit_transform(vector_array) 123 | 124 | # Apply tSNE to reduce dim to #dim 125 | model = TSNE(n_components=dim, random_state=0) 126 | vector_array = model.fit_transform(vector_array) 127 | 128 | return vector_array 129 | 130 | 131 | if __name__ == "__main__": 132 | # Mode: investiagate to find the best k, inference to cluster 133 | # MODE = "investigate" 134 | MODE = "inference" 135 | 136 | # Image vectors root dir 137 | img_dir = "results/image_vectors/" 138 | 139 | # Final dimension 140 | dim = 2 141 | 142 | for object_name in os.listdir(img_dir): 143 | print("Process %s" % object_name) 144 | # object_name = img_dir.split("/")[-1] 145 | vector_array, img_files = read_vector(os.path.join(img_dir, object_name)) 146 | # k_mean(vector_array) 147 | 148 | if vector_array.shape[0] >= 450: 149 | # Apply dimensional reducing approach 150 | vector_array = reduce_dim_combine(vector_array, dim) 151 | 152 | if MODE == "investigate": 153 | 154 | # Plot data distribution after reducing dimension 155 | if dim == 2: 156 | plot_2d(vector_array) 157 | save_plot_dir = "visualization/2D/" 158 | elif dim == 3: 159 | plot_3d(vector_array) 160 | save_plot_dir = "visualization/3D/" 161 | else: 162 | raise ValueError("Not support dimension") 163 | 164 | # Plot cost chart to find best value of k 165 | find_best_k(vector_array, object_name, save_plot_dir) 166 | continue 167 | 168 | # Find label for each image 169 | labels = k_mean(vector_array, k=40).tolist() 170 | assert len(labels) == len(img_files), "Not equal length" 171 | 172 | label_dict = [{"img_file": img_files[i].replace(".npz", "").replace(object_name + '_', ""), "label": str(labels[i]), "prob": "1.0"} for i in range(len(labels))] 173 | 174 | # Save to disk 175 | label_dir = "results/img_cluster/" 176 | label_outpath = os.path.join(label_dir, object_name + ".json") 177 | # os.makedirs(label_outpath, exist_ok=True) 178 | with open(label_outpath, 'w') as fp: 179 | json.dump({"data": label_dict}, fp) -------------------------------------------------------------------------------- /src/utils/analyze_label.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | import json 4 | import shutil 5 | THRESHOLD = 0.85 6 | 7 | 8 | def get_object_type(object_type_path): 9 | """ 10 | Read object type (.csv) file and get list of type in object master 11 | :param label_path: (string) path to object master type 12 | :return: list of types 13 | """ 14 | df = pd.read_csv(object_type_path) 15 | 16 | types = df[["type", "object_type"]].drop_duplicates().reset_index().sort_values(by=['type', 'object_type']) 17 | print(types) 18 | types[["type", "object_type"]].to_csv("../../results/types.csv", index=False) 19 | return df 20 | 21 | 22 | def symlink_cluster(label_path, dest_dir, src_dir): 23 | """ 24 | Link the images in img_root to symlink_dir with its cluster defined in label_path 25 | :param label_path: (string) path to json file containing cluster label of image 26 | :param dest_dir: (string) destination directory to link image 27 | :param src_dir: (string) source directory to link image 28 | :return: 29 | """ 30 | with open(label_path, "r") as f: 31 | json_dat = json.load(f) 32 | 33 | df = pd.DataFrame(json_dat['data']) 34 | 35 | # Convert prob string to float 36 | df["prob"] = df["prob"].apply(lambda x: float(x)) 37 | 38 | top_labels = df[df["prob"] >= THRESHOLD] 39 | top_labels_count = top_labels["label"].value_counts() 40 | 41 | # Make symlink dir 42 | os.makedirs(dest_dir, exist_ok=True) 43 | 44 | print("Object %s has %d images" % (label_path, len(df))) 45 | print(top_labels_count) 46 | print("\n") 47 | 48 | # Remove previous symlink 49 | if os.path.exists(dest_dir): 50 | shutil.rmtree(dest_dir) 51 | 52 | for l in top_labels_count.index: 53 | label_name = l.replace("/", "") + "_" + str(top_labels_count[l]) 54 | 55 | # Create folder for label 56 | os.makedirs(os.path.join(dest_dir, label_name), exist_ok=True) 57 | 58 | img_files = top_labels[top_labels["label"] == l]["img_file"].values.tolist() 59 | 60 | for img_file in img_files: 61 | src_img_path = os.path.abspath(os.path.join(src_dir, img_file)) 62 | dst_img_path = os.path.join(dest_dir, label_name, img_file) 63 | os.symlink(src_img_path, dst_img_path) 64 | 65 | 66 | def symlink_objects(img_json_dir, dest_root_dir, src_root_dir): 67 | """ 68 | Link the images for each object stored in img_json_dir from src_root_dir to dest_root_dir with corresponding cluster labels 69 | :param img_json_dir: (string) directory of label json files 70 | :param dest_root_dir: (string) directory of destination to link objects' images 71 | :param src_root_dir: (string) directory of source to link objects' images 72 | :return: 73 | """ 74 | for json_file in os.listdir(img_json_dir): 75 | if json_file.endswith(".json"): 76 | object_name = json_file.replace(".json", "") 77 | label_path = os.path.join(img_json_dir, object_name) 78 | dest_dir = os.path.join(dest_root_dir, object_name) 79 | src_dir = os.path.join(src_root_dir, object_name) 80 | symlink_cluster(label_path, dest_dir, src_dir) 81 | 82 | 83 | if __name__ == "__main__": 84 | # Get object type from object master 85 | # object_type_path = "../../data/interim/object-list-with-type.csv" 86 | # get_object_type(object_type_path) 87 | 88 | # Symlink top label 89 | img_json_dir = "../../results/k_means_json" 90 | symlink_dir = "../../results/k_means" 91 | img_root = "../../data/raw/images/instagram" 92 | symlink_objects(img_json_dir, symlink_dir, img_root) 93 | -------------------------------------------------------------------------------- /src/utils/config.py: -------------------------------------------------------------------------------- 1 | import json 2 | from dotmap import DotMap 3 | 4 | 5 | def get_config_from_json(json_file): 6 | """ 7 | Get the config from a json file 8 | :param json_file: 9 | :return: config(namespace) or config(dictionary) 10 | """ 11 | # parse the configurations from the config json file provided 12 | with open(json_file, 'r') as config_file: 13 | config_dict = json.load(config_file) 14 | 15 | # convert the dictionary to a namespace using bunch lib 16 | config = DotMap(config_dict) 17 | 18 | return config, config_dict -------------------------------------------------------------------------------- /src/visualization/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/visualization/.gitkeep -------------------------------------------------------------------------------- /src/visualization/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/visualization/__init__.py -------------------------------------------------------------------------------- /src/visualization/visualize.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/src/visualization/visualize.py -------------------------------------------------------------------------------- /visualization/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tringn/image_clustering/f7912ee57967d021fbb5579ae64bd5556faae2f7/visualization/.gitkeep --------------------------------------------------------------------------------