├── .gitignore
├── data
    └── processed
    │   ├── test
    │       ├── 1.jpg
    │       └── 2.jpg
    │   ├── train
    │       ├── Damro tea.jpg
    │       ├── Vacuum Cleaner.JPG
    │       ├── Washing Machine.jpg
    │       ├── coffee machine.jpg
    │       ├── television maker.jpg
    │       ├── Braava 380t damp cleaning.jpg
    │       ├── Roomba_805_charging dock.jpg
    │       ├── An unmodified iRobot Create with Command Module.jpg
    │       └── captions.csv
    │   └── val
    │       ├── Vacuum Cleaner.JPG
    │       ├── Washing Machine.jpg
    │       └── television maker.jpg
├── requirements.txt
├── src
    ├── utils.py
    ├── config.py
    ├── dataset.py
    ├── modules.py
    ├── CLIP.py
    └── train.py
├── make_csv.py
├── app.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | *myenv
2 | *models
3 | *__pycache__
4 | 


--------------------------------------------------------------------------------
/data/processed/test/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/test/1.jpg


--------------------------------------------------------------------------------
/data/processed/test/2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/test/2.jpg


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch
2 | open_clip_torch
3 | torchvision
4 | Pillow
5 | scikit-learn
6 | requests
7 | beautifulsoup4
8 | 


--------------------------------------------------------------------------------
/data/processed/train/Damro tea.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/Damro tea.jpg


--------------------------------------------------------------------------------
/data/processed/val/Vacuum Cleaner.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/val/Vacuum Cleaner.JPG


--------------------------------------------------------------------------------
/data/processed/train/Vacuum Cleaner.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/Vacuum Cleaner.JPG


--------------------------------------------------------------------------------
/data/processed/train/Washing Machine.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/Washing Machine.jpg


--------------------------------------------------------------------------------
/data/processed/train/coffee machine.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/coffee machine.jpg


--------------------------------------------------------------------------------
/data/processed/val/Washing Machine.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/val/Washing Machine.jpg


--------------------------------------------------------------------------------
/data/processed/val/television maker.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/val/television maker.jpg


--------------------------------------------------------------------------------
/data/processed/train/television maker.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/television maker.jpg


--------------------------------------------------------------------------------
/data/processed/train/Braava 380t damp cleaning.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/Braava 380t damp cleaning.jpg


--------------------------------------------------------------------------------
/data/processed/train/Roomba_805_charging dock.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/Roomba_805_charging dock.jpg


--------------------------------------------------------------------------------
/data/processed/train/An unmodified iRobot Create with Command Module.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/noel319/CLIP_detection/HEAD/data/processed/train/An unmodified iRobot Create with Command Module.jpg


--------------------------------------------------------------------------------
/src/utils.py:
--------------------------------------------------------------------------------
 1 | class AvgMeter:
 2 |     def __init__(self, name="Metric"):
 3 |         self.name = name
 4 |         self.reset()
 5 | 
 6 |     def reset(self):
 7 |         self.avg, self.sum, self.count = [0] * 3
 8 | 
 9 |     def update(self, val, count=1):
10 |         self.count += count
11 |         self.sum += val * count
12 |         self.avg = self.sum / self.count
13 | 
14 |     def __repr__(self):
15 |         text = f"{self.name}: {self.avg:.4f}"
16 |         return text
17 | 
18 | def get_lr(optimizer):
19 |     for param_group in optimizer.param_groups:
20 |         return param_group["lr"]
21 | 


--------------------------------------------------------------------------------
/src/config.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | debug = True
 4 | image_path = "data/processed/train"
 5 | captions_path = "data/processed/train"
 6 | batch_size = 8
 7 | num_workers = 0
 8 | lr = 1e-3
 9 | weight_decay = 1e-3
10 | patience = 2
11 | factor = 0.5
12 | epochs = 5000
13 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
14 | 
15 | model_name = 'resnet50'
16 | image_embedding = 2048
17 | text_encoder_model = "distilbert-base-uncased"
18 | text_embedding = 768
19 | text_tokenizer = "distilbert-base-uncased"
20 | max_length = 200
21 | 
22 | pretrained = False # for both image encoder and text encoder
23 | trainable = False # for both image encoder and text encoder
24 | temperature = 1.0
25 | 
26 | # image size
27 | size = 224
28 | 
29 | # for projection head; used for both image and text encoders
30 | num_projection_layers = 1
31 | projection_dim = 256 
32 | dropout = 0.1


--------------------------------------------------------------------------------
/make_csv.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pandas as pd
 3 | 
 4 | # Folder containing images
 5 | image_folder = "data/processed/train"
 6 | 
 7 | 
 8 | # Get all image files in the folder
 9 | image_files = [f for f in os.listdir(image_folder) if f.lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))]
10 | 
11 | # Create data for the CSV
12 | data = {
13 |     'image': [file for file in image_files],  # Construct the full URL
14 |     'caption': [os.path.splitext(file)[0] for file in image_files]  # Use file name as caption without extension
15 | }
16 | 
17 | # Create DataFrame
18 | df = pd.DataFrame(data)
19 | df_repeated = pd.concat([df] * 15, ignore_index=True)
20 | df_repeated['id'] = range(1, len(df_repeated) + 1)
21 | # Save to CSV
22 | csv_file_path = "data/processed/train/captions.csv"  # You can set the path where you want to save your CSV file
23 | df_repeated.to_csv(csv_file_path, index=False)
24 | 
25 | print(f"CSV file created at {csv_file_path} data: {df}")
26 | 


--------------------------------------------------------------------------------
/src/dataset.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import cv2
 3 | import torch
 4 | import albumentations as A
 5 | 
 6 | import src.config as CFG
 7 | 
 8 | 
 9 | class CLIPDataset(torch.utils.data.Dataset):
10 |     def __init__(self, image_filenames, captions, tokenizer, transforms):
11 |         self.image_filenames = image_filenames
12 |         self.captions = list(captions)
13 |         self.encoded_captions = tokenizer(
14 |             list(captions), padding=True, truncation=True, max_length=CFG.max_length
15 |         )
16 |         self.transforms = transforms
17 | 
18 |     def __getitem__(self, idx):
19 |         item = {
20 |             key: torch.tensor(values[idx])
21 |             for key, values in self.encoded_captions.items()
22 |         }
23 | 
24 |         image = cv2.imread(f"{CFG.image_path}/{self.image_filenames[idx]}")
25 |         image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
26 |         image = self.transforms(image=image)['image']
27 |         item['image'] = torch.tensor(image).permute(2, 0, 1).float()
28 |         item['caption'] = self.captions[idx]
29 |         return item
30 |         
31 |     def __len__(self):
32 |         return len(self.captions)
33 | 
34 | 
35 | 
36 | def get_transforms(mode="train"):
37 |     if mode == "train":
38 |         return A.Compose(
39 |             [
40 |                 A.Resize(CFG.size, CFG.size, always_apply=True),
41 |                 A.Normalize(max_pixel_value=255.0, always_apply=True),
42 |             ]
43 |         )
44 |     else:
45 |         return A.Compose(
46 |             [
47 |                 A.Resize(CFG.size, CFG.size, always_apply=True),
48 |                 A.Normalize(max_pixel_value=255.0, always_apply=True),
49 |             ]
50 |         )
51 | 
52 |     
53 | 


--------------------------------------------------------------------------------
/src/modules.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch import nn
 3 | import timm
 4 | from transformers import DistilBertModel, DistilBertConfig
 5 | import src.config as CFG
 6 | 
 7 | 
 8 | class ImageEncoder(nn.Module):
 9 |     def __init__(
10 |         self, model_name=CFG.model_name, pretrained=CFG.pretrained, trainable=CFG.trainable
11 |     ):
12 |         super().__init__()
13 |         self.model = timm.create_model(
14 |             model_name, pretrained, num_classes=0, global_pool="avg"
15 |         )
16 |         for p in self.model.parameters():
17 |             p.requires_grad = trainable
18 | 
19 |     def forward(self, x):
20 |         return self.model(x)
21 | 
22 | 
23 | class TextEncoder(nn.Module):
24 |     def __init__(self, model_name=CFG.text_encoder_model, pretrained=CFG.pretrained, trainable=CFG.trainable):
25 |         super().__init__()
26 |         if pretrained:
27 |             self.model = DistilBertModel.from_pretrained(model_name)
28 |         else:
29 |             self.model = DistilBertModel(config=DistilBertConfig())
30 |             
31 |         for p in self.model.parameters():
32 |             p.requires_grad = trainable        
33 |         self.target_token_idx = 0
34 | 
35 |     def forward(self, input_ids, attention_mask):
36 |         output = self.model(input_ids=input_ids, attention_mask=attention_mask)
37 |         last_hidden_state = output.last_hidden_state
38 |         return last_hidden_state[:, self.target_token_idx, :]
39 | 
40 | 
41 | 
42 | class ProjectionHead(nn.Module):
43 |     def __init__(
44 |         self,
45 |         embedding_dim,
46 |         projection_dim=CFG.projection_dim,
47 |         dropout=CFG.dropout
48 |     ):
49 |         super().__init__()
50 |         self.projection = nn.Linear(embedding_dim, projection_dim)
51 |         self.gelu = nn.GELU()
52 |         self.fc = nn.Linear(projection_dim, projection_dim)
53 |         self.dropout = nn.Dropout(dropout)
54 |         self.layer_norm = nn.LayerNorm(projection_dim)
55 |     
56 |     def forward(self, x):
57 |         projected = self.projection(x)
58 |         x = self.gelu(projected)
59 |         x = self.fc(x)
60 |         x = self.dropout(x)
61 |         x = x + projected
62 |         x = self.layer_norm(x)
63 |         return x
64 | 
65 | 


--------------------------------------------------------------------------------
/src/CLIP.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from torch import nn
 3 | import torch.nn.functional as F
 4 | 
 5 | import src.config as CFG
 6 | from src.modules import ImageEncoder, TextEncoder, ProjectionHead
 7 | 
 8 | 
 9 | class CLIPModel(nn.Module):
10 |     def __init__(
11 |         self,
12 |         temperature=CFG.temperature,
13 |         image_embedding=CFG.image_embedding,
14 |         text_embedding=CFG.text_embedding,
15 |     ):
16 |         super().__init__()
17 |         self.image_encoder = ImageEncoder()
18 |         self.text_encoder = TextEncoder()
19 |         self.image_projection = ProjectionHead(embedding_dim=image_embedding)
20 |         self.text_projection = ProjectionHead(embedding_dim=text_embedding)
21 |         self.temperature = temperature
22 | 
23 |     def forward(self, batch):
24 |         # Getting Image and Text Features
25 |         image_features = self.image_encoder(batch["image"])
26 |         text_features = self.text_encoder(
27 |             input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
28 |         )
29 |         # Getting Image and Text Embeddings (with same dimension)
30 |         image_embeddings = self.image_projection(image_features)
31 |         text_embeddings = self.text_projection(text_features)
32 | 
33 |         # Calculating the Loss
34 |         logits = (text_embeddings @ image_embeddings.T) / self.temperature
35 |         images_similarity = image_embeddings @ image_embeddings.T
36 |         texts_similarity = text_embeddings @ text_embeddings.T
37 |         targets = F.softmax(
38 |             (images_similarity + texts_similarity) / 2 * self.temperature, dim=-1
39 |         )
40 |         texts_loss = cross_entropy(logits, targets, reduction='none')
41 |         images_loss = cross_entropy(logits.T, targets.T, reduction='none')
42 |         loss =  (images_loss + texts_loss) / 2.0 # shape: (batch_size)
43 |         return loss.mean()
44 | 
45 | 
46 | def cross_entropy(preds, targets, reduction='none'):
47 |     log_softmax = nn.LogSoftmax(dim=-1)
48 |     loss = (-targets * log_softmax(preds)).sum(1)
49 |     if reduction == "none":
50 |         return loss
51 |     elif reduction == "mean":
52 |         return loss.mean()
53 | 
54 | if __name__ == '__main__':
55 |     images = torch.randn(8, 3, 224, 224)
56 |     input_ids = torch.randint(5, 300, size=(8, 25))
57 |     attention_mask = torch.ones(8, 25)
58 |     batch = {
59 |         'image': images,
60 |         'input_ids': input_ids,
61 |         'attention_mask': attention_mask
62 |     }
63 | 
64 |     CLIP = CLIPModel()
65 |     loss = CLIP(batch)
66 |     print("")


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import cv2
 4 | import pandas as pd
 5 | import torch
 6 | import torch.nn.functional as F
 7 | from tqdm import tqdm
 8 | from transformers import DistilBertTokenizer
 9 | import matplotlib.pyplot as plt
10 | import src.config as CFG
11 | from src.train import build_loaders, make_train_valid_dfs
12 | from src.CLIP import CLIPModel
13 | 
14 | def get_image_embeddings(valid_df, model_path):        
15 |     tokenizer = DistilBertTokenizer.from_pretrained(CFG.text_tokenizer)
16 |     valid_loader = build_loaders(valid_df, tokenizer, mode="valid")
17 |     
18 |     model = CLIPModel().to(CFG.device)
19 |     model.load_state_dict(torch.load(model_path, map_location=CFG.device))
20 |     model.eval()
21 |     
22 |     valid_image_embeddings = []
23 |     with torch.no_grad():
24 |         for batch in tqdm(valid_loader):
25 |             image_features = model.image_encoder(batch["image"].to(CFG.device))
26 |             image_embeddings = model.image_projection(image_features)
27 |             valid_image_embeddings.append(image_embeddings)
28 |     return model, torch.cat(valid_image_embeddings)
29 | 
30 | def find_matches(model, image_embeddings, query, image_filenames, n=1):
31 |     tokenizer = DistilBertTokenizer.from_pretrained(CFG.text_tokenizer)
32 |     encoded_query = tokenizer([query])
33 |     batch = {
34 |         key: torch.tensor(values).to(CFG.device)
35 |         for key, values in encoded_query.items()
36 |     }
37 |     with torch.no_grad():
38 |         text_features = model.text_encoder(
39 |             input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
40 |         )
41 |         text_embeddings = model.text_projection(text_features)
42 |     
43 |     image_embeddings_n = F.normalize(image_embeddings, p=2, dim=-1)
44 |     text_embeddings_n = F.normalize(text_embeddings, p=2, dim=-1)
45 |     dot_similarity = text_embeddings_n @ image_embeddings_n.T
46 |     _, indices = torch.topk(dot_similarity.squeeze(0),1)
47 |     matches = image_filenames[indices[::1]]
48 |     if matches == args.f:
49 |         print("TRUE")
50 |         
51 |     else:
52 |         print("False")    
53 |     _, axes = plt.subplots(1, 1, figsize=(10, 10))
54 |     image = cv2.imread(f"{CFG.image_path}/{matches}")
55 |     image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
56 |     axes.imshow(image)
57 |     axes.axis("off")        
58 |     plt.show()
59 |     if matches == args.f:
60 |         print("TRUE")
61 |         
62 |     else:
63 |         print("False")
64 | if __name__ == "__main__":
65 |     p = argparse.ArgumentParser(description='Image Detection Source Intelligence Automation.')
66 |     p.add_argument("-f", metavar="FILE", type=str, help="Image File URL")
67 |     p.add_argument("-q", metavar="QUERY", type=str, help="Query of image file")
68 |     valid_df = pd.read_csv('data/processed/train/captions.csv')
69 |     valid_df = valid_df[:8]
70 |     args = p.parse_args()
71 |     
72 |     if args.f and args.q:
73 |         model, image_embeddings = get_image_embeddings(valid_df,  "models/best.pt")
74 |         find_matches(model, image_embeddings, args.q, image_filenames=valid_df['image'].values, n=1)
75 |     else:
76 |         print(f"Input as this type:")
77 |         print(f"py app.py -f 'Image URL' -q 'text query'")


--------------------------------------------------------------------------------
/src/train.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import gc
  3 | import numpy as np
  4 | import pandas as pd
  5 | from tqdm import tqdm
  6 | 
  7 | import torch
  8 | from torch import nn
  9 | from transformers import DistilBertTokenizer
 10 | 
 11 | import src.config as CFG
 12 | from src.dataset import CLIPDataset, get_transforms
 13 | from src.CLIP import CLIPModel
 14 | from src.utils import AvgMeter, get_lr
 15 | 
 16 | 
 17 | def make_train_valid_dfs():
 18 |     dataframe = pd.read_csv(f"{CFG.captions_path}/captions.csv")
 19 |     max_id = dataframe["id"].max() + 1 if not CFG.debug else 100
 20 |     image_ids = np.arange(0, max_id)
 21 |     np.random.seed(42)
 22 |     valid_ids = np.random.choice(
 23 |         image_ids, size=int(0.2 * len(image_ids)), replace=False
 24 |     )
 25 |     train_ids = [id_ for id_ in image_ids if id_ not in valid_ids]
 26 |     train_dataframe = dataframe[dataframe["id"].isin(train_ids)].reset_index(drop=True)
 27 |     valid_dataframe = dataframe[dataframe["id"].isin(valid_ids)].reset_index(drop=True)
 28 |     return train_dataframe, valid_dataframe
 29 | 
 30 | 
 31 | def build_loaders(dataframe, tokenizer, mode):
 32 |     transforms = get_transforms(mode=mode)
 33 |     dataset = CLIPDataset(
 34 |         dataframe["image"].values,
 35 |         dataframe["caption"].values,
 36 |         tokenizer=tokenizer,
 37 |         transforms=transforms,
 38 |     )
 39 |     dataloader = torch.utils.data.DataLoader(
 40 |         dataset,
 41 |         batch_size=CFG.batch_size,
 42 |         num_workers=CFG.num_workers,
 43 |         shuffle=True if mode == "train" else False,
 44 |     )
 45 |     return dataloader
 46 | 
 47 | 
 48 | def train_epoch(model, train_loader, optimizer, lr_scheduler, step):
 49 |     loss_meter = AvgMeter()
 50 |     tqdm_object = tqdm(train_loader, total=len(train_loader))
 51 |     for batch in tqdm_object:
 52 |         batch = {k: v.to(CFG.device) for k, v in batch.items() if k != "caption"}
 53 |         loss = model(batch)
 54 |         optimizer.zero_grad()
 55 |         loss.backward()
 56 |         optimizer.step()
 57 |         if step == "batch":
 58 |             lr_scheduler.step()
 59 | 
 60 |         count = batch["image"].size(0)
 61 |         loss_meter.update(loss.item(), count)
 62 | 
 63 |         tqdm_object.set_postfix(train_loss=loss_meter.avg, lr=get_lr(optimizer))
 64 |     return loss_meter
 65 | 
 66 | 
 67 | def valid_epoch(model, valid_loader):
 68 |     loss_meter = AvgMeter()
 69 | 
 70 |     tqdm_object = tqdm(valid_loader, total=len(valid_loader))
 71 |     for batch in tqdm_object:
 72 |         batch = {k: v.to(CFG.device) for k, v in batch.items() if k != "caption"}
 73 |         loss = model(batch)
 74 | 
 75 |         count = batch["image"].size(0)
 76 |         loss_meter.update(loss.item(), count)
 77 | 
 78 |         tqdm_object.set_postfix(valid_loss=loss_meter.avg)
 79 |     return loss_meter
 80 | 
 81 | 
 82 | def main():
 83 |     train_df, valid_df = make_train_valid_dfs()
 84 |     tokenizer = DistilBertTokenizer.from_pretrained(CFG.text_tokenizer)
 85 |     train_loader = build_loaders(train_df, tokenizer, mode="train")
 86 |     valid_loader = build_loaders(valid_df, tokenizer, mode="valid")
 87 | 
 88 | 
 89 |     model = CLIPModel().to(CFG.device)
 90 |     optimizer = torch.optim.AdamW(
 91 |         model.parameters(), lr=CFG.lr, weight_decay=CFG.weight_decay
 92 |     )
 93 |     lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
 94 |         optimizer, mode="min", patience=CFG.patience, factor=CFG.factor
 95 |     )
 96 |     step = "epoch"
 97 | 
 98 |     best_loss = float('inf')
 99 |     for epoch in range(CFG.epochs):
100 |         print(f"Epoch: {epoch + 1}")
101 |         model.train()
102 |         train_loss = train_epoch(model, train_loader, optimizer, lr_scheduler, step)
103 |         model.eval()
104 |         with torch.no_grad():
105 |             valid_loss = valid_epoch(model, valid_loader)
106 |         
107 |         if valid_loss.avg < best_loss:
108 |             best_loss = valid_loss.avg
109 |             torch.save(model.state_dict(), "models/best.pt")
110 |             print(f"{train_loss}Saved Best Model!")
111 | 
112 | 
113 | if __name__ == "__main__":
114 |     main()
115 |     


--------------------------------------------------------------------------------
/data/processed/train/captions.csv:
--------------------------------------------------------------------------------
  1 | image,caption,id
  2 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,1
  3 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,2
  4 | coffee machine.jpg,coffee machine,3
  5 | Damro tea.jpg,Damro tea,4
  6 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,5
  7 | television maker.jpg,television maker,6
  8 | Vacuum Cleaner.JPG,Vacuum Cleaner,7
  9 | Washing Machine.jpg,Washing Machine,8
 10 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,9
 11 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,10
 12 | coffee machine.jpg,coffee machine,11
 13 | Damro tea.jpg,Damro tea,12
 14 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,13
 15 | television maker.jpg,television maker,14
 16 | Vacuum Cleaner.JPG,Vacuum Cleaner,15
 17 | Washing Machine.jpg,Washing Machine,16
 18 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,17
 19 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,18
 20 | coffee machine.jpg,coffee machine,19
 21 | Damro tea.jpg,Damro tea,20
 22 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,21
 23 | television maker.jpg,television maker,22
 24 | Vacuum Cleaner.JPG,Vacuum Cleaner,23
 25 | Washing Machine.jpg,Washing Machine,24
 26 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,25
 27 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,26
 28 | coffee machine.jpg,coffee machine,27
 29 | Damro tea.jpg,Damro tea,28
 30 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,29
 31 | television maker.jpg,television maker,30
 32 | Vacuum Cleaner.JPG,Vacuum Cleaner,31
 33 | Washing Machine.jpg,Washing Machine,32
 34 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,33
 35 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,34
 36 | coffee machine.jpg,coffee machine,35
 37 | Damro tea.jpg,Damro tea,36
 38 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,37
 39 | television maker.jpg,television maker,38
 40 | Vacuum Cleaner.JPG,Vacuum Cleaner,39
 41 | Washing Machine.jpg,Washing Machine,40
 42 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,41
 43 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,42
 44 | coffee machine.jpg,coffee machine,43
 45 | Damro tea.jpg,Damro tea,44
 46 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,45
 47 | television maker.jpg,television maker,46
 48 | Vacuum Cleaner.JPG,Vacuum Cleaner,47
 49 | Washing Machine.jpg,Washing Machine,48
 50 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,49
 51 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,50
 52 | coffee machine.jpg,coffee machine,51
 53 | Damro tea.jpg,Damro tea,52
 54 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,53
 55 | television maker.jpg,television maker,54
 56 | Vacuum Cleaner.JPG,Vacuum Cleaner,55
 57 | Washing Machine.jpg,Washing Machine,56
 58 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,57
 59 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,58
 60 | coffee machine.jpg,coffee machine,59
 61 | Damro tea.jpg,Damro tea,60
 62 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,61
 63 | television maker.jpg,television maker,62
 64 | Vacuum Cleaner.JPG,Vacuum Cleaner,63
 65 | Washing Machine.jpg,Washing Machine,64
 66 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,65
 67 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,66
 68 | coffee machine.jpg,coffee machine,67
 69 | Damro tea.jpg,Damro tea,68
 70 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,69
 71 | television maker.jpg,television maker,70
 72 | Vacuum Cleaner.JPG,Vacuum Cleaner,71
 73 | Washing Machine.jpg,Washing Machine,72
 74 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,73
 75 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,74
 76 | coffee machine.jpg,coffee machine,75
 77 | Damro tea.jpg,Damro tea,76
 78 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,77
 79 | television maker.jpg,television maker,78
 80 | Vacuum Cleaner.JPG,Vacuum Cleaner,79
 81 | Washing Machine.jpg,Washing Machine,80
 82 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,81
 83 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,82
 84 | coffee machine.jpg,coffee machine,83
 85 | Damro tea.jpg,Damro tea,84
 86 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,85
 87 | television maker.jpg,television maker,86
 88 | Vacuum Cleaner.JPG,Vacuum Cleaner,87
 89 | Washing Machine.jpg,Washing Machine,88
 90 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,89
 91 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,90
 92 | coffee machine.jpg,coffee machine,91
 93 | Damro tea.jpg,Damro tea,92
 94 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,93
 95 | television maker.jpg,television maker,94
 96 | Vacuum Cleaner.JPG,Vacuum Cleaner,95
 97 | Washing Machine.jpg,Washing Machine,96
 98 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,97
 99 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,98
100 | coffee machine.jpg,coffee machine,99
101 | Damro tea.jpg,Damro tea,100
102 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,101
103 | television maker.jpg,television maker,102
104 | Vacuum Cleaner.JPG,Vacuum Cleaner,103
105 | Washing Machine.jpg,Washing Machine,104
106 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,105
107 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,106
108 | coffee machine.jpg,coffee machine,107
109 | Damro tea.jpg,Damro tea,108
110 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,109
111 | television maker.jpg,television maker,110
112 | Vacuum Cleaner.JPG,Vacuum Cleaner,111
113 | Washing Machine.jpg,Washing Machine,112
114 | An unmodified iRobot Create with Command Module.jpg,An unmodified iRobot Create with Command Module,113
115 | Braava 380t damp cleaning.jpg,Braava 380t damp cleaning,114
116 | coffee machine.jpg,coffee machine,115
117 | Damro tea.jpg,Damro tea,116
118 | Roomba_805_charging dock.jpg,Roomba_805_charging dock,117
119 | television maker.jpg,television maker,118
120 | Vacuum Cleaner.JPG,Vacuum Cleaner,119
121 | Washing Machine.jpg,Washing Machine,120
122 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Brand Detection with CLIP
  2 |  This is CLIP by using TensorFlow.
  3 | 
  4 | ## Introduction
  5 | In this article I am going to implement CLIP model from scratch in **PyTorch**.
  6 | This project aims to detect if a product in an image corresponds to a given brand name using OpenAI's CLIP model.
  7 | 
  8 | ## Folder Structure
  9 | - `data/processed`: Contains processed data for train, validation, test.
 10 | - `model/`: Save the train model.
 11 | - `src/`: Source code for data config, CLIP, utils, train dataset and modules.
 12 | - `requirements.txt`: Project dependencies.
 13 | - `app.py`: Main script to run the project.
 14 | - `make_csv.py`: Make captions.csv file from Images for training.
 15 | 
 16 | ## Setup
 17 | 
 18 | 1. Install dependencies:
 19 |    ```bash
 20 |    pip install -r requirements.txt
 21 | 
 22 | ## Config
 23 | 
 24 | _A note on config and CFG: I wrote the codes with python scripts and then converted it into a Jupyter Notebook. So, in case of python scripts, config is a normal python file where I put all the hyperparameters and in the case of Jupyter Notebook, its a class defined in the beginning of the notebook to keep all the hyperparameters._
 25 | ## Utils
 26 | ## Dataset
 27 | 
 28 | As you can see in the tittle image of this article, we need to encode both images and their describing texts. So, the dataset needs to **return both images and texts**. Of course we are not going to feed raw text to our text encoder! We will use **DistilBERT** model (which is smaller than BERT but performs nearly as well as BERT) from **HuggingFace** library as our text encoder; so, we need to **tokenize** the sentences (captions) with DistilBERT tokenizer and then feed the token ids (input_ids) and the attention masks to DistilBERT. Therefore, the dataset needs to take care of the tokenization as well. Below you can see the dataset's code. Below that I'll explain the most important things that is happening in the code.
 29 | 
 30 | In the **\_\_init\_\_** we receive a tokenizer object which is actually a HuggingFace tokinzer; this tokenizer will be loaded when running the model. We are padding and truncating the captions to a specified max_length. In the **\_\_getitem\_\_** we will first load an encoded caption which is a dictionary with keys input_ids and attention_mask, make tensors out of its values and after that we will load the corresponding image, transform and augment it (if there is any!) and then we make it a tensor and put it in the dictionary with "image" as the key. Finally we put the raw text of the caption with the key "caption" in the dictionary only for visualization purposes.
 31 | 
 32 | I did not use additional data augmentations but you can add them if you want to improve the model's performance.
 33 | 
 34 | ## Image Encoder
 35 | 
 36 | The image encoder code is straight forward. I'm using PyTorch Image Models library (timm) here which makes a lot of different image models available from ResNets to EfficientNets and many more. Here we will use a ResNet50 as our image encoder. You can easily use torchvision library to use ResNets if you don't want to install a new library.
 37 | 
 38 | The code encodes each image to a fixed size vector with the size of the model's output channels (in case of ResNet50 the vector size will be **2048**). This is the output after the nn.AdaptiveAvgPool2d() layer.
 39 | 
 40 | ## Text Encoder
 41 | 
 42 | As I mentioned before, I'll use DistilBERT as the text encoder. Like its bigger brother BERT, two special tokens will be added to the actual input tokens: **CLS** and **SEP** which mark the start and end of a sentence. To grab the whole representation of a sentence (as the related BERT and DistilBERT papers point out) we use the final representations of the CLS token and we hope that this representation captures the overall meaning of the sentence (caption). Thinking it in this way, it is similar to what we did to images and converted them into a fixed size vector.
 43 | 
 44 | In the case of DistilBERT (and also BERT) the output hidden representation for each token is a vector with size **768**. So, the whole caption will be encoded in the CLS token representation whose size is 768.
 45 | 
 46 | ## Projection Head
 47 | 
 48 | I write OpenAI CLIP model the following in PyTorch.
 49 | Now that I have encoded both our images and texts into fixed size vectors (2048 for image and 768 for text) I need to bring (project) them into a **new world** (!) with **similar dimensions** for both images and texts in order to be able to compare them and push apart the non-relevant image and texts and pull together those that match. So, the following code will bring the 2048 and 768 dimensional vectors into a 256 (projection_dim) dimensional world, where we can **compare** them.
 50 | 
 51 | "embedding_dim" is the size of the input vector (2048 for images and 768 for texts) and "projection_dim" is the the size of the output vector which will be 256 for our case. For understanding the details of this part you can refer to the CLIP paper.
 52 | 
 53 | ## CLIP
 54 | 
 55 | This part is where all the fun happens! I'll also talk about the loss function here. I translated some of the code from Keras code examples into PyTorch for writing this part. Take a look at the code and then read the explanation below this code block.
 56 | 
 57 | Here we will use the previous modules that we built to implement the main model. The \_\_init\_\_ function is self-explanatory. In the forward function, we first encode the images and texts separately into fixed size vectors (with different dimensionalities). After that, using separate projection modules we project them to that shared world (space) that I talked about previously. Here the encodings will become of similar shape (256 in our case). After that we will compute the loss. Again I recommend reading CLIP paper to get it better but I'll try my best to explain this part.
 58 | 
 59 | In **Linear Algebra**, one common way to measure if two vectors are of similar characteristics (they are like each other) is to calculate their **dot product** (multiplying the matching entries and take the sum of them); if the final number is big, they are alike and if it is small they are not (relatively speaking)!
 60 | 
 61 | Okay! What I just said is the most important thing to have in mind to understand this loss function. Let's continue. We talked about two vectors, but, what do we have here? We have image_embeddings, a matrix with shape (batch_size, 256) and text_embeddings with shape (batch_size, 256). Easy enough! it means we have two groups of vectors instead of two single vectors. How do we measure how similar two groups of vectors (two matrices) are to each other? Again, with dot product (@ operator in PyTorch does the dot product or matrix multiplication in this case). To be able to multiply these two matrices together, we transpose the second one. Okay, we get a matrix with shape (batch_size, batch_size) which we will call logits. (temperature is equal to 1.0 in our case, so, it does not make a difference. You can play with it and see what difference it makes. Also look at the paper to see why it is here!).
 62 | 
 63 | I hope you are still with me! If not it's okay, just review the code and check their shapes. Now that we have our logits, we need targets. I need to say that there is a more straight forward way to obtain targets but I had to do this for our case (I'll talk about why in a next paragraph).
 64 | 
 65 | Let's consider what we hope that this model learns: **we want it to learn "similar representations (vectors)" for a given image and the caption describing it. Meaning that either we give it an image or the text describing it, we want it to produce same 256 sized vectors for both.**
 66 | 
 67 | #### Check the cell below this code block for the continue of the explanations
 68 | 
 69 | ```python
 70 | class CLIPModel(nn.Module):
 71 |     def __init__(
 72 |         self,
 73 |         temperature=CFG.temperature,
 74 |         image_embedding=CFG.image_embedding,
 75 |         text_embedding=CFG.text_embedding,
 76 |     ):
 77 |         super().__init__()
 78 |         self.image_encoder = ImageEncoder()
 79 |         self.text_encoder = TextEncoder()
 80 |         self.image_projection = ProjectionHead(embedding_dim=image_embedding)
 81 |         self.text_projection = ProjectionHead(embedding_dim=text_embedding)
 82 |         self.temperature = temperature
 83 | 
 84 |     def forward(self, batch):
 85 |         # Getting Image and Text Features
 86 |         image_features = self.image_encoder(batch["image"])
 87 |         text_features = self.text_encoder(
 88 |             input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]
 89 |         )
 90 |         # Getting Image and Text Embeddings (with same dimension)
 91 |         image_embeddings = self.image_projection(image_features)
 92 |         text_embeddings = self.text_projection(text_features)
 93 | 
 94 |         # Calculating the Loss
 95 |         logits = (text_embeddings @ image_embeddings.T) / self.temperature
 96 |         images_similarity = image_embeddings @ image_embeddings.T
 97 |         texts_similarity = text_embeddings @ text_embeddings.T
 98 |         targets = F.softmax(
 99 |             (images_similarity + texts_similarity) / 2 * self.temperature, dim=-1
100 |         )
101 |         texts_loss = cross_entropy(logits, targets, reduction='none')
102 |         images_loss = cross_entropy(logits.T, targets.T, reduction='none')
103 |         loss =  (images_loss + texts_loss) / 2.0 # shape: (batch_size)
104 |         return loss.mean()
105 | 
106 | 
107 | def cross_entropy(preds, targets, reduction='none'):
108 |     log_softmax = nn.LogSoftmax(dim=-1)
109 |     loss = (-targets * log_softmax(preds)).sum(1)
110 |     if reduction == "none":
111 |         return loss
112 |     elif reduction == "mean":
113 |         return loss.mean()
114 | ```
115 | 
116 | So, in the best case scenario, text_embeddings and image_embedding matricies should be the same because they are describing similar things. Let's think now: if this happens, what would the logits matrix be like? Let's see with a simple example!
117 | 
118 | ```python
119 | # A simple Example
120 | 
121 | batch_size = 4
122 | dim = 256
123 | embeddings = torch.randn(batch_size, dim)
124 | out = embeddings @ embeddings.T
125 | print(F.softmax(out, dim=-1))
126 | ```
127 | 
128 | So logits, in the best case, will be a matrix that if we take its softmax, will have 1.0s in the diagonal (An identity matrix to call it with fancy words!). As the loss function's job is to make model's predictions similar to targets (at least in most cases!), we want such a matrix as our target. That's the reason why we are calculating images_similarity and texts_similarity matrices in the code block above.
129 | 
130 | Now that we've got our targets matrix, we will use simple cross entropy to calculate the actual loss. I've written the full matrix form of cross entropy as a function which you can see in the bottom of the code block. Okay! We are done! Wasn't it simple?! Alright, you can ignore the next paragraph but if you are curious, there is an important note in that.
131 | 
132 | **Here's why I didn't use a simpler approach**: I need to admit that there's a simpler way to calculate this loss in PyTorch; by doing this: nn.CrossEntropyLoss()(logits, torch.arange(batch_size)). Why I did not use it here? For 2 reasons. 1- The dataset we are using has multiple captions for a single image; so, there is the possibility that two identical images with their similar captions exist in a batch (it is rare but it can happen). Taking the loss with this easier method will ignore this possibility and the model learns to pull apart two representations (assume them different) that are actually the same. Obviously, we don't want this to happen so I calculated the whole target matrix in a way that takes care of these edge cases. 2- Doing it the way I did, gave me a better understanding of what is happening in this loss function; so, I thought it would give you a better intuition as well!
133 | 
134 | ## Train
135 | 
136 | Here are some funtions to help us load train and valid dataloaders, our model and then train and evaluate our model on those. There's not much going on here; just simple training loop and utility functions (see train.py- make_train_valid_dfs and build_loaders function)
137 | 
138 | Here's a handy function to train our model. There's not much happening here; just loading the batches, feeding them to the model and stepping the optimizer and lr_scheduler.
139 | (see train.py- train_epoch function)
140 | 
141 | ## Main app (app.py)
142 | 
143 | Okay! We are done with training the model. Now, we need to do inference which in our case will be giving the model a piece of text and want it to retrieve the most relevant images from an unseen validation (or test) set.
144 | 
145 | ### Getting Image Embeddings
146 | 
147 | In this function, we are loading the model that we saved after training, feeding it images in validation set and returning the image_embeddings with shape (valid_set_size, 256) and the model itself.
148 | 
149 | ### Finding Matches
150 | 
151 | This function does the final task that we wished our model would be capable of: it gets the model, image_embeddings, and a text query. It will display the most relevant images from the validation set! Isn't it amazing? Let's see how it performs after all!
152 | 
153 | ### How to install and run app.
154 | 
155 | 1. set virtual environment and setup module
156 |  -py -m venv myenv
157 |  -./myenv/Scripts/activate
158 |  -pip install -r requirements.txt
159 | 
160 | 2. Save image from data/processed/train folder
161 |  - The image file name is equal brand name.
162 | 
163 | 3. Make csv file for train
164 |  - py make_csv.py
165 | 
166 | 4. Train model
167 |  - py src/train.py
168 | 
169 | 5. Run app
170 |  -py app.py -f "Image File name" -q "Brand name"
171 | 
172 | And Then you see the result.
173 | Thank you!!!!!!


--------------------------------------------------------------------------------