Detecting Aliens using Deep Learning 👽

├── .gitignore
├── LICENSE
├── README.md
├── assets
    ├── fold_results.png
    └── image.jpg
├── requirements.txt
├── src
    ├── augmentations.py
    ├── config.py
    ├── data.py
    ├── models.py
    └── trainer.py
└── train.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .vscode


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Tanay Mehta
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <h1 align='center'>Detecting Aliens using Deep Learning 👽</h1>
 2 | 
 3 | <p align="center">
 4 | <img src="assets/image.jpg" alt="Picture for Representation">
 5 | </p>
 6 | 
 7 | <p align="center">
 8 | <img alt="Python" src="https://img.shields.io/badge/python%20-%2314354C.svg?&style=for-the-badge&logo=python&logoColor=white"/>
 9 | 
10 | <img alt="NumPy" src="https://img.shields.io/badge/numpy%20-%23013243.svg?&style=for-the-badge&logo=numpy&logoColor=white" />
11 | 
12 | <img alt="Pandas" src="https://img.shields.io/badge/pandas%20-%23150458.svg?&style=for-the-badge&logo=pandas&logoColor=white" />
13 | 
14 | <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch%20-%23EE4C2C.svg?&style=for-the-badge&logo=PyTorch&logoColor=white" />
15 | </p>
16 | 
17 | ## Introduction
18 | 
19 | This repository contains the code I wrote for identifying anomalous signals in scans of Breakthrough Listen targets for [SETI Breakthrough Listen - E.T. Signal Search](https://www.kaggle.com/c/seti-breakthrough-listen) Competition hosted on Kaggle.
20 | 
21 | The main task in this competition is to classify if a Signal is from an "alien source" or not.
22 | 
23 | I have also made a [full training notebook](https://www.kaggle.com/heyytanay/pytorch-training-augments-vit-kfolds) in this competition.
24 | 
25 | ## Data
26 | 
27 | Because there are no confirmed examples of alien signals to use to train machine learning algorithms, the team included some simulated signals (that they call “needles”) in the haystack of data from the telescope.
28 | They have identified some of the hidden needles so that we can train out model to find more.
29 | 
30 | The data consist of two-dimensional arrays.
31 | 
32 | ## Training the Model
33 | 
34 | If you want to train the model on this data as-is, then you would typically have to perform 2 steps:
35 | 
36 | ### 1. Getting the Data right
37 | 
38 | First, download the data from [here](https://www.kaggle.com/c/seti-breakthrough-listen/data). 
39 | 
40 | Now, take the downloaded `.zip` file and extract it into a new folder: `input/`.
41 | 
42 | Make sure the `input/` folder is at the same directory level as the `train.py` file.
43 | 
44 | 
45 | ### 2. Installing the dependencies
46 | 
47 | To run the code in this repository, you need a lot of frameworks installed on your system.
48 | 
49 | Make sure you have enough space on your disk and Internet quota before you proceed.
50 | 
51 | ```shell
52 | $ pip install -r requirements.txt
53 | ```
54 | 
55 | ### 3. Training the Model
56 | 
57 | If you have done the above steps right, then just running the `train.py` script should not produce any errors.
58 | 
59 | To run training, open the terminal and change your working directory to the same level as the `train.py` file.
60 | 
61 | Now, for training do:
62 | 
63 | ```shell
64 | $ python train.py
65 | ```
66 | 
67 | This should start training in a few seconds and you should see a progress bar.
68 | 
69 | If you are having problems related to anything, please open an Issue and I will be happy to help!
70 | 
71 | ## Training Results and Conclusion
72 | 
73 | Below you can see the per-fold model performance when trained on `vit_base_patch16_224` for 3 folds and 10 epochs in each fold.
74 | 
75 | ![](assets/fold_results.png)
76 | 
77 | **I hope you find my work useful! If you do, then please Star ⭐ this repository!**


--------------------------------------------------------------------------------
/assets/fold_results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tanaymeh/alien-signal-detection/222ab9c6c631181f335abb1842a33508f4eb11a1/assets/fold_results.png


--------------------------------------------------------------------------------
/assets/image.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tanaymeh/alien-signal-detection/222ab9c6c631181f335abb1842a33508f4eb11a1/assets/image.jpg


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | numpy
 2 | pandas
 3 | matplotlib
 4 | seaborn
 5 | tqdm
 6 | opencv-contrib-python
 7 | timm
 8 | torch
 9 | torchvision
10 | albumentations
11 | sklearn


--------------------------------------------------------------------------------
/src/augmentations.py:
--------------------------------------------------------------------------------
 1 | from albumentations import (
 2 |     HorizontalFlip, VerticalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
 3 |     Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
 4 |     IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, IAAPiecewiseAffine, RandomResizedCrop,
 5 |     IAASharpen, IAAEmboss, RandomBrightnessContrast, Flip, OneOf, Compose, Normalize, Cutout, CoarseDropout, ShiftScaleRotate, CenterCrop, Resize
 6 | )
 7 | 
 8 | from albumentations.pytorch import ToTensorV2
 9 | 
10 | from .config import Config
11 | 
12 | class Augments:
13 |     """
14 |     Contains Train, Validation Augments
15 |     """
16 |     train_augments = Compose([
17 |         Resize(*Config.resize, p=1.0),
18 |         HorizontalFlip(p=0.5),
19 |         VerticalFlip(p=0.5),
20 |         ShiftScaleRotate(p=0.5, shift_limit=0.2, scale_limit=0.2, rotate_limit=20, border_mode=0, value=0, mask_value=0),
21 |         RandomResizedCrop(*Config.resize, p=1.0),
22 |         ToTensorV2(p=1.0),
23 |     ],p=1.)
24 |     
25 |     valid_augments = Compose([
26 |         Resize(*Config.resize, p=1.0),
27 |         ToTensorV2(p=1.0),
28 |     ], p=1.)


--------------------------------------------------------------------------------
/src/config.py:
--------------------------------------------------------------------------------
 1 | class Config:
 2 |     N_SPLITS = 3
 3 |     model_name = 'vit_base_patch16_224'
 4 |     resize = (224, 224)
 5 |     TRAIN_BS = 32
 6 |     VALID_BS = 16
 7 |     num_workers = 8
 8 |     NB_EPOCHS = 10
 9 |     LABELS = 1
10 |     FILE = "/input/train_labels.csv"
11 |     FOLDER = "/input/train"


--------------------------------------------------------------------------------
/src/data.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | import os
 4 | 
 5 | import torch
 6 | from torch.utils.data import Dataset, DataLoader
 7 | 
 8 | from .config import Config
 9 | from .augmentations import Augments
10 | 
11 | 
12 | class SETIData(Dataset):
13 |     def __init__(self, images, targets, is_test=False, augmentations=None): 
14 |         self.images = images
15 |         self.targets = targets
16 |         self.is_test = is_test
17 |         self.augmentations = augmentations
18 |         
19 |     def __getitem__(self, index):
20 |         img, target = self.images[index], self.targets[index]
21 |         
22 |         img = np.load(img)
23 |         img = np.vstack(img)
24 |         img = img.transpose(1, 0)
25 |         img = img.astype("float")[..., np.newaxis]
26 |         
27 |         if self.augmentations:
28 |             img = self.augmentations(image=img)['image']
29 |         
30 |         if self.is_test:
31 |             return img
32 |         
33 |         else:
34 |             target = self.targets[index]
35 |             return img, target
36 |     
37 |     def __len__(self):
38 |         return len(self.images)


--------------------------------------------------------------------------------
/src/models.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import timm
 3 | import torch.nn as nn
 4 | import torch.nn.functional as F
 5 | 
 6 | from .config import Config
 7 | 
 8 | class VITModel(nn.Module):
 9 |     """
10 |     Model Class for VIT Model
11 |     """
12 |     def __init__(self, model_name=Config.model_name, pretrained=True):
13 |         super(VITModel, self).__init__()
14 |         self.backbone = timm.create_model(model_name, pretrained, in_chans=1)
15 |         self.backbone.head = nn.Linear(self.backbone.head.in_features, Config.LABELS)
16 |     
17 |     def forward(self, x):
18 |         x = self.backbone(x)
19 |         return x
20 |     
21 | class MLPMixer(nn.Module):
22 |     """
23 |     Model Class for MLP Mixer Model
24 |     """
25 |     def __init__(self, model_name=Config.model_name, pretrained=True):
26 |         super(MLPMixer, self).__init__()
27 |         self.backbone = timm.create_model(model_name, pretrained, in_chans=1)
28 |         self.backbone.head = nn.Linear(self.backbone.head.in_features, Config.LABELS)
29 |     
30 |     def forward(self, x):
31 |         x = self.backbone(x)
32 |         return x


--------------------------------------------------------------------------------
/src/trainer.py:
--------------------------------------------------------------------------------
 1 | import platform
 2 | import numpy as np
 3 | import pandas as pd
 4 | from tqdm.notebook import tqdm
 5 | import cv2
 6 | import gc
 7 | import matplotlib.pyplot as plt
 8 | 
 9 | import torch
10 | import timm
11 | import torch.nn as nn
12 | import torch.nn.functional as F
13 | from torch.utils.data import Dataset, DataLoader
14 | 
15 | from sklearn.metrics import roc_auc_score
16 | 
17 | def train_one_epoch(model, device, optimizer, dataloader, loss_fn, scheduler=None):
18 |     """Trains a given model for 1 epoch on the given data
19 | 
20 |     Args:
21 |         model: Main model
22 |         device: Device on which model will be trained
23 |         optimizer: Optimizer that will optimize during training
24 |         dataloader: Training Dataloader
25 |         loss_fn: Training Loss function. Will be optimized
26 |         scheduler (optional): Scheduler for the learning rate. Defaults to None.
27 |     """
28 |     prog_bar = tqdm(enumerate(dataloader), total=len(dataloader))
29 |     model.train()
30 |     running_loss = 0
31 |     for idx, (img, target) in prog_bar:
32 |         img = img.to(device, torch.float)
33 |         target = target.to(device, torch.float)
34 |         
35 |         output = model(img).view(-1)
36 |         loss = loss_fn(output, target)
37 |         
38 |         # Sending the data from GPU to CPU in a numpy form (using .item()) consumes memory
39 |         # So only do it once
40 |         loss_item = loss.item()
41 |         prog_bar.set_description('loss: {:.2f}'.format(loss_item))
42 |         
43 |         loss.backward()
44 |         optimizer.step()
45 |         
46 |         if scheduler:
47 |             scheduler.step()
48 |             
49 |         optimizer.zero_grad(set_to_none=True)
50 |         
51 |         running_loss += loss_item
52 |         
53 |     return running_loss / len(dataloader)
54 | 
55 | @torch.no_grad()
56 | def valid_one_epoch(model, device, dataloader, loss_fn):
57 |     """Validates the model on the validation set through all batches
58 | 
59 |     Args:
60 |         model: Main model
61 |         device: Device on which model will be validated
62 |         dataloader: Validation Dataloader
63 |         loss_fn: Validation Loss function. Will NOT be optimized
64 |     """
65 |     prog_bar = tqdm(enumerate(dataloader), total=len(dataloader))
66 |     all_targets, all_predictions = [], []
67 |     running_loss = 0
68 |     model.eval()
69 |     for idx, (img, target) in prog_bar:
70 |         img = img.to(device, torch.float)
71 |         target = target.to(device, torch.float)
72 |         
73 |         output = model(img).view(-1)
74 |         
75 |         loss = loss_fn(output, target)
76 |         loss_item = loss.item()
77 |         
78 |         prog_bar.set_description('val_loss: {:.2f}'.format(loss_item))
79 |         
80 |         all_targets.extend(target.cpu().detach().numpy().tolist())
81 |         all_predictions.extend(torch.sigmoid(output).cpu().detach().numpy().tolist())
82 |         
83 |         running_loss += loss_item
84 |         
85 |     val_roc_auc = roc_auc_score(all_targets, all_predictions)
86 |     return val_roc_auc, running_loss / len(dataloader)


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import platform
  2 | import numpy as np
  3 | import pandas as pd
  4 | from sklearn.utils import shuffle
  5 | from tqdm.notebook import tqdm
  6 | import warnings
  7 | import gc
  8 | import matplotlib.pyplot as plt
  9 | from sklearn.metrics import roc_auc_score
 10 | from sklearn.model_selection import StratifiedKFold
 11 | 
 12 | import torch
 13 | import timm
 14 | import torch.nn as nn
 15 | import torch.nn.functional as F
 16 | from torch.cuda.amp import GradScaler, autocast
 17 | from torch.utils.data import Dataset, DataLoader
 18 | 
 19 | from albumentations import (
 20 |     HorizontalFlip, VerticalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90,
 21 |     Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue,
 22 |     IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, IAAPiecewiseAffine, RandomResizedCrop,
 23 |     IAASharpen, IAAEmboss, RandomBrightnessContrast, Flip, OneOf, Compose, Normalize, Cutout, CoarseDropout, ShiftScaleRotate, CenterCrop, Resize
 24 | )
 25 | from albumentations.pytorch import ToTensorV2
 26 | 
 27 | from src.config import Config
 28 | from src.augmentations import Augments
 29 | from src.data import SETIData
 30 | from src.models import *
 31 | from src.trainer import train_one_epoch, valid_one_epoch
 32 | 
 33 | def yield_loss(outputs, targets):
 34 |     return nn.BCEWithLogitsLoss()(outputs, targets)
 35 | 
 36 | def prepare_data():
 37 |     """
 38 |     Takes the dataframe and prepares it for training
 39 |     """
 40 |     train_labels = pd.read_csv(Config.FILE)
 41 |     train_labels['path'] = train_labels['id'].apply(lambda x: f'{Config.FOLDER}/{x[0]}/{x}.npy')
 42 |     
 43 |     return train_labels
 44 | 
 45 | def run(device, data):
 46 |     kfold = StratifiedKFold(n_splits=Config.N_SPLITS, shuffle=True, random_state=2021)
 47 |     fold_scores = {}
 48 |     for fold_, (trn_idx, val_idx) in enumerate(kfold.split(data, data['target'])):
 49 |         print(f"{'='*40} Fold: {fold_} {'='*40}")
 50 |         
 51 |         train_data = data.loc[trn_idx]
 52 |         valid_data = data.loc[val_idx]
 53 |         
 54 |         print(f"[INFO] Training on {trn_idx.shape[0]} samples and validating on {valid_data.shape[0]} samples")
 55 | 
 56 |         # Make Training and Validation Datasets
 57 |         training_set = SETIData(
 58 |             images=train_data['path'].values,
 59 |             targets=train_data['target'].values,
 60 |             augmentations=Augments.train_augments
 61 |         )
 62 | 
 63 |         validation_set = SETIData(
 64 |             images=valid_data['path'].values,
 65 |             targets=valid_data['target'].values,
 66 |             augmentations=Augments.valid_augments
 67 |         )
 68 |         
 69 |         train = DataLoader(
 70 |             training_set,
 71 |             batch_size=Config.TRAIN_BS,
 72 |             shuffle=True,
 73 |             num_workers=8,
 74 |             pin_memory=True
 75 |         )
 76 | 
 77 |         valid = DataLoader(
 78 |             validation_set,
 79 |             batch_size=Config.VALID_BS,
 80 |             shuffle=False,
 81 |             num_workers=8
 82 |         )
 83 |         
 84 |         model = VITModel().to(device)
 85 |         optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
 86 |         train_loss_fn = yield_loss()
 87 |         valid_loss_fn = yield_loss()
 88 |         print(f"[INFO] Training Model: {Config.model_name}")
 89 |         
 90 |         per_fold_score = []
 91 |         best_roc = 0
 92 |         
 93 |         for epoch in range(1, Config.NB_EPOCHS+1):
 94 |             print(f"\n{'--'*5} EPOCH: {epoch} {'--'*5}\n")
 95 | 
 96 |             # Train for 1 epoch
 97 |             train_loss = train_one_epoch(model, device, optimizer, train, train_loss_fn)
 98 |             
 99 |             # Validate for 1 epoch
100 |             current_roc, avg_val_loss = valid_one_epoch(model, device, valid, valid_loss_fn)
101 |             print(f"Validation ROC-AUC: {current_roc:.4f}")
102 |             
103 |             per_fold_score.append(current_roc)
104 |             
105 |             if current_roc > best_roc:
106 |                 current_roc = best_roc
107 |                 torch.save(model.state_dict(), f"{Config.model_name}_fold_{fold_}.pt")
108 |                 print(f"Saved best model in this fold with ROC-AUC: {current_roc}")
109 |         
110 |         fold_scores[fold_] = per_fold_score
111 |         
112 |         del training_set, validation_set, train, valid, model, optimizer, current_roc, best_roc
113 |         gc.collect()
114 |         torch.cuda.empty_cache()
115 |         
116 | if __name__ == '__main__':
117 |     if torch.cuda.is_available():
118 |         print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name()))
119 |         DEVICE = torch.device('cuda:0')
120 |     else:
121 |         print("\n[INFO] GPU not found. Using CPU: {}\n".format(platform.processor()))
122 |         DEVICE = torch.device('cpu')
123 |     
124 |     # Get the prepared data
125 |     data = prepare_data()
126 |     
127 |     # Run the training
128 |     run(DEVICE, data)


--------------------------------------------------------------------------------