├── .gitignore ├── LICENSE ├── README.md ├── assets ├── fold_results.png └── image.jpg ├── requirements.txt ├── src ├── augmentations.py ├── config.py ├── data.py ├── models.py └── trainer.py └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | .vscode -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Tanay Mehta 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Detecting Aliens using Deep Learning 👽

2 | 3 |

4 | Picture for Representation 5 |

6 | 7 |

8 | Python 9 | 10 | NumPy 11 | 12 | Pandas 13 | 14 | PyTorch 15 |

16 | 17 | ## Introduction 18 | 19 | This repository contains the code I wrote for identifying anomalous signals in scans of Breakthrough Listen targets for [SETI Breakthrough Listen - E.T. Signal Search](https://www.kaggle.com/c/seti-breakthrough-listen) Competition hosted on Kaggle. 20 | 21 | The main task in this competition is to classify if a Signal is from an "alien source" or not. 22 | 23 | I have also made a [full training notebook](https://www.kaggle.com/heyytanay/pytorch-training-augments-vit-kfolds) in this competition. 24 | 25 | ## Data 26 | 27 | Because there are no confirmed examples of alien signals to use to train machine learning algorithms, the team included some simulated signals (that they call “needles”) in the haystack of data from the telescope. 28 | They have identified some of the hidden needles so that we can train out model to find more. 29 | 30 | The data consist of two-dimensional arrays. 31 | 32 | ## Training the Model 33 | 34 | If you want to train the model on this data as-is, then you would typically have to perform 2 steps: 35 | 36 | ### 1. Getting the Data right 37 | 38 | First, download the data from [here](https://www.kaggle.com/c/seti-breakthrough-listen/data). 39 | 40 | Now, take the downloaded `.zip` file and extract it into a new folder: `input/`. 41 | 42 | Make sure the `input/` folder is at the same directory level as the `train.py` file. 43 | 44 | 45 | ### 2. Installing the dependencies 46 | 47 | To run the code in this repository, you need a lot of frameworks installed on your system. 48 | 49 | Make sure you have enough space on your disk and Internet quota before you proceed. 50 | 51 | ```shell 52 | $ pip install -r requirements.txt 53 | ``` 54 | 55 | ### 3. Training the Model 56 | 57 | If you have done the above steps right, then just running the `train.py` script should not produce any errors. 58 | 59 | To run training, open the terminal and change your working directory to the same level as the `train.py` file. 60 | 61 | Now, for training do: 62 | 63 | ```shell 64 | $ python train.py 65 | ``` 66 | 67 | This should start training in a few seconds and you should see a progress bar. 68 | 69 | If you are having problems related to anything, please open an Issue and I will be happy to help! 70 | 71 | ## Training Results and Conclusion 72 | 73 | Below you can see the per-fold model performance when trained on `vit_base_patch16_224` for 3 folds and 10 epochs in each fold. 74 | 75 | ![](assets/fold_results.png) 76 | 77 | **I hope you find my work useful! If you do, then please Star ⭐ this repository!** -------------------------------------------------------------------------------- /assets/fold_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tanaymeh/alien-signal-detection/222ab9c6c631181f335abb1842a33508f4eb11a1/assets/fold_results.png -------------------------------------------------------------------------------- /assets/image.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tanaymeh/alien-signal-detection/222ab9c6c631181f335abb1842a33508f4eb11a1/assets/image.jpg -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | matplotlib 4 | seaborn 5 | tqdm 6 | opencv-contrib-python 7 | timm 8 | torch 9 | torchvision 10 | albumentations 11 | sklearn -------------------------------------------------------------------------------- /src/augmentations.py: -------------------------------------------------------------------------------- 1 | from albumentations import ( 2 | HorizontalFlip, VerticalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90, 3 | Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue, 4 | IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, IAAPiecewiseAffine, RandomResizedCrop, 5 | IAASharpen, IAAEmboss, RandomBrightnessContrast, Flip, OneOf, Compose, Normalize, Cutout, CoarseDropout, ShiftScaleRotate, CenterCrop, Resize 6 | ) 7 | 8 | from albumentations.pytorch import ToTensorV2 9 | 10 | from .config import Config 11 | 12 | class Augments: 13 | """ 14 | Contains Train, Validation Augments 15 | """ 16 | train_augments = Compose([ 17 | Resize(*Config.resize, p=1.0), 18 | HorizontalFlip(p=0.5), 19 | VerticalFlip(p=0.5), 20 | ShiftScaleRotate(p=0.5, shift_limit=0.2, scale_limit=0.2, rotate_limit=20, border_mode=0, value=0, mask_value=0), 21 | RandomResizedCrop(*Config.resize, p=1.0), 22 | ToTensorV2(p=1.0), 23 | ],p=1.) 24 | 25 | valid_augments = Compose([ 26 | Resize(*Config.resize, p=1.0), 27 | ToTensorV2(p=1.0), 28 | ], p=1.) -------------------------------------------------------------------------------- /src/config.py: -------------------------------------------------------------------------------- 1 | class Config: 2 | N_SPLITS = 3 3 | model_name = 'vit_base_patch16_224' 4 | resize = (224, 224) 5 | TRAIN_BS = 32 6 | VALID_BS = 16 7 | num_workers = 8 8 | NB_EPOCHS = 10 9 | LABELS = 1 10 | FILE = "/input/train_labels.csv" 11 | FOLDER = "/input/train" -------------------------------------------------------------------------------- /src/data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import os 4 | 5 | import torch 6 | from torch.utils.data import Dataset, DataLoader 7 | 8 | from .config import Config 9 | from .augmentations import Augments 10 | 11 | 12 | class SETIData(Dataset): 13 | def __init__(self, images, targets, is_test=False, augmentations=None): 14 | self.images = images 15 | self.targets = targets 16 | self.is_test = is_test 17 | self.augmentations = augmentations 18 | 19 | def __getitem__(self, index): 20 | img, target = self.images[index], self.targets[index] 21 | 22 | img = np.load(img) 23 | img = np.vstack(img) 24 | img = img.transpose(1, 0) 25 | img = img.astype("float")[..., np.newaxis] 26 | 27 | if self.augmentations: 28 | img = self.augmentations(image=img)['image'] 29 | 30 | if self.is_test: 31 | return img 32 | 33 | else: 34 | target = self.targets[index] 35 | return img, target 36 | 37 | def __len__(self): 38 | return len(self.images) -------------------------------------------------------------------------------- /src/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import timm 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | 6 | from .config import Config 7 | 8 | class VITModel(nn.Module): 9 | """ 10 | Model Class for VIT Model 11 | """ 12 | def __init__(self, model_name=Config.model_name, pretrained=True): 13 | super(VITModel, self).__init__() 14 | self.backbone = timm.create_model(model_name, pretrained, in_chans=1) 15 | self.backbone.head = nn.Linear(self.backbone.head.in_features, Config.LABELS) 16 | 17 | def forward(self, x): 18 | x = self.backbone(x) 19 | return x 20 | 21 | class MLPMixer(nn.Module): 22 | """ 23 | Model Class for MLP Mixer Model 24 | """ 25 | def __init__(self, model_name=Config.model_name, pretrained=True): 26 | super(MLPMixer, self).__init__() 27 | self.backbone = timm.create_model(model_name, pretrained, in_chans=1) 28 | self.backbone.head = nn.Linear(self.backbone.head.in_features, Config.LABELS) 29 | 30 | def forward(self, x): 31 | x = self.backbone(x) 32 | return x -------------------------------------------------------------------------------- /src/trainer.py: -------------------------------------------------------------------------------- 1 | import platform 2 | import numpy as np 3 | import pandas as pd 4 | from tqdm.notebook import tqdm 5 | import cv2 6 | import gc 7 | import matplotlib.pyplot as plt 8 | 9 | import torch 10 | import timm 11 | import torch.nn as nn 12 | import torch.nn.functional as F 13 | from torch.utils.data import Dataset, DataLoader 14 | 15 | from sklearn.metrics import roc_auc_score 16 | 17 | def train_one_epoch(model, device, optimizer, dataloader, loss_fn, scheduler=None): 18 | """Trains a given model for 1 epoch on the given data 19 | 20 | Args: 21 | model: Main model 22 | device: Device on which model will be trained 23 | optimizer: Optimizer that will optimize during training 24 | dataloader: Training Dataloader 25 | loss_fn: Training Loss function. Will be optimized 26 | scheduler (optional): Scheduler for the learning rate. Defaults to None. 27 | """ 28 | prog_bar = tqdm(enumerate(dataloader), total=len(dataloader)) 29 | model.train() 30 | running_loss = 0 31 | for idx, (img, target) in prog_bar: 32 | img = img.to(device, torch.float) 33 | target = target.to(device, torch.float) 34 | 35 | output = model(img).view(-1) 36 | loss = loss_fn(output, target) 37 | 38 | # Sending the data from GPU to CPU in a numpy form (using .item()) consumes memory 39 | # So only do it once 40 | loss_item = loss.item() 41 | prog_bar.set_description('loss: {:.2f}'.format(loss_item)) 42 | 43 | loss.backward() 44 | optimizer.step() 45 | 46 | if scheduler: 47 | scheduler.step() 48 | 49 | optimizer.zero_grad(set_to_none=True) 50 | 51 | running_loss += loss_item 52 | 53 | return running_loss / len(dataloader) 54 | 55 | @torch.no_grad() 56 | def valid_one_epoch(model, device, dataloader, loss_fn): 57 | """Validates the model on the validation set through all batches 58 | 59 | Args: 60 | model: Main model 61 | device: Device on which model will be validated 62 | dataloader: Validation Dataloader 63 | loss_fn: Validation Loss function. Will NOT be optimized 64 | """ 65 | prog_bar = tqdm(enumerate(dataloader), total=len(dataloader)) 66 | all_targets, all_predictions = [], [] 67 | running_loss = 0 68 | model.eval() 69 | for idx, (img, target) in prog_bar: 70 | img = img.to(device, torch.float) 71 | target = target.to(device, torch.float) 72 | 73 | output = model(img).view(-1) 74 | 75 | loss = loss_fn(output, target) 76 | loss_item = loss.item() 77 | 78 | prog_bar.set_description('val_loss: {:.2f}'.format(loss_item)) 79 | 80 | all_targets.extend(target.cpu().detach().numpy().tolist()) 81 | all_predictions.extend(torch.sigmoid(output).cpu().detach().numpy().tolist()) 82 | 83 | running_loss += loss_item 84 | 85 | val_roc_auc = roc_auc_score(all_targets, all_predictions) 86 | return val_roc_auc, running_loss / len(dataloader) -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import platform 2 | import numpy as np 3 | import pandas as pd 4 | from sklearn.utils import shuffle 5 | from tqdm.notebook import tqdm 6 | import warnings 7 | import gc 8 | import matplotlib.pyplot as plt 9 | from sklearn.metrics import roc_auc_score 10 | from sklearn.model_selection import StratifiedKFold 11 | 12 | import torch 13 | import timm 14 | import torch.nn as nn 15 | import torch.nn.functional as F 16 | from torch.cuda.amp import GradScaler, autocast 17 | from torch.utils.data import Dataset, DataLoader 18 | 19 | from albumentations import ( 20 | HorizontalFlip, VerticalFlip, IAAPerspective, ShiftScaleRotate, CLAHE, RandomRotate90, 21 | Transpose, ShiftScaleRotate, Blur, OpticalDistortion, GridDistortion, HueSaturationValue, 22 | IAAAdditiveGaussianNoise, GaussNoise, MotionBlur, MedianBlur, IAAPiecewiseAffine, RandomResizedCrop, 23 | IAASharpen, IAAEmboss, RandomBrightnessContrast, Flip, OneOf, Compose, Normalize, Cutout, CoarseDropout, ShiftScaleRotate, CenterCrop, Resize 24 | ) 25 | from albumentations.pytorch import ToTensorV2 26 | 27 | from src.config import Config 28 | from src.augmentations import Augments 29 | from src.data import SETIData 30 | from src.models import * 31 | from src.trainer import train_one_epoch, valid_one_epoch 32 | 33 | def yield_loss(outputs, targets): 34 | return nn.BCEWithLogitsLoss()(outputs, targets) 35 | 36 | def prepare_data(): 37 | """ 38 | Takes the dataframe and prepares it for training 39 | """ 40 | train_labels = pd.read_csv(Config.FILE) 41 | train_labels['path'] = train_labels['id'].apply(lambda x: f'{Config.FOLDER}/{x[0]}/{x}.npy') 42 | 43 | return train_labels 44 | 45 | def run(device, data): 46 | kfold = StratifiedKFold(n_splits=Config.N_SPLITS, shuffle=True, random_state=2021) 47 | fold_scores = {} 48 | for fold_, (trn_idx, val_idx) in enumerate(kfold.split(data, data['target'])): 49 | print(f"{'='*40} Fold: {fold_} {'='*40}") 50 | 51 | train_data = data.loc[trn_idx] 52 | valid_data = data.loc[val_idx] 53 | 54 | print(f"[INFO] Training on {trn_idx.shape[0]} samples and validating on {valid_data.shape[0]} samples") 55 | 56 | # Make Training and Validation Datasets 57 | training_set = SETIData( 58 | images=train_data['path'].values, 59 | targets=train_data['target'].values, 60 | augmentations=Augments.train_augments 61 | ) 62 | 63 | validation_set = SETIData( 64 | images=valid_data['path'].values, 65 | targets=valid_data['target'].values, 66 | augmentations=Augments.valid_augments 67 | ) 68 | 69 | train = DataLoader( 70 | training_set, 71 | batch_size=Config.TRAIN_BS, 72 | shuffle=True, 73 | num_workers=8, 74 | pin_memory=True 75 | ) 76 | 77 | valid = DataLoader( 78 | validation_set, 79 | batch_size=Config.VALID_BS, 80 | shuffle=False, 81 | num_workers=8 82 | ) 83 | 84 | model = VITModel().to(device) 85 | optimizer = torch.optim.Adam(model.parameters(), lr=1e-5) 86 | train_loss_fn = yield_loss() 87 | valid_loss_fn = yield_loss() 88 | print(f"[INFO] Training Model: {Config.model_name}") 89 | 90 | per_fold_score = [] 91 | best_roc = 0 92 | 93 | for epoch in range(1, Config.NB_EPOCHS+1): 94 | print(f"\n{'--'*5} EPOCH: {epoch} {'--'*5}\n") 95 | 96 | # Train for 1 epoch 97 | train_loss = train_one_epoch(model, device, optimizer, train, train_loss_fn) 98 | 99 | # Validate for 1 epoch 100 | current_roc, avg_val_loss = valid_one_epoch(model, device, valid, valid_loss_fn) 101 | print(f"Validation ROC-AUC: {current_roc:.4f}") 102 | 103 | per_fold_score.append(current_roc) 104 | 105 | if current_roc > best_roc: 106 | current_roc = best_roc 107 | torch.save(model.state_dict(), f"{Config.model_name}_fold_{fold_}.pt") 108 | print(f"Saved best model in this fold with ROC-AUC: {current_roc}") 109 | 110 | fold_scores[fold_] = per_fold_score 111 | 112 | del training_set, validation_set, train, valid, model, optimizer, current_roc, best_roc 113 | gc.collect() 114 | torch.cuda.empty_cache() 115 | 116 | if __name__ == '__main__': 117 | if torch.cuda.is_available(): 118 | print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name())) 119 | DEVICE = torch.device('cuda:0') 120 | else: 121 | print("\n[INFO] GPU not found. Using CPU: {}\n".format(platform.processor())) 122 | DEVICE = torch.device('cpu') 123 | 124 | # Get the prepared data 125 | data = prepare_data() 126 | 127 | # Run the training 128 | run(DEVICE, data) --------------------------------------------------------------------------------