├── .gitignore ├── README.md ├── requirements.txt ├── source ├── dataset.py ├── download_samples.py ├── helper.py ├── label_studio.py ├── model.py ├── run_first_selection.py ├── run_second_selection.py ├── setup_data.py ├── train_model_1.py └── train_model_2.py └── tutorial └── images ├── init-selection.png ├── ls-add-storage.png ├── ls-interface.png ├── ls-sync-storage.png └── second-selection.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | 132 | #ignore venv 133 | venv 134 | 135 | #ignore pycharm IDE 136 | .idea 137 | 138 | #ignore lightly outputs 139 | lightly_outputs 140 | 141 | #ignore .DS_Store 142 | **.DS_Store 143 | 144 | #ignore eggs 145 | .eggs 146 | 147 | # Ignore saved model 148 | weather_classifier.pth 149 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Tutorial 2 | This tutorial demonstrates a complete workflow of training a machine learning model with the aid of Active Learning using [Lightly](https://www.lightly.ai) and [Label Studio](https://labelstud.io). 3 | 4 | Assume we have a new unlabelled dataset and want to train a new model. We do not want to label all samples because not all of them are valuable. Lightly can help select a good subset of samples to kick off labeling and model training. The loop is as follows: 5 | 6 | 1. Lightly chooses a subset of the unlabelled samples. 7 | 1. This subset is labeled using Label Studio. 8 | 1. A machine learning model is trained on the labeled data and generates predictions for the entire dataset. 9 | 1. Lightly consumes predictions and performs Active Learning to choose the next batch of samples to be labeled. 10 | 1. This new batch of samples is labeled in Label Studio. 11 | 1. The machine learning model is re-trained on the enriched labeled dataset and to achieve better performance. 12 | 13 | 14 | Let's get started! 15 | 16 | ## 0. Installation and Requirements 17 | Make sure you have an account for the [Lightly Web App](https://app.lightly.ai). 18 | You also need to know your API token which is shown under your `USERNAME` -> `Preferences`. 19 | 20 | Clone this repo and install all Python package requirements in the `requirements.txt` file, e.g. with pip. 21 | ```bash 22 | git clone https://github.com/lightly-ai/Lightly_LabelStudio_AL.git 23 | cd Lightly_LabelStudio_AL 24 | pip install -r requirements.txt 25 | ``` 26 | 27 | 28 | ## 1. Prepare data 29 | We want to train a classifier to predict the weather displayed in an image. We use this dataset: [Multi-class Weather Dataset for Image Classification](https://data.mendeley.com/datasets/4drtyfjtfy/1). Download the dataset (zip file) from the [here](https://data.mendeley.com/public-files/datasets/4drtyfjtfy/files/a03e6097-f7fb-4e1a-9c6a-8923c6a0d3e0/file_downloaded) to this directory. 30 | 31 | After downloading and extracting the zip file, you will see the extracted directory as follows: 32 | 33 | ``` 34 | dataset2 35 | ├── cloudy1.jpg 36 | ├── cloudy2.jpg 37 | ├── cloudy3.jpg 38 | ├── cloudy4.jpg 39 | ... 40 | ``` 41 | 42 | Here we have images in 4 weather conditions: `cloudy`, `rain`, `shine`, and `sunrise`. 43 | 44 | #### 1.1 Split dataset 45 | To compare results between iterations, we first split the entire dataset into a full training set and a validation set. The training set will be used to select samples, and the validation set will be used as "new data" to evaluate the model's performance. 46 | 47 | Run the script below to split the dataset: 48 | ```sh 49 | python source/setup_data.py 50 | ``` 51 | 52 | After this, you will find the following files and directories in the current directory: 53 | * `train_set`: Directory that contains all samples to be used for training the model. Here we pretend that these samples are all unlabelled. 54 | * `val_set`: Directory that contains all samples to be used for model validation. Samples are labeled. 55 | * `full_train.json`: JSON file that records paths to all files in `train_set`. 56 | * `val.json`: JSON file that records paths and labels of all files in `val_set`. 57 | 58 | These will be used in the following steps. 59 | 60 | #### 1.2 Upload training samples to cloud storage 61 | In this tutorial, samples are stored in the cloud, and Lightly Worker will read the samples from the cloud data source. For details, please refer to [Set Up Your First Dataset](https://docs.lightly.ai/docs/set-up-your-first-dataset). Here we use Amazon S3 as an example. 62 | 63 | Under your S3 bucket, create two directories: `data` and `lightly`. We will upload all training samples to `data`. For example, run the [AWS CLI tool](https://aws.amazon.com/cli/): 64 | ```sh 65 | aws s3 sync train_set s3:///data 66 | ``` 67 | 68 | After uploading the samples, your S3 bucket should look like 69 | ``` 70 | s3://bucket/ 71 | ├── lightly/ 72 | └── data/ 73 | ├── cloudy1.jpg 74 | ├── cloudy2.jpg 75 | ├── ... 76 | ``` 77 | 78 | ## 2. Select the first batch of samples for labeling 79 | 80 | Now, with all unlabelled data samples in your training dataset, we want to select a good subset, label them, and train our classification model with them. Lightly can do this selection for you in a simple way. The script [run_first_selection.py](./source/run_first_selection.py) does the job for you. You need to first set up Lightly Worker on your machine and put the correct configuration values in the script. Please refer to [Install Lightly](https://docs.lightly.ai/docs/install-lightly) and [Set Up Your First Dataset](https://docs.lightly.ai/docs/set-up-your-first-dataset) for more details. 81 | 82 | Run the script after your worker is ready: 83 | 84 | ```sh 85 | python source/run_first_selection.py 86 | ``` 87 | 88 | In this script, Lightly Worker first creates a dataset named `weather-classification`, selects 30 samples based on embeddings of the training samples, and records them in this dataset. These 30 samples are the ones that we are going to label in the first round. You can see the selected samples in the [Web App](https://app.lightly.ai/). 89 | 90 | ![First selection.](tutorial/images/init-selection.png) 91 | 92 | ## 3. Label the selected samples to train a classifier 93 | 94 | We do this using the open source labeling tool **Label Studio**, which is a browser-based tool hosted on your machine. 95 | You have already installed it and can run it from the command line. It will need access to your local files. We will first download the selected samples, import them in Label Studio, label them, and export the annotations. 96 | 97 | _Curious to get started with Label Studio? Check out [this tutorial](https://labelstud.io/blog/zero-to-one-getting-started-with-label-studio/) for help getting started!_ 98 | 99 | #### 3.1 Download the selected samples 100 | 101 | We can download the selected samples from the Lightly Platform. The [download_samples.py](./source/download_samples.py) script will do everything for you and download the samples to a local directory called `samples_for_labelling`. 102 | 103 | ```sh 104 | python source/download_samples.py 105 | ``` 106 | 107 | Lightly Worker created a tag for the selected samples. This script pulls information about samples in this tag and downloads the samples. 108 | 109 | #### 3.2 Run LabelStudio 110 | 111 | Now we can launch LabelStudio. 112 | 113 | ```sh 114 | export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true && label-studio start 115 | ``` 116 | 117 | You should see it in your browser. Create an account and log in. 118 | 119 | #### 3.3 Configure Storage 120 | 121 | Create a new project called "weather-classification". 122 | Then, head to `Settings` -> `Cloud Storage` -> `Add Source Storage` -> `Storage Type`: `Local files`. 123 | Set the `Absolute local path` to the absolute path of directory `samples_for_labelling`. 124 | Enable the option `Treat every bucket object as a source file`. 125 | Then click `Add Storage`. It will show you that you have added a storage. 126 | Now click on `Sync Storage` to finally load the 30 images. 127 | 128 | ![Configuration of local file input.](tutorial/images/ls-add-storage.png) 129 | 130 | ##### 3.4 Configure the labeling interface 131 | 132 | In the `Settings` -> `Labeling Interface` in the `Code`insert 133 | ```xml 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | ``` 144 | ![Configuration of Labeling Interface.](tutorial/images/ls-interface.png) 145 | 146 | It tells Label Studio that there is an image classification task with 4 distinct choices. 147 | 148 | If you want someone else to help you label the images, navigate to `Settings`->`Instructions` and add some instructions. 149 | 150 | #### 3.5 Labelling 151 | 152 | Now if you click on your project again, you see 30 tasks and the corresponding images. 153 | Click on `Label All Tasks` and get those 30 images labeled. 154 | 155 | **Pro Tip!** Use the keys `1`, `2`, `3`, `4`, on your keyboard as hotkeys to be faster! 156 | 157 | #### 3.6 Export labels 158 | 159 | Export the labels via `Export` in the format `JSON-MIN`. 160 | Rename the file to `annotation-0.json` and place that in the root directory of this repository. 161 | 162 | ## 4. Train a model and prepare for active learning 163 | 164 | We can train a classification model with the 30 labeled samples. The [train_model_1.py](./source/train_model_1.py) script loads samples from `annotation-0.json` and performs this task. 165 | 166 | ```sh 167 | python source/train_model_1.py 168 | ``` 169 | 170 | The following steps are performed in this script: 171 | * Load the annotations and the labeled images. 172 | * Load the validation set. 173 | * Train a simple model as in [model.py](./source/model.py). 174 | * Make predictions for all samples for training, including unlabeled samples. 175 | * Dump the predictions in [Lightly Prediction format](https://docs.lightly.ai/docs/prediction-format#prediction-format) into directory `lightly_predictions`. 176 | 177 | We can see that the model performance is not good: 178 | ``` 179 | Training Acc: 60.000 Validation Acc: 19.027 180 | ``` 181 | 182 | It is okay for now. We will improve this. Predictions will be used for active learning. 183 | 184 | #### Upload predictions to data source 185 | 186 | Lightly Worker also does active learning for you based on predictions. It consumes predictions stored in the data source. We need to place the predictions we just acquired in the data source. For detailed information, please refer to [Predictions Folder Structure](https://docs.lightly.ai/docs/prediction-format#predictions-folder-structure). Here we still use the AWS S3 bucket as an example. 187 | 188 | In the `lightly` directory you created earlier in your S3 bucket, you will have a subdirectory `.lightly/predictions` where predictions are kept. You need the following additional files. You can create these files directly by copying the code blocks below. 189 | 190 | ##### tasks.json 191 | ```json 192 | ["weather-classification"] 193 | ``` 194 | 195 | We only have one task here, and let's name it as `weather-classification`. 196 | 197 | ##### schema.json 198 | ```json 199 | { 200 | "task_type": "classification", 201 | "categories": [ 202 | { 203 | "id": 0, 204 | "name": "cloudy" 205 | }, 206 | { 207 | "id": 1, 208 | "name": "rain" 209 | }, 210 | { 211 | "id": 2, 212 | "name": "shine" 213 | }, 214 | { 215 | "id": 3, 216 | "name": "sunrise" 217 | } 218 | ] 219 | } 220 | ``` 221 | 222 | Place these files in the `lightly` directory in your bucket along with predictions in the directory `lightly_prediction`. 223 | After uploading these files, your S3 bucket should look like 224 | ``` 225 | s3://bucket/ 226 | ├── lightly/ 227 | │   └── .lightly/ 228 | │      └── predictions/ 229 | │         ├── tasks.json 230 | │         └── weather-classification/ 231 | │         ├── schema.json 232 | │         ├── cloudy1.json 233 | │         ├── cloudy2.json 234 | │         ├── ... 235 | └── data/ 236 | ├── cloudy1.jpg 237 | ├── cloudy2.jpg 238 | ├── ... 239 | ``` 240 | 241 | where files like `cloudy1.json` and `cloudy2.json` are prediction files in `lightly_prediction`. 242 | 243 | ## 5. Select and label new samples 244 | 245 | With the predictions, Lightly Worker can perform active learning and select new samples for us. The [run_second_selection.py](./source/run_second_selection.py) script does the job. 246 | 247 | ```sh 248 | python source/run_second_selection.py 249 | ``` 250 | 251 | This time, Lightly Worker goes through all training samples again and selects another 30 samples based on active learning scores computed from the predictions we uploaded in the previous step. For more details, please refer to [Selection Scores](https://docs.lightly.ai/docs/selection#scores) and [Active Learning Scorer](https://docs.lightly.ai/docs/active-learning-scorers). 252 | 253 | You can see the results in the Web App. 254 | 255 | ![Second selection.](./tutorial/images/second-selection.png) 256 | 257 | 258 | #### Label new samples 259 | 260 | You can repeat step 3 to label new samples. To import new samples, go to `Settings` -> `Cloud Storage` and then click `Sync Storage` on the Source Cloud Storage you created earlier. A message `Synced 30 task(s)` should show up. 261 | 262 | ![Sync Storage.](./tutorial/images/ls-sync-storage.png) 263 | 264 | Then, you can go back to the project page and label the new samples. After finishing annotating the samples, export the annotations again. Rename the file to `annotation-1.json` and place that in the root directory of this repository. 265 | 266 | ## 6. Train a new model with the new samples 267 | Very similar to the script in step 4, script [train_model_2.py](source/train_model_2.py) loads samples from `annotation-1.json` and trains the classification model again with all 60 labeled samples now. 268 | 269 | ```sh 270 | python source/train_model_2.py 271 | ``` 272 | 273 | The model indeed does better this time on the validation set: 274 | ``` 275 | Training Acc: 90.000 Validation Acc: 44.248 276 | ``` 277 | 278 | ## Celebrate! You've improved your model with the help of Label Studio and Lightly! 279 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | torch~=2.0.0 2 | torchvision~=0.15.0 3 | Pillow~=9.0.0 4 | lightly>=1.4.6 5 | tqdm~=4.62.2 6 | label-studio~=1.7.2 7 | -------------------------------------------------------------------------------- /source/dataset.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | import numpy as np 4 | import torch 5 | from PIL import Image 6 | from torch.utils.data import DataLoader, Dataset 7 | from torchvision import transforms 8 | 9 | 10 | class WeatherDataset(Dataset): 11 | def __init__( 12 | self, 13 | image_data: List[Image.Image], 14 | target: List[int], 15 | transform: bool = None, 16 | ) -> None: 17 | label_classes = {"cloudy": 0, "rain": 1, "shine": 2, "sunrise": 3} 18 | self.image_data = image_data 19 | self.target = torch.LongTensor([label_classes[t] for t in target]) 20 | self.transform = transform 21 | 22 | def __getitem__(self, index): 23 | x = self.image_data[index] 24 | y = self.target[index] 25 | if self.transform: 26 | x = Image.fromarray( 27 | np.uint8(np.array(self.image_data[index])) 28 | ) # Memory Efficient way 29 | x = self.transform(x) 30 | return x, y 31 | 32 | def __len__(self): 33 | return len(self.image_data) 34 | 35 | 36 | def get_dataloader( 37 | X: List[Image.Image], 38 | y: List[int], 39 | batch_size: int, 40 | shuffle: bool = True, 41 | ) -> DataLoader: 42 | dataset = WeatherDataset(X, y, transform=transforms.ToTensor()) 43 | 44 | return DataLoader( 45 | dataset, 46 | batch_size=batch_size, 47 | num_workers=4, 48 | pin_memory=True, 49 | shuffle=shuffle, 50 | ) 51 | -------------------------------------------------------------------------------- /source/download_samples.py: -------------------------------------------------------------------------------- 1 | import pathlib 2 | 3 | import requests 4 | from lightly.api import ApiWorkflowClient 5 | 6 | # Create the Lightly client to connect to the Lightly Platform. 7 | client = ApiWorkflowClient(token="YOUR_LIGHTLY_TOKEN") 8 | 9 | # Set the dataset to the one we created. 10 | client.set_dataset_id_by_name(dataset_name="weather-classification") 11 | latest_tag = client.get_all_tags()[0] 12 | 13 | # filename_url_mappings is a list of entries with their filenames and read URLs. 14 | # For example, [{"fileName": "image1.png", "readUrl": "https://..."}] 15 | filename_url_mappings = client.export_filenames_and_read_urls_by_tag_id(latest_tag.id) 16 | 17 | output_path = pathlib.Path("samples_for_labelling") 18 | output_path.mkdir(exist_ok=True) 19 | 20 | for entry in filename_url_mappings: 21 | read_url = entry["readUrl"] 22 | filename = entry["fileName"] 23 | print(f"Downloading {filename}") 24 | response = requests.get(read_url, stream=True) 25 | with open(output_path / filename, "wb") as file: 26 | for data in response.iter_content(): 27 | file.write(data) 28 | -------------------------------------------------------------------------------- /source/helper.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import pathlib 4 | from typing import List, Tuple 5 | 6 | import numpy as np 7 | import numpy.typing as npt 8 | from label_studio import read_label_studio_annotation_file 9 | from PIL import Image 10 | 11 | IMAGE_SIZE = 224 # Resize images 12 | 13 | 14 | def prepare_training_data(annotation_filepath: str) -> None: 15 | """Collects labels and filenames from LabelStudio output files. 16 | 17 | Images still stays in directory `train_set`. `train.json` only contains paths to 18 | samples to be used for training. For instance, 19 | [{"path": "/path/image1.png", "label": "cloudy"}] 20 | 21 | `train.json` will be picked up by the scripts for model training to load the 22 | actual images. 23 | """ 24 | samples = [] 25 | root = pathlib.Path("train_set") 26 | for filename, label in read_label_studio_annotation_file(annotation_filepath): 27 | samples.append({"path": str(root / filename), "label": label}) 28 | 29 | with open("train.json", "w") as f: 30 | json.dump(samples, f) 31 | 32 | 33 | def load_data(sample_json_path: str) -> Tuple[List[Image.Image], List[str], List[str]]: 34 | """Loads image data. 35 | 36 | Paths to samples to be used for training are loaded from the json file created in 37 | `prepare_training_data`. 38 | """ 39 | with open(sample_json_path) as f: 40 | sample_list = json.load(f) 41 | 42 | all_images, all_labels = [], [] 43 | filenames = [] 44 | 45 | for sample in sample_list: 46 | all_images.append( 47 | Image.open(sample["path"]) 48 | .convert("RGB") 49 | .resize((IMAGE_SIZE, IMAGE_SIZE), resample=Image.LANCZOS) 50 | ) 51 | all_labels.append(sample["label"]) 52 | filenames.append(pathlib.Path(sample["path"]).name) 53 | return all_images, all_labels, filenames 54 | 55 | 56 | def dump_lightly_predictions(filenames: List[str], predictions: npt.NDArray) -> None: 57 | """Dumps model predictions in the Lightly Prediction format. 58 | 59 | Each input image has its own prediction file. The filename is `.json`. 60 | """ 61 | root = pathlib.Path("lightly_predictions") 62 | os.mkdir(root) 63 | for filename, prediction in zip(filenames, predictions): 64 | with open(str(root / pathlib.Path(filename).stem) + ".json", "w") as f: 65 | pred_list = prediction.tolist() 66 | # Normalise probabilities again because of precision loss in `to_list`. 67 | pred_sum = sum(pred_list) 68 | json.dump( 69 | { 70 | "file_name": filename, 71 | "predictions": [ 72 | { 73 | "category_id": int(np.argmax(prediction)), 74 | "probabilities": [p / pred_sum for p in pred_list], 75 | } 76 | ], 77 | }, 78 | f, 79 | ) 80 | -------------------------------------------------------------------------------- /source/label_studio.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pathlib 3 | from typing import Dict, List, Tuple 4 | 5 | 6 | def read_label_element(label_element: Dict) -> Tuple[str, str]: 7 | """Parses labels from LabelStudio output data structure.""" 8 | filepath = pathlib.Path("/" + label_element["image"].split("?d=")[-1]) 9 | label = label_element["choice"] 10 | return filepath.name, label 11 | 12 | 13 | def read_label_studio_annotation_file(filepath: str) -> Tuple[List[str], List[str]]: 14 | """Reads labels from LabelStudio output files.""" 15 | # read the label file 16 | with open(filepath, "r") as json_file: 17 | data = json.load(json_file) 18 | return [read_label_element(label_element) for label_element in data] 19 | -------------------------------------------------------------------------------- /source/model.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import gc 3 | from typing import List, Mapping 4 | 5 | import numpy as np 6 | import numpy.typing as npt 7 | import torch 8 | import torch.nn as nn 9 | import torch.nn.functional as F 10 | from torch.utils.data import DataLoader 11 | 12 | 13 | class Model(nn.Module): 14 | def __init__(self): 15 | super().__init__() 16 | self.conv1 = nn.Conv2d(3, 6, 5) 17 | self.pool = nn.MaxPool2d(4, 4) 18 | self.conv2 = nn.Conv2d(6, 16, 4) 19 | self.fc1 = nn.Linear(16 * 13 * 13, 120) 20 | self.fc2 = nn.Linear(120, 84) 21 | self.fc3 = nn.Linear(84, 4) 22 | 23 | def forward(self, x): 24 | x = self.pool(F.relu(self.conv1(x))) 25 | x = self.pool(F.relu(self.conv2(x))) 26 | x = torch.flatten(x, 1) # flatten all dimensions except batch 27 | x = F.relu(self.fc1(x)) 28 | x = F.relu(self.fc2(x)) 29 | x = self.fc3(x) 30 | return x 31 | 32 | 33 | def get_optimizer(model: nn.Module) -> torch.optim.Optimizer: 34 | lr = 0.01 35 | momentum = 0.5 36 | decay = 0.01 37 | optimizer = torch.optim.SGD( 38 | model.parameters(), lr=lr, momentum=momentum, weight_decay=decay 39 | ) 40 | return optimizer 41 | 42 | 43 | def train_model( 44 | model: nn.Module, 45 | loss_func: torch.nn.modules.loss, 46 | optimizer: torch.optim.Optimizer, 47 | device: torch.device, 48 | dataloaders: Mapping[str, DataLoader], 49 | early_stop=10, 50 | num_epochs=5, 51 | ) -> nn.Module: 52 | start_time = datetime.datetime.now().replace(microsecond=0) 53 | model = model.to(device) 54 | 55 | valid_loss_min = np.Inf # track change in validation loss 56 | early_stop_cnt = 0 57 | last_epoch_loss = np.Inf 58 | globaliter = 0 59 | 60 | for epoch in range(1, num_epochs + 1): 61 | globaliter += 1 62 | # keep track of training and validation loss 63 | train_loss = 0.0 64 | valid_loss = 0.0 65 | 66 | model.train() 67 | train_corrects = 0 68 | 69 | for data, target in dataloaders["train"]: 70 | data, target = data.to(device), target.to(device) 71 | optimizer.zero_grad() 72 | output = model(data) 73 | # calculate the batch loss 74 | _, preds = torch.max(output, 1) 75 | loss = loss_func(output, target) 76 | loss.backward() 77 | optimizer.step() 78 | train_loss += loss.item() * data.size(0) 79 | train_corrects += torch.sum(preds == target.data) 80 | 81 | train_loss = train_loss / len(dataloaders["train"].dataset) 82 | train_acc = (train_corrects.double() * 100) / len(dataloaders["train"].dataset) 83 | 84 | # validate the model 85 | model.eval() 86 | val_corrects = 0 87 | for data, target in dataloaders["val"]: 88 | data, target = data.to(device), target.to(device) 89 | output = model(data) 90 | _, preds = torch.max(output, 1) 91 | loss = loss_func(output, target) 92 | valid_loss += loss.item() * data.size(0) 93 | val_corrects += torch.sum(preds == target.data) 94 | 95 | # calculate average losses 96 | valid_loss = valid_loss / len(dataloaders["val"].dataset) 97 | valid_acc = (val_corrects.double() * 100) / len(dataloaders["val"].dataset) 98 | 99 | # print training/validation statistics 100 | print( 101 | "Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}".format( 102 | epoch, train_loss, valid_loss 103 | ) 104 | ) 105 | print( 106 | "\t\tTraining Acc: {:.3f} \t\tValidation Acc: {:.3f}".format( 107 | train_acc, valid_acc 108 | ) 109 | ) 110 | 111 | if valid_loss <= valid_loss_min: 112 | print( 113 | "\t\tValidation loss decreased ({:.6f} --> {:.6f}).".format( 114 | valid_loss_min, valid_loss 115 | ) 116 | ) 117 | 118 | valid_loss_min = valid_loss 119 | elif valid_loss == np.nan: 120 | print("Model Loss: NAN") 121 | 122 | if (last_epoch_loss < valid_loss) and last_epoch_loss != np.Inf: 123 | early_stop_cnt += 1 124 | if early_stop_cnt == early_stop: 125 | print("-" * 50 + "\nEarly Stopping Hit\n" + "-" * 50) 126 | break 127 | else: 128 | print( 129 | "-" * 50 130 | + f"\n\t\tEarly Stopping Step: {early_stop_cnt}/{early_stop}\n" 131 | + "-" * 50 132 | ) 133 | else: 134 | early_stop_cnt = 0 135 | last_epoch_loss = valid_loss 136 | 137 | print( 138 | f"Training Completed with best model having loss of {round(valid_loss_min, 6)}" 139 | ) 140 | del data, target 141 | gc.collect() 142 | end_time = datetime.datetime.now().replace(microsecond=0) 143 | print(f"Time Taken: {end_time-start_time}") 144 | return model 145 | 146 | 147 | def predict( 148 | model: nn.Module, device: torch.device, dataloader: DataLoader 149 | ) -> npt.NDArray: 150 | model.eval() 151 | predictions: List[npt.NDArray] = [] 152 | for data, target in dataloader: 153 | data, target = data.to(device), target.to(device) 154 | output = model(data) 155 | predictions.append( 156 | torch.nn.functional.softmax(output, dim=1).cpu().detach().numpy() 157 | ) 158 | return np.concatenate(predictions) 159 | -------------------------------------------------------------------------------- /source/run_first_selection.py: -------------------------------------------------------------------------------- 1 | from lightly.api import ApiWorkflowClient 2 | from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose 3 | 4 | # Create the Lightly client to connect to the Lightly Platform. 5 | client = ApiWorkflowClient(token="YOUR_LIGHTLY_TOKEN") 6 | 7 | # Create a new dataset on the Lightly Platform. 8 | client.create_dataset( 9 | dataset_name="weather-classification", dataset_type=DatasetType.IMAGES 10 | ) 11 | dataset_id = client.dataset_id 12 | 13 | # Configure the Input datasource. 14 | client.set_s3_delegated_access_config( 15 | resource_path="s3:///data/", 16 | region="your_bucket_region", 17 | role_arn="your_role_arn", 18 | external_id="your_external_id", 19 | purpose=DatasourcePurpose.INPUT, 20 | ) 21 | # Configure the Lightly datasource. 22 | client.set_s3_delegated_access_config( 23 | resource_path="s3:///lightly/", 24 | region="your_bucket_region", 25 | role_arn="your_role_arn", 26 | external_id="your_external_id", 27 | purpose=DatasourcePurpose.LIGHTLY, 28 | ) 29 | 30 | # Create a Lightly Worker run to select the first batch of 30 samples 31 | # based on image embeddings. 32 | scheduled_run_id = client.schedule_compute_worker_run( 33 | selection_config={ 34 | "n_samples": 30, 35 | "strategies": [ 36 | { 37 | "input": {"type": "EMBEDDINGS"}, 38 | "strategy": {"type": "DIVERSITY"}, 39 | } 40 | ], 41 | }, 42 | ) 43 | 44 | for run_info in client.compute_worker_run_info_generator( 45 | scheduled_run_id=scheduled_run_id 46 | ): 47 | print( 48 | f"Lightly Worker run is now in state='{run_info.state}' with message='{run_info.message}'" 49 | ) 50 | 51 | if run_info.ended_successfully(): 52 | print("SUCCESS") 53 | else: 54 | print("FAILURE") 55 | -------------------------------------------------------------------------------- /source/run_second_selection.py: -------------------------------------------------------------------------------- 1 | from lightly.api import ApiWorkflowClient 2 | from lightly.openapi_generated.swagger_client import DatasourcePurpose 3 | 4 | # Create the Lightly client to connect to the Lightly Platform. 5 | client = ApiWorkflowClient(token="YOUR_LIGHTLY_TOKEN") 6 | 7 | # Set the dataset to the one we created. 8 | client.set_dataset_id_by_name(dataset_name="weather-classification") 9 | 10 | # Configure the Input datasource. 11 | client.set_s3_delegated_access_config( 12 | resource_path="s3:///data/", 13 | region="your_bucket_region", 14 | role_arn="your_role_arn", 15 | external_id="your_external_id", 16 | purpose=DatasourcePurpose.INPUT, 17 | ) 18 | # Configure the Lightly datasource. 19 | client.set_s3_delegated_access_config( 20 | resource_path="s3:///lightly/", 21 | region="your_bucket_region", 22 | role_arn="your_role_arn", 23 | external_id="your_external_id", 24 | purpose=DatasourcePurpose.LIGHTLY, 25 | ) 26 | 27 | # Create a Lightly Worker run to select another 30 samples using active learning. 28 | scheduled_run_id = client.schedule_compute_worker_run( 29 | worker_config={ 30 | "datasource": { 31 | "process_all": True, 32 | }, 33 | "enable_training": False, 34 | }, 35 | selection_config={ 36 | "n_samples": 30, 37 | "strategies": [ 38 | { 39 | "input": { 40 | "type": "SCORES", 41 | "task": "weather-classification", 42 | "score": "uncertainty_entropy", 43 | }, 44 | "strategy": {"type": "WEIGHTS"}, 45 | } 46 | ], 47 | }, 48 | ) 49 | 50 | for run_info in client.compute_worker_run_info_generator( 51 | scheduled_run_id=scheduled_run_id 52 | ): 53 | print( 54 | f"Lightly Worker run is now in state='{run_info.state}' with message='{run_info.message}'" 55 | ) 56 | 57 | if run_info.ended_successfully(): 58 | print("SUCCESS") 59 | else: 60 | print("FAILURE") 61 | -------------------------------------------------------------------------------- /source/setup_data.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import pathlib 4 | import re 5 | from typing import Dict, List 6 | 7 | import numpy as np 8 | 9 | SEED = 42 10 | np.random.seed(SEED) 11 | 12 | 13 | def setup_data(data_dir_str: str) -> None: 14 | """Splits the full dataset into a training set and a validation set. 15 | 16 | The training set will have images that will be used to train the model. Even if 17 | they are already labelled, we will use them as unlabelled data and use Lightly 18 | to label them. The validation set will be used to evaluate the model's performance. 19 | """ 20 | data_dir = pathlib.Path(data_dir_str) 21 | files = os.listdir(data_dir) 22 | train_set_path = pathlib.Path("train_set") 23 | val_set_path = pathlib.Path("val_set") 24 | train_set_path.mkdir() 25 | val_set_path.mkdir() 26 | 27 | dataset: Dict[str, List[pathlib.PosixPath]] = {} 28 | train_set = [] 29 | val_set = [] 30 | pattern = re.compile("([a-z]+)[0-9]+") 31 | 32 | for file in files: 33 | filepath = data_dir / file 34 | regex_match = pattern.match(filepath.stem) 35 | category = regex_match.group(1) 36 | 37 | if dataset.get(category) is None: 38 | dataset[category] = [] 39 | dataset[category].append(filepath) 40 | 41 | for category, samples in dataset.items(): 42 | sample_idx = list(range(len(samples))) 43 | train_idx = np.random.choice( 44 | sample_idx, int(len(sample_idx) * 0.8), replace=False 45 | ) 46 | val_idx = [idx for idx in sample_idx if idx not in train_idx] 47 | 48 | for idx in train_idx: 49 | filepath: pathlib.PosixPath = samples[idx] 50 | new_path = train_set_path / filepath.name 51 | os.rename(filepath, new_path) 52 | train_set.append({"path": str(new_path), "label": category}) 53 | for idx in val_idx: 54 | filepath = samples[idx] 55 | new_path = val_set_path / filepath.name 56 | os.rename(filepath, new_path) 57 | val_set.append({"path": str(new_path), "label": category}) 58 | 59 | with open("full_train.json", "w") as f: 60 | json.dump(train_set, f) 61 | with open("val.json", "w") as f: 62 | json.dump(val_set, f) 63 | 64 | 65 | setup_data("./dataset2/") 66 | -------------------------------------------------------------------------------- /source/train_model_1.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from dataset import get_dataloader 4 | from helper import dump_lightly_predictions, load_data, prepare_training_data 5 | from model import Model, get_optimizer, predict, train_model 6 | 7 | SEED = 42 8 | torch.manual_seed(SEED) 9 | np.random.seed(SEED) 10 | 11 | 12 | def main(): 13 | prepare_training_data("./annotation-0.json") 14 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 15 | epochs = 25 16 | 17 | train_X, train_y, _ = load_data("./train.json") 18 | val_X, val_y, _ = load_data("./val.json") 19 | dataloaders = { 20 | "train": get_dataloader(train_X, train_y, batch_size=3), 21 | "val": get_dataloader(val_X, val_y, batch_size=50), 22 | } 23 | 24 | model = Model().to(device) 25 | loss_func = torch.nn.CrossEntropyLoss() 26 | optimizer = get_optimizer(model) 27 | model = train_model( 28 | model=model, 29 | loss_func=loss_func, 30 | optimizer=optimizer, 31 | dataloaders=dataloaders, 32 | device=device, 33 | num_epochs=epochs, 34 | ) 35 | 36 | train_X, train_y, filenames = load_data("./full_train.json") 37 | dataloader = get_dataloader(train_X, train_y, batch_size=50, shuffle=False) 38 | predictions = predict(model, device, dataloader) 39 | dump_lightly_predictions(filenames, predictions) 40 | 41 | 42 | main() 43 | -------------------------------------------------------------------------------- /source/train_model_2.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import torch 3 | from dataset import get_dataloader 4 | from helper import load_data, prepare_training_data 5 | from model import Model, get_optimizer, train_model 6 | 7 | SEED = 42 8 | torch.manual_seed(SEED) 9 | np.random.seed(SEED) 10 | 11 | 12 | def main(): 13 | prepare_training_data("./annotation-1.json") 14 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 15 | epochs = 25 16 | 17 | train_X, train_y, _ = load_data("./train.json") 18 | val_X, val_y, _ = load_data("./val.json") 19 | dataloaders = { 20 | "train": get_dataloader(train_X, train_y, batch_size=3), 21 | "val": get_dataloader(val_X, val_y, batch_size=50), 22 | } 23 | 24 | model = Model().to(device) 25 | loss_func = torch.nn.CrossEntropyLoss() 26 | optimizer = get_optimizer(model) 27 | model = train_model( 28 | model=model, 29 | loss_func=loss_func, 30 | optimizer=optimizer, 31 | dataloaders=dataloaders, 32 | device=device, 33 | num_epochs=epochs, 34 | ) 35 | 36 | 37 | main() 38 | -------------------------------------------------------------------------------- /tutorial/images/init-selection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/init-selection.png -------------------------------------------------------------------------------- /tutorial/images/ls-add-storage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/ls-add-storage.png -------------------------------------------------------------------------------- /tutorial/images/ls-interface.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/ls-interface.png -------------------------------------------------------------------------------- /tutorial/images/ls-sync-storage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/ls-sync-storage.png -------------------------------------------------------------------------------- /tutorial/images/second-selection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/second-selection.png --------------------------------------------------------------------------------