├── .gitignore
├── README.md
├── requirements.txt
├── source
    ├── dataset.py
    ├── download_samples.py
    ├── helper.py
    ├── label_studio.py
    ├── model.py
    ├── run_first_selection.py
    ├── run_second_selection.py
    ├── setup_data.py
    ├── train_model_1.py
    └── train_model_2.py
└── tutorial
    └── images
        ├── init-selection.png
        ├── ls-add-storage.png
        ├── ls-interface.png
        ├── ls-sync-storage.png
        └── second-selection.png


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 
131 | 
132 | #ignore venv
133 | venv
134 | 
135 | #ignore pycharm IDE
136 | .idea
137 | 
138 | #ignore lightly outputs
139 | lightly_outputs
140 | 
141 | #ignore .DS_Store
142 | **.DS_Store
143 | 
144 | #ignore eggs
145 | .eggs
146 | 
147 | # Ignore saved model
148 | weather_classifier.pth
149 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Tutorial
  2 | This tutorial demonstrates a complete workflow of training a machine learning model with the aid of Active Learning using [Lightly](https://www.lightly.ai) and [Label Studio](https://labelstud.io).
  3 | 
  4 | Assume we have a new unlabelled dataset and want to train a new model. We do not want to label all samples because not all of them are valuable. Lightly can help select a good subset of samples to kick off labeling and model training. The loop is as follows:
  5 | 
  6 | 1. Lightly chooses a subset of the unlabelled samples.
  7 | 1. This subset is labeled using Label Studio.
  8 | 1. A machine learning model is trained on the labeled data and generates predictions for the entire dataset.
  9 | 1. Lightly consumes predictions and performs Active Learning to choose the next batch of samples to be labeled.
 10 | 1. This new batch of samples is labeled in Label Studio.
 11 | 1. The machine learning model is re-trained on the enriched labeled dataset and to achieve better performance.
 12 | 
 13 | 
 14 | Let's get started!
 15 | 
 16 | ## 0. Installation and Requirements
 17 | Make sure you have an account for the [Lightly Web App](https://app.lightly.ai). 
 18 | You also need to know your API token which is shown under your `USERNAME` -> `Preferences`.
 19 | 
 20 | Clone this repo and install all Python package requirements in the `requirements.txt` file, e.g. with pip.
 21 | ```bash
 22 | git clone https://github.com/lightly-ai/Lightly_LabelStudio_AL.git
 23 | cd Lightly_LabelStudio_AL
 24 | pip install -r requirements.txt
 25 | ```
 26 | 
 27 | 
 28 | ## 1. Prepare data
 29 | We want to train a classifier to predict the weather displayed in an image. We use this dataset: [Multi-class Weather Dataset for Image Classification](https://data.mendeley.com/datasets/4drtyfjtfy/1). Download the dataset (zip file) from the [here](https://data.mendeley.com/public-files/datasets/4drtyfjtfy/files/a03e6097-f7fb-4e1a-9c6a-8923c6a0d3e0/file_downloaded) to this directory.
 30 | 
 31 | After downloading and extracting the zip file, you will see the extracted directory as follows:
 32 | 
 33 | ```
 34 | dataset2
 35 | ├── cloudy1.jpg
 36 | ├── cloudy2.jpg
 37 | ├── cloudy3.jpg
 38 | ├── cloudy4.jpg
 39 | ...
 40 | ```
 41 | 
 42 | Here we have images in 4 weather conditions: `cloudy`, `rain`, `shine`, and `sunrise`.
 43 | 
 44 | #### 1.1 Split dataset
 45 | To compare results between iterations, we first split the entire dataset into a full training set and a validation set. The training set will be used to select samples, and the validation set will be used as "new data" to evaluate the model's performance.
 46 | 
 47 | Run the script below to split the dataset:
 48 | ```sh
 49 | python source/setup_data.py
 50 | ```
 51 | 
 52 | After this, you will find the following files and directories in the current directory:
 53 | * `train_set`: Directory that contains all samples to be used for training the model. Here we pretend that these samples are all unlabelled.
 54 | * `val_set`: Directory that contains all samples to be used for model validation. Samples are labeled.
 55 | * `full_train.json`: JSON file that records paths to all files in `train_set`.
 56 | * `val.json`: JSON file that records paths and labels of all files in `val_set`.
 57 | 
 58 | These will be used in the following steps.
 59 | 
 60 | #### 1.2 Upload training samples to cloud storage
 61 | In this tutorial, samples are stored in the cloud, and Lightly Worker will read the samples from the cloud data source. For details, please refer to [Set Up Your First Dataset](https://docs.lightly.ai/docs/set-up-your-first-dataset). Here we use Amazon S3 as an example.
 62 | 
 63 | Under your S3 bucket, create two directories: `data` and `lightly`. We will upload all training samples to `data`. For example, run the [AWS CLI tool](https://aws.amazon.com/cli/):
 64 | ```sh
 65 | aws s3 sync train_set s3://<bucket>/data
 66 | ```
 67 | 
 68 | After uploading the samples, your S3 bucket should look like
 69 | ```
 70 | s3://bucket/
 71 | ├── lightly/
 72 | └── data/
 73 |     ├── cloudy1.jpg
 74 |     ├── cloudy2.jpg
 75 |     ├── ...
 76 | ```
 77 | 
 78 | ## 2. Select the first batch of samples for labeling
 79 | 
 80 | Now, with all unlabelled data samples in your training dataset, we want to select a good subset, label them, and train our classification model with them. Lightly can do this selection for you in a simple way. The script [run_first_selection.py](./source/run_first_selection.py) does the job for you. You need to first set up Lightly Worker on your machine and put the correct configuration values in the script. Please refer to [Install Lightly](https://docs.lightly.ai/docs/install-lightly) and [Set Up Your First Dataset](https://docs.lightly.ai/docs/set-up-your-first-dataset) for more details.
 81 | 
 82 | Run the script after your worker is ready:
 83 | 
 84 | ```sh
 85 | python source/run_first_selection.py
 86 | ```
 87 | 
 88 | In this script, Lightly Worker first creates a dataset named `weather-classification`, selects 30 samples based on embeddings of the training samples, and records them in this dataset. These 30 samples are the ones that we are going to label in the first round. You can see the selected samples in the [Web App](https://app.lightly.ai/).
 89 | 
 90 | ![First selection.](tutorial/images/init-selection.png)
 91 | 
 92 | ## 3. Label the selected samples to train a classifier
 93 | 
 94 | We do this using the open source labeling tool **Label Studio**, which is a browser-based tool hosted on your machine.
 95 | You have already installed it and can run it from the command line. It will need access to your local files. We will first download the selected samples, import them in Label Studio, label them, and export the annotations.
 96 | 
 97 | _Curious to get started with Label Studio? Check out [this tutorial](https://labelstud.io/blog/zero-to-one-getting-started-with-label-studio/) for help getting started!_
 98 | 
 99 | #### 3.1 Download the selected samples
100 | 
101 | We can download the selected samples from the Lightly Platform. The [download_samples.py](./source/download_samples.py) script will do everything for you and download the samples to a local directory called `samples_for_labelling`.
102 | 
103 | ```sh
104 | python source/download_samples.py
105 | ```
106 | 
107 | Lightly Worker created a tag for the selected samples. This script pulls information about samples in this tag and downloads the samples.
108 | 
109 | #### 3.2 Run LabelStudio
110 | 
111 | Now we can launch LabelStudio.
112 | 
113 | ```sh
114 | export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true && label-studio start
115 | ```
116 | 
117 | You should see it in your browser. Create an account and log in.
118 | 
119 | #### 3.3 Configure Storage
120 | 
121 | Create a new project called "weather-classification".
122 | Then, head to `Settings` -> `Cloud Storage` -> `Add Source Storage` -> `Storage Type`: `Local files`.
123 | Set the `Absolute local path` to the absolute path of directory `samples_for_labelling`.
124 | Enable the option `Treat every bucket object as a source file`.
125 | Then click `Add Storage`. It will show you that you have added a storage.
126 | Now click on `Sync Storage` to finally load the 30 images.
127 | 
128 | ![Configuration of local file input.](tutorial/images/ls-add-storage.png)
129 | 
130 | ##### 3.4 Configure the labeling interface
131 | 
132 | In the `Settings` -> `Labeling Interface` in the `Code`insert
133 | ```xml
134 | <View>
135 |     <Image name="image" value="$image"/>
136 |         <Choices name="choice" toName="image">
137 |         <Choice value="cloudy"/>
138 |         <Choice value="rain"/>
139 |         <Choice value="shine" />
140 |         <Choice value="sunrise" />
141 |     </Choices>
142 | </View>
143 | ```
144 | ![Configuration of Labeling Interface.](tutorial/images/ls-interface.png)
145 | 
146 | It tells Label Studio that there is an image classification task with 4 distinct choices.
147 | 
148 | If you want someone else to help you label the images, navigate to `Settings`->`Instructions` and add some instructions.
149 | 
150 | #### 3.5 Labelling
151 | 
152 | Now if you click on your project again, you see 30 tasks and the corresponding images.
153 | Click on `Label All Tasks` and get those 30 images labeled.
154 | 
155 | **Pro Tip!** Use the keys `1`, `2`, `3`, `4`, on your keyboard as hotkeys to be faster!
156 | 
157 | #### 3.6 Export labels
158 | 
159 | Export the labels via `Export` in the format `JSON-MIN`.
160 | Rename the file to `annotation-0.json` and place that in the root directory of this repository.
161 | 
162 | ## 4. Train a model and prepare for active learning
163 | 
164 | We can train a classification model with the 30 labeled samples. The [train_model_1.py](./source/train_model_1.py) script loads samples from `annotation-0.json` and performs this task.
165 | 
166 | ```sh
167 | python source/train_model_1.py
168 | ```
169 | 
170 | The following steps are performed in this script:
171 | * Load the annotations and the labeled images.
172 | * Load the validation set.
173 | * Train a simple model as in [model.py](./source/model.py).
174 | * Make predictions for all samples for training, including unlabeled samples.
175 | * Dump the predictions in [Lightly Prediction format](https://docs.lightly.ai/docs/prediction-format#prediction-format) into directory `lightly_predictions`.
176 | 
177 | We can see that the model performance is not good:
178 | ```
179 | Training Acc:  60.000           Validation Acc:  19.027
180 | ```
181 | 
182 | It is okay for now. We will improve this. Predictions will be used for active learning.
183 | 
184 | #### Upload predictions to data source
185 | 
186 | Lightly Worker also does active learning for you based on predictions. It consumes predictions stored in the data source. We need to place the predictions we just acquired in the data source. For detailed information, please refer to [Predictions Folder Structure](https://docs.lightly.ai/docs/prediction-format#predictions-folder-structure). Here we still use the AWS S3 bucket as an example.
187 | 
188 | In the `lightly` directory you created earlier in your S3 bucket, you will have a subdirectory `.lightly/predictions` where predictions are kept. You need the following additional files. You can create these files directly by copying the code blocks below.
189 | 
190 | ##### tasks.json
191 | ```json
192 | ["weather-classification"]
193 | ```
194 | 
195 | We only have one task here, and let's name it as `weather-classification`.
196 | 
197 | ##### schema.json
198 | ```json
199 | {
200 |     "task_type": "classification",
201 |     "categories": [
202 |         {
203 |             "id": 0,
204 |             "name": "cloudy"
205 |         },
206 |         {
207 |             "id": 1,
208 |             "name": "rain"
209 |         },
210 |         {
211 |             "id": 2,
212 |             "name": "shine"
213 |         },
214 |         {
215 |             "id": 3,
216 |             "name": "sunrise"
217 |         }
218 |     ]
219 | }
220 | ```
221 | 
222 | Place these files in the `lightly` directory in your bucket along with predictions in the directory `lightly_prediction`.
223 | After uploading these files, your S3 bucket should look like
224 | ```
225 | s3://bucket/
226 | ├── lightly/
227 | │   └── .lightly/
228 | │       └── predictions/
229 | │           ├── tasks.json
230 | │           └── weather-classification/
231 | │               ├── schema.json
232 | │               ├── cloudy1.json
233 | │               ├── cloudy2.json
234 | │               ├── ...
235 | └── data/
236 |     ├── cloudy1.jpg
237 |     ├── cloudy2.jpg
238 |     ├── ...
239 | ```
240 | 
241 | where files like `cloudy1.json` and `cloudy2.json` are prediction files in `lightly_prediction`.
242 | 
243 | ## 5. Select and label new samples
244 | 
245 | With the predictions, Lightly Worker can perform active learning and select new samples for us. The [run_second_selection.py](./source/run_second_selection.py) script does the job.
246 | 
247 | ```sh
248 | python source/run_second_selection.py
249 | ```
250 | 
251 | This time, Lightly Worker goes through all training samples again and selects another 30 samples based on active learning scores computed from the predictions we uploaded in the previous step. For more details, please refer to [Selection Scores](https://docs.lightly.ai/docs/selection#scores) and [Active Learning Scorer](https://docs.lightly.ai/docs/active-learning-scorers).
252 | 
253 | You can see the results in the Web App.
254 | 
255 | ![Second selection.](./tutorial/images/second-selection.png)
256 | 
257 | 
258 | #### Label new samples
259 | 
260 | You can repeat step 3 to label new samples. To import new samples, go to `Settings` -> `Cloud Storage` and then click `Sync Storage` on the Source Cloud Storage you created earlier. A message `Synced 30 task(s)` should show up.
261 | 
262 | ![Sync Storage.](./tutorial/images/ls-sync-storage.png)
263 | 
264 | Then, you can go back to the project page and label the new samples. After finishing annotating the samples, export the annotations again. Rename the file to `annotation-1.json` and place that in the root directory of this repository.
265 | 
266 | ## 6. Train a new model with the new samples
267 | Very similar to the script in step 4, script [train_model_2.py](source/train_model_2.py) loads samples from `annotation-1.json` and trains the classification model again with all 60 labeled samples now.
268 | 
269 | ```sh
270 | python source/train_model_2.py
271 | ```
272 | 
273 | The model indeed does better this time on the validation set:
274 | ```
275 | Training Acc:  90.000           Validation Acc:  44.248
276 | ```
277 | 
278 | ## Celebrate! You've improved your model with the help of Label Studio and Lightly! 
279 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | torch~=2.0.0
2 | torchvision~=0.15.0
3 | Pillow~=9.0.0
4 | lightly>=1.4.6
5 | tqdm~=4.62.2
6 | label-studio~=1.7.2
7 | 


--------------------------------------------------------------------------------
/source/dataset.py:
--------------------------------------------------------------------------------
 1 | from typing import List
 2 | 
 3 | import numpy as np
 4 | import torch
 5 | from PIL import Image
 6 | from torch.utils.data import DataLoader, Dataset
 7 | from torchvision import transforms
 8 | 
 9 | 
10 | class WeatherDataset(Dataset):
11 |     def __init__(
12 |         self,
13 |         image_data: List[Image.Image],
14 |         target: List[int],
15 |         transform: bool = None,
16 |     ) -> None:
17 |         label_classes = {"cloudy": 0, "rain": 1, "shine": 2, "sunrise": 3}
18 |         self.image_data = image_data
19 |         self.target = torch.LongTensor([label_classes[t] for t in target])
20 |         self.transform = transform
21 | 
22 |     def __getitem__(self, index):
23 |         x = self.image_data[index]
24 |         y = self.target[index]
25 |         if self.transform:
26 |             x = Image.fromarray(
27 |                 np.uint8(np.array(self.image_data[index]))
28 |             )  # Memory Efficient way
29 |             x = self.transform(x)
30 |         return x, y
31 | 
32 |     def __len__(self):
33 |         return len(self.image_data)
34 | 
35 | 
36 | def get_dataloader(
37 |     X: List[Image.Image],
38 |     y: List[int],
39 |     batch_size: int,
40 |     shuffle: bool = True,
41 | ) -> DataLoader:
42 |     dataset = WeatherDataset(X, y, transform=transforms.ToTensor())
43 | 
44 |     return DataLoader(
45 |         dataset,
46 |         batch_size=batch_size,
47 |         num_workers=4,
48 |         pin_memory=True,
49 |         shuffle=shuffle,
50 |     )
51 | 


--------------------------------------------------------------------------------
/source/download_samples.py:
--------------------------------------------------------------------------------
 1 | import pathlib
 2 | 
 3 | import requests
 4 | from lightly.api import ApiWorkflowClient
 5 | 
 6 | # Create the Lightly client to connect to the Lightly Platform.
 7 | client = ApiWorkflowClient(token="YOUR_LIGHTLY_TOKEN")
 8 | 
 9 | # Set the dataset to the one we created.
10 | client.set_dataset_id_by_name(dataset_name="weather-classification")
11 | latest_tag = client.get_all_tags()[0]
12 | 
13 | # filename_url_mappings is a list of entries with their filenames and read URLs.
14 | # For example, [{"fileName": "image1.png", "readUrl": "https://..."}]
15 | filename_url_mappings = client.export_filenames_and_read_urls_by_tag_id(latest_tag.id)
16 | 
17 | output_path = pathlib.Path("samples_for_labelling")
18 | output_path.mkdir(exist_ok=True)
19 | 
20 | for entry in filename_url_mappings:
21 |     read_url = entry["readUrl"]
22 |     filename = entry["fileName"]
23 |     print(f"Downloading {filename}")
24 |     response = requests.get(read_url, stream=True)
25 |     with open(output_path / filename, "wb") as file:
26 |         for data in response.iter_content():
27 |             file.write(data)
28 | 


--------------------------------------------------------------------------------
/source/helper.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import pathlib
 4 | from typing import List, Tuple
 5 | 
 6 | import numpy as np
 7 | import numpy.typing as npt
 8 | from label_studio import read_label_studio_annotation_file
 9 | from PIL import Image
10 | 
11 | IMAGE_SIZE = 224  # Resize images
12 | 
13 | 
14 | def prepare_training_data(annotation_filepath: str) -> None:
15 |     """Collects labels and filenames from LabelStudio output files.
16 | 
17 |     Images still stays in directory `train_set`. `train.json` only contains paths to
18 |     samples to be used for training. For instance,
19 |     [{"path": "/path/image1.png", "label": "cloudy"}]
20 | 
21 |     `train.json` will be picked up by the scripts for model training to load the
22 |     actual images.
23 |     """
24 |     samples = []
25 |     root = pathlib.Path("train_set")
26 |     for filename, label in read_label_studio_annotation_file(annotation_filepath):
27 |         samples.append({"path": str(root / filename), "label": label})
28 | 
29 |     with open("train.json", "w") as f:
30 |         json.dump(samples, f)
31 | 
32 | 
33 | def load_data(sample_json_path: str) -> Tuple[List[Image.Image], List[str], List[str]]:
34 |     """Loads image data.
35 | 
36 |     Paths to samples to be used for training are loaded from the json file created in
37 |     `prepare_training_data`.
38 |     """
39 |     with open(sample_json_path) as f:
40 |         sample_list = json.load(f)
41 | 
42 |     all_images, all_labels = [], []
43 |     filenames = []
44 | 
45 |     for sample in sample_list:
46 |         all_images.append(
47 |             Image.open(sample["path"])
48 |             .convert("RGB")
49 |             .resize((IMAGE_SIZE, IMAGE_SIZE), resample=Image.LANCZOS)
50 |         )
51 |         all_labels.append(sample["label"])
52 |         filenames.append(pathlib.Path(sample["path"]).name)
53 |     return all_images, all_labels, filenames
54 | 
55 | 
56 | def dump_lightly_predictions(filenames: List[str], predictions: npt.NDArray) -> None:
57 |     """Dumps model predictions in the Lightly Prediction format.
58 | 
59 |     Each input image has its own prediction file. The filename is `<image_name>.json`.
60 |     """
61 |     root = pathlib.Path("lightly_predictions")
62 |     os.mkdir(root)
63 |     for filename, prediction in zip(filenames, predictions):
64 |         with open(str(root / pathlib.Path(filename).stem) + ".json", "w") as f:
65 |             pred_list = prediction.tolist()
66 |             # Normalise probabilities again because of precision loss in `to_list`.
67 |             pred_sum = sum(pred_list)
68 |             json.dump(
69 |                 {
70 |                     "file_name": filename,
71 |                     "predictions": [
72 |                         {
73 |                             "category_id": int(np.argmax(prediction)),
74 |                             "probabilities": [p / pred_sum for p in pred_list],
75 |                         }
76 |                     ],
77 |                 },
78 |                 f,
79 |             )
80 | 


--------------------------------------------------------------------------------
/source/label_studio.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import pathlib
 3 | from typing import Dict, List, Tuple
 4 | 
 5 | 
 6 | def read_label_element(label_element: Dict) -> Tuple[str, str]:
 7 |     """Parses labels from LabelStudio output data structure."""
 8 |     filepath = pathlib.Path("/" + label_element["image"].split("?d=")[-1])
 9 |     label = label_element["choice"]
10 |     return filepath.name, label
11 | 
12 | 
13 | def read_label_studio_annotation_file(filepath: str) -> Tuple[List[str], List[str]]:
14 |     """Reads labels from LabelStudio output files."""
15 |     # read the label file
16 |     with open(filepath, "r") as json_file:
17 |         data = json.load(json_file)
18 |     return [read_label_element(label_element) for label_element in data]
19 | 


--------------------------------------------------------------------------------
/source/model.py:
--------------------------------------------------------------------------------
  1 | import datetime
  2 | import gc
  3 | from typing import List, Mapping
  4 | 
  5 | import numpy as np
  6 | import numpy.typing as npt
  7 | import torch
  8 | import torch.nn as nn
  9 | import torch.nn.functional as F
 10 | from torch.utils.data import DataLoader
 11 | 
 12 | 
 13 | class Model(nn.Module):
 14 |     def __init__(self):
 15 |         super().__init__()
 16 |         self.conv1 = nn.Conv2d(3, 6, 5)
 17 |         self.pool = nn.MaxPool2d(4, 4)
 18 |         self.conv2 = nn.Conv2d(6, 16, 4)
 19 |         self.fc1 = nn.Linear(16 * 13 * 13, 120)
 20 |         self.fc2 = nn.Linear(120, 84)
 21 |         self.fc3 = nn.Linear(84, 4)
 22 | 
 23 |     def forward(self, x):
 24 |         x = self.pool(F.relu(self.conv1(x)))
 25 |         x = self.pool(F.relu(self.conv2(x)))
 26 |         x = torch.flatten(x, 1)  # flatten all dimensions except batch
 27 |         x = F.relu(self.fc1(x))
 28 |         x = F.relu(self.fc2(x))
 29 |         x = self.fc3(x)
 30 |         return x
 31 | 
 32 | 
 33 | def get_optimizer(model: nn.Module) -> torch.optim.Optimizer:
 34 |     lr = 0.01
 35 |     momentum = 0.5
 36 |     decay = 0.01
 37 |     optimizer = torch.optim.SGD(
 38 |         model.parameters(), lr=lr, momentum=momentum, weight_decay=decay
 39 |     )
 40 |     return optimizer
 41 | 
 42 | 
 43 | def train_model(
 44 |     model: nn.Module,
 45 |     loss_func: torch.nn.modules.loss,
 46 |     optimizer: torch.optim.Optimizer,
 47 |     device: torch.device,
 48 |     dataloaders: Mapping[str, DataLoader],
 49 |     early_stop=10,
 50 |     num_epochs=5,
 51 | ) -> nn.Module:
 52 |     start_time = datetime.datetime.now().replace(microsecond=0)
 53 |     model = model.to(device)
 54 | 
 55 |     valid_loss_min = np.Inf  # track change in validation loss
 56 |     early_stop_cnt = 0
 57 |     last_epoch_loss = np.Inf
 58 |     globaliter = 0
 59 | 
 60 |     for epoch in range(1, num_epochs + 1):
 61 |         globaliter += 1
 62 |         # keep track of training and validation loss
 63 |         train_loss = 0.0
 64 |         valid_loss = 0.0
 65 | 
 66 |         model.train()
 67 |         train_corrects = 0
 68 | 
 69 |         for data, target in dataloaders["train"]:
 70 |             data, target = data.to(device), target.to(device)
 71 |             optimizer.zero_grad()
 72 |             output = model(data)
 73 |             # calculate the batch loss
 74 |             _, preds = torch.max(output, 1)
 75 |             loss = loss_func(output, target)
 76 |             loss.backward()
 77 |             optimizer.step()
 78 |             train_loss += loss.item() * data.size(0)
 79 |             train_corrects += torch.sum(preds == target.data)
 80 | 
 81 |         train_loss = train_loss / len(dataloaders["train"].dataset)
 82 |         train_acc = (train_corrects.double() * 100) / len(dataloaders["train"].dataset)
 83 | 
 84 |         # validate the model
 85 |         model.eval()
 86 |         val_corrects = 0
 87 |         for data, target in dataloaders["val"]:
 88 |             data, target = data.to(device), target.to(device)
 89 |             output = model(data)
 90 |             _, preds = torch.max(output, 1)
 91 |             loss = loss_func(output, target)
 92 |             valid_loss += loss.item() * data.size(0)
 93 |             val_corrects += torch.sum(preds == target.data)
 94 | 
 95 |         # calculate average losses
 96 |         valid_loss = valid_loss / len(dataloaders["val"].dataset)
 97 |         valid_acc = (val_corrects.double() * 100) / len(dataloaders["val"].dataset)
 98 | 
 99 |         # print training/validation statistics
100 |         print(
101 |             "Epoch: {} \tTraining Loss:  {:.6f} \tValidation Loss:  {:.6f}".format(
102 |                 epoch, train_loss, valid_loss
103 |             )
104 |         )
105 |         print(
106 |             "\t\tTraining Acc:  {:.3f} \t\tValidation Acc:  {:.3f}".format(
107 |                 train_acc, valid_acc
108 |             )
109 |         )
110 | 
111 |         if valid_loss <= valid_loss_min:
112 |             print(
113 |                 "\t\tValidation loss decreased ({:.6f} --> {:.6f}).".format(
114 |                     valid_loss_min, valid_loss
115 |                 )
116 |             )
117 | 
118 |             valid_loss_min = valid_loss
119 |         elif valid_loss == np.nan:
120 |             print("Model Loss: NAN")
121 | 
122 |         if (last_epoch_loss < valid_loss) and last_epoch_loss != np.Inf:
123 |             early_stop_cnt += 1
124 |             if early_stop_cnt == early_stop:
125 |                 print("-" * 50 + "\nEarly Stopping Hit\n" + "-" * 50)
126 |                 break
127 |             else:
128 |                 print(
129 |                     "-" * 50
130 |                     + f"\n\t\tEarly Stopping Step: {early_stop_cnt}/{early_stop}\n"
131 |                     + "-" * 50
132 |                 )
133 |         else:
134 |             early_stop_cnt = 0
135 |             last_epoch_loss = valid_loss
136 | 
137 |     print(
138 |         f"Training Completed with best model having loss of {round(valid_loss_min, 6)}"
139 |     )
140 |     del data, target
141 |     gc.collect()
142 |     end_time = datetime.datetime.now().replace(microsecond=0)
143 |     print(f"Time Taken: {end_time-start_time}")
144 |     return model
145 | 
146 | 
147 | def predict(
148 |     model: nn.Module, device: torch.device, dataloader: DataLoader
149 | ) -> npt.NDArray:
150 |     model.eval()
151 |     predictions: List[npt.NDArray] = []
152 |     for data, target in dataloader:
153 |         data, target = data.to(device), target.to(device)
154 |         output = model(data)
155 |         predictions.append(
156 |             torch.nn.functional.softmax(output, dim=1).cpu().detach().numpy()
157 |         )
158 |     return np.concatenate(predictions)
159 | 


--------------------------------------------------------------------------------
/source/run_first_selection.py:
--------------------------------------------------------------------------------
 1 | from lightly.api import ApiWorkflowClient
 2 | from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose
 3 | 
 4 | # Create the Lightly client to connect to the Lightly Platform.
 5 | client = ApiWorkflowClient(token="YOUR_LIGHTLY_TOKEN")
 6 | 
 7 | # Create a new dataset on the Lightly Platform.
 8 | client.create_dataset(
 9 |     dataset_name="weather-classification", dataset_type=DatasetType.IMAGES
10 | )
11 | dataset_id = client.dataset_id
12 | 
13 | # Configure the Input datasource.
14 | client.set_s3_delegated_access_config(
15 |     resource_path="s3://<your_bucket>/data/",
16 |     region="your_bucket_region",
17 |     role_arn="your_role_arn",
18 |     external_id="your_external_id",
19 |     purpose=DatasourcePurpose.INPUT,
20 | )
21 | # Configure the Lightly datasource.
22 | client.set_s3_delegated_access_config(
23 |     resource_path="s3://<your_bucket>/lightly/",
24 |     region="your_bucket_region",
25 |     role_arn="your_role_arn",
26 |     external_id="your_external_id",
27 |     purpose=DatasourcePurpose.LIGHTLY,
28 | )
29 | 
30 | # Create a Lightly Worker run to select the first batch of 30 samples
31 | # based on image embeddings.
32 | scheduled_run_id = client.schedule_compute_worker_run(
33 |     selection_config={
34 |         "n_samples": 30,
35 |         "strategies": [
36 |             {
37 |                 "input": {"type": "EMBEDDINGS"},
38 |                 "strategy": {"type": "DIVERSITY"},
39 |             }
40 |         ],
41 |     },
42 | )
43 | 
44 | for run_info in client.compute_worker_run_info_generator(
45 |     scheduled_run_id=scheduled_run_id
46 | ):
47 |     print(
48 |         f"Lightly Worker run is now in state='{run_info.state}' with message='{run_info.message}'"
49 |     )
50 | 
51 | if run_info.ended_successfully():
52 |     print("SUCCESS")
53 | else:
54 |     print("FAILURE")
55 | 


--------------------------------------------------------------------------------
/source/run_second_selection.py:
--------------------------------------------------------------------------------
 1 | from lightly.api import ApiWorkflowClient
 2 | from lightly.openapi_generated.swagger_client import DatasourcePurpose
 3 | 
 4 | # Create the Lightly client to connect to the Lightly Platform.
 5 | client = ApiWorkflowClient(token="YOUR_LIGHTLY_TOKEN")
 6 | 
 7 | # Set the dataset to the one we created.
 8 | client.set_dataset_id_by_name(dataset_name="weather-classification")
 9 | 
10 | # Configure the Input datasource.
11 | client.set_s3_delegated_access_config(
12 |     resource_path="s3://<your_bucket>/data/",
13 |     region="your_bucket_region",
14 |     role_arn="your_role_arn",
15 |     external_id="your_external_id",
16 |     purpose=DatasourcePurpose.INPUT,
17 | )
18 | # Configure the Lightly datasource.
19 | client.set_s3_delegated_access_config(
20 |     resource_path="s3://<your_bucket>/lightly/",
21 |     region="your_bucket_region",
22 |     role_arn="your_role_arn",
23 |     external_id="your_external_id",
24 |     purpose=DatasourcePurpose.LIGHTLY,
25 | )
26 | 
27 | # Create a Lightly Worker run to select another 30 samples using active learning.
28 | scheduled_run_id = client.schedule_compute_worker_run(
29 |     worker_config={
30 |         "datasource": {
31 |             "process_all": True,
32 |         },
33 |         "enable_training": False,
34 |     },
35 |     selection_config={
36 |         "n_samples": 30,
37 |         "strategies": [
38 |             {
39 |                 "input": {
40 |                     "type": "SCORES",
41 |                     "task": "weather-classification",
42 |                     "score": "uncertainty_entropy",
43 |                 },
44 |                 "strategy": {"type": "WEIGHTS"},
45 |             }
46 |         ],
47 |     },
48 | )
49 | 
50 | for run_info in client.compute_worker_run_info_generator(
51 |     scheduled_run_id=scheduled_run_id
52 | ):
53 |     print(
54 |         f"Lightly Worker run is now in state='{run_info.state}' with message='{run_info.message}'"
55 |     )
56 | 
57 | if run_info.ended_successfully():
58 |     print("SUCCESS")
59 | else:
60 |     print("FAILURE")
61 | 


--------------------------------------------------------------------------------
/source/setup_data.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import os
 3 | import pathlib
 4 | import re
 5 | from typing import Dict, List
 6 | 
 7 | import numpy as np
 8 | 
 9 | SEED = 42
10 | np.random.seed(SEED)
11 | 
12 | 
13 | def setup_data(data_dir_str: str) -> None:
14 |     """Splits the full dataset into a training set and a validation set.
15 | 
16 |     The training set will have images that will be used to train the model. Even if
17 |     they are already labelled, we will use them as unlabelled data and use Lightly
18 |     to label them. The validation set will be used to evaluate the model's performance.
19 |     """
20 |     data_dir = pathlib.Path(data_dir_str)
21 |     files = os.listdir(data_dir)
22 |     train_set_path = pathlib.Path("train_set")
23 |     val_set_path = pathlib.Path("val_set")
24 |     train_set_path.mkdir()
25 |     val_set_path.mkdir()
26 | 
27 |     dataset: Dict[str, List[pathlib.PosixPath]] = {}
28 |     train_set = []
29 |     val_set = []
30 |     pattern = re.compile("([a-z]+)[0-9]+")
31 | 
32 |     for file in files:
33 |         filepath = data_dir / file
34 |         regex_match = pattern.match(filepath.stem)
35 |         category = regex_match.group(1)
36 | 
37 |         if dataset.get(category) is None:
38 |             dataset[category] = []
39 |         dataset[category].append(filepath)
40 | 
41 |     for category, samples in dataset.items():
42 |         sample_idx = list(range(len(samples)))
43 |         train_idx = np.random.choice(
44 |             sample_idx, int(len(sample_idx) * 0.8), replace=False
45 |         )
46 |         val_idx = [idx for idx in sample_idx if idx not in train_idx]
47 | 
48 |         for idx in train_idx:
49 |             filepath: pathlib.PosixPath = samples[idx]
50 |             new_path = train_set_path / filepath.name
51 |             os.rename(filepath, new_path)
52 |             train_set.append({"path": str(new_path), "label": category})
53 |         for idx in val_idx:
54 |             filepath = samples[idx]
55 |             new_path = val_set_path / filepath.name
56 |             os.rename(filepath, new_path)
57 |             val_set.append({"path": str(new_path), "label": category})
58 | 
59 |     with open("full_train.json", "w") as f:
60 |         json.dump(train_set, f)
61 |     with open("val.json", "w") as f:
62 |         json.dump(val_set, f)
63 | 
64 | 
65 | setup_data("./dataset2/")
66 | 


--------------------------------------------------------------------------------
/source/train_model_1.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | from dataset import get_dataloader
 4 | from helper import dump_lightly_predictions, load_data, prepare_training_data
 5 | from model import Model, get_optimizer, predict, train_model
 6 | 
 7 | SEED = 42
 8 | torch.manual_seed(SEED)
 9 | np.random.seed(SEED)
10 | 
11 | 
12 | def main():
13 |     prepare_training_data("./annotation-0.json")
14 |     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
15 |     epochs = 25
16 | 
17 |     train_X, train_y, _ = load_data("./train.json")
18 |     val_X, val_y, _ = load_data("./val.json")
19 |     dataloaders = {
20 |         "train": get_dataloader(train_X, train_y, batch_size=3),
21 |         "val": get_dataloader(val_X, val_y, batch_size=50),
22 |     }
23 | 
24 |     model = Model().to(device)
25 |     loss_func = torch.nn.CrossEntropyLoss()
26 |     optimizer = get_optimizer(model)
27 |     model = train_model(
28 |         model=model,
29 |         loss_func=loss_func,
30 |         optimizer=optimizer,
31 |         dataloaders=dataloaders,
32 |         device=device,
33 |         num_epochs=epochs,
34 |     )
35 | 
36 |     train_X, train_y, filenames = load_data("./full_train.json")
37 |     dataloader = get_dataloader(train_X, train_y, batch_size=50, shuffle=False)
38 |     predictions = predict(model, device, dataloader)
39 |     dump_lightly_predictions(filenames, predictions)
40 | 
41 | 
42 | main()
43 | 


--------------------------------------------------------------------------------
/source/train_model_2.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import torch
 3 | from dataset import get_dataloader
 4 | from helper import load_data, prepare_training_data
 5 | from model import Model, get_optimizer, train_model
 6 | 
 7 | SEED = 42
 8 | torch.manual_seed(SEED)
 9 | np.random.seed(SEED)
10 | 
11 | 
12 | def main():
13 |     prepare_training_data("./annotation-1.json")
14 |     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
15 |     epochs = 25
16 | 
17 |     train_X, train_y, _ = load_data("./train.json")
18 |     val_X, val_y, _ = load_data("./val.json")
19 |     dataloaders = {
20 |         "train": get_dataloader(train_X, train_y, batch_size=3),
21 |         "val": get_dataloader(val_X, val_y, batch_size=50),
22 |     }
23 | 
24 |     model = Model().to(device)
25 |     loss_func = torch.nn.CrossEntropyLoss()
26 |     optimizer = get_optimizer(model)
27 |     model = train_model(
28 |         model=model,
29 |         loss_func=loss_func,
30 |         optimizer=optimizer,
31 |         dataloaders=dataloaders,
32 |         device=device,
33 |         num_epochs=epochs,
34 |     )
35 | 
36 | 
37 | main()
38 | 


--------------------------------------------------------------------------------
/tutorial/images/init-selection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/init-selection.png


--------------------------------------------------------------------------------
/tutorial/images/ls-add-storage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/ls-add-storage.png


--------------------------------------------------------------------------------
/tutorial/images/ls-interface.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/ls-interface.png


--------------------------------------------------------------------------------
/tutorial/images/ls-sync-storage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/ls-sync-storage.png


--------------------------------------------------------------------------------
/tutorial/images/second-selection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lightly-ai/Lightly_LabelStudio_AL/2e5bc2e4600a14d34872aefef510fe19116720ad/tutorial/images/second-selection.png


--------------------------------------------------------------------------------