├── .gitignore ├── LICENSE ├── README.md ├── captcha_model.py ├── prodigy_recipes.py ├── requirements.in ├── requirements.txt ├── run_on_image.py └── train.py /.gitignore: -------------------------------------------------------------------------------- 1 | env/ 2 | input/ 3 | output/ 4 | __pycache__/ 5 | *.pyc 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Open Knowledge Foundation Germany 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Frontex Captcha Solver 2 | 3 | This repository contains the code for training a neural network for solving captchas. 4 | It was developed to solve captchas created with [Jeff Atwoods ASP.NET Captcha Generation Library](https://www.codeproject.com/Articles/8751/A-CAPTCHA-Server-Control-for-ASP-NET), which Frontex uses on their [PAD portal](https://pad.frontex.europa.eu/Token/Create). 5 | 6 | While it was developed for a specific captcha, it should work well for other captchas too. 7 | 8 | ## Installation 9 | 10 | This project uses Pillow for image processing and pytorch for it's machine learning internals. 11 | You can install all dependencies using 12 | 13 | ``` 14 | pip install -r requirements.txt 15 | ``` 16 | 17 | ## Training 18 | 19 | ### Preparation 20 | 21 | For training, you need to solve a few captchas by hand. 22 | We had acceptable results with ~100 and good with ~400 manually solved captchas. 23 | Store them all in one directory (we assume it is named `input/` in the following text) with the captcha text as their filename stem. 24 | If for example the text in the image is `ABC123` and it is a JPEG file, store it as `input/ABC123.jpg` 25 | 26 | #### Using prodigy 27 | 28 | If you have a [prodigy](https://prodi.gy/) license and a collection of downloaded captcha images, you can use the recipes in `prodigy_recipes.py`: 29 | 30 | ``` 31 | prodigy image-caption -F prodigy_recipes.py DATASET_NAME DIRECTORY_WITH_CAPTCHAS_TO_CAPTION 32 | # Then after tagging 33 | prodigy write-images -F prodigy_recipes.py DATASET_NAME input/ 34 | ``` 35 | 36 | ### Model Training 37 | 38 | Now that you have your input ready, you can start the training. 39 | First, check that the settings in the model.py are correct: 40 | `CLASSES` should be all possible characters in the captchas. 41 | `LETTER_COUNT` should be the number of characters per captcha. 42 | 43 | If those setting are correct, simply run 44 | 45 | ``` 46 | python train.py input/ output/ 47 | ``` 48 | 49 | This will train the model on the input data. 50 | After a few minutes it will stop and write the trained model to `output/model.pth`. 51 | 52 | It will also save the internal state of the optimizer to `output/optimizer.pth`. 53 | This can be used to resume training the model. 54 | To do that add the `--resume-training` flag to the command line above. 55 | 56 | ### Evaluating the model 57 | 58 | During training the script will continously output the loss and accuracy of the model. 59 | 60 | To try it on a single image, you can use the `run_on_image.py`-script: 61 | 62 | ``` 63 | python run_on_image.py output/model.pth PATH_TO_IMAGE 64 | ``` 65 | 66 | ## Usage 67 | 68 | To use the model, you can either shell-out to `run_on_image.py` or use the helper functions in captcha.py. 69 | 70 | You need to load the model using `load_net` and then run `solve_image` on it: 71 | 72 | ```python 73 | >>> import captcha_model 74 | >>> from PIL import Image 75 | >>> net = captcha_model.load_net("output/model.pth") 76 | >>> image = Image.open("PATH_TO_IMAGE") 77 | >>> result = captcha_model.solve_image(net, image) 78 | ``` 79 | -------------------------------------------------------------------------------- /captcha_model.py: -------------------------------------------------------------------------------- 1 | import string 2 | 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | import torchvision.transforms as transforms 7 | from PIL import Image 8 | 9 | CLASSES = list(string.digits + string.ascii_uppercase) 10 | 11 | LETTER_COUNT = 5 12 | 13 | 14 | class Net(nn.Module): 15 | def __init__(self): 16 | super().__init__() 17 | width = 6 18 | self.conv1 = nn.Conv2d(1, width, 3, 1) 19 | self.conv2 = nn.Conv2d(width, 64, 3, 1) 20 | self.dropout1 = nn.Dropout(0.25) 21 | self.dropout2 = nn.Dropout(0.5) 22 | self.fc1 = nn.Linear(33856, 128) 23 | self.fc2 = nn.Linear(128, len(CLASSES)) 24 | 25 | def forward(self, x): 26 | x = x 27 | x = self.conv1(x) 28 | x = F.relu(x) 29 | x = self.conv2(x) 30 | x = F.relu(x) 31 | x = F.max_pool2d(x, 2) 32 | x = self.dropout1(x) 33 | x = torch.flatten(x, 1) 34 | x = self.fc1(x) 35 | x = F.relu(x) 36 | x = self.dropout2(x) 37 | x = self.fc2(x) 38 | output = F.log_softmax(x, dim=1) 39 | return output 40 | 41 | 42 | def split_letters(image, letter_count: int = 5): 43 | w, h = image.size 44 | part_width = w / letter_count 45 | for i in range(letter_count): 46 | yield image.crop((i * part_width, 0, i * part_width + part_width, h)) 47 | 48 | 49 | def load_net(path: str) -> Net: 50 | net = Net() 51 | net.load_state_dict(torch.load(path, map_location=get_device())) 52 | return net 53 | 54 | 55 | def solve_image(net: Net, image: Image) -> str: 56 | image = image.convert("L") 57 | transform = transforms.Compose( 58 | [transforms.ToTensor(), transforms.Normalize((0.5), (0.5))] 59 | ) 60 | with torch.no_grad(): 61 | net.eval() 62 | images = [transform(x) for x in split_letters(image, letter_count=LETTER_COUNT)] 63 | 64 | outputs = net(torch.stack(images)) 65 | predictions = outputs.argmax(dim=1, keepdim=True) 66 | return "".join(CLASSES[pred] for pred in predictions) 67 | 68 | 69 | def get_device(): 70 | if torch.cuda.is_available(): 71 | dev_name = "cuda:0" 72 | elif torch.backends.mps.is_available(): 73 | dev_name = "mps" 74 | else: 75 | dev_name = "cpu" 76 | 77 | device = torch.device(dev_name) 78 | return device 79 | -------------------------------------------------------------------------------- /prodigy_recipes.py: -------------------------------------------------------------------------------- 1 | # Based on https://github.com/explosion/prodigy-recipes/blob/master/image/image_caption/image_caption.py 2 | 3 | import prodigy 4 | from prodigy.components.loaders import Images 5 | from pathlib import Path 6 | from prodigy.components.filters import filter_duplicates 7 | import base64 8 | from prodigy import set_hashes 9 | 10 | from typing import List, Dict 11 | import prodigy 12 | from prodigy.components.db import Database, Dataset, Link, Example 13 | 14 | 15 | @prodigy.recipe("image-caption") 16 | def image_caption(dataset, images_path): 17 | """ 18 | Stream in images from a directory and allow captioning them by typing 19 | a caption in a text field. The caption is stored as the key "caption". 20 | """ 21 | stream = Images(images_path) 22 | stream = [set_hashes(eg) for eg in stream] 23 | stream = filter_duplicates(stream, by_input=True, by_task=True) 24 | 25 | blocks = [ 26 | {"view_id": "image"}, 27 | {"view_id": "text_input", "field_id": "caption", "field_autofocus": True}, 28 | ] 29 | return { 30 | "dataset": dataset, 31 | "stream": stream, 32 | "view_id": "blocks", 33 | "config": {"blocks": blocks}, 34 | # "exclude_by": "input", 35 | } 36 | 37 | 38 | @prodigy.recipe("write-images") 39 | def write_images(dataset: str, output: str): 40 | output_path = Path(output) 41 | output_path.mkdir(exist_ok=True) 42 | 43 | DB: Database = prodigy.components.db.connect() 44 | if dataset not in DB: 45 | raise ValueError(f"Dataset {dataset} not found!") 46 | 47 | dataset_id = Dataset.get(Dataset.name == dataset).id 48 | links = Link.select(Link.example).where(Link.dataset == dataset_id) 49 | to_delete: List[Link] = [] 50 | for link in links: 51 | content = link.example.load() 52 | caption = content["caption"] 53 | header, image_content = content["image"].split(",") 54 | content_type = header.split(":")[1].split(";")[0] 55 | extension = content_type.split("/")[1] 56 | with open(output_path / f"{caption}.{extension}", "wb") as f: 57 | f.write(base64.b64decode(image_content)) 58 | -------------------------------------------------------------------------------- /requirements.in: -------------------------------------------------------------------------------- 1 | pip-tools 2 | torch 3 | torchvision 4 | tqdm 5 | Pillow 6 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile with python 3.9 3 | # To update, run: 4 | # 5 | # pip-compile requirements.in 6 | # 7 | build==0.8.0 8 | # via pip-tools 9 | certifi==2022.6.15 10 | # via requests 11 | charset-normalizer==2.1.0 12 | # via requests 13 | click==8.1.3 14 | # via pip-tools 15 | idna==3.3 16 | # via requests 17 | numpy==1.23.1 18 | # via torchvision 19 | packaging==21.3 20 | # via build 21 | pep517==0.13.0 22 | # via build 23 | pillow==9.2.0 24 | # via 25 | # -r requirements.in 26 | # torchvision 27 | pip-tools==6.8.0 28 | # via -r requirements.in 29 | pyparsing==3.0.9 30 | # via packaging 31 | requests==2.28.1 32 | # via torchvision 33 | tomli==2.0.1 34 | # via 35 | # build 36 | # pep517 37 | torch==1.12.0 38 | # via 39 | # -r requirements.in 40 | # torchvision 41 | torchvision==0.13.0 42 | # via -r requirements.in 43 | tqdm==4.64.0 44 | # via -r requirements.in 45 | typing-extensions==4.3.0 46 | # via 47 | # torch 48 | # torchvision 49 | urllib3==1.26.11 50 | # via requests 51 | wheel==0.37.1 52 | # via pip-tools 53 | 54 | # The following packages are considered to be unsafe in a requirements file: 55 | # pip 56 | # setuptools 57 | -------------------------------------------------------------------------------- /run_on_image.py: -------------------------------------------------------------------------------- 1 | import captcha_model 2 | import argparse 3 | from pathlib import Path 4 | from PIL import Image 5 | 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument("model_path", type=Path) 8 | parser.add_argument("image_path", type=Path) 9 | args = parser.parse_args() 10 | 11 | net = captcha_model.load_net(args.model_path) 12 | image = Image.open(args.image_path) 13 | result = captcha_model.solve_image(net, image) 14 | print(result) 15 | -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from PIL import Image 3 | from pathlib import Path 4 | import numpy as np 5 | import math 6 | import tqdm 7 | import torch 8 | import captcha_model 9 | 10 | import torch.nn as nn 11 | import torch.nn.functional as F 12 | import torch.optim as optim 13 | import torchvision.transforms as transforms 14 | 15 | 16 | def parse_cmdline(): 17 | parser = argparse.ArgumentParser() 18 | parser.add_argument("input", type=Path) 19 | parser.add_argument("output", type=Path) 20 | parser.add_argument("--resume-training", action="store_true") 21 | return parser.parse_args() 22 | 23 | 24 | def get_manually_classified(input_dir: Path): 25 | manual_classified = [] 26 | for file in tqdm.tqdm(input_dir.iterdir()): 27 | if not file.is_file(): 28 | continue 29 | label = file.stem 30 | img = Image.open(file).convert("L") 31 | yield (label, img) 32 | 33 | 34 | def split_letters(image, letter_count): 35 | w, h = image.size 36 | part_width = w / letter_count 37 | parts = [] 38 | for i in range(letter_count): 39 | yield image.crop((i * part_width, 0, i * part_width + part_width, h)) 40 | 41 | 42 | def get_class(letter): 43 | return torch.tensor(captcha_model.CLASSES.index(letter)) 44 | 45 | 46 | transform = transforms.Compose( 47 | [transforms.ToTensor(), transforms.Normalize((0.5), (0.5))] 48 | ) 49 | 50 | 51 | def get_letters_separate(device, input_dir): 52 | for label, image in get_manually_classified(input_dir): 53 | for letter, img in zip( 54 | label, split_letters(image, letter_count=captcha_model.LETTER_COUNT) 55 | ): 56 | tensor = transform(img) 57 | yield tensor.to(device), get_class(letter).to(device) 58 | 59 | 60 | class CaptchaDataset(torch.utils.data.Dataset): 61 | def __init__(self, start_perc, end_perc, device, input_dir): 62 | super().__init__() 63 | data = list(get_letters_separate(device, input_dir)) 64 | start = math.floor(len(data) * start_perc) 65 | end = math.floor(len(data) * end_perc) 66 | self.data = data[start:end] 67 | 68 | def __len__(self): 69 | return len(self.data) 70 | 71 | def __getitem__(self, idx): 72 | return self.data[idx] 73 | 74 | 75 | def train(net, train_loader, output_dir, epoch): 76 | net.train() 77 | for (data, target) in tqdm.tqdm( 78 | train_loader, position=1, leave=False, desc="Batch" 79 | ): 80 | optimizer.zero_grad() 81 | output = net(data) 82 | loss = F.nll_loss(output, target) 83 | loss.backward() 84 | optimizer.step() 85 | torch.save(net.state_dict(), output_dir / "model.pth") 86 | torch.save(optimizer.state_dict(), output_dir / "optimizer.pth") 87 | 88 | 89 | def test(net, test_loader): 90 | net.eval() 91 | test_loss = 0 92 | correct = 0 93 | with torch.no_grad(): 94 | for data, target in test_loader: 95 | output = net(data) 96 | test_loss += F.nll_loss(output, target, reduction="sum").item() 97 | pred = output.argmax(dim=1, keepdim=True) 98 | correct += pred.eq(target.view_as(pred)).sum().cpu() 99 | test_loss /= len(test_loader.dataset) 100 | test_len = len(test_loader.dataset) 101 | return "Test: loss {:.2f}, Acc: {:.1f}%".format( 102 | test_loss, 103 | 100.0 * correct / test_len, 104 | ) 105 | 106 | 107 | if __name__ == "__main__": 108 | # Training settings 109 | n_epochs = 200 110 | batch_size_train = 128 111 | batch_size_test = 1000 112 | learning_rate = 0.01 113 | momentum = 0.5 114 | 115 | random_seed = 1 116 | 117 | args = parse_cmdline() 118 | device = captcha_model.get_device() 119 | net = captcha_model.Net().to(device) 120 | args.output.mkdir(exist_ok=True) 121 | 122 | torch.manual_seed(random_seed) 123 | 124 | test_loader = torch.utils.data.DataLoader( 125 | CaptchaDataset(start_perc=0, end_perc=0.1, device=device, input_dir=args.input), 126 | batch_size=batch_size_test, 127 | shuffle=True, 128 | ) 129 | train_loader = torch.utils.data.DataLoader( 130 | CaptchaDataset(start_perc=0.1, end_perc=1, device=device, input_dir=args.input), 131 | batch_size=batch_size_train, 132 | shuffle=True, 133 | ) 134 | 135 | optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=momentum) 136 | if args.resume_training: 137 | net.load_state_dict(torch.load(args.output / "model.pth")) 138 | optimizer.load_state_dict(torch.load(args.output / "optimizer.pth")) 139 | 140 | test(net, test_loader) 141 | for epoch in (bar := tqdm.tqdm(range(1, n_epochs + 1), position=0)): 142 | train(net, train_loader, args.output, epoch) 143 | test_status = test(net, test_loader) 144 | bar.set_description(test_status) 145 | --------------------------------------------------------------------------------