├── .gitignore
├── LICENSE
├── README.md
├── captcha_model.py
├── prodigy_recipes.py
├── requirements.in
├── requirements.txt
├── run_on_image.py
└── train.py


/.gitignore:
--------------------------------------------------------------------------------
1 | env/
2 | input/
3 | output/
4 | __pycache__/
5 | *.pyc
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Open Knowledge Foundation Germany
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Frontex Captcha Solver
 2 | 
 3 | This repository contains the code for training a neural network for solving captchas.
 4 | It was developed to solve captchas created with [Jeff Atwoods ASP.NET Captcha Generation Library](https://www.codeproject.com/Articles/8751/A-CAPTCHA-Server-Control-for-ASP-NET), which Frontex uses on their [PAD portal](https://pad.frontex.europa.eu/Token/Create).
 5 | 
 6 | While it was developed for a specific captcha, it should work well for other captchas too.
 7 | 
 8 | ## Installation
 9 | 
10 | This project uses Pillow for image processing and pytorch for it's machine learning internals.
11 | You can install all dependencies using
12 | 
13 | ```
14 | pip install -r requirements.txt
15 | ```
16 | 
17 | ## Training
18 | 
19 | ### Preparation
20 | 
21 | For training, you need to solve a few captchas by hand.
22 | We had acceptable results with ~100 and good with ~400 manually solved captchas.
23 | Store them all in one directory (we assume it is named `input/` in the following text) with the captcha text as their filename stem.
24 | If for example the text in the image is `ABC123` and it is a JPEG file, store it as `input/ABC123.jpg` 
25 | 
26 | #### Using prodigy
27 | 
28 | If you have a [prodigy](https://prodi.gy/) license and a collection of downloaded captcha images, you can use the recipes in `prodigy_recipes.py`:
29 | 
30 | ```
31 | prodigy image-caption -F prodigy_recipes.py DATASET_NAME DIRECTORY_WITH_CAPTCHAS_TO_CAPTION
32 | # Then after tagging
33 | prodigy write-images -F prodigy_recipes.py DATASET_NAME input/
34 | ```
35 | 
36 | ### Model Training
37 | 
38 | Now that you have your input ready, you can start the training.
39 | First, check that the settings in the model.py are correct:
40 | `CLASSES` should be all possible characters in the captchas.
41 | `LETTER_COUNT` should be the number of characters per captcha.
42 | 
43 | If those setting are correct, simply run
44 | 
45 | ```
46 | python train.py input/ output/
47 | ```
48 | 
49 | This will train the model on the input data.
50 | After a few minutes it will stop and write the trained model to `output/model.pth`.
51 | 
52 | It will also save the internal state of the optimizer to `output/optimizer.pth`.
53 | This can be used to resume training the model.
54 | To do that add the `--resume-training` flag to the command line above.
55 | 
56 | ### Evaluating the model
57 | 
58 | During training the script will continously output the loss and accuracy of the model.
59 | 
60 | To try it on a single image, you can use the `run_on_image.py`-script:
61 | 
62 | ```
63 | python run_on_image.py output/model.pth PATH_TO_IMAGE
64 | ```
65 | 
66 | ## Usage
67 | 
68 | To use the model, you can either shell-out to `run_on_image.py` or use the helper functions in captcha.py.
69 | 
70 | You need to load the model using `load_net` and then run `solve_image` on it:
71 | 
72 | ```python
73 | >>> import captcha_model
74 | >>> from PIL import Image
75 | >>> net = captcha_model.load_net("output/model.pth")
76 | >>> image = Image.open("PATH_TO_IMAGE")
77 | >>> result = captcha_model.solve_image(net, image)
78 | ```
79 | 


--------------------------------------------------------------------------------
/captcha_model.py:
--------------------------------------------------------------------------------
 1 | import string
 2 | 
 3 | import torch
 4 | import torch.nn as nn
 5 | import torch.nn.functional as F
 6 | import torchvision.transforms as transforms
 7 | from PIL import Image
 8 | 
 9 | CLASSES = list(string.digits + string.ascii_uppercase)
10 | 
11 | LETTER_COUNT = 5
12 | 
13 | 
14 | class Net(nn.Module):
15 |     def __init__(self):
16 |         super().__init__()
17 |         width = 6
18 |         self.conv1 = nn.Conv2d(1, width, 3, 1)
19 |         self.conv2 = nn.Conv2d(width, 64, 3, 1)
20 |         self.dropout1 = nn.Dropout(0.25)
21 |         self.dropout2 = nn.Dropout(0.5)
22 |         self.fc1 = nn.Linear(33856, 128)
23 |         self.fc2 = nn.Linear(128, len(CLASSES))
24 | 
25 |     def forward(self, x):
26 |         x = x
27 |         x = self.conv1(x)
28 |         x = F.relu(x)
29 |         x = self.conv2(x)
30 |         x = F.relu(x)
31 |         x = F.max_pool2d(x, 2)
32 |         x = self.dropout1(x)
33 |         x = torch.flatten(x, 1)
34 |         x = self.fc1(x)
35 |         x = F.relu(x)
36 |         x = self.dropout2(x)
37 |         x = self.fc2(x)
38 |         output = F.log_softmax(x, dim=1)
39 |         return output
40 | 
41 | 
42 | def split_letters(image, letter_count: int = 5):
43 |     w, h = image.size
44 |     part_width = w / letter_count
45 |     for i in range(letter_count):
46 |         yield image.crop((i * part_width, 0, i * part_width + part_width, h))
47 | 
48 | 
49 | def load_net(path: str) -> Net:
50 |     net = Net()
51 |     net.load_state_dict(torch.load(path, map_location=get_device()))
52 |     return net
53 | 
54 | 
55 | def solve_image(net: Net, image: Image) -> str:
56 |     image = image.convert("L")
57 |     transform = transforms.Compose(
58 |         [transforms.ToTensor(), transforms.Normalize((0.5), (0.5))]
59 |     )
60 |     with torch.no_grad():
61 |         net.eval()
62 |         images = [transform(x) for x in split_letters(image, letter_count=LETTER_COUNT)]
63 | 
64 |         outputs = net(torch.stack(images))
65 |         predictions = outputs.argmax(dim=1, keepdim=True)
66 |         return "".join(CLASSES[pred] for pred in predictions)
67 | 
68 | 
69 | def get_device():
70 |     if torch.cuda.is_available():
71 |         dev_name = "cuda:0"
72 |     elif torch.backends.mps.is_available():
73 |         dev_name = "mps"
74 |     else:
75 |         dev_name = "cpu"
76 | 
77 |     device = torch.device(dev_name)
78 |     return device
79 | 


--------------------------------------------------------------------------------
/prodigy_recipes.py:
--------------------------------------------------------------------------------
 1 | # Based on https://github.com/explosion/prodigy-recipes/blob/master/image/image_caption/image_caption.py
 2 | 
 3 | import prodigy
 4 | from prodigy.components.loaders import Images
 5 | from pathlib import Path
 6 | from prodigy.components.filters import filter_duplicates
 7 | import base64
 8 | from prodigy import set_hashes
 9 | 
10 | from typing import List, Dict
11 | import prodigy
12 | from prodigy.components.db import Database, Dataset, Link, Example
13 | 
14 | 
15 | @prodigy.recipe("image-caption")
16 | def image_caption(dataset, images_path):
17 |     """
18 |     Stream in images from a directory and allow captioning them by typing
19 |     a caption in a text field. The caption is stored as the key "caption".
20 |     """
21 |     stream = Images(images_path)
22 |     stream = [set_hashes(eg) for eg in stream]
23 |     stream = filter_duplicates(stream, by_input=True, by_task=True)
24 | 
25 |     blocks = [
26 |         {"view_id": "image"},
27 |         {"view_id": "text_input", "field_id": "caption", "field_autofocus": True},
28 |     ]
29 |     return {
30 |         "dataset": dataset,
31 |         "stream": stream,
32 |         "view_id": "blocks",
33 |         "config": {"blocks": blocks},
34 |         # "exclude_by": "input",
35 |     }
36 | 
37 | 
38 | @prodigy.recipe("write-images")
39 | def write_images(dataset: str, output: str):
40 |     output_path = Path(output)
41 |     output_path.mkdir(exist_ok=True)
42 | 
43 |     DB: Database = prodigy.components.db.connect()
44 |     if dataset not in DB:
45 |         raise ValueError(f"Dataset {dataset} not found!")
46 | 
47 |     dataset_id = Dataset.get(Dataset.name == dataset).id
48 |     links = Link.select(Link.example).where(Link.dataset == dataset_id)
49 |     to_delete: List[Link] = []
50 |     for link in links:
51 |         content = link.example.load()
52 |         caption = content["caption"]
53 |         header, image_content = content["image"].split(",")
54 |         content_type = header.split(":")[1].split(";")[0]
55 |         extension = content_type.split("/")[1]
56 |         with open(output_path / f"{caption}.{extension}", "wb") as f:
57 |             f.write(base64.b64decode(image_content))
58 | 


--------------------------------------------------------------------------------
/requirements.in:
--------------------------------------------------------------------------------
1 | pip-tools
2 | torch
3 | torchvision
4 | tqdm
5 | Pillow
6 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | #
 2 | # This file is autogenerated by pip-compile with python 3.9
 3 | # To update, run:
 4 | #
 5 | #    pip-compile requirements.in
 6 | #
 7 | build==0.8.0
 8 |     # via pip-tools
 9 | certifi==2022.6.15
10 |     # via requests
11 | charset-normalizer==2.1.0
12 |     # via requests
13 | click==8.1.3
14 |     # via pip-tools
15 | idna==3.3
16 |     # via requests
17 | numpy==1.23.1
18 |     # via torchvision
19 | packaging==21.3
20 |     # via build
21 | pep517==0.13.0
22 |     # via build
23 | pillow==9.2.0
24 |     # via
25 |     #   -r requirements.in
26 |     #   torchvision
27 | pip-tools==6.8.0
28 |     # via -r requirements.in
29 | pyparsing==3.0.9
30 |     # via packaging
31 | requests==2.28.1
32 |     # via torchvision
33 | tomli==2.0.1
34 |     # via
35 |     #   build
36 |     #   pep517
37 | torch==1.12.0
38 |     # via
39 |     #   -r requirements.in
40 |     #   torchvision
41 | torchvision==0.13.0
42 |     # via -r requirements.in
43 | tqdm==4.64.0
44 |     # via -r requirements.in
45 | typing-extensions==4.3.0
46 |     # via
47 |     #   torch
48 |     #   torchvision
49 | urllib3==1.26.11
50 |     # via requests
51 | wheel==0.37.1
52 |     # via pip-tools
53 | 
54 | # The following packages are considered to be unsafe in a requirements file:
55 | # pip
56 | # setuptools
57 | 


--------------------------------------------------------------------------------
/run_on_image.py:
--------------------------------------------------------------------------------
 1 | import captcha_model
 2 | import argparse
 3 | from pathlib import Path
 4 | from PIL import Image
 5 | 
 6 | parser = argparse.ArgumentParser()
 7 | parser.add_argument("model_path", type=Path)
 8 | parser.add_argument("image_path", type=Path)
 9 | args = parser.parse_args()
10 | 
11 | net = captcha_model.load_net(args.model_path)
12 | image = Image.open(args.image_path)
13 | result = captcha_model.solve_image(net, image)
14 | print(result)
15 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | from PIL import Image
  3 | from pathlib import Path
  4 | import numpy as np
  5 | import math
  6 | import tqdm
  7 | import torch
  8 | import captcha_model
  9 | 
 10 | import torch.nn as nn
 11 | import torch.nn.functional as F
 12 | import torch.optim as optim
 13 | import torchvision.transforms as transforms
 14 | 
 15 | 
 16 | def parse_cmdline():
 17 |     parser = argparse.ArgumentParser()
 18 |     parser.add_argument("input", type=Path)
 19 |     parser.add_argument("output", type=Path)
 20 |     parser.add_argument("--resume-training", action="store_true")
 21 |     return parser.parse_args()
 22 | 
 23 | 
 24 | def get_manually_classified(input_dir: Path):
 25 |     manual_classified = []
 26 |     for file in tqdm.tqdm(input_dir.iterdir()):
 27 |         if not file.is_file():
 28 |             continue
 29 |         label = file.stem
 30 |         img = Image.open(file).convert("L")
 31 |         yield (label, img)
 32 | 
 33 | 
 34 | def split_letters(image, letter_count):
 35 |     w, h = image.size
 36 |     part_width = w / letter_count
 37 |     parts = []
 38 |     for i in range(letter_count):
 39 |         yield image.crop((i * part_width, 0, i * part_width + part_width, h))
 40 | 
 41 | 
 42 | def get_class(letter):
 43 |     return torch.tensor(captcha_model.CLASSES.index(letter))
 44 | 
 45 | 
 46 | transform = transforms.Compose(
 47 |     [transforms.ToTensor(), transforms.Normalize((0.5), (0.5))]
 48 | )
 49 | 
 50 | 
 51 | def get_letters_separate(device, input_dir):
 52 |     for label, image in get_manually_classified(input_dir):
 53 |         for letter, img in zip(
 54 |             label, split_letters(image, letter_count=captcha_model.LETTER_COUNT)
 55 |         ):
 56 |             tensor = transform(img)
 57 |             yield tensor.to(device), get_class(letter).to(device)
 58 | 
 59 | 
 60 | class CaptchaDataset(torch.utils.data.Dataset):
 61 |     def __init__(self, start_perc, end_perc, device, input_dir):
 62 |         super().__init__()
 63 |         data = list(get_letters_separate(device, input_dir))
 64 |         start = math.floor(len(data) * start_perc)
 65 |         end = math.floor(len(data) * end_perc)
 66 |         self.data = data[start:end]
 67 | 
 68 |     def __len__(self):
 69 |         return len(self.data)
 70 | 
 71 |     def __getitem__(self, idx):
 72 |         return self.data[idx]
 73 | 
 74 | 
 75 | def train(net, train_loader, output_dir, epoch):
 76 |     net.train()
 77 |     for (data, target) in tqdm.tqdm(
 78 |         train_loader, position=1, leave=False, desc="Batch"
 79 |     ):
 80 |         optimizer.zero_grad()
 81 |         output = net(data)
 82 |         loss = F.nll_loss(output, target)
 83 |         loss.backward()
 84 |         optimizer.step()
 85 |     torch.save(net.state_dict(), output_dir / "model.pth")
 86 |     torch.save(optimizer.state_dict(), output_dir / "optimizer.pth")
 87 | 
 88 | 
 89 | def test(net, test_loader):
 90 |     net.eval()
 91 |     test_loss = 0
 92 |     correct = 0
 93 |     with torch.no_grad():
 94 |         for data, target in test_loader:
 95 |             output = net(data)
 96 |             test_loss += F.nll_loss(output, target, reduction="sum").item()
 97 |             pred = output.argmax(dim=1, keepdim=True)
 98 |             correct += pred.eq(target.view_as(pred)).sum().cpu()
 99 |     test_loss /= len(test_loader.dataset)
100 |     test_len = len(test_loader.dataset)
101 |     return "Test: loss {:.2f}, Acc: {:.1f}%".format(
102 |         test_loss,
103 |         100.0 * correct / test_len,
104 |     )
105 | 
106 | 
107 | if __name__ == "__main__":
108 |     # Training settings
109 |     n_epochs = 200
110 |     batch_size_train = 128
111 |     batch_size_test = 1000
112 |     learning_rate = 0.01
113 |     momentum = 0.5
114 | 
115 |     random_seed = 1
116 | 
117 |     args = parse_cmdline()
118 |     device = captcha_model.get_device()
119 |     net = captcha_model.Net().to(device)
120 |     args.output.mkdir(exist_ok=True)
121 | 
122 |     torch.manual_seed(random_seed)
123 | 
124 |     test_loader = torch.utils.data.DataLoader(
125 |         CaptchaDataset(start_perc=0, end_perc=0.1, device=device, input_dir=args.input),
126 |         batch_size=batch_size_test,
127 |         shuffle=True,
128 |     )
129 |     train_loader = torch.utils.data.DataLoader(
130 |         CaptchaDataset(start_perc=0.1, end_perc=1, device=device, input_dir=args.input),
131 |         batch_size=batch_size_train,
132 |         shuffle=True,
133 |     )
134 | 
135 |     optimizer = optim.SGD(net.parameters(), lr=learning_rate, momentum=momentum)
136 |     if args.resume_training:
137 |         net.load_state_dict(torch.load(args.output / "model.pth"))
138 |         optimizer.load_state_dict(torch.load(args.output / "optimizer.pth"))
139 | 
140 |     test(net, test_loader)
141 |     for epoch in (bar := tqdm.tqdm(range(1, n_epochs + 1), position=0)):
142 |         train(net, train_loader, args.output, epoch)
143 |         test_status = test(net, test_loader)
144 |         bar.set_description(test_status)
145 | 


--------------------------------------------------------------------------------