├── .gitignore ├── LICENSE.txt ├── README.rst ├── Vagrantfile ├── ocr_with_django ├── assets │ ├── css │ │ └── ocr_form.css │ └── js │ │ └── ocr_form.js ├── documents │ ├── __init__.py │ ├── admin.py │ ├── apps.py │ ├── migrations │ │ └── __init__.py │ ├── models.py │ ├── templates │ │ └── documents │ │ │ └── ocr_form.html │ ├── tests.py │ └── views.py ├── manage.py └── ocr_with_django │ ├── __init__.py │ ├── settings.py │ ├── urls.py │ └── wsgi.py ├── requirements.txt └── screenshot.png /.gitignore: -------------------------------------------------------------------------------- 1 | .vagrant 2 | *.log 3 | .idea 4 | __pycache__ 5 | *.pyc 6 | *.egg-info 7 | *.sqlite3 8 | static/ 9 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016 Agustin Barto 2 | 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 5 | 6 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 7 | 8 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 9 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | =============== 2 | OCR with Django 3 | =============== 4 | 5 | The purpose of this project is to illustrate how to implement a simple OCR web form using `Django `_ and `tesserocr `_. Tesserocr is a Python wrapper for the Tesseract C++ API. 6 | 7 | Installation 8 | ------------ 9 | 10 | Tesserocr requires a fairly recent versions of `tesseract-ocr `_ and `leptonica `_. On Ubuntu these can be installed with: :: 11 | 12 | $ apt install tesseract-ocr libtesseract-dev libleptonica-dev 13 | 14 | Depending on your environment, you might have to install these packages from the source code. Follow their respective documentations on instructions on how to do it. Next, you have to install the project's requirements: :: 15 | 16 | (venv) $ pip3 install Cython==0.24.1 17 | (venv) $ pip3 install -r ocr_with_django/requirements.txt 18 | 19 | and run the necessary steps to set-up the Django site: :: 20 | 21 | (venv) $ cd ocr_with_django/ 22 | (venv) $ python manage.py migrate 23 | (venv) $ python manage.py collectstatic --noinput 24 | 25 | We've included a ``Vagrantfile`` script for you to see the site in action by yourself. 26 | 27 | OCRView 28 | ------- 29 | 30 | The OCR process is done in the ``OcrView`` :: 31 | 32 | # documents/views.py 33 | 34 | class OcrView(View): 35 | def post(self, request, *args, **kwargs): 36 | with PyTessBaseAPI() as api: 37 | with Image.open(request.FILES['image']) as image: 38 | sharpened_image = image.filter(ImageFilter.SHARPEN) 39 | api.SetImage(sharpened_image) 40 | utf8_text = api.GetUTF8Text() 41 | 42 | return JsonResponse({'utf8_text': utf8_text}) 43 | 44 | We take the uploaded image, process it using a `Pillow `_ filter, and pass along the result to the Tesseract OCR API through `tesserocr `_. 45 | 46 | We tried to keep the view as simple as possible (no Form, no validation) to focus only on the OCR processes. If you read `PyTessBaseAPI `_ docstrings you'll see that there are tons of things you can do with the image and recognition result. 47 | -------------------------------------------------------------------------------- /Vagrantfile: -------------------------------------------------------------------------------- 1 | # -*- mode: ruby -*- 2 | # vi: set ft=ruby : 3 | 4 | Vagrant.configure(2) do |config| 5 | config.vm.box = "ubuntu/xenial64" 6 | 7 | config.vm.hostname = "ocr-with-django.local" 8 | 9 | config.vm.network "forwarded_port", guest: 80, host: 8000 10 | config.vm.network "forwarded_port", guest: 8000, host: 8001 11 | 12 | config.vm.synced_folder ".", "/home/ubuntu/ocr_with_django/" 13 | 14 | config.vm.provider "virtualbox" do |vb| 15 | vb.memory = "1024" 16 | vb.cpus = 1 17 | vb.name = "ocr-with-django" 18 | end 19 | 20 | config.vm.provision "shell", inline: <<-SHELL 21 | apt-get update 22 | apt-get install -y git build-essential python3 python3.5-venv python3-dev nginx supervisor 23 | apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev 24 | SHELL 25 | 26 | config.vm.provision "shell", privileged: false, inline: <<-SHELL 27 | pyvenv-3.5 --without-pip ocr_with_django_venv 28 | source ocr_with_django_venv/bin/activate 29 | curl --silent --show-error --retry 5 https://bootstrap.pypa.io/get-pip.py | python 30 | 31 | pip install Cython==0.24.1 32 | pip install -r ocr_with_django/requirements.txt 33 | 34 | cd ocr_with_django/ocr_with_django/ 35 | 36 | python manage.py migrate 37 | python manage.py collectstatic --noinput 38 | SHELL 39 | 40 | config.vm.provision "shell", inline: <<-SHELL 41 | echo ' 42 | upstream ocr_with_django_upstream { 43 | server 127.0.0.1:8000 fail_timeout=0; 44 | } 45 | 46 | server { 47 | listen 80; 48 | server_name localhost; 49 | 50 | client_max_body_size 4G; 51 | 52 | access_log /home/ubuntu/ocr_with_django/nginx_access.log; 53 | error_log /home/ubuntu/ocr_with_django/nginx_error.log; 54 | 55 | location /static/ { 56 | alias /home/ubuntu/ocr_with_django/ocr_with_django/static/; 57 | } 58 | 59 | location / { 60 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 61 | proxy_set_header Host $http_host; 62 | proxy_redirect off; 63 | if (!-f $request_filename) { 64 | proxy_pass http://ocr_with_django_upstream; 65 | break; 66 | } 67 | } 68 | } 69 | ' > /etc/nginx/conf.d/ocr_with_django.conf 70 | /bin/systemctl restart nginx.service 71 | SHELL 72 | 73 | config.vm.provision "shell", inline: <<-SHELL 74 | echo ' 75 | [program:ocr_with_django] 76 | user = ubuntu 77 | directory = /home/ubuntu/ocr_with_django/ocr_with_django/ 78 | command = /home/ubuntu/ocr_with_django_venv/bin/gunicorn ocr_with_django.wsgi 79 | autostart = true 80 | autorestart = true 81 | stderr_logfile = /home/ubuntu/ocr_with_django/gunicorn_stderr.log 82 | stdout_logfile = /home/ubuntu/ocr_with_django/gunicorn_stdout.log 83 | stopsignal = INT 84 | ' > /etc/supervisor/conf.d/ocr_with_django.conf 85 | /bin/systemctl restart supervisor.service 86 | SHELL 87 | end 88 | -------------------------------------------------------------------------------- /ocr_with_django/assets/css/ocr_form.css: -------------------------------------------------------------------------------- 1 | .vertical-center { 2 | min-height: 100%; 3 | min-height: 100vh; 4 | 5 | display: flex; 6 | align-items: center; 7 | } 8 | 9 | .form-container { 10 | display: flex; 11 | flex-direction: column; 12 | align-items: stretch; 13 | flex: 1; 14 | margin-left: 2vw; 15 | margin-right: 2vw; 16 | height: 80vh; 17 | } 18 | 19 | .input-container { 20 | min-height: 5vh; 21 | display: flex; 22 | align-items: center; 23 | } 24 | 25 | .image-result-container { 26 | display: flex; 27 | flex: 1; 28 | } 29 | 30 | .image-container { 31 | border-color: black; 32 | border-width: medium; 33 | border-style: dashed; 34 | margin-right: 1vw; 35 | flex: 1; 36 | } 37 | 38 | .image { 39 | width: 100%; 40 | height: 100%; 41 | object-position: 0% 0%; 42 | object-fit: scale-down; 43 | } 44 | 45 | .result-container { 46 | border-style: dashed; 47 | border-width: medium; 48 | margin-left: 1vw; 49 | flex: 1; 50 | font-family: monospace; 51 | } 52 | 53 | .result-default { 54 | border-color: black; 55 | } 56 | 57 | .result-error { 58 | border-color: red; 59 | } 60 | 61 | .result-success { 62 | border-color: green; 63 | } 64 | 65 | .button-container { 66 | min-height: 5vh; 67 | display: flex; 68 | align-items: center; 69 | justify-content: flex-end; 70 | } 71 | -------------------------------------------------------------------------------- /ocr_with_django/assets/js/ocr_form.js: -------------------------------------------------------------------------------- 1 | $(document).ready(function() { 2 | var $imageInput = $("[data-js-image-input]"); 3 | var $imageContainer = $("[data-js-image-container]"); 4 | var $resultContainer = $("[data-js-result-container]"); 5 | $imageInput.change(function(event) { 6 | event.stopPropagation(); 7 | event.preventDefault(); 8 | var file = event.target.files[0]; 9 | 10 | var fileReader = new FileReader(); 11 | fileReader.onload = (function(theFile) { 12 | return function(event) { 13 | $imageContainer.html(''); 14 | }; 15 | })(file); 16 | fileReader.readAsDataURL(file); 17 | }); 18 | $("[data-js-go-button]").click(function(event) { 19 | event.stopPropagation(); 20 | event.preventDefault(); 21 | data = new FormData(); 22 | data.append('image', $imageInput[0].files[0]); 23 | $.post({ 24 | url: "/ocr/", 25 | data: data, 26 | cache: false, 27 | contentType: false, 28 | processData: false 29 | }).done(function(data) { 30 | console.log(data); 31 | $resultContainer.removeClass("result-default result-error"); 32 | $resultContainer.addClass("result-success"); 33 | $resultContainer.html(data.utf8_text); 34 | }) 35 | .fail(function(jqXHR) { 36 | $resultContainer.removeClass("result-default result-success"); 37 | $resultContainer.addClass("result-error"); 38 | $resultContainer.html('I AM ERROR'); 39 | }); 40 | }); 41 | }); 42 | -------------------------------------------------------------------------------- /ocr_with_django/documents/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/abarto/ocr-with-django/70fa71d38a5dad47c9a6b381195c79979df554c7/ocr_with_django/documents/__init__.py -------------------------------------------------------------------------------- /ocr_with_django/documents/admin.py: -------------------------------------------------------------------------------- 1 | from django.contrib import admin 2 | 3 | # Register your models here. 4 | -------------------------------------------------------------------------------- /ocr_with_django/documents/apps.py: -------------------------------------------------------------------------------- 1 | from django.apps import AppConfig 2 | 3 | 4 | class DocumentsConfig(AppConfig): 5 | name = 'documents' 6 | -------------------------------------------------------------------------------- /ocr_with_django/documents/migrations/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/abarto/ocr-with-django/70fa71d38a5dad47c9a6b381195c79979df554c7/ocr_with_django/documents/migrations/__init__.py -------------------------------------------------------------------------------- /ocr_with_django/documents/models.py: -------------------------------------------------------------------------------- 1 | from django.db import models 2 | 3 | # Create your models here. 4 | -------------------------------------------------------------------------------- /ocr_with_django/documents/templates/documents/ocr_form.html: -------------------------------------------------------------------------------- 1 | 2 | {% load static %} 3 | 4 | 5 | 6 | 7 | 8 | 9 | OCR With Django 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 21 | 22 | 23 | 24 |
25 |
26 |
27 |
28 | 29 | 30 |
31 |
32 |
33 |
34 |
 
35 |
36 |
37 |
38 |
39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /ocr_with_django/documents/tests.py: -------------------------------------------------------------------------------- 1 | from django.test import TestCase 2 | 3 | # Create your tests here. 4 | -------------------------------------------------------------------------------- /ocr_with_django/documents/views.py: -------------------------------------------------------------------------------- 1 | from django.http.response import JsonResponse 2 | from django.views.generic.base import View, TemplateView 3 | from django.views.decorators.csrf import csrf_exempt 4 | 5 | from PIL import Image, ImageFilter 6 | from tesserocr import PyTessBaseAPI 7 | 8 | 9 | class OcrFormView(TemplateView): 10 | template_name = 'documents/ocr_form.html' 11 | ocr_form_view = OcrFormView.as_view() 12 | 13 | 14 | class OcrView(View): 15 | def post(self, request, *args, **kwargs): 16 | with PyTessBaseAPI() as api: 17 | with Image.open(request.FILES['image']) as image: 18 | sharpened_image = image.filter(ImageFilter.SHARPEN) 19 | api.SetImage(sharpened_image) 20 | utf8_text = api.GetUTF8Text() 21 | 22 | return JsonResponse({'utf8_text': utf8_text}) 23 | ocr_view = csrf_exempt(OcrView.as_view()) 24 | -------------------------------------------------------------------------------- /ocr_with_django/manage.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import os 3 | import sys 4 | 5 | if __name__ == "__main__": 6 | os.environ.setdefault("DJANGO_SETTINGS_MODULE", "ocr_with_django.settings") 7 | try: 8 | from django.core.management import execute_from_command_line 9 | except ImportError: 10 | # The above import may fail for some other reason. Ensure that the 11 | # issue is really that Django is missing to avoid masking other 12 | # exceptions on Python 2. 13 | try: 14 | import django 15 | except ImportError: 16 | raise ImportError( 17 | "Couldn't import Django. Are you sure it's installed and " 18 | "available on your PYTHONPATH environment variable? Did you " 19 | "forget to activate a virtual environment?" 20 | ) 21 | raise 22 | execute_from_command_line(sys.argv) 23 | -------------------------------------------------------------------------------- /ocr_with_django/ocr_with_django/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/abarto/ocr-with-django/70fa71d38a5dad47c9a6b381195c79979df554c7/ocr_with_django/ocr_with_django/__init__.py -------------------------------------------------------------------------------- /ocr_with_django/ocr_with_django/settings.py: -------------------------------------------------------------------------------- 1 | """ 2 | Django settings for ocr_with_django project. 3 | 4 | Generated by 'django-admin startproject' using Django 1.10. 5 | 6 | For more information on this file, see 7 | https://docs.djangoproject.com/en/1.10/topics/settings/ 8 | 9 | For the full list of settings and their values, see 10 | https://docs.djangoproject.com/en/1.10/ref/settings/ 11 | """ 12 | 13 | import os 14 | 15 | # Build paths inside the project like this: os.path.join(BASE_DIR, ...) 16 | BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) 17 | 18 | 19 | # Quick-start development settings - unsuitable for production 20 | # See https://docs.djangoproject.com/en/1.10/howto/deployment/checklist/ 21 | 22 | # SECURITY WARNING: keep the secret key used in production secret! 23 | SECRET_KEY = '_7^3839h0ca0gexvz*1la^8^d)#+*d!=v6(8@gpbhytt^1uev5' 24 | 25 | # SECURITY WARNING: don't run with debug turned on in production! 26 | DEBUG = True 27 | 28 | ALLOWED_HOSTS = [] 29 | 30 | 31 | # Application definition 32 | 33 | INSTALLED_APPS = [ 34 | 'django.contrib.admin', 35 | 'django.contrib.auth', 36 | 'django.contrib.contenttypes', 37 | 'django.contrib.sessions', 38 | 'django.contrib.messages', 39 | 'django.contrib.staticfiles', 40 | 'documents' 41 | ] 42 | 43 | MIDDLEWARE = [ 44 | 'django.middleware.security.SecurityMiddleware', 45 | 'django.contrib.sessions.middleware.SessionMiddleware', 46 | 'django.middleware.common.CommonMiddleware', 47 | 'django.middleware.csrf.CsrfViewMiddleware', 48 | 'django.contrib.auth.middleware.AuthenticationMiddleware', 49 | 'django.contrib.messages.middleware.MessageMiddleware', 50 | 'django.middleware.clickjacking.XFrameOptionsMiddleware', 51 | ] 52 | 53 | ROOT_URLCONF = 'ocr_with_django.urls' 54 | 55 | TEMPLATES = [ 56 | { 57 | 'BACKEND': 'django.template.backends.django.DjangoTemplates', 58 | 'DIRS': [], 59 | 'APP_DIRS': True, 60 | 'OPTIONS': { 61 | 'context_processors': [ 62 | 'django.template.context_processors.debug', 63 | 'django.template.context_processors.request', 64 | 'django.contrib.auth.context_processors.auth', 65 | 'django.contrib.messages.context_processors.messages', 66 | ], 67 | }, 68 | }, 69 | ] 70 | 71 | WSGI_APPLICATION = 'ocr_with_django.wsgi.application' 72 | 73 | 74 | # Database 75 | # https://docs.djangoproject.com/en/1.10/ref/settings/#databases 76 | 77 | DATABASES = { 78 | 'default': { 79 | 'ENGINE': 'django.db.backends.sqlite3', 80 | 'NAME': os.path.join(BASE_DIR, 'db.sqlite3'), 81 | } 82 | } 83 | 84 | 85 | # Password validation 86 | # https://docs.djangoproject.com/en/1.10/ref/settings/#auth-password-validators 87 | 88 | AUTH_PASSWORD_VALIDATORS = [ 89 | { 90 | 'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator', 91 | }, 92 | { 93 | 'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator', 94 | }, 95 | { 96 | 'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator', 97 | }, 98 | { 99 | 'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator', 100 | }, 101 | ] 102 | 103 | 104 | # Internationalization 105 | # https://docs.djangoproject.com/en/1.10/topics/i18n/ 106 | 107 | LANGUAGE_CODE = 'en-us' 108 | 109 | TIME_ZONE = 'UTC' 110 | 111 | USE_I18N = True 112 | 113 | USE_L10N = True 114 | 115 | USE_TZ = True 116 | 117 | 118 | # Static files (CSS, JavaScript, Images) 119 | # https://docs.djangoproject.com/en/1.10/howto/static-files/ 120 | 121 | STATIC_URL = '/static/' 122 | 123 | STATICFILES_DIRS = [ 124 | os.path.join(BASE_DIR, 'assets') 125 | ] 126 | 127 | STATIC_ROOT = os.path.join(BASE_DIR, 'static') 128 | -------------------------------------------------------------------------------- /ocr_with_django/ocr_with_django/urls.py: -------------------------------------------------------------------------------- 1 | """ocr_with_django URL Configuration 2 | 3 | The `urlpatterns` list routes URLs to views. For more information please see: 4 | https://docs.djangoproject.com/en/1.10/topics/http/urls/ 5 | Examples: 6 | Function views 7 | 1. Add an import: from my_app import views 8 | 2. Add a URL to urlpatterns: url(r'^$', views.home, name='home') 9 | Class-based views 10 | 1. Add an import: from other_app.views import Home 11 | 2. Add a URL to urlpatterns: url(r'^$', Home.as_view(), name='home') 12 | Including another URLconf 13 | 1. Import the include() function: from django.conf.urls import url, include 14 | 2. Add a URL to urlpatterns: url(r'^blog/', include('blog.urls')) 15 | """ 16 | from django.conf.urls import url 17 | from django.contrib import admin 18 | 19 | from documents.views import ocr_view, ocr_form_view 20 | 21 | urlpatterns = [ 22 | url(r'^admin/', admin.site.urls), 23 | url(r'^ocr/', ocr_view, name='ocr_view'), 24 | url(r'^$', ocr_form_view, name='ocr_form_view'), 25 | ] 26 | -------------------------------------------------------------------------------- /ocr_with_django/ocr_with_django/wsgi.py: -------------------------------------------------------------------------------- 1 | """ 2 | WSGI config for ocr_with_django project. 3 | 4 | It exposes the WSGI callable as a module-level variable named ``application``. 5 | 6 | For more information on this file, see 7 | https://docs.djangoproject.com/en/1.10/howto/deployment/wsgi/ 8 | """ 9 | 10 | import os 11 | 12 | from django.core.wsgi import get_wsgi_application 13 | 14 | os.environ.setdefault("DJANGO_SETTINGS_MODULE", "ocr_with_django.settings") 15 | 16 | application = get_wsgi_application() 17 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | Django==1.10 2 | Cython==0.24.1 3 | Pillow==3.3.0 4 | tesserocr==2.1.2 5 | gunicorn==19.6.0 6 | -------------------------------------------------------------------------------- /screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/abarto/ocr-with-django/70fa71d38a5dad47c9a6b381195c79979df554c7/screenshot.png --------------------------------------------------------------------------------