├── .gitignore ├── LICENSE ├── README.md ├── _config.yml ├── code ├── classification │ ├── ModelSelection_v2.html │ ├── ModelSelection_v2.ipynb │ ├── Model_Aug2020.html │ ├── Model_Aug2020.ipynb │ ├── training_codebook.txt │ ├── training_set_suppl_v2.csv │ └── training_set_v2.csv ├── data_acquisition │ ├── jp2_download.py │ ├── search.csv │ └── xml_parser.py ├── marginalia │ ├── cropfunctions.py │ ├── example_utilities.py │ ├── marginalia_determination.py │ └── marginalia_removal.py ├── ocr │ ├── adjRec.py │ ├── geonames.py │ ├── geonames.txt │ ├── ocr_func.py │ └── ocr_use.py └── split_cleanup │ ├── 00_initial_ch_sec_split.py │ ├── 01_auto_chap_clean1.py │ ├── 02_auto_chap_clean2.py │ ├── 03_gen_manual_chapfix_files.py │ ├── 04_integrate_manual_chapfixes.py │ ├── 05_auto_section_clean.py │ ├── 06_gen_final_agg.py │ └── 07_final_sec_appraisal.py ├── environment.yml ├── examples ├── adjustment_recommendation │ ├── adjRec.ipynb │ ├── adjRec.py │ ├── adjusted.png │ ├── example_image.py │ ├── geonames.py │ ├── geonames.txt │ ├── images │ │ └── lawsresolutionso1891nort_jpg │ │ │ ├── lawsresolutionso1891nort_0272.jpg │ │ │ ├── lawsresolutionso1891nort_0374.jpg │ │ │ ├── lawsresolutionso1891nort_0542.jpg │ │ │ ├── lawsresolutionso1891nort_0606.jpg │ │ │ ├── lawsresolutionso1891nort_0771.jpg │ │ │ ├── lawsresolutionso1891nort_0944.jpg │ │ │ ├── lawsresolutionso1891nort_1114.jpg │ │ │ ├── lawsresolutionso1891nort_1210.jpg │ │ │ ├── lawsresolutionso1891nort_1373.jpg │ │ │ └── lawsresolutionso1891nort_1494.jpg │ ├── marginalia_metadata_demo.csv │ ├── ocr_func.py │ ├── sample_metadata.csv │ └── unadjusted.png ├── marginalia_determination │ ├── cropfunctions.py │ ├── example_image.py │ ├── lawsresolutionso1891nort_jp2 │ │ ├── lawsresolutionso1891nort_0697.jpg │ │ ├── lawsresolutionso1891nort_0715.jpg │ │ └── lawsresolutionso1891nort_0716.jpg │ ├── marginalia_determination.html │ ├── marginalia_determination.ipynb │ ├── output │ │ └── marginalia_metadata_demo.csv │ └── sample_metadata.csv ├── ocr │ ├── adjustments_demo.csv │ ├── images │ │ └── lawsresolutionso1891nort_jpg │ │ │ ├── lawsresolutionso1891nort_0272.jpg │ │ │ ├── lawsresolutionso1891nort_0374.jpg │ │ │ ├── lawsresolutionso1891nort_0542.jpg │ │ │ ├── lawsresolutionso1891nort_0606.jpg │ │ │ ├── lawsresolutionso1891nort_0771.jpg │ │ │ ├── lawsresolutionso1891nort_0944.jpg │ │ │ ├── lawsresolutionso1891nort_1114.jpg │ │ │ ├── lawsresolutionso1891nort_1210.jpg │ │ │ ├── lawsresolutionso1891nort_1373.jpg │ │ │ └── lawsresolutionso1891nort_1494.jpg │ ├── marginalia_metadata_demo.csv │ ├── ocr_func.py │ ├── ocr_use.ipynb │ ├── output │ │ └── lawsresolutionso1891nort │ │ │ ├── lawsresolutionso1891nort_adjustments.txt │ │ │ ├── lawsresolutionso1891nort_private laws.txt │ │ │ ├── lawsresolutionso1891nort_private laws_data.tsv │ │ │ ├── lawsresolutionso1891nort_public laws.txt │ │ │ └── lawsresolutionso1891nort_public laws_data.tsv │ └── xmljpegmerge_demo.csv └── split_cleanup │ ├── 1899_public_chapnumflags_step4.csv │ ├── 1899_public_chapnumflags_step5.csv │ ├── 1899_public_final_agg.csv │ ├── 1899_public_initial_agg.csv │ ├── 1899_public_sample_flag_rows.csv │ ├── 1899_public_sample_raw.csv │ ├── 1899_public_weird_chaps_example.csv │ ├── chap_num_manual.png │ ├── split_cleanup.ipynb │ └── step4_fixlog.csv ├── images ├── Pauli_Murray.jpg ├── UniversityLibraries_logo_black_h75.png └── mellon-foundation-logo.jpg ├── index.html ├── index.md ├── installation.md ├── oer ├── .ipynb_checkpoints │ ├── 04-HowToOCR-checkpoint.ipynb │ ├── 05-StructuringOCRData-checkpoint.ipynb │ └── environment_backup-checkpoint.yml ├── 00-Introduction-AlgorithmsOfResistance.ipynb ├── 01-AlgorithmsOfResistance-WhatIsAnAlgorithm.ipynb ├── 02-GatheringACorpus.ipynb ├── 03-WhatIsOCR.ipynb ├── 04-HowToOCR.ipynb ├── 05-StructuringOCRData.ipynb ├── 06-ExploratoryAnalysis.ipynb ├── NC_counties.txt ├── README.md ├── environment_backup.yml ├── geonames.txt ├── images │ ├── 00-intro-01.jpeg │ ├── 00-intro-02.jpg │ ├── 00-intro-03.jpeg │ ├── 00-intro-04.jpeg │ ├── 00-intro-05.jpeg │ ├── 00-intro-06.jpeg │ ├── 00-intro-07.jpeg │ ├── 00-intro-08.jpeg │ ├── 00-intro-09.jpeg │ ├── 00-intro-10.jpeg │ ├── 00-intro-11.jpg │ ├── 00-intro-12.jpg │ ├── 00-intro-25.jpg │ ├── 01-algorithms-01.jpg │ ├── 06-corpus-01.jpeg │ ├── 06-corpus-02.jpeg │ ├── 06-corpus-03.jpeg │ ├── 06-corpus-04.jpg │ ├── 06-corpus-05.jpeg │ ├── 06-corpus-06.jpeg │ ├── 06-corpus-07.jpeg │ ├── 06-corpus-08.jpeg │ ├── 06-corpus-09.jpg │ ├── 06-corpus-10.jpeg │ ├── 06-corpus-11.jpeg │ ├── 06-corpus-12.jpeg │ ├── 06-corpus-13.jpeg │ ├── 06-corpus-14.jpeg │ ├── 06-corpus-15.jpg │ ├── 06-corpus-16.jpg │ ├── 06-corpus-17.jpg │ ├── 06-corpus-18.jpg │ ├── 06-corpus-runcode.mp4 │ ├── 07-ocr-01.jpeg │ ├── 07-ocr-02.jpeg │ ├── 07-ocr-03.jpeg │ ├── 07-ocr-04.jpeg │ ├── 07-ocr-05.jpeg │ ├── 07-ocr-05.txt │ ├── 07-ocr-06.jpeg │ ├── 07-ocr-06.txt │ ├── 07-ocr-07.jpeg │ ├── 07-ocr-07.txt │ ├── 07-ocr-08.jpeg │ ├── 07-ocr-08.txt │ ├── 08-ocr-01.jpeg │ ├── 08-ocr-02.jpeg │ ├── 08-ocr-03.jpeg │ ├── 08-ocr-04.jpeg │ ├── 08-ocr-05.jpeg │ ├── 08-ocr-06.jpeg │ ├── 08-ocr-07.jpeg │ ├── 09-data-01.jpeg │ ├── 09-data-02.jpeg │ ├── 09-data-03.jpeg │ ├── 09-data-04.jpeg │ ├── 09-data-05.jpeg │ ├── 09-data-06.jpeg │ ├── 09-data-07.jpeg │ ├── 10-explore-01.jpeg │ ├── 10-explore-02.jpeg │ ├── 10-explore-03.jpeg │ ├── 10-explore-04.jpeg │ ├── 10-explore-05.jpeg │ ├── 10-explore-06.jpeg │ ├── 10-explore-07.jpeg │ ├── 10-explore-08.jpeg │ ├── 10-explore-09.jpeg │ ├── 10-explore-10.jpeg │ ├── 10-explore-11.jpeg │ ├── 10-explore-12.jpeg │ ├── 10-explore-13.jpeg │ ├── 10-explore-14.jpeg │ ├── Anaconda_Nucleus_Horizontal_white.svg │ ├── LawBooks-feature.png │ ├── chronam_daybook_19151112_pellagra_full.jpg │ ├── chronam_daybook_19151112_pellagra_full_bboxes.png │ ├── noun_arrow with loops_2073885.png │ ├── sessionlawsresol1955nort_0057.jpg │ ├── sessionlawsresol1955nort_0057_300ppi.jpg │ ├── sessionlawsresol1955nort_0057_grayscale.jpg │ ├── sessionlawsresol1955nort_0057_inverted.jpg │ ├── sessionlawsresol1955nort_0057_rotated.jpg │ ├── sessionlawsresol1955nort_0057_skewed.jpg │ └── sessionlawsresol1955nort_0058.jpg ├── jc_laws_list.csv ├── jclaws_dataset.csv ├── jpg_output │ ├── sessionlawsresol1955nort_0000.jpg │ ├── sessionlawsresol1955nort_0001.jpg │ ├── sessionlawsresol1955nort_0002.jpg │ ├── sessionlawsresol1955nort_0003.jpg │ ├── sessionlawsresol1955nort_0004.jpg │ ├── sessionlawsresol1955nort_0005.jpg │ ├── sessionlawsresol1955nort_0006.jpg │ ├── sessionlawsresol1955nort_0007.jpg │ ├── sessionlawsresol1955nort_0008.jpg │ ├── sessionlawsresol1955nort_0009.jpg │ ├── sessionlawsresol1955nort_0010.jpg │ ├── sessionlawsresol1955nort_0011.jpg │ ├── sessionlawsresol1955nort_0012.jpg │ ├── sessionlawsresol1955nort_0013.jpg │ ├── sessionlawsresol1955nort_0014.jpg │ ├── sessionlawsresol1955nort_0015.jpg │ ├── sessionlawsresol1955nort_0016.jpg │ ├── sessionlawsresol1955nort_0017.jpg │ ├── sessionlawsresol1955nort_0018.jpg │ ├── sessionlawsresol1955nort_0019.jpg │ ├── sessionlawsresol1955nort_0020.jpg │ ├── sessionlawsresol1955nort_0021.jpg │ ├── sessionlawsresol1955nort_0022.jpg │ ├── sessionlawsresol1955nort_0023.jpg │ ├── sessionlawsresol1955nort_0024.jpg │ └── sessionlawsresol1955nort_0025.jpg ├── on_the_books_text_jc_all.txt ├── sample │ ├── sessionlawsresol1955nort_0057.jpg │ ├── sessionlawsresol1955nort_0058.jpg │ ├── sessionlawsresol1955nort_0059.jpg │ ├── sessionlawsresol1955nort_0060.jpg │ ├── sessionlawsresol1955nort_0061.jpg │ ├── sessionlawsresol1955nort_0062.jpg │ ├── sessionlawsresol1955nort_0063.jpg │ ├── sessionlawsresol1955nort_0064.jpg │ ├── sessionlawsresol1955nort_0065.jpg │ └── sessionlawsresol1955nort_0066.jpg ├── sample_output.txt ├── sample_output │ ├── sample_output_spellchecked.csv │ ├── sessionlawsresol1955nort_0057.txt │ ├── sessionlawsresol1955nort_0058.txt │ ├── sessionlawsresol1955nort_0059.txt │ ├── sessionlawsresol1955nort_0060.txt │ ├── sessionlawsresol1955nort_0061.txt │ ├── sessionlawsresol1955nort_0062.txt │ ├── sessionlawsresol1955nort_0063.txt │ ├── sessionlawsresol1955nort_0064.txt │ ├── sessionlawsresol1955nort_0065.txt │ └── sessionlawsresol1955nort_0066.txt ├── sessionlawsresol1955nort_0057.jpg ├── sessionlawsresol1955nort_0057_grayscale.jpg └── sessionlawsresol1955nort_0057_inverted.jpg └── workflow.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # OnTheBooks 2 | 3 | [On the Books: Jim Crow and Algorithms of Resistance](https://onthebooks.lib.unc.edu/) is a [collections as data project](https://collectionsasdata.github.io/part2whole/) of the [University of North Carolina at Chapel Hill Libraries](https://library.unc.edu/) to make North Carolina legal history accessible to researchers by creating a corpus that contains over one hundred years of North Carolina session laws from Reconstruction through the Civil Rights Movement (1866-1967). The project also used machine learning to identify Jim Crow laws during this period. 4 | 5 | [Read more](https://unc-libraries-data.github.io/OnTheBooks/) 6 | 7 | ## [Installation and Dependencies](installation.md) 8 | 9 | ## [Workflow (code and examples)](workflow.md) 10 | 11 | ## [Text Corpora](https://doi.org/10.17615/5c4g-sd44) 12 | 13 | ## [Open Educational Resource](/oer) 14 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-slate -------------------------------------------------------------------------------- /code/classification/Model_Aug2020.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "This notebook documents the model fit during the first phase of *On the Books: Jim Crow and Algorithms of Resistance*, as of August 2020." 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Packages" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "import os\n", 24 | "import re\n", 25 | "\n", 26 | "import pandas as pd\n", 27 | "import numpy as np\n", 28 | "import scipy.sparse\n", 29 | "\n", 30 | "from nltk.tokenize import word_tokenize\n", 31 | "from nltk.corpus import stopwords\n", 32 | "\n", 33 | "from sklearn.feature_extraction.text import CountVectorizer\n", 34 | "from sklearn.calibration import CalibratedClassifierCV, calibration_curve\n", 35 | "\n", 36 | "from xgboost import XGBClassifier" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "## Data Preparation\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 2, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "train_df = pd.read_csv(\"../training_set/training_set_v0_clean.csv\")" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "We performed simple preprocessing on the text:\n", 60 | "* Replaced hyphenated and line broken words with unbroken words.\n", 61 | "* Removed section numbering from the law text (\"section_text\").\n", 62 | "* Removed all non-ASCII characters (most of these were OCR errors).\n", 63 | "* Converted all words to lower case.\n", 64 | "* Removed stopwords based on `nltk`'s default list.\n", 65 | " * We also removed any words occuring in less than 2 or more than 1000 documents.\n", 66 | "* We used session or volume identified (\"csv\") information to extract a numeric year. In the case of multi-year volumes (e.g. 1956-1957) the earlier year was used.\n", 67 | "\n", 68 | "Then we convert the text into a document-term matrix, augmented with year and law type variables." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "repl = lambda m: m.group(\"letter\")\n", 78 | "\n", 79 | "#Fix hyphenated words\n", 80 | "train_df[\"text\"] = train_df.text.str.replace(r\"-[ \\|]+(?P[a-zA-Z])\",repl).astype(\"str\")\n", 81 | "train_df[\"section_text\"] = train_df.section_text.str.replace(r\"-[ \\|]+(?P[a-zA-Z])\",repl).astype(\"str\")\n", 82 | "train_df[\"section_text\"] = [re.sub(r'- *\\n+(\\w+ *)', r'\\1\\n',r) for r in train_df[\"section_text\"]]\n", 83 | "\n", 84 | "#Remove section titles (e.g. \"Sec. 1\") from law text.\n", 85 | "train_df[\"start\"] = train_df.section_raw.str.len().fillna(0).astype(\"int\")\n", 86 | "train_df[\"section_text\"] = train_df.apply(lambda x: x['section_text'][(x[\"start\"]):], axis=1).str.strip()\n", 87 | "\n", 88 | "#Remove all non-ASCII characters\n", 89 | "train_df[\"section_text\"] = train_df[\"section_text\"].str.replace(r\"[^\\x00-\\x7F]\", \"\", regex=True)\n", 90 | "\n", 91 | "law_list = [word_tokenize(r.lower()) for r in train_df.section_text]\n", 92 | "stop_words = stopwords.words('english')\n", 93 | "law_list = [[word for word in law if word not in stop_words] for law in law_list]\n", 94 | "\n", 95 | "#Extract a numeric year variable\n", 96 | "train_df[\"year\"] = train_df.sess.str.slice(start = 0, stop = 4).astype(\"float\")\n", 97 | "train_df.loc[train_df.sess.isna(),\"year\"] = train_df.csv.str.extract(\"(\\d{4})\")\n", 98 | "\n", 99 | "def dummy(doc):\n", 100 | " return doc\n", 101 | "#Remove terms appearing in less than 2 or more than 1000 documents, then convert to document-term matrix.\n", 102 | "vect = CountVectorizer(tokenizer=dummy,preprocessor=dummy, decode_error = \"ignore\",\n", 103 | " min_df = 2, max_df = 1000)\n", 104 | "dtm = vect.fit_transform(law_list)\n", 105 | "\n", 106 | "#Add year and law type variables.\n", 107 | "extra_df = train_df.loc[:,[\"year\",\"type\"]].copy()\n", 108 | "extra_df = pd.get_dummies(extra_df, columns = [\"type\"], prefix = [\"type\"])\n", 109 | "X = scipy.sparse.hstack((dtm,extra_df.values))" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "## Model Details\n", 117 | "\n", 118 | "The `fit_params` below were fit using a 80-20 training-test split, followed by 10-fold cross validation on the training set. We will include a basic template of our model selection process later this year." 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": 4, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "fit_params = {'colsample_bytree': 0.3, 'gamma': 0.3, 'learning_rate': 0.3, \n", 128 | " 'max_depth': 20, 'min_child_weight': 1, 'n_estimators': 50, \n", 129 | " 'scale_pos_weight': 5}\n", 130 | "all_mod = XGBClassifier(**fit_params)\n", 131 | "all_modfit = all_mod.fit(X, train_df.assessment)" 132 | ] 133 | }, 134 | { 135 | "cell_type": "markdown", 136 | "metadata": {}, 137 | "source": [ 138 | "The XGBoost classifier outperformed the other models selected. Read more about XGBoost [here](https://arxiv.org/abs/1603.02754). \n", 139 | "\n", 140 | "After fitting, we used probability calibration to adjust the model probabilities to better reflect the training set." 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 5, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "calibrated_mod = CalibratedClassifierCV(all_modfit, cv=10, method=\"isotonic\")\n", 150 | "calibrated_modfit = calibrated_mod.fit(X, train_df.assessment)\n", 151 | "\n", 152 | "\n", 153 | "train_df[\"base_labels\"] = all_modfit.predict(X)\n", 154 | "train_df[\"base_probs\"] = all_modfit.predict_proba(X)[:,1]\n", 155 | "train_df[\"calibrated_probs\"] = calibrated_modfit.predict_proba(X)[:,1]\n", 156 | "train_df[\"calibrated_labels\"] = (train_df.calibrated_probs > 0.9).astype(\"int\")" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "We reported any laws with a calibrated probability over 90% as as Jim Crow laws with a source of \"model\", unless they were also later confirmed by an expert, in which case they were labeled as \"model and expert\". We chose to be conservative at this point to minimize false positives and since this project will continue over the coming year, allowing us more time to fine tune the modeling process." 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 3", 170 | "language": "python", 171 | "name": "python3" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 3 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython3", 183 | "version": "3.7.3" 184 | }, 185 | "toc": { 186 | "base_numbering": 1, 187 | "nav_menu": {}, 188 | "number_sections": true, 189 | "sideBar": true, 190 | "skip_h1_title": true, 191 | "title_cell": "Table of Contents", 192 | "title_sidebar": "Contents", 193 | "toc_cell": false, 194 | "toc_position": {}, 195 | "toc_section_display": true, 196 | "toc_window_display": false 197 | } 198 | }, 199 | "nbformat": 4, 200 | "nbformat_minor": 4 201 | } 202 | -------------------------------------------------------------------------------- /code/classification/training_codebook.txt: -------------------------------------------------------------------------------- 1 | id: Standardized identifier for each law consisting of: year, law type, chapter_num and section_num 2 | source: Source of Jim Crow law assessment (Pauli Murray, Richard Paschal, or project experts - William Sturkey or Kimber Thomas) 3 | jim_crow: Indicator of Jim Crow (1) or not Jim Crow (0) 4 | type: Type of law 5 | chapter_num: Chapter number as integer, generated from OCR and data cleaning 6 | section_num: Section number as integer, generated from OCR and data cleaning 7 | chapter_text: The text of the title and any introduction before the first section of the law 8 | section_text: The text of the specificed section 9 | extrinsic: Supplemental data only. This field indicates whether the Jim Crow assessment was extrinsic (1), based almost completely on information outside the text of the law, or implicitly (0) Jim Crow. 10 | -------------------------------------------------------------------------------- /code/data_acquisition/jp2_download.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | """ 5 | jp2_download.py 6 | 7 | 8 | @summary: This script downloads and stores volumes of image files from 9 | the Internet Archive. It uses an existing list of volume 10 | titles to create request links. 11 | 12 | Once the downloads are complete, the script checks for errors. 13 | 14 | It uses the volume title list mentioned above to check for missing 15 | volumes so that the user can download missing volumes manually. 16 | It then checks for download errors by comparing file sizes 17 | of local copies with those of the Internet Archive copies. 18 | 19 | Discrepancies are printed for the user. 20 | 21 | @author: Rucha Dalwadi 22 | 23 | Digital Research Services 24 | University Libraries 25 | UNC Chapel Hill 26 | 27 | """ 28 | 29 | import urllib.request 30 | import os 31 | import csv 32 | import time 33 | 34 | os.chdir(r"")# The Directory where you want the pdfs to be downloaded in like C:\Users\onthebooks\Documents\lawpdfs 35 | 36 | # Open the file with identifiers for parsing 37 | with open("search.csv","r") as identifiers: 38 | reader=csv.DictReader(identifiers) 39 | l=[d['identifier'] for d in reader] # identifier is the column with the volume names 40 | 41 | ct=0 42 | start=time.time() 43 | problems=list() 44 | fails=0 45 | sourcelink = 'https://archive.org/download/' # A single web source contains all the files to download 46 | 47 | # The identifiers are used to generate links to the images for download 48 | for f in l: 49 | try: 50 | ct+=1 51 | # A download link is created for each file by appending the id and file extension to the source 52 | full_link = sourcelink+f+'/'+f+'_jp2.zip' 53 | urllib.request.urlretrieve(full_link, f+'_jp2.zip') 54 | time.sleep(120) 55 | if ct%10==0: 56 | print(str(ct)+": "+f) 57 | print(time.time()-start) 58 | except: 59 | fails+=1 60 | time.sleep(60) 61 | 62 | end=time.time() 63 | print(end-start) 64 | 65 | ## Checking for and resolving problems: 66 | 67 | for f in l: 68 | zips= sourcelink + f +'/' +f+ '_jp2.zip' 69 | 70 | # Get a list of the missing folders 71 | missed=[f for f in zips if f.split("/")[-1] not in os.listdir(".")] 72 | print(missed) 73 | #manually download missed files 74 | 75 | # Get a list of the items with broken links by comparing file sizes of the 76 | # original file to the downloaded file 77 | broken_dl=[] 78 | for z in zips: 79 | local=z.split("/")[-1] 80 | local_size=os.path.getsize(local) 81 | with urllib.request.urlopen(z) as web: 82 | meta=dict(web.info()) 83 | web_size=int(meta["Content-Length"]) 84 | if web_size!=local_size: 85 | broken_dl.append(z) 86 | 87 | # Print the sizes of the broken files and original files for comparison 88 | for z in broken_dl: 89 | local=z.split("/")[-1] 90 | local_size=os.path.getsize(local) 91 | with urllib.request.urlopen(z) as web: 92 | meta=dict(web.info()) 93 | web_size=int(meta["Content-Length"]) 94 | print(local_size/(1024^2),web_size/(1024^2)) 95 | 96 | -------------------------------------------------------------------------------- /code/data_acquisition/search.csv: -------------------------------------------------------------------------------- 1 | identifier 2 | lawsofstateofnor184849nor 3 | privatelawsofsta1905nort 4 | publiclawsofstat186465nor 5 | publiclawsofstat185859nor 6 | publiclawsresolx1921nort 7 | privatelawsofsta1895nort 8 | lawsresolutionso1887nort 9 | publiclawsresolu1927nort 10 | publiclawsresolu1920nort 11 | publiclawsresolu1924nort 12 | publiclocallawsp1941nort 13 | lawsofstateofnor184041nort 14 | publiclocallawso1915nort 15 | publiclocallawso1911nort 16 | publiclawsresolu1908nort 17 | publiclawsresolu1905nort 18 | publiclocallawso1913nort 19 | lawsofstateofnor183637nort 20 | lawsresolutionso1883nort 21 | publiclawsresolu187273nor 22 | publiclocallawsp1924nort 23 | publiclocallawsp1921nort 24 | lawsresolutionso1891nort 25 | privlawsofsta185455nort 26 | privatelawsofsta1907nort 27 | privatelawsofsta186970nor 28 | privatelawsofsta1913nort 29 | privatelawsofsta186869nor 30 | publiclawsresolu1903nort 31 | publiclawsresolu1909nort 32 | privatelawsofsta1893nort 33 | privatelawsofsta1901nort 34 | publiclawsresolu1935nort 35 | publiclawsresolu1923nort 36 | publiclawsofstat185657nor 37 | publiclawsofstat186970nor 38 | publiclawsofstat187071nor 39 | publiclawsofstat186566nor 40 | publiclawsofstat186061nor 41 | lawsresolutionso1880nort 42 | lawsresolutionso1881nort 43 | publiclocallawsp1933nort 44 | publiclawsresolu1907nort 45 | publiclocallawsp1917nort 46 | publiclawsresolu1913nort 47 | publiclawsresolu1899nort 48 | publiclocallawsp1925nort 49 | privatelawsofsta186667nor 50 | privatelawsofsta187071nor 51 | privatelawsofsta187172nor 52 | publiclocallawsp1919nort 53 | publiclocallawsp1935nort 54 | lawsofstateofnor184647nor 55 | lawsofstateofnor185051nor 56 | privatelawsofsta1908nort 57 | privatelawsofsta186465nor 58 | privatelawsofsta1915nort 59 | privatelawsofsta1899nort 60 | privatelawsofsta186566nor 61 | publiclawsresolu1933nort 62 | publiclawsresolu1931nort 63 | publiclawsofstat186869nor 64 | publiclawsresolu1921nort 65 | publiclawsofstat187172nor 66 | publiclawsofstat186667nor 67 | publiclawsofstat186263nor 68 | publiclawsofstat1868nort 69 | lawsresolutionso1879nort 70 | lawsresolutionso187475nor 71 | lawsresolutionso187374nor 72 | lawsofnorthcarol1827nort 73 | lawsofnorthcarol1813nort 74 | lawsofnorthcarol1822nort 75 | lawsofnorthcarol183132nort 76 | publiclocallawsp1923nort 77 | publiclocallawsp1927nort 78 | publiclocallawsp3839nort 79 | publiclocallawsp1929nort 80 | publiclocallawsp1931nort 81 | lawsofstateofnor184243nort 82 | lawsofstateofnor18381839nort 83 | lawsofstateofnor184445nor 84 | lawsresolutionso187677nor 85 | lawsofstateofnor1852nort 86 | lawsresolutionso1889nort 87 | lawsresolutionso1885nort 88 | publiclawsresolu1893nort 89 | publiclawsresolu1901nort 90 | publiclawsresolu1897nort 91 | publiclawsofstat1861nort 92 | publiclawsofstat185455nor 93 | publiclawsofstat1863nort 94 | publiclawsresolu1941nort 95 | publiclawsresolu1936nort 96 | publiclawsresolu1925nort 97 | publiclawsresolu1929nort 98 | publiclawsresolu193839nor 99 | privatelawsofsta1897nort 100 | privatelawsofsta1903nort 101 | privatelawsofsta1909nort 102 | privatelawsofsta1911nort 103 | sessionlawsresol1953nort 104 | sessionlawsresol19891nort 105 | sessionlawsresol1973nort 106 | sessionlawsresol1949nort 107 | sessionlawsr199192nort 108 | sessionlawsresol1995nort 109 | sessionlawsresol1943nort 110 | lawsofnorthcarol1791nort 111 | lawsofnorthcarol1819nort 112 | lawsofnorthcarol1816nort 113 | lawsofnorthcarol1799nort 114 | lawsofnorthcarol1798nort 115 | lawsofnorthcarol1807nort 116 | lawsofnorthcarol1800nort 117 | lawsofnorthcarol1835nort 118 | lawsofnorthcarol1828nort 119 | lawsofnorthcarol1801nort 120 | lawsofnorthcarol1795nort 121 | sessionlawsresol02nort 122 | sessionlawsresol19892nort 123 | sessionlawsresol1947nort 124 | sessionlawsresol19911nort 125 | sessionlawsresol19953nort 126 | sessionlawsresol1955nort 127 | sessionlaws198788nort 128 | publiclawsresolu1900nort 129 | publiclocallaws1913nort 130 | publiclawsresolu1911nort 131 | publiclawsresolu1917nort 132 | publiclawsresolu1915nort 133 | publiclawsresolu1895nort 134 | publiclawsresolu1919nort 135 | lawsofnorthcarol1797nort 136 | lawsofnorthcarol1792nort 137 | lawsofnorthcarol1820nort 138 | lawsofnorthcarol1817nort 139 | lawsofnorthcarol1790nort 140 | lawsofnorthcarol1825nort 141 | lawsofnorthcarol1829nort 142 | lawsofnorthcarol1818nort 143 | sessionlawsresol00nort 144 | sessionlawsre19932nort 145 | sessionlaws1997983nort 146 | sessionlaws197778nort 147 | sessionlawsresol1969nort 148 | sessionlaws198384nort 149 | sessionlawsresol19912nort 150 | sessionlawsre199394nort 151 | sessionlawsresol1945nort 152 | sessionlawsresol03nort 153 | sessionlawsresol1959nort 154 | sessionlawsresol19991nort 155 | sessionlawsresol19972nort 156 | sessionlaws195657nort 157 | sessionlaws196365nort 158 | sessionlaws7980nort 159 | sessionlawsresol19971nort 160 | sessionlawsresol1977nort 161 | sessionlawsresol1981nort 162 | sessionlawsresol1961nort 163 | sessionlaws198990nort 164 | sessionlawsresol19871nort 165 | sessionlaws19656667nort 166 | sessionlaws1997984nort 167 | lawsofnorthcarol1821nort 168 | lawsofnorthcarol1793nort 169 | lawsofnorthcarol1796nort 170 | lawsofnorthcarol1794nort 171 | lawsofnorthcarol1812nort 172 | lawsofnorthcarol1810nort 173 | lawsofnorthcarol1809nort 174 | lawsofnorthcarol183435nort 175 | lawsofnorthcarol1811nort 176 | lawsofnorthcarol183334nort 177 | lawsofnorthcarol1823nort 178 | lawsofnorthcarol183031nort 179 | lawsofnorthcarol1826nort 180 | lawsofnorthcarol183233nort 181 | lawsofnorthcarol1815nort 182 | lawsofnorthcarol1814nort 183 | lawsofnorthcarol1824nort 184 | sessionlawsresol1971nort 185 | sessionlawsresol83nort 186 | sessionlawsresol1951nort 187 | sessionlawsl197576nort 188 | sessionlawsresol1985nort 189 | sessionlawsresol1963nort 190 | sessionlawsresol19872nort 191 | sessionlaws198586nort 192 | sessionlaws8182nort 193 | sessionlawsresol19952nort 194 | sessionlawsr19931nort 195 | sessionlawsresol1975nort 196 | -------------------------------------------------------------------------------- /code/data_acquisition/xml_parser.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | 5 | """ 6 | xml_parser.py 7 | 8 | 9 | @summary: The purpose of this script is to extract certain metadata 10 | for a set of volumes using the volume names (identifiers) from the Internet Archive. 11 | First we use "search.csv" to construct a list of the volumes whose xml metadata we 12 | want to parse. We use the volume names to build request urls for the xml metadata 13 | files associated with each volume. We then parse the xml files to locate and store 14 | the following information for each page in a given volume: 15 | 16 | handSide: the hand side (L/R) of a given leaf. 17 | pageNum: logical page numbers (image numbers) 18 | leafNum: Physical page numbers 19 | filename: The filename associated with each page image 20 | 21 | This information is then written to xml_metadata.csv with each image 22 | file in each volume constituting a row. The information in this file 23 | can then be combined with other, manually compiled metadata 24 | to form the xmljpegmerge.csv file, used in later steps. 25 | 26 | 27 | @author: Rucha Dalwadi 28 | 29 | Digital Research Services 30 | University Libraries 31 | UNC Chapel Hill 32 | 33 | """ 34 | 35 | import xml.etree.ElementTree as ET 36 | import csv 37 | import pandas as pd 38 | import os 39 | import urllib 40 | 41 | # Using the search.csv file, a list of the volumes whose xml files will be parsed is created. 42 | with open("search.csv","r") as xmlfiles: 43 | reader=csv.DictReader(xmlfiles) 44 | l=[d['xmlfiles'] for d in reader] # The column xmlfiles contains the identifiers of the volumes 45 | 46 | # Through the xml files, extract the logical page numbers(pageNum), physical page numbers(leafNum) 47 | # and leaf hand side(handSide) 48 | handSide = [] 49 | pageNum = [] 50 | filename = [] 51 | leafNum = [] 52 | master = [] 53 | 54 | for i in l: 55 | try: 56 | # Get the xml file of a volume by creating a download link using volume identifier 57 | xml = urllib.request.urlopen('https://archive.org/download/' + i + '/' + i + '_' + 'scandata.xml') 58 | tree = ET.parse(xml) 59 | root = tree.getroot() 60 | 61 | # Create a master list with xml metadata for all volumes 62 | for page in root[2].findall('page'): 63 | leafNum.append(int(page.attrib['leafNum'])) 64 | handSide.append(page.find('handSide').text) 65 | 66 | page_dict = {} 67 | page_dict['leafNum'] = int(page.attrib['leafNum']) 68 | if page.find('pageNumber') != None: 69 | page_dict['pageNum'] = page.find('pageNumber').text 70 | else: 71 | page_dict['pageNum'] = '' 72 | 73 | page_dict['handSide'] = page.find('handSide').text 74 | page_dict['filename'] = i + '_' + '%04d' % page_dict['leafNum'] 75 | master.append(page_dict) 76 | 77 | # Create a csv file 78 | with open('xml_metadata.csv', 'w') as csvfile: 79 | writer1 = csv.DictWriter(csvfile, fieldnames=['filename','leafNum','handSide','pageNum'],lineterminator='\n') 80 | writer1.writeheader() 81 | for row in master: 82 | writer1.writerow(row) 83 | except: 84 | print(i) 85 | 86 | -------------------------------------------------------------------------------- /code/marginalia/example_utilities.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jul 8 08:16:57 2019 4 | 5 | @author: mtjansen 6 | """ 7 | 8 | from PIL import Image, ImageDraw 9 | import os 10 | import csv 11 | 12 | sys.path.append(os.path.abspath(r"C:\Users\mtjansen\Desktop\OnTheBooks")) 13 | from cropfunctions import * 14 | 15 | 16 | def combine_bbox(b1,b2): 17 | """Combines successive boundary boxes, each within the last. 18 | Returns: 19 | tuple: Coordinates of crop (left,upper,right,lower) 20 | """ 21 | b1 = list(b1) 22 | b2 = list(b2) 23 | total=[b1[k] + b2[k] for k in range(2)] + [b1[k-2] + b2[k] for k in range(2,4)] 24 | return tuple(total) 25 | 26 | def example_image(orig, diff, angle, band_dict, bheight, total_bbox, orig_bbox): 27 | bd_ct = len(band_dict["hbands"]) 28 | back_height = bd_ct*(bheight+20)+100 29 | 30 | # bounds = Image.new(orig.mode, (orig.size[0],back_height), "white") 31 | # 32 | # for row in band_dict["hbands"]: 33 | # band_bbox = list(row["raw"]) 34 | # band_bbox[1] = row["index"] 35 | # band_bbox[3] = row["index"]+50 36 | # band = orig.crop(orig_bbox).crop(tuple(band_bbox)) 37 | # hoff = list(row["raw"])[0] + list(orig_bbox)[0] 38 | # voff = row["index"] + list(orig_bbox)[1]+10*(row["index"]+50)/bheight 39 | # bounds.paste(band,(hoff,int(voff))) 40 | 41 | img = orig.copy().rotate(angle).crop(orig_bbox) 42 | bounds = Image.new(orig.mode, orig.size, "white") 43 | draw = ImageDraw.Draw(bounds) 44 | 45 | for row in band_dict["hbands"]: 46 | band_bbox = list(row["raw"]) 47 | band_bbox[1] = row["index"]-50 48 | band_bbox[3] = row["index"] 49 | band = combine_bbox(orig_bbox,band_bbox) 50 | spot = tuple(list(band)[0:2]) 51 | bounds.paste(img.crop(band_bbox),spot) 52 | if list(band_bbox)[2]-list(band_bbox)[0]>10: 53 | draw.rectangle(band,outline="red",fill = None,width=2) 54 | 55 | final = orig.rotate(angle).crop(total_bbox) 56 | 57 | back_width = (orig.size[0]+bounds.size[0]+final.size[0])+250 58 | back = Image.new(orig.mode, (back_width,back_height), "white") 59 | 60 | back.paste(orig,(50,50)) 61 | back.paste(bounds,(orig.size[0]+150,50)) 62 | back.paste(final,(orig.size[0]+bounds.size[0]+250,list(total_bbox)[1]+50)) 63 | 64 | return back 65 | 66 | 67 | os.chdir(r"C:\Users\mtjansen\Desktop\OnTheBooks\1865-1968 jp2 files") 68 | 69 | ############################ 70 | # Get xml data from file. ## 71 | ############################ 72 | master = [] 73 | 74 | with open(r"..\xmljpegmerge_official.csv", "r") as csvfile: 75 | reader = csv.DictReader(csvfile) 76 | for row in reader: 77 | row_dict = dict() 78 | row_dict["filename"] = row["filename"] + ".jp2" 79 | row_dict["side"] = row["handSide"].lower() 80 | row_dict["folder"] = row["filename"].split("_")[0]+"_jp2" 81 | row_dict["type"] = row['sectiontype'] 82 | row_dict["start_section"] = False 83 | master.append(row_dict) 84 | 85 | master = sorted(master, key = lambda i: i['filename']) 86 | 87 | for k in range(1,len(master)): 88 | if master[k]["type"] != master[k-1]["type"]: 89 | master[k]["start_section"] = True 90 | 91 | master = [m for m in master if "186465" not in m["filename"]] 92 | 93 | batch = [row for row in master if row["filename"] in ["publiclocallawsp1917nort_0568.jp2","publiclocallawsp1933nort_0063.jp2"]] 94 | 95 | output_dir = r"C:\Users\mtjansen\Desktop\OnTheBooks\outwide_fix" 96 | for r in batch: 97 | t0 = time.time() 98 | f = os.path.join(r["folder"],r["filename"]) 99 | orig = Image.open(f) 100 | 101 | side = r["side"] 102 | 103 | ang = rotation_angle(orig) 104 | 105 | if r["start_section"]: 106 | diff, background, orig_bbox = trim(orig, angle=ang, find_top=False) 107 | else: 108 | diff, background, orig_bbox = trim(orig, angle=ang) 109 | 110 | if "196" in r["folder"] or "195" in r["folder"]: 111 | total_bbox = orig_bbox 112 | cut = None 113 | else: 114 | bheight = 50 115 | band_dict = get_bands(diff, bheight=bheight) 116 | 117 | width = diff.size[0] 118 | cut = simp_bd(band_dict=band_dict, diff=diff, side=side, width=width, 119 | pad=10, freq =0.9) 120 | 121 | out_bbox = [0, 0] + list(diff.size) 122 | side_dict = {"left":0, "right":2} 123 | out_bbox[side_dict[side]] = cut 124 | 125 | total_bbox = combine_bbox(orig_bbox,out_bbox) 126 | 127 | ex = example_image(orig=orig, diff=diff, angle=ang, band_dict=band_dict, 128 | bheight=bheight, total_bbox=total_bbox, orig_bbox=orig_bbox) 129 | 130 | out = os.path.join(output_dir,r["filename"].replace(".jp2","_BREAK.jpg")) 131 | ex.save(out, "JPEG") 132 | -------------------------------------------------------------------------------- /code/marginalia/marginalia_determination.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Jun 27 13:30:26 2019 4 | 5 | @author: mtjansen 6 | """ 7 | 8 | import sys 9 | import os 10 | import csv 11 | import shutil 12 | import time 13 | import random 14 | 15 | from collections import Counter 16 | from PIL import Image, ImageChops, ImageStat 17 | from scipy.ndimage import interpolation as inter 18 | import numpy as np 19 | 20 | sys.path.append(os.path.abspath(r"C:\Users\mtjansen\Desktop\OnTheBooks")) 21 | from cropfunctions import * 22 | 23 | os.chdir(r"C:\Users\mtjansen\Desktop\OnTheBooks\1865-1968 jp2 files") 24 | 25 | ############################ 26 | # Get xml data from file. ## 27 | ############################ 28 | master = [] 29 | 30 | with open(r"..\xmljpegmerge_official.csv", "r") as csvfile: 31 | reader = csv.DictReader(csvfile) 32 | for row in reader: 33 | row_dict = dict() 34 | row_dict["filename"] = row["filename"] + ".jp2" 35 | row_dict["side"] = row["handSide"].lower() 36 | row_dict["folder"] = row["filename"].split("_")[0]+"_jp2" 37 | row_dict["type"] = row['sectiontype'] 38 | row_dict["start_section"] = False 39 | master.append(row_dict) 40 | 41 | master = sorted(master, key = lambda i: i['filename']) 42 | 43 | for k in range(1,len(master)): 44 | if master[k]["type"] != master[k-1]["type"]: 45 | master[k]["start_section"] = True 46 | 47 | master = [m for m in master if "186465" not in m["filename"]] 48 | # Process metadata 49 | 50 | #test = random.sample(master,500) 51 | batch = master[80000:] 52 | meta = [] 53 | 54 | img_ct = 0 55 | start = time.time() 56 | for r in batch: 57 | #t0 = time.time() 58 | f = os.path.join(r["folder"],r["filename"]) 59 | orig = Image.open(f) 60 | 61 | side = r["side"] 62 | 63 | ang = rotation_angle(orig) 64 | 65 | if r["start_section"]: 66 | diff, background, orig_bbox = trim(orig, angle=ang, find_top=False) 67 | else: 68 | diff, background, orig_bbox = trim(orig, angle=ang) 69 | 70 | if "196" in r["folder"] or "195" in r["folder"]: 71 | total_bbox = orig_bbox 72 | cut = None 73 | else: 74 | bheight = 50 75 | band_dict = get_bands(diff, bheight=bheight) 76 | 77 | width = diff.size[0] 78 | cut = simp_bd(band_dict=band_dict, diff=diff, side=side, width=width, 79 | pad=10, freq =0.9) 80 | 81 | out_bbox = [0, 0] + list(diff.size) 82 | side_dict = {"left":0, "right":2} 83 | out_bbox[side_dict[side]] = cut 84 | 85 | total_bbox = combine_bbox(orig_bbox,out_bbox) 86 | 87 | meta_list = [r["filename"], ang, side, cut] 88 | meta_list.extend(background) 89 | meta_list.extend(total_bbox) 90 | meta.append(meta_list) 91 | img_ct +=1 92 | #print (r["filename"], time.time() - t0) 93 | if img_ct % 100 ==0: 94 | print(img_ct, time.time()- start) 95 | 96 | headers = ["file","angle","side","cut","backR","backG","backB", 97 | "bbox1","bbox2","bbox3","bbox4"] 98 | with open("..\marginalia_metadata_part2.csv","a",newline="") as outfile: 99 | writer=csv.writer(outfile) 100 | for row in meta: 101 | writer.writerow(row) 102 | 103 | -------------------------------------------------------------------------------- /code/marginalia/marginalia_removal.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Jul 5 09:11:57 2019 4 | 5 | @author: mtjansen 6 | """ 7 | 8 | import csv 9 | import os 10 | from PIL import Image 11 | 12 | meta = [] 13 | with open(r"C:\Users\mtjansen\Desktop\OnTheBooks\marginalia_metadata.csv","r") as csvfile: 14 | reader = csv.DictReader(csvfile) 15 | for row in reader: 16 | meta.append(row) 17 | 18 | def remove_marginalia(img, meta, image_directory, file_output=False, output_directory = None): 19 | """Uses marginalia metadata to crop image and add border. 20 | 21 | Parameters: 22 | img (str): image file name with or without .jp2 file ending 23 | metadata (str): list of dicts from "marginalia_metadata.csv" with keys: 24 | file: file path with extension 25 | angle: angle of rotation 26 | backR: Red channel of background color in RGB 27 | backG: Green channel of background color in RGB 28 | backB: Blue channel of background color in RGB 29 | bbox1: First coordinate of bounding box (left) 30 | bbox2: Second coordinate of bounding box (top) 31 | bbox3: Third coordinate of bounding box (right) 32 | bbox4: Fourth coordinate of bounding box (bottom) 33 | image_directory (str): path to directory containing volue subfolders e.g. 34 | 1865-1968 jp2 files\sessionlaws196365nort_jp2\sessionlaws196365nort_0000.jp2 35 | The path above maps to a single image, therefore the path to 36 | 1865-1968 jp2 files should be supplied to image_directory 37 | file_output (logical): whether to locally save a jpg version of the 38 | cropped image 39 | output_directory (str): path to directory to save output images if indicated 40 | by file_output. Directory structure will mirror the input directories 41 | in image_directory 42 | 43 | Returns: 44 | PIL.Image.Image: An image cropped as indicated in meta, with a 200 pixel wide 45 | border filled in with the supplied background color in meta. 46 | If file_output is selected, a jpg version of the cropped image will be saved 47 | to output_directory. 48 | """ 49 | 50 | try: 51 | if not img.endswith(".jp2"): 52 | img = img+".jp2" 53 | row = [r for r in meta if r["file"]==img][0] 54 | path = os.path.join(image_directory, 55 | row["file"].split("_")[0]+"_jp2", 56 | row["file"]) 57 | background = tuple([int(n) for n in [row["backR"],row["backG"], 58 | row["backB"]]]) 59 | bbox = tuple([int(n) for n in [row["bbox1"],row["bbox2"], 60 | row["bbox3"],row["bbox4"]]]) 61 | orig = Image.open(path) 62 | new = orig.rotate(float(row["angle"])).crop(bbox) 63 | outimg = Image.new(orig.mode, tuple(x+400 for x in new.size), background) 64 | offset = (200, 200) 65 | outimg.paste(new, offset) 66 | 67 | if file_output: 68 | # out = os.path.join(output_dir, 69 | # row["file"].split("_")[0]+"_jp2", 70 | # row["file"]) 71 | # if not (os.path.exists(os.path.split(out)[0])): 72 | # os.mkdir(os.path.split(out)[0]) 73 | out = os.path.join(output_dir, 74 | row["file"].replace(".jp2",".jpg")) 75 | outimg.save(out, "JPEG") 76 | 77 | return outimg 78 | except: 79 | print("Image not found in metadata") 80 | 81 | 82 | 83 | #Test 84 | image_dir = r"C:\Users\mtjansen\Desktop\OnTheBooks\1865-1968 jp2 files" 85 | output_dir = r"C:\Users\mtjansen\Desktop\OnTheBooks\out_width" 86 | 87 | #import random 88 | #test_set = random.sample(meta,500) 89 | 90 | outliers = [] 91 | with open(r"C:\Users\mtjansen\Desktop\OnTheBooks\outlier_metadata_width.csv","r") as csvfile: 92 | reader = csv.DictReader(csvfile) 93 | for row in reader: 94 | outliers.append(row) 95 | 96 | #test_set = [row for row in meta if row["file"] in ["publiclocallawsp1917nort_0568.jp2","publiclocallawsp1933nort_0063.jp2"]] 97 | # 98 | #test_set = random.sample(outliers,100) 99 | 100 | 101 | for row in outliers: 102 | img = remove_marginalia(img = row["file"], 103 | meta = outliers, 104 | image_directory = image_dir, 105 | file_output = True, 106 | output_directory = output_dir) 107 | 108 | for row in test_set: 109 | path = os.path.join(image_directory, 110 | row["file"].split("_")[0]+"_jp2", 111 | row["file"]) 112 | out = os.path.join(output_dir, 113 | row["file"].replace(".jp2",".jpg")) 114 | 115 | background = tuple([int(n) for n in [row["backR"],row["backG"], 116 | row["backB"]]]) 117 | bbox = tuple([int(n) for n in [row["bbox1"],row["bbox2"], 118 | row["bbox3"],row["bbox4"]]]) 119 | 120 | orig = Image.open(path) 121 | new = orig.rotate(float(row["angle"])).crop(bbox) 122 | outimg = Image.new(orig.mode, tuple(x+400 for x in new.size), background) 123 | offset = (200, 200) 124 | outimg.paste(new, offset) 125 | outimg.save(out, "JPEG") -------------------------------------------------------------------------------- /code/ocr/adjRec.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Tue Jul 23 16:38:46 2019 5 | 6 | @author: Lorin Bruckner 7 | 8 | Digital Research Services 9 | University Libraries 10 | UNC Chapel Hill 11 | """ 12 | 13 | import os, sys 14 | import pandas as pandas 15 | from random import sample 16 | import csv 17 | 18 | #get ocr functions 19 | sys.path.insert(0, "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/OCR/") 20 | from ocr_func import cutMarg, OCRtestImg, testList 21 | 22 | def adjRec(vol, dirpath, masterlist, margdata, n): 23 | 24 | """ 25 | 26 | Get the best image adustments to use on a volume. 27 | 28 | vol (str) : The name for the volume to be tested. Should not include 29 | "_jp2" (so for 1879, it's "lawsresolutionso1879nort") 30 | 31 | dirpath (str) : The directory path for the folder where ALL volumes are located 32 | 33 | masterlist (str) : The direct file path for xmljpegmerge_official.csv 34 | 35 | margdata (str) : The direct file path for the csv with marginalia data 36 | 37 | n (int) : The sample size to use for testing 38 | 39 | """ 40 | 41 | #Merge csvs 42 | mastercsv = pandas.read_csv(masterlist) 43 | margcsv = pandas.read_csv(margdata) 44 | mastercsv["filename"] = mastercsv["filename"] + ".jp2" 45 | csv = mastercsv.merge(margcsv, left_on="filename", right_on="file") 46 | 47 | #Create a pool of image filenames for the volume and take a sample 48 | pool = [] 49 | csvf = csv[csv["filename"].str.startswith(vol)].set_index("filename") 50 | 51 | for row in csvf.itertuples(): 52 | pool.append(os.path.normpath(os.path.join(dirpath, vol + "_jp2/" + row.file))) 53 | 54 | pool = sample(pool, n) 55 | 56 | #Get images for files in sample, cut margins and make a test list 57 | imgs = [] 58 | results = [] 59 | 60 | for img in pool: 61 | 62 | #get image name 63 | name = os.path.split(img)[1] 64 | 65 | #get values for cutting margins 66 | rotate = csvf.loc[name]["angle"] 67 | left = csvf.loc[name]["bbox1"] 68 | up = csvf.loc[name]["bbox2"] 69 | right = csvf.loc[name]["bbox3"] 70 | lower = csvf.loc[name]["bbox4"] 71 | bkgcol = (csvf.loc[name]["backR"], csvf.loc[name]["backG"], csvf.loc[name]["backB"]) 72 | 73 | #cut the margins 74 | img = cutMarg(img = img, rotate = rotate, left = left, up = up, right = right, 75 | lower = lower, border = 200, bkgcol = bkgcol) 76 | 77 | #add the new image to the list 78 | imgs.append(img) 79 | 80 | #perform an OCR test on the new image and add the results to the list 81 | results.append(OCRtestImg(img)) 82 | 83 | #create a testList object with the images and results 84 | testSample = testList(imgs, results) 85 | 86 | #set up a dict of reccommended adjustments and perform tests 87 | adjustments = { "volume": vol, "color": 1.0, "invert": False, 88 | "autocontrast": 0, "blur": False, "sharpen": False, 89 | "smooth": False, "xsmooth": False } 90 | 91 | #color test 92 | testRes = testSample.adjustTest("color", levels = [1,.75,.5,.25,0]) 93 | best = float(testRes["best_adjustment"].replace("color", "")) 94 | if best != 1.0: 95 | testSample = testSample.adjustImg(color = best) 96 | adjustments["color"] = best 97 | 98 | #invert test 99 | # testRes = testSample.adjustTest("invert") 100 | # if testRes["best_adjustment"] == "invertTrue": 101 | # testSample = testSample.adjustImg(invert = True) 102 | # adjustments["invert"] = True 103 | 104 | #autocontrast test 105 | testRes = testSample.adjustTest("autocontrast", levels = [0,2,4,6,8]) 106 | best = float(testRes["best_adjustment"].replace("autocontrast", "")) 107 | if best != 0.0: 108 | testSample = testSample.adjustImg(autocontrast = best) 109 | adjustments["autocontrast"] = best 110 | 111 | #blur test 112 | testRes = testSample.adjustTest("blur") 113 | if testRes["best_adjustment"] == "blurTrue": 114 | testSample = testSample.adjustImg(blur = True) 115 | adjustments["blur"] = True 116 | 117 | #sharpen test 118 | testRes = testSample.adjustTest("sharpen") 119 | if testRes["best_adjustment"] == "sharpenTrue": 120 | testSample = testSample.adjustImg(sharpen = True) 121 | adjustments["sharpen"] = True 122 | 123 | #smooth test 124 | testRes = testSample.adjustTest("smooth") 125 | if testRes["best_adjustment"] == "smoothTrue": 126 | testSample = testSample.adjustImg(smooth = True) 127 | adjustments["smooth"] = True 128 | 129 | #xsmooth test 130 | testRes = testSample.adjustTest("xsmooth") 131 | if testRes["best_adjustment"] == "xsmoothTrue": 132 | adjustments["xsmooth"] = True 133 | 134 | return adjustments 135 | 136 | 137 | ########### Set up locations ############################################### 138 | 139 | dirpath = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/1865-1968 jp2 files/" 140 | masterlist = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/xmljpegmerge_official.csv" 141 | margdata = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/marginalia_metadata_part2_fix.csv" 142 | 143 | 144 | ########### Reccommend Adjustments for a Single Volume ###################### 145 | 146 | adj1943 = adjRec("sessionlawsresol1943nort", dirpath, masterlist, margdata, 10) 147 | 148 | 149 | ########### Create a CSV with Adjustment Specs for all Volumes ############## 150 | 151 | savfile = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/output/adjustments.csv" 152 | 153 | for folder in os.listdir(dirpath): 154 | 155 | if folder == ".DS_Store": 156 | continue 157 | 158 | #get volume 159 | vol = folder.replace("_jp2", "") 160 | print ("Testing " + vol + "...") 161 | 162 | #peform adjustment tests 163 | adjRow = adjRec(vol, dirpath, masterlist, margdata, 10) 164 | 165 | #record adjustments 166 | with open(savfile, "a") as f: 167 | w = csv.DictWriter(f, adjRow.keys()) 168 | if f.tell() == 0: 169 | w.writeheader() 170 | w.writerow(adjRow) 171 | else: 172 | w.writerow(adjRow) -------------------------------------------------------------------------------- /code/ocr/geonames.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Fri May 31 16:46:58 2019 5 | 6 | @author: Lorin Bruckner 7 | 8 | Digital Research Services 9 | University Libraries 10 | UNC Chapel Hill 11 | """ 12 | 13 | import pandas as pandas 14 | from nltk import word_tokenize 15 | 16 | #Read in the tab delimited file from http://download.geonames.org/export/dump/US.zip 17 | #File was downloaded 5/31/19, 4:41 PM 18 | gn = pandas.read_csv("/Users/tuesday/Documents/_Projects/Research/OnTheBooks/US/US.txt", sep ="\t", header = None) 19 | 20 | #Filter records for North Carolina 21 | ncgn = gn[gn.loc[:,10] == "NC"] 22 | 23 | #Dump all geonames into a single string 24 | geonames = "" 25 | for index,row in ncgn.iterrows(): 26 | if type(row[2]) is str: 27 | geonames = geonames + " " + row[2] 28 | 29 | #Tokenize geonames. Remove punctuation, duplicates and single letters. Make lowercase. 30 | geotokens = word_tokenize(geonames) 31 | geotokens = [token for token in geotokens if token.isalpha()] 32 | geotokens = list(dict.fromkeys(geotokens)) 33 | geotokens = [token for token in geotokens if len(token) > 1] 34 | geotokens = [token.lower() for token in geotokens] 35 | 36 | #Create text file to add to Spell Checker 37 | with open("/Users/tuesday/Documents/_Projects/Research/OnTheBooks/geonames.txt", "w") as file: 38 | for token in geotokens: 39 | file.write(token + " ") -------------------------------------------------------------------------------- /code/ocr/ocr_use.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Mon Jul 22 15:18:26 2019 5 | 6 | @author: Lorin Bruckner 7 | 8 | Digital Research Services 9 | University Libraries 10 | UNC Chapel Hill 11 | """ 12 | 13 | import os, sys 14 | import pandas as pandas 15 | from datetime import datetime 16 | 17 | #get ocr functions 18 | sys.path.insert(0, "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/OCR/") 19 | from ocr_func import cutMarg, adjustImg, tsvOCR 20 | 21 | #Set up locations 22 | masterlist = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/xmljpegmerge_official.csv" 23 | margdata = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/marginalia_metadata_part2_fix.csv" 24 | adjdata = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/adjustments_fixed.csv" 25 | rootImgDir = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/1865-1968 jp2 files/" 26 | outDir = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/output/" 27 | 28 | #Read csvs 29 | mastercsv = pandas.read_csv(masterlist) 30 | margcsv = pandas.read_csv(margdata) 31 | adjcsv = pandas.read_csv(adjdata) 32 | 33 | #Create column for volume 34 | for index, row in mastercsv.iterrows(): 35 | volume = row["filename"].split("_")[0] 36 | mastercsv.at[index, "volume"] = volume 37 | 38 | #Merge csvs 39 | mastercsv["filename"] = mastercsv["filename"] + ".jp2" 40 | mcsv = mastercsv.merge(margcsv, left_on="filename", right_on="file") 41 | fcsv = mcsv.merge(adjcsv, on = "volume", how = "right") 42 | 43 | #get separate volumes 44 | volsGrouped = fcsv.groupby("volume") 45 | vols = volsGrouped.groups.keys() 46 | 47 | #loop through volumes 48 | for vol in vols: 49 | 50 | print("") 51 | 52 | #create a folder for the volume in the output directory if it doesn't already exist 53 | newdir = os.path.normpath(os.path.join(outDir, vol)) 54 | if os.path.exists(newdir) == False: 55 | os.mkdir(newdir) 56 | 57 | #select rows for volume 58 | voldf = volsGrouped.get_group(vol) 59 | 60 | #get separate section types 61 | secsGrouped = voldf.groupby("sectiontype") 62 | secs = secsGrouped.groups.keys() 63 | 64 | #create seperate OCR files for each section 65 | for sec in secs: 66 | 67 | #select rows for section type 68 | secsdf = secsGrouped.get_group(sec) 69 | 70 | print(datetime.now().strftime("%H:%M") + " Processing " + vol + " " + sec + "...") 71 | 72 | #Loop through section 73 | for row in secsdf.itertuples(): 74 | 75 | img = os.path.normpath(os.path.join(rootImgDir, vol + "_jp2", row.file)) 76 | 77 | #set up margin cutting 78 | cuts = {"rotate" : row.angle, 79 | "left" : row.bbox1, 80 | "up" : row.bbox2, 81 | "right" : row.bbox3, 82 | "lower" : row.bbox4, 83 | "border" : 200, 84 | "bkgcol" : (row.backR, row.backG, row.backB)} 85 | 86 | #set up image adjustment 87 | adjustments = {"color": row.color, 88 | "autocontrast": row.autocontrast, 89 | "blur": row.blur, 90 | "sharpen": row.sharpen, 91 | "smooth": row.smooth, 92 | "xsmooth": row.xsmooth} 93 | 94 | #Record image adjustments 95 | adjf = open(os.path.normpath(os.path.join(outDir, vol, vol + "_adjustments.txt")), "w") 96 | adjf.write("IMAGE ADJUSTMENTS\n\n") 97 | for key, value in adjustments.items(): 98 | adjf.write("{}: {}\n" .format(key, value)) 99 | adjf.close() 100 | 101 | #OCR the image 102 | tsvOCR((adjustImg(cutMarg(img, **cuts), **adjustments)), 103 | savpath = os.path.normpath(os.path.join(outDir, vol, vol + "_" + sec + ".txt")), 104 | tsvfile = vol + "_" + sec + "_data.tsv") 105 | -------------------------------------------------------------------------------- /code/split_cleanup/00_initial_ch_sec_split.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Tue Aug 11 14:14:58 2020 4 | 5 | @summary: This script parses raw OCR output files for section and chapter 6 | headers. It creates new versions of these raw files as well as new files 7 | that have been aggregated into sections. This script corresponds to the 8 | "Step 1. Initial Splitting Process" section of the Split Cleanup Jupyter 9 | notebook. 10 | 11 | @author: Rucha Dalwadi & Matt Jansen 12 | 13 | Digital Research Services 14 | University Libraries 15 | UNC Chapel Hill 16 | """ 17 | 18 | 19 | import pandas as pd 20 | import numpy as np 21 | import joblib 22 | import re 23 | import os 24 | 25 | 26 | 27 | def tsvparser(filename): 28 | """ 29 | 30 | Identifies chapters and sections within a raw OCR output .tsv file for 31 | a single volume. 32 | 33 | Assigns chapter and section identifiers to rows in the raw file and 34 | creates an aggregate pd.DataFrame object grouping text into 35 | individual sections. 36 | 37 | Outputs a new .csv version of the raw file and creates an initial .csv 38 | version of the aggregate file. 39 | 40 | Arguments 41 | -------------------------------------------------------------------------- 42 | filepath (str) : The filepath for a single volume's raw .tsv OCR 43 | output file. 44 | 45 | 46 | Returns 47 | -------------------------------------------------------------------------- 48 | N/A 49 | 50 | """ 51 | 52 | 53 | 54 | # Import the raw .tsv file into a pd.DataFrame object 55 | raw=pd.read_csv(filename) 56 | raw['text'] = raw['text'].replace(np.nan, '') 57 | 58 | # Add columns to dataframe and create lists to which identified chapter 59 | # and section headers will be appended 60 | # Set the variables for chapter and section that will be used to fill out 61 | # the above lists 62 | chapter = '' 63 | chapter_column = [] 64 | raw['chapter'] = '' 65 | section = '' 66 | section_column = [] 67 | raw['section'] = '' 68 | 69 | # Iterate through all rows in the raw file (word by word) 70 | for i in range(0, raw.shape[0]): 71 | 72 | # Initialize as variables the regex patterns used to identify 73 | # chapters (match_chapter), abbreviate section headers 74 | # (match_section), and unabbreviated "Section 1" sections 75 | # (match_section1) 76 | match_chapter = re.match('^(C|O)[A-Za-z]*(R|r)(\.|,|:|;)*$', raw.iloc[i]['text']) 77 | match_section = re.match('(S|s)[a-zA-Z]{2,3}(\.|,|:|;){0,2}$', raw.iloc[i]['text']) 78 | match_section1 = re.match('S[a-zA-Z]+$', raw.iloc[i]['text']) 79 | 80 | # Create a matching condition to check for three blank spaces above 81 | # potential matches 82 | blank3 = (re.match('^$', raw.iloc[i-1]['text']) and 83 | re.match('^$', raw.iloc[i-2]['text']) and 84 | re.match('^$', raw.iloc[i-3]['text'])) 85 | 86 | # The following conditional statements check for situations that 87 | # indicate the beginning of a new chapter, a new section, or a new 88 | # first section. If any of these are satisfied, the 'chapter' or 89 | # 'section' variable is changed accordingly. Once all conditionals have 90 | # been checked, the resulting 'chapter' and 'section' values for the 91 | # word in question are added to their respective lists. 92 | # The results are two lists, one for each column, with a chapter and 93 | # section value for each row in the raw file. 94 | 95 | # Check for new chapters 96 | if (match_chapter and 97 | re.search('[0-9.]+(\.|,|:|;){0,2}', raw.iloc[i+1]['text']) and 98 | blank3): 99 | chapter = raw.iloc[i]['text'] +' '+ raw.iloc[i+1]['text'] 100 | 101 | # Check for new abbreviated sections 102 | if ((match_section and re.search('^[0-9.\}]+(\.|,|:|;){0,2}$', raw.iloc[i+1]['text'])) or 103 | (match_section and blank3)): 104 | section = raw.iloc[i]['text'] +' '+ raw.iloc[i+1]['text'] 105 | 106 | # Check for new unabbreviated "Section 1" sections 107 | if (match_section1 and 108 | re.search('^(1|.)(\.|,|:|;){0,2}$', raw.iloc[i+1]['text'])): 109 | section = raw.iloc[i]['text'] +' '+ raw.iloc[i+1]['text'] 110 | 111 | # Set the "section" value to blank for areas of text belonging 112 | # to a chapter title and not an actual section 113 | if (match_chapter and 114 | re.search('[0-9.]+(\.|,|:|;){0,2}', raw.iloc[i+1]['text']) and 115 | blank3 != raw.iloc[i]['chapter']): 116 | section = '' 117 | 118 | # Add the resulting 'section' and 'chapter' values to their respective 119 | # lists. 120 | section_column.append(section) 121 | chapter_column.append(chapter) 122 | 123 | # Once all words in the raw file have been checked, add the lists as 124 | # columns to the raw dataframe. 125 | raw['chapter'] = chapter_column 126 | raw['section'] = section_column 127 | 128 | # Add a chapter index to differentiate duplicate chapter headers 129 | raw["chapter_index"] = ((raw["chapter"].notna()) & (raw["chapter"]!=raw["chapter"].shift(1))).cumsum() 130 | 131 | # Add cell values for special cases 132 | raw.loc[((raw["chapter"]=="") & (raw["section"]=="")), ["chapter","section"]] = "Paratextual" 133 | raw.loc[((raw["chapter"]!="") & (raw["section"]=="")), ["section"]] = "Chapter_Title" 134 | raw.loc[((raw["chapter"]=="") & (raw["section"]!="")), ["chapter"]] = "Chapter_UNKNOWN" 135 | 136 | # Create the aggregate dataframe grouping words by their identified 137 | # section assignments 138 | agg = raw[raw["text"]!=""].groupby(['chapter', 'section', 'chapter_index'], sort=False)['text'].apply(' '.join).reset_index() 139 | 140 | # Output the raw and aggregate dataframes as .csv files 141 | raw_outname = os.path.join("outputs","raw",filename.replace(".csv",'') + "_output.csv") 142 | agg_outname = os.path.join("outputs","agg",filename.replace(".csv",'') + "_aggregated_ouput.csv") 143 | 144 | raw.to_csv(raw_outname, index=False, encoding="utf-8-sig") 145 | agg.to_csv(agg_outname, index=False, encoding="utf-8-sig") 146 | 147 | 148 | def main(): 149 | 150 | # Set OCR output file directory path 151 | ocr_path = "." 152 | 153 | # Create directories for new raw/agg files 154 | parse_output_path = "." 155 | os.makedirs(os.path.join(parse_output_path,"outputs","agg"),exist_ok=True) 156 | os.makedirs(os.path.join(parse_output_path,"outputs","raw"),exist_ok=True) 157 | 158 | # Create a list of all raw OCR output files in corpus 159 | listdir = [f for f in os.listdir(ocr_path) if f.endswith(".tsv")] 160 | 161 | # Run "tsvparser" function in parallel to decrease compute time 162 | with joblib.parallel_backend(n_jobs=7,backend='loky'): 163 | joblib.Parallel(verbose=5)(joblib.delayed(tsvparser)(filename) for filename in listdir) 164 | 165 | 166 | if __name__ == "__main__": 167 | main() -------------------------------------------------------------------------------- /code/split_cleanup/01_auto_chap_clean1.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Thu Aug 6 10:28:19 2020 4 | 5 | @summary: This script corresponds to the "Step 2. Chapter Cleanup: 6 | First Automatic Pass" section of the Split Cleanup Jupyter notebook 7 | documentation. This script generates a group of excel files, one for each 8 | volume, with chapter errors identified and with certain suggested 9 | corrections ("chapnumflags" files). These files are then utilized in the 10 | "Step 3. Chapter Cleanup: First Manual Pass" section of the Split 11 | Cleanup Jupyter notebook. 12 | 13 | @author: Neil Byers 14 | 15 | Digital Research Services 16 | University Libraries 17 | UNC Chapel Hill 18 | """ 19 | 20 | 21 | import csv 22 | import pandas as pd 23 | import os 24 | from string import punctuation 25 | import numpy as np 26 | import xlsxwriter 27 | csv.field_size_limit(600000) 28 | 29 | # Create a variable to store all automatic fix/recommendation data for each volume 30 | # This will later be used to output a report file for this step 31 | meta_list=[] 32 | 33 | 34 | def initial_chap_fixes(agg_folder, agg_file): 35 | """ 36 | This function identifies chapter split numbering errors, suggests corrections 37 | for certain situtations, and outputs a volume-level list of chapters with 38 | potential errors and suggested corrections flagged for manual review. 39 | The function does not provide any return values. Instead, it outputs a 40 | single Excel file for each volume and adds volume-level metadata to the 41 | corpus-level report list ("meta_list") 42 | 43 | Arguments 44 | -------------------------------------------------------------------------- 45 | agg_folder (str) : The string filepath for the directory containing 46 | the corpus "aggregate" files 47 | agg_file (str) : The string base file name for an individual 48 | volume's "aggregate" file 49 | 50 | Returns 51 | -------------------------------------------------------------------------- 52 | N/A 53 | 54 | """ 55 | 56 | # Create path string variables and import the agg file into a Pandas dataframe 57 | inpath = agg_folder + agg_file 58 | 59 | outpath = inpath.replace("chap_adjusted_agg", "chap_num_flags") 60 | outpath = outpath.replace("aggregated_chapadjusted.csv", "chapnumflags.xlsx") 61 | 62 | vol_df = pd.read_csv(inpath, encoding = 'utf-8-sig') 63 | 64 | # Create lists to be converted to series for a chapter-level dataframe that 65 | # will be exported as an excel file 66 | chap_headers = [] 67 | chap_indices = [] 68 | chap_nums_raw = [] 69 | 70 | 71 | # Populate the above lists 72 | if vol_df.loc[0,"chapter"]=="Paratextual": 73 | for idx, row in vol_df.iterrows(): 74 | if idx > 0: 75 | if vol_df.loc[idx,"chapter"] != vol_df.loc[idx-1,"chapter"]: 76 | chap_headers.append(vol_df.loc[idx,"chapter"]) 77 | chap_indices.append(vol_df.loc[idx,"chapter_index"]) 78 | else: 79 | for idx, row in vol_df.iterrows(): 80 | if idx==0: 81 | chap_headers.append(vol_df.loc[idx,"chapter"]) 82 | chap_indices.append(vol_df.loc[idx,"chapter_index"]) 83 | elif vol_df.loc[idx,"chapter"] != vol_df.loc[idx-1,"chapter"]: 84 | chap_headers.append(vol_df.loc[idx,"chapter"]) 85 | chap_indices.append(vol_df.loc[idx,"chapter_index"]) 86 | for i in range(0,len(chap_headers)): 87 | 88 | try: 89 | chapter_num = chap_headers[i].split()[1] 90 | chapter_num = chapter_num.rstrip(punctuation) 91 | chap_nums_raw.append(chapter_num) 92 | except: 93 | chap_nums_raw.append("N/A") 94 | 95 | # Convert the above lists to series 96 | # Create a stable list of the original numbering for 97 | # Comparison purposes 98 | raw_titles = pd.Series(chap_headers) 99 | indices_Series = pd.Series(chap_indices) 100 | unq_ch = pd.Series(pd.to_numeric(chap_nums_raw, errors="coerce")) 101 | orig_num = unq_ch.copy() 102 | 103 | 104 | 105 | # Complete all lag3, lag2, and lag1 fixes 106 | for row_num in range(3,len(unq_ch)-3): 107 | if unq_ch[row_num]!= (unq_ch[row_num+1]-1) and unq_ch[row_num]!= (unq_ch[row_num-1]+1): 108 | lag1_test = (unq_ch[row_num+1]-unq_ch[row_num-1])==2 109 | lag2_test = (unq_ch[row_num+2]-unq_ch[row_num-2])==4 and unq_ch[row_num+2]-unq_ch[row_num+1]==1 110 | lag3_test = (unq_ch[row_num+3]-unq_ch[row_num-3])==6 and unq_ch[row_num+3]-unq_ch[row_num+2]==1 111 | 112 | if lag1_test and lag2_test and lag3_test: 113 | unq_ch[row_num] = unq_ch[row_num-3]+3 114 | 115 | for row_num in range(2,len(unq_ch)-2): 116 | if unq_ch[row_num]!= (unq_ch[row_num+1]-1) and unq_ch[row_num]!= (unq_ch[row_num-1]+1): 117 | lag1_test = (unq_ch[row_num+1]-unq_ch[row_num-1])==2 118 | lag2_test = (unq_ch[row_num+2]-unq_ch[row_num-2])==4 and unq_ch[row_num+2]-unq_ch[row_num+1]==1 119 | if lag1_test and lag2_test: 120 | unq_ch[row_num] = unq_ch[row_num-2]+2 121 | 122 | for row_num in range(1,len(unq_ch)-1): 123 | if unq_ch[row_num]!= (unq_ch[row_num+1]-1) and unq_ch[row_num]!= (unq_ch[row_num-1]+1): 124 | lag1_test = (unq_ch[row_num+1]-unq_ch[row_num-1])==2 125 | if lag1_test: 126 | unq_ch[row_num] = unq_ch[row_num-1]+1 127 | 128 | # Parse chapter rows in groups of 5 to flag areas with potential errors 129 | # Mark those chapters that were corrected by the lag fix steps above 130 | max_diff = unq_ch.diff(1).rolling(window=5, center=True).max() 131 | min_diff = unq_ch.diff(1).rolling(window=5, center=True).min() 132 | flag = ~((max_diff == min_diff) & max_diff == 1) 133 | corrected = np.logical_and(unq_ch != orig_num, np.isnan(unq_ch)==False) 134 | 135 | # Compile dataframe to be exported as an excel file 136 | output = pd.concat([raw_titles, orig_num, indices_Series, unq_ch, corrected, flag], axis=1) 137 | output.columns = ['chap_title', 'raw_num', 'chapter_index', 'corrected_num', 'correction_made', 'flag'] 138 | 139 | # Create an excel workbook from the above dataframe, add formatting to make 140 | # corrections an errors more easily findable, and save. 141 | with pd.ExcelWriter(outpath, engine='xlsxwriter') as writer: 142 | 143 | # create workbook object 144 | workbook = writer.book 145 | 146 | header_format = workbook.add_format({'bold': True, 147 | 'valign': 'vcenter', 148 | 'border': 1, 149 | 'bg_color': '#e2efda', 150 | 'font_size': 14}) 151 | flag_format = workbook.add_format({'bg_color': '#f8cbad'}) 152 | corrected_format = workbook.add_format({'bg_color': '#b7dee8'}) 153 | 154 | 155 | # Convert the dataframe to an XlsxWriter Excel object. 156 | output.to_excel(writer, sheet_name='op', index=False, startrow=1, header=False) 157 | outputSheet = writer.sheets['op'] 158 | for col_num, value in enumerate(output.columns.values): 159 | outputSheet.write(0, col_num, value, header_format) 160 | outputSheet.set_column(0, 0, 13) 161 | outputSheet.set_column(1, 1, 10) 162 | outputSheet.set_column(2, 2, 16) 163 | outputSheet.set_column(3, 3, 18) 164 | outputSheet.set_column(4, 4, 9) 165 | 166 | outputSheet.conditional_format('F2:F'+str(output.shape[0]+1), {'type': 'cell', 167 | 'criteria': 'equal to', 168 | 'value': "True", 169 | 'format': flag_format}) 170 | outputSheet.conditional_format('E2:E'+str(output.shape[0]+1), {'type': 'cell', 171 | 'criteria': 'equal to', 172 | 'value': "True", 173 | 'format': corrected_format}) 174 | 175 | writer.save() 176 | 177 | 178 | # Add fix metadata for the volume in question to the corpus-level list 179 | # This list will be saved as a report .csv file 180 | try: 181 | corrections = corrected.value_counts()[1] 182 | except: 183 | corrections = 0 184 | meta_list.append({"agg_file":agg_file, 185 | "chap_count":output.shape[0], 186 | "flags":flag.value_counts()[1], 187 | "corrections":corrections}) 188 | 189 | 190 | 191 | def main(): 192 | # Set the filepath variable for the directory containing the corpus 193 | # aggregate files 194 | agg_filelist = os.listdir(r"C:\Users\npbyers\Desktop\OTB\ChapNumFixes\chap_adjusted_agg") 195 | agg_folder = "./chap_adjusted_agg/" 196 | 197 | # Perform chapter fix/report operations for each volume using the 198 | # "initial_chap_fixes" function 199 | for agg_file in agg_filelist: 200 | initial_chap_fixes(agg_folder, agg_file) 201 | 202 | # Compile the corpus-level report for this step and output it to a .csv file 203 | meta = pd.DataFrame(meta_list) 204 | meta.to_csv("chap_nums_check.csv") 205 | 206 | if __name__ == "__main__": 207 | main() -------------------------------------------------------------------------------- /code/split_cleanup/03_gen_manual_chapfix_files.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Aug 7 08:44:09 2020 4 | 5 | @summary: This script creates the files used in "Step 5. Chapter Cleanup: 6 | Second Manual Pass" (see the Split Cleanup Juypter notebook). These 7 | files, called 'flag_rows' files, consist of all rows from a given volume's 8 | raw file which belong to chapters that remain 'flagged' - in other words, 9 | those chapters in the volume that are identified as being in the vicinity 10 | of chapter numbering errors. It places these new files along with the 11 | 'chapnumflags' files from previous steps in volume-specific directories to 12 | enable quick action by manual reviewers. 13 | 14 | 15 | @author: Neil Byers 16 | 17 | Digital Research Services 18 | University Libraries 19 | UNC Chapel Hill 20 | """ 21 | 22 | 23 | import pandas as pd 24 | import os 25 | import numpy as np 26 | import joblib 27 | import shutil 28 | 29 | def create_manual_files(raw_fix_pair): 30 | """ 31 | This function generates files containing only those rows in a raw file 32 | that belong to chapters which remain 'flagged' after the first rounds of 33 | manual and automatic chapter numbering error corrections. One "flag_rows" 34 | file is generated for each volume with remaining chapter errors. These files 35 | are intended by use for manual reviewers to aid them in cleaning the chapter 36 | errors that could not be fixed automatically. 37 | 38 | The "flag_rows" files contain chapter, section, text, and chapter_index 39 | information for each row (word). Volume metadata and Internet Archive 40 | jpeg/pdf urls for each page are also included. These final piees of information 41 | allow for manual reviewers to quickly access page images to aid in them 42 | in correcting errors. Finally, the raw file index location for each row is 43 | added so that any changes made to the "flag_rows" file can be be re-integrated 44 | into new versions of the raw files. 45 | 46 | Once the "flag_rows" file is compiled, it is output as a .csv file. A version 47 | of the final 'chapnumfixes' file for each affected volume is copied into the 48 | same directory so that manual reviewers will have access to all necessary 49 | files in one location. 50 | 51 | Arguments 52 | -------------------------------------------------------------------------- 53 | raw_fix_pair (list) : List with the string filepaths for both the 54 | raw file and "chapnumfixes" file for a 55 | given volume. 56 | 57 | Returns 58 | -------------------------------------------------------------------------- 59 | N/A 60 | """ 61 | 62 | # load files & create dataframes 63 | 64 | rawfile = raw_fix_pair[0] 65 | fixfile = raw_fix_pair[1] 66 | volume = (os.path.basename(rawfile)) 67 | volume = volume.replace("_output_chapadjusted_rd2.csv", "") 68 | 69 | raw_df = pd.read_csv(rawfile, encoding='utf-8', low_memory=False) 70 | fix_df = pd.read_excel(fixfile, encoding='utf-8') 71 | 72 | # Identify raw file rows assigned to "flagged" chapters and compile a new 73 | # dataframe containing only these rows. 74 | if (fix_df['flag']==True).any(): 75 | 76 | raw_df['chapter'] = raw_df['chapter'].replace(np.nan, '') 77 | raw_df['section'] = raw_df['section'].replace(np.nan, '') 78 | raw_df['text'] = raw_df['text'].replace(np.nan, '') 79 | raw_df['chapter_index'] = raw_df['chapter_index'].replace(np.nan, '') 80 | raw_df['flag'] = False 81 | 82 | for i in range(0, fix_df.shape[0]): 83 | idx = fix_df.iloc[i]["chapter_index"] 84 | if fix_df.iloc[i]['flag']: 85 | raw_df.loc[((raw_df["chapter_index"]==idx)), ["flag"]] = True 86 | 87 | flag_df = raw_df[raw_df['flag']==True].copy() 88 | 89 | # Add IA urls to flag_rows file 90 | flag_df["vol"] = flag_df["name"].str.split(pat = "_") 91 | flag_df["vol"] = flag_df["vol"].apply(lambda x: x[0]) 92 | flag_df['img_num'] = flag_df["name"].str.split(pat = "_") 93 | flag_df['img_num'] = flag_df['img_num'].apply(lambda x: x[1].replace(".jp2", "")) 94 | flag_df["jpg_url"] = "https://archive.org/download/" + flag_df["vol"] + "/" + flag_df["vol"] + "_jp2.zip/" + flag_df["vol"] + "_jp2%2F" + flag_df["name"] + "&ext=jpg" 95 | flag_df["pdf_url"] = "https://archive.org/download/" + flag_df["vol"] + "/" + flag_df["vol"] + ".pdf#page=" + flag_df['img_num'] 96 | 97 | 98 | flag_df = flag_df[['text', 'name', 'chapter', 'section', 'jpg_url', 'pdf_url']] 99 | 100 | 101 | # Output flag_rows file 102 | # Copy the chapnumfixes file to the same location 103 | outname = volume + "_flag_rows.csv" 104 | 105 | outdir = "./manual_fixes/" + volume 106 | if not os.path.exists(outdir): 107 | os.mkdir(outdir) 108 | 109 | fullname = os.path.join(outdir, outname) 110 | 111 | flag_df.to_csv(fullname, index_label="rawfile_index") 112 | shutil.copy2(fixfile, outdir) 113 | 114 | def main(): 115 | # Set directories for raw and "chapnumfixe" files 116 | raw_path = r"C:\Users\npbyers\Desktop\OTB\ChapNumFixes\chap_adjusted_raw_round2" 117 | fix_path = r"C:\Users\npbyers\Desktop\OTB\ChapNumFixes\chap_num_fixes_final" 118 | 119 | rawfolder = "./chap_adjusted_raw_round2/" 120 | fixfolder = "./chap_num_fixes_final/" 121 | 122 | # Create filepath lists for both sets of files 123 | raw_filelist = [(rawfolder + f) for f in os.listdir(raw_path) if f.endswith(".csv")] 124 | fix_filelist = [(fixfolder + f) for f in os.listdir(fix_path) if f.endswith(".xlsx")] 125 | 126 | # Create a list of pairs, each containing the path for a raw file 127 | # and "chapnumfixes" file for a given volume 128 | raw_fix_pairs = [] 129 | for i in range(0, len(raw_filelist)): 130 | raw_fix_pairs.append([raw_filelist[i], fix_filelist[i]]) 131 | 132 | # Run the 'create_manual_files' function in parallel to reduce compute time 133 | with joblib.parallel_backend(n_jobs=7,backend='loky'): 134 | joblib.Parallel(verbose=5)(joblib.delayed(create_manual_files)(pair) for pair in raw_fix_pairs) 135 | 136 | 137 | if __name__ == "__main__": 138 | main() -------------------------------------------------------------------------------- /code/split_cleanup/06_gen_final_agg.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Aug 7 10:44:11 2020 4 | 5 | @summary: This script generates a final round of aggregate files from the raw 6 | files that resulted from the automatic and manual section-cleaning 7 | processes. In the Split Cleanup Jupyter notebook, this script corresponds 8 | to "Step 8. Generating Final Files / Remaining Error Appraisal" 9 | 10 | @author: Neil Byers 11 | 12 | Digital Research Services 13 | University Libraries 14 | UNC Chapel Hill 15 | """ 16 | 17 | 18 | import csv 19 | import pandas as pd 20 | import os 21 | import numpy as np 22 | import joblib 23 | csv.field_size_limit(600000) 24 | 25 | 26 | def generate_new(raw_file): 27 | """ 28 | This function generates new aggregate files to reflect changes made in the 29 | manual section error correction process. The final versions of these files 30 | contain 31 | 32 | Arguments 33 | -------------------------------------------------------------------------- 34 | raw_file (str) : The "raw" file path for a single volume 35 | 36 | Returns 37 | -------------------------------------------------------------------------- 38 | N/A 39 | """ 40 | 41 | raw_df = pd.read_csv(raw_file, encoding='utf-8', low_memory=False) 42 | 43 | raw_df['chapter'] = raw_df['chapter'].replace(np.nan, '') 44 | raw_df['section'] = raw_df['section'].replace(np.nan, '') 45 | raw_df['text'] = raw_df['text'].replace(np.nan, '') 46 | 47 | # reset chapter_index in raw files 48 | raw_df["chapter_index"] = (raw_df.chapter != raw_df.chapter.shift(1)).cumsum() 49 | raw_df['chapter_index'] = raw_df['chapter_index'].replace(np.nan, '') 50 | 51 | # reset section_index in raw files 52 | raw_df["section_index"] = (raw_df.section != raw_df.section.shift(1)).groupby(raw_df.chapter).cumsum() 53 | 54 | # Create a new aggregate dataframe with the jpeg image name on which 55 | # a given section begins included on each row (section) 56 | group_list = ['chapter', 'chapter_index', 'section', 'section_index'] 57 | agg_dict = {'text': ' '.join, 58 | 'name': 'first'} 59 | agg = raw_df[raw_df["text"]!=""].groupby(group_list, sort=False, as_index=False).agg(agg_dict) 60 | agg.rename(columns={"name":"first_jpeg"}, inplace=True) 61 | 62 | # Generate the Internet Archive jpeg/pdf urls for each section's start page 63 | # based on the page image file name. Each row (section) in the 64 | # aggregate file will thus be paired with its page image urls 65 | agg["vol"] = agg["first_jpeg"].str.split(pat = "_") 66 | agg["vol"] = agg["vol"].apply(lambda x: x[0]) 67 | agg['img_num'] = agg["first_jpeg"].str.split(pat = "_") 68 | agg['img_num'] = agg['img_num'].apply(lambda x: x[1].replace(".jp2", "")) 69 | agg["first_jpg_url"] = "https://archive.org/download/" + agg["vol"] + "/" + agg["vol"] + "_jp2.zip/" + agg["vol"] + "_jp2%2F" + agg["first_jpeg"] + "&ext=jpg" 70 | agg["pdf_url"] = "https://archive.org/download/" + agg["vol"] + "/" + agg["vol"] + ".pdf#page=" + agg['img_num'] 71 | 72 | 73 | 74 | 75 | # Remove extraneous columns from aggregate dataframe 76 | agg = agg.drop(columns=['first_jpeg', 'vol', 'img_num']) 77 | 78 | #output new raw and agg files to .csv 79 | raw_outname = raw_file.replace("_output.csv", "_output_final.csv") 80 | raw_outname = raw_outname.replace("/sec_clean/raw1/", "/sec_clean_final/raw/") 81 | agg_outname = raw_file.replace("_output.csv", "_aggregated_output_final.csv") 82 | agg_outname = agg_outname.replace("/sec_clean/raw1/", "/sec_clean_final/agg/") 83 | 84 | 85 | #output new raw/agg to file 86 | raw_df.to_csv(raw_outname, index=False, encoding="utf-8") 87 | agg.to_csv(agg_outname, index=False, encoding="utf-8") 88 | 89 | def main(): 90 | # Set directory path locations for raw files 91 | raw_path = r"C:\Users\npbyers\Desktop\OTB\SectNumFixes\sec_clean\raw1" 92 | rawfolder = "./sec_clean/raw1/" 93 | 94 | # Create a list of all raw files 95 | raw_filelist = [(rawfolder + f) for f in os.listdir(raw_path) if f.endswith(".csv")] 96 | 97 | # Create a new aggregate file using the 'generate_new' function. 98 | # This operation is run in parallel to reduce compute time. 99 | with joblib.parallel_backend(n_jobs=7,backend='loky'): 100 | joblib.Parallel(verbose=5)(joblib.delayed(generate_new)(raw_file) for raw_file in raw_filelist) 101 | 102 | if __name__ == "__main__": 103 | main() -------------------------------------------------------------------------------- /code/split_cleanup/07_final_sec_appraisal.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Aug 7 11:23:54 2020 4 | 5 | @summary: Creates two corpus-level report files on remaining section errors 6 | to aid in future error correction efforts. The first 7 | ('remaining_sec_errors.csv') consists of volume-level information about 8 | remaining section gaps. The second ('final_error_chap_rows.csv') contains 9 | section-level information for all chapters containing section numbering 10 | errors, as identified by section numbering 'gaps'. In the Split Cleanup 11 | Jupyter notebook, this script corresponds to "Step 8. Generating Final 12 | Files / Remaining Error Appraisal". 13 | 14 | 15 | @author: Neil Byers 16 | 17 | Digital Research Services 18 | University Libraries 19 | UNC Chapel Hill 20 | """ 21 | 22 | 23 | import csv 24 | import pandas as pd 25 | import os 26 | from string import punctuation 27 | import numpy as np 28 | import joblib 29 | csv.field_size_limit(600000) 30 | 31 | 32 | def error_check(raw_file): 33 | """ 34 | This function compiles information related to the section 'gaps' present in 35 | a single volume. This information (total sections, chapters containing errors, 36 | types of errors remaining, etc.) is then used to compile two corpus-level 37 | report files to aid in future rounds of manual review. One, the 'meta' section 38 | errors file, contains metadata related to the remaining errors (gaps) in each 39 | volume of the corpus. The second, 'final_error_chap_rows.csv' consists of 40 | rows for all sections of all chapters in the corpus which still contain 41 | sections preceded by 'gaps' after all of the previous cleanup steps. These 42 | files will be used in future rounds of manual and automatic review to complete 43 | the section cleanup process for the entire corpus. 44 | 45 | Arguments 46 | -------------------------------------------------------------------------- 47 | raw_file (str) : The "raw" file path for a single volume 48 | 49 | Returns 50 | -------------------------------------------------------------------------- 51 | report_row : A dictionary containing the title of the volume, 52 | the total number sections, the total number of 53 | chapters, the number of remaining section errors, 54 | the number of chapters containing section errors, 55 | and a list of dictionaries, one for each section, 56 | for all sections in chapters that still contain 57 | sections with 'gaps'. 58 | """ 59 | 60 | # Read in raw file 61 | raw_df = pd.read_csv(raw_file, encoding='utf-8', low_memory=False) 62 | 63 | # eliminate np.nan from the raw dataframe 64 | raw_df['chapter'] = raw_df['chapter'].replace(np.nan, '') 65 | raw_df['section'] = raw_df['section'].replace(np.nan, '') 66 | raw_df['text'] = raw_df['text'].replace(np.nan, '') 67 | 68 | #reset chapter_index 69 | raw_df["chapter_index"] = (raw_df.chapter != raw_df.chapter.shift(1)).cumsum() 70 | raw_df['chapter_index'] = raw_df['chapter_index'].replace(np.nan, '') 71 | 72 | #Reset section index 73 | raw_df["section_index"] = (raw_df.section != raw_df.section.shift(1)).groupby(raw_df.chapter).cumsum() 74 | 75 | 76 | 77 | 78 | 79 | # Create a dataframe for the volume's sections, one per row 80 | sections = raw_df.loc[:, ['chapter', 'chapter_index', 'section', 'section_index']].drop_duplicates() 81 | # Extract the numeric values from the section headers and convert them to float 82 | sections['raw_num'] = sections['section'].apply(lambda x: 0 if x == "Chapter_Title" else (x.strip().split()[1].rstrip(punctuation) if " " in x.strip() and x.strip().split()[1].rstrip(punctuation).isnumeric() else np.nan)) 83 | sections.loc[sections["section"]=="Paratextual", "raw_num"] = 0 84 | sections['raw_num'] = sections['raw_num'].astype(float) 85 | 86 | # Create a groupby dataframe to group the sections by their respective chapters 87 | chapters = sections.groupby('chapter_index') 88 | 89 | # Generate the gap information for each section 90 | # This value indicates the numeric distance between a given section number 91 | # and that of the preceding section 92 | sections["gap"] = chapters.raw_num.diff(1) 93 | sections["gap"] = sections["gap"].replace(np.nan, 1) 94 | 95 | 96 | 97 | # Create a list of the unique gap values present in the volume 98 | gaps_remaining = sections['gap'].value_counts().keys().tolist() 99 | # Determine the frequencies of the above gap values 100 | gap_counts = sections['gap'].value_counts().tolist() 101 | 102 | # Create variables for frequencies of gaps of 2, gaps of 3, and gaps of any 103 | # other value with the exception of 1 104 | two_gaps_left = 0 105 | three_gaps_left = 0 106 | other_gaps_left = 0 107 | 108 | for i in range(0, len(gaps_remaining)): 109 | if gaps_remaining[i] == 2: 110 | two_gaps_left = gap_counts[i] 111 | elif gaps_remaining[i] == 3: 112 | three_gaps_left = gap_counts[i] 113 | elif gaps_remaining[i] != 1: 114 | other_gaps_left += gap_counts[i] 115 | 116 | 117 | # Calculate total sections in the volume, including those that are missing 118 | # "Other" gaps are excluded because these are often not actual gaps and 119 | # are likely to artificially inflate the section count variable. 120 | total_sections = sections.shape[0]+two_gaps_left+(2*three_gaps_left) 121 | total_chapters = len(chapters.groups.keys()) 122 | 123 | # Calculate 'errors remaining' to include missing chapters and 'other' errors, 124 | # as indicated by gaps with values other than 1, 2, or 3. 125 | errors_remaining = two_gaps_left+(2*three_gaps_left)+other_gaps_left 126 | 127 | # Extract the volume title 128 | vol = os.path.basename(raw_file).replace("_data_cleaned_new.csv", "") 129 | 130 | 131 | 132 | # Create a dataframe that includes all sections for all chapters containing 133 | # sections with a gap value other than one, or zero in the case of Paratextuals 134 | # and Chapter_Titles 135 | error_chaps = [] 136 | total_error_chaps = 0 137 | for b in chapters.groups.keys(): 138 | chap_sects = chapters.get_group(b) 139 | for r in range(0,chap_sects.shape[0]): 140 | if (chap_sects.iloc[r]['gap']==0 and chap_sects.iloc[r]['section_index']!=1) or chap_sects.iloc[r]['gap']!=1: 141 | total_error_chaps += 1 142 | for b in range(0,chap_sects.shape[0]): 143 | error_chaps.append({"vol":vol, 144 | "ch_index":chap_sects.iloc[b]['chapter_index'], 145 | "ch_title":chap_sects.iloc[b]['chapter'], 146 | "sec_index":chap_sects.iloc[b]['section_index'], 147 | "sec_title":chap_sects.iloc[b]['section'], 148 | "gap":chap_sects.iloc[b]['gap']}) 149 | break 150 | 151 | 152 | # Compile the data below into a dictionary. This dictionary is then returned 153 | # by the function to be added as a row to a corpus-level report. Rows from 154 | # the 'error_chaps' list will be added to a corpus-level document containing 155 | # all sections of all chapters containing sections with remaining gaps 156 | report_row = {"vol":vol, 157 | "total_chapters":total_chapters, 158 | "total_sections":total_sections, 159 | "errors_remaining":errors_remaining, 160 | "error_chaps": total_error_chaps, 161 | "error_chaps_list": error_chaps} 162 | 163 | return report_row 164 | 165 | 166 | def main(): 167 | 168 | # Set raw file directory variables and create a list of all raw files 169 | raw_path = r"C:\Users\npbyers\Desktop\OTB\SectNumFixes\final\raw" 170 | rawfolder = "./final/raw/" 171 | raw_filelist = [(rawfolder + f) for f in os.listdir(raw_path) if f.endswith(".csv")] 172 | 173 | 174 | # Call the error_check function above, once for each volume, in parallel 175 | # to decrease compute time. 176 | with joblib.parallel_backend(n_jobs=7,backend='loky'): 177 | report_rows = joblib.Parallel(verbose=5)( 178 | joblib.delayed(error_check)(raw_file) for raw_file in raw_filelist) 179 | 180 | # Compile the .csv file with all sections from all chapters containing 181 | # sections with unusual gaps (gaps with values other than 1, or 0 in 182 | # certain cases) 183 | error_chap_master = [] 184 | for row in report_rows: 185 | for i in row['error_chaps_list']: 186 | error_chap_master.append(i) 187 | error_chap_df=pd.DataFrame(error_chap_master) 188 | error_chap_df.to_csv(r"C:\Users\npbyers\Desktop\OTB\SectNumFixes\final_error_chap_rows.csv", index=False) 189 | 190 | # Compile the .csv file with volume-level information about remaining errors 191 | # in the corpus as whole. 192 | report_df = pd.DataFrame(report_rows) 193 | meta_df = report_df.drop('error_chaps_list', 1) 194 | meta_df.to_csv(r"C:\Users\npbyers\Desktop\OTB\SectNumFixes\remaining_sec_errors.csv", index=False) 195 | 196 | 197 | if __name__ == "__main__": 198 | main() -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: oer-environment 2 | channels: 3 | - conda-forge 4 | - anaconda 5 | - defaults 6 | 7 | dependencies: 8 | - python 9 | - tesseract 10 | - pytesseract 11 | - pillow 12 | - pip 13 | - pip: 14 | - geopandas 15 | - internetarchive 16 | - matplotlib 17 | - nltk 18 | - pandas 19 | - pillow 20 | - pyspellchecker 21 | - requests 22 | -------------------------------------------------------------------------------- /examples/adjustment_recommendation/adjRec.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Tue Jul 23 16:38:46 2019 5 | 6 | @author: Lorin Bruckner 7 | 8 | Digital Research Services 9 | University Libraries 10 | UNC Chapel Hill 11 | """ 12 | 13 | import os, sys 14 | import pandas as pandas 15 | from random import sample 16 | import csv 17 | 18 | #get ocr functions 19 | sys.path.append(os.path.abspath("./")) 20 | from ocr_func import cutMarg, OCRtestImg, testList 21 | 22 | def adjRec(vol, dirpath, masterlist, margdata, n): 23 | 24 | """ 25 | 26 | Get the best image adustments to use on a volume. 27 | 28 | vol (str) : The name for the volume to be tested. Should not include 29 | "_jp2" (so for 1879, it's "lawsresolutionso1879nort") 30 | 31 | dirpath (str) : The directory path for the folder where ALL volumes are located 32 | 33 | masterlist (str) : The direct file path for xmljpegmerge_official.csv 34 | 35 | margdata (str) : The direct file path for the csv with marginalia data 36 | 37 | n (int) : The sample size to use for testing 38 | 39 | """ 40 | 41 | #Merge csvs 42 | mastercsv = pandas.read_csv(masterlist) 43 | margcsv = pandas.read_csv(margdata) 44 | mastercsv["filename"] = mastercsv["filename"] + ".jp2" 45 | csv = mastercsv.merge(margcsv, left_on="filename", right_on="file") 46 | 47 | #Create a pool of image filenames for the volume and take a sample 48 | pool = [] 49 | csvf = csv[csv["filename"].str.startswith(vol)].set_index("filename") 50 | 51 | for row in csvf.itertuples(): 52 | pool.append(os.path.normpath(os.path.join(dirpath, vol + "_jp2/" + row.file))) 53 | 54 | pool = sample(pool, n) 55 | 56 | #Get images for files in sample, cut margins and make a test list 57 | imgs = [] 58 | results = [] 59 | 60 | for img in pool: 61 | 62 | #get image name 63 | name = os.path.split(img)[1] 64 | 65 | #get values for cutting margins 66 | rotate = csvf.loc[name]["angle"] 67 | left = csvf.loc[name]["bbox1"] 68 | up = csvf.loc[name]["bbox2"] 69 | right = csvf.loc[name]["bbox3"] 70 | lower = csvf.loc[name]["bbox4"] 71 | bkgcol = (csvf.loc[name]["backR"], csvf.loc[name]["backG"], csvf.loc[name]["backB"]) 72 | 73 | #cut the margins 74 | img = cutMarg(img = img, rotate = rotate, left = left, up = up, right = right, 75 | lower = lower, border = 200, bkgcol = bkgcol) 76 | 77 | #add the new image to the list 78 | imgs.append(img) 79 | 80 | #perform an OCR test on the new image and add the results to the list 81 | results.append(OCRtestImg(img)) 82 | 83 | #create a testList object with the images and results 84 | testSample = testList(imgs, results) 85 | 86 | #set up a dict of reccommended adjustments and perform tests 87 | adjustments = { "volume": vol, "color": 1.0, "invert": False, 88 | "autocontrast": 0, "blur": False, "sharpen": False, 89 | "smooth": False, "xsmooth": False } 90 | 91 | #color test 92 | testRes = testSample.adjustTest("color", levels = [1,.75,.5,.25,0]) 93 | best = float(testRes["best_adjustment"].replace("color", "")) 94 | if best != 1.0: 95 | testSample = testSample.adjustSampleImgs(color = best) 96 | adjustments["color"] = best 97 | 98 | #invert test 99 | # testRes = testSample.adjustTest("invert") 100 | # if testRes["best_adjustment"] == "invertTrue": 101 | # testSample = testSample.adjustImg(invert = True) 102 | # adjustments["invert"] = True 103 | 104 | #autocontrast test 105 | testRes = testSample.adjustTest("autocontrast", levels = [0,2,4,6,8]) 106 | best = float(testRes["best_adjustment"].replace("autocontrast", "")) 107 | if best != 0.0: 108 | testSample = testSample.adjustSampleImgs(autocontrast = best) 109 | adjustments["autocontrast"] = best 110 | 111 | #blur test 112 | testRes = testSample.adjustTest("blur") 113 | if testRes["best_adjustment"] == "blurTrue": 114 | testSample = testSample.adjustSampleImgs(blur = True) 115 | adjustments["blur"] = True 116 | 117 | #sharpen test 118 | testRes = testSample.adjustTest("sharpen") 119 | if testRes["best_adjustment"] == "sharpenTrue": 120 | testSample = testSample.adjustSampleImgs(sharpen = True) 121 | adjustments["sharpen"] = True 122 | 123 | #smooth test 124 | testRes = testSample.adjustTest("smooth") 125 | if testRes["best_adjustment"] == "smoothTrue": 126 | testSample = testSample.adjustSampleImgs(smooth = True) 127 | adjustments["smooth"] = True 128 | 129 | #xsmooth test 130 | testRes = testSample.adjustTest("xsmooth") 131 | if testRes["best_adjustment"] == "xsmoothTrue": 132 | testSample = testSample.adjustSampleImgs(xsmooth = True) 133 | adjustments["xsmooth"] = True 134 | 135 | return adjustments 136 | 137 | 138 | ########### Set up locations ############################################### 139 | 140 | dirpath = "/images" 141 | masterlist = "SampleMetadata.csv" 142 | margdata = "marginalia_metadata_demo.csv" 143 | 144 | 145 | ########### Reccommend Adjustments for a Single Volume ###################### 146 | 147 | adjDEMO = adjRec("lawsresolutionso1891nort", dirpath, masterlist, margdata, 3) 148 | 149 | 150 | ########### Create a CSV with Adjustment Specs for all Volumes ############## 151 | 152 | savfile = "/Users/tuesday/Documents/_Projects/Research/OnTheBooks/output/adjustmentsDEMO.csv" 153 | 154 | for folder in os.listdir(dirpath): 155 | 156 | if folder == ".DS_Store": 157 | continue 158 | 159 | #get volume 160 | vol = folder.replace("_jp2", "") 161 | print ("Testing " + vol + "...") 162 | 163 | #peform adjustment tests 164 | adjRow = adjRec(vol, dirpath, masterlist, margdata, 10) 165 | 166 | #record adjustments 167 | with open(savfile, "a") as f: 168 | w = csv.DictWriter(f, adjRow.keys()) 169 | if f.tell() == 0: 170 | w.writeheader() 171 | w.writerow(adjRow) 172 | else: 173 | w.writerow(adjRow) -------------------------------------------------------------------------------- /examples/adjustment_recommendation/adjusted.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/adjusted.png -------------------------------------------------------------------------------- /examples/adjustment_recommendation/example_image.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jul 8 08:16:57 2019 4 | 5 | @author: mtjansen, npbyers 6 | """ 7 | 8 | from PIL import Image, ImageDraw 9 | import os 10 | import sys 11 | 12 | sys.path.append(os.path.abspath("./")) 13 | from cropfunctions import * 14 | 15 | def bandimage(orig, angle, band_dict, bheight, orig_bbox): 16 | bd_ct = len(band_dict["band_bboxes"]) 17 | back_height = bd_ct*(bheight+20)+100 18 | img = orig.copy().rotate(angle).crop(orig_bbox) 19 | bounds = Image.new(orig.mode, orig.size, "white") 20 | draw = ImageDraw.Draw(bounds) 21 | 22 | for row in band_dict["band_bboxes"]: 23 | band_bbox = list(row["raw"]) 24 | band_bbox[1] = row["index"]-50 25 | band_bbox[3] = row["index"] 26 | band = combine_bbox(orig_bbox,band_bbox) 27 | spot = tuple(list(band)[0:2]) 28 | bounds.paste(img.crop(band_bbox),spot) 29 | if list(band_bbox)[2]-list(band_bbox)[0]>10: 30 | draw.rectangle(band,outline="red",fill = None,width=2) 31 | 32 | return bounds 33 | 34 | def diffbands(diff, band_dict, cut, bheight): 35 | cdiff = diff.convert(mode="RGB") 36 | 37 | bd_ct = len(band_dict["band_bboxes"]) 38 | back_height = bd_ct*(bheight+20)+100 39 | bounds = Image.new(cdiff.mode, cdiff.size, "white") 40 | drawBands = ImageDraw.Draw(bounds) 41 | 42 | 43 | for row in band_dict["band_bboxes"]: 44 | band_bbox = list(row["raw"]) 45 | band_bbox[1] = row["index"]-50 46 | band_bbox[3] = row["index"] 47 | spot = tuple(list(band_bbox)[0:2]) 48 | bounds.paste(diff.crop(band_bbox),spot) 49 | if list(band_bbox)[2]-list(band_bbox)[0]>10: 50 | drawBands.rectangle(band_bbox,outline="#fc8003", fill = None,width=2) 51 | 52 | drawCut = ImageDraw.Draw(bounds) 53 | drawCut.line((cut,0, cut, bounds.size[1]),fill ="#0f03fc",width = 7) 54 | 55 | return bounds 56 | 57 | def bandsdisplay(bandimages): 58 | back_width = (bandimages[0].size[0]+bandimages[1].size[0]+bandimages[2].size[0]+200) 59 | back_height = (max([bandimages[0].size[1], bandimages[1].size[1], bandimages[2].size[1]])+100) 60 | back = Image.new(bandimages[0].mode, (back_width,back_height), "white") 61 | 62 | back.paste(bandimages[0],(50,50)) 63 | back.paste(bandimages[1],(bandimages[0].size[0]+100,50)) 64 | back.paste(bandimages[2],(bandimages[0].size[0]+bandimages[1].size[0]+150,50)) 65 | 66 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 67 | 68 | 69 | return rs 70 | 71 | 72 | def comparison1(orig, diff, angle, orig_bbox): 73 | img = orig.copy().convert(mode="RGBA") 74 | box=Image.new('RGBA', (img.size[0],img.size[1])) 75 | d = ImageDraw.Draw(box) 76 | d.rectangle(orig_bbox,outline="red",fill = None,width=5) 77 | w=box.rotate(-angle) 78 | 79 | superimpose = Image.new('RGBA', (img.size[0],img.size[1])) 80 | superimpose.paste(img, (0,0)) 81 | superimpose.paste(w, (0,0), mask=w) 82 | 83 | back_width = (img.size[0]+diff.size[0]+150) 84 | back_height = (img.size[1]+100) 85 | back = Image.new(img.mode, (back_width,back_height), "white") 86 | draw = ImageDraw.Draw(img) 87 | 88 | back.paste(superimpose,(50,50)) 89 | back.paste(diff,(superimpose.size[0]+100,orig_bbox[1]+50)) 90 | 91 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 92 | return rs 93 | 94 | def comparison2(band, final): 95 | back_width = (band.size[0]+final.size[0]+150) 96 | back_height = (band.size[1]+100) 97 | back = Image.new(final.mode, (back_width,back_height), "white") 98 | 99 | back.paste(band,(50,50)) 100 | back.paste(final,(band.size[0]+100,50)) 101 | 102 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 103 | return rs 104 | 105 | def origdisplay(orig1, orig2, orig3): 106 | back_width = (orig1.size[0]+orig2.size[0]+orig3.size[0]+200) 107 | back_height = (max([orig1.size[1], orig2.size[1], orig3.size[1]])+100) 108 | back = Image.new(orig1.mode, (back_width,back_height), "white") 109 | 110 | back.paste(orig1,(50,50)) 111 | back.paste(orig2,(orig1.size[0]+100,50)) 112 | back.paste(orig3,(orig1.size[0]+orig2.size[0]+150,50)) 113 | 114 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 115 | return rs 116 | 117 | def diffdisplay(diff1, diff2, diff3): 118 | img1 = diff1.copy().convert(mode="RGBA") 119 | img2 = diff2.copy().convert(mode="RGBA") 120 | img3 = diff3.copy().convert(mode="RGBA") 121 | back_width = (img1.size[0]+img2.size[0]+img3.size[0]+200) 122 | back_height = (max([img1.size[1], img2.size[1], img3.size[1]])+100) 123 | back = Image.new(img1.mode, (back_width,back_height), "white") 124 | 125 | back.paste(img1,(50,50)) 126 | back.paste(img2,(img1.size[0]+100,50)) 127 | back.paste(img3,(img1.size[0]+img2.size[0]+150,50)) 128 | 129 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 130 | return rs 131 | 132 | def finalsdisplay(finalimages): 133 | #IN USE 134 | back_width = (finalimages[0].size[0]+finalimages[1].size[0]+finalimages[2].size[0]+200) 135 | back_height = (max([finalimages[0].size[1], finalimages[1].size[1], finalimages[2].size[1]])+100) 136 | back = Image.new(finalimages[0].mode, (back_width,back_height), "white") 137 | 138 | back.paste(finalimages[0],(50,50)) 139 | back.paste(finalimages[1],(finalimages[0].size[0]+100,50)) 140 | back.paste(finalimages[2],(finalimages[0].size[0]+finalimages[1].size[0]+150,50)) 141 | 142 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 143 | return rs 144 | -------------------------------------------------------------------------------- /examples/adjustment_recommendation/geonames.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | """ 4 | Created on Fri May 31 16:46:58 2019 5 | 6 | @author: Lorin Bruckner 7 | 8 | Digital Research Services 9 | University Libraries 10 | UNC Chapel Hill 11 | """ 12 | 13 | import pandas as pandas 14 | from nltk import word_tokenize 15 | 16 | #Read in the tab delimited file from http://download.geonames.org/export/dump/US.zip 17 | #File was downloaded 5/31/19, 4:41 PM 18 | gn = pandas.read_csv("/Users/tuesday/Documents/_Projects/Research/OnTheBooks/US/US.txt", sep ="\t", header = None) 19 | 20 | #Filter records for North Carolina 21 | ncgn = gn[gn.loc[:,10] == "NC"] 22 | 23 | #Dump all geonames into a single string 24 | geonames = "" 25 | for index,row in ncgn.iterrows(): 26 | if type(row[2]) is str: 27 | geonames = geonames + " " + row[2] 28 | 29 | #Tokenize geonames. Remove punctuation, duplicates and single letters. Make lowercase. 30 | geotokens = word_tokenize(geonames) 31 | geotokens = [token for token in geotokens if token.isalpha()] 32 | geotokens = list(dict.fromkeys(geotokens)) 33 | geotokens = [token for token in geotokens if len(token) > 1] 34 | geotokens = [token.lower() for token in geotokens] 35 | 36 | #Create text file to add to Spell Checker 37 | with open("/Users/tuesday/Documents/_Projects/Research/OnTheBooks/geonames.txt", "w") as file: 38 | for token in geotokens: 39 | file.write(token + " ") -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0272.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0272.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0374.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0374.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0542.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0542.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0606.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0606.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0771.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0771.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0944.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0944.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1114.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1114.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1210.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1210.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1373.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1373.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1494.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1494.jpg -------------------------------------------------------------------------------- /examples/adjustment_recommendation/marginalia_metadata_demo.csv: -------------------------------------------------------------------------------- 1 | file,angle,side,cut,backR,backG,backB,bbox1,bbox2,bbox3,bbox4 2 | lawsresolutionso1891nort_0272.jpg,0,left,350,229,212,185,427,337,1915,2960 3 | lawsresolutionso1891nort_0374.jpg,0,left,350,228,212,187,446,352,1927,2964 4 | lawsresolutionso1891nort_0542.jpg,0,left,350,228,212,186,472,337,1957,2954 5 | lawsresolutionso1891nort_0606.jpg,0,left,353,227,211,184,459,318,1944,3225 6 | lawsresolutionso1891nort_0771.jpg,0,right,1484,214,198,174,46,345,1530,2510 7 | lawsresolutionso1891nort_0944.jpg,0,left,350,232,215,192,488,326,1974,2945 8 | lawsresolutionso1891nort_1114.jpg,0,left,350,231,215,193,504,327,1989,2947 9 | lawsresolutionso1891nort_1210.jpg,0,left,350,232,217,197,490,296,1976,2909 10 | lawsresolutionso1891nort_1373.jpg,0,right,1478,216,201,178,30,316,1508,2958 11 | lawsresolutionso1891nort_1494.jpg,-0.25,left,358,232,216,195,483,390,1965,3045 12 | -------------------------------------------------------------------------------- /examples/adjustment_recommendation/sample_metadata.csv: -------------------------------------------------------------------------------- 1 | filename,leafNum,handSide,page,sectiontype,sectiontitle,fileUrl 2 | lawsresolutionso1891nort_0272,272,LEFT,226,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0272.jpg 3 | lawsresolutionso1891nort_0374,374,LEFT,328,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0374.jpg 4 | lawsresolutionso1891nort_0542,542,LEFT,496,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0542.jpg 5 | lawsresolutionso1891nort_0606,606,LEFT,558,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0606.jpg 6 | lawsresolutionso1891nort_0771,771,RIGHT,723,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0771.jpg 7 | lawsresolutionso1891nort_0944,944,LEFT,896,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0944.jpg 8 | lawsresolutionso1891nort_1114,1114,LEFT,1066,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1114.jpg 9 | lawsresolutionso1891nort_1210,1210,LEFT,1162,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1210.jpg 10 | lawsresolutionso1891nort_1373,1373,RIGHT,1325,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1373.jpg 11 | lawsresolutionso1891nort_1494,1494,LEFT,1446,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1494.jpg 12 | -------------------------------------------------------------------------------- /examples/adjustment_recommendation/unadjusted.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/adjustment_recommendation/unadjusted.png -------------------------------------------------------------------------------- /examples/marginalia_determination/example_image.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Jul 8 08:16:57 2019 4 | 5 | @author: mtjansen, npbyers 6 | """ 7 | 8 | from PIL import Image, ImageDraw 9 | import os 10 | import sys 11 | 12 | sys.path.append(os.path.abspath("./")) 13 | from cropfunctions import * 14 | 15 | def bandimage(orig, angle, band_dict, bheight, orig_bbox): 16 | bd_ct = len(band_dict["band_bboxes"]) 17 | back_height = bd_ct*(bheight+20)+100 18 | img = orig.copy().rotate(angle).crop(orig_bbox) 19 | bounds = Image.new(orig.mode, orig.size, "white") 20 | draw = ImageDraw.Draw(bounds) 21 | 22 | for row in band_dict["band_bboxes"]: 23 | band_bbox = list(row["raw"]) 24 | band_bbox[1] = row["index"]-50 25 | band_bbox[3] = row["index"] 26 | band = combine_bbox(orig_bbox,band_bbox) 27 | spot = tuple(list(band)[0:2]) 28 | bounds.paste(img.crop(band_bbox),spot) 29 | if list(band_bbox)[2]-list(band_bbox)[0]>10: 30 | draw.rectangle(band,outline="red",fill = None,width=2) 31 | 32 | return bounds 33 | 34 | def diffbands(diff, band_dict, cut, bheight): 35 | cdiff = diff.convert(mode="RGB") 36 | 37 | bd_ct = len(band_dict["band_bboxes"]) 38 | back_height = bd_ct*(bheight+20)+100 39 | bounds = Image.new(cdiff.mode, cdiff.size, "white") 40 | drawBands = ImageDraw.Draw(bounds) 41 | 42 | 43 | for row in band_dict["band_bboxes"]: 44 | band_bbox = list(row["raw"]) 45 | band_bbox[1] = row["index"]-50 46 | band_bbox[3] = row["index"] 47 | spot = tuple(list(band_bbox)[0:2]) 48 | bounds.paste(diff.crop(band_bbox),spot) 49 | if list(band_bbox)[2]-list(band_bbox)[0]>10: 50 | drawBands.rectangle(band_bbox,outline="#fc8003", fill = None,width=2) 51 | 52 | drawCut = ImageDraw.Draw(bounds) 53 | drawCut.line((cut,0, cut, bounds.size[1]),fill ="#0f03fc",width = 7) 54 | 55 | return bounds 56 | 57 | def bandsdisplay(bandimages): 58 | back_width = (bandimages[0].size[0]+bandimages[1].size[0]+bandimages[2].size[0]+200) 59 | back_height = (max([bandimages[0].size[1], bandimages[1].size[1], bandimages[2].size[1]])+100) 60 | back = Image.new(bandimages[0].mode, (back_width,back_height), "white") 61 | 62 | back.paste(bandimages[0],(50,50)) 63 | back.paste(bandimages[1],(bandimages[0].size[0]+100,50)) 64 | back.paste(bandimages[2],(bandimages[0].size[0]+bandimages[1].size[0]+150,50)) 65 | 66 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 67 | 68 | 69 | return rs 70 | 71 | 72 | def comparison1(orig, diff, angle, orig_bbox): 73 | img = orig.copy().convert(mode="RGBA") 74 | box=Image.new('RGBA', (img.size[0],img.size[1])) 75 | d = ImageDraw.Draw(box) 76 | d.rectangle(orig_bbox,outline="red",fill = None,width=5) 77 | w=box.rotate(-angle) 78 | 79 | superimpose = Image.new('RGBA', (img.size[0],img.size[1])) 80 | superimpose.paste(img, (0,0)) 81 | superimpose.paste(w, (0,0), mask=w) 82 | 83 | back_width = (img.size[0]+diff.size[0]+150) 84 | back_height = (img.size[1]+100) 85 | back = Image.new(img.mode, (back_width,back_height), "white") 86 | draw = ImageDraw.Draw(img) 87 | 88 | back.paste(superimpose,(50,50)) 89 | back.paste(diff,(superimpose.size[0]+100,orig_bbox[1]+50)) 90 | 91 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 92 | return rs 93 | 94 | def comparison2(band, final): 95 | back_width = (band.size[0]+final.size[0]+150) 96 | back_height = (band.size[1]+100) 97 | back = Image.new(final.mode, (back_width,back_height), "white") 98 | 99 | back.paste(band,(50,50)) 100 | back.paste(final,(band.size[0]+100,50)) 101 | 102 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 103 | return rs 104 | 105 | def origdisplay(orig1, orig2, orig3): 106 | back_width = (orig1.size[0]+orig2.size[0]+orig3.size[0]+200) 107 | back_height = (max([orig1.size[1], orig2.size[1], orig3.size[1]])+100) 108 | back = Image.new(orig1.mode, (back_width,back_height), "white") 109 | 110 | back.paste(orig1,(50,50)) 111 | back.paste(orig2,(orig1.size[0]+100,50)) 112 | back.paste(orig3,(orig1.size[0]+orig2.size[0]+150,50)) 113 | 114 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 115 | return rs 116 | 117 | def diffdisplay(diff1, diff2, diff3): 118 | img1 = diff1.copy().convert(mode="RGBA") 119 | img2 = diff2.copy().convert(mode="RGBA") 120 | img3 = diff3.copy().convert(mode="RGBA") 121 | back_width = (img1.size[0]+img2.size[0]+img3.size[0]+200) 122 | back_height = (max([img1.size[1], img2.size[1], img3.size[1]])+100) 123 | back = Image.new(img1.mode, (back_width,back_height), "white") 124 | 125 | back.paste(img1,(50,50)) 126 | back.paste(img2,(img1.size[0]+100,50)) 127 | back.paste(img3,(img1.size[0]+img2.size[0]+150,50)) 128 | 129 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 130 | return rs 131 | 132 | def finalsdisplay(finalimages): 133 | #IN USE 134 | back_width = (finalimages[0].size[0]+finalimages[1].size[0]+finalimages[2].size[0]+200) 135 | back_height = (max([finalimages[0].size[1], finalimages[1].size[1], finalimages[2].size[1]])+100) 136 | back = Image.new(finalimages[0].mode, (back_width,back_height), "white") 137 | 138 | back.paste(finalimages[0],(50,50)) 139 | back.paste(finalimages[1],(finalimages[0].size[0]+100,50)) 140 | back.paste(finalimages[2],(finalimages[0].size[0]+finalimages[1].size[0]+150,50)) 141 | 142 | rs = back.resize((int(back.size[0]/5),int(back.size[1]/5))) 143 | return rs 144 | -------------------------------------------------------------------------------- /examples/marginalia_determination/lawsresolutionso1891nort_jp2/lawsresolutionso1891nort_0697.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/marginalia_determination/lawsresolutionso1891nort_jp2/lawsresolutionso1891nort_0697.jpg -------------------------------------------------------------------------------- /examples/marginalia_determination/lawsresolutionso1891nort_jp2/lawsresolutionso1891nort_0715.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/marginalia_determination/lawsresolutionso1891nort_jp2/lawsresolutionso1891nort_0715.jpg -------------------------------------------------------------------------------- /examples/marginalia_determination/lawsresolutionso1891nort_jp2/lawsresolutionso1891nort_0716.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/marginalia_determination/lawsresolutionso1891nort_jp2/lawsresolutionso1891nort_0716.jpg -------------------------------------------------------------------------------- /examples/marginalia_determination/output/marginalia_metadata_demo.csv: -------------------------------------------------------------------------------- 1 | file,angle,side,cut,backR,backG,backB,bbox1,bbox2,bbox3,bbox4 2 | lawsresolutionso1891nort_0697.jp2,0.0,right,1470,214,197,169,52,316,1522,1570 3 | lawsresolutionso1891nort_0697.jp2,-0.75,right,1485,206,185,154,50,427,1535,3153 4 | lawsresolutionso1891nort_0697.jp2,0.0,left,350,225,205,176,494,352,1980,2972 5 | -------------------------------------------------------------------------------- /examples/marginalia_determination/sample_metadata.csv: -------------------------------------------------------------------------------- 1 | filename,leafNum,handSide,page,sectiontype,sectiontitle,fileUrl 2 | lawsresolutionso1891nort_0697,697,RIGHT,649,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0697.jpg 3 | lawsresolutionso1891nort_0715,715,RIGHT,667,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0715.jpg 4 | lawsresolutionso1891nort_0716,716,LEFT,668,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0716.jpg 5 | -------------------------------------------------------------------------------- /examples/ocr/adjustments_demo.csv: -------------------------------------------------------------------------------- 1 | volume,color,invert,autocontrast,blur,sharpen,smooth,xsmooth 2 | lawsresolutionso1891nort,0.75,False,4,False,False,False,False 3 | -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0272.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0272.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0374.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0374.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0542.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0542.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0606.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0606.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0771.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0771.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0944.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_0944.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1114.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1114.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1210.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1210.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1373.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1373.jpg -------------------------------------------------------------------------------- /examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1494.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/images/lawsresolutionso1891nort_jpg/lawsresolutionso1891nort_1494.jpg -------------------------------------------------------------------------------- /examples/ocr/marginalia_metadata_demo.csv: -------------------------------------------------------------------------------- 1 | file,angle,side,cut,backR,backG,backB,bbox1,bbox2,bbox3,bbox4 2 | lawsresolutionso1891nort_0272.jpg,0,left,350,229,212,185,427,337,1915,2960 3 | lawsresolutionso1891nort_0374.jpg,0,left,350,228,212,187,446,352,1927,2964 4 | lawsresolutionso1891nort_0542.jpg,0,left,350,228,212,186,472,337,1957,2954 5 | lawsresolutionso1891nort_0606.jpg,0,left,353,227,211,184,459,318,1944,3225 6 | lawsresolutionso1891nort_0771.jpg,0,right,1484,214,198,174,46,345,1530,2510 7 | lawsresolutionso1891nort_0944.jpg,0,left,350,232,215,192,488,326,1974,2945 8 | lawsresolutionso1891nort_1114.jpg,0,left,350,231,215,193,504,327,1989,2947 9 | lawsresolutionso1891nort_1210.jpg,0,left,350,232,217,197,490,296,1976,2909 10 | lawsresolutionso1891nort_1373.jpg,0,right,1478,216,201,178,30,316,1508,2958 11 | lawsresolutionso1891nort_1494.jpg,-0.25,left,358,232,216,195,483,390,1965,3045 12 | -------------------------------------------------------------------------------- /examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_adjustments.txt: -------------------------------------------------------------------------------- 1 | IMAGE ADJUSTMENTS 2 | 3 | color: 0.75 4 | autocontrast: 4 5 | blur: False 6 | sharpen: False 7 | smooth: False 8 | xsmooth: False 9 | -------------------------------------------------------------------------------- /examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_private laws.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_private laws.txt -------------------------------------------------------------------------------- /examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_private laws_data.tsv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_private laws_data.tsv -------------------------------------------------------------------------------- /examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_public laws.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_public laws.txt -------------------------------------------------------------------------------- /examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_public laws_data.tsv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/ocr/output/lawsresolutionso1891nort/lawsresolutionso1891nort_public laws_data.tsv -------------------------------------------------------------------------------- /examples/ocr/xmljpegmerge_demo.csv: -------------------------------------------------------------------------------- 1 | filename,leafNum,handSide,page,sectiontype,sectiontitle,fileUrl 2 | lawsresolutionso1891nort_0272,272,LEFT,226,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0272.jpg 3 | lawsresolutionso1891nort_0374,374,LEFT,328,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0374.jpg 4 | lawsresolutionso1891nort_0542,542,LEFT,496,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0542.jpg 5 | lawsresolutionso1891nort_0606,606,LEFT,558,public laws,Public Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0606.jpg 6 | lawsresolutionso1891nort_0771,771,RIGHT,723,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0771.jpg 7 | lawsresolutionso1891nort_0944,944,LEFT,896,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_0944.jpg 8 | lawsresolutionso1891nort_1114,1114,LEFT,1066,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1114.jpg 9 | lawsresolutionso1891nort_1210,1210,LEFT,1162,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1210.jpg 10 | lawsresolutionso1891nort_1373,1373,RIGHT,1325,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1373.jpg 11 | lawsresolutionso1891nort_1494,1494,LEFT,1446,private laws,Private Laws of the State of North Carolina Session 1891,https://archive.org/download/lawsresolutionso1891nort/lawsresolutionso1891nort_jp2.zip/lawsresolutionso1891nort_jp2%2Flawsresolutionso1891nort_1494.jpg 12 | -------------------------------------------------------------------------------- /examples/split_cleanup/1899_public_chapnumflags_step4.csv: -------------------------------------------------------------------------------- 1 | chap_title,chapter_index,raw_num,corrected_num,correction_made,flag,gap 2 | CHAPTER 1.,2,1,1,FALSE,TRUE, 3 | CHaprer 1—2.,3,,,FALSE,TRUE, 4 | CHAPTER 2.,4,2,2,FALSE,TRUE, 5 | CHAPTER 3.,5,3,3,FALSE,TRUE, 6 | CHAPTER 4.,6,4,4,FALSE,TRUE, 7 | CHAPTER 5.,7,5,5,FALSE,TRUE, 8 | CHAPTER 6.,8,6,6,FALSE,TRUE, 9 | CHAPTER 6&8.,9,,,FALSE,TRUE, 10 | CHAPTER 9.,10,9,9,FALSE,TRUE, 11 | CHAPTER 10.,11,10,10,FALSE,TRUE, 12 | CHAPTER 14.,12,14,14,FALSE,TRUE,3 13 | CHAPTER 15.,13,15,15,FALSE,TRUE, 14 | CHAPTER 16.,14,16,16,FALSE,TRUE, 15 | CHAPTER 17.,15,17,17,FALSE,FALSE, 16 | CHAPTER 18.,16,18,18,FALSE,FALSE, 17 | CHAPTER 19.,17,19,19,FALSE,FALSE, 18 | CHAPTER 20.,18,20,20,FALSE,FALSE, 19 | CHAPTER 21.,19,21,21,FALSE,FALSE, 20 | CHAPTER 22.,20,22,22,FALSE,FALSE, 21 | CHAPTER 23.,21,23,23,FALSE,FALSE, 22 | CHAPTER 24.,22,24,24,FALSE,FALSE, 23 | CHAPTER 25.,23,25,25,FALSE,FALSE, 24 | CHAPTER 26.,24,26,26,FALSE,FALSE, 25 | CHAPTER 27.,25,27,27,FALSE,FALSE, 26 | CHAPTER 28.,26,28,28,FALSE,FALSE, 27 | CHAPTER 29.,27,29,29,FALSE,FALSE, 28 | CHAPTER 30.,28,30,30,FALSE,FALSE, 29 | CHAPTER 31.,29,31,31,FALSE,FALSE, 30 | CHAPTER 382.,30,382,32,TRUE,FALSE, 31 | CHAPTER 33.,31,33,33,FALSE,FALSE, 32 | CHAPTER 34.,32,34,34,FALSE,FALSE, 33 | CHAPTER 35.,33,35,35,FALSE,FALSE, 34 | -------------------------------------------------------------------------------- /examples/split_cleanup/1899_public_chapnumflags_step5.csv: -------------------------------------------------------------------------------- 1 | chap_title,raw_num,chapter_index,corrected_num,correction_made,flag 2 | CHAPTER 1.0,1,2,1,FALSE,TRUE 3 | CHaprer 1—2.,,3,,FALSE,TRUE 4 | CHAPTER 2.0,2,4,2,FALSE,TRUE 5 | CHAPTER 3.0,3,5,3,FALSE,TRUE 6 | CHAPTER 4.0,4,6,4,FALSE,TRUE 7 | CHAPTER 5.0,5,7,5,FALSE,TRUE 8 | CHAPTER 6.0,6,8,6,FALSE,TRUE 9 | CHAPTER 6&8.,,9,,FALSE,TRUE 10 | CHAPTER 9.0,9,10,9,FALSE,TRUE 11 | CHAPTER 10.0,10,11,10,FALSE,TRUE 12 | CHAPTER 11.0,11,12,11,FALSE,TRUE 13 | CHAPTER 13.0,13,13,13,FALSE,TRUE 14 | CHAPTER 14.0,14,14,14,FALSE,TRUE 15 | CHAPTER 15.0,15,15,15,FALSE,TRUE 16 | CHAPTER 16.0,16,16,16,FALSE,FALSE 17 | CHAPTER 17.0,17,17,17,FALSE,FALSE 18 | CHAPTER 18.0,18,18,18,FALSE,FALSE 19 | CHAPTER 19.0,19,19,19,FALSE,FALSE 20 | CHAPTER 20.0,20,20,20,FALSE,FALSE 21 | CHAPTER 21.0,21,21,21,FALSE,FALSE 22 | CHAPTER 22.0,22,22,22,FALSE,FALSE 23 | CHAPTER 23.0,23,23,23,FALSE,FALSE 24 | CHAPTER 24.0,24,24,24,FALSE,FALSE 25 | CHAPTER 25.0,25,25,25,FALSE,FALSE 26 | CHAPTER 26.0,26,26,26,FALSE,FALSE 27 | CHAPTER 27.0,27,27,27,FALSE,FALSE 28 | CHAPTER 28.0,28,28,28,FALSE,FALSE 29 | CHAPTER 29.0,29,29,29,FALSE,FALSE 30 | CHAPTER 30.0,30,30,30,FALSE,FALSE 31 | CHAPTER 31.0,31,31,31,FALSE,FALSE 32 | CHAPTER 32.0,32,32,32,FALSE,FALSE 33 | CHAPTER 33.0,33,33,33,FALSE,FALSE 34 | CHAPTER 34.0,34,34,34,FALSE,FALSE 35 | CHAPTER 35.0,35,35,35,FALSE,FALSE 36 | -------------------------------------------------------------------------------- /examples/split_cleanup/1899_public_weird_chaps_example.csv: -------------------------------------------------------------------------------- 1 | vol,ch_index,ch_title,sec_index,sec_title,gap 2 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,1,Chapter_Title, 3 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,2,Section 1.,1 4 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,3,SkEc. 2.,1 5 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,4,Sec. 2.,0 6 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,5,Src. 3.,1 7 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,6,Src. 4.,1 8 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,7,Suc. 5.,1 9 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,8,Src. 6.,1 10 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,9,Suc. 7.,1 11 | publiclawsresolu1899nort_public laws,18,CHAPTER 17.0,10,Suc. 8.,1 12 | -------------------------------------------------------------------------------- /examples/split_cleanup/chap_num_manual.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/examples/split_cleanup/chap_num_manual.png -------------------------------------------------------------------------------- /examples/split_cleanup/step4_fixlog.csv: -------------------------------------------------------------------------------- 1 | Volume,Reviewer,Chapter(s),Notes,Affected image jpg url,Transcription required,transcription_index,transcription_ID,transcription_order,transcription_chapter,transcription_section,transcription_text 2 | publiclawsresolu1899nort_public laws_data,Neil,1,"Page header misread as chapter title. Last line of Chapter 1, Sec. 66 at top of page was left out by OCR. OCR resumes at sec. 67, but sections 67-70 have been assigned in the raw file to the misread chapter title. False chapter title removed, Chapter 1.0 title extended through end of chapter 1. ""Chapter_Title"" value replaced with Sec 66. Value in section column for affected rows.",https://archive.org/download/publiclawsresolu1899nort/publiclawsresolu1899nort_jp2.zip/publiclawsresolu1899nort_jp2%2Fpubliclawsresolu1899nort_0096.jp2&ext=jpg,yes,12034,1,1,CHAPTER 1.0,SEC. 66.,"his right mind, or such time as he may be considered harmless and incurable." 3 | publiclawsresolu1899nort_public laws_data,Neil,6,"OCR missed bits from chapter 6, sections 4 and 5.",https://archive.org/download/publiclawsresolu1899nort/publiclawsresolu1899nort_jp2.zip/publiclawsresolu1899nort_jp2%2Fpubliclawsresolu1899nort_0102.jp2&ext=jpg,yes,14494,2,1,CHAPTER 6.0,Sec. 4.,Sec. 4. that this act shall apply to the election of 4 | publiclawsresolu1899nort_public laws_data,Neil,6,"OCR missed bits from chapter 6, sections 4 and 5.",https://archive.org/download/publiclawsresolu1899nort/publiclawsresolu1899nort_jp2.zip/publiclawsresolu1899nort_jp2%2Fpubliclawsresolu1899nort_0102.jp2&ext=jpg,yes,14513,3,1,CHAPTER 6.0,Sec. 4.,act are hereby repealed. 5 | publiclawsresolu1899nort_public laws_data,Neil,6,"OCR missed bits from chapter 6, sections 4 and 5.",https://archive.org/download/publiclawsresolu1899nort/publiclawsresolu1899nort_jp2.zip/publiclawsresolu1899nort_jp2%2Fpubliclawsresolu1899nort_0102.jp2&ext=jpg,yes,14514,4,1,CHAPTER 6.0,Sec. 5.,"Sec. 5. That this act shall be in force from and after its ratification. Ratified this seventh day of January, A.D. eighteen hundred and ninety-nine." 6 | publiclawsresolu1899nort_public laws_data,Neil,"7,8","Split script missed the chapter 7 chapter header and OCR misread the chapter 8 header (""chapter 6&8""). Mis-assigned chapter and section titles were corrected.",https://archive.org/download/publiclawsresolu1899nort/publiclawsresolu1899nort_jp2.zip/publiclawsresolu1899nort_jp2%2Fpubliclawsresolu1899nort_0102.jp2&ext=jpg,no,,,,,, 7 | publiclawsresolu1899nort_public laws_data,Neil,12,"Chapter 12 header misread as ""Ch"" ""meh"" ""th"" across 3 rows. Chapter/section titles cleaned up in vicinity of error",https://archive.org/download/publiclawsresolu1899nort/publiclawsresolu1899nort_jp2.zip/publiclawsresolu1899nort_jp2%2Fpubliclawsresolu1899nort_0132.jp2&ext=jpg,no,,,,,, 8 | -------------------------------------------------------------------------------- /images/Pauli_Murray.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/images/Pauli_Murray.jpg -------------------------------------------------------------------------------- /images/UniversityLibraries_logo_black_h75.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/images/UniversityLibraries_logo_black_h75.png -------------------------------------------------------------------------------- /images/mellon-foundation-logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/images/mellon-foundation-logo.jpg -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | Redirecting to https://onthebooks.lib.unc.edu 4 | 5 | 6 | -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /installation.md: -------------------------------------------------------------------------------- 1 | # Software Installation Documentation 2 | ### Optional: Install the Windows Subsystem for Linux (Ubuntu): 3 | * To install Linux, first enable the Windows Subsystem for Linux optional feature by running the following command in Windows Powershell (Start --> Windows Powershell --> Right click --> Run as Administrator): 4 | * _Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux_ 5 | * Restart when prompted 6 | * From the [Microsoft store](https://www.microsoft.com/en-us/p/ubuntu/9nblggh4msv6?activetab=pivot:overviewtab) Get Ubuntu 7 | * Click on the install button 8 | * Click on the launch button 9 | * Once Installation is complete, create a UNIX user and password 10 | 11 | ### 1. Install Anaconda Python: 12 | * Install Anaconda from [Anaconda website](https://www.anaconda.com/distribution/) 13 | * Download latest Python version (64 Bit - Graphical Installer) 14 | * Run the Anaconda setup file with default settings 15 | 16 | ### 2. Install Tesseract: 17 | * __Windows__: 18 | * Install from [Github link](https://github.com/UB-Mannheim/tesseract/wiki) 19 | * Click on tesseract-ocr-w64-setup-v4.1.0.20190314 (rc1) 20 | * Run the Setup file with default settings, noting the installation location 21 | * Add the installation location to your PATH variable 22 | * __MacOS__: 23 | * To install using MacPorts run the command: _sudo port install tesseract_ 24 | * To install using Homebrew run the command: _brew install tesseract_ 25 | 26 | ### 3. Install Python packages 27 | * Optional: 28 | Use your conda terminal to create a new environment called "onthebooks": 29 | ``` 30 | conda create -n onthebooks python=3.7 31 | ``` 32 | Activate the environment and reinstall basic dependencies with: 33 | ``` 34 | conda activate onthebooks 35 | conda install pandas 36 | conda install spyder 37 | ``` 38 | 39 | * Open a conda terminal, and execute the following: 40 | + Note: Pillow needs to be reinstalled after openjpeg is installed to correctly link to the jpeg2000 decoder. 41 | ``` 42 | conda install openjpeg 43 | pip install Pillow --force-reinstall 44 | pip install pyspellchecker 45 | pip install pytesseract 46 | ``` 47 | -------------------------------------------------------------------------------- /oer/.ipynb_checkpoints/environment_backup-checkpoint.yml: -------------------------------------------------------------------------------- 1 | # This file is a duplicate of the environment.yml file 2 | # stored in the root of the On The Books Github repository. 3 | # If you only need the oer folder in the repository, 4 | # copy its contents to a new Github repository and rename 5 | # this file to environment.yml so that Binder will pull your 6 | # dependencies from this file. 7 | name: oer-environment 8 | channels: 9 | - conda-forge 10 | dependencies: 11 | - python 12 | - pip 13 | - pip: 14 | - geopandas 15 | - internetarchive 16 | - matplotlib 17 | - nltk 18 | - pandas 19 | - pillow 20 | - pyspellchecker 21 | - pytesseract 22 | - requests 23 | -------------------------------------------------------------------------------- /oer/NC_counties.txt: -------------------------------------------------------------------------------- 1 | Alamance 2 | Alexander 3 | Alleghany 4 | Anson 5 | Ashe 6 | Avery 7 | Beaufort 8 | Bertie 9 | Bladen 10 | Brunswick 11 | Buncombe 12 | Burke 13 | Cabarrus 14 | Caldwell 15 | Camden 16 | Carteret 17 | Caswell 18 | Catawba 19 | Chatham 20 | Cherokee 21 | Chowan 22 | Clay 23 | Cleveland 24 | Columbus 25 | Craven 26 | Cumberland 27 | Currituck 28 | Dare 29 | Davidson 30 | Davie 31 | Duplin 32 | Durham 33 | Edgecombe 34 | Forsyth 35 | Franklin 36 | Gaston 37 | Gates 38 | Graham 39 | Granville 40 | Greene 41 | Guilford 42 | Halifax 43 | Harnett 44 | Haywood 45 | Henderson 46 | Hertford 47 | Hoke 48 | Hyde 49 | Iredell 50 | Jackson 51 | Johnston 52 | Jones 53 | Lee 54 | Lenoir 55 | Lincoln 56 | McDowell 57 | Macon 58 | Madison 59 | Martin 60 | Mecklenburg 61 | Mitchell 62 | Montgomery 63 | Moore 64 | Nash 65 | New Hanover 66 | Northampton 67 | Onslow 68 | Orange 69 | Pamlico 70 | Pasquotank 71 | Pender 72 | Perquimans 73 | Person 74 | Pitt 75 | Polk 76 | Randolph 77 | Richmond 78 | Robeson 79 | Rockingham 80 | Rowan 81 | Rutherford 82 | Sampson 83 | Scotland 84 | Stanly 85 | Stokes 86 | Surry 87 | Swain 88 | Transylvania 89 | Tyrrell 90 | Union 91 | Vance 92 | Wake 93 | Warren 94 | Washington 95 | Watauga 96 | Wayne 97 | Wilkes 98 | Wilson 99 | Yadkin 100 | Yancey -------------------------------------------------------------------------------- /oer/environment_backup.yml: -------------------------------------------------------------------------------- 1 | # This file is a duplicate of the environment.yml file 2 | # stored in the root of the On The Books Github repository. 3 | # If you only need the oer folder in the repository, 4 | # copy its contents to a new Github repository and rename 5 | # this file to environment.yml so that Binder will pull your 6 | # dependencies from this file. 7 | name: oer-environment 8 | channels: 9 | - conda-forge 10 | dependencies: 11 | - python 12 | - pip 13 | - pip: 14 | - geopandas 15 | - internetarchive 16 | - matplotlib 17 | - nltk 18 | - pandas 19 | - pillow 20 | - pyspellchecker 21 | - pytesseract 22 | - requests 23 | -------------------------------------------------------------------------------- /oer/images/00-intro-01.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-01.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-02.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-02.jpg -------------------------------------------------------------------------------- /oer/images/00-intro-03.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-03.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-04.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-04.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-05.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-05.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-06.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-06.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-07.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-07.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-08.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-08.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-09.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-09.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-10.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-10.jpeg -------------------------------------------------------------------------------- /oer/images/00-intro-11.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-11.jpg -------------------------------------------------------------------------------- /oer/images/00-intro-12.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-12.jpg -------------------------------------------------------------------------------- /oer/images/00-intro-25.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/00-intro-25.jpg -------------------------------------------------------------------------------- /oer/images/01-algorithms-01.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/01-algorithms-01.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-01.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-01.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-02.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-02.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-03.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-03.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-04.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-04.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-05.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-05.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-06.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-06.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-07.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-07.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-08.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-08.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-09.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-09.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-10.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-10.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-11.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-11.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-12.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-12.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-13.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-13.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-14.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-14.jpeg -------------------------------------------------------------------------------- /oer/images/06-corpus-15.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-15.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-16.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-16.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-17.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-17.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-18.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-18.jpg -------------------------------------------------------------------------------- /oer/images/06-corpus-runcode.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/06-corpus-runcode.mp4 -------------------------------------------------------------------------------- /oer/images/07-ocr-01.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-01.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-02.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-02.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-03.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-03.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-04.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-04.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-05.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-05.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-05.txt: -------------------------------------------------------------------------------- 1 | hereby re-enacted: Provided, however, that convicts shall not be worked on said railroad in the counties of New l Hanover or Pender. subSEC. 2. That if the company shall fail to begin the be '.uc-construction of the road within twelve months from the -------------------------------------------------------------------------------- /oer/images/07-ocr-06.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-06.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-06.txt: -------------------------------------------------------------------------------- 1 | year the sum of twenty-five cents on each three hundred dollars' t\"Orth of property and the same arnouut on each poll, which shall constitute and he held a sinking fund: -------------------------------------------------------------------------------- /oer/images/07-ocr-07.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-07.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-07.txt: -------------------------------------------------------------------------------- 1 | a S€1'.)arate fund, -------------------------------------------------------------------------------- /oer/images/07-ocr-08.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/07-ocr-08.jpeg -------------------------------------------------------------------------------- /oer/images/07-ocr-08.txt: -------------------------------------------------------------------------------- 1 | meanor and upon conviction shall be tined not less than ten doliars nor more than thirty Jollars, or imprisoned uot less tbun ten days nor rnore than thirty dayc, or both at the disereLiou of the cCJurc. SEC. 10 . .?i'ovided, tliat no person shall be admitted into I sa.1d school a-: a studet"!t who has uot :1ttained the age of fiftEeu yeHrs; and that all tbo::ie \Vho shnll eujoy the priv -------------------------------------------------------------------------------- /oer/images/08-ocr-01.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-01.jpeg -------------------------------------------------------------------------------- /oer/images/08-ocr-02.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-02.jpeg -------------------------------------------------------------------------------- /oer/images/08-ocr-03.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-03.jpeg -------------------------------------------------------------------------------- /oer/images/08-ocr-04.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-04.jpeg -------------------------------------------------------------------------------- /oer/images/08-ocr-05.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-05.jpeg -------------------------------------------------------------------------------- /oer/images/08-ocr-06.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-06.jpeg -------------------------------------------------------------------------------- /oer/images/08-ocr-07.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/08-ocr-07.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-01.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-01.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-02.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-02.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-03.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-03.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-04.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-04.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-05.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-05.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-06.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-06.jpeg -------------------------------------------------------------------------------- /oer/images/09-data-07.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/09-data-07.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-01.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-01.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-02.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-02.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-03.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-03.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-04.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-04.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-05.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-05.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-06.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-06.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-07.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-07.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-08.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-08.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-09.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-09.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-10.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-10.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-11.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-11.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-12.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-12.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-13.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-13.jpeg -------------------------------------------------------------------------------- /oer/images/10-explore-14.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/10-explore-14.jpeg -------------------------------------------------------------------------------- /oer/images/Anaconda_Nucleus_Horizontal_white.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /oer/images/LawBooks-feature.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/LawBooks-feature.png -------------------------------------------------------------------------------- /oer/images/chronam_daybook_19151112_pellagra_full.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/chronam_daybook_19151112_pellagra_full.jpg -------------------------------------------------------------------------------- /oer/images/chronam_daybook_19151112_pellagra_full_bboxes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/chronam_daybook_19151112_pellagra_full_bboxes.png -------------------------------------------------------------------------------- /oer/images/noun_arrow with loops_2073885.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/noun_arrow with loops_2073885.png -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0057.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0057.jpg -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0057_300ppi.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0057_300ppi.jpg -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0057_grayscale.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0057_grayscale.jpg -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0057_inverted.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0057_inverted.jpg -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0057_rotated.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0057_rotated.jpg -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0057_skewed.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0057_skewed.jpg -------------------------------------------------------------------------------- /oer/images/sessionlawsresol1955nort_0058.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/images/sessionlawsresol1955nort_0058.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0000.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0000.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0001.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0001.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0002.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0002.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0003.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0003.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0004.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0004.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0005.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0005.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0006.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0006.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0007.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0007.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0008.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0008.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0009.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0009.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0010.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0010.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0011.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0011.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0012.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0012.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0013.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0013.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0014.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0014.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0015.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0015.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0016.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0016.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0017.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0017.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0018.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0018.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0019.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0019.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0020.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0020.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0021.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0021.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0022.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0022.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0023.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0023.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0024.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0024.jpg -------------------------------------------------------------------------------- /oer/jpg_output/sessionlawsresol1955nort_0025.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/jpg_output/sessionlawsresol1955nort_0025.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0057.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0057.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0058.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0058.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0059.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0059.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0060.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0060.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0061.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0061.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0062.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0062.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0063.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0063.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0064.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0064.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0065.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0065.jpg -------------------------------------------------------------------------------- /oer/sample/sessionlawsresol1955nort_0066.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample/sessionlawsresol1955nort_0066.jpg -------------------------------------------------------------------------------- /oer/sample_output.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sample_output.txt -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0057.txt: -------------------------------------------------------------------------------- 1 | SESSION LAWS 2 | 3 | OF THE 4 | 5 | STATE OF NORTH CAROLINA 6 | 7 | SESSION 1955 8 | 9 | S. B. 4 CHAPTER 1 10 | 11 | AN ACT TO AUTHORIZE THE BOARD OF TRUSTEES OF THE 12 | SOUTHERN PINES SCHOOL DISTRICT TO TRANSFER CERTAIN 13 | FUNDS FROM ITS DEBT SERVICE ACCOUNT TO ITS CAPITAL 14 | OUTLAY OR CURRENT EXPENSE ACCOUNTS, OR TO BOTH 15 | SUCH ACCOUNTS. 16 | 17 | The General Assembly of North Carolina do enact: 18 | 19 | Section 1. The Board of Trustees of the Southern Pines School Dis- 20 | triet is hereby authorized and empowered to transfer all surplus funds held 21 | by it in its debt service account on the date of the ratification of this Act 22 | or on July 1, 1955, to its capital outlay account or current expense account, 23 | or to both such accounts, and to use said funds for capital outlay or current 24 | expense purposes, or both, including the construction of school buildings. 25 | 26 | See. 2. All laws and clauses of laws in conflict with this Act are hereby 27 | repealed. 28 | 29 | See. 3. This Act shall become effective on and after its ratification. 30 | 31 | In the General Assembly read three times and ratified, this the 14th 32 | day of January, 1955. 33 | 34 | H. B. 13 CHAPTER 2 35 | 36 | AN ACT TO PERMIT THE BOARD OF COMMISSIONERS OF 37 | CATAWBA COUNTY TO MAKE APPROPRIATIONS FOR BUILDING 38 | WATER LINES, SEWER LINES OR EITHER OF THEM, FROM THE 39 | CORPORATE LIMITS OF MUNICIPALITIES TO COMMUNITIES IN 40 | THE COUNTY. 41 | 42 | The General Assembly of North Carolina do enact: 43 | 44 | Section 1. The Board of County Commissioners of Catawba County is 45 | hereby authorized and empowered in its discretion to expend out of non- 46 | tax funds available to said board such amount or amounts as it may deem 47 | wise, not exceeding in the aggregate the sum of one hundred and twenty- 48 | 49 | 1 50 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0058.txt: -------------------------------------------------------------------------------- 1 | CH. 2-3 1955—-SEssION Laws 2 | 3 | five thousand dollars ($125,000.00), to be used in such amounts in the dis- 4 | cretion of said board of county commissioners for the purpose of acquir- 5 | ing easements for water and sewer lines, or either of them, and for the 6 | purpose of laying and constructing water and sewer lines or either of them 7 | from the corporate limits of municipalities located in Catawba County to 8 | communities located within said county, but outside of the corporate limits 9 | of municipalities, said water and sewer lines, or either of them, shall be 10 | constructed and laid and said easements therefor shall be acquired, for the 11 | purpose of promoting the general welfare of said county and the expense 12 | of laying and construction of said lines and acquiring said easements is 13 | hereby declared to be expenditures for public purposes. 14 | 15 | Sec. 2. All laws and clauses of laws in conflict with this Act are hereby 16 | repealed. 17 | 18 | See. 3. This Act shall be in full force and effect from and after its 19 | ratification. 20 | 21 | In the General Assembly read three times and ratified, this the 14th 22 | day of January, 1955. 23 | 24 | S. B. 13 CHAPTER 3 25 | 26 | AN ACT TO AMEND THE ELECTION LAW HERETOFORE PROVIDED 27 | FOR THE TOWN OF CONETOE, IN EDGECOMBE COUNTY, AND 28 | TO FIX THE DATES OF ELECTIONS FOR SAID TOWN. 29 | 30 | The General Assembly of North Carolina do enact: 31 | 32 | Section 1. Amend Section 2 of Chapter 673 of the Session Laws of 1953 33 | by striking out the following: “1953”, appearing in the first line of said 34 | Section 2, and by inserting in lieu thereof the following: “1955”. 35 | 36 | See. 2. Amend Section 3 of Chapter 673 of the Session Laws of 1953 37 | by striking out the figures “1953”, as the same appear in the eighth line of 38 | said Section 3, and by inserting in lieu thereof the figures “1955”, 39 | 40 | Sec. 3. Amend Section 4 of Chapter 673 of the Session Laws of 1953 by 41 | striking out the figures “1955”, as the same appear in the first line of said 42 | Section 4, and by inserting in lieu thereof the figures “1957”. 43 | 44 | Further amend said Section 4 of Chapter 673 of the Session Laws of 45 | 1953 by striking out the figures “1953”, as the same appear in the tenth 46 | line of said Section 4, and by inserting in lieu thereof the figures “1955”. 47 | 48 | Sec. 4. All laws and clauses of laws in conflict with this Act are hereby 49 | repealed. 50 | 51 | See. 5. This Act shall be in full force and effect from and after its rati- 52 | fication. 53 | 54 | In the General Assembly read three times and ratified, this the 14th day 55 | of January, 1955. 56 | 57 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0059.txt: -------------------------------------------------------------------------------- 1 | 1955—SEsSION LAWS Cu. 4 2 | 3 | H. B. 34 CHAPTER 4 4 | 5 | AN ACT TO PROVIDE THAT THE OFFICE OF SOLICITOR OF THE 6 | RECORDER’S COURT OF FRANKLIN COUNTY BE AN ELECTIVE 7 | OFFICE. 8 | 9 | The General Assembly of North Carolina do enact: 10 | 11 | Section 1. That Section 6 of Chapter 12, Session Laws of 1951 is here- 12 | by repealed. 13 | 14 | Sec. 2. That G. S. 7-235 is hereby amended by adding at the end there- 15 | of the following: 16 | 17 | “Provided that as of February 1, 1955, the office of prosecuting attor- 18 | ney of the Recorder’s Court of Franklin County, denominated solicitor, 19 | shall be an elective office. The first solicitor shall be elected by the Board 20 | of Commissioners of Franklin County on or before February 1, 1955, and 21 | shall hold his office under said appointment until the first Monday in De- 22 | cember, 1956. At the primary and general elections to be held in the year 23 | 1956 and biennially thereafter, the Solicitor of the Recorder’s Court of 24 | Franklin County shall be nominated and elected in the same manner and 25 | at the same time as is now or may hereafter be provided by law for the 26 | nomination and election of the elective officers of the county; the term of 27 | office of said Solicitor shall begin on the first Monday in December follow- 28 | ing the biennial general election at which he shall have been elected and 29 | shall extend to the first Monday in December following the next ensuing 30 | biennial general election. In the event of a vacancy in the office of Solicitor, 31 | either by death, resignation, failure to qualify, or otherwise, the Board of 32 | Commissioners of Franklin County shall fill such vacancy by appointment 33 | and the person so appointed shall serve until the first Monday in December 34 | following the next biennial general election. The salary of said Solicitor 35 | shall be two thousand four hundred dollars ($2400.00) per year and shall 36 | be paid in equal monthly installments from the General Fund of the county. 37 | The said Solicitor shall, at the time of his appointment or nomination and 38 | election, be a qualified elector of Franklin County and a licensed attorney 39 | at law, and before entering upon the duties of his office shall take and sub- 40 | scribe an oath substantially in the form required of State solicitors by 41 | Section 11-11 of the General Statutes of North Carolina, and said oath shall 42 | be recorded by the Clerk of the Superior Court of Franklin County.” 43 | 44 | Sec. 3. All laws and clauses of laws in conflict with the provisions of 45 | this Act are hereby repealed. 46 | 47 | See. 4. This Act shall be in full force and effect from and after its rati- 48 | fication. 49 | 50 | In the General Assembly read three times and ratified, this the 26th day 51 | of January, 1955. 52 | 53 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0060.txt: -------------------------------------------------------------------------------- 1 | Cu. 5-6-7 1955—SEssION LAws 2 | 3 | H. B. 2 CHAPTER 5 4 | 5 | AN ACT TO REPEAL CHAPTER 501 OF THE SESSION LAWS OF 1953, 6 | RELATING TO COMMITTEE HEARINGS ON THE APPROPRIA- 7 | TIONS BILL. 8 | 9 | The General Assembly of North Carolina do enact: 10 | 11 | Section 1. Chapter 501 of the Session Laws of 1958 is repealed. 12 | 13 | See. 2. All laws and clauses of laws in conflict with this Act are here- 14 | by repealed. 15 | 16 | See. 3. This Act shall be in full force and effect from and after its rati- 17 | fication. 18 | 19 | In the General Assembly read three times and ratified, this the 28th 20 | day of January, 1955. 21 | 22 | H. B. 18 CHAPTER 6 23 | 24 | AN ACT TO AMEND THE CHARTER OF THE CITY OF SALISBURY 25 | BY REQUIRING COUNCIL MEETINGS TO BE HELD AS OFTEN AS 26 | TWICE MONTHLY INSTEAD OF ONCE WEEKLY. 27 | 28 | The General Assembly of North Carolina do enact: 29 | 30 | Section 1. Section 8 of Chapter 231 of the Private Laws of 1927, as 31 | amended by Chapter 178 of the Private Laws of 1929, be and the same is 32 | hereby further amended by striking out the first sentence appearing there- 33 | in and inserting in lieu thereof the following: 34 | 35 | “The Council shall fix suitable times for its regular meetings, which 36 | shall be as often as twice monthly.” 37 | 38 | Sec. 2. All laws and clauses of laws in conflict with this Act are here- 39 | by repealed. 40 | 41 | Sec. 3. This Act shall be in full force and effect from and after its 42 | ratification. 43 | 44 | In the General Assembly read three times and ratified, this the 28th 45 | day of January, 1955. 46 | 47 | S. B. 33 CHAPTER 7 48 | 49 | AN ACT TO AMEND ARTICLE 4 OF CHAPTER 15 OF THE GENERAL 50 | STATUTES SO AS TO PROVIDE FOR THE ISSUANCE OF SEARCH 51 | WARRANTS FOR NARCOTIC DRUGS. 52 | 53 | The General Assembly of North Carolina do enact: 54 | 55 | Section 1. G. S. 15-25 is amended by inserting between the comma fol- 56 | lowing the word “premises” and the word “any” in line 5 of said Section 57 | the words and punetuation “any narcotic drugs as defined in Article 5 of 58 | Chapter 90 of the General Statutes,”. G. S. 15-25 is further amended by 59 | inserting between the word “such” and the word “stolen” in line 22 of said 60 | Section the words and punctuation “narcotic drugs,”. 61 | 62 | Sec. 2. All laws and clauses of laws in conflict with this Act are hereby 63 | 64 | 4 65 | 66 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0061.txt: -------------------------------------------------------------------------------- 1 | 1955—SEssion LAws Cu. 7-8-9 2 | 3 | repealed. 4 | 5 | Sec. 3. This Act shall be in full force and effect from and after its rati- 6 | fication. 7 | 8 | In the General Assembly read three times and ratified, this the 3rd day 9 | of February, 1955. 10 | 11 | S. B. 38 CHAPTER 8 12 | 13 | AN ACT TO AMEND G. S. 7-274 SO AS TO AUTHORIZE THE CLERK 14 | OR DEPUTY CLERK OF THE GENERAL COUNTY COURT OF 15 | HALIFAX COUNTY TO ISSUE CRIMINAL WARRANTS. 16 | 17 | The General Assembly of North Carolina do enact: 18 | 19 | Section 1. G. S. 7-274 is hereby amended by striking out the word 20 | “Halifax” as the same appears in line 13. 21 | 22 | Sec. 2. All laws and clauses of laws in conflict with this Act are hereby 23 | repealed. 24 | 25 | Sec. 3. This Act shall become effective upon ratification. 26 | 27 | In the General Assembly read three times and ratified, this the 3rd day 28 | of February, 1955. 29 | 30 | H. B. 28 CHAPTER 9 31 | 32 | AN ACT TO AUTHORIZE AND EMPOWER THE BOARD OF COMMIS- 33 | SIONERS OF STOKES COUNTY TO SELL AND CONVEY THE 34 | TRACT OF LAND AND BUILDINGS SITUATED THEREON FOR- 35 | MERLY USED BY THE COUNTY IN CONNECTION WITH THE 36 | OPERATION AND MAINTENANCE OF THE COUNTY HOME 37 | FARM. 38 | 39 | The General Assembly of North Carolina do enact: 40 | 41 | Section 1. The Board of County Commissioners of Stokes County is 42 | hereby authorized and empowered to sell at public or private sale the entire 43 | tract of land and buildings situated thereon known as the County Home 44 | Farm or such part or parts thereof as in the discretion of the board will 45 | not be needed for public purposes. If the sale is made at public auction, 46 | notice of the sale shall be published once a week for two successive weeks 47 | in a newspaper of general circulation in the county. After any such public 48 | sale, the board of county commissioners is authorized to reject any bid 49 | which in the opinion of the board is not considered to be the fair market 50 | value of the partial or entire tract of land offered. If, after public auction, 51 | the board of county commissioners rejects the highest bid made, further 52 | public auctions may be held or the partial or entire tract of land may be 53 | sold privately for a higher price. 54 | 55 | Sec. 2. In carrying out the provisions of this Act the Board of County 56 | Commissioners of Stokes County may execute all necessary deeds and may 57 | employ an auction company to assist with subdividing and selling the 58 | property involved but shall not pay any company so employed more than 59 | 60 | 5 61 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0062.txt: -------------------------------------------------------------------------------- 1 | Cu. 9-10-11 1955—SEssIon LAWS 2 | 3 | four per cent (4%) of the sales price as confirmed by the board of county 4 | commissioners. 5 | 6 | See. 3. All laws and clauses of laws in conflict with this Act are hereby 7 | repealed. 8 | 9 | See. 4. This Act shall be in full force and effect from and after its 10 | ratification. 11 | 12 | In the General Assembly read three times and ratified, this the 8rd day 13 | of February, 1955. 14 | 15 | H. B. 58 CHAPTER 10 16 | 17 | AN ACT TO AMEND G. §S. 1-109, RELATING TO PROSECUTION 18 | BONDS, SO AS TO PLACE THE STATE ON THE SAME BASIS AS 19 | CITIES AND TOWNS WITH RESPECT TO EXEMPTION THERE- 20 | FROM. 21 | 22 | The General Assembly of North Carolina do enact: 23 | 24 | Section 1. G. S. 1-109 is hereby amended by inserting the words “the 25 | State of North Carolina or any of its agencies, commissions or institutions, 26 | or to” immediately following the word “to” and immediately preceding the 27 | word “counties”, in line 3 of paragraph 3, and by inserting the words “the 28 | State of North Carolina or any of its agencies, commissions or institutions, 29 | and” immediately following the word “that” and immediately preceding the 30 | word “counties” in line 4 of paragraph 3. 31 | 32 | See. 2. This Act shall apply to pending litigation, and all actions or 33 | proceedings heretofore instituted by the State of North Carolina or its 34 | agencies shall be valid as if the provisions of this Act had at all times been 35 | the law of the land. 36 | 37 | Sec. 3. All laws and clauses of laws in conflict with this Act are hereby 38 | repealed. 39 | 40 | Sec. 4. This Act shall become effective upon its ratification. 41 | 42 | In the General Assembly read three times and ratified, this the 3rd day 43 | of February, 1955. 44 | 45 | S. B. 22 CHAPTER 11 46 | 47 | AN ACT TO AMEND G. S. 153-38 SO AS TO PROVIDE FOR THE PAY- 48 | MENT OF THE EXPENSES BY GRANVILLE COUNTY OF THE 49 | COUNTY AUDITOR, THE CLERK TO THE BOARD OF COUNTY 50 | COMMISSIONERS, AND THE COUNTY ATTORNEY IN ATTEND- 51 | ING MEETINGS OF THE STATE ASSOCIATION OF COUNTY COM- 52 | MISSIONERS. 53 | 54 | The General Assembly of North Carolina do enact: 55 | Section 1. G. S. 158-38 is amended by adding at the end thereof a new 56 | paragraph to read as follows: 57 | “In Granville County, the Board of County Commissioners is authorized, 58 | in its discretion, to pay the expenses of the County Auditor, the Clerk to 59 | 60 | 6 61 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0063.txt: -------------------------------------------------------------------------------- 1 | 1955—SESSION Laws Cu. 11-12-13 2 | 3 | the Board of County Commissioners, and the County Attorney in attend- 4 | ing meetings of the State Association of County Commissioners.” 5 | 6 | See. 2. All action heretofore taken by the Board of County Commission- 7 | ers of Granville County in paying the expenses of the officials named in 8 | Section 1 of this Act in attending meetings of the State Association of 9 | County Commissioners is hereby validated, ratified, and confirmed. 10 | 11 | Sec. 3. All laws and clauses of laws in conflict with this Act are hereby 12 | repealed, 13 | 14 | See. 4. This Act shall be in full force and effect from and after its rati- 15 | fication. 16 | 17 | In the General Assembly read three times and ratified, this the 4th day 18 | of February, 1955. 19 | 20 | S. B. 34 CHAPTER 12 21 | 22 | AN ACT TO AMEND CHAPTER 465 OF THE SESSION LAWS OF 1949 23 | TO AUTHORIZE THE BOARD OF COUNTY COMMISSIONERS OF 24 | ROWAN COUNTY IN ITS DISCRETION TO ADD THE DUTIES 25 | AND POWERS OF COUNTY TAX SUPERVISOR TO THOSE NOW 26 | BEING PERFORMED BY THE COUNTY TAX COLLECTOR. 27 | 28 | The General Assembly of North Carolina do enact: 29 | 30 | Section 1. Section 1 of Chapter 465 of the Session Laws of 1949 is 31 | hereby amended by rewriting Section 4 thereof to read as follows: “Sec. 4. 32 | The Board of County Commissioners of Rowan County may, in its discre- 33 | tion, add the duties and powers of County Tax Supervisor to those now 34 | being performed by the County Auditor or by the County Tax Collector 35 | and, in such event, may pay such County Auditor or County Tax Collector 36 | such additional compensation for such services as, in its discretion, it may 37 | deem appropriate.” 38 | 39 | Sec. 2. All laws and clauses of laws in conflict with this Act are here- 40 | by repealed. 41 | 42 | Sec. 3. This Act shall be in full force and effect from and after its rati- 43 | fication. 44 | 45 | In the General Assembly read three times and ratified, this the 4th 46 | day of February, 1955. 47 | 48 | S. B. 35 CHAPTER 13 49 | 50 | AN ACT AUTHORIZING THE BOARD OF COUNTY COMMISSIONERS 51 | OF ROWAN COUNTY TO EXTEND THE PERIOD DURING WHICH 52 | IT MAY SIT IN 1955 AS A BOARD OF EQUALIZATION AND RE- 53 | VIEW. 54 | 55 | WHEREAS, the Board of County Commissioners of Rowan County are 56 | in the process of revaluing taxable property in Rowan County; and 57 | WHEREAS, the revaluation was not completed on January Ist, 1955 58 | the day upon which tax listing began; and 59 | WHEREAS, the said County Commissioners, acting as a Board of 60 | Equalization and Review from March 21st to April 11th, 1955, will not 61 | 62 | ui 63 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0064.txt: -------------------------------------------------------------------------------- 1 | Cu. 18-14 1955—SrssIon LAws 2 | 3 | have sufficient time to properly consider all complaints likely to arise on 4 | account of such revaluations: Now, therefore, 5 | The General Assembly of North Carolina do enact: 6 | 7 | Section 1. That the Board of County Commissioners in its discretion 8 | may extend the period during which it may sit in the year 1955 as a Board 9 | of Equalization and Review until such time as it has completed the work 10 | of hearing and determining complaints relating to revaluation; but said 11 | extension shall end on or before October Ist, 1955. 12 | 13 | See. 2. All laws and clauses of laws in conflict with this Act are here- 14 | by repealed. 15 | 16 | See. 3. This Act shall be in full force and effect from and after its 17 | ratification. 18 | 19 | In the General Assembly read three times and ratified, this the 4th day 20 | of February, 1955. 21 | 22 | S. B. 101 CHAPTER 14 23 | 24 | AN ACT TO AMEND CHAPTER 788 OF THE SESSION LAWS OF 1953 25 | SO AS TO APPOINT A MEMBER OF THE BOARD OF EDUCATION 26 | OF BRUNSWICK COUNTY TO SERVE OUT THE UNEXPIRED 27 | TERM OF RAY WALTON. 28 | 29 | WHEREAS, Ray Walton was named in Chapter 788 of the Session 30 | Laws of 1953 to serve on the Board of Education of Brunswick County for 31 | a term of two years; and 32 | 33 | WHEREAS, Ray Walton having been elected Senator from the Tenth 34 | Senatorial District to serve in the 1955 General Assembly resigned from 35 | his position as a member of the Board of Education of Brunswick 36 | County: Now, therefore, 37 | 38 | The General Assembly of North Carolina do enact: 39 | 40 | Section 1. Section 1 of Chapter 788 of the Session Laws of 1953 is here- 41 | by amended so as to provide that Thomas St. George is appointed a mem- 42 | ber of the Board of Education of Brunswick County to serve for the un- 43 | expired term of Ray Walton. 44 | 45 | Sec. 2. All laws and clauses of laws in conflict with this Act are here- 46 | by repealed. 47 | 48 | Sec. 3. This Act shall be in full force and effect from and after its 49 | ratification. 50 | 51 | In the General Assembly read three times and ratified, this the 4th day 52 | of February, 1955. 53 | 54 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0065.txt: -------------------------------------------------------------------------------- 1 | 1955—SESSION LAws Cu. 15-16-17 2 | 3 | H. B. 6 CHAPTER 15 4 | 5 | AN ACT TO REPEAL CHAPTER 522 OF THE SESSION LAWS OF 1953, 6 | RELATING TO COUNTY POLICEMEN OF MCDOWELL COUNTY. 7 | 8 | The General Assembly of North Carolina do enact: 9 | 10 | Section 1. Chapter 522 of the Session Laws of 1958 is repealed. 11 | 12 | Sec. 2. All laws and clauses of laws in conflict with this Act are here- 13 | by repealed. 14 | 15 | Sec. 3. This Act shall be in full force and effect from and after its rati- 16 | fication. 17 | 18 | In the General Assembly read three times and ratified, this the 4th day 19 | of February, 1955. 20 | 21 | He BYi5 CHAPTER 16 22 | 23 | AN ACT TO INCREASE THE MEMBERSHIP OF THE BOARD OF 24 | COUNTY COMMISSIONERS OF PERSON COUNTY FROM 3 TO 5, 25 | AND TO AMEND G. §. 153-5. 26 | 27 | The General Assembly of North Carolina do enact: 28 | 29 | Section 1. That G. S. 153-5 is hereby amended by adding at the end 30 | thereof a paragraph reading as follows: 31 | 32 | “There shall be elected in Person County at the general election to be 33 | held in the year 1956 and every two years thereafter by the duly qualified 34 | voters thereof, a board of county commissioners composed of five persons 35 | who shall serve for a term of two years from the first Monday in Decem- 36 | ber after their election and until their successors are elected and quali- 37 | fied.” 38 | 39 | Sec. 2. All laws and clauses of laws in conflict with the provisions of 40 | this Act are hereby repealed. 41 | 42 | See. 3. This Act shall be in full force and effect from and after its 43 | ratification. 44 | 45 | In the General Assembly read three times and ratified, this the 4th day 46 | of February, 1955. 47 | 48 | H. B. 16 CHAPTER 17 49 | 50 | AN ACT TO AMEND CHAPTER 105 OF THE GENERAL STATUTES SO. 51 | AS TO CHANGE THE TIME FOR FILING STATE INCOME TAX 52 | RETURNS BY PERSONS OTHER THAN CORPORATIONS FROM 53 | THE FIFTEENTH DAY OF MARCH TO THE FIFTEENTH DAY OF 54 | APRIL IN EACH YEAR, AND TO CONFORM THE STATE LAW TO 55 | THE FEDERAL LAW AS TO THE TIME FOR FILING RETURNS. 56 | 57 | The General Assembly of North Carolina do enact: 58 | 59 | Section 1. The first paragraph of G. S. 105-155 is hereby amended by 60 | rewriting said paragraph to read as follows: 61 | 62 | “Returns shall be in such form as the Commissioner of Revenue may 63 | from time to time prescribe, and shall be filed with the Commissioner at his 64 | 65 | 9 66 | -------------------------------------------------------------------------------- /oer/sample_output/sessionlawsresol1955nort_0066.txt: -------------------------------------------------------------------------------- 1 | CH. 17 1955—-SEssIoN LAWS 2 | 3 | main office, or at any branch office which he may establish. The return of 4 | every person reporting on a calendar year basis shall be filed on or before 5 | the fifteenth day of April in each year, and the return of every person 6 | reporting on a fiscal year basis shall be filed on or before the fifteenth 7 | day of the fourth month following the close of the fiscal year. The return 8 | of a corporation reporting on a calendar year basis shall be filed on or 9 | before the fifteenth day of March in each year, and the return of a cor- 10 | poration reporting on a fiscal year basis shall be filed on or before the 11 | fifteenth day of the third month following the close of the fiscal year. In 12 | ease of sickness, absence, or other disability or whenever in his judgment 13 | good cause exists, the Commissioner may allow further time for filing 14 | returns.” 15 | 16 | See. 2. Subsection (1) of G. S. 105-157 is hereby amended by rewriting 17 | the subsection to read as follows 18 | 19 | “(1) Except as otherwise provided in this Section, the full amount of 20 | the tax payable as shown on the face of the return shall be paid to the 21 | Commissioner of Revenue at the office where the return is filed at the time 22 | fixed by law for filing the return. 23 | 24 | “If the taxpayer is a person reporting on a calendar year basis and the 25 | amount of tax exceeds fifty dollars ($50.00), payment may be made in two 26 | equal installments: one-half at the time of filing the return, and one-half 27 | on or before the fifteenth day of September following the date the return 28 | was originally due to be filed, with interest on the deferred payment at the 29 | rate of four per cent (4%) per annum from the date the return was origi- 30 | nally due to be filed. If the taxpayer is a person reporting on a calendar 31 | year basis and the amount of the tax exceeds four hundred dollars ($400.00), 32 | payment may be made in four equal installments: one-fourth at the time 33 | of filing the return, one-fourth on or before the fifteenth day of June fol- 34 | lowing the date the return was originally due to be filed, one-fourth on or 35 | before the fifteenth day of September following the date the return was 36 | originally due to be filed, and one-fourth on or before the fifteenth day of 37 | December following the date the return was originally due to be filed, with 38 | interest on deferred payments at the rate of four per cent (4%) per annum 39 | from the date the return was originally due to be filed. 40 | 41 | “If the taxpayer is a person reporting on a fiscal year basis or a cor- 42 | poration reporting on either a calendar year or fiscal year basis and the 43 | amount of the tax exceeds fifty dollars ($50.00), payment may be made in 44 | two equal installments: one-half on the date the return is filed, and one- 45 | half on or before the fifteenth day of the sixth month following the month 46 | in which the return was originally due to be filed, with interest on the 47 | deferred payment at the rate of four per cent (4%) per annum from the 48 | date the return was originally due to be filed. If the taxpayer is a person 49 | reporting on a fiscal year basis or a corporation reporting on either a cal- 50 | endar year or fiscal year basis and the amount of the tax exceeds four 51 | hundred dollars ($490.00), payment may be made in four equal install- 52 | ments: one-fourth at the time of filing the return, one-fourth on or before 53 | the fifteenth day of the third month following the month in which the re- 54 | turn was originally due to be filed, one-fourth on or before the fifteenth 55 | 56 | 10 57 | -------------------------------------------------------------------------------- /oer/sessionlawsresol1955nort_0057.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sessionlawsresol1955nort_0057.jpg -------------------------------------------------------------------------------- /oer/sessionlawsresol1955nort_0057_grayscale.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sessionlawsresol1955nort_0057_grayscale.jpg -------------------------------------------------------------------------------- /oer/sessionlawsresol1955nort_0057_inverted.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/UNC-Libraries-data/OnTheBooks/18f8ba7d6c0007c0c5a7388fef61790fdfa758a3/oer/sessionlawsresol1955nort_0057_inverted.jpg -------------------------------------------------------------------------------- /workflow.md: -------------------------------------------------------------------------------- 1 | # *OnTheBooks* Workflow 2 | This page is meant to provide an overview of the workflow used to create the On the Books corpus. The workflow can be divided into seven major stages: 3 | 4 | 1. Data Acquisition 5 | 2. Marginalia Determination 6 | 3. Image Adjustment Recommendations 7 | 4. Optical Character Recognition (OCR) 8 | 5. Section Splitting & Cleaning 9 | 6. Analysis 10 | 7. XML Generation 11 | 12 | ## Data Acquisition 13 | During data acquisition, images and metadata were gathered through a combination of automatic downloads from the Internet Archive and manual metadata creation. 14 | 15 | First, digitized versions of the volumes from the Internet Archive were identified using the Internet Archive's advanced search interface. Using the metadata that resulted from this search, all images comprising the corpus were downloaded using [jp2_download.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/data_acquisition/jp2_download.py). Extraneous page images, such as blank pages or those containing tables of contents, were manually identified and deleted. 16 | 17 | Next, metadata such as law type (private laws, public laws, etc.) and original print page number were manually compiled for corpus images. These metadata were combined with other page-level metadata such as the leaf number (pdf page number) and page hand side (left or right), gathered from Internet Archive XML files using [xml_parser.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/data_acquisition/xml_parser.py). 18 | 19 | The products of this stage consisted of a curated set of all relevant image files as well as page-level metadata for all images in the corpus: 20 | * file name 21 | * leaf number 22 | * hand side 23 | * print page number 24 | * law type section 25 | * law type section title 26 | * Internet Archive image URL 27 | 28 | These items were compiled into a corpus-level document called 'xmljpegmerge.csv'. 29 | 30 | **Output File(s):** 31 | * *xmljpegmerge.csv* - .csv file with page-level metadata for the entire corpus 32 | 33 | ## Marginalia Determination 34 | Marginalia, which is text that serves as a finding aid, was printed in the corpus volumes prior to 1951. The marginalia are not part of the laws and needed to be left out of the OCR process, as did paratextual information from page headers and footers. The marginalia determination process involved identifying the coordinates of the main text body for OCR. The marginalia determination process also identified the median page color to allow for the creation of a blank, color-neutral border around the main body text on each page. Tesseract OCR performs best when the text is not too close to the edge of the page. 35 | 36 | This step was accomplished using [marginalia_determination.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/marginalia/marginalia_determination.py) in concert with [cropfunctions.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/marginalia/cropfunctions.py). Detailed documentation for this step can be found [here](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/examples/marginalia_determination/marginalia_determination.ipynb). 37 | 38 | **Output File(s):** 39 | * *marginalia_metadata.csv* - .csv file containing main body text boundary coordinates and background color information for each page in the corpus 40 | 41 | ## Image Adjustment Recommendations 42 | Once the marginalia cropping information had been compiled, various image adjustments were tested for each volume to maximize OCR performance. A sample of images for each volume was selected and tested using different values for a range of parameters (color, contrast, etc.). Once the optimal image adjustments for each volume had been determined, these were stored for use during the following OCR stage. 43 | 44 | This step was accomplished using [adjRec.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/ocr/adjRec.py) in concert with [ocr_func.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/ocr/ocr_func.py). Detailed documentation for this step can be found [here](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/examples/adjustment_recommendation/adjRec.ipynb). 45 | 46 | **Output File(s):** 47 | * *adjustments.csv* - .csv file containing OCR-optimized image adjustment parameter values for each volume 48 | 49 | ## Optical Character Recognition (OCR) 50 | Having produced the prerequisite files ("adjustments.csv", "marginalia_metadata.csv", and "xmljpegmerge.csv"), OCR was performed on each page of each volume to produce a series of output files. OCR output files were saved for each law type (public, private, public-local) and session (e.g. Private Laws of the State of North Carolina, Session 1891 saved as lawsresolutionso1891nort_private laws_data.tsv). 51 | 52 | This step was accomplished using [Tesseract OCR](https://github.com/UB-Mannheim/tesseract/wiki), which was accessed programmatically via a [pytesseract](https://pypi.org/project/pytesseract/) wrapper. The scripts involved in this stage were [ocr_use.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/ocr/ocr_use.py) and those functions contained in [ocr_func.py](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/code/ocr/ocr_func.py). Detailed documentation for this step can be found [here](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/examples/ocr/ocr_use.ipynb). 53 | 54 | **Output File(s):** 55 | * *(volume)_adjustments.txt* - stores the image adjustments used to perform OCR on that particular volume. One of these files was created for each physical volume. 56 | * *(volume)_(section).txt* - stores a compiled version of all OCR'd text for a given law type section. One of these files was created for each set of laws ("Public", "Private", etc.) found in each physical volume. 57 | * *(volume)_(section)_data.tsv* - a word-level .tsv file for a given section. The rows in this file correspond to each individual token (word) recorded by the OCR process, along with page coordinates and confidence value for each. One of these files was created for each set of laws ("Public", "Private", etc.) in each physical volume. 58 | 59 | ## Section Splitting & Cleaning 60 | After completing OCR, each volume was 'split' into its constituent chapters and sections, with each section representing an individual law. This was accomplished using regular expression pattern matching on the word-level "(volume)_(section)_data.tsv" files produced in the previous step. Once initial assignments had been made, the corpus underwent a lengthy cleaning process that eliminated most section and chapter assignment errors and created a set of "aggregate" files in which all words were aggregated into their assigned sections (laws). 61 | 62 | This step was accomplished using the 7 separate scripts located [here](https://github.com/UNC-Libraries-data/OnTheBooks/tree/main/code/split_cleanup) in combination with several rounds of manual review. Detailed documentation for this step can be found [here](https://github.com/UNC-Libraries-data/OnTheBooks/blob/main/examples/split_cleanup/split_cleanup.ipynb). 63 | 64 | **Output File(s):** 65 | * *(volume)_(section)_data.csv* - an updated version of the 'raw' output .tsv files created in the OCR step. One of these files was created for each set of laws found ("Public", "Private", etc.) in each physical volume. 66 | * *(volume)_(section)_aggregate_data.csv* - contains all volume text aggregated into sections (laws). One of these files was created for each set of laws found ("Public", "Private", etc.) in each physical volume. 67 | 68 | ## Analysis 69 | The analysis phase of the project involved both supervised and unsupervised learning methods. The purposes of this phase were twofold: 70 | 1. To use automated techniques to help explore and better understand the characteristics and composition of Jim Crow Laws 71 | 2. To provide an efficient means for expanding the collection of Jim Crow laws already identified by experts 72 | 73 | This phase utilized the aggregated versions of the laws, compiled during the previous phase: "(volume)_(section)_aggregate_data.csv". 74 | 75 | Latent Dirichlet Allocation, an unsupervised method, was used to build a topic model for the laws. This analysis was conducted by team member Rucha Dalwadi and is detailed in her master’s paper ([Dalwadi 2020](https://doi.org/10.17615/tksc-t217)). 76 | 77 | Following the unsupervised classification efforts, an effort was made to identify Jim Crow laws using "active" supervised classification. A training set was compiled by expert reviewers doing close reading. A combination of preliminary classification runs and expert review was used to expand the existing labeled training set. The resulting expanded training set was used to [perform classification on the entire corpus (script)](https://unc-libraries-data.github.io/OnTheBooks/code/classification/ModelSelection_v2.html). This allowed for the labeling of laws as "Jim Crow" or "not Jim Crow" based on a pre-determined probability threshold. 78 | 79 | This step was accomplished using [scikit learn](https://scikit-learn.org/) and [XGBoost](https://xgboost.readthedocs.io/) to build and evaluate models. For text processing, [nltk](https://www.nltk.org/) was used. 80 | 81 | **Output File(s):** 82 | * *jim_crow_list.csv* - contains all laws identified as Jim Crow laws by expert reviewers, analytical models, or both. 83 | * *law_list.csv* - contains all laws in the corpus with all metadata accumulated from previous steps along with each law's Jim Crow classification value and classification source (experts, models, or both). 84 | 85 | ## XML Generation 86 | Following the analysis phase, the corpus was prepared for dissemination from the [Carolina Digital Repository](https://doi.org/10.17615/5c4g-sd44). Each volume was enriched with metadata as XML. Metadata files were merged using a unique identifier, then added to the corpus as XML elements and attributes. Python's [ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html) API was used to generate the XML. A .xsd schema was then created that defines the information provided about each volume in the corpus, such as the volume title, year, and session name. The schema also provides information about the laws contained in each volume, such as law titles, types, and Jim Crow classifications. 87 | 88 | **Output File(s):** 89 | * *onthebooks.xsd* - the xml schema definition for all xml files in the corpus. 90 | * *(volume).xml* - contains metadata and content for all laws within a given volume, tagged according to the above schema. One of these files was created for each physical volume in the corpus. 91 | --------------------------------------------------------------------------------