├── .gitignore ├── .gitmodules ├── README.md └── gpt-extract.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | _pycache__ 3 | *.swp 4 | *.db 5 | *.log 6 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "chatgpt-wrapper"] 2 | path = chatgpt-wrapper 3 | url = https://github.com/mmabrouk/chatgpt-wrapper 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GPT Document Extraction (DEPRECATED) 2 | 3 | *NOTE:* The underlying strategy here (scraping ChatGPT) no longer works. [There's an LLM version of this tool available here.](https://github.com/brandonrobertz/llm-document-extraction) 4 | 5 | This is a proof-of-concept for using ChatGPT to extract structured data from messy text documents like scanned/OCR'd PDFs and difficult forms. 6 | 7 | It works by asking ChatGPT to turn text documents (found in an input JSON file or a text file) into a JSON record that matches a given JSON Schema specification. 8 | 9 | If your input data is a text file where each line is a document, you can use the script like this: 10 | 11 | ``` 12 | ./gpt-extract.py --input-type text infile.txt schema.json output.json 13 | ``` 14 | 15 | This would extract each line in infile, using schema.json and write extracted data to output.json. You can find an example JSON schema down below in the "JSON schema file" section. 16 | 17 | If your input data is JSON, you'll need to tell the script how to find the documents (and, optionally how to find a unique ID for each recod). The only kind of supported JSON is a list of JSON objects. Your JSON input data should look something like this: 18 | 19 | ``` 20 | [{ 21 | "id": 1 22 | "doc": "My text here..." 23 | }, { 24 | "id": 2, 25 | "doc": "Another record..." 26 | }] 27 | ``` 28 | 29 | You can run the script like this: 30 | 31 | ``` 32 | ./gpt-extract.py --input-type json --keydoc doc --keyid id infile.json schema.json output.json 33 | ``` 34 | 35 | Note that the output file (`output.json`), if it exists, needs to be valid JSON (not a blank file) as the script will attempt to load it and continue where the extraction left off. 36 | 37 | ## Setup 38 | 39 | This repo depends on [ChatGPT-wrapper][wrapper-main], which is included as a submodule of this repo. Clone this repo like: 40 | 41 | ``` 42 | git clone --recurse-submodules https://github.com/brandonrobertz/chatgpt-document-extraction 43 | cd chatgpt-document-extraction 44 | ``` 45 | 46 | If you've already cloned the repo you can get and/or update the submodule with this: 47 | 48 | ``` 49 | git submodule update --init --recursive 50 | ``` 51 | 52 | Then install ChatGPT-wrapper and [set up Playwright][playwright-setup]: 53 | 54 | ``` 55 | cd chatgpt-wrapper/ 56 | pip install . 57 | cd .. 58 | playwright install 59 | ``` 60 | 61 | You need to login, so run the following command and log into ChatGPT: 62 | 63 | ``` 64 | chatgpt install 65 | ``` 66 | 67 | ## Extraction 68 | 69 | Once you're set up, you can extract structured data, 70 | 71 | ``` 72 | ./gpt-extract.py --headless --input-type infile.txt schema.json output.json 73 | ``` 74 | 75 | ### Input data spec 76 | 77 | You can provide one of two options: 78 | 79 | 1. text file, with one record per row (`--input-type txt`) 80 | 2. a JSON file with an array of objects (`--input-type json`). You can specify which keys to use with the `--keydoc` and `--keyid` options which tell the script how to find the document text and the record ID. 81 | 82 | ### JSON schema file 83 | 84 | You need to provide a JSON Schema file that will instruct ChatGPT how to transform the input text. Here's an example that I used: 85 | 86 | ``` 87 | { 88 | "type": "object", 89 | "additionalProperties": false, 90 | "properties": { 91 | "name of person this document is from": { 92 | "type": "string" 93 | }, 94 | "name of person this document is written to": { 95 | "type": "string" 96 | }, 97 | "name of person this document is about": { 98 | "type": "string" 99 | }, 100 | "violation": { 101 | "type": "string" 102 | }, 103 | "outcome": { 104 | "type": "string" 105 | }, 106 | "date": { 107 | "type": "string" 108 | }, 109 | "summary": { 110 | "type": "string" 111 | } 112 | } 113 | } 114 | ``` 115 | 116 | It can be helpful to name the fields in descriptive ways that ChatGPT can use to figure out what to extract. 117 | 118 | 119 | [wrapper-main]: https://github.com/mmabrouk/chatgpt-wrapper 120 | "ChatGPT Wrapper - upstream version" 121 | 122 | [playwright-setup]: https://playwright.dev/python/docs/library 123 | "Playwright - Getting Started" 124 | -------------------------------------------------------------------------------- /gpt-extract.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Generic ChatGPT extraction script. Converts any input data to 4 | any output JSON, as specified by a given JSON schema document. 5 | 6 | This is dependent on the ChatGPT wrapper library: 7 | 8 | https://github.com/mmabrouk/chatgpt-wrapper 9 | 10 | Make sure to also run playwright install before running 11 | this extractor script! 12 | """ 13 | import argparse 14 | from datetime import datetime 15 | import json 16 | import os 17 | import re 18 | import sys 19 | import time 20 | 21 | from chatgpt_wrapper import ChatGPT 22 | 23 | 24 | # max chars to use in prompt 25 | DOC_MAX_LENGTH=3000 26 | 27 | 28 | parser = argparse.ArgumentParser(description='Extract structured data from text using ChatGPT.') 29 | parser.add_argument( 30 | '--input-type', 31 | choices=['txt', 'json'], 32 | help='Input file type: txt (one doc per line) or json (list of objects, add document key path using --dockey)' 33 | ) 34 | parser.add_argument( 35 | '--keydoc', 36 | help='If using JSON input type, this is the key of the document' 37 | ) 38 | parser.add_argument( 39 | '--keyid', 40 | help='If using JSON input type, this is the key of the id/page no' 41 | ) 42 | parser.add_argument( 43 | '--headless', 44 | action='store_true', 45 | help='Hide the browser' 46 | ) 47 | parser.add_argument( 48 | '--continue-at', 49 | help='Continue extration at this document index' 50 | ) 51 | parser.add_argument( 52 | '--continue-last', 53 | action='store_true', 54 | help='Continue extration at the last document extracted' 55 | ) 56 | parser.add_argument( 57 | '--browser', 58 | default="firefox", 59 | help='Choose a browser to use. Needs to already be installed with `playwright install`. Defaults to firefox.' 60 | ) 61 | parser.add_argument( 62 | 'infile', 63 | help='Input file' 64 | ) 65 | parser.add_argument( 66 | 'schema_file', 67 | help='Path to JSON Schema file' 68 | ) 69 | parser.add_argument( 70 | 'outfile', 71 | help='Path to output results JSON file' 72 | ) 73 | 74 | 75 | 76 | def clean_document(page_text): 77 | # cleaned = re.sub("[\n]+", "\n", re.sub("[ \t]+", " ", page_text)).strip() 78 | cleaned = re.sub(r"[\t ]+", " ", re.sub(r"[\n]+", "\n", page_text)).strip() 79 | if len(cleaned) < DOC_MAX_LENGTH: 80 | return cleaned 81 | front = cleaned[:DOC_MAX_LENGTH - 500] 82 | end = cleaned[-500:] 83 | return f"{front} {end}" 84 | 85 | 86 | def scrape_via_prompt(chat, page_text, schema): 87 | prompt = f"```{clean_document(page_text)}```\n\nFor the given text, can you provide a JSON representation that strictly follows this schema:\n\n```{schema}```" 88 | 89 | print("Entering prompt", len(prompt), "bytes") 90 | response = None 91 | # increasing this increases the wait time 92 | waited = 0 93 | # use this prompt so we can change it ("can you continue the 94 | # previous..") but keep track of the original prompt 95 | current_prompt = prompt 96 | while True: 97 | response = chat.ask(current_prompt) 98 | 99 | if waited == 0: 100 | print(f"{'='*70}\nPrompt\n{'-'*70}\n{current_prompt}") 101 | print(f"{'='*70}\nResponse\n{'-'*70}\n{response}") 102 | 103 | waited += 1 104 | 105 | if waited > 5: 106 | print("Timed out on this prompt") 107 | break 108 | 109 | if "unusable response produced by chatgpt" in response.lower(): 110 | wait_seconds = 120 * waited 111 | print("Bad response! Waiting longer for", wait_seconds, "seconds") 112 | time.sleep(wait_seconds) 113 | continue 114 | 115 | bad_input = ( 116 | "it is not possible to generate a json representation " 117 | "of the provided text" 118 | ) 119 | 120 | if bad_input in response.lower(): 121 | response = None 122 | print("Bad input! Skipping this text") 123 | continue 124 | 125 | if response.strip() == "HTTP Error 429: Too many requests": 126 | # sleep for one hour 127 | print("Sleeping for one hour due to rate limiting...") 128 | time.sleep(60 * 60) 129 | continue 130 | 131 | if "}" not in response: 132 | # retry the session if it's not completing the JSON 133 | print("Broken JSON response, sleeping then retrying") 134 | time.sleep(20) 135 | continue 136 | 137 | # we have a good response here 138 | break 139 | 140 | return prompt, response 141 | 142 | 143 | def upsert_result(results, result): 144 | pk = result["id"] 145 | for r_ix, r_result in enumerate(results): 146 | if r_result["id"] != pk: 147 | continue 148 | # overwrite 149 | results[r_ix] = result 150 | return 151 | # if we're here we did't update an existing result 152 | results.append(result) 153 | 154 | 155 | def run(documents, schema, outfile, headless=False, 156 | continue_at=None, continue_last=False, browser=None): 157 | print("Starting ChatGPT interface...") 158 | chat = ChatGPT(headless=headless, browser=browser) 159 | time.sleep(5) 160 | 161 | # TODO: Check for login prompt 162 | # TODO: Optionally clear all prev sessions 163 | 164 | results = [] 165 | if os.path.exists(outfile): 166 | with open(outfile, "r") as f: 167 | results = json.load(f) 168 | 169 | already_scraped = set([ 170 | r.get("id") for r in results 171 | ]) 172 | if already_scraped: 173 | print("Already scraped", already_scraped) 174 | 175 | if continue_last: 176 | continue_at = max(list(already_scraped)) + 1 177 | print("Continuing at", continue_at) 178 | 179 | print(len(documents), "documents to scrape") 180 | 181 | # flag so that we only sleep after the first try 182 | first_scrape = True 183 | for p_ix, page_data in enumerate(documents): 184 | pk = page_data["id"] 185 | page_text = page_data["text"] 186 | if not page_text: 187 | print("Blank text for ID:", pk, "Skipping...") 188 | continue 189 | 190 | print("Doc ID:", pk, "Text length:", len(page_text)) 191 | 192 | if continue_at is not None and pk < continue_at: 193 | continue 194 | 195 | if not first_scrape: 196 | print("Sleeping for rate limiting") 197 | time.sleep(60) 198 | first_scrape = False 199 | 200 | prompt, response = scrape_via_prompt(chat, page_text, schema) 201 | first_scrape = False 202 | 203 | if response is None: 204 | print("Skipping page due to blank response") 205 | continue 206 | 207 | data = None 208 | try: 209 | data = json.loads(response.split("```")[1]) 210 | except Exception as e: 211 | print("Bad result on ID", pk) 212 | print("Parse error:", e) 213 | continue 214 | 215 | result = { 216 | "id": pk, 217 | "text": page_text, 218 | "prompt": prompt, 219 | "response": response, 220 | "data": data, 221 | } 222 | upsert_result(results, result) 223 | 224 | print("Saving results to", outfile) 225 | with open(outfile, "w") as f: 226 | f.write(json.dumps(results, indent=2)) 227 | print("ID", pk, "complete") 228 | 229 | 230 | def parse_input_documents(args): 231 | documents = [] 232 | with open(args.infile, "r") as f: 233 | if args.input_type == "txt": 234 | for i, doc in enumerate(f.readlines()): 235 | documents.append({ 236 | "id": i, 237 | "text": doc 238 | }) 239 | elif args.input_type == "json": 240 | with open(args.infile, "r") as f: 241 | input_json = json.load(f) 242 | type_err_msg = "Input JSON must be an array of objects" 243 | assert args.keydoc, "--keydoc required with JSON input type" 244 | # assert args.keyid, "--keyid required with JSON input type" 245 | assert isinstance(input_json, list), type_err_msg 246 | assert isinstance(input_json[0], dict), type_err_msg 247 | assert args.keydoc in input_json[0], f"'{args.keydoc}' not in JSON" 248 | # assert args.keyid in input_json[0], f"'{args.keyid}' not in JSON" 249 | for ix, doc_data in enumerate(input_json): 250 | documents.append({ 251 | "id": doc_data[args.keyid] if args.keyid else ix, 252 | "text": doc_data[args.keydoc] 253 | }) 254 | return documents 255 | 256 | 257 | if __name__ == "__main__": 258 | args = parser.parse_args() 259 | 260 | documents = parse_input_documents(args) 261 | 262 | with open(args.schema_file, "r") as f: 263 | schema = json.load(f) 264 | 265 | 266 | assert not (args.continue_last and args.continue_at), \ 267 | "--continue-at and --continue-last can't be used together" 268 | 269 | run(documents, schema, args.outfile, 270 | headless=args.headless, 271 | continue_at=args.continue_at, 272 | continue_last=args.continue_last, 273 | browser=args.browser, 274 | ) 275 | --------------------------------------------------------------------------------