├── .gitignore
├── .gitmodules
├── README.md
└── gpt-extract.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | _pycache__
3 | *.swp
4 | *.db
5 | *.log
6 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "chatgpt-wrapper"]
2 | 	path = chatgpt-wrapper
3 |         url = https://github.com/mmabrouk/chatgpt-wrapper
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # GPT Document Extraction (DEPRECATED)
  2 | 
  3 | *NOTE:* The underlying strategy here (scraping ChatGPT) no longer works. [There's an LLM version of this tool available here.](https://github.com/brandonrobertz/llm-document-extraction)
  4 | 
  5 | This is a proof-of-concept for using ChatGPT to extract structured data from messy text documents like scanned/OCR'd PDFs and difficult forms.
  6 | 
  7 | It works by asking ChatGPT to turn text documents (found in an input JSON file or a text file) into a JSON record that matches a given JSON Schema specification.
  8 | 
  9 | If your input data is a text file where each line is a document, you can use the script like this:
 10 | 
 11 | ```
 12 | ./gpt-extract.py --input-type text infile.txt schema.json output.json
 13 | ```
 14 | 
 15 | This would extract each line in infile, using schema.json and write extracted data to output.json. You can find an example JSON schema down below in the "JSON schema file" section.
 16 | 
 17 | If your input data is JSON, you'll need to tell the script how to find the documents (and, optionally how to find a unique ID for each recod). The only kind of supported JSON is a list of JSON objects. Your JSON input data should look something like this:
 18 | 
 19 | ```
 20 | [{
 21 |   "id": 1
 22 |   "doc": "My text here..."
 23 | }, {
 24 |   "id": 2,
 25 |   "doc": "Another record..."
 26 | }]
 27 | ```
 28 | 
 29 | You can run the script like this:
 30 | 
 31 | ```
 32 | ./gpt-extract.py --input-type json --keydoc doc --keyid id infile.json schema.json output.json
 33 | ```
 34 | 
 35 | Note that the output file (`output.json`), if it exists, needs to be valid JSON (not a blank file) as the script will attempt to load it and continue where the extraction left off.
 36 | 
 37 | ## Setup
 38 | 
 39 | This repo depends on [ChatGPT-wrapper][wrapper-main], which is included as a submodule of this repo. Clone this repo like:
 40 | 
 41 | ```
 42 | git clone --recurse-submodules https://github.com/brandonrobertz/chatgpt-document-extraction
 43 | cd chatgpt-document-extraction
 44 | ```
 45 | 
 46 | If you've already cloned the repo you can get and/or update the submodule with this:
 47 | 
 48 | ```
 49 | git submodule update --init --recursive
 50 | ```
 51 | 
 52 | Then install ChatGPT-wrapper and [set up Playwright][playwright-setup]:
 53 | 
 54 | ```
 55 | cd chatgpt-wrapper/
 56 | pip install .
 57 | cd ..
 58 | playwright install
 59 | ```
 60 | 
 61 | You need to login, so run the following command and log into ChatGPT:
 62 | 
 63 | ```
 64 | chatgpt install
 65 | ```
 66 | 
 67 | ## Extraction
 68 | 
 69 | Once you're set up, you can extract structured data, 
 70 | 
 71 | ```
 72 | ./gpt-extract.py --headless --input-type infile.txt schema.json output.json
 73 | ```
 74 | 
 75 | ### Input data spec
 76 | 
 77 | You can provide one of two options:
 78 | 
 79 | 1. text file, with one record per row (`--input-type txt`)
 80 | 2. a JSON file with an array of objects (`--input-type json`). You can specify which keys to use with the `--keydoc` and `--keyid` options which tell the script how to find the document text and the record ID.
 81 | 
 82 | ### JSON schema file
 83 | 
 84 | You need to provide a JSON Schema file that will instruct ChatGPT how to transform the input text. Here's an example that I used:
 85 | 
 86 | ```
 87 | {
 88 |   "type": "object",
 89 |   "additionalProperties": false,
 90 |   "properties": {
 91 |     "name of person this document is from": {
 92 |       "type": "string"
 93 |     },
 94 |     "name of person this document is written to": {
 95 |       "type": "string"
 96 |     },
 97 |     "name of person this document is about": {
 98 |       "type": "string"
 99 |     },
100 |     "violation": {
101 |       "type": "string"
102 |     },
103 |     "outcome": {
104 |       "type": "string"
105 |     },
106 |     "date": {
107 |       "type": "string"
108 |     },
109 |     "summary": {
110 |       "type": "string"
111 |     }
112 |   }
113 | }
114 | ```
115 | 
116 | It can be helpful to name the fields in descriptive ways that ChatGPT can use to figure out what to extract.
117 | 
118 | 
119 | [wrapper-main]: https://github.com/mmabrouk/chatgpt-wrapper
120 |     "ChatGPT Wrapper - upstream version"
121 | 
122 | [playwright-setup]: https://playwright.dev/python/docs/library
123 |     "Playwright - Getting Started"
124 | 


--------------------------------------------------------------------------------
/gpt-extract.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """
  3 | Generic ChatGPT extraction script. Converts any input data to
  4 | any output JSON, as specified by a given JSON schema document.
  5 | 
  6 | This is dependent on the ChatGPT wrapper library:
  7 | 
  8 | https://github.com/mmabrouk/chatgpt-wrapper
  9 | 
 10 | Make sure to also run playwright install before running
 11 | this extractor script!
 12 | """
 13 | import argparse
 14 | from datetime import datetime
 15 | import json
 16 | import os
 17 | import re
 18 | import sys
 19 | import time
 20 | 
 21 | from chatgpt_wrapper import ChatGPT
 22 | 
 23 | 
 24 | # max chars to use in prompt
 25 | DOC_MAX_LENGTH=3000
 26 | 
 27 | 
 28 | parser = argparse.ArgumentParser(description='Extract structured data from text using ChatGPT.')
 29 | parser.add_argument(
 30 |     '--input-type', 
 31 |     choices=['txt', 'json'],
 32 |     help='Input file type: txt (one doc per line) or json (list of objects, add document key path using --dockey)'
 33 | )
 34 | parser.add_argument(
 35 |     '--keydoc', 
 36 |     help='If using JSON input type, this is the key of the document'
 37 | )
 38 | parser.add_argument(
 39 |     '--keyid', 
 40 |     help='If using JSON input type, this is the key of the id/page no'
 41 | )
 42 | parser.add_argument(
 43 |     '--headless', 
 44 |     action='store_true',
 45 |     help='Hide the browser'
 46 | )
 47 | parser.add_argument(
 48 |     '--continue-at',
 49 |     help='Continue extration at this document index'
 50 | )
 51 | parser.add_argument(
 52 |     '--continue-last',
 53 |     action='store_true',
 54 |     help='Continue extration at the last document extracted'
 55 | )
 56 | parser.add_argument(
 57 |     '--browser',
 58 |     default="firefox",
 59 |     help='Choose a browser to use. Needs to already be installed with `playwright install`. Defaults to firefox.'
 60 | )
 61 | parser.add_argument(
 62 |     'infile',
 63 |     help='Input file'
 64 | )
 65 | parser.add_argument(
 66 |     'schema_file',
 67 |     help='Path to JSON Schema file'
 68 | )
 69 | parser.add_argument(
 70 |     'outfile',
 71 |     help='Path to output results JSON file'
 72 | )
 73 | 
 74 | 
 75 | 
 76 | def clean_document(page_text):
 77 |     # cleaned = re.sub("[\n]+", "\n", re.sub("[ \t]+", " ", page_text)).strip()
 78 |     cleaned = re.sub(r"[\t ]+", " ", re.sub(r"[\n]+", "\n", page_text)).strip()
 79 |     if len(cleaned) < DOC_MAX_LENGTH:
 80 |         return cleaned
 81 |     front = cleaned[:DOC_MAX_LENGTH - 500]
 82 |     end = cleaned[-500:]
 83 |     return f"{front} {end}"
 84 | 
 85 | 
 86 | def scrape_via_prompt(chat, page_text, schema):
 87 |     prompt = f"```{clean_document(page_text)}```\n\nFor the given text, can you provide a JSON representation that strictly follows this schema:\n\n```{schema}```"
 88 | 
 89 |     print("Entering prompt", len(prompt), "bytes")
 90 |     response = None
 91 |     # increasing this increases the wait time
 92 |     waited = 0
 93 |     # use this prompt so we can change it ("can you continue the
 94 |     # previous..") but keep track of the original prompt
 95 |     current_prompt = prompt
 96 |     while True:
 97 |         response = chat.ask(current_prompt)
 98 | 
 99 |         if waited == 0:
100 |             print(f"{'='*70}\nPrompt\n{'-'*70}\n{current_prompt}")
101 |             print(f"{'='*70}\nResponse\n{'-'*70}\n{response}")
102 | 
103 |         waited += 1
104 | 
105 |         if waited > 5:
106 |             print("Timed out on this prompt")
107 |             break
108 | 
109 |         if "unusable response produced by chatgpt" in response.lower():
110 |             wait_seconds = 120 * waited
111 |             print("Bad response! Waiting longer for", wait_seconds, "seconds")
112 |             time.sleep(wait_seconds)
113 |             continue
114 | 
115 |         bad_input = (
116 |             "it is not possible to generate a json representation "
117 |             "of the provided text"
118 |         )
119 | 
120 |         if bad_input in response.lower():
121 |             response = None
122 |             print("Bad input! Skipping this text")
123 |             continue
124 | 
125 |         if response.strip() == "HTTP Error 429: Too many requests":
126 |             # sleep for one hour
127 |             print("Sleeping for one hour due to rate limiting...")
128 |             time.sleep(60 * 60)
129 |             continue
130 | 
131 |         if "}" not in response:
132 |             # retry the session if it's not completing the JSON
133 |             print("Broken JSON response, sleeping then retrying")
134 |             time.sleep(20)
135 |             continue
136 | 
137 |         # we have a good response here
138 |         break
139 | 
140 |     return prompt, response
141 | 
142 | 
143 | def upsert_result(results, result):
144 |     pk = result["id"]
145 |     for r_ix, r_result in enumerate(results):
146 |         if r_result["id"] != pk:
147 |             continue
148 |         # overwrite
149 |         results[r_ix] = result
150 |         return
151 |     # if we're here we did't update an existing result
152 |     results.append(result)
153 | 
154 | 
155 | def run(documents, schema, outfile, headless=False,
156 |         continue_at=None, continue_last=False, browser=None):
157 |     print("Starting ChatGPT interface...")
158 |     chat = ChatGPT(headless=headless, browser=browser)
159 |     time.sleep(5)
160 | 
161 |     # TODO: Check for login prompt
162 |     # TODO: Optionally clear all prev sessions
163 | 
164 |     results = []
165 |     if os.path.exists(outfile):
166 |         with open(outfile, "r") as f:
167 |             results = json.load(f)
168 | 
169 |     already_scraped = set([
170 |         r.get("id") for r in results
171 |     ])
172 |     if already_scraped:
173 |         print("Already scraped", already_scraped)
174 | 
175 |     if continue_last:
176 |         continue_at = max(list(already_scraped)) + 1
177 |         print("Continuing at", continue_at)
178 | 
179 |     print(len(documents), "documents to scrape")
180 | 
181 |     # flag so that we only sleep after the first try
182 |     first_scrape = True
183 |     for p_ix, page_data in enumerate(documents):
184 |         pk = page_data["id"]
185 |         page_text = page_data["text"]
186 |         if not page_text:
187 |             print("Blank text for ID:", pk, "Skipping...")
188 |             continue
189 | 
190 |         print("Doc ID:", pk, "Text length:", len(page_text))
191 | 
192 |         if continue_at is not None and pk < continue_at:
193 |             continue
194 | 
195 |         if not first_scrape:
196 |             print("Sleeping for rate limiting")
197 |             time.sleep(60)
198 |             first_scrape = False
199 | 
200 |         prompt, response = scrape_via_prompt(chat, page_text, schema)
201 |         first_scrape = False
202 | 
203 |         if response is None:
204 |             print("Skipping page due to blank response")
205 |             continue
206 | 
207 |         data = None
208 |         try:
209 |             data = json.loads(response.split("```")[1])
210 |         except Exception as e:
211 |             print("Bad result on ID", pk)
212 |             print("Parse error:", e)
213 |             continue
214 | 
215 |         result = {
216 |             "id": pk,
217 |             "text": page_text,
218 |             "prompt": prompt,
219 |             "response": response,
220 |             "data": data,
221 |         }
222 |         upsert_result(results, result)
223 | 
224 |         print("Saving results to", outfile)
225 |         with open(outfile, "w") as f:
226 |             f.write(json.dumps(results, indent=2))
227 |         print("ID", pk, "complete")
228 | 
229 | 
230 | def parse_input_documents(args):
231 |     documents = []
232 |     with open(args.infile, "r") as f:
233 |         if args.input_type == "txt":
234 |             for i, doc in enumerate(f.readlines()):
235 |                 documents.append({
236 |                     "id": i, 
237 |                     "text": doc
238 |                 })
239 |         elif args.input_type == "json":
240 |             with open(args.infile, "r") as f:
241 |                 input_json = json.load(f)
242 |             type_err_msg = "Input JSON must be an array of objects"
243 |             assert args.keydoc, "--keydoc required with JSON input type"
244 |             # assert args.keyid, "--keyid required with JSON input type"
245 |             assert isinstance(input_json, list), type_err_msg
246 |             assert isinstance(input_json[0], dict), type_err_msg
247 |             assert args.keydoc in input_json[0], f"'{args.keydoc}' not in JSON"
248 |             # assert args.keyid in input_json[0], f"'{args.keyid}' not in JSON"
249 |             for ix, doc_data in enumerate(input_json):
250 |                 documents.append({
251 |                     "id": doc_data[args.keyid] if args.keyid else ix,
252 |                     "text": doc_data[args.keydoc]
253 |                 })
254 |     return documents
255 | 
256 | 
257 | if __name__ == "__main__":
258 |     args = parser.parse_args()
259 | 
260 |     documents = parse_input_documents(args)
261 | 
262 |     with open(args.schema_file, "r") as f:
263 |         schema = json.load(f)
264 | 
265 | 
266 |     assert not (args.continue_last and args.continue_at), \
267 |         "--continue-at and --continue-last can't be used together"
268 | 
269 |     run(documents, schema, args.outfile,
270 |         headless=args.headless,
271 |         continue_at=args.continue_at,
272 |         continue_last=args.continue_last,
273 |         browser=args.browser,
274 |     )
275 | 


--------------------------------------------------------------------------------