├── .gitignore ├── .python-version ├── LICENSE ├── README.md ├── Searching with USE.ipynb ├── nyc_docs.jsonl ├── requirements.txt ├── search.py ├── to_annoy.py ├── to_es.py └── to_sentences.py /.gitignore: -------------------------------------------------------------------------------- 1 | nyc_docs-sentences15.json 2 | -------------------------------------------------------------------------------- /.python-version: -------------------------------------------------------------------------------- 1 | 3.7.2 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2020 Quartz 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Searching Bill de Blasio's Emails with the Universal Sentence Encoder 2 | 3 | By Jeremy B. Merrill, [Quartz](https://www.qz.com) 4 | 5 | As part of Quartz's participation in the [Luanda Leaks investigation](https://qz.com/se/luanda-leaks-isabel-dos-santos-angola/) in partnership with ICIJ, we built an system for searching large heterogenous document sets with AI. Read more about the system [here](https://qz.com/1786896/ai-for-investigations-sorting-through-the-luanda-leaks/). 6 | 7 | Here's the code demo so you can reproduce our work. For this demo, we'll be searching in [a set of a few thousand emails](https://github.com/Quartz/aistudio-doc2vec-for-investigative-journalism/blob/master/2018.05.24_BerlinRosen_Responsive_Records.pdf) between Bill de Blasio and some advisors, informally called the "Agent of the City documents." 8 | 9 | This isn't a full library or off-the-shelf software, but rather a demonstration that you can adapt for yourself. 10 | 11 | This workflow works with huge document sets. The demo documents are small, so that indexing only takes a few seconds, rather than hours. The Luanda Leaks investigation involved 356 gigabytes of files (and 11 gigabytes of plain text). 12 | 13 | ## how to 14 | There are a few steps you'll have to take to run the demo. 15 | 16 | 1. Get a computer with Python 3.7. You probably want a server with a GPU, but it's not strictly necessary. Without a GPU, this'll be very slow. 17 | 2. Set up Elasticsearch locally. 18 | 2. Split the provided `nyc_docs.jsonl` file (which has a text copy of every document in the document set) into sentence-size chunks by running `python to_sentences.py`. We need to split the documents into sentence-length chunks because Universal Sentence Encoder can only handle blocks of text shorter than 128 words. 19 | 3. Index the chunks and the full documents to ElasticSearch by running `python to_es.py`. 20 | 4. Embed the chunks into vector space with USE and index those with Annoy by running `python to_annoy.py`. 21 | 5. Run the "Searching with USE.ipynb" Jupyter notebook to run your own searches. 22 | 23 | 24 | -- 25 | ## about tech choices 26 | 27 | *Universal Sentence Encoder* is a tensorflow library that, without any additional training, embeds sentences in several languages into a shared vector representation. Basically, sentences with a similar meaning have vectors that are close together. 28 | 29 | *Annoy* is a Python library that makes nearest neighbor searches for vectors very fast. This lets us index thousands, or millions, of vectors and instantly get back which ones are close by. 30 | 31 | Combined, these tools let us ask the computer, "what sentences mean something similar to this?" and we'll get back high quality results, even if the similar sentences don't overlap in any keywords. We also indexed the text of each sentence to ElasticSearch—and each sentence had a single ID number used in both the Annoy and ElasticSearch indexes. -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tqdm 2 | syntok 3 | BeautifulSoup4 4 | w3lib 5 | tensorflow==1.14.0 6 | tf_sentencepiece==0.1.83 7 | annoy 8 | faiss 9 | tensorflow_hub 10 | pandas 11 | elasticsearch_dsl==6.0.0 12 | -------------------------------------------------------------------------------- /search.py: -------------------------------------------------------------------------------- 1 | # this model tells us _how_ we can run searches. feel free to dig in, but it's just plumbing. 2 | 3 | import numpy as np 4 | import unicodedata 5 | import csv 6 | from os.path import basename 7 | 8 | def remove_accents(input_str): 9 | nfkd_form = unicodedata.normalize('NFKD', input_str) 10 | only_ascii = nfkd_form.encode('ASCII', 'ignore') 11 | return only_ascii.decode("utf-8") 12 | 13 | def index_or_error(list_, item): 14 | try: 15 | return list_.index(item) 16 | except ValueError: 17 | return "absent" 18 | 19 | class QzUSESearch(): 20 | def __init__(self, results, search_terms, es, es_index_full_text, seed_docs=[]): 21 | self.results = list(results) 22 | self.search_terms = search_terms 23 | self.clean_search_terms = [remove_accents(term).lower() for term in search_terms] 24 | self.seed_docs = [i.split("c")[0] for i in seed_docs] 25 | self.es = es 26 | self.es_index_full_text = es_index_full_text 27 | 28 | 29 | def show(self, show_seed_docs=True): 30 | for res, dist in self.results: 31 | terms_in_doc = [search_term in remove_accents(res["_source"]["text"].lower()) for search_term in self.search_terms] 32 | doc_id = res["_id"].split("c")[0] 33 | chunk = res["_id"].split("c")[1] 34 | is_seed_doc = doc_id in self.seed_docs or res["_id"] in self.seed_docs 35 | if is_seed_doc and not show_seed_docs: 36 | continue 37 | print(res["_id"]) 38 | print("http://example.com/{}".format(doc_id.split("c")[0])) 39 | print("sanity checks: ({})".format( terms_in_doc)) 40 | print("") 41 | 42 | def sanity_check(self, targets): 43 | idxes = [index_or_error([chunk_res['_id'].split("c")[0] for chunk_res, dist in self.results], should_match.split("c")[0]) for should_match in targets] 44 | print("sanity check: these should be low-ish:" , idxes, len(self.results)) 45 | 46 | def to_csv(self, csv_fn=None): 47 | search_term_cln = self.search_terms[0].replace(" ", "_") 48 | csv_fn = csv_fn if csv_fn else f"csvs/{search_term_cln}.csv" 49 | print(csv_fn) 50 | with open(csv_fn, 'w') as csvfile: 51 | fieldnames = [ 52 | "url", 53 | "is_seed_doc", 54 | "first few words of match", 55 | "chunk", 56 | "distance", 57 | ] + [f"matches search term (\"{search_term}\")" for search_term in self.search_terms] 58 | writer = csv.writer(csvfile) 59 | writer.writerow(fieldnames) 60 | 61 | for chunk_res, dist in self.results: 62 | doc_id = chunk_res["_id"].split("c")[0] 63 | chunk_str = chunk_res["_id"].split("c")[1] 64 | if len(chunk_str) > 0: 65 | chunk = int(chunk_str) 66 | else: 67 | chunk = None 68 | 69 | full_text_res = self.es.get(index=self.es_index_full_text, id=doc_id) 70 | url = "http://example.com/{}/{}".format(doc_id.split("c")[0], full_text_res["_source"].get("routing", "") or '') 71 | clean_text = remove_accents(full_text_res["_source"]["text"].lower()) 72 | terms_in_doc = [search_term in clean_text for search_term in self.clean_search_terms] 73 | is_seed_doc = doc_id in self.seed_docs or chunk_res["_id"] in self.seed_docs 74 | 75 | if chunk: 76 | text = chunk_res["_source"]["text"] 77 | else: 78 | text = full_text_res["_source"]["text"][:100] 79 | 80 | row = [ 81 | url, 82 | is_seed_doc, 83 | text, 84 | chunk_str, 85 | dist, 86 | ] + terms_in_doc 87 | writer.writerow(row) 88 | 89 | class QzUSESearchFactory(): 90 | def __init__(self, vector_index, idx_name, name_idx, es, es_index_full_text, es_index_chunk, generate_embeddings ): 91 | self.idx_name = idx_name 92 | self.name_idx = name_idx 93 | self.vector_index = vector_index 94 | self.es = es 95 | self.generate_embeddings = generate_embeddings 96 | self.es_index_full_text = es_index_full_text 97 | self.es_index_chunk = es_index_chunk 98 | 99 | def query_by_docs(self, seed_docs, search_terms=[], k=10): 100 | target_vectors = [] 101 | for doc_idx in seed_docs: 102 | if ":" not in doc_idx and "c" not in doc_idx: # if it's a whole doc 103 | res = self.es.get(index=self.es_index_full_text, id=doc_idx.split("c")[0]) 104 | chunks = [page for j, page in enumerate(to_short_paragraphs(res["_source"]["text"]))] 105 | doc_avg_vec = np.mean(np.array(self.generate_embeddings(chunks)), axis=0) 106 | target_vectors.append(doc_avg_vec) 107 | elif "c" in doc_idx: # if it's a chunked doc 108 | chunk_vec = self.vector_index.get_item_vector(self.name_idx[doc_idx]) 109 | target_vectors.append(chunk_vec) 110 | elif ":" in doc_idx: 111 | start, end = [int(i) for i in doc_idx.split("c")[1].split("-")] 112 | assert start < end 113 | chunks_vecs = [self.vector_index.get_item_vector(self.name_idx[doc_idx.split("c")[0] + "c" + str(i)]) for i in range(start, end + 1)] 114 | doc_avg_vec = np.mean(np.array(chunks_vecs), axis=0) 115 | target_vectors.append(doc_avg_vec) 116 | else: 117 | raise ArgumentError(f"invalid seed doc: {doc_idx}") 118 | avg_vec = np.average(target_vectors, axis=0) 119 | docs, distances = self.query_nn_with_vec(avg_vec, k) 120 | return QzUSESearch(zip(docs, distances), search_terms, self.es, self.es_index_full_text, seed_docs) 121 | 122 | def convert_vector(self, query): 123 | query = [query] 124 | vector = self.generate_embeddings(query)[0] 125 | return vector 126 | 127 | def query_nn_with_vec(self, vector_converted, k=10): 128 | idxs, distances = self.vector_index.get_nns_by_vector(vector_converted, k, search_k=-1, include_distances=True) 129 | docs = [self.es.get(index=self.es_index_chunk, id=self.idx_name[str(doc_idx)]) for doc_idx in idxs] 130 | return docs, distances 131 | 132 | def query_nn(self, query, k=10): 133 | vector_converted = self.convert_vector(query) 134 | res = self.query_nn_with_vec(vector_converted, k) 135 | return res[0] 136 | 137 | def doc_avg(self, doc): 138 | n = 0 139 | chunk_vecs = [] 140 | while 1: 141 | try: 142 | chunk_vecs.append(self.vector_index.get_item_vector(self.name_idx[f"{doc}c{n}"])) 143 | except KeyError: 144 | break 145 | n += 1 146 | return np.mean(np.array(chunk_vecs), axis=0) 147 | 148 | def docs_to_avgs(self, doc_ids): 149 | return [n for n in [searcher.doc_avg(doc_id) for doc_id in doc_ids] if not np.isnan(n).all()] 150 | 151 | 152 | def query_by_text(self, query, k=10): 153 | results = self.query_nn(query, k) 154 | for res in results: 155 | doc_id = res["_id"].split("c")[0] 156 | url = "https://example.com/{}/{}".format(doc_id, res["_source"].get("routing", "") or '') 157 | full_text_res = self.es.get(index=self.es_index_full_text, id=doc_id) 158 | print(full_text_res["_id"]) 159 | print(url) 160 | print(res["_source"]["text"]) 161 | print("") 162 | -------------------------------------------------------------------------------- /to_annoy.py: -------------------------------------------------------------------------------- 1 | 2 | import faulthandler; faulthandler.enable() 3 | 4 | import tensorflow as tf 5 | import tensorflow_hub as hub 6 | import tf_sentencepiece 7 | import time 8 | from tqdm import tqdm 9 | import time 10 | import json 11 | from os import environ 12 | import pandas as pd 13 | import numpy as np 14 | from tensorflow.python.framework.errors_impl import ResourceExhaustedError 15 | from tensorflow.python.framework.errors_impl import InvalidArgumentError as TFInvalidArgumentError 16 | import faiss 17 | from annoy import AnnoyIndex 18 | 19 | 20 | batch_size = 256 21 | total_chunks = 37281 22 | trees_to_build = 10 23 | 24 | try: 25 | with tf.device('/gpu:0'): 26 | a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') 27 | b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') 28 | c = tf.matmul(a, b) 29 | 30 | with tf.Session() as sess: 31 | print (sess.run(c)) 32 | except (RuntimeError, TFInvalidArgumentError): 33 | print("no GPU present, this'll be slow, probably") 34 | 35 | 36 | use_module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/1" 37 | 38 | g = tf.Graph() 39 | with g.as_default(): 40 | text_input = tf.placeholder(dtype=tf.string, shape=[None]) 41 | embed_module = hub.Module(use_module_url) 42 | embedded_text = embed_module(text_input) 43 | init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()]) 44 | g.finalize() 45 | 46 | session = tf.Session(graph=g) 47 | session.run(init_op) 48 | 49 | 50 | 51 | def generate_embeddings (messages_in): 52 | if len(messages_in) == 0: 53 | return np.array([]) 54 | return session.run(embedded_text, feed_dict={text_input: messages_in}) 55 | 56 | 57 | # sanity check 58 | generate_embeddings(["My favorite kind of bagel is a toasted bagel."]) 59 | 60 | 61 | ES_INDEX_FULL_TEXT = "nycdocs" 62 | ES_INDEX_CHUNK = "nycdocs-chunk15" 63 | vector_dims = 512 64 | 65 | doc_counter = 0 66 | 67 | idx_name_chunk = {} 68 | name_idx_chunk = {} 69 | def vectorize_batch_chunk(lbatch, vector_index_chunk): 70 | global doc_counter 71 | 72 | doc_idxs = [] 73 | for i in range(lbatch.shape[0]): 74 | doc_idxs.append(doc_counter) 75 | doc_counter += 1 76 | 77 | vectors = generate_embeddings(lbatch["text"]) 78 | if len(vectors.shape) >= 2 and vectors.shape[1] > 0: 79 | for vec, page_num in zip(vectors, doc_idxs): 80 | vector_index_chunk.add_item(page_num, vec) 81 | 82 | vector_index_chunk = AnnoyIndex(vector_dims, 'angular') 83 | vector_index_chunk.on_disk_build(ES_INDEX_CHUNK + f"_annoy.bin") 84 | 85 | with tqdm(total=total_chunks) as pbar: 86 | for j, batch in enumerate(pd.read_json('nyc_docs-sentences15.json', lines=True, chunksize=batch_size)): 87 | batch["smallenough"] = batch["text"].apply(lambda x: len(x) < 100000) 88 | batch = batch[batch["smallenough"]] 89 | try: 90 | vectorize_batch_chunk(batch, vector_index_chunk) 91 | except ResourceExhaustedError: 92 | minibatches = np.array_split(batch, batch_size) 93 | for i, minibatch in enumerate(minibatches): 94 | try: 95 | vectorize_batch_chunk(minibatch, vector_index_chunk) 96 | except ResourceExhaustedError: 97 | continue 98 | pbar.update(len(batch)) 99 | 100 | vector_index_chunk.build(trees_to_build) # 10 trees 101 | -------------------------------------------------------------------------------- /to_es.py: -------------------------------------------------------------------------------- 1 | 2 | import faulthandler; faulthandler.enable() 3 | 4 | import time 5 | from tqdm import tqdm 6 | import time 7 | import json 8 | from os import environ 9 | from elasticsearch import Elasticsearch, helpers 10 | from elasticsearch_dsl import Search 11 | 12 | ES_INDEX_FULL_TEXT = "nycdocs-use" 13 | FIRST = False 14 | ES_INDEX_CHUNK = "nycdocs-use-chunk128" 15 | vector_dims = 512 16 | batch_size = 512 17 | total_chunks = 37281 # get this with `wc nyc_docs_paragraphs.json` 18 | total_docs = 4251 19 | 20 | ## Put ElasticSearch credentials here 21 | es = Elasticsearch([{'host': 'localhost', 'port': 9200}]) 22 | 23 | if not es.ping(): 24 | raise ValueError("Connection to ElasticSearch failed") 25 | sys.exit(1) 26 | else: 27 | print('Connection to ElasticSearch OK') 28 | 29 | doc_counter = 0 30 | 31 | idx_name_chunk = {} 32 | name_idx_chunk = {} 33 | 34 | def es_index_batch_chunk(lbatch): 35 | global doc_counter 36 | 37 | records = [] 38 | for body in lbatch.to_dict(orient='records'): 39 | id_ = body["_id"] + "c" + str(body["chonk"]) 40 | idx_name_chunk[doc_counter] = id_ 41 | name_idx_chunk[id_] = doc_counter 42 | body["page"] = doc_counter 43 | body["_index"] = ES_INDEX_CHUNK 44 | del body["smallenough"] 45 | body["doc_id"] = body["_id"] 46 | body["_id"] = id_ 47 | records.append(body) 48 | doc_counter += 1 49 | res = helpers.bulk(es, records, chunk_size=len(records), request_timeout=200) 50 | 51 | 52 | import pandas as pd 53 | import numpy as np 54 | with tqdm(total=total_chunks) as pbar: 55 | for j, batch in enumerate(pd.read_json('nyc_docs-sentences15.json', lines=True, chunksize=batch_size)): 56 | batch["smallenough"] = batch["text"].apply(lambda x: len(x) < 100000) 57 | batch = batch[batch["smallenough"]] 58 | es_index_batch_chunk(batch) 59 | pbar.update(len(batch)) 60 | 61 | with open(ES_INDEX_CHUNK + "_idx_name.json", 'w') as f: 62 | f.write(json.dumps(idx_name_chunk)) 63 | with open(ES_INDEX_CHUNK + "_name_idx.json", 'w') as f: 64 | f.write(json.dumps(name_idx_chunk)) 65 | 66 | # also put the full documents into ES 67 | with open('nyc_docs.jsonl', 'r') as reader: 68 | for i, line_json in tqdm(enumerate(reader), total=total_docs): 69 | line = json.loads(line_json) 70 | body = { 71 | "text": line["_source"]["content"][:1000000], 72 | "routing": line.get("_routing", None), 73 | } 74 | es.index(index=ES_INDEX_FULL_TEXT, id=line["_id"], body=body) 75 | 76 | 77 | 78 | -------------------------------------------------------------------------------- /to_sentences.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | from tqdm import tqdm 3 | import json 4 | from bs4 import BeautifulSoup 5 | from functools import reduce 6 | from w3lib.html import remove_tags 7 | 8 | import syntok.segmenter as segmenter 9 | 10 | total_docs = 4251 # get this with `wc` (only used for progress bar) 11 | 12 | total_short_paragraphs = 0 13 | MAX_SENT_LEN = 100 14 | 15 | def sentenceify(text): 16 | return [sl for l in [[''.join([t.spacing + t.value for t in s]) for s in p if len(s) < MAX_SENT_LEN] for p in segmenter.analyze(text)] for sl in l if any(map(lambda x: x.isalpha(), sl))] 17 | 18 | 19 | def clean_html(html): 20 | if "<" in html and ">" in html: 21 | try: 22 | soup = BeautifulSoup(html, features="html.parser") 23 | plist = soup.find('plist') 24 | if plist: 25 | plist.decompose() # remove plists because ugh 26 | text = soup.getText() 27 | except: 28 | text = remove_tags(html) 29 | return '. '.join(text.split("\r\n\r\n\r\n")) 30 | else: 31 | return '. '.join(html.split("\r\n\r\n\r\n")) 32 | 33 | # if this sentence is short, then group it with other short sentences (so you get groups of continuous short sentences, broken up by one-element groups of longer sentences) 34 | def short_sentence_grouper_bean_factory(target_sentence_length): # in chars 35 | def group_short_sentences(list_of_lists_of_sentences, next_sentence): 36 | if not list_of_lists_of_sentences: 37 | return [[next_sentence]] 38 | if len(next_sentence) < target_sentence_length: 39 | list_of_lists_of_sentences[-1].append(next_sentence) 40 | else: 41 | list_of_lists_of_sentences.append([next_sentence]) 42 | list_of_lists_of_sentences.append([]) 43 | return list_of_lists_of_sentences 44 | return group_short_sentences 45 | 46 | 47 | def overlap(document_tokens, target_length): 48 | """ pseudo-paginate a document by creating lists of tokens of length `target-length` that overlap at 50% 49 | 50 | return a list of `target_length`-length lists of tokens, overlapping by 50% representing all the tokens in the document 51 | """ 52 | 53 | overlapped = [] 54 | cursor = 0 55 | while len(' '.join(document_tokens[cursor:]).split()) >= target_length: 56 | overlapped.append(document_tokens[cursor:cursor+target_length]) 57 | cursor += target_length // 2 58 | return overlapped 59 | 60 | 61 | def sentences_to_short_paragraphs(group_of_sentences, target_length, min_shingle_length=10): 62 | """ outputting overlapping groups of shorter sentences 63 | 64 | group_of_sentences = list of strings, where each string is a sentences 65 | target_length = max length IN WORDS of output sentennces 66 | min_shingle_length = don't have sentences that differ just in the inclusion of a sentence of this size 67 | """ 68 | if len(group_of_sentences) == 1: 69 | return [' '.join(group_of_sentences[0].split())] 70 | sentences_as_words = [sent.split() for sent in group_of_sentences] 71 | sentences_as_words = [sentence for sentence in sentences_as_words if [len(word) for word in sentence].count(1) < (len(sentence) * 0.5) ] 72 | paragraphs = [] 73 | for i, sentence in enumerate(sentences_as_words[:-1]): 74 | if i > 0 and len(sentence) < min_shingle_length and len(sentences_as_words[i-1]) < min_shingle_length and i % 2 == 0: 75 | continue # skip really short sentences if the previous one is also really short (but not so often that we lose anything ) 76 | buff = list(sentence) # just making a copy. 77 | for subsequent_sentence in sentences_as_words[i+1:]: 78 | if len(buff) + len(subsequent_sentence) <= target_length: 79 | buff += subsequent_sentence 80 | else: 81 | break 82 | paragraphs.append(buff) 83 | return [' '.join(graf) for graf in paragraphs] 84 | 85 | 86 | def to_short_paragraphs(text, paragraph_len=15, min_sentence_len=8): # paragraph_len in words, min_sentence_len in chars 87 | sentences = sentenceify( clean_html(text) ) 88 | grouped_sentences = reduce(short_sentence_grouper_bean_factory(150) , sentences, []) 89 | return [sl for l in [sentences_to_short_paragraphs(group, paragraph_len) for group in grouped_sentences if len(group) >= 2 or (len(group) > 0 and len(group[0]) > min_sentence_len)] for sl in l if sl] 90 | 91 | paragraph_target_length = 15 92 | 93 | if __name__ == "__main__": 94 | with open(f"nyc_docs-sentences{paragraph_target_length}.json", 'w') as writer: 95 | with open('nyc_docs.jsonl', 'r') as reader: 96 | for i, line_json in tqdm(enumerate(reader), total=total_docs): 97 | line = json.loads(line_json) 98 | text = line["_source"]["content"][:1000000] 99 | for j, page in enumerate(to_short_paragraphs(text, paragraph_target_length)): 100 | total_short_paragraphs += 1 101 | writer.write(json.dumps({ 102 | "text": page, 103 | "_id": line["_id"], 104 | "chonk": j, 105 | # "routing": line.get("_routing", None), 106 | # "path": line["_source"]["path"] 107 | }) + "\n") 108 | print(f"total paragraphs: {total_short_paragraphs}") 109 | --------------------------------------------------------------------------------