├── .gitignore
├── .python-version
├── LICENSE
├── README.md
├── Searching with USE.ipynb
├── nyc_docs.jsonl
├── requirements.txt
├── search.py
├── to_annoy.py
├── to_es.py
└── to_sentences.py


/.gitignore:
--------------------------------------------------------------------------------
1 | nyc_docs-sentences15.json
2 | 


--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------
1 | 3.7.2
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright 2020 Quartz
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 | 
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Searching Bill de Blasio's Emails with the Universal Sentence Encoder
 2 | 
 3 | By Jeremy B. Merrill, [Quartz](https://www.qz.com)
 4 | 
 5 | As part of Quartz's participation in the [Luanda Leaks investigation](https://qz.com/se/luanda-leaks-isabel-dos-santos-angola/) in partnership with ICIJ, we built an system for searching large heterogenous document sets with AI. Read more about the system [here](https://qz.com/1786896/ai-for-investigations-sorting-through-the-luanda-leaks/).
 6 | 
 7 | Here's the code demo so you can reproduce our work. For this demo, we'll be searching in [a set of a few thousand emails](https://github.com/Quartz/aistudio-doc2vec-for-investigative-journalism/blob/master/2018.05.24_BerlinRosen_Responsive_Records.pdf) between Bill de Blasio and some advisors, informally called the "Agent of the City documents."
 8 | 
 9 | This isn't a full library or off-the-shelf software, but rather a demonstration that you can adapt for yourself.
10 | 
11 | This workflow works with huge document sets. The demo documents are small, so that indexing only takes a few seconds, rather than hours. The Luanda Leaks investigation involved 356 gigabytes of files (and 11 gigabytes of plain text). 
12 | 
13 | ## how to
14 | There are a few steps you'll have to take to run the demo.
15 | 
16 | 1. Get a computer with Python 3.7. You probably want a server with a GPU, but it's not strictly necessary. Without a GPU, this'll be very slow.
17 | 2. Set up Elasticsearch locally.
18 | 2. Split the provided `nyc_docs.jsonl` file (which has a text copy of every document in the document set) into sentence-size chunks by running `python to_sentences.py`. We need to split the documents into sentence-length chunks because Universal Sentence Encoder can only handle blocks of text shorter than 128 words.
19 | 3. Index the chunks and the full documents to ElasticSearch by running `python to_es.py`.
20 | 4. Embed the chunks into vector space with USE and index those with Annoy by running `python to_annoy.py`.
21 | 5. Run the "Searching with USE.ipynb" Jupyter notebook to run your own searches.
22 | 
23 | 
24 | --
25 | ## about tech choices
26 | 
27 | *Universal Sentence Encoder* is a tensorflow library that, without any additional training, embeds sentences in several languages into a shared vector representation. Basically, sentences with a similar meaning have vectors that are close together.
28 | 
29 | *Annoy* is a Python library that makes nearest neighbor searches for vectors very fast. This lets us index thousands, or millions, of vectors and instantly get back which ones are close by.
30 | 
31 | Combined, these tools let us ask the computer, "what sentences mean something similar to this?" and we'll get back high quality results, even if the similar sentences don't overlap in any keywords. We also indexed the text of each sentence to ElasticSearch—and each sentence had a single ID number used in both the Annoy and ElasticSearch indexes.


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | tqdm
 2 | syntok
 3 | BeautifulSoup4
 4 | w3lib
 5 | tensorflow==1.14.0
 6 | tf_sentencepiece==0.1.83
 7 | annoy
 8 | faiss
 9 | tensorflow_hub
10 | pandas
11 | elasticsearch_dsl==6.0.0
12 | 


--------------------------------------------------------------------------------
/search.py:
--------------------------------------------------------------------------------
  1 | # this model tells us _how_ we can run searches. feel free to dig in, but it's just plumbing.
  2 | 
  3 | import numpy as np
  4 | import unicodedata
  5 | import csv
  6 | from os.path import basename
  7 | 
  8 | def remove_accents(input_str):
  9 |     nfkd_form = unicodedata.normalize('NFKD', input_str)
 10 |     only_ascii = nfkd_form.encode('ASCII', 'ignore')
 11 |     return only_ascii.decode("utf-8")
 12 | 
 13 | def index_or_error(list_, item):
 14 |     try:
 15 |         return list_.index(item)
 16 |     except ValueError:
 17 |         return "absent"
 18 | 
 19 | class QzUSESearch():
 20 |     def __init__(self, results, search_terms, es, es_index_full_text, seed_docs=[]):
 21 |         self.results = list(results)
 22 |         self.search_terms = search_terms
 23 |         self.clean_search_terms = [remove_accents(term).lower() for term in search_terms]
 24 |         self.seed_docs = [i.split("c")[0] for i in seed_docs]
 25 |         self.es = es
 26 |         self.es_index_full_text = es_index_full_text
 27 |         
 28 |         
 29 |     def show(self, show_seed_docs=True):
 30 |         for res, dist in self.results:
 31 |             terms_in_doc = [search_term in remove_accents(res["_source"]["text"].lower()) for search_term in self.search_terms] 
 32 |             doc_id = res["_id"].split("c")[0]
 33 |             chunk = res["_id"].split("c")[1]
 34 |             is_seed_doc = doc_id in self.seed_docs or res["_id"] in self.seed_docs
 35 |             if is_seed_doc and not show_seed_docs:
 36 |                 continue
 37 |             print(res["_id"])
 38 |             print("http://example.com/{}".format(doc_id.split("c")[0]))
 39 |             print("sanity checks: ({})".format( terms_in_doc))
 40 |             print("")
 41 |     
 42 |     def sanity_check(self, targets):
 43 |         idxes = [index_or_error([chunk_res['_id'].split("c")[0] for chunk_res, dist in self.results], should_match.split("c")[0]) for should_match in targets]
 44 |         print("sanity check: these should be low-ish:" , idxes, len(self.results))
 45 |     
 46 |     def to_csv(self, csv_fn=None):
 47 |         search_term_cln = self.search_terms[0].replace(" ", "_")
 48 |         csv_fn = csv_fn if csv_fn else f"csvs/{search_term_cln}.csv"
 49 |         print(csv_fn)
 50 |         with open(csv_fn, 'w') as csvfile:
 51 |             fieldnames = [ 
 52 |                           "url", 
 53 |                           "is_seed_doc",
 54 |                           "first few words of match",
 55 |                           "chunk",
 56 |                           "distance", 
 57 |                          ] + [f"matches search term (\"{search_term}\")" for search_term in self.search_terms]
 58 |             writer = csv.writer(csvfile)
 59 |             writer.writerow(fieldnames)
 60 | 
 61 |             for chunk_res, dist in self.results:
 62 |                 doc_id = chunk_res["_id"].split("c")[0]
 63 |                 chunk_str = chunk_res["_id"].split("c")[1]
 64 |                 if len(chunk_str) > 0:
 65 |                     chunk = int(chunk_str)
 66 |                 else:
 67 |                     chunk = None
 68 |                 
 69 |                 full_text_res = self.es.get(index=self.es_index_full_text, id=doc_id)
 70 |                 url = "http://example.com/{}/{}".format(doc_id.split("c")[0], full_text_res["_source"].get("routing", "") or '')
 71 |                 clean_text = remove_accents(full_text_res["_source"]["text"].lower())
 72 |                 terms_in_doc = [search_term in clean_text for search_term in self.clean_search_terms] 
 73 |                 is_seed_doc = doc_id in self.seed_docs or chunk_res["_id"] in self.seed_docs
 74 |                 
 75 |                 if chunk:
 76 |                     text = chunk_res["_source"]["text"]
 77 |                 else:
 78 |                     text = full_text_res["_source"]["text"][:100]
 79 |                     
 80 |                 row = [
 81 |                     url,
 82 |                     is_seed_doc,
 83 |                     text,
 84 |                     chunk_str,
 85 |                     dist,
 86 |                 ] + terms_in_doc
 87 |                 writer.writerow(row)
 88 | 
 89 | class QzUSESearchFactory():
 90 |     def __init__(self, vector_index, idx_name, name_idx, es, es_index_full_text, es_index_chunk, generate_embeddings ):
 91 |         self.idx_name = idx_name
 92 |         self.name_idx = name_idx
 93 |         self.vector_index = vector_index
 94 |         self.es = es
 95 |         self.generate_embeddings = generate_embeddings
 96 |         self.es_index_full_text = es_index_full_text
 97 |         self.es_index_chunk = es_index_chunk
 98 | 
 99 |     def query_by_docs(self, seed_docs, search_terms=[], k=10):
100 |         target_vectors = []
101 |         for doc_idx in seed_docs: 
102 |             if ":" not in doc_idx and "c" not in doc_idx: # if it's a whole doc
103 |                 res = self.es.get(index=self.es_index_full_text, id=doc_idx.split("c")[0])
104 |                 chunks = [page for j, page in enumerate(to_short_paragraphs(res["_source"]["text"]))]
105 |                 doc_avg_vec = np.mean(np.array(self.generate_embeddings(chunks)), axis=0)
106 |                 target_vectors.append(doc_avg_vec)
107 |             elif "c" in doc_idx: # if it's a chunked doc
108 |                 chunk_vec = self.vector_index.get_item_vector(self.name_idx[doc_idx])
109 |                 target_vectors.append(chunk_vec)
110 |             elif ":" in doc_idx:
111 |                 start, end = [int(i) for i in doc_idx.split("c")[1].split("-")]
112 |                 assert start < end
113 |                 chunks_vecs = [self.vector_index.get_item_vector(self.name_idx[doc_idx.split("c")[0] + "c" + str(i)]) for i in range(start, end + 1)]
114 |                 doc_avg_vec = np.mean(np.array(chunks_vecs), axis=0)
115 |                 target_vectors.append(doc_avg_vec)
116 |             else:
117 |                 raise ArgumentError(f"invalid seed doc: {doc_idx}")
118 |         avg_vec = np.average(target_vectors, axis=0)
119 |         docs, distances = self.query_nn_with_vec(avg_vec, k)
120 |         return QzUSESearch(zip(docs, distances), search_terms, self.es, self.es_index_full_text, seed_docs)
121 |     
122 |     def convert_vector(self, query): 
123 |         query = [query]
124 |         vector = self.generate_embeddings(query)[0]
125 |         return vector
126 |     
127 |     def query_nn_with_vec(self, vector_converted, k=10):
128 |         idxs, distances = self.vector_index.get_nns_by_vector(vector_converted, k, search_k=-1, include_distances=True)
129 |         docs = [self.es.get(index=self.es_index_chunk, id=self.idx_name[str(doc_idx)]) for doc_idx in idxs]
130 |         return docs, distances
131 | 
132 |     def query_nn(self, query, k=10):
133 |         vector_converted = self.convert_vector(query)
134 |         res = self.query_nn_with_vec(vector_converted, k)
135 |         return res[0]
136 |     
137 |     def doc_avg(self, doc):
138 |         n = 0
139 |         chunk_vecs = []
140 |         while 1:
141 |             try: 
142 |                 chunk_vecs.append(self.vector_index.get_item_vector(self.name_idx[f"{doc}c{n}"]))
143 |             except KeyError:
144 |                 break
145 |             n += 1
146 |         return np.mean(np.array(chunk_vecs), axis=0)
147 | 
148 |     def docs_to_avgs(self, doc_ids):
149 |         return [n for n in [searcher.doc_avg(doc_id) for doc_id in doc_ids] if not np.isnan(n).all()]
150 | 
151 |                 
152 |     def query_by_text(self, query, k=10):
153 |         results = self.query_nn(query, k)
154 |         for res in results:
155 |             doc_id = res["_id"].split("c")[0]
156 |             url = "https://example.com/{}/{}".format(doc_id, res["_source"].get("routing", "") or '')
157 |             full_text_res = self.es.get(index=self.es_index_full_text, id=doc_id)
158 |             print(full_text_res["_id"])
159 |             print(url)
160 |             print(res["_source"]["text"])
161 |             print("")
162 | 


--------------------------------------------------------------------------------
/to_annoy.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import faulthandler; faulthandler.enable()
  3 | 
  4 | import tensorflow as tf
  5 | import tensorflow_hub as hub
  6 | import tf_sentencepiece
  7 | import time
  8 | from tqdm import tqdm
  9 | import time
 10 | import json
 11 | from os import environ
 12 | import pandas as pd
 13 | import numpy as np
 14 | from tensorflow.python.framework.errors_impl import ResourceExhaustedError
 15 | from tensorflow.python.framework.errors_impl import InvalidArgumentError as TFInvalidArgumentError
 16 | import faiss
 17 | from annoy import AnnoyIndex
 18 | 
 19 | 
 20 | batch_size = 256
 21 | total_chunks = 37281
 22 | trees_to_build = 10
 23 | 
 24 | try:
 25 |     with tf.device('/gpu:0'):
 26 |         a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
 27 |         b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
 28 |         c = tf.matmul(a, b)
 29 | 
 30 |     with tf.Session() as sess:
 31 |         print (sess.run(c))
 32 | except (RuntimeError, TFInvalidArgumentError): 
 33 |     print("no GPU present, this'll be slow, probably")
 34 | 
 35 | 
 36 | use_module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/1"
 37 | 
 38 | g = tf.Graph()
 39 | with g.as_default():
 40 |     text_input = tf.placeholder(dtype=tf.string, shape=[None])
 41 |     embed_module = hub.Module(use_module_url)
 42 |     embedded_text = embed_module(text_input)
 43 |     init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
 44 | g.finalize()
 45 | 
 46 | session = tf.Session(graph=g)
 47 | session.run(init_op)
 48 | 
 49 | 
 50 | 
 51 | def generate_embeddings (messages_in):
 52 |     if len(messages_in) == 0:
 53 |         return np.array([])
 54 |     return session.run(embedded_text, feed_dict={text_input: messages_in})
 55 | 
 56 | 
 57 | # sanity check
 58 | generate_embeddings(["My favorite kind of bagel is a toasted bagel."])
 59 | 
 60 | 
 61 | ES_INDEX_FULL_TEXT = "nycdocs"
 62 | ES_INDEX_CHUNK = "nycdocs-chunk15"
 63 | vector_dims = 512
 64 | 
 65 | doc_counter = 0
 66 | 
 67 | idx_name_chunk = {}
 68 | name_idx_chunk = {}
 69 | def vectorize_batch_chunk(lbatch, vector_index_chunk):
 70 |     global doc_counter
 71 |     
 72 |     doc_idxs = []
 73 |     for i in range(lbatch.shape[0]):
 74 |         doc_idxs.append(doc_counter)
 75 |         doc_counter += 1
 76 | 
 77 |     vectors = generate_embeddings(lbatch["text"])
 78 |     if len(vectors.shape) >= 2 and vectors.shape[1] > 0:
 79 |         for vec, page_num in zip(vectors, doc_idxs):
 80 |             vector_index_chunk.add_item(page_num, vec)
 81 | 
 82 | vector_index_chunk = AnnoyIndex(vector_dims, 'angular')
 83 | vector_index_chunk.on_disk_build(ES_INDEX_CHUNK + f"_annoy.bin")
 84 | 
 85 | with tqdm(total=total_chunks) as pbar:
 86 |     for j, batch in enumerate(pd.read_json('nyc_docs-sentences15.json', lines=True, chunksize=batch_size)):
 87 |         batch["smallenough"] = batch["text"].apply(lambda x: len(x) < 100000)
 88 |         batch = batch[batch["smallenough"]]
 89 |         try:
 90 |             vectorize_batch_chunk(batch, vector_index_chunk)
 91 |         except ResourceExhaustedError: 
 92 |             minibatches = np.array_split(batch, batch_size)
 93 |             for i, minibatch in enumerate(minibatches):
 94 |                 try:
 95 |                     vectorize_batch_chunk(minibatch, vector_index_chunk)
 96 |                 except ResourceExhaustedError:
 97 |                     continue
 98 |         pbar.update(len(batch))
 99 | 
100 | vector_index_chunk.build(trees_to_build) # 10 trees
101 | 


--------------------------------------------------------------------------------
/to_es.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import faulthandler; faulthandler.enable()
 3 | 
 4 | import time
 5 | from tqdm import tqdm
 6 | import time
 7 | import json
 8 | from os import environ
 9 | from elasticsearch import Elasticsearch, helpers
10 | from elasticsearch_dsl import Search
11 | 
12 | ES_INDEX_FULL_TEXT = "nycdocs-use"
13 | FIRST = False
14 | ES_INDEX_CHUNK = "nycdocs-use-chunk128"
15 | vector_dims = 512
16 | batch_size = 512
17 | total_chunks = 37281 # get this with `wc nyc_docs_paragraphs.json`
18 | total_docs = 4251
19 | 
20 | ## Put ElasticSearch credentials here
21 | es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
22 | 
23 | if not es.ping():
24 |   raise ValueError("Connection to ElasticSearch failed")
25 |   sys.exit(1)
26 | else:
27 |   print('Connection to ElasticSearch OK')
28 | 
29 | doc_counter = 0
30 | 
31 | idx_name_chunk = {}
32 | name_idx_chunk = {}
33 | 
34 | def es_index_batch_chunk(lbatch):
35 |     global doc_counter
36 | 
37 |     records = []
38 |     for body in lbatch.to_dict(orient='records'):
39 |         id_ = body["_id"] + "c" + str(body["chonk"])
40 |         idx_name_chunk[doc_counter] = id_
41 |         name_idx_chunk[id_] = doc_counter
42 |         body["page"] = doc_counter
43 |         body["_index"] = ES_INDEX_CHUNK
44 |         del body["smallenough"]
45 |         body["doc_id"] = body["_id"]
46 |         body["_id"] = id_
47 |         records.append(body)
48 |         doc_counter += 1
49 |     res = helpers.bulk(es, records, chunk_size=len(records), request_timeout=200)
50 | 
51 | 
52 | import pandas as pd
53 | import numpy as np
54 | with tqdm(total=total_chunks) as pbar:
55 |     for j, batch in enumerate(pd.read_json('nyc_docs-sentences15.json', lines=True, chunksize=batch_size)):
56 |         batch["smallenough"] = batch["text"].apply(lambda x: len(x) < 100000)
57 |         batch = batch[batch["smallenough"]]
58 |         es_index_batch_chunk(batch)
59 |         pbar.update(len(batch))
60 | 
61 | with open(ES_INDEX_CHUNK + "_idx_name.json", 'w') as f:
62 |     f.write(json.dumps(idx_name_chunk))
63 | with open(ES_INDEX_CHUNK + "_name_idx.json", 'w') as f:
64 |     f.write(json.dumps(name_idx_chunk))
65 | 
66 | # also put the full documents into ES
67 | with open('nyc_docs.jsonl', 'r') as reader:
68 |     for i, line_json in tqdm(enumerate(reader), total=total_docs):
69 |         line = json.loads(line_json)
70 |         body = {
71 |             "text": line["_source"]["content"][:1000000], 
72 |             "routing": line.get("_routing", None),
73 |             }
74 |         es.index(index=ES_INDEX_FULL_TEXT, id=line["_id"], body=body)
75 | 
76 | 
77 | 
78 | 


--------------------------------------------------------------------------------
/to_sentences.py:
--------------------------------------------------------------------------------
  1 | # encoding: utf-8
  2 | from tqdm import tqdm
  3 | import json
  4 | from bs4 import BeautifulSoup
  5 | from functools import reduce
  6 | from w3lib.html import remove_tags
  7 | 
  8 | import syntok.segmenter as segmenter
  9 | 
 10 | total_docs = 4251 # get this with `wc` (only used for progress bar)
 11 | 
 12 | total_short_paragraphs = 0
 13 | MAX_SENT_LEN = 100
 14 | 
 15 | def sentenceify(text):
 16 |     return [sl for l in [[''.join([t.spacing + t.value for t in s]) for s in p if len(s) < MAX_SENT_LEN] for p in segmenter.analyze(text)] for sl in l if any(map(lambda x: x.isalpha(), sl))]
 17 | 
 18 | 
 19 | def clean_html(html):
 20 |     if "<" in html and ">" in html:
 21 |         try:
 22 |             soup = BeautifulSoup(html, features="html.parser")
 23 |             plist = soup.find('plist')
 24 |             if plist:
 25 |                 plist.decompose() # remove plists because ugh
 26 |             text = soup.getText()
 27 |         except:
 28 |             text = remove_tags(html)
 29 |         return '. '.join(text.split("\r\n\r\n\r\n"))
 30 |     else:
 31 |         return '. '.join(html.split("\r\n\r\n\r\n"))
 32 | 
 33 | # if this sentence is short, then group it with other short sentences (so you get groups of continuous short sentences, broken up by one-element groups of longer sentences)
 34 | def short_sentence_grouper_bean_factory(target_sentence_length): # in chars
 35 |     def group_short_sentences(list_of_lists_of_sentences, next_sentence):
 36 |         if not list_of_lists_of_sentences:
 37 |             return [[next_sentence]]
 38 |         if len(next_sentence) < target_sentence_length:
 39 |            list_of_lists_of_sentences[-1].append(next_sentence)
 40 |         else:
 41 |             list_of_lists_of_sentences.append([next_sentence])
 42 |             list_of_lists_of_sentences.append([])
 43 |         return list_of_lists_of_sentences
 44 |     return group_short_sentences
 45 | 
 46 | 
 47 | def overlap(document_tokens, target_length):
 48 |     """ pseudo-paginate a document by creating lists of tokens of length `target-length` that overlap at 50%
 49 | 
 50 |     return a list of `target_length`-length lists of tokens, overlapping by 50% representing all the tokens in the document 
 51 |     """
 52 | 
 53 |     overlapped = []
 54 |     cursor = 0
 55 |     while len(' '.join(document_tokens[cursor:]).split()) >= target_length:
 56 |       overlapped.append(document_tokens[cursor:cursor+target_length])
 57 |       cursor += target_length // 2
 58 |     return overlapped
 59 | 
 60 | 
 61 | def sentences_to_short_paragraphs(group_of_sentences, target_length, min_shingle_length=10):
 62 |     """ outputting overlapping groups of shorter sentences 
 63 |     
 64 |         group_of_sentences = list of strings, where each string is a sentences
 65 |         target_length = max length IN WORDS of output sentennces
 66 |         min_shingle_length = don't have sentences that differ just in the inclusion of a sentence of this size
 67 |     """
 68 |     if len(group_of_sentences) == 1:
 69 |         return [' '.join(group_of_sentences[0].split())]
 70 |     sentences_as_words = [sent.split() for sent in group_of_sentences]
 71 |     sentences_as_words = [sentence for sentence in sentences_as_words if [len(word) for word in sentence].count(1) < (len(sentence) * 0.5) ]
 72 |     paragraphs = []
 73 |     for i, sentence in enumerate(sentences_as_words[:-1]):
 74 |         if i > 0 and len(sentence) < min_shingle_length  and len(sentences_as_words[i-1]) < min_shingle_length and i % 2 == 0:
 75 |             continue # skip really short sentences if the previous one is also really short (but not so often that we lose anything )
 76 |         buff = list(sentence) # just making a copy.
 77 |         for subsequent_sentence in sentences_as_words[i+1:]:
 78 |             if len(buff) + len(subsequent_sentence) <= target_length:
 79 |                 buff += subsequent_sentence
 80 |             else:
 81 |                 break
 82 |         paragraphs.append(buff)
 83 |     return [' '.join(graf) for graf in paragraphs]
 84 | 
 85 | 
 86 | def to_short_paragraphs(text, paragraph_len=15, min_sentence_len=8): # paragraph_len in words, min_sentence_len in chars
 87 |     sentences = sentenceify( clean_html(text) )
 88 |     grouped_sentences = reduce(short_sentence_grouper_bean_factory(150) , sentences, [])
 89 |     return [sl for l in [sentences_to_short_paragraphs(group, paragraph_len) for group in grouped_sentences if len(group) >= 2 or (len(group) > 0 and len(group[0]) > min_sentence_len)] for sl in l if sl]
 90 | 
 91 | paragraph_target_length = 15
 92 | 
 93 | if __name__ == "__main__":
 94 |     with open(f"nyc_docs-sentences{paragraph_target_length}.json", 'w') as writer: 
 95 |         with open('nyc_docs.jsonl', 'r') as reader:
 96 |             for i, line_json in tqdm(enumerate(reader), total=total_docs):
 97 |                 line = json.loads(line_json)
 98 |                 text = line["_source"]["content"][:1000000]
 99 |                 for j, page in enumerate(to_short_paragraphs(text, paragraph_target_length)):
100 |                     total_short_paragraphs += 1
101 |                     writer.write(json.dumps({
102 |                         "text": page, 
103 |                         "_id": line["_id"], 
104 |                         "chonk": j,
105 |                         # "routing": line.get("_routing", None),
106 |                         # "path": line["_source"]["path"]
107 |                         }) + "\n")
108 |     print(f"total paragraphs: {total_short_paragraphs}")
109 | 


--------------------------------------------------------------------------------