├── .gitignore
├── README.md
├── chunker.py
├── download.py
├── embed-tei.py
├── experimental
    ├── batchsize.py
    └── embed.py
├── features.py
├── fetch.py
├── filter.py
├── lancer.py
├── notebooks
    ├── features.ipynb
    ├── perfile.ipynb
    ├── small_sample.ipynb
    ├── tokenizers.ipynb
    └── validate.ipynb
├── remove.py
├── summary.py
├── todataset.py
├── top10map.py
├── top10reduce.py
├── torched.py
├── upload.py
└── volume.py


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | *.parquet
3 | venv
4 | .DS_Store
5 | *.arrow
6 | data
7 | *.parquet
8 | *.npy
9 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # latent-data-modal
 2 | 
 3 | This repository is a set of scripts used to process and embed a large datasets using on-demand infrastructure via [Modal](https://modal.com).
 4 | 
 5 | The first resulting dataset published is [FineWeb-edu 10BT Sample embedded with nomic-text-v1.5](https://huggingface.co/datasets/enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5).
 6 | 
 7 | All of these scripts have been developed as part of my learning process to scale up my capacity for embedding large datasets. 
 8 | As such they aren't immediately generalizable but can be treated as a reference implementation. A lot of it is adapted from the [Embedding Wikipedia](https://modal.com/blog/embedding-wikipedia) tutorial.
 9 | 
10 | I am hoping to improve this process and use it to scale up to the 100BT next. If I can get a compute sponsor I'll then take it to the entire 1.4 trillion token dataset.
11 | 
12 | 
13 | ## Process
14 | 
15 | ### [download.py](download.py)
16 | To start with, we need to download the HF dataset to a volume in Modal. This is relatively straight forward and easy to change to a different dataset.
17 | 
18 | ### [chunker.py](chunker.py)
19 | I wanted to pre-chunk my dataset since tokenizing is relatively CPU intensive and my initial experiments with the tutorial code we bottlenecked by the chunking process. I also wanted to use actual token counts and analyze the impact of chunking on the dataset.
20 | 
21 | I found that the 9.6 million documents in the 10BT sample turned into ~25 million chunks with 10.5 billion tokens due to the 10% overlap I chose. There is an issue in the chunking code right now that I will fix soon where chunks <= 50 tokens are created even though they represent pure overlap and aren't needed.
22 | 
23 | I based everything on files in the dataset, so the 10BT sample was 99 arrow files, which allowed me to take advantage of Modal's automatic container scaling. Each file is processed by its own container which dramatically sped up the process.
24 | 
25 | The chunking process took ~40 minutes using 100 containers and cost $5.
26 | 
27 | ### [embed-tei.py](embed-tei.py)
28 | This script uses the [Text Embeddings Interface](https://huggingface.co/docs/text-embeddings-inference/en/index) like the wikipedia tutorial, but loading the pre-chunked dataset and creating batches that attempt to fit the batch token limit. So we can pack many more small chunks into a single batch to speed things up.
29 | 
30 | I believe I'm not quite properly utilizing TEI because I only got ~60% GPU utilization and was only using 10GB memory in the A10G GPUs that have 24GB available. So there is probably a way to speed this up even more. That said it only cost ~$50 to embed the entire dataset. It did take ~12 hours because I didn't always have my full allocation of 10 GPUs available.
31 | 
32 | ### [summary.py](summary.py)
33 | I found it useful to quickly calculate summary statistics using the same parallel process of loading each file in its own container and performaing some basic pandas calculations.
34 | 
35 | ### [fetch.py](fetch.py)
36 | I made a quick utility to download a single file to inspect locally, which was used in the [notebooks/validate.ipynb](notebooks/validate.ipynb) notebook to confirm that the embedding process was working as expected.
37 | 
38 | 
39 | ## Notebooks
40 | I'm including several notebooks that I developed in the process of learning this in case they are helpful to others.
41 | 
42 | ### [small_sample.ipynb](notebooks/small_sample.ipynb)
43 | The first thing I did was download some very small samples of the dataset and explore them with [Latent Scope](https://github.com/enjalot/latent-scope) to familiarize myself with the data and validate the idea of embedding the dataset.
44 | 
45 | ### [perfile.ipynb](notebooks/perfile.ipynb)
46 | After I struggled with the structure of the wikipedia tutorial I realized I could leverage the CPU parallelism of Modal to process each file in its own container. This notebook was me working out the chunking logic on a single file that I could then parallelize in the `chunker.py` script.
47 | 
48 | ### [validate.ipynb](notebooks/validate.ipynb)
49 | This notebook is me taking a look at a single file that was processed and then trying to understand why I was seeing such small chunks. It led me to realize the mistake I made of keeping around <50 token chunks (which I still need to fix in the chunker.py script...)
50 | 
51 | ## Experimental
52 | On the way to developing this I was trying to understand how to choose batch sizes and token limits. There are two scripts here:
53 | 
54 | ### [batchsize.py](experimental/batchsize.py)
55 | This script uses crude measurement techniques to see how much memory gets filled by a batch of tokens. I'm not confident in it anymore because I was able to fit a lot more tokens into the batches I submitted to `embed-tei.py` than I predicted using a A10G instead of an H100.
56 | 
57 | ### [embed.py](experimental/embed.py)
58 | This script uses the HuggingFace transformers directly (instead of TEI) so I could have a little more control over how I was embedding. It's the same kind of code I use in Latent Scope for locally embedding smaller datasets so it allowed me to better understand the scaling process.
59 | The problem is that it's just much slower than TEI.
60 | 


--------------------------------------------------------------------------------
/chunker.py:
--------------------------------------------------------------------------------
  1 | from modal import App, Image, Volume
  2 | 
  3 | NUM_CPU=4
  4 | MAX_TOKENS = 500
  5 | # MAX_TOKENS = 120
  6 | OVERLAP = 0.1 # 10% overlap when chunking
  7 | BATCH_SIZE = 200 # number of rows to process per thread at once
  8 | 
  9 | # We first set out configuration variables for our script.
 10 | DATASET_DIR = "/data"
 11 | 
 12 | # https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
 13 | # VOLUME = "embedding-fineweb-edu"
 14 | # DATASET_SAVE ="fineweb-edu-sample-10BT"
 15 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}"
 16 | # TEXT_KEY = "text"
 17 | # KEEP_KEYS = ["id", "url", "score", "dump"] 
 18 | # files = [f"data-{i:05d}-of-00099.arrow" for i in range(99)]
 19 | 
 20 | # VOLUME = "embedding-fineweb-edu"
 21 | # DATASET_SAVE ="fineweb-edu-sample-100BT"
 22 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-100BT-chunked-{MAX_TOKENS}"
 23 | # KEEP_KEYS = ["id", "url", "score", "dump"] 
 24 | 
 25 | 
 26 | # https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
 27 | # VOLUME = "datasets"
 28 | # DATASET_SAVE ="RedPajama-Data-V2-sample-10B"
 29 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-{MAX_TOKENS}"
 30 | # TEXT_KEY = "raw_content"
 31 | # KEEP_KEYS = ["doc_id", "meta"]
 32 | # files = [f"data-{i:05d}-of-00150.arrow" for i in range(150)]
 33 | 
 34 | # https://huggingface.co/datasets/monology/pile-uncopyrighted
 35 | # VOLUME = "datasets"
 36 | # DATASET_SAVE ="pile-uncopyrighted"
 37 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-{MAX_TOKENS}"
 38 | # TEXT_KEY = "text"
 39 | # KEEP_KEYS = ["meta"]
 40 | # files = [f"data-{i:05d}-of-01987.arrow" for i in range(200)]
 41 | 
 42 | #https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.en
 43 | # VOLUME = "datasets"
 44 | # DATASET_SAVE ="wikipedia-en"
 45 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-{MAX_TOKENS}"
 46 | # TEXT_KEY = "text"
 47 | # KEEP_KEYS = ["id", "url", "title"]
 48 | # files = [f"data-{i:05d}-of-00041.arrow" for i in range(41)]
 49 | 
 50 | VOLUME = "datasets"
 51 | DATASET_SAVE ="medrag-pubmed"
 52 | DATASET_SAVE_CHUNKED = f"medrag-pubmed-{MAX_TOKENS}"
 53 | TEXT_KEY = "content"
 54 | KEEP_KEYS = ["id", "title", "PMID"]
 55 | files = [f"data-{i:05d}-of-00138.arrow" for i in range(138)]
 56 | 
 57 | 
 58 | 
 59 | 
 60 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
 61 | 
 62 | # We define our Modal Resources that we'll need
 63 | volume = Volume.from_name(VOLUME, create_if_missing=True)
 64 | image = Image.debian_slim(python_version="3.9").pip_install(
 65 |     "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
 66 | )
 67 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
 68 | 
 69 | def chunk_row(row, tokenizer):
 70 |     # print("ROW", row)
 71 |     text = row[TEXT_KEY]
 72 |     chunks = []
 73 | 
 74 |     # TODO: don't save an empty chunk
 75 | 
 76 |     tokens = tokenizer.encode(text)
 77 |     token_count = len(tokens)
 78 |     if token_count > MAX_TOKENS:
 79 |         overlap = int(MAX_TOKENS * OVERLAP)
 80 |         start_index = 0
 81 |         ci = 0
 82 |         while start_index < len(tokens):
 83 |             end_index = min(start_index + MAX_TOKENS, len(tokens))
 84 |             chunk = tokens[start_index:end_index]
 85 |             if len(chunk) < overlap:
 86 |                 break
 87 |             chunks.append({
 88 |                 "chunk_index": ci,
 89 |                 "chunk_text": tokenizer.decode(chunk),
 90 |                 "chunk_tokens": chunk,
 91 |                 "chunk_token_count": len(chunk),
 92 |                 **{key: row[key] for key in KEEP_KEYS}
 93 |             })
 94 |             start_index += MAX_TOKENS - overlap
 95 |             ci += 1
 96 |     else:
 97 |         chunks.append({
 98 |             "chunk_index": 0,
 99 |             "chunk_text": text,
100 |             "chunk_tokens": tokens,
101 |             "chunk_token_count": token_count,
102 |             **{key: row[key] for key in KEEP_KEYS}
103 |         })
104 | 
105 |     return chunks
106 | 
107 | 
108 | @app.function(cpu=NUM_CPU, volumes={DATASET_DIR: volume}, timeout=3000)
109 | def process_dataset(file):
110 |     import time
111 |     from concurrent.futures import ThreadPoolExecutor, as_completed
112 |     from tqdm import tqdm
113 |     import pandas as pd
114 |     import transformers
115 |     transformers.logging.set_verbosity_error()
116 |     from transformers import AutoTokenizer
117 |     from datasets import load_from_disk, load_dataset
118 |     
119 |     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS)
120 | 
121 |     start = time.perf_counter()
122 |     # Load the dataset as a Hugging Face dataset
123 |     print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}/train/{file}")
124 |     dataset = load_dataset("arrow", data_files=f"{DATASET_DIR}/{DATASET_SAVE}/train/{file}")
125 |     df = pd.DataFrame(dataset['train'])
126 |     print("dataset", len(df))
127 |     print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds for {file}") 
128 | 
129 |     chunks_list = []
130 |     with ThreadPoolExecutor(max_workers=NUM_CPU) as executor:
131 |         pbar = tqdm(total=len(df), desc=f"Processing Rows for {file}")
132 |         
133 |         # this gets called inside each thread
134 |         def process_batch(batch):
135 |             batch_chunks = []
136 |             for row in batch:
137 |                 row_chunks = chunk_row(row, tokenizer)
138 |                 pbar.update(1)
139 |                 batch_chunks.extend(row_chunks)
140 |             return batch_chunks
141 | 
142 |         print(f"making batches for {file}")
143 |         batches = [df.iloc[i:i + BATCH_SIZE].to_dict(orient="records") for i in range(0, len(df), BATCH_SIZE)]
144 |         print(f"made batches for {file}")
145 |         print(f"setting up futures for {file}")
146 |         futures = [executor.submit(process_batch, batch) for batch in batches]
147 |         print(f"in the future for {file}")
148 |         for future in as_completed(futures):
149 |             chunks_list.extend(future.result())
150 |         pbar.close()
151 | 
152 |     chunked_df = pd.DataFrame(chunks_list)
153 |     file_name = file.split(".")[0]
154 |     import os
155 |     output_dir = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train"
156 |     if not os.path.exists(output_dir):
157 |         os.makedirs(output_dir)
158 |     print(f"saving to {output_dir}/{file_name}.parquet")
159 |     chunked_df.to_parquet(f"{output_dir}/{file_name}.parquet")
160 |     print(f"done with {file}, {len(chunks_list)} chunks")
161 |     volume.commit()
162 |     return f"All done with {file}", len(chunks_list)
163 | 
164 | 
165 | @app.local_entrypoint()
166 | def main():
167 |     # download_dataset.remote()
168 |     # from huggingface_hub import HfFileSystem
169 |     # hffs = HfFileSystem()
170 |     # files = hffs.ls("datasets/HuggingFaceFW/fineweb-edu/sample/10BT", detail=False)
171 | 
172 |     # files = [f"data-{i:05d}-of-00989.arrow" for i in range(989)]
173 |     # files = [f"data-{i:05d}-of-00011.arrow" for i in range(11)]
174 |     
175 |     # process_dataset.remote(file, max_tokens=MAX_TOKENS, num_cpu=NUM_CPU)
176 |     for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True):
177 |         if isinstance(resp, Exception):
178 |             print(f"Exception: {resp}")
179 |             continue
180 |         print(resp)
181 | 
182 | 
183 | 


--------------------------------------------------------------------------------
/download.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Download a dataset from HuggingFace to a modal volume
  3 | s"""
  4 | from modal import App, Image, Volume, Secret
  5 | 
  6 | # We first set out configuration variables for our script.
  7 | VOLUME = "datasets"
  8 | DATASET_DIR = "/data"
  9 | 
 10 | HF_CACHE_DIR = f"{DATASET_DIR}/cache"
 11 | 
 12 | # DATASET_NAME = "HuggingFaceFW/fineweb-edu"
 13 | # SAMPLE = "100BT"
 14 | # DATASET_FILES = f"sample/{SAMPLE}/*.parquet"
 15 | # DATASET_SAVE =f"fineweb-edu-sample-{SAMPLE}"
 16 | # VOLUME = "embedding-fineweb-edu"
 17 | 
 18 | 
 19 | # DATASET_NAME = "togethercomputer/RedPajama-Data-V2"
 20 | # DATASET_SAVE = "RedPajama-Data-V2-sample-10B"
 21 | # DATASET_SAMPLE = "sample-10B"
 22 | # DATASET_FILES = None
 23 | 
 24 | # DATASET_NAME = "monology/pile-uncopyrighted"
 25 | # DATASET_SAVE = "pile-uncopyrighted"
 26 | # DATASET_SAMPLE = None
 27 | # DATASET_FILES = None
 28 | 
 29 | # DATASET_NAME = "PleIAs/common_corpus"
 30 | # DATASET_SAVE = "common_corpus"
 31 | # DATASET_SAMPLE = None
 32 | # DATASET_FILES = None
 33 | 
 34 | # DATASET_NAME = "bigcode/the-stack-dedup"
 35 | # DATASET_SAVE = "the-stack-dedup"
 36 | # DATASET_FILES = None
 37 | 
 38 | # DATASET_NAME = "wikimedia/wikipedia"
 39 | # DATASET_SAVE = "wikipedia-en"
 40 | # DATASET_SAMPLE = "20231101.en"
 41 | # DATASET_FILES = None
 42 | 
 43 | DATASET_NAME = "MedRAG/pubmed"
 44 | DATASET_SAVE = "medrag-pubmed"
 45 | DATASET_SAMPLE = None
 46 | DATASET_FILES = None
 47 | 
 48 | 
 49 | 
 50 | 
 51 | # We define our Modal Resources that we'll need
 52 | volume = Volume.from_name(VOLUME, create_if_missing=True)
 53 | image = Image.debian_slim(python_version="3.9").pip_install(
 54 |     "datasets==3.2.0"
 55 | )
 56 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
 57 | 
 58 | 
 59 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
 60 | #  but we override this to
 61 | # 6000s to avoid any potential timeout issues
 62 | @app.function(
 63 |     volumes={DATASET_DIR: volume}, 
 64 |     timeout=60000,
 65 |     ephemeral_disk=int(3145728), # in MiB
 66 |     secrets=[Secret.from_name("huggingface-secret")],
 67 | )
 68 | def download_dataset():
 69 |     # Redownload the dataset
 70 |     import time
 71 |     import os
 72 | 
 73 |     # Set HF cache environment variable
 74 |     os.environ['HF_HOME'] = HF_CACHE_DIR
 75 |     
 76 | 
 77 |     from datasets import load_dataset, DownloadConfig, logging
 78 |     logging.set_verbosity_debug()
 79 | 
 80 |     start = time.time()
 81 |     if DATASET_FILES:
 82 |         dataset = load_dataset(DATASET_NAME,  data_files=DATASET_FILES, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR))
 83 |     elif DATASET_SAMPLE:
 84 |         dataset = load_dataset(DATASET_NAME,  DATASET_SAMPLE, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR))
 85 |     else:
 86 |         dataset = load_dataset(DATASET_NAME,  num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR))
 87 |     end = time.time()
 88 |     print(f"Download complete - downloaded files in {end-start}s")
 89 | 
 90 |     dataset.save_to_disk(f"{DATASET_DIR}/{DATASET_SAVE}")
 91 |     volume.commit()
 92 | 
 93 | @app.function(volumes={DATASET_DIR: volume})
 94 | def load_dataset():
 95 |     import time
 96 |     import os
 97 | 
 98 |     # Set HF cache environment variable
 99 |     os.environ['HF_HOME'] = HF_CACHE_DIR
100 |     
101 | 
102 |     from datasets import load_from_disk
103 | 
104 |     start = time.perf_counter()
105 |     # Load the dataset as a Hugging Face dataset
106 |     print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}")
107 |     dataset = load_from_disk(f"{DATASET_DIR}/{DATASET_SAVE}")
108 |     print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds") 
109 | 
110 | 
111 |     # # Sample the dataset to 100,000 rows
112 |     # print("Sampling dataset to 100,000 rows")
113 |     # sampled_datasets = dataset["train"].select(range(100000))
114 |     # sampled_datasets.save_to_disk(f"{DATASET_DIR}/{DATASET_SAVE}-100k")
115 | 
116 | 
117 | # TODO: make a function to delete files
118 | # the 00099 files are old/wrong
119 | 
120 | # TODO: make a function to load a single file from dataset
121 | 
122 | @app.local_entrypoint()
123 | def main():
124 |     download_dataset.remote()
125 |     # load_dataset.remote()
126 | 
127 | 


--------------------------------------------------------------------------------
/embed-tei.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Embed a dataset using the HuggingFace TEI
  3 | """
  4 | import os
  5 | import json
  6 | import time
  7 | import asyncio
  8 | import subprocess
  9 | 
 10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method
 11 | 
 12 | DATASET_DIR = "/data"
 13 | 
 14 | ### CHUNKED DATASET
 15 | # VOLUME = "embedding-fineweb-edu"
 16 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500"
 17 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
 18 | 
 19 | # VOLUME = "embedding-fineweb-edu"
 20 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-120"
 21 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
 22 | 
 23 | # VOLUME = "datasets"
 24 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-120"
 25 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)]
 26 | 
 27 | # VOLUME = "datasets"
 28 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-500"
 29 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)]
 30 | 
 31 | # VOLUME = "datasets"
 32 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-120"
 33 | # # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-500"
 34 | # files = [f"data-{i:05d}-of-01987.parquet" for i in range(200)]
 35 | 
 36 | VOLUME = "datasets"
 37 | DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-120"
 38 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-500"
 39 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)]
 40 | 
 41 | # VOLUME = "datasets"
 42 | # DATASET_SAVE_CHUNKED = f"medrag-pubmed-500"
 43 | # files = [f"data-{i:05d}-of-00138.parquet" for i in range(138)]
 44 | 
 45 | 
 46 | 
 47 | EMBEDDING_DIR = "/embeddings"
 48 | 
 49 | #### MODEL
 50 | # Tokenized version of "clustering: " prefix = [101, 9324, 2075, 1024]
 51 | PREFIX = "clustering: "
 52 | PREFIX_TOKEN_COUNT = 4
 53 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
 54 | 
 55 | # PREFIX = """
 56 | # PREFIX_TOKEN_COUNT = 0
 57 | # MODEL_ID = "BAAI/bge-base-en-v1.5"
 58 | 
 59 | # PREFIX = ""
 60 | # PREFIX_TOKEN_COUNT = 0
 61 | # MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
 62 | 
 63 | MODEL_SLUG = MODEL_ID.split("/")[-1]
 64 | 
 65 | MODEL_DIR = "/model"
 66 | MODEL_REVISION="main"
 67 | 
 68 | GPU_CONCURRENCY = 10
 69 | BATCHER_CONCURRENCY = GPU_CONCURRENCY
 70 | GPU_CONFIG = "A10G"
 71 | GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:86-1.2"
 72 | # GPU_CONFIG = gpu.A100(size="40GB")
 73 | # GPU_CONFIG = gpu.A100(size="80GB")
 74 | # GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:1.2"
 75 | # GPU_CONFIG = gpu.H100()
 76 | # GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:hopper-1.2"
 77 | 
 78 | 
 79 | SENTENCE_TOKEN_LIMIT = 512
 80 | CLIENT_BATCH_TOKEN_LIMIT = 768 * SENTENCE_TOKEN_LIMIT  # how many tokens we put in a batch. limiting factor
 81 | # i put the server higher but if we make the client batch too big it errors out without helpful message
 82 | SERVER_BATCH_TOKEN_LIMIT = 2 * CLIENT_BATCH_TOKEN_LIMIT  # how many tokens the server can handle in a batch
 83 | MAX_CLIENT_BATCH_SIZE = 2 * 4096 # how many rows can be in a batch
 84 | # CLIENT_BATCH_TOKEN_LIMIT = 1536 * SENTENCE_TOKEN_LIMIT  # Double from 768
 85 | # SERVER_BATCH_TOKEN_LIMIT = 4 * 1536 * SENTENCE_TOKEN_LIMIT  # Increased server capacity
 86 | 
 87 | # CLIENT_BATCH_TOKEN_LIMIT =  512 * SENTENCE_TOKEN_LIMIT #A100 40GB
 88 | # SERVER_BATCH_TOKEN_LIMIT = 4*2048 * SENTENCE_TOKEN_LIMIT #A100 40GB
 89 | 
 90 | LAUNCH_FLAGS = [
 91 |     "--model-id",
 92 |     MODEL_ID,
 93 |     "--port",
 94 |     "8000",
 95 |     "--max-client-batch-size",
 96 |     str(MAX_CLIENT_BATCH_SIZE),  # Increased from 20000
 97 |     "--max-batch-tokens",
 98 |     str(SERVER_BATCH_TOKEN_LIMIT),
 99 |     "--auto-truncate",
100 |     "--dtype",
101 |     "float16",
102 |     "--json-output"  # Add for more detailed perf metrics
103 | ]
104 | 
105 | ## Dataset-Specific Configuration
106 | DATASET_READ_VOLUME = Volume.from_name(
107 |     VOLUME, create_if_missing=True
108 | )
109 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name(
110 |     "embeddings", create_if_missing=True
111 | )
112 | 
113 | def spawn_server() -> subprocess.Popen:
114 |     import socket
115 | 
116 |     process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS)
117 |     # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
118 |     while True:
119 |         try:
120 |             socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
121 |             print("Webserver ready!")
122 |             return process
123 |         except (socket.timeout, ConnectionRefusedError):
124 |             # Check if launcher webserving process has exited.
125 |             # If so, a connection can never be made.
126 |             retcode = process.poll()
127 |             if retcode is not None:
128 |                 raise RuntimeError(
129 |                     f"launcher exited unexpectedly with code {retcode}"
130 |                 )
131 | 
132 | 
133 | tei_image = (
134 |     Image.from_registry(
135 |         GPU_IMAGE,
136 |         add_python="3.10",
137 |     )
138 |     .dockerfile_commands("ENTRYPOINT []")
139 |     .pip_install("httpx", "numpy")
140 | )
141 | 
142 | with tei_image.imports():
143 |     import numpy as np
144 | 
145 | app = App(
146 |     "fineweb-embeddings-tei"
147 | )  
148 | 
149 | @app.cls(
150 |     gpu=GPU_CONFIG,
151 |     image=tei_image,
152 |     max_containers=GPU_CONCURRENCY,
153 |     allow_concurrent_inputs=4, # allows the batchers to queue up several requests
154 |     # but if we allow too many and they get backed up it spams timeout errors
155 |     retries=3,
156 | )
157 | class TextEmbeddingsInference:
158 |     # @build()
159 |     # def download_model(self):
160 |     #     spawn_server()
161 | 
162 |     @enter()
163 |     def open_connection(self):
164 |         # If the process is running for a long time, the client does not seem to close the connections, results in a pool timeout
165 |         from httpx import AsyncClient
166 | 
167 |         self.process = spawn_server()
168 |         self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)
169 | 
170 |     @exit()
171 |     def terminate_connection(self):
172 |         self.process.terminate()
173 | 
174 |     @method()
175 |     async def embed(self, chunk_batch):
176 |         texts = chunk_batch[0]
177 |         res = await self.client.post("/embed", json={"inputs": texts})
178 |         try:
179 |             emb = res.json()
180 |             return chunk_batch, np.array(emb)
181 |         except Exception as e:
182 |             print(f"Error embedding", e)
183 |             print("res", res)
184 |             raise e
185 | 
186 | @app.function(
187 |     max_containers=BATCHER_CONCURRENCY, 
188 |     image=Image.debian_slim().pip_install(
189 |         "pandas", "pyarrow", "tqdm"
190 |     ),
191 |     volumes={
192 |         DATASET_DIR: DATASET_READ_VOLUME,
193 |         EMBEDDING_DIR: EMBEDDING_CHECKPOINT_VOLUME,
194 |     },
195 |     timeout=86400,
196 |     secrets=[Secret.from_name("huggingface-secret")],
197 | )
198 | def batch_loader(file):
199 |     import pandas as pd
200 |     from tqdm import tqdm
201 |     import time
202 | 
203 |     print(f"reading in {file}")
204 |     file_path = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}"
205 |     df = pd.read_parquet(file_path)
206 |     df['original_position'] = np.arange(len(df))
207 |     print(f"sorting {file}", len(df))
208 |     df = df.sort_values(by='chunk_token_count', ascending=True)
209 |     # df = df[0: 80000]
210 |     # df = df.reset_index(drop=True)
211 | 
212 |     batches_text = []
213 |     current_batch_counts = []
214 |     current_batch_text = []
215 |     batch_indices = []
216 |     current_batch_indices = []
217 |     packed = []
218 | 
219 |     print("building batches for ", file, "with client batch token limit", CLIENT_BATCH_TOKEN_LIMIT)
220 |     start = time.monotonic_ns()
221 |     
222 |     pbar = tqdm(total=len(df), desc=f"building batches for {file}")
223 |     # idx is actually the original index since i didn't reset the index during sort
224 |     # i just hate that its implied and had a bug when i didn't realize it
225 |     for idx, row in df.iterrows():
226 |         pbar.update(1)
227 |         original_idx = row['original_position']
228 |         chunk_token_count = row['chunk_token_count'] + PREFIX_TOKEN_COUNT # 4 for the prefix
229 |         chunkt = PREFIX + row['chunk_text']
230 |         if not chunkt or not chunkt.strip():
231 |             print(f"WARNING: Empty chunk detected at index {original_idx}")
232 |             chunkt = " "
233 |             chunk_token_count = 1
234 |         proposed_batch_count = current_batch_counts + [chunk_token_count]
235 |         proposed_length = max(count for count in proposed_batch_count) * len(proposed_batch_count)
236 | 
237 |         if proposed_length <= CLIENT_BATCH_TOKEN_LIMIT and len(current_batch_indices) < MAX_CLIENT_BATCH_SIZE:
238 |             current_batch_text.append(chunkt)
239 |             current_batch_indices.append(original_idx)
240 |             current_batch_counts.append(chunk_token_count)
241 |         else:
242 |             batches_text.append(current_batch_text)
243 |             batch_indices.append(current_batch_indices)
244 |             current_batch_counts = [chunk_token_count]
245 |             current_batch_text = [chunkt]
246 |             current_batch_indices = [original_idx]
247 | 
248 |     if current_batch_counts:
249 |         batch_indices.append(current_batch_indices)
250 |         batches_text.append(current_batch_text)
251 | 
252 | 
253 |     duration_s = (time.monotonic_ns() - start) / 1e9
254 |     print(f"batched {file} in {duration_s:.0f}s")
255 | 
256 |     responses = []
257 |     for batch_text, batch_indices in zip(batches_text, batch_indices):
258 |         packed.append((batch_text, batch_indices))
259 | 
260 |     print(f"{len(packed)} batches")
261 | 
262 |     pbar = tqdm(total=len(packed), desc=f"embedding {file}")
263 |     model = TextEmbeddingsInference()
264 | 
265 |     for resp in model.embed.map(
266 |         packed,
267 |         order_outputs=False, 
268 |         return_exceptions=False
269 |     ):
270 |         responses.append(resp)
271 |         pbar.update(1)
272 | 
273 |     if not os.path.exists(f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train"):
274 |         os.makedirs(f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train", exist_ok=True)
275 | 
276 |     embedding_dim = responses[0][1].shape[1]
277 |     embedding_path = f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train/{file.replace('.parquet', '.npy')}"
278 |     mmap_embeddings = np.memmap(embedding_path, dtype='float32', mode='w+', shape=(len(df), embedding_dim))
279 |     
280 |     print("writing embeddings to disk")
281 |     for batch, response in responses:
282 |         for idx, embedding in zip(batch[1], response):
283 |             mmap_embeddings[idx] = embedding
284 |         mmap_embeddings.flush()
285 |     
286 |     del mmap_embeddings
287 | 
288 |     EMBEDDING_CHECKPOINT_VOLUME.commit()
289 |     return f"done with {file}"
290 | 
291 | @app.local_entrypoint()
292 | def full_job():
293 |     for resp in batch_loader.map(
294 |         files,
295 |         order_outputs=False, 
296 |         return_exceptions=True
297 |     ):
298 |         print(resp)
299 | 
300 |     print("done")
301 | 
302 | 


--------------------------------------------------------------------------------
/experimental/batchsize.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Try to figure out the optimal batch size for embedding on a given GPU
  3 | """
  4 | import os
  5 | import json
  6 | import time
  7 | import asyncio
  8 | import subprocess
  9 | 
 10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method
 11 | 
 12 | # We first set out configuration variables for our script.
 13 | ## Embedding Containers Configuration
 14 | # GPU_CONCURRENCY = 100
 15 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
 16 | MODEL_SLUG = MODEL_ID.split("/")[-1]
 17 | 
 18 | MODEL_DIR = "/model"
 19 | MODEL_REVISION="main"
 20 | 
 21 | GPU_CONCURRENCY = 1
 22 | # GPU_CONFIG = gpu.A100(size="80GB")
 23 | # GPU_CONFIG = gpu.A100(size="40GB")
 24 | # GPU_CONFIG = gpu.A10G()
 25 | GPU_CONFIG = gpu.H100()
 26 | # BATCH_SIZE = 512
 27 | BATCH_SIZE = 64
 28 | # BATCH_SIZE = 128
 29 | MAX_TOKENS = 8192
 30 | # MAX_TOKENS = 2048
 31 | 
 32 | 
 33 | ## Dataset-Specific Configuration
 34 | DATASET_READ_VOLUME = Volume.from_name(
 35 |     "embedding-fineweb-edu", create_if_missing=True
 36 | )
 37 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name(
 38 |     "checkpoint", create_if_missing=True
 39 | )
 40 | DATASET_DIR = "/data"
 41 | # DATASET_SAVE ="fineweb-edu-sample-10BT"
 42 | DATASET_SAVE ="fineweb-edu-sample-10BT-100k"
 43 | CHECKPOINT_DIR = "/checkpoint"
 44 | SAVE_TO_DISK = True
 45 | 
 46 | ## Upload-Specific Configuration
 47 | # DATASET_HF_UPLOAD_REPO_NAME = "enjalot/fineweb-edu-sample-10BT"
 48 | DATASET_HF_UPLOAD_REPO_NAME = f"enjalot/{DATASET_SAVE}"
 49 | UPLOAD_TO_HF = False
 50 | 
 51 | 
 52 | def download_model_to_image(model_dir, model_name, model_revision):
 53 |     from huggingface_hub import snapshot_download
 54 |     from transformers.utils import move_cache
 55 | 
 56 |     os.makedirs(model_dir, exist_ok=True)
 57 | 
 58 |     snapshot_download(
 59 |         repo_id=model_name,
 60 |         revision=model_revision,
 61 |         local_dir=model_dir,
 62 |         ignore_patterns=["*.pt", "*.bin"],  # Using safetensors
 63 |     )
 64 |     move_cache()
 65 | 
 66 | st_image = (
 67 |     Image.debian_slim(python_version="3.10")
 68 |     .pip_install(
 69 |         "torch==2.1.2",
 70 |         "numpy==1.26.3",
 71 |         "transformers==4.39.3",
 72 |         "hf-transfer==0.1.6",
 73 |         "huggingface_hub==0.22.2",
 74 |         "einops==0.7.0"
 75 |     )
 76 |     .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
 77 |     .run_function(
 78 |         download_model_to_image,
 79 |         timeout=60 * 20,
 80 |         kwargs={
 81 |             "model_dir": MODEL_DIR,
 82 |             "model_name": MODEL_ID,
 83 |             "model_revision": MODEL_REVISION,
 84 |         },
 85 |         secrets=[Secret.from_name("huggingface-secret")],
 86 |     )
 87 | )
 88 | with st_image.imports():
 89 |     import numpy as np
 90 |     import torch
 91 |     from torch.cuda.amp import autocast
 92 |     from transformers import AutoTokenizer, AutoModel
 93 | 
 94 | app = App(
 95 |     "fineweb-embeddings-st"
 96 | )  
 97 | 
 98 | @app.cls(
 99 |     gpu=GPU_CONFIG,
100 |     # cpu=16,
101 |     concurrency_limit=GPU_CONCURRENCY,
102 |     timeout=60 * 10,
103 |     container_idle_timeout=60 * 10,
104 |     allow_concurrent_inputs=1,
105 |     image=st_image,
106 | )
107 | class TransformerModel:
108 |     @enter()
109 |     def start_engine(self):
110 |         # import torch
111 |         # from transformers import AutoTokenizer, AutoModel
112 | 
113 |         self.device = torch.device("cuda")
114 | 
115 |         print("🥶 cold starting inference")
116 |         start = time.monotonic_ns()
117 | 
118 |         self.model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, safe_serialization=True)#, rotary_scaling_factor=2 )
119 |         self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS)
120 |         self.model.to(self.device)
121 |         self.model.eval()
122 | 
123 |         print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e6} MB")
124 |         duration_s = (time.monotonic_ns() - start) / 1e9
125 |         print(f"🏎️ engine started in {duration_s:.0f}s")
126 | 
127 |     @method()
128 |     def embed(self, inputs):
129 |         tok = self.tokenizer
130 |         
131 |         # print(torch.cuda.memory_summary(device=None, abbreviated=False))
132 |         print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
133 | 
134 |         # print(f"CUDA memory allocated before encoding: {torch.cuda.memory_allocated() / 1e6} MB")
135 | 
136 |         start = time.monotonic_ns()
137 |         encoded_input = tok(inputs, padding=True, truncation=True, return_tensors='pt')
138 |         print("encoded in", (time.monotonic_ns() - start) / 1e9)
139 | 
140 |         encoded_input = {key: value.to(self.device) for key, value in encoded_input.items()}
141 |         # print("moved to device", (time.monotonic_ns() - start) / 1e9)
142 |         # print("encoded input size", encoded_input['input_ids'].nelement() * encoded_input['input_ids'].element_size() / 1e6, "MB")
143 | 
144 |         # print(f"CUDA memory allocated after encoding: {torch.cuda.memory_allocated() / 1e6} MB")
145 |         start = time.monotonic_ns()
146 |         # print(torch.cuda.memory_summary(device=None, abbreviated=False))
147 |         with torch.no_grad():#, autocast():
148 |             # print(f"CUDA memory allocated before embedding: {torch.cuda.memory_allocated() / 1e6} MB")
149 |             model_output = self.model(**encoded_input)
150 |             # print(f"CUDA memory allocated after model output: {torch.cuda.memory_allocated() / 1e6} MB")
151 |             # print(f"model output size: {model_output.nelement() * model_output.element_size() / 1e6} MB")
152 |             embeddings = model_output[0][:, 0]
153 |             # print(f"Embedding size: {embeddings.nelement() * embeddings.element_size() / 1e6} MB")
154 |             # print(f"CUDA memory allocated after embedding: {torch.cuda.memory_allocated() / 1e6} MB")
155 |             normalized_embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
156 |             normalized_embeddings_cpu = normalized_embeddings.cpu().numpy()
157 |             # print(f"CUDA memory allocated after got embeddings: {torch.cuda.memory_allocated() / 1e6} MB")
158 |             # # Clean up torch memory
159 |             # del encoded_input
160 |             # del model_output
161 |             # del embeddings
162 |             # del normalized_embeddings
163 |             # torch.cuda.empty_cache()
164 |             duration_ms = (time.monotonic_ns() - start) / 1e6
165 |             print(f"embedding took {duration_ms:.0f}ms")
166 |             print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
167 | 
168 |             return inputs, normalized_embeddings_cpu
169 | 
170 | 
171 | 
172 | @app.local_entrypoint()
173 | def full_job():
174 |     tok = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS)
175 |     batch_size = BATCH_SIZE
176 | 
177 |     test = "I "
178 |     test = test * 1022
179 |     tokens = tok.encode(test)
180 |     print("tokens", len(tokens))
181 | 
182 |     inputs = [test] * (384)
183 | 
184 |     model = TransformerModel()
185 |     [inputs, embeddings] = model.embed.remote(inputs=inputs)
186 |     print("done")
187 | 
188 | 


--------------------------------------------------------------------------------
/experimental/embed.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Embed a dataset using a HuggingFace model, a good deal slower than TEI
  3 | """
  4 | import os
  5 | import json
  6 | import time
  7 | import asyncio
  8 | import subprocess
  9 | 
 10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method
 11 | 
 12 | DATASET_DIR = "/data"
 13 | DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500"
 14 | CHECKPOINT_DIR = "/checkpoint"
 15 | 
 16 | # We first set out configuration variables for our script.
 17 | ## Embedding Containers Configuration
 18 | # GPU_CONCURRENCY = 100
 19 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
 20 | MODEL_SLUG = MODEL_ID.split("/")[-1]
 21 | 
 22 | MODEL_DIR = "/model"
 23 | MODEL_REVISION="main"
 24 | 
 25 | GPU_CONCURRENCY = 10
 26 | # GPU_CONFIG = gpu.A100(size="80GB")
 27 | # GPU_CONFIG = gpu.A100(size="40GB")
 28 | # GPU_CONFIG = gpu.A10G()
 29 | GPU_CONFIG = gpu.H100()
 30 | 
 31 | 
 32 | ## Dataset-Specific Configuration
 33 | DATASET_READ_VOLUME = Volume.from_name(
 34 |     "embedding-fineweb-edu", create_if_missing=True
 35 | )
 36 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name(
 37 |     "embeddings", create_if_missing=True
 38 | )
 39 | def download_model_to_image(model_dir, model_name, model_revision):
 40 |     from huggingface_hub import snapshot_download
 41 |     from transformers.utils import move_cache
 42 | 
 43 |     os.makedirs(model_dir, exist_ok=True)
 44 | 
 45 |     snapshot_download(
 46 |         repo_id=model_name,
 47 |         revision=model_revision,
 48 |         local_dir=model_dir,
 49 |         ignore_patterns=["*.pt", "*.bin"],  # Using safetensors
 50 |     )
 51 |     move_cache()
 52 | 
 53 | st_image = (
 54 |     Image.debian_slim(python_version="3.10")
 55 |     .pip_install(
 56 |         "torch==2.1.2",
 57 |         "numpy==1.26.3",
 58 |         "transformers==4.39.3",
 59 |         "hf-transfer==0.1.6",
 60 |         "huggingface_hub==0.22.2",
 61 |         "einops==0.7.0"
 62 |     )
 63 |     .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
 64 |     .run_function(
 65 |         download_model_to_image,
 66 |         timeout=60 * 20,
 67 |         kwargs={
 68 |             "model_dir": MODEL_DIR,
 69 |             "model_name": MODEL_ID,
 70 |             "model_revision": MODEL_REVISION,
 71 |         },
 72 |         secrets=[Secret.from_name("huggingface-secret")],
 73 |     )
 74 | )
 75 | with st_image.imports():
 76 |     import numpy as np
 77 |     import torch
 78 |     from torch.cuda.amp import autocast
 79 |     from transformers import AutoTokenizer, AutoModel
 80 | 
 81 | app = App(
 82 |     "fineweb-embeddings-st"
 83 | )  
 84 | 
 85 | @app.cls(
 86 |     gpu=GPU_CONFIG,
 87 |     # cpu=16,
 88 |     concurrency_limit=GPU_CONCURRENCY,
 89 |     timeout=60 * 10,
 90 |     container_idle_timeout=60 * 10,
 91 |     allow_concurrent_inputs=1,
 92 |     image=st_image,
 93 | )
 94 | class TransformerModel:
 95 |     @enter()
 96 |     def start_engine(self):
 97 |         # import torch
 98 |         # from transformers import AutoTokenizer, AutoModel
 99 | 
100 |         self.device = torch.device("cuda")
101 | 
102 |         print("🥶 cold starting inference")
103 |         start = time.monotonic_ns()
104 | 
105 |         self.model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, safe_serialization=True)#, rotary_scaling_factor=2 )
106 |         # self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=512) # MAX_TOKENS
107 |         self.model.to(self.device)
108 |         self.model.eval()
109 | 
110 |         # print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e6} MB")
111 |         duration_s = (time.monotonic_ns() - start) / 1e9
112 |         print(f"🏎️ engine started in {duration_s:.0f}s")
113 | 
114 |     @method()
115 |     def embed(self, batch_mask_index):
116 |         batch, mask, index = batch_mask_index
117 |         # print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
118 | 
119 |         tokens_tensor = torch.tensor(batch)
120 |         attention_mask = torch.tensor(mask)
121 | 
122 |         encoded_input = {
123 |             'input_ids': tokens_tensor.to(self.device),
124 |             'attention_mask': attention_mask.to(self.device)
125 |         }
126 |         # encoded_input = {key: value.to(self.device) for key, value in inputs}
127 |         start = time.monotonic_ns()
128 |         with torch.no_grad():#, autocast():
129 |             model_output = self.model(**encoded_input)
130 |             embeddings = model_output[0][:, 0]
131 |             normalized_embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
132 |             normalized_embeddings_cpu = normalized_embeddings.cpu().numpy()
133 | 
134 |             duration_ms = (time.monotonic_ns() - start) / 1e6
135 |             print(f"embedding took {duration_ms:.0f}ms")
136 |             
137 |             del encoded_input
138 |             del model_output
139 |             del embeddings
140 |             del normalized_embeddings
141 |             torch.cuda.empty_cache()
142 | 
143 |             # print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
144 |             return index, normalized_embeddings_cpu
145 | 
146 | 
147 | 
148 | @app.function(
149 |     image=Image.debian_slim().pip_install(
150 |         "pandas", "pyarrow", "tqdm"
151 |     ),
152 |     volumes={
153 |         DATASET_DIR: DATASET_READ_VOLUME,
154 |         CHECKPOINT_DIR: EMBEDDING_CHECKPOINT_VOLUME,
155 |     },
156 |     timeout=86400,
157 |     secrets=[Secret.from_name("huggingface-secret")],
158 | )
159 | def batch_loader(file, batch_size: int = 512 * 1024):
160 |     import pandas as pd
161 |     from tqdm import tqdm
162 |     import time
163 | 
164 | 
165 |     print(f"reading in {file}")
166 |     file_path = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}"
167 |     df = pd.read_parquet(file_path)
168 |     print(f"sorting {file}")
169 |     df = df.sort_values(by='chunk_token_count', ascending=True)
170 |     batches = []
171 |     current_batch = []
172 |     current_token_count = 0
173 |     batch_indices = []
174 |     current_batch_indices = []
175 |     attention_masks = []  # List to store attention masks for each batch
176 | 
177 | 
178 |     # Tokenized version of "clustering: "
179 |     prefix = [101, 9324, 2075, 1024]
180 | 
181 |     print("building batches for ", file)
182 |     start = time.monotonic_ns()
183 |     
184 |     for index, row in df.iterrows():
185 |         # chunk_token_count = row['chunk_token_count']
186 |         chunk = prefix + list(row['chunk_tokens'])
187 |         proposed_batch = current_batch + [chunk]
188 |         proposed_length = max(len(tokens) for tokens in proposed_batch) * len(proposed_batch)
189 | 
190 |         if proposed_length <= batch_size:
191 |             current_batch.append(chunk)
192 |             current_batch_indices.append(index)
193 |             # current_token_count = proposed_length
194 |         else:
195 |             # Pad the current batch
196 |             max_length = max(len(tokens) for tokens in current_batch)
197 |             padded_batch = [tokens + [0] * (max_length - len(tokens)) for tokens in current_batch]
198 |             attention_mask = [[1] * len(tokens) + [0] * (max_length - len(tokens)) for tokens in current_batch]
199 |             batches.append(padded_batch)
200 |             attention_masks.append(attention_mask)
201 |             batch_indices.append(current_batch_indices)
202 |             # Start new batch
203 |             current_batch = [chunk]
204 |             current_batch_indices = [index]
205 |             # current_token_count = len(chunk)
206 | 
207 |     if current_batch:
208 |         # Pad the final batch
209 |         max_length = max(len(tokens) for tokens in current_batch)
210 |         padded_batch = [tokens + [0] * (max_length - len(tokens)) for tokens in current_batch]
211 |         attention_mask = [[1] * len(tokens) + [0] * (max_length - len(tokens)) for tokens in current_batch]
212 | 
213 |         batches.append(padded_batch)
214 |         batch_indices.append(current_batch_indices)
215 | 
216 | 
217 |     print("length of first batch", len(batches[0]))
218 |     first_batch_length = sum(len(chunk) for chunk in batches[0])
219 |     print("Total length of all elements in the first batch:", first_batch_length)
220 |     print(f"number of batches {len(batches)}")
221 | 
222 |     duration_s = (time.monotonic_ns() - start) / 1e9
223 |     print(f"batched {file} in {duration_s:.0f}s")
224 | 
225 |     pbar = tqdm(total=len(batches), desc=f"embedding {file}")
226 |     model = TransformerModel()
227 | 
228 |     responses = []
229 |     for resp in model.embed.map(
230 |         zip(batches, attention_masks, batch_indices), 
231 |         order_outputs=False, 
232 |         return_exceptions=False
233 |     ):
234 |         responses.append(resp)
235 |         pbar.update(1)
236 | 
237 |     print("zipping batches with responses")
238 |     for batch_idx, response in responses:
239 |         for idx, embedding in zip(batch_idx, response):
240 |             df.at[idx, 'embedding'] = embedding
241 |     
242 |     if not os.path.exists(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train"):
243 |         os.makedirs(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train", exist_ok=True)
244 |     df.to_parquet(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}")
245 |     return f"done with {file}"
246 | 
247 | @app.local_entrypoint()
248 | def full_job():
249 | 
250 |     file = "data-00000-of-00099.parquet"
251 | 
252 |     batch_loader.remote(file=file, batch_size = (1024) * 512)
253 |     print("done")
254 | 
255 | 


--------------------------------------------------------------------------------
/features.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Extract the features for the embeddings of a dataset using a pre-trained SAE model
  3 | 
  4 | modal run features.py
  5 | """
  6 | 
  7 | import os
  8 | import time
  9 | from tqdm import tqdm
 10 | from latentsae.sae import Sae
 11 | from modal import App, Image, Volume, Secret, gpu, enter, method
 12 | 
 13 | DATASET_DIR="/embeddings"
 14 | VOLUME = "embeddings"
 15 | 
 16 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4" 
 17 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-120-all-MiniLM-L6-v2" 
 18 | # DIRECTORY = f"{DATASET_DIR}/RedPajama-Data-V2-sample-10B-chunked-120-all-MiniLM-L6-v2" 
 19 | # DIRECTORY = f"{DATASET_DIR}/pile-uncopyrighted-chunked-120-all-MiniLM-L6-v2" 
 20 | # DIRECTORY = f"{DATASET_DIR}/medrag-pubmed-500-nomic-embed-text-v1.5"
 21 | # FILES = [f"{DIRECTORY}/train/data-{i:05d}-of-00138.npy" for i in range(138)]
 22 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5"
 23 | FILES = [f"{DIRECTORY}/train/data-{i:05d}-of-00041.npy" for i in range(41)]
 24 | SAE = "64_32"
 25 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-2"
 26 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3"
 27 | # SAE = "64_128"
 28 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}"
 29 | # SAE = "64_64"
 30 | 
 31 | SAVE_DIRECTORY = f"{DIRECTORY}-{SAE}"
 32 | 
 33 | 
 34 | # MODEL_ID = "enjalot/sae-all-MiniLM-L6-v2"
 35 | # D_IN = 384
 36 | MODEL_ID = "enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT"
 37 | MODEL_DIR = "/model"
 38 | D_IN = 768
 39 | MODEL_REVISION="main"
 40 | 
 41 | # We define our Modal Resources that we'll need
 42 | volume = Volume.from_name(VOLUME, create_if_missing=True)
 43 | 
 44 | def download_model_to_image(model_dir, model_name, model_revision):
 45 |     from huggingface_hub import snapshot_download
 46 |     from transformers.utils import move_cache
 47 | 
 48 |     os.makedirs(model_dir, exist_ok=True)
 49 | 
 50 |     snapshot_download(
 51 |         repo_id=model_name,
 52 |         revision=model_revision,
 53 |         local_dir=model_dir,
 54 |         ignore_patterns=["*.pt", "*.bin"],  # Using safetensors
 55 |     )
 56 |     move_cache()
 57 | 
 58 | st_image = (
 59 |     Image.debian_slim(python_version="3.10")
 60 |     .pip_install(
 61 |         "torch==2.1.2",
 62 |         "numpy==1.26.3",
 63 |         "transformers==4.39.3",
 64 |         "hf-transfer==0.1.6",
 65 |         "huggingface_hub==0.22.2",
 66 |         "einops==0.7.0",
 67 |         "latentsae==0.1.0"
 68 |     )
 69 |     .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
 70 |     .run_function(
 71 |         download_model_to_image,
 72 |         timeout=60 * 20,
 73 |         kwargs={
 74 |             "model_dir": MODEL_DIR,
 75 |             "model_name": MODEL_ID,
 76 |             "model_revision": MODEL_REVISION,
 77 |         },
 78 |         secrets=[Secret.from_name("huggingface-secret")],
 79 |     )
 80 | )
 81 | app = App(image=st_image)  # Note: prior to April 2024, "app" was called "stub"
 82 | 
 83 | with st_image.imports():
 84 |     import numpy as np
 85 |     import torch
 86 | 
 87 | @app.cls(
 88 |     volumes={DATASET_DIR: volume}, 
 89 |     timeout=60 * 100,
 90 |     scaledown_window=60 * 10,
 91 |     allow_concurrent_inputs=1,
 92 |     image=st_image,
 93 | )
 94 | class SAEModel:
 95 |     @enter()
 96 |     def start_engine(self):
 97 |         # import torch
 98 |         self.device = torch.device("cpu")
 99 |         print("🥶 cold starting inference")
100 |         start = time.monotonic_ns()
101 |         self.model = Sae.load_from_hub(MODEL_ID, SAE, device=self.device)
102 |         duration_s = (time.monotonic_ns() - start) / 1e9
103 |         print(f"🏎️ engine started in {duration_s:.0f}s")
104 | 
105 |     @method()
106 |     def make_features(self, file):
107 |         # Redownload the dataset
108 |         import time
109 |         from datasets import load_dataset
110 |         import torch
111 |         import pandas as pd
112 |         import numpy as np
113 |         import time
114 | 
115 |         start = time.monotonic_ns()
116 |         print("loading", file)
117 |         # dataset = load_dataset("arrow", data_files=f"{DIRECTORY}/train/{file}")
118 |         # # df = pd.read_parquet(f"{DIRECTORY}/train/{file}")
119 |         # print("loaded")
120 |         # df = pd.DataFrame(dataset['train'])
121 |         # print("converted to dataframe")
122 |         # embeddings = df['embedding'].to_numpy()
123 |         # print("converted to numpy")
124 |         # embeddings = np.array([np.array(e).astype(np.float32) for e in embeddings])
125 |         duration_s = (time.monotonic_ns() - start) / 1e9
126 |         # read the npy memmapped file
127 |         size= os.path.getsize(file) // (D_IN * 4)
128 |         embeddings = np.memmap(file, 
129 |                       dtype='float32', 
130 |                       mode='r', 
131 |                       shape=(size, D_IN))
132 |         print("loaded", file, "in", duration_s)
133 |  
134 |         start = time.monotonic_ns()
135 |         print("Encoding embeddings with SAE")
136 | 
137 |         # batch_size = 4096
138 |         batch_size = 128
139 |         num_batches = (len(embeddings) + batch_size - 1) // batch_size
140 |         all_acts = np.zeros((len(embeddings), 64))
141 |         all_indices = np.zeros((len(embeddings), 64))
142 |         for i in tqdm(range(num_batches), desc="Encoding batches"):
143 |             batch_embeddings = embeddings[i * batch_size:(i + 1) * batch_size]
144 |             batch_embeddings_tensor = torch.from_numpy(batch_embeddings).float().to(self.device)
145 |             batch_features = self.model.encode(batch_embeddings_tensor)
146 |             all_acts[i * batch_size:(i + 1) * batch_size] = batch_features.top_acts.detach().cpu().numpy()
147 |             all_indices[i * batch_size:(i + 1) * batch_size] = batch_features.top_indices.detach().cpu().numpy()
148 | 
149 |         duration_s = (time.monotonic_ns() - start) / 1e9
150 |         print("encoding completed", duration_s)
151 | 
152 |         df = pd.DataFrame()
153 |         df['top_acts'] = list(all_acts)
154 |         df['top_indices'] = list(all_indices)
155 |         # # df.drop(columns=['embedding'], inplace=True)
156 |         # if 'chunk_tokens' in df.columns:
157 |         #     df.drop(columns=['chunk_tokens'], inplace=True)
158 |         print("features generated for", file)
159 | 
160 |         file_name = os.path.basename(file).split(".")[0]
161 |         output_dir = f"{SAVE_DIRECTORY}/train"
162 |         os.makedirs(output_dir, exist_ok=True)
163 |         print(f"saving to {output_dir}/{file_name}.parquet")
164 |         df.to_parquet(f"{output_dir}/{file_name}.parquet")
165 | 
166 |         volume.commit()
167 |         return f"done with {file}"
168 | 
169 | @app.local_entrypoint()
170 | def main():
171 | 
172 |     # files = files[0:10]
173 |     
174 |     model = SAEModel()
175 | 
176 |     for resp in model.make_features.map(FILES, order_outputs=False, return_exceptions=True):
177 |         if isinstance(resp, Exception):
178 |             print(f"Exception: {resp}")
179 |             continue
180 |         print(resp)
181 | 
182 | 
183 | 
184 | 


--------------------------------------------------------------------------------
/fetch.py:
--------------------------------------------------------------------------------
 1 | """
 2 | fetch a file from a modal volume and write it locally
 3 | """
 4 | 
 5 | from modal import App, Image, Volume
 6 | 
 7 | # We first set out configuration variables for our script.
 8 | # DATASET_DIR = "/data"
 9 | VOLUME = "embeddings"
10 | DATASET_DIR = "/embeddings"
11 | # DATASET_NAME = "HuggingFaceFW/fineweb-edu"
12 | # DATASET_FILES = "sample/10BT/*.parquet"
13 | # DATASET_SAVE ="fineweb-edu-sample-10BT"
14 | MAX_TOKENS = 500
15 | # DATASET_SAVE = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}-HF4-64_32"
16 | DATASET_SAVE = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}-HF4-64_32-top10"
17 | # DATASET_SAVE = f"fineweb-edu-sample-10BT"
18 | # DIRECTORY = f"{DATASET_DIR}/{DATASET_SAVE}/train"
19 | DIRECTORY = f"{DATASET_DIR}/{DATASET_SAVE}"
20 | 
21 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
22 | 
23 | # We define our Modal Resources that we'll need
24 | volume = Volume.from_name(VOLUME, create_if_missing=True)
25 | # volume = Volume.from_name("embeddings", create_if_missing=True)
26 | image = Image.debian_slim(python_version="3.9").pip_install(
27 |     "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
28 | )
29 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
30 | 
31 | 
32 | @app.function(volumes={DATASET_DIR: volume}, timeout=3000)
33 | def fetch_dataset(file):
34 |     import pandas as pd
35 |     from datasets import load_dataset
36 |     print("loading", file)
37 |     # Load the dataset as a Hugging Face dataset
38 |     if file.endswith(".parquet"):
39 |         df = pd.read_parquet(file)
40 |     else:
41 |         dataset = load_dataset("arrow", data_files=file)
42 |         df = pd.DataFrame(dataset['train'])
43 |     print("file loaded, returning", file)
44 |     return df
45 | 
46 | @app.local_entrypoint()
47 | def main():
48 |     import pandas as pd
49 | 
50 |     # file = "data-00000-of-00099.arrow"
51 |     file = "data-00000-of-00099.parquet"
52 |     # file = "data-00001-of-00099.parquet"
53 |     file_path = f"{DIRECTORY}/{file}"
54 |     resp = fetch_dataset.remote(file_path)
55 |     if isinstance(resp, Exception):
56 |         print(f"Exception: {resp}")
57 |     else:
58 |         print(resp)
59 |         # resp.to_parquet(f"./notebooks/{file}")
60 |         resp.to_parquet(f"./notebooks/top10-{file}")
61 |         


--------------------------------------------------------------------------------
/filter.py:
--------------------------------------------------------------------------------
 1 | from modal import App, Image, Volume, Secret
 2 | 
 3 | DATASET_DIR="/embeddings"
 4 | VOLUME = "embeddings"
 5 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF2" # converted the original to a dataset
 6 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4"
 7 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500/train"
 8 | 
 9 | # We define our Modal Resources that we'll need
10 | volume = Volume.from_name(VOLUME, create_if_missing=True)
11 | image = Image.debian_slim(python_version="3.9").pip_install(
12 |     "datasets==2.16.1", "apache_beam==2.53.0"
13 | )
14 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
15 | 
16 | 
17 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
18 | #  but we override this to
19 | # 6000s to avoid any potential timeout issues
20 | @app.function(
21 |     volumes={DATASET_DIR: volume}, 
22 |     timeout=60000,
23 |     # ephemeral_disk=2145728, # in MiB
24 | )
25 | def filter_dataset():
26 |     # Redownload the dataset
27 |     import time
28 |     from datasets import load_from_disk
29 |     print("loading")
30 |     dataset = load_from_disk(DIRECTORY)
31 |     print("filtering")
32 |     filtered = dataset.filter(lambda x: x > 50, input_columns=["chunk_token_count"])
33 |     # print("sorting")
34 |     # dataset.sort(column_names=["id", "chunk_index"], keep_in_memory=True)
35 |     print("saving")
36 |     filtered.save_to_disk(SAVE_DIRECTORY, num_shards={"train":99})
37 |     print("done!")
38 |     volume.commit()
39 | 
40 | @app.function(
41 |     volumes={DATASET_DIR: volume}, 
42 |     timeout=60000,
43 |     # ephemeral_disk=2145728, # in MiB
44 | )
45 | def filter_dataset_file(file):
46 |     import pandas as pd
47 |     print("loading", file)
48 |     df = pd.read_parquet(f"{DIRECTORY}/{file}")
49 |     print("filtering", file)
50 |     filtered = df[df["chunk_token_count"] > 50]
51 |     print("saving", file)
52 |     filtered.to_parquet(f"{DIRECTORY}/{file}")
53 |     print("done!", file)
54 |     volume.commit()
55 |     return file
56 | 
57 | 
58 | 
59 | 
60 | @app.local_entrypoint()
61 | def main():
62 |     # filter_dataset.remote()
63 | 
64 |     files = [f"data-{i:05d}-of-00989.parquet" for i in range(100)]
65 |     files = files[2:]
66 |     for resp in filter_dataset_file.map(files, order_outputs=False, return_exceptions=True):
67 |         if isinstance(resp, Exception):
68 |             print(f"Exception: {resp}")
69 |             continue
70 |         print(resp)
71 | 
72 | 


--------------------------------------------------------------------------------
/lancer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Combine chunks, embeddings and features into a single LanceDB table.
  3 | 
  4 | This script loops over each corresponding file:
  5 |   - The chunk parquet produced by chunker.py (e.g. "/data/medrag-pubmed-500/train/data-00000-of-00138.parquet")
  6 |   - The embedding npy file produced by features.py (e.g. "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5/train/data-00000-of-00138.npy")
  7 |   - The features parquet file produced by features.py (e.g. "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5-64_32/train/data-00000-of-00138.parquet")
  8 |   
  9 | They are then concatenated (column-wise) row‐by‐row in the natural order and written to a lancedb table.
 10 |   
 11 | Usage (from Modal CLI):
 12 |     modal run combine.py
 13 | """
 14 | 
 15 | import os
 16 | import time
 17 | import numpy as np
 18 | import pandas as pd
 19 | import lancedb
 20 | from modal import App, Image, Volume, enter, method, gpu
 21 | 
 22 | # ============================================================================
 23 | # Configuration variables – adjust these to your environment/path names!
 24 | # ============================================================================
 25 | 
 26 | # Directories for the input files:
 27 | # CHUNK_PARQUET_DIR = "/datasets/medrag-pubmed-500/train"  
 28 | # EMBEDDING_NPY_DIR = "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5/train"
 29 | # FEATURE_PARQUET_DIR = "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5-64_32/train"
 30 | CHUNK_PARQUET_DIR = "/datasets/wikipedia-en-chunked-500/train"  
 31 | EMBEDDING_NPY_DIR = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5/train"
 32 | FEATURE_PARQUET_DIR = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5-64_32/train"
 33 | 
 34 | 
 35 | # Directory (volume) where the LanceDB table will be stored.
 36 | # LANCE_DB_DIR = "/lancedb/enjalot/medrag-pubmed"  
 37 | # LANCE_DB_DIR_INDEXED = "/lancedb/enjalot/medrag-pubmed-indexed"  
 38 | # TMP_LANCE_DB_DIR = "/tmp/medrag-pubmed"  
 39 | LANCE_DB_DIR = "/lancedb/enjalot/wikipedia-en-500"  
 40 | LANCE_DB_DIR_INDEXED = "/lancedb/enjalot/wikipedia-en-500-indexed"  
 41 | TMP_LANCE_DB_DIR = "/tmp/wikipedia-en-500"  
 42 | 
 43 | TABLE_NAME = "500-64_32"
 44 | 
 45 | # TOTAL_FILES = 138    # total number of shards (files)
 46 | TOTAL_FILES = 41    # total number of shards (files)
 47 | D_EMB = 768          # embedding dimension
 48 | 
 49 | # Volume for the lancedb storage
 50 | DATASETS_VOLUME = "datasets"
 51 | EMBEDDING_VOLUME = "embeddings"
 52 | DB_VOLUME = "lancedb"
 53 | 
 54 | # ============================================================================
 55 | # Modal Resources
 56 | # ============================================================================
 57 | 
 58 | volume_db = Volume.from_name(DB_VOLUME, create_if_missing=True)
 59 | volume_datasets = Volume.from_name(DATASETS_VOLUME, create_if_missing=True)
 60 | volume_embeddings = Volume.from_name(EMBEDDING_VOLUME, create_if_missing=True)
 61 | 
 62 | st_image = (
 63 |     Image.debian_slim(python_version="3.10")
 64 |     .pip_install(
 65 |         "pandas", "numpy", "lancedb", "pyarrow", "torch", "tantivy"
 66 |     )
 67 |     .env({"RUST_BACKTRACE": "1"})
 68 | )
 69 | 
 70 | 
 71 | app = App(image=st_image)
 72 | 
 73 | # ============================================================================
 74 | # Class to combine and write data into a lancedb table
 75 | # ============================================================================
 76 | 
 77 | @app.function(volumes={
 78 |     "/datasets": volume_datasets,
 79 |     "/embeddings": volume_embeddings,
 80 |     "/lancedb": volume_db
 81 |     }, 
 82 |     ephemeral_disk=int(1024*1024), # in MiB
 83 |     image=st_image, 
 84 |     timeout=60*100,
 85 |     scaledown_window=60*10
 86 |     )
 87 | def combine():
 88 |     """
 89 |     Sequentially process each shard by reading the corresponding chunk parquet,
 90 |     embedding npy, and features parquet files. The data are combined (column-wise)
 91 |     and then appended to a single lancedb table.
 92 |     """
 93 |     db_path = TMP_LANCE_DB_DIR
 94 |     print(f"Connecting to LanceDB at: {db_path}")
 95 |     db = lancedb.connect(db_path)
 96 | 
 97 |     for i in range(TOTAL_FILES):
 98 |         base_file = f"data-{i:05d}-of-{TOTAL_FILES:05d}"
 99 |         chunk_file = os.path.join(CHUNK_PARQUET_DIR, f"{base_file}.parquet")
100 |         embedding_file = os.path.join(EMBEDDING_NPY_DIR, f"{base_file}.npy")
101 |         feature_file = os.path.join(FEATURE_PARQUET_DIR, f"{base_file}.parquet")
102 |         
103 |         print(f"\nProcessing shard: {base_file}")
104 |         start_time = time.monotonic()
105 | 
106 |         # Load the chunk parquet file.
107 |         try:
108 |             chunk_df = pd.read_parquet(chunk_file)
109 |         except Exception as e:
110 |             print(f"Error reading chunk file {chunk_file}: {e}")
111 |             break
112 |         
113 |         # Load the embeddings npy file.
114 |         try:
115 |             size = os.path.getsize(embedding_file) // (D_EMB * 4)
116 |             embedding_np = np.memmap(embedding_file, 
117 |                   dtype='float32', 
118 |                   mode='r', 
119 |                   shape=(size, D_EMB))
120 |         except Exception as e:
121 |             print(f"Error reading embedding file {embedding_file}: {e}")
122 |             break 
123 |         
124 |         # Load the features parquet file.
125 |         try:
126 |             feature_df = pd.read_parquet(feature_file)
127 |             feature_df = feature_df.rename(columns={
128 |                 'top_indices': 'sae_indices',
129 |                 'top_acts': 'sae_acts'
130 |             })
131 |             # Convert sae_indices from float to int for each row
132 |             feature_df['sae_indices'] = feature_df['sae_indices'].apply(lambda x: [int(i) for i in x])
133 |         except Exception as e:
134 |             print(f"Error reading feature file {feature_file}: {e}")
135 |             break 
136 | 
137 |         # Validate that the three sources have the same number of rows.
138 |         n_chunk = len(chunk_df)
139 |         n_embedding = embedding_np.shape[0]
140 |         n_feature = len(feature_df)
141 |         if not (n_chunk == n_embedding == n_feature):
142 |             print(f"Row count mismatch in {base_file}: chunk {n_chunk}, embedding {n_embedding}, feature {n_feature}")
143 |             break
144 | 
145 |         # Store the embedding data as a list column. (Alternatively, you could split the embedding vector into columns.)
146 | 
147 |         vector_column = list(embedding_np)
148 | 
149 |         # Combine the dataframes (reseting indices to ensure correct alignment).
150 |         combined_df = pd.concat(
151 |             [chunk_df.reset_index(drop=True),
152 |               feature_df.reset_index(drop=True)],
153 |             axis=1,
154 |         )
155 |         combined_df["vector"] = vector_column
156 |         combined_df["shard"] = i
157 |         
158 |         if i == 0:
159 |             msg = f"Creating LanceDB table '{TABLE_NAME}' at {db_path} with {len(combined_df)} rows."
160 |             print(msg)
161 |             table = db.create_table(TABLE_NAME, combined_df)
162 |         else:
163 |             msg = f"Adding shard {base_file} to LanceDB table '{TABLE_NAME}' at {db_path} with {len(combined_df)} rows."
164 |             print(msg)
165 |             table.add(combined_df)
166 |         # if i == 2:
167 |         #     break
168 | 
169 |         duration = time.monotonic() - start_time
170 |         print(f"Shard {base_file} processed in {duration:.2f} seconds; {n_chunk} rows")
171 | 
172 | 
173 |     print(f"Copying LanceDB to {LANCE_DB_DIR}")
174 |     # copy the tmp lancedb directory to the volume
175 |     import shutil
176 |     shutil.copytree(TMP_LANCE_DB_DIR, LANCE_DB_DIR)
177 |     print(f"Done!")
178 | 
179 | 
180 | @app.function(volumes={
181 |     "/datasets": volume_datasets,
182 |     "/embeddings": volume_embeddings,
183 |     "/lancedb": volume_db
184 |     }, 
185 |     gpu="A10G",
186 |     ephemeral_disk=int(1024*1024), # in MiB
187 |     image=st_image, 
188 |     timeout=60*100,
189 |     scaledown_window=60*10
190 |     )
191 | def create_indices():
192 |     import lancedb
193 |     import shutil
194 |     start_time = time.monotonic()
195 |     print(f"Copying table {LANCE_DB_DIR} to {TMP_LANCE_DB_DIR}")
196 |     shutil.copytree(LANCE_DB_DIR, TMP_LANCE_DB_DIR)
197 |     duration = time.monotonic() - start_time
198 |     print(f"Copying table {LANCE_DB_DIR} to {TMP_LANCE_DB_DIR} took {duration:.2f} seconds")
199 | 
200 |     db = lancedb.connect(TMP_LANCE_DB_DIR)
201 |     table = db.open_table(TABLE_NAME)
202 | 
203 |     # start_time = time.monotonic()
204 |     # print(f"Creating index for sae_indices on table '{TABLE_NAME}'")
205 |     # table.create_scalar_index("sae_indices", index_type="LABEL_LIST")
206 |     # duration = time.monotonic() - start_time
207 |     # print(f"Creating index for sae_indices on table '{TABLE_NAME}' took {duration:.2f} seconds")
208 | 
209 |     start_time = time.monotonic()
210 |     print(f"Creating FTS index for title on table '{TABLE_NAME}'")
211 |     table.create_fts_index("title")
212 |     duration = time.monotonic() - start_time
213 |     print(f"Creating FTS index for title on table '{TABLE_NAME}' took {duration:.2f} seconds")
214 | 
215 |     start_time = time.monotonic()
216 |     print(f"Creating ANN index for embeddings on table '{TABLE_NAME}'")
217 |     partitions = int(table.count_rows() ** 0.5) * 2
218 |     sub_vectors = D_EMB // 16
219 |     metric = "cosine"
220 |     print(f"Partitioning into {partitions} partitions, {sub_vectors} sub-vectors")
221 |     table.create_index(
222 |         num_partitions=partitions, 
223 |         num_sub_vectors=sub_vectors, 
224 |         metric=metric,
225 |         accelerator="cuda"
226 |     )
227 |     duration = time.monotonic() - start_time
228 |     print(f"Creating ANN index for embeddings on table '{TABLE_NAME}' took {duration:.2f} seconds")
229 | 
230 |     # print(f"Deleting existing {LANCE_DB_DIR}")
231 |     # shutil.rmtree(LANCE_DB_DIR, ignore_errors=True)
232 |     start_time = time.monotonic() 
233 |     print(f"Copying table {TABLE_NAME} to {LANCE_DB_DIR_INDEXED}")
234 |     shutil.copytree(TMP_LANCE_DB_DIR, LANCE_DB_DIR_INDEXED, dirs_exist_ok=True)
235 |     duration = time.monotonic() - start_time
236 |     print(f"Copying table {TMP_LANCE_DB_DIR} to {LANCE_DB_DIR_INDEXED} took {duration:.2f} seconds")
237 | 
238 | # ============================================================================
239 | # Modal Local Entrypoint
240 | # ============================================================================
241 | 
242 | @app.local_entrypoint()
243 | def main():
244 |     # Combine all shards and write to LanceDB.
245 |     # combine.remote()
246 |     # print("done with combine, creating indices")
247 |     create_indices.remote()


--------------------------------------------------------------------------------
/notebooks/perfile.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n"
 13 |      ]
 14 |     }
 15 |    ],
 16 |    "source": [
 17 |     "import time\n",
 18 |     "# import tqdm\n",
 19 |     "from tqdm.notebook import tqdm  # Import the notebook version of tqdm\n",
 20 |     "\n",
 21 |     "from datasets import load_dataset\n",
 22 |     "import pandas as pd\n",
 23 |     "import numpy as np\n",
 24 |     "import huggingface_hub\n",
 25 |     "from huggingface_hub import HfFileSystem\n",
 26 |     "hffs = HfFileSystem()\n",
 27 |     "from concurrent.futures import ThreadPoolExecutor, as_completed\n",
 28 |     "\n",
 29 |     "import transformers\n",
 30 |     "transformers.logging.set_verbosity_error()\n",
 31 |     "from transformers import AutoTokenizer\n"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "code",
 36 |    "execution_count": 2,
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "dataset = load_dataset(\"HuggingFaceFW/fineweb-edu\", data_files=\"sample/10BT/*.parquet\", streaming=True, split=\"train\")"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": 3,
 46 |    "metadata": {},
 47 |    "outputs": [],
 48 |    "source": [
 49 |     "files = hffs.ls(\"datasets/HuggingFaceFW/fineweb-edu/sample/10BT\", detail=False)"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 5,
 55 |    "metadata": {},
 56 |    "outputs": [
 57 |     {
 58 |      "data": {
 59 |       "text/plain": [
 60 |        "['datasets/HuggingFaceFW/fineweb-edu/sample/10BT/000_00000.parquet',\n",
 61 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/001_00000.parquet',\n",
 62 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/002_00000.parquet',\n",
 63 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/003_00000.parquet',\n",
 64 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/004_00000.parquet',\n",
 65 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/005_00000.parquet',\n",
 66 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/006_00000.parquet',\n",
 67 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/007_00000.parquet',\n",
 68 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/008_00000.parquet',\n",
 69 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/009_00000.parquet',\n",
 70 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/010_00000.parquet',\n",
 71 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/011_00000.parquet',\n",
 72 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/012_00000.parquet',\n",
 73 |        " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/013_00000.parquet']"
 74 |       ]
 75 |      },
 76 |      "execution_count": 5,
 77 |      "metadata": {},
 78 |      "output_type": "execute_result"
 79 |     }
 80 |    ],
 81 |    "source": [
 82 |     "files"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 4,
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "file = files[0]"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": null,
 97 |    "metadata": {},
 98 |    "outputs": [],
 99 |    "source": [
100 |     "# df = pd.read_parquet(\"hf://\" + files[0])\n",
101 |     "df = pd.read_parquet(file.split(\"/\")[-1])"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": null,
107 |    "metadata": {},
108 |    "outputs": [],
109 |    "source": [
110 |     "df.head()"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": null,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "# df.to_parquet(files[0].split(\"/\")[-1])"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": null,
125 |    "metadata": {},
126 |    "outputs": [],
127 |    "source": [
128 |     "MAX_TOKENS = 512"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": null,
134 |    "metadata": {},
135 |    "outputs": [],
136 |    "source": [
137 |     "# keep_keys = [\"id\", \"url\", \"score\", \"dump\"]"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": null,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "# tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\", model_max_length=MAX_TOKENS)"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": null,
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": [
155 |     "# def chunk(rows):\n",
156 |     "#     texts = rows[\"text\"]\n",
157 |     "#     chunks_index = []\n",
158 |     "#     chunks_text = []\n",
159 |     "#     chunks_tokens = []\n",
160 |     "#     updated_token_counts = []\n",
161 |     "\n",
162 |     "#     # Assuming you have other properties in the rows that you want to retain\n",
163 |     "#     keep = {key: [] for key in keep_keys}\n",
164 |     "\n",
165 |     "#     for index, text in enumerate(texts):\n",
166 |     "#         tokens = tokenizer.encode(text)\n",
167 |     "#         token_count = len(tokens)\n",
168 |     "\n",
169 |     "#         if token_count > MAX_TOKENS:\n",
170 |     "#             overlap = int(MAX_TOKENS * 0.1)\n",
171 |     "#             start_index = 0\n",
172 |     "#             ci = 0\n",
173 |     "#             while start_index < len(tokens):\n",
174 |     "#                 end_index = min(start_index + MAX_TOKENS, len(tokens))\n",
175 |     "#                 chunk = tokens[start_index:end_index]\n",
176 |     "#                 chunks_index.append(ci)\n",
177 |     "#                 chunks_tokens.append(chunk)\n",
178 |     "#                 updated_token_counts.append(len(chunk))\n",
179 |     "#                 chunks_text.append(tokenizer.decode(chunk))\n",
180 |     "#                 # Copy other properties for each chunk\n",
181 |     "#                 for key in keep:\n",
182 |     "#                     keep[key].append(rows[key][index])\n",
183 |     "#                 start_index += MAX_TOKENS - overlap\n",
184 |     "#                 ci += 1\n",
185 |     "#         else:\n",
186 |     "#             chunks_index.append(0)\n",
187 |     "#             chunks_text.append(text)\n",
188 |     "#             chunks_tokens.append(tokens)\n",
189 |     "#             updated_token_counts.append(token_count)\n",
190 |     "#             # Copy other properties for non-chunked texts\n",
191 |     "#             for key in keep:\n",
192 |     "#                 keep[key].append(rows[key][index])\n",
193 |     "\n",
194 |     "#     keep[\"chunk_index\"] = chunks_index\n",
195 |     "#     keep[\"chunk_text\"] = chunks_text\n",
196 |     "#     keep[\"chunk_tokens\"] = chunks_tokens\n",
197 |     "#     keep[\"chunk_token_count\"] = updated_token_counts\n",
198 |     "#     return keep\n"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "def chunk_row(row, tokenizer):\n",
208 |     "    # print(\"ROW\", row)\n",
209 |     "    MAX_TOKENS = 512\n",
210 |     "    keep_keys = [\"id\", \"url\", \"score\", \"dump\"]\n",
211 |     "    text = row[\"text\"]\n",
212 |     "    chunks = []\n",
213 |     "\n",
214 |     "    tokens = tokenizer.encode(text)\n",
215 |     "    token_count = len(tokens)\n",
216 |     "    if token_count > MAX_TOKENS:\n",
217 |     "        overlap = int(MAX_TOKENS * 0.1)\n",
218 |     "        start_index = 0\n",
219 |     "        ci = 0\n",
220 |     "        while start_index < len(tokens):\n",
221 |     "            end_index = min(start_index + MAX_TOKENS, len(tokens))\n",
222 |     "            chunk = tokens[start_index:end_index]\n",
223 |     "            chunks.append({\n",
224 |     "                \"chunk_index\": ci,\n",
225 |     "                \"chunk_text\": tokenizer.decode(chunk),\n",
226 |     "                \"chunk_tokens\": chunk,\n",
227 |     "                \"chunk_token_count\": len(chunk),\n",
228 |     "                **{key: row[key] for key in keep_keys}\n",
229 |     "            })\n",
230 |     "            start_index += MAX_TOKENS - overlap\n",
231 |     "            ci += 1\n",
232 |     "    else:\n",
233 |     "        chunks.append({\n",
234 |     "            \"chunk_index\": 0,\n",
235 |     "            \"chunk_text\": text,\n",
236 |     "            \"chunk_tokens\": tokens,\n",
237 |     "            \"chunk_token_count\": token_count,\n",
238 |     "            **{key: row[key] for key in keep_keys}\n",
239 |     "        })\n",
240 |     "\n",
241 |     "    return chunks\n"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "metadata": {},
248 |    "outputs": [],
249 |    "source": [
250 |     "def process_dataframe(df):\n",
251 |     "    chunks_list = []\n",
252 |     "    with ThreadPoolExecutor(max_workers=16) as executor:\n",
253 |     "        # Submit all rows to the executor\n",
254 |     "        pbar = tqdm(total=len(df), desc=\"Processing Rows\")\n",
255 |     "        \n",
256 |     "        def process_batch(batch):\n",
257 |     "            \n",
258 |     "            tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\", model_max_length=MAX_TOKENS)\n",
259 |     "            batch_chunks = []\n",
260 |     "            for row in batch:\n",
261 |     "                row_chunks = chunk_row(row, tokenizer)\n",
262 |     "                pbar.update(1)\n",
263 |     "                batch_chunks.extend(row_chunks)\n",
264 |     "            return batch_chunks\n",
265 |     "\n",
266 |     "\n",
267 |     "        print(\"making batches\")\n",
268 |     "        batch_size = 200  # Adjust batch size based on your needs\n",
269 |     "        batches = [df.iloc[i:i + batch_size].to_dict(orient=\"records\") for i in range(0, len(df), batch_size)]\n",
270 |     "        print(\"made batches\")\n",
271 |     "        print(\"setting up futures\")\n",
272 |     "        futures = [executor.submit(process_batch, batch) for batch in batches]\n",
273 |     "        # futures = [executor.submit(chunk_row, row) for index, row in df.iterrows()]\n",
274 |     "        # for future in tqdm(as_completed(futures), total=len(df), desc=\"Processing Rows\"):\n",
275 |     "        #     chunks_list.extend(future.result())\n",
276 |     "        print(\"in the future\")\n",
277 |     "        # pbar = tqdm(total=len(df)//batch_size, desc=\"Processing Rows\")\n",
278 |     "        for future in as_completed(futures):\n",
279 |     "            chunks_list.extend(future.result())\n",
280 |     "            # print(len(chunks_list))\n",
281 |     "            # pbar.update(1)  # Manually update the progress bar\n",
282 |     "        pbar.close()\n",
283 |     "    return chunks_list"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "code",
288 |    "execution_count": null,
289 |    "metadata": {},
290 |    "outputs": [],
291 |    "source": [
292 |     "# Process the DataFrame and create a new DataFrame from the list of chunks\n",
293 |     "start = time.perf_counter()\n",
294 |     "print(f\"Chunking text that is longer than {MAX_TOKENS} tokens\")\n",
295 |     "chunked_data = process_dataframe(df)\n",
296 |     "print(f\"Dataset chunked in {time.perf_counter() - start:.2f} seconds\")\n",
297 |     "start = time.perf_counter()\n",
298 |     "chunked_df = pd.DataFrame(chunked_data)\n",
299 |     "print(f\"Dataset converted to DataFrame in {time.perf_counter() - start:.2f} seconds\")\n"
300 |    ]
301 |   },
302 |   {
303 |    "cell_type": "code",
304 |    "execution_count": null,
305 |    "metadata": {},
306 |    "outputs": [],
307 |    "source": [
308 |     "# chunked_df.to_parquet(\"chunked-\" + file.split(\"/\")[-1])"
309 |    ]
310 |   },
311 |   {
312 |    "cell_type": "code",
313 |    "execution_count": null,
314 |    "metadata": {},
315 |    "outputs": [],
316 |    "source": [
317 |     "len(chunked_df)"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": null,
323 |    "metadata": {},
324 |    "outputs": [],
325 |    "source": [
326 |     "chunked_df.head()"
327 |    ]
328 |   },
329 |   {
330 |    "cell_type": "code",
331 |    "execution_count": null,
332 |    "metadata": {},
333 |    "outputs": [],
334 |    "source": []
335 |   }
336 |  ],
337 |  "metadata": {
338 |   "kernelspec": {
339 |    "display_name": "modalenv",
340 |    "language": "python",
341 |    "name": "python3"
342 |   },
343 |   "language_info": {
344 |    "codemirror_mode": {
345 |     "name": "ipython",
346 |     "version": 3
347 |    },
348 |    "file_extension": ".py",
349 |    "mimetype": "text/x-python",
350 |    "name": "python",
351 |    "nbconvert_exporter": "python",
352 |    "pygments_lexer": "ipython3",
353 |    "version": "3.11.6"
354 |   }
355 |  },
356 |  "nbformat": 4,
357 |  "nbformat_minor": 2
358 | }
359 | 


--------------------------------------------------------------------------------
/notebooks/small_sample.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 2,
   6 |    "metadata": {},
   7 |    "outputs": [],
   8 |    "source": [
   9 |     "from datasets import load_dataset\n",
  10 |     "import pandas as pd\n",
  11 |     "import numpy as np\n",
  12 |     "\n"
  13 |    ]
  14 |   },
  15 |   {
  16 |    "cell_type": "code",
  17 |    "execution_count": 8,
  18 |    "metadata": {},
  19 |    "outputs": [],
  20 |    "source": [
  21 |     "dataset = load_dataset(\"HuggingFaceFW/fineweb-edu\", data_files=\"sample/10BT/*.parquet\", streaming=True, split=\"train\")\n",
  22 |     "\n"
  23 |    ]
  24 |   },
  25 |   {
  26 |    "cell_type": "code",
  27 |    "execution_count": 9,
  28 |    "metadata": {},
  29 |    "outputs": [],
  30 |    "source": [
  31 |     "dataset_head = dataset.take(10000)"
  32 |    ]
  33 |   },
  34 |   {
  35 |    "cell_type": "code",
  36 |    "execution_count": 10,
  37 |    "metadata": {},
  38 |    "outputs": [],
  39 |    "source": [
  40 |     "df10k = pd.DataFrame(list(dataset_head))"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": 11,
  46 |    "metadata": {},
  47 |    "outputs": [
  48 |     {
  49 |      "data": {
  50 |       "text/html": [
  51 |        "<div>\n",
  52 |        "<style scoped>\n",
  53 |        "    .dataframe tbody tr th:only-of-type {\n",
  54 |        "        vertical-align: middle;\n",
  55 |        "    }\n",
  56 |        "\n",
  57 |        "    .dataframe tbody tr th {\n",
  58 |        "        vertical-align: top;\n",
  59 |        "    }\n",
  60 |        "\n",
  61 |        "    .dataframe thead th {\n",
  62 |        "        text-align: right;\n",
  63 |        "    }\n",
  64 |        "</style>\n",
  65 |        "<table border=\"1\" class=\"dataframe\">\n",
  66 |        "  <thead>\n",
  67 |        "    <tr style=\"text-align: right;\">\n",
  68 |        "      <th></th>\n",
  69 |        "      <th>text</th>\n",
  70 |        "      <th>id</th>\n",
  71 |        "      <th>dump</th>\n",
  72 |        "      <th>url</th>\n",
  73 |        "      <th>file_path</th>\n",
  74 |        "      <th>language</th>\n",
  75 |        "      <th>language_score</th>\n",
  76 |        "      <th>token_count</th>\n",
  77 |        "      <th>score</th>\n",
  78 |        "      <th>int_score</th>\n",
  79 |        "    </tr>\n",
  80 |        "  </thead>\n",
  81 |        "  <tbody>\n",
  82 |        "    <tr>\n",
  83 |        "      <th>0</th>\n",
  84 |        "      <td>The Independent Jane\\nFor all the love, romanc...</td>\n",
  85 |        "      <td>&lt;urn:uuid:0d8a309d-25c5-405d-a08a-c11239f0d717&gt;</td>\n",
  86 |        "      <td>CC-MAIN-2013-20</td>\n",
  87 |        "      <td>http://austenauthors.net/the-independent-jane</td>\n",
  88 |        "      <td>s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...</td>\n",
  89 |        "      <td>en</td>\n",
  90 |        "      <td>0.974320</td>\n",
  91 |        "      <td>845</td>\n",
  92 |        "      <td>2.750000</td>\n",
  93 |        "      <td>3</td>\n",
  94 |        "    </tr>\n",
  95 |        "    <tr>\n",
  96 |        "      <th>1</th>\n",
  97 |        "      <td>Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\...</td>\n",
  98 |        "      <td>&lt;urn:uuid:316c7af5-14e1-4d0b-9576-753e17ef2cc5&gt;</td>\n",
  99 |        "      <td>CC-MAIN-2013-20</td>\n",
 100 |        "      <td>http://query.nytimes.com/gst/fullpage.html?res...</td>\n",
 101 |        "      <td>s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...</td>\n",
 102 |        "      <td>en</td>\n",
 103 |        "      <td>0.961459</td>\n",
 104 |        "      <td>1055</td>\n",
 105 |        "      <td>2.562500</td>\n",
 106 |        "      <td>3</td>\n",
 107 |        "    </tr>\n",
 108 |        "    <tr>\n",
 109 |        "      <th>2</th>\n",
 110 |        "      <td>How do you get HIV?\\nHIV can be passed on when...</td>\n",
 111 |        "      <td>&lt;urn:uuid:a3e140cd-7f25-48c9-a2f0-a7d0b1954e0d&gt;</td>\n",
 112 |        "      <td>CC-MAIN-2013-20</td>\n",
 113 |        "      <td>http://www.childline.org.uk/Explore/SexRelatio...</td>\n",
 114 |        "      <td>s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...</td>\n",
 115 |        "      <td>en</td>\n",
 116 |        "      <td>0.966757</td>\n",
 117 |        "      <td>136</td>\n",
 118 |        "      <td>3.125000</td>\n",
 119 |        "      <td>3</td>\n",
 120 |        "    </tr>\n",
 121 |        "    <tr>\n",
 122 |        "      <th>3</th>\n",
 123 |        "      <td>CTComms sends on average 2 million emails mont...</td>\n",
 124 |        "      <td>&lt;urn:uuid:c337bcd8-6aa1-4f2d-8c48-b916442ebbee&gt;</td>\n",
 125 |        "      <td>CC-MAIN-2013-20</td>\n",
 126 |        "      <td>http://www.ctt.org/resource_centre/getting_sta...</td>\n",
 127 |        "      <td>s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...</td>\n",
 128 |        "      <td>en</td>\n",
 129 |        "      <td>0.910602</td>\n",
 130 |        "      <td>3479</td>\n",
 131 |        "      <td>3.234375</td>\n",
 132 |        "      <td>3</td>\n",
 133 |        "    </tr>\n",
 134 |        "    <tr>\n",
 135 |        "      <th>4</th>\n",
 136 |        "      <td>Hold the salt: UCLA engineers develop revoluti...</td>\n",
 137 |        "      <td>&lt;urn:uuid:c0b175bb-65fb-420e-a881-a80b91d00ecd&gt;</td>\n",
 138 |        "      <td>CC-MAIN-2013-20</td>\n",
 139 |        "      <td>http://www.environment.ucla.edu/water/news/art...</td>\n",
 140 |        "      <td>s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...</td>\n",
 141 |        "      <td>en</td>\n",
 142 |        "      <td>0.924981</td>\n",
 143 |        "      <td>1115</td>\n",
 144 |        "      <td>2.812500</td>\n",
 145 |        "      <td>3</td>\n",
 146 |        "    </tr>\n",
 147 |        "  </tbody>\n",
 148 |        "</table>\n",
 149 |        "</div>"
 150 |       ],
 151 |       "text/plain": [
 152 |        "                                                text  \\\n",
 153 |        "0  The Independent Jane\\nFor all the love, romanc...   \n",
 154 |        "1  Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\...   \n",
 155 |        "2  How do you get HIV?\\nHIV can be passed on when...   \n",
 156 |        "3  CTComms sends on average 2 million emails mont...   \n",
 157 |        "4  Hold the salt: UCLA engineers develop revoluti...   \n",
 158 |        "\n",
 159 |        "                                                id             dump  \\\n",
 160 |        "0  <urn:uuid:0d8a309d-25c5-405d-a08a-c11239f0d717>  CC-MAIN-2013-20   \n",
 161 |        "1  <urn:uuid:316c7af5-14e1-4d0b-9576-753e17ef2cc5>  CC-MAIN-2013-20   \n",
 162 |        "2  <urn:uuid:a3e140cd-7f25-48c9-a2f0-a7d0b1954e0d>  CC-MAIN-2013-20   \n",
 163 |        "3  <urn:uuid:c337bcd8-6aa1-4f2d-8c48-b916442ebbee>  CC-MAIN-2013-20   \n",
 164 |        "4  <urn:uuid:c0b175bb-65fb-420e-a881-a80b91d00ecd>  CC-MAIN-2013-20   \n",
 165 |        "\n",
 166 |        "                                                 url  \\\n",
 167 |        "0      http://austenauthors.net/the-independent-jane   \n",
 168 |        "1  http://query.nytimes.com/gst/fullpage.html?res...   \n",
 169 |        "2  http://www.childline.org.uk/Explore/SexRelatio...   \n",
 170 |        "3  http://www.ctt.org/resource_centre/getting_sta...   \n",
 171 |        "4  http://www.environment.ucla.edu/water/news/art...   \n",
 172 |        "\n",
 173 |        "                                           file_path language  language_score  \\\n",
 174 |        "0  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.974320   \n",
 175 |        "1  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.961459   \n",
 176 |        "2  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.966757   \n",
 177 |        "3  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.910602   \n",
 178 |        "4  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.924981   \n",
 179 |        "\n",
 180 |        "   token_count     score  int_score  \n",
 181 |        "0          845  2.750000          3  \n",
 182 |        "1         1055  2.562500          3  \n",
 183 |        "2          136  3.125000          3  \n",
 184 |        "3         3479  3.234375          3  \n",
 185 |        "4         1115  2.812500          3  "
 186 |       ]
 187 |      },
 188 |      "execution_count": 11,
 189 |      "metadata": {},
 190 |      "output_type": "execute_result"
 191 |     }
 192 |    ],
 193 |    "source": [
 194 |     "df10k.head()"
 195 |    ]
 196 |   },
 197 |   {
 198 |    "cell_type": "code",
 199 |    "execution_count": 12,
 200 |    "metadata": {},
 201 |    "outputs": [],
 202 |    "source": [
 203 |     "import latentscope as ls"
 204 |    ]
 205 |   },
 206 |   {
 207 |    "cell_type": "code",
 208 |    "execution_count": 13,
 209 |    "metadata": {},
 210 |    "outputs": [
 211 |     {
 212 |      "name": "stdout",
 213 |      "output_type": "stream",
 214 |      "text": [
 215 |       "Initialized env with data directory at /Users/enjalot/latent-scope-data\n"
 216 |      ]
 217 |     }
 218 |    ],
 219 |    "source": [
 220 |     "ls.init(\"~/latent-scope-data\")"
 221 |    ]
 222 |   },
 223 |   {
 224 |    "cell_type": "code",
 225 |    "execution_count": 14,
 226 |    "metadata": {},
 227 |    "outputs": [
 228 |     {
 229 |      "name": "stdout",
 230 |      "output_type": "stream",
 231 |      "text": [
 232 |       "Loading environment variables from: /Users/enjalot/code/latent-testing/notebooks/.env\n",
 233 |       "DATA DIR /Users/enjalot/latent-scope-data\n",
 234 |       "DIRECTORY /Users/enjalot/latent-scope-data/fineweb-edu-10k\n",
 235 |       "                                                text  \\\n",
 236 |       "0  The Independent Jane\\nFor all the love, romanc...   \n",
 237 |       "1  Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\...   \n",
 238 |       "2  How do you get HIV?\\nHIV can be passed on when...   \n",
 239 |       "3  CTComms sends on average 2 million emails mont...   \n",
 240 |       "4  Hold the salt: UCLA engineers develop revoluti...   \n",
 241 |       "\n",
 242 |       "                                                id             dump  \\\n",
 243 |       "0  <urn:uuid:0d8a309d-25c5-405d-a08a-c11239f0d717>  CC-MAIN-2013-20   \n",
 244 |       "1  <urn:uuid:316c7af5-14e1-4d0b-9576-753e17ef2cc5>  CC-MAIN-2013-20   \n",
 245 |       "2  <urn:uuid:a3e140cd-7f25-48c9-a2f0-a7d0b1954e0d>  CC-MAIN-2013-20   \n",
 246 |       "3  <urn:uuid:c337bcd8-6aa1-4f2d-8c48-b916442ebbee>  CC-MAIN-2013-20   \n",
 247 |       "4  <urn:uuid:c0b175bb-65fb-420e-a881-a80b91d00ecd>  CC-MAIN-2013-20   \n",
 248 |       "\n",
 249 |       "                                                 url  \\\n",
 250 |       "0      http://austenauthors.net/the-independent-jane   \n",
 251 |       "1  http://query.nytimes.com/gst/fullpage.html?res...   \n",
 252 |       "2  http://www.childline.org.uk/Explore/SexRelatio...   \n",
 253 |       "3  http://www.ctt.org/resource_centre/getting_sta...   \n",
 254 |       "4  http://www.environment.ucla.edu/water/news/art...   \n",
 255 |       "\n",
 256 |       "                                           file_path language  language_score  \\\n",
 257 |       "0  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.974320   \n",
 258 |       "1  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.961459   \n",
 259 |       "2  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.966757   \n",
 260 |       "3  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.910602   \n",
 261 |       "4  s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...       en        0.924981   \n",
 262 |       "\n",
 263 |       "   token_count     score  int_score  \n",
 264 |       "0          845  2.750000          3  \n",
 265 |       "1         1055  2.562500          3  \n",
 266 |       "2          136  3.125000          3  \n",
 267 |       "3         3479  3.234375          3  \n",
 268 |       "4         1115  2.812500          3  \n",
 269 |       "                                                   text  \\\n",
 270 |       "9995  Here we have the inspiration for the movie tre...   \n",
 271 |       "9996  Love and Logic Resource KitLove and Logic is a...   \n",
 272 |       "9997  In the event of fire, people need to know exac...   \n",
 273 |       "9998  It may be a small comfort to those planning th...   \n",
 274 |       "9999  A 13-year-old middle school student is working...   \n",
 275 |       "\n",
 276 |       "                                                   id             dump  \\\n",
 277 |       "9995  <urn:uuid:57ae955d-687d-497f-93d4-d5314a541145>  CC-MAIN-2017-26   \n",
 278 |       "9996  <urn:uuid:3df9d504-e03a-4ef2-93ae-1b0fe24baa5e>  CC-MAIN-2017-26   \n",
 279 |       "9997  <urn:uuid:cbd2548e-361a-4de4-98e6-b5ecd485bf4f>  CC-MAIN-2017-26   \n",
 280 |       "9998  <urn:uuid:51ee7105-5715-47c0-a4d7-d6c1b39d3344>  CC-MAIN-2017-26   \n",
 281 |       "9999  <urn:uuid:5f525003-bf93-42d1-b05b-29a50aacfb63>  CC-MAIN-2017-26   \n",
 282 |       "\n",
 283 |       "                                                    url  \\\n",
 284 |       "9995  https://www.hamahamaoysters.com/blogs/learn/18...   \n",
 285 |       "9996  http://holly.rpes.schoolfusion.us/modules/cms/...   \n",
 286 |       "9997  http://churchsafety.org.uk/information/fire/f_...   \n",
 287 |       "9998  http://insideindustrynews.com/curiosity-gives-...   \n",
 288 |       "9999  http://juneauempire.com/stories/120505/loc_200...   \n",
 289 |       "\n",
 290 |       "                                              file_path language  \\\n",
 291 |       "9995  s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se...       en   \n",
 292 |       "9996  s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se...       en   \n",
 293 |       "9997  s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se...       en   \n",
 294 |       "9998  s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se...       en   \n",
 295 |       "9999  s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se...       en   \n",
 296 |       "\n",
 297 |       "      language_score  token_count     score  int_score  \n",
 298 |       "9995        0.961133          368  2.875000          3  \n",
 299 |       "9996        0.895080          249  2.828125          3  \n",
 300 |       "9997        0.960923         1081  3.171875          3  \n",
 301 |       "9998        0.938971          141  2.968750          3  \n",
 302 |       "9999        0.981334         1131  2.859375          3  \n",
 303 |       "Index(['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score',\n",
 304 |       "       'token_count', 'score', 'int_score'],\n",
 305 |       "      dtype='object')\n",
 306 |       "wrote /Users/enjalot/latent-scope-data/fineweb-edu-10k/input.parquet\n"
 307 |      ]
 308 |     }
 309 |    ],
 310 |    "source": [
 311 |     "ls.ingest(\"fineweb-edu-10k\", df10k, \"text\")\n",
 312 |     "\n"
 313 |    ]
 314 |   },
 315 |   {
 316 |    "cell_type": "code",
 317 |    "execution_count": 17,
 318 |    "metadata": {},
 319 |    "outputs": [],
 320 |    "source": [
 321 |     "dataset100k = dataset.remove_columns([\"url\", \"file_path\", \"language_score\"])\n",
 322 |     "dataset_head100k = dataset100k.take(100000)\n"
 323 |    ]
 324 |   },
 325 |   {
 326 |    "cell_type": "code",
 327 |    "execution_count": 18,
 328 |    "metadata": {},
 329 |    "outputs": [],
 330 |    "source": [
 331 |     "df100k = pd.DataFrame(list(dataset_head100k))"
 332 |    ]
 333 |   },
 334 |   {
 335 |    "cell_type": "code",
 336 |    "execution_count": 19,
 337 |    "metadata": {},
 338 |    "outputs": [
 339 |     {
 340 |      "name": "stdout",
 341 |      "output_type": "stream",
 342 |      "text": [
 343 |       "Loading environment variables from: /Users/enjalot/code/latent-testing/notebooks/.env\n",
 344 |       "DATA DIR /Users/enjalot/latent-scope-data\n",
 345 |       "DIRECTORY /Users/enjalot/latent-scope-data/fineweb-edu-100k\n",
 346 |       "                                                text  \\\n",
 347 |       "0  The Independent Jane\\nFor all the love, romanc...   \n",
 348 |       "1  Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\...   \n",
 349 |       "2  How do you get HIV?\\nHIV can be passed on when...   \n",
 350 |       "3  CTComms sends on average 2 million emails mont...   \n",
 351 |       "4  Hold the salt: UCLA engineers develop revoluti...   \n",
 352 |       "\n",
 353 |       "                                                id             dump language  \\\n",
 354 |       "0  <urn:uuid:0d8a309d-25c5-405d-a08a-c11239f0d717>  CC-MAIN-2013-20       en   \n",
 355 |       "1  <urn:uuid:316c7af5-14e1-4d0b-9576-753e17ef2cc5>  CC-MAIN-2013-20       en   \n",
 356 |       "2  <urn:uuid:a3e140cd-7f25-48c9-a2f0-a7d0b1954e0d>  CC-MAIN-2013-20       en   \n",
 357 |       "3  <urn:uuid:c337bcd8-6aa1-4f2d-8c48-b916442ebbee>  CC-MAIN-2013-20       en   \n",
 358 |       "4  <urn:uuid:c0b175bb-65fb-420e-a881-a80b91d00ecd>  CC-MAIN-2013-20       en   \n",
 359 |       "\n",
 360 |       "   token_count     score  int_score  \n",
 361 |       "0          845  2.750000          3  \n",
 362 |       "1         1055  2.562500          3  \n",
 363 |       "2          136  3.125000          3  \n",
 364 |       "3         3479  3.234375          3  \n",
 365 |       "4         1115  2.812500          3  \n",
 366 |       "                                                    text  \\\n",
 367 |       "99995  Avoid the extreme, but beware of household can...   \n",
 368 |       "99996  The Gospel of Luke is the third of the four ca...   \n",
 369 |       "99997  It's is short for it is or it has.\\nIts is the...   \n",
 370 |       "99998  As more and more users gain access to the web,...   \n",
 371 |       "99999  Equipping students to successfully navigate th...   \n",
 372 |       "\n",
 373 |       "                                                    id             dump  \\\n",
 374 |       "99995  <urn:uuid:f635e2ad-40e6-4a9a-96c6-c1b59402927a>  CC-MAIN-2013-20   \n",
 375 |       "99996  <urn:uuid:0bf72ded-4537-4e31-b3cc-994319051c63>  CC-MAIN-2013-20   \n",
 376 |       "99997  <urn:uuid:ea54c2d0-d764-4f02-85cc-ec551afb387f>  CC-MAIN-2013-20   \n",
 377 |       "99998  <urn:uuid:e3dec8b9-e20e-4e05-866c-c53542413ed4>  CC-MAIN-2013-20   \n",
 378 |       "99999  <urn:uuid:4a847a52-1497-4395-886e-ca1fde2ac0cd>  CC-MAIN-2013-20   \n",
 379 |       "\n",
 380 |       "      language  token_count     score  int_score  \n",
 381 |       "99995       en          377  2.531250          3  \n",
 382 |       "99996       en         1755  3.125000          3  \n",
 383 |       "99997       en          573  2.828125          3  \n",
 384 |       "99998       en          648  2.750000          3  \n",
 385 |       "99999       en         1053  3.578125          4  \n",
 386 |       "Index(['text', 'id', 'dump', 'language', 'token_count', 'score', 'int_score'], dtype='object')\n",
 387 |       "wrote /Users/enjalot/latent-scope-data/fineweb-edu-100k/input.parquet\n"
 388 |      ]
 389 |     }
 390 |    ],
 391 |    "source": [
 392 |     "ls.ingest(\"fineweb-edu-100k\", df100k, \"text\")"
 393 |    ]
 394 |   },
 395 |   {
 396 |    "cell_type": "code",
 397 |    "execution_count": 1,
 398 |    "metadata": {},
 399 |    "outputs": [],
 400 |    "source": [
 401 |     "broken_ids = [] # can put some ids here to check"
 402 |    ]
 403 |   },
 404 |   {
 405 |    "cell_type": "code",
 406 |    "execution_count": 21,
 407 |    "metadata": {},
 408 |    "outputs": [],
 409 |    "source": [
 410 |     "filtered_df100k = df100k[df100k['id'].isin(broken_ids)]\n"
 411 |    ]
 412 |   },
 413 |   {
 414 |    "cell_type": "code",
 415 |    "execution_count": 22,
 416 |    "metadata": {},
 417 |    "outputs": [
 418 |     {
 419 |      "data": {
 420 |       "text/plain": [
 421 |        "(512, 7)"
 422 |       ]
 423 |      },
 424 |      "execution_count": 22,
 425 |      "metadata": {},
 426 |      "output_type": "execute_result"
 427 |     }
 428 |    ],
 429 |    "source": [
 430 |     "filtered_df100k.shape"
 431 |    ]
 432 |   },
 433 |   {
 434 |    "cell_type": "code",
 435 |    "execution_count": 23,
 436 |    "metadata": {},
 437 |    "outputs": [
 438 |     {
 439 |      "data": {
 440 |       "text/html": [
 441 |        "<div>\n",
 442 |        "<style scoped>\n",
 443 |        "    .dataframe tbody tr th:only-of-type {\n",
 444 |        "        vertical-align: middle;\n",
 445 |        "    }\n",
 446 |        "\n",
 447 |        "    .dataframe tbody tr th {\n",
 448 |        "        vertical-align: top;\n",
 449 |        "    }\n",
 450 |        "\n",
 451 |        "    .dataframe thead th {\n",
 452 |        "        text-align: right;\n",
 453 |        "    }\n",
 454 |        "</style>\n",
 455 |        "<table border=\"1\" class=\"dataframe\">\n",
 456 |        "  <thead>\n",
 457 |        "    <tr style=\"text-align: right;\">\n",
 458 |        "      <th></th>\n",
 459 |        "      <th>text</th>\n",
 460 |        "      <th>id</th>\n",
 461 |        "      <th>dump</th>\n",
 462 |        "      <th>language</th>\n",
 463 |        "      <th>token_count</th>\n",
 464 |        "      <th>score</th>\n",
 465 |        "      <th>int_score</th>\n",
 466 |        "    </tr>\n",
 467 |        "  </thead>\n",
 468 |        "  <tbody>\n",
 469 |        "    <tr>\n",
 470 |        "      <th>52736</th>\n",
 471 |        "      <td>The two-button remote control is a very versat...</td>\n",
 472 |        "      <td>&lt;urn:uuid:52418580-9004-4afd-b9d6-7c991f761b06&gt;</td>\n",
 473 |        "      <td>CC-MAIN-2013-20</td>\n",
 474 |        "      <td>en</td>\n",
 475 |        "      <td>222</td>\n",
 476 |        "      <td>3.000000</td>\n",
 477 |        "      <td>3</td>\n",
 478 |        "    </tr>\n",
 479 |        "    <tr>\n",
 480 |        "      <th>52737</th>\n",
 481 |        "      <td>for National Geographic News\\nA new population...</td>\n",
 482 |        "      <td>&lt;urn:uuid:3dc4e72f-a8c2-48b9-bae3-ea8e374b4462&gt;</td>\n",
 483 |        "      <td>CC-MAIN-2013-20</td>\n",
 484 |        "      <td>en</td>\n",
 485 |        "      <td>422</td>\n",
 486 |        "      <td>3.875000</td>\n",
 487 |        "      <td>4</td>\n",
 488 |        "    </tr>\n",
 489 |        "    <tr>\n",
 490 |        "      <th>52738</th>\n",
 491 |        "      <td>The right to access the various documents of t...</td>\n",
 492 |        "      <td>&lt;urn:uuid:ebf0a847-f329-441d-a625-47eabeb0e52f&gt;</td>\n",
 493 |        "      <td>CC-MAIN-2013-20</td>\n",
 494 |        "      <td>en</td>\n",
 495 |        "      <td>1959</td>\n",
 496 |        "      <td>3.671875</td>\n",
 497 |        "      <td>4</td>\n",
 498 |        "    </tr>\n",
 499 |        "    <tr>\n",
 500 |        "      <th>52739</th>\n",
 501 |        "      <td>Product Type: Open-file Report\\nAuthor(s): Ell...</td>\n",
 502 |        "      <td>&lt;urn:uuid:5cd69abe-263e-4246-b7ee-3641dc5b4c17&gt;</td>\n",
 503 |        "      <td>CC-MAIN-2013-20</td>\n",
 504 |        "      <td>en</td>\n",
 505 |        "      <td>530</td>\n",
 506 |        "      <td>2.718750</td>\n",
 507 |        "      <td>3</td>\n",
 508 |        "    </tr>\n",
 509 |        "    <tr>\n",
 510 |        "      <th>52740</th>\n",
 511 |        "      <td>BOISE, Idaho – An invasive insect commonly fou...</td>\n",
 512 |        "      <td>&lt;urn:uuid:528c738a-fa9c-4858-8683-878e210185c4&gt;</td>\n",
 513 |        "      <td>CC-MAIN-2013-20</td>\n",
 514 |        "      <td>en</td>\n",
 515 |        "      <td>382</td>\n",
 516 |        "      <td>2.656250</td>\n",
 517 |        "      <td>3</td>\n",
 518 |        "    </tr>\n",
 519 |        "  </tbody>\n",
 520 |        "</table>\n",
 521 |        "</div>"
 522 |       ],
 523 |       "text/plain": [
 524 |        "                                                    text  \\\n",
 525 |        "52736  The two-button remote control is a very versat...   \n",
 526 |        "52737  for National Geographic News\\nA new population...   \n",
 527 |        "52738  The right to access the various documents of t...   \n",
 528 |        "52739  Product Type: Open-file Report\\nAuthor(s): Ell...   \n",
 529 |        "52740  BOISE, Idaho – An invasive insect commonly fou...   \n",
 530 |        "\n",
 531 |        "                                                    id             dump  \\\n",
 532 |        "52736  <urn:uuid:52418580-9004-4afd-b9d6-7c991f761b06>  CC-MAIN-2013-20   \n",
 533 |        "52737  <urn:uuid:3dc4e72f-a8c2-48b9-bae3-ea8e374b4462>  CC-MAIN-2013-20   \n",
 534 |        "52738  <urn:uuid:ebf0a847-f329-441d-a625-47eabeb0e52f>  CC-MAIN-2013-20   \n",
 535 |        "52739  <urn:uuid:5cd69abe-263e-4246-b7ee-3641dc5b4c17>  CC-MAIN-2013-20   \n",
 536 |        "52740  <urn:uuid:528c738a-fa9c-4858-8683-878e210185c4>  CC-MAIN-2013-20   \n",
 537 |        "\n",
 538 |        "      language  token_count     score  int_score  \n",
 539 |        "52736       en          222  3.000000          3  \n",
 540 |        "52737       en          422  3.875000          4  \n",
 541 |        "52738       en         1959  3.671875          4  \n",
 542 |        "52739       en          530  2.718750          3  \n",
 543 |        "52740       en          382  2.656250          3  "
 544 |       ]
 545 |      },
 546 |      "execution_count": 23,
 547 |      "metadata": {},
 548 |      "output_type": "execute_result"
 549 |     }
 550 |    ],
 551 |    "source": [
 552 |     "filtered_df100k.head()"
 553 |    ]
 554 |   },
 555 |   {
 556 |    "cell_type": "code",
 557 |    "execution_count": 24,
 558 |    "metadata": {},
 559 |    "outputs": [
 560 |     {
 561 |      "data": {
 562 |       "text/plain": [
 563 |        "[222,\n",
 564 |        " 422,\n",
 565 |        " 1959,\n",
 566 |        " 530,\n",
 567 |        " 382,\n",
 568 |        " 11986,\n",
 569 |        " 129,\n",
 570 |        " 652,\n",
 571 |        " 329,\n",
 572 |        " 4472,\n",
 573 |        " 1046,\n",
 574 |        " 453,\n",
 575 |        " 212,\n",
 576 |        " 473,\n",
 577 |        " 1503,\n",
 578 |        " 356,\n",
 579 |        " 307,\n",
 580 |        " 245,\n",
 581 |        " 420,\n",
 582 |        " 761,\n",
 583 |        " 392,\n",
 584 |        " 1327,\n",
 585 |        " 284,\n",
 586 |        " 2369,\n",
 587 |        " 170,\n",
 588 |        " 198,\n",
 589 |        " 1128,\n",
 590 |        " 592,\n",
 591 |        " 488,\n",
 592 |        " 267,\n",
 593 |        " 1440,\n",
 594 |        " 496,\n",
 595 |        " 373,\n",
 596 |        " 2140,\n",
 597 |        " 844,\n",
 598 |        " 250,\n",
 599 |        " 229,\n",
 600 |        " 597,\n",
 601 |        " 858,\n",
 602 |        " 219,\n",
 603 |        " 381,\n",
 604 |        " 787,\n",
 605 |        " 784,\n",
 606 |        " 811,\n",
 607 |        " 124,\n",
 608 |        " 251,\n",
 609 |        " 493,\n",
 610 |        " 257,\n",
 611 |        " 313,\n",
 612 |        " 619,\n",
 613 |        " 593,\n",
 614 |        " 528,\n",
 615 |        " 581,\n",
 616 |        " 707,\n",
 617 |        " 192,\n",
 618 |        " 755,\n",
 619 |        " 207,\n",
 620 |        " 885,\n",
 621 |        " 187,\n",
 622 |        " 1141,\n",
 623 |        " 1089,\n",
 624 |        " 975,\n",
 625 |        " 630,\n",
 626 |        " 306,\n",
 627 |        " 767,\n",
 628 |        " 353,\n",
 629 |        " 143,\n",
 630 |        " 774,\n",
 631 |        " 465,\n",
 632 |        " 870,\n",
 633 |        " 9691,\n",
 634 |        " 393,\n",
 635 |        " 429,\n",
 636 |        " 541,\n",
 637 |        " 671,\n",
 638 |        " 219,\n",
 639 |        " 599,\n",
 640 |        " 682,\n",
 641 |        " 561,\n",
 642 |        " 704,\n",
 643 |        " 788,\n",
 644 |        " 374,\n",
 645 |        " 334,\n",
 646 |        " 398,\n",
 647 |        " 348,\n",
 648 |        " 693,\n",
 649 |        " 611,\n",
 650 |        " 274,\n",
 651 |        " 753,\n",
 652 |        " 1326,\n",
 653 |        " 521,\n",
 654 |        " 1686,\n",
 655 |        " 747,\n",
 656 |        " 470,\n",
 657 |        " 332,\n",
 658 |        " 2011,\n",
 659 |        " 727,\n",
 660 |        " 23407,\n",
 661 |        " 464,\n",
 662 |        " 175,\n",
 663 |        " 751,\n",
 664 |        " 428,\n",
 665 |        " 148,\n",
 666 |        " 425,\n",
 667 |        " 200,\n",
 668 |        " 283,\n",
 669 |        " 642,\n",
 670 |        " 700,\n",
 671 |        " 771,\n",
 672 |        " 859,\n",
 673 |        " 547,\n",
 674 |        " 230,\n",
 675 |        " 1425,\n",
 676 |        " 1212,\n",
 677 |        " 680,\n",
 678 |        " 863,\n",
 679 |        " 108,\n",
 680 |        " 345,\n",
 681 |        " 187,\n",
 682 |        " 363,\n",
 683 |        " 2336,\n",
 684 |        " 3878,\n",
 685 |        " 631,\n",
 686 |        " 281,\n",
 687 |        " 256,\n",
 688 |        " 1811,\n",
 689 |        " 438,\n",
 690 |        " 1122,\n",
 691 |        " 1205,\n",
 692 |        " 3044,\n",
 693 |        " 978,\n",
 694 |        " 1199,\n",
 695 |        " 2367,\n",
 696 |        " 1791,\n",
 697 |        " 832,\n",
 698 |        " 608,\n",
 699 |        " 774,\n",
 700 |        " 456,\n",
 701 |        " 275,\n",
 702 |        " 569,\n",
 703 |        " 1537,\n",
 704 |        " 5759,\n",
 705 |        " 889,\n",
 706 |        " 317,\n",
 707 |        " 248,\n",
 708 |        " 360,\n",
 709 |        " 3122,\n",
 710 |        " 1723,\n",
 711 |        " 429,\n",
 712 |        " 920,\n",
 713 |        " 747,\n",
 714 |        " 271,\n",
 715 |        " 851,\n",
 716 |        " 2007,\n",
 717 |        " 161,\n",
 718 |        " 1054,\n",
 719 |        " 484,\n",
 720 |        " 936,\n",
 721 |        " 700,\n",
 722 |        " 257,\n",
 723 |        " 1191,\n",
 724 |        " 218,\n",
 725 |        " 443,\n",
 726 |        " 866,\n",
 727 |        " 717,\n",
 728 |        " 348,\n",
 729 |        " 1402,\n",
 730 |        " 467,\n",
 731 |        " 2245,\n",
 732 |        " 122,\n",
 733 |        " 812,\n",
 734 |        " 670,\n",
 735 |        " 413,\n",
 736 |        " 1831,\n",
 737 |        " 2151,\n",
 738 |        " 367,\n",
 739 |        " 537,\n",
 740 |        " 983,\n",
 741 |        " 348,\n",
 742 |        " 3545,\n",
 743 |        " 887,\n",
 744 |        " 184,\n",
 745 |        " 204,\n",
 746 |        " 980,\n",
 747 |        " 227,\n",
 748 |        " 798,\n",
 749 |        " 408,\n",
 750 |        " 374,\n",
 751 |        " 243,\n",
 752 |        " 1821,\n",
 753 |        " 249,\n",
 754 |        " 432,\n",
 755 |        " 560,\n",
 756 |        " 334,\n",
 757 |        " 1389,\n",
 758 |        " 890,\n",
 759 |        " 346,\n",
 760 |        " 524,\n",
 761 |        " 313,\n",
 762 |        " 528,\n",
 763 |        " 154,\n",
 764 |        " 261,\n",
 765 |        " 1890,\n",
 766 |        " 471,\n",
 767 |        " 3951,\n",
 768 |        " 461,\n",
 769 |        " 595,\n",
 770 |        " 320,\n",
 771 |        " 676,\n",
 772 |        " 1002,\n",
 773 |        " 1871,\n",
 774 |        " 370,\n",
 775 |        " 4132,\n",
 776 |        " 996,\n",
 777 |        " 435,\n",
 778 |        " 1010,\n",
 779 |        " 308,\n",
 780 |        " 288,\n",
 781 |        " 484,\n",
 782 |        " 368,\n",
 783 |        " 405,\n",
 784 |        " 378,\n",
 785 |        " 514,\n",
 786 |        " 895,\n",
 787 |        " 232,\n",
 788 |        " 110,\n",
 789 |        " 374,\n",
 790 |        " 433,\n",
 791 |        " 788,\n",
 792 |        " 403,\n",
 793 |        " 1217,\n",
 794 |        " 849,\n",
 795 |        " 333,\n",
 796 |        " 126,\n",
 797 |        " 324,\n",
 798 |        " 977,\n",
 799 |        " 295,\n",
 800 |        " 1629,\n",
 801 |        " 319,\n",
 802 |        " 350,\n",
 803 |        " 128,\n",
 804 |        " 754,\n",
 805 |        " 779,\n",
 806 |        " 314,\n",
 807 |        " 604,\n",
 808 |        " 391,\n",
 809 |        " 242,\n",
 810 |        " 403,\n",
 811 |        " 1291,\n",
 812 |        " 112,\n",
 813 |        " 263,\n",
 814 |        " 128,\n",
 815 |        " 1620,\n",
 816 |        " 543,\n",
 817 |        " 800,\n",
 818 |        " 973,\n",
 819 |        " 552,\n",
 820 |        " 244,\n",
 821 |        " 628,\n",
 822 |        " 418,\n",
 823 |        " 428,\n",
 824 |        " 412,\n",
 825 |        " 809,\n",
 826 |        " 240,\n",
 827 |        " 940,\n",
 828 |        " 747,\n",
 829 |        " 6330,\n",
 830 |        " 469,\n",
 831 |        " 770,\n",
 832 |        " 188,\n",
 833 |        " 952,\n",
 834 |        " 1575,\n",
 835 |        " 790,\n",
 836 |        " 1178,\n",
 837 |        " 439,\n",
 838 |        " 4270,\n",
 839 |        " 834,\n",
 840 |        " 527,\n",
 841 |        " 206,\n",
 842 |        " 683,\n",
 843 |        " 541,\n",
 844 |        " 257,\n",
 845 |        " 191,\n",
 846 |        " 390,\n",
 847 |        " 267,\n",
 848 |        " 316,\n",
 849 |        " 1029,\n",
 850 |        " 233,\n",
 851 |        " 261,\n",
 852 |        " 3734,\n",
 853 |        " 799,\n",
 854 |        " 275,\n",
 855 |        " 388,\n",
 856 |        " 1718,\n",
 857 |        " 6228,\n",
 858 |        " 188,\n",
 859 |        " 367,\n",
 860 |        " 648,\n",
 861 |        " 1717,\n",
 862 |        " 1196,\n",
 863 |        " 639,\n",
 864 |        " 1904,\n",
 865 |        " 1107,\n",
 866 |        " 1127,\n",
 867 |        " 414,\n",
 868 |        " 341,\n",
 869 |        " 936,\n",
 870 |        " 124,\n",
 871 |        " 704,\n",
 872 |        " 359,\n",
 873 |        " 631,\n",
 874 |        " 771,\n",
 875 |        " 853,\n",
 876 |        " 892,\n",
 877 |        " 796,\n",
 878 |        " 302,\n",
 879 |        " 2938,\n",
 880 |        " 289,\n",
 881 |        " 1287,\n",
 882 |        " 3105,\n",
 883 |        " 3493,\n",
 884 |        " 812,\n",
 885 |        " 1861,\n",
 886 |        " 425,\n",
 887 |        " 475,\n",
 888 |        " 348,\n",
 889 |        " 241,\n",
 890 |        " 2461,\n",
 891 |        " 1359,\n",
 892 |        " 755,\n",
 893 |        " 741,\n",
 894 |        " 205,\n",
 895 |        " 145,\n",
 896 |        " 380,\n",
 897 |        " 1028,\n",
 898 |        " 364,\n",
 899 |        " 553,\n",
 900 |        " 301,\n",
 901 |        " 770,\n",
 902 |        " 319,\n",
 903 |        " 208,\n",
 904 |        " 1006,\n",
 905 |        " 559,\n",
 906 |        " 334,\n",
 907 |        " 399,\n",
 908 |        " 1010,\n",
 909 |        " 162,\n",
 910 |        " 528,\n",
 911 |        " 1272,\n",
 912 |        " 348,\n",
 913 |        " 1823,\n",
 914 |        " 1690,\n",
 915 |        " 1991,\n",
 916 |        " 472,\n",
 917 |        " 2442,\n",
 918 |        " 461,\n",
 919 |        " 1204,\n",
 920 |        " 738,\n",
 921 |        " 267,\n",
 922 |        " 943,\n",
 923 |        " 680,\n",
 924 |        " 3376,\n",
 925 |        " 804,\n",
 926 |        " 701,\n",
 927 |        " 1482,\n",
 928 |        " 283,\n",
 929 |        " 466,\n",
 930 |        " 533,\n",
 931 |        " 170,\n",
 932 |        " 880,\n",
 933 |        " 2902,\n",
 934 |        " 980,\n",
 935 |        " 434,\n",
 936 |        " 1280,\n",
 937 |        " 580,\n",
 938 |        " 229,\n",
 939 |        " 84,\n",
 940 |        " 257,\n",
 941 |        " 286,\n",
 942 |        " 175,\n",
 943 |        " 198,\n",
 944 |        " 2043,\n",
 945 |        " 335,\n",
 946 |        " 240,\n",
 947 |        " 1517,\n",
 948 |        " 5200,\n",
 949 |        " 539,\n",
 950 |        " 1022,\n",
 951 |        " 11524,\n",
 952 |        " 187,\n",
 953 |        " 158,\n",
 954 |        " 658,\n",
 955 |        " 165,\n",
 956 |        " 283,\n",
 957 |        " 736,\n",
 958 |        " 195,\n",
 959 |        " 871,\n",
 960 |        " 801,\n",
 961 |        " 178,\n",
 962 |        " 1267,\n",
 963 |        " 112,\n",
 964 |        " 717,\n",
 965 |        " 327,\n",
 966 |        " 846,\n",
 967 |        " 253,\n",
 968 |        " 520,\n",
 969 |        " 101,\n",
 970 |        " 626,\n",
 971 |        " 945,\n",
 972 |        " 454,\n",
 973 |        " 254,\n",
 974 |        " 775,\n",
 975 |        " 520,\n",
 976 |        " 753,\n",
 977 |        " 2658,\n",
 978 |        " 2021,\n",
 979 |        " 855,\n",
 980 |        " 3316,\n",
 981 |        " 2032,\n",
 982 |        " 8629,\n",
 983 |        " 762,\n",
 984 |        " 3730,\n",
 985 |        " 1576,\n",
 986 |        " 328,\n",
 987 |        " 1115,\n",
 988 |        " 496,\n",
 989 |        " 770,\n",
 990 |        " 143,\n",
 991 |        " 133,\n",
 992 |        " 743,\n",
 993 |        " 348,\n",
 994 |        " 214,\n",
 995 |        " 580,\n",
 996 |        " 2310,\n",
 997 |        " 204,\n",
 998 |        " 312,\n",
 999 |        " 815,\n",
1000 |        " 417,\n",
1001 |        " 843,\n",
1002 |        " 329,\n",
1003 |        " 3034,\n",
1004 |        " 410,\n",
1005 |        " 672,\n",
1006 |        " 225,\n",
1007 |        " 673,\n",
1008 |        " 415,\n",
1009 |        " 1475,\n",
1010 |        " 444,\n",
1011 |        " 780,\n",
1012 |        " 497,\n",
1013 |        " 586,\n",
1014 |        " 1161,\n",
1015 |        " 1608,\n",
1016 |        " 752,\n",
1017 |        " 600,\n",
1018 |        " 1645,\n",
1019 |        " 155,\n",
1020 |        " 56446,\n",
1021 |        " 562,\n",
1022 |        " 513,\n",
1023 |        " 6647,\n",
1024 |        " 660,\n",
1025 |        " 112,\n",
1026 |        " 1539,\n",
1027 |        " 1220,\n",
1028 |        " 1281,\n",
1029 |        " 741,\n",
1030 |        " 1078,\n",
1031 |        " 474,\n",
1032 |        " 864,\n",
1033 |        " 182,\n",
1034 |        " 244,\n",
1035 |        " 1278,\n",
1036 |        " 1056,\n",
1037 |        " 647,\n",
1038 |        " 358,\n",
1039 |        " 535,\n",
1040 |        " 2641,\n",
1041 |        " 364,\n",
1042 |        " 413,\n",
1043 |        " 720,\n",
1044 |        " 976,\n",
1045 |        " 510,\n",
1046 |        " 686,\n",
1047 |        " 427,\n",
1048 |        " 2311,\n",
1049 |        " 238,\n",
1050 |        " 4432,\n",
1051 |        " 277,\n",
1052 |        " 356,\n",
1053 |        " 665,\n",
1054 |        " 311,\n",
1055 |        " 886,\n",
1056 |        " 1529,\n",
1057 |        " 1467,\n",
1058 |        " 305,\n",
1059 |        " 350,\n",
1060 |        " 1839,\n",
1061 |        " 316,\n",
1062 |        " 1613,\n",
1063 |        " 229,\n",
1064 |        " 198,\n",
1065 |        " 1235,\n",
1066 |        " 2633,\n",
1067 |        " 809,\n",
1068 |        " 4255,\n",
1069 |        " 1864,\n",
1070 |        " 606,\n",
1071 |        " 497,\n",
1072 |        " 793,\n",
1073 |        " 1371,\n",
1074 |        " 1703]"
1075 |       ]
1076 |      },
1077 |      "execution_count": 24,
1078 |      "metadata": {},
1079 |      "output_type": "execute_result"
1080 |     }
1081 |    ],
1082 |    "source": [
1083 |     "filtered_df100k[\"token_count\"].to_list()"
1084 |    ]
1085 |   },
1086 |   {
1087 |    "cell_type": "code",
1088 |    "execution_count": 27,
1089 |    "metadata": {},
1090 |    "outputs": [
1091 |     {
1092 |      "name": "stderr",
1093 |      "output_type": "stream",
1094 |      "text": [
1095 |       "/var/folders/sx/rrvr6l_d5x1_g46jxlx5ypfc0000gn/T/ipykernel_76251/2136929811.py:1: SettingWithCopyWarning: \n",
1096 |       "A value is trying to be set on a copy of a slice from a DataFrame.\n",
1097 |       "Try using .loc[row_indexer,col_indexer] = value instead\n",
1098 |       "\n",
1099 |       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
1100 |       "  filtered_df100k['text_length'] = filtered_df100k['text'].apply(len)\n"
1101 |      ]
1102 |     },
1103 |     {
1104 |      "data": {
1105 |       "text/html": [
1106 |        "<div>\n",
1107 |        "<style scoped>\n",
1108 |        "    .dataframe tbody tr th:only-of-type {\n",
1109 |        "        vertical-align: middle;\n",
1110 |        "    }\n",
1111 |        "\n",
1112 |        "    .dataframe tbody tr th {\n",
1113 |        "        vertical-align: top;\n",
1114 |        "    }\n",
1115 |        "\n",
1116 |        "    .dataframe thead th {\n",
1117 |        "        text-align: right;\n",
1118 |        "    }\n",
1119 |        "</style>\n",
1120 |        "<table border=\"1\" class=\"dataframe\">\n",
1121 |        "  <thead>\n",
1122 |        "    <tr style=\"text-align: right;\">\n",
1123 |        "      <th></th>\n",
1124 |        "      <th>token_count</th>\n",
1125 |        "      <th>text_length</th>\n",
1126 |        "    </tr>\n",
1127 |        "  </thead>\n",
1128 |        "  <tbody>\n",
1129 |        "    <tr>\n",
1130 |        "      <th>52736</th>\n",
1131 |        "      <td>222</td>\n",
1132 |        "      <td>974</td>\n",
1133 |        "    </tr>\n",
1134 |        "    <tr>\n",
1135 |        "      <th>52737</th>\n",
1136 |        "      <td>422</td>\n",
1137 |        "      <td>1881</td>\n",
1138 |        "    </tr>\n",
1139 |        "    <tr>\n",
1140 |        "      <th>52738</th>\n",
1141 |        "      <td>1959</td>\n",
1142 |        "      <td>10870</td>\n",
1143 |        "    </tr>\n",
1144 |        "    <tr>\n",
1145 |        "      <th>52739</th>\n",
1146 |        "      <td>530</td>\n",
1147 |        "      <td>2720</td>\n",
1148 |        "    </tr>\n",
1149 |        "    <tr>\n",
1150 |        "      <th>52740</th>\n",
1151 |        "      <td>382</td>\n",
1152 |        "      <td>1869</td>\n",
1153 |        "    </tr>\n",
1154 |        "    <tr>\n",
1155 |        "      <th>...</th>\n",
1156 |        "      <td>...</td>\n",
1157 |        "      <td>...</td>\n",
1158 |        "    </tr>\n",
1159 |        "    <tr>\n",
1160 |        "      <th>53243</th>\n",
1161 |        "      <td>606</td>\n",
1162 |        "      <td>2935</td>\n",
1163 |        "    </tr>\n",
1164 |        "    <tr>\n",
1165 |        "      <th>53244</th>\n",
1166 |        "      <td>497</td>\n",
1167 |        "      <td>2507</td>\n",
1168 |        "    </tr>\n",
1169 |        "    <tr>\n",
1170 |        "      <th>53245</th>\n",
1171 |        "      <td>793</td>\n",
1172 |        "      <td>3819</td>\n",
1173 |        "    </tr>\n",
1174 |        "    <tr>\n",
1175 |        "      <th>53246</th>\n",
1176 |        "      <td>1371</td>\n",
1177 |        "      <td>6762</td>\n",
1178 |        "    </tr>\n",
1179 |        "    <tr>\n",
1180 |        "      <th>53247</th>\n",
1181 |        "      <td>1703</td>\n",
1182 |        "      <td>8438</td>\n",
1183 |        "    </tr>\n",
1184 |        "  </tbody>\n",
1185 |        "</table>\n",
1186 |        "<p>512 rows × 2 columns</p>\n",
1187 |        "</div>"
1188 |       ],
1189 |       "text/plain": [
1190 |        "       token_count  text_length\n",
1191 |        "52736          222          974\n",
1192 |        "52737          422         1881\n",
1193 |        "52738         1959        10870\n",
1194 |        "52739          530         2720\n",
1195 |        "52740          382         1869\n",
1196 |        "...            ...          ...\n",
1197 |        "53243          606         2935\n",
1198 |        "53244          497         2507\n",
1199 |        "53245          793         3819\n",
1200 |        "53246         1371         6762\n",
1201 |        "53247         1703         8438\n",
1202 |        "\n",
1203 |        "[512 rows x 2 columns]"
1204 |       ]
1205 |      },
1206 |      "execution_count": 27,
1207 |      "metadata": {},
1208 |      "output_type": "execute_result"
1209 |     }
1210 |    ],
1211 |    "source": [
1212 |     "filtered_df100k['text_length'] = filtered_df100k['text'].apply(len)\n",
1213 |     "filtered_df100k[['token_count', 'text_length']]\n"
1214 |    ]
1215 |   },
1216 |   {
1217 |    "cell_type": "code",
1218 |    "execution_count": 44,
1219 |    "metadata": {},
1220 |    "outputs": [],
1221 |    "source": [
1222 |     "df100k['text_length'] = df100k['text'].apply(len)\n",
1223 |     "sorted_df = df100k.sort_values(by='token_count', ascending=False)\n",
1224 |     "sorted_df = sorted_df[sorted_df[\"text_length\"] > 10000]\n"
1225 |    ]
1226 |   },
1227 |   {
1228 |    "cell_type": "code",
1229 |    "execution_count": 47,
1230 |    "metadata": {},
1231 |    "outputs": [
1232 |     {
1233 |      "data": {
1234 |       "text/plain": [
1235 |        "(8444, 8)"
1236 |       ]
1237 |      },
1238 |      "execution_count": 47,
1239 |      "metadata": {},
1240 |      "output_type": "execute_result"
1241 |     }
1242 |    ],
1243 |    "source": [
1244 |     "sorted_df.shape"
1245 |    ]
1246 |   },
1247 |   {
1248 |    "cell_type": "code",
1249 |    "execution_count": 48,
1250 |    "metadata": {},
1251 |    "outputs": [
1252 |     {
1253 |      "name": "stdout",
1254 |      "output_type": "stream",
1255 |      "text": [
1256 |       "The smallest text_length where token_count is more than 8192 is: 2385\n"
1257 |      ]
1258 |     }
1259 |    ],
1260 |    "source": [
1261 |     "# Filter the DataFrame to find entries where token_count is more than 8000\n",
1262 |     "high_token_count_df = df100k[df100k['token_count'] > 2048]\n",
1263 |     "\n",
1264 |     "# Find the minimum text_length from the filtered DataFrame\n",
1265 |     "min_text_length = high_token_count_df['text_length'].min()\n",
1266 |     "\n",
1267 |     "# Print the result\n",
1268 |     "print(\"The smallest text_length where token_count is more than 8192 is:\", min_text_length)\n"
1269 |    ]
1270 |   },
1271 |   {
1272 |    "cell_type": "code",
1273 |    "execution_count": 46,
1274 |    "metadata": {},
1275 |    "outputs": [
1276 |     {
1277 |      "data": {
1278 |       "text/html": [
1279 |        "<div>\n",
1280 |        "<style scoped>\n",
1281 |        "    .dataframe tbody tr th:only-of-type {\n",
1282 |        "        vertical-align: middle;\n",
1283 |        "    }\n",
1284 |        "\n",
1285 |        "    .dataframe tbody tr th {\n",
1286 |        "        vertical-align: top;\n",
1287 |        "    }\n",
1288 |        "\n",
1289 |        "    .dataframe thead th {\n",
1290 |        "        text-align: right;\n",
1291 |        "    }\n",
1292 |        "</style>\n",
1293 |        "<table border=\"1\" class=\"dataframe\">\n",
1294 |        "  <thead>\n",
1295 |        "    <tr style=\"text-align: right;\">\n",
1296 |        "      <th></th>\n",
1297 |        "      <th>token_count</th>\n",
1298 |        "      <th>text_length</th>\n",
1299 |        "    </tr>\n",
1300 |        "  </thead>\n",
1301 |        "  <tbody>\n",
1302 |        "    <tr>\n",
1303 |        "      <th>57385</th>\n",
1304 |        "      <td>104023</td>\n",
1305 |        "      <td>485818</td>\n",
1306 |        "    </tr>\n",
1307 |        "    <tr>\n",
1308 |        "      <th>8741</th>\n",
1309 |        "      <td>101566</td>\n",
1310 |        "      <td>485117</td>\n",
1311 |        "    </tr>\n",
1312 |        "    <tr>\n",
1313 |        "      <th>26462</th>\n",
1314 |        "      <td>83662</td>\n",
1315 |        "      <td>336015</td>\n",
1316 |        "    </tr>\n",
1317 |        "    <tr>\n",
1318 |        "      <th>48738</th>\n",
1319 |        "      <td>81132</td>\n",
1320 |        "      <td>302843</td>\n",
1321 |        "    </tr>\n",
1322 |        "    <tr>\n",
1323 |        "      <th>22545</th>\n",
1324 |        "      <td>69087</td>\n",
1325 |        "      <td>306273</td>\n",
1326 |        "    </tr>\n",
1327 |        "    <tr>\n",
1328 |        "      <th>59842</th>\n",
1329 |        "      <td>68874</td>\n",
1330 |        "      <td>328275</td>\n",
1331 |        "    </tr>\n",
1332 |        "    <tr>\n",
1333 |        "      <th>66900</th>\n",
1334 |        "      <td>64344</td>\n",
1335 |        "      <td>328697</td>\n",
1336 |        "    </tr>\n",
1337 |        "    <tr>\n",
1338 |        "      <th>50493</th>\n",
1339 |        "      <td>59206</td>\n",
1340 |        "      <td>272464</td>\n",
1341 |        "    </tr>\n",
1342 |        "    <tr>\n",
1343 |        "      <th>53193</th>\n",
1344 |        "      <td>56446</td>\n",
1345 |        "      <td>280809</td>\n",
1346 |        "    </tr>\n",
1347 |        "    <tr>\n",
1348 |        "      <th>25146</th>\n",
1349 |        "      <td>46862</td>\n",
1350 |        "      <td>203838</td>\n",
1351 |        "    </tr>\n",
1352 |        "    <tr>\n",
1353 |        "      <th>98915</th>\n",
1354 |        "      <td>46664</td>\n",
1355 |        "      <td>238568</td>\n",
1356 |        "    </tr>\n",
1357 |        "    <tr>\n",
1358 |        "      <th>63484</th>\n",
1359 |        "      <td>46596</td>\n",
1360 |        "      <td>217907</td>\n",
1361 |        "    </tr>\n",
1362 |        "    <tr>\n",
1363 |        "      <th>26363</th>\n",
1364 |        "      <td>46018</td>\n",
1365 |        "      <td>206822</td>\n",
1366 |        "    </tr>\n",
1367 |        "    <tr>\n",
1368 |        "      <th>20481</th>\n",
1369 |        "      <td>44657</td>\n",
1370 |        "      <td>165152</td>\n",
1371 |        "    </tr>\n",
1372 |        "    <tr>\n",
1373 |        "      <th>19752</th>\n",
1374 |        "      <td>43150</td>\n",
1375 |        "      <td>207965</td>\n",
1376 |        "    </tr>\n",
1377 |        "    <tr>\n",
1378 |        "      <th>21011</th>\n",
1379 |        "      <td>42434</td>\n",
1380 |        "      <td>190152</td>\n",
1381 |        "    </tr>\n",
1382 |        "    <tr>\n",
1383 |        "      <th>19932</th>\n",
1384 |        "      <td>41801</td>\n",
1385 |        "      <td>157942</td>\n",
1386 |        "    </tr>\n",
1387 |        "    <tr>\n",
1388 |        "      <th>75998</th>\n",
1389 |        "      <td>41636</td>\n",
1390 |        "      <td>197039</td>\n",
1391 |        "    </tr>\n",
1392 |        "    <tr>\n",
1393 |        "      <th>54072</th>\n",
1394 |        "      <td>41160</td>\n",
1395 |        "      <td>205458</td>\n",
1396 |        "    </tr>\n",
1397 |        "    <tr>\n",
1398 |        "      <th>75446</th>\n",
1399 |        "      <td>40872</td>\n",
1400 |        "      <td>173387</td>\n",
1401 |        "    </tr>\n",
1402 |        "    <tr>\n",
1403 |        "      <th>21587</th>\n",
1404 |        "      <td>40787</td>\n",
1405 |        "      <td>171901</td>\n",
1406 |        "    </tr>\n",
1407 |        "    <tr>\n",
1408 |        "      <th>61198</th>\n",
1409 |        "      <td>39787</td>\n",
1410 |        "      <td>165134</td>\n",
1411 |        "    </tr>\n",
1412 |        "    <tr>\n",
1413 |        "      <th>41897</th>\n",
1414 |        "      <td>38832</td>\n",
1415 |        "      <td>184465</td>\n",
1416 |        "    </tr>\n",
1417 |        "    <tr>\n",
1418 |        "      <th>37280</th>\n",
1419 |        "      <td>38682</td>\n",
1420 |        "      <td>179243</td>\n",
1421 |        "    </tr>\n",
1422 |        "    <tr>\n",
1423 |        "      <th>60438</th>\n",
1424 |        "      <td>38664</td>\n",
1425 |        "      <td>168148</td>\n",
1426 |        "    </tr>\n",
1427 |        "    <tr>\n",
1428 |        "      <th>52712</th>\n",
1429 |        "      <td>37943</td>\n",
1430 |        "      <td>180434</td>\n",
1431 |        "    </tr>\n",
1432 |        "    <tr>\n",
1433 |        "      <th>49513</th>\n",
1434 |        "      <td>36064</td>\n",
1435 |        "      <td>144425</td>\n",
1436 |        "    </tr>\n",
1437 |        "    <tr>\n",
1438 |        "      <th>10770</th>\n",
1439 |        "      <td>35178</td>\n",
1440 |        "      <td>161248</td>\n",
1441 |        "    </tr>\n",
1442 |        "    <tr>\n",
1443 |        "      <th>30875</th>\n",
1444 |        "      <td>34214</td>\n",
1445 |        "      <td>156451</td>\n",
1446 |        "    </tr>\n",
1447 |        "    <tr>\n",
1448 |        "      <th>21416</th>\n",
1449 |        "      <td>33794</td>\n",
1450 |        "      <td>108263</td>\n",
1451 |        "    </tr>\n",
1452 |        "    <tr>\n",
1453 |        "      <th>76883</th>\n",
1454 |        "      <td>33370</td>\n",
1455 |        "      <td>151322</td>\n",
1456 |        "    </tr>\n",
1457 |        "    <tr>\n",
1458 |        "      <th>61913</th>\n",
1459 |        "      <td>32132</td>\n",
1460 |        "      <td>145142</td>\n",
1461 |        "    </tr>\n",
1462 |        "    <tr>\n",
1463 |        "      <th>82674</th>\n",
1464 |        "      <td>32022</td>\n",
1465 |        "      <td>110127</td>\n",
1466 |        "    </tr>\n",
1467 |        "    <tr>\n",
1468 |        "      <th>25879</th>\n",
1469 |        "      <td>31153</td>\n",
1470 |        "      <td>139767</td>\n",
1471 |        "    </tr>\n",
1472 |        "    <tr>\n",
1473 |        "      <th>50241</th>\n",
1474 |        "      <td>30879</td>\n",
1475 |        "      <td>141398</td>\n",
1476 |        "    </tr>\n",
1477 |        "    <tr>\n",
1478 |        "      <th>91285</th>\n",
1479 |        "      <td>30841</td>\n",
1480 |        "      <td>126778</td>\n",
1481 |        "    </tr>\n",
1482 |        "    <tr>\n",
1483 |        "      <th>85932</th>\n",
1484 |        "      <td>30799</td>\n",
1485 |        "      <td>123451</td>\n",
1486 |        "    </tr>\n",
1487 |        "    <tr>\n",
1488 |        "      <th>67564</th>\n",
1489 |        "      <td>30505</td>\n",
1490 |        "      <td>110449</td>\n",
1491 |        "    </tr>\n",
1492 |        "    <tr>\n",
1493 |        "      <th>23804</th>\n",
1494 |        "      <td>30502</td>\n",
1495 |        "      <td>133703</td>\n",
1496 |        "    </tr>\n",
1497 |        "    <tr>\n",
1498 |        "      <th>64453</th>\n",
1499 |        "      <td>30140</td>\n",
1500 |        "      <td>122132</td>\n",
1501 |        "    </tr>\n",
1502 |        "    <tr>\n",
1503 |        "      <th>20491</th>\n",
1504 |        "      <td>29787</td>\n",
1505 |        "      <td>147622</td>\n",
1506 |        "    </tr>\n",
1507 |        "    <tr>\n",
1508 |        "      <th>98810</th>\n",
1509 |        "      <td>29474</td>\n",
1510 |        "      <td>104333</td>\n",
1511 |        "    </tr>\n",
1512 |        "    <tr>\n",
1513 |        "      <th>23779</th>\n",
1514 |        "      <td>29404</td>\n",
1515 |        "      <td>109206</td>\n",
1516 |        "    </tr>\n",
1517 |        "    <tr>\n",
1518 |        "      <th>12476</th>\n",
1519 |        "      <td>28799</td>\n",
1520 |        "      <td>118054</td>\n",
1521 |        "    </tr>\n",
1522 |        "    <tr>\n",
1523 |        "      <th>84791</th>\n",
1524 |        "      <td>28792</td>\n",
1525 |        "      <td>126564</td>\n",
1526 |        "    </tr>\n",
1527 |        "    <tr>\n",
1528 |        "      <th>16536</th>\n",
1529 |        "      <td>28782</td>\n",
1530 |        "      <td>134643</td>\n",
1531 |        "    </tr>\n",
1532 |        "    <tr>\n",
1533 |        "      <th>6770</th>\n",
1534 |        "      <td>28745</td>\n",
1535 |        "      <td>139452</td>\n",
1536 |        "    </tr>\n",
1537 |        "    <tr>\n",
1538 |        "      <th>64492</th>\n",
1539 |        "      <td>28731</td>\n",
1540 |        "      <td>151428</td>\n",
1541 |        "    </tr>\n",
1542 |        "    <tr>\n",
1543 |        "      <th>55693</th>\n",
1544 |        "      <td>28615</td>\n",
1545 |        "      <td>126376</td>\n",
1546 |        "    </tr>\n",
1547 |        "    <tr>\n",
1548 |        "      <th>96635</th>\n",
1549 |        "      <td>28603</td>\n",
1550 |        "      <td>128460</td>\n",
1551 |        "    </tr>\n",
1552 |        "    <tr>\n",
1553 |        "      <th>87035</th>\n",
1554 |        "      <td>28458</td>\n",
1555 |        "      <td>126872</td>\n",
1556 |        "    </tr>\n",
1557 |        "    <tr>\n",
1558 |        "      <th>97372</th>\n",
1559 |        "      <td>28073</td>\n",
1560 |        "      <td>128217</td>\n",
1561 |        "    </tr>\n",
1562 |        "    <tr>\n",
1563 |        "      <th>16966</th>\n",
1564 |        "      <td>27827</td>\n",
1565 |        "      <td>121963</td>\n",
1566 |        "    </tr>\n",
1567 |        "    <tr>\n",
1568 |        "      <th>54282</th>\n",
1569 |        "      <td>27622</td>\n",
1570 |        "      <td>110453</td>\n",
1571 |        "    </tr>\n",
1572 |        "    <tr>\n",
1573 |        "      <th>64422</th>\n",
1574 |        "      <td>27399</td>\n",
1575 |        "      <td>126250</td>\n",
1576 |        "    </tr>\n",
1577 |        "    <tr>\n",
1578 |        "      <th>43095</th>\n",
1579 |        "      <td>26805</td>\n",
1580 |        "      <td>127607</td>\n",
1581 |        "    </tr>\n",
1582 |        "    <tr>\n",
1583 |        "      <th>11223</th>\n",
1584 |        "      <td>26774</td>\n",
1585 |        "      <td>107899</td>\n",
1586 |        "    </tr>\n",
1587 |        "    <tr>\n",
1588 |        "      <th>938</th>\n",
1589 |        "      <td>26697</td>\n",
1590 |        "      <td>124276</td>\n",
1591 |        "    </tr>\n",
1592 |        "    <tr>\n",
1593 |        "      <th>72130</th>\n",
1594 |        "      <td>26616</td>\n",
1595 |        "      <td>95360</td>\n",
1596 |        "    </tr>\n",
1597 |        "    <tr>\n",
1598 |        "      <th>35815</th>\n",
1599 |        "      <td>26602</td>\n",
1600 |        "      <td>100388</td>\n",
1601 |        "    </tr>\n",
1602 |        "    <tr>\n",
1603 |        "      <th>60910</th>\n",
1604 |        "      <td>26557</td>\n",
1605 |        "      <td>130256</td>\n",
1606 |        "    </tr>\n",
1607 |        "    <tr>\n",
1608 |        "      <th>53729</th>\n",
1609 |        "      <td>26423</td>\n",
1610 |        "      <td>123570</td>\n",
1611 |        "    </tr>\n",
1612 |        "    <tr>\n",
1613 |        "      <th>21879</th>\n",
1614 |        "      <td>26392</td>\n",
1615 |        "      <td>116157</td>\n",
1616 |        "    </tr>\n",
1617 |        "    <tr>\n",
1618 |        "      <th>97467</th>\n",
1619 |        "      <td>26192</td>\n",
1620 |        "      <td>114295</td>\n",
1621 |        "    </tr>\n",
1622 |        "    <tr>\n",
1623 |        "      <th>19179</th>\n",
1624 |        "      <td>25979</td>\n",
1625 |        "      <td>94335</td>\n",
1626 |        "    </tr>\n",
1627 |        "    <tr>\n",
1628 |        "      <th>78544</th>\n",
1629 |        "      <td>25836</td>\n",
1630 |        "      <td>110689</td>\n",
1631 |        "    </tr>\n",
1632 |        "    <tr>\n",
1633 |        "      <th>86182</th>\n",
1634 |        "      <td>25721</td>\n",
1635 |        "      <td>102530</td>\n",
1636 |        "    </tr>\n",
1637 |        "    <tr>\n",
1638 |        "      <th>70463</th>\n",
1639 |        "      <td>25345</td>\n",
1640 |        "      <td>100019</td>\n",
1641 |        "    </tr>\n",
1642 |        "    <tr>\n",
1643 |        "      <th>19729</th>\n",
1644 |        "      <td>24953</td>\n",
1645 |        "      <td>115631</td>\n",
1646 |        "    </tr>\n",
1647 |        "    <tr>\n",
1648 |        "      <th>92956</th>\n",
1649 |        "      <td>24808</td>\n",
1650 |        "      <td>110828</td>\n",
1651 |        "    </tr>\n",
1652 |        "    <tr>\n",
1653 |        "      <th>75490</th>\n",
1654 |        "      <td>24776</td>\n",
1655 |        "      <td>98531</td>\n",
1656 |        "    </tr>\n",
1657 |        "    <tr>\n",
1658 |        "      <th>57823</th>\n",
1659 |        "      <td>24624</td>\n",
1660 |        "      <td>93310</td>\n",
1661 |        "    </tr>\n",
1662 |        "    <tr>\n",
1663 |        "      <th>5150</th>\n",
1664 |        "      <td>24593</td>\n",
1665 |        "      <td>86615</td>\n",
1666 |        "    </tr>\n",
1667 |        "    <tr>\n",
1668 |        "      <th>5065</th>\n",
1669 |        "      <td>24504</td>\n",
1670 |        "      <td>110972</td>\n",
1671 |        "    </tr>\n",
1672 |        "    <tr>\n",
1673 |        "      <th>64878</th>\n",
1674 |        "      <td>24310</td>\n",
1675 |        "      <td>104386</td>\n",
1676 |        "    </tr>\n",
1677 |        "    <tr>\n",
1678 |        "      <th>85699</th>\n",
1679 |        "      <td>24133</td>\n",
1680 |        "      <td>115965</td>\n",
1681 |        "    </tr>\n",
1682 |        "    <tr>\n",
1683 |        "      <th>35083</th>\n",
1684 |        "      <td>24130</td>\n",
1685 |        "      <td>119805</td>\n",
1686 |        "    </tr>\n",
1687 |        "    <tr>\n",
1688 |        "      <th>1122</th>\n",
1689 |        "      <td>24091</td>\n",
1690 |        "      <td>90348</td>\n",
1691 |        "    </tr>\n",
1692 |        "    <tr>\n",
1693 |        "      <th>41560</th>\n",
1694 |        "      <td>23898</td>\n",
1695 |        "      <td>101378</td>\n",
1696 |        "    </tr>\n",
1697 |        "    <tr>\n",
1698 |        "      <th>53989</th>\n",
1699 |        "      <td>23874</td>\n",
1700 |        "      <td>56790</td>\n",
1701 |        "    </tr>\n",
1702 |        "    <tr>\n",
1703 |        "      <th>33448</th>\n",
1704 |        "      <td>23852</td>\n",
1705 |        "      <td>115978</td>\n",
1706 |        "    </tr>\n",
1707 |        "    <tr>\n",
1708 |        "      <th>29779</th>\n",
1709 |        "      <td>23735</td>\n",
1710 |        "      <td>120142</td>\n",
1711 |        "    </tr>\n",
1712 |        "    <tr>\n",
1713 |        "      <th>54450</th>\n",
1714 |        "      <td>23715</td>\n",
1715 |        "      <td>105685</td>\n",
1716 |        "    </tr>\n",
1717 |        "    <tr>\n",
1718 |        "      <th>39629</th>\n",
1719 |        "      <td>23685</td>\n",
1720 |        "      <td>101014</td>\n",
1721 |        "    </tr>\n",
1722 |        "    <tr>\n",
1723 |        "      <th>6874</th>\n",
1724 |        "      <td>23436</td>\n",
1725 |        "      <td>110472</td>\n",
1726 |        "    </tr>\n",
1727 |        "    <tr>\n",
1728 |        "      <th>52833</th>\n",
1729 |        "      <td>23407</td>\n",
1730 |        "      <td>90316</td>\n",
1731 |        "    </tr>\n",
1732 |        "    <tr>\n",
1733 |        "      <th>14484</th>\n",
1734 |        "      <td>23357</td>\n",
1735 |        "      <td>104001</td>\n",
1736 |        "    </tr>\n",
1737 |        "    <tr>\n",
1738 |        "      <th>1692</th>\n",
1739 |        "      <td>23338</td>\n",
1740 |        "      <td>99255</td>\n",
1741 |        "    </tr>\n",
1742 |        "    <tr>\n",
1743 |        "      <th>23331</th>\n",
1744 |        "      <td>23138</td>\n",
1745 |        "      <td>89524</td>\n",
1746 |        "    </tr>\n",
1747 |        "    <tr>\n",
1748 |        "      <th>82384</th>\n",
1749 |        "      <td>23120</td>\n",
1750 |        "      <td>89432</td>\n",
1751 |        "    </tr>\n",
1752 |        "    <tr>\n",
1753 |        "      <th>69196</th>\n",
1754 |        "      <td>23083</td>\n",
1755 |        "      <td>58244</td>\n",
1756 |        "    </tr>\n",
1757 |        "    <tr>\n",
1758 |        "      <th>15008</th>\n",
1759 |        "      <td>22890</td>\n",
1760 |        "      <td>107912</td>\n",
1761 |        "    </tr>\n",
1762 |        "    <tr>\n",
1763 |        "      <th>58367</th>\n",
1764 |        "      <td>22855</td>\n",
1765 |        "      <td>105143</td>\n",
1766 |        "    </tr>\n",
1767 |        "    <tr>\n",
1768 |        "      <th>26207</th>\n",
1769 |        "      <td>22832</td>\n",
1770 |        "      <td>85131</td>\n",
1771 |        "    </tr>\n",
1772 |        "    <tr>\n",
1773 |        "      <th>78156</th>\n",
1774 |        "      <td>22807</td>\n",
1775 |        "      <td>99823</td>\n",
1776 |        "    </tr>\n",
1777 |        "    <tr>\n",
1778 |        "      <th>36604</th>\n",
1779 |        "      <td>22656</td>\n",
1780 |        "      <td>103464</td>\n",
1781 |        "    </tr>\n",
1782 |        "    <tr>\n",
1783 |        "      <th>32054</th>\n",
1784 |        "      <td>22652</td>\n",
1785 |        "      <td>107154</td>\n",
1786 |        "    </tr>\n",
1787 |        "    <tr>\n",
1788 |        "      <th>20075</th>\n",
1789 |        "      <td>22646</td>\n",
1790 |        "      <td>109951</td>\n",
1791 |        "    </tr>\n",
1792 |        "    <tr>\n",
1793 |        "      <th>9812</th>\n",
1794 |        "      <td>22574</td>\n",
1795 |        "      <td>101409</td>\n",
1796 |        "    </tr>\n",
1797 |        "    <tr>\n",
1798 |        "      <th>36465</th>\n",
1799 |        "      <td>22527</td>\n",
1800 |        "      <td>98342</td>\n",
1801 |        "    </tr>\n",
1802 |        "  </tbody>\n",
1803 |        "</table>\n",
1804 |        "</div>"
1805 |       ],
1806 |       "text/plain": [
1807 |        "       token_count  text_length\n",
1808 |        "57385       104023       485818\n",
1809 |        "8741        101566       485117\n",
1810 |        "26462        83662       336015\n",
1811 |        "48738        81132       302843\n",
1812 |        "22545        69087       306273\n",
1813 |        "59842        68874       328275\n",
1814 |        "66900        64344       328697\n",
1815 |        "50493        59206       272464\n",
1816 |        "53193        56446       280809\n",
1817 |        "25146        46862       203838\n",
1818 |        "98915        46664       238568\n",
1819 |        "63484        46596       217907\n",
1820 |        "26363        46018       206822\n",
1821 |        "20481        44657       165152\n",
1822 |        "19752        43150       207965\n",
1823 |        "21011        42434       190152\n",
1824 |        "19932        41801       157942\n",
1825 |        "75998        41636       197039\n",
1826 |        "54072        41160       205458\n",
1827 |        "75446        40872       173387\n",
1828 |        "21587        40787       171901\n",
1829 |        "61198        39787       165134\n",
1830 |        "41897        38832       184465\n",
1831 |        "37280        38682       179243\n",
1832 |        "60438        38664       168148\n",
1833 |        "52712        37943       180434\n",
1834 |        "49513        36064       144425\n",
1835 |        "10770        35178       161248\n",
1836 |        "30875        34214       156451\n",
1837 |        "21416        33794       108263\n",
1838 |        "76883        33370       151322\n",
1839 |        "61913        32132       145142\n",
1840 |        "82674        32022       110127\n",
1841 |        "25879        31153       139767\n",
1842 |        "50241        30879       141398\n",
1843 |        "91285        30841       126778\n",
1844 |        "85932        30799       123451\n",
1845 |        "67564        30505       110449\n",
1846 |        "23804        30502       133703\n",
1847 |        "64453        30140       122132\n",
1848 |        "20491        29787       147622\n",
1849 |        "98810        29474       104333\n",
1850 |        "23779        29404       109206\n",
1851 |        "12476        28799       118054\n",
1852 |        "84791        28792       126564\n",
1853 |        "16536        28782       134643\n",
1854 |        "6770         28745       139452\n",
1855 |        "64492        28731       151428\n",
1856 |        "55693        28615       126376\n",
1857 |        "96635        28603       128460\n",
1858 |        "87035        28458       126872\n",
1859 |        "97372        28073       128217\n",
1860 |        "16966        27827       121963\n",
1861 |        "54282        27622       110453\n",
1862 |        "64422        27399       126250\n",
1863 |        "43095        26805       127607\n",
1864 |        "11223        26774       107899\n",
1865 |        "938          26697       124276\n",
1866 |        "72130        26616        95360\n",
1867 |        "35815        26602       100388\n",
1868 |        "60910        26557       130256\n",
1869 |        "53729        26423       123570\n",
1870 |        "21879        26392       116157\n",
1871 |        "97467        26192       114295\n",
1872 |        "19179        25979        94335\n",
1873 |        "78544        25836       110689\n",
1874 |        "86182        25721       102530\n",
1875 |        "70463        25345       100019\n",
1876 |        "19729        24953       115631\n",
1877 |        "92956        24808       110828\n",
1878 |        "75490        24776        98531\n",
1879 |        "57823        24624        93310\n",
1880 |        "5150         24593        86615\n",
1881 |        "5065         24504       110972\n",
1882 |        "64878        24310       104386\n",
1883 |        "85699        24133       115965\n",
1884 |        "35083        24130       119805\n",
1885 |        "1122         24091        90348\n",
1886 |        "41560        23898       101378\n",
1887 |        "53989        23874        56790\n",
1888 |        "33448        23852       115978\n",
1889 |        "29779        23735       120142\n",
1890 |        "54450        23715       105685\n",
1891 |        "39629        23685       101014\n",
1892 |        "6874         23436       110472\n",
1893 |        "52833        23407        90316\n",
1894 |        "14484        23357       104001\n",
1895 |        "1692         23338        99255\n",
1896 |        "23331        23138        89524\n",
1897 |        "82384        23120        89432\n",
1898 |        "69196        23083        58244\n",
1899 |        "15008        22890       107912\n",
1900 |        "58367        22855       105143\n",
1901 |        "26207        22832        85131\n",
1902 |        "78156        22807        99823\n",
1903 |        "36604        22656       103464\n",
1904 |        "32054        22652       107154\n",
1905 |        "20075        22646       109951\n",
1906 |        "9812         22574       101409\n",
1907 |        "36465        22527        98342"
1908 |       ]
1909 |      },
1910 |      "metadata": {},
1911 |      "output_type": "display_data"
1912 |     }
1913 |    ],
1914 |    "source": [
1915 |     "with pd.option_context('display.max_rows', None, 'display.max_columns', None):\n",
1916 |     "    display(sorted_df[['token_count', 'text_length']].head(100))\n"
1917 |    ]
1918 |   },
1919 |   {
1920 |    "cell_type": "code",
1921 |    "execution_count": null,
1922 |    "metadata": {},
1923 |    "outputs": [],
1924 |    "source": []
1925 |   }
1926 |  ],
1927 |  "metadata": {
1928 |   "kernelspec": {
1929 |    "display_name": "testing",
1930 |    "language": "python",
1931 |    "name": "python3"
1932 |   },
1933 |   "language_info": {
1934 |    "codemirror_mode": {
1935 |     "name": "ipython",
1936 |     "version": 3
1937 |    },
1938 |    "file_extension": ".py",
1939 |    "mimetype": "text/x-python",
1940 |    "name": "python",
1941 |    "nbconvert_exporter": "python",
1942 |    "pygments_lexer": "ipython3",
1943 |    "version": "3.11.6"
1944 |   }
1945 |  },
1946 |  "nbformat": 4,
1947 |  "nbformat_minor": 2
1948 | }
1949 | 


--------------------------------------------------------------------------------
/notebooks/tokenizers.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stderr",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "/Users/enjalot/code/fineweb-modal/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
 13 |       "  from .autonotebook import tqdm as notebook_tqdm\n"
 14 |      ]
 15 |     }
 16 |    ],
 17 |    "source": [
 18 |     "from transformers import AutoTokenizer\n",
 19 |     "import numpy as np\n",
 20 |     "from collections import Counter"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 2,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "\n",
 30 |     "def compare_tokenizers(text_samples):\n",
 31 |     "    \"\"\"\n",
 32 |     "    Compare tokenization results between BGE and Nomic tokenizers\n",
 33 |     "    \n",
 34 |     "    Args:\n",
 35 |     "        text_samples: List of text strings to compare tokenization\n",
 36 |     "    \n",
 37 |     "    Returns:\n",
 38 |     "        dict: Comparison statistics and analysis results\n",
 39 |     "    \"\"\"\n",
 40 |     "    # Load both tokenizers\n",
 41 |     "    bge_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-base-en-v1.5\")\n",
 42 |     "    nomic_tokenizer = AutoTokenizer.from_pretrained(\"nomic-ai/nomic-embed-text-v1.5\")\n",
 43 |     "    \n",
 44 |     "    results = {\n",
 45 |     "        \"vocabulary_sizes\": {\n",
 46 |     "            \"bge\": len(bge_tokenizer.vocab),\n",
 47 |     "            \"nomic\": len(nomic_tokenizer.vocab),\n",
 48 |     "        },\n",
 49 |     "        \"samples\": []\n",
 50 |     "    }\n",
 51 |     "    \n",
 52 |     "    # Compare tokenization for each sample\n",
 53 |     "    for text in text_samples:\n",
 54 |     "        bge_tokens = bge_tokenizer.tokenize(text)\n",
 55 |     "        nomic_tokens = nomic_tokenizer.tokenize(text)\n",
 56 |     "        \n",
 57 |     "        # Get token counts\n",
 58 |     "        bge_counts = Counter(bge_tokens)\n",
 59 |     "        nomic_counts = Counter(nomic_tokens)\n",
 60 |     "        \n",
 61 |     "        # Compare token sequences\n",
 62 |     "        sample_result = {\n",
 63 |     "            \"text\": text,\n",
 64 |     "            \"bge_tokens\": bge_tokens,\n",
 65 |     "            \"nomic_tokens\": nomic_tokens,\n",
 66 |     "            \"token_counts\": {\n",
 67 |     "                \"bge\": len(bge_tokens),\n",
 68 |     "                \"nomic\": len(nomic_tokens)\n",
 69 |     "            },\n",
 70 |     "            \"unique_tokens\": {\n",
 71 |     "                \"bge\": len(bge_counts),\n",
 72 |     "                \"nomic\": len(nomic_counts)\n",
 73 |     "            },\n",
 74 |     "            \"identical_tokenization\": bge_tokens == nomic_tokens\n",
 75 |     "        }\n",
 76 |     "        \n",
 77 |     "        results[\"samples\"].append(sample_result)\n",
 78 |     "    \n",
 79 |     "    # Calculate overall statistics\n",
 80 |     "    identical_count = sum(1 for r in results[\"samples\"] if r[\"identical_tokenization\"])\n",
 81 |     "    results[\"overall_stats\"] = {\n",
 82 |     "        \"total_samples\": len(text_samples),\n",
 83 |     "        \"identical_tokenizations\": identical_count,\n",
 84 |     "        \"identical_percentage\": (identical_count / len(text_samples)) * 100 if text_samples else 0\n",
 85 |     "    }\n",
 86 |     "    \n",
 87 |     "    return results"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 3,
 93 |    "metadata": {},
 94 |    "outputs": [],
 95 |    "source": [
 96 |     "\n",
 97 |     "def print_comparison_report(results):\n",
 98 |     "    \"\"\"Print a formatted report of the tokenizer comparison results\"\"\"\n",
 99 |     "    print(\"Tokenizer Comparison Report\")\n",
100 |     "    print(\"==========================\")\n",
101 |     "    print(f\"\\nVocabulary Sizes:\")\n",
102 |     "    print(f\"BGE: {results['vocabulary_sizes']['bge']:,} tokens\")\n",
103 |     "    print(f\"Nomic: {results['vocabulary_sizes']['nomic']:,} tokens\")\n",
104 |     "    \n",
105 |     "    print(f\"\\nOverall Statistics:\")\n",
106 |     "    print(f\"Total samples analyzed: {results['overall_stats']['total_samples']}\")\n",
107 |     "    print(f\"Identical tokenizations: {results['overall_stats']['identical_tokenizations']}\")\n",
108 |     "    print(f\"Percentage identical: {results['overall_stats']['identical_percentage']:.1f}%\")\n",
109 |     "    \n",
110 |     "    print(\"\\nDetailed Sample Analysis:\")\n",
111 |     "    for i, sample in enumerate(results['samples'], 1):\n",
112 |     "        print(f\"\\nSample {i}:\")\n",
113 |     "        print(f\"Text: {sample['text']}\")\n",
114 |     "        print(f\"BGE tokens ({sample['token_counts']['bge']}): {sample['bge_tokens']}\")\n",
115 |     "        print(f\"Nomic tokens ({sample['token_counts']['nomic']}): {sample['nomic_tokens']}\")\n",
116 |     "        print(f\"Identical: {sample['identical_tokenization']}\")"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": 4,
122 |    "metadata": {},
123 |    "outputs": [
124 |     {
125 |      "name": "stdout",
126 |      "output_type": "stream",
127 |      "text": [
128 |       "Tokenizer Comparison Report\n",
129 |       "==========================\n",
130 |       "\n",
131 |       "Vocabulary Sizes:\n",
132 |       "BGE: 30,522 tokens\n",
133 |       "Nomic: 30,522 tokens\n",
134 |       "\n",
135 |       "Overall Statistics:\n",
136 |       "Total samples analyzed: 3\n",
137 |       "Identical tokenizations: 3\n",
138 |       "Percentage identical: 100.0%\n",
139 |       "\n",
140 |       "Detailed Sample Analysis:\n",
141 |       "\n",
142 |       "Sample 1:\n",
143 |       "Text: This is a test sentence.\n",
144 |       "BGE tokens (6): ['this', 'is', 'a', 'test', 'sentence', '.']\n",
145 |       "Nomic tokens (6): ['this', 'is', 'a', 'test', 'sentence', '.']\n",
146 |       "Identical: True\n",
147 |       "\n",
148 |       "Sample 2:\n",
149 |       "Text: Machine learning models use different tokenization approaches.\n",
150 |       "BGE tokens (9): ['machine', 'learning', 'models', 'use', 'different', 'token', '##ization', 'approaches', '.']\n",
151 |       "Nomic tokens (9): ['machine', 'learning', 'models', 'use', 'different', 'token', '##ization', 'approaches', '.']\n",
152 |       "Identical: True\n",
153 |       "\n",
154 |       "Sample 3:\n",
155 |       "Text: Some текст with mixed 字符 and специальные characters!\n",
156 |       "BGE tokens (24): ['some', 'т', '##е', '##к', '##с', '##т', 'with', 'mixed', '[UNK]', '[UNK]', 'and', 'с', '##п', '##е', '##ц', '##и', '##а', '##л', '##ь', '##н', '##ы', '##е', 'characters', '!']\n",
157 |       "Nomic tokens (24): ['some', 'т', '##е', '##к', '##с', '##т', 'with', 'mixed', '[UNK]', '[UNK]', 'and', 'с', '##п', '##е', '##ц', '##и', '##а', '##л', '##ь', '##н', '##ы', '##е', 'characters', '!']\n",
158 |       "Identical: True\n"
159 |      ]
160 |     }
161 |    ],
162 |    "source": [
163 |     "\n",
164 |     "# Example usage\n",
165 |     "sample_texts = [\n",
166 |     "    \"This is a test sentence.\",\n",
167 |     "    \"Machine learning models use different tokenization approaches.\",\n",
168 |     "    \"Some текст with mixed 字符 and специальные characters!\",\n",
169 |     "]\n",
170 |     "\n",
171 |     "results = compare_tokenizers(sample_texts)\n",
172 |     "print_comparison_report(results)"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": 5,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "bge_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-base-en-v1.5\")\n",
182 |     "nomic_tokenizer = AutoTokenizer.from_pretrained(\"nomic-ai/nomic-embed-text-v1.5\")\n"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 6,
188 |    "metadata": {},
189 |    "outputs": [
190 |     {
191 |      "data": {
192 |       "text/plain": [
193 |        "BertTokenizerFast(name_or_path='BAAI/bge-base-en-v1.5', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={\n",
194 |        "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
195 |        "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
196 |        "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
197 |        "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
198 |        "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
199 |        "}"
200 |       ]
201 |      },
202 |      "execution_count": 6,
203 |      "metadata": {},
204 |      "output_type": "execute_result"
205 |     }
206 |    ],
207 |    "source": [
208 |     "bge_tokenizer"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": 7,
214 |    "metadata": {},
215 |    "outputs": [
216 |     {
217 |      "data": {
218 |       "text/plain": [
219 |        "BertTokenizerFast(name_or_path='nomic-ai/nomic-embed-text-v1.5', vocab_size=30522, model_max_length=8192, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={\n",
220 |        "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
221 |        "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
222 |        "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
223 |        "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
224 |        "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
225 |        "}"
226 |       ]
227 |      },
228 |      "execution_count": 7,
229 |      "metadata": {},
230 |      "output_type": "execute_result"
231 |     }
232 |    ],
233 |    "source": [
234 |     "nomic_tokenizer"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": null,
240 |    "metadata": {},
241 |    "outputs": [],
242 |    "source": []
243 |   }
244 |  ],
245 |  "metadata": {
246 |   "language_info": {
247 |    "name": "python"
248 |   }
249 |  },
250 |  "nbformat": 4,
251 |  "nbformat_minor": 2
252 | }
253 | 


--------------------------------------------------------------------------------
/remove.py:
--------------------------------------------------------------------------------
 1 | from modal import App, Image, Volume, Secret
 2 | 
 3 | DATASET_DIR="/embeddings"
 4 | VOLUME = "embeddings"
 5 | SAE = "64_32"
 6 | 
 7 | # SAMPLE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3/train"
 8 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10"
 9 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10/combined"
10 | 
11 | SAMPLE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500/train"
12 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5"
13 | SAVE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5/combined"
14 | 
15 | 
16 | 
17 | 
18 | 
19 | # We define our Modal Resources that we'll need
20 | volume = Volume.from_name(VOLUME, create_if_missing=True)
21 | image = Image.debian_slim(python_version="3.9").pip_install(
22 |     "pandas", "datasets==2.16.1", "apache_beam==2.53.0"
23 | )
24 | app = App(image=image) 
25 | 
26 | @app.function(
27 |     volumes={DATASET_DIR: volume}, 
28 |     timeout=60000,
29 | )
30 | def remove_files_by_pattern(directory, pattern):
31 |     """
32 |     Remove all files in the specified directory that match the given pattern.
33 |     
34 |     Args:
35 |         directory: Directory to search for files
36 |         pattern: File pattern to match (e.g., "temp*" for files starting with "temp")
37 |     """
38 |     import os
39 |     import glob
40 |     
41 |     # Get the full path pattern
42 |     full_pattern = os.path.join(directory, pattern)
43 |     
44 |     # Find all files matching the pattern
45 |     matching_files = glob.glob(full_pattern)
46 |     
47 |     # Count files to be removed
48 |     file_count = len(matching_files)
49 |     print(f"Found {file_count} files matching pattern '{pattern}' in {directory}")
50 |     
51 |     # Remove each file
52 |     for file_path in matching_files:
53 |         try:
54 |             os.remove(file_path)
55 |             print(f"Removed: {file_path}")
56 |         except Exception as e:
57 |             print(f"Error removing {file_path}: {e}")
58 |     
59 |     # Commit changes to the volume
60 |     volume.commit()
61 |     
62 |     return f"Removed {file_count} files matching pattern '{pattern}'"
63 | 
64 | @app.local_entrypoint()
65 | def main():
66 |         
67 |     directory = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5-64_32-top10"
68 |     pattern = "temp*"
69 |     print(f"Removing files matching '{pattern}' from '{directory}'")
70 |     result = remove_files_by_pattern.remote(directory, pattern)
71 |     print(result)
72 |    
73 | 
74 | 


--------------------------------------------------------------------------------
/summary.py:
--------------------------------------------------------------------------------
 1 | from modal import App, Image, Volume
 2 | 
 3 | 
 4 | # We first set out configuration variables for our script.
 5 | DATASET_DIR = "/data"
 6 | # VOLUME = "embedding-fineweb-edu"
 7 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-120"
 8 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500"
 9 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
10 | 
11 | 
12 | 
13 | VOLUME = "datasets"
14 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-120"
15 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-500"
16 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)]
17 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-500"
18 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-120"
19 | # files = [f"data-{i:05d}-of-01987.parquet" for i in range(200)]
20 | DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-120"
21 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-500"
22 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)]
23 | 
24 | 
25 | 
26 | 
27 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
28 | 
29 | # We define our Modal Resources that we'll need
30 | volume = Volume.from_name(VOLUME, create_if_missing=True)
31 | image = Image.debian_slim(python_version="3.9").pip_install(
32 |     "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
33 | )
34 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
35 | 
36 | 
37 | 
38 | @app.function(volumes={DATASET_DIR: volume}, timeout=3000)
39 | def process_dataset(file):
40 |     import time
41 |     from concurrent.futures import ThreadPoolExecutor, as_completed
42 |     from tqdm import tqdm
43 |     import pandas as pd
44 |   
45 |     # Load the dataset as a Hugging Face dataset
46 |     # print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}/train/{file}")
47 |     df = pd.read_parquet(f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}")
48 |     print("dataset", len(df))
49 | 
50 |     return {
51 |         "file": file,
52 |         "num_rows": len(df),
53 |         "tokens": df["chunk_token_count"].sum(),
54 |         "less2": df[df["chunk_token_count"] < 2].shape[0],
55 |         "less10": df[df["chunk_token_count"] < 10].shape[0],
56 |         "less50": df[df["chunk_token_count"] < 50].shape[0],
57 |     }
58 | 
59 | @app.local_entrypoint()
60 | def main():
61 |     from tqdm import tqdm
62 |     responses = []
63 |     for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True):
64 |         if isinstance(resp, Exception):
65 |             print(f"Exception: {resp}")
66 |             continue
67 |         print(resp)
68 |         responses.append(resp)
69 | 
70 |     total_rows = 0
71 |     total_tokens = 0
72 |     total_less2 = 0
73 |     total_less10 = 0
74 |     total_less50 = 0
75 |     for resp in tqdm(responses):
76 |         total_rows += resp['num_rows']
77 |         total_tokens += resp['tokens']
78 |         total_less2 += resp['less2']
79 |         total_less10 += resp['less10']
80 |         total_less50 += resp['less50']
81 |     print(f"Total rows processed: {total_rows}")
82 |     print(f"Total tokens processed: {total_tokens}")
83 |     print(f"Total less2: {total_less2}")
84 |     print(f"Total less10: {total_less10}")
85 |     print(f"Total less50: {total_less50}")
86 | 
87 | 
88 | 


--------------------------------------------------------------------------------
/todataset.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Turn a directory of parquet files into a HuggingFace dataset in the modal volume
 3 | """
 4 | # TODO: look into keeping the parquet files as is to make the dataset
 5 | 
 6 | 
 7 | from modal import App, Image, Volume, Secret
 8 | 
 9 | DATASET_DIR="/embeddings"
10 | VOLUME = "embeddings"
11 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500"
12 | SAVE_DIRECTORY = f"{DIRECTORY}-HF2"
13 | 
14 | # We define our Modal Resources that we'll need
15 | volume = Volume.from_name(VOLUME, create_if_missing=True)
16 | image = Image.debian_slim(python_version="3.9").pip_install(
17 |     "datasets==2.16.1", "apache_beam==2.53.0"
18 | )
19 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
20 | 
21 | 
22 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
23 | #  but we override this to
24 | # 6000s to avoid any potential timeout issues
25 | @app.function(
26 |     volumes={DATASET_DIR: volume}, 
27 |     timeout=6000,
28 |     # ephemeral_disk=2145728, # in MiB
29 |     secrets=[Secret.from_name("huggingface-secret")],
30 | )
31 | def convert_dataset():
32 |     # Redownload the dataset
33 |     import time
34 |     from datasets import load_dataset
35 |     print("loading")
36 |     dataset = load_dataset("parquet", data_files=f"{DIRECTORY}/train/*.parquet")
37 |     print("saving")
38 |     dataset.save_to_disk(SAVE_DIRECTORY, num_shards={"train":99})
39 |     print("done!")
40 |     volume.commit()
41 | 
42 | 
43 | @app.local_entrypoint()
44 | def main():
45 |     convert_dataset.remote()
46 | 
47 | 


--------------------------------------------------------------------------------
/top10map.py:
--------------------------------------------------------------------------------
  1 | """
  2 | For each of the parquet files with activations, find the top 10 and write to an intermediate file
  3 | modal run top10map.py
  4 | """
  5 | from modal import App, Image, Volume
  6 | import os
  7 | import time
  8 | import numpy as np
  9 | import pandas as pd
 10 | from tqdm import tqdm
 11 | import concurrent.futures
 12 | from functools import partial
 13 | 
 14 | NUM_CPU=4
 15 | 
 16 | N=5 # the number of samples to keep per feature
 17 | 
 18 | DATASET_DIR="/embeddings"
 19 | VOLUME = "embeddings"
 20 | 
 21 | D_IN = 768 # the dimensions from the embedding models
 22 | K=64
 23 | # EXPANSION = 128
 24 | EXPANSION = 32
 25 | SAE = f"{K}_{EXPANSION}"
 26 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3" 
 27 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10"
 28 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}" 
 29 | SAVE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top{N}"
 30 | 
 31 | 
 32 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)]
 33 | 
 34 | # We define our Modal Resources that we'll need
 35 | volume = Volume.from_name(VOLUME, create_if_missing=True)
 36 | image = Image.debian_slim(python_version="3.9").pip_install(
 37 |     "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
 38 | )
 39 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
 40 | 
 41 | # def get_top_n_rows_by_top_act(file, top_indices, top_acts, feature):
 42 | #     # feature_positions = np.where(np.any(top_indices == feature, axis=1),
 43 | #     #                        np.argmax(top_indices == feature, axis=1),
 44 | #     #                        -1)
 45 | #     # act_values = np.where(feature_positions != -1, 
 46 | #     #                   top_acts[np.arange(len(top_acts)), feature_positions], 
 47 | #     #                   0)
 48 | #     # top_n_indices = np.argsort(act_values)[-N:][::-1]
 49 | 
 50 | #     # Find positions where feature appears (returns a boolean mask)
 51 | #     feature_mask = top_indices == feature
 52 |     
 53 | #     # Get the activation values where the feature appears (all others will be 0)
 54 | #     act_values = np.where(feature_mask.any(axis=1),
 55 | #                          top_acts[feature_mask].reshape(-1),
 56 | #                          0)
 57 |     
 58 | #     # Use partition to get top N indices efficiently
 59 | #     top_n_indices = np.argpartition(act_values, -N)[-N:]
 60 | #     # Sort just the top N indices
 61 | #     top_n_indices = top_n_indices[np.argsort(act_values[top_n_indices])[::-1]]
 62 | 
 63 | #     filtered_df = pd.DataFrame({
 64 | #         "shard": file,
 65 | #         "index": top_n_indices,
 66 | #         "feature": feature,
 67 | #         "activation": act_values[top_n_indices]
 68 | #     })
 69 | #     return filtered_df
 70 | 
 71 | def get_top_n_rows_by_top_act(file, top_indices, top_acts, feature):
 72 |     # Use memory-efficient approach to find rows with this feature
 73 |     rows_with_feature = np.any(top_indices == feature, axis=1)
 74 |     
 75 |     # Only process rows that have this feature
 76 |     filtered_indices = top_indices[rows_with_feature]
 77 |     filtered_acts = top_acts[rows_with_feature]
 78 |     
 79 |     # Get positions of the feature in each row
 80 |     positions = np.argwhere(filtered_indices == feature)
 81 |     
 82 |     # Create array of activation values (sparse approach)
 83 |     row_indices = positions[:, 0]
 84 |     col_indices = positions[:, 1]
 85 |     act_values = filtered_acts[row_indices, col_indices]
 86 |     
 87 |     # Map back to original indices
 88 |     original_indices = np.where(rows_with_feature)[0][row_indices]
 89 |     
 90 |     # Get top N
 91 |     if len(act_values) > N:
 92 |         top_n_pos = np.argpartition(act_values, -N)[-N:]
 93 |         top_n_pos = top_n_pos[np.argsort(act_values[top_n_pos])[::-1]]
 94 |     else:
 95 |         # If we have fewer than N matches, take all of them
 96 |         top_n_pos = np.argsort(act_values)[::-1]
 97 |     
 98 |     filtered_df = pd.DataFrame({
 99 |         "shard": file,
100 |         "index": original_indices[top_n_pos],
101 |         "feature": feature,
102 |         "activation": act_values[top_n_pos]
103 |     })
104 |     return filtered_df
105 | 
106 | 
107 | def process_feature_chunk(file, feature_ids, chunk_index):
108 |     start = time.perf_counter()
109 |     print(f"Loading dataset from {DIRECTORY}/train/{file}", chunk_index)
110 |     
111 |     # Only read the columns we need
112 |     df = pd.read_parquet(f"{DIRECTORY}/train/{file}", columns=['top_indices', 'top_acts'])
113 |     print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds for {file}", chunk_index) 
114 | 
115 |     top_indices = np.array(df['top_indices'].tolist())
116 |     top_acts = np.array(df['top_acts'].tolist())
117 |     
118 |     # Free up memory by deleting the DataFrame after conversion to numpy
119 |     del df
120 |     
121 |     print(f"top_indices shape: {top_indices.shape}")
122 |     print(f"top_acts shape: {top_acts.shape}")
123 |     print("got numpy arrays", chunk_index)
124 |     
125 |     results = []
126 |     
127 |     # Process each feature in this worker's batch
128 |     for feature in tqdm(feature_ids, desc=f"Processing features (worker {chunk_index})", position=chunk_index):
129 |         # Get the true top N rows for this feature across the entire chunk
130 |         top = get_top_n_rows_by_top_act(file, top_indices, top_acts, feature)
131 |         results.append(top)
132 |     
133 |     # Combine results for all features in this worker
134 |     combined_df = pd.concat(results, ignore_index=True)
135 |     
136 |     # Write to a temporary file to save memory
137 |     temp_file = f"{SAVE_DIRECTORY}/temp_{file}_{chunk_index}.parquet"
138 |     combined_df.to_parquet(temp_file)
139 |     
140 |     # Free memory
141 |     del top_indices, top_acts, results, combined_df
142 |     
143 |     return temp_file
144 | 
145 | @app.function(cpu=NUM_CPU, volumes={DATASET_DIR: volume}, timeout=6000)
146 | def process_dataset(file):
147 |     from concurrent.futures import ProcessPoolExecutor, as_completed
148 | 
149 |     # Ensure directory exists
150 |     if not os.path.exists(f"{SAVE_DIRECTORY}"):
151 |         os.makedirs(f"{SAVE_DIRECTORY}")
152 | 
153 |     num_features = D_IN * EXPANSION
154 | 
155 |     # Split the features among workers - each worker handles a subset of features
156 |     # but processes the ENTIRE dataset for those features
157 |     features_per_worker = num_features // NUM_CPU
158 |     feature_batches = [list(range(i, min(i + features_per_worker, num_features))) 
159 |                       for i in range(0, num_features, features_per_worker)]
160 | 
161 |     with ProcessPoolExecutor(max_workers=NUM_CPU) as executor:
162 |         futures = [executor.submit(process_feature_chunk, file, feature_batch, i) 
163 |                   for i, feature_batch in enumerate(feature_batches)]
164 |         
165 |         temp_files = []
166 |         for future in as_completed(futures):
167 |             temp_file = future.result()
168 |             temp_files.append(temp_file)
169 |     
170 |     # Combine temporary files
171 |     print("Combining temporary files")
172 |     dfs = []
173 |     for temp_file in temp_files:
174 |         dfs.append(pd.read_parquet(temp_file))
175 |         # Remove temp file after reading
176 |         os.remove(temp_file)
177 |     
178 |     combined_df = pd.concat(dfs, ignore_index=True)
179 |     combined_df.to_parquet(f"{SAVE_DIRECTORY}/{file}")
180 |     volume.commit()
181 |     
182 |     return f"All done with {file}", len(combined_df)
183 | 
184 | 
185 | @app.local_entrypoint()
186 | def main():
187 |     for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True):
188 |         if isinstance(resp, Exception):
189 |             print(f"Exception: {resp}")
190 |             continue
191 |         print(resp)
192 | 
193 | 
194 | 


--------------------------------------------------------------------------------
/top10reduce.py:
--------------------------------------------------------------------------------
  1 | from modal import App, Image, Volume, Secret
  2 | 
  3 | EMBEDDINGS_DIR="/embeddings"
  4 | EMBEDDINGS_VOLUME = "embeddings"
  5 | DATASETS_DIR="/datasets"
  6 | DATASETS_VOLUME = "datasets"
  7 | 
  8 | SAE = "64_32"
  9 | 
 10 | # SAMPLE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3/train"
 11 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10"
 12 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10/combined"
 13 | 
 14 | SAMPLE_DIRECTORY = f"{DATASETS_DIR}/wikipedia-en-chunked-500/train"
 15 | SAE_DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}/train"
 16 | DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5"
 17 | SAVE_DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5/combined"
 18 | 
 19 | 
 20 | 
 21 | 
 22 | 
 23 | # We define our Modal Resources that we'll need
 24 | embeddings_volume = Volume.from_name(EMBEDDINGS_VOLUME, create_if_missing=True)
 25 | datasets_volume = Volume.from_name(DATASETS_VOLUME, create_if_missing=True)
 26 | image = Image.debian_slim(python_version="3.9").pip_install(
 27 |     "pandas", "datasets==2.16.1", "apache_beam==2.53.0"
 28 | )
 29 | app = App(image=image) 
 30 | 
 31 | @app.function(
 32 |     volumes={DATASETS_DIR: datasets_volume, EMBEDDINGS_DIR: embeddings_volume}, 
 33 |     timeout=60000,
 34 |     # ephemeral_disk=2145728, # in MiB
 35 | )
 36 | def populate_indices(samples):
 37 |     import pandas as pd
 38 | 
 39 |     shard = samples.iloc[0]['shard']
 40 |     indices = samples['index'].tolist()
 41 | 
 42 |     print("reading shard", shard, len(indices))
 43 |     sample_df = pd.read_parquet(f"{SAMPLE_DIRECTORY}/{shard}")
 44 |     sample_df = sample_df.iloc[indices].copy()
 45 |     sample_df['feature'] = samples['feature'].tolist()
 46 |     sample_df['activation'] = samples['activation'].tolist()
 47 |     sample_df['top_indices'] = samples['top_indices'].tolist()
 48 |     sample_df['top_acts'] = samples['top_acts'].tolist()
 49 |     print("returning samples for", shard)
 50 | 
 51 |     return sample_df
 52 | 
 53 | @app.function(
 54 |     volumes={
 55 |         DATASETS_DIR: datasets_volume, 
 56 |         EMBEDDINGS_DIR: embeddings_volume
 57 |     }, 
 58 |     timeout=60000,
 59 |     # ephemeral_disk=2145728, # in MiB
 60 | )
 61 | def reduce_top10_indices(directory, save_directory, sae_directory, N):
 62 |     import os
 63 |     if not os.path.exists(save_directory):
 64 |         os.makedirs(save_directory)
 65 | 
 66 |     files = [f for f in os.listdir(directory) if f.endswith('.parquet')]
 67 |     print("len files", len(files))
 68 | 
 69 |     import pandas as pd
 70 | 
 71 |     combined_indices_path = f"{save_directory}/combined_indices.parquet"
 72 |     if not os.path.exists(combined_indices_path):
 73 |         print("creating combined_indices")
 74 |         all_dataframes = []
 75 |         for file in files:
 76 |             print(f"Reading {file}")
 77 |             # Read from top directory
 78 |             df = pd.read_parquet(f"{directory}/{file}")
 79 |             
 80 |             # Read corresponding file from SAE directory to get top_indices and top_acts
 81 |             if os.path.exists(f"{sae_directory}/{file}"):
 82 |                 sae_df = pd.read_parquet(f"{sae_directory}/{file}")
 83 |                 # Ensure we have the right columns
 84 |                 if 'top_indices' in sae_df.columns and 'top_acts' in sae_df.columns:
 85 |                     # Match records based on feature (assuming they're in the same order)
 86 |                     df['top_indices'] = sae_df['top_indices']
 87 |                     df['top_acts'] = sae_df['top_acts']
 88 |                     print(f"Added top_indices and top_acts columns from {file}")
 89 |                 else:
 90 |                     print(f"Warning: top_indices or top_acts not found in {file} from SAE directory")
 91 |             else:
 92 |                 print(f"Warning: file {file} not found in SAE directory")
 93 |                 
 94 |             all_dataframes.append(df)
 95 | 
 96 |         # Concatenate all DataFrames into a single DataFrame
 97 |         combined_df = pd.concat(all_dataframes, ignore_index=True)
 98 |         print("combined")
 99 |         combined_df.to_parquet(combined_indices_path)
100 |     else:
101 |         print(f"{combined_indices_path} already exists. Loading it.")
102 |         combined_df = pd.read_parquet(combined_indices_path)
103 | 
104 |     combined_df = combined_df.sort_values(by=['feature', 'activation'], ascending=[True, False])
105 |     combined_df = combined_df.groupby('feature').head(N).reset_index(drop=True)
106 |     print(f"writing top{N}")
107 |     combined_df.to_parquet(f"{save_directory}/combined_indices_top{N}.parquet")
108 |     embeddings_volume.commit()
109 | 
110 |     shard_counts = combined_df.groupby('shard').size().reset_index(name='count')
111 |     print("shard_counts", shard_counts.head())
112 | 
113 |     print("Number of shards:", len(shard_counts))
114 |     rows_by_shard = [combined_df[combined_df['shard'] == shard] for shard in combined_df['shard'].unique()]
115 | 
116 |     results = []
117 |     for resp in populate_indices.map(rows_by_shard, order_outputs=False, return_exceptions=True):
118 |         if isinstance(resp, Exception):
119 |             print(f"Exception: {resp}")
120 |             continue
121 |         results.append(resp)
122 | 
123 |     print("concatenating final results")
124 |     final_df = pd.concat(results, ignore_index=True)
125 |     final_df = final_df.drop(columns=['index', '__index_level_0__'], errors='ignore')
126 |     print("sorting final results")
127 |     final_df = final_df.sort_values(by=['feature', 'activation'], ascending=[True, False])
128 |     print("writing final results")
129 |     final_df.to_parquet(f"{save_directory}/samples_top{N}.parquet")
130 |     embeddings_volume.commit()
131 |     return "done"
132 | 
133 | 
134 |     # for resp in reduce_top10.map(pairs, order_outputs=False, return_exceptions=True):
135 |     #     if isinstance(resp, Exception):
136 |     #         print(f"Exception: {resp}")
137 |     #         continue
138 |     #     print(resp)
139 | 
140 | 
141 | 
142 | @app.local_entrypoint()
143 | def main():
144 |     reduce_top10_indices.remote(DIRECTORY, SAVE_DIRECTORY, SAE_DIRECTORY, 10)
145 |     
146 | 
147 | 


--------------------------------------------------------------------------------
/torched.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Write the embeddings from the dataset to torch files that can be loaded quicker
 3 | 
 4 | modal run torched.py
 5 | """
 6 | 
 7 | from modal import App, Image, Volume, Secret
 8 | 
 9 | DATASET_DIR="/embeddings"
10 | VOLUME = "embeddings"
11 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4" 
12 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500" 
13 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-torched"
14 | SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500-torched"
15 | 
16 | # We define our Modal Resources that we'll need
17 | volume = Volume.from_name(VOLUME, create_if_missing=True)
18 | image = Image.debian_slim(python_version="3.9").pip_install(
19 |     "datasets==2.16.1", "apache_beam==2.53.0", "tqdm", "torch", "numpy"
20 | )
21 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
22 | 
23 | # NUM_EMBEDDINGS = 25504378
24 | # SHARD_SIZE = 262144 # 2048*128
25 | 
26 | @app.function(
27 |     volumes={DATASET_DIR: volume}, 
28 |     timeout=60000,
29 |     # ephemeral_disk=2145728, # in MiB
30 | )
31 | def torch_dataset_shard(file):
32 |     # Redownload the dataset
33 |     import time
34 |     # from datasets import load_from_disk
35 |     import pandas as pd
36 |     from tqdm import tqdm
37 |     import torch
38 |     import numpy as np
39 |     import os
40 | 
41 |     print("loading", file)
42 |     # dataset = load_from_disk(DIRECTORY)
43 |     df = pd.read_parquet(f"{DIRECTORY}/train/{file}")
44 |     print("loaded", file)
45 |     # train_dataset = dataset["train"]
46 | 
47 |     # start_idx = shard * SHARD_SIZE
48 |     # end_idx = min(start_idx + SHARD_SIZE, NUM_EMBEDDINGS)
49 |     # print("reading", shard)
50 |     embeddings = df["embedding"].to_numpy()
51 |     embeddings = np.array([np.array(e).astype(np.float32) for e in embeddings])
52 |     # shard_embeddings = np.array(train_dataset.select(range(start_idx, end_idx))["embedding"])
53 |     # print("permuting", shard)
54 |     # shard_embeddings = np.random.permutation(shard_embeddings)  # {{ edit_1 }}
55 |     shard = file.split(".")[0]
56 |     print("saving", shard)
57 |     shard_tensor = torch.tensor(embeddings, dtype=torch.float32)
58 |     if not os.path.exists(f"{SAVE_DIRECTORY}"):
59 |         os.makedirs(f"{SAVE_DIRECTORY}")
60 |     torch.save(shard_tensor, f"{SAVE_DIRECTORY}/{shard}.pt")
61 |     print("done!", shard)
62 |     volume.commit()
63 |     return shard
64 | 
65 | @app.local_entrypoint()
66 | def main():
67 |     # num_shards = NUM_EMBEDDINGS // SHARD_SIZE + (1 if NUM_EMBEDDINGS % SHARD_SIZE != 0 else 0)
68 |     # shards = list(range(num_shards))
69 |     # # torch_dataset.remote()
70 |     # for resp in torch_dataset_shard.map(shards, order_outputs=False, return_exceptions=True):
71 |     #     if isinstance(resp, Exception):
72 |     #         print(f"Exception: {resp}")
73 |     #         continue
74 |     #     print(resp)
75 | 
76 |     files = [f"data-{i:05d}-of-00989.parquet" for i in range(989)]
77 |     files = files[2:]
78 |     # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
79 |     
80 |     # process_dataset.remote(file, max_tokens=MAX_TOKENS, num_cpu=NUM_CPU)
81 |     for resp in torch_dataset_shard.map(files, order_outputs=False, return_exceptions=True):
82 |         if isinstance(resp, Exception):
83 |             print(f"Exception: {resp}")
84 |             continue
85 |         print(resp)
86 | 
87 | 
88 | 


--------------------------------------------------------------------------------
/upload.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Upload a dataset from a modal volume to HuggingFace
 3 | """
 4 | from modal import App, Image, Volume, Secret
 5 | 
 6 | # We first set out configuration variables for our script.
 7 | DATASET_DIR = "/embeddings"
 8 | VOLUME="embeddings"
 9 | HF_REPO="enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5-2"
10 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4"
11 | 
12 | # We define our Modal Resources that we'll need
13 | volume = Volume.from_name(VOLUME, create_if_missing=True)
14 | image = Image.debian_slim(python_version="3.9").pip_install(
15 |     "datasets==2.20.0", "huggingface_hub"
16 | )
17 | app = App(image=image)  # Note: prior to April 2024, "app" was called "stub"
18 | 
19 | 
20 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
21 | #  but we override this to
22 | # 6000s to avoid any potential timeout issues
23 | @app.function(
24 |     volumes={DATASET_DIR: volume}, 
25 |     timeout=60000,
26 |     secrets=[Secret.from_name("huggingface-secret")],
27 | )
28 | def upload_dataset(directory, repo):
29 |     import os
30 |     import time
31 | 
32 |     from huggingface_hub import HfApi
33 |     from datasets import load_from_disk
34 | 
35 | 
36 |     api = HfApi(token=os.environ["HF_TOKEN"])
37 |     api.create_repo(
38 |         repo_id=repo,
39 |         private=False,
40 |         repo_type="dataset",
41 |         exist_ok=True,
42 |     )
43 | 
44 |     print("loading from disk")
45 |     dataset=load_from_disk(directory)
46 | 
47 |     print(f"Pushing to hub {HF_REPO}")
48 |     start = time.perf_counter()
49 |     max_retries = 20
50 |     for attempt in range(max_retries):
51 |         try:
52 |             # api.upload_folder(
53 |             #     folder_path=directory,
54 |             #     repo_id=repo,
55 |             #     repo_type="dataset",
56 |             #     multi_commits=True,
57 |             #     multi_commits_verbose=True,
58 |             # )
59 |             dataset.push_to_hub(repo, num_shards={"train": 99})
60 |             break  # Exit loop if upload is successful
61 |         except Exception as e:
62 |             if attempt < max_retries - 1:
63 |                 print(f"Attempt {attempt + 1} failed, retrying...")
64 |                 time.sleep(5)  # Wait for 5 seconds before retrying
65 |             else:
66 |                 print("Failed to upload after several attempts.")
67 |                 raise  # Re-raise the last exception if all retries fail
68 |     end = time.perf_counter()
69 |     print(f"Uploaded in {end-start}s")
70 | 
71 | 
72 | @app.local_entrypoint()
73 | def main():
74 |     upload_dataset.remote(DIRECTORY, HF_REPO)
75 | 
76 | 


--------------------------------------------------------------------------------
/volume.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Copy a directory from one volume to another.
 3 | I don't think you can use * in modal volume commands, so need to copy each file individually.
 4 | Probably a better way to do this though.
 5 | 
 6 | python volume.py cp
 7 | python volume.py rm
 8 | """
 9 | import os
10 | from tqdm import tqdm
11 | 
12 | def automate_volume_copy():
13 |     source_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched"
14 |     destination_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched-shuffled"
15 | 
16 |     # Use tqdm to create a progress bar for the file copying process
17 |     for i in tqdm(range(100), desc="Copying files"):
18 |         file_index = f"{i:05d}"
19 |         source_file = os.path.join(source_dir, f"shard_{file_index}.pt")
20 |         destination_file = os.path.join(destination_dir, f"shard_{file_index}.pt")
21 |         
22 |         command = f"modal volume cp embeddings {source_file} {destination_file}"
23 |         os.system(command)  # Execute the command
24 | 
25 | def automate_volume_rm():
26 |     source_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched"
27 | 
28 |     # Use tqdm to create a progress bar for the file copying process
29 |     for i in tqdm(range(100), desc="Deleting files"):
30 |         file_index = f"{i:05d}"
31 |         source_file = os.path.join(source_dir, f"shard_{file_index}.pt")
32 |         
33 |         command = f"modal volume rm embeddings {source_file}"
34 |         os.system(command)  # Execute the command
35 | 
36 | 
37 | import sys
38 | import argparse
39 | 
40 | def parse_arguments():
41 |     parser = argparse.ArgumentParser(description="Copy or remove files in a volume.")
42 |     parser.add_argument("command", choices=["cp", "rm"], help="Specify 'cp' to copy files or 'rm' to remove files.")
43 |     return parser.parse_args()
44 | 
45 | def main():
46 |     args = parse_arguments()
47 |     command = args.command
48 |     if command == "cp":
49 |         automate_volume_copy()
50 |     elif command == "rm":
51 |         automate_volume_rm()
52 |     else:
53 |         print("Invalid command. Use 'cp' to copy or 'rm' to remove.")
54 |         sys.exit(1)
55 | 
56 | if __name__ == "__main__":
57 |     main()
58 | 


--------------------------------------------------------------------------------