├── .gitignore ├── README.md ├── chunker.py ├── download.py ├── embed-tei.py ├── experimental ├── batchsize.py └── embed.py ├── features.py ├── fetch.py ├── filter.py ├── lancer.py ├── notebooks ├── features.ipynb ├── perfile.ipynb ├── small_sample.ipynb ├── tokenizers.ipynb └── validate.ipynb ├── remove.py ├── summary.py ├── todataset.py ├── top10map.py ├── top10reduce.py ├── torched.py ├── upload.py └── volume.py /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | *.parquet 3 | venv 4 | .DS_Store 5 | *.arrow 6 | data 7 | *.parquet 8 | *.npy 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # latent-data-modal 2 | 3 | This repository is a set of scripts used to process and embed a large datasets using on-demand infrastructure via [Modal](https://modal.com). 4 | 5 | The first resulting dataset published is [FineWeb-edu 10BT Sample embedded with nomic-text-v1.5](https://huggingface.co/datasets/enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5). 6 | 7 | All of these scripts have been developed as part of my learning process to scale up my capacity for embedding large datasets. 8 | As such they aren't immediately generalizable but can be treated as a reference implementation. A lot of it is adapted from the [Embedding Wikipedia](https://modal.com/blog/embedding-wikipedia) tutorial. 9 | 10 | I am hoping to improve this process and use it to scale up to the 100BT next. If I can get a compute sponsor I'll then take it to the entire 1.4 trillion token dataset. 11 | 12 | 13 | ## Process 14 | 15 | ### [download.py](download.py) 16 | To start with, we need to download the HF dataset to a volume in Modal. This is relatively straight forward and easy to change to a different dataset. 17 | 18 | ### [chunker.py](chunker.py) 19 | I wanted to pre-chunk my dataset since tokenizing is relatively CPU intensive and my initial experiments with the tutorial code we bottlenecked by the chunking process. I also wanted to use actual token counts and analyze the impact of chunking on the dataset. 20 | 21 | I found that the 9.6 million documents in the 10BT sample turned into ~25 million chunks with 10.5 billion tokens due to the 10% overlap I chose. There is an issue in the chunking code right now that I will fix soon where chunks <= 50 tokens are created even though they represent pure overlap and aren't needed. 22 | 23 | I based everything on files in the dataset, so the 10BT sample was 99 arrow files, which allowed me to take advantage of Modal's automatic container scaling. Each file is processed by its own container which dramatically sped up the process. 24 | 25 | The chunking process took ~40 minutes using 100 containers and cost $5. 26 | 27 | ### [embed-tei.py](embed-tei.py) 28 | This script uses the [Text Embeddings Interface](https://huggingface.co/docs/text-embeddings-inference/en/index) like the wikipedia tutorial, but loading the pre-chunked dataset and creating batches that attempt to fit the batch token limit. So we can pack many more small chunks into a single batch to speed things up. 29 | 30 | I believe I'm not quite properly utilizing TEI because I only got ~60% GPU utilization and was only using 10GB memory in the A10G GPUs that have 24GB available. So there is probably a way to speed this up even more. That said it only cost ~$50 to embed the entire dataset. It did take ~12 hours because I didn't always have my full allocation of 10 GPUs available. 31 | 32 | ### [summary.py](summary.py) 33 | I found it useful to quickly calculate summary statistics using the same parallel process of loading each file in its own container and performaing some basic pandas calculations. 34 | 35 | ### [fetch.py](fetch.py) 36 | I made a quick utility to download a single file to inspect locally, which was used in the [notebooks/validate.ipynb](notebooks/validate.ipynb) notebook to confirm that the embedding process was working as expected. 37 | 38 | 39 | ## Notebooks 40 | I'm including several notebooks that I developed in the process of learning this in case they are helpful to others. 41 | 42 | ### [small_sample.ipynb](notebooks/small_sample.ipynb) 43 | The first thing I did was download some very small samples of the dataset and explore them with [Latent Scope](https://github.com/enjalot/latent-scope) to familiarize myself with the data and validate the idea of embedding the dataset. 44 | 45 | ### [perfile.ipynb](notebooks/perfile.ipynb) 46 | After I struggled with the structure of the wikipedia tutorial I realized I could leverage the CPU parallelism of Modal to process each file in its own container. This notebook was me working out the chunking logic on a single file that I could then parallelize in the `chunker.py` script. 47 | 48 | ### [validate.ipynb](notebooks/validate.ipynb) 49 | This notebook is me taking a look at a single file that was processed and then trying to understand why I was seeing such small chunks. It led me to realize the mistake I made of keeping around <50 token chunks (which I still need to fix in the chunker.py script...) 50 | 51 | ## Experimental 52 | On the way to developing this I was trying to understand how to choose batch sizes and token limits. There are two scripts here: 53 | 54 | ### [batchsize.py](experimental/batchsize.py) 55 | This script uses crude measurement techniques to see how much memory gets filled by a batch of tokens. I'm not confident in it anymore because I was able to fit a lot more tokens into the batches I submitted to `embed-tei.py` than I predicted using a A10G instead of an H100. 56 | 57 | ### [embed.py](experimental/embed.py) 58 | This script uses the HuggingFace transformers directly (instead of TEI) so I could have a little more control over how I was embedding. It's the same kind of code I use in Latent Scope for locally embedding smaller datasets so it allowed me to better understand the scaling process. 59 | The problem is that it's just much slower than TEI. 60 | -------------------------------------------------------------------------------- /chunker.py: -------------------------------------------------------------------------------- 1 | from modal import App, Image, Volume 2 | 3 | NUM_CPU=4 4 | MAX_TOKENS = 500 5 | # MAX_TOKENS = 120 6 | OVERLAP = 0.1 # 10% overlap when chunking 7 | BATCH_SIZE = 200 # number of rows to process per thread at once 8 | 9 | # We first set out configuration variables for our script. 10 | DATASET_DIR = "/data" 11 | 12 | # https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu 13 | # VOLUME = "embedding-fineweb-edu" 14 | # DATASET_SAVE ="fineweb-edu-sample-10BT" 15 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}" 16 | # TEXT_KEY = "text" 17 | # KEEP_KEYS = ["id", "url", "score", "dump"] 18 | # files = [f"data-{i:05d}-of-00099.arrow" for i in range(99)] 19 | 20 | # VOLUME = "embedding-fineweb-edu" 21 | # DATASET_SAVE ="fineweb-edu-sample-100BT" 22 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-100BT-chunked-{MAX_TOKENS}" 23 | # KEEP_KEYS = ["id", "url", "score", "dump"] 24 | 25 | 26 | # https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2 27 | # VOLUME = "datasets" 28 | # DATASET_SAVE ="RedPajama-Data-V2-sample-10B" 29 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-{MAX_TOKENS}" 30 | # TEXT_KEY = "raw_content" 31 | # KEEP_KEYS = ["doc_id", "meta"] 32 | # files = [f"data-{i:05d}-of-00150.arrow" for i in range(150)] 33 | 34 | # https://huggingface.co/datasets/monology/pile-uncopyrighted 35 | # VOLUME = "datasets" 36 | # DATASET_SAVE ="pile-uncopyrighted" 37 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-{MAX_TOKENS}" 38 | # TEXT_KEY = "text" 39 | # KEEP_KEYS = ["meta"] 40 | # files = [f"data-{i:05d}-of-01987.arrow" for i in range(200)] 41 | 42 | #https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.en 43 | # VOLUME = "datasets" 44 | # DATASET_SAVE ="wikipedia-en" 45 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-{MAX_TOKENS}" 46 | # TEXT_KEY = "text" 47 | # KEEP_KEYS = ["id", "url", "title"] 48 | # files = [f"data-{i:05d}-of-00041.arrow" for i in range(41)] 49 | 50 | VOLUME = "datasets" 51 | DATASET_SAVE ="medrag-pubmed" 52 | DATASET_SAVE_CHUNKED = f"medrag-pubmed-{MAX_TOKENS}" 53 | TEXT_KEY = "content" 54 | KEEP_KEYS = ["id", "title", "PMID"] 55 | files = [f"data-{i:05d}-of-00138.arrow" for i in range(138)] 56 | 57 | 58 | 59 | 60 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5" 61 | 62 | # We define our Modal Resources that we'll need 63 | volume = Volume.from_name(VOLUME, create_if_missing=True) 64 | image = Image.debian_slim(python_version="3.9").pip_install( 65 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm" 66 | ) 67 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 68 | 69 | def chunk_row(row, tokenizer): 70 | # print("ROW", row) 71 | text = row[TEXT_KEY] 72 | chunks = [] 73 | 74 | # TODO: don't save an empty chunk 75 | 76 | tokens = tokenizer.encode(text) 77 | token_count = len(tokens) 78 | if token_count > MAX_TOKENS: 79 | overlap = int(MAX_TOKENS * OVERLAP) 80 | start_index = 0 81 | ci = 0 82 | while start_index < len(tokens): 83 | end_index = min(start_index + MAX_TOKENS, len(tokens)) 84 | chunk = tokens[start_index:end_index] 85 | if len(chunk) < overlap: 86 | break 87 | chunks.append({ 88 | "chunk_index": ci, 89 | "chunk_text": tokenizer.decode(chunk), 90 | "chunk_tokens": chunk, 91 | "chunk_token_count": len(chunk), 92 | **{key: row[key] for key in KEEP_KEYS} 93 | }) 94 | start_index += MAX_TOKENS - overlap 95 | ci += 1 96 | else: 97 | chunks.append({ 98 | "chunk_index": 0, 99 | "chunk_text": text, 100 | "chunk_tokens": tokens, 101 | "chunk_token_count": token_count, 102 | **{key: row[key] for key in KEEP_KEYS} 103 | }) 104 | 105 | return chunks 106 | 107 | 108 | @app.function(cpu=NUM_CPU, volumes={DATASET_DIR: volume}, timeout=3000) 109 | def process_dataset(file): 110 | import time 111 | from concurrent.futures import ThreadPoolExecutor, as_completed 112 | from tqdm import tqdm 113 | import pandas as pd 114 | import transformers 115 | transformers.logging.set_verbosity_error() 116 | from transformers import AutoTokenizer 117 | from datasets import load_from_disk, load_dataset 118 | 119 | tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS) 120 | 121 | start = time.perf_counter() 122 | # Load the dataset as a Hugging Face dataset 123 | print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}/train/{file}") 124 | dataset = load_dataset("arrow", data_files=f"{DATASET_DIR}/{DATASET_SAVE}/train/{file}") 125 | df = pd.DataFrame(dataset['train']) 126 | print("dataset", len(df)) 127 | print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds for {file}") 128 | 129 | chunks_list = [] 130 | with ThreadPoolExecutor(max_workers=NUM_CPU) as executor: 131 | pbar = tqdm(total=len(df), desc=f"Processing Rows for {file}") 132 | 133 | # this gets called inside each thread 134 | def process_batch(batch): 135 | batch_chunks = [] 136 | for row in batch: 137 | row_chunks = chunk_row(row, tokenizer) 138 | pbar.update(1) 139 | batch_chunks.extend(row_chunks) 140 | return batch_chunks 141 | 142 | print(f"making batches for {file}") 143 | batches = [df.iloc[i:i + BATCH_SIZE].to_dict(orient="records") for i in range(0, len(df), BATCH_SIZE)] 144 | print(f"made batches for {file}") 145 | print(f"setting up futures for {file}") 146 | futures = [executor.submit(process_batch, batch) for batch in batches] 147 | print(f"in the future for {file}") 148 | for future in as_completed(futures): 149 | chunks_list.extend(future.result()) 150 | pbar.close() 151 | 152 | chunked_df = pd.DataFrame(chunks_list) 153 | file_name = file.split(".")[0] 154 | import os 155 | output_dir = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train" 156 | if not os.path.exists(output_dir): 157 | os.makedirs(output_dir) 158 | print(f"saving to {output_dir}/{file_name}.parquet") 159 | chunked_df.to_parquet(f"{output_dir}/{file_name}.parquet") 160 | print(f"done with {file}, {len(chunks_list)} chunks") 161 | volume.commit() 162 | return f"All done with {file}", len(chunks_list) 163 | 164 | 165 | @app.local_entrypoint() 166 | def main(): 167 | # download_dataset.remote() 168 | # from huggingface_hub import HfFileSystem 169 | # hffs = HfFileSystem() 170 | # files = hffs.ls("datasets/HuggingFaceFW/fineweb-edu/sample/10BT", detail=False) 171 | 172 | # files = [f"data-{i:05d}-of-00989.arrow" for i in range(989)] 173 | # files = [f"data-{i:05d}-of-00011.arrow" for i in range(11)] 174 | 175 | # process_dataset.remote(file, max_tokens=MAX_TOKENS, num_cpu=NUM_CPU) 176 | for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True): 177 | if isinstance(resp, Exception): 178 | print(f"Exception: {resp}") 179 | continue 180 | print(resp) 181 | 182 | 183 | -------------------------------------------------------------------------------- /download.py: -------------------------------------------------------------------------------- 1 | """ 2 | Download a dataset from HuggingFace to a modal volume 3 | s""" 4 | from modal import App, Image, Volume, Secret 5 | 6 | # We first set out configuration variables for our script. 7 | VOLUME = "datasets" 8 | DATASET_DIR = "/data" 9 | 10 | HF_CACHE_DIR = f"{DATASET_DIR}/cache" 11 | 12 | # DATASET_NAME = "HuggingFaceFW/fineweb-edu" 13 | # SAMPLE = "100BT" 14 | # DATASET_FILES = f"sample/{SAMPLE}/*.parquet" 15 | # DATASET_SAVE =f"fineweb-edu-sample-{SAMPLE}" 16 | # VOLUME = "embedding-fineweb-edu" 17 | 18 | 19 | # DATASET_NAME = "togethercomputer/RedPajama-Data-V2" 20 | # DATASET_SAVE = "RedPajama-Data-V2-sample-10B" 21 | # DATASET_SAMPLE = "sample-10B" 22 | # DATASET_FILES = None 23 | 24 | # DATASET_NAME = "monology/pile-uncopyrighted" 25 | # DATASET_SAVE = "pile-uncopyrighted" 26 | # DATASET_SAMPLE = None 27 | # DATASET_FILES = None 28 | 29 | # DATASET_NAME = "PleIAs/common_corpus" 30 | # DATASET_SAVE = "common_corpus" 31 | # DATASET_SAMPLE = None 32 | # DATASET_FILES = None 33 | 34 | # DATASET_NAME = "bigcode/the-stack-dedup" 35 | # DATASET_SAVE = "the-stack-dedup" 36 | # DATASET_FILES = None 37 | 38 | # DATASET_NAME = "wikimedia/wikipedia" 39 | # DATASET_SAVE = "wikipedia-en" 40 | # DATASET_SAMPLE = "20231101.en" 41 | # DATASET_FILES = None 42 | 43 | DATASET_NAME = "MedRAG/pubmed" 44 | DATASET_SAVE = "medrag-pubmed" 45 | DATASET_SAMPLE = None 46 | DATASET_FILES = None 47 | 48 | 49 | 50 | 51 | # We define our Modal Resources that we'll need 52 | volume = Volume.from_name(VOLUME, create_if_missing=True) 53 | image = Image.debian_slim(python_version="3.9").pip_install( 54 | "datasets==3.2.0" 55 | ) 56 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 57 | 58 | 59 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts 60 | # but we override this to 61 | # 6000s to avoid any potential timeout issues 62 | @app.function( 63 | volumes={DATASET_DIR: volume}, 64 | timeout=60000, 65 | ephemeral_disk=int(3145728), # in MiB 66 | secrets=[Secret.from_name("huggingface-secret")], 67 | ) 68 | def download_dataset(): 69 | # Redownload the dataset 70 | import time 71 | import os 72 | 73 | # Set HF cache environment variable 74 | os.environ['HF_HOME'] = HF_CACHE_DIR 75 | 76 | 77 | from datasets import load_dataset, DownloadConfig, logging 78 | logging.set_verbosity_debug() 79 | 80 | start = time.time() 81 | if DATASET_FILES: 82 | dataset = load_dataset(DATASET_NAME, data_files=DATASET_FILES, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR)) 83 | elif DATASET_SAMPLE: 84 | dataset = load_dataset(DATASET_NAME, DATASET_SAMPLE, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR)) 85 | else: 86 | dataset = load_dataset(DATASET_NAME, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR)) 87 | end = time.time() 88 | print(f"Download complete - downloaded files in {end-start}s") 89 | 90 | dataset.save_to_disk(f"{DATASET_DIR}/{DATASET_SAVE}") 91 | volume.commit() 92 | 93 | @app.function(volumes={DATASET_DIR: volume}) 94 | def load_dataset(): 95 | import time 96 | import os 97 | 98 | # Set HF cache environment variable 99 | os.environ['HF_HOME'] = HF_CACHE_DIR 100 | 101 | 102 | from datasets import load_from_disk 103 | 104 | start = time.perf_counter() 105 | # Load the dataset as a Hugging Face dataset 106 | print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}") 107 | dataset = load_from_disk(f"{DATASET_DIR}/{DATASET_SAVE}") 108 | print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds") 109 | 110 | 111 | # # Sample the dataset to 100,000 rows 112 | # print("Sampling dataset to 100,000 rows") 113 | # sampled_datasets = dataset["train"].select(range(100000)) 114 | # sampled_datasets.save_to_disk(f"{DATASET_DIR}/{DATASET_SAVE}-100k") 115 | 116 | 117 | # TODO: make a function to delete files 118 | # the 00099 files are old/wrong 119 | 120 | # TODO: make a function to load a single file from dataset 121 | 122 | @app.local_entrypoint() 123 | def main(): 124 | download_dataset.remote() 125 | # load_dataset.remote() 126 | 127 | -------------------------------------------------------------------------------- /embed-tei.py: -------------------------------------------------------------------------------- 1 | """ 2 | Embed a dataset using the HuggingFace TEI 3 | """ 4 | import os 5 | import json 6 | import time 7 | import asyncio 8 | import subprocess 9 | 10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method 11 | 12 | DATASET_DIR = "/data" 13 | 14 | ### CHUNKED DATASET 15 | # VOLUME = "embedding-fineweb-edu" 16 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500" 17 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)] 18 | 19 | # VOLUME = "embedding-fineweb-edu" 20 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-120" 21 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)] 22 | 23 | # VOLUME = "datasets" 24 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-120" 25 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)] 26 | 27 | # VOLUME = "datasets" 28 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-500" 29 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)] 30 | 31 | # VOLUME = "datasets" 32 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-120" 33 | # # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-500" 34 | # files = [f"data-{i:05d}-of-01987.parquet" for i in range(200)] 35 | 36 | VOLUME = "datasets" 37 | DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-120" 38 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-500" 39 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)] 40 | 41 | # VOLUME = "datasets" 42 | # DATASET_SAVE_CHUNKED = f"medrag-pubmed-500" 43 | # files = [f"data-{i:05d}-of-00138.parquet" for i in range(138)] 44 | 45 | 46 | 47 | EMBEDDING_DIR = "/embeddings" 48 | 49 | #### MODEL 50 | # Tokenized version of "clustering: " prefix = [101, 9324, 2075, 1024] 51 | PREFIX = "clustering: " 52 | PREFIX_TOKEN_COUNT = 4 53 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5" 54 | 55 | # PREFIX = """ 56 | # PREFIX_TOKEN_COUNT = 0 57 | # MODEL_ID = "BAAI/bge-base-en-v1.5" 58 | 59 | # PREFIX = "" 60 | # PREFIX_TOKEN_COUNT = 0 61 | # MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2" 62 | 63 | MODEL_SLUG = MODEL_ID.split("/")[-1] 64 | 65 | MODEL_DIR = "/model" 66 | MODEL_REVISION="main" 67 | 68 | GPU_CONCURRENCY = 10 69 | BATCHER_CONCURRENCY = GPU_CONCURRENCY 70 | GPU_CONFIG = "A10G" 71 | GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:86-1.2" 72 | # GPU_CONFIG = gpu.A100(size="40GB") 73 | # GPU_CONFIG = gpu.A100(size="80GB") 74 | # GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:1.2" 75 | # GPU_CONFIG = gpu.H100() 76 | # GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:hopper-1.2" 77 | 78 | 79 | SENTENCE_TOKEN_LIMIT = 512 80 | CLIENT_BATCH_TOKEN_LIMIT = 768 * SENTENCE_TOKEN_LIMIT # how many tokens we put in a batch. limiting factor 81 | # i put the server higher but if we make the client batch too big it errors out without helpful message 82 | SERVER_BATCH_TOKEN_LIMIT = 2 * CLIENT_BATCH_TOKEN_LIMIT # how many tokens the server can handle in a batch 83 | MAX_CLIENT_BATCH_SIZE = 2 * 4096 # how many rows can be in a batch 84 | # CLIENT_BATCH_TOKEN_LIMIT = 1536 * SENTENCE_TOKEN_LIMIT # Double from 768 85 | # SERVER_BATCH_TOKEN_LIMIT = 4 * 1536 * SENTENCE_TOKEN_LIMIT # Increased server capacity 86 | 87 | # CLIENT_BATCH_TOKEN_LIMIT = 512 * SENTENCE_TOKEN_LIMIT #A100 40GB 88 | # SERVER_BATCH_TOKEN_LIMIT = 4*2048 * SENTENCE_TOKEN_LIMIT #A100 40GB 89 | 90 | LAUNCH_FLAGS = [ 91 | "--model-id", 92 | MODEL_ID, 93 | "--port", 94 | "8000", 95 | "--max-client-batch-size", 96 | str(MAX_CLIENT_BATCH_SIZE), # Increased from 20000 97 | "--max-batch-tokens", 98 | str(SERVER_BATCH_TOKEN_LIMIT), 99 | "--auto-truncate", 100 | "--dtype", 101 | "float16", 102 | "--json-output" # Add for more detailed perf metrics 103 | ] 104 | 105 | ## Dataset-Specific Configuration 106 | DATASET_READ_VOLUME = Volume.from_name( 107 | VOLUME, create_if_missing=True 108 | ) 109 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name( 110 | "embeddings", create_if_missing=True 111 | ) 112 | 113 | def spawn_server() -> subprocess.Popen: 114 | import socket 115 | 116 | process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS) 117 | # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs. 118 | while True: 119 | try: 120 | socket.create_connection(("127.0.0.1", 8000), timeout=1).close() 121 | print("Webserver ready!") 122 | return process 123 | except (socket.timeout, ConnectionRefusedError): 124 | # Check if launcher webserving process has exited. 125 | # If so, a connection can never be made. 126 | retcode = process.poll() 127 | if retcode is not None: 128 | raise RuntimeError( 129 | f"launcher exited unexpectedly with code {retcode}" 130 | ) 131 | 132 | 133 | tei_image = ( 134 | Image.from_registry( 135 | GPU_IMAGE, 136 | add_python="3.10", 137 | ) 138 | .dockerfile_commands("ENTRYPOINT []") 139 | .pip_install("httpx", "numpy") 140 | ) 141 | 142 | with tei_image.imports(): 143 | import numpy as np 144 | 145 | app = App( 146 | "fineweb-embeddings-tei" 147 | ) 148 | 149 | @app.cls( 150 | gpu=GPU_CONFIG, 151 | image=tei_image, 152 | max_containers=GPU_CONCURRENCY, 153 | allow_concurrent_inputs=4, # allows the batchers to queue up several requests 154 | # but if we allow too many and they get backed up it spams timeout errors 155 | retries=3, 156 | ) 157 | class TextEmbeddingsInference: 158 | # @build() 159 | # def download_model(self): 160 | # spawn_server() 161 | 162 | @enter() 163 | def open_connection(self): 164 | # If the process is running for a long time, the client does not seem to close the connections, results in a pool timeout 165 | from httpx import AsyncClient 166 | 167 | self.process = spawn_server() 168 | self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30) 169 | 170 | @exit() 171 | def terminate_connection(self): 172 | self.process.terminate() 173 | 174 | @method() 175 | async def embed(self, chunk_batch): 176 | texts = chunk_batch[0] 177 | res = await self.client.post("/embed", json={"inputs": texts}) 178 | try: 179 | emb = res.json() 180 | return chunk_batch, np.array(emb) 181 | except Exception as e: 182 | print(f"Error embedding", e) 183 | print("res", res) 184 | raise e 185 | 186 | @app.function( 187 | max_containers=BATCHER_CONCURRENCY, 188 | image=Image.debian_slim().pip_install( 189 | "pandas", "pyarrow", "tqdm" 190 | ), 191 | volumes={ 192 | DATASET_DIR: DATASET_READ_VOLUME, 193 | EMBEDDING_DIR: EMBEDDING_CHECKPOINT_VOLUME, 194 | }, 195 | timeout=86400, 196 | secrets=[Secret.from_name("huggingface-secret")], 197 | ) 198 | def batch_loader(file): 199 | import pandas as pd 200 | from tqdm import tqdm 201 | import time 202 | 203 | print(f"reading in {file}") 204 | file_path = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}" 205 | df = pd.read_parquet(file_path) 206 | df['original_position'] = np.arange(len(df)) 207 | print(f"sorting {file}", len(df)) 208 | df = df.sort_values(by='chunk_token_count', ascending=True) 209 | # df = df[0: 80000] 210 | # df = df.reset_index(drop=True) 211 | 212 | batches_text = [] 213 | current_batch_counts = [] 214 | current_batch_text = [] 215 | batch_indices = [] 216 | current_batch_indices = [] 217 | packed = [] 218 | 219 | print("building batches for ", file, "with client batch token limit", CLIENT_BATCH_TOKEN_LIMIT) 220 | start = time.monotonic_ns() 221 | 222 | pbar = tqdm(total=len(df), desc=f"building batches for {file}") 223 | # idx is actually the original index since i didn't reset the index during sort 224 | # i just hate that its implied and had a bug when i didn't realize it 225 | for idx, row in df.iterrows(): 226 | pbar.update(1) 227 | original_idx = row['original_position'] 228 | chunk_token_count = row['chunk_token_count'] + PREFIX_TOKEN_COUNT # 4 for the prefix 229 | chunkt = PREFIX + row['chunk_text'] 230 | if not chunkt or not chunkt.strip(): 231 | print(f"WARNING: Empty chunk detected at index {original_idx}") 232 | chunkt = " " 233 | chunk_token_count = 1 234 | proposed_batch_count = current_batch_counts + [chunk_token_count] 235 | proposed_length = max(count for count in proposed_batch_count) * len(proposed_batch_count) 236 | 237 | if proposed_length <= CLIENT_BATCH_TOKEN_LIMIT and len(current_batch_indices) < MAX_CLIENT_BATCH_SIZE: 238 | current_batch_text.append(chunkt) 239 | current_batch_indices.append(original_idx) 240 | current_batch_counts.append(chunk_token_count) 241 | else: 242 | batches_text.append(current_batch_text) 243 | batch_indices.append(current_batch_indices) 244 | current_batch_counts = [chunk_token_count] 245 | current_batch_text = [chunkt] 246 | current_batch_indices = [original_idx] 247 | 248 | if current_batch_counts: 249 | batch_indices.append(current_batch_indices) 250 | batches_text.append(current_batch_text) 251 | 252 | 253 | duration_s = (time.monotonic_ns() - start) / 1e9 254 | print(f"batched {file} in {duration_s:.0f}s") 255 | 256 | responses = [] 257 | for batch_text, batch_indices in zip(batches_text, batch_indices): 258 | packed.append((batch_text, batch_indices)) 259 | 260 | print(f"{len(packed)} batches") 261 | 262 | pbar = tqdm(total=len(packed), desc=f"embedding {file}") 263 | model = TextEmbeddingsInference() 264 | 265 | for resp in model.embed.map( 266 | packed, 267 | order_outputs=False, 268 | return_exceptions=False 269 | ): 270 | responses.append(resp) 271 | pbar.update(1) 272 | 273 | if not os.path.exists(f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train"): 274 | os.makedirs(f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train", exist_ok=True) 275 | 276 | embedding_dim = responses[0][1].shape[1] 277 | embedding_path = f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train/{file.replace('.parquet', '.npy')}" 278 | mmap_embeddings = np.memmap(embedding_path, dtype='float32', mode='w+', shape=(len(df), embedding_dim)) 279 | 280 | print("writing embeddings to disk") 281 | for batch, response in responses: 282 | for idx, embedding in zip(batch[1], response): 283 | mmap_embeddings[idx] = embedding 284 | mmap_embeddings.flush() 285 | 286 | del mmap_embeddings 287 | 288 | EMBEDDING_CHECKPOINT_VOLUME.commit() 289 | return f"done with {file}" 290 | 291 | @app.local_entrypoint() 292 | def full_job(): 293 | for resp in batch_loader.map( 294 | files, 295 | order_outputs=False, 296 | return_exceptions=True 297 | ): 298 | print(resp) 299 | 300 | print("done") 301 | 302 | -------------------------------------------------------------------------------- /experimental/batchsize.py: -------------------------------------------------------------------------------- 1 | """ 2 | Try to figure out the optimal batch size for embedding on a given GPU 3 | """ 4 | import os 5 | import json 6 | import time 7 | import asyncio 8 | import subprocess 9 | 10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method 11 | 12 | # We first set out configuration variables for our script. 13 | ## Embedding Containers Configuration 14 | # GPU_CONCURRENCY = 100 15 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5" 16 | MODEL_SLUG = MODEL_ID.split("/")[-1] 17 | 18 | MODEL_DIR = "/model" 19 | MODEL_REVISION="main" 20 | 21 | GPU_CONCURRENCY = 1 22 | # GPU_CONFIG = gpu.A100(size="80GB") 23 | # GPU_CONFIG = gpu.A100(size="40GB") 24 | # GPU_CONFIG = gpu.A10G() 25 | GPU_CONFIG = gpu.H100() 26 | # BATCH_SIZE = 512 27 | BATCH_SIZE = 64 28 | # BATCH_SIZE = 128 29 | MAX_TOKENS = 8192 30 | # MAX_TOKENS = 2048 31 | 32 | 33 | ## Dataset-Specific Configuration 34 | DATASET_READ_VOLUME = Volume.from_name( 35 | "embedding-fineweb-edu", create_if_missing=True 36 | ) 37 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name( 38 | "checkpoint", create_if_missing=True 39 | ) 40 | DATASET_DIR = "/data" 41 | # DATASET_SAVE ="fineweb-edu-sample-10BT" 42 | DATASET_SAVE ="fineweb-edu-sample-10BT-100k" 43 | CHECKPOINT_DIR = "/checkpoint" 44 | SAVE_TO_DISK = True 45 | 46 | ## Upload-Specific Configuration 47 | # DATASET_HF_UPLOAD_REPO_NAME = "enjalot/fineweb-edu-sample-10BT" 48 | DATASET_HF_UPLOAD_REPO_NAME = f"enjalot/{DATASET_SAVE}" 49 | UPLOAD_TO_HF = False 50 | 51 | 52 | def download_model_to_image(model_dir, model_name, model_revision): 53 | from huggingface_hub import snapshot_download 54 | from transformers.utils import move_cache 55 | 56 | os.makedirs(model_dir, exist_ok=True) 57 | 58 | snapshot_download( 59 | repo_id=model_name, 60 | revision=model_revision, 61 | local_dir=model_dir, 62 | ignore_patterns=["*.pt", "*.bin"], # Using safetensors 63 | ) 64 | move_cache() 65 | 66 | st_image = ( 67 | Image.debian_slim(python_version="3.10") 68 | .pip_install( 69 | "torch==2.1.2", 70 | "numpy==1.26.3", 71 | "transformers==4.39.3", 72 | "hf-transfer==0.1.6", 73 | "huggingface_hub==0.22.2", 74 | "einops==0.7.0" 75 | ) 76 | .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) 77 | .run_function( 78 | download_model_to_image, 79 | timeout=60 * 20, 80 | kwargs={ 81 | "model_dir": MODEL_DIR, 82 | "model_name": MODEL_ID, 83 | "model_revision": MODEL_REVISION, 84 | }, 85 | secrets=[Secret.from_name("huggingface-secret")], 86 | ) 87 | ) 88 | with st_image.imports(): 89 | import numpy as np 90 | import torch 91 | from torch.cuda.amp import autocast 92 | from transformers import AutoTokenizer, AutoModel 93 | 94 | app = App( 95 | "fineweb-embeddings-st" 96 | ) 97 | 98 | @app.cls( 99 | gpu=GPU_CONFIG, 100 | # cpu=16, 101 | concurrency_limit=GPU_CONCURRENCY, 102 | timeout=60 * 10, 103 | container_idle_timeout=60 * 10, 104 | allow_concurrent_inputs=1, 105 | image=st_image, 106 | ) 107 | class TransformerModel: 108 | @enter() 109 | def start_engine(self): 110 | # import torch 111 | # from transformers import AutoTokenizer, AutoModel 112 | 113 | self.device = torch.device("cuda") 114 | 115 | print("🥶 cold starting inference") 116 | start = time.monotonic_ns() 117 | 118 | self.model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, safe_serialization=True)#, rotary_scaling_factor=2 ) 119 | self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS) 120 | self.model.to(self.device) 121 | self.model.eval() 122 | 123 | print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e6} MB") 124 | duration_s = (time.monotonic_ns() - start) / 1e9 125 | print(f"🏎️ engine started in {duration_s:.0f}s") 126 | 127 | @method() 128 | def embed(self, inputs): 129 | tok = self.tokenizer 130 | 131 | # print(torch.cuda.memory_summary(device=None, abbreviated=False)) 132 | print(torch.cuda.memory_summary(device=self.device, abbreviated=True)) 133 | 134 | # print(f"CUDA memory allocated before encoding: {torch.cuda.memory_allocated() / 1e6} MB") 135 | 136 | start = time.monotonic_ns() 137 | encoded_input = tok(inputs, padding=True, truncation=True, return_tensors='pt') 138 | print("encoded in", (time.monotonic_ns() - start) / 1e9) 139 | 140 | encoded_input = {key: value.to(self.device) for key, value in encoded_input.items()} 141 | # print("moved to device", (time.monotonic_ns() - start) / 1e9) 142 | # print("encoded input size", encoded_input['input_ids'].nelement() * encoded_input['input_ids'].element_size() / 1e6, "MB") 143 | 144 | # print(f"CUDA memory allocated after encoding: {torch.cuda.memory_allocated() / 1e6} MB") 145 | start = time.monotonic_ns() 146 | # print(torch.cuda.memory_summary(device=None, abbreviated=False)) 147 | with torch.no_grad():#, autocast(): 148 | # print(f"CUDA memory allocated before embedding: {torch.cuda.memory_allocated() / 1e6} MB") 149 | model_output = self.model(**encoded_input) 150 | # print(f"CUDA memory allocated after model output: {torch.cuda.memory_allocated() / 1e6} MB") 151 | # print(f"model output size: {model_output.nelement() * model_output.element_size() / 1e6} MB") 152 | embeddings = model_output[0][:, 0] 153 | # print(f"Embedding size: {embeddings.nelement() * embeddings.element_size() / 1e6} MB") 154 | # print(f"CUDA memory allocated after embedding: {torch.cuda.memory_allocated() / 1e6} MB") 155 | normalized_embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) 156 | normalized_embeddings_cpu = normalized_embeddings.cpu().numpy() 157 | # print(f"CUDA memory allocated after got embeddings: {torch.cuda.memory_allocated() / 1e6} MB") 158 | # # Clean up torch memory 159 | # del encoded_input 160 | # del model_output 161 | # del embeddings 162 | # del normalized_embeddings 163 | # torch.cuda.empty_cache() 164 | duration_ms = (time.monotonic_ns() - start) / 1e6 165 | print(f"embedding took {duration_ms:.0f}ms") 166 | print(torch.cuda.memory_summary(device=self.device, abbreviated=True)) 167 | 168 | return inputs, normalized_embeddings_cpu 169 | 170 | 171 | 172 | @app.local_entrypoint() 173 | def full_job(): 174 | tok = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS) 175 | batch_size = BATCH_SIZE 176 | 177 | test = "I " 178 | test = test * 1022 179 | tokens = tok.encode(test) 180 | print("tokens", len(tokens)) 181 | 182 | inputs = [test] * (384) 183 | 184 | model = TransformerModel() 185 | [inputs, embeddings] = model.embed.remote(inputs=inputs) 186 | print("done") 187 | 188 | -------------------------------------------------------------------------------- /experimental/embed.py: -------------------------------------------------------------------------------- 1 | """ 2 | Embed a dataset using a HuggingFace model, a good deal slower than TEI 3 | """ 4 | import os 5 | import json 6 | import time 7 | import asyncio 8 | import subprocess 9 | 10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method 11 | 12 | DATASET_DIR = "/data" 13 | DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500" 14 | CHECKPOINT_DIR = "/checkpoint" 15 | 16 | # We first set out configuration variables for our script. 17 | ## Embedding Containers Configuration 18 | # GPU_CONCURRENCY = 100 19 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5" 20 | MODEL_SLUG = MODEL_ID.split("/")[-1] 21 | 22 | MODEL_DIR = "/model" 23 | MODEL_REVISION="main" 24 | 25 | GPU_CONCURRENCY = 10 26 | # GPU_CONFIG = gpu.A100(size="80GB") 27 | # GPU_CONFIG = gpu.A100(size="40GB") 28 | # GPU_CONFIG = gpu.A10G() 29 | GPU_CONFIG = gpu.H100() 30 | 31 | 32 | ## Dataset-Specific Configuration 33 | DATASET_READ_VOLUME = Volume.from_name( 34 | "embedding-fineweb-edu", create_if_missing=True 35 | ) 36 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name( 37 | "embeddings", create_if_missing=True 38 | ) 39 | def download_model_to_image(model_dir, model_name, model_revision): 40 | from huggingface_hub import snapshot_download 41 | from transformers.utils import move_cache 42 | 43 | os.makedirs(model_dir, exist_ok=True) 44 | 45 | snapshot_download( 46 | repo_id=model_name, 47 | revision=model_revision, 48 | local_dir=model_dir, 49 | ignore_patterns=["*.pt", "*.bin"], # Using safetensors 50 | ) 51 | move_cache() 52 | 53 | st_image = ( 54 | Image.debian_slim(python_version="3.10") 55 | .pip_install( 56 | "torch==2.1.2", 57 | "numpy==1.26.3", 58 | "transformers==4.39.3", 59 | "hf-transfer==0.1.6", 60 | "huggingface_hub==0.22.2", 61 | "einops==0.7.0" 62 | ) 63 | .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) 64 | .run_function( 65 | download_model_to_image, 66 | timeout=60 * 20, 67 | kwargs={ 68 | "model_dir": MODEL_DIR, 69 | "model_name": MODEL_ID, 70 | "model_revision": MODEL_REVISION, 71 | }, 72 | secrets=[Secret.from_name("huggingface-secret")], 73 | ) 74 | ) 75 | with st_image.imports(): 76 | import numpy as np 77 | import torch 78 | from torch.cuda.amp import autocast 79 | from transformers import AutoTokenizer, AutoModel 80 | 81 | app = App( 82 | "fineweb-embeddings-st" 83 | ) 84 | 85 | @app.cls( 86 | gpu=GPU_CONFIG, 87 | # cpu=16, 88 | concurrency_limit=GPU_CONCURRENCY, 89 | timeout=60 * 10, 90 | container_idle_timeout=60 * 10, 91 | allow_concurrent_inputs=1, 92 | image=st_image, 93 | ) 94 | class TransformerModel: 95 | @enter() 96 | def start_engine(self): 97 | # import torch 98 | # from transformers import AutoTokenizer, AutoModel 99 | 100 | self.device = torch.device("cuda") 101 | 102 | print("🥶 cold starting inference") 103 | start = time.monotonic_ns() 104 | 105 | self.model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, safe_serialization=True)#, rotary_scaling_factor=2 ) 106 | # self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=512) # MAX_TOKENS 107 | self.model.to(self.device) 108 | self.model.eval() 109 | 110 | # print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e6} MB") 111 | duration_s = (time.monotonic_ns() - start) / 1e9 112 | print(f"🏎️ engine started in {duration_s:.0f}s") 113 | 114 | @method() 115 | def embed(self, batch_mask_index): 116 | batch, mask, index = batch_mask_index 117 | # print(torch.cuda.memory_summary(device=self.device, abbreviated=True)) 118 | 119 | tokens_tensor = torch.tensor(batch) 120 | attention_mask = torch.tensor(mask) 121 | 122 | encoded_input = { 123 | 'input_ids': tokens_tensor.to(self.device), 124 | 'attention_mask': attention_mask.to(self.device) 125 | } 126 | # encoded_input = {key: value.to(self.device) for key, value in inputs} 127 | start = time.monotonic_ns() 128 | with torch.no_grad():#, autocast(): 129 | model_output = self.model(**encoded_input) 130 | embeddings = model_output[0][:, 0] 131 | normalized_embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) 132 | normalized_embeddings_cpu = normalized_embeddings.cpu().numpy() 133 | 134 | duration_ms = (time.monotonic_ns() - start) / 1e6 135 | print(f"embedding took {duration_ms:.0f}ms") 136 | 137 | del encoded_input 138 | del model_output 139 | del embeddings 140 | del normalized_embeddings 141 | torch.cuda.empty_cache() 142 | 143 | # print(torch.cuda.memory_summary(device=self.device, abbreviated=True)) 144 | return index, normalized_embeddings_cpu 145 | 146 | 147 | 148 | @app.function( 149 | image=Image.debian_slim().pip_install( 150 | "pandas", "pyarrow", "tqdm" 151 | ), 152 | volumes={ 153 | DATASET_DIR: DATASET_READ_VOLUME, 154 | CHECKPOINT_DIR: EMBEDDING_CHECKPOINT_VOLUME, 155 | }, 156 | timeout=86400, 157 | secrets=[Secret.from_name("huggingface-secret")], 158 | ) 159 | def batch_loader(file, batch_size: int = 512 * 1024): 160 | import pandas as pd 161 | from tqdm import tqdm 162 | import time 163 | 164 | 165 | print(f"reading in {file}") 166 | file_path = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}" 167 | df = pd.read_parquet(file_path) 168 | print(f"sorting {file}") 169 | df = df.sort_values(by='chunk_token_count', ascending=True) 170 | batches = [] 171 | current_batch = [] 172 | current_token_count = 0 173 | batch_indices = [] 174 | current_batch_indices = [] 175 | attention_masks = [] # List to store attention masks for each batch 176 | 177 | 178 | # Tokenized version of "clustering: " 179 | prefix = [101, 9324, 2075, 1024] 180 | 181 | print("building batches for ", file) 182 | start = time.monotonic_ns() 183 | 184 | for index, row in df.iterrows(): 185 | # chunk_token_count = row['chunk_token_count'] 186 | chunk = prefix + list(row['chunk_tokens']) 187 | proposed_batch = current_batch + [chunk] 188 | proposed_length = max(len(tokens) for tokens in proposed_batch) * len(proposed_batch) 189 | 190 | if proposed_length <= batch_size: 191 | current_batch.append(chunk) 192 | current_batch_indices.append(index) 193 | # current_token_count = proposed_length 194 | else: 195 | # Pad the current batch 196 | max_length = max(len(tokens) for tokens in current_batch) 197 | padded_batch = [tokens + [0] * (max_length - len(tokens)) for tokens in current_batch] 198 | attention_mask = [[1] * len(tokens) + [0] * (max_length - len(tokens)) for tokens in current_batch] 199 | batches.append(padded_batch) 200 | attention_masks.append(attention_mask) 201 | batch_indices.append(current_batch_indices) 202 | # Start new batch 203 | current_batch = [chunk] 204 | current_batch_indices = [index] 205 | # current_token_count = len(chunk) 206 | 207 | if current_batch: 208 | # Pad the final batch 209 | max_length = max(len(tokens) for tokens in current_batch) 210 | padded_batch = [tokens + [0] * (max_length - len(tokens)) for tokens in current_batch] 211 | attention_mask = [[1] * len(tokens) + [0] * (max_length - len(tokens)) for tokens in current_batch] 212 | 213 | batches.append(padded_batch) 214 | batch_indices.append(current_batch_indices) 215 | 216 | 217 | print("length of first batch", len(batches[0])) 218 | first_batch_length = sum(len(chunk) for chunk in batches[0]) 219 | print("Total length of all elements in the first batch:", first_batch_length) 220 | print(f"number of batches {len(batches)}") 221 | 222 | duration_s = (time.monotonic_ns() - start) / 1e9 223 | print(f"batched {file} in {duration_s:.0f}s") 224 | 225 | pbar = tqdm(total=len(batches), desc=f"embedding {file}") 226 | model = TransformerModel() 227 | 228 | responses = [] 229 | for resp in model.embed.map( 230 | zip(batches, attention_masks, batch_indices), 231 | order_outputs=False, 232 | return_exceptions=False 233 | ): 234 | responses.append(resp) 235 | pbar.update(1) 236 | 237 | print("zipping batches with responses") 238 | for batch_idx, response in responses: 239 | for idx, embedding in zip(batch_idx, response): 240 | df.at[idx, 'embedding'] = embedding 241 | 242 | if not os.path.exists(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train"): 243 | os.makedirs(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train", exist_ok=True) 244 | df.to_parquet(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}") 245 | return f"done with {file}" 246 | 247 | @app.local_entrypoint() 248 | def full_job(): 249 | 250 | file = "data-00000-of-00099.parquet" 251 | 252 | batch_loader.remote(file=file, batch_size = (1024) * 512) 253 | print("done") 254 | 255 | -------------------------------------------------------------------------------- /features.py: -------------------------------------------------------------------------------- 1 | """ 2 | Extract the features for the embeddings of a dataset using a pre-trained SAE model 3 | 4 | modal run features.py 5 | """ 6 | 7 | import os 8 | import time 9 | from tqdm import tqdm 10 | from latentsae.sae import Sae 11 | from modal import App, Image, Volume, Secret, gpu, enter, method 12 | 13 | DATASET_DIR="/embeddings" 14 | VOLUME = "embeddings" 15 | 16 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4" 17 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-120-all-MiniLM-L6-v2" 18 | # DIRECTORY = f"{DATASET_DIR}/RedPajama-Data-V2-sample-10B-chunked-120-all-MiniLM-L6-v2" 19 | # DIRECTORY = f"{DATASET_DIR}/pile-uncopyrighted-chunked-120-all-MiniLM-L6-v2" 20 | # DIRECTORY = f"{DATASET_DIR}/medrag-pubmed-500-nomic-embed-text-v1.5" 21 | # FILES = [f"{DIRECTORY}/train/data-{i:05d}-of-00138.npy" for i in range(138)] 22 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5" 23 | FILES = [f"{DIRECTORY}/train/data-{i:05d}-of-00041.npy" for i in range(41)] 24 | SAE = "64_32" 25 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-2" 26 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3" 27 | # SAE = "64_128" 28 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}" 29 | # SAE = "64_64" 30 | 31 | SAVE_DIRECTORY = f"{DIRECTORY}-{SAE}" 32 | 33 | 34 | # MODEL_ID = "enjalot/sae-all-MiniLM-L6-v2" 35 | # D_IN = 384 36 | MODEL_ID = "enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT" 37 | MODEL_DIR = "/model" 38 | D_IN = 768 39 | MODEL_REVISION="main" 40 | 41 | # We define our Modal Resources that we'll need 42 | volume = Volume.from_name(VOLUME, create_if_missing=True) 43 | 44 | def download_model_to_image(model_dir, model_name, model_revision): 45 | from huggingface_hub import snapshot_download 46 | from transformers.utils import move_cache 47 | 48 | os.makedirs(model_dir, exist_ok=True) 49 | 50 | snapshot_download( 51 | repo_id=model_name, 52 | revision=model_revision, 53 | local_dir=model_dir, 54 | ignore_patterns=["*.pt", "*.bin"], # Using safetensors 55 | ) 56 | move_cache() 57 | 58 | st_image = ( 59 | Image.debian_slim(python_version="3.10") 60 | .pip_install( 61 | "torch==2.1.2", 62 | "numpy==1.26.3", 63 | "transformers==4.39.3", 64 | "hf-transfer==0.1.6", 65 | "huggingface_hub==0.22.2", 66 | "einops==0.7.0", 67 | "latentsae==0.1.0" 68 | ) 69 | .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"}) 70 | .run_function( 71 | download_model_to_image, 72 | timeout=60 * 20, 73 | kwargs={ 74 | "model_dir": MODEL_DIR, 75 | "model_name": MODEL_ID, 76 | "model_revision": MODEL_REVISION, 77 | }, 78 | secrets=[Secret.from_name("huggingface-secret")], 79 | ) 80 | ) 81 | app = App(image=st_image) # Note: prior to April 2024, "app" was called "stub" 82 | 83 | with st_image.imports(): 84 | import numpy as np 85 | import torch 86 | 87 | @app.cls( 88 | volumes={DATASET_DIR: volume}, 89 | timeout=60 * 100, 90 | scaledown_window=60 * 10, 91 | allow_concurrent_inputs=1, 92 | image=st_image, 93 | ) 94 | class SAEModel: 95 | @enter() 96 | def start_engine(self): 97 | # import torch 98 | self.device = torch.device("cpu") 99 | print("🥶 cold starting inference") 100 | start = time.monotonic_ns() 101 | self.model = Sae.load_from_hub(MODEL_ID, SAE, device=self.device) 102 | duration_s = (time.monotonic_ns() - start) / 1e9 103 | print(f"🏎️ engine started in {duration_s:.0f}s") 104 | 105 | @method() 106 | def make_features(self, file): 107 | # Redownload the dataset 108 | import time 109 | from datasets import load_dataset 110 | import torch 111 | import pandas as pd 112 | import numpy as np 113 | import time 114 | 115 | start = time.monotonic_ns() 116 | print("loading", file) 117 | # dataset = load_dataset("arrow", data_files=f"{DIRECTORY}/train/{file}") 118 | # # df = pd.read_parquet(f"{DIRECTORY}/train/{file}") 119 | # print("loaded") 120 | # df = pd.DataFrame(dataset['train']) 121 | # print("converted to dataframe") 122 | # embeddings = df['embedding'].to_numpy() 123 | # print("converted to numpy") 124 | # embeddings = np.array([np.array(e).astype(np.float32) for e in embeddings]) 125 | duration_s = (time.monotonic_ns() - start) / 1e9 126 | # read the npy memmapped file 127 | size= os.path.getsize(file) // (D_IN * 4) 128 | embeddings = np.memmap(file, 129 | dtype='float32', 130 | mode='r', 131 | shape=(size, D_IN)) 132 | print("loaded", file, "in", duration_s) 133 | 134 | start = time.monotonic_ns() 135 | print("Encoding embeddings with SAE") 136 | 137 | # batch_size = 4096 138 | batch_size = 128 139 | num_batches = (len(embeddings) + batch_size - 1) // batch_size 140 | all_acts = np.zeros((len(embeddings), 64)) 141 | all_indices = np.zeros((len(embeddings), 64)) 142 | for i in tqdm(range(num_batches), desc="Encoding batches"): 143 | batch_embeddings = embeddings[i * batch_size:(i + 1) * batch_size] 144 | batch_embeddings_tensor = torch.from_numpy(batch_embeddings).float().to(self.device) 145 | batch_features = self.model.encode(batch_embeddings_tensor) 146 | all_acts[i * batch_size:(i + 1) * batch_size] = batch_features.top_acts.detach().cpu().numpy() 147 | all_indices[i * batch_size:(i + 1) * batch_size] = batch_features.top_indices.detach().cpu().numpy() 148 | 149 | duration_s = (time.monotonic_ns() - start) / 1e9 150 | print("encoding completed", duration_s) 151 | 152 | df = pd.DataFrame() 153 | df['top_acts'] = list(all_acts) 154 | df['top_indices'] = list(all_indices) 155 | # # df.drop(columns=['embedding'], inplace=True) 156 | # if 'chunk_tokens' in df.columns: 157 | # df.drop(columns=['chunk_tokens'], inplace=True) 158 | print("features generated for", file) 159 | 160 | file_name = os.path.basename(file).split(".")[0] 161 | output_dir = f"{SAVE_DIRECTORY}/train" 162 | os.makedirs(output_dir, exist_ok=True) 163 | print(f"saving to {output_dir}/{file_name}.parquet") 164 | df.to_parquet(f"{output_dir}/{file_name}.parquet") 165 | 166 | volume.commit() 167 | return f"done with {file}" 168 | 169 | @app.local_entrypoint() 170 | def main(): 171 | 172 | # files = files[0:10] 173 | 174 | model = SAEModel() 175 | 176 | for resp in model.make_features.map(FILES, order_outputs=False, return_exceptions=True): 177 | if isinstance(resp, Exception): 178 | print(f"Exception: {resp}") 179 | continue 180 | print(resp) 181 | 182 | 183 | 184 | -------------------------------------------------------------------------------- /fetch.py: -------------------------------------------------------------------------------- 1 | """ 2 | fetch a file from a modal volume and write it locally 3 | """ 4 | 5 | from modal import App, Image, Volume 6 | 7 | # We first set out configuration variables for our script. 8 | # DATASET_DIR = "/data" 9 | VOLUME = "embeddings" 10 | DATASET_DIR = "/embeddings" 11 | # DATASET_NAME = "HuggingFaceFW/fineweb-edu" 12 | # DATASET_FILES = "sample/10BT/*.parquet" 13 | # DATASET_SAVE ="fineweb-edu-sample-10BT" 14 | MAX_TOKENS = 500 15 | # DATASET_SAVE = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}-HF4-64_32" 16 | DATASET_SAVE = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}-HF4-64_32-top10" 17 | # DATASET_SAVE = f"fineweb-edu-sample-10BT" 18 | # DIRECTORY = f"{DATASET_DIR}/{DATASET_SAVE}/train" 19 | DIRECTORY = f"{DATASET_DIR}/{DATASET_SAVE}" 20 | 21 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5" 22 | 23 | # We define our Modal Resources that we'll need 24 | volume = Volume.from_name(VOLUME, create_if_missing=True) 25 | # volume = Volume.from_name("embeddings", create_if_missing=True) 26 | image = Image.debian_slim(python_version="3.9").pip_install( 27 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm" 28 | ) 29 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 30 | 31 | 32 | @app.function(volumes={DATASET_DIR: volume}, timeout=3000) 33 | def fetch_dataset(file): 34 | import pandas as pd 35 | from datasets import load_dataset 36 | print("loading", file) 37 | # Load the dataset as a Hugging Face dataset 38 | if file.endswith(".parquet"): 39 | df = pd.read_parquet(file) 40 | else: 41 | dataset = load_dataset("arrow", data_files=file) 42 | df = pd.DataFrame(dataset['train']) 43 | print("file loaded, returning", file) 44 | return df 45 | 46 | @app.local_entrypoint() 47 | def main(): 48 | import pandas as pd 49 | 50 | # file = "data-00000-of-00099.arrow" 51 | file = "data-00000-of-00099.parquet" 52 | # file = "data-00001-of-00099.parquet" 53 | file_path = f"{DIRECTORY}/{file}" 54 | resp = fetch_dataset.remote(file_path) 55 | if isinstance(resp, Exception): 56 | print(f"Exception: {resp}") 57 | else: 58 | print(resp) 59 | # resp.to_parquet(f"./notebooks/{file}") 60 | resp.to_parquet(f"./notebooks/top10-{file}") 61 | -------------------------------------------------------------------------------- /filter.py: -------------------------------------------------------------------------------- 1 | from modal import App, Image, Volume, Secret 2 | 3 | DATASET_DIR="/embeddings" 4 | VOLUME = "embeddings" 5 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF2" # converted the original to a dataset 6 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4" 7 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500/train" 8 | 9 | # We define our Modal Resources that we'll need 10 | volume = Volume.from_name(VOLUME, create_if_missing=True) 11 | image = Image.debian_slim(python_version="3.9").pip_install( 12 | "datasets==2.16.1", "apache_beam==2.53.0" 13 | ) 14 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 15 | 16 | 17 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts 18 | # but we override this to 19 | # 6000s to avoid any potential timeout issues 20 | @app.function( 21 | volumes={DATASET_DIR: volume}, 22 | timeout=60000, 23 | # ephemeral_disk=2145728, # in MiB 24 | ) 25 | def filter_dataset(): 26 | # Redownload the dataset 27 | import time 28 | from datasets import load_from_disk 29 | print("loading") 30 | dataset = load_from_disk(DIRECTORY) 31 | print("filtering") 32 | filtered = dataset.filter(lambda x: x > 50, input_columns=["chunk_token_count"]) 33 | # print("sorting") 34 | # dataset.sort(column_names=["id", "chunk_index"], keep_in_memory=True) 35 | print("saving") 36 | filtered.save_to_disk(SAVE_DIRECTORY, num_shards={"train":99}) 37 | print("done!") 38 | volume.commit() 39 | 40 | @app.function( 41 | volumes={DATASET_DIR: volume}, 42 | timeout=60000, 43 | # ephemeral_disk=2145728, # in MiB 44 | ) 45 | def filter_dataset_file(file): 46 | import pandas as pd 47 | print("loading", file) 48 | df = pd.read_parquet(f"{DIRECTORY}/{file}") 49 | print("filtering", file) 50 | filtered = df[df["chunk_token_count"] > 50] 51 | print("saving", file) 52 | filtered.to_parquet(f"{DIRECTORY}/{file}") 53 | print("done!", file) 54 | volume.commit() 55 | return file 56 | 57 | 58 | 59 | 60 | @app.local_entrypoint() 61 | def main(): 62 | # filter_dataset.remote() 63 | 64 | files = [f"data-{i:05d}-of-00989.parquet" for i in range(100)] 65 | files = files[2:] 66 | for resp in filter_dataset_file.map(files, order_outputs=False, return_exceptions=True): 67 | if isinstance(resp, Exception): 68 | print(f"Exception: {resp}") 69 | continue 70 | print(resp) 71 | 72 | -------------------------------------------------------------------------------- /lancer.py: -------------------------------------------------------------------------------- 1 | """ 2 | Combine chunks, embeddings and features into a single LanceDB table. 3 | 4 | This script loops over each corresponding file: 5 | - The chunk parquet produced by chunker.py (e.g. "/data/medrag-pubmed-500/train/data-00000-of-00138.parquet") 6 | - The embedding npy file produced by features.py (e.g. "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5/train/data-00000-of-00138.npy") 7 | - The features parquet file produced by features.py (e.g. "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5-64_32/train/data-00000-of-00138.parquet") 8 | 9 | They are then concatenated (column-wise) row‐by‐row in the natural order and written to a lancedb table. 10 | 11 | Usage (from Modal CLI): 12 | modal run combine.py 13 | """ 14 | 15 | import os 16 | import time 17 | import numpy as np 18 | import pandas as pd 19 | import lancedb 20 | from modal import App, Image, Volume, enter, method, gpu 21 | 22 | # ============================================================================ 23 | # Configuration variables – adjust these to your environment/path names! 24 | # ============================================================================ 25 | 26 | # Directories for the input files: 27 | # CHUNK_PARQUET_DIR = "/datasets/medrag-pubmed-500/train" 28 | # EMBEDDING_NPY_DIR = "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5/train" 29 | # FEATURE_PARQUET_DIR = "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5-64_32/train" 30 | CHUNK_PARQUET_DIR = "/datasets/wikipedia-en-chunked-500/train" 31 | EMBEDDING_NPY_DIR = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5/train" 32 | FEATURE_PARQUET_DIR = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5-64_32/train" 33 | 34 | 35 | # Directory (volume) where the LanceDB table will be stored. 36 | # LANCE_DB_DIR = "/lancedb/enjalot/medrag-pubmed" 37 | # LANCE_DB_DIR_INDEXED = "/lancedb/enjalot/medrag-pubmed-indexed" 38 | # TMP_LANCE_DB_DIR = "/tmp/medrag-pubmed" 39 | LANCE_DB_DIR = "/lancedb/enjalot/wikipedia-en-500" 40 | LANCE_DB_DIR_INDEXED = "/lancedb/enjalot/wikipedia-en-500-indexed" 41 | TMP_LANCE_DB_DIR = "/tmp/wikipedia-en-500" 42 | 43 | TABLE_NAME = "500-64_32" 44 | 45 | # TOTAL_FILES = 138 # total number of shards (files) 46 | TOTAL_FILES = 41 # total number of shards (files) 47 | D_EMB = 768 # embedding dimension 48 | 49 | # Volume for the lancedb storage 50 | DATASETS_VOLUME = "datasets" 51 | EMBEDDING_VOLUME = "embeddings" 52 | DB_VOLUME = "lancedb" 53 | 54 | # ============================================================================ 55 | # Modal Resources 56 | # ============================================================================ 57 | 58 | volume_db = Volume.from_name(DB_VOLUME, create_if_missing=True) 59 | volume_datasets = Volume.from_name(DATASETS_VOLUME, create_if_missing=True) 60 | volume_embeddings = Volume.from_name(EMBEDDING_VOLUME, create_if_missing=True) 61 | 62 | st_image = ( 63 | Image.debian_slim(python_version="3.10") 64 | .pip_install( 65 | "pandas", "numpy", "lancedb", "pyarrow", "torch", "tantivy" 66 | ) 67 | .env({"RUST_BACKTRACE": "1"}) 68 | ) 69 | 70 | 71 | app = App(image=st_image) 72 | 73 | # ============================================================================ 74 | # Class to combine and write data into a lancedb table 75 | # ============================================================================ 76 | 77 | @app.function(volumes={ 78 | "/datasets": volume_datasets, 79 | "/embeddings": volume_embeddings, 80 | "/lancedb": volume_db 81 | }, 82 | ephemeral_disk=int(1024*1024), # in MiB 83 | image=st_image, 84 | timeout=60*100, 85 | scaledown_window=60*10 86 | ) 87 | def combine(): 88 | """ 89 | Sequentially process each shard by reading the corresponding chunk parquet, 90 | embedding npy, and features parquet files. The data are combined (column-wise) 91 | and then appended to a single lancedb table. 92 | """ 93 | db_path = TMP_LANCE_DB_DIR 94 | print(f"Connecting to LanceDB at: {db_path}") 95 | db = lancedb.connect(db_path) 96 | 97 | for i in range(TOTAL_FILES): 98 | base_file = f"data-{i:05d}-of-{TOTAL_FILES:05d}" 99 | chunk_file = os.path.join(CHUNK_PARQUET_DIR, f"{base_file}.parquet") 100 | embedding_file = os.path.join(EMBEDDING_NPY_DIR, f"{base_file}.npy") 101 | feature_file = os.path.join(FEATURE_PARQUET_DIR, f"{base_file}.parquet") 102 | 103 | print(f"\nProcessing shard: {base_file}") 104 | start_time = time.monotonic() 105 | 106 | # Load the chunk parquet file. 107 | try: 108 | chunk_df = pd.read_parquet(chunk_file) 109 | except Exception as e: 110 | print(f"Error reading chunk file {chunk_file}: {e}") 111 | break 112 | 113 | # Load the embeddings npy file. 114 | try: 115 | size = os.path.getsize(embedding_file) // (D_EMB * 4) 116 | embedding_np = np.memmap(embedding_file, 117 | dtype='float32', 118 | mode='r', 119 | shape=(size, D_EMB)) 120 | except Exception as e: 121 | print(f"Error reading embedding file {embedding_file}: {e}") 122 | break 123 | 124 | # Load the features parquet file. 125 | try: 126 | feature_df = pd.read_parquet(feature_file) 127 | feature_df = feature_df.rename(columns={ 128 | 'top_indices': 'sae_indices', 129 | 'top_acts': 'sae_acts' 130 | }) 131 | # Convert sae_indices from float to int for each row 132 | feature_df['sae_indices'] = feature_df['sae_indices'].apply(lambda x: [int(i) for i in x]) 133 | except Exception as e: 134 | print(f"Error reading feature file {feature_file}: {e}") 135 | break 136 | 137 | # Validate that the three sources have the same number of rows. 138 | n_chunk = len(chunk_df) 139 | n_embedding = embedding_np.shape[0] 140 | n_feature = len(feature_df) 141 | if not (n_chunk == n_embedding == n_feature): 142 | print(f"Row count mismatch in {base_file}: chunk {n_chunk}, embedding {n_embedding}, feature {n_feature}") 143 | break 144 | 145 | # Store the embedding data as a list column. (Alternatively, you could split the embedding vector into columns.) 146 | 147 | vector_column = list(embedding_np) 148 | 149 | # Combine the dataframes (reseting indices to ensure correct alignment). 150 | combined_df = pd.concat( 151 | [chunk_df.reset_index(drop=True), 152 | feature_df.reset_index(drop=True)], 153 | axis=1, 154 | ) 155 | combined_df["vector"] = vector_column 156 | combined_df["shard"] = i 157 | 158 | if i == 0: 159 | msg = f"Creating LanceDB table '{TABLE_NAME}' at {db_path} with {len(combined_df)} rows." 160 | print(msg) 161 | table = db.create_table(TABLE_NAME, combined_df) 162 | else: 163 | msg = f"Adding shard {base_file} to LanceDB table '{TABLE_NAME}' at {db_path} with {len(combined_df)} rows." 164 | print(msg) 165 | table.add(combined_df) 166 | # if i == 2: 167 | # break 168 | 169 | duration = time.monotonic() - start_time 170 | print(f"Shard {base_file} processed in {duration:.2f} seconds; {n_chunk} rows") 171 | 172 | 173 | print(f"Copying LanceDB to {LANCE_DB_DIR}") 174 | # copy the tmp lancedb directory to the volume 175 | import shutil 176 | shutil.copytree(TMP_LANCE_DB_DIR, LANCE_DB_DIR) 177 | print(f"Done!") 178 | 179 | 180 | @app.function(volumes={ 181 | "/datasets": volume_datasets, 182 | "/embeddings": volume_embeddings, 183 | "/lancedb": volume_db 184 | }, 185 | gpu="A10G", 186 | ephemeral_disk=int(1024*1024), # in MiB 187 | image=st_image, 188 | timeout=60*100, 189 | scaledown_window=60*10 190 | ) 191 | def create_indices(): 192 | import lancedb 193 | import shutil 194 | start_time = time.monotonic() 195 | print(f"Copying table {LANCE_DB_DIR} to {TMP_LANCE_DB_DIR}") 196 | shutil.copytree(LANCE_DB_DIR, TMP_LANCE_DB_DIR) 197 | duration = time.monotonic() - start_time 198 | print(f"Copying table {LANCE_DB_DIR} to {TMP_LANCE_DB_DIR} took {duration:.2f} seconds") 199 | 200 | db = lancedb.connect(TMP_LANCE_DB_DIR) 201 | table = db.open_table(TABLE_NAME) 202 | 203 | # start_time = time.monotonic() 204 | # print(f"Creating index for sae_indices on table '{TABLE_NAME}'") 205 | # table.create_scalar_index("sae_indices", index_type="LABEL_LIST") 206 | # duration = time.monotonic() - start_time 207 | # print(f"Creating index for sae_indices on table '{TABLE_NAME}' took {duration:.2f} seconds") 208 | 209 | start_time = time.monotonic() 210 | print(f"Creating FTS index for title on table '{TABLE_NAME}'") 211 | table.create_fts_index("title") 212 | duration = time.monotonic() - start_time 213 | print(f"Creating FTS index for title on table '{TABLE_NAME}' took {duration:.2f} seconds") 214 | 215 | start_time = time.monotonic() 216 | print(f"Creating ANN index for embeddings on table '{TABLE_NAME}'") 217 | partitions = int(table.count_rows() ** 0.5) * 2 218 | sub_vectors = D_EMB // 16 219 | metric = "cosine" 220 | print(f"Partitioning into {partitions} partitions, {sub_vectors} sub-vectors") 221 | table.create_index( 222 | num_partitions=partitions, 223 | num_sub_vectors=sub_vectors, 224 | metric=metric, 225 | accelerator="cuda" 226 | ) 227 | duration = time.monotonic() - start_time 228 | print(f"Creating ANN index for embeddings on table '{TABLE_NAME}' took {duration:.2f} seconds") 229 | 230 | # print(f"Deleting existing {LANCE_DB_DIR}") 231 | # shutil.rmtree(LANCE_DB_DIR, ignore_errors=True) 232 | start_time = time.monotonic() 233 | print(f"Copying table {TABLE_NAME} to {LANCE_DB_DIR_INDEXED}") 234 | shutil.copytree(TMP_LANCE_DB_DIR, LANCE_DB_DIR_INDEXED, dirs_exist_ok=True) 235 | duration = time.monotonic() - start_time 236 | print(f"Copying table {TMP_LANCE_DB_DIR} to {LANCE_DB_DIR_INDEXED} took {duration:.2f} seconds") 237 | 238 | # ============================================================================ 239 | # Modal Local Entrypoint 240 | # ============================================================================ 241 | 242 | @app.local_entrypoint() 243 | def main(): 244 | # Combine all shards and write to LanceDB. 245 | # combine.remote() 246 | # print("done with combine, creating indices") 247 | create_indices.remote() -------------------------------------------------------------------------------- /notebooks/perfile.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "import time\n", 18 | "# import tqdm\n", 19 | "from tqdm.notebook import tqdm # Import the notebook version of tqdm\n", 20 | "\n", 21 | "from datasets import load_dataset\n", 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "import huggingface_hub\n", 25 | "from huggingface_hub import HfFileSystem\n", 26 | "hffs = HfFileSystem()\n", 27 | "from concurrent.futures import ThreadPoolExecutor, as_completed\n", 28 | "\n", 29 | "import transformers\n", 30 | "transformers.logging.set_verbosity_error()\n", 31 | "from transformers import AutoTokenizer\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 2, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "dataset = load_dataset(\"HuggingFaceFW/fineweb-edu\", data_files=\"sample/10BT/*.parquet\", streaming=True, split=\"train\")" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "files = hffs.ls(\"datasets/HuggingFaceFW/fineweb-edu/sample/10BT\", detail=False)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 5, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "['datasets/HuggingFaceFW/fineweb-edu/sample/10BT/000_00000.parquet',\n", 61 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/001_00000.parquet',\n", 62 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/002_00000.parquet',\n", 63 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/003_00000.parquet',\n", 64 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/004_00000.parquet',\n", 65 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/005_00000.parquet',\n", 66 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/006_00000.parquet',\n", 67 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/007_00000.parquet',\n", 68 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/008_00000.parquet',\n", 69 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/009_00000.parquet',\n", 70 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/010_00000.parquet',\n", 71 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/011_00000.parquet',\n", 72 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/012_00000.parquet',\n", 73 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/013_00000.parquet']" 74 | ] 75 | }, 76 | "execution_count": 5, 77 | "metadata": {}, 78 | "output_type": "execute_result" 79 | } 80 | ], 81 | "source": [ 82 | "files" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 4, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "file = files[0]" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "# df = pd.read_parquet(\"hf://\" + files[0])\n", 101 | "df = pd.read_parquet(file.split(\"/\")[-1])" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "df.head()" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "# df.to_parquet(files[0].split(\"/\")[-1])" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "MAX_TOKENS = 512" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": null, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "# keep_keys = [\"id\", \"url\", \"score\", \"dump\"]" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\", model_max_length=MAX_TOKENS)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "# def chunk(rows):\n", 156 | "# texts = rows[\"text\"]\n", 157 | "# chunks_index = []\n", 158 | "# chunks_text = []\n", 159 | "# chunks_tokens = []\n", 160 | "# updated_token_counts = []\n", 161 | "\n", 162 | "# # Assuming you have other properties in the rows that you want to retain\n", 163 | "# keep = {key: [] for key in keep_keys}\n", 164 | "\n", 165 | "# for index, text in enumerate(texts):\n", 166 | "# tokens = tokenizer.encode(text)\n", 167 | "# token_count = len(tokens)\n", 168 | "\n", 169 | "# if token_count > MAX_TOKENS:\n", 170 | "# overlap = int(MAX_TOKENS * 0.1)\n", 171 | "# start_index = 0\n", 172 | "# ci = 0\n", 173 | "# while start_index < len(tokens):\n", 174 | "# end_index = min(start_index + MAX_TOKENS, len(tokens))\n", 175 | "# chunk = tokens[start_index:end_index]\n", 176 | "# chunks_index.append(ci)\n", 177 | "# chunks_tokens.append(chunk)\n", 178 | "# updated_token_counts.append(len(chunk))\n", 179 | "# chunks_text.append(tokenizer.decode(chunk))\n", 180 | "# # Copy other properties for each chunk\n", 181 | "# for key in keep:\n", 182 | "# keep[key].append(rows[key][index])\n", 183 | "# start_index += MAX_TOKENS - overlap\n", 184 | "# ci += 1\n", 185 | "# else:\n", 186 | "# chunks_index.append(0)\n", 187 | "# chunks_text.append(text)\n", 188 | "# chunks_tokens.append(tokens)\n", 189 | "# updated_token_counts.append(token_count)\n", 190 | "# # Copy other properties for non-chunked texts\n", 191 | "# for key in keep:\n", 192 | "# keep[key].append(rows[key][index])\n", 193 | "\n", 194 | "# keep[\"chunk_index\"] = chunks_index\n", 195 | "# keep[\"chunk_text\"] = chunks_text\n", 196 | "# keep[\"chunk_tokens\"] = chunks_tokens\n", 197 | "# keep[\"chunk_token_count\"] = updated_token_counts\n", 198 | "# return keep\n" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "def chunk_row(row, tokenizer):\n", 208 | " # print(\"ROW\", row)\n", 209 | " MAX_TOKENS = 512\n", 210 | " keep_keys = [\"id\", \"url\", \"score\", \"dump\"]\n", 211 | " text = row[\"text\"]\n", 212 | " chunks = []\n", 213 | "\n", 214 | " tokens = tokenizer.encode(text)\n", 215 | " token_count = len(tokens)\n", 216 | " if token_count > MAX_TOKENS:\n", 217 | " overlap = int(MAX_TOKENS * 0.1)\n", 218 | " start_index = 0\n", 219 | " ci = 0\n", 220 | " while start_index < len(tokens):\n", 221 | " end_index = min(start_index + MAX_TOKENS, len(tokens))\n", 222 | " chunk = tokens[start_index:end_index]\n", 223 | " chunks.append({\n", 224 | " \"chunk_index\": ci,\n", 225 | " \"chunk_text\": tokenizer.decode(chunk),\n", 226 | " \"chunk_tokens\": chunk,\n", 227 | " \"chunk_token_count\": len(chunk),\n", 228 | " **{key: row[key] for key in keep_keys}\n", 229 | " })\n", 230 | " start_index += MAX_TOKENS - overlap\n", 231 | " ci += 1\n", 232 | " else:\n", 233 | " chunks.append({\n", 234 | " \"chunk_index\": 0,\n", 235 | " \"chunk_text\": text,\n", 236 | " \"chunk_tokens\": tokens,\n", 237 | " \"chunk_token_count\": token_count,\n", 238 | " **{key: row[key] for key in keep_keys}\n", 239 | " })\n", 240 | "\n", 241 | " return chunks\n" 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "execution_count": null, 247 | "metadata": {}, 248 | "outputs": [], 249 | "source": [ 250 | "def process_dataframe(df):\n", 251 | " chunks_list = []\n", 252 | " with ThreadPoolExecutor(max_workers=16) as executor:\n", 253 | " # Submit all rows to the executor\n", 254 | " pbar = tqdm(total=len(df), desc=\"Processing Rows\")\n", 255 | " \n", 256 | " def process_batch(batch):\n", 257 | " \n", 258 | " tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\", model_max_length=MAX_TOKENS)\n", 259 | " batch_chunks = []\n", 260 | " for row in batch:\n", 261 | " row_chunks = chunk_row(row, tokenizer)\n", 262 | " pbar.update(1)\n", 263 | " batch_chunks.extend(row_chunks)\n", 264 | " return batch_chunks\n", 265 | "\n", 266 | "\n", 267 | " print(\"making batches\")\n", 268 | " batch_size = 200 # Adjust batch size based on your needs\n", 269 | " batches = [df.iloc[i:i + batch_size].to_dict(orient=\"records\") for i in range(0, len(df), batch_size)]\n", 270 | " print(\"made batches\")\n", 271 | " print(\"setting up futures\")\n", 272 | " futures = [executor.submit(process_batch, batch) for batch in batches]\n", 273 | " # futures = [executor.submit(chunk_row, row) for index, row in df.iterrows()]\n", 274 | " # for future in tqdm(as_completed(futures), total=len(df), desc=\"Processing Rows\"):\n", 275 | " # chunks_list.extend(future.result())\n", 276 | " print(\"in the future\")\n", 277 | " # pbar = tqdm(total=len(df)//batch_size, desc=\"Processing Rows\")\n", 278 | " for future in as_completed(futures):\n", 279 | " chunks_list.extend(future.result())\n", 280 | " # print(len(chunks_list))\n", 281 | " # pbar.update(1) # Manually update the progress bar\n", 282 | " pbar.close()\n", 283 | " return chunks_list" 284 | ] 285 | }, 286 | { 287 | "cell_type": "code", 288 | "execution_count": null, 289 | "metadata": {}, 290 | "outputs": [], 291 | "source": [ 292 | "# Process the DataFrame and create a new DataFrame from the list of chunks\n", 293 | "start = time.perf_counter()\n", 294 | "print(f\"Chunking text that is longer than {MAX_TOKENS} tokens\")\n", 295 | "chunked_data = process_dataframe(df)\n", 296 | "print(f\"Dataset chunked in {time.perf_counter() - start:.2f} seconds\")\n", 297 | "start = time.perf_counter()\n", 298 | "chunked_df = pd.DataFrame(chunked_data)\n", 299 | "print(f\"Dataset converted to DataFrame in {time.perf_counter() - start:.2f} seconds\")\n" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "# chunked_df.to_parquet(\"chunked-\" + file.split(\"/\")[-1])" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": null, 314 | "metadata": {}, 315 | "outputs": [], 316 | "source": [ 317 | "len(chunked_df)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "chunked_df.head()" 327 | ] 328 | }, 329 | { 330 | "cell_type": "code", 331 | "execution_count": null, 332 | "metadata": {}, 333 | "outputs": [], 334 | "source": [] 335 | } 336 | ], 337 | "metadata": { 338 | "kernelspec": { 339 | "display_name": "modalenv", 340 | "language": "python", 341 | "name": "python3" 342 | }, 343 | "language_info": { 344 | "codemirror_mode": { 345 | "name": "ipython", 346 | "version": 3 347 | }, 348 | "file_extension": ".py", 349 | "mimetype": "text/x-python", 350 | "name": "python", 351 | "nbconvert_exporter": "python", 352 | "pygments_lexer": "ipython3", 353 | "version": "3.11.6" 354 | } 355 | }, 356 | "nbformat": 4, 357 | "nbformat_minor": 2 358 | } 359 | -------------------------------------------------------------------------------- /notebooks/small_sample.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from datasets import load_dataset\n", 10 | "import pandas as pd\n", 11 | "import numpy as np\n", 12 | "\n" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 8, 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "dataset = load_dataset(\"HuggingFaceFW/fineweb-edu\", data_files=\"sample/10BT/*.parquet\", streaming=True, split=\"train\")\n", 22 | "\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 9, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "dataset_head = dataset.take(10000)" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 10, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "df10k = pd.DataFrame(list(dataset_head))" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 11, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/html": [ 51 | "
\n", 52 | "\n", 65 | "\n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | "
textiddumpurlfile_pathlanguagelanguage_scoretoken_countscoreint_score
0The Independent Jane\\nFor all the love, romanc...<urn:uuid:0d8a309d-25c5-405d-a08a-c11239f0d717>CC-MAIN-2013-20http://austenauthors.net/the-independent-janes3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.9743208452.7500003
1Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\...<urn:uuid:316c7af5-14e1-4d0b-9576-753e17ef2cc5>CC-MAIN-2013-20http://query.nytimes.com/gst/fullpage.html?res...s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.96145910552.5625003
2How do you get HIV?\\nHIV can be passed on when...<urn:uuid:a3e140cd-7f25-48c9-a2f0-a7d0b1954e0d>CC-MAIN-2013-20http://www.childline.org.uk/Explore/SexRelatio...s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.9667571363.1250003
3CTComms sends on average 2 million emails mont...<urn:uuid:c337bcd8-6aa1-4f2d-8c48-b916442ebbee>CC-MAIN-2013-20http://www.ctt.org/resource_centre/getting_sta...s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.91060234793.2343753
4Hold the salt: UCLA engineers develop revoluti...<urn:uuid:c0b175bb-65fb-420e-a881-a80b91d00ecd>CC-MAIN-2013-20http://www.environment.ucla.edu/water/news/art...s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se...en0.92498111152.8125003
\n", 149 | "
" 150 | ], 151 | "text/plain": [ 152 | " text \\\n", 153 | "0 The Independent Jane\\nFor all the love, romanc... \n", 154 | "1 Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... \n", 155 | "2 How do you get HIV?\\nHIV can be passed on when... \n", 156 | "3 CTComms sends on average 2 million emails mont... \n", 157 | "4 Hold the salt: UCLA engineers develop revoluti... \n", 158 | "\n", 159 | " id dump \\\n", 160 | "0 CC-MAIN-2013-20 \n", 161 | "1 CC-MAIN-2013-20 \n", 162 | "2 CC-MAIN-2013-20 \n", 163 | "3 CC-MAIN-2013-20 \n", 164 | "4 CC-MAIN-2013-20 \n", 165 | "\n", 166 | " url \\\n", 167 | "0 http://austenauthors.net/the-independent-jane \n", 168 | "1 http://query.nytimes.com/gst/fullpage.html?res... \n", 169 | "2 http://www.childline.org.uk/Explore/SexRelatio... \n", 170 | "3 http://www.ctt.org/resource_centre/getting_sta... \n", 171 | "4 http://www.environment.ucla.edu/water/news/art... \n", 172 | "\n", 173 | " file_path language language_score \\\n", 174 | "0 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.974320 \n", 175 | "1 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.961459 \n", 176 | "2 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.966757 \n", 177 | "3 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.910602 \n", 178 | "4 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.924981 \n", 179 | "\n", 180 | " token_count score int_score \n", 181 | "0 845 2.750000 3 \n", 182 | "1 1055 2.562500 3 \n", 183 | "2 136 3.125000 3 \n", 184 | "3 3479 3.234375 3 \n", 185 | "4 1115 2.812500 3 " 186 | ] 187 | }, 188 | "execution_count": 11, 189 | "metadata": {}, 190 | "output_type": "execute_result" 191 | } 192 | ], 193 | "source": [ 194 | "df10k.head()" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 12, 200 | "metadata": {}, 201 | "outputs": [], 202 | "source": [ 203 | "import latentscope as ls" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 13, 209 | "metadata": {}, 210 | "outputs": [ 211 | { 212 | "name": "stdout", 213 | "output_type": "stream", 214 | "text": [ 215 | "Initialized env with data directory at /Users/enjalot/latent-scope-data\n" 216 | ] 217 | } 218 | ], 219 | "source": [ 220 | "ls.init(\"~/latent-scope-data\")" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 14, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "Loading environment variables from: /Users/enjalot/code/latent-testing/notebooks/.env\n", 233 | "DATA DIR /Users/enjalot/latent-scope-data\n", 234 | "DIRECTORY /Users/enjalot/latent-scope-data/fineweb-edu-10k\n", 235 | " text \\\n", 236 | "0 The Independent Jane\\nFor all the love, romanc... \n", 237 | "1 Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... \n", 238 | "2 How do you get HIV?\\nHIV can be passed on when... \n", 239 | "3 CTComms sends on average 2 million emails mont... \n", 240 | "4 Hold the salt: UCLA engineers develop revoluti... \n", 241 | "\n", 242 | " id dump \\\n", 243 | "0 CC-MAIN-2013-20 \n", 244 | "1 CC-MAIN-2013-20 \n", 245 | "2 CC-MAIN-2013-20 \n", 246 | "3 CC-MAIN-2013-20 \n", 247 | "4 CC-MAIN-2013-20 \n", 248 | "\n", 249 | " url \\\n", 250 | "0 http://austenauthors.net/the-independent-jane \n", 251 | "1 http://query.nytimes.com/gst/fullpage.html?res... \n", 252 | "2 http://www.childline.org.uk/Explore/SexRelatio... \n", 253 | "3 http://www.ctt.org/resource_centre/getting_sta... \n", 254 | "4 http://www.environment.ucla.edu/water/news/art... \n", 255 | "\n", 256 | " file_path language language_score \\\n", 257 | "0 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.974320 \n", 258 | "1 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.961459 \n", 259 | "2 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.966757 \n", 260 | "3 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.910602 \n", 261 | "4 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.924981 \n", 262 | "\n", 263 | " token_count score int_score \n", 264 | "0 845 2.750000 3 \n", 265 | "1 1055 2.562500 3 \n", 266 | "2 136 3.125000 3 \n", 267 | "3 3479 3.234375 3 \n", 268 | "4 1115 2.812500 3 \n", 269 | " text \\\n", 270 | "9995 Here we have the inspiration for the movie tre... \n", 271 | "9996 Love and Logic Resource KitLove and Logic is a... \n", 272 | "9997 In the event of fire, people need to know exac... \n", 273 | "9998 It may be a small comfort to those planning th... \n", 274 | "9999 A 13-year-old middle school student is working... \n", 275 | "\n", 276 | " id dump \\\n", 277 | "9995 CC-MAIN-2017-26 \n", 278 | "9996 CC-MAIN-2017-26 \n", 279 | "9997 CC-MAIN-2017-26 \n", 280 | "9998 CC-MAIN-2017-26 \n", 281 | "9999 CC-MAIN-2017-26 \n", 282 | "\n", 283 | " url \\\n", 284 | "9995 https://www.hamahamaoysters.com/blogs/learn/18... \n", 285 | "9996 http://holly.rpes.schoolfusion.us/modules/cms/... \n", 286 | "9997 http://churchsafety.org.uk/information/fire/f_... \n", 287 | "9998 http://insideindustrynews.com/curiosity-gives-... \n", 288 | "9999 http://juneauempire.com/stories/120505/loc_200... \n", 289 | "\n", 290 | " file_path language \\\n", 291 | "9995 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n", 292 | "9996 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n", 293 | "9997 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n", 294 | "9998 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n", 295 | "9999 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n", 296 | "\n", 297 | " language_score token_count score int_score \n", 298 | "9995 0.961133 368 2.875000 3 \n", 299 | "9996 0.895080 249 2.828125 3 \n", 300 | "9997 0.960923 1081 3.171875 3 \n", 301 | "9998 0.938971 141 2.968750 3 \n", 302 | "9999 0.981334 1131 2.859375 3 \n", 303 | "Index(['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score',\n", 304 | " 'token_count', 'score', 'int_score'],\n", 305 | " dtype='object')\n", 306 | "wrote /Users/enjalot/latent-scope-data/fineweb-edu-10k/input.parquet\n" 307 | ] 308 | } 309 | ], 310 | "source": [ 311 | "ls.ingest(\"fineweb-edu-10k\", df10k, \"text\")\n", 312 | "\n" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 17, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "dataset100k = dataset.remove_columns([\"url\", \"file_path\", \"language_score\"])\n", 322 | "dataset_head100k = dataset100k.take(100000)\n" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 18, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "df100k = pd.DataFrame(list(dataset_head100k))" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 19, 337 | "metadata": {}, 338 | "outputs": [ 339 | { 340 | "name": "stdout", 341 | "output_type": "stream", 342 | "text": [ 343 | "Loading environment variables from: /Users/enjalot/code/latent-testing/notebooks/.env\n", 344 | "DATA DIR /Users/enjalot/latent-scope-data\n", 345 | "DIRECTORY /Users/enjalot/latent-scope-data/fineweb-edu-100k\n", 346 | " text \\\n", 347 | "0 The Independent Jane\\nFor all the love, romanc... \n", 348 | "1 Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... \n", 349 | "2 How do you get HIV?\\nHIV can be passed on when... \n", 350 | "3 CTComms sends on average 2 million emails mont... \n", 351 | "4 Hold the salt: UCLA engineers develop revoluti... \n", 352 | "\n", 353 | " id dump language \\\n", 354 | "0 CC-MAIN-2013-20 en \n", 355 | "1 CC-MAIN-2013-20 en \n", 356 | "2 CC-MAIN-2013-20 en \n", 357 | "3 CC-MAIN-2013-20 en \n", 358 | "4 CC-MAIN-2013-20 en \n", 359 | "\n", 360 | " token_count score int_score \n", 361 | "0 845 2.750000 3 \n", 362 | "1 1055 2.562500 3 \n", 363 | "2 136 3.125000 3 \n", 364 | "3 3479 3.234375 3 \n", 365 | "4 1115 2.812500 3 \n", 366 | " text \\\n", 367 | "99995 Avoid the extreme, but beware of household can... \n", 368 | "99996 The Gospel of Luke is the third of the four ca... \n", 369 | "99997 It's is short for it is or it has.\\nIts is the... \n", 370 | "99998 As more and more users gain access to the web,... \n", 371 | "99999 Equipping students to successfully navigate th... \n", 372 | "\n", 373 | " id dump \\\n", 374 | "99995 CC-MAIN-2013-20 \n", 375 | "99996 CC-MAIN-2013-20 \n", 376 | "99997 CC-MAIN-2013-20 \n", 377 | "99998 CC-MAIN-2013-20 \n", 378 | "99999 CC-MAIN-2013-20 \n", 379 | "\n", 380 | " language token_count score int_score \n", 381 | "99995 en 377 2.531250 3 \n", 382 | "99996 en 1755 3.125000 3 \n", 383 | "99997 en 573 2.828125 3 \n", 384 | "99998 en 648 2.750000 3 \n", 385 | "99999 en 1053 3.578125 4 \n", 386 | "Index(['text', 'id', 'dump', 'language', 'token_count', 'score', 'int_score'], dtype='object')\n", 387 | "wrote /Users/enjalot/latent-scope-data/fineweb-edu-100k/input.parquet\n" 388 | ] 389 | } 390 | ], 391 | "source": [ 392 | "ls.ingest(\"fineweb-edu-100k\", df100k, \"text\")" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": 1, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "broken_ids = [] # can put some ids here to check" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 21, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "filtered_df100k = df100k[df100k['id'].isin(broken_ids)]\n" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 22, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/plain": [ 421 | "(512, 7)" 422 | ] 423 | }, 424 | "execution_count": 22, 425 | "metadata": {}, 426 | "output_type": "execute_result" 427 | } 428 | ], 429 | "source": [ 430 | "filtered_df100k.shape" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 23, 436 | "metadata": {}, 437 | "outputs": [ 438 | { 439 | "data": { 440 | "text/html": [ 441 | "
\n", 442 | "\n", 455 | "\n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | "
textiddumplanguagetoken_countscoreint_score
52736The two-button remote control is a very versat...<urn:uuid:52418580-9004-4afd-b9d6-7c991f761b06>CC-MAIN-2013-20en2223.0000003
52737for National Geographic News\\nA new population...<urn:uuid:3dc4e72f-a8c2-48b9-bae3-ea8e374b4462>CC-MAIN-2013-20en4223.8750004
52738The right to access the various documents of t...<urn:uuid:ebf0a847-f329-441d-a625-47eabeb0e52f>CC-MAIN-2013-20en19593.6718754
52739Product Type: Open-file Report\\nAuthor(s): Ell...<urn:uuid:5cd69abe-263e-4246-b7ee-3641dc5b4c17>CC-MAIN-2013-20en5302.7187503
52740BOISE, Idaho – An invasive insect commonly fou...<urn:uuid:528c738a-fa9c-4858-8683-878e210185c4>CC-MAIN-2013-20en3822.6562503
\n", 521 | "
" 522 | ], 523 | "text/plain": [ 524 | " text \\\n", 525 | "52736 The two-button remote control is a very versat... \n", 526 | "52737 for National Geographic News\\nA new population... \n", 527 | "52738 The right to access the various documents of t... \n", 528 | "52739 Product Type: Open-file Report\\nAuthor(s): Ell... \n", 529 | "52740 BOISE, Idaho – An invasive insect commonly fou... \n", 530 | "\n", 531 | " id dump \\\n", 532 | "52736 CC-MAIN-2013-20 \n", 533 | "52737 CC-MAIN-2013-20 \n", 534 | "52738 CC-MAIN-2013-20 \n", 535 | "52739 CC-MAIN-2013-20 \n", 536 | "52740 CC-MAIN-2013-20 \n", 537 | "\n", 538 | " language token_count score int_score \n", 539 | "52736 en 222 3.000000 3 \n", 540 | "52737 en 422 3.875000 4 \n", 541 | "52738 en 1959 3.671875 4 \n", 542 | "52739 en 530 2.718750 3 \n", 543 | "52740 en 382 2.656250 3 " 544 | ] 545 | }, 546 | "execution_count": 23, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "filtered_df100k.head()" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": 24, 558 | "metadata": {}, 559 | "outputs": [ 560 | { 561 | "data": { 562 | "text/plain": [ 563 | "[222,\n", 564 | " 422,\n", 565 | " 1959,\n", 566 | " 530,\n", 567 | " 382,\n", 568 | " 11986,\n", 569 | " 129,\n", 570 | " 652,\n", 571 | " 329,\n", 572 | " 4472,\n", 573 | " 1046,\n", 574 | " 453,\n", 575 | " 212,\n", 576 | " 473,\n", 577 | " 1503,\n", 578 | " 356,\n", 579 | " 307,\n", 580 | " 245,\n", 581 | " 420,\n", 582 | " 761,\n", 583 | " 392,\n", 584 | " 1327,\n", 585 | " 284,\n", 586 | " 2369,\n", 587 | " 170,\n", 588 | " 198,\n", 589 | " 1128,\n", 590 | " 592,\n", 591 | " 488,\n", 592 | " 267,\n", 593 | " 1440,\n", 594 | " 496,\n", 595 | " 373,\n", 596 | " 2140,\n", 597 | " 844,\n", 598 | " 250,\n", 599 | " 229,\n", 600 | " 597,\n", 601 | " 858,\n", 602 | " 219,\n", 603 | " 381,\n", 604 | " 787,\n", 605 | " 784,\n", 606 | " 811,\n", 607 | " 124,\n", 608 | " 251,\n", 609 | " 493,\n", 610 | " 257,\n", 611 | " 313,\n", 612 | " 619,\n", 613 | " 593,\n", 614 | " 528,\n", 615 | " 581,\n", 616 | " 707,\n", 617 | " 192,\n", 618 | " 755,\n", 619 | " 207,\n", 620 | " 885,\n", 621 | " 187,\n", 622 | " 1141,\n", 623 | " 1089,\n", 624 | " 975,\n", 625 | " 630,\n", 626 | " 306,\n", 627 | " 767,\n", 628 | " 353,\n", 629 | " 143,\n", 630 | " 774,\n", 631 | " 465,\n", 632 | " 870,\n", 633 | " 9691,\n", 634 | " 393,\n", 635 | " 429,\n", 636 | " 541,\n", 637 | " 671,\n", 638 | " 219,\n", 639 | " 599,\n", 640 | " 682,\n", 641 | " 561,\n", 642 | " 704,\n", 643 | " 788,\n", 644 | " 374,\n", 645 | " 334,\n", 646 | " 398,\n", 647 | " 348,\n", 648 | " 693,\n", 649 | " 611,\n", 650 | " 274,\n", 651 | " 753,\n", 652 | " 1326,\n", 653 | " 521,\n", 654 | " 1686,\n", 655 | " 747,\n", 656 | " 470,\n", 657 | " 332,\n", 658 | " 2011,\n", 659 | " 727,\n", 660 | " 23407,\n", 661 | " 464,\n", 662 | " 175,\n", 663 | " 751,\n", 664 | " 428,\n", 665 | " 148,\n", 666 | " 425,\n", 667 | " 200,\n", 668 | " 283,\n", 669 | " 642,\n", 670 | " 700,\n", 671 | " 771,\n", 672 | " 859,\n", 673 | " 547,\n", 674 | " 230,\n", 675 | " 1425,\n", 676 | " 1212,\n", 677 | " 680,\n", 678 | " 863,\n", 679 | " 108,\n", 680 | " 345,\n", 681 | " 187,\n", 682 | " 363,\n", 683 | " 2336,\n", 684 | " 3878,\n", 685 | " 631,\n", 686 | " 281,\n", 687 | " 256,\n", 688 | " 1811,\n", 689 | " 438,\n", 690 | " 1122,\n", 691 | " 1205,\n", 692 | " 3044,\n", 693 | " 978,\n", 694 | " 1199,\n", 695 | " 2367,\n", 696 | " 1791,\n", 697 | " 832,\n", 698 | " 608,\n", 699 | " 774,\n", 700 | " 456,\n", 701 | " 275,\n", 702 | " 569,\n", 703 | " 1537,\n", 704 | " 5759,\n", 705 | " 889,\n", 706 | " 317,\n", 707 | " 248,\n", 708 | " 360,\n", 709 | " 3122,\n", 710 | " 1723,\n", 711 | " 429,\n", 712 | " 920,\n", 713 | " 747,\n", 714 | " 271,\n", 715 | " 851,\n", 716 | " 2007,\n", 717 | " 161,\n", 718 | " 1054,\n", 719 | " 484,\n", 720 | " 936,\n", 721 | " 700,\n", 722 | " 257,\n", 723 | " 1191,\n", 724 | " 218,\n", 725 | " 443,\n", 726 | " 866,\n", 727 | " 717,\n", 728 | " 348,\n", 729 | " 1402,\n", 730 | " 467,\n", 731 | " 2245,\n", 732 | " 122,\n", 733 | " 812,\n", 734 | " 670,\n", 735 | " 413,\n", 736 | " 1831,\n", 737 | " 2151,\n", 738 | " 367,\n", 739 | " 537,\n", 740 | " 983,\n", 741 | " 348,\n", 742 | " 3545,\n", 743 | " 887,\n", 744 | " 184,\n", 745 | " 204,\n", 746 | " 980,\n", 747 | " 227,\n", 748 | " 798,\n", 749 | " 408,\n", 750 | " 374,\n", 751 | " 243,\n", 752 | " 1821,\n", 753 | " 249,\n", 754 | " 432,\n", 755 | " 560,\n", 756 | " 334,\n", 757 | " 1389,\n", 758 | " 890,\n", 759 | " 346,\n", 760 | " 524,\n", 761 | " 313,\n", 762 | " 528,\n", 763 | " 154,\n", 764 | " 261,\n", 765 | " 1890,\n", 766 | " 471,\n", 767 | " 3951,\n", 768 | " 461,\n", 769 | " 595,\n", 770 | " 320,\n", 771 | " 676,\n", 772 | " 1002,\n", 773 | " 1871,\n", 774 | " 370,\n", 775 | " 4132,\n", 776 | " 996,\n", 777 | " 435,\n", 778 | " 1010,\n", 779 | " 308,\n", 780 | " 288,\n", 781 | " 484,\n", 782 | " 368,\n", 783 | " 405,\n", 784 | " 378,\n", 785 | " 514,\n", 786 | " 895,\n", 787 | " 232,\n", 788 | " 110,\n", 789 | " 374,\n", 790 | " 433,\n", 791 | " 788,\n", 792 | " 403,\n", 793 | " 1217,\n", 794 | " 849,\n", 795 | " 333,\n", 796 | " 126,\n", 797 | " 324,\n", 798 | " 977,\n", 799 | " 295,\n", 800 | " 1629,\n", 801 | " 319,\n", 802 | " 350,\n", 803 | " 128,\n", 804 | " 754,\n", 805 | " 779,\n", 806 | " 314,\n", 807 | " 604,\n", 808 | " 391,\n", 809 | " 242,\n", 810 | " 403,\n", 811 | " 1291,\n", 812 | " 112,\n", 813 | " 263,\n", 814 | " 128,\n", 815 | " 1620,\n", 816 | " 543,\n", 817 | " 800,\n", 818 | " 973,\n", 819 | " 552,\n", 820 | " 244,\n", 821 | " 628,\n", 822 | " 418,\n", 823 | " 428,\n", 824 | " 412,\n", 825 | " 809,\n", 826 | " 240,\n", 827 | " 940,\n", 828 | " 747,\n", 829 | " 6330,\n", 830 | " 469,\n", 831 | " 770,\n", 832 | " 188,\n", 833 | " 952,\n", 834 | " 1575,\n", 835 | " 790,\n", 836 | " 1178,\n", 837 | " 439,\n", 838 | " 4270,\n", 839 | " 834,\n", 840 | " 527,\n", 841 | " 206,\n", 842 | " 683,\n", 843 | " 541,\n", 844 | " 257,\n", 845 | " 191,\n", 846 | " 390,\n", 847 | " 267,\n", 848 | " 316,\n", 849 | " 1029,\n", 850 | " 233,\n", 851 | " 261,\n", 852 | " 3734,\n", 853 | " 799,\n", 854 | " 275,\n", 855 | " 388,\n", 856 | " 1718,\n", 857 | " 6228,\n", 858 | " 188,\n", 859 | " 367,\n", 860 | " 648,\n", 861 | " 1717,\n", 862 | " 1196,\n", 863 | " 639,\n", 864 | " 1904,\n", 865 | " 1107,\n", 866 | " 1127,\n", 867 | " 414,\n", 868 | " 341,\n", 869 | " 936,\n", 870 | " 124,\n", 871 | " 704,\n", 872 | " 359,\n", 873 | " 631,\n", 874 | " 771,\n", 875 | " 853,\n", 876 | " 892,\n", 877 | " 796,\n", 878 | " 302,\n", 879 | " 2938,\n", 880 | " 289,\n", 881 | " 1287,\n", 882 | " 3105,\n", 883 | " 3493,\n", 884 | " 812,\n", 885 | " 1861,\n", 886 | " 425,\n", 887 | " 475,\n", 888 | " 348,\n", 889 | " 241,\n", 890 | " 2461,\n", 891 | " 1359,\n", 892 | " 755,\n", 893 | " 741,\n", 894 | " 205,\n", 895 | " 145,\n", 896 | " 380,\n", 897 | " 1028,\n", 898 | " 364,\n", 899 | " 553,\n", 900 | " 301,\n", 901 | " 770,\n", 902 | " 319,\n", 903 | " 208,\n", 904 | " 1006,\n", 905 | " 559,\n", 906 | " 334,\n", 907 | " 399,\n", 908 | " 1010,\n", 909 | " 162,\n", 910 | " 528,\n", 911 | " 1272,\n", 912 | " 348,\n", 913 | " 1823,\n", 914 | " 1690,\n", 915 | " 1991,\n", 916 | " 472,\n", 917 | " 2442,\n", 918 | " 461,\n", 919 | " 1204,\n", 920 | " 738,\n", 921 | " 267,\n", 922 | " 943,\n", 923 | " 680,\n", 924 | " 3376,\n", 925 | " 804,\n", 926 | " 701,\n", 927 | " 1482,\n", 928 | " 283,\n", 929 | " 466,\n", 930 | " 533,\n", 931 | " 170,\n", 932 | " 880,\n", 933 | " 2902,\n", 934 | " 980,\n", 935 | " 434,\n", 936 | " 1280,\n", 937 | " 580,\n", 938 | " 229,\n", 939 | " 84,\n", 940 | " 257,\n", 941 | " 286,\n", 942 | " 175,\n", 943 | " 198,\n", 944 | " 2043,\n", 945 | " 335,\n", 946 | " 240,\n", 947 | " 1517,\n", 948 | " 5200,\n", 949 | " 539,\n", 950 | " 1022,\n", 951 | " 11524,\n", 952 | " 187,\n", 953 | " 158,\n", 954 | " 658,\n", 955 | " 165,\n", 956 | " 283,\n", 957 | " 736,\n", 958 | " 195,\n", 959 | " 871,\n", 960 | " 801,\n", 961 | " 178,\n", 962 | " 1267,\n", 963 | " 112,\n", 964 | " 717,\n", 965 | " 327,\n", 966 | " 846,\n", 967 | " 253,\n", 968 | " 520,\n", 969 | " 101,\n", 970 | " 626,\n", 971 | " 945,\n", 972 | " 454,\n", 973 | " 254,\n", 974 | " 775,\n", 975 | " 520,\n", 976 | " 753,\n", 977 | " 2658,\n", 978 | " 2021,\n", 979 | " 855,\n", 980 | " 3316,\n", 981 | " 2032,\n", 982 | " 8629,\n", 983 | " 762,\n", 984 | " 3730,\n", 985 | " 1576,\n", 986 | " 328,\n", 987 | " 1115,\n", 988 | " 496,\n", 989 | " 770,\n", 990 | " 143,\n", 991 | " 133,\n", 992 | " 743,\n", 993 | " 348,\n", 994 | " 214,\n", 995 | " 580,\n", 996 | " 2310,\n", 997 | " 204,\n", 998 | " 312,\n", 999 | " 815,\n", 1000 | " 417,\n", 1001 | " 843,\n", 1002 | " 329,\n", 1003 | " 3034,\n", 1004 | " 410,\n", 1005 | " 672,\n", 1006 | " 225,\n", 1007 | " 673,\n", 1008 | " 415,\n", 1009 | " 1475,\n", 1010 | " 444,\n", 1011 | " 780,\n", 1012 | " 497,\n", 1013 | " 586,\n", 1014 | " 1161,\n", 1015 | " 1608,\n", 1016 | " 752,\n", 1017 | " 600,\n", 1018 | " 1645,\n", 1019 | " 155,\n", 1020 | " 56446,\n", 1021 | " 562,\n", 1022 | " 513,\n", 1023 | " 6647,\n", 1024 | " 660,\n", 1025 | " 112,\n", 1026 | " 1539,\n", 1027 | " 1220,\n", 1028 | " 1281,\n", 1029 | " 741,\n", 1030 | " 1078,\n", 1031 | " 474,\n", 1032 | " 864,\n", 1033 | " 182,\n", 1034 | " 244,\n", 1035 | " 1278,\n", 1036 | " 1056,\n", 1037 | " 647,\n", 1038 | " 358,\n", 1039 | " 535,\n", 1040 | " 2641,\n", 1041 | " 364,\n", 1042 | " 413,\n", 1043 | " 720,\n", 1044 | " 976,\n", 1045 | " 510,\n", 1046 | " 686,\n", 1047 | " 427,\n", 1048 | " 2311,\n", 1049 | " 238,\n", 1050 | " 4432,\n", 1051 | " 277,\n", 1052 | " 356,\n", 1053 | " 665,\n", 1054 | " 311,\n", 1055 | " 886,\n", 1056 | " 1529,\n", 1057 | " 1467,\n", 1058 | " 305,\n", 1059 | " 350,\n", 1060 | " 1839,\n", 1061 | " 316,\n", 1062 | " 1613,\n", 1063 | " 229,\n", 1064 | " 198,\n", 1065 | " 1235,\n", 1066 | " 2633,\n", 1067 | " 809,\n", 1068 | " 4255,\n", 1069 | " 1864,\n", 1070 | " 606,\n", 1071 | " 497,\n", 1072 | " 793,\n", 1073 | " 1371,\n", 1074 | " 1703]" 1075 | ] 1076 | }, 1077 | "execution_count": 24, 1078 | "metadata": {}, 1079 | "output_type": "execute_result" 1080 | } 1081 | ], 1082 | "source": [ 1083 | "filtered_df100k[\"token_count\"].to_list()" 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "code", 1088 | "execution_count": 27, 1089 | "metadata": {}, 1090 | "outputs": [ 1091 | { 1092 | "name": "stderr", 1093 | "output_type": "stream", 1094 | "text": [ 1095 | "/var/folders/sx/rrvr6l_d5x1_g46jxlx5ypfc0000gn/T/ipykernel_76251/2136929811.py:1: SettingWithCopyWarning: \n", 1096 | "A value is trying to be set on a copy of a slice from a DataFrame.\n", 1097 | "Try using .loc[row_indexer,col_indexer] = value instead\n", 1098 | "\n", 1099 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 1100 | " filtered_df100k['text_length'] = filtered_df100k['text'].apply(len)\n" 1101 | ] 1102 | }, 1103 | { 1104 | "data": { 1105 | "text/html": [ 1106 | "
\n", 1107 | "\n", 1120 | "\n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | "
token_counttext_length
52736222974
527374221881
52738195910870
527395302720
527403821869
.........
532436062935
532444972507
532457933819
5324613716762
5324717038438
\n", 1186 | "

512 rows × 2 columns

\n", 1187 | "
" 1188 | ], 1189 | "text/plain": [ 1190 | " token_count text_length\n", 1191 | "52736 222 974\n", 1192 | "52737 422 1881\n", 1193 | "52738 1959 10870\n", 1194 | "52739 530 2720\n", 1195 | "52740 382 1869\n", 1196 | "... ... ...\n", 1197 | "53243 606 2935\n", 1198 | "53244 497 2507\n", 1199 | "53245 793 3819\n", 1200 | "53246 1371 6762\n", 1201 | "53247 1703 8438\n", 1202 | "\n", 1203 | "[512 rows x 2 columns]" 1204 | ] 1205 | }, 1206 | "execution_count": 27, 1207 | "metadata": {}, 1208 | "output_type": "execute_result" 1209 | } 1210 | ], 1211 | "source": [ 1212 | "filtered_df100k['text_length'] = filtered_df100k['text'].apply(len)\n", 1213 | "filtered_df100k[['token_count', 'text_length']]\n" 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "code", 1218 | "execution_count": 44, 1219 | "metadata": {}, 1220 | "outputs": [], 1221 | "source": [ 1222 | "df100k['text_length'] = df100k['text'].apply(len)\n", 1223 | "sorted_df = df100k.sort_values(by='token_count', ascending=False)\n", 1224 | "sorted_df = sorted_df[sorted_df[\"text_length\"] > 10000]\n" 1225 | ] 1226 | }, 1227 | { 1228 | "cell_type": "code", 1229 | "execution_count": 47, 1230 | "metadata": {}, 1231 | "outputs": [ 1232 | { 1233 | "data": { 1234 | "text/plain": [ 1235 | "(8444, 8)" 1236 | ] 1237 | }, 1238 | "execution_count": 47, 1239 | "metadata": {}, 1240 | "output_type": "execute_result" 1241 | } 1242 | ], 1243 | "source": [ 1244 | "sorted_df.shape" 1245 | ] 1246 | }, 1247 | { 1248 | "cell_type": "code", 1249 | "execution_count": 48, 1250 | "metadata": {}, 1251 | "outputs": [ 1252 | { 1253 | "name": "stdout", 1254 | "output_type": "stream", 1255 | "text": [ 1256 | "The smallest text_length where token_count is more than 8192 is: 2385\n" 1257 | ] 1258 | } 1259 | ], 1260 | "source": [ 1261 | "# Filter the DataFrame to find entries where token_count is more than 8000\n", 1262 | "high_token_count_df = df100k[df100k['token_count'] > 2048]\n", 1263 | "\n", 1264 | "# Find the minimum text_length from the filtered DataFrame\n", 1265 | "min_text_length = high_token_count_df['text_length'].min()\n", 1266 | "\n", 1267 | "# Print the result\n", 1268 | "print(\"The smallest text_length where token_count is more than 8192 is:\", min_text_length)\n" 1269 | ] 1270 | }, 1271 | { 1272 | "cell_type": "code", 1273 | "execution_count": 46, 1274 | "metadata": {}, 1275 | "outputs": [ 1276 | { 1277 | "data": { 1278 | "text/html": [ 1279 | "
\n", 1280 | "\n", 1293 | "\n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | " \n", 1495 | " \n", 1496 | " \n", 1497 | " \n", 1498 | " \n", 1499 | " \n", 1500 | " \n", 1501 | " \n", 1502 | " \n", 1503 | " \n", 1504 | " \n", 1505 | " \n", 1506 | " \n", 1507 | " \n", 1508 | " \n", 1509 | " \n", 1510 | " \n", 1511 | " \n", 1512 | " \n", 1513 | " \n", 1514 | " \n", 1515 | " \n", 1516 | " \n", 1517 | " \n", 1518 | " \n", 1519 | " \n", 1520 | " \n", 1521 | " \n", 1522 | " \n", 1523 | " \n", 1524 | " \n", 1525 | " \n", 1526 | " \n", 1527 | " \n", 1528 | " \n", 1529 | " \n", 1530 | " \n", 1531 | " \n", 1532 | " \n", 1533 | " \n", 1534 | " \n", 1535 | " \n", 1536 | " \n", 1537 | " \n", 1538 | " \n", 1539 | " \n", 1540 | " \n", 1541 | " \n", 1542 | " \n", 1543 | " \n", 1544 | " \n", 1545 | " \n", 1546 | " \n", 1547 | " \n", 1548 | " \n", 1549 | " \n", 1550 | " \n", 1551 | " \n", 1552 | " \n", 1553 | " \n", 1554 | " \n", 1555 | " \n", 1556 | " \n", 1557 | " \n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | " \n", 1635 | " \n", 1636 | " \n", 1637 | " \n", 1638 | " \n", 1639 | " \n", 1640 | " \n", 1641 | " \n", 1642 | " \n", 1643 | " \n", 1644 | " \n", 1645 | " \n", 1646 | " \n", 1647 | " \n", 1648 | " \n", 1649 | " \n", 1650 | " \n", 1651 | " \n", 1652 | " \n", 1653 | " \n", 1654 | " \n", 1655 | " \n", 1656 | " \n", 1657 | " \n", 1658 | " \n", 1659 | " \n", 1660 | " \n", 1661 | " \n", 1662 | " \n", 1663 | " \n", 1664 | " \n", 1665 | " \n", 1666 | " \n", 1667 | " \n", 1668 | " \n", 1669 | " \n", 1670 | " \n", 1671 | " \n", 1672 | " \n", 1673 | " \n", 1674 | " \n", 1675 | " \n", 1676 | " \n", 1677 | " \n", 1678 | " \n", 1679 | " \n", 1680 | " \n", 1681 | " \n", 1682 | " \n", 1683 | " \n", 1684 | " \n", 1685 | " \n", 1686 | " \n", 1687 | " \n", 1688 | " \n", 1689 | " \n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " \n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | "
token_counttext_length
57385104023485818
8741101566485117
2646283662336015
4873881132302843
2254569087306273
5984268874328275
6690064344328697
5049359206272464
5319356446280809
2514646862203838
9891546664238568
6348446596217907
2636346018206822
2048144657165152
1975243150207965
2101142434190152
1993241801157942
7599841636197039
5407241160205458
7544640872173387
2158740787171901
6119839787165134
4189738832184465
3728038682179243
6043838664168148
5271237943180434
4951336064144425
1077035178161248
3087534214156451
2141633794108263
7688333370151322
6191332132145142
8267432022110127
2587931153139767
5024130879141398
9128530841126778
8593230799123451
6756430505110449
2380430502133703
6445330140122132
2049129787147622
9881029474104333
2377929404109206
1247628799118054
8479128792126564
1653628782134643
677028745139452
6449228731151428
5569328615126376
9663528603128460
8703528458126872
9737228073128217
1696627827121963
5428227622110453
6442227399126250
4309526805127607
1122326774107899
93826697124276
721302661695360
3581526602100388
6091026557130256
5372926423123570
2187926392116157
9746726192114295
191792597994335
7854425836110689
8618225721102530
7046325345100019
1972924953115631
9295624808110828
754902477698531
578232462493310
51502459386615
506524504110972
6487824310104386
8569924133115965
3508324130119805
11222409190348
4156023898101378
539892387456790
3344823852115978
2977923735120142
5445023715105685
3962923685101014
687423436110472
528332340790316
1448423357104001
16922333899255
233312313889524
823842312089432
691962308358244
1500822890107912
5836722855105143
262072283285131
781562280799823
3660422656103464
3205422652107154
2007522646109951
981222574101409
364652252798342
\n", 1804 | "
" 1805 | ], 1806 | "text/plain": [ 1807 | " token_count text_length\n", 1808 | "57385 104023 485818\n", 1809 | "8741 101566 485117\n", 1810 | "26462 83662 336015\n", 1811 | "48738 81132 302843\n", 1812 | "22545 69087 306273\n", 1813 | "59842 68874 328275\n", 1814 | "66900 64344 328697\n", 1815 | "50493 59206 272464\n", 1816 | "53193 56446 280809\n", 1817 | "25146 46862 203838\n", 1818 | "98915 46664 238568\n", 1819 | "63484 46596 217907\n", 1820 | "26363 46018 206822\n", 1821 | "20481 44657 165152\n", 1822 | "19752 43150 207965\n", 1823 | "21011 42434 190152\n", 1824 | "19932 41801 157942\n", 1825 | "75998 41636 197039\n", 1826 | "54072 41160 205458\n", 1827 | "75446 40872 173387\n", 1828 | "21587 40787 171901\n", 1829 | "61198 39787 165134\n", 1830 | "41897 38832 184465\n", 1831 | "37280 38682 179243\n", 1832 | "60438 38664 168148\n", 1833 | "52712 37943 180434\n", 1834 | "49513 36064 144425\n", 1835 | "10770 35178 161248\n", 1836 | "30875 34214 156451\n", 1837 | "21416 33794 108263\n", 1838 | "76883 33370 151322\n", 1839 | "61913 32132 145142\n", 1840 | "82674 32022 110127\n", 1841 | "25879 31153 139767\n", 1842 | "50241 30879 141398\n", 1843 | "91285 30841 126778\n", 1844 | "85932 30799 123451\n", 1845 | "67564 30505 110449\n", 1846 | "23804 30502 133703\n", 1847 | "64453 30140 122132\n", 1848 | "20491 29787 147622\n", 1849 | "98810 29474 104333\n", 1850 | "23779 29404 109206\n", 1851 | "12476 28799 118054\n", 1852 | "84791 28792 126564\n", 1853 | "16536 28782 134643\n", 1854 | "6770 28745 139452\n", 1855 | "64492 28731 151428\n", 1856 | "55693 28615 126376\n", 1857 | "96635 28603 128460\n", 1858 | "87035 28458 126872\n", 1859 | "97372 28073 128217\n", 1860 | "16966 27827 121963\n", 1861 | "54282 27622 110453\n", 1862 | "64422 27399 126250\n", 1863 | "43095 26805 127607\n", 1864 | "11223 26774 107899\n", 1865 | "938 26697 124276\n", 1866 | "72130 26616 95360\n", 1867 | "35815 26602 100388\n", 1868 | "60910 26557 130256\n", 1869 | "53729 26423 123570\n", 1870 | "21879 26392 116157\n", 1871 | "97467 26192 114295\n", 1872 | "19179 25979 94335\n", 1873 | "78544 25836 110689\n", 1874 | "86182 25721 102530\n", 1875 | "70463 25345 100019\n", 1876 | "19729 24953 115631\n", 1877 | "92956 24808 110828\n", 1878 | "75490 24776 98531\n", 1879 | "57823 24624 93310\n", 1880 | "5150 24593 86615\n", 1881 | "5065 24504 110972\n", 1882 | "64878 24310 104386\n", 1883 | "85699 24133 115965\n", 1884 | "35083 24130 119805\n", 1885 | "1122 24091 90348\n", 1886 | "41560 23898 101378\n", 1887 | "53989 23874 56790\n", 1888 | "33448 23852 115978\n", 1889 | "29779 23735 120142\n", 1890 | "54450 23715 105685\n", 1891 | "39629 23685 101014\n", 1892 | "6874 23436 110472\n", 1893 | "52833 23407 90316\n", 1894 | "14484 23357 104001\n", 1895 | "1692 23338 99255\n", 1896 | "23331 23138 89524\n", 1897 | "82384 23120 89432\n", 1898 | "69196 23083 58244\n", 1899 | "15008 22890 107912\n", 1900 | "58367 22855 105143\n", 1901 | "26207 22832 85131\n", 1902 | "78156 22807 99823\n", 1903 | "36604 22656 103464\n", 1904 | "32054 22652 107154\n", 1905 | "20075 22646 109951\n", 1906 | "9812 22574 101409\n", 1907 | "36465 22527 98342" 1908 | ] 1909 | }, 1910 | "metadata": {}, 1911 | "output_type": "display_data" 1912 | } 1913 | ], 1914 | "source": [ 1915 | "with pd.option_context('display.max_rows', None, 'display.max_columns', None):\n", 1916 | " display(sorted_df[['token_count', 'text_length']].head(100))\n" 1917 | ] 1918 | }, 1919 | { 1920 | "cell_type": "code", 1921 | "execution_count": null, 1922 | "metadata": {}, 1923 | "outputs": [], 1924 | "source": [] 1925 | } 1926 | ], 1927 | "metadata": { 1928 | "kernelspec": { 1929 | "display_name": "testing", 1930 | "language": "python", 1931 | "name": "python3" 1932 | }, 1933 | "language_info": { 1934 | "codemirror_mode": { 1935 | "name": "ipython", 1936 | "version": 3 1937 | }, 1938 | "file_extension": ".py", 1939 | "mimetype": "text/x-python", 1940 | "name": "python", 1941 | "nbconvert_exporter": "python", 1942 | "pygments_lexer": "ipython3", 1943 | "version": "3.11.6" 1944 | } 1945 | }, 1946 | "nbformat": 4, 1947 | "nbformat_minor": 2 1948 | } 1949 | -------------------------------------------------------------------------------- /notebooks/tokenizers.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stderr", 10 | "output_type": "stream", 11 | "text": [ 12 | "/Users/enjalot/code/fineweb-modal/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", 13 | " from .autonotebook import tqdm as notebook_tqdm\n" 14 | ] 15 | } 16 | ], 17 | "source": [ 18 | "from transformers import AutoTokenizer\n", 19 | "import numpy as np\n", 20 | "from collections import Counter" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "\n", 30 | "def compare_tokenizers(text_samples):\n", 31 | " \"\"\"\n", 32 | " Compare tokenization results between BGE and Nomic tokenizers\n", 33 | " \n", 34 | " Args:\n", 35 | " text_samples: List of text strings to compare tokenization\n", 36 | " \n", 37 | " Returns:\n", 38 | " dict: Comparison statistics and analysis results\n", 39 | " \"\"\"\n", 40 | " # Load both tokenizers\n", 41 | " bge_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-base-en-v1.5\")\n", 42 | " nomic_tokenizer = AutoTokenizer.from_pretrained(\"nomic-ai/nomic-embed-text-v1.5\")\n", 43 | " \n", 44 | " results = {\n", 45 | " \"vocabulary_sizes\": {\n", 46 | " \"bge\": len(bge_tokenizer.vocab),\n", 47 | " \"nomic\": len(nomic_tokenizer.vocab),\n", 48 | " },\n", 49 | " \"samples\": []\n", 50 | " }\n", 51 | " \n", 52 | " # Compare tokenization for each sample\n", 53 | " for text in text_samples:\n", 54 | " bge_tokens = bge_tokenizer.tokenize(text)\n", 55 | " nomic_tokens = nomic_tokenizer.tokenize(text)\n", 56 | " \n", 57 | " # Get token counts\n", 58 | " bge_counts = Counter(bge_tokens)\n", 59 | " nomic_counts = Counter(nomic_tokens)\n", 60 | " \n", 61 | " # Compare token sequences\n", 62 | " sample_result = {\n", 63 | " \"text\": text,\n", 64 | " \"bge_tokens\": bge_tokens,\n", 65 | " \"nomic_tokens\": nomic_tokens,\n", 66 | " \"token_counts\": {\n", 67 | " \"bge\": len(bge_tokens),\n", 68 | " \"nomic\": len(nomic_tokens)\n", 69 | " },\n", 70 | " \"unique_tokens\": {\n", 71 | " \"bge\": len(bge_counts),\n", 72 | " \"nomic\": len(nomic_counts)\n", 73 | " },\n", 74 | " \"identical_tokenization\": bge_tokens == nomic_tokens\n", 75 | " }\n", 76 | " \n", 77 | " results[\"samples\"].append(sample_result)\n", 78 | " \n", 79 | " # Calculate overall statistics\n", 80 | " identical_count = sum(1 for r in results[\"samples\"] if r[\"identical_tokenization\"])\n", 81 | " results[\"overall_stats\"] = {\n", 82 | " \"total_samples\": len(text_samples),\n", 83 | " \"identical_tokenizations\": identical_count,\n", 84 | " \"identical_percentage\": (identical_count / len(text_samples)) * 100 if text_samples else 0\n", 85 | " }\n", 86 | " \n", 87 | " return results" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "\n", 97 | "def print_comparison_report(results):\n", 98 | " \"\"\"Print a formatted report of the tokenizer comparison results\"\"\"\n", 99 | " print(\"Tokenizer Comparison Report\")\n", 100 | " print(\"==========================\")\n", 101 | " print(f\"\\nVocabulary Sizes:\")\n", 102 | " print(f\"BGE: {results['vocabulary_sizes']['bge']:,} tokens\")\n", 103 | " print(f\"Nomic: {results['vocabulary_sizes']['nomic']:,} tokens\")\n", 104 | " \n", 105 | " print(f\"\\nOverall Statistics:\")\n", 106 | " print(f\"Total samples analyzed: {results['overall_stats']['total_samples']}\")\n", 107 | " print(f\"Identical tokenizations: {results['overall_stats']['identical_tokenizations']}\")\n", 108 | " print(f\"Percentage identical: {results['overall_stats']['identical_percentage']:.1f}%\")\n", 109 | " \n", 110 | " print(\"\\nDetailed Sample Analysis:\")\n", 111 | " for i, sample in enumerate(results['samples'], 1):\n", 112 | " print(f\"\\nSample {i}:\")\n", 113 | " print(f\"Text: {sample['text']}\")\n", 114 | " print(f\"BGE tokens ({sample['token_counts']['bge']}): {sample['bge_tokens']}\")\n", 115 | " print(f\"Nomic tokens ({sample['token_counts']['nomic']}): {sample['nomic_tokens']}\")\n", 116 | " print(f\"Identical: {sample['identical_tokenization']}\")" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 4, 122 | "metadata": {}, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "Tokenizer Comparison Report\n", 129 | "==========================\n", 130 | "\n", 131 | "Vocabulary Sizes:\n", 132 | "BGE: 30,522 tokens\n", 133 | "Nomic: 30,522 tokens\n", 134 | "\n", 135 | "Overall Statistics:\n", 136 | "Total samples analyzed: 3\n", 137 | "Identical tokenizations: 3\n", 138 | "Percentage identical: 100.0%\n", 139 | "\n", 140 | "Detailed Sample Analysis:\n", 141 | "\n", 142 | "Sample 1:\n", 143 | "Text: This is a test sentence.\n", 144 | "BGE tokens (6): ['this', 'is', 'a', 'test', 'sentence', '.']\n", 145 | "Nomic tokens (6): ['this', 'is', 'a', 'test', 'sentence', '.']\n", 146 | "Identical: True\n", 147 | "\n", 148 | "Sample 2:\n", 149 | "Text: Machine learning models use different tokenization approaches.\n", 150 | "BGE tokens (9): ['machine', 'learning', 'models', 'use', 'different', 'token', '##ization', 'approaches', '.']\n", 151 | "Nomic tokens (9): ['machine', 'learning', 'models', 'use', 'different', 'token', '##ization', 'approaches', '.']\n", 152 | "Identical: True\n", 153 | "\n", 154 | "Sample 3:\n", 155 | "Text: Some текст with mixed 字符 and специальные characters!\n", 156 | "BGE tokens (24): ['some', 'т', '##е', '##к', '##с', '##т', 'with', 'mixed', '[UNK]', '[UNK]', 'and', 'с', '##п', '##е', '##ц', '##и', '##а', '##л', '##ь', '##н', '##ы', '##е', 'characters', '!']\n", 157 | "Nomic tokens (24): ['some', 'т', '##е', '##к', '##с', '##т', 'with', 'mixed', '[UNK]', '[UNK]', 'and', 'с', '##п', '##е', '##ц', '##и', '##а', '##л', '##ь', '##н', '##ы', '##е', 'characters', '!']\n", 158 | "Identical: True\n" 159 | ] 160 | } 161 | ], 162 | "source": [ 163 | "\n", 164 | "# Example usage\n", 165 | "sample_texts = [\n", 166 | " \"This is a test sentence.\",\n", 167 | " \"Machine learning models use different tokenization approaches.\",\n", 168 | " \"Some текст with mixed 字符 and специальные characters!\",\n", 169 | "]\n", 170 | "\n", 171 | "results = compare_tokenizers(sample_texts)\n", 172 | "print_comparison_report(results)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 5, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "bge_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-base-en-v1.5\")\n", 182 | "nomic_tokenizer = AutoTokenizer.from_pretrained(\"nomic-ai/nomic-embed-text-v1.5\")\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 6, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "data": { 192 | "text/plain": [ 193 | "BertTokenizerFast(name_or_path='BAAI/bge-base-en-v1.5', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n", 194 | "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 195 | "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 196 | "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 197 | "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 198 | "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 199 | "}" 200 | ] 201 | }, 202 | "execution_count": 6, 203 | "metadata": {}, 204 | "output_type": "execute_result" 205 | } 206 | ], 207 | "source": [ 208 | "bge_tokenizer" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 7, 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "data": { 218 | "text/plain": [ 219 | "BertTokenizerFast(name_or_path='nomic-ai/nomic-embed-text-v1.5', vocab_size=30522, model_max_length=8192, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n", 220 | "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 221 | "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 222 | "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 223 | "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 224 | "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n", 225 | "}" 226 | ] 227 | }, 228 | "execution_count": 7, 229 | "metadata": {}, 230 | "output_type": "execute_result" 231 | } 232 | ], 233 | "source": [ 234 | "nomic_tokenizer" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [] 243 | } 244 | ], 245 | "metadata": { 246 | "language_info": { 247 | "name": "python" 248 | } 249 | }, 250 | "nbformat": 4, 251 | "nbformat_minor": 2 252 | } 253 | -------------------------------------------------------------------------------- /remove.py: -------------------------------------------------------------------------------- 1 | from modal import App, Image, Volume, Secret 2 | 3 | DATASET_DIR="/embeddings" 4 | VOLUME = "embeddings" 5 | SAE = "64_32" 6 | 7 | # SAMPLE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3/train" 8 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10" 9 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10/combined" 10 | 11 | SAMPLE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500/train" 12 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5" 13 | SAVE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5/combined" 14 | 15 | 16 | 17 | 18 | 19 | # We define our Modal Resources that we'll need 20 | volume = Volume.from_name(VOLUME, create_if_missing=True) 21 | image = Image.debian_slim(python_version="3.9").pip_install( 22 | "pandas", "datasets==2.16.1", "apache_beam==2.53.0" 23 | ) 24 | app = App(image=image) 25 | 26 | @app.function( 27 | volumes={DATASET_DIR: volume}, 28 | timeout=60000, 29 | ) 30 | def remove_files_by_pattern(directory, pattern): 31 | """ 32 | Remove all files in the specified directory that match the given pattern. 33 | 34 | Args: 35 | directory: Directory to search for files 36 | pattern: File pattern to match (e.g., "temp*" for files starting with "temp") 37 | """ 38 | import os 39 | import glob 40 | 41 | # Get the full path pattern 42 | full_pattern = os.path.join(directory, pattern) 43 | 44 | # Find all files matching the pattern 45 | matching_files = glob.glob(full_pattern) 46 | 47 | # Count files to be removed 48 | file_count = len(matching_files) 49 | print(f"Found {file_count} files matching pattern '{pattern}' in {directory}") 50 | 51 | # Remove each file 52 | for file_path in matching_files: 53 | try: 54 | os.remove(file_path) 55 | print(f"Removed: {file_path}") 56 | except Exception as e: 57 | print(f"Error removing {file_path}: {e}") 58 | 59 | # Commit changes to the volume 60 | volume.commit() 61 | 62 | return f"Removed {file_count} files matching pattern '{pattern}'" 63 | 64 | @app.local_entrypoint() 65 | def main(): 66 | 67 | directory = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5-64_32-top10" 68 | pattern = "temp*" 69 | print(f"Removing files matching '{pattern}' from '{directory}'") 70 | result = remove_files_by_pattern.remote(directory, pattern) 71 | print(result) 72 | 73 | 74 | -------------------------------------------------------------------------------- /summary.py: -------------------------------------------------------------------------------- 1 | from modal import App, Image, Volume 2 | 3 | 4 | # We first set out configuration variables for our script. 5 | DATASET_DIR = "/data" 6 | # VOLUME = "embedding-fineweb-edu" 7 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-120" 8 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500" 9 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)] 10 | 11 | 12 | 13 | VOLUME = "datasets" 14 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-120" 15 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-500" 16 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)] 17 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-500" 18 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-120" 19 | # files = [f"data-{i:05d}-of-01987.parquet" for i in range(200)] 20 | DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-120" 21 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-500" 22 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)] 23 | 24 | 25 | 26 | 27 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5" 28 | 29 | # We define our Modal Resources that we'll need 30 | volume = Volume.from_name(VOLUME, create_if_missing=True) 31 | image = Image.debian_slim(python_version="3.9").pip_install( 32 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm" 33 | ) 34 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 35 | 36 | 37 | 38 | @app.function(volumes={DATASET_DIR: volume}, timeout=3000) 39 | def process_dataset(file): 40 | import time 41 | from concurrent.futures import ThreadPoolExecutor, as_completed 42 | from tqdm import tqdm 43 | import pandas as pd 44 | 45 | # Load the dataset as a Hugging Face dataset 46 | # print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}/train/{file}") 47 | df = pd.read_parquet(f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}") 48 | print("dataset", len(df)) 49 | 50 | return { 51 | "file": file, 52 | "num_rows": len(df), 53 | "tokens": df["chunk_token_count"].sum(), 54 | "less2": df[df["chunk_token_count"] < 2].shape[0], 55 | "less10": df[df["chunk_token_count"] < 10].shape[0], 56 | "less50": df[df["chunk_token_count"] < 50].shape[0], 57 | } 58 | 59 | @app.local_entrypoint() 60 | def main(): 61 | from tqdm import tqdm 62 | responses = [] 63 | for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True): 64 | if isinstance(resp, Exception): 65 | print(f"Exception: {resp}") 66 | continue 67 | print(resp) 68 | responses.append(resp) 69 | 70 | total_rows = 0 71 | total_tokens = 0 72 | total_less2 = 0 73 | total_less10 = 0 74 | total_less50 = 0 75 | for resp in tqdm(responses): 76 | total_rows += resp['num_rows'] 77 | total_tokens += resp['tokens'] 78 | total_less2 += resp['less2'] 79 | total_less10 += resp['less10'] 80 | total_less50 += resp['less50'] 81 | print(f"Total rows processed: {total_rows}") 82 | print(f"Total tokens processed: {total_tokens}") 83 | print(f"Total less2: {total_less2}") 84 | print(f"Total less10: {total_less10}") 85 | print(f"Total less50: {total_less50}") 86 | 87 | 88 | -------------------------------------------------------------------------------- /todataset.py: -------------------------------------------------------------------------------- 1 | """ 2 | Turn a directory of parquet files into a HuggingFace dataset in the modal volume 3 | """ 4 | # TODO: look into keeping the parquet files as is to make the dataset 5 | 6 | 7 | from modal import App, Image, Volume, Secret 8 | 9 | DATASET_DIR="/embeddings" 10 | VOLUME = "embeddings" 11 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500" 12 | SAVE_DIRECTORY = f"{DIRECTORY}-HF2" 13 | 14 | # We define our Modal Resources that we'll need 15 | volume = Volume.from_name(VOLUME, create_if_missing=True) 16 | image = Image.debian_slim(python_version="3.9").pip_install( 17 | "datasets==2.16.1", "apache_beam==2.53.0" 18 | ) 19 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 20 | 21 | 22 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts 23 | # but we override this to 24 | # 6000s to avoid any potential timeout issues 25 | @app.function( 26 | volumes={DATASET_DIR: volume}, 27 | timeout=6000, 28 | # ephemeral_disk=2145728, # in MiB 29 | secrets=[Secret.from_name("huggingface-secret")], 30 | ) 31 | def convert_dataset(): 32 | # Redownload the dataset 33 | import time 34 | from datasets import load_dataset 35 | print("loading") 36 | dataset = load_dataset("parquet", data_files=f"{DIRECTORY}/train/*.parquet") 37 | print("saving") 38 | dataset.save_to_disk(SAVE_DIRECTORY, num_shards={"train":99}) 39 | print("done!") 40 | volume.commit() 41 | 42 | 43 | @app.local_entrypoint() 44 | def main(): 45 | convert_dataset.remote() 46 | 47 | -------------------------------------------------------------------------------- /top10map.py: -------------------------------------------------------------------------------- 1 | """ 2 | For each of the parquet files with activations, find the top 10 and write to an intermediate file 3 | modal run top10map.py 4 | """ 5 | from modal import App, Image, Volume 6 | import os 7 | import time 8 | import numpy as np 9 | import pandas as pd 10 | from tqdm import tqdm 11 | import concurrent.futures 12 | from functools import partial 13 | 14 | NUM_CPU=4 15 | 16 | N=5 # the number of samples to keep per feature 17 | 18 | DATASET_DIR="/embeddings" 19 | VOLUME = "embeddings" 20 | 21 | D_IN = 768 # the dimensions from the embedding models 22 | K=64 23 | # EXPANSION = 128 24 | EXPANSION = 32 25 | SAE = f"{K}_{EXPANSION}" 26 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3" 27 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10" 28 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}" 29 | SAVE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top{N}" 30 | 31 | 32 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)] 33 | 34 | # We define our Modal Resources that we'll need 35 | volume = Volume.from_name(VOLUME, create_if_missing=True) 36 | image = Image.debian_slim(python_version="3.9").pip_install( 37 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm" 38 | ) 39 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 40 | 41 | # def get_top_n_rows_by_top_act(file, top_indices, top_acts, feature): 42 | # # feature_positions = np.where(np.any(top_indices == feature, axis=1), 43 | # # np.argmax(top_indices == feature, axis=1), 44 | # # -1) 45 | # # act_values = np.where(feature_positions != -1, 46 | # # top_acts[np.arange(len(top_acts)), feature_positions], 47 | # # 0) 48 | # # top_n_indices = np.argsort(act_values)[-N:][::-1] 49 | 50 | # # Find positions where feature appears (returns a boolean mask) 51 | # feature_mask = top_indices == feature 52 | 53 | # # Get the activation values where the feature appears (all others will be 0) 54 | # act_values = np.where(feature_mask.any(axis=1), 55 | # top_acts[feature_mask].reshape(-1), 56 | # 0) 57 | 58 | # # Use partition to get top N indices efficiently 59 | # top_n_indices = np.argpartition(act_values, -N)[-N:] 60 | # # Sort just the top N indices 61 | # top_n_indices = top_n_indices[np.argsort(act_values[top_n_indices])[::-1]] 62 | 63 | # filtered_df = pd.DataFrame({ 64 | # "shard": file, 65 | # "index": top_n_indices, 66 | # "feature": feature, 67 | # "activation": act_values[top_n_indices] 68 | # }) 69 | # return filtered_df 70 | 71 | def get_top_n_rows_by_top_act(file, top_indices, top_acts, feature): 72 | # Use memory-efficient approach to find rows with this feature 73 | rows_with_feature = np.any(top_indices == feature, axis=1) 74 | 75 | # Only process rows that have this feature 76 | filtered_indices = top_indices[rows_with_feature] 77 | filtered_acts = top_acts[rows_with_feature] 78 | 79 | # Get positions of the feature in each row 80 | positions = np.argwhere(filtered_indices == feature) 81 | 82 | # Create array of activation values (sparse approach) 83 | row_indices = positions[:, 0] 84 | col_indices = positions[:, 1] 85 | act_values = filtered_acts[row_indices, col_indices] 86 | 87 | # Map back to original indices 88 | original_indices = np.where(rows_with_feature)[0][row_indices] 89 | 90 | # Get top N 91 | if len(act_values) > N: 92 | top_n_pos = np.argpartition(act_values, -N)[-N:] 93 | top_n_pos = top_n_pos[np.argsort(act_values[top_n_pos])[::-1]] 94 | else: 95 | # If we have fewer than N matches, take all of them 96 | top_n_pos = np.argsort(act_values)[::-1] 97 | 98 | filtered_df = pd.DataFrame({ 99 | "shard": file, 100 | "index": original_indices[top_n_pos], 101 | "feature": feature, 102 | "activation": act_values[top_n_pos] 103 | }) 104 | return filtered_df 105 | 106 | 107 | def process_feature_chunk(file, feature_ids, chunk_index): 108 | start = time.perf_counter() 109 | print(f"Loading dataset from {DIRECTORY}/train/{file}", chunk_index) 110 | 111 | # Only read the columns we need 112 | df = pd.read_parquet(f"{DIRECTORY}/train/{file}", columns=['top_indices', 'top_acts']) 113 | print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds for {file}", chunk_index) 114 | 115 | top_indices = np.array(df['top_indices'].tolist()) 116 | top_acts = np.array(df['top_acts'].tolist()) 117 | 118 | # Free up memory by deleting the DataFrame after conversion to numpy 119 | del df 120 | 121 | print(f"top_indices shape: {top_indices.shape}") 122 | print(f"top_acts shape: {top_acts.shape}") 123 | print("got numpy arrays", chunk_index) 124 | 125 | results = [] 126 | 127 | # Process each feature in this worker's batch 128 | for feature in tqdm(feature_ids, desc=f"Processing features (worker {chunk_index})", position=chunk_index): 129 | # Get the true top N rows for this feature across the entire chunk 130 | top = get_top_n_rows_by_top_act(file, top_indices, top_acts, feature) 131 | results.append(top) 132 | 133 | # Combine results for all features in this worker 134 | combined_df = pd.concat(results, ignore_index=True) 135 | 136 | # Write to a temporary file to save memory 137 | temp_file = f"{SAVE_DIRECTORY}/temp_{file}_{chunk_index}.parquet" 138 | combined_df.to_parquet(temp_file) 139 | 140 | # Free memory 141 | del top_indices, top_acts, results, combined_df 142 | 143 | return temp_file 144 | 145 | @app.function(cpu=NUM_CPU, volumes={DATASET_DIR: volume}, timeout=6000) 146 | def process_dataset(file): 147 | from concurrent.futures import ProcessPoolExecutor, as_completed 148 | 149 | # Ensure directory exists 150 | if not os.path.exists(f"{SAVE_DIRECTORY}"): 151 | os.makedirs(f"{SAVE_DIRECTORY}") 152 | 153 | num_features = D_IN * EXPANSION 154 | 155 | # Split the features among workers - each worker handles a subset of features 156 | # but processes the ENTIRE dataset for those features 157 | features_per_worker = num_features // NUM_CPU 158 | feature_batches = [list(range(i, min(i + features_per_worker, num_features))) 159 | for i in range(0, num_features, features_per_worker)] 160 | 161 | with ProcessPoolExecutor(max_workers=NUM_CPU) as executor: 162 | futures = [executor.submit(process_feature_chunk, file, feature_batch, i) 163 | for i, feature_batch in enumerate(feature_batches)] 164 | 165 | temp_files = [] 166 | for future in as_completed(futures): 167 | temp_file = future.result() 168 | temp_files.append(temp_file) 169 | 170 | # Combine temporary files 171 | print("Combining temporary files") 172 | dfs = [] 173 | for temp_file in temp_files: 174 | dfs.append(pd.read_parquet(temp_file)) 175 | # Remove temp file after reading 176 | os.remove(temp_file) 177 | 178 | combined_df = pd.concat(dfs, ignore_index=True) 179 | combined_df.to_parquet(f"{SAVE_DIRECTORY}/{file}") 180 | volume.commit() 181 | 182 | return f"All done with {file}", len(combined_df) 183 | 184 | 185 | @app.local_entrypoint() 186 | def main(): 187 | for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True): 188 | if isinstance(resp, Exception): 189 | print(f"Exception: {resp}") 190 | continue 191 | print(resp) 192 | 193 | 194 | -------------------------------------------------------------------------------- /top10reduce.py: -------------------------------------------------------------------------------- 1 | from modal import App, Image, Volume, Secret 2 | 3 | EMBEDDINGS_DIR="/embeddings" 4 | EMBEDDINGS_VOLUME = "embeddings" 5 | DATASETS_DIR="/datasets" 6 | DATASETS_VOLUME = "datasets" 7 | 8 | SAE = "64_32" 9 | 10 | # SAMPLE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3/train" 11 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10" 12 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10/combined" 13 | 14 | SAMPLE_DIRECTORY = f"{DATASETS_DIR}/wikipedia-en-chunked-500/train" 15 | SAE_DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}/train" 16 | DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5" 17 | SAVE_DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5/combined" 18 | 19 | 20 | 21 | 22 | 23 | # We define our Modal Resources that we'll need 24 | embeddings_volume = Volume.from_name(EMBEDDINGS_VOLUME, create_if_missing=True) 25 | datasets_volume = Volume.from_name(DATASETS_VOLUME, create_if_missing=True) 26 | image = Image.debian_slim(python_version="3.9").pip_install( 27 | "pandas", "datasets==2.16.1", "apache_beam==2.53.0" 28 | ) 29 | app = App(image=image) 30 | 31 | @app.function( 32 | volumes={DATASETS_DIR: datasets_volume, EMBEDDINGS_DIR: embeddings_volume}, 33 | timeout=60000, 34 | # ephemeral_disk=2145728, # in MiB 35 | ) 36 | def populate_indices(samples): 37 | import pandas as pd 38 | 39 | shard = samples.iloc[0]['shard'] 40 | indices = samples['index'].tolist() 41 | 42 | print("reading shard", shard, len(indices)) 43 | sample_df = pd.read_parquet(f"{SAMPLE_DIRECTORY}/{shard}") 44 | sample_df = sample_df.iloc[indices].copy() 45 | sample_df['feature'] = samples['feature'].tolist() 46 | sample_df['activation'] = samples['activation'].tolist() 47 | sample_df['top_indices'] = samples['top_indices'].tolist() 48 | sample_df['top_acts'] = samples['top_acts'].tolist() 49 | print("returning samples for", shard) 50 | 51 | return sample_df 52 | 53 | @app.function( 54 | volumes={ 55 | DATASETS_DIR: datasets_volume, 56 | EMBEDDINGS_DIR: embeddings_volume 57 | }, 58 | timeout=60000, 59 | # ephemeral_disk=2145728, # in MiB 60 | ) 61 | def reduce_top10_indices(directory, save_directory, sae_directory, N): 62 | import os 63 | if not os.path.exists(save_directory): 64 | os.makedirs(save_directory) 65 | 66 | files = [f for f in os.listdir(directory) if f.endswith('.parquet')] 67 | print("len files", len(files)) 68 | 69 | import pandas as pd 70 | 71 | combined_indices_path = f"{save_directory}/combined_indices.parquet" 72 | if not os.path.exists(combined_indices_path): 73 | print("creating combined_indices") 74 | all_dataframes = [] 75 | for file in files: 76 | print(f"Reading {file}") 77 | # Read from top directory 78 | df = pd.read_parquet(f"{directory}/{file}") 79 | 80 | # Read corresponding file from SAE directory to get top_indices and top_acts 81 | if os.path.exists(f"{sae_directory}/{file}"): 82 | sae_df = pd.read_parquet(f"{sae_directory}/{file}") 83 | # Ensure we have the right columns 84 | if 'top_indices' in sae_df.columns and 'top_acts' in sae_df.columns: 85 | # Match records based on feature (assuming they're in the same order) 86 | df['top_indices'] = sae_df['top_indices'] 87 | df['top_acts'] = sae_df['top_acts'] 88 | print(f"Added top_indices and top_acts columns from {file}") 89 | else: 90 | print(f"Warning: top_indices or top_acts not found in {file} from SAE directory") 91 | else: 92 | print(f"Warning: file {file} not found in SAE directory") 93 | 94 | all_dataframes.append(df) 95 | 96 | # Concatenate all DataFrames into a single DataFrame 97 | combined_df = pd.concat(all_dataframes, ignore_index=True) 98 | print("combined") 99 | combined_df.to_parquet(combined_indices_path) 100 | else: 101 | print(f"{combined_indices_path} already exists. Loading it.") 102 | combined_df = pd.read_parquet(combined_indices_path) 103 | 104 | combined_df = combined_df.sort_values(by=['feature', 'activation'], ascending=[True, False]) 105 | combined_df = combined_df.groupby('feature').head(N).reset_index(drop=True) 106 | print(f"writing top{N}") 107 | combined_df.to_parquet(f"{save_directory}/combined_indices_top{N}.parquet") 108 | embeddings_volume.commit() 109 | 110 | shard_counts = combined_df.groupby('shard').size().reset_index(name='count') 111 | print("shard_counts", shard_counts.head()) 112 | 113 | print("Number of shards:", len(shard_counts)) 114 | rows_by_shard = [combined_df[combined_df['shard'] == shard] for shard in combined_df['shard'].unique()] 115 | 116 | results = [] 117 | for resp in populate_indices.map(rows_by_shard, order_outputs=False, return_exceptions=True): 118 | if isinstance(resp, Exception): 119 | print(f"Exception: {resp}") 120 | continue 121 | results.append(resp) 122 | 123 | print("concatenating final results") 124 | final_df = pd.concat(results, ignore_index=True) 125 | final_df = final_df.drop(columns=['index', '__index_level_0__'], errors='ignore') 126 | print("sorting final results") 127 | final_df = final_df.sort_values(by=['feature', 'activation'], ascending=[True, False]) 128 | print("writing final results") 129 | final_df.to_parquet(f"{save_directory}/samples_top{N}.parquet") 130 | embeddings_volume.commit() 131 | return "done" 132 | 133 | 134 | # for resp in reduce_top10.map(pairs, order_outputs=False, return_exceptions=True): 135 | # if isinstance(resp, Exception): 136 | # print(f"Exception: {resp}") 137 | # continue 138 | # print(resp) 139 | 140 | 141 | 142 | @app.local_entrypoint() 143 | def main(): 144 | reduce_top10_indices.remote(DIRECTORY, SAVE_DIRECTORY, SAE_DIRECTORY, 10) 145 | 146 | 147 | -------------------------------------------------------------------------------- /torched.py: -------------------------------------------------------------------------------- 1 | """ 2 | Write the embeddings from the dataset to torch files that can be loaded quicker 3 | 4 | modal run torched.py 5 | """ 6 | 7 | from modal import App, Image, Volume, Secret 8 | 9 | DATASET_DIR="/embeddings" 10 | VOLUME = "embeddings" 11 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4" 12 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500" 13 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-torched" 14 | SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500-torched" 15 | 16 | # We define our Modal Resources that we'll need 17 | volume = Volume.from_name(VOLUME, create_if_missing=True) 18 | image = Image.debian_slim(python_version="3.9").pip_install( 19 | "datasets==2.16.1", "apache_beam==2.53.0", "tqdm", "torch", "numpy" 20 | ) 21 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 22 | 23 | # NUM_EMBEDDINGS = 25504378 24 | # SHARD_SIZE = 262144 # 2048*128 25 | 26 | @app.function( 27 | volumes={DATASET_DIR: volume}, 28 | timeout=60000, 29 | # ephemeral_disk=2145728, # in MiB 30 | ) 31 | def torch_dataset_shard(file): 32 | # Redownload the dataset 33 | import time 34 | # from datasets import load_from_disk 35 | import pandas as pd 36 | from tqdm import tqdm 37 | import torch 38 | import numpy as np 39 | import os 40 | 41 | print("loading", file) 42 | # dataset = load_from_disk(DIRECTORY) 43 | df = pd.read_parquet(f"{DIRECTORY}/train/{file}") 44 | print("loaded", file) 45 | # train_dataset = dataset["train"] 46 | 47 | # start_idx = shard * SHARD_SIZE 48 | # end_idx = min(start_idx + SHARD_SIZE, NUM_EMBEDDINGS) 49 | # print("reading", shard) 50 | embeddings = df["embedding"].to_numpy() 51 | embeddings = np.array([np.array(e).astype(np.float32) for e in embeddings]) 52 | # shard_embeddings = np.array(train_dataset.select(range(start_idx, end_idx))["embedding"]) 53 | # print("permuting", shard) 54 | # shard_embeddings = np.random.permutation(shard_embeddings) # {{ edit_1 }} 55 | shard = file.split(".")[0] 56 | print("saving", shard) 57 | shard_tensor = torch.tensor(embeddings, dtype=torch.float32) 58 | if not os.path.exists(f"{SAVE_DIRECTORY}"): 59 | os.makedirs(f"{SAVE_DIRECTORY}") 60 | torch.save(shard_tensor, f"{SAVE_DIRECTORY}/{shard}.pt") 61 | print("done!", shard) 62 | volume.commit() 63 | return shard 64 | 65 | @app.local_entrypoint() 66 | def main(): 67 | # num_shards = NUM_EMBEDDINGS // SHARD_SIZE + (1 if NUM_EMBEDDINGS % SHARD_SIZE != 0 else 0) 68 | # shards = list(range(num_shards)) 69 | # # torch_dataset.remote() 70 | # for resp in torch_dataset_shard.map(shards, order_outputs=False, return_exceptions=True): 71 | # if isinstance(resp, Exception): 72 | # print(f"Exception: {resp}") 73 | # continue 74 | # print(resp) 75 | 76 | files = [f"data-{i:05d}-of-00989.parquet" for i in range(989)] 77 | files = files[2:] 78 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)] 79 | 80 | # process_dataset.remote(file, max_tokens=MAX_TOKENS, num_cpu=NUM_CPU) 81 | for resp in torch_dataset_shard.map(files, order_outputs=False, return_exceptions=True): 82 | if isinstance(resp, Exception): 83 | print(f"Exception: {resp}") 84 | continue 85 | print(resp) 86 | 87 | 88 | -------------------------------------------------------------------------------- /upload.py: -------------------------------------------------------------------------------- 1 | """ 2 | Upload a dataset from a modal volume to HuggingFace 3 | """ 4 | from modal import App, Image, Volume, Secret 5 | 6 | # We first set out configuration variables for our script. 7 | DATASET_DIR = "/embeddings" 8 | VOLUME="embeddings" 9 | HF_REPO="enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5-2" 10 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4" 11 | 12 | # We define our Modal Resources that we'll need 13 | volume = Volume.from_name(VOLUME, create_if_missing=True) 14 | image = Image.debian_slim(python_version="3.9").pip_install( 15 | "datasets==2.20.0", "huggingface_hub" 16 | ) 17 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub" 18 | 19 | 20 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts 21 | # but we override this to 22 | # 6000s to avoid any potential timeout issues 23 | @app.function( 24 | volumes={DATASET_DIR: volume}, 25 | timeout=60000, 26 | secrets=[Secret.from_name("huggingface-secret")], 27 | ) 28 | def upload_dataset(directory, repo): 29 | import os 30 | import time 31 | 32 | from huggingface_hub import HfApi 33 | from datasets import load_from_disk 34 | 35 | 36 | api = HfApi(token=os.environ["HF_TOKEN"]) 37 | api.create_repo( 38 | repo_id=repo, 39 | private=False, 40 | repo_type="dataset", 41 | exist_ok=True, 42 | ) 43 | 44 | print("loading from disk") 45 | dataset=load_from_disk(directory) 46 | 47 | print(f"Pushing to hub {HF_REPO}") 48 | start = time.perf_counter() 49 | max_retries = 20 50 | for attempt in range(max_retries): 51 | try: 52 | # api.upload_folder( 53 | # folder_path=directory, 54 | # repo_id=repo, 55 | # repo_type="dataset", 56 | # multi_commits=True, 57 | # multi_commits_verbose=True, 58 | # ) 59 | dataset.push_to_hub(repo, num_shards={"train": 99}) 60 | break # Exit loop if upload is successful 61 | except Exception as e: 62 | if attempt < max_retries - 1: 63 | print(f"Attempt {attempt + 1} failed, retrying...") 64 | time.sleep(5) # Wait for 5 seconds before retrying 65 | else: 66 | print("Failed to upload after several attempts.") 67 | raise # Re-raise the last exception if all retries fail 68 | end = time.perf_counter() 69 | print(f"Uploaded in {end-start}s") 70 | 71 | 72 | @app.local_entrypoint() 73 | def main(): 74 | upload_dataset.remote(DIRECTORY, HF_REPO) 75 | 76 | -------------------------------------------------------------------------------- /volume.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copy a directory from one volume to another. 3 | I don't think you can use * in modal volume commands, so need to copy each file individually. 4 | Probably a better way to do this though. 5 | 6 | python volume.py cp 7 | python volume.py rm 8 | """ 9 | import os 10 | from tqdm import tqdm 11 | 12 | def automate_volume_copy(): 13 | source_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched" 14 | destination_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched-shuffled" 15 | 16 | # Use tqdm to create a progress bar for the file copying process 17 | for i in tqdm(range(100), desc="Copying files"): 18 | file_index = f"{i:05d}" 19 | source_file = os.path.join(source_dir, f"shard_{file_index}.pt") 20 | destination_file = os.path.join(destination_dir, f"shard_{file_index}.pt") 21 | 22 | command = f"modal volume cp embeddings {source_file} {destination_file}" 23 | os.system(command) # Execute the command 24 | 25 | def automate_volume_rm(): 26 | source_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched" 27 | 28 | # Use tqdm to create a progress bar for the file copying process 29 | for i in tqdm(range(100), desc="Deleting files"): 30 | file_index = f"{i:05d}" 31 | source_file = os.path.join(source_dir, f"shard_{file_index}.pt") 32 | 33 | command = f"modal volume rm embeddings {source_file}" 34 | os.system(command) # Execute the command 35 | 36 | 37 | import sys 38 | import argparse 39 | 40 | def parse_arguments(): 41 | parser = argparse.ArgumentParser(description="Copy or remove files in a volume.") 42 | parser.add_argument("command", choices=["cp", "rm"], help="Specify 'cp' to copy files or 'rm' to remove files.") 43 | return parser.parse_args() 44 | 45 | def main(): 46 | args = parse_arguments() 47 | command = args.command 48 | if command == "cp": 49 | automate_volume_copy() 50 | elif command == "rm": 51 | automate_volume_rm() 52 | else: 53 | print("Invalid command. Use 'cp' to copy or 'rm' to remove.") 54 | sys.exit(1) 55 | 56 | if __name__ == "__main__": 57 | main() 58 | --------------------------------------------------------------------------------