├── .gitignore
├── README.md
├── chunker.py
├── download.py
├── embed-tei.py
├── experimental
├── batchsize.py
└── embed.py
├── features.py
├── fetch.py
├── filter.py
├── lancer.py
├── notebooks
├── features.ipynb
├── perfile.ipynb
├── small_sample.ipynb
├── tokenizers.ipynb
└── validate.ipynb
├── remove.py
├── summary.py
├── todataset.py
├── top10map.py
├── top10reduce.py
├── torched.py
├── upload.py
└── volume.py
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | *.parquet
3 | venv
4 | .DS_Store
5 | *.arrow
6 | data
7 | *.parquet
8 | *.npy
9 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # latent-data-modal
2 |
3 | This repository is a set of scripts used to process and embed a large datasets using on-demand infrastructure via [Modal](https://modal.com).
4 |
5 | The first resulting dataset published is [FineWeb-edu 10BT Sample embedded with nomic-text-v1.5](https://huggingface.co/datasets/enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5).
6 |
7 | All of these scripts have been developed as part of my learning process to scale up my capacity for embedding large datasets.
8 | As such they aren't immediately generalizable but can be treated as a reference implementation. A lot of it is adapted from the [Embedding Wikipedia](https://modal.com/blog/embedding-wikipedia) tutorial.
9 |
10 | I am hoping to improve this process and use it to scale up to the 100BT next. If I can get a compute sponsor I'll then take it to the entire 1.4 trillion token dataset.
11 |
12 |
13 | ## Process
14 |
15 | ### [download.py](download.py)
16 | To start with, we need to download the HF dataset to a volume in Modal. This is relatively straight forward and easy to change to a different dataset.
17 |
18 | ### [chunker.py](chunker.py)
19 | I wanted to pre-chunk my dataset since tokenizing is relatively CPU intensive and my initial experiments with the tutorial code we bottlenecked by the chunking process. I also wanted to use actual token counts and analyze the impact of chunking on the dataset.
20 |
21 | I found that the 9.6 million documents in the 10BT sample turned into ~25 million chunks with 10.5 billion tokens due to the 10% overlap I chose. There is an issue in the chunking code right now that I will fix soon where chunks <= 50 tokens are created even though they represent pure overlap and aren't needed.
22 |
23 | I based everything on files in the dataset, so the 10BT sample was 99 arrow files, which allowed me to take advantage of Modal's automatic container scaling. Each file is processed by its own container which dramatically sped up the process.
24 |
25 | The chunking process took ~40 minutes using 100 containers and cost $5.
26 |
27 | ### [embed-tei.py](embed-tei.py)
28 | This script uses the [Text Embeddings Interface](https://huggingface.co/docs/text-embeddings-inference/en/index) like the wikipedia tutorial, but loading the pre-chunked dataset and creating batches that attempt to fit the batch token limit. So we can pack many more small chunks into a single batch to speed things up.
29 |
30 | I believe I'm not quite properly utilizing TEI because I only got ~60% GPU utilization and was only using 10GB memory in the A10G GPUs that have 24GB available. So there is probably a way to speed this up even more. That said it only cost ~$50 to embed the entire dataset. It did take ~12 hours because I didn't always have my full allocation of 10 GPUs available.
31 |
32 | ### [summary.py](summary.py)
33 | I found it useful to quickly calculate summary statistics using the same parallel process of loading each file in its own container and performaing some basic pandas calculations.
34 |
35 | ### [fetch.py](fetch.py)
36 | I made a quick utility to download a single file to inspect locally, which was used in the [notebooks/validate.ipynb](notebooks/validate.ipynb) notebook to confirm that the embedding process was working as expected.
37 |
38 |
39 | ## Notebooks
40 | I'm including several notebooks that I developed in the process of learning this in case they are helpful to others.
41 |
42 | ### [small_sample.ipynb](notebooks/small_sample.ipynb)
43 | The first thing I did was download some very small samples of the dataset and explore them with [Latent Scope](https://github.com/enjalot/latent-scope) to familiarize myself with the data and validate the idea of embedding the dataset.
44 |
45 | ### [perfile.ipynb](notebooks/perfile.ipynb)
46 | After I struggled with the structure of the wikipedia tutorial I realized I could leverage the CPU parallelism of Modal to process each file in its own container. This notebook was me working out the chunking logic on a single file that I could then parallelize in the `chunker.py` script.
47 |
48 | ### [validate.ipynb](notebooks/validate.ipynb)
49 | This notebook is me taking a look at a single file that was processed and then trying to understand why I was seeing such small chunks. It led me to realize the mistake I made of keeping around <50 token chunks (which I still need to fix in the chunker.py script...)
50 |
51 | ## Experimental
52 | On the way to developing this I was trying to understand how to choose batch sizes and token limits. There are two scripts here:
53 |
54 | ### [batchsize.py](experimental/batchsize.py)
55 | This script uses crude measurement techniques to see how much memory gets filled by a batch of tokens. I'm not confident in it anymore because I was able to fit a lot more tokens into the batches I submitted to `embed-tei.py` than I predicted using a A10G instead of an H100.
56 |
57 | ### [embed.py](experimental/embed.py)
58 | This script uses the HuggingFace transformers directly (instead of TEI) so I could have a little more control over how I was embedding. It's the same kind of code I use in Latent Scope for locally embedding smaller datasets so it allowed me to better understand the scaling process.
59 | The problem is that it's just much slower than TEI.
60 |
--------------------------------------------------------------------------------
/chunker.py:
--------------------------------------------------------------------------------
1 | from modal import App, Image, Volume
2 |
3 | NUM_CPU=4
4 | MAX_TOKENS = 500
5 | # MAX_TOKENS = 120
6 | OVERLAP = 0.1 # 10% overlap when chunking
7 | BATCH_SIZE = 200 # number of rows to process per thread at once
8 |
9 | # We first set out configuration variables for our script.
10 | DATASET_DIR = "/data"
11 |
12 | # https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
13 | # VOLUME = "embedding-fineweb-edu"
14 | # DATASET_SAVE ="fineweb-edu-sample-10BT"
15 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}"
16 | # TEXT_KEY = "text"
17 | # KEEP_KEYS = ["id", "url", "score", "dump"]
18 | # files = [f"data-{i:05d}-of-00099.arrow" for i in range(99)]
19 |
20 | # VOLUME = "embedding-fineweb-edu"
21 | # DATASET_SAVE ="fineweb-edu-sample-100BT"
22 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-100BT-chunked-{MAX_TOKENS}"
23 | # KEEP_KEYS = ["id", "url", "score", "dump"]
24 |
25 |
26 | # https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
27 | # VOLUME = "datasets"
28 | # DATASET_SAVE ="RedPajama-Data-V2-sample-10B"
29 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-{MAX_TOKENS}"
30 | # TEXT_KEY = "raw_content"
31 | # KEEP_KEYS = ["doc_id", "meta"]
32 | # files = [f"data-{i:05d}-of-00150.arrow" for i in range(150)]
33 |
34 | # https://huggingface.co/datasets/monology/pile-uncopyrighted
35 | # VOLUME = "datasets"
36 | # DATASET_SAVE ="pile-uncopyrighted"
37 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-{MAX_TOKENS}"
38 | # TEXT_KEY = "text"
39 | # KEEP_KEYS = ["meta"]
40 | # files = [f"data-{i:05d}-of-01987.arrow" for i in range(200)]
41 |
42 | #https://huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.en
43 | # VOLUME = "datasets"
44 | # DATASET_SAVE ="wikipedia-en"
45 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-{MAX_TOKENS}"
46 | # TEXT_KEY = "text"
47 | # KEEP_KEYS = ["id", "url", "title"]
48 | # files = [f"data-{i:05d}-of-00041.arrow" for i in range(41)]
49 |
50 | VOLUME = "datasets"
51 | DATASET_SAVE ="medrag-pubmed"
52 | DATASET_SAVE_CHUNKED = f"medrag-pubmed-{MAX_TOKENS}"
53 | TEXT_KEY = "content"
54 | KEEP_KEYS = ["id", "title", "PMID"]
55 | files = [f"data-{i:05d}-of-00138.arrow" for i in range(138)]
56 |
57 |
58 |
59 |
60 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
61 |
62 | # We define our Modal Resources that we'll need
63 | volume = Volume.from_name(VOLUME, create_if_missing=True)
64 | image = Image.debian_slim(python_version="3.9").pip_install(
65 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
66 | )
67 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
68 |
69 | def chunk_row(row, tokenizer):
70 | # print("ROW", row)
71 | text = row[TEXT_KEY]
72 | chunks = []
73 |
74 | # TODO: don't save an empty chunk
75 |
76 | tokens = tokenizer.encode(text)
77 | token_count = len(tokens)
78 | if token_count > MAX_TOKENS:
79 | overlap = int(MAX_TOKENS * OVERLAP)
80 | start_index = 0
81 | ci = 0
82 | while start_index < len(tokens):
83 | end_index = min(start_index + MAX_TOKENS, len(tokens))
84 | chunk = tokens[start_index:end_index]
85 | if len(chunk) < overlap:
86 | break
87 | chunks.append({
88 | "chunk_index": ci,
89 | "chunk_text": tokenizer.decode(chunk),
90 | "chunk_tokens": chunk,
91 | "chunk_token_count": len(chunk),
92 | **{key: row[key] for key in KEEP_KEYS}
93 | })
94 | start_index += MAX_TOKENS - overlap
95 | ci += 1
96 | else:
97 | chunks.append({
98 | "chunk_index": 0,
99 | "chunk_text": text,
100 | "chunk_tokens": tokens,
101 | "chunk_token_count": token_count,
102 | **{key: row[key] for key in KEEP_KEYS}
103 | })
104 |
105 | return chunks
106 |
107 |
108 | @app.function(cpu=NUM_CPU, volumes={DATASET_DIR: volume}, timeout=3000)
109 | def process_dataset(file):
110 | import time
111 | from concurrent.futures import ThreadPoolExecutor, as_completed
112 | from tqdm import tqdm
113 | import pandas as pd
114 | import transformers
115 | transformers.logging.set_verbosity_error()
116 | from transformers import AutoTokenizer
117 | from datasets import load_from_disk, load_dataset
118 |
119 | tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS)
120 |
121 | start = time.perf_counter()
122 | # Load the dataset as a Hugging Face dataset
123 | print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}/train/{file}")
124 | dataset = load_dataset("arrow", data_files=f"{DATASET_DIR}/{DATASET_SAVE}/train/{file}")
125 | df = pd.DataFrame(dataset['train'])
126 | print("dataset", len(df))
127 | print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds for {file}")
128 |
129 | chunks_list = []
130 | with ThreadPoolExecutor(max_workers=NUM_CPU) as executor:
131 | pbar = tqdm(total=len(df), desc=f"Processing Rows for {file}")
132 |
133 | # this gets called inside each thread
134 | def process_batch(batch):
135 | batch_chunks = []
136 | for row in batch:
137 | row_chunks = chunk_row(row, tokenizer)
138 | pbar.update(1)
139 | batch_chunks.extend(row_chunks)
140 | return batch_chunks
141 |
142 | print(f"making batches for {file}")
143 | batches = [df.iloc[i:i + BATCH_SIZE].to_dict(orient="records") for i in range(0, len(df), BATCH_SIZE)]
144 | print(f"made batches for {file}")
145 | print(f"setting up futures for {file}")
146 | futures = [executor.submit(process_batch, batch) for batch in batches]
147 | print(f"in the future for {file}")
148 | for future in as_completed(futures):
149 | chunks_list.extend(future.result())
150 | pbar.close()
151 |
152 | chunked_df = pd.DataFrame(chunks_list)
153 | file_name = file.split(".")[0]
154 | import os
155 | output_dir = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train"
156 | if not os.path.exists(output_dir):
157 | os.makedirs(output_dir)
158 | print(f"saving to {output_dir}/{file_name}.parquet")
159 | chunked_df.to_parquet(f"{output_dir}/{file_name}.parquet")
160 | print(f"done with {file}, {len(chunks_list)} chunks")
161 | volume.commit()
162 | return f"All done with {file}", len(chunks_list)
163 |
164 |
165 | @app.local_entrypoint()
166 | def main():
167 | # download_dataset.remote()
168 | # from huggingface_hub import HfFileSystem
169 | # hffs = HfFileSystem()
170 | # files = hffs.ls("datasets/HuggingFaceFW/fineweb-edu/sample/10BT", detail=False)
171 |
172 | # files = [f"data-{i:05d}-of-00989.arrow" for i in range(989)]
173 | # files = [f"data-{i:05d}-of-00011.arrow" for i in range(11)]
174 |
175 | # process_dataset.remote(file, max_tokens=MAX_TOKENS, num_cpu=NUM_CPU)
176 | for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True):
177 | if isinstance(resp, Exception):
178 | print(f"Exception: {resp}")
179 | continue
180 | print(resp)
181 |
182 |
183 |
--------------------------------------------------------------------------------
/download.py:
--------------------------------------------------------------------------------
1 | """
2 | Download a dataset from HuggingFace to a modal volume
3 | s"""
4 | from modal import App, Image, Volume, Secret
5 |
6 | # We first set out configuration variables for our script.
7 | VOLUME = "datasets"
8 | DATASET_DIR = "/data"
9 |
10 | HF_CACHE_DIR = f"{DATASET_DIR}/cache"
11 |
12 | # DATASET_NAME = "HuggingFaceFW/fineweb-edu"
13 | # SAMPLE = "100BT"
14 | # DATASET_FILES = f"sample/{SAMPLE}/*.parquet"
15 | # DATASET_SAVE =f"fineweb-edu-sample-{SAMPLE}"
16 | # VOLUME = "embedding-fineweb-edu"
17 |
18 |
19 | # DATASET_NAME = "togethercomputer/RedPajama-Data-V2"
20 | # DATASET_SAVE = "RedPajama-Data-V2-sample-10B"
21 | # DATASET_SAMPLE = "sample-10B"
22 | # DATASET_FILES = None
23 |
24 | # DATASET_NAME = "monology/pile-uncopyrighted"
25 | # DATASET_SAVE = "pile-uncopyrighted"
26 | # DATASET_SAMPLE = None
27 | # DATASET_FILES = None
28 |
29 | # DATASET_NAME = "PleIAs/common_corpus"
30 | # DATASET_SAVE = "common_corpus"
31 | # DATASET_SAMPLE = None
32 | # DATASET_FILES = None
33 |
34 | # DATASET_NAME = "bigcode/the-stack-dedup"
35 | # DATASET_SAVE = "the-stack-dedup"
36 | # DATASET_FILES = None
37 |
38 | # DATASET_NAME = "wikimedia/wikipedia"
39 | # DATASET_SAVE = "wikipedia-en"
40 | # DATASET_SAMPLE = "20231101.en"
41 | # DATASET_FILES = None
42 |
43 | DATASET_NAME = "MedRAG/pubmed"
44 | DATASET_SAVE = "medrag-pubmed"
45 | DATASET_SAMPLE = None
46 | DATASET_FILES = None
47 |
48 |
49 |
50 |
51 | # We define our Modal Resources that we'll need
52 | volume = Volume.from_name(VOLUME, create_if_missing=True)
53 | image = Image.debian_slim(python_version="3.9").pip_install(
54 | "datasets==3.2.0"
55 | )
56 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
57 |
58 |
59 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
60 | # but we override this to
61 | # 6000s to avoid any potential timeout issues
62 | @app.function(
63 | volumes={DATASET_DIR: volume},
64 | timeout=60000,
65 | ephemeral_disk=int(3145728), # in MiB
66 | secrets=[Secret.from_name("huggingface-secret")],
67 | )
68 | def download_dataset():
69 | # Redownload the dataset
70 | import time
71 | import os
72 |
73 | # Set HF cache environment variable
74 | os.environ['HF_HOME'] = HF_CACHE_DIR
75 |
76 |
77 | from datasets import load_dataset, DownloadConfig, logging
78 | logging.set_verbosity_debug()
79 |
80 | start = time.time()
81 | if DATASET_FILES:
82 | dataset = load_dataset(DATASET_NAME, data_files=DATASET_FILES, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR))
83 | elif DATASET_SAMPLE:
84 | dataset = load_dataset(DATASET_NAME, DATASET_SAMPLE, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR))
85 | else:
86 | dataset = load_dataset(DATASET_NAME, num_proc=6, trust_remote_code=True, download_config=DownloadConfig(resume_download=True, cache_dir=HF_CACHE_DIR))
87 | end = time.time()
88 | print(f"Download complete - downloaded files in {end-start}s")
89 |
90 | dataset.save_to_disk(f"{DATASET_DIR}/{DATASET_SAVE}")
91 | volume.commit()
92 |
93 | @app.function(volumes={DATASET_DIR: volume})
94 | def load_dataset():
95 | import time
96 | import os
97 |
98 | # Set HF cache environment variable
99 | os.environ['HF_HOME'] = HF_CACHE_DIR
100 |
101 |
102 | from datasets import load_from_disk
103 |
104 | start = time.perf_counter()
105 | # Load the dataset as a Hugging Face dataset
106 | print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}")
107 | dataset = load_from_disk(f"{DATASET_DIR}/{DATASET_SAVE}")
108 | print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds")
109 |
110 |
111 | # # Sample the dataset to 100,000 rows
112 | # print("Sampling dataset to 100,000 rows")
113 | # sampled_datasets = dataset["train"].select(range(100000))
114 | # sampled_datasets.save_to_disk(f"{DATASET_DIR}/{DATASET_SAVE}-100k")
115 |
116 |
117 | # TODO: make a function to delete files
118 | # the 00099 files are old/wrong
119 |
120 | # TODO: make a function to load a single file from dataset
121 |
122 | @app.local_entrypoint()
123 | def main():
124 | download_dataset.remote()
125 | # load_dataset.remote()
126 |
127 |
--------------------------------------------------------------------------------
/embed-tei.py:
--------------------------------------------------------------------------------
1 | """
2 | Embed a dataset using the HuggingFace TEI
3 | """
4 | import os
5 | import json
6 | import time
7 | import asyncio
8 | import subprocess
9 |
10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method
11 |
12 | DATASET_DIR = "/data"
13 |
14 | ### CHUNKED DATASET
15 | # VOLUME = "embedding-fineweb-edu"
16 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500"
17 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
18 |
19 | # VOLUME = "embedding-fineweb-edu"
20 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-120"
21 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
22 |
23 | # VOLUME = "datasets"
24 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-120"
25 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)]
26 |
27 | # VOLUME = "datasets"
28 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-500"
29 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)]
30 |
31 | # VOLUME = "datasets"
32 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-120"
33 | # # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-500"
34 | # files = [f"data-{i:05d}-of-01987.parquet" for i in range(200)]
35 |
36 | VOLUME = "datasets"
37 | DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-120"
38 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-500"
39 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)]
40 |
41 | # VOLUME = "datasets"
42 | # DATASET_SAVE_CHUNKED = f"medrag-pubmed-500"
43 | # files = [f"data-{i:05d}-of-00138.parquet" for i in range(138)]
44 |
45 |
46 |
47 | EMBEDDING_DIR = "/embeddings"
48 |
49 | #### MODEL
50 | # Tokenized version of "clustering: " prefix = [101, 9324, 2075, 1024]
51 | PREFIX = "clustering: "
52 | PREFIX_TOKEN_COUNT = 4
53 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
54 |
55 | # PREFIX = """
56 | # PREFIX_TOKEN_COUNT = 0
57 | # MODEL_ID = "BAAI/bge-base-en-v1.5"
58 |
59 | # PREFIX = ""
60 | # PREFIX_TOKEN_COUNT = 0
61 | # MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
62 |
63 | MODEL_SLUG = MODEL_ID.split("/")[-1]
64 |
65 | MODEL_DIR = "/model"
66 | MODEL_REVISION="main"
67 |
68 | GPU_CONCURRENCY = 10
69 | BATCHER_CONCURRENCY = GPU_CONCURRENCY
70 | GPU_CONFIG = "A10G"
71 | GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:86-1.2"
72 | # GPU_CONFIG = gpu.A100(size="40GB")
73 | # GPU_CONFIG = gpu.A100(size="80GB")
74 | # GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:1.2"
75 | # GPU_CONFIG = gpu.H100()
76 | # GPU_IMAGE = "ghcr.io/huggingface/text-embeddings-inference:hopper-1.2"
77 |
78 |
79 | SENTENCE_TOKEN_LIMIT = 512
80 | CLIENT_BATCH_TOKEN_LIMIT = 768 * SENTENCE_TOKEN_LIMIT # how many tokens we put in a batch. limiting factor
81 | # i put the server higher but if we make the client batch too big it errors out without helpful message
82 | SERVER_BATCH_TOKEN_LIMIT = 2 * CLIENT_BATCH_TOKEN_LIMIT # how many tokens the server can handle in a batch
83 | MAX_CLIENT_BATCH_SIZE = 2 * 4096 # how many rows can be in a batch
84 | # CLIENT_BATCH_TOKEN_LIMIT = 1536 * SENTENCE_TOKEN_LIMIT # Double from 768
85 | # SERVER_BATCH_TOKEN_LIMIT = 4 * 1536 * SENTENCE_TOKEN_LIMIT # Increased server capacity
86 |
87 | # CLIENT_BATCH_TOKEN_LIMIT = 512 * SENTENCE_TOKEN_LIMIT #A100 40GB
88 | # SERVER_BATCH_TOKEN_LIMIT = 4*2048 * SENTENCE_TOKEN_LIMIT #A100 40GB
89 |
90 | LAUNCH_FLAGS = [
91 | "--model-id",
92 | MODEL_ID,
93 | "--port",
94 | "8000",
95 | "--max-client-batch-size",
96 | str(MAX_CLIENT_BATCH_SIZE), # Increased from 20000
97 | "--max-batch-tokens",
98 | str(SERVER_BATCH_TOKEN_LIMIT),
99 | "--auto-truncate",
100 | "--dtype",
101 | "float16",
102 | "--json-output" # Add for more detailed perf metrics
103 | ]
104 |
105 | ## Dataset-Specific Configuration
106 | DATASET_READ_VOLUME = Volume.from_name(
107 | VOLUME, create_if_missing=True
108 | )
109 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name(
110 | "embeddings", create_if_missing=True
111 | )
112 |
113 | def spawn_server() -> subprocess.Popen:
114 | import socket
115 |
116 | process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS)
117 | # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
118 | while True:
119 | try:
120 | socket.create_connection(("127.0.0.1", 8000), timeout=1).close()
121 | print("Webserver ready!")
122 | return process
123 | except (socket.timeout, ConnectionRefusedError):
124 | # Check if launcher webserving process has exited.
125 | # If so, a connection can never be made.
126 | retcode = process.poll()
127 | if retcode is not None:
128 | raise RuntimeError(
129 | f"launcher exited unexpectedly with code {retcode}"
130 | )
131 |
132 |
133 | tei_image = (
134 | Image.from_registry(
135 | GPU_IMAGE,
136 | add_python="3.10",
137 | )
138 | .dockerfile_commands("ENTRYPOINT []")
139 | .pip_install("httpx", "numpy")
140 | )
141 |
142 | with tei_image.imports():
143 | import numpy as np
144 |
145 | app = App(
146 | "fineweb-embeddings-tei"
147 | )
148 |
149 | @app.cls(
150 | gpu=GPU_CONFIG,
151 | image=tei_image,
152 | max_containers=GPU_CONCURRENCY,
153 | allow_concurrent_inputs=4, # allows the batchers to queue up several requests
154 | # but if we allow too many and they get backed up it spams timeout errors
155 | retries=3,
156 | )
157 | class TextEmbeddingsInference:
158 | # @build()
159 | # def download_model(self):
160 | # spawn_server()
161 |
162 | @enter()
163 | def open_connection(self):
164 | # If the process is running for a long time, the client does not seem to close the connections, results in a pool timeout
165 | from httpx import AsyncClient
166 |
167 | self.process = spawn_server()
168 | self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)
169 |
170 | @exit()
171 | def terminate_connection(self):
172 | self.process.terminate()
173 |
174 | @method()
175 | async def embed(self, chunk_batch):
176 | texts = chunk_batch[0]
177 | res = await self.client.post("/embed", json={"inputs": texts})
178 | try:
179 | emb = res.json()
180 | return chunk_batch, np.array(emb)
181 | except Exception as e:
182 | print(f"Error embedding", e)
183 | print("res", res)
184 | raise e
185 |
186 | @app.function(
187 | max_containers=BATCHER_CONCURRENCY,
188 | image=Image.debian_slim().pip_install(
189 | "pandas", "pyarrow", "tqdm"
190 | ),
191 | volumes={
192 | DATASET_DIR: DATASET_READ_VOLUME,
193 | EMBEDDING_DIR: EMBEDDING_CHECKPOINT_VOLUME,
194 | },
195 | timeout=86400,
196 | secrets=[Secret.from_name("huggingface-secret")],
197 | )
198 | def batch_loader(file):
199 | import pandas as pd
200 | from tqdm import tqdm
201 | import time
202 |
203 | print(f"reading in {file}")
204 | file_path = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}"
205 | df = pd.read_parquet(file_path)
206 | df['original_position'] = np.arange(len(df))
207 | print(f"sorting {file}", len(df))
208 | df = df.sort_values(by='chunk_token_count', ascending=True)
209 | # df = df[0: 80000]
210 | # df = df.reset_index(drop=True)
211 |
212 | batches_text = []
213 | current_batch_counts = []
214 | current_batch_text = []
215 | batch_indices = []
216 | current_batch_indices = []
217 | packed = []
218 |
219 | print("building batches for ", file, "with client batch token limit", CLIENT_BATCH_TOKEN_LIMIT)
220 | start = time.monotonic_ns()
221 |
222 | pbar = tqdm(total=len(df), desc=f"building batches for {file}")
223 | # idx is actually the original index since i didn't reset the index during sort
224 | # i just hate that its implied and had a bug when i didn't realize it
225 | for idx, row in df.iterrows():
226 | pbar.update(1)
227 | original_idx = row['original_position']
228 | chunk_token_count = row['chunk_token_count'] + PREFIX_TOKEN_COUNT # 4 for the prefix
229 | chunkt = PREFIX + row['chunk_text']
230 | if not chunkt or not chunkt.strip():
231 | print(f"WARNING: Empty chunk detected at index {original_idx}")
232 | chunkt = " "
233 | chunk_token_count = 1
234 | proposed_batch_count = current_batch_counts + [chunk_token_count]
235 | proposed_length = max(count for count in proposed_batch_count) * len(proposed_batch_count)
236 |
237 | if proposed_length <= CLIENT_BATCH_TOKEN_LIMIT and len(current_batch_indices) < MAX_CLIENT_BATCH_SIZE:
238 | current_batch_text.append(chunkt)
239 | current_batch_indices.append(original_idx)
240 | current_batch_counts.append(chunk_token_count)
241 | else:
242 | batches_text.append(current_batch_text)
243 | batch_indices.append(current_batch_indices)
244 | current_batch_counts = [chunk_token_count]
245 | current_batch_text = [chunkt]
246 | current_batch_indices = [original_idx]
247 |
248 | if current_batch_counts:
249 | batch_indices.append(current_batch_indices)
250 | batches_text.append(current_batch_text)
251 |
252 |
253 | duration_s = (time.monotonic_ns() - start) / 1e9
254 | print(f"batched {file} in {duration_s:.0f}s")
255 |
256 | responses = []
257 | for batch_text, batch_indices in zip(batches_text, batch_indices):
258 | packed.append((batch_text, batch_indices))
259 |
260 | print(f"{len(packed)} batches")
261 |
262 | pbar = tqdm(total=len(packed), desc=f"embedding {file}")
263 | model = TextEmbeddingsInference()
264 |
265 | for resp in model.embed.map(
266 | packed,
267 | order_outputs=False,
268 | return_exceptions=False
269 | ):
270 | responses.append(resp)
271 | pbar.update(1)
272 |
273 | if not os.path.exists(f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train"):
274 | os.makedirs(f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train", exist_ok=True)
275 |
276 | embedding_dim = responses[0][1].shape[1]
277 | embedding_path = f"{EMBEDDING_DIR}/{DATASET_SAVE_CHUNKED}-{MODEL_SLUG}/train/{file.replace('.parquet', '.npy')}"
278 | mmap_embeddings = np.memmap(embedding_path, dtype='float32', mode='w+', shape=(len(df), embedding_dim))
279 |
280 | print("writing embeddings to disk")
281 | for batch, response in responses:
282 | for idx, embedding in zip(batch[1], response):
283 | mmap_embeddings[idx] = embedding
284 | mmap_embeddings.flush()
285 |
286 | del mmap_embeddings
287 |
288 | EMBEDDING_CHECKPOINT_VOLUME.commit()
289 | return f"done with {file}"
290 |
291 | @app.local_entrypoint()
292 | def full_job():
293 | for resp in batch_loader.map(
294 | files,
295 | order_outputs=False,
296 | return_exceptions=True
297 | ):
298 | print(resp)
299 |
300 | print("done")
301 |
302 |
--------------------------------------------------------------------------------
/experimental/batchsize.py:
--------------------------------------------------------------------------------
1 | """
2 | Try to figure out the optimal batch size for embedding on a given GPU
3 | """
4 | import os
5 | import json
6 | import time
7 | import asyncio
8 | import subprocess
9 |
10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method
11 |
12 | # We first set out configuration variables for our script.
13 | ## Embedding Containers Configuration
14 | # GPU_CONCURRENCY = 100
15 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
16 | MODEL_SLUG = MODEL_ID.split("/")[-1]
17 |
18 | MODEL_DIR = "/model"
19 | MODEL_REVISION="main"
20 |
21 | GPU_CONCURRENCY = 1
22 | # GPU_CONFIG = gpu.A100(size="80GB")
23 | # GPU_CONFIG = gpu.A100(size="40GB")
24 | # GPU_CONFIG = gpu.A10G()
25 | GPU_CONFIG = gpu.H100()
26 | # BATCH_SIZE = 512
27 | BATCH_SIZE = 64
28 | # BATCH_SIZE = 128
29 | MAX_TOKENS = 8192
30 | # MAX_TOKENS = 2048
31 |
32 |
33 | ## Dataset-Specific Configuration
34 | DATASET_READ_VOLUME = Volume.from_name(
35 | "embedding-fineweb-edu", create_if_missing=True
36 | )
37 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name(
38 | "checkpoint", create_if_missing=True
39 | )
40 | DATASET_DIR = "/data"
41 | # DATASET_SAVE ="fineweb-edu-sample-10BT"
42 | DATASET_SAVE ="fineweb-edu-sample-10BT-100k"
43 | CHECKPOINT_DIR = "/checkpoint"
44 | SAVE_TO_DISK = True
45 |
46 | ## Upload-Specific Configuration
47 | # DATASET_HF_UPLOAD_REPO_NAME = "enjalot/fineweb-edu-sample-10BT"
48 | DATASET_HF_UPLOAD_REPO_NAME = f"enjalot/{DATASET_SAVE}"
49 | UPLOAD_TO_HF = False
50 |
51 |
52 | def download_model_to_image(model_dir, model_name, model_revision):
53 | from huggingface_hub import snapshot_download
54 | from transformers.utils import move_cache
55 |
56 | os.makedirs(model_dir, exist_ok=True)
57 |
58 | snapshot_download(
59 | repo_id=model_name,
60 | revision=model_revision,
61 | local_dir=model_dir,
62 | ignore_patterns=["*.pt", "*.bin"], # Using safetensors
63 | )
64 | move_cache()
65 |
66 | st_image = (
67 | Image.debian_slim(python_version="3.10")
68 | .pip_install(
69 | "torch==2.1.2",
70 | "numpy==1.26.3",
71 | "transformers==4.39.3",
72 | "hf-transfer==0.1.6",
73 | "huggingface_hub==0.22.2",
74 | "einops==0.7.0"
75 | )
76 | .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
77 | .run_function(
78 | download_model_to_image,
79 | timeout=60 * 20,
80 | kwargs={
81 | "model_dir": MODEL_DIR,
82 | "model_name": MODEL_ID,
83 | "model_revision": MODEL_REVISION,
84 | },
85 | secrets=[Secret.from_name("huggingface-secret")],
86 | )
87 | )
88 | with st_image.imports():
89 | import numpy as np
90 | import torch
91 | from torch.cuda.amp import autocast
92 | from transformers import AutoTokenizer, AutoModel
93 |
94 | app = App(
95 | "fineweb-embeddings-st"
96 | )
97 |
98 | @app.cls(
99 | gpu=GPU_CONFIG,
100 | # cpu=16,
101 | concurrency_limit=GPU_CONCURRENCY,
102 | timeout=60 * 10,
103 | container_idle_timeout=60 * 10,
104 | allow_concurrent_inputs=1,
105 | image=st_image,
106 | )
107 | class TransformerModel:
108 | @enter()
109 | def start_engine(self):
110 | # import torch
111 | # from transformers import AutoTokenizer, AutoModel
112 |
113 | self.device = torch.device("cuda")
114 |
115 | print("🥶 cold starting inference")
116 | start = time.monotonic_ns()
117 |
118 | self.model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, safe_serialization=True)#, rotary_scaling_factor=2 )
119 | self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS)
120 | self.model.to(self.device)
121 | self.model.eval()
122 |
123 | print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e6} MB")
124 | duration_s = (time.monotonic_ns() - start) / 1e9
125 | print(f"🏎️ engine started in {duration_s:.0f}s")
126 |
127 | @method()
128 | def embed(self, inputs):
129 | tok = self.tokenizer
130 |
131 | # print(torch.cuda.memory_summary(device=None, abbreviated=False))
132 | print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
133 |
134 | # print(f"CUDA memory allocated before encoding: {torch.cuda.memory_allocated() / 1e6} MB")
135 |
136 | start = time.monotonic_ns()
137 | encoded_input = tok(inputs, padding=True, truncation=True, return_tensors='pt')
138 | print("encoded in", (time.monotonic_ns() - start) / 1e9)
139 |
140 | encoded_input = {key: value.to(self.device) for key, value in encoded_input.items()}
141 | # print("moved to device", (time.monotonic_ns() - start) / 1e9)
142 | # print("encoded input size", encoded_input['input_ids'].nelement() * encoded_input['input_ids'].element_size() / 1e6, "MB")
143 |
144 | # print(f"CUDA memory allocated after encoding: {torch.cuda.memory_allocated() / 1e6} MB")
145 | start = time.monotonic_ns()
146 | # print(torch.cuda.memory_summary(device=None, abbreviated=False))
147 | with torch.no_grad():#, autocast():
148 | # print(f"CUDA memory allocated before embedding: {torch.cuda.memory_allocated() / 1e6} MB")
149 | model_output = self.model(**encoded_input)
150 | # print(f"CUDA memory allocated after model output: {torch.cuda.memory_allocated() / 1e6} MB")
151 | # print(f"model output size: {model_output.nelement() * model_output.element_size() / 1e6} MB")
152 | embeddings = model_output[0][:, 0]
153 | # print(f"Embedding size: {embeddings.nelement() * embeddings.element_size() / 1e6} MB")
154 | # print(f"CUDA memory allocated after embedding: {torch.cuda.memory_allocated() / 1e6} MB")
155 | normalized_embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
156 | normalized_embeddings_cpu = normalized_embeddings.cpu().numpy()
157 | # print(f"CUDA memory allocated after got embeddings: {torch.cuda.memory_allocated() / 1e6} MB")
158 | # # Clean up torch memory
159 | # del encoded_input
160 | # del model_output
161 | # del embeddings
162 | # del normalized_embeddings
163 | # torch.cuda.empty_cache()
164 | duration_ms = (time.monotonic_ns() - start) / 1e6
165 | print(f"embedding took {duration_ms:.0f}ms")
166 | print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
167 |
168 | return inputs, normalized_embeddings_cpu
169 |
170 |
171 |
172 | @app.local_entrypoint()
173 | def full_job():
174 | tok = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=MAX_TOKENS)
175 | batch_size = BATCH_SIZE
176 |
177 | test = "I "
178 | test = test * 1022
179 | tokens = tok.encode(test)
180 | print("tokens", len(tokens))
181 |
182 | inputs = [test] * (384)
183 |
184 | model = TransformerModel()
185 | [inputs, embeddings] = model.embed.remote(inputs=inputs)
186 | print("done")
187 |
188 |
--------------------------------------------------------------------------------
/experimental/embed.py:
--------------------------------------------------------------------------------
1 | """
2 | Embed a dataset using a HuggingFace model, a good deal slower than TEI
3 | """
4 | import os
5 | import json
6 | import time
7 | import asyncio
8 | import subprocess
9 |
10 | from modal import App, Image, Secret, Volume, build, enter, exit, gpu, method
11 |
12 | DATASET_DIR = "/data"
13 | DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500"
14 | CHECKPOINT_DIR = "/checkpoint"
15 |
16 | # We first set out configuration variables for our script.
17 | ## Embedding Containers Configuration
18 | # GPU_CONCURRENCY = 100
19 | MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
20 | MODEL_SLUG = MODEL_ID.split("/")[-1]
21 |
22 | MODEL_DIR = "/model"
23 | MODEL_REVISION="main"
24 |
25 | GPU_CONCURRENCY = 10
26 | # GPU_CONFIG = gpu.A100(size="80GB")
27 | # GPU_CONFIG = gpu.A100(size="40GB")
28 | # GPU_CONFIG = gpu.A10G()
29 | GPU_CONFIG = gpu.H100()
30 |
31 |
32 | ## Dataset-Specific Configuration
33 | DATASET_READ_VOLUME = Volume.from_name(
34 | "embedding-fineweb-edu", create_if_missing=True
35 | )
36 | EMBEDDING_CHECKPOINT_VOLUME = Volume.from_name(
37 | "embeddings", create_if_missing=True
38 | )
39 | def download_model_to_image(model_dir, model_name, model_revision):
40 | from huggingface_hub import snapshot_download
41 | from transformers.utils import move_cache
42 |
43 | os.makedirs(model_dir, exist_ok=True)
44 |
45 | snapshot_download(
46 | repo_id=model_name,
47 | revision=model_revision,
48 | local_dir=model_dir,
49 | ignore_patterns=["*.pt", "*.bin"], # Using safetensors
50 | )
51 | move_cache()
52 |
53 | st_image = (
54 | Image.debian_slim(python_version="3.10")
55 | .pip_install(
56 | "torch==2.1.2",
57 | "numpy==1.26.3",
58 | "transformers==4.39.3",
59 | "hf-transfer==0.1.6",
60 | "huggingface_hub==0.22.2",
61 | "einops==0.7.0"
62 | )
63 | .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
64 | .run_function(
65 | download_model_to_image,
66 | timeout=60 * 20,
67 | kwargs={
68 | "model_dir": MODEL_DIR,
69 | "model_name": MODEL_ID,
70 | "model_revision": MODEL_REVISION,
71 | },
72 | secrets=[Secret.from_name("huggingface-secret")],
73 | )
74 | )
75 | with st_image.imports():
76 | import numpy as np
77 | import torch
78 | from torch.cuda.amp import autocast
79 | from transformers import AutoTokenizer, AutoModel
80 |
81 | app = App(
82 | "fineweb-embeddings-st"
83 | )
84 |
85 | @app.cls(
86 | gpu=GPU_CONFIG,
87 | # cpu=16,
88 | concurrency_limit=GPU_CONCURRENCY,
89 | timeout=60 * 10,
90 | container_idle_timeout=60 * 10,
91 | allow_concurrent_inputs=1,
92 | image=st_image,
93 | )
94 | class TransformerModel:
95 | @enter()
96 | def start_engine(self):
97 | # import torch
98 | # from transformers import AutoTokenizer, AutoModel
99 |
100 | self.device = torch.device("cuda")
101 |
102 | print("🥶 cold starting inference")
103 | start = time.monotonic_ns()
104 |
105 | self.model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, safe_serialization=True)#, rotary_scaling_factor=2 )
106 | # self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=512) # MAX_TOKENS
107 | self.model.to(self.device)
108 | self.model.eval()
109 |
110 | # print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e6} MB")
111 | duration_s = (time.monotonic_ns() - start) / 1e9
112 | print(f"🏎️ engine started in {duration_s:.0f}s")
113 |
114 | @method()
115 | def embed(self, batch_mask_index):
116 | batch, mask, index = batch_mask_index
117 | # print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
118 |
119 | tokens_tensor = torch.tensor(batch)
120 | attention_mask = torch.tensor(mask)
121 |
122 | encoded_input = {
123 | 'input_ids': tokens_tensor.to(self.device),
124 | 'attention_mask': attention_mask.to(self.device)
125 | }
126 | # encoded_input = {key: value.to(self.device) for key, value in inputs}
127 | start = time.monotonic_ns()
128 | with torch.no_grad():#, autocast():
129 | model_output = self.model(**encoded_input)
130 | embeddings = model_output[0][:, 0]
131 | normalized_embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
132 | normalized_embeddings_cpu = normalized_embeddings.cpu().numpy()
133 |
134 | duration_ms = (time.monotonic_ns() - start) / 1e6
135 | print(f"embedding took {duration_ms:.0f}ms")
136 |
137 | del encoded_input
138 | del model_output
139 | del embeddings
140 | del normalized_embeddings
141 | torch.cuda.empty_cache()
142 |
143 | # print(torch.cuda.memory_summary(device=self.device, abbreviated=True))
144 | return index, normalized_embeddings_cpu
145 |
146 |
147 |
148 | @app.function(
149 | image=Image.debian_slim().pip_install(
150 | "pandas", "pyarrow", "tqdm"
151 | ),
152 | volumes={
153 | DATASET_DIR: DATASET_READ_VOLUME,
154 | CHECKPOINT_DIR: EMBEDDING_CHECKPOINT_VOLUME,
155 | },
156 | timeout=86400,
157 | secrets=[Secret.from_name("huggingface-secret")],
158 | )
159 | def batch_loader(file, batch_size: int = 512 * 1024):
160 | import pandas as pd
161 | from tqdm import tqdm
162 | import time
163 |
164 |
165 | print(f"reading in {file}")
166 | file_path = f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}"
167 | df = pd.read_parquet(file_path)
168 | print(f"sorting {file}")
169 | df = df.sort_values(by='chunk_token_count', ascending=True)
170 | batches = []
171 | current_batch = []
172 | current_token_count = 0
173 | batch_indices = []
174 | current_batch_indices = []
175 | attention_masks = [] # List to store attention masks for each batch
176 |
177 |
178 | # Tokenized version of "clustering: "
179 | prefix = [101, 9324, 2075, 1024]
180 |
181 | print("building batches for ", file)
182 | start = time.monotonic_ns()
183 |
184 | for index, row in df.iterrows():
185 | # chunk_token_count = row['chunk_token_count']
186 | chunk = prefix + list(row['chunk_tokens'])
187 | proposed_batch = current_batch + [chunk]
188 | proposed_length = max(len(tokens) for tokens in proposed_batch) * len(proposed_batch)
189 |
190 | if proposed_length <= batch_size:
191 | current_batch.append(chunk)
192 | current_batch_indices.append(index)
193 | # current_token_count = proposed_length
194 | else:
195 | # Pad the current batch
196 | max_length = max(len(tokens) for tokens in current_batch)
197 | padded_batch = [tokens + [0] * (max_length - len(tokens)) for tokens in current_batch]
198 | attention_mask = [[1] * len(tokens) + [0] * (max_length - len(tokens)) for tokens in current_batch]
199 | batches.append(padded_batch)
200 | attention_masks.append(attention_mask)
201 | batch_indices.append(current_batch_indices)
202 | # Start new batch
203 | current_batch = [chunk]
204 | current_batch_indices = [index]
205 | # current_token_count = len(chunk)
206 |
207 | if current_batch:
208 | # Pad the final batch
209 | max_length = max(len(tokens) for tokens in current_batch)
210 | padded_batch = [tokens + [0] * (max_length - len(tokens)) for tokens in current_batch]
211 | attention_mask = [[1] * len(tokens) + [0] * (max_length - len(tokens)) for tokens in current_batch]
212 |
213 | batches.append(padded_batch)
214 | batch_indices.append(current_batch_indices)
215 |
216 |
217 | print("length of first batch", len(batches[0]))
218 | first_batch_length = sum(len(chunk) for chunk in batches[0])
219 | print("Total length of all elements in the first batch:", first_batch_length)
220 | print(f"number of batches {len(batches)}")
221 |
222 | duration_s = (time.monotonic_ns() - start) / 1e9
223 | print(f"batched {file} in {duration_s:.0f}s")
224 |
225 | pbar = tqdm(total=len(batches), desc=f"embedding {file}")
226 | model = TransformerModel()
227 |
228 | responses = []
229 | for resp in model.embed.map(
230 | zip(batches, attention_masks, batch_indices),
231 | order_outputs=False,
232 | return_exceptions=False
233 | ):
234 | responses.append(resp)
235 | pbar.update(1)
236 |
237 | print("zipping batches with responses")
238 | for batch_idx, response in responses:
239 | for idx, embedding in zip(batch_idx, response):
240 | df.at[idx, 'embedding'] = embedding
241 |
242 | if not os.path.exists(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train"):
243 | os.makedirs(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train", exist_ok=True)
244 | df.to_parquet(f"{CHECKPOINT_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}")
245 | return f"done with {file}"
246 |
247 | @app.local_entrypoint()
248 | def full_job():
249 |
250 | file = "data-00000-of-00099.parquet"
251 |
252 | batch_loader.remote(file=file, batch_size = (1024) * 512)
253 | print("done")
254 |
255 |
--------------------------------------------------------------------------------
/features.py:
--------------------------------------------------------------------------------
1 | """
2 | Extract the features for the embeddings of a dataset using a pre-trained SAE model
3 |
4 | modal run features.py
5 | """
6 |
7 | import os
8 | import time
9 | from tqdm import tqdm
10 | from latentsae.sae import Sae
11 | from modal import App, Image, Volume, Secret, gpu, enter, method
12 |
13 | DATASET_DIR="/embeddings"
14 | VOLUME = "embeddings"
15 |
16 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4"
17 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-120-all-MiniLM-L6-v2"
18 | # DIRECTORY = f"{DATASET_DIR}/RedPajama-Data-V2-sample-10B-chunked-120-all-MiniLM-L6-v2"
19 | # DIRECTORY = f"{DATASET_DIR}/pile-uncopyrighted-chunked-120-all-MiniLM-L6-v2"
20 | # DIRECTORY = f"{DATASET_DIR}/medrag-pubmed-500-nomic-embed-text-v1.5"
21 | # FILES = [f"{DIRECTORY}/train/data-{i:05d}-of-00138.npy" for i in range(138)]
22 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5"
23 | FILES = [f"{DIRECTORY}/train/data-{i:05d}-of-00041.npy" for i in range(41)]
24 | SAE = "64_32"
25 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-2"
26 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3"
27 | # SAE = "64_128"
28 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}"
29 | # SAE = "64_64"
30 |
31 | SAVE_DIRECTORY = f"{DIRECTORY}-{SAE}"
32 |
33 |
34 | # MODEL_ID = "enjalot/sae-all-MiniLM-L6-v2"
35 | # D_IN = 384
36 | MODEL_ID = "enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT"
37 | MODEL_DIR = "/model"
38 | D_IN = 768
39 | MODEL_REVISION="main"
40 |
41 | # We define our Modal Resources that we'll need
42 | volume = Volume.from_name(VOLUME, create_if_missing=True)
43 |
44 | def download_model_to_image(model_dir, model_name, model_revision):
45 | from huggingface_hub import snapshot_download
46 | from transformers.utils import move_cache
47 |
48 | os.makedirs(model_dir, exist_ok=True)
49 |
50 | snapshot_download(
51 | repo_id=model_name,
52 | revision=model_revision,
53 | local_dir=model_dir,
54 | ignore_patterns=["*.pt", "*.bin"], # Using safetensors
55 | )
56 | move_cache()
57 |
58 | st_image = (
59 | Image.debian_slim(python_version="3.10")
60 | .pip_install(
61 | "torch==2.1.2",
62 | "numpy==1.26.3",
63 | "transformers==4.39.3",
64 | "hf-transfer==0.1.6",
65 | "huggingface_hub==0.22.2",
66 | "einops==0.7.0",
67 | "latentsae==0.1.0"
68 | )
69 | .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
70 | .run_function(
71 | download_model_to_image,
72 | timeout=60 * 20,
73 | kwargs={
74 | "model_dir": MODEL_DIR,
75 | "model_name": MODEL_ID,
76 | "model_revision": MODEL_REVISION,
77 | },
78 | secrets=[Secret.from_name("huggingface-secret")],
79 | )
80 | )
81 | app = App(image=st_image) # Note: prior to April 2024, "app" was called "stub"
82 |
83 | with st_image.imports():
84 | import numpy as np
85 | import torch
86 |
87 | @app.cls(
88 | volumes={DATASET_DIR: volume},
89 | timeout=60 * 100,
90 | scaledown_window=60 * 10,
91 | allow_concurrent_inputs=1,
92 | image=st_image,
93 | )
94 | class SAEModel:
95 | @enter()
96 | def start_engine(self):
97 | # import torch
98 | self.device = torch.device("cpu")
99 | print("🥶 cold starting inference")
100 | start = time.monotonic_ns()
101 | self.model = Sae.load_from_hub(MODEL_ID, SAE, device=self.device)
102 | duration_s = (time.monotonic_ns() - start) / 1e9
103 | print(f"🏎️ engine started in {duration_s:.0f}s")
104 |
105 | @method()
106 | def make_features(self, file):
107 | # Redownload the dataset
108 | import time
109 | from datasets import load_dataset
110 | import torch
111 | import pandas as pd
112 | import numpy as np
113 | import time
114 |
115 | start = time.monotonic_ns()
116 | print("loading", file)
117 | # dataset = load_dataset("arrow", data_files=f"{DIRECTORY}/train/{file}")
118 | # # df = pd.read_parquet(f"{DIRECTORY}/train/{file}")
119 | # print("loaded")
120 | # df = pd.DataFrame(dataset['train'])
121 | # print("converted to dataframe")
122 | # embeddings = df['embedding'].to_numpy()
123 | # print("converted to numpy")
124 | # embeddings = np.array([np.array(e).astype(np.float32) for e in embeddings])
125 | duration_s = (time.monotonic_ns() - start) / 1e9
126 | # read the npy memmapped file
127 | size= os.path.getsize(file) // (D_IN * 4)
128 | embeddings = np.memmap(file,
129 | dtype='float32',
130 | mode='r',
131 | shape=(size, D_IN))
132 | print("loaded", file, "in", duration_s)
133 |
134 | start = time.monotonic_ns()
135 | print("Encoding embeddings with SAE")
136 |
137 | # batch_size = 4096
138 | batch_size = 128
139 | num_batches = (len(embeddings) + batch_size - 1) // batch_size
140 | all_acts = np.zeros((len(embeddings), 64))
141 | all_indices = np.zeros((len(embeddings), 64))
142 | for i in tqdm(range(num_batches), desc="Encoding batches"):
143 | batch_embeddings = embeddings[i * batch_size:(i + 1) * batch_size]
144 | batch_embeddings_tensor = torch.from_numpy(batch_embeddings).float().to(self.device)
145 | batch_features = self.model.encode(batch_embeddings_tensor)
146 | all_acts[i * batch_size:(i + 1) * batch_size] = batch_features.top_acts.detach().cpu().numpy()
147 | all_indices[i * batch_size:(i + 1) * batch_size] = batch_features.top_indices.detach().cpu().numpy()
148 |
149 | duration_s = (time.monotonic_ns() - start) / 1e9
150 | print("encoding completed", duration_s)
151 |
152 | df = pd.DataFrame()
153 | df['top_acts'] = list(all_acts)
154 | df['top_indices'] = list(all_indices)
155 | # # df.drop(columns=['embedding'], inplace=True)
156 | # if 'chunk_tokens' in df.columns:
157 | # df.drop(columns=['chunk_tokens'], inplace=True)
158 | print("features generated for", file)
159 |
160 | file_name = os.path.basename(file).split(".")[0]
161 | output_dir = f"{SAVE_DIRECTORY}/train"
162 | os.makedirs(output_dir, exist_ok=True)
163 | print(f"saving to {output_dir}/{file_name}.parquet")
164 | df.to_parquet(f"{output_dir}/{file_name}.parquet")
165 |
166 | volume.commit()
167 | return f"done with {file}"
168 |
169 | @app.local_entrypoint()
170 | def main():
171 |
172 | # files = files[0:10]
173 |
174 | model = SAEModel()
175 |
176 | for resp in model.make_features.map(FILES, order_outputs=False, return_exceptions=True):
177 | if isinstance(resp, Exception):
178 | print(f"Exception: {resp}")
179 | continue
180 | print(resp)
181 |
182 |
183 |
184 |
--------------------------------------------------------------------------------
/fetch.py:
--------------------------------------------------------------------------------
1 | """
2 | fetch a file from a modal volume and write it locally
3 | """
4 |
5 | from modal import App, Image, Volume
6 |
7 | # We first set out configuration variables for our script.
8 | # DATASET_DIR = "/data"
9 | VOLUME = "embeddings"
10 | DATASET_DIR = "/embeddings"
11 | # DATASET_NAME = "HuggingFaceFW/fineweb-edu"
12 | # DATASET_FILES = "sample/10BT/*.parquet"
13 | # DATASET_SAVE ="fineweb-edu-sample-10BT"
14 | MAX_TOKENS = 500
15 | # DATASET_SAVE = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}-HF4-64_32"
16 | DATASET_SAVE = f"fineweb-edu-sample-10BT-chunked-{MAX_TOKENS}-HF4-64_32-top10"
17 | # DATASET_SAVE = f"fineweb-edu-sample-10BT"
18 | # DIRECTORY = f"{DATASET_DIR}/{DATASET_SAVE}/train"
19 | DIRECTORY = f"{DATASET_DIR}/{DATASET_SAVE}"
20 |
21 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
22 |
23 | # We define our Modal Resources that we'll need
24 | volume = Volume.from_name(VOLUME, create_if_missing=True)
25 | # volume = Volume.from_name("embeddings", create_if_missing=True)
26 | image = Image.debian_slim(python_version="3.9").pip_install(
27 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
28 | )
29 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
30 |
31 |
32 | @app.function(volumes={DATASET_DIR: volume}, timeout=3000)
33 | def fetch_dataset(file):
34 | import pandas as pd
35 | from datasets import load_dataset
36 | print("loading", file)
37 | # Load the dataset as a Hugging Face dataset
38 | if file.endswith(".parquet"):
39 | df = pd.read_parquet(file)
40 | else:
41 | dataset = load_dataset("arrow", data_files=file)
42 | df = pd.DataFrame(dataset['train'])
43 | print("file loaded, returning", file)
44 | return df
45 |
46 | @app.local_entrypoint()
47 | def main():
48 | import pandas as pd
49 |
50 | # file = "data-00000-of-00099.arrow"
51 | file = "data-00000-of-00099.parquet"
52 | # file = "data-00001-of-00099.parquet"
53 | file_path = f"{DIRECTORY}/{file}"
54 | resp = fetch_dataset.remote(file_path)
55 | if isinstance(resp, Exception):
56 | print(f"Exception: {resp}")
57 | else:
58 | print(resp)
59 | # resp.to_parquet(f"./notebooks/{file}")
60 | resp.to_parquet(f"./notebooks/top10-{file}")
61 |
--------------------------------------------------------------------------------
/filter.py:
--------------------------------------------------------------------------------
1 | from modal import App, Image, Volume, Secret
2 |
3 | DATASET_DIR="/embeddings"
4 | VOLUME = "embeddings"
5 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF2" # converted the original to a dataset
6 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4"
7 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500/train"
8 |
9 | # We define our Modal Resources that we'll need
10 | volume = Volume.from_name(VOLUME, create_if_missing=True)
11 | image = Image.debian_slim(python_version="3.9").pip_install(
12 | "datasets==2.16.1", "apache_beam==2.53.0"
13 | )
14 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
15 |
16 |
17 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
18 | # but we override this to
19 | # 6000s to avoid any potential timeout issues
20 | @app.function(
21 | volumes={DATASET_DIR: volume},
22 | timeout=60000,
23 | # ephemeral_disk=2145728, # in MiB
24 | )
25 | def filter_dataset():
26 | # Redownload the dataset
27 | import time
28 | from datasets import load_from_disk
29 | print("loading")
30 | dataset = load_from_disk(DIRECTORY)
31 | print("filtering")
32 | filtered = dataset.filter(lambda x: x > 50, input_columns=["chunk_token_count"])
33 | # print("sorting")
34 | # dataset.sort(column_names=["id", "chunk_index"], keep_in_memory=True)
35 | print("saving")
36 | filtered.save_to_disk(SAVE_DIRECTORY, num_shards={"train":99})
37 | print("done!")
38 | volume.commit()
39 |
40 | @app.function(
41 | volumes={DATASET_DIR: volume},
42 | timeout=60000,
43 | # ephemeral_disk=2145728, # in MiB
44 | )
45 | def filter_dataset_file(file):
46 | import pandas as pd
47 | print("loading", file)
48 | df = pd.read_parquet(f"{DIRECTORY}/{file}")
49 | print("filtering", file)
50 | filtered = df[df["chunk_token_count"] > 50]
51 | print("saving", file)
52 | filtered.to_parquet(f"{DIRECTORY}/{file}")
53 | print("done!", file)
54 | volume.commit()
55 | return file
56 |
57 |
58 |
59 |
60 | @app.local_entrypoint()
61 | def main():
62 | # filter_dataset.remote()
63 |
64 | files = [f"data-{i:05d}-of-00989.parquet" for i in range(100)]
65 | files = files[2:]
66 | for resp in filter_dataset_file.map(files, order_outputs=False, return_exceptions=True):
67 | if isinstance(resp, Exception):
68 | print(f"Exception: {resp}")
69 | continue
70 | print(resp)
71 |
72 |
--------------------------------------------------------------------------------
/lancer.py:
--------------------------------------------------------------------------------
1 | """
2 | Combine chunks, embeddings and features into a single LanceDB table.
3 |
4 | This script loops over each corresponding file:
5 | - The chunk parquet produced by chunker.py (e.g. "/data/medrag-pubmed-500/train/data-00000-of-00138.parquet")
6 | - The embedding npy file produced by features.py (e.g. "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5/train/data-00000-of-00138.npy")
7 | - The features parquet file produced by features.py (e.g. "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5-64_32/train/data-00000-of-00138.parquet")
8 |
9 | They are then concatenated (column-wise) row‐by‐row in the natural order and written to a lancedb table.
10 |
11 | Usage (from Modal CLI):
12 | modal run combine.py
13 | """
14 |
15 | import os
16 | import time
17 | import numpy as np
18 | import pandas as pd
19 | import lancedb
20 | from modal import App, Image, Volume, enter, method, gpu
21 |
22 | # ============================================================================
23 | # Configuration variables – adjust these to your environment/path names!
24 | # ============================================================================
25 |
26 | # Directories for the input files:
27 | # CHUNK_PARQUET_DIR = "/datasets/medrag-pubmed-500/train"
28 | # EMBEDDING_NPY_DIR = "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5/train"
29 | # FEATURE_PARQUET_DIR = "/embeddings/medrag-pubmed-500-nomic-embed-text-v1.5-64_32/train"
30 | CHUNK_PARQUET_DIR = "/datasets/wikipedia-en-chunked-500/train"
31 | EMBEDDING_NPY_DIR = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5/train"
32 | FEATURE_PARQUET_DIR = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5-64_32/train"
33 |
34 |
35 | # Directory (volume) where the LanceDB table will be stored.
36 | # LANCE_DB_DIR = "/lancedb/enjalot/medrag-pubmed"
37 | # LANCE_DB_DIR_INDEXED = "/lancedb/enjalot/medrag-pubmed-indexed"
38 | # TMP_LANCE_DB_DIR = "/tmp/medrag-pubmed"
39 | LANCE_DB_DIR = "/lancedb/enjalot/wikipedia-en-500"
40 | LANCE_DB_DIR_INDEXED = "/lancedb/enjalot/wikipedia-en-500-indexed"
41 | TMP_LANCE_DB_DIR = "/tmp/wikipedia-en-500"
42 |
43 | TABLE_NAME = "500-64_32"
44 |
45 | # TOTAL_FILES = 138 # total number of shards (files)
46 | TOTAL_FILES = 41 # total number of shards (files)
47 | D_EMB = 768 # embedding dimension
48 |
49 | # Volume for the lancedb storage
50 | DATASETS_VOLUME = "datasets"
51 | EMBEDDING_VOLUME = "embeddings"
52 | DB_VOLUME = "lancedb"
53 |
54 | # ============================================================================
55 | # Modal Resources
56 | # ============================================================================
57 |
58 | volume_db = Volume.from_name(DB_VOLUME, create_if_missing=True)
59 | volume_datasets = Volume.from_name(DATASETS_VOLUME, create_if_missing=True)
60 | volume_embeddings = Volume.from_name(EMBEDDING_VOLUME, create_if_missing=True)
61 |
62 | st_image = (
63 | Image.debian_slim(python_version="3.10")
64 | .pip_install(
65 | "pandas", "numpy", "lancedb", "pyarrow", "torch", "tantivy"
66 | )
67 | .env({"RUST_BACKTRACE": "1"})
68 | )
69 |
70 |
71 | app = App(image=st_image)
72 |
73 | # ============================================================================
74 | # Class to combine and write data into a lancedb table
75 | # ============================================================================
76 |
77 | @app.function(volumes={
78 | "/datasets": volume_datasets,
79 | "/embeddings": volume_embeddings,
80 | "/lancedb": volume_db
81 | },
82 | ephemeral_disk=int(1024*1024), # in MiB
83 | image=st_image,
84 | timeout=60*100,
85 | scaledown_window=60*10
86 | )
87 | def combine():
88 | """
89 | Sequentially process each shard by reading the corresponding chunk parquet,
90 | embedding npy, and features parquet files. The data are combined (column-wise)
91 | and then appended to a single lancedb table.
92 | """
93 | db_path = TMP_LANCE_DB_DIR
94 | print(f"Connecting to LanceDB at: {db_path}")
95 | db = lancedb.connect(db_path)
96 |
97 | for i in range(TOTAL_FILES):
98 | base_file = f"data-{i:05d}-of-{TOTAL_FILES:05d}"
99 | chunk_file = os.path.join(CHUNK_PARQUET_DIR, f"{base_file}.parquet")
100 | embedding_file = os.path.join(EMBEDDING_NPY_DIR, f"{base_file}.npy")
101 | feature_file = os.path.join(FEATURE_PARQUET_DIR, f"{base_file}.parquet")
102 |
103 | print(f"\nProcessing shard: {base_file}")
104 | start_time = time.monotonic()
105 |
106 | # Load the chunk parquet file.
107 | try:
108 | chunk_df = pd.read_parquet(chunk_file)
109 | except Exception as e:
110 | print(f"Error reading chunk file {chunk_file}: {e}")
111 | break
112 |
113 | # Load the embeddings npy file.
114 | try:
115 | size = os.path.getsize(embedding_file) // (D_EMB * 4)
116 | embedding_np = np.memmap(embedding_file,
117 | dtype='float32',
118 | mode='r',
119 | shape=(size, D_EMB))
120 | except Exception as e:
121 | print(f"Error reading embedding file {embedding_file}: {e}")
122 | break
123 |
124 | # Load the features parquet file.
125 | try:
126 | feature_df = pd.read_parquet(feature_file)
127 | feature_df = feature_df.rename(columns={
128 | 'top_indices': 'sae_indices',
129 | 'top_acts': 'sae_acts'
130 | })
131 | # Convert sae_indices from float to int for each row
132 | feature_df['sae_indices'] = feature_df['sae_indices'].apply(lambda x: [int(i) for i in x])
133 | except Exception as e:
134 | print(f"Error reading feature file {feature_file}: {e}")
135 | break
136 |
137 | # Validate that the three sources have the same number of rows.
138 | n_chunk = len(chunk_df)
139 | n_embedding = embedding_np.shape[0]
140 | n_feature = len(feature_df)
141 | if not (n_chunk == n_embedding == n_feature):
142 | print(f"Row count mismatch in {base_file}: chunk {n_chunk}, embedding {n_embedding}, feature {n_feature}")
143 | break
144 |
145 | # Store the embedding data as a list column. (Alternatively, you could split the embedding vector into columns.)
146 |
147 | vector_column = list(embedding_np)
148 |
149 | # Combine the dataframes (reseting indices to ensure correct alignment).
150 | combined_df = pd.concat(
151 | [chunk_df.reset_index(drop=True),
152 | feature_df.reset_index(drop=True)],
153 | axis=1,
154 | )
155 | combined_df["vector"] = vector_column
156 | combined_df["shard"] = i
157 |
158 | if i == 0:
159 | msg = f"Creating LanceDB table '{TABLE_NAME}' at {db_path} with {len(combined_df)} rows."
160 | print(msg)
161 | table = db.create_table(TABLE_NAME, combined_df)
162 | else:
163 | msg = f"Adding shard {base_file} to LanceDB table '{TABLE_NAME}' at {db_path} with {len(combined_df)} rows."
164 | print(msg)
165 | table.add(combined_df)
166 | # if i == 2:
167 | # break
168 |
169 | duration = time.monotonic() - start_time
170 | print(f"Shard {base_file} processed in {duration:.2f} seconds; {n_chunk} rows")
171 |
172 |
173 | print(f"Copying LanceDB to {LANCE_DB_DIR}")
174 | # copy the tmp lancedb directory to the volume
175 | import shutil
176 | shutil.copytree(TMP_LANCE_DB_DIR, LANCE_DB_DIR)
177 | print(f"Done!")
178 |
179 |
180 | @app.function(volumes={
181 | "/datasets": volume_datasets,
182 | "/embeddings": volume_embeddings,
183 | "/lancedb": volume_db
184 | },
185 | gpu="A10G",
186 | ephemeral_disk=int(1024*1024), # in MiB
187 | image=st_image,
188 | timeout=60*100,
189 | scaledown_window=60*10
190 | )
191 | def create_indices():
192 | import lancedb
193 | import shutil
194 | start_time = time.monotonic()
195 | print(f"Copying table {LANCE_DB_DIR} to {TMP_LANCE_DB_DIR}")
196 | shutil.copytree(LANCE_DB_DIR, TMP_LANCE_DB_DIR)
197 | duration = time.monotonic() - start_time
198 | print(f"Copying table {LANCE_DB_DIR} to {TMP_LANCE_DB_DIR} took {duration:.2f} seconds")
199 |
200 | db = lancedb.connect(TMP_LANCE_DB_DIR)
201 | table = db.open_table(TABLE_NAME)
202 |
203 | # start_time = time.monotonic()
204 | # print(f"Creating index for sae_indices on table '{TABLE_NAME}'")
205 | # table.create_scalar_index("sae_indices", index_type="LABEL_LIST")
206 | # duration = time.monotonic() - start_time
207 | # print(f"Creating index for sae_indices on table '{TABLE_NAME}' took {duration:.2f} seconds")
208 |
209 | start_time = time.monotonic()
210 | print(f"Creating FTS index for title on table '{TABLE_NAME}'")
211 | table.create_fts_index("title")
212 | duration = time.monotonic() - start_time
213 | print(f"Creating FTS index for title on table '{TABLE_NAME}' took {duration:.2f} seconds")
214 |
215 | start_time = time.monotonic()
216 | print(f"Creating ANN index for embeddings on table '{TABLE_NAME}'")
217 | partitions = int(table.count_rows() ** 0.5) * 2
218 | sub_vectors = D_EMB // 16
219 | metric = "cosine"
220 | print(f"Partitioning into {partitions} partitions, {sub_vectors} sub-vectors")
221 | table.create_index(
222 | num_partitions=partitions,
223 | num_sub_vectors=sub_vectors,
224 | metric=metric,
225 | accelerator="cuda"
226 | )
227 | duration = time.monotonic() - start_time
228 | print(f"Creating ANN index for embeddings on table '{TABLE_NAME}' took {duration:.2f} seconds")
229 |
230 | # print(f"Deleting existing {LANCE_DB_DIR}")
231 | # shutil.rmtree(LANCE_DB_DIR, ignore_errors=True)
232 | start_time = time.monotonic()
233 | print(f"Copying table {TABLE_NAME} to {LANCE_DB_DIR_INDEXED}")
234 | shutil.copytree(TMP_LANCE_DB_DIR, LANCE_DB_DIR_INDEXED, dirs_exist_ok=True)
235 | duration = time.monotonic() - start_time
236 | print(f"Copying table {TMP_LANCE_DB_DIR} to {LANCE_DB_DIR_INDEXED} took {duration:.2f} seconds")
237 |
238 | # ============================================================================
239 | # Modal Local Entrypoint
240 | # ============================================================================
241 |
242 | @app.local_entrypoint()
243 | def main():
244 | # Combine all shards and write to LanceDB.
245 | # combine.remote()
246 | # print("done with combine, creating indices")
247 | create_indices.remote()
--------------------------------------------------------------------------------
/notebooks/perfile.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stderr",
10 | "output_type": "stream",
11 | "text": [
12 | "None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.\n"
13 | ]
14 | }
15 | ],
16 | "source": [
17 | "import time\n",
18 | "# import tqdm\n",
19 | "from tqdm.notebook import tqdm # Import the notebook version of tqdm\n",
20 | "\n",
21 | "from datasets import load_dataset\n",
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "import huggingface_hub\n",
25 | "from huggingface_hub import HfFileSystem\n",
26 | "hffs = HfFileSystem()\n",
27 | "from concurrent.futures import ThreadPoolExecutor, as_completed\n",
28 | "\n",
29 | "import transformers\n",
30 | "transformers.logging.set_verbosity_error()\n",
31 | "from transformers import AutoTokenizer\n"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 2,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "dataset = load_dataset(\"HuggingFaceFW/fineweb-edu\", data_files=\"sample/10BT/*.parquet\", streaming=True, split=\"train\")"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 3,
46 | "metadata": {},
47 | "outputs": [],
48 | "source": [
49 | "files = hffs.ls(\"datasets/HuggingFaceFW/fineweb-edu/sample/10BT\", detail=False)"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 5,
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "data": {
59 | "text/plain": [
60 | "['datasets/HuggingFaceFW/fineweb-edu/sample/10BT/000_00000.parquet',\n",
61 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/001_00000.parquet',\n",
62 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/002_00000.parquet',\n",
63 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/003_00000.parquet',\n",
64 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/004_00000.parquet',\n",
65 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/005_00000.parquet',\n",
66 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/006_00000.parquet',\n",
67 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/007_00000.parquet',\n",
68 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/008_00000.parquet',\n",
69 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/009_00000.parquet',\n",
70 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/010_00000.parquet',\n",
71 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/011_00000.parquet',\n",
72 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/012_00000.parquet',\n",
73 | " 'datasets/HuggingFaceFW/fineweb-edu/sample/10BT/013_00000.parquet']"
74 | ]
75 | },
76 | "execution_count": 5,
77 | "metadata": {},
78 | "output_type": "execute_result"
79 | }
80 | ],
81 | "source": [
82 | "files"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 4,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "file = files[0]"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": [
100 | "# df = pd.read_parquet(\"hf://\" + files[0])\n",
101 | "df = pd.read_parquet(file.split(\"/\")[-1])"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "df.head()"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "# df.to_parquet(files[0].split(\"/\")[-1])"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "MAX_TOKENS = 512"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": null,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": [
137 | "# keep_keys = [\"id\", \"url\", \"score\", \"dump\"]"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "# tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\", model_max_length=MAX_TOKENS)"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "# def chunk(rows):\n",
156 | "# texts = rows[\"text\"]\n",
157 | "# chunks_index = []\n",
158 | "# chunks_text = []\n",
159 | "# chunks_tokens = []\n",
160 | "# updated_token_counts = []\n",
161 | "\n",
162 | "# # Assuming you have other properties in the rows that you want to retain\n",
163 | "# keep = {key: [] for key in keep_keys}\n",
164 | "\n",
165 | "# for index, text in enumerate(texts):\n",
166 | "# tokens = tokenizer.encode(text)\n",
167 | "# token_count = len(tokens)\n",
168 | "\n",
169 | "# if token_count > MAX_TOKENS:\n",
170 | "# overlap = int(MAX_TOKENS * 0.1)\n",
171 | "# start_index = 0\n",
172 | "# ci = 0\n",
173 | "# while start_index < len(tokens):\n",
174 | "# end_index = min(start_index + MAX_TOKENS, len(tokens))\n",
175 | "# chunk = tokens[start_index:end_index]\n",
176 | "# chunks_index.append(ci)\n",
177 | "# chunks_tokens.append(chunk)\n",
178 | "# updated_token_counts.append(len(chunk))\n",
179 | "# chunks_text.append(tokenizer.decode(chunk))\n",
180 | "# # Copy other properties for each chunk\n",
181 | "# for key in keep:\n",
182 | "# keep[key].append(rows[key][index])\n",
183 | "# start_index += MAX_TOKENS - overlap\n",
184 | "# ci += 1\n",
185 | "# else:\n",
186 | "# chunks_index.append(0)\n",
187 | "# chunks_text.append(text)\n",
188 | "# chunks_tokens.append(tokens)\n",
189 | "# updated_token_counts.append(token_count)\n",
190 | "# # Copy other properties for non-chunked texts\n",
191 | "# for key in keep:\n",
192 | "# keep[key].append(rows[key][index])\n",
193 | "\n",
194 | "# keep[\"chunk_index\"] = chunks_index\n",
195 | "# keep[\"chunk_text\"] = chunks_text\n",
196 | "# keep[\"chunk_tokens\"] = chunks_tokens\n",
197 | "# keep[\"chunk_token_count\"] = updated_token_counts\n",
198 | "# return keep\n"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "def chunk_row(row, tokenizer):\n",
208 | " # print(\"ROW\", row)\n",
209 | " MAX_TOKENS = 512\n",
210 | " keep_keys = [\"id\", \"url\", \"score\", \"dump\"]\n",
211 | " text = row[\"text\"]\n",
212 | " chunks = []\n",
213 | "\n",
214 | " tokens = tokenizer.encode(text)\n",
215 | " token_count = len(tokens)\n",
216 | " if token_count > MAX_TOKENS:\n",
217 | " overlap = int(MAX_TOKENS * 0.1)\n",
218 | " start_index = 0\n",
219 | " ci = 0\n",
220 | " while start_index < len(tokens):\n",
221 | " end_index = min(start_index + MAX_TOKENS, len(tokens))\n",
222 | " chunk = tokens[start_index:end_index]\n",
223 | " chunks.append({\n",
224 | " \"chunk_index\": ci,\n",
225 | " \"chunk_text\": tokenizer.decode(chunk),\n",
226 | " \"chunk_tokens\": chunk,\n",
227 | " \"chunk_token_count\": len(chunk),\n",
228 | " **{key: row[key] for key in keep_keys}\n",
229 | " })\n",
230 | " start_index += MAX_TOKENS - overlap\n",
231 | " ci += 1\n",
232 | " else:\n",
233 | " chunks.append({\n",
234 | " \"chunk_index\": 0,\n",
235 | " \"chunk_text\": text,\n",
236 | " \"chunk_tokens\": tokens,\n",
237 | " \"chunk_token_count\": token_count,\n",
238 | " **{key: row[key] for key in keep_keys}\n",
239 | " })\n",
240 | "\n",
241 | " return chunks\n"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": null,
247 | "metadata": {},
248 | "outputs": [],
249 | "source": [
250 | "def process_dataframe(df):\n",
251 | " chunks_list = []\n",
252 | " with ThreadPoolExecutor(max_workers=16) as executor:\n",
253 | " # Submit all rows to the executor\n",
254 | " pbar = tqdm(total=len(df), desc=\"Processing Rows\")\n",
255 | " \n",
256 | " def process_batch(batch):\n",
257 | " \n",
258 | " tokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\", model_max_length=MAX_TOKENS)\n",
259 | " batch_chunks = []\n",
260 | " for row in batch:\n",
261 | " row_chunks = chunk_row(row, tokenizer)\n",
262 | " pbar.update(1)\n",
263 | " batch_chunks.extend(row_chunks)\n",
264 | " return batch_chunks\n",
265 | "\n",
266 | "\n",
267 | " print(\"making batches\")\n",
268 | " batch_size = 200 # Adjust batch size based on your needs\n",
269 | " batches = [df.iloc[i:i + batch_size].to_dict(orient=\"records\") for i in range(0, len(df), batch_size)]\n",
270 | " print(\"made batches\")\n",
271 | " print(\"setting up futures\")\n",
272 | " futures = [executor.submit(process_batch, batch) for batch in batches]\n",
273 | " # futures = [executor.submit(chunk_row, row) for index, row in df.iterrows()]\n",
274 | " # for future in tqdm(as_completed(futures), total=len(df), desc=\"Processing Rows\"):\n",
275 | " # chunks_list.extend(future.result())\n",
276 | " print(\"in the future\")\n",
277 | " # pbar = tqdm(total=len(df)//batch_size, desc=\"Processing Rows\")\n",
278 | " for future in as_completed(futures):\n",
279 | " chunks_list.extend(future.result())\n",
280 | " # print(len(chunks_list))\n",
281 | " # pbar.update(1) # Manually update the progress bar\n",
282 | " pbar.close()\n",
283 | " return chunks_list"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": null,
289 | "metadata": {},
290 | "outputs": [],
291 | "source": [
292 | "# Process the DataFrame and create a new DataFrame from the list of chunks\n",
293 | "start = time.perf_counter()\n",
294 | "print(f\"Chunking text that is longer than {MAX_TOKENS} tokens\")\n",
295 | "chunked_data = process_dataframe(df)\n",
296 | "print(f\"Dataset chunked in {time.perf_counter() - start:.2f} seconds\")\n",
297 | "start = time.perf_counter()\n",
298 | "chunked_df = pd.DataFrame(chunked_data)\n",
299 | "print(f\"Dataset converted to DataFrame in {time.perf_counter() - start:.2f} seconds\")\n"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": null,
305 | "metadata": {},
306 | "outputs": [],
307 | "source": [
308 | "# chunked_df.to_parquet(\"chunked-\" + file.split(\"/\")[-1])"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": null,
314 | "metadata": {},
315 | "outputs": [],
316 | "source": [
317 | "len(chunked_df)"
318 | ]
319 | },
320 | {
321 | "cell_type": "code",
322 | "execution_count": null,
323 | "metadata": {},
324 | "outputs": [],
325 | "source": [
326 | "chunked_df.head()"
327 | ]
328 | },
329 | {
330 | "cell_type": "code",
331 | "execution_count": null,
332 | "metadata": {},
333 | "outputs": [],
334 | "source": []
335 | }
336 | ],
337 | "metadata": {
338 | "kernelspec": {
339 | "display_name": "modalenv",
340 | "language": "python",
341 | "name": "python3"
342 | },
343 | "language_info": {
344 | "codemirror_mode": {
345 | "name": "ipython",
346 | "version": 3
347 | },
348 | "file_extension": ".py",
349 | "mimetype": "text/x-python",
350 | "name": "python",
351 | "nbconvert_exporter": "python",
352 | "pygments_lexer": "ipython3",
353 | "version": "3.11.6"
354 | }
355 | },
356 | "nbformat": 4,
357 | "nbformat_minor": 2
358 | }
359 |
--------------------------------------------------------------------------------
/notebooks/small_sample.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from datasets import load_dataset\n",
10 | "import pandas as pd\n",
11 | "import numpy as np\n",
12 | "\n"
13 | ]
14 | },
15 | {
16 | "cell_type": "code",
17 | "execution_count": 8,
18 | "metadata": {},
19 | "outputs": [],
20 | "source": [
21 | "dataset = load_dataset(\"HuggingFaceFW/fineweb-edu\", data_files=\"sample/10BT/*.parquet\", streaming=True, split=\"train\")\n",
22 | "\n"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 9,
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "dataset_head = dataset.take(10000)"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 10,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "df10k = pd.DataFrame(list(dataset_head))"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 11,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "data": {
50 | "text/html": [
51 | "
\n",
52 | "\n",
65 | "
\n",
66 | " \n",
67 | " \n",
68 | " | \n",
69 | " text | \n",
70 | " id | \n",
71 | " dump | \n",
72 | " url | \n",
73 | " file_path | \n",
74 | " language | \n",
75 | " language_score | \n",
76 | " token_count | \n",
77 | " score | \n",
78 | " int_score | \n",
79 | "
\n",
80 | " \n",
81 | " \n",
82 | " \n",
83 | " 0 | \n",
84 | " The Independent Jane\\nFor all the love, romanc... | \n",
85 | " <urn:uuid:0d8a309d-25c5-405d-a08a-c11239f0d717> | \n",
86 | " CC-MAIN-2013-20 | \n",
87 | " http://austenauthors.net/the-independent-jane | \n",
88 | " s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... | \n",
89 | " en | \n",
90 | " 0.974320 | \n",
91 | " 845 | \n",
92 | " 2.750000 | \n",
93 | " 3 | \n",
94 | "
\n",
95 | " \n",
96 | " 1 | \n",
97 | " Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... | \n",
98 | " <urn:uuid:316c7af5-14e1-4d0b-9576-753e17ef2cc5> | \n",
99 | " CC-MAIN-2013-20 | \n",
100 | " http://query.nytimes.com/gst/fullpage.html?res... | \n",
101 | " s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... | \n",
102 | " en | \n",
103 | " 0.961459 | \n",
104 | " 1055 | \n",
105 | " 2.562500 | \n",
106 | " 3 | \n",
107 | "
\n",
108 | " \n",
109 | " 2 | \n",
110 | " How do you get HIV?\\nHIV can be passed on when... | \n",
111 | " <urn:uuid:a3e140cd-7f25-48c9-a2f0-a7d0b1954e0d> | \n",
112 | " CC-MAIN-2013-20 | \n",
113 | " http://www.childline.org.uk/Explore/SexRelatio... | \n",
114 | " s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... | \n",
115 | " en | \n",
116 | " 0.966757 | \n",
117 | " 136 | \n",
118 | " 3.125000 | \n",
119 | " 3 | \n",
120 | "
\n",
121 | " \n",
122 | " 3 | \n",
123 | " CTComms sends on average 2 million emails mont... | \n",
124 | " <urn:uuid:c337bcd8-6aa1-4f2d-8c48-b916442ebbee> | \n",
125 | " CC-MAIN-2013-20 | \n",
126 | " http://www.ctt.org/resource_centre/getting_sta... | \n",
127 | " s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... | \n",
128 | " en | \n",
129 | " 0.910602 | \n",
130 | " 3479 | \n",
131 | " 3.234375 | \n",
132 | " 3 | \n",
133 | "
\n",
134 | " \n",
135 | " 4 | \n",
136 | " Hold the salt: UCLA engineers develop revoluti... | \n",
137 | " <urn:uuid:c0b175bb-65fb-420e-a881-a80b91d00ecd> | \n",
138 | " CC-MAIN-2013-20 | \n",
139 | " http://www.environment.ucla.edu/water/news/art... | \n",
140 | " s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... | \n",
141 | " en | \n",
142 | " 0.924981 | \n",
143 | " 1115 | \n",
144 | " 2.812500 | \n",
145 | " 3 | \n",
146 | "
\n",
147 | " \n",
148 | "
\n",
149 | "
"
150 | ],
151 | "text/plain": [
152 | " text \\\n",
153 | "0 The Independent Jane\\nFor all the love, romanc... \n",
154 | "1 Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... \n",
155 | "2 How do you get HIV?\\nHIV can be passed on when... \n",
156 | "3 CTComms sends on average 2 million emails mont... \n",
157 | "4 Hold the salt: UCLA engineers develop revoluti... \n",
158 | "\n",
159 | " id dump \\\n",
160 | "0 CC-MAIN-2013-20 \n",
161 | "1 CC-MAIN-2013-20 \n",
162 | "2 CC-MAIN-2013-20 \n",
163 | "3 CC-MAIN-2013-20 \n",
164 | "4 CC-MAIN-2013-20 \n",
165 | "\n",
166 | " url \\\n",
167 | "0 http://austenauthors.net/the-independent-jane \n",
168 | "1 http://query.nytimes.com/gst/fullpage.html?res... \n",
169 | "2 http://www.childline.org.uk/Explore/SexRelatio... \n",
170 | "3 http://www.ctt.org/resource_centre/getting_sta... \n",
171 | "4 http://www.environment.ucla.edu/water/news/art... \n",
172 | "\n",
173 | " file_path language language_score \\\n",
174 | "0 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.974320 \n",
175 | "1 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.961459 \n",
176 | "2 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.966757 \n",
177 | "3 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.910602 \n",
178 | "4 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.924981 \n",
179 | "\n",
180 | " token_count score int_score \n",
181 | "0 845 2.750000 3 \n",
182 | "1 1055 2.562500 3 \n",
183 | "2 136 3.125000 3 \n",
184 | "3 3479 3.234375 3 \n",
185 | "4 1115 2.812500 3 "
186 | ]
187 | },
188 | "execution_count": 11,
189 | "metadata": {},
190 | "output_type": "execute_result"
191 | }
192 | ],
193 | "source": [
194 | "df10k.head()"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 12,
200 | "metadata": {},
201 | "outputs": [],
202 | "source": [
203 | "import latentscope as ls"
204 | ]
205 | },
206 | {
207 | "cell_type": "code",
208 | "execution_count": 13,
209 | "metadata": {},
210 | "outputs": [
211 | {
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "Initialized env with data directory at /Users/enjalot/latent-scope-data\n"
216 | ]
217 | }
218 | ],
219 | "source": [
220 | "ls.init(\"~/latent-scope-data\")"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 14,
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "name": "stdout",
230 | "output_type": "stream",
231 | "text": [
232 | "Loading environment variables from: /Users/enjalot/code/latent-testing/notebooks/.env\n",
233 | "DATA DIR /Users/enjalot/latent-scope-data\n",
234 | "DIRECTORY /Users/enjalot/latent-scope-data/fineweb-edu-10k\n",
235 | " text \\\n",
236 | "0 The Independent Jane\\nFor all the love, romanc... \n",
237 | "1 Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... \n",
238 | "2 How do you get HIV?\\nHIV can be passed on when... \n",
239 | "3 CTComms sends on average 2 million emails mont... \n",
240 | "4 Hold the salt: UCLA engineers develop revoluti... \n",
241 | "\n",
242 | " id dump \\\n",
243 | "0 CC-MAIN-2013-20 \n",
244 | "1 CC-MAIN-2013-20 \n",
245 | "2 CC-MAIN-2013-20 \n",
246 | "3 CC-MAIN-2013-20 \n",
247 | "4 CC-MAIN-2013-20 \n",
248 | "\n",
249 | " url \\\n",
250 | "0 http://austenauthors.net/the-independent-jane \n",
251 | "1 http://query.nytimes.com/gst/fullpage.html?res... \n",
252 | "2 http://www.childline.org.uk/Explore/SexRelatio... \n",
253 | "3 http://www.ctt.org/resource_centre/getting_sta... \n",
254 | "4 http://www.environment.ucla.edu/water/news/art... \n",
255 | "\n",
256 | " file_path language language_score \\\n",
257 | "0 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.974320 \n",
258 | "1 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.961459 \n",
259 | "2 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.966757 \n",
260 | "3 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.910602 \n",
261 | "4 s3://commoncrawl/crawl-data/CC-MAIN-2013-20/se... en 0.924981 \n",
262 | "\n",
263 | " token_count score int_score \n",
264 | "0 845 2.750000 3 \n",
265 | "1 1055 2.562500 3 \n",
266 | "2 136 3.125000 3 \n",
267 | "3 3479 3.234375 3 \n",
268 | "4 1115 2.812500 3 \n",
269 | " text \\\n",
270 | "9995 Here we have the inspiration for the movie tre... \n",
271 | "9996 Love and Logic Resource KitLove and Logic is a... \n",
272 | "9997 In the event of fire, people need to know exac... \n",
273 | "9998 It may be a small comfort to those planning th... \n",
274 | "9999 A 13-year-old middle school student is working... \n",
275 | "\n",
276 | " id dump \\\n",
277 | "9995 CC-MAIN-2017-26 \n",
278 | "9996 CC-MAIN-2017-26 \n",
279 | "9997 CC-MAIN-2017-26 \n",
280 | "9998 CC-MAIN-2017-26 \n",
281 | "9999 CC-MAIN-2017-26 \n",
282 | "\n",
283 | " url \\\n",
284 | "9995 https://www.hamahamaoysters.com/blogs/learn/18... \n",
285 | "9996 http://holly.rpes.schoolfusion.us/modules/cms/... \n",
286 | "9997 http://churchsafety.org.uk/information/fire/f_... \n",
287 | "9998 http://insideindustrynews.com/curiosity-gives-... \n",
288 | "9999 http://juneauempire.com/stories/120505/loc_200... \n",
289 | "\n",
290 | " file_path language \\\n",
291 | "9995 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n",
292 | "9996 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n",
293 | "9997 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n",
294 | "9998 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n",
295 | "9999 s3://commoncrawl/crawl-data/CC-MAIN-2017-26/se... en \n",
296 | "\n",
297 | " language_score token_count score int_score \n",
298 | "9995 0.961133 368 2.875000 3 \n",
299 | "9996 0.895080 249 2.828125 3 \n",
300 | "9997 0.960923 1081 3.171875 3 \n",
301 | "9998 0.938971 141 2.968750 3 \n",
302 | "9999 0.981334 1131 2.859375 3 \n",
303 | "Index(['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score',\n",
304 | " 'token_count', 'score', 'int_score'],\n",
305 | " dtype='object')\n",
306 | "wrote /Users/enjalot/latent-scope-data/fineweb-edu-10k/input.parquet\n"
307 | ]
308 | }
309 | ],
310 | "source": [
311 | "ls.ingest(\"fineweb-edu-10k\", df10k, \"text\")\n",
312 | "\n"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": 17,
318 | "metadata": {},
319 | "outputs": [],
320 | "source": [
321 | "dataset100k = dataset.remove_columns([\"url\", \"file_path\", \"language_score\"])\n",
322 | "dataset_head100k = dataset100k.take(100000)\n"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 18,
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "df100k = pd.DataFrame(list(dataset_head100k))"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": 19,
337 | "metadata": {},
338 | "outputs": [
339 | {
340 | "name": "stdout",
341 | "output_type": "stream",
342 | "text": [
343 | "Loading environment variables from: /Users/enjalot/code/latent-testing/notebooks/.env\n",
344 | "DATA DIR /Users/enjalot/latent-scope-data\n",
345 | "DIRECTORY /Users/enjalot/latent-scope-data/fineweb-edu-100k\n",
346 | " text \\\n",
347 | "0 The Independent Jane\\nFor all the love, romanc... \n",
348 | "1 Taking Play Seriously\\nBy ROBIN MARANTZ HENIG\\... \n",
349 | "2 How do you get HIV?\\nHIV can be passed on when... \n",
350 | "3 CTComms sends on average 2 million emails mont... \n",
351 | "4 Hold the salt: UCLA engineers develop revoluti... \n",
352 | "\n",
353 | " id dump language \\\n",
354 | "0 CC-MAIN-2013-20 en \n",
355 | "1 CC-MAIN-2013-20 en \n",
356 | "2 CC-MAIN-2013-20 en \n",
357 | "3 CC-MAIN-2013-20 en \n",
358 | "4 CC-MAIN-2013-20 en \n",
359 | "\n",
360 | " token_count score int_score \n",
361 | "0 845 2.750000 3 \n",
362 | "1 1055 2.562500 3 \n",
363 | "2 136 3.125000 3 \n",
364 | "3 3479 3.234375 3 \n",
365 | "4 1115 2.812500 3 \n",
366 | " text \\\n",
367 | "99995 Avoid the extreme, but beware of household can... \n",
368 | "99996 The Gospel of Luke is the third of the four ca... \n",
369 | "99997 It's is short for it is or it has.\\nIts is the... \n",
370 | "99998 As more and more users gain access to the web,... \n",
371 | "99999 Equipping students to successfully navigate th... \n",
372 | "\n",
373 | " id dump \\\n",
374 | "99995 CC-MAIN-2013-20 \n",
375 | "99996 CC-MAIN-2013-20 \n",
376 | "99997 CC-MAIN-2013-20 \n",
377 | "99998 CC-MAIN-2013-20 \n",
378 | "99999 CC-MAIN-2013-20 \n",
379 | "\n",
380 | " language token_count score int_score \n",
381 | "99995 en 377 2.531250 3 \n",
382 | "99996 en 1755 3.125000 3 \n",
383 | "99997 en 573 2.828125 3 \n",
384 | "99998 en 648 2.750000 3 \n",
385 | "99999 en 1053 3.578125 4 \n",
386 | "Index(['text', 'id', 'dump', 'language', 'token_count', 'score', 'int_score'], dtype='object')\n",
387 | "wrote /Users/enjalot/latent-scope-data/fineweb-edu-100k/input.parquet\n"
388 | ]
389 | }
390 | ],
391 | "source": [
392 | "ls.ingest(\"fineweb-edu-100k\", df100k, \"text\")"
393 | ]
394 | },
395 | {
396 | "cell_type": "code",
397 | "execution_count": 1,
398 | "metadata": {},
399 | "outputs": [],
400 | "source": [
401 | "broken_ids = [] # can put some ids here to check"
402 | ]
403 | },
404 | {
405 | "cell_type": "code",
406 | "execution_count": 21,
407 | "metadata": {},
408 | "outputs": [],
409 | "source": [
410 | "filtered_df100k = df100k[df100k['id'].isin(broken_ids)]\n"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": 22,
416 | "metadata": {},
417 | "outputs": [
418 | {
419 | "data": {
420 | "text/plain": [
421 | "(512, 7)"
422 | ]
423 | },
424 | "execution_count": 22,
425 | "metadata": {},
426 | "output_type": "execute_result"
427 | }
428 | ],
429 | "source": [
430 | "filtered_df100k.shape"
431 | ]
432 | },
433 | {
434 | "cell_type": "code",
435 | "execution_count": 23,
436 | "metadata": {},
437 | "outputs": [
438 | {
439 | "data": {
440 | "text/html": [
441 | "\n",
442 | "\n",
455 | "
\n",
456 | " \n",
457 | " \n",
458 | " | \n",
459 | " text | \n",
460 | " id | \n",
461 | " dump | \n",
462 | " language | \n",
463 | " token_count | \n",
464 | " score | \n",
465 | " int_score | \n",
466 | "
\n",
467 | " \n",
468 | " \n",
469 | " \n",
470 | " 52736 | \n",
471 | " The two-button remote control is a very versat... | \n",
472 | " <urn:uuid:52418580-9004-4afd-b9d6-7c991f761b06> | \n",
473 | " CC-MAIN-2013-20 | \n",
474 | " en | \n",
475 | " 222 | \n",
476 | " 3.000000 | \n",
477 | " 3 | \n",
478 | "
\n",
479 | " \n",
480 | " 52737 | \n",
481 | " for National Geographic News\\nA new population... | \n",
482 | " <urn:uuid:3dc4e72f-a8c2-48b9-bae3-ea8e374b4462> | \n",
483 | " CC-MAIN-2013-20 | \n",
484 | " en | \n",
485 | " 422 | \n",
486 | " 3.875000 | \n",
487 | " 4 | \n",
488 | "
\n",
489 | " \n",
490 | " 52738 | \n",
491 | " The right to access the various documents of t... | \n",
492 | " <urn:uuid:ebf0a847-f329-441d-a625-47eabeb0e52f> | \n",
493 | " CC-MAIN-2013-20 | \n",
494 | " en | \n",
495 | " 1959 | \n",
496 | " 3.671875 | \n",
497 | " 4 | \n",
498 | "
\n",
499 | " \n",
500 | " 52739 | \n",
501 | " Product Type: Open-file Report\\nAuthor(s): Ell... | \n",
502 | " <urn:uuid:5cd69abe-263e-4246-b7ee-3641dc5b4c17> | \n",
503 | " CC-MAIN-2013-20 | \n",
504 | " en | \n",
505 | " 530 | \n",
506 | " 2.718750 | \n",
507 | " 3 | \n",
508 | "
\n",
509 | " \n",
510 | " 52740 | \n",
511 | " BOISE, Idaho – An invasive insect commonly fou... | \n",
512 | " <urn:uuid:528c738a-fa9c-4858-8683-878e210185c4> | \n",
513 | " CC-MAIN-2013-20 | \n",
514 | " en | \n",
515 | " 382 | \n",
516 | " 2.656250 | \n",
517 | " 3 | \n",
518 | "
\n",
519 | " \n",
520 | "
\n",
521 | "
"
522 | ],
523 | "text/plain": [
524 | " text \\\n",
525 | "52736 The two-button remote control is a very versat... \n",
526 | "52737 for National Geographic News\\nA new population... \n",
527 | "52738 The right to access the various documents of t... \n",
528 | "52739 Product Type: Open-file Report\\nAuthor(s): Ell... \n",
529 | "52740 BOISE, Idaho – An invasive insect commonly fou... \n",
530 | "\n",
531 | " id dump \\\n",
532 | "52736 CC-MAIN-2013-20 \n",
533 | "52737 CC-MAIN-2013-20 \n",
534 | "52738 CC-MAIN-2013-20 \n",
535 | "52739 CC-MAIN-2013-20 \n",
536 | "52740 CC-MAIN-2013-20 \n",
537 | "\n",
538 | " language token_count score int_score \n",
539 | "52736 en 222 3.000000 3 \n",
540 | "52737 en 422 3.875000 4 \n",
541 | "52738 en 1959 3.671875 4 \n",
542 | "52739 en 530 2.718750 3 \n",
543 | "52740 en 382 2.656250 3 "
544 | ]
545 | },
546 | "execution_count": 23,
547 | "metadata": {},
548 | "output_type": "execute_result"
549 | }
550 | ],
551 | "source": [
552 | "filtered_df100k.head()"
553 | ]
554 | },
555 | {
556 | "cell_type": "code",
557 | "execution_count": 24,
558 | "metadata": {},
559 | "outputs": [
560 | {
561 | "data": {
562 | "text/plain": [
563 | "[222,\n",
564 | " 422,\n",
565 | " 1959,\n",
566 | " 530,\n",
567 | " 382,\n",
568 | " 11986,\n",
569 | " 129,\n",
570 | " 652,\n",
571 | " 329,\n",
572 | " 4472,\n",
573 | " 1046,\n",
574 | " 453,\n",
575 | " 212,\n",
576 | " 473,\n",
577 | " 1503,\n",
578 | " 356,\n",
579 | " 307,\n",
580 | " 245,\n",
581 | " 420,\n",
582 | " 761,\n",
583 | " 392,\n",
584 | " 1327,\n",
585 | " 284,\n",
586 | " 2369,\n",
587 | " 170,\n",
588 | " 198,\n",
589 | " 1128,\n",
590 | " 592,\n",
591 | " 488,\n",
592 | " 267,\n",
593 | " 1440,\n",
594 | " 496,\n",
595 | " 373,\n",
596 | " 2140,\n",
597 | " 844,\n",
598 | " 250,\n",
599 | " 229,\n",
600 | " 597,\n",
601 | " 858,\n",
602 | " 219,\n",
603 | " 381,\n",
604 | " 787,\n",
605 | " 784,\n",
606 | " 811,\n",
607 | " 124,\n",
608 | " 251,\n",
609 | " 493,\n",
610 | " 257,\n",
611 | " 313,\n",
612 | " 619,\n",
613 | " 593,\n",
614 | " 528,\n",
615 | " 581,\n",
616 | " 707,\n",
617 | " 192,\n",
618 | " 755,\n",
619 | " 207,\n",
620 | " 885,\n",
621 | " 187,\n",
622 | " 1141,\n",
623 | " 1089,\n",
624 | " 975,\n",
625 | " 630,\n",
626 | " 306,\n",
627 | " 767,\n",
628 | " 353,\n",
629 | " 143,\n",
630 | " 774,\n",
631 | " 465,\n",
632 | " 870,\n",
633 | " 9691,\n",
634 | " 393,\n",
635 | " 429,\n",
636 | " 541,\n",
637 | " 671,\n",
638 | " 219,\n",
639 | " 599,\n",
640 | " 682,\n",
641 | " 561,\n",
642 | " 704,\n",
643 | " 788,\n",
644 | " 374,\n",
645 | " 334,\n",
646 | " 398,\n",
647 | " 348,\n",
648 | " 693,\n",
649 | " 611,\n",
650 | " 274,\n",
651 | " 753,\n",
652 | " 1326,\n",
653 | " 521,\n",
654 | " 1686,\n",
655 | " 747,\n",
656 | " 470,\n",
657 | " 332,\n",
658 | " 2011,\n",
659 | " 727,\n",
660 | " 23407,\n",
661 | " 464,\n",
662 | " 175,\n",
663 | " 751,\n",
664 | " 428,\n",
665 | " 148,\n",
666 | " 425,\n",
667 | " 200,\n",
668 | " 283,\n",
669 | " 642,\n",
670 | " 700,\n",
671 | " 771,\n",
672 | " 859,\n",
673 | " 547,\n",
674 | " 230,\n",
675 | " 1425,\n",
676 | " 1212,\n",
677 | " 680,\n",
678 | " 863,\n",
679 | " 108,\n",
680 | " 345,\n",
681 | " 187,\n",
682 | " 363,\n",
683 | " 2336,\n",
684 | " 3878,\n",
685 | " 631,\n",
686 | " 281,\n",
687 | " 256,\n",
688 | " 1811,\n",
689 | " 438,\n",
690 | " 1122,\n",
691 | " 1205,\n",
692 | " 3044,\n",
693 | " 978,\n",
694 | " 1199,\n",
695 | " 2367,\n",
696 | " 1791,\n",
697 | " 832,\n",
698 | " 608,\n",
699 | " 774,\n",
700 | " 456,\n",
701 | " 275,\n",
702 | " 569,\n",
703 | " 1537,\n",
704 | " 5759,\n",
705 | " 889,\n",
706 | " 317,\n",
707 | " 248,\n",
708 | " 360,\n",
709 | " 3122,\n",
710 | " 1723,\n",
711 | " 429,\n",
712 | " 920,\n",
713 | " 747,\n",
714 | " 271,\n",
715 | " 851,\n",
716 | " 2007,\n",
717 | " 161,\n",
718 | " 1054,\n",
719 | " 484,\n",
720 | " 936,\n",
721 | " 700,\n",
722 | " 257,\n",
723 | " 1191,\n",
724 | " 218,\n",
725 | " 443,\n",
726 | " 866,\n",
727 | " 717,\n",
728 | " 348,\n",
729 | " 1402,\n",
730 | " 467,\n",
731 | " 2245,\n",
732 | " 122,\n",
733 | " 812,\n",
734 | " 670,\n",
735 | " 413,\n",
736 | " 1831,\n",
737 | " 2151,\n",
738 | " 367,\n",
739 | " 537,\n",
740 | " 983,\n",
741 | " 348,\n",
742 | " 3545,\n",
743 | " 887,\n",
744 | " 184,\n",
745 | " 204,\n",
746 | " 980,\n",
747 | " 227,\n",
748 | " 798,\n",
749 | " 408,\n",
750 | " 374,\n",
751 | " 243,\n",
752 | " 1821,\n",
753 | " 249,\n",
754 | " 432,\n",
755 | " 560,\n",
756 | " 334,\n",
757 | " 1389,\n",
758 | " 890,\n",
759 | " 346,\n",
760 | " 524,\n",
761 | " 313,\n",
762 | " 528,\n",
763 | " 154,\n",
764 | " 261,\n",
765 | " 1890,\n",
766 | " 471,\n",
767 | " 3951,\n",
768 | " 461,\n",
769 | " 595,\n",
770 | " 320,\n",
771 | " 676,\n",
772 | " 1002,\n",
773 | " 1871,\n",
774 | " 370,\n",
775 | " 4132,\n",
776 | " 996,\n",
777 | " 435,\n",
778 | " 1010,\n",
779 | " 308,\n",
780 | " 288,\n",
781 | " 484,\n",
782 | " 368,\n",
783 | " 405,\n",
784 | " 378,\n",
785 | " 514,\n",
786 | " 895,\n",
787 | " 232,\n",
788 | " 110,\n",
789 | " 374,\n",
790 | " 433,\n",
791 | " 788,\n",
792 | " 403,\n",
793 | " 1217,\n",
794 | " 849,\n",
795 | " 333,\n",
796 | " 126,\n",
797 | " 324,\n",
798 | " 977,\n",
799 | " 295,\n",
800 | " 1629,\n",
801 | " 319,\n",
802 | " 350,\n",
803 | " 128,\n",
804 | " 754,\n",
805 | " 779,\n",
806 | " 314,\n",
807 | " 604,\n",
808 | " 391,\n",
809 | " 242,\n",
810 | " 403,\n",
811 | " 1291,\n",
812 | " 112,\n",
813 | " 263,\n",
814 | " 128,\n",
815 | " 1620,\n",
816 | " 543,\n",
817 | " 800,\n",
818 | " 973,\n",
819 | " 552,\n",
820 | " 244,\n",
821 | " 628,\n",
822 | " 418,\n",
823 | " 428,\n",
824 | " 412,\n",
825 | " 809,\n",
826 | " 240,\n",
827 | " 940,\n",
828 | " 747,\n",
829 | " 6330,\n",
830 | " 469,\n",
831 | " 770,\n",
832 | " 188,\n",
833 | " 952,\n",
834 | " 1575,\n",
835 | " 790,\n",
836 | " 1178,\n",
837 | " 439,\n",
838 | " 4270,\n",
839 | " 834,\n",
840 | " 527,\n",
841 | " 206,\n",
842 | " 683,\n",
843 | " 541,\n",
844 | " 257,\n",
845 | " 191,\n",
846 | " 390,\n",
847 | " 267,\n",
848 | " 316,\n",
849 | " 1029,\n",
850 | " 233,\n",
851 | " 261,\n",
852 | " 3734,\n",
853 | " 799,\n",
854 | " 275,\n",
855 | " 388,\n",
856 | " 1718,\n",
857 | " 6228,\n",
858 | " 188,\n",
859 | " 367,\n",
860 | " 648,\n",
861 | " 1717,\n",
862 | " 1196,\n",
863 | " 639,\n",
864 | " 1904,\n",
865 | " 1107,\n",
866 | " 1127,\n",
867 | " 414,\n",
868 | " 341,\n",
869 | " 936,\n",
870 | " 124,\n",
871 | " 704,\n",
872 | " 359,\n",
873 | " 631,\n",
874 | " 771,\n",
875 | " 853,\n",
876 | " 892,\n",
877 | " 796,\n",
878 | " 302,\n",
879 | " 2938,\n",
880 | " 289,\n",
881 | " 1287,\n",
882 | " 3105,\n",
883 | " 3493,\n",
884 | " 812,\n",
885 | " 1861,\n",
886 | " 425,\n",
887 | " 475,\n",
888 | " 348,\n",
889 | " 241,\n",
890 | " 2461,\n",
891 | " 1359,\n",
892 | " 755,\n",
893 | " 741,\n",
894 | " 205,\n",
895 | " 145,\n",
896 | " 380,\n",
897 | " 1028,\n",
898 | " 364,\n",
899 | " 553,\n",
900 | " 301,\n",
901 | " 770,\n",
902 | " 319,\n",
903 | " 208,\n",
904 | " 1006,\n",
905 | " 559,\n",
906 | " 334,\n",
907 | " 399,\n",
908 | " 1010,\n",
909 | " 162,\n",
910 | " 528,\n",
911 | " 1272,\n",
912 | " 348,\n",
913 | " 1823,\n",
914 | " 1690,\n",
915 | " 1991,\n",
916 | " 472,\n",
917 | " 2442,\n",
918 | " 461,\n",
919 | " 1204,\n",
920 | " 738,\n",
921 | " 267,\n",
922 | " 943,\n",
923 | " 680,\n",
924 | " 3376,\n",
925 | " 804,\n",
926 | " 701,\n",
927 | " 1482,\n",
928 | " 283,\n",
929 | " 466,\n",
930 | " 533,\n",
931 | " 170,\n",
932 | " 880,\n",
933 | " 2902,\n",
934 | " 980,\n",
935 | " 434,\n",
936 | " 1280,\n",
937 | " 580,\n",
938 | " 229,\n",
939 | " 84,\n",
940 | " 257,\n",
941 | " 286,\n",
942 | " 175,\n",
943 | " 198,\n",
944 | " 2043,\n",
945 | " 335,\n",
946 | " 240,\n",
947 | " 1517,\n",
948 | " 5200,\n",
949 | " 539,\n",
950 | " 1022,\n",
951 | " 11524,\n",
952 | " 187,\n",
953 | " 158,\n",
954 | " 658,\n",
955 | " 165,\n",
956 | " 283,\n",
957 | " 736,\n",
958 | " 195,\n",
959 | " 871,\n",
960 | " 801,\n",
961 | " 178,\n",
962 | " 1267,\n",
963 | " 112,\n",
964 | " 717,\n",
965 | " 327,\n",
966 | " 846,\n",
967 | " 253,\n",
968 | " 520,\n",
969 | " 101,\n",
970 | " 626,\n",
971 | " 945,\n",
972 | " 454,\n",
973 | " 254,\n",
974 | " 775,\n",
975 | " 520,\n",
976 | " 753,\n",
977 | " 2658,\n",
978 | " 2021,\n",
979 | " 855,\n",
980 | " 3316,\n",
981 | " 2032,\n",
982 | " 8629,\n",
983 | " 762,\n",
984 | " 3730,\n",
985 | " 1576,\n",
986 | " 328,\n",
987 | " 1115,\n",
988 | " 496,\n",
989 | " 770,\n",
990 | " 143,\n",
991 | " 133,\n",
992 | " 743,\n",
993 | " 348,\n",
994 | " 214,\n",
995 | " 580,\n",
996 | " 2310,\n",
997 | " 204,\n",
998 | " 312,\n",
999 | " 815,\n",
1000 | " 417,\n",
1001 | " 843,\n",
1002 | " 329,\n",
1003 | " 3034,\n",
1004 | " 410,\n",
1005 | " 672,\n",
1006 | " 225,\n",
1007 | " 673,\n",
1008 | " 415,\n",
1009 | " 1475,\n",
1010 | " 444,\n",
1011 | " 780,\n",
1012 | " 497,\n",
1013 | " 586,\n",
1014 | " 1161,\n",
1015 | " 1608,\n",
1016 | " 752,\n",
1017 | " 600,\n",
1018 | " 1645,\n",
1019 | " 155,\n",
1020 | " 56446,\n",
1021 | " 562,\n",
1022 | " 513,\n",
1023 | " 6647,\n",
1024 | " 660,\n",
1025 | " 112,\n",
1026 | " 1539,\n",
1027 | " 1220,\n",
1028 | " 1281,\n",
1029 | " 741,\n",
1030 | " 1078,\n",
1031 | " 474,\n",
1032 | " 864,\n",
1033 | " 182,\n",
1034 | " 244,\n",
1035 | " 1278,\n",
1036 | " 1056,\n",
1037 | " 647,\n",
1038 | " 358,\n",
1039 | " 535,\n",
1040 | " 2641,\n",
1041 | " 364,\n",
1042 | " 413,\n",
1043 | " 720,\n",
1044 | " 976,\n",
1045 | " 510,\n",
1046 | " 686,\n",
1047 | " 427,\n",
1048 | " 2311,\n",
1049 | " 238,\n",
1050 | " 4432,\n",
1051 | " 277,\n",
1052 | " 356,\n",
1053 | " 665,\n",
1054 | " 311,\n",
1055 | " 886,\n",
1056 | " 1529,\n",
1057 | " 1467,\n",
1058 | " 305,\n",
1059 | " 350,\n",
1060 | " 1839,\n",
1061 | " 316,\n",
1062 | " 1613,\n",
1063 | " 229,\n",
1064 | " 198,\n",
1065 | " 1235,\n",
1066 | " 2633,\n",
1067 | " 809,\n",
1068 | " 4255,\n",
1069 | " 1864,\n",
1070 | " 606,\n",
1071 | " 497,\n",
1072 | " 793,\n",
1073 | " 1371,\n",
1074 | " 1703]"
1075 | ]
1076 | },
1077 | "execution_count": 24,
1078 | "metadata": {},
1079 | "output_type": "execute_result"
1080 | }
1081 | ],
1082 | "source": [
1083 | "filtered_df100k[\"token_count\"].to_list()"
1084 | ]
1085 | },
1086 | {
1087 | "cell_type": "code",
1088 | "execution_count": 27,
1089 | "metadata": {},
1090 | "outputs": [
1091 | {
1092 | "name": "stderr",
1093 | "output_type": "stream",
1094 | "text": [
1095 | "/var/folders/sx/rrvr6l_d5x1_g46jxlx5ypfc0000gn/T/ipykernel_76251/2136929811.py:1: SettingWithCopyWarning: \n",
1096 | "A value is trying to be set on a copy of a slice from a DataFrame.\n",
1097 | "Try using .loc[row_indexer,col_indexer] = value instead\n",
1098 | "\n",
1099 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
1100 | " filtered_df100k['text_length'] = filtered_df100k['text'].apply(len)\n"
1101 | ]
1102 | },
1103 | {
1104 | "data": {
1105 | "text/html": [
1106 | "\n",
1107 | "\n",
1120 | "
\n",
1121 | " \n",
1122 | " \n",
1123 | " | \n",
1124 | " token_count | \n",
1125 | " text_length | \n",
1126 | "
\n",
1127 | " \n",
1128 | " \n",
1129 | " \n",
1130 | " 52736 | \n",
1131 | " 222 | \n",
1132 | " 974 | \n",
1133 | "
\n",
1134 | " \n",
1135 | " 52737 | \n",
1136 | " 422 | \n",
1137 | " 1881 | \n",
1138 | "
\n",
1139 | " \n",
1140 | " 52738 | \n",
1141 | " 1959 | \n",
1142 | " 10870 | \n",
1143 | "
\n",
1144 | " \n",
1145 | " 52739 | \n",
1146 | " 530 | \n",
1147 | " 2720 | \n",
1148 | "
\n",
1149 | " \n",
1150 | " 52740 | \n",
1151 | " 382 | \n",
1152 | " 1869 | \n",
1153 | "
\n",
1154 | " \n",
1155 | " ... | \n",
1156 | " ... | \n",
1157 | " ... | \n",
1158 | "
\n",
1159 | " \n",
1160 | " 53243 | \n",
1161 | " 606 | \n",
1162 | " 2935 | \n",
1163 | "
\n",
1164 | " \n",
1165 | " 53244 | \n",
1166 | " 497 | \n",
1167 | " 2507 | \n",
1168 | "
\n",
1169 | " \n",
1170 | " 53245 | \n",
1171 | " 793 | \n",
1172 | " 3819 | \n",
1173 | "
\n",
1174 | " \n",
1175 | " 53246 | \n",
1176 | " 1371 | \n",
1177 | " 6762 | \n",
1178 | "
\n",
1179 | " \n",
1180 | " 53247 | \n",
1181 | " 1703 | \n",
1182 | " 8438 | \n",
1183 | "
\n",
1184 | " \n",
1185 | "
\n",
1186 | "
512 rows × 2 columns
\n",
1187 | "
"
1188 | ],
1189 | "text/plain": [
1190 | " token_count text_length\n",
1191 | "52736 222 974\n",
1192 | "52737 422 1881\n",
1193 | "52738 1959 10870\n",
1194 | "52739 530 2720\n",
1195 | "52740 382 1869\n",
1196 | "... ... ...\n",
1197 | "53243 606 2935\n",
1198 | "53244 497 2507\n",
1199 | "53245 793 3819\n",
1200 | "53246 1371 6762\n",
1201 | "53247 1703 8438\n",
1202 | "\n",
1203 | "[512 rows x 2 columns]"
1204 | ]
1205 | },
1206 | "execution_count": 27,
1207 | "metadata": {},
1208 | "output_type": "execute_result"
1209 | }
1210 | ],
1211 | "source": [
1212 | "filtered_df100k['text_length'] = filtered_df100k['text'].apply(len)\n",
1213 | "filtered_df100k[['token_count', 'text_length']]\n"
1214 | ]
1215 | },
1216 | {
1217 | "cell_type": "code",
1218 | "execution_count": 44,
1219 | "metadata": {},
1220 | "outputs": [],
1221 | "source": [
1222 | "df100k['text_length'] = df100k['text'].apply(len)\n",
1223 | "sorted_df = df100k.sort_values(by='token_count', ascending=False)\n",
1224 | "sorted_df = sorted_df[sorted_df[\"text_length\"] > 10000]\n"
1225 | ]
1226 | },
1227 | {
1228 | "cell_type": "code",
1229 | "execution_count": 47,
1230 | "metadata": {},
1231 | "outputs": [
1232 | {
1233 | "data": {
1234 | "text/plain": [
1235 | "(8444, 8)"
1236 | ]
1237 | },
1238 | "execution_count": 47,
1239 | "metadata": {},
1240 | "output_type": "execute_result"
1241 | }
1242 | ],
1243 | "source": [
1244 | "sorted_df.shape"
1245 | ]
1246 | },
1247 | {
1248 | "cell_type": "code",
1249 | "execution_count": 48,
1250 | "metadata": {},
1251 | "outputs": [
1252 | {
1253 | "name": "stdout",
1254 | "output_type": "stream",
1255 | "text": [
1256 | "The smallest text_length where token_count is more than 8192 is: 2385\n"
1257 | ]
1258 | }
1259 | ],
1260 | "source": [
1261 | "# Filter the DataFrame to find entries where token_count is more than 8000\n",
1262 | "high_token_count_df = df100k[df100k['token_count'] > 2048]\n",
1263 | "\n",
1264 | "# Find the minimum text_length from the filtered DataFrame\n",
1265 | "min_text_length = high_token_count_df['text_length'].min()\n",
1266 | "\n",
1267 | "# Print the result\n",
1268 | "print(\"The smallest text_length where token_count is more than 8192 is:\", min_text_length)\n"
1269 | ]
1270 | },
1271 | {
1272 | "cell_type": "code",
1273 | "execution_count": 46,
1274 | "metadata": {},
1275 | "outputs": [
1276 | {
1277 | "data": {
1278 | "text/html": [
1279 | "\n",
1280 | "\n",
1293 | "
\n",
1294 | " \n",
1295 | " \n",
1296 | " | \n",
1297 | " token_count | \n",
1298 | " text_length | \n",
1299 | "
\n",
1300 | " \n",
1301 | " \n",
1302 | " \n",
1303 | " 57385 | \n",
1304 | " 104023 | \n",
1305 | " 485818 | \n",
1306 | "
\n",
1307 | " \n",
1308 | " 8741 | \n",
1309 | " 101566 | \n",
1310 | " 485117 | \n",
1311 | "
\n",
1312 | " \n",
1313 | " 26462 | \n",
1314 | " 83662 | \n",
1315 | " 336015 | \n",
1316 | "
\n",
1317 | " \n",
1318 | " 48738 | \n",
1319 | " 81132 | \n",
1320 | " 302843 | \n",
1321 | "
\n",
1322 | " \n",
1323 | " 22545 | \n",
1324 | " 69087 | \n",
1325 | " 306273 | \n",
1326 | "
\n",
1327 | " \n",
1328 | " 59842 | \n",
1329 | " 68874 | \n",
1330 | " 328275 | \n",
1331 | "
\n",
1332 | " \n",
1333 | " 66900 | \n",
1334 | " 64344 | \n",
1335 | " 328697 | \n",
1336 | "
\n",
1337 | " \n",
1338 | " 50493 | \n",
1339 | " 59206 | \n",
1340 | " 272464 | \n",
1341 | "
\n",
1342 | " \n",
1343 | " 53193 | \n",
1344 | " 56446 | \n",
1345 | " 280809 | \n",
1346 | "
\n",
1347 | " \n",
1348 | " 25146 | \n",
1349 | " 46862 | \n",
1350 | " 203838 | \n",
1351 | "
\n",
1352 | " \n",
1353 | " 98915 | \n",
1354 | " 46664 | \n",
1355 | " 238568 | \n",
1356 | "
\n",
1357 | " \n",
1358 | " 63484 | \n",
1359 | " 46596 | \n",
1360 | " 217907 | \n",
1361 | "
\n",
1362 | " \n",
1363 | " 26363 | \n",
1364 | " 46018 | \n",
1365 | " 206822 | \n",
1366 | "
\n",
1367 | " \n",
1368 | " 20481 | \n",
1369 | " 44657 | \n",
1370 | " 165152 | \n",
1371 | "
\n",
1372 | " \n",
1373 | " 19752 | \n",
1374 | " 43150 | \n",
1375 | " 207965 | \n",
1376 | "
\n",
1377 | " \n",
1378 | " 21011 | \n",
1379 | " 42434 | \n",
1380 | " 190152 | \n",
1381 | "
\n",
1382 | " \n",
1383 | " 19932 | \n",
1384 | " 41801 | \n",
1385 | " 157942 | \n",
1386 | "
\n",
1387 | " \n",
1388 | " 75998 | \n",
1389 | " 41636 | \n",
1390 | " 197039 | \n",
1391 | "
\n",
1392 | " \n",
1393 | " 54072 | \n",
1394 | " 41160 | \n",
1395 | " 205458 | \n",
1396 | "
\n",
1397 | " \n",
1398 | " 75446 | \n",
1399 | " 40872 | \n",
1400 | " 173387 | \n",
1401 | "
\n",
1402 | " \n",
1403 | " 21587 | \n",
1404 | " 40787 | \n",
1405 | " 171901 | \n",
1406 | "
\n",
1407 | " \n",
1408 | " 61198 | \n",
1409 | " 39787 | \n",
1410 | " 165134 | \n",
1411 | "
\n",
1412 | " \n",
1413 | " 41897 | \n",
1414 | " 38832 | \n",
1415 | " 184465 | \n",
1416 | "
\n",
1417 | " \n",
1418 | " 37280 | \n",
1419 | " 38682 | \n",
1420 | " 179243 | \n",
1421 | "
\n",
1422 | " \n",
1423 | " 60438 | \n",
1424 | " 38664 | \n",
1425 | " 168148 | \n",
1426 | "
\n",
1427 | " \n",
1428 | " 52712 | \n",
1429 | " 37943 | \n",
1430 | " 180434 | \n",
1431 | "
\n",
1432 | " \n",
1433 | " 49513 | \n",
1434 | " 36064 | \n",
1435 | " 144425 | \n",
1436 | "
\n",
1437 | " \n",
1438 | " 10770 | \n",
1439 | " 35178 | \n",
1440 | " 161248 | \n",
1441 | "
\n",
1442 | " \n",
1443 | " 30875 | \n",
1444 | " 34214 | \n",
1445 | " 156451 | \n",
1446 | "
\n",
1447 | " \n",
1448 | " 21416 | \n",
1449 | " 33794 | \n",
1450 | " 108263 | \n",
1451 | "
\n",
1452 | " \n",
1453 | " 76883 | \n",
1454 | " 33370 | \n",
1455 | " 151322 | \n",
1456 | "
\n",
1457 | " \n",
1458 | " 61913 | \n",
1459 | " 32132 | \n",
1460 | " 145142 | \n",
1461 | "
\n",
1462 | " \n",
1463 | " 82674 | \n",
1464 | " 32022 | \n",
1465 | " 110127 | \n",
1466 | "
\n",
1467 | " \n",
1468 | " 25879 | \n",
1469 | " 31153 | \n",
1470 | " 139767 | \n",
1471 | "
\n",
1472 | " \n",
1473 | " 50241 | \n",
1474 | " 30879 | \n",
1475 | " 141398 | \n",
1476 | "
\n",
1477 | " \n",
1478 | " 91285 | \n",
1479 | " 30841 | \n",
1480 | " 126778 | \n",
1481 | "
\n",
1482 | " \n",
1483 | " 85932 | \n",
1484 | " 30799 | \n",
1485 | " 123451 | \n",
1486 | "
\n",
1487 | " \n",
1488 | " 67564 | \n",
1489 | " 30505 | \n",
1490 | " 110449 | \n",
1491 | "
\n",
1492 | " \n",
1493 | " 23804 | \n",
1494 | " 30502 | \n",
1495 | " 133703 | \n",
1496 | "
\n",
1497 | " \n",
1498 | " 64453 | \n",
1499 | " 30140 | \n",
1500 | " 122132 | \n",
1501 | "
\n",
1502 | " \n",
1503 | " 20491 | \n",
1504 | " 29787 | \n",
1505 | " 147622 | \n",
1506 | "
\n",
1507 | " \n",
1508 | " 98810 | \n",
1509 | " 29474 | \n",
1510 | " 104333 | \n",
1511 | "
\n",
1512 | " \n",
1513 | " 23779 | \n",
1514 | " 29404 | \n",
1515 | " 109206 | \n",
1516 | "
\n",
1517 | " \n",
1518 | " 12476 | \n",
1519 | " 28799 | \n",
1520 | " 118054 | \n",
1521 | "
\n",
1522 | " \n",
1523 | " 84791 | \n",
1524 | " 28792 | \n",
1525 | " 126564 | \n",
1526 | "
\n",
1527 | " \n",
1528 | " 16536 | \n",
1529 | " 28782 | \n",
1530 | " 134643 | \n",
1531 | "
\n",
1532 | " \n",
1533 | " 6770 | \n",
1534 | " 28745 | \n",
1535 | " 139452 | \n",
1536 | "
\n",
1537 | " \n",
1538 | " 64492 | \n",
1539 | " 28731 | \n",
1540 | " 151428 | \n",
1541 | "
\n",
1542 | " \n",
1543 | " 55693 | \n",
1544 | " 28615 | \n",
1545 | " 126376 | \n",
1546 | "
\n",
1547 | " \n",
1548 | " 96635 | \n",
1549 | " 28603 | \n",
1550 | " 128460 | \n",
1551 | "
\n",
1552 | " \n",
1553 | " 87035 | \n",
1554 | " 28458 | \n",
1555 | " 126872 | \n",
1556 | "
\n",
1557 | " \n",
1558 | " 97372 | \n",
1559 | " 28073 | \n",
1560 | " 128217 | \n",
1561 | "
\n",
1562 | " \n",
1563 | " 16966 | \n",
1564 | " 27827 | \n",
1565 | " 121963 | \n",
1566 | "
\n",
1567 | " \n",
1568 | " 54282 | \n",
1569 | " 27622 | \n",
1570 | " 110453 | \n",
1571 | "
\n",
1572 | " \n",
1573 | " 64422 | \n",
1574 | " 27399 | \n",
1575 | " 126250 | \n",
1576 | "
\n",
1577 | " \n",
1578 | " 43095 | \n",
1579 | " 26805 | \n",
1580 | " 127607 | \n",
1581 | "
\n",
1582 | " \n",
1583 | " 11223 | \n",
1584 | " 26774 | \n",
1585 | " 107899 | \n",
1586 | "
\n",
1587 | " \n",
1588 | " 938 | \n",
1589 | " 26697 | \n",
1590 | " 124276 | \n",
1591 | "
\n",
1592 | " \n",
1593 | " 72130 | \n",
1594 | " 26616 | \n",
1595 | " 95360 | \n",
1596 | "
\n",
1597 | " \n",
1598 | " 35815 | \n",
1599 | " 26602 | \n",
1600 | " 100388 | \n",
1601 | "
\n",
1602 | " \n",
1603 | " 60910 | \n",
1604 | " 26557 | \n",
1605 | " 130256 | \n",
1606 | "
\n",
1607 | " \n",
1608 | " 53729 | \n",
1609 | " 26423 | \n",
1610 | " 123570 | \n",
1611 | "
\n",
1612 | " \n",
1613 | " 21879 | \n",
1614 | " 26392 | \n",
1615 | " 116157 | \n",
1616 | "
\n",
1617 | " \n",
1618 | " 97467 | \n",
1619 | " 26192 | \n",
1620 | " 114295 | \n",
1621 | "
\n",
1622 | " \n",
1623 | " 19179 | \n",
1624 | " 25979 | \n",
1625 | " 94335 | \n",
1626 | "
\n",
1627 | " \n",
1628 | " 78544 | \n",
1629 | " 25836 | \n",
1630 | " 110689 | \n",
1631 | "
\n",
1632 | " \n",
1633 | " 86182 | \n",
1634 | " 25721 | \n",
1635 | " 102530 | \n",
1636 | "
\n",
1637 | " \n",
1638 | " 70463 | \n",
1639 | " 25345 | \n",
1640 | " 100019 | \n",
1641 | "
\n",
1642 | " \n",
1643 | " 19729 | \n",
1644 | " 24953 | \n",
1645 | " 115631 | \n",
1646 | "
\n",
1647 | " \n",
1648 | " 92956 | \n",
1649 | " 24808 | \n",
1650 | " 110828 | \n",
1651 | "
\n",
1652 | " \n",
1653 | " 75490 | \n",
1654 | " 24776 | \n",
1655 | " 98531 | \n",
1656 | "
\n",
1657 | " \n",
1658 | " 57823 | \n",
1659 | " 24624 | \n",
1660 | " 93310 | \n",
1661 | "
\n",
1662 | " \n",
1663 | " 5150 | \n",
1664 | " 24593 | \n",
1665 | " 86615 | \n",
1666 | "
\n",
1667 | " \n",
1668 | " 5065 | \n",
1669 | " 24504 | \n",
1670 | " 110972 | \n",
1671 | "
\n",
1672 | " \n",
1673 | " 64878 | \n",
1674 | " 24310 | \n",
1675 | " 104386 | \n",
1676 | "
\n",
1677 | " \n",
1678 | " 85699 | \n",
1679 | " 24133 | \n",
1680 | " 115965 | \n",
1681 | "
\n",
1682 | " \n",
1683 | " 35083 | \n",
1684 | " 24130 | \n",
1685 | " 119805 | \n",
1686 | "
\n",
1687 | " \n",
1688 | " 1122 | \n",
1689 | " 24091 | \n",
1690 | " 90348 | \n",
1691 | "
\n",
1692 | " \n",
1693 | " 41560 | \n",
1694 | " 23898 | \n",
1695 | " 101378 | \n",
1696 | "
\n",
1697 | " \n",
1698 | " 53989 | \n",
1699 | " 23874 | \n",
1700 | " 56790 | \n",
1701 | "
\n",
1702 | " \n",
1703 | " 33448 | \n",
1704 | " 23852 | \n",
1705 | " 115978 | \n",
1706 | "
\n",
1707 | " \n",
1708 | " 29779 | \n",
1709 | " 23735 | \n",
1710 | " 120142 | \n",
1711 | "
\n",
1712 | " \n",
1713 | " 54450 | \n",
1714 | " 23715 | \n",
1715 | " 105685 | \n",
1716 | "
\n",
1717 | " \n",
1718 | " 39629 | \n",
1719 | " 23685 | \n",
1720 | " 101014 | \n",
1721 | "
\n",
1722 | " \n",
1723 | " 6874 | \n",
1724 | " 23436 | \n",
1725 | " 110472 | \n",
1726 | "
\n",
1727 | " \n",
1728 | " 52833 | \n",
1729 | " 23407 | \n",
1730 | " 90316 | \n",
1731 | "
\n",
1732 | " \n",
1733 | " 14484 | \n",
1734 | " 23357 | \n",
1735 | " 104001 | \n",
1736 | "
\n",
1737 | " \n",
1738 | " 1692 | \n",
1739 | " 23338 | \n",
1740 | " 99255 | \n",
1741 | "
\n",
1742 | " \n",
1743 | " 23331 | \n",
1744 | " 23138 | \n",
1745 | " 89524 | \n",
1746 | "
\n",
1747 | " \n",
1748 | " 82384 | \n",
1749 | " 23120 | \n",
1750 | " 89432 | \n",
1751 | "
\n",
1752 | " \n",
1753 | " 69196 | \n",
1754 | " 23083 | \n",
1755 | " 58244 | \n",
1756 | "
\n",
1757 | " \n",
1758 | " 15008 | \n",
1759 | " 22890 | \n",
1760 | " 107912 | \n",
1761 | "
\n",
1762 | " \n",
1763 | " 58367 | \n",
1764 | " 22855 | \n",
1765 | " 105143 | \n",
1766 | "
\n",
1767 | " \n",
1768 | " 26207 | \n",
1769 | " 22832 | \n",
1770 | " 85131 | \n",
1771 | "
\n",
1772 | " \n",
1773 | " 78156 | \n",
1774 | " 22807 | \n",
1775 | " 99823 | \n",
1776 | "
\n",
1777 | " \n",
1778 | " 36604 | \n",
1779 | " 22656 | \n",
1780 | " 103464 | \n",
1781 | "
\n",
1782 | " \n",
1783 | " 32054 | \n",
1784 | " 22652 | \n",
1785 | " 107154 | \n",
1786 | "
\n",
1787 | " \n",
1788 | " 20075 | \n",
1789 | " 22646 | \n",
1790 | " 109951 | \n",
1791 | "
\n",
1792 | " \n",
1793 | " 9812 | \n",
1794 | " 22574 | \n",
1795 | " 101409 | \n",
1796 | "
\n",
1797 | " \n",
1798 | " 36465 | \n",
1799 | " 22527 | \n",
1800 | " 98342 | \n",
1801 | "
\n",
1802 | " \n",
1803 | "
\n",
1804 | "
"
1805 | ],
1806 | "text/plain": [
1807 | " token_count text_length\n",
1808 | "57385 104023 485818\n",
1809 | "8741 101566 485117\n",
1810 | "26462 83662 336015\n",
1811 | "48738 81132 302843\n",
1812 | "22545 69087 306273\n",
1813 | "59842 68874 328275\n",
1814 | "66900 64344 328697\n",
1815 | "50493 59206 272464\n",
1816 | "53193 56446 280809\n",
1817 | "25146 46862 203838\n",
1818 | "98915 46664 238568\n",
1819 | "63484 46596 217907\n",
1820 | "26363 46018 206822\n",
1821 | "20481 44657 165152\n",
1822 | "19752 43150 207965\n",
1823 | "21011 42434 190152\n",
1824 | "19932 41801 157942\n",
1825 | "75998 41636 197039\n",
1826 | "54072 41160 205458\n",
1827 | "75446 40872 173387\n",
1828 | "21587 40787 171901\n",
1829 | "61198 39787 165134\n",
1830 | "41897 38832 184465\n",
1831 | "37280 38682 179243\n",
1832 | "60438 38664 168148\n",
1833 | "52712 37943 180434\n",
1834 | "49513 36064 144425\n",
1835 | "10770 35178 161248\n",
1836 | "30875 34214 156451\n",
1837 | "21416 33794 108263\n",
1838 | "76883 33370 151322\n",
1839 | "61913 32132 145142\n",
1840 | "82674 32022 110127\n",
1841 | "25879 31153 139767\n",
1842 | "50241 30879 141398\n",
1843 | "91285 30841 126778\n",
1844 | "85932 30799 123451\n",
1845 | "67564 30505 110449\n",
1846 | "23804 30502 133703\n",
1847 | "64453 30140 122132\n",
1848 | "20491 29787 147622\n",
1849 | "98810 29474 104333\n",
1850 | "23779 29404 109206\n",
1851 | "12476 28799 118054\n",
1852 | "84791 28792 126564\n",
1853 | "16536 28782 134643\n",
1854 | "6770 28745 139452\n",
1855 | "64492 28731 151428\n",
1856 | "55693 28615 126376\n",
1857 | "96635 28603 128460\n",
1858 | "87035 28458 126872\n",
1859 | "97372 28073 128217\n",
1860 | "16966 27827 121963\n",
1861 | "54282 27622 110453\n",
1862 | "64422 27399 126250\n",
1863 | "43095 26805 127607\n",
1864 | "11223 26774 107899\n",
1865 | "938 26697 124276\n",
1866 | "72130 26616 95360\n",
1867 | "35815 26602 100388\n",
1868 | "60910 26557 130256\n",
1869 | "53729 26423 123570\n",
1870 | "21879 26392 116157\n",
1871 | "97467 26192 114295\n",
1872 | "19179 25979 94335\n",
1873 | "78544 25836 110689\n",
1874 | "86182 25721 102530\n",
1875 | "70463 25345 100019\n",
1876 | "19729 24953 115631\n",
1877 | "92956 24808 110828\n",
1878 | "75490 24776 98531\n",
1879 | "57823 24624 93310\n",
1880 | "5150 24593 86615\n",
1881 | "5065 24504 110972\n",
1882 | "64878 24310 104386\n",
1883 | "85699 24133 115965\n",
1884 | "35083 24130 119805\n",
1885 | "1122 24091 90348\n",
1886 | "41560 23898 101378\n",
1887 | "53989 23874 56790\n",
1888 | "33448 23852 115978\n",
1889 | "29779 23735 120142\n",
1890 | "54450 23715 105685\n",
1891 | "39629 23685 101014\n",
1892 | "6874 23436 110472\n",
1893 | "52833 23407 90316\n",
1894 | "14484 23357 104001\n",
1895 | "1692 23338 99255\n",
1896 | "23331 23138 89524\n",
1897 | "82384 23120 89432\n",
1898 | "69196 23083 58244\n",
1899 | "15008 22890 107912\n",
1900 | "58367 22855 105143\n",
1901 | "26207 22832 85131\n",
1902 | "78156 22807 99823\n",
1903 | "36604 22656 103464\n",
1904 | "32054 22652 107154\n",
1905 | "20075 22646 109951\n",
1906 | "9812 22574 101409\n",
1907 | "36465 22527 98342"
1908 | ]
1909 | },
1910 | "metadata": {},
1911 | "output_type": "display_data"
1912 | }
1913 | ],
1914 | "source": [
1915 | "with pd.option_context('display.max_rows', None, 'display.max_columns', None):\n",
1916 | " display(sorted_df[['token_count', 'text_length']].head(100))\n"
1917 | ]
1918 | },
1919 | {
1920 | "cell_type": "code",
1921 | "execution_count": null,
1922 | "metadata": {},
1923 | "outputs": [],
1924 | "source": []
1925 | }
1926 | ],
1927 | "metadata": {
1928 | "kernelspec": {
1929 | "display_name": "testing",
1930 | "language": "python",
1931 | "name": "python3"
1932 | },
1933 | "language_info": {
1934 | "codemirror_mode": {
1935 | "name": "ipython",
1936 | "version": 3
1937 | },
1938 | "file_extension": ".py",
1939 | "mimetype": "text/x-python",
1940 | "name": "python",
1941 | "nbconvert_exporter": "python",
1942 | "pygments_lexer": "ipython3",
1943 | "version": "3.11.6"
1944 | }
1945 | },
1946 | "nbformat": 4,
1947 | "nbformat_minor": 2
1948 | }
1949 |
--------------------------------------------------------------------------------
/notebooks/tokenizers.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stderr",
10 | "output_type": "stream",
11 | "text": [
12 | "/Users/enjalot/code/fineweb-modal/venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
13 | " from .autonotebook import tqdm as notebook_tqdm\n"
14 | ]
15 | }
16 | ],
17 | "source": [
18 | "from transformers import AutoTokenizer\n",
19 | "import numpy as np\n",
20 | "from collections import Counter"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "metadata": {},
27 | "outputs": [],
28 | "source": [
29 | "\n",
30 | "def compare_tokenizers(text_samples):\n",
31 | " \"\"\"\n",
32 | " Compare tokenization results between BGE and Nomic tokenizers\n",
33 | " \n",
34 | " Args:\n",
35 | " text_samples: List of text strings to compare tokenization\n",
36 | " \n",
37 | " Returns:\n",
38 | " dict: Comparison statistics and analysis results\n",
39 | " \"\"\"\n",
40 | " # Load both tokenizers\n",
41 | " bge_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-base-en-v1.5\")\n",
42 | " nomic_tokenizer = AutoTokenizer.from_pretrained(\"nomic-ai/nomic-embed-text-v1.5\")\n",
43 | " \n",
44 | " results = {\n",
45 | " \"vocabulary_sizes\": {\n",
46 | " \"bge\": len(bge_tokenizer.vocab),\n",
47 | " \"nomic\": len(nomic_tokenizer.vocab),\n",
48 | " },\n",
49 | " \"samples\": []\n",
50 | " }\n",
51 | " \n",
52 | " # Compare tokenization for each sample\n",
53 | " for text in text_samples:\n",
54 | " bge_tokens = bge_tokenizer.tokenize(text)\n",
55 | " nomic_tokens = nomic_tokenizer.tokenize(text)\n",
56 | " \n",
57 | " # Get token counts\n",
58 | " bge_counts = Counter(bge_tokens)\n",
59 | " nomic_counts = Counter(nomic_tokens)\n",
60 | " \n",
61 | " # Compare token sequences\n",
62 | " sample_result = {\n",
63 | " \"text\": text,\n",
64 | " \"bge_tokens\": bge_tokens,\n",
65 | " \"nomic_tokens\": nomic_tokens,\n",
66 | " \"token_counts\": {\n",
67 | " \"bge\": len(bge_tokens),\n",
68 | " \"nomic\": len(nomic_tokens)\n",
69 | " },\n",
70 | " \"unique_tokens\": {\n",
71 | " \"bge\": len(bge_counts),\n",
72 | " \"nomic\": len(nomic_counts)\n",
73 | " },\n",
74 | " \"identical_tokenization\": bge_tokens == nomic_tokens\n",
75 | " }\n",
76 | " \n",
77 | " results[\"samples\"].append(sample_result)\n",
78 | " \n",
79 | " # Calculate overall statistics\n",
80 | " identical_count = sum(1 for r in results[\"samples\"] if r[\"identical_tokenization\"])\n",
81 | " results[\"overall_stats\"] = {\n",
82 | " \"total_samples\": len(text_samples),\n",
83 | " \"identical_tokenizations\": identical_count,\n",
84 | " \"identical_percentage\": (identical_count / len(text_samples)) * 100 if text_samples else 0\n",
85 | " }\n",
86 | " \n",
87 | " return results"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 3,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "\n",
97 | "def print_comparison_report(results):\n",
98 | " \"\"\"Print a formatted report of the tokenizer comparison results\"\"\"\n",
99 | " print(\"Tokenizer Comparison Report\")\n",
100 | " print(\"==========================\")\n",
101 | " print(f\"\\nVocabulary Sizes:\")\n",
102 | " print(f\"BGE: {results['vocabulary_sizes']['bge']:,} tokens\")\n",
103 | " print(f\"Nomic: {results['vocabulary_sizes']['nomic']:,} tokens\")\n",
104 | " \n",
105 | " print(f\"\\nOverall Statistics:\")\n",
106 | " print(f\"Total samples analyzed: {results['overall_stats']['total_samples']}\")\n",
107 | " print(f\"Identical tokenizations: {results['overall_stats']['identical_tokenizations']}\")\n",
108 | " print(f\"Percentage identical: {results['overall_stats']['identical_percentage']:.1f}%\")\n",
109 | " \n",
110 | " print(\"\\nDetailed Sample Analysis:\")\n",
111 | " for i, sample in enumerate(results['samples'], 1):\n",
112 | " print(f\"\\nSample {i}:\")\n",
113 | " print(f\"Text: {sample['text']}\")\n",
114 | " print(f\"BGE tokens ({sample['token_counts']['bge']}): {sample['bge_tokens']}\")\n",
115 | " print(f\"Nomic tokens ({sample['token_counts']['nomic']}): {sample['nomic_tokens']}\")\n",
116 | " print(f\"Identical: {sample['identical_tokenization']}\")"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": 4,
122 | "metadata": {},
123 | "outputs": [
124 | {
125 | "name": "stdout",
126 | "output_type": "stream",
127 | "text": [
128 | "Tokenizer Comparison Report\n",
129 | "==========================\n",
130 | "\n",
131 | "Vocabulary Sizes:\n",
132 | "BGE: 30,522 tokens\n",
133 | "Nomic: 30,522 tokens\n",
134 | "\n",
135 | "Overall Statistics:\n",
136 | "Total samples analyzed: 3\n",
137 | "Identical tokenizations: 3\n",
138 | "Percentage identical: 100.0%\n",
139 | "\n",
140 | "Detailed Sample Analysis:\n",
141 | "\n",
142 | "Sample 1:\n",
143 | "Text: This is a test sentence.\n",
144 | "BGE tokens (6): ['this', 'is', 'a', 'test', 'sentence', '.']\n",
145 | "Nomic tokens (6): ['this', 'is', 'a', 'test', 'sentence', '.']\n",
146 | "Identical: True\n",
147 | "\n",
148 | "Sample 2:\n",
149 | "Text: Machine learning models use different tokenization approaches.\n",
150 | "BGE tokens (9): ['machine', 'learning', 'models', 'use', 'different', 'token', '##ization', 'approaches', '.']\n",
151 | "Nomic tokens (9): ['machine', 'learning', 'models', 'use', 'different', 'token', '##ization', 'approaches', '.']\n",
152 | "Identical: True\n",
153 | "\n",
154 | "Sample 3:\n",
155 | "Text: Some текст with mixed 字符 and специальные characters!\n",
156 | "BGE tokens (24): ['some', 'т', '##е', '##к', '##с', '##т', 'with', 'mixed', '[UNK]', '[UNK]', 'and', 'с', '##п', '##е', '##ц', '##и', '##а', '##л', '##ь', '##н', '##ы', '##е', 'characters', '!']\n",
157 | "Nomic tokens (24): ['some', 'т', '##е', '##к', '##с', '##т', 'with', 'mixed', '[UNK]', '[UNK]', 'and', 'с', '##п', '##е', '##ц', '##и', '##а', '##л', '##ь', '##н', '##ы', '##е', 'characters', '!']\n",
158 | "Identical: True\n"
159 | ]
160 | }
161 | ],
162 | "source": [
163 | "\n",
164 | "# Example usage\n",
165 | "sample_texts = [\n",
166 | " \"This is a test sentence.\",\n",
167 | " \"Machine learning models use different tokenization approaches.\",\n",
168 | " \"Some текст with mixed 字符 and специальные characters!\",\n",
169 | "]\n",
170 | "\n",
171 | "results = compare_tokenizers(sample_texts)\n",
172 | "print_comparison_report(results)"
173 | ]
174 | },
175 | {
176 | "cell_type": "code",
177 | "execution_count": 5,
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "bge_tokenizer = AutoTokenizer.from_pretrained(\"BAAI/bge-base-en-v1.5\")\n",
182 | "nomic_tokenizer = AutoTokenizer.from_pretrained(\"nomic-ai/nomic-embed-text-v1.5\")\n"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 6,
188 | "metadata": {},
189 | "outputs": [
190 | {
191 | "data": {
192 | "text/plain": [
193 | "BertTokenizerFast(name_or_path='BAAI/bge-base-en-v1.5', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n",
194 | "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
195 | "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
196 | "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
197 | "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
198 | "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
199 | "}"
200 | ]
201 | },
202 | "execution_count": 6,
203 | "metadata": {},
204 | "output_type": "execute_result"
205 | }
206 | ],
207 | "source": [
208 | "bge_tokenizer"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": 7,
214 | "metadata": {},
215 | "outputs": [
216 | {
217 | "data": {
218 | "text/plain": [
219 | "BertTokenizerFast(name_or_path='nomic-ai/nomic-embed-text-v1.5', vocab_size=30522, model_max_length=8192, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={\n",
220 | "\t0: AddedToken(\"[PAD]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
221 | "\t100: AddedToken(\"[UNK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
222 | "\t101: AddedToken(\"[CLS]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
223 | "\t102: AddedToken(\"[SEP]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
224 | "\t103: AddedToken(\"[MASK]\", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),\n",
225 | "}"
226 | ]
227 | },
228 | "execution_count": 7,
229 | "metadata": {},
230 | "output_type": "execute_result"
231 | }
232 | ],
233 | "source": [
234 | "nomic_tokenizer"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": null,
240 | "metadata": {},
241 | "outputs": [],
242 | "source": []
243 | }
244 | ],
245 | "metadata": {
246 | "language_info": {
247 | "name": "python"
248 | }
249 | },
250 | "nbformat": 4,
251 | "nbformat_minor": 2
252 | }
253 |
--------------------------------------------------------------------------------
/remove.py:
--------------------------------------------------------------------------------
1 | from modal import App, Image, Volume, Secret
2 |
3 | DATASET_DIR="/embeddings"
4 | VOLUME = "embeddings"
5 | SAE = "64_32"
6 |
7 | # SAMPLE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3/train"
8 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10"
9 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10/combined"
10 |
11 | SAMPLE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500/train"
12 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5"
13 | SAVE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5/combined"
14 |
15 |
16 |
17 |
18 |
19 | # We define our Modal Resources that we'll need
20 | volume = Volume.from_name(VOLUME, create_if_missing=True)
21 | image = Image.debian_slim(python_version="3.9").pip_install(
22 | "pandas", "datasets==2.16.1", "apache_beam==2.53.0"
23 | )
24 | app = App(image=image)
25 |
26 | @app.function(
27 | volumes={DATASET_DIR: volume},
28 | timeout=60000,
29 | )
30 | def remove_files_by_pattern(directory, pattern):
31 | """
32 | Remove all files in the specified directory that match the given pattern.
33 |
34 | Args:
35 | directory: Directory to search for files
36 | pattern: File pattern to match (e.g., "temp*" for files starting with "temp")
37 | """
38 | import os
39 | import glob
40 |
41 | # Get the full path pattern
42 | full_pattern = os.path.join(directory, pattern)
43 |
44 | # Find all files matching the pattern
45 | matching_files = glob.glob(full_pattern)
46 |
47 | # Count files to be removed
48 | file_count = len(matching_files)
49 | print(f"Found {file_count} files matching pattern '{pattern}' in {directory}")
50 |
51 | # Remove each file
52 | for file_path in matching_files:
53 | try:
54 | os.remove(file_path)
55 | print(f"Removed: {file_path}")
56 | except Exception as e:
57 | print(f"Error removing {file_path}: {e}")
58 |
59 | # Commit changes to the volume
60 | volume.commit()
61 |
62 | return f"Removed {file_count} files matching pattern '{pattern}'"
63 |
64 | @app.local_entrypoint()
65 | def main():
66 |
67 | directory = "/embeddings/wikipedia-en-chunked-500-nomic-embed-text-v1.5-64_32-top10"
68 | pattern = "temp*"
69 | print(f"Removing files matching '{pattern}' from '{directory}'")
70 | result = remove_files_by_pattern.remote(directory, pattern)
71 | print(result)
72 |
73 |
74 |
--------------------------------------------------------------------------------
/summary.py:
--------------------------------------------------------------------------------
1 | from modal import App, Image, Volume
2 |
3 |
4 | # We first set out configuration variables for our script.
5 | DATASET_DIR = "/data"
6 | # VOLUME = "embedding-fineweb-edu"
7 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-120"
8 | # DATASET_SAVE_CHUNKED = f"fineweb-edu-sample-10BT-chunked-500"
9 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
10 |
11 |
12 |
13 | VOLUME = "datasets"
14 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-120"
15 | # DATASET_SAVE_CHUNKED = f"RedPajama-Data-V2-sample-10B-chunked-500"
16 | # files = [f"data-{i:05d}-of-00150.parquet" for i in range(150)]
17 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-500"
18 | # DATASET_SAVE_CHUNKED = f"pile-uncopyrighted-chunked-120"
19 | # files = [f"data-{i:05d}-of-01987.parquet" for i in range(200)]
20 | DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-120"
21 | # DATASET_SAVE_CHUNKED = f"wikipedia-en-chunked-500"
22 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)]
23 |
24 |
25 |
26 |
27 | # MODEL_ID = "nomic-ai/nomic-embed-text-v1.5"
28 |
29 | # We define our Modal Resources that we'll need
30 | volume = Volume.from_name(VOLUME, create_if_missing=True)
31 | image = Image.debian_slim(python_version="3.9").pip_install(
32 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
33 | )
34 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
35 |
36 |
37 |
38 | @app.function(volumes={DATASET_DIR: volume}, timeout=3000)
39 | def process_dataset(file):
40 | import time
41 | from concurrent.futures import ThreadPoolExecutor, as_completed
42 | from tqdm import tqdm
43 | import pandas as pd
44 |
45 | # Load the dataset as a Hugging Face dataset
46 | # print(f"Loading dataset from {DATASET_DIR}/{DATASET_SAVE}/train/{file}")
47 | df = pd.read_parquet(f"{DATASET_DIR}/{DATASET_SAVE_CHUNKED}/train/{file}")
48 | print("dataset", len(df))
49 |
50 | return {
51 | "file": file,
52 | "num_rows": len(df),
53 | "tokens": df["chunk_token_count"].sum(),
54 | "less2": df[df["chunk_token_count"] < 2].shape[0],
55 | "less10": df[df["chunk_token_count"] < 10].shape[0],
56 | "less50": df[df["chunk_token_count"] < 50].shape[0],
57 | }
58 |
59 | @app.local_entrypoint()
60 | def main():
61 | from tqdm import tqdm
62 | responses = []
63 | for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True):
64 | if isinstance(resp, Exception):
65 | print(f"Exception: {resp}")
66 | continue
67 | print(resp)
68 | responses.append(resp)
69 |
70 | total_rows = 0
71 | total_tokens = 0
72 | total_less2 = 0
73 | total_less10 = 0
74 | total_less50 = 0
75 | for resp in tqdm(responses):
76 | total_rows += resp['num_rows']
77 | total_tokens += resp['tokens']
78 | total_less2 += resp['less2']
79 | total_less10 += resp['less10']
80 | total_less50 += resp['less50']
81 | print(f"Total rows processed: {total_rows}")
82 | print(f"Total tokens processed: {total_tokens}")
83 | print(f"Total less2: {total_less2}")
84 | print(f"Total less10: {total_less10}")
85 | print(f"Total less50: {total_less50}")
86 |
87 |
88 |
--------------------------------------------------------------------------------
/todataset.py:
--------------------------------------------------------------------------------
1 | """
2 | Turn a directory of parquet files into a HuggingFace dataset in the modal volume
3 | """
4 | # TODO: look into keeping the parquet files as is to make the dataset
5 |
6 |
7 | from modal import App, Image, Volume, Secret
8 |
9 | DATASET_DIR="/embeddings"
10 | VOLUME = "embeddings"
11 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500"
12 | SAVE_DIRECTORY = f"{DIRECTORY}-HF2"
13 |
14 | # We define our Modal Resources that we'll need
15 | volume = Volume.from_name(VOLUME, create_if_missing=True)
16 | image = Image.debian_slim(python_version="3.9").pip_install(
17 | "datasets==2.16.1", "apache_beam==2.53.0"
18 | )
19 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
20 |
21 |
22 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
23 | # but we override this to
24 | # 6000s to avoid any potential timeout issues
25 | @app.function(
26 | volumes={DATASET_DIR: volume},
27 | timeout=6000,
28 | # ephemeral_disk=2145728, # in MiB
29 | secrets=[Secret.from_name("huggingface-secret")],
30 | )
31 | def convert_dataset():
32 | # Redownload the dataset
33 | import time
34 | from datasets import load_dataset
35 | print("loading")
36 | dataset = load_dataset("parquet", data_files=f"{DIRECTORY}/train/*.parquet")
37 | print("saving")
38 | dataset.save_to_disk(SAVE_DIRECTORY, num_shards={"train":99})
39 | print("done!")
40 | volume.commit()
41 |
42 |
43 | @app.local_entrypoint()
44 | def main():
45 | convert_dataset.remote()
46 |
47 |
--------------------------------------------------------------------------------
/top10map.py:
--------------------------------------------------------------------------------
1 | """
2 | For each of the parquet files with activations, find the top 10 and write to an intermediate file
3 | modal run top10map.py
4 | """
5 | from modal import App, Image, Volume
6 | import os
7 | import time
8 | import numpy as np
9 | import pandas as pd
10 | from tqdm import tqdm
11 | import concurrent.futures
12 | from functools import partial
13 |
14 | NUM_CPU=4
15 |
16 | N=5 # the number of samples to keep per feature
17 |
18 | DATASET_DIR="/embeddings"
19 | VOLUME = "embeddings"
20 |
21 | D_IN = 768 # the dimensions from the embedding models
22 | K=64
23 | # EXPANSION = 128
24 | EXPANSION = 32
25 | SAE = f"{K}_{EXPANSION}"
26 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3"
27 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10"
28 | DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}"
29 | SAVE_DIRECTORY = f"{DATASET_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top{N}"
30 |
31 |
32 | files = [f"data-{i:05d}-of-00041.parquet" for i in range(41)]
33 |
34 | # We define our Modal Resources that we'll need
35 | volume = Volume.from_name(VOLUME, create_if_missing=True)
36 | image = Image.debian_slim(python_version="3.9").pip_install(
37 | "datasets==2.16.1", "apache_beam==2.53.0", "transformers", "pandas", "tqdm"
38 | )
39 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
40 |
41 | # def get_top_n_rows_by_top_act(file, top_indices, top_acts, feature):
42 | # # feature_positions = np.where(np.any(top_indices == feature, axis=1),
43 | # # np.argmax(top_indices == feature, axis=1),
44 | # # -1)
45 | # # act_values = np.where(feature_positions != -1,
46 | # # top_acts[np.arange(len(top_acts)), feature_positions],
47 | # # 0)
48 | # # top_n_indices = np.argsort(act_values)[-N:][::-1]
49 |
50 | # # Find positions where feature appears (returns a boolean mask)
51 | # feature_mask = top_indices == feature
52 |
53 | # # Get the activation values where the feature appears (all others will be 0)
54 | # act_values = np.where(feature_mask.any(axis=1),
55 | # top_acts[feature_mask].reshape(-1),
56 | # 0)
57 |
58 | # # Use partition to get top N indices efficiently
59 | # top_n_indices = np.argpartition(act_values, -N)[-N:]
60 | # # Sort just the top N indices
61 | # top_n_indices = top_n_indices[np.argsort(act_values[top_n_indices])[::-1]]
62 |
63 | # filtered_df = pd.DataFrame({
64 | # "shard": file,
65 | # "index": top_n_indices,
66 | # "feature": feature,
67 | # "activation": act_values[top_n_indices]
68 | # })
69 | # return filtered_df
70 |
71 | def get_top_n_rows_by_top_act(file, top_indices, top_acts, feature):
72 | # Use memory-efficient approach to find rows with this feature
73 | rows_with_feature = np.any(top_indices == feature, axis=1)
74 |
75 | # Only process rows that have this feature
76 | filtered_indices = top_indices[rows_with_feature]
77 | filtered_acts = top_acts[rows_with_feature]
78 |
79 | # Get positions of the feature in each row
80 | positions = np.argwhere(filtered_indices == feature)
81 |
82 | # Create array of activation values (sparse approach)
83 | row_indices = positions[:, 0]
84 | col_indices = positions[:, 1]
85 | act_values = filtered_acts[row_indices, col_indices]
86 |
87 | # Map back to original indices
88 | original_indices = np.where(rows_with_feature)[0][row_indices]
89 |
90 | # Get top N
91 | if len(act_values) > N:
92 | top_n_pos = np.argpartition(act_values, -N)[-N:]
93 | top_n_pos = top_n_pos[np.argsort(act_values[top_n_pos])[::-1]]
94 | else:
95 | # If we have fewer than N matches, take all of them
96 | top_n_pos = np.argsort(act_values)[::-1]
97 |
98 | filtered_df = pd.DataFrame({
99 | "shard": file,
100 | "index": original_indices[top_n_pos],
101 | "feature": feature,
102 | "activation": act_values[top_n_pos]
103 | })
104 | return filtered_df
105 |
106 |
107 | def process_feature_chunk(file, feature_ids, chunk_index):
108 | start = time.perf_counter()
109 | print(f"Loading dataset from {DIRECTORY}/train/{file}", chunk_index)
110 |
111 | # Only read the columns we need
112 | df = pd.read_parquet(f"{DIRECTORY}/train/{file}", columns=['top_indices', 'top_acts'])
113 | print(f"Dataset loaded in {time.perf_counter()-start:.2f} seconds for {file}", chunk_index)
114 |
115 | top_indices = np.array(df['top_indices'].tolist())
116 | top_acts = np.array(df['top_acts'].tolist())
117 |
118 | # Free up memory by deleting the DataFrame after conversion to numpy
119 | del df
120 |
121 | print(f"top_indices shape: {top_indices.shape}")
122 | print(f"top_acts shape: {top_acts.shape}")
123 | print("got numpy arrays", chunk_index)
124 |
125 | results = []
126 |
127 | # Process each feature in this worker's batch
128 | for feature in tqdm(feature_ids, desc=f"Processing features (worker {chunk_index})", position=chunk_index):
129 | # Get the true top N rows for this feature across the entire chunk
130 | top = get_top_n_rows_by_top_act(file, top_indices, top_acts, feature)
131 | results.append(top)
132 |
133 | # Combine results for all features in this worker
134 | combined_df = pd.concat(results, ignore_index=True)
135 |
136 | # Write to a temporary file to save memory
137 | temp_file = f"{SAVE_DIRECTORY}/temp_{file}_{chunk_index}.parquet"
138 | combined_df.to_parquet(temp_file)
139 |
140 | # Free memory
141 | del top_indices, top_acts, results, combined_df
142 |
143 | return temp_file
144 |
145 | @app.function(cpu=NUM_CPU, volumes={DATASET_DIR: volume}, timeout=6000)
146 | def process_dataset(file):
147 | from concurrent.futures import ProcessPoolExecutor, as_completed
148 |
149 | # Ensure directory exists
150 | if not os.path.exists(f"{SAVE_DIRECTORY}"):
151 | os.makedirs(f"{SAVE_DIRECTORY}")
152 |
153 | num_features = D_IN * EXPANSION
154 |
155 | # Split the features among workers - each worker handles a subset of features
156 | # but processes the ENTIRE dataset for those features
157 | features_per_worker = num_features // NUM_CPU
158 | feature_batches = [list(range(i, min(i + features_per_worker, num_features)))
159 | for i in range(0, num_features, features_per_worker)]
160 |
161 | with ProcessPoolExecutor(max_workers=NUM_CPU) as executor:
162 | futures = [executor.submit(process_feature_chunk, file, feature_batch, i)
163 | for i, feature_batch in enumerate(feature_batches)]
164 |
165 | temp_files = []
166 | for future in as_completed(futures):
167 | temp_file = future.result()
168 | temp_files.append(temp_file)
169 |
170 | # Combine temporary files
171 | print("Combining temporary files")
172 | dfs = []
173 | for temp_file in temp_files:
174 | dfs.append(pd.read_parquet(temp_file))
175 | # Remove temp file after reading
176 | os.remove(temp_file)
177 |
178 | combined_df = pd.concat(dfs, ignore_index=True)
179 | combined_df.to_parquet(f"{SAVE_DIRECTORY}/{file}")
180 | volume.commit()
181 |
182 | return f"All done with {file}", len(combined_df)
183 |
184 |
185 | @app.local_entrypoint()
186 | def main():
187 | for resp in process_dataset.map(files, order_outputs=False, return_exceptions=True):
188 | if isinstance(resp, Exception):
189 | print(f"Exception: {resp}")
190 | continue
191 | print(resp)
192 |
193 |
194 |
--------------------------------------------------------------------------------
/top10reduce.py:
--------------------------------------------------------------------------------
1 | from modal import App, Image, Volume, Secret
2 |
3 | EMBEDDINGS_DIR="/embeddings"
4 | EMBEDDINGS_VOLUME = "embeddings"
5 | DATASETS_DIR="/datasets"
6 | DATASETS_VOLUME = "datasets"
7 |
8 | SAE = "64_32"
9 |
10 | # SAMPLE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3/train"
11 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10"
12 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-{SAE}-3-top10/combined"
13 |
14 | SAMPLE_DIRECTORY = f"{DATASETS_DIR}/wikipedia-en-chunked-500/train"
15 | SAE_DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}/train"
16 | DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5"
17 | SAVE_DIRECTORY = f"{EMBEDDINGS_DIR}/wikipedia-en-chunked-500-nomic-embed-text-v1.5-{SAE}-top5/combined"
18 |
19 |
20 |
21 |
22 |
23 | # We define our Modal Resources that we'll need
24 | embeddings_volume = Volume.from_name(EMBEDDINGS_VOLUME, create_if_missing=True)
25 | datasets_volume = Volume.from_name(DATASETS_VOLUME, create_if_missing=True)
26 | image = Image.debian_slim(python_version="3.9").pip_install(
27 | "pandas", "datasets==2.16.1", "apache_beam==2.53.0"
28 | )
29 | app = App(image=image)
30 |
31 | @app.function(
32 | volumes={DATASETS_DIR: datasets_volume, EMBEDDINGS_DIR: embeddings_volume},
33 | timeout=60000,
34 | # ephemeral_disk=2145728, # in MiB
35 | )
36 | def populate_indices(samples):
37 | import pandas as pd
38 |
39 | shard = samples.iloc[0]['shard']
40 | indices = samples['index'].tolist()
41 |
42 | print("reading shard", shard, len(indices))
43 | sample_df = pd.read_parquet(f"{SAMPLE_DIRECTORY}/{shard}")
44 | sample_df = sample_df.iloc[indices].copy()
45 | sample_df['feature'] = samples['feature'].tolist()
46 | sample_df['activation'] = samples['activation'].tolist()
47 | sample_df['top_indices'] = samples['top_indices'].tolist()
48 | sample_df['top_acts'] = samples['top_acts'].tolist()
49 | print("returning samples for", shard)
50 |
51 | return sample_df
52 |
53 | @app.function(
54 | volumes={
55 | DATASETS_DIR: datasets_volume,
56 | EMBEDDINGS_DIR: embeddings_volume
57 | },
58 | timeout=60000,
59 | # ephemeral_disk=2145728, # in MiB
60 | )
61 | def reduce_top10_indices(directory, save_directory, sae_directory, N):
62 | import os
63 | if not os.path.exists(save_directory):
64 | os.makedirs(save_directory)
65 |
66 | files = [f for f in os.listdir(directory) if f.endswith('.parquet')]
67 | print("len files", len(files))
68 |
69 | import pandas as pd
70 |
71 | combined_indices_path = f"{save_directory}/combined_indices.parquet"
72 | if not os.path.exists(combined_indices_path):
73 | print("creating combined_indices")
74 | all_dataframes = []
75 | for file in files:
76 | print(f"Reading {file}")
77 | # Read from top directory
78 | df = pd.read_parquet(f"{directory}/{file}")
79 |
80 | # Read corresponding file from SAE directory to get top_indices and top_acts
81 | if os.path.exists(f"{sae_directory}/{file}"):
82 | sae_df = pd.read_parquet(f"{sae_directory}/{file}")
83 | # Ensure we have the right columns
84 | if 'top_indices' in sae_df.columns and 'top_acts' in sae_df.columns:
85 | # Match records based on feature (assuming they're in the same order)
86 | df['top_indices'] = sae_df['top_indices']
87 | df['top_acts'] = sae_df['top_acts']
88 | print(f"Added top_indices and top_acts columns from {file}")
89 | else:
90 | print(f"Warning: top_indices or top_acts not found in {file} from SAE directory")
91 | else:
92 | print(f"Warning: file {file} not found in SAE directory")
93 |
94 | all_dataframes.append(df)
95 |
96 | # Concatenate all DataFrames into a single DataFrame
97 | combined_df = pd.concat(all_dataframes, ignore_index=True)
98 | print("combined")
99 | combined_df.to_parquet(combined_indices_path)
100 | else:
101 | print(f"{combined_indices_path} already exists. Loading it.")
102 | combined_df = pd.read_parquet(combined_indices_path)
103 |
104 | combined_df = combined_df.sort_values(by=['feature', 'activation'], ascending=[True, False])
105 | combined_df = combined_df.groupby('feature').head(N).reset_index(drop=True)
106 | print(f"writing top{N}")
107 | combined_df.to_parquet(f"{save_directory}/combined_indices_top{N}.parquet")
108 | embeddings_volume.commit()
109 |
110 | shard_counts = combined_df.groupby('shard').size().reset_index(name='count')
111 | print("shard_counts", shard_counts.head())
112 |
113 | print("Number of shards:", len(shard_counts))
114 | rows_by_shard = [combined_df[combined_df['shard'] == shard] for shard in combined_df['shard'].unique()]
115 |
116 | results = []
117 | for resp in populate_indices.map(rows_by_shard, order_outputs=False, return_exceptions=True):
118 | if isinstance(resp, Exception):
119 | print(f"Exception: {resp}")
120 | continue
121 | results.append(resp)
122 |
123 | print("concatenating final results")
124 | final_df = pd.concat(results, ignore_index=True)
125 | final_df = final_df.drop(columns=['index', '__index_level_0__'], errors='ignore')
126 | print("sorting final results")
127 | final_df = final_df.sort_values(by=['feature', 'activation'], ascending=[True, False])
128 | print("writing final results")
129 | final_df.to_parquet(f"{save_directory}/samples_top{N}.parquet")
130 | embeddings_volume.commit()
131 | return "done"
132 |
133 |
134 | # for resp in reduce_top10.map(pairs, order_outputs=False, return_exceptions=True):
135 | # if isinstance(resp, Exception):
136 | # print(f"Exception: {resp}")
137 | # continue
138 | # print(resp)
139 |
140 |
141 |
142 | @app.local_entrypoint()
143 | def main():
144 | reduce_top10_indices.remote(DIRECTORY, SAVE_DIRECTORY, SAE_DIRECTORY, 10)
145 |
146 |
147 |
--------------------------------------------------------------------------------
/torched.py:
--------------------------------------------------------------------------------
1 | """
2 | Write the embeddings from the dataset to torch files that can be loaded quicker
3 |
4 | modal run torched.py
5 | """
6 |
7 | from modal import App, Image, Volume, Secret
8 |
9 | DATASET_DIR="/embeddings"
10 | VOLUME = "embeddings"
11 | # DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4"
12 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500"
13 | # SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4-torched"
14 | SAVE_DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-100BT-chunked-500-torched"
15 |
16 | # We define our Modal Resources that we'll need
17 | volume = Volume.from_name(VOLUME, create_if_missing=True)
18 | image = Image.debian_slim(python_version="3.9").pip_install(
19 | "datasets==2.16.1", "apache_beam==2.53.0", "tqdm", "torch", "numpy"
20 | )
21 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
22 |
23 | # NUM_EMBEDDINGS = 25504378
24 | # SHARD_SIZE = 262144 # 2048*128
25 |
26 | @app.function(
27 | volumes={DATASET_DIR: volume},
28 | timeout=60000,
29 | # ephemeral_disk=2145728, # in MiB
30 | )
31 | def torch_dataset_shard(file):
32 | # Redownload the dataset
33 | import time
34 | # from datasets import load_from_disk
35 | import pandas as pd
36 | from tqdm import tqdm
37 | import torch
38 | import numpy as np
39 | import os
40 |
41 | print("loading", file)
42 | # dataset = load_from_disk(DIRECTORY)
43 | df = pd.read_parquet(f"{DIRECTORY}/train/{file}")
44 | print("loaded", file)
45 | # train_dataset = dataset["train"]
46 |
47 | # start_idx = shard * SHARD_SIZE
48 | # end_idx = min(start_idx + SHARD_SIZE, NUM_EMBEDDINGS)
49 | # print("reading", shard)
50 | embeddings = df["embedding"].to_numpy()
51 | embeddings = np.array([np.array(e).astype(np.float32) for e in embeddings])
52 | # shard_embeddings = np.array(train_dataset.select(range(start_idx, end_idx))["embedding"])
53 | # print("permuting", shard)
54 | # shard_embeddings = np.random.permutation(shard_embeddings) # {{ edit_1 }}
55 | shard = file.split(".")[0]
56 | print("saving", shard)
57 | shard_tensor = torch.tensor(embeddings, dtype=torch.float32)
58 | if not os.path.exists(f"{SAVE_DIRECTORY}"):
59 | os.makedirs(f"{SAVE_DIRECTORY}")
60 | torch.save(shard_tensor, f"{SAVE_DIRECTORY}/{shard}.pt")
61 | print("done!", shard)
62 | volume.commit()
63 | return shard
64 |
65 | @app.local_entrypoint()
66 | def main():
67 | # num_shards = NUM_EMBEDDINGS // SHARD_SIZE + (1 if NUM_EMBEDDINGS % SHARD_SIZE != 0 else 0)
68 | # shards = list(range(num_shards))
69 | # # torch_dataset.remote()
70 | # for resp in torch_dataset_shard.map(shards, order_outputs=False, return_exceptions=True):
71 | # if isinstance(resp, Exception):
72 | # print(f"Exception: {resp}")
73 | # continue
74 | # print(resp)
75 |
76 | files = [f"data-{i:05d}-of-00989.parquet" for i in range(989)]
77 | files = files[2:]
78 | # files = [f"data-{i:05d}-of-00099.parquet" for i in range(99)]
79 |
80 | # process_dataset.remote(file, max_tokens=MAX_TOKENS, num_cpu=NUM_CPU)
81 | for resp in torch_dataset_shard.map(files, order_outputs=False, return_exceptions=True):
82 | if isinstance(resp, Exception):
83 | print(f"Exception: {resp}")
84 | continue
85 | print(resp)
86 |
87 |
88 |
--------------------------------------------------------------------------------
/upload.py:
--------------------------------------------------------------------------------
1 | """
2 | Upload a dataset from a modal volume to HuggingFace
3 | """
4 | from modal import App, Image, Volume, Secret
5 |
6 | # We first set out configuration variables for our script.
7 | DATASET_DIR = "/embeddings"
8 | VOLUME="embeddings"
9 | HF_REPO="enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5-2"
10 | DIRECTORY = f"{DATASET_DIR}/fineweb-edu-sample-10BT-chunked-500-HF4"
11 |
12 | # We define our Modal Resources that we'll need
13 | volume = Volume.from_name(VOLUME, create_if_missing=True)
14 | image = Image.debian_slim(python_version="3.9").pip_install(
15 | "datasets==2.20.0", "huggingface_hub"
16 | )
17 | app = App(image=image) # Note: prior to April 2024, "app" was called "stub"
18 |
19 |
20 | # The default timeout is 5 minutes re: https://modal.com/docs/guide/timeouts#handling-timeouts
21 | # but we override this to
22 | # 6000s to avoid any potential timeout issues
23 | @app.function(
24 | volumes={DATASET_DIR: volume},
25 | timeout=60000,
26 | secrets=[Secret.from_name("huggingface-secret")],
27 | )
28 | def upload_dataset(directory, repo):
29 | import os
30 | import time
31 |
32 | from huggingface_hub import HfApi
33 | from datasets import load_from_disk
34 |
35 |
36 | api = HfApi(token=os.environ["HF_TOKEN"])
37 | api.create_repo(
38 | repo_id=repo,
39 | private=False,
40 | repo_type="dataset",
41 | exist_ok=True,
42 | )
43 |
44 | print("loading from disk")
45 | dataset=load_from_disk(directory)
46 |
47 | print(f"Pushing to hub {HF_REPO}")
48 | start = time.perf_counter()
49 | max_retries = 20
50 | for attempt in range(max_retries):
51 | try:
52 | # api.upload_folder(
53 | # folder_path=directory,
54 | # repo_id=repo,
55 | # repo_type="dataset",
56 | # multi_commits=True,
57 | # multi_commits_verbose=True,
58 | # )
59 | dataset.push_to_hub(repo, num_shards={"train": 99})
60 | break # Exit loop if upload is successful
61 | except Exception as e:
62 | if attempt < max_retries - 1:
63 | print(f"Attempt {attempt + 1} failed, retrying...")
64 | time.sleep(5) # Wait for 5 seconds before retrying
65 | else:
66 | print("Failed to upload after several attempts.")
67 | raise # Re-raise the last exception if all retries fail
68 | end = time.perf_counter()
69 | print(f"Uploaded in {end-start}s")
70 |
71 |
72 | @app.local_entrypoint()
73 | def main():
74 | upload_dataset.remote(DIRECTORY, HF_REPO)
75 |
76 |
--------------------------------------------------------------------------------
/volume.py:
--------------------------------------------------------------------------------
1 | """
2 | Copy a directory from one volume to another.
3 | I don't think you can use * in modal volume commands, so need to copy each file individually.
4 | Probably a better way to do this though.
5 |
6 | python volume.py cp
7 | python volume.py rm
8 | """
9 | import os
10 | from tqdm import tqdm
11 |
12 | def automate_volume_copy():
13 | source_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched"
14 | destination_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched-shuffled"
15 |
16 | # Use tqdm to create a progress bar for the file copying process
17 | for i in tqdm(range(100), desc="Copying files"):
18 | file_index = f"{i:05d}"
19 | source_file = os.path.join(source_dir, f"shard_{file_index}.pt")
20 | destination_file = os.path.join(destination_dir, f"shard_{file_index}.pt")
21 |
22 | command = f"modal volume cp embeddings {source_file} {destination_file}"
23 | os.system(command) # Execute the command
24 |
25 | def automate_volume_rm():
26 | source_dir = "fineweb-edu-sample-10BT-chunked-500-HF4-torched"
27 |
28 | # Use tqdm to create a progress bar for the file copying process
29 | for i in tqdm(range(100), desc="Deleting files"):
30 | file_index = f"{i:05d}"
31 | source_file = os.path.join(source_dir, f"shard_{file_index}.pt")
32 |
33 | command = f"modal volume rm embeddings {source_file}"
34 | os.system(command) # Execute the command
35 |
36 |
37 | import sys
38 | import argparse
39 |
40 | def parse_arguments():
41 | parser = argparse.ArgumentParser(description="Copy or remove files in a volume.")
42 | parser.add_argument("command", choices=["cp", "rm"], help="Specify 'cp' to copy files or 'rm' to remove files.")
43 | return parser.parse_args()
44 |
45 | def main():
46 | args = parse_arguments()
47 | command = args.command
48 | if command == "cp":
49 | automate_volume_copy()
50 | elif command == "rm":
51 | automate_volume_rm()
52 | else:
53 | print("Invalid command. Use 'cp' to copy or 'rm' to remove.")
54 | sys.exit(1)
55 |
56 | if __name__ == "__main__":
57 | main()
58 |
--------------------------------------------------------------------------------