├── images ├── ray_data.png └── ray.drawio.png └── README.md /images/ray_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/npuichigo/blazing-fast-io-tutorial/HEAD/images/ray_data.png -------------------------------------------------------------------------------- /images/ray.drawio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/npuichigo/blazing-fast-io-tutorial/HEAD/images/ray.drawio.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Blazing Fast IO Tutorial 2 | 3 | ## TL;DR 4 | Hugging Face Dataset and Ray Data is all you need. 5 | 6 | ![Ray Data](images/ray_data.png) 7 | 8 | Load a dataset from [Hugging Face Hub](https://huggingface.co/datasets) or your own [loading script](https://huggingface.co/docs/datasets/create_dataset#from-local-files) 9 | ```python 10 | # From Hugging Face Hub 11 | dataset = load_dataset("lj_speech", split='train') 12 | 13 | # From loading script 14 | dataset = load_dataset("path/to/your/script.py", split='train') 15 | ``` 16 | 17 | Convert the data to [Parquet](https://parquet.apache.org/) files. Actually, Hugging Face Dataset already does this for 18 | you if you use the following loading code: 19 | ```python 20 | ds = load_dataset("lj_speech", revision="refs/convert/parquet", data_dir="main") 21 | ``` 22 | 23 | For those who want to do it themselves, here is the code: 24 | 25 | **Remember: Audio and Image files may not be embedded in the Parquet files if you directly call `dataset.to_parquet()`** 26 | ```python 27 | import os 28 | from datasets import load_dataset 29 | from datasets.features.features import require_decoding 30 | from datasets.download.streaming_download_manager import xgetsize 31 | from datasets import config 32 | from datasets.utils.py_utils import convert_file_size_to_int 33 | from datasets.table import embed_table_storage 34 | from tqdm import tqdm 35 | 36 | data_dir = 'output_parquets' 37 | max_shard_size = '500MB' 38 | 39 | decodable_columns = ( 40 | [k for k, v in dataset.features.items() if require_decoding(v, ignore_decode_attribute=True)] 41 | ) 42 | dataset_nbytes = dataset._estimate_nbytes() 43 | max_shard_size = convert_file_size_to_int(max_shard_size or config.MAX_SHARD_SIZE) 44 | num_shards = int(dataset_nbytes / max_shard_size) + 1 45 | num_shards = max(num_shards, 1) 46 | shards = (dataset.shard(num_shards=num_shards, index=i, contiguous=True) for i in range(num_shards)) 47 | 48 | # Embed Audio and Image as bytes in the Parquet files 49 | def shards_with_embedded_external_files(shards): 50 | for shard in shards: 51 | format = shard.format 52 | shard = shard.with_format("arrow") 53 | shard = shard.map( 54 | embed_table_storage, 55 | batched=True, 56 | batch_size=1000, 57 | keep_in_memory=True, 58 | ) 59 | shard = shard.with_format(**format) 60 | yield shard 61 | shards = shards_with_embedded_external_files(shards) 62 | 63 | os.makedirs(data_dir) 64 | 65 | for index, shard in tqdm( 66 | enumerate(shards), 67 | desc="Save the dataset shards", 68 | total=num_shards, 69 | ): 70 | shard_path = f"{data_dir}/{index:05d}-of-{num_shards:05d}.parquet" 71 | shard.to_parquet(shard_path) 72 | ``` 73 | 74 | Now we could load the Parquet files with [Ray Data](https://docs.ray.io/en/latest/data/data.html) to leverage it's 75 | distributed computing capabilities. Remember to use your dataset's [Feature](https://huggingface.co/docs/datasets/about_dataset_features) 76 | to correct decode the data. 77 | ```python 78 | # Read Parquet files (with parallelism A) 79 | ds = ray.data.read_parquet('lj_speech_parquets')#, parallelism=A) 80 | # Map decoding (with B cpus and autoscaling concurrency from C to D) 81 | features = datasets.get_dataset_config_info("lj_speech").features 82 | ds = ds.map(features.decode_example)#, num_cpus=B, concurrency=(C, D)) 83 | ``` 84 | 85 | Now you could use Ray Data's [API](https://docs.ray.io/en/latest/data/user-guide.html) to efficiently process your data 86 | in parallel with a distributed way, even on GPU in batches. 87 | 88 | Finally, feed the data to your favorite deep learning framework 89 | for training. Most importantly, the data can be easily dispatched to multiple workers in distributed training without 90 | mannually sharding with `rank` and `world_size`. 91 | ```python 92 | # https://docs.ray.io/en/latest/data/iterating-over-data.html#splitting-datasets-for-distributed-parallel-training 93 | import ray 94 | 95 | @ray.remote 96 | class Worker: 97 | 98 | def train(self, data_iterator): 99 | for batch in data_iterator.iter_batches(batch_size=8): 100 | pass 101 | 102 | ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv") 103 | workers = [Worker.remote() for _ in range(4)] 104 | shards = ds.streaming_split(n=4, equal=True) 105 | ray.get([w.train.remote(s) for w, s in zip(workers, shards)]) 106 | ``` 107 | 108 | ## Motivation 109 | Tensorflow Datasets provide a complete solution from data generation to data loading, with [tfds](https://www.tensorflow.org/datasets) 110 | and [tf.data](https://www.tensorflow.org/api_docs/python/tf/data). Benefit from its computational graph design, data 111 | loading is natively asynchronous and parallel, scheduled by the runtime to make full use of resources for maximum throughput. 112 | 113 | Comparing to Tensorflow Datasets, PyTorch only provides a toy to use at first. `torch.utils.data.Dataset` assumes the 114 | random access ability of the lower storage layer, which is not true for most of the storage system. To make it work, you 115 | always need to add an abstract layer, like cache to provide a good performance. 116 | 117 | `IterableDataset` is a good try to solve this problem, but it's not enough. It's still a synchronous data loading process. 118 | Also, the parallelism is coarse grained, you can only use `num_workers` to control the parallelism, which is not flexible enough. 119 | For example, you may want to add more workers to one [DataPipe](https://pytorch.org/data/beta/torchdata.datapipes.iter.html), 120 | let's say the [Mapper](https://pytorch.org/data/beta/generated/torchdata.datapipes.iter.Mapper.html#torchdata.datapipes.iter.Mapper) 121 | to speed up data processing, but keep the other workers unchanged. It's not possible unless you design your own DataPipe 122 | and handle the parallelism mannually, maybe with `unordered_imap`. 123 | 124 | 125 | After the development of pytorch/data [frozen](https://github.com/pytorch/data/issues/1196), I can't wait to find a 126 | more performant solution for data loading, especially for large-scale datasets. 127 | 128 | ### So what we want for data loading? 129 | 1. Asynchronous and parallel data loading. 130 | 2. Fine-grained parallelism control. 131 | 3. GPU processing support. 132 | 4. Distributed data dispatching. 133 | 5. Scale up and down the computing resources easily. 134 | 135 | [Ray Data](https://docs.ray.io/en/latest/data/overview.html#data-overview) is a good fit for these requirements. 136 | It even pushes it further, with the actor model, to provide a distributed data loading/processing solution. 137 | It's enough to read the doc to know the benefits or refer to the Q&A below. But the part attracting me most is 138 | the find-grained parallelism control ability. 139 | 140 | 141 | 142 | Now, it's Ray's responsibility to schedule the concrete task of data reading and processing to drain all the computing 143 | resources. You don't need to worry about the details of distributed computing. Just focus on your data processing logic. 144 | 145 | That's the story of how I come to Ray Data. 146 | 147 | ## Q & A 148 | ### Why HuggingFace Dataset? 149 | 1. Hugging Face Hub has become the de facto standard for sharing datasets. It contains a lot of datasets that are 150 | useful for deep learning research and applications. 151 | 2. Hugging Face Dataset inherits from [Tensorflow Datasets](https://www.tensorflow.org/datasets) and provides a standard 152 | way to define a new dataset. It's clear for users to understand the dataset with only a glance at the loading script, 153 | much better than countless and various data conversion scripts. 154 | ```python 155 | class LJSpeech(datasets.GeneratorBasedBuilder): 156 | 157 | def _info(self): 158 | return datasets.DatasetInfo( 159 | description=_DESCRIPTION, 160 | features=datasets.Features( 161 | { 162 | "id": datasets.Value("string"), 163 | "audio": datasets.Audio(sampling_rate=22050), 164 | "file": datasets.Value("string"), 165 | "text": datasets.Value("string"), 166 | "normalized_text": datasets.Value("string"), 167 | } 168 | ), 169 | supervised_keys=("file", "text"), 170 | homepage=_URL, 171 | citation=_CITATION, 172 | task_templates=[AutomaticSpeechRecognition(audio_column="audio", transcription_column="text")], 173 | ) 174 | ``` 175 | 3. Even without Ray, Hugging Face Dataset is already blazing fast, thanks to the [Arrow](https://arrow.apache.org/) storage 176 | format. Actually, Arrow is a by-product during the development of Ray. So they are good in format compatibility. 177 | 4. It's easy to generate Parquet files from Hugging Face Dataset, we use Parquet balance the storage and loading cost, 178 | also it's a good fit for Ray Data to consume. (For relationship with Arrow and Parquet, see [here](https://arrow.apache.org/faq/)) 179 | 180 | ### Why Ray Data? 181 | 1. Distributed data processing is hard. Ray Data makes it easy. It's a high-level API for distributed computing. You don't need 182 | to worry about the details of distributed computing. 183 | 2. The actor model is a good fit for distributed data processing. Ray Data is built on top of Ray, which is a distributed 184 | computing framework based on the actor model. It's easy to scale up and down the computing resources with Ray. 185 | 186 | ### Why not ray.data.from_huggingface? 187 | 1. Directly loading from Hugging Face Dataset is not paralleled according to the Ray's Doc. 188 | --------------------------------------------------------------------------------