├── images
    ├── ray_data.png
    └── ray.drawio.png
└── README.md


/images/ray_data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/npuichigo/blazing-fast-io-tutorial/HEAD/images/ray_data.png


--------------------------------------------------------------------------------
/images/ray.drawio.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/npuichigo/blazing-fast-io-tutorial/HEAD/images/ray.drawio.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Blazing Fast IO Tutorial
  2 | 
  3 | ## TL;DR
  4 | Hugging Face Dataset and Ray Data is all you need.
  5 | 
  6 | ![Ray Data](images/ray_data.png)
  7 | 
  8 | Load a dataset from [Hugging Face Hub](https://huggingface.co/datasets) or your own [loading script](https://huggingface.co/docs/datasets/create_dataset#from-local-files)
  9 | ```python
 10 | # From Hugging Face Hub
 11 | dataset = load_dataset("lj_speech", split='train')
 12 | 
 13 | # From loading script
 14 | dataset = load_dataset("path/to/your/script.py", split='train')
 15 | ```
 16 | 
 17 | Convert the data to [Parquet](https://parquet.apache.org/) files. Actually, Hugging Face Dataset already does this for 
 18 | you if you use the following loading code:
 19 | ```python
 20 | ds = load_dataset("lj_speech", revision="refs/convert/parquet", data_dir="main")
 21 | ```
 22 | 
 23 | For those who want to do it themselves, here is the code:
 24 | 
 25 | **Remember: Audio and Image files may not be embedded in the Parquet files if you directly call `dataset.to_parquet()`**
 26 | ```python
 27 | import os
 28 | from datasets import load_dataset
 29 | from datasets.features.features import require_decoding
 30 | from datasets.download.streaming_download_manager import xgetsize
 31 | from datasets import config
 32 | from datasets.utils.py_utils import convert_file_size_to_int
 33 | from datasets.table import embed_table_storage
 34 | from tqdm import tqdm
 35 | 
 36 | data_dir = 'output_parquets'
 37 | max_shard_size = '500MB'
 38 | 
 39 | decodable_columns = (
 40 |     [k for k, v in dataset.features.items() if require_decoding(v, ignore_decode_attribute=True)]
 41 | )
 42 | dataset_nbytes = dataset._estimate_nbytes()
 43 | max_shard_size = convert_file_size_to_int(max_shard_size or config.MAX_SHARD_SIZE)
 44 | num_shards = int(dataset_nbytes / max_shard_size) + 1
 45 | num_shards = max(num_shards, 1)
 46 | shards = (dataset.shard(num_shards=num_shards, index=i, contiguous=True) for i in range(num_shards))
 47 | 
 48 | # Embed Audio and Image as bytes in the Parquet files
 49 | def shards_with_embedded_external_files(shards):
 50 |     for shard in shards:
 51 |         format = shard.format
 52 |         shard = shard.with_format("arrow")
 53 |         shard = shard.map(
 54 |             embed_table_storage,
 55 |             batched=True,
 56 |             batch_size=1000,
 57 |             keep_in_memory=True,
 58 |         )
 59 |         shard = shard.with_format(**format)
 60 |         yield shard
 61 | shards = shards_with_embedded_external_files(shards)
 62 | 
 63 | os.makedirs(data_dir)
 64 | 
 65 | for index, shard in tqdm(
 66 |     enumerate(shards),
 67 |     desc="Save the dataset shards",
 68 |     total=num_shards,
 69 | ):
 70 |     shard_path = f"{data_dir}/{index:05d}-of-{num_shards:05d}.parquet"
 71 |     shard.to_parquet(shard_path)
 72 | ```
 73 | 
 74 | Now we could load the Parquet files with [Ray Data](https://docs.ray.io/en/latest/data/data.html) to leverage it's
 75 | distributed computing capabilities. Remember to use your dataset's [Feature](https://huggingface.co/docs/datasets/about_dataset_features)
 76 | to correct decode the data.
 77 | ```python
 78 | # Read Parquet files (with parallelism A)
 79 | ds = ray.data.read_parquet('lj_speech_parquets')#, parallelism=A)
 80 | # Map decoding (with B cpus and autoscaling concurrency from C to D)
 81 | features = datasets.get_dataset_config_info("lj_speech").features
 82 | ds = ds.map(features.decode_example)#, num_cpus=B, concurrency=(C, D))
 83 | ```
 84 | 
 85 | Now you could use Ray Data's [API](https://docs.ray.io/en/latest/data/user-guide.html) to efficiently process your data
 86 | in parallel with a distributed way, even on GPU in batches.
 87 | 
 88 | Finally, feed the data to your favorite deep learning framework
 89 | for training. Most importantly, the data can be easily dispatched to multiple workers in distributed training without
 90 | mannually sharding with `rank` and `world_size`.
 91 | ```python
 92 | # https://docs.ray.io/en/latest/data/iterating-over-data.html#splitting-datasets-for-distributed-parallel-training
 93 | import ray
 94 | 
 95 | @ray.remote
 96 | class Worker:
 97 | 
 98 |     def train(self, data_iterator):
 99 |         for batch in data_iterator.iter_batches(batch_size=8):
100 |             pass
101 | 
102 | ds = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
103 | workers = [Worker.remote() for _ in range(4)]
104 | shards = ds.streaming_split(n=4, equal=True)
105 | ray.get([w.train.remote(s) for w, s in zip(workers, shards)])
106 | ```
107 | 
108 | ## Motivation
109 | Tensorflow Datasets provide a complete solution from data generation to data loading, with [tfds](https://www.tensorflow.org/datasets)
110 | and [tf.data](https://www.tensorflow.org/api_docs/python/tf/data). Benefit from its computational graph design, data
111 | loading is natively asynchronous and parallel, scheduled by the runtime to make full use of resources for maximum throughput.
112 | 
113 | Comparing to Tensorflow Datasets, PyTorch only provides a toy to use at first. `torch.utils.data.Dataset` assumes the 
114 | random access ability of the lower storage layer, which is not true for most of the storage system. To make it work, you
115 | always need to add an abstract layer, like cache to provide a good performance.
116 | 
117 | `IterableDataset` is a good try to solve this problem, but it's not enough. It's still a synchronous data loading process.
118 | Also, the parallelism is coarse grained, you can only use `num_workers` to control the parallelism, which is not flexible enough.
119 | For example, you may want to add more workers to one [DataPipe](https://pytorch.org/data/beta/torchdata.datapipes.iter.html),
120 | let's say the [Mapper](https://pytorch.org/data/beta/generated/torchdata.datapipes.iter.Mapper.html#torchdata.datapipes.iter.Mapper)
121 | to speed up data processing, but keep the other workers unchanged. It's not possible unless you design your own DataPipe
122 | and handle the parallelism mannually, maybe with `unordered_imap`.
123 | 
124 | 
125 | After the development of pytorch/data [frozen](https://github.com/pytorch/data/issues/1196), I can't wait to find a
126 | more performant solution for data loading, especially for large-scale datasets.
127 | 
128 | ### So what we want for data loading?
129 | 1. Asynchronous and parallel data loading.
130 | 2. Fine-grained parallelism control.
131 | 3. GPU processing support.
132 | 4. Distributed data dispatching. 
133 | 5. Scale up and down the computing resources easily.
134 | 
135 | [Ray Data](https://docs.ray.io/en/latest/data/overview.html#data-overview) is a good fit for these requirements.
136 | It even pushes it further, with the actor model, to provide a distributed data loading/processing solution.
137 | It's enough to read the doc to know the benefits or refer to the Q&A below. But the part attracting me most is
138 | the find-grained parallelism control ability.
139 | 
140 | <img src="images/ray.drawio.png" width="600">
141 | 
142 | Now, it's Ray's responsibility to schedule the concrete task of data reading and processing to drain all the computing
143 | resources. You don't need to worry about the details of distributed computing. Just focus on your data processing logic.
144 | 
145 | That's the story of how I come to Ray Data.
146 | 
147 | ## Q & A
148 | ### Why HuggingFace Dataset?
149 | 1. Hugging Face Hub has become the de facto standard for sharing datasets. It contains a lot of datasets that are
150 |    useful for deep learning research and applications.
151 | 2. Hugging Face Dataset inherits from [Tensorflow Datasets](https://www.tensorflow.org/datasets) and provides a standard
152 |    way to define a new dataset. It's clear for users to understand the dataset with only a glance at the loading script,
153 |    much better than countless and various data conversion scripts.
154 |    ```python
155 |    class LJSpeech(datasets.GeneratorBasedBuilder):
156 | 
157 |     def _info(self):
158 |         return datasets.DatasetInfo(
159 |             description=_DESCRIPTION,
160 |             features=datasets.Features(
161 |                 {
162 |                     "id": datasets.Value("string"),
163 |                     "audio": datasets.Audio(sampling_rate=22050),
164 |                     "file": datasets.Value("string"),
165 |                     "text": datasets.Value("string"),
166 |                     "normalized_text": datasets.Value("string"),
167 |                 }
168 |             ),
169 |             supervised_keys=("file", "text"),
170 |             homepage=_URL,
171 |             citation=_CITATION,
172 |             task_templates=[AutomaticSpeechRecognition(audio_column="audio", transcription_column="text")],
173 |         )
174 |    ```
175 | 3. Even without Ray, Hugging Face Dataset is already blazing fast, thanks to the [Arrow](https://arrow.apache.org/) storage
176 |    format. Actually, Arrow is a by-product during the development of Ray. So they are good in format compatibility.
177 | 4. It's easy to generate Parquet files from Hugging Face Dataset, we use Parquet balance the storage and loading cost,
178 |    also it's a good fit for Ray Data to consume. (For relationship with Arrow and Parquet, see [here](https://arrow.apache.org/faq/))
179 | 
180 | ### Why Ray Data?
181 | 1. Distributed data processing is hard. Ray Data makes it easy. It's a high-level API for distributed computing. You don't need
182 |    to worry about the details of distributed computing.
183 | 2. The actor model is a good fit for distributed data processing. Ray Data is built on top of Ray, which is a distributed
184 |    computing framework based on the actor model. It's easy to scale up and down the computing resources with Ray.
185 | 
186 | ### Why not ray.data.from_huggingface?
187 | 1. Directly loading from Hugging Face Dataset is not paralleled according to the Ray's Doc.
188 | 


--------------------------------------------------------------------------------