└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Thesios: Google Synthesized I/O Traces for Storage Servers and Disks 2 | 3 | This repository describes I/O traces of Google storage servers and disks 4 | synthesized by Thesios. Thesios synthesizes representative I/O traces by 5 | combining down-sampled I/O traces collected from multiple disks (HDDs) attached 6 | to multiple storage servers in Google distributed storage system. 7 | 8 | Please refer to our [paper](https://dl.acm.org/doi/10.1145/3620666.3651337) for 9 | more details on the synthesis methodology, validation, and example use cases. If 10 | you use this data in your research, please cite: 11 | 12 | ``` 13 | @inproceedings{thesios, 14 | author = {Phothilimthana, Phitchaya Mangpo and Kadekodi, Saurabh and Ghodrati, Soroush and Moon, Selene and Maas, Martin}, 15 | title = {Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples}, 16 | year = {2024}, 17 | isbn = {9798400703867}, 18 | publisher = {Association for Computing Machinery}, 19 | url = {https://doi.org/10.1145/3620666.3651337}, 20 | doi = {10.1145/3620666.3651337}, 21 | series = {ASPLOS '24} 22 | } 23 | ``` 24 | 25 | ## Data Description 26 | 27 | The data includes I/O traces from 2024/01/15 to 2024/03/15 from three different 28 | clusters (with different types of traffic). A one-day trace is stored as sharded 29 | CSV files, named `{cluster}_{disk_size}/{yyyymmdd}/data*`. Each trace contains 30 | I/O requests to a storage server with a single disk (HDD) attached. 31 | 32 | | Field | Description | 33 | | -------------------------------- | ----------------------------------------- | 34 | | *--- collected fields ---* | | 35 | | `filename` | Local filename (hashed) | 36 | | `application` | Application owner of the file (hashed except for `spanner`, `bigtable`, or `blobstore`) | 37 | | `file_offset` | File offset | 38 | | `c_time` | Inode change time in Unix seconds since epoch | 39 | | `io_zone` | `WARM`, `COLD`, or `UNKNOWN` | 40 | | `redundancy_type` | `REPLICATED` or `ERASURE_CODED` | 41 | | `op_type` | READ or WRITE | 42 | | `service_class` | Request's priority: `LATENCY_SENSITIVE`, `THROUGHPUT_ORIENTED`, or `OTHER` | 43 | | `from_flash_cache` | Whether the request is from flash cache | 44 | | `cache_hit` | Whether the request is served by server's buffer cache: hit = 1, miss = 0, write = -1 (n/a) | 45 | | `request_io_size_bytes` | Size of the request in bytes | 46 | | `response_io_size_bytes` | Size of the response in bytes | 47 | | `disk_io_size_bytes` | Size of the disk operation in bytes (0 for cache hit and writes) 48 | | `start_time` | Request's arrival time at the server in Unix seconds since epoch | 49 | | `disk_time` | Disk read time in seconds (0 for cache hit or write) | 50 | | *--- synthesized fields ---* | | 51 | | `simulated_disk_start_time` | Start time of disk read in Unix seconds (disk-level) since epoch (0 for cache hit or write) | 52 | | `simulated_latency` | Latency (server-level) in seconds from arrival time to response time at the server. For cache miss read, the value is adjusted by a simulator. For cache hit or write, the value is from real measurement from the original trace. | 53 | 54 | ### Disclaimers 55 | 56 | 1. The synthesized traces do not include I/O requests to archival files stored 57 | in the innermost partition of a disk. However, the number of requests to 58 | archival files is negligible compared to accesses to non-archival files. 59 | 2. The disk size in the folder name is just one instance of one disk type 60 | in the cluster. 61 | 3. The I/O that is absorbed by the flash caching layer is not included in this 62 | trace. 63 | 64 | 65 | ## Download 66 | 67 | The dataset is located at https://console.cloud.google.com/storage/browser/thesios-io-traces. 68 | You can use `gcloud`, `wget`, `curl` command line utilities to download files, 69 | or directly download from the webpage. 70 | 71 | 72 | ## Contact 73 | 74 | For questions or concerns contact Mangpo Phothilimthana (`mangpo AT google DOT com`) or Saurabh Kadekodi (`saukad AT google DOT com`). 75 | 76 | 77 | ## CC-BY License 78 | 79 | This work is licensed under the Creative Commons Attribution 4.0 International 80 | License. To view a copy of this license, visit 81 | http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative 82 | Commons, PO Box 1866, Mountain View, CA 94042, USA. 83 | 84 | --------------------------------------------------------------------------------