├── .github ├── dependabot.yml ├── mergify.yml └── workflows │ └── rust.yml ├── .gitignore ├── CHANGELOG.md ├── Cargo.lock ├── Cargo.toml ├── README.md ├── ci ├── before_deploy.sh ├── install.sh ├── script.sh └── utils.sh ├── src ├── cli.rs ├── error.rs ├── io.rs ├── lib.rs ├── main.rs ├── split.rs └── split │ ├── single.rs │ ├── splits.rs │ ├── splitter.rs │ └── writer.rs └── tests ├── cli_tests.rs └── cmd ├── help-split.stderr ├── help-split.stdout ├── help-split.toml ├── help.stderr ├── help.stdout └── help.toml /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | updates: 3 | - package-ecosystem: cargo 4 | directory: "/" 5 | schedule: 6 | interval: daily 7 | open-pull-requests-limit: 10 8 | reviewers: 9 | - sd2k 10 | assignees: 11 | - sd2k 12 | groups: 13 | rust-dependencies: 14 | patterns: 15 | - "*" 16 | -------------------------------------------------------------------------------- /.github/mergify.yml: -------------------------------------------------------------------------------- 1 | pull_request_rules: 2 | - name: Automatic merge for Dependabot pull requests 3 | conditions: 4 | - author=dependabot[bot] 5 | actions: 6 | merge: 7 | method: squash 8 | 9 | - name: Automatic update to the main branch for pull requests 10 | conditions: 11 | - -conflict # skip PRs with conflicts 12 | - -draft # skip GH draft PRs 13 | - -author=dependabot[bot] # skip dependabot PRs 14 | actions: 15 | update: 16 | -------------------------------------------------------------------------------- /.github/workflows/rust.yml: -------------------------------------------------------------------------------- 1 | name: Rust 2 | 3 | on: 4 | push: 5 | pull_request: 6 | branches: 7 | - master 8 | 9 | jobs: 10 | build: 11 | 12 | runs-on: ubuntu-latest 13 | 14 | steps: 15 | - uses: actions/checkout@v2 16 | - name: Install Rust nightly with clippy and rustfmt 17 | uses: actions-rs/toolchain@v1 18 | with: 19 | profile: minimal 20 | toolchain: stable 21 | override: true 22 | components: rustfmt, clippy 23 | - name: Run fmt 24 | run: cargo fmt -- --check 25 | - name: Run clippy 26 | run: cargo clippy -- -D warnings 27 | - name: Run tests 28 | run: cargo test --verbose 29 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /deployment 2 | /target 3 | **/*.rs.bk 4 | 5 | # Sublime Text 6 | *.sublime-project 7 | *.sublime-workspace 8 | 9 | *.csv 10 | *.csv.gz 11 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | 3 | All notable changes to this project will be documented in this file. 4 | 5 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), 6 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). 7 | 8 | ## [Unreleased] 9 | 10 | ## [0.4.0] - 2020-05-12 11 | ### Added 12 | 13 | - Add a flag (`-n / --no-header`) to treat the input as if there is no header row (i.e. to avoid sending the first row to each split / chunk). 14 | 15 | ### Changed 16 | 17 | - Changed the defaults to **not** decompress inputs / compress outputs. This is a breaking change but should be a less surprising default. 18 | - Explicitly add jemalloc as the global allocator. 19 | 20 | ## [0.3.0] - 2019-09-24 21 | ### Added 22 | 23 | - Add a flag (`--csv`) to parse input as CSV rather than just treating as newline delimited. This is only really needed if files contain embedded newlines, and will impact performance, so should be used sparingly! 24 | - Add a short flag for uncompressed output (`-U`). 25 | 26 | ### Fixed 27 | 28 | - Allow proportions of 1.0 to be specified. 29 | 30 | ## [0.2.2] - 2018-11-14 31 | ### Fixed 32 | 33 | - Fix an off-by-one error when there are unknown total rows. 34 | - Fix a bug where the header wasn't sent to additional chunks. 35 | 36 | ### Added 37 | 38 | - Added examples to README. 39 | 40 | ## [0.2.1] - 2018-11-09 41 | ### Changed 42 | 43 | - Updated dependencies ready for first crates.io release. 44 | - Internal crate modifications. 45 | 46 | ## [0.2.0] - 2018-10-30 47 | ### Fixed 48 | 49 | - Improve errors if proportion is less than 0.0 or greater than 1.0. 50 | 51 | ### Changed 52 | 53 | - Don't try to infer compression from input. 54 | 55 | ## [0.1.0] - 2018-10-18 56 | ### Added 57 | 58 | - First version of the crate. 59 | 60 | [Unreleased]: https://github.com/sd2k/ttv/compare/v0.2.2...HEAD 61 | [0.4.0]: https://github.com/sd2k/ttv/compare/v0.3.0...v0.4.0 62 | [0.3.0]: https://github.com/sd2k/ttv/compare/v0.2.2...v0.3.0 63 | [0.2.2]: https://github.com/sd2k/ttv/compare/v0.2.1...v0.2.2 64 | [0.2.1]: https://github.com/sd2k/ttv/compare/v0.2.0...v0.2.1 65 | [0.2.0]: https://github.com/sd2k/ttv/compare/v0.1.0...v0.2.0 66 | [0.1.0]: https://github.com/sd2k/ttv/releases/tag/v0.1.0 67 | -------------------------------------------------------------------------------- /Cargo.lock: -------------------------------------------------------------------------------- 1 | # This file is automatically @generated by Cargo. 2 | # It is not intended for manual editing. 3 | version = 4 4 | 5 | [[package]] 6 | name = "adler2" 7 | version = "2.0.0" 8 | source = "registry+https://github.com/rust-lang/crates.io-index" 9 | checksum = "512761e0bb2578dd7380c6baaa0f4ce03e84f95e960231d1dec8bf4d7d6e2627" 10 | 11 | [[package]] 12 | name = "aho-corasick" 13 | version = "1.1.2" 14 | source = "registry+https://github.com/rust-lang/crates.io-index" 15 | checksum = "b2969dcb958b36655471fc61f7e416fa76033bdd4bfed0678d8fee1e2d07a1f0" 16 | dependencies = [ 17 | "memchr", 18 | ] 19 | 20 | [[package]] 21 | name = "anstream" 22 | version = "0.6.18" 23 | source = "registry+https://github.com/rust-lang/crates.io-index" 24 | checksum = "8acc5369981196006228e28809f761875c0327210a891e941f4c683b3a99529b" 25 | dependencies = [ 26 | "anstyle", 27 | "anstyle-parse", 28 | "anstyle-query", 29 | "anstyle-wincon", 30 | "colorchoice", 31 | "is_terminal_polyfill", 32 | "utf8parse", 33 | ] 34 | 35 | [[package]] 36 | name = "anstyle" 37 | version = "1.0.10" 38 | source = "registry+https://github.com/rust-lang/crates.io-index" 39 | checksum = "55cc3b69f167a1ef2e161439aa98aed94e6028e5f9a59be9a6ffb47aef1651f9" 40 | 41 | [[package]] 42 | name = "anstyle-parse" 43 | version = "0.2.2" 44 | source = "registry+https://github.com/rust-lang/crates.io-index" 45 | checksum = "317b9a89c1868f5ea6ff1d9539a69f45dffc21ce321ac1fd1160dfa48c8e2140" 46 | dependencies = [ 47 | "utf8parse", 48 | ] 49 | 50 | [[package]] 51 | name = "anstyle-query" 52 | version = "1.0.0" 53 | source = "registry+https://github.com/rust-lang/crates.io-index" 54 | checksum = "5ca11d4be1bab0c8bc8734a9aa7bf4ee8316d462a08c6ac5052f888fef5b494b" 55 | dependencies = [ 56 | "windows-sys 0.48.0", 57 | ] 58 | 59 | [[package]] 60 | name = "anstyle-wincon" 61 | version = "3.0.6" 62 | source = "registry+https://github.com/rust-lang/crates.io-index" 63 | checksum = "2109dbce0e72be3ec00bed26e6a7479ca384ad226efdd66db8fa2e3a38c83125" 64 | dependencies = [ 65 | "anstyle", 66 | "windows-sys 0.59.0", 67 | ] 68 | 69 | [[package]] 70 | name = "atty" 71 | version = "0.2.14" 72 | source = "registry+https://github.com/rust-lang/crates.io-index" 73 | checksum = "d9b39be18770d11421cdb1b9947a45dd3f37e93092cbf377614828a319d5fee8" 74 | dependencies = [ 75 | "hermit-abi", 76 | "libc", 77 | "winapi", 78 | ] 79 | 80 | [[package]] 81 | name = "autocfg" 82 | version = "1.1.0" 83 | source = "registry+https://github.com/rust-lang/crates.io-index" 84 | checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" 85 | 86 | [[package]] 87 | name = "automod" 88 | version = "1.0.14" 89 | source = "registry+https://github.com/rust-lang/crates.io-index" 90 | checksum = "edf3ee19dbc0a46d740f6f0926bde8c50f02bdbc7b536842da28f6ac56513a8b" 91 | dependencies = [ 92 | "proc-macro2", 93 | "quote", 94 | "syn 2.0.87", 95 | ] 96 | 97 | [[package]] 98 | name = "bitflags" 99 | version = "1.3.2" 100 | source = "registry+https://github.com/rust-lang/crates.io-index" 101 | checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" 102 | 103 | [[package]] 104 | name = "bitflags" 105 | version = "2.4.1" 106 | source = "registry+https://github.com/rust-lang/crates.io-index" 107 | checksum = "327762f6e5a765692301e5bb513e0d9fef63be86bbc14528052b1cd3e6f03e07" 108 | 109 | [[package]] 110 | name = "bumpalo" 111 | version = "3.16.0" 112 | source = "registry+https://github.com/rust-lang/crates.io-index" 113 | checksum = "79296716171880943b8470b5f8d03aa55eb2e645a4874bdbb28adb49162e012c" 114 | 115 | [[package]] 116 | name = "cc" 117 | version = "1.0.83" 118 | source = "registry+https://github.com/rust-lang/crates.io-index" 119 | checksum = "f1174fb0b6ec23863f8b971027804a42614e347eafb0a95bf0b12cdae21fc4d0" 120 | dependencies = [ 121 | "libc", 122 | ] 123 | 124 | [[package]] 125 | name = "cfg-if" 126 | version = "1.0.0" 127 | source = "registry+https://github.com/rust-lang/crates.io-index" 128 | checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" 129 | 130 | [[package]] 131 | name = "clap" 132 | version = "3.2.25" 133 | source = "registry+https://github.com/rust-lang/crates.io-index" 134 | checksum = "4ea181bf566f71cb9a5d17a59e1871af638180a18fb0035c92ae62b705207123" 135 | dependencies = [ 136 | "atty", 137 | "bitflags 1.3.2", 138 | "clap_derive", 139 | "clap_lex", 140 | "indexmap 1.9.3", 141 | "once_cell", 142 | "strsim", 143 | "termcolor", 144 | "textwrap", 145 | "yaml-rust", 146 | ] 147 | 148 | [[package]] 149 | name = "clap_derive" 150 | version = "3.2.25" 151 | source = "registry+https://github.com/rust-lang/crates.io-index" 152 | checksum = "ae6371b8bdc8b7d3959e9cf7b22d4435ef3e79e138688421ec654acf8c81b008" 153 | dependencies = [ 154 | "heck", 155 | "proc-macro-error", 156 | "proc-macro2", 157 | "quote", 158 | "syn 1.0.109", 159 | ] 160 | 161 | [[package]] 162 | name = "clap_lex" 163 | version = "0.2.4" 164 | source = "registry+https://github.com/rust-lang/crates.io-index" 165 | checksum = "2850f2f5a82cbf437dd5af4d49848fbdfc27c157c3d010345776f952765261c5" 166 | dependencies = [ 167 | "os_str_bytes", 168 | ] 169 | 170 | [[package]] 171 | name = "colorchoice" 172 | version = "1.0.0" 173 | source = "registry+https://github.com/rust-lang/crates.io-index" 174 | checksum = "acbf1af155f9b9ef647e42cdc158db4b64a1b61f743629225fde6f3e0be2a7c7" 175 | 176 | [[package]] 177 | name = "console" 178 | version = "0.15.7" 179 | source = "registry+https://github.com/rust-lang/crates.io-index" 180 | checksum = "c926e00cc70edefdc64d3a5ff31cc65bb97a3460097762bd23afb4d8145fccf8" 181 | dependencies = [ 182 | "encode_unicode", 183 | "lazy_static", 184 | "libc", 185 | "unicode-width 0.1.11", 186 | "windows-sys 0.45.0", 187 | ] 188 | 189 | [[package]] 190 | name = "content_inspector" 191 | version = "0.2.4" 192 | source = "registry+https://github.com/rust-lang/crates.io-index" 193 | checksum = "b7bda66e858c683005a53a9a60c69a4aca7eeaa45d124526e389f7aec8e62f38" 194 | dependencies = [ 195 | "memchr", 196 | ] 197 | 198 | [[package]] 199 | name = "crc32fast" 200 | version = "1.3.2" 201 | source = "registry+https://github.com/rust-lang/crates.io-index" 202 | checksum = "b540bd8bc810d3885c6ea91e2018302f68baba2129ab3e88f32389ee9370880d" 203 | dependencies = [ 204 | "cfg-if", 205 | ] 206 | 207 | [[package]] 208 | name = "crossbeam-deque" 209 | version = "0.8.3" 210 | source = "registry+https://github.com/rust-lang/crates.io-index" 211 | checksum = "ce6fd6f855243022dcecf8702fef0c297d4338e226845fe067f6341ad9fa0cef" 212 | dependencies = [ 213 | "cfg-if", 214 | "crossbeam-epoch", 215 | "crossbeam-utils", 216 | ] 217 | 218 | [[package]] 219 | name = "crossbeam-epoch" 220 | version = "0.9.15" 221 | source = "registry+https://github.com/rust-lang/crates.io-index" 222 | checksum = "ae211234986c545741a7dc064309f67ee1e5ad243d0e48335adc0484d960bcc7" 223 | dependencies = [ 224 | "autocfg", 225 | "cfg-if", 226 | "crossbeam-utils", 227 | "memoffset", 228 | "scopeguard", 229 | ] 230 | 231 | [[package]] 232 | name = "crossbeam-utils" 233 | version = "0.8.16" 234 | source = "registry+https://github.com/rust-lang/crates.io-index" 235 | checksum = "5a22b2d63d4d1dc0b7f1b6b2747dd0088008a9be28b6ddf0b1e7d335e3037294" 236 | dependencies = [ 237 | "cfg-if", 238 | ] 239 | 240 | [[package]] 241 | name = "csv" 242 | version = "1.3.1" 243 | source = "registry+https://github.com/rust-lang/crates.io-index" 244 | checksum = "acdc4883a9c96732e4733212c01447ebd805833b7275a73ca3ee080fd77afdaf" 245 | dependencies = [ 246 | "csv-core", 247 | "itoa", 248 | "ryu", 249 | "serde", 250 | ] 251 | 252 | [[package]] 253 | name = "csv-core" 254 | version = "0.1.11" 255 | source = "registry+https://github.com/rust-lang/crates.io-index" 256 | checksum = "5efa2b3d7902f4b634a20cae3c9c4e6209dc4779feb6863329607560143efa70" 257 | dependencies = [ 258 | "memchr", 259 | ] 260 | 261 | [[package]] 262 | name = "dunce" 263 | version = "1.0.4" 264 | source = "registry+https://github.com/rust-lang/crates.io-index" 265 | checksum = "56ce8c6da7551ec6c462cbaf3bfbc75131ebbfa1c944aeaa9dab51ca1c5f0c3b" 266 | 267 | [[package]] 268 | name = "either" 269 | version = "1.9.0" 270 | source = "registry+https://github.com/rust-lang/crates.io-index" 271 | checksum = "a26ae43d7bcc3b814de94796a5e736d4029efb0ee900c12e2d54c993ad1a1e07" 272 | 273 | [[package]] 274 | name = "encode_unicode" 275 | version = "0.3.6" 276 | source = "registry+https://github.com/rust-lang/crates.io-index" 277 | checksum = "a357d28ed41a50f9c765dbfe56cbc04a64e53e5fc58ba79fbc34c10ef3df831f" 278 | 279 | [[package]] 280 | name = "env_filter" 281 | version = "0.1.2" 282 | source = "registry+https://github.com/rust-lang/crates.io-index" 283 | checksum = "4f2c92ceda6ceec50f43169f9ee8424fe2db276791afde7b2cd8bc084cb376ab" 284 | dependencies = [ 285 | "log", 286 | "regex", 287 | ] 288 | 289 | [[package]] 290 | name = "env_logger" 291 | version = "0.11.6" 292 | source = "registry+https://github.com/rust-lang/crates.io-index" 293 | checksum = "dcaee3d8e3cfc3fd92428d477bc97fc29ec8716d180c0d74c643bb26166660e0" 294 | dependencies = [ 295 | "anstream", 296 | "anstyle", 297 | "env_filter", 298 | "humantime", 299 | "log", 300 | ] 301 | 302 | [[package]] 303 | name = "equivalent" 304 | version = "1.0.1" 305 | source = "registry+https://github.com/rust-lang/crates.io-index" 306 | checksum = "5443807d6dff69373d433ab9ef5378ad8df50ca6298caf15de6e52e24aaf54d5" 307 | 308 | [[package]] 309 | name = "errno" 310 | version = "0.3.5" 311 | source = "registry+https://github.com/rust-lang/crates.io-index" 312 | checksum = "ac3e13f66a2f95e32a39eaa81f6b95d42878ca0e1db0c7543723dfe12557e860" 313 | dependencies = [ 314 | "libc", 315 | "windows-sys 0.48.0", 316 | ] 317 | 318 | [[package]] 319 | name = "fastrand" 320 | version = "2.0.1" 321 | source = "registry+https://github.com/rust-lang/crates.io-index" 322 | checksum = "25cbce373ec4653f1a01a31e8a5e5ec0c622dc27ff9c4e6606eefef5cbbed4a5" 323 | 324 | [[package]] 325 | name = "filetime" 326 | version = "0.2.22" 327 | source = "registry+https://github.com/rust-lang/crates.io-index" 328 | checksum = "d4029edd3e734da6fe05b6cd7bd2960760a616bd2ddd0d59a0124746d6272af0" 329 | dependencies = [ 330 | "cfg-if", 331 | "libc", 332 | "redox_syscall", 333 | "windows-sys 0.48.0", 334 | ] 335 | 336 | [[package]] 337 | name = "flate2" 338 | version = "1.1.1" 339 | source = "registry+https://github.com/rust-lang/crates.io-index" 340 | checksum = "7ced92e76e966ca2fd84c8f7aa01a4aea65b0eb6648d72f7c8f3e2764a67fece" 341 | dependencies = [ 342 | "crc32fast", 343 | "miniz_oxide", 344 | ] 345 | 346 | [[package]] 347 | name = "getrandom" 348 | version = "0.3.1" 349 | source = "registry+https://github.com/rust-lang/crates.io-index" 350 | checksum = "43a49c392881ce6d5c3b8cb70f98717b7c07aabbdff06687b9030dbfbe2725f8" 351 | dependencies = [ 352 | "cfg-if", 353 | "libc", 354 | "wasi", 355 | "windows-targets 0.52.6", 356 | ] 357 | 358 | [[package]] 359 | name = "glob" 360 | version = "0.3.1" 361 | source = "registry+https://github.com/rust-lang/crates.io-index" 362 | checksum = "d2fabcfbdc87f4758337ca535fb41a6d701b65693ce38287d856d1674551ec9b" 363 | 364 | [[package]] 365 | name = "hashbrown" 366 | version = "0.12.3" 367 | source = "registry+https://github.com/rust-lang/crates.io-index" 368 | checksum = "8a9ee70c43aaf417c914396645a0fa852624801b24ebb7ae78fe8272889ac888" 369 | 370 | [[package]] 371 | name = "hashbrown" 372 | version = "0.14.1" 373 | source = "registry+https://github.com/rust-lang/crates.io-index" 374 | checksum = "7dfda62a12f55daeae5015f81b0baea145391cb4520f86c248fc615d72640d12" 375 | 376 | [[package]] 377 | name = "heck" 378 | version = "0.4.1" 379 | source = "registry+https://github.com/rust-lang/crates.io-index" 380 | checksum = "95505c38b4572b2d910cecb0281560f54b440a19336cbbcb27bf6ce6adc6f5a8" 381 | 382 | [[package]] 383 | name = "hermit-abi" 384 | version = "0.1.19" 385 | source = "registry+https://github.com/rust-lang/crates.io-index" 386 | checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33" 387 | dependencies = [ 388 | "libc", 389 | ] 390 | 391 | [[package]] 392 | name = "humantime" 393 | version = "2.1.0" 394 | source = "registry+https://github.com/rust-lang/crates.io-index" 395 | checksum = "9a3a5bfb195931eeb336b2a7b4d761daec841b97f947d34394601737a7bba5e4" 396 | 397 | [[package]] 398 | name = "humantime-serde" 399 | version = "1.1.1" 400 | source = "registry+https://github.com/rust-lang/crates.io-index" 401 | checksum = "57a3db5ea5923d99402c94e9feb261dc5ee9b4efa158b0315f788cf549cc200c" 402 | dependencies = [ 403 | "humantime", 404 | "serde", 405 | ] 406 | 407 | [[package]] 408 | name = "indexmap" 409 | version = "1.9.3" 410 | source = "registry+https://github.com/rust-lang/crates.io-index" 411 | checksum = "bd070e393353796e801d209ad339e89596eb4c8d430d18ede6a1cced8fafbd99" 412 | dependencies = [ 413 | "autocfg", 414 | "hashbrown 0.12.3", 415 | ] 416 | 417 | [[package]] 418 | name = "indexmap" 419 | version = "2.0.2" 420 | source = "registry+https://github.com/rust-lang/crates.io-index" 421 | checksum = "8adf3ddd720272c6ea8bf59463c04e0f93d0bbf7c5439b691bca2987e0270897" 422 | dependencies = [ 423 | "equivalent", 424 | "hashbrown 0.14.1", 425 | ] 426 | 427 | [[package]] 428 | name = "indicatif" 429 | version = "0.17.11" 430 | source = "registry+https://github.com/rust-lang/crates.io-index" 431 | checksum = "183b3088984b400f4cfac3620d5e076c84da5364016b4f49473de574b2586235" 432 | dependencies = [ 433 | "console", 434 | "number_prefix", 435 | "portable-atomic", 436 | "unicode-width 0.2.0", 437 | "web-time", 438 | ] 439 | 440 | [[package]] 441 | name = "is_terminal_polyfill" 442 | version = "1.70.1" 443 | source = "registry+https://github.com/rust-lang/crates.io-index" 444 | checksum = "7943c866cc5cd64cbc25b2e01621d07fa8eb2a1a23160ee81ce38704e97b8ecf" 445 | 446 | [[package]] 447 | name = "itoa" 448 | version = "1.0.9" 449 | source = "registry+https://github.com/rust-lang/crates.io-index" 450 | checksum = "af150ab688ff2122fcef229be89cb50dd66af9e01a4ff320cc137eecc9bacc38" 451 | 452 | [[package]] 453 | name = "jemalloc-sys" 454 | version = "0.5.4+5.3.0-patched" 455 | source = "registry+https://github.com/rust-lang/crates.io-index" 456 | checksum = "ac6c1946e1cea1788cbfde01c993b52a10e2da07f4bac608228d1bed20bfebf2" 457 | dependencies = [ 458 | "cc", 459 | "libc", 460 | ] 461 | 462 | [[package]] 463 | name = "jemallocator" 464 | version = "0.5.4" 465 | source = "registry+https://github.com/rust-lang/crates.io-index" 466 | checksum = "a0de374a9f8e63150e6f5e8a60cc14c668226d7a347d8aee1a45766e3c4dd3bc" 467 | dependencies = [ 468 | "jemalloc-sys", 469 | "libc", 470 | ] 471 | 472 | [[package]] 473 | name = "js-sys" 474 | version = "0.3.72" 475 | source = "registry+https://github.com/rust-lang/crates.io-index" 476 | checksum = "6a88f1bda2bd75b0452a14784937d796722fdebfe50df998aeb3f0b7603019a9" 477 | dependencies = [ 478 | "wasm-bindgen", 479 | ] 480 | 481 | [[package]] 482 | name = "lazy_static" 483 | version = "1.4.0" 484 | source = "registry+https://github.com/rust-lang/crates.io-index" 485 | checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646" 486 | 487 | [[package]] 488 | name = "libc" 489 | version = "0.2.169" 490 | source = "registry+https://github.com/rust-lang/crates.io-index" 491 | checksum = "b5aba8db14291edd000dfcc4d620c7ebfb122c613afb886ca8803fa4e128a20a" 492 | 493 | [[package]] 494 | name = "linked-hash-map" 495 | version = "0.5.6" 496 | source = "registry+https://github.com/rust-lang/crates.io-index" 497 | checksum = "0717cef1bc8b636c6e1c1bbdefc09e6322da8a9321966e8928ef80d20f7f770f" 498 | 499 | [[package]] 500 | name = "linux-raw-sys" 501 | version = "0.4.10" 502 | source = "registry+https://github.com/rust-lang/crates.io-index" 503 | checksum = "da2479e8c062e40bf0066ffa0bc823de0a9368974af99c9f6df941d2c231e03f" 504 | 505 | [[package]] 506 | name = "log" 507 | version = "0.4.27" 508 | source = "registry+https://github.com/rust-lang/crates.io-index" 509 | checksum = "13dc2df351e3202783a1fe0d44375f7295ffb4049267b0f3018346dc122a1d94" 510 | 511 | [[package]] 512 | name = "memchr" 513 | version = "2.6.4" 514 | source = "registry+https://github.com/rust-lang/crates.io-index" 515 | checksum = "f665ee40bc4a3c5590afb1e9677db74a508659dfd71e126420da8274909a0167" 516 | 517 | [[package]] 518 | name = "memoffset" 519 | version = "0.9.0" 520 | source = "registry+https://github.com/rust-lang/crates.io-index" 521 | checksum = "5a634b1c61a95585bd15607c6ab0c4e5b226e695ff2800ba0cdccddf208c406c" 522 | dependencies = [ 523 | "autocfg", 524 | ] 525 | 526 | [[package]] 527 | name = "miniz_oxide" 528 | version = "0.8.5" 529 | source = "registry+https://github.com/rust-lang/crates.io-index" 530 | checksum = "8e3e04debbb59698c15bacbb6d93584a8c0ca9cc3213cb423d31f760d8843ce5" 531 | dependencies = [ 532 | "adler2", 533 | ] 534 | 535 | [[package]] 536 | name = "normalize-line-endings" 537 | version = "0.3.0" 538 | source = "registry+https://github.com/rust-lang/crates.io-index" 539 | checksum = "61807f77802ff30975e01f4f071c8ba10c022052f98b3294119f3e615d13e5be" 540 | 541 | [[package]] 542 | name = "number_prefix" 543 | version = "0.4.0" 544 | source = "registry+https://github.com/rust-lang/crates.io-index" 545 | checksum = "830b246a0e5f20af87141b25c173cd1b609bd7779a4617d6ec582abaf90870f3" 546 | 547 | [[package]] 548 | name = "once_cell" 549 | version = "1.18.0" 550 | source = "registry+https://github.com/rust-lang/crates.io-index" 551 | checksum = "dd8b5dd2ae5ed71462c540258bedcb51965123ad7e7ccf4b9a8cafaa4a63576d" 552 | 553 | [[package]] 554 | name = "os_pipe" 555 | version = "1.1.4" 556 | source = "registry+https://github.com/rust-lang/crates.io-index" 557 | checksum = "0ae859aa07428ca9a929b936690f8b12dc5f11dd8c6992a18ca93919f28bc177" 558 | dependencies = [ 559 | "libc", 560 | "windows-sys 0.48.0", 561 | ] 562 | 563 | [[package]] 564 | name = "os_str_bytes" 565 | version = "6.6.1" 566 | source = "registry+https://github.com/rust-lang/crates.io-index" 567 | checksum = "e2355d85b9a3786f481747ced0e0ff2ba35213a1f9bd406ed906554d7af805a1" 568 | 569 | [[package]] 570 | name = "portable-atomic" 571 | version = "1.4.3" 572 | source = "registry+https://github.com/rust-lang/crates.io-index" 573 | checksum = "31114a898e107c51bb1609ffaf55a0e011cf6a4d7f1170d0015a165082c0338b" 574 | 575 | [[package]] 576 | name = "ppv-lite86" 577 | version = "0.2.17" 578 | source = "registry+https://github.com/rust-lang/crates.io-index" 579 | checksum = "5b40af805b3121feab8a3c29f04d8ad262fa8e0561883e7653e024ae4479e6de" 580 | 581 | [[package]] 582 | name = "proc-macro-error" 583 | version = "1.0.4" 584 | source = "registry+https://github.com/rust-lang/crates.io-index" 585 | checksum = "da25490ff9892aab3fcf7c36f08cfb902dd3e71ca0f9f9517bea02a73a5ce38c" 586 | dependencies = [ 587 | "proc-macro-error-attr", 588 | "proc-macro2", 589 | "quote", 590 | "syn 1.0.109", 591 | "version_check", 592 | ] 593 | 594 | [[package]] 595 | name = "proc-macro-error-attr" 596 | version = "1.0.4" 597 | source = "registry+https://github.com/rust-lang/crates.io-index" 598 | checksum = "a1be40180e52ecc98ad80b184934baf3d0d29f979574e439af5a55274b35f869" 599 | dependencies = [ 600 | "proc-macro2", 601 | "quote", 602 | "version_check", 603 | ] 604 | 605 | [[package]] 606 | name = "proc-macro2" 607 | version = "1.0.89" 608 | source = "registry+https://github.com/rust-lang/crates.io-index" 609 | checksum = "f139b0662de085916d1fb67d2b4169d1addddda1919e696f3252b740b629986e" 610 | dependencies = [ 611 | "unicode-ident", 612 | ] 613 | 614 | [[package]] 615 | name = "quote" 616 | version = "1.0.35" 617 | source = "registry+https://github.com/rust-lang/crates.io-index" 618 | checksum = "291ec9ab5efd934aaf503a6466c5d5251535d108ee747472c3977cc5acc868ef" 619 | dependencies = [ 620 | "proc-macro2", 621 | ] 622 | 623 | [[package]] 624 | name = "rand" 625 | version = "0.9.1" 626 | source = "registry+https://github.com/rust-lang/crates.io-index" 627 | checksum = "9fbfd9d094a40bf3ae768db9361049ace4c0e04a4fd6b359518bd7b73a73dd97" 628 | dependencies = [ 629 | "rand_chacha", 630 | "rand_core", 631 | ] 632 | 633 | [[package]] 634 | name = "rand_chacha" 635 | version = "0.9.0" 636 | source = "registry+https://github.com/rust-lang/crates.io-index" 637 | checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb" 638 | dependencies = [ 639 | "ppv-lite86", 640 | "rand_core", 641 | ] 642 | 643 | [[package]] 644 | name = "rand_core" 645 | version = "0.9.0" 646 | source = "registry+https://github.com/rust-lang/crates.io-index" 647 | checksum = "b08f3c9802962f7e1b25113931d94f43ed9725bebc59db9d0c3e9a23b67e15ff" 648 | dependencies = [ 649 | "getrandom", 650 | "zerocopy", 651 | ] 652 | 653 | [[package]] 654 | name = "rayon" 655 | version = "1.10.0" 656 | source = "registry+https://github.com/rust-lang/crates.io-index" 657 | checksum = "b418a60154510ca1a002a752ca9714984e21e4241e804d32555251faf8b78ffa" 658 | dependencies = [ 659 | "either", 660 | "rayon-core", 661 | ] 662 | 663 | [[package]] 664 | name = "rayon-core" 665 | version = "1.12.1" 666 | source = "registry+https://github.com/rust-lang/crates.io-index" 667 | checksum = "1465873a3dfdaa8ae7cb14b4383657caab0b3e8a0aa9ae8e04b044854c8dfce2" 668 | dependencies = [ 669 | "crossbeam-deque", 670 | "crossbeam-utils", 671 | ] 672 | 673 | [[package]] 674 | name = "redox_syscall" 675 | version = "0.3.5" 676 | source = "registry+https://github.com/rust-lang/crates.io-index" 677 | checksum = "567664f262709473930a4bf9e51bf2ebf3348f2e748ccc50dea20646858f8f29" 678 | dependencies = [ 679 | "bitflags 1.3.2", 680 | ] 681 | 682 | [[package]] 683 | name = "regex" 684 | version = "1.10.2" 685 | source = "registry+https://github.com/rust-lang/crates.io-index" 686 | checksum = "380b951a9c5e80ddfd6136919eef32310721aa4aacd4889a8d39124b026ab343" 687 | dependencies = [ 688 | "aho-corasick", 689 | "memchr", 690 | "regex-automata", 691 | "regex-syntax", 692 | ] 693 | 694 | [[package]] 695 | name = "regex-automata" 696 | version = "0.4.3" 697 | source = "registry+https://github.com/rust-lang/crates.io-index" 698 | checksum = "5f804c7828047e88b2d32e2d7fe5a105da8ee3264f01902f796c8e067dc2483f" 699 | dependencies = [ 700 | "aho-corasick", 701 | "memchr", 702 | "regex-syntax", 703 | ] 704 | 705 | [[package]] 706 | name = "regex-syntax" 707 | version = "0.8.2" 708 | source = "registry+https://github.com/rust-lang/crates.io-index" 709 | checksum = "c08c74e62047bb2de4ff487b251e4a92e24f48745648451635cec7d591162d9f" 710 | 711 | [[package]] 712 | name = "rustix" 713 | version = "0.38.19" 714 | source = "registry+https://github.com/rust-lang/crates.io-index" 715 | checksum = "745ecfa778e66b2b63c88a61cb36e0eea109e803b0b86bf9879fbc77c70e86ed" 716 | dependencies = [ 717 | "bitflags 2.4.1", 718 | "errno", 719 | "libc", 720 | "linux-raw-sys", 721 | "windows-sys 0.48.0", 722 | ] 723 | 724 | [[package]] 725 | name = "ryu" 726 | version = "1.0.15" 727 | source = "registry+https://github.com/rust-lang/crates.io-index" 728 | checksum = "1ad4cc8da4ef723ed60bced201181d83791ad433213d8c24efffda1eec85d741" 729 | 730 | [[package]] 731 | name = "same-file" 732 | version = "1.0.6" 733 | source = "registry+https://github.com/rust-lang/crates.io-index" 734 | checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502" 735 | dependencies = [ 736 | "winapi-util", 737 | ] 738 | 739 | [[package]] 740 | name = "scopeguard" 741 | version = "1.2.0" 742 | source = "registry+https://github.com/rust-lang/crates.io-index" 743 | checksum = "94143f37725109f92c262ed2cf5e59bce7498c01bcc1502d7b9afe439a4e9f49" 744 | 745 | [[package]] 746 | name = "serde" 747 | version = "1.0.189" 748 | source = "registry+https://github.com/rust-lang/crates.io-index" 749 | checksum = "8e422a44e74ad4001bdc8eede9a4570ab52f71190e9c076d14369f38b9200537" 750 | dependencies = [ 751 | "serde_derive", 752 | ] 753 | 754 | [[package]] 755 | name = "serde_derive" 756 | version = "1.0.189" 757 | source = "registry+https://github.com/rust-lang/crates.io-index" 758 | checksum = "1e48d1f918009ce3145511378cf68d613e3b3d9137d67272562080d68a2b32d5" 759 | dependencies = [ 760 | "proc-macro2", 761 | "quote", 762 | "syn 2.0.87", 763 | ] 764 | 765 | [[package]] 766 | name = "serde_spanned" 767 | version = "0.6.6" 768 | source = "registry+https://github.com/rust-lang/crates.io-index" 769 | checksum = "79e674e01f999af37c49f70a6ede167a8a60b2503e56c5599532a65baa5969a0" 770 | dependencies = [ 771 | "serde", 772 | ] 773 | 774 | [[package]] 775 | name = "shlex" 776 | version = "1.3.0" 777 | source = "registry+https://github.com/rust-lang/crates.io-index" 778 | checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" 779 | 780 | [[package]] 781 | name = "similar" 782 | version = "2.3.0" 783 | source = "registry+https://github.com/rust-lang/crates.io-index" 784 | checksum = "2aeaf503862c419d66959f5d7ca015337d864e9c49485d771b732e2a20453597" 785 | 786 | [[package]] 787 | name = "snapbox" 788 | version = "0.6.21" 789 | source = "registry+https://github.com/rust-lang/crates.io-index" 790 | checksum = "96dcfc4581e3355d70ac2ee14cfdf81dce3d85c85f1ed9e2c1d3013f53b3436b" 791 | dependencies = [ 792 | "anstream", 793 | "anstyle", 794 | "content_inspector", 795 | "dunce", 796 | "filetime", 797 | "libc", 798 | "normalize-line-endings", 799 | "os_pipe", 800 | "similar", 801 | "snapbox-macros", 802 | "tempfile", 803 | "wait-timeout", 804 | "walkdir", 805 | "windows-sys 0.59.0", 806 | ] 807 | 808 | [[package]] 809 | name = "snapbox-macros" 810 | version = "0.3.10" 811 | source = "registry+https://github.com/rust-lang/crates.io-index" 812 | checksum = "16569f53ca23a41bb6f62e0a5084aa1661f4814a67fa33696a79073e03a664af" 813 | dependencies = [ 814 | "anstream", 815 | ] 816 | 817 | [[package]] 818 | name = "strsim" 819 | version = "0.10.0" 820 | source = "registry+https://github.com/rust-lang/crates.io-index" 821 | checksum = "73473c0e59e6d5812c5dfe2a064a6444949f089e20eec9a2e5506596494e4623" 822 | 823 | [[package]] 824 | name = "syn" 825 | version = "1.0.109" 826 | source = "registry+https://github.com/rust-lang/crates.io-index" 827 | checksum = "72b64191b275b66ffe2469e8af2c1cfe3bafa67b529ead792a6d0160888b4237" 828 | dependencies = [ 829 | "proc-macro2", 830 | "quote", 831 | "unicode-ident", 832 | ] 833 | 834 | [[package]] 835 | name = "syn" 836 | version = "2.0.87" 837 | source = "registry+https://github.com/rust-lang/crates.io-index" 838 | checksum = "25aa4ce346d03a6dcd68dd8b4010bcb74e54e62c90c573f394c46eae99aba32d" 839 | dependencies = [ 840 | "proc-macro2", 841 | "quote", 842 | "unicode-ident", 843 | ] 844 | 845 | [[package]] 846 | name = "tempfile" 847 | version = "3.8.0" 848 | source = "registry+https://github.com/rust-lang/crates.io-index" 849 | checksum = "cb94d2f3cc536af71caac6b6fcebf65860b347e7ce0cc9ebe8f70d3e521054ef" 850 | dependencies = [ 851 | "cfg-if", 852 | "fastrand", 853 | "redox_syscall", 854 | "rustix", 855 | "windows-sys 0.48.0", 856 | ] 857 | 858 | [[package]] 859 | name = "termcolor" 860 | version = "1.3.0" 861 | source = "registry+https://github.com/rust-lang/crates.io-index" 862 | checksum = "6093bad37da69aab9d123a8091e4be0aa4a03e4d601ec641c327398315f62b64" 863 | dependencies = [ 864 | "winapi-util", 865 | ] 866 | 867 | [[package]] 868 | name = "textwrap" 869 | version = "0.16.0" 870 | source = "registry+https://github.com/rust-lang/crates.io-index" 871 | checksum = "222a222a5bfe1bba4a77b45ec488a741b3cb8872e5e499451fd7d0129c9c7c3d" 872 | 873 | [[package]] 874 | name = "thiserror" 875 | version = "2.0.12" 876 | source = "registry+https://github.com/rust-lang/crates.io-index" 877 | checksum = "567b8a2dae586314f7be2a752ec7474332959c6460e02bde30d702a66d488708" 878 | dependencies = [ 879 | "thiserror-impl", 880 | ] 881 | 882 | [[package]] 883 | name = "thiserror-impl" 884 | version = "2.0.12" 885 | source = "registry+https://github.com/rust-lang/crates.io-index" 886 | checksum = "7f7cf42b4507d8ea322120659672cf1b9dbb93f8f2d4ecfd6e51350ff5b17a1d" 887 | dependencies = [ 888 | "proc-macro2", 889 | "quote", 890 | "syn 2.0.87", 891 | ] 892 | 893 | [[package]] 894 | name = "toml_datetime" 895 | version = "0.6.6" 896 | source = "registry+https://github.com/rust-lang/crates.io-index" 897 | checksum = "4badfd56924ae69bcc9039335b2e017639ce3f9b001c393c1b2d1ef846ce2cbf" 898 | dependencies = [ 899 | "serde", 900 | ] 901 | 902 | [[package]] 903 | name = "toml_edit" 904 | version = "0.22.13" 905 | source = "registry+https://github.com/rust-lang/crates.io-index" 906 | checksum = "c127785850e8c20836d49732ae6abfa47616e60bf9d9f57c43c250361a9db96c" 907 | dependencies = [ 908 | "indexmap 2.0.2", 909 | "serde", 910 | "serde_spanned", 911 | "toml_datetime", 912 | "winnow", 913 | ] 914 | 915 | [[package]] 916 | name = "trycmd" 917 | version = "0.15.9" 918 | source = "registry+https://github.com/rust-lang/crates.io-index" 919 | checksum = "a8b5cf29388862aac065d6597ac9c8e842d1cc827cb50f7c32f11d29442eaae4" 920 | dependencies = [ 921 | "anstream", 922 | "automod", 923 | "glob", 924 | "humantime", 925 | "humantime-serde", 926 | "rayon", 927 | "serde", 928 | "shlex", 929 | "snapbox", 930 | "toml_edit", 931 | ] 932 | 933 | [[package]] 934 | name = "ttv" 935 | version = "0.4.0" 936 | dependencies = [ 937 | "clap", 938 | "csv", 939 | "env_logger", 940 | "flate2", 941 | "indicatif", 942 | "jemallocator", 943 | "log", 944 | "rand", 945 | "rand_chacha", 946 | "rayon", 947 | "thiserror", 948 | "trycmd", 949 | ] 950 | 951 | [[package]] 952 | name = "unicode-ident" 953 | version = "1.0.12" 954 | source = "registry+https://github.com/rust-lang/crates.io-index" 955 | checksum = "3354b9ac3fae1ff6755cb6db53683adb661634f67557942dea4facebec0fee4b" 956 | 957 | [[package]] 958 | name = "unicode-width" 959 | version = "0.1.11" 960 | source = "registry+https://github.com/rust-lang/crates.io-index" 961 | checksum = "e51733f11c9c4f72aa0c160008246859e340b00807569a0da0e7a1079b27ba85" 962 | 963 | [[package]] 964 | name = "unicode-width" 965 | version = "0.2.0" 966 | source = "registry+https://github.com/rust-lang/crates.io-index" 967 | checksum = "1fc81956842c57dac11422a97c3b8195a1ff727f06e85c84ed2e8aa277c9a0fd" 968 | 969 | [[package]] 970 | name = "utf8parse" 971 | version = "0.2.1" 972 | source = "registry+https://github.com/rust-lang/crates.io-index" 973 | checksum = "711b9620af191e0cdc7468a8d14e709c3dcdb115b36f838e601583af800a370a" 974 | 975 | [[package]] 976 | name = "version_check" 977 | version = "0.9.4" 978 | source = "registry+https://github.com/rust-lang/crates.io-index" 979 | checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" 980 | 981 | [[package]] 982 | name = "wait-timeout" 983 | version = "0.2.0" 984 | source = "registry+https://github.com/rust-lang/crates.io-index" 985 | checksum = "9f200f5b12eb75f8c1ed65abd4b2db8a6e1b138a20de009dacee265a2498f3f6" 986 | dependencies = [ 987 | "libc", 988 | ] 989 | 990 | [[package]] 991 | name = "walkdir" 992 | version = "2.4.0" 993 | source = "registry+https://github.com/rust-lang/crates.io-index" 994 | checksum = "d71d857dc86794ca4c280d616f7da00d2dbfd8cd788846559a6813e6aa4b54ee" 995 | dependencies = [ 996 | "same-file", 997 | "winapi-util", 998 | ] 999 | 1000 | [[package]] 1001 | name = "wasi" 1002 | version = "0.13.3+wasi-0.2.2" 1003 | source = "registry+https://github.com/rust-lang/crates.io-index" 1004 | checksum = "26816d2e1a4a36a2940b96c5296ce403917633dff8f3440e9b236ed6f6bacad2" 1005 | dependencies = [ 1006 | "wit-bindgen-rt", 1007 | ] 1008 | 1009 | [[package]] 1010 | name = "wasm-bindgen" 1011 | version = "0.2.95" 1012 | source = "registry+https://github.com/rust-lang/crates.io-index" 1013 | checksum = "128d1e363af62632b8eb57219c8fd7877144af57558fb2ef0368d0087bddeb2e" 1014 | dependencies = [ 1015 | "cfg-if", 1016 | "once_cell", 1017 | "wasm-bindgen-macro", 1018 | ] 1019 | 1020 | [[package]] 1021 | name = "wasm-bindgen-backend" 1022 | version = "0.2.95" 1023 | source = "registry+https://github.com/rust-lang/crates.io-index" 1024 | checksum = "cb6dd4d3ca0ddffd1dd1c9c04f94b868c37ff5fac97c30b97cff2d74fce3a358" 1025 | dependencies = [ 1026 | "bumpalo", 1027 | "log", 1028 | "once_cell", 1029 | "proc-macro2", 1030 | "quote", 1031 | "syn 2.0.87", 1032 | "wasm-bindgen-shared", 1033 | ] 1034 | 1035 | [[package]] 1036 | name = "wasm-bindgen-macro" 1037 | version = "0.2.95" 1038 | source = "registry+https://github.com/rust-lang/crates.io-index" 1039 | checksum = "e79384be7f8f5a9dd5d7167216f022090cf1f9ec128e6e6a482a2cb5c5422c56" 1040 | dependencies = [ 1041 | "quote", 1042 | "wasm-bindgen-macro-support", 1043 | ] 1044 | 1045 | [[package]] 1046 | name = "wasm-bindgen-macro-support" 1047 | version = "0.2.95" 1048 | source = "registry+https://github.com/rust-lang/crates.io-index" 1049 | checksum = "26c6ab57572f7a24a4985830b120de1594465e5d500f24afe89e16b4e833ef68" 1050 | dependencies = [ 1051 | "proc-macro2", 1052 | "quote", 1053 | "syn 2.0.87", 1054 | "wasm-bindgen-backend", 1055 | "wasm-bindgen-shared", 1056 | ] 1057 | 1058 | [[package]] 1059 | name = "wasm-bindgen-shared" 1060 | version = "0.2.95" 1061 | source = "registry+https://github.com/rust-lang/crates.io-index" 1062 | checksum = "65fc09f10666a9f147042251e0dda9c18f166ff7de300607007e96bdebc1068d" 1063 | 1064 | [[package]] 1065 | name = "web-time" 1066 | version = "1.1.0" 1067 | source = "registry+https://github.com/rust-lang/crates.io-index" 1068 | checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb" 1069 | dependencies = [ 1070 | "js-sys", 1071 | "wasm-bindgen", 1072 | ] 1073 | 1074 | [[package]] 1075 | name = "winapi" 1076 | version = "0.3.9" 1077 | source = "registry+https://github.com/rust-lang/crates.io-index" 1078 | checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" 1079 | dependencies = [ 1080 | "winapi-i686-pc-windows-gnu", 1081 | "winapi-x86_64-pc-windows-gnu", 1082 | ] 1083 | 1084 | [[package]] 1085 | name = "winapi-i686-pc-windows-gnu" 1086 | version = "0.4.0" 1087 | source = "registry+https://github.com/rust-lang/crates.io-index" 1088 | checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" 1089 | 1090 | [[package]] 1091 | name = "winapi-util" 1092 | version = "0.1.6" 1093 | source = "registry+https://github.com/rust-lang/crates.io-index" 1094 | checksum = "f29e6f9198ba0d26b4c9f07dbe6f9ed633e1f3d5b8b414090084349e46a52596" 1095 | dependencies = [ 1096 | "winapi", 1097 | ] 1098 | 1099 | [[package]] 1100 | name = "winapi-x86_64-pc-windows-gnu" 1101 | version = "0.4.0" 1102 | source = "registry+https://github.com/rust-lang/crates.io-index" 1103 | checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" 1104 | 1105 | [[package]] 1106 | name = "windows-sys" 1107 | version = "0.45.0" 1108 | source = "registry+https://github.com/rust-lang/crates.io-index" 1109 | checksum = "75283be5efb2831d37ea142365f009c02ec203cd29a3ebecbc093d52315b66d0" 1110 | dependencies = [ 1111 | "windows-targets 0.42.2", 1112 | ] 1113 | 1114 | [[package]] 1115 | name = "windows-sys" 1116 | version = "0.48.0" 1117 | source = "registry+https://github.com/rust-lang/crates.io-index" 1118 | checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" 1119 | dependencies = [ 1120 | "windows-targets 0.48.5", 1121 | ] 1122 | 1123 | [[package]] 1124 | name = "windows-sys" 1125 | version = "0.59.0" 1126 | source = "registry+https://github.com/rust-lang/crates.io-index" 1127 | checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" 1128 | dependencies = [ 1129 | "windows-targets 0.52.6", 1130 | ] 1131 | 1132 | [[package]] 1133 | name = "windows-targets" 1134 | version = "0.42.2" 1135 | source = "registry+https://github.com/rust-lang/crates.io-index" 1136 | checksum = "8e5180c00cd44c9b1c88adb3693291f1cd93605ded80c250a75d472756b4d071" 1137 | dependencies = [ 1138 | "windows_aarch64_gnullvm 0.42.2", 1139 | "windows_aarch64_msvc 0.42.2", 1140 | "windows_i686_gnu 0.42.2", 1141 | "windows_i686_msvc 0.42.2", 1142 | "windows_x86_64_gnu 0.42.2", 1143 | "windows_x86_64_gnullvm 0.42.2", 1144 | "windows_x86_64_msvc 0.42.2", 1145 | ] 1146 | 1147 | [[package]] 1148 | name = "windows-targets" 1149 | version = "0.48.5" 1150 | source = "registry+https://github.com/rust-lang/crates.io-index" 1151 | checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c" 1152 | dependencies = [ 1153 | "windows_aarch64_gnullvm 0.48.5", 1154 | "windows_aarch64_msvc 0.48.5", 1155 | "windows_i686_gnu 0.48.5", 1156 | "windows_i686_msvc 0.48.5", 1157 | "windows_x86_64_gnu 0.48.5", 1158 | "windows_x86_64_gnullvm 0.48.5", 1159 | "windows_x86_64_msvc 0.48.5", 1160 | ] 1161 | 1162 | [[package]] 1163 | name = "windows-targets" 1164 | version = "0.52.6" 1165 | source = "registry+https://github.com/rust-lang/crates.io-index" 1166 | checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973" 1167 | dependencies = [ 1168 | "windows_aarch64_gnullvm 0.52.6", 1169 | "windows_aarch64_msvc 0.52.6", 1170 | "windows_i686_gnu 0.52.6", 1171 | "windows_i686_gnullvm", 1172 | "windows_i686_msvc 0.52.6", 1173 | "windows_x86_64_gnu 0.52.6", 1174 | "windows_x86_64_gnullvm 0.52.6", 1175 | "windows_x86_64_msvc 0.52.6", 1176 | ] 1177 | 1178 | [[package]] 1179 | name = "windows_aarch64_gnullvm" 1180 | version = "0.42.2" 1181 | source = "registry+https://github.com/rust-lang/crates.io-index" 1182 | checksum = "597a5118570b68bc08d8d59125332c54f1ba9d9adeedeef5b99b02ba2b0698f8" 1183 | 1184 | [[package]] 1185 | name = "windows_aarch64_gnullvm" 1186 | version = "0.48.5" 1187 | source = "registry+https://github.com/rust-lang/crates.io-index" 1188 | checksum = "2b38e32f0abccf9987a4e3079dfb67dcd799fb61361e53e2882c3cbaf0d905d8" 1189 | 1190 | [[package]] 1191 | name = "windows_aarch64_gnullvm" 1192 | version = "0.52.6" 1193 | source = "registry+https://github.com/rust-lang/crates.io-index" 1194 | checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3" 1195 | 1196 | [[package]] 1197 | name = "windows_aarch64_msvc" 1198 | version = "0.42.2" 1199 | source = "registry+https://github.com/rust-lang/crates.io-index" 1200 | checksum = "e08e8864a60f06ef0d0ff4ba04124db8b0fb3be5776a5cd47641e942e58c4d43" 1201 | 1202 | [[package]] 1203 | name = "windows_aarch64_msvc" 1204 | version = "0.48.5" 1205 | source = "registry+https://github.com/rust-lang/crates.io-index" 1206 | checksum = "dc35310971f3b2dbbf3f0690a219f40e2d9afcf64f9ab7cc1be722937c26b4bc" 1207 | 1208 | [[package]] 1209 | name = "windows_aarch64_msvc" 1210 | version = "0.52.6" 1211 | source = "registry+https://github.com/rust-lang/crates.io-index" 1212 | checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469" 1213 | 1214 | [[package]] 1215 | name = "windows_i686_gnu" 1216 | version = "0.42.2" 1217 | source = "registry+https://github.com/rust-lang/crates.io-index" 1218 | checksum = "c61d927d8da41da96a81f029489353e68739737d3beca43145c8afec9a31a84f" 1219 | 1220 | [[package]] 1221 | name = "windows_i686_gnu" 1222 | version = "0.48.5" 1223 | source = "registry+https://github.com/rust-lang/crates.io-index" 1224 | checksum = "a75915e7def60c94dcef72200b9a8e58e5091744960da64ec734a6c6e9b3743e" 1225 | 1226 | [[package]] 1227 | name = "windows_i686_gnu" 1228 | version = "0.52.6" 1229 | source = "registry+https://github.com/rust-lang/crates.io-index" 1230 | checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b" 1231 | 1232 | [[package]] 1233 | name = "windows_i686_gnullvm" 1234 | version = "0.52.6" 1235 | source = "registry+https://github.com/rust-lang/crates.io-index" 1236 | checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66" 1237 | 1238 | [[package]] 1239 | name = "windows_i686_msvc" 1240 | version = "0.42.2" 1241 | source = "registry+https://github.com/rust-lang/crates.io-index" 1242 | checksum = "44d840b6ec649f480a41c8d80f9c65108b92d89345dd94027bfe06ac444d1060" 1243 | 1244 | [[package]] 1245 | name = "windows_i686_msvc" 1246 | version = "0.48.5" 1247 | source = "registry+https://github.com/rust-lang/crates.io-index" 1248 | checksum = "8f55c233f70c4b27f66c523580f78f1004e8b5a8b659e05a4eb49d4166cca406" 1249 | 1250 | [[package]] 1251 | name = "windows_i686_msvc" 1252 | version = "0.52.6" 1253 | source = "registry+https://github.com/rust-lang/crates.io-index" 1254 | checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66" 1255 | 1256 | [[package]] 1257 | name = "windows_x86_64_gnu" 1258 | version = "0.42.2" 1259 | source = "registry+https://github.com/rust-lang/crates.io-index" 1260 | checksum = "8de912b8b8feb55c064867cf047dda097f92d51efad5b491dfb98f6bbb70cb36" 1261 | 1262 | [[package]] 1263 | name = "windows_x86_64_gnu" 1264 | version = "0.48.5" 1265 | source = "registry+https://github.com/rust-lang/crates.io-index" 1266 | checksum = "53d40abd2583d23e4718fddf1ebec84dbff8381c07cae67ff7768bbf19c6718e" 1267 | 1268 | [[package]] 1269 | name = "windows_x86_64_gnu" 1270 | version = "0.52.6" 1271 | source = "registry+https://github.com/rust-lang/crates.io-index" 1272 | checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78" 1273 | 1274 | [[package]] 1275 | name = "windows_x86_64_gnullvm" 1276 | version = "0.42.2" 1277 | source = "registry+https://github.com/rust-lang/crates.io-index" 1278 | checksum = "26d41b46a36d453748aedef1486d5c7a85db22e56aff34643984ea85514e94a3" 1279 | 1280 | [[package]] 1281 | name = "windows_x86_64_gnullvm" 1282 | version = "0.48.5" 1283 | source = "registry+https://github.com/rust-lang/crates.io-index" 1284 | checksum = "0b7b52767868a23d5bab768e390dc5f5c55825b6d30b86c844ff2dc7414044cc" 1285 | 1286 | [[package]] 1287 | name = "windows_x86_64_gnullvm" 1288 | version = "0.52.6" 1289 | source = "registry+https://github.com/rust-lang/crates.io-index" 1290 | checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d" 1291 | 1292 | [[package]] 1293 | name = "windows_x86_64_msvc" 1294 | version = "0.42.2" 1295 | source = "registry+https://github.com/rust-lang/crates.io-index" 1296 | checksum = "9aec5da331524158c6d1a4ac0ab1541149c0b9505fde06423b02f5ef0106b9f0" 1297 | 1298 | [[package]] 1299 | name = "windows_x86_64_msvc" 1300 | version = "0.48.5" 1301 | source = "registry+https://github.com/rust-lang/crates.io-index" 1302 | checksum = "ed94fce61571a4006852b7389a063ab983c02eb1bb37b47f8272ce92d06d9538" 1303 | 1304 | [[package]] 1305 | name = "windows_x86_64_msvc" 1306 | version = "0.52.6" 1307 | source = "registry+https://github.com/rust-lang/crates.io-index" 1308 | checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec" 1309 | 1310 | [[package]] 1311 | name = "winnow" 1312 | version = "0.6.8" 1313 | source = "registry+https://github.com/rust-lang/crates.io-index" 1314 | checksum = "c3c52e9c97a68071b23e836c9380edae937f17b9c4667bd021973efc689f618d" 1315 | dependencies = [ 1316 | "memchr", 1317 | ] 1318 | 1319 | [[package]] 1320 | name = "wit-bindgen-rt" 1321 | version = "0.33.0" 1322 | source = "registry+https://github.com/rust-lang/crates.io-index" 1323 | checksum = "3268f3d866458b787f390cf61f4bbb563b922d091359f9608842999eaee3943c" 1324 | dependencies = [ 1325 | "bitflags 2.4.1", 1326 | ] 1327 | 1328 | [[package]] 1329 | name = "yaml-rust" 1330 | version = "0.4.5" 1331 | source = "registry+https://github.com/rust-lang/crates.io-index" 1332 | checksum = "56c1936c4cc7a1c9ab21a1ebb602eb942ba868cbd44a99cb7cdc5892335e1c85" 1333 | dependencies = [ 1334 | "linked-hash-map", 1335 | ] 1336 | 1337 | [[package]] 1338 | name = "zerocopy" 1339 | version = "0.8.14" 1340 | source = "registry+https://github.com/rust-lang/crates.io-index" 1341 | checksum = "a367f292d93d4eab890745e75a778da40909cab4d6ff8173693812f79c4a2468" 1342 | dependencies = [ 1343 | "zerocopy-derive", 1344 | ] 1345 | 1346 | [[package]] 1347 | name = "zerocopy-derive" 1348 | version = "0.8.14" 1349 | source = "registry+https://github.com/rust-lang/crates.io-index" 1350 | checksum = "d3931cb58c62c13adec22e38686b559c86a30565e16ad6e8510a337cedc611e1" 1351 | dependencies = [ 1352 | "proc-macro2", 1353 | "quote", 1354 | "syn 2.0.87", 1355 | ] 1356 | -------------------------------------------------------------------------------- /Cargo.toml: -------------------------------------------------------------------------------- 1 | [package] 2 | name = "ttv" 3 | description = "Create train, test and validation sets from text files" 4 | version = "0.4.0" 5 | authors = ["Ben Sully "] 6 | repository = "https://github.com/sd2k/ttv" 7 | keywords = ["cli", "data", "machine-learning"] 8 | readme = "README.md" 9 | edition = "2021" 10 | license = "MIT/Apache-2.0" 11 | 12 | [dependencies] 13 | clap = { version = "3.1.15", features = ["derive", "yaml"] } 14 | csv = "1.3.1" 15 | env_logger = "0.11.6" 16 | flate2 = "1.1.1" 17 | indicatif = "0.17.11" 18 | jemallocator = "0.5.4" 19 | log = "0.4.27" 20 | rand = "0.9.1" 21 | rand_chacha = "0.9.0" 22 | rayon = "1.10.0" 23 | thiserror = "2.0.12" 24 | 25 | [dev-dependencies] 26 | trycmd = "0.15.9" 27 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Dependabot Status](https://api.dependabot.com/badges/status?host=github&repo=sd2k/ttv)](https://dependabot.com) 2 | 3 | ttv - create train, test, validation sets 4 | ========================================= 5 | 6 | ttv is a command line tool for splitting large files up into chunks suitable for train/test/validation splits for machine learning. It arose from the need to split files that were too large to fit into memory to split, and the desire to do it in a clean way. 7 | 8 | `ttv` requires Rust 2021. 9 | 10 | Installation 11 | ------------ 12 | 13 | Build using `cargo build --release` to get a binary at `./target/release/ttv`. Copy this into your path to use it. 14 | 15 | Usage 16 | ----- 17 | 18 | Run `ttv --help` to get help, or infer what you can from one of these examples: 19 | 20 | # Split CSV file into two sets of a fixed number of rows 21 | $ ttv split data.csv --rows=train=9000 --rows=test=1000 22 | 23 | # Accepts gzipped data (no flag required). Shorthand argument version. As many splits as you like! 24 | $ ttv split data.csv.gz --rows=train=65000,validation=15000,test=15000 -d 25 | 26 | # Alternatively, specify proportion-based splits. 27 | $ ttv split data.csv --prop=train=0.8,test=0.2 28 | 29 | # When using proportions, include the total rows to get a progress bar 30 | $ ttv split data.csv --prop=train=0.8,test=0.2 --total-rows=1234 31 | 32 | # Accepts data from stdin, compressed or not (must give a filename) 33 | $ cat data.csv | ttv split --rows=test=10000,train=90000 --output-prefix data -u 34 | $ cat data.csv.gz | ttv split --rows=test=10000,train=90000 --output-prefix data -d 35 | 36 | # Using pigz for faster decompression 37 | $ pigz -dc data.csv.gz | ttv split --prop=test=0.1,train=0.9 --chunk-size 5000 --output-prefix data 38 | 39 | # Split outputs into chunks for faster writing/reading later 40 | $ ttv split data.csv.gz --rows=test=100000,train=900000 --chunk-size 5000 -d 41 | 42 | # Write outputs uncompressed 43 | $ ttv split data.csv.gz --prop=test=0.5,train=0.5 44 | 45 | # Reproducible splits using seed 46 | $ ttv split data.csv.gz --prop=test=0.5,train=0.5 --chunk-size 1000 --seed 5330 -d 47 | 48 | Development 49 | ----------- 50 | 51 | You'll need a recent version of the Rust nightly toolchain and Cargo. Then just hack away as normal. 52 | -------------------------------------------------------------------------------- /ci/before_deploy.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # package the build artifacts 4 | # heavily inspired by https://github.com/BurntSushi/ripgrep/blob/master/ci/before_deploy.sh 5 | 6 | set -ex 7 | 8 | . "$(dirname $0)/utils.sh" 9 | 10 | # Generate artifacts for release 11 | mk_artifacts() { 12 | cargo build --target "$TARGET" --release 13 | } 14 | 15 | mk_tarball() { 16 | # Create a temporary dir that contains our staging area. 17 | # $tmpdir/$name is what eventually ends up as the deployed archive. 18 | local tmpdir="$(mktemp -d)" 19 | local name="${PROJECT_NAME}-${TRAVIS_TAG}-${TARGET}" 20 | local staging="$tmpdir/$name" 21 | mkdir "$staging" 22 | 23 | # The deployment directory is where the final archive will reside. 24 | # This path is known by the .travis.yml configuration. 25 | local out_dir="$(pwd)/deployment" 26 | mkdir -p "$out_dir" 27 | 28 | cp "target/$TARGET/release/ttv" "$staging/ttv" 29 | strip "$staging/ttv" 30 | 31 | (cd "$tmpdir" && tar czf "$out_dir/$name.tar.gz" "$name") 32 | rm -rf "$tmpdir" 33 | } 34 | 35 | main() { 36 | mk_artifacts 37 | mk_tarball 38 | } 39 | 40 | main 41 | -------------------------------------------------------------------------------- /ci/install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # install stuff needed for the `script` phase 4 | # heavily inspired by https://github.com/BurntSushi/ripgrep/blob/master/ci/install.sh 5 | 6 | # Where rustup gets installed. 7 | export PATH="$PATH:$HOME/.cargo/bin" 8 | 9 | set -ex 10 | 11 | . "$(dirname $0)/utils.sh" 12 | 13 | install_rustup() { 14 | curl https://sh.rustup.rs -sSf \ 15 | | sh -s -- -y --default-toolchain="$TRAVIS_RUST_VERSION" 16 | rustc -V 17 | cargo -V 18 | } 19 | 20 | install_targets() { 21 | if [ $(host) != "$TARGET" ]; then 22 | rustup target add $TARGET 23 | fi 24 | } 25 | 26 | configure_cargo() { 27 | local prefix=$(gcc_prefix) 28 | if [ -n "${prefix}" ]; then 29 | local gcc_suffix= 30 | if [ -n "$GCC_VERSION" ]; then 31 | gcc_suffix="-$GCC_VERSION" 32 | fi 33 | local gcc="${prefix}gcc${gcc_suffix}" 34 | 35 | # information about the cross compiler 36 | "${gcc}" -v 37 | 38 | # tell cargo which linker to use for cross compilation 39 | mkdir -p .cargo 40 | cat >>.cargo/config <, 44 | 45 | #[clap( 46 | short = 'p', 47 | long = "prop", 48 | required_unless_present = "rows", 49 | conflicts_with = "rows", 50 | help = "Specify splits by proportion of rows", 51 | use_value_delimiter = true 52 | )] 53 | pub prop: Vec, 54 | 55 | #[clap( 56 | short = 'n', 57 | long = "no-header", 58 | help = "Don't treat the first row as a header" 59 | )] 60 | pub no_header: bool, 61 | 62 | #[clap( 63 | short = 'c', 64 | long = "chunk-size", 65 | help = "Maximum number of rows per output chunk" 66 | )] 67 | pub chunk_size: Option, 68 | 69 | #[clap( 70 | short = 't', 71 | long = "total-rows", 72 | help = "Number of rows in input file. Used for progress when using proportion splits" 73 | )] 74 | pub total_rows: Option, 75 | 76 | #[clap(short = 's', long = "seed", help = "RNG seed, for reproducibility")] 77 | pub seed: Option, 78 | 79 | #[clap( 80 | long = "csv", 81 | help = "Parse input as CSV. Only needed if rows contain embedded newlines - will impact performance." 82 | )] 83 | pub csv: bool, 84 | 85 | #[clap( 86 | parse(from_os_str), 87 | help = "Data to split, optionally gzip compressed. If '-', read from stdin" 88 | )] 89 | pub input: PathBuf, 90 | 91 | #[clap( 92 | short = 'o', 93 | long = "output-prefix", 94 | parse(from_os_str), 95 | required_if_eq("input", "-"), 96 | help = "Output filename prefix. Only used if reading from stdin" 97 | )] 98 | pub output_prefix: Option, 99 | 100 | #[clap( 101 | short = 'd', 102 | long = "decompress-input", 103 | help = "Decompress input from gzip format" 104 | )] 105 | pub decompress_input: bool, 106 | 107 | #[clap( 108 | short = 'C', 109 | long = "compressed-output", 110 | help = "Compress output files using gzip" 111 | )] 112 | pub compress_output: bool, 113 | } 114 | -------------------------------------------------------------------------------- /src/error.rs: -------------------------------------------------------------------------------- 1 | use thiserror::Error; 2 | 3 | use crate::split::ProportionSplit; 4 | 5 | /// Error type in ttv. 6 | #[derive(Debug, Error)] 7 | pub enum Error { 8 | #[error("empty file")] 9 | EmptyFile, 10 | #[error("invalid split specification: {0}")] 11 | InvalidSplitSpecification(String), 12 | #[error("invalid splits: {0:?}")] 13 | InvalidSplits(Vec), 14 | 15 | #[error("proportion too low: {0}")] 16 | ProportionTooLow(String), 17 | #[error("proportion too high: {0}")] 18 | ProportionTooHigh(String), 19 | 20 | #[error("error parsing CSV: {0}")] 21 | CsvError(csv::Error), 22 | #[error("I/O error: {0}")] 23 | IoError(std::io::Error), 24 | #[error("error parsing float: {0}")] 25 | ParseFloatError(std::num::ParseFloatError), 26 | #[error("error parsing int: {0}")] 27 | ParseIntError(std::num::ParseIntError), 28 | #[error("internal error: {0}")] 29 | SendError(std::sync::mpsc::SendError), 30 | } 31 | 32 | pub type Result = std::result::Result; 33 | 34 | impl From for Error { 35 | fn from(error: std::num::ParseFloatError) -> Self { 36 | Error::ParseFloatError(error) 37 | } 38 | } 39 | 40 | impl From for Error { 41 | fn from(error: std::num::ParseIntError) -> Self { 42 | Error::ParseIntError(error) 43 | } 44 | } 45 | 46 | impl From for Error { 47 | fn from(error: std::io::Error) -> Self { 48 | Error::IoError(error) 49 | } 50 | } 51 | 52 | impl From for Error { 53 | fn from(error: csv::Error) -> Self { 54 | Error::CsvError(error) 55 | } 56 | } 57 | 58 | impl From> for Error { 59 | fn from(error: std::sync::mpsc::SendError) -> Self { 60 | Error::SendError(error) 61 | } 62 | } 63 | -------------------------------------------------------------------------------- /src/io.rs: -------------------------------------------------------------------------------- 1 | use std::fs::File; 2 | use std::io::{BufReader, Read, Write}; 3 | use std::path::Path; 4 | 5 | use flate2::read::GzDecoder; 6 | use flate2::write::GzEncoder; 7 | 8 | use crate::error::Result; 9 | 10 | pub type OutputWriter = Box; 11 | 12 | #[derive(Clone, Copy, Debug)] 13 | pub enum Compression { 14 | Uncompressed, 15 | GzipCompression, 16 | } 17 | 18 | pub trait LineReader { 19 | fn read_line(&mut self) -> Option>; 20 | } 21 | 22 | impl LineReader for csv::Reader> { 23 | fn read_line(&mut self) -> Option> { 24 | let mut record = csv::ByteRecord::with_capacity(1024, 100); 25 | match self.read_byte_record(&mut record) { 26 | Ok(read) if read => { 27 | let curs = std::io::Cursor::new(Vec::with_capacity(1024)); 28 | let mut writer = csv::Writer::from_writer(curs); 29 | writer.write_byte_record(&record).unwrap(); 30 | let s = String::from_utf8(writer.into_inner().unwrap().into_inner()).unwrap(); 31 | Some(Ok(s)) 32 | } 33 | Ok(_) => None, 34 | Err(e) => Some(Err(e.into())), 35 | } 36 | } 37 | } 38 | 39 | impl LineReader for BufReader> { 40 | fn read_line(&mut self) -> Option> { 41 | let mut buf = String::with_capacity(1024); 42 | match std::io::BufRead::read_line(self, &mut buf) { 43 | Ok(0) => None, 44 | Ok(_) => Some(Ok(buf)), 45 | Err(e) => Some(Err(e.into())), 46 | } 47 | } 48 | } 49 | 50 | pub fn open_data>( 51 | path: P, 52 | compression: Compression, 53 | csv_builder: Option, 54 | ) -> Result> { 55 | // Read from stdin if input is '-', else try to open the provided file. 56 | let reader: Box = match path.as_ref().to_str() { 57 | Some("-") => Box::new(std::io::stdin()), 58 | Some(p) => Box::new(File::open(p)?), 59 | _ => unreachable!(), 60 | }; 61 | 62 | let reader: Box = match compression { 63 | Compression::Uncompressed => reader, 64 | Compression::GzipCompression => Box::new(GzDecoder::new(reader)), 65 | }; 66 | 67 | let reader: Box = match csv_builder { 68 | Some(builder) => Box::new(builder.from_reader(reader)), 69 | None => Box::new(BufReader::with_capacity(1024 * 1024, reader)), 70 | }; 71 | Ok(reader) 72 | } 73 | 74 | pub fn open_output>(path: P, compression: Compression) -> Result { 75 | let file = File::create(path)?; 76 | let writer: OutputWriter = match compression { 77 | Compression::GzipCompression => Box::new(GzEncoder::new(file, Default::default())), 78 | Compression::Uncompressed => Box::new(file), 79 | }; 80 | Ok(writer) 81 | } 82 | -------------------------------------------------------------------------------- /src/lib.rs: -------------------------------------------------------------------------------- 1 | pub mod cli; 2 | mod error; 3 | mod io; 4 | mod split; 5 | 6 | pub use { 7 | crate::error::{Error, Result}, 8 | crate::io::Compression, 9 | crate::split::SplitterBuilder, 10 | }; 11 | -------------------------------------------------------------------------------- /src/main.rs: -------------------------------------------------------------------------------- 1 | use clap::StructOpt; 2 | use jemallocator::Jemalloc; 3 | 4 | use ttv::{cli, Compression, Result, SplitterBuilder}; 5 | 6 | #[global_allocator] 7 | static GLOBAL: Jemalloc = Jemalloc; 8 | 9 | fn main() -> Result<()> { 10 | env_logger::init(); 11 | let opt = cli::Opt::parse(); 12 | match opt.cmd { 13 | cli::Command::Split(x) => { 14 | let mut splitter = SplitterBuilder::new(&x.input, x.rows, x.prop)?; 15 | if x.decompress_input { 16 | splitter = splitter.input_compression(Compression::GzipCompression); 17 | } 18 | if x.compress_output { 19 | splitter = splitter.output_compression(Compression::GzipCompression); 20 | } 21 | if x.csv { 22 | splitter = splitter.csv(true); 23 | } 24 | if x.no_header { 25 | splitter = splitter.has_header(false); 26 | } 27 | if let Some(seed) = x.seed { 28 | splitter = splitter.seed(seed); 29 | } 30 | if let Some(output_prefix) = x.output_prefix { 31 | splitter = splitter.output_prefix(output_prefix); 32 | } 33 | if let Some(chunk_size) = x.chunk_size { 34 | splitter = splitter.chunk_size(chunk_size); 35 | } 36 | if let Some(total_rows) = x.total_rows { 37 | splitter = splitter.total_rows(total_rows); 38 | } 39 | splitter.build()?.run()?; 40 | } 41 | }; 42 | Ok(()) 43 | } 44 | -------------------------------------------------------------------------------- /src/split.rs: -------------------------------------------------------------------------------- 1 | mod single; 2 | mod splits; 3 | mod splitter; 4 | mod writer; 5 | 6 | pub use self::single::{ProportionSplit, RowSplit}; 7 | pub use self::splitter::SplitterBuilder; 8 | -------------------------------------------------------------------------------- /src/split/single.rs: -------------------------------------------------------------------------------- 1 | use std::ops::Deref; 2 | use std::str::FromStr; 3 | 4 | use crate::error::{Error, Result}; 5 | 6 | /// Represents a single 'split' of data 7 | pub trait Split { 8 | /// Get the name of the split. 9 | fn name(&self) -> &str; 10 | } 11 | 12 | /// A split based on a proportion. 13 | #[derive(Clone, Debug)] 14 | pub struct ProportionSplit { 15 | /// The split name. Will be used as the filename for the split. 16 | name: String, 17 | /// The proportion of data that should be directed to this split. 18 | pub proportion: f64, 19 | } 20 | 21 | impl Split for ProportionSplit { 22 | fn name(&self) -> &str { 23 | &self.name 24 | } 25 | } 26 | 27 | impl FromStr for ProportionSplit { 28 | type Err = Error; 29 | 30 | /// Create a ProportionSplit from a string specification, such as 31 | /// "train=0.8". 32 | fn from_str(spec: &str) -> Result { 33 | let split: Vec<&str> = spec.split('=').collect(); 34 | if split.len() != 2 { 35 | return Err(Error::InvalidSplitSpecification(spec.to_string())); 36 | } 37 | let proportion = split[1] 38 | .parse::() 39 | .map_err(|_| Error::InvalidSplitSpecification(spec.to_string()))?; 40 | if proportion <= 0.0 { 41 | return Err(Error::ProportionTooLow(spec.to_string())); 42 | } else if proportion > 1.0 { 43 | return Err(Error::ProportionTooHigh(spec.to_string())); 44 | } 45 | Ok(ProportionSplit { 46 | name: split[0].to_string(), 47 | proportion, 48 | }) 49 | } 50 | } 51 | 52 | /// A split based on a number of rows 53 | #[derive(Clone, Debug)] 54 | pub struct RowSplit { 55 | /// The split name. Will be used as the filename for the split. 56 | name: String, 57 | /// The total number of rows to send to this split. 58 | /// Stored as an f64 for optimization reasons. 59 | pub total: f64, 60 | /// The number of rows sent to this split so far. 61 | pub done: f64, 62 | } 63 | 64 | impl Split for RowSplit { 65 | fn name(&self) -> &str { 66 | &self.name 67 | } 68 | } 69 | 70 | impl FromStr for RowSplit { 71 | type Err = Error; 72 | 73 | /// Create a ProportionSplit from a string specification, such as 74 | /// "train=0.8". 75 | fn from_str(spec: &str) -> Result { 76 | let split: Vec<&str> = spec.split('=').collect(); 77 | if split.len() != 2 { 78 | return Err(Error::InvalidSplitSpecification(spec.to_string())); 79 | } 80 | let total = split[1] 81 | .parse::() 82 | .map(|total| total as f64) 83 | .map_err(|_| Error::InvalidSplitSpecification(spec.to_string()))?; 84 | Ok(RowSplit { 85 | name: split[0].to_string(), 86 | total, 87 | done: 0.0, 88 | }) 89 | } 90 | } 91 | 92 | pub enum SplitEnum { 93 | Rows(RowSplit), 94 | Proportion(ProportionSplit), 95 | } 96 | 97 | impl Deref for SplitEnum { 98 | type Target = dyn Split; 99 | fn deref(&self) -> &Self::Target { 100 | match self { 101 | SplitEnum::Rows(r) => r, 102 | SplitEnum::Proportion(p) => p, 103 | } 104 | } 105 | } 106 | -------------------------------------------------------------------------------- /src/split/splits.rs: -------------------------------------------------------------------------------- 1 | use std::ops::Deref; 2 | 3 | use rand::prelude::*; 4 | use rand_chacha::ChaChaRng; 5 | 6 | use crate::error::{Error, Result}; 7 | use crate::split::single::{ProportionSplit, RowSplit, Split}; 8 | 9 | pub enum SplitSelection<'a> { 10 | Some(&'a str), 11 | None, 12 | Done, 13 | } 14 | 15 | pub trait SplitSelector { 16 | fn get_split(&mut self, rng: &mut ChaChaRng) -> SplitSelection; 17 | } 18 | 19 | /// Splits defined using proportions. 20 | #[derive(Debug, Default)] 21 | pub struct ProportionSplits { 22 | pub splits: Vec, 23 | } 24 | 25 | impl SplitSelector for ProportionSplits { 26 | fn get_split(&mut self, rng: &mut ChaChaRng) -> SplitSelection { 27 | let random: f64 = rng.random(); 28 | let mut total = 0.0; 29 | for split in &self.splits { 30 | total += split.proportion; 31 | if random < total { 32 | return SplitSelection::Some(split.name()); 33 | } 34 | } 35 | SplitSelection::None 36 | } 37 | } 38 | 39 | impl Deref for ProportionSplits { 40 | type Target = Vec; 41 | fn deref(&self) -> &Self::Target { 42 | &self.splits 43 | } 44 | } 45 | 46 | impl TryFrom> for ProportionSplits { 47 | type Error = Error; 48 | fn try_from(splits: Vec) -> Result { 49 | let total = splits.iter().fold(0.0, |x, p| x + p.proportion); 50 | if total > 1.0 { 51 | return Err(Error::InvalidSplits(splits)); 52 | } 53 | Ok(ProportionSplits { splits }) 54 | } 55 | } 56 | 57 | /// Splits defined using rows. 58 | #[derive(Debug, Default)] 59 | pub struct RowSplits { 60 | pub splits: Vec, 61 | /// The total number of rows in all splits combined 62 | total: f64, 63 | } 64 | 65 | impl SplitSelector for RowSplits { 66 | fn get_split(&mut self, rng: &mut ChaChaRng) -> SplitSelection { 67 | let random: f64 = rng.random(); 68 | let random = random * self.total; 69 | 70 | let mut total = 0.0; 71 | let unfinished_splits = self.splits.iter_mut().filter(|s| s.done < s.total); 72 | 73 | for split in unfinished_splits { 74 | total += split.total; 75 | if random < total { 76 | split.done += 1.0; 77 | if split.done >= split.total { 78 | self.total -= split.total; 79 | } 80 | return SplitSelection::Some(split.name()); 81 | } 82 | } 83 | SplitSelection::Done 84 | } 85 | } 86 | 87 | impl Deref for RowSplits { 88 | type Target = Vec; 89 | fn deref(&self) -> &Self::Target { 90 | &self.splits 91 | } 92 | } 93 | 94 | impl From> for RowSplits { 95 | fn from(splits: Vec) -> Self { 96 | let total = splits.iter().fold(0.0, |x, y| x + y.total); 97 | RowSplits { splits, total } 98 | } 99 | } 100 | 101 | /// Either RowSplits or ProportionSplits, determined at runtime depending 102 | /// on the user's input. 103 | pub enum Splits { 104 | Rows(RowSplits), 105 | Proportions(ProportionSplits), 106 | } 107 | 108 | impl Deref for Splits { 109 | type Target = dyn SplitSelector; 110 | fn deref(&self) -> &Self::Target { 111 | match self { 112 | Splits::Rows(r) => r, 113 | Splits::Proportions(r) => r, 114 | } 115 | } 116 | } 117 | 118 | impl Splits { 119 | /// Get a random split. 120 | pub fn get_split(&mut self, rng: &mut ChaChaRng) -> SplitSelection { 121 | match self { 122 | Splits::Rows(rows) => rows.get_split(rng), 123 | Splits::Proportions(rows) => rows.get_split(rng), 124 | } 125 | } 126 | } 127 | -------------------------------------------------------------------------------- /src/split/splitter.rs: -------------------------------------------------------------------------------- 1 | use std::collections::HashMap; 2 | use std::path::{Path, PathBuf}; 3 | 4 | use indicatif::{MultiProgress, ProgressBar, ProgressStyle}; 5 | use log::{debug, info}; 6 | use rand::prelude::*; 7 | use rand_chacha::ChaChaRng; 8 | 9 | use crate::error::{Error, Result}; 10 | use crate::io::{open_data, Compression}; 11 | use crate::split::{ 12 | single::{ProportionSplit, RowSplit, Split, SplitEnum}, 13 | splits::{SplitSelection, Splits}, 14 | writer::SplitWriter, 15 | }; 16 | 17 | pub struct SplitterBuilder { 18 | /// The path to the input file 19 | input: PathBuf, 20 | /// The desired splits 21 | splits: Splits, 22 | /// The seed used for randomisation 23 | seed: Option, 24 | /// The prefix for the output file(s) 25 | output_prefix: Option, 26 | /// The maximum size of each chunk 27 | chunk_size: Option, 28 | /// The total number of rows 29 | total_rows: Option, 30 | /// Compression for input files 31 | input_compression: Compression, 32 | /// Compression for output files 33 | output_compression: Compression, 34 | /// Is the input CSV? 35 | csv: bool, 36 | /// Does the input have headers? 37 | /// 38 | /// Note: defaults to true. 39 | has_header: bool, 40 | } 41 | 42 | impl SplitterBuilder { 43 | pub fn new>( 44 | input: &P, 45 | row_splits: Vec, 46 | prop_splits: Vec, 47 | ) -> Result { 48 | let splits = if row_splits.is_empty() { 49 | Splits::Proportions(prop_splits.try_into()?) 50 | } else { 51 | Splits::Rows(row_splits.into()) 52 | }; 53 | Ok(SplitterBuilder { 54 | input: input.as_ref().to_path_buf(), 55 | splits, 56 | seed: None, 57 | output_prefix: None, 58 | chunk_size: None, 59 | total_rows: None, 60 | input_compression: Compression::Uncompressed, 61 | output_compression: Compression::Uncompressed, 62 | csv: false, 63 | has_header: true, 64 | }) 65 | } 66 | 67 | #[must_use] 68 | pub fn seed(mut self, seed: u64) -> Self { 69 | self.seed = Some(seed); 70 | self 71 | } 72 | 73 | #[must_use] 74 | pub fn output_prefix(mut self, output_prefix: PathBuf) -> Self { 75 | self.output_prefix = Some(output_prefix); 76 | self 77 | } 78 | 79 | #[must_use] 80 | pub fn chunk_size(mut self, chunk_size: u64) -> Self { 81 | self.chunk_size = Some(chunk_size); 82 | self 83 | } 84 | 85 | #[must_use] 86 | pub fn total_rows(mut self, total_rows: u64) -> Self { 87 | self.total_rows = Some(total_rows); 88 | self 89 | } 90 | 91 | #[must_use] 92 | pub fn input_compression(mut self, input_compression: Compression) -> Self { 93 | self.input_compression = input_compression; 94 | self 95 | } 96 | 97 | #[must_use] 98 | pub fn output_compression(mut self, output_compression: Compression) -> Self { 99 | self.output_compression = output_compression; 100 | self 101 | } 102 | 103 | #[must_use] 104 | pub fn csv(mut self, csv: bool) -> Self { 105 | self.csv = csv; 106 | self 107 | } 108 | 109 | #[must_use] 110 | pub fn has_header(mut self, has_header: bool) -> Self { 111 | self.has_header = has_header; 112 | self 113 | } 114 | 115 | pub fn build(self) -> Result { 116 | let rng = match self.seed { 117 | Some(s) => ChaChaRng::seed_from_u64(s), 118 | None => ChaChaRng::from_os_rng(), 119 | }; 120 | Ok(Splitter { 121 | input: self.input, 122 | rng, 123 | splits: self.splits, 124 | output_prefix: self.output_prefix, 125 | chunk_size: self.chunk_size, 126 | total_rows: self.total_rows, 127 | input_compression: self.input_compression, 128 | output_compression: self.output_compression, 129 | csv: self.csv, 130 | has_header: self.has_header, 131 | }) 132 | } 133 | } 134 | 135 | pub struct Splitter { 136 | /// The path to the input file 137 | input: PathBuf, 138 | /// The desired splits 139 | splits: Splits, 140 | /// The stateful random number generator. 141 | rng: ChaChaRng, 142 | /// The prefix for the output file(s) 143 | output_prefix: Option, 144 | /// The maximum size of each chunk 145 | chunk_size: Option, 146 | /// The total number of rows 147 | total_rows: Option, 148 | /// Compression for input files 149 | input_compression: Compression, 150 | /// Compression for output files 151 | output_compression: Compression, 152 | /// Is the input CSV? 153 | csv: bool, 154 | /// Does the input have headers? 155 | /// 156 | /// Note: defaults to true. 157 | has_header: bool, 158 | } 159 | 160 | impl Splitter { 161 | pub fn run(mut self) -> Result<()> { 162 | let multi = MultiProgress::new(); 163 | 164 | // Use a slightly different progress bar depending on the situation 165 | let progress: HashMap = match (&self.splits, self.total_rows) { 166 | (Splits::Proportions(p), Some(t)) => p 167 | .splits 168 | .iter() 169 | .map(|p| { 170 | let name = p.name().to_string(); 171 | let style = ProgressStyle::default_bar() 172 | .template("{msg:<10}: [{elapsed_precise}] {bar:40.cyan/blue} {pos:>7}/~{len:7} (ETA: {eta_precise})") 173 | .expect("valid indicatif template") 174 | .progress_chars("█▉▊▋▌▍▎▏ "); 175 | let split_total = p.proportion * t as f64; 176 | let pb = multi.add(ProgressBar::new(split_total as u64)); 177 | pb.set_message(name.clone()); 178 | pb.set_style(style); 179 | (name, pb) 180 | }) 181 | .collect(), 182 | (Splits::Proportions(p), None) => p 183 | .splits 184 | .iter() 185 | .map(|p| { 186 | let name = p.name().to_string(); 187 | let style = ProgressStyle::default_bar() 188 | .template("{msg:<10}: [{elapsed_precise}] {spinner:.green} {pos:>7}") 189 | .expect("valid indicatif template"); 190 | let pb = multi.add(ProgressBar::new_spinner()); 191 | pb.set_style(style); 192 | pb.set_message(name.clone()); 193 | (name, pb) 194 | }) 195 | .collect(), 196 | (Splits::Rows(r), _) => r 197 | .splits 198 | .iter() 199 | .map(|r| { 200 | let name = r.name().to_string(); 201 | let style = ProgressStyle::default_bar() 202 | .template("{msg:<10}: [{elapsed_precise}] {bar:40.cyan/blue} {pos:>7}/{len:7} (ETA: {eta_precise})") 203 | .expect("valid indicatif template") 204 | .progress_chars("█▉▊▋▌▍▎▏ "); 205 | let pb = multi.add(ProgressBar::new(r.total as u64)); 206 | pb.set_message(name.clone()); 207 | pb.set_style(style); 208 | (name, pb) 209 | }) 210 | .collect() 211 | }; 212 | 213 | let mut senders = HashMap::new(); 214 | let mut chunk_writers = Vec::new(); 215 | let output_path = match self.output_prefix { 216 | Some(ref f) => f.clone(), 217 | None => self.input.clone(), 218 | }; 219 | match &self.splits { 220 | Splits::Proportions(p) => { 221 | for split in p.iter() { 222 | let split = SplitEnum::Proportion((*split).clone()); 223 | let (split_sender, mut split_chunk_writers) = SplitWriter::new( 224 | &output_path, 225 | &split, 226 | self.chunk_size, 227 | self.total_rows, 228 | self.output_compression, 229 | )?; 230 | senders.insert(split.name().to_string(), split_sender); 231 | chunk_writers.append(&mut split_chunk_writers); 232 | } 233 | } 234 | Splits::Rows(r) => { 235 | for split in r.iter() { 236 | let split = SplitEnum::Rows((*split).clone()); 237 | let (split_sender, mut split_chunk_writers) = SplitWriter::new( 238 | &output_path, 239 | &split, 240 | self.chunk_size, 241 | self.total_rows, 242 | self.output_compression, 243 | )?; 244 | senders.insert(split.name().to_string(), split_sender); 245 | chunk_writers.append(&mut split_chunk_writers); 246 | } 247 | } 248 | }; 249 | 250 | let pool = rayon::ThreadPoolBuilder::new() 251 | .num_threads(chunk_writers.len() + 2) 252 | .thread_name(|num| format!("thread-{num}")) 253 | .start_handler(|num| debug!("thread {} starting", num)) 254 | .exit_handler(|num| debug!("thread {} finishing", num)) 255 | .build() 256 | .unwrap(); 257 | 258 | pool.scope(move |scope| { 259 | info!("Reading data from {}", self.input.to_str().unwrap()); 260 | let reader_builder = if self.csv { 261 | let mut reader_builder = csv::ReaderBuilder::new(); 262 | reader_builder.has_headers(false); 263 | Some(reader_builder) 264 | } else { 265 | None 266 | }; 267 | let mut reader = open_data(&self.input, self.input_compression, reader_builder)?; 268 | 269 | if self.has_header { 270 | info!("Writing header to files"); 271 | let header = match reader.read_line() { 272 | Some(h) => h?, 273 | None => return Err(Error::EmptyFile), 274 | }; 275 | for sender in senders.values_mut() { 276 | sender.send_all(&header)?; 277 | } 278 | } 279 | 280 | let has_header = self.has_header; 281 | { 282 | for writer in chunk_writers { 283 | scope.spawn(move |_| { 284 | // In most cases each writer will only deal with 285 | // one chunk. But if we're only told a proportion and 286 | // a chunk size (and no total rows), we'll be writing 287 | // to two files at once, and we'll need to switch to a 288 | // new file if we go over the chunk size. 289 | let mut chunk_id = writer.chunk_id; 290 | let mut rows_sent_to_chunk = 0; 291 | let mut file = writer.output(chunk_id).expect("Could not open file"); 292 | let mut header: Header = if has_header { 293 | Header::None 294 | } else { 295 | Header::Disabled 296 | }; 297 | for row in writer.receiver.iter() { 298 | if header == Header::None { 299 | header = Header::Some(row.clone()); 300 | } 301 | if let Some(chunk_size) = writer.chunk_size { 302 | if rows_sent_to_chunk > (chunk_size) { 303 | // add one for header 304 | // This should only ever happen if we weren't 305 | // able to pre-calculate how many chunks were 306 | // needed 307 | chunk_id = chunk_id.map(|c| c + 2); 308 | file = writer.output(chunk_id).expect("Could not open file"); 309 | if let Header::Some(h) = header.as_ref() { 310 | writer 311 | .handle_row(&mut file, h) 312 | .expect("Could not write row to file"); 313 | } 314 | rows_sent_to_chunk = 1 315 | } 316 | } 317 | writer 318 | .handle_row(&mut file, &row) 319 | .expect("Could not write row to file"); 320 | rows_sent_to_chunk += 1; 321 | } 322 | }) 323 | } 324 | } 325 | 326 | info!("Reading lines"); 327 | while let Some(record) = reader.read_line() { 328 | let split = self.splits.get_split(&mut self.rng); 329 | match split { 330 | SplitSelection::Some(split) => { 331 | match senders.get_mut(split).unwrap().send(record.unwrap()) { 332 | Ok(_) => progress[split].inc(1), 333 | Err(e) => return Err(e), 334 | } 335 | } 336 | SplitSelection::None => continue, 337 | SplitSelection::Done => break, 338 | } 339 | } 340 | progress.values().for_each(|f| f.finish()); 341 | info!("Finished writing to files"); 342 | 343 | for (_, sender) in senders { 344 | sender.finish(); 345 | } 346 | Ok(()) 347 | })?; 348 | Ok(()) 349 | } 350 | } 351 | 352 | #[derive(Debug, PartialEq)] 353 | enum Header { 354 | None, 355 | Some(T), 356 | Disabled, 357 | } 358 | 359 | impl Header { 360 | fn as_ref(&self) -> Header<&str> { 361 | match self { 362 | Header::None => Header::None, 363 | Header::Disabled => Header::Disabled, 364 | Header::Some(s) => Header::Some(s.as_str()), 365 | } 366 | } 367 | } 368 | -------------------------------------------------------------------------------- /src/split/writer.rs: -------------------------------------------------------------------------------- 1 | use std::fs::create_dir_all; 2 | use std::io::Write; 3 | use std::path::{Path, PathBuf}; 4 | use std::sync::mpsc::{Receiver, SyncSender}; 5 | 6 | use super::single::SplitEnum; 7 | use crate::error::Result; 8 | use crate::io; 9 | 10 | /// Accepts rows assigned to a split and writes them in an appropriate way. 11 | /// 12 | /// If a max chunk size as been specified it will round-robin the rows between 13 | /// the splits. 14 | pub(crate) struct SplitWriter { 15 | /// Sending halves of channels. 16 | /// 17 | /// We use a SyncSender here because we may end up reading much faster 18 | /// than writing, and we need to limit the size of the buffers. 19 | chunk_senders: Vec>, 20 | 21 | /// Index of the chunk_sender which should receive the next row. 22 | next_index: usize, 23 | } 24 | 25 | impl SplitWriter { 26 | pub fn new( 27 | path: &Path, 28 | split: &SplitEnum, 29 | chunk_size: Option, 30 | total_rows: Option, 31 | compression: io::Compression, 32 | ) -> Result<(Self, Vec)> { 33 | let n_chunks = match (split, chunk_size, total_rows) { 34 | // Just use one sender since there is no chunking required. 35 | (_, None, _) => 1, 36 | 37 | // Create one sender per chunk. 38 | (SplitEnum::Rows(r), Some(c), _) => (r.total / c as f64).ceil() as u64, 39 | 40 | // TODO: 41 | // We don't know how many chunks will be required. Create two 42 | // chunks; we'll fix this later. 43 | (SplitEnum::Proportion(_), Some(_), None) => 2, 44 | 45 | // Use as many senders as we estimate there will be chunks for this 46 | // split. 47 | (SplitEnum::Proportion(p), Some(c), Some(t)) => { 48 | ((t as f64) * p.proportion / c as f64).ceil() as u64 + 1 49 | } 50 | }; 51 | 52 | let mut chunk_senders = Vec::new(); 53 | let mut chunk_writers = Vec::new(); 54 | for chunk_id in 0..n_chunks { 55 | let (sender, receiver) = std::sync::mpsc::sync_channel(100); 56 | chunk_senders.push(sender); 57 | 58 | let chunk_id = if n_chunks == 1 { None } else { Some(chunk_id) }; 59 | let chunk_writer = ChunkWriter::new( 60 | path.to_path_buf(), 61 | split.name().to_string(), 62 | compression, 63 | chunk_id, 64 | chunk_size, 65 | receiver, 66 | ); 67 | chunk_writers.push(chunk_writer); 68 | } 69 | 70 | Ok(( 71 | SplitWriter { 72 | chunk_senders, 73 | next_index: 0, 74 | }, 75 | chunk_writers, 76 | )) 77 | } 78 | 79 | /// Send a row to this split. 80 | /// 81 | /// The sender will assign it to the correct chunk (if there was no maximum 82 | /// chunk size specified, there is effectively only one chunk!) 83 | /// This will round-robin through the chunks. 84 | pub fn send(&mut self, row: String) -> Result { 85 | match self.chunk_senders.get(self.next_index) { 86 | Some(sender) => { 87 | sender.send(row)?; 88 | self.next_index += 1; 89 | } 90 | None => { 91 | // Start again at the next chunk 92 | self.chunk_senders[0].send(row)?; 93 | self.next_index = 1; 94 | } 95 | } 96 | Ok(true) 97 | } 98 | 99 | /// Send a row to all splits. 100 | /// 101 | /// Used for the header row. 102 | pub fn send_all(&mut self, row: &str) -> Result<()> { 103 | for sender in &self.chunk_senders { 104 | sender.send(row.to_string())? 105 | } 106 | Ok(()) 107 | } 108 | 109 | pub fn finish(self) { 110 | for sender in self.chunk_senders { 111 | drop(sender); 112 | } 113 | } 114 | } 115 | 116 | /// Writes rows to files once they've been assigned to a split. 117 | pub struct ChunkWriter { 118 | path: PathBuf, 119 | name: String, 120 | compression: io::Compression, 121 | pub chunk_id: Option, 122 | pub chunk_size: Option, 123 | pub receiver: Receiver, 124 | } 125 | 126 | impl ChunkWriter { 127 | fn new( 128 | path: PathBuf, 129 | name: String, 130 | compression: io::Compression, 131 | chunk_id: Option, 132 | chunk_size: Option, 133 | receiver: Receiver, 134 | ) -> Self { 135 | ChunkWriter { 136 | path, 137 | name, 138 | compression, 139 | chunk_id, 140 | chunk_size, 141 | receiver, 142 | } 143 | } 144 | 145 | pub fn output(&self, chunk_id: Option) -> Result { 146 | let mut filename = self.path.clone(); 147 | let original_filename = self.path.file_stem().unwrap(); 148 | filename.pop(); 149 | filename.push(&self.name); 150 | create_dir_all(&filename)?; 151 | let chunk_part = match chunk_id { 152 | None => "".to_string(), 153 | Some(c) => format!(".{c:0>4}"), 154 | }; 155 | let extension = match self.compression { 156 | io::Compression::GzipCompression => ".gz", 157 | io::Compression::Uncompressed => "", 158 | }; 159 | filename.push(format!( 160 | "{}.{}{}.csv{}", 161 | original_filename.to_string_lossy(), 162 | &self.name, 163 | chunk_part, 164 | extension, 165 | )); 166 | io::open_output(filename, self.compression) 167 | } 168 | /// Handle writing of a row to this chunk. 169 | pub fn handle_row(&self, file: &mut io::OutputWriter, row: &str) -> Result<()> { 170 | file.write_all(row.as_bytes())?; 171 | Ok(()) 172 | } 173 | } 174 | -------------------------------------------------------------------------------- /tests/cli_tests.rs: -------------------------------------------------------------------------------- 1 | #[test] 2 | fn cli_tests() { 3 | trycmd::TestCases::new().case("tests/cmd/*.toml"); 4 | } 5 | -------------------------------------------------------------------------------- /tests/cmd/help-split.stderr: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sd2k/ttv/9cdc49e4bb8640085b080adc49773df9becadbe7/tests/cmd/help-split.stderr -------------------------------------------------------------------------------- /tests/cmd/help-split.stdout: -------------------------------------------------------------------------------- 1 | ttv-split 2 | Split dataset into two or more files for test/train/validation sets 3 | 4 | USAGE: 5 | ttv split [OPTIONS] 6 | 7 | ARGS: 8 | Data to split, optionally gzip compressed. If '-', read from stdin 9 | 10 | OPTIONS: 11 | -c, --chunk-size 12 | Maximum number of rows per output chunk 13 | 14 | -C, --compressed-output 15 | Compress output files using gzip 16 | 17 | --csv 18 | Parse input as CSV. Only needed if rows contain embedded newlines - will impact 19 | performance. 20 | 21 | -d, --decompress-input 22 | Decompress input from gzip format 23 | 24 | -h, --help 25 | Print help information 26 | 27 | -n, --no-header 28 | Don't treat the first row as a header 29 | 30 | -o, --output-prefix 31 | Output filename prefix. Only used if reading from stdin 32 | 33 | -p, --prop 34 | Specify splits by proportion of rows 35 | 36 | -r, --rows 37 | Specify splits by number of rows 38 | 39 | -s, --seed 40 | RNG seed, for reproducibility 41 | 42 | -t, --total-rows 43 | Number of rows in input file. Used for progress when using proportion splits 44 | -------------------------------------------------------------------------------- /tests/cmd/help-split.toml: -------------------------------------------------------------------------------- 1 | bin.name = "ttv" 2 | args = "split -h" 3 | -------------------------------------------------------------------------------- /tests/cmd/help.stderr: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sd2k/ttv/9cdc49e4bb8640085b080adc49773df9becadbe7/tests/cmd/help.stderr -------------------------------------------------------------------------------- /tests/cmd/help.stdout: -------------------------------------------------------------------------------- 1 | ttv 2 | Flexibly create test, train and validation sets 3 | 4 | USAGE: 5 | ttv [OPTIONS] 6 | 7 | OPTIONS: 8 | -h, --help Print help information 9 | -v Set the level of verbosity 10 | 11 | SUBCOMMANDS: 12 | help Print this message or the help of the given subcommand(s) 13 | split Split dataset into two or more files for test/train/validation sets 14 | -------------------------------------------------------------------------------- /tests/cmd/help.toml: -------------------------------------------------------------------------------- 1 | bin.name = "ttv" 2 | args = "-h" 3 | --------------------------------------------------------------------------------