├── 01.intro.md ├── 02.why.md ├── 03.current_impl.md ├── 04.rust_impl.md ├── 05.future.md ├── LICENSE ├── README.md ├── abstract.md └── poster ├── figures ├── arch.svg ├── arch_cpp.png ├── arch_cpp.svg ├── arch_rust.png └── arch_rust.svg ├── poster.pdf └── poster.sla /01.intro.md: -------------------------------------------------------------------------------- 1 | # Intro 2 | 3 | Python has a mature ecosystem for extensions using C/C++, with Cython being part of the standard toolset for scientific programming. Even so, C/C++ still have many drawbacks, ranging from smaller annoyances (like library packaging, versioning and build systems) to serious one like buffer overflows and undefined behavior leading to security issues. Rust is a system programming language trying to avoid many of the C/C++ pitfalls, on top of providing a good development workflow and memory safety guarantees. This work presents a way to write extensions in Rust and use them in Python, using sourmash [Brown and Irber, 2016] as an example. 4 | 5 | sourmash implements MinHash [Broder, 1997], a method for estimating the similarity of two or more datasets, and expanding on the work pioneered by Mash [Ondov et al, 2016]. It is available as a CLI and a Python library. 6 | -------------------------------------------------------------------------------- /02.why.md: -------------------------------------------------------------------------------- 1 | # Why Rust? 2 | 3 | While Rust doesn't aim at being a scientific language, its focus on being a general purpose language allows a phenomenon similar to what happened with Python, where people from many areas pushed the language in different directions (system scripting, web development, numerical programming...) creating an environment where developers can combine it all in their systems. 4 | 5 | Rust brings many best practices to the default experience: integrated package management with Cargo (supporting documentation, testing and benchmarking). Some of them are not viable in C/C++ due to the widespread adoption of both languages and backward compatibility guarantees, but due to Rust being developed initially to be integrated incrementally in the Firefox browser engine it tries to keep as much compatibility as possible with C/C++. 6 | 7 | Rust also has a minimal runtime (like C, unlike Python), making it a good candidate for embedding into other software or even for situations where strict control of resources is required (microcontrollers and embedded systems, for example). 8 | -------------------------------------------------------------------------------- /03.current_impl.md: -------------------------------------------------------------------------------- 1 | # Current implementation 2 | 3 | [![](poster/figures/arch_cpp.svg?sanitize=true)](poster/figures/arch_cpp.svg) 4 | 5 | ## Pros 6 | 7 | - Cython is a superset of Python 8 | - Mature codebases for example usage and best practices 9 | - Lower overhead to call C/C++ code 10 | - NumPy integration 11 | - Nice gradual path to migrate performance-intensive code from Python to C/C++ 12 | 13 | ## Cons 14 | 15 | - Cython C++ integration has some corner cases and missing features 16 | - Need to rewrite header declarations (pxd file) 17 | - Errors can be cryptic (do they happen at the C/C++, Cython or Python level?) 18 | - Many C/C++ build system combinations 19 | - Vendored dependencies (no package mgmt) 20 | - One wheel per OS and Python version 21 | 22 | ## Further reading 23 | 24 | - [The current minhash implementation (header file)][1] 25 | - [The pxd (descriptions) for Cython)][2] 26 | - [The pyx (implementation in Cython) to make low level code more pythonic][3] 27 | 28 | [1]: https://github.com/dib-lab/sourmash/blob/ab67c0bd9bbc12aa4bb4bc90533042d74ced914b/sourmash/kmer_min_hash.hh 29 | [2]: https://github.com/dib-lab/sourmash/blob/ab67c0bd9bbc12aa4bb4bc90533042d74ced914b/sourmash/_minhash.pxd 30 | [3]: https://github.com/dib-lab/sourmash/blob/ab67c0bd9bbc12aa4bb4bc90533042d74ced914b/sourmash/_minhash.pyx 31 | -------------------------------------------------------------------------------- /04.rust_impl.md: -------------------------------------------------------------------------------- 1 | # Rust implementation 2 | 3 | [![](poster/figures/arch_rust.svg?sanitize=true)](poster/figures/arch_rust.svg) 4 | 5 | ## Pros 6 | 7 | - Cargo and crates.io for package management 8 | - FFI interface is reusable in other languages 9 | - Auto-generated C header (cbindgen) and low level bindings (CFFI) 10 | - Works for PyPy too 11 | - One wheel per OS (universal) 12 | 13 | ## Cons 14 | 15 | - Fewer projects using Rust extensions 16 | - FFI overhead when calling C code 17 | - No gradual transition from Python to Rust code 18 | - Fewer bioinformatics libraries available 19 | - No NumPy integration 20 | - Low level abstraction ("what C can represent") 21 | 22 | ## Further reading 23 | 24 | - [The Rust implementation][1] 25 | - [The .h header with definitions created by cbindgen][2] 26 | - [The Python implementation using CFFI to make low level code more pythonic][3] 27 | - [PR on the sourmash repo for tracking progress][4] 28 | 29 | [1]: https://github.com/luizirber/sourmash-rust 30 | [2]: https://github.com/luizirber/sourmash-rust/blob/920809c4dee692f83f40cad08b018ad4cd859c72/target/sourmash.h 31 | [3]: https://github.com/dib-lab/sourmash/pull/424/files#diff-2032759ae736988ba55922787626efb5 32 | [4]: https://github.com/dib-lab/sourmash/pull/424 33 | -------------------------------------------------------------------------------- /05.future.md: -------------------------------------------------------------------------------- 1 | # Future work 2 | 3 | This proof of concept focused on replacing the C++ parts with Rust, but while all the sourmash tests are passing there are many improvements to be done. The performance in most benchmarks is very close to the C++ implementation, but since this wasn't the initial goal of the experiment there are many opportunities to make it faster. 4 | 5 | Another goal is to be able to use the core functionality of sourmash in browsers. A previous experiment focused on implementing a compatible package in JavaScript, but it lead to split codebases and increased maintenance burden. The Rust implementation make it possible to target WebAssembly and generate a JavaScript package wrapping it, with the added benefit of avoiding some JavaScript shortcomings (like 64-bit integers support). 6 | 7 | The Rust library implements basic compatibility with Finch sketches [Bovee and Greenfield, 2018], allowing sharing data between both MinHash implementations. Many of the other sourmash methods (search, gather) are not available in Rust yet, but this already allows using other MinHash sketches with them. 8 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2018, Luiz Irber 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Oxidizing Python: writing extensions in Rust 2 | 3 | [Luiz Carlos Irber Júnior](https://github.com/luizirber) 4 | 5 | - Department of Population Health and Reproduction, University of California, Davis, USA 6 | 7 | [![DOI](https://img.shields.io/badge/DOI-10.7490%2Ff1000research.1115726.1-orange.svg)](https://dx.doi.org/10.7490/f1000research.1115726.1) 8 | 9 | Poster presented at [GCCBOSC 2018][1] and [SciPy 2018][2]. 10 | 11 | ## Abstract 12 | 13 | Python has a mature ecosystem for extensions using C/C++, 14 | with Cython being part of the standard toolset for scientific programming. 15 | Even so, C/C++ still have many drawbacks, 16 | ranging from smaller annoyances (like library packaging, versioning and build systems) 17 | to serious one like buffer overflows and undefined behavior leading to security issues. 18 | 19 | Rust is a system programming language trying to avoid many of the C/C++ pitfalls, 20 | on top of providing a good development workflow and memory safety guarantees. 21 | 22 | This work presents a way to write extensions in Rust and use them in Python, 23 | using sourmash as an example. 24 | 25 | ## Table of Contents 26 | 27 | - [Introduction](01.intro.md) 28 | - [Why Rust?](02.why.md) 29 | - [Current implementation](03.current_impl.md) 30 | - [Rust implementation](04.rust_impl.md) 31 | - [Future work](05.future.md) 32 | - [References](#references) 33 | - Appendices 34 | - [Submitted abstract](abstract.md) 35 | - [The final poster](poster/poster.pdf) 36 | 37 | ## References 38 | 39 | - Broder, Andrei Z. 1997. “On the Resemblance and Containment of Documents.” - In Compression and Complexity of Sequences 1997. Proceedings, 21–29. IEEE. http://ieeexplore.ieee.org/abstract/document/666900/. 40 | - Ondov, Brian D., Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. 2016. “Mash: Fast Genome and Metagenome Distance Estimation Using MinHash.” Genome Biology 17: 132. https://dx.doi.org/10.1186/s13059-016-0997-x 41 | - Bovee, Roderick, and Nick Greenfield. 2018. “Finch: A Tool Adding Dynamic Abundance Filtering to Genomic MinHashing.” The Journal of Open Source Software. doi: https://dx.doi.org/10.21105/joss.00505 42 | - Titus Brown, C., and Luiz Irber. 2016. “sourmash: A Library for MinHash Sketching of DNA.” The Journal of Open Source Software 1 (5). https://dx.doi.org/10.21105/joss.00027 43 | 44 | 45 | [1]: https://gccbosc2018.sched.com/event/FEWp/b23-oxidizing-python-writing-extensions-in-rust 46 | [2]: https://scipy2018.scipy.org/ehome/index.php?eventid=299527&tabid=712461&cid=2233543&sessionid=21618890&sessionchoice=1& 47 | -------------------------------------------------------------------------------- /abstract.md: -------------------------------------------------------------------------------- 1 | # Oxidizing Python: writing extensions in Rust 2 | 3 | Python has a mature ecosystem for extensions using C/C++, 4 | with Cython being part of the standard toolset for scientific programming. 5 | Even so, C/C++ still have many drawbacks, 6 | ranging from smaller annoyances (like library packaging, versioning and build systems) 7 | to serious one like buffer overflows and undefined behavior leading to security issues. 8 | 9 | Rust is a system programming language trying to avoid many of the C/C++ pitfalls, 10 | on top of providing a good development workflow and memory safety guarantees. 11 | 12 | This work presents a way to write extensions in Rust and use them in Python, 13 | using sourmash as an example. 14 | -------------------------------------------------------------------------------- /poster/figures/arch_cpp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luizirber/2018-python-rust/eac4d0f52726482f2a91f65588fdc8dcb1e020cb/poster/figures/arch_cpp.png -------------------------------------------------------------------------------- /poster/figures/arch_cpp.svg: -------------------------------------------------------------------------------- 1 | 2 | 15 | 17 | 20 | 24 | 25 | 28 | 32 | 33 | 36 | 40 | 41 | 44 | 48 | 49 | 52 | 56 | 57 | 60 | 64 | 65 | 68 | 72 | 73 | 76 | 80 | 81 | 89 | 93 | 97 | 98 | 106 | 110 | 114 | 115 | 122 | 126 | 130 | 134 | 138 | 142 | 146 | 150 | 151 | 154 | 161 | 165 | 166 | 168 | Right Arrow 170 | 174 | 175 | 182 | 187 | 192 | 193 | 201 | 205 | 209 | 210 | 218 | 222 | 226 | 227 | 229 | And Gate 231 | 235 | 236 | 245 | 254 | 262 | 270 | 279 | 288 | 296 | 305 | 314 | 322 | 331 | 339 | 347 | 355 | 356 | 358 | 359 | 361 | image/svg+xml 362 | 364 | 365 | 366 | 367 | 368 | 372 | screed 386 | ijson 400 | 403 | 407 | 411 | 412 | 419 | 426 | sourmash 436 | 439 | 442 | 447 | 452 | 457 | 461 | 465 | 469 | 474 | 479 | 484 | 488 | 492 | 496 | 501 | 506 | 511 | 515 | 519 | 523 | 524 | 527 | 532 | 537 | 542 | 546 | 550 | 554 | 555 | 558 | 563 | 568 | 573 | 577 | 581 | 585 | 586 | 589 | 594 | 599 | 604 | 608 | 612 | 616 | 617 | 620 | 623 | 628 | 633 | 638 | 642 | 646 | 650 | 651 | 658 | 659 | 662 | 667 | 672 | 677 | 681 | 685 | 689 | 690 | 693 | 698 | 703 | 708 | 712 | 716 | 720 | 721 | 724 | 729 | 734 | 739 | 743 | 747 | 751 | 752 | 755 | 760 | 765 | 770 | 774 | 778 | 782 | 783 | 786 | 791 | 796 | 801 | 805 | 809 | 813 | 814 | 817 | 822 | 827 | 832 | 836 | 840 | 844 | 845 | 848 | 853 | 858 | 863 | 867 | 871 | 875 | 876 | 879 | 884 | 889 | 894 | 898 | 902 | 906 | 907 | 910 | 915 | 920 | 925 | 929 | 933 | 937 | 938 | 941 | 946 | 951 | 956 | 960 | 964 | 968 | 969 | 972 | 977 | 982 | 987 | 991 | 995 | 999 | 1000 | 1003 | 1006 | 1011 | 1016 | 1021 | 1025 | 1029 | 1033 | 1034 | 1041 | 1042 | 1043 | khmer 1057 | ... 1071 | 1078 | 1085 | 1092 | commands.py 1106 | 1113 | search.py 1123 | 1130 | gather.py 1140 | 1148 | 1156 | Handwritten code 1166 | Auto-generated code 1176 | 1177 | 1181 | 1183 | 1190 | compiled extension 1202 | 1203 | 1206 | 1210 | 1214 | 1218 | 1219 | Vendored deps 1229 | murmurhash3 1243 | kmer_min_hash.hh 1257 | 1264 | 1271 | 1274 | 1278 | 1282 | 1286 | 1290 | 1291 | 1295 | 1299 | 1306 | 1308 | 1315 | pxd 1329 | 1330 | 1332 | 1339 | pyx 1353 | 1354 | 1358 | 1362 | 1365 | 1372 | minhashmodule 1386 | 1387 | 1388 | 1389 | 1390 | 2025 | 2026 | -------------------------------------------------------------------------------- /poster/figures/arch_rust.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luizirber/2018-python-rust/eac4d0f52726482f2a91f65588fdc8dcb1e020cb/poster/figures/arch_rust.png -------------------------------------------------------------------------------- /poster/poster.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luizirber/2018-python-rust/eac4d0f52726482f2a91f65588fdc8dcb1e020cb/poster/poster.pdf --------------------------------------------------------------------------------