├── CITATION.cff ├── README.md ├── motivation.md └── reference.md /CITATION.cff: -------------------------------------------------------------------------------- 1 | cff-version: 1.2.0 2 | message: If you use this software, please cite it as below. 3 | title: ADL Functionality Benchmarks Index 4 | abstract: A list of example analysis challenges and solved implementations of them in various languages 5 | authors: 6 | - family-names: Proffitt 7 | given-names: Mason 8 | - family-names: Müller 9 | given-names: Ingo 10 | - family-names: Graur 11 | given-names: Dan 12 | - family-names: Adamec 13 | given-names: Mat 14 | - family-names: Ling 15 | given-names: Jerry 16 | - family-names: David 17 | given-names: Pieter 18 | - family-names: Guiraud 19 | given-names: Enrico 20 | - family-names: Binet 21 | given-names: Sebastien 22 | doi: 10.5281/zenodo.5131286 23 | repository-code: "https://github.com/iris-hep/adl-benchmarks-index" 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![DOI](https://zenodo.org/badge/188481137.svg)](https://zenodo.org/badge/latestdoi/188481137) 2 | 3 | Introduction 4 | ============ 5 | 6 | This repository is intended to maintain a list of common agreed-upon benchmark analysis tasks that can be used to exemplify, test, and compare different languages and approaches used for analysis. Also listed here are public data files available to run these benchmarks on and the repositories of actual implementations of these benchmarks. 7 | 8 | Functionality benchmarks 9 | ======================== 10 | 11 | 1. Plot the ETmiss of all events. 12 | 1. Plot the pT of all jets. 13 | 1. Plot the pT of jets with |η| < 1. 14 | 1. Plot the ETmiss of events that have at least two jets with pT > 40 GeV. 15 | 1. Plot the ETmiss of events that have an opposite-charge muon pair with an invariant mass between 60 and 120 GeV. 16 | 1. For events with at least three jets, plot the pT of the trijet four-momentum that has the invariant mass closest to 172.5 GeV in each event and plot the maximum b-tagging discriminant value among the jets in this trijet. 17 | 1. Plot the scalar sum in each event of the pT of jets with pT > 30 GeV that are not within 0.4 in ΔR of any light lepton with pT > 10 GeV. 18 | 1. For events with at least three light leptons and a same-flavor opposite-charge light lepton pair, find such a pair that has the invariant mass closest to 91.2 GeV in each event and plot the transverse mass of the system consisting of the missing tranverse momentum and the highest-pT light lepton not in this pair. 19 | 20 | For the motivations behind these benchmarks, see [motivation.md](motivation.md). For a technical reference of the terms used in the benchmarks, see [reference.md](reference.md). 21 | 22 | Input data files 23 | ================ 24 | 25 | * [Converted to NanoAOD](https://github.com/cms-opendata-analyses/AOD2NanoAODOutreachTool) from [2012 CMS open data](http://opendata.cern.ch/record/6021): 26 | * root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root (16 GiB, 53 million events) 27 | 28 | Language implementations 29 | ======================== 30 | 31 | |Repository|Language|Description| 32 | |----------|--------|-----------| 33 | |[opendata-benchmarks](https://github.com/root-project/opendata-benchmarks)|RDataFrame|RDataFrame is a componenent of [ROOT](https://root.cern/) that provides a high-level interface for analyzing [TTrees](https://root.cern.ch/doc/master/classTTree.html) and other data formats. Each task is solved with a simpler syntax useful in interpreted ROOT macros as well as a fully compiled C++ syntax for best performance. | 34 | |[nail](https://github.com/arizzi/nail/tree/master/benchmarks)|NAIL (Natual Analysis Implementation Language)|| 35 | |[groot](https://github.com/go-hep/examples/tree/master/groot/bench-opendata)|[Go](https://golang.org)|Part of the [Go-HEP](https://go-hep.org/) project, `groot` is a pure Go package that provides read/write access to ROOT files| 36 | |[coffea](https://github.com/CoffeaTeam/coffea-benchmarks/tree/master)|Python + Numpy|[Coffea](https://github.com/CoffeaTeam/coffea) builds on numpy and awkward-array for columnar data analysis in Python| 37 | |[bamboo](https://github.com/pieterdavid/bamboo-adl-benchmarks)|Python + RDataFrame|The [bamboo](https://gitlab.cern.ch/cp3-cms/bamboo) analysis framework provides a high-level Python interface to RDataFrame (technically an embedded domain-specific language)| 38 | |[queryosity](https://github.com/taehyounpark/queryosity-benchmarks)| C++ | [Queryosity](https://queryosity.readthedocs.io/en/latest/) is a (semi-)structured data analysis library with support for arbitrary data types. | 39 | |[Rumble](https://github.com/RumbleDB/hep-iris-benchmark-jsoniq)|[JSONiq](https://www.jsoniq.org/) (an [XQuery](https://en.wikipedia.org/wiki/XQuery) dialect for [JSON](https://en.wikipedia.org/wiki/JSON) data)|Most data in ROOT files can be exposed in the JSON data model and can thus be processed by JSONiq. This implementation is targeted to be run on [Rumble](https://rumbledb.org/), a JSONiq implementation on top of Spark, but could be run by any other JSONiq processor.| 40 | |[BigQuery](https://github.com/RumbleDB/iris-hep-benchmark-bigquery)|[BigQuery's dialect](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax) of [SQL](https://en.wikipedia.org/wiki/SQL)|SQL is arguably the most wide-spread language for querying structured data. Since SQL:1999, it supports arrays and structured types and is thus, in principle, suited for typical HEP analyses, though not many implementations support these features. BigQuery's dialect is based on SQL:2011, supports the mentioned features, and has a few additional language constructs that make queries more concise.| 41 | |[PrestoDB](https://github.com/RumbleDB/iris-hep-benchmark-presto)|[PrestoDB's dialect](https://prestodb.io/docs/current/sql/select.html) of [SQL](https://en.wikipedia.org/wiki/SQL) |Like BigQuery, Presto has some support for arrays and structured types; however, it only has limited support for nested queries and a more verbose syntax than BigQuery.| 42 | |[Amazon Athena](https://github.com/RumbleDB/iris-hep-benchmark-athena)|[Athena's dialect](https://docs.aws.amazon.com/athena/latest/ug/ddl-sql-reference.html) of [SQL](https://en.wikipedia.org/wiki/SQL)|Athena is a fully-managed Query-as-a-Service system based on PrestoDB with attractive scalability and pricing but a few more limitations than Presto (most importantly, no support for user-defined functions).| 43 | |[SQL++ (AsterixDB)](https://github.com/RumbleDB/iris-hep-benchmark-sqlpp)|[SQL++](https://asterixdb.apache.org/docs/0.9.6/sqlpp/manual.html)|[AsterixDB](https://asterixdb.apache.org/) is a Big Data platform specialized for semi-structured data. Its query language is thus designed to deal with nested data intuitively.| 44 | |[UnROOT.jl](https://github.com/Moelf/ADLBenchmark.jl)|[Julia](https://julialang.org/)|Pure Julia implementation utilizing packages developed by [JuliaHEP](https://github.com/JuliaHEP/) as a demonstration of ease of use, flexibility, and peak performance at the same time for end-user analysis.| 45 | |[Snowflake](https://github.com/DanGraur/iris-hep-benchmark-snowflake)|[Snowflake's](https://docs.snowflake.com/en/sql-reference-commands) dialect of [SQL](https://en.wikipedia.org/wiki/SQL)|Snowflake is a fully-managed Query-as-a-Service system that boasts high performance and scalability as a pure in-cloud database. Moreover, Snowflake adds support for the powerful [`VARIANT`](https://docs.snowflake.com/en/sql-reference/data-types-semistructured#variant) data type, specifically designed to efficiently store and process semi-structured data.| 46 | 47 | 48 | Adding new benchmarks, data, or implementations 49 | =============================================== 50 | 51 | * Additional benchmarks or public data files can be suggested as GitHub issues on this project to start a discussion within the [HSF Data Analysis Working Group](https://hepsoftwarefoundation.org/workinggroups/dataanalysis.html) community. 52 | * Suggested modifications to the layout of this repository are also welcome as new GitHub issues. 53 | * If you would like to add a repository with a new implementation of the benchmarks, go ahead and submit a pull request with the proposed changes. 54 | -------------------------------------------------------------------------------- /motivation.md: -------------------------------------------------------------------------------- 1 | Functionality motivations for benchmarks 2 | ======================================================= 3 | 4 | | Benchmark description | Motivation | 5 | |-----------------------|------------| 6 | | Plot the ETmiss of all events. | Loop over events and get an event-level variable. | 7 | | Plot the pT of all jets. | Loop over an array in each event. | 8 | | Plot the pT of jets with \|η\| < 1. | Loop over an array that is filtered. | 9 | | Plot the ETmiss of events that have at least two jets with pT > 40 GeV. | Loop over an array and aggregate the results to filter at the event level. | 10 | | Plot the ETmiss of events that have an opposite-charge muon pair with an invariant mass between 60 and 120 GeV. | Loop on pairs of objects in one collection and do four-vector algebra. | 11 | | For events with at least three jets, plot the pT of the trijet four-momentum that has the invariant mass closest to 172.5 GeV in each event and plot the maximum b-tagging discriminant value among the jets in this trijet. | Loop over combinations of objects in the same collection and extract a property of the combinations other than the key used to sort them. | 12 | | Plot the scalar sum in each event of the pT of jets with pT > 30 GeV that are not within 0.4 in ΔR of any light lepton with pT > 10 GeV. | Loop over two different collections. | 13 | | For events with at least three light leptons and a same-flavor opposite-charge light lepton pair, find such a pair that has the invariant mass closest to 91.2 GeV in each event and plot the transverse mass of the system consisting of the missing tranverse momentum and the highest-pT light lepton not in this pair. | Perform a task for which the formulation in an imperative language is easy but the translation to a functional query language may be less clear or inefficient. | 14 | -------------------------------------------------------------------------------- /reference.md: -------------------------------------------------------------------------------- 1 | # Technical reference for the benchmarks 2 | 3 | ## Glossary 4 | 5 | ### b-tagging discriminant 6 | 7 | A value assigned to a jet that is correlated with the probability that the jet originated from a [bottom quark](https://en.wikipedia.org/wiki/Bottom_quark). See [b-tagging](https://en.wikipedia.org/wiki/B-tagging). 8 | 9 | NanoAOD branch: `Jet_btag` 10 | 11 | ### ΔR 12 | 13 | Equal to sqrt(Δη2 + Δφ2). The most common metric used for distance between four-momenta in hadron colliders. 14 | 15 | ### η 16 | 17 | [Pseudorapidity](https://en.wikipedia.org/wiki/Pseudorapidity), a unitless geometric coordinate used in hadron colliders. 18 | 19 | NanoAOD branches: `_eta` 20 | 21 | ### Event 22 | 23 | A single bunch crossing. Each entry in a NanoAOD `TTree` corresponds to one event. See [Event (particle physics)](https://en.wikipedia.org/wiki/Event_\(particle_physics\)). 24 | 25 | ### Flavor 26 | 27 | A species of elementary particle, including its [antiparticle](https://en.wikipedia.org/wiki/Antiparticle). For example, electrons and positrons are of the same flavor, but muons are another flavor. See [Flavour](https://en.wikipedia.org/wiki/Flavour_\(particle_physics\)). 28 | 29 | ### Four-momentum 30 | 31 | A [four-vector](https://en.wikipedia.org/wiki/Four-vector) with units of momentum. See [Four-momentum](https://en.wikipedia.org/wiki/Four-momentum). Four-momentum in events is often specified by pT, η, φ, and mass, but addition of four-momenta is only straightforward in the Cartesian coordinates E, px, py, and pz. A nice summary of the relationships between these can found [here](https://energyflow.network/docs/utils/). 32 | 33 | NanoAOD branches: pT (in GeV): `_pt`, η: `_eta`, φ (in radians): `_phi`, Mass (in GeV): `_mass` 34 | 35 | ### GeV 36 | 37 | Gigaelectronvolt, a unit of energy. Also used as shorthand for GeV/c (a unit of momentum) and GeV/c2 (a unit of mass). See [Electronvolt](https://en.wikipedia.org/wiki/Electronvolt). In NanoAOD files, all energies, momenta, and masses are in GeV. 38 | 39 | ### Invariant mass 40 | 41 | Equal to sqrt(E2 - p2), where E and p are the energy and momentum, respectively, of a given four-momentum. See [Invariant mass](https://en.wikipedia.org/wiki/Invariant_mass). 42 | 43 | ### Jet 44 | 45 | A cone of particles emitted by a collision. A jet is represeneted by a four-momentum, which is a sum over the constituent particles. See [Jet (particle physics)](https://en.wikipedia.org/wiki/Jet_\(particle_physics\)). 46 | 47 | NanoAOD branches: `Jet_` 48 | 49 | ### Light lepton 50 | 51 | An electron, positron, or muon. See [Lepton](https://en.wikipedia.org/wiki/Lepton). "Light" means less massive than a tau lepton. Neutrinos are also light leptons, but they are generally not detectable in collider experiments. Leptons are represented by their four-momentum and their charge. 52 | 53 | NanoAOD branches: Electrons and positrons: `Electron_`, Muons: `Muon_` 54 | (Positrons are considered to be electrons with a positive charge.) 55 | 56 | ### Missing transverse momentum 57 | 58 | The negative vector sum of the transverse momenta of all objects in an event. The magnitude of this vector is written as ETmiss, also called missing ET (MET). See [Missing energy](https://en.wikipedia.org/wiki/Missing_energy). 59 | 60 | NanoAOD branches: Magnitude (in GeV): `MET_pt`, φ (in radians): `MET_phi` 61 | 62 | ### NanoAOD 63 | 64 | A ROOT-based file format used by the [CMS experiment](https://en.wikipedia.org/wiki/Compact_Muon_Solenoid). It consists of a [TTree](https://root.cern.ch/doc/master/classTTree.html) named `Events` with [TBranches](https://root.cern.ch/doc/master/classTBranch.html) corresponding to the properties of the events. 65 | 66 | ### φ 67 | 68 | [Azimuthal angle](https://en.wikipedia.org/wiki/Spherical_coordinate_system), a spherical coordinate. 69 | 70 | NanoAOD branches (in radians): `_phi` 71 | 72 | ### ROOT 73 | 74 | A [data analysis software framework](https://root.cern.ch/) for high energy physics (HEP). Most HEP data is stored in [ROOT files](https://root.cern.ch/doc/master/classTFile.html), a format produced by the ROOT framework. 75 | 76 | ### Transverse mass 77 | 78 | The transverse mass of a system of missing transverse momentum and a lepton is equal to sqrt(2 * pTlepton * ETmiss * (1 - cos(Δφ))). 79 | 80 | ### Transverse momentum 81 | 82 | The projection of the momentum onto the xy-plane. The magnitude is written as pT. 83 | 84 | NanoAOD branches: Magnitude (in GeV): `_pt`, φ (in radians): `_phi` 85 | 86 | ### Trijet 87 | 88 | A group of three jets. The four-momentum of a trijet is the sum of the three jet four-momenta. 89 | --------------------------------------------------------------------------------