├── agility ├── agile-tsdb.png ├── traditional-tsdb.png └── agility.md ├── elasticsearch ├── fst.png ├── bit-packing.jpg ├── es-sharding.png ├── three-steps.png ├── es-document-block.png ├── int-columnar-store.jpg ├── multi-column-index.jpg ├── column-vs-row-oriented-database.png └── elasticsearch.md ├── existing-solutions ├── rdbms-index.png ├── covering-index.png ├── opentsdb-scan.jpg ├── opentsdb-tsuid.jpg ├── opentsdb-compact.png ├── non-covering-index.png ├── vividcortext-query.jpg ├── mysql-clustered-index.png ├── opentsdb-tsuid-mapping.jpg ├── vivicortex-primary-key.jpg ├── partition-table-per-day.png ├── vividcortext-metric-name.jpg ├── partition-table-query-plan.png └── existing-solutions.md ├── time-series ├── tsdb-functionality.png └── time-series.md ├── SUMMARY.md ├── .gitignore ├── README.md ├── how-to └── how-to.md └── indexing └── indexing.md /agility/agile-tsdb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/agility/agile-tsdb.png -------------------------------------------------------------------------------- /elasticsearch/fst.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/fst.png -------------------------------------------------------------------------------- /agility/traditional-tsdb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/agility/traditional-tsdb.png -------------------------------------------------------------------------------- /elasticsearch/bit-packing.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/bit-packing.jpg -------------------------------------------------------------------------------- /elasticsearch/es-sharding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/es-sharding.png -------------------------------------------------------------------------------- /elasticsearch/three-steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/three-steps.png -------------------------------------------------------------------------------- /elasticsearch/es-document-block.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/es-document-block.png -------------------------------------------------------------------------------- /existing-solutions/rdbms-index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/rdbms-index.png -------------------------------------------------------------------------------- /time-series/tsdb-functionality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/time-series/tsdb-functionality.png -------------------------------------------------------------------------------- /elasticsearch/int-columnar-store.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/int-columnar-store.jpg -------------------------------------------------------------------------------- /elasticsearch/multi-column-index.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/multi-column-index.jpg -------------------------------------------------------------------------------- /existing-solutions/covering-index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/covering-index.png -------------------------------------------------------------------------------- /existing-solutions/opentsdb-scan.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/opentsdb-scan.jpg -------------------------------------------------------------------------------- /existing-solutions/opentsdb-tsuid.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/opentsdb-tsuid.jpg -------------------------------------------------------------------------------- /existing-solutions/opentsdb-compact.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/opentsdb-compact.png -------------------------------------------------------------------------------- /existing-solutions/non-covering-index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/non-covering-index.png -------------------------------------------------------------------------------- /existing-solutions/vividcortext-query.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/vividcortext-query.jpg -------------------------------------------------------------------------------- /existing-solutions/mysql-clustered-index.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/mysql-clustered-index.png -------------------------------------------------------------------------------- /existing-solutions/opentsdb-tsuid-mapping.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/opentsdb-tsuid-mapping.jpg -------------------------------------------------------------------------------- /existing-solutions/vivicortex-primary-key.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/vivicortex-primary-key.jpg -------------------------------------------------------------------------------- /existing-solutions/partition-table-per-day.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/partition-table-per-day.png -------------------------------------------------------------------------------- /existing-solutions/vividcortext-metric-name.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/vividcortext-metric-name.jpg -------------------------------------------------------------------------------- /elasticsearch/column-vs-row-oriented-database.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/elasticsearch/column-vs-row-oriented-database.png -------------------------------------------------------------------------------- /existing-solutions/partition-table-query-plan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taowen/tsdb-book/HEAD/existing-solutions/partition-table-query-plan.png -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | * [Introduction](README.md) 4 | * [Time Series](time-series/time-series.md) 5 | * [Agility](agility/agility.md) 6 | * [Existing Solutions](existing-solutions/existing-solutions.md) 7 | * [Elasticsearch](elasticsearch/elasticsearch.md) 8 | * [How-to](how-to/how-to.md) 9 | * [Indexing](indexing/indexing.md) 10 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Node rules: 2 | ## Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files) 3 | .grunt 4 | 5 | ## Dependency directory 6 | ## Commenting this out is preferred by some people, see 7 | ## https://docs.npmjs.com/misc/faq#should-i-check-my-node_modules-folder-into-git 8 | node_modules 9 | 10 | # Book build output 11 | _book 12 | 13 | # eBook build output 14 | *.epub 15 | *.mobi 16 | *.pdf 17 | /.idea 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | In Search of Agile Time Series Database 2 | ======= 3 | 4 | Time series data is everywhere. There are a lot of databases claim to be good time series database (TSDB). Which of them is good for your use case? This book is about my quest on searching for a "agile" time series database. 5 | 6 | There is another title for this book: "Why Elasticsearch is awesome for time series data from inside out" or "What traditional database has tried but still not quite nailed it". We will go through what is required for a time series database, what have been tried before, why Elasticsearch is so great, and how can be use it to make our job better. -------------------------------------------------------------------------------- /agility/agility.md: -------------------------------------------------------------------------------- 1 | # Agility 2 | What is agility? To understand that, we need to know how traditional TSDB works. 3 | 4 | ![](traditional-tsdb.png) 5 | 6 | 7 | What you report is what gets stored. What you stored, is what you can query. The schema is exactly same from the raw data to the chart on monitor. Also, the shema is limited to two dimension, key being the timestamp, value being a numeric. What if you need to change the chart to view? Well, you have to go back to the source, modify and wait for new data. What if you need to aggregate up the curves? It is either too slow or impossible to so. Normally you will end up with do the aggregation yourself (using tools such as Statsd), then put the aggregated data into the TSDB. 8 | 9 | This is a what a agile TSDB looks like: 10 | 11 | ![](agile-tsdb.png) 12 | 13 | The input data and stored data can be multi-dimensional. They do not need to be a exact match, some transformation and aggregation can happen at the input phase. One input can be transformed differently, just like materalized view in RDBMS. At the query time, fast aggregation executed on the fly giving us two dimensional data back to render the charts on monitor. This way, if we want to change how the data being viewed to gain different insights from different perspective, we can change the query or input phase, do not need to go all the way back to the origins. Think about re-deploying collection script on thousands of machines, that is not the agile cycle we have in mind. 14 | 15 | # Database Selection 16 | 17 | Existing solutions available on the market can be roughly divided into 4 categories: 18 | 19 | 1. Old school: RRD Tool, Graphite 20 | 1. Fast K/V: opentsdb, kairosDB, influxdb, blueflood 21 | 1. TSDB that can aggregate: citusdb, elasticsearch, druid 22 | 1. Future: VectorWise(Actian), Alenka, mapD, blazingdb, gpudb 23 | 24 | Old school TSDB are invented by the field ops guy to solve immediate issue, they can not support aggregate on query time, can not hold very long history without loss roll up. They are still valid practical solution to practical problems. But hey, who invents database from filesystem ground up these days? 25 | 26 | Fast K/V solution are backed by solid distributed K/V database, instead of relying on the local file system directly. Opentsdb builds on hbase, kairosDB and blueflood build on cassandra. Influxdb is more ambitious, it is built on local leveldb, and do the networking distributed computation part themself. They all are good at keeping long history, some even have built-in materalized view support (fan-out). What is missing is the on the fly aggregation capability. In general, they limit you to store two dimensional data in the database. 27 | 28 | TSDB that can actually support aggregation is the key to agility. Cloudflare is using citusdb as their backbone analytics database (https://blog.cloudflare.com/scaling-out-postgresql-for-cloudflare-analytics-using-citusdb/). My choice is elasticsearch. They both share the same architecture, columnar compact on disk storage, fast and distributed aggregation pipeline built as the first class citizen. Unlike relying on external K/V storage, aggregation will encounter data locality issue. There are other guys' who share the same view on this: http://blog.parsely.com/post/1928/cass/ , no suprisingly we all use Elasticsearch https://www.elastic.co/blog/pythonic-analytics-with-elasticsearch 29 | 30 | There are the fourth category of database that is the future. Database leveraging SIMD instructions of CPU (from MMX, SSE to AVX), or the many core GPU can aggregate all of the data insanely fast. VectorWise being the pioneer in using SIMD feature and keeping most of the data hot in CPU L2 Cache. There are increasing more and more database popping up on this area, specially to take the advantage of GPU. But, there is no good open source solution ready to production yet. -------------------------------------------------------------------------------- /time-series/time-series.md: -------------------------------------------------------------------------------- 1 | # Record Data Model 2 | 3 | Before we dive in, it is better to clarify what is time series. Examples of time series data could be stock price in stock market, temperature of the weather, CPU usage of a host. It is obvious time series data should contain a field called timestamp. I would argue potentially everything can have a timestamp with it. Time series is not just about the extra timestamp field of the data, but more about how to query it. When we think something is a time series, the query we use always include a time range. The result of the query always contain a timestamp field. A complete data model for a time series should be: 4 | 5 | ``` 6 | [timestamp],[d1],[d2]...[dn],[v1],[v2]...[vn] 7 | # d1 ~ dn are dimensions, like ip, idc, country 8 | # v1 ~ vn are values, like cpu_usage, free_memory_bytes 9 | ``` 10 | Values should always be numeric, host name should never be a value. Values can be optional, it is totally valid to have a record of time series without values. For example, we can have a record to about some user logined the website at some specific time. The record has no value associated with it, but have a series of records we know tell how many people logined during last minute. 11 | 12 | Dimensions are most categorical, such as the type of user, yoru user name. However, some numeric field can be viewed as dimension as well, such as the age of a user. There is no clear distinction betwen what is a dimension, what is value. In a lot of context, the comination of diemnsions makes it unique for that timestamp. For example, for a given cpu at given time, there is only one cpu usage value. 13 | 14 | It worth mentioning in the context of a lot of time series database, the data model is in a limited form 15 | 16 | ``` 17 | [metric_name],[timestamp],[value] 18 | # metric_name is a string such as sys.loadavg.1m 19 | # value is a numeric 20 | ``` 21 | It is very common to encode a lot of dimensions into the metric name if the database forces us to do so. 22 | 23 | # Characteristics 24 | 25 | Time series database is called TSDB, and separated from normal database for very good reason.It is a special kind of database to support operational intelligence, Internet of Things (IoT). The kind of data can be characterized as: 26 | 27 | * extreme high volume/velocity: at the magnitude of 1 billion per day 28 | * low latency: operational intelligence requires very fast query time, unlike tranditional BI, report generation can tolerate hours of time 29 | * low value: the data should be stored and queried very cheaply, most of the machine generated operational intelligence data is not very interesting 30 | 31 | Store and query large volume of low value data cheaply presents very unique technique challenges. Most of the challenges are unmet by existing solutions, and we are seeing a lot of novel solutions popping up recently. 32 | 33 | # Functionality 34 | 35 | Functional wise, TSDB is not that much different from a traditional relational data management system (RDBMS). We can build a TSDB upon traditional RDBMS like mysql or postgresql. For mysql database, within 1 million rows, it can be queries within second, and hold the volume of data wihout a problem. What mordern TDSB can provide us is nearly the same capability (except sub query or join) provided by mysql, but with a 100x ~ 1000x size of data per day (100 million to 1 billion rows), and maintain linear scalability for infinite number of days of data given enough machines provided. 36 | 37 | ![](tsdb-functionality.png) 38 | 39 | The functionality of a TSDB can be listed as 4 types: 40 | 41 | * input interface: like INSERT INTO sql 42 | * raw data query: like SELECT xxx FROM xxx WHERE 43 | * realtime aggregation: like SELECT xxx FROM xxx GROUP BY xxx 44 | * materialized view: like INSERT INTO xxx SELECT xxx, or CREATE VIEW xxx 45 | 46 | Materialized view is to transform the shape of one table into another form, so that slow query can be faster (because less rows), or impossible query can be possible (like ip translated to province, then we can group by province). Materialized view was implemented in RDBMS as view (by CREATE VIEW statement), and the data can be auto-refreshed in the background (https://en.wikipedia.org/wiki/Materialized_view). 47 | 48 | The primary missing features in TSDB are: 49 | 50 | * no sub query: the realtime aggregation is very limited for performance reason. Sub query (SELECT xxx FROM (SELECT xxx FROM some_table)) is not a common feature in TSDB. 51 | * no join: join at query time is considered slow, so TSDB do not allow join in query time. One work around is to join at the input time (like in TRT-Storm phase). 52 | * no post aggregation: one common SQL feature not supported by TSDB is HAVING statement. HAVING will do the filtering after aggregation 53 | 54 | # Kafka/Storm/Spark 55 | 56 | There is no database that support full functionality. In real setting it is common to combine several technologies to form a working solution: 57 | 58 | * Kafka is used as the queue to time series database 59 | * Storm is used as a realtime pipeline to ingest the data, and do the transformation to several form. This provides the input interface and materlized view 60 | * Database is used to hold the data and support querys. All databases can support raw data querys. Some of the databases can support realtime aggregation 61 | * Spark can be used on top of database to provide scalable and complex aggregation or even realtime machine learning 62 | 63 | The total scope of a TSDB is very big. Here we do not discuss the usage of kafka/storm/spark. 64 | -------------------------------------------------------------------------------- /how-to/how-to.md: -------------------------------------------------------------------------------- 1 | Knowing all those great features provided by Elasticsearch. Now, we need to build a time series database using Elasticsearch. There are two thoughts: 2 | 3 | * Design a Elasticsearch mapping that maximizes the aggregation performance and minimize the storage size. 4 | * Pre-compute the common aggregation, so that for most of query the aggregation is already done 5 | 6 | Mapping 7 | ------------- 8 | 9 | This is what the final mapping looks like: 10 | 11 | ``` 12 | { 13 | "login":{ 14 | "_source":{ 15 | "enabled":false 16 | }, 17 | "_all":{ 18 | "enabled":false 19 | }, 20 | "properties":{ 21 | "login_records":{ 22 | "properties":{ 23 | "login_os":{ 24 | "index":"not_analyzed", 25 | "fielddata":{ 26 | "format":"doc_values" 27 | }, 28 | "doc_values":true, 29 | "type":"string" 30 | }, 31 | "login_plat":{ 32 | "index":"not_analyzed", 33 | "fielddata":{ 34 | "format":"doc_values" 35 | }, 36 | "doc_values":true, 37 | "type":"string" 38 | }, 39 | "login_timestamp":{ 40 | "fielddata":{ 41 | "format":"doc_values" 42 | }, 43 | "format":"dateOptionalTime", 44 | "doc_values":true, 45 | "type":"date" 46 | }, 47 | "login_cc_set":{ 48 | "index":"not_analyzed", 49 | "fielddata":{ 50 | "format":"doc_values" 51 | }, 52 | "doc_values":true, 53 | "type":"string" 54 | }, 55 | "login_count":{ 56 | "index":"no", 57 | "fielddata":{ 58 | "format":"doc_values" 59 | }, 60 | "doc_values":true, 61 | "type":"integer" 62 | }, 63 | "login_biz_id":{ 64 | "index":"not_analyzed", 65 | "fielddata":{ 66 | "format":"doc_values" 67 | }, 68 | "doc_values":true, 69 | "type":"string" 70 | } 71 | }, 72 | "type":"nested" 73 | }, 74 | "login_min_timestamp":{ 75 | "fielddata":{ 76 | "format":"disabled" 77 | }, 78 | "format":"dateOptionalTime", 79 | "doc_values":true, 80 | "type":"date" 81 | }, 82 | "login_max_timestamp":{ 83 | "fielddata":{ 84 | "format":"disabled" 85 | }, 86 | "format":"dateOptionalTime", 87 | "doc_values":true, 88 | "type":"date" 89 | } 90 | } 91 | } 92 | } 93 | ``` 94 | 95 | The fields are: 96 | 97 | ``` 98 | timestamp, biz_id, plat, os, cc_set, count 99 | ``` 100 | 101 | timestamp, biz_id, plat, os, cc_set are the dimensional fields 102 | 103 | count is the measurement field or value field. 104 | 105 | Here is a explanation on why the default mapping looks like this: 106 | 107 | * _source is disabled: avoid duplicated storage, save disk space 108 | * _all is disabled: no query need _all 109 | * index not analyzed: we do not need full text search. not analyzed means the filter need to be speeded up by index 110 | * fielddata docvalues: the column data is stored on disk in docvalues format. the default format is in memory cache, which is consumes too much memory. docvalues can be preceived as something like parquet columnar file format. 111 | * login_records as nested documents: one elasticsearch doc contains several nested document. Each nested document is one record (a.k.a data point). nested document is physically stored together to compact the storage size. 112 | min/max timestamp: used to locate the root level document without un-nest the nested document 113 | * login_xxx prefix: all fields is prefixed by the type name (login). In Elasticsearch, mapping within same index share same lucece segment file. If the field of different mapping has same name but different type or other properties, error will happen. 114 | 115 | Clustered Fields 116 | ----------------- 117 | 118 | One optimization can be used on the default mapping is to "pull up" some of the fields from nested document to parent document. For example we can pull up the biz_id field, so that records for the same biz_id of some time range will be combined into one parent document, but records of different biz_id will always be separated. 119 | 120 | ``` 121 | { 122 | "login":{ 123 | "_source":{ 124 | "enabled":false 125 | }, 126 | "_all":{ 127 | "enabled":false 128 | }, 129 | "properties":{ 130 | "login_records":{ 131 | "properties":{ 132 | "login_os":{ 133 | "index":"not_analyzed", 134 | "fielddata":{ 135 | "format":"doc_values" 136 | }, 137 | "doc_values":true, 138 | "type":"string" 139 | }, 140 | "login_plat":{ 141 | "index":"not_analyzed", 142 | "fielddata":{ 143 | "format":"doc_values" 144 | }, 145 | "doc_values":true, 146 | "type":"string" 147 | }, 148 | "login_timestamp":{ 149 | "fielddata":{ 150 | "format":"doc_values" 151 | }, 152 | "format":"dateOptionalTime", 153 | "doc_values":true, 154 | "type":"date" 155 | }, 156 | "login_cc_set":{ 157 | "index":"not_analyzed", 158 | "fielddata":{ 159 | "format":"doc_values" 160 | }, 161 | "doc_values":true, 162 | "type":"string" 163 | }, 164 | "login_count":{ 165 | "index":"no", 166 | "fielddata":{ 167 | "format":"doc_values" 168 | }, 169 | "doc_values":true, 170 | "type":"integer" 171 | } 172 | }, 173 | "type":"nested" 174 | }, 175 | "login_biz_id":{ 176 | "index":"not_analyzed", 177 | "fielddata":{ 178 | "format":"doc_values" 179 | }, 180 | "doc_values":true, 181 | "type":"string" 182 | } 183 | "login_min_timestamp":{ 184 | "fielddata":{ 185 | "format":"disabled" 186 | }, 187 | "format":"dateOptionalTime", 188 | "doc_values":true, 189 | "type":"date" 190 | }, 191 | "login_max_timestamp":{ 192 | "fielddata":{ 193 | "format":"disabled" 194 | }, 195 | "format":"dateOptionalTime", 196 | "doc_values":true, 197 | "type":"date" 198 | } 199 | } 200 | } 201 | } 202 | ``` 203 | 204 | Pre-computation 205 | ------------------------- 206 | 207 | We create one index for one day data. The indices will looks like: 208 | 209 | * logs-2015-07-05 210 | * logs-2015-07-06 211 | * logs-2015-07-07 212 | 213 | We can pre-compute the data to a separate indices: 214 | 215 | * logs-precomputed-2015-07-05 216 | * logs-precomputed-2015-07-06 217 | * logs-precomputed-2015-07-07 218 | 219 | In above mapping, there are four dimensions: 220 | 221 | ``` 222 | biz_id, cc_set, plat, os 223 | ``` 224 | 225 | In the pre-computed data, we can omit the cc_set, plat, os dimensions, so that query the total login for one biz_id can be faster. 226 | 227 | ``` 228 | { 229 | "login_precomputed_biz_id":{ 230 | "_source":{ 231 | "enabled":false 232 | }, 233 | "_all":{ 234 | "enabled":false 235 | }, 236 | "properties":{ 237 | "login_precomputed_biz_id_biz_id":{ 238 | "index":"not_analyzed", 239 | "fielddata":{ 240 | "format":"doc_values" 241 | }, 242 | "doc_values":true, 243 | "type":"string" 244 | }, 245 | "login_precomputed_biz_id_precomputed_max_of_count":{ 246 | "index":"no", 247 | "fielddata":{ 248 | "format":"doc_values" 249 | }, 250 | "doc_values":true, 251 | "type":"integer" 252 | }, 253 | "login_precomputed_biz_id_precomputed_sum_of_count":{ 254 | "index":"no", 255 | "fielddata":{ 256 | "format":"doc_values" 257 | }, 258 | "doc_values":true, 259 | "type":"integer" 260 | }, 261 | "login_precomputed_biz_id_precomputed_count":{ 262 | "index":"no", 263 | "fielddata":{ 264 | "format":"doc_values" 265 | }, 266 | "doc_values":true, 267 | "type":"integer" 268 | }, 269 | "login_precomputed_biz_id_precomputed_min_of_count":{ 270 | "index":"no", 271 | "fielddata":{ 272 | "format":"doc_values" 273 | }, 274 | "doc_values":true, 275 | "type":"integer" 276 | }, 277 | "login_precomputed_biz_id_timestamp":{ 278 | "fielddata":{ 279 | "format":"doc_values" 280 | }, 281 | "format":"dateOptionalTime", 282 | "doc_values":true, 283 | "type":"date" 284 | } 285 | } 286 | } 287 | } 288 | ``` 289 | 290 | Indexing & Querying 291 | -------------------- 292 | 293 | Using this format, indexing and querying will not be straight forward. It is certainly unlike 294 | 295 | ``` 296 | INSERT INTO login(biz_id, cc_set, plat, os, cunt) VALUES(...) 297 | ``` 298 | 299 | There should be two service wrap Elasticsearch to provide more usable index/query api. 300 | 301 | 302 | -------------------------------------------------------------------------------- /indexing/indexing.md: -------------------------------------------------------------------------------- 1 | 2 | # Indexing 3 | 4 | There are many things can be improved in the indexing stage. A regular setup looks like this: 5 | 6 | ``` 7 | log file => local log harvester => remote log parser => kafka 8 | kafka => log indexer => elasticsearch indexing node => elasticsearch shards => lucene docvalues 9 | ``` 10 | 11 | It is a terribly long process. It might be fragile, but that is not the main issue. We can always use excessive monitoring to fix any broken process. The main issue I can not bear with is the huge amount of data copying that can be saved. The raw indexing performance should be way more efficient if we can do it right. A lot of hardware and bandwidth cost can be saved. But how? 12 | 13 | If we look at the final destination of the pipeline, it is the lucene doc-values setting in the file system. The file format is very compact, if you have a column of long values, it is just a bunch of numbers one by one, packed together. Essentially it is a file representation of ```long[]```. So the indexing process is all about filling a bunch of ```long[]```, each column will have a separate array to fill. For example: 14 | 15 | ``` 16 | String[] symbolColumn = new String[1024 * 1024]; 17 | long[] marketCapColumn = new long[1024 * 1024]; 18 | long[] lastSaleColumn = new long[1024 * 1024]; 19 | symbolColumn[0] = 'AAPL'; 20 | marketColumn[0] = 52269; 21 | lastSaleColumn[0] = 94; 22 | symbolColumn[1] = 'GOOG'; 23 | marketColumn[1] = 43234; 24 | lastSaleColumn[1] = 66; 25 | ``` 26 | When we ship log, it is very natural to accumulate a block of rows in memory and ship together. If we can organize the memory block in column-stride fashion, then the bytes can be directly dumped into the final destination with maximum speed. 27 | 28 | ``` 29 | long[] output = new long[1024 * 1024]; 30 | for (int i=0; i``` as each row, and ```List>``` as a bunch of rows. This is not the most efficent way of dealing with time series data or OLAP. If we bring back strong schema, and build a new set of tooling leverage the fact we can use ```long[]``` instead of ```Map Local log harvester => remote log parser 40 | 41 | ### Packet Encoding 42 | 43 | We collect logs from where data is generated. At those places, CPU cycles are much precious, and should be saved to better serve our customer. So it is general best practices to do minimum parsing and just ship the data to remote side for further processing. Some parsing is necessary as we have to send the data in small chunks and shuffle them to different remote servers. If the chunk was not splitted in the right boundary, then the remote server can not handle the data properly. The most commonly used boundary is "\n" to separate log lines. 44 | 45 | Not all log files are not structured. Some log file such as the binlog of mysql server, it is very compact and structured. "\n" is just a less-sophisticated form of packet encoding to convey the meaning of "event". There are much formal ways to encode a "event" into a packet. If the data source can generate stream of events in easy to parse packet form, the performance can be improved. For example: 46 | 47 | ``` 48 | [packe_size][... remaining bytes ...][packet_size][... remaining bytes ...] 49 | ``` 50 | It is very common to have a header containing the size of packet. So that it is easy to chop out the bytes from the stream without needing to understand the whole structure. "\n" is not the best solution here. 51 | 52 | ### Inter-process Communication 53 | 54 | ``` 55 | Data source =IPC=> Harvester =RPC=> Remote 56 | ``` 57 | Log file is a means of IPC between data source process and harvester process. Use file to pass events from one process to another might not be very efficient, but modern Linux system optimized the pattern efficient enough. 58 | We can implement the IPC side a unix domain socket. Then the harvester can be a TCP proxy to relay events from data source to remote. If one data source correspond to only one remote, then the process can be as fast as direct copy. 59 | 60 | ### Shuffling 61 | 62 | It is common requirement to shuffle the data in the process. For example, in the log file 63 | ``` 64 | LOGIN,xxxx 65 | LOGOUT,xxx 66 | LOGIN,xxx 67 | LOGIN,xxx 68 | ``` 69 | The data of login and logout might need to be send to different kafka topic. So the harvester or the remote parser will need to shuffle the data from same source to different destination. It is better to generate several log files from the upstream. The cost of picking and shuffling is just for programming convenience. After all, the events are normally generated from different site of the code, why not just output to different log files? 70 | 71 | ### Status 72 | 73 | Current agent to collect data for Elasticsearch is assuming the data is unstructured log, and trying to put as many features in it. The result is slow agent with too many CPU used when the volume is high. We need a new agent for structured and typed data to handle vast amount of data for time series and IOT. The agent can be 74 | 75 | * a unix domain socket server listen at 127.0.0.1 76 | * binary compact packet encoding, without need to split at "\n" 77 | * allow upstream to specify the destination in the packet header 78 | * or let upstream to specify the destination in the TCP connection level, so the harvester agent can just forward the stream to remote server 79 | 80 | ## local log harvester => remote log parser => kafka 81 | 82 | ### Packet Parsing 83 | 84 | Regex parsing is slow. JSON parsing is faster. Binary encoding such as msgpack or avro is much much better. If the packet is already in json or even avro, the log parser is optional then. We can store input bytes directly in kafka. 85 | 86 | ### Shuffling 87 | 88 | The cost of remote log parser is not just parsing some string according some complex regex. It also introduce a lot of copying. Think about: 89 | 90 | ``` 91 | data source machines1 => log parser machine 92 | data source machines2 => log parser machine 93 | ... 94 | data source machines1000 => log parser machine 95 | ``` 96 | 97 | It is easily to overwhelm the central parsing machine. So we distribute the load 98 | ``` 99 | data source machines1 => log parser machine1 100 | data source machines2 => log parser machine1 101 | ... 102 | data source machines1000 => log parser machine10 103 | ``` 104 | But the backend kafka server is also distributed. 105 | ``` 106 | data source[K] => log parser[L] => kafka[M] 107 | ``` 108 | The data will copy from data source cluster to log parser cluster and then to kafka cluster. It is very hard to co-locate the log parser and kafka in the same machine, as the whole point of log parsing could be shuffle different event to different topic in kafka. Then we are looking at double the bandwidth cost compared to kafka only solution. 109 | 110 | ### Small Packet 111 | 112 | If the event is tiny, store it as a document in kafka will be costly. It is common to batch a bunch of events into a larger packet and store as one in kafka. In this batching process, we can reogranized the data from row oriented to column oridented. 113 | 114 | ### Status 115 | 116 | It is best practice to have a central logstash cluster setup in the middle to parse the logs. It has huge cost in terms of parsing and bandwidth. If we have structured and typed data already, we can skip this process altogether, just store the raw bytes in the right kafka topic, from the beginning the data stream. The reason we do not normally write kafka directly, but rather use a log based parsing solution is: 117 | 118 | * we do not want to slow down the application server 119 | * file is reliable, in case of network partition, we can hold up 120 | 121 | This can be solved by a better local agent discussed above. It can be 122 | 123 | * async 124 | * use batching internally 125 | * memory mapped file based queueing 126 | 127 | In case of failure, the memory mapped file can provide the safety we need. Essentially, we implement some kind of write-ahead logging using local file system, and store the data async to remote kafka cluster. 128 | 129 | ## Log file => Kafka 130 | 131 | Kafka is just another form of log file. We use a local agent implement write ahead logging to support reliable transfer stream of events from local machine to central kafka. It should be reliable and fast this way. 132 | 133 | But why we need to copy log file from one form to another? Well, kafka is in the central server and have well defined topics to consume from. The benefit of having a publicly accessible event stream out-weight the cost of log shipping. 134 | 135 | We just need a better agent for structured data. 136 | 137 | ## kafka => log indexer => elasticsearch nodes 138 | 139 | ### JSON 140 | 141 | Elasticsearch only speaks JSON. Any format must be converted to JSON to index. If we want to be fast, we need to modify elasticsearch to let it speak another language. 142 | 143 | ### Data shuffling 144 | 145 | If the log indexer is a node client, then the data is just copied from kafka to indexer, then from indexer to es nodes. If the log indexer is not a node client, then the data will be copied three times. 146 | 147 | ``` 148 | kafka => log indexer => elasticsearching indexing node => elasticsearch shard WAL => lucene 149 | ``` 150 | 151 | Between final lucene file and kafka, there are too many stages. If the data is structured and typed, then what is the difference between kafka and WAL of elasticsearch shard? 152 | 153 | ### Status 154 | 155 | There is no way to avoid copy data when using both kafka and Elasticsearch currently. It is not possible to co-locate kafka partition and elasticsearch shard in same node. In theory we can use kafka partition on the same machine as the WAL of the elasticsearch shard. 156 | 157 | 158 | ## elasticsearch node => lucene index writer => lucene doc-values 159 | 160 | The lucene index writer is also row oriented. The same document is parsed and represented by different objects from elasticsearch to lucene. Indexing large volume of data will create lot os GC pressure. 161 | Also the doc-values is process as ```long[]``` in the codec, but the interface to the codec is ```List```, which will have a lot of boxing cost. 162 | 163 | ## Summary 164 | 165 | ``` 166 | Log => Kafka => Lucene Doc-values 167 | ``` 168 | 169 | kafka, elasticsearch, lucene all three things produced by different parties. Which leaves us a great cost to translate data from one to another. In theory, it should be possible to have a bunch of ```long[]``` land in kafka in binary compact form, and store them directly in elasticsearch index as lucene doc-values. We need: 170 | 171 | * agent to provide reliable async facade to kafka 172 | * agent to batch up events in block, re-organize the data from row to column 173 | * modify kafka to serve as elasticsearch WAL 174 | * modify elasticsearch to use kafka as WAL 175 | * modify lucene to provide fast lane for java primitive values and index document in column-oriented fashsion 176 | 177 | If we truly believe Elasticsearch can be a great time-series database, we would need better support for the "INSERT" statement. 178 | 179 | 180 | -------------------------------------------------------------------------------- /existing-solutions/existing-solutions.md: -------------------------------------------------------------------------------- 1 | There are a lot of existing solutions. They all have some good internal design that makes them particularly good at some aspect, by pushing the underlying storage to its limit. Learning from the good old experience is a necessary to build a better one. The following databases all share the same record data mode: 2 | 3 | ``` 4 | [metric_name],[timestamp],[value] 5 | ``` 6 | 7 | All of them can handle this kind of data very well, and can return the records within a time range very fast. 8 | 9 | # Opentsdb 10 | 11 | Opentsdb is a good presentive of fast k/v based solution. It makes very good use of underlying HBase features to maximize the performance and minimized the cost to store large volume of data. 12 | 13 | ## TSUID 14 | 15 | http://opentsdb.net/docs/build/html/user_guide/uids.html 16 | 17 | Opentsdb support metric name, tags. But metric name and tags are not stored in each data point to save space. Instead, each metric name + tags combination will be assigned with a TSUID (time series unique id), then each data point only need to store TSUID with it. When querying the data, first translate the metric name and tags to TSUID, then load the data points using TSUID. 18 | 19 | ![](opentsdb-tsuid-mapping.jpg) 20 | 21 | This is the bi-directional mapping table for TSUID and raw metric name + tags 22 | 23 | ![](opentsdb-tsuid.jpg) 24 | 25 | Metric name + tags will be combined to a long TSUID. take the above example 26 | 27 | ``` 28 | proc.loadavg.1m host=web42 pool=static 29 | metric_name tagk=tagv tagk=tagv .... 30 | 052 001 028 047 001 31 | 052001028047001 is the final TSUID used 32 | ``` 33 | 34 | By default, UIDs are encoded on 3 bytes in storage, giving a maximum unique ID of 16,777,215 for each UID type. Given tag value is also encoded as TSUID, when the tag value contains something like QQ number, very quickly we can run out of 16 million limit. To change the 3 bytes limit to 4 bytes or more, requires recompile Opentsdb itself, and existing data will not be compatible. 35 | 36 | Scanning all data with same metric name but different tags is possible. TSUID is part of the hbase row key, so scan the sequential rows can load metric at one timestamp of all tags. Take the above example, search rows start with 052 can give us all data of metric "proc.loadavg.1m" 37 | 38 | The key idea is using table lookup (or dictionary encoding) to compress low cardinal repeated values (the dimensions of time series data). It is a common technique to compress string data in analytical database. Opentsdb TSUID is just a sepcial purpose implementation. 39 | 40 | ## HBase Scan 41 | 42 | HBase can scan a large number of records in very short time. The scan limited to scan along the sort order defined by row key. Defining the right row key to support the query, so that scan can be most optimal is the key to fast HBase usage. Essentially, Rowkey in HBase is the index in database terms, and the only index. Not only for fast single record retrival, but also for large range scan. 43 | 44 | This is the rowkey for opentsdb 45 | 46 | ![](opentsdb-scan.jpg) 47 | 48 | Rowkey is in the format: metric_name+timestamp then 8 optional tagk,tagv pair 49 | 50 | So query "proc.loadavg.1m" between 12:05 and 13:00 is a very fast range scan to batch load all the data. As the data is first sorted by metric name and then sorted by timestamp. Data of other metric will not affect the query performance. Data at yesterday will not affect the query performance. 51 | 52 | But if the query is about "proc.loadavg.1m{host=web1}" then we have to scan all "proc.loadavg.1m" within the time range, then filter out host=web1 in memory. If there are 1 million hosts per minute, then we need to load 1 millions records for 1 minute, but only keep 1 record, which is very slow. So a common techique is to encode the tagk/tagv in the metric itself. Then we will have metric like "proc.loadavg.1m.host.web1". 53 | 54 | ## Compact 55 | 56 | The idea of compact to store 1 minute even 1 hour data per row, instead of 1 second data per row. So we can reduce the number of rows to speed up the query performance. But to keep the insert fast, we are not going to rewrite the row for every insert, instead the insert will be 1 second level, then periodically the 1 second data will be "compacted" to 1 minute or 1 hour. 57 | 58 | ![](opentsdb-compact.png) 59 | 60 | The cost of row is too heavy (for example, the dimensions need to stored again and again), it is better to pay the cost once and get as many as possible out of a single row. Note, even the per column cost is too high, opentsdb is not leveraging the multi-column feature, instead everything is packed into the same column. 61 | 62 | ## Summary of Opentsdb 63 | 64 | Opentsdb is missing following features: 65 | 66 | * Opentsdb only support one value per data point 67 | * Opentsdb forces one timestamp can only have one dimension combination (for example proc.loadavg.1m already have one value for 12:05, insert another one will yield error when reading) 68 | * Opentsdb can not filter out values based on tags(dimensions) quickly 69 | * Aggregation is done out side of HBase, the network traffic will be a problem when aggregating a lot of data 70 | * Aggregation is done at a single node, not leveraging map reduce 71 | 72 | # PostgreSQL 73 | 74 | PostgresSQL is a very standard RDBMS. A lot of people are using it to store time series data. 75 | 76 | ## B-tree index is useless for time series 77 | 78 | RDBMS has a heap like row store. You can load one row using its row id (primay key) from the row store very fast. However if the query do not use the primary key, the database will scan the whole table to find the rows we want. To avoid full table scan, a b-tree based secondary index is introduced to allow looking up the index to find out what rows are required. The process can be illustrated as: 79 | 80 | ![](rdbms-index.png) 81 | 82 | b-tree index is looked up to find the row id, then the row id is used to load from the row store to get actual data. Using the row id to load data is "random access" of the on disk data. B-tree index is optimized to load small number of rows, so that "random access" penalty is negligible compared to the benefit of filtering out the exact rows needed. However, this optimization does not work for time series data. After filtering by b-tree index, there are many thousands of rows to be fetched from the row store, then the "random access" cost will be very high. A lot of times, database query planner will decide to fallback to full table scan instead of using the index in this scenario. 83 | 84 | ## Convering index is also useless 85 | 86 | There is one feature of rdbms index worth mentioning, that is covering index (in mysql terms) or index only scan (in postgresql terms). without covering index, the query will hit the b-tree index, than load the rows from row store in two steps: 87 | 88 | ![](non-covering-index.png) 89 | 90 | If the filter of the query used columns are all "covered" by the index, and the selected columns of the query are all "covered" by the index, then the query can be served from the index only, there is no need to go back to the row store to fetch original rows to compute query. 91 | 92 | ![](covering-index.png) 93 | 94 | Covering index is useful to optimize one or two frequent query. To make a optimized query about column A,B,C, then index should contain A,B,C column in the same order. If there is index for A,B,C separately, it can not be used by the covering index. We can not rely on covering index to give us good performance on any query. 95 | 96 | ## Partition as coarse grained timestamp based index 97 | 98 | Another commonly used technique is partitioning. The simplest way is just to create a new table to store data for every day. 99 | 100 | ![](partition-table-per-day.png) 101 | 102 | When query a partitioned table, the query planner can know the time range filter will cover how many partitions. Then when doing full table scan the filter out the rows, only the partitions actually covered by the filters will be touched. Then the result for each partition is collected then merged. This is a query plan generated by PostgreSQL: 103 | 104 | ![](partition-table-query-plan.png) 105 | 106 | Essentially, table partitioned by the timestamp is using the timestamp as a coarse grained index. It is very useful to maintain a constant speed for time series data query, even the historic data keeps growing. 107 | 108 | ## Summary of Postgresql 109 | 110 | Using timestamp based partition table is the most common optimization, it is not enough 111 | 112 | * When there are many metric, not possible to create table for every metric. Storing metric name as column will bloat up the partion table, finding values for particular metric will scan over other metric 113 | * Aggregation is done in single node, no built-in distributed computation support 114 | 115 | In 9.5, the BRIN index might be used to speed up the metric lookup: https://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.5#BRIN_Indexes 116 | 117 | # Mongodb 118 | 119 | There are also a lot of people trying to fit their hammers into any kind of holes, mongodb is no exception. Mongodb is architecturally very similar to tranditional RDBMS, but the table per day partitioning optimization does not apply to Mongodb, as mongodb has no native partitioning support, other than you roll your own at the application level. 120 | 121 | But Mongodb has two great feature: 122 | 123 | * the data model is very flexible. One document can hold many sub documents 124 | * if no size increase, document can be updated inplace very fast 125 | 126 | Leveraging these two features of the database engine, one can design the schema like this 127 | 128 | ``` 129 | { 130 | timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"), 131 | type: “memory_used”, 132 | values: { 133 | 0: 999999, 134 | … 135 | 37: 1000000, 136 | 38: 1500000, 137 | … 138 | 59: 2000000 139 | } 140 | } 141 | ``` 142 | The idea is for every new data point, we update the document for the minute, instead of inserting a new document until the next minute. Same can be applied for one hour or one day. This way, when querying the data for the minute/hour/day, we just need to load one big document which is fast. 143 | 144 | The problem of storing like this is: aggregation will be very hard. Mongodb aggregation framework does not allow aggregate data of document properties. 145 | 146 | # Mysql 147 | 148 | VividCortext is a mysql performance monitoring company lead by mysql performance expert Baron Schwartz. You can bet he knows how to use mysql well. Not surprisingly, everyone just fits their hammers into any holes, VividCortex is using mysql to store time series data for their monitoring business. 149 | 150 | Baron Schewartz is kind enough to share their secret on how to push mysql to its limit as fast TSDB: 151 | 152 | * video: https://www.youtube.com/watch?v=ldyKJL_IVl8 153 | * slides: http://www.slideshare.net/vividcortex/vividcortex-building-a-timeseries-database-in-mysql 154 | * video (improved): https://www.youtube.com/watch?v=kTD1NuQXr4k&t=5h49m27s 155 | * slides (improved): http://www.slideshare.net/vividcortex/scaling-vividortexs-big-data-systems-on-mysql 156 | * another related slides: http://www.slideshare.net/vividcortex/catena-a-highperformance-time-series-data 157 | 158 | ## Clustered Index 159 | 160 | Mysql Innodb storage engine has a feature called "clustered index". Normally the primary key of the table is the clustered index. The clustered index key or primary key is stored physcially on disk in sorted order. So that scan a sequence of primary key is very fast, just like hbase where rows are sorted by row key. 161 | 162 | ![](mysql-clustered-index.png) 163 | 164 | Using clustered index the row store of mysql is also a b-tree index. For in-depth details on how clustered index works: https://www.simple-talk.com/sql/learn-sql-server/effective-clustered-indexes/ 165 | 166 | ![](vivicortex-primary-key.jpg) 167 | 168 | The primary key they use is "host.metric,timestamp", so that accessing a range of data points for a host/metric combination is optimized. The primary key design is exactly the same as opentsdb hbase row key design. 169 | 170 | ## Sparse Metric 171 | 172 | A common problem when using metric name to represent every dimension is there will be a lot of metrics. Some user of graphite or opentsdb used Elasticsearch to build a database of metric name to allow search. VividCortex has similar problem. They collect metric or every database queries. So metric for a specific type of queries might not present in certain time ranges. If they need to answer this: 173 | 174 | ![](vividcortext-query.jpg) 175 | 176 | ```Rank all metrics matching pattern X from B to C, limit N``` 177 | 178 | With the primary key design mentioned above, there is no easy way to figure out what are the metrics present in time range B to C? 179 | 180 | ![](vividcortext-metric-name.jpg) 181 | 182 | This problem is sovled by building a secondary index for metric name using external redis database. They ned to first query redis to see how many metrics present in [B, C] time range, then use the metric name to query mysql for the actual data. Just like other people using Elasticsearch for similar purpose, which is UGLY... 183 | 184 | ## Compact 185 | 186 | Just like everybody else (opentsdb, mongodb), compact many data points into a single row is a great way to boost performance. VividCortex compact many data points in a single row, at cost of losing the ability to query using raw SQL. Developers at vividcortex have to internal service to query the TSDB, as the service code understands how the compaction works. Without compaction, there will be too many rows to be stored and queried efficiently in mysql. 187 | 188 | # Summary 189 | 190 | Traditional RDBMS and Fast K/V database can make great TSDB. If we can know all the query we will make, and with clever design, everything can fit into the metric name centric data model. But if you are wondering how to query a large number of metrics in one time and select the top N or aggregate them up, you would better to review if you can choose a better metric name design. 191 | If you can not know the querys ahead of time, or there is no good way to normalize or de-normalize so that metric name centric model can fit, then let's explore other options. -------------------------------------------------------------------------------- /elasticsearch/elasticsearch.md: -------------------------------------------------------------------------------- 1 | In this chapter, we explore the features provided by Elasticsearch that can be used to build a agile time series database. We can see how good experience from existing solutions can be applied to Elasticsearch, also how to mitigate the problems of existing solutions. 2 | 3 | # Architecture 4 | 5 | To answer query like 6 | 7 | ``` 8 | SELECT AVG(age) FROM login WHERE timestamp > xxx AND timestamp < yyy AND site = 'zzz' 9 | ``` 10 | 11 | The query need to go through three things (in RDBMS terms): 12 | 13 | ![](three-steps.png) 14 | 15 | In Mysql, the three steps are: 16 | 17 | 1. look up the filters in its b-tree index, and found a list of row ids 18 | 2. use the row ids to retrieve rows from the primary row store 19 | 3. compute the group by and projections in memory from the loaded rows 20 | 21 | Using Elasticsearch as a database is very similar: 22 | 23 | 1. look up the filters in its inverted index, found a list of document ids 24 | 2. use the document ids to retrieve the values of columns mentioned in projections 25 | 3. compute the group by and projections in memory from the loaded documents 26 | 27 | Conceptually despite Elasticsearch is a full-text index not a relational database, the process are nearly identical. The differences are: 28 | 29 | * Elasticsearch is using inverted index instead of b-tree index to process filters 30 | * Mysql row is called document in Elaticsearch 31 | * Mysql has only one way to load the whole row from row store using row ids, Elaticsearch has 4 different ways to load the values using document ids 32 | 33 | Elaticsearch is faster and more scalable than mysql for time series data because: 34 | 35 | * Inverted index is more efficient to filter out large number of documents than b-tree index 36 | * Elasticsearch has a way to load large number of values using the document ids efficiently, mysql does not which makes its b-tree index useless in this context 37 | * Elasticsearch has a fast and sharded in memory computation layer to scale out, and there are optimization to bring in SIMD to optimize the speed further 38 | 39 | Let's go through them one by one, I am sure you will love Elasticsearch. 40 | 41 | # Finite State Transducers (FST) 42 | 43 | There is one article that explains inverted index very well: http://blog.parsely.com/post/1691/lucene/ 44 | 45 | The example from the article illustrated what is inverted index very well: 46 | 47 | ``` 48 | doc1={"tag": "big data"} 49 | doc2={"tag": "big data"} 50 | doc3={"tag": "small data"} 51 | ``` 52 | 53 | The inverted index for the documents looks like: 54 | 55 | ``` 56 | big=[doc1,doc2] 57 | data=[doc1,doc2,doc3] 58 | small=[doc3] 59 | ``` 60 | b-tree index is not that different from end-user perspective, b-tree also invert the rowid=>value to value=>rowid, so that find rows contains certain value can be very fast. 61 | 62 | There is a great article on how inverted is stored and queried using Finite State Transducers 63 | http://www.cnblogs.com/forfuture1978/p/3945755.html 64 | 65 | ![](fst.png) 66 | 67 | For people do not have time to read the article or lucene source code. It works like English dictionary. You look up the section, find the chapter, than find the material you are looking for. 68 | 69 | # Numeric Range Query 70 | 71 | Numeric data can also be indexed, looks like: 72 | 73 | ``` 74 | 49=[doc31] 75 | 50=[doc40,doc41] 76 | 51=[doc53] 77 | ... 78 | ``` 79 | 80 | This is fine for point query. But if querying for a range, it will be slow. Optimization is done to speed up the range query: 81 | 82 | ``` 83 | 49=[doc31] 84 | 50=[doc40,doc41] 85 | 50x75=[doc40,doc41,doc53,doc78,doc99,...] 86 | 51=[doc53] 87 | ... 88 | ``` 89 | 90 | To query 50 ~ 100 91 | 92 | ``` 93 | 50x75 OR 76x99 OR 100 94 | ``` 95 | 96 | https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#number 97 | there is a setting precision_step to control how many terms generated. If no range query is required, we can set it to max to save disk space. 98 | 99 | # Bitset 100 | 101 | In mysql, if you have two columns and you have query like "plat='wx' AND os='android'" then using b-tree index to index plat and os is not enough. In the runtime, if you have index for column plat and another index for column os, mysql have to pick one of most selective index to use, leave another index not used at all. There is a great presentation on how index work in mysql: http://www.slideshare.net/vividcortex/optimizing-mysql-queries-with-indexes 102 | 103 | ![](multi-column-index.jpg) 104 | 105 | To index both column plat and os, then you have to create a multi column index to include both and even the ordering of the columns matters, index for plat,os and os,plat are different. 106 | 107 | In elasticsearch, the inverted index are composable. You only need to index os, and plat separately. In the runtime, the both index can be used and combined to speed up the query. The filter plat='wx, os='android' can event be cached separately to speed up future querys. 108 | 109 | The trick is called "roaring bitmap" (http://roaringbitmap.org/). Druid database has a detailed on article on how the magic works: http://druid.io/blog/2012/09/21/druid-bitmap-compression.html 110 | 111 | There are two challenges to use bitset on the filter result. If we have 1 million documents, then for a given filter(for example plat='wx') there will be 1 million 1/0 result. How to represent the 1 million boolean result with minimum size is not a easy job. 112 | 113 | This is a active research area to compress the sparse bitset. The filter result is sometime very sparse. For example user="wentao" will likely only have 1 hit out of 1 million documents. Storing nearly 1 million zeros will be such a waste. So the simplest compression is to represent a series of 1 or 0 using a compact form. 114 | 115 | for example: ```11111111 10001000 11110001 11100010 11111111 11111111``` 116 | 117 | the last two words are all 1, so it can be compressed to 118 | 119 | ``` 120 | header section: 00001 121 | content section: 11111111 10001000 11110001 11100010 10000010 122 | ``` 123 | 124 | Another challenge is to do AND/OR/NOT calculation for the compressed bitset. If we have to uncompress to do boolean logic operation, then the compression is not very helpful to save memory. Algorithms are invented to do the AND/OR/NOT on the compressed data. 125 | 126 | The actual compression algorithm used in Elasticsearch (lucene is the underlying storage) is very complex and efficient. http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/util/SparseFixedBitSet.java 127 | 128 | The biggest benefit Elasticsearch can give us is to index a lot of columns separately. When doing the query, filters on different columns can be evaluated to bitset and cached efficiently. Then the filter result is AND/OR/NOT of the those individual bitset. There is no composite index in Elasticsearch, unlike mysql. 129 | 130 | # DocValues 131 | 132 | In mysql, having a bunch of row ids filtered from the b-tree index, then load from the row store are very time consuming "random access" process. Similarly, having a bunch of document ids filtered from the inverted index, how to load other fields of those documents efficiently? 133 | 134 | There are 4 ways to load document values that in Elasticsearch, take field "user_age" as example: 135 | 136 | * Using _source "stored field", the document is stored as json, need to parse and get the field "user_age" 137 | * Using user_age "stored field" if it is stored 138 | * If user_age is indexed in inverted index, Elasticsearch will un-invert the index to form a document_id=>user_age mapping in memory (called "field cache"), then load from the "field cache" 139 | * If user_age is docvalues, then the field can be loaded from docvalues file 140 | 141 | DocValues is a real game changer here. It turns Elasticsearch into a valid player in the analytical database market. DocValues is a column-oriented data store, optimized for batch loading. 142 | 143 | ![](int-columnar-store.jpg) 144 | 145 | Above is how a int typed field gets stored on disk. There will be separate file for each DocValues field, within the file, values of documents are stored continuously. Each document occupy a fixed size to store its value of that field. So the memory mapped file looks like a array in memory, seek to the values of certain documents is a very fast operation. Batching loading a continuous documents is also very fast sequential read operation. 146 | 147 | There is a very good video on DocValues: 148 | 149 | * video: https://www.youtube.com/watch?v=JLUgEpJcG40 150 | * slides: http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene 151 | 152 | Mysql is a row oriented database, which means the rows are store one after another, each row has all fields in it. In row oriented database, there is no good way to load values of just one column out of 100. In column oriented database, it is very easy to just load the needed values. Also the column oriented layout is good for cache of CPU, as the values read is the values to be computed. 153 | 154 | ![](column-vs-row-oriented-database.png) 155 | 156 | # Compression: bit packing 157 | 158 | It is worth mentioning, Elasticsearch also compress the stored DocValue. One of the technique is called bit packing. Using bit-packing numeric values can be stored with less disk space. The key observation is despite some int are large, the range of serie of ints are very close to each other. For example: 159 | 160 | ```101, 102, 105``` 161 | 162 | If we subtract 100 from the value, it become 163 | 164 | ```1, 2, 5``` 165 | 166 | then those values can be stored with less on disk space. 167 | 168 | ![](bit-packing.jpg) 169 | 170 | # Compression: dictionary encoding 171 | 172 | Store repeated string values in doc values will not store the same value again and again. Recall there is TSUID in opentsdb, the idea is metric name is very repeative and long (many bytes), instead of store the metric name on each data point, we translate the metric name to a numeric value to save disk space. Elasticsearch DocValues will do the translation internally for us, there is no need to do the optimization trick in application layer. 173 | 174 | For example, we have 3 documents 175 | 176 | ``` 177 | doc[0] = "aardvark" 178 | doc[1] = "beaver" 179 | doc[2] = "aardvark" 180 | ``` 181 | 182 | It will be stored as 183 | 184 | ``` 185 | doc[0] = 0 186 | doc[1] = 1 187 | doc[2] = 0 188 | 189 | term[0] = "aardvark" 190 | term[1] = "beaver" 191 | ``` 192 | 193 | # Block Join 194 | 195 | It is also called nested documents or document block. 196 | 197 | ![](es-document-block.png) 198 | 199 | This is how it is indexed 200 | 201 | ``` 202 | curl -XPOST 'localhost:9200/products/product' -d '{ 203 | "name" : "Polo shirt", 204 | "description" : "Made of 100% cotton", 205 | "offers" : [ 206 | { 207 | "color" : "red", 208 | "size" : "s", 209 | "price" : 999 210 | }, 211 | { 212 | "color" : "red", 213 | "size" : "m", 214 | "price" : 1099 215 | }, 216 | { 217 | "color" : "blue", 218 | "size" : "s", 219 | "price" : 999 220 | } 221 | ] 222 | } 223 | ``` 224 | 225 | What nested document can give us is 3 fold goodness: 226 | 227 | * The ability to control the on disk layout of documents. If we index a bunch of documents as nested documents of one parent, they will be stored physically next to each other. data within a time range is accessed together most of time, it makes sense to pack data within a time range together as a series of nested documents. Nested document is working like clustering index in mysql 228 | * Parent documents count is much less, so the inverted index lookup can be faster for parent document 229 | * We can pull up some common fields to the parent document so that we do not need to repeat it in nested documents. For example, data for same application can have a app_id in the parent document, and in each data point nested do not need to mention app_id again 230 | 231 | # Index as partition 232 | 233 | Compare to mysql 234 | 235 | * mysql => database => table => rows 236 | * elasticsearch => index => mapping => documents 237 | 238 | At first glance, you might think index=database, mapping=table. Actually mapping is NOT physically isolated from each other. In reality index is ued like mysql table to physically partition the time series data, for example, we have 3 days of logs, then we create 3 indices: 239 | 240 | * logs-2013-02-22 241 | * logs-2013-02-21 242 | * logs-2013-02-20 243 | 244 | When we search we can specify the indices included in the scope to speed up the query: 245 | 246 | ``` 247 | $ curl -XGET localhost:9200/logs-2013-02-22,logs-2013-02-21/Errors/_search?query="q:Error Message" 248 | ``` 249 | 250 | Notice in Elasticsearch we can partition the data to time based indices, but there is no built-in support to select the indices for us based on the time range filter. Unlike postgreql, we can SELECT * FROM logs WHERE timestamp > xxx AND timestamp < yyy without knowing there is actually 3 physical tables beneath the logs view. 251 | 252 | # Distributed Computation 253 | 254 | Most of the goodness is built-in the underlying lucene storage engine. But the distributed computation part is authentic part of Elasticsearch. Having elasticsearch built the distributed part for us, so that we do not need to shard our database, we do not need to send queries to "related" nodes and collect the results back for further aggregation. All of these things have been taken care of. 255 | 256 | ![](es-sharding.png) 257 | 258 | The sharding is not perfect. Master is not using zookeeper, makes it prone to split-brain problem. From crate.io blog we can know the aggregation can be improved or we can use crate.io instead (which builds on top of Elasticsearch): https://crate.io/blog/crate_data_elasticsearch/ 259 | 260 | ``` 261 | Elasticsearch currently supports the HyperLogLog aggregations, whereas Crate.IO supports accurate aggregations. Also Elasticsearch scatters the queries to all nodes, gathering the responses and doing the aggregations afterwards which results in high memory consumption on the node that is handling the client request (and so doing the aggregation). 262 | 263 | Crate distributes the collected results to the whole cluster using a simple modulo based hashing, and as a result uses the complete memory of the cluster for merging. Think of it as some kind of distributed map/reduce. 264 | ``` 265 | 266 | But nevertheless, Elasticsearch has a distributed computation layer that works. 267 | 268 | # Map/reduce story 269 | 270 | Elasticsearch has very good story on Map/reduce as well. There are two ways to do that 271 | 272 | * Use Elasticsearch distributed computation as a map/reduce engine, you plug script into the calculation process 273 | * Use Elasticsearch as a data store with good filtering support, build the distributed computation layer using Spark 274 | 275 | This is how Elasticsearch can be used as RDD in spark: 276 | 277 | ``` 278 | // Set our query 279 | jobConf.set("es.query", query) 280 | // Create an RDD of the tweets 281 | val currentTweets = sc.hadoopRDD(jobConf, 282 | classOf[EsInputFormat[Object, MapWritable]], 283 | classOf[Object], classOf[MapWritable]) 284 | // Convert to a format we can work with 285 | val tweets = currentTweets.map{ case (key, value) => 286 | SharedIndex.mapWritableToInput(value) } 287 | // Extract the hashtags 288 | val hashTags = tweets.flatMap{t => 289 | t.getOrElse("hashTags", "").split(" ") 290 | } 291 | ``` 292 | 293 | The filter is push down to be executed by Elasticsearch, the aggregation part is done by spark. 294 | 295 | For using Elasticsearch itself as a map reduce engine, the feature is called "Scripted Metric Aggregation": https://www.elastic.co/guide/en/elasticsearch/reference/1.6/search-aggregations-metrics-scripted-metric-aggregation.html 296 | 297 | Compared to nosql database like Mongodb, who claims to support map/reduce using javascript. I think plugin java code into Elasticsearch/lucene is a much better interop story than integrate a c++ database engine with a javascript engine. 298 | 299 | # Future: Off-heap 300 | 301 | Elasticsearch is using a lot of Java Heap to cache. One important future direction is to move a lot of cache off the heap. Here is video on this effort forked from Solr (Another lucene based search engine): 302 | 303 | * video: https://www.youtube.com/watch?v=dDywZDQJQ3o&index=43&list=PLU6n9Voqu_1FM8nmVwiWWDRtsEjlPqhgP 304 | * slides: http://www.slideshare.net/lucidworks/native-code-off-heap-data-structures-for-solr-yonik-seeley 305 | 306 | # Future: SIMD 307 | 308 | Elsaticsearch is limited by the JVM, so that leveraging SIMD cpu instructions to aggregate data faster is hard to implement. There are smart guys working on bring SIMD acceleration into Elasticsearch, and have very promising results: 309 | 310 | * video: https://berlinbuzzwords.de/session/fast-decompression-lucene-codec 311 | * post: http://blog.griddynamics.com/2015/02/proposing-simd-codec-for-lucene.html 312 | 313 | # Success Stories 314 | 315 | There are two company using Elasticsearch as time series database from my knowledge. There must be many more companies using it, but I was not aware of. 316 | 317 | * Bloomberg: http://www.slideshare.net/lucidworks/search-analytics-component-presented-by-steven-bower-bloomberg-lp 318 | * Parse.ly: https://www.elastic.co/blog/pythonic-analytics-with-elasticsearch 319 | 320 | Because there are many more people using Elasticsearch as a analytics database instead of search engine, the company behind Elasticsearch is now called "Elastic" without search. 321 | 322 | As we can see from above, there are a sorts of optimization has been done for you by lucene and elasticsearch. We can make a list to compare: 323 | 324 | * Opentsdb TSUID: opentsdb compress the metric name to TSUID by using a dictionary. This is optimization is FST to store inverted index in lucene. If the long string is not index but DocValues, lucene will also use dictionary encoding to compress it. 325 | * Opentsdb HBase Scan: opentsdb is using the physical layout of hbase data file to scan sequentially. In lucene this is done by columnar store DocValues file. 326 | * Opentsdb compaction: opentsdb compact data points of a time range into one column of one row, to reduce the storage size and speed up the scanning. This can be optimized using nested documents of lucene. 327 | * Postgresql time based partition: postgresql can create table for every day, then combine the tables as one view. Elasticsearch can also do the time based partition, but wehn querying the application level need to be aware of the time range and actively select the indices (like postgresql partition table) to use. 328 | * RDBMS b-tree index: inverted index is fast, and the DocValues is column-oriented unlike row-oriented rdbms which can not utilize the b-tree index when rows to fetch is huge. 329 | * Mysql covering index: Elasticsearch does not support covering index. DocValues can remedy the pain. 330 | * Mysql clustered index: Using clustered index can store related rows phyiscally together to speed up query. Elasticsearch can use nested documents as clustered index. 331 | 332 | 333 | --------------------------------------------------------------------------------