├── LICENSE ├── README.md └── images └── osm-logo.png /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 snuup 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FlatMap - a binary file format for OpenStreetMap data 2 | 3 |
4 |
5 | 6 | ![OpenStreeMap Logo](/images/osm-logo.png) 7 | 8 |
9 |
10 | 11 | ## The Movie 12 | https://media.ccc.de/v/fossgis2022-14162-flatmap-ein-dateiformat-fr-osm-daten 13 | 14 | 15 | ## Purpose 16 | 17 | ### Efficient data access 18 | is the primary goal of this data format. Operations that shall be fast are 19 | - getnode (id) 20 | - getway (id) 21 | - getrelation (id) 22 | - iterate all nodes 23 | - iterate all ways 24 | - iterate all relations 25 | - getnodes (wayid) 26 | 27 | getnode/way/relation are log(n)
28 | iterators locate the first item in log(n) and are then linear to the number of iterated elements
29 | getnodes (wayid) accesses the way in log(n), iterating the nodes is linear to the number of nodes
30 | 31 | ### Compactness 32 | supports efficiency, because smaller data often leads to faster data access. Begin compact makes flatmap a usful format for the transfer of osm datasets, making it an alternative to the prevailing pbf file format. [See also](https://github.com/snuup/flatmap/wiki/Compactness) 33 | 34 | 35 | 36 | ## File Format 37 | 38 | A flat-map-file holds these elements: 39 | 40 | | name | mandatory | 41 | |:-|:-| 42 | | file-header | yes | 43 | | nodes-blocks | no | 44 | | ways-blocks | no | 45 | | relations-blocks | no | 46 | | nodes-blocktable | no | 47 | | ways-blocktable | no | 48 | | relations-blocktable | no | 49 | | string-table | no | 50 | 51 | The dependencies considering the links inside these elements and their number of occuriencies are: 52 | 53 | ``` 54 | file-header (1) 55 | -> nodes-blocktable (0|1) 56 | -> nodes-blocks (0..*) 57 | -> ways-blocktable (0|1) 58 | -> ways-blocks (0..*) 59 | -> relations-blocktable (0|1) 60 | -> relations-blocks (0..*) 61 | -> string-table (0|1) 62 | ``` 63 | 64 | There is no neccessity to store all nodes/ways/relations-blocks as a sequence, the only requirement is that all links are valid. While flatmap is designed as a simple readonly file format, its utility as an updateable file format is also envisaged and this flexibility of allocation becomes relevant than. 65 | 66 | 67 | ## Sorting 68 | 69 | The blocktables and the blocks are considered a BTree data structure, a B+Tree where all data is in the leaves. The blocks are the leafs, the 3 blocktables the 3 roots. Accordingly, BlockTableEntries must be ordered by their first-id values and the elements inside the blocks must be ordered as well. 70 | 71 | 72 | ## File-Header 73 | 74 | | name | size | type | comment | 75 | |:-------------------------|------|--------------------|-| 76 | | magic file identifier | 4 | uint32 | 0xF1AD8ABB | "flat map" 77 | | file-format-version | 4 | uint32 | 0x00000001 | 78 | | nodes-blocks-count | 8 | uint64 | number of nodes blocks 79 | | nodes-blocks-start | 8 | uint64 | link to first DataBlockEntry 80 | | ways-blocks-count | 8 | uint64 | number of ways blocks 81 | | ways-blocks-start | 8 | uint64 | link to first DataBlockEntry 82 | | relations-blocks-count | 8 | uint64 | number of relations blocks 83 | | relations-blocks-start | 8 | uint64 | link to first DataBlockEntry 84 | | string-table-header | 32 | | as defined below 85 | 86 | The mandatory file-header points to all other elements. **Links** are expressed as absolute file positions, a value of 0 means that no such element exists. 87 | 88 | ## DataBlockEntry 89 | | name | size | type | comment 90 | |:-|:-|:-|:-| 91 | | first-id | 8 | uint64 | first node/way/relation id 92 | | data-block | 8 | uint64 | link to node/way/relation block 93 | 94 | ## Stringtable-Header 95 | 96 | | name | size | type | comment | 97 | |-|-|-|-| 98 | | count | 8 | uint64 | number of strings 99 | | strings | 8 | uint64 | link to stream of Strings 100 | | alpha-sorted | 8 | uint64 | link to aplhanumerically sorted StringIndexEntry[] 101 | | sid-sorted | 8 | uint64 | link to numerically sorted StringIndexEntry[] 102 | 103 | ## String 104 | 105 | | name | type | comment | 106 | |:-|:-|:-| 107 | | length | int32 | varint encoded number of utf8 bytes 108 | | text | utf8[] | 109 | 110 | **maximum value needed**: Currently, the largest strings have a length of 765 utf8 bytes, one of them is the description here: https://www.openstreetmap.org/relation/12372925 . 111 | 112 | **encoding**: 99,67% of the strings have a length <= 127 such that varint encoding uses 1 byte for the length field. 113 | 114 | 115 | ## StringIndexEntry 116 | | name | type | 117 | |:-|:-| 118 | | sid | int32 | string-id 119 | | String | uint64 | link to string 120 | 121 | This structure is used by both string indexes: by string-id and by string (alphanumeric). 122 | 123 | ## Nodes-Block 124 | 125 | |name | size | type | comment | 126 | |:-|:-|:-|:-| 127 | |**header** 128 | | nodecount-minus-1 | 1 | byte | 255 means a count of 256 | 129 | | id-size | 1 | byte | datatype of the local-ids array: 1 | 2 | 4 | 8 | 130 | | tag-size | 1 | byte | datatype of the tagsizes array: 1 | 2 | 4 | 8 | 131 | |**arrays** 132 | | local-ids | | int[] | 1, 2, 4 or 8 byte sized integer 133 | | lon-lat | | LonLat[] | lon-lat of the node 134 | | tag-sizes | | int[] | 1, 2, 4 or 8 byte sized integer 135 | |**streams** 136 | | tag-stream | | | stream of varint encoded key-value sequence for all nodes 137 | 138 | Adaptive data structures are applied at the block level: In the node-block this applies to local-ids and tag-sizes: Both are stored in arrays whose integer type is minimized such that it can hold the largest value. 139 | 140 | **local-ids** are the offset of the node-id from the first node in the block. the value for the first node is not stored in the block but in the data-block-entry for the block. 141 | 142 | **tag-sizes** is an array that holds the length of the tag-stream for each node. to find the start of the tag-stream for the i-th node, the sum of the 143 | tag-stream-lengths of the preceeding nodes has to be computed. 144 | 145 | ## LonLat 146 | | name | size | type | comment | 147 | |:-|:-|:-|:-| 148 | | lon | 4 | int | longitude in 100 nanodegrees 149 | | lat | 4 | int | latitude in 100 nanodegrees 150 | 151 | 152 | ## Ways-Block 153 | 154 | | name | size | type | comment | 155 | |:-|:-|:-|:-| 156 | |**header** 157 | | waycount-minus-1 | 1 | byte | 255 means a count of 256 | 158 | | id-size | 1 | byte | datatype of the local-ids array: 1 | 2 | 4 | 8 | 159 | | tag-size | 1 | byte | datatype of the tagsizes array: 1 | 2 | 4 | 8 | 160 | | nodes-size | 1 | byte | datatype of the nodesizes array: 1 | 2 | 4 | 8 | 161 | | tagstream-size | 4 | uint32 | size of the whole tag-stream, so can jump to nodestream 162 | |**arrays** 163 | | local-ids | | int[] | 1, 2, 4 or 8 byte sized integer 164 | | tag-sizes | | int[] | 1, 2, 4 or 8 byte sized integer 165 | | nodes-sizes | | int[] | 1, 2, 4 or 8 byte sized integer 166 | |**streams** 167 | | tag-stream | | | stream of varint encoded key-value sequence for all nodes 168 | | node-stream | | | stream of nodes, see below 169 | 170 | The node-stream stores the way-nodes of all ways. This includes the node-ids as well as the lon/lats of those nodes. This implements the locations-on-ways design in the variant where node-ids are kept. See also https://github.com/osmlab/osm-data-model for a discussion of this design. 171 | 172 | Tagless nodes that are embedded in ways that way do not need to be stored as nodes by themselves, although the file format does not mandate that. 173 | 174 | The first node and the follower nodes are encoded as follows: 175 | 176 | ## First-Node 177 | 178 | | name | size | type | comment | 179 | |:-|:-|:-|:-| 180 | | id | 5 | uint64 | only lower 5 bytes are stored 181 | | lon | 4 | int32 | in 100 nanodegrees 182 | | lat | 4 | int32 | in 100 nanodegrees 183 | 184 | 185 | There are currently around 842 mio ways in the database (https://taginfo.openstreetmap.org/reports/database_statistics). The size of the first node affects the file size by several percent of the file size. Encoding it as varint with 7bit encoding currently leads to an average of 4.98byte per id which is around 40bit. This is larger than 186 | the size needed to for the maximum node-id, which is 34bit. In the long run we expect this id to be uniformly distributed across the node-id domain. Making the id type a 187 | 5 byte sized integer value should be safe for many years and avoids surprises of varint encoding. Further research could measure skewness, skewness 188 | as it evolves over time and investigate other variable integer encodings, like 15bit or 31bit codings. 189 | 190 | ## Follower-Node 191 | | name | type | comment | 192 | |:-----|:-----|:--------| 193 | | id | uint64 | varint delta coded node-id 194 | | lon | int32 | varint zigzag delta coded lon 195 | | lat | int32 | varint zigzag delta coded lat 196 | 197 | 198 | 199 | ## Relations-Block 200 | 201 | | name | size | type | comment | 202 | |:-|:-|:-|:-| 203 | |**header** 204 | | relationcount-minus-1 | 1 | byte | 255 means a count of 256 | 205 | | id-size | 1 | byte | datatype of the local-ids array: 1 | 2 | 4 | 8 | 206 | | tag-size | 1 | byte | datatype of the tagsizes array: 1 | 2 | 4 | 8 | 207 | | members-size | 1 | byte | datatype of the membersizes array: 1 | 2 | 4 | 8 | 208 | | tagstream-size | 4 | uint32 | size of the whole tag-stream, so can jump to nodestream 209 | |**arrays** 210 | | local-ids | | int[] | 1, 2, 4 or 8 byte sized integer 211 | | tag-sizes | | int[] | 1, 2, 4 or 8 byte sized integer 212 | | members-sizes | | int[] | 1, 2, 4 or 8 byte sized integer 213 | |**streams** 214 | | tag-stream | | Tag[] | stream of varint encoded key-value sequence for all nodes 215 | | members-stream | | Member[] | stream of members, see below 216 | 217 | 218 | ## Member 219 | 220 | | name | type | comment | 221 | |:-|:-|:-| 222 | | member-id | uint64 varint | node/way/rel id 223 | | role | uint32 varint | string-id 224 | | member-type | MemberType | see below 225 | 226 | ## MemberType 227 | 1 = Node 228 | 2 = Way 229 | 3 = Relation 230 | 231 | 232 | ## Encodings 233 | 234 | ### varint encoding 235 | means 7-bit encoding 236 | 237 | ### zigzag encoding 238 | means 7-bit zigzag encoding 239 | -------------------------------------------------------------------------------- /images/osm-logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/snuup/flatmap/8351b704a504e90c1eaa3e33b5f9ebb732f090d5/images/osm-logo.png --------------------------------------------------------------------------------