├── LICENSE
├── README.md
└── images
└── osm-logo.png
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 snuup
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # FlatMap - a binary file format for OpenStreetMap data
2 |
3 |
4 |
5 |
6 | 
7 |
8 |
9 |
10 |
11 | ## The Movie
12 | https://media.ccc.de/v/fossgis2022-14162-flatmap-ein-dateiformat-fr-osm-daten
13 |
14 |
15 | ## Purpose
16 |
17 | ### Efficient data access
18 | is the primary goal of this data format. Operations that shall be fast are
19 | - getnode (id)
20 | - getway (id)
21 | - getrelation (id)
22 | - iterate all nodes
23 | - iterate all ways
24 | - iterate all relations
25 | - getnodes (wayid)
26 |
27 | getnode/way/relation are log(n)
28 | iterators locate the first item in log(n) and are then linear to the number of iterated elements
29 | getnodes (wayid) accesses the way in log(n), iterating the nodes is linear to the number of nodes
30 |
31 | ### Compactness
32 | supports efficiency, because smaller data often leads to faster data access. Begin compact makes flatmap a usful format for the transfer of osm datasets, making it an alternative to the prevailing pbf file format. [See also](https://github.com/snuup/flatmap/wiki/Compactness)
33 |
34 |
35 |
36 | ## File Format
37 |
38 | A flat-map-file holds these elements:
39 |
40 | | name | mandatory |
41 | |:-|:-|
42 | | file-header | yes |
43 | | nodes-blocks | no |
44 | | ways-blocks | no |
45 | | relations-blocks | no |
46 | | nodes-blocktable | no |
47 | | ways-blocktable | no |
48 | | relations-blocktable | no |
49 | | string-table | no |
50 |
51 | The dependencies considering the links inside these elements and their number of occuriencies are:
52 |
53 | ```
54 | file-header (1)
55 | -> nodes-blocktable (0|1)
56 | -> nodes-blocks (0..*)
57 | -> ways-blocktable (0|1)
58 | -> ways-blocks (0..*)
59 | -> relations-blocktable (0|1)
60 | -> relations-blocks (0..*)
61 | -> string-table (0|1)
62 | ```
63 |
64 | There is no neccessity to store all nodes/ways/relations-blocks as a sequence, the only requirement is that all links are valid. While flatmap is designed as a simple readonly file format, its utility as an updateable file format is also envisaged and this flexibility of allocation becomes relevant than.
65 |
66 |
67 | ## Sorting
68 |
69 | The blocktables and the blocks are considered a BTree data structure, a B+Tree where all data is in the leaves. The blocks are the leafs, the 3 blocktables the 3 roots. Accordingly, BlockTableEntries must be ordered by their first-id values and the elements inside the blocks must be ordered as well.
70 |
71 |
72 | ## File-Header
73 |
74 | | name | size | type | comment |
75 | |:-------------------------|------|--------------------|-|
76 | | magic file identifier | 4 | uint32 | 0xF1AD8ABB | "flat map"
77 | | file-format-version | 4 | uint32 | 0x00000001 |
78 | | nodes-blocks-count | 8 | uint64 | number of nodes blocks
79 | | nodes-blocks-start | 8 | uint64 | link to first DataBlockEntry
80 | | ways-blocks-count | 8 | uint64 | number of ways blocks
81 | | ways-blocks-start | 8 | uint64 | link to first DataBlockEntry
82 | | relations-blocks-count | 8 | uint64 | number of relations blocks
83 | | relations-blocks-start | 8 | uint64 | link to first DataBlockEntry
84 | | string-table-header | 32 | | as defined below
85 |
86 | The mandatory file-header points to all other elements. **Links** are expressed as absolute file positions, a value of 0 means that no such element exists.
87 |
88 | ## DataBlockEntry
89 | | name | size | type | comment
90 | |:-|:-|:-|:-|
91 | | first-id | 8 | uint64 | first node/way/relation id
92 | | data-block | 8 | uint64 | link to node/way/relation block
93 |
94 | ## Stringtable-Header
95 |
96 | | name | size | type | comment |
97 | |-|-|-|-|
98 | | count | 8 | uint64 | number of strings
99 | | strings | 8 | uint64 | link to stream of Strings
100 | | alpha-sorted | 8 | uint64 | link to aplhanumerically sorted StringIndexEntry[]
101 | | sid-sorted | 8 | uint64 | link to numerically sorted StringIndexEntry[]
102 |
103 | ## String
104 |
105 | | name | type | comment |
106 | |:-|:-|:-|
107 | | length | int32 | varint encoded number of utf8 bytes
108 | | text | utf8[] |
109 |
110 | **maximum value needed**: Currently, the largest strings have a length of 765 utf8 bytes, one of them is the description here: https://www.openstreetmap.org/relation/12372925 .
111 |
112 | **encoding**: 99,67% of the strings have a length <= 127 such that varint encoding uses 1 byte for the length field.
113 |
114 |
115 | ## StringIndexEntry
116 | | name | type |
117 | |:-|:-|
118 | | sid | int32 | string-id
119 | | String | uint64 | link to string
120 |
121 | This structure is used by both string indexes: by string-id and by string (alphanumeric).
122 |
123 | ## Nodes-Block
124 |
125 | |name | size | type | comment |
126 | |:-|:-|:-|:-|
127 | |**header**
128 | | nodecount-minus-1 | 1 | byte | 255 means a count of 256 |
129 | | id-size | 1 | byte | datatype of the local-ids array: 1 | 2 | 4 | 8 |
130 | | tag-size | 1 | byte | datatype of the tagsizes array: 1 | 2 | 4 | 8 |
131 | |**arrays**
132 | | local-ids | | int[] | 1, 2, 4 or 8 byte sized integer
133 | | lon-lat | | LonLat[] | lon-lat of the node
134 | | tag-sizes | | int[] | 1, 2, 4 or 8 byte sized integer
135 | |**streams**
136 | | tag-stream | | | stream of varint encoded key-value sequence for all nodes
137 |
138 | Adaptive data structures are applied at the block level: In the node-block this applies to local-ids and tag-sizes: Both are stored in arrays whose integer type is minimized such that it can hold the largest value.
139 |
140 | **local-ids** are the offset of the node-id from the first node in the block. the value for the first node is not stored in the block but in the data-block-entry for the block.
141 |
142 | **tag-sizes** is an array that holds the length of the tag-stream for each node. to find the start of the tag-stream for the i-th node, the sum of the
143 | tag-stream-lengths of the preceeding nodes has to be computed.
144 |
145 | ## LonLat
146 | | name | size | type | comment |
147 | |:-|:-|:-|:-|
148 | | lon | 4 | int | longitude in 100 nanodegrees
149 | | lat | 4 | int | latitude in 100 nanodegrees
150 |
151 |
152 | ## Ways-Block
153 |
154 | | name | size | type | comment |
155 | |:-|:-|:-|:-|
156 | |**header**
157 | | waycount-minus-1 | 1 | byte | 255 means a count of 256 |
158 | | id-size | 1 | byte | datatype of the local-ids array: 1 | 2 | 4 | 8 |
159 | | tag-size | 1 | byte | datatype of the tagsizes array: 1 | 2 | 4 | 8 |
160 | | nodes-size | 1 | byte | datatype of the nodesizes array: 1 | 2 | 4 | 8 |
161 | | tagstream-size | 4 | uint32 | size of the whole tag-stream, so can jump to nodestream
162 | |**arrays**
163 | | local-ids | | int[] | 1, 2, 4 or 8 byte sized integer
164 | | tag-sizes | | int[] | 1, 2, 4 or 8 byte sized integer
165 | | nodes-sizes | | int[] | 1, 2, 4 or 8 byte sized integer
166 | |**streams**
167 | | tag-stream | | | stream of varint encoded key-value sequence for all nodes
168 | | node-stream | | | stream of nodes, see below
169 |
170 | The node-stream stores the way-nodes of all ways. This includes the node-ids as well as the lon/lats of those nodes. This implements the locations-on-ways design in the variant where node-ids are kept. See also https://github.com/osmlab/osm-data-model for a discussion of this design.
171 |
172 | Tagless nodes that are embedded in ways that way do not need to be stored as nodes by themselves, although the file format does not mandate that.
173 |
174 | The first node and the follower nodes are encoded as follows:
175 |
176 | ## First-Node
177 |
178 | | name | size | type | comment |
179 | |:-|:-|:-|:-|
180 | | id | 5 | uint64 | only lower 5 bytes are stored
181 | | lon | 4 | int32 | in 100 nanodegrees
182 | | lat | 4 | int32 | in 100 nanodegrees
183 |
184 |
185 | There are currently around 842 mio ways in the database (https://taginfo.openstreetmap.org/reports/database_statistics). The size of the first node affects the file size by several percent of the file size. Encoding it as varint with 7bit encoding currently leads to an average of 4.98byte per id which is around 40bit. This is larger than
186 | the size needed to for the maximum node-id, which is 34bit. In the long run we expect this id to be uniformly distributed across the node-id domain. Making the id type a
187 | 5 byte sized integer value should be safe for many years and avoids surprises of varint encoding. Further research could measure skewness, skewness
188 | as it evolves over time and investigate other variable integer encodings, like 15bit or 31bit codings.
189 |
190 | ## Follower-Node
191 | | name | type | comment |
192 | |:-----|:-----|:--------|
193 | | id | uint64 | varint delta coded node-id
194 | | lon | int32 | varint zigzag delta coded lon
195 | | lat | int32 | varint zigzag delta coded lat
196 |
197 |
198 |
199 | ## Relations-Block
200 |
201 | | name | size | type | comment |
202 | |:-|:-|:-|:-|
203 | |**header**
204 | | relationcount-minus-1 | 1 | byte | 255 means a count of 256 |
205 | | id-size | 1 | byte | datatype of the local-ids array: 1 | 2 | 4 | 8 |
206 | | tag-size | 1 | byte | datatype of the tagsizes array: 1 | 2 | 4 | 8 |
207 | | members-size | 1 | byte | datatype of the membersizes array: 1 | 2 | 4 | 8 |
208 | | tagstream-size | 4 | uint32 | size of the whole tag-stream, so can jump to nodestream
209 | |**arrays**
210 | | local-ids | | int[] | 1, 2, 4 or 8 byte sized integer
211 | | tag-sizes | | int[] | 1, 2, 4 or 8 byte sized integer
212 | | members-sizes | | int[] | 1, 2, 4 or 8 byte sized integer
213 | |**streams**
214 | | tag-stream | | Tag[] | stream of varint encoded key-value sequence for all nodes
215 | | members-stream | | Member[] | stream of members, see below
216 |
217 |
218 | ## Member
219 |
220 | | name | type | comment |
221 | |:-|:-|:-|
222 | | member-id | uint64 varint | node/way/rel id
223 | | role | uint32 varint | string-id
224 | | member-type | MemberType | see below
225 |
226 | ## MemberType
227 | 1 = Node
228 | 2 = Way
229 | 3 = Relation
230 |
231 |
232 | ## Encodings
233 |
234 | ### varint encoding
235 | means 7-bit encoding
236 |
237 | ### zigzag encoding
238 | means 7-bit zigzag encoding
239 |
--------------------------------------------------------------------------------
/images/osm-logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/snuup/flatmap/8351b704a504e90c1eaa3e33b5f9ebb732f090d5/images/osm-logo.png
--------------------------------------------------------------------------------