YEAR: 2021 129 | COPYRIGHT HOLDER: Chris Jochem 130 |131 | 132 |
Reading/writing with sfarrow and how it works.
135 |NEWS.md
127 | arrow
(reported by @jonkeane).New find_geom
parameter in read_sf_dataset()
adds any geometry columns to the arrow_dplyr_query
. Default behaviour is FALSE
for consistent behaviour.
Cleaning documentation and preparing for CRAN submission
st_write_feather()
and st_read_feather()
allow similar functionality to read/write to .feather formats with sf
objects.arrow
2.0.0, properties to st_write_parquet()
are deprecated.New write_sf_dataset()
and read_sf_dataset()
to handle partitioned datasets. These also work with dplyr
and grouped variables to define partitions.
New vignettes added for documentation of all functions.
st_write_parquet()
now warns uses that geo metadata format may change.sf
R/st_arrow.R
128 | arrow_to_sf.Rd
Helper function to convert 'data.frame' to sf
arrow_to_sf(tbl, metadata)136 | 137 |
tbl | 142 |
|
143 |
---|---|
metadata | 146 |
|
147 |
object of sf
with CRS and geometry columns
R/st_arrow.R
128 | create_metadata.Rd
Create standardised geo metadata for Parquet files
133 |create_metadata(df)136 | 137 |
df | 142 |object of class |
143 |
---|
JSON formatted list with geo-metadata
149 |Reference for metadata standard:
152 | https://github.com/geopandas/geo-arrow-spec. This is compatible with
153 | GeoPandas
Parquet files.
sfc
geometry columns into a WKB binary formatR/st_arrow.R
128 | encode_wkb.Rd
Convert sfc
geometry columns into a WKB binary format
encode_wkb(df)136 | 137 |
df | 142 |
|
143 |
---|
data.frame
with binary geometry column(s)
Allows for more than one geometry column in sfc
format
sf
objectR/st_arrow.R
128 | read_sf_dataset.Rd
Read an Arrow multi-file dataset and create sf
object
read_sf_dataset(dataset, find_geom = FALSE)136 | 137 |
dataset | 142 |a |
144 |
---|---|
find_geom | 147 |logical. Only needed when returning a subset of columns.
148 | Should all available geometry columns be selected and added to to the
149 | dataset query without being named? Default is |
151 |
object of class sf
This function is primarily for use after opening a dataset with
160 | arrow::open_dataset
. Users can then query the arrow Dataset
161 | using dplyr
methods such as filter
or
162 | select
. Passing the resulting query to this function
163 | will parse the datasets and create an sf
object. The function
164 | expects consistent geographic metadata to be stored with the dataset in
165 | order to create sf
objects.
233 |# read spatial object 172 | nc <- sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE) 173 | 174 | # create random grouping 175 | nc$group <- sample(1:3, nrow(nc), replace = TRUE) 176 | 177 | # use dplyr to group the dataset. %>% also allowed 178 | nc_g <- dplyr::group_by(nc, group) 179 | 180 | # write out to parquet datasets 181 | tf <- tempfile() # create temporary location 182 | on.exit(unlink(tf)) 183 | # partitioning determined by dplyr 'group_vars' 184 | write_sf_dataset(nc_g, path = tf) 185 |#> Warning: This is an initial implementation of Parquet/Feather file support and 186 | #> geo metadata. This is tracking version 0.1.0 of the metadata 187 | #> (https://github.com/geopandas/geo-arrow-spec). This metadata 188 | #> specification may change and does not yet make stability promises. We 189 | #> do not yet recommend using this in a production setting unless you are 190 | #> able to rewrite your Parquet/Feather files.#> [1] "group=1/part-0.parquet" "group=2/part-1.parquet" "group=3/part-2.parquet"193 | # open parquet files from dataset 194 | ds <- arrow::open_dataset(tf) 195 | 196 | # create a query. %>% also allowed 197 | q <- dplyr::filter(ds, group == 1) 198 | 199 | # read the dataset (piping syntax also works) 200 | nc_d <- read_sf_dataset(dataset = q) 201 | 202 | nc_d 203 |#> Simple feature collection with 33 features and 15 fields 204 | #> Geometry type: MULTIPOLYGON 205 | #> Dimension: XY 206 | #> Bounding box: xmin: -83.98855 ymin: 33.94867 xmax: -75.45698 ymax: 36.58965 207 | #> Geodetic CRS: NAD27 208 | #> First 10 features: 209 | #> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 210 | #> 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1 211 | #> 2 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1 212 | #> 3 0.124 1.428 1837 1837 Stokes 37169 37169 85 1612 1 213 | #> 4 0.114 1.352 1838 1838 Caswell 37033 37033 17 1035 2 214 | #> 5 0.153 1.616 1839 1839 Rockingham 37157 37157 79 4449 16 215 | #> 6 0.072 1.085 1842 1842 Vance 37181 37181 91 2180 4 216 | #> 7 0.064 1.213 1892 1892 Avery 37011 37011 6 781 0 217 | #> 8 0.086 1.267 1893 1893 Yadkin 37197 37197 99 1269 1 218 | #> 9 0.128 1.554 1897 1897 Franklin 37069 37069 35 1399 2 219 | #> 10 0.142 1.640 1913 1913 Nash 37127 37127 64 4021 8 220 | #> NWBIR74 BIR79 SID79 NWBIR79 group geometry 221 | #> 1 10 1364 0 19 1 MULTIPOLYGON (((-81.47276 3... 222 | #> 2 123 830 2 145 1 MULTIPOLYGON (((-76.00897 3... 223 | #> 3 160 2038 5 176 1 MULTIPOLYGON (((-80.02567 3... 224 | #> 4 550 1253 2 597 1 MULTIPOLYGON (((-79.53051 3... 225 | #> 5 1243 5386 5 1369 1 MULTIPOLYGON (((-79.53051 3... 226 | #> 6 1179 2753 6 1492 1 MULTIPOLYGON (((-78.49252 3... 227 | #> 7 4 977 0 5 1 MULTIPOLYGON (((-81.94135 3... 228 | #> 8 65 1568 1 76 1 MULTIPOLYGON (((-80.49554 3... 229 | #> 9 736 1863 0 950 1 MULTIPOLYGON (((-78.25455 3... 230 | #> 10 1851 5189 7 2274 1 MULTIPOLYGON (((-78.18693 3...232 |
sfarrow
: An R package for reading/writing simple feature (sf
)
136 | objects from/to Arrow parquet/feather files with arrow
R/sfarrow.R
138 | sfarrow.Rd
Simple features are a popular, standardised way to create spatial vector data
143 | with a list-type geometry column. Parquet files are standard column-oriented
144 | files designed by Apache Arrow (https://parquet.apache.org/) for fast
145 | read/writes. sfarrow
is designed to support the reading and writing of
146 | simple features in sf
objects from/to Parquet files (.parquet) and
147 | Feather files (.feather) within R
. A key goal of sfarrow
is to
148 | support interoperability of spatial data in files between R
and
149 | Python
through the use of standardised metadata.
Coordinate reference and geometry field information for sf
objects are
159 | stored in standard metadata tables within the files. The metadata are based
160 | on a standard representation (Version 0.1.0, reference:
161 | https://github.com/geopandas/geo-arrow-spec). This is compatible with
162 | the format used by the Python library GeoPandas
for read/writing
163 | Parquet/Feather files. Note to users: this metadata format is not yet stable
164 | for production uses and may change in the future.
This work was undertaken by Chris Jochem, a member of the WorldPop Research 170 | Group at the University of Southampton(https://www.worldpop.org/).
171 | 172 |Read a Feather file. Uses standard metadata information to 134 | identify geometry columns and coordinate reference system information.
135 |st_read_feather(dsn, col_select = NULL, ...)138 | 139 |
dsn | 144 |character file path to a data source |
145 |
---|---|
col_select | 148 |A character vector of column names to keep. Default is
149 | |
150 |
... | 153 |additional parameters to pass to
154 | |
155 |
object of class sf
Reference for the metadata used:
164 | https://github.com/geopandas/geo-arrow-spec. These are standard with
165 | the Python GeoPandas
library.
208 |# load Natural Earth low-res dataset. 172 | # Created in Python with GeoPandas.to_feather() 173 | path <- system.file("extdata", package = "sfarrow") 174 | 175 | world <- st_read_feather(file.path(path, "world.feather")) 176 | 177 | world 178 |#> Simple feature collection with 177 features and 5 fields 179 | #> Geometry type: GEOMETRY 180 | #> Dimension: XY 181 | #> Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513 182 | #> Geodetic CRS: WGS 84 183 | #> First 10 features: 184 | #> pop_est continent name iso_a3 gdp_md_est 185 | #> 1 920938 Oceania Fiji FJI 8.374e+03 186 | #> 2 53950935 Africa Tanzania TZA 1.506e+05 187 | #> 3 603253 Africa W. Sahara ESH 9.065e+02 188 | #> 4 35623680 North America Canada CAN 1.674e+06 189 | #> 5 326625791 North America United States of America USA 1.856e+07 190 | #> 6 18556698 Asia Kazakhstan KAZ 4.607e+05 191 | #> 7 29748859 Asia Uzbekistan UZB 2.023e+05 192 | #> 8 6909701 Oceania Papua New Guinea PNG 2.802e+04 193 | #> 9 260580739 Asia Indonesia IDN 3.028e+06 194 | #> 10 44293293 South America Argentina ARG 8.794e+05 195 | #> geometry 196 | #> 1 MULTIPOLYGON (((180 -16.067... 197 | #> 2 POLYGON ((33.90371 -0.95, 3... 198 | #> 3 POLYGON ((-8.66559 27.65643... 199 | #> 4 MULTIPOLYGON (((-122.84 49,... 200 | #> 5 MULTIPOLYGON (((-122.84 49,... 201 | #> 6 POLYGON ((87.35997 49.21498... 202 | #> 7 POLYGON ((55.96819 41.30864... 203 | #> 8 MULTIPOLYGON (((141.0002 -2... 204 | #> 9 MULTIPOLYGON (((141.0002 -2... 205 | #> 10 MULTIPOLYGON (((-68.63401 -...207 |
Read a Parquet file. Uses standard metadata information to 134 | identify geometry columns and coordinate reference system information.
135 |st_read_parquet(dsn, col_select = NULL, props = NULL, ...)138 | 139 |
dsn | 144 |character file path to a data source |
145 |
---|---|
col_select | 148 |A character vector of column names to keep. Default is
149 | |
150 |
props | 153 |Now deprecated in |
154 |
... | 157 |additional parameters to pass to
158 | |
159 |
object of class sf
Reference for the metadata used:
168 | https://github.com/geopandas/geo-arrow-spec. These are standard with
169 | the Python GeoPandas
library.
212 |# load Natural Earth low-res dataset. 176 | # Created in Python with GeoPandas.to_parquet() 177 | path <- system.file("extdata", package = "sfarrow") 178 | 179 | world <- st_read_parquet(file.path(path, "world.parquet")) 180 | 181 | world 182 |#> Simple feature collection with 177 features and 5 fields 183 | #> Geometry type: GEOMETRY 184 | #> Dimension: XY 185 | #> Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513 186 | #> Geodetic CRS: WGS 84 187 | #> First 10 features: 188 | #> pop_est continent name iso_a3 gdp_md_est 189 | #> 1 920938 Oceania Fiji FJI 8.374e+03 190 | #> 2 53950935 Africa Tanzania TZA 1.506e+05 191 | #> 3 603253 Africa W. Sahara ESH 9.065e+02 192 | #> 4 35623680 North America Canada CAN 1.674e+06 193 | #> 5 326625791 North America United States of America USA 1.856e+07 194 | #> 6 18556698 Asia Kazakhstan KAZ 4.607e+05 195 | #> 7 29748859 Asia Uzbekistan UZB 2.023e+05 196 | #> 8 6909701 Oceania Papua New Guinea PNG 2.802e+04 197 | #> 9 260580739 Asia Indonesia IDN 3.028e+06 198 | #> 10 44293293 South America Argentina ARG 8.794e+05 199 | #> geometry 200 | #> 1 MULTIPOLYGON (((180 -16.067... 201 | #> 2 POLYGON ((33.90371 -0.95, 3... 202 | #> 3 POLYGON ((-8.66559 27.65643... 203 | #> 4 MULTIPOLYGON (((-122.84 49,... 204 | #> 5 MULTIPOLYGON (((-122.84 49,... 205 | #> 6 POLYGON ((87.35997 49.21498... 206 | #> 7 POLYGON ((55.96819 41.30864... 207 | #> 8 MULTIPOLYGON (((141.0002 -2... 208 | #> 9 MULTIPOLYGON (((141.0002 -2... 209 | #> 10 MULTIPOLYGON (((-68.63401 -...211 |
Convert a simple features spatial object from sf
and
135 | write to a Feather file using write_feather
. Geometry
136 | columns (type sfc
) are converted to well-known binary (WKB) format.
st_write_feather(obj, dsn, ...)140 | 141 |
obj | 146 |object of class |
147 |
---|---|
dsn | 150 |data source name. A path and file name with .parquet extension |
151 |
... | 154 |additional options to pass to |
155 |
obj
invisibly
186 |# read spatial object 167 | nc <- sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE) 168 | 169 | # create temp file 170 | tf <- tempfile(fileext = '.feather') 171 | on.exit(unlink(tf)) 172 | 173 | # write out object 174 | st_write_feather(obj = nc, dsn = tf) 175 |#> Warning: This is an initial implementation of Parquet/Feather file support and 176 | #> geo metadata. This is tracking version 0.1.0 of the metadata 177 | #> (https://github.com/geopandas/geo-arrow-spec). This metadata 178 | #> specification may change and does not yet make stability promises. We 179 | #> do not yet recommend using this in a production setting unless you are 180 | #> able to rewrite your Parquet/Feather files.181 | # In Python, read the new file with geopandas.read_feather(...) 182 | # read back into R 183 | nc_f <- st_read_feather(tf) 184 | 185 |
Convert a simple features spatial object from sf
and
135 | write to a Parquet file using write_parquet
. Geometry
136 | columns (type sfc
) are converted to well-known binary (WKB) format.
st_write_parquet(obj, dsn, ...)140 | 141 |
obj | 146 |object of class |
147 |
---|---|
dsn | 150 |data source name. A path and file name with .parquet extension |
151 |
... | 154 |additional options to pass to |
155 |
obj
invisibly
186 |# read spatial object 167 | nc <- sf::st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE) 168 | 169 | # create temp file 170 | tf <- tempfile(fileext = '.parquet') 171 | on.exit(unlink(tf)) 172 | 173 | # write out object 174 | st_write_parquet(obj = nc, dsn = tf) 175 |#> Warning: This is an initial implementation of Parquet/Feather file support and 176 | #> geo metadata. This is tracking version 0.1.0 of the metadata 177 | #> (https://github.com/geopandas/geo-arrow-spec). This metadata 178 | #> specification may change and does not yet make stability promises. We 179 | #> do not yet recommend using this in a production setting unless you are 180 | #> able to rewrite your Parquet/Feather files.181 | # In Python, read the new file with geopandas.read_parquet(...) 182 | # read back into R 183 | nc_p <- st_read_parquet(tf) 184 | 185 |
R/st_arrow.R
128 | validate_metadata.Rd
Basic checking of key geo metadata columns
133 |validate_metadata(metadata)136 | 137 |
metadata | 142 |list for geo metadata |
143 |
---|
None. Throws an error and stops execution
149 | 150 |