81 |
82 | netcdf G2G_DailyRiverFlow_NATURAL_RCM01_19801201_20801130 {
83 | dimensions:
84 | Time = UNLIMITED ; // (36000 currently) --------> The 36000 and the dates in the filename tell us this is 100 years of daily data.
85 | RCM = 1 ; --------> This file only contains one 'RCM' (which stands for 'regional climate model')
86 | Northing = 1000 ;
87 | Easting = 700 ; ---------> Northing and Easting suggest it is gridded data
88 | variables:
89 | string RCM(RCM) ;
90 | float Northing(Northing) ;
91 | Northing:_FillValue = NaNf ;
92 | Northing:standard_name = "Northing" ;
93 | Northing:axis = "Y" ;
94 | Northing:units = "GB National Grid" ;
95 | float Easting(Easting) ;
96 | Easting:_FillValue = NaNf ;
97 | Easting:standard_name = "Easting" ;
98 | Easting:axis = "X" ;
99 | Easting:units = "GB National Grid" ;
100 | float Time(Time) ;
101 | Time:_FillValue = NaNf ;
102 | Time:standard_name = "Time" ;
103 | Time:axis = "T" ;
104 | Time:units = "days since 1961-01-01" ;
105 | Time:calendar = "360_day" ;
106 | float dmflow(RCM, Time, Northing, Easting) ; ------> The main data variable. It is 3 dimensional (the RCM is only a singleton dimension)
107 | dmflow:_FillValue = -999.f ;
108 | dmflow:units = "m3 s-1" ;
109 | dmflow:standard_name = "dmflow" ;
110 | dmflow:long_name = "Daily mean river flow" ;
111 | dmflow:missing_value = -999.f ;
112 |
113 | // global attributes:
114 | :_NCProperties = "version=2,netcdf=4.8.1,hdf5=1.12.2" ;
115 |
116 |
117 | This information tells us we have 100 years of daily, gridded data for a single RCM in this file. From the other filenames:
118 |
119 | G2G_DailyRiverFlow_NATURAL_RCM01_19801201_20801130.nc
120 | G2G_DailyRiverFlow_NATURAL_RCM04_19801201_20801130.nc
121 | G2G_DailyRiverFlow_NATURAL_RCM05_19801201_20801130.nc
122 | G2G_DailyRiverFlow_NATURAL_RCM06_19801201_20801130.nc
123 | G2G_DailyRiverFlow_NATURAL_RCM07_19801201_20801130.nc
124 | G2G_DailyRiverFlow_NATURAL_RCM08_19801201_20801130.nc
125 | G2G_DailyRiverFlow_NATURAL_RCM09_19801201_20801130.nc
126 | G2G_DailyRiverFlow_NATURAL_RCM10_19801201_20801130.nc
127 | G2G_DailyRiverFlow_NATURAL_RCM11_19801201_20801130.nc
128 | G2G_DailyRiverFlow_NATURAL_RCM12_19801201_20801130.nc
129 | G2G_DailyRiverFlow_NATURAL_RCM13_19801201_20801130.nc
130 | G2G_DailyRiverFlow_NATURAL_RCM15_19801201_20801130.nc
131 |
132 |
133 | we can deduce that each file contains a single RCM, but is otherwise identical.
134 | Therefore, to combine these files into a single, zarr dataset we need to concatenate over the RCM dimension.
135 |
136 | Now that we know these key pieces of information, the next step is to create a workflow or 'recipe' that does this.
137 |
138 | ## The Workflow
139 |
140 | The first step is to define a 'ConcatDim' object/variable which contains the name of the dimension along which we want to concatenate the files, the values of the dimensions in the files (in the order that we'd like them contacatenated) and, optionally, the number of dimension elements within each file, if it is constant (e.g. for monthly files on a 360 calendar this would be 30).
141 |
142 | ```
143 | from pangeo_forge_recipes.patterns import ConcatDim
144 | RCMs = ['01', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '15']
145 | RCM_concat_dim = ConcatDim("RCM", RCMs, nitems_per_file=1)
146 | ```
147 |
148 | Next, we define the function that translates a given RCM (e.g. '04') into a file path. The function must have the same number of arguments as the number of Combine Dimensions and the name of the argument must match the name of the the Combine Dimension.
149 |
150 | ```
151 | indir = '/users/sgsys/matbro/object_storage/object_storage/data/preproc'
152 | pre = 'G2G_DailyRiverFlow_NATURAL_RCM'
153 | suf = '_19801201_20801130.nc'
154 | def make_path(RCM):
155 | return(os.path.join(indir, pre + RCM + suf))
156 | ```
157 |
158 | Then this function and the ConcatDim object are put into a FilePattern object:
159 |
160 | ```
161 | from pangeo_forge_recipes.patterns import FilePattern
162 | pattern = FilePattern(make_path, RCM_concat_dim)
163 | ```
164 |
165 | Before running the full workflow it is a good idea to test it out on a subset of the files. pangeo-forge-recipes has a built in function for this:
166 |
167 | ```
168 | pattern_pruned = pattern.prune()
169 | ```
170 |
171 | Next, we need to specify where we want the converted data to be output and what chunks we want to convert the data to. Note that here we are choosing to chunk the data yearly and by 100km, as that is the optimum chunk size when both temporal-heavy and spatial-heavy analyses are likely to be carried out on the data (see ['Thinking Chunky Thoughts'](#thinking-chunky-thoughts---data-chunking-strategies)).
172 |
173 | ```
174 | target_root = '/users/sgsys/matbro/object_storage/object_storage/data/output' ## output folder
175 | tn = 'test.zarr' ## output filename
176 |
177 | target_chunks = {'RCM': 1, 'Time': 360,
178 | 'Northing': 100,
179 | 'Easting': 100} ## length of each dimension of the desired chunks
180 | ```
181 |
182 | The next step is to create the workflow/recipe we would like to run. Pangeo-forge-recipes uses apache beam as the backend, and requires the workflow to be specified using apache-beam syntax, which is a little unusual compared to most other python code, but simple enough to understand. Each step in the workflow is separated by a '|'. Pangeo-forge-recipes has workflow functions which are wrappers around more fiddly apache-beam code, simplifying the workflow massively. We use beam's 'Create' function to initialise the workflow and create the apache-beam workflow object containing our data, then pass this to the pangeo-forge-recipe workflow function 'OpenWithXarray' (which as you might expect uses the xarray module to open our netcdf files), then pass this to 'StoreToZarr' (which also does what you would expect). At each workflow step we pass the various options we've decided upon. Whilst that was a lot of detail, the workflow we have created essentially just does the following: Find the NetCDF files and their metadata --> open them with xarray --> convert/rechunk and write to disk as a zarr file.
183 |
184 | ```
185 | transforms = (
186 | beam.Create(pattern_pruned.items())
187 | | OpenWithXarray(file_type=pattern_pruned.file_type)
188 | | StoreToZarr(
189 | target_root=target_root,
190 | store_name=tn,
191 | combine_dims=pattern.combine_dim_keys,
192 | target_chunks=target_chunks
193 | )
194 | )
195 | ```
196 |
197 | We haven't actually run the workflow yet, just set it up ready to be run. To run it:
198 |
199 | ```
200 | with beam.Pipeline() as p:
201 | p | transforms
202 | ```
203 |
204 | If you have access to multiple cores or a cluster/HPC it is easy to run the pipeline in parallel and likely dramatically speed up the process. To do this, use:
205 |
206 | ```
207 | from apache_beam.options.pipeline_options import PipelineOptions
208 |
209 | beam_options = PipelineOptions(direct_num_workers=8, direct_running_mode="multi_processing")
210 | with beam.Pipeline(options=beam_options) as p:
211 | p | transforms
212 | ```
213 | to run the pipeline, changing 'direct_num_workers' to the number of workers (the number of cores) you wish to use.
214 | You can run the pipeline on a SLURM cluster in this way, see [scripts/convert_G2G_beam.sbatch](scripts/convert_G2G_beam.sbatch)
215 |
216 | The resulting dataset is formatted as a zarr datastore in the chunks we specified earlier:
217 | ```
218 | import xarray as xr
219 | xr.open_dataset('/work/scratch-pw2/mattjbr/testoutput.zarr')
220 | ```
221 | 
222 |
223 | ```
224 | import zarr
225 | tzar = zarr.open('/work/scratch-pw2/mattjbr/testoutput.zarr/dmflow')
226 | tzar.info
227 | ```
228 | 
229 |
230 | Pangeo forge has greater capabilities than what we have shown here: it can read/write directly to/from object storage, concatenate over multiple dimensions and be customized to perform any pre- or post-processing tasks. More information and examples is available on the [pangeo-forge website](https://pangeo-forge.readthedocs.io/en/latest/index.html).
231 |
232 | Some further notes on [Apache Beam](https://beam.apache.org/). It is a backend that is designed to run large and complex data workflows in parallel. To achieve the best performance it has several 'runners' that translate the beam workflows into workflows specific to the compute architecture that the workflow is running on. By default beam uses a generic runner that can run on any architecture (the 'Direct Runner'), and that is what is used in the example here. The generic nature of the Direct Runner means that it does not have optimal performance. Furthermore, it is geared more towards testing than actually running full workloads, and so raises more errors than the other runners. This becomes a particular problem when writing files directly to object storage, as this is done over http which can often timeout or dropout, stalling the workflow. The main other runners developed at time of writing are for Google Cloud Dataflow, Apache Flink, and Apache Spark compute architecture. Fortunately manually uploading data from disk to object storage outside of the pangeo-forge recipe is not too complicated.
233 |
234 | ## The upload
235 |
236 | To upload data we use the [s4cmd](https://github.com/bloomreach/s4cmd) command line tool. This can be safely installed in the same conda environment as pangeo-forge-recipes using pip:
237 | ```
238 | pip install s4cmd
239 | ```
240 |
241 | Create a '.s3cfg' file in the home directory of whichever linux/compute system you are on containing your s3 access key/token and secret key (essentially a username and password respectively) like so:
242 | ```
243 | [default]
244 | access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
245 | secret_key = yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
246 | ```
247 |
248 | List the buckets in your object store with
249 | ```
250 | s4cmd ls --endpoint endpoint_url
251 | ```
252 | where endpoint_url is the web address/'endpoint' of your object storage.
253 |
254 | Make a new bucket in your object store with
255 | ```
256 | s4cmd mb s3://<xarray.Dataset>\n", 703 | "Dimensions: (Easting: 700, Northing: 1000, RCM: 2, Time: 36000)\n", 704 | "Coordinates:\n", 705 | " * Easting (Easting) float32 500.0 1.5e+03 2.5e+03 ... 6.985e+05 6.995e+05\n", 706 | " * Northing (Northing) float32 9.995e+05 9.985e+05 9.975e+05 ... 1.5e+03 500.0\n", 707 | " * RCM (RCM) int64 1 4\n", 708 | " * Time (Time) object 1980-12-01 00:00:00 ... 2080-11-30 00:00:00\n", 709 | "Data variables:\n", 710 | " dmflow (RCM, Time, Northing, Easting) float32 ...
Type | zarr.core.Array |
---|---|
Data type | float32 |
Shape | (2, 36000, 1000, 700) |
Chunk shape | (1, 360, 100, 100) |
Order | C |
Read-only | False |
Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
Store type | zarr.storage.DirectoryStore |
No. bytes | 201600000000 (187.8G) |
No. bytes stored | 31345654783 (29.2G) |
Storage ratio | 6.4 |
Chunks initialized | 14000/14000 |