├── LICENSE
├── README.md
├── build
├── clean
├── default.config
├── mosx
    ├── MesoPy.py
    ├── __init__.py
    ├── bufr
    │   ├── __init__.py
    │   └── methods.py
    ├── configspec
    ├── estimators.py
    ├── model
    │   ├── __init__.py
    │   ├── model.py
    │   ├── predictors.py
    │   └── scorers.py
    ├── obs
    │   ├── __init__.py
    │   └── methods.py
    ├── util.py
    └── verification
    │   ├── __init__.py
    │   └── methods.py
├── performance
├── run
├── validate
└── verify


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Jonathan Weyn
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # MOS-X
 2 | 
 3 | MOS-X is a machine learning-based forecasting model built in Python designed to produce output tailored for the [WxChallenge](http://www.wxchallenge.com) weather forecasting competition.
 4 | It uses an external executable to download and process time-height profiles of model data from the National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS) and North American Mesoscale (NAM) models.
 5 | These data, along with surface observations from MesoWest, are used to train any of scikit-learn's ML algorithms to predict tomorrow's high temperature, low temperature, peak 2-minute sustained wind speed, and rain total.
 6 | 
 7 | ## Installing
 8 | 
 9 | ### Requirements
10 | 
11 | - Python 2.7 (no Python 3 yet, and probably ever, because this is a toy project)
12 | - A workstation with a recent Linux installation... sorry, that's all that will work with the next item...
13 | - [BUFRgruven](http://strc.comet.ucar.edu/software/bgruven/) - for model data
14 | - An API key for [MesoWest](https://synopticlabs.org/api/mesonet/) - unfortunately the API now has a limited free tier. MOS-X currently does a poor job of data caching so large data sets will exceed the free limit - use with caution.
15 | - A decent amount of free disk space - some of the models are > 1 GB pickle files... not to mention all the BUFKIT files...
16 | 
17 | ### Python packages - easier with conda
18 | 
19 | - NumPy
20 | - scipy
21 | - pandas
22 | - ConfigObj (and validate)
23 | - ulmo (use conda-forge)
24 | - the excellent [scikit-learn](http://scikit-learn.org/stable/index.html)
25 | 
26 | ### Installation
27 | 
28 | Nothing to do really. Just make sure the scripts in the main directory (`build`, `run`, `verify`, `validate`, and `performance`) are executable, for example:
29 | 
30 | `chmod +x build run verify validate performance`
31 | 
32 | ## Building a model
33 | 
34 | 1. The first thing to do is to set up the config file for the particular site to forecast for. The `default.config` file has a good number of comments to describe how to do that. Parameters that are not marked 'optional' or with a default value must be specified.
35 |   - The parameter `climo_station_id` is now automatically generated!
36 |   - It is not recommended to use the upper-air sounding data option. In my testing adding sounding data actually made no difference to the skill of the models, but YMMV. Use with caution. I don't test it.
37 | 2. Once the config is set up, build the model using `build <config>`. The config reader will automatically look for `<config>.config` too, so if you're like me and like to call your config files `KSEA.config`, it's handy to just pass `KSEA`.
38 |   - Depending on how much training data is requested, it may take several hours for BUFRgruven to download everything.
39 |   - Actually building the scikit-learn model, however, takes only 10 minutes for a 1000-tree random forest on a 16-core machine.
40 | 
41 | ## Running the model
42 | 
43 | - Run the model for tomorrow with `run <config>`, or give it any day to run on.
44 | - Verify the model prediction with the truth and with GFS and NAM MOS products with `verify <config>`.
45 | - The `validate` script basically is a glorified verification over an entire user-specified range of dates.
46 | 
47 | ## Some notes on advanced model configurations
48 | 
49 | - There is built-in functionality for building a model that predicts a time series of hourly temperature, relative humidity, wind speed, and rain for the forecast period in addition to the daily values. While handy to get an idea of the temporal variation of predicted weather, it actually has limited use, and makes the pickled model file much larger.
50 | - Rain forecasting is difficult for an ML model. Rain values are highly non-normally distributed. There is the option to use a post-processor model, which is another random forest, trained on the distribution of output from the base model's trees. It improves rain forecast a little, particularly by doing a better job of predicting 0 on sunny days.
51 | - Rain forecasting can now be done in three different ways: `quantity`, which is the standard prediction of an actual daily rain total, `pop`, or the probability of precipitation, and `categorical`, which uses the MOS categories.
52 | 


--------------------------------------------------------------------------------
/build:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | #
  3 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  4 | #
  5 | # See the file LICENSE for your rights.
  6 | #
  7 | 
  8 | """
  9 | Build the estimator for the MOS-X model.
 10 | """
 11 | 
 12 | import sys
 13 | import mosx
 14 | import pickle
 15 | from optparse import OptionParser
 16 | from multiprocessing import Process
 17 | 
 18 | 
 19 | def get_command_options():
 20 |     parser = OptionParser()
 21 |     parser.add_option('-e', '--use-existing-files', dest='use_existing', action='store_true', default=False,
 22 |                       help='Use existing BUFR, obs, and verification files: use with caution!')
 23 |     (opts, args) = parser.parse_args()
 24 |     return opts, args
 25 | 
 26 | 
 27 | # Get the config dictionary
 28 | 
 29 | options, arguments = get_command_options()
 30 | use_existing = options.use_existing
 31 | try:
 32 |     config_file = arguments[0]
 33 | except IndexError:
 34 |     print('Required argument (config file) not provided.')
 35 |     sys.exit(1)
 36 | config = mosx.util.get_config(config_file)
 37 | 
 38 | 
 39 | # Retrieve data; parallelize BUFR and OBS
 40 | 
 41 | bufr_file = '%s/%s_bufr_train.pkl' % (config['SITE_ROOT'], config['station_id'])
 42 | obs_file = '%s/%s_obs_train.pkl' % (config['SITE_ROOT'], config['station_id'])
 43 | verif_file = '%s/%s_verif_train.pkl' % (config['SITE_ROOT'], config['station_id'])
 44 | predictor_file = '%s/%s_predictors_train.pkl' % (config['SITE_ROOT'], config['station_id'])
 45 | 
 46 | 
 47 | def get_bufr():
 48 |     print('\n--- MOS-X build: initiating BUFR data retrieval...\n')
 49 |     mosx.bufr.bufr(config, bufr_file)
 50 | 
 51 | 
 52 | def get_obs():
 53 |     print('\n--- MOS-X build: initiating OBS data retrieval...\n')
 54 |     mosx.obs.obs(config, obs_file)
 55 | 
 56 | 
 57 | if not use_existing:
 58 |     if __name__ == '__main__':
 59 |         p1 = Process(target=get_bufr)
 60 |         p1.start()
 61 |         p2 = Process(target=get_obs)
 62 |         p2.start()
 63 |         p1.join()
 64 |         p2.join()
 65 | 
 66 |     print('\n--- MOS-X build: retrieving VERIF data...\n')
 67 |     if config['Obs']['use_climo_wind']:
 68 |         mosx.verification.verification(config, verif_file)
 69 |     else:
 70 |         mosx.verification.verification(config, verif_file, use_climo=False, use_cf6=False)
 71 | 
 72 | # Re-format predictors even if using existing raw data
 73 | print('\n--- MOS-X build: formatting predictor and target data...\n')
 74 | mosx.model.format_predictors(config, bufr_file, obs_file, verif_file, predictor_file)
 75 | 
 76 | 
 77 | # Train the estimator and save it
 78 | 
 79 | print('\n--- MOS-X build: generating and training estimator...\n')
 80 | 
 81 | p_test, t_test, r_test = mosx.model.train(config, predictor_file, config['Model']['estimator_file'], test_size=1)
 82 | 
 83 | 
 84 | # Test the predictor
 85 | 
 86 | print('\n--- MOS-X build: predicting for a test of 1 day...\n')
 87 | 
 88 | print('Loading estimator file %s' % config['Model']['estimator_file'])
 89 | with open(config['Model']['estimator_file'], 'rb') as handle:
 90 |     estimator = pickle.load(handle)
 91 | rain_tuning = config['Model'].get('Rain tuning', None)
 92 | 
 93 | # Make a prediction
 94 | expected = t_test
 95 | if rain_tuning is not None and mosx.util.to_bool(rain_tuning.get('use_raw_rain', False)):
 96 |     predicted = estimator.predict(p_test, rain_array=r_test)
 97 | else:
 98 |     predicted = estimator.predict(p_test)
 99 | 
100 | num_test = t_test.shape[0]
101 | for t in range(num_test):
102 |     print('\nFor day %d, the predicted forecast is' % t)
103 |     print('%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[t, :4]))
104 |     print('  while the verification was')
105 |     print('%0.0f/%0.0f/%0.0f/%0.2f' % tuple(expected[t, :4]))
106 | 


--------------------------------------------------------------------------------
/clean:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | #
  3 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  4 | #
  5 | # See the file LICENSE for your rights.
  6 | #
  7 | 
  8 | """
  9 | Clean the archived files for a particular site.
 10 | """
 11 | 
 12 | import os
 13 | import sys
 14 | import mosx
 15 | from optparse import OptionParser
 16 | 
 17 | 
 18 | def get_command_options():
 19 |     parser = OptionParser()
 20 |     parser.add_option('-s', '--station-id', dest='station_id', action='store', type='string', default='',
 21 |                       help='Station ID to clean (required!)')
 22 |     parser.add_option('-d', '--no-remove-site-data', dest='data_flag', action='store_true', default=False,
 23 |                       help="Don't delete files in site_data")
 24 |     parser.add_option('-b', '--remove-bufr', dest='bufr_flag', action='store_true', default=False,
 25 |                       help='Delete archived BUFKIT files')
 26 |     parser.add_option('-u', '--remove-upper-air', dest='sndg_flag', action='store_true', default=False,
 27 |                       help='Delete archived sounding (upper air) files')
 28 |     parser.add_option('-m', '--remove-model', dest='model_flag', action='store_true', default=False,
 29 |                       help='Delete estimator files *mosx*.pkl')
 30 |     parser.add_option('-v', '--verbose', dest='verbose', action='store_true', default=False,
 31 |                       help='Print some extra statements')
 32 |     (opts, args) = parser.parse_args()
 33 |     return opts, args
 34 | 
 35 | 
 36 | options, arguments = get_command_options()
 37 | station_id, data_flag, bufr_flag, sndg_flag, model_flag, verbose = (options.station_id, options.data_flag,
 38 |                                                                     options.bufr_flag, options.sndg_flag,
 39 |                                                                     options.model_flag, options.verbose)
 40 | try:
 41 |     config_file = arguments[0]
 42 | except IndexError:
 43 |     print('Required argument (config file) not provided.')
 44 |     sys.exit(1)
 45 | config = mosx.util.get_config(config_file)
 46 | 
 47 | bufr_data_dir = config['BUFR']['bufr_data_dir']
 48 | sndg_data_dir = config['Obs']['sounding_data_dir']
 49 | 
 50 | if station_id == '':
 51 |     print('\nStation ID (-s) required! Use --help or -h for full options.')
 52 |     sys.exit(1)
 53 | 
 54 | station_id = station_id.upper()
 55 | print('mosx_clean: Cleaning files for station %s' %station_id)
 56 | print('  Files to delete:')
 57 | if not data_flag:
 58 |     print('    Site data in %s' % config['SITE_ROOT'])
 59 | if bufr_flag:
 60 |     print('    BUFKIT archives in %s' % bufr_data_dir)
 61 | if sndg_flag:
 62 |     print('    Sounding files in %s' % sndg_data_dir)
 63 | if model_flag:
 64 |     print("    Model estimator in config or files containing 'mosx'")
 65 | 
 66 | rm_command = 'rm -f'
 67 | 
 68 | if not data_flag:
 69 |     if verbose:
 70 |         print('mosx_clean: deleting %s files in %s' % (station_id, config['SITE_ROOT']))
 71 |     listing = os.listdir(config['SITE_ROOT'])
 72 |     if not model_flag:
 73 |         for ll in listing:
 74 |             if ll == config['Model']['estimator_file'].split('/')[-1]:
 75 |                 listing.remove(ll)
 76 |             if 'mosx' in ll.lower() and station_id in ll.upper():
 77 |                 listing.remove(ll)
 78 |     else:
 79 |         if verbose:
 80 |             print('mosx_clean: deleting model estimator files')
 81 |     for f in listing:
 82 |         if not f.upper().startswith(station_id):
 83 |             continue
 84 |         command = '%s %s/%s' % (rm_command, config['SITE_ROOT'], f)
 85 |         if verbose:
 86 |             print(command)
 87 |         os.system(command)
 88 | 
 89 | if bufr_flag:
 90 |     if verbose:
 91 |         print('mosx_clean: deleting bufr files in %s' % bufr_data_dir)
 92 |     command = '%s %s/*%s*' % (rm_command, bufr_data_dir, station_id.lower())
 93 |     if verbose:
 94 |         print(command)
 95 |     os.system(command)
 96 | 
 97 | if sndg_flag:
 98 |     if verbose:
 99 |         print('mosx_clean: deleting sounding files in %s' % sndg_data_dir)
100 |     command = '%s %s/%s*' % (rm_command, sndg_data_dir, station_id)
101 |     if verbose:
102 |         print(command)
103 |     os.system(command)
104 | 
105 | print('\nDone.')
106 | 


--------------------------------------------------------------------------------
/default.config:
--------------------------------------------------------------------------------
  1 | ########################################################################################################################
  2 | # This is the configuration file for the MOS-X model. Parameters specified here are used to build and run the
  3 | # machine learning model to predict a few weather parameters at a given station location.
  4 | ########################################################################################################################
  5 | # Global parameters go here.
  6 | 
  7 | # The root directory of this program
  8 | MOSX_ROOT =
  9 | 
 10 | # The root directory of the BUFRgruven executable program
 11 | BUFR_ROOT =
 12 | 
 13 | # The directory where site-specific data files are saved (defaults to %MOSX_ROOT/site_data)
 14 | SITE_ROOT =
 15 | 
 16 | # The 4-letter station ID of the location
 17 | station_id = KAUS
 18 | 
 19 | # Lowest pressure level for vertical profiles. For stations at high elevation, the lowest pressure level may have to be
 20 | # higher because data will be missing at higher pressures.
 21 | lowest_p_level = 950
 22 | 
 23 | # Starting and ending dates for the model training. BUFR data are available beginning 2010-01-01. If the parameter
 24 | # 'is_season' is True, then the program will assume that you want seasonally-subset training data beginning in the year
 25 | # of 'data_start_date' and ending in the year of 'data_end_date', with seasons spanning from the start day to the end
 26 | # day. Dates should be provided as YYYYMMDD.
 27 | data_start_date = 20081120
 28 | data_end_date = 20160409
 29 | is_season = True
 30 | 
 31 | # This is the UTC hour at which the forecast period starts on any given day. The period ends 24 hours later. (Defaults
 32 | # to 6.)
 33 | forecast_hour_start = 6
 34 | 
 35 | # Provide here the hourly resolution of the time series prediction. Should be an integer in the range of 1 to 6, but not
 36 | # 5... Ignored if the parameter 'predict_timeseries' under 'Model' is False. (Defaults to 3.)
 37 | time_series_interval = 3
 38 | 
 39 | # API token for MesoWest data (required). To get an API key, visit https://synopticlabs.org/api/
 40 | meso_token =
 41 | 
 42 | # Produce verbose output
 43 | verbose = True
 44 | 
 45 | 
 46 | ########################################################################################################################
 47 | # In this section, provide parameters used to control the retrieval of BUFR profile model data.
 48 | 
 49 | [BUFR]
 50 | 
 51 |     # (Optional) The station ID used in BUFR for the station (defaults to station_id, may be different though)
 52 |     bufr_station_id =
 53 | 
 54 |     # (Optional) Data directory where BUFKIT files are saved (defaults to %SITE_ROOT/bufkit)
 55 |     bufr_data_dir =
 56 | 
 57 |     # (Optional) Path to bufrgruven executable (defaults to %BUFR_ROOT/bufr_gruven.pl)
 58 |     bufrgruven =
 59 | 
 60 |     # BUFR models; should be a list
 61 |     models = GFS, NAM
 62 | 
 63 | 
 64 | ########################################################################################################################
 65 | # In this section, provide parameters to control the retrieval of observations.
 66 | 
 67 | [Obs]
 68 | 
 69 |     # Option to use upper-air sounding data
 70 |     use_soundings = False
 71 | 
 72 |     # Upper-air sounding station ID, required if use_soundings is True
 73 |     sounding_station_id = FWD
 74 | 
 75 |     # (Optional) Data directory where sounding files are saved (defaults to %SITE_ROOT/soundings)
 76 |     sounding_data_dir =
 77 | 
 78 |     # Set this option to False to disable retrieval of NCDC and CF6 data for max wind values. It may need to be
 79 |     # disabled if the forecast start hour is not near 6 UTC, otherwise should be True. (Defaults to True.)
 80 |     use_climo_wind = True
 81 | 
 82 | 
 83 | ########################################################################################################################
 84 | # In this section, provide the machine learning model parameters.
 85 | 
 86 | [Model]
 87 | 
 88 |     # Save the model estimator using pickle to this file
 89 |     estimator_file = %(SITE_ROOT)s/%(station_id)s_mosx.pkl
 90 | 
 91 |     # The base scikit-learn regressor
 92 |     regressor = ensemble.RandomForestRegressor
 93 | 
 94 |     # If True, trains a separate estimator for each weather parameter
 95 |     train_individual = True
 96 | 
 97 |     # If True, also predict a meteorology time series for the next day, in addition to high/low/wind/rain
 98 |     predict_timeseries = False
 99 | 
100 |     # This is the type of rain forecast. 'quantity' predicts an actual amount of rain in inches, 'categorical'
101 |     # predicts a probabilistic category of rain (a la MOS), and 'pop' (probability of precipitation) predicts the
102 |     # fractional chance that there will be ANY measurable precipitation.
103 |     rain_forecast_type = pop
104 | 
105 |     # Keyword arguments passed to the base regressor
106 |     [[Parameters]]
107 |         n_estimators = 1000
108 |         max_features = 0.75
109 |         n_jobs = 2
110 |         verbose = 1
111 | 
112 |     # This section, if present, enables a post-processing algorithm to be trained on the raw rain predictions from a
113 |     # native ensemble regressor. This is usually desirable because rain has a very non-normal distribution and is
114 |     # therefore tricky for standard algorithms to predict. The parameter 'rain_estimator' is a string, just like
115 |     # 'regressor' above, which determines the scikit-learn algorithm to use (classifiers are an option!). The parameter
116 |     # 'use_raw_rain' determines whether the raw rainfall estimates from the BUFR models are also used as features for
117 |     # the rain post-processor, in addition to the model ensemble statistics. Any other parameters provided here are
118 |     # passed as kwargs to the initialization of the processor's scikit-learn algorithm. Due to the use of certain
119 |     # methods used in Bootstrapping, use_raw_rain is currently not compatible with Bootstrapping.
120 |     # [[Rain tuning]]
121 |     #     rain_estimator = ensemble.RandomForestRegressor
122 |     #     use_raw_rain = False
123 |     #     n_estimators = 100
124 | 
125 |     # Ada boosting may improve the performance of a model by selectively increasing the weight of training on samples
126 |     # with a large training error. If Ada boosting is desired, provide here any parameters passed to the Ada class;
127 |     # otherwise, remove or comment out this subsection.
128 |     # [[Ada boosting]]
129 |     #     n_estimators = 50
130 | 
131 |     # This last option allows for bootstrapping development of an ensemble of ML models. The training set is split
132 |     # according to a few options here, then an ensemble of n_members is generated by training on individual splits.
133 |     # Comment out this section to disable bootstrapping.
134 |     # [[Bootstrapping]]
135 |     #     n_members = 10
136 |     #     # The number of training samples per split (if int), or the fraction (if float)
137 |     #     n_samples_split = 0.1
138 |     #     # If 1, each split contains no sample present in any other split. Also overrides n_samples_split and sets it
139 |     #     # as the maximum available per split. Otherwise, set to 0.
140 |     #     unique_splits = 0
141 | 
142 | 
143 | ########################################################################################################################
144 | # There are a few parameters here for validation of the model.
145 | 
146 | [Validate]
147 | 
148 |     # Start and end dates for the validation
149 |     start_date = 20161120
150 |     end_date = 20170409
151 | 
152 | 
153 | ########################################################################################################################
154 | # The run executable allows for an upload to an FTP or SFTP server, for example, to post data to a website. The forecast
155 | # data are aggregated over all runs and uploaded as a file_type (currently only 'pickle' is implemented), and if a plot
156 | # of forecast ensemble distributions is requested, that plot is also uploaded.
157 | 
158 | [Upload]
159 | 
160 |     # Type of file to upload forecasts in. Can be 'pickle' (one file for all forecasts), 'json' (one file per forecast
161 |     # date), 'uw_text' (one file per forecast date, only a short version with high/pop), or a list of several.
162 |     file_type = json
163 | 
164 |     # User name, server, and directory on server. Prompts for password, or set an ssh key. The forecast_directory
165 |     # points to where the forecasts are uploaded, while the plot_directory points to where plots are uploaded. If
166 |     # 'username' and 'server' are both empty, then assumes local directories are specified.
167 |     username =
168 |     server =
169 |     forecast_directory =
170 |     plot_directory =
171 | 


--------------------------------------------------------------------------------
/mosx/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | MOS-X python package. Tools for building, running, and validating a machine learning model for predicting tomorrow's
 9 | weather. 'MOS-X' represents an intelligent improvement to the tradition Model Output Statistics of the National
10 | Weather Service and other meteorological institutions.
11 | 
12 | ============
13 | Version history
14 | ------
15 | 2017-10-26: Version 0.0
16 | 2017-11-03: Version 0.2
17 |     -- added upper-air sounding data from metpy
18 | 2017-11-07: Version 0.3
19 |     -- fixed rain not counted in cumulative values from skipped intervals in bufkit profiles
20 |     -- fixed issue with end year when season overlaps to new year
21 | 2017-11-22: Version 0.3.1
22 |     -- added optional directory to save bufkit files
23 |     -- added saving of sounding files
24 |     -- added options for random forest object
25 |     -- edit return_precip option in mosx_predictors to handle multiple days
26 |     -- add return_dates option to mosx_predictors
27 | 2017-12-06: Version 0.4
28 |     -- added bufkit model predictions of high, low, and max wind
29 |     -- added option to select neural network regressor
30 |     -- changed verification to ONLY use CF6/climo wind if available
31 |     -- added option to not use existing sounding files in mosx_obs
32 |     -- change upper_air to not save nan soundings
33 |     -- improved handling of missing verification values with missing cf6 wind
34 |     -- added options to ignore cf6 and/or climo in mosx_verif
35 | 2018-01-16: Version 0.5
36 |     -- added training design to individually train a forest for each weather parameter
37 |     -- added the option to use any scikit-learn regressor
38 |     -- added use_soundings option to optionally omit sounding data
39 |     -- changed pressure levels to 925, 850, 750, 600
40 |     -- pondered adding model times 18Z and 00Z before forecast start time
41 |     -- changed obs dt to 3 hrs
42 |     -- changed bufkit surface dt to 3 hrs
43 |     -- added methodology for generating diagnostic variables from BUFR
44 |     -- added 'temperature advection index' using above
45 | 2018-01-18: Version 0.6
46 |     -- added ability to forecast hourly time series
47 |     -- added class for time series estimator
48 |     -- fixed bug where last 6 hours were not included in obs verifications at the end of a season
49 |     -- improved handling of array conversion of obs data by requiring pandas to export OrderedDict and
50 |        using get_array commands
51 |     -- fixed a bug where daily verification would fail when verification is unavailable for a day
52 | 2018-03-06: Version 0.6.1
53 |     -- fixed critical bug where obs dataframe was not sorted when running for only one day
54 | 2018-03-12: Version 0.6.2
55 |     -- added support for the Ada Boosting estimator wrapper
56 | 2018-03-14: Version 0.7.0
57 |     -- completely changed the structure of the project to be modular
58 | 2018-03-20: Version 0.7.1
59 |     -- fixed a bug that would result in an extra set of API dates when is_season is False
60 |     -- fixed some issues related to the use of the config file
61 | 2018-03-21: Version 0.8.0
62 |     -- added scorers
63 |     -- added learning curves in performance metrics; more to come
64 | 2018-03-27: Version 0.8.1
65 |     -- fixed an error in util.generate_dates that failed to produce all the dates when is_season is False
66 | 2018-04-10: Version 0.9.0
67 |     -- better implementation of base estimator attributes in TimeSeriesEstimator and RainTuningEstimator classes
68 |     -- added submodule 'predict' for unified predictions
69 |     -- improved handling of the raw precipitation values to allow them to be used in rain tuning
70 |     -- added config option for the starting hour of a forecast day
71 |     -- added config option to predict probability of precipitation rather than quantity
72 |     -- added automatic fetching of CF6 files
73 |     -- added automatic retrieval of climo_station_id (removed from default.config)
74 |     -- added time_series_interval parameter to output coarser time series
75 |     -- added option to disable climo/cf6 wind retrieval, for non-WxChallenge purposes
76 | 2018-05-03: Version 0.10.0
77 |     -- moved special estimator classes to mosx.estimators
78 |     -- added bootstrapping training estimator
79 |     -- added ability to select the type of estimator for rain tuning
80 |     -- re-organized the 'train' and 'predict' modules to 'model'
81 | 2018-06-11: Version 0.10.2
82 |     -- fixed an error in obs retrieval that retrieved one data point too many
83 |     -- added many plots to 'performance'
84 | 2018-10-22: Version 0.10.3
85 |     -- added option to write more than one file type
86 | 
87 | """
88 | 
89 | import bufr
90 | import estimators
91 | import obs
92 | import model
93 | import util
94 | import verification
95 | 
96 | __version__ = '0.10.3'
97 | 


--------------------------------------------------------------------------------
/mosx/bufr/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | Methods for processing BUFR data.
 9 | """
10 | 
11 | from .methods import *
12 | 


--------------------------------------------------------------------------------
/mosx/bufr/methods.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  3 | #
  4 | # See the file LICENSE for your rights.
  5 | #
  6 | 
  7 | """
  8 | Methods for processing BUFR data.
  9 | """
 10 | 
 11 | from mosx.util import generate_dates, get_array
 12 | from collections import OrderedDict
 13 | import re
 14 | import os
 15 | import pickle
 16 | import numpy as np
 17 | from datetime import datetime, timedelta
 18 | from scipy import interpolate
 19 | 
 20 | 
 21 | def bufkit_parser_time_height(config, file_name, interval=1, start_dt=None, end_dt=None):
 22 |     """
 23 |     By Luke Madaus. Modified by jweyn.
 24 |     Returns a dictionary of time-height profiles from a BUFKIT file, with profiles interpolated to a basic set of
 25 |     pressure levels.
 26 | 
 27 |     :param config:
 28 |     :param file_name: str: full path to bufkit file name
 29 |     :param interval: int: process data every 'interval' hours
 30 |     :param start_dt: datetime: starting time for data processing
 31 |     :param end_dt: datetime: ending time for data processing
 32 |     :return: dict: ugly dictionary of processed values
 33 |     """
 34 |     # Open the file
 35 |     infile = open(file_name, 'r')
 36 | 
 37 |     profile = OrderedDict()
 38 | 
 39 |     # Find the block that contains the description of what everything is (header information)
 40 |     block_lines = []
 41 |     inblock = False
 42 |     block_found = False
 43 |     for line in infile:
 44 |         if line.startswith('PRES TMPC') and not block_found:
 45 |             # We've found the line that starts the header info
 46 |             inblock = True
 47 |             block_lines.append(line)
 48 |         elif inblock:
 49 |             # Keep appending lines until we start hitting numbers
 50 |             if re.match('^\d{3}|^\d{4}', line):
 51 |                 inblock = False
 52 |                 block_found = True
 53 |             else:
 54 |                 block_lines.append(line)
 55 | 
 56 |     # Now compute the remaining number of variables
 57 |     re_string = ''
 58 |     for line in block_lines:
 59 |         dum_num = len(line.split())
 60 |         for n in range(dum_num):
 61 |             re_string = re_string + '(-?\d{1,5}.\d{2}) '
 62 |         re_string = re_string[:-1]  # Get rid of the trailing space
 63 |         re_string = re_string + '\r\n'
 64 | 
 65 |     # Compile this re_string for more efficient re searches
 66 |     block_expr = re.compile(re_string)
 67 | 
 68 |     # Now get corresponding indices of the variables we need
 69 |     full_line = ''
 70 |     for r in block_lines:
 71 |         full_line = full_line + r[:-2] + ' '
 72 |     # Now split it
 73 |     varlist = re.split('[ /]', full_line)
 74 |     # Get rid of trailing space
 75 |     varlist = varlist[:-1]
 76 | 
 77 |     # Variables we want
 78 |     vars_desired = ['TMPC', 'DWPC', 'UWND', 'VWND', 'HGHT']
 79 | 
 80 |     # Pressure levels to interpolate to
 81 |     plevs = [600, 750, 850, 925]
 82 |     plevs = [p for p in plevs if p <= float(config['lowest_p_level'])]
 83 | 
 84 |     # We now need to break everything up into a chunk for each
 85 |     # forecast date and time
 86 |     with open(file_name) as infile:
 87 |         blocks = infile.read().split('STID')
 88 |         for block in blocks:
 89 |             interp_plevs = []
 90 |             header = block
 91 |             if header.split()[0] != '=':
 92 |                 continue
 93 |             fcst_time = re.search('TIME = (\d{6}/\d{4})', header).groups()[0]
 94 |             fcst_dt = datetime.strptime(fcst_time, '%y%m%d/%H%M')
 95 |             if start_dt is not None and fcst_dt < start_dt:
 96 |                 continue
 97 |             if end_dt is not None and fcst_dt > end_dt:
 98 |                 break
 99 |             if fcst_dt.hour % interval != 0:
100 |                 continue
101 |             temp_vars = OrderedDict()
102 |             for var in varlist:
103 |                 temp_vars[var] = []
104 |             temp_vars['PRES'] = []
105 |             for block_match in block_expr.finditer(block):
106 |                 vals = block_match.groups()
107 |                 for val, name in zip(vals, varlist):
108 |                     if float(val) == -9999.:
109 |                         temp_vars[name].append(np.nan)
110 |                     else:
111 |                         temp_vars[name].append(float(val))
112 | 
113 |             # Unfortunately, bufkit values aren't always uniformly distributed.
114 |             final_vars = OrderedDict()
115 |             cur_plevs = temp_vars['PRES']
116 |             cur_plevs.reverse()
117 |             for var in varlist[1:]:
118 |                 if var in (vars_desired + ['SKNT', 'DRCT']):
119 |                     values = temp_vars[var]
120 |                     values.reverse()
121 |                     interp_plevs = list(plevs)
122 |                     num_plevs = len(interp_plevs)
123 |                     f = interpolate.interp1d(cur_plevs, values, bounds_error=False)
124 |                     interp_vals = f(interp_plevs)
125 |                     interp_array = np.full((len(plevs)), np.nan)
126 |                     # Array almost certainly missing values at high pressures
127 |                     interp_array[:num_plevs] = interp_vals
128 |                     interp_vals = list(interp_array)
129 |                     interp_plevs = list(plevs)  # use original array
130 |                     interp_vals.reverse()
131 |                     interp_plevs.reverse()
132 |                     if var == 'SKNT':
133 |                         wspd = np.array(interp_vals)
134 |                     if var == 'DRCT':
135 |                         wdir = np.array(interp_vals)
136 |                 if var in vars_desired:
137 |                     final_vars[var] = interp_vals
138 |             final_vars['PRES'] = interp_plevs
139 |             if 'UWND' not in final_vars.keys():
140 |                 final_vars['UWND'] = list(wspd * np.sin(wdir * np.pi/180. - np.pi))
141 |             if 'VWND' not in final_vars.keys():
142 |                 final_vars['VWND'] = list(wspd * np.cos(wdir * np.pi/180. - np.pi))
143 |             profile[fcst_dt] = final_vars
144 | 
145 |     return profile
146 | 
147 | 
148 | def bufkit_parser_surface(file_name, interval=1, start_dt=None, end_dt=None):
149 |     """
150 |     By Luke Madaus. Modified by jweyn.
151 |     Returns a dictionary of surface data from a BUFKIT file.
152 | 
153 |     :param file_name: str: full path to bufkit file name
154 |     :param interval: int: process data every 'interval' hours
155 |     :param start_dt: datetime: starting time for data processing
156 |     :param end_dt: datetime: ending time for data processing
157 |     :return: dict: ugly dictionary of processed values
158 |     """
159 |     # Load the file
160 |     infile = open(file_name, 'r')
161 |     sfc_dict = OrderedDict()
162 | 
163 |     block_lines = []
164 |     inblock = False
165 |     for line in infile:
166 |         if re.search('SELV', line):
167 |             try:  # jweyn
168 |                 elev = re.search('SELV = -?(\d{1,4})', line).groups()[0]  # jweyn: -?
169 |                 elev = float(elev)
170 |             except:
171 |                 elev = 0.0
172 |         if line.startswith('STN YY'):
173 |             # We've found the line that starts the header info
174 |             inblock = True
175 |             block_lines.append(line)
176 |         elif inblock:
177 |             # Keep appending lines until we start hitting numbers
178 |             if re.search('\d{6}', line):
179 |                 inblock = False
180 |             else:
181 |                 block_lines.append(line)
182 | 
183 |     # Build an re search pattern based on this
184 |     # We know the first two parts of the section are station id num and date
185 |     re_string = "(\d{6}|\w{4}) (\d{6})/(\d{4})"
186 |     # Now compute the remaining number of variables
187 |     dum_num = len(block_lines[0].split()) - 2
188 |     for n in range(dum_num):
189 |         re_string = re_string + " (-?\d{1,4}.\d{2})"
190 |     re_string = re_string + '\r\n'
191 |     for line in block_lines[1:]:
192 |         dum_num = len(line.split())
193 |         for n in range(dum_num):
194 |             re_string = re_string + '(-?\d{1,4}.\d{2}) '
195 |         re_string = re_string[:-1]  # Get rid of the trailing space
196 |         re_string = re_string + '\r\n'
197 | 
198 |     # Compile this re_string for more efficient re searches
199 |     block_expr = re.compile(re_string)
200 | 
201 |     # Now get corresponding indices of the variables we need
202 |     full_line = ''
203 |     for r in block_lines:
204 |         full_line = full_line + r[:-2] + ' '
205 |     # Now split it
206 |     varlist = re.split('[ /]', full_line)
207 | 
208 |     with open(file_name) as infile:
209 |         # Now loop through all blocks that match the search pattern we defined above
210 |         blocknum = -1
211 |         # For rain total in missing times
212 |         temp_rain = 0.
213 |         # For max temp, min temp, max wind, total rain
214 |         t_max = -150.
215 |         t_min = 150.
216 |         w_max = 0.
217 |         r_total = 0.
218 |         for block_match in block_expr.finditer(infile.read()):
219 |             blocknum += 1
220 |             # Split out the match into each component number
221 |             vals = block_match.groups()
222 |             # Check for missing values
223 |             for v in range(len(vals)):
224 |                 if vals[v] == -9999.:
225 |                     vals[v] = np.nan
226 |             # Set the time
227 |             dt = '20' + vals[varlist.index('YYMMDD')] + vals[varlist.index('HHMM')]
228 |             validtime = datetime.strptime(dt, '%Y%m%d%H%M')
229 | 
230 |             # Check that time is within the period we want
231 |             if start_dt is not None and validtime < start_dt:
232 |                 continue
233 |             if end_dt is not None and validtime > end_dt:
234 |                 break
235 | 
236 |             # Check for max daily values!
237 |             t_max = max(t_max, float(vals[varlist.index('T2MS')]))
238 |             t_min = min(t_min, float(vals[varlist.index('T2MS')]))
239 |             uwind = float(vals[varlist.index('UWND')])
240 |             vwind = float(vals[varlist.index('VWND')])
241 |             wspd = np.sqrt(uwind ** 2 + vwind ** 2)
242 |             w_max = max(w_max, wspd)
243 | 
244 |             if validtime.hour % interval != 0:
245 |                 # Still need to get cumulative precipitation!
246 |                 try:
247 |                     temp_precip = float(vals[varlist.index('P01M')])
248 |                 except:
249 |                     temp_precip = float(vals[varlist.index('P03M')])
250 |                 if np.isnan(temp_precip):
251 |                     temp_precip = 0.0
252 |                 temp_rain += temp_precip
253 |                 continue
254 | 
255 |             sfc_dict[validtime] = OrderedDict()
256 |             sfc_dict[validtime]['WSPD'] = wspd
257 |             sfc_dict[validtime]['UWND'] = uwind
258 |             sfc_dict[validtime]['VWND'] = vwind
259 |             sfc_dict[validtime]['PRES'] = float(vals[varlist.index('PRES')])
260 |             sfc_dict[validtime]['TMPC'] = float(vals[varlist.index('T2MS')])
261 |             sfc_dict[validtime]['DWPC'] = float(vals[varlist.index('TD2M')])
262 |             sfc_dict[validtime]['HCLD'] = float(vals[varlist.index('HCLD')])
263 |             sfc_dict[validtime]['MCLD'] = float(vals[varlist.index('MCLD')])
264 |             sfc_dict[validtime]['LCLD'] = float(vals[varlist.index('LCLD')])
265 |             # Could be 3 hour or 1 hour precip
266 |             try:
267 |                 precip = float(vals[varlist.index('P01M')])
268 |             except:
269 |                 precip = float(vals[varlist.index('P03M')])
270 |             # Make sure precip is not nan
271 |             if np.isnan(precip):
272 |                 precip = 0.0
273 |             # Add the temporary value from uneven intervals
274 |             precip += temp_rain
275 |             # Also do cumulative sum in r_total
276 |             previous_time = validtime - timedelta(hours=interval)
277 |             if previous_time in sfc_dict.keys():
278 |                 sfc_dict[validtime]['PRCP'] = precip
279 |                 r_total += precip
280 |             else:
281 |                 # We want zero at first time: precip in LAST hour or three
282 |                 sfc_dict[validtime]['PRCP'] = 0.0
283 |             # Reset temp_rain after having added it
284 |             temp_rain = 0.
285 | 
286 |     daily = [t_max, t_min, w_max, r_total]
287 |     return sfc_dict, daily
288 | 
289 | 
290 | def bufr_retrieve(bufr, bufarg):
291 |     """
292 |     Call bufrgruven to retrieve BUFR files.
293 | 
294 |     :param bufr: str: bufrgruven executable path
295 |     :param bufarg: dict: dictionary of arguments passed to bufrgruven
296 |     :return:
297 |     """
298 |     argstring = ''
299 |     for key, value in bufarg.items():
300 |         argstring += ' --%s %s' % (key, value)
301 |     result = os.system('%s %s' % (bufr, argstring))
302 |     return result
303 | 
304 | 
305 | def bufr(config, output_file=None, cycle='18'):
306 |     """
307 |     Generates model data from BUFKIT profiles and saves to a file, which can later be retrieved for either training
308 |     data or model run data.
309 | 
310 |     :param config:
311 |     :param output_file: str: output file path
312 |     :param cycle: str: model cycle (init hour)
313 |     :return:
314 |     """
315 |     bufr_station_id = config['BUFR']['bufr_station_id']
316 |     # Base arguments dictionary. dset and date will be modified iteratively.
317 |     bufarg = {
318 |         'dset': '',
319 |         'date': '',
320 |         'cycle': cycle,
321 |         'stations': bufr_station_id.lower(),
322 |         'noascii': '',
323 |         'nozipit': '',
324 |         'noverbose': '',
325 |         'prepend': ''
326 |     }
327 |     if config['verbose']:
328 |         print('\n')
329 |     bufr_default_dir = '%s/metdat/bufkit' % config['BUFR_ROOT']
330 |     bufr_data_dir = config['BUFR']['bufr_data_dir']
331 |     if not(os.path.isdir(bufr_data_dir)):
332 |         os.makedirs(bufr_data_dir)
333 |     bufrgruven = config['BUFR']['bufrgruven']
334 |     if config['verbose']:
335 |         print('bufr: using BUFKIT files in %s' % bufr_data_dir)
336 |     bufr_format = '%s/%s%s.%s_%s.buf'
337 |     missing_dates = []
338 |     models = config['BUFR']['bufr_models']
339 |     model_names = config['BUFR']['models']
340 |     start_date = datetime.strptime(config['data_start_date'], '%Y%m%d') - timedelta(days=1)
341 |     end_date = datetime.strptime(config['data_end_date'], '%Y%m%d') - timedelta(days=1)
342 |     dates = generate_dates(config, start_date=start_date, end_date=end_date)
343 |     for date in dates:
344 |         bufarg['date'] = datetime.strftime(date, '%Y%m%d')
345 |         if date.year < 2010:
346 |             if config['verbose']:
347 |                 print('bufr: skipping BUFR data for %s; data starts in 2010.' % bufarg['date'])
348 |             continue
349 |         if config['verbose']:
350 |             print('bufr: date: %s' % bufarg['date'])
351 | 
352 |         for m in range(len(models)):
353 |             if config['verbose']:
354 |                 print('bufr: trying to retrieve BUFR data for %s...' % model_names[m])
355 |             bufr_new_name = bufr_format % (bufr_data_dir, bufarg['date'], '%02d' % int(bufarg['cycle']),
356 |                                            model_names[m], bufarg['stations'])
357 |             if os.path.isfile(bufr_new_name):
358 |                 if config['verbose']:
359 |                     print('bufr: file %s already exists; skipping!' % bufr_new_name)
360 |                 break
361 | 
362 |             if type(models[m]) == list:
363 |                 for model in models[m]:
364 |                     try:
365 |                         bufarg['dset'] = model
366 |                         bufr_retrieve(bufrgruven, bufarg)
367 |                         bufr_name = bufr_format % (bufr_default_dir, bufarg['date'], '%02d' % int(bufarg['cycle']),
368 |                                                    model, bufarg['stations'])
369 |                         bufr_file = open(bufr_name)
370 |                         bufr_file.close()
371 |                         os.rename(bufr_name, bufr_new_name)
372 |                         if config['verbose']:
373 |                             print('bufr: BUFR file found for %s at date %s.' % (model, bufarg['date']))
374 |                             print('bufr: writing BUFR file: %s' % bufr_new_name)
375 |                         break
376 |                     except:
377 |                         if config['verbose']:
378 |                             print('bufr: BUFR file for %s at date %s not retrieved.' % (model, bufarg['date']))
379 |             else:
380 |                 try:
381 |                     model = models[m]
382 |                     bufarg['dset'] = model
383 |                     bufr_retrieve(bufrgruven, bufarg)
384 |                     bufr_name = bufr_format % (bufr_default_dir, bufarg['date'], '%02d' % int(bufarg['cycle']),
385 |                                                bufarg['dset'], bufarg['stations'])
386 |                     bufr_file = open(bufr_name)
387 |                     bufr_file.close()
388 |                     os.rename(bufr_name, bufr_new_name)
389 |                     if config['verbose']:
390 |                         print('bufr: BUFR file found for %s at date %s.' % (model, bufarg['date']))
391 |                         print('bufr: writing BUFR file: %s' % bufr_new_name)
392 |                 except:
393 |                     if config['verbose']:
394 |                         print('bufr: BUFR file for %s at date %s not retrieved.' % (model, bufarg['date']))
395 |             if not (os.path.isfile(bufr_new_name)):
396 |                 print('bufr: warning: no BUFR file found for model %s at date %s' % (
397 |                     model_names[m], bufarg['date']))
398 |                 missing_dates.append((date, model_names[m]))
399 | 
400 |     # Process data
401 |     print('\n')
402 |     bufr_dict = OrderedDict({'PROF': OrderedDict(), 'SFC': OrderedDict(), 'DAY': OrderedDict()})
403 |     for model in model_names:
404 |         bufr_dict['PROF'][model] = OrderedDict()
405 |         bufr_dict['SFC'][model] = OrderedDict()
406 |         bufr_dict['DAY'][model] = OrderedDict()
407 | 
408 |     for date in dates:
409 |         date_str = datetime.strftime(date, '%Y%m%d')
410 |         verif_date = date + timedelta(days=1)
411 |         start_dt = verif_date + timedelta(hours=config['forecast_hour_start'])
412 |         end_dt = verif_date + timedelta(hours=config['forecast_hour_start'] + 24)
413 |         for model in model_names:
414 |             if (date, model) in missing_dates:
415 |                 if config['verbose']:
416 |                     print('bufr: skipping %s data for %s; file missing.' % (model, date_str))
417 |                 continue
418 |             if config['verbose']:
419 |                 print('bufr: processing %s data for %s' % (model, date_str))
420 |             bufr_name = bufr_format % (bufr_data_dir, date_str, '%02d' % int(bufarg['cycle']), model,
421 |                                        bufarg['stations'])
422 |             if not (os.path.isfile(bufr_name)):
423 |                 if config['verbose']:
424 |                     print('bufr: skipping %s data for %s; file missing.' % (model, date_str))
425 |                 continue
426 |             profile = bufkit_parser_time_height(config, bufr_name, 6, start_dt, end_dt)
427 |             sfc, daily = bufkit_parser_surface(bufr_name, 3, start_dt, end_dt)
428 |             # Drop 'PRES' variable which is useless
429 |             for key, values in profile.items():
430 |                 values.pop('PRES', None)
431 |                 profile[key] = values
432 |             bufr_dict['PROF'][model][verif_date] = profile
433 |             bufr_dict['SFC'][model][verif_date] = sfc
434 |             bufr_dict['DAY'][model][verif_date] = daily
435 | 
436 |     # Export data
437 |     if output_file is None:
438 |         output_file = '%s/%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'])
439 |     if config['verbose']:
440 |         print('bufr: -> exporting to %s' % output_file)
441 |     with open(output_file, 'wb') as handle:
442 |         pickle.dump(bufr_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
443 | 
444 |     return
445 | 
446 | 
447 | def process(config, bufr, advection_diagnostic=True):
448 |     """
449 |     Imports the data contained in a bufr dictionary and returns a time-by-x numpy array for use in mosx_predictors. The
450 |     first dimension is date; all other dimensions are first extracted using get_array and then one-dimensionalized.
451 | 
452 |     :param config:
453 |     :param bufr: dict: dictionary of processed BUFR data
454 |     :param advection_diagnostic: bool: if True, add temperature advection diagnostic to the data
455 |     :return: ndarray: array of formatted BUFR predictor values
456 |     """
457 |     if config['verbose']:
458 |         print('bufr.process: processing array for BUFR data...')
459 |     # PROF part of the BUFR data
460 |     bufr_prof = get_array(bufr['PROF'])
461 |     bufr_dims = list(range(len(bufr_prof.shape)))
462 |     bufr_dims[0] = 1
463 |     bufr_dims[1] = 0
464 |     bufr_prof = bufr_prof.transpose(bufr_dims)
465 |     bufr_shape = bufr_prof.shape
466 |     bufr_reshape = [bufr_shape[0]] + [np.cumprod(bufr_shape[1:])[-1]]
467 |     bufr_prof = bufr_prof.reshape(tuple(bufr_reshape))
468 |     # SFC part of the BUFR data
469 |     bufr_sfc = get_array(bufr['SFC'])
470 |     bufr_dims = list(range(len(bufr_sfc.shape)))
471 |     bufr_dims[0] = 1
472 |     bufr_dims[1] = 0
473 |     bufr_sfc = bufr_sfc.transpose(bufr_dims)
474 |     bufr_shape = bufr_sfc.shape
475 |     bufr_reshape = [bufr_shape[0]] + [np.cumprod(bufr_shape[1:])[-1]]
476 |     bufr_sfc = bufr_sfc.reshape(tuple(bufr_reshape))
477 |     # DAY part of the BUFR data
478 |     bufr_day = get_array(bufr['DAY'])
479 |     bufr_dims = list(range(len(bufr_day.shape)))
480 |     bufr_dims[0] = 1
481 |     bufr_dims[1] = 0
482 |     bufr_day = bufr_day.transpose(bufr_dims)
483 |     bufr_shape = bufr_day.shape
484 |     bufr_reshape = [bufr_shape[0]] + [np.cumprod(bufr_shape[1:])[-1]]
485 |     bufr_day = bufr_day.reshape(tuple(bufr_reshape))
486 |     bufr_out = np.concatenate((bufr_prof, bufr_sfc, bufr_day), axis=1)
487 |     # Fix missing values
488 |     bufr_out[bufr_out < -1000.] = np.nan
489 |     if advection_diagnostic:
490 |         advection_array = temp_advection(bufr)
491 |         bufr_out = np.concatenate((bufr_out, advection_array), axis=1)
492 |     return bufr_out
493 | 
494 | 
495 | def temp_advection(bufr):
496 |     """
497 |     Produces an array of temperature advection diagnostic from the bufr dictionary. The diagnostic is a simple
498 |     calculation of the strength of backing or veering of winds with height, based on thermal wind approximations,
499 |     between the lowest two profile levels retained in the bufr dictionary. Searches for UWND and VWND keys.
500 |     IMPORTANT: expects the keys of each model to match dates; i.e., expects bufr output from find_matching_dates. This
501 |     is to ensure num_samples is correct.
502 | 
503 |     :param bufr: dict: dictionary of processed BUFR data
504 |     :return: advection_array: array of num_samples-by-num_features of advection diagnostic
505 |     """
506 |     bufr_prof = bufr['PROF']
507 |     models = bufr_prof.keys()
508 |     num_models = len(models)
509 |     dates = bufr_prof[bufr_prof.keys()[0]].keys()
510 |     num_dates = len(dates)
511 |     num_times = len(bufr_prof[bufr_prof.keys()[0]][dates[0]].keys())
512 |     num_features = num_models * num_times
513 | 
514 |     advection_array = np.zeros((num_dates, num_features))
515 | 
516 |     def advection_index(V1, V2):
517 |         """
518 |         The advection index measures the strength of veering/backing of wind.
519 |         :param V1: array wind vector at lower model level
520 |         :param V2: array wind vector at higher model level
521 |         :return: index of projection of (V2 - V1) onto V1
522 |         """
523 |         proj = V2 - np.dot(V1, V2) * V1 / np.linalg.norm(V1)
524 |         diff = V1 - V2
525 |         sign = np.sign(np.arctan2(diff[1], diff[0]))
526 |         return sign * np.linalg.norm(proj)
527 | 
528 |     # Here comes the giant ugly loop.
529 |     sample = 0
530 |     for date in dates:
531 |         feature = 0
532 |         for model in models:
533 |             for eval_date in bufr_prof[model][date].keys():
534 |                 u = bufr_prof[model][date][eval_date]['UWND']
535 |                 v = bufr_prof[model][date][eval_date]['VWND']
536 |                 try:
537 |                     V1 = np.array([u[0], v[0]])
538 |                     V2 = np.array([u[1], v[1]])
539 |                 except IndexError:
540 |                     print('Not enough wind levels available for advection calculation; omitting...')
541 |                     return
542 |                 advection_array[sample, feature] = advection_index(V1, V2)
543 |                 feature += 1
544 |         sample += 1
545 | 
546 |     return advection_array
547 | 


--------------------------------------------------------------------------------
/mosx/configspec:
--------------------------------------------------------------------------------
 1 | MOSX_ROOT = string()
 2 | BUFR_ROOT = string()
 3 | SITE_ROOT = string(default='')
 4 | station_id = string(min=4, max=4)
 5 | climo_station_id = string(min=11, max=11)
 6 | lowest_p_level = float()
 7 | data_start_date = string(min=8, max=8)
 8 | data_end_date = string(min=8, max=8)
 9 | is_season = boolean(default=False)
10 | forecast_hour_start = integer(0, 24, default=6)
11 | time_series_interval = integer(1, 6, default=3)
12 | meso_token = string()
13 | verbose = boolean(default=False)
14 | [BUFR]
15 |     bufr_station_id = string(min=3, max=4, default='')
16 |     bufr_data_dir = string(default='')
17 |     bufrgruven = string(default='')
18 |     models = force_list()
19 | [Obs]
20 |     use_soundings = boolean(default=False)
21 |     sounding_station_id = string(min=3, max=3)
22 |     sounding_data_dir = string(default='')
23 |     use_climo_wind = boolean(default=True)
24 | [Model]
25 |     estimator_file = string()
26 |     regressor = string(default='ensemble.RandomForestRegressor')
27 |     train_individual = boolean(default=False)
28 |     predict_timeseries = boolean(default=False)
29 |     rain_forecast_type = option('pop', 'categorical', 'quantity', default='pop')
30 |     [[Parameters]]
31 | [Validate]
32 |     start_date = string(min=8, max=8)
33 |     end_date = string(min=8, max=8)
34 | [Upload]
35 |     file_type = option('pickle', 'uw_text', default='pickle')
36 |     username = string(default='')
37 |     server = string(default='')
38 |     forecast_directory = string(default='~')
39 |     plot_directory = string(default='~')


--------------------------------------------------------------------------------
/mosx/estimators.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  3 | #
  4 | # See the file LICENSE for your rights.
  5 | #
  6 | 
  7 | """
  8 | Special classes containing MOS-X-specific scikit-learn estimators. These estimators are designed to perform specialized
  9 | functions such as adding time series prediction, creating sub-estimators to tune the rain prediction, and so on. They
 10 | are NOT fully-compatible with scikit-learn because it is impossible to link all of the sklearn attributes of the base
 11 | estimators to these meta-estimators. It might be possible to, in the future, make these meta-classes that inherit from
 12 | the base sklearn estimator class.
 13 | """
 14 | 
 15 | import numpy as np
 16 | from sklearn.preprocessing import LabelBinarizer
 17 | from sklearn.multioutput import MultiOutputRegressor
 18 | from sklearn.pipeline import Pipeline
 19 | from sklearn.model_selection import KFold, train_test_split
 20 | from copy import deepcopy
 21 | from .util import get_object
 22 | 
 23 | 
 24 | class TimeSeriesEstimator(object):
 25 |     """
 26 |     Wrapper class for containing separately-trained daily and timeseries estimators.
 27 |     """
 28 |     def __init__(self, daily_estimator, timeseries_estimator):
 29 |         self.daily_estimator = daily_estimator
 30 |         self.timeseries_estimator = timeseries_estimator
 31 |         # Inherit attributes from the daily estimator by default.
 32 |         # Apparently only 'steps' and 'memory' are in __dict__ for a Pipeline. BS.
 33 |         for attr in self.daily_estimator.__dict__.keys():
 34 |             try:
 35 |                 setattr(self, attr, getattr(self.daily_estimator, attr))
 36 |             except AttributeError:
 37 |                 pass
 38 |         # Apparently still have to do this
 39 |         if isinstance(self.daily_estimator, Pipeline):
 40 |             self.named_steps = self.daily_estimator.named_steps
 41 |         else:
 42 |             self.named_steps = None
 43 |         try:  # Likely only works if model has been fitted
 44 |             self.estimators_ = self.daily_estimator.estimators_
 45 |         except AttributeError:
 46 |             self.estimators_ = None
 47 |         self.array_form = True
 48 |         if not hasattr(self, 'verbose'):
 49 |             self.verbose = 1
 50 | 
 51 |     def fit(self, predictor_array, verification_array, **kwargs):
 52 |         """
 53 |         Fit both the daily and the timeseries estimators.
 54 | 
 55 |         :param predictor_array: ndarray-like: predictor features
 56 |         :param verification_array: ndarray-like: truth values
 57 |         :param kwargs: kwargs passed to fit methods
 58 |         :return:
 59 |         """
 60 |         if self.verbose > 0:
 61 |             print('TimeSeriesEstimator: fitting DAILY estimator')
 62 |         self.daily_estimator.fit(predictor_array, verification_array[:, :4], **kwargs)
 63 |         if self.verbose > 0:
 64 |             print('TimeSeriesEstimator: fitting TIMESERIES estimator')
 65 |         self.timeseries_estimator.fit(predictor_array, verification_array[:, 4:], **kwargs)
 66 | 
 67 |     def predict(self, predictor_array, **kwargs):
 68 |         """
 69 |         Predict from both the daily and timeseries estimators. Returns an array if self.array_form is True,
 70 |         otherwise returns a dictionary (not implemented yet).
 71 | 
 72 |         :param predictor_array: num_samples x num_features
 73 |         :param kwargs: kwargs passed to predict methods
 74 |         :return: array or dictionary of predicted values
 75 |         """
 76 |         daily = self.daily_estimator.predict(predictor_array, **kwargs)
 77 |         timeseries = self.timeseries_estimator.predict(predictor_array, **kwargs)
 78 |         if self.array_form:
 79 |             return np.concatenate((daily, timeseries), axis=1)
 80 | 
 81 | 
 82 | class RainTuningEstimator(object):
 83 |     """
 84 |     This class extends an estimator to include a separately-trained post-processing random forest for the daily rainfall
 85 |     prediction. Standard algorithms generally do a poor job of predicting a variable that has such a non-normal
 86 |     probability distribution as daily rainfall (which is dominated by 0s).
 87 |     """
 88 |     def __init__(self, estimator, rain_estimator='ensemble.RandomForestRegressor', **kwargs):
 89 |         """
 90 |         Initialize an instance of an estimator with a rainfall post-processor.
 91 | 
 92 |         :param estimator: sklearn estimator or TimeSeriesEstimator with an estimators_ attribute
 93 |         :param rain_estimator: str: type of sklearn estimator to use for rain processing
 94 |         :param kwargs: passed to sklearn rain estimator
 95 |         """
 96 |         self.base_estimator = estimator
 97 |         if isinstance(estimator, TimeSeriesEstimator):
 98 |             self.daily_estimator = self.base_estimator.daily_estimator
 99 |         else:
100 |             self.daily_estimator = self.base_estimator
101 |         # Inherit attributes from the base estimator
102 |         for attr in self.base_estimator.__dict__.keys():
103 |             try:
104 |                 setattr(self, attr, getattr(self.base_estimator, attr))
105 |             except AttributeError:
106 |                 pass
107 |         try:
108 |             self.estimators_ = self.base_estimator.estimators_
109 |         except AttributeError:
110 |             pass
111 |         self.rain_estimator_name = rain_estimator
112 |         self.is_classifier = ('Classifi' in self.rain_estimator_name)
113 |         Regressor = get_object('sklearn.%s' % rain_estimator)
114 |         self.rain_estimator = Regressor(**kwargs)
115 |         if isinstance(self.daily_estimator, Pipeline):
116 |             self.named_steps = self.daily_estimator.named_steps
117 |             self._forest = self.daily_estimator.named_steps['regressor']
118 |             self._imputer = self.daily_estimator.named_steps['imputer']
119 |         else:
120 |             self.named_steps = None
121 |             self._imputer = None
122 |             self._forest = self.daily_estimator
123 |         if not hasattr(self, 'verbose'):
124 |             self.verbose = 1
125 | 
126 |     def _get_tree_rain_prediction(self, X):
127 |         # Get predictions from individual trees.
128 |         num_samples = X.shape[0]
129 |         if self._imputer is not None:
130 |             X = self._imputer.transform(X)
131 |         if isinstance(self._forest, MultiOutputRegressor):
132 |             num_trees = len(self._forest.estimators_[0].estimators_)
133 |             predicted_rain = np.zeros((num_samples, num_trees))
134 |             for s in range(num_samples):
135 |                 Xs = X[s].reshape(1, -1)
136 |                 for t in range(num_trees):
137 |                     try:
138 |                         predicted_rain[s, t] = self._forest.estimators_[3].estimators_[t].predict(Xs)
139 |                     except AttributeError:
140 |                         # Work around the 2-D array of estimators for GBTrees
141 |                         predicted_rain[s, t] = self._forest.estimators_[3].estimators_[t][0].predict(Xs)
142 |         else:
143 |             num_trees = len(self._forest.estimators_)
144 |             predicted_rain = np.zeros((num_samples, num_trees))
145 |             for s in range(num_samples):
146 |                 Xs = X[s].reshape(1, -1)
147 |                 for t in range(num_trees):
148 |                     try:
149 |                         predicted_rain[s, t] = self._forest.estimators_[t].predict(Xs)
150 |                     except AttributeError:
151 |                         # Work around an error in sklearn where GBTrees have length-1 ndarrays...
152 |                         predicted_rain[s, t] = self._forest.estimators_[t][0].predict(Xs)
153 |         return predicted_rain
154 | 
155 |     def _get_distribution(self, p_rain):
156 |         # Get the mean, std, and number of 0 forecasts from the estimator.
157 |         mean = np.mean(p_rain, axis=1)
158 |         std = np.std(p_rain, axis=1)
159 |         zero_frac = 1. * np.sum(p_rain < 0.01, axis=1) / p_rain.shape[1]
160 |         return np.stack((mean, std, zero_frac), axis=1)
161 | 
162 |     def fit(self, predictor_array, verification_array, rain_array=None, **kwargs):
163 |         """
164 |         Fit the estimator and the post-processor.
165 | 
166 |         :param predictor_array: ndarray-like: predictor features
167 |         :param verification_array: ndarray-like: truth values
168 |         :param rain_array: ndarray-like: raw rain from the models
169 |         :param kwargs: passed to the estimator's 'fit' method
170 |         :return:
171 |         """
172 |         # First, fit the estimator as usual
173 |         self.base_estimator.fit(predictor_array, verification_array, **kwargs)
174 | 
175 |         # Now generate the distribution information from the individual trees in the forest
176 |         if self.verbose > 0:
177 |             print('RainTuningEstimator: getting ensemble rain predictions')
178 |         predicted_rain = self._get_tree_rain_prediction(predictor_array)
179 |         rain_distribution = self._get_distribution(predicted_rain)
180 |         # If raw rain values are given, add those to the distribution
181 |         if rain_array is not None:
182 |             rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1)
183 | 
184 |         # Fit the rain estimator
185 |         if self.verbose > 0:
186 |             print('RainTuningEstimator: fitting rain post-processor')
187 |         # # If we're using a classifier, then we may need to binarize the labels
188 |         # if self.is_classifier:
189 |         #     lb = LabelBinarizer()
190 |         #     rain_targets = lb.fit_transform(verification_array[:, 3])
191 |         # else:
192 |         rain_targets = verification_array[:, 3]
193 |         self.rain_estimator.fit(rain_distribution, rain_targets)
194 | 
195 |     def predict(self, predictor_array, rain_tuning=True, rain_array=None, **kwargs):
196 |         """
197 |         Return a prediction from the estimator with post-processed rain.
198 | 
199 |         :param predictor_array: ndarray-like: predictor features
200 |         :param rain_tuning: bool: toggle option to disable rain tuning in prediction
201 |         :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw
202 |         rain values and rain_tuning is True.
203 |         :param kwargs: passed to estimator's 'predict' method
204 |         :return: array of predictions
205 |         """
206 |         # Predict with the estimator as usual
207 |         predicted = self.base_estimator.predict(predictor_array, **kwargs)
208 | 
209 |         # Now get the tuned rain
210 |         if rain_tuning:
211 |             if self.verbose > 0:
212 |                 print('RainTuningEstimator: tuning rain prediction')
213 |             # Get the distribution from individual trees
214 |             predicted_rain = self._get_tree_rain_prediction(predictor_array)
215 |             rain_distribution = self._get_distribution(predicted_rain)
216 |             if rain_array is not None:
217 |                 rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1)
218 |             tuned_rain = self.rain_estimator.predict(rain_distribution)
219 |             predicted[:, 3] = tuned_rain
220 | 
221 |         return predicted
222 | 
223 |     def predict_proba(self, predictor_array, rain_tuning=True, rain_array=None, **kwargs):
224 |         """
225 |         Return a prediction from the estimator with post-processed rain, with a probability of rainfall. Should only
226 |         be used if rain_forecast_type is 'pop'.
227 | 
228 |         :param predictor_array: ndarray-like: predictor features
229 |         :param rain_tuning: bool: toggle option to disable rain tuning in prediction
230 |         :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw
231 |         rain values and rain_tuning is True.
232 |         :param kwargs: passed to estimator's 'predict' method
233 |         :return: array of predictions
234 |         """
235 |         # Predict with the estimator as usual
236 |         predicted = self.base_estimator.predict(predictor_array, **kwargs)
237 | 
238 |         # Do the probabilistic prediction for rain
239 |         if rain_tuning:
240 |             if self.verbose > 0:
241 |                 print('RainTuningEstimator: tuning rain prediction')
242 |             # Get the distribution from individual trees
243 |             predicted_rain = self._get_tree_rain_prediction(predictor_array)
244 |             rain_distribution = self._get_distribution(predicted_rain)
245 |             if rain_array is not None:
246 |                 rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1)
247 |             tuned_rain = self.rain_estimator.predict_proba(rain_distribution)
248 |             predicted[:, 3] = np.sum(tuned_rain[:, 1:], axis=1)
249 | 
250 |         return predicted
251 | 
252 |     def predict_rain_proba(self, predictor_array, rain_array=None):
253 |         """
254 |         Get the raw categorical probabilistic prediction from the rain post-processor.
255 |         :param predictor_array: ndarray-like: predictor features
256 |         :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw
257 |         rain values.
258 |         :return: array of categorical rain predictions
259 |         """
260 |         if self.verbose > 0:
261 |             print('RainTuningEstimator: getting probabilistic rain prediction')
262 |         # Get the distribution from individual trees
263 |         predicted_rain = self._get_tree_rain_prediction(predictor_array)
264 |         rain_distribution = self._get_distribution(predicted_rain)
265 |         if rain_array is not None:
266 |             rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1)
267 |         categorical_rain = self.rain_estimator.predict_proba(rain_distribution)
268 |         return categorical_rain
269 | 
270 | 
271 | class BootStrapEnsembleEstimator(object):
272 |     """
273 |     This class implements a bootstrapping technique to generate a small ensemble of identical algorithms trained on
274 |     a random subset of the training data. Options include partitioning the training data evenly, so that no algorithm
275 |     has any sample in any other algorithm's training set, or completely randomly, so that there may be an arbitrary
276 |     amount of overlap in the individual algorithms' training sets.
277 |     """
278 |     def __init__(self, estimator, n_members=10, n_samples_split=0.1, unique_splits=False):
279 |         """
280 |         Initialize instance of a BootStrapEnsembleEstimator.
281 |         :param estimator: object: base estimator object (scikit-learn or one of those in this module)
282 |         :param n_members: int: number of bootstrapped ensemble members (individual trained models)
283 |         :param n_samples_split: float or int: if float, then gives the fraction of the training set that is used to
284 |         form the training set for any given ensemble member. If int, then gives the exact number of samples used.
285 |         Ignored if unique_splits is True, because then each training set has the maximum number of samples possible to
286 |         be unique.
287 |         :param unique_splits: bool: if True, each split contains a unique set of samples not present in any other split
288 |         """
289 |         self.base_estimator = estimator
290 |         self.n_members = n_members
291 |         self.estimators_ = np.empty((n_members,), dtype=np.object)
292 |         for n in range(n_members):
293 |             self.estimators_[n] = deepcopy(estimator)
294 |         self.n_samples_split = n_samples_split
295 |         self.unique_splits = unique_splits
296 |         if not hasattr(self.base_estimator, 'verbose'):
297 |             self.verbose = 1
298 |         else:
299 |             self.verbose = self.base_estimator.verbose
300 | 
301 |     def get_splits(self, X, y):
302 |         X_splits = []
303 |         y_splits = []
304 |         if self.unique_splits:
305 |             kf = KFold(n_splits=self.n_members, shuffle=True)
306 |             for train_index, test_index in kf.split(X):
307 |                 X_splits.append(X[test_index])  # test is unique; train is all other tests
308 |                 y_splits.append(y[test_index])
309 |         else:
310 |             for split_count in range(self.n_members):
311 |                 X_split, xt, y_split, yt = train_test_split(X, y, train_size=self.n_samples_split, shuffle=True)
312 |                 X_splits.append(X_split)
313 |                 y_splits.append(y_split)
314 |         return X_splits, y_splits
315 | 
316 |     def fit(self, predictor_array, verification_array, **kwargs):
317 |         """
318 |         Fit the bootstrapped algorithms.
319 | 
320 |         :param predictor_array: ndarray-like: predictor features
321 |         :param verification_array: ndarray-like: truth values
322 |         :param kwargs: kwargs passed to fit methods
323 |         :return:
324 |         """
325 |         predictor_splits, verification_splits = self.get_splits(predictor_array, verification_array)
326 |         for est in range(self.n_members):
327 |             if self.verbose:
328 |                 print('BootStrapEnsembleEstimator: training ensemble member %d of %d' % (est + 1, self.n_members))
329 |             self.estimators_[est].fit(predictor_splits[est], verification_splits[est], **kwargs)
330 | 
331 |     def predict(self, predictor_array, **kwargs):
332 |         """
333 |         Predict from the bootstrapped algorithms. Gives an ensemble mean.
334 | 
335 |         :param predictor_array: ndarray-like: predictor features
336 |         :param kwargs: passed to estimator's 'predict' method
337 |         :return:
338 |         """
339 |         prediction = []
340 |         for est in range(self.n_members):
341 |             if self.verbose:
342 |                 print('BootStrapEnsembleEstimator: predicting from ensemble member %d of %d' %
343 |                       (est + 1, self.n_members))
344 |             prediction.append(self.estimators_[est].predict(predictor_array, **kwargs))
345 | 
346 |         return np.mean(np.array(prediction), axis=0)
347 | 
348 |     def predict_rain_proba(self, predictor_array, rain_array=None):
349 |         """
350 |         If the base estimator is a RainTuningEstimator, yields an average prediction for categorical rain
351 |         probabilities.
352 |         :param predictor_array: ndarray-like: predictor features
353 |         :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw
354 |         rain values.
355 |         :return:
356 |         """
357 |         prediction = []
358 |         for est in range(self.n_members):
359 |             try:
360 |                 prediction.append(self.estimators_[est].predict_rain_proba(predictor_array, rain_array=rain_array))
361 |             except AttributeError:
362 |                 raise AttributeError("'%s' cannot predict rain category probabilities; use RainTuningEstimator" %
363 |                                      type(self.base_estimator))
364 | 
365 |         # Fix the shape of the arrays, in case some predictions don't have all categories of rain
366 |         rain_dims = [p.shape[1] for p in prediction]
367 |         max_dim = np.max(rain_dims)
368 |         new_prediction = []
369 |         for p in prediction:
370 |             while p.shape[1] < max_dim:
371 |                 p = np.c_[p, np.zeros(p.shape[0])]
372 |             new_prediction.append(p)
373 | 
374 |         return np.mean(np.array(new_prediction), axis=0)
375 | 


--------------------------------------------------------------------------------
/mosx/model/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | Methods for formatting predictor data and training scikit-learn models.
 9 | """
10 | 
11 | from model import *
12 | from predictors import *
13 | from scorers import *
14 | 


--------------------------------------------------------------------------------
/mosx/model/model.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  3 | #
  4 | # See the file LICENSE for your rights.
  5 | #
  6 | 
  7 | """
  8 | Methods for training scikit-learn models.
  9 | """
 10 | 
 11 | from datetime import datetime, timedelta
 12 | import pandas as pd
 13 | 
 14 | from mosx.util import get_object, to_bool, dewpoint
 15 | from mosx.estimators import TimeSeriesEstimator, RainTuningEstimator, BootStrapEnsembleEstimator
 16 | import pickle
 17 | import numpy as np
 18 | 
 19 | 
 20 | def build_estimator(config):
 21 |     """
 22 |     Build the estimator object from the parameters in config.
 23 | 
 24 |     :param config:
 25 |     :return:
 26 |     """
 27 |     regressor = config['Model']['regressor']
 28 |     sklearn_kwargs = config['Model']['Parameters']
 29 |     train_individual = config['Model']['train_individual']
 30 |     ada_boost = config['Model'].get('Ada boosting', None)
 31 |     rain_tuning = config['Model'].get('Rain tuning', None)
 32 |     bootstrap = config['Model'].get('Bootstrapping', None)
 33 |     Regressor = get_object('sklearn.%s' % regressor)
 34 |     if config['verbose']:
 35 |         print('build_estimator: using sklearn.%s as estimator...' % regressor)
 36 | 
 37 |     from sklearn.preprocessing import Imputer
 38 |     from sklearn.preprocessing import StandardScaler as Scaler
 39 |     from sklearn.pipeline import Pipeline
 40 | 
 41 |     # Create and train the learning algorithm
 42 |     if config['verbose']:
 43 |         print('build_estimator: here are the parameters passed to the learning algorithm...')
 44 |         print(sklearn_kwargs)
 45 | 
 46 |     # Create the pipeline list
 47 |     pipeline = [("imputer", Imputer(missing_values=np.nan, strategy="mean", axis=0))]
 48 |     if config['Model']['predict_timeseries']:
 49 |         pipeline_timeseries = [("imputer", Imputer(missing_values=np.nan, strategy="mean", axis=0))]
 50 | 
 51 |     if not (regressor.startswith('ensemble')):
 52 |         # Need to add feature scaling
 53 |         pipeline.append(("scaler", Scaler()))
 54 |         if config['Model']['predict_timeseries']:
 55 |             pipeline_timeseries.append(("scaler", Scaler()))
 56 | 
 57 |     # Create the regressor object
 58 |     regressor_obj = Regressor(**sklearn_kwargs)
 59 |     if ada_boost is not None:
 60 |         if config['verbose']:
 61 |             print('build_estimator: using Ada boosting...')
 62 |         from sklearn.ensemble import AdaBoostRegressor
 63 |         regressor_obj = AdaBoostRegressor(regressor_obj, **ada_boost)
 64 |     if train_individual:
 65 |         if config['verbose']:
 66 |             print('build_estimator: training separate models for each parameter...')
 67 |         from sklearn.multioutput import MultiOutputRegressor
 68 |         multi_regressor = MultiOutputRegressor(regressor_obj, 4)
 69 |         pipeline.append(("regressor", multi_regressor))
 70 |     else:
 71 |         pipeline.append(("regressor", regressor_obj))
 72 |     if config['Model']['predict_timeseries']:
 73 |         pipeline_timeseries.append(("regressor", regressor_obj))
 74 | 
 75 |     # Make the final estimator with a Pipeline
 76 |     if config['Model']['predict_timeseries']:
 77 |         estimator = TimeSeriesEstimator(Pipeline(pipeline), Pipeline(pipeline_timeseries))
 78 |     else:
 79 |         estimator = Pipeline(pipeline)
 80 | 
 81 |     if rain_tuning is not None and regressor.startswith('ensemble'):
 82 |         if config['verbose']:
 83 |             print('build_estimator: using rain tuning...')
 84 |         rain_kwargs = rain_tuning.copy()
 85 |         rain_kwargs.pop('use_raw_rain', None)
 86 |         estimator = RainTuningEstimator(estimator, **rain_kwargs)
 87 | 
 88 |     # Add bootstrapping if requested
 89 |     if bootstrap is not None:
 90 |         if config['verbose']:
 91 |             print('build_estimator: using bootstrapping ensemble...')
 92 |         estimator = BootStrapEnsembleEstimator(estimator, **bootstrap)
 93 | 
 94 |     return estimator
 95 | 
 96 | 
 97 | def build_train_data(config, predictor_file, no_obs=False, no_models=False, test_size=0):
 98 |     """
 99 |     Build the array of training (and optionally testing) data.
100 | 
101 |     :param config:
102 |     :param predictor_file:
103 |     :param no_obs:
104 |     :param no_models:
105 |     :param test_size:
106 |     :return:
107 |     """
108 |     from sklearn.model_selection import train_test_split
109 | 
110 |     if config['verbose']:
111 |         print('build_train_data: reading predictor file')
112 |     rain_tuning = config['Model'].get('Rain tuning', None)
113 |     with open(predictor_file, 'rb') as handle:
114 |         data = pickle.load(handle)
115 | 
116 |     # Select data
117 |     if no_obs and no_models:
118 |         no_obs = False
119 |         no_models = False
120 |     if no_obs:
121 |         if config['verbose']:
122 |             print('build_train_data: not using observations to train')
123 |         predictors = data['BUFKIT']
124 |     elif no_models:
125 |         if config['verbose']:
126 |             print('build_train_data: not using models to train')
127 |         predictors = data['OBS']
128 |     else:
129 |         predictors = np.concatenate((data['BUFKIT'], data['OBS']), axis=1)
130 |     if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)):
131 |         predictors = np.concatenate((predictors, data.rain), axis=1)
132 |         rain_shape = data.rain.shape[-1]
133 |     targets = data['VERIF']
134 | 
135 |     if test_size > 0:
136 |         p_train, p_test, t_train, t_test = train_test_split(predictors, targets, test_size=test_size)
137 |         if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)):
138 |             r_train = p_train[:, -1*rain_shape:]
139 |             p_train = p_train[:, :-1*rain_shape]
140 |             r_test = p_test[:, -1 * rain_shape:]
141 |             p_test = p_test[:, :-1 * rain_shape]
142 |         else:
143 |             r_train = None
144 |             r_test = None
145 |         return p_train, t_train, r_train, p_test, t_test, r_test
146 |     else:
147 |         if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)):
148 |             return predictors, targets, data.rain
149 |         else:
150 |             return predictors, targets, None
151 | 
152 | 
153 | def train(config, predictor_file, estimator_file=None, no_obs=False, no_models=False, test_size=0):
154 |     """
155 |     Generate and train a scikit-learn machine learning estimator. The estimator object is saved as a pickle so that it
156 |     may be imported and used for predictions at any time.
157 | 
158 |     :param config:
159 |     :param predictor_file: str: full path to saved file of predictor data
160 |     :param estimator_file: str: full path to output model file
161 |     :param no_obs: bool: if True, generates the model with no OBS data
162 |     :param no_models: bool: if True, generates the model with no BUFR data
163 |     :param test_size: int: if > 0, returns a subset of the training data of size 'test_size' to test on
164 |     :return: matplotlib Figure if plot_learning_curve is True
165 |     """
166 |     estimator = build_estimator(config)
167 |     rain_tuning = config['Model'].get('Rain tuning', None)
168 |     if test_size > 0:
169 |         p_train, t_train, r_train, p_test, t_test, r_test = build_train_data(config, predictor_file, no_obs=no_obs,
170 |                                                                              no_models=no_models, test_size=test_size)
171 |     else:
172 |         p_train, t_train, r_train = build_train_data(config, predictor_file, no_obs=no_obs, no_models=no_models)
173 | 
174 |     print('train: training the estimator')
175 |     if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)):
176 |         estimator.fit(p_train, t_train, rain_array=r_train)
177 |     else:
178 |         estimator.fit(p_train, t_train)
179 | 
180 |     if estimator_file is None:
181 |         estimator_file = '%s/%s_mosx.pkl' % (config['MOSX_ROOT'], config['station_id'])
182 |     print('train: -> exporting to %s' % estimator_file)
183 |     with open(estimator_file, 'wb') as handle:
184 |         pickle.dump(estimator, handle, protocol=pickle.HIGHEST_PROTOCOL)
185 | 
186 |     if test_size > 0:
187 |         return p_test, t_test, r_test
188 |     return
189 | 
190 | 
191 | def _plot_learning_curve(estimator, X, y, ylim=None, cv=None, scoring=None, title=None, n_jobs=1,
192 |                          train_sizes=np.linspace(.1, 1.0, 5)):
193 |     import matplotlib.pyplot as plt
194 |     from sklearn.model_selection import learning_curve
195 | 
196 |     fig = plt.figure()
197 |     if title is not None:
198 |         plt.title(title)
199 |     if ylim is not None:
200 |         plt.ylim(*ylim)
201 |     plt.xlabel("Training examples")
202 |     plt.ylabel("Score")
203 |     train_sizes, train_scores, test_scores = learning_curve(
204 |         estimator, X, y, cv=cv, scoring=scoring, n_jobs=n_jobs, train_sizes=train_sizes)
205 |     train_scores_mean = np.mean(train_scores, axis=1)
206 |     train_scores_std = np.std(train_scores, axis=1)
207 |     test_scores_mean = np.mean(test_scores, axis=1)
208 |     test_scores_std = np.std(test_scores, axis=1)
209 |     plt.grid()
210 | 
211 |     plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
212 |                      train_scores_mean + train_scores_std, alpha=0.1,
213 |                      color="r")
214 |     plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
215 |                      test_scores_mean + test_scores_std, alpha=0.1, color="g")
216 |     plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
217 |              label="Training score")
218 |     plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
219 |              label="Cross-validation score")
220 | 
221 |     plt.legend(loc="best")
222 |     return fig
223 | 
224 | 
225 | def plot_learning_curve(config, predictor_file, no_obs=False, no_models=False, ylim=None, cv=None, scoring=None,
226 |                         title=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
227 |     """
228 |     Generate a simple plot of the test and training learning curve. From scikit-learn:
229 |     http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
230 | 
231 |     Parameters
232 |     ----------
233 |     config :
234 | 
235 |     predictor_file : string
236 |         Full path to file containing predictor data
237 | 
238 |     no_obs : boolean
239 |         Train model without observations
240 | 
241 |     no_models : boolean
242 |         Train model without model data
243 | 
244 |     ylim : tuple, shape (ymin, ymax), optional
245 |         Defines minimum and maximum yvalues plotted.
246 | 
247 |     cv : int, cross-validation generator or an iterable, optional
248 |         Determines the cross-validation splitting strategy.
249 |         Possible inputs for cv are:
250 |           - None, to use the default 3-fold cross-validation,
251 |           - integer, to specify the number of folds.
252 |           - An object to be used as a cross-validation generator.
253 |           - An iterable yielding train/test splits.
254 | 
255 |         For integer/None inputs, if ``y`` is binary or multiclass,
256 |         :class:`StratifiedKFold` used. If the estimator is not a classifier
257 |         or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
258 | 
259 |         Refer :ref:`User Guide <cross_validation>` for the various
260 |         cross-validators that can be used here.
261 | 
262 |     scoring :
263 |         Scoring function for the error calculation; should be a scikit-learn scorer object
264 | 
265 |     title : string
266 |         Title for the chart.
267 | 
268 |     n_jobs : integer, optional
269 |         Number of jobs to run in parallel (default 1).
270 | 
271 |     train_sizes : iterable, optional
272 |         Sequence of subsets of training data used in learning curve plot
273 |     """
274 |     estimator = build_estimator(config)
275 |     X, y = build_train_data(config, predictor_file, no_obs=no_obs, no_models=no_models)
276 |     fig = _plot_learning_curve(estimator, X, y, ylim=ylim, cv=cv, scoring=scoring, title=title, n_jobs=n_jobs,
277 |                                train_sizes=train_sizes)
278 |     return fig
279 | 
280 | 
281 | def combine_train_test(config, train_file, test_file, no_obs=False, no_models=False, return_count_test=True):
282 |     """
283 |     Concatenates the arrays of predictors and verification values from the train file and the test file. Useful for
284 |     implementing cross-validation using scikit-learn's methods and the SplitConsecutive class.
285 | 
286 |     :param config:
287 |     :param train_file: str: full path to predictor file for training
288 |     :param test_file: str: full path to predictor file for validation
289 |     :param no_obs: bool: if True, generates the model with no OBS data
290 |     :param no_models: bool: if True, generates the model with no BUFR data
291 |     :param return_count_test: bool: if True, also returns the number of samples in the test set (see SplitConsecutive)
292 |     :return: predictors, verifications: concatenated arrays of predictors and verification values; count: number of
293 |     samples in the test set
294 |     """
295 |     p_train, t_train = build_train_data(config, train_file, no_obs=no_obs, no_models=no_models)
296 |     p_test, t_test = build_train_data(config, test_file, no_obs=no_obs, no_models=no_models)
297 |     p_combined = np.concatenate((p_train, p_test), axis=0)
298 |     t_combined = np.concatenate((t_train, t_test), axis=0)
299 |     if return_count_test:
300 |         return p_combined, t_combined, t_test.shape[0]
301 |     else:
302 |         return p_combined, t_combined
303 | 
304 | 
305 | class SplitConsecutive(object):
306 |     """
307 |     Implements a split method to subset a training set into train and test sets, using the first or last n samples in
308 |     the set.
309 |     """
310 | 
311 |     def __init__(self, first=False, n_samples=0.2):
312 |         """
313 |         Create an instance of SplitConsecutive.
314 | 
315 |         :param first: bool: if True, gets test data from the beginning of the data set; otherwise from the end
316 |         :param n_samples: float or int: if float, subsets a fraction (0 to 1) of the data into the test set; if int,
317 |         subsets a specific number of samples.
318 |         """
319 |         if type(first) is not bool:
320 |             raise TypeError("'first' must be a boolean type.")
321 |         try:
322 |             n_samples = int(n_samples)
323 |         except:
324 |             pass
325 |         if type(n_samples) is float and (n_samples <= 0. or n_samples >= 1.):
326 |             raise ValueError("if float, 'n_samples' must be between 0 and 1.")
327 |         if type(n_samples) is not float and type(n_samples) is not int:
328 |             raise TypeError("'n_samples' must be float or int type.")
329 |         self.first = first
330 |         self.n_samples = n_samples
331 |         self.n_splits = 1
332 | 
333 |     def split(self, X, y=None, groups=None):
334 |         """
335 |         Produces arrays of indices to use for model and test splits.
336 | 
337 |         :param X: array-like, shape (samples, features): predictor data
338 |         :param y: array-like, shape (samples, outputs) or None: verification data; ignored
339 |         :param groups: ignored
340 |         :return: model, test: 1-D arrays of sample indices in the model and test sets
341 |         """
342 |         num_samples = X.shape[0]
343 |         indices = np.arange(0, num_samples, 1, dtype=np.int32)
344 |         if type(self.n_samples) is float:
345 |             self.n_samples = int(np.round(num_samples * self.n_samples))
346 |         if self.first:
347 |             test = indices[:self.n_samples]
348 |             train = indices[self.n_samples:]
349 |         else:
350 |             test = indices[-self.n_samples:]
351 |             train = indices[:num_samples - self.n_samples]
352 |         yield train, test
353 | 
354 |     def get_n_splits(self, X=None, y=None, groups=None):
355 |         """
356 |         Return the number of splits. Dummy function for compatibility.
357 | 
358 |         :param X: ignored
359 |         :param y: ignored
360 |         :param groups: ignored
361 |         :return:
362 |         """
363 |         return self.n_splits
364 | 
365 | 
366 | def predict_all(config, predictor_file, ensemble=False, time_series_date=None, naive_rain_correction=False,
367 |                 round_result=False, **kwargs):
368 |     """
369 |     Predict forecasts from the estimator in config. Also return probabilities and time series.
370 | 
371 |     :param config:
372 |     :param predictor_file: str: file containing predictor data from mosx.model.format_predictors
373 |     :param ensemble: bool: if True, return an array of num_trees-by-4 of the predictions of each tree in the estimator
374 |     :param time_series_date: datetime: if set, returns a time series prediction from the estimator, where the datetime
375 |     provided is the day the forecast is for (only works for single-day runs, or assumes last day)
376 |     :param naive_rain_correction: bool: if True, applies manual tuning to the rain forecast
377 |     :param round_result: bool: if True, rounds the predicted estimate
378 |     :param kwargs: passed to the estimator's 'predict' method
379 |     :return:
380 |     predicted: ndarray: num_samples x num_predicted_variables predictions
381 |     all_predicted: ndarray: num_samples x num_predicted_variables x num_ensemble_members predictions for all trees
382 |     predicted_timeseries: DataFrame: time series for final sample
383 |     """
384 |     # Load the predictor data and estimator
385 |     with open(predictor_file, 'rb') as handle:
386 |         predictor_data = pickle.load(handle)
387 |     rain_tuning = config['Model'].get('Rain tuning', None)
388 |     if config['verbose']:
389 |         print('predict: loading estimator %s' % config['Model']['estimator_file'])
390 |     with open(config['Model']['estimator_file'], 'rb') as handle:
391 |         estimator = pickle.load(handle)
392 | 
393 |     predictors = np.concatenate((predictor_data['BUFKIT'], predictor_data['OBS']), axis=1)
394 |     if config['Model']['rain_forecast_type'] == 'pop' and getattr(estimator, 'is_classifier', False):
395 |         predict_method = estimator.predict_proba
396 |     else:
397 |         predict_method = estimator.predict
398 |     if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)):
399 |         predicted = predict_method(predictors, rain_array=predictor_data.rain, **kwargs)
400 |     else:
401 |         predicted = predict_method(predictors, **kwargs)
402 |     precip = predictor_data.rain
403 | 
404 |     # Check for precipitation override
405 |     if naive_rain_correction:
406 |         for day in range(predicted.shape[0]):
407 |             if sum(precip[day]) < 0.01:
408 |                 if config['verbose']:
409 |                     print('predict: warning: overriding MOS-X rain prediction of %0.2f on day %s with 0' %
410 |                           (predicted[day, 3], day))
411 |                 predicted[day, 3] = 0.
412 |             elif predicted[day, 3] > max(precip[day]) or predicted[day, 3] < min(precip[day]):
413 |                 if config['verbose']:
414 |                     print('predict: warning: overriding MOS-X prediction of %0.2f on day %s with model mean' %
415 |                           (predicted[day, 3], day))
416 |                 predicted[day, 3] = max(0., np.mean(precip[day] + [predicted[day, 3]]))
417 |     else:
418 |         # At least make sure we aren't predicting negative values...
419 |         predicted[:, 3][predicted[:, 3] < 0] = 0.0
420 | 
421 |     # Round off daily values, if selected
422 |     if round_result:
423 |         predicted[:, :3] = np.round(predicted[:, :3])
424 |         predicted[:, 3] = np.round(predicted[:, 3], 2)
425 | 
426 |     # If probabilities are requested and available, get the results from each tree
427 |     if ensemble:
428 |         num_samples = predictors.shape[0]
429 |         if not hasattr(estimator, 'named_steps'):
430 |             forest = estimator
431 |         else:
432 |             imputer = estimator.named_steps['imputer']
433 |             forest = estimator.named_steps['regressor']
434 |             predictors = imputer.transform(predictors)
435 |         # If we generated our own ensemble by bootstrapping, it must be treated as such
436 |         if config['Model']['train_individual'] and config['Model'].get('Bootstrapping', None) is None:
437 |             num_trees = len(forest.estimators_[0].estimators_)
438 |             all_predicted = np.zeros((num_samples, 4, num_trees))
439 |             for v in range(4):
440 |                 for t in range(num_trees):
441 |                     try:
442 |                         all_predicted[:, v, t] = forest.estimators_[v].estimators_[t].predict(predictors)
443 |                     except AttributeError:
444 |                         # Work around the 2-D array of estimators for GBTrees
445 |                         all_predicted[:, v, t] = forest.estimators_[v].estimators_[t][0].predict(predictors)
446 |         else:
447 |             num_trees = len(forest.estimators_)
448 |             all_predicted = np.zeros((num_samples, 4, num_trees))
449 |             for t in range(num_trees):
450 |                 try:
451 |                     all_predicted[:, :, t] = forest.estimators_[t].predict(predictors)[:, :4]
452 |                 except AttributeError:
453 |                     # Work around the 2-D array of estimators for GBTrees
454 |                     all_predicted[:, :, t] = forest.estimators_[t][0].predict(predictors)[:, :4]
455 |     else:
456 |         all_predicted = None
457 | 
458 |     if config['Model']['predict_timeseries']:
459 |         if time_series_date is None:
460 |             date_now = datetime.utcnow()
461 |             time_series_date = datetime(date_now.year, date_now.month, date_now.day) + timedelta(days=1)
462 |             print('predict: warning: set time series start date to %s (was unspecified)' % time_series_date)
463 |         num_hours = int(24 / config['time_series_interval']) + 1
464 |         predicted_array = predicted[-1, 4:].reshape((4, num_hours)).T
465 |         # Get dewpoint
466 |         predicted_array[:, 2] = dewpoint(predicted_array[:, 0], predicted_array[:, 2])
467 |         times = pd.date_range(time_series_date.replace(hour=6), periods=num_hours,
468 |                               freq='%dH' % config['time_series_interval']).to_pydatetime().tolist()
469 |         variables = ['temperature', 'rain', 'dewpoint', 'windSpeed']
470 |         round_dict = {'temperature': 0, 'rain': 2, 'dewpoint': 0, 'windSpeed': 0}
471 |         predicted_timeseries = pd.DataFrame(predicted_array, index=times, columns=variables)
472 |         predicted_timeseries = predicted_timeseries.round(round_dict)
473 |     else:
474 |         predicted_timeseries = None
475 | 
476 |     return predicted, all_predicted, predicted_timeseries
477 | 
478 | 
479 | def predict(config, predictor_file, naive_rain_correction=False, round=False, **kwargs):
480 |     """
481 |     Predict forecasts from the estimator in config. Only returns daily values.
482 | 
483 |     :param config:
484 |     :param predictor_file: str: file containing predictor data from mosx.model.format_predictors
485 |     :param naive_rain_correction: bool: if True, applies manual tuning to the rain forecast
486 |     :param round: bool: if True, rounds the predicted estimate
487 |     :param kwargs: passed to the estimator's 'predict' method
488 |     :return:
489 |     """
490 | 
491 |     predicted, all_predicted, predicted_timeseries = predict_all(config, predictor_file,
492 |                                                                  naive_rain_correction=naive_rain_correction,
493 |                                                                  round_result=round, **kwargs)
494 |     return predicted
495 | 
496 | 
497 | def predict_rain_proba(config, predictor_file):
498 |     """
499 |     Predict probabilistic rain forecasts for 'pop' or 'categorical' types.
500 | 
501 |     :param config:
502 |     :param predictor_file: str: file containing predictor data from mosx.model.format_predictors
503 |     :return:
504 |     """
505 |     if config['Model']['rain_forecast_type'] not in ['pop', 'categorical']:
506 |         raise TypeError("'quantity' rain forecast is not probabilistic, cannot get probabilities")
507 |     rain_tuning = config['Model'].get('Rain tuning', None)
508 |     if rain_tuning is None:
509 |         raise TypeError('Probabilistic rain forecasts are only possible with a RainTuningEstimator')
510 | 
511 |     # Load the predictor data and estimator
512 |     with open(predictor_file, 'rb') as handle:
513 |         predictor_data = pickle.load(handle)
514 |     if config['verbose']:
515 |         print('predict: loading estimator %s' % config['Model']['estimator_file'])
516 |     with open(config['Model']['estimator_file'], 'rb') as handle:
517 |         estimator = pickle.load(handle)
518 | 
519 |     predictors = np.concatenate((predictor_data['BUFKIT'], predictor_data['OBS']), axis=1)
520 |     if to_bool(rain_tuning.get('use_raw_rain', False)):
521 |         rain_proba = estimator.predict_rain_proba(predictors, rain_array=predictor_data.rain)
522 |     else:
523 |         rain_proba = estimator.predict_rain_proba(predictors)
524 | 
525 |     return rain_proba
526 | 


--------------------------------------------------------------------------------
/mosx/model/predictors.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | Methods for processing input predictor data.
 9 | """
10 | 
11 | import numpy as np
12 | import mosx.bufr
13 | import mosx.obs
14 | import mosx.verification
15 | from mosx.util import unpickle, find_matching_dates
16 | import pickle
17 | 
18 | 
19 | class PredictorDict(dict):
20 |     """
21 |     Special class extending dict to add an attribute containing raw precipitation values, for precipitation-aware
22 |     estimator configurations.
23 |     """
24 |     def __init__(self, *args, **kwargs):
25 |         super(PredictorDict, self).__init__(*args, **kwargs)
26 |         self.rain = None
27 | 
28 |     def add_rain(self, rain_array):
29 |         """
30 |         Add an array of raw rain values to the dict. If the dictionary contains BUFKIT array, checks that the sample
31 |         size is correct.
32 |         :param rain_array:
33 |         :return:
34 |         """
35 |         rain_array[np.isnan(rain_array)] = 0.
36 | 
37 |         if 'BUFKIT' in self.keys():
38 |             if isinstance(self['BUFKIT'], np.ndarray):
39 |                 if self['BUFKIT'].shape[0] != rain_array.shape[0]:
40 |                     raise ValueError('rain_array and BUFKIT array must have the same sample size; got %s and %s' %
41 |                                      (rain_array.shape[0], self['BUFKIT'].shape[0]))
42 |         self.rain = rain_array
43 | 
44 | 
45 | def format_predictors(config, bufr_file, obs_file, verif_file, output_file=None, return_dates=False):
46 |     """
47 |     Generates a complete date-by-x array of data for ingestion into the machine learning estimator. verif_file may be
48 |     None if creating a set to run the model.
49 | 
50 |     :param config:
51 |     :param bufr_file: str: full path to the saved file of BUFR data
52 |     :param obs_file: str: full path to the saved file of OBS data
53 |     :param verif_file: str: full path to the saved file of VERIF data
54 |     :param output_file: str: full path to output predictors file
55 |     :param return_dates: if True, returns all of the matching dates used to produce the predictor arrays
56 |     :return: optionally a list of dates and a list of lists of precipitation values
57 |     """
58 |     bufr, obs, verif = unpickle(bufr_file, obs_file, verif_file)
59 |     bufr, obs, verif, all_dates = find_matching_dates(bufr, obs, verif, return_data=True)
60 |     bufr_array = mosx.bufr.process(config, bufr)
61 |     obs_array = mosx.obs.process(config, obs)
62 |     verif_array = mosx.verification.process(config, verif)
63 | 
64 |     export_dict = {
65 |         'BUFKIT': bufr_array,
66 |         'OBS': obs_array,
67 |         'VERIF': verif_array
68 |     }
69 |     export_dict = PredictorDict(export_dict)
70 | 
71 |     # Get raw precipitation values and add them to the PredictorDict
72 |     precip_list = []
73 |     for date in all_dates:
74 |         precip = []
75 |         for model in bufr['DAY'].keys():
76 |             precip.append(bufr['DAY'][model][date][-1] / 25.4)  # mm to inches
77 |         precip_list.append(precip)
78 |     export_dict.add_rain(np.array(precip_list))
79 | 
80 |     if output_file is None:
81 |         output_file = '%s/%s_predictors.pkl' % (config['site_directory'], config['station_id'])
82 |     print('predictors: -> exporting to %s' % output_file)
83 |     with open(output_file, 'wb') as handle:
84 |         pickle.dump(export_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
85 | 
86 |     if return_dates:
87 |         return all_dates
88 |     else:
89 |         return
90 | 


--------------------------------------------------------------------------------
/mosx/model/scorers.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | Scoring techniques for model evaluation.
 9 | """
10 | 
11 | import numpy as np
12 | from sklearn.metrics import make_scorer
13 | 
14 | 
15 | def wxchallenge_error(y, y_pred, average=True, no_rain=False):
16 |     """
17 |     Returns the forecast error as measured by the WxChallenge.
18 | 
19 |     :param y: n-by-4 array of truth values
20 |     :param y_pred: n-by-4 array of predicted values
21 |     :param average: bool: if True, returns the average error per sample
22 |     :param no_rain: bool: if True, does not count rain error
23 |     :return: float: cumulative error
24 |     """
25 |     if y.shape != y_pred.shape:
26 |         raise ValueError("y and y_pred must have the same shape")
27 |     if len(y.shape) > 2:
28 |         raise ValueError("got too many dimensions for y and y_pred (expected 2)")
29 |     if (y.shape != (4,)) and y.shape[1] != 4:
30 |         raise ValueError('last dimension of y and y_pred should be length 4')
31 | 
32 |     if len(y.shape) == 1:
33 |         y = np.array([y])
34 |         y_pred = np.array([y_pred])
35 | 
36 |     high_error = np.sum(np.abs(y[:, 0] - y_pred[:, 0]))
37 |     low_error = np.sum(np.abs(y[:, 1] - y_pred[:, 1]))
38 |     wind_error = 0.5 * np.sum(np.abs(y[:, 2] - y_pred[:, 2]))
39 |     rain_error = 0.
40 |     for sample in range(y.shape[0]):
41 |         y_rain = y[sample, 3]
42 |         y_pred_rain = y_pred[sample, 3]
43 |         rain_min = int(100.*min(y_rain, y_pred_rain))
44 |         rain_max = int(100.*max(y_rain, y_pred_rain))
45 |         while rain_min < rain_max:
46 |             if rain_min < 10:
47 |                 rain_error += 0.4
48 |             elif rain_min < 25:
49 |                 rain_error += 0.3
50 |             elif rain_min < 50:
51 |                 rain_error += 0.2
52 |             else:
53 |                 rain_error += 0.1
54 |             rain_min += 1
55 | 
56 |     result = high_error + low_error + wind_error
57 |     if not no_rain:
58 |         result += rain_error
59 |     if average:
60 |         result /= y.shape[0]
61 |     return result
62 | 
63 | 
64 | def wxchallenge_scorer(**kwargs):
65 |     """
66 |     Return a scikit-learn scorer object for forecast error as measured by WxChallenge.
67 | 
68 |     :param kwargs: parameters passed to the WxChallenge error function
69 |         no_rain: if True, does not count rain error
70 |     :return:
71 |     """
72 |     scorer = make_scorer(wxchallenge_error, greater_is_better=False, **kwargs)
73 |     return scorer
74 | 


--------------------------------------------------------------------------------
/mosx/obs/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | Methods for processing OBS data.
 9 | """
10 | 
11 | from .methods import *
12 | 


--------------------------------------------------------------------------------
/mosx/obs/methods.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  3 | #
  4 | # See the file LICENSE for your rights.
  5 | #
  6 | 
  7 | """
  8 | Methods for processing OBS data.
  9 | """
 10 | 
 11 | import os
 12 | import numpy as np
 13 | import pandas as pd
 14 | from datetime import datetime, timedelta
 15 | import pickle
 16 | from collections import OrderedDict
 17 | from mosx.MesoPy import Meso
 18 | from metpy.io import get_upper_air_data
 19 | from metpy.calc import interp
 20 | from mosx.util import generate_dates, get_array
 21 | 
 22 | 
 23 | def upper_air(config, date, use_nan_sounding=False, use_existing=True, save=True):
 24 |     """
 25 |     Retrieves upper-air data and interpolates to pressure levels. If use_nan_sounding is True, then if a retrieval
 26 |     error occurs, a blank sounding will be returned instead of an error.
 27 | 
 28 |     :param config:
 29 |     :param date: datetime
 30 |     :param use_nan_sounding: bool: if True, use sounding of NaNs instead of raising an error
 31 |     :param use_existing: bool: preferentially use existing soundings in sounding_data_dir
 32 |     :param save: bool: if True, save processed soundings to sounding_data_dir
 33 |     :return:
 34 |     """
 35 |     variables = ['height', 'temperature', 'dewpoint', 'u_wind', 'v_wind']
 36 | 
 37 |     # Define levels for interpolation: same as model data, except omitting lowest_p_level
 38 |     plevs = [600, 750, 850, 925]
 39 |     pres_interp = [p for p in plevs if p <= config['lowest_p_level']]
 40 | 
 41 |     # Try retrieving the sounding, first checking for existing
 42 |     if config['verbose']:
 43 |         print('upper_air: retrieving sounding for %s' % datetime.strftime(date, '%Y%m%d%H'))
 44 |     nan_sounding = False
 45 |     retrieve_sounding = False
 46 |     sndg_data_dir = config['Obs']['sounding_data_dir']
 47 |     if not(os.path.isdir(sndg_data_dir)):
 48 |         os.makedirs(sndg_data_dir)
 49 |     sndg_file = '%s/%s_SNDG_%s.pkl' % (sndg_data_dir, config['station_id'], datetime.strftime(date, '%Y%m%d%H'))
 50 |     if use_existing:
 51 |         try:
 52 |             with open(sndg_file, 'rb') as handle:
 53 |                 data = pickle.load(handle)
 54 |             if config['verbose']:
 55 |                 print('    Read from file.')
 56 |         except:
 57 |             retrieve_sounding = True
 58 |     else:
 59 |         retrieve_sounding = True
 60 |     if retrieve_sounding:
 61 |         try:
 62 |             dset = get_upper_air_data(date, config['Obs']['sounding_station_id'])
 63 |         except:
 64 |             # Try again
 65 |             try:
 66 |                 dset = get_upper_air_data(date, config['Obs']['sounding_station_id'])
 67 |             except:
 68 |                 if use_nan_sounding:
 69 |                     if config['verbose']:
 70 |                         print('upper_air: warning: unable to retrieve sounding; using nan.')
 71 |                     nan_sounding = True
 72 |                 else:
 73 |                     raise ValueError('error retrieving sounding for %s' % date)
 74 | 
 75 |         # Retrieve pressure for interpolation to fixed levels
 76 |         if not nan_sounding:
 77 |             pressure = dset.variables['pressure']
 78 |             pres = np.array([p.magnitude for p in list(pressure)])  # units are hPa
 79 | 
 80 |         # Get variables and interpolate; add to dictionary
 81 |         data = OrderedDict()
 82 |         for var in variables:
 83 |             if not nan_sounding:
 84 |                 var_data = dset.variables[var]
 85 |                 var_array = np.array([v.magnitude for v in list(var_data)])
 86 |                 var_interp = interp(pres_interp, pres, var_array)
 87 |                 data[var] = var_interp.tolist()
 88 |             else:
 89 |                 data[var] = [np.nan] * len(pres_interp)
 90 | 
 91 |         # Save
 92 |         if save and not nan_sounding:
 93 |             with open(sndg_file, 'wb') as handle:
 94 |                 pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)
 95 | 
 96 |     return data
 97 | 
 98 | 
 99 | def get_obs_hourly(config, api_dates, vars_api, units):
100 |     """
101 |     Retrieve hourly obs data in a pd dataframe. In order to ensure that there is no missing hourly indices, use
102 |     dataframe.reindex on each retrieved dataframe.
103 | 
104 |     :param api_dates: dates from generate_dates
105 |     :param vars_api: str: string formatted for api call var parameter
106 |     :param units: str: string formatted for api call units parameter
107 |     :return: pd.DataFrame: formatted hourly obs DataFrame
108 |     """
109 |     # Initialize Meso
110 |     m = Meso(token=config['meso_token'])
111 |     if config['verbose']:
112 |         print('get_obs_hourly: MesoPy initialized for station %s' % config['station_id'])
113 | 
114 |     # Retrieve data
115 |     obs_final = pd.DataFrame()
116 |     for api_date in api_dates:
117 |         if config['verbose']:
118 |             print('get_obs_hourly: retrieving data from %s to %s' % api_date)
119 |         obs = m.timeseries(stid=config['station_id'], start=api_date[0], end=api_date[1], vars=vars_api, units=units,
120 |                            hfmetars='0')
121 |         obspd = pd.DataFrame.from_dict(obs['STATION'][0]['OBSERVATIONS'])
122 | 
123 |         # Rename columns to requested vars
124 |         obs_var_names = obs['STATION'][0]['SENSOR_VARIABLES']
125 |         obs_var_keys = list(obs_var_names.keys())
126 |         col_names = list(map(''.join, obspd.columns.values))
127 |         for c in range(len(col_names)):
128 |             col = col_names[c]
129 |             for k in range(len(obs_var_keys)):
130 |                 key = obs_var_keys[k]
131 |                 if col == list(obs_var_names[key].keys())[0]:
132 |                     col_names[c] = key
133 |         obspd.columns = col_names
134 | 
135 |         # Change datetime column to datetime object
136 |         dateobj = pd.to_datetime(obspd['date_time'])
137 |         obspd['date_time'] = dateobj
138 |         datename = 'date_time'
139 |         obspd = obspd.rename(columns={'date_time': datename})
140 | 
141 |         # Reformat data into hourly obs
142 |         # Find mode of minute data: where the hourly metars are
143 |         if config['verbose']:
144 |             print('get_obs_hourly: finding METAR observation times...')
145 |         minutes = []
146 |         for row in obspd.iterrows():
147 |             date = row[1][datename]
148 |             minutes.append(date.minute)
149 |         minute_count = np.bincount(np.array(minutes))
150 |         rev_count = minute_count[::-1]
151 |         minute_mode = minute_count.size - rev_count.argmax() - 1
152 | 
153 |         if config['verbose']:
154 |             print('get_obs_hourly: finding hourly data...')
155 |         obs_hourly = obspd[pd.DatetimeIndex(obspd[datename]).minute == minute_mode]
156 |         obs_hourly = obs_hourly.set_index(datename)
157 | 
158 |         # May not have precip if none is recorded
159 |         try:
160 |             obs_hourly['precip_accum_one_hour'].fillna(0.0, inplace=True)
161 |         except KeyError:
162 |             obs_hourly['precip_accum_one_hour'] = 0.0
163 | 
164 |         # Need to reorder the column names
165 |         obs_hourly.sort_index(axis=1, inplace=True)
166 | 
167 |         # Remove any duplicate rows
168 |         obs_hourly = obs_hourly[~obs_hourly.index.duplicated(keep='last')]
169 | 
170 |         # Re-index by hourly. Fills missing with NaNs. Try to interpolate the NaNs.
171 |         expected_start = datetime.strptime(api_date[0], '%Y%m%d%H%M').replace(minute=minute_mode)
172 |         expected_end = datetime.strptime(api_date[1], '%Y%m%d%H%M')
173 |         expected_times = pd.date_range(expected_start, expected_end, freq='H').to_pydatetime()
174 |         obs_hourly = obs_hourly.reindex(expected_times)
175 |         obs_hourly = obs_hourly.interpolate(limit=3)
176 | 
177 |         obs_final = pd.concat((obs_final, obs_hourly))
178 | 
179 |     # Remove any duplicate rows from concatenation
180 |     obs_final = obs_final[~obs_final.index.duplicated(keep='last')]
181 | 
182 |     return obs_final
183 | 
184 | 
185 | def reindex_hourly(df, start, end, interval, end_23z=False, use_rain_max=False):
186 | 
187 |     def last(values):
188 |         return values.iloc[-1]
189 | 
190 |     if end_23z:
191 |         new_end = pd.Timestamp(end.to_pydatetime() - timedelta(hours=1))
192 |     else:
193 |         new_end = end
194 |     period = pd.date_range(start, end, freq='%dH' % interval)
195 |     # Create a column with the new index an ob falls into
196 |     df['period'] = (df.index.values > period.values[..., np.newaxis]).sum(0)
197 |     df['DateTime'] = df.index.values
198 |     aggregate = OrderedDict()
199 |     col_names = df.columns.values
200 |     for col in col_names:
201 |         if not(col.lower().startswith('precip')) and not(col.lower().startswith('rain')):
202 |             aggregate[col] = last
203 |         else:
204 |             if use_rain_max:
205 |                 aggregate[col] = np.max
206 |             else:
207 |                 aggregate[col] = np.sum
208 |     df_reindex = df.loc[start:new_end].groupby('period').agg(aggregate)
209 |     try:
210 |         df_reindex = df_reindex.drop('period', axis=1)
211 |     except (ValueError, KeyError):
212 |         pass
213 |     df_reindex = df_reindex.set_index('DateTime')
214 | 
215 |     return df_reindex
216 | 
217 | 
218 | def obs(config, output_file=None, num_hours=24, interval=3, use_nan_sounding=False, use_existing_sounding=True):
219 |     """
220 |     Generates observation data from MesoWest and UCAR soundings and saves to a file, which can later be retrieved for
221 |     either training data or model run data.
222 | 
223 |     :param config:
224 |     :param output_file: str: output file path
225 |     :param num_hours: int: number of hours to retrieve obs
226 |     :param interval: int: retrieve obs every 'interval' hours
227 |     :param use_nan_sounding: bool: if True, uses a sounding of NaNs rather than omitting a day if sounding is missing
228 |     :param use_existing_sounding: bool: if True, preferentially uses saved soundings in sounding_data_dir
229 |     :return:
230 |     """
231 |     if output_file is None:
232 |         output_file = '%s/%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'])
233 | 
234 |     start_date = datetime.strptime(config['data_start_date'], '%Y%m%d') - timedelta(hours=num_hours)
235 |     dates = generate_dates(config)
236 |     api_dates = generate_dates(config, api=True, start_date=start_date)
237 | 
238 |     # Look for desired variables
239 |     vars_request = ['air_temp', 'altimeter', 'precip_accum_one_hour', 'relative_humidity',
240 |                     'wind_speed', 'wind_direction']
241 | 
242 |     # Add variables to the api request
243 |     vars_api = ''
244 |     for var in vars_request:
245 |         vars_api += var + ','
246 |     vars_api = vars_api[:-1]
247 | 
248 |     # Units
249 |     units = 'temp|f,precip|in,speed|kts'
250 | 
251 |     # Retrieve station data
252 |     obs_hourly = get_obs_hourly(config, api_dates, vars_api, units)
253 | 
254 |     # Retrieve upper-air sounding data
255 |     if config['verbose']:
256 |         print('obs: retrieving upper-air sounding data')
257 |     soundings = OrderedDict()
258 |     if config['Obs']['use_soundings']:
259 |         for date in dates:
260 |             soundings[date] = OrderedDict()
261 |             start_date = date - timedelta(days=1)  # get the previous day's soundings
262 |             for hour in [0, 12]:
263 |                 sounding_date = start_date + timedelta(hours=hour)
264 |                 try:
265 |                     sounding = upper_air(sounding_date, use_nan_sounding, use_existing=use_existing_sounding)
266 |                     soundings[date][sounding_date] = sounding
267 |                 except:
268 |                     print('obs: warning: problem retrieving soundings for %s' % datetime.strftime(date, '%Y%m%d'))
269 |                     soundings.pop(date)
270 |                     break
271 | 
272 |     # Create dictionary of days
273 |     if config['verbose']:
274 |         print('obs: converting to output dictionary')
275 |     obs_export = OrderedDict({'SFC': OrderedDict(),
276 |                               'SNDG': OrderedDict()})
277 |     for date in dates:
278 |         if config['Obs']['use_soundings'] and date not in soundings.keys():
279 |             continue
280 |         # Need to ensure we use the right intervals to have 22:5? Z obs
281 |         start = pd.Timestamp((date - timedelta(hours=num_hours)))
282 |         end = pd.Timestamp(date)
283 |         obs_export['SFC'][date] = reindex_hourly(obs_hourly, start, end, interval,
284 |                                                  end_23z=True).to_dict(into=OrderedDict)
285 |         if config['Obs']['use_soundings']:
286 |             obs_export['SNDG'][date] = soundings[date]
287 | 
288 |     # Export final data
289 |     if config['verbose']:
290 |         print('obs: -> exporting to %s' % output_file)
291 |     with open(output_file, 'wb') as handle:
292 |         pickle.dump(obs_export, handle, protocol=pickle.HIGHEST_PROTOCOL)
293 | 
294 |     return
295 | 
296 | 
297 | def process(config, obs):
298 |     """
299 |     Returns a numpy array of obs for use in mosx_predictors. The first dimension is date; all other dimensions are
300 |     serialized.
301 | 
302 |     :param config:
303 |     :param obs: dict: dictionary of processed obs data
304 |     :return:
305 |     """
306 |     if config['verbose']:
307 |         print('obs.process: processing array for obs data...')
308 | 
309 |     # Surface observations
310 |     sfc = obs['SFC']
311 |     num_days = len(sfc.keys())
312 |     variables = sorted(sfc[sfc.keys()[0]].keys())
313 |     sfc_array = get_array(sfc)
314 |     sfc_array_r = np.reshape(sfc_array, (num_days, -1))
315 | 
316 |     # Sounding observations
317 |     if config['Obs']['use_soundings']:
318 |         sndg_array = get_array(obs['SNDG'])
319 |         # num_days should be the same first dimension
320 |         sndg_array_r = np.reshape(sndg_array, (num_days, -1))
321 |         return np.hstack((sfc_array_r, sndg_array_r))
322 |     else:
323 |         return sfc_array_r
324 | 


--------------------------------------------------------------------------------
/mosx/util.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  3 | #
  4 | # See the file LICENSE for your rights.
  5 | #
  6 | 
  7 | """
  8 | Utilities for the MOS-X model.
  9 | """
 10 | 
 11 | import os
 12 | import numpy as np
 13 | import pandas as pd
 14 | from datetime import datetime, timedelta
 15 | import pickle
 16 | 
 17 | 
 18 | # ==================================================================================================================== #
 19 | # Classes
 20 | # ==================================================================================================================== #
 21 | 
 22 | 
 23 | # ==================================================================================================================== #
 24 | # Config functions
 25 | # ==================================================================================================================== #
 26 | 
 27 | def walk_kwargs(section, key):
 28 |     value = section[key]
 29 |     try:
 30 |         section[key] = int(value)
 31 |     except (TypeError, ValueError):
 32 |         try:
 33 |             section[key] = float(value)
 34 |         except (TypeError, ValueError):
 35 |             pass
 36 | 
 37 | 
 38 | def get_config(config_path):
 39 |     """
 40 |     Retrieve the config object from config_path.
 41 | 
 42 |     :param config_path: str: full path to config file
 43 |     :return:
 44 |     """
 45 |     import configobj
 46 |     from validate import Validator
 47 | 
 48 |     dir_path = os.path.dirname(os.path.realpath(__file__))
 49 |     config_spec = '%s/configspec' % dir_path
 50 | 
 51 |     try:
 52 |         config = configobj.ConfigObj(config_path, configspec=config_spec, file_error=True)
 53 |     except IOError:
 54 |         try:
 55 |             config = configobj.ConfigObj(config_path+'.config', configspec=config_spec, file_error=True)
 56 |         except IOError:
 57 |             print('Error: unable to open configuration file %s' % config_path)
 58 |             raise
 59 |     except configobj.ConfigObjError as e:
 60 |         print('Error while parsing configuration file %s' % config_path)
 61 |         print("*** Reason: '%s'" % e)
 62 |         raise
 63 | 
 64 |     config.validate(Validator())
 65 | 
 66 |     # Make sure site_directory is there
 67 |     if config['SITE_ROOT'] == '':
 68 |         config['SITE_ROOT'] = '%(MOSX_ROOT)s/site_data'
 69 | 
 70 |     # Make sure BUFR parameters have defaults
 71 |     if config['BUFR']['bufr_station_id'] == '':
 72 |         config['BUFR']['bufr_station_id'] = '%(station_id)s'
 73 |     if config['BUFR']['bufr_data_dir'] == '':
 74 |         config['BUFR']['bufr_data_dir'] = '%(SITE_ROOT)s/bufkit'
 75 |     if config['BUFR']['bufrgruven'] == '':
 76 |         config['BUFR']['bufrgruven'] = '%(BUFR_ROOT)s/bufr_gruven.pl'
 77 | 
 78 |     # Make sure Obs parameters have defaults
 79 |     if config['Obs']['sounding_data_dir'] == '':
 80 |         config['Obs']['sounding_data_dir'] = '%(SITE_ROOT)s/soundings'
 81 | 
 82 |     # Add in a list for BUFR models
 83 |     config['BUFR']['bufr_models'] = []
 84 |     for model in config['BUFR']['models']:
 85 |         if model.upper() == 'GFS':
 86 |             config['BUFR']['bufr_models'].append(['gfs3', 'gfs'])
 87 |         else:
 88 |             config['BUFR']['bufr_models'].append(model.lower())
 89 | 
 90 |     # Convert kwargs, Rain tuning, Ada boosting, and Bootstrapping, if available, to int or float types
 91 |     config['Model']['Parameters'].walk(walk_kwargs)
 92 |     try:
 93 |         config['Model']['Ada boosting'].walk(walk_kwargs)
 94 |     except KeyError:
 95 |         pass
 96 |     try:
 97 |         config['Model']['Rain tuning'].walk(walk_kwargs)
 98 |     except KeyError:
 99 |         pass
100 |     try:
101 |         config['Model']['Bootstrapping'].walk(walk_kwargs)
102 |     except KeyError:
103 |         pass
104 | 
105 |     return config
106 | 
107 | 
108 | # ==================================================================================================================== #
109 | # Utility functions
110 | # ==================================================================================================================== #
111 | 
112 | def get_object(module_class):
113 |     """
114 |     Given a string with a module class name, it imports and returns the class.
115 |     This function (c) Tom Keffer, weeWX.
116 |     """
117 | 
118 |     # Split the path into its parts
119 |     parts = module_class.split('.')
120 |     # Strip off the classname:
121 |     module = '.'.join(parts[:-1])
122 |     # Import the top level module
123 |     mod = __import__(module)
124 |     # Recursively work down from the top level module to the class name.
125 |     # Be prepared to catch an exception if something cannot be found.
126 |     try:
127 |         for part in parts[1:]:
128 |             mod = getattr(mod, part)
129 |     except AttributeError:
130 |         # Can't find something. Give a more informative error message:
131 |         raise AttributeError("Module '%s' has no attribute '%s' when searching for '%s'" %
132 |                              (mod.__name__, part, module_class))
133 |     return mod
134 | 
135 | 
136 | def generate_dates(config, api=False, start_date=None, end_date=None, api_add_hour=0):
137 |     """
138 |     Returns all of the dates requested from the config. If api is True, then returns a list of (start_date, end_date)
139 |     tuples split by year in strings formatted for the MesoWest API call. If api is False, then returns a list of all
140 |     dates as datetime objects. start_date and end_date are available as options as certain calls require addition of
141 |     some data for prior days.
142 | 
143 |     :param config:
144 |     :param api: bool: if True, returns dates formatted for MesoWest API call
145 |     :param start_date: str: starting date in config file format (YYYYMMDD)
146 |     :param end_date: str: ending date in config file format (YYYYMMDD)
147 |     :param api_add_hour: int: add this number of hours to the end of the call, useful for getting up to 6Z on last day
148 |     :return:
149 |     """
150 |     if start_date is None:
151 |         start_date = datetime.strptime(config['data_start_date'], '%Y%m%d')
152 |     if end_date is None:
153 |         end_date = datetime.strptime(config['data_end_date'], '%Y%m%d')
154 |     start_dt = start_date
155 |     end_dt = end_date
156 |     if start_dt > end_dt:
157 |         raise ValueError('Start date must be before end date; check MOSX_INFILE.')
158 |     end_year = end_dt.year + 1
159 |     time_added = timedelta(hours=api_add_hour)
160 |     all_dates = []
161 |     if config['is_season']:
162 |         if start_dt > datetime(start_dt.year, end_dt.month, end_dt.day):
163 |             # Season crosses new year
164 |             end_year -= 1
165 |         for year in range(start_dt.year, end_year):
166 |             if start_dt > datetime(start_dt.year, end_dt.month, end_dt.day):
167 |                 # Season crosses new year
168 |                 year2 = year + 1
169 |             else:
170 |                 year2 = year
171 |             if api:
172 |                 year_start = datetime.strftime(datetime(year, start_dt.month, start_dt.day), '%Y%m%d0000')
173 |                 year_end = datetime.strftime(datetime(year2, end_dt.month, end_dt.day) + time_added, '%Y%m%d%H00')
174 |                 all_dates.append((year_start, year_end))
175 |             else:
176 |                 year_dates = pd.date_range(datetime(year, start_dt.month, start_dt.day),
177 |                                            datetime(year2, end_dt.month, end_dt.day), freq='D')
178 |                 for date in year_dates:
179 |                     all_dates.append(date.to_pydatetime())
180 | 
181 |     else:
182 |         if api:
183 |             for year in range(start_dt.year, end_year):
184 |                 if year == start_dt.year:
185 |                     year_start = datetime.strftime(datetime(year, start_dt.month, start_dt.day), '%Y%m%d0000')
186 |                 else:
187 |                     year_start = datetime.strftime(datetime(year, 1, 1), '%Y%m%d0000')
188 |                 if year == end_dt.year:
189 |                     year_end = datetime.strftime(datetime(year, end_dt.month, end_dt.day) + time_added, '%Y%m%d%H00')
190 |                 else:
191 |                     year_end = datetime.strftime(datetime(year+1, 1, 1) + time_added, '%Y%m%d%H00')
192 |                 all_dates.append((year_start, year_end))
193 |         else:
194 |             pd_dates = pd.date_range(start_dt, end_dt, freq='D')
195 |             for date in pd_dates:
196 |                 all_dates.append(date.to_pydatetime())
197 |     return all_dates
198 | 
199 | 
200 | def find_matching_dates(bufr, obs, verif, return_data=False):
201 |     """
202 |     Finds dates which match in all three dictionaries. If return_data is True, returns the input dictionaries with only
203 |     common dates retained. verif may be None if running the model.
204 | 
205 |     :param bufr: dict: dictionary of processed BUFR data
206 |     :param obs: dict: dictionary of processed OBS data
207 |     :param verif: dict: dictionary of processed VERIFICATION data
208 |     :param return_data: bool: if True, returns edited data dictionaries containing only matching dates' data
209 |     :return: list of dates[, new BUFR, OBS, and VERIF dictionaries]
210 |     """
211 |     obs_dates = obs['SFC'].keys()
212 |     if verif is not None:
213 |         verif_dates = verif.keys()
214 |     # For BUFR dates, find for all models
215 |     bufr_dates_list = [bufr['SFC'][key].keys() for key in bufr['SFC'].keys()]
216 |     bufr_dates = bufr_dates_list[0]
217 |     for m in range(1, len(bufr_dates_list)):
218 |         bufr_dates = set(bufr_dates).intersection(set(bufr_dates_list[m]))
219 |     if verif is not None:
220 |         all_dates = (set(verif_dates).intersection(set(obs_dates))).intersection(bufr_dates)
221 |     else:
222 |         all_dates = set(obs_dates).intersection(bufr_dates)
223 |     if len(all_dates) == 0:
224 |         raise ValueError('Sorry, no matching dates found in data!')
225 |     print('find_matching_dates: found %d matching dates.' % len(all_dates))
226 |     if return_data:
227 |         for lev in ['SFC', 'PROF', 'DAY']:
228 |             for model in bufr[lev].keys():
229 |                 for date in bufr[lev][model].keys():
230 |                     if date not in all_dates:
231 |                         bufr[lev][model].pop(date, None)
232 |         for date in obs_dates:
233 |             if date not in all_dates:
234 |                 obs['SFC'].pop(date, None)
235 |                 obs['SNDG'].pop(date, None)
236 |         if verif is not None:
237 |             for date in verif_dates:
238 |                 if date not in all_dates:
239 |                     verif.pop(date, None)
240 |         return bufr, obs, verif, sorted(list(all_dates))
241 |     else:
242 |         return sorted(list(all_dates))
243 | 
244 | 
245 | def get_array(dictionary):
246 |     """
247 |     Transforms a nested dictionary into an nd numpy array, assuming that each nested sub-dictionary has the same
248 |     structure and that the values elements of the innermost dictionary is either a list or a float value. Function
249 |     _get_array is its recursive sub-function.
250 | 
251 |     :param dictionary:
252 |     :return:
253 |     """
254 |     dim_list = []
255 |     d = dictionary
256 |     while isinstance(d, dict):
257 |         dim_list.append(len(d.keys()))
258 |         d = d.values()[0]
259 |     try:
260 |         dim_list.append(len(d))
261 |     except:
262 |         pass
263 |     out_array = np.full(dim_list, np.nan, dtype=np.float64)
264 |     _get_array(dictionary, out_array)
265 |     return out_array
266 | 
267 | 
268 | def _get_array(dictionary, out_array):
269 |     if dictionary == {}:  # in case there's an empty dict for any reason
270 |         return
271 |     if isinstance(dictionary.values()[0], list):
272 |         for i, L in enumerate(dictionary.values()):
273 |             out_array[i, :] = np.asarray(L)
274 |     elif isinstance(dictionary.values()[0], float):
275 |         for i, L in enumerate(dictionary.values()):
276 |             out_array[i] = L
277 |     else:
278 |         for i, d in enumerate(dictionary.values()):
279 |             _get_array(d, out_array[i, :])
280 | 
281 | 
282 | def unpickle(bufr_file, obs_file, verif_file):
283 |     """
284 |     Shortcut function to unpickle bufr, obs, and verif files all at once. verif_file may be None if running the model.
285 | 
286 |     :param bufr_file: str: full path to pickled BUFR data file
287 |     :param obs_file: str: full path to pickled OBS data file
288 |     :param verif_file: str: full path to pickled VERIFICATION data file
289 |     :return:
290 |     """
291 |     print('util: loading BUFKIT data from %s' % bufr_file)
292 |     with open(bufr_file, 'rb') as handle:
293 |         bufr = pickle.load(handle)
294 |     print('util: loading OBS data from %s' % obs_file)
295 |     with open(obs_file, 'rb') as handle:
296 |         obs = pickle.load(handle)
297 |     if verif_file is not None:
298 |         print('util: loading VERIFICATION data from %s' % verif_file)
299 |         with open(verif_file, 'rb') as handle:
300 |             verif = pickle.load(handle)
301 |     else:
302 |         verif = None
303 |     return bufr, obs, verif
304 | 
305 | 
306 | def get_ghcn_stid(config):
307 |     """
308 |     After code by Luke Madaus.
309 | 
310 |     Gets the GHCN station ID from the 4-letter station ID.
311 |     """
312 |     main_addr = 'ftp://ftp.ncdc.noaa.gov/pub/data/noaa'
313 | 
314 |     site_directory = config['SITE_ROOT']
315 |     stid = config['station_id']
316 |     # Check to see that ish-history.txt exists
317 |     stations_file = 'isd-history.txt'
318 |     stations_filename = '%s/%s' % (site_directory, stations_file)
319 |     if not os.path.exists(stations_filename):
320 |         print('get_ghcn_stid: downloading site name database')
321 |         try:
322 |             from urllib2 import urlopen
323 |             response = urlopen('%s/%s' % (main_addr, stations_file))
324 |             with open(stations_filename, 'w') as f:
325 |                 f.write(response.read())
326 |         except BaseException as e:
327 |             print('get_ghcn_stid: unable to download site name database')
328 |             print("*** Reason: '%s'" % str(e))
329 | 
330 |     # Now open this file and look for our siteid
331 |     site_found = False
332 |     infile = open(stations_filename, 'r')
333 |     station_wbans = []
334 |     station_ghcns = []
335 |     for line in infile:
336 |         if stid.upper() in line:
337 |             linesp = line.split()
338 |             if (not linesp[0].startswith('99999') and not site_found
339 |                     and not linesp[1].startswith('99999')):
340 |                 try:
341 |                     site_wban = int(linesp[0])
342 |                     station_ghcn = int(linesp[1])
343 |                     # site_found = True
344 |                     print('get_ghcn_stid: site found for %s (%s)' %
345 |                           (stid, station_ghcn))
346 |                     station_wbans.append(site_wban)
347 |                     station_ghcns.append(station_ghcn)
348 |                 except:
349 |                     continue
350 |     if len(station_wbans) == 0:
351 |         raise ValueError('get_ghcn_stid error: so station found for %s' % stid)
352 | 
353 |     # Format station as USW...
354 |     usw_format = 'USW000%05d'
355 |     return usw_format % station_ghcns[0]
356 | 
357 | 
358 | # ==================================================================================================================== #
359 | # Conversion functions
360 | # ==================================================================================================================== #
361 | 
362 | def dewpoint(T, RH):
363 |     """
364 |     Calculates dewpoint from T in Fahrenheit and RH in percent.
365 |     """
366 | 
367 |     def FtoC(T):
368 |         return (T - 32.) / 9. * 5.
369 | 
370 |     def CtoF(T):
371 |         return 9. / 5. * T + 32.
372 | 
373 |     b = 17.67
374 |     c = 243.5  # deg C
375 | 
376 |     def gamma(T, RH):
377 |         return np.log(RH/100.) + b * T/ (c + T)
378 | 
379 |     T = FtoC(T)
380 |     TD = c * gamma(T, RH) / (b - gamma(T, RH))
381 |     return CtoF(TD)
382 | 
383 | 
384 | def to_bool(x):
385 |     """Convert an object to boolean.
386 | 
387 |     Examples:
388 |     >>> print to_bool('TRUE')
389 |     True
390 |     >>> print to_bool(True)
391 |     True
392 |     >>> print to_bool(1)
393 |     True
394 |     >>> print to_bool('FALSE')
395 |     False
396 |     >>> print to_bool(False)
397 |     False
398 |     >>> print to_bool(0)
399 |     False
400 |     >>> print to_bool('Foo')
401 |     Traceback (most recent call last):
402 |     ValueError: Unknown boolean specifier: 'Foo'.
403 |     >>> print to_bool(None)
404 |     Traceback (most recent call last):
405 |     ValueError: Unknown boolean specifier: 'None'.
406 | 
407 |     This function (c) Tom Keffer, weeWX.
408 |     """
409 |     try:
410 |         if x.lower() in ['true', 'yes']:
411 |             return True
412 |         elif x.lower() in ['false', 'no']:
413 |             return False
414 |     except AttributeError:
415 |         pass
416 |     try:
417 |         return bool(int(x))
418 |     except (ValueError, TypeError):
419 |         pass
420 |     raise ValueError("Unknown boolean specifier: '%s'." % x)
421 | 


--------------------------------------------------------------------------------
/mosx/verification/__init__.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
 3 | #
 4 | # See the file LICENSE for your rights.
 5 | #
 6 | 
 7 | """
 8 | Methods for processing VERIFICATION data.
 9 | """
10 | 
11 | from .methods import *
12 | 


--------------------------------------------------------------------------------
/mosx/verification/methods.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  3 | #
  4 | # See the file LICENSE for your rights.
  5 | #
  6 | 
  7 | """
  8 | Methods for processing VERIFICATION data.
  9 | """
 10 | 
 11 | import os
 12 | import re
 13 | import numpy as np
 14 | import pandas as pd
 15 | from datetime import datetime, timedelta
 16 | import pickle
 17 | import requests
 18 | from collections import OrderedDict
 19 | from mosx.MesoPy import Meso
 20 | from mosx.obs.methods import get_obs_hourly, reindex_hourly
 21 | from mosx.util import generate_dates, get_array, get_ghcn_stid
 22 | 
 23 | 
 24 | def get_cf6_files(config, num_files=1):
 25 |     """
 26 |     After code by Luke Madaus
 27 | 
 28 |     Retrieves CF6 climate verification data released by the NWS. Parameter num_files determines how many recent files
 29 |     are downloaded.
 30 |     """
 31 | 
 32 |     # Create directory if it does not exist
 33 |     site_directory = config['SITE_ROOT']
 34 | 
 35 |     # Construct the web url address. Check if a special 3-letter station ID is provided.
 36 |     nws_url = 'http://forecast.weather.gov/product.php?site=NWS&issuedby=%s&product=CF6&format=TXT'
 37 |     try:
 38 |         stid3 = config['station_id3']
 39 |     except KeyError:
 40 |         stid3 = config['station_id'][1:].upper()
 41 |     nws_url = nws_url % stid3
 42 | 
 43 |     # Determine how many files (iterations of product) we want to fetch
 44 |     if num_files == 1:
 45 |         if config['verbose']:
 46 |             print('get_cf6_files: retrieving latest CF6 file for %s' % config['station_id'])
 47 |     else:
 48 |         if config['verbose']:
 49 |             print('get_cf6_files: retrieving %s archived CF6 files for %s' % (num_files, config['station_id']))
 50 | 
 51 |     # Fetch files
 52 |     for r in range(1, num_files + 1):
 53 |         # Format the web address: goes through 'versions' on NWS site which correspond to increasingly older files
 54 |         version = 'version=%d&glossary=0' % r
 55 |         nws_site = '&'.join((nws_url, version))
 56 |         response = requests.get(nws_site)
 57 |         cf6_data = response.text
 58 | 
 59 |         # Remove the header
 60 |         try:
 61 |             body_and_footer = cf6_data.split('CXUS')[1]  # Mainland US
 62 |         except IndexError:
 63 |             try:
 64 |                 body_and_footer = cf6_data.split('CXHW')[1]  # Hawaii
 65 |             except IndexError:
 66 |                 body_and_footer = cf6_data.split('CXAK')[1]  # Alaska
 67 |         body_and_footer_lines = body_and_footer.splitlines()
 68 |         if len(body_and_footer_lines) <= 2:
 69 |             body_and_footer = cf6_data.split('000')[2]
 70 | 
 71 |         # Remove the footer
 72 |         body = body_and_footer.split('[REMARKS]')[0]
 73 | 
 74 |         # Find the month and year of the file
 75 |         current_year = re.search('YEAR: *(\d{4})', body).groups()[0]
 76 |         try:
 77 |             current_month = re.search('MONTH: *(\D{3,9})', body).groups()[0]
 78 |             current_month = current_month.strip()  # Gets rid of newlines and whitespace
 79 |             datestr = '%s %s' % (current_month, current_year)
 80 |             file_date = datetime.strptime(datestr, '%B %Y')
 81 |         except:  # Some files have a different formatting, although this may be fixed now.
 82 |             current_month = re.search('MONTH: *(\d{2})', body).groups()[0]
 83 |             current_month = current_month.strip()
 84 |             datestr = '%s %s' % (current_month, current_year)
 85 |             file_date = datetime.strptime(datestr, '%m %Y')
 86 | 
 87 |         # Write to a temporary file, check if output file exists, and if so, make sure the new one has more data
 88 |         datestr = file_date.strftime('%Y%m')
 89 |         filename = '%s/%s_%s.cli' % (site_directory, config['station_id'].upper(), datestr)
 90 |         temp_file = '%s/temp.cli' % site_directory
 91 |         with open(temp_file, 'w') as out:
 92 |             out.write(body)
 93 | 
 94 |         def file_len(file_name):
 95 |             with open(file_name) as f:
 96 |                 for i, l in enumerate(f):
 97 |                     pass
 98 |                 return i + 1
 99 | 
100 |         if os.path.isfile(filename):
101 |             old_file_len = file_len(filename)
102 |             new_file_len = file_len(temp_file)
103 |             if old_file_len < new_file_len:
104 |                 if config['verbose']:
105 |                     print('get_cf6_files: overwriting %s' % filename)
106 |                 os.remove(filename)
107 |                 os.rename(temp_file, filename)
108 |             else:
109 |                 if config['verbose']:
110 |                     print('get_cf6_files: %s already exists' % filename)
111 |         else:
112 |             if config['verbose']:
113 |                 print('get_cf6_files: writing %s' % filename)
114 |             os.rename(temp_file, filename)
115 | 
116 | 
117 | def _cf6_wind(config):
118 |     """
119 |     After code by Luke Madaus
120 | 
121 |     This function is used internally only.
122 | 
123 |     Generates wind verification values from climate CF6 files stored in SITE_ROOT. These files can be generated
124 |     externally by get_cf6_files.py. This function is not necessary if climo data from _climo_wind is found, except for
125 |     recent values which may not be in the NCDC database yet.
126 | 
127 |     :param config:
128 |     :return: dict: wind values from CF6 files
129 |     """
130 | 
131 |     if config['verbose']:
132 |         print('_cf6_wind: searching for CF6 files in %s' % config['SITE_ROOT'])
133 |     allfiles = os.listdir(config['SITE_ROOT'])
134 |     filelist = [f for f in allfiles if f.startswith(config['station_id'].upper()) and f.endswith('.cli')]
135 |     filelist.sort()
136 |     if len(filelist) == 0:
137 |         raise IOError('No CF6 files found.')
138 |     if config['verbose']:
139 |         print('_cf6_wind: found %d CF6 files.' % len(filelist))
140 | 
141 |     # Interpret CF6 files
142 |     if config['verbose']:
143 |         print('_cf6_wind: reading CF6 files')
144 |     cf6_values = {}
145 |     for file in filelist:
146 |         year, month = re.search('(\d{4})(\d{2})', file).groups()
147 |         infile = open('%s/%s' % (config['SITE_ROOT'], file), 'r')
148 |         for line in infile:
149 |             matcher = re.compile(
150 |                 '( \d|\d{2}) ( \d{2}|-\d{2}|  \d| -\d|\d{3})')
151 |             if matcher.match(line):
152 |                 # We've found an ob line!
153 |                 lsp = line.split()
154 |                 day = int(lsp[0])
155 |                 curdt = datetime(int(year), int(month), day)
156 |                 cf6_values[curdt] = {}
157 |                 # Wind
158 |                 if lsp[11] == 'M':
159 |                     cf6_values[curdt]['wind'] = 0.0
160 |                 else:
161 |                     cf6_values[curdt]['wind'] = float(lsp[11]) * 0.868976
162 | 
163 |     return cf6_values
164 | 
165 | 
166 | def _climo_wind(config, dates=None):
167 |     """
168 |      Fetches climatological wind data using ulmo package to retrieve NCDC archives.
169 | 
170 |     :param config:
171 |     :param dates: list of datetime objects
172 |     :return: dict: dictionary of wind values
173 |     """
174 |     import ulmo
175 | 
176 |     if config['verbose']:
177 |         print('_climo_wind: fetching data from NCDC (may take a while)...')
178 |     v = 'WSF2'
179 |     wind_dict = {}
180 |     D = ulmo.ncdc.ghcn_daily.get_data(get_ghcn_stid(config), as_dataframe=True, elements=[v])
181 | 
182 |     if dates is None:
183 |         dates = list(D[v].index.to_timestamp().to_pydatetime())
184 |     for date in dates:
185 |         wind_dict[date] = {'wind': D[v].loc[date]['value'] / 10. * 1.94384}
186 | 
187 |     return wind_dict
188 | 
189 | 
190 | def pop_rain(series):
191 |     """
192 |     Converts a series of rain values into 0 or 1 depending on whether there is measurable rain
193 |     :param series:
194 |     :return:
195 |     """
196 |     new_series = series.copy()
197 |     new_series[series >= 0.01] = 1.
198 |     new_series[series < 0.01] = 0.
199 |     return new_series
200 | 
201 | 
202 | def categorical_rain(series):
203 |     """
204 |     Converts a series of rain values into categorical precipitation quantities a la MOS.
205 |     :param series:
206 |     :return:
207 |     """
208 |     new_series = series.copy()
209 |     for j in range(len(series)):
210 |         if series.iloc[j] < 0.01:
211 |             new_series.iloc[j] = 0.
212 |         elif series.iloc[j] < 0.10:
213 |             new_series.iloc[j] = 1.
214 |         elif series.iloc[j] < 0.25:
215 |             new_series.iloc[j] = 2.
216 |         elif series.iloc[j] < 0.50:
217 |             new_series.iloc[j] = 3.
218 |         elif series.iloc[j] < 1.00:
219 |             new_series.iloc[j] = 4.
220 |         elif series.iloc[j] < 2.00:
221 |             new_series.iloc[j] = 5.
222 |         elif series.iloc[j] >= 2.00:
223 |             new_series.iloc[j] = 6.
224 |         else:  # missing, or something else that's strange
225 |             new_series.iloc[j] = 0.
226 |     return new_series
227 | 
228 | 
229 | def verification(config, output_file=None, use_cf6=True, use_climo=True, force_rain_quantity=False):
230 |     """
231 |     Generates verification data from MesoWest and saves to a file, which is used to train the model and check test
232 |     results.
233 | 
234 |     :param config:
235 |     :param output_file: str: path to output file
236 |     :param use_cf6: bool: if True, uses wind values from CF6 files
237 |     :param use_climo: bool: if True, uses wind values from NCDC climatology
238 |     :param force_rain_quantity: if True, returns the actual quantity of rain (rather than POP); useful for validation
239 |     files
240 |     :return:
241 |     """
242 |     if output_file is None:
243 |         output_file = '%s/%s_verif.pkl' % (config['SITE_ROOT'], config['station_id'])
244 | 
245 |     dates = generate_dates(config)
246 |     api_dates = generate_dates(config, api=True, api_add_hour=config['forecast_hour_start'] + 24)
247 | 
248 |     # Read new data for daily values
249 |     m = Meso(token=config['meso_token'])
250 | 
251 |     if config['verbose']:
252 |         print('verification: MesoPy initialized for station %s' % config['station_id'])
253 |         print('verification: retrieving latest obs and metadata')
254 |     latest = m.latest(stid=config['station_id'])
255 |     obs_list = list(latest['STATION'][0]['SENSOR_VARIABLES'].keys())
256 | 
257 |     # Look for desired variables
258 |     vars_request = ['air_temp', 'wind_speed', 'precip_accum_one_hour']
259 |     vars_option = ['air_temp_low_6_hour', 'air_temp_high_6_hour', 'precip_accum_six_hour']
260 | 
261 |     # Add variables to the api request if they exist
262 |     if config['verbose']:
263 |         print('verification: searching for 6-hourly variables...')
264 |     for var in vars_option:
265 |         if var in obs_list:
266 |             if config['verbose']:
267 |                 print('verification: found variable %s, adding to data' % var)
268 |             vars_request += [var]
269 |     vars_api = ''
270 |     for var in vars_request:
271 |         vars_api += var + ','
272 |     vars_api = vars_api[:-1]
273 | 
274 |     # Units
275 |     units = 'temp|f,precip|in,speed|kts'
276 | 
277 |     # Retrieve data
278 |     obspd = pd.DataFrame()
279 |     for api_date in api_dates:
280 |         if config['verbose']:
281 |             print('verification: retrieving data from %s to %s' % api_date)
282 |         obs = m.timeseries(stid=config['station_id'], start=api_date[0], end=api_date[1], vars=vars_api, units=units)
283 |         obspd = pd.concat((obspd, pd.DataFrame.from_dict(obs['STATION'][0]['OBSERVATIONS'])), ignore_index=True)
284 | 
285 |     # Rename columns to requested vars
286 |     obs_var_names = obs['STATION'][0]['SENSOR_VARIABLES']
287 |     obs_var_keys = list(obs_var_names.keys())
288 |     col_names = list(map(''.join, obspd.columns.values))
289 |     for c in range(len(col_names)):
290 |         col = col_names[c]
291 |         for k in range(len(obs_var_keys)):
292 |             key = obs_var_keys[k]
293 |             if col == list(obs_var_names[key].keys())[0]:
294 |                 col_names[c] = key
295 |     obspd.columns = col_names
296 | 
297 |     # Make sure we have columns for all requested variables
298 |     for var in vars_request:
299 |         if var not in col_names:
300 |             obspd = obspd.assign(**{var: np.nan})
301 | 
302 |     # Change datetime column to datetime object, subtract 6 hours to use 6Z days
303 |     if config['verbose']:
304 |         print('verification: setting time back %d hours for daily statistics' % config['forecast_hour_start'])
305 |     dateobj = pd.to_datetime(obspd['date_time']) - timedelta(hours=config['forecast_hour_start'])
306 |     obspd['date_time'] = dateobj
307 |     datename = 'date_time_minus_%d' % config['forecast_hour_start']
308 |     obspd = obspd.rename(columns={'date_time': datename})
309 | 
310 |     # Reformat data into hourly and daily
311 |     # Hourly
312 |     def hour(dates):
313 |         date = dates.iloc[0]
314 |         return datetime(date.year, date.month, date.day, date.hour)
315 | 
316 |     def last(values):
317 |         return values.iloc[-1]
318 | 
319 |     aggregate = {datename: hour}
320 |     if 'air_temp_high_6_hour' in vars_request and 'air_temp_low_6_hour' in vars_request:
321 |         aggregate['air_temp_high_6_hour'] = np.max
322 |         aggregate['air_temp_low_6_hour'] = np.min
323 |     aggregate['air_temp'] = {'air_temp_max': np.max, 'air_temp_min': np.min}
324 |     if 'precip_accum_six_hour' in vars_request:
325 |         aggregate['precip_accum_six_hour'] = np.max
326 |     aggregate['wind_speed'] = np.max
327 |     aggregate['precip_accum_one_hour'] = np.max
328 | 
329 |     if config['verbose']:
330 |         print('verification: grouping data by hour for hourly observations')
331 |     # Note that obs in hour H are reported at hour H, not H+1
332 |     obs_hourly = obspd.groupby([pd.DatetimeIndex(obspd[datename]).year,
333 |                                 pd.DatetimeIndex(obspd[datename]).month,
334 |                                 pd.DatetimeIndex(obspd[datename]).day,
335 |                                 pd.DatetimeIndex(obspd[datename]).hour]).agg(aggregate)
336 |     # Rename columns
337 |     col_names = obs_hourly.columns.values
338 |     col_names_new = []
339 |     for c in range(len(col_names)):
340 |         if col_names[c][0] == 'air_temp':
341 |             col_names_new.append(col_names[c][1])
342 |         else:
343 |             col_names_new.append(col_names[c][0])
344 | 
345 |     obs_hourly.columns = col_names_new
346 | 
347 |     # Daily
348 |     def day(dates):
349 |         date = dates.iloc[0]
350 |         return datetime(date.year, date.month, date.day)
351 | 
352 |     aggregate[datename] = day
353 |     aggregate['air_temp_min'] = np.min
354 |     aggregate['air_temp_max'] = np.max
355 |     aggregate['precip_accum_six_hour'] = np.sum
356 |     try:
357 |         aggregate.pop('air_temp')
358 |     except:
359 |         pass
360 | 
361 |     if config['verbose']:
362 |         print('verification: grouping data by day for daily verifications')
363 |     obs_daily = obs_hourly.groupby([pd.DatetimeIndex(obs_hourly[datename]).year,
364 |                                     pd.DatetimeIndex(obs_hourly[datename]).month,
365 |                                     pd.DatetimeIndex(obs_hourly[datename]).day]).agg(aggregate)
366 | 
367 |     if config['verbose']:
368 |         print('verification: checking matching dates for daily obs and CF6')
369 |     if use_climo:
370 |         try:
371 |             climo_values = _climo_wind(config, dates)
372 |         except BaseException as e:
373 |             if config['verbose']:
374 |                 print("verification: warning: '%s' while reading climo data" % str(e))
375 |             climo_values = {}
376 |     else:
377 |         if config['verbose']:
378 |             print('verification: not using climo.')
379 |         climo_values = {}
380 |     if use_cf6:
381 |         num_months = min((datetime.utcnow() - dates[0]).days / 30, 24)
382 |         try:
383 |             get_cf6_files(config, num_months)
384 |         except BaseException as e:
385 |             if config['verbose']:
386 |                 print("verification: warning: '%s' while getting CF6 files" % str(e))
387 |         try:
388 |             cf6_values = _cf6_wind(config)
389 |         except BaseException as e:
390 |             if config['verbose']:
391 |                 print("verification: warning: '%s' while reading CF6 files" % str(e))
392 |             cf6_values = {}
393 |     else:
394 |         if config['verbose']:
395 |             print('verification: not using CF6.')
396 |         cf6_values = {}
397 |     climo_values.update(cf6_values)  # CF6 has precedence
398 |     count_rows = 0
399 |     for index, row in obs_daily.iterrows():
400 |         date = row[datename]
401 |         if date in climo_values.keys():
402 |             count_rows += 1
403 |             obs_wind = row['wind_speed']
404 |             cf6_wind = climo_values[date]['wind']
405 |             if not (np.isnan(cf6_wind)):
406 |                 if obs_wind - cf6_wind >= 5:
407 |                     print('verification: warning: obs wind for %s much larger than wind from cf6/climo; using obs' % 
408 |                           date)
409 |                 else:
410 |                     obs_daily.loc[index, 'wind_speed'] = cf6_wind
411 |             else:
412 |                 count_rows -= 1
413 |     if config['verbose']:
414 |         print('verification: found %d matching rows.' % count_rows)
415 | 
416 |     # Round
417 |     round_dict = {'wind_speed': 0}
418 |     if 'air_temp_high_6_hour' in vars_request:
419 |         round_dict['air_temp_high_6_hour'] = 0
420 |     if 'air_temp_low_6_hour' in vars_request:
421 |         round_dict['air_temp_low_6_hour'] = 0
422 |     round_dict['air_temp_max'] = 0
423 |     round_dict['air_temp_min'] = 0
424 |     if 'precip_accum_six_hour' in vars_request:
425 |         round_dict['precip_accum_six_hour'] = 2
426 |     round_dict['precip_accum_one_hour'] = 2
427 |     obs_daily = obs_daily.round(round_dict)
428 | 
429 |     # Generation of final output data
430 |     if config['verbose']:
431 |         print('verification: generating final verification dictionary...')
432 |     if 'air_temp_high_6_hour' in vars_request:
433 |         obs_daily.rename(columns={'air_temp_high_6_hour': 'Tmax'}, inplace=True)
434 |     else:
435 |         obs_daily.rename(columns={'air_temp_max': 'Tmax'}, inplace=True)
436 |     if 'air_temp_low_6_hour' in vars_request:
437 |         obs_daily.rename(columns={'air_temp_low_6_hour': 'Tmin'}, inplace=True)
438 |     else:
439 |         obs_daily.rename(columns={'air_temp_min': 'Tmin'}, inplace=True)
440 |     if 'precip_accum_six_hour' in vars_request:
441 |         obs_daily.rename(columns={'precip_accum_six_hour': 'Rain'}, inplace=True)
442 |     else:
443 |         obs_daily.rename(columns={'precip_accum_one_hour': 'Rain'}, inplace=True)
444 |     obs_daily.rename(columns={'wind_speed': 'Wind'}, inplace=True)
445 | 
446 |     # Deal with the rain depending on the type of forecast requested
447 |     obs_daily['Rain'].fillna(0.0, inplace=True)
448 |     if config['Model']['rain_forecast_type'] == 'pop' and not force_rain_quantity:
449 |         obs_daily.loc[:, 'Rain'] = pop_rain(obs_daily['Rain'])
450 |     elif config['Model']['rain_forecast_type'] == 'categorical' and not force_rain_quantity:
451 |         obs_daily.loc[:, 'Rain'] = categorical_rain(obs_daily['Rain'])
452 | 
453 |     # Set the date time index and retain only desired columns
454 |     obs_daily = obs_daily.rename(columns={datename: 'date_time'})
455 |     obs_daily = obs_daily.set_index('date_time')
456 |     if config['verbose']:
457 |         print('verification: -> exporting to %s' % output_file)
458 |     export_cols = ['Tmax', 'Tmin', 'Wind', 'Rain']
459 |     for col in obs_daily.columns:
460 |         if col not in export_cols:
461 |             obs_daily.drop(col, 1, inplace=True)
462 | 
463 |     # If a time series is desired, then get hourly data
464 |     if config['Model']['predict_timeseries']:
465 | 
466 |         # Look for desired variables
467 |         vars_request = ['air_temp', 'relative_humidity', 'wind_speed', 'precip_accum_one_hour']
468 | 
469 |         # Add variables to the api request
470 |         vars_api = ''
471 |         for var in vars_request:
472 |             vars_api += var + ','
473 |         vars_api = vars_api[:-1]
474 | 
475 |         # Units
476 |         units = 'temp|f,precip|in,speed|kts'
477 | 
478 |         # Retrieve data
479 |         obs_hourly_verify = get_obs_hourly(config, api_dates, vars_api, units)
480 | 
481 |         # Fix rainfall for categorical and time accumulation
482 |         rain_column = 'precip_last_%d_hour' % config['time_series_interval']
483 |         obs_hourly_verify.rename(columns={'precip_accum_one_hour': rain_column}, inplace=True)
484 |         if config['Model']['rain_forecast_type'] == 'pop' and not force_rain_quantity:
485 |             if config['verbose']:
486 |                 print("verification: using 'pop' rain")
487 |             obs_hourly_verify.loc[:, rain_column] = pop_rain(obs_hourly_verify[rain_column])
488 |             use_rain_max = True
489 |         elif config['Model']['rain_forecast_type'] == 'categorical' and not force_rain_quantity:
490 |             if config['verbose']:
491 |                 print("verification: using 'categorical' rain")
492 |             obs_hourly_verify.loc[:, rain_column] = categorical_rain(obs_hourly_verify[rain_column])
493 |             use_rain_max = True
494 |         else:
495 |             use_rain_max = False
496 | 
497 |     # Export final data
498 |     export_dict = OrderedDict()
499 |     for date in dates:
500 |         try:
501 |             day_dict = obs_daily.loc[date].to_dict(into=OrderedDict)
502 |         except KeyError:
503 |             continue
504 |         if np.any(np.isnan(day_dict.values())):
505 |             if config['verbose']:
506 |                 print('verification: warning: omitting day %s; missing data' % date)
507 |             continue  # No verification can have missing values
508 |         if config['Model']['predict_timeseries']:
509 |             start = pd.Timestamp(date + timedelta(hours=(config['forecast_hour_start'] -
510 |                                                          config['time_series_interval'])))
511 |             end = pd.Timestamp(date + timedelta(hours=config['forecast_hour_start'] + 24))
512 |             try:
513 |                 series = reindex_hourly(obs_hourly_verify, start, end, config['time_series_interval'],
514 |                                         use_rain_max=use_rain_max)
515 |             except KeyError:
516 |                 # No values for the day
517 |                 if config['verbose']:
518 |                     print('verification: warning: omitting day %s; missing data' % date)
519 |                 continue
520 |             if series.isnull().values.any():
521 |                 if config['verbose']:
522 |                     print('verification: warning: omitting day %s; missing data' % date)
523 |                 continue
524 |             series_dict = OrderedDict(series.to_dict(into=OrderedDict))
525 |             day_dict.update(series_dict)
526 |         export_dict[date] = day_dict
527 |     with open(output_file, 'wb') as handle:
528 |         pickle.dump(export_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)
529 | 
530 |     return
531 | 
532 | 
533 | def process(config, verif):
534 |     """
535 |     Returns a numpy array of verification data for use in mosx_predictors. The first dimension is date, the second is
536 |     variable.
537 | 
538 |     :param config:
539 |     :param verif: dict: dictionary of processed verification data; may be None
540 |     :return: ndarray: array of processed verification targets
541 |     """
542 |     if verif is not None:
543 |         if config['verbose']:
544 |             print('verification.process: processing array for verification data')
545 |         num_days = len(verif.keys())
546 |         variables = ['Tmax', 'Tmin', 'Wind', 'Rain']
547 |         day_verif_array = np.full((num_days, len(variables)), np.nan, dtype=np.float64)
548 |         for d in range(len(verif.keys())):
549 |             date = verif.keys()[d]
550 |             for v in range(len(variables)):
551 |                 var = variables[v]
552 |                 day_verif_array[d, v] = verif[date][var]
553 |         if config['Model']['predict_timeseries']:
554 |             hour_verif = OrderedDict(verif)
555 |             for date in hour_verif.keys():
556 |                 for variable in variables:
557 |                     hour_verif[date].pop(variable, None)
558 |             hour_verif_array = get_array(hour_verif)
559 |             hour_verif_array = np.reshape(hour_verif_array, (num_days, -1))
560 |             verif_array = np.concatenate((day_verif_array, hour_verif_array), axis=1)
561 |             return verif_array
562 |         else:
563 |             return day_verif_array
564 |     else:
565 |         return None
566 | 


--------------------------------------------------------------------------------
/performance:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | #
  3 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  4 | #
  5 | # See the file LICENSE for your rights.
  6 | #
  7 | 
  8 | """
  9 | Evaluate performance metrics for a MOS-X model. These functions should only be used after the data files in 'build' and
 10 | 'validate' have been created.
 11 | """
 12 | 
 13 | import mosx
 14 | import numpy as np
 15 | import pandas as pd
 16 | import os
 17 | import sys
 18 | import pickle
 19 | import string
 20 | from optparse import OptionParser
 21 | from datetime import datetime
 22 | import warnings
 23 | from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, r2_score
 24 | from sklearn.preprocessing import LabelBinarizer
 25 | import matplotlib
 26 | import matplotlib.pyplot as plt
 27 | import matplotlib.gridspec as gs
 28 | from matplotlib import rc
 29 | try:
 30 |     from properscoring import crps_ensemble
 31 |     crps = True
 32 | except ImportError:
 33 |     crps = False
 34 | 
 35 | 
 36 | # Set matplotlib rc parameters
 37 | rc('font', **{'family': 'Times New Roman'})
 38 | matplotlib.rcParams['mathtext.fontset'] = 'custom'
 39 | matplotlib.rcParams['mathtext.rm'] = 'Times New Roman'
 40 | matplotlib.rcParams['mathtext.it'] = 'Times New Roman:italic'
 41 | matplotlib.rcParams['mathtext.bf'] = 'Times New Roman:bold'
 42 | matplotlib.rcParams.update({'font.size': 10})
 43 | 
 44 | # Suppress warnings
 45 | # warnings.filterwarnings("ignore")
 46 | 
 47 | 
 48 | def get_command_options():
 49 |     parser = OptionParser()
 50 |     parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False,
 51 |                       help='Use the raw precipitation from GFS/NAM to override or average with MOS-X')
 52 |     parser.add_option('-e', '--ensemble', dest='ensemble', action='store_true', default=False,
 53 |                       help='Calculate ensemble statistics (for a valid ensemble estimator)')
 54 |     parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False,
 55 |                       help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble'
 56 |                            'with that of the rain tuning post-processor')
 57 |     parser.add_option('-L', '--learning-curve', dest='learning', action='store_true', default=False,
 58 |                       help='Generate a learning curve by re-training the model (time consuming)')
 59 |     parser.add_option('-S', '--spread-skill', dest='spread_skill', action='store_true', default=False,
 60 |                       help='Plot spread-skill relationship for an ensemble (if -e is enabled)')
 61 |     parser.add_option('-H', '--rank-histogram', dest='histogram', action='store_true', default=False,
 62 |                       help='Plot rank histograms for an ensemble (if -e is enabled)')
 63 |     parser.add_option('-E', '--error-distribution', dest='error_plot', action='store_true', default=False,
 64 |                       help='Plot error distributions')
 65 |     (opts, args) = parser.parse_args()
 66 |     return opts, args
 67 | 
 68 | 
 69 | def brier_score(y, x):
 70 |     num_categories = max(x.shape[1], y.shape[1])
 71 |     while x.shape[1] < num_categories:
 72 |         x = np.c_[x, np.zeros(x.shape[0])]
 73 |     while y.shape[1] < num_categories:
 74 |         y = np.c_[y, np.zeros(y.shape[0])]
 75 |     score = np.sum((x - y) ** 2, axis=1)
 76 |     return np.mean(score / num_categories)
 77 | 
 78 | 
 79 | def mean_variance(X):
 80 |     return np.mean(np.var(X, axis=-1), axis=0)
 81 | 
 82 | 
 83 | def plot_spread_skill(X, y, grid=False, one_to_one=True, width=6, height=6):
 84 | 
 85 |     fig = plt.figure()
 86 |     fig.set_size_inches(width, height)
 87 |     rows = 2
 88 |     columns = 2
 89 |     gs1 = gs.GridSpec(rows, columns)
 90 |     gs1.update(wspace=0.18, hspace=0.18)
 91 |     colors = [c['color'] for c in list(matplotlib.rcParams['axes.prop_cycle'])]
 92 |     params = ['High', 'Low', 'Wind', 'Rain']
 93 |     limits = [20., 20., 10., 2.]
 94 | 
 95 |     var = np.var(X, axis=-1)
 96 |     ens_mean = np.mean(X, axis=-1)
 97 |     err = (ens_mean - y) ** 2
 98 | 
 99 |     for plot_num in range(len(params)):
100 |         ax = plt.subplot(gs1[plot_num])
101 |         plt.scatter(err[:, plot_num], var[:, plot_num], c=colors[plot_num], s=max(width, height)/2.)
102 |         plt.xlim([0., limits[plot_num]])
103 |         plt.ylim([0., limits[plot_num]])
104 |         if grid:
105 |             plt.grid(grid, linestyle='--', color=[0.8, 0.8, 0.8])
106 |         if one_to_one:
107 |             curr_xlim = ax.get_xlim()
108 |             curr_ylim = ax.get_ylim()
109 |             max_limit = max(np.max(curr_xlim), np.max(curr_ylim))
110 |             min_limit = min(np.min(curr_xlim), np.min(curr_ylim))
111 |             plt.plot([min_limit, max_limit], [min_limit, max_limit], 'k')
112 |         if (plot_num + 1) % columns == 1:
113 |             plt.ylabel('spread')
114 |         if (plot_num + 1) > columns * (rows - 1):
115 |             plt.xlabel('error')
116 |         letters = list(string.ascii_lowercase)
117 |         panel_label = '%s) %s' % (letters[plot_num], params[plot_num])
118 |         ax.annotate(panel_label, xycoords='axes fraction', xy=(0.05, 0.95), horizontalalignment='left',
119 |                     verticalalignment='top')
120 | 
121 |     return fig
122 | 
123 | 
124 | def plot_rank_histogram(X, y, width=6, height=6):
125 |     fig = plt.figure()
126 |     fig.set_size_inches(width, height)
127 |     gs1 = gs.GridSpec(2, 2)
128 |     gs1.update(wspace=0.18, hspace=0.18)
129 |     colors = [c['color'] for c in list(matplotlib.rcParams['axes.prop_cycle'])]
130 |     params = ['High', 'Low', 'Wind', 'Rain']
131 | 
132 |     def plot_histogram(subplot_num, x, unit=None, facecolor='b', bins=None, align='left'):
133 |         global fig
134 |         ax = plt.subplot(subplot_num)
135 |         if bins is None and unit is not None:
136 |             bins = max(int(np.nanmax(x) / unit - np.nanmin(x) / unit), 1)
137 |         n, bins, patches = plt.hist(x, bins=bins, facecolor=facecolor, normed=True, align=align, )
138 |         ylim = ax.get_ylim()
139 |         if ylim[1] - np.nanmax(n) < 0.005:
140 |             ax.set_ylim([ylim[0], ylim[1] + 0.005])
141 |         if unit is None:
142 |             unit = 1.
143 |         ax.set_yticklabels(['{:.1f}'.format(100. * l * unit) for l in plt.yticks()[0]])
144 |         if plot_num % 2 == 0:
145 |             ax.set_ylabel('frequency (%)')
146 |         letters = list(string.ascii_lowercase)
147 |         panel_label = '%s) %s' % (letters[plot_num], params[plot_num])
148 |         ax.annotate(panel_label, xycoords='axes fraction', xy=(0.50, 0.95), horizontalalignment='center',
149 |                     verticalalignment='top')
150 |         return
151 | 
152 |     for plot_num in range(4):
153 |         rank = np.sum(X[:, plot_num, :].T < y[:, plot_num], axis=0).T
154 |         plot_histogram(gs1[plot_num], rank, facecolor=colors[plot_num])
155 | 
156 |     return fig
157 | 
158 | 
159 | def plot_error(X, y, width=6, height=6):
160 |     fig = plt.figure()
161 |     fig.set_size_inches(width, height)
162 |     gs1 = gs.GridSpec(2, 2)
163 |     gs1.update(wspace=0.18, hspace=0.18)
164 |     colors = [c['color'] for c in list(matplotlib.rcParams['axes.prop_cycle'])]
165 |     params = ['High', 'Low', 'Wind', 'Rain']
166 | 
167 |     def plot_histogram(subplot_num, x, unit=None, facecolor='b', bins=None, align='left'):
168 |         global fig
169 |         ax = plt.subplot(subplot_num)
170 |         if bins is None and unit is not None:
171 |             bins = max(int(np.nanmax(x) / unit - np.nanmin(x) / unit), 1)
172 |         n, bins, patches = plt.hist(x, bins=bins, facecolor=facecolor, normed=True, align=align, )
173 |         ylim = ax.get_ylim()
174 |         if ylim[1] - np.nanmax(n) < 0.005:
175 |             ax.set_ylim([ylim[0], ylim[1] + 0.005])
176 |         if unit is None:
177 |             unit = 1.
178 |         ax.set_yticklabels(['{:.1f}'.format(100. * l * unit) for l in plt.yticks()[0]])
179 |         if plot_num % 2 == 0:
180 |             ax.set_ylabel('frequency (%)')
181 |         letters = list(string.ascii_lowercase)
182 |         panel_label = '%s) %s' % (letters[plot_num], params[plot_num])
183 |         ax.annotate(panel_label, xycoords='axes fraction', xy=(0.05, 0.95), horizontalalignment='left',
184 |                     verticalalignment='top')
185 |         return
186 | 
187 |     for plot_num in range(4):
188 |         error = X[:, plot_num] - y[:, plot_num]
189 |         plot_histogram(gs1[plot_num], error, facecolor=colors[plot_num], unit=1)
190 | 
191 |     return fig
192 | 
193 | 
194 | # Get options and config
195 | 
196 | options, arguments = get_command_options()
197 | 
198 | try:
199 |     config_file = arguments[0]
200 | except IndexError:
201 |     print('Required argument (config file) not provided.')
202 |     sys.exit(1)
203 | config = mosx.util.get_config(config_file)
204 | 
205 | predict_timeseries = config['Model']['predict_timeseries']
206 | if predict_timeseries:
207 |     config['Model']['predict_timeseries'] = False
208 | 
209 | predictor_file = '%s/%s_CV_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'],
210 |                                                  config['Validate']['end_date'])
211 | if not(os.path.isfile(predictor_file)):
212 |     print("Cannot find validation predictors file '%s'. Please run 'validate' first." % predictor_file)
213 |     sys.exit(1)
214 | 
215 | 
216 | # Open predictors file from validate
217 | 
218 | with open(predictor_file, 'rb') as handle:
219 |     predictors = pickle.load(handle)
220 | predictor_array = np.concatenate((predictors['BUFKIT'], predictors['OBS']), axis=1)
221 | true_array = predictors['VERIF']
222 | 
223 | 
224 | # Make the prediction
225 | print('\n--- MOS-X performance: making the forecast...\n')
226 | predicted, all_predicted, ps = mosx.model.predict_all(config, predictor_file, ensemble=options.ensemble,
227 |                                                       naive_rain_correction=options.tune_rain)
228 | if options.avg_rain:
229 |     print('Using average of raw and rain-tuned precipitation forecasts')
230 |     no_tuned_predictions = mosx.model.predict_all(config, predictor_file, naive_rain_correction=options.tune_rain,
231 |                                                   rain_tuning=False)
232 |     predicted = np.mean([predicted, no_tuned_predictions[0]], axis=0)
233 | 
234 | 
235 | # Calculate general performance scores
236 | 
237 | print('\n--- MOS-X performance: generating global performance scores...\n')
238 | multi = 'raw_values'
239 | scores = np.nan * np.zeros((7, 4))
240 | scores[0] = explained_variance_score(true_array[:, :4], predicted[:, :4], multioutput=multi)
241 | scores[1] = mean_absolute_error(true_array[:, :4], predicted[:, :4], multioutput=multi)
242 | scores[2] = mean_squared_error(true_array[:, :4], predicted[:, :4], multioutput=multi)
243 | scores[3] = r2_score(true_array[:, :4], predicted[:, :4], multioutput=multi)
244 | 
245 | if config['Model']['rain_forecast_type'] in ['pop', 'categorical']:
246 |     # Try to get probabilities for each category of rain
247 |     print('MOS-X performance: making probabilistic rain predictions...')
248 |     rain_prob = mosx.model.predict_rain_proba(config, predictor_file)
249 |     lb = LabelBinarizer()
250 |     rain_categories = lb.fit_transform(true_array[:, 3])
251 |     scores[4, 3] = brier_score(rain_categories, rain_prob)
252 | 
253 | if options.ensemble:
254 |     scores[5] = mean_variance(all_predicted[:, :4, :])
255 | 
256 | if options.ensemble and crps:
257 |     scores[6] = np.mean(crps_ensemble(true_array[:, :4], all_predicted[:, :4, :], axis=-1), axis=0)
258 | 
259 | scores_df = pd.DataFrame(scores)
260 | score_names = ['Explained variance score', 'Mean absolute bias', 'Mean squared bias',
261 |                'R^2 coefficient of determination', 'Brier score', 'Mean ensemble variance', 'CRPS']
262 | score_columns = ['High', 'Low', 'Wind', 'Rain']
263 | scores_df.index = score_names
264 | scores_df.columns = score_columns
265 | 
266 | print('\n')
267 | print(scores_df)
268 | 
269 | 
270 | # Optional learning curve plot
271 | 
272 | if options.learning:
273 |     print("\n--- MOS-X performance: exporting learning curve figure to 'learning_curve.pdf'...\n")
274 |     train_file = '%s/%s_predictors_train.pkl' % (config['SITE_ROOT'], config['station_id'])
275 |     predictor_file = '%s/%s_CV_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'],
276 |                                                      config['Validate']['end_date'])
277 |     predictors, targets, n_samples_test = mosx.model.combine_train_test(config, train_file, predictor_file,
278 |                                                                         return_count_test=True)
279 |     cv = mosx.model.SplitConsecutive(first=False, n_samples=n_samples_test)
280 |     scorer = mosx.model.wxchallenge_scorer(no_rain=True)
281 |     figure = mosx.model._plot_learning_curve(mosx.model.build_estimator(config), predictors, targets, cv=cv,
282 |                                           scoring=scorer)
283 |     plt.savefig('learning_curve.pdf', bbox_inches='tight')
284 | 
285 | 
286 | # Optional spread-skill plot
287 | 
288 | if options.spread_skill and options.ensemble:
289 |     print("\n--- MOS-X performance: exporting spread-skill figure to 'spread_skill.pdf'...\n")
290 |     figure = plot_spread_skill(all_predicted, true_array, )
291 |     plt.savefig('spread_skill.pdf', bbox_inches='tight')
292 | elif options.spread_skill:
293 |     print("warning: '--spread-skill' option enabled but no ensemble predictions (--ensemble) were enabled!")
294 | 
295 | 
296 | # Optional rank histograms plot
297 | 
298 | if options.histogram and options.ensemble:
299 |     print("\n--- MOS-X performance: exporting rank histogram figure to 'rank_histogram.pdf'...\n")
300 |     figure = plot_rank_histogram(all_predicted, true_array, )
301 |     plt.savefig('rank_histogram.pdf', bbox_inches='tight')
302 | elif options.histogram:
303 |     print("warning: '--rank-histogram' option enabled but no ensemble predictions (--ensemble) were enabled!")
304 | 
305 | 
306 | # Optional error distribution plot
307 | 
308 | if options.error_plot:
309 |     print("\n--- MOS-X performance: exporting error distribution figure to 'error_distribution.pdf'...\n")
310 |     figure = plot_error(predicted, true_array, )
311 |     plt.savefig('error_distribution.pdf', bbox_inches='tight')
312 | 


--------------------------------------------------------------------------------
/run:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | #
  3 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  4 | #
  5 | # See the file LICENSE for your rights.
  6 | #
  7 | 
  8 | """
  9 | Run the MOS-X model either initialized at 23Z on any given day or for tomorrow.
 10 | """
 11 | 
 12 | import os
 13 | import sys
 14 | import mosx
 15 | import numpy as np
 16 | from shutil import copy2
 17 | from collections import OrderedDict
 18 | from optparse import OptionParser
 19 | from datetime import datetime, timedelta
 20 | import pickle
 21 | import json
 22 | 
 23 | # Suppress warnings
 24 | import warnings
 25 | warnings.filterwarnings("ignore")
 26 | 
 27 | 
 28 | def get_command_options():
 29 |     parser = OptionParser()
 30 |     parser.add_option('-d', '--date', dest='datestr', action='store', type='string', default='tomorrow',
 31 |                       help='Date to run model for, YYYYMMDD (default=tomorrow)')
 32 |     parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False,
 33 |                       help='Use the raw precipitation from GFS/NAM to override or average with MOS-X')
 34 |     parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False,
 35 |                       help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble'
 36 |                            'with that of the rain tuning post-processor')
 37 |     parser.add_option('-w', '--write', dest='write_flag', action='store_true', default=False,
 38 |                       help='Write to a pickle file')
 39 |     parser.add_option('-f', '--write-file', dest='write_file', action='store', type='string', default='default',
 40 |                       help='If -w is enabled, write to this file (default $SITE_ROOT/$(station_id)_MOSX_fcst). '
 41 |                            'File extension is added based on file type.')
 42 |     parser.add_option('-u', '--upload', dest='upload', action='store_true', default=False,
 43 |                       help='Upload upload forecast output to server in config')
 44 |     parser.add_option('-p', '--probabilities', dest='prob', action='store_true', default=False,
 45 |                       help='Calculate and plot probability distributions')
 46 |     (opts, args) = parser.parse_args()
 47 |     return opts, args
 48 | 
 49 | 
 50 | # Figure out the date
 51 | 
 52 | options, arguments = get_command_options()
 53 | datestr, write_flag, write_base, upload, prob = (options.datestr, options.write_flag, options.write_file,
 54 |                                                  options.upload, options.prob)
 55 | try:
 56 |     config_file = arguments[0]
 57 | except IndexError:
 58 |     print('Required argument (config file) not provided.')
 59 |     sys.exit(1)
 60 | config = mosx.util.get_config(config_file)
 61 | 
 62 | if datestr == 'tomorrow':
 63 |     date = datetime.utcnow()
 64 |     # BUFR cycle
 65 |     cycle = str(6 * (((date.hour + 24 - 5) % 24) // 6))
 66 |     verif_date = datetime(date.year, date.month, date.day) + timedelta(days=1)
 67 | else:
 68 |     cycle = '18'
 69 |     try:
 70 |         verif_date = datetime.strptime(datestr, '%Y%m%d')
 71 |     except:
 72 |         raise ValueError('Invalid date format entered (use YYYYMMDD).')
 73 | 
 74 | # Override the INFILE values
 75 | new_start_date = datetime.strftime(verif_date, '%Y%m%d')
 76 | new_end_date = datetime.strftime(verif_date, '%Y%m%d')
 77 | config['data_start_date'] = new_start_date
 78 | config['data_end_date'] = new_end_date
 79 | 
 80 | 
 81 | # Retrieve data
 82 | 
 83 | bufr_file = '%s/%s_%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 84 | print('\n--- MOS-X run: retrieving BUFR data...\n')
 85 | print('Using model cycle %sZ' % cycle)
 86 | mosx.bufr.bufr(config, bufr_file, cycle=cycle)
 87 | 
 88 | obs_file = '%s/%s_%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 89 | print('\n--- MOS-X run: retrieving OBS data...\n')
 90 | mosx.obs.obs(config, obs_file, use_nan_sounding=True, use_existing_sounding=False)
 91 | 
 92 | 
 93 | # Format data
 94 | 
 95 | predictor_file = '%s/%s_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 96 | print('\n--- MOS-X run: formatting predictor data...\n')
 97 | mosx.model.format_predictors(config, bufr_file, obs_file, None, predictor_file)
 98 | 
 99 | 
100 | # Make a prediction!
101 | 
102 | print('\n--- MOS-X run: making the forecast...\n')
103 | predicted, all_predicted, predicted_timeseries = mosx.model.model.predict_all(config, predictor_file, ensemble=prob,
104 |                                                                               time_series_date=verif_date,
105 |                                                                               naive_rain_correction=options.tune_rain,
106 |                                                                               round_result=not prob)
107 | if options.avg_rain:
108 |     no_tuned_predictions = mosx.model.model.predict_all(config, predictor_file, ensemble=prob,
109 |                                                         time_series_date=verif_date,
110 |                                                         naive_rain_correction=options.tune_rain, round_result=not prob,
111 |                                                         rain_tuning=False)
112 | 
113 | predicted = np.squeeze(predicted)
114 | if options.avg_rain:
115 |     no_tuned_rain = no_tuned_predictions[0][0, 3]
116 |     print('Adjusting rain estimate to average of raw (%0.2f) and tuned (%0.2f) values' %
117 |           (no_tuned_rain, predicted[3]))
118 |     predicted[3] = np.mean([predicted[3], no_tuned_rain])
119 | 
120 | 
121 | # Print forecast!
122 | 
123 | print("\nRain forecast type: '%s'" % config['Model']['rain_forecast_type'])
124 | 
125 | print('\nFor day %s at %s, the predicted forecast is' % (new_end_date,
126 |                                                          config['station_id']))
127 | print('%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[:4]))
128 | 
129 | if prob:
130 |     predicted_std = np.squeeze(np.std(all_predicted, axis=-1))
131 |     predicted_display = []
132 |     for v in range(4):
133 |         predicted_display.append(predicted[v])
134 |         predicted_display.append(predicted_std[v])
135 |     print('\nPredicted forecast with standard deviation is')
136 |     print('%0.1f+/-%0.1f | %0.1f+/-%0.1f | %0.1f+/-%0.1f | %0.3f+/-%0.3f' % tuple(predicted_display))
137 |     print('\nNote that the reported rain above may be adjusted from raw model output.')
138 | 
139 | if config['Model']['predict_timeseries']:
140 |     print('\nPredicted time series:')
141 |     print(predicted_timeseries)
142 | 
143 | 
144 | # Write the forecast, if requested
145 | 
146 | if write_flag:
147 |     file_types = config['Upload']['file_type']
148 |     write_files = []
149 |     if not isinstance(file_types, list):
150 |         file_types = [file_types]
151 | 
152 |     for file_type in file_types:
153 |         if file_type == 'pickle':
154 |             if write_base == 'default':
155 |                 write_file = '%s/%s_MOSX_fcst.pkl' % (config['SITE_ROOT'], config['station_id'])
156 |             else:
157 |                 write_file = '%s.pkl' % write_base
158 |             print('\nForecast write requested, writing to file %s' % write_file)
159 |             # Check if pickle file already exists
160 |             try:
161 |                 with open(write_file, 'rb') as handle:
162 |                     data = pickle.load(handle)
163 |             except:
164 |                 data = OrderedDict()
165 | 
166 |             data[verif_date] = {
167 |                                 'high': np.round(predicted[0]),
168 |                                 'low': np.round(predicted[1]),
169 |                                 'wind': np.round(predicted[2]),
170 |                                 'precip': np.round(predicted[3], 2)
171 |                                }
172 | 
173 |             if config['Model']['predict_timeseries']:
174 |                 data[verif_date].update(predicted_timeseries.to_dict(into=OrderedDict))
175 | 
176 |             with open(write_file, 'wb') as handle:
177 |                 pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)
178 | 
179 |         elif file_type == 'uw_text':
180 |             write_file = 'MOS-X.%s' % new_end_date
181 |             print('\nForecast write requested, writing to file %s' % write_file)
182 |             if config['Model']['rain_forecast_type'] != 'pop':
183 |                 print('Warning: model rain prediction is not probability! Expect unusual output!')
184 |             high = np.round(predicted[0])
185 |             rain = np.round(10. * predicted[3])
186 |             line = 'MOS-X,%0.0f,%0.0f' % (high, rain)
187 |             with open(write_file, 'w') as f:
188 |                 f.write(line)
189 | 
190 |         elif file_type == 'json':
191 |             if write_base == 'default':
192 |                 write_file = '%s/MOSX_%s_%s.json' % (config['SITE_ROOT'], config['station_id'], new_end_date)
193 |             else:
194 |                 write_file = '%s.json' % write_base
195 |             print('\nForecast write requested, writing to file %s' % write_file)
196 |             data = OrderedDict()
197 |             data['daily'] = {
198 |                 'high': np.round(predicted[0]),
199 |                 'low': np.round(predicted[1]),
200 |                 'wind': np.round(predicted[2]),
201 |                 'precip': np.round(predicted[3], 2)
202 |             }
203 | 
204 |             if config['Model']['predict_timeseries']:
205 |                 data['hourly'] = predicted_timeseries.to_dict(into=OrderedDict, orient='list')
206 |                 data['hourly']['DateTime'] = [str(p) for p in predicted_timeseries.index]
207 | 
208 |             with open(write_file, 'w') as f:
209 |                 json.dump(data, f)
210 | 
211 |         write_files.append(write_file)
212 | 
213 | 
214 | # Make plots of probability distributions
215 | 
216 | if prob:
217 |     # Imports
218 |     import matplotlib
219 |     import matplotlib.pyplot as plt
220 |     import matplotlib.gridspec as gs
221 |     import matplotlib.ticker as ticker
222 |     from scipy.stats import norm
223 |     matplotlib.rcParams.update({'font.size': 9})
224 | 
225 |     prob_file = '%s/%s_MOSX_prob.svg' % (config['SITE_ROOT'], config['station_id'])
226 | 
227 |     def plot_probabilities(predicted):
228 |         fig = plt.figure()
229 |         fig.set_size_inches(8, 6)
230 |         gs1 = gs.GridSpec(2, 2)
231 |         gs1.update(wspace=0.18, hspace=0.18)
232 | 
233 |         def plot_histogram(subplot_num, x, unit=1., facecolor='b', bins=None, align='left',
234 |                            xtickint=None, decimals=1, title=None):
235 |             global fig
236 |             ax = plt.subplot(subplot_num)
237 | 
238 |             if bins is None:
239 |                 bins = max(int(np.nanmax(x)/unit - np.nanmin(x)/unit), 1)
240 |             n, bins, patches = plt.hist(x, bins=bins, facecolor=facecolor, normed=True, align=align,)
241 |             x_axis = np.linspace(np.nanmin(x), np.nanmax(x), 101)
242 |             x_mean = np.nanmean(x)
243 |             x_std = np.nanstd(x)
244 |             normal = norm.pdf(x_axis, x_mean, x_std)
245 |             plt.plot(x_axis, normal)
246 |             if xtickint is not None:
247 |                 ax.xaxis.set_major_locator(ticker.MultipleLocator(xtickint))
248 |             ylim = ax.get_ylim()
249 |             if ylim[1] - np.nanmax(n) < 0.005:
250 |                 ax.set_ylim([ylim[0], ylim[1]+0.005])
251 |             ax.set_yticklabels(['{:.1f}'.format(100.*l*unit) for l in plt.yticks()[0]])
252 |             if plot_num % 2 == 0:
253 |                 ax.set_ylabel('Frequency (%)')
254 |             plt.axvline(x_mean, linewidth=1.5, color='black')
255 |             formatter = '%0.{:d}f'.format(decimals)
256 |             text = ('Mean: %s\nStd: %s' % (formatter, formatter)) % (x_mean, x_std)
257 |             plt.text(x_mean+unit/2, 0.9*np.nanmax(n), text, fontsize=8)
258 |             if title is not None:
259 |                 ax.set_title(title)
260 |             return
261 | 
262 |         colors = [(0.1, 0.6, 0.4),
263 |                   (0.6, 0.1, 0.4),
264 |                   (0.2, 0.4, 0.8),
265 |                   (0.8, 0.7, 0.1)]
266 |         titles = ['High temperature', 'Low temperature', 'Max 2-min wind', 'Precipitation']
267 |         for plot_num in range(4):
268 |             if plot_num != 3:
269 |                 unit = 1.
270 |                 decimals = 1
271 |             elif config['Model']['rain_forecast_type'] == 'pop':
272 |                 unit = 0.1
273 |                 decimals = 2
274 |             elif config['Model']['rain_forecast_type'] == 'categorical':
275 |                 unit = 1.
276 |                 decimals = 1
277 |             elif config['Model']['rain_forecast_type'] == 'quantity':
278 |                 unit = 0.05
279 |                 decimals = 3
280 |             plot_histogram(gs1[plot_num], predicted[plot_num, :], unit, facecolor=colors[plot_num],
281 |                            decimals=decimals, title=titles[plot_num])
282 | 
283 |         verif_date_str = datetime.strftime(verif_date, '%Y/%m/%d')
284 |         plt.suptitle('MOS-X probability distributions, %s on %s' % (config['station_id'], verif_date_str))
285 |         plt.savefig(prob_file, dpi=200)
286 | 
287 |     plot_probabilities(np.squeeze(all_predicted))
288 | 
289 | 
290 | # Upload, if requested
291 | 
292 | if upload and write_flag:
293 |     print('\nUpload requested...')
294 |     account = config['Upload']['username']
295 |     server = config['Upload']['server']
296 |     forecast_dir = config['Upload']['forecast_directory']
297 |     plot_dir = config['Upload']['plot_directory']
298 |     if account == '' and server == '':
299 |         if forecast_dir != '':
300 |             for write_file in write_files:
301 |                 copy2(write_file, '%s/%s' % (forecast_dir, write_file.split('/')[-1]))
302 |         else:
303 |             print('No local directory specified for forecast upload, aborting')
304 |         if prob and plot_dir != '':
305 |             copy2(prob_file, '%s/%s' % (plot_dir, prob_file.split('/')[-1]))
306 |         else:
307 |             print('No local directory specified for plot upload, aborting')
308 |     elif account == '' or server == '':
309 |         print('Invalid username and/or server in config file, aborting')
310 |     else:
311 |         for write_file in write_files:
312 |             result = os.system('scp %s %s@%s:%s' % (write_file, account, server, forecast_dir))
313 |         if prob:
314 |             os.system('scp %s %s@%s:%s' % (prob_file, account, server, plot_dir))
315 |         print('Upload finished with system exit status %s' % result)
316 | 


--------------------------------------------------------------------------------
/validate:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | #
  3 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  4 | #
  5 | # See the file LICENSE for your rights.
  6 | #
  7 | 
  8 | """
  9 | Perform cross-validation of the MOS-X model for a range of dates specified in the config.
 10 | """
 11 | 
 12 | import mosx
 13 | import os
 14 | import sys
 15 | import numpy as np
 16 | from optparse import OptionParser
 17 | from datetime import datetime
 18 | import pickle
 19 | 
 20 | # Suppress warnings
 21 | import warnings
 22 | warnings.filterwarnings("ignore")
 23 | 
 24 | 
 25 | def get_command_options():
 26 |     parser = OptionParser()
 27 |     parser.add_option('-o', '--overwrite', dest='overwrite', action='store_true', default=False,
 28 |                       help='Overwrite any existing BUFR, obs, and verification files')
 29 |     parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False,
 30 |                       help='Use the raw precipitation from GFS/NAM to override or average with MOS-X')
 31 |     parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False,
 32 |                       help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble'
 33 |                            'with that of the rain tuning post-processor')
 34 |     parser.add_option('-w', '--write', dest='write_flag', action='store_true', default=False, help='Write a CSV file')
 35 |     default_file = './MOSX_CV.csv'
 36 |     parser.add_option('-f', '--write-file', dest='write_file', action='store', type='string', default=default_file,
 37 |                       help=('If -w is enabled, write to this file (default %s)' % default_file))
 38 |     (opts, args) = parser.parse_args()
 39 |     return opts, args
 40 | 
 41 | 
 42 | # Figure out the date
 43 | 
 44 | options, arguments = get_command_options()
 45 | overwrite, write_flag, write_file = options.overwrite, options.write_flag, options.write_file
 46 | try:
 47 |     config_file = arguments[0]
 48 | except IndexError:
 49 |     print('Required argument (config file) not provided.')
 50 |     sys.exit(1)
 51 | config = mosx.util.get_config(config_file)
 52 | 
 53 | cycle = '18'
 54 | 
 55 | start_date = datetime.strptime(config['Validate']['start_date'], '%Y%m%d')
 56 | end_date = datetime.strptime(config['Validate']['end_date'], '%Y%m%d')
 57 | 
 58 | # Override the INFILE values
 59 | new_start_date = datetime.strftime(start_date, '%Y%m%d')
 60 | new_end_date = datetime.strftime(end_date, '%Y%m%d')
 61 | config['data_start_date'] = new_start_date
 62 | config['data_end_date'] = new_end_date
 63 | 
 64 | 
 65 | # Retrieve data
 66 | 
 67 | bufr_file = '%s/%s_CV_%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 68 | print('\n--- MOS-X validate: retrieving BUFR data...\n')
 69 | if os.path.isfile(bufr_file) and not overwrite:
 70 |     print('Using existing BUFR file %s' % bufr_file)
 71 |     print('If issues occur, delete this file and try again')
 72 | else:
 73 |     print('Using model cycle %sZ' % cycle)
 74 |     mosx.bufr.bufr(config, bufr_file, cycle=cycle)
 75 | 
 76 | obs_file = '%s/%s_CV_%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 77 | print('\n--- MOS-X validate: retrieving OBS data...\n')
 78 | if os.path.isfile(obs_file) and not overwrite:
 79 |     print('Using existing obs file %s' % obs_file)
 80 |     print('If issues occur, delete this file and try again')
 81 | else:
 82 |     mosx.obs.obs(config, obs_file, use_nan_sounding=False)
 83 | 
 84 | verif_file = '%s/%s_CV_%s_verif.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 85 | print('\n--- MOS-X validate: retrieving VERIF data...\n')
 86 | if os.path.isfile(verif_file) and not overwrite:
 87 |     print('Using existing verif file %s' % verif_file)
 88 |     print('If issues occur, delete this file and try again')
 89 | else:
 90 |     mosx.verification.verification(config, verif_file, use_climo=config['Obs']['use_climo_wind'],
 91 |                                    use_cf6=config['Obs']['use_climo_wind'])
 92 | with open(verif_file, 'rb') as handle:
 93 |     verif = pickle.load(handle)
 94 | 
 95 | 
 96 | # Format data
 97 | 
 98 | predictor_file = '%s/%s_CV_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
 99 | print('\n--- MOS-X validate: formatting predictor data...\n')
100 | all_dates = mosx.model.format_predictors(config, bufr_file, obs_file, verif_file, predictor_file,
101 |                                          return_dates=True)
102 | 
103 | 
104 | # Make a prediction!
105 | 
106 | predicted = mosx.model.predict(config, predictor_file,naive_rain_correction=options.tune_rain)
107 | if options.avg_rain:
108 |     print('Using average of raw and rain-tuned precipitation forecasts')
109 |     no_tuned_predictions = mosx.model.predict(config, predictor_file, naive_rain_correction=options.tune_rain,
110 |                                               rain_tuning=False)
111 |     predicted = np.mean([predicted, no_tuned_predictions], axis=0)
112 | 
113 | # Print forecasts!
114 | 
115 | print("\nRain forecast type: '%s'" % config['Model']['rain_forecast_type'])
116 | 
117 | print('\nDay,verification,forecast')
118 | for day in range(len(all_dates)):
119 |     date = all_dates[day]
120 |     day_verif = [verif[date][v] for v in ['Tmax', 'Tmin', 'Wind', 'Rain']]
121 |     verif_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(day_verif)
122 |     fcst_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[day, :4])
123 |     print('%s,%s,%s' % (date, verif_str, fcst_str))
124 | 
125 | 
126 | # Write the forecast, if requested
127 | 
128 | if write_flag:
129 |     print('\nForecast write requested, writing to file %s' % write_file)
130 | 
131 |     with open(write_file, 'w') as f:
132 |         print >> f, 'date,verification,forecast'
133 |         for day in range(len(all_dates)):
134 |             date = all_dates[day]
135 |             day_verif = [verif[date][v] for v in ['Tmax', 'Tmin', 'Wind', 'Rain']]
136 |             verif_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(day_verif)
137 |             fcst_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[day, :4])
138 |             print >> f, ('%s,%s,%s' % (date, verif_str, fcst_str))
139 | 


--------------------------------------------------------------------------------
/verify:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | #
  3 | # Copyright (c) 2018 Jonathan Weyn <jweyn@uw.edu>
  4 | #
  5 | # See the file LICENSE for your rights.
  6 | #
  7 | 
  8 | """
  9 | Verify the MOS-X model, along with GFS and NAM MOS products, for a range of dates.
 10 | """
 11 | 
 12 | import mosx
 13 | import sys
 14 | import os
 15 | import numpy as np
 16 | import pandas as pd
 17 | from urllib import urlopen
 18 | from collections import OrderedDict
 19 | from optparse import OptionParser
 20 | from datetime import datetime, timedelta
 21 | import pickle
 22 | 
 23 | # Suppress warnings
 24 | import warnings
 25 | warnings.filterwarnings("ignore")
 26 | 
 27 | 
 28 | def get_command_options():
 29 |     parser = OptionParser()
 30 |     parser.add_option('-s', '--start-date', dest='startstr', action='store', type='string', default='weekago',
 31 |                       help='Starting verification date, YYYYMMDD (default=today)')
 32 |     parser.add_option('-e', '--end-date', dest='endstr', action='store', type='string', default='today',
 33 |                       help='Ending verification date, YYYYMMDD (default=weekago)')
 34 |     parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False,
 35 |                       help='Use the raw precipitation from GFS/NAM to override or average with MOS-X')
 36 |     parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False,
 37 |                       help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble'
 38 |                            'with that of the rain tuning post-processor')
 39 |     parser.add_option('-m', '--verify-mos', dest='mos_flag', action='store_true', default=False,
 40 |                       help="Also retrieve GFS and NAM MOS forecasts for comparison")
 41 |     parser.add_option('-N', '--no-verify', dest='no_verif_flag', action='store_true', default=False,
 42 |                       help="Don't do verification, just get the model forecasts")
 43 |     parser.add_option('-o', '--overwrite', dest='overwrite', action='store_true', default=False,
 44 |                       help='Overwrite any existing BUFR, obs, and verification files')
 45 |     parser.add_option('-w', '--write', dest='write_flag', action='store_true', default=False,
 46 |                       help='Write a CSV file')
 47 |     default_file = './MOS-X_verification.csv'
 48 |     parser.add_option('-f', '--write-file', dest='write_file', action='store', type='string', default=default_file,
 49 |                       help=('If -w is enabled, write to this file (default %s)' % default_file))
 50 |     (opts, args) = parser.parse_args()
 51 |     return opts, args
 52 | 
 53 | 
 54 | def mos_qpf_interpret(qpf):
 55 |     """
 56 |     Interprets a pandas Series of QPF by average estimates
 57 | 
 58 |     :param qpf: Series of q06 or q12 from MOS
 59 |     :return: precip: Series of average estimated precipitation
 60 |     """
 61 |     translator = {0: 0.0,
 62 |                   1: 0.05,
 63 |                   2: 0.15,
 64 |                   3: 0.35,
 65 |                   4: 0.75,
 66 |                   5: 1.5,
 67 |                   6: 2.5}
 68 |     new_qpf = qpf.copy()
 69 |     for j in range(len(qpf)):
 70 |         try:
 71 |             new_qpf.iloc[j] = translator[int(qpf.iloc[j])]
 72 |         except:
 73 |             new_qpf.iloc[j] = 0.0
 74 |     return new_qpf
 75 | 
 76 | 
 77 | def retrieve_mos(model, init_date, forecast_date):
 78 |     """
 79 |     Retrieve MOS data.
 80 | 
 81 |     :param model: model name (GFS or NAM)
 82 |     :param init_date: datetime of model initialization
 83 |     :param forecast_date: datetime of forecast
 84 |     :return: dict of high, low, max wind, total rain for next 6Z--6Z
 85 |     """
 86 | 
 87 |     # Create daily return dict
 88 |     daily = OrderedDict()
 89 | 
 90 |     base_url = 'http://mesonet.agron.iastate.edu/mos/csv.php?station=%s&runtime=%s&model=%s'
 91 |     formatted_date = init_date.strftime('%Y-%m-%d%%20%H:00')
 92 |     url = base_url % (config['station_id'], formatted_date, model)
 93 |     response = urlopen(url)
 94 |     df = pd.read_csv(response, index_col=False)
 95 |     date_index = pd.to_datetime(df['ftime'])
 96 |     df['datetime'] = date_index
 97 |     df = df.set_index('datetime')
 98 | 
 99 |     # Remove duplicate rows
100 |     df = df.drop_duplicates()
101 | 
102 |     # Fix rain
103 |     df['q06'] = mos_qpf_interpret(df['q06'])
104 | 
105 |     forecast_start = forecast_date.replace(hour=6)
106 |     forecast_end = forecast_start + timedelta(days=1)
107 | 
108 |     # Some parameters need to include the forecast start; others, like total rain and 6-hour maxes, don't
109 |     try:
110 |         iloc_start_include = df.index.get_loc(forecast_start)
111 |         iloc_start_exclude = iloc_start_include + 1
112 |     except:
113 |         print('Error getting start time index in db; check data.')
114 |         return
115 |     try:
116 |         iloc_end = df.index.get_loc(forecast_end) + 1
117 |     except:
118 |         print('Error getting end time index in db; check data.')
119 |         return
120 | 
121 |     raw_high = df.iloc[iloc_start_include:iloc_end]['tmp'].max()
122 |     raw_low = df.iloc[iloc_start_include:iloc_end]['tmp'].min()
123 |     nx_high = df.iloc[iloc_start_exclude:iloc_end]['n_x'].max()
124 |     nx_low = df.iloc[iloc_start_exclude:iloc_end]['n_x'].max()
125 |     daily['Tmax'] = np.nanmax([raw_high, nx_high])
126 |     daily['Tmin'] = np.nanmin([raw_low, nx_low])
127 |     daily['Wind'] = df.iloc[iloc_start_include:iloc_end]['wsp'].max()
128 |     daily['Rain'] = df.iloc[iloc_start_exclude:iloc_end]['q06'].sum()
129 | 
130 |     return daily
131 | 
132 | 
133 | # Parameters
134 | 
135 | options, arguments = get_command_options()
136 | 
137 | try:
138 |     config_file = arguments[0]
139 | except IndexError:
140 |     print('Required argument (config file) not provided.')
141 |     sys.exit(1)
142 | config = mosx.util.get_config(config_file)
143 | 
144 | if options.endstr == 'today':
145 |     date = datetime.utcnow()
146 |     # BUFR cycle
147 |     cycle = '18'
148 |     if date.hour < 6:
149 |         end_date = datetime(date.year, date.month, date.day) - timedelta(days=2)
150 |     else:
151 |         end_date = datetime(date.year, date.month, date.day) - timedelta(days=1)
152 |     if options.no_verif_flag:
153 |         end_date += timedelta(days=2)
154 | else:
155 |     cycle = '18'
156 |     try:
157 |         end_date = datetime.strptime(options.endstr, '%Y%m%d')
158 |     except:
159 |         raise ValueError('Invalid date format entered (use YYYYMMDD).')
160 | 
161 | if options.startstr == 'weekago':
162 |     date = datetime.utcnow()
163 |     start_date = datetime(date.year, date.month, date.day) - timedelta(days=7)
164 | else:
165 |     try:
166 |         start_date = datetime.strptime(options.startstr, '%Y%m%d')
167 |     except:
168 |         raise ValueError('Invalid date format entered (use YYYYMMDD).')
169 | 
170 | # Override the INFILE values
171 | new_start_date = datetime.strftime(start_date, '%Y%m%d')
172 | new_end_date = datetime.strftime(end_date, '%Y%m%d')
173 | config['data_start_date'] = new_start_date
174 | config['data_end_date'] = new_end_date
175 | 
176 | 
177 | # Retrieve data
178 | 
179 | bufr_file = '%s/%s_verify_%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
180 | print('\n--- MOS-X verify: retrieving BUFR data...\n')
181 | print('Using model cycle %sZ' % cycle)
182 | if os.path.isfile(bufr_file) and not options.overwrite:
183 |     print('Using existing BUFR file %s' % bufr_file)
184 |     print('If issues occur, delete this file and try again')
185 | else:
186 |     print('Using model cycle %sZ' % cycle)
187 |     mosx.bufr.bufr(config, bufr_file, cycle=cycle)
188 | 
189 | obs_file = '%s/%s_verify_%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
190 | print('\n--- MOS-X verify: retrieving OBS data...\n')
191 | if os.path.isfile(obs_file) and not options.overwrite:
192 |     print('Using existing obs file %s' % obs_file)
193 |     print('If issues occur, delete this file and try again')
194 | else:
195 |     mosx.obs.obs(config, obs_file, use_nan_sounding=False)
196 | 
197 | if not options.no_verif_flag:
198 |     verif_file = '%s/%s_verify_%s_verif.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date)
199 |     print('\n--- MOS-X verify: retrieving VERIF data...\n')
200 |     if os.path.isfile(verif_file) and not options.overwrite:
201 |         print('Using existing verif file %s' % verif_file)
202 |         print('If issues occur, delete this file and try again')
203 |     else:
204 |         mosx.verification.verification(config, verif_file, use_climo=False, use_cf6=config['Obs']['use_climo_wind'])
205 |     with open(verif_file, 'rb') as handle:
206 |         verif = pickle.load(handle)
207 | else:
208 |     verif_file = None
209 | 
210 | 
211 | # Format data
212 | 
213 | predictor_file = '%s/%s_verify_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'],
214 |                                               new_end_date)
215 | print('\n--- MOS-X run: formatting predictor data...\n')
216 | all_dates = mosx.model.format_predictors(config, bufr_file, obs_file, verif_file, predictor_file,
217 |                                          return_dates=True)
218 | 
219 | 
220 | # Make a prediction!
221 | 
222 | predicted = mosx.model.predict(config, predictor_file, naive_rain_correction=options.tune_rain)
223 | if options.avg_rain:
224 |     print('Using average of raw and rain-tuned precipitation forecasts')
225 |     no_tuned_predictions = mosx.model.predict_all(config, predictor_file, naive_rain_correction=options.tune_rain,
226 |                                                   rain_tuning=False)
227 |     predicted = np.mean([predicted, no_tuned_predictions[0]], axis=0)
228 | 
229 | # Get GFS and NAM MOS forecasts, if desired
230 | 
231 | if options.mos_flag:
232 |     gfs = OrderedDict()
233 |     nam = OrderedDict()
234 |     mos_mean = OrderedDict()
235 |     for current_date in all_dates:
236 |         init_date = current_date - timedelta(days=1)
237 |         init_date = init_date.replace(hour=12)
238 |         mos_array = np.full((2, 4), np.nan)
239 | 
240 |         # GFS
241 |         print('Retrieving %s data initialized at %s' % ('GFS MOS', init_date))
242 |         daily = retrieve_mos('GFS', init_date, current_date)
243 |         if daily is not None:
244 |             gfs[current_date] = daily
245 |             mos_array[0, :] = daily.values()
246 | 
247 |         # NAM
248 |         print('Retrieving %s data initialized at %s' % ('NAM MOS', init_date))
249 |         daily = retrieve_mos('NAM', init_date, current_date)
250 |         if daily is not None:
251 |             nam[current_date] = daily
252 |             mos_array[1, :] = daily.values()
253 | 
254 |         # Ensemble mean
255 |         mos_array = np.nanmean(mos_array, axis=0)
256 |         mos_mean[current_date] = {
257 |             'Tmax': mos_array[0],
258 |             'Tmin': mos_array[1],
259 |             'Wind': mos_array[2],
260 |             'Rain': mos_array[3]
261 |         }
262 | 
263 |         current_date += timedelta(days=1)
264 | 
265 | 
266 | # Print forecasts!
267 | 
268 | print("\nRain forecast type: '%s'" % config['Model']['rain_forecast_type'])
269 | 
270 | forecast_format = '%0.0f/%0.0f/%0.0f/%0.2f'
271 | variables = ['Tmax', 'Tmin', 'Wind', 'Rain']
272 | if options.mos_flag:
273 |     print('\nDay, verification, GFS MOS, NAM MOS, MOS MEAN, MOS-X')
274 | else:
275 |     print('\nDay, verification, MOS-X')
276 | for day in range(len(all_dates)):
277 |     date = all_dates[day]
278 |     if not options.no_verif_flag:
279 |         verif_str = forecast_format % tuple([verif[date][v] for v in variables])
280 |     else:
281 |         verif_str = ''
282 |     mosx_str = forecast_format % tuple(predicted[day, :4])
283 |     if options.mos_flag:
284 |         try:
285 |             gfs_str = forecast_format % tuple([gfs[date][v] for v in variables])
286 |         except KeyError:
287 |             gfs_str = ''
288 |         try:
289 |             nam_str = forecast_format % tuple([nam[date][v] for v in variables])
290 |         except KeyError:
291 |             nam_str = ''
292 |         try:
293 |             mos_mean_str = forecast_format % tuple([mos_mean[date][v] for v in variables])
294 |         except KeyError:
295 |             mos_mean_str = ''
296 |         print('%s,%s,%s,%s,%s,%s' % (date, verif_str, gfs_str, nam_str, mos_mean_str, mosx_str))
297 |     else:
298 |         print('%s,%s,%s' % (date, verif_str, mosx_str))
299 | 
300 | 
301 | # Write the forecast, if requested
302 | 
303 | if options.write_flag:
304 |     print('\nForecast write requested, writing to file %s' % options.write_file)
305 |     
306 |     with open(options.write_file, 'w') as f:
307 |         if options.mos_flag:
308 |             print >> f, 'date,verification,GFS MOS,NAM MOS,MOS MEAN,MOS-X'
309 |         else:
310 |             print >> f, 'date,verification,MOS-X'
311 |         for day in range(len(all_dates)):
312 |             date = all_dates[day]
313 |             if not options.no_verif_flag:
314 |                 verif_str = forecast_format % tuple([verif[date][v] for v in variables])
315 |             else:
316 |                 verif_str = ''
317 |             mosx_str = forecast_format % tuple(predicted[day, :4])
318 |             if options.mos_flag:
319 |                 try:
320 |                     gfs_str = forecast_format % tuple([gfs[date][v] for v in variables])
321 |                 except KeyError:
322 |                     gfs_str = ''
323 |                 try:
324 |                     nam_str = forecast_format % tuple([nam[date][v] for v in variables])
325 |                 except KeyError:
326 |                     nam_str = ''
327 |                 try:
328 |                     mos_mean_str = forecast_format % tuple([mos_mean[date][v] for v in variables])
329 |                 except KeyError:
330 |                     mos_mean_str = ''
331 |                 print >> f, '%s,%s,%s,%s,%s,%s' % (date, verif_str, gfs_str, nam_str, mos_mean_str, mosx_str)
332 |             else:
333 |                 print >>f, '%s,%s,%s' % (date, verif_str, mosx_str)
334 | 


--------------------------------------------------------------------------------