├── LICENSE ├── README.md ├── build ├── clean ├── default.config ├── mosx ├── MesoPy.py ├── __init__.py ├── bufr │ ├── __init__.py │ └── methods.py ├── configspec ├── estimators.py ├── model │ ├── __init__.py │ ├── model.py │ ├── predictors.py │ └── scorers.py ├── obs │ ├── __init__.py │ └── methods.py ├── util.py └── verification │ ├── __init__.py │ └── methods.py ├── performance ├── run ├── validate └── verify /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Jonathan Weyn 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MOS-X 2 | 3 | MOS-X is a machine learning-based forecasting model built in Python designed to produce output tailored for the [WxChallenge](http://www.wxchallenge.com) weather forecasting competition. 4 | It uses an external executable to download and process time-height profiles of model data from the National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS) and North American Mesoscale (NAM) models. 5 | These data, along with surface observations from MesoWest, are used to train any of scikit-learn's ML algorithms to predict tomorrow's high temperature, low temperature, peak 2-minute sustained wind speed, and rain total. 6 | 7 | ## Installing 8 | 9 | ### Requirements 10 | 11 | - Python 2.7 (no Python 3 yet, and probably ever, because this is a toy project) 12 | - A workstation with a recent Linux installation... sorry, that's all that will work with the next item... 13 | - [BUFRgruven](http://strc.comet.ucar.edu/software/bgruven/) - for model data 14 | - An API key for [MesoWest](https://synopticlabs.org/api/mesonet/) - unfortunately the API now has a limited free tier. MOS-X currently does a poor job of data caching so large data sets will exceed the free limit - use with caution. 15 | - A decent amount of free disk space - some of the models are > 1 GB pickle files... not to mention all the BUFKIT files... 16 | 17 | ### Python packages - easier with conda 18 | 19 | - NumPy 20 | - scipy 21 | - pandas 22 | - ConfigObj (and validate) 23 | - ulmo (use conda-forge) 24 | - the excellent [scikit-learn](http://scikit-learn.org/stable/index.html) 25 | 26 | ### Installation 27 | 28 | Nothing to do really. Just make sure the scripts in the main directory (`build`, `run`, `verify`, `validate`, and `performance`) are executable, for example: 29 | 30 | `chmod +x build run verify validate performance` 31 | 32 | ## Building a model 33 | 34 | 1. The first thing to do is to set up the config file for the particular site to forecast for. The `default.config` file has a good number of comments to describe how to do that. Parameters that are not marked 'optional' or with a default value must be specified. 35 | - The parameter `climo_station_id` is now automatically generated! 36 | - It is not recommended to use the upper-air sounding data option. In my testing adding sounding data actually made no difference to the skill of the models, but YMMV. Use with caution. I don't test it. 37 | 2. Once the config is set up, build the model using `build `. The config reader will automatically look for `.config` too, so if you're like me and like to call your config files `KSEA.config`, it's handy to just pass `KSEA`. 38 | - Depending on how much training data is requested, it may take several hours for BUFRgruven to download everything. 39 | - Actually building the scikit-learn model, however, takes only 10 minutes for a 1000-tree random forest on a 16-core machine. 40 | 41 | ## Running the model 42 | 43 | - Run the model for tomorrow with `run `, or give it any day to run on. 44 | - Verify the model prediction with the truth and with GFS and NAM MOS products with `verify `. 45 | - The `validate` script basically is a glorified verification over an entire user-specified range of dates. 46 | 47 | ## Some notes on advanced model configurations 48 | 49 | - There is built-in functionality for building a model that predicts a time series of hourly temperature, relative humidity, wind speed, and rain for the forecast period in addition to the daily values. While handy to get an idea of the temporal variation of predicted weather, it actually has limited use, and makes the pickled model file much larger. 50 | - Rain forecasting is difficult for an ML model. Rain values are highly non-normally distributed. There is the option to use a post-processor model, which is another random forest, trained on the distribution of output from the base model's trees. It improves rain forecast a little, particularly by doing a better job of predicting 0 on sunny days. 51 | - Rain forecasting can now be done in three different ways: `quantity`, which is the standard prediction of an actual daily rain total, `pop`, or the probability of precipitation, and `categorical`, which uses the MOS categories. 52 | -------------------------------------------------------------------------------- /build: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # 3 | # Copyright (c) 2018 Jonathan Weyn 4 | # 5 | # See the file LICENSE for your rights. 6 | # 7 | 8 | """ 9 | Build the estimator for the MOS-X model. 10 | """ 11 | 12 | import sys 13 | import mosx 14 | import pickle 15 | from optparse import OptionParser 16 | from multiprocessing import Process 17 | 18 | 19 | def get_command_options(): 20 | parser = OptionParser() 21 | parser.add_option('-e', '--use-existing-files', dest='use_existing', action='store_true', default=False, 22 | help='Use existing BUFR, obs, and verification files: use with caution!') 23 | (opts, args) = parser.parse_args() 24 | return opts, args 25 | 26 | 27 | # Get the config dictionary 28 | 29 | options, arguments = get_command_options() 30 | use_existing = options.use_existing 31 | try: 32 | config_file = arguments[0] 33 | except IndexError: 34 | print('Required argument (config file) not provided.') 35 | sys.exit(1) 36 | config = mosx.util.get_config(config_file) 37 | 38 | 39 | # Retrieve data; parallelize BUFR and OBS 40 | 41 | bufr_file = '%s/%s_bufr_train.pkl' % (config['SITE_ROOT'], config['station_id']) 42 | obs_file = '%s/%s_obs_train.pkl' % (config['SITE_ROOT'], config['station_id']) 43 | verif_file = '%s/%s_verif_train.pkl' % (config['SITE_ROOT'], config['station_id']) 44 | predictor_file = '%s/%s_predictors_train.pkl' % (config['SITE_ROOT'], config['station_id']) 45 | 46 | 47 | def get_bufr(): 48 | print('\n--- MOS-X build: initiating BUFR data retrieval...\n') 49 | mosx.bufr.bufr(config, bufr_file) 50 | 51 | 52 | def get_obs(): 53 | print('\n--- MOS-X build: initiating OBS data retrieval...\n') 54 | mosx.obs.obs(config, obs_file) 55 | 56 | 57 | if not use_existing: 58 | if __name__ == '__main__': 59 | p1 = Process(target=get_bufr) 60 | p1.start() 61 | p2 = Process(target=get_obs) 62 | p2.start() 63 | p1.join() 64 | p2.join() 65 | 66 | print('\n--- MOS-X build: retrieving VERIF data...\n') 67 | if config['Obs']['use_climo_wind']: 68 | mosx.verification.verification(config, verif_file) 69 | else: 70 | mosx.verification.verification(config, verif_file, use_climo=False, use_cf6=False) 71 | 72 | # Re-format predictors even if using existing raw data 73 | print('\n--- MOS-X build: formatting predictor and target data...\n') 74 | mosx.model.format_predictors(config, bufr_file, obs_file, verif_file, predictor_file) 75 | 76 | 77 | # Train the estimator and save it 78 | 79 | print('\n--- MOS-X build: generating and training estimator...\n') 80 | 81 | p_test, t_test, r_test = mosx.model.train(config, predictor_file, config['Model']['estimator_file'], test_size=1) 82 | 83 | 84 | # Test the predictor 85 | 86 | print('\n--- MOS-X build: predicting for a test of 1 day...\n') 87 | 88 | print('Loading estimator file %s' % config['Model']['estimator_file']) 89 | with open(config['Model']['estimator_file'], 'rb') as handle: 90 | estimator = pickle.load(handle) 91 | rain_tuning = config['Model'].get('Rain tuning', None) 92 | 93 | # Make a prediction 94 | expected = t_test 95 | if rain_tuning is not None and mosx.util.to_bool(rain_tuning.get('use_raw_rain', False)): 96 | predicted = estimator.predict(p_test, rain_array=r_test) 97 | else: 98 | predicted = estimator.predict(p_test) 99 | 100 | num_test = t_test.shape[0] 101 | for t in range(num_test): 102 | print('\nFor day %d, the predicted forecast is' % t) 103 | print('%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[t, :4])) 104 | print(' while the verification was') 105 | print('%0.0f/%0.0f/%0.0f/%0.2f' % tuple(expected[t, :4])) 106 | -------------------------------------------------------------------------------- /clean: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # 3 | # Copyright (c) 2018 Jonathan Weyn 4 | # 5 | # See the file LICENSE for your rights. 6 | # 7 | 8 | """ 9 | Clean the archived files for a particular site. 10 | """ 11 | 12 | import os 13 | import sys 14 | import mosx 15 | from optparse import OptionParser 16 | 17 | 18 | def get_command_options(): 19 | parser = OptionParser() 20 | parser.add_option('-s', '--station-id', dest='station_id', action='store', type='string', default='', 21 | help='Station ID to clean (required!)') 22 | parser.add_option('-d', '--no-remove-site-data', dest='data_flag', action='store_true', default=False, 23 | help="Don't delete files in site_data") 24 | parser.add_option('-b', '--remove-bufr', dest='bufr_flag', action='store_true', default=False, 25 | help='Delete archived BUFKIT files') 26 | parser.add_option('-u', '--remove-upper-air', dest='sndg_flag', action='store_true', default=False, 27 | help='Delete archived sounding (upper air) files') 28 | parser.add_option('-m', '--remove-model', dest='model_flag', action='store_true', default=False, 29 | help='Delete estimator files *mosx*.pkl') 30 | parser.add_option('-v', '--verbose', dest='verbose', action='store_true', default=False, 31 | help='Print some extra statements') 32 | (opts, args) = parser.parse_args() 33 | return opts, args 34 | 35 | 36 | options, arguments = get_command_options() 37 | station_id, data_flag, bufr_flag, sndg_flag, model_flag, verbose = (options.station_id, options.data_flag, 38 | options.bufr_flag, options.sndg_flag, 39 | options.model_flag, options.verbose) 40 | try: 41 | config_file = arguments[0] 42 | except IndexError: 43 | print('Required argument (config file) not provided.') 44 | sys.exit(1) 45 | config = mosx.util.get_config(config_file) 46 | 47 | bufr_data_dir = config['BUFR']['bufr_data_dir'] 48 | sndg_data_dir = config['Obs']['sounding_data_dir'] 49 | 50 | if station_id == '': 51 | print('\nStation ID (-s) required! Use --help or -h for full options.') 52 | sys.exit(1) 53 | 54 | station_id = station_id.upper() 55 | print('mosx_clean: Cleaning files for station %s' %station_id) 56 | print(' Files to delete:') 57 | if not data_flag: 58 | print(' Site data in %s' % config['SITE_ROOT']) 59 | if bufr_flag: 60 | print(' BUFKIT archives in %s' % bufr_data_dir) 61 | if sndg_flag: 62 | print(' Sounding files in %s' % sndg_data_dir) 63 | if model_flag: 64 | print(" Model estimator in config or files containing 'mosx'") 65 | 66 | rm_command = 'rm -f' 67 | 68 | if not data_flag: 69 | if verbose: 70 | print('mosx_clean: deleting %s files in %s' % (station_id, config['SITE_ROOT'])) 71 | listing = os.listdir(config['SITE_ROOT']) 72 | if not model_flag: 73 | for ll in listing: 74 | if ll == config['Model']['estimator_file'].split('/')[-1]: 75 | listing.remove(ll) 76 | if 'mosx' in ll.lower() and station_id in ll.upper(): 77 | listing.remove(ll) 78 | else: 79 | if verbose: 80 | print('mosx_clean: deleting model estimator files') 81 | for f in listing: 82 | if not f.upper().startswith(station_id): 83 | continue 84 | command = '%s %s/%s' % (rm_command, config['SITE_ROOT'], f) 85 | if verbose: 86 | print(command) 87 | os.system(command) 88 | 89 | if bufr_flag: 90 | if verbose: 91 | print('mosx_clean: deleting bufr files in %s' % bufr_data_dir) 92 | command = '%s %s/*%s*' % (rm_command, bufr_data_dir, station_id.lower()) 93 | if verbose: 94 | print(command) 95 | os.system(command) 96 | 97 | if sndg_flag: 98 | if verbose: 99 | print('mosx_clean: deleting sounding files in %s' % sndg_data_dir) 100 | command = '%s %s/%s*' % (rm_command, sndg_data_dir, station_id) 101 | if verbose: 102 | print(command) 103 | os.system(command) 104 | 105 | print('\nDone.') 106 | -------------------------------------------------------------------------------- /default.config: -------------------------------------------------------------------------------- 1 | ######################################################################################################################## 2 | # This is the configuration file for the MOS-X model. Parameters specified here are used to build and run the 3 | # machine learning model to predict a few weather parameters at a given station location. 4 | ######################################################################################################################## 5 | # Global parameters go here. 6 | 7 | # The root directory of this program 8 | MOSX_ROOT = 9 | 10 | # The root directory of the BUFRgruven executable program 11 | BUFR_ROOT = 12 | 13 | # The directory where site-specific data files are saved (defaults to %MOSX_ROOT/site_data) 14 | SITE_ROOT = 15 | 16 | # The 4-letter station ID of the location 17 | station_id = KAUS 18 | 19 | # Lowest pressure level for vertical profiles. For stations at high elevation, the lowest pressure level may have to be 20 | # higher because data will be missing at higher pressures. 21 | lowest_p_level = 950 22 | 23 | # Starting and ending dates for the model training. BUFR data are available beginning 2010-01-01. If the parameter 24 | # 'is_season' is True, then the program will assume that you want seasonally-subset training data beginning in the year 25 | # of 'data_start_date' and ending in the year of 'data_end_date', with seasons spanning from the start day to the end 26 | # day. Dates should be provided as YYYYMMDD. 27 | data_start_date = 20081120 28 | data_end_date = 20160409 29 | is_season = True 30 | 31 | # This is the UTC hour at which the forecast period starts on any given day. The period ends 24 hours later. (Defaults 32 | # to 6.) 33 | forecast_hour_start = 6 34 | 35 | # Provide here the hourly resolution of the time series prediction. Should be an integer in the range of 1 to 6, but not 36 | # 5... Ignored if the parameter 'predict_timeseries' under 'Model' is False. (Defaults to 3.) 37 | time_series_interval = 3 38 | 39 | # API token for MesoWest data (required). To get an API key, visit https://synopticlabs.org/api/ 40 | meso_token = 41 | 42 | # Produce verbose output 43 | verbose = True 44 | 45 | 46 | ######################################################################################################################## 47 | # In this section, provide parameters used to control the retrieval of BUFR profile model data. 48 | 49 | [BUFR] 50 | 51 | # (Optional) The station ID used in BUFR for the station (defaults to station_id, may be different though) 52 | bufr_station_id = 53 | 54 | # (Optional) Data directory where BUFKIT files are saved (defaults to %SITE_ROOT/bufkit) 55 | bufr_data_dir = 56 | 57 | # (Optional) Path to bufrgruven executable (defaults to %BUFR_ROOT/bufr_gruven.pl) 58 | bufrgruven = 59 | 60 | # BUFR models; should be a list 61 | models = GFS, NAM 62 | 63 | 64 | ######################################################################################################################## 65 | # In this section, provide parameters to control the retrieval of observations. 66 | 67 | [Obs] 68 | 69 | # Option to use upper-air sounding data 70 | use_soundings = False 71 | 72 | # Upper-air sounding station ID, required if use_soundings is True 73 | sounding_station_id = FWD 74 | 75 | # (Optional) Data directory where sounding files are saved (defaults to %SITE_ROOT/soundings) 76 | sounding_data_dir = 77 | 78 | # Set this option to False to disable retrieval of NCDC and CF6 data for max wind values. It may need to be 79 | # disabled if the forecast start hour is not near 6 UTC, otherwise should be True. (Defaults to True.) 80 | use_climo_wind = True 81 | 82 | 83 | ######################################################################################################################## 84 | # In this section, provide the machine learning model parameters. 85 | 86 | [Model] 87 | 88 | # Save the model estimator using pickle to this file 89 | estimator_file = %(SITE_ROOT)s/%(station_id)s_mosx.pkl 90 | 91 | # The base scikit-learn regressor 92 | regressor = ensemble.RandomForestRegressor 93 | 94 | # If True, trains a separate estimator for each weather parameter 95 | train_individual = True 96 | 97 | # If True, also predict a meteorology time series for the next day, in addition to high/low/wind/rain 98 | predict_timeseries = False 99 | 100 | # This is the type of rain forecast. 'quantity' predicts an actual amount of rain in inches, 'categorical' 101 | # predicts a probabilistic category of rain (a la MOS), and 'pop' (probability of precipitation) predicts the 102 | # fractional chance that there will be ANY measurable precipitation. 103 | rain_forecast_type = pop 104 | 105 | # Keyword arguments passed to the base regressor 106 | [[Parameters]] 107 | n_estimators = 1000 108 | max_features = 0.75 109 | n_jobs = 2 110 | verbose = 1 111 | 112 | # This section, if present, enables a post-processing algorithm to be trained on the raw rain predictions from a 113 | # native ensemble regressor. This is usually desirable because rain has a very non-normal distribution and is 114 | # therefore tricky for standard algorithms to predict. The parameter 'rain_estimator' is a string, just like 115 | # 'regressor' above, which determines the scikit-learn algorithm to use (classifiers are an option!). The parameter 116 | # 'use_raw_rain' determines whether the raw rainfall estimates from the BUFR models are also used as features for 117 | # the rain post-processor, in addition to the model ensemble statistics. Any other parameters provided here are 118 | # passed as kwargs to the initialization of the processor's scikit-learn algorithm. Due to the use of certain 119 | # methods used in Bootstrapping, use_raw_rain is currently not compatible with Bootstrapping. 120 | # [[Rain tuning]] 121 | # rain_estimator = ensemble.RandomForestRegressor 122 | # use_raw_rain = False 123 | # n_estimators = 100 124 | 125 | # Ada boosting may improve the performance of a model by selectively increasing the weight of training on samples 126 | # with a large training error. If Ada boosting is desired, provide here any parameters passed to the Ada class; 127 | # otherwise, remove or comment out this subsection. 128 | # [[Ada boosting]] 129 | # n_estimators = 50 130 | 131 | # This last option allows for bootstrapping development of an ensemble of ML models. The training set is split 132 | # according to a few options here, then an ensemble of n_members is generated by training on individual splits. 133 | # Comment out this section to disable bootstrapping. 134 | # [[Bootstrapping]] 135 | # n_members = 10 136 | # # The number of training samples per split (if int), or the fraction (if float) 137 | # n_samples_split = 0.1 138 | # # If 1, each split contains no sample present in any other split. Also overrides n_samples_split and sets it 139 | # # as the maximum available per split. Otherwise, set to 0. 140 | # unique_splits = 0 141 | 142 | 143 | ######################################################################################################################## 144 | # There are a few parameters here for validation of the model. 145 | 146 | [Validate] 147 | 148 | # Start and end dates for the validation 149 | start_date = 20161120 150 | end_date = 20170409 151 | 152 | 153 | ######################################################################################################################## 154 | # The run executable allows for an upload to an FTP or SFTP server, for example, to post data to a website. The forecast 155 | # data are aggregated over all runs and uploaded as a file_type (currently only 'pickle' is implemented), and if a plot 156 | # of forecast ensemble distributions is requested, that plot is also uploaded. 157 | 158 | [Upload] 159 | 160 | # Type of file to upload forecasts in. Can be 'pickle' (one file for all forecasts), 'json' (one file per forecast 161 | # date), 'uw_text' (one file per forecast date, only a short version with high/pop), or a list of several. 162 | file_type = json 163 | 164 | # User name, server, and directory on server. Prompts for password, or set an ssh key. The forecast_directory 165 | # points to where the forecasts are uploaded, while the plot_directory points to where plots are uploaded. If 166 | # 'username' and 'server' are both empty, then assumes local directories are specified. 167 | username = 168 | server = 169 | forecast_directory = 170 | plot_directory = 171 | -------------------------------------------------------------------------------- /mosx/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | MOS-X python package. Tools for building, running, and validating a machine learning model for predicting tomorrow's 9 | weather. 'MOS-X' represents an intelligent improvement to the tradition Model Output Statistics of the National 10 | Weather Service and other meteorological institutions. 11 | 12 | ============ 13 | Version history 14 | ------ 15 | 2017-10-26: Version 0.0 16 | 2017-11-03: Version 0.2 17 | -- added upper-air sounding data from metpy 18 | 2017-11-07: Version 0.3 19 | -- fixed rain not counted in cumulative values from skipped intervals in bufkit profiles 20 | -- fixed issue with end year when season overlaps to new year 21 | 2017-11-22: Version 0.3.1 22 | -- added optional directory to save bufkit files 23 | -- added saving of sounding files 24 | -- added options for random forest object 25 | -- edit return_precip option in mosx_predictors to handle multiple days 26 | -- add return_dates option to mosx_predictors 27 | 2017-12-06: Version 0.4 28 | -- added bufkit model predictions of high, low, and max wind 29 | -- added option to select neural network regressor 30 | -- changed verification to ONLY use CF6/climo wind if available 31 | -- added option to not use existing sounding files in mosx_obs 32 | -- change upper_air to not save nan soundings 33 | -- improved handling of missing verification values with missing cf6 wind 34 | -- added options to ignore cf6 and/or climo in mosx_verif 35 | 2018-01-16: Version 0.5 36 | -- added training design to individually train a forest for each weather parameter 37 | -- added the option to use any scikit-learn regressor 38 | -- added use_soundings option to optionally omit sounding data 39 | -- changed pressure levels to 925, 850, 750, 600 40 | -- pondered adding model times 18Z and 00Z before forecast start time 41 | -- changed obs dt to 3 hrs 42 | -- changed bufkit surface dt to 3 hrs 43 | -- added methodology for generating diagnostic variables from BUFR 44 | -- added 'temperature advection index' using above 45 | 2018-01-18: Version 0.6 46 | -- added ability to forecast hourly time series 47 | -- added class for time series estimator 48 | -- fixed bug where last 6 hours were not included in obs verifications at the end of a season 49 | -- improved handling of array conversion of obs data by requiring pandas to export OrderedDict and 50 | using get_array commands 51 | -- fixed a bug where daily verification would fail when verification is unavailable for a day 52 | 2018-03-06: Version 0.6.1 53 | -- fixed critical bug where obs dataframe was not sorted when running for only one day 54 | 2018-03-12: Version 0.6.2 55 | -- added support for the Ada Boosting estimator wrapper 56 | 2018-03-14: Version 0.7.0 57 | -- completely changed the structure of the project to be modular 58 | 2018-03-20: Version 0.7.1 59 | -- fixed a bug that would result in an extra set of API dates when is_season is False 60 | -- fixed some issues related to the use of the config file 61 | 2018-03-21: Version 0.8.0 62 | -- added scorers 63 | -- added learning curves in performance metrics; more to come 64 | 2018-03-27: Version 0.8.1 65 | -- fixed an error in util.generate_dates that failed to produce all the dates when is_season is False 66 | 2018-04-10: Version 0.9.0 67 | -- better implementation of base estimator attributes in TimeSeriesEstimator and RainTuningEstimator classes 68 | -- added submodule 'predict' for unified predictions 69 | -- improved handling of the raw precipitation values to allow them to be used in rain tuning 70 | -- added config option for the starting hour of a forecast day 71 | -- added config option to predict probability of precipitation rather than quantity 72 | -- added automatic fetching of CF6 files 73 | -- added automatic retrieval of climo_station_id (removed from default.config) 74 | -- added time_series_interval parameter to output coarser time series 75 | -- added option to disable climo/cf6 wind retrieval, for non-WxChallenge purposes 76 | 2018-05-03: Version 0.10.0 77 | -- moved special estimator classes to mosx.estimators 78 | -- added bootstrapping training estimator 79 | -- added ability to select the type of estimator for rain tuning 80 | -- re-organized the 'train' and 'predict' modules to 'model' 81 | 2018-06-11: Version 0.10.2 82 | -- fixed an error in obs retrieval that retrieved one data point too many 83 | -- added many plots to 'performance' 84 | 2018-10-22: Version 0.10.3 85 | -- added option to write more than one file type 86 | 87 | """ 88 | 89 | import bufr 90 | import estimators 91 | import obs 92 | import model 93 | import util 94 | import verification 95 | 96 | __version__ = '0.10.3' 97 | -------------------------------------------------------------------------------- /mosx/bufr/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing BUFR data. 9 | """ 10 | 11 | from .methods import * 12 | -------------------------------------------------------------------------------- /mosx/bufr/methods.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing BUFR data. 9 | """ 10 | 11 | from mosx.util import generate_dates, get_array 12 | from collections import OrderedDict 13 | import re 14 | import os 15 | import pickle 16 | import numpy as np 17 | from datetime import datetime, timedelta 18 | from scipy import interpolate 19 | 20 | 21 | def bufkit_parser_time_height(config, file_name, interval=1, start_dt=None, end_dt=None): 22 | """ 23 | By Luke Madaus. Modified by jweyn. 24 | Returns a dictionary of time-height profiles from a BUFKIT file, with profiles interpolated to a basic set of 25 | pressure levels. 26 | 27 | :param config: 28 | :param file_name: str: full path to bufkit file name 29 | :param interval: int: process data every 'interval' hours 30 | :param start_dt: datetime: starting time for data processing 31 | :param end_dt: datetime: ending time for data processing 32 | :return: dict: ugly dictionary of processed values 33 | """ 34 | # Open the file 35 | infile = open(file_name, 'r') 36 | 37 | profile = OrderedDict() 38 | 39 | # Find the block that contains the description of what everything is (header information) 40 | block_lines = [] 41 | inblock = False 42 | block_found = False 43 | for line in infile: 44 | if line.startswith('PRES TMPC') and not block_found: 45 | # We've found the line that starts the header info 46 | inblock = True 47 | block_lines.append(line) 48 | elif inblock: 49 | # Keep appending lines until we start hitting numbers 50 | if re.match('^\d{3}|^\d{4}', line): 51 | inblock = False 52 | block_found = True 53 | else: 54 | block_lines.append(line) 55 | 56 | # Now compute the remaining number of variables 57 | re_string = '' 58 | for line in block_lines: 59 | dum_num = len(line.split()) 60 | for n in range(dum_num): 61 | re_string = re_string + '(-?\d{1,5}.\d{2}) ' 62 | re_string = re_string[:-1] # Get rid of the trailing space 63 | re_string = re_string + '\r\n' 64 | 65 | # Compile this re_string for more efficient re searches 66 | block_expr = re.compile(re_string) 67 | 68 | # Now get corresponding indices of the variables we need 69 | full_line = '' 70 | for r in block_lines: 71 | full_line = full_line + r[:-2] + ' ' 72 | # Now split it 73 | varlist = re.split('[ /]', full_line) 74 | # Get rid of trailing space 75 | varlist = varlist[:-1] 76 | 77 | # Variables we want 78 | vars_desired = ['TMPC', 'DWPC', 'UWND', 'VWND', 'HGHT'] 79 | 80 | # Pressure levels to interpolate to 81 | plevs = [600, 750, 850, 925] 82 | plevs = [p for p in plevs if p <= float(config['lowest_p_level'])] 83 | 84 | # We now need to break everything up into a chunk for each 85 | # forecast date and time 86 | with open(file_name) as infile: 87 | blocks = infile.read().split('STID') 88 | for block in blocks: 89 | interp_plevs = [] 90 | header = block 91 | if header.split()[0] != '=': 92 | continue 93 | fcst_time = re.search('TIME = (\d{6}/\d{4})', header).groups()[0] 94 | fcst_dt = datetime.strptime(fcst_time, '%y%m%d/%H%M') 95 | if start_dt is not None and fcst_dt < start_dt: 96 | continue 97 | if end_dt is not None and fcst_dt > end_dt: 98 | break 99 | if fcst_dt.hour % interval != 0: 100 | continue 101 | temp_vars = OrderedDict() 102 | for var in varlist: 103 | temp_vars[var] = [] 104 | temp_vars['PRES'] = [] 105 | for block_match in block_expr.finditer(block): 106 | vals = block_match.groups() 107 | for val, name in zip(vals, varlist): 108 | if float(val) == -9999.: 109 | temp_vars[name].append(np.nan) 110 | else: 111 | temp_vars[name].append(float(val)) 112 | 113 | # Unfortunately, bufkit values aren't always uniformly distributed. 114 | final_vars = OrderedDict() 115 | cur_plevs = temp_vars['PRES'] 116 | cur_plevs.reverse() 117 | for var in varlist[1:]: 118 | if var in (vars_desired + ['SKNT', 'DRCT']): 119 | values = temp_vars[var] 120 | values.reverse() 121 | interp_plevs = list(plevs) 122 | num_plevs = len(interp_plevs) 123 | f = interpolate.interp1d(cur_plevs, values, bounds_error=False) 124 | interp_vals = f(interp_plevs) 125 | interp_array = np.full((len(plevs)), np.nan) 126 | # Array almost certainly missing values at high pressures 127 | interp_array[:num_plevs] = interp_vals 128 | interp_vals = list(interp_array) 129 | interp_plevs = list(plevs) # use original array 130 | interp_vals.reverse() 131 | interp_plevs.reverse() 132 | if var == 'SKNT': 133 | wspd = np.array(interp_vals) 134 | if var == 'DRCT': 135 | wdir = np.array(interp_vals) 136 | if var in vars_desired: 137 | final_vars[var] = interp_vals 138 | final_vars['PRES'] = interp_plevs 139 | if 'UWND' not in final_vars.keys(): 140 | final_vars['UWND'] = list(wspd * np.sin(wdir * np.pi/180. - np.pi)) 141 | if 'VWND' not in final_vars.keys(): 142 | final_vars['VWND'] = list(wspd * np.cos(wdir * np.pi/180. - np.pi)) 143 | profile[fcst_dt] = final_vars 144 | 145 | return profile 146 | 147 | 148 | def bufkit_parser_surface(file_name, interval=1, start_dt=None, end_dt=None): 149 | """ 150 | By Luke Madaus. Modified by jweyn. 151 | Returns a dictionary of surface data from a BUFKIT file. 152 | 153 | :param file_name: str: full path to bufkit file name 154 | :param interval: int: process data every 'interval' hours 155 | :param start_dt: datetime: starting time for data processing 156 | :param end_dt: datetime: ending time for data processing 157 | :return: dict: ugly dictionary of processed values 158 | """ 159 | # Load the file 160 | infile = open(file_name, 'r') 161 | sfc_dict = OrderedDict() 162 | 163 | block_lines = [] 164 | inblock = False 165 | for line in infile: 166 | if re.search('SELV', line): 167 | try: # jweyn 168 | elev = re.search('SELV = -?(\d{1,4})', line).groups()[0] # jweyn: -? 169 | elev = float(elev) 170 | except: 171 | elev = 0.0 172 | if line.startswith('STN YY'): 173 | # We've found the line that starts the header info 174 | inblock = True 175 | block_lines.append(line) 176 | elif inblock: 177 | # Keep appending lines until we start hitting numbers 178 | if re.search('\d{6}', line): 179 | inblock = False 180 | else: 181 | block_lines.append(line) 182 | 183 | # Build an re search pattern based on this 184 | # We know the first two parts of the section are station id num and date 185 | re_string = "(\d{6}|\w{4}) (\d{6})/(\d{4})" 186 | # Now compute the remaining number of variables 187 | dum_num = len(block_lines[0].split()) - 2 188 | for n in range(dum_num): 189 | re_string = re_string + " (-?\d{1,4}.\d{2})" 190 | re_string = re_string + '\r\n' 191 | for line in block_lines[1:]: 192 | dum_num = len(line.split()) 193 | for n in range(dum_num): 194 | re_string = re_string + '(-?\d{1,4}.\d{2}) ' 195 | re_string = re_string[:-1] # Get rid of the trailing space 196 | re_string = re_string + '\r\n' 197 | 198 | # Compile this re_string for more efficient re searches 199 | block_expr = re.compile(re_string) 200 | 201 | # Now get corresponding indices of the variables we need 202 | full_line = '' 203 | for r in block_lines: 204 | full_line = full_line + r[:-2] + ' ' 205 | # Now split it 206 | varlist = re.split('[ /]', full_line) 207 | 208 | with open(file_name) as infile: 209 | # Now loop through all blocks that match the search pattern we defined above 210 | blocknum = -1 211 | # For rain total in missing times 212 | temp_rain = 0. 213 | # For max temp, min temp, max wind, total rain 214 | t_max = -150. 215 | t_min = 150. 216 | w_max = 0. 217 | r_total = 0. 218 | for block_match in block_expr.finditer(infile.read()): 219 | blocknum += 1 220 | # Split out the match into each component number 221 | vals = block_match.groups() 222 | # Check for missing values 223 | for v in range(len(vals)): 224 | if vals[v] == -9999.: 225 | vals[v] = np.nan 226 | # Set the time 227 | dt = '20' + vals[varlist.index('YYMMDD')] + vals[varlist.index('HHMM')] 228 | validtime = datetime.strptime(dt, '%Y%m%d%H%M') 229 | 230 | # Check that time is within the period we want 231 | if start_dt is not None and validtime < start_dt: 232 | continue 233 | if end_dt is not None and validtime > end_dt: 234 | break 235 | 236 | # Check for max daily values! 237 | t_max = max(t_max, float(vals[varlist.index('T2MS')])) 238 | t_min = min(t_min, float(vals[varlist.index('T2MS')])) 239 | uwind = float(vals[varlist.index('UWND')]) 240 | vwind = float(vals[varlist.index('VWND')]) 241 | wspd = np.sqrt(uwind ** 2 + vwind ** 2) 242 | w_max = max(w_max, wspd) 243 | 244 | if validtime.hour % interval != 0: 245 | # Still need to get cumulative precipitation! 246 | try: 247 | temp_precip = float(vals[varlist.index('P01M')]) 248 | except: 249 | temp_precip = float(vals[varlist.index('P03M')]) 250 | if np.isnan(temp_precip): 251 | temp_precip = 0.0 252 | temp_rain += temp_precip 253 | continue 254 | 255 | sfc_dict[validtime] = OrderedDict() 256 | sfc_dict[validtime]['WSPD'] = wspd 257 | sfc_dict[validtime]['UWND'] = uwind 258 | sfc_dict[validtime]['VWND'] = vwind 259 | sfc_dict[validtime]['PRES'] = float(vals[varlist.index('PRES')]) 260 | sfc_dict[validtime]['TMPC'] = float(vals[varlist.index('T2MS')]) 261 | sfc_dict[validtime]['DWPC'] = float(vals[varlist.index('TD2M')]) 262 | sfc_dict[validtime]['HCLD'] = float(vals[varlist.index('HCLD')]) 263 | sfc_dict[validtime]['MCLD'] = float(vals[varlist.index('MCLD')]) 264 | sfc_dict[validtime]['LCLD'] = float(vals[varlist.index('LCLD')]) 265 | # Could be 3 hour or 1 hour precip 266 | try: 267 | precip = float(vals[varlist.index('P01M')]) 268 | except: 269 | precip = float(vals[varlist.index('P03M')]) 270 | # Make sure precip is not nan 271 | if np.isnan(precip): 272 | precip = 0.0 273 | # Add the temporary value from uneven intervals 274 | precip += temp_rain 275 | # Also do cumulative sum in r_total 276 | previous_time = validtime - timedelta(hours=interval) 277 | if previous_time in sfc_dict.keys(): 278 | sfc_dict[validtime]['PRCP'] = precip 279 | r_total += precip 280 | else: 281 | # We want zero at first time: precip in LAST hour or three 282 | sfc_dict[validtime]['PRCP'] = 0.0 283 | # Reset temp_rain after having added it 284 | temp_rain = 0. 285 | 286 | daily = [t_max, t_min, w_max, r_total] 287 | return sfc_dict, daily 288 | 289 | 290 | def bufr_retrieve(bufr, bufarg): 291 | """ 292 | Call bufrgruven to retrieve BUFR files. 293 | 294 | :param bufr: str: bufrgruven executable path 295 | :param bufarg: dict: dictionary of arguments passed to bufrgruven 296 | :return: 297 | """ 298 | argstring = '' 299 | for key, value in bufarg.items(): 300 | argstring += ' --%s %s' % (key, value) 301 | result = os.system('%s %s' % (bufr, argstring)) 302 | return result 303 | 304 | 305 | def bufr(config, output_file=None, cycle='18'): 306 | """ 307 | Generates model data from BUFKIT profiles and saves to a file, which can later be retrieved for either training 308 | data or model run data. 309 | 310 | :param config: 311 | :param output_file: str: output file path 312 | :param cycle: str: model cycle (init hour) 313 | :return: 314 | """ 315 | bufr_station_id = config['BUFR']['bufr_station_id'] 316 | # Base arguments dictionary. dset and date will be modified iteratively. 317 | bufarg = { 318 | 'dset': '', 319 | 'date': '', 320 | 'cycle': cycle, 321 | 'stations': bufr_station_id.lower(), 322 | 'noascii': '', 323 | 'nozipit': '', 324 | 'noverbose': '', 325 | 'prepend': '' 326 | } 327 | if config['verbose']: 328 | print('\n') 329 | bufr_default_dir = '%s/metdat/bufkit' % config['BUFR_ROOT'] 330 | bufr_data_dir = config['BUFR']['bufr_data_dir'] 331 | if not(os.path.isdir(bufr_data_dir)): 332 | os.makedirs(bufr_data_dir) 333 | bufrgruven = config['BUFR']['bufrgruven'] 334 | if config['verbose']: 335 | print('bufr: using BUFKIT files in %s' % bufr_data_dir) 336 | bufr_format = '%s/%s%s.%s_%s.buf' 337 | missing_dates = [] 338 | models = config['BUFR']['bufr_models'] 339 | model_names = config['BUFR']['models'] 340 | start_date = datetime.strptime(config['data_start_date'], '%Y%m%d') - timedelta(days=1) 341 | end_date = datetime.strptime(config['data_end_date'], '%Y%m%d') - timedelta(days=1) 342 | dates = generate_dates(config, start_date=start_date, end_date=end_date) 343 | for date in dates: 344 | bufarg['date'] = datetime.strftime(date, '%Y%m%d') 345 | if date.year < 2010: 346 | if config['verbose']: 347 | print('bufr: skipping BUFR data for %s; data starts in 2010.' % bufarg['date']) 348 | continue 349 | if config['verbose']: 350 | print('bufr: date: %s' % bufarg['date']) 351 | 352 | for m in range(len(models)): 353 | if config['verbose']: 354 | print('bufr: trying to retrieve BUFR data for %s...' % model_names[m]) 355 | bufr_new_name = bufr_format % (bufr_data_dir, bufarg['date'], '%02d' % int(bufarg['cycle']), 356 | model_names[m], bufarg['stations']) 357 | if os.path.isfile(bufr_new_name): 358 | if config['verbose']: 359 | print('bufr: file %s already exists; skipping!' % bufr_new_name) 360 | break 361 | 362 | if type(models[m]) == list: 363 | for model in models[m]: 364 | try: 365 | bufarg['dset'] = model 366 | bufr_retrieve(bufrgruven, bufarg) 367 | bufr_name = bufr_format % (bufr_default_dir, bufarg['date'], '%02d' % int(bufarg['cycle']), 368 | model, bufarg['stations']) 369 | bufr_file = open(bufr_name) 370 | bufr_file.close() 371 | os.rename(bufr_name, bufr_new_name) 372 | if config['verbose']: 373 | print('bufr: BUFR file found for %s at date %s.' % (model, bufarg['date'])) 374 | print('bufr: writing BUFR file: %s' % bufr_new_name) 375 | break 376 | except: 377 | if config['verbose']: 378 | print('bufr: BUFR file for %s at date %s not retrieved.' % (model, bufarg['date'])) 379 | else: 380 | try: 381 | model = models[m] 382 | bufarg['dset'] = model 383 | bufr_retrieve(bufrgruven, bufarg) 384 | bufr_name = bufr_format % (bufr_default_dir, bufarg['date'], '%02d' % int(bufarg['cycle']), 385 | bufarg['dset'], bufarg['stations']) 386 | bufr_file = open(bufr_name) 387 | bufr_file.close() 388 | os.rename(bufr_name, bufr_new_name) 389 | if config['verbose']: 390 | print('bufr: BUFR file found for %s at date %s.' % (model, bufarg['date'])) 391 | print('bufr: writing BUFR file: %s' % bufr_new_name) 392 | except: 393 | if config['verbose']: 394 | print('bufr: BUFR file for %s at date %s not retrieved.' % (model, bufarg['date'])) 395 | if not (os.path.isfile(bufr_new_name)): 396 | print('bufr: warning: no BUFR file found for model %s at date %s' % ( 397 | model_names[m], bufarg['date'])) 398 | missing_dates.append((date, model_names[m])) 399 | 400 | # Process data 401 | print('\n') 402 | bufr_dict = OrderedDict({'PROF': OrderedDict(), 'SFC': OrderedDict(), 'DAY': OrderedDict()}) 403 | for model in model_names: 404 | bufr_dict['PROF'][model] = OrderedDict() 405 | bufr_dict['SFC'][model] = OrderedDict() 406 | bufr_dict['DAY'][model] = OrderedDict() 407 | 408 | for date in dates: 409 | date_str = datetime.strftime(date, '%Y%m%d') 410 | verif_date = date + timedelta(days=1) 411 | start_dt = verif_date + timedelta(hours=config['forecast_hour_start']) 412 | end_dt = verif_date + timedelta(hours=config['forecast_hour_start'] + 24) 413 | for model in model_names: 414 | if (date, model) in missing_dates: 415 | if config['verbose']: 416 | print('bufr: skipping %s data for %s; file missing.' % (model, date_str)) 417 | continue 418 | if config['verbose']: 419 | print('bufr: processing %s data for %s' % (model, date_str)) 420 | bufr_name = bufr_format % (bufr_data_dir, date_str, '%02d' % int(bufarg['cycle']), model, 421 | bufarg['stations']) 422 | if not (os.path.isfile(bufr_name)): 423 | if config['verbose']: 424 | print('bufr: skipping %s data for %s; file missing.' % (model, date_str)) 425 | continue 426 | profile = bufkit_parser_time_height(config, bufr_name, 6, start_dt, end_dt) 427 | sfc, daily = bufkit_parser_surface(bufr_name, 3, start_dt, end_dt) 428 | # Drop 'PRES' variable which is useless 429 | for key, values in profile.items(): 430 | values.pop('PRES', None) 431 | profile[key] = values 432 | bufr_dict['PROF'][model][verif_date] = profile 433 | bufr_dict['SFC'][model][verif_date] = sfc 434 | bufr_dict['DAY'][model][verif_date] = daily 435 | 436 | # Export data 437 | if output_file is None: 438 | output_file = '%s/%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id']) 439 | if config['verbose']: 440 | print('bufr: -> exporting to %s' % output_file) 441 | with open(output_file, 'wb') as handle: 442 | pickle.dump(bufr_dict, handle, protocol=pickle.HIGHEST_PROTOCOL) 443 | 444 | return 445 | 446 | 447 | def process(config, bufr, advection_diagnostic=True): 448 | """ 449 | Imports the data contained in a bufr dictionary and returns a time-by-x numpy array for use in mosx_predictors. The 450 | first dimension is date; all other dimensions are first extracted using get_array and then one-dimensionalized. 451 | 452 | :param config: 453 | :param bufr: dict: dictionary of processed BUFR data 454 | :param advection_diagnostic: bool: if True, add temperature advection diagnostic to the data 455 | :return: ndarray: array of formatted BUFR predictor values 456 | """ 457 | if config['verbose']: 458 | print('bufr.process: processing array for BUFR data...') 459 | # PROF part of the BUFR data 460 | bufr_prof = get_array(bufr['PROF']) 461 | bufr_dims = list(range(len(bufr_prof.shape))) 462 | bufr_dims[0] = 1 463 | bufr_dims[1] = 0 464 | bufr_prof = bufr_prof.transpose(bufr_dims) 465 | bufr_shape = bufr_prof.shape 466 | bufr_reshape = [bufr_shape[0]] + [np.cumprod(bufr_shape[1:])[-1]] 467 | bufr_prof = bufr_prof.reshape(tuple(bufr_reshape)) 468 | # SFC part of the BUFR data 469 | bufr_sfc = get_array(bufr['SFC']) 470 | bufr_dims = list(range(len(bufr_sfc.shape))) 471 | bufr_dims[0] = 1 472 | bufr_dims[1] = 0 473 | bufr_sfc = bufr_sfc.transpose(bufr_dims) 474 | bufr_shape = bufr_sfc.shape 475 | bufr_reshape = [bufr_shape[0]] + [np.cumprod(bufr_shape[1:])[-1]] 476 | bufr_sfc = bufr_sfc.reshape(tuple(bufr_reshape)) 477 | # DAY part of the BUFR data 478 | bufr_day = get_array(bufr['DAY']) 479 | bufr_dims = list(range(len(bufr_day.shape))) 480 | bufr_dims[0] = 1 481 | bufr_dims[1] = 0 482 | bufr_day = bufr_day.transpose(bufr_dims) 483 | bufr_shape = bufr_day.shape 484 | bufr_reshape = [bufr_shape[0]] + [np.cumprod(bufr_shape[1:])[-1]] 485 | bufr_day = bufr_day.reshape(tuple(bufr_reshape)) 486 | bufr_out = np.concatenate((bufr_prof, bufr_sfc, bufr_day), axis=1) 487 | # Fix missing values 488 | bufr_out[bufr_out < -1000.] = np.nan 489 | if advection_diagnostic: 490 | advection_array = temp_advection(bufr) 491 | bufr_out = np.concatenate((bufr_out, advection_array), axis=1) 492 | return bufr_out 493 | 494 | 495 | def temp_advection(bufr): 496 | """ 497 | Produces an array of temperature advection diagnostic from the bufr dictionary. The diagnostic is a simple 498 | calculation of the strength of backing or veering of winds with height, based on thermal wind approximations, 499 | between the lowest two profile levels retained in the bufr dictionary. Searches for UWND and VWND keys. 500 | IMPORTANT: expects the keys of each model to match dates; i.e., expects bufr output from find_matching_dates. This 501 | is to ensure num_samples is correct. 502 | 503 | :param bufr: dict: dictionary of processed BUFR data 504 | :return: advection_array: array of num_samples-by-num_features of advection diagnostic 505 | """ 506 | bufr_prof = bufr['PROF'] 507 | models = bufr_prof.keys() 508 | num_models = len(models) 509 | dates = bufr_prof[bufr_prof.keys()[0]].keys() 510 | num_dates = len(dates) 511 | num_times = len(bufr_prof[bufr_prof.keys()[0]][dates[0]].keys()) 512 | num_features = num_models * num_times 513 | 514 | advection_array = np.zeros((num_dates, num_features)) 515 | 516 | def advection_index(V1, V2): 517 | """ 518 | The advection index measures the strength of veering/backing of wind. 519 | :param V1: array wind vector at lower model level 520 | :param V2: array wind vector at higher model level 521 | :return: index of projection of (V2 - V1) onto V1 522 | """ 523 | proj = V2 - np.dot(V1, V2) * V1 / np.linalg.norm(V1) 524 | diff = V1 - V2 525 | sign = np.sign(np.arctan2(diff[1], diff[0])) 526 | return sign * np.linalg.norm(proj) 527 | 528 | # Here comes the giant ugly loop. 529 | sample = 0 530 | for date in dates: 531 | feature = 0 532 | for model in models: 533 | for eval_date in bufr_prof[model][date].keys(): 534 | u = bufr_prof[model][date][eval_date]['UWND'] 535 | v = bufr_prof[model][date][eval_date]['VWND'] 536 | try: 537 | V1 = np.array([u[0], v[0]]) 538 | V2 = np.array([u[1], v[1]]) 539 | except IndexError: 540 | print('Not enough wind levels available for advection calculation; omitting...') 541 | return 542 | advection_array[sample, feature] = advection_index(V1, V2) 543 | feature += 1 544 | sample += 1 545 | 546 | return advection_array 547 | -------------------------------------------------------------------------------- /mosx/configspec: -------------------------------------------------------------------------------- 1 | MOSX_ROOT = string() 2 | BUFR_ROOT = string() 3 | SITE_ROOT = string(default='') 4 | station_id = string(min=4, max=4) 5 | climo_station_id = string(min=11, max=11) 6 | lowest_p_level = float() 7 | data_start_date = string(min=8, max=8) 8 | data_end_date = string(min=8, max=8) 9 | is_season = boolean(default=False) 10 | forecast_hour_start = integer(0, 24, default=6) 11 | time_series_interval = integer(1, 6, default=3) 12 | meso_token = string() 13 | verbose = boolean(default=False) 14 | [BUFR] 15 | bufr_station_id = string(min=3, max=4, default='') 16 | bufr_data_dir = string(default='') 17 | bufrgruven = string(default='') 18 | models = force_list() 19 | [Obs] 20 | use_soundings = boolean(default=False) 21 | sounding_station_id = string(min=3, max=3) 22 | sounding_data_dir = string(default='') 23 | use_climo_wind = boolean(default=True) 24 | [Model] 25 | estimator_file = string() 26 | regressor = string(default='ensemble.RandomForestRegressor') 27 | train_individual = boolean(default=False) 28 | predict_timeseries = boolean(default=False) 29 | rain_forecast_type = option('pop', 'categorical', 'quantity', default='pop') 30 | [[Parameters]] 31 | [Validate] 32 | start_date = string(min=8, max=8) 33 | end_date = string(min=8, max=8) 34 | [Upload] 35 | file_type = option('pickle', 'uw_text', default='pickle') 36 | username = string(default='') 37 | server = string(default='') 38 | forecast_directory = string(default='~') 39 | plot_directory = string(default='~') -------------------------------------------------------------------------------- /mosx/estimators.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Special classes containing MOS-X-specific scikit-learn estimators. These estimators are designed to perform specialized 9 | functions such as adding time series prediction, creating sub-estimators to tune the rain prediction, and so on. They 10 | are NOT fully-compatible with scikit-learn because it is impossible to link all of the sklearn attributes of the base 11 | estimators to these meta-estimators. It might be possible to, in the future, make these meta-classes that inherit from 12 | the base sklearn estimator class. 13 | """ 14 | 15 | import numpy as np 16 | from sklearn.preprocessing import LabelBinarizer 17 | from sklearn.multioutput import MultiOutputRegressor 18 | from sklearn.pipeline import Pipeline 19 | from sklearn.model_selection import KFold, train_test_split 20 | from copy import deepcopy 21 | from .util import get_object 22 | 23 | 24 | class TimeSeriesEstimator(object): 25 | """ 26 | Wrapper class for containing separately-trained daily and timeseries estimators. 27 | """ 28 | def __init__(self, daily_estimator, timeseries_estimator): 29 | self.daily_estimator = daily_estimator 30 | self.timeseries_estimator = timeseries_estimator 31 | # Inherit attributes from the daily estimator by default. 32 | # Apparently only 'steps' and 'memory' are in __dict__ for a Pipeline. BS. 33 | for attr in self.daily_estimator.__dict__.keys(): 34 | try: 35 | setattr(self, attr, getattr(self.daily_estimator, attr)) 36 | except AttributeError: 37 | pass 38 | # Apparently still have to do this 39 | if isinstance(self.daily_estimator, Pipeline): 40 | self.named_steps = self.daily_estimator.named_steps 41 | else: 42 | self.named_steps = None 43 | try: # Likely only works if model has been fitted 44 | self.estimators_ = self.daily_estimator.estimators_ 45 | except AttributeError: 46 | self.estimators_ = None 47 | self.array_form = True 48 | if not hasattr(self, 'verbose'): 49 | self.verbose = 1 50 | 51 | def fit(self, predictor_array, verification_array, **kwargs): 52 | """ 53 | Fit both the daily and the timeseries estimators. 54 | 55 | :param predictor_array: ndarray-like: predictor features 56 | :param verification_array: ndarray-like: truth values 57 | :param kwargs: kwargs passed to fit methods 58 | :return: 59 | """ 60 | if self.verbose > 0: 61 | print('TimeSeriesEstimator: fitting DAILY estimator') 62 | self.daily_estimator.fit(predictor_array, verification_array[:, :4], **kwargs) 63 | if self.verbose > 0: 64 | print('TimeSeriesEstimator: fitting TIMESERIES estimator') 65 | self.timeseries_estimator.fit(predictor_array, verification_array[:, 4:], **kwargs) 66 | 67 | def predict(self, predictor_array, **kwargs): 68 | """ 69 | Predict from both the daily and timeseries estimators. Returns an array if self.array_form is True, 70 | otherwise returns a dictionary (not implemented yet). 71 | 72 | :param predictor_array: num_samples x num_features 73 | :param kwargs: kwargs passed to predict methods 74 | :return: array or dictionary of predicted values 75 | """ 76 | daily = self.daily_estimator.predict(predictor_array, **kwargs) 77 | timeseries = self.timeseries_estimator.predict(predictor_array, **kwargs) 78 | if self.array_form: 79 | return np.concatenate((daily, timeseries), axis=1) 80 | 81 | 82 | class RainTuningEstimator(object): 83 | """ 84 | This class extends an estimator to include a separately-trained post-processing random forest for the daily rainfall 85 | prediction. Standard algorithms generally do a poor job of predicting a variable that has such a non-normal 86 | probability distribution as daily rainfall (which is dominated by 0s). 87 | """ 88 | def __init__(self, estimator, rain_estimator='ensemble.RandomForestRegressor', **kwargs): 89 | """ 90 | Initialize an instance of an estimator with a rainfall post-processor. 91 | 92 | :param estimator: sklearn estimator or TimeSeriesEstimator with an estimators_ attribute 93 | :param rain_estimator: str: type of sklearn estimator to use for rain processing 94 | :param kwargs: passed to sklearn rain estimator 95 | """ 96 | self.base_estimator = estimator 97 | if isinstance(estimator, TimeSeriesEstimator): 98 | self.daily_estimator = self.base_estimator.daily_estimator 99 | else: 100 | self.daily_estimator = self.base_estimator 101 | # Inherit attributes from the base estimator 102 | for attr in self.base_estimator.__dict__.keys(): 103 | try: 104 | setattr(self, attr, getattr(self.base_estimator, attr)) 105 | except AttributeError: 106 | pass 107 | try: 108 | self.estimators_ = self.base_estimator.estimators_ 109 | except AttributeError: 110 | pass 111 | self.rain_estimator_name = rain_estimator 112 | self.is_classifier = ('Classifi' in self.rain_estimator_name) 113 | Regressor = get_object('sklearn.%s' % rain_estimator) 114 | self.rain_estimator = Regressor(**kwargs) 115 | if isinstance(self.daily_estimator, Pipeline): 116 | self.named_steps = self.daily_estimator.named_steps 117 | self._forest = self.daily_estimator.named_steps['regressor'] 118 | self._imputer = self.daily_estimator.named_steps['imputer'] 119 | else: 120 | self.named_steps = None 121 | self._imputer = None 122 | self._forest = self.daily_estimator 123 | if not hasattr(self, 'verbose'): 124 | self.verbose = 1 125 | 126 | def _get_tree_rain_prediction(self, X): 127 | # Get predictions from individual trees. 128 | num_samples = X.shape[0] 129 | if self._imputer is not None: 130 | X = self._imputer.transform(X) 131 | if isinstance(self._forest, MultiOutputRegressor): 132 | num_trees = len(self._forest.estimators_[0].estimators_) 133 | predicted_rain = np.zeros((num_samples, num_trees)) 134 | for s in range(num_samples): 135 | Xs = X[s].reshape(1, -1) 136 | for t in range(num_trees): 137 | try: 138 | predicted_rain[s, t] = self._forest.estimators_[3].estimators_[t].predict(Xs) 139 | except AttributeError: 140 | # Work around the 2-D array of estimators for GBTrees 141 | predicted_rain[s, t] = self._forest.estimators_[3].estimators_[t][0].predict(Xs) 142 | else: 143 | num_trees = len(self._forest.estimators_) 144 | predicted_rain = np.zeros((num_samples, num_trees)) 145 | for s in range(num_samples): 146 | Xs = X[s].reshape(1, -1) 147 | for t in range(num_trees): 148 | try: 149 | predicted_rain[s, t] = self._forest.estimators_[t].predict(Xs) 150 | except AttributeError: 151 | # Work around an error in sklearn where GBTrees have length-1 ndarrays... 152 | predicted_rain[s, t] = self._forest.estimators_[t][0].predict(Xs) 153 | return predicted_rain 154 | 155 | def _get_distribution(self, p_rain): 156 | # Get the mean, std, and number of 0 forecasts from the estimator. 157 | mean = np.mean(p_rain, axis=1) 158 | std = np.std(p_rain, axis=1) 159 | zero_frac = 1. * np.sum(p_rain < 0.01, axis=1) / p_rain.shape[1] 160 | return np.stack((mean, std, zero_frac), axis=1) 161 | 162 | def fit(self, predictor_array, verification_array, rain_array=None, **kwargs): 163 | """ 164 | Fit the estimator and the post-processor. 165 | 166 | :param predictor_array: ndarray-like: predictor features 167 | :param verification_array: ndarray-like: truth values 168 | :param rain_array: ndarray-like: raw rain from the models 169 | :param kwargs: passed to the estimator's 'fit' method 170 | :return: 171 | """ 172 | # First, fit the estimator as usual 173 | self.base_estimator.fit(predictor_array, verification_array, **kwargs) 174 | 175 | # Now generate the distribution information from the individual trees in the forest 176 | if self.verbose > 0: 177 | print('RainTuningEstimator: getting ensemble rain predictions') 178 | predicted_rain = self._get_tree_rain_prediction(predictor_array) 179 | rain_distribution = self._get_distribution(predicted_rain) 180 | # If raw rain values are given, add those to the distribution 181 | if rain_array is not None: 182 | rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1) 183 | 184 | # Fit the rain estimator 185 | if self.verbose > 0: 186 | print('RainTuningEstimator: fitting rain post-processor') 187 | # # If we're using a classifier, then we may need to binarize the labels 188 | # if self.is_classifier: 189 | # lb = LabelBinarizer() 190 | # rain_targets = lb.fit_transform(verification_array[:, 3]) 191 | # else: 192 | rain_targets = verification_array[:, 3] 193 | self.rain_estimator.fit(rain_distribution, rain_targets) 194 | 195 | def predict(self, predictor_array, rain_tuning=True, rain_array=None, **kwargs): 196 | """ 197 | Return a prediction from the estimator with post-processed rain. 198 | 199 | :param predictor_array: ndarray-like: predictor features 200 | :param rain_tuning: bool: toggle option to disable rain tuning in prediction 201 | :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw 202 | rain values and rain_tuning is True. 203 | :param kwargs: passed to estimator's 'predict' method 204 | :return: array of predictions 205 | """ 206 | # Predict with the estimator as usual 207 | predicted = self.base_estimator.predict(predictor_array, **kwargs) 208 | 209 | # Now get the tuned rain 210 | if rain_tuning: 211 | if self.verbose > 0: 212 | print('RainTuningEstimator: tuning rain prediction') 213 | # Get the distribution from individual trees 214 | predicted_rain = self._get_tree_rain_prediction(predictor_array) 215 | rain_distribution = self._get_distribution(predicted_rain) 216 | if rain_array is not None: 217 | rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1) 218 | tuned_rain = self.rain_estimator.predict(rain_distribution) 219 | predicted[:, 3] = tuned_rain 220 | 221 | return predicted 222 | 223 | def predict_proba(self, predictor_array, rain_tuning=True, rain_array=None, **kwargs): 224 | """ 225 | Return a prediction from the estimator with post-processed rain, with a probability of rainfall. Should only 226 | be used if rain_forecast_type is 'pop'. 227 | 228 | :param predictor_array: ndarray-like: predictor features 229 | :param rain_tuning: bool: toggle option to disable rain tuning in prediction 230 | :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw 231 | rain values and rain_tuning is True. 232 | :param kwargs: passed to estimator's 'predict' method 233 | :return: array of predictions 234 | """ 235 | # Predict with the estimator as usual 236 | predicted = self.base_estimator.predict(predictor_array, **kwargs) 237 | 238 | # Do the probabilistic prediction for rain 239 | if rain_tuning: 240 | if self.verbose > 0: 241 | print('RainTuningEstimator: tuning rain prediction') 242 | # Get the distribution from individual trees 243 | predicted_rain = self._get_tree_rain_prediction(predictor_array) 244 | rain_distribution = self._get_distribution(predicted_rain) 245 | if rain_array is not None: 246 | rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1) 247 | tuned_rain = self.rain_estimator.predict_proba(rain_distribution) 248 | predicted[:, 3] = np.sum(tuned_rain[:, 1:], axis=1) 249 | 250 | return predicted 251 | 252 | def predict_rain_proba(self, predictor_array, rain_array=None): 253 | """ 254 | Get the raw categorical probabilistic prediction from the rain post-processor. 255 | :param predictor_array: ndarray-like: predictor features 256 | :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw 257 | rain values. 258 | :return: array of categorical rain predictions 259 | """ 260 | if self.verbose > 0: 261 | print('RainTuningEstimator: getting probabilistic rain prediction') 262 | # Get the distribution from individual trees 263 | predicted_rain = self._get_tree_rain_prediction(predictor_array) 264 | rain_distribution = self._get_distribution(predicted_rain) 265 | if rain_array is not None: 266 | rain_distribution = np.concatenate((rain_distribution, rain_array), axis=1) 267 | categorical_rain = self.rain_estimator.predict_proba(rain_distribution) 268 | return categorical_rain 269 | 270 | 271 | class BootStrapEnsembleEstimator(object): 272 | """ 273 | This class implements a bootstrapping technique to generate a small ensemble of identical algorithms trained on 274 | a random subset of the training data. Options include partitioning the training data evenly, so that no algorithm 275 | has any sample in any other algorithm's training set, or completely randomly, so that there may be an arbitrary 276 | amount of overlap in the individual algorithms' training sets. 277 | """ 278 | def __init__(self, estimator, n_members=10, n_samples_split=0.1, unique_splits=False): 279 | """ 280 | Initialize instance of a BootStrapEnsembleEstimator. 281 | :param estimator: object: base estimator object (scikit-learn or one of those in this module) 282 | :param n_members: int: number of bootstrapped ensemble members (individual trained models) 283 | :param n_samples_split: float or int: if float, then gives the fraction of the training set that is used to 284 | form the training set for any given ensemble member. If int, then gives the exact number of samples used. 285 | Ignored if unique_splits is True, because then each training set has the maximum number of samples possible to 286 | be unique. 287 | :param unique_splits: bool: if True, each split contains a unique set of samples not present in any other split 288 | """ 289 | self.base_estimator = estimator 290 | self.n_members = n_members 291 | self.estimators_ = np.empty((n_members,), dtype=np.object) 292 | for n in range(n_members): 293 | self.estimators_[n] = deepcopy(estimator) 294 | self.n_samples_split = n_samples_split 295 | self.unique_splits = unique_splits 296 | if not hasattr(self.base_estimator, 'verbose'): 297 | self.verbose = 1 298 | else: 299 | self.verbose = self.base_estimator.verbose 300 | 301 | def get_splits(self, X, y): 302 | X_splits = [] 303 | y_splits = [] 304 | if self.unique_splits: 305 | kf = KFold(n_splits=self.n_members, shuffle=True) 306 | for train_index, test_index in kf.split(X): 307 | X_splits.append(X[test_index]) # test is unique; train is all other tests 308 | y_splits.append(y[test_index]) 309 | else: 310 | for split_count in range(self.n_members): 311 | X_split, xt, y_split, yt = train_test_split(X, y, train_size=self.n_samples_split, shuffle=True) 312 | X_splits.append(X_split) 313 | y_splits.append(y_split) 314 | return X_splits, y_splits 315 | 316 | def fit(self, predictor_array, verification_array, **kwargs): 317 | """ 318 | Fit the bootstrapped algorithms. 319 | 320 | :param predictor_array: ndarray-like: predictor features 321 | :param verification_array: ndarray-like: truth values 322 | :param kwargs: kwargs passed to fit methods 323 | :return: 324 | """ 325 | predictor_splits, verification_splits = self.get_splits(predictor_array, verification_array) 326 | for est in range(self.n_members): 327 | if self.verbose: 328 | print('BootStrapEnsembleEstimator: training ensemble member %d of %d' % (est + 1, self.n_members)) 329 | self.estimators_[est].fit(predictor_splits[est], verification_splits[est], **kwargs) 330 | 331 | def predict(self, predictor_array, **kwargs): 332 | """ 333 | Predict from the bootstrapped algorithms. Gives an ensemble mean. 334 | 335 | :param predictor_array: ndarray-like: predictor features 336 | :param kwargs: passed to estimator's 'predict' method 337 | :return: 338 | """ 339 | prediction = [] 340 | for est in range(self.n_members): 341 | if self.verbose: 342 | print('BootStrapEnsembleEstimator: predicting from ensemble member %d of %d' % 343 | (est + 1, self.n_members)) 344 | prediction.append(self.estimators_[est].predict(predictor_array, **kwargs)) 345 | 346 | return np.mean(np.array(prediction), axis=0) 347 | 348 | def predict_rain_proba(self, predictor_array, rain_array=None): 349 | """ 350 | If the base estimator is a RainTuningEstimator, yields an average prediction for categorical rain 351 | probabilities. 352 | :param predictor_array: ndarray-like: predictor features 353 | :param rain_array: ndarray-like: raw rain values from models. Must be provided if fit() was called using raw 354 | rain values. 355 | :return: 356 | """ 357 | prediction = [] 358 | for est in range(self.n_members): 359 | try: 360 | prediction.append(self.estimators_[est].predict_rain_proba(predictor_array, rain_array=rain_array)) 361 | except AttributeError: 362 | raise AttributeError("'%s' cannot predict rain category probabilities; use RainTuningEstimator" % 363 | type(self.base_estimator)) 364 | 365 | # Fix the shape of the arrays, in case some predictions don't have all categories of rain 366 | rain_dims = [p.shape[1] for p in prediction] 367 | max_dim = np.max(rain_dims) 368 | new_prediction = [] 369 | for p in prediction: 370 | while p.shape[1] < max_dim: 371 | p = np.c_[p, np.zeros(p.shape[0])] 372 | new_prediction.append(p) 373 | 374 | return np.mean(np.array(new_prediction), axis=0) 375 | -------------------------------------------------------------------------------- /mosx/model/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for formatting predictor data and training scikit-learn models. 9 | """ 10 | 11 | from model import * 12 | from predictors import * 13 | from scorers import * 14 | -------------------------------------------------------------------------------- /mosx/model/model.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for training scikit-learn models. 9 | """ 10 | 11 | from datetime import datetime, timedelta 12 | import pandas as pd 13 | 14 | from mosx.util import get_object, to_bool, dewpoint 15 | from mosx.estimators import TimeSeriesEstimator, RainTuningEstimator, BootStrapEnsembleEstimator 16 | import pickle 17 | import numpy as np 18 | 19 | 20 | def build_estimator(config): 21 | """ 22 | Build the estimator object from the parameters in config. 23 | 24 | :param config: 25 | :return: 26 | """ 27 | regressor = config['Model']['regressor'] 28 | sklearn_kwargs = config['Model']['Parameters'] 29 | train_individual = config['Model']['train_individual'] 30 | ada_boost = config['Model'].get('Ada boosting', None) 31 | rain_tuning = config['Model'].get('Rain tuning', None) 32 | bootstrap = config['Model'].get('Bootstrapping', None) 33 | Regressor = get_object('sklearn.%s' % regressor) 34 | if config['verbose']: 35 | print('build_estimator: using sklearn.%s as estimator...' % regressor) 36 | 37 | from sklearn.preprocessing import Imputer 38 | from sklearn.preprocessing import StandardScaler as Scaler 39 | from sklearn.pipeline import Pipeline 40 | 41 | # Create and train the learning algorithm 42 | if config['verbose']: 43 | print('build_estimator: here are the parameters passed to the learning algorithm...') 44 | print(sklearn_kwargs) 45 | 46 | # Create the pipeline list 47 | pipeline = [("imputer", Imputer(missing_values=np.nan, strategy="mean", axis=0))] 48 | if config['Model']['predict_timeseries']: 49 | pipeline_timeseries = [("imputer", Imputer(missing_values=np.nan, strategy="mean", axis=0))] 50 | 51 | if not (regressor.startswith('ensemble')): 52 | # Need to add feature scaling 53 | pipeline.append(("scaler", Scaler())) 54 | if config['Model']['predict_timeseries']: 55 | pipeline_timeseries.append(("scaler", Scaler())) 56 | 57 | # Create the regressor object 58 | regressor_obj = Regressor(**sklearn_kwargs) 59 | if ada_boost is not None: 60 | if config['verbose']: 61 | print('build_estimator: using Ada boosting...') 62 | from sklearn.ensemble import AdaBoostRegressor 63 | regressor_obj = AdaBoostRegressor(regressor_obj, **ada_boost) 64 | if train_individual: 65 | if config['verbose']: 66 | print('build_estimator: training separate models for each parameter...') 67 | from sklearn.multioutput import MultiOutputRegressor 68 | multi_regressor = MultiOutputRegressor(regressor_obj, 4) 69 | pipeline.append(("regressor", multi_regressor)) 70 | else: 71 | pipeline.append(("regressor", regressor_obj)) 72 | if config['Model']['predict_timeseries']: 73 | pipeline_timeseries.append(("regressor", regressor_obj)) 74 | 75 | # Make the final estimator with a Pipeline 76 | if config['Model']['predict_timeseries']: 77 | estimator = TimeSeriesEstimator(Pipeline(pipeline), Pipeline(pipeline_timeseries)) 78 | else: 79 | estimator = Pipeline(pipeline) 80 | 81 | if rain_tuning is not None and regressor.startswith('ensemble'): 82 | if config['verbose']: 83 | print('build_estimator: using rain tuning...') 84 | rain_kwargs = rain_tuning.copy() 85 | rain_kwargs.pop('use_raw_rain', None) 86 | estimator = RainTuningEstimator(estimator, **rain_kwargs) 87 | 88 | # Add bootstrapping if requested 89 | if bootstrap is not None: 90 | if config['verbose']: 91 | print('build_estimator: using bootstrapping ensemble...') 92 | estimator = BootStrapEnsembleEstimator(estimator, **bootstrap) 93 | 94 | return estimator 95 | 96 | 97 | def build_train_data(config, predictor_file, no_obs=False, no_models=False, test_size=0): 98 | """ 99 | Build the array of training (and optionally testing) data. 100 | 101 | :param config: 102 | :param predictor_file: 103 | :param no_obs: 104 | :param no_models: 105 | :param test_size: 106 | :return: 107 | """ 108 | from sklearn.model_selection import train_test_split 109 | 110 | if config['verbose']: 111 | print('build_train_data: reading predictor file') 112 | rain_tuning = config['Model'].get('Rain tuning', None) 113 | with open(predictor_file, 'rb') as handle: 114 | data = pickle.load(handle) 115 | 116 | # Select data 117 | if no_obs and no_models: 118 | no_obs = False 119 | no_models = False 120 | if no_obs: 121 | if config['verbose']: 122 | print('build_train_data: not using observations to train') 123 | predictors = data['BUFKIT'] 124 | elif no_models: 125 | if config['verbose']: 126 | print('build_train_data: not using models to train') 127 | predictors = data['OBS'] 128 | else: 129 | predictors = np.concatenate((data['BUFKIT'], data['OBS']), axis=1) 130 | if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)): 131 | predictors = np.concatenate((predictors, data.rain), axis=1) 132 | rain_shape = data.rain.shape[-1] 133 | targets = data['VERIF'] 134 | 135 | if test_size > 0: 136 | p_train, p_test, t_train, t_test = train_test_split(predictors, targets, test_size=test_size) 137 | if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)): 138 | r_train = p_train[:, -1*rain_shape:] 139 | p_train = p_train[:, :-1*rain_shape] 140 | r_test = p_test[:, -1 * rain_shape:] 141 | p_test = p_test[:, :-1 * rain_shape] 142 | else: 143 | r_train = None 144 | r_test = None 145 | return p_train, t_train, r_train, p_test, t_test, r_test 146 | else: 147 | if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)): 148 | return predictors, targets, data.rain 149 | else: 150 | return predictors, targets, None 151 | 152 | 153 | def train(config, predictor_file, estimator_file=None, no_obs=False, no_models=False, test_size=0): 154 | """ 155 | Generate and train a scikit-learn machine learning estimator. The estimator object is saved as a pickle so that it 156 | may be imported and used for predictions at any time. 157 | 158 | :param config: 159 | :param predictor_file: str: full path to saved file of predictor data 160 | :param estimator_file: str: full path to output model file 161 | :param no_obs: bool: if True, generates the model with no OBS data 162 | :param no_models: bool: if True, generates the model with no BUFR data 163 | :param test_size: int: if > 0, returns a subset of the training data of size 'test_size' to test on 164 | :return: matplotlib Figure if plot_learning_curve is True 165 | """ 166 | estimator = build_estimator(config) 167 | rain_tuning = config['Model'].get('Rain tuning', None) 168 | if test_size > 0: 169 | p_train, t_train, r_train, p_test, t_test, r_test = build_train_data(config, predictor_file, no_obs=no_obs, 170 | no_models=no_models, test_size=test_size) 171 | else: 172 | p_train, t_train, r_train = build_train_data(config, predictor_file, no_obs=no_obs, no_models=no_models) 173 | 174 | print('train: training the estimator') 175 | if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)): 176 | estimator.fit(p_train, t_train, rain_array=r_train) 177 | else: 178 | estimator.fit(p_train, t_train) 179 | 180 | if estimator_file is None: 181 | estimator_file = '%s/%s_mosx.pkl' % (config['MOSX_ROOT'], config['station_id']) 182 | print('train: -> exporting to %s' % estimator_file) 183 | with open(estimator_file, 'wb') as handle: 184 | pickle.dump(estimator, handle, protocol=pickle.HIGHEST_PROTOCOL) 185 | 186 | if test_size > 0: 187 | return p_test, t_test, r_test 188 | return 189 | 190 | 191 | def _plot_learning_curve(estimator, X, y, ylim=None, cv=None, scoring=None, title=None, n_jobs=1, 192 | train_sizes=np.linspace(.1, 1.0, 5)): 193 | import matplotlib.pyplot as plt 194 | from sklearn.model_selection import learning_curve 195 | 196 | fig = plt.figure() 197 | if title is not None: 198 | plt.title(title) 199 | if ylim is not None: 200 | plt.ylim(*ylim) 201 | plt.xlabel("Training examples") 202 | plt.ylabel("Score") 203 | train_sizes, train_scores, test_scores = learning_curve( 204 | estimator, X, y, cv=cv, scoring=scoring, n_jobs=n_jobs, train_sizes=train_sizes) 205 | train_scores_mean = np.mean(train_scores, axis=1) 206 | train_scores_std = np.std(train_scores, axis=1) 207 | test_scores_mean = np.mean(test_scores, axis=1) 208 | test_scores_std = np.std(test_scores, axis=1) 209 | plt.grid() 210 | 211 | plt.fill_between(train_sizes, train_scores_mean - train_scores_std, 212 | train_scores_mean + train_scores_std, alpha=0.1, 213 | color="r") 214 | plt.fill_between(train_sizes, test_scores_mean - test_scores_std, 215 | test_scores_mean + test_scores_std, alpha=0.1, color="g") 216 | plt.plot(train_sizes, train_scores_mean, 'o-', color="r", 217 | label="Training score") 218 | plt.plot(train_sizes, test_scores_mean, 'o-', color="g", 219 | label="Cross-validation score") 220 | 221 | plt.legend(loc="best") 222 | return fig 223 | 224 | 225 | def plot_learning_curve(config, predictor_file, no_obs=False, no_models=False, ylim=None, cv=None, scoring=None, 226 | title=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)): 227 | """ 228 | Generate a simple plot of the test and training learning curve. From scikit-learn: 229 | http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html 230 | 231 | Parameters 232 | ---------- 233 | config : 234 | 235 | predictor_file : string 236 | Full path to file containing predictor data 237 | 238 | no_obs : boolean 239 | Train model without observations 240 | 241 | no_models : boolean 242 | Train model without model data 243 | 244 | ylim : tuple, shape (ymin, ymax), optional 245 | Defines minimum and maximum yvalues plotted. 246 | 247 | cv : int, cross-validation generator or an iterable, optional 248 | Determines the cross-validation splitting strategy. 249 | Possible inputs for cv are: 250 | - None, to use the default 3-fold cross-validation, 251 | - integer, to specify the number of folds. 252 | - An object to be used as a cross-validation generator. 253 | - An iterable yielding train/test splits. 254 | 255 | For integer/None inputs, if ``y`` is binary or multiclass, 256 | :class:`StratifiedKFold` used. If the estimator is not a classifier 257 | or if ``y`` is neither binary nor multiclass, :class:`KFold` is used. 258 | 259 | Refer :ref:`User Guide ` for the various 260 | cross-validators that can be used here. 261 | 262 | scoring : 263 | Scoring function for the error calculation; should be a scikit-learn scorer object 264 | 265 | title : string 266 | Title for the chart. 267 | 268 | n_jobs : integer, optional 269 | Number of jobs to run in parallel (default 1). 270 | 271 | train_sizes : iterable, optional 272 | Sequence of subsets of training data used in learning curve plot 273 | """ 274 | estimator = build_estimator(config) 275 | X, y = build_train_data(config, predictor_file, no_obs=no_obs, no_models=no_models) 276 | fig = _plot_learning_curve(estimator, X, y, ylim=ylim, cv=cv, scoring=scoring, title=title, n_jobs=n_jobs, 277 | train_sizes=train_sizes) 278 | return fig 279 | 280 | 281 | def combine_train_test(config, train_file, test_file, no_obs=False, no_models=False, return_count_test=True): 282 | """ 283 | Concatenates the arrays of predictors and verification values from the train file and the test file. Useful for 284 | implementing cross-validation using scikit-learn's methods and the SplitConsecutive class. 285 | 286 | :param config: 287 | :param train_file: str: full path to predictor file for training 288 | :param test_file: str: full path to predictor file for validation 289 | :param no_obs: bool: if True, generates the model with no OBS data 290 | :param no_models: bool: if True, generates the model with no BUFR data 291 | :param return_count_test: bool: if True, also returns the number of samples in the test set (see SplitConsecutive) 292 | :return: predictors, verifications: concatenated arrays of predictors and verification values; count: number of 293 | samples in the test set 294 | """ 295 | p_train, t_train = build_train_data(config, train_file, no_obs=no_obs, no_models=no_models) 296 | p_test, t_test = build_train_data(config, test_file, no_obs=no_obs, no_models=no_models) 297 | p_combined = np.concatenate((p_train, p_test), axis=0) 298 | t_combined = np.concatenate((t_train, t_test), axis=0) 299 | if return_count_test: 300 | return p_combined, t_combined, t_test.shape[0] 301 | else: 302 | return p_combined, t_combined 303 | 304 | 305 | class SplitConsecutive(object): 306 | """ 307 | Implements a split method to subset a training set into train and test sets, using the first or last n samples in 308 | the set. 309 | """ 310 | 311 | def __init__(self, first=False, n_samples=0.2): 312 | """ 313 | Create an instance of SplitConsecutive. 314 | 315 | :param first: bool: if True, gets test data from the beginning of the data set; otherwise from the end 316 | :param n_samples: float or int: if float, subsets a fraction (0 to 1) of the data into the test set; if int, 317 | subsets a specific number of samples. 318 | """ 319 | if type(first) is not bool: 320 | raise TypeError("'first' must be a boolean type.") 321 | try: 322 | n_samples = int(n_samples) 323 | except: 324 | pass 325 | if type(n_samples) is float and (n_samples <= 0. or n_samples >= 1.): 326 | raise ValueError("if float, 'n_samples' must be between 0 and 1.") 327 | if type(n_samples) is not float and type(n_samples) is not int: 328 | raise TypeError("'n_samples' must be float or int type.") 329 | self.first = first 330 | self.n_samples = n_samples 331 | self.n_splits = 1 332 | 333 | def split(self, X, y=None, groups=None): 334 | """ 335 | Produces arrays of indices to use for model and test splits. 336 | 337 | :param X: array-like, shape (samples, features): predictor data 338 | :param y: array-like, shape (samples, outputs) or None: verification data; ignored 339 | :param groups: ignored 340 | :return: model, test: 1-D arrays of sample indices in the model and test sets 341 | """ 342 | num_samples = X.shape[0] 343 | indices = np.arange(0, num_samples, 1, dtype=np.int32) 344 | if type(self.n_samples) is float: 345 | self.n_samples = int(np.round(num_samples * self.n_samples)) 346 | if self.first: 347 | test = indices[:self.n_samples] 348 | train = indices[self.n_samples:] 349 | else: 350 | test = indices[-self.n_samples:] 351 | train = indices[:num_samples - self.n_samples] 352 | yield train, test 353 | 354 | def get_n_splits(self, X=None, y=None, groups=None): 355 | """ 356 | Return the number of splits. Dummy function for compatibility. 357 | 358 | :param X: ignored 359 | :param y: ignored 360 | :param groups: ignored 361 | :return: 362 | """ 363 | return self.n_splits 364 | 365 | 366 | def predict_all(config, predictor_file, ensemble=False, time_series_date=None, naive_rain_correction=False, 367 | round_result=False, **kwargs): 368 | """ 369 | Predict forecasts from the estimator in config. Also return probabilities and time series. 370 | 371 | :param config: 372 | :param predictor_file: str: file containing predictor data from mosx.model.format_predictors 373 | :param ensemble: bool: if True, return an array of num_trees-by-4 of the predictions of each tree in the estimator 374 | :param time_series_date: datetime: if set, returns a time series prediction from the estimator, where the datetime 375 | provided is the day the forecast is for (only works for single-day runs, or assumes last day) 376 | :param naive_rain_correction: bool: if True, applies manual tuning to the rain forecast 377 | :param round_result: bool: if True, rounds the predicted estimate 378 | :param kwargs: passed to the estimator's 'predict' method 379 | :return: 380 | predicted: ndarray: num_samples x num_predicted_variables predictions 381 | all_predicted: ndarray: num_samples x num_predicted_variables x num_ensemble_members predictions for all trees 382 | predicted_timeseries: DataFrame: time series for final sample 383 | """ 384 | # Load the predictor data and estimator 385 | with open(predictor_file, 'rb') as handle: 386 | predictor_data = pickle.load(handle) 387 | rain_tuning = config['Model'].get('Rain tuning', None) 388 | if config['verbose']: 389 | print('predict: loading estimator %s' % config['Model']['estimator_file']) 390 | with open(config['Model']['estimator_file'], 'rb') as handle: 391 | estimator = pickle.load(handle) 392 | 393 | predictors = np.concatenate((predictor_data['BUFKIT'], predictor_data['OBS']), axis=1) 394 | if config['Model']['rain_forecast_type'] == 'pop' and getattr(estimator, 'is_classifier', False): 395 | predict_method = estimator.predict_proba 396 | else: 397 | predict_method = estimator.predict 398 | if rain_tuning is not None and to_bool(rain_tuning.get('use_raw_rain', False)): 399 | predicted = predict_method(predictors, rain_array=predictor_data.rain, **kwargs) 400 | else: 401 | predicted = predict_method(predictors, **kwargs) 402 | precip = predictor_data.rain 403 | 404 | # Check for precipitation override 405 | if naive_rain_correction: 406 | for day in range(predicted.shape[0]): 407 | if sum(precip[day]) < 0.01: 408 | if config['verbose']: 409 | print('predict: warning: overriding MOS-X rain prediction of %0.2f on day %s with 0' % 410 | (predicted[day, 3], day)) 411 | predicted[day, 3] = 0. 412 | elif predicted[day, 3] > max(precip[day]) or predicted[day, 3] < min(precip[day]): 413 | if config['verbose']: 414 | print('predict: warning: overriding MOS-X prediction of %0.2f on day %s with model mean' % 415 | (predicted[day, 3], day)) 416 | predicted[day, 3] = max(0., np.mean(precip[day] + [predicted[day, 3]])) 417 | else: 418 | # At least make sure we aren't predicting negative values... 419 | predicted[:, 3][predicted[:, 3] < 0] = 0.0 420 | 421 | # Round off daily values, if selected 422 | if round_result: 423 | predicted[:, :3] = np.round(predicted[:, :3]) 424 | predicted[:, 3] = np.round(predicted[:, 3], 2) 425 | 426 | # If probabilities are requested and available, get the results from each tree 427 | if ensemble: 428 | num_samples = predictors.shape[0] 429 | if not hasattr(estimator, 'named_steps'): 430 | forest = estimator 431 | else: 432 | imputer = estimator.named_steps['imputer'] 433 | forest = estimator.named_steps['regressor'] 434 | predictors = imputer.transform(predictors) 435 | # If we generated our own ensemble by bootstrapping, it must be treated as such 436 | if config['Model']['train_individual'] and config['Model'].get('Bootstrapping', None) is None: 437 | num_trees = len(forest.estimators_[0].estimators_) 438 | all_predicted = np.zeros((num_samples, 4, num_trees)) 439 | for v in range(4): 440 | for t in range(num_trees): 441 | try: 442 | all_predicted[:, v, t] = forest.estimators_[v].estimators_[t].predict(predictors) 443 | except AttributeError: 444 | # Work around the 2-D array of estimators for GBTrees 445 | all_predicted[:, v, t] = forest.estimators_[v].estimators_[t][0].predict(predictors) 446 | else: 447 | num_trees = len(forest.estimators_) 448 | all_predicted = np.zeros((num_samples, 4, num_trees)) 449 | for t in range(num_trees): 450 | try: 451 | all_predicted[:, :, t] = forest.estimators_[t].predict(predictors)[:, :4] 452 | except AttributeError: 453 | # Work around the 2-D array of estimators for GBTrees 454 | all_predicted[:, :, t] = forest.estimators_[t][0].predict(predictors)[:, :4] 455 | else: 456 | all_predicted = None 457 | 458 | if config['Model']['predict_timeseries']: 459 | if time_series_date is None: 460 | date_now = datetime.utcnow() 461 | time_series_date = datetime(date_now.year, date_now.month, date_now.day) + timedelta(days=1) 462 | print('predict: warning: set time series start date to %s (was unspecified)' % time_series_date) 463 | num_hours = int(24 / config['time_series_interval']) + 1 464 | predicted_array = predicted[-1, 4:].reshape((4, num_hours)).T 465 | # Get dewpoint 466 | predicted_array[:, 2] = dewpoint(predicted_array[:, 0], predicted_array[:, 2]) 467 | times = pd.date_range(time_series_date.replace(hour=6), periods=num_hours, 468 | freq='%dH' % config['time_series_interval']).to_pydatetime().tolist() 469 | variables = ['temperature', 'rain', 'dewpoint', 'windSpeed'] 470 | round_dict = {'temperature': 0, 'rain': 2, 'dewpoint': 0, 'windSpeed': 0} 471 | predicted_timeseries = pd.DataFrame(predicted_array, index=times, columns=variables) 472 | predicted_timeseries = predicted_timeseries.round(round_dict) 473 | else: 474 | predicted_timeseries = None 475 | 476 | return predicted, all_predicted, predicted_timeseries 477 | 478 | 479 | def predict(config, predictor_file, naive_rain_correction=False, round=False, **kwargs): 480 | """ 481 | Predict forecasts from the estimator in config. Only returns daily values. 482 | 483 | :param config: 484 | :param predictor_file: str: file containing predictor data from mosx.model.format_predictors 485 | :param naive_rain_correction: bool: if True, applies manual tuning to the rain forecast 486 | :param round: bool: if True, rounds the predicted estimate 487 | :param kwargs: passed to the estimator's 'predict' method 488 | :return: 489 | """ 490 | 491 | predicted, all_predicted, predicted_timeseries = predict_all(config, predictor_file, 492 | naive_rain_correction=naive_rain_correction, 493 | round_result=round, **kwargs) 494 | return predicted 495 | 496 | 497 | def predict_rain_proba(config, predictor_file): 498 | """ 499 | Predict probabilistic rain forecasts for 'pop' or 'categorical' types. 500 | 501 | :param config: 502 | :param predictor_file: str: file containing predictor data from mosx.model.format_predictors 503 | :return: 504 | """ 505 | if config['Model']['rain_forecast_type'] not in ['pop', 'categorical']: 506 | raise TypeError("'quantity' rain forecast is not probabilistic, cannot get probabilities") 507 | rain_tuning = config['Model'].get('Rain tuning', None) 508 | if rain_tuning is None: 509 | raise TypeError('Probabilistic rain forecasts are only possible with a RainTuningEstimator') 510 | 511 | # Load the predictor data and estimator 512 | with open(predictor_file, 'rb') as handle: 513 | predictor_data = pickle.load(handle) 514 | if config['verbose']: 515 | print('predict: loading estimator %s' % config['Model']['estimator_file']) 516 | with open(config['Model']['estimator_file'], 'rb') as handle: 517 | estimator = pickle.load(handle) 518 | 519 | predictors = np.concatenate((predictor_data['BUFKIT'], predictor_data['OBS']), axis=1) 520 | if to_bool(rain_tuning.get('use_raw_rain', False)): 521 | rain_proba = estimator.predict_rain_proba(predictors, rain_array=predictor_data.rain) 522 | else: 523 | rain_proba = estimator.predict_rain_proba(predictors) 524 | 525 | return rain_proba 526 | -------------------------------------------------------------------------------- /mosx/model/predictors.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing input predictor data. 9 | """ 10 | 11 | import numpy as np 12 | import mosx.bufr 13 | import mosx.obs 14 | import mosx.verification 15 | from mosx.util import unpickle, find_matching_dates 16 | import pickle 17 | 18 | 19 | class PredictorDict(dict): 20 | """ 21 | Special class extending dict to add an attribute containing raw precipitation values, for precipitation-aware 22 | estimator configurations. 23 | """ 24 | def __init__(self, *args, **kwargs): 25 | super(PredictorDict, self).__init__(*args, **kwargs) 26 | self.rain = None 27 | 28 | def add_rain(self, rain_array): 29 | """ 30 | Add an array of raw rain values to the dict. If the dictionary contains BUFKIT array, checks that the sample 31 | size is correct. 32 | :param rain_array: 33 | :return: 34 | """ 35 | rain_array[np.isnan(rain_array)] = 0. 36 | 37 | if 'BUFKIT' in self.keys(): 38 | if isinstance(self['BUFKIT'], np.ndarray): 39 | if self['BUFKIT'].shape[0] != rain_array.shape[0]: 40 | raise ValueError('rain_array and BUFKIT array must have the same sample size; got %s and %s' % 41 | (rain_array.shape[0], self['BUFKIT'].shape[0])) 42 | self.rain = rain_array 43 | 44 | 45 | def format_predictors(config, bufr_file, obs_file, verif_file, output_file=None, return_dates=False): 46 | """ 47 | Generates a complete date-by-x array of data for ingestion into the machine learning estimator. verif_file may be 48 | None if creating a set to run the model. 49 | 50 | :param config: 51 | :param bufr_file: str: full path to the saved file of BUFR data 52 | :param obs_file: str: full path to the saved file of OBS data 53 | :param verif_file: str: full path to the saved file of VERIF data 54 | :param output_file: str: full path to output predictors file 55 | :param return_dates: if True, returns all of the matching dates used to produce the predictor arrays 56 | :return: optionally a list of dates and a list of lists of precipitation values 57 | """ 58 | bufr, obs, verif = unpickle(bufr_file, obs_file, verif_file) 59 | bufr, obs, verif, all_dates = find_matching_dates(bufr, obs, verif, return_data=True) 60 | bufr_array = mosx.bufr.process(config, bufr) 61 | obs_array = mosx.obs.process(config, obs) 62 | verif_array = mosx.verification.process(config, verif) 63 | 64 | export_dict = { 65 | 'BUFKIT': bufr_array, 66 | 'OBS': obs_array, 67 | 'VERIF': verif_array 68 | } 69 | export_dict = PredictorDict(export_dict) 70 | 71 | # Get raw precipitation values and add them to the PredictorDict 72 | precip_list = [] 73 | for date in all_dates: 74 | precip = [] 75 | for model in bufr['DAY'].keys(): 76 | precip.append(bufr['DAY'][model][date][-1] / 25.4) # mm to inches 77 | precip_list.append(precip) 78 | export_dict.add_rain(np.array(precip_list)) 79 | 80 | if output_file is None: 81 | output_file = '%s/%s_predictors.pkl' % (config['site_directory'], config['station_id']) 82 | print('predictors: -> exporting to %s' % output_file) 83 | with open(output_file, 'wb') as handle: 84 | pickle.dump(export_dict, handle, protocol=pickle.HIGHEST_PROTOCOL) 85 | 86 | if return_dates: 87 | return all_dates 88 | else: 89 | return 90 | -------------------------------------------------------------------------------- /mosx/model/scorers.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Scoring techniques for model evaluation. 9 | """ 10 | 11 | import numpy as np 12 | from sklearn.metrics import make_scorer 13 | 14 | 15 | def wxchallenge_error(y, y_pred, average=True, no_rain=False): 16 | """ 17 | Returns the forecast error as measured by the WxChallenge. 18 | 19 | :param y: n-by-4 array of truth values 20 | :param y_pred: n-by-4 array of predicted values 21 | :param average: bool: if True, returns the average error per sample 22 | :param no_rain: bool: if True, does not count rain error 23 | :return: float: cumulative error 24 | """ 25 | if y.shape != y_pred.shape: 26 | raise ValueError("y and y_pred must have the same shape") 27 | if len(y.shape) > 2: 28 | raise ValueError("got too many dimensions for y and y_pred (expected 2)") 29 | if (y.shape != (4,)) and y.shape[1] != 4: 30 | raise ValueError('last dimension of y and y_pred should be length 4') 31 | 32 | if len(y.shape) == 1: 33 | y = np.array([y]) 34 | y_pred = np.array([y_pred]) 35 | 36 | high_error = np.sum(np.abs(y[:, 0] - y_pred[:, 0])) 37 | low_error = np.sum(np.abs(y[:, 1] - y_pred[:, 1])) 38 | wind_error = 0.5 * np.sum(np.abs(y[:, 2] - y_pred[:, 2])) 39 | rain_error = 0. 40 | for sample in range(y.shape[0]): 41 | y_rain = y[sample, 3] 42 | y_pred_rain = y_pred[sample, 3] 43 | rain_min = int(100.*min(y_rain, y_pred_rain)) 44 | rain_max = int(100.*max(y_rain, y_pred_rain)) 45 | while rain_min < rain_max: 46 | if rain_min < 10: 47 | rain_error += 0.4 48 | elif rain_min < 25: 49 | rain_error += 0.3 50 | elif rain_min < 50: 51 | rain_error += 0.2 52 | else: 53 | rain_error += 0.1 54 | rain_min += 1 55 | 56 | result = high_error + low_error + wind_error 57 | if not no_rain: 58 | result += rain_error 59 | if average: 60 | result /= y.shape[0] 61 | return result 62 | 63 | 64 | def wxchallenge_scorer(**kwargs): 65 | """ 66 | Return a scikit-learn scorer object for forecast error as measured by WxChallenge. 67 | 68 | :param kwargs: parameters passed to the WxChallenge error function 69 | no_rain: if True, does not count rain error 70 | :return: 71 | """ 72 | scorer = make_scorer(wxchallenge_error, greater_is_better=False, **kwargs) 73 | return scorer 74 | -------------------------------------------------------------------------------- /mosx/obs/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing OBS data. 9 | """ 10 | 11 | from .methods import * 12 | -------------------------------------------------------------------------------- /mosx/obs/methods.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing OBS data. 9 | """ 10 | 11 | import os 12 | import numpy as np 13 | import pandas as pd 14 | from datetime import datetime, timedelta 15 | import pickle 16 | from collections import OrderedDict 17 | from mosx.MesoPy import Meso 18 | from metpy.io import get_upper_air_data 19 | from metpy.calc import interp 20 | from mosx.util import generate_dates, get_array 21 | 22 | 23 | def upper_air(config, date, use_nan_sounding=False, use_existing=True, save=True): 24 | """ 25 | Retrieves upper-air data and interpolates to pressure levels. If use_nan_sounding is True, then if a retrieval 26 | error occurs, a blank sounding will be returned instead of an error. 27 | 28 | :param config: 29 | :param date: datetime 30 | :param use_nan_sounding: bool: if True, use sounding of NaNs instead of raising an error 31 | :param use_existing: bool: preferentially use existing soundings in sounding_data_dir 32 | :param save: bool: if True, save processed soundings to sounding_data_dir 33 | :return: 34 | """ 35 | variables = ['height', 'temperature', 'dewpoint', 'u_wind', 'v_wind'] 36 | 37 | # Define levels for interpolation: same as model data, except omitting lowest_p_level 38 | plevs = [600, 750, 850, 925] 39 | pres_interp = [p for p in plevs if p <= config['lowest_p_level']] 40 | 41 | # Try retrieving the sounding, first checking for existing 42 | if config['verbose']: 43 | print('upper_air: retrieving sounding for %s' % datetime.strftime(date, '%Y%m%d%H')) 44 | nan_sounding = False 45 | retrieve_sounding = False 46 | sndg_data_dir = config['Obs']['sounding_data_dir'] 47 | if not(os.path.isdir(sndg_data_dir)): 48 | os.makedirs(sndg_data_dir) 49 | sndg_file = '%s/%s_SNDG_%s.pkl' % (sndg_data_dir, config['station_id'], datetime.strftime(date, '%Y%m%d%H')) 50 | if use_existing: 51 | try: 52 | with open(sndg_file, 'rb') as handle: 53 | data = pickle.load(handle) 54 | if config['verbose']: 55 | print(' Read from file.') 56 | except: 57 | retrieve_sounding = True 58 | else: 59 | retrieve_sounding = True 60 | if retrieve_sounding: 61 | try: 62 | dset = get_upper_air_data(date, config['Obs']['sounding_station_id']) 63 | except: 64 | # Try again 65 | try: 66 | dset = get_upper_air_data(date, config['Obs']['sounding_station_id']) 67 | except: 68 | if use_nan_sounding: 69 | if config['verbose']: 70 | print('upper_air: warning: unable to retrieve sounding; using nan.') 71 | nan_sounding = True 72 | else: 73 | raise ValueError('error retrieving sounding for %s' % date) 74 | 75 | # Retrieve pressure for interpolation to fixed levels 76 | if not nan_sounding: 77 | pressure = dset.variables['pressure'] 78 | pres = np.array([p.magnitude for p in list(pressure)]) # units are hPa 79 | 80 | # Get variables and interpolate; add to dictionary 81 | data = OrderedDict() 82 | for var in variables: 83 | if not nan_sounding: 84 | var_data = dset.variables[var] 85 | var_array = np.array([v.magnitude for v in list(var_data)]) 86 | var_interp = interp(pres_interp, pres, var_array) 87 | data[var] = var_interp.tolist() 88 | else: 89 | data[var] = [np.nan] * len(pres_interp) 90 | 91 | # Save 92 | if save and not nan_sounding: 93 | with open(sndg_file, 'wb') as handle: 94 | pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL) 95 | 96 | return data 97 | 98 | 99 | def get_obs_hourly(config, api_dates, vars_api, units): 100 | """ 101 | Retrieve hourly obs data in a pd dataframe. In order to ensure that there is no missing hourly indices, use 102 | dataframe.reindex on each retrieved dataframe. 103 | 104 | :param api_dates: dates from generate_dates 105 | :param vars_api: str: string formatted for api call var parameter 106 | :param units: str: string formatted for api call units parameter 107 | :return: pd.DataFrame: formatted hourly obs DataFrame 108 | """ 109 | # Initialize Meso 110 | m = Meso(token=config['meso_token']) 111 | if config['verbose']: 112 | print('get_obs_hourly: MesoPy initialized for station %s' % config['station_id']) 113 | 114 | # Retrieve data 115 | obs_final = pd.DataFrame() 116 | for api_date in api_dates: 117 | if config['verbose']: 118 | print('get_obs_hourly: retrieving data from %s to %s' % api_date) 119 | obs = m.timeseries(stid=config['station_id'], start=api_date[0], end=api_date[1], vars=vars_api, units=units, 120 | hfmetars='0') 121 | obspd = pd.DataFrame.from_dict(obs['STATION'][0]['OBSERVATIONS']) 122 | 123 | # Rename columns to requested vars 124 | obs_var_names = obs['STATION'][0]['SENSOR_VARIABLES'] 125 | obs_var_keys = list(obs_var_names.keys()) 126 | col_names = list(map(''.join, obspd.columns.values)) 127 | for c in range(len(col_names)): 128 | col = col_names[c] 129 | for k in range(len(obs_var_keys)): 130 | key = obs_var_keys[k] 131 | if col == list(obs_var_names[key].keys())[0]: 132 | col_names[c] = key 133 | obspd.columns = col_names 134 | 135 | # Change datetime column to datetime object 136 | dateobj = pd.to_datetime(obspd['date_time']) 137 | obspd['date_time'] = dateobj 138 | datename = 'date_time' 139 | obspd = obspd.rename(columns={'date_time': datename}) 140 | 141 | # Reformat data into hourly obs 142 | # Find mode of minute data: where the hourly metars are 143 | if config['verbose']: 144 | print('get_obs_hourly: finding METAR observation times...') 145 | minutes = [] 146 | for row in obspd.iterrows(): 147 | date = row[1][datename] 148 | minutes.append(date.minute) 149 | minute_count = np.bincount(np.array(minutes)) 150 | rev_count = minute_count[::-1] 151 | minute_mode = minute_count.size - rev_count.argmax() - 1 152 | 153 | if config['verbose']: 154 | print('get_obs_hourly: finding hourly data...') 155 | obs_hourly = obspd[pd.DatetimeIndex(obspd[datename]).minute == minute_mode] 156 | obs_hourly = obs_hourly.set_index(datename) 157 | 158 | # May not have precip if none is recorded 159 | try: 160 | obs_hourly['precip_accum_one_hour'].fillna(0.0, inplace=True) 161 | except KeyError: 162 | obs_hourly['precip_accum_one_hour'] = 0.0 163 | 164 | # Need to reorder the column names 165 | obs_hourly.sort_index(axis=1, inplace=True) 166 | 167 | # Remove any duplicate rows 168 | obs_hourly = obs_hourly[~obs_hourly.index.duplicated(keep='last')] 169 | 170 | # Re-index by hourly. Fills missing with NaNs. Try to interpolate the NaNs. 171 | expected_start = datetime.strptime(api_date[0], '%Y%m%d%H%M').replace(minute=minute_mode) 172 | expected_end = datetime.strptime(api_date[1], '%Y%m%d%H%M') 173 | expected_times = pd.date_range(expected_start, expected_end, freq='H').to_pydatetime() 174 | obs_hourly = obs_hourly.reindex(expected_times) 175 | obs_hourly = obs_hourly.interpolate(limit=3) 176 | 177 | obs_final = pd.concat((obs_final, obs_hourly)) 178 | 179 | # Remove any duplicate rows from concatenation 180 | obs_final = obs_final[~obs_final.index.duplicated(keep='last')] 181 | 182 | return obs_final 183 | 184 | 185 | def reindex_hourly(df, start, end, interval, end_23z=False, use_rain_max=False): 186 | 187 | def last(values): 188 | return values.iloc[-1] 189 | 190 | if end_23z: 191 | new_end = pd.Timestamp(end.to_pydatetime() - timedelta(hours=1)) 192 | else: 193 | new_end = end 194 | period = pd.date_range(start, end, freq='%dH' % interval) 195 | # Create a column with the new index an ob falls into 196 | df['period'] = (df.index.values > period.values[..., np.newaxis]).sum(0) 197 | df['DateTime'] = df.index.values 198 | aggregate = OrderedDict() 199 | col_names = df.columns.values 200 | for col in col_names: 201 | if not(col.lower().startswith('precip')) and not(col.lower().startswith('rain')): 202 | aggregate[col] = last 203 | else: 204 | if use_rain_max: 205 | aggregate[col] = np.max 206 | else: 207 | aggregate[col] = np.sum 208 | df_reindex = df.loc[start:new_end].groupby('period').agg(aggregate) 209 | try: 210 | df_reindex = df_reindex.drop('period', axis=1) 211 | except (ValueError, KeyError): 212 | pass 213 | df_reindex = df_reindex.set_index('DateTime') 214 | 215 | return df_reindex 216 | 217 | 218 | def obs(config, output_file=None, num_hours=24, interval=3, use_nan_sounding=False, use_existing_sounding=True): 219 | """ 220 | Generates observation data from MesoWest and UCAR soundings and saves to a file, which can later be retrieved for 221 | either training data or model run data. 222 | 223 | :param config: 224 | :param output_file: str: output file path 225 | :param num_hours: int: number of hours to retrieve obs 226 | :param interval: int: retrieve obs every 'interval' hours 227 | :param use_nan_sounding: bool: if True, uses a sounding of NaNs rather than omitting a day if sounding is missing 228 | :param use_existing_sounding: bool: if True, preferentially uses saved soundings in sounding_data_dir 229 | :return: 230 | """ 231 | if output_file is None: 232 | output_file = '%s/%s_obs.pkl' % (config['SITE_ROOT'], config['station_id']) 233 | 234 | start_date = datetime.strptime(config['data_start_date'], '%Y%m%d') - timedelta(hours=num_hours) 235 | dates = generate_dates(config) 236 | api_dates = generate_dates(config, api=True, start_date=start_date) 237 | 238 | # Look for desired variables 239 | vars_request = ['air_temp', 'altimeter', 'precip_accum_one_hour', 'relative_humidity', 240 | 'wind_speed', 'wind_direction'] 241 | 242 | # Add variables to the api request 243 | vars_api = '' 244 | for var in vars_request: 245 | vars_api += var + ',' 246 | vars_api = vars_api[:-1] 247 | 248 | # Units 249 | units = 'temp|f,precip|in,speed|kts' 250 | 251 | # Retrieve station data 252 | obs_hourly = get_obs_hourly(config, api_dates, vars_api, units) 253 | 254 | # Retrieve upper-air sounding data 255 | if config['verbose']: 256 | print('obs: retrieving upper-air sounding data') 257 | soundings = OrderedDict() 258 | if config['Obs']['use_soundings']: 259 | for date in dates: 260 | soundings[date] = OrderedDict() 261 | start_date = date - timedelta(days=1) # get the previous day's soundings 262 | for hour in [0, 12]: 263 | sounding_date = start_date + timedelta(hours=hour) 264 | try: 265 | sounding = upper_air(sounding_date, use_nan_sounding, use_existing=use_existing_sounding) 266 | soundings[date][sounding_date] = sounding 267 | except: 268 | print('obs: warning: problem retrieving soundings for %s' % datetime.strftime(date, '%Y%m%d')) 269 | soundings.pop(date) 270 | break 271 | 272 | # Create dictionary of days 273 | if config['verbose']: 274 | print('obs: converting to output dictionary') 275 | obs_export = OrderedDict({'SFC': OrderedDict(), 276 | 'SNDG': OrderedDict()}) 277 | for date in dates: 278 | if config['Obs']['use_soundings'] and date not in soundings.keys(): 279 | continue 280 | # Need to ensure we use the right intervals to have 22:5? Z obs 281 | start = pd.Timestamp((date - timedelta(hours=num_hours))) 282 | end = pd.Timestamp(date) 283 | obs_export['SFC'][date] = reindex_hourly(obs_hourly, start, end, interval, 284 | end_23z=True).to_dict(into=OrderedDict) 285 | if config['Obs']['use_soundings']: 286 | obs_export['SNDG'][date] = soundings[date] 287 | 288 | # Export final data 289 | if config['verbose']: 290 | print('obs: -> exporting to %s' % output_file) 291 | with open(output_file, 'wb') as handle: 292 | pickle.dump(obs_export, handle, protocol=pickle.HIGHEST_PROTOCOL) 293 | 294 | return 295 | 296 | 297 | def process(config, obs): 298 | """ 299 | Returns a numpy array of obs for use in mosx_predictors. The first dimension is date; all other dimensions are 300 | serialized. 301 | 302 | :param config: 303 | :param obs: dict: dictionary of processed obs data 304 | :return: 305 | """ 306 | if config['verbose']: 307 | print('obs.process: processing array for obs data...') 308 | 309 | # Surface observations 310 | sfc = obs['SFC'] 311 | num_days = len(sfc.keys()) 312 | variables = sorted(sfc[sfc.keys()[0]].keys()) 313 | sfc_array = get_array(sfc) 314 | sfc_array_r = np.reshape(sfc_array, (num_days, -1)) 315 | 316 | # Sounding observations 317 | if config['Obs']['use_soundings']: 318 | sndg_array = get_array(obs['SNDG']) 319 | # num_days should be the same first dimension 320 | sndg_array_r = np.reshape(sndg_array, (num_days, -1)) 321 | return np.hstack((sfc_array_r, sndg_array_r)) 322 | else: 323 | return sfc_array_r 324 | -------------------------------------------------------------------------------- /mosx/util.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Utilities for the MOS-X model. 9 | """ 10 | 11 | import os 12 | import numpy as np 13 | import pandas as pd 14 | from datetime import datetime, timedelta 15 | import pickle 16 | 17 | 18 | # ==================================================================================================================== # 19 | # Classes 20 | # ==================================================================================================================== # 21 | 22 | 23 | # ==================================================================================================================== # 24 | # Config functions 25 | # ==================================================================================================================== # 26 | 27 | def walk_kwargs(section, key): 28 | value = section[key] 29 | try: 30 | section[key] = int(value) 31 | except (TypeError, ValueError): 32 | try: 33 | section[key] = float(value) 34 | except (TypeError, ValueError): 35 | pass 36 | 37 | 38 | def get_config(config_path): 39 | """ 40 | Retrieve the config object from config_path. 41 | 42 | :param config_path: str: full path to config file 43 | :return: 44 | """ 45 | import configobj 46 | from validate import Validator 47 | 48 | dir_path = os.path.dirname(os.path.realpath(__file__)) 49 | config_spec = '%s/configspec' % dir_path 50 | 51 | try: 52 | config = configobj.ConfigObj(config_path, configspec=config_spec, file_error=True) 53 | except IOError: 54 | try: 55 | config = configobj.ConfigObj(config_path+'.config', configspec=config_spec, file_error=True) 56 | except IOError: 57 | print('Error: unable to open configuration file %s' % config_path) 58 | raise 59 | except configobj.ConfigObjError as e: 60 | print('Error while parsing configuration file %s' % config_path) 61 | print("*** Reason: '%s'" % e) 62 | raise 63 | 64 | config.validate(Validator()) 65 | 66 | # Make sure site_directory is there 67 | if config['SITE_ROOT'] == '': 68 | config['SITE_ROOT'] = '%(MOSX_ROOT)s/site_data' 69 | 70 | # Make sure BUFR parameters have defaults 71 | if config['BUFR']['bufr_station_id'] == '': 72 | config['BUFR']['bufr_station_id'] = '%(station_id)s' 73 | if config['BUFR']['bufr_data_dir'] == '': 74 | config['BUFR']['bufr_data_dir'] = '%(SITE_ROOT)s/bufkit' 75 | if config['BUFR']['bufrgruven'] == '': 76 | config['BUFR']['bufrgruven'] = '%(BUFR_ROOT)s/bufr_gruven.pl' 77 | 78 | # Make sure Obs parameters have defaults 79 | if config['Obs']['sounding_data_dir'] == '': 80 | config['Obs']['sounding_data_dir'] = '%(SITE_ROOT)s/soundings' 81 | 82 | # Add in a list for BUFR models 83 | config['BUFR']['bufr_models'] = [] 84 | for model in config['BUFR']['models']: 85 | if model.upper() == 'GFS': 86 | config['BUFR']['bufr_models'].append(['gfs3', 'gfs']) 87 | else: 88 | config['BUFR']['bufr_models'].append(model.lower()) 89 | 90 | # Convert kwargs, Rain tuning, Ada boosting, and Bootstrapping, if available, to int or float types 91 | config['Model']['Parameters'].walk(walk_kwargs) 92 | try: 93 | config['Model']['Ada boosting'].walk(walk_kwargs) 94 | except KeyError: 95 | pass 96 | try: 97 | config['Model']['Rain tuning'].walk(walk_kwargs) 98 | except KeyError: 99 | pass 100 | try: 101 | config['Model']['Bootstrapping'].walk(walk_kwargs) 102 | except KeyError: 103 | pass 104 | 105 | return config 106 | 107 | 108 | # ==================================================================================================================== # 109 | # Utility functions 110 | # ==================================================================================================================== # 111 | 112 | def get_object(module_class): 113 | """ 114 | Given a string with a module class name, it imports and returns the class. 115 | This function (c) Tom Keffer, weeWX. 116 | """ 117 | 118 | # Split the path into its parts 119 | parts = module_class.split('.') 120 | # Strip off the classname: 121 | module = '.'.join(parts[:-1]) 122 | # Import the top level module 123 | mod = __import__(module) 124 | # Recursively work down from the top level module to the class name. 125 | # Be prepared to catch an exception if something cannot be found. 126 | try: 127 | for part in parts[1:]: 128 | mod = getattr(mod, part) 129 | except AttributeError: 130 | # Can't find something. Give a more informative error message: 131 | raise AttributeError("Module '%s' has no attribute '%s' when searching for '%s'" % 132 | (mod.__name__, part, module_class)) 133 | return mod 134 | 135 | 136 | def generate_dates(config, api=False, start_date=None, end_date=None, api_add_hour=0): 137 | """ 138 | Returns all of the dates requested from the config. If api is True, then returns a list of (start_date, end_date) 139 | tuples split by year in strings formatted for the MesoWest API call. If api is False, then returns a list of all 140 | dates as datetime objects. start_date and end_date are available as options as certain calls require addition of 141 | some data for prior days. 142 | 143 | :param config: 144 | :param api: bool: if True, returns dates formatted for MesoWest API call 145 | :param start_date: str: starting date in config file format (YYYYMMDD) 146 | :param end_date: str: ending date in config file format (YYYYMMDD) 147 | :param api_add_hour: int: add this number of hours to the end of the call, useful for getting up to 6Z on last day 148 | :return: 149 | """ 150 | if start_date is None: 151 | start_date = datetime.strptime(config['data_start_date'], '%Y%m%d') 152 | if end_date is None: 153 | end_date = datetime.strptime(config['data_end_date'], '%Y%m%d') 154 | start_dt = start_date 155 | end_dt = end_date 156 | if start_dt > end_dt: 157 | raise ValueError('Start date must be before end date; check MOSX_INFILE.') 158 | end_year = end_dt.year + 1 159 | time_added = timedelta(hours=api_add_hour) 160 | all_dates = [] 161 | if config['is_season']: 162 | if start_dt > datetime(start_dt.year, end_dt.month, end_dt.day): 163 | # Season crosses new year 164 | end_year -= 1 165 | for year in range(start_dt.year, end_year): 166 | if start_dt > datetime(start_dt.year, end_dt.month, end_dt.day): 167 | # Season crosses new year 168 | year2 = year + 1 169 | else: 170 | year2 = year 171 | if api: 172 | year_start = datetime.strftime(datetime(year, start_dt.month, start_dt.day), '%Y%m%d0000') 173 | year_end = datetime.strftime(datetime(year2, end_dt.month, end_dt.day) + time_added, '%Y%m%d%H00') 174 | all_dates.append((year_start, year_end)) 175 | else: 176 | year_dates = pd.date_range(datetime(year, start_dt.month, start_dt.day), 177 | datetime(year2, end_dt.month, end_dt.day), freq='D') 178 | for date in year_dates: 179 | all_dates.append(date.to_pydatetime()) 180 | 181 | else: 182 | if api: 183 | for year in range(start_dt.year, end_year): 184 | if year == start_dt.year: 185 | year_start = datetime.strftime(datetime(year, start_dt.month, start_dt.day), '%Y%m%d0000') 186 | else: 187 | year_start = datetime.strftime(datetime(year, 1, 1), '%Y%m%d0000') 188 | if year == end_dt.year: 189 | year_end = datetime.strftime(datetime(year, end_dt.month, end_dt.day) + time_added, '%Y%m%d%H00') 190 | else: 191 | year_end = datetime.strftime(datetime(year+1, 1, 1) + time_added, '%Y%m%d%H00') 192 | all_dates.append((year_start, year_end)) 193 | else: 194 | pd_dates = pd.date_range(start_dt, end_dt, freq='D') 195 | for date in pd_dates: 196 | all_dates.append(date.to_pydatetime()) 197 | return all_dates 198 | 199 | 200 | def find_matching_dates(bufr, obs, verif, return_data=False): 201 | """ 202 | Finds dates which match in all three dictionaries. If return_data is True, returns the input dictionaries with only 203 | common dates retained. verif may be None if running the model. 204 | 205 | :param bufr: dict: dictionary of processed BUFR data 206 | :param obs: dict: dictionary of processed OBS data 207 | :param verif: dict: dictionary of processed VERIFICATION data 208 | :param return_data: bool: if True, returns edited data dictionaries containing only matching dates' data 209 | :return: list of dates[, new BUFR, OBS, and VERIF dictionaries] 210 | """ 211 | obs_dates = obs['SFC'].keys() 212 | if verif is not None: 213 | verif_dates = verif.keys() 214 | # For BUFR dates, find for all models 215 | bufr_dates_list = [bufr['SFC'][key].keys() for key in bufr['SFC'].keys()] 216 | bufr_dates = bufr_dates_list[0] 217 | for m in range(1, len(bufr_dates_list)): 218 | bufr_dates = set(bufr_dates).intersection(set(bufr_dates_list[m])) 219 | if verif is not None: 220 | all_dates = (set(verif_dates).intersection(set(obs_dates))).intersection(bufr_dates) 221 | else: 222 | all_dates = set(obs_dates).intersection(bufr_dates) 223 | if len(all_dates) == 0: 224 | raise ValueError('Sorry, no matching dates found in data!') 225 | print('find_matching_dates: found %d matching dates.' % len(all_dates)) 226 | if return_data: 227 | for lev in ['SFC', 'PROF', 'DAY']: 228 | for model in bufr[lev].keys(): 229 | for date in bufr[lev][model].keys(): 230 | if date not in all_dates: 231 | bufr[lev][model].pop(date, None) 232 | for date in obs_dates: 233 | if date not in all_dates: 234 | obs['SFC'].pop(date, None) 235 | obs['SNDG'].pop(date, None) 236 | if verif is not None: 237 | for date in verif_dates: 238 | if date not in all_dates: 239 | verif.pop(date, None) 240 | return bufr, obs, verif, sorted(list(all_dates)) 241 | else: 242 | return sorted(list(all_dates)) 243 | 244 | 245 | def get_array(dictionary): 246 | """ 247 | Transforms a nested dictionary into an nd numpy array, assuming that each nested sub-dictionary has the same 248 | structure and that the values elements of the innermost dictionary is either a list or a float value. Function 249 | _get_array is its recursive sub-function. 250 | 251 | :param dictionary: 252 | :return: 253 | """ 254 | dim_list = [] 255 | d = dictionary 256 | while isinstance(d, dict): 257 | dim_list.append(len(d.keys())) 258 | d = d.values()[0] 259 | try: 260 | dim_list.append(len(d)) 261 | except: 262 | pass 263 | out_array = np.full(dim_list, np.nan, dtype=np.float64) 264 | _get_array(dictionary, out_array) 265 | return out_array 266 | 267 | 268 | def _get_array(dictionary, out_array): 269 | if dictionary == {}: # in case there's an empty dict for any reason 270 | return 271 | if isinstance(dictionary.values()[0], list): 272 | for i, L in enumerate(dictionary.values()): 273 | out_array[i, :] = np.asarray(L) 274 | elif isinstance(dictionary.values()[0], float): 275 | for i, L in enumerate(dictionary.values()): 276 | out_array[i] = L 277 | else: 278 | for i, d in enumerate(dictionary.values()): 279 | _get_array(d, out_array[i, :]) 280 | 281 | 282 | def unpickle(bufr_file, obs_file, verif_file): 283 | """ 284 | Shortcut function to unpickle bufr, obs, and verif files all at once. verif_file may be None if running the model. 285 | 286 | :param bufr_file: str: full path to pickled BUFR data file 287 | :param obs_file: str: full path to pickled OBS data file 288 | :param verif_file: str: full path to pickled VERIFICATION data file 289 | :return: 290 | """ 291 | print('util: loading BUFKIT data from %s' % bufr_file) 292 | with open(bufr_file, 'rb') as handle: 293 | bufr = pickle.load(handle) 294 | print('util: loading OBS data from %s' % obs_file) 295 | with open(obs_file, 'rb') as handle: 296 | obs = pickle.load(handle) 297 | if verif_file is not None: 298 | print('util: loading VERIFICATION data from %s' % verif_file) 299 | with open(verif_file, 'rb') as handle: 300 | verif = pickle.load(handle) 301 | else: 302 | verif = None 303 | return bufr, obs, verif 304 | 305 | 306 | def get_ghcn_stid(config): 307 | """ 308 | After code by Luke Madaus. 309 | 310 | Gets the GHCN station ID from the 4-letter station ID. 311 | """ 312 | main_addr = 'ftp://ftp.ncdc.noaa.gov/pub/data/noaa' 313 | 314 | site_directory = config['SITE_ROOT'] 315 | stid = config['station_id'] 316 | # Check to see that ish-history.txt exists 317 | stations_file = 'isd-history.txt' 318 | stations_filename = '%s/%s' % (site_directory, stations_file) 319 | if not os.path.exists(stations_filename): 320 | print('get_ghcn_stid: downloading site name database') 321 | try: 322 | from urllib2 import urlopen 323 | response = urlopen('%s/%s' % (main_addr, stations_file)) 324 | with open(stations_filename, 'w') as f: 325 | f.write(response.read()) 326 | except BaseException as e: 327 | print('get_ghcn_stid: unable to download site name database') 328 | print("*** Reason: '%s'" % str(e)) 329 | 330 | # Now open this file and look for our siteid 331 | site_found = False 332 | infile = open(stations_filename, 'r') 333 | station_wbans = [] 334 | station_ghcns = [] 335 | for line in infile: 336 | if stid.upper() in line: 337 | linesp = line.split() 338 | if (not linesp[0].startswith('99999') and not site_found 339 | and not linesp[1].startswith('99999')): 340 | try: 341 | site_wban = int(linesp[0]) 342 | station_ghcn = int(linesp[1]) 343 | # site_found = True 344 | print('get_ghcn_stid: site found for %s (%s)' % 345 | (stid, station_ghcn)) 346 | station_wbans.append(site_wban) 347 | station_ghcns.append(station_ghcn) 348 | except: 349 | continue 350 | if len(station_wbans) == 0: 351 | raise ValueError('get_ghcn_stid error: so station found for %s' % stid) 352 | 353 | # Format station as USW... 354 | usw_format = 'USW000%05d' 355 | return usw_format % station_ghcns[0] 356 | 357 | 358 | # ==================================================================================================================== # 359 | # Conversion functions 360 | # ==================================================================================================================== # 361 | 362 | def dewpoint(T, RH): 363 | """ 364 | Calculates dewpoint from T in Fahrenheit and RH in percent. 365 | """ 366 | 367 | def FtoC(T): 368 | return (T - 32.) / 9. * 5. 369 | 370 | def CtoF(T): 371 | return 9. / 5. * T + 32. 372 | 373 | b = 17.67 374 | c = 243.5 # deg C 375 | 376 | def gamma(T, RH): 377 | return np.log(RH/100.) + b * T/ (c + T) 378 | 379 | T = FtoC(T) 380 | TD = c * gamma(T, RH) / (b - gamma(T, RH)) 381 | return CtoF(TD) 382 | 383 | 384 | def to_bool(x): 385 | """Convert an object to boolean. 386 | 387 | Examples: 388 | >>> print to_bool('TRUE') 389 | True 390 | >>> print to_bool(True) 391 | True 392 | >>> print to_bool(1) 393 | True 394 | >>> print to_bool('FALSE') 395 | False 396 | >>> print to_bool(False) 397 | False 398 | >>> print to_bool(0) 399 | False 400 | >>> print to_bool('Foo') 401 | Traceback (most recent call last): 402 | ValueError: Unknown boolean specifier: 'Foo'. 403 | >>> print to_bool(None) 404 | Traceback (most recent call last): 405 | ValueError: Unknown boolean specifier: 'None'. 406 | 407 | This function (c) Tom Keffer, weeWX. 408 | """ 409 | try: 410 | if x.lower() in ['true', 'yes']: 411 | return True 412 | elif x.lower() in ['false', 'no']: 413 | return False 414 | except AttributeError: 415 | pass 416 | try: 417 | return bool(int(x)) 418 | except (ValueError, TypeError): 419 | pass 420 | raise ValueError("Unknown boolean specifier: '%s'." % x) 421 | -------------------------------------------------------------------------------- /mosx/verification/__init__.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing VERIFICATION data. 9 | """ 10 | 11 | from .methods import * 12 | -------------------------------------------------------------------------------- /mosx/verification/methods.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2018 Jonathan Weyn 3 | # 4 | # See the file LICENSE for your rights. 5 | # 6 | 7 | """ 8 | Methods for processing VERIFICATION data. 9 | """ 10 | 11 | import os 12 | import re 13 | import numpy as np 14 | import pandas as pd 15 | from datetime import datetime, timedelta 16 | import pickle 17 | import requests 18 | from collections import OrderedDict 19 | from mosx.MesoPy import Meso 20 | from mosx.obs.methods import get_obs_hourly, reindex_hourly 21 | from mosx.util import generate_dates, get_array, get_ghcn_stid 22 | 23 | 24 | def get_cf6_files(config, num_files=1): 25 | """ 26 | After code by Luke Madaus 27 | 28 | Retrieves CF6 climate verification data released by the NWS. Parameter num_files determines how many recent files 29 | are downloaded. 30 | """ 31 | 32 | # Create directory if it does not exist 33 | site_directory = config['SITE_ROOT'] 34 | 35 | # Construct the web url address. Check if a special 3-letter station ID is provided. 36 | nws_url = 'http://forecast.weather.gov/product.php?site=NWS&issuedby=%s&product=CF6&format=TXT' 37 | try: 38 | stid3 = config['station_id3'] 39 | except KeyError: 40 | stid3 = config['station_id'][1:].upper() 41 | nws_url = nws_url % stid3 42 | 43 | # Determine how many files (iterations of product) we want to fetch 44 | if num_files == 1: 45 | if config['verbose']: 46 | print('get_cf6_files: retrieving latest CF6 file for %s' % config['station_id']) 47 | else: 48 | if config['verbose']: 49 | print('get_cf6_files: retrieving %s archived CF6 files for %s' % (num_files, config['station_id'])) 50 | 51 | # Fetch files 52 | for r in range(1, num_files + 1): 53 | # Format the web address: goes through 'versions' on NWS site which correspond to increasingly older files 54 | version = 'version=%d&glossary=0' % r 55 | nws_site = '&'.join((nws_url, version)) 56 | response = requests.get(nws_site) 57 | cf6_data = response.text 58 | 59 | # Remove the header 60 | try: 61 | body_and_footer = cf6_data.split('CXUS')[1] # Mainland US 62 | except IndexError: 63 | try: 64 | body_and_footer = cf6_data.split('CXHW')[1] # Hawaii 65 | except IndexError: 66 | body_and_footer = cf6_data.split('CXAK')[1] # Alaska 67 | body_and_footer_lines = body_and_footer.splitlines() 68 | if len(body_and_footer_lines) <= 2: 69 | body_and_footer = cf6_data.split('000')[2] 70 | 71 | # Remove the footer 72 | body = body_and_footer.split('[REMARKS]')[0] 73 | 74 | # Find the month and year of the file 75 | current_year = re.search('YEAR: *(\d{4})', body).groups()[0] 76 | try: 77 | current_month = re.search('MONTH: *(\D{3,9})', body).groups()[0] 78 | current_month = current_month.strip() # Gets rid of newlines and whitespace 79 | datestr = '%s %s' % (current_month, current_year) 80 | file_date = datetime.strptime(datestr, '%B %Y') 81 | except: # Some files have a different formatting, although this may be fixed now. 82 | current_month = re.search('MONTH: *(\d{2})', body).groups()[0] 83 | current_month = current_month.strip() 84 | datestr = '%s %s' % (current_month, current_year) 85 | file_date = datetime.strptime(datestr, '%m %Y') 86 | 87 | # Write to a temporary file, check if output file exists, and if so, make sure the new one has more data 88 | datestr = file_date.strftime('%Y%m') 89 | filename = '%s/%s_%s.cli' % (site_directory, config['station_id'].upper(), datestr) 90 | temp_file = '%s/temp.cli' % site_directory 91 | with open(temp_file, 'w') as out: 92 | out.write(body) 93 | 94 | def file_len(file_name): 95 | with open(file_name) as f: 96 | for i, l in enumerate(f): 97 | pass 98 | return i + 1 99 | 100 | if os.path.isfile(filename): 101 | old_file_len = file_len(filename) 102 | new_file_len = file_len(temp_file) 103 | if old_file_len < new_file_len: 104 | if config['verbose']: 105 | print('get_cf6_files: overwriting %s' % filename) 106 | os.remove(filename) 107 | os.rename(temp_file, filename) 108 | else: 109 | if config['verbose']: 110 | print('get_cf6_files: %s already exists' % filename) 111 | else: 112 | if config['verbose']: 113 | print('get_cf6_files: writing %s' % filename) 114 | os.rename(temp_file, filename) 115 | 116 | 117 | def _cf6_wind(config): 118 | """ 119 | After code by Luke Madaus 120 | 121 | This function is used internally only. 122 | 123 | Generates wind verification values from climate CF6 files stored in SITE_ROOT. These files can be generated 124 | externally by get_cf6_files.py. This function is not necessary if climo data from _climo_wind is found, except for 125 | recent values which may not be in the NCDC database yet. 126 | 127 | :param config: 128 | :return: dict: wind values from CF6 files 129 | """ 130 | 131 | if config['verbose']: 132 | print('_cf6_wind: searching for CF6 files in %s' % config['SITE_ROOT']) 133 | allfiles = os.listdir(config['SITE_ROOT']) 134 | filelist = [f for f in allfiles if f.startswith(config['station_id'].upper()) and f.endswith('.cli')] 135 | filelist.sort() 136 | if len(filelist) == 0: 137 | raise IOError('No CF6 files found.') 138 | if config['verbose']: 139 | print('_cf6_wind: found %d CF6 files.' % len(filelist)) 140 | 141 | # Interpret CF6 files 142 | if config['verbose']: 143 | print('_cf6_wind: reading CF6 files') 144 | cf6_values = {} 145 | for file in filelist: 146 | year, month = re.search('(\d{4})(\d{2})', file).groups() 147 | infile = open('%s/%s' % (config['SITE_ROOT'], file), 'r') 148 | for line in infile: 149 | matcher = re.compile( 150 | '( \d|\d{2}) ( \d{2}|-\d{2}| \d| -\d|\d{3})') 151 | if matcher.match(line): 152 | # We've found an ob line! 153 | lsp = line.split() 154 | day = int(lsp[0]) 155 | curdt = datetime(int(year), int(month), day) 156 | cf6_values[curdt] = {} 157 | # Wind 158 | if lsp[11] == 'M': 159 | cf6_values[curdt]['wind'] = 0.0 160 | else: 161 | cf6_values[curdt]['wind'] = float(lsp[11]) * 0.868976 162 | 163 | return cf6_values 164 | 165 | 166 | def _climo_wind(config, dates=None): 167 | """ 168 | Fetches climatological wind data using ulmo package to retrieve NCDC archives. 169 | 170 | :param config: 171 | :param dates: list of datetime objects 172 | :return: dict: dictionary of wind values 173 | """ 174 | import ulmo 175 | 176 | if config['verbose']: 177 | print('_climo_wind: fetching data from NCDC (may take a while)...') 178 | v = 'WSF2' 179 | wind_dict = {} 180 | D = ulmo.ncdc.ghcn_daily.get_data(get_ghcn_stid(config), as_dataframe=True, elements=[v]) 181 | 182 | if dates is None: 183 | dates = list(D[v].index.to_timestamp().to_pydatetime()) 184 | for date in dates: 185 | wind_dict[date] = {'wind': D[v].loc[date]['value'] / 10. * 1.94384} 186 | 187 | return wind_dict 188 | 189 | 190 | def pop_rain(series): 191 | """ 192 | Converts a series of rain values into 0 or 1 depending on whether there is measurable rain 193 | :param series: 194 | :return: 195 | """ 196 | new_series = series.copy() 197 | new_series[series >= 0.01] = 1. 198 | new_series[series < 0.01] = 0. 199 | return new_series 200 | 201 | 202 | def categorical_rain(series): 203 | """ 204 | Converts a series of rain values into categorical precipitation quantities a la MOS. 205 | :param series: 206 | :return: 207 | """ 208 | new_series = series.copy() 209 | for j in range(len(series)): 210 | if series.iloc[j] < 0.01: 211 | new_series.iloc[j] = 0. 212 | elif series.iloc[j] < 0.10: 213 | new_series.iloc[j] = 1. 214 | elif series.iloc[j] < 0.25: 215 | new_series.iloc[j] = 2. 216 | elif series.iloc[j] < 0.50: 217 | new_series.iloc[j] = 3. 218 | elif series.iloc[j] < 1.00: 219 | new_series.iloc[j] = 4. 220 | elif series.iloc[j] < 2.00: 221 | new_series.iloc[j] = 5. 222 | elif series.iloc[j] >= 2.00: 223 | new_series.iloc[j] = 6. 224 | else: # missing, or something else that's strange 225 | new_series.iloc[j] = 0. 226 | return new_series 227 | 228 | 229 | def verification(config, output_file=None, use_cf6=True, use_climo=True, force_rain_quantity=False): 230 | """ 231 | Generates verification data from MesoWest and saves to a file, which is used to train the model and check test 232 | results. 233 | 234 | :param config: 235 | :param output_file: str: path to output file 236 | :param use_cf6: bool: if True, uses wind values from CF6 files 237 | :param use_climo: bool: if True, uses wind values from NCDC climatology 238 | :param force_rain_quantity: if True, returns the actual quantity of rain (rather than POP); useful for validation 239 | files 240 | :return: 241 | """ 242 | if output_file is None: 243 | output_file = '%s/%s_verif.pkl' % (config['SITE_ROOT'], config['station_id']) 244 | 245 | dates = generate_dates(config) 246 | api_dates = generate_dates(config, api=True, api_add_hour=config['forecast_hour_start'] + 24) 247 | 248 | # Read new data for daily values 249 | m = Meso(token=config['meso_token']) 250 | 251 | if config['verbose']: 252 | print('verification: MesoPy initialized for station %s' % config['station_id']) 253 | print('verification: retrieving latest obs and metadata') 254 | latest = m.latest(stid=config['station_id']) 255 | obs_list = list(latest['STATION'][0]['SENSOR_VARIABLES'].keys()) 256 | 257 | # Look for desired variables 258 | vars_request = ['air_temp', 'wind_speed', 'precip_accum_one_hour'] 259 | vars_option = ['air_temp_low_6_hour', 'air_temp_high_6_hour', 'precip_accum_six_hour'] 260 | 261 | # Add variables to the api request if they exist 262 | if config['verbose']: 263 | print('verification: searching for 6-hourly variables...') 264 | for var in vars_option: 265 | if var in obs_list: 266 | if config['verbose']: 267 | print('verification: found variable %s, adding to data' % var) 268 | vars_request += [var] 269 | vars_api = '' 270 | for var in vars_request: 271 | vars_api += var + ',' 272 | vars_api = vars_api[:-1] 273 | 274 | # Units 275 | units = 'temp|f,precip|in,speed|kts' 276 | 277 | # Retrieve data 278 | obspd = pd.DataFrame() 279 | for api_date in api_dates: 280 | if config['verbose']: 281 | print('verification: retrieving data from %s to %s' % api_date) 282 | obs = m.timeseries(stid=config['station_id'], start=api_date[0], end=api_date[1], vars=vars_api, units=units) 283 | obspd = pd.concat((obspd, pd.DataFrame.from_dict(obs['STATION'][0]['OBSERVATIONS'])), ignore_index=True) 284 | 285 | # Rename columns to requested vars 286 | obs_var_names = obs['STATION'][0]['SENSOR_VARIABLES'] 287 | obs_var_keys = list(obs_var_names.keys()) 288 | col_names = list(map(''.join, obspd.columns.values)) 289 | for c in range(len(col_names)): 290 | col = col_names[c] 291 | for k in range(len(obs_var_keys)): 292 | key = obs_var_keys[k] 293 | if col == list(obs_var_names[key].keys())[0]: 294 | col_names[c] = key 295 | obspd.columns = col_names 296 | 297 | # Make sure we have columns for all requested variables 298 | for var in vars_request: 299 | if var not in col_names: 300 | obspd = obspd.assign(**{var: np.nan}) 301 | 302 | # Change datetime column to datetime object, subtract 6 hours to use 6Z days 303 | if config['verbose']: 304 | print('verification: setting time back %d hours for daily statistics' % config['forecast_hour_start']) 305 | dateobj = pd.to_datetime(obspd['date_time']) - timedelta(hours=config['forecast_hour_start']) 306 | obspd['date_time'] = dateobj 307 | datename = 'date_time_minus_%d' % config['forecast_hour_start'] 308 | obspd = obspd.rename(columns={'date_time': datename}) 309 | 310 | # Reformat data into hourly and daily 311 | # Hourly 312 | def hour(dates): 313 | date = dates.iloc[0] 314 | return datetime(date.year, date.month, date.day, date.hour) 315 | 316 | def last(values): 317 | return values.iloc[-1] 318 | 319 | aggregate = {datename: hour} 320 | if 'air_temp_high_6_hour' in vars_request and 'air_temp_low_6_hour' in vars_request: 321 | aggregate['air_temp_high_6_hour'] = np.max 322 | aggregate['air_temp_low_6_hour'] = np.min 323 | aggregate['air_temp'] = {'air_temp_max': np.max, 'air_temp_min': np.min} 324 | if 'precip_accum_six_hour' in vars_request: 325 | aggregate['precip_accum_six_hour'] = np.max 326 | aggregate['wind_speed'] = np.max 327 | aggregate['precip_accum_one_hour'] = np.max 328 | 329 | if config['verbose']: 330 | print('verification: grouping data by hour for hourly observations') 331 | # Note that obs in hour H are reported at hour H, not H+1 332 | obs_hourly = obspd.groupby([pd.DatetimeIndex(obspd[datename]).year, 333 | pd.DatetimeIndex(obspd[datename]).month, 334 | pd.DatetimeIndex(obspd[datename]).day, 335 | pd.DatetimeIndex(obspd[datename]).hour]).agg(aggregate) 336 | # Rename columns 337 | col_names = obs_hourly.columns.values 338 | col_names_new = [] 339 | for c in range(len(col_names)): 340 | if col_names[c][0] == 'air_temp': 341 | col_names_new.append(col_names[c][1]) 342 | else: 343 | col_names_new.append(col_names[c][0]) 344 | 345 | obs_hourly.columns = col_names_new 346 | 347 | # Daily 348 | def day(dates): 349 | date = dates.iloc[0] 350 | return datetime(date.year, date.month, date.day) 351 | 352 | aggregate[datename] = day 353 | aggregate['air_temp_min'] = np.min 354 | aggregate['air_temp_max'] = np.max 355 | aggregate['precip_accum_six_hour'] = np.sum 356 | try: 357 | aggregate.pop('air_temp') 358 | except: 359 | pass 360 | 361 | if config['verbose']: 362 | print('verification: grouping data by day for daily verifications') 363 | obs_daily = obs_hourly.groupby([pd.DatetimeIndex(obs_hourly[datename]).year, 364 | pd.DatetimeIndex(obs_hourly[datename]).month, 365 | pd.DatetimeIndex(obs_hourly[datename]).day]).agg(aggregate) 366 | 367 | if config['verbose']: 368 | print('verification: checking matching dates for daily obs and CF6') 369 | if use_climo: 370 | try: 371 | climo_values = _climo_wind(config, dates) 372 | except BaseException as e: 373 | if config['verbose']: 374 | print("verification: warning: '%s' while reading climo data" % str(e)) 375 | climo_values = {} 376 | else: 377 | if config['verbose']: 378 | print('verification: not using climo.') 379 | climo_values = {} 380 | if use_cf6: 381 | num_months = min((datetime.utcnow() - dates[0]).days / 30, 24) 382 | try: 383 | get_cf6_files(config, num_months) 384 | except BaseException as e: 385 | if config['verbose']: 386 | print("verification: warning: '%s' while getting CF6 files" % str(e)) 387 | try: 388 | cf6_values = _cf6_wind(config) 389 | except BaseException as e: 390 | if config['verbose']: 391 | print("verification: warning: '%s' while reading CF6 files" % str(e)) 392 | cf6_values = {} 393 | else: 394 | if config['verbose']: 395 | print('verification: not using CF6.') 396 | cf6_values = {} 397 | climo_values.update(cf6_values) # CF6 has precedence 398 | count_rows = 0 399 | for index, row in obs_daily.iterrows(): 400 | date = row[datename] 401 | if date in climo_values.keys(): 402 | count_rows += 1 403 | obs_wind = row['wind_speed'] 404 | cf6_wind = climo_values[date]['wind'] 405 | if not (np.isnan(cf6_wind)): 406 | if obs_wind - cf6_wind >= 5: 407 | print('verification: warning: obs wind for %s much larger than wind from cf6/climo; using obs' % 408 | date) 409 | else: 410 | obs_daily.loc[index, 'wind_speed'] = cf6_wind 411 | else: 412 | count_rows -= 1 413 | if config['verbose']: 414 | print('verification: found %d matching rows.' % count_rows) 415 | 416 | # Round 417 | round_dict = {'wind_speed': 0} 418 | if 'air_temp_high_6_hour' in vars_request: 419 | round_dict['air_temp_high_6_hour'] = 0 420 | if 'air_temp_low_6_hour' in vars_request: 421 | round_dict['air_temp_low_6_hour'] = 0 422 | round_dict['air_temp_max'] = 0 423 | round_dict['air_temp_min'] = 0 424 | if 'precip_accum_six_hour' in vars_request: 425 | round_dict['precip_accum_six_hour'] = 2 426 | round_dict['precip_accum_one_hour'] = 2 427 | obs_daily = obs_daily.round(round_dict) 428 | 429 | # Generation of final output data 430 | if config['verbose']: 431 | print('verification: generating final verification dictionary...') 432 | if 'air_temp_high_6_hour' in vars_request: 433 | obs_daily.rename(columns={'air_temp_high_6_hour': 'Tmax'}, inplace=True) 434 | else: 435 | obs_daily.rename(columns={'air_temp_max': 'Tmax'}, inplace=True) 436 | if 'air_temp_low_6_hour' in vars_request: 437 | obs_daily.rename(columns={'air_temp_low_6_hour': 'Tmin'}, inplace=True) 438 | else: 439 | obs_daily.rename(columns={'air_temp_min': 'Tmin'}, inplace=True) 440 | if 'precip_accum_six_hour' in vars_request: 441 | obs_daily.rename(columns={'precip_accum_six_hour': 'Rain'}, inplace=True) 442 | else: 443 | obs_daily.rename(columns={'precip_accum_one_hour': 'Rain'}, inplace=True) 444 | obs_daily.rename(columns={'wind_speed': 'Wind'}, inplace=True) 445 | 446 | # Deal with the rain depending on the type of forecast requested 447 | obs_daily['Rain'].fillna(0.0, inplace=True) 448 | if config['Model']['rain_forecast_type'] == 'pop' and not force_rain_quantity: 449 | obs_daily.loc[:, 'Rain'] = pop_rain(obs_daily['Rain']) 450 | elif config['Model']['rain_forecast_type'] == 'categorical' and not force_rain_quantity: 451 | obs_daily.loc[:, 'Rain'] = categorical_rain(obs_daily['Rain']) 452 | 453 | # Set the date time index and retain only desired columns 454 | obs_daily = obs_daily.rename(columns={datename: 'date_time'}) 455 | obs_daily = obs_daily.set_index('date_time') 456 | if config['verbose']: 457 | print('verification: -> exporting to %s' % output_file) 458 | export_cols = ['Tmax', 'Tmin', 'Wind', 'Rain'] 459 | for col in obs_daily.columns: 460 | if col not in export_cols: 461 | obs_daily.drop(col, 1, inplace=True) 462 | 463 | # If a time series is desired, then get hourly data 464 | if config['Model']['predict_timeseries']: 465 | 466 | # Look for desired variables 467 | vars_request = ['air_temp', 'relative_humidity', 'wind_speed', 'precip_accum_one_hour'] 468 | 469 | # Add variables to the api request 470 | vars_api = '' 471 | for var in vars_request: 472 | vars_api += var + ',' 473 | vars_api = vars_api[:-1] 474 | 475 | # Units 476 | units = 'temp|f,precip|in,speed|kts' 477 | 478 | # Retrieve data 479 | obs_hourly_verify = get_obs_hourly(config, api_dates, vars_api, units) 480 | 481 | # Fix rainfall for categorical and time accumulation 482 | rain_column = 'precip_last_%d_hour' % config['time_series_interval'] 483 | obs_hourly_verify.rename(columns={'precip_accum_one_hour': rain_column}, inplace=True) 484 | if config['Model']['rain_forecast_type'] == 'pop' and not force_rain_quantity: 485 | if config['verbose']: 486 | print("verification: using 'pop' rain") 487 | obs_hourly_verify.loc[:, rain_column] = pop_rain(obs_hourly_verify[rain_column]) 488 | use_rain_max = True 489 | elif config['Model']['rain_forecast_type'] == 'categorical' and not force_rain_quantity: 490 | if config['verbose']: 491 | print("verification: using 'categorical' rain") 492 | obs_hourly_verify.loc[:, rain_column] = categorical_rain(obs_hourly_verify[rain_column]) 493 | use_rain_max = True 494 | else: 495 | use_rain_max = False 496 | 497 | # Export final data 498 | export_dict = OrderedDict() 499 | for date in dates: 500 | try: 501 | day_dict = obs_daily.loc[date].to_dict(into=OrderedDict) 502 | except KeyError: 503 | continue 504 | if np.any(np.isnan(day_dict.values())): 505 | if config['verbose']: 506 | print('verification: warning: omitting day %s; missing data' % date) 507 | continue # No verification can have missing values 508 | if config['Model']['predict_timeseries']: 509 | start = pd.Timestamp(date + timedelta(hours=(config['forecast_hour_start'] - 510 | config['time_series_interval']))) 511 | end = pd.Timestamp(date + timedelta(hours=config['forecast_hour_start'] + 24)) 512 | try: 513 | series = reindex_hourly(obs_hourly_verify, start, end, config['time_series_interval'], 514 | use_rain_max=use_rain_max) 515 | except KeyError: 516 | # No values for the day 517 | if config['verbose']: 518 | print('verification: warning: omitting day %s; missing data' % date) 519 | continue 520 | if series.isnull().values.any(): 521 | if config['verbose']: 522 | print('verification: warning: omitting day %s; missing data' % date) 523 | continue 524 | series_dict = OrderedDict(series.to_dict(into=OrderedDict)) 525 | day_dict.update(series_dict) 526 | export_dict[date] = day_dict 527 | with open(output_file, 'wb') as handle: 528 | pickle.dump(export_dict, handle, protocol=pickle.HIGHEST_PROTOCOL) 529 | 530 | return 531 | 532 | 533 | def process(config, verif): 534 | """ 535 | Returns a numpy array of verification data for use in mosx_predictors. The first dimension is date, the second is 536 | variable. 537 | 538 | :param config: 539 | :param verif: dict: dictionary of processed verification data; may be None 540 | :return: ndarray: array of processed verification targets 541 | """ 542 | if verif is not None: 543 | if config['verbose']: 544 | print('verification.process: processing array for verification data') 545 | num_days = len(verif.keys()) 546 | variables = ['Tmax', 'Tmin', 'Wind', 'Rain'] 547 | day_verif_array = np.full((num_days, len(variables)), np.nan, dtype=np.float64) 548 | for d in range(len(verif.keys())): 549 | date = verif.keys()[d] 550 | for v in range(len(variables)): 551 | var = variables[v] 552 | day_verif_array[d, v] = verif[date][var] 553 | if config['Model']['predict_timeseries']: 554 | hour_verif = OrderedDict(verif) 555 | for date in hour_verif.keys(): 556 | for variable in variables: 557 | hour_verif[date].pop(variable, None) 558 | hour_verif_array = get_array(hour_verif) 559 | hour_verif_array = np.reshape(hour_verif_array, (num_days, -1)) 560 | verif_array = np.concatenate((day_verif_array, hour_verif_array), axis=1) 561 | return verif_array 562 | else: 563 | return day_verif_array 564 | else: 565 | return None 566 | -------------------------------------------------------------------------------- /performance: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # 3 | # Copyright (c) 2018 Jonathan Weyn 4 | # 5 | # See the file LICENSE for your rights. 6 | # 7 | 8 | """ 9 | Evaluate performance metrics for a MOS-X model. These functions should only be used after the data files in 'build' and 10 | 'validate' have been created. 11 | """ 12 | 13 | import mosx 14 | import numpy as np 15 | import pandas as pd 16 | import os 17 | import sys 18 | import pickle 19 | import string 20 | from optparse import OptionParser 21 | from datetime import datetime 22 | import warnings 23 | from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, r2_score 24 | from sklearn.preprocessing import LabelBinarizer 25 | import matplotlib 26 | import matplotlib.pyplot as plt 27 | import matplotlib.gridspec as gs 28 | from matplotlib import rc 29 | try: 30 | from properscoring import crps_ensemble 31 | crps = True 32 | except ImportError: 33 | crps = False 34 | 35 | 36 | # Set matplotlib rc parameters 37 | rc('font', **{'family': 'Times New Roman'}) 38 | matplotlib.rcParams['mathtext.fontset'] = 'custom' 39 | matplotlib.rcParams['mathtext.rm'] = 'Times New Roman' 40 | matplotlib.rcParams['mathtext.it'] = 'Times New Roman:italic' 41 | matplotlib.rcParams['mathtext.bf'] = 'Times New Roman:bold' 42 | matplotlib.rcParams.update({'font.size': 10}) 43 | 44 | # Suppress warnings 45 | # warnings.filterwarnings("ignore") 46 | 47 | 48 | def get_command_options(): 49 | parser = OptionParser() 50 | parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False, 51 | help='Use the raw precipitation from GFS/NAM to override or average with MOS-X') 52 | parser.add_option('-e', '--ensemble', dest='ensemble', action='store_true', default=False, 53 | help='Calculate ensemble statistics (for a valid ensemble estimator)') 54 | parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False, 55 | help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble' 56 | 'with that of the rain tuning post-processor') 57 | parser.add_option('-L', '--learning-curve', dest='learning', action='store_true', default=False, 58 | help='Generate a learning curve by re-training the model (time consuming)') 59 | parser.add_option('-S', '--spread-skill', dest='spread_skill', action='store_true', default=False, 60 | help='Plot spread-skill relationship for an ensemble (if -e is enabled)') 61 | parser.add_option('-H', '--rank-histogram', dest='histogram', action='store_true', default=False, 62 | help='Plot rank histograms for an ensemble (if -e is enabled)') 63 | parser.add_option('-E', '--error-distribution', dest='error_plot', action='store_true', default=False, 64 | help='Plot error distributions') 65 | (opts, args) = parser.parse_args() 66 | return opts, args 67 | 68 | 69 | def brier_score(y, x): 70 | num_categories = max(x.shape[1], y.shape[1]) 71 | while x.shape[1] < num_categories: 72 | x = np.c_[x, np.zeros(x.shape[0])] 73 | while y.shape[1] < num_categories: 74 | y = np.c_[y, np.zeros(y.shape[0])] 75 | score = np.sum((x - y) ** 2, axis=1) 76 | return np.mean(score / num_categories) 77 | 78 | 79 | def mean_variance(X): 80 | return np.mean(np.var(X, axis=-1), axis=0) 81 | 82 | 83 | def plot_spread_skill(X, y, grid=False, one_to_one=True, width=6, height=6): 84 | 85 | fig = plt.figure() 86 | fig.set_size_inches(width, height) 87 | rows = 2 88 | columns = 2 89 | gs1 = gs.GridSpec(rows, columns) 90 | gs1.update(wspace=0.18, hspace=0.18) 91 | colors = [c['color'] for c in list(matplotlib.rcParams['axes.prop_cycle'])] 92 | params = ['High', 'Low', 'Wind', 'Rain'] 93 | limits = [20., 20., 10., 2.] 94 | 95 | var = np.var(X, axis=-1) 96 | ens_mean = np.mean(X, axis=-1) 97 | err = (ens_mean - y) ** 2 98 | 99 | for plot_num in range(len(params)): 100 | ax = plt.subplot(gs1[plot_num]) 101 | plt.scatter(err[:, plot_num], var[:, plot_num], c=colors[plot_num], s=max(width, height)/2.) 102 | plt.xlim([0., limits[plot_num]]) 103 | plt.ylim([0., limits[plot_num]]) 104 | if grid: 105 | plt.grid(grid, linestyle='--', color=[0.8, 0.8, 0.8]) 106 | if one_to_one: 107 | curr_xlim = ax.get_xlim() 108 | curr_ylim = ax.get_ylim() 109 | max_limit = max(np.max(curr_xlim), np.max(curr_ylim)) 110 | min_limit = min(np.min(curr_xlim), np.min(curr_ylim)) 111 | plt.plot([min_limit, max_limit], [min_limit, max_limit], 'k') 112 | if (plot_num + 1) % columns == 1: 113 | plt.ylabel('spread') 114 | if (plot_num + 1) > columns * (rows - 1): 115 | plt.xlabel('error') 116 | letters = list(string.ascii_lowercase) 117 | panel_label = '%s) %s' % (letters[plot_num], params[plot_num]) 118 | ax.annotate(panel_label, xycoords='axes fraction', xy=(0.05, 0.95), horizontalalignment='left', 119 | verticalalignment='top') 120 | 121 | return fig 122 | 123 | 124 | def plot_rank_histogram(X, y, width=6, height=6): 125 | fig = plt.figure() 126 | fig.set_size_inches(width, height) 127 | gs1 = gs.GridSpec(2, 2) 128 | gs1.update(wspace=0.18, hspace=0.18) 129 | colors = [c['color'] for c in list(matplotlib.rcParams['axes.prop_cycle'])] 130 | params = ['High', 'Low', 'Wind', 'Rain'] 131 | 132 | def plot_histogram(subplot_num, x, unit=None, facecolor='b', bins=None, align='left'): 133 | global fig 134 | ax = plt.subplot(subplot_num) 135 | if bins is None and unit is not None: 136 | bins = max(int(np.nanmax(x) / unit - np.nanmin(x) / unit), 1) 137 | n, bins, patches = plt.hist(x, bins=bins, facecolor=facecolor, normed=True, align=align, ) 138 | ylim = ax.get_ylim() 139 | if ylim[1] - np.nanmax(n) < 0.005: 140 | ax.set_ylim([ylim[0], ylim[1] + 0.005]) 141 | if unit is None: 142 | unit = 1. 143 | ax.set_yticklabels(['{:.1f}'.format(100. * l * unit) for l in plt.yticks()[0]]) 144 | if plot_num % 2 == 0: 145 | ax.set_ylabel('frequency (%)') 146 | letters = list(string.ascii_lowercase) 147 | panel_label = '%s) %s' % (letters[plot_num], params[plot_num]) 148 | ax.annotate(panel_label, xycoords='axes fraction', xy=(0.50, 0.95), horizontalalignment='center', 149 | verticalalignment='top') 150 | return 151 | 152 | for plot_num in range(4): 153 | rank = np.sum(X[:, plot_num, :].T < y[:, plot_num], axis=0).T 154 | plot_histogram(gs1[plot_num], rank, facecolor=colors[plot_num]) 155 | 156 | return fig 157 | 158 | 159 | def plot_error(X, y, width=6, height=6): 160 | fig = plt.figure() 161 | fig.set_size_inches(width, height) 162 | gs1 = gs.GridSpec(2, 2) 163 | gs1.update(wspace=0.18, hspace=0.18) 164 | colors = [c['color'] for c in list(matplotlib.rcParams['axes.prop_cycle'])] 165 | params = ['High', 'Low', 'Wind', 'Rain'] 166 | 167 | def plot_histogram(subplot_num, x, unit=None, facecolor='b', bins=None, align='left'): 168 | global fig 169 | ax = plt.subplot(subplot_num) 170 | if bins is None and unit is not None: 171 | bins = max(int(np.nanmax(x) / unit - np.nanmin(x) / unit), 1) 172 | n, bins, patches = plt.hist(x, bins=bins, facecolor=facecolor, normed=True, align=align, ) 173 | ylim = ax.get_ylim() 174 | if ylim[1] - np.nanmax(n) < 0.005: 175 | ax.set_ylim([ylim[0], ylim[1] + 0.005]) 176 | if unit is None: 177 | unit = 1. 178 | ax.set_yticklabels(['{:.1f}'.format(100. * l * unit) for l in plt.yticks()[0]]) 179 | if plot_num % 2 == 0: 180 | ax.set_ylabel('frequency (%)') 181 | letters = list(string.ascii_lowercase) 182 | panel_label = '%s) %s' % (letters[plot_num], params[plot_num]) 183 | ax.annotate(panel_label, xycoords='axes fraction', xy=(0.05, 0.95), horizontalalignment='left', 184 | verticalalignment='top') 185 | return 186 | 187 | for plot_num in range(4): 188 | error = X[:, plot_num] - y[:, plot_num] 189 | plot_histogram(gs1[plot_num], error, facecolor=colors[plot_num], unit=1) 190 | 191 | return fig 192 | 193 | 194 | # Get options and config 195 | 196 | options, arguments = get_command_options() 197 | 198 | try: 199 | config_file = arguments[0] 200 | except IndexError: 201 | print('Required argument (config file) not provided.') 202 | sys.exit(1) 203 | config = mosx.util.get_config(config_file) 204 | 205 | predict_timeseries = config['Model']['predict_timeseries'] 206 | if predict_timeseries: 207 | config['Model']['predict_timeseries'] = False 208 | 209 | predictor_file = '%s/%s_CV_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], 210 | config['Validate']['end_date']) 211 | if not(os.path.isfile(predictor_file)): 212 | print("Cannot find validation predictors file '%s'. Please run 'validate' first." % predictor_file) 213 | sys.exit(1) 214 | 215 | 216 | # Open predictors file from validate 217 | 218 | with open(predictor_file, 'rb') as handle: 219 | predictors = pickle.load(handle) 220 | predictor_array = np.concatenate((predictors['BUFKIT'], predictors['OBS']), axis=1) 221 | true_array = predictors['VERIF'] 222 | 223 | 224 | # Make the prediction 225 | print('\n--- MOS-X performance: making the forecast...\n') 226 | predicted, all_predicted, ps = mosx.model.predict_all(config, predictor_file, ensemble=options.ensemble, 227 | naive_rain_correction=options.tune_rain) 228 | if options.avg_rain: 229 | print('Using average of raw and rain-tuned precipitation forecasts') 230 | no_tuned_predictions = mosx.model.predict_all(config, predictor_file, naive_rain_correction=options.tune_rain, 231 | rain_tuning=False) 232 | predicted = np.mean([predicted, no_tuned_predictions[0]], axis=0) 233 | 234 | 235 | # Calculate general performance scores 236 | 237 | print('\n--- MOS-X performance: generating global performance scores...\n') 238 | multi = 'raw_values' 239 | scores = np.nan * np.zeros((7, 4)) 240 | scores[0] = explained_variance_score(true_array[:, :4], predicted[:, :4], multioutput=multi) 241 | scores[1] = mean_absolute_error(true_array[:, :4], predicted[:, :4], multioutput=multi) 242 | scores[2] = mean_squared_error(true_array[:, :4], predicted[:, :4], multioutput=multi) 243 | scores[3] = r2_score(true_array[:, :4], predicted[:, :4], multioutput=multi) 244 | 245 | if config['Model']['rain_forecast_type'] in ['pop', 'categorical']: 246 | # Try to get probabilities for each category of rain 247 | print('MOS-X performance: making probabilistic rain predictions...') 248 | rain_prob = mosx.model.predict_rain_proba(config, predictor_file) 249 | lb = LabelBinarizer() 250 | rain_categories = lb.fit_transform(true_array[:, 3]) 251 | scores[4, 3] = brier_score(rain_categories, rain_prob) 252 | 253 | if options.ensemble: 254 | scores[5] = mean_variance(all_predicted[:, :4, :]) 255 | 256 | if options.ensemble and crps: 257 | scores[6] = np.mean(crps_ensemble(true_array[:, :4], all_predicted[:, :4, :], axis=-1), axis=0) 258 | 259 | scores_df = pd.DataFrame(scores) 260 | score_names = ['Explained variance score', 'Mean absolute bias', 'Mean squared bias', 261 | 'R^2 coefficient of determination', 'Brier score', 'Mean ensemble variance', 'CRPS'] 262 | score_columns = ['High', 'Low', 'Wind', 'Rain'] 263 | scores_df.index = score_names 264 | scores_df.columns = score_columns 265 | 266 | print('\n') 267 | print(scores_df) 268 | 269 | 270 | # Optional learning curve plot 271 | 272 | if options.learning: 273 | print("\n--- MOS-X performance: exporting learning curve figure to 'learning_curve.pdf'...\n") 274 | train_file = '%s/%s_predictors_train.pkl' % (config['SITE_ROOT'], config['station_id']) 275 | predictor_file = '%s/%s_CV_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], 276 | config['Validate']['end_date']) 277 | predictors, targets, n_samples_test = mosx.model.combine_train_test(config, train_file, predictor_file, 278 | return_count_test=True) 279 | cv = mosx.model.SplitConsecutive(first=False, n_samples=n_samples_test) 280 | scorer = mosx.model.wxchallenge_scorer(no_rain=True) 281 | figure = mosx.model._plot_learning_curve(mosx.model.build_estimator(config), predictors, targets, cv=cv, 282 | scoring=scorer) 283 | plt.savefig('learning_curve.pdf', bbox_inches='tight') 284 | 285 | 286 | # Optional spread-skill plot 287 | 288 | if options.spread_skill and options.ensemble: 289 | print("\n--- MOS-X performance: exporting spread-skill figure to 'spread_skill.pdf'...\n") 290 | figure = plot_spread_skill(all_predicted, true_array, ) 291 | plt.savefig('spread_skill.pdf', bbox_inches='tight') 292 | elif options.spread_skill: 293 | print("warning: '--spread-skill' option enabled but no ensemble predictions (--ensemble) were enabled!") 294 | 295 | 296 | # Optional rank histograms plot 297 | 298 | if options.histogram and options.ensemble: 299 | print("\n--- MOS-X performance: exporting rank histogram figure to 'rank_histogram.pdf'...\n") 300 | figure = plot_rank_histogram(all_predicted, true_array, ) 301 | plt.savefig('rank_histogram.pdf', bbox_inches='tight') 302 | elif options.histogram: 303 | print("warning: '--rank-histogram' option enabled but no ensemble predictions (--ensemble) were enabled!") 304 | 305 | 306 | # Optional error distribution plot 307 | 308 | if options.error_plot: 309 | print("\n--- MOS-X performance: exporting error distribution figure to 'error_distribution.pdf'...\n") 310 | figure = plot_error(predicted, true_array, ) 311 | plt.savefig('error_distribution.pdf', bbox_inches='tight') 312 | -------------------------------------------------------------------------------- /run: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # 3 | # Copyright (c) 2018 Jonathan Weyn 4 | # 5 | # See the file LICENSE for your rights. 6 | # 7 | 8 | """ 9 | Run the MOS-X model either initialized at 23Z on any given day or for tomorrow. 10 | """ 11 | 12 | import os 13 | import sys 14 | import mosx 15 | import numpy as np 16 | from shutil import copy2 17 | from collections import OrderedDict 18 | from optparse import OptionParser 19 | from datetime import datetime, timedelta 20 | import pickle 21 | import json 22 | 23 | # Suppress warnings 24 | import warnings 25 | warnings.filterwarnings("ignore") 26 | 27 | 28 | def get_command_options(): 29 | parser = OptionParser() 30 | parser.add_option('-d', '--date', dest='datestr', action='store', type='string', default='tomorrow', 31 | help='Date to run model for, YYYYMMDD (default=tomorrow)') 32 | parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False, 33 | help='Use the raw precipitation from GFS/NAM to override or average with MOS-X') 34 | parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False, 35 | help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble' 36 | 'with that of the rain tuning post-processor') 37 | parser.add_option('-w', '--write', dest='write_flag', action='store_true', default=False, 38 | help='Write to a pickle file') 39 | parser.add_option('-f', '--write-file', dest='write_file', action='store', type='string', default='default', 40 | help='If -w is enabled, write to this file (default $SITE_ROOT/$(station_id)_MOSX_fcst). ' 41 | 'File extension is added based on file type.') 42 | parser.add_option('-u', '--upload', dest='upload', action='store_true', default=False, 43 | help='Upload upload forecast output to server in config') 44 | parser.add_option('-p', '--probabilities', dest='prob', action='store_true', default=False, 45 | help='Calculate and plot probability distributions') 46 | (opts, args) = parser.parse_args() 47 | return opts, args 48 | 49 | 50 | # Figure out the date 51 | 52 | options, arguments = get_command_options() 53 | datestr, write_flag, write_base, upload, prob = (options.datestr, options.write_flag, options.write_file, 54 | options.upload, options.prob) 55 | try: 56 | config_file = arguments[0] 57 | except IndexError: 58 | print('Required argument (config file) not provided.') 59 | sys.exit(1) 60 | config = mosx.util.get_config(config_file) 61 | 62 | if datestr == 'tomorrow': 63 | date = datetime.utcnow() 64 | # BUFR cycle 65 | cycle = str(6 * (((date.hour + 24 - 5) % 24) // 6)) 66 | verif_date = datetime(date.year, date.month, date.day) + timedelta(days=1) 67 | else: 68 | cycle = '18' 69 | try: 70 | verif_date = datetime.strptime(datestr, '%Y%m%d') 71 | except: 72 | raise ValueError('Invalid date format entered (use YYYYMMDD).') 73 | 74 | # Override the INFILE values 75 | new_start_date = datetime.strftime(verif_date, '%Y%m%d') 76 | new_end_date = datetime.strftime(verif_date, '%Y%m%d') 77 | config['data_start_date'] = new_start_date 78 | config['data_end_date'] = new_end_date 79 | 80 | 81 | # Retrieve data 82 | 83 | bufr_file = '%s/%s_%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 84 | print('\n--- MOS-X run: retrieving BUFR data...\n') 85 | print('Using model cycle %sZ' % cycle) 86 | mosx.bufr.bufr(config, bufr_file, cycle=cycle) 87 | 88 | obs_file = '%s/%s_%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 89 | print('\n--- MOS-X run: retrieving OBS data...\n') 90 | mosx.obs.obs(config, obs_file, use_nan_sounding=True, use_existing_sounding=False) 91 | 92 | 93 | # Format data 94 | 95 | predictor_file = '%s/%s_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 96 | print('\n--- MOS-X run: formatting predictor data...\n') 97 | mosx.model.format_predictors(config, bufr_file, obs_file, None, predictor_file) 98 | 99 | 100 | # Make a prediction! 101 | 102 | print('\n--- MOS-X run: making the forecast...\n') 103 | predicted, all_predicted, predicted_timeseries = mosx.model.model.predict_all(config, predictor_file, ensemble=prob, 104 | time_series_date=verif_date, 105 | naive_rain_correction=options.tune_rain, 106 | round_result=not prob) 107 | if options.avg_rain: 108 | no_tuned_predictions = mosx.model.model.predict_all(config, predictor_file, ensemble=prob, 109 | time_series_date=verif_date, 110 | naive_rain_correction=options.tune_rain, round_result=not prob, 111 | rain_tuning=False) 112 | 113 | predicted = np.squeeze(predicted) 114 | if options.avg_rain: 115 | no_tuned_rain = no_tuned_predictions[0][0, 3] 116 | print('Adjusting rain estimate to average of raw (%0.2f) and tuned (%0.2f) values' % 117 | (no_tuned_rain, predicted[3])) 118 | predicted[3] = np.mean([predicted[3], no_tuned_rain]) 119 | 120 | 121 | # Print forecast! 122 | 123 | print("\nRain forecast type: '%s'" % config['Model']['rain_forecast_type']) 124 | 125 | print('\nFor day %s at %s, the predicted forecast is' % (new_end_date, 126 | config['station_id'])) 127 | print('%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[:4])) 128 | 129 | if prob: 130 | predicted_std = np.squeeze(np.std(all_predicted, axis=-1)) 131 | predicted_display = [] 132 | for v in range(4): 133 | predicted_display.append(predicted[v]) 134 | predicted_display.append(predicted_std[v]) 135 | print('\nPredicted forecast with standard deviation is') 136 | print('%0.1f+/-%0.1f | %0.1f+/-%0.1f | %0.1f+/-%0.1f | %0.3f+/-%0.3f' % tuple(predicted_display)) 137 | print('\nNote that the reported rain above may be adjusted from raw model output.') 138 | 139 | if config['Model']['predict_timeseries']: 140 | print('\nPredicted time series:') 141 | print(predicted_timeseries) 142 | 143 | 144 | # Write the forecast, if requested 145 | 146 | if write_flag: 147 | file_types = config['Upload']['file_type'] 148 | write_files = [] 149 | if not isinstance(file_types, list): 150 | file_types = [file_types] 151 | 152 | for file_type in file_types: 153 | if file_type == 'pickle': 154 | if write_base == 'default': 155 | write_file = '%s/%s_MOSX_fcst.pkl' % (config['SITE_ROOT'], config['station_id']) 156 | else: 157 | write_file = '%s.pkl' % write_base 158 | print('\nForecast write requested, writing to file %s' % write_file) 159 | # Check if pickle file already exists 160 | try: 161 | with open(write_file, 'rb') as handle: 162 | data = pickle.load(handle) 163 | except: 164 | data = OrderedDict() 165 | 166 | data[verif_date] = { 167 | 'high': np.round(predicted[0]), 168 | 'low': np.round(predicted[1]), 169 | 'wind': np.round(predicted[2]), 170 | 'precip': np.round(predicted[3], 2) 171 | } 172 | 173 | if config['Model']['predict_timeseries']: 174 | data[verif_date].update(predicted_timeseries.to_dict(into=OrderedDict)) 175 | 176 | with open(write_file, 'wb') as handle: 177 | pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL) 178 | 179 | elif file_type == 'uw_text': 180 | write_file = 'MOS-X.%s' % new_end_date 181 | print('\nForecast write requested, writing to file %s' % write_file) 182 | if config['Model']['rain_forecast_type'] != 'pop': 183 | print('Warning: model rain prediction is not probability! Expect unusual output!') 184 | high = np.round(predicted[0]) 185 | rain = np.round(10. * predicted[3]) 186 | line = 'MOS-X,%0.0f,%0.0f' % (high, rain) 187 | with open(write_file, 'w') as f: 188 | f.write(line) 189 | 190 | elif file_type == 'json': 191 | if write_base == 'default': 192 | write_file = '%s/MOSX_%s_%s.json' % (config['SITE_ROOT'], config['station_id'], new_end_date) 193 | else: 194 | write_file = '%s.json' % write_base 195 | print('\nForecast write requested, writing to file %s' % write_file) 196 | data = OrderedDict() 197 | data['daily'] = { 198 | 'high': np.round(predicted[0]), 199 | 'low': np.round(predicted[1]), 200 | 'wind': np.round(predicted[2]), 201 | 'precip': np.round(predicted[3], 2) 202 | } 203 | 204 | if config['Model']['predict_timeseries']: 205 | data['hourly'] = predicted_timeseries.to_dict(into=OrderedDict, orient='list') 206 | data['hourly']['DateTime'] = [str(p) for p in predicted_timeseries.index] 207 | 208 | with open(write_file, 'w') as f: 209 | json.dump(data, f) 210 | 211 | write_files.append(write_file) 212 | 213 | 214 | # Make plots of probability distributions 215 | 216 | if prob: 217 | # Imports 218 | import matplotlib 219 | import matplotlib.pyplot as plt 220 | import matplotlib.gridspec as gs 221 | import matplotlib.ticker as ticker 222 | from scipy.stats import norm 223 | matplotlib.rcParams.update({'font.size': 9}) 224 | 225 | prob_file = '%s/%s_MOSX_prob.svg' % (config['SITE_ROOT'], config['station_id']) 226 | 227 | def plot_probabilities(predicted): 228 | fig = plt.figure() 229 | fig.set_size_inches(8, 6) 230 | gs1 = gs.GridSpec(2, 2) 231 | gs1.update(wspace=0.18, hspace=0.18) 232 | 233 | def plot_histogram(subplot_num, x, unit=1., facecolor='b', bins=None, align='left', 234 | xtickint=None, decimals=1, title=None): 235 | global fig 236 | ax = plt.subplot(subplot_num) 237 | 238 | if bins is None: 239 | bins = max(int(np.nanmax(x)/unit - np.nanmin(x)/unit), 1) 240 | n, bins, patches = plt.hist(x, bins=bins, facecolor=facecolor, normed=True, align=align,) 241 | x_axis = np.linspace(np.nanmin(x), np.nanmax(x), 101) 242 | x_mean = np.nanmean(x) 243 | x_std = np.nanstd(x) 244 | normal = norm.pdf(x_axis, x_mean, x_std) 245 | plt.plot(x_axis, normal) 246 | if xtickint is not None: 247 | ax.xaxis.set_major_locator(ticker.MultipleLocator(xtickint)) 248 | ylim = ax.get_ylim() 249 | if ylim[1] - np.nanmax(n) < 0.005: 250 | ax.set_ylim([ylim[0], ylim[1]+0.005]) 251 | ax.set_yticklabels(['{:.1f}'.format(100.*l*unit) for l in plt.yticks()[0]]) 252 | if plot_num % 2 == 0: 253 | ax.set_ylabel('Frequency (%)') 254 | plt.axvline(x_mean, linewidth=1.5, color='black') 255 | formatter = '%0.{:d}f'.format(decimals) 256 | text = ('Mean: %s\nStd: %s' % (formatter, formatter)) % (x_mean, x_std) 257 | plt.text(x_mean+unit/2, 0.9*np.nanmax(n), text, fontsize=8) 258 | if title is not None: 259 | ax.set_title(title) 260 | return 261 | 262 | colors = [(0.1, 0.6, 0.4), 263 | (0.6, 0.1, 0.4), 264 | (0.2, 0.4, 0.8), 265 | (0.8, 0.7, 0.1)] 266 | titles = ['High temperature', 'Low temperature', 'Max 2-min wind', 'Precipitation'] 267 | for plot_num in range(4): 268 | if plot_num != 3: 269 | unit = 1. 270 | decimals = 1 271 | elif config['Model']['rain_forecast_type'] == 'pop': 272 | unit = 0.1 273 | decimals = 2 274 | elif config['Model']['rain_forecast_type'] == 'categorical': 275 | unit = 1. 276 | decimals = 1 277 | elif config['Model']['rain_forecast_type'] == 'quantity': 278 | unit = 0.05 279 | decimals = 3 280 | plot_histogram(gs1[plot_num], predicted[plot_num, :], unit, facecolor=colors[plot_num], 281 | decimals=decimals, title=titles[plot_num]) 282 | 283 | verif_date_str = datetime.strftime(verif_date, '%Y/%m/%d') 284 | plt.suptitle('MOS-X probability distributions, %s on %s' % (config['station_id'], verif_date_str)) 285 | plt.savefig(prob_file, dpi=200) 286 | 287 | plot_probabilities(np.squeeze(all_predicted)) 288 | 289 | 290 | # Upload, if requested 291 | 292 | if upload and write_flag: 293 | print('\nUpload requested...') 294 | account = config['Upload']['username'] 295 | server = config['Upload']['server'] 296 | forecast_dir = config['Upload']['forecast_directory'] 297 | plot_dir = config['Upload']['plot_directory'] 298 | if account == '' and server == '': 299 | if forecast_dir != '': 300 | for write_file in write_files: 301 | copy2(write_file, '%s/%s' % (forecast_dir, write_file.split('/')[-1])) 302 | else: 303 | print('No local directory specified for forecast upload, aborting') 304 | if prob and plot_dir != '': 305 | copy2(prob_file, '%s/%s' % (plot_dir, prob_file.split('/')[-1])) 306 | else: 307 | print('No local directory specified for plot upload, aborting') 308 | elif account == '' or server == '': 309 | print('Invalid username and/or server in config file, aborting') 310 | else: 311 | for write_file in write_files: 312 | result = os.system('scp %s %s@%s:%s' % (write_file, account, server, forecast_dir)) 313 | if prob: 314 | os.system('scp %s %s@%s:%s' % (prob_file, account, server, plot_dir)) 315 | print('Upload finished with system exit status %s' % result) 316 | -------------------------------------------------------------------------------- /validate: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # 3 | # Copyright (c) 2018 Jonathan Weyn 4 | # 5 | # See the file LICENSE for your rights. 6 | # 7 | 8 | """ 9 | Perform cross-validation of the MOS-X model for a range of dates specified in the config. 10 | """ 11 | 12 | import mosx 13 | import os 14 | import sys 15 | import numpy as np 16 | from optparse import OptionParser 17 | from datetime import datetime 18 | import pickle 19 | 20 | # Suppress warnings 21 | import warnings 22 | warnings.filterwarnings("ignore") 23 | 24 | 25 | def get_command_options(): 26 | parser = OptionParser() 27 | parser.add_option('-o', '--overwrite', dest='overwrite', action='store_true', default=False, 28 | help='Overwrite any existing BUFR, obs, and verification files') 29 | parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False, 30 | help='Use the raw precipitation from GFS/NAM to override or average with MOS-X') 31 | parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False, 32 | help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble' 33 | 'with that of the rain tuning post-processor') 34 | parser.add_option('-w', '--write', dest='write_flag', action='store_true', default=False, help='Write a CSV file') 35 | default_file = './MOSX_CV.csv' 36 | parser.add_option('-f', '--write-file', dest='write_file', action='store', type='string', default=default_file, 37 | help=('If -w is enabled, write to this file (default %s)' % default_file)) 38 | (opts, args) = parser.parse_args() 39 | return opts, args 40 | 41 | 42 | # Figure out the date 43 | 44 | options, arguments = get_command_options() 45 | overwrite, write_flag, write_file = options.overwrite, options.write_flag, options.write_file 46 | try: 47 | config_file = arguments[0] 48 | except IndexError: 49 | print('Required argument (config file) not provided.') 50 | sys.exit(1) 51 | config = mosx.util.get_config(config_file) 52 | 53 | cycle = '18' 54 | 55 | start_date = datetime.strptime(config['Validate']['start_date'], '%Y%m%d') 56 | end_date = datetime.strptime(config['Validate']['end_date'], '%Y%m%d') 57 | 58 | # Override the INFILE values 59 | new_start_date = datetime.strftime(start_date, '%Y%m%d') 60 | new_end_date = datetime.strftime(end_date, '%Y%m%d') 61 | config['data_start_date'] = new_start_date 62 | config['data_end_date'] = new_end_date 63 | 64 | 65 | # Retrieve data 66 | 67 | bufr_file = '%s/%s_CV_%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 68 | print('\n--- MOS-X validate: retrieving BUFR data...\n') 69 | if os.path.isfile(bufr_file) and not overwrite: 70 | print('Using existing BUFR file %s' % bufr_file) 71 | print('If issues occur, delete this file and try again') 72 | else: 73 | print('Using model cycle %sZ' % cycle) 74 | mosx.bufr.bufr(config, bufr_file, cycle=cycle) 75 | 76 | obs_file = '%s/%s_CV_%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 77 | print('\n--- MOS-X validate: retrieving OBS data...\n') 78 | if os.path.isfile(obs_file) and not overwrite: 79 | print('Using existing obs file %s' % obs_file) 80 | print('If issues occur, delete this file and try again') 81 | else: 82 | mosx.obs.obs(config, obs_file, use_nan_sounding=False) 83 | 84 | verif_file = '%s/%s_CV_%s_verif.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 85 | print('\n--- MOS-X validate: retrieving VERIF data...\n') 86 | if os.path.isfile(verif_file) and not overwrite: 87 | print('Using existing verif file %s' % verif_file) 88 | print('If issues occur, delete this file and try again') 89 | else: 90 | mosx.verification.verification(config, verif_file, use_climo=config['Obs']['use_climo_wind'], 91 | use_cf6=config['Obs']['use_climo_wind']) 92 | with open(verif_file, 'rb') as handle: 93 | verif = pickle.load(handle) 94 | 95 | 96 | # Format data 97 | 98 | predictor_file = '%s/%s_CV_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 99 | print('\n--- MOS-X validate: formatting predictor data...\n') 100 | all_dates = mosx.model.format_predictors(config, bufr_file, obs_file, verif_file, predictor_file, 101 | return_dates=True) 102 | 103 | 104 | # Make a prediction! 105 | 106 | predicted = mosx.model.predict(config, predictor_file,naive_rain_correction=options.tune_rain) 107 | if options.avg_rain: 108 | print('Using average of raw and rain-tuned precipitation forecasts') 109 | no_tuned_predictions = mosx.model.predict(config, predictor_file, naive_rain_correction=options.tune_rain, 110 | rain_tuning=False) 111 | predicted = np.mean([predicted, no_tuned_predictions], axis=0) 112 | 113 | # Print forecasts! 114 | 115 | print("\nRain forecast type: '%s'" % config['Model']['rain_forecast_type']) 116 | 117 | print('\nDay,verification,forecast') 118 | for day in range(len(all_dates)): 119 | date = all_dates[day] 120 | day_verif = [verif[date][v] for v in ['Tmax', 'Tmin', 'Wind', 'Rain']] 121 | verif_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(day_verif) 122 | fcst_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[day, :4]) 123 | print('%s,%s,%s' % (date, verif_str, fcst_str)) 124 | 125 | 126 | # Write the forecast, if requested 127 | 128 | if write_flag: 129 | print('\nForecast write requested, writing to file %s' % write_file) 130 | 131 | with open(write_file, 'w') as f: 132 | print >> f, 'date,verification,forecast' 133 | for day in range(len(all_dates)): 134 | date = all_dates[day] 135 | day_verif = [verif[date][v] for v in ['Tmax', 'Tmin', 'Wind', 'Rain']] 136 | verif_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(day_verif) 137 | fcst_str = '%0.0f/%0.0f/%0.0f/%0.2f' % tuple(predicted[day, :4]) 138 | print >> f, ('%s,%s,%s' % (date, verif_str, fcst_str)) 139 | -------------------------------------------------------------------------------- /verify: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # 3 | # Copyright (c) 2018 Jonathan Weyn 4 | # 5 | # See the file LICENSE for your rights. 6 | # 7 | 8 | """ 9 | Verify the MOS-X model, along with GFS and NAM MOS products, for a range of dates. 10 | """ 11 | 12 | import mosx 13 | import sys 14 | import os 15 | import numpy as np 16 | import pandas as pd 17 | from urllib import urlopen 18 | from collections import OrderedDict 19 | from optparse import OptionParser 20 | from datetime import datetime, timedelta 21 | import pickle 22 | 23 | # Suppress warnings 24 | import warnings 25 | warnings.filterwarnings("ignore") 26 | 27 | 28 | def get_command_options(): 29 | parser = OptionParser() 30 | parser.add_option('-s', '--start-date', dest='startstr', action='store', type='string', default='weekago', 31 | help='Starting verification date, YYYYMMDD (default=today)') 32 | parser.add_option('-e', '--end-date', dest='endstr', action='store', type='string', default='today', 33 | help='Ending verification date, YYYYMMDD (default=weekago)') 34 | parser.add_option('-t', '--naive-rain-correction', dest='tune_rain', action='store_true', default=False, 35 | help='Use the raw precipitation from GFS/NAM to override or average with MOS-X') 36 | parser.add_option('-r', '--rain-post-average', dest='avg_rain', action='store_true', default=False, 37 | help='If using a RainTuningEstimator, this will average the raw estimation from an ensemble' 38 | 'with that of the rain tuning post-processor') 39 | parser.add_option('-m', '--verify-mos', dest='mos_flag', action='store_true', default=False, 40 | help="Also retrieve GFS and NAM MOS forecasts for comparison") 41 | parser.add_option('-N', '--no-verify', dest='no_verif_flag', action='store_true', default=False, 42 | help="Don't do verification, just get the model forecasts") 43 | parser.add_option('-o', '--overwrite', dest='overwrite', action='store_true', default=False, 44 | help='Overwrite any existing BUFR, obs, and verification files') 45 | parser.add_option('-w', '--write', dest='write_flag', action='store_true', default=False, 46 | help='Write a CSV file') 47 | default_file = './MOS-X_verification.csv' 48 | parser.add_option('-f', '--write-file', dest='write_file', action='store', type='string', default=default_file, 49 | help=('If -w is enabled, write to this file (default %s)' % default_file)) 50 | (opts, args) = parser.parse_args() 51 | return opts, args 52 | 53 | 54 | def mos_qpf_interpret(qpf): 55 | """ 56 | Interprets a pandas Series of QPF by average estimates 57 | 58 | :param qpf: Series of q06 or q12 from MOS 59 | :return: precip: Series of average estimated precipitation 60 | """ 61 | translator = {0: 0.0, 62 | 1: 0.05, 63 | 2: 0.15, 64 | 3: 0.35, 65 | 4: 0.75, 66 | 5: 1.5, 67 | 6: 2.5} 68 | new_qpf = qpf.copy() 69 | for j in range(len(qpf)): 70 | try: 71 | new_qpf.iloc[j] = translator[int(qpf.iloc[j])] 72 | except: 73 | new_qpf.iloc[j] = 0.0 74 | return new_qpf 75 | 76 | 77 | def retrieve_mos(model, init_date, forecast_date): 78 | """ 79 | Retrieve MOS data. 80 | 81 | :param model: model name (GFS or NAM) 82 | :param init_date: datetime of model initialization 83 | :param forecast_date: datetime of forecast 84 | :return: dict of high, low, max wind, total rain for next 6Z--6Z 85 | """ 86 | 87 | # Create daily return dict 88 | daily = OrderedDict() 89 | 90 | base_url = 'http://mesonet.agron.iastate.edu/mos/csv.php?station=%s&runtime=%s&model=%s' 91 | formatted_date = init_date.strftime('%Y-%m-%d%%20%H:00') 92 | url = base_url % (config['station_id'], formatted_date, model) 93 | response = urlopen(url) 94 | df = pd.read_csv(response, index_col=False) 95 | date_index = pd.to_datetime(df['ftime']) 96 | df['datetime'] = date_index 97 | df = df.set_index('datetime') 98 | 99 | # Remove duplicate rows 100 | df = df.drop_duplicates() 101 | 102 | # Fix rain 103 | df['q06'] = mos_qpf_interpret(df['q06']) 104 | 105 | forecast_start = forecast_date.replace(hour=6) 106 | forecast_end = forecast_start + timedelta(days=1) 107 | 108 | # Some parameters need to include the forecast start; others, like total rain and 6-hour maxes, don't 109 | try: 110 | iloc_start_include = df.index.get_loc(forecast_start) 111 | iloc_start_exclude = iloc_start_include + 1 112 | except: 113 | print('Error getting start time index in db; check data.') 114 | return 115 | try: 116 | iloc_end = df.index.get_loc(forecast_end) + 1 117 | except: 118 | print('Error getting end time index in db; check data.') 119 | return 120 | 121 | raw_high = df.iloc[iloc_start_include:iloc_end]['tmp'].max() 122 | raw_low = df.iloc[iloc_start_include:iloc_end]['tmp'].min() 123 | nx_high = df.iloc[iloc_start_exclude:iloc_end]['n_x'].max() 124 | nx_low = df.iloc[iloc_start_exclude:iloc_end]['n_x'].max() 125 | daily['Tmax'] = np.nanmax([raw_high, nx_high]) 126 | daily['Tmin'] = np.nanmin([raw_low, nx_low]) 127 | daily['Wind'] = df.iloc[iloc_start_include:iloc_end]['wsp'].max() 128 | daily['Rain'] = df.iloc[iloc_start_exclude:iloc_end]['q06'].sum() 129 | 130 | return daily 131 | 132 | 133 | # Parameters 134 | 135 | options, arguments = get_command_options() 136 | 137 | try: 138 | config_file = arguments[0] 139 | except IndexError: 140 | print('Required argument (config file) not provided.') 141 | sys.exit(1) 142 | config = mosx.util.get_config(config_file) 143 | 144 | if options.endstr == 'today': 145 | date = datetime.utcnow() 146 | # BUFR cycle 147 | cycle = '18' 148 | if date.hour < 6: 149 | end_date = datetime(date.year, date.month, date.day) - timedelta(days=2) 150 | else: 151 | end_date = datetime(date.year, date.month, date.day) - timedelta(days=1) 152 | if options.no_verif_flag: 153 | end_date += timedelta(days=2) 154 | else: 155 | cycle = '18' 156 | try: 157 | end_date = datetime.strptime(options.endstr, '%Y%m%d') 158 | except: 159 | raise ValueError('Invalid date format entered (use YYYYMMDD).') 160 | 161 | if options.startstr == 'weekago': 162 | date = datetime.utcnow() 163 | start_date = datetime(date.year, date.month, date.day) - timedelta(days=7) 164 | else: 165 | try: 166 | start_date = datetime.strptime(options.startstr, '%Y%m%d') 167 | except: 168 | raise ValueError('Invalid date format entered (use YYYYMMDD).') 169 | 170 | # Override the INFILE values 171 | new_start_date = datetime.strftime(start_date, '%Y%m%d') 172 | new_end_date = datetime.strftime(end_date, '%Y%m%d') 173 | config['data_start_date'] = new_start_date 174 | config['data_end_date'] = new_end_date 175 | 176 | 177 | # Retrieve data 178 | 179 | bufr_file = '%s/%s_verify_%s_bufr.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 180 | print('\n--- MOS-X verify: retrieving BUFR data...\n') 181 | print('Using model cycle %sZ' % cycle) 182 | if os.path.isfile(bufr_file) and not options.overwrite: 183 | print('Using existing BUFR file %s' % bufr_file) 184 | print('If issues occur, delete this file and try again') 185 | else: 186 | print('Using model cycle %sZ' % cycle) 187 | mosx.bufr.bufr(config, bufr_file, cycle=cycle) 188 | 189 | obs_file = '%s/%s_verify_%s_obs.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 190 | print('\n--- MOS-X verify: retrieving OBS data...\n') 191 | if os.path.isfile(obs_file) and not options.overwrite: 192 | print('Using existing obs file %s' % obs_file) 193 | print('If issues occur, delete this file and try again') 194 | else: 195 | mosx.obs.obs(config, obs_file, use_nan_sounding=False) 196 | 197 | if not options.no_verif_flag: 198 | verif_file = '%s/%s_verify_%s_verif.pkl' % (config['SITE_ROOT'], config['station_id'], new_end_date) 199 | print('\n--- MOS-X verify: retrieving VERIF data...\n') 200 | if os.path.isfile(verif_file) and not options.overwrite: 201 | print('Using existing verif file %s' % verif_file) 202 | print('If issues occur, delete this file and try again') 203 | else: 204 | mosx.verification.verification(config, verif_file, use_climo=False, use_cf6=config['Obs']['use_climo_wind']) 205 | with open(verif_file, 'rb') as handle: 206 | verif = pickle.load(handle) 207 | else: 208 | verif_file = None 209 | 210 | 211 | # Format data 212 | 213 | predictor_file = '%s/%s_verify_%s_predictors.pkl' % (config['SITE_ROOT'], config['station_id'], 214 | new_end_date) 215 | print('\n--- MOS-X run: formatting predictor data...\n') 216 | all_dates = mosx.model.format_predictors(config, bufr_file, obs_file, verif_file, predictor_file, 217 | return_dates=True) 218 | 219 | 220 | # Make a prediction! 221 | 222 | predicted = mosx.model.predict(config, predictor_file, naive_rain_correction=options.tune_rain) 223 | if options.avg_rain: 224 | print('Using average of raw and rain-tuned precipitation forecasts') 225 | no_tuned_predictions = mosx.model.predict_all(config, predictor_file, naive_rain_correction=options.tune_rain, 226 | rain_tuning=False) 227 | predicted = np.mean([predicted, no_tuned_predictions[0]], axis=0) 228 | 229 | # Get GFS and NAM MOS forecasts, if desired 230 | 231 | if options.mos_flag: 232 | gfs = OrderedDict() 233 | nam = OrderedDict() 234 | mos_mean = OrderedDict() 235 | for current_date in all_dates: 236 | init_date = current_date - timedelta(days=1) 237 | init_date = init_date.replace(hour=12) 238 | mos_array = np.full((2, 4), np.nan) 239 | 240 | # GFS 241 | print('Retrieving %s data initialized at %s' % ('GFS MOS', init_date)) 242 | daily = retrieve_mos('GFS', init_date, current_date) 243 | if daily is not None: 244 | gfs[current_date] = daily 245 | mos_array[0, :] = daily.values() 246 | 247 | # NAM 248 | print('Retrieving %s data initialized at %s' % ('NAM MOS', init_date)) 249 | daily = retrieve_mos('NAM', init_date, current_date) 250 | if daily is not None: 251 | nam[current_date] = daily 252 | mos_array[1, :] = daily.values() 253 | 254 | # Ensemble mean 255 | mos_array = np.nanmean(mos_array, axis=0) 256 | mos_mean[current_date] = { 257 | 'Tmax': mos_array[0], 258 | 'Tmin': mos_array[1], 259 | 'Wind': mos_array[2], 260 | 'Rain': mos_array[3] 261 | } 262 | 263 | current_date += timedelta(days=1) 264 | 265 | 266 | # Print forecasts! 267 | 268 | print("\nRain forecast type: '%s'" % config['Model']['rain_forecast_type']) 269 | 270 | forecast_format = '%0.0f/%0.0f/%0.0f/%0.2f' 271 | variables = ['Tmax', 'Tmin', 'Wind', 'Rain'] 272 | if options.mos_flag: 273 | print('\nDay, verification, GFS MOS, NAM MOS, MOS MEAN, MOS-X') 274 | else: 275 | print('\nDay, verification, MOS-X') 276 | for day in range(len(all_dates)): 277 | date = all_dates[day] 278 | if not options.no_verif_flag: 279 | verif_str = forecast_format % tuple([verif[date][v] for v in variables]) 280 | else: 281 | verif_str = '' 282 | mosx_str = forecast_format % tuple(predicted[day, :4]) 283 | if options.mos_flag: 284 | try: 285 | gfs_str = forecast_format % tuple([gfs[date][v] for v in variables]) 286 | except KeyError: 287 | gfs_str = '' 288 | try: 289 | nam_str = forecast_format % tuple([nam[date][v] for v in variables]) 290 | except KeyError: 291 | nam_str = '' 292 | try: 293 | mos_mean_str = forecast_format % tuple([mos_mean[date][v] for v in variables]) 294 | except KeyError: 295 | mos_mean_str = '' 296 | print('%s,%s,%s,%s,%s,%s' % (date, verif_str, gfs_str, nam_str, mos_mean_str, mosx_str)) 297 | else: 298 | print('%s,%s,%s' % (date, verif_str, mosx_str)) 299 | 300 | 301 | # Write the forecast, if requested 302 | 303 | if options.write_flag: 304 | print('\nForecast write requested, writing to file %s' % options.write_file) 305 | 306 | with open(options.write_file, 'w') as f: 307 | if options.mos_flag: 308 | print >> f, 'date,verification,GFS MOS,NAM MOS,MOS MEAN,MOS-X' 309 | else: 310 | print >> f, 'date,verification,MOS-X' 311 | for day in range(len(all_dates)): 312 | date = all_dates[day] 313 | if not options.no_verif_flag: 314 | verif_str = forecast_format % tuple([verif[date][v] for v in variables]) 315 | else: 316 | verif_str = '' 317 | mosx_str = forecast_format % tuple(predicted[day, :4]) 318 | if options.mos_flag: 319 | try: 320 | gfs_str = forecast_format % tuple([gfs[date][v] for v in variables]) 321 | except KeyError: 322 | gfs_str = '' 323 | try: 324 | nam_str = forecast_format % tuple([nam[date][v] for v in variables]) 325 | except KeyError: 326 | nam_str = '' 327 | try: 328 | mos_mean_str = forecast_format % tuple([mos_mean[date][v] for v in variables]) 329 | except KeyError: 330 | mos_mean_str = '' 331 | print >> f, '%s,%s,%s,%s,%s,%s' % (date, verif_str, gfs_str, nam_str, mos_mean_str, mosx_str) 332 | else: 333 | print >>f, '%s,%s,%s' % (date, verif_str, mosx_str) 334 | --------------------------------------------------------------------------------