├── .travis.yml ├── Functions Documentation.md ├── Functions Examples └── examples.py ├── LICENSE ├── README.md ├── dataset └── rain.csv ├── declustered.png ├── install_pot.py ├── nocluster.png ├── paper.bib ├── paper.md ├── requirements.txt ├── result_CDF.png ├── result_MODSCALE.png ├── result_MRL.png ├── result_SHAPE.png ├── result_pdf.png ├── result_pp.png ├── result_qq.png ├── result_retlvl.png ├── setup.py ├── tests ├── __init__.py ├── declustering_test.py ├── entropy_test.py ├── gpdfit_test.py ├── lmom_dist_test.py ├── lmom_sample_test.py ├── non_central_moments_test.py └── return_value_test.py └── thresholdmodeling ├── __init__.py └── thresh_modeling.py /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | # We don't actually use the Travis Python, but this keeps it organized. 4 | - "3.7" 5 | install: 6 | - sudo apt-get update 7 | # We do this conditionally because it saves us some downloading if the 8 | # version is the same. 9 | - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then 10 | wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh; 11 | else 12 | wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh; 13 | fi 14 | - bash miniconda.sh -b -p $HOME/miniconda 15 | - source "$HOME/miniconda/etc/profile.d/conda.sh" 16 | - hash -r 17 | - conda config --set always_yes yes --set changeps1 no 18 | - conda update -q conda 19 | # Useful for debugging any issues with conda 20 | - conda info -a 21 | 22 | # Replace dep1 dep2 ... with your dependencies 23 | - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION 24 | - conda activate test-environment 25 | - conda install r 26 | - conda install -c r rpy2=2.9.4 27 | - python setup.py install 28 | 29 | script: 30 | - python setup.py test 31 | 32 | -------------------------------------------------------------------------------- /Functions Documentation.md: -------------------------------------------------------------------------------- 1 | # Functions Documentations 2 | 3 | This file presents a documentation of the functions presented in the ``thresholdmodeling``package. 4 | 5 | ## Threshold Selection 6 | * **``MRL(sample, alpha)``** : It plots the Mean Residual Life function. ``Sample`` is a 1-D array of the observations and ``alpha`` is a float number representing the confidence level. 7 | * **``Parameter_Stability_Plot(sample, alpha)``** : It plots the two graphics related to the shape and the modified scale parameters stability plot.``Sample`` is a 1-D array of the observations and ``alpha`` is a float number representing the confidence level. 8 | 9 | ## Model Fit 10 | * **``gpdfit(sample, threshold, fit_method)``** : This function fits the given data to a GPD model and show the GPD estimatives in the terminal. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' for the maximum likelihood, maximum penalized likelihood, moments, unbiased probability weighted moments, biased probability weigthed moments, minimum density power divergence, medians, Pickands’ likelihood moment and maximum goodness-of-fit estimators respectively. 11 | 12 | ## Model Checking 13 | * **``gpdpdf(sample, threshold, fit_method, bin_method, alpha)``** : This function returns the GPD probability density function plot with the normalized empirical histograms. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), ``bin_mehotd`` is one of the following methods to compute the number of bins of a histogram: 'sturges', 'doane', 'scott', 'fd' (Freedman-Diaconis estimator), 'stone', 'rice' and 'sqrt', and ``alpha`` is the confidence level. 14 | 15 | * **``gpdcdf(sample, threshold, fit_method, alpha)``** : This function returns the GPD comulative distribution function plot with the empirical points and the confidence bands based on the Dvoretzky–Kiefer–Wolfowitz method. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level. 16 | 17 | * **``qqplot(sample, threshold, fit_method, alpha)``** : This function returns the quantile-quantile plot with the confidence bands based on the Kolmogorov-Smirnov two sample test. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level. 18 | 19 | * **``ppplot(sample, threshold, fit_method, alpha)``** : This function returns the probability-probability plot with the confidence bands based on the Dvoretzky–Kiefer–Wolfowitz method. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level. 20 | 21 | * **``survival_function(sample, threshold, fit_method, alpha)``** : This function returns the survival function plot (1-CDF) with empirical points and the confidence bands based on the Dvoretzky–Kiefer–Wolfowitz method. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level. 22 | 23 | * **``lmomplot(sample, threshold)``** : This function returns the L-Skewness against L-Kurtosis plot using the Generalized Pareto normalization. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold. **Warning**: This plot is very difficult to interpret. 24 | 25 | ## Model Diagnostics and Return Level Analysis 26 | * **``return_value(sample, threshold, alpha, block_size, return_period, fit_method)``** : This function returns the return level for the given argument ``return_period`` with confidence interval based on the Delta Method. Also, it will draw the return level plot based on the block size (usualy annual) with confidence bands based on the Delta Method and empirical points. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``alpha`` is the confidence level, 'block_size' is represents the number of observations will be a block, for example, if the interest is to conduct an annual analysis, the ``block_size`` should be represent a year, in other words, if the data is daily, ``block_size`` should be 365, ``return_period`` is the exact return period you want to compute the return level and ``fit_mehotd`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**). 27 | 28 | ## Declustering and Data Visualization 29 | 30 | * **``decluster(sample, threshold, block_size)``** : This function returns two graphics: The data against the unit of return period (days, for example), and the declustered data based on the block size and the maximum of each block. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``block_size`` is the number of observations that will be part of a cluster, for example: if the dataset is daily and the idea is to cluster based on months, ``block_size`` should be 30. 31 | 32 | ## Further Functions for Additional Analysis 33 | 34 | * **``non_central_moments(sample, threshold, fit_method)``** : This function returns the non-central moments estimated from the model. 35 | ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**). 36 | 37 | * **``lmom_dist(sample, threshold, fit_method)``** : This function returns the L-moments estimated from the model. 38 | ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**). 39 | 40 | * **``lmom_sample(sample)``** : This function returns the L-moments estimated from the sample. ``Sample`` is a 1-D array of the observations. 41 | 42 | * **``entropy(sample, b, threshold, fit_method)``** : This function returns the differential entropy of the model in nats. ``Sample`` is a 1-D array of the observations, ``b`` must be equal to 'e' (changing it does not take any difference in the result, it is just to ilustrate the Euler's number), ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**). 43 | 44 | -------------------------------------------------------------------------------- /Functions Examples/examples.py: -------------------------------------------------------------------------------- 1 | from thresholdmodeling import thresh_modeling 2 | import pandas as pd 3 | 4 | 5 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' 6 | df = pd.read_csv(url, error_bad_lines=False) 7 | data = df.values.ravel() 8 | 9 | 10 | thresh_modeling.MRL(data, 0.05) 11 | thresh_modeling.Parameter_Stability_plot(data, 0.05) 12 | thresh_modeling.gpdfit(data, 30, 'mle') 13 | thresh_modeling.gpdpdf(data, 30, 'mle', 'sturges', 0.05) 14 | thresh_modeling.qqplot(data,30, 'mle', 0.05) 15 | thresh_modeling.ppplot(data, 30, 'mle', 0.05) 16 | thresh_modeling.gpdcdf(data, 30, 'mle', 0.05) 17 | thresh_modeling.return_value(data, 30, 0.05, 365, 36500, 'mle') 18 | thresh_modeling.survival_function(data, 30, 'mle', 0.05) 19 | thresh_modeling.non_central_moments(data, 30, 'mle') 20 | thresh_modeling.lmom_dist(data, 30, 'mle') 21 | thresh_modeling.lmom_sample(data) 22 | thresh_modeling.lmomplot(data, 30) 23 | thresh_modeling.decluster(data, 30, 30) 24 | thresh_modeling.entropy(data, 'e', 30, 'mle') 25 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU LESSER GENERAL PUBLIC LICENSE 2 | Version 3, 29 June 2007 3 | 4 | Copyright (C) 2007 Free Software Foundation, Inc. 5 | Everyone is permitted to copy and distribute verbatim copies 6 | of this license document, but changing it is not allowed. 7 | 8 | 9 | This version of the GNU Lesser General Public License incorporates 10 | the terms and conditions of version 3 of the GNU General Public 11 | License, supplemented by the additional permissions listed below. 12 | 13 | 0. Additional Definitions. 14 | 15 | As used herein, "this License" refers to version 3 of the GNU Lesser 16 | General Public License, and the "GNU GPL" refers to version 3 of the GNU 17 | General Public License. 18 | 19 | "The Library" refers to a covered work governed by this License, 20 | other than an Application or a Combined Work as defined below. 21 | 22 | An "Application" is any work that makes use of an interface provided 23 | by the Library, but which is not otherwise based on the Library. 24 | Defining a subclass of a class defined by the Library is deemed a mode 25 | of using an interface provided by the Library. 26 | 27 | A "Combined Work" is a work produced by combining or linking an 28 | Application with the Library. The particular version of the Library 29 | with which the Combined Work was made is also called the "Linked 30 | Version". 31 | 32 | The "Minimal Corresponding Source" for a Combined Work means the 33 | Corresponding Source for the Combined Work, excluding any source code 34 | for portions of the Combined Work that, considered in isolation, are 35 | based on the Application, and not on the Linked Version. 36 | 37 | The "Corresponding Application Code" for a Combined Work means the 38 | object code and/or source code for the Application, including any data 39 | and utility programs needed for reproducing the Combined Work from the 40 | Application, but excluding the System Libraries of the Combined Work. 41 | 42 | 1. Exception to Section 3 of the GNU GPL. 43 | 44 | You may convey a covered work under sections 3 and 4 of this License 45 | without being bound by section 3 of the GNU GPL. 46 | 47 | 2. Conveying Modified Versions. 48 | 49 | If you modify a copy of the Library, and, in your modifications, a 50 | facility refers to a function or data to be supplied by an Application 51 | that uses the facility (other than as an argument passed when the 52 | facility is invoked), then you may convey a copy of the modified 53 | version: 54 | 55 | a) under this License, provided that you make a good faith effort to 56 | ensure that, in the event an Application does not supply the 57 | function or data, the facility still operates, and performs 58 | whatever part of its purpose remains meaningful, or 59 | 60 | b) under the GNU GPL, with none of the additional permissions of 61 | this License applicable to that copy. 62 | 63 | 3. Object Code Incorporating Material from Library Header Files. 64 | 65 | The object code form of an Application may incorporate material from 66 | a header file that is part of the Library. You may convey such object 67 | code under terms of your choice, provided that, if the incorporated 68 | material is not limited to numerical parameters, data structure 69 | layouts and accessors, or small macros, inline functions and templates 70 | (ten or fewer lines in length), you do both of the following: 71 | 72 | a) Give prominent notice with each copy of the object code that the 73 | Library is used in it and that the Library and its use are 74 | covered by this License. 75 | 76 | b) Accompany the object code with a copy of the GNU GPL and this license 77 | document. 78 | 79 | 4. Combined Works. 80 | 81 | You may convey a Combined Work under terms of your choice that, 82 | taken together, effectively do not restrict modification of the 83 | portions of the Library contained in the Combined Work and reverse 84 | engineering for debugging such modifications, if you also do each of 85 | the following: 86 | 87 | a) Give prominent notice with each copy of the Combined Work that 88 | the Library is used in it and that the Library and its use are 89 | covered by this License. 90 | 91 | b) Accompany the Combined Work with a copy of the GNU GPL and this license 92 | document. 93 | 94 | c) For a Combined Work that displays copyright notices during 95 | execution, include the copyright notice for the Library among 96 | these notices, as well as a reference directing the user to the 97 | copies of the GNU GPL and this license document. 98 | 99 | d) Do one of the following: 100 | 101 | 0) Convey the Minimal Corresponding Source under the terms of this 102 | License, and the Corresponding Application Code in a form 103 | suitable for, and under terms that permit, the user to 104 | recombine or relink the Application with a modified version of 105 | the Linked Version to produce a modified Combined Work, in the 106 | manner specified by section 6 of the GNU GPL for conveying 107 | Corresponding Source. 108 | 109 | 1) Use a suitable shared library mechanism for linking with the 110 | Library. A suitable mechanism is one that (a) uses at run time 111 | a copy of the Library already present on the user's computer 112 | system, and (b) will operate properly with a modified version 113 | of the Library that is interface-compatible with the Linked 114 | Version. 115 | 116 | e) Provide Installation Information, but only if you would otherwise 117 | be required to provide such information under section 6 of the 118 | GNU GPL, and only to the extent that such information is 119 | necessary to install and execute a modified version of the 120 | Combined Work produced by recombining or relinking the 121 | Application with a modified version of the Linked Version. (If 122 | you use option 4d0, the Installation Information must accompany 123 | the Minimal Corresponding Source and Corresponding Application 124 | Code. If you use option 4d1, you must provide the Installation 125 | Information in the manner specified by section 6 of the GNU GPL 126 | for conveying Corresponding Source.) 127 | 128 | 5. Combined Libraries. 129 | 130 | You may place library facilities that are a work based on the 131 | Library side by side in a single library together with other library 132 | facilities that are not Applications and are not covered by this 133 | License, and convey such a combined library under terms of your 134 | choice, if you do both of the following: 135 | 136 | a) Accompany the combined library with a copy of the same work based 137 | on the Library, uncombined with any other library facilities, 138 | conveyed under the terms of this License. 139 | 140 | b) Give prominent notice with the combined library that part of it 141 | is a work based on the Library, and explaining where to find the 142 | accompanying uncombined form of the same work. 143 | 144 | 6. Revised Versions of the GNU Lesser General Public License. 145 | 146 | The Free Software Foundation may publish revised and/or new versions 147 | of the GNU Lesser General Public License from time to time. Such new 148 | versions will be similar in spirit to the present version, but may 149 | differ in detail to address new problems or concerns. 150 | 151 | Each version is given a distinguishing version number. If the 152 | Library as you received it specifies that a certain numbered version 153 | of the GNU Lesser General Public License "or any later version" 154 | applies to it, you have the option of following the terms and 155 | conditions either of that published version or of any later version 156 | published by the Free Software Foundation. If the Library as you 157 | received it does not specify a version number of the GNU Lesser 158 | General Public License, you may choose any version of the GNU Lesser 159 | General Public License ever published by the Free Software Foundation. 160 | 161 | If the Library as you received it specifies that a proxy can decide 162 | whether future versions of the GNU Lesser General Public License shall 163 | apply, that proxy's public statement of acceptance of any version is 164 | permanent authorization for you to choose that version for the 165 | Library. 166 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3661338.svg)](https://doi.org/10.5281/zenodo.3661338) 2 | [![DOI](https://joss.theoj.org/papers/10.21105/joss.02013/status.svg)](https://doi.org/10.21105/joss.02013) 3 | 4 | # ```thresholdmodeling```: A Python package for modeling excesses over a threshold using the Peak-Over-Threshold Method and the Generalized Pareto Distribution 5 | 6 | This package is intended for those who wish to conduct an extreme values analysis. It provides the whole toolkit necessary to create a threshold model in a simple and efficient way, presenting the main methods towards the Peak-Over-Threshold method and the fit in the Generalized Pareto Distribution. 7 | 8 | In this repository you can find the main files of the package, the [Functions Documenation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md), the [dataset](https://github.com/iagolemos1/thresholdmodeling/blob/master/dataset/rain.csv) used in some examples, the [paper](https://github.com/iagolemos1/thresholdmodeling/blob/master/paper.md) submitted to the [Jounal of Open Source Software](https://joss.theoj.org/) and some tutorials. 9 | 10 | # Installing Package 11 | **It is necessary to have internet connection and use Anaconda distribution (Python 3).** 12 | 13 | * For installing Anaconda on Linux, go to [this link](https://docs.anaconda.com/anaconda/install/linux/). For installing on Windows, go to [this one](https://docs.anaconda.com/anaconda/install/windows/). For istalling on macOS, go to [this one](https://docs.anaconda.com/anaconda/install/mac-os/). 14 | 15 | * For creating your own environment by using the terminal or Anaconda Prompt, go [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands). 16 | 17 | ## Windows Users 18 | Firstly, it will necessary to install R on your environment and considering that ``rpy2`` (a python dependency package for thresholdmodeling) does not have Windows support, installing it from ``pip install thresholdmodeling`` will result in an error, the same occurs with ``pip install rpy2``. Then, it is necessary to download it from an unuofficial website: 19 | https://www.lfd.uci.edu/~gohlke/pythonlibs/ 20 | Here, you must find the rpy2 realese which works on your machine and install it manually going to the download folder with the Anaconda Prompt and run this line, for example (it will depend on the name of the downloaded file): 21 | ``` 22 | pip install rpy2‑2.9.5‑cp37‑cp37m‑win_amd64.whl 23 | ``` 24 | **Or** you can install it from the the Anaconda Prompt by activating your environment and running: 25 | ``` 26 | conda activate my_env 27 | conda install r 28 | conda install -c r rpy2=2.9.4 29 | ``` 30 | After that, `` rpy2`` and ``R`` will be installed on your machine. Follow the next steps. 31 | 32 | For installing the package just use the following command on your Anaconda Prompt (it is already in PyPi): 33 | ``` 34 | pip install thresholdmodeling 35 | ``` 36 | The others Python dependencies for runing the software will install automatically with this command. 37 | 38 | Once the package is installed, it is necessary to run these lines on your IDE for installing ``POT`` ``R`` package (package that our software uses by means of ``rpy2`` for computing GPD estimatives): 39 | ```python 40 | from rpy2.robjects.packages import importr 41 | import rpy2.robjects.packages as rpackages 42 | 43 | base = importr('base') 44 | utils = importr('utils') 45 | utils.chooseCRANmirror(ind=1) 46 | utils.install_packages('POT') #installing POT package 47 | ``` 48 | 49 | ## Linux Users 50 | Firstly, run this lines on your terminal in order to install R and ``rpy2`` package on your environment: 51 | ``` 52 | conda activate my_env (my_env is your environment name) 53 | conda install r 54 | conda install -c r rpy2=2.9.4 55 | ``` 56 | After installing R and ``rpy2``, find your anaconda directory, and find the actual environment folder. It should be somewhere like ~/anaconda3/envs/my_env. Open the terminal in this folder and run this line (the others dependencies will automatically install): 57 | ``` 58 | pip install thresholdmodeling 59 | ``` 60 | Once the package is installed, it is necessary to run this lines on your IDE for installing ``POT R`` package (package that our software uses by means of ``rpy2`` for computing GPD estimatives): 61 | 62 | ```python 63 | from rpy2.robjects.packages import importr 64 | import rpy2.robjects.packages as rpackages 65 | 66 | base = importr('base') 67 | utils = importr('utils') 68 | utils.chooseCRANmirror(ind=1) 69 | utils.install_packages('POT') #installing POT package 70 | Or, it is possible to download this [file](https://github.com/iagolemos1/thresholdmodeling/blob/master/install_pot.py) in order to run it in yout IDE and installing ``POT``. 71 | ``` 72 | # User's guide and Reproducibility 73 | In the file [example](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Examples/examples.py) it is possible to see how the package should be used. In [Functions Documenation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md) it may be seen a complete documentation on how to use the functions presented in the package. 74 | 75 | In order to present a tutorial on how to use the package and its results, a guide is presented below, using the example on the Coles's [book](https://www.springer.com/gp/book/9781852334598) with the [Daily Rainfall in South-West England](https://github.com/iagolemos1/thresholdmodeling/blob/master/dataset/rain.csv) dataset. 76 | 77 | ## Threshold Selection 78 | Firstly, it is necessary to conduct a threshold value analysis using the first two functions of the package: ``MRL`` and ``Parameter_Stability_Plot``, in order to select a reasonable threshold value. 79 | Runing this: 80 | ```python 81 | from thresholdmodeling import thresh_modeling #importing package 82 | import pandas as pd #importing pandas 83 | 84 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' #saving url 85 | df = pd.read_csv(url, error_bad_lines=False) #getting data 86 | data = df.values.ravel() #turning data into an array 87 | 88 | thresh_modeling.MRL(data, 0.05) 89 | thresh_modeling.Parameter_Stability_plot(data, 0.05) 90 | ``` 91 | The results must be: 92 | 93 | ![](result_MRL.png) 94 | 95 | ![](result_SHAPE.png) 96 | 97 | ![](result_MODSCALE.png) 98 | 99 | Then, by analysing the three graphics, it is reasonable taking the threshold value as 30. 100 | 101 | ## Model Fit 102 | Once the threshold value is defined, it is possible to fit the dataset to a GPD model by using the function ``gpdfit``running the following line and using the maximum likelihood estimation method: 103 | 104 | ```python 105 | thresh_modeling.gpdfit(data, 30, 'mle') 106 | ``` 107 | 108 | The results must be in Terminal like: 109 | ``` 110 | Estimator: MLE 111 | 112 | Deviance: 970.1874 113 | 114 | AIC: 974.1874 115 | 116 | 117 | Varying Threshold: FALSE 118 | 119 | 120 | Threshold Call: 30L 121 | 122 | Number Above: 152 123 | 124 | Proportion Above: 0.0087 125 | 126 | 127 | Estimates 128 | 129 | scale shape 130 | 131 | 7.4411 0.1845 132 | 133 | 134 | Standard Error Type: observed 135 | 136 | 137 | Standard Errors 138 | 139 | scale shape 140 | 141 | 0.9587 0.1012 142 | 143 | 144 | Asymptotic Variance Covariance 145 | 146 | scale shape 147 | 148 | scale 0.91920 -0.06554 149 | 150 | shape -0.06554 0.01025 151 | 152 | 153 | Optimization Information 154 | 155 | Convergence: successful 156 | 157 | Function Evaluations: 14 158 | 159 | Gradient Evaluations: 6 160 | ``` 161 | These are the GPD model estimatives using the maximum likelihood estimator. 162 | 163 | ## Model Checking 164 | Once the GPD model is defined, it is necessary to verify if the model is reasonable and describes well the empirical observations. Plots like probability density function, cumulative distribution function, quantile-quantile and probability-probability can show to us if the model is good. It is possible to obtain these plots using some functions of the package: ``gpdpdf``, ``gpdcdf``, ``qqplot`` and ``ppplot``. By running these lines: 165 | ```python 166 | thresh_modeling.gpdpdf(data, 30, 'mle', 'sturges', 0.05) 167 | thresh_modeling.gpdcdf(data, 30, 'mle', 0.05) 168 | thresh_modeling.qqplot(data,30, 'mle', 0.05) 169 | thresh_modeling.ppplot(data, 30, 'mle', 0.05) 170 | ``` 171 | The results must be: 172 | 173 | ![](result_pdf.png) 174 | 175 | ![](result_CDF.png) 176 | 177 | ![](result_qq.png) 178 | 179 | ![](result_pp.png) 180 | 181 | Once it is possible to verifiy that the theoretical model describes very well the empirical observations, the next step is to use the main tool of the extreme values approach: extrapolation over the unit of the return period. 182 | 183 | ## Return Value Analysis 184 | The first thing that must be defined is: what is the unit of the return period? In this example, the unit is days because the observations are **daily**, but in other applications, like corrosion engineering, the unit may be number of observations. 185 | 186 | Using the function ``return_value`` is possible to get two informations: 187 | * **1** : The return value for a given return period and; 188 | * **2** : The return value plot, that works very well for a model diagnostic. 189 | 190 | By running this line (go to [Model Diagnostics and Return Level Analysis](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md#model-diagnostics-and-return-level-analysis) for more information about the function): 191 | ```python 192 | thresh_modeling.return_value(data, 30, 0.05, 365, 36500, 'mle') 193 | ``` 194 | It means, the return period we want to know the exact return value is 36500 days or 100 years. With the 365, we are saying that the annual number of observations is 365. 195 | 196 | The results must be: 197 | 198 | ![](result_retlvl.png) 199 | 200 | ``` 201 | The return value for the given return period is 106.34386649996667 ± 40.86691363790978 202 | ``` 203 | Hence, by the graphic, it is possible to say that the theoretical model is very well fitted. 204 | Also, it was possible to compute the return value in 100 years. In other words, the rainfall preciptation once in every 100 years must be between 65.4470 and 147.2108 mm. 205 | 206 | ## Declustering 207 | Stuart Coles's in his [book](https://www.springer.com/gp/book/9781852334598) says that if the extremes assume a tendency to be clustered in a stationary series, another pratice would be need to model these values. The pratice consists in declustering, which is: cluster data and decluster by its maximums. For this example, it is clear that, at least initialy, the dataset is not orgnanized in clusters. With the function ``decluster`` it is possible to observe the dataset plot against its unit of return period, but, also it is possible to cluster it using a given block size (in this example it will be monthly, then the block size will be 30 days), and then decluster it by taking the maximum of each block. 208 | 209 | By running these lines: 210 | ```python 211 | thresh_modeling.decluster(data, 30, 30) 212 | ``` 213 | The result must be: 214 | 215 | ![](nocluster.png) 216 | 217 | ![](declustered.png) 218 | 219 | It is important to say that the unit of the return period after the decluster changes (monthly). With the first graph is possible to observe that, at least initialy, there is not any pattern. However, it does not means that it is not possible to descluter the data set to a given block size, which is possible to see in the second graphic. 220 | 221 | In a case that it is necessary to decluster the dataset, the second one, shown in the declustered graphic must be used. 222 | 223 | ## Further Functions 224 | The other functions that are not in this tutorial can be used as it is shown in the [test](https://github.com/iagolemos1/thresholdmodeling/blob/master/Test/test.py) file. The discription of each one is in the [Functions Documenation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md). 225 | 226 | ## Doubts 227 | Any doubts about the package, don't hesitate to contact me. 228 | 229 | # General License 230 | 231 | Copyright (c) 2019 Iago Pereira Lemos 232 | 233 | This program is free software: you can redistribute it and/or modify 234 | it under the terms of the GNU General Public License as published by 235 | the Free Software Foundation, either version 3 of the License, or 236 | (at your option) any later version. 237 | 238 | This program is distributed in the hope that it will be useful, 239 | but WITHOUT ANY WARRANTY; without even the implied warranty of 240 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 241 | GNU General Public License for more details. 242 | 243 | You should have received a copy of the GNU General Public License 244 | along with this program. If not, see 245 | 246 | # Referencing 247 | For referencing the repository, use the following code: 248 | ``` 249 | @misc{thresholdmodeling, 250 | author = {Iago P. Lemos and Antonio Marcos G. Lima and Marcus Antonio Viana Duarte}, 251 | title = {thresholdmodeling package}, 252 | month = Feb, 253 | year = 2020, 254 | doi = {10.5281/zenodo.3661338}, 255 | version = {0.0.1}, 256 | publisher = {Zenodo}, 257 | url = {https://github.com/iagolemos1/thresholdmodeling} 258 | } 259 | ``` 260 | # Background 261 | I am a mechanical engineering undergraduate student in the Federal University of Uberlândia and this package was made in the Acoustics and Vibration Laboratory, in the School of Mechanical Engineering. 262 | 263 | -------------------------------------------------------------------------------- /declustered.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/declustered.png -------------------------------------------------------------------------------- /install_pot.py: -------------------------------------------------------------------------------- 1 | from rpy2.robjects.packages import importr 2 | import rpy2.robjects.packages as rpackages 3 | 4 | base = importr('base') 5 | utils = importr('utils') 6 | utils.chooseCRANmirror(ind=1) 7 | utils.install_packages('POT') #installing POT package -------------------------------------------------------------------------------- /nocluster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/nocluster.png -------------------------------------------------------------------------------- /paper.bib: -------------------------------------------------------------------------------- 1 | @Book{coles, 2 | author = {S. Coles}, 3 | title = {An {I}ntroduction to {S}tatistical {M}odeling of {E}xtreme {V}alues}, 4 | year = {2001}, 5 | edition = {1st}, 6 | publisher = {Springer}, 7 | address = {London}, 8 | doi = {10.1007/978-1-4471-3675-0}, 9 | } 10 | @Manual{POT, 11 | title = {\pkg{POT}: Generalized {P}areto {D}istribution and {P}eaks {O}ver {T}hreshold}, 12 | author = {Mathieu Ribatet and Christophe Dutang}, 13 | year = {2019}, 14 | note = {\proglang{R} package version 1.1-7}, 15 | url = {https://cran.r-project.org/web/packages/POT/index.html}, 16 | } 17 | @Manual{extremes, 18 | title = {\pkg{extRemes}: {E}xtreme {V}alue {A}nalysis}, 19 | author = {Eric Gilleland}, 20 | year = {2019}, 21 | note = {\proglang{R} package version 2.0-11}, 22 | url = {https://cran.r-project.org/web/packages/extRemes/index.html}, 23 | } 24 | @Manual{evd, 25 | title = {\pkg{evd}: Functions for {E}xtreme {V}alue {D}istributions}, 26 | author = {Alec Stephenson}, 27 | year = {2018}, 28 | note = {\proglang{R} package version 2.3-3}, 29 | url = {https://cran.r-project.org/web/packages/evd/index.html}, 30 | } 31 | 32 | @Manual{ismev, 33 | title = {\pkg{ismev}: An {I}ntroduction to {S}tatistical {M}odeling of {E}xtreme {V}alues}, 34 | author = {Janet E. Heffernan and Alec G. Stephenson}, 35 | year = {2018}, 36 | note = {\proglang{R} package version 1.42}, 37 | url = {https://cran.r-project.org/web/packages/ismev/index.html}, 38 | } 39 | 40 | @thesis{tan, 41 | author = {Hwei-Yang Tan}, 42 | title = {Analysis of {C}orrosion {D}ata for 43 | {I}ntegrity {A}ssessments}, 44 | type = {Thesis for the Degree of Doctor of Philosophy}, 45 | year = {2017}, 46 | institution = {Brunel University London}, 47 | date = {2017}, 48 | } 49 | 50 | @unpublished{esther, 51 | author = {Esther Bommier}, 52 | title = {Peaks-{O}ver-{T}hreshold {M}odelling of 53 | {E}nvironmental {D}ata}, 54 | note = {Examensarbete i matematik, Uppsala University}, 55 | year = {2014},} 56 | 57 | @unpublished{max, 58 | author = {Max Rydman}, 59 | title = {Application of the {P}eaks-{O}ver-{T}hreshold 60 | {M}ethod on {I}nsurance {D}ata}, 61 | note = {Examensarbete i matematik, Uppsala University}, 62 | year = {2018},} 63 | 64 | @Article{katz, 65 | author = {Richard W. Katz and Marc B. Parlange and Philippe Naveau}, 66 | title = {Statistics of extremes in hydrology}, 67 | journal = {Advances in Water Resources}, 68 | year = {2002}, 69 | volume = {25}, 70 | number = {8--12}, 71 | pages = {1287--1304}, 72 | doi = {10.1016/S0309-1708(02)00056-8}, 73 | } 74 | 75 | @Book{hosking, 76 | author = {J. R. M. Hosking and J. R. Wallis}, 77 | title = {Regional {F}requency {A}nalysis: {A}n {A}pproach {B}ased on {L}-{M}oments.}, 78 | year = {1997}, 79 | edition = {1st}, 80 | publisher = {Cambridge University Press}, 81 | address = {Cambridge}, 82 | doi = {10.1017/CBO9780511529443}, 83 | } 84 | 85 | @Article{scarf, 86 | author = {Philip A. Scarf and Patrick J. Laycock}, 87 | title = {Applications of {E}xtreme {V}alue {T}heory 88 | in {C}orrosion {E}ngineering}, 89 | journal = {Journal of Research of the National Institute of Standards and Technology}, 90 | year = {1994}, 91 | volume = {99}, 92 | number = {4}, 93 | pages = {313--320}, 94 | doi = {10.6028/jres.099.028}, 95 | } 96 | 97 | @Article{evpot, 98 | author = {Soheil S. Far and Ahmad K. A. Wahab}, 99 | title = {Evaluation of {P}eaks-{O}ver-{T}hreshold {M}ethod}, 100 | journal = {Ocean Science}, 101 | year = {2016}, 102 | volume = {99}, 103 | number = {4}, 104 | pages = {313--320}, 105 | doi = {10.5194/os-2016-47} 106 | 107 | } 108 | 109 | @online{scipy, 110 | author = {Scipy}, 111 | title = {scipy.stats.genpareto}, 112 | year = {2019}, 113 | url = {https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.genpareto.html}, 114 | } 115 | @online{kiko, 116 | author = {Kiko Correoso}, 117 | title = {scikit-extremes}, 118 | year = {2019}, 119 | url = {https://github.com/kikocorreoso/scikit-extremes}, 120 | } 121 | -------------------------------------------------------------------------------- /paper.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'thresholdmodeling: A Python package for modeling excesses over a threshold using the Peak-Over-Threshold Method and the Generalized Pareto Distribution' 3 | tags: 4 | - Python 5 | - Threshold Models 6 | - Peak-Over-Threshold Method 7 | - Generalized Pareto Distribution 8 | - Estatistical Modeling 9 | authors: 10 | - name: Iago Pereira Lemos 11 | orcid: 0000-0002-5829-7711 12 | affiliation: "1, 2, 3" 13 | 14 | - name: Antônio Marcos Gonçalves Lima 15 | orcid: 0000-0003-0170-6083 16 | affiliation: "4, 2, 3" 17 | 18 | - name: Marcus Antônio Viana Duarte 19 | orcid: 0000-0002-8166-5666 20 | affiliation: "4, 1, 2, 3" 21 | affiliations: 22 | - name: Acoustics and Vibration Laboratory 23 | index: 1 24 | - name: School of Mechanical Engineering 25 | index: 2 26 | - name: Federal University of Uberlândia 27 | index: 3 28 | - name: Associate Professor 29 | index: 4 30 | 31 | date: 06 January, 2020 32 | bibliography: paper.bib 33 | --- 34 | 35 | # Summary 36 | 37 | Extreme value analysis has emerged as one of the most important disciplines 38 | for the applied sciences when dealing with reduced datasets and when the main idea is to 39 | extrapolate the observations over a given time. By using a threshold model with an asymptotic characterization, it is posible to work with the Generalized Pareto Distribution (GPD) [@coles] and use it to model the stochastic behavior of a process at an unusual level, either a maximum or minimum. For example, consider a large dataset of wind velocity in Florida, USA, during a certain period of time. It is possible to model this process and to quantify extreme events' probability, for example hurricanes, which are maximum observations of wind velocity, in a time of interest using the return value tool. 40 | 41 | In this context, this package provides a complete toolkit to conduct a threshold model analysis, from the beginning phase of selecting the threshold, going through the model fit, model checking, and return value analysis. Moreover, statistical moments functions are provided. In case of extremes of dependent sequences it is also possible to conduct a declustering analysis. 42 | 43 | In a software context, it is possible to see a strong community working with ``R`` packages like ``POT`` [@POT], ``evd`` [@evd], and ``extRemes`` [@extremes] that are used for complete extreme value modeling. 44 | In ``Python``, it is possible to find the ``scikit-extremes`` [@kiko], which does not contain threshold models yet. Another package is ``scipy``, which has the ``genpareto`` [@scipy] functions, but this does not provide any Peak-Over-Threshold modeling functions since it is not possible to define a threshold using this package. Moreover, this package brings to the community of scientists, engineers, and any other interested person and programmer, the possibility to conduct an extreme value analysis, using a strong, consolidated and high-level programming language, given the importance of the extreme value theory approach for statistical analysis in corrosion engineering (see @scarf and @tan), hydrology (see @katz), enviromental data analysis (see @max and @esther) and many other fields of natural sciences and engineering. (For a large number of additional applications, see @coles p. 1.) 45 | 46 | Hence, the ``thresholdmodeling`` package presents numerous functions to model the stochastic behavior of an extreme process. For a complete introduction to the complete fifteen package functions, it is crucial to go to the [Functions Documentation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md) on the [GitHub page](https://github.com/iagolemos1/thresholdmodeling). 47 | 48 | # Package Features 49 | 50 | ## Threshold Selection 51 | * **Mean Residual Life Plot**: It is possible to plot the Mean Residual Life function as it is defined in @coles; 52 | 53 | * **Parameter Stability Plot**: Also, it is possible to obtain the two parameter stability plots of the GPD: the Shape Parameter Stability Plot and the Modified Scale Parameter Stability Plot, which is defined from a reparametrization of the GPD scale parameter. (See @coles for a complete theoretical introduction about these two plots.) 54 | 55 | ## Model Fit 56 | * **Fit the GPD Model**: Fitting a given dataset to a GPD model using some fit methods (see [**Model Fit**](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md#model-fit)). 57 | 58 | ## Model Checking 59 | * **Probability Density Function, Cumulative Distribution Function, Quantile-Quantile and Probability-Probability Plots**: Plots the theoretical probability density function with the normalized empirical histograms for a given dataset, using some bin methods (see [``gpdpdf``](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md#model-fit)). 60 | Also, the theoretical CDF in comparison to the empirical one with the Dvoretzky–Kiefer–Wolfowitz confidence bands can be drawn. 61 | In addition, The QQ and PP plots, comparing the sample and the theoretical values can be obtained, where the first uses the Kolmogorov-Smirnov Two Sample Test for getting the confidence bands while the second uses the Dvoretzky–Kiefer–Wolfowitz method; 62 | 63 | * **L-Moments Plots**: L-Skewness against L-Kurtosis plot for a given threshold values using the Generalized Pareto parametrization. Be warned, L-Moments plots are really difficult to interpret. See @POT and @hosking for more details. 64 | 65 | ## Model Diagnostics and Return Level Analysis 66 | * **Return Level Computation and Plot**: Computing a return value for a given return period is also possible, with a confidence interval obtained by the Delta Method [@coles]. Furthermore, a return level plot is provided, using the Delta Method in order to obtain the confidence bands. In order to compare, the empirical return level plot is provided. 67 | 68 | ## Declustering and Data Visualization 69 | It is possible to visualize the data during the unit of a return period. In case of extreme dependences sequences, for a given empirical rule (number of days, for example), it is possible to cluster the dataset and, taking the maximum observation of each cluster, a declustering of maximums is done. 70 | 71 | ## Further Functions 72 | It is also possible to compute sample L-Moments, model L-Moments, non-central moments, differential entropy, and the survival function plot. 73 | 74 | ## Installation 75 | 76 | For installation instructions, see the [README](https://github.com/iagolemos1/thresholdmodeling/blob/master/README.md) on the GitHub page. 77 | 78 | # Reproducibility and User's Guide 79 | 80 | The repository on the [GitHub page](https://github.com/iagolemos1/thresholdmodeling) contains a link to 81 | the dataset: Daily Rainfall in the South-West of England from 1914 to 1962. 82 | It can be used to test the software in order to verify its results and compare it with the forseen ones in @coles. For a more detailed tutorial of using of each function, go to the [Test](https://github.com/iagolemos1/thresholdmodeling/blob/master/Test/test.py) directory. 83 | 84 | A minimal simple example on how to use the software and get some of the results presented by @coles is given below. For information about the functions employed, see the [Functions Documentation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md) and for more details of reproducibility, see the [README](https://github.com/iagolemos1/thresholdmodeling/blob/master/README.md). 85 | 86 | ```python 87 | from thresholdmodeling import thresh_modeling 88 | import pandas as pd 89 | 90 | url = 'https://raw.githubusercontent.com/iagolemos1 91 | /thresholdmodeling/master/dataset/rain.csv' 92 | df = pd.read_csv(url, error_bad_lines=False) 93 | data = df.values 94 | 95 | thresh_modeling.MRL(data, 0.05) 96 | thresh_modeling.return_value(data, 30, 0.05, 365, 36500, 'mle') 97 | ``` 98 | ![](result_MRL.png) 99 | 100 | **Fig. 1:** Mean Residual Life Plot for the daily rainfall dataset. 101 | 102 | ![](result_retlvl.png) 103 | 104 | **Fig. 2:** Return level plot with the empirical estimatives of the return level and the confidence bands based on the Delta Method. 105 | 106 | Also, for the given return period (100 years), the software presents the following results in the terminal: 107 | ``` 108 | The return value for the given return period is 106.3439 ± 40.8669 109 | ``` 110 | 111 | For more details, the documentation on the [GitHub page](https://github.com/iagolemos1/thresholdmodeling) is up-to-date. 112 | 113 | # Acknowledgements 114 | 115 | The authors would like to thanks the School of Mechanical Engineering at Federal University of Uberlândia and CNPq and CAPES for the financial support to this research. 116 | 117 | # References 118 | 119 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | scipy 2 | numpy 3 | pandas 4 | thresholdmodeling 5 | matplotlib 6 | 7 | -------------------------------------------------------------------------------- /result_CDF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_CDF.png -------------------------------------------------------------------------------- /result_MODSCALE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_MODSCALE.png -------------------------------------------------------------------------------- /result_MRL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_MRL.png -------------------------------------------------------------------------------- /result_SHAPE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_SHAPE.png -------------------------------------------------------------------------------- /result_pdf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_pdf.png -------------------------------------------------------------------------------- /result_pp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_pp.png -------------------------------------------------------------------------------- /result_qq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_qq.png -------------------------------------------------------------------------------- /result_retlvl.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_retlvl.png -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md", "r") as fh: 4 | long_description = fh.read() 5 | 6 | setuptools.setup( 7 | name="thresholdmodeling", 8 | version="0.0.1", 9 | author="Iago Pereira Lemos", 10 | author_email="lemosiago123@gmail.com", 11 | description="This package is intended for those who wish to conduct an extreme values analysis. It provides the whole toolkit necessary to create a threshold model in a simple and efficient way, presenting the main methods towards the Peak-Over-Threshold Method and the fit in the Generalized Pareto Distribution. For installing and use it, go to https://github.com/iagolemos1/thresholdmodeling", 12 | long_description=long_description, 13 | long_description_content_type="text/markdown", 14 | url="https://github.com/iagolemos1/thresholdmodeling", 15 | packages=['thresholdmodeling'], 16 | test_suit = 'tests', 17 | classifiers=[ 18 | "Programming Language :: Python :: 3", 19 | 'License :: OSI Approved :: GNU General Public License (GPL)', 20 | "Operating System :: OS Independent", 21 | ], 22 | python_requires='>=3.6', 23 | install_requires= ['numpy','scipy','rpy2','matplotlib','seaborn']) 24 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /tests/declustering_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | 5 | def decluster(sample, threshold, block_size): #function to decluster the dataset toward period blocks 6 | period_unit = np.arange(1, len(sample)+1, 1) #period array 7 | threshold_array = np.ones(len(sample))*threshold 8 | nob = int(len(sample)/block_size) #number of blocks 9 | clust = np.zeros((nob, block_size)) #initialization of the cluster matrix (rows: cluster; columns: observations) 10 | #Algorithm to cluster 11 | k = 0 12 | for i in range(0, nob): 13 | for j in range(0, block_size): 14 | clust[i][j] = sample[j+k] 15 | k = j + k + 1 16 | 17 | block_max = np.amax(clust, 1) #getting max of each block and declustering 18 | 19 | period_unit_block = np.arange(0, len(block_max), 1) #array of period for each block 20 | threshold_block_array = np.ones(len(block_max))*threshold 21 | 22 | #Plot real dataset 23 | plt.figure(11) 24 | plt.scatter(period_unit, sample) 25 | plt.plot(period_unit, threshold_array, label = 'Threshold', color = 'red') 26 | plt.legend() 27 | plt.xlabel('Period Unit') 28 | plt.ylabel('Data') 29 | plt.title('Sample dataset per Period Unit') 30 | 31 | #Plot declustered data 32 | plt.figure(12) 33 | plt.scatter(period_unit_block, block_max) 34 | plt.plot(period_unit_block, threshold_block_array, label = 'Threshold', color = 'red') 35 | plt.legend() 36 | plt.xlabel('Period Unit') 37 | plt.ylabel('Declustered Data') 38 | plt.title('Declustered dataset per Period Unit') 39 | plt.show() 40 | 41 | return(block_max) 42 | 43 | class TestFun(unittest.TestCase): 44 | def test_declustering(self): 45 | """ 46 | Testing if the function will return exactly the points it should 47 | """ 48 | data = [1, 1.5, 1.2, 4, 4.5, 4.2, 8, 8.5, 8.2, 12, 12.5, 12.2] 49 | #The code will cluster the data intro four blocks with size 3 and take the maximum from each one 50 | # From data, we can say that the resulting array will be [1.5, 4.5, 8.5, 12.5] 51 | result = decluster(data, 0, 3) 52 | resultreal = np.array([1.5, 4.5, 8.5, 12.5]) 53 | for i in range(len(result)): 54 | self.assertEqual(result[i], resultreal[i]) 55 | 56 | 57 | if __name__ == '__main__': 58 | unittest.main() 59 | 60 | -------------------------------------------------------------------------------- /tests/entropy_test.py: -------------------------------------------------------------------------------- 1 | from thresholdmodeling import thresh_modeling 2 | import pandas as pd 3 | import unittest 4 | from scipy.stats import genpareto 5 | import math as mt 6 | 7 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' 8 | df = pd.read_csv(url, error_bad_lines=False) 9 | 10 | def entropy(sample, b, threshold, fit_method): #Get the entropy of the distribution 11 | [shape, scale, sample, sample_excess, sample_over_thresh] = thresh_modeling.gpdfit(sample, threshold, fit_method) 12 | h = mt.log(scale) + shape + 1 13 | print('The differential entropy is {} nats.'.format(h)) 14 | return (h, shape, scale) 15 | 16 | class TestFun(unittest.TestCase): 17 | def test_entropy(self): 18 | """ 19 | Testing the diferencial entropy computation 20 | """ 21 | data = df.values.ravel() 22 | result = entropy(data, 'e', 30, 'mle') 23 | #testing 24 | self.assertEqual(result[0], genpareto.entropy(result[1], 30, result[2])) 25 | 26 | 27 | if __name__ == '__main__': 28 | unittest.main() 29 | -------------------------------------------------------------------------------- /tests/gpdfit_test.py: -------------------------------------------------------------------------------- 1 | from thresholdmodeling import thresh_modeling 2 | import pandas as pd 3 | import unittest 4 | 5 | 6 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' 7 | df = pd.read_csv(url, error_bad_lines=False) 8 | 9 | class TestFun(unittest.TestCase): 10 | def test_fit(self): 11 | """ 12 | Test that it can fit the array to a GPD comparing to the values presented by Coles 13 | """ 14 | data = df.values.ravel() 15 | result = thresh_modeling.gpdfit(data, 30, 'mle') 16 | #testing scale parameter 17 | self.assertEqual(round(result[0],3), 0.185) 18 | #testing shape parameter 19 | self.assertEqual(round(result[1],2), 7.44) 20 | 21 | if __name__ == '__main__': 22 | unittest.main() -------------------------------------------------------------------------------- /tests/lmom_dist_test.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | def lmom_dist(shape, scale, threshold): 4 | #The package function was changed a little just to input given parameters and to don't estimate them from a sample 5 | #The math is exactly the same. 6 | t_1 = threshold + scale*(1+shape) 7 | t_2 = scale/((1+shape)*(2+shape)) 8 | t_3 = (1 - shape)/(3 + shape) 9 | t_4 = ((1 - shape)*(2 - shape))/((3 + shape)*(4 + shape)) 10 | return (t_1, t_2, t_3, t_4) 11 | 12 | class TestFun(unittest.TestCase): 13 | def test_lmom_dist(self): 14 | """ 15 | Testing if the function will return the right moments for the given parameters: 16 | Shape = 1 17 | Scale = 1 18 | Threshold = 1 19 | """ 20 | result = lmom_dist(1, 1, 1) 21 | #testing 22 | self.assertEqual(result[0], 3) 23 | self.assertEqual(round(result[1],4), 0.1667) 24 | self.assertEqual(result[2], 0) 25 | self.assertEqual(result[3], 0) 26 | 27 | 28 | if __name__ == '__main__': 29 | unittest.main() 30 | 31 | -------------------------------------------------------------------------------- /tests/lmom_sample_test.py: -------------------------------------------------------------------------------- 1 | from rpy2.robjects.packages import importr 2 | from rpy2.robjects.vectors import FloatVector 3 | from thresholdmodeling import thresh_modeling 4 | import pandas as pd 5 | import unittest 6 | 7 | 8 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' 9 | df = pd.read_csv(url, error_bad_lines=False) 10 | 11 | class TestFun(unittest.TestCase): 12 | def test_lmom_sample(self): 13 | """ 14 | Testing L-moments from sample 15 | """ 16 | data = df.values.ravel() 17 | POT = importr('POT') #importing POT package 18 | POTLmonsample = POT.samlmu(FloatVector(data), 4) 19 | result = thresh_modeling.lmom_sample(data) 20 | #testing 21 | self.assertEqual(round(result[0],4), round(POTLmonsample[0],4)) 22 | self.assertEqual(round(result[1],4), round(POTLmonsample[1],4)) 23 | self.assertEqual(round(result[2],4), round(POTLmonsample[2],4)) 24 | self.assertEqual(round(result[3],4), round(POTLmonsample[3],4)) 25 | 26 | if __name__ == '__main__': 27 | unittest.main() -------------------------------------------------------------------------------- /tests/non_central_moments_test.py: -------------------------------------------------------------------------------- 1 | from thresholdmodeling import thresh_modeling 2 | import pandas as pd 3 | import unittest 4 | from scipy.stats import genpareto 5 | 6 | 7 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' 8 | df = pd.read_csv(url, error_bad_lines=False) 9 | 10 | class TestFun(unittest.TestCase): 11 | def test_entropy(self): 12 | """ 13 | Testing the non-central moments computation 14 | """ 15 | data = df.values.ravel() 16 | result = thresh_modeling.non_central_moments(data, 30, 'mle') 17 | par = thresh_modeling.gpdfit(data, 30, 'mle') 18 | #testing 19 | self.assertEqual(result[0], genpareto.stats(par[0], 30, par[1], 'mvsk')[0]) 20 | self.assertEqual(result[1], genpareto.stats(par[0], 30, par[1], 'mvsk')[1]) 21 | self.assertEqual(result[2], genpareto.stats(par[0], 30, par[1], 'mvsk')[2]) 22 | self.assertEqual(result[3], genpareto.stats(par[0], 30, par[1], 'mvsk')[3]) 23 | 24 | 25 | 26 | if __name__ == '__main__': 27 | unittest.main() -------------------------------------------------------------------------------- /tests/return_value_test.py: -------------------------------------------------------------------------------- 1 | #Getting main packages 2 | import numpy as np 3 | import matplotlib.pyplot as plt 4 | from scipy.stats import norm 5 | import seaborn as sns; sns.set(style = 'whitegrid') 6 | from scipy.stats import genpareto 7 | import pandas as pd 8 | import math as mt 9 | import scipy.special as sm 10 | 11 | #Getting main packages from R in order to apply the maximum likelihood function 12 | from rpy2.robjects.packages import importr 13 | from rpy2.robjects.vectors import FloatVector 14 | 15 | POT = importr('POT') #importing POT package 16 | import unittest 17 | 18 | 19 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' 20 | df = pd.read_csv(url, error_bad_lines=False) 21 | data = df.values.ravel() 22 | 23 | def return_value(sample_real, threshold, alpha, block_size, return_period, fit_method): #return value plot and return value estimative 24 | sample = np.sort(sample_real) 25 | sample_excess = [] 26 | sample_over_thresh = [] 27 | for data in sample: 28 | if data > threshold+0.00001: 29 | sample_excess.append(data - threshold) 30 | sample_over_thresh.append(data) 31 | 32 | rdata = FloatVector(sample) 33 | fit = POT.fitgpd(rdata, threshold, est = fit_method) #fit data 34 | shape = fit[0][1] 35 | scale = fit[0][0] 36 | 37 | #Computing the return value for a given return period with the confidence interval estimated by the Delta Method 38 | m = return_period 39 | Eu = len(sample_over_thresh)/len(sample) 40 | x_m = threshold + (scale/shape)*(((m*Eu)**shape) - 1) 41 | 42 | #Solving the delta method 43 | d = Eu*(1-Eu)/len(sample) 44 | e = fit[3][0] 45 | f = fit[3][1] 46 | g = fit[3][2] 47 | h = fit[3][3] 48 | a = (scale*(m**shape))*(Eu**(shape-1)) 49 | b = (shape**-1)*(((m*Eu)**shape) - 1) 50 | c = (-scale*(shape**-2))*((m*Eu)**shape - 1) + (scale*(shape**-1))*((m*Eu)**shape)*mt.log(m*Eu) 51 | CI = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5)) 52 | 53 | print('The return value for the given return period is {} \u00B1 {}'.format(x_m, CI)) 54 | 55 | 56 | ny = block_size #defining how much observations will be a block (usually anual) 57 | N_year = return_period/block_size #N_year represents the number of years based on the given return_period 58 | 59 | for i in range(0, len(sample)): 60 | if sample[i] > threshold + 0.0001: 61 | i_initial = i 62 | break 63 | 64 | p = np.arange(i_initial,len(sample))/(len(sample)) #Getting Plotting Position points 65 | N = 1/(ny*(1 - p)) #transforming plotting position points to years 66 | 67 | year_array = np.arange(min(N), N_year+0.1, 0.1) #defining a year array 68 | 69 | #Algorithm to compute the return value and the confidence intervals for plotting 70 | z_N = [] 71 | CI_z_N_high_year = [] 72 | CI_z_N_low_year = [] 73 | for year in year_array: 74 | z_N.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1)) 75 | a = (scale*((year*ny)**shape))*(Eu**(shape-1)) 76 | b = (shape**-1)*((((year*ny)*Eu)**shape) - 1) 77 | c = (-scale*(shape**-2))*(((year*ny)*Eu)**shape - 1) + (scale*(shape**-1))*(((year*ny)*Eu)**shape)*mt.log((year*ny)*Eu) 78 | CIyear = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5)) 79 | CI_z_N_high_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) + CIyear) 80 | CI_z_N_low_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) - CIyear) 81 | 82 | #Plotting Return Value 83 | plt.figure(8) 84 | plt.plot(year_array, CI_z_N_high_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Confidence Bands') 85 | plt.plot(year_array, CI_z_N_low_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9) 86 | plt.plot(year_array, z_N, color = 'black', label = 'Theoretical Return Level') 87 | plt.scatter(N, sample_over_thresh, label = 'Empirical Return Level') 88 | plt.xscale('log') 89 | plt.xlabel('Return Period') 90 | plt.ylabel('Return Level') 91 | plt.title('Return Level Plot') 92 | plt.legend() 93 | plt.show() 94 | return (x_m, CI) 95 | 96 | class TestFun(unittest.TestCase): 97 | def test_return_value(self): 98 | """ 99 | Testing the return value and its confidence interval based on the values presented by Coles 100 | """ 101 | data = df.values.ravel() 102 | result = return_value(data, 30, 0.05, 365, 36500, 'mle') 103 | #testing return value 104 | self.assertEqual(round(result[0],1), 106.3) 105 | #testing confidence interval 106 | self.assertEqual(round(result[1],1), 40.9) 107 | 108 | if __name__ == '__main__': 109 | unittest.main() -------------------------------------------------------------------------------- /thresholdmodeling/__init__.py: -------------------------------------------------------------------------------- 1 | from . thresh_modeling import MRL 2 | from . thresh_modeling import Parameter_Stability_plot 3 | from . thresh_modeling import gpdfit 4 | from . thresh_modeling import gpdpdf 5 | from . thresh_modeling import qqplot 6 | from . thresh_modeling import ppplot 7 | from . thresh_modeling import gpdcdf 8 | from . thresh_modeling import return_value 9 | from . thresh_modeling import survival_function 10 | from . thresh_modeling import non_central_moments 11 | from . thresh_modeling import lmom_dist 12 | from . thresh_modeling import lmom_sample 13 | from . thresh_modeling import lmomplot 14 | from . thresh_modeling import decluster 15 | from . thresh_modeling import entropy 16 | -------------------------------------------------------------------------------- /thresholdmodeling/thresh_modeling.py: -------------------------------------------------------------------------------- 1 | ######################################################################### 2 | #Copyright (c) 2019 Iago Pereira Lemos 3 | 4 | #This program is free software: you can redistribute it and/or modify 5 | #it under the terms of the GNU General Public License as published by 6 | #the Free Software Foundation, either version 3 of the License, or 7 | #(at your option) any later version. 8 | 9 | #This program is distributed in the hope that it will be useful, 10 | #but WITHOUT ANY WARRANTY; without even the implied warranty of 11 | #MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 12 | #GNU General Public License for more details. 13 | 14 | #You should have received a copy of the GNU General Public License 15 | #along with this program. If not, see 16 | 17 | ########################################################################## 18 | 19 | #Two functions for getting the plots for defining the threshold 20 | #in order to model a Generalized Pareto Distribution. 21 | # 22 | #MRL Function plots the Mean Residual Life function. 23 | #The Parameter_Stability_plot plots the shape and the modified scale 24 | #parameters against the threshold values, u. 25 | # 26 | #For both functions it needed the sample array and the significance level. 27 | 28 | 29 | #Getting main packages 30 | import numpy as np 31 | import matplotlib.pyplot as plt 32 | from scipy.stats import norm 33 | import seaborn as sns; sns.set(style = 'whitegrid') 34 | from scipy.stats import genpareto 35 | import pandas as pd 36 | import math as mt 37 | import scipy.special as sm 38 | 39 | #Getting main packages from R in order to apply the maximum likelihood function 40 | from rpy2.robjects.packages import importr 41 | from rpy2.robjects.vectors import FloatVector 42 | from rpy2.robjects.packages import importr 43 | import rpy2.robjects.packages as rpackages 44 | 45 | base = importr('base') 46 | utils = importr('utils') 47 | utils.chooseCRANmirror(ind=1) 48 | utils.install_packages('POT') #installing POT package 49 | 50 | POT = importr('POT') #importing POT package 51 | 52 | def MRL(sample, alpha): #MRL function 53 | 54 | #Defining the threshold array and its step 55 | step = np.quantile(sample, .995)/60 56 | threshold = np.arange(0, max(sample), step=step) 57 | z_inverse = norm.ppf(1-(alpha/2)) 58 | 59 | #Initialization of arrays 60 | mrl_array = [] #mean of excesses intialization 61 | CImrl = [] #confidence interval for the excesses initialization 62 | 63 | #First Loop for getting the mean residual life for each threshold value and the 64 | #second one getting the confidence intervals for the plot 65 | for u in threshold: 66 | excess = [] #initialization of the excesses array for each loop 67 | for data in sample: 68 | if data > u: 69 | excess.append(data - u) #adding excesses to the excesses array 70 | mrl_array.append(np.mean(excess)) #adding the mean of the excesses in the mean excesses array 71 | std_loop = np.std(excess) #getting standard deviation in the loop 72 | CImrl.append(z_inverse*std_loop/(len(excess)**0.5)) #getting confidence interval 73 | 74 | CI_Low = [] #initialization of the low confidence interval array 75 | CI_High = [] #initialization of the high confidence interval array 76 | 77 | #Loop to add in the confidence interval to the plot arrays 78 | for i in range(0, len(mrl_array)): 79 | CI_Low.append(mrl_array[i] - CImrl[i]) 80 | CI_High.append(mrl_array[i] + CImrl[i]) 81 | 82 | #Plot MRL 83 | plt.figure(1) 84 | sns.lineplot(x = threshold, y = mrl_array) 85 | plt.fill_between(threshold, CI_Low, CI_High, alpha = 0.4) 86 | plt.xlabel('u') 87 | plt.ylabel('Mean Excesses') 88 | plt.title('Mean Residual Life Plot') 89 | plt.show() 90 | 91 | def Parameter_Stability_plot(sample, alpha): #Parameter stability plot function 92 | #Defining Threshold array 93 | step = np.quantile(sample, .995)/45 94 | threshold = np.arange( 95 | 0, np.quantile(sample, .999), step = step, dtype='float32') 96 | 97 | #Transforming sample in a R array 98 | rdata = FloatVector(sample) 99 | 100 | #Initialization of some main arrays 101 | stdshape = [] #standard deviation of the shape parameter initialization 102 | shape = [] #shape parameter intialization 103 | scale = [] #scale paramter initilization 104 | mod_scale = [] #modified scale parameter initizaliation 105 | CI_shape = [] #confidence interval of the shape parameter 106 | CI_mod_scale = [] #confidence interval of the modified scale 107 | z = norm.ppf(1-(alpha/2)) 108 | 109 | #Getting parameters and CI's for both plots 110 | for u in threshold: 111 | fit = POT.fitgpd(rdata, u.item(), est = 'mle') #fitting distribution using POT package with the MLE method 112 | shape.append(fit[0][1]) #adding the shape parameter to the respective array 113 | scale.append(fit[0][0]) #adding the scale parameter to the respective array 114 | stdshape.append(fit[1][1]) #adding the shape standard deviation to the respective array 115 | CI_shape.append(fit[1][1]*z) #getting the values of the confidence interval for plotting 116 | mod_scale.append(fit[0][0] - (fit[0][1]*u)) #getting the modified scale parameter 117 | Var_mod_scale = (fit[3][0] - (u*fit[3][2]) - u*(fit[3][1] - (fit[3][3]*u))) #solving the Delta method 118 | #in order to get the variance to the modified scale parameter 119 | CI_mod_scale.append((Var_mod_scale**0.5)*z) #getting the confidence interval for the 120 | #modified scale parameter 121 | 122 | #Plotting shape parameter against u vales 123 | plt.figure(2) 124 | plt.errorbar(threshold, shape, yerr = CI_shape, fmt = 'o' ) 125 | plt.xlabel('u') 126 | plt.ylabel('Shape Parameter') 127 | plt.title('Shape Parameter Stability Plot') 128 | 129 | #Plotting modified scale parameter against u values 130 | plt.figure(3) 131 | plt.errorbar(threshold, mod_scale, yerr = CI_mod_scale, fmt = 'o') 132 | plt.xlabel('u') 133 | plt.ylabel('Modified Scale Parameter') 134 | plt.title('Modified Scale Parameter Stability Plot') 135 | 136 | plt.show() 137 | 138 | def gpdfit(sample, threshold, fit_method): 139 | sample = np.sort(sample) 140 | sample_excess = [] 141 | sample_over_thresh = [] 142 | for data in sample: 143 | if data > threshold+0.00001: 144 | sample_excess.append(data - threshold) #getting an excesses array 145 | sample_over_thresh.append(data) #getting an array with values over the threshold 146 | rdata = FloatVector(sample) 147 | fit = POT.fitgpd(rdata, threshold, est = fit_method) #fit the data to the distribution 148 | shape = fit[0][1] 149 | scale = fit[0][0] 150 | print(fit) #show gpd fit estimatives 151 | 152 | return(shape, scale, sample, sample_excess, sample_over_thresh) 153 | 154 | def gpdpdf(sample, threshold, fit_method, bin_method, alpha): #get PDF plot with histogram to diagnostic the model 155 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #Fit the data 156 | x_points = np.arange(0, max(sample), 0.001) #define a range of points for drawing the pdf 157 | pdf = genpareto.pdf(x_points, shape, loc=0, scale=scale) #get the pdf values 158 | 159 | #Plotting PDF 160 | plt.figure(4) 161 | plt.xlabel('Data') 162 | plt.ylabel('PDF') 163 | plt.title('Data Probability Density Function') 164 | plt.plot(x_points, pdf, color = 'black', label = 'Theoretical PDF') 165 | plt.hist(sample_excess, bins = bin_method, density = True) #draw histograms 166 | plt.legend() 167 | plt.show() 168 | 169 | def qqplot(sample, threshold, fit_method, alpha): #get Quantile-Quantile plot to diagnostic the model 170 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #fit data 171 | i_initial = 0 172 | p = [] 173 | n = len(sample) 174 | sample = np.sort(sample) 175 | for i in range(0, n): 176 | if sample[i] > threshold + 0.0001: 177 | i_initial = i #get the index of the first observation over the threshold 178 | k = i - 1 179 | break 180 | 181 | for i in range(i_initial, n): 182 | p.append((i - 0.35)/(n)) #using the index, compute the empirical probabilities by the Hosking Plotting Poistion Estimator. 183 | 184 | p0 = (k - 0.35)/(n) 185 | 186 | quantiles = [] 187 | for pth in p: 188 | quantiles.append(threshold + ((scale/shape)*(((1-((pth-p0)/(1-p0)))**-shape) - 1))) #getting theorecial quantiles arrays 189 | 190 | n = len(sample_over_thresh) 191 | y = np.arange(1,n+1)/n #getting empirical quantiles 192 | 193 | #Kolmogorov-Smirnov Test for getting the confidence interval 194 | K = (-0.5*mt.log(alpha/2))**0.5 195 | M = (len(p)**2/(2*len(p)))**0.5 196 | CI_qq_high = [] 197 | CI_qq_low = [] 198 | for prob in y: 199 | F1 = prob - K/M 200 | F2 = prob + K/M 201 | CI_qq_low.append(threshold + ((scale/shape)*(((1-((F1)/(1)))**-shape) - 1))) 202 | CI_qq_high.append(threshold + ((scale/shape)*(((1-((F2)/(1)))**-shape) - 1))) 203 | 204 | #Plotting QQ 205 | plt.figure(5) 206 | sns.regplot(quantiles, sample_over_thresh, ci = None, line_kws={'color':'black','label':'Regression Line'}) 207 | plt.axis('square') 208 | plt.plot(sample_over_thresh, CI_qq_low, linestyle='--', color='red', alpha = 0.5, lw = 0.8, label = 'Kolmogorov-Smirnov Confidence Bands') 209 | plt.legend() 210 | plt.plot(sample_over_thresh, CI_qq_high, linestyle='--', color='red', alpha = 0.5, lw = 0.8) 211 | plt.xlabel('Theoretical GPD Quantiles') 212 | plt.ylabel('Sample Quantiles') 213 | plt.title('Q-Q Plot') 214 | plt.show() 215 | 216 | def ppplot(sample, threshold, fit_method, alpha): #probability-probability plot to diagnostic the model 217 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #fit the data 218 | n = len(sample_over_thresh) 219 | #Getting empirical probabilities 220 | y = np.arange(1,n+1)/n 221 | #Getting theoretical probabilities 222 | cdf_pp = genpareto.cdf(sample_over_thresh, shape, loc=threshold, scale=scale) 223 | 224 | #Getting Confidence Intervals using the Dvoretzky–Kiefer–Wolfowitz method 225 | i_initial = 0 226 | n = len(sample) 227 | for i in range(0, n): 228 | if sample[i] > threshold + 0.0001: 229 | i_initial = i 230 | break 231 | F1 = [] 232 | F2 = [] 233 | for i in range(i_initial,len(sample)): 234 | e = (((mt.log(2/alpha))/(2*len(sample_over_thresh)))**0.5) 235 | F1.append(y[i-i_initial] - e) 236 | F2.append(y[i-i_initial] + e) 237 | 238 | #Plotting PP 239 | plt.figure(6) 240 | sns.regplot(y, cdf_pp, ci = None, line_kws={'color':'black', 'label':'Regression Line'}) 241 | plt.plot(y, F1, linestyle='--', color='red', alpha = 0.5, lw = 0.8, label = 'Dvoretzky–Kiefer–Wolfowitz Confidence Bands') 242 | plt.plot(y, F2, linestyle='--', color='red', alpha = 0.5, lw = 0.8) 243 | plt.legend() 244 | plt.title('P-P Plot') 245 | plt.xlabel('Empirical Probability') 246 | plt.ylabel('Theoritical Probability') 247 | plt.show() 248 | 249 | def gpdcdf(sample, threshold, fit_method, alpha): #plot gpd cdf with empirical points 250 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #fit the data 251 | 252 | n = len(sample_over_thresh) 253 | y = np.arange(1,n+1)/n #empirical probabilities 254 | 255 | i_initial = 0 256 | n = len(sample) 257 | for i in range(0, n): 258 | if sample[i] > threshold + 0.0001: 259 | i_initial = i 260 | break 261 | 262 | #Computing confidence interval with the Dvoretzky–Kiefer–Wolfowitz method based on the empirical points 263 | F1 = [] 264 | F2 = [] 265 | for i in range(i_initial,len(sample)): 266 | e = (((mt.log(2/alpha))/(2*len(sample_over_thresh)))**0.5) 267 | F1.append(y[i-i_initial] - e) 268 | F2.append(y[i-i_initial] + e) 269 | 270 | x_points = np.arange(0, max(sample), 0.001) #generating points to apply in the cdf 271 | cdf = genpareto.cdf(x_points, shape, loc=threshold, scale=scale) #getting theoretical cdf 272 | 273 | #Plotting cdf 274 | plt.figure(7) 275 | plt.plot(x_points, cdf, color = 'black', label='Theoretical CDF') 276 | plt.xlabel('Data') 277 | plt.ylabel('CDF') 278 | plt.title('Data Comulative Distribution Function') 279 | plt.scatter(sorted(sample_over_thresh), y, label='Empirical CDF') 280 | plt.plot(sorted(sample_over_thresh), F1, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Dvoretzky–Kiefer–Wolfowitz Confidence Bands') 281 | plt.plot(sorted(sample_over_thresh), F2, linestyle='--', color='red', alpha = 0.8, lw = 0.9) 282 | plt.legend() 283 | plt.show() 284 | 285 | def return_value(sample_real, threshold, alpha, block_size, return_period, fit_method): #return value plot and return value estimative 286 | sample = np.sort(sample_real) 287 | sample_excess = [] 288 | sample_over_thresh = [] 289 | for data in sample: 290 | if data > threshold+0.00001: 291 | sample_excess.append(data - threshold) 292 | sample_over_thresh.append(data) 293 | 294 | rdata = FloatVector(sample) 295 | fit = POT.fitgpd(rdata, threshold, est = fit_method) #fit data 296 | shape = fit[0][1] 297 | scale = fit[0][0] 298 | 299 | #Computing the return value for a given return period with the confidence interval estimated by the Delta Method 300 | m = return_period 301 | Eu = len(sample_over_thresh)/len(sample) 302 | x_m = threshold + (scale/shape)*(((m*Eu)**shape) - 1) 303 | 304 | #Solving the delta method 305 | d = Eu*(1-Eu)/len(sample) 306 | e = fit[3][0] 307 | f = fit[3][1] 308 | g = fit[3][2] 309 | h = fit[3][3] 310 | a = (scale*(m**shape))*(Eu**(shape-1)) 311 | b = (shape**-1)*(((m*Eu)**shape) - 1) 312 | c = (-scale*(shape**-2))*((m*Eu)**shape - 1) + (scale*(shape**-1))*((m*Eu)**shape)*mt.log(m*Eu) 313 | CI = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5)) 314 | 315 | print('The return value for the given return period is {} \u00B1 {}'.format(x_m, CI)) 316 | 317 | 318 | ny = block_size #defining how much observations will be a block (usually anual) 319 | N_year = return_period/block_size #N_year represents the number of years based on the given return_period 320 | 321 | for i in range(0, len(sample)): 322 | if sample[i] > threshold + 0.0001: 323 | i_initial = i 324 | break 325 | 326 | p = np.arange(i_initial,len(sample))/(len(sample)) #Getting Plotting Position points 327 | N = 1/(ny*(1 - p)) #transforming plotting position points to years 328 | 329 | year_array = np.arange(min(N), N_year+0.1, 0.1) #defining a year array 330 | 331 | #Algorithm to compute the return value and the confidence intervals for plotting 332 | z_N = [] 333 | CI_z_N_high_year = [] 334 | CI_z_N_low_year = [] 335 | for year in year_array: 336 | z_N.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1)) 337 | a = (scale*((year*ny)**shape))*(Eu**(shape-1)) 338 | b = (shape**-1)*((((year*ny)*Eu)**shape) - 1) 339 | c = (-scale*(shape**-2))*(((year*ny)*Eu)**shape - 1) + (scale*(shape**-1))*(((year*ny)*Eu)**shape)*mt.log((year*ny)*Eu) 340 | CIyear = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5)) 341 | CI_z_N_high_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) + CIyear) 342 | CI_z_N_low_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) - CIyear) 343 | 344 | #Plotting Return Value 345 | plt.figure(8) 346 | plt.plot(year_array, CI_z_N_high_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Confidence Bands') 347 | plt.plot(year_array, CI_z_N_low_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9) 348 | plt.plot(year_array, z_N, color = 'black', label = 'Theoretical Return Level') 349 | plt.scatter(N, sample_over_thresh, label = 'Empirical Return Level') 350 | plt.xscale('log') 351 | plt.xlabel('Return Period') 352 | plt.ylabel('Return Level') 353 | plt.title('Return Level Plot') 354 | plt.legend() 355 | 356 | plt.show() 357 | 358 | def survival_function(sample, threshold, fit_method, alpha): #Plot the survival function, (1 - cdf) 359 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) 360 | 361 | n = len(sample_over_thresh) 362 | y_surv = 1 - np.arange(1,n+1)/n 363 | 364 | i_initial = 0 365 | 366 | n = len(sample) 367 | for i in range(0, n): 368 | if sample[i] > threshold + 0.0001: 369 | i_initial = i 370 | break 371 | #Computing confidence interval with the Dvoretzky–Kiefer–Wolfowitz 372 | F1 = [] 373 | F2 = [] 374 | for i in range(i_initial,len(sample)): 375 | e = (((mt.log(2/alpha))/(2*len(sample_over_thresh)))**0.5) 376 | F1.append(y_surv[i-i_initial] - e) 377 | F2.append(y_surv[i-i_initial] + e) 378 | 379 | x_points = np.arange(0, max(sample), 0.001) 380 | surv_func = 1 - genpareto.cdf(x_points, shape, loc=threshold, scale=scale) 381 | 382 | #Plotting survival function 383 | plt.figure(9) 384 | plt.plot(x_points, surv_func, color = 'black', label='Theoretical Survival Function') 385 | plt.xlabel('Data') 386 | plt.ylabel('Survival Function') 387 | plt.title('Data Survival Function Plot') 388 | plt.scatter(sorted(sample_over_thresh), y_surv, label='Empirical Survival Function') 389 | plt.plot(sorted(sample_over_thresh), F1, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Dvoretzky–Kiefer–Wolfowitz Confidence Bands') 390 | plt.plot(sorted(sample_over_thresh), F2, linestyle='--', color='red', alpha = 0.8, lw = 0.9) 391 | plt.legend() 392 | plt.show() 393 | 394 | def non_central_moments(sample, threshold, fit_method): #Getting non-central moments using the genpareto package 395 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) 396 | [Mean, Variance, Skewness, Kurtosis]= genpareto.stats(shape, threshold, scale, moments = 'mvsk') 397 | print('Non-Central Moments estimated from the distribution:\nMean: {} \nVariance: {} \nSkewness: {} \nKurtosis: {} \n'.format(Mean, Variance, Skewness, Kurtosis)) 398 | return (Mean, Variance, Skewness, Kurtosis) 399 | 400 | def lmom_dist(sample, threshold, fit_method): #Getting the l-moments from the distribution 401 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) 402 | t_1 = threshold + scale*(1+shape) 403 | t_2 = scale/((1+shape)*(2+shape)) 404 | t_3 = (1 - shape)/(3 + shape) 405 | t_4 = ((1 - shape)*(2 - shape))/((3 + shape)*(4 + shape)) 406 | print('L-Moments estimated from the distribution:\nL-Mean: {} \nL-Variance: {} \nL-Skewness: {} \nL-Kurtosis: {} \n'.format(t_1, t_2, t_3, t_4)) 407 | return (t_1, t_2, t_3, t_4) 408 | 409 | def lmom_sample(sample): #Algorithm to compute the fourth l-moments from the sample 410 | sample = np.sort(sample) 411 | n = len(sample) 412 | 413 | #first moment 414 | l1 = np.sum(sample) / sm.comb(n, 1, exact=True) 415 | 416 | #second moment 417 | comb1 = range(n) 418 | coefl2 = 0.5 / sm.comb(n, 2, exact=True) 419 | sum_xtrans = sum([(comb1[i] - comb1[n - i - 1]) * sample[i] for i in range(n)]) 420 | l2 = coefl2 * sum_xtrans 421 | 422 | #third moment 423 | comb3 = [sm.comb(i, 2, exact=True) for i in range(n)] 424 | coefl3 = 1.0 / 3.0 / sm.comb(n, 3, exact=True) 425 | sum_xtrans = sum([(comb3[i] - 2 * comb1[i] * comb1[n - i - 1] + comb3[n - i - 1]) * sample[i] for i in range(n)]) 426 | l3 = coefl3 * sum_xtrans / l2 427 | 428 | #fourth moment 429 | comb5 = [sm.comb(i, 3, exact=True) for i in range(n)] 430 | coefl4 = 0.25 / sm.comb(n, 4, exact=True) 431 | sum_xtrans = sum( 432 | [(comb5[i] - 3 * comb3[i] * comb1[n - i - 1] + 3 * comb1[i] * comb3[n - i - 1] - comb5[n - i - 1]) * sample[i] 433 | for i in range(n)]) 434 | l4 = coefl4 * sum_xtrans / l2 435 | 436 | print('L-Moments estimated from the sample:\nL-Mean: {} \nL-Variance: {} \nL-Skewness: {} \nL-Kurtosis: {} \n'.format(l1, l2, l3, l4)) 437 | 438 | return(l1, l2, l3, l4) 439 | 440 | def lmomplot(sample, threshold): #Plotting the l-skewnes and l-kurtosis empirical against theoretical to 441 | #diagnostic the u choice. 442 | def lmom_sample2(sample): 443 | sample = np.sort(sample) 444 | n = len(sample) 445 | 446 | #first moment 447 | l1 = np.sum(sample) / sm.comb(n, 1, exact=True) 448 | 449 | #second moment 450 | comb1 = range(n) 451 | coefl2 = 0.5 / sm.comb(n, 2, exact=True) 452 | sum_xtrans = sum([(comb1[i] - comb1[n - i - 1]) * sample[i] for i in range(n)]) 453 | l2 = coefl2 * sum_xtrans 454 | 455 | #third moment 456 | comb3 = [sm.comb(i, 2, exact=True) for i in range(n)] 457 | coefl3 = 1.0 / 3.0 / sm.comb(n, 3, exact=True) 458 | sum_xtrans = sum([(comb3[i] - 2 * comb1[i] * comb1[n - i - 1] + comb3[n - i - 1]) * sample[i] for i in range(n)]) 459 | l3 = coefl3 * sum_xtrans / l2 460 | 461 | #fourth moment 462 | comb5 = [sm.comb(i, 3, exact=True) for i in range(n)] 463 | coefl4 = 0.25 / sm.comb(n, 4, exact=True) 464 | sum_xtrans = sum( 465 | [(comb5[i] - 3 * comb3[i] * comb1[n - i - 1] + 3 * comb1[i] * comb3[n - i - 1] - comb5[n - i - 1]) * sample[i] 466 | for i in range(n)]) 467 | l4 = coefl4 * sum_xtrans / l2 468 | return(l1, l2, l3, l4) 469 | 470 | threshold_array = np.arange(0, threshold + (threshold/3), 0.5) #defining a threshold array to compute the 471 | #different l-moments from the sample 472 | sample = np.sort(sample) 473 | skewness_sample = [] 474 | kurtosis_sample =[] 475 | #Algorithm to compute the l-moments for each threshold 476 | for u in threshold_array: 477 | sample_over_thresh = [] 478 | for data in sample: 479 | if data > u+0.00001: 480 | sample_over_thresh.append(data) 481 | [l1, l2, l3, l4] = lmom_sample2(sample_over_thresh) 482 | skewness_sample.append(l3) 483 | kurtosis_sample.append(l4) 484 | 485 | skewness_theo = np.arange(0,1+0.1,0.1) #defining theoretical l-skewness 486 | kurtosis_theo = (skewness_theo*(1 + 5*skewness_theo))/(5 + skewness_theo) #theoretical kurtosis of the gpd 487 | 488 | #Plotting l-moments 489 | plt.figure(10) 490 | plt.scatter(skewness_sample, kurtosis_sample, label = 'Empirical') 491 | plt.plot(skewness_theo, kurtosis_theo, color = 'black', label = 'Theoretical') 492 | plt.legend() 493 | plt.xlabel('L-Skewness') 494 | plt.ylabel('L-Kurtosis') 495 | plt.title('L-Moments Plot') 496 | plt.show() 497 | 498 | def decluster(sample, threshold, block_size): #function to decluster the dataset toward period blocks 499 | period_unit = np.arange(1, len(sample)+1, 1) #period array 500 | threshold_array = np.ones(len(sample))*threshold 501 | nob = int(len(sample)/block_size) #number of blocks 502 | clust = np.zeros((nob, block_size)) #initialization of the cluster matrix (rows: cluster; columns: observations) 503 | #Algorithm to cluster 504 | k = 0 505 | for i in range(0, nob): 506 | for j in range(0, block_size): 507 | clust[i][j] = sample[j+k] 508 | k = j + k + 1 509 | 510 | block_max = np.amax(clust, 1) #getting max of each block and declustering 511 | 512 | period_unit_block = np.arange(0, len(block_max), 1) #array of period for each block 513 | threshold_block_array = np.ones(len(block_max))*threshold 514 | 515 | #Plot real dataset 516 | plt.figure(11) 517 | plt.scatter(period_unit, sample) 518 | plt.plot(period_unit, threshold_array, label = 'Threshold', color = 'red') 519 | plt.legend() 520 | plt.xlabel('Period Unit') 521 | plt.ylabel('Data') 522 | plt.title('Sample dataset per Period Unit') 523 | 524 | #Plot declustered data 525 | plt.figure(12) 526 | plt.scatter(period_unit_block, block_max) 527 | plt.plot(period_unit_block, threshold_block_array, label = 'Threshold', color = 'red') 528 | plt.legend() 529 | plt.xlabel('Period Unit') 530 | plt.ylabel('Declustered Data') 531 | plt.title('Declustered dataset per Period Unit') 532 | plt.show() 533 | 534 | def entropy(sample, b, threshold, fit_method): #Get the entropy of the distribution 535 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) 536 | h = mt.log(scale) + shape + 1 537 | print('The differential entropy is {} nats.'.format(h)) 538 | 539 | --------------------------------------------------------------------------------