├── .travis.yml
├── Functions Documentation.md
├── Functions Examples
└── examples.py
├── LICENSE
├── README.md
├── dataset
└── rain.csv
├── declustered.png
├── install_pot.py
├── nocluster.png
├── paper.bib
├── paper.md
├── requirements.txt
├── result_CDF.png
├── result_MODSCALE.png
├── result_MRL.png
├── result_SHAPE.png
├── result_pdf.png
├── result_pp.png
├── result_qq.png
├── result_retlvl.png
├── setup.py
├── tests
├── __init__.py
├── declustering_test.py
├── entropy_test.py
├── gpdfit_test.py
├── lmom_dist_test.py
├── lmom_sample_test.py
├── non_central_moments_test.py
└── return_value_test.py
└── thresholdmodeling
├── __init__.py
└── thresh_modeling.py
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | python:
3 | # We don't actually use the Travis Python, but this keeps it organized.
4 | - "3.7"
5 | install:
6 | - sudo apt-get update
7 | # We do this conditionally because it saves us some downloading if the
8 | # version is the same.
9 | - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
10 | wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh;
11 | else
12 | wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
13 | fi
14 | - bash miniconda.sh -b -p $HOME/miniconda
15 | - source "$HOME/miniconda/etc/profile.d/conda.sh"
16 | - hash -r
17 | - conda config --set always_yes yes --set changeps1 no
18 | - conda update -q conda
19 | # Useful for debugging any issues with conda
20 | - conda info -a
21 |
22 | # Replace dep1 dep2 ... with your dependencies
23 | - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
24 | - conda activate test-environment
25 | - conda install r
26 | - conda install -c r rpy2=2.9.4
27 | - python setup.py install
28 |
29 | script:
30 | - python setup.py test
31 |
32 |
--------------------------------------------------------------------------------
/Functions Documentation.md:
--------------------------------------------------------------------------------
1 | # Functions Documentations
2 |
3 | This file presents a documentation of the functions presented in the ``thresholdmodeling``package.
4 |
5 | ## Threshold Selection
6 | * **``MRL(sample, alpha)``** : It plots the Mean Residual Life function. ``Sample`` is a 1-D array of the observations and ``alpha`` is a float number representing the confidence level.
7 | * **``Parameter_Stability_Plot(sample, alpha)``** : It plots the two graphics related to the shape and the modified scale parameters stability plot.``Sample`` is a 1-D array of the observations and ``alpha`` is a float number representing the confidence level.
8 |
9 | ## Model Fit
10 | * **``gpdfit(sample, threshold, fit_method)``** : This function fits the given data to a GPD model and show the GPD estimatives in the terminal. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' for the maximum likelihood, maximum penalized likelihood, moments, unbiased probability weighted moments, biased probability weigthed moments, minimum density power divergence, medians, Pickands’ likelihood moment and maximum goodness-of-fit estimators respectively.
11 |
12 | ## Model Checking
13 | * **``gpdpdf(sample, threshold, fit_method, bin_method, alpha)``** : This function returns the GPD probability density function plot with the normalized empirical histograms. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), ``bin_mehotd`` is one of the following methods to compute the number of bins of a histogram: 'sturges', 'doane', 'scott', 'fd' (Freedman-Diaconis estimator), 'stone', 'rice' and 'sqrt', and ``alpha`` is the confidence level.
14 |
15 | * **``gpdcdf(sample, threshold, fit_method, alpha)``** : This function returns the GPD comulative distribution function plot with the empirical points and the confidence bands based on the Dvoretzky–Kiefer–Wolfowitz method. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level.
16 |
17 | * **``qqplot(sample, threshold, fit_method, alpha)``** : This function returns the quantile-quantile plot with the confidence bands based on the Kolmogorov-Smirnov two sample test. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level.
18 |
19 | * **``ppplot(sample, threshold, fit_method, alpha)``** : This function returns the probability-probability plot with the confidence bands based on the Dvoretzky–Kiefer–Wolfowitz method. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level.
20 |
21 | * **``survival_function(sample, threshold, fit_method, alpha)``** : This function returns the survival function plot (1-CDF) with empirical points and the confidence bands based on the Dvoretzky–Kiefer–Wolfowitz method. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**), and ``alpha`` is the confidence level.
22 |
23 | * **``lmomplot(sample, threshold)``** : This function returns the L-Skewness against L-Kurtosis plot using the Generalized Pareto normalization. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold. **Warning**: This plot is very difficult to interpret.
24 |
25 | ## Model Diagnostics and Return Level Analysis
26 | * **``return_value(sample, threshold, alpha, block_size, return_period, fit_method)``** : This function returns the return level for the given argument ``return_period`` with confidence interval based on the Delta Method. Also, it will draw the return level plot based on the block size (usualy annual) with confidence bands based on the Delta Method and empirical points. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold, ``alpha`` is the confidence level, 'block_size' is represents the number of observations will be a block, for example, if the interest is to conduct an annual analysis, the ``block_size`` should be represent a year, in other words, if the data is daily, ``block_size`` should be 365, ``return_period`` is the exact return period you want to compute the return level and ``fit_mehotd`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**).
27 |
28 | ## Declustering and Data Visualization
29 |
30 | * **``decluster(sample, threshold, block_size)``** : This function returns two graphics: The data against the unit of return period (days, for example), and the declustered data based on the block size and the maximum of each block. ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``block_size`` is the number of observations that will be part of a cluster, for example: if the dataset is daily and the idea is to cluster based on months, ``block_size`` should be 30.
31 |
32 | ## Further Functions for Additional Analysis
33 |
34 | * **``non_central_moments(sample, threshold, fit_method)``** : This function returns the non-central moments estimated from the model.
35 | ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**).
36 |
37 | * **``lmom_dist(sample, threshold, fit_method)``** : This function returns the L-moments estimated from the model.
38 | ``Sample`` is a 1-D array of the observations, ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**).
39 |
40 | * **``lmom_sample(sample)``** : This function returns the L-moments estimated from the sample. ``Sample`` is a 1-D array of the observations.
41 |
42 | * **``entropy(sample, b, threshold, fit_method)``** : This function returns the differential entropy of the model in nats. ``Sample`` is a 1-D array of the observations, ``b`` must be equal to 'e' (changing it does not take any difference in the result, it is just to ilustrate the Euler's number), ``threshold`` is the chosen threshold and ``fit_method`` is one of the following fit methods (string format): 'mle', 'mple', 'moments', 'pwmu', 'pwmb', 'mdpd', 'med', 'pickands', 'lme' and 'mgf' (for more information see **Model Fit**).
43 |
44 |
--------------------------------------------------------------------------------
/Functions Examples/examples.py:
--------------------------------------------------------------------------------
1 | from thresholdmodeling import thresh_modeling
2 | import pandas as pd
3 |
4 |
5 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv'
6 | df = pd.read_csv(url, error_bad_lines=False)
7 | data = df.values.ravel()
8 |
9 |
10 | thresh_modeling.MRL(data, 0.05)
11 | thresh_modeling.Parameter_Stability_plot(data, 0.05)
12 | thresh_modeling.gpdfit(data, 30, 'mle')
13 | thresh_modeling.gpdpdf(data, 30, 'mle', 'sturges', 0.05)
14 | thresh_modeling.qqplot(data,30, 'mle', 0.05)
15 | thresh_modeling.ppplot(data, 30, 'mle', 0.05)
16 | thresh_modeling.gpdcdf(data, 30, 'mle', 0.05)
17 | thresh_modeling.return_value(data, 30, 0.05, 365, 36500, 'mle')
18 | thresh_modeling.survival_function(data, 30, 'mle', 0.05)
19 | thresh_modeling.non_central_moments(data, 30, 'mle')
20 | thresh_modeling.lmom_dist(data, 30, 'mle')
21 | thresh_modeling.lmom_sample(data)
22 | thresh_modeling.lmomplot(data, 30)
23 | thresh_modeling.decluster(data, 30, 30)
24 | thresh_modeling.entropy(data, 'e', 30, 'mle')
25 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU LESSER GENERAL PUBLIC LICENSE
2 | Version 3, 29 June 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 |
9 | This version of the GNU Lesser General Public License incorporates
10 | the terms and conditions of version 3 of the GNU General Public
11 | License, supplemented by the additional permissions listed below.
12 |
13 | 0. Additional Definitions.
14 |
15 | As used herein, "this License" refers to version 3 of the GNU Lesser
16 | General Public License, and the "GNU GPL" refers to version 3 of the GNU
17 | General Public License.
18 |
19 | "The Library" refers to a covered work governed by this License,
20 | other than an Application or a Combined Work as defined below.
21 |
22 | An "Application" is any work that makes use of an interface provided
23 | by the Library, but which is not otherwise based on the Library.
24 | Defining a subclass of a class defined by the Library is deemed a mode
25 | of using an interface provided by the Library.
26 |
27 | A "Combined Work" is a work produced by combining or linking an
28 | Application with the Library. The particular version of the Library
29 | with which the Combined Work was made is also called the "Linked
30 | Version".
31 |
32 | The "Minimal Corresponding Source" for a Combined Work means the
33 | Corresponding Source for the Combined Work, excluding any source code
34 | for portions of the Combined Work that, considered in isolation, are
35 | based on the Application, and not on the Linked Version.
36 |
37 | The "Corresponding Application Code" for a Combined Work means the
38 | object code and/or source code for the Application, including any data
39 | and utility programs needed for reproducing the Combined Work from the
40 | Application, but excluding the System Libraries of the Combined Work.
41 |
42 | 1. Exception to Section 3 of the GNU GPL.
43 |
44 | You may convey a covered work under sections 3 and 4 of this License
45 | without being bound by section 3 of the GNU GPL.
46 |
47 | 2. Conveying Modified Versions.
48 |
49 | If you modify a copy of the Library, and, in your modifications, a
50 | facility refers to a function or data to be supplied by an Application
51 | that uses the facility (other than as an argument passed when the
52 | facility is invoked), then you may convey a copy of the modified
53 | version:
54 |
55 | a) under this License, provided that you make a good faith effort to
56 | ensure that, in the event an Application does not supply the
57 | function or data, the facility still operates, and performs
58 | whatever part of its purpose remains meaningful, or
59 |
60 | b) under the GNU GPL, with none of the additional permissions of
61 | this License applicable to that copy.
62 |
63 | 3. Object Code Incorporating Material from Library Header Files.
64 |
65 | The object code form of an Application may incorporate material from
66 | a header file that is part of the Library. You may convey such object
67 | code under terms of your choice, provided that, if the incorporated
68 | material is not limited to numerical parameters, data structure
69 | layouts and accessors, or small macros, inline functions and templates
70 | (ten or fewer lines in length), you do both of the following:
71 |
72 | a) Give prominent notice with each copy of the object code that the
73 | Library is used in it and that the Library and its use are
74 | covered by this License.
75 |
76 | b) Accompany the object code with a copy of the GNU GPL and this license
77 | document.
78 |
79 | 4. Combined Works.
80 |
81 | You may convey a Combined Work under terms of your choice that,
82 | taken together, effectively do not restrict modification of the
83 | portions of the Library contained in the Combined Work and reverse
84 | engineering for debugging such modifications, if you also do each of
85 | the following:
86 |
87 | a) Give prominent notice with each copy of the Combined Work that
88 | the Library is used in it and that the Library and its use are
89 | covered by this License.
90 |
91 | b) Accompany the Combined Work with a copy of the GNU GPL and this license
92 | document.
93 |
94 | c) For a Combined Work that displays copyright notices during
95 | execution, include the copyright notice for the Library among
96 | these notices, as well as a reference directing the user to the
97 | copies of the GNU GPL and this license document.
98 |
99 | d) Do one of the following:
100 |
101 | 0) Convey the Minimal Corresponding Source under the terms of this
102 | License, and the Corresponding Application Code in a form
103 | suitable for, and under terms that permit, the user to
104 | recombine or relink the Application with a modified version of
105 | the Linked Version to produce a modified Combined Work, in the
106 | manner specified by section 6 of the GNU GPL for conveying
107 | Corresponding Source.
108 |
109 | 1) Use a suitable shared library mechanism for linking with the
110 | Library. A suitable mechanism is one that (a) uses at run time
111 | a copy of the Library already present on the user's computer
112 | system, and (b) will operate properly with a modified version
113 | of the Library that is interface-compatible with the Linked
114 | Version.
115 |
116 | e) Provide Installation Information, but only if you would otherwise
117 | be required to provide such information under section 6 of the
118 | GNU GPL, and only to the extent that such information is
119 | necessary to install and execute a modified version of the
120 | Combined Work produced by recombining or relinking the
121 | Application with a modified version of the Linked Version. (If
122 | you use option 4d0, the Installation Information must accompany
123 | the Minimal Corresponding Source and Corresponding Application
124 | Code. If you use option 4d1, you must provide the Installation
125 | Information in the manner specified by section 6 of the GNU GPL
126 | for conveying Corresponding Source.)
127 |
128 | 5. Combined Libraries.
129 |
130 | You may place library facilities that are a work based on the
131 | Library side by side in a single library together with other library
132 | facilities that are not Applications and are not covered by this
133 | License, and convey such a combined library under terms of your
134 | choice, if you do both of the following:
135 |
136 | a) Accompany the combined library with a copy of the same work based
137 | on the Library, uncombined with any other library facilities,
138 | conveyed under the terms of this License.
139 |
140 | b) Give prominent notice with the combined library that part of it
141 | is a work based on the Library, and explaining where to find the
142 | accompanying uncombined form of the same work.
143 |
144 | 6. Revised Versions of the GNU Lesser General Public License.
145 |
146 | The Free Software Foundation may publish revised and/or new versions
147 | of the GNU Lesser General Public License from time to time. Such new
148 | versions will be similar in spirit to the present version, but may
149 | differ in detail to address new problems or concerns.
150 |
151 | Each version is given a distinguishing version number. If the
152 | Library as you received it specifies that a certain numbered version
153 | of the GNU Lesser General Public License "or any later version"
154 | applies to it, you have the option of following the terms and
155 | conditions either of that published version or of any later version
156 | published by the Free Software Foundation. If the Library as you
157 | received it does not specify a version number of the GNU Lesser
158 | General Public License, you may choose any version of the GNU Lesser
159 | General Public License ever published by the Free Software Foundation.
160 |
161 | If the Library as you received it specifies that a proxy can decide
162 | whether future versions of the GNU Lesser General Public License shall
163 | apply, that proxy's public statement of acceptance of any version is
164 | permanent authorization for you to choose that version for the
165 | Library.
166 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://doi.org/10.5281/zenodo.3661338)
2 | [](https://doi.org/10.21105/joss.02013)
3 |
4 | # ```thresholdmodeling```: A Python package for modeling excesses over a threshold using the Peak-Over-Threshold Method and the Generalized Pareto Distribution
5 |
6 | This package is intended for those who wish to conduct an extreme values analysis. It provides the whole toolkit necessary to create a threshold model in a simple and efficient way, presenting the main methods towards the Peak-Over-Threshold method and the fit in the Generalized Pareto Distribution.
7 |
8 | In this repository you can find the main files of the package, the [Functions Documenation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md), the [dataset](https://github.com/iagolemos1/thresholdmodeling/blob/master/dataset/rain.csv) used in some examples, the [paper](https://github.com/iagolemos1/thresholdmodeling/blob/master/paper.md) submitted to the [Jounal of Open Source Software](https://joss.theoj.org/) and some tutorials.
9 |
10 | # Installing Package
11 | **It is necessary to have internet connection and use Anaconda distribution (Python 3).**
12 |
13 | * For installing Anaconda on Linux, go to [this link](https://docs.anaconda.com/anaconda/install/linux/). For installing on Windows, go to [this one](https://docs.anaconda.com/anaconda/install/windows/). For istalling on macOS, go to [this one](https://docs.anaconda.com/anaconda/install/mac-os/).
14 |
15 | * For creating your own environment by using the terminal or Anaconda Prompt, go [here](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).
16 |
17 | ## Windows Users
18 | Firstly, it will necessary to install R on your environment and considering that ``rpy2`` (a python dependency package for thresholdmodeling) does not have Windows support, installing it from ``pip install thresholdmodeling`` will result in an error, the same occurs with ``pip install rpy2``. Then, it is necessary to download it from an unuofficial website:
19 | https://www.lfd.uci.edu/~gohlke/pythonlibs/
20 | Here, you must find the rpy2 realese which works on your machine and install it manually going to the download folder with the Anaconda Prompt and run this line, for example (it will depend on the name of the downloaded file):
21 | ```
22 | pip install rpy2‑2.9.5‑cp37‑cp37m‑win_amd64.whl
23 | ```
24 | **Or** you can install it from the the Anaconda Prompt by activating your environment and running:
25 | ```
26 | conda activate my_env
27 | conda install r
28 | conda install -c r rpy2=2.9.4
29 | ```
30 | After that, `` rpy2`` and ``R`` will be installed on your machine. Follow the next steps.
31 |
32 | For installing the package just use the following command on your Anaconda Prompt (it is already in PyPi):
33 | ```
34 | pip install thresholdmodeling
35 | ```
36 | The others Python dependencies for runing the software will install automatically with this command.
37 |
38 | Once the package is installed, it is necessary to run these lines on your IDE for installing ``POT`` ``R`` package (package that our software uses by means of ``rpy2`` for computing GPD estimatives):
39 | ```python
40 | from rpy2.robjects.packages import importr
41 | import rpy2.robjects.packages as rpackages
42 |
43 | base = importr('base')
44 | utils = importr('utils')
45 | utils.chooseCRANmirror(ind=1)
46 | utils.install_packages('POT') #installing POT package
47 | ```
48 |
49 | ## Linux Users
50 | Firstly, run this lines on your terminal in order to install R and ``rpy2`` package on your environment:
51 | ```
52 | conda activate my_env (my_env is your environment name)
53 | conda install r
54 | conda install -c r rpy2=2.9.4
55 | ```
56 | After installing R and ``rpy2``, find your anaconda directory, and find the actual environment folder. It should be somewhere like ~/anaconda3/envs/my_env. Open the terminal in this folder and run this line (the others dependencies will automatically install):
57 | ```
58 | pip install thresholdmodeling
59 | ```
60 | Once the package is installed, it is necessary to run this lines on your IDE for installing ``POT R`` package (package that our software uses by means of ``rpy2`` for computing GPD estimatives):
61 |
62 | ```python
63 | from rpy2.robjects.packages import importr
64 | import rpy2.robjects.packages as rpackages
65 |
66 | base = importr('base')
67 | utils = importr('utils')
68 | utils.chooseCRANmirror(ind=1)
69 | utils.install_packages('POT') #installing POT package
70 | Or, it is possible to download this [file](https://github.com/iagolemos1/thresholdmodeling/blob/master/install_pot.py) in order to run it in yout IDE and installing ``POT``.
71 | ```
72 | # User's guide and Reproducibility
73 | In the file [example](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Examples/examples.py) it is possible to see how the package should be used. In [Functions Documenation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md) it may be seen a complete documentation on how to use the functions presented in the package.
74 |
75 | In order to present a tutorial on how to use the package and its results, a guide is presented below, using the example on the Coles's [book](https://www.springer.com/gp/book/9781852334598) with the [Daily Rainfall in South-West England](https://github.com/iagolemos1/thresholdmodeling/blob/master/dataset/rain.csv) dataset.
76 |
77 | ## Threshold Selection
78 | Firstly, it is necessary to conduct a threshold value analysis using the first two functions of the package: ``MRL`` and ``Parameter_Stability_Plot``, in order to select a reasonable threshold value.
79 | Runing this:
80 | ```python
81 | from thresholdmodeling import thresh_modeling #importing package
82 | import pandas as pd #importing pandas
83 |
84 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv' #saving url
85 | df = pd.read_csv(url, error_bad_lines=False) #getting data
86 | data = df.values.ravel() #turning data into an array
87 |
88 | thresh_modeling.MRL(data, 0.05)
89 | thresh_modeling.Parameter_Stability_plot(data, 0.05)
90 | ```
91 | The results must be:
92 |
93 | 
94 |
95 | 
96 |
97 | 
98 |
99 | Then, by analysing the three graphics, it is reasonable taking the threshold value as 30.
100 |
101 | ## Model Fit
102 | Once the threshold value is defined, it is possible to fit the dataset to a GPD model by using the function ``gpdfit``running the following line and using the maximum likelihood estimation method:
103 |
104 | ```python
105 | thresh_modeling.gpdfit(data, 30, 'mle')
106 | ```
107 |
108 | The results must be in Terminal like:
109 | ```
110 | Estimator: MLE
111 |
112 | Deviance: 970.1874
113 |
114 | AIC: 974.1874
115 |
116 |
117 | Varying Threshold: FALSE
118 |
119 |
120 | Threshold Call: 30L
121 |
122 | Number Above: 152
123 |
124 | Proportion Above: 0.0087
125 |
126 |
127 | Estimates
128 |
129 | scale shape
130 |
131 | 7.4411 0.1845
132 |
133 |
134 | Standard Error Type: observed
135 |
136 |
137 | Standard Errors
138 |
139 | scale shape
140 |
141 | 0.9587 0.1012
142 |
143 |
144 | Asymptotic Variance Covariance
145 |
146 | scale shape
147 |
148 | scale 0.91920 -0.06554
149 |
150 | shape -0.06554 0.01025
151 |
152 |
153 | Optimization Information
154 |
155 | Convergence: successful
156 |
157 | Function Evaluations: 14
158 |
159 | Gradient Evaluations: 6
160 | ```
161 | These are the GPD model estimatives using the maximum likelihood estimator.
162 |
163 | ## Model Checking
164 | Once the GPD model is defined, it is necessary to verify if the model is reasonable and describes well the empirical observations. Plots like probability density function, cumulative distribution function, quantile-quantile and probability-probability can show to us if the model is good. It is possible to obtain these plots using some functions of the package: ``gpdpdf``, ``gpdcdf``, ``qqplot`` and ``ppplot``. By running these lines:
165 | ```python
166 | thresh_modeling.gpdpdf(data, 30, 'mle', 'sturges', 0.05)
167 | thresh_modeling.gpdcdf(data, 30, 'mle', 0.05)
168 | thresh_modeling.qqplot(data,30, 'mle', 0.05)
169 | thresh_modeling.ppplot(data, 30, 'mle', 0.05)
170 | ```
171 | The results must be:
172 |
173 | 
174 |
175 | 
176 |
177 | 
178 |
179 | 
180 |
181 | Once it is possible to verifiy that the theoretical model describes very well the empirical observations, the next step is to use the main tool of the extreme values approach: extrapolation over the unit of the return period.
182 |
183 | ## Return Value Analysis
184 | The first thing that must be defined is: what is the unit of the return period? In this example, the unit is days because the observations are **daily**, but in other applications, like corrosion engineering, the unit may be number of observations.
185 |
186 | Using the function ``return_value`` is possible to get two informations:
187 | * **1** : The return value for a given return period and;
188 | * **2** : The return value plot, that works very well for a model diagnostic.
189 |
190 | By running this line (go to [Model Diagnostics and Return Level Analysis](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md#model-diagnostics-and-return-level-analysis) for more information about the function):
191 | ```python
192 | thresh_modeling.return_value(data, 30, 0.05, 365, 36500, 'mle')
193 | ```
194 | It means, the return period we want to know the exact return value is 36500 days or 100 years. With the 365, we are saying that the annual number of observations is 365.
195 |
196 | The results must be:
197 |
198 | 
199 |
200 | ```
201 | The return value for the given return period is 106.34386649996667 ± 40.86691363790978
202 | ```
203 | Hence, by the graphic, it is possible to say that the theoretical model is very well fitted.
204 | Also, it was possible to compute the return value in 100 years. In other words, the rainfall preciptation once in every 100 years must be between 65.4470 and 147.2108 mm.
205 |
206 | ## Declustering
207 | Stuart Coles's in his [book](https://www.springer.com/gp/book/9781852334598) says that if the extremes assume a tendency to be clustered in a stationary series, another pratice would be need to model these values. The pratice consists in declustering, which is: cluster data and decluster by its maximums. For this example, it is clear that, at least initialy, the dataset is not orgnanized in clusters. With the function ``decluster`` it is possible to observe the dataset plot against its unit of return period, but, also it is possible to cluster it using a given block size (in this example it will be monthly, then the block size will be 30 days), and then decluster it by taking the maximum of each block.
208 |
209 | By running these lines:
210 | ```python
211 | thresh_modeling.decluster(data, 30, 30)
212 | ```
213 | The result must be:
214 |
215 | 
216 |
217 | 
218 |
219 | It is important to say that the unit of the return period after the decluster changes (monthly). With the first graph is possible to observe that, at least initialy, there is not any pattern. However, it does not means that it is not possible to descluter the data set to a given block size, which is possible to see in the second graphic.
220 |
221 | In a case that it is necessary to decluster the dataset, the second one, shown in the declustered graphic must be used.
222 |
223 | ## Further Functions
224 | The other functions that are not in this tutorial can be used as it is shown in the [test](https://github.com/iagolemos1/thresholdmodeling/blob/master/Test/test.py) file. The discription of each one is in the [Functions Documenation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md).
225 |
226 | ## Doubts
227 | Any doubts about the package, don't hesitate to contact me.
228 |
229 | # General License
230 |
231 | Copyright (c) 2019 Iago Pereira Lemos
232 |
233 | This program is free software: you can redistribute it and/or modify
234 | it under the terms of the GNU General Public License as published by
235 | the Free Software Foundation, either version 3 of the License, or
236 | (at your option) any later version.
237 |
238 | This program is distributed in the hope that it will be useful,
239 | but WITHOUT ANY WARRANTY; without even the implied warranty of
240 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
241 | GNU General Public License for more details.
242 |
243 | You should have received a copy of the GNU General Public License
244 | along with this program. If not, see
245 |
246 | # Referencing
247 | For referencing the repository, use the following code:
248 | ```
249 | @misc{thresholdmodeling,
250 | author = {Iago P. Lemos and Antonio Marcos G. Lima and Marcus Antonio Viana Duarte},
251 | title = {thresholdmodeling package},
252 | month = Feb,
253 | year = 2020,
254 | doi = {10.5281/zenodo.3661338},
255 | version = {0.0.1},
256 | publisher = {Zenodo},
257 | url = {https://github.com/iagolemos1/thresholdmodeling}
258 | }
259 | ```
260 | # Background
261 | I am a mechanical engineering undergraduate student in the Federal University of Uberlândia and this package was made in the Acoustics and Vibration Laboratory, in the School of Mechanical Engineering.
262 |
263 |
--------------------------------------------------------------------------------
/declustered.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/declustered.png
--------------------------------------------------------------------------------
/install_pot.py:
--------------------------------------------------------------------------------
1 | from rpy2.robjects.packages import importr
2 | import rpy2.robjects.packages as rpackages
3 |
4 | base = importr('base')
5 | utils = importr('utils')
6 | utils.chooseCRANmirror(ind=1)
7 | utils.install_packages('POT') #installing POT package
--------------------------------------------------------------------------------
/nocluster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/nocluster.png
--------------------------------------------------------------------------------
/paper.bib:
--------------------------------------------------------------------------------
1 | @Book{coles,
2 | author = {S. Coles},
3 | title = {An {I}ntroduction to {S}tatistical {M}odeling of {E}xtreme {V}alues},
4 | year = {2001},
5 | edition = {1st},
6 | publisher = {Springer},
7 | address = {London},
8 | doi = {10.1007/978-1-4471-3675-0},
9 | }
10 | @Manual{POT,
11 | title = {\pkg{POT}: Generalized {P}areto {D}istribution and {P}eaks {O}ver {T}hreshold},
12 | author = {Mathieu Ribatet and Christophe Dutang},
13 | year = {2019},
14 | note = {\proglang{R} package version 1.1-7},
15 | url = {https://cran.r-project.org/web/packages/POT/index.html},
16 | }
17 | @Manual{extremes,
18 | title = {\pkg{extRemes}: {E}xtreme {V}alue {A}nalysis},
19 | author = {Eric Gilleland},
20 | year = {2019},
21 | note = {\proglang{R} package version 2.0-11},
22 | url = {https://cran.r-project.org/web/packages/extRemes/index.html},
23 | }
24 | @Manual{evd,
25 | title = {\pkg{evd}: Functions for {E}xtreme {V}alue {D}istributions},
26 | author = {Alec Stephenson},
27 | year = {2018},
28 | note = {\proglang{R} package version 2.3-3},
29 | url = {https://cran.r-project.org/web/packages/evd/index.html},
30 | }
31 |
32 | @Manual{ismev,
33 | title = {\pkg{ismev}: An {I}ntroduction to {S}tatistical {M}odeling of {E}xtreme {V}alues},
34 | author = {Janet E. Heffernan and Alec G. Stephenson},
35 | year = {2018},
36 | note = {\proglang{R} package version 1.42},
37 | url = {https://cran.r-project.org/web/packages/ismev/index.html},
38 | }
39 |
40 | @thesis{tan,
41 | author = {Hwei-Yang Tan},
42 | title = {Analysis of {C}orrosion {D}ata for
43 | {I}ntegrity {A}ssessments},
44 | type = {Thesis for the Degree of Doctor of Philosophy},
45 | year = {2017},
46 | institution = {Brunel University London},
47 | date = {2017},
48 | }
49 |
50 | @unpublished{esther,
51 | author = {Esther Bommier},
52 | title = {Peaks-{O}ver-{T}hreshold {M}odelling of
53 | {E}nvironmental {D}ata},
54 | note = {Examensarbete i matematik, Uppsala University},
55 | year = {2014},}
56 |
57 | @unpublished{max,
58 | author = {Max Rydman},
59 | title = {Application of the {P}eaks-{O}ver-{T}hreshold
60 | {M}ethod on {I}nsurance {D}ata},
61 | note = {Examensarbete i matematik, Uppsala University},
62 | year = {2018},}
63 |
64 | @Article{katz,
65 | author = {Richard W. Katz and Marc B. Parlange and Philippe Naveau},
66 | title = {Statistics of extremes in hydrology},
67 | journal = {Advances in Water Resources},
68 | year = {2002},
69 | volume = {25},
70 | number = {8--12},
71 | pages = {1287--1304},
72 | doi = {10.1016/S0309-1708(02)00056-8},
73 | }
74 |
75 | @Book{hosking,
76 | author = {J. R. M. Hosking and J. R. Wallis},
77 | title = {Regional {F}requency {A}nalysis: {A}n {A}pproach {B}ased on {L}-{M}oments.},
78 | year = {1997},
79 | edition = {1st},
80 | publisher = {Cambridge University Press},
81 | address = {Cambridge},
82 | doi = {10.1017/CBO9780511529443},
83 | }
84 |
85 | @Article{scarf,
86 | author = {Philip A. Scarf and Patrick J. Laycock},
87 | title = {Applications of {E}xtreme {V}alue {T}heory
88 | in {C}orrosion {E}ngineering},
89 | journal = {Journal of Research of the National Institute of Standards and Technology},
90 | year = {1994},
91 | volume = {99},
92 | number = {4},
93 | pages = {313--320},
94 | doi = {10.6028/jres.099.028},
95 | }
96 |
97 | @Article{evpot,
98 | author = {Soheil S. Far and Ahmad K. A. Wahab},
99 | title = {Evaluation of {P}eaks-{O}ver-{T}hreshold {M}ethod},
100 | journal = {Ocean Science},
101 | year = {2016},
102 | volume = {99},
103 | number = {4},
104 | pages = {313--320},
105 | doi = {10.5194/os-2016-47}
106 |
107 | }
108 |
109 | @online{scipy,
110 | author = {Scipy},
111 | title = {scipy.stats.genpareto},
112 | year = {2019},
113 | url = {https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.genpareto.html},
114 | }
115 | @online{kiko,
116 | author = {Kiko Correoso},
117 | title = {scikit-extremes},
118 | year = {2019},
119 | url = {https://github.com/kikocorreoso/scikit-extremes},
120 | }
121 |
--------------------------------------------------------------------------------
/paper.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: 'thresholdmodeling: A Python package for modeling excesses over a threshold using the Peak-Over-Threshold Method and the Generalized Pareto Distribution'
3 | tags:
4 | - Python
5 | - Threshold Models
6 | - Peak-Over-Threshold Method
7 | - Generalized Pareto Distribution
8 | - Estatistical Modeling
9 | authors:
10 | - name: Iago Pereira Lemos
11 | orcid: 0000-0002-5829-7711
12 | affiliation: "1, 2, 3"
13 |
14 | - name: Antônio Marcos Gonçalves Lima
15 | orcid: 0000-0003-0170-6083
16 | affiliation: "4, 2, 3"
17 |
18 | - name: Marcus Antônio Viana Duarte
19 | orcid: 0000-0002-8166-5666
20 | affiliation: "4, 1, 2, 3"
21 | affiliations:
22 | - name: Acoustics and Vibration Laboratory
23 | index: 1
24 | - name: School of Mechanical Engineering
25 | index: 2
26 | - name: Federal University of Uberlândia
27 | index: 3
28 | - name: Associate Professor
29 | index: 4
30 |
31 | date: 06 January, 2020
32 | bibliography: paper.bib
33 | ---
34 |
35 | # Summary
36 |
37 | Extreme value analysis has emerged as one of the most important disciplines
38 | for the applied sciences when dealing with reduced datasets and when the main idea is to
39 | extrapolate the observations over a given time. By using a threshold model with an asymptotic characterization, it is posible to work with the Generalized Pareto Distribution (GPD) [@coles] and use it to model the stochastic behavior of a process at an unusual level, either a maximum or minimum. For example, consider a large dataset of wind velocity in Florida, USA, during a certain period of time. It is possible to model this process and to quantify extreme events' probability, for example hurricanes, which are maximum observations of wind velocity, in a time of interest using the return value tool.
40 |
41 | In this context, this package provides a complete toolkit to conduct a threshold model analysis, from the beginning phase of selecting the threshold, going through the model fit, model checking, and return value analysis. Moreover, statistical moments functions are provided. In case of extremes of dependent sequences it is also possible to conduct a declustering analysis.
42 |
43 | In a software context, it is possible to see a strong community working with ``R`` packages like ``POT`` [@POT], ``evd`` [@evd], and ``extRemes`` [@extremes] that are used for complete extreme value modeling.
44 | In ``Python``, it is possible to find the ``scikit-extremes`` [@kiko], which does not contain threshold models yet. Another package is ``scipy``, which has the ``genpareto`` [@scipy] functions, but this does not provide any Peak-Over-Threshold modeling functions since it is not possible to define a threshold using this package. Moreover, this package brings to the community of scientists, engineers, and any other interested person and programmer, the possibility to conduct an extreme value analysis, using a strong, consolidated and high-level programming language, given the importance of the extreme value theory approach for statistical analysis in corrosion engineering (see @scarf and @tan), hydrology (see @katz), enviromental data analysis (see @max and @esther) and many other fields of natural sciences and engineering. (For a large number of additional applications, see @coles p. 1.)
45 |
46 | Hence, the ``thresholdmodeling`` package presents numerous functions to model the stochastic behavior of an extreme process. For a complete introduction to the complete fifteen package functions, it is crucial to go to the [Functions Documentation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md) on the [GitHub page](https://github.com/iagolemos1/thresholdmodeling).
47 |
48 | # Package Features
49 |
50 | ## Threshold Selection
51 | * **Mean Residual Life Plot**: It is possible to plot the Mean Residual Life function as it is defined in @coles;
52 |
53 | * **Parameter Stability Plot**: Also, it is possible to obtain the two parameter stability plots of the GPD: the Shape Parameter Stability Plot and the Modified Scale Parameter Stability Plot, which is defined from a reparametrization of the GPD scale parameter. (See @coles for a complete theoretical introduction about these two plots.)
54 |
55 | ## Model Fit
56 | * **Fit the GPD Model**: Fitting a given dataset to a GPD model using some fit methods (see [**Model Fit**](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md#model-fit)).
57 |
58 | ## Model Checking
59 | * **Probability Density Function, Cumulative Distribution Function, Quantile-Quantile and Probability-Probability Plots**: Plots the theoretical probability density function with the normalized empirical histograms for a given dataset, using some bin methods (see [``gpdpdf``](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md#model-fit)).
60 | Also, the theoretical CDF in comparison to the empirical one with the Dvoretzky–Kiefer–Wolfowitz confidence bands can be drawn.
61 | In addition, The QQ and PP plots, comparing the sample and the theoretical values can be obtained, where the first uses the Kolmogorov-Smirnov Two Sample Test for getting the confidence bands while the second uses the Dvoretzky–Kiefer–Wolfowitz method;
62 |
63 | * **L-Moments Plots**: L-Skewness against L-Kurtosis plot for a given threshold values using the Generalized Pareto parametrization. Be warned, L-Moments plots are really difficult to interpret. See @POT and @hosking for more details.
64 |
65 | ## Model Diagnostics and Return Level Analysis
66 | * **Return Level Computation and Plot**: Computing a return value for a given return period is also possible, with a confidence interval obtained by the Delta Method [@coles]. Furthermore, a return level plot is provided, using the Delta Method in order to obtain the confidence bands. In order to compare, the empirical return level plot is provided.
67 |
68 | ## Declustering and Data Visualization
69 | It is possible to visualize the data during the unit of a return period. In case of extreme dependences sequences, for a given empirical rule (number of days, for example), it is possible to cluster the dataset and, taking the maximum observation of each cluster, a declustering of maximums is done.
70 |
71 | ## Further Functions
72 | It is also possible to compute sample L-Moments, model L-Moments, non-central moments, differential entropy, and the survival function plot.
73 |
74 | ## Installation
75 |
76 | For installation instructions, see the [README](https://github.com/iagolemos1/thresholdmodeling/blob/master/README.md) on the GitHub page.
77 |
78 | # Reproducibility and User's Guide
79 |
80 | The repository on the [GitHub page](https://github.com/iagolemos1/thresholdmodeling) contains a link to
81 | the dataset: Daily Rainfall in the South-West of England from 1914 to 1962.
82 | It can be used to test the software in order to verify its results and compare it with the forseen ones in @coles. For a more detailed tutorial of using of each function, go to the [Test](https://github.com/iagolemos1/thresholdmodeling/blob/master/Test/test.py) directory.
83 |
84 | A minimal simple example on how to use the software and get some of the results presented by @coles is given below. For information about the functions employed, see the [Functions Documentation](https://github.com/iagolemos1/thresholdmodeling/blob/master/Functions%20Documentation.md) and for more details of reproducibility, see the [README](https://github.com/iagolemos1/thresholdmodeling/blob/master/README.md).
85 |
86 | ```python
87 | from thresholdmodeling import thresh_modeling
88 | import pandas as pd
89 |
90 | url = 'https://raw.githubusercontent.com/iagolemos1
91 | /thresholdmodeling/master/dataset/rain.csv'
92 | df = pd.read_csv(url, error_bad_lines=False)
93 | data = df.values
94 |
95 | thresh_modeling.MRL(data, 0.05)
96 | thresh_modeling.return_value(data, 30, 0.05, 365, 36500, 'mle')
97 | ```
98 | 
99 |
100 | **Fig. 1:** Mean Residual Life Plot for the daily rainfall dataset.
101 |
102 | 
103 |
104 | **Fig. 2:** Return level plot with the empirical estimatives of the return level and the confidence bands based on the Delta Method.
105 |
106 | Also, for the given return period (100 years), the software presents the following results in the terminal:
107 | ```
108 | The return value for the given return period is 106.3439 ± 40.8669
109 | ```
110 |
111 | For more details, the documentation on the [GitHub page](https://github.com/iagolemos1/thresholdmodeling) is up-to-date.
112 |
113 | # Acknowledgements
114 |
115 | The authors would like to thanks the School of Mechanical Engineering at Federal University of Uberlândia and CNPq and CAPES for the financial support to this research.
116 |
117 | # References
118 |
119 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | scipy
2 | numpy
3 | pandas
4 | thresholdmodeling
5 | matplotlib
6 |
7 |
--------------------------------------------------------------------------------
/result_CDF.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_CDF.png
--------------------------------------------------------------------------------
/result_MODSCALE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_MODSCALE.png
--------------------------------------------------------------------------------
/result_MRL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_MRL.png
--------------------------------------------------------------------------------
/result_SHAPE.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_SHAPE.png
--------------------------------------------------------------------------------
/result_pdf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_pdf.png
--------------------------------------------------------------------------------
/result_pp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_pp.png
--------------------------------------------------------------------------------
/result_qq.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_qq.png
--------------------------------------------------------------------------------
/result_retlvl.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/199390b82dbb62bc6247621d9a9e191f33d7fb3e/result_retlvl.png
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 |
3 | with open("README.md", "r") as fh:
4 | long_description = fh.read()
5 |
6 | setuptools.setup(
7 | name="thresholdmodeling",
8 | version="0.0.1",
9 | author="Iago Pereira Lemos",
10 | author_email="lemosiago123@gmail.com",
11 | description="This package is intended for those who wish to conduct an extreme values analysis. It provides the whole toolkit necessary to create a threshold model in a simple and efficient way, presenting the main methods towards the Peak-Over-Threshold Method and the fit in the Generalized Pareto Distribution. For installing and use it, go to https://github.com/iagolemos1/thresholdmodeling",
12 | long_description=long_description,
13 | long_description_content_type="text/markdown",
14 | url="https://github.com/iagolemos1/thresholdmodeling",
15 | packages=['thresholdmodeling'],
16 | test_suit = 'tests',
17 | classifiers=[
18 | "Programming Language :: Python :: 3",
19 | 'License :: OSI Approved :: GNU General Public License (GPL)',
20 | "Operating System :: OS Independent",
21 | ],
22 | python_requires='>=3.6',
23 | install_requires= ['numpy','scipy','rpy2','matplotlib','seaborn'])
24 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/tests/declustering_test.py:
--------------------------------------------------------------------------------
1 | import unittest
2 | import numpy as np
3 | import matplotlib.pyplot as plt
4 |
5 | def decluster(sample, threshold, block_size): #function to decluster the dataset toward period blocks
6 | period_unit = np.arange(1, len(sample)+1, 1) #period array
7 | threshold_array = np.ones(len(sample))*threshold
8 | nob = int(len(sample)/block_size) #number of blocks
9 | clust = np.zeros((nob, block_size)) #initialization of the cluster matrix (rows: cluster; columns: observations)
10 | #Algorithm to cluster
11 | k = 0
12 | for i in range(0, nob):
13 | for j in range(0, block_size):
14 | clust[i][j] = sample[j+k]
15 | k = j + k + 1
16 |
17 | block_max = np.amax(clust, 1) #getting max of each block and declustering
18 |
19 | period_unit_block = np.arange(0, len(block_max), 1) #array of period for each block
20 | threshold_block_array = np.ones(len(block_max))*threshold
21 |
22 | #Plot real dataset
23 | plt.figure(11)
24 | plt.scatter(period_unit, sample)
25 | plt.plot(period_unit, threshold_array, label = 'Threshold', color = 'red')
26 | plt.legend()
27 | plt.xlabel('Period Unit')
28 | plt.ylabel('Data')
29 | plt.title('Sample dataset per Period Unit')
30 |
31 | #Plot declustered data
32 | plt.figure(12)
33 | plt.scatter(period_unit_block, block_max)
34 | plt.plot(period_unit_block, threshold_block_array, label = 'Threshold', color = 'red')
35 | plt.legend()
36 | plt.xlabel('Period Unit')
37 | plt.ylabel('Declustered Data')
38 | plt.title('Declustered dataset per Period Unit')
39 | plt.show()
40 |
41 | return(block_max)
42 |
43 | class TestFun(unittest.TestCase):
44 | def test_declustering(self):
45 | """
46 | Testing if the function will return exactly the points it should
47 | """
48 | data = [1, 1.5, 1.2, 4, 4.5, 4.2, 8, 8.5, 8.2, 12, 12.5, 12.2]
49 | #The code will cluster the data intro four blocks with size 3 and take the maximum from each one
50 | # From data, we can say that the resulting array will be [1.5, 4.5, 8.5, 12.5]
51 | result = decluster(data, 0, 3)
52 | resultreal = np.array([1.5, 4.5, 8.5, 12.5])
53 | for i in range(len(result)):
54 | self.assertEqual(result[i], resultreal[i])
55 |
56 |
57 | if __name__ == '__main__':
58 | unittest.main()
59 |
60 |
--------------------------------------------------------------------------------
/tests/entropy_test.py:
--------------------------------------------------------------------------------
1 | from thresholdmodeling import thresh_modeling
2 | import pandas as pd
3 | import unittest
4 | from scipy.stats import genpareto
5 | import math as mt
6 |
7 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv'
8 | df = pd.read_csv(url, error_bad_lines=False)
9 |
10 | def entropy(sample, b, threshold, fit_method): #Get the entropy of the distribution
11 | [shape, scale, sample, sample_excess, sample_over_thresh] = thresh_modeling.gpdfit(sample, threshold, fit_method)
12 | h = mt.log(scale) + shape + 1
13 | print('The differential entropy is {} nats.'.format(h))
14 | return (h, shape, scale)
15 |
16 | class TestFun(unittest.TestCase):
17 | def test_entropy(self):
18 | """
19 | Testing the diferencial entropy computation
20 | """
21 | data = df.values.ravel()
22 | result = entropy(data, 'e', 30, 'mle')
23 | #testing
24 | self.assertEqual(result[0], genpareto.entropy(result[1], 30, result[2]))
25 |
26 |
27 | if __name__ == '__main__':
28 | unittest.main()
29 |
--------------------------------------------------------------------------------
/tests/gpdfit_test.py:
--------------------------------------------------------------------------------
1 | from thresholdmodeling import thresh_modeling
2 | import pandas as pd
3 | import unittest
4 |
5 |
6 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv'
7 | df = pd.read_csv(url, error_bad_lines=False)
8 |
9 | class TestFun(unittest.TestCase):
10 | def test_fit(self):
11 | """
12 | Test that it can fit the array to a GPD comparing to the values presented by Coles
13 | """
14 | data = df.values.ravel()
15 | result = thresh_modeling.gpdfit(data, 30, 'mle')
16 | #testing scale parameter
17 | self.assertEqual(round(result[0],3), 0.185)
18 | #testing shape parameter
19 | self.assertEqual(round(result[1],2), 7.44)
20 |
21 | if __name__ == '__main__':
22 | unittest.main()
--------------------------------------------------------------------------------
/tests/lmom_dist_test.py:
--------------------------------------------------------------------------------
1 | import unittest
2 |
3 | def lmom_dist(shape, scale, threshold):
4 | #The package function was changed a little just to input given parameters and to don't estimate them from a sample
5 | #The math is exactly the same.
6 | t_1 = threshold + scale*(1+shape)
7 | t_2 = scale/((1+shape)*(2+shape))
8 | t_3 = (1 - shape)/(3 + shape)
9 | t_4 = ((1 - shape)*(2 - shape))/((3 + shape)*(4 + shape))
10 | return (t_1, t_2, t_3, t_4)
11 |
12 | class TestFun(unittest.TestCase):
13 | def test_lmom_dist(self):
14 | """
15 | Testing if the function will return the right moments for the given parameters:
16 | Shape = 1
17 | Scale = 1
18 | Threshold = 1
19 | """
20 | result = lmom_dist(1, 1, 1)
21 | #testing
22 | self.assertEqual(result[0], 3)
23 | self.assertEqual(round(result[1],4), 0.1667)
24 | self.assertEqual(result[2], 0)
25 | self.assertEqual(result[3], 0)
26 |
27 |
28 | if __name__ == '__main__':
29 | unittest.main()
30 |
31 |
--------------------------------------------------------------------------------
/tests/lmom_sample_test.py:
--------------------------------------------------------------------------------
1 | from rpy2.robjects.packages import importr
2 | from rpy2.robjects.vectors import FloatVector
3 | from thresholdmodeling import thresh_modeling
4 | import pandas as pd
5 | import unittest
6 |
7 |
8 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv'
9 | df = pd.read_csv(url, error_bad_lines=False)
10 |
11 | class TestFun(unittest.TestCase):
12 | def test_lmom_sample(self):
13 | """
14 | Testing L-moments from sample
15 | """
16 | data = df.values.ravel()
17 | POT = importr('POT') #importing POT package
18 | POTLmonsample = POT.samlmu(FloatVector(data), 4)
19 | result = thresh_modeling.lmom_sample(data)
20 | #testing
21 | self.assertEqual(round(result[0],4), round(POTLmonsample[0],4))
22 | self.assertEqual(round(result[1],4), round(POTLmonsample[1],4))
23 | self.assertEqual(round(result[2],4), round(POTLmonsample[2],4))
24 | self.assertEqual(round(result[3],4), round(POTLmonsample[3],4))
25 |
26 | if __name__ == '__main__':
27 | unittest.main()
--------------------------------------------------------------------------------
/tests/non_central_moments_test.py:
--------------------------------------------------------------------------------
1 | from thresholdmodeling import thresh_modeling
2 | import pandas as pd
3 | import unittest
4 | from scipy.stats import genpareto
5 |
6 |
7 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv'
8 | df = pd.read_csv(url, error_bad_lines=False)
9 |
10 | class TestFun(unittest.TestCase):
11 | def test_entropy(self):
12 | """
13 | Testing the non-central moments computation
14 | """
15 | data = df.values.ravel()
16 | result = thresh_modeling.non_central_moments(data, 30, 'mle')
17 | par = thresh_modeling.gpdfit(data, 30, 'mle')
18 | #testing
19 | self.assertEqual(result[0], genpareto.stats(par[0], 30, par[1], 'mvsk')[0])
20 | self.assertEqual(result[1], genpareto.stats(par[0], 30, par[1], 'mvsk')[1])
21 | self.assertEqual(result[2], genpareto.stats(par[0], 30, par[1], 'mvsk')[2])
22 | self.assertEqual(result[3], genpareto.stats(par[0], 30, par[1], 'mvsk')[3])
23 |
24 |
25 |
26 | if __name__ == '__main__':
27 | unittest.main()
--------------------------------------------------------------------------------
/tests/return_value_test.py:
--------------------------------------------------------------------------------
1 | #Getting main packages
2 | import numpy as np
3 | import matplotlib.pyplot as plt
4 | from scipy.stats import norm
5 | import seaborn as sns; sns.set(style = 'whitegrid')
6 | from scipy.stats import genpareto
7 | import pandas as pd
8 | import math as mt
9 | import scipy.special as sm
10 |
11 | #Getting main packages from R in order to apply the maximum likelihood function
12 | from rpy2.robjects.packages import importr
13 | from rpy2.robjects.vectors import FloatVector
14 |
15 | POT = importr('POT') #importing POT package
16 | import unittest
17 |
18 |
19 | url = 'https://raw.githubusercontent.com/iagolemos1/thresholdmodeling/master/dataset/rain.csv'
20 | df = pd.read_csv(url, error_bad_lines=False)
21 | data = df.values.ravel()
22 |
23 | def return_value(sample_real, threshold, alpha, block_size, return_period, fit_method): #return value plot and return value estimative
24 | sample = np.sort(sample_real)
25 | sample_excess = []
26 | sample_over_thresh = []
27 | for data in sample:
28 | if data > threshold+0.00001:
29 | sample_excess.append(data - threshold)
30 | sample_over_thresh.append(data)
31 |
32 | rdata = FloatVector(sample)
33 | fit = POT.fitgpd(rdata, threshold, est = fit_method) #fit data
34 | shape = fit[0][1]
35 | scale = fit[0][0]
36 |
37 | #Computing the return value for a given return period with the confidence interval estimated by the Delta Method
38 | m = return_period
39 | Eu = len(sample_over_thresh)/len(sample)
40 | x_m = threshold + (scale/shape)*(((m*Eu)**shape) - 1)
41 |
42 | #Solving the delta method
43 | d = Eu*(1-Eu)/len(sample)
44 | e = fit[3][0]
45 | f = fit[3][1]
46 | g = fit[3][2]
47 | h = fit[3][3]
48 | a = (scale*(m**shape))*(Eu**(shape-1))
49 | b = (shape**-1)*(((m*Eu)**shape) - 1)
50 | c = (-scale*(shape**-2))*((m*Eu)**shape - 1) + (scale*(shape**-1))*((m*Eu)**shape)*mt.log(m*Eu)
51 | CI = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5))
52 |
53 | print('The return value for the given return period is {} \u00B1 {}'.format(x_m, CI))
54 |
55 |
56 | ny = block_size #defining how much observations will be a block (usually anual)
57 | N_year = return_period/block_size #N_year represents the number of years based on the given return_period
58 |
59 | for i in range(0, len(sample)):
60 | if sample[i] > threshold + 0.0001:
61 | i_initial = i
62 | break
63 |
64 | p = np.arange(i_initial,len(sample))/(len(sample)) #Getting Plotting Position points
65 | N = 1/(ny*(1 - p)) #transforming plotting position points to years
66 |
67 | year_array = np.arange(min(N), N_year+0.1, 0.1) #defining a year array
68 |
69 | #Algorithm to compute the return value and the confidence intervals for plotting
70 | z_N = []
71 | CI_z_N_high_year = []
72 | CI_z_N_low_year = []
73 | for year in year_array:
74 | z_N.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1))
75 | a = (scale*((year*ny)**shape))*(Eu**(shape-1))
76 | b = (shape**-1)*((((year*ny)*Eu)**shape) - 1)
77 | c = (-scale*(shape**-2))*(((year*ny)*Eu)**shape - 1) + (scale*(shape**-1))*(((year*ny)*Eu)**shape)*mt.log((year*ny)*Eu)
78 | CIyear = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5))
79 | CI_z_N_high_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) + CIyear)
80 | CI_z_N_low_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) - CIyear)
81 |
82 | #Plotting Return Value
83 | plt.figure(8)
84 | plt.plot(year_array, CI_z_N_high_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Confidence Bands')
85 | plt.plot(year_array, CI_z_N_low_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9)
86 | plt.plot(year_array, z_N, color = 'black', label = 'Theoretical Return Level')
87 | plt.scatter(N, sample_over_thresh, label = 'Empirical Return Level')
88 | plt.xscale('log')
89 | plt.xlabel('Return Period')
90 | plt.ylabel('Return Level')
91 | plt.title('Return Level Plot')
92 | plt.legend()
93 | plt.show()
94 | return (x_m, CI)
95 |
96 | class TestFun(unittest.TestCase):
97 | def test_return_value(self):
98 | """
99 | Testing the return value and its confidence interval based on the values presented by Coles
100 | """
101 | data = df.values.ravel()
102 | result = return_value(data, 30, 0.05, 365, 36500, 'mle')
103 | #testing return value
104 | self.assertEqual(round(result[0],1), 106.3)
105 | #testing confidence interval
106 | self.assertEqual(round(result[1],1), 40.9)
107 |
108 | if __name__ == '__main__':
109 | unittest.main()
--------------------------------------------------------------------------------
/thresholdmodeling/__init__.py:
--------------------------------------------------------------------------------
1 | from . thresh_modeling import MRL
2 | from . thresh_modeling import Parameter_Stability_plot
3 | from . thresh_modeling import gpdfit
4 | from . thresh_modeling import gpdpdf
5 | from . thresh_modeling import qqplot
6 | from . thresh_modeling import ppplot
7 | from . thresh_modeling import gpdcdf
8 | from . thresh_modeling import return_value
9 | from . thresh_modeling import survival_function
10 | from . thresh_modeling import non_central_moments
11 | from . thresh_modeling import lmom_dist
12 | from . thresh_modeling import lmom_sample
13 | from . thresh_modeling import lmomplot
14 | from . thresh_modeling import decluster
15 | from . thresh_modeling import entropy
16 |
--------------------------------------------------------------------------------
/thresholdmodeling/thresh_modeling.py:
--------------------------------------------------------------------------------
1 | #########################################################################
2 | #Copyright (c) 2019 Iago Pereira Lemos
3 |
4 | #This program is free software: you can redistribute it and/or modify
5 | #it under the terms of the GNU General Public License as published by
6 | #the Free Software Foundation, either version 3 of the License, or
7 | #(at your option) any later version.
8 |
9 | #This program is distributed in the hope that it will be useful,
10 | #but WITHOUT ANY WARRANTY; without even the implied warranty of
11 | #MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 | #GNU General Public License for more details.
13 |
14 | #You should have received a copy of the GNU General Public License
15 | #along with this program. If not, see
16 |
17 | ##########################################################################
18 |
19 | #Two functions for getting the plots for defining the threshold
20 | #in order to model a Generalized Pareto Distribution.
21 | #
22 | #MRL Function plots the Mean Residual Life function.
23 | #The Parameter_Stability_plot plots the shape and the modified scale
24 | #parameters against the threshold values, u.
25 | #
26 | #For both functions it needed the sample array and the significance level.
27 |
28 |
29 | #Getting main packages
30 | import numpy as np
31 | import matplotlib.pyplot as plt
32 | from scipy.stats import norm
33 | import seaborn as sns; sns.set(style = 'whitegrid')
34 | from scipy.stats import genpareto
35 | import pandas as pd
36 | import math as mt
37 | import scipy.special as sm
38 |
39 | #Getting main packages from R in order to apply the maximum likelihood function
40 | from rpy2.robjects.packages import importr
41 | from rpy2.robjects.vectors import FloatVector
42 | from rpy2.robjects.packages import importr
43 | import rpy2.robjects.packages as rpackages
44 |
45 | base = importr('base')
46 | utils = importr('utils')
47 | utils.chooseCRANmirror(ind=1)
48 | utils.install_packages('POT') #installing POT package
49 |
50 | POT = importr('POT') #importing POT package
51 |
52 | def MRL(sample, alpha): #MRL function
53 |
54 | #Defining the threshold array and its step
55 | step = np.quantile(sample, .995)/60
56 | threshold = np.arange(0, max(sample), step=step)
57 | z_inverse = norm.ppf(1-(alpha/2))
58 |
59 | #Initialization of arrays
60 | mrl_array = [] #mean of excesses intialization
61 | CImrl = [] #confidence interval for the excesses initialization
62 |
63 | #First Loop for getting the mean residual life for each threshold value and the
64 | #second one getting the confidence intervals for the plot
65 | for u in threshold:
66 | excess = [] #initialization of the excesses array for each loop
67 | for data in sample:
68 | if data > u:
69 | excess.append(data - u) #adding excesses to the excesses array
70 | mrl_array.append(np.mean(excess)) #adding the mean of the excesses in the mean excesses array
71 | std_loop = np.std(excess) #getting standard deviation in the loop
72 | CImrl.append(z_inverse*std_loop/(len(excess)**0.5)) #getting confidence interval
73 |
74 | CI_Low = [] #initialization of the low confidence interval array
75 | CI_High = [] #initialization of the high confidence interval array
76 |
77 | #Loop to add in the confidence interval to the plot arrays
78 | for i in range(0, len(mrl_array)):
79 | CI_Low.append(mrl_array[i] - CImrl[i])
80 | CI_High.append(mrl_array[i] + CImrl[i])
81 |
82 | #Plot MRL
83 | plt.figure(1)
84 | sns.lineplot(x = threshold, y = mrl_array)
85 | plt.fill_between(threshold, CI_Low, CI_High, alpha = 0.4)
86 | plt.xlabel('u')
87 | plt.ylabel('Mean Excesses')
88 | plt.title('Mean Residual Life Plot')
89 | plt.show()
90 |
91 | def Parameter_Stability_plot(sample, alpha): #Parameter stability plot function
92 | #Defining Threshold array
93 | step = np.quantile(sample, .995)/45
94 | threshold = np.arange(
95 | 0, np.quantile(sample, .999), step = step, dtype='float32')
96 |
97 | #Transforming sample in a R array
98 | rdata = FloatVector(sample)
99 |
100 | #Initialization of some main arrays
101 | stdshape = [] #standard deviation of the shape parameter initialization
102 | shape = [] #shape parameter intialization
103 | scale = [] #scale paramter initilization
104 | mod_scale = [] #modified scale parameter initizaliation
105 | CI_shape = [] #confidence interval of the shape parameter
106 | CI_mod_scale = [] #confidence interval of the modified scale
107 | z = norm.ppf(1-(alpha/2))
108 |
109 | #Getting parameters and CI's for both plots
110 | for u in threshold:
111 | fit = POT.fitgpd(rdata, u.item(), est = 'mle') #fitting distribution using POT package with the MLE method
112 | shape.append(fit[0][1]) #adding the shape parameter to the respective array
113 | scale.append(fit[0][0]) #adding the scale parameter to the respective array
114 | stdshape.append(fit[1][1]) #adding the shape standard deviation to the respective array
115 | CI_shape.append(fit[1][1]*z) #getting the values of the confidence interval for plotting
116 | mod_scale.append(fit[0][0] - (fit[0][1]*u)) #getting the modified scale parameter
117 | Var_mod_scale = (fit[3][0] - (u*fit[3][2]) - u*(fit[3][1] - (fit[3][3]*u))) #solving the Delta method
118 | #in order to get the variance to the modified scale parameter
119 | CI_mod_scale.append((Var_mod_scale**0.5)*z) #getting the confidence interval for the
120 | #modified scale parameter
121 |
122 | #Plotting shape parameter against u vales
123 | plt.figure(2)
124 | plt.errorbar(threshold, shape, yerr = CI_shape, fmt = 'o' )
125 | plt.xlabel('u')
126 | plt.ylabel('Shape Parameter')
127 | plt.title('Shape Parameter Stability Plot')
128 |
129 | #Plotting modified scale parameter against u values
130 | plt.figure(3)
131 | plt.errorbar(threshold, mod_scale, yerr = CI_mod_scale, fmt = 'o')
132 | plt.xlabel('u')
133 | plt.ylabel('Modified Scale Parameter')
134 | plt.title('Modified Scale Parameter Stability Plot')
135 |
136 | plt.show()
137 |
138 | def gpdfit(sample, threshold, fit_method):
139 | sample = np.sort(sample)
140 | sample_excess = []
141 | sample_over_thresh = []
142 | for data in sample:
143 | if data > threshold+0.00001:
144 | sample_excess.append(data - threshold) #getting an excesses array
145 | sample_over_thresh.append(data) #getting an array with values over the threshold
146 | rdata = FloatVector(sample)
147 | fit = POT.fitgpd(rdata, threshold, est = fit_method) #fit the data to the distribution
148 | shape = fit[0][1]
149 | scale = fit[0][0]
150 | print(fit) #show gpd fit estimatives
151 |
152 | return(shape, scale, sample, sample_excess, sample_over_thresh)
153 |
154 | def gpdpdf(sample, threshold, fit_method, bin_method, alpha): #get PDF plot with histogram to diagnostic the model
155 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #Fit the data
156 | x_points = np.arange(0, max(sample), 0.001) #define a range of points for drawing the pdf
157 | pdf = genpareto.pdf(x_points, shape, loc=0, scale=scale) #get the pdf values
158 |
159 | #Plotting PDF
160 | plt.figure(4)
161 | plt.xlabel('Data')
162 | plt.ylabel('PDF')
163 | plt.title('Data Probability Density Function')
164 | plt.plot(x_points, pdf, color = 'black', label = 'Theoretical PDF')
165 | plt.hist(sample_excess, bins = bin_method, density = True) #draw histograms
166 | plt.legend()
167 | plt.show()
168 |
169 | def qqplot(sample, threshold, fit_method, alpha): #get Quantile-Quantile plot to diagnostic the model
170 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #fit data
171 | i_initial = 0
172 | p = []
173 | n = len(sample)
174 | sample = np.sort(sample)
175 | for i in range(0, n):
176 | if sample[i] > threshold + 0.0001:
177 | i_initial = i #get the index of the first observation over the threshold
178 | k = i - 1
179 | break
180 |
181 | for i in range(i_initial, n):
182 | p.append((i - 0.35)/(n)) #using the index, compute the empirical probabilities by the Hosking Plotting Poistion Estimator.
183 |
184 | p0 = (k - 0.35)/(n)
185 |
186 | quantiles = []
187 | for pth in p:
188 | quantiles.append(threshold + ((scale/shape)*(((1-((pth-p0)/(1-p0)))**-shape) - 1))) #getting theorecial quantiles arrays
189 |
190 | n = len(sample_over_thresh)
191 | y = np.arange(1,n+1)/n #getting empirical quantiles
192 |
193 | #Kolmogorov-Smirnov Test for getting the confidence interval
194 | K = (-0.5*mt.log(alpha/2))**0.5
195 | M = (len(p)**2/(2*len(p)))**0.5
196 | CI_qq_high = []
197 | CI_qq_low = []
198 | for prob in y:
199 | F1 = prob - K/M
200 | F2 = prob + K/M
201 | CI_qq_low.append(threshold + ((scale/shape)*(((1-((F1)/(1)))**-shape) - 1)))
202 | CI_qq_high.append(threshold + ((scale/shape)*(((1-((F2)/(1)))**-shape) - 1)))
203 |
204 | #Plotting QQ
205 | plt.figure(5)
206 | sns.regplot(quantiles, sample_over_thresh, ci = None, line_kws={'color':'black','label':'Regression Line'})
207 | plt.axis('square')
208 | plt.plot(sample_over_thresh, CI_qq_low, linestyle='--', color='red', alpha = 0.5, lw = 0.8, label = 'Kolmogorov-Smirnov Confidence Bands')
209 | plt.legend()
210 | plt.plot(sample_over_thresh, CI_qq_high, linestyle='--', color='red', alpha = 0.5, lw = 0.8)
211 | plt.xlabel('Theoretical GPD Quantiles')
212 | plt.ylabel('Sample Quantiles')
213 | plt.title('Q-Q Plot')
214 | plt.show()
215 |
216 | def ppplot(sample, threshold, fit_method, alpha): #probability-probability plot to diagnostic the model
217 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #fit the data
218 | n = len(sample_over_thresh)
219 | #Getting empirical probabilities
220 | y = np.arange(1,n+1)/n
221 | #Getting theoretical probabilities
222 | cdf_pp = genpareto.cdf(sample_over_thresh, shape, loc=threshold, scale=scale)
223 |
224 | #Getting Confidence Intervals using the Dvoretzky–Kiefer–Wolfowitz method
225 | i_initial = 0
226 | n = len(sample)
227 | for i in range(0, n):
228 | if sample[i] > threshold + 0.0001:
229 | i_initial = i
230 | break
231 | F1 = []
232 | F2 = []
233 | for i in range(i_initial,len(sample)):
234 | e = (((mt.log(2/alpha))/(2*len(sample_over_thresh)))**0.5)
235 | F1.append(y[i-i_initial] - e)
236 | F2.append(y[i-i_initial] + e)
237 |
238 | #Plotting PP
239 | plt.figure(6)
240 | sns.regplot(y, cdf_pp, ci = None, line_kws={'color':'black', 'label':'Regression Line'})
241 | plt.plot(y, F1, linestyle='--', color='red', alpha = 0.5, lw = 0.8, label = 'Dvoretzky–Kiefer–Wolfowitz Confidence Bands')
242 | plt.plot(y, F2, linestyle='--', color='red', alpha = 0.5, lw = 0.8)
243 | plt.legend()
244 | plt.title('P-P Plot')
245 | plt.xlabel('Empirical Probability')
246 | plt.ylabel('Theoritical Probability')
247 | plt.show()
248 |
249 | def gpdcdf(sample, threshold, fit_method, alpha): #plot gpd cdf with empirical points
250 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method) #fit the data
251 |
252 | n = len(sample_over_thresh)
253 | y = np.arange(1,n+1)/n #empirical probabilities
254 |
255 | i_initial = 0
256 | n = len(sample)
257 | for i in range(0, n):
258 | if sample[i] > threshold + 0.0001:
259 | i_initial = i
260 | break
261 |
262 | #Computing confidence interval with the Dvoretzky–Kiefer–Wolfowitz method based on the empirical points
263 | F1 = []
264 | F2 = []
265 | for i in range(i_initial,len(sample)):
266 | e = (((mt.log(2/alpha))/(2*len(sample_over_thresh)))**0.5)
267 | F1.append(y[i-i_initial] - e)
268 | F2.append(y[i-i_initial] + e)
269 |
270 | x_points = np.arange(0, max(sample), 0.001) #generating points to apply in the cdf
271 | cdf = genpareto.cdf(x_points, shape, loc=threshold, scale=scale) #getting theoretical cdf
272 |
273 | #Plotting cdf
274 | plt.figure(7)
275 | plt.plot(x_points, cdf, color = 'black', label='Theoretical CDF')
276 | plt.xlabel('Data')
277 | plt.ylabel('CDF')
278 | plt.title('Data Comulative Distribution Function')
279 | plt.scatter(sorted(sample_over_thresh), y, label='Empirical CDF')
280 | plt.plot(sorted(sample_over_thresh), F1, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Dvoretzky–Kiefer–Wolfowitz Confidence Bands')
281 | plt.plot(sorted(sample_over_thresh), F2, linestyle='--', color='red', alpha = 0.8, lw = 0.9)
282 | plt.legend()
283 | plt.show()
284 |
285 | def return_value(sample_real, threshold, alpha, block_size, return_period, fit_method): #return value plot and return value estimative
286 | sample = np.sort(sample_real)
287 | sample_excess = []
288 | sample_over_thresh = []
289 | for data in sample:
290 | if data > threshold+0.00001:
291 | sample_excess.append(data - threshold)
292 | sample_over_thresh.append(data)
293 |
294 | rdata = FloatVector(sample)
295 | fit = POT.fitgpd(rdata, threshold, est = fit_method) #fit data
296 | shape = fit[0][1]
297 | scale = fit[0][0]
298 |
299 | #Computing the return value for a given return period with the confidence interval estimated by the Delta Method
300 | m = return_period
301 | Eu = len(sample_over_thresh)/len(sample)
302 | x_m = threshold + (scale/shape)*(((m*Eu)**shape) - 1)
303 |
304 | #Solving the delta method
305 | d = Eu*(1-Eu)/len(sample)
306 | e = fit[3][0]
307 | f = fit[3][1]
308 | g = fit[3][2]
309 | h = fit[3][3]
310 | a = (scale*(m**shape))*(Eu**(shape-1))
311 | b = (shape**-1)*(((m*Eu)**shape) - 1)
312 | c = (-scale*(shape**-2))*((m*Eu)**shape - 1) + (scale*(shape**-1))*((m*Eu)**shape)*mt.log(m*Eu)
313 | CI = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5))
314 |
315 | print('The return value for the given return period is {} \u00B1 {}'.format(x_m, CI))
316 |
317 |
318 | ny = block_size #defining how much observations will be a block (usually anual)
319 | N_year = return_period/block_size #N_year represents the number of years based on the given return_period
320 |
321 | for i in range(0, len(sample)):
322 | if sample[i] > threshold + 0.0001:
323 | i_initial = i
324 | break
325 |
326 | p = np.arange(i_initial,len(sample))/(len(sample)) #Getting Plotting Position points
327 | N = 1/(ny*(1 - p)) #transforming plotting position points to years
328 |
329 | year_array = np.arange(min(N), N_year+0.1, 0.1) #defining a year array
330 |
331 | #Algorithm to compute the return value and the confidence intervals for plotting
332 | z_N = []
333 | CI_z_N_high_year = []
334 | CI_z_N_low_year = []
335 | for year in year_array:
336 | z_N.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1))
337 | a = (scale*((year*ny)**shape))*(Eu**(shape-1))
338 | b = (shape**-1)*((((year*ny)*Eu)**shape) - 1)
339 | c = (-scale*(shape**-2))*(((year*ny)*Eu)**shape - 1) + (scale*(shape**-1))*(((year*ny)*Eu)**shape)*mt.log((year*ny)*Eu)
340 | CIyear = (norm.ppf(1-(alpha/2))*((((a**2)*d) + (b*((c*g) + (e*b))) + (c*((b*f) + (c*h))))**0.5))
341 | CI_z_N_high_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) + CIyear)
342 | CI_z_N_low_year.append(threshold + (scale/shape)*(((year*ny*Eu)**shape) - 1) - CIyear)
343 |
344 | #Plotting Return Value
345 | plt.figure(8)
346 | plt.plot(year_array, CI_z_N_high_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Confidence Bands')
347 | plt.plot(year_array, CI_z_N_low_year, linestyle='--', color='red', alpha = 0.8, lw = 0.9)
348 | plt.plot(year_array, z_N, color = 'black', label = 'Theoretical Return Level')
349 | plt.scatter(N, sample_over_thresh, label = 'Empirical Return Level')
350 | plt.xscale('log')
351 | plt.xlabel('Return Period')
352 | plt.ylabel('Return Level')
353 | plt.title('Return Level Plot')
354 | plt.legend()
355 |
356 | plt.show()
357 |
358 | def survival_function(sample, threshold, fit_method, alpha): #Plot the survival function, (1 - cdf)
359 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method)
360 |
361 | n = len(sample_over_thresh)
362 | y_surv = 1 - np.arange(1,n+1)/n
363 |
364 | i_initial = 0
365 |
366 | n = len(sample)
367 | for i in range(0, n):
368 | if sample[i] > threshold + 0.0001:
369 | i_initial = i
370 | break
371 | #Computing confidence interval with the Dvoretzky–Kiefer–Wolfowitz
372 | F1 = []
373 | F2 = []
374 | for i in range(i_initial,len(sample)):
375 | e = (((mt.log(2/alpha))/(2*len(sample_over_thresh)))**0.5)
376 | F1.append(y_surv[i-i_initial] - e)
377 | F2.append(y_surv[i-i_initial] + e)
378 |
379 | x_points = np.arange(0, max(sample), 0.001)
380 | surv_func = 1 - genpareto.cdf(x_points, shape, loc=threshold, scale=scale)
381 |
382 | #Plotting survival function
383 | plt.figure(9)
384 | plt.plot(x_points, surv_func, color = 'black', label='Theoretical Survival Function')
385 | plt.xlabel('Data')
386 | plt.ylabel('Survival Function')
387 | plt.title('Data Survival Function Plot')
388 | plt.scatter(sorted(sample_over_thresh), y_surv, label='Empirical Survival Function')
389 | plt.plot(sorted(sample_over_thresh), F1, linestyle='--', color='red', alpha = 0.8, lw = 0.9, label = 'Dvoretzky–Kiefer–Wolfowitz Confidence Bands')
390 | plt.plot(sorted(sample_over_thresh), F2, linestyle='--', color='red', alpha = 0.8, lw = 0.9)
391 | plt.legend()
392 | plt.show()
393 |
394 | def non_central_moments(sample, threshold, fit_method): #Getting non-central moments using the genpareto package
395 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method)
396 | [Mean, Variance, Skewness, Kurtosis]= genpareto.stats(shape, threshold, scale, moments = 'mvsk')
397 | print('Non-Central Moments estimated from the distribution:\nMean: {} \nVariance: {} \nSkewness: {} \nKurtosis: {} \n'.format(Mean, Variance, Skewness, Kurtosis))
398 | return (Mean, Variance, Skewness, Kurtosis)
399 |
400 | def lmom_dist(sample, threshold, fit_method): #Getting the l-moments from the distribution
401 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method)
402 | t_1 = threshold + scale*(1+shape)
403 | t_2 = scale/((1+shape)*(2+shape))
404 | t_3 = (1 - shape)/(3 + shape)
405 | t_4 = ((1 - shape)*(2 - shape))/((3 + shape)*(4 + shape))
406 | print('L-Moments estimated from the distribution:\nL-Mean: {} \nL-Variance: {} \nL-Skewness: {} \nL-Kurtosis: {} \n'.format(t_1, t_2, t_3, t_4))
407 | return (t_1, t_2, t_3, t_4)
408 |
409 | def lmom_sample(sample): #Algorithm to compute the fourth l-moments from the sample
410 | sample = np.sort(sample)
411 | n = len(sample)
412 |
413 | #first moment
414 | l1 = np.sum(sample) / sm.comb(n, 1, exact=True)
415 |
416 | #second moment
417 | comb1 = range(n)
418 | coefl2 = 0.5 / sm.comb(n, 2, exact=True)
419 | sum_xtrans = sum([(comb1[i] - comb1[n - i - 1]) * sample[i] for i in range(n)])
420 | l2 = coefl2 * sum_xtrans
421 |
422 | #third moment
423 | comb3 = [sm.comb(i, 2, exact=True) for i in range(n)]
424 | coefl3 = 1.0 / 3.0 / sm.comb(n, 3, exact=True)
425 | sum_xtrans = sum([(comb3[i] - 2 * comb1[i] * comb1[n - i - 1] + comb3[n - i - 1]) * sample[i] for i in range(n)])
426 | l3 = coefl3 * sum_xtrans / l2
427 |
428 | #fourth moment
429 | comb5 = [sm.comb(i, 3, exact=True) for i in range(n)]
430 | coefl4 = 0.25 / sm.comb(n, 4, exact=True)
431 | sum_xtrans = sum(
432 | [(comb5[i] - 3 * comb3[i] * comb1[n - i - 1] + 3 * comb1[i] * comb3[n - i - 1] - comb5[n - i - 1]) * sample[i]
433 | for i in range(n)])
434 | l4 = coefl4 * sum_xtrans / l2
435 |
436 | print('L-Moments estimated from the sample:\nL-Mean: {} \nL-Variance: {} \nL-Skewness: {} \nL-Kurtosis: {} \n'.format(l1, l2, l3, l4))
437 |
438 | return(l1, l2, l3, l4)
439 |
440 | def lmomplot(sample, threshold): #Plotting the l-skewnes and l-kurtosis empirical against theoretical to
441 | #diagnostic the u choice.
442 | def lmom_sample2(sample):
443 | sample = np.sort(sample)
444 | n = len(sample)
445 |
446 | #first moment
447 | l1 = np.sum(sample) / sm.comb(n, 1, exact=True)
448 |
449 | #second moment
450 | comb1 = range(n)
451 | coefl2 = 0.5 / sm.comb(n, 2, exact=True)
452 | sum_xtrans = sum([(comb1[i] - comb1[n - i - 1]) * sample[i] for i in range(n)])
453 | l2 = coefl2 * sum_xtrans
454 |
455 | #third moment
456 | comb3 = [sm.comb(i, 2, exact=True) for i in range(n)]
457 | coefl3 = 1.0 / 3.0 / sm.comb(n, 3, exact=True)
458 | sum_xtrans = sum([(comb3[i] - 2 * comb1[i] * comb1[n - i - 1] + comb3[n - i - 1]) * sample[i] for i in range(n)])
459 | l3 = coefl3 * sum_xtrans / l2
460 |
461 | #fourth moment
462 | comb5 = [sm.comb(i, 3, exact=True) for i in range(n)]
463 | coefl4 = 0.25 / sm.comb(n, 4, exact=True)
464 | sum_xtrans = sum(
465 | [(comb5[i] - 3 * comb3[i] * comb1[n - i - 1] + 3 * comb1[i] * comb3[n - i - 1] - comb5[n - i - 1]) * sample[i]
466 | for i in range(n)])
467 | l4 = coefl4 * sum_xtrans / l2
468 | return(l1, l2, l3, l4)
469 |
470 | threshold_array = np.arange(0, threshold + (threshold/3), 0.5) #defining a threshold array to compute the
471 | #different l-moments from the sample
472 | sample = np.sort(sample)
473 | skewness_sample = []
474 | kurtosis_sample =[]
475 | #Algorithm to compute the l-moments for each threshold
476 | for u in threshold_array:
477 | sample_over_thresh = []
478 | for data in sample:
479 | if data > u+0.00001:
480 | sample_over_thresh.append(data)
481 | [l1, l2, l3, l4] = lmom_sample2(sample_over_thresh)
482 | skewness_sample.append(l3)
483 | kurtosis_sample.append(l4)
484 |
485 | skewness_theo = np.arange(0,1+0.1,0.1) #defining theoretical l-skewness
486 | kurtosis_theo = (skewness_theo*(1 + 5*skewness_theo))/(5 + skewness_theo) #theoretical kurtosis of the gpd
487 |
488 | #Plotting l-moments
489 | plt.figure(10)
490 | plt.scatter(skewness_sample, kurtosis_sample, label = 'Empirical')
491 | plt.plot(skewness_theo, kurtosis_theo, color = 'black', label = 'Theoretical')
492 | plt.legend()
493 | plt.xlabel('L-Skewness')
494 | plt.ylabel('L-Kurtosis')
495 | plt.title('L-Moments Plot')
496 | plt.show()
497 |
498 | def decluster(sample, threshold, block_size): #function to decluster the dataset toward period blocks
499 | period_unit = np.arange(1, len(sample)+1, 1) #period array
500 | threshold_array = np.ones(len(sample))*threshold
501 | nob = int(len(sample)/block_size) #number of blocks
502 | clust = np.zeros((nob, block_size)) #initialization of the cluster matrix (rows: cluster; columns: observations)
503 | #Algorithm to cluster
504 | k = 0
505 | for i in range(0, nob):
506 | for j in range(0, block_size):
507 | clust[i][j] = sample[j+k]
508 | k = j + k + 1
509 |
510 | block_max = np.amax(clust, 1) #getting max of each block and declustering
511 |
512 | period_unit_block = np.arange(0, len(block_max), 1) #array of period for each block
513 | threshold_block_array = np.ones(len(block_max))*threshold
514 |
515 | #Plot real dataset
516 | plt.figure(11)
517 | plt.scatter(period_unit, sample)
518 | plt.plot(period_unit, threshold_array, label = 'Threshold', color = 'red')
519 | plt.legend()
520 | plt.xlabel('Period Unit')
521 | plt.ylabel('Data')
522 | plt.title('Sample dataset per Period Unit')
523 |
524 | #Plot declustered data
525 | plt.figure(12)
526 | plt.scatter(period_unit_block, block_max)
527 | plt.plot(period_unit_block, threshold_block_array, label = 'Threshold', color = 'red')
528 | plt.legend()
529 | plt.xlabel('Period Unit')
530 | plt.ylabel('Declustered Data')
531 | plt.title('Declustered dataset per Period Unit')
532 | plt.show()
533 |
534 | def entropy(sample, b, threshold, fit_method): #Get the entropy of the distribution
535 | [shape, scale, sample, sample_excess, sample_over_thresh] = gpdfit(sample, threshold, fit_method)
536 | h = mt.log(scale) + shape + 1
537 | print('The differential entropy is {} nats.'.format(h))
538 |
539 |
--------------------------------------------------------------------------------