Merge pull request #7 from arfogg/code_tidying

Code tidying
arfogg · Aug 13, 2024 · 7dafc93 · 7dafc93
2 parents c82b838 + d2c9e7c
commit 7dafc93
Show file tree

Hide file tree

Showing 14 changed files with 1,713 additions and 927 deletions.
diff --git a/README.md b/README.md
@@ -11,26 +11,13 @@
 [![Issues](https://img.shields.io/github/issues/arfogg/bi_multi_variate_eva.svg)](https://github.com/arfogg/bi_multi_variate_eva/issues)
 ![Contributions welcome](https://img.shields.io/badge/contributions-welcome-orange.svg)
 
-
-
-
-Code to run bivariate and multivariate extreme value analysis on generic data - work in progress.
+Python package to run bivariate and multivariate extreme value analysis on generic data - work in progress.
 
 **License:** CC0-1.0
 
 **Support:** please [create an issue](https://github.com/arfogg/bi_multi_variate_eva/issues) or contact [arfogg](https://github.com/arfogg) directly. Any input on the code / issues found are greatly appreciated and will help to improve the software.
 
 ## Ongoing tasks
-Code Improvement
-- [ ] Make an overarching function to run all commands
-- [ ] Make code PEP 8 compliant
-- [ ] Make code importable in one statement e.g. import "bi_multi_variate_eva as BiEVA"
-- [x] Create fit_params class
-
-Organisation
-- [ ] Make subfolders for Bi- and Multi- EVA
-- [ ] Create requirements.txt with packages and versions
-- [ ] Note requirements.txt in README
 - [ ] Include Dáire acknowledgement statement
 
 ## Table of Contents
@@ -42,26 +29,38 @@ Organisation
    * [(3) Extract extrema](#3-extract-extrema) 
    * [(4) Fit a model to the extrema](#4-fit-a-model-to-the-extrema) 
    * [(5) Transform extrema data to uniform margins](#5-transform-extrema-data-to-uniform-margins) 
-   * [(6) Fit a copula to both sets of extrema](#6-fit-a-copula-to-both-sets-of-extrema) 
-   * [(7) Take a sample from the copula](#7-take-a-sample-from-the-copula) 
-   * [(8) Plot diagnostic to assess copula fit](#8-plot-diagnostic-to-assess-copula-fit) 
-   * [(9) Plot return period as a function of two variables](#9-plot-return-period-as-a-function-of-two-variables) 
+   * [(6) Bootstrapping the extrema](#6-bootstrapping-the-extrema)
+   * [(7) Fit a copula to both sets of extrema](#7-fit-a-copula-to-both-sets-of-extrema) 
+   * [(8) Take a sample from the copula](#8-take-a-sample-from-the-copula) 
+   * [(9) Plot diagnostic to assess copula fit](#9-plot-diagnostic-to-assess-copula-fit) 
+   * [(10) Plot return period as a function of two variables](#10-plot-return-period-as-a-function-of-two-variables) 
 - [Multivariate Analysis](#multivariate-analysis)
 - [Acknowledgements](#acknowledgements)
 
 ## Required Packages
 
 scipy, numpy, matplotlib, pandas, copulas
 
-Install copulas using pip or see documentation [here](https://pypi.org/project/copulas/)
+See [environment.yml](environment.yml) for details.
 
+Install copulas using pip or see documentation [here](https://pypi.org/project/copulas/).
 
-## Installing the code
+## Using the code
 
-First, the code must be downloaded using `git clone https://github.com/arfogg/bi_multi_variate_eva`
+Download the code:
+```python
+git clone https://github.com/arfogg/bi_multi_variate_eva
+```
+
+Importing:
+```python
+from bi_multi_variate_eva import *
+```
 
 ## Bivariate Analysis
 
+An example walkthrough of running the Bivariate Extreme Value Analysis on two variables x and y. For theory, a recommended text is: Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer.
+
 #### (1) Getting your data ready
 
 Make sure there are no datagaps in your timeseries. You can either remove the rows, or interpolate. This requires an data-expert user decision so is **not** included in this package. You must do this before using the package. For some functions, the data is parsed as a pandas.DataFrame.
@@ -71,93 +70,132 @@ Make sure there are no datagaps in your timeseries. You can either remove the ro
 The function `plot_extremal_dependence_coefficient` within `determine_AD_AI` creates a diagnostic plot to examine asymptotic dependence/independence.
 
 For example:
+
 ```python
-determine_AD_AI.plot_extremal_dependence_coefficient(x, y, "X", "Y", "(units)", "(units)")
+fig, ax_data, ax_data_unif, ax_edc, chi, chi_lower_q, chi_upper_q = \
+    determine_AD_AI.plot_extremal_dependence_coefficient(x, y, x_bs_um, y_bs_um, n_bootstrap,
+                                                         "X", "Y", "(units)", "(units)")
 ```
 
+Timeseries (`np.array` or `pd.Series`) of x and y are parsed, with their bootstraps (transformed to uniform margins), number of bootstraps and strings for plot labels.
+
 #### (3) Extract extrema
 
-Extract extremes for both X and Y using `detect_extremes.find_block_maxima`. Analysis on points above threshold maxima yet to be implemented.
+Extract extremes for both X and Y using `detect_extremes.find_joint_block_maxima`. Analysis on points above threshold maxima yet to be implemented.
 
 For example:
 ```python
-x_extremes_df=detect_extremes.find_block_maxima(df,'x',df_time_tag='datetime',block_size=block_size,extremes_type='high')
-y_extremes_df=detect_extremes.find_block_maxima(df,'y',df_time_tag='datetime',block_size=block_size,extremes_type='high')
+empty_blocks, x_extreme_t, x_extreme, y_extreme_t, y_extreme = \
+            detect_extremes.find_joint_block_maxima(data_df, 'x', 'y')
+
+x_extremes_df = pd.DataFrame({'datetime':x_extreme_t, 'extreme':x_extreme})
+y_extremes_df = pd.DataFrame({'datetime':y_extreme_t, 'extreme':y_extreme})    
 ```
 
+A dataframe of evenly sampled x and y are parsed, with their respective dataframe column names. These are transformed to individual parameter DataFrames.
+
+
 #### (4) Fit a model to the extrema
 
-Fit a GEVD or Gumbel distribution to both sets of extrema (i.e. for x and y) using `fit_model_to_extremes.fit_gevd_or_gumbel`.
+Fit a GEVD or Gumbel distribution to both sets of extrema (i.e. for x and y) using `gevd_fitter` class.
 
 For example:
 ```python
-x_gevd_fit_params=fit_model_to_extremes.fit_gevd_or_gumbel(x_extremes_df, 'BM', 'high','extreme',df_time_tag='datetime',fitting_type='Emcee', block_size=block_size)
-y_gevd_fit_params=fit_model_to_extremes.fit_gevd_or_gumbel(y_extremes_df, 'BM', 'high','extreme',df_time_tag='datetime',fitting_type='Emcee', block_size=block_size)
+x_gevd_fit = gevd_fitter(x_extremes_df.extreme)
+y_gevd_fit = gevd_fitter(y_extremes_df.extreme)
 ```
 
+By initialising the `gevd_fitter` class, a GEVD or Gumbel model is fit to the extrema. Fitting information is stored in the object.
+
 #### (5) Transform extrema data to uniform margins
 
-Transform x and y extrema from data scale (as it looks on the instrument) to uniform margins empirically using `transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically` or using the cumulative distribution function with `transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF`.
+Transform x and y extrema from data scale (as it looks on the instrument) to uniform margins. This happens within the `gevd_fitter` class.
+
+You can plot a diagnostic about the transformation of one of the variables using `transform_uniform_margins.plot_diagnostic`.
+
+#### (6) Bootstrapping the extrema
+
+To facilitate error calculation, we bootstrap the extrema. For each of the N bootstraps, a random selection of indices from between 0 to n_extrema-1 is chosen (where n_extrema is the number of extrema in each dataset). This set of indices is used to select points from both x and y. This ensures joint selection, so we retain the physical link between x and y.
 
 For example:
 ```python
-# Empirically
-x_extremes_unif_empirical=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(x_extremes_df.extreme)
-y_extremes_unif_empirical=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(y_extremes_df.extreme)
-# Using cumulative distribution function
-x_extremes_unif=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF(x_extremes_df.extreme, x_gevd_fit_params,distribution=x_gevd_fit_params.distribution_name[0])
-y_extremes_unif=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF(y_extremes_df.extreme, y_gevd_fit_params,distribution=y_gevd_fit_params.distribution_name[0])
+x_bootstrap = np.full((n_extrema, N), np.nan)
+y_bootstrap = np.full((n_extrema, N), np.nan)
+
+for i in range(N):
+    # Select indices to get bootstraps from
+    ind = np.random.choice(np.linspace(0, n_extrema-1, n_extrema), n_extrema)
+    x_bootstrap[:, i] = x_extremes_df.extreme.iloc[ind]
+    y_bootstrap[:, i] = y_extremes_df.extreme.iloc[ind]
 ```
 
-You can plot a diagnostic about the transformation of one of the variables using `transform_uniform_margins.plot_diagnostic`:
+By then using a `bootstrap_gevd_fit` object, GEVD or Gumbel fits are estimated for each bootstrap.
+
 ```python
-fig_um_x,ax_um_x=transform_uniform_margins.plot_diagnostic(x_extremes_df.extreme, x_extremes_unif_empirical, x_extremes_unif, x_gevd_fit_params, 'X')
+x_bs_gevd_fit = bootstrap_gevd_fit(x_bootstrap, x_gevd_fit)
+y_bs_gevd_fit = bootstrap_gevd_fit(y_bootstrap, y_gevd_fit)
 ```
 
-#### (6) Fit a copula to both sets of extrema
+#### (7) Fit a copula to both sets of extrema
 
 Fit a copula to x and y extrema using `fit_copula_to_extremes.fit_copula_bivariate`.
 
 For example:
 ```python
-copula=fit_copula_to_extremes.fit_copula_bivariate(x_extremes_unif, y_extremes_unif, 'X', 'Y')
+copula = fit_copula_to_extremes.fit_copula_bivariate(x_extremes_unif, y_extremes_unif, 'X', 'Y')
 ```
-
-#### (7) Take a sample from the copula
 
-Using your copula from (6), extract a sample, e.g.: `copula_sample=copula.sample(100)`.
+#### (8) Take a sample from the copula
+
+Using your copula from (6), extract a sample, e.g.: `copula_sample = copula.sample(100)`.
 
 Transform that sample back to data scale:
 ```python
-x_sample_in_data_scale=transform_uniform_margins.transform_from_uniform_margins_to_data_scale(copula_sample[:,0], x_gevd_fit_params)
-y_sample_in_data_scale=transform_uniform_margins.transform_from_uniform_margins_to_data_scale(copula_sample[:,0], y_gevd_fit_params)
+x_sample = transform_uniform_margins.\
+                transform_from_uniform_margins_to_data_scale(copula_sample[:, 0], x_gevd_fit)
+y_sample = transform_uniform_margins.\
+                transform_from_uniform_margins_to_data_scale(copula_sample[:, 0], y_gevd_fit)
 ```
 
-#### (8) Plot diagnostic to assess copula fit
+#### (9) Plot diagnostic to assess copula fit
 
 To plot histograms of the copula in data scale (with GEVD/Gumbel fitted to observed extrema overplotted) and on uniform margins, use `transform_uniform_margins.plot_copula_diagnostic`. 
 
 For example:
 ```python
-fig_copula_1d,ax_copula_1d=transform_uniform_margins.plot_copula_diagnostic(copula_sample[:,0], copula_sample[:,1], x_sample_in_data_scale, y_sample_in_data_scale, x_gevd_fit_params, y_gevd_fit_params, 'X', 'Y')
+fig_copula_1d, ax_copula_1d = transform_uniform_margins.\
+                                    plot_copula_diagnostic(copula_sample[:, 0], copula_sample[:, 1],
+                                                           x_sample, y_sample, x_gevd_fit, y_gevd_fit,
+                                                           'X', 'Y')
 ```
 
 Alternatively, to compare the 2D distributions of the observed extrema and copula sample, use `fit_copula_to_extremes.qualitative_copula_fit_check_bivariate`.
 
 For example:
 ```python
-fig_copula_2d,ax_copula_2d=fit_copula_to_extremes.qualitative_copula_fit_check_bivariate(x_extremes_df.extreme, y_extremes_df.extreme, x_sample_in_data_scale, y_sample_in_data_scale, 'X', 'Y')
+fig_copula_2d, ax_copula_2d = fit_copula_to_extremes.\
+                    qualitative_copula_fit_check_bivariate(x_extremes_df.extreme, y_extremes_df.extreme,
+                                                           x_sample, y_sample, 'X', 'Y')
 ```
 
-#### (9) Plot return period as a function of two variables
+#### (10) Plot return period as a function of two variables
 
 To plot the return period as a function of x and y, with standard contours.
 
 For example:
 ```python
-fig_return_period,ax_return_period=calculate_return_periods_values.plot_return_period_as_function_x_y(copula,np.nanmin(x_extremes_df.extreme),np.nanmax(x_extremes_df.extreme),np.nanmin(y_extremes_df.extreme),np.nanmax(y_extremes_df.extreme),'X','Y', x_gevd_fit_params, y_gevd_fit_params, 'X (units)', 'Y (units)', n_samples=1000,block_size=block_size)
+fig_rp, ax_rp = calculate_return_periods_values.\
+                        plot_return_period_as_function_x_y(copula,
+                                                           np.nanmin(x_extremes_df.extreme),
+                                                           np.nanmax(x_extremes_df.extreme),
+                                                           np.nanmin(y_extremes_df.extreme),
+                                                           np.nanmax(y_extremes_df.extreme),
+                                                           'X', 'Y', 'X (units)', 'Y (units)',
+                                                           bs_copula_arr, N)
 ```
 
+Where bs_copula_arr is a list of copulae fit to each bootstrap, which is used to calculate confidence intervals.
+
 ## Multivariate Analysis
 
 To be completed

diff --git a/__init__.py b/__init__.py
@@ -3,8 +3,11 @@
 Created on Tue Aug  6 11:36:17 2024
 
 @author: A R Fogg
+
+Initialisation file which imports code.
 """
 
+# Import package code
 from .src.bivariate import bootstrap_data
 from .src.bivariate import calculate_return_periods_values
 from .src.bivariate import detect_extremes

diff --git a/environment.yml b/environment.yml
@@ -0,0 +1,9 @@
+dependencies:
+  - python=3.8.8
+  - pip
+  - scipy=1.10.1
+  - numpy=1.23.5
+  - matplotlib=3.7.5
+  - pandas=1.5.2
+  - pip:
+    - copulas=0.9.0