diff --git a/README.md b/README.md index bd3a707..f05de50 100644 --- a/README.md +++ b/README.md @@ -11,26 +11,13 @@ [![Issues](https://img.shields.io/github/issues/arfogg/bi_multi_variate_eva.svg)](https://github.com/arfogg/bi_multi_variate_eva/issues) ![Contributions welcome](https://img.shields.io/badge/contributions-welcome-orange.svg) - - - -Code to run bivariate and multivariate extreme value analysis on generic data - work in progress. +Python package to run bivariate and multivariate extreme value analysis on generic data - work in progress. **License:** CC0-1.0 **Support:** please [create an issue](https://github.com/arfogg/bi_multi_variate_eva/issues) or contact [arfogg](https://github.com/arfogg) directly. Any input on the code / issues found are greatly appreciated and will help to improve the software. ## Ongoing tasks -Code Improvement -- [ ] Make an overarching function to run all commands -- [ ] Make code PEP 8 compliant -- [ ] Make code importable in one statement e.g. import "bi_multi_variate_eva as BiEVA" -- [x] Create fit_params class - -Organisation -- [ ] Make subfolders for Bi- and Multi- EVA -- [ ] Create requirements.txt with packages and versions -- [ ] Note requirements.txt in README - [ ] Include Dáire acknowledgement statement ## Table of Contents @@ -42,10 +29,11 @@ Organisation * [(3) Extract extrema](#3-extract-extrema) * [(4) Fit a model to the extrema](#4-fit-a-model-to-the-extrema) * [(5) Transform extrema data to uniform margins](#5-transform-extrema-data-to-uniform-margins) - * [(6) Fit a copula to both sets of extrema](#6-fit-a-copula-to-both-sets-of-extrema) - * [(7) Take a sample from the copula](#7-take-a-sample-from-the-copula) - * [(8) Plot diagnostic to assess copula fit](#8-plot-diagnostic-to-assess-copula-fit) - * [(9) Plot return period as a function of two variables](#9-plot-return-period-as-a-function-of-two-variables) + * [(6) Bootstrapping the extrema](#6-bootstrapping-the-extrema) + * [(7) Fit a copula to both sets of extrema](#7-fit-a-copula-to-both-sets-of-extrema) + * [(8) Take a sample from the copula](#8-take-a-sample-from-the-copula) + * [(9) Plot diagnostic to assess copula fit](#9-plot-diagnostic-to-assess-copula-fit) + * [(10) Plot return period as a function of two variables](#10-plot-return-period-as-a-function-of-two-variables) - [Multivariate Analysis](#multivariate-analysis) - [Acknowledgements](#acknowledgements) @@ -53,15 +41,26 @@ Organisation scipy, numpy, matplotlib, pandas, copulas -Install copulas using pip or see documentation [here](https://pypi.org/project/copulas/) +See [environment.yml](environment.yml) for details. +Install copulas using pip or see documentation [here](https://pypi.org/project/copulas/). -## Installing the code +## Using the code -First, the code must be downloaded using `git clone https://github.com/arfogg/bi_multi_variate_eva` +Download the code: +```python +git clone https://github.com/arfogg/bi_multi_variate_eva +``` + +Importing: +```python +from bi_multi_variate_eva import * +``` ## Bivariate Analysis +An example walkthrough of running the Bivariate Extreme Value Analysis on two variables x and y. For theory, a recommended text is: Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer. + #### (1) Getting your data ready Make sure there are no datagaps in your timeseries. You can either remove the rows, or interpolate. This requires an data-expert user decision so is **not** included in this package. You must do this before using the package. For some functions, the data is parsed as a pandas.DataFrame. @@ -71,93 +70,132 @@ Make sure there are no datagaps in your timeseries. You can either remove the ro The function `plot_extremal_dependence_coefficient` within `determine_AD_AI` creates a diagnostic plot to examine asymptotic dependence/independence. For example: + ```python -determine_AD_AI.plot_extremal_dependence_coefficient(x, y, "X", "Y", "(units)", "(units)") +fig, ax_data, ax_data_unif, ax_edc, chi, chi_lower_q, chi_upper_q = \ + determine_AD_AI.plot_extremal_dependence_coefficient(x, y, x_bs_um, y_bs_um, n_bootstrap, + "X", "Y", "(units)", "(units)") ``` +Timeseries (`np.array` or `pd.Series`) of x and y are parsed, with their bootstraps (transformed to uniform margins), number of bootstraps and strings for plot labels. + #### (3) Extract extrema -Extract extremes for both X and Y using `detect_extremes.find_block_maxima`. Analysis on points above threshold maxima yet to be implemented. +Extract extremes for both X and Y using `detect_extremes.find_joint_block_maxima`. Analysis on points above threshold maxima yet to be implemented. For example: ```python -x_extremes_df=detect_extremes.find_block_maxima(df,'x',df_time_tag='datetime',block_size=block_size,extremes_type='high') -y_extremes_df=detect_extremes.find_block_maxima(df,'y',df_time_tag='datetime',block_size=block_size,extremes_type='high') +empty_blocks, x_extreme_t, x_extreme, y_extreme_t, y_extreme = \ + detect_extremes.find_joint_block_maxima(data_df, 'x', 'y') + +x_extremes_df = pd.DataFrame({'datetime':x_extreme_t, 'extreme':x_extreme}) +y_extremes_df = pd.DataFrame({'datetime':y_extreme_t, 'extreme':y_extreme}) ``` +A dataframe of evenly sampled x and y are parsed, with their respective dataframe column names. These are transformed to individual parameter DataFrames. + + #### (4) Fit a model to the extrema -Fit a GEVD or Gumbel distribution to both sets of extrema (i.e. for x and y) using `fit_model_to_extremes.fit_gevd_or_gumbel`. +Fit a GEVD or Gumbel distribution to both sets of extrema (i.e. for x and y) using `gevd_fitter` class. For example: ```python -x_gevd_fit_params=fit_model_to_extremes.fit_gevd_or_gumbel(x_extremes_df, 'BM', 'high','extreme',df_time_tag='datetime',fitting_type='Emcee', block_size=block_size) -y_gevd_fit_params=fit_model_to_extremes.fit_gevd_or_gumbel(y_extremes_df, 'BM', 'high','extreme',df_time_tag='datetime',fitting_type='Emcee', block_size=block_size) +x_gevd_fit = gevd_fitter(x_extremes_df.extreme) +y_gevd_fit = gevd_fitter(y_extremes_df.extreme) ``` +By initialising the `gevd_fitter` class, a GEVD or Gumbel model is fit to the extrema. Fitting information is stored in the object. + #### (5) Transform extrema data to uniform margins -Transform x and y extrema from data scale (as it looks on the instrument) to uniform margins empirically using `transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically` or using the cumulative distribution function with `transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF`. +Transform x and y extrema from data scale (as it looks on the instrument) to uniform margins. This happens within the `gevd_fitter` class. + +You can plot a diagnostic about the transformation of one of the variables using `transform_uniform_margins.plot_diagnostic`. + +#### (6) Bootstrapping the extrema + +To facilitate error calculation, we bootstrap the extrema. For each of the N bootstraps, a random selection of indices from between 0 to n_extrema-1 is chosen (where n_extrema is the number of extrema in each dataset). This set of indices is used to select points from both x and y. This ensures joint selection, so we retain the physical link between x and y. For example: ```python -# Empirically -x_extremes_unif_empirical=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(x_extremes_df.extreme) -y_extremes_unif_empirical=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(y_extremes_df.extreme) -# Using cumulative distribution function -x_extremes_unif=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF(x_extremes_df.extreme, x_gevd_fit_params,distribution=x_gevd_fit_params.distribution_name[0]) -y_extremes_unif=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF(y_extremes_df.extreme, y_gevd_fit_params,distribution=y_gevd_fit_params.distribution_name[0]) +x_bootstrap = np.full((n_extrema, N), np.nan) +y_bootstrap = np.full((n_extrema, N), np.nan) + +for i in range(N): + # Select indices to get bootstraps from + ind = np.random.choice(np.linspace(0, n_extrema-1, n_extrema), n_extrema) + x_bootstrap[:, i] = x_extremes_df.extreme.iloc[ind] + y_bootstrap[:, i] = y_extremes_df.extreme.iloc[ind] ``` -You can plot a diagnostic about the transformation of one of the variables using `transform_uniform_margins.plot_diagnostic`: +By then using a `bootstrap_gevd_fit` object, GEVD or Gumbel fits are estimated for each bootstrap. + ```python -fig_um_x,ax_um_x=transform_uniform_margins.plot_diagnostic(x_extremes_df.extreme, x_extremes_unif_empirical, x_extremes_unif, x_gevd_fit_params, 'X') +x_bs_gevd_fit = bootstrap_gevd_fit(x_bootstrap, x_gevd_fit) +y_bs_gevd_fit = bootstrap_gevd_fit(y_bootstrap, y_gevd_fit) ``` -#### (6) Fit a copula to both sets of extrema +#### (7) Fit a copula to both sets of extrema Fit a copula to x and y extrema using `fit_copula_to_extremes.fit_copula_bivariate`. For example: ```python -copula=fit_copula_to_extremes.fit_copula_bivariate(x_extremes_unif, y_extremes_unif, 'X', 'Y') +copula = fit_copula_to_extremes.fit_copula_bivariate(x_extremes_unif, y_extremes_unif, 'X', 'Y') ``` - -#### (7) Take a sample from the copula -Using your copula from (6), extract a sample, e.g.: `copula_sample=copula.sample(100)`. +#### (8) Take a sample from the copula + +Using your copula from (6), extract a sample, e.g.: `copula_sample = copula.sample(100)`. Transform that sample back to data scale: ```python -x_sample_in_data_scale=transform_uniform_margins.transform_from_uniform_margins_to_data_scale(copula_sample[:,0], x_gevd_fit_params) -y_sample_in_data_scale=transform_uniform_margins.transform_from_uniform_margins_to_data_scale(copula_sample[:,0], y_gevd_fit_params) +x_sample = transform_uniform_margins.\ + transform_from_uniform_margins_to_data_scale(copula_sample[:, 0], x_gevd_fit) +y_sample = transform_uniform_margins.\ + transform_from_uniform_margins_to_data_scale(copula_sample[:, 0], y_gevd_fit) ``` -#### (8) Plot diagnostic to assess copula fit +#### (9) Plot diagnostic to assess copula fit To plot histograms of the copula in data scale (with GEVD/Gumbel fitted to observed extrema overplotted) and on uniform margins, use `transform_uniform_margins.plot_copula_diagnostic`. For example: ```python -fig_copula_1d,ax_copula_1d=transform_uniform_margins.plot_copula_diagnostic(copula_sample[:,0], copula_sample[:,1], x_sample_in_data_scale, y_sample_in_data_scale, x_gevd_fit_params, y_gevd_fit_params, 'X', 'Y') +fig_copula_1d, ax_copula_1d = transform_uniform_margins.\ + plot_copula_diagnostic(copula_sample[:, 0], copula_sample[:, 1], + x_sample, y_sample, x_gevd_fit, y_gevd_fit, + 'X', 'Y') ``` Alternatively, to compare the 2D distributions of the observed extrema and copula sample, use `fit_copula_to_extremes.qualitative_copula_fit_check_bivariate`. For example: ```python -fig_copula_2d,ax_copula_2d=fit_copula_to_extremes.qualitative_copula_fit_check_bivariate(x_extremes_df.extreme, y_extremes_df.extreme, x_sample_in_data_scale, y_sample_in_data_scale, 'X', 'Y') +fig_copula_2d, ax_copula_2d = fit_copula_to_extremes.\ + qualitative_copula_fit_check_bivariate(x_extremes_df.extreme, y_extremes_df.extreme, + x_sample, y_sample, 'X', 'Y') ``` -#### (9) Plot return period as a function of two variables +#### (10) Plot return period as a function of two variables To plot the return period as a function of x and y, with standard contours. For example: ```python -fig_return_period,ax_return_period=calculate_return_periods_values.plot_return_period_as_function_x_y(copula,np.nanmin(x_extremes_df.extreme),np.nanmax(x_extremes_df.extreme),np.nanmin(y_extremes_df.extreme),np.nanmax(y_extremes_df.extreme),'X','Y', x_gevd_fit_params, y_gevd_fit_params, 'X (units)', 'Y (units)', n_samples=1000,block_size=block_size) +fig_rp, ax_rp = calculate_return_periods_values.\ + plot_return_period_as_function_x_y(copula, + np.nanmin(x_extremes_df.extreme), + np.nanmax(x_extremes_df.extreme), + np.nanmin(y_extremes_df.extreme), + np.nanmax(y_extremes_df.extreme), + 'X', 'Y', 'X (units)', 'Y (units)', + bs_copula_arr, N) ``` +Where bs_copula_arr is a list of copulae fit to each bootstrap, which is used to calculate confidence intervals. + ## Multivariate Analysis To be completed diff --git a/__init__.py b/__init__.py index e43b01e..45d303f 100644 --- a/__init__.py +++ b/__init__.py @@ -3,8 +3,11 @@ Created on Tue Aug 6 11:36:17 2024 @author: A R Fogg + +Initialisation file which imports code. """ +# Import package code from .src.bivariate import bootstrap_data from .src.bivariate import calculate_return_periods_values from .src.bivariate import detect_extremes diff --git a/environment.yml b/environment.yml new file mode 100644 index 0000000..82713ac --- /dev/null +++ b/environment.yml @@ -0,0 +1,9 @@ +dependencies: + - python=3.8.8 + - pip + - scipy=1.10.1 + - numpy=1.23.5 + - matplotlib=3.7.5 + - pandas=1.5.2 + - pip: + - copulas=0.9.0 \ No newline at end of file diff --git a/src/bivariate/bootstrap_data.py b/src/bivariate/bootstrap_data.py index 08a5a92..19032c8 100644 --- a/src/bivariate/bootstrap_data.py +++ b/src/bivariate/bootstrap_data.py @@ -3,6 +3,8 @@ Created on Mon Mar 25 15:38:34 2024 @author: A R Fogg + +Functions to bootstrap data. """ import numpy as np @@ -10,12 +12,13 @@ from . import transform_uniform_margins + def produce_single_bootstrap_sample(data, extract_length): """ - Function to product a bootstrapping sample from one given + Function to product a bootstrapping sample from one given data set. - Based on code by Dr Dáire Healy here - + Based on code by Dr Dáire Healy here - https://github.com/arfogg/bi_multi_variate_eva/blob/c8f7911b5aa911a074a33b862224563460357ce3/r_code/bootstrapped_chi.R Parameters @@ -33,35 +36,38 @@ def produce_single_bootstrap_sample(data, extract_length): A sample of length data.size of bootstrapped data. """ - # data must be np.array, df col doesn't work - + + # Initialise empty list bootstrap_sample = [] - + # While our output is shorter than the input data # we will continue adding data while len(bootstrap_sample) < len(data): - + # Choose a random start and end point from the # input data to resample - start_point = int( np.random.choice(data.size, size=1)[0] ) - end_point = int( start_point + np.random.geometric(1.0 / extract_length, size=1)[0] ) - + start_point = int(np.random.choice(data.size, size=1)[0]) + end_point = int(start_point + np.random.geometric(1.0 / extract_length, + size=1)[0]) + # If not beyond the end of the data, append # some data to the new array if end_point < len(data): - bootstrap_sample=np.append(bootstrap_sample,data[start_point:end_point]) - - # Check we aren't longer that the original sample + bootstrap_sample = np.append(bootstrap_sample, + data[start_point:end_point]) + + # Ensure output sample isn't longer that the original sample bootstrap_sample = np.array(bootstrap_sample[0:len(data)]) - + return bootstrap_sample + def produce_dual_bootstrap_sample(data_x, data_y, extract_length): """ - Function to product a bootstrapping sample from a given + Function to product a bootstrapping sample from a given two parameter data set. - Based on code by Dr Dáire Healy here - + Based on code by Dr Dáire Healy here - https://github.com/arfogg/bi_multi_variate_eva/blob/c8f7911b5aa911a074a33b862224563460357ce3/r_code/bootstrapped_chi.R Parameters @@ -83,31 +89,36 @@ def produce_dual_bootstrap_sample(data_x, data_y, extract_length): bootstrap_sample_y : np.array A sample of length data.size of bootstrapped y data. """ - + + # Initialise empty bootstrap lists bootstrap_sample_x = [] bootstrap_sample_y = [] - + # While our output is shorter than the input data # we will continue adding data while len(bootstrap_sample_x) < len(data_x): - + # Choose a random start and end point from the # input data to resample - start_point = int( np.random.choice(data_x.size, size=1)[0] ) - end_point = int( start_point + np.random.geometric(1.0 / extract_length, size=1)[0] ) - + start_point = int(np.random.choice(data_x.size, size=1)[0]) + end_point = int(start_point + np.random.geometric(1.0 / extract_length, + size=1)[0]) + # If not beyond the end of the data, append # some data to the new array if end_point < len(data_x): - bootstrap_sample_x=np.append(bootstrap_sample_x,data_x[start_point:end_point]) - bootstrap_sample_y=np.append(bootstrap_sample_y,data_y[start_point:end_point]) - - # Check we aren't longer that the original sample + bootstrap_sample_x = np.append(bootstrap_sample_x, + data_x[start_point:end_point]) + bootstrap_sample_y = np.append(bootstrap_sample_y, + data_y[start_point:end_point]) + + # Ensure output sample isn't longer that the original sample bootstrap_sample_x = np.array(bootstrap_sample_x[0:len(data_x)]) bootstrap_sample_y = np.array(bootstrap_sample_y[0:len(data_y)]) - + return bootstrap_sample_x, bootstrap_sample_y + def iterative_dual_bootstrap(data_x, data_y, extract_length, n_iterations=100): """ Create a matrix of many bootstraps of the same length, from @@ -134,21 +145,28 @@ def iterative_dual_bootstrap(data_x, data_y, extract_length, n_iterations=100): bootstrap_sample_y : np.array Bootstrap sample of shape data_y.size x n_iterations. """ - + print('Producing a bootstrapped sample - may be slow') print('Data length: ', data_x.size) print('Number of iterations requested: ', n_iterations) - - bootstrap_sample_x=np.full((data_x.size,n_iterations), np.nan) - bootstrap_sample_y=np.full((data_y.size,n_iterations), np.nan) + + # Initialise empty bootstrap matrix + bootstrap_sample_x = np.full((data_x.size, n_iterations), np.nan) + bootstrap_sample_y = np.full((data_y.size, n_iterations), np.nan) + + # Loop through number of parsed iterations for i in range(n_iterations): - bootstrap_sample_x[:,i], bootstrap_sample_y[:,i]=produce_dual_bootstrap_sample(data_x, data_y, extract_length) - + # Produce bootstrap + bootstrap_sample_x[:, i], bootstrap_sample_y[:, i] \ + = produce_dual_bootstrap_sample(data_x, data_y, extract_length) + return bootstrap_sample_x, bootstrap_sample_y -def iterative_dual_bootstrap_um(data_x, data_y, extract_length, n_iterations=100): + +def iterative_dual_bootstrap_um(data_x, data_y, extract_length, + n_iterations=100): """ - Create a matrix of many bootstraps of the same length, + Create a matrix of many bootstraps of the same length, and return data in both data scale and on uniform margins. Parameters @@ -167,38 +185,57 @@ def iterative_dual_bootstrap_um(data_x, data_y, extract_length, n_iterations=100 Returns ------- - bootstrap_sample_ds : np.array - Bootstrap sample of shape data.size x n_iterations in - data scale. - bootstrap_sample_um : np.array - Bootstrap sample of shape data.size x n_iterations on - uniform margins. + bootstrap_sample_x_ds : np.array + Bootstrap sample of x, of shape data_x.size x + n_iterations in data scale. + bootstrap_sample_x_um : np.array + Bootstrap sample of x, of shape data_x.size x + n_iterations on uniform margins. + bootstrap_sample_y_ds : np.array + Bootstrap sample of y, of shape data_y.size x + n_iterations in data scale. + bootstrap_sample_y_um : np.array + Bootstrap sample of y, of shape data_y.size + n_iterations on uniform margins. + """ - print('Producing a bootstrapped sample in data scale and on uniform margins - may be slow') + print('Producing a bootstrapped sample in data scale and on uniform') + print('margins - may be slow') print('Data length: ', data_x.size) print('Number of iterations requested: ', n_iterations) - - bootstrap_sample_x_ds=np.full((data_x.size,n_iterations), np.nan) - bootstrap_sample_x_um=np.full((data_x.size,n_iterations), np.nan) - bootstrap_sample_y_ds=np.full((data_y.size,n_iterations), np.nan) - bootstrap_sample_y_um=np.full((data_y.size,n_iterations), np.nan) - + # Initialise empty bootstrap matrices for x and y + bootstrap_sample_x_ds = np.full((data_x.size, n_iterations), np.nan) + bootstrap_sample_x_um = np.full((data_x.size, n_iterations), np.nan) + bootstrap_sample_y_ds = np.full((data_y.size, n_iterations), np.nan) + bootstrap_sample_y_um = np.full((data_y.size, n_iterations), np.nan) + + # Loop through the number of parsed iterations for i in range(n_iterations): - - bootstrap_sample_x_ds[:,i], bootstrap_sample_y_ds[:,i]=produce_dual_bootstrap_sample(data_x, data_y, extract_length) - bootstrap_sample_x_um[:,i]=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(bootstrap_sample_x_ds[:,i], plot=False) - bootstrap_sample_y_um[:,i]=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(bootstrap_sample_y_ds[:,i], plot=False) - - return bootstrap_sample_x_ds, bootstrap_sample_x_um, bootstrap_sample_y_ds, bootstrap_sample_y_um, + + # Produce dual bootstrap sample + bootstrap_sample_x_ds[:, i], bootstrap_sample_y_ds[:, i] = \ + produce_dual_bootstrap_sample(data_x, data_y, extract_length) + + # Transform data to uniform margins + bootstrap_sample_x_um[:, i] = transform_uniform_margins. \ + transform_from_data_scale_to_uniform_margins_empirically( + bootstrap_sample_x_ds[:, i], plot=False) + bootstrap_sample_y_um[:, i] = transform_uniform_margins. \ + transform_from_data_scale_to_uniform_margins_empirically( + bootstrap_sample_y_ds[:, i], plot=False) + + return bootstrap_sample_x_ds, bootstrap_sample_x_um, \ + bootstrap_sample_y_ds, bootstrap_sample_y_um, + def produce_n_bootstrap_sample(df, cols, extract_length): """ - Function to product a bootstrapping sample from a given - two parameter data set. + Function to product a bootstrapping sample from a given + n parameter data set. - Based on code by Dr Dáire Healy here - + Based on code by Dr Dáire Healy here - https://github.com/arfogg/bi_multi_variate_eva/blob/c8f7911b5aa911a074a33b862224563460357ce3/r_code/bootstrapped_chi.R Parameters @@ -215,31 +252,48 @@ def produce_n_bootstrap_sample(df, cols, extract_length): ------- bootstrap_df : pd.DataFrame Dataframe with columns defined by cols, with - bootstrapped data. - + bootstrapped data. + """ - bootstrap_df=pd.DataFrame(columns=cols) + # Initialise bootstrap DataFrame + bootstrap_df = pd.DataFrame(columns=cols) + + # While our output is shorter than the input data + # we will continue adding data while len(bootstrap_df) < len(df): - start_point = int( np.random.choice(len(df), size=1)[0] ) - end_point = int( start_point + np.random.geometric(1.0 / extract_length, size=1)[0] ) - + + # Choose a random start and end point from the + # input data to resample + start_point = int(np.random.choice(len(df), size=1)[0]) + end_point = int(start_point + np.random.geometric(1.0 / extract_length, + size=1)[0]) + + # If not beyond the end of the data, append + # some data to the new array if end_point < len(df): - loop_df=pd.DataFrame({}) + + loop_df = pd.DataFrame({}) + + # Loop through the columns to be bootstrapped for col in cols: - loop_df[col]=df[col].iloc[start_point:end_point] - bootstrap_df=pd.concat([bootstrap_df,loop_df], ignore_index=True) - - bootstrap_df=bootstrap_df.iloc[:len(df)].copy(deep=True) - - return bootstrap_df + loop_df[col] = df[col].iloc[start_point:end_point] + + bootstrap_df = pd.concat([bootstrap_df, loop_df], + ignore_index=True) + + # Ensure output sample isn't longer that the original sample + bootstrap_df = bootstrap_df.iloc[:len(df)].copy(deep=True) + + return bootstrap_df + def produce_n_bootstrap_sample_um(df, cols, extract_length): """ - Function to product a bootstrapping sample from a given + Function to product a bootstrapping sample from a given two parameter data set. - Based on code by Dr Dáire Healy here - + Based on code by Dr Dáire Healy here - https://github.com/arfogg/bi_multi_variate_eva/blob/c8f7911b5aa911a074a33b862224563460357ce3/r_code/bootstrapped_chi.R Parameters @@ -256,30 +310,47 @@ def produce_n_bootstrap_sample_um(df, cols, extract_length): ------- bootstrap_df : pd.DataFrame Dataframe with columns defined by cols, with - bootstrapped data in data scale. + bootstrapped data in data scale. bootstrap_df_um : pd.DataFrame Dataframe with columns defined by cols, with - bootstrapped data on uniform margins. - + bootstrapped data on uniform margins. + """ - - bootstrap_df=pd.DataFrame(columns=cols) + + # Initialise bootstrap DataFrame + bootstrap_df = pd.DataFrame(columns=cols) + + # While our output is shorter than the input data + # we will continue adding data while len(bootstrap_df) < len(df): - start_point = int( np.random.choice(len(df), size=1)[0] ) - end_point = int( start_point + np.random.geometric(1.0 / extract_length, size=1)[0] ) - + + # Choose a random start and end point from the + # input data to resample + start_point = int(np.random.choice(len(df), size=1)[0]) + end_point = int(start_point + np.random.geometric(1.0 / extract_length, + size=1)[0]) + + # If not beyond the end of the data, append + # some data to the new array if end_point < len(df): - loop_df=pd.DataFrame({}) + + loop_df = pd.DataFrame({}) + + # Loop through the columns to be bootstrapped for col in cols: - loop_df[col]=df[col].iloc[start_point:end_point] - bootstrap_df=pd.concat([bootstrap_df,loop_df], ignore_index=True) - - bootstrap_df=bootstrap_df.iloc[:len(df)].copy(deep=True) - - bootstrap_df_um=pd.DataFrame({}) + loop_df[col] = df[col].iloc[start_point:end_point] + + bootstrap_df = pd.concat([bootstrap_df, loop_df], + ignore_index=True) + + # Ensure output sample isn't longer that the original sample + bootstrap_df = bootstrap_df.iloc[:len(df)].copy(deep=True) + + # Transform to uniform margins + bootstrap_df_um = pd.DataFrame({}) for col in cols: - bootstrap_df_um[col]=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(bootstrap_df[col], plot=False) - + bootstrap_df_um[col] = transform_uniform_margins. \ + transform_from_data_scale_to_uniform_margins_empirically( + bootstrap_df[col], plot=False) + return bootstrap_df, bootstrap_df_um - - \ No newline at end of file diff --git a/src/bivariate/bootstrap_gevd_fit.py b/src/bivariate/bootstrap_gevd_fit.py index 869bd3e..1644d9c 100644 --- a/src/bivariate/bootstrap_gevd_fit.py +++ b/src/bivariate/bootstrap_gevd_fit.py @@ -4,16 +4,18 @@ @author: A R Fogg -Bootstrap GEVD fit class definition +Bootstrap GEVD fit class definition. """ from . import return_period_plot_1d from .gevd_fitter import gevd_fitter + class bootstrap_gevd_fit(): """ A class which contains bootstrapped data and their fits. """ + def __init__(self, bootstrap_extrema, true_fit): """ Initialise class @@ -21,9 +23,40 @@ def __init__(self, bootstrap_extrema, true_fit): Parameters ---------- bootstrap_extrema : np.array - Bootstrapped extrema. Of shape n_extrema x + Bootstrapped extrema. Of shape n_extrema x n_bootstraps. - + true_fit : gevd_fitter class + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. Returns ------- None. @@ -31,25 +64,44 @@ def __init__(self, bootstrap_extrema, true_fit): """ # Store the bootstrapped data self.bs_data = bootstrap_extrema - + # Store the number of extrema and bootstraps self.n_extrema = bootstrap_extrema.shape[0] self.n_ci_iterations = bootstrap_extrema.shape[1] - + # Store the fit to the true data self.true_fit = true_fit - + + # Fit distributions to each bootstrap self.fit_gevd() - + def fit_gevd(self): - + """ + Fit a GEVD model to each individual bootstrapped + dataset, and store an array of gevd_fitter classes. + + Returns + ------- + None. + + """ + + # Initialise empty list gevd_fitter_arr = [] + + # Looping through individual bootstraps for i in range(self.n_ci_iterations): - gevd_fitter_arr.append( gevd_fitter(self.bs_data[:,i], dist = self.true_fit.distribution_name) ) - + # Create and append a gevd_fitter for each bootstrap + gevd_fitter_arr.append(gevd_fitter( + self.bs_data[:, i], + dist=self.true_fit.distribution_name, + fit_guess=self.true_fit.params_dict)) + + # Assign the list of gevd_fitters to this class self.gevd_fitter_arr = gevd_fitter_arr - - def calc_return_levels(self, distribution_name, block_size, bs_return_periods, true_fit): + + def calc_return_levels(self, distribution_name, block_size, + bs_return_periods, true_fit): """ Calculate the return levels for these bootstrapped extrema. @@ -63,30 +115,56 @@ def calc_return_levels(self, distribution_name, block_size, bs_return_periods, t bs_return_periods : np.array Return periods (in years) to calculate the levels at. true_fit : gevd_fitter class - Gevd_fitter object for the fit to true extrema. - + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. Returns ------- None. """ - + # Store the parsed information self.distribution_name = distribution_name self.block_size = block_size self.periods = bs_return_periods - + # Calculate the return levels - levels, shape_, location, scale = return_period_plot_1d.return_level_bootstrapped_data(self, - self.n_ci_iterations, - bs_return_periods) - - # levels, shape_, location, scale = return_period_plot_1d.return_level_bootstrapped_data(self.bs_data, - # self.n_ci_iterations, self.distribution_name, - # self.block_size, bs_return_periods, true_fit) + levels, shape_, location, scale = \ + return_period_plot_1d.\ + return_level_bootstrapped_data(self, self.n_ci_iterations, + bs_return_periods) # Store the return levels and associated fit information self.levels = levels self.shape_ = shape_ self.location = location - self.scale = scale \ No newline at end of file + self.scale = scale diff --git a/src/bivariate/calculate_return_periods_values.py b/src/bivariate/calculate_return_periods_values.py index aadd7b1..e833b31 100644 --- a/src/bivariate/calculate_return_periods_values.py +++ b/src/bivariate/calculate_return_periods_values.py @@ -3,9 +3,10 @@ Created on Fri Jul 21 14:40:39 2023 @author: A R Fogg + +Functions to calculate return periods. """ -#import copulas import pandas as pd import numpy as np import matplotlib.pyplot as plt @@ -14,7 +15,9 @@ from . import transform_uniform_margins from . import plotting_utils -def calculate_return_period(copula, sample_grid, block_size=pd.to_timedelta("365.2425D")): + +def calculate_return_period(copula, sample_grid, + block_size=pd.to_timedelta("365.2425D")): """ Calculate the return period for a given list of sample x and y @@ -23,44 +26,74 @@ def calculate_return_period(copula, sample_grid, block_size=pd.to_timedelta("365 copula : copulas copula Copula that has been fit to some data sample_grid : pd.DataFrame - Two columns with names same as copula, containing x and y values + Two columns with names same as copula, containing x and y values to find the return period for. block_size : pd.Timedelta, optional - Size over which block maxima have been found. The default + Size over which block maxima have been found. The default is pd.to_timedelta("365.2425D"). Returns ------- return_period : np.array - Return periods for given sample. + Return periods for given sample in years. """ - + print('Calculating the return period over parsed copula and sample') - + # Calculate the CDF value for each point in sample - CDF=copula.cumulative_distribution(sample_grid) - + CDF = copula.cumulative_distribution(sample_grid) + # Calculate the number of extremes in a year - n_extremes_per_year=pd.to_timedelta("365.2425D")/block_size - + n_extremes_per_year = pd.to_timedelta("365.2425D")/block_size + # Calculate the return period (in years!) # See Coles 2001 textbook pages 81-82 - return_period=(1.0/n_extremes_per_year)*(1.0/(1-CDF)) - + return_period = (1.0/n_extremes_per_year)*(1.0/(1-CDF)) + return return_period + def estimate_return_period_ci(bs_copula_arr, n_bootstrap, - sample_grid, block_size=pd.to_timedelta("365.2425D"), + sample_grid, + block_size=pd.to_timedelta("365.2425D"), ci_percentiles=[2.5, 97.5]): - + """ + Function to estimate the Confidence Interval over the 2D + return period matrix using bootstraps. + Parameters + ---------- + bs_copula_arr : list + List of len n_bootstrap containing copulas.bivariate objects for each + individual bootstrap. + n_bootstrap : int + Number of bootstraps parsed. + sample_grid : pd.DataFrame + Two columns with names same as copula, containing x and y values + to find the return period for. + block_size : pd.Timedelta, optional + Size over which block maxima have been found. The default + is pd.to_timedelta("365.2425D"). + ci_percentiles : list, optional + Upper and lower percentiles for the confidence interval + plots. The default is [2.5, 97.5]. + + Returns + ------- + rp : np.array + Array containing the calculated return period across all bootstraps. + ci : np.array + Confidence interval on each rp value. + + """ + # Initialise empty return period matrix rp = np.full((sample_grid.shape[0], n_bootstrap), np.nan) for i in range(n_bootstrap): - print('Bootstrap ',i) - rp[:,i] = calculate_return_period(bs_copula_arr[i], sample_grid, - block_size=block_size) - + print('Bootstrap ', i) + rp[:, i] = calculate_return_period(bs_copula_arr[i], sample_grid, + block_size=block_size) + # Looping through each grid pixel print('Calculating confidence interval') ci = np.full(sample_grid.shape, np.nan) @@ -68,60 +101,109 @@ def estimate_return_period_ci(bs_copula_arr, n_bootstrap, # First, select the Bootstraps where return_period is finite # Infinite return_period means CDF->1, which indicates # the bootstrap didn't contain the full extent of the actual - # observed extrema. Hence the copula CDF was saying "out + # observed extrema. Hence the copula CDF was saying "out # of bounds!" to the requested X/Y point. So, we use only # finite return periods to calculate the CI, and retain the # number of bootstraps contributing to each point. - rp_clean_index, = np.where(np.isfinite(rp[j,:])) - rp_clean = rp[j,rp_clean_index] if rp_clean_index.size > 0 else rp[j,:] - ci[j,:] = np.percentile(rp_clean, ci_percentiles) + rp_clean_index, = np.where(np.isfinite(rp[j, :])) + rp_clean = rp[j, rp_clean_index] if rp_clean_index.size > 0 \ + else rp[j, :] + ci[j, :] = np.percentile(rp_clean, ci_percentiles) - return rp, ci + return rp, ci -def generate_sample_grid(min_x, max_x, min_y, max_y, +def generate_sample_grid(min_x, max_x, min_y, max_y, x_name, y_name, n_samples=1000): + """ + Generate x vs y grid for 2D return period plot. + + Parameters + ---------- + min_x : float + Minimum value for x. + max_x : float + Maximum value for x. + min_y : float + Minimum value for y. + max_y : float + Maximum value for y. + x_name : string + Tag for x values. + y_name : string + Tag for y values. + n_samples : int, optional + Number of points in x and y direction for the matrix. The default + is 1000. + + Returns + ------- + sample_grid : np.array + X and y points (on uniform margins) for the plotting grid. Raveled + onto 2 column matrix. + xv_ds : np.array + X points for pcolormesh grid in data scale. + yv_ds : np.array + Y points for pcolormesh grid in data scale. + mid_point_x_ds : np.array + Mid x point of bin to be used to calculate return period. + mid_point_y_ds : np.array + Mid y point of bin to be used to calculate return period. + + """ + # Create a sample - sample_um=pd.DataFrame({x_name:transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically( - np.linspace(min_x,max_x,n_samples)), - y_name:transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically( - np.linspace(min_y,max_y,n_samples))}) - sample_ds=pd.DataFrame({x_name:np.linspace(min_x,max_x,n_samples), - y_name:np.linspace(min_y,max_y,n_samples)}) - + sample_um = \ + pd.DataFrame({x_name: + transform_uniform_margins. + transform_from_data_scale_to_uniform_margins_empirically( + np.linspace(min_x, max_x, n_samples)), + y_name: + transform_uniform_margins. + transform_from_data_scale_to_uniform_margins_empirically( + np.linspace(min_y, max_y, n_samples))}) + sample_ds = pd.DataFrame({x_name: np.linspace(min_x, max_x, n_samples), + y_name: np.linspace(min_y, max_y, n_samples)}) + # Create sample grid - xv_um, yv_um = np.meshgrid(sample_um[x_name], sample_um[y_name]) #uniform margins - xv_ds, yv_ds = np.meshgrid(sample_ds[x_name], sample_ds[y_name]) #data scale + # On uniform margins + xv_um, yv_um = np.meshgrid(sample_um[x_name], sample_um[y_name]) + # In data scale + xv_ds, yv_ds = np.meshgrid(sample_ds[x_name], sample_ds[y_name]) # mesh grid on uniform margins for calculating, in data scale # for plotting - + # Determine mid point of each pixel to calculate return # period for - mid_point_x_um=(xv_um[1:,1:]+xv_um[:-1,:-1])/2 - mid_point_y_um=(yv_um[1:,1:]+yv_um[:-1,:-1])/2 - mid_point_x_ds=(xv_ds[1:,1:]+xv_ds[:-1,:-1])/2 - mid_point_y_ds=(yv_ds[1:,1:]+yv_ds[:-1,:-1])/2 - + mid_point_x_um = (xv_um[1:, 1:]+xv_um[:-1, :-1])/2 + mid_point_y_um = (yv_um[1:, 1:]+yv_um[:-1, :-1])/2 + mid_point_x_ds = (xv_ds[1:, 1:]+xv_ds[:-1, :-1])/2 + mid_point_y_ds = (yv_ds[1:, 1:]+yv_ds[:-1, :-1])/2 + # Reshape - raveled_mid_point_x=mid_point_x_um.ravel() - raveled_mid_point_y=mid_point_y_um.ravel() - sample_grid=np.array([raveled_mid_point_x,raveled_mid_point_y]).T + raveled_mid_point_x = mid_point_x_um.ravel() + raveled_mid_point_y = mid_point_y_um.ravel() + sample_grid = np.array([raveled_mid_point_x, raveled_mid_point_y]).T + + return sample_grid, xv_ds, yv_ds, mid_point_x_ds, mid_point_y_ds - return sample_grid, xv_ds, yv_ds, mid_point_x_ds, mid_point_y_ds -def plot_return_period_as_function_x_y(copula, min_x, max_x, min_y, max_y, - x_name, y_name, - x_label, y_label, +def plot_return_period_as_function_x_y(copula, + min_x, max_x, min_y, max_y, + x_name, y_name, + x_label, y_label, bs_copula_arr, n_bootstrap, - sample_grid=None, - xv_ds=None, yv_ds=None, - mid_point_x_ds=None, mid_point_y_ds=None, - return_period=None, ci=None,# n=None, n_samples=1000, block_size=pd.to_timedelta("365.2425D"), - contour_levels=[1/12,0.5,1.0,10.0], - contour_colors=['white','white','white','black'], + sample_grid=None, + xv_ds=None, yv_ds=None, + mid_point_x_ds=None, + mid_point_y_ds=None, + return_period=None, ci=None, + contour_levels=[1/12, 0.5, 1.0, 10.0], + contour_colors=['white', 'white', + 'white', 'black'], lower_ax_limit_contour_index=1, ci_percentiles=[2.5, 97.5], fontsize=15): @@ -151,23 +233,50 @@ def plot_return_period_as_function_x_y(copula, min_x, max_x, min_y, max_y, Name for y, used for labelling plots. bs_copula_arr : list of copulas copulae Python list of length n_bootstrap containing - copulae for each bootstrapped set of extrema. + copulas.bivariate object for each bootstrapped set of extrema. n_bootstrap : int Number of bootstrapped extrema generated. n_samples : int, optional - Number of points for x and y axes. So return period is evaluated + Number of points for x and y axes. So return period is evaluated across n_samples x n_samples size grid. The default is 1000. block_size : pd.timedelta, optional - Block size used in the block maxima extreme selection. The + Block size used in the block maxima extreme selection. The default is pd.to_timedelta("365.2425D"). + sample_grid : np.array, optional + X and y points (on uniform margins) for the plotting grid. Raveled + onto 2 column matrix. The default is None - in this case, the parameter + is calculated. + xv_ds : np.array, optional + X points for pcolormesh grid in data scale. The default is None - in + this case, the parameter is calculated. + yv_ds : np.array, optional + Y points for pcolormesh grid in data scale. The default is None - in + this case, the parameter is calculated. + mid_point_x_ds : np.array, optional + Mid x point of bin to be used to calculate return period. The default + is None - in this case, the parameter is calculated. + mid_point_y_ds : np.array, optional + Mid y point of bin to be used to calculate return period. The default + is None - in this case, the parameter is calculated. + return_period : np.array, optional + Return period matrix. The default is None - in this case, the parameter + is calculated. + ci : np.array, optional + Confidence interval matrix. The default is None - in this case, the + parameter is calculated. contour_levels : list, optional - Return period values at which contours will be drawn. The + Return period values at which contours will be drawn. The default is [1/12,0.5,1.0,10.0]. + contour_colors : list, optional + List of string names for contour colors. The default is ['white', + 'white', 'white', 'black']. lower_ax_limit_contour_index : int, optional Used to decide the lower axes limits for x and y. The default is 1. ci_percentiles : list, optional Upper and lower percentiles for the confidence interval plots. The default is [2.5, 97.5]. + fontsize : float, optional + Fontsize to be applied to all labels. The default is 15. Returns ------- @@ -177,117 +286,140 @@ def plot_return_period_as_function_x_y(copula, min_x, max_x, min_y, max_y, Axes within fig. """ - + print('Creating a 2D return period plot with confidence intervals') + # Adjust fontsize for all text plt.rcParams['font.size'] = fontsize - + # Create sample_grid etc if not parsed in - if (sample_grid is None) | (xv_ds is None) | (yv_ds is None) | (mid_point_x_ds is None) | (mid_point_y_ds is None): - sample_grid, xv_ds, yv_ds, mid_point_x_ds, mid_point_y_ds = generate_sample_grid( - min_x, max_x, min_y, max_y, + if (sample_grid is None) | (xv_ds is + None) | (yv_ds is + None) | (mid_point_x_ds is + None) | (mid_point_y_ds is + None): + sample_grid, xv_ds, yv_ds, mid_point_x_ds, mid_point_y_ds = \ + generate_sample_grid( + min_x, max_x, min_y, max_y, x_name, y_name, n_samples=1000) - + # Calculate return period if not parsed if return_period is None: - return_period=calculate_return_period(copula, sample_grid, block_size=block_size) + return_period = calculate_return_period(copula, sample_grid, + block_size=block_size) # Reshape for mesh grid - shaped_return_period=return_period.reshape(mid_point_x_ds.shape) + shaped_return_period = return_period.reshape(mid_point_x_ds.shape) # Calculate confidence intervals if needed if (ci is None): rp, ci = estimate_return_period_ci(bs_copula_arr, n_bootstrap, - sample_grid, block_size=block_size, - ci_percentiles=ci_percentiles) - + sample_grid, block_size=block_size, + ci_percentiles=ci_percentiles) # Initialise plot - fig,ax=plt.subplots(nrows=1, ncols=3, figsize=(24,6)) - + fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(24, 6)) + # ----- RETURN PERIOD ----- rp_cbar_norm = colors.LogNorm(vmin=np.quantile(shaped_return_period, 0.1), - vmax=np.quantile(shaped_return_period, 0.999)) + vmax=np.quantile(shaped_return_period, + 0.999)) + # Plot return period as function of x and y in data scale - pcm=ax[0].pcolormesh(xv_ds,yv_ds,shaped_return_period, cmap='plasma', - norm = rp_cbar_norm) - + pcm = ax[0].pcolormesh(xv_ds, yv_ds, shaped_return_period, cmap='plasma', + norm=rp_cbar_norm) + # Contours - contours = ax[0].contour(mid_point_x_ds, mid_point_y_ds, shaped_return_period, contour_levels, - colors=contour_colors) + contours = ax[0].contour(mid_point_x_ds, mid_point_y_ds, + shaped_return_period, contour_levels, + colors=contour_colors) for c in contours.collections: c.set_clip_on(True) - clabels = plotting_utils.contour_labels(contours, xpad=max_x*0.005, sides=['left', 'right'], - color='black', ha='left', va='center') + clabels = plotting_utils.contour_labels(contours, xpad=max_x*0.005, + sides=['left', 'right'], + color='black', ha='left', + va='center') # Colourbar - cbar=fig.colorbar(pcm, ax=ax[0], extend='max', label='Return Period (years)', - pad=0.06) - cbar.add_lines(contours) - + cbar = fig.colorbar(pcm, ax=ax[0], extend='max', + label='Return Period (years)', pad=0.06) + cbar.add_lines(contours) + # Some Decor ax[0].set_xlabel(x_label) ax[0].set_ylabel(y_label) - ax[0].set_title('(a) Return Period\n'+r'($\tau_{max}$ = '+str(round(np.nanmax(shaped_return_period),2))+' years)') + ax[0].set_title('(a) Return Period\n'+r'($\tau_{max}$ = ' + + str(round(np.nanmax(shaped_return_period), 2))+' years)') # ----- LOWER QUANTILE ----- - # Plot lower quantile as function of x and y in data scale - lower_quantile = ci[:,0].reshape(mid_point_x_ds.shape) + lower_quantile = ci[:, 0].reshape(mid_point_x_ds.shape) lci_cbar_norm = colors.LogNorm(vmin=np.nanquantile(lower_quantile, 0.1), - vmax=np.nanquantile(lower_quantile, 0.999)) - pcm_lq=ax[1].pcolormesh(xv_ds,yv_ds,lower_quantile, cmap='magma', - norm = lci_cbar_norm) + vmax=np.nanquantile(lower_quantile, 0.999)) + pcm_lq = ax[1].pcolormesh(xv_ds, yv_ds, lower_quantile, cmap='magma', + norm=lci_cbar_norm) + # Contours - contours_lq = ax[1].contour(mid_point_x_ds, mid_point_y_ds, lower_quantile, contour_levels, - colors=contour_colors) + contours_lq = ax[1].contour(mid_point_x_ds, mid_point_y_ds, lower_quantile, + contour_levels, colors=contour_colors) for c in contours_lq.collections: c.set_clip_on(True) - clabels = plotting_utils.contour_labels(contours_lq, xpad=max_x*0.005, sides=['left', 'right'], - color='black', ha='left', va='center') + clabels = plotting_utils.contour_labels(contours_lq, xpad=max_x*0.005, + sides=['left', 'right'], + color='black', ha='left', + va='center') # Colourbar - cbar=fig.colorbar(pcm_lq, ax=ax[1], extend='max', label='Return Period (years)') + cbar = fig.colorbar(pcm_lq, ax=ax[1], extend='max', + label='Return Period (years)') cbar.add_lines(contours_lq) + # Some Decor ax[1].set_xlabel(x_label) ax[1].set_ylabel(y_label) - - ax[1].set_title('(b) '+str(ci_percentiles[1]-ci_percentiles[0])+'% Confidence Interval (lower)\n'+r'($\tau_{max}$ = '+str(round(np.nanmax(lower_quantile),2))+' years)') - + + ax[1].set_title('(b) ' + str(ci_percentiles[1]-ci_percentiles[0]) + + '% Confidence Interval (lower)\n' + r'($\tau_{max}$ = ' + + str(round(np.nanmax(lower_quantile), 2)) + ' years)') + # ----- UPPER QUANTILE ----- # Plot upper quantile as function of x and y in data scale - upper_quantile = ci[:,1].reshape(mid_point_x_ds.shape) + upper_quantile = ci[:, 1].reshape(mid_point_x_ds.shape) uci_cbar_norm = colors.LogNorm(vmin=np.nanquantile(upper_quantile, 0.1), - vmax=np.nanquantile(upper_quantile, 0.999)) - pcm_uq=ax[2].pcolormesh(xv_ds,yv_ds,upper_quantile, cmap='magma', - norm = uci_cbar_norm) + vmax=np.nanquantile(upper_quantile, 0.999)) + pcm_uq = ax[2].pcolormesh(xv_ds, yv_ds, upper_quantile, cmap='magma', + norm=uci_cbar_norm) # Contours - contours_uq = ax[2].contour(mid_point_x_ds, mid_point_y_ds, upper_quantile, contour_levels, - colors=contour_colors) + contours_uq = ax[2].contour(mid_point_x_ds, mid_point_y_ds, upper_quantile, + contour_levels, colors=contour_colors) for c in contours_uq.collections: c.set_clip_on(True) - clabels = plotting_utils.contour_labels(contours_uq, xpad=max_x*0.005, sides=['left', 'right'], - color='black', ha='left', va='center') + clabels = plotting_utils.contour_labels(contours_uq, xpad=max_x*0.005, + sides=['left', 'right'], + color='black', ha='left', + va='center') # Colourbar - cbar=fig.colorbar(pcm_uq, ax=ax[2], extend='max', label='Return Period (years)') - cbar.add_lines(contours_uq) - + cbar = fig.colorbar(pcm_uq, ax=ax[2], extend='max', + label='Return Period (years)') + cbar.add_lines(contours_uq) + # Some Decor ax[2].set_xlabel(x_label) ax[2].set_ylabel(y_label) - ax[2].set_title('(c) '+str(ci_percentiles[1]-ci_percentiles[0])+'% Confidence Interval (upper)\n'+r'($\tau_{max}$ = '+str(round(np.nanmax(upper_quantile),2))+' years)') + ax[2].set_title('(c) ' + str(ci_percentiles[1]-ci_percentiles[0]) + + '% Confidence Interval (upper)\n' + r'($\tau_{max}$ = ' + + str(round(np.nanmax(upper_quantile), 2)) + ' years)') # Set x and y limits # Work out x and y limits based on parsed contour index - xy_contour_limit=contours.allsegs[lower_ax_limit_contour_index][0] - xlim=np.min(xy_contour_limit[:,0])*0.9 - ylim=np.min(xy_contour_limit[:,1])*0.9 - for i,a in np.ndenumerate(ax): + xy_contour_limit = contours.allsegs[lower_ax_limit_contour_index][0] + xlim = np.min(xy_contour_limit[:, 0])*0.9 + ylim = np.min(xy_contour_limit[:, 1])*0.9 + for i, a in np.ndenumerate(ax): a.set_xlim(left=xlim) - a.set_ylim(bottom=ylim) + a.set_ylim(bottom=ylim) fig.tight_layout() - - return fig, ax \ No newline at end of file + + return fig, ax diff --git a/src/bivariate/detect_extremes.py b/src/bivariate/detect_extremes.py index 997a404..71b520d 100644 --- a/src/bivariate/detect_extremes.py +++ b/src/bivariate/detect_extremes.py @@ -3,42 +3,45 @@ Created on Tue Jul 18 16:54:38 2023 @author: A R Fogg + +Functions to detect extrema in data. """ import pandas as pd import numpy as np -def find_joint_block_maxima(data, df_x_tag, df_y_tag, + +def find_joint_block_maxima(data, df_x_tag, df_y_tag, df_time_tag='datetime', - md_fr = 0.5, + md_fr=0.5, block_size=pd.to_timedelta("365.2425D"), extremes_type='high'): """ Find block maxima for two variables, ensuring the output has maxima for both datasets over the same block in each row - + Please note this function was inspired by pyextremes. Please see https://github.com/georgebv/pyextremes/tree/master - + Parameters ---------- data : pandas.DataFrame DataFrame containing a column with pd.Timestamp and two data columns with names df_x_tag and df_y_tag. df_x_tag : string - Tag describing the column of data which extremes should be extracted + Tag describing the column of data which extremes should be extracted from. df_y_tag : string - Tag describing the column of data which extremes should be extracted + Tag describing the column of data which extremes should be extracted from. df_time_tag : string, optional - Tag describing the column of data which contains pd.Timestamp. + Tag describing the column of data which contains pd.Timestamp. The default is 'datetime'. md_fr : float Any block with greater than md_fr*100 percentage missing data is returned as an empty slice. The default is 0.5. block_size : pd.Timedelta, optional - Block size over which individual extremes are found. The default is + Block size over which individual extremes are found. The default is pd.to_timedelta("365.2425D"). extremes_type : string, optional If extremes_type='high' block maxima are found. If extremes_type='low', @@ -59,43 +62,48 @@ def find_joint_block_maxima(data, df_x_tag, df_y_tag, Value of the y extrema. """ - + # First, check the parsed columns exist in data if set([df_x_tag, df_y_tag, df_time_tag]).issubset(data.columns): - print('Extracting extremes for ',df_x_tag, ' and ', df_y_tag) + print('Extracting extremes for ', df_x_tag, ' and ', df_y_tag) else: print('ERROR: detect_extremes.find_block_maxima') - print(' Either '+df_x_tag+' or '+df_y_tag+' or '+df_time_tag+' does not exist in dataframe') + print(' Either ' + df_x_tag + ' or ' + df_y_tag + ' or ' + + df_time_tag + ' does not exist in dataframe') print(' Exiting...') - raise NameError(df_x_tag+' or '+df_y_tag+' or '+df_time_tag+' does not exist in dataframe') - + raise NameError(df_x_tag + ' or ' + df_y_tag + ' or ' + df_time_tag + + ' does not exist in dataframe') + # Check the parsed extremes type is valid if extremes_type not in ['high', 'low']: print('ERROR: invalid extremes_type entered.') print(' Valid options are "high" or "low".') print(' Exiting...') - raise ValueError(extremes_type + " is not a valid option for extremes_type!") - + raise ValueError(extremes_type + + " is not a valid option for extremes_type!") + # Select extreme selection function if extremes_type == "high": extreme_selection_func = np.nanargmax else: extreme_selection_func = np.nanargmin - + # Define the time blocks, rounding up so the last block may be short! - n_time_blocks = int( np.ceil( (data[df_time_tag].max()-data[df_time_tag].min()) / block_size ) ) + n_time_blocks = int(np.ceil((data[df_time_tag].max() - + data[df_time_tag].min()) / block_size)) time_blocks = pd.interval_range( - start = data[df_time_tag].iloc[0], - freq = block_size, - periods = n_time_blocks, - closed = "left") - - empty_blocks, x_extreme_t, x_extreme, y_extreme_t, y_extreme = [], [], [], [], [] + start=data[df_time_tag].iloc[0], + freq=block_size, + periods=n_time_blocks, + closed="left") + + empty_blocks, x_extreme_t, x_extreme, y_extreme_t, y_extreme = \ + [], [], [], [], [] for block in time_blocks: # Loop through defined time blocks, and select # a subset of the DataFrame - df_slice = data.loc[ (data[df_time_tag] >= block.left) - & (data[df_time_tag] < block.right) ] + df_slice = data.loc[(data[df_time_tag] >= block.left) + & (data[df_time_tag] < block.right)] # If there's data if len(df_slice) > 0: @@ -103,8 +111,8 @@ def find_joint_block_maxima(data, df_x_tag, df_y_tag, # in the slice x_fr = df_slice[df_x_tag].isna().sum()/len(df_slice) y_fr = df_slice[df_y_tag].isna().sum()/len(df_slice) - - if (x_fr > md_fr) | (y_fr > x_fr) : + + if (x_fr > md_fr) | (y_fr > x_fr): # Do not record an extreme for this interval, # instead output as a missing interval empty_blocks.append(block) @@ -113,12 +121,12 @@ def find_joint_block_maxima(data, df_x_tag, df_y_tag, x_idx = extreme_selection_func(df_slice[df_x_tag]) x_extreme_t.append(df_slice[df_time_tag].iloc[x_idx]) x_extreme.append(df_slice[df_x_tag].iloc[x_idx]) - + y_idx = extreme_selection_func(df_slice[df_y_tag]) y_extreme_t.append(df_slice[df_time_tag].iloc[y_idx]) y_extreme.append(df_slice[df_y_tag].iloc[y_idx]) - + else: empty_blocks.append(block) - + return empty_blocks, x_extreme_t, x_extreme, y_extreme_t, y_extreme diff --git a/src/bivariate/determine_AD_AI.py b/src/bivariate/determine_AD_AI.py index a8c875c..4dedfb8 100644 --- a/src/bivariate/determine_AD_AI.py +++ b/src/bivariate/determine_AD_AI.py @@ -4,7 +4,8 @@ @author: A R Fogg -based on R code sent by Dr Daire Healy +Calculate the extremal dependence coefficient. +Based on R code sent by Dr Daire Healy. """ import numpy as np @@ -15,13 +16,16 @@ from . import transform_uniform_margins -def plot_extremal_dependence_coefficient(x_data, y_data, x_bs_um, y_bs_um, bootstrap_n_iterations, - x_name, y_name, x_units, y_units, csize=17, + +def plot_extremal_dependence_coefficient(x_data, y_data, x_bs_um, y_bs_um, + bootstrap_n_iterations, + x_name, y_name, x_units, y_units, + fontsize=17, percentiles=[2.5, 97.5], - quantiles=np.linspace(0,0.99,100), - edc_ylims=[0,1]): + quantiles=np.linspace(0, 0.99, 100), + edc_ylims=[0, 1]): """ - Function to create a diagnostic plot to determine whether a pair of + Function to create a diagnostic plot to determine whether a pair of variables are asymptotically dependent. Parameters @@ -31,16 +35,16 @@ def plot_extremal_dependence_coefficient(x_data, y_data, x_bs_um, y_bs_um, boots y_data : np.array or pd.Series Timeseries of y parameter. x_bs_um : np.array - Bootstrapped dataset of x_data, as in output from - bootstrap_data.iterative_bootstrap_um. Of shape x_data.size x + Bootstrapped dataset of x_data, as in output from + bootstrap_data.iterative_bootstrap_um. Of shape x_data.size x bootstrap_n_iterations. y_bs_um : np.array - Bootstrapped dataset of y_data, as in output from - bootstrap_data.iterative_bootstrap_um. Of shape y_data.size x + Bootstrapped dataset of y_data, as in output from + bootstrap_data.iterative_bootstrap_um. Of shape y_data.size x bootstrap_n_iterations. bootstrap_n_iterations : int Number of iterations used to generate the bootstrapped - dataset. So, x_bs_um will be of shape x_data.size x + dataset. So, x_bs_um will be of shape x_data.size x bootstrap_n_iterations. x_name : string String name for axes labelling for x. @@ -50,10 +54,10 @@ def plot_extremal_dependence_coefficient(x_data, y_data, x_bs_um, y_bs_um, boots String name for axes labelling for x units. y_units : string String name for axes labelling for y units. - csize : int, optional + fontsize : int, optional Fontsize for text on output plot. The default is 17. percentiles : list, optional - Percentiles to evaluate the [lower, upper] + Percentiles to evaluate the [lower, upper] error at. The default is [2.5, 97.5]. quantiles : np.array, optional Quantiles to evaluate chi at. The default @@ -61,19 +65,20 @@ def plot_extremal_dependence_coefficient(x_data, y_data, x_bs_um, y_bs_um, boots edc_ylims : list, optional Bottom and top limits for the extremal dependence coefficient axis. The default is [0,1]. - + Returns ------- fig : matplotlib figure - Figure containing ax_data, ax_data_unif, ax_edc. + Figure containing ax_data, ax_data_unif, ax_edc. ax_data : matplotlib axes Axes containing a 2D histogram comparing x and y in data scale. ax_data_unif : matplotlib axes Axes containing a 2D histogram comparing x and y on uniform margins. ax_edc : matplotlib axes - Axes showing the extremal dependence coefficient as a function of quantile. - np.min(chi) : float - Minimum value of the extremal dependence coefficient. + Axes showing the extremal dependence coefficient as a function + of quantile. + chi : list + Chi as a function of parsed quantiles. chi_lower_q : np.array Value of lower limit on chi (at percentiles[0]) as a function of quantiles. @@ -85,79 +90,98 @@ def plot_extremal_dependence_coefficient(x_data, y_data, x_bs_um, y_bs_um, boots # Makes fig.tight_layout() and colorbar work together mpl.rcParams['figure.constrained_layout.use'] = False - + # Define plotting window for diagnostic plot - fig=plt.figure(figsize=(15,11)) - gs=GridSpec(2,2,figure=fig) + fig = plt.figure(figsize=(15, 11)) + gs = GridSpec(2, 2, figure=fig) # Define three axes to plot the data on - ax_data=fig.add_subplot(gs[0,0]) - ax_data_unif=fig.add_subplot(gs[0,1]) - ax_edc=fig.add_subplot(gs[1,:]) - + ax_data = fig.add_subplot(gs[0, 0]) + ax_data_unif = fig.add_subplot(gs[0, 1]) + ax_edc = fig.add_subplot(gs[1, :]) + # Plot the original input data - h_data=ax_data.hist2d(x_data,y_data, bins=50, density=True, norm='log') - cb_data=fig.colorbar(h_data[3],ax=ax_data) + h_data = ax_data.hist2d(x_data, y_data, bins=50, density=True, norm='log') + cb_data = fig.colorbar(h_data[3], ax=ax_data) + # Formatting - ax_data.set_xlabel(x_name+' '+x_units, fontsize=csize) - ax_data.set_ylabel(y_name+' '+y_units, fontsize=csize) + ax_data.set_xlabel(x_name + ' ' + x_units, fontsize=fontsize) + ax_data.set_ylabel(y_name + ' ' + y_units, fontsize=fontsize) for label in (ax_data.get_xticklabels() + ax_data.get_yticklabels()): - label.set_fontsize(csize) - cb_data.ax.tick_params(labelsize=csize) - cb_data.set_label("Normalised occurrence", fontsize=csize) - t_data=ax_data.text(0.06, 0.94, '(a)', transform=ax_data.transAxes, fontsize=csize, va='top', ha='left') + label.set_fontsize(fontsize) + cb_data.ax.tick_params(labelsize=fontsize) + cb_data.set_label("Normalised occurrence", fontsize=fontsize) + t_data = ax_data.text(0.06, 0.94, '(a)', transform=ax_data.transAxes, + fontsize=fontsize, va='top', ha='left') t_data.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) # Transform the variables into uniform margins - x_unif=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(x_data,plot=False) - y_unif=transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(y_data,plot=False) - + x_unif = transform_uniform_margins.\ + transform_from_data_scale_to_uniform_margins_empirically(x_data, + plot=False) + y_unif = transform_uniform_margins.\ + transform_from_data_scale_to_uniform_margins_empirically(y_data, + plot=False) + # Plot uniform margins data - h_unif=ax_data_unif.hist2d(x_unif, y_unif, bins=50, density=True) - cb_unif=fig.colorbar(h_unif[3],ax=ax_data_unif) + h_unif = ax_data_unif.hist2d(x_unif, y_unif, bins=50, density=True) + cb_unif = fig.colorbar(h_unif[3], ax=ax_data_unif) + # Formatting - ax_data_unif.set_xlabel(x_name+" on uniform margins", fontsize=csize) - ax_data_unif.set_ylabel(y_name+" on uniform margins", fontsize=csize) - for label in (ax_data_unif.get_xticklabels() + ax_data_unif.get_yticklabels()): - label.set_fontsize(csize) - cb_unif.ax.tick_params(labelsize=csize) - cb_unif.set_label("Normalised occurrence", fontsize=csize) - - t_unif=ax_data_unif.text(0.06, 0.94, '(b)', transform=ax_data_unif.transAxes, fontsize=csize, va='top', ha='left') + ax_data_unif.set_xlabel(x_name+" on uniform margins", fontsize=fontsize) + ax_data_unif.set_ylabel(y_name+" on uniform margins", fontsize=fontsize) + for label in (ax_data_unif.get_xticklabels() + + ax_data_unif.get_yticklabels()): + label.set_fontsize(fontsize) + cb_unif.ax.tick_params(labelsize=fontsize) + cb_unif.set_label("Normalised occurrence", fontsize=fontsize) + + t_unif = ax_data_unif.text(0.06, 0.94, '(b)', + transform=ax_data_unif.transAxes, + fontsize=fontsize, va='top', ha='left') t_unif.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - - # Calculate the extremal dependence coefficient (chi) over quantiles u - chi=calculate_extremal_dependence_coefficient(quantiles,x_unif,y_unif) - + + # Calculate the extremal dependence coefficient (chi) over quantiles u + chi = calculate_extremal_dependence_coefficient(quantiles, x_unif, y_unif) + # Calculate errors on chi - chi_lower_q, chi_upper_q, bootstrap_chi = calculate_upper_lower_lim_chi(quantiles, - x_bs_um, y_bs_um, bootstrap_n_iterations, percentiles=percentiles) + chi_lower_q, chi_upper_q, bootstrap_chi =\ + calculate_upper_lower_lim_chi(quantiles, x_bs_um, y_bs_um, + bootstrap_n_iterations, + percentiles=percentiles) # Plot chi as a function of quantiles - ax_edc.plot(quantiles,chi, color='black', linewidth=2.0, label='$\chi$') + ax_edc.plot(quantiles, chi, color='black', linewidth=2.0, label='$\chi$') # Plot error shade - ax_edc.fill_between(quantiles, chi_lower_q, chi_upper_q, alpha=0.5, color='grey', - label=("%.1f" % (percentiles[1]-percentiles[0]))+"% CI") + ax_edc.fill_between(quantiles, chi_lower_q, chi_upper_q, alpha=0.5, + color='grey', + label=("%.1f" % (percentiles[1]-percentiles[0])) + + "% CI") + # Formatting - ax_edc.set_xlabel("Quantiles", fontsize=csize) - ax_edc.set_ylabel("Extremal Dependence Coefficient, $\chi$", fontsize=csize) + ax_edc.set_xlabel("Quantiles", fontsize=fontsize) + ax_edc.set_ylabel("Extremal Dependence Coefficient, $\chi$", + fontsize=fontsize) for label in (ax_edc.get_xticklabels() + ax_edc.get_yticklabels()): - label.set_fontsize(csize) - t_edc=ax_edc.text(0.03, 0.94, '(c)', transform=ax_edc.transAxes, fontsize=csize, va='top', ha='left') - t_edc.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - + label.set_fontsize(fontsize) + t_edc = ax_edc.text(0.03, 0.94, '(c)', transform=ax_edc.transAxes, + fontsize=fontsize, va='top', ha='left') + t_edc.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) + # Set ylimits ax_edc.set_ylim(edc_ylims) - - ax_edc.legend(loc='lower left', fontsize=csize) + + # Legend + ax_edc.legend(loc='lower left', fontsize=fontsize) fig.tight_layout() - + return fig, ax_data, ax_data_unif, ax_edc, chi, chi_lower_q, chi_upper_q - + + def calculate_extremal_dependence_coefficient(quantiles, x_unif, y_unif): """ - Calculate the extremal dependence coefficient for + Calculate the extremal dependence coefficient for two parameters x and y provided on uniform margins. Parameters @@ -177,15 +201,20 @@ def calculate_extremal_dependence_coefficient(quantiles, x_unif, y_unif): """ - chi=[] + # Initialise empty chi list + chi = [] for i in range(quantiles.size): - top,=np.where((x_unif>quantiles[i]) & (y_unif>quantiles[i])) - bottom,=np.where(x_unif>quantiles[i]) - chi.append( (top.size)/(bottom.size) ) - + # For each quantile, calculate chi + top, = np.where((x_unif > quantiles[i]) & (y_unif > quantiles[i])) + bottom, = np.where(x_unif > quantiles[i]) + chi.append((top.size)/(bottom.size)) + return chi -def calculate_upper_lower_lim_chi(quantiles, x_bs_um, y_bs_um, bootstrap_n_iterations, percentiles=[2.5, 97.5]): + +def calculate_upper_lower_lim_chi(quantiles, x_bs_um, y_bs_um, + bootstrap_n_iterations, + percentiles=[2.5, 97.5]): """ Function to calculate the upper and lower boundary for error shade on chi, based on bootstrapped data. @@ -195,17 +224,17 @@ def calculate_upper_lower_lim_chi(quantiles, x_bs_um, y_bs_um, bootstrap_n_itera quantiles : np.array Quantiles over which to evaluate the error. x_bs_um : np.array - Bootstrapped x values. Of shape n_data x + Bootstrapped x values. Of shape n_data x bootstrap_n_iterations. y_bs_um : np.array - Bootstrapped y values. Of shape n_data x + Bootstrapped y values. Of shape n_data x bootstrap_n_iterations. bootstrap_n_iterations : int Number of iterations to calculate chi over, i.e. number of bootstrapped datasets provided. percentiles : list, optional - Percentiles to evaluate the [lower, upper] + Percentiles to evaluate the [lower, upper] error at. The default is [2.5, 97.5]. Returns @@ -221,18 +250,25 @@ def calculate_upper_lower_lim_chi(quantiles, x_bs_um, y_bs_um, bootstrap_n_itera quantiles.size x bootstrap_n_iterations. """ - - print('Estimating chi over '+str(bootstrap_n_iterations)+' bootstrapped iterations - may be slow') - bts_chi=np.full((quantiles.size,bootstrap_n_iterations), np.nan) + + print('Estimating chi over ' + str(bootstrap_n_iterations) + + ' bootstrapped iterations - may be slow') + + # Initialuse empty array for bootstrapped chi + bts_chi = np.full((quantiles.size, bootstrap_n_iterations), np.nan) # Calculate chi over each set of bootstrapped x and y for i in range(bootstrap_n_iterations): - bts_chi[:,i]=calculate_extremal_dependence_coefficient(quantiles, x_bs_um[:,i], y_bs_um[:,i]) - - lower_q=np.full(quantiles.size, np.nan) - upper_q=np.full(quantiles.size, np.nan) + bts_chi[:, i] = \ + calculate_extremal_dependence_coefficient(quantiles, + x_bs_um[:, i], + y_bs_um[:, i]) + + # Initialise empty arrays for upper and lower limits + lower_q = np.full(quantiles.size, np.nan) + upper_q = np.full(quantiles.size, np.nan) + # Calculate the errors from these bootstrapped chi for j in range(quantiles.size): - lower_q[j], upper_q[j]=np.percentile(bts_chi[j,:],percentiles) - + lower_q[j], upper_q[j] = np.percentile(bts_chi[j, :], percentiles) + return lower_q, upper_q, bts_chi - \ No newline at end of file diff --git a/src/bivariate/fit_copula_to_extremes.py b/src/bivariate/fit_copula_to_extremes.py index 9dc75cf..03b2bec 100644 --- a/src/bivariate/fit_copula_to_extremes.py +++ b/src/bivariate/fit_copula_to_extremes.py @@ -3,6 +3,8 @@ Created on Wed Jul 19 13:40:02 2023 @author: A R Fogg + +Functions to fit copulae to data. """ import numpy as np @@ -10,6 +12,7 @@ from copulas.bivariate import Bivariate + def fit_copula_bivariate(x_extremes, y_extremes, x_name, y_name): """ Function to fit a copula to x and y extremes @@ -31,7 +34,7 @@ def fit_copula_bivariate(x_extremes, y_extremes, x_name, y_name): fitted copula """ - + # First, check same number of extremes in x and y if np.array(x_extremes).size == np.array(y_extremes).size: print('Fitting copula to parsed extremes') @@ -39,21 +42,23 @@ def fit_copula_bivariate(x_extremes, y_extremes, x_name, y_name): print('ERROR: fit_copula_to_extremes.fit_copula_bivariate') print('x_extremes and y_extremes must have the same length') raise NameError('x_extremes and y_extremes must have the same length') - + # Format the extremes to how copulas wants them - copula_arr=np.array([x_extremes,y_extremes]).T - - # Initialise the copula - testing with gumbel + copula_arr = np.array([x_extremes, y_extremes]).T + + # Initialise the copula - using Gumbel # (options are clayton, frank, gumbel or independence) - copula=Bivariate(copula_type='gumbel') - + copula = Bivariate(copula_type='gumbel') + # Fit the copula to the extremes copula.fit(copula_arr) - + return copula -def qualitative_copula_fit_check_bivariate(x_extremes, y_extremes, x_sample, - y_sample, x_name, y_name): + +def qualitative_copula_fit_check_bivariate(x_extremes, y_extremes, + x_sample, y_sample, + x_name, y_name): """ Function to do a qualitative diagnostic plot for copula fit. @@ -80,67 +85,71 @@ def qualitative_copula_fit_check_bivariate(x_extremes, y_extremes, x_sample, Axes from fig. """ - - csize=15 - fig, ax=plt.subplots(ncols=3, figsize=(27,7)) - + + fontsize = 15 + fig, ax = plt.subplots(ncols=3, figsize=(27, 7)) + # OBSERVED EXTREMES - h_data=ax[0].hist2d(x_extremes, y_extremes, bins=[10,10], cmap='magma', cmin=1) - + h_data = ax[0].hist2d(x_extremes, y_extremes, bins=[10, 10], + cmap='magma', cmin=1) + # Some decor - ax[0].set_xlabel(x_name, fontsize=csize) - ax[0].set_ylabel(y_name, fontsize=csize) + ax[0].set_xlabel(x_name, fontsize=fontsize) + ax[0].set_ylabel(y_name, fontsize=fontsize) for label in (ax[0].get_xticklabels() + ax[0].get_yticklabels()): - label.set_fontsize(csize) + label.set_fontsize(fontsize) ax[0].set_facecolor('darkgray') - t_data=ax[0].text(0.06, 0.94, '(a)', transform=ax[0].transAxes, fontsize=csize, va='top', ha='left') + t_data = ax[0].text(0.06, 0.94, '(a)', transform=ax[0].transAxes, + fontsize=fontsize, va='top', ha='left') t_data.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[0].set_title('Observed Extremes', fontsize=csize) - + ax[0].set_title('Observed Extremes', fontsize=fontsize) + # Colourbar - cbar_data=fig.colorbar(h_data[3], ax=ax[0]) - cbar_data.set_label('Occurrence', fontsize=csize) - cbar_data.ax.tick_params(labelsize=csize) + cbar_data = fig.colorbar(h_data[3], ax=ax[0]) + cbar_data.set_label('Occurrence', fontsize=fontsize) + cbar_data.ax.tick_params(labelsize=fontsize) # COPULA SAMPLE - h_sample=ax[1].hist2d(x_sample, y_sample, bins=[10,10], cmap='magma', cmin=1) - + h_sample = ax[1].hist2d(x_sample, y_sample, bins=[10, 10], cmap='magma', + cmin=1) + # Some decor - ax[1].set_xlabel(x_name, fontsize=csize) - ax[1].set_ylabel(y_name, fontsize=csize) + ax[1].set_xlabel(x_name, fontsize=fontsize) + ax[1].set_ylabel(y_name, fontsize=fontsize) for label in (ax[1].get_xticklabels() + ax[1].get_yticklabels()): - label.set_fontsize(csize) + label.set_fontsize(fontsize) ax[1].set_facecolor('darkgray') - t_data=ax[1].text(0.06, 0.94, '(b)', transform=ax[1].transAxes, fontsize=csize, va='top', ha='left') + t_data = ax[1].text(0.06, 0.94, '(b)', transform=ax[1].transAxes, + fontsize=fontsize, va='top', ha='left') t_data.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[1].set_title('Copula Sample (data scale)', fontsize=csize) - + ax[1].set_title('Copula Sample (data scale)', fontsize=fontsize) # Colourbar - cbar_sample=fig.colorbar(h_sample[3], ax=ax[1]) - cbar_sample.set_label('Occurrence', fontsize=csize) - cbar_sample.ax.tick_params(labelsize=csize) - + cbar_sample = fig.colorbar(h_sample[3], ax=ax[1]) + cbar_sample.set_label('Occurrence', fontsize=fontsize) + cbar_sample.ax.tick_params(labelsize=fontsize) + # Both both sets of component-wise maxima as a scatter plot - ax[2].plot(x_extremes, y_extremes, marker='^', color="indigo", label='Observations', - fillstyle='none', linewidth=0.) - ax[2].plot(x_sample, y_sample, marker='*', color="darkgoldenrod", label="Copula Sample", - fillstyle='none', linewidth=0.) + ax[2].plot(x_extremes, y_extremes, marker='^', color="indigo", + label='Observations', fillstyle='none', linewidth=0.) + ax[2].plot(x_sample, y_sample, marker='*', color="darkgoldenrod", + label="Copula Sample", fillstyle='none', linewidth=0.) + # Some decor - ax[2].set_xlabel(x_name, fontsize=csize) - ax[2].set_ylabel(y_name, fontsize=csize) + ax[2].set_xlabel(x_name, fontsize=fontsize) + ax[2].set_ylabel(y_name, fontsize=fontsize) for label in (ax[2].get_xticklabels() + ax[2].get_yticklabels()): - label.set_fontsize(csize) - t_data=ax[2].text(0.06, 0.94, '(c)', transform=ax[2].transAxes, fontsize=csize, va='top', ha='left') + label.set_fontsize(fontsize) + t_data = ax[2].text(0.06, 0.94, '(c)', transform=ax[2].transAxes, + fontsize=fontsize, va='top', ha='left') t_data.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[2].set_title('Comparison', fontsize=csize) - ax[2].legend(fontsize=csize) - + ax[2].set_title('Comparison', fontsize=fontsize) + ax[2].legend(fontsize=fontsize) + # Make axes limits sample as for panel0 for panels 1 and 2 ax[1].set_xlim(ax[0].get_xlim()) ax[1].set_ylim(ax[0].get_ylim()) ax[2].set_xlim(ax[0].get_xlim()) - ax[2].set_ylim(ax[0].get_ylim()) - + ax[2].set_ylim(ax[0].get_ylim()) + return fig, ax - \ No newline at end of file diff --git a/src/bivariate/gevd_fitter.py b/src/bivariate/gevd_fitter.py index dfc4e27..4dbd241 100644 --- a/src/bivariate/gevd_fitter.py +++ b/src/bivariate/gevd_fitter.py @@ -4,7 +4,7 @@ @author: A R Fogg -GEVD fit class definition +GEVD fit class definition. """ import numpy as np @@ -14,60 +14,78 @@ from . import transform_uniform_margins + class gevd_fitter(): """ A class which contains model fitting information based on input extrema. """ - - def __init__(self, extremes, dist = None): + + def __init__(self, extremes, dist=None, fit_guess={}): """ - Initialise the gevd_fitter class. Fits a GEVD or + Initialise the gevd_fitter class. Fits a GEVD or Gumbel distribution. Parameters ---------- extremes : np.array or pd.Series List of extrema to fit a model to. - dist : string + dist : string, optional Distribution to use for fitting. If == None, - best fitting distribution is chosen using + best fitting distribution is chosen using select_distribution. Valid options 'genextreme' - or 'gumbel_r'. + or 'gumbel_r'. The default is None. + fit_guess : dictionary, optional + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. The default is + {}. Returns ------- None. """ - + # Store the extrema self.extremes = np.array(extremes) - + # Fit a model - self.fit_model(dist = dist) - + self.fit_model(dist=dist, fit_guess=fit_guess) + # Convert extrema to uniform margins empirically - self.extremes_unif_empirical = transform_uniform_margins.transform_from_data_scale_to_uniform_margins_empirically(self.extremes) - + self.extremes_unif_empirical = transform_uniform_margins.\ + transform_from_data_scale_to_uniform_margins_empirically( + self.extremes) + # Convert extrema to uniform margins using CDF - self.extremes_unif_CDF = transform_uniform_margins.transform_from_data_scale_to_uniform_margins_using_CDF(self.extremes, self) + self.extremes_unif_CDF = transform_uniform_margins.\ + transform_from_data_scale_to_uniform_margins_using_CDF( + self.extremes, self) + + # Define dictionary of fit parameters + self.params_dict = {'c': self.shape_, + 'scale': self.scale, + 'loc': self.location} - def fit_model(self, dist = None): + def fit_model(self, dist=None, fit_guess={}): """ Fit a GEVD or Gumbel to the parsed extrema. Parameters ---------- - extremes : np.array or pd.Series - List of extrema to fit a model to. - dist : string + dist : string, optional Distribution to use for fitting. If == None, - best fitting distribution is chosen using + best fitting distribution is chosen using select_distribution. Valid options 'genextreme' - or 'gumbel_r'. - + or 'gumbel_r'. The default is None. + fit_guess : dictionary, optional + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. The default is + {}. + Returns ------- None. @@ -79,30 +97,41 @@ def fit_model(self, dist = None): else: # Use parsed distribution self.distribution_name = dist - self.distribution = genextreme if dist == 'genextreme' else gumbel_r - + self.distribution = genextreme if dist == 'genextreme' \ + else gumbel_r + # Fit model - fitted_params = self.distribution.fit(self.extremes) - + if fit_guess == {}: + fitted_params = self.distribution.fit(self.extremes) + else: + # Make a copy of fit_guess so global distribution isn't changed + fit_guess = fit_guess.copy() + args = fit_guess.pop('c') + # Different inputs for different distributions + if self.distribution_name == 'genextreme': + fitted_params = self.distribution.fit(self.extremes, args, + **fit_guess) + elif self.distribution_name == 'gumbel_r': + fitted_params = self.distribution.fit(self.extremes, + **fit_guess) # Freeze the fitted model self.frozen_dist = self.distribution(*fitted_params) - + # Assign other parameters per model if self.distribution_name == 'genextreme': self.shape_ = fitted_params[0] self.location = fitted_params[1] self.scale = fitted_params[2] self.formatted_dist_name = 'GEVD' - + elif self.distribution_name == 'gumbel_r': self.shape_ = 0. self.location = fitted_params[0] self.scale = fitted_params[1] self.formatted_dist_name = 'Gumbel' - + # Assign AIC self.aic = self.akaike_info_criterion(self.extremes, self.frozen_dist) - def select_distribution(self): """ @@ -111,38 +140,37 @@ def select_distribution(self): Parameters ---------- - extremes : np.array or pd.Series - List of extrema to fit a model to. + None. Returns ------- None. """ - + # Define loop lists distributions = [genextreme, gumbel_r] names = ['genextreme', 'gumbel_r'] aic_arr = [] for dist in distributions: - # Fit the model + # Fit the model params = dist.fit(self.extremes) - + # Freeze distribution frozen = dist(*params) - + # Calculate AIC aic = self.akaike_info_criterion(self.extremes, frozen) aic_arr.append(aic) - + # Find model with lowest AIC min_index, = np.where(aic_arr == np.min(aic_arr)) - + # Assign the selected distribution to the class self.distribution_name = names[min_index[0]] self.distribution = distributions[min_index[0]] - + def akaike_info_criterion(self, data, model): """ Calculate AIC. @@ -166,8 +194,9 @@ def akaike_info_criterion(self, data, model): loglikelyhood = np.sum(model.logpdf(data)) k = len(model.args) n = len(data) - + # AIC with modification for small sample size - aic = (2. * k) - (2. * loglikelyhood) + ( ( (2. * k**2) + (2. * k) ) / (n - k - 1) ) - + aic = (2. * k) - (2. * loglikelyhood) + \ + (((2. * k**2) + (2. * k)) / (n - k - 1)) + return aic diff --git a/src/bivariate/plotting_utils.py b/src/bivariate/plotting_utils.py index 5ce07f5..d71ffa0 100644 --- a/src/bivariate/plotting_utils.py +++ b/src/bivariate/plotting_utils.py @@ -2,54 +2,59 @@ """ Created on Fri Jun 14 10:55:29 2024 -@author: S J Walker, adapted by A R Fogg +@author for contour_labels: S J Walker, adapted by A R Fogg +else: A R Fogg + +Functions for plotting. """ import numpy as np -def contour_labels(contour, xpad=None, ypad=None, sides=['left', 'right'], rtol=.1, - fmt=lambda x: f'{x}', x_splits=None, y_splits=None, **text_kwargs): + +def contour_labels(contour, xpad=None, ypad=None, sides=['left', 'right'], + rtol=.1, fmt=lambda x: f'{x}', + x_splits=None, y_splits=None, **text_kwargs): """ - Function for creating labels for contour lines at the + Function for creating labels for contour lines at the edge of the subplot. - + Adapted from code by Simon J Walker see - https://github.com/08walkersj/Plotting_Tools/blob/dd8c63f1b349518ea448ee59411dba1715d5eac2/src/Plotting_Tools/Contour_labels.py + https://github.com/08walkersj/Plotting_Tools/blob/dd8c63f1b349518ea448ee59411dba1715d5eac2/src/Plotting_Tools/Contour_labels.py Parameters ---------- contour : matplotlib.contour matplotlib contour object. xpad : float/int, optional - shift the left or right labels. The default is + shift the left or right labels. The default is None and uses .15% of absolute max of limits. ypad : float/int, optional - shift the bottom or top labels. The default is + shift the bottom or top labels. The default is None and uses .15% of absolute max of limits. sides : list/string, optional - sides the labels are for. The default is + sides the labels are for. The default is ['left', 'right']. rtol : int/float, optional - Defines how close the line must be to the axes - limit for a label to be made. The default is .1, - which is 10% of the limit away. Passed to + Defines how close the line must be to the axes + limit for a label to be made. The default is .1, + which is 10% of the limit away. Passed to numpy.isclose and atol=0. fmt : definition, optional - format for the label string The default is + format for the label string The default is lambda x: f'{x}'. x_splits : list/str, optional - Used to define the area where labels should be - found. The default is None. This is useful of - the contour line intersects the axes limits in - more than one place. Can define a string as - positive (only look at the postive side of the - x axis) or negative. Alternatively string can - be provided as e.g. 'x>10' or any string with - boolean design where only x is used. If multiple - sides will be done to all if x_splits is a list + Used to define the area where labels should be + found. The default is None. This is useful of + the contour line intersects the axes limits in + more than one place. Can define a string as + positive (only look at the postive side of the + x axis) or negative. Alternatively string can + be provided as e.g. 'x>10' or any string with + boolean design where only x is used. If multiple + sides will be done to all if x_splits is a list is not provided. Use None to not activate this. y_splits : TYPE, optional - Same as x_splits but for the y axis. The default + Same as x_splits but for the y axis. The default is None. **text_kwargs : dictionary kwargs to be passed to matplotlib.axes.text. @@ -62,110 +67,121 @@ def contour_labels(contour, xpad=None, ypad=None, sides=['left', 'right'], rtol= Returns ------- list - list of text objects. If more than one side - provided then will be a list of lists where - the first list is a list of text objects + list of text objects. If more than one side + provided then will be a list of lists where + the first list is a list of text objects corresponding to the first side provided etc. """ ax = contour.axes if xpad is None: - xpad= np.max(np.abs(ax.get_xlim()))*.015 + xpad = np.max(np.abs(ax.get_xlim()))*.015 if ypad is None: - ypad= np.max(np.abs(ax.get_ylim()))*.015 + ypad = np.max(np.abs(ax.get_ylim()))*.015 try: if isinstance(sides, (str, np.str)): - sides= [sides] + sides = [sides] except AttributeError: if isinstance(sides, (str)): - sides= [sides] + sides = [sides] if x_splits is None or isinstance(x_splits, (str, np.str)): - x_splits= [x_splits]*len(sides) + x_splits = [x_splits]*len(sides) if y_splits is None or isinstance(y_splits, (str, np.str)): - y_splits= [y_splits]*len(sides) - labels= [[] for i in range(len(sides))] + y_splits = [y_splits]*len(sides) + labels = [[] for i in range(len(sides))] for collection, level in zip(contour.collections, contour.levels): if not len(collection.get_paths()): continue - for i, (side, x_split, y_split) in enumerate(zip(sides, x_splits, y_splits)): - x, y= np.concatenate([path.vertices for path in collection.get_paths()], axis=0).T - if side=='left': - if y_split=='negative': - x= x[y<0] - y= y[y<0] - elif y_split=='positive': - x= x[y>0] - y= y[y>0] - elif not y_split is None and ('<' in y_split or '>' in y_split): - x= x[eval(y_split)] - y= y[eval(y_split)] - if not np.any(np.isclose(x, ax.get_xlim()[0], atol=0, rtol=rtol)): + for i, (side, x_split, y_split) in enumerate(zip(sides, + x_splits, y_splits)): + x, y = np.concatenate([path.vertices for path in + collection.get_paths()], axis=0).T + if side == 'left': + if y_split == 'negative': + x = x[y < 0] + y = y[y < 0] + elif y_split == 'positive': + x = x[y > 0] + y = y[y > 0] + elif not y_split is None and ('<' in y_split or + '>' in y_split): + x = x[eval(y_split)] + y = y[eval(y_split)] + if not np.any(np.isclose(x, ax.get_xlim()[0], atol=0, + rtol=rtol)): continue - y= y[np.isclose(x, ax.get_xlim()[0], atol=0, rtol=rtol)] - x= x[np.isclose(x, ax.get_xlim()[0], atol=0, rtol=rtol)] - y= y[np.argmin(abs(x-ax.get_xlim()[0]))] - x= x[np.argmin(abs(x-ax.get_xlim()[0]))] - x= ax.get_xlim()[0] - Xpad=xpad*-1 - Ypad=0 - elif side=='right': - if y_split=='negative': - x= x[y<0] - y= y[y<0] - elif y_split=='positive': - x= x[y>0] - y= y[y>0] - elif not y_split is None and ('<' in y_split or '>' in y_split): - x= x[eval(y_split)] - y= y[eval(y_split)] - if not np.any(np.isclose(x, ax.get_xlim()[1], atol=0, rtol=rtol)): + y = y[np.isclose(x, ax.get_xlim()[0], atol=0, rtol=rtol)] + x = x[np.isclose(x, ax.get_xlim()[0], atol=0, rtol=rtol)] + y = y[np.argmin(abs(x-ax.get_xlim()[0]))] + x = x[np.argmin(abs(x-ax.get_xlim()[0]))] + x = ax.get_xlim()[0] + Xpad = xpad*-1 + Ypad = 0 + elif side == 'right': + if y_split == 'negative': + x = x[y < 0] + y = y[y < 0] + elif y_split == 'positive': + x = x[y > 0] + y = y[y > 0] + elif not y_split is None and ('<' in y_split or + '>' in y_split): + x = x[eval(y_split)] + y = y[eval(y_split)] + if not np.any(np.isclose(x, ax.get_xlim()[1], + atol=0, rtol=rtol)): continue - y= y[np.isclose(x, ax.get_xlim()[1], atol=0, rtol=rtol)] - x= x[np.isclose(x, ax.get_xlim()[1], atol=0, rtol=rtol)] - y= y[np.argmin(abs(x-ax.get_xlim()[1]))] - x= x[np.argmin(abs(x-ax.get_xlim()[1]))] - x= ax.get_xlim()[1] - Xpad=xpad - Ypad=0 - elif side=='bottom': - if x_split=='negative': - y= y[x<0] - x= x[x<0] - elif x_split=='positive': - y= y[x>0] - x= x[x>0] - elif not x_split is None and ('<' in x_split or '>' in x_split): - y= y[eval(x_split)] - x= x[eval(x_split)] - if not np.any(np.isclose(y, ax.get_ylim()[0], atol=0, rtol=rtol)): + y = y[np.isclose(x, ax.get_xlim()[1], atol=0, rtol=rtol)] + x = x[np.isclose(x, ax.get_xlim()[1], atol=0, rtol=rtol)] + y = y[np.argmin(abs(x-ax.get_xlim()[1]))] + x = x[np.argmin(abs(x-ax.get_xlim()[1]))] + x = ax.get_xlim()[1] + Xpad = xpad + Ypad = 0 + elif side == 'bottom': + if x_split == 'negative': + y = y[x < 0] + x = x[x < 0] + elif x_split == 'positive': + y = y[x > 0] + x = x[x > 0] + elif not x_split is None and ('<' in x_split or + '>' in x_split): + y = y[eval(x_split)] + x = x[eval(x_split)] + if not np.any(np.isclose(y, ax.get_ylim()[0], + atol=0, rtol=rtol)): continue - x= x[np.isclose(y, ax.get_ylim()[0], atol=0, rtol=rtol)] - y= y[np.isclose(y, ax.get_ylim()[0], atol=0, rtol=rtol)] - x= x[np.argmin(abs(y-ax.get_ylim()[0]))] - y= y[np.argmin(abs(y-ax.get_ylim()[0]))] - y= ax.get_ylim()[0] - Xpad=0 - Ypad=ypad*-1 - elif side=='top': - if x_split=='negative': - y= y[x<0] - x= x[x<0] - elif x_split=='positive': - y= y[x>0] - x= x[x>0] - elif not x_split is None and ('<' in x_split or '>' in x_split): - y= y[eval(x_split)] - x= x[eval(x_split)] - if not np.any(np.isclose(y, ax.get_ylim()[1], atol=0, rtol=rtol)): + x = x[np.isclose(y, ax.get_ylim()[0], atol=0, rtol=rtol)] + y = y[np.isclose(y, ax.get_ylim()[0], atol=0, rtol=rtol)] + x = x[np.argmin(abs(y-ax.get_ylim()[0]))] + y = y[np.argmin(abs(y-ax.get_ylim()[0]))] + y = ax.get_ylim()[0] + Xpad = 0 + Ypad = ypad*-1 + elif side == 'top': + if x_split == 'negative': + y = y[x < 0] + x = x[x < 0] + elif x_split == 'positive': + y = y[x > 0] + x = x[x > 0] + elif not x_split is None and ('<' in x_split or + '>' in x_split): + y = y[eval(x_split)] + x = x[eval(x_split)] + if not np.any(np.isclose(y, ax.get_ylim()[1], + atol=0, rtol=rtol)): continue - x= x[np.isclose(y, ax.get_ylim()[1], atol=0, rtol=rtol)] - y= y[np.isclose(y, ax.get_ylim()[1], atol=0, rtol=rtol)] - x= x[np.argmin(abs(y-ax.get_ylim()[1]))] - y= y[np.argmin(abs(y-ax.get_ylim()[1]))] - y= ax.get_ylim()[1] - Xpad=0 - Ypad=ypad - labels[i].append(ax.text(x+Xpad, y+Ypad, fmt(level), **text_kwargs)) - if len(sides)==1: + x = x[np.isclose(y, ax.get_ylim()[1], atol=0, rtol=rtol)] + y = y[np.isclose(y, ax.get_ylim()[1], atol=0, rtol=rtol)] + x = x[np.argmin(abs(y-ax.get_ylim()[1]))] + y = y[np.argmin(abs(y-ax.get_ylim()[1]))] + y = ax.get_ylim()[1] + Xpad = 0 + Ypad = ypad + labels[i].append(ax.text(x+Xpad, y+Ypad, fmt(level), + **text_kwargs)) + if len(sides) == 1: return labels[0] - return labels \ No newline at end of file + return labels diff --git a/src/bivariate/qq_plot.py b/src/bivariate/qq_plot.py index b195418..fa88833 100644 --- a/src/bivariate/qq_plot.py +++ b/src/bivariate/qq_plot.py @@ -5,35 +5,37 @@ @author: A R Fogg Based on the private code found on github -https://github.com/arfogg/qq_plot +https://github.com/arfogg/qq_plot. +Functions to create QQ axes under various conditions. """ import numpy as np from scipy.stats import linregress -def qq_data_vs_data(x_dist, y_dist, ax, quantiles=np.linspace(0,100,101), - marker='^', fillstyle='none', color='darkcyan', title='', - fit_linear=False, legend_pos='upper left'): + +def qq_data_vs_data(x_dist, y_dist, ax, quantiles=np.linspace(0, 100, 101), + marker='^', fillstyle='none', color='darkcyan', title='', + fit_linear=False, legend_pos='upper left'): """ Takes in two datasets and plots a Quantile-Quantile plot comparing the two distributions. Parameters ---------- - x_dist : np.array or list + x_dist : np.array or list Set of observations forming x distribution. Will be on the x axis. - y_dist : TYPE + y_dist : np.array or list Set of observations forming y distribution. Will be on the y axis. ax : matplotlib axis object Axis object to do the plotting on. quantiles : np.array, optional - Quantiles to evaluate the distributions at for + Quantiles to evaluate the distributions at for the QQ plot. The default is np.linspace(0,100,101). marker : string, optional - Markerstyle for the points, feeds into ax.plot. For + Markerstyle for the points, feeds into ax.plot. For valid options see matplotlib.markers. The default is '^'. fillstyle : string, optional Fillstyle for the markers. The default is 'none'. @@ -42,44 +44,44 @@ def qq_data_vs_data(x_dist, y_dist, ax, quantiles=np.linspace(0,100,101), title : string, optional Title for the axis. The default is ''. fit_linear : bool, optional - If fit_linear==True, a linear fit to the quantiles + If fit_linear==True, a linear fit to the quantiles is calculated and overplotted. The default is False. legend_pos : string, optional - Position to place the legend. The default is 'upper + Position to place the legend. The default is 'upper left'. Returns ------- - ax + ax : matplotlib axis object Matplotlib axis object containing the plotting. """ - + print('Welcome to the Quantile-Quantile plotting program') - # Calculate the quantiles - x_q=np.nanpercentile(x_dist, quantiles) - y_q=np.nanpercentile(y_dist, quantiles) + x_q = np.nanpercentile(x_dist, quantiles) + y_q = np.nanpercentile(y_dist, quantiles) # Plot the quantiles - ax.plot(x_q,y_q, linewidth=0.0, - marker=marker, fillstyle=fillstyle, color=color, label='QQ') + ax.plot(x_q, y_q, linewidth=0.0, marker=marker, fillstyle=fillstyle, + color=color, label='QQ') - # Draw a y=x line min_value = min([min(ax.get_xlim()), min(ax.get_ylim())]) max_value = max([max(ax.get_xlim()), max(ax.get_ylim())]) - - ax.plot( [min_value, max_value], [min_value, max_value], linewidth=1.0, linestyle='--', color='black', label='y=x') - + ax.plot([min_value, max_value], [min_value, max_value], + linewidth=1.0, linestyle='--', color='black', label='y=x') + + # Add the title ax.set_title(title) - + # Fit a linear trend to the data if fit_linear: - lin_fit=linregress(x_q,y_q) - ax.plot(x_q, lin_fit.intercept + lin_fit.slope*x_q, color=color, linewidth=1.0, - label=str(float('%.4g' % lin_fit.slope))+'x + '+ str(float('%.4g' % lin_fit.intercept))) + lin_fit = linregress(x_q, y_q) + ax.plot(x_q, lin_fit.intercept + lin_fit.slope*x_q, color=color, + linewidth=1.0, label=str(float('%.4g' % lin_fit.slope)) + + 'x + ' + str(float('%.4g' % lin_fit.intercept))) ax.legend(loc=legend_pos) print('Returning QQ axis') return ax @@ -87,10 +89,10 @@ def qq_data_vs_data(x_dist, y_dist, ax, quantiles=np.linspace(0,100,101), ax.legend(loc=legend_pos) print('Returning Q-Q axis') return ax - - -def qq_data_vs_model(ax, extrema_ds, extrema_um_empirical, gevd_fitter, - marker='^', fillstyle='none', color='darkcyan', title='', + + +def qq_data_vs_model(ax, extrema_ds, extrema_um_empirical, gevd_fitter, + marker='^', fillstyle='none', color='darkcyan', title='', legend_pos='center left'): """ Function to generate a QQ plot comparing a GEVD model with @@ -108,7 +110,7 @@ def qq_data_vs_model(ax, extrema_ds, extrema_um_empirical, gevd_fitter, gevd_fitter : gevd_fitter class Object containing fitting information. marker : string, optional - Markerstyle for the points, feeds into ax.plot. For + Markerstyle for the points, feeds into ax.plot. For valid options see matplotlib.markers. The default is '^'. fillstyle : string, optional Fillstyle for the markers. The default is 'none'. @@ -117,8 +119,8 @@ def qq_data_vs_model(ax, extrema_ds, extrema_um_empirical, gevd_fitter, title : string, optional Title for the axis. The default is ''. legend_pos : string, optional - Position to place the legend. The default is 'center - left'. + Position to place the legend. The default is 'center + left'. Returns ------- @@ -126,22 +128,24 @@ def qq_data_vs_model(ax, extrema_ds, extrema_um_empirical, gevd_fitter, Axis containing the QQ plot. """ - + print('Welcome to the Quantile-Quantile plotting program') - + + # Generate model QQ points model_qq = gevd_fitter.frozen_dist.ppf(extrema_um_empirical) - ax.plot(extrema_ds, model_qq, linewidth=0.0, - marker=marker, fillstyle=fillstyle, color=color, label='QQ') - + # Plot QQ points + ax.plot(extrema_ds, model_qq, linewidth=0.0, marker=marker, + fillstyle=fillstyle, color=color, label='QQ') + # Draw a y=x line min_value = min([min(ax.get_xlim()), min(ax.get_ylim())]) max_value = max([max(ax.get_xlim()), max(ax.get_ylim())]) - - ax.plot( [min_value, max_value], [min_value, max_value], linewidth=1.0, linestyle='--', color='black', label='y=x') - - # Decor + ax.plot([min_value, max_value], [min_value, max_value], linewidth=1.0, + linestyle='--', color='black', label='y=x') + + # Title and legend ax.set_title(title) ax.legend(loc=legend_pos) - - return ax \ No newline at end of file + + return ax diff --git a/src/bivariate/return_period_plot_1d.py b/src/bivariate/return_period_plot_1d.py index 5121fb6..cae9253 100644 --- a/src/bivariate/return_period_plot_1d.py +++ b/src/bivariate/return_period_plot_1d.py @@ -3,30 +3,61 @@ Created on Thu Mar 21 10:31:24 2024 @author: A R Fogg + +Functions to visualise return periods and values. """ import numpy as np import pandas as pd -from scipy.stats import genextreme -from scipy.stats import gumbel_r from scipy import stats -def return_period_plot(gevd_fitter, bootstrap_gevd_fit, block_size, data_tag, data_units_fm, ax, - csize=15, line_color='darkcyan', ci_percentiles=[2.5, 97.5]): +def return_period_plot(gevd_fitter, bootstrap_gevd_fit, block_size, data_tag, + data_units_fm, ax, csize=15, line_color='darkcyan', + ci_percentiles=[2.5, 97.5]): """ Function to generate a return period plot. Parameters ---------- gevd_fitter : gevd_fitter class - Object containing GEVD fitting information. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. bootstrap_gevd_fit : bootstrap_gevd_fit class see bootstrap_gevd_fit.py. Contains attributes listed below. bs_data : np.array - Bootstrapped extrema of dimensions n_extrema x + Bootstrapped extrema of dimensions n_extrema x n_ci_iterations. n_extrema : int Number of true extrema. @@ -64,7 +95,7 @@ def return_period_plot(gevd_fitter, bootstrap_gevd_fit, block_size, data_tag, da line. The default is 'darkcyan'. ci_percentiles : list, optional Pair of floats which define the upper and lower - percentiles of the confidence interval shade. + percentiles of the confidence interval shade. The default is [2.5, 97.5]. Returns @@ -73,36 +104,49 @@ def return_period_plot(gevd_fitter, bootstrap_gevd_fit, block_size, data_tag, da Return period plot. """ - + print('Creating a Return Period plot') - + # Plot observed extremes as a function of their return period - empirical_return_period=calculate_return_period_empirical(gevd_fitter.extremes, block_size) - ax.plot(empirical_return_period, gevd_fitter.extremes, linewidth=0.0, marker='^', fillstyle='none', color='black', label='Observations') - + empirical_return_period = \ + calculate_return_period_empirical(gevd_fitter.extremes, block_size) + ax.plot(empirical_return_period, gevd_fitter.extremes, linewidth=0.0, + marker='^', fillstyle='none', color='black', label='Observations') + # Overplot the model return value as a function of period - model_data=np.linspace(np.nanmin(gevd_fitter.extremes)*0.8,np.nanmax(gevd_fitter.extremes)*1.2,201) - model_return_period=calculate_return_period_CDF(model_data, gevd_fitter, block_size) - ax.plot(model_return_period, model_data, linewidth=1.5, color=line_color, label=gevd_fitter.formatted_dist_name) - + model_data = np.linspace(np.nanmin(gevd_fitter.extremes)*0.8, + np.nanmax(gevd_fitter.extremes)*1.2, 201) + model_return_period = calculate_return_period_CDF(model_data, gevd_fitter, + block_size) + ax.plot(model_return_period, model_data, linewidth=1.5, color=line_color, + label=gevd_fitter.formatted_dist_name) + # Overplot confidence interval - plot_ind,=np.where(bootstrap_gevd_fit.periods <= np.max(model_return_period)) - percentiles=np.percentile(bootstrap_gevd_fit.levels[plot_ind,], ci_percentiles, axis=1).T - ax.plot(bootstrap_gevd_fit.periods[plot_ind], percentiles, color='grey', linestyle="--", linewidth=1.0) - ax.fill_between(bootstrap_gevd_fit.periods[plot_ind],percentiles[:,0],percentiles[:,1],color='grey',alpha=0.5, - label=str(ci_percentiles[1]-ci_percentiles[0])+'% CI') - + plot_ind, = np.where(bootstrap_gevd_fit.periods <= + np.max(model_return_period)) + percentiles = np.percentile(bootstrap_gevd_fit.levels[plot_ind, ], + ci_percentiles, axis=1).T + ax.plot(bootstrap_gevd_fit.periods[plot_ind], percentiles, color='grey', + linestyle="--", linewidth=1.0) + ax.fill_between(bootstrap_gevd_fit.periods[plot_ind], percentiles[:, 0], + percentiles[:, 1], color='grey', alpha=0.5, + label=str(ci_percentiles[1]-ci_percentiles[0]) + '% CI') + # Some decor ax.set_xlabel('Return Period (years)') - ax.set_ylabel(str(data_tag)+' observed at least once\nper return period ('+str(data_units_fm)+')') + ax.set_ylabel(str(data_tag) + + ' observed at least once\nper return period (' + + str(data_units_fm) + ')') ax.set_xscale('log') - ax.legend(loc='lower right') + ax.legend(loc='lower right') ax.set_title('Return Period Plot') return ax -def return_period_table_ax(ax, gevd_fitter, block_size, data_units_fm, bootstrap_gevd_fit, - periods=np.array([2, 5, 10, 15, 20, 25, 50, 100]), + +def return_period_table_ax(ax, gevd_fitter, block_size, data_units_fm, + bootstrap_gevd_fit, + periods=np.array([2, 5, 10, 15, 20, 25, 50, 100]), ci_percentiles=[2.5, 97.5]): """ Function to generate a table of return levels, at @@ -113,7 +157,37 @@ def return_period_table_ax(ax, gevd_fitter, block_size, data_units_fm, bootstrap ax : matplotlib axis object The axes to plot the return period table onto. gevd_fitter : gevd_fitter class - Object containing GEVD fitting information. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. block_size : pd.Timedelta Size chosen for block maxima selection. data_units_fm : string @@ -122,7 +196,7 @@ def return_period_table_ax(ax, gevd_fitter, block_size, data_units_fm, bootstrap see bootstrap_gevd_fit.py. Contains attributes listed below. bs_data : np.array - Bootstrapped extrema of dimensions n_extrema x + Bootstrapped extrema of dimensions n_extrema x n_ci_iterations. n_extrema : int Number of true extrema. @@ -146,11 +220,11 @@ def return_period_table_ax(ax, gevd_fitter, block_size, data_units_fm, bootstrap Fitted scale parameter for each bootstrap. periods : np.array, optional Array of return periods (in years) to evaluate - the return level at. - The default is np.array([2, 5, 10, 15, 20, 25, 50, 100]). + the return level at. The default is + np.array([2, 5, 10, 15, 20, 25, 50, 100]). ci_percentiles : list, optional Pair of floats which define the upper and lower - percentiles of the confidence interval shade. + percentiles of the confidence interval shade. The default is [2.5, 97.5]. Returns @@ -160,27 +234,34 @@ def return_period_table_ax(ax, gevd_fitter, block_size, data_units_fm, bootstrap """ print('Creating a table of return periods and values') - + # Calculate return level levels = calculate_return_value_CDF(periods, gevd_fitter, block_size) - + # Confidence intervals - period_ind=np.full(periods.size, np.nan) + period_ind = np.full(periods.size, np.nan) for i in range(period_ind.size): - period_ind[i],=np.where(bootstrap_gevd_fit.periods == periods[i]) - period_ind=period_ind.astype('int') - percentiles=np.percentile(bootstrap_gevd_fit.levels[period_ind,], ci_percentiles, axis=1).T - - table_df=pd.DataFrame({"period\n(years)":periods, - "level\n("+data_units_fm+")":["%.2f" % v for v in levels], - "-"+str("%.0f" % (ci_percentiles[1]-ci_percentiles[0]))+"% CI":["%.2f" % v for v in percentiles[:,0]], - "+"+str("%.0f" % (ci_percentiles[1]-ci_percentiles[0]))+"% CI":["%.2f" % v for v in percentiles[:,1]] - }) - - table = ax.table(cellText=table_df.values, colLabels=table_df.columns, loc='center') - table.scale(1,2) + period_ind[i], = np.where(bootstrap_gevd_fit.periods == periods[i]) + period_ind = period_ind.astype('int') + percentiles = np.percentile(bootstrap_gevd_fit.levels[period_ind, ], + ci_percentiles, axis=1).T + + table_df = pd.DataFrame({"period\n(years)": periods, + "level\n(" + data_units_fm+")": + ["%.2f" % v for v in levels], + "-" + str("%.0f" % (ci_percentiles[1] - + ci_percentiles[0])) + "% CI": + ["%.2f" % v for v in percentiles[:, 0]], + "+" + str("%.0f" % (ci_percentiles[1] - + ci_percentiles[0])) + "% CI": + ["%.2f" % v for v in percentiles[:, 1]] + }) + + table = ax.table(cellText=table_df.values, colLabels=table_df.columns, + loc='center') + table.scale(1, 2) ax.axis('off') - + return ax @@ -188,23 +269,18 @@ def calculate_return_period_empirical(data, block_size): """ Function to calculate the return period of provided extrema based on exceedance probability. - + Pr_exceedance = 1 - (rank / (n + 1) ) rank - the ranking of the ordered data n - number of data points - + tau = 1 / (Pr_exceedance * n_ex_per_year) - - !!!!!!!!!!!!!!! for theory. Parameters ---------- data : np.array Extrema / y values for return period plot. - fit_params : pd.DataFrame - Contains columns as returned by - fit_model_to_extremes.fit_gevd_or_gumbel. block_size : pd.Timedelta Size chosen for block maxima selection. @@ -214,27 +290,27 @@ def calculate_return_period_empirical(data, block_size): Return period for data in years. """ - + # Calculate the number of extrema per year based on block_size - extrema_per_year=pd.to_timedelta("365.2425D")/block_size - + extrema_per_year = pd.to_timedelta("365.2425D")/block_size + # Rank the data - rank=stats.rankdata(data) - + rank = stats.rankdata(data) + # Calculate exceedance probability - exceedance_prob=1.-((rank)/(data.size + 1.0)) - + exceedance_prob = 1. - ((rank)/(data.size + 1.0)) + # Calculate Return Period - tau=(1.0/(exceedance_prob*extrema_per_year)) + tau = (1.0/(exceedance_prob*extrema_per_year)) return tau + def calculate_return_period_CDF(data, gevd_fitter, block_size): - """ Function to calculate the return period of provided extrema based on CDF model. - + See Coles 2001 pg 81-82 for theory. Parameters @@ -243,7 +319,37 @@ def calculate_return_period_CDF(data, gevd_fitter, block_size): Extrema / y values for return period plot. gevd_fitter : gevd_fitter class - Object containing GEVD fitting information. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. block_size : pd.Timedelta Size chosen for block maxima selection. @@ -253,14 +359,17 @@ def calculate_return_period_CDF(data, gevd_fitter, block_size): Return period for data in years. """ - + # Calculate the number of extrema per year based on block_size - extrema_per_year=pd.to_timedelta("365.2425D")/block_size - - tau = 1.0 / ( (extrema_per_year) * (1.0 - gevd_fitter.frozen_dist.cdf(data)) ) + extrema_per_year = pd.to_timedelta("365.2425D")/block_size + + # Calculate tau + tau = 1.0 / ((extrema_per_year) * + (1.0 - gevd_fitter.frozen_dist.cdf(data))) return tau + def calculate_return_value_CDF(periods, gevd_fitter, block_size): """ Function to calculate the return levels based on a list of @@ -272,7 +381,37 @@ def calculate_return_value_CDF(periods, gevd_fitter, block_size): The return periods (in years) to evaluate the return levels over. gevd_fitter : gevd_fitter class - Object containing GEVD fitting information. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. block_size : pd.Timedelta Size chosen for block maxima selection. @@ -282,35 +421,56 @@ def calculate_return_value_CDF(periods, gevd_fitter, block_size): Calculated return levels. """ - - extrema_per_year=pd.to_timedelta("365.2425D")/block_size - - return_periods=periods*extrema_per_year - + + # Calculate the number of extrema each year + extrema_per_year = pd.to_timedelta("365.2425D")/block_size + + # Calculate return period + return_periods = periods*extrema_per_year + + # Calculate return level levels = gevd_fitter.frozen_dist.ppf(1.-(1./return_periods)) - + return levels -def return_level_bootstrapped_data(bootstrap_gevd_fit, n_bootstrap, return_periods): + +def return_level_bootstrapped_data(bootstrap_gevd_fit, n_bootstrap, + return_periods): """ - + Calculate return levels for bootstrapped data. Parameters ---------- - bs_data : np.array - Bootstrapped extrema. Of shape number - of extrema x n_bootstrap. + bootstrap_gevd_fit : bootstrap_gevd_fit class + see bootstrap_gevd_fit.py. Contains attributes + listed below. + bs_data : np.array + Bootstrapped extrema of dimensions n_extrema x + n_ci_iterations. + n_extrema : int + Number of true extrema. + n_ci_iterations : int + Number of bootstraps. + distribution_name : str + Name of fitted distribution. Valid options + 'genextreme' and 'gumbel_r'. + block_size : pd.Timedelta + Size of block for maxima detection. + periods : np.array + Return periods for calculated levels. + levels : np.array + Return levels per period per bootstrap. Of shape + periods.size x n_ci_iterations. + shape_ : np.array + Fitted shape parameter for each bootstrap. + location : np.array + Fitted location parameter for each bootstrap. + scale : np.array + Fitted scale parameter for each bootstrap. n_bootstrap : int - Number of desired bootstraps. - distribution_name : string - Distribution to fit to for bootstraps. - block_size : pd.Timedelta - Size over which block maxima have been found, - e.g. pd.to_timedelta("365.2425D"). + Number of desired bootstraps. return_periods : np.array Return periods to calculate return levels at. - true_fit_params : gevd_fitter class - See gevd_fitter.py. Returns ------- @@ -326,53 +486,24 @@ def return_level_bootstrapped_data(bootstrap_gevd_fit, n_bootstrap, return_perio """ - # # For GEVD distribution - # if distribution_name == 'genextreme': - # shape_=np.full(n_bootstrap, np.nan) - # location=np.full(n_bootstrap, np.nan) - # scale=np.full(n_bootstrap, np.nan) - # # Fit GEVD for each bootstrap iteration - # for i in range(n_bootstrap): - # shape_[i],location[i],scale[i]=genextreme.fit(bs_data[:,i], - # true_fit_params.shape_, - # loc=true_fit_params.location, - # scale=true_fit_params.scale) - - # # For Gumbel distribution - # elif distribution_name == 'gumbel_r': - # shape_=np.full(n_bootstrap, 0.0) - # location=np.full(n_bootstrap, np.nan) - # scale=np.full(n_bootstrap, np.nan) - # # Fit Gumbel for each bootstrap iteration - # for i in range(n_bootstrap): - # location[i],scale[i]=gumbel_r.fit(bs_data[:,i], - # loc=true_fit_params.location, - # scale=true_fit_params.scale) - - - shape_=np.full(n_bootstrap, np.nan) - location=np.full(n_bootstrap, np.nan) - scale=np.full(n_bootstrap, np.nan) - levels=np.full((return_periods.size, n_bootstrap), np.nan) - + # Initialise empty arrays + shape_ = np.full(n_bootstrap, np.nan) + location = np.full(n_bootstrap, np.nan) + scale = np.full(n_bootstrap, np.nan) + levels = np.full((return_periods.size, n_bootstrap), np.nan) + for i in range(n_bootstrap): # Store the params shape_[i] = bootstrap_gevd_fit.gevd_fitter_arr[i].shape_ location[i] = bootstrap_gevd_fit.gevd_fitter_arr[i].location scale[i] = bootstrap_gevd_fit.gevd_fitter_arr[i].scale - + # For each bootstrap iteration, calculate the return level - # as a function of the parsed return periods, given the + # as a function of the parsed return periods, given the # iterated fitting parameters - levels[:,i]=calculate_return_value_CDF(return_periods, - bootstrap_gevd_fit.gevd_fitter_arr[i], - bootstrap_gevd_fit.block_size) - # levels[:,i]=calculate_return_value_CDF(return_periods, - # pd.DataFrame({'distribution_name':distribution_name, - # 'shape_':shape_[i], - # 'location':location[i], - # 'scale':scale[i]}, index=[0]), - # block_size) + levels[:, i] = \ + calculate_return_value_CDF(return_periods, + bootstrap_gevd_fit.gevd_fitter_arr[i], + bootstrap_gevd_fit.block_size) return levels, shape_, location, scale - \ No newline at end of file diff --git a/src/bivariate/transform_uniform_margins.py b/src/bivariate/transform_uniform_margins.py index 0ff27b6..e18d5c3 100644 --- a/src/bivariate/transform_uniform_margins.py +++ b/src/bivariate/transform_uniform_margins.py @@ -3,6 +3,8 @@ Created on Tue Jul 18 16:44:18 2023 @author: A R Fogg + +Functions to transform between uniform margins and data scale. """ import scipy @@ -13,6 +15,7 @@ from . import qq_plot from . import return_period_plot_1d + def transform_from_data_scale_to_uniform_margins_empirically(data, plot=False): """ Transform the data to uniform margins empirically @@ -21,9 +24,9 @@ def transform_from_data_scale_to_uniform_margins_empirically(data, plot=False): Parameters ---------- data : np.array - + plot : BOOL, optional - If plot == True, plots the distributions of data in data + If plot == True, plots the distributions of data in data scale and on uniform margins. The default is False. Returns @@ -32,27 +35,32 @@ def transform_from_data_scale_to_uniform_margins_empirically(data, plot=False): Data on uniform margins """ - + # Transform the variables to uniform margins - data_unif=scipy.stats.rankdata(data)/(data.size+1) + data_unif = scipy.stats.rankdata(data)/(data.size + 1) + + # Plot comparison histograms + if plot is True: - if plot==True: - - fig,ax=plt.subplots(ncols=2,figsize=(8,4)) - - ax[0].hist(data, bins=25, density=True, rwidth=0.8, color='cornflowerblue') + fig, ax = plt.subplots(ncols=2, figsize=(8, 4)) + + ax[0].hist(data, bins=25, density=True, rwidth=0.8, + color='cornflowerblue') ax[0].set_ylabel('Normalised Occurrence') ax[0].set_xlabel('Data in data scale') - - ax[1].hist(data_unif, bins=25, density=True, rwidth=0.8, color='darkorange') + + ax[1].hist(data_unif, bins=25, density=True, rwidth=0.8, + color='darkorange') ax[1].set_ylabel('Normalised Occurrence') ax[1].set_xlabel('Data on uniform margins') - + plt.show() return data_unif -def transform_from_data_scale_to_uniform_margins_using_CDF(data, gevd_fitter, plot=False): + +def transform_from_data_scale_to_uniform_margins_using_CDF(data, gevd_fitter, + plot=False): """ Transforms the data to uniform margins by plugging into the CDF (a probability integral transform) @@ -61,14 +69,43 @@ def transform_from_data_scale_to_uniform_margins_using_CDF(data, gevd_fitter, pl where u is on uniform margins, x is in data scale Citation for this equation Coles (2001) page 47 - Parameters ---------- data : np.array Data in data scale. gevd_fitter : gevd_fitter class - See gevd_fitter.py. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. plot : BOOL, optional If plot == True, plots the distributions of data in data scale and on uniform margins. The default is False. @@ -79,29 +116,34 @@ def transform_from_data_scale_to_uniform_margins_using_CDF(data, gevd_fitter, pl Data transformed to uniform margins """ - + + # Calculate data on uniform margins data_unif = gevd_fitter.frozen_dist.cdf(data) - - if plot==True: - fig,ax=plt.subplots(ncols=2,figsize=(8,4)) - - ax[0].hist(data, bins=25, density=True, rwidth=0.8, color='cornflowerblue') + + # Plot comparison + if plot is True: + fig, ax = plt.subplots(ncols=2, figsize=(8, 4)) + + ax[0].hist(data, bins=25, density=True, rwidth=0.8, + color='cornflowerblue') ax[0].set_ylabel('Normalised Occurrence') ax[0].set_xlabel('Data in data scale') - - ax[1].hist(data_unif, bins=25, density=True, rwidth=0.8, color='darkorange') + + ax[1].hist(data_unif, bins=25, density=True, rwidth=0.8, + color='darkorange') ax[1].set_ylabel('Normalised Occurrence') ax[1].set_xlabel('Data on uniform margins') - + plt.show() - + return data_unif - -def transform_from_uniform_margins_to_data_scale(data_unif, gevd_fitter, plot=False): + +def transform_from_uniform_margins_to_data_scale(data_unif, gevd_fitter, + plot=False): """ Transform the data from uniform margins back to data scale - using the CDF. + using the CDF. CDF Distribution is G(x) = some formula G(x) = u where u is on uniform margins, x is in data scale @@ -112,36 +154,71 @@ def transform_from_uniform_margins_to_data_scale(data_unif, gevd_fitter, plot=Fa data_unif : np.array Data on uniform margins. gevd_fitter : gevd_fitter class - See gevd_fitter.py. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. plot : TYPE, optional DESCRIPTION. The default is False. Returns ------- - data : TYPE - DESCRIPTION. + data : np.array + Data transformed to data scale. """ - + + # Transform data data = gevd_fitter.frozen_dist.ppf(data_unif) - - if plot==True: - fig,ax=plt.subplots(ncols=2,figsize=(8,4)) - - ax[0].hist(data, bins=25, density=True, rwidth=0.8, color='cornflowerblue') + + # Plot comparison + if plot is True: + fig, ax = plt.subplots(ncols=2, figsize=(8, 4)) + + ax[0].hist(data, bins=25, density=True, rwidth=0.8, + color='cornflowerblue') ax[0].set_ylabel('Normalised Occurrence') ax[0].set_xlabel('Data in data scale') - - ax[1].hist(data_unif, bins=25, density=True, rwidth=0.8, color='darkorange') + + ax[1].hist(data_unif, bins=25, density=True, rwidth=0.8, + color='darkorange') ax[1].set_ylabel('Normalised Occurrence') ax[1].set_xlabel('Data on uniform margins') - + plt.show() - + return data + def plot_diagnostic(gevd_fitter, bootstrap_gevd_fit, data_tag, data_units_fm, - block_size, um_bins=np.linspace(0,1,11)): + block_size, um_bins=np.linspace(0, 1, 11)): """ Function to plot the PDF of extremes and the fitted distribution (left), and comparing the empirically and CDF determined data on uniform @@ -150,12 +227,42 @@ def plot_diagnostic(gevd_fitter, bootstrap_gevd_fit, data_tag, data_units_fm, Parameters ---------- gevd_fitter : gevd_fitter class - Object containing GEVD fitting information. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. bootstrap_gevd_fit : bootstrap_gevd_fit class see bootstrap_gevd_fit.py. Contains attributes listed below. bs_data : np.array - Bootstrapped extrema of dimensions n_extrema x + Bootstrapped extrema of dimensions n_extrema x n_ci_iterations. n_extrema : int Number of true extrema. @@ -185,95 +292,123 @@ def plot_diagnostic(gevd_fitter, bootstrap_gevd_fit, data_tag, data_units_fm, Size over which block maxima have been found, e.g. pd.to_timedelta("365.2425D"). um_bins : np.array - array defining the edges of the bins for the + array defining the edges of the bins for the uniform margins histograms Returns ------- - None. + fig : matplotlib figure + Generated figure. + ax : matplotlib axes + Generated axes. """ + # Initialise figure and axes - fig,ax=plt.subplots(ncols=3, nrows=2, figsize=(16.5,10.5)) - + fig, ax = plt.subplots(ncols=3, nrows=2, figsize=(16.5, 10.5)) + # Plot normalised histogram of extremes - ax[0,0].hist(gevd_fitter.extremes, bins=np.linspace(np.nanmin(gevd_fitter.extremes),np.nanmax(gevd_fitter.extremes),25), density=True, rwidth=0.8, color='darkgrey', label='extremes') - + ax[0, 0].hist(gevd_fitter.extremes, + bins=np.linspace(np.nanmin(gevd_fitter.extremes), + np.nanmax(gevd_fitter.extremes), 25), + density=True, rwidth=0.8, color='darkgrey', label='extremes') + # Initialise arrays - model_x=np.linspace(np.nanmin(gevd_fitter.extremes),np.nanmax(gevd_fitter.extremes), 100) + model_x = np.linspace(np.nanmin(gevd_fitter.extremes), + np.nanmax(gevd_fitter.extremes), 100) model_y = gevd_fitter.frozen_dist.pdf(model_x) - + # Plot the PDF against x - ax[0,0].plot(model_x, model_y, color='darkmagenta', label=gevd_fitter.formatted_dist_name) - + ax[0, 0].plot(model_x, model_y, color='darkmagenta', + label=gevd_fitter.formatted_dist_name) + # Some decor - ax[0,0].set_ylabel('Normalised Occurrence') - ax[0,0].set_xlabel(data_tag + ' in data scale ('+data_units_fm+')') - t=ax[0,0].text(0.06, 0.94, '(a)', transform=ax[0,0].transAxes, va='top', ha='left') + ax[0, 0].set_ylabel('Normalised Occurrence') + ax[0, 0].set_xlabel(data_tag + ' in data scale (' + data_units_fm + ')') + t = ax[0, 0].text(0.06, 0.94, '(a)', transform=ax[0, 0].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[0,0].set_title(gevd_fitter.formatted_dist_name+' fit for '+data_tag) + ax[0, 0].set_title(gevd_fitter.formatted_dist_name + ' fit for ' + + data_tag) # Plot normalised histograms of different uniform margins data - ax[0,1].hist(gevd_fitter.extremes_unif_CDF, bins=um_bins, density=True, rwidth=0.8, color='darkorange', label='using CDF') - ax[0,1].hist(gevd_fitter.extremes_unif_empirical, bins=um_bins, density=True, rwidth=0.8, color='grey', alpha=0.5, label='empirical') - + ax[0, 1].hist(gevd_fitter.extremes_unif_CDF, bins=um_bins, density=True, + rwidth=0.8, color='darkorange', label='using CDF') + ax[0, 1].hist(gevd_fitter.extremes_unif_empirical, bins=um_bins, + density=True, rwidth=0.8, color='grey', alpha=0.5, + label='empirical') + # Some decor - ax[0,1].set_ylabel('Normalised Occurrence') - ax[0,1].set_xlabel(data_tag + ' on uniform margins') - ax[0,1].legend(loc='upper right') - t=ax[0,1].text(0.06, 0.94, '(b)', transform=ax[0,1].transAxes, va='top', ha='left') + ax[0, 1].set_ylabel('Normalised Occurrence') + ax[0, 1].set_xlabel(data_tag + ' on uniform margins') + ax[0, 1].legend(loc='upper right') + t = ax[0, 1].text(0.06, 0.94, '(b)', transform=ax[0, 1].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[0,1].set_title('Data on uniform margins') - ax[0,0].legend(loc='upper right') - + ax[0, 1].set_title('Data on uniform margins') + ax[0, 0].legend(loc='upper right') + # QQ plot comparing the extremes and their PDF - ax[1,0]=qq_plot.qq_data_vs_model(ax[1,0], gevd_fitter.extremes, gevd_fitter.extremes_unif_empirical, gevd_fitter, - marker='^', fillstyle='none', color='darkmagenta', title='', - legend_pos='center left') - + ax[1, 0] = qq_plot.qq_data_vs_model(ax[1, 0], gevd_fitter.extremes, + gevd_fitter.extremes_unif_empirical, + gevd_fitter, marker='^', + fillstyle='none', color='darkmagenta', + title='', legend_pos='center left') + # Some decor - ax[1,0].set_xlabel('Extremes') - ax[1,0].set_ylabel('Fitted '+gevd_fitter.formatted_dist_name+' distribution') - t=ax[1,0].text(0.06, 0.94, '(d)', transform=ax[1,0].transAxes, va='top', ha='left') - t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - + ax[1, 0].set_xlabel('Extremes') + ax[1, 0].set_ylabel('Fitted ' + gevd_fitter.formatted_dist_name + + ' distribution') + t = ax[1, 0].text(0.06, 0.94, '(d)', transform=ax[1, 0].transAxes, + va='top', ha='left') + t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) + # QQ plot comparing the uniform margins distributions - ax[1,1]=qq_plot.qq_data_vs_data(gevd_fitter.extremes_unif_empirical, gevd_fitter.extremes_unif_CDF, - ax[1,1], quantiles=np.linspace(0,100,26), - legend_pos='center left', color='darkorange') - + ax[1, 1] = qq_plot.qq_data_vs_data(gevd_fitter.extremes_unif_empirical, + gevd_fitter.extremes_unif_CDF, ax[1, 1], + quantiles=np.linspace(0, 100, 26), + legend_pos='center left', + color='darkorange') + # Some decor - ax[1,1].set_xlabel('Empirical') - ax[1,1].set_ylabel('CDF') - t=ax[1,1].text(0.06, 0.94, '(e)', transform=ax[1,1].transAxes, va='top', ha='left') + ax[1, 1].set_xlabel('Empirical') + ax[1, 1].set_ylabel('CDF') + t = ax[1, 1].text(0.06, 0.94, '(e)', transform=ax[1, 1].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) # Return period plot - ax[0,2]=return_period_plot_1d.return_period_plot(gevd_fitter, bootstrap_gevd_fit, block_size, - data_tag, data_units_fm, - ax[0,2], csize=15, line_color='darkmagenta') - + ax[0, 2] = return_period_plot_1d.\ + return_period_plot(gevd_fitter, bootstrap_gevd_fit, block_size, + data_tag, data_units_fm, ax[0, 2], csize=15, + line_color='darkmagenta') + # Some decor - t=ax[0,2].text(0.06, 0.94, '(c)', transform=ax[0,2].transAxes, va='top', ha='left') + t = ax[0, 2].text(0.06, 0.94, '(c)', transform=ax[0, 2].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) # Return values table - ax[1,2]=return_period_plot_1d.return_period_table_ax(ax[1,2], gevd_fitter, block_size, data_units_fm, bootstrap_gevd_fit) - + ax[1, 2] = return_period_plot_1d.\ + return_period_table_ax(ax[1, 2], gevd_fitter, block_size, + data_units_fm, bootstrap_gevd_fit) + # Some decor - t=ax[1,2].text(0.03, 0.90, '(f)', transform=ax[1,2].transAxes, va='top', ha='left') - t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) + t = ax[1, 2].text(0.03, 0.90, '(f)', transform=ax[1, 2].transAxes, + va='top', ha='left') + t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) fig.tight_layout() plt.show() - + return fig, ax - -def plot_copula_diagnostic(copula_x_sample, copula_y_sample, + + +def plot_copula_diagnostic(copula_x_sample, copula_y_sample, x_sample_data_scale, y_sample_data_scale, x_gevd_fitter, y_gevd_fitter, - x_name, y_name, um_bins=np.linspace(0,1,11)): + x_name, y_name, um_bins=np.linspace(0, 1, 11)): """ Function to plot diagnostic to assess copula fit. @@ -288,16 +423,77 @@ def plot_copula_diagnostic(copula_x_sample, copula_y_sample, y_sample_data_scale : np.array or pd.Series Random sample of y transformed to data scale. x_gevd_fitter : gevd_fitter class - See gevd_fitter.py. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. y_gevd_fitter : gevd_fitter class - See gevd_fitter.py. + Object containing GEVD fitting information for the true data. + Contains attributes listed below. See gevd_fitter.py for + definition. + extremes : np.array + List of extrema the model is fit to. + extremes_unif_empirical : np.array + Parsed extrema converted to uniform margins empirically. + extremes_unif_CDF : np.array + Parsed extrema converted to uniform margins using the fitted + GEVD or Gumbel distribution. + distribution_name : string + String name of fitted distribution. Either 'genextreme' or + 'gumbel_r'. + distribution : scipy.rv_continuous + Empty / generic distribution of selected distribution. + frozen_dist : frozen scipy.rv_continuous + Frozen version of fitted distribution. + shape_ : float + Fitted shape parameter. + location : float + Fitted location parameter. + scale : float + Fitted scale parameter. + formatted_dist_name : string + A formatted version of distribution name for plot labels. + aic : float + Akaike Information Criterion for fit. + fit_guess : dictionary + Dictionary containing guess initial parameters + for fitting the distribution. Keys 'c' for shape, + 'scale', and 'loc' for location. x_name : string String name for x. Used for labelling plots. y_name : string String name for y. Used for labelling plots. - um_bins : np.array - array defining the edges of the bins for the - uniform margins histograms + um_bins : np.array, optional + Array defining the edges of the bins for the + uniform margins histograms. The default is + np.linspace(0, 1, 11). Returns ------- @@ -307,87 +503,113 @@ def plot_copula_diagnostic(copula_x_sample, copula_y_sample, Four axes within fig. """ - - fig, ax = plt.subplots(nrows=2,ncols=3, figsize=(16.5,10.5)) - + + fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(16.5, 10.5)) + # FOR X PARAMETER - # Plot normalised histogram of copula sample on uniform margins - ax[0,0].hist(copula_x_sample, bins=um_bins, density=True, rwidth=0.8, color='darkorange', label=x_name+' copula sample\n(uniform margins)') + ax[0, 0].hist(copula_x_sample, bins=um_bins, density=True, rwidth=0.8, + color='darkorange', label=x_name + + ' copula sample\n(uniform margins)') # Some decor - ax[0,0].set_xlabel('Copula sample for '+x_name) - ax[0,0].set_ylabel('Normalised Occurrence') - ax[0,0].set_title('Copula sample on uniform margins') - t=ax[0,0].text(0.06, 0.94, '(a)', transform=ax[0,0].transAxes, va='top', ha='left') + ax[0, 0].set_xlabel('Copula sample for '+x_name) + ax[0, 0].set_ylabel('Normalised Occurrence') + ax[0, 0].set_title('Copula sample on uniform margins') + t = ax[0, 0].text(0.06, 0.94, '(a)', transform=ax[0, 0].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[0,0].legend(loc='upper right') - + ax[0, 0].legend(loc='upper right') + # Plot normalised histogram of copula sample in data scale - ax[0,1].hist(x_sample_data_scale, bins=25, density=True, rwidth=0.8, color='darkgray', label=x_name+' copula\nsample\n(data scale)') - + ax[0, 1].hist(x_sample_data_scale, bins=25, density=True, rwidth=0.8, + color='darkgray', label=x_name + + ' copula\nsample\n(data scale)') + # Overplot distribution - model_x=np.linspace(np.nanmin(x_sample_data_scale),np.nanmax(x_sample_data_scale), 100) + model_x = np.linspace(np.nanmin(x_sample_data_scale), + np.nanmax(x_sample_data_scale), 100) model_y = x_gevd_fitter.frozen_dist.pdf(model_x) - ax[0,1].plot(model_x, model_y, color='darkmagenta', label=x_gevd_fitter.formatted_dist_name) - + ax[0, 1].plot(model_x, model_y, color='darkmagenta', + label=x_gevd_fitter.formatted_dist_name) + # Some decor - ax[0,1].set_xlabel('Data scale for '+x_name) - ax[0,1].set_ylabel('Normalised Occurrence') - ax[0,1].set_title('Copula sample vs '+x_gevd_fitter.formatted_dist_name+' (data scale)') - t=ax[0,1].text(0.06, 0.94, '(b)', transform=ax[0,1].transAxes, va='top', ha='left') + ax[0, 1].set_xlabel('Data scale for ' + x_name) + ax[0, 1].set_ylabel('Normalised Occurrence') + ax[0, 1].set_title('Copula sample vs ' + + x_gevd_fitter.formatted_dist_name + ' (data scale)') + t = ax[0, 1].text(0.06, 0.94, '(b)', transform=ax[0, 1].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[0,1].legend(loc='upper right') - + ax[0, 1].legend(loc='upper right') + # QQ plot comparing Copula sample in data scale with GEVD fit - ax[0,2]=qq_plot.qq_data_vs_model(ax[0,2], x_sample_data_scale, copula_x_sample, - x_gevd_fitter, marker='^', fillstyle='none', - color='darkmagenta', title='Copula sample vs '+x_gevd_fitter.formatted_dist_name+' (QQ)', - legend_pos='center left') - ax[0,2].set_xlabel('Copula sample in data scale') - ax[0,2].set_ylabel(x_gevd_fitter.formatted_dist_name+' fitted to observations') - t=ax[0,2].text(0.06, 0.94, '(c)', transform=ax[0,2].transAxes, va='top', ha='left') + ax[0, 2] = qq_plot.qq_data_vs_model(ax[0, 2], x_sample_data_scale, + copula_x_sample, x_gevd_fitter, + marker='^', fillstyle='none', + color='darkmagenta', + title='Copula sample vs ' + + x_gevd_fitter.formatted_dist_name + + ' (QQ)', legend_pos='center left') + ax[0, 2].set_xlabel('Copula sample in data scale') + ax[0, 2].set_ylabel(x_gevd_fitter.formatted_dist_name + + ' fitted to observations') + t = ax[0, 2].text(0.06, 0.94, '(c)', transform=ax[0, 2].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - + # FOR Y PARAMETER # Plot normalised histogram of copula sample on uniform margins - ax[1,0].hist(copula_y_sample, bins=um_bins, density=True, rwidth=0.8, color='darkorange', label=y_name+' copula sample\n(uniform margins)') + ax[1, 0].hist(copula_y_sample, bins=um_bins, density=True, rwidth=0.8, + color='darkorange', label=y_name + + ' copula sample\n(uniform margins)') # Some decor - ax[1,0].set_xlabel('Copula sample for '+y_name) - ax[1,0].set_ylabel('Normalised Occurrence') - ax[1,0].set_title('Copula sample on uniform margins') - t=ax[1,0].text(0.06, 0.94, '(d)', transform=ax[1,0].transAxes, va='top', ha='left') + ax[1, 0].set_xlabel('Copula sample for '+y_name) + ax[1, 0].set_ylabel('Normalised Occurrence') + ax[1, 0].set_title('Copula sample on uniform margins') + t = ax[1, 0].text(0.06, 0.94, '(d)', transform=ax[1, 0].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[1,0].legend(loc='upper right') + ax[1, 0].legend(loc='upper right') # Plot normalised histogram of copula sample in data scale - ax[1,1].hist(y_sample_data_scale, bins=25, density=True, rwidth=0.8, color='darkgray', label=y_name+' copula\nsample\n(data scale)') - + ax[1, 1].hist(y_sample_data_scale, bins=25, density=True, rwidth=0.8, + color='darkgray', label=y_name + + ' copula\nsample\n(data scale)') + # Overplot distribution - model_x=np.linspace(np.nanmin(y_sample_data_scale),np.nanmax(y_sample_data_scale), 100) + model_x = np.linspace(np.nanmin(y_sample_data_scale), + np.nanmax(y_sample_data_scale), 100) model_y = y_gevd_fitter.frozen_dist.pdf(model_x) - ax[1,1].plot(model_x,model_y, color='darkmagenta', label=y_gevd_fitter.formatted_dist_name) - + ax[1, 1].plot(model_x, model_y, color='darkmagenta', + label=y_gevd_fitter.formatted_dist_name) + # Some decor - ax[1,1].set_xlabel('Data scale for '+y_name) - ax[1,1].set_ylabel('Normalised Occurrence') - ax[1,1].set_title('Copula sample vs '+y_gevd_fitter.formatted_dist_name+' (data scale)') - t=ax[1,1].text(0.06, 0.94, '(e)', transform=ax[1,1].transAxes, va='top', ha='left') + ax[1, 1].set_xlabel('Data scale for ' + y_name) + ax[1, 1].set_ylabel('Normalised Occurrence') + ax[1, 1].set_title('Copula sample vs ' + + y_gevd_fitter.formatted_dist_name + ' (data scale)') + t = ax[1, 1].text(0.06, 0.94, '(e)', transform=ax[1, 1].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - ax[1,1].legend(loc='upper right') - + ax[1, 1].legend(loc='upper right') + # QQ plot comparing Copula sample in data scale with GEVD fit - ax[1,2]=qq_plot.qq_data_vs_model(ax[1,2], y_sample_data_scale, copula_y_sample, - y_gevd_fitter, marker='^', fillstyle='none', - color='darkmagenta', title='Copula sample vs '+y_gevd_fitter.formatted_dist_name+' (QQ)', - legend_pos='center left') - ax[1,2].set_xlabel('Copula sample in data scale') - t=ax[1,2].text(0.06, 0.94, '(f)', transform=ax[1,2].transAxes, va='top', ha='left') + ax[1, 2] = qq_plot.qq_data_vs_model(ax[1, 2], y_sample_data_scale, + copula_y_sample, y_gevd_fitter, + marker='^', fillstyle='none', + color='darkmagenta', + title='Copula sample vs ' + + y_gevd_fitter.formatted_dist_name + + ' (QQ)', legend_pos='center left') + ax[1, 2].set_xlabel('Copula sample in data scale') + t = ax[1, 2].text(0.06, 0.94, '(f)', transform=ax[1, 2].transAxes, + va='top', ha='left') t.set_bbox(dict(facecolor='white', alpha=0.5, edgecolor='grey')) - + fig.tight_layout() - + plt.show() - - return fig, ax \ No newline at end of file + + return fig, ax