Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package: fit discretised distributions to match expected distributional properties #4

Open
thibautjombart opened this issue Apr 6, 2020 · 10 comments
Labels
medium_complexity Can be completed by 1 person in a few (<5) days. medium_priority Essential for COVID19 analytics new_package Create a new R package

Comments

@thibautjombart
Copy link
Contributor

Package: fit discretised distributions to match expected distributional properties

Description

Disclaimer: if this exists already, please feel free to point it out here.

What we need: fit a discretised distribution (distcrete object) so that it matches user-specified properties e.g. quantiles (typically the median, and 25% / 75% quantiles)

The package should include documentation, simple reproducible examples, unit tests, and use good practices and standards recommended by RECON or similar.

Specs (option 1)

This version would be slightly easier to use, but more constrained as we explicitely force fitting using quantiles. The method for calculating distances from targets is likely pre-programmed in the function, so only a few options will be available (likely: least square, absolute values, etc).

  • inputs:
    • name: a single character indicating the family of a distribution; I would recommend using gamma and weibull for testing purposes, as these are often the distributions we end up using; see name in ?distcrete::distcrete
    • target_quantiles: a numeric vector providing quantile used for fitting (between 0 and 1)
    • target_values: a numeric vector providing quantiles values to be fitted to
    • fit_method: a single character indicating which fitting method should be used (default to least squares)
    • ... other arguments passed to distcrete::distcrete
  • outputs:
    • a distcrete object with quantile properties matching the user-specified quantiles as closely as possible

Specs (option 2)

This version would be more flexible targets and fitting are passed as user-specified functions.

  • inputs:
    • name: a single character indicating the family of a distribution; I would recommend using gamma and weibull for testing purposes, as these are often the distributions we end up using; see name in ?distcrete::distcrete
    • summary: a function with a single argument x calculating a distributional summary of its input
    • distance: a function with 2 arguments x and y, calculating a dissimilarity between x and y, as a single numeric, finite value
    • ... other arguments passed to distcrete::distcrete
  • outputs:
    • a distcrete object with quantile properties matching the user-specified quantiles as closely as possible

Relevant related packages

Starting point

The following code generates a discretised weibull as a distcrete object, generates a sample of 1e5 values, and calculates the sum of squared distances from reference quantiles:

## define targets
target_quantiles <- c(.25, .5, .75)
target_values <- c(3, 5, 7)


## make distcrete object
x <- distcrete::distcrete("weibull", shape = 2, scale = 4, w = 0.5, interval = 1)

## check quantiles
x_quantiles <- quantile(x$r(1e5), probs = target_quantiles)
sum_square <- sum((x_quantiles - target_values)^2)

Further tweaks will be putting this insided an optimiser. The interface should allow accessing all arguments of distcrete::distcrete in addition to arguments used to specify targets and fitting

Impact

Many epidemiologically relevant delays are shared in the literature as means, medians, IQR, 95%CI etc. However, it can be very tricky to turn these into discretised distributions which we can then use for forecasting, estimating transmissibility etc. A simple utilitary function implementing this would be a great help for this.

Proposed timeline

  • first viable product: as soon as possible; proposed: 25th April
  • first stable version (CRAN release): less urgent, the main thing is to have this package, tested, documented, and publicly accessible

Focal point

@thibautjombart

@thibautjombart thibautjombart added medium_complexity Can be completed by 1 person in a few (<5) days. medium_priority Essential for COVID19 analytics new_package Create a new R package labels Apr 6, 2020
@samclifford
Copy link

I'm not sure that least squares is the best measure of goodness of fit. The documentation of fitdistrplus::fitdist() indicates that goodness of fit measures tend not to be appropriate for discrete distributions. Is it possible to write out a d, p and q function for the discredited gamma? If so, this could perhap fit into the fitdistrplus framework.

@thibautjombart
Copy link
Contributor Author

Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)

Good point re fitdistrplus. distcrete objects are lists with $d $r $q $p components.

@johnurbanik
Copy link

I'd second just adding some scaffolding around fitdistrplus... it's a pretty solidly designed package.

@lguillier
Copy link

lguillier commented Apr 9, 2020

The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q.
https://cran.r-project.org/web/packages/rriskDistributions/index.html

@znmeb
Copy link

znmeb commented Apr 10, 2020

Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)

Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/.

@thibautjombart
Copy link
Contributor Author

Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :)

Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/.

Nope, this project really is for cases where we don't have data, only some distributional summaries.

@thibautjombart
Copy link
Contributor Author

@samclifford @johnurbanik I thought fitdistrplus was only implementing procedures for fitting distributions to data, which would be a different use-case. I may not be up to date. Is there a specific function to create distributions matching distributional summaries you can point to?

@thibautjombart
Copy link
Contributor Author

The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q.
https://cran.r-project.org/web/packages/rriskDistributions/index.html

Yes, this is very close. I have just given it a try and it seems to work very nicely. Only caveat is the discretisation may actually change results a bit, so I am not sure if the optimal parametrisation to match the continuous distribution remains optimal to match against the discrete version. I'd guess it is not in the general case, but should be pretty close?

Example below for a GAmma with median 7 and IQR 3-11:

library(rriskDistributions)
library(distcrete)

## get parameters of a Gamma with median 7, IQR 3-11
pars <- get.gamma.par(p = c(.25,  .5, .75), q = c(3, 7, 11), plot = FALSE)
#> The fitting procedure 'L-BFGS-B' was successful!
#> $par
#> [1] 1.3647791 0.1615613
#> 
#> $value
#> [1] 0.0001973927
#> 
#> $counts
#> function gradient 
#>       42       42 
#> 
#> $convergence
#> [1] 0
#> 
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
pars
#>     shape      rate 
#> 1.3647791 0.1615613

## check that it atcually works
samp_cont <- rgamma(1e5, shape = pars["shape"], rate = pars["rate"])
summary(samp_cont)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>  0.00099  3.16772  6.51986  8.46700 11.65060 76.12839

## discretise the distribution
x <- distcrete("gamma", shape = pars["shape"], rate = pars["rate"], interval = 1)
samp_disc <- x$r(1e5)
summary(samp_disc)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.000   3.000   7.000   8.468  12.000  92.000

Created on 2020-04-14 by the reprex package (v0.3.0)

@thibautjombart
Copy link
Contributor Author

Still, given how far rriskDistributions gets us, I think this project will make a more meaningful contribution if we go for option 2, i.e. a more general-purpose package to create distcrete objects matching any summary function.

@thibautjombart
Copy link
Contributor Author

This could be a starting point for option 2:
https://github.com/thibautjombart/misc/blob/master/R/distcreate.R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium_complexity Can be completed by 1 person in a few (<5) days. medium_priority Essential for COVID19 analytics new_package Create a new R package
Projects
None yet
Development

No branches or pull requests

5 participants