-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Package: fit discretised distributions to match expected distributional properties #4
Comments
I'm not sure that least squares is the best measure of goodness of fit. The documentation of fitdistrplus::fitdist() indicates that goodness of fit measures tend not to be appropriate for discrete distributions. Is it possible to write out a d, p and q function for the discredited gamma? If so, this could perhap fit into the fitdistrplus framework. |
Note that I am talking of fitting to summaries of the distribution (e.g. the quantile), not the data, so the function to minimize will be a measure of distance between expected summaries and observed ones. Sorry if you got that, just making sure :) Good point re |
I'd second just adding some scaffolding around fitdistrplus... it's a pretty solidly designed package. |
The rriskDistributions might also be related to this issue . This package proposes several build get."distr".par functions that will return the parameters of the "distr" distribution where the pth percentiles match with the quantiles q. |
Do you have or could you get the raw data? There are a couple of approaches you could use if so. See https://cran.rstudio.com/web/packages/PDQutils/index.html and https://cran.r-project.org/web/packages/gld/. |
Nope, this project really is for cases where we don't have data, only some distributional summaries. |
@samclifford @johnurbanik I thought |
Yes, this is very close. I have just given it a try and it seems to work very nicely. Only caveat is the discretisation may actually change results a bit, so I am not sure if the optimal parametrisation to match the continuous distribution remains optimal to match against the discrete version. I'd guess it is not in the general case, but should be pretty close? Example below for a GAmma with median 7 and IQR 3-11: library(rriskDistributions)
library(distcrete)
## get parameters of a Gamma with median 7, IQR 3-11
pars <- get.gamma.par(p = c(.25, .5, .75), q = c(3, 7, 11), plot = FALSE)
#> The fitting procedure 'L-BFGS-B' was successful!
#> $par
#> [1] 1.3647791 0.1615613
#>
#> $value
#> [1] 0.0001973927
#>
#> $counts
#> function gradient
#> 42 42
#>
#> $convergence
#> [1] 0
#>
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
pars
#> shape rate
#> 1.3647791 0.1615613
## check that it atcually works
samp_cont <- rgamma(1e5, shape = pars["shape"], rate = pars["rate"])
summary(samp_cont)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00099 3.16772 6.51986 8.46700 11.65060 76.12839
## discretise the distribution
x <- distcrete("gamma", shape = pars["shape"], rate = pars["rate"], interval = 1)
samp_disc <- x$r(1e5)
summary(samp_disc)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000 3.000 7.000 8.468 12.000 92.000 Created on 2020-04-14 by the reprex package (v0.3.0) |
Still, given how far rriskDistributions gets us, I think this project will make a more meaningful contribution if we go for option 2, i.e. a more general-purpose package to create |
This could be a starting point for option 2: |
Package: fit discretised distributions to match expected distributional properties
Description
Disclaimer: if this exists already, please feel free to point it out here.
What we need: fit a discretised distribution (
distcrete
object) so that it matches user-specified properties e.g. quantiles (typically the median, and 25% / 75% quantiles)The package should include documentation, simple reproducible examples, unit tests, and use good practices and standards recommended by RECON or similar.
Specs (option 1)
This version would be slightly easier to use, but more constrained as we explicitely force fitting using quantiles. The method for calculating distances from targets is likely pre-programmed in the function, so only a few options will be available (likely: least square, absolute values, etc).
name
: a singlecharacter
indicating the family of a distribution; I would recommend usinggamma
andweibull
for testing purposes, as these are often the distributions we end up using; seename
in?distcrete::distcrete
target_quantiles
: anumeric
vector providing quantile used for fitting (between 0 and 1)target_values
: anumeric
vector providing quantiles values to be fitted tofit_method
: a singlecharacter
indicating which fitting method should be used (default to least squares)...
other arguments passed todistcrete::distcrete
distcrete
object with quantile properties matching the user-specified quantiles as closely as possibleSpecs (option 2)
This version would be more flexible targets and fitting are passed as user-specified functions.
name
: a singlecharacter
indicating the family of a distribution; I would recommend usinggamma
andweibull
for testing purposes, as these are often the distributions we end up using; seename
in?distcrete::distcrete
summary
: afunction
with a single argumentx
calculating a distributional summary of its inputdistance
: afunction
with 2 argumentsx
andy
, calculating a dissimilarity betweenx
andy
, as a singlenumeric
, finite value...
other arguments passed todistcrete::distcrete
distcrete
object with quantile properties matching the user-specified quantiles as closely as possibleRelevant related packages
Starting point
The following code generates a discretised weibull as a
distcrete
object, generates a sample of 1e5 values, and calculates the sum of squared distances from reference quantiles:Further tweaks will be putting this insided an optimiser. The interface should allow accessing all arguments of
distcrete::distcrete
in addition to arguments used to specify targets and fittingImpact
Many epidemiologically relevant delays are shared in the literature as means, medians, IQR, 95%CI etc. However, it can be very tricky to turn these into discretised distributions which we can then use for forecasting, estimating transmissibility etc. A simple utilitary function implementing this would be a great help for this.
Proposed timeline
Focal point
@thibautjombart
The text was updated successfully, but these errors were encountered: