diff --git a/man/mphcrm.Rd b/man/mphcrm.Rd index 18742db..5178272 100644 --- a/man/mphcrm.Rd +++ b/man/mphcrm.Rd @@ -46,9 +46,7 @@ can be specified as \code{NULL}, or ignored.} \itemize{ \item \code{"exact"}. The timing is exact, the transition occured at the end of the observation interval. \item \code{"interval"}. The transition occured some time during the observation interval. This model - can be notoriously hard to estimate due to unfavourable numerics. It could be worthwhile to - lower the control parameter \code{ll.improve} to 0.0001 or something, to ensure it does not - give up to early. + can be notoriously hard to estimate due to unfavourable numerics. \item \code{"none"}. There is no timing, the transition occured, or not. A logit model is used. }} diff --git a/man/mphcrm.callback.Rd b/man/mphcrm.callback.Rd index f231a02..3fed4e1 100644 --- a/man/mphcrm.callback.Rd +++ b/man/mphcrm.callback.Rd @@ -10,7 +10,7 @@ mphcrm.callback(fromwhere, opt, dataset, control, ...) \item{fromwhere}{a string which identifies which step in the algorithm it is called from. \code{fromwhere=='full'} means that it is a full estimation of all the parameters. There are also other codes, when adding a point, when removing duplicate points. When some optimization is completed it is called with the -return status from \code{\link{optim}} (and in some occasions from \code{\link[nloptr]{nloptr}}.} +return status from \code{\link{optim}} (and in some occasions from \code{\link[nloptr]{nloptr}}).} \item{opt}{Typically the result of a call to \code{\link{optim}}.} diff --git a/vignettes/whatmph.Rnw b/vignettes/whatmph.Rnw index 100d1c9..903fed9 100644 --- a/vignettes/whatmph.Rnw +++ b/vignettes/whatmph.Rnw @@ -245,6 +245,36 @@ the mixture distribution can be way off. It is good practice to inspect the mixture distribution for such extreme points, and go back to a previous iteration (which has slightly worse likelihood) without such extreme points. +\subsection{Overparameterization?} +We did note in \cite{GRK07} that picking the number of points which yields +the lowest AIC yields satisfactory results. We did not, however, have any theoretical justification +for doing this, and still don't. It is easy to look at the estimates with the lowest AIC: +<<>>= +summary(fit[[which.min(sapply(fit,AIC))]]) +@ +AIC is generally used to pick a model which is parsimonious, but still explains the +data well, i.e. to avoid overparameterization. AIC has an interpretation as the distance +between the model and reality. However, since the points are not found in a canonical order, +the AIC is really not well defined in these models. Another estimation of the +same data may find the points in a different order, with different log likelihoods along the way, +resulting in another set of points having the lowest AIC in the new estimation. + +Also, keep in mind that the method of using a discrete distribution +in this way is \emph{not} an ``approximation''. When the likelihood can't be improved by +adding more points, we have actually reached the likelihood which would result if we were to +integrate with the true distribution, be it discrete or continuous. This is the content of +\cite[Theorems 3.1 and 4.1]{lindsay83}. Thus, there is really no theoretical risk +of adding ``too many'' points in the distribution, save for those which may result from +numerical problems. + +The drawback with the method is that when the likelihood can't be improved by adding more +points, the purely technical reason is that something degenerates, i.e. something breaks down, +at least the Fisher matrix. That is, we will necessarily run into numerical problems at the +end of the estimation, and this may result in irrelevant points with very low probabilities +being added, because they seem to increase the likelihood by some small amount due to numerical +inaccuracies. This may be the case if the algorithm has run several iterations at the end +which barely moved the likelihood. + \section{More options} \subsection{Interval timing} The example above had exactly recorded time. For some applications we do have that, while @@ -449,10 +479,13 @@ around the program, I have collected them here with their defaults. Some of them In general, when using parallelization, one should make sure that the cpus are not overbooked and that the nodes you are running on are approximately equally fast. There is some overhead when - distributing the computation among many nodes, it will usually run - faster if using \code{threads} on a single node, or few nodes, both - because the thread algorithm uses shared memory and because the - division of data between the threads are dynamic. + distributing the computation among many nodes, depending on how the + nodes are interconnected, and due to somewhat sub-optimal + implementations in package \pkg{parallel} which e.g. does not + utilize collective MPI-calls. \code{mphcrm} will usually run faster + if using \code{threads} on a single node, or few nodes, both because + the thread algorithm uses shared memory and because the division of + data between the threads are dynamic. \item{\code{nodeshares=NULL}.} A numeric vector. When running on a cluster, the default is to share the data equally among the nodes, or in the case \code{threads} is a vector of @@ -467,6 +500,12 @@ around the program, I have collected them here with their defaults. Some of them vector will be divided by their sum, so it is not necessary that they sum to 1, it is their relative sizes which matter. +\item{\code{fisher=\Sexpr{def$fisher}}.} A logical. Should the Fisher matrix be computed? You will normally + want this, it is used to compute the standard errors. However, it may take some time if you have + a large number of covariates. If you do not need standard errors, e.g.\ if you do a bootstrap or Monte Carlo + simulation or similar, you may switch off the computation of the Fisher matrix and save some time. +\item{\code{gradient=\Sexpr{def$gradient}}.} A logical. As \code{fisher}, but for the gradient. I doubt much + time can be saved by switching this off. \end{itemize} \bibliographystyle{apalike}