Sampling does not converge with default NUTS sampler #527
Replies: 1 comment 1 reply
-
Hi @jCalderTravis , the issues we are experiencing with the Our current attempt to mitigate this is by starting at very small parameter values Simultaneously, we are trying to figure out how to improve the likelihood formulation to have more robust gradients in those regions. The suggestion to use the slice sampler, derives from the insight that it is really the gradient, not the basic likelihood evaluation, that is a problem here. Slice samplers don't use gradients (making the slice sampler worse in certain respects ofc including speed, otherwise we wouldn't care about gradients), so shouldn't suffer the same fate. Your concerns about equal starting points have some merit, let me share a few thoughts below.
Best, |
Beta Was this translation helpful? Give feedback.
-
Thank you to the developers for a very exciting new code package, and for all the associated documentation.
We have experienced similar convergence issues to those that have been discussed elsewhere (#306, #323, #413, #471). If I understand correctly, these are likely caused by issues with the "t" (non-decision time) parameter, when the value of the t parameter is close to the value of a response time (RT) observed in the dataset.
In the linked discussions the following solutions have been proposed:
My collaborator has found that the slice sampler (option 3) works very well, although unfortunately the slice sampler is prohibitively slow for hierarchical models (an open issue; #388).
Regarding setting initial values for the t parameter to very small values, I am unsure about this approach for two reasons. First, I wondered whether this could bias the Gelman-Rubin statistic: If all chains start at the same place then even if sampling fails, between-chain variance will be very small making it appear as if convergence has been achieved. Second, real non-decision time values can be quite high, depending on the task. For example, van den Berg et al. (2016) report non-decision times of about 400ms. Certainly real datasets could contain lapse trials with RTs shorter than this. If the sampling will encounter issues as soon as the non-decision time parameter approaches the shortest RTs, and the chain is initialised with a very low value for t, then the true value for t will likely never be achieved.
I wondered whether these concerns with setting the initial value for t are legitimate, or whether I might have misunderstood something. I would be grateful for any thoughts.
Beta Was this translation helpful? Give feedback.
All reactions