Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-regression #119

Open
gdalle opened this issue Nov 6, 2024 · 7 comments
Open

Auto-regression #119

gdalle opened this issue Nov 6, 2024 · 7 comments

Comments

@gdalle
Copy link
Owner

gdalle commented Nov 6, 2024

It would be interesting to see if we can implement HMMs where the emission also depends on the previous emission

@fausto-mpj
Copy link

fausto-mpj commented Nov 12, 2024

Well, I'm doing a AR-HMM for protein conformation in my dissertation and I'm using your code as a base. Initially, I'm only doing an AR(1)-HMM(1) with a categorical observation distribution for simplicity sake (as I need only a proof of concept), but I'm willing to generalize for a AR(n)-HMM(m), so...

Right now I'm trying to adapt your Viterbi for a n-best Viterbi and afterward I want to adapt them both so that it doesn't goes from 1 to N, but from k to N, given the the k-1 previous values of the parameter process, i.e. $\arg\max P(X_{k}, \ldots, X_{N}|X_{1}, \ldots, X_{k-1}, Y_{1}, \ldots, Y_{N})$.

(Of course, the second one is only different from the regular Viterbi if the MC is of higher order and/or non-homogeneous. If I understood your code correctly, I can make a non-homogeneous MC using control dependency and adding a control variable to transition_matrix, but for higher order I'm still trying to figure out how much I have to tinkle.)

To be honest, right now my code is pure pasta and still very experimental, but if you have any interest in this, give me a shout and we can try to make this work.

@gdalle
Copy link
Owner Author

gdalle commented Nov 13, 2024

Hi @fausto-mpj, glad to hear you're using HiddenMarkovModels.jl!

It should be possible to adapt my code and take into account the previous observation obs_seq[t-1] when computing likelihoods. That way we could support AR(1)-HMM(1). If you want to give it a try, I can guide you.

After that, I don't know whether the jump in complexity from memory-1 to memory-k is justified. In practice, you can always simulate memory-k with memory-k by encoding the state (or the observation, or both) as a tuple including previous values.

What do you think?

@fausto-mpj
Copy link

fausto-mpj commented Nov 13, 2024

To be honest, I didn't check if your package supported sparse matrices, but I just found out that it does. Nice!

Definitely, a higher-order MC ${X_{t}}$ with lag $l$ and support $H$, such that $P(X_{t} \vert X_{[1:t-1]}) = P(X_{t} \vert X_{t-1},\ldots,X_{t-l})$, is equivalent to a first-order MC ${Z_{t}}$, $Z_{t} = (X_{t-l+1},\ldots,X_{t})$ on $M^{l}$. The main issue is that transition matrix for $Z_{t}$ grows exponentially, but by using sparse matrices this is not a huge problem, as most entries are $0$. Langrock, MacDonalds & Zucchinni (2016) also briefly mention Raftery's MTD to efficiently infer the parameters for this higher-order MC, but I don't know if it is straightforward to implement, nor how to adapt the emission matrix (or any equivalent construct) and its serial dependency to work with $Z_{t}$ transition matrix.

In a AR($n$)-HMM($m$) model, with ${X_{t}}$ the parameter process and ${Y_{t}}$ the state process, we have that, for $t > m,n$:

$$ P(X_{t}\vert X_{[1:t-1]}, Y_{[1:t]}) = P(X_{t}\vert X_{t-1},\ldots,X_{t-m}), \qquad P(Y_{t}\vert X_{[1:t]}, Y_{[1:t-1]}) = P(Y_{t}\vert X_{t}, X_{t-1}, \ldots,X_{t-n}) $$

For a sequence with length N, the Factorization Theorem tell us that

$$ \begin{array}{l} P(X_{[1:N]}, Y_{[1:N]}) = P(X_{1}) \times P(Y_{1}\vert X_{1}) \times \prod_{i=2}^{m} P(X_{i} \vert X_{[1:i-1]}) \times \prod_{i=2}^{n} P(Y_{i} \vert X_{i}, Y_{[1:i-1]}) \ \times \prod_{i = m+1}^{N} P(X_{i} \vert X_{[i-m:i-1]}) \times \prod_{i = n+1}^{N} P(Y_{i} \vert X_{i}, Y_{[i-m:i-1]}). \end{array} $$

For some of these quantities the estimation is quite similar to what Forward and Forward-Backward are doing. However, I don't know if $P(Y_{1}\vert X_{1})$, $\prod_{i=2}^{m} P(X_{i}) \vert X_{[1:i-1]})$, and $\prod_{i=2}^{n} P(Y_{i} \vert X_{i}, Y_{[1:i-1]})$ are correct without any further adjustment to the algorithms. And, of course, the EM for this is probably a different beast. Maybe it's better to begin with AR(1)-HMM(1), and then AR(2)-HMM(1) or AR(1)-HMM(2), just to do a sanity test before doing for $n,m$.

Right now I'm using the previous observation as control variable, but I will look into your suggestion to use obs_seq[t-1], which would at least save me from using a control variable for a information that is already there.
I'm going to read Hadar & Messer (2009) higher-order HMM and Boys & Henderson (2004) Bayesian HMM right now to see what kind of adjustment is more viable to your preexisting code.

Afterwards, I have some questions to ask, because I'm getting a slightly different $\alpha_{t}$ and I'm pretty sure I'm missing something, such as a normalizing constant or something like that. I tried to create a trellis graph and find the k-shortest paths using Yen's algorithm (just to use as a "base truth" for the $n$-best Viterbi), but I'm getting a different answer from both the sequence of that maximizes the local decoding for each step or the global decoding.

@gdalle
Copy link
Owner Author

gdalle commented Nov 13, 2024

My proposal is to handle the AR(1)-HMM(1) case in the package itself, and then let users translate from AR(n)-HMM(m) to AR(1)-HMM(1) by tuple-ifying whatever they need to do.

We will be able to do this for inference without too much trouble. The only requirement is to generalize obs_distributions(hmm, control) into obs_distributions(hmm, control, previous_obs). Ideally I'd like this generalization to be optional, so as to avoid bothering most users, who don't care about the AR part.

I don't think we'll be able to code a generic learning function though, for the same reason as with controls: the M update is no longer explicit with AR dependencies, except in very special cases (discrete observations would probably work). So we might have to accept that the user needs to write fit! themselves.

@fausto-mpj
Copy link

fausto-mpj commented Nov 15, 2024

Right, I've been tinkering with the code and got some obstacles, so I might need some help. I've added an optional parameter for HMM, l<:Bool, with default value ar::l = false. So, for every case in which the user doesn't want to use AR-HMM, the whole code goes as usual.

The main problem I'm facing is with obs_distributions. Here things are wildly different if the observation process ${Y_{t}}$ has a finite support or not. If ar is false, then we need a vector dists with $\vert X_{t} \vert$ components, in which the $i$-eth component is a distribution for $Y_{t} \vert X_{t} = i$. All nice and quiet, and it doesn't matter if $supp(Y_{t})$ is finite or not.

However, for ar equal to true, the first problem is that we don't have a generic default way to determine $P(Y_{1} \vert X_{1})$ for the AR-HMM at the initial conditions, thus the user should provide $\vert X_{t} \vert$ extra observation distributions. But how? This question led me to the second problem.

If $supp(Y_{t})$ is finite, then dists is a vector of $\vert X_{t} \vert$ components, in which the $i$-eth component is also a vector of $\vert Y_{t} \vert + 1$ sub-components, and the $j$-eth sub-component is a distribution for $Y_{t} \vert X_{t} = i, Y_{t-1} = j$ (except for $j = \vert Y_{t} \vert + 1$, which is a distribution for $Y_{t} \vert X_{t} = i$). For this case, obs_distributions is simply managing the indices based on the value or existence of a previous observation.

If $supp(Y_{t})$ is infinite, then dists is a vector of $\vert X_{t} \vert$ components, in which the $i$-eth component is a distribution for $Y_{t} \vert X_{t} = i, Y_{t-1}$. For this case, obs_distributions must be specified to do something with the previous_obs, as there is no generic default way to use the previous observation. For example, in a Normal distribution it could use the previous observation to alter the mean parameter. In this case, there is no need for the extra observation distributions, because we can assume that the previous value was nothing and then adjust the function so that it would do its job.

But this is too confusing. Ideally, there should be a way that works for both cases. I was looking at Distributions.jl MixtureModel type, unifying every case as a mixture and using its inbuilt functions ncomponents, component and probs to do whatever inside obs_distributions... but this would add a dependency. Do you have any other idea? Any suggestion?

@gdalle
Copy link
Owner Author

gdalle commented Nov 15, 2024

As you might have noticed, the inference code in this package is completely agnostic to whether $Y$ has discrete or continuous values. Only the learning part is specific to the observation distribution. Thus I think we should strive to maintain this level of generality.

I also don't think we should modify the built-in HMM, because it is the simplest model and should not be overcomplicated. Instead, anyone using an AutoRegressive HMM will have to subtype AbstractHMM manually, if only because the dependency $\mathbb{P}(Y_t | X_t, Y_{t-1})$ cannot be captured by a one-size-fits-all struct.

I created a branch gd/ar with some initial suggestions to get you started. Essentially, I propose to add a type parameter to AbstractHMM{ar} with possible values true and false. I also add a positional argument to obs_distributions, whose signature is now

obs_distributions(hmm, control, prev_obs)

with sensible fallbacks. I have started to adapt the main inference routines to pass the previous observation to obs_distributions, but I have probably missed a lot of spots (I didn't run the code).

I don't have a good answer yet for what happens at the first time step. What do they usually do in the literature?

@fausto-mpj
Copy link

fausto-mpj commented Nov 16, 2024

Nice. I've just cloned the branch and will look into it.

In the literature, I got the following:

  1. Autoregressive Hidden Markov Models for the Early Detection of Neonatal Sepsis (2014): The observation process is finite and discrete, so they modeled it as a first-order MC with transition matrix $(t_{i,j}^{(k)}) = P(Y_{t} = j \vert Y_{t-1} = i, X_{t} = k)$, and initial distribution $u_{i}^{k}(1) = P(Y_{1} = i \vert X_{1} = k)$. For these quantities, they used a symmetric Dirichlet as a priori distribution, and obtained MAP estimates via Dirichlet-Categorical conjugacy. They made a small adjustment to $\alpha_{t}$ and $\beta_{t}$ in the Forward and Forward-Backward algorithms... but the adjustment is just adding the dependency, no additional terms or anything like this. However, this doesn't work in continuous cases...

  2. Autoregressive Higher-Order Hidden Markov Models: Exploiting Local Chromosomal Dependencies in the Analysis of Tumor Expression Profiles (2014): The observation process is continuous, so they modeled it as $M$-order autoregressive Normal distribution, with mean $\mu_{t}^{(i)} = \mu^{(i)} + \sum_{m=1}^{M} c_{i,m} \cdot o_{t-m} \cdot \delta_{t, m}$, for $x_{t} = i$, where the $c_{i,k}$ are the weights that measure the impact of the previous observations and $\delta_{t,k}$ is an indicator function which assumes the value $0$ if $t-k &lt;1$, and $1$ otherwise. Pretty standard stuff: when there is no previous observation, just truncate it.

  3. Stochastic models for heterogeneous DNA sequences (1989): The observation process is finite and discrete, so he modeled it as a counting process and avoided the problem of estimating $P(Y_{1} \vert X_{1})$ by assuming that the sequence is circular: $y_{0} \equiv y_{N}, x_{0} \equiv x_{N}$. For a long enough sequence, this is good enough, I guess... but it does feel like gambiarra (an untranslatable word).

  4. Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models (2002): The observation process is finite and discrete, so they modeled it as a $M$-order counting process. However, they simply used EM for the estimation of the emission parameters, using the MLEs for the categorical distributions (given the first $M$ observations) as the initial estimates. Again, when there is no previous observations, they just truncate it and then let the EM complete the missing parts.

  5. Mixture and Hidden Markov Models with R (2022): In Chapter 4, Section 4.8.3, the authors argue that truncating is not that simple when it is not due to Missing at Random. They exhibit an example where including a simple model for the missing value gives better estimates for the parameters.

I believe that encouraging the user to provide a distribution for $P(Y_{1} \vert X_{1})$ is the best choice. If the user fails to do so, maybe a fallback to truncation. Or maybe a special case for discrete/categorical distributions... But then, how to do the truncation after he gave the parameters of the model? A simple solution is to begin from $X_{2}$ (but having observed the two first values of the observation sequence), so the initial distribution is updated to $u^{*}(1) = u(1) \cdot T$, and then it goes exactly as it already is, with the first step in Forward being $u_{i}^{*}(1) \cdot e_{i,j}^{k}$. Any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants