Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depth of auto-generated MSAs #39

Open
max-overath opened this issue Nov 21, 2024 · 10 comments
Open

Depth of auto-generated MSAs #39

max-overath opened this issue Nov 21, 2024 · 10 comments

Comments

@max-overath
Copy link

The MSAs that got generated for my predictions only contain a couple of sequences.
Is this due to limitations of the MMSeq2 sever or can this be adjusted?

@jwohlwend
Copy link
Owner

Yeah that d be due to the server. Maybe try this (https://zhanggroup.org/DeepMSA/) and see if you're getting a much deeper MSA? Maybe your sequence just doesn't have many homologues?

@Overathed
Copy link

Thanks for the answer! Yes when using the DeepMSA I get a much deeper MSA. However, it is also deeper when using colabfold which should also use a MMSeq2 server if I'm not mistaken?

@jwohlwend
Copy link
Owner

Would you mind sharing your input config? I can take a look

@heol1
Copy link

heol1 commented Nov 21, 2024

Is the query sequence a hetero-multimeric protein? If so, I had the same issue.
In ColabFold, it queries MMseqs2 API twice: one for each chain and the other for the "pair" mode.
https://github.com/sokrypton/ColabFold/blob/e2ca9e8f992cd65c986de5b64885d5572d8b8ad9/colabfold/batch.py#L817-L857
In contrast, the current implementation of Boltz, compute_msa, calls the API only once for the "pair" mode.

msa = run_mmseqs2(list(data.values()), msa_dir, use_pairing=len(data) > 1)

This might the reason why you have a shallow MSA...

@jadolfbr
Copy link

jadolfbr commented Nov 21, 2024 via email

@jwohlwend
Copy link
Owner

Looking into it, will report back

@max-overath
Copy link
Author

@heol1 yes exactly it's a hetero multimer. When I run the chains individually I get much deeper MSAs

@max-overath
Copy link
Author

@jwohlwend input fasta for reference:

>A|protein
QVQLQESGGGLVQAGGSLRLSCAGSGDALGSYTMGWFRQAPGGGRDLVAQISVDGSSTYHLDSVRGRFTASRDNAKNTVYLEMNSLNSEDTAVYYCAAAPLLRGNYDYWGQGTQVTVSS
>B|protein
IRCFITPDITSKDCPNGHVCYTKTWCDAFCSIRGKRVDLGCAATCPTVKTGVDIQCCSTDNCNPFPTRKRP

@amelie-iska
Copy link

I'm having similar issues. I tried this:

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: A
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI
  - protein:
      id: B
      sequence: MGTFEEVP

with the command:

boltz predict examples/multimer.yaml --recycling_steps 20  --diffusion_samples 10 --use_msa_server

and both MSAs for the individual, as well as the pair, are single sequences only. Using ColabFold, I get much deeper MSAs (and much better predictions).

@paul-goldsmith
Copy link

Just adding another voice to this - I'm also finding the auto-generated MSA to be very shallow (single sequence) using two proteins and the --use_msa_server flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants