- Prerequisites
- Installation
- Usage
- Results
- AlphaFold pulldown
- Updating Alphafold
- Troubleshooting
- Citing
This is not meant to be an exhaustive guide on how to use ibex and a linux terminal, so if you're completely unfamiliar with these concepts, I recommend you to take a look at the following resources first:
-
What is Ibex?
Ibex is a computer cluster that you can access to run "jobs" or execute programs that require large amounts of memory, CPU or GPU power. If you want to learn more you can check out the Ibex training site or the user guide.
You can connect to ibex from your Terminal application (Mac) with the following command (without the brackets around your username):
ssh [your_username]@glogin.ibex.kaust.edu.sa
To connect in Windows, you need an ssh client application such as PuTTY.
-
What is a terminal?
If you're not familiar with the use of a Unix terminal/shell, you would greatly benefit from watching this introductory workshop by the KAUST Visualization Core Lab, inspired from this short course by Software Carpentry.
Also check out this "linux cheat sheet" for a list of common commands. The ones in the section called "Files & Navigating" will be the most useful to you.
-
How do I transfer files between ibex and my computer?
Since the update to WekaIO file system it is not possible to connect to ibex through the Finder application, so you will have to transfer files from your local terminal to the destination in ibex with either the scp or rsync commands.
Example:
rsync -trvhP /local/path/to/fasta/sequence.fasta <username>@glogin.ibex.kaust.edu.sa:/ibex/user/<username>/path/to/store/sequence.fasta
You need to have Miniconda installed in your home directory in ibex. If you already have it installed, you can skip to "Install Alphafold":
-
Login to Ibex through your Terminal application (mac) with the command:
ssh [your_username]@glogin.ibex.kaust.edu.sa
For logging in on Windows, you need an ssh client application such as PuTTY.
-
Download the Miniconda installation script, and run it.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh
Note: The installation assistant will prompt you for input in a few steps, to accept the license agreement, and if you want to set a custom installation directory. You can just accept the defaults. Near the end, when it asks you the following:
Do you wish the installer to initialize Miniconda3 by running conda init?
In here you should type
yes
and press enter. -
Load modifications onto your current session:
source ~/.bashrc
-
Done! You should see either
(miniconda3)
or(base)
in you terminal before the prompt, which means conda was installed correctly and thebase
environment is activated. If you are still unsure about the process, you can checkout the following video in the KAUST Visualization Core Lab YouTube channel with the detailed steps.
Follow these steps to install your instance of AlphaFold in Ibex:
-
Login to Ibex through your Terminal application (mac) with the command:
ssh [your_username]@glogin.ibex.kaust.edu.sa
For logging in on Windows, you need an ssh client application such as PuTTY.
-
Go to your user directory:
cd /ibex/user/[your_username]/
-
Clone the following repositories, and then
cd
into thealphafold
directory.git clone https://github.com/strubelab/alphafold.git git clone https://github.com/guzmanfj/Executor.git cd alphafold
-
Create a virtual environmnet to work on, and activate it. This step will take 15-30 minutes, so make sure you don't lose your connection to ibex.
conda env create --prefix ./env --file environment.yml
Once the environment is created, make sure to activate it afterwards:
conda activate ./env
-
Install the newly downloaded
alphafold
andExecutor
packages in your current environment:pip install -e ../Executor pip install -e .
-
Done! The alphafold wrapper should be installed now. Check that you can use the
alphafold_wrapper
command by runningalphafold_wrapper --help
usage: alphafold_wrapper [-h] [--input INPUT [INPUT ...]] --destination DESTINATION [--gpus GPUS] [--time TIME] [--mem MEM] [--recycles RECYCLES] [--cpus CPUS] [--models_to_relax {all,best,none}] [--use_precomputed_msas] [--mail MAIL] [--multimer_predictions_per_model MULTIMER_PREDICTIONS_PER_MODEL] [--gpu_type {v100,a100}] Takes one or more FASTA files with amino acid sequences (one for each model), and submits a job array to ibex to run AlphaFold on each file. options: -h, --help show this help message and exit --input INPUT [INPUT ...] Fasta file(s) with the sequence(s) to model. Provide one fasta file for each model that you want to produce. --destination DESTINATION Path for saving the resulting AlphaFold models. It will create one subdirectory for each model. Also will contain the sequence files as they were submitted to ibex, the script that was submitted and the ibex stdout files. --gpus GPUS Number of GPUs to request to ibex. It is likely that AlphaFold only ever uses 1. (default=1) --time TIME Time in minutes to give to each job. (default="auto") --mem MEM Memory in GB to allocate to each job. (default="auto") --recycles RECYCLES Only for monomeric models. Set the number of times to recycle the output through the network. More recycles might help to mildly improve the quality of the models in some cases. Default is 3, max recommended is 10. Multimeric models run up to 20 recycles, but will stop if the difference in pairwise distances is less than 0.5. --cpus CPUS Number of CPUs to request. The MSA programs are set to use 8, which is the default here. --models_to_relax {all,best,none} The models to run the final relaxation step on. If `all`, all models are relaxed, which may be time consuming. If `best`, only the most confident model is relaxed. If `none`, relaxation is not run. Defaults to `best`. --use_precomputed_msas Set this flag if you want to reuse the MSA outputs from a previous run, for example if there was an error after the MSA step and you want to start the same run again, or if you want to make the same model but with a different number of recycles. Make sure to copy the old results to a new directory if you don't want to overwrite them. --mail MAIL Email to send notifications about the job progess in ibex. --multimer_predictions_per_model MULTIMER_PREDICTIONS_PER_MODEL Number of multimeric predictions to make for each of the 5 ML models that AlphaFold runs. The total number of structures (pdb files) predicted will be 5 times this number (e.g. `--multimer_predictions_per_model 5` will give 25 structures in total). Defaults to 1. NOTE: if you indicate more than 1, you might have to increase the time to be requested with the --time argument as well. --gpu_type {v100,a100} Whether to use V100 or A100 GPU. If you don't know what this is, leave the default. (Default=v100) --old_uniclust Use the older version of UniClust database (uniclust30_2018_08) to calculate the MSAs with HHblits.
-
Login to ibex, navigate to the
alphafold
directory installed earlier, and activate the conda environment:ssh [your_username]@glogin.ibex.kaust.edu.sa cd /ibex/user/[your_username]/alphafold # the exact location might vary depending # on where you cloned the repository conda activate ./env
-
Run the
alphafold_wrapper
program:There are basically two scenarios for making models:
A. Make one or multiple models from scratch:
alphafold_wrapper --input ./data/model1.fasta data/model2.fasta --destination ./results --recycles 6 --mail [email protected] --models_to_relax all
-
--input
: One or more fasta files for modeling. One modeling job will be launched for each fasta file separately. If you want to model a multimer, give all the fasta sequences of the multimeric chains in the same file. For example, if you want to make a complex A2B3 (two subunits of chain A and 3 subunits of chain B), the fasta file will look like this:>A TGTGTGASDFASDFASDFWERT >A TGTGTGASDFASDFASDFWERT >B KLKFDASFTERPLKLFDAS >B KLKFDASFTERPLKLFDAS >B KLKFDASFTERPLKLFDAS
-
--destination
: The directory where the outputs will be saved. One directory will be created inside for each fasta file provided in--input
.NOTE: You have to give a different
--destination
directory for each time you run alphafold, or otherwise the previous files might get overwritten. -
--recycles
: Number of times to recycle the output through AlphaFold. 3 is the default. This argument only works for monomers.For multimeric models, alphafold will do up to 20 recycles with a possible early stop when the difference in pairwise distances is less than 0.5 (Angstroms, I believe).
-
--mail
: Email to receive notifications from ibex about the job status.Your email is only written to the script that is submitted to ibex, found in
destination_directory/out_ibex/script.sh
, and is not saved or recorded anywhere else. -
--models_to_relax
: New in version 2.3.1, this argument allows you to perform relaxation only for the highest-confidence model ("best
"), all models ("all
"), or no models ("none
") to speed up the execution. Default is "best
".
B. Make a model reusing the MSAs from another run:
Since calcualting the multiple sequence alignments for your protein(s) takes a long time, you might want to reuse those results in a new run, in case the previous one failed at a subsequent step (e.g. if the ibex job runs out of time or memory), or if you want to make the same model but with a different number of recycles. For this, you need to leave the output directory as it is, run alphafold again with the same
--input
and--destination
parameters, and include the--use_precomputed_msas
flag and others you might want to change (e.g. a different number of recycles). Since this will overwrite the old results in the--destination
directory except for the MSAs, you can copy the entire directory somewhere else if you want to keep them.alphafold_wrapper --input data/model1.fasta --destination ./results --use_precomputed_msas --recycles 12
-
-
Check the job status:
squeue -u [user_name]
When the job is finished (you can't see it in the list when running the previous command), look at the destination folder and confirm that all the files are present:
ls -lh results/ drwxr-xr-x 4 guzmanfj g-guzmanfj 26 Jun 22 19:15 [protein_id]-[multimeric_state]/ drwxr-xr-x 2 guzmanfj g-guzmanfj 5 Jun 22 23:43 out_ibex/ drwxr-xr-x 2 guzmanfj g-guzmanfj 3 Jun 22 18:47 sequences/
ls -lh results/[protein_id]-[multimeric_state] total 63M -rw-r--r-- 1 guzmanfj g-guzmanfj 136 Jun 22 18:49 [protein_id]-[multimeric_state].fasta -rw-r--r-- 1 guzmanfj g-guzmanfj 327K Jun 22 23:46 features.pkl drwxr-xr-x 2 guzmanfj g-guzmanfj 4 Jun 22 19:03 msas/ drwxr-xr-x 2 guzmanfj g-guzmanfj 9 Jun 22 19:15 plots/ -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_1.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_2.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_3.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_4.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 440 Jun 22 23:59 ranking_debug.json -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:50 relaxed_model_1_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:52 relaxed_model_2_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:55 relaxed_model_3_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:57 relaxed_model_4_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 relaxed_model_5_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:49 result_model_1_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:52 result_model_2_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:54 result_model_3_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:56 result_model_4_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:58 result_model_5_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 938 Jun 22 23:59 timings.json -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:49 unrelaxed_model_1_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:52 unrelaxed_model_2_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:54 unrelaxed_model_3_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:56 unrelaxed_model_4_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:58 unrelaxed_model_5_ptm_pred_0.pdb
-
Check the results
After a successful run, you will find the following inside of the
--destination
directory:ls -lh results/ drwxr-xr-x 4 guzmanfj g-guzmanfj 26 Jun 22 19:15 [protein_id]-[multimeric_state]/ drwxr-xr-x 2 guzmanfj g-guzmanfj 5 Jun 22 23:43 out_ibex/ drwxr-xr-x 2 guzmanfj g-guzmanfj 3 Jun 22 18:47 sequences/
-
One folder for each fasta file in
--input
containing the resulting models. These directories will be named after the sequence IDs found in the fasta headers, along with the number of subunits for that sequence. For example, for the A2B3 fasta file shown above, the corresponding directory with the results would be namedA-2_B-3
. -
The
out_ibex
directory with the stdout from ibex, which will be useful in case you need to troubleshoot. -
The
sequences
directory with the sequences that were submitted to ibex. Useful for debugging only.
Inside each model directory you will find:
ls -lh results/[protein_id]-[multimeric_state] total 63M -rw-r--r-- 1 guzmanfj g-guzmanfj 136 Jun 22 18:49 [protein_id]-[multimeric_state].fasta -rw-r--r-- 1 guzmanfj g-guzmanfj 327K Jun 22 23:46 features.pkl drwxr-xr-x 2 guzmanfj g-guzmanfj 4 Jun 22 19:03 msas/ drwxr-xr-x 2 guzmanfj g-guzmanfj 9 Jun 22 19:15 plots/ -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_1.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_2.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_3.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 ranked_4.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 440 Jun 22 23:59 ranking_debug.json -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:50 relaxed_model_1_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:52 relaxed_model_2_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:55 relaxed_model_3_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:57 relaxed_model_4_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 142K Jun 22 23:59 relaxed_model_5_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:49 result_model_1_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:52 result_model_2_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:54 result_model_3_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:56 result_model_4_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 13M Jun 22 23:58 result_model_5_ptm_pred_0.pkl -rw-r--r-- 1 guzmanfj g-guzmanfj 938 Jun 22 23:59 timings.json -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:49 unrelaxed_model_1_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:52 unrelaxed_model_2_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:54 unrelaxed_model_3_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:56 unrelaxed_model_4_ptm_pred_0.pdb -rw-r--r-- 1 guzmanfj g-guzmanfj 71K Jun 22 23:58 unrelaxed_model_5_ptm_pred_0.pdb
-
The
pdb
models (ranked*.pdb
andrelaxed*.pdb
). -
A
plots
directory with plots of the different quality scores. The model ranking in these plots is done with the pLDDT scores for monomers, or the PTM score for multimers (actually a combination of interface-PTM and PTM). These rankings are the same as for the models calledranked_#.pdb
, and the names in the plots' headings correspond to the models inrelaxed_model_#_ptm_pred_0.pdb
.
-
2D pictures of the proteins colored by pLDDT score, indicating the rank and the model name.
plots/rank_[01234]_model_[model_name].png
Metric for assessing inter-domain accuracy. See the full explanation at the bottom of every entry in the AF database (e.g. https://alphafold.ebi.ac.uk/entry/Q5VSL9)
plots/pae.png
plots/predicted_contacts.png
plots/predicted_distances.png
plots/plddts.png
The numbers in parenthesis are the average pLDDTs for the whole sequence.
The ones you want to look at will be ranked_*.pdb
.
This library contains scripts to perform mass-multimerization using AlphaFold. To achieve this it is necessary to break down the modeling process into several steps:
This step will make the MSAs and look for structural templates, if needed. Afterwards it will make the necessary calculations and processing steps to convert this information into the features that are fed into AlphaFold Multimer. The command is as follows:
make_features --input [fasta_file] --destination [output_directory] --mail [your_email]
To see a detailed explanation of the parameters run make_features --help
.
This command will submit to ibex an array with up to max_jobs
to calculate the features in parallel. Once all the jobs are finished (run squeue -u [your_username]
to check), you can check the results by using the same command but adding the --check_only
flag. This will print out any errors that might have occurred during the process. If there are no errors, you can proceed to the next step.
make_features --input [fasta_file] --destination [output_directory] --mail [your_email] --check_only
The most common case for mass multimerization is to model one bait protein against many candidates (possibly an entire proteome). For this, the make_models
command will take the features from the previous step and use them to make the models. The simplest way to run the command is as follows:
make_complexes --bait [fasta_file] --candidates [fasta_file] --features_dir [features_directory] --destination [output_directory] --mail [your_email]
This command will create two models for each complex in the --destination
directory. If you are making more than a few hundred models, it is highly recommended to use the --screen_mode
flag that will indicate the program to only save to disk the quality scores (iptm and ptm) to a text file. This will save a lot of disk space. Also use the --models_to_relax none
parameter to skip the relaxation step and save time. The command would look like this:
make_complexes --bait [fasta_file] --candidates [fasta_file] --features_dir [features_directory] --destination [output_directory] --mail [your_email] --screen_mode --models_to_relax none
To see a detailed explanation of the parameters run make_complexes --help
.
This will submit to ibex an array with up to max_jobs
to make the complexes in parallel (up to 24 will be running at the same time due to the max GPU per-user limitation in ibex). When there are no more jobs running, you can check the results by using the same command but adding the --check_only
flag. This will print out any errors that might have occurred during the process. If there are no errors, you can proceed to the next step.
make_complexes --bait [fasta_file] --candidates [fasta_file] --features_dir [features_directory] --destination [output_directory] --mail [your_email] --check_only
If you used the --screen_mode
flag when making the complexes, you now only have the quality scores for each model. Through the next steps you will rank the models according to their quality scores, select the best ones and re-make them with full outputs.
score_complexes --models_dir [directory_with_models]
Again, running score_complexes --help
will give you more information.
By default, this command will create a scores
directory inside the --models_dir
directory, with the following files:
top_candidates.fasta
: The fasta file with the sequences of the top candidates, sorted by their quality scores.scores.csv
: The quality scores for every model.seeds.pkl
: The random seeds used to make the top models. This file will be used in the next step.iptms.png
: A box plot of the interface-ptm scores for every model.
After you successfully ran the previous step, you can now run once again the make_complexes
command but this time with the --seeds
parameter to re-make the top models with full outputs, and also ommitting the --screen_mode
flag.
make_complexes --bait [fasta_file] --candidates [path/to/top_candidates.fasta] --features_dir [features_directory] --destination [new_output_directory] --mail [your_email] --models_to_relax best --seeds [path/to/seeds.pkl]
After all the jobs are finished, run the same command but with the --check_only
flag to make sure everything went well. If there are no errors, you have your top models!!. See the Results section above to find more details about the outputs.
I make modifications to the code in the wrapper every now and then, also to integrate DeepMind's latest updates to AlphaFold. So, if it's been a while since you downloaded it, I recommend running the following commands to update your installation.
While inside the alphafold
directory run:
conda activate ./env
git pull
pip install -e .
Done! You should have the latest version now and can use it as usual.
-
I don't see my job when typing
squeue -u [my_username]
in Ibex, and the output directory is empty/doesn't have all the files with the results.If this happens, something might have gone wrong with the AlphaFold execution. You should take a look at the file with the standard output from Ibex, located in the
out_ibex
directory. To look at this file you could do:$ tail out_ibex/AlphafoldIbex.[jobid].out
Take a look at the output and it will tell you if the program was stopped due to time or memory limitations, or if there was some other error. In the first two cases, you would re-run the alphafold command but asking for more time and/or memory with the
--time
and--mem
arguments.From my own experience, the following values should work for the corresponding total sequence length (sum of all sequences in the input):
~60 residues
--time 60 --mem 64
~350 residues
--time 120 --mem 64
~1,000 residues
--time 600 --mem 128
-
Why don't I just ask for as many resources as possible to prevent the program from failing?
If you ask for more resources than your job will need, you might have to wait longer for those resources to become free and be assigned to you. Also you would prevent other users from accessing those resources for as long as your job is running. Remember that Ibex is used by many other people :)
If you use AlphaFold-Ibex in your research, please cite this repository as follows:
- Francisco J. Guzmán-Vega and Stefan T. Arold, ‘Strubelab/Alphafold-Ibex: Scaling up Protein Modeling with AlphaFold for Slurm-Based HPC Clusters’, GitHub repository, 2024. DOI: 10.5281/zenodo.10548184