-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract the residue numbers of a patch #70
Comments
Hi @TobiasPol Any luck? I would be interested in this as well. |
Hey @celalp, not yet unfortunately. Do you have any idea for the solution or do you know if it is even possible? |
I'm not so sure if we can with certanity, I have a tiny pdb that I've been using to understand the code in greater depth, seems like the reduction in the number of vertices from the original msms output happens during the fixmesh function. There is a whole bunch of joining and collapsing happening during that. This seems to remove the vertex information and create new vertices and faces and there is no tracking of what's going on not just in the code but also from the point of view of pymesh, (pymesh dont care). There are couple of things I'm looking into now,
I think I'm going to go with 2 and just say hey, these are the closest ones, we are dealing with atom coordinates and Ca's of aas pretty far from one another and I'm hoping that it would be right or very close to right. To clarify this needs to be done on the vertex coordinates before you convert them to polar, I'm thinking about writing couple of functions that does the above calculations and saves them in a class of sorts and for I/O think and hdf5 per protein/chain while a bit expensive might be the simplest to implement. This is my interpretation of the code and considering the spare commenting dont take this as "legal advice" and if you have any ideas I would love to hear. Honestly, all i want to do is compare protein surfaces at the moment, I just want to see if patch x from protein a is similar to patch y from protein b and then get the aas involved. When I first started I was so sure that I would find something simple enough but I guess coordinate and rotation invariance requirement makes it challenging. I'm hesitant to use learned features because I will not be working with human/mouse/even animal proteins for this specific project. Sorry for the wall of text and rant. |
Thank you for your detailed answer! That helps me a lot. I would prefer to have an output containing the two comparative patches with the coordinates and the respective residue numbers. Something like this:
Unfortunately, I only have this table and not the code behind it. And as you say, the code is very poorly documented. |
may I ask where you got this table? I was thinking more along the lines of getting closest atom to each vertex on the processed mesh, then it's just a matter of a lookup for each vertex in each patch. I would love to have an actual truth set even for a handful of cases to test out the code. |
@TobiasPol Ok, following the authors' lead, here is a small snippet that finds the n closest amino acids. If you can verify this using your data I would be grateful from sklearn.neighbors import KDTree
def map_aas(old_verts, new_verts, names, num_inter):
"""
old_verts: vertices from the initial mesh from compute MSMS
new_verts: vertices from the output of fixmesh
names: names1 from again compute MSMS
num_inter: how many closest aas to return
returns an ndarray of new_verts rows and num_inter columns, the values indicate the aa number coming from computeMSMS output
"""
kdt = KDTree(old_verts)
dists, result = kdt.query(new_verts, k=num_inter)
aas=np.array([int(name.split("_")[1]) for name in names])
closest_aas=np.zeros_like(dists)
for i in range(num_inter):
closest_aas[:, i]=aas[result[:, i]]
return closest_aas |
I had a similar question when using MaSIF-site when I was trying to map the predicted sites to residues in the input structure. I wrote a script to map vertices (with a high probability of being a binding site) to nearby residues in the input structure. Not sure if this is helpful with MaSIF-search. The script prints the mapped residues along with a PyMOL CLI command to select them. '''
Requires biopandas to parse pdb file
pip install biopandas
Usage:
python map-highprob-vertices-to-residues.py \
--job_name 1cz8_V \
--data_preparation /path/to/masif_output/data_preparation \
--output /path/to/masif_output/pred_output \
--prob_thr 0.7 --radius 1.2
'''
import argparse
import numpy as np
from typing import List
from pathlib import Path
from pprint import pprint
from biopandas.pdb import PandasPdb
def match_atoms(target_coords: np.ndarray, query_coords: np.ndarray, radius: float) -> List[int]:
"""
Find atoms that are within a radius from the high-prob vertices
Args:
target_coords(np.ndarray): shape (N, 3), coordinates of the vertices
query_coords (np.ndarray): shape (N, 3), coordinates of the atoms
radius (float) : radius in Å cutoff to retrieve atoms close to vertices
Returns:
idx (List[int]): indices of the atoms IN THE QUERY_COORDS
that are within a radius from the vertices
"""
from scipy.spatial import cKDTree
tree = cKDTree(query_coords) # indexing the atom coordinates
# get atoms that are within a radius from the vertices
idx = tree.query_ball_point(target_coords, r=radius)
# flatten the list of lists
idx = [item for sublist in idx for item in sublist]
# remove duplicates
idx = list(set(idx))
return idx
# input
def cli():
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--job_name", type=str, help="job name assigned to the inference run, e.g. 1cz8_V")
parser.add_argument("--data_preparation", type=str, help="path to data_preparation folder")
parser.add_argument("--output", type=str, help="path to the inference output folder")
parser.add_argument("--prob_thr", type=float, default=0.7,
help="probability threshold to filter vertices")
parser.add_argument("--radius", type=float, default=1.2,
help="radius in Å to retrieve atoms close to vertices")
args = parser.parse_args()
return args
if __name__ == "__main__":
args = cli()
job_name = args.job_name
inp_path = Path(args.data_preparation)
out_path = Path(args.output)
# --------------------
# file paths
# --------------------
# pred probabilities of each vertex being part of a binding site
pred_fp = out_path.joinpath('all_feat_3l', 'pred_data', f'pred_{job_name}.npy')
# input pdb file to masif-site
in_pdb_fp = inp_path.joinpath('01-benchmark_pdbs', f'{job_name}.pdb')
# --------------------
# read pred files
# --------------------
# Load the predicted surface numpy file.
# If running inside a container, this is located by default under `/masif/data/masif_site/output/all_feat_3l/pred_surfaces`
# e.g. `/masif/data/masif_site/output/all_feat_3l/pred_surfaces/1cz8_V.npy`
probs = np.load(pred_fp)
# Load vertex coordinates from precomputed files
# If running in the container, this is located under `/masif/data/masif_site/data_preparation/04a-precomputation_9A/precomputation/`
# e.g. `/masif/data/masif_site/data_preparation/04a-precomputation_9A/precomputation/1cz8_V`
# - `1cz8_V` is the job name I used for this example.
precompute_folder = inp_path.joinpath('04a-precomputation_9A', 'precomputation', job_name)
# We only need the coordinates, other features are not needed in this notebook
p1_X = np.load(precompute_folder/'p1_X.npy')
p1_Y = np.load(precompute_folder/'p1_Y.npy')
p1_Z = np.load(precompute_folder/'p1_Z.npy')
print(f'p1_X.shape: {p1_X.shape}')
print(f'p1_Y.shape: {p1_Y.shape}')
print(f'p1_Z.shape: {p1_Z.shape}')
# --------------------
# map high prob vertices to
# residues in the structure
# --------------------
# use biopandas to parse the pdb file
pdb = PandasPdb().read_pdb(in_pdb_fp)
# convert to dataframe
atom_df = pdb.df['ATOM']
# add node_id in the format of [chain_id]:[residue_name]:[residue_number]:[insertion]
atom_df['node_id'] = atom_df['chain_id'] + ':' + atom_df['residue_name'] + ':' + atom_df['residue_number'].astype(str) + ':' + atom_df['insertion']
# remove the tailing space and colon in the node_id if insertion is empty
atom_df['node_id'] = atom_df['node_id'].str.replace(r':\s*$', '', regex=True)
# --------------------
# Find atoms close to the
# predicted surface vertices
# --------------------
# params
prob_thr = args.prob_thr # prob thr to filter vertices
radius = args.radius # radius in Å to retrieve atoms close to vertices
# get the coordinates of the vertices (shape N_vertices, 3)
vertices_coords = np.concatenate([p1_X.reshape(-1, 1),
p1_Y.reshape(-1, 1),
p1_Z.reshape(-1, 1)], axis=1)
# get vertex with the probability greater than a threshold e.g. 0.9
_, idx = np.where(probs > prob_thr)
# get the coordinates of the vertices with the probability greater than a threshold
hp_coords = vertices_coords[idx]
# atom coordinates from the pdb file
atom_coords = atom_df.loc[:, ['x_coord', 'y_coord', 'z_coord']].to_numpy()
# atom df row indices
idx = match_atoms(hp_coords, atom_coords, radius)
# pymol selection string
resis = atom_df.iloc[idx].drop_duplicates(subset=['node_id']).node_id.map(lambda x: x.split(':')[-1]).unique()
chain_id = atom_df.chain_id.values[0]
sel_str = f'select pred, ///{chain_id}/{"+".join(resis)}'
print('PyMOL selection string (copy paste it in PyMOL CLI):')
print(sel_str)
# print residues
print('Residue IDs:')
for i in atom_df.iloc[idx].sort_values('residue_number')['node_id'].to_list():
print(i) Output
|
@celalp and @biochunan thanks for your replies! I will take a look at it and verify the results. |
Quick Update: There is a problem with the Python version. I see that Pymesh comes with Python 3.6.11, but I need at least 3.7. Does anyone happen to have a working Dockerfile? |
Mine was directly pulled from Docker hub according to the tutorial (https://github.com/LPDI-EPFL/masif/blob/master/docker_tutorial.md). I had problems setting up local environment for the dependencies, so I changed to from inside the container instead and it works well for me. docker pull pablogainza/masif:latest |
@biochunan
|
Hi @TobiasPol, apologies that I misunderstood your last question.
I have attached an example for your reference. You just need to |
I am currently working with MaSIF-search and am looking for a way to extract the residual numbers of a patch to validate the results. Does anyone know if this is possible?
The text was updated successfully, but these errors were encountered: