You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for making the model available.
I have been playing with the model and realized that usually when making prediction of a sequence of DNA, usually the last token is not the one in the original sequence. The predictions usually have some extra nucleotides at the end of the sequence.
Am i missing something? Is this the expected behavior? Is there a expected nucleotide input length which fix this behavior?
for dna in sequences:
dna = dna[:128]
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)
logits = hidden_states.logits
# Apply softmax to convert logits to probabilities
probabilities = softmax(logits, dim=-1)
# Choose the most likely token for each position
predicted_token_ids = torch.argmax(probabilities, dim=-1)
print('original tokens', inputs)
print('predicted tokens', predicted_token_ids)
print()
# Convert these token ids back to nucleotides
predicted_sequences = [tokenizer.decode(token_ids) for token_ids in predicted_token_ids[:,1:]]
original = [tokenizer.decode(token_ids) for token_ids in inputs]
print('Original', dna)
print('Predicted',' '.join(predicted_sequences).replace(' ', ''))
print()
Hi,
Thanks for making the model available.
I have been playing with the model and realized that usually when making prediction of a sequence of DNA, usually the last token is not the one in the original sequence. The predictions usually have some extra nucleotides at the end of the sequence.
Am i missing something? Is this the expected behavior? Is there a expected nucleotide input length which fix this behavior?
The text was updated successfully, but these errors were encountered: