-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic tokenizer #6
Conversation
tokenizer
Outdated
@@ -0,0 +1,24 @@ | |||
import numpy as np | |||
|
|||
def tokenizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions use imperative language.
tokenizer
Outdated
for x in nucleotide_sequence: | ||
if x == "A" or x == "a": | ||
arr.append([0,counter]) # Map A to 0 | ||
elif x == "C" or x == "c": | ||
arr.append([1,counter]) # Map C to 1 | ||
elif x == "G" or x == "g": | ||
arr.append([2,counter]) # Map G to 2 | ||
elif x == "T" or x == "t": | ||
arr.append([3,counter]) # Map T to 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be faster here. Try defining a dict {'A': 0, 'C': 1, 'G': 2, 'T': 3} as a "Rosetta Stone" for tokenizing the base pairs
tokenizer
Outdated
elif x == "T" or x == "t": | ||
arr.append([3,counter]) # Map T to 3 | ||
counter += 1 # Increment counter | ||
np_arr = np.array(arr) # Convert finished array to a numpy array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaks single-responsibility principle. The function name claims it tokenizes a string, not additionally constructing a NumPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not doing position embeddings at this point. Waiting on wet lab feedback on where to set zero position.
"G" : 2, | ||
"T" : 3 | ||
} # Declares the integers that correspond to each nucleotide | ||
# counter = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the counter for clarity
for x in nucleotide_sequence: | ||
arr.append(mappings[x]) # Adds a converted nucleotide to the end of the array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced with a list comprehension, something like [mappings[x] for x in sequence]. List comprehensions are generally faster than for loops
No description provided.