Basic tokenizer #6

MacDoc77 · 2021-01-16T20:22:06Z

No description provided.

Lev1ty · 2021-01-16T20:27:00Z

tokenizer

@@ -0,0 +1,24 @@
+import numpy as np
+
+def tokenizer(


Functions use imperative language.

Lev1ty · 2021-01-16T20:28:22Z

tokenizer

+    for x in nucleotide_sequence:
+        if x == "A" or x == "a":
+            arr.append([0,counter]) # Map A to 0
+        elif x == "C" or x == "c":
+            arr.append([1,counter]) # Map C to 1
+        elif x == "G" or x == "g":
+            arr.append([2,counter]) # Map G to 2
+        elif x == "T" or x == "t":
+            arr.append([3,counter]) # Map T to 3


Could be faster here. Try defining a dict {'A': 0, 'C': 1, 'G': 2, 'T': 3} as a "Rosetta Stone" for tokenizing the base pairs

Lev1ty · 2021-01-16T20:29:49Z

tokenizer

+        elif x == "T" or x == "t":
+            arr.append([3,counter]) # Map T to 3
+        counter += 1 # Increment counter
+    np_arr = np.array(arr) # Convert finished array to a numpy array


Breaks single-responsibility principle. The function name claims it tokenizes a string, not additionally constructing a NumPy array.

Lev1ty

Not doing position embeddings at this point. Waiting on wet lab feedback on where to set zero position.

…sformer into vincent

Lev1ty · 2021-01-16T20:54:14Z

tokenizer

+        "G" : 2,
+        "T" : 3
+    } # Declares the integers that correspond to each nucleotide
+    # counter = 0


Remove the counter for clarity

Lev1ty · 2021-01-16T20:54:57Z

tokenizer

+    for x in nucleotide_sequence:
+        arr.append(mappings[x]) # Adds a converted nucleotide to the end of the array


This can be replaced with a list comprehension, something like [mappings[x] for x in sequence]. List comprehensions are generally faster than for loops

MacDoc77 and others added 2 commits January 16, 2021 15:17

Basic tokenizer

e2b6efc

Delete launch.json

39bb969

Lev1ty reviewed Jan 16, 2021

View reviewed changes

tokenizer Outdated

@@ -0,0 +1,24 @@

import numpy as np

def tokenizer(

Copy link

Contributor

Lev1ty Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions use imperative language.

MacDoc77 linked an issue Jan 16, 2021 that may be closed by this pull request

Make tokenizer for gene sequences #2

Open

Lev1ty reviewed Jan 16, 2021

View reviewed changes

Lev1ty requested changes Jan 16, 2021

View reviewed changes

MacDoc77 added 2 commits January 16, 2021 15:51

Update tokenizer

4056fdb

Merge branch 'vincent' of https://github.com/igemmcmaster/genome-tran…

3035c0b

…sformer into vincent

MacDoc77 merged commit 3ad1c75 into main Jan 16, 2021

Lev1ty requested changes Jan 16, 2021

View reviewed changes

Lev1ty mentioned this pull request Jan 16, 2021

Revert "Basic tokenizer" #7

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic tokenizer #6

Basic tokenizer #6

MacDoc77 commented Jan 16, 2021

Lev1ty Jan 16, 2021

Lev1ty Jan 16, 2021

Lev1ty Jan 16, 2021

Lev1ty left a comment

Lev1ty Jan 16, 2021

Lev1ty Jan 16, 2021

		for x in nucleotide_sequence:
		arr.append(mappings[x]) # Adds a converted nucleotide to the end of the array

Basic tokenizer #6

Basic tokenizer #6

Conversation

MacDoc77 commented Jan 16, 2021

Lev1ty Jan 16, 2021

Choose a reason for hiding this comment

Lev1ty Jan 16, 2021

Choose a reason for hiding this comment

Lev1ty Jan 16, 2021

Choose a reason for hiding this comment

Lev1ty left a comment

Choose a reason for hiding this comment

Lev1ty Jan 16, 2021

Choose a reason for hiding this comment

Lev1ty Jan 16, 2021

Choose a reason for hiding this comment