-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archie #22
Conversation
#I think these parameters are NOT necessary for ArchIE - just retained for the signature of preprocess.process_data | ||
match_bonus = 1 | ||
max_mismatch = 1 | ||
mismatch_penalty = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters are necessary for calculating the S* scores, which are also input features in the logistic regression.
match_bonus = 5000
max_mismatch = 5
mismatch_penalty = -10000
mut_num, hap_num = tgt_gt.shape | ||
iv = np.ones((hap_num, 1)) | ||
counts = tgt_gt*np.matmul(tgt_gt, iv) | ||
spectra = np.array([np.bincount(np.array(counts[:,idx] > 0).astype('int8'), minlength=hap_num+1) for idx in range(hap_num)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 70 is not correct. The spectra here only contain "single-ton", because counts[:,idx] > 0
returns a boolean array and FALSE
s and TRUE
s are regarded as 0s and 1s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, what this line founds is the total number of mutations that a haplotype contains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The negative number problem was caused by astype('int8')
, because the range of int8
is from -128 to 127, when simulating large sample sizes (e.g. 100 diploid target individuals), negative numbers may occur. To fix it, we can change astype('int8')
to astype('int64')
#reading of data, preprocessing - i.e., calculating statistics -, and obtaining & labeling of true tracts | ||
for replicate1, folder in enumerate(os.listdir(output_dir)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using enumerate
makes the order of replicates in the feature table not consistent with the order of the folders containing simulated data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it difficult to check the correctness of the calculation for different features.
dist_skew = sps.skew(tgt_dist, axis=1) | ||
dist_kurtosis = sps.kurtosis(tgt_dist, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skew
and kurtosis
may be np.nan
occasionally, we could replace nan
with 0
No description provided.