Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archie #22

Closed
wants to merge 35 commits into from
Closed

Archie #22

wants to merge 35 commits into from

Conversation

xin-huang
Copy link
Owner

No description provided.

Comment on lines +264 to +267
#I think these parameters are NOT necessary for ArchIE - just retained for the signature of preprocess.process_data
match_bonus = 1
max_mismatch = 1
mismatch_penalty = 1
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are necessary for calculating the S* scores, which are also input features in the logistic regression.

match_bonus = 5000
max_mismatch = 5
mismatch_penalty = -10000

mut_num, hap_num = tgt_gt.shape
iv = np.ones((hap_num, 1))
counts = tgt_gt*np.matmul(tgt_gt, iv)
spectra = np.array([np.bincount(np.array(counts[:,idx] > 0).astype('int8'), minlength=hap_num+1) for idx in range(hap_num)])
Copy link
Owner Author

@xin-huang xin-huang Dec 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 70 is not correct. The spectra here only contain "single-ton", because counts[:,idx] > 0 returns a boolean array and FALSEs and TRUEs are regarded as 0s and 1s.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, what this line founds is the total number of mutations that a haplotype contains.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative number problem was caused by astype('int8'), because the range of int8 is from -128 to 127, when simulating large sample sizes (e.g. 100 diploid target individuals), negative numbers may occur. To fix it, we can change astype('int8') to astype('int64')

Comment on lines +280 to +281
#reading of data, preprocessing - i.e., calculating statistics -, and obtaining & labeling of true tracts
for replicate1, folder in enumerate(os.listdir(output_dir)):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using enumerate makes the order of replicates in the feature table not consistent with the order of the folders containing simulated data

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it difficult to check the correctness of the calculation for different features.

Comment on lines +150 to +151
dist_skew = sps.skew(tgt_dist, axis=1)
dist_kurtosis = sps.kurtosis(tgt_dist, axis=1)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skew and kurtosis may be np.nan occasionally, we could replace nan with 0

@xin-huang xin-huang closed this Feb 11, 2024
@xin-huang xin-huang deleted the archie branch February 15, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants