Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

- Start of ToxicRmod commit #211

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions ToxicR workinprogress.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
library("devtools")
#requires dependency Cmake
#devtools::install_github("NIEHS/ToxicR")
library("ToxicR")
library("dplyr")
library("PMCMRplus")
library(parallel)
library("pbmcapply")
#bmdoutput <- read.table("~/storage/biomarker_input_.txt",
# header=FALSE, sep = "\t", row.names = 1)
bmdoutput <- read.table("~/storage/ToxicR mod/data for BMDExpress log2 CPM Males.txt",
header=FALSE, sep = "\t", row.names = 1)
#
#bmdoutput <- bmdoutput[-3,-c(1:3,5:9,11:15,17:18)]
#
bmdoutput <- as.data.frame(t(bmdoutput))
bmdoutput <-bmdoutput[,-1]
bmdoutput[-1] <- lapply(bmdoutput[-1], function(x) {
as.numeric(as.character(x))
})

dose_column <-as.numeric(bmdoutput[,1])
start.time <- Sys.time()
Func <- function(resp_bygene_column){
trendtestres<- williamsTest(resp_bygene_column ~ dose_column, alternative = c("greater", "less"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raiduhs pointed out a downside to this, we can't set alpha with this implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He also gave this a test at his end and found that it might need to be run for "greater" and "less" as separate tests, can you confirm whether you get both alternative hypotheses output?

> williamsTest(-1*x ~ g, alternative = c("greater", "less"))
 
         Williams trend test
 
data:  -1 * x by g 
alternative hypothesis:  greater

H0
               t'-value df t'-crit decision alpha
mu1 - ctr <= 0   -1.435 12   1.782   accept  0.05
mu2 - ctr <= 0   -1.435 12   1.873   accept  0.05
---
> williamsTest(-1*x ~ g, alternative = "less")
 
         Williams trend test
 
data:  -1 * x by g 
alternative hypothesis:  less
 
H0

               t'-value df t'-crit decision alpha
mu1 - ctr >= 0    1.357 12   1.782   accept  0.05
mu2 - ctr >= 0    2.950 12   1.873   reject  0.05
---

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to test both greater and less, I am also check the ToxicR williamtrendtest and unfortunately it also does not allow for choosing an alpha value, I think we might be revamping the williamtrendtest in general anyways based on the BMDexpress source files I have sent you

trendestresdf <- data_frame("pass/fail value" = trendtestres$crit.value[,1] - trendtestres$statistic[,1])
#only generate model when significant based on williams test
if (any(trendestresdf$`pass/fail value` <0)){
model<-single_continuous_fit(dose_column,resp_bygene_column)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you're specifying any model; I guess the default is Hill?

The ideal way to approach this would be to wrap the whole function so that it can iterate over model types (ideally also in parallel)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the default was hill, I have implemented modelling of all 5 available models, and to save the best one based on The maximum value of the likelihod/posterior, I read that this is an assessment of the goodness of fit and a higher maximum would be considered the best model, please correct me if I am wrong. I have also implemented this in parallel processing (thank you for teaching me about it, it has changed my world), I used nested mclapplys but was worried about how it would be affected given the cores distributed to the workers (maybe there is an optimal split?), I also used foreach dopar, based on this discussion: https://stackoverflow.com/questions/51707443/should-mclapply-calls-be-nested. Indeed after testing, foreach dopar is slightly faster than nested mclapply with maximum cores used for both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output file also now contains the gene the bmd and the model that was the best used

#capture p-value output from ToxicR for goodness of fit
modeloutput<- capture.output(summary(model))
all_numbers_mean_adequate_pvalue <- lapply(modeloutput[9:10], function(line) {
as.numeric(tail(regmatches(line, gregexpr("\\b\\d+\\.\\d+\\b", line))[[1]],n=1))
})
#only return model if goodness of fit is significant
if(!any(as.data.frame(all_numbers_mean_adequate_pvalue) >0.05)){
return(as.numeric(model$bmd[1]))
}
}
}

result <- pbmclapply(bmdoutput[,2:ncol(bmdoutput)], function(col) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can speed this up much more by passing mc.cores (usually a good idea to use num CPUs - 1 - see some examples and discussion here: https://stackoverflow.com/questions/28954991/whether-to-use-the-detectcores-function-in-r-to-specify-the-number-of-cores-for)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my apologies I thought I read that it automatically used the maximum amount of cores but you are correct including detectcores greatly improves speed!

return(Func(col))
})

#structuring of results to remove NULLs and have rows for output
signficant_BMD_table <- t(as.data.frame(Filter(Negate(is.null), result)))
end.time <- Sys.time()
time.taken <- round(end.time - start.time,2)
time.taken
write.table(signficant_BMD_table, file ="~/storage/toxicR output.txt", sep=" ", col.names = TRUE)
Loading