You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to understand locfitByCluster a bit more. Currently, I'm using it for some clusters that can have large gaps and I'm seeing the following warning several times:
Warning in lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth, :
procv: no points with non-zero weight
I made the following small reproducible example. It generates the above warning 3 times. I looked at the first cluster where this happened and ran it manually.
library('bumphunter')
set.seed(20160428)
l<-100y<- sample(0:4, size=l, replace=TRUE)
x<- sort(sample(seq_len(3000*l), size=l))
cluster<- clusterMaker(rep('chr1', l), x, maxGap=3000)
length(unique(cluster))
test<- locfitByCluster(y=y, x=x, cluster=cluster, weights=NULL, minNum=7, bpSpan=1000, minInSpan=0)
table(test$smoothed)
## Runs fine with a higher minInSpan value (also has warnings with minInSpan = 1)test2<- locfitByCluster(y=y, x=x, cluster=cluster, weights=NULL, minNum=7, bpSpan=1000, minInSpan=2)
table(test2$smoothed)
## Identical in this case, but not in my larger use case
identical(test$fitted, test2$fitted)
## Run pieces of## https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L53-L93y<-matrix(y, ncol=1)
bpSpan=1000minNum=7minInSpan=0weights=NULLverbose=TRUEweights<-matrix(1, nrow= nrow(y), ncol= ncol(y))
Indexes<- split(seq(along=cluster), cluster)
clusterL<- sapply(Indexes, length)
smoothed<- rep(TRUE, nrow(y))
Index<-Indexes[[2]]
nn<-minInSpan/ length(Index)
j<-1sdata<-data.frame(pos=x[Index],
y=y[Index, j],
weights=weights[Index,j])
fit<- locfit(y~ lp(pos, nn=nn, h=bpSpan), data=sdata,
weights=weights, family="gaussian", maxk=10000)
## Explore positionsx[cluster==2]
## Reproducibility info
proc.time()
message(Sys.time())
options(width=120)
devtools::session_info()
You'll note that if you change how y is created, to lets say y <- rnorm(l, mean = 100) the above example doesn't produce any more warnings.
Looking at the help of locfitByCluster I see the following:
minNum
Clusters with fewer than minNum locations will not be smoothed
minInSpan
Only smooth the region if there are at least this many locations in the span.
However, I only see minInSpan play a role at https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L76 which doesn't seem to match the description to me. Specially since length(Index) is the length of the cluster, not of the sub-regions in the cluster. Or maybe you meant minInSpan: Only smooth the cluster if there at least this many locations in the span.
case LF_NOPT:
WARN(("procv: no points with non-zero weight"));
set_default_like(&lf->fp,v);
return(k);
In this particular small example, the returned values are the same. But in my larger use case they are not. I'm looking at base-resolution F-statistics (that's my y) while using minNum = 100 since the read length is 100 for this data set.
The full code is here although the data is not there.
I think that I should use minInSpan equal to minNum (100 in my use case). However, the description of minInSpan doesn't match what it does. I know minInSpan / length(Index) is passed to locfit::lp(nn) whose description is:
nn
Nearest neighbor component of the smoothing parameter. Default value is 0.7, unless either h or adpen are provided, in which case the default is 0.
This leaves me confused. Maybe there is something I'm missing on how minInSpan actually enforces the minimum number of locations needed before smoothing. I'm guessing nn should be between 0 and 1, but looking at my data, in some cases length(Index) is smaller than 100, which would lead to nn values greater than 1. Edit: actually https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L75 would pick these up since clusterL[i] is less than minNum = 100.
I'll appreciate it if you can help me figure out what is the proper minInSpan value to use. I know that I can run the code right now without changing minInSpan, but it doesn't feel right to have hundreds of warnings.
Thanks!
Leo
The text was updated successfully, but these errors were encountered:
lcolladotor
added a commit
to LieberInstitute/dbFinder
that referenced
this issue
Feb 6, 2018
Hi,
I'm trying to understand
locfitByCluster
a bit more. Currently, I'm using it for some clusters that can have large gaps and I'm seeing the following warning several times:I made the following small reproducible example. It generates the above warning 3 times. I looked at the first cluster where this happened and ran it manually.
You'll note that if you change how
y
is created, to lets sayy <- rnorm(l, mean = 100)
the above example doesn't produce any more warnings.The actual output of the example is below:
Looking at the help of
locfitByCluster
I see the following:However, I only see
minInSpan
play a role at https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L76 which doesn't seem to match the description to me. Specially sincelength(Index)
is the length of the cluster, not of the sub-regions in the cluster. Or maybe you meantminInSpan: Only smooth the cluster if there at least this many locations in the span
.The warning is eventually produced by
locfit::locfit
and in particular by lines 88-91 ofprocv.c
(from the package source at https://cran.rstudio.com/web/packages/locfit/index.html) which reads:In this particular small example, the returned values are the same. But in my larger use case they are not. I'm looking at base-resolution F-statistics (that's my
y
) while usingminNum = 100
since the read length is 100 for this data set.Parts of my code is:
The full code is here although the data is not there.
I think that I should use
minInSpan
equal tominNum
(100 in my use case). However, the description ofminInSpan
doesn't match what it does. I knowminInSpan / length(Index)
is passed tolocfit::lp(nn)
whose description is:This leaves me confused. Maybe there is something I'm missing on how
minInSpan
actually enforces the minimum number of locations needed before smoothing. I'm guessingnn
should be between 0 and 1, but looking at my data, in some caseslength(Index)
is smaller than 100, which would lead tonn
values greater than 1. Edit: actually https://github.com/ririzarr/bumphunter/blob/17eac01a3dd57ba8d496b6687bc55fcb29bea54d/R/smooth.R#L75 would pick these up sinceclusterL[i]
is less thanminNum = 100
.I'll appreciate it if you can help me figure out what is the proper
minInSpan
value to use. I know that I can run the code right now without changingminInSpan
, but it doesn't feel right to have hundreds of warnings.Thanks!
Leo
The text was updated successfully, but these errors were encountered: