Performance issue generating jobs #267

npklein · 2016-11-15T13:42:58Z

With a dataset where I have multiple parameter files the parsing of on of the parameter files takes very long. The issue can be reproduced by running generateJobs_phasing.sh from #266.

fdlk · 2016-12-05T12:58:27Z

First look comment:
Parameter files chromosomes_X_Y.csv and chromosome_chunks.csv contain related data.
Is there a reason why the CHR column from chromosomes_X_Y.csv cannot be merged into chromosome_chunks.csv i.e.

CHR, chromosomeChunk
1, 1:1-5500000
1, 1:4500001-10500000
1, 1:9500001-15500000
1, 1:14500001-20500000
[...]
2, 2:1-5500000
2, 2:4500001-10500000
2, 2:9500001-15500000
2, 2:14500001-20500000
[...]

I'd expect that to speed up things by a factor 25 or so

fdlk · 2016-12-05T14:40:49Z

Talked with Niek and Freerk. Two questions:

Is parameter solving slower than it should be because it fails to sufficiently collapse the problem, i.e. solves the same parameter value many times for each different version of independent parameters.
Why does the above example give such a long #list of chromosomeChunks and why does it stop to do so if you add #list CHR?

fdlk · 2016-12-07T11:40:22Z

Answer to number 2:

Behaviour of #list parameters is not completely specified but what specifications exist can be found here: http://molgenis.github.io/pipelines/mc-parameters#3listsofparameters

Behaviour is dependent on what file the parameters are defined in(!) I find this odd and impractical, since I'd think you should be able to specify the parameter space any way you like and it should be collapsed for each step script depending on the parameters defined in that script.

I suspect the reason for this dependence is that the implementation of #list filters the parameters in the original file.

Things to do:

Create simpler specification on how #list parameters should work and implement it.

For now:

Realise that experimenting with the order of parameters and adding #list of unused parameters and moving parameters to and from separate files may wildly change behaviour. For better and for worse.

fdlk · 2016-12-07T11:41:55Z

Answer to 1: Parameter solving no longer is done using freemarker and looks to me to be reasonably efficient. The slowness comes from generating way too many jobs for this combination of chromosome, chunk and sample.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue generating jobs #267

Performance issue generating jobs #267

npklein commented Nov 15, 2016

fdlk commented Dec 5, 2016 •

edited

Loading

fdlk commented Dec 5, 2016

fdlk commented Dec 7, 2016

fdlk commented Dec 7, 2016

Performance issue generating jobs #267

Performance issue generating jobs #267

Comments

npklein commented Nov 15, 2016

fdlk commented Dec 5, 2016 • edited Loading

fdlk commented Dec 5, 2016

fdlk commented Dec 7, 2016

fdlk commented Dec 7, 2016

fdlk commented Dec 5, 2016 •

edited

Loading