Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue generating jobs #267

Open
npklein opened this issue Nov 15, 2016 · 4 comments
Open

Performance issue generating jobs #267

npklein opened this issue Nov 15, 2016 · 4 comments

Comments

@npklein
Copy link
Contributor

npklein commented Nov 15, 2016

With a dataset where I have multiple parameter files the parsing of on of the parameter files takes very long. The issue can be reproduced by running generateJobs_phasing.sh from #266.

@fdlk
Copy link
Contributor

fdlk commented Dec 5, 2016

First look comment:
Parameter files chromosomes_X_Y.csv and chromosome_chunks.csv contain related data.
Is there a reason why the CHR column from chromosomes_X_Y.csv cannot be merged into chromosome_chunks.csv i.e.

CHR, chromosomeChunk
1, 1:1-5500000
1, 1:4500001-10500000
1, 1:9500001-15500000
1, 1:14500001-20500000
[...]
2, 2:1-5500000
2, 2:4500001-10500000
2, 2:9500001-15500000
2, 2:14500001-20500000
[...]

I'd expect that to speed up things by a factor 25 or so

@fdlk
Copy link
Contributor

fdlk commented Dec 5, 2016

Talked with Niek and Freerk. Two questions:

  1. Is parameter solving slower than it should be because it fails to sufficiently collapse the problem, i.e. solves the same parameter value many times for each different version of independent parameters.
  2. Why does the above example give such a long #list of chromosomeChunks and why does it stop to do so if you add #list CHR?

@fdlk
Copy link
Contributor

fdlk commented Dec 7, 2016

Answer to number 2:

Behaviour of #list parameters is not completely specified but what specifications exist can be found here: http://molgenis.github.io/pipelines/mc-parameters#3listsofparameters

Behaviour is dependent on what file the parameters are defined in(!) I find this odd and impractical, since I'd think you should be able to specify the parameter space any way you like and it should be collapsed for each step script depending on the parameters defined in that script.

I suspect the reason for this dependence is that the implementation of #list filters the parameters in the original file.

Things to do:

  • Create simpler specification on how #list parameters should work and implement it.

For now:

  • Realise that experimenting with the order of parameters and adding #list of unused parameters and moving parameters to and from separate files may wildly change behaviour. For better and for worse.

@fdlk
Copy link
Contributor

fdlk commented Dec 7, 2016

Answer to 1: Parameter solving no longer is done using freemarker and looks to me to be reasonably efficient. The slowness comes from generating way too many jobs for this combination of chromosome, chunk and sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants