You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.
we have recently succesfully trained a model for a plant species sequenced on the MinION using R9.4 flowcell. We have also sequenced the same plant species on the MinION on R10.3 flowcell and scussefully trained a model with those data.
We now have sequenced the same plant (again) on PromethION R10.4 flowcell, but are running into an error when attempting to train a model:
* Taiyaki version 5.1.0
* Platform is Linux-4.15.0-38-generic-x86_64-with-debian-buster-sid
* PyTorch version 1.2.0
* CUDA version 10.0.130 on device GeForce GTX 1080 Ti
* Command line:
* "/opt/kgapps/taiyaki/bin/train_flipflop.py resume2/model_checkpoint_00018.checkpoint mapped_reads_2.hdf5 --min_sub_batch_size 48 --outdir resume3 --lr_max 0.00160 --niteration 40000 --lr_cosine_iters 30000 --overwrite --device 0
* Started on 2020-09-25 08:40:06.741154
* Loading data from mapped_reads_2.hdf5
* Per read file MD5 62e8f6baab6b7ca1d1c046bdaed7e933
* Reads not filtered by id
* Using alphabet definition: canonical alphabet ACGT and no modified bases
* Loaded 14191 reads.
* Reading network from resume2/model_checkpoint_00018.checkpoint
* Network has 10683280 parameters.
* Loaded standard (canonical bases-only) model.
* Dumping initial model
* Sampled 100000 chunks: median(mean_dwell)=9.20, mad(mean_dwell)=0.89
* Learning rate goes like cosine from lr_max to lr_min over 30000.0 iterations.
* At start, train for 200 batches at warm-up learning rate 0.0001
* Standard loss reporting from 141 validation reads held out of training.
* Standard loss report: chunk length = 5500 & sub-batch size = 48 for 10 sub-batches.
* Gradient L2 norm cap will be upper 0.05 quantile of the last 100 norms.
* Training
.................................................. 1 0.10118 0.10477 116.30s (164.95 ksample/s 18.39 kbase/s) lr=1.00e-04 22.8% chunks filtered
.................................................. 2 0.10152 0.10432 116.79s (164.35 ksample/s 18.31 kbase/s) lr=1.00e-04 23.1% chunks filtered
.................................................. 3 0.10123 0.10401 110.55s (173.69 ksample/s 19.35 kbase/s) lr=1.00e-04 22.4% chunks filtered
.................................................. 4 0.10249 0.10369 117.33s (163.65 ksample/s 18.22 kbase/s) lr=1.00e-04 22.5% chunks filtered
............................Traceback (most recent call last):
File "/opt/kgapps/taiyaki/bin/train_flipflop.py", line 4, in <module>
__import__('pkg_resources').run_script('taiyaki==5.1.0', 'train_flipflop.py')
File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1446, in run_script
exec(code, namespace, namespace)
File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 624, in <module>
main()
File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 541, in main
mod_factor_t, calc_grads = True )
File "/opt/kgapps/taiyaki/lib/python3.7/site-packages/taiyaki-5.1.0-py3.7-linux-x86_64.egg/EGG-INFO/scripts/train_flipflop.py", line 247, in calculate_loss
outputs, seqs, seqlens, sharpen)
File "taiyaki/ctc/ctc.pyx", line 88, in taiyaki.ctc.ctc.FlipFlopCRF.forward
File "taiyaki/ctc/ctc.pyx", line 62, in taiyaki.ctc.ctc.crf_flipflop_grad
AssertionError: Input not finite
If we resume from the checkpoint, we run into the same error sometime later:
This is an area of active research internally. Currently the best solution/workaround is to decrease the --max_lr and increase the --niterations (and maybe --lr_cosine_iters).
Hi,
we have recently succesfully trained a model for a plant species sequenced on the MinION using R9.4 flowcell. We have also sequenced the same plant species on the MinION on R10.3 flowcell and scussefully trained a model with those data.
We now have sequenced the same plant (again) on PromethION R10.4 flowcell, but are running into an error when attempting to train a model:
If we resume from the checkpoint, we run into the same error sometime later:
Do you have any idea what is going on here, and what we are doing wrong?
The text was updated successfully, but these errors were encountered: