-
Notifications
You must be signed in to change notification settings - Fork 78
/
ReleaseNotes.txt
979 lines (853 loc) · 40.2 KB
/
ReleaseNotes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
============================== (Pending) Release Notes: v1.00 ==============================
C++ API:
Support for new training algorithms:
Support for new network structures:
Support for new layers:
- Multi-dimensional reduction (requires cuTENSOR)
Python front-end:
Performance optimizations:
Model portability & usability:
Experiments & Applications:
Internal features:
I/O & data ingestion:
- Added a new python dataset reader for simple, flexible, and distconv-supported
python data readers.
Build system:
Bug fixes:
- Fixed a bug where Adam added an extra eps to the gradients, resulting in training
instability for gradients approaching the scale of eps. This was previously added
to avoid the performance penalty of denormalized values on CPU.
Retired features:
============================== Release Notes: v0.104 ==============================
C++ API:
Support for new training algorithms:
Support for new network structures:
- Added GPT-3 transformers and training recipes
Support for new layers:
- Select operator (set tensor value based on predicate)
- Model parallelism for channel-wise fully-connected layers
Python front-end:
- Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0
or newer, compiled with PyTorch Dynamo)
Performance optimizations:
- Support in-place computations for capable layers as a memory optimization
- Allow distconv-enabled convolution and batchnorm layers to reuse their
input activations as error signals as a memory optimization if the parent
layer does not need its activations in the backward pass. This optimization
can be disabled by setting the environment variable
DISTCONV_DISABLE_MEM_OPT=1.
- Added support for selective weight sharding (also known as
Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true
on weight objects.
- Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1.
- Activations are now deallocated when no longer needed via a reference counter,
disable with LBANN_DISABLE_ACT_GC=1.
- Added option for LBANN to set the number of OMP threads to modest
default (4) if the environment doesn't specify anything.
- Save memory on backpropagation by not replicating gradients between
GradientManager and data_type_optimizer
- Save more memory in FSDP by synchronizing previous outstanding
async communication calls and freeing up local gradient contributions
- FSDP: release full weight views after backprop
- Batching heads in multi-head attention into single operations
instead of on a per-head basis
- Stacking the weights and biases for queries/keys/values in
self-attention
- Support for automatic mixed precision.
Model portability & usability:
- Added support for profiling with Caliper
Experiments & Applications:
- Updated CosmoFlow model to automatically scale the model
architecture and parallelism with input size.
- Added a PyTorch reference implementation of CosmoFlow.
Internal features:
- Removed the mini_batch_size parameter from the following functions
in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs
and the distconv_adapter class: fp_setup, bp_setup
- Support global and local gradient norm clipping with the clip_gradient_norm callback
- Interactive progress bar with the progress_bar callback
- Evaluate progress callback allows for periodic monitoring during
training with independent data set (intra-epoch evaluation)
- Detailed memory usage profiling with the memory_profiler callback
- Refactored subgraph parallelism
I/O & data readers:
- Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use.
- DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers
were removed from the model and layer API, and instead reside in the data
ingestion pipeline.
- Fixed implementation of background I/O to achive better decoupling
of background data fetch. Can be enabled / disabled with runtime
flag.
- Set the default number of I/O threads to 4
- Changed the I/O and transform pipeline to use a bank of RNGs that
is now indexed by the sample ID in the load sequence, rather than the
I/O thread ID. This eliminates variablility when using different
numbers of I/O threads.
- Moved state tracking current position in a data set from the data
reader to the dataset class.
- Split the I/O RNGs into two banks one for training and one for all
other execution modes.
Build system:
- Updated build script to use CachedCMakeProject mode, which should
simplfy the overall workflow
- Set a default time limit for CI tests to avoid unnecessary stalls
Bug fixes:
- Fixed a bug where in-place layers sometimes attached a locked view
of a matrix to a mutable view.
- Fixed a bug when trying to use the legacy HDF5 data reader without data store.
- Fixed concurrency bugs in the data store
- Fixed DistConv memory optimization bug
Retired features:
- Support for autoencoder strategy in the summarize images callback was removed
- Removed deprecated Layer protobuf fields: weight_data,
num_neurons_from_data_reader
- Removed support for calculating a global mini-batch across multiple
models using the imcomm callback or multiple trainers. The
mini-batch is now strictly contained to a single model in a single
trainer. This deprecates an unused (and old) multi-model
execution mode using imcomm callback that predated LTFB.
- Removed the notion of effective mini-batch size versus current mini-batch size.
- Remove world master mini-batch adjustment.
- Remove model offset field. No longer necessary since data sets do not span models.
- Remove the cached value of the current mini-batch size from the SGD
execution context. It is now only cached in the model.
- Removed the imcomm "inter-model" callback
- Removed the num-parallel-readers parameter to the I/O subsystem.
This eliminates an older version of I/O parallelism that relied on
a non-data-parallel I/O buffer and had different ranks fetching
entire mini-batches. It is superseded by standard data-parallel I/O.
============================== Release Notes: v0.103 ==============================
C++ API:
- Added ability to load models and run inference from external C++ applications
- Added inference-only execution algorithm
Support for new training algorithms:
- 2nd-order optimization with K-FAC.
Currently supports fully-connected, convolution, and GRU layers.
- Added Sub-graph parallelism support for multi-branch architectures
(split, slice, sum, and concat layers)
- Data + sub-graph parallelism for in-core models (D&SP and D&SP-cSub)
- Initial sub-graph parallelism support for common layers in
Transformers
- Model topology mutation in LTFB/PBT
- Added sub-grid parallelism support for K-FAC using primary and
secondary grids
- Truncation selection exchange for LTFB/PBT
- Regularized evolution for LTFB/PBT
- Hyperparameter grid search
- Multi-GAN training algorithm with multiple discriminators
Support for new network structures:
- Edge-conditioned graph neural networks
- RoBERTa with pretrained weights
Support for new layers:
- Added support for 2D Matrices for Scatter and Gather layers
- Added support for distributed Scatter and Gather layers
- DistConv enabled 3D MatMul
- Added image rotation layer and composite image transformation layer
(rotate, shear, translate)
- Added distributed tensor parallelism with channelwise decomposition for channelwise fully connected layer
- Added "binary-with-constant" operators
- Updated deconvolution layer to match PyTorch's API
- Updated identity layer to copy tensors to enable tensor parallelism in subsequent layers in the compute graph
- Added IdentityZero layer that allows alternating generator/discriminator
updates for training GANs.
- Added an External layer that enables separately-compiled library to be loaded dynamically
- Added support for labels_only mode on data-parallel cross entropy layer
Python front-end:
- Added support for building and launching jobs on Fugaku
- Added Riken as a known compute center
- Added Perlmutter as a known compute center
- Added support for PJM as job launcher
- Unified convolution/deconvolution interface to better approximate PyTorch.
- Added circular (periodic) padding transformation for 2D and 3D tensors
- Added support for Flux job scheduler
Performance optimizations:
- Enabled the input layers to use a view of the I/O buffers in the
buffered data coordinator
- Use default-allocated GPU memory for long-lived buffers
- Optimized GPU kernels for entry-wise operators
- Optionally use default-allocated GPU memory for long-lived buffers
Model portability & usability:
- Weight initialization from NumPy files
- Expanded layer documentation
Experiments & Applications:
- Example for training Transformer model with D&SP and D&SP-cSub
- PROBIESNet model for HRRL data
- Cosmo 3D GAN
- MNIST GAN
- Image GAN
- Example Distributed Graph Convolutions Networks
- NASNet
- RoBERTa
Internal features:
- Added operator class
- Added AlternateUpdates callback to be used with IdentityZero layers for
training GANs.
- Added support for serializing network architectures to protobuf format.
- Reformatted headers and implementation files for a more IWYU paradigm.
- General support for ROCm-enabled DistConv
- Support fo use of libfabric plugin for RCCL and NCCL
- Framework-wide improvements in support for ROCm and MIOpen
- Callback for alternating optimizer layer update
- Command line argument to hang the LBANN application for debuggin
- Add a cuTT/hipTT backend to the permute layer
- Add a permute layer utilizing cuTENSOR for the permute implementation
- Weight initializer from NumPy file
I/O & data readers:
- Updated SMILES data reader to use sample lists
- Added explicitly managed buffered reading and local unpacking for the
SMILES data reader to minimize file access
- Sample lists with integral indices can use range format (start ... end)
- Added a new extensible HDF5 data reader that uses a data set schema
and experiment schema files to define how the data is represented.
This allows the user to change the representation of data without
changing the data reader.
- Changed the input layer to take a data field and only produce a
single output. Currently valid Data fields are samples, labels,
and responses.
- Added support for using arbitrary field names with HDF5 data reader.
- Updated the data coordinator and data readers to
take dynamic data fields rather than fixed fields. Input buffers
are no long allocated for fields that are not used in active
models.
- Added support in the generic data reader and synthetic data reader
clases for arbitrary data fields.
- Added support for data readers to return full Conduit nodes to the
Data Coordinator.
- Data coordinator can now directly return packed data fields to
input layers.
- Added padding and cutout transformations
Build system:
- Added support for using uptream Spack repositories
- Added support to reuse existing Spack environments, which
significantly decreases the startup time of running a CI job
- Enforce consistent GPU targets in Spack environment
- Switched from Bamboo to GitLab CI framework
- Added support for a moduler GitLab CI script on Lassen
- Added CI pipelines for +distconv and +nvshmem where appropriate
Bug fixes:
- Fixed GPU kernels that launched with more blocks than allowed
- Fixed build and runtime errors with DistConv
- Use correct LayerNorm operaton in "Attention Is All You Need"
Transformer
- Fixed a bug where the input layer performed unnecessary memory
allocations.
- Bug fixes within Cosmoflow and U-Net models
- Fixed a bug in the GPU-based computation of the batchnorm
statistics
- Patch for when distconv'd input layer is followed by non-distconv layer
- Bugfix input layer activations: Fixed the input layer so that it
would only resize the activation matrix if it wasn't already setup
to be a view of the data_coordinator's matrix. This addresses a
signficant performance bug in the data ingestion where the
activation matrix was a view into the data coordinator's internal buffers.
- Fixed bad convolution parameters producing incorrect layer shapes.
- Enabling tensor copy on distconv-enabled Identity layer
- General cleanup and improvement in the coverage and robustness of
CI testing
- Fix buffer overflow in SMILES data reader
- Fix a bug in TSE
- Do not construct bias weights when not needed in conv and FC modules
- Use tournament set in LTFB with truncation selection exchange
- Cleanup data reader tests memory leaks
- Fixed a buffer overrun, heap overflow, and double allocation of the
data store in the SMILES data reader
- Match LayerNorm and InstanceNorm layers to PyTorch
- Make sure GPU grid dims are valid in slice/concat layers
- Fixed incorrect matrix ording in K-FAC for conv layer
- Bugfix for polynomial learning rate schedule
Retired features:
============================== Release Notes: v0.102 ==============================
Support for new training algorithms:
- LTFB is now a first-class training algorithm.
- LTFB now allows multiple metrics. The local algorithm is favored by
each trainer and a partner model must win every metric to be declared
the tournament winner.
- The batched iterative optimizer (sgd_training_algorithm) was
refactored for consistency.
- Improved documentation of training algorithm infrastructure.
Support for new network structures:
- ATOM WAE model - character-based Wasserstein Autoencoder
- Community GAN model for graph data sets
Support for new layers:
- "DFTAbs" layer that computes the absolute value of the channel-wise
DFT of the input data
- Adding support for 3D Matrix Multiplication
- Added scatter and gather neural network layers
- CPU-based GRU layers using oneDNN
- Added batch-wise reduce-sum
- ArcFace loss
Python front-end:
- Added 3D U-Net Model
- Added Cosmoflow Model
- Ported CANDLE Pilot1 models
- Support nvprof
- Added channelwise fully connected layer
- Added support for non square kernels, padding, stride, and
dilation for the convolution module
- Support for OpenMPI launcher
Performance optimizations:
- Use cuDNN 8 RNN API and CUDA Graphs in GRU layer
- Cache CUDA Graphs for each active mini-batch size
- Tuned performance of slice, concatenate, and tessellate layers on
ARM processors
- Parallelize computation of Gaussian random numbers
- Optimizing tessellate, concatenate, and slice layers on CPU
Experiments & Applications:
- Added experiment scripts for ATOM cWAE Gordon Bell simulations
- LBANN-ATOM model inference and analysis
Internal features:
- Wrapper classes for CUDA Graphs API
- Elementary examples of using complex numbers
- cuDNN handles are now wrapped in RAII management classes
- Improved HWLOC compatility for v1.11 and v2.x
- Added an enum type of visitor hooks that will eventually be used to
allow callbacks or other visitors to operate at user defined hook
points
- Changed checkpoint logic to checkpoint at the start of epochs
and changed the naming scheme to use the callback phase (visitor
hook) in the name rather than the current execution context.
- Added in-memory binary model exchange for LTFB.
- Added support for ROCm and MIOpen
- Added support for oneDNN
- Updated the bamboo test environment to use local executable rather
than hard coded executables
- Overhauled and refactored serialization throughout code to use
Cereal serialization library
- Significant cleanup and refactoring of code base to improve compile
times. Moving to ensure that code adheres to standard split of
header between declaration and implementation functions (for
templated code). Specifically focused on serialization functions
and comm class. Reduced dependencies through over reaching header
inclusions.
- The relationship of execution_contexts and training_algorithms was
clarified. There is still work to do here.
- Added DistConv tests both convolution and pooling layers
- Support padding in distributed embedding layer
- Added dump model graph callback
- Added perturb learning rate callback
- Added batched inference algorithm
- Switched ATOM tests to use CPU embedding and tessellate layers to
minimize noise
I/O & data readers:
- Experimental data reader that generates graph random walks with
HavoqGT
- Added explict tournament execution mode
- Added support to split training data reader into validation and
tournament readers
- node2vec data reader
Build system:
- Hydrogen v1.5.0+
- Aluminum v0.5.0+
- DiHydrogen v0.2.0 is required
- C++14 or newer standard with CUDA (CMake: "-DCMAKE_CUDA_STANDARD=14")
- OpenCV is now an optional dependency via CMake "LBANN_WITH_VISION"
- CNPY is now an optional dependency via CMake "LBANN_WITH_CNPY"
- Adds support in the build_lbann.sh script for concretizing extra
packages with the primary LBANN installation
- New features in the build script to setup / configure the build
environment, but stop and allow the user to manually add extra
packages
- Add a set of user-focused build scripts that use the main
build_lbann.sh script to setup good defaults on known systems
- Added application specific build scripts for users such as ATOM
- Added support for pulling from Spack mirrors and setting them up
- Split embedded Python support from Python Front End
- Switched Spack-based build script to use Spack's clingo concretizer
Bug fixes:
- Fixed a bug where LBANN didn't set the Hydrogen RNG seed
- Fixed both CosmoFlow and UNet models PFE as well as addressed
issues in the data reader and data coordinator.
- Fixed the HDF5 data reader to properly specify the supported I/O
types
- Fixed calculation of the linearized response size
- Fixed the data coordinator's interface to input_layer
- Fixed error with deterministic execution of dropout layers
Retired features:
- Removed deprecated JAG leader mode which was made obsolete when the
data reader moved into the data coordinator
- Removed the deprecated partitioned data reader modes that were used
to partition and overlap data sets for multiple models
- Removed deprecated ActivationDescriptor class
============================== Release Notes: v0.101 ==============================
Support for new training algorithms:
Support for new network structures:
- ATOM VAE model
- Graph neural networks
- Graph Convolutional Networks (GCN)
- 3D U-Net Model
Support for new layers:
- Implemented optimized GRU layer using cuDNN kernel
- Graph Layers: GCN, GIN, Graph, GatedGraph
Python front-end:
- Support for Graph and Graph Convolutional Networks
- Added support for OCLF data center (Summit)
Performance optimizations:
- Optimize CUDA kernel for tensor reordering in GRU layer
- Enabled TensorCore optimization for GRU layer
- GCN and Graph layers also have a faster Dense variant which only utilizes Matrix Multiplication
Model portability & usability:
- Added Users Quickstart section to documentation including PyTorch
to LBANN mini-tutorial
- Added section on callbacks with detailed instructions on summarize
images callback
Internal features:
- Support for double data type in distributed embedding layer
- Support for large number of channels in GPU batchnorm layer
- Modified LTFB so that NaNs lose tournaments
- Improved numerical stability of reconstruction loss in ATOM VAE
model
- Skip bad gradients in Adam
I/O & data readers:
- Added support for ImageNet data reader to use sample lists
- Refactored sample list code to be more flexible and generalize
beyond JAG data reader
- Added support for slab-based I/O in HDF5 data reader required by
DistConv implementations of CosmoFlow 3D volumes
- Extended slab-based HDF5 data reader to support labels and
reconstruction modes for use with U-Net architecture
Datasets:
- Added two graph datasets (MNIST, and PROTEINS)
Build system and Dependent Libraries:
- Hydrogen 1.4.0
- Aluminum 0.4.0
- Spack v0.15.4+ (Requires new format for environments)
- cuDNN 8.0.2
- Require C++14
- Added Spack build support for OCLF data center (Summit)
Bug fixes:
- Properly reset data coordinator after each LTFB round
- Fixed bug in weights proxy when weights buffer is reallocated
- Bugfix for smiles data reader bound checking and simple LTFB data
distribution
- Eliminated a race condition observed in VAE ATOM model with SMILES
data reader. Added a barrier after each data store mini-batch
exchange -- avoid race between non-blocking sends and receives and
later GPU kernel communication.
Retired features:
============================== Release Notes: v0.100 ==============================
Support for new network structures:
- 3D molecular generation models for Metal Organic Frameworks from the CoRE MOF Database.
- 3D CosmoFlow Model
- DenseNet
- ATOM LSTM model
- RAS state classifier
- node2vec
- Transformer and other attention-based models
- ExaGAN (formerly CosmoGAN)
- MaCC ICF surrogate model
Applications:
- Created a directory of example applications, deprecating the "model zoo" directory
Support for new layers:
- Embedding layer
- Distributed embedding layer
- Channel-wise scale/bias layer
- Entry-wise scale/bias layer
- Gated-Recurrent Units (GRU)
- Entry-wise batchnorm
- Argmax, Argmin, and one-hot layers
- Layer norm
- Deconvolution layer (transposed convolution)
- Layers for channel-wise operations (channel-wise fully-connected, channel-wise softmax, channel-wise scale/bias, instance norm)
- Matrix multiply layer
Python front-end:
- Can now configure contrib launcher with environment variables
- Added NERSC compute center
- Per-layer specification of compute device (CPU or GPU)
- Option to write custom batch scripts with Python front-end
Performance optimizations:
- Parallelized Python data reader with "multiprocessing" module
- Fuse batchnorm stats allreduces in FP/BP.
- Tuned concatenate and slice layer
- Dynamically allocate and free memory for layer error signals (halves LBANN's memory footprint)
Model portability & usability:
- Bamboo tests for individual layers
Internal features:
- Added support for DistConv features (distributed, generalized,
parallel convolution)
- Added support for NVSHMEM 1.0 API (used in distributed embedding
layer and DistConv halo exchange)
- Support for multiple data types per model (per-layer)
- Support for per-layer mixed-precision weight training and inference,
includes per-weight object and objective function mixed-precision.
- Improved how and when the RNGs are initialized
- Callback to dump images to TensorBoard
- Callback to save model weights (useful to export to PyTorch)
- Callback to save top K models (LTFB)
- Improved run-to-run reproducibility by initializing weights in alphabetical order
- Moved models from model_zoo directory to applications directory
- Cleanup and refactoring of callbacks and layer instantiation
- Grouped batchnorm statistics
- Callback to print model description
- Refactored trainer and training-state out of the model class
- Support for transposing data in matrix multiply layers
- Added DiHydrogen tensor and DistConv library
- Added parallel strategy to layer class to support DistConv
- LBANN inference mode supports loading models from multiple directories
- Cleanup of checkpoint and restart logic
I/O & data readers:
- Added in-memory data store that caches samples in CPU memory. It can be loaded
during the first epoch or preloaded
- Added new "transform" data preprocessing ingestion pipeline
- Added sample list format for specifying data sets
- Introduced data coordinator that manages data readers and extracts them from
the input layers
- Data store is able to checkpoint / spill it's contents to local disk
- Data reader for SMILE strings
Build system:
- Hydrogen 1.3.4
- Aluminum 0.3.3
- Improved documentation on read the docs (RTD)
- Robust support for using Spack as a build system around CMake
- Identified compute centers for specifying build and run dependencies
- Added Catch2-based tests
Bug fixes:
- Fixed path resolution for dump weights, save model, and checkpoint callbacks
- Added mutexes for preloading the data store
- Fixed the LTFB exchange to include all ADAM optimizer state
- Fixed the mapping of I/O RNGs to I/O processing threads to ensure
consistent and correct multi-threaded performance
Retired features:
- moving MNIST data reader is replaced by python data reader
- ASCII data reader is deprecated
============================== Release Notes: v0.99 ==============================
Support for new training algorithms:
- Improvements to LTFB infrastructure (including transfer of SGD and Adam hyperparameters)
Support for new network structures:
- Support for Wide ResNets
Support for new layers:
Python front-end:
- Python front-end for generating neural network architectures (lbann namespace):
including layers, objective functions, callbacks, metrics, and optimizers.
- Python interface for launching (SLURM or LSF) jobs on HPC systems
- Support for running LBANN experiments and capturing experimental output
- Network templates for AlexNet, LeNet, arbitrary ResNet models, and Wide ResNet models
- Python scripts for LeNet, AlexNet, and (Wide) ResNets in model zoo.
Performance optimizations:
- GPU implementation of RMSprop optimizer.
- cuDNN convolution algorithms are determined by empirically measuring
performance rather than using heuristics.
- Avoid setting up unused bias weights.
- Perform gradient accumulations in-place when possible.
Model portability & usability:
Internal features:
- Weight gradient allreduces are in-place rather than on a staging buffer.
- Fully connected and convolution layers only create bias weights when
needed.
- Optimizer exposes gradient buffers so they can be updated in-place.
- Added callback support to explicitly save model
- Min-max metric for reporting on multiple LTFB trainers
- Cleanup of Hydrogen interface to match Hydrogen v1.2.0
- Added type-erased matrix class for internal refactoring
- Make CUB always log performance critical events
I/O & data readers:
- Python data reader that interacts with an embedded Python session.
- Optimized data store to provide preload option
- Extended data store to operate with Cosmoflow-numpy data reader
Build system:
- Added documentation for how users can use Spack to install LBANN
either directly or via environments.
- Conduit is a required dependency.
- Provided Spack environment for installing LBANN as a user
- Improved documentation on lbann.readthedocs.io
- CMake installs a module file in the installation directory that
sets up PATH and PYTHONPATH variables appropriately
Bug fixes:
- Models can now be copied or setup multiple times.
- Fixed incorrect weight initialization with multiple trainers.
- Updated I/O random number generators to be C++ thread safe (rather than OpenMP)
- Added an I/O random number generator for preprocessing that is independent
of the data sequence RNG.
- Fixed initialization order of RNGs and multiple models / trainers.
- General fixes for I/O and LTFB interaction.
Retired features:
- "Zero" layer (hack for early GAN implementation).
- Removed data reader specific implementations of data store (in favor of Conduit-based
data store)
============================== Release Notes: v0.98.1 ==============================
Bug Fixes:
- Added missing header
============================== Release Notes: v0.98 ==============================
Support for new training algorithms:
- Hyperparameter exploration with Adam optimizers
- LTFB can perform inter-trainer communication via checkpoint files
Support for new network structures:
- Wassertein autoencoder
Support for new layers:
- Squared difference
- Tessellate
- Clamp
Performance optimizations:
- Added support for node-local batch normalization
Model portability & usability:
- Added prototype Python front end for generating model prototext files
that is inspired by PyTorch's interface
- Created Python library of networks and modules used for prototext
generation
- Support for exporting and importing models in ONNX format
- Output dumping callback exports in CSV, TSV, .npy, or .npz formats
- Added dedicated inference front end
Internal features:
- Expanded layer documentation
- Utility class for nicely formatted descriptions
- Switched to using ReadTheDocs for documentation which uses a
combination of doxygen, breathe, and sphinx
- Provided distinction between trainer and model objects
- Added a generic factory template
- Refactored front-end functionality into library class
I/O & data readers:
- Overhauled the I/O system to use an independent background thread
pool for fetching data
- Added support for data set metadata file that provides both schema
and normalization values unique to a given data set. Demonstrated
use in JAG Conduit data reader.
- Added support for an index list based approach for describing the
samples to use in the training and testing. Note that this is
currently only supported in the JAG Conduit data reader
- Create a general-purpose data store that operates on generic
Conduit node data structures. This should provide an extensible
and generic approach for holding and exchanging data between
epochs. Note that this is currently only supported in the JAG
Conduit data reader.
Build system:
- Support for using Spack environments feature when building
Retired features:
- Removed deprecated objective functions and target layer
- Removed distributed I/O buffer layer it has been deprecated by the
background I/O threads
============================== Release Notes: v0.97.1 ==============================
Bug Fixes:
- Removed deprecated header file include
============================== Release Notes: v0.97 ==============================
Support for new layers:
- Mean absolute error and L1 norm
- GPU implementation for activation layers
- Log sigmoid and softsign
- Channel-wise mean (temporary kludge)
Model portability & usability:
- Hints for layer output dimensions
- Confusion matrix callback
- Metric checking callback
Internal features:
- Removed target-layer-based features from model zoo
- Layer unit tests check for expected output values
Retired features:
- Smooth ReLU, bent identity, and swish layers
- Target-layer-based metrics
- Target-layer-based models (sequential, greedy layer-wise autoencoder, Siamese)
============================== Release Notes: v0.96 ==============================
Support for new layers:
- Log softmax
- Basic math functions
- Weights layer, which outputs a weights tensor
- L2 norm squared
- Binary cross entropy loss and sigmoid binary cross entropy loss
- Boolean accuracy, Boolean false negative rate, Boolean false positive rate
- Bilinear resize
- Variance and covariance
- Dilated and grouped convolution (GPU only)
Performance optimizations:
- Optimized GPU model-parallel softmax layer
Model portability & usability:
- Option for weight initialization with user-provided list of values
- Callback to save any layer output as an image
Internal features:
- Provide compile time option to selectively disable OpenMP for data fetching loop
- Thrust calls no longer involve the default CUDA stream
I/O & data readers:
- Reworked jag_conduit data reader:
- Support the updated JAG simulation data output format
- Use direct HDF5 I/O for on-demand data loading with Conduit
- Ingest a unique set of data files per instance
- Allow exclusive data partitioning among multiple trainers
- Multi-channel images
- Normalization of JAG data
- Interface to select images of specific views and time indices
- Interface to describe how to slice JAG data
- Avoid redundant fetching and incoherent random number pulls in the group of local data readers
- Improved threading performance by preallocating scratch space for loading samples
Build system:
- Support cross-compilation configurations in superbuild and SetupProtobuf
============================== Release Notes: v0.95 ==============================
Support for new training algorithms:
- Generative Adversarial Networks (GAN)
Support for new network structures:
- Variational Autoencoders
- GAN
- CycleGAN
- Combined Autoencoders with CycleGAN
- Deep Recurrent Attention Model (DRAM), Ba et al. (2015)
- Video Recurrent Attention Model (VRAM)
Support for new layers:
- Optimized Top-K accuracy (CPU, GPU)
- Crop (CPU, GPU)
- Sort (CPU, GPU) both ascending and descending order
- Absolute value (CPU, GPU)
- Mean-squared (CPU, GPU)
- Top-K categorical accuracy (CPU, GPU)
- Cross-entropy (CPU, GPU)
- Stop gradient (CPU, GPU)
Performance optimizations:
- Use Pinned memory for CPU activations matrices
- Non-blocking GPU computation of objective functions and metrics
- Refactored weight matrices and weight initialization
- Manage GPU workspace buffers with memory pool
- Slice and concatenation layer emit matrix views if possible
- Used more fine-grained asynchronous calls when using Aluminum Library
- Minimized GPU stream synchronization events per call
- Improved / minimized synchronization events when using a single GPU
- Fixed GPU workspace size
- GPU implementation of Adagrad optimizer
- GPU model-parallel softmax
- Optimized local CUDA kernel implementations
- Support for distributed matrices with arbitrary alignment
Model portability & Usability:
- Keras to LBANN prototext conversion tool
Internals Features:
- Support for multiple objective functions and metrics per network with arbitrary placement
- Objective functions represented as layers
- Metrics represented as layers
- Introduced evaluation layer construct
- Ability to freeze specific layers for pre-training / fine-tuning
- Refactoring tensor setup in setup, forward prop, and back prop
- Layers store matrices in private smart pointers
- Model automatically inserts evaluation layers where needed
- Copy Layer activations between models
- Annotated GPU profiling output with training phases
- Fixed initialization of Comm object and Grid objects when using multiple models
- General code cleanup, refactoring, and various bug fixes.
- All layers overwrite error signal matrices
- NCCL backend is now implemented via Aluminum Library
- MPI calls are routed through the LBANN Comm object into Hydrogen or Aluminum
- Provide runtime statistics summary from every rank
- Reworked LBANN to use Hydrogen to manage GPU memory
- GPU allocations now via CUB memory pool
- Fixed Spack build interaction with Hydrogen Library
I/O & data readers:
- Support for Conduit objects with HDF5 formatting
- In-memory and locally offloaded data store
- Data Store can hold the entire training set in memory (or node-local storage)
- Data store will shuffle data samples between epochs and present samples to input layer
- Updated synthetic data reader
- Modified data readers to handle bad samples in JAG conduit data
- Reworked the I/O layers (input and target) so that the input layer produces both the
sample and label / response if necessary.
- Target layer is being deprecated
- Updated image data reader to use cv::imdecode to accelerate image load times
- Allow users to specify an array of data sources for the independent/dependent
variables via prototext
============================== Release Notes: v0.94 ==============================
Support for new training algorithms:
- Back-Propagation Through Time (BPTT)
-- Recurrent Neural Networks (RNN)
-- Long Short-Term Memories (LSTM)
- Generative Adversarial Networks (GAN)
- Variational autoencoders
- Convolutional autoencoders
- Fine tuning of pretrained networks
-- Flexible weight freezing
- Context-prediction network (Siamese network)
- Livermore Tournament Fast Batch learning (LTFB)
- Variable mini-batch sizes
Support for new network structures
- Directed Acyclic Graph (DAG) networks
- Residual networks
- Modular and composable objective functions
- Multiple metrics
- Shared weight matrices
- (BETA) New evaluation layer that is attach to any point of DAG
- Motifs (compound, reused network patterns)
Support for new layers:
- Learning:
- Deconvolution
- Metrics:
-- Top K Categorical accuracy, Pearson correlation, Mean absolute deviation
- Loss Functions:
-- Cross Entropy with Uncertainty, Geometric negative log likelihood
-- Poisson Negative log likelihood, Polya Negative Log Likelihood
- Optimizers:
-- Hypergradient Adam
- Transform Layers:
-- Contatenation, Noise, Unpooling, Pooling, Reshape, Slice, Split, Sum
- Regularizer:
-- Batch Normalization, Selu Dropout, Local Response Normalization (LRN)
- Activations:
-- Leaky Relu, Smooth Relu, Elu, Scaled Elu, Softplus, Atan,
-- Bent Identity, Exponential
Performance optimizations:
- GPU acceleration for most layers
- NCCL 2.X
- Optimized communication patterns
- Asynchronous weight updates
- Asynchronous metric and objective function updates
- batch normalization (global and local)
- L2 normalization
- Adaptive Quantization (inter-model)
Model portability & usability:
- Portable checkpoints / recovery
- Distributed checkpoint / recovery
- Network visualization
- Export LBANN to TensorFlow format
Internals Features:
- Gradient checking
- Network representation using tensor dimensions
- Bamboo continuous integration (CI)
- Improved data processing pipeline
New data readers:
- Numpy
- CSV
- Methods for merging multiple features and samples across files
- CANDLE Pilot 2
- CANDLE Pilot 1 Combo
- ICF JAG
Integration with Hydrogen, an optimized distributed, dense linear algebra
library. Hydrogen is a fork of the Elemental library. Hydrogen optimizes for:
distributed matrices with elemental and block distributions, BLAS, LAPACK,
distributed and local matrix management.
Integration with optimized all-reduce communication library Aluminum. Aluminum
provides custom reduction patterns, customized CUDA reduction kernels,
and asynchronous communication operators. It uses MPI, MPI w/GPUdirect, or NCCL
as back-end libraries. Aluminum enables us to effectively use non-blocking
all-reduces during backprop/optimization
Additionally, we have added support for an online, distributed data store. When
enabled, LBANN is able to ingest all of the training data set in a distributed
method across all ranks. Each data store is then able to serve it's portion of
a mini-batch, dynamically moving data to the necessary ranks in the model (based
on the mini-batch data distribution).
============================== Release Notes: v0.93 ==============================
This release contains a major refactoring / overhaul of the code base.
Key highlights include:
- Moving layer design into smaller simpler layers that have a single
compute behavior per layer. Specifically, linear combination of the
inputs, non-linear activations, and regularizers now exist as their
own layers.
- Layers now have a template parameter that specifies the data layout
for the distributed matrices.
- Prototext interface for specifying neural network models and data
readers is nearly fully functional.
- Code now adheres to internal coding style as outlined in
README_coding_style.txt
- Dead-code has been eliminated and layer file hierarchy has been
cleaned up.
============================== Release Notes: v0.92 ==============================
New features include (but are not limited to):
- Full support for convolutional and pooling layers
- GPU acceleration of local Elemental GEMM operations
- Improved network and data reader support
-- Alexnet
-- VGG
-- CIFAR-10
- Added a suite of regularizers, objective functions, and metrics, including:
-- Batch normalization
-- Drop-out
-- L2
- Dramatically improves the performance of inter-model communication
- Added suite of image prepossessing routines
============================== Release Notes: v0.91 ==============================
Incorporates a number of changes through the LBANN code base. In
particular there is a new build system that tries to have LBANN
download all of the dependencies into its build tree, and compile them
locally. Additional improvements include optimizations in the data
parallel, multiple model training framework, support for convolutional
layers, and general bug fixes.
============================== Release Notes: v0.90 ==============================
Initial release of the LBANN toolkit.