You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
[3.0.7]
Changed
Improve training speed by usingtorch.nn.functional.multi_head_attention_forward for self- and encoder-attention
during training. Requires reorganization of the parameter layout of the key-value input projections,
as the current Sockeye attention interleaves for faster inference.
Attention masks (both for source masking and autoregressive masks need some shape adjustments as requirements
for the fused MHA op differ slightly).
Non-interleaved format for joint key-value input projection parameters: in_features=hidden, out_features=2*hidden -> Shape: (2*hidden, hidden)
Interleaved format for joint-key-value input projection stores key and value parameters, grouped by heads: Shape: ((num_heads * 2 * hidden_per_head), hidden)
Models save and load key-value projection parameters in interleaved format.
When model.training == True key-value projection parameters are put into
non-interleaved format for torch.nn.functional.multi_head_attention_forward
When model.training == False, i.e. model.eval() is called, key-value projection
parameters are again converted into interleaved format in place.
[3.0.6]
Fixed
Fixed checkpoint decoder issue that prevented using bleu as --optimized-metric for distributed training (#995).