Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

Initial base for AltFP #2

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions doc/riscv-bfloat16-appx-rationale.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
[appendix]
[[BFloat16_appx_rationale]]
= Extension Rationale

== Format Rationale
Various choices were made in the RISC-V BFloat16 format and behavior.
Some of these choices are allowed by IEEE-754 and others are deviations
from the standard

=== Rounding Modes

==== Round to odd
Round to odd is not a '754 supported rounding mode. However, it avoids double
rounding can occur when accumulating a result in a wider format and then
converting the result to a narrower format before subsequent usage.

==== Round to nearest - even
Round to nearest, ties to even is the default '754 rounding format. It is unbiased
and minimze rounding error.

=== Subnormal Handling

=== NaN handling

=== Zeros and Infinities

== Instruction Rationale

This section contains various rationale, design notes and usage
recommendations for the instructions in the BFloat16 extension.
It also tries to record how the designs of instructions were
derived, or where they were contributed from.

=== Conversion Instructions


The most common and important conversion instructions are between BFloat16 and FP32
(Single Precision).

We chose not to have direct conversion between BFloa16 and other formats as they
kdockser marked this conversation as resolved.
Show resolved Hide resolved
can typcially be performed by a combination of instructions.

.Notes to software developers
[NOTE,caption="SH"]
====
In some cases, for example convert from FP64 to BFloat16 there can be double rounding.
It is up to software to elimiante such sources of error if this is important to the
application.
====

=== FMA

Fused multiply add.

=== Dot Product
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention here to support dot product as a packed-SIMD style operation, or an application of FMA?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reading the rest of the spec, it seems like this is intended to be used in packed-SIMD style ops. The spec should probably also define other operations on multiple packed BF16 operands (at least a note on how to load/store them - probably using standard FLW/FLD?) and how this is intended to work in general; I assume it's useful to to think about this for all operations, since just using 16 bits of a 32 or 64 bit register seems a bit wasteful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the intention is for dot product operations, as that is what BF16 is usually used for. It was requested that we start with the base and then we can move on to operations. I anticipate that these operations will be most useful in Vector.


Somewhat unaptly named, yet very useful instructions.


.Notes to software developers
[NOTE,caption="SH"]
====
Signifiant speedup

E Plurbus Unum
kdockser marked this conversation as resolved.
Show resolved Hide resolved
====

47 changes: 47 additions & 0 deletions doc/riscv-bfloat16-audience.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
[[crypto_scalar_audience]]
=== Intended Audience
THIS IS VERY PRELIMINARY - TO BE UPDATED

FLoating-point arithmetic is a specialised subject, requiring people with many different
backgrounds to cooperate in its correct and efficient implementation.
Where possible, we have written this specification to be understandable by
all, though we recognize that the motivations and references to
algorithms or other specifications and standards may be unfamiliar to those
who are not domain experts.

This specification anticipates being read and acted on by various people
with different backgrounds.
We have tried to capture these backgrounds
here, with a brief explanation of what we expect them to know, and how
it relates to the specification.
We hope this aids people's understanding of which aspects of the specification
are particularly relevant to them, and which they may (safely!) ignore or
pass to a colleague.

Software developers::
These are the people we expect to write code using the instructions
in this specification.
They should understand fairly obviously the motivations for the
instructions we include, and be familiar with most of the algorithms
and outside standards to which we refer.

Computer architects::
We do expect architects to have a floating-point background.
We nonetheless expect architects to be able to examine our instructions
for implementation issues, understand how the instructions will be used
in context, and advise on how best to fit the functionality the
cryptographers want to the ISA interface.

Digital design engineers & micro-architects::
These are the people who will implement the specification inside a
core. Floating-point expertise is assumed as not all of the corner
cases are pointed out in the specification.

Verification engineers::
Responsible for ensuring the correct implementation of the extension
in hardware.


These are by no means the only people concerned with the specification,
but they are the ones we considered most while writing it.

73 changes: 73 additions & 0 deletions doc/riscv-bfloat16-format.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
[[bfloat16_format]]
== BFloat16 Operand Format

BFloat16 bits::
[wavedrom, , svg]
....
{reg:[
{bits: 1, name: 'S'},
{bits: 5, name: 'expo'},
{bits: 10, name: 'frac'},
]}
....

IEEE Compliance: While BFloat16 (BF16) is not an IEEE-754 _standard_ format, it is a valid floating point format as defined by the standard. There are three parameters that specify a format: radix (b), number of digits in the significand (p), and maximum exponent (emax).
For BF16 these values are:

[%autowidth]
.BFloat16 paramenters
|===
|radix (b)|2
|significand (p)|8
|emax|127
|===


.Obligatory Floating Point Format Table
[cols = "1,1,1,1,1,1,1,1"]
|===
|Format|Sign Bits|Expo Bits|fraction bits|padded 0s|encoding bits|expo max/bias|expo min

|FP16 |1| 8| 7| 0|16| 127|-126
|BFloat16|1| 5|10| 0|16| 15| -14
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got these switched around: BF16 is the same as FP32, but with 16 fewer fraction bits: https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Somehow I swapped them.

|TF32 |1| 8|10|13|32| 127|-126
|FP32 |1| 9|23| 0|32| 127|-126
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FP32 only has 8 exponent bits, not 9 - as it's written now the sum of bits would be 33.
There is an implied 1 bit in there so technically you get the effect of 33 bits, but that goes into the fraction 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the typo. I will fix.

|FP64 |1|11|52| 0|64|1023|-1022
|FP128 |1|15|112|0|128|16,383|-16,382
|===

== BFloat16 behaviors

=== Subnormal Numbers:
Floating-point values that are too small to be represented as normal numbers, but can still be represented by using the format's smallest exponent with a zero integer bit and one or more leading 0s --- and one or

more 1s --- in the trailing fractional bits are called subnormal numbers. Basically, the idea is there is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Floating-point values that are too small to be represented as normal numbers, but can still be represented by using the format's smallest exponent with a zero integer bit and one or more leading 0s --- and one or
more 1s --- in the trailing fractional bits are called subnormal numbers. Basically, the idea is there is
Floating-point values that are too small to be represented as normal numbers, but can still be represented by using the format's smallest exponent with a zero integer bit and one or more leading 0s --- and one or
more 1s --- in the trailing fractional bits are called subnormal numbers. Basically, the idea is there is

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching. I fixed this but hadn't added it before committing. The latest pull request should look better (for this file anyway).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I am trying to do a full review and listing typos / questions along the way.

a trade off of precision to support _gradual underflow_.

In RISC-V BFloat16, all subnormal BFloat16 inputs are treated as zero and subnormal outputs are flushed to zero. The sign of the original value is retained. This is not consistent with '754' but has been found to be a suitable alternative in many workloads. Furthermore, with BFloat16's relatively large exponent range, subnormals add little value.


=== Infinities:
Infinities are used to represent values that are too large to be represented by the target format. These are usually produced as a result of overflows (depending on the rounding mode), but can also be provided as inputs. Infinities have a sign associated with them: there are positive infinities and negative infinities.


Infinities are important for keeping meaningless results from being operated upon.

=== NaNs

NaN stands for Not a Number. These are provided as the result of an operation when it cannot be represnted
as a number or infinity. For example, performaning the square root of -1 will result in a NaN because
there is no real number that can represent the result. NaNs can also be used as inputs.

There are two types of NaNs: signalling and quiet. Signalling NaNs are provided as input data since no computational instruction will ever produce tis kind of a NaN. Operating on a Signalling NaN will produce an invalid operation exception. Operating on a Quiet NaN usually does not cause an exception.

NaNs include a sign bit, but the bit has no meaning.

NaNs are important for keeping meaningless results from being operated upon. It is best to retain them. As IEEE allows, operations should return the canonical NaN rather than be required to propagate the payload.

=== Rounding Modes:
In general, the default IEEE rounding mode (round to nearest, ties to even) works for arithmetic cases. There are some special cases where a particular instruction benefits from a different rounding mode (e.g., convert to integer, widening multiply-accumulate) - we can address this on those specific instructions.

Comment on lines +78 to +79

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we intend to have a static rounding mode force to RNE by default and only allow rounding-mode static (opcode) or dynamic (csr) selection on a specific subset of instructions ? This seems to be in contradiction with F and D extensions and should be justified here IMHO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these instructions would have a static rounding mode that is not overridable. Yes, this is different from '754. However, it is a common simplification (just like flushing subnormals). If someone needs more control over the rounding mode they can run in SP (F).

I agree that we will need to provide a detailed justification in the specification for this simplification.

=== Handling exceptions
Default exception handling, as defined by IEEE, is a simple and effective approach to producing results in exceptional cases. For the coder to be able to see what has happened, and take further action if needed, the BFloat16 instructions need to set floating-point exception flags the same way as all other floating-point instructions in RISC-V.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation may not be future proof, we may want to cite explicitly the basic floating-point extensions here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which area are concerned about: the rounding mode, default exception handling, or both?

Should the need arise, an extension could be added that allows the rounding mode to be changed by the CSR.

The handling of exceptions via the IEEE default is common across RISC-V. Is this what you mean about citing the basic floating-point extensions?
At some point there might be a TG that creates trapped exceptions for FP instructions - but right now only default is supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that another floating-point extension may be introduced with a different way of managing FP exception flags, making "as all other floating-point instructions ..." missleading. So mentioning explicitly that we intend to managed them as extension F and D (and Q), clarifies things if such an extension should appear at some point. I agree that the use of such remark may be limited.


45 changes: 45 additions & 0 deletions doc/riscv-bfloat16-introduction.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
[[BFloat16_introduction]]
== Introduction

When FP16 (officially called binary16) was first introduced by the IEEE-754 standard,
it was just an interchange format. It was intended as a space/bandwidth efficient
encoding that would be used to transfer information. This is in line with the Zfhmin
proposed extension.

However, there were some applications (notably graphics) that found that the smaller
precision and dynamic range was sufficient for their space. So, FP16 started to see
some widespread adoption as an arithmetic format. This is in line with the Zfh
proposed extension.

While it was not the intention of '754 to have FP16 be an arithmetic format, it is
supported by the standard. Eventhough the '754 committee recognized that FP16 was
gaining popularity, the committee decided to hold off on making it a basic format
in the 2019 release. This means that a '754 compliant implementation of binary
floating point, which needs to support at least one basic format, cannot support
only FP16 - it needs to support at least one of binary32, binary64, and binary128.

Experts working in machine learning noticed that FP16 was a much more compact way of
storing operands and often provided sufficient precision for them. However, they also
found that intermediate values were much better when accumulated into a higher precision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does better means accuracte here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. I will clarify and elaborate (a little). Also, thanks for catching the typos.

The final computations were then typically converted back into the more compact FP16
encoding. This approach has become very common in inferencing where the weights and
activations are stored in FP16 encodings. There was the added benefit that smaller
multipliers could be created for the FP16's smaller number of significant bits. At this
point, widening multiply-accumulate instructions became much more common. Also, more
complicated dot product instructions started to show up including those that stored two
FP16 numbers in a 32-bit register, multiplied these by another pair of FP16 numbers in
another register, added these two products to an FP32 accumulate value in a 3rd register
and returned an FP32 result.

Experts working in machine learning at Google who continued to work with FP32 values
noted that the least significant 16 bits of their significands were not always needed
for good results, even in training. They proposed a truncated version of FP32, which was
the 16 most significant bits of the FP32 encoding. This format was named BFloat16
(or BF16). The B in BFloat16, stands for Brain. Not only did they find that the number of
significant bits in BF16 tended to be sufficient for their work (despite being fewer than
in FP16), but it was very easy for them to reuse their existing data; FP32 numbers could
be readily rounded to BF16 with a minimal amount of work. Furthermore, the even smaller
number of the BF16 significant bits enabled even smaller multipliers to be built. Similar
to FP16, BF16 multiply-accumulate widening and dot-product instructions started to
proliferate.

15 changes: 15 additions & 0 deletions doc/riscv-bfloat16-policies.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[[crypto_scalar_policies]]
=== Policies

In creating this proposal, we tried to adhere to the following
policies:

* Provide a RISC-V BFloat16 definition that makes sense for how we expect
these operands to be used in real applications.
* Provide the basic instructions that allow implementations to leverge the
benefits of the BFloat16 format +
** reduced storage space - A BFloat16 operand consumes half the space of an FP32 operand +
** higher effective storage bandwidth - Two BFloat16 operands can be transfered at the same rate as one FP32 +
** higher computational throughput - Two BFloat16 multiplies can be performed with less logic than one FP32 +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could even add than one BFloat16 multiply can be done with less logic than one FP16 (mostly due to multiplier area gains)

* Provide consitency with other approaches when this doesn't interfere with
the above
Loading