Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tritongpu.upcast_mxfp operation #2700

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

etiotto
Copy link
Contributor

@etiotto etiotto commented Nov 13, 2024

Add initial support for the new tritongpu upcast_mxfp operation.

@etiotto etiotto self-assigned this Nov 13, 2024
@etiotto etiotto linked an issue Nov 13, 2024 that may be closed by this pull request
@etiotto
Copy link
Contributor Author

etiotto commented Nov 13, 2024

Note: Merging upstream to 1cf7b1b31cde8c62611e421becd4648c7284d76c should make this PR smaller (changes to NVidia and AMD implementation of upcast_mxfp would be coming in from the merge).

// standalone values and returns them as a pair for (high 4 bits, low 4 bits).
std::pair<Value, Value> convertMxfp4x2ToBf16x2(RewriterBase &rewriter,
Location loc, Value v);
// Convert each value, which is an int8 containing 2 packed mxfp4 values,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is identical to upstream code as of commit 1cf7b1b


return {v0, v1};
}
SmallVector<Value> convertMxfp4x2ToBf16x2(RewriterBase &rewriter, Location loc,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is identical to upstream code as of commit 1cf7b1b

@@ -19,17 +19,6 @@ using namespace mlir::triton::gpu;

namespace {

Value mxfpScaleBf16(RewriterBase &rewriter, Location loc, Value v,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is identical to upstream code as of commit 1cf7b1b

@@ -30,47 +30,6 @@ class UpcastMXFPOpPattern : public ConvertOpToLLVMPattern<UpcastMXFPOp> {
: ConvertOpToLLVMPattern<UpcastMXFPOp>(typeConverter, benefit),
targetInfo(targetInfo) {}

llvm::SmallVector<Value> unpackFP4Elements(Location loc,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 1cf7b1b

@etiotto etiotto marked this pull request as ready for review November 14, 2024 14:12
@whitneywhtsang
Copy link
Contributor

Note: Merging upstream to 1cf7b1b31cde8c62611e421becd4648c7284d76c should make this PR smaller (changes to NVidia and AMD implementation of upcast_mxfp would be coming in from the merge).

Merging in #2707.

Copy link
Contributor

@victor-eds victor-eds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is just copying, LGTM. As in previous cases we found relying on logical bitwise operations for this kind of operations was slower, does it make sense to have a ticket to change the code in the future?

auto parentEncoding = oldEncoding.getParent();

// Note: For Intel the dot operands layout's kWidth parameter must
// match the parent's dpas layout opsPerChannel. Given that the kWidth
Copy link
Contributor

@chengjunlu chengjunlu Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opsPerChannel is defined by the HW DPAS instruction.
I think we should align the opsPerChannel to the result scalar type of the UpcastMXFPOp instead of double the size.
fp16/bf16 -> opsPerChannel=2

Otherwise there might be ambiguous in the lowering of UpcastMXFPOp.

if (fpType == ScaleDotElemType::E2M1)
xVals = LLVM::convertMxfp4x2ToBf16x2(rewriter, loc, xVals);

// Each thread owns elements of 4 mxfp vectors so we need 4 scales
Copy link
Contributor

@chengjunlu chengjunlu Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better we need to make sure the layout are expected of the DotOp with the DPAS as parent and the layout conversion from the source operand to the dest operand are supported as well.

@etiotto etiotto marked this pull request as draft November 26, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement support for TritonGPU::UpcastMXFPOp for Intel XPU BE
4 participants