[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1071