Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes #334

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

obilaniu
Copy link
Contributor

@obilaniu obilaniu commented Jan 25, 2017

Added multiple new reduction ops, enumerated in enum ga_reduce_op, and also heavily refactored internally in preparation to add a generator for "small code model" kernels, which will be used for reductions with a small destination.

typedef enum _ga_reduce_op {
	GA_REDUCE_SUM,             /*        +        */
	GA_REDUCE_PROD,            /*        *        */
	GA_REDUCE_PRODNZ,          /*        * (!=0)  */
	GA_REDUCE_MIN,             /*      min()      */
	GA_REDUCE_MAX,             /*      max()      */
	GA_REDUCE_ARGMIN,          /*     argmin()    */
	GA_REDUCE_ARGMAX,          /*     argmax()    */
	GA_REDUCE_MINANDARGMIN,    /* min(), argmin() */
	GA_REDUCE_MAXANDARGMAX,    /* max(), argmax() */
	GA_REDUCE_AND,             /*        &        */
	GA_REDUCE_OR,              /*        |        */
	GA_REDUCE_XOR,             /*        ^        */
	GA_REDUCE_ALL,             /*     &&/all()    */
	GA_REDUCE_ANY,             /*     ||/any()    */
} ga_reduce_op;

@obilaniu obilaniu force-pushed the smallredux branch 2 times, most recently from 3f9b8c4 to d80fc0e Compare January 25, 2017 10:16
Copy link
Member

@abergeron abergeron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is WIP, but I still have a couple of nits I identified.

GA_REDUCE_XOR, /* ^ */
GA_REDUCE_ALL, /* &&/all() */
GA_REDUCE_ANY, /* ||/any() */
} ga_reduce_op;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go in a reduction.h header.





Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this should also probably go into a reduction.h header.

strb_appendf(&ctx->s, "typedef %s S;/* The type of the source array. */\n", ctx->srcTypeStr);
strb_appendf(&ctx->s, "typedef %s T;/* The type of the destination array. */\n", ctx->dstTypeStr);
strb_appendf(&ctx->s, "typedef %s A;/* The type of the destination argument array. */\n", ctx->dstArgTypeStr);
strb_appendf(&ctx->s, "typedef %s X;/* The type of the indices: signed 32/64-bit. */\n", ctx->idxTypeStr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been new development that makes the kernel compile/call code responsible for handling this.

You can always use ga_size for this and everything will work out.

strb_appends(&ctx->s, "\n");
strb_appends(&ctx->s, "\n");
strb_appends(&ctx->s, "\n");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using nvrtc there cannot be any includes except for explicitly supported ones (and non are at the moment).

So this function is probably useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered that by myself, but forgot to nuke this function.

*/

static int reduxGetSumInit (int typecode, const char** property){
if(typecode == GA_POINTER ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STYLE: Space between if and (

ctx->blockSize [0] = ctx->blockSize [1] = ctx->blockSize [2] = 1;
ctx->gridSize [0] = ctx->gridSize [1] = ctx->gridSize [2] = 1;
ctx->chunkSize [0] = ctx->chunkSize [1] = ctx->chunkSize [2] = 1;
for(i=0;i<MAX_HW_DIMS;i++){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STYLE: space between for and (

@@ -256,113 +801,492 @@ static int maxandargmaxCheckargs (maxandargmax_ctx* ctx){
ctx->nds = ctx->src->nd;
ctx->ndr = ctx->reduxLen;
ctx->ndd = ctx->nds - ctx->ndr;
strb_ensure(&ctx->s, 5*1024);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you preallocate so large a buffer? Is the kernel you're generating really 5kb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the kernel really is ~5KB for the 3D testcases.


return ctx->ret;

return reduxSelectModel(ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it is convenient to do tail calls everywhere like this, but C doesn't have tail-call optimization (or at least not all the time).

It also feels inappropriate that calling reduxCheckArgs ends up compiling the reduction kernel and calling it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous version I tried to do all of the error handling in the highest-level function. That doesn't really scale as I add more and more and deeper functions, since each such function has to be able to detect the condition and return, and all its callers have to also detect a condition, abort and pass on the error until the highest level.

So I instead arranged the code in CPS style, and every function can abort the entire GpuArray_reduction() by tail-calling reduxCleanup() with whichever is the correct GA_ERROR_CODE as argument. A nice way of having clean "exception" handling in C.

CPS is somewhat unidiomatic C, but GCC does optimize my tailcalls from callq to jmpq in Release mode, or inlines them whole. And there is a net readability win from all the error handling code I've eliminated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then at least rename the functions so that it is clear that they do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron Alright, how shall I call them then? reduxDoXYZ()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Something like reduxCheckArgsAndRest()? I don't like it very much ...

static void reduxAppendFuncGetInitVal (redux_ctx* ctx){
strb_appends(&ctx->s, "/**\n");
strb_appends(&ctx->s, " * Initial value function.\n");
strb_appends(&ctx->s, " */\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know it outputting comments in a generated kernel is useful at all. Perhaps you can keep this as an actual comment right here.

strb_appends(&ctx->s, "}\n");
strb_appends(&ctx->s, "\n");
strb_appends(&ctx->s, "\n");
strb_appends(&ctx->s, "\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a single string. Don't use an append per line, it is mostly inefficient.

@obilaniu
Copy link
Contributor Author

@abergeron I've cleaned up the code given your feedback. It's still a WIP and has numerous bugs. However, I'm aware of most of what remains to be done.


static inline void strb_init(strb* sb){
const strb s = STRB_STATIC_INIT;
*sb = s;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be a memset(b, 0, sizeof(strb)). We can assume that it is ok at this level.

Copy link
Contributor Author

@obilaniu obilaniu Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron I think it would be wise to keep it as is, Don't Repeat Yourself and all that. If STRB_STATIC_INIT changes in such a way that memset() no longer suffices, this function will continue working. Otherwise there's two different places that must be changed, and one can forget one or the other.

At any rate, for the present strb, the compiler can and will optimize memset() to just

xor eax, eax
movq [sb   ], rax
movq [sb+ 8], rax
movq [sb+16], rax
retq

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it optimize the current form too?

Copy link
Contributor Author

@obilaniu obilaniu Jan 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron As a matter of fact, a Release-mode build with GCC inlined strb_init(), then recognized that strb, along with everything else I set to zero explicitly in reduxCheckArgs(), could be memsetted to 0:

Dump of assembler code for function GpuArray_reduction:
   0x00007ffff74f5300 <+0>:     push   %r15
   0x00007ffff74f5302 <+2>:     xor    %eax,%eax           # Zero RAX
   0x00007ffff74f5304 <+4>:     push   %r14
   0x00007ffff74f5306 <+6>:     push   %r13
   0x00007ffff74f5308 <+8>:     push   %r12
   0x00007ffff74f530a <+10>:    mov    %rcx,%r12
   0x00007ffff74f530d <+13>:    mov    $0x2f,%ecx          # How many words to clear
   0x00007ffff74f5312 <+18>:    push   %rbp
   0x00007ffff74f5313 <+19>:    mov    %edi,%ebp
   0x00007ffff74f5315 <+21>:    push   %rbx
   0x00007ffff74f5316 <+22>:    sub    $0x1b8,%rsp
   0x00007ffff74f531d <+29>:    test   %r12,%r12
   0x00007ffff74f5320 <+32>:    lea    0x30(%rsp),%rdi     # Base pointer for memset
   0x00007ffff74f5325 <+37>:    lea    0x30(%rsp),%rbx
   0x00007ffff74f532a <+42>:    rep stos %rax,%es:(%rdi)   # Memset to 0.
   0x00007ffff74f532d <+45>:    mov    %ebp,0x30(%rsp)
   0x00007ffff74f5331 <+49>:    mov    %rsi,0x38(%rsp)
   0x00007ffff74f5336 <+54>:    mov    %rdx,0x40(%rsp)
   0x00007ffff74f533b <+59>:    mov    %r12,0x48(%rsp)
   0x00007ffff74f5340 <+64>:    mov    %r8d,0x50(%rsp)
   0x00007ffff74f5345 <+69>:    mov    %r9,0x58(%rsp)
   0x00007ffff74f534a <+74>:    movq   $0x1,0x138(%rsp)
   0x00007ffff74f5356 <+86>:    movq   $0x1,0x150(%rsp)
   0x00007ffff74f5362 <+98>:    movq   $0x1,0x168(%rsp)
   0x00007ffff74f536e <+110>:   movq   $0x1,0x140(%rsp)
   0x00007ffff74f537a <+122>:   movq   $0x1,0x158(%rsp)
   0x00007ffff74f5386 <+134>:   movq   $0x1,0x170(%rsp)
   0x00007ffff74f5392 <+146>:   movq   $0x1,0x148(%rsp)
   0x00007ffff74f539e <+158>:   movq   $0x1,0x160(%rsp)
   0x00007ffff74f53aa <+170>:   movq   $0x1,0x178(%rsp)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm good.

@abergeron
Copy link
Member

Is there anything missing from this?

@obilaniu
Copy link
Contributor Author

@abergeron Yes; The codegen breaks somewhat at random depending on the kernel's required list of arguments and the tensor dimensionalities involved. The number of cases is large. Where it happens is in cases like generating a , to separate arguments to a generated function, but a part of the argument list is empty (e.g. because the dst tensor is 0-dim, or the kernel doesn't need a dstArg argument, ...), so you get consecutive commas and an angry runtime-compile error.

@abergeron
Copy link
Member

Ok

@obilaniu obilaniu force-pushed the smallredux branch 2 times, most recently from ca571e4 to 51c559d Compare March 3, 2017 18:07
@obilaniu
Copy link
Contributor Author

obilaniu commented Mar 5, 2017

@abergeron @nouiz You're going to be very happy:

  • ALL reductions are now implemented, using the large code model.
  • ALL testcases pass.
  • The code is a bit cleaner after heavy-duty refactoring of the code-gen functions,
  • .. which required that I add a new utility util/srcgen.h to make printing lists of arguments or summands easier,
  • ... which required that I add strb_appendv(), which is to strb_appendf() as vfprintf() is to fprintf().

Now, the Jenkins bot apparently has an older libcheck than Travis, because I'm using ck_assert_double_eq_tol() and Jenkins vomits all over that while everything is just fine with Travis.

In several places there are notices /* BUG: Small code model not implemented */. At those places (currently unreachable) I must add code-gen for the small code model. This will require great care and some help from the CLUDA language in providing atomic*() portably.

Similarly, for the pre-scalar/post-scalar op fusions there are notices at the appropriate places for where they will hook in.

Copy link
Member

@abergeron abergeron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some style changes and a few nits but overall ok.

I didn't run the tests though.



/* Enumerations */
enum srcb_state{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before {.



/* Functions */
static inline void srcbInit (srcb* s, strb* sb){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No spaces between name and ( and space before {.

s->empty = empty;
}
static inline void srcbEndList(srcb* s){
if(s->numElems == 0){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after if and space before {.

* Initialize at runtime an strb.
*/

static inline void strb_init(strb* sb){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before {.

#endif
va_end(ap);

va_end(apSave);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the va_end inside the else branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The va_end() is in the correct place.

  • va_copy()ing a va_list is equivalent to va_start()ing it followed by an equivalent number of va_arg() calls.
  • Every va_start() must be matched by a va_end().
  • va_copy() doesn't exist on older Visual Studios, but in the Visual Studios where it's implemented it's implemented as just a straight assignment.

Therefore va_end() should be unconditional or else it's UB (although on x86-64 it's actually a no-op)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is technically UB to just assign and reuse va_lists like you did.

If we start from the assumption that a va_list is just a pointer on MSVC and we can just copy it then let's see it through and not use va_end either since it doesn't matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not fixed.

GPUARRAY_PUBLIC int GpuArray_any (GpuArray* dst,
const GpuArray* src,
unsigned reduxLen,
const unsigned* reduxList);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these functions could (and probably should) be macros.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron One thing that I've really liked about the current design is that I can breakpoint one of these functions without breakpointing on any others. It was very helpful when debugging using the test-suite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but it costs in code size and number of exported symbols.

For debugging you can always use conditional breakpoints on the op to only stop for those you are interested in. I don't really mind making debugging a bit harder here.


static int reduxGetSumInit (int typecode, const char** property){
if (typecode == GA_POINTER ||
typecode == GA_BUFFER){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use typecode < 0. The "special" values will always be below 0.

return GA_UNSUPPORTED_ERROR;
}

return GA_NO_ERROR;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that an additional property on the types for min/max values would be better than this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron Doable by hacking the Python generator script a bit.

@obilaniu obilaniu force-pushed the smallredux branch 2 times, most recently from 061c174 to 4188424 Compare May 30, 2017 14:52
@obilaniu obilaniu force-pushed the smallredux branch 2 times, most recently from 52d8d6e to 2c4a3cb Compare June 12, 2017 16:17
@obilaniu
Copy link
Contributor Author

@abergeron @nouiz @lamblin Great success.

Code now passes all tests (apparently deterministically, since no failures in multiple tries) on my machine and leto13. But still waiting for Travis results to triple-check.

Key points:

  • My kernels are fully deterministic given:
    • An identical number of SMs
    • Identical max local workgroup size
    • Type, dimensions and strides of all input/output tensors.
  • For all-reduction of a 100M-element float32 vector to a scalar:
    • Global memory bandwidth:
      • On my laptop's GTX 765M, I achieve 86% of the max in-practice bandwidth (~46.6 GB/s).
      • cuDNN does get about 100% though (53 GB/s).
      • Apart from this best-case, Nvidia Visual Profiler reports that a dozen of my generated kernels get in excess of 20 GB/s, most get > 10 GB/s and a few stragglers get > 3 GB/s.
    • Workspace:
      • For that same reduction, however, I use exactly 2*sizeof(TK0)*BlocksSpawned*D bytes of memory, where, in this case, TK0 is float32, BlocksSpawned is 64*NUMBER_OF_SMs and D is the number of writebacks per block, which in this case is 1. On a GTX765M, this results in 256 blocks of 1024/4=256 threads being spawned, and the workspace size is just 2KB. My workspace size is thus O(k) in the number k of simultaneously-executing threads on the GPU.
      • But cuDNN requests 400MB of workspace for the same reduction.
  • cuDNN is limited to tensors with no more than 2**31 - 1 elements and strides (measured in elements) representable in a 32-bit signed integer. I am almost completely unconstrained; Practically the only assumption I make is that the tensor will have at most one axis of length > 2**31 - 1, while the strides (measured in bytes) can be 64-bit integers.
  • I enforce a max block size 4 times smaller than the maximum allowed since that gives better performance.
  • You may customize almost all types used in my kernel through a new GpuReductionAttr API.
  • It is in principle possible to add arbitrary pre/post scalar operations. In fact, my code is capable of generalizing GpuElemwise.

@obilaniu obilaniu changed the title Generalize MaxAndArgmax to multiple ops and prepare groundwork for small-destination reductions Generalize MaxAndArgmax to all Commutative Operations and Datatypes and all Destination Tensor Sizes Jul 14, 2017
@obilaniu obilaniu force-pushed the smallredux branch 5 times, most recently from 6c2abc8 to 6d9e62d Compare July 25, 2017 16:23
#endif
va_end(ap);

va_end(apSave);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not fixed.

*
* It is assumed that at most one axis will ever be of length > 2**31-1. The
* assumption is believed safe because no GPU or similar accelerator presently
* on Earth has the capacity to store or process 2**62-element tensors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is true, what about the counterexample of an array with super-large broadcasted dimensions? Or is there special code to handle that?

Copy link
Contributor Author

@obilaniu obilaniu Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abergeron When a user broadcasts a tensor on two or more axes, each of which are of length >= 2^31, he implicitly accepts a computational cost of 2^62 FLOP's. The most powerful GPUs on Earth right now have a throughput of 1e10 float32 additions or multiplications per second, and so would chug through such a tensor of 2^62 elements in a mere 15 years. I'm unaware of a realistic environment and usecase for a 15-year computation.

int reduxLen;
const int* reduxList;
GpuReductionAttr grAttr;
gpucontext* gpuCtx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have a gpucontext in GpuReductionAttr

/* Source code Generator. */
strb s;
srcb srcGen;
char kName[256];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need such a large name? Also, why don't you just use a fixed name?

srcb srcGen;
char kName[256];
char* kSourceCode;
size_t kSourceCodeLen;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need those two members, they are available from either the strb or srcb above.

/* Workspace */
if (reduxGenKernelRequiresWspace(gr)){
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user);
if (reduxGenKern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what is the type of B (because the kernel code is impossible to follow, but since left is 64 bits this will always do a 64-bit modulo. I don't know how often this code is run, but using GA_SIZE would allow this to be 32 bit for smaller arrays.

/* Workspace */
if (reduxGenKernelRequiresWspace(gr)){
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user);
if (reduxGenKern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These pointers need to be tagged with their memory space.

/* Workspace */
if (reduxGenKernelRequiresWspace(gr)){
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user);
if (reduxGenKern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of printing to the output, you should set the error message with the new error API.

/* Workspace */
if (reduxGenKernelRequiresWspace(gr)){
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user);
if (reduxGenKern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is horribly fragile code. How is this a good idea?

/* Workspace */
if (reduxGenKernelRequiresWspace(gr)){
fn(gr, GA_BUFFER, "GLOBAL_MEM char* restrict", "W", 0, user);
if (reduxGenKern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have function aliases like this? Can't you just reuse the wrapped function?

It allows initializing at runtime an strb. This can't always be done at
compile-time, for instance if it is dynamically allocated.
All tests pass, but currently the codegen is locked to the large code
model (the small code model has most of the groundwork laid down but has
several extra complexities which haven't yet been implemented, like
atomic reduction operators.
Clang and MSVC correctly recognize that all paths to the allegedly-
uninitialized variables are, in fact, dominated by their initialization.
40% of tests still failing, and the code has a wierd smell to it that I
really don't appreciate.
All tests now pass except summation, which fails to meet tolerance.
- Subtract 0.5 from random numbers, so they sum to 0 in expectation.
- Increase tolerance from 1e-5 to 1e-4 just for summation.
Now, all the veryhighrank tests pass and the others fail for an unknown
reason.
They are overkill but seem to fix the problems with the testcases, at
least so far.
There is now not a single -Wdeclaration-after-statement warning
origination in that file.
@obilaniu
Copy link
Contributor Author

@abergeron In case this interests you I rebased against master, so there are now no conflicts and the codebase now builds, runs and passes tests again.

On Jenkins I see that for some reason or other ck_assert_ptr_nonnull() no longer works properly. It did use to, AFAIK. Was libcheck downgraded? For now I simply re-#define it to another libcheck API, but I'm still surprised.

Given your fp16-related changes I may have to review certain data-access macros, but otherwise this PR is still as good as it was last month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants