MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

SirYwell · 2024-04-29T17:09:44Z

Describe the bug

A clear and concise description of what the bug is.

I'm getting the following error:

[TornadoVM-OCL-JNI] ERROR : clBuildProgram -> Returned: -11
uk.ac.manchester.tornado.api.exceptions.TornadoRuntimeException: [ERROR] Generated Kernel is NULL.

when trying to run the MatrixMultiplication example using Docker.

My current assumption is that this is caused by my GPU not supporting FP16 (from my understanding, output of clinfo here: https://gist.github.com/SirYwell/bdc347db5c4b5f66e2c664666fb0313f), but TornadoVM unconditionally emitting

TornadoVM/tornado-drivers/opencl/src/main/java/uk/ac/manchester/tornado/drivers/opencl/graal/asm/OCLAssembler.java

Line 77 in e8faf81

emitLine("#pragma OPENCL EXTENSION cl_khr_fp16 : enable ");

As I haven't yet figured out how I can build and run TornadoVM on NixOS directly, I can't test if changing that already fixes the problem.

How To Reproduce

After cloning the docker-tornado repository, I'm running ./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk

The output can be found here: https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820

Note: I made a few adjustments to make things work:
pom.xml: I changed the version from tornado-api and tornado-matrices from 1.0.3-dev to 1.0.3.
MatrixMultiplication.java: I changed WARMING_UP_ITERATIONS from 100 to 1 to reduce output.
run_nvidia_openjdk.sh: I changed #!/bin/bash to #!/usr/bin/env bash to make it work on NixOS. I also added the --gpus all flag to the docker command.

Expected behavior

Expecting the example to run.

Computing system setup (please complete the following information):

Output of ./run_nvidia_openjdk.sh tornado --devices:

./run_nvidia_openjdk.sh tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 1
  Tornado device=0:0  (DEFAULT)
        OPENCL --  [NVIDIA CUDA] -- NVIDIA GeForce RTX 2070 SUPER
                Global Memory Size: 7.8 GB
                Local Memory Size: 48.0 KB
                Workgroup Dimensions: 3
                Total Number of Block Threads: [1024]
                Max WorkGroup Configuration: [1024, 1024, 64]
                Device OpenCL C version: OpenCL C 1.2

Output of ./run_nvidia_openjdk.sh tornado --version:

version=1.0.3
branch=master
commit=02cb3f7

Backends installed: 
         - opencl

OS: NixOS 23.11, but running via Docker as provided

Additional context

I'm using the nvidia-container-toolkit as described in beehive-lab/docker-tornadovm#8, but I don't think it is related to this issue.

The text was updated successfully, but these errors were encountered:

jjfumero · 2024-04-29T17:29:12Z

Ok, sorry I was part of it since I jump first to the docker-tornado repo. I will take a look and let you know.

jjfumero · 2024-04-29T17:44:57Z

My take here is that it is related to the thread-block being used.

Task info: s0.t0
	Backend           : OPENCL
	Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
	Dims              : 2
	Global work offset: [0, 0]
	Global work size  : [512, 512]
	Local  work size  : [32, 32, 1]
	Number of workgroups  : [16, 16]

[TornadoVM-OCL-JNI] ERROR : [JNI] uk.ac.manchester.tornado.drivers.opencl> notify error:
clEnqueueNDRangeKernel -> Returned: [JNI] uk.ac.manchester.tornado.drivers.opencl> CL_OUT_OF_RESOURCES error executing CL_COMMAND_NDRANGE_KERNEL on NVIDIA GeForce RTX 3070 (Device 0).

Tornado does not fail during compilation, but during the kernel dispatch.
See related issue (thread conversation): #356

If I select blocks of 16x16, then I get correct executions:

Task info: s0.t0
	Backend           : OPENCL
	Device            : NVIDIA GeForce RTX 3070 CL_DEVICE_TYPE_GPU (available)
	Dims              : 2
	Global work offset: [0, 0, 0]
	Global work size  : [512, 512, 1]
	Local  work size  : [16, 16, 1]
	Number of workgroups  : [32, 32, 1]

The diff:

diff --git a/example/pom.xml b/example/pom.xml
index 30c2487..785dc2d 100644
--- a/example/pom.xml
+++ b/example/pom.xml
@@ -29,12 +29,12 @@
     <dependency>
       <groupId>tornado</groupId>
       <artifactId>tornado-api</artifactId>
-      <version>1.0.3-dev</version>
+      <version>1.0.4-dev</version>
     </dependency>
     <dependency>
       <groupId>tornado</groupId>
       <artifactId>tornado-matrices</artifactId>
-      <version>1.0.3-dev</version>
+      <version>1.0.4-dev</version>
     </dependency>
     <dependency>
       <groupId>junit</groupId>
diff --git a/example/src/main/java/example/MatrixMultiplication.java b/example/src/main/java/example/MatrixMultiplication.java
index a37b0d6..d97439e 100644
--- a/example/src/main/java/example/MatrixMultiplication.java
+++ b/example/src/main/java/example/MatrixMultiplication.java
@@ -19,6 +19,7 @@ package example;
 import java.util.Random;
 import java.util.stream.IntStream;
 
+import uk.ac.manchester.tornado.api.*;
 import uk.ac.manchester.tornado.api.ImmutableTaskGraph;
 import uk.ac.manchester.tornado.api.TaskGraph;
 import uk.ac.manchester.tornado.api.TornadoExecutionPlan;
@@ -88,7 +89,12 @@ public class MatrixMultiplication {
 
         ImmutableTaskGraph immutableTaskGraph = taskGraph.snapshot();
         TornadoExecutionPlan executor = new TornadoExecutionPlan(immutableTaskGraph);
-        executor.withWarmUp();
+
+        WorkerGrid workerGrid = new WorkerGrid2D(matrixA.getNumRows(), matrixA.getNumColumns());
+        GridScheduler gridScheduler = new GridScheduler("s0.t0", workerGrid);
+        workerGrid.setLocalWork(16, 16, 1);
+
+        executor.withGridScheduler(gridScheduler).withWarmUp();
 
         // 1. Warm up Tornado
         for (int i = 0; i < WARMING_UP_ITERATIONS; i++) {
diff --git a/example/target/example-1.0-SNAPSHOT.jar b/example/target/example-1.0-SNAPSHOT.jar
index df83b80..9fff7be 100644
Binary files a/example/target/example-1.0-SNAPSHOT.jar and b/example/target/example-1.0-SNAPSHOT.jar differ

Funny, beacuse on my native system, I can run with blocks of 32x32, so this is a driver issue within the Docker image.

jjfumero · 2024-04-29T17:47:53Z

I pushed the "fix" in the tornado-docker repo: beehive-lab/docker-tornadovm@cb5f48c

jjfumero · 2024-04-29T17:55:06Z

For reference:

./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication 1024
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 1024x1024
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
	Single Threaded CPU Execution: 0.92 GFlops, Total time = 2326 ms
	Streams Execution: 10.28 GFlops, Total time = 209 ms
	TornadoVM Execution on GPU (Accelerated): 1073.74 GFlops, Total Time = 2 ms
	Speedup: 1163.0x
	Verification true

I am using a new docker image for a new TornadoVM version coming up tomorrow ;-)

SirYwell · 2024-04-29T19:22:52Z

Thanks for looking into it so quickly. I'll try out tomorrow with the new version and report back!

jjfumero · 2024-04-30T08:41:55Z

New docker images: https://hub.docker.com/r/beehivelab/tornadovm-nvidia-openjdk

SirYwell · 2024-04-30T17:18:25Z

Hi @jjfumero, I tried with the new docker image, but I'm still getting the same error.
Running ./run_nvidia_openjdk.sh tornado --version now outputs

version=1.0.4
branch=master
commit=585574e

Backends installed: 
         - opencl

so I assume it's running the correct version.

Is there any way I can get more information out of OpenCL? Or is there any information I can provide that helps you? It's also not completely unrealistic that there is something wrong due to NixOS, but my understanding how things interact with each other isn't deep enough here. (I'm currently trying to manually build TornadoVM from source, but it fails due to cmake not finding the jni/jawt header files, which I also don't understand as they are present)

jjfumero · 2024-04-30T18:28:42Z

The TornadoVM automatic installer should bring all necessary dependencies (Java, cmake, maven, etc).

I do not know if there is something specific for NixOS, but to install TonadoVM from source, can you try the following:

Assuming you want the OpenCL backend and you have the NVIDIA Driver installed for your NVIDIA GPU:

$ ./bin/tornadovm-installer  --jdk jdk21 --backend=opencl

stratika · 2024-04-30T19:49:56Z

In my system (Ubuntu 23.10), I tested the script that failed for you and it is working:

./run_nvidia_openjdk.sh tornado -cp example/target/example-1.0-SNAPSHOT.jar example.MatrixMultiplication --fullDebug -pk
WARNING: Using incubator modules: jdk.incubator.vector
Computing MxM of 512x512
[INFO] Loading Backend: uk.ac.manchester.tornado.drivers.opencl.OCLTornadoDriverProvider@52aa2946
TornadoGraph dependency matrix...
+----+---------------+
|  5 [data]| <none>
|----+---------------+
|  6 [data]| <none>
|----+---------------+
|  7 [data]| <none>
|----+---------------+
|  8 [data]| <none>
|----+---------------+
|  9 [data]| 10
|----+---------------+
| 10 [task]| 6 7 8
|----+---------------+
| 11 [data]| 10
|----+---------------+
| 12 [data]| 11
|----+---------------+
| 13 [data]| 11
|----+---------------+
| 14 [data]| 11
|----+---------------+

-----------------------------------
Device Table:
[0]:  [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
Constant Table:
[0]: 512
Object Table:
[0]: 0x16f7c8c1 MatrixFloat <512 x 512>
[1]: 0x573906eb MatrixFloat <512 x 512>
[2]: 0x4ebff610 MatrixFloat <512 x 512>
Task Table:
[0]: task s0.t0 - matrixMultiplication
-----------------------------------
-----------------------------------
TaskGraph:
[0]: constant 0
[1]: object 0
[2]: object 1
[3]: object 2
[4]: context device=0, [ 5 6 7 8 10 11 12 13 14 ]
[5]: persist node
[6]: copy in object 0
[7]: copy in object 1
[8]: copy in object 2
[9]: dependent write on object 2 by task 10
[10]: task=0, args=[ 6 7 8 0 ]
[11]: copy out object 2 after task 10
[12]: deallocate object 1 after 11
[13]: deallocate object 2 after 11
[14]: deallocate object 3 after 11
-----------------------------------
[TornadoVM] Warning: The loop bounds will be configured by the GridScheduler. Check the grid by using the flag --threadInfo.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable  
__kernel void matrixMultiplication(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *A, __global uchar *B, __global uchar *C, __private int size)
{
  int i_3, i_34, i_28, i_9, i_41, i_8, i_11, i_10, i_42, i_37, i_4, i_17, i_13, i_12, i_15, i_14, i_27, i_20; 
  long l_39, l_29, l_30, l_21, l_22, l_38; 
  float f_24, f_16, f_32, f_33; 
  ulong ul_19, ul_18, ul_23, ul_25, ul_26, ul_31, ul_1, ul_0, ul_35, ul_2, ul_5, ul_36, ul_7, ul_6, ul_40; 

  // BLOCK 0
  ul_0  =  (ulong) A;
  ul_1  =  (ulong) B;
  ul_2  =  (ulong) C;
  i_3  =  get_global_size(0);
  i_4  =  get_global_size(1);
  ul_5  =  ul_2 + 32L;
  ul_6  =  ul_1 + 32L;
  ul_7  =  ul_0 + 32L;
  i_8  =  get_global_id(0);
  i_9  =  get_global_id(1);
  i_10  =  _kernel_context[0];
  // BLOCK 1 MERGES [0 8 ]
  i_11  =  i_9;
  for(;i_11 < i_10;)
  {
    // BLOCK 2
    i_12  =  i_11 << 9;
    i_13  =  i_12 + 6;
    // BLOCK 3 MERGES [2 7 ]
    i_14  =  i_8;
    for(;i_14 < i_10;)
    {
      // BLOCK 4
      i_15  =  i_14 + 6;
      // BLOCK 5 MERGES [4 6 ]
      f_16  =  0.0F;
      i_17  =  0;
      for(;i_17 < i_10;)
      {
        // BLOCK 6
        ul_18  =  *((__global ulong *) ul_7);
        ul_19  =  ul_0 + ul_18;
        i_20  =  i_13 + i_17;
        l_21  =  (long) i_20;
        l_22  =  l_21 << 2;
        ul_23  =  ul_19 + l_22;
        f_24  =  *((__global float *) ul_23);
        ul_25  =  *((__global ulong *) ul_6);
        ul_26  =  ul_1 + ul_25;
        i_27  =  i_17 << 9;
        i_28  =  i_27 + i_15;
        l_29  =  (long) i_28;
        l_30  =  l_29 << 2;
        ul_31  =  ul_26 + l_30;
        f_32  =  *((__global float *) ul_31);
        f_33  =  fma(f_24, f_32, f_16);
        i_34  =  i_17 + 1;
        f_16  =  f_33;
        i_17  =  i_34;
      }  // B6
      // BLOCK 7
      ul_35  =  *((__global ulong *) ul_5);
      ul_36  =  ul_2 + ul_35;
      i_37  =  i_14 + i_13;
      l_38  =  (long) i_37;
      l_39  =  l_38 << 2;
      ul_40  =  ul_36 + l_39;
      *((__global float *) ul_40)  =  f_16;
      i_41  =  i_3 + i_14;
      i_14  =  i_41;
    }  // B7
    // BLOCK 8
    i_42  =  i_4 + i_11;
    i_11  =  i_42;
  }  // B8
  // BLOCK 9
  return;
}  //  kernel

	Single Threaded CPU Execution: 1.64 GFlops, Total time = 164 ms
	Streams Execution: 14.91 GFlops, Total time = 18 ms
	TornadoVM Execution on GPU (Accelerated): 268.44 GFlops, Total Time = 1 ms
	Speedup: 164.0x
	Verification true
cleanup: programs  ..........0.000256351 s
cleanup: context   ..........0.000006982 s
cleanup: total     ..........0.000263333 s

./run_nvidia_openjdk.sh tornado --version
version=1.0.4
branch=master
commit=585574e

Backends installed: 
	 - opencl

I share the same opinion with @jjfumero, it will be easier to understand the problem with your platform if you can install it locally and run the example without docker. Do you have any modified source code?

SirYwell · 2024-04-30T20:53:14Z

I do not know if there is something specific for NixOS

Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run.

I'm just confused that it doesn't work in the docker container, especially as it finds the device.

stratika · 2024-04-30T21:04:59Z

I do not know if there is something specific for NixOS

Sadly there is, dynamically linked binaries typically cannot be executed without patching. That's why I tried the manual installation, I guess I have to figure out what's up with it not finding JNI headers... I'll let you know when I get it to run.

I'm just confused that it doesn't work in the docker container, especially as it finds the device.

It seems that the problem is in the driver when it compiles the generated kernel from TornadoVM. The CL_BUILD_PROGRAM_FAILURE is returned if clBuildProgram does not return till the build of the kernel is completed. My guess would be that the driver does not communicate correctly with the underlying driver in your local system. This container is build based on the nvidia/opencl image.

I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails.

SirYwell · 2024-04-30T21:30:38Z

I would suggest you to try one of the polyglot images, if possible. Those images are bigger, but I had installed the NVIDIA OpenCL driver in the container image manually. For example the tornadovm-polyglot-graalpy. Also, this container is build on a commit point prior to the latest release, but it may help to see if the driver in the container fails.

Thanks, I gave that a try and ran ./polyglotImages/polyglot-graalpy/tornadovm-polyglot-nvidia.sh tornado --printKernel --truffle python example/polyglot-examples/mxmWithTornadoVM.py. This seems to result in the same error as before sadly. (PS it looks like the docs are outdated, they reference a tornadovm-polyglot.sh file instead)

jjfumero · 2024-05-01T04:52:15Z

Going back to the initial problem:

[TornadoVM-OCL-JNI] ERROR : clBuildProgram -> Returned: -11

Which is different from the error I encountered. To debug your error, I suggest 2 things:

Run with --debug in TornadoVM to obtain the clBuild error messages.
Use the kernel that TornadoVM generates and compiling it for your own system. You can build a C++ program that just builds and compiles OpenCL and obtain the info. You can take this program as inspiration: https://github.com/jjfumero/scripts/tree/master/opencl/compileKernel

SirYwell · 2024-05-01T08:23:46Z

Thanks for the pointers. https://gist.github.com/SirYwell/d9ae4b5393de135ec15429c54d031820 already contains the output from running with --fullDebug (but only 1 iteration instead of 100).

The 2. seems to work, I can compile the example kernel and the one from the output I got from the MatrixMultiplication example. The compileKernel program crashes, but from my debugging, that only happens after the kernelBin.bin is already written (after the output of https://github.com/jjfumero/scripts/blob/c8e52c3e83bb7db529ab11f9ae1d61e738792d8d/opencl/compileKernel/compileKernel.cpp#L302 )

jjfumero self-assigned this Apr 29, 2024

jjfumero added the bug Something isn't working label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

SirYwell commented Apr 29, 2024

jjfumero commented Apr 29, 2024

jjfumero commented Apr 29, 2024 •

edited

Loading

jjfumero commented Apr 29, 2024

jjfumero commented Apr 29, 2024

SirYwell commented Apr 29, 2024

jjfumero commented Apr 30, 2024

SirYwell commented Apr 30, 2024

jjfumero commented Apr 30, 2024

stratika commented Apr 30, 2024

SirYwell commented Apr 30, 2024

stratika commented Apr 30, 2024

SirYwell commented Apr 30, 2024

jjfumero commented May 1, 2024 •

edited

Loading

SirYwell commented May 1, 2024

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

MatrixMultiplication example ERROR : clBuildProgram -> Returned: -11 #397

Comments

SirYwell commented Apr 29, 2024

Describe the bug

How To Reproduce

Expected behavior

Computing system setup (please complete the following information):

Additional context

jjfumero commented Apr 29, 2024

jjfumero commented Apr 29, 2024 • edited Loading

jjfumero commented Apr 29, 2024

jjfumero commented Apr 29, 2024

SirYwell commented Apr 29, 2024

jjfumero commented Apr 30, 2024

SirYwell commented Apr 30, 2024

jjfumero commented Apr 30, 2024

stratika commented Apr 30, 2024

SirYwell commented Apr 30, 2024

stratika commented Apr 30, 2024

SirYwell commented Apr 30, 2024

jjfumero commented May 1, 2024 • edited Loading

SirYwell commented May 1, 2024

jjfumero commented Apr 29, 2024 •

edited

Loading

jjfumero commented May 1, 2024 •

edited

Loading