Compiling blas3.cu with clang++ #245

singam-sanjay · 2017-09-14T16:28:12Z

I'd like to compile blas3.cu with clang++ (yeah !! clang++ can compile CUDA) instead of nvcc to compare the performance of the prod kernels produced. I've built clang and llvm from sources on the release50 branch of each repository and tried building the the program with,

clang++ -DVIENNACL_WITH_CUDA -I/home/seabed/Software/viennacl-dev -I/usr/local/cuda/include ../examples/tutorial/blas3.cu -o examples/tutorial/blas3-clang-cuda -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -lpthread -lboost_chrono -lboost_date_time -lboost_serialization -lboost_system -lboost_thread -lboost_atomic -lpthread -O3 -Xcuda-ptxas "-O3 -m64 -fmad true"

The command failed with the __shfl_xor as an undefined intrinsic,

In file included from ../examples/tutorial/blas3.cu:56:
In file included from viennacl-dev/viennacl/matrix.hpp:29:
In file included from viennacl-dev/viennacl/linalg/sparse_matrix_operations.hpp:37:
In file included from viennacl-dev/viennacl/linalg/cuda/sparse_matrix_operations.hpp:35:
viennacl-dev/viennacl/linalg/cuda/spgemm_rmerge.hpp:146:32: error: use of undeclared identifier '__shfl_xor'
    min_index = min(min_index, __shfl_xor((int)min_index, (int)i));
                               ^
viennacl-dev/viennacl/linalg/cuda/spgemm_rmerge.hpp:235:21: error: use of undeclared identifier '__shfl_xor'
    output_value += __shfl_xor((int)output_value, (int)i);

The error persists even after including the header files that declare and define the intrinsic,

--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -30,10 +30,14 @@
+#include <sm_30_intrinsics.h>
+#include <sm_30_intrinsics.hpp>

Please suggest corrections for this strategy.

The text was updated successfully, but these errors were encountered:

karlrupp · 2017-09-14T16:33:26Z

The way to fix this with nvcc is to specify the correct arch (e.g. -arch=sm_50). You probably need to do the same for clang.

Btw: The fast BLAS3 kernels in ViennaCL are in the OpenCL backend. I haven't backported them to the CUDA backend yet.

singam-sanjay · 2017-09-15T13:20:08Z

That worked !! thanks !

singam-sanjay · 2017-09-15T13:24:44Z

But, aren't these kernels part of the CUDA backend ?

karlrupp · 2017-09-15T13:59:28Z

Yes, they are. But these are not as fast as the kernels generated by the OpenCL backend.

singam-sanjay · 2017-09-18T07:04:15Z

I modified the blas3.cpp file to default to using OPENCL_MEMORY when VIENNACL_WITH_OPENCL was being used and added a new "blas3-ocl" target to compile the example for OpenCL,

--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -140,7 +140,10 @@ ScalarType scaleToNbitIntIfInt(int n_bits)
 */
 int main()
 {
+#ifdef VIENNACL_WITH_OPENCL
+       viennacl::backend::default_memory_type(viennacl::OPENCL_MEMORY);
+#endif

--- a/examples/tutorial/CMakeLists.txt
+++ b/examples/tutorial/CMakeLists.txt
@@ -104,6 +104,12 @@ if (ENABLE_CUDA)
 
 endif (ENABLE_CUDA)
 
+if (ENABLE_UBLAS AND ENABLE_OPENCL)
+  include_directories(${Boost_INCLUDE_DIRS})
+  add_executable(blas3-ocl blas3.cpp)
+  set_target_properties(blas3-ocl PROPERTIES COMPILE_FLAGS "-g -DVIENNACL_WITH_OPENCL")
+  target_link_libraries(blas3-ocl ${Boost_LIBRARIES} ${OPENCL_LIBRARIES})
+endif ()

For matrices of size 128x16384 (A) and 16384x128 (B), the opencl code runs slower than CUDA code,
OpenCL : 0.036686 secs
CUDA : 0.00591053 secs

System setup:
Ubuntu 16.04 x86_64
Quadro K1200
CUDA SDK, Drivers and OpenCL packages : nvidia-opencl-dev nvidia-375 nvidia-opencl-icd-375 cuda-8-0

Can the slowdown be attributed to the NVIDIA GPU ?

singam-sanjay closed this as completed Sep 15, 2017

singam-sanjay reopened this Sep 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiling blas3.cu with clang++ #245

Compiling blas3.cu with clang++ #245

singam-sanjay commented Sep 14, 2017

karlrupp commented Sep 14, 2017

singam-sanjay commented Sep 15, 2017

singam-sanjay commented Sep 15, 2017

karlrupp commented Sep 15, 2017

singam-sanjay commented Sep 18, 2017 •

edited

Loading

Compiling blas3.cu with clang++ #245

Compiling blas3.cu with clang++ #245

Comments

singam-sanjay commented Sep 14, 2017

karlrupp commented Sep 14, 2017

singam-sanjay commented Sep 15, 2017

singam-sanjay commented Sep 15, 2017

karlrupp commented Sep 15, 2017

singam-sanjay commented Sep 18, 2017 • edited Loading

singam-sanjay commented Sep 18, 2017 •

edited

Loading