Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling blas3.cu with clang++ #245

Open
singam-sanjay opened this issue Sep 14, 2017 · 5 comments
Open

Compiling blas3.cu with clang++ #245

singam-sanjay opened this issue Sep 14, 2017 · 5 comments

Comments

@singam-sanjay
Copy link

I'd like to compile blas3.cu with clang++ (yeah !! clang++ can compile CUDA) instead of nvcc to compare the performance of the prod kernels produced. I've built clang and llvm from sources on the release50 branch of each repository and tried building the the program with,

clang++ -DVIENNACL_WITH_CUDA -I/home/seabed/Software/viennacl-dev -I/usr/local/cuda/include ../examples/tutorial/blas3.cu -o examples/tutorial/blas3-clang-cuda -L/usr/local/cuda/lib64 -lcudart_static -ldl -lrt -lpthread -lboost_chrono -lboost_date_time -lboost_serialization -lboost_system -lboost_thread -lboost_atomic -lpthread -O3 -Xcuda-ptxas "-O3 -m64 -fmad true"

The command failed with the __shfl_xor as an undefined intrinsic,

In file included from ../examples/tutorial/blas3.cu:56:
In file included from viennacl-dev/viennacl/matrix.hpp:29:
In file included from viennacl-dev/viennacl/linalg/sparse_matrix_operations.hpp:37:
In file included from viennacl-dev/viennacl/linalg/cuda/sparse_matrix_operations.hpp:35:
viennacl-dev/viennacl/linalg/cuda/spgemm_rmerge.hpp:146:32: error: use of undeclared identifier '__shfl_xor'
    min_index = min(min_index, __shfl_xor((int)min_index, (int)i));
                               ^
viennacl-dev/viennacl/linalg/cuda/spgemm_rmerge.hpp:235:21: error: use of undeclared identifier '__shfl_xor'
    output_value += __shfl_xor((int)output_value, (int)i);

The error persists even after including the header files that declare and define the intrinsic,

--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -30,10 +30,14 @@
+#include <sm_30_intrinsics.h>
+#include <sm_30_intrinsics.hpp>

Please suggest corrections for this strategy.

@karlrupp
Copy link
Collaborator

The way to fix this with nvcc is to specify the correct arch (e.g. -arch=sm_50). You probably need to do the same for clang.

Btw: The fast BLAS3 kernels in ViennaCL are in the OpenCL backend. I haven't backported them to the CUDA backend yet.

@singam-sanjay
Copy link
Author

That worked !! thanks !

@singam-sanjay
Copy link
Author

But, aren't these kernels part of the CUDA backend ?

@singam-sanjay singam-sanjay reopened this Sep 15, 2017
@karlrupp
Copy link
Collaborator

Yes, they are. But these are not as fast as the kernels generated by the OpenCL backend.

@singam-sanjay
Copy link
Author

singam-sanjay commented Sep 18, 2017

I modified the blas3.cpp file to default to using OPENCL_MEMORY when VIENNACL_WITH_OPENCL was being used and added a new "blas3-ocl" target to compile the example for OpenCL,

--- a/examples/tutorial/blas3.cpp
+++ b/examples/tutorial/blas3.cpp
@@ -140,7 +140,10 @@ ScalarType scaleToNbitIntIfInt(int n_bits)
 */
 int main()
 {
+#ifdef VIENNACL_WITH_OPENCL
+       viennacl::backend::default_memory_type(viennacl::OPENCL_MEMORY);
+#endif

--- a/examples/tutorial/CMakeLists.txt
+++ b/examples/tutorial/CMakeLists.txt
@@ -104,6 +104,12 @@ if (ENABLE_CUDA)
 
 endif (ENABLE_CUDA)
 
+if (ENABLE_UBLAS AND ENABLE_OPENCL)
+  include_directories(${Boost_INCLUDE_DIRS})
+  add_executable(blas3-ocl blas3.cpp)
+  set_target_properties(blas3-ocl PROPERTIES COMPILE_FLAGS "-g -DVIENNACL_WITH_OPENCL")
+  target_link_libraries(blas3-ocl ${Boost_LIBRARIES} ${OPENCL_LIBRARIES})
+endif ()

For matrices of size 128x16384 (A) and 16384x128 (B), the opencl code runs slower than CUDA code,
OpenCL : 0.036686 secs
CUDA : 0.00591053 secs

System setup:
Ubuntu 16.04 x86_64
Quadro K1200
CUDA SDK, Drivers and OpenCL packages : nvidia-opencl-dev nvidia-375 nvidia-opencl-icd-375 cuda-8-0

Can the slowdown be attributed to the NVIDIA GPU ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants