link libnvblas? #17

cboettig · 2019-03-09T01:02:12Z

libnvblas.so gets installed with the existing cuda libraries. Apparently this can be enabled as the drop-in BLAS library for R, and is smart enough to let openblas handle things and only take over when it can provide significant acceleration(?)

EDIT

Haven't found great documentation on setup or performance, but looks like this can be done as a one-off at runtime by setting LD_PRELOAD and configuring the fallback to openblas:

## create config file:
echo "NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/libopenblas.so
NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf

Run R with these env vars:

NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.9.0 R

Will have to benchmark a bit, but maybe worth adding this into our cuda/base setup @noamross ?

The text was updated successfully, but these errors were encountered:

eddelbuettel · 2019-03-09T01:44:00Z

Will have to benchmark a bit

Seconded. We should definitely document that it is there, but I am not convinced it will always be a winner. Then again I am also often wrong when guessing :)

cboettig · 2019-03-09T04:38:24Z

Yeah, it's not clear to me what the appropriate benchmark comparison is -- obviously the difference between a given operation on GPU vs CPU depends a lot on exactly what GPU vs what CPU you have on the platform.

That said, I imagine people will really only be deploying the rocker/cuda images on machines with significant GPUs available, if not on hardware explicitly optimized for GPU use (e.g. GPU-type instances on AWS). I do see some substantial improvement in low-level linear algebra operations, things like calculating determinant can see a factor of 10. For typical R use I doubt a lot of operations would see things like that, but then this image is already aimed at more specialized applications intended for GPU anyway.

Note in this experimental repo we have the cpu-based rocker/ml as well as the rocker/ml-gpu, only the latter builds on rocker/cuda and would thus get the GPU blas. of course a lot the specialized ML packages (xgboost, h2o keras) are either already linking these libs (via their calls to python or java), or else doing other gpu-optimized algorithms, but having rocker/cuda support GPU blas out of the box could make it a useful image to users where the GPU linear algebra is useful in contexts wholly apart from the ML packages.

noamross · 2019-03-12T13:35:41Z

Agreed that we should benchmark but in principle it seems a reasonable default for the cuda-based images. If you have an experimental fork with a script I'll get to it on our hardware, and maybe others (@MarkEdmondson1234), can give it a go, too?

cboettig · 2019-03-12T18:44:39Z

@noamross Thanks!

Yes, I think I have an experimental version of this on the nvblas branch on the cuda/base/Dockerfile. (Help testing would be great since I just had to send my System76 desktop with my GPU back to the shop for weird crashing behavior :-( ).

So one thing is that I'm following NVIDIA's advice to use LD_PRELOAD instead of re-linking. Like they say, you don't want to set LD_PRELOAD globally, since then it would get set before every shell command run on the system, so I cribbed this approach to load it just before the R, Rscript, and rserver sessions:

ml/cuda/base/Dockerfile

Lines 89 to 105 in 87726cf

    
           RUN mv /usr/local/bin/R /usr/local/bin/R_ && \ 
        
               mv /usr/local/bin/Rscript /usr/local/bin/Rscript_ && \ 
        
               echo "#!/bin/sh 
        
                     \nLD_PRELOAD=$CUDA_BLAS /usr/local/bin/R_ \"\$@\"" \ 
        
           		  > /usr/local/bin/R && \ 
        
           	chmod +x /usr/local/bin/R && \ 
        
               echo "#!/bin/sh 
        
                     \nLD_PRELOAD=$CUDA_BLAS /usr/local/bin/Rscript_ \"\$@\"" \ 
        
           		  > /usr/local/bin/Rscript && \ 
        
           	chmod +x /usr/local/bin/Rscript 
        
           RUN echo "#!/usr/bin/with-contenv bash \ 
        
                     \n## load /etc/environment vars first: \ 
        
             		  \n for line in \$( cat /etc/environment ) ; do export \$line ; done \ 
        
           		  \n export$LD_PRELOAD=$CUDA_BLAS \ 
        
                     \n exec /usr/lib/rstudio-server/bin/rserver --server-daemonize 0" \ 
        
                     > /etc/services.d/rstudio/run

I'm really not sure that's the best way to do this. If we're adding it to the library, it probably makes more sense to configure it directly as the system's blas, but I'd have to refresh on how to do that (particularly in a non-interactive session like the Dockerfile). @eddelbuettel has loads more experience with linking blas libraries and all and can probably give us some pointers (perhaps after recovering from the horror of seeing LD_PRELOAD approach above?).

I did give this a quick run on my system before sending it back and the results were impressive for basic matrix multiplication and determinants, particularly compared to default (non-parallel) blas. For openblas it depended more on how many CPU threads and much memory was available to the CPU relative to your GPU, but notably it was never slower linking the GPU libraries (perhaps because the nvblas-conf file already links the openblas cpu libs as the fallback anyway). But could use more testing; and I haven't run this exact dockerfile yet (or run in the RStudio mode), I was just running interactively on the machine...

eddelbuettel · 2019-03-12T19:21:05Z

Sorry to hear about the crashes. Frustrating.

My experience with "plugging BLAS in and out" is/was limited to system others made that already supported it :) I.e. the Debian BLAS maintainer had this brilliant idea of using the interchangeable nature of BLAS/LAPACK along the 'dpkg-alterntatives' mechanism of setting and adjusting softlinks to really make it swappable. In that we could lean on that scheme and try to fold NVidia's BLAS into it.

Otherwise LD_PRELOAD does the same: by rejigging the search order, you get your preferred BLAS in lieu of a default. So in that sense what you did here should do the trick.

MarkEdmondson1234 · 2019-03-12T19:44:32Z

Would be happy to do some benchmarking but would need some demo code to run as BLAS etc all over my head :)

eddelbuettel · 2019-03-12T19:48:45Z

Roughly a hundred years ago I did just that in what is now this repo using an existing R benchmark package / script. If memory serves then Colin's benchmarkme package uses the same. It all goes back to an original old script by Simon U. Can you start off that?

MarkEdmondson1234 · 2019-03-12T19:55:24Z

Looks good!

restonslacker · 2019-03-13T10:20:45Z

I'm getting: ``` Error response from daemon: Dockerfile parse error line 92: unknown instruction: \NLD_PRELOAD=$CUDA_BLAS ``` when I run `docker build .` ``` me@mybox :~/test_docker/ml/cuda/base$ nvidia-smi Wed Mar 13 10:18:27 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.145 Driver Version: 384.145 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A | | 0% 47C P0 54W / 250W | 0MiB / 12188MiB | 2% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ me@mybox:~/test_docker/ml/cuda/base$ sudo docker version Client: Version:17.12.0-ce API version:1.35 Go version:go1.9.2 Git commit:c97c6d6 Built:Wed Dec 27 20:11:19 2017 OS/Arch:linux/amd64 Server: Engine: Version:17.12.0-ce API version:1.35 (minimum version 1.12) Go version:go1.9.2 Git commit:c97c6d6 Built:Wed Dec 27 20:09:53 2017 OS/Arch:linux/amd64 Experimental:false ```

…

On Tue, Mar 12, 2019 at 3:55 PM Mark ***@***.***> wrote: Looks good! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#17 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFi1C-m2bFpGQKUPew5cK51QeUAn5ywuks5vWAYsgaJpZM4bmYjz> .

cboettig · 2019-03-13T15:57:09Z

@restonslacker whoops, that was just a typo in the Dockerfile (apparently you can't escape a literal ! while using double quotes for $VARS....) should be fixed now

lezwright · 2021-01-21T16:01:27Z

Hi, did you have any chance with the LE_PRELOAD and R? When I use this approach I can hardly engage the GPU.

cboettig · 2021-01-21T21:49:23Z

This example should run on the GPU using our docker images (e.g. rocker/ml) with NVIDIA BLAS.

Note that this is obviously hardware-dependent -- in particular, NVIDIA BLAS uses a configuration that enables a fall-back to CPU-BLAS if it decides the problem size is too large for the GPU. Also note that there's non-trivial overhead in moving the data from CPU to GPU, which can often swamp the time saved in the actual GPU-based computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

link libnvblas? #17

link libnvblas? #17

cboettig commented Mar 9, 2019 •

edited

Loading

eddelbuettel commented Mar 9, 2019

cboettig commented Mar 9, 2019

noamross commented Mar 12, 2019

cboettig commented Mar 12, 2019

eddelbuettel commented Mar 12, 2019

MarkEdmondson1234 commented Mar 12, 2019

eddelbuettel commented Mar 12, 2019

MarkEdmondson1234 commented Mar 12, 2019

restonslacker commented Mar 13, 2019 via email

cboettig commented Mar 13, 2019

lezwright commented Jan 21, 2021

cboettig commented Jan 21, 2021

link libnvblas? #17

link libnvblas? #17

Comments

cboettig commented Mar 9, 2019 • edited Loading

eddelbuettel commented Mar 9, 2019

cboettig commented Mar 9, 2019

noamross commented Mar 12, 2019

cboettig commented Mar 12, 2019

eddelbuettel commented Mar 12, 2019

MarkEdmondson1234 commented Mar 12, 2019

eddelbuettel commented Mar 12, 2019

MarkEdmondson1234 commented Mar 12, 2019

restonslacker commented Mar 13, 2019 via email

cboettig commented Mar 13, 2019

lezwright commented Jan 21, 2021

cboettig commented Jan 21, 2021

cboettig commented Mar 9, 2019 •

edited

Loading