-
Notifications
You must be signed in to change notification settings - Fork 13
link libnvblas? #17
Comments
Seconded. We should definitely document that it is there, but I am not convinced it will always be a winner. Then again I am also often wrong when guessing :) |
Yeah, it's not clear to me what the appropriate benchmark comparison is -- obviously the difference between a given operation on GPU vs CPU depends a lot on exactly what GPU vs what CPU you have on the platform. That said, I imagine people will really only be deploying the rocker/cuda images on machines with significant GPUs available, if not on hardware explicitly optimized for GPU use (e.g. GPU-type instances on AWS). I do see some substantial improvement in low-level linear algebra operations, things like calculating determinant can see a factor of 10. For typical R use I doubt a lot of operations would see things like that, but then this image is already aimed at more specialized applications intended for GPU anyway. Note in this experimental repo we have the cpu-based |
Agreed that we should benchmark but in principle it seems a reasonable default for the |
@noamross Thanks! Yes, I think I have an experimental version of this on the So one thing is that I'm following NVIDIA's advice to use Lines 89 to 105 in 87726cf
I'm really not sure that's the best way to do this. If we're adding it to the library, it probably makes more sense to configure it directly as the system's blas, but I'd have to refresh on how to do that (particularly in a non-interactive session like the Dockerfile). @eddelbuettel has loads more experience with linking blas libraries and all and can probably give us some pointers (perhaps after recovering from the horror of seeing I did give this a quick run on my system before sending it back and the results were impressive for basic matrix multiplication and determinants, particularly compared to default (non-parallel) blas. For openblas it depended more on how many CPU threads and much memory was available to the CPU relative to your GPU, but notably it was never slower linking the GPU libraries (perhaps because the nvblas-conf file already links the openblas cpu libs as the fallback anyway). But could use more testing; and I haven't run this exact dockerfile yet (or run in the RStudio mode), I was just running interactively on the machine... |
Sorry to hear about the crashes. Frustrating. My experience with "plugging BLAS in and out" is/was limited to system others made that already supported it :) I.e. the Debian BLAS maintainer had this brilliant idea of using the interchangeable nature of BLAS/LAPACK along the 'dpkg-alterntatives' mechanism of setting and adjusting softlinks to really make it swappable. In that we could lean on that scheme and try to fold NVidia's BLAS into it. Otherwise |
Would be happy to do some benchmarking but would need some demo code to run as BLAS etc all over my head :) |
Roughly a hundred years ago I did just that in what is now this repo using an existing R benchmark package / script. If memory serves then Colin's benchmarkme package uses the same. It all goes back to an original old script by Simon U. Can you start off that? |
Looks good! |
I'm getting:
```
Error response from daemon: Dockerfile parse error line 92: unknown
instruction: \NLD_PRELOAD=$CUDA_BLAS
```
when I run `docker build .`
```
me@mybox :~/test_docker/ml/cuda/base$ nvidia-smi
Wed Mar 13 10:18:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145 Driver Version:
384.145 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:03:00.0 Off |
N/A |
| 0% 47C P0 54W / 250W | 0MiB / 12188MiB | 2%
Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name
Usage |
|=============================================================================|
| No running processes
found |
+-----------------------------------------------------------------------------+
me@mybox:~/test_docker/ml/cuda/base$ sudo docker version
Client:
Version:17.12.0-ce
API version:1.35
Go version:go1.9.2
Git commit:c97c6d6
Built:Wed Dec 27 20:11:19 2017
OS/Arch:linux/amd64
Server:
Engine:
Version:17.12.0-ce
API version:1.35 (minimum version 1.12)
Go version:go1.9.2
Git commit:c97c6d6
Built:Wed Dec 27 20:09:53 2017
OS/Arch:linux/amd64
Experimental:false
```
…On Tue, Mar 12, 2019 at 3:55 PM Mark ***@***.***> wrote:
Looks good!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#17 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFi1C-m2bFpGQKUPew5cK51QeUAn5ywuks5vWAYsgaJpZM4bmYjz>
.
|
@restonslacker whoops, that was just a typo in the Dockerfile (apparently you can't escape a literal |
Hi, did you have any chance with the LE_PRELOAD and R? When I use this approach I can hardly engage the GPU. |
This example should run on the GPU using our docker images (e.g. Note that this is obviously hardware-dependent -- in particular, NVIDIA BLAS uses a configuration that enables a fall-back to CPU-BLAS if it decides the problem size is too large for the GPU. Also note that there's non-trivial overhead in moving the data from CPU to GPU, which can often swamp the time saved in the actual GPU-based computation. |
libnvblas.so gets installed with the existing cuda libraries. Apparently this can be enabled as the drop-in BLAS library for R, and is smart enough to let openblas handle things and only take over when it can provide significant acceleration(?)
EDIT
Haven't found great documentation on setup or performance, but looks like this can be done as a one-off at runtime by setting
LD_PRELOAD
and configuring the fallback to openblas:Run R with these env vars:
Will have to benchmark a bit, but maybe worth adding this into our cuda/base setup @noamross ?
The text was updated successfully, but these errors were encountered: