-
Notifications
You must be signed in to change notification settings - Fork 24
Getting core dumps on "real" workloads #41
Comments
Also tried running as root to see if it was a permissions problem... I think it's not. Still core dump with sudo. |
Just upgraded to most recent kernel modules, etc. (e.g., hip_hcc 1.5.18081, rocblas 0.13.2.1, compute-firmware 1.7.18, rock-dkms 1.7.148, rocm-opencl 1.2.0.2018041722, among many others). Tried a clean recompile and test - still same errors (including those from #19), but now I get a print-out of the stack-trace:
|
Hi @davclark , it can be a user bits configuration issue. And use the following command to launch our official hipcaffe docker image: |
The basic ROCm image works with the test program I believe I successfullly ran the above image... (corrected unconfine->unconfined). However, I still get a core dump on the tests, in addtion to a number of warnings like this:
The stacktrace for core dump:
This seems more like a test configuration issue, though... Note that I am not building hipCaffe, just trying to run the test program that's already there in the image. |
Hi @davclark , thanks for the further information. |
Looking at the info above, I should clarify I am actually on 16.04.4 - I can't find any info on how to "downgrade" to 16.04.3, it seems you're either on the HWE branch, or not. I understand that there is some issue with ROCm and the 16.04.4 kernel (currently 4.13.0-39-generic - strangely, uname -a reports 16.04.1, even though lsb_release reports 16.04.4)? I've got rocm (including rock / rockt, etc.) installed via AMD's rocm PPA. In any case, First, I verified that the ROCm image seems to work for me on Docker, including compiling and running the "vector-copy" program. To reproduce the above failure, I simply run your docker command above (fixing a type-o): Then, I run Is that what you meant? Happy to provide more info. I can certainly try switching to the HWE-edge kernel also (which is ported from 18.04). |
Another idea is that perhaps I should not be using the HWE kernel? I'll see what happens if I use the base LTS kernel... but this makes me think to ask whether there are any other expected settings I may be missing. |
Hi @davclark , we know some hipCaffe direct tests can fail. That's normal, and even the upstream caffe can not pass all its direct tests. |
All example models fail. For example, setting up and trying to run the MNIST example results in a core dump. I'm more concerned about core dumps than tolerance violations! E.g. from the hipCaffe dir: ./data/mnist/get_mnist.sh |
Hi @davclark , those samples should execute fine. Let's focus on the mnist sample for now. Could you try to upgrade to ROCm1.7.2? It's publicly available now. Then, change to use the ROCm1.7.2 docker image: If the issue remains, please provide the complete failure log and the output: |
You all are moving fast, I see! by the time I was able to try this, the rocm repo had updated to 1.8. So, that's what I got after an update just now. The 1.7.2 docker image works fine (again, a type-o on seccomp=unconfined was missing the "d" at the end). In case it's useful, my kernel was at the following version (I'm on the HWE kernel): 4.13.0-41-generic Thank you! |
Issue summary
Initially reported in #19 that I am getting issues with test failures as well as core dumps, but just reporting on core dumps here for now.
In short,
NetTest/0.TestReshape
, as well as my attempts at running the MNIST and CaffeNet all end with a core dump. Data for CIFAR-10 has an integrity problem...(I've been out of the game long enough that I'm not sure how to get the stack trace with gdb... I'm happy to look into this further.)
Steps to reproduce
If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.
Your system configuration
Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0, ROCm & DKMS from the ROCm PPA.
GPU: AMD RX 580, drivers from rocm PPA
CPU: Threadripper 1900X on X399 chipset
Compiler: (I think this is the one you want?) hcc version=1.2.18063-7e18c64-ac8732c-710f135, workweek (YYWWD) = 18063
BLAS: ROCBLAS (I assume - that's the default in Makefile.config)
Python or MATLAB version (for pycaffe and matcaffe respectively): standard Ubuntu Python 2.7, no matlab
The text was updated successfully, but these errors were encountered: