test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

dhzhd1 · 2017-10-25T21:31:13Z

Issue summary

After run the ./build/test/test_all.testbin, below test items failed:

ThresholdLayerTest/3.Test
Error Message:
src/caffe/test/test_threshold_layer.cpp:67: Failure
Expected: (bottom_data[i]) > (threshold_), actual: -0.635736 vs 0
src/caffe/test/test_threshold_layer.cpp:67: Failure
Expected: (bottom_data[i]) > (threshold_), actual: -0.363372 vs 0
src/caffe/test/test_threshold_layer.cpp:64: Failure
......
RNNLayerTest/2.TestForward
Error Message:
src/caffe/test/test_rnn_layer.cpp:156: Failure
Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 0
src/caffe/test/test_rnn_layer.cpp:156: Failure
Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 1
src/caffe/test/test_rnn_layer.cpp:156: Failure
Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: 0 vs 0 t = 1; i = 2
......
RNNLayerTest/2.TestGradient
Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.37764036655426025, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.37764036655426025, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.24414941668510437; objective+ = -0; objective- = -0
... ...
RNNLayerTest/2.TestGradientNonZeroCont
Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.37112760543823242, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.37112760543823242, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18513253331184387; objective+ = -0; objective- = -0
... ....
RNNLayerTest/2.TestGradientNonZeroContBufferSize2
Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.14533787965774536, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.14533787965774536, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.17294931411743164; objective+ = 0; objective- = 0
... ...
RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput
Error Message:
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.14564625918865204, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.14564625918865204, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.20095519721508026; objective+ = -0; objective- = -0
... ...
RNNLayerTest/3.TestForward
Error Message:
MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:45: Only alpha=1 and beta=0 is supported

Steps to reproduce

According to the README.ROCm.md build the test_all.testbin. All of the prerequired packages has been install. The LD_LIBRARY_PATH and PATH has been setup.

Your system configuration

GPU: AMD MI25
Operating system: Ubuntu 16.04.3 64bit
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): python 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180

The text was updated successfully, but these errors were encountered:

parallelo · 2017-10-26T20:40:18Z

Thanks for the report, @dhzhd1. We'll take a look.

yige-hu · 2018-02-26T02:32:52Z

Hi Jeff @parallelo ,

I'm still observing the 6th failure. My configurations:
Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0
GPU: AMD RX 580
ROCm backend.

./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.40885764360427856, which exceeds threshold_ * scale, where
computed_gradient evaluates to 0.40885764360427856,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,27,6,47; feat = 0.20945753157138824; objective+ = -0; objective- = -0
[  FAILED  ] RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput, where TypeParam = caffe::GPUDevice<float> (20358 ms)
....
MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:47: Only alpha=1 and beta=0 is supported
F0225 20:25:48.964237  9249 cudnn_tanh_layer_hip.cpp:23] Check failed: status == miopenStatusSuccess (7 vs. 0)  miopenStatusUnknownError
*** Check failure stack trace: ***
    @     0x7f2b521295cd  google::LogMessage::Fail()
    @     0x7f2b5212b433  google::LogMessage::SendToLog()
    @     0x7f2b5212915b  google::LogMessage::Flush()
    @     0x7f2b5212be1e  google::LogMessageFatal::~LogMessageFatal()
    @          0x1547cce  caffe::CuDNNTanHLayer<>::Forward_gpu()
    @           0x4f7967  caffe::Layer<>::Forward()
    @          0x1b3a137  caffe::Net<>::ForwardFromTo()
    @          0x1c3ab1a  caffe::RecurrentLayer<>::Forward_gpu()
    @           0x4f7967  caffe::Layer<>::Forward()
    @           0x5b73f2  caffe::RNNLayerTest_TestForward_Test<>::TestBody()
    @          0x108fd14  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x108fbd6  testing::Test::Run()
    @          0x1090d21  testing::TestInfo::Run()
    @          0x1091577  testing::TestCase::Run()
    @          0x1097c57  testing::internal::UnitTestImpl::RunAllTests()
    @          0x1097694  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1097649  testing::UnitTest::Run()
    @          0x2006fda  main
    @     0x7f2b4d5e4830  __libc_start_main
    @          0x2006479  _start
    @              (nil)  (unknown)
Aborted (core dumped)

Thanks,
Yige

davclark · 2018-04-16T02:18:05Z

I'm also getting a number of failures, which seem in the same ballpark. I'm building on a just-updated checkout of the rocrand branch, followed instructions exactly, with no changes to Makefile.config. Basically the same config as @yige-hu:

Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0
GPU: AMD RX 580, drivers from rocm PPA
CPU: Threadripper 1900X on X399 chipset
ROCm backend.

One thing I'm seeing is this "0 ms" note on most of the failures. I'm guessing the operation is simply not running.

Anyway, please let me know if this is a good place to post or if there's a better place!

[ RUN      ] EmbedLayerTest/0.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/0.TestGradientWithBias, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] EmbedLayerTest/1.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/1.TestGradient, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] EmbedLayerTest/2.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] EmbedLayerTest/3.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/3.TestGradientWithBias, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] MaxPoolingDropoutTest/2.TestBackward
src/caffe/test/test_maxpool_dropout_layers.cpp:124: Failure
Expected: (sum_with_dropout) >= (sum), actual: 22 vs 36
[  FAILED  ] MaxPoolingDropoutTest/2.TestBackward, where TypeParam = caffe::GPUDevice<float> (2 ms)

[ RUN      ] ConvolutionLayerTest/0.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestGradient3D, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestGradient3D, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestGradient3D, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/3.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] ConvolutionLayerTest/3.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/3.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] NeuronLayerTest/2.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.054580926895141602, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.9454190731048584, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.10926593095064163; objective+ = 1.289490818977356; objective- = 1.2705824375152588
<snipped a bunch more like that...>
[  FAILED  ] NeuronLayerTest/2.TestBNLLGradient, where TypeParam = caffe::GPUDevice<float> (67 ms)

[ RUN      ] NeuronLayerTest/3.TestDropoutHalf
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.5, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.5, and
1.96 * std_error evaluates to 0.089461353392063625.
[  FAILED  ] NeuronLayerTest/3.TestDropoutHalf, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] NeuronLayerTest/3.TestDropoutThreeQuarters
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.25, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.75, and
1.96 * std_error evaluates to 0.077475803251365008.
[  FAILED  ] NeuronLayerTest/3.TestDropoutThreeQuarters, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] NeuronLayerTest/3.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.091322861619979268, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.90867713838002073, and
threshold_ * scale evaluates to 0.001.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835755; objective+ = 1.220623351064388; objective- = 1.2024498082967876
<again snipping many repeats...>
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double> (63 ms)

[ RUN      ] NetTest/0.TestReshape
Segmentation fault (core dumped)

You can see that there's a core dump there at the end!

davclark · 2018-04-16T02:42:15Z

Just updated to the hip branch, which doesn't seem to have many meaningful changes over the rocrand branch. The same errors persist. I can also report that MNIST and CaffeNet also fail, both with core dumps.

parallelo · 2018-04-16T18:05:34Z

@davclark - Thanks for the heads-up. Please open a new ticket for the core dumps, as that appears to be a separate issue.

davclark mentioned this issue Apr 16, 2018

Getting core dumps on "real" workloads #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

dhzhd1 commented Oct 25, 2017 •

edited

Loading

parallelo commented Oct 26, 2017

yige-hu commented Feb 26, 2018 •

edited

Loading

davclark commented Apr 16, 2018

davclark commented Apr 16, 2018

parallelo commented Apr 16, 2018

test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

Comments

dhzhd1 commented Oct 25, 2017 • edited Loading

Issue summary

Steps to reproduce

Your system configuration

parallelo commented Oct 26, 2017

yige-hu commented Feb 26, 2018 • edited Loading

davclark commented Apr 16, 2018

davclark commented Apr 16, 2018

parallelo commented Apr 16, 2018

dhzhd1 commented Oct 25, 2017 •

edited

Loading

yige-hu commented Feb 26, 2018 •

edited

Loading