Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

Open
dhzhd1 opened this issue Oct 25, 2017 · 5 comments
Open

test_all.testbin failed on ThresholdLayerTest & RNNLayerTest #19

dhzhd1 opened this issue Oct 25, 2017 · 5 comments

Comments

@dhzhd1
Copy link

dhzhd1 commented Oct 25, 2017

Issue summary

After run the ./build/test/test_all.testbin, below test items failed:

  1. ThresholdLayerTest/3.Test
    Error Message:
    src/caffe/test/test_threshold_layer.cpp:67: Failure
    Expected: (bottom_data[i]) > (threshold_), actual: -0.635736 vs 0
    src/caffe/test/test_threshold_layer.cpp:67: Failure
    Expected: (bottom_data[i]) > (threshold_), actual: -0.363372 vs 0
    src/caffe/test/test_threshold_layer.cpp:64: Failure
    ......

  2. RNNLayerTest/2.TestForward
    Error Message:
    src/caffe/test/test_rnn_layer.cpp:156: Failure
    Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 0
    src/caffe/test/test_rnn_layer.cpp:156: Failure
    Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: -0 vs -0 t = 1; i = 1
    src/caffe/test/test_rnn_layer.cpp:156: Failure
    Expected: (this->blob_top_.cpu_data()[i]) != (top_copy.cpu_data()[t * top_count + i]), actual: 0 vs 0 t = 1; i = 2
    ......

  3. RNNLayerTest/2.TestGradient
    Error Message:
    ./include/caffe/test/test_gradient_check_util.hpp:175: Failure
    The difference between computed_gradient and estimated_gradient is 0.37764036655426025, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.37764036655426025, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
    debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.24414941668510437; objective+ = -0; objective- = -0
    ... ...

  4. RNNLayerTest/2.TestGradientNonZeroCont
    Error Message:
    ./include/caffe/test/test_gradient_check_util.hpp:175: Failure
    The difference between computed_gradient and estimated_gradient is 0.37112760543823242, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.37112760543823242, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
    debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18513253331184387; objective+ = -0; objective- = -0
    ... ....

  5. RNNLayerTest/2.TestGradientNonZeroContBufferSize2
    Error Message:
    ./include/caffe/test/test_gradient_check_util.hpp:175: Failure
    The difference between computed_gradient and estimated_gradient is 0.14533787965774536, which exceeds threshold_ * scale, where computed_gradient evaluates to -0.14533787965774536, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
    debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.17294931411743164; objective+ = 0; objective- = 0
    ... ...

  6. RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput
    Error Message:
    ./include/caffe/test/test_gradient_check_util.hpp:175: Failure
    The difference between computed_gradient and estimated_gradient is 0.14564625918865204, which exceeds threshold_ * scale, where computed_gradient evaluates to 0.14564625918865204, estimated_gradient evaluates to 0, and threshold_ * scale evaluates to 0.0010000000474974513.
    debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = 0.20095519721508026; objective+ = -0; objective- = -0
    ... ...

  7. RNNLayerTest/3.TestForward
    Error Message:
    MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:45: Only alpha=1 and beta=0 is supported

Steps to reproduce

According to the README.ROCm.md build the test_all.testbin. All of the prerequired packages has been install. The LD_LIBRARY_PATH and PATH has been setup.

Your system configuration

GPU: AMD MI25
Operating system: Ubuntu 16.04.3 64bit
Compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
CUDA version (if applicable):
CUDNN version (if applicable):
BLAS: USE_ROCBLAS := 1
Python or MATLAB version (for pycaffe and matcaffe respectively): python 2.7.12
Other:
miopen-hip 1.1.4
miopengemm 1.1.5
rocm-libs 1.6.180

@parallelo
Copy link
Contributor

Thanks for the report, @dhzhd1. We'll take a look.

@yige-hu
Copy link

yige-hu commented Feb 26, 2018

Hi Jeff @parallelo ,

I'm still observing the 6th failure. My configurations:
Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0
GPU: AMD RX 580
ROCm backend.

./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.40885764360427856, which exceeds threshold_ * scale, where
computed_gradient evaluates to 0.40885764360427856,
estimated_gradient evaluates to 0, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,27,6,47; feat = 0.20945753157138824; objective+ = -0; objective- = -0
[  FAILED  ] RNNLayerTest/2.TestGradientNonZeroContBufferSize2WithStaticInput, where TypeParam = caffe::GPUDevice<float> (20358 ms)
....
MIOpen Error: /data/repo/MIOpen/src/ocl/activ_ocl.cpp:47: Only alpha=1 and beta=0 is supported
F0225 20:25:48.964237  9249 cudnn_tanh_layer_hip.cpp:23] Check failed: status == miopenStatusSuccess (7 vs. 0)  miopenStatusUnknownError
*** Check failure stack trace: ***
    @     0x7f2b521295cd  google::LogMessage::Fail()
    @     0x7f2b5212b433  google::LogMessage::SendToLog()
    @     0x7f2b5212915b  google::LogMessage::Flush()
    @     0x7f2b5212be1e  google::LogMessageFatal::~LogMessageFatal()
    @          0x1547cce  caffe::CuDNNTanHLayer<>::Forward_gpu()
    @           0x4f7967  caffe::Layer<>::Forward()
    @          0x1b3a137  caffe::Net<>::ForwardFromTo()
    @          0x1c3ab1a  caffe::RecurrentLayer<>::Forward_gpu()
    @           0x4f7967  caffe::Layer<>::Forward()
    @           0x5b73f2  caffe::RNNLayerTest_TestForward_Test<>::TestBody()
    @          0x108fd14  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x108fbd6  testing::Test::Run()
    @          0x1090d21  testing::TestInfo::Run()
    @          0x1091577  testing::TestCase::Run()
    @          0x1097c57  testing::internal::UnitTestImpl::RunAllTests()
    @          0x1097694  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @          0x1097649  testing::UnitTest::Run()
    @          0x2006fda  main
    @     0x7f2b4d5e4830  __libc_start_main
    @          0x2006479  _start
    @              (nil)  (unknown)
Aborted (core dumped)

Thanks,
Yige

@davclark
Copy link

I'm also getting a number of failures, which seem in the same ballpark. I'm building on a just-updated checkout of the rocrand branch, followed instructions exactly, with no changes to Makefile.config. Basically the same config as @yige-hu:

Operating System: Ubuntu 16.04.3 LTS, Linux kernel 4.13.0
GPU: AMD RX 580, drivers from rocm PPA
CPU: Threadripper 1900X on X399 chipset
ROCm backend.

One thing I'm seeing is this "0 ms" note on most of the failures. I'm guessing the operation is simply not running.

Anyway, please let me know if this is a good place to post or if there's a better place!

[ RUN      ] EmbedLayerTest/0.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/0.TestGradientWithBias, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] EmbedLayerTest/1.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/1.TestGradient, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] EmbedLayerTest/2.TestGradient
src/caffe/test/test_embed_layer.cpp:158: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/2.TestGradient, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] EmbedLayerTest/3.TestGradientWithBias
src/caffe/test/test_embed_layer.cpp:183: Failure
Value of: 1
Expected: 0
[  FAILED  ] EmbedLayerTest/3.TestGradientWithBias, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] MaxPoolingDropoutTest/2.TestBackward
src/caffe/test/test_maxpool_dropout_layers.cpp:124: Failure
Expected: (sum_with_dropout) >= (sum), actual: 22 vs 36
[  FAILED  ] MaxPoolingDropoutTest/2.TestBackward, where TypeParam = caffe::GPUDevice<float> (2 ms)

[ RUN      ] ConvolutionLayerTest/0.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/0.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/0.TestGradient3D, where TypeParam = caffe::CPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestSimple3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestDilated3DConvolution, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestNDAgainst2D, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/1.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/1.TestGradient3D, where TypeParam = caffe::CPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/2.TestGradient3D
src/caffe/test/test_convolution_layer.cpp:792: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/2.TestGradient3D, where TypeParam = caffe::GPUDevice<float> (0 ms)

[ RUN      ] ConvolutionLayerTest/3.TestSimple3DConvolution
src/caffe/test/test_convolution_layer.cpp:397: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestSimple3DConvolution, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] ConvolutionLayerTest/3.TestDilated3DConvolution
src/caffe/test/test_convolution_layer.cpp:448: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestDilated3DConvolution, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] ConvolutionLayerTest/3.TestNDAgainst2D
src/caffe/test/test_convolution_layer.cpp:718: Failure
Value of: 1
Expected: 0
[  FAILED  ] ConvolutionLayerTest/3.TestNDAgainst2D, where TypeParam = caffe::GPUDevice<double> (0 ms)

[ RUN      ] NeuronLayerTest/2.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.054580926895141602, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.9454190731048584, and
threshold_ * scale evaluates to 0.0010000000474974513.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.10926593095064163; objective+ = 1.289490818977356; objective- = 1.2705824375152588
<snipped a bunch more like that...>
[  FAILED  ] NeuronLayerTest/2.TestBNLLGradient, where TypeParam = caffe::GPUDevice<float> (67 ms)

[ RUN      ] NeuronLayerTest/3.TestDropoutHalf
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.5, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.5, and
1.96 * std_error evaluates to 0.089461353392063625.
[  FAILED  ] NeuronLayerTest/3.TestDropoutHalf, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] NeuronLayerTest/3.TestDropoutThreeQuarters
src/caffe/test/test_neuron_layer.cpp:87: Failure
The difference between empirical_dropout_ratio and dropout_ratio is 0.25, which exceeds 1.96 * std_error, where
empirical_dropout_ratio evaluates to 1,
dropout_ratio evaluates to 0.75, and
1.96 * std_error evaluates to 0.077475803251365008.
[  FAILED  ] NeuronLayerTest/3.TestDropoutThreeQuarters, where TypeParam = caffe::GPUDevice<double> (1 ms)

[ RUN      ] NeuronLayerTest/3.TestBNLLGradient
./include/caffe/test/test_gradient_check_util.hpp:175: Failure
The difference between computed_gradient and estimated_gradient is 0.091322861619979268, which exceeds threshold_ * scale, where
computed_gradient evaluates to 1,
estimated_gradient evaluates to 0.90867713838002073, and
threshold_ * scale evaluates to 0.001.
debug: (top_id, top_data_id, blob_id, feat_id)=0,0,0,0; feat = -0.18315754813835755; objective+ = 1.220623351064388; objective- = 1.2024498082967876
<again snipping many repeats...>
[  FAILED  ] NeuronLayerTest/3.TestBNLLGradient, where TypeParam = caffe::GPUDevice<double> (63 ms)

[ RUN      ] NetTest/0.TestReshape
Segmentation fault (core dumped)

You can see that there's a core dump there at the end!

@davclark
Copy link

Just updated to the hip branch, which doesn't seem to have many meaningful changes over the rocrand branch. The same errors persist. I can also report that MNIST and CaffeNet also fail, both with core dumps.

@parallelo
Copy link
Contributor

@davclark - Thanks for the heads-up. Please open a new ticket for the core dumps, as that appears to be a separate issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants