operator() is horrendously slow #75

shiozaki · 2015-02-09T19:54:21Z

I know this is not a bug but has to be reported. The MP2 assembly step becomes slower by a lot if I use BTAS' native operator().

(I previously wrote 50-100, but seems that was not quite right. It is slow, though.)

evaleev · 2015-02-09T19:59:13Z

Do you mean x100 slower than nested loops with explicit offset computation or than DGEMM? A code snippet would be useful.

shiozaki · 2015-02-09T20:00:36Z

See above. I will measure the timing more accurately.

shiozaki · 2015-02-09T20:06:50Z

the BTAS code I am talking about is the following, which does not seem to be written for efficiency. The timing follows (takes a lot as it needs to recompile the entire BAGEL)

      /// accesses element using its index, given as a pack of integers
      template<typename index0, typename... _args>
      typename std::enable_if<std::is_integral<index0>::value, const_reference>::type
      operator() (const index0& first, const _args&... rest) const
      {
        typedef typename common_signed_type<index0, typename index_type::value_type>::type ctype;
        auto indexv = {static_cast<ctype>(first), static_cast<ctype>(rest)...};
        index_type index = array_adaptor<index_type>::construct(indexv.size());
        std::copy(std::begin(indexv), std::end(indexv), std::begin(index));
        return storage_[ range_.ordinal(index) ];
      }

shiozaki · 2015-02-09T20:10:49Z

Code in BAGEL - I am talking about the loop in the middle (again, timing follows)

    for (int n = 0; n != nloop; ++n) {
      // take care of data. The communication should be hidden
      if (n+ncache < nloop)
        cache.block(n+ncache, n-1);

      const int i = get<0>(cache.task(n));
      const int j = get<1>(cache.task(n));
      if (i < 0 || j < 0) continue;
      cache.data_wait(n);

      shared_ptr<const Matrix> iblock = cache(i);
      shared_ptr<const Matrix> jblock = cache(j);
      const Matrix mat(*iblock % *jblock); // V
      Matrix mat2 = mat; // 2T-T^t
      if (i != j) {
        mat2 *= 2.0;
        mat2 -= *mat.transpose();
      }   
      Matrix mat3 = mat; // T

      for (int a = 0; a != nvirt; ++a) {
        for (int b = 0; b != nvirt; ++b) {
          const double denom = -eig(a+nocc)+eig(i)-eig(b+nocc)+eig(j);
          mat2(b,a) /= denom;
          mat3(b,a) /= denom;
        }   
      }   

      const double fac = i == j ? 1.0 : 2.0;
      ecorr += mat.dot_product(mat2) * fac;
      dmp2->add_block(2.0, nocca, nocca, nvirt, nvirt, mat2 % mat3);
      if (i != j)
        dmp2->add_block(2.0, nocca, nocca, nvirt, nvirt, mat2 ^ mat3);
    }

evaleev · 2015-02-09T20:34:13Z

It is most probably due to dynamic memory allocation in index construction.
When you know the number of dimensions parametrize tensor by std::array. I
do this when using btas with libint.
On Feb 9, 2015 3:06 PM, "Toru Shiozaki" [email protected] wrote:

the BTAS code I am talking about is the following, which does not seem to
be written for efficiency. The timing follows (takes a lot as it needs to
recompile the entire BAGEL)

  /// accesses element using its index, given as a pack of integers
  template<typename index0, typename... _args>
  typename std::enable_if<std::is_integral<index0>::value, const_reference>::type
  operator() (const index0& first, const _args&... rest) const
  {
    typedef typename common_signed_type<index0, typename index_type::value_type>::type ctype;
    auto indexv = {static_cast<ctype>(first), static_cast<ctype>(rest)...};
    index_type index = array_adaptor<index_type>::construct(indexv.size());
    std::copy(std::begin(indexv), std::end(indexv), std::begin(index));
    return storage_[ range_.ordinal(index) ];
  }

—
Reply to this email directly or view it on GitHub
#75 (comment).

shiozaki · 2015-02-09T20:47:56Z

Sorry maybe this could be partly my fault. I will figure it out (also I should default to std::array now).

evaleev · 2015-02-09T21:28:50Z

we definitely need an alias for constexpr number of dimensions ... perhaps something like this:

/// Tensor with const number of dimensions
template <typename _T,
size_t _N,
CBLAS_ORDER _Order = CblasRowMajor,
class _Storage = btas::DEFAULT::storage<_T>,
class = typename std::enable_if<std::is_same<_T, typename _Storage::value_type>::value>::type
>
using TensorNd = Tensor<_T,
RangeNd<_Order, std::array<long, _N>, btas::BoxOrdinal<_Order,std::array<long, _N>>>,
_Storage
>;

shiozaki · 2015-02-09T21:40:24Z

View::begin() seems very slow, which was perhaps responsible for my problem. just for your info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator() is horrendously slow #75

operator() is horrendously slow #75

shiozaki commented Feb 9, 2015

evaleev commented Feb 9, 2015 via email

shiozaki commented Feb 9, 2015

shiozaki commented Feb 9, 2015

shiozaki commented Feb 9, 2015

evaleev commented Feb 9, 2015

shiozaki commented Feb 9, 2015

evaleev commented Feb 9, 2015

shiozaki commented Feb 9, 2015

operator() is horrendously slow #75

operator() is horrendously slow #75

Comments

shiozaki commented Feb 9, 2015

evaleev commented Feb 9, 2015 via email

shiozaki commented Feb 9, 2015

shiozaki commented Feb 9, 2015

shiozaki commented Feb 9, 2015

evaleev commented Feb 9, 2015

shiozaki commented Feb 9, 2015

evaleev commented Feb 9, 2015

shiozaki commented Feb 9, 2015