-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of Numeric.AD.Mode.Reverse.Double
#96
Comments
Good question! I think the punchline you’ll see is, as you say, the comparisons of forward vs reverse mode will switch when you consider R^n-> R for large enough n. Your example is R-> R, which is where forward mode will shine, as it will also on R-> R^n too algorithmically |
I looked into this performance problem a bit. Here's slightly modified benchmark from top post. I used instruction counting. It's easy to measure and almost deterministic and I believe would make decent proxy in this case {-# LANGUAGE BangPatterns #-}
{-# LANGUAGE ImportQualifiedPost #-}
import Test.Tasty.PAPI
import Numeric.AD.Mode.Forward qualified as Fwd
import Numeric.AD.Mode.Forward.Double qualified as FwdD
import Numeric.AD.Mode.Reverse qualified as Rev
import Numeric.AD.Mode.Reverse.Double qualified as RevD
{-# INLINE poly #-}
poly :: Floating a => a -> a
poly x = go (1000 :: Int) 0 where
go 0 !a = a
go n !a = go (n - 1) (a + x ^ n)
main :: IO ()
main = defaultMain
[ bench "eval" $ whnf poly p
, bench "Fwd" $ whnf (Fwd.diff poly) p
, bench "FwdD" $ whnf (FwdD.diff poly) p
, bench "Rev" $ whnf (Rev.diff poly) p
, bench "RevD" $ whnf (RevD.diff poly) p
]
where
p = 1.02 :: Double Here are benchmark results for different implementations:
What could we understand from this:
|
I am using AD for gradient-based optimization and need better performance than I am currently getting. I noticed that some work has gone into improving the
.Double
specializations recently, so I did some experiments with the latest master (85aee3c). My setup is as follows:I am using ghc 8.10.5 and llvm 12.0.1 and compiled with
-O2 -fllvm
. I also set the+ffi
switch for the ad package.I get the following results (full details):
Using NCG instead of LLVM, the results are similar, with slightly longer execution times. I am not sure why regular evaluation times also change with different modes.
I am very happy with
Numeric.AD.Mode.Forward.Double
, as it causes barely any overhead over regular evaluation.While
Numeric.AD.Mode.Reverse.Double
is significantly faster than its generic counterpart, its 50x slowdown is still a long shot from the promise of "automatic differentiation typically only decreases performance by a small multiplier". In particular, it allocates a lot of intermediate memory. Since the reverse mode tape is implemented in C via FFI (which I presume is not counted by haskell's GC), I suspect that the 770MB that are allocated indicate that there is still some boxing going on.Since I am doing gradient-based optimization, I would like to use reverse mode. Am I doing something wrong here? Is there something that can be done to bring its performance more in line with how
Numeric.AD.Mode.Forward.Double
behaves? Or is this simply a consequence of the additional complexity and bookkeeping of reverse mode ad that just cannot be avoided and is only justified by its better performance for gradients of high dimensionality?The text was updated successfully, but these errors were encountered: