Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] A few questions on additional variable metadata and C++ codegen #812

Open
rok-cesnovar opened this issue Feb 11, 2021 · 21 comments
Open
Labels
question Further information is requested

Comments

@rok-cesnovar
Copy link
Member

I have a few questions on how to improve the implementations for the upcoming varmat and OpenCL PRs and make them more maintainable. This would also cleanup how we do the transform MIR with OpenCL.

  • how to store metadata for the variable to change the C++ codegen?

Example:

For matrix[N, N] a; in the parameters/TP/model block the default codegen is

Eigen::Matrix<local_scalar_t__, -1, -1> a;

but with "varmat" we may want to generate that as

conditional_var_value_t<local_scalar_t__, Eigen::MatrixXd> a;

if applicable (conditional_var_value becomes var_value<Eigen::MatrixXd> if local_scalar_t__ = var.
Additionally if we want this parameter to reside on the GPU, that would be

conditional_var_value_t<local_scalar_t__, matrix_cl<double>> a;

We need info on what the underlying C++ type will be in the codegen as well as in the transform MIR passes where we try to figure out if varmat/opencl is applicable for a variable. Current OpenCL implementation and also my development branches rely un suffixes (opencl_ and varmat__) to store this info during transform MIR phase and to generate the different C++. But that is not ideal and also not maintainable I guess.

  • how to best represent if a given Stan Math function signature is supported by varmat and OpenCL

We need lists of supported functions for both varmat and OpenCL. One idea would be to just have separate lists but I was hoping we could add this to the Stan Math signatures somehow.

@VMatthijs
Copy link
Member

VMatthijs commented Feb 11, 2021

Ah, I guess it would also depend on the design in the surface language, right? What are you planning to do in that respect? I'd always had a design in mind where we add types like matrix_gpu and vector_gpu to the Stan language. We could then implement explicit casts
vector_gpu to_gpu(vector v)
vector from_gpu(vector_gpu v)
etc.
I think explicit casts are a better language design than an implicit one as it signals to the user that there is a cost to data transfer to the GPU.

And we could implement any functions that operate on the GPU with the corresponding types then, e.g.
matrix_gpu cholesky_decompose(matrix_gpu A)
as a new signature in addition to the existing
matrix cholesky_decompose(matrix A) .

Then, we would just propagate these types throughout the AST and MIR. Would that work or is it not at all related to what you're thinking about?

@VMatthijs
Copy link
Member

(Sorry if what I'm saying is naive. I've been out of the loop for a year, so have no idea what the current state is of GPU support.)

@rok-cesnovar
Copy link
Member Author

Sorry, I should have explained all of this a bit better. So there are two projects nearing its final stages, one is varmat and the other is the OpenCL support.

  1. Varmat:

This introduces a new way of representing matrices, vectors and row vectors in Stan Math. Instead of Eigen::Matrix<var, -1, -1> (a matrix of vars - the shorthand we use is matvar) for matrix the type is var_value<Eigen::MatrixXd> (var of matrices, the shorthand is varmat). This new autodiff type is A LOT faster mostly, but should not be used for variables where we do access to individual elements as there can be a slowdown. But if there are no such accesses the varmat will always be faster than matvar. More on that in the design doc and an issue. Our current plan is to not add a new type for this but rather have this be a part of transform MIR. Reasons are given in the design doc. We can always revisit this as its not been merged to stanc3 yet. This has no direct connection to the OpenCL stuff.

The majority of Stan Math functions already support varmat, support for some will be added in the near future, some functions will probably never support varmat (like functions where we dont define the reverse mode and just use some underlying implementation of Eigen).

  1. OpenCL

The backend currently supports all _lpmf and _lpdf functions and the majority of functions. The list of unsupported functions is listed here
stan-dev/math#1854 plus the _cdf/_lcdf functions.

We also have some support in stanc3 for OpenCL, right now only for distributions as those return scalars and we dont need to worry about what does or does not stay on the OpenCL device. OpenCL transformations are part of the transform MIR phase.

What currently happens is as follows:

data {
  int<lower=0> N;
  int<lower=0> n_redcards[N];
  int<lower=0> n_games[N];
  vector[N] rating;
}
parameters {
  vector[2] beta;
}
model {
  beta[1] ~ normal(0, 10);
  beta[2] ~ normal(0, 1);
  target += binomial_logit_lpmf(n_redcards | n_games, beta[1] + beta[2] * rating);
}

binomial_logit_lpmf is identified as a function that will use OpenCL and thus the data variables n_redcards, n_games are copied to the OpenCL device in the model constructor (to variables n_redcards_opencl__, n_games_opencl__) and then the transform mir transform the model block to essentially this (this is pseudcode):

beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += binomial_logit_lpmf(n_redcards_opencl__ | n_games_opencl__, to_matrix_cl(beta[1] + beta[2] * rating));

This is fine and can provide nice speedups in the ~5 range for moderate sizes. But we could also do more by also supporting other functions. Like:

beta[1] ~ normal(0, 10);
beta[2] ~ normal(0, 1);
target += binomial_logit_lpmf(n_redcards_opencl__ | n_games_opencl__, beta[1] + beta[2] * rating_opencl__);

gets us a bit more:

15edba8b-d237-49d4-bd23-de8a991e9878

(This is done with develop Cmdstan and my local branch of stanc3 that does the additional transform MIR steps).

The plan was to try and hide this computational details from the user (they just switch the flag on). Could still be convinced otherwise though I think the fact that Stan hides a lot of these computational details from the user is a big plus compared to other tools.

@VMatthijs
Copy link
Member

Very cool! (Sorry, just seeing this now. I need to make sure github emails end up in my inbox.)

I very much agree that the strength of Stan is that it is a high level language and that the user does not have to think much about low level details. Part of my motivation with stanc3 was also to push further in that direction by automating various optimizations and program transformations. (Of course, just reproducing the basic functionality of stanc2 took most of the time I had on my contract. Though I wrote a fair amount of static analysis and optimization code, it wasn't hooked up yet when my contract ended.)

In this particular case, aren't there situations where you'd end up copying back and forth a lot of stuff to the GPU to the point that it's not giving you a speedup anymore? I can see that if all the (high dimensional) arguments are data, you'd probably always want to run the operation on the GPU if possible, as you only need to copy the data there once. But what happens if some of the high dimensional arguments are parameters. Wouldn't the copying get quite expensive in that case? It looks like it's still worth it in the case about with the rating parameter, but aren't there cases where the cost of copying is simply too high? (Perhaps not, just playing the devil's advocate here.)

Also, could you maybe explain a bit more about the difference between the rating and rating_opencl__ cases above? I'm not sure I understand the difference? rating would probably need to get moved to the GPU at some point anyway, right? Or does the top one only perform part of the binomial_logit_lpmf calculation on the GPU and the rest on the CPU?

@VMatthijs
Copy link
Member

Also, those speed-ups are insane! So cool! This'll allow Baysian inference to scale to totally new applications once it's done.

@VMatthijs
Copy link
Member

Even with this design where the compiler essentially decides what gets sent to GPU, perhaps it's useful to just have explicit types for GPU-matrices etc in the MIR? They wouldn't be in the AST then, as the user cannot write them, but one of the optimization passes could then optimize some of the code to operate on GPU-matrices instead and implement the required explicit casts to transfer data to the GPU?
(I say GPU here, but I realize it's a more general opencl thing.)

@rok-cesnovar
Copy link
Member Author

In this particular case, aren't there situations where you'd end up copying back and forth a lot of stuff to the GPU to the point that it's not giving you a speedup anymore?

Yes, you wouldnt just send B to the GPU to do A = transpose(B).

Right now I split functions in two types:

  • fast functions: These are functions where we know they are faster on the GPU standalone (meaning copy all args, run on the device and copy results and adjoints (if needed) back). These are all lpdf/lpmf functions and a subset of regular math functions.

  • others: these are all OpenCL-supported functions that dont fall in the first category but should be used if any of the arguments are already on the GPU. We treat data = on the GPU as we can move that to the GPU without cost.

I then do a pass and just move all fast functions to be run on the GPU. Then do subsequent passes that move additional function to the GPU. And the final pass moves any remaining parameters from the GPU to the CPU.

See this summary of benchmarks: https://github.com/rok-cesnovar/stan-math-opencl-benchmarks/blob/main/summaries/radeonvii_vs_i7_distributions.md. For data args we discard the data transfer (because the transfer happens once during sampling so it can be ignored), parameter data transfers are included. If any of the parameters are already on the GPU/OpenCL device the speedup of that function is then similar to the all-data numbers, which is also what we are seeing in the example figure I posted above.

For example for the binomial_lpmf(int[], int[], vector) we have this figure:
image

If we do beta[1] + beta[2] * rating on the GPU we get closer to the ideal case. If we dont, then we are equal to the data, data, param case.

You can see that almost all lpxf functions are faster on the GPU even for the worst case. The worst case is that all arguments are parameters, you copy them just for the lpxf function and that is it. This probably isnt a common case as there is at least some expression on any of the parameters or one of the arguments is data.

Also, could you maybe explain a bit more about the difference between the rating and rating_opencl__ cases above?

The _opencl__ denotes that the variable is already on the OpenCL device so this represents a matrix_cl<T> object if data or var_value<matrixl_cl<double>> if parameter. We use opencl__ cause a user isnt allowed to write that suffix in Stan. So what happens is the C++ model gains three new members:

matrixl_cl<int> n_redcards_opencl__;
matrixl_cl<int> n_games_opencl__;
matrixl_cl<double> rating_opencl__;

and the constructor gains these at the end:

n_redcards_opencl__ = to_matrixl_cl(n_redcards);
n_games_opencl__ = to_matrixl_cl(n_games);
rating_opencl__ = to_matrixl_cl(rating);

So for:

target += binomial_logit_lpmf(n_redcards_opencl__ | n_games_opencl__, beta[1] + beta[2] * rating_opencl__);

we run

beta[2] * rating_opencl__

on the GPU (there is a multiply(scalar data/param on the CPU, anything on the GPU) overload, then also do the addition on the GPU and finally call this signature in C++:

binomial_logit_lpmf(matrix_cl<int>, matrix_cl<int>, var_value<matrix_cl<double>>);

Even with this design where the compiler essentially decides what gets sent to GPU, perhaps it's useful to just have explicit types for GPU-matrices etc in the MIR?

Oh, I didnt realize that was an option. I would be intersted in that yes. For both this and varmat. We now work with these suffx__ that are cumbersome and not a great design I guess. It was enough to get this off the ground.

@rok-cesnovar
Copy link
Member Author

There are similar figures to the binomial one for almost all lpdf/lpmf functions: https://unilj-my.sharepoint.com/:f:/g/personal/rokcesnovar_fri1_uni-lj_si/EgPP9IsCdFJJg0Ygm4_4-HcB9FOozC3FayacTGXeaT6LEg?e=ULRnwA

Not sure anyone is interested but posting anyway :)

@VMatthijs
Copy link
Member

There are similar figures to the binomial one for almost all lpdf/lpmf functions: https://unilj-my.sharepoint.com/:f:/g/personal/rokcesnovar_fri1_uni-lj_si/EgPP9IsCdFJJg0Ygm4_4-HcB9FOozC3FayacTGXeaT6LEg?e=ULRnwA

Not sure anyone is interested but posting anyway :)

Of course people are interested! This is so cool! Is there any reason the GLMs are missing? Aren't those things you typically would want to do on the GPU?

@VMatthijs
Copy link
Member

Even with this design where the compiler essentially decides what gets sent to GPU, perhaps it's useful to just have explicit types for GPU-matrices etc in the MIR?

Oh, I didnt realize that was an option. I would be intersted in that yes. For both this and varmat. We now work with these suffx__ that are cumbersome and not a great design I guess. It was enough to get this off the ground.

Yeah, I see no reason why you couldn't just add some types that are only in the MIR and not in the AST. At the very least, doing this in a type safe way using some kind of tags rather than using suffixes seems like a good idea. It's one of the things that OCaml gets us I guess, that we have easy variant types.

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Feb 18, 2021

This is so cool! Is there any reason the GLMs are missing? Aren't those things you typically would want to do on the GPU?

GLMs are missing because we knew those are faster always (we did those benchmarks a long time ago), but I have been meaning to run the same benchmarks for those as well. Will add.

Will see if I can make the MIR-only type work if not I will bug you :) Thanks!

@VMatthijs
Copy link
Member

@rok-cesnovar , thanks for the explanation of your two pass approach! That's so cool! Really clever to take this approach of first moving some of the stuff to GPU that should always be on GPU, and then seeing what's already on GPU and what further optimizations that enables! That makes sense.

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Feb 18, 2021

Thanks! Its a bit conservative. We have been circling around this idea for like two years which is what it took to support all the functions and make the kernel generator. Now the fun/language stuff.

@VMatthijs
Copy link
Member

VMatthijs commented Feb 18, 2021

Also, I see that you guys have been looking at doing automatic kernel fusion. Is that implemented in Stan already? Is the fusion done in C++ or in OCaml?

I'd also love to understand how the GPU primitives interact with the reduce_sum. Is there any meaningful interaction? Is there a more general reduce (a parallel fold) for any associative operation, or is there only a sum?

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Feb 18, 2021

I bugged @seantalts a year and a half ago with this, which in hindsight was too early :) Underestimated how long the backend would take.

Also, I see that you guys have been looking at doing automatic kernel fusion. Is that implemented in Stan already?

Yes, that is implemented and I think most of the distributions are implemented using this (some require separate manual kernels). @t4c1 is the mastermind behind that operation. Its done in C++ with expression templating. There is a presentation on that from last year https://www.youtube.com/watch?v=0wUxv6Mv61g&ab_channel=IWOCL%26SYCLcon

For examples of its use see examples in a distribution (https://github.com/stan-dev/math/blob/develop/stan/math/opencl/prim/bernoulli_logit_lpmf.hpp) or a non-distributuon function (https://github.com/stan-dev/math/blob/develop/stan/math/opencl/rev/mdivide_left_tri_low.hpp). The rest of the kernel generator code is at https://github.com/stan-dev/math/tree/develop/stan/math/opencl/kernel_generator

I'd also love to understand how the GPU primitives interact with the reduce_sum.

We havent really tried it with reduce_sum, though we do know that splitting a problem in chunks and thus having a few separate data transfers and GPU kernel invocations can help with efficiency because GPU kernel invocations can mask the data transfers. So if we are lucky while thread 1 transfers its data, thread 2 starts computation etc.

We know this helps because we have seen that GPU vs CPU speedup increases with more chains. Meaning that CPU/GPU speedup is higher for 10 chains than 4 chains, meaning that the GPU is more efficient with more chains (obviously until we hit a limit). The below chart shows for example how many chains you can run in time of a single CPU chain with increasing N:

image

This was done for the same example model from above (the binomial_logit_lpmf).

@rok-cesnovar
Copy link
Member Author

There is also some docs on the kernel fusion available: https://mc-stan.org/math/d2/d3c/group__opencl__kernel__generator.html

@VMatthijs
Copy link
Member

Cool! Thanks! It's pretty unbelievable what you guys have been able to achieve in 2 years. :-D

@rok-cesnovar
Copy link
Member Author

rok-cesnovar commented Feb 18, 2021

I warned Sean about Tadej at the Cambridge StanCon :)

@VMatthijs
Copy link
Member

@rok-cesnovar , about your two optimization passes for automating what gets moved to GPU: are they implemented/prototyped yet? If not, perhaps I can help with that?
What OCaml stuff can you guys still use some help with?

@rok-cesnovar
Copy link
Member Author

I have a minimal prototype for it, will clean it up and make a PR early next week so we can discuss. If you think its good enough we can start with that. Would definitely appreciate help with that. Improving it, using the MIR type approach you mentioned, reviewing, whatever you want. My goal is to have it ready for the May release, which now finally seems like a realistic goal.

What OCaml stuff can you guys still use some help with?

  • varmat

    I wrote a bit about this one above. Its essentially a somewhat similar story to GPU/OpenCL. Automatically and conservatively
    identify where the varmat types can be used. A current plan for implementation is written here:
    https://github.com/stan-dev/design-docs/blob/master/designs/0005-static-matrices.md#stanc3-implementation
    For this I also have a minimal protoype that we used to evaluate the backend, though its even more minimal than the OpenCL one. A conservative version here should be easy, if we want to be more proactive with varmat types it gets trickier.

  • complex types

    This should be a straightforward task for stanc3 of adding complex_real, complex_vector types. Steve has done some preliminary work in Add a complex type to the stan language #660 There is still some leftover stuff in the backend.

  • help figuring out the best way to store comments in AST (Store comments in AST #603). This one would open up more use to the auto-format and print-canonical

These are some of the top of my head.

@VMatthijs
Copy link
Member

Sounds good, @rok-cesnovar ! Just let me know when your PR is ready or if you want to talk anything through with me. Also always happy to do a video call or look over the code. It'd be fun to help out with this and get into stanc3 a bit again.

@WardBrian WardBrian added the question Further information is requested label Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants