-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor graph building to reduce duplication #3591
Conversation
Great, thank you for initiating this. From a quick skim, I like it overall, but will look deeper and provide more meaningful feedback in the following days. |
Add layer, layers, graph building methods
Hopefully the benefit here will be a bit more obvious. The graph building functions are reduced to static struct ggml_cgraph * llm_build_llama(
llama_context & lctx,
const llama_batch & batch) {
llm_build_llama_ctx bctx(lctx, batch);
return bctx.build();
}
static struct ggml_cgraph * llm_build_baichaun(
llama_context & lctx,
const llama_batch & batch) {
llm_build_baichuan_ctx bctx(lctx, batch);
return bctx.build();
} (Basically they could just be eliminated, but I haven't gotten to that point.) I fixed a dumb error with using the wrong RMS value, LLaMa now runs with identical results. I'll test Baichuan tomorrow and convert a few more model types over. There are a bunch that have very minor changes compared to LLaMA. There are also some where the difference is more extensive (like Falcon) - I probably wouldn't have that inherit from the llama version. |
Baichuan and Refact are very close to base LLaMA and differ only by a few lines of code. The remaining models have significantly larger changes. Before I start looking at those, I think I'll wait for a bit more feedback/direction (also want to make sure I'm really on the right track). Unfortunately, I'm making these changes without really having much understanding of the model architecture. Basically just diffing the graph building functions, so I think someone who actually understands this stuff better could build on the general structure once it's in place and do a much better job of splitting things into logical methods/modules. That's also why I'm not going too crazy splitting stuff up, because my choices probably aren't going to be optimal. Uh, also, my naming choices for the structures/methods are absolute garbage. Happy to take suggestions for a better approach. |
@ggerganov If you get a chance, could you take a look at this again after the progress since your last comment? I don't necessarily need a full review, just something like an indication that the current direction is still good or that changes are needed and I should wait for a more complete review before putting more work into it. If so, what I probably would do is convert over the remaining model types to that structure but not "de-dup" them since they diverge from base LLaMA in much more complicated ways than the ones I've done so far. So this would basically be adding the infrastructure for someone more knowledgeable to split them into logical chunks. At that point, aside from small stuff like maybe changing naming conventions I'd consider this pull done. edit: I guess I should also look at I feel like there's probably a nicer way to do this that isn't necessarily (only) at the model type level. Like instead maybe we still have similar model structs but also something like a RoPE attention module, an ALiBi attention module, etc and those might also have some model-specific chunks or settings that can be changed or swapped in/out. However, that probably isn't something I can do on my own. If there's anyone that wants to collaborate and basically make the decisions about how to divide stuff up, I can do the grunt work of refactoring. |
Can't make a call atm. There are some nice things like reusing certain sections, but there are also indirections due to the inheritance which makes it difficult to follow the arch implementations in IMO. Another approach would be to start factoring out the common sections into standalone stateless functions and de-dup the code from the inside-out. For example I'll get to this at some point after merging #3624 Edit: For sure I can say that Baichuan and Refact inheriting the LLaMA arch is not a good idea. I can see why it is done - to reuse the common code, but it does not make sense when thinking about these objects in a more abstract way. |
Thanks, I'll hold off on further changes here for now then. (Might mess around with a different approach if I get the time.) |
qwen.cpp has an interesting approach to using ggml, with each logical block of the transformer split into a different class. I think that they tried to replicate the pytorch style with modules. It is probably overkill for what we need, but there are some interesting ideas there. |
https://github.com/li-plus/chatglm.cpp is similar with https://github.com/QwenLM/qwen.cpp , but more interesting. It splits the logic into different classes, and uses the template class to build different models. But the reason for this approach may be that ChatGLM's architecture is closer to Baichuan's. |
Looking to maybe make some progress #3382 here. This is just a very early preview of the general approach.
Running the model doesn't quite work, so something is wrong in the graph building. It's coherent but apparently the model forgot some of the rules of English: "so, Reynard devised a plan to trick the fisherman.he put on a show and started to laugh.the fisherman walked toward him, pretending to be a friend. as they got closer, Reynard sprang up and bited off the mans hand. with a snap of his jaw, he took the fishermans fish away."
In the next day or so hopefully I'll fix that issue (I doubt it'll be anything too complicated) and implement at least Baichuan or something. It should basically just be a couple lines of code and struct that inherents from the
llm_llama_build_ctx
one. At that point it'll be more obvious how this is actually doing something useful.Does the general approach look promising? The C++ish stuff is hidden away and doesn't get exported to the public API.
(By the way, just in case it isn't clear, the iterating through the layers in
llm_build_llama
will be moving into the struct, I just didn't get to that point yet. So there will probably be a method to run a full layer, and one to run all the layers that can call it.)