Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor graph building to reduce duplication #3591

Closed
wants to merge 5 commits into from

Conversation

KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 commented Oct 12, 2023

Looking to maybe make some progress #3382 here. This is just a very early preview of the general approach.

Running the model doesn't quite work, so something is wrong in the graph building. It's coherent but apparently the model forgot some of the rules of English: "so, Reynard devised a plan to trick the fisherman.he put on a show and started to laugh.the fisherman walked toward him, pretending to be a friend. as they got closer, Reynard sprang up and bited off the mans hand. with a snap of his jaw, he took the fishermans fish away."

In the next day or so hopefully I'll fix that issue (I doubt it'll be anything too complicated) and implement at least Baichuan or something. It should basically just be a couple lines of code and struct that inherents from the llm_llama_build_ctx one. At that point it'll be more obvious how this is actually doing something useful.

Does the general approach look promising? The C++ish stuff is hidden away and doesn't get exported to the public API.

(By the way, just in case it isn't clear, the iterating through the layers in llm_build_llama will be moving into the struct, I just didn't get to that point yet. So there will probably be a method to run a full layer, and one to run all the layers that can call it.)

@KerfuffleV2 KerfuffleV2 added the demo Demonstrate some concept or idea, not intended to be merged label Oct 12, 2023
@ggerganov
Copy link
Owner

Great, thank you for initiating this. From a quick skim, I like it overall, but will look deeper and provide more meaningful feedback in the following days.

@KerfuffleV2
Copy link
Collaborator Author

Hopefully the benefit here will be a bit more obvious. The graph building functions are reduced to

static struct ggml_cgraph * llm_build_llama(
        llama_context & lctx,
        const llama_batch & batch) {

    llm_build_llama_ctx bctx(lctx, batch);
    return bctx.build();

}

static struct ggml_cgraph * llm_build_baichaun(
         llama_context & lctx,
         const llama_batch & batch) {

    llm_build_baichuan_ctx bctx(lctx, batch);
    return bctx.build();
}

(Basically they could just be eliminated, but I haven't gotten to that point.)

I fixed a dumb error with using the wrong RMS value, LLaMa now runs with identical results. I'll test Baichuan tomorrow and convert a few more model types over. There are a bunch that have very minor changes compared to LLaMA. There are also some where the difference is more extensive (like Falcon) - I probably wouldn't have that inherit from the llama version.

@KerfuffleV2
Copy link
Collaborator Author

Baichuan and Refact are very close to base LLaMA and differ only by a few lines of code. The remaining models have significantly larger changes. Before I start looking at those, I think I'll wait for a bit more feedback/direction (also want to make sure I'm really on the right track).

Unfortunately, I'm making these changes without really having much understanding of the model architecture. Basically just diffing the graph building functions, so I think someone who actually understands this stuff better could build on the general structure once it's in place and do a much better job of splitting things into logical methods/modules. That's also why I'm not going too crazy splitting stuff up, because my choices probably aren't going to be optimal.

Uh, also, my naming choices for the structures/methods are absolute garbage. Happy to take suggestions for a better approach.

@KerfuffleV2 KerfuffleV2 marked this pull request as ready for review October 13, 2023 06:02
@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 15, 2023

@ggerganov If you get a chance, could you take a look at this again after the progress since your last comment? I don't necessarily need a full review, just something like an indication that the current direction is still good or that changes are needed and I should wait for a more complete review before putting more work into it.

If so, what I probably would do is convert over the remaining model types to that structure but not "de-dup" them since they diverge from base LLaMA in much more complicated ways than the ones I've done so far. So this would basically be adding the infrastructure for someone more knowledgeable to split them into logical chunks. At that point, aside from small stuff like maybe changing naming conventions I'd consider this pull done. edit: I guess I should also look at llm_load_tensors, that function is insane (also shouldn't really require specialized knowledge to reduce duplication).

I feel like there's probably a nicer way to do this that isn't necessarily (only) at the model type level. Like instead maybe we still have similar model structs but also something like a RoPE attention module, an ALiBi attention module, etc and those might also have some model-specific chunks or settings that can be changed or swapped in/out. However, that probably isn't something I can do on my own. If there's anyone that wants to collaborate and basically make the decisions about how to divide stuff up, I can do the grunt work of refactoring.

@ggerganov
Copy link
Owner

ggerganov commented Oct 15, 2023

Can't make a call atm. There are some nice things like reusing certain sections, but there are also indirections due to the inheritance which makes it difficult to follow the arch implementations in IMO.

Another approach would be to start factoring out the common sections into standalone stateless functions and de-dup the code from the inside-out. For example KQ_pos, KQ_mask etc. could be such simple functions. The FFN layer could be another set of functions for the different variations of FFN. Basically, try to factor out as much as possible into stateless functions so that each llm_build_xxx() becomes as short as possible and easier to read. Might want to sketch that and think some more about maybe combining the two approaches together in some way.

I'll get to this at some point after merging #3624

Edit: For sure I can say that Baichuan and Refact inheriting the LLaMA arch is not a good idea. I can see why it is done - to reuse the common code, but it does not make sense when thinking about these objects in a more abstract way.

@KerfuffleV2
Copy link
Collaborator Author

Thanks, I'll hold off on further changes here for now then. (Might mess around with a different approach if I get the time.)

@slaren
Copy link
Collaborator

slaren commented Oct 15, 2023

qwen.cpp has an interesting approach to using ggml, with each logical block of the transformer split into a different class. I think that they tried to replicate the pytorch style with modules. It is probably overkill for what we need, but there are some interesting ideas there.

@yansh97
Copy link

yansh97 commented Oct 18, 2023

qwen.cpp has an interesting approach to using ggml, with each logical block of the transformer split into a different class. I think that they tried to replicate the pytorch style with modules. It is probably overkill for what we need, but there are some interesting ideas there.

https://github.com/li-plus/chatglm.cpp is similar with https://github.com/QwenLM/qwen.cpp , but more interesting. It splits the logic into different classes, and uses the template class to build different models. But the reason for this approach may be that ChatGLM's architecture is closer to Baichuan's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants