Refactor graph building to reduce duplication #3591

KerfuffleV2 · 2023-10-12T02:14:35Z

Looking to maybe make some progress #3382 here. This is just a very early preview of the general approach.

Running the model doesn't quite work, so something is wrong in the graph building. It's coherent but apparently the model forgot some of the rules of English: "so, Reynard devised a plan to trick the fisherman.he put on a show and started to laugh.the fisherman walked toward him, pretending to be a friend. as they got closer, Reynard sprang up and bited off the mans hand. with a snap of his jaw, he took the fishermans fish away."

In the next day or so hopefully I'll fix that issue (I doubt it'll be anything too complicated) and implement at least Baichuan or something. It should basically just be a couple lines of code and struct that inherents from the llm_llama_build_ctx one. At that point it'll be more obvious how this is actually doing something useful.

Does the general approach look promising? The C++ish stuff is hidden away and doesn't get exported to the public API.

(By the way, just in case it isn't clear, the iterating through the layers in llm_build_llama will be moving into the struct, I just didn't get to that point yet. So there will probably be a method to run a full layer, and one to run all the layers that can call it.)

ggerganov · 2023-10-12T12:06:48Z

Great, thank you for initiating this. From a quick skim, I like it overall, but will look deeper and provide more meaningful feedback in the following days.

Add layer, layers, graph building methods

KerfuffleV2 · 2023-10-12T20:21:20Z

Hopefully the benefit here will be a bit more obvious. The graph building functions are reduced to

static struct ggml_cgraph * llm_build_llama(
        llama_context & lctx,
        const llama_batch & batch) {

    llm_build_llama_ctx bctx(lctx, batch);
    return bctx.build();

}

static struct ggml_cgraph * llm_build_baichaun(
         llama_context & lctx,
         const llama_batch & batch) {

    llm_build_baichuan_ctx bctx(lctx, batch);
    return bctx.build();
}

(Basically they could just be eliminated, but I haven't gotten to that point.)

I fixed a dumb error with using the wrong RMS value, LLaMa now runs with identical results. I'll test Baichuan tomorrow and convert a few more model types over. There are a bunch that have very minor changes compared to LLaMA. There are also some where the difference is more extensive (like Falcon) - I probably wouldn't have that inherit from the llama version.

KerfuffleV2 · 2023-10-13T05:57:26Z

Baichuan and Refact are very close to base LLaMA and differ only by a few lines of code. The remaining models have significantly larger changes. Before I start looking at those, I think I'll wait for a bit more feedback/direction (also want to make sure I'm really on the right track).

Unfortunately, I'm making these changes without really having much understanding of the model architecture. Basically just diffing the graph building functions, so I think someone who actually understands this stuff better could build on the general structure once it's in place and do a much better job of splitting things into logical methods/modules. That's also why I'm not going too crazy splitting stuff up, because my choices probably aren't going to be optimal.

Uh, also, my naming choices for the structures/methods are absolute garbage. Happy to take suggestions for a better approach.

KerfuffleV2 · 2023-10-15T07:06:36Z

@ggerganov If you get a chance, could you take a look at this again after the progress since your last comment? I don't necessarily need a full review, just something like an indication that the current direction is still good or that changes are needed and I should wait for a more complete review before putting more work into it.

If so, what I probably would do is convert over the remaining model types to that structure but not "de-dup" them since they diverge from base LLaMA in much more complicated ways than the ones I've done so far. So this would basically be adding the infrastructure for someone more knowledgeable to split them into logical chunks. At that point, aside from small stuff like maybe changing naming conventions I'd consider this pull done. edit: I guess I should also look at llm_load_tensors, that function is insane (also shouldn't really require specialized knowledge to reduce duplication).

I feel like there's probably a nicer way to do this that isn't necessarily (only) at the model type level. Like instead maybe we still have similar model structs but also something like a RoPE attention module, an ALiBi attention module, etc and those might also have some model-specific chunks or settings that can be changed or swapped in/out. However, that probably isn't something I can do on my own. If there's anyone that wants to collaborate and basically make the decisions about how to divide stuff up, I can do the grunt work of refactoring.

ggerganov · 2023-10-15T09:15:09Z

Can't make a call atm. There are some nice things like reusing certain sections, but there are also indirections due to the inheritance which makes it difficult to follow the arch implementations in IMO.

Another approach would be to start factoring out the common sections into standalone stateless functions and de-dup the code from the inside-out. For example KQ_pos, KQ_mask etc. could be such simple functions. The FFN layer could be another set of functions for the different variations of FFN. Basically, try to factor out as much as possible into stateless functions so that each llm_build_xxx() becomes as short as possible and easier to read. Might want to sketch that and think some more about maybe combining the two approaches together in some way.

I'll get to this at some point after merging #3624

Edit: For sure I can say that Baichuan and Refact inheriting the LLaMA arch is not a good idea. I can see why it is done - to reuse the common code, but it does not make sense when thinking about these objects in a more abstract way.

KerfuffleV2 · 2023-10-15T09:29:35Z

Thanks, I'll hold off on further changes here for now then. (Might mess around with a different approach if I get the time.)

slaren · 2023-10-15T11:35:10Z

qwen.cpp has an interesting approach to using ggml, with each logical block of the transformer split into a different class. I think that they tried to replicate the pytorch style with modules. It is probably overkill for what we need, but there are some interesting ideas there.

yansh97 · 2023-10-18T05:51:35Z

qwen.cpp has an interesting approach to using ggml, with each logical block of the transformer split into a different class. I think that they tried to replicate the pytorch style with modules. It is probably overkill for what we need, but there are some interesting ideas there.

https://github.com/li-plus/chatglm.cpp is similar with https://github.com/QwenLM/qwen.cpp , but more interesting. It splits the logic into different classes, and uses the template class to build different models. But the reason for this approach may be that ChatGLM's architecture is closer to Baichuan's.

Refactor graph building to reduce duplication

aa7fbef

KerfuffleV2 added the demo Demonstrate some concept or idea, not intended to be merged label Oct 12, 2023

KerfuffleV2 added 2 commits October 12, 2023 13:30

Fix incorrect offloading and norm_rms_eps value

06c2788

Adapt baichuan (not tested yet)

5a06711

Add layer, layers, graph building methods

KerfuffleV2 added 2 commits October 12, 2023 14:23

Fix kq_scale for Baichuan

ae31a9a

Adapt Refact graph building

a382777

KerfuffleV2 marked this pull request as ready for review October 13, 2023 06:02

KerfuffleV2 closed this Oct 29, 2023

ggerganov mentioned this pull request Nov 1, 2023

llm : add llm_build_context #3881

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor graph building to reduce duplication #3591

Refactor graph building to reduce duplication #3591

KerfuffleV2 commented Oct 12, 2023 •

edited

Loading

ggerganov commented Oct 12, 2023

KerfuffleV2 commented Oct 12, 2023

KerfuffleV2 commented Oct 13, 2023

KerfuffleV2 commented Oct 15, 2023 •

edited

Loading

ggerganov commented Oct 15, 2023 •

edited

Loading

KerfuffleV2 commented Oct 15, 2023

slaren commented Oct 15, 2023

yansh97 commented Oct 18, 2023

Refactor graph building to reduce duplication #3591

Refactor graph building to reduce duplication #3591

Conversation

KerfuffleV2 commented Oct 12, 2023 • edited Loading

ggerganov commented Oct 12, 2023

KerfuffleV2 commented Oct 12, 2023

KerfuffleV2 commented Oct 13, 2023

KerfuffleV2 commented Oct 15, 2023 • edited Loading

ggerganov commented Oct 15, 2023 • edited Loading

KerfuffleV2 commented Oct 15, 2023

slaren commented Oct 15, 2023

yansh97 commented Oct 18, 2023

KerfuffleV2 commented Oct 12, 2023 •

edited

Loading

KerfuffleV2 commented Oct 15, 2023 •

edited

Loading

ggerganov commented Oct 15, 2023 •

edited

Loading