Replies: 2 comments 8 replies
-
I have been trying to understand the RPC architecture and I've come up with this:
It looks like RPC does not keep track of this padding which is actually going on, so it ends up not mirroring what the (assuming CUDA) backend is actually doing. In addition it will be passing wrong tensor information to whatever code is coordinating all the memory and tensor splitting. I guess this is the source of why this doesn't work for Qwen? Is this correct or am I barking up the wrong tree here? |
Beta Was this translation helpful? Give feedback.
-
Example of this problem with RPC server accessed over SSH tunnel (it is not actually running on 127.0.0.1).
|
Beta Was this translation helpful? Give feedback.
-
Hello!
I have been experimenting using the following machine configuration:
I have been attempting to run/test the following models. I had to comment out
llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp
Line 467 in ebdee94
Based on the line which I commented out, I suspect that this is because Qwen2.5-72b has a intermediate_size of 29568, which is not divisible by 512?
If this is the reason, is it possible to get Qwen2.5 working over RPC by implementing cuda-like padding of 512 in ggml-rpc.cpp?
I think this RPC functionality is extremely cool and its a lot more lightweight and configurable for enthusiasts than other options in other engines, which seem geared towards setting up production inference clusters as they all rely on docker + ray combo it seems.
Beta Was this translation helpful? Give feedback.
All reactions