Replies: 1 comment
-
Hi, There is also a few problems that is going to make upstreaming challenging, but will deal with that in due time.
Documentation is available in the repo. I just wrote it so it's a quite up to date view . https://github.com/marty1885/llama.cpp/blob/metalium-support/docs/backend/Metalium.md There's a list of op support request in TTNN. Which I hope they can get to soon. https://github.com/tenstorrent/tt-metal/issues?q=is%3Aissue+state%3Aopen+label%3AGGML
Feel free to try my codebase. But again, it's quite slow as of now. Until the operator support is being added in TTNN. Contributions are totally welcomed. Though IMO most that can be done is done. I spend more time in TTNN fixing their problems lately. With that said, some notes
I'll be speaking at FOSDEM 2025 about the effort. Please come if you happen to be at Brussels. Will love to hear what people think! |
Beta Was this translation helpful? Give feedback.
-
I was going to post a feature request, but the issue template said to post here first. Hopefully I'm not the only one interested in this!
Background
Tenstorrent makes some AI accelerator cards (Grayskull and Wormhole) that connect to a host system over PCIe and are purchasable for reasonable amounts of money (IMO the price would be a lot more reasonable if it was reduced by 50%, but I digress). Probably the most interesting feature of these cards is that the Wormhole variant can be linked to other Wormhole cards over multiple 400G Ethernet links to make one big accelerator with lots of GDDR6 memory--not unlike how Nvidia GPUs can be linked together to share memory over NVLink. Tenstorrent sells some prebuilt systems with this configuration (TT-LoudBox and TT-QuietBox), each including four Wormhole n300 cards with a combined total of 96 GB of device DRAM--enough to run an LLM with 80 billion parameters.
Request
I have a Wormhole n300 card, and I'd like to get it working with ollama, which means it needs to be working with llama.cpp first. I'm hoping to be able to benchmark it and compare the performance to my 4090.
Existing efforts
@marty1885 appears to be working on a backend for this already here, though I'm not sure what the current status of it is.
Other information
Tenstorrent offers a cloud service with instances that have multiple Grayskull and Wormhole cards. I can't afford to drop $15k on a TT-QuietBox just to find out whether or not the performance is good, but depending on how much the cloud access costs it might be possible to do development and performance testing without breaking the bank.
Beta Was this translation helpful? Give feedback.
All reactions