Hardware for local inference?

droopy4096@lemmy.ca · 2 months ago

Hardware for local inference?

BlameThePeacock@lemmy.ca · 2 months ago

I’m running Qwen 3.6 35B A3B (the MoE model) on an 8GB Vram Nvidia GPU with 32 GB of ram, with tweaking (and Turboquant) I’ve got it up to 30-40 Tokens per second and a 260k Context. It’s very usable. I’ve seen people report success with Dual 3060 Cards, but you’re still talking $1000-1500 for that kind of setup even if you have parts of it already.

sobchak@programming.dev · 2 months ago

The trend I see are the Mac Minis with a lot of unified memory. These are typically very well off people though. Prices for even old GPUs like 3090s are ridiculous now. I don’t think connecting 2 machines over Ethernet would work well, but putting 2 GPUs in a single machine does.

ffhein@lemmy.world · 2 months ago

I bought a used 3090 two years ago, and back then they were usually listed for €800-1000 in my country. I thought I was lucky to find one for €700 after searching for a few months, and I don’t think they’ve ever been cheaper than this here. There are definitely fewer of them available now, but you can still buy one for €950 (and possibly even lower if you’re patient). So prices have gone up, but IMO not by ridiculous amounts like RAM.

robber@lemmy.ml · edit-2 2 months ago

To add some practical advice:

It depends on what you mean by more advanced models. I run Qwen3.6-27b on 48GB VRAM across 3 cards (RTX 2000e Ada), and with the recent software optimizations merged into llama.cpp (tensor parallelism & MTP) I get around 30 tokens per second in generation. I use the model through openwebui for (agentic) web research and simple Q&A mostly and I’m quite happy with what it can do.

If you want something similar, maybe look at one or two second hand V100 PCIE 32GB. Or something from the Intel Arc Pro series, if you don’t mind the software support lacking behind a bit (as in less optimized).

Also it might be worth reading into the difference of dense vs MoE models, if you’re new to that. For MoE models, if your system RAM is fast enough, it’s often viable to offload the “experts” (largest parts of such models) to RAM, reducing VRAM capacity needs. Note that server motherboards with e.g. octa-channel RAM have a huge advantage over consumer boards (making DDR4 interesting despite slower speed per module).

And to adress your last question, while I have no direct experience, I’ve seen posts online about people connecting Strix Halo or DGX Spark devices, but usually via a 10+Gbit/s switch as interconnect is crucial (except if you just want to load balance).

Self-hosting LLMs is a very fun thing to do, but also a time- and money-consuming rabbit hole. You might wanna check out the LocalLlama community over at shitjustworks.

Edit: typos

solrize@lemmy.ml · 2 months ago

Unless you’re going to really run a lot, this is an area where vast.ai is probably more affordable than mucking with hardware.

droopy4096@lemmy.ca · 2 months ago

thank you folks. Your input gives me a decent starting point. I’ll start digging based on info/experiences shared, maybe I can find someone locally selling old GPU with enough ram for cheap

Sonalder@lemmy.ml · 2 months ago

The sad truth is that Apple Silicon, especially Ultra chip are champion of local inference. Using oMLX instead of ollama take the most out of it.

In my region older Mac Studio are hard to find but maybe you will be more lucky than I am.

worhui@lemmy.world · 2 months ago

Ram if a big driver of what models you can run with vram at a premium. Equipping 2 separate boxes with enough ram to load advanced models may be more expensive than just equipping one faster machine.

On the larger models even with ssd swap I can’t even get them to fully load on my 16gb of ram.

droopy4096@lemmy.ca · 2 months ago

well, I intend on scavenging for parts as I can’t really afford today’s prices. And since I don’t really know what should I grab as minimum specs I don’t even know what to look for. I could try to look for old(er) gaming rigs people sell or maybe there are some business workstations that may be sold in bulk. Either way, knowing what’s the minimum viable set of specs for running qwen or claude locally would be helpful

xylogx@lemmy.world · 2 months ago

What size model? I can run 8 billion parameter models on my Geforce 3070 with 8gb of vram. Bigger models need more memory. For $1-2k you can upgrade to a 16 or 32 gb video card. For $3k you can get a Framework Desktop with 128 gb unified memory. For $6k you can get a DGX Spark with a blackwell chip and 128 gb unified memory. Mac mini or Mac studio are also good choices in this price range.

theunknownmuncher@lemmy.world · 2 months ago

I use Instinct MI60 GPUs. They are pretty decent performance for local LLM. Connecting multiple computers is going to be impractical because severe bandwidth bottleneck.