1w ago

So image generation is where it's at?

Total noob to this space, correct me if I'm wrong. I'm looking at getting new hardware for inference and I'm open to AMD, NVIDIA or even Apple Silicon.

It feels like consumer hardware comparatively gives you more value generating images than trying to run chatbots. Like, the models you can run at home are just dumb to talk to. But they can generate images of comparable quality to online services if you're willing to wait a bit longer.

Like, GPT OSS 120b, assuming you can spare 80GB of memory, is still not GPT 5. But Flux Shnell is still Flux Shnel, right? So if diffusion is the thing, NVIDIA wins right now.

Other options might even be better for other uses, but chatbots are comparatively hard to justify. Maybe for more specific cases like code completion with zero latency or building a voice assistant, I guess.

Am I too off the mark?

13 comments

Framework has an AI machine on the market.
I haven't used it myself but perhaps it's worth looking into for your project.
https://frame.work/gb/en/desktop?tab=machine-learning
- I'm aware of it, seems cool. But I don't think AMD fully supports the ML data types that can be used in diffusion and therefore it's slower than NVIDIA.
  
  Slower? Yes. But the alternative to a Framework Desktop for home use is a 30-40k Nvidia GPU, so I'm fine with slow.
  Not to mention that it is more than fast enough for common use cases: https://github.com/geerlingguy/ollama-benchmark/issues/21#issuecomment-3164570956
  
  I wonder if that's a limitation of mesa?
  Could it be possible with amdvlk?
I run a 14B model that is not too dumb, and definitely worth having as an offline local backup. I also use my NVIDIA 4080 with 16GB VRAM for image and video generation of adequate quality, however. I'd still say you get better quality from the closed models in some areas, and many open models require far too much VRAM for consumer hardware, but in general all local usecases work well locally, just a bit worse that closed online models. Except voice, that can be just as good.
There is koboldcp-rocm fork. Koboldcpp itself has basic image generation. https://github.com/YellowRoseCx/koboldcpp-rocm
- I'm not sure I follow.
  What Koboldcpp does is set up to call out to an external generator for the images. It itself isn't providing the image generation computation.
  Like, you create a prompt and hand it to koboldcpp. It then computes a textual response. Part of that response is a prompt intended for an image generative AI. It feeds the prompt to something like Stable Diffusion or ComfyUI, and that does the image generation. It takes the output and displays it inline with the text it's generated for you in the KoboldCPP web page. You run both the image generator and koboldcpp side-by-side.
  What OP is complaining about is that he feels that consumer hardware --- by which you're probably talking GPUs with up to about 24 GB of VRAM --- don't have enough memory to run large LLM models, to have a chatbot on par with what typical cloud-based services are running. He is okay with the image generation side.
  Llama.cpp can split a model across multiple GPUs. In theory, you can run quad 4090s or quad XT 7900 XTXs. Each of those has 24GB. Each of those is maybe $1k for the XT 7900 XTXs. I'm pretty sure that the 4090s used to go for $1.5k-$2k, but it looks like they're currently about $3k on Amazon. So $4k for the AMD route, and $12k for the Nvidia route. For the 7900s, that's about 1420 watts, disregarding the rest of the system. For the 4090s, 1800W. A standard US household circuit is 15A or 20A at 120V so 1800W or 2400W, so in the US, you're probably running close to circuit limitations. There are apparently some computer systems that use dual PSUs that can feed off multiple circuits. You'd need a power supply capable of feeding that, and given that this is considerably more heat than a lot of space heaters, probably cooling. That'll get you 96GB of VRAM (assuming, as is possible with llama.cpp, that your problem is one that can be split across multiple GPUs). Whether-or-not that's reasonable consumer hardware may be debatable, but unless you start going to dedicated AI compute hardware, which costs more, that's about what you have to work with.
  There's also the approach that some people have used of using machines with unified memory to get more memory for the GPU. OP mentioned Apple hardware (like a Mac Studio, at up to 192GB) , and I mentioned the AMD AI Max machines (at up to 128GB) that Framework is selling. Those probably aren't going to be able to crunch as quickly as dedicated hardware for your given problem, but they're a way to get parallel compute hardware with a lot of memory for less money.
  Running cloud-based will save money if whatever you're doing doesn't have your hardware constantly in use, since otherwise you're paying for idle hardware. That is, it's definitely going to be cheaper to run things sharing hardware if your use case is intermittent use, like a chatbot.
  Llama.cpp does support clustering multiple machines (dunno about for training). I have not done this myself, and if you're thinking about buying hardware to do that, I'd probably look into whether what software you want to run can actually run in that kind of environment, and what kind of performance penalties you're looking at, but it's one possibility.
  But I think that the short answer to "can I locally run hardware that can do what the bleeding edge of cloud-based services are doing", the answer is "yes, but not inexpensively". You are going to need to accept paying more for hardware than you would if you were sharing the cost with other people, performance tradeoffs if you're going to have less-beefy hardware doing the crunching, or quality-of-output tradeoffs if you want to use smaller models or otherwise limit memory usage.
  I personally find running smaller models than GPT-5 locally to have value. But...OP might not; for him, the quality tradeoff might not be acceptable. I also don't mind things running more slowly, but that might not be an acceptable tradeoff for OP. I am not willing to pay for the kind of hardware that cloud-based commercial AI services are using to do AI compute, but it is possible to get that (well, barring any availability issues) if one throws enough money at it.
  It's also possible to use something like vast.ai to rent remote cloud-based parallel compute hardware, if one is comfortable with trusting remote hardware and hoping that whoever is running that hardware isn't snorfling up data from one's compute job (I'd guess probably not, but one never knows, once one's data goes out into the broader world). That's not local, but it might be preferable to using ChatGPT or whatever service, which will be logging user chats.
  
  No, you can run sd, flux based model inside the koboldcpp. You can try it out using the original koboldcpp in google colab. It loads gguf model. Related discussion on Reddit: https://www.reddit.com/r/StableDiffusion/comments/1gsdygl/koboldcpp_now_supports_generating_images_locally/
  Edit: Sorry, I kinda missed the point, maybe I'm sleepy when writing that comment. Yeah, I agree that LLM need big memory to run which is one of it's downside. I remember someone doing comparison that API with token based pricing is cheaper that to run it locally. But, running image generation locally is cheaper than API with step+megapixel pricing.
It feels like consumer hardware comparatively gives you more value generating images than trying to run chatbots
While I personally get more use out of the hardware that way, you're also posting this to an LLM community. You're probably going to get people who do use LLMs more.
I also don't think that "Image diffusion models small, LLM models large" is likely some sort of constant --- I'm sure that the image generation people can make use of larger models --- and the hardware is going to be a moving target. Those Framework Desktop machines have up to 128GB of unified memory, for example.
- That's a good point, but it seems that there are several ways to make models fit in smaller memory hardware. But there aren't many options to compensate for not having the ML data types that allows NVIDIA to be like 8x faster sometimes.

13 comments