From the abstract: "Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}."
Would allow larger models with limited resources. However, this isn't a quantization method you can convert models to after the fact, Seems models need to be trained from scratch this way, and to this point they only went as far as 3B parameters. The paper isn't that long and seems they didn't release the models. It builds on the BitNet paper from October 2023.
"the matrix multiplication of BitNet only involves integer addition, which saves orders of energy cost for LLMs." (no floating point matrix multiplication necessary)
"1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint"
As far as I understand, their contribution is to apply what has proven to work well in the Llama architecture, to what BitNet does. And add a '0'. Maybe you just don't need that much text to explain it, just the statistics.
They claim it scales as a FP16 Llama model does... So unless their judgement/maths is wrong, it should hold up. I can't comment on that. But I'd like that if it were true...
I think we're already getting there. Lots of newer phones include AI accelerators. And all the companies advertise for AI. I don't think they're made to run LLMs, but anyways. Llama.cpp already runs on phones. And the limiting factor seems to be the RAM. I've tried Microsoft's "phi-2", quantized and on slow hardware, it's surprisingly capable for such a small model. Something like a ternary model would significantly cut down on the amount of RAM that is being used which allows to load larger models while also making it faster, everywhere. So I'd say yes. And it would also allow me to load a more intelligent model on my PC.
I think the doing away with matrix multiplications is also a big deal, but has little consequences as of today. You'd first need to re-design the chips to take advantage of that. And local inference is typically limited by memory bandwidth, not multiplication speed. At least as far as I understand.
I'd say if this is true, it allows for a big improvement in parameter count for all kinds if use-cases. But I've also come to the conclusion that there might be a caveat to that. Maybe the training is prohibitively expensive. I don't really know, at this point there is too much speculation going on and I'm not really an expert.
Reading up on the speculation on the internet: There must be a caveat... There is probably a reason why they only trained up to 3B parameter models... I mean the team has the name Microsoft underneath and they should have access to enough GPUs. Maybe the training is super (computationally) expensive.
Sure, I meant considerably more expensive than current methods... It's not really a downside if it's as expensive as other methods, because of the huge benefits it has after training is finished (on inference.)
If it's just that, the next base/foundation models would be surely conceptualized with this. And companies would soon pick up on it, since the initial investment in training would pay back quickly. And then you have like an 8x competetive advantage.
They claim it performs at 1.56 bit about as good as something with 16 bits. I don't quite get your question. Seems we can do with less precision / different maths and arrive at the same quality. The total count of parameters isn't affected. But the numbers now don't take 16 bits each, but less.
They said their's is "comparable with the 8-bit models".
Its all tradeoffs. It isn't clear to me where you allocate your compute/memory budget. I've noticed that full 7b 16 bit models often produce better results for me than some much larger quantied models. It will be interesting to find the sweet spot.