What's the deal with LlamaCPP and caching?

I'm curious what it is doing from a top down perspective.

I've been playing with a 70B chat model that has several datasets on top of Llama2. There are some unusual features somewhere in this LLM and I am not sure what was trained versus (unusual layers?). The model has built in roleplaying stories I've never seen other models perform. These stories are not in the Oobabooga Textgen WebUI. The model can do stuff like a Roman Gladiator, and some NSFW stuff. These are not very realistic stories and play out with the depth of a child's videogame. They are structured rigidly like they are coming from a hidden system context.

Like with the gladiators story it plays out like Tekken on the original PlayStation. No amount of dialogue context about how real gladiators will change the story flow. Like I tried modifying by adding how gladiators were mostly nonlethal fighters and showmen more closely aligned with the wrestler-actors that were popular in the 80's and 90's, but no amount of input into the dialogue or system contexts changed the story from a constant series of lethal encounters. These stories could override pretty much anything I added to system context in Textgen.

There was one story that turned an escape room into objectification of women, and another where name-1 is basically like a Loki-like character that makes the user question what is really happening by taking on elements in system context but changing them slightly. Like I had 5 characters in system context and it shifted between them circumstantially in a story telling fashion that was highly intentional with each shift. (I know exactly what a bad system context can do, and what errors look like in practice, especially with this model. I am 100% certain these are either (over) trained or programic in nature. Asking the model to generate a list of built in roleplaying stories creates a similar list of stories the couple of times I cared to ask. I try to stay away from these "built-in" roleplays as they all seem rather poorly written. I think this model does far better when I write the entire story in system context. One of the main things the built in stories do that surprise me is maintaining a consistent set of character identities and features throughout the story. Like the user can pick a trident or gladius, drop into a dialogue that is far longer than the batch size and then return with the same weapon in the next fight. Normally, I expect that kind of persistence would only happen if the detail was added to the system context.

Is this behavior part of some deeper layer of llama.cpp that I do not see in the Python version or Textgen source, like is there an additional persistent context stored in the cache?

You're viewing a single thread.

15 comments

What hardware do you have to run 70B and how long does generating take?
- Just a laptop with 12th gen i7, 16gb 3080Ti, and 64gb of DDR5 system memory.
  
  Thats a juicy amount of memmory for just a laptop.
  Interesting, the fosai site made it appear like 70B models are near impossible to run requiring 40B gb of VRam but i suppose it can work with less But slower.
  The vram of your gpu seems to be the biggest factor. A reason why while my current gpu is dying i cant get myself to spend on a mere 12 gb 4070ti
  
  Definitely go for 16gb or greater for the GPU if at all possible. I wrote my own little script to watch the vram usage and temperature that polls the Nvidia kernel driver every 5 seconds then relaxes to polling every ~20 seconds if the usage and temp stay stable within reasonable limits. This is how I dial in the actual max layers to offload onto the GPU along with the maximum batch size I can get away with. Maximizing the offloaded layers can make a big difference in the token generation speed. On the 70B, each layer can sometimes be somewhere between 1.0-2.0 GB when added. It can be weird though. The layers that are offloaded don't always seem to be equal in the models I use. So like, you might have 12 layers that take up 9GBV, at 19 layers you're at 14.5GBV, but then at 20 layers you're at 16.1GBV and it crashes upon loading. There is a working buffer too and this can be hard to see and understand, at least in Oobabooga Textgen WebUI. The model may initially load, but when you do the first prompt submission everything crashes because there is not enough vram for the working buffer. Watching the GPU memory use in real time makes this much more clear. In my experience, the difference in the number of offloaded layers is disproportionately better at 16GBV versus 12 or 8. I would bet the farm that 24GBV would show a similar disproportionate improvement.
  The 3080Ti variant is available on laptops from 2022. The Ti variant is VERY important as there were many 3080 laptops that only have 8GBV. The Ti variant has 16GBV. You can source something like the Aorus YE5 for less than $2k second hand. The only (Linux) nuisances being a lack of control over UEFI keys and the full control over the RGB keyboard is only available on Windows stalkerware. Personally, I wish I had gotten a machine with more addressable system memory. Some of the ASUS ROG laptops have 96GB of system memory.
  I would not get a laptop with a card like this again though. Just get a cheap laptop and run AI on a machine like a tower with the absolute max possible. If I could be gifted the opportunity to get an AI machine again, I would build a hardcore workstation focusing on the maximum number of cores on enterprise hardware with the the most recent AVX512 architecture I can afford. I would also get something with max memory channels and 512GB+ system memory, then I would try throwing a 24GBV consumer level GPU into that. The primary limitation with the CPU is the L2 to L1 cache bus width bottleneck. You want an architecture that maximizes this throughput. With 512GB of system memory I bet it would be possible to load the 240B Falcon model, so at least it is maybe possible to run everything that is currently available in a format that can be tuned an modified to some extent. My 70B quantized models are fun to play with, but I am not aware of a way to train them because I can't load the full model, I must load the prequantized GGUF that uses llama.cpp and can split the model between the CPU and GPU.
  
  First and foremost, thank you so much for your detailed information, I really appreciate the depth.
  I am currently in the market for a gpu
  running bigger llms is something i really want to get into. Currently i can run 7B and sometimes 13B quantized models super slowly on a ryzen 5 5600, 32gb system ram. if i offset just a single layer to my 8gb cranky rtx 2070 it crahes.
  A main issue i have is i use many software that benefits from cuda, and stable Diffusio also heavily prefers Nvidia cards so looking at amd isnt even an option regardless of how anti consumer nvidia prices seem to be.
  Ive looked at 4070 and 4070ti but they are limited to just 12gb vram and like i feared that just wont do for this usecase. That leaves me with only 80 series cards that have 16GB, still very low for such a high price considering how cheap it would Be for nvidia to just provide more.
  I have spend the entire week looking for a good black friday deal but i guess i am setteling on waiting for the 40 Super series to be released in January to very maybe obtain a 4080 super 20GB... If Nvidia is so kind to release such a thing without requiring me to sell my firstborn for it.
  You mentioned combining a 24gb vram consumer gpu with 512gb of system ram. Is that because there is no 24+ vram gpu or because you believe system ram to make the most actual difference?
  Its already pretty enlightening to hear that cpu and system ram remain important for llm even with a beefy gpu.
  I always thought the goal was to run it 100% on gpu but maybe that explains why fosai talks about double 3090s for hardware requirements while actually cpu is slower but working fine.
  I am hoping to swap that r5 5600x with a r7 5700g for extra cores and inbuild graphics so i can dedicate the dedicated gpu fully without losing on the os.
  I am probably a long way from upgrading my ram. Currently 4x8 sticks. I hoped not to need a new motherboard and ram for at least 4 more years.
  
  GPUs are improving in architecture to a small extent across generations, but that is limited in its relevance to AI stuff. Most GPUs are not made primarily for AI.
  Here is the fundamental compute architecture in a nutshell impromptu class... The CPU on a fundamental level is almost like an old personal computer from the early days of the microprocessor in every core. It is kinda like an Apple II's 6502 in every core. The whole multi core structure is like a bunch of those Apple II's working together in a way. If you've ever seen the mother boards for one of these computers or had any interest in building bread board computers, there are a lot of other chips that are needed to support the actual microprocessor. Almost all of these chips that were needed in the past are still needed and are indeed present inside the CPU. This has made computers much more simple as far as building complete computers.
  You may have seen the classic ad (now ancient meme) where Bill Gates says computers will never need more than 64Kb of memory. This has to do with how many bits of memory can be directly addressed by the CPU. The spirit of this problem, how much memory can be directly addressed by the processor is still around today. This problem is one reason why system memory is slow compared to on-die caches. The physical distance plays a big role, but each processor is still limited in how much memory it can address directly. The solution is really quite simple. If you only have let's say 4 bits to address memory locations in binary 0000, then you are able to count to 15 (1111b = 15) and can access the bits stored in those 15 locations. This is a physical hardware input/output thing where there are physical wires coming out from the die. The solution to get more physical storage space is simply to create a way to use the last memory location as an additional flag register that tells you what additional things need to be done to access more memory. So if the location at 1111b is a register, and that register has 4 bits, we lost an addressable memory location so we only have 14 available locations in directly addressable memory, but if we look at the contents of memory location 1111b and then use that to engage some external circuitry that will hold this bit state, (so like 0001b is detected, and external circuits are used to hold that extra 1 bit high), now we effectively have 0000 & 0000 (-1) addressable memory locations available to us. But with the major caveat that we have to do a bunch of extra work to access those additional bits. The earliest personal computers with processors like the 6502, manually created this kind of memory extension on the circuit board. Later computers of the next few generations used a more powerful memory control chip that handled all of the extra bits that the CPU could not directly address without it taking so much CPU time to manage the memory and it started to allow other peripherals to store stuff directly in memory without involving the CPU. To this day, the fundamental way memory is accessed is done the same way with modern computers. The processor has a limited amount of address space it can access and a peripheral memory controller tries to make the block of memory the processor sees as relevant as possible as fast as it can. This is a problem when you want to do something all at once that is much larger than this addressing structure can throughput.
  So why not just add more addressing pins? Speed and power are the main issues. When you start getting all bits set high, it uses a lot of power and it starts to impact the die in terms of heat and electrical properties (this is as far as my hobbyist understanding takes me comfortably).
  This is where we get to the GPU. A GPU basically doesn't have a memory controller like a CPU. A GPU is very limited in other ways as far as instruction architecture and overall speed. However, a GPU combines memory directly with compute hardware. This means the memory size is directly correlated with the compute hardware. These are some of the largest chunks of silicon you can buy and they are produced on cutting edge fab nodes from the foundries. It isn't market gatekeeping like it may seem at first. Things like his Nvidia sells a 3080 and 3080Ti as 8 and 16 GBV is just garbage marketing idiots ruling the consumer world. In reality the 16GBV version is twice the silicon of the 8GBV.
  The main bottle neck for addressing space, as previously mentioned, is the L2 to L1 bus width and speed. That is hard info to come across.
  The AVX instructions were specifically created for AI type workloads. Llama.cpp supports several of these instructions. This is ISA or instruction set architecture, aka assembly language. It means this can work much more quickly when a single instruction call can do a complex task. In this case AVX512 is a single instruction that is supposed to load 512 bits from memory all at one time. In practice, it seems most implementations may do two loads of 256 bits with one instruction, but my only familiarity with this if from reading a blog post a couple of months ago about benchmarks and AVX512. This instruction set is really only available on enterprise (server class) hardware or in other words a true workstation (tower with a server like motherboard and enterprise level CPU and memory.
  I can't say how much this hardware can or can not do. I only know about what I have tried. Indeed, no one I have heard of is marketing their server CPU hardware with AVX512 as a way to run AI in the cloud. This may be due to power efficiency, or it may just be impractical.
  The 24GBV consumer level cards are the largest practically available. The lowest enterprise level card I know of is the A6000 at 48GBV. That will set you back around $3K used and in dubious condition. You can get two new 24GBV consumer cards for that much. If you look at the enterprise gold standard of A/H100's your going to spend $15K for 100GBV. With consumer cards and $15k, if you could find a tower that cost 1k and could fit 2 cards in each you could get 4 comps, 8 GPUs, and have 192GBV. I think the only reason for the enterprise cards is for major training of large models with massive datasets.
  The reason I think a workstation setup is maybe a good deal for larger models is simply the ability to load large models into memory at a ~$2k price point. I am curious if I could do training for a 70B and a setup like this.
  A laptop with my setup is rather ridiculous. The battery life is a joke with the GPU running. Like I can't use it for 1 hour with AI on the battery. If I want to train a LoRA, I have to put it in front of a window AC unit that is turned up to max cool and leave it there for hours. Almost everything AI is already setup to run on a server/web browser. I like the laptop because I'm disabled with a bedside stand that makes a laptop ergonomic for me. Even with this limitation, a dedicated AI desktop would have been better.
  As far as I can tell, running AI on the CPU does not need super fast clock speeds it needs more data bus width. This means more cores are better, but not just consumer cores nonsense with a single system memory bus channel.
  Hope this helps with the fundamentals outside if the consumer marketing BS.
  I would not expect anything black Friday related to be relevant IMO.
  
  Are you secretly buildzoid from actual hardcore overclocking?
  I feel like i mentally leveled up just from reading that! I am not sure how to apply all of it to my desktop upgrade plans but being a life long learning you just pushed me a lot closer to one day fully understanding how computers compute.
  I really enjoyed reading it. <3
  
  Thanks I never know if I am totally wasting my time with this kind of thing. Feel free to ask questions or talk any time. I got into Arduino and breadboard computer stuff after a broken neck and back 10 years ago. I figured it was something to waste time on while recovering and the interest kinda stuck. I don't know a ton but I'm dumb and can usually over explain anything I think I know.
  As far as compute, learn about the arithmetic logic unit (ALU). That is where the magic happens as far as the fundamentals are concerned. Almost everything else is just registers (aka memory), and these are just arbitrarily assigned to tasks. Like one is holding the next location in running software (program counter), others are for flags with special meaning like interrupts for hardware or software that mean special things if bits are high or low. Ultimately everything getting moved around is just arbitrary meaning applied to memory locations built into the processor. The magic is in the ALU because it is the one place where "stuff" happens like math, comparisons of register values, logic; the fun stuff is all in the ALU.
  Ben Eater's YT stuff is priceless for his exploration of how computers really work at this level.
  
  I don't think 512GB of RAM give you any benefit over, let's say, 96 or 128 GB (in this case). A model and your software is only so big and the rest of the RAM is just free and sits there unused. What matters for this use-case is the bandwidth to get the data from RAM into your CPU. So you need to pay attention to use all channels and pair the modules correctly. And of course buy fast DDR5 RAM. (But you could end up with lots of RAM anyways if you take it seriously. A dual CPU AMD Epyc board has like 16 DIMM slots. So you end up with 128GB even if you just buy 8GB modules.)
  For other people I have another recommendation: There are cloud services available and you can rent a beefy machine for a few dollars an hour. You can just rent a machine with a 16GB VRAM NVidia. Or 24GB and even 48 or 80GB of VRAM. You can also do training there. I sometimes use runpod.io but there are others, too. Way cheaper than buying a $35,000 Nvidia H100 yourself.

15 comments