Not a Onepiece fan, but... get the Discord to plug Lemmy?
Almost all my TV fandoms that live on Reddit seem to be deteriorating and migrating to Discord. I think people from a Discord would post here just to escape that dreadful format.
Trump thanks you for your vote.
It's easy to forget how bad it can get.
One day I wandered into /r/kotakuinaction over some linked comment on the Tomb Raider animation, and the Fallout TV series, and... yeah. I remembered.
And that's a pretty mild example.
There's another trolley coming, and she's trying to stop it from running over people again by crashing the first trolley into it.
Sounds like it'd be nice if you had real control over the car's software, and you could roll it back.
This... also makes me a little more weary driving around Teslas in traffic.
Why does Trump always get a pass?
It's like everyone against him has to be an utter saint, and one wrong move? Welp, voting for Trump, I guess.
And yes, strategically... this makes no sense for anyone who cares about what's happening in Israel.
What Shyamalan movie?
(Seriously, it's nothing like the movie).
And it's fine, its entertaining and a spectacle with some emotional moments. I mean, it depends what else is in your TV queue, as there's a TON to watch these days, but I wouldn't skip it just because it's not ATLA.
I want them to dive into exactly that in NATLA, more than they already have.
Maybe it's just me, but I'm really enjoying it as an 'AU' that explores some off-screen scenes and implications from the original. I think people are getting too hung up on it being 'not ATLA' (just like LoK).
He is really betting the farm on Trump getting elected.
Like, he is set if he is, and screwed if he isn't.
I just despise the format and the intense engagement/attention optimization over just finding what you want. It buries information so you have to spend time in it.
Honestly, they should fight fire with fire? Another vision model (like Qwen VL) would catch this
You can ask it "does this image seem fake?" and it would look at it, reason something out and conclude it's fake, instead of... I dunno, looking for smaller patters or whatever their internal model does?
He's just waiting for Trump to win, so he'll be pardoned, I guess.
Especially since it means there could be more.
Ants too.
No, from what I've seen it falls off below 4bpw (just less slowly than other models) and makes ~2.25 bit quants somewhat usable instead of totally impractical, largely like AQLM.
You are thinking of bitnet, which (so far, though not after many tries) requires models to be trained from scratch that way to be effective.
“In terms of what Trump will do for Ukraine, it’s kind of up in the air, I don't believe he will necessarily provide weapons and help so to speak, but what he will do is pressure (Russian President Vladimir Putin),” Lashchyk told the Kyiv Independent.
“I believe that Trump will negotiate some kind of end to the war, do I believe Ukraine will have to cede some of our ground in order for the killing to stop, probably, I’m totally not okay with that but I have no choice.”
I think this is the root problem with... well, everything Trump, supporters idolize him and see him as some kind of super hero, like he's really that different (in this case) than the entire US government that's already been "pressuring Putin."
And the problem with lots of American culture in general. Idolization is not good. Seeing people as heroes is not good. No one on Lemmy wants to hear this, but it happens in the Democrats too.
Shouldn't they be in a rush, instead?
They basically have 7 days (or really till January) before NATO could be destabilized and delivering weapons becomes far more complicated.
I saw someone outside the US comment on this, and I don't think people outside the US (and apparently many US voters) are aware just how anti Ukraine Trump will be.
Common sense would say "he's gonna be hawkish, right?" They don't know about this: https://en.wikipedia.org/wiki/Trump–Ukraine_scandal
None of those were even close to flagships, they were all short term experiments.
YouTube is the basis for much of Google's portfolio and a steady moneymaker, not an upstart liek those.
NATO does drills and excercises, and I'm sure everything is inspected before its sent. They can just quietly throw away what doesn't work and dodge that bullet.
I see a lot of talk of Ollama here, which I personally don't like because:
-
The quantizations they use tend to be suboptimal
-
It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.
-
It abstracts away things that you should really know for hosting LLMs.
-
I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.
So, here's a quick guide to get away from Ollama.
-
First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.
-
Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.
-
Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.
Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/
As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:
-
Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI
-
Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine
-
This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp
Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:
-
Open a terminal, run
git clone https://github.com/theroyallab/tabbyAPI.git
-
cd tabbyAPI
-
Follow this guide for setting up a python venv and installing pytorch and tabbyAPI: https://github.com/theroyallab/tabbyAPI/wiki/01.-Getting-Started#installing
This can go wrong, if anyone gets stuck I can help with that.
-
Next, figure out how much VRAM you have.
-
Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.
-
Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER
-
Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.
-
With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2
-
There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:
-
4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.
-
6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2
-
8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2
-
16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.
-
20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2
-
32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.
-
48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)
Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.
-
Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader
-
Put it in your TabbyAPI models folder, and follow the documentation on the wiki.
-
There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)
-
Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui
-
And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.
-
Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like
sudo nvidia-smi -pl 290
, which throttles it down from 420W to 290W.
Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable languag...
cross-posted from: https://lemmy.world/post/19925986
> https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e > > Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too. > > All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral). > > And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA. > > I am running 34B locally, and it seems super smart! > > As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete. > > Get usable quants here: > > GGUF: https://huggingface.co/bartowski?search_models=qwen2.5 > > EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5
GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable languag...
https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e
Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B just came out, with some variants in some sizes just for math or coding, and base models too.
All Apache licensed, all 128K context, and the 128K seems legit (unlike Mistral).
And it's pretty sick, with a tokenizer that's more efficient than Mistral's or Cohere's and benchmark scores even better than llama 3.1 or mistral in similar sizes, especially with newer metrics like MMLU-Pro and GPQA.
I am running 34B locally, and it seems super smart!
As long as the benchmarks aren't straight up lies/trained, this is massive, and just made a whole bunch of models obsolete.
Get usable quants here:
GGUF: https://huggingface.co/bartowski?search_models=qwen2.5
EXL2: https://huggingface.co/models?sort=modified&search=exl2+qwen2.5
Obviously there's not a lot of love for OpenAI and other corporate API generative AI here, but how does the community feel about self hosted models? Especially stuff like the Linux Foundation's Open Model Initiative?
I feel like a lot of people just don't know there are Apache/CC-BY-NC licensed "AI" they can run on sane desktops, right now, that are incredible. I'm thinking of the most recent Command-R, specifically. I can run it on one GPU, and it blows expensive API models away, and it's mine to use.
And there are efforts to kill the power cost of inference and training with stuff like matrix-multiplication free models, open source and legally licensed datasets, cheap training... and OpenAI and such want to shut down all of this because it breaks their monopoly, where they can just outspend everyone scaling , stealiing data and destroying the planet. And it's actually a threat to them.
Again, I feel like corporate social media vs fediverse is a good anology, where one is kinda destroying the planet and the other, while still niche, problematic and a WIP, kills a lot of the downsides.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
cross-posted from: https://lemmy.world/post/19242887
> I can run the full 131K context with a 3.75bpw quantization, and still a very long one at 4bpw. And it should barely be fine-tunable in unsloth as well. > > > It's pretty much perfect! Unlike the last iteration, they're using very aggressive GQA, which makes the context small, and it feels really smart at long context stuff like storytelling, RAG, document analysis and things like that (whereas Gemma 27B and Mistral Code 22B are probably better suited to short chats/code).
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
I can run full 131K context with a 3.75bpw quantization, and still a very long one at 4bpw. And it should barely be fine-tunable in unsloth as well.
It's pretty much perfect! Unlike the last iteration, they're using very aggressive GQA, which makes the context small, and it feels really smart at long context stuff like storytelling, RAG, document analysis and things like that (whereas Gemma 27B and Mistral Code 22B are probably better suited to short chats/code).
> Senior U.S., Qatari, Egyptian and Israeli officials will meet on Thursday under intense pressure to reach a breakthrough on the Gaza hostage and ceasefire deal.
> he heads of the Israeli security and intelligence services told Netanyahu at the meeting on Wednesday that time is running out to reach a deal and emphasized that delay and insistence on certain positions in the negotiations could cost the lives of hostages, a senior Israeli official said.
HP is apparently testing these upcoming APUs in a single, 8-core configuration.
The Geekbench 5 ST score is around 2100, which is crazy... but not what I really care about. Strix Halo will have a 256 -bit memory bus and 40 CUs, which will make it a monster for local LLM inference.
I am praying AMD sells these things in embedded motherboards with a 128GB+ memory config. Especially in an 8-core config, as I'd rather not burn money and TDP on a 16 core version.
cross-posted from: https://lemmy.world/post/16629163
> Supposedly for petty personal reasons: > > > The woman who controls the company, Shari Redstone, snatched defeat from the jaws of victory last week as she scuttled a planned merger with David Ellison's Skydance Media. > > > Redstone had spent six months negotiating a complicated deal that would have given control of Paramount to Ellison and RedBird Capital, only to call it off as it neared the finish line. > > > The chief reason for her decision: Her reluctance to let go of a family heirloom she fought very hard to get.
I cross posted this from c/Avatar, but I am a Trekkie too and don't like this one bit.
FYI previous articles seemed to imply the Sony deal is dead.
Supposedly for petty personal reasons:
> The woman who controls the company, Shari Redstone, snatched defeat from the jaws of victory last week as she scuttled a planned merger with David Ellison's Skydance Media.
> Redstone had spent six months negotiating a complicated deal that would have given control of Paramount to Ellison and RedBird Capital, only to call it off as it neared the finish line.
> The chief reason for her decision: Her reluctance to let go of a family heirloom she fought very hard to get.
The fandom doesn't want to talk about it, but the Avatar franchise is in trouble.
Sony signed a non-disclosure agreement with Paramount allowing deal talks to begin but they'll not be focused on a $26 billion bid for whole company.
Avatar Studios seems to be part of Paramount Media, aka the "pay television channels" that I assume Sony is not interested in: https://en.wikipedia.org/wiki/Paramount_Global
And in light of this article: https://deadline.com/2024/05/paramount-sale-hollywood-studio-takeover-history-lessons-1235910245/
That doesn't look good for Avatar Studios. If they are left behind in a Sony sale, it seems the probability of them getting shut down (or just going down with whatever is left of Paramount) is very high.
The article is a very fast read because it's Axios, but in a nutshell, either:
-
Skydance gets Paramount intact, but possibly with financial trouble and selling some IP.
-
Sony gets Paramount, but restructures the company and also possibly sells some parts.
-
Nothing happens... and Paramount continues its downward spiral, probably accelerated by a failed sale.
The can of worms opened today, as now Paramount is officially open to a buyout from sony.
I don't like this at all. Avatar is a high budget IP, animesque fantasy, and not historically, proveably profitable like Star Trek/Spongebob. Avatar Studios is a real candidate to be chopped off.
As the title says. This includes any visual media, including all 7 Books and other stuff.
What kind screen do you watch it on? What sound setup? What source?
Screen poll: https://strawpoll.com/e6Z28M9aqnN
Source poll: https://strawpoll.com/Q0ZpRmzaVnM
I'm asking this because:
A: I'm curious how this fandom generally consumes the shows
B: I theorize this may have an impact on the experience. Avatar is an audiovisual feast, and I find I get caught up in the art/music more than many viewers seem to. LoK in particular is like a totally different show with high-bitrate HD vs. a bad stream.