Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 1 day ago

The biotech industry is already extremely nervous: https://www.axios.com/2024/11/15/rfk-jr-uncertainty-biotech-startups

They don’t like this at all, with the hope being RFK just focuses on other stuff.

Or… it could make investor money fly away and collapse the US biotech industry. Great.

brucethemoose@lemmy.world · 2 days ago

What on Earth is the NIH thinking right now?

I mean, what if an moon landing skeptic took over NASA? It’s like that. They literally produce this mountain of evidence and organize this stuff, and… yeah.

brucethemoose@lemmy.world · 2 days ago

Don’t jinx it.

Especially not if they somehow coincidentally get some government funding.

brucethemoose@lemmy.world · 4 days ago

I’d posit the algorithm has turned it into a monster.

Attention should be dictated more by chronological order and what others retweet, not what some black box thinks will keep you glued to the screen, and it felt like more of the former in the old days. This is a subtle, but also very significant change.

brucethemoose@lemmy.world · edit-2 4 days ago

On the other hand, the track record of old social networks is not great.

And it’s reasonable to posit Twitter is deep into the enshitifiication cycle.

brucethemoose@lemmy.world · 4 days ago

The localllama/local LLM community ridicules AMD basically every single day.

They have the hardware. They have 90% of the software. Then they waste it with absolutely nonsensical business decisions, like they are actively trying to avoid the market.

2 phone calls from Lisa Su (one to OEMs lifting VRAM restriction, another to engineers yelling “someone fix these random bugs with flash attention and torchetune , now,”) would absolutely revolutionize the AI space, just to start… and apparently they couldn’t care less. It’s mind boggling.

brucethemoose@lemmy.world · 4 days ago

Still perfectly runnable in kobold.cpp. There was a whole community built up around with Pygmalion.

It is as dumb as dirt though. IMO that is going back too far.

brucethemoose@lemmy.world · 4 days ago

People still run or even continue pretrain llama2 for that reason, as its data is pre-slop.

brucethemoose@lemmy.world · edit-2 4 days ago

The facebook/mastadon format is much better for individuals, no? And Reddit/Lemmy for niches, as long as they’re supplemented by a wiki or something.

And Tumblr. The way content gets spread organically, rather than with an algorithm, is actually super nice.

IMO Twitter’s original premise, of letting novel, original, but very short thoughts fly into the ether has been so thoroughly corrupted that it can’t really come back. It’s entertaining and engaging, but an awful format for actually exchanging important information, like discord.

brucethemoose@lemmy.world · 4 days ago

This is called prompt engineering, and it’s been studied objectively and extensively. There are papers where many different personas are benchmarked, or even dynamically created like a genetic algorithm.

You’re still limited by the underlying LLM though, especially something so dry and hyper sanitized like OpenAI’s API models.

brucethemoose@lemmy.world · edit-2 5 days ago

To add to this:

All LLMs absolutely have a sycophancy bias. It’s what the model is built to do. Even wildly unhinged local ones tend to ‘agree’ or hedge, generally speaking, if they have any instruction tuning.

Base models can be better in this respect, as their only goal is ostensibly “complete this paragraph” like a naive improv actor, but even thats kinda diminished now because so much ChatGPT is leaking into training data. And users aren’t exposed to base models unless they are local LLM nerds.

brucethemoose@lemmy.world · 5 days ago

I don’t know when the goal post got moved

Ken Paxton, at least?

brucethemoose@lemmy.world · edit-2 6 days ago

BTW, as I wrote that post, Qwen 32B coder came out.

Now a single 3090 can beat GPT-4o, and do it way faster! In coding, specifically.

brucethemoose@lemmy.world · 6 days ago

Yep.

32B fits on a “consumer” 3090, and I use it every day.

72B will fit neatly on 2025 APUs, though we may have an even better update by then.

I’ve been using local llms for a while, but Qwen 2.5, specifically 32B and up, really feels like an inflection point to me.

brucethemoose@lemmy.world · edit-2 6 days ago

Yeah, well Alibaba nearly (and sometimes) beat GPT-4 with a comparatively microscopic model you can run on a desktop. And released a whole series of them. For free! With a tiny fraction of the GPUs any of the American trainers have.

Bigger is not better, but OpenAI has also just lost their creative edge, and all Altman’s talk about scaling up training with trillions of dollars is a massive con.

o1 is kind of a joke, CoT and reflection strategies have been known for awhile. You can do it for free youself, to an extent, and some models have tried to finetune this in: https://github.com/codelion/optillm

But one sad thing OpenAI has seemingly accomplished is to “salt” the open LLM space. Theres way less hacky experimentation going on than there used to be, which makes me sad, as many of its “old” innovations still run circles around OpenAI.

brucethemoose@lemmy.world · edit-2 10 days ago

One can’t offload “usable” LLMs without tons of memory bandwidth and plenty of RAM. It’s just not physically possible.

You can run small models like Phi pretty quick, but I don’t think people will be satisfied with that for copilot, even as basic autocomplete.

About 2x faster than Intel’s current IGPs is the threshold where the offloading can happen, IMO. And that’s exactly what AMD/Apple are producing.

brucethemoose@lemmy.world · edit-2 10 days ago

My level of worry hasn’t lowered in years…

But honestly? Low on the totem pole. Even with Trumpy governments.

Things like engagement optimized social media warping people’s minds for profit, the internet outside of apps dying before our eyes, Sam Altman/OpenAI trying to squelch open source generative models so we’re dependent on their Earth burning plans, blatant, open collusion with the govt, everything turning into echo chambers… There are just too many disasters for me to even worry about the government spying on me.

If I lived in China or Russia, the story would be different. I know, I know. But even now, I’m confident I can given the U.S. president the middle finger in my country, but I’d really be more scared for my life in more authoritarian strongman regions.

brucethemoose@lemmy.world · edit-2 10 days ago

The localllama crowd is supremely unimpressed with Intel, not just because of software issues but because they just don’t have beefy enough designs, like Apple does, and AMD will soon enough. Even the latest chips are simply not fast enough for a “smart” model, and the A770 doesn’t have enough VRAM to be worth the trouble.

They made some good contributions to runtimes, but seeing how they fired a bunch of engineers, I’m not sure that will continue.

brucethemoose@lemmy.world · edit-2 10 days ago

I wouldn’t call that “large.”

Strix Halo (256 bit LPDDR5X, 40 AMD CUs) is where I’d start calling integrated graphics “large.” Intel is going to remain a laughing stock in the gaming world without bigger designs than their little 128-bit IGPs.

brucethemoose@lemmy.world · edit-2 11 days ago

If they wanna abandon discrete GPUs… OK.

But they need graphics. They should make M Pro/Max-ish integrated GPUs like AMD is already planning on doing, with wide busses, instead of topping out at bottom-end configs.

They could turn around and sell them as GPU-accelerated servers too, like the market is begging for right now.

brucethemoose@lemmy.world · edit-2 1 month ago

Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · 4 months ago

Hostage-ceasefire deal talks stall over new Netanyahu demands, Israeli officials say

brucethemoose@lemmy.world · 4 months ago

Alleged AMD Strix Halo APU Appears in Benchmark