macintosh.world | Log In | Register
Today | News | Books | Recipes | Notes | YouTube | QuickTake
Translate | Wiki | Browse | Maps | Reference | Reddit | About

Back to HN

GLM-5.2 is the new leading open weights model on Artificial Analysis

by himata4113 | 327 points | 147 comments | 2026-06-17 04:12:00 Central

Open Source Link | Read Source Here

Open on Hacker News

Comments

kristopolous
I have a script that ranks these based on codingindex from
Artificial Analysis.All it does is pull a json from their
main table page and parses it with the fields I care about
(coding).There used to be a mailing list associated with
it but eh ... there wasn't much interest. I use the script
every day though.Current partial output score age size
name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max
Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max
Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max
Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max
Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort,
Opus 4.8 Fallback)

run it like so $ curl day50.dev/art-analysis.sh | bash
official repo where it lives:
https://github.com/day50-dev/aa-eval-emailsome key
takeaways:* open models are on about a 4-7 month lag right
now depending on how you want to measure it* if this keeps
up, you might see an open-weights model doing claude fable
5 level work before the new year.if people sign up for the
free mailing list (that just does this) I'll go and put it
back on ... emails when new model evals drop - it was
pretty useful.

  > papersail
score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max
Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max
Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High
Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max
Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max
Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max
Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)

    > > tcp_handshaker
Short comments...- GPT 5.5 consistently the best,
an opinion who gets me constant downvotes here by
the Anthropic Marketeer strike force...- China is
going to eat the US lunch on AI- What have
European universities and companies been doing?
Its like if, on a parallel past/future, Nikola
Tesla and
Edison would have created flying Cyberpunk
machines,
while Europeans researchers, would be getting
together to
request EU funds, for investigation on how to
breed faster horses.- If Zuckerberg could be
fired, after spending
a total of $235 billion on AI and having
NOTHING to show for...should he be fired?

      > > > Certhas
None of these models come from universities,
European or otherwise.Mistral is clearly
currently not competing for Frontier Model.
Whether this is due to a lack of VC Funds or a
lack of technical ability or the former
arising from the latter would be interesting
to know.The top models are from startups.
Among the FAANG only Google managed to get a
Frontier model, and they litterally invented
the architecture and have more money than they
can possibly spend to throw at the problem.
Facebook shows that even ungodly amounts of
money don't get you there though.So why did no
EU based Startups succeed while two US start
ups succeeded? I agree that that's a very
important question the EU should ask. The
Internet revolution was driven by US
companies, and now AI will be as well, with
Chinese Open Weights mixed in. The EU
consistently can not turn its considerable
economic output into fast moving tech firms.

      > > > kristopolous
They did muse spark ... it's not garbage.Also
what are they building it for? I'd think it's
to serve ads better or something like that.
Maybe Muse Spark fits facebook's needs
perfectly...

  > alecco
Consider using decrementing score order (best on top)
    > > kristopolous
then I'd have to scroll up over 500 lines after
running it every time to see what I care about.But
if that's your thing, here you go:
https://github.com/day50-dev/aa-eval-email/commit/
1853be6461...add an argument (any argument) and it
will be sorted as your specified. It just works as
a toggle flipping the order ... so literally any
string will do.The original link has been updated
accordingly with the new code.

      > > > datadrivenangel
Have it print paginated or just top 10?
        > > > > kristopolous
only the small ones: $ ./art-analysis.sh |
grep small

or maybe just the qwen $ ./art-analysis.sh
| grep Qwen

only the ones in the past 30 days $
./art-analysis.sh | awk '$2 < 31'

I use it in pipes like this.
  > slig
Thanks for sharing. I'm curious: why didn't you sort
with the score descending?

    > > kristopolous
Because it's currently 511 lines. Why would I want
to scroll up to see the stuff I care about? Don't
you want the relevant stuff to be right there in
front of you?

      > > > duckmysick
I do and that's why I pipe the output to `head
-n 20` or use `LIMIT 20` in SQL.That aside,
this is a good script you're running. Thanks.

        > > > > tasuki
But maybe you decide you want to see more.
It makes perfect sense for a cli tool to
output the most interesting piece of info
last: then you can decide on the fly
whether you want to scroll up or not.

    > > fridder
Not OP but if you run this from the CLI it does
make the ordering make a little more sense

    > > snsnbsne
Because programmers can't figure out how to have a
CLI that prints in a normal order, with the newest
stuff on top instead of on the bottom.Setup a
fresh new large monitor. Open CLI. Run command.
Watch output at the bottom of your screen. Keep
watching the bottom of your screen for the rest of
the day.Sure you can tile windows and it helps but
come on. Just have the command/input section in
the bottom and the "output" on top. Keep the
command bit on the bottom.

Tiberium
It seems to really be a nice step-up and is getting quite
close to the frontier. I wish they'd start focusing on the
reasoning efficiency now, though. I have a simple
(relatively) test task to evaluate LLMs: writing a simple
math evaluator library in Nim (it's about 400-600 lines
total max), and GLM 5.2 (xhigh which maps to max effort)
spent over 15 minutes (!) reasoning, spending about 45k
tokens, before it finally wrote the first file.I know it's
hard to improve on that, but now that their models are
good enough at raw intelligence, I think this should
become a higher priority task.Currently on
https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh
spends 16k tokens total on average, GPT 5.5 high is 10k,
Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is
extremely reasoning efficient.Of course if you convert
those values to actual request cost, GLM 5.2 will probably
beat GPT 5.5/Opus 4.8, but speed matters for a lot of
people, I think.

  > vorticalbox
This is a problem I find with opus is will spend so
long thinking then going "but wait what if"To point
where I stop it and simple tell it to "start writing
code you can work it out as you go along"Seems writers
block also effects LLM

    > > mikeocool
Seriously. Whenever I read the thinking output I
get mad and turn down effort to medium or low.Just
output the code and we'll work through it!I feel
similarly about having codex review claude's
plans. I don't think I've ever seen it catch a
major issue. It just points out things that would
have inevitably been addressed during
implementation anyway.

    > > giancarlostoro
I usually have Claude build a plan first, then I
put it into an XML file it updates with phases,
usually we talk about some of those tasks, and
then once its good and I like it, I have Claude
implement the plan.Another thing I tell Claude to
do is to not guess, but look at documentation, it
messes up a lot less, might use some tokens
reading docs, but at least it has a higher success
rate code wise.

        > > > > giancarlostoro
Apparently because of how Claude is
trained, even the system level prompts go
through as XML, it works better with XML
"prompting" so I figured I could have it
write plans in XML. I need to update my
ticketing tool to output XML maybe by
default.https://www.reddit.com/r/ClaudeAI/
comments/1psxuv7/anthropic...

    > > epolanski
Fable was 20 times worse on that.It's clear it was
the vibe coding model, as like no other model
before, fully turned you into his assistant
instead of the other way around.

      > > > RyanHamilton
Could it be possible, these firms are
optimizing for two things: a) Better
performance. b) Gathering data from you to
further improve performance later. I've also
found the huge amount of planning rather than
iteration frustrating. I've felt like I'm
teaching a junior!

        > > > > epolanski
I think they simply optimize around E2E
benchmarks, none of those benchmarks is
designed as multi turn assistance to the
user, but going from a prompt straight to
the final solution.

    > > thinkingtoilet
I've been having success with Opus but you REALLY
have to tame it. Long prompts that list what files
to look at, relationships between entities, etc...
I went from regularly hitting my daily limit to
almost never hitting it. Oh, and also I was being
lazy with small changes and stopping that helped a
lot too. As you said, it gets in these loops where
it's just churning and if you don't stop it it can
go on for way too long.

  > benjiro29
GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The
thinking chain is so similar, and so is the amount of
token usage on the output.If you want reasonable token
usage, you need to run it GLM 5.2 at High. There is
little drop in quality from Max to High (for most
tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2,
Max is really something you only need for complex
tasks.In essence, GLM 5.2 is Opus 4.8 its little
brother, at a way, WAY cheaper price.There has been
really no training on Opus models going on, really,
none i tell you! /sarcasm

    > > vitalyan123
distillation of thinking models is not
particularly effective - both "Open"AI and
Misanthropic don't show you the real chain of
thought, only its severely downscaled version.
both do everything in their power to combat such
outrageous copyright infringement, so the bulk of
unethically scrapped data the Chinese have is from
several generations ago.

      > > > duskdozer
>such outrageous copyright
infringementSarcasm, considering the source of
their own training data?

        > > > > orphea
Narrator: it was sarcasm, indeed.
  > h14h
Hopefully the recent work Moonshot did with Kimi K2.7
Code trickles in to the other open-model labs.Per AA,
while K2.7 Code is roughly on par w/ K2.6 in terms of
intelligence, it uses half the output tokens to get
there.

  > robmccoll
That's interesting. I gave nearly the same task to
Gemma4 31b as a test yesterday. Write a symbolic math
engine in Typescript that can perform evaluation and
simple expression reductions over +-/*(). It performed
the task correctly with minimal reasoning - much fewer
reasoning tokens than output tokens.

  > bertili
This is GLM 5.2 Max. GLM 5.2 High which use less than
half[1] the tokens.[1] https://z.ai/blog/glm-5.2

    > > Tiberium
Yes, but the Artificial Analysis result is also
from GLM 5.2 (max), not high.

      > > > andai
They have this with a lot of models, measuring
only the max setting, while the one you'd
actually want to use for most tasks is much
lower.

        > > > > epolanski
For the brief period with had Fable, I
never had to use it above medium.Low
nailed the overwhelming majority of
mundane tasks on it's own, medium was good
for more complex stuff.

  > cmrdporcupine
> Of course if you convert those values to actual
request cost, GLM 5.2 will probably beat GPT 5.5/Opus
4.8, but speed matters for a lot of people, I
think.GLM5.2 ends up being far more expensive than I
thought it would be when I tried it on openrouter. I
ground through $5 USD worth of tokens quite
quickly.And this was high, not max.

mrngld
Artificial Analysis coding benchmark shows GLM5.1 on high
pretty close to GPT5.5 xhigh in cost to run, with GPT5.5
on medium significantly less expensive. Compared to GPT5.5
medium GLM5.1xhigh is twice the cost and half the
intelligence. They don't have GLM5.2 on there yet, but
that'd a big gap to
bridge.https://artificialanalysis.ai/agents/coding-agents?
coding-ag...I thought I was "holding it wrong" until
DeepSWE came along -- personally it seems to match my own
experiences pretty well. Really makes me wonder how
legitimate some of the internet noise is about open
models. There's surely some use cases for them, not
everything needs the absolute frontier (GPT5.5 on low is
awesome), but if you want to be near the frontier everyone
needs to be honest about the fact that we're only talking
about Opus, Fable, GPT5.5.

  > cmrdporcupine
I gave GLM 5.2 a spin on openrouter yesterday and it
was mostly fine but it racked up $5 in token use in 30
minutes of (relatively slow) work.It's easily 4x the
cost of DeepSeek V4 but I didn't actually feel the
results were that much better. I had GPT 5.5 in Codex
review it after it was done and there was plenty of
slop to go around.Having better luck with MiniMax M3,
from a cost/benefit ratio.

    > > pjerem
I really like DeepSeek V4 Pro. It's pretty smart
and I get so much usage out of it on a $20 Ollama
cloud plan.With a good harness, that's my favorite
model for any personal project. I use Opus 4.8 at
work because i don't have to pay for it and of
course I love it, but DeepSeek is like 80% there
for one tenth of the price.

    > > zooming
Try MiMo-2.5, I'm having astonishing success with
it in opencode for cents per day. Not even the pro
model.

m-dot-reviews
For anyone who's interested, I've put together a simple
site for sharing ratings/opinions on models at a
task-specific granularity. https://model.reviews/The idea
is that benchmark score comparisons are useful for a large
cross-product comparison across models + their settings,
but less useful if you're looking for the best model for
<your-specific-task>. So I thought having a place to
review and comment could be beneficial to people.I'm not
sure how best to get the corpus bootstrapped (i.e. people
will likely only visit/post on the site if there's already
activity), so posting it here for anyone who'd like to
contribute.

unrvl22
Why aren't more people talking about this? It's literally
Opus 4.7 quality stupid prices. I know providers who are
offering this at unlimited tokens for $50 a month. Some
are even offering API rates at 3x lower than the official
ZAI api rates which are already like 10x cheaper than
Opus. (Crof and Umans btw)This is a huge blow to
Anthropic/OpenAI/Google and a massive win for the rest of
the world. The official API prices and speeds mean nothing
for open source models.

  > stanac
> Some are even offering API rates at 3x lower than
the official ZAI api ratesLooking at openrouter [1],
some of the cheaper offerings are for quantized
models. Not sure how much intelligence is lost in
quantization. And they are not 3 times cheaper. Where
did you find 3x lower prices for APIs? I am
considering skipping open router and using them
directly for that price.edit:I see, croft [2] 8bit for
$0.50/$0.08/$2.20[1]:
https://openrouter.ai/z-ai/glm-5.2[2]:
https://ai.nahcrof.com/pricing

    > > benjiro29
Neuralwatt ... When you reverse calculate the
actual energy usage / price on a token basis, the
gap is large.I do not have GLM 5.2 numbers because
the whole default max setting is overkill. But GLM
5.1 numbers had it at 12x cheaper then API rates.
And about 2.5x more tokens vs zai their own
subscription service.Yes, its FP8 but lets be
honest, do we know for sure that even zai runs at
FP16? I learned a long time ago with Claude and
Codex how much cheating happens on model levels,
even from the big boys.

    > > scrlk
IME, unquantised -> FP8 is pretty much lossless.
What matters more is having an unquantized KV
cache - using an FP8 KV cache can result in a
significant drop in quality.

  > CuriouslyC
Be careful about unofficial providers, a lot of them
misconfigure models or stealth quantize them. For a
while the difference between Kimi on the official API
and most third party providers was 20-40%.

    > > thehamkercat
Kimi K2 had a vendor verifier:
https://github.com/MoonshotAI/K2-Vendor-Verifier(t
here's a table which shows comparison between
vendors)Also, it seems there's a general one as
well (for all kimi models?):
https://github.com/MoonshotAI/Kimi-Vendor-Verifier

    > > cedws
OpenRouter should be penalising or banning for
this.

      > > > alecco
Would that align with their VC-backed
incentives?

      > > > kilroy123
This is my biggest complaint about OpenRouter
and I'm a fan. Might be pretty tough at scale?

    > > unrvl22
the 2 I mentioned both have a fairly large
following, who run benchmarks and absolutely will
spot issues.

  > embedding-shape
> Why aren't more people talking about this?Wasn't
this released like 2 days ago? Everyone is still
evaluating and playing around with it, things like the
submission is just starting to come out. Give it some
days at least before jumping to conclusions, ideally
weeks.

  > Schiendelman
To answer the question in your first sentence -
because it's VERY computationally (ha) expensive as a
human being to keep up with all the options. It's also
very hard to figure out how to run a model like this.
There's no installer. If you really really care, which
99% of people do not, you have to google a guide, and
then find out it's out of date...I've tried a number
of these, and the learning curve is very steep
compared to "install Claude Code and pay $100/mo".
There is no way saving me $50/month matters compared
to figuring that out.

    > > andai
But it just works with Claude Code? They have a
guide on their
website.https://docs.z.ai/devpack/tool/claudeHere'
s my setup. I add this to my .bashrcexport
ZAI_API_KEY="your_key_here"alias
claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY"
ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic
" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]"
ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7"
ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7"
claude'Then I just run claudezpro tip the same
thing works with deepseek
https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for
you haha

      > > > Schiendelman
Sure, I'm not saying I, a software engineer,
cannot do this. I'm saying it's significant
onboarding friction.Unless this were a massive
differentiator, people aren't going to be
"talking about it" the way GP suggests!

        > > > > fc417fc802
You're seriously suggesting that setting
up opencode or tweaking your claude code
config or etc is too much trouble to be
worth saving $50 /mo? That's absurd.
Doubly so when the audience in question is
already using LLMs so ... just ask your
existing LLM for help if it seems
daunting.

          > > > > > Schiendelman
I'm not just suggesting that, I'm
trying to be crystal clear: it's a gap
that probably cuts TAM by 95% or more.
Most LLM users are not software
engineers. Even those that are don't
care enough to muck with their
settings to try out a model. Keep in
mind I'm not answering the question
"Is this hard to install?" - I'm
answering the question "Why aren't
people talking about this?"

            > > > > > > donohoe
I would broadly agree with this
(based on years of dealing
directly with user-facing UX and
setup steps). Small hurdles, even
easy ones, create larger barriers
to adoption then you'd think.

            > > > > > > fc417fc802
Doesn't pass the sniff test.
Casuals messing around already go
to far more trouble to set up
openclaw or comfyui or what have
you.

            > > > > > > Schiendelman
What percentage of "casuals"? ;)
            > > > > > > neonstatic
"Casuals" just use the web
interface from the provider, which
Z.ai also has

        > > > > skeledrew
The friction is near 0 when you can ask
another LLM to set it up for you.

  > cedws
In my org everyone is extremely Claude-pilled to the
point you'd think it's the only LLM that exists,
purely because it caters to non-engineers within
enterprises.

  > knollimar
Isn't it closer to sonnet?
    > > redox99
Definitely opus level for coding.
      > > > smith7018
Do you have benchmarks or at least anecdotes
to back that up? I'm not arguing with you; I
would just love to see some proof that open
models are getting as good as Anthropic's
models.

  > unrvl22
I cancelled my claude sub after realizing I can burn
300m tokens a day of this quality, for $50 a month.

  > Hamuko
I'm not that interested in models that I can't run on
my desktop for ~0€, which is my AI budget.

    > > andai
Electricity cost seems to be about $30/month for a
32B model on a GPU. It's probably better on Apple
hardware.https://github.com/QuantiusBenignus/Zshel
f/discussions/2Not accounting for hardware, of
course :)

      > > > NorwegianDude
The price, processed tokens, and output can be
anything, it just depends on what GPU it
is.Nvidia GPUs are much more efficient than
Apple hardware for inference(and training).

      > > > Hamuko
My Mac Studio uses about 60-80 watts whenever
I'm running a model (as measured by the system
metrics), so it's less than 2 kWh/day at full
blast. Electricity is like 0.125 €/kWh, so
that 24-hour period would be <0.25 €.Not
accounting hardware in my costs, since I
didn't buy my hardware for running models.
Running models is just something it can do in
addition to what I got it for.

    > > igravious
Cool beans. You're not the target audience then.
      > > > Hamuko
Did I claim I was? I just said why I and
people like me are not talking about it.

        > > > > simianwords
and he said its cool
  > anuramat
> unlimited tokens for $50 a monthlink?> Whyimho
everything but opus produces unusable code (fable was
even better...), eg gpt5.5 seems to write the absolute
worst code that still technically solves the problem;
tbh I'd be totally willing to trade "raw intelligence"
for "code taste"more labs need to figure out whatever
anthropic did to destroy everybody else on
frontiercode bench

simonw
I was surprised that GLM 5.1/5.2 are not vision models -
they are text input only.That's actually pretty uncommon
these days. All of the OpenAI/Anthropic/Gemini models
accept images, and so do the other leading open weight
families - Gemma 4, Qwen 3.6, Kimi 2.x.In GLM's case image
input would be useful because it's a model that scores
very highly for tasks like web design, but without image
input it can't take a screenshot and output HTML+CSS.Don't
get me wrong, GLM is a phenomenal model, but the image
thing is a bit of a gap.

  > _pdp_
I don't see this being such a big gap. There are some
use-cases for sure but apart from UX/UI work it is not
really needed. Besides, none of the frontier models
can replicate actual images - the can approximate at
least in my own experience.

    > > simonw
One of my tests for a new model is dumping in a
screenshot of a web page and seeing if it can
recreate it from scratch in HTML and CSS.Even the
local models I run on my Mac are getting
surprisingly good at that now.

CuriouslyC
I've been playing with this model a fair amount over the
last 24 hours, and I can confirm it's quite capable, while
being a little bit verbose (I've seen it reconsider things
3-4 times in thinking traces before deciding on a path
forward), and not being quite as good as GPT5.5 at working
through complex abstract requirements.Honestly it's good
enough that I feel comfortable recommending a Z.AI sub + a
$20/mo OpenAI sub for all but the most AI pilled
multi-orchestrators, or the die hard Claude fans. GLM
writing + GPT reviewing/debugging feels pretty unlimited
and minimally worse than just doing everything in GPT with
the $200/mo plan.

  > Havoc
> while being a little bit verboseDiscovered today
that they set reasoning effort to max by default. So
that's probably why

  > andai
This is my workflow. And then once a day I copy paste
the code into the free Claude Sonnet so it comes out
actually readable.

  > igravious
After having got a taste of Fable 5 for me Opus 4.8
doesn't cut it any more -- and I don't know how to put
this, I don't know if it's just me, but it's
rhetorical flourishes are starting to really grate on
me, never mind that it is at times deliberately
weasel-wordy and economical with the truth until
pressed. Opus 4.8 is definitely a stronger coding
agent than DeepSeek 4.0 or Kimi 2.7 succeeding where
they flounder and fail but its way of expressing
itself conversationally is making me reconsider my
subscription ...

    > > elwebmaster
You are not alone. How about GPT 5.5? Does it come
close to Fable 5?

      > > > theplumber
GPT 5.5 xhigh is smarter than Fable but Fable
like Opus 4.8 as well is faster and seems more
"agentic". It's easy to test this. Build a
fairly complex software with Claude(opus or
Fable).Review the commits with both Claude and
GPT 5.5 Xhigh. You can see that Fable is still
sloppy(er) compared to GPT. You can test it
the other way around as well(drive the dev
with GPT and review with GPT and Claude). You
get the same result
Claude has an edge though and that's on
building more beautiful user interfaces.

      > > > fragmede
5.5 is pretty good. It's no Fable though. It
is definitely better than opus tho.

CubsFan1060
Knowing very little about how to run these, how close are
we to medium or larger businesses starting to buy hardware
to run models like this to keep the models local?It's
expensive, and not as capable as the frontier models, but
would have some pretty big benefits around privacy and
agency.

  > wongarsu
I know of multiple businesses in Europe that have been
doing that for a while with 70B models, and are
upgrading hardware to run the new crop of 700B-1T
models (really started around Kimi K2, but buying and
hosting that kind of hardware takes time)Not everyone
is willing (or even legally able) to send their trade
secrets to OpenAI or Anthropic

    > > CubsFan1060
What kind of hardware/price does it take to run
those?

      > > > bitmasher9
Nvidia will sell you an entire server rack
ready for inference. Or maybe you can roll out
your own Blackwell based system.We're
approaching a world where running a primer
frontier model is possible on a workstation,
probably will have something under $30k that
looks like a desktop for Nvidia's next
generation. It sounds expensive, until you
look at your Anthropic bill.It's similar unit
economics as could computing for the open
models. You can save a ton on the expenses by
buying the hardware, but it requires a lot of
in-house expertise, and you get the most value
if you keep the system operating around the
clock. The big kink is open models are usually
2 quarters behind frontier, and your
competitors are probably trying to get access
to mythos.

      > > > wongarsu
For an 8-bit quant (what people call "near
lossless") you are looking at something like
4xMI350X, which comes out to about $150k after
adding the rest of the server. More if you go
with Nvidia instead of AMDBut prices are
changing rapidly, and not for the better

  > MikhailTal
This is not a new situation. This was happening also
when good vision models like alexa net were coming
through, especially for OCR. Companies had choice
between cloud or self hosting with GPUs. But turns
out, problem is usage patterns.Your usage will peak
during certain timezone work hours(even if you are a
huge multinational company most of your
engineers/users tend to be from only a few locations),
so then you have a bunch of gpus doing nothing the
rest of the day.
especially with latency sensitive stuff, this is a
decades old tradeoff problem, its not unique to llms

  > Havoc
It's a ~750B model so still a hell of a lot of
vramWould need to be a pretty determined medium biz

  > moffkalast
So far there seems to be one major use-case for
complete privacy, and that is legal work. You don't
need top of the line models to search vast amounts of
text in discovery and it needs to be completely
confidential. There's quite a few lawyers over on
r/localllama showing off their multi-GPU builds.
Coincidentally they also have the vast funding
required for it.

  > petesergeant
Unless you have genuine national security concerns,
you'd be better off just negotiating a commercial
agreement with privacy protections with a couple of
existing vendors.

    > > CubsFan1060
I think that's true until it isn't, which may end
up being the problem. Fable/Mythos doesn't fall
under the ZDR agreements with Anthropic. And I'm
curious if others will follow suit.

    > > tancop
if you can afford the investment you get stable
low costs for years with better security (at least
if your cyber team is good). its even better in
regulated industries where some vendors might add
a premium for hipaa/soc/pci dss compliance to the
point its a lot cheaper to self host. for a
smaller business its not worth it and you should
just use a hosted open model.

      > > > petesergeant
> to the point its a lot cheaper to self
hostI'm pretty skeptical, especially given
typical utilization patterns. Do you have
numbers, or this is just vibes?

  > re-thc
> how close are we to medium or larger businesses
starting to buy hardware to run models like this to
keep the models local?Years.Even Microsoft said they
don't have enough for Github and need to call
Amazon.Getting a few even at decent prices is hard.
Unless the shortages goes down...

tensegrist
> On the Intelligence vs. Cost per Task Pareto Frontier:
GLM-5.2 is on the Pareto frontier of the Intelligence vs
Cost per Task chart, with the lowest cost per task among
models at its intelligence level. GLM-5.2 costs ~$0.46 per
task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31),
MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)am i
missing something?

  > OtherShrezzing
I think they've just picked poor peer examples.
Instead of choosing other models near 5.2 on the
intelligence scale, they've picked some open models
from further down the scale.

  > xiaoyu2006
Some models are heavily subsidized. Total params &
active params are better measurement of inference
cost.

    > > simianwords
No models are subsidised -- there are lots of
third party hosting services that will still run
at breakeven/profit. (except Deepseek after
discount)

      > > > stymaar
> No models are subsidisedWe have no proof in
either direction, it's not like we had access
to their financial numbers in details.And the
pricing itself muddies the water, as input
tokens that are already in the KV cache are
practically free for the provider, whereas
other tokens are expensive. So they could
still make money overall thanks to people
having multi-turn conversation (and as such,
paying multiple times for the same token), but
lose money on actual compute done.> there are
lots of third party hosting services that will
still run at breakeven/profit.How can you be
sure that they are making profit directly from
token price, and are not billing at marginal
cost (i.e. electricity price, without counting
the cost of the GPUs) and aiming to make a
profit later on from the valuable training
data that they are collecting in the process?

XCSme
In my tests[0] GLM-5.2 is not much better than GLM-5, and
overall DeepSeek V4 Flash seems to be the better/more
cost-effective choice:[0]:
https://aibenchy.com/compare/deepseek-deepseek-v4-flash-hi
gh...

  > XCSme
I think the problem is, as can also be seen on other
benchmarks, is that most models nowadays are focused
more and more purely on tool calling and coding.This
means, that models are losing more and more general
and domain-specific knowledge.Look at those graphs on
ARtificialAnalysis, GLM-5.1 still performs similarly
or better:AA-Omnisicence Accuracy:
https://i.snipboard.io/5DYmpx.jpgIFBench:
https://i.snipboard.io/74kg0R.jpgI still feel like
models are not getting any smarter for a few months
already, they just changed their training to be
focused more on some areas than others, so shifting
the intelligence from one place to another, not
necessarily increasing the overall intelligence or
"AGI" score.

  > sourcecodeplz
man, i love dsv4-flash but i found its weaknesses in
complex projects with multiple moving parts. tried
kimi 2.6 and it understood and could work on the task.
bigger is better..

xiaoyu2006
This open source model is quite near SOTA with only
700B/40B MoE. Truly efficient.

rahidz
Correct me if I'm wrong, but neither DeepSeek nor GLM have
image input modality. This makes them less useful when
looking at UIs, photos, screenshots, etc. doesn't it? Or
do they have alternate ways of doing so?

  > dryarzeg
Yes, you are right (as far as I'm aware). For things
where you need the LLM to look at screenshots, photos
or other images you can use Kimi-K2.6/K2.7 -
comparable pricing, somewhat comparable performance
and quality. You can even probably combine two models
(e.g Kimi and GLM) in one agent, using Kimi for
multimodal inputs and GLM for everything else,
although 1) I'm not sure if this will not cause some
kind of context poisoning with low-quality patterns
for better performing model (e.g. in some cases Kimi
may be worse than GLM, but GLM, when following up, may
adopt the same reasoning patterns as Kimi, undermining
it's own performance), and 2) I'm not quite sure if
it's possible with the tools currently available (I'm
not really into agentic or chatbots stuff to be
honest).

  > mordae
They do not and it sucks for certain tasks.It also
means that if they actually trained with vision,
they'd be on par with Anthropic models as vision seems
to improve model performance across the board even for
non-vision tasks.

    > > osti
Many other open source models have vision but they
don't compare to GLM in terms of coding quality.
So I don't think it's because of vision that the
frontier models are better, it's more that they
are probably just much bigger models.

  > adrian_b
That's right, but there are other recent open weights
and relatively big LLMs that are multimodal, e.g.
MiniMax-M3.With open weights LLMs, it is affordable to
use many different models, each for whatever it is
better.Moreover, for analyzing "UIs, photos,
screenshots, etc." there are small models that can be
run locally on smartphones or laptops, e.g. IBM
granite-vision-4.1-4B, certain Google Gemma 4 variants
and certain Qwen variants, whose output you can use as
input for a big LLM, in order to accomplish some more
complex task.

  > Havoc
They have a separate VL model but never tried it
kingstnap
According to many benchmarks this model is straight up
frontier level and Zai seriously cooked. Some of these
numbers are incredible.Excited to see if this turns out to
be a Open Weight Opus 4.5 or better.

  > andai
The only benchmarks that matters is your actual
task.I've had models that benched poorly but performed
great. And I constantly see models at near the top of
AA, which are terrible.There doesn't necessarily seem
to be a lot of overlap between benchmarks and real
world usage. (Let alone common sense!)As far as they
go, though, these harder benchmarks match my
experience more
closely:https://deepswe.datacurve.ai/and
https://cognition.ai/blog/frontier-codeWhere we see
"top" models drop way down in score when given longer
tasks.That being said, I've had a reasonably pleasant
time with GLM-5.2 so far. (And have had an OK time
with DeepSeek as well.)By the time I'm done testing
all the Chinese models, they'll be obsolete :)

zftnb666
Open-weight models are winning. The gap with closed models
is now measured in months, not years.

ramon156
I've made a comment before that 5.1 will sometimes get
stuck looping over a simple decision or statement. It will
basically contradict and then not realize that one option
is the definite option. Sometimes it's two statements that
aren't even exclusive. Nonetheless, a lot of tokens that
get wasted from this.I haven't extensively used 5.2 yet,
but it seems a lot better.

dizhn
FYI.. This is coming with 3mil GLM 5.2 tokens right now.
(Needs login. Google SSO fine) https://zcode.z.ai/en

Pragmata
So this basically means we will have a near opus level
model able to be run locally in the next couple of months
right?QWEN 3.6 27b is already pretty good, but it should
be possible to get a better option now that runs in the
same hardware, right?

  > segmondy
Why wait for the next few months? There are plenty of
better models that you can run today locally.
Qwen3.5-397B beats Qwen3.6-27B. MiniMax2.7 is a
longrun horizon monster. (I haven't given 3 much of a
try yet). KimiK2.6/2.7, MiMoV2.5/MiMoV2.5-Pro and
GLM5.1 will wreck Qwen3.6-27B any day on any task.

  > XCSme
Which Opus?GLM-5.2 is already close to Opus-4.7
level:https://aibenchy.com/compare/anthropic-claude-op
us-4-7-mediu...