wongarsu It does really well on "AA-Omniscience Non-Hallucination
Rate", far higher than DeepSeek, GPT 5.5 or Fable. I
really like that benchmark because it's one of the few
benchmarks that allows LLMs to elect not to answer if they
are unsure and punishes them for trying to bullshit their
way through the benchmark
|
> andai This implies that other benchmarks (for which every AI
provider is optimizing?) are actively encouraging
bullshitting?
|
hemkeshr Local models are already useful today. The next milestone
is getting this level of performance onto truly affordable
hardware.
|
XCSme I also tested it[0]: quite similar to GLM 5, a few percent
better, 30% faster and 50% more expensive.[0]:
https://aibenchy.com/?q=glm
|
> XCSme PS: Just added a cool feature, so you can filter the
leaderboard for multiple models at once, by using a
comma, like: https://aibenchy.com/?q=glm,claude
|
> lousken still 1/4 of the price of anthropic and openai models
though
|
lanycrost It's always nice to see how open source models growing,
hope we will have good performance with lower tier
hardware some day.
|
theturtletalks I want to trust their benchmarks but when they have Muse
Spark over GPT-5.5, it gives me pause.
|
sourcecodeplz still quite verbose at 140m output tokens, but this is on
max thinking. high should do better.
|
ChrisArchitect Some more discussion:
https://news.ycombinator.com/item?id=48567759
|
DeathArrow One or two more releases and they will reach Fable level.
|
> vitalyan123 by then there will be Fable 5.21, again 5% ahead of
every other SotA while still only 500% the size.
|