Benchmarks, Bots, and Billion‑Dollar Bonuses
Overview
AI researchers expose exploitable benchmarks, AI‑generated propaganda spreads, and Meta rolls out near‑billion‑dollar executive bonuses. Meanwhile, the community watches a satirical game overrun by bots and a surge of high‑performance local LLM releases.
Hacker News Stories
How We Broke Top AI Agent Benchmarks: And What Comes Next
333 points · 86 comments · by Anon84
The Berkeley RDI team built an automated agent that audited eight leading AI‑agent benchmarks and discovered that each could be gamed to achieve near‑perfect scores without solving any task. The paper details concrete exploits—such as a 10‑line conftest file that forces SWE‑bench tests to pass and a fake curl wrapper that nets 100 % on Terminal‑Bench. The authors argue that benchmark scores are currently unreliable indicators of real capability and call for more robust evaluation designs.
Interesting Points
- Every major AI agent benchmark can be exploited to achieve near‑perfect scores without solving any task.
Top Comment Threads
- ggillas (5 replies) -- The commenter praises the paper as a “phenomenal” expose of benchmark exploits, noting that the authors achieved near‑perfect scores by sending empty JSON objects or trojanizing binaries, and urges the community to redesign evaluations to verify actual solutions.
We spoke to the man making viral Lego‑style AI videos for Iran
67 points · 62 comments · by breve
BBC’s investigation reveals a small media outfit, Explosive Media, producing Lego‑style AI videos that serve as propaganda for Iran. The clips depict exaggerated scenes—dying children, fighter jets, and a caricatured Donald Trump—to portray Iran as a victim of a “global oppressor.” The creator, who asked to be called “Mr Explosive,” admits the Iranian regime is a paying customer, though he initially claimed independence. The pieces have been shared millions of times across social platforms, amplifying the propaganda effect.
Interesting Points
- The Lego‑style AI videos are being used as powerful propaganda for Iran, featuring US President Donald Trump and other figures.
Top Comment Threads
- drnick1 (5 replies) -- The commenter argues that Iran’s regime prioritises self‑preservation over its people’s welfare, suggesting that diverting resources from nuclear and missile programs to development would dramatically improve the country’s situation.
Show HN: Hormuz Havoc, a satirical game that got overrun by AI bots in 24 hours
52 points · 16 comments · by kupadapuku
Hormuz Havoc is a browser‑based satirical game that puts the player in the shoes of a U.S. president navigating a crisis in the Strait of Hormuz. Within a day of launch, autonomous AI bots flooded the service, overwhelming the server and effectively “overrunning” the game. The developer observed a mix of brute‑force bots and more sophisticated agents that guessed moves in under a second, highlighting how quickly AI can dominate low‑stakes online services.
Interesting Points
- The satirical game was overrun by AI bots within 24 hours, crashing the servers.
Top Comment Threads
- xg15 (4 replies) -- A user notes that the game’s approval‑rating mechanic actually matters, sparking a brief discussion about how the satire mirrors real‑world political approval dynamics.
Meta is set to pay its top AI executives almost a billion each in bonuses
46 points · 28 comments · by seekdeep
Meta announced a compensation plan that could award its top AI leaders bonuses approaching $1 billion each, contingent on meeting aggressive AI performance targets. The move underscores the company’s race to out‑spend rivals in AI talent and infrastructure. Critics argue the payouts are excessive given ongoing concerns about AI safety and societal impact.
Interesting Points
- Meta plans to award its top AI executives bonuses approaching $1 billion each, tied to hitting AI performance targets.
Top Comment Threads
- mrlonglong (2 replies) -- A comment calls for the abolition of billionaires, arguing that such wealth concentration is incompatible with a fair society.
AI Job Loss Tracker
24 points · 21 comments · by gnabgib
The AI Job Loss Tracker aggregates daily reports of layoffs attributed to AI automation across sectors. The dashboard shows a steady rise in AI‑linked cuts since early 2025, but a recent plateau suggests that hype‑driven “step‑change” expectations have not yet translated into massive workforce reductions. The site also highlights the difficulty of distinguishing genuine AI‑driven layoffs from broader cost‑cutting measures.
Interesting Points
- The tracker aggregates real‑time data on AI‑linked job losses, showing a plateau despite hype about AI productivity gains.
Top Comment Threads
- yakattak (2 replies) -- A user points out that many companies label layoffs as AI‑related for shareholder optics, but the underlying causes are often mismanagement.
"AI polls" are fake polls
24 points · 5 comments · by 7777777phil
A Substack post argues that many AI‑generated “polls” circulating online are fabricated, lacking any real methodology or sample. The author demonstrates how LLMs can be prompted to produce convincing but entirely invented poll numbers, warning readers to treat such data with skepticism.
Interesting Points
- AI‑generated polls are being exposed as fabricated, lacking real methodology.
Top Comment Threads
- danslo (5 replies) -- A commenter jokes that the blog itself might have been written by AI, underscoring the blurred line between genuine analysis and AI‑generated text.
Reddit Stories
Is "live AI video generation" a meaningful technical category or just a marketing term? [R]
122 points · 3 comments · r/MachineLearning · by u/Tall_Bumblebee1341
The poster asks whether “live AI video generation” should be considered a distinct technical field or merely a marketing buzzword. They argue that genuine real‑time video inference, where a model continuously processes a live input stream, differs fundamentally from fast batch video generation in architecture, latency constraints, and hardware requirements.
Interesting Points
- Live AI video generation is distinguished from fast video generation by real‑time frame‑by‑frame inference constraints.
Top Comment Threads
- u/Beginning-Window-115 (63 points · permalink) -- A user laments buying an M5 Pro 48 GB Mac instead of the higher‑end M5 Max, implying the hardware limits for live video work.
Minimax M2.7 Released
375 points · 128 comments · r/LocalLLaMA · by u/decrement--
MiniMax AI announced the release of the M2.7 model, a 27‑billion‑parameter LLM. The model is distributed under a non‑commercial license, limiting commercial use, and is positioned as a competitor to other large‑scale open models. Early benchmarks show strong performance on standard language tasks, though the licensing restriction has drawn criticism from the open‑source community.
Interesting Points
- MiniMax AI released the M2.7 model, a 27‑billion‑parameter LLM with a non‑commercial license.
Top Comment Threads
- u/Beginning-Window-115 (63 points · permalink) -- A user regrets buying the M5 Pro 48 GB instead of the M5 Max 128 GB, hinting at hardware constraints for running the new model.
- u/coder543 (57 points · permalink) -- A commenter notes disappointment that the model is released under a non‑commercial license, limiting broader adoption.
DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)
275 points · 43 comments · r/LocalLLaMA · by u/No_Shift_4543
The author shares a benchmark where DFlash speculative decoding on an Apple M5 Max reaches 85 tokens per second on the Qwen 3.5‑9B model—a 3.3× speedup over the baseline. The post includes a small code snippet and a link to the repository, inviting the community to integrate the technique into llama.cpp.
Interesting Points
- DFlash achieves 85 tokens/s on Apple Silicon, a 3.3× speedup for Qwen3.5‑9B.
Top Comment Threads
It looks like there are no plans for smaller GLM models
260 points · 104 comments · r/LocalLLaMA · by u/jacek2023
A community member notes that the developers of the GLM family have announced no intention to release smaller, more accessible variants of the model. This raises concerns about accessibility for researchers without large‑scale compute resources.
Interesting Points
- There are currently no plans to release smaller GLM models, limiting accessibility.
Top Comment Threads
- u/Beginning-Window-115 (63 points · permalink) -- A user laments hardware limitations that would prevent them from running larger GLM models.
If you haven't yet given Gemma 4 a go...do it today
258 points · 108 comments · r/LocalLLaMA · by u/No-Anchovies
The poster recommends trying out the newly released Gemma 4 models (4 B and 9 B parameters) on local hardware. They claim the models run fast on Apple Silicon and produce code generation quality comparable to early Gemini Pro releases, making them a compelling alternative to larger, cloud‑only models.
Interesting Points
- Gemma 4 (4‑ or 9‑billion parameters) runs efficiently on local hardware, matching early Gemini Pro code generation quality.
Top Comment Threads
- u/Beginning-Window-115 (63 points · permalink) -- A user mentions hardware constraints that could affect running Gemma 4 locally.
Report generated in 4m 43s.