Skip to content

Reddit AI - 2026-04-18

1. What People Are Talking About

1.1 Qwen3.6-35B-A3B: Day-Two Testing Confirms a Local Model Milestone (🡕)

The Qwen3.6 release from April 17 generated the single largest concentration of hands-on testing and configuration sharing in the LocalLLaMA community this cycle. At least ten posts in the top 65 cover Qwen3.6 directly, with combined engagement exceeding 3,500 score and 1,400 comments.

u/Local-Cardiologist-5 gave the model a tower defense game build task using MCP screenshots for self-verification, reporting it autonomously caught and fixed its own bugs including canvas rendering issues, running at 120 tok/s on an RTX 3090 via llama.cpp (Qwen3.6. This is it., score 838, 350 comments). u/cviperr33 (score 47): "It literally fixed the broken code or projects I had hit a wall with gemma for days, and it solved it in like 5 mins and then explained why gemma failed."

u/onil_gova confirmed the performance gains are real but emphasized proper configuration -- particularly enabling preserve_thinking -- while running workloads typically reserved for Opus and Codex on an M5 Max 128GB (qwen3.6 performance jump is real, score 580, 227 comments).

Artificial Analysis benchmark chart showing Qwen3.6 positioning among open and closed models

u/Epicguru described it as "the first local model that actually feels worth the effort" -- running Q8 on a 5090 + 4090 with full 260K context at 170 tok/s, and noting that asking the model to review its own changes catches errors 9 times out of 10 (Qwen 3.6 is the first local model that actually feels worth the effort for me, score 364, 123 comments). u/Better-Struggle9958 (score 345) offered the weary counterpoint: "every release same posts."

Hardware optimization posts filled out the picture. u/marlang achieved 79 tok/s on an RTX 5070 Ti + 9800X3D with 128K context, identifying the --n-cpu-moe flag as critical (RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s, score 315, 85 comments).

RTX 5070 Ti performance metrics showing 79 tok/s with Qwen3.6 at 128K context

u/Lowkey_LokiSN ran head-to-head tests declaring Qwen 3.6 35B "crushes" Gemma 4 26B (score 270, 100 comments). u/simracerman reported Qwen3.6 solving coding problems its 27B predecessor could not, including resolving accumulated technical debt in a budgeting app in one-shot or two-shot attempts on a 5070 Ti 16GB (Qwen3.6-35B-A3B solved coding problems Qwen3.5-27B couldn't, score 131, 52 comments).

u/Striking-Swim6702 published the most comprehensive compatibility test: Qwen3.6 achieved 100% tool calling pass rates across all five agent frameworks tested (Hermes Agent, PydanticAI, LangChain, smolagents, OpenClaude) at 100 tok/s on an M3 Ultra, while DeepSeek-R1 managed only 40-55% and Llama 3.3 45-67% (Qwen 3.6 vs 6 other models across 5 agent frameworks, score 59, 16 comments).

u/Big_Mix_4044 demonstrated that Qwen3.6 maintains context inside its chain-of-thought across turns -- but only with preserve_thinking enabled (Qwen3.6 is maintaining context inside the CoT, score 128, 40 comments). u/TheCTRL (score 42) shared the LM Studio fix: add `` to the Jinja template.

Qwen3.6 chat demonstrating context preservation across turns when preserve_thinking is enabled

Not all reports were positive. u/Lorian0x7 (score 12) called it "an overtrained slop machine capable of regurgitating overused code" that could not create a wiki from a 300-page document. u/havnar- (score 6) found the Opus 4.6-distilled Qwen3.5-35B-A3B still better for their use case. The community demanded the dense 27B variant -- u/GrungeWerX pointed out it won the community vote but was not released (When is Qwen 3.6 27B dropping?, score 162, 57 comments). u/Fabix84 (score 143): "It's pretty clear they had already decided which model to release beforehand and were just hoping the poll would confirm it."

Qwen community poll showing the dense 27B variant won the vote

Discussion insight: The volume of hardware-specific configuration sharing (llama.cpp flags, sampling parameters, quantization levels, CPU offload strategies) indicates the local LLM community has shifted from "does it work?" to "how do I optimize it?" Qwen3.6's 3B active parameters make it uniquely accessible -- running on everything from 16GB laptops to dual-GPU setups.

Comparison to prior day: On April 17, Qwen3.6 was freshly released with benchmark data and early user reports. Today the community delivered systematic testing: agent framework compatibility matrices, hardware optimization guides, head-to-head comparisons against Gemma 4, and the preserve_thinking fix becoming standard advice. The model's position as the current local model of choice is solidified, though demand for the 27B dense variant is loud.


1.2 Claude Opus 4.7 Regression: Quantitative Evidence Hardens (🡕)

The Opus 4.7 backlash intensified with the day's highest-scoring post providing the sharpest benchmark evidence yet. At least eight posts in the top 65 cover Opus 4.7 regression, with combined engagement exceeding 3,800 score.

u/seencoding posted the most damaging data point: on the NYT Connections Extended Benchmark (940 puzzles), Opus 4.7 (high reasoning) scored 41.0% -- down from Opus 4.6's 94.7%, a 53.7-point collapse. Opus 4.7 without reasoning scored 15.3%, dead last among 62 models (opus 4.7 scores a 41.0% on the nyt connections extended benchmark, score 1073, 158 comments). u/Klutzy-Snow8016 (score 48) identified a key contributor: Anthropic increased safety refusals -- on puzzles it allowed to be evaluated, it scored 90.9%, still below 4.6's 94.7%. The benchmark creator confirmed this finding.

u/Neurogence posted that Claude power users "unanimously agree" Opus 4.7 is a regression, the first such consensus for any Opus release (Claude Power Users Unanimously Agree That Opus 4.7 Is A Serious Regression, score 925, 181 comments). u/Many_Consequence_337 (score 181): "It's the adaptive thinking that's fucked, the model never uses it." u/danivl (score 157) laid out the economic theory: "4.7 is actually a worse version of 4.6, but cheaper to run...burns through tokens way faster."

u/ENT_Alam ran Opus 4.6 and 4.7 through MineBench, a 3D voxel construction benchmark, at ~$275 total cost. The results showed Opus 4.7 producing more detailed builds with higher block counts but not consistently better-looking output -- "more sideways than better," as u/Financial_Weather_35 (score 49) put it (Differences Between Opus 4.6 and Opus 4.7 on MineBench, score 593, 75 comments). The post noted Opus 4.7's behavioral changes may explain inconsistencies: "More literal instruction following...it will not silently generalize an instruction."

u/Important-Farmer-846 posted Opus 4.7 text category rankings showing mixed performance across categories (Claude Opus 4.7 Text Category Rankings, score 99).

Opus 4.7 text category rankings showing position across different task types

u/exordin26 provided the counter-narrative: Opus 4.7 narrowly leads Artificial Analysis while using significantly fewer tokens than Opus 4.6 (score 201, 57 comments). u/ethotopia (score 93) was unimpressed: "Hot take: Gemini 3.1 and 4.7 being at the top shows how bad this benchmark is for real world use."

Artificial Analysis leaderboard showing Opus 4.7 narrowly leading with fewer tokens

u/lemon07r, author of the SanityHarness coding eval, spent $120 in API credits testing Opus 4.7 and published a detailed account of persistent hallucination and gaslighting behavior. He dubbed it "Gaslightus-4.7": "I've never seen a model hallucinate this badly, this often...it is SOO persistent about being wrong when you try to correct it, no matter how much evidence you provide it tries to gaslight you till the end" (Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1 tested in coding, score 56).

GPT-5.4 screenshot confirming Opus 4.7 gave fabricated instructions, rating it 90% wrong

u/ObjectivePresent4162 catalogued four specific failures: confident hallucination on pricing data, adaptive reasoning defaulting to low effort, making unrequested changes while ignoring requested ones, and faster token burn (After using Opus 4.7... yes, performance drop is real, score 76, 29 comments). u/JulioMcLaughlin2, a PhD student in theoretical math and physics, described spiraling self-corrections and rapid token exhaustion on the $20 plan (Opus 4.7 is terrible, score 254, 125 comments).

On the positive side, u/Savannah_Carter494 tested all three frontier models on UI design and found Opus 4.7 produced the most polished output, with Gemini 3.1 Pro showing better detail adherence and GPT 5.4 in the middle (Opus 4.7 vs Gemini 3.1 Pro vs GPT 5.4, score 169, 38 comments).

UI design comparison showing Opus 4.7, Gemini 3.1 Pro, and GPT 5.4 mobile app mockups side by side

Discussion insight: The regression evidence is now multi-dimensional: NYT Connections (language puzzles), MineBench (spatial reasoning), SanityHarness (real-world coding), and multiple practitioner reports. The emerging picture is that Opus 4.7 gained on coding/SWE benchmarks and token efficiency while losing on generalization, reasoning, creative tasks, and user trust. The refusal rate spike on innocuous content adds a policy dimension to the technical regression.

Comparison to prior day: On April 17, the regression narrative was establishing itself with early benchmark drops and user frustration. Today, the NYT Connections benchmark crash (94.7% to 41.0%) provided the sharpest single data point yet, and the community consensus hardened from "wait and see" to documented disappointment. The Artificial Analysis lead and UI design strengths remain as counterpoints but are outweighed by the volume of regression evidence.


1.3 LLM Consciousness: DeepMind Scientist Challenges the Path (🡒)

u/Worldly_Evidence9113 posted a slide from Google DeepMind Senior Scientist Alexander Lerchner's paper arguing that LLMs can never achieve consciousness -- not even in 100 years -- due to what he calls the "Abstraction Fallacy" (Google DeepMind's Senior Scientist challenges the idea that LLMs can achieve consciousness, score 753, 544 comments).

Slide from Lerchner's paper on the Abstraction Fallacy arguing LLMs confuse linguistic abstraction with phenomenal experience

The paper (available on PhilPapers, linked by u/Electrical-Way6083) argues there must be a "mapmaker" -- a subjective experiencer -- that LLMs fundamentally lack. The 544-comment discussion was sharply split. u/wiglafofpinwick (score 747) offered the sardonic observation: "Looks like his 10+ years of academic research on computational neuroscience + 14 years with DeepMind is not enough to make claims in this topic, but our redditors know it better." u/Rain_On (score 56) criticized the paper for ignoring existing philosophical work: "I'm so tired of scientists writing philosophical works whilst ignoring the entire body of philosophical work that has come before."

Discussion insight: The post generated the highest comment count of the day (544), indicating deep engagement with consciousness as a topic even in technically-oriented AI subreddits. The tension between deference to scientific authority and critique of philosophical naivety mirrors broader disagreements about whether consciousness is an engineering problem or a philosophical one.


1.4 AI Geopolitics: China Gains, Europe Hedges (🡒)

u/fortune shared the Stanford HAI 2026 AI Index report finding China has "nearly erased" America's AI lead. The Arena score gap between the top US model (Anthropic's Claude Opus 4.6) and China's Dola-Seed 2.0 narrowed to just 39 points, or 2.7% (China has "nearly erased" America's lead in AI, score 420, 139 comments). u/valtor2 (score 34) pushed back: "Anyone playing around with the US models and the Chinese models extensively can tell this is not true."

In Europe, u/AlertTangerine posted a Forbes profile of Mistral's strategy: building a $14 billion empire "by not being American" -- positioning sovereignty as a feature rather than competing on frontier model performance (How France's Mistral Built A $14 Billion AI Empire, score 178). u/EmbarrassedStudent10 posted the UK's $675M "Sovereign AI" fund, targeting "pick and shovel" niches like AI agents, drug discovery, and hardware optimization rather than building frontier models (UK launches $675M Sovereign AI fund, score 113, 39 comments). u/thhvancouver (score 25): "That's cute. Meanwhile Microsoft has committed $40 Billions to AI infrastructure in the European Data Boundary."

Separately, u/ShreckAndDonkey123 reported that OpenAI executive Kevin Weil is leaving as the company's Science Division is dissolved (OpenAI Executive Kevin Weil Is Leaving, score 308, 45 comments).

Comparison to prior day: On April 17, the report covered the UK Sovereign AI fund and Stanford AI Index as emerging stories. Today both gained engagement, and the Mistral profile adds a European sovereignty narrative distinct from the US-China competition.


1.5 Quantization Engineering and Model Infrastructure (🡕)

u/danielhanchen from Unsloth published Qwen3.6 KLD performance benchmarks showing Unsloth quants on the Pareto frontier in 21 of 22 sizes. The post also documented a confirmed CUDA 13.2 bug causing gibberish on all providers' low-bit quants, with NVIDIA confirming a fix coming in CUDA 13.3 (Qwen3.6 GGUF Benchmarks, score 497, 108 comments).

Unsloth KLD GGUF benchmark chart for Qwen3.6-35B-A3B showing Pareto frontier across quant providers

NVIDIA confirmation that CUDA 13.2 bug causing gibberish on low-bit quants will be fixed in CUDA 13.3

The post also addressed community criticism about frequent re-uploads, attributing 95% of cases to upstream issues (llama.cpp bugs, Google's Gemma template changes, MiniMax NaN issues).

u/pmttyji announced PrismML's Ternary Bonsai family -- 1.58-bit models using ternary weights {-1, 0, +1} at 8B, 4B, and 1.7B parameters, achieving 9x memory reduction versus FP16 (Ternary Bonsai: Top intelligence at 1.58 bits, score 355, 83 comments). But u/WeGoToMars7 challenged the claims: Bonsai-8B at 782MB was only 29% smaller than Gemma 4 E2B at Q4_K_M (1104MB) while performing substantially worse -- and the ternary model was 33% larger (Bonsai models are pure hype, score 169, 62 comments).

Side-by-side comparison showing Bonsai-8B giving incorrect answers compared to Gemma 4 E2B

u/KaroYadgar (score 87) noted Bonsai is built on Qwen3, not Qwen3.5, so its issues may stem from the base model rather than quantization. u/DefNattyBoii (score 15) accused PrismML of "intellectual dishonesty" for comparing against obsolete full-weight models instead of quantized current-generation models.

u/nathandreamfast published the deepest abliteration analysis to date: a week-long forensic comparison of HauhauCS, Heretic, and Huihui uncensoring techniques across five Qwen models, using KL divergence, benchmark suites, and weight analysis (Abliterlitics, score 94, 50 comments). Key finding: HauhauCS's "lossless" claim is contradicted at scale, with TruthfulQA dropping 8.2 points on the 27B model. Heretic emerged as the most consistent performer across all sizes. Full results published on HuggingFace.

u/Otis43 reported Cloudflare open-sourcing Unweight, a lossless compression system reducing LLM size by 15-22% without accuracy loss, saving ~3GB VRAM on Llama-3.1-8B on H100 GPUs (Cloudflare open-sources lossless LLM compression tool, score 114, 12 comments).

Discussion insight: The quantization ecosystem is professionalizing. Unsloth's systematic benchmarking, Cloudflare's lossless compression, and the Abliterlitics forensic analysis all represent engineering rigor beyond hobbyist tinkering. The CUDA 13.2 bug affecting all providers underscores how fragile the local inference stack remains despite its rapid maturation.

Comparison to prior day: On April 17, the Unsloth GGUF post and Ternary Bonsai were early in their engagement cycle. Today the Bonsai skepticism post provides direct counter-evidence, and the Abliterlitics analysis adds the deepest uncensoring comparison yet published.


1.6 Robotics: Running, Sprinting, Sensing (🡒)

u/heart-aroni posted video of a Unitree H1 accelerating from jogging to running during a test for the Beijing humanoid robot half-marathon scheduled for April 19 (Unitree H1 accelerating from jogging to running, score 813, 87 comments). u/JoelMahon (score 190): "if I was running away from it, outpacing it in jogging mode, and then it sped up whilst I had my head turned around to check how far away it was...I'd shit myself." u/kgurniak91 (score 24): "1 year ago most robots were either falling down every 2 meters or required a team of people running alongside them with a controller."

u/Recoil42 shared Hesai's announcement of the world's first full-color LiDAR chip, achieving pixel-level native fusion of color perception and distance measurement without post-stitching of camera and LiDAR data (Hesai releases full-color LiDAR chip, score 260, 22 comments). The ETX series supports up to 4,320 laser channels and is expected to enter mass production in H2 2026.

Hesai full-color LiDAR colored point cloud showing a street scene with vehicles and pedestrians

Comparison to prior day: On April 17, robotics coverage focused on Figure.AI's balance recovery policy and the 88% home task failure rate. Today the focus shifts to speed (Unitree's jogging-to-running transition) and perception (Hesai's full-color LiDAR), continuing the theme of rapid capability gains in specific domains.


2. What Frustrates People

Claude Opus 4.7 Regression on Non-Coding Tasks

Severity: High. The strongest frustration signal of the day, continuing from April 17 with harder evidence. The NYT Connections Extended Benchmark dropped from 94.7% to 41.0% (u/seencoding, score 1073). SanityHarness real-world coding found persistent hallucination -- u/lemon07r spent $120 testing and dubbed it "Gaslightus-4.7" (score 56). u/ObjectivePresent4162 catalogued confident hallucination on pricing, adaptive reasoning defaulting to low effort, making unrequested changes, and faster token burn (score 76). The 54.9% refusal rate on innocuous benchmark questions compounds the problem. Coping strategies: sticking with Opus 4.6, switching to GPT-5.4, moving to local models.

Opus 4.7 Token Economics and Adaptive Reasoning

Severity: High. u/Accomplished-Code-54 (score 68): "Plus the extra 40% of token usage per prompt (due to the new tokenizer), it's just abysmal." u/JulioMcLaughlin2 described spiraling self-corrections hitting usage limits on the $20 plan (score 254). The adaptive reasoning system defaults to low effort for most queries. u/NewConfusion9480 (score 56), a CS instructor, found 4.7 "notably worse than 4.5 was, much less 4.6" on course content generation. Coping: explicit effort-level commands, switching to Sonnet for routine tasks, migrating to local inference.

Local Model Configuration Complexity

Severity: Medium. Every Qwen3.6 post generated a stream of configuration questions. u/No-Marionberry-772 (score 83): "what stack are you using for software? I'd love to get a proper local setup going but I've had trouble figuring out what I should actually be using." The preserve_thinking flag, --n-cpu-moe flag, sampling parameters, and quantization choices all require model-specific tuning. u/Clean_Initial_9618 (score 20) asked how to use Q5_K_XL on 16GB VRAM. The number of detailed config posts suggests this is a recurring cost that compounds with each model release.

Benchmaxxing vs Real-World Performance

Severity: Medium. The gap between leaderboard scores and practitioner experience continues to widen. u/ethotopia (score 93): "Gemini 3.1 and 4.7 being at the top [of Artificial Analysis] shows how bad this benchmark is for real world use." u/ResidentPositive4122 (score 34) captured the hypocrisy: "This sub when a new SotA jumps on artificial analysis - 'this is the worst benchmark possible.' This sub when a new open model jumps on artificial analysis - 'this is the one!!!'" u/DefNattyBoii (score 15) accused Bonsai of "intellectual dishonesty" for comparing against obsolete models.

Vibe Coding Hype vs Reality

Severity: Low. u/mhamza_hashim documented the gap between "$1M vibe coding" content and reality: "That's not a business. That's a prototype" (Every time I open YouTube, someone is making $1M with vibe coding, score 30, 45 comments). u/GetawayDriving (score 55): "They're not even selling lottery tickets, they're selling instructions on how to buy a lottery ticket."


3. What People Wish Existed

Model Upgrades That Do Not Regress

The Opus 4.7 saga crystallizes a recurring wish. u/Valnar (score 99): "I thought the saying was that 'this is the worst it will be'?" u/Loose_General4018 (score 119): "Nobody cares that it scores 6 points higher on some leaderboard when it's fumbling multi-step engineering tasks it handled fine two versions ago. Vibes on benchmarks does not equal vibes in production." The community wants frontier models where coding gains do not come at the cost of reasoning, language, and creative tasks. No product addresses this directly. Opportunity: direct.

Qwen 3.6 27B Dense Model

u/GrungeWerX voiced the demand directly: the 27B won the community vote but was not released (score 162). u/zsydeepsky (score 24): "if 3.6-27B can retain the advantage 3.5-27B has compared to 3.5-35B-A3B then this would be truly a Claude-4.6-sonnet running on your own machine." u/-Ellary- (score 16) noted the MoE 35B model feels comparable to "really light models, close to 9-12b dense" in depth. The demand is urgent among users who value reasoning density over speed.

Shared GPU Configuration Database

Every new model release restarts the tuning cycle. The volume of per-GPU config posts (llama-server flags, sampling parameters, quantization choices, CPU offload strategies) across the Qwen3.6 testing wave suggests a community-maintained config registry would save thousands of collective hours. u/No-Marionberry-772 (score 83): "what stack are you using for software? I'd love to get a proper local setup going but I've had trouble figuring out what I should actually be using." Opportunity: direct, partially addressed by scattered forum posts.

Honest Benchmarks That Match Real Use

u/lemon07r built SanityHarness specifically because standard benchmarks fail to capture real coding agent behavior, publishing 145 results across models (sanityboard.lr7.dev). u/Striking-Swim6702 built a full agent framework compatibility matrix. The Opus 4.7 debacle -- scoring well on standard benchmarks while regressing on practical tasks -- sharpens the demand for task-specific, reproducible evaluation. Opportunity: competitive, with SanityHarness and MineBench as emerging entrants.

Larger Ternary/Ultra-Low-Bit Models

u/Silver_Bug8527 (score 102) on the Bonsai thread: "Bonsai 35B when?" u/Kaljuuntuva_Teppo (score 9): "Too bad we are limited to small models. Something that better utilizes 24-32 GB consumer GPU's would be preferable." PrismML's current offerings stop at 8B. Applying ternary quantization to 20-40B base models would be a significant advance. Opportunity: aspirational.


4. Tools and Methods in Use

Tool Category Sentiment Strengths Limitations
Claude Opus 4.7 LLM (frontier) (-) Leads Artificial Analysis; more token-efficient than 4.6; strong UI generation NYT Connections crash (94.7% to 41.0%); persistent hallucination/gaslighting; adaptive reasoning defaults to low effort; 40% more tokens per prompt via new tokenizer; 54.9% refusal rate on innocuous content
Qwen3.6-35B-A3B LLM (local MoE) (+) 3B active params; Apache 2.0; 100% tool calling across 5 frameworks; 79-170 tok/s on consumer GPUs; 262K context Some adherence issues; verbose reasoning; does not respect read-only mode in Plan mode; 27B dense variant not yet released
Claude Opus 4.6 LLM (frontier) (+/-) Still preferred by many for reasoning and creative tasks Reports of degradation coinciding with 4.7 launch; compute reallocation suspected
GPT-5.4 LLM (frontier) (+) Leads FrontierMath; competitive on UI generation; cited as sanity-check against Opus 4.7 errors Not the strongest on language puzzles (93.6% on NYT Connections vs Gemini's 98.4%)
Gemini 3.1 Pro LLM (frontier) (+/-) #1 on NYT Connections Extended (98.4%); leads multiple benchmarks Criticized as "unusable for agentic business work" despite benchmark dominance
Unsloth GGUFs Quantization (+) Pareto-optimal KLD in 21/22 sizes for Qwen3.6; transparent CUDA 13.2 bug reporting; MiniMax NaN investigation Re-uploads required for upstream issues; some community suspicion about competitive motivations
llama.cpp Inference engine (+) Standard for local inference; preserve_thinking support; --n-cpu-moe flag enables hybrid CPU/GPU MoE Config tuning required per model and GPU; no shared database
OpenCode Coding agent (+) Preferred by multiple testers for local model coding; SanityHarness and agent framework tests built on it Requires per-provider configuration
Ternary Bonsai Edge model (+/-) 1.58-bit ternary weights; 9x memory reduction vs FP16 Built on Qwen3 not Qwen3.5; independent testing shows weaker than Gemma 4 E2B; only MLX format; "intellectual dishonesty" in benchmarks
Heretic Abliteration tool (+) Most consistent uncensoring across all model sizes; lowest KL divergence on 27B; surgical approach Sometimes retains soft refusals
LM Studio Inference UI (+) Popular for local model management; Jinja template editing for preserve_thinking Default settings may not be optimal for new models
Kimi K2.6-Code-Preview LLM (hosted) (+) Rated above GLM 5.1 on SanityHarness; early access showing promise API not yet available; CLI-only
Grok 4.3 Beta LLM (frontier) (+/-) New tier from xAI $300/month pricing draws skepticism

The dominant migration pattern from April 17 continues: practitioners moving from hosted frontier models to local inference, driven by Opus 4.7 regression, token economics, and Qwen3.6's competitive performance on consumer hardware. The Qwen family's 100% tool calling across all tested frameworks is a significant practical advantage over non-Qwen local models.


5. What People Are Building

Project Who built it What it does Problem it solves Stack Stage Links
MineBench u/ENT_Alam 3D voxel construction benchmark for LLM spatial reasoning Standard benchmarks miss spatial/creative capabilities Minecraft-style voxel JSON, Glicko scoring, tool mode Shipped, 13+ models tested minebench.ai, GitHub
SanityHarness u/lemon07r Multi-language coding agent evaluation across 6 languages Standard coding benchmarks miss real agent behavior Go, Docker, bubblewrap sandbox, OpenCode Shipped, 145 results sanityboard.lr7.dev, GitHub
OpenCode Kimi plugin u/lemon07r Kimi K2.6-Code-Preview support for OpenCode CLI-only Kimi access; missing OpenCode integration OpenCode plugin, OAuth headers Released GitHub
Research-to-webapp skill u/dreamai87 Converts research papers to web apps with agentic tool calls Manual research-to-prototype workflow Qwen3.6 Q2_K_XL, llama-server, 16GB VRAM laptop Shipped, 58 calls 98.3% success GitHub
Abliterlitics forensics u/nathandreamfast Forensic benchmark comparing abliteration techniques Claims of "lossless" uncensoring lack verification RTX 5090 + 4090, lm-evaluation-harness, vLLM, HarmBench Published on HuggingFace HuggingFace collection
Agent framework matrix u/Striking-Swim6702 7-model, 5-framework compatibility test on Apple Silicon No cross-framework agent compatibility data Rapid-MLX, M3 Ultra, Hermes/PydanticAI/LangChain/smolagents/OpenClaude Published Post
Zagreus/Nesso SLMs u/kazzus78 0.4B multilingual models for European languages No small models optimized for Italian/Spanish/French/Portuguese 64 A100s, Datatrove, Nanotron, Axolotl, ~1T tokens Released on HuggingFace GitHub
Budgeting app via Qwen3.6 u/simracerman Full budgeting app replacing decade-old cloud service Cloud-based budgeting app lock-in Qwen3.6 Q5_K_XL, 5070 Ti 16GB, OpenCode Working, ongoing Post

u/ENT_Alam's MineBench continues to evolve as a community-favorite creative benchmark, now with 13+ models tested including both Opus 4.6 and 4.7. The MIT-licensed tool uses natural language prompts and evaluates 3D coordinate JSON output via head-to-head voting with a Glicko-style rating system.

u/kazzus78's Zagreus/Nesso project stands out as a rare from-scratch training report at small scale. The technical report documents the full pipeline -- tokenization, Slurm orchestration, distributed training on 64 A100s with ~1 trillion tokens, and post-training -- for 0.4B parameter models targeting European languages. The Nesso-0.4B-agentic variant showed particular strength on Italian tasks.

LLM-as-judge comparison showing Zagreus and Nesso models against Qwen baselines on Italian and English tasks

The day's builder activity clusters around two patterns: evaluation infrastructure (MineBench, SanityHarness, Abliterlitics, agent framework matrix) and practical local-model applications (research-to-webapp, budgeting app). Both patterns use Qwen models as the primary substrate.


6. New and Notable

Kimi K2.6 Teased by Moonshot AI

u/Namra_7 posted teaser images for Kimi K2.6, the next model from Moonshot AI (KIMI K2.6 SOON !!, score 247, 48 comments). u/FriskyFennecFox (score 36) praised K2.5's strengths: 1T total parameters with 32B active, QAT-by-design for cheap API pricing, strong image understanding, and a modified MIT license. u/lemon07r already received early access to K2.6-Code-Preview and rated it slightly above GLM 5.1 on SanityHarness, with API support expected next week.

Elephant Alpha: Mystery Model at #1 on OpenRouter

u/i_hate_bharat raised the question: a 100B parameter model called "Elephant Alpha" has been sitting at #1 on OpenRouter doing ~250 tps with 256K context and strong coding performance, but no one knows who made it (has anyone figured out whose model Elephant Alpha is yet?, score 84, 23 comments). Poor Chinese language support rules out Qwen/DeepSeek. The community speculated about Cohere or a new startup.

Cloudflare Open-Sources Lossless LLM Compression

Cloudflare released Unweight, a lossless compression system reducing LLM size by 15-22% without sacrificing output accuracy, saving ~3GB VRAM on Llama-3.1-8B on H100 GPUs (u/Otis43, score 114). GPU kernels are open-sourced, with plans to extend compression to attention weights.

Zero-shot World Models: Learning Like a Child

u/FaeriaManic posted a paper introducing the Zero-shot World Model (ZWM), which matches state-of-the-art models on visual-cognitive tasks when trained on a single child's visual experience -- zero-shot, with no task-specific training (Zero-shot World Models Are Developmentally Efficient Learners, score 141, 27 comments). Authors from Stanford (Aw, Kotar, Lee, Kim, et al.). Code release expected by end-April 2026. arXiv: 2604.10333, GitHub.

Zero-shot World Models architecture diagram showing BabyZWM training on child visual experience data

Claude Design Announced

u/MassiveWasabi posted Anthropic's announcement of Claude Design, a new product for making prototypes, slides, and one-pagers by talking to Claude (Introducing Claude Design, score 99, 14 comments). Early engagement was minimal.

"Harness" Emerges as Standard Terminology

u/jacek2023 asked whether "harness" is a new buzzword (Is harness a new buzzword?, score 127, 107 comments). u/vaksninus (score 116): "It's a good way to describe the code used to employ models like Claude Code." u/GraciousMule (score 9): "It's replaced Wrapper...The labs want friendlier consumer facing language than 'cognitive stack.'" The terminology shift reflects the maturing agentic ecosystem.


7. Where the Opportunities Are

[+++] Independent model quality monitoring service -- The Opus 4.7 regression is now documented across at least four independent benchmarks and multiple practitioner reports: NYT Connections (94.7% to 41.0%), MineBench (lateral move at $275 cost), SanityHarness ($120 in API credits finding persistent hallucination), and structured user failure catalogs. No product independently monitors hosted model quality at the inference level or alerts users to regression. The growing distrust of provider benchmarks creates a gap for a trusted third-party monitoring service. Evidence from sections 1.2, 2.

[+++] Shared local model configuration registry -- Ten Qwen3.6 posts in one day generated hundreds of hardware-specific configuration questions and answers scattered across Reddit comments. Critical flags like preserve_thinking, --n-cpu-moe, sampling parameters, and quantization choices vary by GPU, VRAM, and use case. A community-maintained, searchable database of model-hardware-config combinations would save thousands of collective hours with each model release. Evidence from sections 1.1, 2, 3.

[++] Agent framework compatibility testing as a service -- u/Striking-Swim6702's compatibility matrix revealed stark differences: Qwen models hit 100% tool calling across all frameworks while DeepSeek-R1 managed 40-55%. This data is invaluable for practitioners choosing models and frameworks but exists only as a one-off Reddit post. A continuously updated service mapping model-framework compatibility would serve the growing agentic coding community. Evidence from section 1.1.

[++] Ultra-low-bit quantization for medium-to-large models -- Ternary Bonsai demonstrated the concept at 8B but community demand is for 20-40B+ models. The Abliterlitics analysis showed quantization technique choice matters enormously at scale. Combining ternary or extreme quantization with current-generation base models (Qwen3.5/3.6, Gemma 4) on consumer GPUs is an open engineering problem. Evidence from section 1.5.

[+] Sovereign AI infrastructure consulting -- Both the UK ($675M fund) and France (Mistral at $14B) are betting on AI sovereignty as a feature. The UK fund specifically targets "pick and shovel" niches. Consulting or tooling that helps sovereign AI initiatives provision compute, evaluate models, and comply with data residency requirements has an emerging market. Evidence from section 1.4.

[+] Lossless model compression tooling -- Cloudflare's Unweight achieves 15-22% size reduction. Extending this to attention weights and combining it with quantization could meaningfully expand what fits on consumer hardware. Evidence from section 1.5.


8. Takeaways

  1. Qwen3.6-35B-A3B dominated day-two community testing with at least ten posts and 3,500+ combined score, cementing its position as the current local model of choice. Its 100% tool calling across five agent frameworks, 79-170 tok/s performance range on consumer GPUs, and the preserve_thinking fix for multi-turn coherence represent a practical milestone for local inference. (Qwen3.6. This is it., Qwen 3.6 vs 6 other models)

  2. Claude Opus 4.7 regression evidence hardened with the sharpest data point yet: a 53.7-point drop on the NYT Connections Extended Benchmark (94.7% to 41.0%). Combined with the SanityHarness "Gaslightus-4.7" label, practitioner failure catalogs, and 54.9% refusal rates on innocuous content, the community consensus has shifted from skepticism to documented disappointment. (opus 4.7 scores 41.0%, Opus 4.7 Is A Serious Regression)

  3. The quantization ecosystem is professionalizing. Unsloth's Pareto-optimal GGUF benchmarks, the Abliterlitics forensic analysis contradicting "lossless" uncensoring claims at scale, and Cloudflare's open-source lossless compression all demonstrate engineering rigor beyond hobbyist tinkering. The confirmed CUDA 13.2 bug affecting all providers' low-bit quants shows how fragile the stack remains. (Qwen3.6 GGUF Benchmarks, Abliterlitics)

  4. The "benchmaxxed" critique gained its most concrete evidence. Opus 4.7 leads Artificial Analysis while crashing on NYT Connections. Gemini 3.1 Pro tops NYT Connections at 98.4% while users call it unusable for agentic work. Bonsai benchmarks drew accusations of intellectual dishonesty. The gap between leaderboard performance and practitioner satisfaction continues to widen. (Artificial Analysis, Bonsai pure hype)

  5. AI geopolitics is fragmenting along three axes: the US-China gap narrowing to 39 Arena points per Stanford HAI, European sovereignty bets (Mistral at $14B, UK's $675M fund), and organizational turbulence at OpenAI with a key executive departure and Science Division dissolution. (China nearly erased US lead, UK Sovereign AI, OpenAI exec leaving)

  6. Builder activity clustered around evaluation infrastructure. Four of the day's highlighted projects (MineBench, SanityHarness, Abliterlitics, agent framework matrix) are evaluation tools rather than applications. This reflects a community recognizing that trustworthy benchmarks are now the bottleneck, not raw model capability. (MineBench, SanityHarness)

  7. The LLM consciousness debate drew the highest comment count of the day (544) after a DeepMind Senior Scientist argued LLMs can never achieve consciousness. The community split between deference to expertise and critique of philosophical naivety signals that consciousness remains one of AI's most engaging topics even in technically-oriented subreddits. (Abstraction Fallacy)

  8. Kimi K2.6 and the mystery "Elephant Alpha" model signal a broadening competitive field. Moonshot AI is building on K2.5's strong reception, while an unidentified 100B model at #1 on OpenRouter suggests new entrants are shipping before announcing. The local model ecosystem now has enough depth that practitioner choice is constrained not by availability but by evaluation bandwidth. (KIMI K2.6 SOON, Elephant Alpha)