Reddit AI - 2026-04-17¶
1. What People Are Talking About¶
1.1 Qwen3.6-35B-A3B Dominates the Day (🡕)¶
The highest-scoring post across all tracked subreddits. u/ResearchCrafty1804 announced the release of Qwen3.6-35B-A3B, a sparse MoE model with 35B total parameters and only 3B active, under the Apache 2.0 license (Qwen3.6-35B-A3B released!, score 1947, 615 comments). u/NewEconomy55 posted a parallel thread (score 476, 88 comments) (Released Qwen3.6-35B-A3B).

Key benchmarks from the Qwen blog: SWE-bench Verified 73.4, SWE-bench Pro 49.5 (up from 44.6 on Qwen3.5-35B-A3B), Terminal-Bench 2.0 51.5, GPQA Diamond 86.0, HMMT Feb 26 83.6. The model is natively multimodal, with VLM performance matching Claude Sonnet 4.5 on several benchmarks and strong spatial intelligence (RefCOCO 92.0, ODInW13 50.8). The model uses 256 experts with 8 routed per token, hybrid attention (linear + softmax, 3:1 ratio), and supports 262K context.
u/Kodix (score 382): "What a good couple months for local LLMs, huh?" u/AndreVallestero (score 144): "I hope they release 3.6 122B to pressure Google to release their 124B model as well." u/Willing-Toe1942 (score 102): "Oh I feel that qwen team wanted to flex on Gemma so bad that they only compared to Qwen3.5/Gemma4." u/Middle_Bullfrog_6173 (score 129) flagged the blog's teaser: "Qwen3.6 open-source family keeps expanding, stay tuned."
Early user reports were overwhelmingly positive. u/Local-Cardiologist-5 reported the model autonomously building a tower defense game, catching its own bugs, and fixing canvas rendering issues, running at 120 tok/s on an RTX 3090 via llama.cpp (Qwen3.6. This is it., score 525, 254 comments). u/cviperr33 (score 34): "It literally fixed the broken code or projects I had hit a wall with gemma for days, and it solved it in like 5 mins." u/CountlessFlies implemented RLS in Postgres across a multi-language codebase (Rust, TypeScript, Python) and called it "the holy grail of local coding" (Qwen3.6 is incredible with OpenCode!, score 115, 48 comments).
However, not all experiences were positive. u/tkon3 reported worse adherence than its predecessor on vLLM with RAG setups — more verbose reasoning, weaker system prompt following, and shorter final responses (Qwen 3.6: worse adherence?, score 67, 47 comments). u/exact_constraint (score 30): "3.6 enjoys ignoring the read only limitation while in Plan mode."
A critical infrastructure finding came from u/onil_gova: Qwen3.6 ships with a preserve_thinking flag that fixes the KV cache invalidation issue from previous versions (PSA: Qwen3.6 ships with preserve_thinking, score 305, 68 comments). The flag keeps prior reasoning in context instead of stripping and re-serializing it each turn. u/mlhher (score 105) shared the llama.cpp flag: --chat-template-kwargs '{"preserve_thinking": true}'.
u/danielhanchen from Unsloth published KLD benchmarks for Qwen3.6 GGUFs, showing Unsloth quants on the Pareto frontier in 21 of 22 sizes (Qwen3.6 GGUF Benchmarks, score 227, 57 comments). The post also documented a confirmed CUDA 13.2 bug causing gibberish on low-bit quants, with NVIDIA confirming a fix coming in CUDA 13.3.

u/hauhau901 released an uncensored "Aggressive" variant with K_P quants, claiming 0 of 465 refusals with zero capability loss (Qwen3.6-35B-A3B Uncensored Aggressive, score 250, 78 comments). Community reception was interested but skeptical: u/llama-impersonator (score 56) noted "still a distinct lack of information on what you did, how you tested 'zero capability loss.'"
Discussion insight: Qwen3.6 generated the most concentrated community testing activity of any model release in recent memory. The 3B active parameter count makes it accessible on consumer GPUs (running on 4090s, 3090s, and even 16GB laptops), while the preserve_thinking fix addresses a real infrastructure pain point. The competitive framing against Gemma 4 is explicit — Qwen published direct comparisons only against Qwen3.5 and Gemma4.
Comparison to prior day: On April 16, Qwen3.6 was freshly released with benchmark excitement but limited user reports. Today the community delivered a flood of hands-on testing — from agentic coding to RAG workflows — establishing it as the current local model of choice, though adherence and system prompt issues remain for some users.
1.2 Claude Opus 4.7: Benchmark Gains Meet User Backlash (🡕)¶
Opus 4.7's general availability generated the day's most polarized discussion. Multiple posts covered benchmarks, regression evidence, and user frustration.
u/ShreckAndDonkey123 posted the official benchmark table (Claude Opus 4.7 benchmarks, score 827, 222 comments). u/policyweb posted additional coverage (score 483, 33 comments).

Key numbers from the Anthropic blog: SWE-bench Pro 64.3% (up from 53.4% on Opus 4.6), SWE-bench Verified 87.6%, Terminal-Bench 2.0 69.4%, HLE 46.9% without tools / 54.7% with tools, OSWorld-Verified 78.0%. Cyber capabilities (CyberGym 73.1%) were intentionally constrained below Mythos Preview levels per Project Glasswing. Pricing unchanged at $5/$25 per million input/output tokens. Hex's early tester quote: "low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6."
But the backlash was swift and unusually unified. u/Neurogence posted that Claude power users "unanimously agree" Opus 4.7 is a regression (Claude Power Users Unanimously Agree That Opus 4.7 Is A Serious Regression, score 819, 162 comments). u/Many_Consequence_337 (score 164): "It's the adaptive thinking that's fucked, the model never uses it." u/danivl (score 155) laid out the theory: "4.7 is actually a worse version of 4.6, but cheaper to run...burns through tokens way faster."
u/seencoding provided the sharpest evidence: on the NYT Connections Extended Benchmark, Opus 4.7 (high) scored 41.0% while Opus 4.6 scored 94.7% — a 53.7 point drop (opus 4.7 (high) scores a 41.0%, score 618, 112 comments). Opus 4.7 without reasoning scored dead last at 62nd place with 15.3%. u/Klutzy-Snow8016 (score 36) identified a key contributor: Anthropic turned refusals up — 54.9% of benchmark questions were refused on safety grounds despite containing no NSFW content. When it did answer, it scored 90.9%, still below 4.6's 94.7%.

u/zero0_one1 confirmed the pattern on the Thematic Generalization Benchmark: 80.6 to 72.8 for high reasoning, 68.8 to 52.6 with no reasoning (Opus 4.7 unexpectedly performs significantly worse, score 439, 66 comments). u/PaxODST (score 165): "Some corners were definitely cut on other aspects to maximize its gains on coding and SWE."
u/lemon07r, author of the SanityHarness coding eval, spent $120 in API credits testing Opus 4.7 and wrote an extended rant: "I've never seen a model hallucinate this badly, this often...it is SOO persistent about being wrong when you try to correct it" (Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1 tested in coding, score 52). He dubbed it "Gaslightus-4.7."
u/JulioMcLaughlin2, a PhD student doing theoretical math and physics research, described spiraling self-corrections and rapid token exhaustion on the $20 plan (Opus 4.7 is terrible, score 121, 66 comments). u/looselyhuman (score 71): "'Adaptively' nerfing Opus is how Anthropic is trying to keep the servers running until they can build more."
u/ObjectivePresent4162 catalogued four specific failures: confident hallucination on pricing data, adaptive reasoning defaulting to low-effort, making unrequested changes while ignoring requested ones, and faster token burn (After using Opus 4.7... yes, performance drop is real, score 65, 26 comments).
Discussion insight: The pattern is strikingly clear: Opus 4.7 gains on coding/SWE benchmarks while regressing on generalization, language puzzles, creative tasks, and real-world reasoning. The refusal rate spike (54.9% on innocuous benchmark questions) and adaptive reasoning defaulting to low effort suggest systemic optimization for a narrow set of benchmarks at the cost of broader capability. The community is coalescing around a "benchmaxxed" narrative.
Comparison to prior day: On April 16, the report covered Opus 4.7's launch with measured skepticism. Today, quantitative regression evidence emerged from at least four independent benchmarks and multiple practitioner reports. The narrative has shifted decisively from "wait and see" to documented disappointment.
1.3 Anthropic's Trust Crisis: Subscriptions, Privacy, and the Local Migration (🡕)¶
Three converging threads form a broader trust narrative around Anthropic.
u/kaggleqrdl amplified a GitHub user's analysis predicting Anthropic is "constructively terminating its subscription plans" in favor of enterprise-only access (Only LocalLLaMa can save us now, score 391, 134 comments). The original GitHub comment argued Anthropic is "willing to slowly attrit and lose customers to churn through silent degradation." u/PhillyG17 (score 200): "This is the same sort of thing that happened in the early 'Wild Wild West' of the internet. First it's all about making the best product, then it shifts to making a profitable one." u/ttkciar (score 115), a LocalLLaMA moderator, left the post up, calling it on-topic: "The mercurial nature of commercial inference services is a big reason some of us are here on this sub."
u/fulgencio_batista posted that Claude now requires identity verification including a valid ID and facial recognition scan (More reasons to go local, score 541, 88 comments). u/Makers7886 (score 200): "Wonder how much of this is part of the 'US Labs binding together to stop Chinese labs from using them' or how much is just an excuse to extract personal data." u/hideo_kuze_ (score 66): "You now need to submit your passport and a dna sample for every fucking website or app."
Meanwhile, u/kaggleqrdl separately shared the prediction that Anthropic is heading toward an enterprise-first model, with the claim that an AMD engineer analyzed 6,852 Claude Code sessions proving performance changes (github user predicts Anthropic terminating subscriptions, score 155, 54 comments). u/Weak-Variety-4307 (score 61): "Both Anthropic and OpenAI are playing a long-term revenue game...when existing models appear to get downgraded right before a new release, it looks like a bait-and-switch in broad daylight."
Discussion insight: The convergence of the Opus 4.7 regression, identity verification requirements, and subscription uncertainty is producing a compound trust crisis. Each issue alone might be absorbed; together they form a narrative of a company prioritizing enterprise revenue over individual users. The LocalLLaMA moderator's explicit endorsement of the post as on-topic signals the community is framing this as an existential motivation for local models.
Comparison to prior day: On April 16, Anthropic trust erosion was already a pain point. Today it escalated with the "constructive termination" analysis gaining 391 score, the identity verification post at 541, and the moderator explicitly connecting it to LocalLLaMA's mission.
1.4 AI Ethics and Military Use: Palantir Faces Public Reckoning (🡒)¶
Two videos of Palantir co-founders drew strong engagement. u/Algrm posted Peter Thiel being questioned about Palantir's AI use in Gaza (Peter Thiel...sh*ts himself when asked, score 1497, 204 comments) and separately posted Alex Karp referring to those killed as "useful idiots" and "mostly terrorists" (Alex Karp...refers to those killed, score 773, 178 comments).
u/justpassingbyteam (score 209): "My bias is to defer to Israel, whatever they decide is right. That's his answer. Insane." u/Miamiconnectionexo (score 284): "The 'mostly terrorists' framing is exactly how these systems get deployed with zero accountability for civilian deaths."
Discussion insight: These posts sit at the intersection of AI and geopolitics. The combined score of 2,270 makes this the single most-engaged topic by raw numbers, though the discussion is primarily ethical rather than technical. The community's response reflects deepening concern about AI deployment in military contexts without accountability mechanisms.
Comparison to prior day: April 16 featured Anthropic opposing an Illinois liability shield and White House Mythos access. Today the AI ethics thread shifts from regulatory positioning to direct confrontation with military deployment outcomes.
1.5 The Emotional Cost of AI-Assisted Work Deepens (🡒)¶
u/throwawayname46 described a three-stage emotional arc after weeks of solving work problems with Claude: fatigue from intense sessions, guilt during recovery that progress is stalling, and emptiness once results ship because "you can't honestly take credit for all the output" (Me, after a few weeks of solving my work problems with Claude and feeling terribly empty, score 867, 178 comments).
u/wheres_my_ballot (score 230): "For many of us, the satisfaction was in the process, and the feeling of achievement when you found solutions. That feels dead now." u/evendedwifestillnags (score 100): "Post Claude clarity. It's been doing 90% of my job. I feel the biggest wave of imposter syndrome ever." u/Actual_Editor (score 24): "We are all PMs." u/puncheonjudy (score 45) offered the counterpoint: "Consider what it has given you...If it allows me to finish my work quicker, then generally I'll play with my daughter or go for a walk."
Discussion insight: This post continued climbing from April 16 (originally 663, now 867), indicating the theme has staying power. The "we are all PMs" framing — that AI reduces skilled practitioners to project managers of their own AI output — encapsulates the professional identity crisis more concisely than any prior discussion.
Comparison to prior day: This post was featured on April 16 at 663 score. Today it reached 867 with additional comments deepening the conversation. The emotional cost theme is becoming a recurring signal, not a one-off venting thread.
1.6 Robotics: Running Faster, Failing Less, Training for Marathons (🡒)¶
u/Distinct-Question-16 posted Figure.AI's "Vulcan" balance policy, enabling the Figure 03 robot to maintain balance with up to 3 lost lower-body actuators (Figure.AI new balance policy, score 585, 109 comments). u/Maleficent-Low-7485 (score 209): "the fact that we are casually engineering robots to recover from partial hardware failure is insane."
u/heart-aroni posted video of a Unitree H1 accelerating from jogging to running during a test for the Beijing humanoid robot half-marathon on April 19 (Unitree H1 accelerating from jogging to running, score 404, 52 comments). u/JoelMahon (score 118): "if I was running away from it, outpacing it in jogging mode, and then it sped up...I'd shit myself."
u/EasyTree12 shared a Forbes report documenting humanoid robots' 88% failure rate on home tasks (Humanoid Robots' 88% Fail Rate, score 113, 85 comments). u/RanklesTheOtter (score 132): "The worst they'll ever be." u/DaySecure7642 (score 24): "0% just a few years ago and now 12%. By 2030 they will be usable for many tasks."
u/socoolandawesome posted an impressive Physical Intelligence demo showing robots generalizing to new tasks with language-based steering (Impressive robotics demo from Physical Intelligence, score 68, 20 comments).
Discussion insight: The juxtaposition of the 88% failure rate with rapid capability gains in resilience (Vulcan) and speed (Unitree) captures the current state: robots are simultaneously terrible at most tasks and visibly improving. The community leans toward optimism, reframing failures as baselines rather than limitations.
Comparison to prior day: April 16 covered Leju Robotics' automated factory and dexterous robotic hands. Today the focus shifts to resilience (recovering from hardware failure) and athletic capability (marathon training), continuing the theme of maturing robotics infrastructure.
1.7 AI Policy: Government Access, Sovereign Funds, and Market Shifts (🡒)¶
u/exordin26 reported the White House moving to give US agencies Anthropic Mythos access per Bloomberg, despite the prior supply chain risk designation (White House Moves to Give US Agencies Anthropic Mythos Access, score 465, 53 comments). u/AdAnnual5736 (score 321): "Wait, aren't they a supply chain risk?" u/o5mfiHTNsH748KVq (score 51): "I also read this as OpenAI doesn't have anything in the chamber to compete with it."
u/EmbarrassedStudent10 posted that the UK launched a $675M "Sovereign AI" fund targeting AI agents, drug discovery, and hardware optimization rather than building a frontier model (UK launches $675M Sovereign AI fund, score 98, 36 comments). u/thhvancouver (score 19): "Meanwhile Microsoft has committed $40 Billions to AI infrastructure in the European Data Boundary."
u/fortune shared the Stanford HAI 2026 AI Index report showing China has "nearly erased" America's AI lead — the Arena score gap narrowed to just 39 points between Anthropic's Claude Opus 4.6 and China's Dola-Seed 2.0 (China has nearly erased America's lead in AI, score 125, 48 comments).
u/GamingDisruptor posted SimilarWeb data showing ChatGPT's GenAI traffic share declining from 77.43% to 56.72% over 12 months, while Gemini rose to 25.46% and Claude to 6.02% (OpenAI continues to lose market share, score 130, 53 comments). u/Cagnazzo82 (score 11) provided context: "ChatGPT is still growing. 6 billion monthly visits...5th highest traffic site globally."

Discussion insight: The policy landscape is fragmenting: the US government simultaneously labels Anthropic a supply chain risk and seeks Mythos access, the UK bets on picks-and-shovels rather than frontier models, and China closes the gap. The market share data shows healthy diversification rather than ChatGPT decline, but the trend favors Google's distribution advantages.
Comparison to prior day: April 16 covered Anthropic's UK expansion and OpenAI's London office. Today the UK Sovereign AI fund and Stanford AI Index add quantitative context to the geopolitical competition narrative.
2. What Frustrates People¶
Claude Opus 4.7 Regression on Non-Coding Tasks¶
Severity: High. The strongest frustration signal of the day. Four independent benchmarks document regression: NYT Connections Extended dropped from 94.7% to 41.0% (u/seencoding, score 618), Thematic Generalization fell from 80.6 to 72.8 (u/zero0_one1, score 439), SanityHarness real-world coding found persistent hallucination and gaslighting behavior (u/lemon07r, score 52), and user evaluations on openmark.ai showed Opus 4.6 beating 4.7 on all real-world task benchmarks (u/Rent_South). The 54.9% refusal rate on innocuous benchmark questions compounds the problem. Coping: sticking with Opus 4.6, switching to GPT-5.4, moving to local models.
Anthropic's Adaptive Reasoning and Token Economics¶
Severity: High. u/Accomplished-Code-54 (score 61): "Plus the extra 40% of token usage per prompt (due to the new tokenizer), it's just abysmal." u/JulioMcLaughlin2 described Opus 4.7 spiraling through self-corrections and hitting usage limits on the $20 plan. u/FateOfMuffins noted that adaptive reasoning on the website leaves users "having trouble figuring out how to get 4.7 to think at all." Coping: explicit /effort high commands, switching to Sonnet for routine tasks, migrating to local inference.
Identity Verification and Privacy Erosion¶
Severity: Medium. Claude now requires passport or driver's license plus facial recognition (u/fulgencio_batista, score 541). u/hideo_kuze_ (score 66): "The world is becoming increasingly dystopian." This is explicitly framed as a local model migration catalyst. Coping: going local, using open-weight alternatives.
Qwen3.6 Adherence Issues¶
Severity: Medium. Multiple users report system prompt compliance problems, especially in RAG and agentic setups. u/tkon3 documented 2-3x reasoning token bloat with tools, weaker system prompt following, and shorter final responses (score 67). u/exact_constraint (score 30): started writing files in read-only mode. Coping: using preserve_thinking flag, adjusting sampling parameters, waiting for community templates to mature.
3. What People Wish Existed¶
A Model That Improves Without Regressing¶
The Opus 4.7 saga crystallizes a recurring wish: an upgrade path that does not sacrifice existing capabilities. u/m_atx (score 35) captured the fatigue: "Some form of this literally exists in every new model announcement. Just replace the model numbers." u/Valnar (score 72): "I thought the saying was that 'this is the worst it will be'?" The community wants frontier models where coding gains do not come at the cost of reasoning, language, and creative tasks. No product addresses this directly.
Model Integrity Verification¶
Continuing from April 16 and strengthening. The convergence of Opus 4.7 regression data, identity verification, and the "constructive termination" analysis all point to the same gap: no independent mechanism verifies users receive the full-quality model they pay for. u/Loose_General4018 (score 122): "Vibes on benchmarks does not equal vibes in production." Opportunity: direct -- no product addresses this.
Shared GPU Configuration Database¶
u/No-Marionberry-772 (score 56) on the Qwen3.6 thread: "what stack are you using for software? I'd love to get a proper local setup going but I've had trouble figuring out what I should actually be using." Every new model release (Qwen3.6 today, Gemma 4 last week) restarts the tuning cycle. The number of detailed config posts (llama-server flags, sampling parameters, quantization choices) suggests a community-maintained config registry would save thousands of collective hours.
Honest Benchmarks That Match Real Use¶
u/lemon07r built SanityHarness specifically because standard benchmarks fail to capture real coding agent behavior. u/Desperate-Purpose178 (score 7): "Gemini is king of benchmarkmaxxing." u/Helpful_Inflation344 (score 5): "If METR isn't measuring that, they have outdated benchmarks." The gap between benchmark-topping and real-world utility is widening, creating demand for task-specific, reproducible evaluation platforms.
4. Tools and Methods in Use¶
| Tool | Category | Sentiment | Strengths | Limitations |
|---|---|---|---|---|
| Claude Opus 4.7 | LLM (frontier) | (-) | SWE-bench Pro 64.3% (+11pp over 4.6); improved vision; self-verification | Regression on reasoning/language benchmarks; 54.9% refusal rate on innocuous tasks; adaptive reasoning defaults to low effort; 40% more tokens per prompt |
| Qwen3.6-35B-A3B | LLM (local MoE) | (+) | 3B active params; Apache 2.0; natively multimodal; preserve_thinking fix; 120 tok/s on 3090 | Adherence issues in RAG setups; system prompt compliance weaker than 3.5; verbose reasoning with tools |
| Claude Opus 4.6 | LLM (frontier) | (+/-) | Still preferred by many for reasoning tasks; available for selection | Reports of degradation coinciding with 4.7 launch; compute reallocation suspected |
| Unsloth GGUFs | Quantization | (+) | Pareto-optimal KLD in 21/22 sizes for Qwen3.6; transparent bug reporting | Re-uploads required for upstream issues; CUDA 13.2 bug affects low-bit quants |
| llama.cpp | Inference engine | (+) | Gold standard for local inference; preserve_thinking support; fast iteration | Config tuning required per GPU; no shared database |
| Ternary Bonsai | Edge model | (+/-) | 1.58-bit, 9x smaller than FP16; 75.5 avg benchmark at 1.75GB | Built on Qwen3 (not 3.5); skepticism about benchmark comparisons; only MLX format available |
| OpenCode | Coding agent | (+) | Preferred by multiple testers for local model coding; SanityHarness eval built on it | Requires configuration for non-standard providers |
| Kimi K2.6-Code-Preview | LLM (hosted) | (+) | Rated slightly above GLM 5.1 in SanityHarness; early access showing promise | API support not yet available; CLI-only currently |
| MiniMax M2.7 | LLM (local) | (+/-) | Sonnet-level for some users at full precision | Inconsistent at default settings; missing spaces/spelling errors; 38% of Bartowski GGUFs had NaNs; tool calling format drift |
The dominant migration pattern accelerates: practitioners moving from hosted frontier models to local inference, driven by trust erosion (identity verification, subscription uncertainty, silent degradation) and capability convergence (Qwen3.6 approaching frontier performance on consumer hardware). The Opus 4.7 regression paradoxically strengthens the case for local models, as u/kaggleqrdl framed it: "Only LocalLLaMa can save us now."
5. What People Are Building¶
| Project | Who built it | What it does | Problem it solves | Stack | Stage | Links |
|---|---|---|---|---|---|---|
| Tower defense game via Qwen3.6 | u/Local-Cardiologist-5 | Autonomous game development with self-debugging | Demonstrates local model agentic coding capability | Qwen3.6-35B-A3B Q6_K_XL, llama-server, MCP screenshots | Working demo | Post |
| Multi-language RLS implementation | u/CountlessFlies | Row-level security across Rust, TypeScript, Python services | Cross-language database security patterns | Qwen3.6 IQ4_NL, llama.cpp, OpenCode, RTX 4090 | PR submitted | PR |
| SanityHarness coding eval | u/lemon07r | Multi-language coding agent benchmark across 6 languages | Standard benchmarks miss real agent behavior | Docker, bubblewrap sandbox, OpenCode | 145 results published | sanityboard.lr7.dev, GitHub |
| Qwen3.6 research-webapp skill | u/dreamai87 | Converts research papers to web applications | Manual research-to-prototype workflow | Qwen3.6 Q4_K_XL, llama-server, 16GB VRAM laptop | Shipped, 58 tool calls 98.3% success | GitHub |
| Qwen3.6 Uncensored Aggressive | u/hauhau901 | Zero-refusal variant with K_P quants and imatrix | Censorship removal without capability loss | Qwen3.6 base, custom K_P quantization | Released on HuggingFace | HuggingFace |
| PromptCreek library | u/Big-Initiative-4256 | Free prompt library with 1000+ templates and 1200+ agent skills | Losing prompts in chat history; no organized prompt repository | Web app, npx install for skills | Live | promptcreek.com |
| OpenCode Kimi plugin | u/lemon07r | Kimi K2.6-Code-Preview support for OpenCode | CLI-only Kimi access; missing OpenCode integration | OpenCode plugin, OAuth headers | Released | GitHub |
| Open WebUI rich elements | u/Mr_BETADINE | Rich UI components in Open WebUI responses | Plain text output limitations | Open WebUI plugin | Prototype | Post |
The day's builder activity concentrated on Qwen3.6 as the substrate. Three of the eight projects use it as the primary model, and two more evaluate it. The research-to-webapp skill achieving 98.3% tool call success rate across 2.7 million tokens on a 16GB laptop demonstrates real agentic capability at consumer scale.
6. New and Notable¶
DeepSeek Preparing Mega MoE Infrastructure for Next-Generation Model¶
u/External_Mood4719 tracked a DeepGEMM PR #304 adding "Mega MoE" support — fusing dispatch, linear 1, SwiGLU, linear 2, and combine operations into a single mega-kernel with overlapping NVLink communication and tensor core computation (DeepSeek Updated their repo DeepGEMM testing Mega MoE, score 116).

The combination of FP8 x FP4 MoE quantization, Mega MoE kernels, distributed communication via DeepEPv2, and Blackwell GPU adaptation points to a model larger than DeepSeek V3. Requires PyTorch >= 2.9. The repo includes a disclaimer: "this release is only related to DeepGEMM's development, has nothing to do with internal model release."
Bonsai Ternary Models Face Skepticism Despite Novel Architecture¶
u/pmttyji posted PrismML's Ternary Bonsai family — 1.58-bit models at 8B, 4B, and 1.7B parameters using ternary weights {-1, 0, +1}, achieving 9x memory reduction versus FP16 (Ternary Bonsai: Top intelligence at 1.58 bits, score 333, 81 comments). But u/WeGoToMars7 challenged the claims directly: Bonsai-8B at 782MB was only 29% smaller than Gemma 4 E2B at Q4_K_M (1104MB) while being "MUCH dumber" — and the ternary variant was 33% larger (Bonsai models are pure hype, score 124, 54 comments). u/KaroYadgar (score 69) noted Bonsai is built on Qwen3, not Qwen3.5, limiting its ceiling.
Claude Design Announced by Anthropic Labs¶
u/MassiveWasabi posted the announcement of Claude Design, a new Anthropic product for making prototypes, slides, and one-pagers by talking to Claude (Introducing Claude Design, score 55, 4 comments). Early engagement was minimal.
OpenAI Codex for Almost Everything¶
u/manubfr shared OpenAI's announcement expanding Codex capabilities (Codex for Almost Everything, score 122, 12 comments).
Kimi K2.6-Code-Preview Emerges as a Contender¶
u/lemon07r received early access to Kimi K2.6-Code-Preview and rated it slightly above GLM 5.1 on SanityHarness, with API support expected next week. The model is currently CLI-only via Kimi CLI with OpenAI-compatible format plus Kimi-specific extensions.
7. Where the Opportunities Are¶
[+++] Independent model quality monitoring service -- Four independent benchmarks documented Opus 4.7 regression on the same day Anthropic claims improvement. NYT Connections: 94.7% to 41.0%. Thematic Generalization: 80.6 to 72.8. SanityHarness: persistent hallucination. User evals on openmark.ai: 4.6 beats 4.7 on all real-world tasks. The identity verification requirements and "constructive termination" analysis amplify the trust gap. No product independently monitors hosted model quality at the inference level. Evidence from sections 1.2, 1.3, 2.
[+++] GPU configuration registry for local models -- Every Qwen3.6 user thread includes manual config tuning: llama-server flags, quantization choices, sampling parameters, context sizes, preserve_thinking settings. u/No-Marionberry-772 asked what stack to use. u/CountlessFlies published Docker commands. u/Local-Cardiologist-5 published server configs. Each model release restarts the cycle. A community database mapping GPU model + LLM + target specs to optimized configs would save thousands of collective hours. Evidence from sections 1.1, 4.
[++] Task-specific model evaluation platform -- Standard benchmarks diverge from real-world performance. u/lemon07r built SanityHarness specifically to fill this gap. u/Rent_South runs custom evals on openmark.ai. u/Helpful_Inflation344: "METR's testsuite is definitely outdated." The Opus 4.7 case demonstrates that benchmark-topping and user satisfaction can move in opposite directions. A platform offering reproducible, task-specific evaluation that correlates with practitioner experience would find immediate demand. Evidence from sections 1.2, 3.
[++] Local model agentic workflow framework -- Qwen3.6's preserve_thinking flag, the tool calling improvements, and the 262K context window make sustained agentic workflows viable on consumer hardware. But the infrastructure is fragmented: llama.cpp for inference, OpenCode or Claude Code for agent scaffolding, manual config tuning, and no standard memory/state management. A unified framework optimized for local model agentic coding would capitalize on the migration from hosted services. Evidence from sections 1.1, 5.
[+] Enterprise AI client with local model support -- Mozilla announced Thunderbolt (MPL 2.0) with local model support, MCP servers, and Agent Client Protocol, but it is assessed as "very early stage" and far behind Open WebUI. The enterprise segment wants self-hosted AI with compliance features. The gap between Thunderbolt's promise and its current state is a buildable opportunity. Evidence from section 6 (prior day).
8. Takeaways¶
-
Qwen3.6-35B-A3B claimed the day's top position with 1947 score and 615 comments, establishing itself as the current local model of choice. Its 3B active parameters achieve benchmark scores approaching dense 27B models, and the preserve_thinking flag resolves a real KV cache invalidation problem. Adherence issues remain in some RAG setups. (Qwen3.6-35B-A3B released!)
-
Claude Opus 4.7 launched to the most negative unified reception of any Anthropic model. Despite SWE-bench Pro gains (+11pp to 64.3%), four independent benchmarks documented regression on non-coding tasks. NYT Connections dropped 53.7 points. Refusals spiked to 54.9% on innocuous questions. Multiple practitioners described persistent hallucination and gaslighting behavior. (Claude Opus 4.7 benchmarks, opus 4.7 scores 41%, Opus 4.7 Is A Serious Regression)
-
Anthropic faces a compound trust crisis spanning subscriptions, privacy, and model quality. Identity verification requiring passport and facial recognition (score 541), the "constructive termination" prediction for subscriptions (score 391), and simultaneous Opus 4.6 degradation reports all converge into a single narrative pushing users toward local alternatives. (More reasons to go local, Only LocalLLaMa can save us now)
-
The "benchmaxxed" model critique is gaining quantitative support. Opus 4.7's coding gains coexist with reasoning regression. Gemini 3.1 Pro leads METR but users call it "unusable for agentic business work." Bonsai's benchmark comparisons drew accusations of intellectual dishonesty. The gap between leaderboard performance and practitioner satisfaction is widening across providers. (Thematic Generalization drop, Bonsai models are pure hype)
-
DeepSeek's Mega MoE infrastructure update signals preparation for a model larger than V3. FP4 quantization, fused mega-kernels, Blackwell adaptation, and distributed communication in DeepGEMM PR #304 point to extreme-scale MoE training, despite the disclaimer separating it from model releases. (DeepSeek Updated their repo)
-
The local model ecosystem is maturing from hobbyist experimentation to production tooling. Unsloth's systematic GGUF quality benchmarking, the preserve_thinking infrastructure fix, K_P custom quants, and real-world agentic coding demos (tower defense game, multi-language RLS, research-to-webapp at 98.3% tool call success) demonstrate a professional-grade local inference stack emerging on consumer hardware. (Qwen3.6 GGUF Benchmarks, Qwen3.6 is incredible with OpenCode)
-
AI ethics discussion surged around Palantir's military AI deployment, with combined engagement exceeding 2,200 score. Peter Thiel and Alex Karp faced direct confrontation over AI use in Gaza. Separately, the White House moved to give US agencies Anthropic Mythos access despite the prior supply chain risk label. The tension between AI capability development and deployment accountability continues to intensify. (Peter Thiel, White House Mythos access)
-
The emotional cost of AI-assisted work continues to resonate, climbing from 663 to 867 score. The framing of skilled practitioners becoming "PMs of their own AI output" captures a professional identity crisis distinct from job displacement anxiety. The counterpoint -- that AI frees time for family and walks -- exists but is outnumbered by reports of emptiness and imposter syndrome. (Me, after solving my work problems with Claude and feeling terribly empty)