Skip to content

Reddit AI - 2026-04-26

1. What People Are Talking About

1.1 HauhauCS Plagiarism of Heretic Rocks the Open-Source LLM Community (🡕)

A meticulously documented plagiarism case dominated r/LocalLLaMA. u/nathandreamfast published HauhauCS published an abliteration package that plagiarizes Heretic without attribution, and violates its license (score 442, 166 comments), presenting a 17-point forensic code breakdown. The evidence: 7/7 module filenames preserved from Heretic v1.2.0, 30/32 refusal markers character-for-character identical (including typos like "i an ai" missing the "m"), and 30+ shared function and class names. The package was published under PolyForm Noncommercial, violating Heretic's AGPL-3.0 license.

Heretic's creator u/-p-e-w- (Philipp Emanuel Weidmann) responded with the top comment (score 543): "I have to fully confirm OP's findings and conclusions... There are literally hundreds of superficial and deep similarities between the codebases." He identified SPDX headers, a geometric median approach he has "never seen in literature," and the DatasetSpecification fields as proof. He concluded: "If you want to build your own abliteration tool based on Heretic, I have great news for you: You don't have to steal my code. I'm already gifting it to you." u/a_beautiful_rhind (score 44) summarized: "If you do shit like this it will eventually be found out. Then you get outed as a huge phony."

Separately, u/My_Unbiased_Opinion posted Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model (score 287, 59 comments), a legitimate Heretic-derived uncensored model. u/-p-e-w- (score 76) praised the creator as "without a doubt a master user of Heretic" who "did much more than just run a command line program here."

Discussion insight: The community is drawing a sharp line between legitimate open-source derivatives (credited, same license) and plagiarism. The forensic depth of the analysis and Heretic author's direct confirmation make this unusually well-evidenced.

Comparison to prior day: Not covered yesterday. This is a new controversy that will likely have follow-on effects for HauhauCS's 22 models with 5M+ combined monthly downloads.

1.2 Amateur Solves 60-Year Erdos Problem Using AI (🡕)

u/Marha01 shared An amateur just solved a 60-year-old math problem -- by asking AI (score 579, 79 comments), linking to a Scientific American article on Erdos Problem #1196. The LLM took "an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question." Terence Tao reviewed and shortened the proof. The result has been formally verified in approximately 4,000 lines of Lean 4 code.

u/sckchui (score 297) highlighted the key detail: "The LLM was thinking for itself, and actually produced an ugly answer because of it. Nevertheless, the mess it wrote (slop, you might say) contained one novel and potentially important insight that human experts have missed thus far." u/Peanut_Extreme_8208 (score 16) reported: "From inside the mathematical community, there is a real sense of fear and frustration at the prospect of being 'replaced' by AI."

Discussion insight: The formal verification via Lean 4 distinguishes this from previous AI math claims. The LLM's contribution was a novel conceptual connection, not brute-force computation.

Comparison to prior day: Not covered yesterday. This is a new development.

1.3 Qwen 3.6 Optimization Enters Systematic Phase (🡒)

The Qwen 3.6 optimization wave continued with new speed records and quantization data.

u/Kindly-Cantaloupe978 posted Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 (score 195, 68 comments), improving on yesterday's 80 tps result using an AutoRound INT4 quantization with MTP speculative decoding and fp8_e4m3 KV cache via vllm 0.19. u/Important_Quote_1180 (score 22) reported 71-83 tok/s on a single RTX 3090 using TurboQuant 3-bit KV cache with 125K context.

u/ROS_SDN explored Quantisation effects of Qwen3.6 35b a3b (score 72, 71 comments), noting "stark" quality differences between Q4 and Q8 on the MoE variant. u/LaurentPayot (score 27) shared a benchmark chart showing accuracy recovery rates: UD-Q4_K_XL recovers 98.5% of full-precision accuracy at 22.4 GB, while UD-Q3_K_XL achieves 100.0% at 16.8 GB -- a counterintuitive result suggesting the smaller quant may benefit from fewer generated tokens.

Qwen3.6-35B-A3B GGUF evaluation showing accuracy recovery, model size, and token counts across quantization levels

u/Phaelon74 measured Qwen3.6-35B-A3B KLDs for INT and NVFP4 quantizations (score 46, 20 comments) using real GPU logits via a modified vllm pipeline. The data showed NVFP4 from sakamakismile has the worst KLD at 0.176, while INT8 quantizations achieved KLD below 0.008. The author cautioned: "The NVFP4 cake, as always, is a lie."

KL Divergence vs file size scatter plot for Qwen3.6-35B-A3B quantization methods showing NVFP4 underperformance

u/boutell wrote a comprehensive field report on coding with Qwen 3.6 35B-A3B on an M2 MacBook Pro with 32GB RAM (score 68, 43 comments). The model succeeded on an adapter pattern task ("build a compatible thing that passes the same test suite") but failed on a geometry-based CSS/PDF positioning bug. The author concluded it "required more guidance than Claude Code" but was preferable to manual implementation for structured tasks.

u/Ok_Mine189 published Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (score 57, 61 comments). Linux delivered 4-8% faster token generation across all models, but the headline result was CPU/GPU hybrid prompt processing: Linux was 100-143% faster. u/ambient_temp_xeno (score 37) explained: "it's not so much that there's an inherent issue with windows... it's that the cuda dev guy doesn't care about the windows performance."

Discussion insight: The community is producing systematic quantization data showing the MoE 35B-A3B is more sensitive to quantization than the dense 27B. The NVFP4 backlash is growing based on KLD measurements. Speed records continue to fall on RTX 5090.

Comparison to prior day: Yesterday focused on KV cache quantization for the 27B dense and achieving 80 tps on RTX 5090. Today adds the 100 tps INT4 record, MoE-specific KLD measurements, the Windows vs Linux benchmark, and a detailed M2 field report. The optimization phase is producing increasingly actionable data.

1.4 DeepSeek V4 Day Three: Intelligence Density Critique Grows (🡖)

The DeepSeek V4 narrative shifted from launch enthusiasm to critical analysis of token efficiency.

u/Mindless_Pain1860 posted Decreased Intelligence Density in DeepSeek V4 Pro (score 209, 88 comments), presenting TerminalBench 2.0 data showing V4-Pro requires roughly 10x more tokens than GPT-5.5 to achieve similar performance. The selftext noted that "even the non-thinking mode uses significantly more tokens than V3.2."

TerminalBench 2.0 charts comparing GPT-5.5 and GPT-5.4 token efficiency alongside DeepSeek V4 variants

u/Puzzleheaded-Drama-8 (score 136) maintained the "undertrained" thesis: "I expect we're going to see huge gains in that model when we get new checkpoints in coming months." u/TheKingOfTCGames (score 46) noted: "GPT 5.5 was specifically trained for token efficiency -- it's like 3-5x more efficient than Opus and like 10x Sonnet." u/Hyp3rSoniX (score 25) offered a strategic explanation: "I think the main goal of the v4 release was to get the models to run on the Huawei Ascend AI processors."

u/antirez (creator of Redis) posted llama.cpp DeepSeek v4 Flash experimental inference (score 41, 37 comments), achieving 21 t/s on a MacBook M3 Max with 128GB RAM using aggressive 2-bit quantization on routed experts with Q8 on shared experts. He reported: "For the first time, even with this selective 2-bit quantization, I feel like I have a frontier model running on my computer."

Discussion insight: The community is splitting V4 assessment by variant: Flash remains the consensus winner for cost-efficiency, while Pro faces growing skepticism on token density. The Huawei chip compatibility theory provides a charitable explanation for Pro's underwhelming token efficiency.

Comparison to prior day: Yesterday covered the V4 launch excitement and Flash's pricing. Today the intelligence density critique from yesterday's initial thread (then score 119) has grown significantly (now 209), and the "undertrained" theory has become the dominant framing. The antirez local inference milestone adds a new practical dimension.

1.5 GPT-5.5 and GPT Image 2: Creative Leap, Verbal Padding (🡒)

GPT-5.5 assessment stabilized around two poles: impressive creative capabilities and persistent verbosity complaints.

u/Proof-Square7528 posted geoguessr time travel clone with gpt-image-2 (score 584, 47 comments), demonstrating 360-degree panoramas of historical scenes. u/xirzon (score 117) noted an amusing detail: "The privacy pixelation of nonexistent people is a nice touch." u/Rare_Bunch4348 shared The Comeback ChatGPT Did with Image 2 Is Insane (score 318, 58 comments), with a side-by-side against Nano Banana Pro. u/Able-Line2683 (score 115): "second pic looks like a real image."

u/Akashictruth posted GPT 5.5 Xhigh VoxelBench test (score 176, 35 comments), showing Minecraft voxel builds including Spider-Man and NYC skylines. The VoxelBench leaderboard shows GPT-5.5 xHigh dominating with a score of 2106.

VoxelBench live leaderboard showing GPT-5.5 xHigh at 2106 rating with 96.1% win rate, followed by Gemini 3.1 Pro Preview at 1725

On the critical side, u/No-Yesterday-1624 asked GPT5.5 but why is there so much waffle still? (score 306, 31 comments). u/RealCat7386 (score 53) articulated the frustration: "Still think these models are trained to be too polite and 'helpful' instead of just giving straight answers." u/Calm-Branch1671 (score 9) provided a model comparison: "I like Claude 4.6 -- it sort of gets your vibe and required level of depth very well."

u/artemisgarden charted OpenAI scores on Artificial Analysis over time (score 202, 38 comments), showing the trajectory from GPT-3.5 (score 9) to GPT-5.5 (score 60). Commenters noted date inaccuracies in the AI-generated chart.

Chart showing OpenAI flagship model intelligence over time on Artificial Analysis Index, from GPT-3.5 at 9 to GPT-5.5 at 60

Discussion insight: GPT Image 2 is the unambiguous winner in the GPT-5.5 release. The text model continues to draw "waffle" complaints. The community is converging on Claude for precision and GPT for creative breadth.

Comparison to prior day: Yesterday covered SimpleBench scores and "big model feel." Today adds the GeoGuessr clone, Image 2 comparisons, and the verbosity complaints. The assessment is stabilizing: strong creative and multimodal model, persistent verbosity issue.

1.6 Societal Concerns: Palantir, Science Policy, AI Displacement (🡒)

Multiple high-engagement threads covered AI's societal impact.

u/shikizen posted Palantir employees are talking about company's "descent into fascism" (score 1,070, 130 comments), citing an Ars Technica article about internal Slack messages and a manifesto suggesting the US consider reinstating the draft. u/5553331117 (score 284) was unsurprised: "Pretty sure they were always solidly fascist. It's their business model." A cross-post by u/esporx in r/artificial (score 488, 58 comments) amplified the signal. u/prisongovernor added real-world consequences: Met investigates hundreds of officers after using Palantir AI tool (score 68, 13 comments).

u/esporx reported Trump fires the entire National Science Board (score 481, 62 comments). u/Illuminatus-Prime (score 136): "Trump hates anything that can prove him wrong."

On workforce displacement, u/Bharath720 posted Microsoft offers voluntary buyouts to its senior employees, amounting to 7% of the US workforce (score 164, 36 comments). u/ada_stack (score 16) observed: "if even senior engineers who directly contributed to profits are now considered 'replaceable,' then the bar for everyone else is only going to get higher." u/chunmunsingh shared Chinese Workers Horrified as Bosses Direct Them to Train Their AI Replacements (score 246, 24 comments).

u/Beautiful_Bee4090 posted Gen Alpha boys are preferring "AI girlfriends" over real ones (score 198, 140 comments). u/Hartax_ (score 67) provided a first-person teen perspective: "from me and my friends experience there's almost no girls that finds us attractive and only a handful of people in my school has real gfs. This isn't about preference but what is available."

Discussion insight: The Palantir story draws the strongest engagement, but the broader pattern is notable: workforce displacement, science policy disruption, AI surveillance consequences, and youth social impact all drew high independent engagement on the same day.

Comparison to prior day: Yesterday covered Palantir, Microsoft buyouts, and Chinese workers training replacements. Today adds the NSF board firing, Met police Palantir investigation, and Gen Alpha AI relationships. The Palantir story grew from 675 to 1,070 score. Societal concern threads continue to diversify.

1.7 Google-Anthropic $40B Investment: Hedge or Endorsement? (🡒)

u/Ordinary-Cycle7809 discussed Google Investing $40,000,000,000 in Claude Is Honestly Kind of Hilarious (score 193, 142 comments). u/crystalpeaks25 (score 141) provided the critical context: "You know anthropic is the most used model in Google own Google Vertex AI. When Google says a certain amount of their revenue comes from AI they meant majority is enterprise users using anthropic models in Vertex AI." u/EndOfWorldBoredom (score 50) framed it financially: "Google just sold low interest 100 year bonds. They are placing their cheap capital in places where it will produce a return. They're just investment bankers with a tech portfolio."

Discussion insight: The investment is consistently read as financial hedging rather than an endorsement of Anthropic's technology. The Vertex AI detail provides the most compelling explanation.

Comparison to prior day: Yesterday introduced the Google $40B and Amazon $5B investments. Today the conversation has matured around the "hedge" framing with specific Vertex AI context.


2. What Frustrates People

Open-Source Plagiarism and License Violations

Severity: High

HauhauCS published an abliteration package plagiarizing Heretic's AGPL-3.0 code, stripping all attribution and relicensing under PolyForm Noncommercial. The evidence includes character-for-character identical refusal markers and preserved typos. Heretic's author u/-p-e-w- (score 543) confirmed: "a clear violation of Sections 4 and 5 of the AGPL. It's also a clear violation of every ethical standard imaginable." HauhauCS has 5M+ monthly downloads across 22 models, raising questions about all models' provenance. (Plagiarism analysis)

DeepSeek V4 Pro Token Bloat

Severity: Medium

V4-Pro requires approximately 10x more tokens than GPT-5.5 on TerminalBench 2.0, and even non-thinking mode uses significantly more tokens than V3.2 despite a 2.5x larger model. u/Mindless_Pain1860 documented the regression: "the intelligence density of the model has decreased rather than improved." (Intelligence density thread)

GPT-5.5 Verbosity Persists

Severity: Medium

u/No-Yesterday-1624 captured the frustration (score 306): GPT-5.5 still produces excessive "waffle" in responses. u/RealCat7386 (score 53) reported: "when I ask something simple about car features for customers, it gives me whole essay about safety considerations and market trends when I just need specs." The community attributes this to training reward structures that favor longer outputs. (Verbosity thread)

NVFP4 Quantization Quality Falls Short

Severity: Medium

u/Phaelon74 measured KLD for NVFP4 quantizations of Qwen3.6-35B-A3B and found they significantly underperform INT quantizations at the same bit width. The sakamakismile NVFP4 variant showed KLD of 0.176 versus 0.007 for INT8 from the same base model. "The NVFP4 cake, as always, is a lie." (KLD analysis)

SWE Bench Gaming Confirmed

Severity: Low

u/rm-rf-rm posted OpenAI's own explanation for why they no longer evaluate SWE Bench Verified (score 105). u/Mashic (score 82) cited Goodhart's law. u/suicidaleggroll (score 39) argued: "benchmarks really need to be closed in order to remain effective."


3. What People Wish Existed

A Qwen 3.6 Coder Variant (or Confirmation It Is Unnecessary)

u/ComplexType568 asked Qwen3.5/3.6 Coder? (score 91, 53 comments). u/StardockEngineer (score 73) responded: "I almost don't feel it's necessary anymore." u/NNN_Throwaway2 (score 47) agreed: "3.6 feels like it could have just as well have been the 'coder' release." The community wants either a dedicated coder variant or official confirmation that the base model subsumes that role.

Reliable Quantization Quality Guidance

The proliferation of quantization methods (GGUF Q2-Q8, NVFP4, MXFP4, INT4 AutoRound, AWQ, GPTQ) across diverse hardware creates decision paralysis. u/denis-craciun asked Are Unsloth models as good as I read? (score 100, 162 comments). u/emprahsFury (score 48) pushed back on marketing: "A q4 quant is really just a q4 quant. Everyone is doing what Unsloth does." Users want provider-independent quality metrics.

Speculative Decoding Draft Models for New Releases

u/butterfly_labs asked Is there a DFlash draft model compatible with Qwen3.6 27B yet? (score 27, 20 comments). Speed gains from speculative decoding are proven (3x throughput multiplier reported), but compatible draft models continue to lag architecture releases.

Minimum Viable Hardware Guidance for Agent Workflows

u/MexInAbu asked What do you consider to be the minimum performance (t/s) for local Agent workflows? (score 40, 60 comments). u/triplebits (score 8) provided the most structured answer: below 15 t/s causes noticeable stalls, 20-25 t/s is workable for planning tasks, 35+ t/s removes the model as bottleneck. The community wants standardized hardware recommendations tied to workflow types.


4. Tools and Methods in Use

Tool Category Sentiment Strengths Limitations
Qwen 3.6 27B Local LLM (dense) Very positive 100 tps on RTX 5090; fits single 3090 at Q4; adapts well to coding agent use Requires careful KV cache management at long context
Qwen 3.6 35B-A3B Local LLM (MoE) Positive 8x faster than 27B on Apple Silicon; strong uncensored derivatives (Heretic) More quantization-sensitive than 27B; KLD degrades sharply below Q4
DeepSeek V4-Flash Open LLM (284B MoE) Positive 21 t/s on MacBook at 2-bit quant; MIT license; 1M context No multimodal; token efficiency below GPT-5.5
DeepSeek V4-Pro Open LLM (1.6T MoE) Mixed Strong reasoning; Huawei chip support 10x token bloat vs GPT-5.5; "hugely undertrained" per community
GPT-5.5 Cloud LLM Mixed-positive Top VoxelBench scores; Image 2 panoramic generation; AA Index 60 Persistent verbosity; no coding frontier advance
GPT Image 2 Image generation Very positive Photorealistic output; 360-degree panoramas; local detail accuracy Privacy pixelation of nonexistent people (amusing artifact)
Heretic Abliteration tool Very positive AGPL-3.0; KLD 0.0015 on best derivatives; author-supported Plagiarism target; requires careful parameterization
vllm 0.19 Serving engine Very positive NVFP4+MTP; 100 tps Qwen 3.6 27B on 5090; TurboQuant KV cache Requires recent hardware for peak results
llama.cpp Inference engine Very positive NVFP4/MXFP4 support; DS V4 Flash at 21 t/s on Mac; broad hardware Windows 100-143% slower than Linux on CPU/GPU hybrid
OpenCode Agent scaffold Positive Local model support; compatible with llama-server Requires manual configuration
PaddleOCR-VL-1.5 Vision-language OCR Positive Handles complex layouts, tables, multilingual text via llama-server Limited community testing

5. What People Are Building

Project Who built it What it does Problem it solves Stack Stage Links
Heretic Plagiarism Forensic Analysis u/nathandreamfast 17-point code comparison recovering deleted PyPI package Verifies open-source license violations with SHA-256 evidence PyPI CDN recovery, code diffing Released dreamfast.github.io/reaper-analysis
Qwen3.6-35B-A3B Heretic Uncensored u/My_Unbiased_Opinion via llmfan46 KLD 0.0015 uncensored model with separate attention parameters Uncensored local model with minimal quality loss Heretic, Qwen 3.6 Released HuggingFace
Darwin-36B-Opus u/jacek2023 Evolutionary model breeding: Qwen3.6-35B-A3B x Claude-distilled variant Automated model improvement without retraining Darwin V7, single GPU Released HuggingFace
Qwen3.6 27B 100tps Stack u/Kindly-Cantaloupe978 INT4 AutoRound + MTP at 100 tps, 256K context on RTX 5090 Maximum throughput local inference vllm 0.19, AutoRound Active r/LocalLLaMA post
DeepSeek V4 Flash Local Inference u/antirez V4 Flash at 21 t/s on MacBook M3 Max with 2-bit routed experts Local frontier model on consumer hardware llama.cpp, custom quantizer Experimental GitHub
c137 Structured Memory System u/MontyOW 90.4% LongMemEval-S with structured storage, no embeddings Long-term memory without embedding overhead 3-stage pipeline, structured storage Active c137.ai/research
PaddleOCR-VL Book OCR Pipeline u/Final-Frosting7742 Layout detection, region OCR, Markdown+HTML table output for books Digitizing books with local vision-language model PaddleOCR-VL-1.5, llama-server, Vulkan Released GitHub
GeoGuessr Time Travel Clone u/Proof-Square7528 Batch-generated 360-degree historical panoramas AI-generated historical street view experience GPT Image 2 API Demo wen-ware.com
OpenAI Privacy Filter OpenAI PII detection and masking model Privacy-preserving text processing Small model, open weights Released HuggingFace
Real-time EEG Meditation System u/uisato AI interprets live brain signals for guided meditation cues Personalized meditation from real-time EEG OpenBCI, TouchDesigner, Python Demo r/singularity post
Qwen3.6-35B-A3B KLD Measurement Pipeline u/Phaelon74 Real GPU logit-based KLD for quantization quality Authoritative quantization quality comparison Modified vllm, RTX 6000 Active GitHub

6. New and Notable

OpenAI Drops SWE Bench Verified, Acknowledges Gaming

OpenAI published why they no longer evaluate SWE Bench Verified, confirming the benchmark has been gamed. u/rm-rf-rm shared this in r/LocalLLaMA (score 105, 28 comments). u/noctrex (score 9) pointed to swe-rebench.com as an alternative that constantly refreshes problems.

NVIDIA Day-0 Blackwell Support for DeepSeek V4

u/shikizen reported NVIDIA pushing 3,500 tokens per second on 1.6T models (score 48, 21 comments) using Blackwell GPUs with day-0 DeepSeek V4 support.

Darwin-36B-Opus: Evolutionary Model Breeding Hits 88.4% GPQA

u/jacek2023 posted Darwin-36B-Opus (score 79, 19 comments), a model produced by automated evolutionary breeding of Qwen3.6-35B-A3B with a Claude-distilled variant. The process runs in under an hour on a single GPU and achieved 88.4% on GPQA Diamond.

Speculative Decoding Hits 120-200 tok/s on Gemma-4-31B

u/Clasyc reported speculative decoding with Gemma-4-31B + Gemma-4-E2B achieving 120-200 tok/s (score 22, 14 comments) for specific tasks.

Structured Memory System Hits 90.4% on LongMemEval-S Without Embeddings

u/MontyOW posted a structured storage approach (score 44, 13 comments) achieving 90.4% on LongMemEval-S with 98% retrieval accuracy, using approximately half the tokens of embedding-based approaches. The system uses a 3-stage fixed pipeline (retrieve, answer, store) with structured maps instead of vector search.

LongMemEval-S leaderboard showing c137 system at 90.4% overall using structured storage with multiple model backends

Anthropic's Job Exposure Data Reveals 60-80 Point Capability-Adoption Gap

u/Professional-Rest138 analyzed Anthropic's labour market paper (score 82, 13 comments), breaking down the gap between theoretical AI capability and observed coverage into five categories: legal constraints, integration friction, verification overhead, workflow inertia, and quality thresholds. Computer and mathematical occupations showed 94% theoretical capability but only 33% observed coverage.


7. Where the Opportunities Are

[+++] Local inference is crossing the usability threshold for coding agents. Qwen 3.6 27B at 100 tps on RTX 5090, 71-83 tok/s on a single RTX 3090, and DeepSeek V4 Flash at 21 t/s on a MacBook M3 Max provide three distinct hardware tiers of viable local coding. The community is producing systematic quantization data (KLD measurements, accuracy recovery benchmarks, OS performance comparisons) but no unified tool synthesizes these findings into hardware-specific recommendations. Building an auto-configuration layer that selects optimal quantization, KV cache settings, and serving parameters for a given hardware profile fills an explicit gap. (100 tps stack, 3090 config, V4 Flash local)

[++] Open-source license compliance tooling is needed. The HauhauCS/Heretic case was detected through manual forensic analysis of recovered PyPI packages. With 5M+ monthly downloads on models of uncertain provenance, automated tools that scan for code-level derivation indicators (identical typos, shared function names, preserved parameter bounds) could detect license violations at scale. The AGPL specifically requires derivative identification -- tooling that verifies compliance would serve both model creators and users. (Plagiarism analysis)

[++] Token efficiency is becoming a key differentiator. GPT-5.5 achieves comparable results to DeepSeek V4-Pro with 2.5-10x fewer tokens. Tools that measure and optimize token efficiency for specific workflows -- rather than raw capability benchmarks -- address a growing demand. The SWE Bench gaming confirmation further shifts value toward real-world efficiency metrics. (Intelligence density, SWE Bench)

[+] Evolutionary and hybrid model creation is producing strong results with minimal compute. Darwin-36B-Opus achieved 88.4% GPQA Diamond through automated breeding on a single GPU in under an hour. The Heretic uncensored model achieved KLD 0.0015 through expert parameterization. These techniques democratize model customization beyond those with training budgets. (Darwin, Heretic model)

[+] The gap between AI capability and deployment (94% theoretical vs 33% observed in tech roles per Anthropic data) is largest for integration friction and verification overhead -- the two fastest-eroding barriers. Tools addressing these specific barriers have the most immediate growth trajectory. (Anthropic analysis)


8. Takeaways

  1. The HauhauCS plagiarism case is the most significant open-source ethics story in the local LLM community this month. With 442 score, 166 comments, Heretic's author confirming the findings (score 543), and the accused having 5M+ monthly downloads across 22 models, this will reshape how the community evaluates model provenance. The forensic depth -- SHA-256 verified downloads, character-for-character typo matches, identical Optuna parameter bounds -- sets a new standard for derivation analysis. (Analysis thread)

  2. An AI-assisted solution to Erdos Problem #1196 has been formally verified in Lean 4. The LLM used a novel approach "that no one had thought to apply to this type of question," and the proof has been machine-verified in approximately 4,000 lines of formal code. This is qualitatively different from previous AI math claims due to the formal verification and Terence Tao's direct involvement. (Scientific American discussion)

  3. Qwen 3.6 27B reaches 100 tps on a single RTX 5090 with 256K context. The INT4 AutoRound quantization with MTP speculative decoding via vllm 0.19 sets a new consumer GPU speed record. Meanwhile, the 35B-A3B MoE variant shows sharply worse KLD under NVFP4 quantization (0.176) compared to INT8 (0.007), and field reports confirm the MoE is more quantization-sensitive than the dense variant. (100 tps, KLD data)

  4. DeepSeek V4-Pro faces growing token efficiency criticism. TerminalBench 2.0 data shows it requires roughly 10x more tokens than GPT-5.5, worse than V3.2 despite being 2.5x larger. The community's dominant theory is that the model is "hugely undertrained" with the primary release goal being Huawei Ascend chip compatibility. Meanwhile, antirez (Redis creator) got V4 Flash running locally at 21 t/s on a MacBook with 2-bit quantization and called it "a frontier model running on my computer." (Token efficiency, Local V4)

  5. GPT Image 2 is the clear winner of the GPT-5.5 release cycle. The GeoGuessr time travel clone (score 584), photorealistic Dhaka street scene (score 318), and VoxelBench domination (2106 rating, 96.1% win rate) demonstrate creative capabilities that GPT-5.5's text model verbosity complaints cannot match. The community is converging on "strong creative model, persistent verbosity issue." (GeoGuessr, VoxelBench)

  6. The Palantir story reached peak engagement at 1,070 score alongside multiple AI-society threads. Combined with the Met police investigation, NSF board firing (score 481), Microsoft buyouts (score 164), Chinese worker displacement (score 246), and Gen Alpha AI girlfriends (score 198), the day's societal threads collectively drew more engagement than any single technical topic. These concerns are no longer niche. (Palantir, NSF)

  7. Linux is 100-143% faster than Windows for CPU/GPU hybrid prompt processing in llama.cpp. The first systematic OS benchmark on identical hardware (RTX 5080 + i9-14900KF) shows generation speed differences of 4-8%, but prompt processing in hybrid CPU/GPU mode is dramatically faster on Linux. This has practical implications for users running models that spill to system RAM. (OS benchmark)

  8. Benchmark trust continues to erode. OpenAI's own explanation for dropping SWE Bench Verified confirms Goodhart's law in action. Combined with yesterday's benchmaxxing discussions, the community is increasingly skeptical of public benchmarks and gravitating toward task-specific evaluations and real-world field reports. (SWE Bench)