Twitter AI Agent - 2026-05-25¶

1. What People Are Talking About¶

1.1 Agent memory is moving out of the context window and into reusable files 🡕¶

The clearest May 25 pattern was that builders no longer want every tool, MCP, and note stuffed into live context. They want cold storage, selective loading, and reusable skill artifacts. At least four retained items pointed in the same direction: Obsidian notes, Markdown skill memories, transferable skill libraries, and resource packs for harness design.

@EXM7777 described (289 likes, 15 replies, 16,368 views, 437 bookmarks) using Obsidian as a memory layer for tools that should stay searchable but not permanently loaded into Claude Code or Codex. The strongest reply sharpened the idea into architecture language: keep tool memory in cold storage, keep active execution context hot, and only pull the few pieces a task actually needs.

@GithubProjects highlighted (81 likes, 4 replies, 7,735 views, 93 bookmarks) Acontext as a Markdown-native skill memory layer. The Acontext repo describes it as “Agent Skills as a Memory Layer,” and the README card shown in the attached image adds stronger maturity signals than the tweet alone: website and docs links, published PyPI and npm packages, and passing core, API, and CLI tests.

Acontext README card showing website/docs links, published packages, and passing tests

@DanKornas shared (34 likes, 4 replies, 1,684 views, 44 bookmarks) SkillX, and the SkillX repo says it turns successful trajectories into reusable planning, functional, and atomic skills that can be plugged into weaker agents without retraining. @tom_doerr added (55 likes, 4 replies, 3,110 views, 69 bookmarks) a 200+ item harness engineering resource list, which turned the same theme into a public map of files, tools, and practices.

Discussion insight: Replies did not ask for larger windows. They asked for better separation between archive and runtime, clearer provenance on why a tool was loaded, and reusable files that survive across sessions and agents.

Comparison to prior day: May 24 already pushed editable memory and selective loading. May 25 made the file-based version more explicit by pairing Obsidian cold storage with shippable skill libraries like Acontext and SkillX.

1.2 Harness engineering became a concrete design discipline rather than a slogan 🡕¶

The second major theme was that agent builders are now talking about prompt, context, and harness layers as separate engineering surfaces. The strongest posts were not model launches. They were diagrams, implementation guides, and articles about what reliable agent environments need around the model.

@sjsandeep_jain argued (97 likes, 10 replies, 1,406 views, 45 bookmarks) that the “biggest AI skill” is now designing the system around the model, not prompt engineering in isolation. The attached diagram makes the distinction explicit: prompt engineering shapes one request, context engineering curates what stays in the window, and harness engineering wraps gather, act, verify, and retry into a machine.

Harness engineering diagram separating prompt, context, and gather-act-verify loops

@bibryam called out (101 likes, 5 replies, 5,569 views, 134 bookmarks) an OpenAI article as “a gold mine for harness engineers,” emphasizing environments agents can reliably operate in, mechanical encoding of engineering taste, and observability that agents can read. The reply thread added the missing caution: once observability becomes agent-readable, the execution boundary becomes part of the security model too.

@harjtaggar summarized (120 likes, 14 replies, 9,916 views, 26 bookmarks) the lived experience more bluntly: every agent project starts simple and then ends up deep in retrieval quality, context engineering, and cross-modal eval loops. @tom_doerr backed that up (55 likes, 4 replies, 3,110 views, 69 bookmarks) with the Picrew harness resource list, which the repo describes as an awesome list of projects, tools, benchmarks, and practical guides.

Discussion insight: The practical question was no longer “what prompt works?” It was “what gets gathered, what stays loaded, how is it verified, and what happens when the harness itself fails?”

Comparison to prior day: On May 24, harness engineering was becoming a shared vocabulary. On May 25, the conversation shifted toward resource libraries, implementation diagrams, and infra guidance builders could apply immediately.

1.3 Verification-heavy agent systems broke into the mainstream discussion 🡕¶

The most notable breakout signal was that agentic systems were being discussed in domains where correctness has to be checked, not merely asserted. Formal proof search, formalized code generation, and security-oriented agent suites all showed up in the same daily set.

@pushmeet reported (696 likes, 45 replies, 40,683 views, 181 bookmarks) that Google DeepMind’s AlphaProof Nexus solved 9 open Erdős problems, 44 OEIS problems, a 15-year-old algebraic geometry problem, and a 7-year-old min-max optimization question. The public AlphaProof Nexus results repo contains Lean proofs for the solved problems, and the attached proof table matters because it shows these are not vague “AI solved math” claims but machine-checkable outputs tied to explicit proof techniques.

Proof table listing AlphaProof Nexus conjecture IDs and proof techniques for solved problems

@getjonwithit introduced (83 likes, 4 replies, 7,220 views, 49 bookmarks) a coding and formal-verification agent for computational physics and applied mathematics that aims to generate DSL code, formalize correctness properties in Lean, Isabelle, or Rocq, and then compile provably correct C. The replies made the real constraint explicit: formal correctness of code is not the same thing as correctness of the underlying physics model, so verification still has to be scoped carefully.

@The_Cyber_News shared (56 likes, 3 replies, 2,815 views, 30 bookmarks) Pentest Agent Suite, and the linked Cyber Security News article says the open-source package spans 50 specialized security agents, 26 slash commands, 19 CLI tools, and a cross-IDE installer across seven coding platforms. That is a strong builder signal because it packages security review as a structured agent surface rather than another generic assistant.

Discussion insight: The most persuasive agent claims were the ones tied to a checker: Lean, a proof assistant, repo rules, or a security workflow. The dataset rewarded verifiability more than raw autonomy.

Comparison to prior day: May 24 emphasized trust surfaces like disputes, dashboards, and review gates. May 25 pushed that same instinct into mathematically checked proofs, formalized code paths, and security-specific agent frameworks.

2. What Frustrates People¶

Context bloat still makes agents worse instead of better¶

Severity: High. @EXM7777 wrote (289 likes, 15 replies, 16,368 views, 437 bookmarks) that stacking more skills, MCPs, and context into Claude Code or Codex makes them slower and less predictable. @harjtaggar said (120 likes, 14 replies, 9,916 views, 26 bookmarks) that agent projects quickly collapse into retrieval and evaluation complexity, and @GithubProjects promoted (81 likes, 4 replies, 7,735 views, 93 bookmarks) Acontext precisely as an alternative to opaque memory stuffing. The visible workaround is selective loading and external memory stores, but that pushes more architecture work onto the builder. Worth building for: yes — the pain is recurring, operational, and already shaping how people use agents day to day.

Reliability work is still hiding under the label “agent engineering”¶

Severity: High. @sjsandeep_jain showed (97 likes, 10 replies, 1,406 views, 45 bookmarks) that prompt, context, and harness concerns now have to be engineered separately, while @bibryam argued (101 likes, 5 replies, 5,569 views, 134 bookmarks) that the real work lives in environment design, mechanical feedback loops, and agent-readable observability. @harjtaggar summed up (120 likes, 14 replies, 9,916 views, 26 bookmarks) the frustration directly: people start expecting a quick build and end up in retrieval quality and cross-modal eval loops instead. Worth building for: yes — this is the main reason “simple” agent ideas still take real engineering effort.

Open-source agents still need friendlier control rooms¶

Severity: Medium. @hasantoxr said (31 likes, 6 replies, 1,757 views, 37 bookmarks) that Hermes Desktop exists because terminal-first agents hide too much state, break quietly, and make ordinary setup unnecessarily technical. The replies were supportive but skeptical: one called the interface exactly what open-source agents need, while another warned that a GUI just makes failure easier to watch. Worth building for: yes — the workflow pain is practical and tied to broader adoption, even if the UI alone will not solve deeper reliability problems.

3. What People Wish Existed¶

Searchable cold storage that stays outside the live context window¶

This was the most direct need in the data. @EXM7777 wanted (289 likes, 15 replies, 16,368 views, 437 bookmarks) a growing Obsidian-backed library of tools that only get loaded when the task calls for them, and the strongest reply made the same point in system terms: cold storage for options, hot context for execution. Acontext and SkillX are partial answers, but the demand is for memory that is searchable, versioned, and cheap to keep outside the prompt. Opportunity: direct.

Agent outputs that can be checked, replayed, and trusted¶

The clearest signal came from verification-heavy builders. @pushmeet shared (696 likes, 45 replies, 40,683 views, 181 bookmarks) AlphaProof Nexus and linked to formal proofs, @getjonwithit framed (83 likes, 4 replies, 7,220 views, 49 bookmarks) formal verification as a core feature of a new coding agent, and @The_Cyber_News pointed to (56 likes, 3 replies, 2,815 views, 30 bookmarks) a whole security-agent suite. The missing layer is not more confident text. It is outputs that can be checked by proof systems, repo rules, or security gates. Opportunity: direct and competitive.

Control rooms that make open-source agents usable for normal people¶

@hasantoxr argued (31 likes, 6 replies, 1,757 views, 37 bookmarks) that Hermes Desktop matters because most open-source agents still expose users to setup friction, hidden state, and terminal complexity. That is a practical product need rather than an aspirational one. The competition here will likely be intense because the underlying agent capability is increasingly open, while the interface and recovery story are still weak. Opportunity: direct and competitive.

4. Tools and Methods in Use¶

Tool	Category	Sentiment	Strengths	Limitations
Obsidian	External memory layer	(+)	Keeps tools searchable outside live context; fits selective loading workflows	Manual curation and freshness still matter
Acontext	Skill memory layer	(+)	Editable Markdown skills, no API lock-in, ZIP export, visible package maturity	Still early-stage by package versions and community size
SkillX	Skill knowledge-base framework	(+)	Three-level skill hierarchy, automatic distillation from trajectories, transferable libraries	Research-heavy workflow; still early for mainstream teams
AlphaProof Nexus	Formal proof agent	(+)	Machine-checkable Lean proofs and public results repo make claims verifiable	Narrow domain and high formalization burden
awesome-agent-harness	Resource library	(+)	Large public map of harness projects, benchmarks, and guides	A reference list, not an execution system
Pentest Agent Suite	Security agent framework	(+/-)	Specialized agents, slash commands, MCP infrastructure, cross-tool installer	Evidence here comes through a news writeup and one tweet, not broad operator feedback
Hermes Desktop	Agent control surface	(+/-)	Makes setup, memory, tools, providers, schedules, and logs easier to manage	UI helps visibility, but replies still questioned whether it fixes deeper failure modes

Overall sentiment favored file-based memory, explicit structure, and systems that expose what they are doing. Satisfaction was strongest when the tool reduced hidden state or improved reusability, and mixed when the surface area expanded faster than reliability proof. The shared workaround is consistent across the day’s posts: keep the working set small, version reusable knowledge, and put some kind of checker around the output.

5. What People Are Building¶

Project	Who built it	What it does	Problem it solves	Stack	Stage	Links
Acontext	@GithubProjects / memodb-io	Stores agent learnings as editable Markdown skill files	Replaces opaque memory stores with readable, portable skills	JavaScript, PyPI/npm packages, file-based skills	Shipped	repo, post
SkillX	@DanKornas / zjunlp	Builds reusable planning, functional, and atomic skill libraries from agent experience	Stops agents from rediscovering the same tool-use patterns every run	Python, trajectory distillation, hierarchical skill KB	Alpha	repo, post
AlphaProof Nexus results	@pushmeet / Google DeepMind	Publishes Lean proofs and prose proofs for solved open math problems	Gives agentic proof-search claims a public, checkable artifact	Gemini-powered proof search, Lean, natural-language proof outputs	Alpha	repo, post
Pentest Agent Suite	@The_Cyber_News	Open-source bug-bounty and security agent framework across seven coding tools	Packages offensive-security workflows into a reusable agent stack	50 security agents, MCP infrastructure, 19 CLI tools	Beta	article, post
Hermes Desktop	@hasantoxr	Desktop interface for setup, chat, memory, skills, tools, schedules, and logs around Hermes Agent	Makes terminal-first open-source agents easier to operate and recover	Desktop UI, provider setup, memory/tools control, scheduling	Beta	post

Acontext and SkillX were the clearest memory-layer builds, but they solve adjacent problems. Acontext packages learnings into human-readable files and exports them across environments, while SkillX turns successful trajectories into a structured hierarchy of reusable skills. The shared trigger is the same frustration visible across sections 1-4: agents keep relearning what teams already know.

AlphaProof Nexus and Pentest Agent Suite show the same packaging instinct in verification-heavy work. One turns proof search into public Lean artifacts; the other turns offensive security into a multi-agent framework with slash commands and tool infrastructure. The repeated build pattern is not “another chatbot.” It is a reusable wrapper around a domain where the output has to be checked.

Hermes Desktop represents the usability layer around the same trend. Its importance in this dataset is not raw novelty but the fact that open-source agent builders are now spending effort on interfaces, schedules, logs, and recovery rather than only more autonomy.

6. New and Notable¶

AlphaProof Nexus made “checked agent output” tangible¶

@pushmeet reported (696 likes, 45 replies, 40,683 views, 181 bookmarks) that AlphaProof Nexus solved 9 open Erdős problems and 44 OEIS problems, and the public results repo matters because it publishes the Lean proofs themselves. That is notable because it shifts the credibility test from “do you believe the demo?” to “can you inspect the formal artifact?”

Harness engineering resources are getting packaged for self-study¶

@tom_doerr shared (55 likes, 4 replies, 3,110 views, 69 bookmarks) a 200+ item resource list, and @sjsandeep_jain shared (97 likes, 10 replies, 1,406 views, 45 bookmarks) a widely-circulated visual explanation of prompt, context, and harness layers. Together they show that harness engineering is no longer a niche internal concept; it is being turned into public curriculum.

7. Where the Opportunities Are¶

[+++] Versioned agent memory with selective loading — Multiple sections converge here. @EXM7777 pushed (289 likes, 15 replies, 16,368 views, 437 bookmarks) for Obsidian-backed cold storage, Acontext packages skills as Markdown, and SkillX turns trajectories into reusable KBs. The need is strong because builders want the same thing from different directions: less context bloat without losing reusable knowledge.

[++] Verification-first agent tooling — AlphaProof Nexus, @getjonwithit (83 likes, 4 replies, 7,220 views, 49 bookmarks), and Pentest Agent Suite all point toward agent systems whose output has to be checked by a proof assistant, code rules, or security workflow. The signal is moderate rather than universal, but it is one of the clearest quality thresholds in the day’s data.

[++] Agent control rooms and recovery surfaces — @hasantoxr made the case (31 likes, 6 replies, 1,757 views, 37 bookmarks) for a desktop wrapper around Hermes because hidden state and terminal-only setup still block adoption. The opportunity is moderate because every open-source agent now needs some combination of install, logs, memory inspection, and scheduling UX.

[+] Harness engineering education — @tom_doerr published (55 likes, 4 replies, 3,110 views, 69 bookmarks) a large resource map and @sjsandeep_jain made the conceptual split (97 likes, 10 replies, 1,406 views, 45 bookmarks) legible. The signal is emerging rather than dominant, but the language is clearly hardening into a teachable discipline.

8. Takeaways¶

The winning memory pattern is “cold archive, hot execution.” @EXM7777 showed (289 likes, 15 replies, 16,368 views, 437 bookmarks) why builders are moving tools out of live context and into searchable notes, while Acontext packages the same idea as Markdown skills.
Harness engineering is being treated like real systems work, not prompt decoration. @sjsandeep_jain diagrammed (97 likes, 10 replies, 1,406 views, 45 bookmarks) the split between prompt, context, and harness layers, and @bibryam argued (101 likes, 5 replies, 5,569 views, 134 bookmarks) that environment design and observability are the real leverage points.
Verification is becoming a first-class selling point for agent systems. @pushmeet reported (696 likes, 45 replies, 40,683 views, 181 bookmarks) formally checked math proofs, and @The_Cyber_News surfaced (56 likes, 3 replies, 2,815 views, 30 bookmarks) a security-agent framework that packages review into the stack.
Reusable skill libraries are emerging as the durable layer above raw trajectories. @DanKornas shared (34 likes, 4 replies, 1,684 views, 44 bookmarks) SkillX’s hierarchical knowledge base approach, and @tom_doerr mapped (55 likes, 4 replies, 3,110 views, 69 bookmarks) a wider harness ecosystem around the same idea.
Open-source agents are finally getting user-facing control surfaces. @hasantoxr argued (31 likes, 6 replies, 1,757 views, 37 bookmarks) that Hermes Desktop matters because setup, memory, tools, schedules, and logs are still too hidden in terminal-first tools.