Skip to content

Phase 6 — The Learning Loop (Soul)

Status: 🚧 in progress · Hermes background-review and curator parity rows are now tracked explicitly; core automatic promotion/scoring remains planned. 6.K prompt evaluation + optimization rows are validated.

Completion lane: Phase 6 is Lane 6 — Learning Loop. It depends on the Phase 5.F skills substrate and should not begin with live LLM skill extraction. Ship detector, storage, extractor schema, retrieval, feedback, and operator surfaces as separate fixture-backed rows.

The Learning Loop has two layers. Hermes now defines the compatibility floor: after-turn background review forks can update memory/skills, and the curator can maintain agent-created skills over time. Gormes ports those user-visible contracts first, then adds Gormes-native evidence gates for detection, scoring, retrieval, and promotion.

“Agents are not prompts. They are systems. Memory + skills > raw model intelligence.”

SubphaseStatusDeliverable
6.A — Complexity Detector🚧 partialHermes background-review fork lifecycle is row-backed; deterministic local trigger signals remain planned
6.B — Skill Extractor⏳ plannedLLM-assisted pattern distillation from the conversation + tool-call trace, with fake-model fixtures and secret/noise rejection gates
6.C — Skill Storage Format⏳ plannedPortable, human-editable SKILL.md with versioned metadata, provenance, review state, and atomic writes
6.D — Skill Retrieval + Matching⏳ plannedHybrid lexical + Phase 3 semantic lookup for relevant reviewed skills at turn start, plus optional Code Cathedral II-style code-context evidence after the base scorer is stable
6.E — Feedback Loop⏳ plannedHermes curator auxiliary model slot plus curator state transitions and run reports, then skill-use outcomes, explicit operator feedback, and auditable weight adjustments
6.F — Skill Surface (TUI + Telegram)🚧 partialHermes curator CLI surface plus browse, edit, disable, and review skills from the TUI or messaging edge after store/feedback contracts are stable
6.K — Self-Evolution Engine (GEPA)🚧 partialPrompt evaluation harness and iterative prompt mutation/scoring loop are validated; behavioral pattern extraction remains planned
6.L — Composable Skill Execution (Voyager)⏳ plannedSandbox executable skills, dependency resolution, and validation remain future rows

Upstream Hermes at b816fd4e2 makes the learning loop concrete in two places:

  • run_agent.py spawns an after-turn background review fork with active runtime credentials, memory+skills toolsets only, auto-deny approval behavior, parent-session attribution, isolated prompt history, cleanup, and one user-visible Self-improvement review summary.
  • agent/curator.py and hermes_cli/curator.py add autonomous skill maintenance: interval/paused gates, first-run defer, activity-based active/stale/archived transitions, pinned-skill safeguards, dry-run reports, backups, rollback/restore, and hermes curator status/run/control commands.
  • auxiliary.curator is a first-class auxiliary model slot. It participates in Hermes’ auxiliary picker/dashboard allowlists, falls back to the main model when set to auto or partially configured, and preserves legacy curator.auxiliary config with deprecation evidence.
  • skill_manage can mutate supporting files under references/templates/scripts and assets, route patch calls to those files, refuse pinned skills, thread absorbed_into delete intent, and mark only background-review-created skills as agent-created for later curator maintenance.

Gormes already has several prerequisites: base skill_manage create/edit/patch/delete, skills_list, skill_view, validated SKILL.md storage, skill retrieval fixtures, and the memory+skills-only background review toolset policy. The missing rows are the support-file/curator-intent skill_manage surface, background review fork lifecycle, curator state/report engine, and curator CLI.

Phase 5.F (Skills system) was previously scoped as “port the upstream Python skills plumbing”. That’s mechanical. Phase 6 is the algorithm on top — detecting complexity, distilling patterns, scoring feedback. It depends on 5.F (needs the storage format), but it’s not the same work.

Positioning: Hermes-compatible self-improvement, Go-native safety gates. Hermes defines the background-review and curator behavior users can observe. Gormes keeps those semantics while making the scheduler, reports, skill storage, and operator controls testable without Python runtime assumptions.

Skills are code-like runtime assets, not loose notes. The current skill rows show the value of procedural knowledge with resolver checks and conformance tests. Hermes shows the value and risk of large skill surfaces injected into prompts. Gormes should combine the useful parts:

  • active skills require valid metadata, triggers, exclusions, provenance, and review state;
  • disabled or unreviewed skills never enter prompt injection;
  • resolver routes have fixtures for confusing user phrases;
  • skill selection records are tied to turn outcome and operator feedback;
  • generated skill drafts are inactive until reviewed;
  • updates preserve version history and source evidence;
  • secret stripping and one-off task rejection are mandatory gates.

The code-context retrieval rows keep the useful shape: qualified symbols, parent-scope chunks, call-graph edges, and two-pass retrieval. For Gormes this is a retrieval evidence lesson, not a runtime dependency. Phase 6.D now keeps that drift as a small blocked row: define synthetic code-context evidence and fan-out caps that the skill scorer can explain before any tree-sitter, WASM grammar, or repo-wide backfill decision.

The learning loop is allowed to draft and improve skills only after the storage, resolver, review, and feedback records are testable. Otherwise “self-improving” becomes unreviewed prompt mutation.

Do not begin Phase 6 with live LLM extraction. The dependency order is:

  1. 6.F skill_manage support-file and curator intent actions — port Hermes write_file, remove_file, support-file patching, pinned-skill refusal, absorbed_into delete declarations, usage/provenance updates, and optional agent-created guard rollback with temp skill roots.
  2. 6.A background review fork lifecycle — port Hermes runtime inheritance, memory+skills-only toolset restriction, summary attribution, and cleanup with fake review workers.
  3. 6.C storage extension — extend the Phase 2.G store with versioned metadata, provenance, review state, and atomic writes before generated skills can persist.
  4. 6.E curator auxiliary model slot — port Hermes auxiliary.curator default registration, main-model fallback, canonical override precedence, legacy fallback, blank credential stripping, and no-secret-leak evidence.
  5. 6.E curator state/report engine — port Hermes first-run defer, interval/paused gates, activity transitions, dry-run/report behavior, and pinned/manual safeguards before exposing the command.
  6. 6.F curator CLI — make gormes curator available only after it can read real native curator state and reports.
  7. 6.A deterministic detector — prove local trigger signals are explainable and replayable from transcript/tool-call fixtures.
  8. 6.B extractor schema — use fake model outputs to prove accepted/rejected skill drafts, secret stripping, and one-off task rejection.
  9. 6.D retrieval scorer — combine lexical and semantic signals while excluding disabled or unreviewed skills from prompt injection.
  10. 6.E feedback records — persist outcomes before any automatic promotion/demotion or weight change.
  11. 6.F operator surfaces — expose review/edit/disable flows only after the underlying store and feedback records are stable.

Phase 6 should use Goncho as the learning loop’s durable memory substrate, not create another store. The safe contract is four seams:

  1. Recall input — turn-time learning signals read through the existing Phase 3 recall path and <memory-context> fence, not direct table scans.
  2. Honcho-compatible tools — model-visible memory introspection keeps the public honcho_profile, honcho_search, honcho_context, honcho_reasoning, and honcho_conclude names registered by internal/gonchotools.
  3. Outcome writes — skill-use results, curator conclusions, and retained facts become Goncho conclusions or memory-category writes with provenance, review state, and tombstone/rollback evidence rather than unreviewed prompt mutations.
  4. Diagnostics — operator-facing reports should prefer existing Goncho recall traces, queue status, and memory-status surfaces before adding a new learning-loop dashboard.

Hermes defines the compatibility floor here: background review is restricted to memory and skills tools, and hermes curator owns user-visible state/report semantics. OpenClaw memory behavior is donor evidence only: bounded hidden recall, graceful no-plugin degradation, memory_search/memory_get QA, lazy QMD startup, and plugin diagnostics can harden Gormes reports, but they do not replace the Hermes/Honcho contract.

The GEPA lane is now test-backed but remains offline and deterministic:

  • Prompt evaluation harness is complete. internal/llm/prompt_evaluator.go evaluates prompt variants against injected scenario runners, records task_success, tool_accuracy, response_quality on a 1-5 scale, and aggregates variant scores. internal/llm/eval_scenarios.go provides a 10-scenario local corpus.
  • Iterative prompt mutation and scoring loop is complete. internal/llm/prompt_optimizer.go generates bounded tool-selection, response-quality, task-decomposition, and command-safety mutations, scores them through the harness, and stops on convergence, perfect score, or budget.
  • Behavioral pattern extraction from session logs is still planned. Do not promote prompt mutations from live logs until the extractor row has fixture-backed success/anti-pattern evidence and operator review rules.

Hermes owns the background-review and curator contracts; Gormes owns the Go-native implementation and safety evidence. Automatic scoring/promotion rows remain Gormes-native unless a later Hermes source introduces a stricter contract. Surrounding plumbing has donors:

Phase 6 problemDonor fileNotes
6.A complexity detector — bounded transcript-size budgetaxe/internal/budget/budget.goPer-turn counter + overflow signal
6.A complexity detector — append-only signal logengram/internal/mcp/activity.goAudit shape, redaction
6.B extractor schema — secret stripping at ingest boundarynanobot/pkg/agents/truncate.goSanitize/truncate before persistence
6.C skill storage — versioned metadata + atomic writesengram/internal/persistence/store/store.goDDL + migration helpers
6.C skill storage — sanitized artifact paths for stored evidenceaxe/internal/artifact/tracker.goPath-traversal guard
6.D retrieval scorer — bounded fan-out cap for code-context evidenceaxe/internal/budget/budget.goReset + overflow signal
6.D retrieval scorer — provenance-aware ranking signalsengram/internal/persistence/store/relations.goProvenance edges (scoped, supersedes)
6.E feedback records — outcome ledger before promotion/demotionengram/internal/mcp/activity.goAppend-only outcome log
6.F operator review surfaces — workflow agent patternadk-go/agent/workflowagents/...Loop / sequential / parallel primitives

Route through the gormes-references skill (development-skills/gormes-references/SKILL.md) before re-deriving any of these shapes.