engram

Q-Learning Router

Adaptive strategy selection through reinforcement learning

Q-Learning Router

Engram uses a hierarchical Q-Learning router for adaptive search parameter selection. The router learns from agent feedback and optimizes strategy for specific usage patterns over time.

4 decision levels

The router makes decisions at four independent levels. Each level has its own Q-table.

1. Search Strategy

Selects the similarity threshold for filtering results.

ActionDescription
high_thresholdOnly highly relevant results
medium_thresholdBalance of precision and recall
low_thresholdMaximum coverage

2. LLM Selection

Selects the LLM model for auxiliary tasks (HyDE, scoring).

ActionDescription
cheapFast and inexpensive model
balancedBalance of speed and quality
expensiveMaximum quality

3. Contextualization

How to prepare results for the agent.

ActionDescription
rawReturn as-is
summarizeSummarize via LLM

4. Proactivity

Level of proactivity during search.

ActionDescription
passiveOnly explicitly requested results
proactiveAdditional warnings and related records

Epsilon-greedy exploration

The router uses an epsilon-greedy strategy:

  • With probability epsilon (default 0.15) — random action selection (exploration)
  • With probability 1 - epsilon — best action from Q-table (exploitation)

This balances using proven strategies with exploring new ones.

Mode detection

The router automatically detects the agent's working mode from query keywords:

ModeKeywordsPriority
debugbug, error, stack, trace, crash, fix, issue, panic1 (highest)
planplan, estimate, risk, assess, schedule, timeline2
architecturedesign, choose, structure, framework, pattern3
reviewreview, refactor, improve, clean, lint, optimize4
codingimplement, code, function, method, class, feature5
routineupdate, version, config, dependency, setup, init6 (lowest)

Each mode has its own defaults for all 4 levels and defines memory type priorities.

Feedback loop

Router learning cycle:

  1. memory_search — router selects strategy based on Q-table
  2. Search results are returned to the agent, the showing fact is recorded in feedback_tracking
  3. memory_judge — agent rates usefulness of found records
  4. Rating is passed to the router as reward
  5. Q-table updates: Q(s,a) += alpha * (reward - Q(s,a))

Scoring guide

ScoreInterpretation
0.8-1.0Directly solved the problem
0.5-0.7Useful context
0.1-0.4Tangentially related
no judgeImplicit low usefulness signal

Q-table

Stored in the SQLite q_table table:

CREATE TABLE q_table (
    router_level TEXT,    -- search, llm, context, proactivity
    state TEXT,           -- mode: debug, architecture, coding...
    action TEXT,          -- level-specific action
    value REAL,           -- Q-value
    update_count INTEGER  -- number of updates
);

The learning rate alpha (default 0.1) determines learning speed. The higher the update_count, the more stable the Q-table value.