We propose using Thompson Sampling over discrete weight presets to optimize retrieval scoring in agent memory systems, with reward signals from LLM self-assessment of input quality rather than output quality. Across 2,200 episodes (1,000 + 1,200), we identify two orthogonal mechanisms: output-driven efficiency (up to 19.7% token reduction) and adaptive retrieval (26.3% NDCG@5 improvement). The input-assessment design shows zero rating drift, avoiding reward hacking.
Retrieval quality improvement over fixed-weight baseline
Mean reciprocal rank improvement
Output token reduction from structured feedback
Watch as Thompson Sampling learns which retrieval weight preset works best. Each line tracks one of 12 weight configurations as the system processes 300 tasks. Click a preset in the legend to toggle it.
@article{dirocco2026gradient,
title={Gradient-Free Retrieval Weight Learning via Thompson Sampling with LLM Self-Assessment},
author={DiRocco, Alfonso},
year={2026},
note={arXiv preprint}
}