Gradient-Free Retrieval Weight Learning via Thompson Sampling with LLM Self-Assessment

Abstract

We propose using Thompson Sampling over discrete weight presets to optimize retrieval scoring in agent memory systems, with reward signals from LLM self-assessment of input quality rather than output quality. Across 2,200 episodes (1,000 + 1,200), we identify two orthogonal mechanisms: output-driven efficiency (up to 19.7% token reduction) and adaptive retrieval (26.3% NDCG@5 improvement). The input-assessment design shows zero rating drift, avoiding reward hacking.

Key Results

26.3% NDCG@5

Retrieval quality improvement over fixed-weight baseline

35.9% MRR

Mean reciprocal rank improvement

19.7% Token

Output token reduction from structured feedback

How Weights Evolve

Watch as Thompson Sampling learns which retrieval weight preset works best. Each line tracks one of 12 weight configurations as the system processes 300 tasks. Click a preset in the legend to toggle it.

Citation

@article{dirocco2026gradient,
  title={Gradient-Free Retrieval Weight Learning via Thompson Sampling with LLM Self-Assessment},
  author={DiRocco, Alfonso},
  year={2026},
  note={arXiv preprint}
}