Reinforcement Learning from Compiler
and Language Server Feedback

A CLI‑first, agent‑native orchestrator that turns compiler and language‑server feedback into deterministic analysis bundles, guarded edit previews, and RLCSF process rewards.

Yifan Zhang

Princeton University  •  October 24, 2025  •  Revised May 1, 2026

LSP (Pyright) Deterministic Bundles Selector DSL RLCSF Reward Replay & CI Safe Mutations

Abstract

Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers expose the missing supervision signal through diagnostics, symbol resolution, type information, references, and refactoring preconditions.

We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) and Lanser‑CLI, a CLI-first orchestration layer that makes this signal usable for agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety.

Lanser‑CLI turns ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its main mechanisms are robust selectors beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayable under frozen snapshots.

Project Repository Read the Paper

Motivation: Bridging the Agent–Feedback Gap

Language servers were designed for interactive IDEs, not autonomous optimization loops. At agent scale, four requirements become first-class:

  1. Determinism & Replay: equivalent requests should produce byte-stable artifacts after response normalization, version pinning, and content hashing.
  2. Robust Addressing: agents need selectors that survive edits, expose ambiguity, and make encoding conventions explicit.
  3. Safety for Mutations: refactors must be previewed, confined to the workspace, checked for conflicts, and recoverable when application fails.
  4. Process Supervision: intermediate tool feedback should become a dense, verifiable signal correlated with successful repair and refactoring.

Lanser‑CLI closes these gaps by transforming interactive compiler and language-server sessions into verifiable artifacts. The resulting interface gives LLM agents protocol grounding: model speculation is replaced by machine-checked facts, and each adjacent pair of bundles can be scored by a deterministic reward functional.

Architecture Overview

Lanser‑CLI orchestrator mediates agent ↔ LSP server ↔ workspace; emits deterministic bundles. Figure: The orchestrator mediates JSON‑RPC with a pinned language server or compiler-backed analyzer and emits deterministic bundles.

Environment Capture

Each bundle records: {toolVersion, serverVersion, positionEncoding, pythonExe, pythonVersion, venvPath, configDigest, platform}.

Contracts & Invariants

  • Deterministic list order: (uri, sL, sC, eL, eC) with stable tie‑breakers.
  • bundleId is a SHA‑256 over a JCS‑canonicalized subset of fields that excludes volatile timestamps and run-local trace data.
  • Replay under a frozen snapshot, pinned tool version, fixed configuration, and deterministic analyzer semantics yields byte-stable hash-domain contents.
{
  "version": "1.2",
  "bundleId": "sha256:…",
  "status": "ok",
  "request": {"cmd": "definition", "selector": {...}},
  "resolution": {"original": "...", "resolved": {...}, "disambiguation": [...]},
  "facts": {"definitions": [...], "hover": {...}, "provenance": "lsp"},
  "environment": {"tool":{"name":"pyright","version":"1.1.407"},
    "positionEncoding":"utf-16","python":{"version":"3.12.0"}},
  "meta": {"sorting_keys": ["uri","range[0]","range[1]","range[2]","range[3]"]}
}

Selectors & Repositioning

Agents address intent rather than bytes. The PositionSpec union supports cursor, range, symbolic, AST‑path, and content‑anchor forms with an optional docVersion.

// Structured PositionSpec subset
{ kind:"cursor", uri, line, col, indexing:"utf-16|utf-8|codepoint" }
{ kind:"range", uri, start:[l,c], end:[l,c] }
{ kind:"symbol", qualname:"pkg.mod:Class.method", role:"def|sig|body|doc", overload:0 }
{ kind:"ast", path:[["module","pkg.mod"],["class","C"],["def","m"]] }
{ kind:"anchor", uri, snippet:"def load_data(", ctx:24, hash:"sha1:…" }

Canonical strings for the CLI:

# Cursor / range
src/app.py@L42:C7
src/app.py@R(42,7->44,1)

# Symbolic
py://pkg.mod#Class.method:body
py://pkg.mod#function_name:sig

# AST path subset
ast://[module=pkg.mod]/[class=Class]/[def=method]/name[1]

# Content anchor: snippet + context N chars
anchor://src/app.py#"def load_data("?ctx=24

Indexing Semantics

Lanser‑CLI negotiates positionEncoding with the server, preferring LSP utf‑16, while also supporting utf‑8 and codepoint CLI I/O. When server and CLI encodings differ, both coordinate systems can be surfaced in verbose traces; bundles retain server coordinates.

Relocate Algorithm (deterministic)

Algorithm: Relocate(selector s, workspace W, optional snapshot v)
1. If v present and W has exact map(s, v): return [(map(s, v), 1.0, certified)].
2. If s.kind ∈ {symbol, ast}: resolve via module graph + parser → candidates A.
3. If s.kind = anchor: fuzzy match k-grams within context → candidates H.
4. Score non-certified candidates by:
   0.5·ast + 0.2·module + 0.2·tokenJaccard + 0.1·proximity.
5. Sort descending by score, then ascending by (uri, range). If empty → E/NOT_FOUND.
6. If top score is low or margin is small: attach disambiguation evidence.
7. Return top-k candidates.

RLCSF Rewards from Compiler and Language Server Feedback

RLCSF exposes a shaped reward that is computable online and replayable from deterministic bundles. Let $D_t$ be the diagnostic count over the recorded scope, $S_t\in[0,1]$ be safety readiness, $A_t\in[0,1]$ be top selector-resolution confidence, and $E_t\in\{0,1\}$ indicate a structured tool error.

$$ r^{\mathrm{csf}}_t = w_D(D_{t-1}-D_t) + w_S(S_t-S_{t-1}) + w_A(A_t-A_{t-1}) - w_EE_t. $$

This is the common undiscounted form of the potential-based RLCSF reward. It credits diagnostic reduction, safety improvement, and selector-confidence improvement, while penalizing structured tool failures.

// Example 1: diagnostic reduction, safe apply, confident resolution
weights: (wD,wS,wA,wE)=(0.5,0.4,0.1,0.5)
D_{t-1}=5, D_t=2
S_{t-1}=0, S_t=1
A_{t-1}=0.70, A_t=0.94
E_t=0

r_t = 0.5*(5-2) + 0.4*(1-0) + 0.1*(0.94-0.70) - 0.5*0
    = 1.924
// Example 2: ambiguous selector and apply conflict
weights: (wD,wS,wA,wE)=(0.5,0.4,0.1,0.5)
D_{t-1}=7, D_t=7
S_{t-1}=0, S_t=0
A_{t-1}=0.62, A_t=0.62
E_t=1

r_t = 0.5*0 + 0.4*0 + 0.1*0 - 0.5*1
    = -0.5

The reward is shaping signal, not a replacement for terminal task success. Because it is computed from adjacent deterministic bundle contents and fixed weights, lanser trace replay can recover the same reward offline for evaluation and counterfactual policy analysis.

Editing & Safety Envelope

  • Preview‑by‑default: dry‑run produces diffs; apply requires --apply.
  • Workspace jail: edits are confined to the project root with allow/deny path filters.
  • Git guardrails: a clean worktree is required unless explicitly overridden with --allow-dirty.
  • Staged apply: preflight validation, conflict detection, temporary files, fsync, and per-file rename(2) replacement.
  • Conflict recovery: optional git apply --3way and structured E/APPLY_CONFLICT reports.
  • Indexing & encoding checks: dual coordinates can be surfaced in verbose mode.
Algorithm: PreviewThenApply (Guarded Rename)
1. Assert clean git or --allow-dirty.
2. prepareRename(s) → must pass; else error.
3. rename(s, newName) → WorkspaceEdit preview; emit unified diff.
4. If --apply: apply with workspace jail + allow/deny filters.
5. On conflict → E/APPLY_CONFLICT.
6. Notify server via didChange; emit success bundle.

Interface: CLI, Tracing, and Bundles

Navigation (read‑only)

# Definition / References / Hover / Symbols / Diagnostics
lanser def py://pkg.mod#Class.method:sig --json
lanser references anchor://src/app.py#"def load_data("?ctx=24 --json
lanser hover src/app.py@L42:C7 --json
lanser symbols --json
lanser diagnostics --json

Safe Mutations

lanser prepare-rename py://pkg.mod#load_data:def --json
lanser rename py://pkg.mod#load_data:def read_data --dry-run
lanser rename py://pkg.mod#load_data:def read_data --apply

Batch & Replay

# JSONL batch execution
lanser batch --in requests.jsonl --out bundles.jsonl --trace-file trace.jsonl

# Deterministic replay of a recorded trace
lanser trace replay --trace-file trace.jsonl --verify

Schema Contracts

# Export and validate JSON contracts for selectors, bundles, and fixtures
lanser schema export --out schemas/
lanser schema validate bundle.json
lanser schema validate-batch fixtures/

Bundle Envelope (with RLCSF process reward)

{
  "version": "1.2",
  "bundleId": "sha256:…",
  "status": "ok",
  "request": {"cmd":"definition","selector": {...}},
  "resolution": {"original":"py://pkg.mod#load_data:def",
                 "resolved": {...}, "disambiguation":[...]},
  "facts": {"definitions":[...], "hover": {...}, "provenance":"lsp"},
  "edits": {"workspaceEdit": null, "diff": null},
  "processReward": {
    "version":"rl-csf-v1",
    "previousBundleId":"sha256:…",
    "r":1.924,
    "components":{"diag_delta":3,"safety_delta":1,
                  "confidence_delta":0.24,"tool_error":0},
    "weights":{"wD":0.5,"wS":0.4,"wA":0.1,"wE":0.5,"gamma":1.0},
    "source":"compiler+lsp",
    "explanation":"Reward computed over adjacent frozen bundles."
  },
  "environment": {"tool":{"name":"pyright","version":"1.1.407"},
                  "positionEncoding":"utf-16","python":{"version":"3.12.0"}},
  "capabilities": {"partialResult": false, "cancellable": true},
  "meta": {"exit_code":0,
           "sorting_keys":["uri","range[0]","range[1]","range[2]","range[3]"]}
}

Citation

If you find this work useful, please cite:

@article{zhang2025rlcsf,
  title   = {Reinforcement Learning from Compiler and Language Server Feedback},
  author  = {Zhang, Yifan},
  journal = {arXiv preprint arXiv:2510.22907},
  year    = {2025},
}