hyle: Salience-Aware Context Management for Autonomous Code Assistants
A Rust-native implementation with multi-tier compression and cognitive architecture
Introduction
Large language models excel at code generation but struggle with extended development sessions. Context windows, while expanding (from 4K to 200K+ tokens), remain finite. Previous approaches either truncate history aggressively—losing critical context about design decisions—or maintain full context at prohibitive cost.
hyle addresses this through salience-aware context management: a four-tier system that prioritizes recent, error-containing, and task-relevant content while compressing or discarding peripheral information. The key insight is that not all context is equally valuable: a compilation error from 30 seconds ago is more salient than a successful file read from 10 minutes ago.
This paper makes the following contributions:
- A formal model for salience scoring in code assistant contexts
- A multi-tier compression scheme with configurable budget allocation
- A cognitive architecture separating execution, summarization, and validation
- An open-source implementation with comprehensive safety mechanisms
System Architecture
The system is organized into six primary modules. Click on any component in the diagram below to see its detailed responsibilities.
main.rs - Entry Point
Parses CLI arguments using clap, initializes logging, and dispatches to either TUI mode or HTTP server mode. Handles configuration loading from ~/.config/hyle/config.json and environment variable overrides.
pub fn main() -> Result<()> {
let cli = Cli::parse();
init_logging(cli.verbose)?;
match cli.mode {
Mode::Interactive => ui::run(cli.into())?,
Mode::Server { port } => server::run(port)?,
Mode::Once { prompt } => agent::run_once(&prompt)?,
}
Ok(())
}
ui.rs - Terminal User Interface
Implements a 20Hz event loop using ratatui. Non-blocking design polls keyboard input, background task completion, and API streaming simultaneously. Maintains render state separately from application state for flicker-free updates.
loop {
// Poll at 50ms intervals (~20Hz)
if event::poll(Duration::from_millis(50))? {
handle_input(event::read()?)?;
}
// Check background tasks
while let Ok(msg) = bg_rx.try_recv() {
process_background_result(msg)?;
}
// Render current state
terminal.draw(|f| render_ui(f, &state))?;
}
agent.rs - Execution Loop
Core agentic loop that parses tool calls from model output, executes them, and feeds results back. Implements the "think-act-observe" cycle with configurable iteration limits and stuck detection.
while iterations < MAX_ITERATIONS {
let response = client.complete(&messages).await?;
if let Some(tools) = parse_tool_calls(&response) {
let results = execute_tools(tools).await?;
messages.push(tool_results_message(results));
} else {
// No tools = task complete
break;
}
iterations += 1;
}
tools.rs - Tool Implementations
Five core tools: read, write, patch, bash, and search. All file operations use atomic semantics. Bash commands are checked against a blocklist before execution.
pub enum Tool {
Read { path: PathBuf },
Write { path: PathBuf, content: String },
Patch { path: PathBuf, search: String, replace: String },
Bash { command: String, timeout: Option },
Search { pattern: String, path: Option },
}
client.rs - API Client
Server-Sent Events (SSE) streaming client for OpenRouter API. Handles rate limiting with exponential backoff, automatic model fallback, and token counting for budget management.
pub async fn stream_completion(
&self,
messages: &[Message],
) -> impl Stream- > {
let req = self.build_request(messages)?;
reqwest::Client::new()
.post(&self.endpoint)
.headers(self.headers())
.json(&req)
.send()
.await?
.bytes_stream()
.map(parse_sse_event)
}
server.rs - HTTP API
Optional REST API for IDE integrations. Exposes endpoints for session management, prompt submission, and status queries. Uses axum with tower middleware for request logging.
let app = Router::new()
.route("/v1/chat", post(handle_chat))
.route("/v1/sessions", get(list_sessions))
.route("/v1/sessions/:id", get(get_session))
.route("/health", get(|| async { "ok" }))
.layer(TraceLayer::new_for_http());
2.1 Threading Model
The system uses a hybrid async/sync architecture. The main event loop is synchronous for predictable TUI timing, while API calls and file I/O use Tokio's async runtime via spawn_blocking.
Context Management
Context is allocated across four tiers based on computed salience scores. The allocation adapts dynamically based on task phase and error state.
| Tier | Budget | Content | Compression |
|---|---|---|---|
| Focus | 40% | Current task, last tool results, errors | None |
| Recent | 30% | Last 2-3 exchanges, active decisions | Light |
| Summary | 20% | Older exchanges, key facts | Heavy |
| Background | 10% | Project structure, conventions | Minimal |
Definition 1: Salience Score
For a message m with age t (seconds since creation), the salience score S(m) is computed as:
S(m) = w_r · R(t) + w_e · E(m) + w_k · K(m) + w_f · F(m) where: R(t) = exp(-t / τ) // Recency: exponential decay, τ = 300s E(m) = 1 if contains_error(m) // Error boost: errors are highly salient K(m) = |keywords(m) ∩ task| // Keyword overlap with current task F(m) = 1 if references_focus() // References currently focused files Default weights: w_r = 0.4, w_e = 0.3, w_k = 0.2, w_f = 0.1
Messages are sorted by salience score and allocated to tiers in descending order until each tier's token budget is exhausted.
Algorithm 2: Tiered Compression
fn compress_context(messages: Vec<Message>, budget: usize) -> Vec<Message> {
let mut output = Vec::new();
let mut remaining = budget;
// Sort by salience
let sorted = messages.sorted_by(|a, b| salience(b).cmp(&salience(a)));
for msg in sorted {
let tier = assign_tier(&msg, &output);
let compressed = match tier {
Tier::Focus => msg.clone(), // No compression
Tier::Recent => light_compress(&msg),
Tier::Summary => summarize(&msg),
Tier::Background => extract_facts(&msg),
};
let tokens = count_tokens(&compressed);
if tokens <= remaining {
output.push(compressed);
remaining -= tokens;
}
}
output
}
3.1 Compression Strategies
Each tier uses a different compression strategy optimized for its purpose:
- Light compression: Remove redundant whitespace, truncate large code blocks to first/last 10 lines with ellipsis
- Heavy compression: Use a free summarization model to extract key decisions and outcomes
- Fact extraction: Pattern-match for file paths, function names, and configuration values
Safety Mechanisms
Given the autonomous nature of the system, safety is paramount. We implement defense in depth with multiple layers.
4.1 Command Blocklist
const BLOCKED_PATTERNS: &[&str] = &[
"rm -rf /", "rm -r /", "rm --recursive /",
":(){ :|:& };:", // fork bomb
"dd if=/dev/zero", // disk overwrite
"dd if=/dev/random",
"mkfs.", // filesystem format
"chmod -R 777 /", // permission disasters
"> /dev/sda", // direct disk write
"curl | sh", "wget | sh", // remote code execution
"curl | bash", "wget | bash",
];
4.2 Atomic File Operations
All file writes follow an atomic protocol to prevent partial writes and enable recovery:
4.3 Loop Detection
The validator model monitors for stuck states by comparing recent tool calls. If the same operation is attempted 3+ times without progress, the system surfaces a clarifying question.
Evaluation
| Metric | Value | Notes |
|---|---|---|
| Test coverage | 364 tests | Unit, integration, and property tests |
| Binary size | ~10MB | Release build, stripped |
| Startup time | <250ms | Cold start to interactive prompt |
| TUI refresh rate | 20Hz | 50ms polling interval |
| Memory usage | ~30MB | Idle, single session |
| Supported models | 35+ | Via OpenRouter |
5.1 Context Efficiency
In a 2-hour refactoring session involving 47 files, the salience-aware compression maintained task coherence while using only 38% of the naive full-context approach's token budget.