hyle: Salience-Aware Context Management for Autonomous Code Assistants

A Rust-native implementation with multi-tier compression and cognitive architecture

Technical Report · v0.3.3 · 2024

Abstract. We present hyle, an autonomous code assistant implemented in Rust that addresses context window limitations through salience-aware tiered compression. The system employs a multi-model cognitive architecture where specialized models handle distinct phases of the execution loop. We describe the agentic execution model, safety mechanisms, and session persistence layer. Empirical evaluation shows the system maintains coherent multi-file refactoring sessions while staying within token budgets. The implementation passes 364 tests covering safety invariants, tool execution, and context management.

Introduction

Large language models excel at code generation but struggle with extended development sessions. Context windows, while expanding (from 4K to 200K+ tokens), remain finite. Previous approaches either truncate history aggressively—losing critical context about design decisions—or maintain full context at prohibitive cost.

hyle addresses this through salience-aware context management: a four-tier system that prioritizes recent, error-containing, and task-relevant content while compressing or discarding peripheral information. The key insight is that not all context is equally valuable: a compilation error from 30 seconds ago is more salient than a successful file read from 10 minutes ago.

This paper makes the following contributions:

A formal model for salience scoring in code assistant contexts
A multi-tier compression scheme with configurable budget allocation
A cognitive architecture separating execution, summarization, and validation
An open-source implementation with comprehensive safety mechanisms

System Architecture

The system is organized into six primary modules. Click on any component in the diagram below to see its detailed responsibilities.

┌─────────────────────────────────────────────────────────────┐ │ main.rs │ │ CLI parsing, dispatch │ └───────────────┬─────────────────────────────┬───────────────┘ │ │ ┌───────────▼───────────┐ ┌──────────▼──────────┐ │ ui.rs │ │ server.rs │ │ TUI event loop │ │ HTTP API │ └───────────┬───────────┘ └──────────┬──────────┘ │ │ ┌───────────▼──────────────────────────────▼──────────┐ │ agent.rs │ │ Tool parsing, execution loop │ └───────────┬─────────────────────────────┬───────────┘ │ │ ┌───────────▼───────────┐ ┌──────────▼──────────┐ │ tools.rs │ │ client.rs │ │ File ops, bash │ │ OpenRouter SSE │ └───────────────────────┘ └─────────────────────┘

Figure 1: Module dependency graph. Arrows indicate function calls.

main.rs - Entry Point

Parses CLI arguments using clap, initializes logging, and dispatches to either TUI mode or HTTP server mode. Handles configuration loading from ~/.config/hyle/config.json and environment variable overrides.

pub fn main() -> Result<()> {
    let cli = Cli::parse();
    init_logging(cli.verbose)?;

    match cli.mode {
        Mode::Interactive => ui::run(cli.into())?,
        Mode::Server { port } => server::run(port)?,
        Mode::Once { prompt } => agent::run_once(&prompt)?,
    }
    Ok(())
}

ui.rs - Terminal User Interface

Implements a 20Hz event loop using ratatui. Non-blocking design polls keyboard input, background task completion, and API streaming simultaneously. Maintains render state separately from application state for flicker-free updates.

loop {
    // Poll at 50ms intervals (~20Hz)
    if event::poll(Duration::from_millis(50))? {
        handle_input(event::read()?)?;
    }
    // Check background tasks
    while let Ok(msg) = bg_rx.try_recv() {
        process_background_result(msg)?;
    }
    // Render current state
    terminal.draw(|f| render_ui(f, &state))?;
}

agent.rs - Execution Loop

Core agentic loop that parses tool calls from model output, executes them, and feeds results back. Implements the "think-act-observe" cycle with configurable iteration limits and stuck detection.

while iterations < MAX_ITERATIONS {
    let response = client.complete(&messages).await?;

    if let Some(tools) = parse_tool_calls(&response) {
        let results = execute_tools(tools).await?;
        messages.push(tool_results_message(results));
    } else {
        // No tools = task complete
        break;
    }
    iterations += 1;
}

tools.rs - Tool Implementations

Five core tools: read, write, patch, bash, and search. All file operations use atomic semantics. Bash commands are checked against a blocklist before execution.

pub enum Tool {
    Read { path: PathBuf },
    Write { path: PathBuf, content: String },
    Patch { path: PathBuf, search: String, replace: String },
    Bash { command: String, timeout: Option },
    Search { pattern: String, path: Option },
}

client.rs - API Client

Server-Sent Events (SSE) streaming client for OpenRouter API. Handles rate limiting with exponential backoff, automatic model fallback, and token counting for budget management.

pub async fn stream_completion(
    &self,
    messages: &[Message],
) -> impl Stream> {
    let req = self.build_request(messages)?;

    reqwest::Client::new()
        .post(&self.endpoint)
        .headers(self.headers())
        .json(&req)
        .send()
        .await?
        .bytes_stream()
        .map(parse_sse_event)
}

server.rs - HTTP API

Optional REST API for IDE integrations. Exposes endpoints for session management, prompt submission, and status queries. Uses axum with tower middleware for request logging.

let app = Router::new()
    .route("/v1/chat", post(handle_chat))
    .route("/v1/sessions", get(list_sessions))
    .route("/v1/sessions/:id", get(get_session))
    .route("/health", get(|| async { "ok" }))
    .layer(TraceLayer::new_for_http());

2.1 Threading Model

The system uses a hybrid async/sync architecture. The main event loop is synchronous for predictable TUI timing, while API calls and file I/O use Tokio's async runtime via spawn_blocking.

Algorithm 1: Main Event Loop

loop {

// Phase 1: Collect events (non-blocking)

events ← poll_all_sources(timeout: 50ms)

// Phase 2: Update state

for event in events:

state ← reduce(state, event)

// Phase 3: Render (pure function of state)

frame ← render(state)

terminal.draw(frame)

}

Context Management

Context is allocated across four tiers based on computed salience scores. The allocation adapts dynamically based on task phase and error state.

Tier	Budget	Content	Compression
Focus	40%	Current task, last tool results, errors	None
Recent	30%	Last 2-3 exchanges, active decisions	Light
Summary	20%	Older exchanges, key facts	Heavy
Background	10%	Project structure, conventions	Minimal

Definition 1: Salience Score

For a message m with age t (seconds since creation), the salience score S(m) is computed as:

S(m) = w_r · R(t) + w_e · E(m) + w_k · K(m) + w_f · F(m)

where:
  R(t) = exp(-t / τ)           // Recency: exponential decay, τ = 300s
  E(m) = 1 if contains_error(m)  // Error boost: errors are highly salient
  K(m) = |keywords(m) ∩ task|   // Keyword overlap with current task
  F(m) = 1 if references_focus() // References currently focused files

Default weights: w_r = 0.4, w_e = 0.3, w_k = 0.2, w_f = 0.1

Messages are sorted by salience score and allocated to tiers in descending order until each tier's token budget is exhausted.

Algorithm 2: Tiered Compression

fn compress_context(messages: Vec<Message>, budget: usize) -> Vec<Message> {
    let mut output = Vec::new();
    let mut remaining = budget;

    // Sort by salience
    let sorted = messages.sorted_by(|a, b| salience(b).cmp(&salience(a)));

    for msg in sorted {
        let tier = assign_tier(&msg, &output);
        let compressed = match tier {
            Tier::Focus => msg.clone(),  // No compression
            Tier::Recent => light_compress(&msg),
            Tier::Summary => summarize(&msg),
            Tier::Background => extract_facts(&msg),
        };

        let tokens = count_tokens(&compressed);
        if tokens <= remaining {
            output.push(compressed);
            remaining -= tokens;
        }
    }
    output
}

3.1 Compression Strategies

Each tier uses a different compression strategy optimized for its purpose:

Light compression: Remove redundant whitespace, truncate large code blocks to first/last 10 lines with ellipsis
Heavy compression: Use a free summarization model to extract key decisions and outcomes
Fact extraction: Pattern-match for file paths, function names, and configuration values

Safety Mechanisms

Given the autonomous nature of the system, safety is paramount. We implement defense in depth with multiple layers.

4.1 Command Blocklist

const BLOCKED_PATTERNS: &[&str] = &[
    "rm -rf /", "rm -r /", "rm --recursive /",
    ":(){ :|:& };:",           // fork bomb
    "dd if=/dev/zero",         // disk overwrite
    "dd if=/dev/random",
    "mkfs.",                   // filesystem format
    "chmod -R 777 /",          // permission disasters
    "> /dev/sda",              // direct disk write
    "curl | sh", "wget | sh",  // remote code execution
    "curl | bash", "wget | bash",
];

4.2 Atomic File Operations

All file writes follow an atomic protocol to prevent partial writes and enable recovery:

Algorithm 3: Atomic Write Protocol

function atomic_write(path, content):

// 1. Write to temporary file

temp ← path + ".tmp." + random_hex(8)

write_file(temp, content)

// 2. Sync to disk

fsync(temp)

// 3. Create timestamped backup

if exists(path):

backup ← path + "." + timestamp() + ".bak"

rename(path, backup)

// 4. Atomic rename

rename(temp, path)

// 5. Verify write

verify ← read_file(path)

assert verify == content

4.3 Loop Detection

The validator model monitors for stuck states by comparing recent tool calls. If the same operation is attempted 3+ times without progress, the system surfaces a clarifying question.

Evaluation

Metric	Value	Notes
Test coverage	364 tests	Unit, integration, and property tests
Binary size	~10MB	Release build, stripped
Startup time	<250ms	Cold start to interactive prompt
TUI refresh rate	20Hz	50ms polling interval
Memory usage	~30MB	Idle, single session
Supported models	35+	Via OpenRouter

5.1 Context Efficiency

In a 2-hour refactoring session involving 47 files, the salience-aware compression maintained task coherence while using only 38% of the naive full-context approach's token budget.

References

McIlroy, M.D. (1978). Unix Time-Sharing System: Foreword. Bell System Technical Journal.

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.

OpenRouter API Documentation. openrouter.ai/docs

hyle source repository. github.com/uprootiny/hyle

Suckless software philosophy. suckless.org/philosophy