Skip to content

Observability

Lithos v0.1.8 introduced comprehensive observability: structured JSON logging, OpenTelemetry tracing, Prometheus metrics, and read audit logging. v0.2.1 extends this with full LCMA pipeline metrics coverage and the --telemetry-console developer shortcut.


Structured Logging

All log output is structured JSON by default. Set the log level via the LITHOS_LOG_LEVEL environment variable:

LITHOS_LOG_LEVEL=info   # default
LITHOS_LOG_LEVEL=debug  # includes link resolution + slug computation traces
LITHOS_LOG_LEVEL=warn

Example log entry:

{
  "timestamp": "2026-04-11T06:00:00.123Z",
  "level": "INFO",
  "logger": "lithos.knowledge",
  "message": "document written",
  "document_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
  "agent": "research-agent",
  "trace_id": "1234abcd",
  "span_id": "5678ef01"
}

trace_id and span_id are populated automatically when OTEL tracing is enabled (see below), enabling correlation between traces and log lines in tools like Grafana Loki.

In Docker, redirect logs to your preferred log aggregator:

# docker-compose.override.yml
services:
  lithos:
    logging:
      driver: json-file
      options:
        max-size: "50m"
        max-file: "5"
    environment:
      LITHOS_LOG_LEVEL: info

OpenTelemetry Tracing

Lithos uses a push-only OTEL model: all telemetry (traces, metrics, logs) is exported via OTLP to an external collector. There is no /metrics scrape endpoint — metrics flow through the OTLP pipeline to your collector (e.g. OpenTelemetry Collector → Prometheus remote-write).

To enable export to an OTEL collector:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
OTEL_SERVICE_NAME=lithos

Per-signal overrides are supported:

OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://tempo:4318
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://otel-collector:4318

Or in Docker:

# docker-compose.override.yml
services:
  lithos:
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318
      OTEL_SERVICE_NAME: lithos
      LITHOS_LOG_LEVEL: info

Spans are created for: - Each MCP tool call (e.g., lithos_write, lithos_search, lithos_retrieve) - LCMA pipeline stages (scouts, rerank, consolidation) - Embedding operations - Index writes and queries - Graph traversal

Developer shortcut: --telemetry-console

For local debugging without a collector, use the --telemetry-console flag to stream spans and metrics to stdout:

lithos serve --transport sse --port 8765 --telemetry-console

This enables in-process OTEL console exporters — no OTEL_EXPORTER_OTLP_ENDPOINT required.

Grafana stack

If you run Grafana + Tempo + Loki + Prometheus, Lithos integrates with the full stack via the OTLP push path: - Traces → Tempo via OTLP - Logs → Loki (structured JSON, with trace correlation) - Metrics → OpenTelemetry Collector → Prometheus remote-write (or direct OTLP receiver)


Metrics

Lithos emits Prometheus-compatible metrics via OTLP push — metrics are exported to your OTEL collector, which forwards them to Prometheus (or another metrics backend). There is no direct /metrics scrape endpoint.

Available Metrics

Metric Type Description
lithos_knowledge_write_duration_ms Histogram Write operation latency (ms)
lithos_documents_total Gauge Total documents in knowledge base
lithos_chunks_total Gauge Total embedding chunks in ChromaDB
lithos_agents_total Gauge Total registered agents
lithos_open_tasks_total Gauge Current open coordination tasks
lithos_startup_duration_seconds Histogram Server startup duration
lithos_file_watcher_events_total Counter File system events processed
lithos_event_bus_subscriber_drops_total Counter Event bus messages dropped (slow subscriber)
lithos_event_bus_buffer_utilisation Gauge Event bus buffer fill fraction (0–1)
lithos_tool_calls_total Counter MCP tool calls (labelled by tool name)
lithos_tool_errors_total Counter MCP tool errors (labelled by tool name)
lithos_sse_active_clients Gauge Currently connected SSE clients
lithos_sse_events_delivered_total Counter Total SSE events delivered
lithos_lcma_retrieve_duration_ms Histogram LCMA lithos_retrieve end-to-end latency
lithos_lcma_scout_hits_total Counter LCMA scout results (labelled by scout type)
lithos_lcma_rerank_duration_ms Histogram LCMA rerank stage latency

Collector → Prometheus setup

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Grafana Dashboard

Example PromQL queries once metrics are flowing via OTLP:

# Write latency P99
histogram_quantile(0.99, sum(rate(lithos_knowledge_write_duration_ms_bucket[5m])) by (le))

# Tool call rate
sum(rate(lithos_tool_calls_total[1m])) by (tool)

# Error rate per tool
sum(rate(lithos_tool_errors_total[1m])) by (tool)

# SSE clients
lithos_sse_active_clients

# Event bus health (drops indicate slow subscribers)
rate(lithos_event_bus_subscriber_drops_total[5m])

# LCMA retrieve latency P95
histogram_quantile(0.95, sum(rate(lithos_lcma_retrieve_duration_ms_bucket[5m])) by (le))

Read Audit Log

Every lithos_read call is appended to an audit log at <data_dir>/.lithos/read_audit.jsonl.

Example entry:

{"timestamp": "2026-04-11T06:00:00Z", "document_id": "f47ac10b-...", "agent": "research-agent", "path": "python-asyncio-gather-patterns.md"}

The audit log is append-only and never rotated by Lithos itself. Use standard log rotation (logrotate, cron) if you need size management.

To tail the audit log in real time:

tail -f /path/to/data/.lithos/read_audit.jsonl | jq .

lithos_stats — Health Indicators

lithos_stats now returns a health block alongside statistics (added v0.1.8):

{
  "documents": 142,
  "chunks": 1893,
  "agents": 5,
  "active_tasks": 3,
  "open_claims": 2,
  "tags": 48,
  "duplicate_urls": 0,
  "health": {
    "overall": "pass",
    "index": {"status": "pass", "detail": "Tantivy index OK"},
    "embedding": {"status": "pass", "detail": "ChromaDB OK, model loaded"},
    "coordination": {"status": "pass", "detail": "3 active tasks, 2 open claims"}
  }
}

Health status values: "pass" | "warn" | "fail".

Use this from agents to verify the server is fully operational before starting a long batch:

stats = lithos_stats()
if stats.get("health", {}).get("overall") != "pass":
    raise RuntimeError(f"Lithos not healthy: {stats['health']}")