Observability: Tracing and Metrics in the Hive

This document describes the OpenTelemetry (OTel) tracing and Prometheus metrics implementation for the Aura Platform.

Overview

The Aura Platform now includes full-stack observability across:

  • API Gateway (FastAPI)

  • Core Service (gRPC + SQLAlchemy + LangChain)

  • Database (PostgreSQL via SQLAlchemy)

  • LLM Calls (Mistral AI via LangChain)

Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│             │    │             │    │             │
│  API Client │───▶│ API Gateway │───▶│ Core Service│
│             │    │ (FastAPI)   │    │ (gRPC)      │
└─────────────┘    └─────────────┘    └─────────────┘

                                      ├─────────────┐
                                      │             │
                                      ▼             ▼
                                ┌─────────────┐  ┌─────────────┐
                                │             │  │             │
                                │  Database   │  │  Mistral AI │
                                │ (PostgreSQL)│  │  (LangChain)│
                                └─────────────┘  └─────────────┘

All components export traces to Jaeger via OTLP protocol.

Metrics Architecture (The Hive Vital Signs)

In addition to tracing, the hive monitors its vital signs using Prometheus.

The Core Service exposes metrics such as CPU usage, memory consumption, and caching health, which are scraped by Prometheus and used by the GetSystemStatus gRPC method.

Configuration

Environment Variables

Both services use these environment variables:

Docker Compose

The compose.yml file already includes the Jaeger service and proper environment variables:

Implementation Details

Telemetry Initialization

Both services initialize OpenTelemetry in their main.py files:

Instrumentation

API Gateway (api-gateway/src/main.py)

  • FastAPI Instrumentation: Automatic tracing of HTTP requests

  • gRPC Client Instrumentation: Distributed tracing context propagation

Core Service (core-service/src/main.py)

  • gRPC Server Instrumentation: Automatic tracing of gRPC methods

  • SQLAlchemy Instrumentation: Database query tracing

  • LangChain Instrumentation: LLM call tracing

Logging Correlation

Both services include OpenTelemetry context in structlog output:

Example log output:

Usage

Running with Docker

Accessing Jaeger UI

After starting the services, access the Jaeger UI at:

Expected Traces

You should see traces for:

  1. API Gateway Requests: /v1/search, /v1/negotiate

  2. gRPC Calls: NegotiationService.Negotiate, NegotiationService.Search

  3. Database Queries: SQLAlchemy queries to PostgreSQL

  4. LLM Calls: LiteLLM/DSPy inference calls

Key Metrics

The following vital signs are monitored:

  • cpu_usage_percent: Average CPU load across the hive cells.

  • memory_usage_mb: Average memory footprint of the Core Engine.

  • cache_hit_rate: Success rate of the Semantic Nectar (Redis).

  • negotiation_count: Number of active economic interactions.

Testing

Run the telemetry test:

This will:

  1. Test API Gateway endpoints

  2. Generate traces for analysis

  3. Provide instructions for viewing traces in Jaeger

Troubleshooting

No Traces Appearing

  1. Check Jaeger is running: docker ps | grep jaeger

  2. Verify OTLP endpoint: Ensure OTEL_EXPORTER_OTLP_ENDPOINT points to the correct Jaeger instance

  3. Check service logs: Look for telemetry initialization messages

  4. Verify network connectivity: Services should be able to reach Jaeger on port 4317

Common Issues

  • Port conflicts: Ensure port 4317 is not used by other services

  • Environment variables: Verify both services have proper OTel environment variables

  • Dependency conflicts: Ensure all OpenTelemetry packages are compatible versions

  • Fallback behavior: If OTLP fails, traces are logged to console (check service logs)

Debugging Commands

Error Handling

The implementation includes robust error handling:

  • Fallback to console logging if OTLP export fails

  • Graceful degradation if OpenTelemetry initialization fails

  • Input validation for configuration settings

  • Safe logging context that doesn't break if OTel is unavailable

Performance Optimization

Sampling Configuration

For high-volume environments, consider adding sampling:

Batch Processor Tuning

Adjust batch processor settings for your workload:

Resource Usage Monitoring

Monitor OpenTelemetry resource usage:

Advanced Configuration

Custom Span Attributes

Add business context to traces:

Context Propagation

Manual context propagation for async tasks:

Security Considerations

Sensitive Data

Avoid logging sensitive data in span attributes:

Network Security

  • Ensure OTLP endpoint uses TLS in production

  • Restrict Jaeger UI access to authorized personnel

  • Consider network policies for inter-service communication

Migration Guide

From No Tracing to OpenTelemetry

  1. Start with basic instrumentation (current implementation)

  2. Add custom spans for critical business operations

  3. Implement sampling for high-volume endpoints

  4. Add metrics for performance monitoring

  5. Set up alerts based on trace patterns

Upgrading OpenTelemetry Versions

Dependencies

The following OpenTelemetry packages are used:

Performance Considerations

  • Batch processing: Traces are batched before export to reduce network overhead

  • Sampling: Consider adding sampling for high-volume environments

  • Resource usage: OpenTelemetry adds minimal overhead to request processing

Future Enhancements

  • Add metrics collection alongside tracing

  • Implement custom span attributes for business context

  • Add health checks and monitoring for telemetry pipeline

  • Consider adding baggages for additional context propagation

Последнее обновление