Health Check Implementation

This document describes the health check endpoints and probes implemented in the Aura platform.

Overview

Health checks are implemented at three levels:

  1. API Gateway - HTTP health endpoints

  2. Core Service - gRPC Health Checking Protocol

  3. Docker Compose & Kubernetes - Container orchestration probes

API Gateway Health Endpoints

/healthz - Liveness Probe

Purpose: Simple alive check - is the process running?

Response:

{
  "status": "ok"
}

HTTP Status: 200 OK

Use case: Kubernetes liveness probe - restart if failing


/readyz - Readiness Probe

Purpose: Full dependency check - is service ready to handle traffic?

Response (healthy):

Response (unhealthy):

HTTP Status:

  • 200 OK if ready

  • 503 Service Unavailable if not ready

Use case: Kubernetes readiness probe - remove from load balancer if failing


/health - Detailed Status

Purpose: Human-readable health information with component status

Response:

HTTP Status: 200 OK

Use case: Monitoring dashboards, debugging, status pages


Core Service gRPC Health

The Core Service implements the standard gRPC Health Checking Protocolarrow-up-right.

Implementation Details

Service: grpc.health.v1.Health

Check Method: Verifies database connectivity

  • Executes SELECT 1 against PostgreSQL

  • Returns SERVING if successful

  • Returns NOT_SERVING if database unreachable

Watch Method: Not implemented (returns UNIMPLEMENTED)

Testing with grpc_health_probe


Kubernetes Probes

Gateway Deployment Probes

Liveness Probe:

  • Endpoint: GET /healthz

  • Initial delay: 10s

  • Period: 10s

  • Timeout: 5s

  • Failure threshold: 3 (30s total before restart)

Readiness Probe:

  • Endpoint: GET /readyz

  • Initial delay: 5s

  • Period: 5s

  • Timeout: 3s

  • Failure threshold: 2 (10s total before removing from LB)

Startup Probe:

  • Endpoint: GET /healthz

  • Initial delay: 0s

  • Period: 2s

  • Failure threshold: 30 (60s total startup time allowed)

Core Service Deployment Probes

Liveness Probe:

  • Type: gRPC

  • Port: 50051

  • Initial delay: 15s

  • Period: 20s

  • Timeout: 5s

  • Failure threshold: 3

Readiness Probe:

  • Type: gRPC

  • Port: 50051

  • Initial delay: 5s

  • Period: 10s

  • Timeout: 3s

  • Failure threshold: 2


Docker Compose Health Checks

Database (PostgreSQL)

Core Service

Note: Requires grpc_health_probe binary in the Docker image.

API Gateway

Note: Requires curl binary in the Docker image.

Frontend


Testing Health Endpoints

Manual Testing

Automated Testing

Run the comprehensive test suite:

Kubernetes Testing


Probe Configuration Rationale

Liveness Probe

  • Purpose: Detect deadlocks, infinite loops, crashes

  • Failure threshold: 3 failures = 30s before restart

  • Why: Prevent premature restarts due to temporary slowness

Readiness Probe

  • Purpose: Manage traffic routing based on dependency health

  • Failure threshold: 2 failures = 10s before removing from LB

  • Why: Quickly remove unhealthy instances from load balancer

Startup Probe

  • Purpose: Protect slow-starting containers from premature restarts

  • Failure threshold: 30 failures = 60s startup window

  • Why: Allow sufficient time for service initialization and dependency checks


Monitoring and Alerting

  1. Health check success rate: Track percentage of successful health checks

  2. Time to ready: Measure how long services take to become ready

  3. Probe failure count: Alert on repeated probe failures

  4. Dependency status: Monitor core_service connectivity from gateway

Prometheus Integration (Future)

Consider adding a /metrics endpoint with:

  • health_check_total{endpoint, status} - Counter of health checks

  • health_check_duration_seconds{endpoint} - Histogram of check duration

  • dependency_status{service} - Gauge of dependency health (0/1)


Troubleshooting

Gateway readiness probe failing

  1. Check core service connectivity:

  2. Verify core service is running:

  3. Test gRPC connectivity manually:

Core service health check failing

  1. Check database connectivity:

  2. Verify PostgreSQL is accessible:

  3. Test database connection:

Docker Compose health checks failing

  1. Check service logs:

  2. Verify health check command works:

  3. Check if required binaries are present:


Security Considerations

  1. No authentication: Health endpoints are public by design for K8s/LB access

  2. Minimal information: Endpoints expose only necessary status information

  3. No secrets: Health responses never include credentials or sensitive data

  4. Rate limiting: Consider adding light rate limiting to prevent abuse


Dependencies

Python Packages

  • grpcio-health-checking>=1.76.0 - gRPC health service implementation

Docker Images

API Gateway Dockerfile should include:

Core Service Dockerfile should include:


References

Последнее обновление