This document describes the health check endpoints and probes implemented in the Aura platform.
Health checks are implemented at three levels:
API Gateway - HTTP health endpoints
Core Service - gRPC Health Checking Protocol
Docker Compose & Kubernetes - Container orchestration probes
API Gateway Health Endpoints
/healthz - Liveness Probe
Purpose : Simple alive check - is the process running?
Response :
Копировать {
" status " : " ok "
} HTTP Status : 200 OK
Use case : Kubernetes liveness probe - restart if failing
/readyz - Readiness Probe
Purpose : Full dependency check - is service ready to handle traffic?
Response (healthy) :
Response (unhealthy) :
HTTP Status :
503 Service Unavailable if not ready
Use case : Kubernetes readiness probe - remove from load balancer if failing
/health - Detailed Status
Purpose : Human-readable health information with component status
Response :
HTTP Status : 200 OK
Use case : Monitoring dashboards, debugging, status pages
Core Service gRPC Health
The Core Service implements the standard gRPC Health Checking Protocolarrow-up-right .
Implementation Details
Service : grpc.health.v1.Health
Check Method : Verifies database connectivity
Executes SELECT 1 against PostgreSQL
Returns SERVING if successful
Returns NOT_SERVING if database unreachable
Watch Method : Not implemented (returns UNIMPLEMENTED)
Testing with grpc_health_probe
Kubernetes Probes
Gateway Deployment Probes
Liveness Probe :
Failure threshold: 3 (30s total before restart)
Readiness Probe :
Failure threshold: 2 (10s total before removing from LB)
Startup Probe :
Failure threshold: 30 (60s total startup time allowed)
Core Service Deployment Probes
Liveness Probe :
Readiness Probe :
Docker Compose Health Checks
Database (PostgreSQL)
Note : Requires grpc_health_probe binary in the Docker image.
Note : Requires curl binary in the Docker image.
Testing Health Endpoints
Automated Testing
Run the comprehensive test suite:
Kubernetes Testing
Probe Configuration Rationale
Purpose : Detect deadlocks, infinite loops, crashes
Failure threshold : 3 failures = 30s before restart
Why : Prevent premature restarts due to temporary slowness
Readiness Probe
Purpose : Manage traffic routing based on dependency health
Failure threshold : 2 failures = 10s before removing from LB
Why : Quickly remove unhealthy instances from load balancer
Purpose : Protect slow-starting containers from premature restarts
Failure threshold : 30 failures = 60s startup window
Why : Allow sufficient time for service initialization and dependency checks
Monitoring and Alerting
Recommended Metrics
Health check success rate : Track percentage of successful health checks
Time to ready : Measure how long services take to become ready
Probe failure count : Alert on repeated probe failures
Dependency status : Monitor core_service connectivity from gateway
Prometheus Integration (Future)
Consider adding a /metrics endpoint with:
health_check_total{endpoint, status} - Counter of health checks
health_check_duration_seconds{endpoint} - Histogram of check duration
dependency_status{service} - Gauge of dependency health (0/1)
Troubleshooting
Gateway readiness probe failing
Check core service connectivity:
Verify core service is running:
Test gRPC connectivity manually:
Core service health check failing
Check database connectivity:
Verify PostgreSQL is accessible:
Test database connection:
Docker Compose health checks failing
Verify health check command works:
Check if required binaries are present:
Security Considerations
No authentication : Health endpoints are public by design for K8s/LB access
Minimal information : Endpoints expose only necessary status information
No secrets : Health responses never include credentials or sensitive data
Rate limiting : Consider adding light rate limiting to prevent abuse
Python Packages
grpcio-health-checking>=1.76.0 - gRPC health service implementation
API Gateway Dockerfile should include:
Core Service Dockerfile should include:
Последнее обновление 4 часа назад