Monitoring Basics
EduIDE's monitoring stack is built on Prometheus and Grafana, managed through the theia-monitoring chart and the Rancher monitoring system (cattle-monitoring-system). This page describes the key signals to watch, what normal and degraded states look like, and how to verify platform health proactively.
Monitoring infrastructure
The theia-monitoring chart deploys Prometheus ServiceMonitor resources that configure scrape targets for each environment namespace. Grafana dashboards are discovered from the cattle-dashboards namespace.
Namespaces in scope:
theia-prodtheia-stagingtest1,test2,test3
If a new environment namespace is added, update the targetNamespaces and sessionNamespaces lists in the monitoring chart values before deploying.
Shared build cache metrics
The theia-shared-cache (Gradle build cache) exposes Prometheus metrics at /metrics via a Redis Exporter sidecar. Enable the ServiceMonitor for it in the chart values:
metrics:
serviceMonitor:
enabled: true
Key signals
Session launch latency
The most user-visible signal. A session launch that takes longer than ~10 seconds indicates either:
- All pre-warmed instances are consumed (increase
minInstances) - The cluster is under node pressure (check CPU/memory on nodes)
- The operator is backlogged in reconciliation
Monitor: time from POST /service to a reachable session URL.
Alert threshold: p95 > 15 seconds sustained for 5 minutes.
Session availability
The fraction of launch requests that succeed. A drop indicates cluster instability, image pull failures, or storage attachment problems.
Monitor: ratio of successful session starts to total start attempts.
Alert threshold: success rate < 95% over a 10-minute window.
Pod memory utilisation
Individual session pods have a memory limit (e.g., 3000M for java-17-latest). Pods consistently near their limit will OOMKill, which surfaces as unexpected session terminations.
Monitor: container_memory_working_set_bytes for session pods vs. limits.memory.
Alert threshold: > 85% of memory limit for > 5 minutes.
Authentication error rate
Failed authentication attempts spike during misconfiguration (e.g., after a Keycloak change) or during an attack. Normal rate should be near zero for legitimate users.
Monitor: HTTP 401 and 403 response rates on the OAuth2 proxy and service.
Alert threshold: > 5% of requests returning 401/403 over a 5-minute window.
Workspace storage usage
PVC count and total storage consumption against the namespace quota. Approaching the quota hard limit will prevent new workspace creation.
Monitor: kube_resourcequota for persistentvolumeclaims and requests.storage.
Alert threshold: > 80% of quota consumed.
Build cache hit rate
A low cache hit rate for the Gradle shared cache degrades CI build times but does not affect user sessions directly.
Monitor: gradle_cache_cache_hits / (gradle_cache_cache_hits + gradle_cache_cache_misses).
Informational threshold: < 50% over a 24-hour window warrants investigation.
Health check procedures
Service health
# Public ping (requires service auth token, not admin token)
curl https://service.theia.artemis.cit.tum.de/service/{appId}
# Admin ping (requires admin API token)
curl -H "X-Admin-Api-Token: $ADMIN_API_TOKEN" \
https://service.theia.artemis.cit.tum.de/service/admin/{appId}
Both return true when the service is healthy.
Operator health
kubectl get pods -n theia-prod -l app=operator
kubectl logs -n theia-prod -l app=operator --tail=50
The operator runs 3 replicas in production. If fewer than 3 are Running, investigate immediately.
Session pod health
# Count running session pods
kubectl get pods -n theia-prod --field-selector=status.phase=Running | grep -c session
# Find stuck or crash-looping pods
kubectl get pods -n theia-prod | grep -E 'CrashLoopBackOff|Error|Pending'
Build cache health
# Readiness check
curl https://cache.theia.artemis.cit.tum.de/health
# Liveness check
curl https://cache.theia.artemis.cit.tum.de/ping
Returns 200 OK when healthy.
Grafana dashboards
Dashboards are deployed to the cattle-dashboards namespace. To access them:
- Open the Rancher UI and navigate to the monitoring section.
- Look for dashboards prefixed with
theia-. - The main session dashboard shows launch latency, active sessions, and pod resource usage per namespace.
If dashboards are missing after a new environment is added, verify that the monitoring chart has been redeployed with the updated namespace list.
Routine health check cadence
| Check | Frequency | Method |
|---|---|---|
| Session launch smoke test | Daily | Launch a session manually and verify it starts |
| Pod status overview | Daily | kubectl get pods -n theia-prod |
| Resource quota utilisation | Weekly | kubectl describe resourcequota -n theia-prod |
| PVC growth rate | Weekly | Compare PVC count to previous week |
| Alert rule review | Monthly | Confirm alert thresholds are still appropriate for current cohort size |
| Dashboard coverage | On namespace addition | Verify new namespaces appear in Grafana |