Service Monitoring
Guide to building a basic monitoring stack for self-hosted services and infrastructure
created: Sat Mar 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
updated: Sat Mar 14 2026 00:00:00 GMT+0000 (Coordinated Universal Time) #monitoring#self-hosting#observability
Introduction
Monitoring turns a self-hosted environment from a collection of services into an operable system. At minimum, that means collecting metrics, checking service availability, and alerting on failures that need human action.
Purpose
This guide focuses on:
- Host and service metrics
- Uptime checks
- Dashboards and alerting
- Monitoring coverage for common homelab services
Architecture Overview
A small monitoring stack often includes:
- Prometheus for scraping metrics
- Exporters such as
node_exporterfor host metrics - Blackbox probing for endpoint availability
- Grafana for dashboards
- Alertmanager for notifications
Typical flow:
Exporter or target -> Prometheus -> Grafana dashboards
Prometheus alerts -> Alertmanager -> notification channelStep-by-Step Guide
1. Start with host metrics
Install node_exporter on important Linux hosts or run it in a controlled containerized setup.
2. Scrape targets from Prometheus
Example scrape config:
scrape_configs:
- job_name: node
static_configs:
- targets:
- "server-01.internal.example:9100"
- "server-02.internal.example:9100"3. Add endpoint checks
Use a blackbox probe or equivalent to test HTTPS and TCP reachability for user-facing services.
4. Add dashboards and alerts
Alert only on conditions that require action, such as:
- Host down
- Disk nearly full
- Backup job missing
- TLS certificate near expiry
Configuration Example
Example alert concept:
groups:
- name: infrastructure
rules:
- alert: HostDown
expr: up == 0
for: 5m
labels:
severity: criticalTroubleshooting Tips
Metrics are missing for one host
- Check exporter health on that host
- Confirm firewall rules allow scraping
- Verify the target name and port in the Prometheus config
Alerts are noisy
- Add
fordurations to avoid alerting on short blips - Remove alerts that never trigger action
- Tune thresholds per service class rather than globally
Dashboards look healthy while the service is down
- Add blackbox checks in addition to internal metrics
- Monitor the reverse proxy or external entry point, not only the app process
- Track backups and certificate expiry separately from CPU and RAM
Best Practices
- Monitor the services users depend on, not only the hosts they run on
- Keep alert volume low enough that alerts remain meaningful
- Document the owner and response path for each critical alert
- Treat backup freshness and certificate expiry as first-class signals
- Start simple, then add coverage where operational pain justifies it