Monitoring Archives

Lets take a light bubbly kubernetes project like arkade retro gaming and ramp up the fun by adding monitoring! Bleck. Seriously though, I’d like to learn more about monitoring a kubernetes cluster so lets get started. Today (erm this week) I’ll build that into my home cluster.

Setup

I recently setup the grafana/prometheus combination to monitor my main server. The server runs pretty idle, and never really gets overloaded. I do want to monitor drive temps on the server RAID. I also want to monitor network usage of the WAN NIC just to compare with the usage report from my ISP.

To get the network monitor, I setup a custom vnstat -> telegraf -> prometheus chain which was sort of interesting (but *really* not required for my cluster – so skip over if you want). The kubernetes content you crave continues below…

vnstat

$ sudo apt install vnstat 
$ systemctl enable vnstat 
$ systemctl start vnstat

I played with vnstat commands for a while to reduce the number of interfaces it was tracking:

$ sudo vnstat --remove --iface gar0 --force
$ sudo vnstat --add --iface wan 
$ sudo vnstat --add --iface lan 
...
$ vnstat wan -m 

 wan  /  monthly

        month        rx      |     tx      |    total    |   avg. rate
     ------------------------+-------------+-------------+---------------
       2025-08     99.15 GiB |   23.86 GiB |  123.01 GiB |  394.52 kbit/s
       2025-09    394.20 GiB |  169.98 GiB |  564.18 GiB |   21.31 Mbit/s
     ------------------------+-------------+-------------+---------------
     estimated      4.39 TiB |    1.89 TiB |    6.28 TiB |

Let that run for a while and soon you can dump traffic stats for the various NICs in the linux box. Side note my linux server is my router so I’ve got my interface adapters renamed ‘wan’, ‘lan’, ‘wifi’ etc. depending on what they are connected to. ‘wan’ is external traffic to the modem. ‘lan’ is all internal traffic (lan is a bridge). ‘wifi’ goes off to my wireless access point (wifi lives in the lan bridge).

telegraf

Now setup telegraf to scrape the vnstat reports and feed them to prometheus.

$ sudo apt install telegraf
$ cat /etc/telegraf/telegraf.d/vnstat.conf 
[[inputs.exec]]
  commands = ["/usr/local/bin/vnstat-telegraf.sh"]
  timeout = "5s"
  data_format = "influx"
$ cat /usr/local/bin/vnstat-telegraf.sh 
#!/bin/bash
for IFACE in wan lan wifi fam off bond0
do 
   vnstat --json | jq -r --arg iface "$IFACE" '
     .interfaces[] | select(.name==$iface) |
     .traffic.total as $total |
     "vnstat,interface=\($iface) rx_bytes=\($total.rx),tx_bytes=\($total.tx)"
   '
done
$ sudo systemctl enable telegraf
$ sudo systemctl start telegraf

then check the config with a quick curl:

$ curl --silent localhost:9273/metrics | grep wan
vnstat_rx_bytes{host="www.hobosuit.com",interface="wan"} 5.34410207039e+11
vnstat_tx_bytes{host="www.hobosuit.com",interface="wan"} 2.10284865396e+11

prometheus

Now install prometheus and hookup telegraf:

$ sudo apt install prometheus 
$ tail  /etc/prometheus/prometheus.yml 

  - job_name: node
    # If prometheus-node-exporter is installed, grab stats about the local
    # machine by default.
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'telegraf-vnstat'
    static_configs:
      - targets: ['localhost:9273']
$ sudo systemctl restart prometheus

The prometheus is configured with the node exporter which includes the drive temp data I’m looking for and I’ve added the telegraf-nvstat config to put out the network usage stats. Check the node exporter with curl like this:

curl --silent localhost:9100/metrics | grep smartmon_temperature_celsius_raw
# HELP smartmon_temperature_celsius_raw_value SMART metric temperature_celsius_raw_value
# TYPE smartmon_temperature_celsius_raw_value gauge
smartmon_temperature_celsius_raw_value{disk="/dev/sda",smart_id="194",type="sat"} 35
smartmon_temperature_celsius_raw_value{disk="/dev/sdb",smart_id="194",type="sat"} 40
smartmon_temperature_celsius_raw_value{disk="/dev/sdc",smart_id="194",type="sat"} 41
smartmon_temperature_celsius_raw_value{disk="/dev/sdd",smart_id="194",type="sat"} 42
smartmon_temperature_celsius_raw_value{disk="/dev/sde",smart_id="194",type="sat"} 36
smartmon_temperature_celsius_raw_value{disk="/dev/sdf",smart_id="194",type="sat"} 40

Grafana

I setup grafana using docker-compose :

$ cat docker-compose.yaml 
version: '3.8'

services:
  influxdb:
    image: influxdb:latest
    container_name: influxdb
    ports:
      - "8086:8086"
    volumes:
      - /grafana/influxdb-storage:/var/lib/influxdb2
    environment:
      - DOCKER_INFLUXDB_INIT_MODE=setup
      - DOCKER_INFLUXDB_INIT_USERNAME=admin
      - DOCKER_INFLUXDB_INIT_PASSWORD=thatsmypurse
      - DOCKER_INFLUXDB_INIT_ORG=my_org
      - DOCKER_INFLUXDB_INIT_BUCKET=my_bucket
      - DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=idontknowyou

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - /grafana/grafana-storage:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_USER=thatsmypurse
      - GF_SECURITY_ADMIN_PASSWORD=idontknowyou
    depends_on:
      - influxdb

networks:
  default:
    driver: bridge
    driver_opts:
      com.docker.network.bridge.name: br-grafana
    ipam:
      config:
        - subnet: 172.27.0.0/24

Set a DataSource

Finally, I can log in to grafana, setup a datasource and build some dashboards

Wait What?

I can’t build all that that nonsense every time I want to setup a node in a kubernetes cluster. Plus I don’t really need the wan usage stuff. Luckily ‘there’s a helm chart for that’. prometheus-community/kube-prometheus-stack puts the prometheus node exporter on each node (as a daemonset), and takes care of starting grafana with lots of nice preconfigured dashboards.

Dump the values.yaml

It’s a big complicated chart, so I found it helpful to dump the values.yaml file to study.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm show values prometheus-community/kube-prometheus-stack > prom_values.yaml

Ultimately I came up with this script to install the chart into the mon namespace:

#!/bin/bash

. ./functions.sh

NAMESPACE=mon

info "Setup prometheus community helm repo"
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

info "Install prometheus-community chart into '$NAMESPACE' namespace"
helm upgrade --install prom prometheus-community/kube-prometheus-stack \
  --namespace $NAMESPACE \
  --create-namespace \
  --set grafana.adminUser=${GRAFANA_ADMIN} \
  --set grafana.adminPassword=${GRAFANA_PASSWORD} \
  --set grafana.persistence.type=pvc \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.storageClass=longhorn \
  --set "grafana.persistence.accessModes={ReadWriteMany}" \
  --set grafana.persistence.size=8Gi \
  --set grafana.resources.requests.memory=512Mi \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \
  --set "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes={ReadWriteMany}" \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=4Gi \
  --set prometheus.prometheusSpec.retention=14d \
  --set prometheus.prometheusSpec.retentionSize=8GiB \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi 

info "Wait for pods to come up in '$NAMESPACE' namespace"
pod_wait $NAMESPACE

info "Setup cert-manager for the grafana server"
kubectl apply -f prom_Certificate.yaml

info "Setup Traefik ingress route for grafana"
kubectl apply -f prom_IngressRoute.yaml

info "Wait for https://grafana to be available"
# grafana redirects to the login screen so a status code of 302 means it's ready
https_wait https://grafana/login '200|302'

I added persistence using longhorn volumes and setup my usual self-signed certificate to secure the traefik IngressRoute to the grafana UI. Here’s the other bits and pieces of yaml:

 cat prom_Certificate.yaml 
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: prom-grafana
  namespace: mon
spec:
  secretName: prom-grafana-cert-secret  # <===  Name of secret where the generated certificate will be stored.
  dnsNames:
    - "grafana"
  issuerRef:
    name: hobo-intermediate-ca1-issuer
    kind: ClusterIssuer

$ cat prom_IngressRoute.yaml 
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: grafana
  namespace: mon
  annotations:
    cert-manager.io/cluster-issuer: hobo-intermediate-ca1-issuer
    cert-manager.io/common-name: grafana 
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`grafana`)
      priority: 10
      services:
        - name: prom-grafana 
          port: 80
  tls:
    secretName: prom-grafana-cert-secret

‘Monitoring’ ain’t easy

Plus/minus some trial and error I adjusted my cluster setup scripts to bring up the usual pieces plus this new monitoring chunk (prom.sh) is the new bit:

$ cat setup.sh 
#!/bin/bash 

. ./functions.sh 
./cert-manager.sh 
./argocd.sh 

kubectl create ns games
./longhorn.sh 

./prom.sh

… and then one of the nodes went catatonic – he’s dead Jim.

Monitoring is Expensive

No shock there there’s no free lunch with cluster monitoring. At the very least, you have to dedicate some hardware. In this case, my little cluster just couldn’t handle the memory requirements of grafana / prometheus. The memory foot-print in this cluster is 4GB in the control plane and 2GB in each node. I fixed it by digging out a pi5 8GB that I just bought and adding it as a node – so with the single board computer and accessories, monitoring will cost about $150.

Finally Cluster Monitoring

Pretty. When it’s not crashing my main workloads. This gives me a start on learning to love monitoring and maybe even alerting.

A late add to the setup was to add a model label on each node. The Pi5 computers have more RAM – so I set affinity for the larger deployments to make things run smoother. The updated helm install command looked like this:

helm upgrade --install prom prometheus-community/kube-prometheus-stack \
  --namespace $NAMESPACE \
  --create-namespace \
  --set grafana.adminUser=${GRAFANA_ADMIN} \
  --set grafana.adminPassword=${GRAFANA_PASSWORD} \
  --set grafana.persistence.type=pvc \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.storageClass=longhorn \
  --set "grafana.persistence.accessModes={ReadWriteMany}" \
  --set grafana.persistence.size=8Gi \
  --set grafana.resources.requests.memory=512Mi \
  --set-json 'grafana.affinity={"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"preference":{"matchExpressions":[{"key":"model","operator":"In","values":["Pi5"]}]}}]}}' \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \
  --set "prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.accessModes={ReadWriteMany}" \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=4Gi \
  --set prometheus.prometheusSpec.retention=14d \
  --set prometheus.prometheusSpec.retentionSize=8GiB \
  --set prometheus.prometheusSpec.resources.requests.memory=2Gi \
  --set-json 'prometheus.prometheusSpec.affinity={"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"preference":{"matchExpressions":[{"key":"model","operator":"In","values":["Pi5"]}]}}]}}'

-Sandy

Category: Monitoring

Grafana Monitoring for the Arkade Cluster