Kuboard Etcd故障-运维手记

故障描述

Kuboad默认使用的Etcd镜像存在2G的存储限制，当到达限制时会etcd会报出NOSPACE告警。

处理办法

修改原镜像启动参数调整etcd后端存储限制

拉取原镜像，修改entrypoint文件

docker pull eipwork/etcd-host:3.4.16-2
docker create eipwork/etcd-host:3.4.16-2
docker cp <container>:/docker-entrypoint.sh .
vim docker-entrypoint.sh

在结尾添加两行参数

etcd --name ${HOSTNAME} \
  --listen-peer-urls http://${HOSTIP}:2382 \
  --listen-client-urls http://${HOSTIP}:2381 \
  --advertise-client-urls http://${HOSTIP}:2381 \
  --initial-advertise-peer-urls http://${HOSTIP}:2382 \
  --initial-cluster-token kuboard-etcd-cluster-1 \
  --initial-cluster ${PEERS} \
  --initial-cluster-state new \
  --snapshot-count=10000 \
  --log-level=info \
  --logger=zap \
  --data-dir /data \
  #数据自动压缩
  --auto-compaction-retention=1 \          
  #限制后端存储为8G
  --quota-backend-bytes=8388608000

重新构建镜像

FROM eipwork/etcd-host:3.4.16-2
COPY ./docker-entrypoint.sh /docker-entrypoint.sh

调整镜像后需要手动解除告警：

首先修改存活探针启动时间

kubectl -n kuboard edit stateful kuboard-etcd

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 2381
            scheme: HTTP
          initialDelaySeconds: 30 #调高一点
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

进入容器解除告警

kubectl -n kuboard exec -it kuboard-etcd -- sh
ETCDCTL_API=3 etcdctl  --endpoints="http://127.0.0.1:2381" --write-out=table endpoint status
ETCDCTL_API=3 etcdctl  --endpoints="http://127.0.0.1:2381" alarm disarm

解除告警后恢复探针