[PVE-User] node not rebooted after corosync crash

Josh Knight

2018-08-17 14:30:34 UTC

The ipmi_watchdog is a hardware watchdog which the OS pokes to keep happy.
If the OS hangs/crashes and therefore fails to poke it, then the IPMI
watchdog will reset the system. It will not catch the case of an
individual daemon/process, like corosync, hanging/crashing on the system.

Post by Dmitry Petuhov
Week ago on one of my PVE nodes suddenly crashed corosync.
-------------->8=========
corosync[4701]: error [TOTEM ] FAILED TO RECEIVE
corosync[4701]: [TOTEM ] FAILED TO RECEIVE
corosync[4701]: notice [TOTEM ] A new membership (10.19.92.53:1992) was
formed. Members left: 1 2 4
corosync[4701]: notice [TOTEM ] Failed to receive the leave message.
failed: 1 2 4
corosync[4701]: [TOTEM ] A new membership (10.19.92.53:1992) was
formed. Members left: 1 2 4
corosync[4701]: [TOTEM ] Failed to receive the leave message. failed: 1 2
4
corosync[4701]: notice [QUORUM] This node is within the non-primary
component and will NOT provide any services.
corosync[4701]: notice [QUORUM] Members[1]: 3
corosync[4701]: notice [MAIN ] Completed service synchronization,
ready to provide service.
corosync[4701]: [QUORUM] This node is within the non-primary component
and will NOT provide any services.
corosync[4701]: [QUORUM] Members[1]: 3
corosync[4701]: [MAIN ] Completed service synchronization, ready to
provide service.
kernel: [29187555.500409] dlm: closing connection to node 2
corosync[4701]: notice [TOTEM ] A new membership (10.19.92.51:2000) was
formed. Members joined: 1 2 4
corosync[4701]: [TOTEM ] A new membership (10.19.92.51:2000) was
formed. Members joined: 1 2 4
corosync[4701]: notice [QUORUM] This node is within the primary
component and will provide service.
corosync[4701]: notice [QUORUM] Members[4]: 1 2 3 4
corosync[4701]: notice [MAIN ] Completed service synchronization,
ready to provide service.
corosync[4701]: [QUORUM] This node is within the primary component and
will provide service.
corosync[4701]: notice [CFG ] Killed by node 1: dlm_controld
corosync[4701]: error [MAIN ] Corosync Cluster Engine exiting with
status -1 at cfg.c:530.
corosync[4701]: [QUORUM] Members[4]: 1 2 3 4
corosync[4701]: [MAIN ] Completed service synchronization, ready to
provide service.
dlm_controld[688]: 29187298 daemon node 4 stateful merge
dlm_controld[688]: 29187298 receive_start 4:6 add node with started_count 2
dlm_controld[688]: 29187298 daemon node 1 stateful merge
dlm_controld[688]: 29187298 receive_start 1:5 add node with started_count 4
dlm_controld[688]: 29187298 daemon node 2 stateful merge
dlm_controld[688]: 29187298 receive_start 2:17 add node with
started_count 13
corosync[4701]: [CFG ] Killed by node 1: dlm_controld
corosync[4701]: [MAIN ] Corosync Cluster Engine exiting with status -1
at cfg.c:530.
dlm_controld[688]: 29187298 cpg_dispatch error 2
dlm_controld[688]: 29187298 process_cluster_cfg cfg_dispatch 2
dlm_controld[688]: 29187298 cluster is down, exiting
dlm_controld[688]: 29187298 process_cluster quorum_dispatch 2
dlm_controld[688]: 29187298 daemon cpg_dispatch error 2
systemd[1]: corosync.service: Main process exited, code=exited,
status=255/n/a
systemd[1]: corosync.service: Unit entered failed state.
systemd[1]: corosync.service: Failed with result 'exit-code'.
kernel: [29187556.903177] dlm: closing connection to node 4
kernel: [29187556.906730] dlm: closing connection to node 3
dlm_controld[688]: 29187298 abandoned lockspace hp-big-gfs
kernel: [29187556.924279] dlm: dlm user daemon left 1 lockspaces
-------------->8=========
But node did not rebooted.
-------------->8=========
# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 9 sec
-------------->8=========
The only down service is corosync.
-------------->8=========
# pveversion --verbose
proxmox-ve: 5.0-21 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-31 (running version: 5.0-31/27769b1f)
pve-kernel-4.10.17-2-pve: 4.10.17-20
pve-kernel-4.10.17-3-pve: 4.10.17-21
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-5
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
gfs2-utils: 3.1.9-2
openvswitch-switch: 2.7.0-2
ceph: 12.2.0-pve1
-------------->8=========
I also have GFS2 in this cluster, which did not stop work after corosync
crash (which scares me most).
Shouldn't node reboot on corosync fail, and why it can still run? Or
shall node have HA VMs to reboot, and just stay as it is if there's only
regular autostarted VMs and no HA machines present?
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user