Discussion:
[PVE-User] Cluster doesn't recover automatically after blackout
Eneko Lacunza
2018-08-01 09:02:18 UTC
Permalink
Hi all,

This morning there was a quite long blackout which powered off a cluster
of 3 proxmox 5.1 servers.

All 3 servers the same make and model, so they need the same amount of
time to boot.

When the power came back, servers started correctly but corosync
couldn't set up a quorum. Events timing:

07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works

What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.

I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).

A simple restart of pve-cluster at about 9:50 restored the cluster to
normal state.

Is this expected? I expected that corosync would set up a quorum after
network was operational....

# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

Thanks a lot
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
Alwin Antreich
2018-08-01 10:56:05 UTC
Permalink
Hi,
Post by Eneko Lacunza
Hi all,
This morning there was a quite long blackout which powered off a cluster of
3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of time
to boot.
When the power came back, servers started correctly but corosync couldn't
I recommend against, servers returning automatically to previous power
state after a power loss. A manual start up is better, as by then the
admin made sure power is back to normal operation. This will also reduce
the chance of breakage if there are subsequent power or hardware
failures.
Post by Eneko Lacunza
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to normal
state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.


--
Cheers,
Alwin
Eneko Lacunza
2018-08-01 11:40:34 UTC
Permalink
This post might be inappropriate. Click to display it.
Alwin Antreich
2018-08-01 11:57:09 UTC
Permalink
Post by Eneko Lacunza
Hi Alwin,
Post by Alwin Antreich
Post by Eneko Lacunza
Hi all,
This morning there was a quite long blackout which powered off a cluster of
3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of time
to boot.
When the power came back, servers started correctly but corosync couldn't
I recommend against, servers returning automatically to previous power
state after a power loss. A manual start up is better, as by then the
admin made sure power is back to normal operation. This will also reduce
the chance of breakage if there are subsequent power or hardware
failures.
This is an off-site place with no knowledgeable sysadmins and servers don't
have remote control cards. I'm sure they would screw the boot up  :)
I'm afraid we have to take the risk. :)
A boot delay, if the server have such a setting or switchable UPS power
plugs might help. :)
Post by Eneko Lacunza
Post by Alwin Antreich
Post by Eneko Lacunza
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to normal
state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.
I don't have that info (or I don't know how to look that in the logs,
/var/log/corosync is empty). I'm trying to plan an intentional blackout to
test things again with technicians onsite, we could get more info that day.
Corosync writes into the syslog, there should be more to find.
Post by Eneko Lacunza
Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
switches deployed with good results so far.
The switch may hold a log that shows its startup process.
Eneko Lacunza
2018-08-01 14:12:19 UTC
Permalink
Hi,
Post by Alwin Antreich
Post by Eneko Lacunza
Post by Alwin Antreich
Post by Eneko Lacunza
Hi all,
This morning there was a quite long blackout which powered off a cluster of
3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of time
to boot.
When the power came back, servers started correctly but corosync couldn't
I recommend against, servers returning automatically to previous power
state after a power loss. A manual start up is better, as by then the
admin made sure power is back to normal operation. This will also reduce
the chance of breakage if there are subsequent power or hardware
failures.
This is an off-site place with no knowledgeable sysadmins and servers don't
have remote control cards. I'm sure they would screw the boot up  :)
I'm afraid we have to take the risk. :)
A boot delay, if the server have such a setting or switchable UPS power
plugs might help. :)
Yes, I can do that at grub level, that's no problem. But I have to know
first the correct amount for the delay ;)
Post by Alwin Antreich
Post by Eneko Lacunza
Post by Alwin Antreich
Post by Eneko Lacunza
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to normal
state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.
I don't have that info (or I don't know how to look that in the logs,
/var/log/corosync is empty). I'm trying to plan an intentional blackout to
test things again with technicians onsite, we could get more info that day.
Corosync writes into the syslog, there should be more to find.
Doesn't seem there is any more to me:
# grep corosync /var/log/syslog
Aug  1 07:57:11 proxmox1 corosync[1697]:  [MAIN  ] Corosync Cluster
Engine ('2.4.2-dirty'): started and ready to provide service.
Aug  1 07:57:11 proxmox1 corosync[1697]: notice  [MAIN  ] Corosync
Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Aug  1 07:57:11 proxmox1 corosync[1697]: info    [MAIN  ] Corosync
built-in features: dbus rdma monitoring watchdog augeas systemd upstart
xmlconf qdevices qnetd snmp pie relro bindnow
Aug  1 07:57:11 proxmox1 corosync[1697]:  [MAIN  ] Corosync built-in
features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf
qdevices qnetd snmp pie relro bindnow
Aug  1 07:58:40 proxmox1 systemd[1]: corosync.service: Start operation
timed out. Terminating.
Aug  1 07:58:40 proxmox1 systemd[1]: corosync.service: Unit entered
failed state.
Aug  1 07:58:40 proxmox1 systemd[1]: corosync.service: Failed with
result 'timeout'.
Aug  1 09:51:35 proxmox1 corosync[32220]:  [MAIN  ] Corosync Cluster
Engine ('2.4.2-dirty'): started and ready to provide service.

This last line is our manual pve-cluster restart .
Post by Alwin Antreich
Post by Eneko Lacunza
Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
switches deployed with good results so far.
The switch may hold a log that shows its startup process.
Seems it was disabled, we have enabled it.

Thanks
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
Klaus Darilion
2018-08-07 13:54:02 UTC
Permalink
Just some rant: I do think that the presented solutions are the wrong
approach to this problem. A HA cluster should recover automatically from
such simple failures (power loss, network outage) to achieve HA. If
there is manual intervention necessary - then the whole thing should not
be called "HA" cluster.

I know corosync is picky and for example does not start when a
configured network interface is not available yet. Hence, corosync
should be automatically restarted if it fails.

Intrudocing a "sleep" until the network is available is also a dirty
workaround - the problem is the cluster software - the software should
try to form a cluster endlessly (why should a "HA" software give up?).
Would a mars rover give up and shutdown when it could not ping the earth
for some days? Probably no, it would try endlessly.

regards
Klaus
Post by Eneko Lacunza
Hi all,
This morning there was a quite long blackout which powered off a cluster
of 3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of
time to boot.
When the power came back, servers started correctly but corosync
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to
normal state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9
Thanks a lot
Eneko
Loading...