[PVE-User] Cluster doesn't recover automatically after blackout

Discussion:

Eneko Lacunza

2018-08-01 09:02:18 UTC

Hi all,

This morning there was a quite long blackout which powered off a cluster
of 3 proxmox 5.1 servers.

All 3 servers the same make and model, so they need the same amount of
time to boot.

When the power came back, servers started correctly but corosync
couldn't set up a quorum. Events timing:

07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works

What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.

I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).

A simple restart of pve-cluster at about 9:50 restored the cluster to
normal state.

Is this expected? I expected that corosync would set up a quorum after
network was operational....

# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9

Thanks a lot
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Alwin Antreich

2018-08-01 10:56:05 UTC

Permalink

Hi,

I recommend against, servers returning automatically to previous power
state after a power loss. A manual start up is better, as by then the
admin made sure power is back to normal operation. This will also reduce
the chance of breakage if there are subsequent power or hardware
failures.

Post by Eneko Lacunza
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to normal
state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....

When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.

--
Cheers,
Alwin

Eneko Lacunza

2018-08-01 11:40:34 UTC

Permalink

Hi Alwin,

Post by Alwin Antreich

This is an off-site place with no knowledgeable sysadmins and servers
don't have remote control cards. I'm sure they would screw the boot up :)

I'm afraid we have to take the risk. :)

Post by Alwin Antreich

When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.

I don't have that info (or I don't know how to look that in the logs,
/var/log/corosync is empty). I'm trying to plan an intentional blackout
to test things again with technicians onsite, we could get more info
that day.

Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral
18x0 switches deployed with good results so far.

Thanks a lot,
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Alwin Antreich

2018-08-01 11:57:09 UTC

Permalink

Post by Eneko Lacunza
Hi Alwin,

Post by Alwin Antreich

This is an off-site place with no knowledgeable sysadmins and servers don't
have remote control cards. I'm sure they would screw the boot up :)
I'm afraid we have to take the risk. :)

A boot delay, if the server have such a setting or switchable UPS power
plugs might help. :)

Post by Eneko Lacunza

Post by Alwin Antreich

When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.

Corosync writes into the syslog, there should be more to find.

Post by Eneko Lacunza
Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
switches deployed with good results so far.

The switch may hold a log that shows its startup process.

Eneko Lacunza

2018-08-01 14:12:19 UTC

Permalink

Hi,

Post by Alwin Antreich

Post by Eneko Lacunza

Post by Alwin Antreich

This is an off-site place with no knowledgeable sysadmins and servers don't
have remote control cards. I'm sure they would screw the boot up :)
I'm afraid we have to take the risk. :)

A boot delay, if the server have such a setting or switchable UPS power
plugs might help. :)

Yes, I can do that at grub level, that's no problem. But I have to know
first the correct amount for the delay ;)

Post by Alwin Antreich

Post by Eneko Lacunza

Post by Alwin Antreich

When was multicast working again? That might have taken longer, as IGMP
snooping and the querier on the switch might just take longer to get
operating again.

Corosync writes into the syslog, there should be more to find.

Doesn't seem there is any more to me:
# grep corosync /var/log/syslog
Aug 1 07:57:11 proxmox1 corosync[1697]: [MAIN ] Corosync Cluster
Engine ('2.4.2-dirty'): started and ready to provide service.
Aug 1 07:57:11 proxmox1 corosync[1697]: notice [MAIN ] Corosync
Cluster Engine ('2.4.2-dirty'): started and ready to provide service.
Aug 1 07:57:11 proxmox1 corosync[1697]: info [MAIN ] Corosync
built-in features: dbus rdma monitoring watchdog augeas systemd upstart
xmlconf qdevices qnetd snmp pie relro bindnow
Aug 1 07:57:11 proxmox1 corosync[1697]: [MAIN ] Corosync built-in
features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf
qdevices qnetd snmp pie relro bindnow
Aug 1 07:58:40 proxmox1 systemd[1]: corosync.service: Start operation
timed out. Terminating.
Aug 1 07:58:40 proxmox1 systemd[1]: corosync.service: Unit entered
failed state.
Aug 1 07:58:40 proxmox1 systemd[1]: corosync.service: Failed with
result 'timeout'.
Aug 1 09:51:35 proxmox1 corosync[32220]: [MAIN ] Corosync Cluster
Engine ('2.4.2-dirty'): started and ready to provide service.

This last line is our manual pve-cluster restart .

Post by Alwin Antreich

Post by Eneko Lacunza
Switch is HPE 1820-24G J9980A, it's L2 but quite dumb; we have serveral 18x0
switches deployed with good results so far.

The switch may hold a log that shows its startup process.

Seems it was disabled, we have enabled it.

Thanks
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Klaus Darilion

2018-08-07 13:54:02 UTC

Permalink

Just some rant: I do think that the presented solutions are the wrong
approach to this problem. A HA cluster should recover automatically from
such simple failures (power loss, network outage) to achieve HA. If
there is manual intervention necessary - then the whole thing should not
be called "HA" cluster.

I know corosync is picky and for example does not start when a
configured network interface is not available yet. Hence, corosync
should be automatically restarted if it fails.

Intrudocing a "sleep" until the network is available is also a dirty
workaround - the problem is the cluster software - the software should
try to form a cluster endlessly (why should a "HA" software give up?).
Would a mars rover give up and shutdown when it could not ping the earth
for some days? Probably no, it would try endlessly.

regards
Klaus

Post by Eneko Lacunza
Hi all,
This morning there was a quite long blackout which powered off a cluster
of 3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of
time to boot.
When the power came back, servers started correctly but corosync
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to
normal state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9
Thanks a lot
Eneko

Continue reading on narkive:

Search results for '[PVE-User] Cluster doesn't recover automatically after blackout' (Questions and Answers)

148

replies

Trump wants to jam an extremist on the Supreme Court while the coronavirus continues to be out of control in America?

started 2020-09-27 13:48:11 UTC

politics