Eneko Lacunza
2018-08-01 09:02:18 UTC
Hi all,
This morning there was a quite long blackout which powered off a cluster
of 3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of
time to boot.
When the power came back, servers started correctly but corosync
couldn't set up a quorum. Events timing:
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to
normal state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9
Thanks a lot
Eneko
This morning there was a quite long blackout which powered off a cluster
of 3 proxmox 5.1 servers.
All 3 servers the same make and model, so they need the same amount of
time to boot.
When the power came back, servers started correctly but corosync
couldn't set up a quorum. Events timing:
07:57:10 corosync start
07:57:15 first pmxcfs error quorum_initialize_failed: 2
07:57:52 network up
07:58:40 Corosync timeout
07:59:57 time sync works
What I can see is that network switch boot was slower than server's, but
nonetheless network was operational about 45s before corosync gives up
trying to set up a quorum.
I also can see that internet access wasn't back until 1 minute after
corosync timeout (the time sync event).
A simple restart of pve-cluster at about 9:50 restored the cluster to
normal state.
Is this expected? I expected that corosync would set up a quorum after
network was operational....
# pveversion -v
proxmox-ve: 5.1-41 (running kernel: 4.13.13-6-pve)
pve-manager: 5.1-46 (running version: 5.1-46/ae8241d4)
pve-kernel-4.13.13-6-pve: 4.13.13-41
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.2-pve1
corosync: 2.4.2-pve3
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-common-perl: 5.0-28
libpve-guest-common-perl: 2.0-14
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-17
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-2
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-11
pve-cluster: 5.0-20
pve-container: 2.0-19
pve-docs: 5.1-16
pve-firewall: 3.0-5
pve-firmware: 2.0-3
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.9.1-9
pve-xtermjs: 1.0-2
qemu-server: 5.0-22
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.6-pve1~bpo9
Thanks a lot
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es