Discussion:
[PVE-User] I lost the cluster communication in a 10 nodes cluster
Denis Morejon
2018-10-12 16:57:50 UTC
Permalink
The 10 nodes lost the communication with each other. And they were
working fine for a month. They all have version 5.1.


All nodes have the same date/time and show a status like this:

***@proxmox11:~# pvecm status

Quorum information
------------------
Date:             Fri Oct 12 11:55:59 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000007
Ring ID:          7/60372
Quorate:          No

Votequorum information
----------------------
Expected votes:   10
Highest expected: 10
Total votes:      1
Quorum:           6 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
0x00000007          1 192.168.80.11 (local)
Thomas Lamprecht
2018-10-15 07:57:08 UTC
Permalink
The 10 nodes lost the communication with each other. And they were working fine for a month. They all have version 5.1.
any environment changes? E.g., switch change or software update
(which then could block multicast)?

Can you also see if the omping test go still through:
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
Quorum information
------------------
Date:             Fri Oct 12 11:55:59 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000007
Ring ID:          7/60372
Quorate:          No
Votequorum information
----------------------
Expected votes:   10
Highest expected: 10
Total votes:      1
Quorum:           6 Activity blocked
Membership information
----------------------
    Nodeid      Votes Name
0x00000007          1 192.168.80.11 (local)
Denis Morejon
2018-10-15 16:26:01 UTC
Permalink
I upgraded all the Proxmox (With debian repo and with proxmox repo) and
all find again!

Thank you.
Post by Thomas Lamprecht
The 10 nodes lost the communication with each other. And they were working fine for a month. They all have version 5.1.
any environment changes? E.g., switch change or software update
(which then could block multicast)?
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
Quorum information
------------------
Date:             Fri Oct 12 11:55:59 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000007
Ring ID:          7/60372
Quorate:          No
Votequorum information
----------------------
Expected votes:   10
Highest expected: 10
Total votes:      1
Quorum:           6 Activity blocked
Membership information
----------------------
    Nodeid      Votes Name
0x00000007          1 192.168.80.11 (local)
Denis Morejon
2018-10-15 16:46:33 UTC
Permalink
Is multicast communication the main cause of cluster proxmox file system
problems ?

Why some times date and time have to be with cluster errors ?

Since my point of view cluster communication errors are the most
critical errors since affect all VMs keeping It from start again

because of not quorrum.

Are there any tips (or steps) to fix it or to avoid it ?
Post by Thomas Lamprecht
The 10 nodes lost the communication with each other. And they were working fine for a month. They all have version 5.1.
any environment changes? E.g., switch change or software update
(which then could block multicast)?
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
Quorum information
------------------
Date:             Fri Oct 12 11:55:59 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000007
Ring ID:          7/60372
Quorate:          No
Votequorum information
----------------------
Expected votes:   10
Highest expected: 10
Total votes:      1
Quorum:           6 Activity blocked
Membership information
----------------------
    Nodeid      Votes Name
0x00000007          1 192.168.80.11 (local)
Denis Morejon
2018-10-18 16:21:51 UTC
Permalink
I lost the cluster communication again.

I have been using Proxmox since version 1, and this is the first time It
bothers me so much!

- All the 10 nodes have the same version

(pve-manager/5.2-9/4b30e8f9 (running kernel: 4.13.13-2-pve))

- All they have the same date / time (It is one of the causes It could
lose the communication)

- The environment is ident (No new switch, no new server)


And why all these nodes lost the communication at the same time ? If
they are 10 at least 5 have to be with problems to lost the quorum and
then the connection. Is it true?

I think it is something related to this proxmox version.

What to do ?
Post by Denis Morejon
Is multicast communication the main cause of cluster proxmox file
system problems ?
Why some times date and time have to be with cluster errors ?
Since my point of view cluster communication errors are the most
critical errors since affect all VMs keeping It from start again
because of not quorrum.
Are there any tips (or steps) to fix it or to avoid it ?
Post by Thomas Lamprecht
Post by Denis Morejon
The 10 nodes lost the communication with each other. And they were
working fine for a month. They all have version 5.1.
any environment changes? E.g., switch change or software update
(which then could block multicast)?
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
Post by Denis Morejon
Quorum information
------------------
Date:             Fri Oct 12 11:55:59 2018
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000007
Ring ID:          7/60372
Quorate:          No
Votequorum information
----------------------
Expected votes:   10
Highest expected: 10
Total votes:      1
Quorum:           6 Activity blocked
Membership information
----------------------
     Nodeid      Votes Name
0x00000007          1 192.168.80.11 (local)
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Alwin Antreich
2018-10-19 07:36:11 UTC
Permalink
Hi,
Post by Denis Morejon
I lost the cluster communication again.
I have been using Proxmox since version 1, and this is the first time It
bothers me so much!
- All the 10 nodes have the same version
(pve-manager/5.2-9/4b30e8f9 (running kernel: 4.13.13-2-pve))
Is there a reason why you use an old kernel? 4.15.x is now the main kernel.
Post by Denis Morejon
- All they have the same date / time (It is one of the causes It could
lose the communication)
- The environment is ident (No new switch, no new server)
And why all these nodes lost the communication at the same time ? If
they are 10 at least 5 have to be with problems to lost the quorum and
then the connection. Is it true?
It is actually, (10/2)-1 that can have trouble without loosing the quorum,
one partition needs to be bigger.
Post by Denis Morejon
I think it is something related to this proxmox version.
What to do ?
As Thomas stated, check you multicast traffic. Corosync uses multicast for
it's cluster communication and the cluster filesystem sits on top of
corosync. So, if corosync is not working, neither is the pmxcfs.

--
Cheers,
Alwin
p***@elchaka.de
2018-10-27 08:39:35 UTC
Permalink
Hi Denis,

I dont know why it happen, but would Hey to Switch to unicast. This helped me in the past where i had a similar issue. I thought it had to be an issue with the Kernel, where some default value had changed...

- Mehmet
Post by Alwin Antreich
Hi,
Post by Denis Morejon
I lost the cluster communication again.
I have been using Proxmox since version 1, and this is the first time
It
Post by Denis Morejon
bothers me so much!
- All the 10 nodes have the same version
(pve-manager/5.2-9/4b30e8f9 (running kernel: 4.13.13-2-pve))
Is there a reason why you use an old kernel? 4.15.x is now the main kernel.
Post by Denis Morejon
- All they have the same date / time (It is one of the causes It
could
Post by Denis Morejon
lose the communication)
- The environment is ident (No new switch, no new server)
And why all these nodes lost the communication at the same time ? If
they are 10 at least 5 have to be with problems to lost the quorum
and
Post by Denis Morejon
then the connection. Is it true?
It is actually, (10/2)-1 that can have trouble without loosing the quorum,
one partition needs to be bigger.
Post by Denis Morejon
I think it is something related to this proxmox version.
What to do ?
As Thomas stated, check you multicast traffic. Corosync uses multicast for
it's cluster communication and the cluster filesystem sits on top of
corosync. So, if corosync is not working, neither is the pmxcfs.
--
Cheers,
Alwin
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Gerald Brandt
2018-10-27 15:01:52 UTC
Permalink
Post by Denis Morejon
I lost the cluster communication again.
I have been using Proxmox since version 1, and this is the first time
It bothers me so much!
- All the 10 nodes have the same version
(pve-manager/5.2-9/4b30e8f9 (running kernel: 4.13.13-2-pve))
- All they have the same date / time (It is one of the causes It could
lose the communication)
- The environment is ident (No new switch, no new server)
And why all these nodes lost the communication at the same time ? If
they are 10 at least 5 have to be with problems to lost the quorum and
then the connection. Is it true?
I think it is something related to this proxmox version.
What to do ?
That happened to me a few versions back on a 3 node cluster. I had to
switch heartbeat from multicast to unicast to keep things stable. There
were no changes in any of my equipment when this happened.


Gerald
Lindsay Mathieson
2018-10-27 23:17:26 UTC
Permalink
Post by Gerald Brandt
That happened to me a few versions back on a 3 node cluster. I had to
switch heartbeat from multicast to unicast to keep things stable.
There were no changes in any of my equipment when this happened.
Same here, been running unicast since.
--
Lindsay
Continue reading on narkive:
Loading...