Discussion:
[PVE-User] Cluster disaster
Dhaussy Alexandre
2016-11-09 13:46:50 UTC
Permalink
Hello,

I have a big problem on my cluster (1500 HA VMs), storage is LVM + SAN (around 70 PVs, 2000 LVs)
Problems began adding a new node to the cluster…

All nodes crashed and rebooted (happended yesterday)
After some work I managed to get all back online, but some nodes were down (hardware problem.)
3 or 4 nodes are still powered off because we don’t know if they caused an issue..

So this morning, we tried to add all the nodes back, I believe someone did something wrong…
Everything have been rebooted again, but now the VMs can’t start.
Only 2/7 nodes seems to start VMs, all others nodes seems stuck.

I have quorum with 7 nodes, I tried to reboot the master, but nothing happends in the GUI (master still the same, even if it is powered down.)

Is the any way to force a reelection of a new master ?
After rebooting all nodes…I’m out of ideas…
I wanted to remove vms from HA and start the vms locally, but I can’t even do that (nothing happens.)

/etc/pve/nodes/proxmoxt21/lrm_status:{"timestamp":1478698319,"results":{},"state":"wait_for_agent_lock","mode":"active"}
/etc/pve/nodes/proxmoxt25/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698315,"mode":"active"}
/etc/pve/nodes/proxmoxt26/lrm_status:{"results":{},"mode":"active","timestamp":1478693656,"state":"wait_for_agent_lock"}
/etc/pve/nodes/proxmoxt30/lrm_status:{"mode":"active","results":{},"state":"wait_for_agent_lock","timestamp":1478698319}
/etc/pve/nodes/proxmoxt31/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698319,"mode":"active"}
/etc/pve/nodes/proxmoxt34/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698318,"mode":"active"}

***@proxmoxt21:~# pvecm status
Quorum information
------------------
Date: Wed Nov 9 14:40:17 2016
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000003
Ring ID: 9/988
Quorate: Yes

Votequorum information
----------------------
Expected votes: 13
Highest expected: 13
Total votes: 8
Quorum: 7
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000009 1 10.98.187.11
0x0000000a 1 10.98.187.12
0x0000000b 1 10.98.187.15
0x0000000c 1 10.98.187.36
0x00000003 1 10.98.187.40 (local)
0x00000001 1 10.98.187.41
0x00000002 1 10.98.187.42
0x00000008 1 10.98.187.47


Dietmar Maurer
2016-11-09 15:13:34 UTC
Permalink
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I can’t even do
that (nothing happens.)
How do you do that exactly (on the GUI)? You should be able to start them
manually afterwards.
Dhaussy Alexandre
2016-11-09 15:29:37 UTC
Permalink
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.

Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I can’t even do
that (nothing happens.)
How do you do that exactly (on the GUI)? You should be able to start them
manually afterwards.
Thomas Lamprecht
2016-11-09 16:00:14 UTC
Permalink
Hi,
Post by Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.
Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Can you post the manager status please?

Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?

check with:
systemctl status pve-ha-lrm
systemctl status pve-ha-crm

If not restart them, and if then its still problematic please post the output
of the systemctl status call (if its the same on all node one output should be enough).
Post by Dhaussy Alexandre
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I can’t even do
that (nothing happens.)
You can remove them from HA by emptying the HA resource file (this deletes also
comments and group settings, but if you need to start them _now_ that shouldn't be a problem)

echo "" > /etc/pve/ha/resources.cfg

Afterwards you should be able to start them manually.
Post by Dhaussy Alexandre
Post by Dietmar Maurer
How do you do that exactly (on the GUI)? You should be able to start them
manually afterwards.
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Dhaussy Alexandre
2016-11-09 16:33:43 UTC
Permalink
Typo

- delnode on known NON-working nodes.
- delnode on known now-working nodes.
Dhaussy Alexandre
2016-11-09 16:40:21 UTC
Permalink
Sorry my old message was too big...

Thanks for the input !...

I have attached manager_status files.
.old is the original file, and .new is the file i have modified and put
in /etc/pve/ha.

I know this is bad but here's what i've done :

- delnode on known NON-working nodes.
- rm -Rf /etc/pve/nodes/x for all NON-working nodes.
- replace all NON-working nodes with working nodes in
/etc/pve/ha/manager_status
- mv VM.conf files in the proper node directory
(/etc/pve/nodes/x/qemu-server/) in reference to /etc/pve/ha/manager_status
- restart pve-ha-crm and pve-ha-lrm on all nodes

Now on several nodes i have thoses messages :

nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
wait_for_quorum
nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
Noeud final de transport n'est pas connecté
nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
Connexion refusée
nov. 09 17:12:04 proxmoxt34 pve-ha-crm[26200]: ipcc_send_rec failed:
Connexion refusée

nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
wait_for_agent_lock
nov. 09 17:12:07 proxmoxt34 pve-ha-lrm[26282]: ipcc_send_rec failed:
Noeud final de transport n'est pas connecté

We are also investigating on a possible network problem..
Post by Thomas Lamprecht
Hi,
Post by Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.
Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Can you post the manager status please?
Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?
systemctl status pve-ha-lrm
systemctl status pve-ha-crm
If not restart them, and if then its still problematic please post the
output
of the systemctl status call (if its the same on all node one output
should be enough).
Post by Dhaussy Alexandre
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I
can’t even do
that (nothing happens.)
You can remove them from HA by emptying the HA resource file (this
deletes also
comments and group settings, but if you need to start them _now_ that
shouldn't be a problem)
echo "" > /etc/pve/ha/resources.cfg
Afterwards you should be able to start them manually.
Post by Dhaussy Alexandre
Post by Dietmar Maurer
How do you do that exactly (on the GUI)? You should be able to start
them
manually afterwards.
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Dhaussy Alexandre
2016-11-09 17:05:46 UTC
Permalink
I have done a cleanup of ressources with echo "" >
/etc/pve/ha/resources.cfg

It seems to have resolved all problems with inconsistent status of
lrm/lcm in the GUI.

A new master have been elected. The manager_status file have been
cleaned up.
All nodes are idle or active.

I am re-starting all vms in ha with "ha manager add".
Seems to work now... :-/
Post by Dhaussy Alexandre
Sorry my old message was too big...
Thanks for the input !...
I have attached manager_status files.
.old is the original file, and .new is the file i have modified and put
in /etc/pve/ha.
- delnode on known NON-working nodes.
- rm -Rf /etc/pve/nodes/x for all NON-working nodes.
- replace all NON-working nodes with working nodes in
/etc/pve/ha/manager_status
- mv VM.conf files in the proper node directory
(/etc/pve/nodes/x/qemu-server/) in reference to /etc/pve/ha/manager_status
- restart pve-ha-crm and pve-ha-lrm on all nodes
nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
wait_for_quorum
Noeud final de transport n'est pas connecté
Connexion refusée
Connexion refusée
nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
wait_for_agent_lock
Noeud final de transport n'est pas connecté
We are also investigating on a possible network problem..
Post by Thomas Lamprecht
Hi,
Post by Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.
Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Can you post the manager status please?
Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?
systemctl status pve-ha-lrm
systemctl status pve-ha-crm
If not restart them, and if then its still problematic please post the
output
of the systemctl status call (if its the same on all node one output
should be enough).
Post by Dhaussy Alexandre
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I
can’t even do
that (nothing happens.)
You can remove them from HA by emptying the HA resource file (this
deletes also
comments and group settings, but if you need to start them _now_ that
shouldn't be a problem)
echo "" > /etc/pve/ha/resources.cfg
Afterwards you should be able to start them manually.
Post by Dhaussy Alexandre
Post by Dietmar Maurer
How do you do that exactly (on the GUI)? You should be able to start
them
manually afterwards.
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Thomas Lamprecht
2016-11-09 19:54:04 UTC
Permalink
Post by Dhaussy Alexandre
I have done a cleanup of ressources with echo "" >
/etc/pve/ha/resources.cfg
It seems to have resolved all problems with inconsistent status of
lrm/lcm in the GUI.
Good. Logs would be interesting to see what went wrong but I do not
know if I can skim through them as your setup is not too small and there
may be much noise from the outage in there.

If you have time you may sent me the log file(s) generated by:

journalctl --since "-2 days" -u corosync -u pve-ha-lrm -u pve-ha-crm -u pve-cluster > pve-log-$(hostname).log

(adapt the "-2 days" accordingly, it understands also something like, "-1 day 3 hours")

Sent them directly to my address (The list does not accepts bigger attachments,
limit is something like 20-20 kb AFAIK).
I cannot promise any deep examination, but I can skim through them and
look what happened in the HA stack, maybe I see something obvious.
Post by Dhaussy Alexandre
A new master have been elected. The manager_status file have been
cleaned up.
All nodes are idle or active.
I am re-starting all vms in ha with "ha manager add".
Seems to work now... :-/
Post by Dhaussy Alexandre
Sorry my old message was too big...
Thanks for the input !...
I have attached manager_status files.
.old is the original file, and .new is the file i have modified and put
in /etc/pve/ha.
- delnode on known NON-working nodes.
- rm -Rf /etc/pve/nodes/x for all NON-working nodes.
- replace all NON-working nodes with working nodes in
/etc/pve/ha/manager_status
- mv VM.conf files in the proper node directory
(/etc/pve/nodes/x/qemu-server/) in reference to /etc/pve/ha/manager_status
- restart pve-ha-crm and pve-ha-lrm on all nodes
nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
wait_for_quorum
Noeud final de transport n'est pas connecté
Connexion refusée
Connexion refusée
This means that something with the cluster filesystem (pve-cluster) was not OK.
Those messages weren't there previously?
Post by Dhaussy Alexandre
Post by Dhaussy Alexandre
nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
wait_for_agent_lock
Noeud final de transport n'est pas connecté
We are also investigating on a possible network problem..
Multicast properly working?
Post by Dhaussy Alexandre
Post by Dhaussy Alexandre
Post by Thomas Lamprecht
Hi,
Post by Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.
Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Can you post the manager status please?
Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?
systemctl status pve-ha-lrm
systemctl status pve-ha-crm
If not restart them, and if then its still problematic please post the
output
of the systemctl status call (if its the same on all node one output
should be enough).
Post by Dhaussy Alexandre
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I
can’t even do
that (nothing happens.)
You can remove them from HA by emptying the HA resource file (this
deletes also
comments and group settings, but if you need to start them _now_ that
shouldn't be a problem)
echo "" > /etc/pve/ha/resources.cfg
Afterwards you should be able to start them manually.
Post by Dhaussy Alexandre
Post by Dietmar Maurer
How do you do that exactly (on the GUI)? You should be able to start
them
manually afterwards.
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Dhaussy Alexandre
2016-11-09 22:46:53 UTC
Permalink
I had again another outage...
BUT now everything is back online ! yay !

So i think i had (at least) two problems :

1 - When installing/upgrading a node.

If the node sees all SAN storages LUN before install, debian
partitionner tries to scan all LUNs..
This causes almost all nodes to reboot (not sure why, maybe it causes
latency in lvm cluster, or a problem with a lock somewhere..)

Same thing happens when f*$king os_prober spawns out on kernel upgrade.
It scans all LVs and causes nodes reboots. So now i make sure of this in
/etc/default/grub => GRUB_DISABLE_OS_PROBER=true

2 - There seems to be a bug in lrm.

Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active.
Just after the timeouts, lrm was kind of stuck doing nothing.
Services began to start again after i restarted the service, anyway a
few seconds after, the nodes got fenced.

I think the timeouts are due to a bottlenet in our storage switchs, i
have a few messages like this :

Nov 9 22:34:40 proxmoxt25 kernel: [ 5389.318716] qla2xxx
[0000:08:00.1]-801c:2: Abort command issued nexus=2:2:28 -- 1 2002.
Nov 9 22:34:41 proxmoxt25 kernel: [ 5390.482259] qla2xxx
[0000:08:00.1]-801c:2: Abort command issued nexus=2:1:28 -- 1 2002.

So when all nodes rebooted, i may have hit the bottleneck, then the lrm
bug, and all ha services were frozen... (happened several times.)


Thanks again for the help.
Alexandre.
Post by Thomas Lamprecht
I have done a cleanup of ressources with echo "" >
/etc/pve/ha/resources.cfg
It seems to have resolved all problems with inconsistent status of
lrm/lcm in the GUI.
Good. Logs would be interesting to see what went wrong but I do not
know if I can skim through them as your setup is not too small and there
may be much noise from the outage in there.
journalctl --since "-2 days" -u corosync -u pve-ha-lrm -u pve-ha-crm
-u pve-cluster > pve-log-$(hostname).log
(adapt the "-2 days" accordingly, it understands also something like, "-1 day 3 hours")
Sent them directly to my address (The list does not accepts bigger attachments,
limit is something like 20-20 kb AFAIK).
I cannot promise any deep examination, but I can skim through them and
look what happened in the HA stack, maybe I see something obvious.
A new master have been elected. The manager_status file have been
cleaned up.
All nodes are idle or active.
I am re-starting all vms in ha with "ha manager add".
Seems to work now... :-/
Post by Dhaussy Alexandre
Sorry my old message was too big...
Thanks for the input !...
I have attached manager_status files.
.old is the original file, and .new is the file i have modified and put
in /etc/pve/ha.
- delnode on known NON-working nodes.
- rm -Rf /etc/pve/nodes/x for all NON-working nodes.
- replace all NON-working nodes with working nodes in
/etc/pve/ha/manager_status
- mv VM.conf files in the proper node directory
(/etc/pve/nodes/x/qemu-server/) in reference to
/etc/pve/ha/manager_status
- restart pve-ha-crm and pve-ha-lrm on all nodes
nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
wait_for_quorum
Noeud final de transport n'est pas connecté
Connexion refusée
Connexion refusée
This means that something with the cluster filesystem (pve-cluster) was not OK.
Those messages weren't there previously?
Post by Dhaussy Alexandre
nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
wait_for_agent_lock
Noeud final de transport n'est pas connecté
We are also investigating on a possible network problem..
Multicast properly working?
Post by Dhaussy Alexandre
Post by Thomas Lamprecht
Hi,
Post by Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.
Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Can you post the manager status please?
Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?
systemctl status pve-ha-lrm
systemctl status pve-ha-crm
If not restart them, and if then its still problematic please post the
output
of the systemctl status call (if its the same on all node one output
should be enough).
Post by Dhaussy Alexandre
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I
can’t even do
that (nothing happens.)
You can remove them from HA by emptying the HA resource file (this
deletes also
comments and group settings, but if you need to start them _now_ that
shouldn't be a problem)
echo "" > /etc/pve/ha/resources.cfg
Afterwards you should be able to start them manually.
Post by Dhaussy Alexandre
Post by Dietmar Maurer
How do you do that exactly (on the GUI)? You should be able to start
them
manually afterwards.
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Thomas Lamprecht
2016-11-10 10:39:53 UTC
Permalink
Post by Dhaussy Alexandre
I had again another outage...
BUT now everything is back online ! yay !
1 - When installing/upgrading a node.
If the node sees all SAN storages LUN before install, debian
partitionner tries to scan all LUNs..
This causes almost all nodes to reboot (not sure why, maybe it causes
latency in lvm cluster, or a problem with a lock somewhere..)
Same thing happens when f*$king os_prober spawns out on kernel upgrade.
It scans all LVs and causes nodes reboots. So now i make sure of this in
/etc/default/grub => GRUB_DISABLE_OS_PROBER=true
Yes OS_PROBER is _bad_ and may even corrupt some FS under some
conditions, AFAIK.
The Proxmox VE iso does not have it for this reason.
Post by Dhaussy Alexandre
2 - There seems to be a bug in lrm.
Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active.
Just after the timeouts, lrm was kind of stuck doing nothing.
If it's doing nothing it would be interesting to see in which state it is.
Because if it's already online and active the watchdog must trigger if
it is stuck for ~60 seconds or more.
Post by Dhaussy Alexandre
Services began to start again after i restarted the service, anyway a
few seconds after, the nodes got fenced.
Hmm, this means the watchdog was already running out.
Post by Dhaussy Alexandre
I think the timeouts are due to a bottlenet in our storage switchs, i
Nov 9 22:34:40 proxmoxt25 kernel: [ 5389.318716] qla2xxx
[0000:08:00.1]-801c:2: Abort command issued nexus=2:2:28 -- 1 2002.
Nov 9 22:34:41 proxmoxt25 kernel: [ 5390.482259] qla2xxx
[0000:08:00.1]-801c:2: Abort command issued nexus=2:1:28 -- 1 2002.
So when all nodes rebooted, i may have hit the bottleneck, then the lrm
bug, and all ha services were frozen... (happened several times.)
Yeah I looked a bit through logs of two of your nodes, it looks like the
system hit quite some bottle necks..
CRM/LRM run often in 'loop took to long' errors the filesystem also is
sometimes not writable.
You have in some logs some huge retransmit list from corosync.

Where does your cluster communication happens, not on the storage network?


A few general hints:

The ha-stack does not likes it when somebody moves the VM configs around
from a VM in the started/migrate state.
If it's in stopped it's OK as there it can fixup the VM location. Else
it cannot simply fixup the location as it does not know if the resource
still runs on the (old) node.

Modifying the manager status does not works, if a manager is currently
elected.
The manager reads it only on it transition from slave to manager to get
the last state in memory.
After that it writes it just out so that on a master reelection the new
master has the most current state.

So if something bad as this happens again I'd to the following:

If no master election happen, but there is a quorate parition of nodes
and you are sure that thier pve-ha-crm service is up and running (else
restart it first) you can try to trigger an instant master reelection by
deleting the olds masters lock (which may not yet be invalid through
timeout):
rmdir /etc/pve/priv/lock/ha_manager_lock/

If then a master election happens you should be fine and the HA stack
will do its work and recover.

If you have to move the VMs you should disable those primary, ha-manager
disable SID does that also quite well in a lot of problematic situations
as it just edits the resources.cfg.
If this does not work you have no quorum or pve-cluster has a problem,
which both mean HA recovery cannot take place on this node one way or
the other.
Post by Dhaussy Alexandre
Thanks again for the help.
Alexandre.
Post by Thomas Lamprecht
I have done a cleanup of ressources with echo "" >
/etc/pve/ha/resources.cfg
It seems to have resolved all problems with inconsistent status of
lrm/lcm in the GUI.
Good. Logs would be interesting to see what went wrong but I do not
know if I can skim through them as your setup is not too small and there
may be much noise from the outage in there.
journalctl --since "-2 days" -u corosync -u pve-ha-lrm -u pve-ha-crm
-u pve-cluster > pve-log-$(hostname).log
(adapt the "-2 days" accordingly, it understands also something like, "-1 day 3 hours")
Sent them directly to my address (The list does not accepts bigger attachments,
limit is something like 20-20 kb AFAIK).
I cannot promise any deep examination, but I can skim through them and
look what happened in the HA stack, maybe I see something obvious.
A new master have been elected. The manager_status file have been
cleaned up.
All nodes are idle or active.
I am re-starting all vms in ha with "ha manager add".
Seems to work now... :-/
Post by Dhaussy Alexandre
Sorry my old message was too big...
Thanks for the input !...
I have attached manager_status files.
.old is the original file, and .new is the file i have modified and put
in /etc/pve/ha.
- delnode on known NON-working nodes.
- rm -Rf /etc/pve/nodes/x for all NON-working nodes.
- replace all NON-working nodes with working nodes in
/etc/pve/ha/manager_status
- mv VM.conf files in the proper node directory
(/etc/pve/nodes/x/qemu-server/) in reference to
/etc/pve/ha/manager_status
- restart pve-ha-crm and pve-ha-lrm on all nodes
nov. 09 17:08:19 proxmoxt34 pve-ha-crm[26200]: status change startup =>
wait_for_quorum
Noeud final de transport n'est pas connecté
Connexion refusée
Connexion refusée
This means that something with the cluster filesystem (pve-cluster) was not OK.
Those messages weren't there previously?
Post by Dhaussy Alexandre
nov. 09 17:08:22 proxmoxt34 pve-ha-lrm[26282]: status change startup =>
wait_for_agent_lock
Noeud final de transport n'est pas connecté
We are also investigating on a possible network problem..
Multicast properly working?
Post by Dhaussy Alexandre
Post by Thomas Lamprecht
Hi,
Post by Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends.
There are some services in "error" or "fence" state.
Now i tried to remove the non-working nodes from the cluster... but i
still see those nodes in /etc/pve/ha/manager_status.
Can you post the manager status please?
Also, is pve-ha-lrm and pve-ha-crm up and running without any error
on all nodes, at least on those in the quorate partition?
systemctl status pve-ha-lrm
systemctl status pve-ha-crm
If not restart them, and if then its still problematic please post the
output
of the systemctl status call (if its the same on all node one output
should be enough).
Post by Dhaussy Alexandre
Post by Dietmar Maurer
Post by Dhaussy Alexandre
I wanted to remove vms from HA and start the vms locally, but I
can’t even do
that (nothing happens.)
You can remove them from HA by emptying the HA resource file (this
deletes also
comments and group settings, but if you need to start them _now_ that
shouldn't be a problem)
echo "" > /etc/pve/ha/resources.cfg
Afterwards you should be able to start them manually.
Post by Dhaussy Alexandre
Post by Dietmar Maurer
How do you do that exactly (on the GUI)? You should be able to start
them
manually afterwards.
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Dhaussy Alexandre
2016-11-11 14:56:32 UTC
Permalink
I really hope to find an explanation to all this mess.
Because i'm not very confident right now..

So far if i understand all this correctly.. I'm not very found of how watchdog behaves with crm/lrm.
To make a comparison with PVE 3 (RedHat cluster), fencing happened on the corosync/cluster communication stack, but not on the resource manager stack.

On PVE 3, several times I found rgmanager was stuck.
I just had to find the culprit process (usually pve status), kill it, et voila.
But it never caused an outage.
Post by Thomas Lamprecht
Post by Dhaussy Alexandre
2 - There seems to be a bug in lrm.
Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active.
Just after the timeouts, lrm was kind of stuck doing nothing.
If it's doing nothing it would be interesting to see in which state it is.
Because if it's already online and active the watchdog must trigger if
it is stuck for ~60 seconds or more.
I'll try to grab some info if it happens again.
Post by Thomas Lamprecht
Hmm, this means the watchdog was already running out.
Do you have a hint why there is no messages in the logs when watchdog actually seems to trigger fencing ?
Because when a node suddently reboots, i can't be sure if it's the watchdog, a hardware bug, kernel bug or whatever..
Post by Thomas Lamprecht
Yeah I looked a bit through logs of two of your nodes, it looks like the
system hit quite some bottle necks..
CRM/LRM run often in 'loop took to long' errors the filesystem also is
sometimes not writable.
You have in some logs some huge retransmit list from corosync.
Yes, there were much retransmits on "9 Nov 14:56".
This matches when we tried to switch network path, because at this time the nodes did not seem to talk to each other correctly (lrm waiting for quorum.)

Anyway I need to triple check (again) IGMP snooping on all network switchs.
+ Check HP blades Virtual Connect and firmwares..
Post by Thomas Lamprecht
Where does your cluster communication happens, not on the storage network?
Storage is on fibre channel.
Cluster communication happens on a dedicated network vlan (shared with vmware.)
I also use another vlan for live migrations.
Dhaussy Alexandre
2016-11-11 15:28:09 UTC
Permalink
Post by Dhaussy Alexandre
Do you have a hint why there is no messages in the logs when watchdog
actually seems to trigger fencing ?
Because when a node suddently reboots, i can't be sure if it's the watchdog,
a hardware bug, kernel bug or whatever..
Responding to myself, i find this interesting :

Nov 8 10:39:01 proxmoxt35 corosync[35250]: [TOTEM ] A new membership (10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:39:58 proxmoxt35 watchdog-mux[28239]: client watchdog expired - disable watchdog updates

Nov 8 10:39:01 proxmoxt31 corosync[23483]: [TOTEM ] A new membership (10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:40:01 proxmoxt31 watchdog-mux[22395]: client watchdog expired - disable watchdog updates

Nov 8 10:39:01 proxmoxt30 corosync[24634]: [TOTEM ] A new membership (10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:40:00 proxmoxt30 watchdog-mux[23492]: client watchdog expired - disable watchdog updates


Nov 9 10:05:41 proxmoxt20 corosync[42543]: [TOTEM ] A new membership (10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt20 corosync[42543]: [TOTEM ] A new membership (10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt20 watchdog-mux[41401]: client watchdog expired - disable watchdog updates

Nov 9 10:05:41 proxmoxt21 corosync[16184]: [TOTEM ] A new membership (10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt21 corosync[16184]: [TOTEM ] A new membership (10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt21 watchdog-mux[42853]: client watchdog expired - disable watchdog updates

Nov 9 10:05:41 proxmoxt30 corosync[16159]: [TOTEM ] A new membership (10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt30 corosync[16159]: [TOTEM ] A new membership (10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt30 watchdog-mux[43148]: client watchdog expired - disable watchdog updates

Nov 9 10:05:41 proxmoxt31 corosync[16297]: [TOTEM ] A new membership (10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt31 corosync[16297]: [TOTEM ] A new membership (10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt31 watchdog-mux[42761]: client watchdog expired - disable watchdog updates

Nov 9 10:05:41 proxmoxt34 corosync[41330]: [TOTEM ] A new membership (10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt34 corosync[41330]: [TOTEM ] A new membership (10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt34 watchdog-mux[40262]: client watchdog expired - disable watchdog updates

Nov 9 10:05:41 proxmoxt35 corosync[16158]: [TOTEM ] A new membership (10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt35 corosync[16158]: [TOTEM ] A new membership (10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt35 watchdog-mux[42684]: client watchdog expired - disable watchdog updates
Michael Rasmussen
2016-11-11 15:31:54 UTC
Permalink
A long shot. Do you have a hardware watchdog enabled in bios?
Post by Dhaussy Alexandre
Post by Dhaussy Alexandre
Do you have a hint why there is no messages in the logs when watchdog
actually seems to trigger fencing ?
Because when a node suddently reboots, i can't be sure if it's the
watchdog,
Post by Dhaussy Alexandre
a hardware bug, kernel bug or whatever..
Nov 8 10:39:01 proxmoxt35 corosync[35250]: [TOTEM ] A new membership
(10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:39:58 proxmoxt35 watchdog-mux[28239]: client watchdog expired
- disable watchdog updates
Nov 8 10:39:01 proxmoxt31 corosync[23483]: [TOTEM ] A new membership
(10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:40:01 proxmoxt31 watchdog-mux[22395]: client watchdog expired
- disable watchdog updates
Nov 8 10:39:01 proxmoxt30 corosync[24634]: [TOTEM ] A new membership
(10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:40:00 proxmoxt30 watchdog-mux[23492]: client watchdog expired
- disable watchdog updates
Nov 9 10:05:41 proxmoxt20 corosync[42543]: [TOTEM ] A new membership
(10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt20 corosync[42543]: [TOTEM ] A new membership
(10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt20 watchdog-mux[41401]: client watchdog expired
- disable watchdog updates
Nov 9 10:05:41 proxmoxt21 corosync[16184]: [TOTEM ] A new membership
(10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt21 corosync[16184]: [TOTEM ] A new membership
(10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt21 watchdog-mux[42853]: client watchdog expired
- disable watchdog updates
Nov 9 10:05:41 proxmoxt30 corosync[16159]: [TOTEM ] A new membership
(10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt30 corosync[16159]: [TOTEM ] A new membership
(10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt30 watchdog-mux[43148]: client watchdog expired
- disable watchdog updates
Nov 9 10:05:41 proxmoxt31 corosync[16297]: [TOTEM ] A new membership
(10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt31 corosync[16297]: [TOTEM ] A new membership
(10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt31 watchdog-mux[42761]: client watchdog expired
- disable watchdog updates
Nov 9 10:05:41 proxmoxt34 corosync[41330]: [TOTEM ] A new membership
(10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt34 corosync[41330]: [TOTEM ] A new membership
(10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt34 watchdog-mux[40262]: client watchdog expired
- disable watchdog updates
Nov 9 10:05:41 proxmoxt35 corosync[16158]: [TOTEM ] A new membership
(10.xx.xx.11:796) was formed. Members left: 7
Nov 9 10:05:46 proxmoxt35 corosync[16158]: [TOTEM ] A new membership
(10.xx.xx.11:800) was formed. Members joined: 7
Nov 9 10:06:42 proxmoxt35 watchdog-mux[42684]: client watchdog expired
- disable watchdog updates
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

----

This mail was virus scanned and spam checked before delivery.
This mail is also DKIM signed. See header dkim-signature.
Dhaussy Alexandre
2016-11-11 16:44:08 UTC
Permalink
Post by Michael Rasmussen
A long shot. Do you have a hardware watchdog enabled in bios?
I didn't modify any BIOS parameters, except power management.
So I believe it's enabled.

hpwdt module (hp ilo watchdog) is not loaded.
HP ASR is enabled (10 min timeout.)
Ipmi_watchdog is blacklisted.
nmi_watchdog is enabled => I have seen "please disable this" in proxmox wiki, but there is no explaination why you should do it. :)
Dietmar Maurer
2016-11-11 16:43:23 UTC
Permalink
Post by Dhaussy Alexandre
Nov 8 10:39:01 proxmoxt35 corosync[35250]: [TOTEM ] A new membership
(10.xx.xx.11:684) was formed. Members joined: 13
Nov 8 10:39:58 proxmoxt35 watchdog-mux[28239]: client watchdog expired -
disable watchdog updates
you lost quorum, and the watchdog expired - that is how the watchdog based
fencing works.
Dhaussy Alexandre
2016-11-11 17:41:20 UTC
Permalink
Post by Dietmar Maurer
you lost quorum, and the watchdog expired - that is how the watchdog
based fencing works.
I don't expect to loose quorum when _one_ node joins or leave the cluster.

Nov 8 10:38:58 proxmoxt20 pmxcfs[22537]: [status] notice: update cluster info (cluster name pxmcluster, version = 14)
Nov 8 10:39:01 proxmoxt20 corosync[22577]: [TOTEM ] A new membership (10.98.187.11:684) was formed. Members joined: 13
Nov 8 10:39:01 proxmoxt20 corosync[22577]: [QUORUM] Members[13]: 9 10 11 13 4 12 3 1 2 5 6 7 8
Nov 8 10:39:59 proxmoxt20 watchdog-mux[23964]: client watchdog expired - disable watchdog updates

Nov 8 10:39:01 proxmoxt35 corosync[35250]: [TOTEM ] A new membership (10.98.187.11:684) was formed. Members joined: 13
Nov 8 10:39:01 proxmoxt35 corosync[35250]: [QUORUM] Members[13]: 9 10 11 13 4 12 3 1 2 5 6 7 8
Nov 8 10:39:58 proxmoxt35 watchdog-mux[28239]: client watchdog expired - disable watchdog updates
Dietmar Maurer
2016-11-11 18:43:39 UTC
Permalink
Post by Dhaussy Alexandre
Post by Dietmar Maurer
you lost quorum, and the watchdog expired - that is how the watchdog
based fencing works.
I don't expect to loose quorum when _one_ node joins or leave the cluster.
This was probably a long time before - but I have not read through the whole
logs ...
Dhaussy Alexandre
2016-11-14 10:50:57 UTC
Permalink
On November 11, 2016 at 6:41 PM Dhaussy Alexandre
Post by Dhaussy Alexandre
Post by Dietmar Maurer
you lost quorum, and the watchdog expired - that is how the watchdog
based fencing works.
I don't expect to loose quorum when _one_ node joins or leave the cluster.
This was probably a long time before - but I have not read through the whole
logs ...
That makes no sense to me..
The fact is : everything have been working fine for weeks.


What i can see in the logs is : several reboots of cluster nodes
suddently, and exactly one minute after one node joining and/or leaving
the cluster.
I see no problems with corosync/lrm/crm before that.
This leads me to a probable network (multicast) malfunction.

I did a bit of homeworks reading the wiki about ha manager..

What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.
So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
right ?

Another thing ; i have checked my network configuration, the cluster ip
is set on a linux bridge...
By default multicast_snooping is set to 1 on linux bridge, so i think it
there's a good chance this is the source of my problems...
Note that we don't use IGMP snooping, it is disabled on almost all
network switchs.

Plus i found a post by A.Derumier (yes, 3 years old..) He did have
similar issues with bridge and multicast.
http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html
Thomas Lamprecht
2016-11-14 11:33:27 UTC
Permalink
Post by Dhaussy Alexandre
On November 11, 2016 at 6:41 PM Dhaussy Alexandre
Post by Dhaussy Alexandre
Post by Dietmar Maurer
you lost quorum, and the watchdog expired - that is how the watchdog
based fencing works.
I don't expect to loose quorum when _one_ node joins or leave the cluster.
This was probably a long time before - but I have not read through the whole
logs ...
That makes no sense to me..
The fact is : everything have been working fine for weeks.
What i can see in the logs is : several reboots of cluster nodes
suddently, and exactly one minute after one node joining and/or leaving
the cluster.
The watchdog is set to an 60 second timeout, meaning that cluster leave caused
quorum loss, or other problems (you said you had multicast problems around that
time) thus the LRM stopped updating the watchdog, so one minute later it resetted
all nodes, which left the quorate partition.
Post by Dhaussy Alexandre
I see no problems with corosync/lrm/crm before that.
This leads me to a probable network (multicast) malfunction.
I did a bit of homeworks reading the wiki about ha manager..
What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.
Yes and no, LRM and CRM are two state machines with synced inputs,
but that holds mainly for human triggered commands and the resulting
communication.
Meaning that commands like start, stop, migrate may not go through from
the CRM to the LRM. Fencing and such stuff works none the less, else it
would be a major design flaw :)
Post by Dhaussy Alexandre
So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
right ?
No, the watchdog runs on each node and is CRM independent.
As watchdogs are normally not able to server more clients we wrote
the watchdog-mux (multiplexer).
This is a very simple C program which opens the watchdog with a
60 second timeout and allows multiple clients (at the moment CRM
and LRM) to connect to it.
If a client does not resets the dog for about 10 seconds, IIRC, the
watchdox-mux disables watchdogs updates on the real watchdog.
After that a node reset will happen *when* the dog runs out of time,
not instantly.

So if the LRM cannot communicate (i.e. has no quorum) he will stop
updating the dog, thus trigger independent what the CRM says or does.
Post by Dhaussy Alexandre
Another thing ; i have checked my network configuration, the cluster ip
is set on a linux bridge...
By default multicast_snooping is set to 1 on linux bridge, so i think it
there's a good chance this is the source of my problems...
Note that we don't use IGMP snooping, it is disabled on almost all
network switchs.
Yes, multicast snooping has to be configured (recommended) or else turned off on the switch.
That's stated in some wiki articles, various forum posts and our docs, here:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

Hope that helps a bit understanding. :)

cheers,
Thomas
Post by Dhaussy Alexandre
Plus i found a post by A.Derumier (yes, 3 years old..) He did have
similar issues with bridge and multicast.
http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Dhaussy Alexandre
2016-11-14 13:46:40 UTC
Permalink
Post by Thomas Lamprecht
Hope that helps a bit understanding. :)
Sure, thank you for clearing things up. :)
I wish i had done this before, but i learned a lot in the last few days...
Dhaussy Alexandre
2016-11-22 16:35:08 UTC
Permalink
...sequel to those thrilling adventures...
I _still_ have problems with nodes not joining the cluster properly after rebooting...

Here's what we have done last night :

- Stopped ALL VMs (just to ensure no corruption happen in case of unexpected reboots...)
- Patched qemu from 2.6.1 to 2.6.2 to fix live migration issues.
- Removed bridge (cluster network) on all nodes to fix multicast issues (11 nodes total.)
- Patched all (HP blade/HP ILO/Ethernet/Fiber Channel card) bios and firmwares (13 nodes total.)
- Rebooted all nodes, one, two, or three server simultaneously.

So far we had absolutly no problems, corosync was still quorate and all nodes leaved and joined the cluster successfully.

- Added 2 nodes to the cluster, no problem at all...
- Started two VMs on two nodes, and to cut the network on those nodes.
- As expected, watchdog did its job killing the two nodes, VMs were relocated.... so far so good !

_Except_, the two nodes were never able to join the cluster again after reboot...

LVM takes so long to scan all PVs/LVs....somehow, i believe, it ends in an inconsistency when systemd starts cluster services.
On the other nodes, i can actually see that corosync does a quick join/leave (and fails) right after booting...

Nov 22 02:07:52 proxmoxt21 corosync[22342]: [TOTEM ] A new membership (10.98.x.x:1492) was formed. Members joined: 10
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [TOTEM ] A new membership (10.98.x.x:1496) was formed. Members left: 10
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [CPG ] downlist left_list: 0 received in state 2
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [QUORUM] Members[10]: 9 11 5 4 12 3 1 2 6 8
Nov 22 02:07:52 proxmoxt21 corosync[22342]: [MAIN ] Completed service synchronization, ready to provide service.

I tried several reboots...same problem. :(
I ended up removing the two freshly added nodes from the cluster, and restarted all VMs.

I don't know how, but i feel that every node i add to the cluster currently slows down LVM scan a little more...until it ends up interfering with cluster services at boot...
Recall that i have about 1500Vms, 1600LVs, 70PVs on external SAN storage...

_Now_ i have a serious lead that this issue could be related to a known racing condition between udev and multipath.
I have had this issue previously, but i didnt think i would interact and cause issues with cluster services...what do you think ?
See the https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799781

I quickly tried the workaround suggested here : https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799781#32
(remove this rule from udev : ACTION=="add|change", SUBSYSTEM=="block", RUN+="/sbin/multipath -v0 /dev/$name")

I can tell it boots _much_ faster, but i will need to give another try and proper testing to see if it fix my issue...
Anyhow, i'm open to suggestions or thoughts that could enlighten me...

(And sorry for the long story)

Le 14/11/2016 à 12:33, Thomas Lamprecht a écrit :


On 14.11.2016 11:50, Dhaussy Alexandre wrote:

Le 11/11/2016 à 19:43, Dietmar Maurer a écrit :
On November 11, 2016 at 6:41 PM Dhaussy Alexandre
<***@voyages-sncf.com><mailto:***@voyages-sncf.com> wrote:
you lost quorum, and the watchdog expired - that is how the watchdog
based fencing works.
I don't expect to loose quorum when _one_ node joins or leave the cluster.
This was probably a long time before - but I have not read through the whole
logs ...
That makes no sense to me..
The fact is : everything have been working fine for weeks.


What i can see in the logs is : several reboots of cluster nodes
suddently, and exactly one minute after one node joining and/or leaving
the cluster.

The watchdog is set to an 60 second timeout, meaning that cluster leave caused
quorum loss, or other problems (you said you had multicast problems around that
time) thus the LRM stopped updating the watchdog, so one minute later it resetted
all nodes, which left the quorate partition.

I see no problems with corosync/lrm/crm before that.
This leads me to a probable network (multicast) malfunction.

I did a bit of homeworks reading the wiki about ha manager..

What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.

Yes and no, LRM and CRM are two state machines with synced inputs,
but that holds mainly for human triggered commands and the resulting
communication.
Meaning that commands like start, stop, migrate may not go through from
the CRM to the LRM. Fencing and such stuff works none the less, else it
would be a major design flaw :)

So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
right ?



No, the watchdog runs on each node and is CRM independent.
As watchdogs are normally not able to server more clients we wrote
the watchdog-mux (multiplexer).
This is a very simple C program which opens the watchdog with a
60 second timeout and allows multiple clients (at the moment CRM
and LRM) to connect to it.
If a client does not resets the dog for about 10 seconds, IIRC, the
watchdox-mux disables watchdogs updates on the real watchdog.
After that a node reset will happen *when* the dog runs out of time,
not instantly.

So if the LRM cannot communicate (i.e. has no quorum) he will stop
updating the dog, thus trigger independent what the CRM says or does.


Another thing ; i have checked my network configuration, the cluster ip
is set on a linux bridge...
By default multicast_snooping is set to 1 on linux bridge, so i think it
there's a good chance this is the source of my problems...
Note that we don't use IGMP snooping, it is disabled on almost all
network switchs.


Yes, multicast snooping has to be configured (recommended) or else turned off on the switch.
That's stated in some wiki articles, various forum posts and our docs, here:
http://pve.proxmox.com/pve-docs/chapter-pvecm.html#cluster-network-requirements

Hope that helps a bit understanding. :)

cheers,
Thomas

Plus i found a post by A.Derumier (yes, 3 years old..) He did have
similar issues with bridge and multicast.
http://pve.proxmox.com/pipermail/pve-devel/2013-March/006678.html
_______________________________________________
pve-user mailing list
pve-***@pve.proxmox.com<mailto:pve-***@pve.proxmox.com>
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Michael Rasmussen
2016-11-22 16:56:08 UTC
Permalink
On Tue, 22 Nov 2016 16:35:08 +0000
Post by Dhaussy Alexandre
I don't know how, but i feel that every node i add to the cluster currently slows down LVM scan a little more...until it ends up interfering with cluster services at boot...
Maybe you need to tune the filter rules in /etc/lvm/lvm.conf.

My own rules as an inspiration:
# Do not scan ZFS zvols (to avoid problems on ZFS zvols snapshots)
global_filter = [ "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|" ]

# Only scan for volumes on local disk and on iSCSI target from Qnap NAS. Block scanning from all
# other block devices.
filter = [ "a|ata-OCZ-AGILITY3_OCZ-QMZN8K4967DA9NGO.*|", "a|scsi-36001405e38e9f02ddef9d4573db7a0d0|", "r|.*|" ]
--
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael <at> rasmussen <dot> cc
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E
mir <at> datanom <dot> net
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C
mir <at> miras <dot> org
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917
--------------------------------------------------------------
/usr/games/fortune -es says:
The trouble with being punctual is that people think you have nothing
more important to do.
Dhaussy Alexandre
2016-11-22 17:12:27 UTC
Permalink
Post by Michael Rasmussen
On Tue, 22 Nov 2016 16:35:08 +0000
Post by Dhaussy Alexandre
I don't know how, but i feel that every node i add to the cluster currently slows down LVM scan a little more...until it ends up interfering with cluster services at boot...
Maybe you need to tune the filter rules in /etc/lvm/lvm.conf.
Yep, i already tuned filters in lvm config, before that i had "duplicate
PVs' messages because of multipath devices.
Anyway if i'm not wrong, LVM still has a lot of LVs to activate at boot.

nov. 22 02:16:21 proxmoxt34 lvm[7279]: 1644 logical volume(s) in volume
group "T_proxmox_1" now active
nov. 22 02:16:21 proxmoxt34 lvm[7279]: 2 logical volume(s) in volume
group "proxmoxt34-vg" now active
Michael Rasmussen
2016-11-22 17:48:54 UTC
Permalink
Have you tested your filter rules?
Post by Dhaussy Alexandre
Post by Michael Rasmussen
On Tue, 22 Nov 2016 16:35:08 +0000
Post by Dhaussy Alexandre
I don't know how, but i feel that every node i add to the cluster
currently slows down LVM scan a little more...until it ends up
interfering with cluster services at boot...
Post by Michael Rasmussen
Maybe you need to tune the filter rules in /etc/lvm/lvm.conf.
Yep, i already tuned filters in lvm config, before that i had
"duplicate
PVs' messages because of multipath devices.
Anyway if i'm not wrong, LVM still has a lot of LVs to activate at boot.
nov. 22 02:16:21 proxmoxt34 lvm[7279]: 1644 logical volume(s) in volume
group "T_proxmox_1" now active
nov. 22 02:16:21 proxmoxt34 lvm[7279]: 2 logical volume(s) in volume
group "proxmoxt34-vg" now active
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

----

This mail was virus scanned and spam checked before delivery.
This mail is also DKIM signed. See header dkim-signature.
Dhaussy Alexandre
2016-11-22 18:04:39 UTC
Permalink
Post by Michael Rasmussen
Have you tested your filter rules?
Yes, i set this filter at install :

global_filter = [ "r|sd[b-z].*|", "r|disk|", "r|dm-.*|",
"r|vm.*disk.*|", "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "a|.*|" ]
Post by Michael Rasmussen
Post by Dhaussy Alexandre
Post by Michael Rasmussen
On Tue, 22 Nov 2016 16:35:08 +0000
Post by Dhaussy Alexandre
I don't know how, but i feel that every node i add to the cluster
currently slows down LVM scan a little more...until it ends up
interfering with cluster services at boot...
Post by Michael Rasmussen
Maybe you need to tune the filter rules in /etc/lvm/lvm.conf.
Yep, i already tuned filters in lvm config, before that i had
"duplicate
PVs' messages because of multipath devices.
Anyway if i'm not wrong, LVM still has a lot of LVs to activate at boot.
nov. 22 02:16:21 proxmoxt34 lvm[7279]: 1644 logical volume(s) in volume
group "T_proxmox_1" now active
nov. 22 02:16:21 proxmoxt34 lvm[7279]: 2 logical volume(s) in volume
group "proxmoxt34-vg" now active
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Michael Rasmussen
2016-11-22 18:18:44 UTC
Permalink
On Tue, 22 Nov 2016 18:04:39 +0000
Post by Dhaussy Alexandre
Post by Michael Rasmussen
Have you tested your filter rules?
global_filter = [ "r|sd[b-z].*|", "r|disk|", "r|dm-.*|",
"r|vm.*disk.*|", "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "a|.*|" ]
Does vgscan and lvscan list the expected?
--
Hilsen/Regards
Michael Rasmussen

Get my public GnuPG keys:
michael <at> rasmussen <dot> cc
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xD3C9A00E
mir <at> datanom <dot> net
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE501F51C
mir <at> miras <dot> org
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xE3E80917
--------------------------------------------------------------
/usr/games/fortune -es says:
We come to bury DOS, not to praise it.
-- Paul Vojta, ***@math.berkeley.edu
Dhaussy Alexandre
2016-11-22 18:47:51 UTC
Permalink
Post by Michael Rasmussen
Post by Dhaussy Alexandre
Post by Michael Rasmussen
Have you tested your filter rules?
global_filter = [ "r|sd[b-z].*|", "r|disk|", "r|dm-.*|",
"r|vm.*disk.*|", "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "a|.*|" ]
Does vgscan and lvscan list the expected?
Seems to.

***@proxmoxt20:~# vgscan
Reading all physical volumes. This may take a while...
Found volume group "T_proxmox_1" using metadata type lvm2
Found volume group "pve" using metadata type lvm2

***@proxmoxt20:~# lvscan
ACTIVE '/dev/T_proxmox_1/vm-106-disk-1' [116,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-108-disk-1' [106,00 GiB] inherit
inactive '/dev/T_proxmox_1/vm-109-disk-1' [116,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-110-disk-1' [116,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-111-disk-1' [116,00 GiB] inherit
................
....cut.....
................
ACTIVE '/dev/T_proxmox_1/vm-451-disk-2' [90,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-451-disk-3' [90,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-1195-disk-2' [128,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-138-disk-1' [106,00 GiB] inherit
ACTIVE '/dev/T_proxmox_1/vm-517-disk-1' [101,00 GiB] inherit
ACTIVE '/dev/pve/swap' [7,63 GiB] inherit
ACTIVE '/dev/pve/root' [95,37 GiB] inherit
ACTIVE '/dev/pve/data' [174,46 GiB] inherit

Dietmar Maurer
2016-11-14 11:34:02 UTC
Permalink
Post by Dhaussy Alexandre
What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.
So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
right ?
Nothing happens as long as you have quorum. And if I understand you
correctly, you never lost quorum on those nodes?
Dhaussy Alexandre
2016-11-14 13:25:18 UTC
Permalink
Post by Dietmar Maurer
Post by Dhaussy Alexandre
What i understand so far, is that every state/service change from LRM
must be acknowledged (cluster-wise) by CRM master.
So if a multicast disruption occurs, and i assume LRM wouldn't be able
talk to the CRM MASTER, then it also couldn't reset the watchdog, am i
right ?
Nothing happens as long as you have quorum. And if I understand you
correctly, you never lost quorum on those nodes?
As far as can be told from the log files, yes.
Continue reading on narkive:
Loading...