Discussion:
[PVE-User] HA Timing question
Klaus Darilion
2018-09-07 08:27:40 UTC
Permalink
Hi!

I have a HA question if the cluster network partitions.

E.g. 3 nodes. VM100 is running on node 3.

Suddenly the network breaks and node3 is isolated. Hence, node3 is alone
without quroum, node1+2 form a new groub with quorum.

What happens now exactly when HA is configured for VM100?

According to https://pve.proxmox.com/wiki/High_Availability node 3 will
reboot after 60 seconds ("When a cluster member determines that it is no
longer in the cluster quorum, the LRM waits for a new quorum to form. As
long as there is no quorum the node cannot reset the watchdog. This will
trigger a reboot after the watchdog then times out, this happens after
60 seconds.")

But what is the timing for starting VM100 on another node? Is it
guaranteed that this only happens after 60 seconds? (avoiding concurrent
access to the shared storage, and service-network on node3 may still be
functional although the cluster network broke)

Thanks
Klaus
Dietmar Maurer
2018-09-07 08:35:16 UTC
Permalink
Post by Klaus Darilion
What happens now exactly when HA is configured for VM100?
According to https://pve.proxmox.com/wiki/High_Availability node 3 will
reboot after 60 seconds ("When a cluster member determines that it is no
longer in the cluster quorum, the LRM waits for a new quorum to form. As
long as there is no quorum the node cannot reset the watchdog. This will
trigger a reboot after the watchdog then times out, this happens after
60 seconds.")
exactly
Post by Klaus Darilion
But what is the timing for starting VM100 on another node? Is it
guaranteed that this only happens after 60 seconds?
yes, that is the idea.
Klaus Darilion
2018-09-07 14:28:43 UTC
Permalink
Post by Dietmar Maurer
Post by Klaus Darilion
But what is the timing for starting VM100 on another node? Is it
guaranteed that this only happens after 60 seconds?
yes, that is the idea.
I miss the point how this is achieved. Is there somewhere a timer of 60s
before starting a VM on some other node? Where exactly in case I need to
tune this? E.g if I would like to have such reboots and VM starting only
after 5 minutes of cluster problems.

Are there some other not yet mentioned relevant timers in Proxmox
(besides the timers in Corosync)?

Thanks
Klaus
Dietmar Maurer
2018-09-07 14:40:30 UTC
Permalink
Post by Klaus Darilion
Post by Dietmar Maurer
Post by Klaus Darilion
But what is the timing for starting VM100 on another node? Is it
guaranteed that this only happens after 60 seconds?
yes, that is the idea.
I miss the point how this is achieved. Is there somewhere a timer of 60s
before starting a VM on some other node?
I short, there is a distributed locking mechanism based on corosync.
Post by Klaus Darilion
Where exactly in case I need to
tune this?
You cannot tune this.
Thomas Lamprecht
2018-09-10 06:14:02 UTC
Permalink
Post by Klaus Darilion
Post by Dietmar Maurer
Post by Klaus Darilion
But what is the timing for starting VM100 on another node? Is it
guaranteed that this only happens after 60 seconds?
yes, that is the idea.
I miss the point how this is achieved. Is there somewhere a timer of 60s
before starting a VM on some other node? Where exactly in case I need to
tune this? E.g if I would like to have such reboots and VM starting only
after 5 minutes of cluster problems.
Ha, I guess your the first whom wants to increase this delay, most want it
to be in the duration of mere seconds.

Problem is, there's the $fence_delay in the HA::NodeStatus module which is
the delay at which point a node gets marked as dead-to-be-fenced.
Then there's the nodes watchdog, which, even if you increase the delay above,
will still trigger if it's not quorate for 60 seconds, so this would need
changing too. For the locks, they are per-node and timeout after 2 minutes
of the last update, as a node (or the current manager) can only do something
if they held this lock, a time increase here should not be too problematic -
theoretically, but is not tested at all.
I'm just telling you what is where, no encouraging, if you still want to hack
around: great, wouldn't recommend starting in production, though :)
Post by Klaus Darilion
Are there some other not yet mentioned relevant timers in Proxmox
(besides the timers in Corosync)?
Maybe give our HA documentation, especially the "How It Works"[0] and
"Fencing"[1] chapters, a read.

[0]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#_how_it_works
[1]: https://pve.proxmox.com/pve-docs/chapter-ha-manager.html#ha_manager_fencing
Loading...