Discussion:
[PVE-User] HA Failover if shared storage fails on one Node
Martin Holub
2018-10-17 11:05:27 UTC
Permalink
Hi,

I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?

Thanks,
Martin
Gilberto Nunes
2018-10-17 11:11:10 UTC
Permalink
Hi

How about Node priority?
Look section 14.5.2 in this doc

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
---
Gilberto Nunes Ferreira

(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram

Skype: gilberto.nunes36
Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Martin Holub
2018-10-17 11:19:34 UTC
Permalink
Post by Gilberto Nunes
Hi
How about Node priority?
Look section 14.5.2 in this doc
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36
Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin
Not shure if i understood what you mean with that reference, but since
Proxmox does not detect that the Storage is unreachable on that specific
Cluster Node, how are HA Groups supposed to work around this?

Best,
Martin
Martin Holub
2018-10-17 11:26:57 UTC
Permalink
Hi,

In my specific Test Case i was simulating that only one out of 6 Nodes
is losing connectivity to the Shared Storage. So the other 5 could still
access the Data. In my Opinion Proxmox should be, somehow, able to
detect that and fence that Node, causing a migration (depending on the
HA Configuration of course) to the other Nodes.

Best,
Martin
Perhaps I wasn't able to understand you issue...
But if a storage crash, no way to migrate from a node to other, since
Proxmox can not found the VM image... 
Sorry if I don't see what happen cleary.
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36
Post by Gilberto Nunes
Hi
How about Node priority?
Look section 14.5.2 in this doc
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
Post by Gilberto Nunes
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36
Em qua, 17 de out de 2018 às 08:05, Martin Holub
Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and
a NetAPP
Post by Gilberto Nunes
Post by Martin Holub
Storage with iSCSI and multipath configured on all Nodes. I now
tried
Post by Gilberto Nunes
Post by Martin Holub
what happens if, for any reason, booth Links fail (by shutting
down the
Post by Gilberto Nunes
Post by Martin Holub
Interfaces on one Blade). Unfortunately, altough i had
configured HA for
Post by Gilberto Nunes
Post by Martin Holub
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or
removed that
Post by Gilberto Nunes
Post by Martin Holub
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin
Not shure if i understood what you mean with that reference, but since
Proxmox does not detect that the Storage is unreachable on that specific
Cluster Node, how are HA Groups supposed to work around this?
Best,
Martin
Mark Adams
2018-10-17 11:29:04 UTC
Permalink
What interface is your cluster communication (corosync) running over? As
this is the link that needs to be unavailable to initiate a VM start on
another node AFAIK.

Basically, the other nodes in the cluster need to be seeing a problem with
the node. If its still communicating over the whichever interface you have
the cluster communication on then as far as it is concerned the node is
still up. If you just lose access to your storage, then your VM will still
be running in memory.

I don't believe there is any separate storage specific monitoring in
proxmox that could trigger a move to another node. If there is I'm sure
someone else on the list will advise.

Regards,
Mark
Post by Martin Holub
Post by Gilberto Nunes
Hi
How about Node priority?
Look section 14.5.2 in this doc
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36
Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin
Not shure if i understood what you mean with that reference, but since
Proxmox does not detect that the Storage is unreachable on that specific
Cluster Node, how are HA Groups supposed to work around this?
Best,
Martin
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Martin Holub
2018-10-17 11:33:16 UTC
Permalink
Hi,

We have dedicated Links for the Storage and the Cluster Communication,
so if only the Storage Links fail Corosync is still working. Maybe i
need to create some Watchdog myself for that specific case, but let's
wait if there is really nothing in Proxmox to handle that Scenario.

Best,
Martin
Post by Mark Adams
What interface is your cluster communication (corosync) running over? As
this is the link that needs to be unavailable to initiate a VM start on
another node AFAIK.
Basically, the other nodes in the cluster need to be seeing a problem with
the node. If its still communicating over the whichever interface you have
the cluster communication on then as far as it is concerned the node is
still up. If you just lose access to your storage, then your VM will still
be running in memory.
I don't believe there is any separate storage specific monitoring in
proxmox that could trigger a move to another node. If there is I'm sure
someone else on the list will advise.
Regards,
Mark
Post by Martin Holub
Post by Gilberto Nunes
Hi
How about Node priority?
Look section 14.5.2 in this doc
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36
Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin
Not shure if i understood what you mean with that reference, but since
Proxmox does not detect that the Storage is unreachable on that specific
Cluster Node, how are HA Groups supposed to work around this?
Best,
Martin
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Mark Schouten
2018-10-17 11:31:09 UTC
Permalink
Post by Martin Holub
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
HA Detects outages between the Proxmox Nodes. Not if storage is
reachable.
--
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | ***@tuxis.nl
Loading...