[PVE-User] HA Failover if shared storage fails on one Node

Discussion:

Martin Holub

2018-10-17 11:05:27 UTC

Hi,

I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?

Thanks,
Martin

Gilberto Nunes

2018-10-17 11:11:10 UTC

Permalink

Hi

How about Node priority?
Look section 14.5.2 in this doc

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
---
Gilberto Nunes Ferreira

(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram

Skype: gilberto.nunes36

Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and a NetAPP
Storage with iSCSI and multipath configured on all Nodes. I now tried
what happens if, for any reason, booth Links fail (by shutting down the
Interfaces on one Blade). Unfortunately, altough i had configured HA for
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Martin Holub

2018-10-17 11:19:34 UTC

Permalink

Post by Gilberto Nunes
Hi
How about Node priority?
Look section 14.5.2 in this doc
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36

Martin Holub

2018-10-17 11:26:57 UTC

Permalink

Hi,

In my specific Test Case i was simulating that only one out of 6 Nodes
is losing connectivity to the Shared Storage. So the other 5 could still
access the Data. In my Opinion Proxmox should be, somehow, able to
detect that and fence that Node, causing a migration (depending on the
HA Configuration of course) to the other Nodes.

Best,
Martin

Perhaps I wasn't able to understand you issue...
But if a storage crash, no way to migrate from a node to other, since
Proxmox can not found the VM image...
Sorry if I don't see what happen cleary.
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36

Post by Gilberto Nunes
Hi
How about Node priority?
Look section 14.5.2 in this doc

https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_configuration_10

Post by Gilberto Nunes
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36
Em qua, 17 de out de 2018 às 08:05, Martin Holub

Post by Martin Holub
Hi,
I am currently testing the HA features on a 6 Node Cluster and

a NetAPP

Post by Gilberto Nunes

Post by Martin Holub
Storage with iSCSI and multipath configured on all Nodes. I now

tried

Post by Gilberto Nunes

Post by Martin Holub
what happens if, for any reason, booth Links fail (by shutting

down the

Post by Gilberto Nunes

Post by Martin Holub
Interfaces on one Blade). Unfortunately, altough i had

configured HA for

Post by Gilberto Nunes

Post by Martin Holub
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or

removed that

Post by Gilberto Nunes

Post by Martin Holub
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?
Thanks,
Martin

Mark Adams

2018-10-17 11:29:04 UTC

Permalink

What interface is your cluster communication (corosync) running over? As
this is the link that needs to be unavailable to initiate a VM start on
another node AFAIK.

Basically, the other nodes in the cluster need to be seeing a problem with
the node. If its still communicating over the whichever interface you have
the cluster communication on then as far as it is concerned the node is
still up. If you just lose access to your storage, then your VM will still
be running in memory.

I don't believe there is any separate storage specific monitoring in
proxmox that could trigger a move to another node. If there is I'm sure
someone else on the list will advise.

Regards,
Mark

Post by Martin Holub

Not shure if i understood what you mean with that reference, but since
Proxmox does not detect that the Storage is unreachable on that specific
Cluster Node, how are HA Groups supposed to work around this?
Best,
Martin
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Martin Holub

2018-10-17 11:33:16 UTC

Permalink

Hi,

We have dedicated Links for the Storage and the Cluster Communication,
so if only the Storage Links fail Corosync is still working. Maybe i
need to create some Watchdog myself for that specific case, but let's
wait if there is really nothing in Proxmox to handle that Scenario.

Best,
Martin

Post by Mark Adams
What interface is your cluster communication (corosync) running over? As
this is the link that needs to be unavailable to initiate a VM start on
another node AFAIK.
Basically, the other nodes in the cluster need to be seeing a problem with
the node. If its still communicating over the whichever interface you have
the cluster communication on then as far as it is concerned the node is
still up. If you just lose access to your storage, then your VM will still
be running in memory.
I don't believe there is any separate storage specific monitoring in
proxmox that could trigger a move to another node. If there is I'm sure
someone else on the list will advise.
Regards,
Mark

Post by Martin Holub

_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Mark Schouten

2018-10-17 11:31:09 UTC

Permalink

Post by Martin Holub
my Test VM, Proxmox seems to not recognize the Storage outtage and
therefore did not migrate the VM to a different blade or removed that
Node from the Cluster (either by resetting it or fencing it somehow
else). Any hints on how to get that solved?

HA Detects outages between the Proxmox Nodes. Not if storage is
reachable.

--
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | ***@tuxis.nl

Continue reading on narkive:

Search results for '[PVE-User] HA Failover if shared storage fails on one Node' (Questions and Answers)

replies

what is DNS?what is Active Directory?what is patch file?

started 2006-10-10 03:15:22 UTC

computer networking

replies

Hi. what are the differences between oracle 9i and oracle 10g?

started 2010-10-16 11:09:03 UTC

programming & design