Dhaussy Alexandre
2016-11-09 13:46:50 UTC
Hello,
I have a big problem on my cluster (1500 HA VMs), storage is LVM + SAN (around 70 PVs, 2000 LVs)
Problems began adding a new node to the cluster…
All nodes crashed and rebooted (happended yesterday)
After some work I managed to get all back online, but some nodes were down (hardware problem.)
3 or 4 nodes are still powered off because we don’t know if they caused an issue..
So this morning, we tried to add all the nodes back, I believe someone did something wrong…
Everything have been rebooted again, but now the VMs can’t start.
Only 2/7 nodes seems to start VMs, all others nodes seems stuck.
I have quorum with 7 nodes, I tried to reboot the master, but nothing happends in the GUI (master still the same, even if it is powered down.)
Is the any way to force a reelection of a new master ?
After rebooting all nodes…I’m out of ideas…
I wanted to remove vms from HA and start the vms locally, but I can’t even do that (nothing happens.)
/etc/pve/nodes/proxmoxt21/lrm_status:{"timestamp":1478698319,"results":{},"state":"wait_for_agent_lock","mode":"active"}
/etc/pve/nodes/proxmoxt25/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698315,"mode":"active"}
/etc/pve/nodes/proxmoxt26/lrm_status:{"results":{},"mode":"active","timestamp":1478693656,"state":"wait_for_agent_lock"}
/etc/pve/nodes/proxmoxt30/lrm_status:{"mode":"active","results":{},"state":"wait_for_agent_lock","timestamp":1478698319}
/etc/pve/nodes/proxmoxt31/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698319,"mode":"active"}
/etc/pve/nodes/proxmoxt34/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698318,"mode":"active"}
***@proxmoxt21:~# pvecm status
Quorum information
------------------
Date: Wed Nov 9 14:40:17 2016
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000003
Ring ID: 9/988
Quorate: Yes
Votequorum information
----------------------
Expected votes: 13
Highest expected: 13
Total votes: 8
Quorum: 7
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000009 1 10.98.187.11
0x0000000a 1 10.98.187.12
0x0000000b 1 10.98.187.15
0x0000000c 1 10.98.187.36
0x00000003 1 10.98.187.40 (local)
0x00000001 1 10.98.187.41
0x00000002 1 10.98.187.42
0x00000008 1 10.98.187.47
☹
I have a big problem on my cluster (1500 HA VMs), storage is LVM + SAN (around 70 PVs, 2000 LVs)
Problems began adding a new node to the cluster…
All nodes crashed and rebooted (happended yesterday)
After some work I managed to get all back online, but some nodes were down (hardware problem.)
3 or 4 nodes are still powered off because we don’t know if they caused an issue..
So this morning, we tried to add all the nodes back, I believe someone did something wrong…
Everything have been rebooted again, but now the VMs can’t start.
Only 2/7 nodes seems to start VMs, all others nodes seems stuck.
I have quorum with 7 nodes, I tried to reboot the master, but nothing happends in the GUI (master still the same, even if it is powered down.)
Is the any way to force a reelection of a new master ?
After rebooting all nodes…I’m out of ideas…
I wanted to remove vms from HA and start the vms locally, but I can’t even do that (nothing happens.)
/etc/pve/nodes/proxmoxt21/lrm_status:{"timestamp":1478698319,"results":{},"state":"wait_for_agent_lock","mode":"active"}
/etc/pve/nodes/proxmoxt25/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698315,"mode":"active"}
/etc/pve/nodes/proxmoxt26/lrm_status:{"results":{},"mode":"active","timestamp":1478693656,"state":"wait_for_agent_lock"}
/etc/pve/nodes/proxmoxt30/lrm_status:{"mode":"active","results":{},"state":"wait_for_agent_lock","timestamp":1478698319}
/etc/pve/nodes/proxmoxt31/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698319,"mode":"active"}
/etc/pve/nodes/proxmoxt34/lrm_status:{"state":"wait_for_agent_lock","results":{},"timestamp":1478698318,"mode":"active"}
***@proxmoxt21:~# pvecm status
Quorum information
------------------
Date: Wed Nov 9 14:40:17 2016
Quorum provider: corosync_votequorum
Nodes: 8
Node ID: 0x00000003
Ring ID: 9/988
Quorate: Yes
Votequorum information
----------------------
Expected votes: 13
Highest expected: 13
Total votes: 8
Quorum: 7
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
0x00000009 1 10.98.187.11
0x0000000a 1 10.98.187.12
0x0000000b 1 10.98.187.15
0x0000000c 1 10.98.187.36
0x00000003 1 10.98.187.40 (local)
0x00000001 1 10.98.187.41
0x00000002 1 10.98.187.42
0x00000008 1 10.98.187.47
☹