Ml Ml
2018-08-23 06:57:48 UTC
Hello,
i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.
In syslog the last entries are:
Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log
1>/dev/null)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
or:
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data
syncronisation
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
I already posted it here:
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/
It happened at:
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2
...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.
Memtest is okay.
As far as i googled the "^@^@^@^@^@^" appear is syslog because i can
not fully write the file to disk?
Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?
My Ideas from here:
- disable corosync/totem and see if the problems stop
Have you any ideas which could narrow my problem down?
My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.
I have 4 other 3-NodeCluster running just fine.
Thanks a lot.
Mario
i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.
In syslog the last entries are:
Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log
1>/dev/null)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
or:
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data
syncronisation
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
I already posted it here:
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/
It happened at:
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2
...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.
Memtest is okay.
As far as i googled the "^@^@^@^@^@^" appear is syslog because i can
not fully write the file to disk?
Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?
My Ideas from here:
- disable corosync/totem and see if the problems stop
Have you any ideas which could narrow my problem down?
My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.
I have 4 other 3-NodeCluster running just fine.
Thanks a lot.
Mario