[PVE-User] periodic Node Crash/freeze

Discussion:

Ml Ml

2018-08-23 06:57:48 UTC

Hello,

i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.

In syslog the last entries are:

Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log
1>/dev/null)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^

or:

Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data
syncronisation
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

I already posted it here:
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/

It happened at:
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2

...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.

Memtest is okay.

As far as i googled the "^@^@^@^@^@^" appear is syslog because i can
not fully write the file to disk?

Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?

My Ideas from here:
- disable corosync/totem and see if the problems stop

Have you any ideas which could narrow my problem down?

My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.

I have 4 other 3-NodeCluster running just fine.

Thanks a lot.

Mario

Woods, Ken A (DNR)

2018-08-23 10:19:32 UTC

Permalink

Why did you decide to not use multicast?

Post by Ml Ml
Hello,
i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.
Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log
1>/dev/null)
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data
syncronisation
https://urldefense.proofpoint.com/v2/url?u=https-3A__forum.proxmox.com_threads_periodic-2Dnode-2Dcrash-2Dfreeze.46407_&d=DwIGaQ&c=teXCf5DW4bHgLDM-H5_GmQ&r=THf3d3FQjCY5FQHo3goSprNAh9vsOWPUM7J0jwvvVwM&m=zpOdKmRPAro1hJw-CO0lkGqmzXn8fQ4Ye5aJvsC8lbk&s=fRGRq_-sMJvikzFr6peWj3oZxkZ5eHY434Re48Mv9mI&e=
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2
...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.
Memtest is okay.
not fully write the file to disk?
Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?
- disable corosync/totem and see if the problems stop
Have you any ideas which could narrow my problem down?
My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.
I have 4 other 3-NodeCluster running just fine.
Thanks a lot.
Mario
_______________________________________________
pve-user mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__pve.proxmox.com_cgi-2Dbin_mailman_listinfo_pve-2Duser&d=DwIGaQ&c=teXCf5DW4bHgLDM-H5_GmQ&r=THf3d3FQjCY5FQHo3goSprNAh9vsOWPUM7J0jwvvVwM&m=zpOdKmRPAro1hJw-CO0lkGqmzXn8fQ4Ye5aJvsC8lbk&s=8K2XEB3Soz8V0JMR6hzvc78bjDExInI2vC2LC_FfljI&e=

Klaus Darilion

2018-08-25 17:32:17 UTC

Permalink

Maybe it is related to some Kernel Update
regards
Klaus

Post by Ml Ml
Hello,
i could need some hint/help since one cluster is letting me down since
29.07.2018 .
Thats when one of my three nodes started to freeze and stop.
Aug 21 02:33:00 node10 systemd[1]: Starting Proxmox VE replication runner...
Aug 21 02:33:01 node10 systemd[1]: Started Proxmox VE replication runner.
Aug 21 02:33:01 node10 CRON[1870491]: (root) CMD (/usr/bin/puppet
agent -vt --color false --logdest /var/log/puppet/agent.log
1>/dev/null)
Aug 22 16:11:12 node08 pmxcfs[5227]: [dcdb] notice: cpg_send_message
retried 1 times
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: members: 1/5227, 2/5058
Aug 22 16:11:12 node08 pmxcfs[5227]: [status] notice: starting data
syncronisation
https://forum.proxmox.com/threads/periodic-node-crash-freeze.46407/
29.07.2018 node09 / pve 4.4
07.08.2018 node08 / pve 4.4 ( then i decided to upgrade)
21.08.2018 node10 / pve 5.2
22.08.2018 node08 / pve 5.2
...and i am getting nervous now since there are 60 important VMs on it.
As you can see it happened across multiple nodes with diffrent PVE Versions.
Memtest is okay.
not fully write the file to disk?
Maybe something triggers some totem/watchdog stuff which then ends in
a disaster?
- disable corosync/totem and see if the problems stop
Have you any ideas which could narrow my problem down?
My Setup is a 3 Node Cluster (node08, node09, node10) with ceph.
I have 4 other 3-NodeCluster running just fine.
Thanks a lot.
Mario
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user