Marcus Haarmann
2018-07-27 09:02:05 UTC
Hi experts,
we are using a Proxmox cluster with an underlying ceph storage.
Versions are pve 5.2-2 with kernel 4.15.18-1-pve and ceph luminous 12.2.5
We are running a couple of VM and also Containers there.
3 virtual NIC (as bond balance-alb), ceph uses 2 bonded 10GBit interfaces (public/cluster separated)
It occurs during nightly backup that backup stalls. In parallel, we get lots of messages in the dmesg:
[137612.371311] libceph: mon0 192.168.16.31:6789 session established
[137643.090541] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
[137643.091383] libceph: mon1 192.168.16.32:6789 session established
[137673.810526] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[137673.811388] libceph: mon2 192.168.16.34:6789 session established
[137704.530567] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
[137704.531363] libceph: mon0 192.168.16.31:6789 session established
[137735.250593] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
[137735.251352] libceph: mon1 192.168.16.32:6789 session established
[137765.970608] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[137765.971544] libceph: mon0 192.168.16.31:6789 session established
[137796.690605] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
[137796.691412] libceph: mon1 192.168.16.32:6789 session established
We are searching for the issue for a while, since the blocking backup is not easy to overcome (unblocking does not help,
only stop and migrate to a different server, since the rbd device seems to block).
It seems to be related to the ceph messages.
We found the following patch related to these messages (which may lead to a blocking state in kernel):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7b4c443d139f1d2b5570da475f7a9cbcef86740c
We have tried to patch the kernel ourselfes, but this was not successful.
Although I presume the real error situation is related to a network problem, it would be nice to have an
official backport of this patch in the pve kernel.
Anybody can do that ? (only one line of code)
We are trying to modify the bonding mode because the network connection seems to be unstable,
maybe this solves the issue.
Thank you very much and best regards,
Marcus Haarmann
we are using a Proxmox cluster with an underlying ceph storage.
Versions are pve 5.2-2 with kernel 4.15.18-1-pve and ceph luminous 12.2.5
We are running a couple of VM and also Containers there.
3 virtual NIC (as bond balance-alb), ceph uses 2 bonded 10GBit interfaces (public/cluster separated)
It occurs during nightly backup that backup stalls. In parallel, we get lots of messages in the dmesg:
[137612.371311] libceph: mon0 192.168.16.31:6789 session established
[137643.090541] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
[137643.091383] libceph: mon1 192.168.16.32:6789 session established
[137673.810526] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[137673.811388] libceph: mon2 192.168.16.34:6789 session established
[137704.530567] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
[137704.531363] libceph: mon0 192.168.16.31:6789 session established
[137735.250593] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
[137735.251352] libceph: mon1 192.168.16.32:6789 session established
[137765.970608] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[137765.971544] libceph: mon0 192.168.16.31:6789 session established
[137796.690605] libceph: mon0 192.168.16.31:6789 session lost, hunting for new mon
[137796.691412] libceph: mon1 192.168.16.32:6789 session established
We are searching for the issue for a while, since the blocking backup is not easy to overcome (unblocking does not help,
only stop and migrate to a different server, since the rbd device seems to block).
It seems to be related to the ceph messages.
We found the following patch related to these messages (which may lead to a blocking state in kernel):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7b4c443d139f1d2b5570da475f7a9cbcef86740c
We have tried to patch the kernel ourselfes, but this was not successful.
Although I presume the real error situation is related to a network problem, it would be nice to have an
official backport of this patch in the pve kernel.
Anybody can do that ? (only one line of code)
We are trying to modify the bonding mode because the network connection seems to be unstable,
maybe this solves the issue.
Thank you very much and best regards,
Marcus Haarmann