[PVE-User] Hanging storage tasks in all RH based VMs after update

Discussion:

Uwe Sauter

2018-05-02 20:27:39 UTC

Hi all,

I updated my cluster this morning (version info see end of mail) and rebooted all hosts sequentially, live migrating VMs
between hosts. (Six hosts connected via 10GbE, all participating in a Ceph cluster.)

Since then I experience hanging storage tasks inside the VMs (e.g. jbd2 on VMs with Ext4 or xfsaid on VMs with XFS).
This goes so far that auditd fills the dmesg buffer with messages like:

[14109.375608] audit_log_start: 23 callbacks suppressed
[14109.376496] audit: audit_backlog=70 > audit_backlog_limit=64
[14109.377213] audit: audit_lost=2274 audit_rate_limit=0 audit_backlog_limit=64
[14109.377954] audit: backlog limit exceeded

Performance is massively reduced on those VMs. The VMs all run up-to-date CentOS 7.4 with Qemu Guest service running.
This happens also if the VM was shutdown and started again.

Does anyone else see this happen? Any thought on the cause, any proposal for a fix?

Thanks,

Uwe

# pveversion -v
proxmox-ve: 5.1-43 (running kernel: 4.15.15-1-pve)
pve-manager: 5.1-52 (running version: 5.1-52/ba597a64)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
ceph: 12.2.4-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-15
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-19
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-26
pve-container: 2.0-22
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-3
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9

Lindsay Mathieson

2018-05-02 23:50:18 UTC

Permalink

Post by Uwe Sauter
I updated my cluster this morning (version info see end of mail) and
rebooted all hosts sequentially, live migrating VMs between hosts.
(Six hosts connected via 10GbE, all participating in a Ceph cluster.)

Whats your ceph status? it probably doing a massive backfill after the
rolling reboot. That will kill your IO.

--
Lindsay

Uwe Sauter

2018-05-03 05:26:53 UTC

Permalink

Hi Lindsay,

Post by Uwe Sauter
I updated my cluster this morning (version info see end of mail) and rebooted all hosts sequentially, live migrating
VMs between hosts. (Six hosts connected via 10GbE, all participating in a Ceph cluster.)

Whats your ceph status? it probably doing a massive backfill after the rolling reboot. That will kill your IO.

Backfilling shouldn't be the cause as I always run "ceph osd set noout" before rebooting the servers.

When I last checked yesterday it was HEALTH_OK but now it is:

# ceph status
cluster:
id: 982484e6-69bf-490c-9b3a-942a179e759b
health: HEALTH_WARN
15 slow requests are blocked > 32 sec

services:
mon: 6 daemons, quorum 0,1,2,3,px-echo-cluster,px-foxtrott-cluster
mgr: px-alpha-cluster(active), standbys: px-bravo-cluster, px-charlie-cluster, px-echo-cluster, px-delta-cluster,
px-foxtrott-cluster
osd: 24 osds: 24 up, 24 in

data:
pools: 2 pools, 576 pgs
objects: 111k objects, 401 GB
usage: 1287 GB used, 11046 GB / 12334 GB avail
pgs: 576 active+clean

io:
client: 60071 B/s wr, 0 op/s rd, 2 op/s wr

I'll look into the slow requests and let you know.

Thanks,

Uwe

Mark Schouten

2018-05-03 07:29:30 UTC

Permalink

Hi,

Â Â Â Â Â mon: 6 daemons, quorum 0,1,2,3,px-echo-cluster,px-foxtrott-
cluster

No response about your current issue, but an even number of monitors is
not recommended. Also, with a six-node-cluster, I would personally be
happy with three.

Just my 2 cts.

--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | ***@tuxis.nl

Uwe Sauter

2018-05-03 05:36:01 UTC

Permalink

One thing I just saw:

The update installed pve-kernel-4.15.15-1-pve. Before I had pve-kernel-4.13.16-2-pve running.

Are there known issues with 4.15.15 ?

I'll reboot in 4.13.16 when I get to work to see if this fixes things.

Regards,

Uwe

Post by Uwe Sauter
I updated my cluster this morning (version info see end of mail) and rebooted all hosts sequentially, live migrating
VMs between hosts. (Six hosts connected via 10GbE, all participating in a Ceph cluster.)

Whats your ceph status? it probably doing a massive backfill after the rolling reboot. That will kill your IO.

Uwe Sauter

2018-05-03 08:41:29 UTC

Permalink

Looks like this was cause by pve-kernel-4.15.15-1-pve. After rebooting into pve-kernel-4.13.16-2-pve performance is back to normal.

Hopefully the next kernel update will address this.

Regards,

Uwe

Post by Uwe Sauter
Hi all,
I updated my cluster this morning (version info see end of mail) and rebooted all hosts sequentially, live migrating VMs between
hosts. (Six hosts connected via 10GbE, all participating in a Ceph cluster.)
Since then I experience hanging storage tasks inside the VMs (e.g. jbd2 on VMs with Ext4 or xfsaid on VMs with XFS). This goes so
[14109.375608] audit_log_start: 23 callbacks suppressed
[14109.376496] audit: audit_backlog=70 > audit_backlog_limit=64
[14109.377213] audit: audit_lost=2274 audit_rate_limit=0 audit_backlog_limit=64
[14109.377954] audit: backlog limit exceeded
Performance is massively reduced on those VMs. The VMs all run up-to-date CentOS 7.4 with Qemu Guest service running. This happens
also if the VM was shutdown and started again.
Does anyone else see this happen? Any thought on the cause, any proposal for a fix?
Thanks,
Uwe
# pveversion -v
proxmox-ve: 5.1-43 (running kernel: 4.15.15-1-pve)
pve-manager: 5.1-52 (running version: 5.1-52/ba597a64)
pve-kernel-4.13: 5.1-44
pve-kernel-4.15: 5.1-3
pve-kernel-4.15.15-1-pve: 4.15.15-6
pve-kernel-4.13.16-2-pve: 4.13.16-47
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-5-pve: 4.13.13-38
ceph: 12.2.4-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-4
libpve-common-perl: 5.0-30
libpve-guest-common-perl: 2.0-15
libpve-http-server-perl: 2.0-8
libpve-storage-perl: 5.0-19
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-2
lxcfs: 3.0.0-1
novnc-pve: 0.6-4
proxmox-widget-toolkit: 1.0-15
pve-cluster: 5.0-26
pve-container: 2.0-22
pve-docs: 5.1-17
pve-firewall: 3.0-8
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-4
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-3
qemu-server: 5.0-25
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.7-pve1~bpo9