Discussion:
[PVE-User] vzdump causing complete box hang
Gavin Henry
2018-07-29 09:39:03 UTC
Permalink
Hi all,

Once a week when we run vzdump on some database VMs we get the
following and the whole box hang and needs a cold remote reboot. Any
suggestions? We're seeing no other issues, no disk issues on the RAID,
no filesystem issues:

Jul 29 05:15:03 leela vzdump[280112]: <***@pam> starting task
UPID:leela:0004465A:038A102F:5B5D4D57:vzdump::***@pam:
Jul 29 05:15:03 leela vzdump[280154]: INFO: starting new backup job:
vzdump --mailnotification always --quiet 1 --mode snapshot --compress
lzo --storage Backups --all 1
Jul 29 05:15:03 leela pmxcfs[6094]: [status] notice: received log
Jul 29 05:15:03 leela vzdump[280154]: INFO: Starting Backup of VM 151 (openvz)
Jul 29 05:15:04 leela kernel: EXT3-fs: barriers disabled
Jul 29 05:15:04 leela kernel: kjournald starting. Commit interval 5 seconds
Jul 29 05:15:04 leela kernel: EXT3-fs (dm-2): using internal journal
Jul 29 05:15:04 leela kernel: ext3_orphan_cleanup: deleting
unreferenced inode 132645031

<snip>

Jul 29 05:15:04 leela kernel: EXT3-fs (dm-2): 45 orphan inodes deleted
Jul 29 05:15:04 leela kernel: EXT3-fs (dm-2): recovery complete
Jul 29 05:15:04 leela kernel: EXT3-fs (dm-2): mounted filesystem with
ordered data mode
Jul 29 05:17:01 leela /USR/SBIN/CRON[282355]: (root) CMD ( cd / &&
run-parts --report /etc/cron.hourly)
Jul 29 05:17:23 leela vzdump[280154]: INFO: Finished Backup of VM 151 (00:02:20)
Jul 29 05:17:24 leela vzdump[280154]: INFO: Starting Backup of VM 153 (openvz)
Jul 29 05:17:24 leela kernel: EXT3-fs: barriers disabled
Jul 29 05:17:24 leela kernel: kjournald starting. Commit interval 5 seconds
Jul 29 05:17:24 leela kernel: EXT3-fs (dm-2): using internal journal

<snip>

Jul 29 05:24:33 leela kernel: Buffer I/O error on device dm-2, logical
block 433061993
Jul 29 05:24:33 leela kernel: lost page write due to I/O error on dm-2
Jul 29 05:25:02 leela /USR/SBIN/CRON[291615]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 29 05:28:59 leela kernel: EXT3-fs (dm-2): error: ext3_put_super:
Couldn't clean up the journal
Jul 29 05:29:00 leela vzdump[280154]: ERROR: Backup of VM 153 failed -
command '(cd /mnt/vzsnap0/private/153;find . '(' -regex '^\.$' ')' -o
'(' -type 's' -prune ')' -o -print0|sed 's/\\/\\\\/g'|tar cpf -
--totals --sparse --numeric-owner --no-recursion --one-file-system
./etc/vzdump/vps.conf --null -T -|lzop)
/mnt/replicated/backups/dump/vzdump-openvz-153-2018_07_29-05_17_24.tar.dat'
failed: exit code 2
Jul 29 05:29:00 leela vzdump[280154]: INFO: Starting Backup of VM 157 (openvz)
Jul 29 05:29:01 leela kernel: EXT3-fs: barriers disabled
Jul 29 05:29:01 leela kernel: kjournald starting. Commit interval 5 seconds
Jul 29 05:29:01 leela kernel: EXT3-fs (dm-2): using internal journal
Jul 29 05:29:01 leela kernel: ext3_orphan_cleanup: deleting
unreferenced inode 132645031

<snip>

Jul 29 05:29:01 leela kernel: ext3_orphan_cleanup: deleting
unreferenced inode 122814467
Jul 29 05:29:01 leela kernel: EXT3-fs (dm-2): 45 orphan inodes deleted
Jul 29 05:29:01 leela kernel: EXT3-fs (dm-2): recovery complete
Jul 29 05:29:01 leela kernel: EXT3-fs (dm-2): mounted filesystem with
ordered data mode
Jul 29 05:30:04 leela ansible-setup: Invoked with filter=*
gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10
Jul 29 05:32:05 leela ansible-setup: Invoked with filter=*
gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10
Jul 29 05:35:02 leela /USR/SBIN/CRON[304114]: (root) CMD (command -v
debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 29 05:35:03 leela kernel: device-mapper: snapshots: Invalidating
snapshot: Unable to allocate exception.

<snip>

Jul 29 05:35:48 leela kernel: JBD: I/O error detected when updating
journal superblock for dm-2.
Jul 29 05:35:54 leela kernel: EXT3-fs (dm-2): error:
ext3_journal_start_sb: Detected aborted journal
Jul 29 05:35:54 leela kernel: EXT3-fs (dm-2): error: remounting
filesystem read-only
Jul 29 05:36:40 leela kernel: EXT3-fs (dm-2): error: ext3_put_super:
Couldn't clean up the journal
Jul 29 05:36:41 leela vzdump[280154]: ERROR: Backup of VM 157 failed -
command '(cd /mnt/vzsnap0/private/157;find . '(' -regex '^\.$' ')' -o
'(' -type 's' -prune ')' -o -print0|sed 's/\\/\\\\/g'|tar cpf -
--totals --sparse --numeric-owner --no-recursion --one-file-system
./etc/vzdump/vps.conf --null -T -|lzop)
/mnt/replicated/backups/dump/vzdump-openvz-157-2018_07_29-05_29_00.tar.dat'
failed: exit code 2
Jul 29 05:36:41 leela vzdump[280154]: INFO: Starting Backup of VM 161 (openvz)
Jul 29 05:36:42 leela kernel: EXT3-fs: barriers disabled
Jul 29 05:36:42 leela kernel: kjournald starting. Commit interval 5 seconds
Jul 29 05:36:42 leela kernel: EXT3-fs (dm-2): using internal journal
Jul 29 05:35:42 leela kernel: EXT3-fs error (device dm-2):
ext3_get_inode_loc: unable to read inode block - inode=15024130,
block=60096514
Jul 29 05:35:42 leela kernel: Buffer I/O error on device dm-2, logical block 0
Jul 29 05:35:42 leela kernel: lost page write due to I/O error on dm-2
Jul 29 05:35:42 leela kernel: EXT3-fs (dm-2): I/O error while writing superblock
Jul 29 05:35:42 leela kernel: EXT3-fs (dm-2): error in
ext3_reserve_inode_write: IO failure
Jul 29 05:35:42 leela kernel: Buffer I/O error on device dm-2, logical block 0
Jul 29 05:35:42 leela kernel: lost page write due to I/O error on dm-2
Jul 29 05:35:42 leela kernel: EXT3-fs (dm-2): I/O error while writing superblock
Jul 29 05:35:43 leela kernel: EXT3-fs error (device dm-2):
ext3_get_inode_loc: unable to read inode block - inode=23347202,
block=93388802
Jul 29 05:35:43 leela kernel: Buffer I/O error on device dm-2, logical block 0
Jul 29 05:35:43 leela kernel: lost page write due to I/O error on dm-2
Jul 29 05:35:43 leela kernel: EXT3-fs (dm-2): I/O error while writing superblock
Jul 29 05:35:43 leela kernel: EXT3-fs (dm-2): error in
ext3_reserve_inode_write: IO failure
Jul 29 05:35:43 leela kernel: Buffer I/O error on device dm-2, logical block 0

They are getting dumped on to a glusterfs replicated set up which is
shared with another proxmox in the same rack in a two node cluster.

Thanks,
Gavin.
Gavin Henry
2018-07-29 14:21:02 UTC
Permalink
I’d start by figure out what’s going on with with all the errors on DM-2
Yep, that's:

dmsetup info
Name: pve-data-real
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 2
Event number: 0
Major, minor: 253, 2
Number of targets: 1
UUID: LVM-RDfBSyRHSFN0z6KxboxgFoJfPn4nD1tFumUZGz4X9NTfxAhccnDZ6Y3oQ0EM72HI-real

dm-2

mount:

/dev/mapper/pve-data on /var/lib/vz type ext3
(rw,relatime,errors=continue,user_xattr,acl,barrier=0,data=ordered)

dmsetup ls --tree
pve-swap (253:1)
└─ (8:3)
pve-root (253:0)
└─ (8:3)
pve-data (253:3)
└─pve-data-real (253:2)
└─ (8:3)
pve-vzsnap--leela--0 (253:5)
├─pve-vzsnap--leela--0-cow (253:4)
│ └─ (8:3)
└─pve-data-real (253:2)
└─ (8:3)

ls -lha /dev/mapper/
total 0
drwxr-xr-x 2 root root 180 Jul 29 08:06 .
drwxr-xr-x 16 root root 6.3K Jul 29 08:14 ..
crw------- 1 root root 10, 61 Jul 29 08:06 control
lrwxrwxrwx 1 root root 7 Jul 29 08:06 pve-data -> ../dm-3
lrwxrwxrwx 1 root root 7 Jul 29 08:06 pve-data-real -> ../dm-2
lrwxrwxrwx 1 root root 7 Jul 29 08:06 pve-root -> ../dm-0
lrwxrwxrwx 1 root root 7 Jul 29 08:06 pve-swap -> ../dm-1
lrwxrwxrwx 1 root root 7 Jul 29 08:06 pve-vzsnap--leela--0 -> ../dm-5
lrwxrwxrwx 1 root root 7 Jul 29 08:06 pve-vzsnap--leela--0-cow -> ../dm-4

Some downtime with a full fsck I think...
Gavin Henry
2018-07-29 14:50:08 UTC
Permalink
I notice that I removed this lvm last week and it's back. What's
creating it again? The proxmox lvm snapshot tool?

vgdisplay
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at
2212310745088: Input/output error
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at
2212310802432: Input/output error
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at 0: Input/output error
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at 4096:
Input/output error
--- Volume group ---
VG Name pve
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1038
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 4
Open LV 3
Max PV 0
Cur PV 1
Act PV 1
VG Size 2.18 TiB
PE Size 4.00 MiB
Total PE 571999
Alloc PE / Size 568160 / 2.17 TiB
Free PE / Size 3839 / 15.00 GiB
VG UUID RDfBSy-RHSF-N0z6-Kxbo-xgFo-JfPn-4nD1tF


lvremove /dev/pve/vzsnap-leela-0
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at
2212310745088: Input/output error
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at
2212310802432: Input/output error
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at 0: Input/output error
/dev/pve/vzsnap-leela-0: read failed after 0 of 4096 at 4096:
Input/output error
Do you really want to remove and DISCARD active logical volume
vzsnap-leela-0? [y/n]: y
Logical volume "vzsnap-leela-0" successfully removed
Gavin Henry
2018-07-29 21:09:59 UTC
Permalink
dm-2 is fine. Did a full forced fsck on it. No disk issues either. Very strange.
Loading...