[PVE-User] Snapshot rollback slow

Discussion:

Marcus Haarmann

2018-08-29 15:56:51 UTC

Hi,

we have a small Proxmox cluster, on top of ceph.
Version is proxmox 5.2.6, Ceph 12.2.5 luminous
Hardware is 4 machines, dual Xeon E5, 128 GB RAM
local SATA (raid1) for OS
local SSD for OSD (2 OSD per machine, no Raid here)
4x 10GBit (copper) NICs

We came upon the following situation:
VM snapshot was created to perform a dangerous installation process, which should be revertable
Installation was done and a rollback to snapshot was initiated (because something went wrong).
However, the rollback of snapshot took > 1 hour and during this timeframe, the whole cluster
was reacting veeeeery slow.
We tried to find out the reason for this, and it looks like an I/O bottleneck.
For some reason, the main I/O was done on two local OSD processes (on the same host where the VM was running).
The iostat output said the data transmission rate was about 30MB/s per OSD disk but util was 100%. (whatever this means)
The underlying SSD are not damaged and have a significant higher throughput normally.
OSD is based on filestore/XFS (we encountered some problems with bluestore and decided to use filestore again)
There are a lot of read/write operations in parallel at this time.

Normal cluster operation is relatively fluent, only copying machines affects I/O but we can see
transfer rates > 200 MB/s in iostat in this case. (this is not very fast for the SSD disks from my point of view,
but it is not only sequential write)
Also, I/O utilization is not near 100% when a copy action is executed.

SSD and SATA disks are on separate controllers.

Any ideas where to tune for better snapshot rollback performance ?
I am not sure how the placement of the snapshot data is done from proxmox or ceph.

Under the hood, there are rbd devices, which are snapshotted. So it should be up to the ceph logic
where the snapshots are done (maybe depending on the initial layout of the original device ) ?
Would the crush map influence that ?

Also live backup takes snapshots as I can see. We have had very strange locks on running backups
in the past (mostly gone since the disks were put on separate controllers).

Could this be the same reason ?

Another thing we found is the following (not on all hosts):
[614673.831726] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614673.848249] libceph: mon2 192.168.16.34:6789 session established
[614704.551754] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
[614704.552729] libceph: mon1 192.168.16.32:6789 session established
[614735.271779] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614735.272339] libceph: mon2 192.168.16.34:6789 session established

This leads to a kernel problem, which is still not solved (because not backported to 4.15).
I am not sure if this is a reaction to a ceph problem or the reason for the ceph problem.

Any thoughts on this ?

Marcus Haarmann

Yannis Milios

2018-08-29 17:02:49 UTC

Permalink

Can’t comment on the I/O issues, but in regards to the snapshot rollback, I
would personally prefer to clone the snapshot instead of rolling back. It
has been proven for me much faster to recover in emergencies.
Then, after recovering, to release the clone from the its snapshot
reference, you can flatten the clone.
You can find this info in Ceph docs.

Post by Marcus Haarmann
Hi,
we have a small Proxmox cluster, on top of ceph.
Version is proxmox 5.2.6, Ceph 12.2.5 luminous
Hardware is 4 machines, dual Xeon E5, 128 GB RAM
local SATA (raid1) for OS
local SSD for OSD (2 OSD per machine, no Raid here)
4x 10GBit (copper) NICs
VM snapshot was created to perform a dangerous installation process, which
should be revertable
Installation was done and a rollback to snapshot was initiated (because
something went wrong).
However, the rollback of snapshot took > 1 hour and during this timeframe,
the whole cluster
was reacting veeeeery slow.
We tried to find out the reason for this, and it looks like an I/O bottleneck.
For some reason, the main I/O was done on two local OSD processes (on the
same host where the VM was running).
The iostat output said the data transmission rate was about 30MB/s per OSD
disk but util was 100%. (whatever this means)
The underlying SSD are not damaged and have a significant higher throughput normally.
OSD is based on filestore/XFS (we encountered some problems with bluestore
and decided to use filestore again)
There are a lot of read/write operations in parallel at this time.
Normal cluster operation is relatively fluent, only copying machines
affects I/O but we can see
transfer rates > 200 MB/s in iostat in this case. (this is not very fast
for the SSD disks from my point of view,
but it is not only sequential write)
Also, I/O utilization is not near 100% when a copy action is executed.
SSD and SATA disks are on separate controllers.
Any ideas where to tune for better snapshot rollback performance ?
I am not sure how the placement of the snapshot data is done from proxmox or ceph.
Under the hood, there are rbd devices, which are snapshotted. So it should
be up to the ceph logic
where the snapshots are done (maybe depending on the initial layout of the
original device ) ?
Would the crush map influence that ?
Also live backup takes snapshots as I can see. We have had very strange
locks on running backups
in the past (mostly gone since the disks were put on separate
controllers).
Could this be the same reason ?
[614673.831726] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614673.848249] libceph: mon2 192.168.16.34:6789 session established
[614704.551754] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
[614704.552729] libceph: mon1 192.168.16.32:6789 session established
[614735.271779] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614735.272339] libceph: mon2 192.168.16.34:6789 session established
This leads to a kernel problem, which is still not solved (because not backported to 4.15).
I am not sure if this is a reaction to a ceph problem or the reason for the ceph problem.
Any thoughts on this ?
Marcus Haarmann
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

--
Sent from Gmail Mobile

Marco Gaiarin

2018-08-29 17:09:35 UTC

Permalink

Mandi! Marcus Haarmann
In chel di` si favelave...

Post by Marcus Haarmann
However, the rollback of snapshot took > 1 hour and during this timeframe, the whole cluster
was reacting veeeeery slow.

AFAIK this is a 'feature' of ceph (again AFAIK fixed/better handled in
recent ceph version and on bluestore): a deletion (of a volume or a
snapshot rollback) trigger a 'write amplification'.

In older ceph version there's some workarounds...

Sorry for be not so precise, look at ceph mailing list archive for more
precise info...

--
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
Associazione ``La Nostra Famiglia'' http://www.lanostrafamiglia.it/
Polo FVG - Via della Bontà, 7 - 33078 - San Vito al Tagliamento (PN)
marco.gaiarin(at)lanostrafamiglia.it t +39-0434-842711 f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)