Marcus Haarmann
2018-08-29 15:56:51 UTC
Hi,
we have a small Proxmox cluster, on top of ceph.
Version is proxmox 5.2.6, Ceph 12.2.5 luminous
Hardware is 4 machines, dual Xeon E5, 128 GB RAM
local SATA (raid1) for OS
local SSD for OSD (2 OSD per machine, no Raid here)
4x 10GBit (copper) NICs
We came upon the following situation:
VM snapshot was created to perform a dangerous installation process, which should be revertable
Installation was done and a rollback to snapshot was initiated (because something went wrong).
However, the rollback of snapshot took > 1 hour and during this timeframe, the whole cluster
was reacting veeeeery slow.
We tried to find out the reason for this, and it looks like an I/O bottleneck.
For some reason, the main I/O was done on two local OSD processes (on the same host where the VM was running).
The iostat output said the data transmission rate was about 30MB/s per OSD disk but util was 100%. (whatever this means)
The underlying SSD are not damaged and have a significant higher throughput normally.
OSD is based on filestore/XFS (we encountered some problems with bluestore and decided to use filestore again)
There are a lot of read/write operations in parallel at this time.
Normal cluster operation is relatively fluent, only copying machines affects I/O but we can see
transfer rates > 200 MB/s in iostat in this case. (this is not very fast for the SSD disks from my point of view,
but it is not only sequential write)
Also, I/O utilization is not near 100% when a copy action is executed.
SSD and SATA disks are on separate controllers.
Any ideas where to tune for better snapshot rollback performance ?
I am not sure how the placement of the snapshot data is done from proxmox or ceph.
Under the hood, there are rbd devices, which are snapshotted. So it should be up to the ceph logic
where the snapshots are done (maybe depending on the initial layout of the original device ) ?
Would the crush map influence that ?
Also live backup takes snapshots as I can see. We have had very strange locks on running backups
in the past (mostly gone since the disks were put on separate controllers).
Could this be the same reason ?
Another thing we found is the following (not on all hosts):
[614673.831726] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614673.848249] libceph: mon2 192.168.16.34:6789 session established
[614704.551754] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
[614704.552729] libceph: mon1 192.168.16.32:6789 session established
[614735.271779] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614735.272339] libceph: mon2 192.168.16.34:6789 session established
This leads to a kernel problem, which is still not solved (because not backported to 4.15).
I am not sure if this is a reaction to a ceph problem or the reason for the ceph problem.
Any thoughts on this ?
Marcus Haarmann
we have a small Proxmox cluster, on top of ceph.
Version is proxmox 5.2.6, Ceph 12.2.5 luminous
Hardware is 4 machines, dual Xeon E5, 128 GB RAM
local SATA (raid1) for OS
local SSD for OSD (2 OSD per machine, no Raid here)
4x 10GBit (copper) NICs
We came upon the following situation:
VM snapshot was created to perform a dangerous installation process, which should be revertable
Installation was done and a rollback to snapshot was initiated (because something went wrong).
However, the rollback of snapshot took > 1 hour and during this timeframe, the whole cluster
was reacting veeeeery slow.
We tried to find out the reason for this, and it looks like an I/O bottleneck.
For some reason, the main I/O was done on two local OSD processes (on the same host where the VM was running).
The iostat output said the data transmission rate was about 30MB/s per OSD disk but util was 100%. (whatever this means)
The underlying SSD are not damaged and have a significant higher throughput normally.
OSD is based on filestore/XFS (we encountered some problems with bluestore and decided to use filestore again)
There are a lot of read/write operations in parallel at this time.
Normal cluster operation is relatively fluent, only copying machines affects I/O but we can see
transfer rates > 200 MB/s in iostat in this case. (this is not very fast for the SSD disks from my point of view,
but it is not only sequential write)
Also, I/O utilization is not near 100% when a copy action is executed.
SSD and SATA disks are on separate controllers.
Any ideas where to tune for better snapshot rollback performance ?
I am not sure how the placement of the snapshot data is done from proxmox or ceph.
Under the hood, there are rbd devices, which are snapshotted. So it should be up to the ceph logic
where the snapshots are done (maybe depending on the initial layout of the original device ) ?
Would the crush map influence that ?
Also live backup takes snapshots as I can see. We have had very strange locks on running backups
in the past (mostly gone since the disks were put on separate controllers).
Could this be the same reason ?
Another thing we found is the following (not on all hosts):
[614673.831726] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614673.848249] libceph: mon2 192.168.16.34:6789 session established
[614704.551754] libceph: mon2 192.168.16.34:6789 session lost, hunting for new mon
[614704.552729] libceph: mon1 192.168.16.32:6789 session established
[614735.271779] libceph: mon1 192.168.16.32:6789 session lost, hunting for new mon
[614735.272339] libceph: mon2 192.168.16.34:6789 session established
This leads to a kernel problem, which is still not solved (because not backported to 4.15).
I am not sure if this is a reaction to a ceph problem or the reason for the ceph problem.
Any thoughts on this ?
Marcus Haarmann