[PVE-User] Poor CEPH performance? or normal?

Discussion:

Mark Adams

2018-07-25 00:19:59 UTC

Hi All,

I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB GOLD
drives. Network is 10Gbps on X550-T2, separate network for the ceph cluster.

I have 1 VM currently running on this cluster, which is debian stretch with
a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
write speed. does this sound right? it seems very slow to me.

Not only that, but when this zfs send is running - I can not do any
parallel sends to any other zfs datasets inside of the same VM. They just
seem to hang, then eventually say "dataset is busy".

Any pointers or insights greatly appreciated!

Thanks

Alwin Antreich

2018-07-25 06:10:59 UTC

Permalink

Hi,

Post by Mark Adams
Hi All,
I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB GOLD
drives. Network is 10Gbps on X550-T2, separate network for the ceph cluster.

Do a rados bench for testing the cluster performance, spinners are not
fast.

Post by Mark Adams
I have 1 VM currently running on this cluster, which is debian stretch with
a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
write speed. does this sound right? it seems very slow to me.

Never ever use a CoW filesystem on top of another CoW system. This doubles
the writes that need to be made.

Post by Mark Adams
Not only that, but when this zfs send is running - I can not do any
parallel sends to any other zfs datasets inside of the same VM. They just
seem to hang, then eventually say "dataset is busy".

Ceph already gives you the possibility of snapshots. You can let PVE do
this through CLI or GUI.

--
Cheers,
Alwin

Mark Adams

2018-07-25 09:20:57 UTC

Permalink

Hi Alwin,

Post by Alwin Antreich
Hi,

Post by Mark Adams
Hi All,
I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB

GOLD

Post by Mark Adams
drives. Network is 10Gbps on X550-T2, separate network for the ceph cluster.

Do a rados bench for testing the cluster performance, spinners are not
fast.

This was a typo - I'm actually on 5.2-1. I'll give rados bench a try to see
what it comes back with.

Post by Alwin Antreich

Post by Mark Adams
I have 1 VM currently running on this cluster, which is debian stretch

with

Post by Mark Adams
a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
write speed. does this sound right? it seems very slow to me.

Never ever use a CoW filesystem on top of another CoW system. This doubles
the writes that need to be made.

Ceph already gives you the possibility of snapshots. You can let PVE do
this through CLI or GUI.

The problem with this is the required features.. I need an HA cluster, and
zfs doesn't support this - so ceph is ideal, however I also need "restore
previous versions" usable inside a file server VM in samba, which ceph
snapshots at the VM layer is no use for.... Unless there is some other
smart way of doing this I don't know about!

I guess my main question is, is there any other config hints to speed this
up whether ceph or in ZFS inside the VM, and is the blocking of other IO
normal with ceph when "max" write speed is being reached? That bit doesn't
seem right to me.

Post by Alwin Antreich
--
Cheers,
Alwin

Regards,
Mark

Mark Adams

2018-07-26 10:25:45 UTC

Permalink

Hi Ronny,

Thanks for your suggestions. Do you know if it is possible to change an
existing rbd pool to striping? or does this have to be done on first setup?

Regards,
Mark

Post by Mark Adams

Post by Mark Adams
Hi All,
I have a proxmox 5.1 + ceph cluster of 3 nodes, each with 12 x WD 10TB

GOLD

Post by Mark Adams
drives. Network is 10Gbps on X550-T2, separate network for the ceph

cluster.

Post by Mark Adams
I have 1 VM currently running on this cluster, which is debian stretch

with

Post by Mark Adams
a zpool on it. I'm zfs sending in to it, but only getting around ~15MiB/s
write speed. does this sound right? it seems very slow to me.
Not only that, but when this zfs send is running - I can not do any
parallel sends to any other zfs datasets inside of the same VM. They just
seem to hang, then eventually say "dataset is busy".
Any pointers or insights greatly appreciated!

Greetings
alwin gave you some good advice about filesystems and vm's, i wanted to
say a little about ceph.
with 3 nodes, and the default and reccomended size=3 pools, you can not
tolerate any node failures. IOW, if you loose a node, or need to do
lengthy maintainance on it, you are running degraded. I allways have a
4th "failure domain" node. so my cluster can selfheal (one of cephs
killer features) from a node failure. your cluster should be
3+[how-many-node-failures-i-want-to-be-able-to-survive-and-still-operate-sanely]
spinning osd's with bluestore benefit greatly from ssd DB/WAL's if your
osd's have ondisk DB/WAL you can gain a lot of performance by having the
DB/WAL on a SSD or better.
ceph gains performance with scale(number of osd nodes) . so while ceph's
aggeregate performance is awesome, an individual single thread will not
be amazing. A given set of data will exist on all 3 nodes, and you will
hit 100% of nodes with any write. so by using ceph with 3 nodes you
give ceph the worst case for performance. eg
with 4 nodes a write would hit 75%, with 6 nodes it would hit 50% of the
cluster. you see where this is going...
But a single write will only hit one disk in 3 nodes, and will not have
a better performance then the disk it hits. you can cheat more
performance with rbd caching. and it is important for performance to get
a higher queue depth. afaik zfs uses a queue depth of 1, for ceph the
worst possible. you may have some success by buffering on one or both
ends of the transfer [1]
if the vm have a RBD disk, you may (or may not) benefit from rbd fancy
striping[2], since operations can hit more osd's in parallel.
good luck
Ronny Aasen
[1]
https://everycity.co.uk/alasdair/2010/07/using-mbuffer-to-speed-up-slow-zfs-send-zfs-receive/
[2] http://docs.ceph.com/docs/master/architecture/#data-striping
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Adam Thompson

2018-07-27 13:46:14 UTC

Permalink

rbd striping is a per image setting. you may need to make the rbd
image and migrate data.

Post by Mark Adams
Thanks for your suggestions. Do you know if it is possible to change an
existing rbd pool to striping? or does this have to be done on first setup?

Please be aware that striping will not result in any increased
performance, if you are using "safe" I/O modes, i.e. your VM waits for a
successful flush-to-disk after every sector. In that scenario, CEPH
will never give you write performance equal to a local disk because
you're limited to the bandwidth of a single remote disk [subsystem]
*plus* the network round-trip latency, which even if measured in
microseconds, still adds up.

Based on my experience with this and other distributed storage systems,
I believe you will likely find that you get large write-performance
gains by:

1. use the largest possible block size during writes. 512B sectors are
the worst-case scenario for any remote storage. Try to write in chunks
of *at least* 1 MByte, and it's not unreasonable nowadays to write in
chunks of 64MB or larger. The rationale here is that you're spending
more time sending data, and less time waiting for ACKs. The more you
can tilt that in favor of data, the better off you are. (There are
downsides to huge sector/block/chunk sizes, though - this isn't a "free
lunch" scenario. See #5.)

2. relax your write-consistency requirements. If you can tolerate the
small risk with "Write Back" you should see better performance,
especially during burst writes. During large sequential writes, there
are not many ways to violate the laws of physics, and CEPH automatically
amplifies your writes by (in your case) a factor of 2x due to
replication.

3. switch to storage devices with the best possible local write speed,
for OSDs. OSDs are limited by the performance of the underlying device
or virtual device. (e.g. it's totally possible to run OSDs on a
hardware RAID6 controller)

4. Avoid CoW-on-CoW. Write amplification means you'll lose around 50%
of your IOPS and/or I/O bandwidth for each level of CoW nesting,
depending on workload. So don't put CEPH OSDs on, ssy, BTRFS or ZFS
filesystems. A worst-case scenario would be something like running a VM
using ZFS on top of CEPH, where the OSDs are located on BTRFS
filsystems, which are in turn virtual devices hosted on ZFS filesystems.
Welcome to 1980's storage performance, in that case! (I did it without
realizing once... seriously, 5 MBps sequential writes was a good day!)
FWIW, CoW filesystems are generally awesome - just not when stacked. A
sufficiently fast external NAS running ZFS with VMs stored over NFS can
provide decent performance, *if* tuned correctly. iX Systems, for
example, spends a lot of time & effort making this work well, including
some lovely HA NAS appliances.

5. Remember the triangle. You can optimize a distributed storage system
for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed.
(This is a specific case of the traditional good/fast/cheap:pick-any-2
adage.)

I'm not sure I'm saying anything new here, I may have just summarized
the discussion, but the points remain valid.

Good luck with your performance problems.
-Adam

Mark Adams

2018-07-28 11:00:15 UTC

Permalink

Hi Adam,

Thanks for your great round up there - Your points are excellent.

What I have ended up doing a few days ago (apologies have been too busy to
respond..) was set rbd cache = true under each client in the ceph.conf -
This got me from 15MB/s up to about 70MB/s. I then set the disk holding the
zfs dataset to writeback cache in proxmox (as you note below) and that has
bumped it up to about 130MB/s -- Which I am happy with for this setup.

Regards,
Mark

Post by Adam Thompson

rbd striping is a per image setting. you may need to make the rbd
image and migrate data.

Post by Mark Adams
Thanks for your suggestions. Do you know if it is possible to change an
existing rbd pool to striping? or does this have to be done on first setup?

Please be aware that striping will not result in any increased
performance, if you are using "safe" I/O modes, i.e. your VM waits for a
successful flush-to-disk after every sector. In that scenario, CEPH will
never give you write performance equal to a local disk because you're
limited to the bandwidth of a single remote disk [subsystem] *plus* the
network round-trip latency, which even if measured in microseconds, still
adds up.
Based on my experience with this and other distributed storage systems, I
1. use the largest possible block size during writes. 512B sectors are
the worst-case scenario for any remote storage. Try to write in chunks of
*at least* 1 MByte, and it's not unreasonable nowadays to write in chunks
of 64MB or larger. The rationale here is that you're spending more time
sending data, and less time waiting for ACKs. The more you can tilt that
in favor of data, the better off you are. (There are downsides to huge
sector/block/chunk sizes, though - this isn't a "free lunch" scenario. See
#5.)
2. relax your write-consistency requirements. If you can tolerate the
small risk with "Write Back" you should see better performance, especially
during burst writes. During large sequential writes, there are not many
ways to violate the laws of physics, and CEPH automatically amplifies your
writes by (in your case) a factor of 2x due to replication.
3. switch to storage devices with the best possible local write speed, for
OSDs. OSDs are limited by the performance of the underlying device or
virtual device. (e.g. it's totally possible to run OSDs on a hardware
RAID6 controller)
4. Avoid CoW-on-CoW. Write amplification means you'll lose around 50% of
your IOPS and/or I/O bandwidth for each level of CoW nesting, depending on
workload. So don't put CEPH OSDs on, ssy, BTRFS or ZFS filesystems. A
worst-case scenario would be something like running a VM using ZFS on top
of CEPH, where the OSDs are located on BTRFS filsystems, which are in turn
virtual devices hosted on ZFS filesystems. Welcome to 1980's storage
performance, in that case! (I did it without realizing once... seriously,
5 MBps sequential writes was a good day!) FWIW, CoW filesystems are
generally awesome - just not when stacked. A sufficiently fast external
NAS running ZFS with VMs stored over NFS can provide decent performance,
*if* tuned correctly. iX Systems, for example, spends a lot of time &
effort making this work well, including some lovely HA NAS appliances.
5. Remember the triangle. You can optimize a distributed storage system
for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed. (This
is a specific case of the traditional good/fast/cheap:pick-any-2 adage.)
I'm not sure I'm saying anything new here, I may have just summarized the
discussion, but the points remain valid.
Good luck with your performance problems.
-Adam
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Adam Thompson

2018-07-29 14:41:24 UTC

Permalink

Post by Mark Adams
Hi Adam,
Thanks for your great round up there - Your points are excellent.
What I have ended up doing a few days ago (apologies have been too busy to
respond..) was set rbd cache = true under each client in the ceph.conf -
This got me from 15MB/s up to about 70MB/s. I then set the disk holding the
zfs dataset to writeback cache in proxmox (as you note below) and that has
bumped it up to about 130MB/s -- Which I am happy with for this setup.
Regards,
Mark

Post by Adam Thompson

rbd striping is a per image setting. you may need to make the rbd
image and migrate data.

Post by Mark Adams
Thanks for your suggestions. Do you know if it is possible to

change an

Post by Adam Thompson

Post by Mark Adams
existing rbd pool to striping? or does this have to be done on

first

Post by Adam Thompson

Post by Mark Adams
setup?

Please be aware that striping will not result in any increased
performance, if you are using "safe" I/O modes, i.e. your VM waits

for a

Post by Adam Thompson
successful flush-to-disk after every sector. In that scenario, CEPH

will

Post by Adam Thompson
never give you write performance equal to a local disk because you're
limited to the bandwidth of a single remote disk [subsystem] *plus*

the

Post by Adam Thompson
network round-trip latency, which even if measured in microseconds,

still

Post by Adam Thompson
adds up.
Based on my experience with this and other distributed storage

systems, I

Post by Adam Thompson
believe you will likely find that you get large write-performance
1. use the largest possible block size during writes. 512B sectors

are

Post by Adam Thompson
the worst-case scenario for any remote storage. Try to write in

chunks of

Post by Adam Thompson
*at least* 1 MByte, and it's not unreasonable nowadays to write in

chunks

Post by Adam Thompson
of 64MB or larger. The rationale here is that you're spending more

time

Post by Adam Thompson
sending data, and less time waiting for ACKs. The more you can tilt

that

Post by Adam Thompson
in favor of data, the better off you are. (There are downsides to

huge

Post by Adam Thompson
sector/block/chunk sizes, though - this isn't a "free lunch"

scenario. See

Post by Adam Thompson
#5.)
2. relax your write-consistency requirements. If you can tolerate

the

Post by Adam Thompson
small risk with "Write Back" you should see better performance,

especially

Post by Adam Thompson
during burst writes. During large sequential writes, there are not

many

Post by Adam Thompson
ways to violate the laws of physics, and CEPH automatically amplifies

your

Post by Adam Thompson
writes by (in your case) a factor of 2x due to replication.
3. switch to storage devices with the best possible local write

speed, for

Post by Adam Thompson
OSDs. OSDs are limited by the performance of the underlying device

Post by Adam Thompson
virtual device. (e.g. it's totally possible to run OSDs on a

hardware

Post by Adam Thompson
RAID6 controller)
4. Avoid CoW-on-CoW. Write amplification means you'll lose around

50% of

Post by Adam Thompson
your IOPS and/or I/O bandwidth for each level of CoW nesting,

depending on

Post by Adam Thompson
workload. So don't put CEPH OSDs on, ssy, BTRFS or ZFS filesystems.

Post by Adam Thompson
worst-case scenario would be something like running a VM using ZFS on

top

Post by Adam Thompson
of CEPH, where the OSDs are located on BTRFS filsystems, which are in

turn

Post by Adam Thompson
virtual devices hosted on ZFS filesystems. Welcome to 1980's storage
performance, in that case! (I did it without realizing once...

seriously,

Post by Adam Thompson
5 MBps sequential writes was a good day!) FWIW, CoW filesystems are
generally awesome - just not when stacked. A sufficiently fast

external

Post by Adam Thompson
NAS running ZFS with VMs stored over NFS can provide decent

performance,

Post by Adam Thompson
*if* tuned correctly. iX Systems, for example, spends a lot of time

Post by Adam Thompson
effort making this work well, including some lovely HA NAS

appliances.

Post by Adam Thompson
5. Remember the triangle. You can optimize a distributed storage

system

Post by Adam Thompson
for any TWO of: a) cost, b) resiliency/reliability/HA, or c) speed.

(This

Post by Adam Thompson
is a specific case of the traditional good/fast/cheap:pick-any-2

adage.)

Post by Adam Thompson
I'm not sure I'm saying anything new here, I may have just summarized

the

Post by Adam Thompson
discussion, but the points remain valid.
Good luck with your performance problems.
-Adam
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

That's a pretty good result. You now have some very small windows where recently-written data could be lost, but in most applications not unreasonably so.
In exchange, you get very good throughout for spinning rust.
(FWIW, I gave up on CEPH because my nodes only have 2Gbps network each, but I am seeing similar speeds with local ZFS+ZIL+L2ARC on 15k SAS drives. These are older systems, obviously.)
Thanks for sharing your solution!
-Adam

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.