[PVE-User] Ceph or Gluster

Discussion:

Mohamed Sadok Ben Jazia

2016-04-22 10:42:00 UTC

Hello list,
In order to set a high scalable proxmox infrastructure with a number of
clusters, i plan to use distributed storing system, for this i have some
questions.
1- I have a choice between Ceph and Gluster, which is better for proxmox.
2- Is it better to install one of those systems on the nodes or on
separated servers.
3- Can this architecture realise a stable product, with VM and LXC
migration (not live migration), store backups and snapshots, store iso
files and lxc container templates.
Thank you

Eneko Lacunza

2016-04-22 11:06:06 UTC

Permalink

Hi Mohamed,

Post by Mohamed Sadok Ben Jazia
Hello list,
In order to set a high scalable proxmox infrastructure with a number of
clusters, i plan to use distributed storing system, for this i have some
questions.
1- I have a choice between Ceph and Gluster, which is better for proxmox.

I have no experience with Gluster, Ceph has been great for our use.

Post by Mohamed Sadok Ben Jazia
2- Is it better to install one of those systems on the nodes or on
separated servers.

Better on separated systems, but works quite well on the same systems if
the load is ok. Proxmox Ceph Server integration is very nice and saves
lots of work.

Post by Mohamed Sadok Ben Jazia
3- Can this architecture realise a stable product, with VM and LXC
migration (not live migration), store backups and snapshots, store iso
files and lxc container templates.

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Mohamed Sadok Ben Jazia

2016-04-22 13:00:47 UTC

Permalink

Thank you Eneko,
I read in proxmox forum that distributed storage needs 10GBit or faster on
the local network and a dedicated network.
Could you detail your used infrastructure to see if it matches those
conditions?

Post by Eneko Lacunza
Hi Mohamed,

I have no experience with Gluster, Ceph has been great for our use.

Post by Mohamed Sadok Ben Jazia
2- Is it better to install one of those systems on the nodes or on
separated servers.

Better on separated systems, but works quite well on the same systems if
the load is ok. Proxmox Ceph Server integration is very nice and saves lots
of work.

In order to use Ceph for backups and ISO/templates, you'll have to use
CephFS. It is considered experimental in current Ceph version in Proxmox
(Hammer) but today a new Ceph LTS version has been released (Jewel), that
marks CephFS stable and production-ready. I think this will be integrated
in Proxmox shortly, developers are talking about this in the mailing list
today.
I use NFS for backups and ISO/templates. For our storage needs this is
enough.
Cheers
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Brian ::

2016-04-22 13:31:52 UTC

Permalink

Hi Mohamed

10Gbps or faster at a minimum or you will have pain. Even using 4
nodes with 4 spinner disks in each node and you will be maxing out
1Gbps network. For any backfills or adding new OSDs you don't want to
be waiting on 1Gbps ethernet speeds.

Dedicated 10Gbps network for ceph communication at a minimum and you
will have nice results.

On Fri, Apr 22, 2016 at 2:00 PM, Mohamed Sadok Ben Jazia

Post by Mohamed Sadok Ben Jazia
Thank you Eneko,
I read in proxmox forum that distributed storage needs 10GBit or faster on
the local network and a dedicated network.
Could you detail your used infrastructure to see if it matches those
conditions?

Post by Eneko Lacunza
Hi Mohamed,

I have no experience with Gluster, Ceph has been great for our use.

Post by Mohamed Sadok Ben Jazia
2- Is it better to install one of those systems on the nodes or on
separated servers.

Better on separated systems, but works quite well on the same systems if
the load is ok. Proxmox Ceph Server integration is very nice and saves lots
of work.

_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Lindsay Mathieson

2016-04-22 14:02:19 UTC

Permalink

Post by Brian ::
10Gbps or faster at a minimum or you will have pain. Even using 4
nodes with 4 spinner disks in each node and you will be maxing out
1Gbps network.

Can't say I saw that on our cluster.

- 3 Nodes
- 3 OSD's per Node
- SSD journals for each OSD.
- 2*1G Eth in LACP Bond dedicated to Ceph
- 1G Admin Net
- Size = 3 (Replica 3)

Never came close to maxing out 1 1Gbps connection, write throughput and
IOPS are terrible. Read is pretty good though.

Currently trialing Gluster 3.7.11 on ZFS Bricks, Replica 3 also. Triple
the throughput ans IOPS I was getting with Ceph, maxes out the 2*1G
connection, also seems to deal with VM I/O spikes better too, not
letting other VM's be stalled.

Not convinced its as robust as Ceph yet, give it a few more weeks. It
does cope very well with failover and brick heals (using 64MB shards).

--
Lindsay Mathieson

Brian ::

2016-04-22 21:50:57 UTC

Permalink

Hi Lindsay,

With NVME journals on a 3 node 4 OSD cluster if I do a quick dd of a
1GB file on a VM I can see 2.34Gbps on the storage network straight
away so if I was only using 1Gbps here the network would be a
bottlekneck. If I perform the same in 2 VMs traffic hits 4.19Gbps on
the storage network.

The throughput in the VM is 1073741824 bytes (1.1 GB) copied, 3.43556
s, 313 MB/s (R=3)

Would be very interested in hearing more about your gluster setup.. I
don't know anything about it - how many nodes are involved?

On Fri, Apr 22, 2016 at 3:02 PM, Lindsay Mathieson

Post by Lindsay Mathieson

Post by Brian ::
10Gbps or faster at a minimum or you will have pain. Even using 4
nodes with 4 spinner disks in each node and you will be maxing out
1Gbps network.

Can't say I saw that on our cluster.
- 3 Nodes
- 3 OSD's per Node
- SSD journals for each OSD.
- 2*1G Eth in LACP Bond dedicated to Ceph
- 1G Admin Net
- Size = 3 (Replica 3)
Never came close to maxing out 1 1Gbps connection, write throughput and IOPS
are terrible. Read is pretty good though.
Currently trialing Gluster 3.7.11 on ZFS Bricks, Replica 3 also. Triple the
throughput ans IOPS I was getting with Ceph, maxes out the 2*1G connection,
also seems to deal with VM I/O spikes better too, not letting other VM's be
stalled.
Not convinced its as robust as Ceph yet, give it a few more weeks. It does
cope very well with failover and brick heals (using 64MB shards).
--
Lindsay Mathieson
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Lindsay Mathieson

2016-04-23 00:36:41 UTC

Permalink

Post by Brian ::
With NVME journals on a 3 node 4 OSD cluster

Well your hardware is rather better than mine :) I'm just using consumer
grade SSD's for journals which won't have anywhere near the performance
of NVME

Post by Brian ::
if I do a quick dd of a
1GB file on a VM I can see 2.34Gbps on the storage network straight
away so if I was only using 1Gbps here the network would be a
bottlekneck. If I perform the same in 2 VMs traffic hits 4.19Gbps on
the storage network.
The throughput in the VM is 1073741824 bytes (1.1 GB) copied, 3.43556
s, 313 MB/s (R=3)

dd isn't really a good test of throughput, to easy for the kernel and
filesystem to optimise it. bonnie++ or even CrystalDiskMark (Windows VM)
would be interesting.

Post by Brian ::
Would be very interested in hearing more about your gluster setup.. I
don't know anything about it - how many nodes are involved?

POOMA U summary:

redhat offer two cluster filesystems, ceph and gluster - gluster
actually predates ceph, though ceph definitely has more attention now.

http://gluster.org,

gluster replicates a file system directly, whereas ceph rbd is a pure
block based replication system (ignoring rgw etc). CephFS only reached
stable in the latest release, but rbd is a good match for block based VM
images. Like ceph, gluster has a direct block based interface for VM
images (gfapi) integrated with qemu which offers better performance than
fuse based filesystems.

One of the problem with gluster used to be its file based replication
and healing process - it had no way of tracking block changes, so when a
node was down and a large VM image was written to, it would have to scan
and compare the entire multi GB file for changes when the node came back
up. A none issue for ceph where block devices are stored in 4MB chunks
and it tracks which chunks have changed.

However in vs 3.7 gluster introduced sharded volumes where files are
stored in shards. shard size is configurable and defaults to 4MB. That
has brought gluster heal performance and resource usage in into the same
league as ceph, though ceph is still slightly faster I think.

One huge problem I've noticed with ceph is snapshot speed. For me via
proxmox, ceph rbd live snapshots were unusably slow. Sluggish to take,
but rolling back a snapshot would take literally hours. Same problem
with restoring backups. Deal breaker for me. Gluster can use qcow2
images and snapshot rollbacks would take a couple of minutes at worst.

My hardware setup:

3 Proxmox modes, VM's and ceph/gluster on all 3.

Node 1:
- Xeon E5-2620
- 64GB RAM
- ZFS RAID10
- SSD log & cache
- 4 * 3TB WD Red
- 3 * 1GB Eth

Node 2:
- 2 * Xeon E5-2660
- 64GB RAM
- ZFS RAID10
- SSD log & cache
- 4 * 3TB WD Red
- 3 * 1GB Eth

Node 3:
- Xeon E5-2620
- 64GB RAM
- ZFS RAID10,
- SSD log & cache
- 6 * 600GB Velocoraptor
- 2 * 3TB WD Red
- 2 * 1GB Eth

Originally ceph had all the disks to itself (xfs underneath), now ceph
and gluster are both now running off ZFS pools while I evaluate gluster.
Currently half the VM's are running off gluster. Not ideal as there is a
certain amount of overhead in running both.

gluster - basically the same overall setup as ceph:
- replica 3
- 64MB shard size
- caching etc is all handled by ZFS

Crucial things for me:
- stability. Does it crash a lot :)
- Robustness, how well does it cope with node crashes, network outages etc
- performance - raw speed and IOPS
- snapshots. How easy is it to snapshot and rollback VM's. Not an issue
for eveyone, but we run a lot of dev and testing VM's where easy access
to multiple snaphots is important.
- backups. How easy to backup and *restore*.

cheers,

--
Lindsay Mathieson

Eneko Lacunza

2016-04-25 07:24:46 UTC

Permalink

Hi,

Post by Lindsay Mathieson

Post by Brian ::
With NVME journals on a 3 node 4 OSD cluster

Well your hardware is rather better than mine :) I'm just using
consumer grade SSD's for journals which won't have anywhere near the
performance of NVME

So, here you have the reason for your bad write performance with Ceph.
You have to carefully choose the journal SSDs, otherwise you can be
better even without SSDs... (yes, I made this very mistake too!)
What brand/model?

[...]

Post by Lindsay Mathieson
One huge problem I've noticed with ceph is snapshot speed. For me via
proxmox, ceph rbd live snapshots were unusably slow. Sluggish to take,
but rolling back a snapshot would take literally hours. Same problem
with restoring backups. Deal breaker for me. Gluster can use qcow2
images and snapshot rollbacks would take a couple of minutes at worst.

I don't usually use snapshot, but in Proxmox restore to a Ceph storage
is quite sloooooow too. (It's almost insulting really, a pitty because
the integration is very good).

Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Lindsay Mathieson

2016-04-25 08:41:21 UTC

Permalink

Post by Eneko Lacunza
So, here you have the reason for your bad write performance with Ceph.
You have to carefully choose the journal SSDs, otherwise you can be
better even without SSDs... (yes, I made this very mistake too!)
What brand/model?

Yah I know :( Intel 530's initially, then Samsung 850 Pro. The 3700's
were out of our budget range for the size of our cluster.

Wasn't to happy with the direction Ceph is taking either, to me the
required hardware and complexity seems to be moving out of the range of
the small business setup and I'm not happy with the concept of them
moving it bare disk access (Bluestore).

--
Lindsay Mathieson

Eneko Lacunza

2016-04-25 09:14:16 UTC

Permalink

Post by Lindsay Mathieson

Post by Eneko Lacunza
So, here you have the reason for your bad write performance with
Ceph. You have to carefully choose the journal SSDs, otherwise you
can be better even without SSDs... (yes, I made this very mistake too!)
What brand/model?

Yah I know :( Intel 530's initially, then Samsung 850 Pro. The 3700's
were out of our budget range for the size of our cluster.
Wasn't to happy with the direction Ceph is taking either, to me the
required hardware and complexity seems to be moving out of the range
of the small business setup and I'm not happy with the concept of them
moving it bare disk access (Bluestore).

Last Bluestore bare-HDD performance data seems quite good; but Bluestore
needs time to reach production-quality. I think the path is good and
with Proxmox Ceph Server management is very easy... ;)

What hardware requirements are out of the range of SBS?

Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Alexandre DERUMIER

2016-04-25 15:00:07 UTC

Permalink

Post by Eneko Lacunza

Post by Lindsay Mathieson
I'm not happy with the concept of them
moving it bare disk access (Bluestore).

Well,with bluestore, no more journal and double write.
So, without ssd, performance will be better.

You can still have ssd for fast ack, but something like a intel s3500 should be enough,
because they won't write too many things.

----- Mail original -----
De: "Lindsay Mathieson" <***@gmail.com>
À: "proxmoxve" <pve-***@pve.proxmox.com>
Envoyé: Lundi 25 Avril 2016 10:41:21
Objet: Re: [PVE-User] Ceph or Gluster

--
Lindsay Mathieson

_______________________________________________
pve-user mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Lindsay Mathieson

2016-04-23 00:44:17 UTC

Permalink

Post by Brian ::
Would be very interested in hearing more about your gluster setup.. I
don't know anything about it - how many nodes are involved?

I'd be interested in you ceph setup as well
- Version?
- Rolled out using proxmox tools? (pveceph etc)
- underlying filesystem for the OSD's? (xfs etc)
- nobarrier disabled?

cheers,

--
Lindsay Mathieson

Lindsay Mathieson

2016-04-23 00:45:46 UTC

Permalink

Also, what sort of iowait percentages are you seeing?

Post by Brian ::
Hi Lindsay,
With NVME journals on a 3 node 4 OSD cluster if I do a quick dd of a
1GB file on a VM I can see 2.34Gbps on the storage network straight
away so if I was only using 1Gbps here the network would be a
bottlekneck. If I perform the same in 2 VMs traffic hits 4.19Gbps on
the storage network.
The throughput in the VM is 1073741824 bytes (1.1 GB) copied, 3.43556
s, 313 MB/s (R=3)
Would be very interested in hearing more about your gluster setup.. I
don't know anything about it - how many nodes are involved?
On Fri, Apr 22, 2016 at 3:02 PM, Lindsay Mathieson

Post by Lindsay Mathieson

Post by Brian ::
10Gbps or faster at a minimum or you will have pain. Even using 4
nodes with 4 spinner disks in each node and you will be maxing out
1Gbps network.

Can't say I saw that on our cluster.
- 3 Nodes
- 3 OSD's per Node
- SSD journals for each OSD.
- 2*1G Eth in LACP Bond dedicated to Ceph
- 1G Admin Net
- Size = 3 (Replica 3)
Never came close to maxing out 1 1Gbps connection, write throughput and IOPS
are terrible. Read is pretty good though.
Currently trialing Gluster 3.7.11 on ZFS Bricks, Replica 3 also. Triple the
throughput ans IOPS I was getting with Ceph, maxes out the 2*1G connection,
also seems to deal with VM I/O spikes better too, not letting other VM's be
stalled.
Not convinced its as robust as Ceph yet, give it a few more weeks. It does
cope very well with failover and brick heals (using 64MB shards).
--
Lindsay Mathieson
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

--
Lindsay Mathieson

Eneko Lacunza

2016-04-25 07:20:06 UTC

Permalink

Hi,

Post by Brian ::
With NVME journals on a 3 node 4 OSD cluster if I do a quick dd of a
1GB file on a VM I can see 2.34Gbps on the storage network straight
away so if I was only using 1Gbps here the network would be a
bottlekneck. If I perform the same in 2 VMs traffic hits 4.19Gbps on
the storage network.
The throughput in the VM is 1073741824 bytes (1.1 GB) copied, 3.43556
s, 313 MB/s (R=3)
Would be very interested in hearing more about your gluster setup.. I
don't know anything about it - how many nodes are involved?

Sorry, I don't think using "dd" gives any meaningfull benchmark for a
virtualization storage. :-)

Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Alexandre DERUMIER

2016-04-23 15:23:46 UTC

Permalink

@Lindsay

Post by Brian ::

Post by Lindsay Mathieson
Never came close to maxing out 1 1Gbps connection, write throughput and
IOPS are terrible. Read is pretty good though.
Well your hardware is rather better than mine :) I'm just using consumer
grade SSD's for journals which won't have anywhere near the performance
of NVME

Please, don't use consumer ssd for ceph journal, they sucks (very) for D_SYNC writes

Please read this blog:

http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

----- Mail original -----
De: "Lindsay Mathieson" <***@gmail.com>
À: "proxmoxve" <pve-***@pve.proxmox.com>
Envoyé: Vendredi 22 Avril 2016 16:02:19
Objet: Re: [PVE-User] Ceph or Gluster

Post by Brian ::
10Gbps or faster at a minimum or you will have pain. Even using 4
nodes with 4 spinner disks in each node and you will be maxing out
1Gbps network.

--
Lindsay Mathieson

_______________________________________________
pve-user mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Eneko Lacunza

2016-04-22 13:44:12 UTC

Permalink

Hi Mohamed,

We only have small/tiny clusters (3 clusters). I'll detail the one in
our office that is the "biggest".

4 Proxmox 4.x nodes:
- 1 node for backup/helper: 2 cores, 16GB RAM, NFS export for
backups/ISO
- 3 main nodes: 4 cores (one with HT), 32 GB RAM, 2x1gbit ethernet
- Execute VMs (46 total, 17 running right now)
- Ceph storage: each node has
- 1 MDS daemon
- 3x1TB OSD disk
- 1xSSD disk for OSD journals

We are using 1gbit interface for VM access, Proxmox cluster. The other
1gbit port is for ceph public/private networks. We're using size=2 for
Ceph. (Two replicas for each data).

With this setup we only saturate the ceph network in special backup
times (weekend), network is below 20% of average use normally.

Of course, this depends on your loads, also if you're planning nodes
with more OSD, then it is easier to saturate network.

Also, you have to take into account failures; what happens when
something brokes (node, OSD disk) and how long it will take until
replica count is restored. In this, your network can also have a big impact.

What size of storage/virtualization load are you planning?

Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

Mohamed Sadok Ben Jazia

2016-04-22 13:58:04 UTC

Permalink

As it will we a large infrastructure, the number of nodes will increase
while demand is increasing.
The network will be mainly needed for backups (not sure to make a node for
backups only), and also for VM and LXC migration to use nodes in an optimal
way.
A good bandwidth is needed while moving container storages to be user
friendly for users who want upgrade his container and we have to move it to
another free node.
Any recommendation is welcome
Thank you

Post by Eneko Lacunza
Hi Mohamed,

We only have small/tiny clusters (3 clusters). I'll detail the one in our
office that is the "biggest".
- 1 node for backup/helper: 2 cores, 16GB RAM, NFS export for
backups/ISO
- 3 main nodes: 4 cores (one with HT), 32 GB RAM, 2x1gbit ethernet
- Execute VMs (46 total, 17 running right now)
- Ceph storage: each node has
- 1 MDS daemon
- 3x1TB OSD disk
- 1xSSD disk for OSD journals
We are using 1gbit interface for VM access, Proxmox cluster. The other
1gbit port is for ceph public/private networks. We're using size=2 for
Ceph. (Two replicas for each data).
With this setup we only saturate the ceph network in special backup times
(weekend), network is below 20% of average use normally.
Of course, this depends on your loads, also if you're planning nodes with
more OSD, then it is easier to saturate network.
Also, you have to take into account failures; what happens when something
brokes (node, OSD disk) and how long it will take until replica count is
restored. In this, your network can also have a big impact.
What size of storage/virtualization load are you planning?
Cheers
Eneko
--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943493611
943324914
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Angel Docampo

2016-04-25 07:59:45 UTC

Permalink

We use gluster because we develop our custom translator for it, the
disperse translator (now on the official branch), and we had tested
gluster extensively since 5 years now. Is far from be perfect, but it
works just fine and even it has its pros and cons, we like the
simplicity to set up a cluster, and reinforced with ZFS, it gets a boost
in performance.

Post by Mohamed Sadok Ben Jazia
2- Is it better to install one of those systems on the nodes or on
separated servers.

The best practices says they should be on separated systems, but we have
them on the same proxmox cluster. In our scenario, ZFS "eats" great part
of the RAM, so we need to have a lot of it, depending on the size of the
data of the volume (ZFS needs 1GB for each TB to be fine) and the number
of VMs you want to put it on. At least 64GB, but 128GB should be fine.

Yes, no problem.

Post by Mohamed Sadok Ben Jazia
Thank you
_______________________________________________
pve-user mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

--
*Angel Docampo
*
*Datalab Tecnologia, s.a.*
Castillejos, 352 - 08025 Barcelona
Tel. 93 476 69 14 - Ext: 114
Mob. 670.299.381