Discussion:
[PVE-User] shared LVM on host-based mirrored iSCSI LUNs
Stefan Sänger
2012-04-23 11:05:32 UTC
Permalink
Hi everybody,

after coping with the cluster problems I am back for another question.
Let me first describe the setup:

I have two small NAS boxes running openfiler each providing an iSCSI
LUN. These LUNs are exactly the same size.
I connected both LUNs to all three of my proxmox-servers (pve1, pve2 and
pve2).

In order to implement host-based mirroring I installed mdadm on each
proxmox server and created a RAID1 on pve1 using both LUNs. (/dev/md0)
Then I created a physical volume on that RAID and a new volume group
(/dev/vgiscsi).

After restarting mdadm and excuting pvscan the volume group was visible
on all 3 servers.

I added that volume group as a proxmox storage and set the "shared" flag
on the first server and then I installed / restored several virtual
machines to it using that volume group as storage.

And now I'm amazed because I did several tests with live migration,
adding VMs to HA, testing node failures etc.

Since I did not configure any locking for mdadm I figured that mdadm
would lead to corrupting the contents of the logical volumes used as
virtual hard disks.

But to my surprise fsck did not reveal any errors.

So my question is: Is there some locking already in place and I just
missed it? clvm is installed but obviously not used, /etc/lvm.conf is
set to file based locking and the locking_dir is local to every server,

Neither /etc/default/mdadm nor /etc/mdadm/mdadm.conf contain any hin
about locking, so I wonder if I only have been lucky not to encounter
any errors or if I missed something?


TIA, Stefan
Dietmar Maurer
2012-04-23 11:21:48 UTC
Permalink
Since I did not configure any locking for mdadm I figured that mdadm would
lead to corrupting the contents of the logical volumes used as virtual hard
disks.
But to my surprise fsck did not reveal any errors.
So my question is: Is there some locking already in place and I just missed it?
clvm is installed but obviously not used, /etc/lvm.conf is set to file based
locking and the locking_dir is local to every server,
Yes, we have cluster wide locking, as long as you use the pve tools to manage storage.

- Dietmar
Stefan Sänger
2012-04-23 15:37:41 UTC
Permalink
Hi Dietmar,
Post by Dietmar Maurer
Since I did not configure any locking for mdadm I figured that mdadm would
lead to corrupting the contents of the logical volumes used as virtual hard
disks.
But to my surprise fsck did not reveal any errors.
So my question is: Is there some locking already in place and I just missed it?
clvm is installed but obviously not used, /etc/lvm.conf is set to file based
locking and the locking_dir is local to every server,
Yes, we have cluster wide locking, as long as you use the pve tools to manage storage.
Well - as far as I understood that cluster wide locking is in place is
no problem for drbd or iSCSI/FC-Targets.

The difference which I am really not sure about is the RAID setup with
mdadm using 2 iSCSI-Targets.

The fault scenario I am thinking about is this:

node pve1 is running a VM managed by HA when it crashes. As the crash
occured the vm was writing data to its hard disk. In normal operation
mode the data is passed to LVM which will pass it to mdadm - and mdadm
will write the data to each raid member disk.

I suspect that there may be a chance that the last write operation was
only successful to one of the raid members.

Now the cluster starts its work and will do two things: fence the failed
node and start the vm on another node, let's say pve2.

Since the restart of pve1 initiated by fencing will take some time and
booting the vm pve2 starts earlier, it is likely that the raid metadata
will still state "clean" when pve1 starts to connect to the storage
again - so that will not be a problem.

But looking at the physical extents used by the logical volume the
situaqtion is different; the last write operation may have failed and
now the extents may hold different data. When data is read from a RAID1
volume mdadm is supposed to do round-robin-reading in order to speed up
disk access. I believe that there is a 50/50 chance from which raid
member the extent will be read, so it is not defined if the correct data
will be read. Or am I missing something here?

The cluster wide locking is working on lvm layer. But my concern this
time is one layer further down: mdadm.


Stefan
Dietmar Maurer
2012-04-23 16:06:28 UTC
Permalink
The cluster wide locking is working on lvm layer. But my concern this time is
one layer further down: mdadm.
A normal write does not lock the device anyways - but maybe I do not understand your concern?

- Dietmar
Dietmar Maurer
2012-04-23 16:08:43 UTC
Permalink
But my concern this time is
one layer further down: mdadm.
And yes, disk cache and software RAID is always a problem - that is one reason why we do not support software raid.

- Dietmar
Flavio Stanchina
2012-04-23 16:10:51 UTC
Permalink
Post by Stefan Sänger
I have two small NAS boxes running openfiler each providing an iSCSI
LUN. These LUNs are exactly the same size.
I connected both LUNs to all three of my proxmox-servers (pve1, pve2 and
pve2).
In order to implement host-based mirroring I installed mdadm on each
proxmox server and created a RAID1 on pve1 using both LUNs. (/dev/md0)
Then I created a physical volume on that RAID and a new volume group
(/dev/vgiscsi).
After restarting mdadm and excuting pvscan the volume group was visible
on all 3 servers.
Not safe, as far as I know. It would be just like using a
non-distributed filesystem such as extX on shared storage: md is not
meant to be used in this way, there is no locking between multiple
nodes. While I can't think of a sure way to break it, I wouldn't feel
safe to use it in production.

Use DRBD between the two NAS boxes -- or whatever kind of realtime
mirroring OpenFiler has to offer -- to mirror the disks, then use
multipath to expose both ends to the VM hosts.
--
Flavio Stanchina
Informatica e Servizi
Trento - Italy
Stefan Sänger
2012-04-25 15:52:59 UTC
Permalink
Hi,

I did some more research and here is what I found out...
Post by Flavio Stanchina
Not safe, as far as I know.
That was my first guess right away. My iSCSI-setup is just a test
environment to see what is possible - but unfortunately there is a
production system that uses FC, two FC SAN boxes and where host-based
mirroring should be implemented.

BTW: That hardware was not my decision, and right now I am basically
trying to figure out what will be the best way to go on with this problem...
Post by Flavio Stanchina
It would be just like using a
non-distributed filesystem such as extX on shared storage: md is not
meant to be used in this way, there is no locking between multiple
nodes.
Well - it is not really like using a file system on shared storage.
As long as everything is working and the RAID was synced once, it is not
really a problem to connect the RAID-LUNs to another host. The other
host will only discover a clean RAID, will find the lvm information and
go along with that.

The interesting part is that LVM in fact is kind of a locking mechanism
here: every logical volume can only be used by a single VM, and that
single VM can only ron on one host at a time. So there is a clear
mapping of physical extents to virtual machines and hence there is no
data corruption as every host system is only writing to extents it is
allowed to.

But in case of a failover when one of the hosts goes down, the other
hosts are not aware of the RAID-state, since every host keeps its own
RAID-metadata.

And a write command issued by a VM to the logical volume will mean that
md has to issue two write commands - one two each LUN. Since there is no
communication about the RAID state between hosts there is no way two get
at least a consistent state.

What is more, reading from a clean, synced RAID1 is supposed to be done
round-robin just like RAID0 - whithout checking the mirrored block.

So if something has been written to only one RAID member it is a
coin-flip if you will read that or not. And that means that even if the
fsck of the VM will think everything is fine it is not :(
Post by Flavio Stanchina
While I can't think of a sure way to break it, I wouldn't feel
safe to use it in production.
Well, I think I probably described a decent way why it can break.
And that leads me to the next question:

Instead of using RAID to do the mirroring, LVM should be able to take
care for this. I will do some tests, but maybe you guys around here have
a good idea about it.

So my next test will be:

- deleting the RAID
- disconnecting the iSCSI-Targets from all nodes but one
- creating single physical volumes on each LUN
- creating the volume group using -cy (--clustered=yes) with both LUNs
- probably the tricky point: creating the logical volume manually
using lvcreate -m 1
- adding that volume to the virtual machine

I am not sure about some lvcreate options like --mirrorlog yet, and not
sure if it will work anyway. But I think I should give it a try...
Post by Flavio Stanchina
Use DRBD between the two NAS boxes -- or whatever kind of realtime
mirroring OpenFiler has to offer -- to mirror the disks, then use
multipath to expose both ends to the VM hosts.
As mentioned above, this is basically some research on how to implement
host based mirroring. I did not come up with this requirement, bus since
I am using proxmox ve for some time now I woul prefer using it here as well.



Stefan
Stefan Sänger
2012-04-30 19:07:09 UTC
Permalink
Hi, I'm answering to myself quite a lot, but I think that some of my
thoughts and findings might be interesting...
Post by Stefan Sänger
Instead of using RAID to do the mirroring, LVM should be able to take
care for this. I will do some tests, but maybe you guys around here have
a good idea about it.
- deleting the RAID
- disconnecting the iSCSI-Targets from all nodes but one
- creating single physical volumes on each LUN
Up to this point everything went like it was expected.
Post by Stefan Sänger
- creating the volume group using -cy (--clustered=yes) with both LUNs
That just meant a little bit more to do:

- I had to change /etc/default/clvm to enable clvmd.
- I changed locking type in /etc/lvm/lvm.conf from 1 (file based) to 3
(cluster builtin).
- I created a new VM from WebUI, which created a logical volume.
Post by Stefan Sänger
- probably the tricky point: creating the logical volume manually
using lvcreate -m 1
My intention was to create the volume itself using WebUI, which worked
fine. Next I wanted to convert the volume to a mirrored volume.
Unfortunately I got the message
"Shared cluster mirrors are not available."

So I decided to dig a little deeper here first and discovered, that
cmirror is missing.

After having a look at the tarball from
ftp://sources.redhat.com/pub/lvm2/LVM2.2.02.95.tgz it just looks like it
has been built without --enable-cmirrord.

For the moment I might be stuck here, and I'm thinking about compiling
lvm myself.

Is there a repository available with sources for the pve-packages?
Everything in http://download.proxmox.com/sources/ looks like it is not
the current releas (just guessing from file dates).



regards, Stefan
Dietmar Maurer
2012-05-01 06:59:31 UTC
Permalink
Post by Stefan Sänger
Is there a repository available with sources for the pve-packages?
Everything in http://download.proxmox.com/sources/ looks like it is not the
current releas (just guessing from file dates).
http://git.proxmox.com

Loading...