[PVE-User] ceph disk replace

Discussion:

lists

2018-09-04 08:56:11 UTC

Hi,

We had a broken disk, and needed to replace it. In the process we
noticed something (to us) surprising:

Steps taken:

- set noscrub / no-deepscrub from cli
- stop the OSD from pve gui
- out the OSD from pve gui
wait for data rebalance and HEALTH_OK

When again HEALTH_OK:
- remove the OSD from pve gui
but at this point ceph started rebalancing again, which to us was
unexpected..?

It is now rebalancing nicely, but can we prevent this data movement next
time..? (and HOW?)

And a question on adding the new OSD:

I tried putting in the new filestore OSD, with an SSD journal, but it

create OSD on /dev/sdj (xfs)
using device '/dev/sdl' for journal
ceph-disk: Error: journal specified but not allowed by osd backend
TASK ERROR: command 'ceph-disk prepare --zap-disk --fs-type xfs --cluster ceph --cluster-uuid 1397f1dc-7d94-43ea-ab12-8f8792eee9c1 --journal-dev /dev/sdj /dev/sdl' failed: exit code 1

The device /dev/sdl is my journal SSD, however, it has a partition for
each journal (7 partitions currently) and of course pve should create a
new partition for journal and *not* use the whole disk as the above
command appears to try..?

Any ideas..?

MJ

Lindsay Mathieson

2018-09-04 09:40:49 UTC

Permalink

Post by lists
- set noscrub / no-deepscrub from cli
- stop the OSD from pve gui
- out the OSD from pve gui
wait for data rebalance and HEALTH_OK
- remove the OSD from pve gui
but at this point ceph started rebalancing again, which to us was
unexpected..?

Thats what its meant to do. If you want to prevent a re balance while
doing disk maintenance, you need to set *noout* (ceph osd set noout.)

--
Lindsay

lists

2018-09-04 09:42:08 UTC

Permalink

Post by Lindsay Mathieson
Thats what its meant to do. If you want to prevent a re balance while
doing disk maintenance, you need to set *noout* (ceph osd set noout.)

Ah right..!

Can I set that now, to make it stop rebalancing, or should I just let it
finish, and then add the new disk?

And any idea about the issue while adding the new OSD/journal..?

MJ

Lindsay Mathieson

2018-09-04 09:48:06 UTC

Permalink

Post by lists
Can I set that now, to make it stop rebalancing, or should I just let
it finish, and then add the new disk?

Should be fine - what sort of replication level are you running? 3 copies?

Post by lists
And any idea about the issue while adding the new OSD/journal..?

Sorry, not familiar with ceph-disk utilities.

--
Lindsay

Mark Schouten

2018-09-04 09:48:30 UTC

Permalink

Post by lists
- remove the OSD from pve gui
but at this point ceph started rebalancing again, which to us was
unexpected..?
It is now rebalancing nicely, but can we prevent this data movement next
time..? (and HOW?)

--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
T: 0318 200208 | ***@tuxis.nl

Gilberto Nunes

2018-09-04 09:57:39 UTC

Permalink

In the other hand, what if he use the command:

ceph osd crush reweight osd.X 0

in order to "mark" that osd to not use?

---
Gilberto Nunes Ferreira

(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram

Skype: gilberto.nunes36

Post by Mark Schouten

I think you can't prevent it. Removing an OSD changes the CRUSH-map,
and thus rebalances. However, there is no nood to wait for it to finish
rebalancing again.
You can add the new OSD directly after removing the old one.
--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

lists

2018-09-04 10:15:04 UTC

Permalink

Hi,

Thanks for the interesting discussion.

/dev/sdj is the new OSD, as seen by the OS, and /dev/sdl is our journal
ssd. The journal ssd currently has 7 partitions, 5GB each, holding the
journal for 7 OSDs on that host, all added using the pve gui. PVE
created each partition automatically.

However, it looks as if this time pve tries to use the WHOLE ssd for a
hournal..? Should it not create an 8th partiton on /dev/sdl..?

MJ

lists

2018-09-04 11:16:57 UTC

Permalink

Hi,
...skipped less relevant stuff

/dev/sdg1 ceph data, active, cluster ceph, osd.22, journal /dev/sdl7
/dev/sdh1 ceph data, active, cluster ceph, osd.21, journal /dev/sdl6
/dev/sdi1 ceph data, active, cluster ceph, osd.20, journal /dev/sdl5
/dev/sdj other, unknown
/dev/sdl8 other
/dev/sdl1 ceph journal, for /dev/sda1
/dev/sdl2 ceph journal, for /dev/sdb1
/dev/sdl3 ceph journal, for /dev/sdc1
/dev/sdl4 ceph journal, for /dev/sdd1
/dev/sdl7 ceph journal, for /dev/sdg1
/dev/sdl6 ceph journal, for /dev/sdh1
/dev/sdl5 ceph journal, for /dev/sdi1
Disk /dev/sdl: 447.1 GiB, 480103981056 bytes, 937703088 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: D6AE7F88-3C21-4C98-BDAC-1140554D8BB2
Device Start End Sectors Size Type
/dev/sdl1 2048 10487807 10485760 5G unknown
/dev/sdl2 10487808 20973567 10485760 5G unknown
/dev/sdl3 20973568 31459327 10485760 5G unknown
/dev/sdl4 31459328 41945087 10485760 5G unknown
/dev/sdl5 41945088 52430847 10485760 5G unknown
/dev/sdl6 52430848 62916607 10485760 5G unknown
/dev/sdl7 62916608 73402367 10485760 5G unknown

Looking at the above, fdisk lists only seven partitions on the journal
device/dev/sdl, but ceph--disks lists an 8th "other" partition on /dev/sdl

Perhaps that is why ceph-disk complains "Error: journal specified but
not allowed by osd backend".

Ideas?

MJ

2018-09-04 18:45:47 UTC

Permalink

setuser match path = /var/lib/ceph/$type/$cluster-$id

which we forgot to remove from ceph.conf after the jewel upgrade and
completing the chown-ing of OSD data.

After we removed the line from ceph.conf, no restart or anything, we
could add the OSD/journal as expected.

Hope this helps someone else in the future :-)

MJ

p***@elchaka.de

2018-09-07 22:11:46 UTC

Permalink

Hi all,

Post by Gilberto Nunes
ceph osd crush reweight osd.X 0

That is the way you should Go. But do this in gentle way so there shouldnt much impact for your Clients - gentle drain the crush weicht for osd in question to the value "0"

This way you only have One rebalance instead of two!

Hth
- Mehmet

Post by Gilberto Nunes
in order to "mark" that osd to not use?
---
Gilberto Nunes Ferreira
(47) 3025-5907
(47) 99676-7530 - Whatsapp / Telegram
Skype: gilberto.nunes36

Post by Mark Schouten

I think you can't prevent it. Removing an OSD changes the CRUSH-map,
and thus rebalances. However, there is no nood to wait for it to

finish

Post by Mark Schouten
rebalancing again.
You can add the new OSD directly after removing the old one.
--
Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
Mark Schouten | Tuxis Internet Engineering
KvK: 61527076 | http://www.tuxis.nl/
_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

_______________________________________________
pve-user mailing list
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

2018-09-08 18:00:11 UTC

Permalink

Hi,

Post by p***@elchaka.de
That is the way you should Go. But do this in gentle way so there shouldnt much impact for your Clients - gentle drain the crush weicht for osd in question to the value "0"
This way you only have One rebalance instead of two!

Thanks, noted :-)