ZFS I got an NVME U.2 SSD (actually two) to use as ZIL - now what?

rainer_d · Tuesday at 5:06 PM

So, I got a pair of Solidigm SSDPF2SQ800GZ that I need to use as mirrored ZILs.
It reports its size as 800166076416 Bytes

I understand I should create a namespace (which I understand to be something like a partition) on it.

However, I don't seem to get the syntax and the man-page conveniently omits any examples.

Bash:

(server-prod </root>) 0 # nvmecontrol ns create -s 150031139328 -c 200041519104 -f 0 nvme0         
nvmecontrol: namespace creation failed: Invalid Field
(server-prod </root>) 74 # nvmecontrol ns create -s 150031139328 -c 200041519104 -f 2 nvme0
nvmecontrol: namespace creation failed: Invalid Field
(server-prod </root>) 74 #

taken from here:

269912 – nvmecontrol ns create command fails on FreeBSD 13.1

bugs.freebsd.org

What value should -f be?

What else should I provide?

I intend to use about a quarter of the SSD as ZIL and the rest for L2ARC.

This is 14.1-RELEASE-p6, for the record.

cracauer@ · Tuesday at 5:09 PM

rainer_d said:
I understand I should create a namespace (which I understand to be something like a partition) on it.

Why? Where did you find this info that simple partitioning is not sufficient?

Phishfry · Tuesday at 5:27 PM

cracauer@ said:
Why?

Indeed the use case seems to be to break up a drive into smaller units. How ZIL would benefit from that I don't know.

NVMe Namespaces - NVM Express

What is a Namespace? In NVMe® technology, a namespace is a collection of logical block addresses (LBA) accessible to host software. A namespace ID (NSID) is an identifier used by a controller to provide access to a namespace. A namespace is not the physical isolation of blocks, rather the...

nvmexpress.org

for logical isolation, multi-tenancy, security isolation (encryption per namespace), write protecting a namespace for recovery purposes, overprovisioning to improve write performance and endurance and so on.

cracauer@ · Tuesday at 5:36 PM

So it's for passing separate /dev/ entries into different security/tenant level machines. I can see that.

Phishfry · Tuesday at 5:44 PM

How does that fit with a mirrored ZIL?

OK you are going to mirror drives for ZIL. That is recommended. But 800GB drives? That is so overkill it is silly (expensive too).
What is your anticipitated pool size? Spinners or SSD? Installed Memory?

rainer_d said:
I intend to use about a quarter of the SSD as ZIL and the rest for L2ARC.

I don't like the sound of that. Mirror two drives then divide? I want dedicated ZIL.
Mirror ZIL, seperate drive for L2ARC.
Those dinky Intel Optane drives would make nice ZIL.

Intel Optane P1600X & P4800X as ZFS SLOG/ZIL - Austin's Nerdy Things

As a follow-up to my last post (ZFS SLOG Performance Testing of SSDs including Intel P4800X to Samsung 850 Evo), I wanted to focus specifically on the Intel Optane devices in a slightly faster test machine. These are incredible devices, especially for $59 as of latest check – for the P1600X...

austinsnerdythings.com

cracauer@ · Tuesday at 8:11 PM

So if you split by namespace, do you get separate PCIe devices? As opposed to just different disk devices?

Phishfry · Tuesday at 8:28 PM

No and think of it this way. The IO funnel is only so wide. Divvy it up all you want.

And on topic do you want L2ARC on same device as SLOG even with different namespaces.

I say no. It still sucks down overall IOPS. Max out motherboard memory instead of L2ARC disk?

Phishfry · Tuesday at 8:38 PM

There is a line that has 2 distinct NVMe devices on one card.

Intel AIC P4608/P4618

Why Does The Solidigm™ (Formerly Intel®) Data Center Drive Show Two Partitions Of 3.2TB?

Information About Solidigm SSD DC P4618 Series 6.4TB And Solidigm SSD DC P4608 Series 6.4TB That Show Two Hardware Partitions Of 3.2TB

www.solidigm.com

It is 8x device and shows up as two volumes.

cracauer@ · Tuesday at 8:47 PM

Phishfry said:
There is a line that has 2 distinct NVMe devices on one card.

Intel AIC P4608/P4618

Why Does The Solidigm™ (Formerly Intel®) Data Center Drive Show Two Partitions Of 3.2TB?

Information About Solidigm SSD DC P4618 Series 6.4TB And Solidigm SSD DC P4608 Series 6.4TB That Show Two Hardware Partitions Of 3.2TB

www.solidigm.com

It is 8x device and shows up as two volumes.

Are the volumes variable sized?

Phishfry · Tuesday at 8:53 PM

I would guess No and probably each controller hardwired to a x4 PCIe lane set. Bifurcation required would be a good question.

These were mostly found in Oracle machines as seen on ebay used.. There was a similar prior generation too P3608.

Intel SSD DC P3608 AIC NVMe SSD Review

The Intel SSD DC P3608 is an AIC SSD that is aimed at database, HPC, and real-time analytics and brings a strong performance punch with it.

www.storagereview.com

rainer_d · Tuesday at 9:26 PM

Supermicro wouldn't sell us Optanes for these servers. And 800GB was apparently the smallest they would sell us.

OK, I think I've read that splitting the SSD that forms the ZIL isn't a good idea. I still want to underprovision them.

Any idea how the command is supposed to work?

PMc · Tuesday at 9:39 PM

Phishfry said:
Indeed the use case seems to be to break up a drive into smaller units. How ZIL would benefit from that I don't know.

NVMe Namespaces - NVM Express

What is a Namespace? In NVMe® technology, a namespace is a collection of logical block addresses (LBA) accessible to host software. A namespace ID (NSID) is an identifier used by a controller to provide access to a namespace. A namespace is not the physical isolation of blocks, rather the...

nvmexpress.org

As I understood, namespace is mandatory. Even on my laptop and with a nvme that does not support namespaces, I was forced to create a single namespace before I could use it.

Phishfry said:
OK you are going to mirror drives for ZIL. That is recommended. But 800GB drives? That is so overkill it is silly (expensive too).

It's a matter of sizing. These are 100'000 cycle devices. So the sizing is for a pool with sustained 500 MB/sec synchronous write load.
No comment on that.

Phishfry said:
I don't like the sound of that. Mirror two drives then divide? I want dedicated ZIL.
Mirror ZIL, seperate drive for L2ARC.

L2ARC cannot be mirrored. However L2ARC and ZIL can coexist on same device: mirror the two pieces of ZIL, just combine those for L2ARC.
Use gpart to partition this, and forget about the namespacing (create only one default full-size namespace if required).

Obviousely this is not optimal performace-wise, because one traffic interferes with the other, but if the over-all bandwidth, queue-depth and responsiveness of the device can cope with that, it should work.

cracauer@ said:
So if you split by namespace, do you get separate PCIe devices? As opposed to just different disk devices?

Nope. You get /dev/nvmeXnsY. (And I didn't manage to get a nvme running without at least /dev/nvme0ns1)

VladiBG · Tuesday at 10:03 PM

"-f" lbaf (LBA format) or (frm) when used with format option should be selected depending what the disk support. You can check it with "identify".

LBA Format #00: Data Size: 512 Metadata Size: 0 Performance: Better
LBA Format #01: Data Size: 512 Metadata Size: 8 Performance: Degraded
LBA Format #02: Data Size: 4096 Metadata Size: 0 Performance: Best
LBA Format #03: Data Size: 4096 Metadata Size: 8 Performance: Good
LBA Format #04: Data Size: 4096 Metadata Size: 64 Performance: Degraded

rainer_d · 2024-11-06T09:44:21+0000

How do I know which field it objects?

Code:

(server-prod </root>) 0 # nvmecontrol ns create -s 150031139328 -c 200041519104 -L 0 -d 0 nvme0   
nvmecontrol: namespace creation failed: Invalid Field

gpw928 · 2024-11-06T10:19:31+0000

Do you have a need for synchronous writes? Did you know that the ZIL is only used for synchronous writes?

When you move the ZIL to a Separate Intent Log (SLOG), the SLOG does not need to be large. To quote Jude & Lucas:

"The sysctl vfs.zfs.dirty_data_max gives the maximum possible amount of in-flight data. FreeBSD 10’s ZFS defaults to using a ZIL with a size equal to one-tenth of the system RAM"

I have also heard it repeated often that the ZIL can never benefit from being larger than main memory.

So 800GB would seem somewhat generous...

These days, people seem to ignore that advice that the ZIL needs to be protected from data loss. You should be aware that if your ZIL media do not have power loss protection, and you lose power, your ZFS pool may be irretrievably corrupted. That's not a risk I would have taken in any place where I worked.

If that were my system, and I needed synchronous writes, I would get some small media with power loss protection to mirror for the ZIL. To deploy the 800GB media, you could then figure out if you might benefit from an L2ARC (some do not), and potentially also consider a special VDEV for pool metadata.

sko · 2024-11-06T10:43:09+0000

gpw928 said:
Do you have a need for synchronous writes? Did you know that the ZIL is only used for synchronous writes?

exactly this.

In almost all real-world scenarios that still reside on spinning rust for some reason, there is absolutely no need for a ZIL/SLOG device. Same goes for L2ARC (which often is 'recommended' on systems with low amounts of RAM, but actually increases the memory pressure...).
What you *almost always* want for a spinning-rust-pool is a 'special' device that holds the metadata and maybe small files - because this generates a lot of random I/O which performs particularly bad on spinning drives - i.e. so bad that listing a few hundred snapshots can take a few dozen minutes on a busy pool...

rainer_d · 2024-11-06T10:57:43+0000

I now managed to create a namespace with the "nvme" cli.
However, I am not sure if it actually worked.

Code:

(server-prod </root>) 0 # /usr/local/sbin/nvme create-ns /dev/nvme0 -s 150031139328 -c 200041519104 -d 0 -m 0 -f 0
0xc0484e41: opc: 0xd fuse: 0 cid 0 nsid:0 cmd2: 0 cmd3: 0
          : cdw10: 0 cdw11: 0 cdw12: 0 cdw13: 0
          : cdw14: 0 cdw15: 0 len: 0x1000 is_read: 1
<--- 0 cid: 0 status 0x8004
create-ns: Success, created nsid:-2147221504
(server-prod </root>) 0 # nvme attach-ns /dev/nvme0 --namespace-id=1 -controllers=0                               
0xc0484e41: opc: 0x15 fuse: 0 cid 0 nsid:0x1 cmd2: 0 cmd3: 0
          : cdw10: 0 cdw11: 0 cdw12: 0 cdw13: 0
          : cdw14: 0 cdw15: 0 len: 0x1000 is_read: 1
<--- 0 cid: 0 status 0x8004
attach-ns: Success, nsid:1
(server-prod </root>) 0 # nvme list /dev/nvme0                                              
(server-prod </root>) 0 #

(from here)

Managing Nvme Namespaces

Manage your nvme device

narasimhan-v.github.io

The SSDs are Enterprise SSDs, they are supposed to write down their cache on nvram in case of power-loss...

Output from dmesg after boot:

Code:

ixl0: fw 9.20.71847 api 1.15 nvm 9.00 etid 8000d299 oem 1.268.0
ixl1: fw 9.20.71847 api 1.15 nvm 9.00 etid 8000d299 oem 1.268.0
nvme0: <Generic NVMe Device> mem 0xbde10000-0xbde13fff irq 16 at device 0.0 numa-domain 0 on pci10
nvme1: <Generic NVMe Device> mem 0xbdd10000-0xbdd13fff irq 16 at device 0.0 numa-domain 0 on pci11
nvme0: SET_FEATURES (09) sqid:0 cid:15 nsid:0 cdw10:0000000b cdw11:0000031f
nvme0: INVALID_FIELD (00/02) crd:0 m:0 dnr:1 p:1 sqid:0 cid:15 cdw0:0
nvme1: SET_FEATURES (09) sqid:0 cid:15 nsid:0 cdw10:0000000b cdw11:0000031f
nvme1: INVALID_FIELD (00/02) crd:0 m:0 dnr:1 p:1 sqid:0 cid:15 cdw0:0
[167] nvme0: IDENTIFY (06) sqid:0 cid:10 nsid:0 cdw10:00000011 cdw11:00000000
[167] nvme0: INVALID NAMESPACE OR FORMAT (00/0b) crd:0 m:0 dnr:1 p:0 sqid:0 cid:10 cdw0:0
[208] nvme0: NAMESPACE_MANAGEMENT (0d) sqid:0 cid:10 nsid:0 cdw10:00000000 cdw11:00000000
[208] nvme0: INVALID_FIELD (00/02) crd:0 m:0 dnr:1 p:0 sqid:0 cid:10 cdw0:0

rainer_d · 2024-11-06T11:00:28+0000

This is a server (well, two servers actually) that receives syslog data over the network and keeps it for 400 days.
Each server has about 3k clients it receives data from.

rainer_d · 2024-11-06T16:22:53+0000

OK, I got this resolved (sort of) by booting grml and installing Solidigm's "sst" util and using that to create namespaces.
They look almost "right".

What I created:

Code:

Size:                        209715200 blocks
Capacity:                    209715200 blocks
Utilization:                 209715200 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #00
Metadata Capabilities
  Extended:                  Not Supported
  Separate:                  Not Supported
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read 00h, Write Zero, Guard CRC
Optimal I/O Boundary:        256 blocks
NVM Capacity:                111669149696 bytes
Globally Unique Identifier:  0100000000000000c8d6b7ac39250050
IEEE EUI64:                  c8d6b7ac39250000
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Good

What the default looks like:

Code:

Size:                        1562824368 blocks
Capacity:                    1562824368 blocks
Utilization:                 1562824368 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #00
Metadata Capabilities
  Extended:                  Not Supported
  Separate:                  Not Supported
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read 00h, Write Zero, Guard CRC
Optimal I/O Boundary:        256 blocks
NVM Capacity:                800166076416 bytes
Globally Unique Identifier:  0100000000000000c8d6b7b2923b0050
IEEE EUI64:                  c8d6b7b2923b0000
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Good
LBA Format #01: Data Size:  4096  Metadata Size:     0  Performance: Good

Not sure if it makes a difference.

PMc · 2024-11-06T16:33:04+0000

gpw928 said:
I have also heard it repeated often that the ZIL can never benefit from being larger than main memory.

ZIL cannot get larger than ARC, because it holds only the cached writes in the ARC.

gpw928 said:
So 800GB would seem somewhat generous...

These days, people seem to ignore that advice that the ZIL needs to be protected from data loss. You should be aware that if your ZIL media do not have power loss protection, and you lose power, your ZFS pool may be irretrievably corrupted. That's not a risk I would have taken in any place where I worked.

I don't think so. Power loss protection does not protect from data corruption, it only increases the speed of the device.

A device with power loss protection can ignore the CACHE FLUSH command, because the DRAM will be flushed after power loss anyway.
A device without power loss protection must on any CACHE FLUSH command save the DRAM to persistence, before returning. And that slows things down significantly.

And these beasts here, being flagship enterprise cache solutions, certainly have power loss protection.

sko said:
In almost all real-world scenarios that still reside on spinning rust for some reason, there is absolutely no need for a ZIL/SLOG device.

NFS is completely synchronous.

sko said:
Same goes for L2ARC (which often is 'recommended' on systems with low amounts of RAM, but actually increases the memory pressure...).

I don't know who brought that up, as it is bogus. L2ARC needs 0.5-1% of it's size as additional ram. So the math is simple. And when you don't get a database working set into ram, but get it into L2ARC, the effect is some x10 to x100 speedup.

sko said:
What you *almost always* want for a spinning-rust-pool is a 'special' device that holds the metadata and maybe small files

These are designed for special usecases of dRAID, as dRAID is ineffective for small files. And indeed they have beneficial effects in many other cases also.

sko · 2024-11-06T17:06:59+0000

PMc said:
I don't know who brought that up, as it is bogus. L2ARC needs 0.5-1% of it's size as additional ram. So the math is simple. And when you don't get a database working set into ram, but get it into L2ARC, the effect is some x10 to x100 speedup.

L2ARC needs another mapping table in memory - so it reduces the memory available to conventional ARC. If you are under constant memory pressure, adding a L2ARC will worsen your problems - so unless you completely maxed out your system with RAM, you should never think about adding L2ARC devices.
In most cases L2ARC is deployed on systems with way too less memory for what the system should handle and people are complaining that performance got worse...

PMc said:
NFS is completely synchronous.

If you don't do async mounts, which is recommended in pretty much every 'best practices' guide (and IIRC even in the manpage). Also OP never said anything about NFS but a syslog server; so it's safe to assume that writes are mostly local and huge numbers of very small (async!) writes which are usually aggregated before being commited to disk. This pattern of many small changes will aggregate tons of metadata, so a special device would still be the first thing I'd go for.
I'd also argue that tuning the transaction sizes might also be an option on those hosts.

PMc said:
These are designed for special usecases of dRAID, as dRAID is ineffective for small files. And indeed they have beneficial effects in many other cases also.

the special device has nothing to do with dRAID

zpool(8)

Code:

Special Allocation Class
       The allocations in the special class are    dedicated  to  specific     block
       types.    By  default this includes all metadata,    the indirect blocks of
       user data, and any dedup    data.  The class can also  be  provisioned  to
       accept a    limited    percentage of small file data blocks.

(taken from the 12.4 manpage - no idea in which one of the umpteen dozen fragments of that manpage this section has vanished in the hopelessly scattered manpages from 13 onwards...)

I've used special devices several times back on spinning-disk pools; especially on very busy hosts. We had a backup NAS in a branch that would take >20 minutes to list all snapshots from the spinning disks; with a pair of special devices (plain 'slow' SATA-SSDs) this went down to ~30 seconds. So IMHO adding a special device is pretty much always the best first step to increase pool performance unless you *specifically* have a use-cases that produces a ton of sync writes which *really* should be handled synchronous (e.g. iSCSI targets for VMs).

PMc · 2024-11-06T17:49:09+0000

sko said:
the special device has nothing to do with dRAID

It's also worth noting that dRAID requires fixed stripe widths—not the dynamic widths supported by traditional RAIDz1 and RAIDz2 vdevs....This discrepancy gets worse the higher the values of d+p get—a draid2:8:1 would require a whopping 40KiB for the same metadata block! .... For this reason, the special allocation vdev is very useful in pools with dRAID vdevs—when a pool with draid2:8:1 and a 3-wide special needs to store a 4KiB metadata block, it does so in only 12KiB on the special, instead of 40KiB on the draid2:8:1.

OpenZFS 2.1 is out—let’s talk about its brand-new dRAID vdevs

dRAID vdevs resilver very quickly, using spare capacity rather than spare disks.

arstechnica.com

PMc · 2024-11-06T18:02:11+0000

sko said:
L2ARC needs another mapping table in memory - so it reduces the memory available to conventional ARC. If you are under constant memory pressure, adding a L2ARC will worsen your problems - so unless you completely maxed out your system with RAM, you should never think about adding L2ARC devices.
In most cases L2ARC is deployed on systems with way too less memory for what the system should handle and people are complaining that performance got worse...

Practical usecase: 64 GB installed memory, 1 TB database with some 250 GB regular working set.

Now you cannot just increase the installed memory to 256 GB - and anything below will not have the benefit..

You can however add 256 GB L2ARC, and thus get the entire working set cached. This will use 2-3 GB ram for the headers - but that doesn't hurt at all, because the ram doesn't do much good at that stage.

gpw928 · 2024-11-06T20:07:51+0000

PMc said:
I don't think so. Power loss protection does not protect from data corruption, it only increases the speed of the device.

A device with power loss protection can ignore the CACHE FLUSH command, because the DRAM will be flushed after power loss anyway.
A device without power loss protection must on any CACHE FLUSH command save the DRAM to persistence, before returning. And that slows things down significantly.

You are correct, and I retract the assertion that you must have PLP for reliable function.

My view that Power Loss Protection (PLP) was mandatory for reliable function came from a time when I was building my ZFS server, a decade ago, and some consumer grade SSDs had buggy firmware.

However it's worth affirming that a SLOG without PLP will be very slow.

Phishfry · 2024-11-06T20:18:46+0000

PMc said:
As I understood, namespace is mandatory. Even on my laptop and with a nvme that does not support namespaces, I was forced to create a single namespace before I could use it.

I have never had to set a namespace on any of my NVMe. I am not sure why the descrepancy here.
I have work with probably 30 models of NVMe in the last 8 years. New and Used.

The OP here has set a second namespace when I don't see why. Please someone explain what I missed?

Is this something new? Eight years at this and I never heard of an NVMe shipping without a namespace.