ZFS I got an NVME U.2 SSD (actually two) to use as ZIL - now what?

So, I got a pair of Solidigm SSDPF2SQ800GZ that I need to use as mirrored ZILs.
It reports its size as 800166076416 Bytes

I understand I should create a namespace (which I understand to be something like a partition) on it.

However, I don't seem to get the syntax and the man-page conveniently omits any examples.

Bash:
(server-prod </root>) 0 # nvmecontrol ns create -s 150031139328 -c 200041519104 -f 0 nvme0         
nvmecontrol: namespace creation failed: Invalid Field
(server-prod </root>) 74 # nvmecontrol ns create -s 150031139328 -c 200041519104 -f 2 nvme0
nvmecontrol: namespace creation failed: Invalid Field
(server-prod </root>) 74 #

taken from here:

What value should -f be?

What else should I provide?

I intend to use about a quarter of the SSD as ZIL and the rest for L2ARC.

This is 14.1-RELEASE-p6, for the record.
 
Indeed the use case seems to be to break up a drive into smaller units. How ZIL would benefit from that I don't know.


for logical isolation, multi-tenancy, security isolation (encryption per namespace), write protecting a namespace for recovery purposes, overprovisioning to improve write performance and endurance and so on.
 
How does that fit with a mirrored ZIL?

OK you are going to mirror drives for ZIL. That is recommended. But 800GB drives? That is so overkill it is silly (expensive too).
What is your anticipitated pool size? Spinners or SSD? Installed Memory?

I intend to use about a quarter of the SSD as ZIL and the rest for L2ARC.
I don't like the sound of that. Mirror two drives then divide? I want dedicated ZIL.
Mirror ZIL, seperate drive for L2ARC.
Those dinky Intel Optane drives would make nice ZIL.
 
No and think of it this way. The IO funnel is only so wide. Divvy it up all you want.

And on topic do you want L2ARC on same device as SLOG even with different namespaces.

I say no. It still sucks down overall IOPS. Max out motherboard memory instead of L2ARC disk?
 
I would guess No and probably each controller hardwired to a x4 PCIe lane set. Bifurcation required would be a good question.

These were mostly found in Oracle machines as seen on ebay used.. There was a similar prior generation too P3608.
 
Supermicro wouldn't sell us Optanes for these servers. And 800GB was apparently the smallest they would sell us.

OK, I think I've read that splitting the SSD that forms the ZIL isn't a good idea. I still want to underprovision them.

Any idea how the command is supposed to work?
 
Indeed the use case seems to be to break up a drive into smaller units. How ZIL would benefit from that I don't know.

As I understood, namespace is mandatory. Even on my laptop and with a nvme that does not support namespaces, I was forced to create a single namespace before I could use it.

OK you are going to mirror drives for ZIL. That is recommended. But 800GB drives? That is so overkill it is silly (expensive too).
It's a matter of sizing. These are 100'000 cycle devices. So the sizing is for a pool with sustained 500 MB/sec synchronous write load.
No comment on that.

I don't like the sound of that. Mirror two drives then divide? I want dedicated ZIL.
Mirror ZIL, seperate drive for L2ARC.
L2ARC cannot be mirrored. However L2ARC and ZIL can coexist on same device: mirror the two pieces of ZIL, just combine those for L2ARC.
Use gpart to partition this, and forget about the namespacing (create only one default full-size namespace if required).

Obviousely this is not optimal performace-wise, because one traffic interferes with the other, but if the over-all bandwidth, queue-depth and responsiveness of the device can cope with that, it should work.

So if you split by namespace, do you get separate PCIe devices? As opposed to just different disk devices?
Nope. You get /dev/nvmeXnsY. (And I didn't manage to get a nvme running without at least /dev/nvme0ns1)
 
"-f" lbaf (LBA format) or (frm) when used with format option should be selected depending what the disk support. You can check it with "identify".
LBA Format #00: Data Size: 512 Metadata Size: 0 Performance: Better
LBA Format #01: Data Size: 512 Metadata Size: 8 Performance: Degraded
LBA Format #02: Data Size: 4096 Metadata Size: 0 Performance: Best
LBA Format #03: Data Size: 4096 Metadata Size: 8 Performance: Good
LBA Format #04: Data Size: 4096 Metadata Size: 64 Performance: Degraded
 
How do I know which field it objects?

Code:
(server-prod </root>) 0 # nvmecontrol ns create -s 150031139328 -c 200041519104 -L 0 -d 0 nvme0   
nvmecontrol: namespace creation failed: Invalid Field
 
Do you have a need for synchronous writes? Did you know that the ZIL is only used for synchronous writes?

When you move the ZIL to a Separate Intent Log (SLOG), the SLOG does not need to be large. To quote Jude & Lucas:

"The sysctl vfs.zfs.dirty_data_max gives the maximum possible amount of in-flight data. FreeBSD 10’s ZFS defaults to using a ZIL with a size equal to one-tenth of the system RAM"​

I have also heard it repeated often that the ZIL can never benefit from being larger than main memory.

So 800GB would seem somewhat generous...

These days, people seem to ignore that advice that the ZIL needs to be protected from data loss. You should be aware that if your ZIL media do not have power loss protection, and you lose power, your ZFS pool may be irretrievably corrupted. That's not a risk I would have taken in any place where I worked.

If that were my system, and I needed synchronous writes, I would get some small media with power loss protection to mirror for the ZIL. To deploy the 800GB media, you could then figure out if you might benefit from an L2ARC (some do not), and potentially also consider a special VDEV for pool metadata.
 
Do you have a need for synchronous writes? Did you know that the ZIL is only used for synchronous writes?
exactly this.

In almost all real-world scenarios that still reside on spinning rust for some reason, there is absolutely no need for a ZIL/SLOG device. Same goes for L2ARC (which often is 'recommended' on systems with low amounts of RAM, but actually increases the memory pressure...).
What you *almost always* want for a spinning-rust-pool is a 'special' device that holds the metadata and maybe small files - because this generates a lot of random I/O which performs particularly bad on spinning drives - i.e. so bad that listing a few hundred snapshots can take a few dozen minutes on a busy pool...
 
I now managed to create a namespace with the "nvme" cli.
However, I am not sure if it actually worked.

Code:
(server-prod </root>) 0 # /usr/local/sbin/nvme create-ns /dev/nvme0 -s 150031139328 -c 200041519104 -d 0 -m 0 -f 0
0xc0484e41: opc: 0xd fuse: 0 cid 0 nsid:0 cmd2: 0 cmd3: 0
          : cdw10: 0 cdw11: 0 cdw12: 0 cdw13: 0
          : cdw14: 0 cdw15: 0 len: 0x1000 is_read: 1
<--- 0 cid: 0 status 0x8004
create-ns: Success, created nsid:-2147221504
(server-prod </root>) 0 # nvme attach-ns /dev/nvme0 --namespace-id=1 -controllers=0                               
0xc0484e41: opc: 0x15 fuse: 0 cid 0 nsid:0x1 cmd2: 0 cmd3: 0
          : cdw10: 0 cdw11: 0 cdw12: 0 cdw13: 0
          : cdw14: 0 cdw15: 0 len: 0x1000 is_read: 1
<--- 0 cid: 0 status 0x8004
attach-ns: Success, nsid:1
(server-prod </root>) 0 # nvme list /dev/nvme0                                              
(server-prod </root>) 0 #

(from here)

The SSDs are Enterprise SSDs, they are supposed to write down their cache on nvram in case of power-loss...

Output from dmesg after boot:
Code:
ixl0: fw 9.20.71847 api 1.15 nvm 9.00 etid 8000d299 oem 1.268.0
ixl1: fw 9.20.71847 api 1.15 nvm 9.00 etid 8000d299 oem 1.268.0
nvme0: <Generic NVMe Device> mem 0xbde10000-0xbde13fff irq 16 at device 0.0 numa-domain 0 on pci10
nvme1: <Generic NVMe Device> mem 0xbdd10000-0xbdd13fff irq 16 at device 0.0 numa-domain 0 on pci11
nvme0: SET_FEATURES (09) sqid:0 cid:15 nsid:0 cdw10:0000000b cdw11:0000031f
nvme0: INVALID_FIELD (00/02) crd:0 m:0 dnr:1 p:1 sqid:0 cid:15 cdw0:0
nvme1: SET_FEATURES (09) sqid:0 cid:15 nsid:0 cdw10:0000000b cdw11:0000031f
nvme1: INVALID_FIELD (00/02) crd:0 m:0 dnr:1 p:1 sqid:0 cid:15 cdw0:0
[167] nvme0: IDENTIFY (06) sqid:0 cid:10 nsid:0 cdw10:00000011 cdw11:00000000
[167] nvme0: INVALID NAMESPACE OR FORMAT (00/0b) crd:0 m:0 dnr:1 p:0 sqid:0 cid:10 cdw0:0
[208] nvme0: NAMESPACE_MANAGEMENT (0d) sqid:0 cid:10 nsid:0 cdw10:00000000 cdw11:00000000
[208] nvme0: INVALID_FIELD (00/02) crd:0 m:0 dnr:1 p:0 sqid:0 cid:10 cdw0:0
 
This is a server (well, two servers actually) that receives syslog data over the network and keeps it for 400 days.
Each server has about 3k clients it receives data from.
 
OK, I got this resolved (sort of) by booting grml and installing Solidigm's "sst" util and using that to create namespaces.
They look almost "right".

What I created:
Code:
Size:                        209715200 blocks
Capacity:                    209715200 blocks
Utilization:                 209715200 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #00
Metadata Capabilities
  Extended:                  Not Supported
  Separate:                  Not Supported
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read 00h, Write Zero, Guard CRC
Optimal I/O Boundary:        256 blocks
NVM Capacity:                111669149696 bytes
Globally Unique Identifier:  0100000000000000c8d6b7ac39250050
IEEE EUI64:                  c8d6b7ac39250000
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Good

What the default looks like:

Code:
Size:                        1562824368 blocks
Capacity:                    1562824368 blocks
Utilization:                 1562824368 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       2
Current LBA Format:          LBA Format #00
Metadata Capabilities
  Extended:                  Not Supported
  Separate:                  Not Supported
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   Not Supported
Deallocate Logical Block:    Read 00h, Write Zero, Guard CRC
Optimal I/O Boundary:        256 blocks
NVM Capacity:                800166076416 bytes
Globally Unique Identifier:  0100000000000000c8d6b7b2923b0050
IEEE EUI64:                  c8d6b7b2923b0000
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Good
LBA Format #01: Data Size:  4096  Metadata Size:     0  Performance: Good

Not sure if it makes a difference.
 
I have also heard it repeated often that the ZIL can never benefit from being larger than main memory.
ZIL cannot get larger than ARC, because it holds only the cached writes in the ARC.

So 800GB would seem somewhat generous...

These days, people seem to ignore that advice that the ZIL needs to be protected from data loss. You should be aware that if your ZIL media do not have power loss protection, and you lose power, your ZFS pool may be irretrievably corrupted. That's not a risk I would have taken in any place where I worked.
I don't think so. Power loss protection does not protect from data corruption, it only increases the speed of the device.

A device with power loss protection can ignore the CACHE FLUSH command, because the DRAM will be flushed after power loss anyway.
A device without power loss protection must on any CACHE FLUSH command save the DRAM to persistence, before returning. And that slows things down significantly.

And these beasts here, being flagship enterprise cache solutions, certainly have power loss protection.

In almost all real-world scenarios that still reside on spinning rust for some reason, there is absolutely no need for a ZIL/SLOG device.
NFS is completely synchronous.

Same goes for L2ARC (which often is 'recommended' on systems with low amounts of RAM, but actually increases the memory pressure...).
I don't know who brought that up, as it is bogus. L2ARC needs 0.5-1% of it's size as additional ram. So the math is simple. And when you don't get a database working set into ram, but get it into L2ARC, the effect is some x10 to x100 speedup.

What you *almost always* want for a spinning-rust-pool is a 'special' device that holds the metadata and maybe small files
These are designed for special usecases of dRAID, as dRAID is ineffective for small files. And indeed they have beneficial effects in many other cases also.
 
I don't know who brought that up, as it is bogus. L2ARC needs 0.5-1% of it's size as additional ram. So the math is simple. And when you don't get a database working set into ram, but get it into L2ARC, the effect is some x10 to x100 speedup.
L2ARC needs another mapping table in memory - so it reduces the memory available to conventional ARC. If you are under constant memory pressure, adding a L2ARC will worsen your problems - so unless you completely maxed out your system with RAM, you should never think about adding L2ARC devices.
In most cases L2ARC is deployed on systems with way too less memory for what the system should handle and people are complaining that performance got worse...


NFS is completely synchronous.
If you don't do async mounts, which is recommended in pretty much every 'best practices' guide (and IIRC even in the manpage). Also OP never said anything about NFS but a syslog server; so it's safe to assume that writes are mostly local and huge numbers of very small (async!) writes which are usually aggregated before being commited to disk. This pattern of many small changes will aggregate tons of metadata, so a special device would still be the first thing I'd go for.
I'd also argue that tuning the transaction sizes might also be an option on those hosts.


These are designed for special usecases of dRAID, as dRAID is ineffective for small files. And indeed they have beneficial effects in many other cases also.
the special device has nothing to do with dRAID

zpool(8)
Code:
Special Allocation Class
       The allocations in the special class are    dedicated  to  specific     block
       types.    By  default this includes all metadata,    the indirect blocks of
       user data, and any dedup    data.  The class can also  be  provisioned  to
       accept a    limited    percentage of small file data blocks.
(taken from the 12.4 manpage - no idea in which one of the umpteen dozen fragments of that manpage this section has vanished in the hopelessly scattered manpages from 13 onwards...)

I've used special devices several times back on spinning-disk pools; especially on very busy hosts. We had a backup NAS in a branch that would take >20 minutes to list all snapshots from the spinning disks; with a pair of special devices (plain 'slow' SATA-SSDs) this went down to ~30 seconds. So IMHO adding a special device is pretty much always the best first step to increase pool performance unless you *specifically* have a use-cases that produces a ton of sync writes which *really* should be handled synchronous (e.g. iSCSI targets for VMs).
 
the special device has nothing to do with dRAID

It's also worth noting that dRAID requires fixed stripe widths—not the dynamic widths supported by traditional RAIDz1 and RAIDz2 vdevs....This discrepancy gets worse the higher the values of d+p get—a draid2:8:1 would require a whopping 40KiB for the same metadata block! .... For this reason, the special allocation vdev is very useful in pools with dRAID vdevs—when a pool with draid2:8:1 and a 3-wide special needs to store a 4KiB metadata block, it does so in only 12KiB on the special, instead of 40KiB on the draid2:8:1.
 
L2ARC needs another mapping table in memory - so it reduces the memory available to conventional ARC. If you are under constant memory pressure, adding a L2ARC will worsen your problems - so unless you completely maxed out your system with RAM, you should never think about adding L2ARC devices.
In most cases L2ARC is deployed on systems with way too less memory for what the system should handle and people are complaining that performance got worse...
Practical usecase: 64 GB installed memory, 1 TB database with some 250 GB regular working set.

Now you cannot just increase the installed memory to 256 GB - and anything below will not have the benefit..

You can however add 256 GB L2ARC, and thus get the entire working set cached. This will use 2-3 GB ram for the headers - but that doesn't hurt at all, because the ram doesn't do much good at that stage.
 
I don't think so. Power loss protection does not protect from data corruption, it only increases the speed of the device.

A device with power loss protection can ignore the CACHE FLUSH command, because the DRAM will be flushed after power loss anyway.
A device without power loss protection must on any CACHE FLUSH command save the DRAM to persistence, before returning. And that slows things down significantly.
You are correct, and I retract the assertion that you must have PLP for reliable function.

My view that Power Loss Protection (PLP) was mandatory for reliable function came from a time when I was building my ZFS server, a decade ago, and some consumer grade SSDs had buggy firmware.

However it's worth affirming that a SLOG without PLP will be very slow.
 
As I understood, namespace is mandatory. Even on my laptop and with a nvme that does not support namespaces, I was forced to create a single namespace before I could use it.
I have never had to set a namespace on any of my NVMe. I am not sure why the descrepancy here.
I have work with probably 30 models of NVMe in the last 8 years. New and Used.

The OP here has set a second namespace when I don't see why. Please someone explain what I missed?

Is this something new? Eight years at this and I never heard of an NVMe shipping without a namespace.
 
Back
Top