ZFS Why does periodic(8) scrub ZFS every 35 days by default?

Deleted member 76849 · Nov 14, 2023

In /etc/defaults/periodic.conf, there's this part about periodic ZFS scrubs:

Code:

# 800.scrub-zfs
daily_scrub_zfs_enable="NO"
daily_scrub_zfs_pools=""                        # empty string selects all pools
daily_scrub_zfs_default_threshold="35"          # days between scrubs
#daily_scrub_zfs_${poolname}_threshold="35"     # pool specific threshold

Why is daily_scrub_zfs_default_threshold set to 35 days (5 weeks) by default? I thought that the recommendation is to scrub every 7 days (1 week)?

cracauer@ · Nov 14, 2023

Scrubbing is for wimps anyway...

mer · Nov 14, 2023

cracauer@ said:
Scrubbing is for wimps anyway...

Ha!

My opinion on scrubbing:
The daily scrubbing is disabled by default.
The intent is to make sure all the internal data structures are consistent with the data on the vdevs.
Scrubbing takes resources (time) and can degrade performance while the scrub is running.

How often can be roughly related to the quality of the devices in the vdevs.
An old "rule of thumb" was "higher quality (enterprise, datacenter) devices scrub less often, consumer scrub more often".
Not sure if it really applies anymore, maybe SSD/NVMe vs spinning devices makes a difference.

Me, since I'm a little paranoid, I'll manually run a scrub maybe 2 or 3 times a year. I've never seen it reporting "fixing any errors" so I may be prematurely wearing things out.
I have seen running scrub triggering timeouts on devices, leading me to keep an eye on a device and maybe replacing it sooner.

getopt · Nov 14, 2023

mer said:
maybe SSD/NVMe vs spinning devices makes a difference.
... to keep an eye on a device and maybe replacing it sooner.

An expensive way for keeping an eye on memory based storage. Don't scrub them.

cracauer@ · Nov 14, 2023

Once a day would be quite insane for pools with large mechanical hard drives. I think the resilver can reach a duration of a day.

mer · Nov 14, 2023

getopt said:
An expensive way for keeping an eye on memory based storage. Don't scrub them.

Scrubbing: isn't it mostly a read operation? Read, verify checksums are correct, fix only if wrong? Erase cycles are typically the killer of memory based devices, reads shouldn't affect longevity, writes depend on erases.

That's all why I said "maybe... makes a difference".
And the occasional timeouts I've seen triggered have been on spinning devices.

Physical problems are not likely fixed by scrubbing, scrubbing may trigger potential physical problems (timeouts), it boils down to it's your system, do what makes you comfortable.

cracauer@ · Nov 14, 2023

Scrubbing is almost all reads.

The main purpose is to find bad drives earlier, so by the time you find them you don't have multiple bad ones.

ralphbsz · Nov 14, 2023

Scrubbing may be for wimps, but wimps are the people who get to keep their data. Scrubbing is actually really important for that. Scrubbing is, as cracauer already said, to find problems earlier. Before more and bigger problems can accumulate, and overwhelm whatever redundancy mechanism is there to deal with problems (redundancy can be RAID, backups, or the redundancy inherent in file system metadata). This applies to detecting disk drive hardware problems (on both spinning rust and SSDs), silent data (or metadata) corruptions (which exist, and in a large enough system are measurably common), and software bugs and user errors. In a nutshell, the reliability of a redundant storage system depends critically on the time required to repair damage (which is why the MTTR is discussed in the appendix of the original RAID paper), and scrubbing allows one to begin repairs sooner, effectively reducing the MTTR. There is a really nice paper published by some people at NetApp about 15 years ago demonstrating how scrubbing increases data durability.

Now, how often or quickly should one scrub? That's a tradeoff between three things. The more often one scrubs, the sooner one detects problems, so from that viewpoint one should scrub as fast as possible. On the other hand, scrubbing can interfere with the performance of the real user workload (and use more energy = carbon footprint, if one worries about such things). And on certain hardware types, the reads that scrubbing does can cause wear on the drive itself, so scrubbing overly quickly may actually be counter-productive. Whether reads are bad for the devices depends on fine details. On SSDs, at the hardware (flash chip) level, reads do not contribute to wearout. But a continuous stream of reads may keep the controller overly busy, preventing it from doing garbage collection (internal compaction) efficiently, and inefficient compaction at the last minute may require more block erases (depends on the fine details of how the FTL is implemented). On spinning disks, it was long thought that reads do not create wear of heads and platters. We now know this to be false, even reading a disk will cause transfer of lubricant between heads and platters and making the surface less flat (although the effect might be worse for writes, due to the reduced fly height when writing), which is why current disks have limits on the total IO throughput to maintain warranty coverage (commonly 550 TB/year). This seems to hold both for conventional disks as well as shingled (SMR) disks, but there is anecdotal evidence that SMR-enabled hardware may be more reliable when used correctly (full zone writes). In the near future, when HAMR/MAMR disks come into production, it is likely the balance will flip again, with writes having a much bigger effect than reads again. To my knowledge, there is no published study on the optimal tradeoff between scrub frequency and durability. Even if it were published, the result depends on a precarious balance between the mechanisms for data destruction (vibration, temperature cycles causing hardware problems, the quality of the storage software stack) versus the thoroughness of scrub (for example, does it validate checksums and the whole metadata graph, like fsck), to the results could not be generalized.

For amateur and small-scale users, the much bigger question is: Does scrub interfere with file system performance to an unacceptable level? The answer is clearly: it depends on hardware (how many disks, how fast, what disk interface, CPU performance), and performance expectations (do I need near-perfect speed, or is getting 10 MByte/s reliably good enough). Just as one example, on my home server (very small, 2 enterprise near-line disks in a mirror, 1.8 GHz Atom CPU and 4 gig of RAM), the system became unpleasantly slow when scrubbing. I had to adjust ZFS scrub as follows to get the system to feel responsive:

Code:

vfs.zfs.no_scrub_prefetch: leave at default 0                                                   
vfs.zfs.scrub_delay: 20 (default 4)                                                             
vfs.zfs.scan_idle: 1000 (default 50)

YMMV, so don't follow those values blindly. With these settings, I scrub the SSD-based file systems every 3 days, and the disk-based 3TB mirror file system once a week, and all scrubs finish in less than 24 hours (typically much less, I start them after 1am, and they are done by morning). I know I should probably change the schedule to do the disk every 2-3 weeks, but my silly little scrubbing script only looks at the weekday, and I haven't had the 10 minutes needed to teach it to count to 3.

And: I have one "remote" backup disk attached to my server (the disk is physically 2m away from the server, inside a thick-walled safe). I know that scrubbing that disk for an hour increases its temperature, from typically 32 to over 40 degrees. Is that a bad thing? I don't know; the effect of temperature on disks is complex.

tingo · Nov 14, 2023

cracauer@ said:
I think the resilver can reach a duration of a day.

Nope - longer than that, based on my experience. Here is from when I replaced a bad drive on one of my fileservers last year.

Code:

tingo@kg-f6$ zpool status z6
  pool: z6
 state: ONLINE
  scan: resilvered 3.51T in 2 days 16:33:57 with 0 errors on Tue Sep 20 08:39:09 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    z6                                              ONLINE       0     0     0
      raidz3-0                                      ONLINE       0     0     0
        gptid/ab074aeb-3691-11ed-aacb-7085c239f419  ONLINE       0     0     0
        gptid/2226f441-9579-11e7-9009-7085c239f419  ONLINE       0     0     0
        gptid/231416ea-9579-11e7-9009-7085c239f419  ONLINE       0     0     0
        gptid/23fdb526-9579-11e7-9009-7085c239f419  ONLINE       0     0     0
        gptid/24edb679-9579-11e7-9009-7085c239f419  ONLINE       0     0     0
        gptid/25d23441-9579-11e7-9009-7085c239f419  ONLINE       0     0     0
        gptid/26bf7deb-9579-11e7-9009-7085c239f419  ONLINE       0     0     0
        gptid/27aac2e7-9579-11e7-9009-7085c239f419  ONLINE       0     0     0

errors: No known data errors

mer · Nov 14, 2023

tingo said:
resilvered 3.51T in 2 days

I think that is the beauty of ZFS; it only looked at 3.51T. But good data on "time to replace a failed device is directly related to how big the dataset actually is"

Kaleron · Nov 15, 2023

ralphbsz said:
We now know this to be false, even reading a disk will cause transfer of lubricant between heads and platters and making the surface less flat (although the effect might be worse for writes, due to the reduced fly height when writing), which is why current disks have limits on the total IO throughput to maintain warranty coverage (commonly 550 TB/year).

Can you explain this phenomenon further? Or point to publications documenting how wear and tear occurs? In a working disk, the platter rotates all the time and the reading head also reads all the time, if only to keep track of the track it is on. How would an intentional reading differ from an idle reading?

sko · Nov 15, 2023

tingo said:
Nope - longer than that, based on my experience. Here is from when I replaced a bad drive on one of my fileservers last year.

raidz is horribly slow when it comes to scrubs and resilvers (and also slow and inefficient otherwise...)
That's why you should always use mirrors for small pools, also because using a *single* raidz vdev is usually a bad idea if you want to get at least some performance from your pool...

regarding scrubs:
I'm running them monthly on all pools, regardless if its SATA, SAS, NVMe or flash / spinning disks (actually I have only one NAS left with spinning rust...)
Especially for spinning rust scrubs usually report errors long before the drive firmware admits there's something wrong. So yes - I'd consider them useful, but I wouldn't agonize about what exact interval is best for a given type of storage - the arbitrary definition of 'a month' is as good as the similarly arbitrary definition of 'N weeks'...

More important than the type of storage is how 'warm' the data on the pool is - if all of the data in a pool is read (or modified) in relatively short time, scrubs are almost unnecessary and can be timed very seldomly. A pool with mostly stale data (e.g. long-term backups) should be scrubbed more frequently to detect and repair bitrot.

astyle · Nov 15, 2023

Y'know,

pandan said:
daily_scrub_zfs_default_threshold="35" # days between scrubs

defining a threshold is NOT the same as telling a daemon to scrub a pool on that schedule...

ralphbsz · Nov 16, 2023

Kaleron said:
Can you explain this phenomenon further? Or point to publications documenting how wear and tear occurs? In a working disk, the platter rotates all the time and the reading head also reads all the time, if only to keep track of the track it is on. How would an intentional reading differ from an idle reading?

Sadly, I don't know any publication that explains this (in the open literature, if you have an NDA with the disk manufacturers, you have access to more information).

In the old days, the head was relatively far above the head, and its height was not actively controlled. In those days, a head crash meant that the head goes through the iron oxide layer (which used to be brown), leaving a silver-colored streak, exposing the aluminum material of the platter itself. Today, platters are made of glass, the magnetic layer is silvery (not brown, being cobalt based). The fly height of the head is adjusted, and it is different for idle (when the read amplifiers and decoders are actually turned off), read, and write. In particular for writing, the fly height is reduced even further. Now couple that with the fact that today's platters have a layer of lubricant on them, and that the heads fly exceedingly low. What really happens is that there is a non-zero chance in normal operation of the head picking up a little bit of lubricant when operating. Usually, it drops the lubricant off on the platter again, but if you do this too often, it leaves hills and valleys. Supposedly, this effect is similar to what happens when roads get "washboards", meaning waves form. I've also seen micrographs of disk surfaces where engineering uses skiing terms like "moguls" to identify hills in the lubricant.

sko said:
raidz is horribly slow when it comes to scrubs and resilvers (and also slow and inefficient otherwise...)
That's why you should always use mirrors for small pools, also because using a *single* raidz vdev is usually a bad idea if you want to get at least some performance from your pool...

I actually disagree with your wording. Clearly, parity-based RAID (whether ZFS's RAID-Z or the traditional RAID-5, 6 ...) can potential to be slower, in particular in the presence of small updates. But there are several counter arguments. First, it is much more space-efficient. For example, if you have 10 disks and use mirroring only, you will get 5 disk's worth of capacity and can tolerate a single disk fault. If you use RAID-5 or RAID-Z, you get 9 disk's worth of capacity, and can still tolerate a single fault. With a larger set of disks, that argument gets even stronger: mirroring is a factor of 2 overhead. The argument gets even stronger when you build systems that tolerate multiple faults (which for good data durability is a de-facto requirement when storing large amounts of data, as disk reliability hasn't kept up with disk size): 3-way mirrors can tolerate 2 faults, but the overhead increases to 3x; with a parity-based code on a 10-disk wide array, the overhead is only 20%. For systems larger than small amateur storage (with 2-3 disks), the cost of mirroring is high. In many cases, that cost is not justified by the performance increase.

Kaleron · Nov 30, 2023

ralphbsz
Iron oxides have long been discontinued as a magnetic layer. They were replaced with Co-alloys long before head height control began. And the magnetic layer then became thinner, making it at least as susceptible to mechanical damage. And the same will apply to FePt layers, which are just coming into use.

Do you think that whether you have a glass or aluminum substrate has anything to do with the durability of the disk and the read and write operations performed? Could you describe this further? The advantage of glass is low thermal expansion coefficients, which makes it easier to increase the recording density, especially in 2.5" drives. Have I missed anything else?

Can you explain how the head doesn't lose the servo signal and start knocking when it stops reading?

The lubricant layer was used back in the 1990s. Possible contamination of the heads is random and limiting the operating time does not protect against these events. Moreover, what would self-cleaning of the heads look like during operation?

The lubricant is distributed on the surface under the influence of centrifugal force. Were the disk images you saw taken during operation or during idle time?

ralphbsz · Nov 30, 2023

A lot of detailed questions. No, I don't know whether a glass substrate helps or hurts durability. And I'm not 100% sure that they are used on all drives.

When the disk is truly idle, it doesn't even have to serve, that's a waste of energy. It can just raise the head high, and turn the servo (and read amplifiers) off. Regaining servo control is a relatively fast operation, since it is done during normal read/write IO all the time.

Today's platters still have a lubricant layer. In the old days, it was applied by "spinners", as you describe: which is the same process used to put a thin chemical layer on a semiconductor wafer: Put a little glob of liquid in the middle of the platter, spin it (at a speed carefully calculated to balance centrifugal force and surface tension), and you get a pretty uniform layer of chemical. I don't know whether today's lubricant is applied that way, or using a solvent bath, or by vapor deposition (which is used for the undercoat, and the magnetic layer). I've been told that today's lubricant layer is not like a grease or oil, but more like a tough and durable teflon coating on a frying pan (and to be clear: when I say "more like" that doesn't mean it's the same kind or thickness as a frying pan).

What causes surface damage is an interesting question, and I don't know all the details. The bits I know is that it is a mix of foreign object damage, and head getting too close to the platter and picking up bits of the topmost (lubricant) layer. And conversely, the heads can also deposit some of that lubricant back onto the platter, without damaging the underlying magnetic layer. Like that, the heads may contribute to the surface being non-flat.

And the only time I've seen that flatness of the platter being studied was offline, with the platters removed and under an STM (tunneling surface microscope). It's interesting that one can get pictures of a platter, and then using knowledge of the geometry of the bits on the surfaces, figure out roughly what sector numbers one should expect damage, which then correlates nicely with error logs from the disk.

bakul · Dec 1, 2023

I don't know how accurate this is but I tend to think scrubbing as a tracing algorithm, somewhat similar to garbage collection: starting from "live" roots you follow all the pointers until you reach all the live data. In case of a copying collector the live data is moved to a new space. In a scrub it is just verifying that the computed checksums of the content of a block match the stored checksum in the pointer block. So as disks get filled up, there is more and more live data and it takes longer and longer. In both cases concurrent writes complicate things. Of course this is from a 10,000 foot level PoV.

Ideally I would like to see incremental scrub to be active *all* the time but costing no more than a tiny fraction of cpu & io overhead.

ralphbsz · Dec 1, 2023

bakul said:
Ideally I would like to see incremental scrub to be active *all* the time but costing no more than a tiny fraction of cpu & io overhead.

If reading from disk was 100% guaranteed to not contribute to disk wear-out and failure ...
If reading from disk didn't use any energy ...
If disks were really good at prioritizing real foreground requests (such as user reads and writes) over background activity (like scrub and defragmentation), with minimal performance penalty on foreground IO ...
... then we should be scrubbing all the time, or doing other useful background activity.

But that's not exactly the world we live in. None of the three if statements is completely true today; and looking at the trend, the first two are getting worse, and there is not terribly much progress on the third. So we have to adjust scrubbing frequency, balancing risk and reward. Doing this systematically is really hard. For the amateur and people with small systems, every few weeks seems plausible.

bakul · Dec 1, 2023

Rough sketch: Let us say there are 32M blocks (blocksize 128KB so about 4Terabytes of data) and you want the scrub to complete in a month. That translates to under scanning 13 blocks/sec on average. That is pretty low.

homeadm · Dec 1, 2023

Scrub should skip blocks that have been recently read or written. This will significantly speed scrubbing of pools with live data and, as side effect, extend life of disks. I think this would be a really good new feature for ZFS.

ralphbsz said:
When the disk is truly idle, it doesn't even have to serve, that's a waste of energy. It can just raise the head high, and turn the servo (and read amplifiers) off. Regaining servo control is a relatively fast operation, since it is done during normal read/write IO all the time.

I found a conflicting post on this matter in the past. Does this behavior, of not reading servo tracks during idle time, apply to drives from specific or all vendors?

ralphbsz · Dec 1, 2023

homeadm said:
Scrub should skip blocks that have been recently read or written.

I vaguely remember that about 15 years ago, someone was discussing keeping a list of how "old" blocks were (how recently accessed), and then scrubbing the oldest blocks first. Don't remember where I heard about it (research conference like FAST? colleagues brainstorming?). This is the kind of thing that a grad student could prototype as part of a PhD thesis, get a degree and a research paper out of it, and if it works, it could flow into production systems.

Does this behavior, of not reading servo tracks during idle time, apply to drives from specific or all vendors?

That's another really good question, to which I don't know the answer. It could very well be that the behavior I describe applies to nearline (enterprise) drives, while laptop (2.5") drives behave differently. The whole area of power management (how the drive saves power when idle) depends a lot on the intended use of the drive: some drives spin down sooner, some park the head more, and so on.

rotor · Dec 3, 2023

cracauer@ said:
The main purpose is to find bad drives earlier

I have SMART monitoring of the drives also.

I find the drives' internal monitoring tends to find early problems before they start showing up at the OS/ZFS levels.

SMART watches that internal monitoring of the drives and notifies me when there is an issue. At that point, the drive is likely using spare sectors as replacements for the bad ones, and the OS/ZFS have not yet been informed.

That's been my experience. YMMV.

rotor · Dec 3, 2023

I forgot to note, I use the smartmontools package for the SMART monitoring.

ralphbsz · Dec 4, 2023

rotor said:
I find the drives' internal monitoring tends to find early problems before they start showing up at the OS/ZFS levels.

That's partially correct. Often, SMART monitoring will find hardware problems. In particular, if one carefully looks at the trend (growth) in the number of remapped sectors, one can often suspect that the drive will have problems soon.

But this doesn't always happen. In the early 2000s, there was a published high-statistics study by some authors from Google that showed that roughly half the time, drive failure was predictable from SMART data. The other half of the time, drive failure was NOT predicted by SMART. Furthermore, SMART only looks at one particular failure mechanism, which is failure at the head/platter level; it doesn't help with interface failures, firmware and software bugs, off-track writes, and many other syndromes. Scrubbing catches many of those.

SMART does not replace scrubbing. If it were that easy, nobody would be scrubbing, and all large proprietary enterprise-grade storage systems implement some form of scrubbing. They wouldn't do that (and they wouldn't use IO workload and energy consumption) if scrubbing was worthless. And what's good for the gander is good for the goose: small system users and amateurs can also benefit from scrubbing, and ZFS happens to deliver it. If we care about our data, we should use whatever tools are appropriate, and INHO, scrubbing has a good return-on-investment.[/QUOTE]

rotor · Dec 4, 2023

ralphbsz said:
That's partially correct. Often, SMART monitoring will find hardware problems. In particular, if one carefully looks at the trend (growth) in the number of remapped sectors, one can often suspect that the drive will have problems soon.

But this doesn't always happen ...

That's why I said "tends to" and not "always."

ZFS Why does periodic(8) scrub ZFS every 35 days by default?

Deleted member 76849

Guest

cracauer@

mer

getopt

␢

cracauer@

mer

cracauer@

ralphbsz

tingo

mer

Kaleron

sko

astyle

ralphbsz

Kaleron

ralphbsz

bakul

ralphbsz

bakul

homeadm

ralphbsz

rotor

rotor

ralphbsz

rotor