ZFS Why does periodic(8) scrub ZFS every 35 days by default?

homeadm · Dec 5, 2023

Scrubbing and S.M.A.R.T. tests focus on fundamentally different areas of interest. By monitoring changes in S.M.A.R.T. values, you can check health of your drive, whether it is OK or deteriorating. It may help you decide whether to replace the drive prematurely, although this is only an estimate. As said, half of hard drives fail unexpectedly, and I have some drives that work for a long time despite very bad S.M.A.R.T..
For SSDs, S.M.A.R.T. is worthless except for total bytes written.

On the other hand, ZFS scrubbing is mainly used to detect and - on redundant storage - correct bit-rot. The more data you have and the longer you store it, the more likely is that some bits will be flipped over time. Here's brilliant example of a wimp, who turned on checksums but never scrubbed:

View: https://www.youtube.com/watch?v=Npu7jkJk5nM

ralphbsz said:
SMART only looks at one particular failure mechanism, which is failure at the head/platter level; it doesn't help with interface failures, firmware and software bugs, off-track writes, and many other syndromes.

To be precise, S.M.A.R.T. indeed can detect interface errors. If value on "C7/199 UDMA CRC errors" increases, it means that something is wrong between controller and the disk. Usually, replacing or reconnecting signal cable solves the problem.

rotor · Dec 6, 2023

homeadm said:
Scrubbing and S.M.A.R.T. tests focus on fundamentally different areas of interest....

Yup.

In my experience, so far, only SMART has flagged a problem. I replaced the drive, and all has been fine.

I've not yet had scrubbing find any errors. But I have only a small 10TB home server....

Thanks for the follow-up.

ralphbsz · Dec 6, 2023

I think in about 10 years of using FreeBSD and ZFS on a home server (with 3 disks being used by ZFS, two in a mirror pair, the third one an unreplicated file system), I've had one episode of scrub errors. I think it happened when there was a hardware problem with one of the disks combined with a power outage crash (my UPS situation isn't always perfect, batteries fail). I think the total number of disk replacements on my ZFS home server has been 3 due to disk or interface failures.

Professionally, I have worked on very large systems (not FreeBSD nor ZFS), with tens of thousands to millions of disks. On those, scrub errors are found all the time, and auto-corrected. The root cause of scrub errors is not always diagnosed, but it is commonly accepted that the bulk of them are caused by software bugs, but examples of what is clearly disk errors (returning wrong data) are not uncommon.

mer · Dec 6, 2023

So don't think of scrub like brushing your teeth (twice a day) but more like cleaning your shower (before company comes over)?

PMc · Dec 6, 2023

ralphbsz said:
If reading from disk was 100% guaranteed to not contribute to disk wear-out and failure ...
If reading from disk didn't use any energy ...
If disks were really good at prioritizing real foreground requests (such as user reads and writes) over background activity (like scrub and defragmentation), with minimal performance penalty on foreground IO ...
... then we should be scrubbing all the time, or doing other useful background activity.

If scrubbing would be so problematic, then why do disks do it internally all the time?

Let's clarify:

there is scrub from ZFS. It scans all those sectors that are actually used. Runs as configured from periodic.
there is scrub of the entire surface, aka "smart extended offline". Runs whenever you invoke it via smartctl.
there is another scrub of the entire surface, run by the device itself on it's own initiative, about every one or two weeks. There is no way to configure it, nor to see the outcome. But it will increment 197/Current_Pending when it hits a defective sector.

Device in question:

 Model Family:     HGST Ultrastar 7K6000

Device Model:     HGST HUS726040ALA610

Not the very newest, but probably enterprise grade.

That one can do EPC, so it should go into Idle_c (low rpm) if so configured. During it's internal scrub it doesn't, for some ~9 hours.

ralphbsz · Dec 6, 2023

mer said:
So don't think of scrub like brushing your teeth (twice a day) but more like cleaning your shower (before company comes over)?

It's more like preventive medicine. Like looking at moles on the skin, to make sure they're not skin cancer. Except that check is done pretty rarely. So perhaps it's more like a patient with serious diabetes checking their blood sugar once or twice a day.

PMc said:
If scrubbing would be so problematic, then why do disks do it internally all the time?

Scrubbing is not problematic at all, it is a wonderful thing. Except it's not free. Which means that we need to find a compromise, between doing it often enough to get most of the benefit (of finding errors), while doing it rarely enough to avoid most of the cost (in latency, energy usage, and disk wearout). For most things in life, there is an 80-20 rule: 80% of the benefit comes from 20% of the investment. But I don't know what that curve is for scrubbing (and it is one of the open research questions that I'd love to have time to work on).

For disks doing scrub internally, the rules are somewhat different. The disk knows when each physical sector was last read, it knows detailed error rates (including the number of ECC corrections from partial reads) which give it a good indication of the health of platter and head, it knows whether it is busy or not. It can be more aggressive at scheduling scrubs, with less overhead.

That brings up another interesting research question: The upper layer (file system) has knowledge about data on disk. For example, it knows the value of it (is it file data versus metadata, is it redundant, how important is it to the end user), and in many cases it has a good idea of how often it is going to be read in the future, and when it is going to be overwritten = deleted. As an example, if it knew that some data is going to be overwritten in 10 minutes, unlikely to ever be read again, and 3-way replicated with copies on multiple continents, it could skip scrubbing it, since damage to it is probably irrelevant. On the other hand, the lowest layer (the disk drive) also has knowledge about data on disk: hardware error rates, platter and head health, and such. It can also make an intelligent choice to invest scrubbing resources on things that have the highest payoff. If one combined the knowledge from the upper and the lower layer, could one do scrubbing more efficiently?

This may sound like an irrelevant and mostly academic question. And indeed, giving a good answer would probably require several PhD theses to be written. But at a large scale, this question actually matters. My educated guess is that the total electricity / energy cost (or CO2 footprint, closely related) for the FAANGs to do scrubbing is measured in hundreds of M$. Optimizing this would lead to significant savings.

rotor · Dec 8, 2023

If I may add to this great discussion....

My use of ZFS on FreeBSD is only my little home server, going back a decade, maybe even two Nothing more, nothing less. For me, it has "just worked."

But the quite informative comments that resulted from my SMART comment of a week or so ago, have helped me to understand a lot of what is going on "behind the scenes."

To which I say, thank-you.

I learned stuff.

Kaleron · Dec 15, 2023

Glass platters are mainly used in 2.5" drives. This is probably due to lower thermal expansion coefficients. Aluminum platters are still used in 3.5" drives.

How would the disk know where the head is? Many factors influence the position of the actuator, hence the need to constantly monitor the servo signal. Otherwise, the actuator will drift under the influence of factors such as air movement or tension on the cable connecting the head block with the PCB. Therefore, your explanation is completely unconvincing to me.

The carbon polymer layer has also been used for a long time. It is different from a layer of lubricant. Teflon cannot be used in hard drives because in order for it to stick to a plate, like a frying pan, the surface would have to be porous, and applying a layer of Teflon would require appropriate sintering. Otherwise, the Teflon layer would detach under the influence of rotational speed. Because the magnetic layer must be as smooth as possible, other solutions are necessary, but the protective layer was also known several decades ago.

A foreign body is not needed to damage the surface. Just contact of the head with the surface is enough. The story about collecting and depositing grease is unbelievable to me because it does not fit with hydrodynamic knowledge.

Yes - micromagnetic tests are carried out on a platter removed from the disk. Yes, it is possible to determine what is stored on the disk this way, but this process requires long-term surface imaging, assembling hundreds of thousands of graphic files into a coherent image of the magnetization, and then correct synchronization and decoding of the signal. Do you actually know anyone doing this type of research, or do you just know it's a possibility?

PMc · Dec 15, 2023

Folks, let's get back to the essential question: does a mechanical disk wear out from reading?

What we know is that Seagate (maybe others too) at one time came up with limiting the warranty on desktop drives to 2400 operational hours per year and 55 TB data transfer per year. Given that a proper surface analysis on a 3TB drive would already transfer 24 TB, that is not so very much.
In my zoo is a ST3000DM008-2DM166, that came used (with a bundle of other things), had about 5 months of operation and an incredible read count of some 2000 TB. It has been operative ever since and looks now like this:

Code:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE                                                           
  9 Power_On_Hours          -O--CK   052   052   000    -    42471                                                               
241 Total_LBAs_Written      ------   100   253   000    -    141089127653                                                        
242 Total_LBAs_Read         ------   100   253   000    -    4680652614937

Let's recap: historically the unix community was the one with the most profound knowledge about hardware. Because, while other OS would rely on device drivers provided by the manufacturers, unix kernels would have their own device drivers, written by the unix folks. The technical specs of the devices would go into these drivers, i.e. real engineering happened.
But then, with the firmware layer getting more and more elaborate, manufacturers started to consider technical specs as "trade secrets", and instead of devices they would sell "solutions" - which, all together, is just elegant wording for telling the customer only bullshit.

The NAS community found proof that a (cheaper) WD green can be used like a (more expensive) WD red if only the idle parking is removed from the configuration (which is hidden behind a "vendor specific" page). So, might the data read limit also be one of those artificially created product differenciations?
Anyway, so-called "video disks" support 24/7 operation and might not even have Attributes 241+242. Apparently these have no problem with the amount of data. In fact, nobody really knows what would be the difference between a video disk handling camera data and a NAS streaming videos, except that NAS users are willing to pay more.

The knowlede, however, while having disappeared from the unix community, has not entirely disappeared. It is now in Russia, entertained by reverse engineering.

ralphbsz · Dec 16, 2023

PMc said:
Folks, let's get back to the essential question: does a mechanical disk wear out from reading?

The disk drive vendors say that it does. Which is why they provide a warranty only if reading is below certain limits (the 55 and 550 TB that are talked about). The exact mechanism is a trade secret. I know some hints about it (even if Kaleron disagrees).

In my zoo is a ST3000DM008-2DM166, that came used (with a bundle of other things), had about 5 months of operation and an incredible read count of some 2000 TB. It has been operative ever since and looks now like this:

This proves that going over 55 or 550 TB is not an immediate death sentence. But we didn't expect that anyway. Disk drives have an AFR of about 0.5 to 1% (that is both specified and measured, and the measurements are in decent agreement with manufacturer's specs, within a small factor). The disk drive market place has very low profit margins (because there is massive competition between the vendors), so a few percent change in profits are make or break. If a few percent of all drives sold came back due to warranty claims (above and beyond the number expected from AFR and bathtub aging models), that would kill the profits. So for example, if reading 2000 TB in the first 1/2 year increased the AFR from the expected 1/2% to 5%, you would most likely not notice that on your sample of one disk. But Seagate would go bankrupt if it honored those warranty claims. So your measurement on a sample of one does not disagree with Seagate's limit of 55 TB/year, due to low statistics.

Let's recap: historically the unix community was the one with the most profound knowledge about hardware.

Definitely false. Historically, the best knowledge of disk hardware resided with the companies that made disk drives. And for the longest time that was IBM, CDC, HP, DEC, and a few smaller ones (Memorex, Maxtor, Quantum, Fujitsu, Hitachi, Toshiba; Seagate came significantly later). Many of these companies were also among the largest manufacturers of computers and systems: Most computer makers in "IBM + bunch" made their own disks. Furthermore, the disk drive manufacturers in those days were very willing to share data with their BIG customers (for example Data General or Prime). So the deepest knowledge about disk drives has always been in deeply vertically integrated companies, such as IBM, HP, DEC, CDC and so on.

Much of this works differently today. There are really only 2-1/2 manufacturers left on the supply side (Toshiba has small market share). On the demand side, nearly all enterprise-grade disks are used by a handful of very large customers (the FAANG = hyper scalers). There is excellent vertical integration of knowledge within that stack. For example, I'm sure that the engineers from companies like Facebook and Amazon spend a lot of time with engineers from Seagate and WD (I have never worked for those examples).

In the early days of Unix (that being development being done at Bell Labs and at Berkeley), the development groups there probably had an easy time getting detailed technical information, just like other large hardware users could. Today, this works differently, and much of it is not (and can not be) visible to consumers of open source software. For example, if an engineer from a company such as Amazon or Google or Microsoft contributes to the Linux kernel, and that engineer is doing that based on their detailed knowledge they have received from disk drive vendors, the person running Linux would never find out. In the case of FreeBSD, there is just much less development happening, and in particular less of it is being done by people employed by large companies (IBM, the FAANG and such).

ralphbsz · Dec 16, 2023

Kaleron said:
How would the disk know where the head is?

When the disk is idle (not reading or writing), it doesn't need to know where the head is accurately. When the next IO arrives, it needs to know how to get to the new desired track, but turning the read data path on is fast, and the drive will get track servo information very quickly (using partial reads). So when idle (with an expectation that idleness will continue), there really is no need to servo.

The story about collecting and depositing grease is unbelievable to me because it does not fit with hydrodynamic knowledge.

It is one of the explanations given to me for why doing reads causes wear. The other explanation is micro head crashes: not serious enough to completely destroy the magnetic layer, but enough to cause surface imperfections (perhaps in the polymer layer?) that affect things ... but how would they affect things? Disturb fly height?

Yes - micromagnetic tests are carried out on a platter removed from the disk. Yes, it is possible to determine what is stored on the disk this way, but this process requires long-term surface imaging, assembling hundreds of thousands of graphic files into a coherent image of the magnetization, and then correct synchronization and decoding of the signal. Do you actually know anyone doing this type of research, or do you just know it's a possibility?

There are multiple different things. To begin with, with a magnetic STM (spin polarized), one can read individual bits on a platter. As you said, trying to read data this way is very tedious, but doable. I've only seen it done in research labs (to characterize platters and heads), not as a form of data recovery. Whether it is financially viable as data recovery, I don't know.

The other thing is using much coarser microscopy to characterize surface imperfections. This can be done optically (typically in the near UV, using dark field illumination), and it can be done using electron microscopes (both regular and STM). As an example, when I had to deal with a series of disk drives that had shown unusually high early failures (measured in "daily failure rates, which were single-digit percent!), I was at a company that was integrating disk drives into large systems. We sent a few failed disks to the drive manufacturer, and they did the following: They first prepared maps of the platters to show where the errors were (they typically form patterns), and then zoomed in to some of the failures, so one could see foreign objects, long scratched, waviness of the surface finish. And eventually, at the highest zoom level and magnetic imaging, you could see individual bits, and then see areas where the bits were clearly mechanically wiped out.

And to be clear: I did not work in disk drive manufacturing or R&D. This is user information.

homeadm · Dec 17, 2023

Kaleron said:
How would the disk know where the head is? Many factors influence the position of the actuator, hence the need to constantly monitor the servo signal. Otherwise, the actuator will drift under the influence of factors such as air movement or tension on the cable connecting the head block with the PCB.

I don't get it. If during cold start hard drive finds servo tracks, why should it be a problem to find them again after idling? I'm asking as a HDD user.

recluce · Dec 29, 2023

sko said:
raidz is horribly slow when it comes to scrubs and resilvers (and also slow and inefficient otherwise...)
That's why you should always use mirrors for small pools, also because using a *single* raidz vdev is usually a bad idea if you want to get at least some performance from your pool...

No, it is very fast if your hardware is right. At home, I have a fileserver based on zfs with an external SAS JBOD enclosure that limits throughput to 6 GBit/s. I am using WD Gold/Ultrastar drives, an older LSI HBA (not RAID controller), currently 16 drives in two RAIDZ2 pools with eight drives each, 64 GB RAM with two Intel E5-2690 (48 cores). Scrub is running close to the theoretical limit of 6 GBit/s. Resliver runs at the maximum speed that the drives can mechanically handle, a good 250 MBit/s.

My experience is, if scrubbing or reslivering is slow, there is something wrong with your hardware setup or potentially, with the zfs configuration (e.g. wrong ashift value).

Kaleron · Jan 14, 2024

ralphbsz said:
doesn't need to know where the head is accurately.

homeadm said:
why should it be a problem to find them again after idling?

Does the disk knock its heads in idle? This would be the consequence if the disk in idle stopped reading and tracking the servo signal.

ralphbsz said:
micro head crashes: not serious enough to completely destroy the magnetic layer, but enough to cause surface imperfections

Any contacts between the head and the surface may cause damage to the heads and the surface, but this is unrelated to normal read and write operations and in no way justifies limiting these operations.

ralphbsz said:
read individual bits on a platter.

The data is encoded, so imaging individual magnetic domains is not the same as reading bits, but a detailed explanation is a large topic.
Moreover, production defects and defects resulting from emergency situations also do not justify limiting read and write operations. These are random situations. If there is a reason for limiting read and write operations, it is different, but I can't pinpoint it at the moment. If the limiting applied only to writing and only to drives with energy-assisted recording (HAMR/MAMR), I would look for it on the side of heat dissipation problems, but at the moment I don't have enough data to comment on it.

ralphbsz · Jan 15, 2024

Kaleron said:
Does the disk knock its heads in idle? This would be the consequence if the disk in idle stopped reading and tracking the servo signal.

I don't know what you mean by "knock" here. This is what I've heard: When in idle, the actuator may drift, but not terribly much, due to lack of active servoing; that addresses heads moving back and forth. And the flying height of the head is actively adjusted: It's high above the platter when idle or fast seeking, it is low for reading, and even lower for writing. How is the flying height adjusted? As far as I've heard (and many of my sources are from Hitachi = WD, so other manufacturers may have different techniques), this is done somewhat similar to an airplane. An airplane has a force acting on it which pushes it down (gravity), which is balanced by a force pushing it up (the aerodynamic lift of the wing). This is exactly the same in a disk drive, where the force pushing the head onto the platter is spring tension of the actuator arm, and the force pushing it away from the platter is aerodynamic lift (which is why disk drives can't work in a vacuum). How do both planes and heads adjust their height? By adjusting the shape of the wing. In a plane, that is done by having a thing at the trailing edge of the wing which can move up and down (I think it's called the elevator or aileron, but I'm no airplane expert). On disk heads, it is done by heating the trailing edge of the head slightly with an electric heater, which causes it to change shape a little bit, changing the aerodynamics and the lifting force. Obviously, this effect is very small ... but since the flying height of heads today is about 3-5 nm, it doesn't take much effect to adjust it. I think in idle or seek mode, the heater is turned off, and the head rises up.

In the old days, heads were designed to touch the surface of the platter, when they were parked during spindown (typically on a parking track that's on the inner surface of the disk). The problem with that is contamination (transfer of material between disk and head) and stiction. So what disk drives do today is to withdraw the head completely from the platter, using an off-ramp: a tiny piece of plastic that forces the heads way up in the air and off the platter when the actuator goes to the parking position. This has a very important side effect: It used to be that the platters (or at least the parking position) had to be textured slightly, so when the head lands on it, you don't get stiction. With the off-ramp parking technology, the platters can now be polished smooth, which allows lower fly height and therefore much higher bit density. But the price of much lower (and actively controlled) fly height is that occasionally, there will be contact, which then causes contamination ... and supposedly that is what causes disk wearout during IO. And according to what I hear, a slight contact between head and platter is not rare, and not immediately catastrophic. In the old days, such contact was called a "head crash", and caused the otherwise brown disk to get a bright aluminum ring on it (or worse). Today a slight contact is normal, but causes very slight wear.

If someone knows more details why disk drives experience wear when performing IO, I would be glad to hear it.

Speaking of "or worse" on head crashes, here is a disk crash horror story: In the mid 80s, our VAX used a top-loading (washing machine) disk drive, probably around 60 MB, I think OEM'ed from Control Data. We had a spectacular malfunction, where the spindle holding the platters became a little loose, which caused ALL heads (there were probably a dozen of them) to crash simultaneously. Not just crash, but rotate 90 degrees, so the head assemblies were now wedged between adjoining platters. The remaining momentum of the spinning platters managed to pull all heads out of the actuator mechanism halfway, and bend them so they couldn't be retracted to the parking position. It also made a horrible screeching noise.

Then field service (a.k.a. field circus) showed up. Normal operating procedure if the head has been retracted: Open the top, remove the pack (all the platters) by pulling it up (that's normal disk change). Won't work, platter is being held by the heads wedged in it perpendicular. The platters couldn't be rotated. Abnormal procedure if the heads have not retracted: Open the front, find the end of the actuator, manually pull heads back. Won't work, heads are stuck between platters and bent. So we couldn't go up nor out. The first attempt was to go to the department machine shop, and get some big pipe wrenches and vise grips, and try to yank the actuator out. Didn't work, not enough of the actuator left to grip it. What was eventually done was do use a hammer to bend the whole platter mechanism sideways, then reaching in (between platter and actuator) with a long screwdriver to beat the head assemblies up enough that they could be moved completely out of the way, and eventually freeing the platters enough to unbolt things. There were metal shavings everywhere. The whole top half of the drive had to be replaced, leaving only the drive motor and the electronics below. Field service had a really bad day.

Kaleron · Jan 19, 2024

When the heads do not read the servo signal (cannot find it or the quality of the read signal does not allow its interpretation), the actuator begins to move chaotically from limiter to limiter. When it hits the limiter, it makes a knocking noise. This is the behavior of a disk with damaged heads (deformed sliders), a demagnetized disk and a disk in any other situation when the head does not read the servo signal. The actuator is subjected to many different forces and to keep it in a stable position, you must appropriately regulate the current fed to the coil. Appropriately, that is, you need to know exactly how. PES (Positioning Error Signal) stored in servo sectors is used for this purpose. And tracking this signal is why the head must read all the time regardless of whether the disk is performing I/O or not. If you read any publication about signal processing or servomechanics of hard drives, you will easily understand that the head must read all the time, and the statement that it is turned off at idle is nonsense.

Yes, the head's flight height is adjustable. Yes, its regulation uses the balance of the elastic force and the lifting force of the slider. Yes, the regulation uses thermal elements that influence the deformation of the slider. This is controlled by examining the amplitude of the servo signal. This is another reason why the head must read all the time regardless of I/O operations.

Disassemble any disk in which the heads are parked on the platter. You will easily notice that only the parking zone is textured and the rest of the surface is smooth.
Yes - parking on an external ramp is safer, which is why manufacturers are moving in this direction.

Your story about the disk failure is interesting, but it is a very old disk and this case does not contribute anything to the workload rate limit considerations. We still do not find a rational justification for this parameter. The contacts between the heads and the platters themselves are random, loosely related to input/output operations. The idea of turning off the reading head in idle is hopefully no longer valid. My English is mainly for reading and communication in this language is difficult for me, so if you need a more extensive explanation of this issue, I suggest you consult the literature.