ZFS ZFS Dataset Size is out of control, ZFS SEND/RECV hangs on random datasets

I did find this thread that was related but I have snapshots.
https://forums.freebsd.org/threads/zfs-dataset-is-occupying-more-space-than-the-actual-data.83901/

A few months ago i had a zfs corruption issue with the server in question and re-partioned.

Since then I've been having zfs send | zfs recv backups that hang every few days.

Well, this one dataset had a lot of variation in the zfs list -t snap REFER column. Over the hours it would climb from 50G to 155G, then go back down, then climb to 100G and go back down, etc. But right now, the latest snap shows 320G!

Code:
root(4)smtp:~ # df -h /zsmtp_jail/postfix
Filesystem            Size    Used   Avail Capacity  Mounted on
zsmtp_jail/postfix    919G    320G    599G    35%    /zsmtp_jail/postfix

root(4)smtp:~ # du -hs /zsmtp_jail/postfix/
 55G    /zsmtp_jail/postfix/

root(4)smtp:~ # zfs list
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
zsmtp_jail/postfix                                        320G   620G   320G  /zsmtp_jail/postfix

When I enter the jail and run du I get the same 55G.

Why the wild size discrepancy? 919G? 320G? 55G?

I renamed my backup dataset to preserve all my snaps and destroyed all the snaps on the live server. Now I get:

Code:
root(4)smtp:~ # zfs list zsmtp_jail/postfix
NAME                 USED  AVAIL  REFER  MOUNTPOINT
zsmtp_jail/postfix   320G   620G   320G  /zsmtp_jail/postfix


Code:
zpool status shows:
  pool: zsmtp_jail
 state: ONLINE
  scan: scrub repaired 0B in 00:27:09 with 0 errors on Fri May 24 04:22:09 2024    <--TODAY!
config:

        NAME        STATE     READ WRITE CKSUM
        zsmtp_jail  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nda1p3  ONLINE       0     0     0
            nda2p3  ONLINE       0     0     0

1/ As far as I know, the pool is fine (says zfs) and there is nothing that I can do to fix the dataset. Any info to the contrary is welcome.
2/ As far as I know, the size discrepancy is indicative of a real problem since there is *NOT* 320G in the dataset now that all the snaps are destroyed. Perhaps the snap space takes time to be re-calculated? I would like to know how zfs handles this. Is it fixed on the next scrub? I looked and found no reference to scrub recalculating sizes. Any info welcome, especially if i can trigger the resize.
3/ in the last month I have had fourty six backups hang on zfs send/recv. About a third of those are from a server that is backing up to itself from an ssd to a hard drive so I don't think SSH has anything to do with it. The datasets vary.
for example:
back/smtp/zsmtp/usr/src
back/smtp/zsmtp/var/crash
back/aujail/jail
back/smtp/zsmtp/var
zgep_back/zgep/var/crash
back/smtp/zsmtp_jail/jmusicbot
zgep_back/zgep/ROOT
zgep_back/zgep/var
zgep_back/zgep/var/crash
zgep_back/zgep/var/crash
zgep_back/zgep/var/crash
zgep_back/zgep/usr
back/smtp/zsmtp/usr

I would love to know:
why do these datasets hang on zfs send/recv?
is there any command i can run to find the datasets that are in a state where they could or would hang?
is there a way to 'clean' them so they don't hang anymore?
should I be thinking of a new zfs pool again? should i only move the data in by rsync and not zfs send/recv?
Any thoughts appreciated.
4/ if the total files in a dataset without snapshots can be 50G but zfs reports 320G, is there some command that I can see what the extra space is for?
 
zfs list
zsmtp_jail/postfix 55.5G 564G 55.5G /zsmtp_jail/postfix
zsmtp_jail/postfix_xld 320G 564G 55.2G /zsmtp_jail/postfix_xld
The new dataset is only 55G and transferring properly. I've retained my snapshots but just renaming the old dataset on the backup server. I've retained the bad dataset in case anyone has any ideas. I would like to figure out a better way out of this mess. Note that now the REFER column on the bad dataset shows only 55G now instead of 320G!

Why would it do that? Did it happen when I unmounted it?

That reminds me, I had to do a zfs unmount -f to force it to unmount. It wouldn't unmount without it. I ran various commands to see what was open:

fstat | grep postfix
procstat -fa | grep postfix
sh -c "ps ax -o pid=|xargs procstat -f 2>/dev/null" | grep postfix

At first, fail2ban was showing some files open:
root(4)smtp:~ # fstat | grep postfix
root python3.9 2025 14 /zsmtp_jail/postfix 937589 -rw-r----- 1255190
root python3.9 2025 18 /zsmtp_jail/postfix 647378 drwxr-xr-x 60
root(4)smtp:~ # sh -c "ps ax -o pid=|xargs procstat -f 2>/dev/null" | grep postfix
2025 python3.9 14 v r -----n-- 2 0 - /zsmtp_jail/postfix/var/log/maillog
2025 python3.9 18 v d -----n-- 2 0 - /zsmtp_jail/postfix/var/log

But shutting down fail2ban cleared those. Maybe I should have shut down fail2ban before shutting down the jail.

Anyway zfs unmount -f worked.

Still interested in any information about why I can have size 320G and refer to 55.2G. Meanwhile, the snaps are still refer 320G
root(4)smtp:~ # zfs list -t snap zsmtp_jail/postfix_xld
NAME USED AVAIL REFER MOUNTPOINT
zsmtp_jail/postfix_xld@2024-05-24.12 10.5M - 320G -
zsmtp_jail/postfix_xld@2024-05-24.13 13.4M - 320G -

If it is clear to anyone that there is a bug here let me know and i will try to report it.

Thank you all for any help you can offer.
 
Now, just four days later, I had a panic while deleting some snaps on zsmtp_jail/nginx

To boot we had to add to /boot/loader.conf
vfs.zfs.recover=1

I'm going to give nginx the rename-rsync treatment and try to reboot without zfs recovery
 
Ok, I failed to put the -r on the zfs destroy so the bad dataset still existed.

After destroying it for real and removing the recover mode it still panics.

Next I created new datasets on a backup drive and wiped the whole zpool.

rsynced everything back and rebooted.

zfs with two mirrors isn't as reliable as I expected. I would love to know what I'm doing wrong. Second time I've had this problem with this machine.
 
Thank you for your reply cracauer!

I am certainly with you, the common denominator is this machine. However, I have no indication that that is the case. It's been rock solid other than the zfs crashes.

I am also left with other questions that I have no way to answer.

Is there a way to track down the discrepancy between REFER and du?
-should I add a periodic test to detect such a discrepancy as an indication of an unhealthy zpool?
If there is a problem with zfs why is there no way to classify or detect it?
-Replacing the zpool fixed it, but since I didn't have a way to find the problem I don't know what happened or how to detect or prevent it.
My zfs backup script is pretty complex. Dealing with send hangs all the time complicated it further. Now I need to incorporate resume tokens. Is backing up zfs usually this complex?

Obviously I have multiple servers and love zfs, our whole infrastructure is based on it. A handful of servers and dozens of jails. But the lack of info in the face of a factual problem has me wondering if I fell asleep during zfs 101.
 
I dunno about REFER and du, but I'd still start memtest86ing thingie.

Are you sure you don't have snapshots or clone there?
 
we went through memtest last time this happened. ECC memory. it all passed testing so we took half out. That way we halved our chances that it was a memory issue. Now we have the same issue so I'm inclined to say not memory but who knows?

I need to set up another server anyway so I'm thinking I'll transfer production to the new one and free this server up for some testing.

No snapshots since I had deliberately deleted them all, then checked again. I rarely use clones so I didn't expect any.

I do expect there is some kind of undocumented calculation delay. I'll test that soon since I am interested to know myself if that is a thing.

I will post the results here.
 
Some things that may help others help you:
  • CPU and RAM of the two machines.
  • OS/kernel version `uname -a` or at least`uname -KU`. FreeBSD 13.3 was released with a particularly ugly zfs memory release issue which can make things become incredibly slow; not sure if it can cause a complete freeze too. I don't know if it was fixed but its better to use 13.2 or 14.0 if it wasn't. Hanging I would presume is a bug or hardware issue instead of just a corrupted pool
  • ZFS version if different such as from ports.
  • ashift value for pool `zdb -C zsmtp_jail|grep ashift`.
  • Commands used for the send and receive of the pools; an interrupted receive does not list as a snapshot but you can list the token to resume it.
  • Any differences between the sending and receiving pools: versions, properties, differences in running hardware.
  • Other ZFS space measurements: `zfs list -ro space zsmtp_jail` or `zfs list -t snapshot -ro name,used -s used zsmtp_jail`.
  • Any other ZFS properties that are not default or could be presented for examining: `zpool get all zsmtp_jail | grep -v default` `zfs get -r all zsmtp_jail | egrep -v 'default|inherited'`.
I don't have a good explanation for the changing sizes with my best guess wondering if intermediate snapshots are playing a role. Possible other explanations for unexpected space use that I could think of is if there are copies turned on for data, lowering compression, using block cloning (doesn't transfer as cloned blocks if I recall), many files smaller than ashift, incomplete `zfs recv -s` transfer(s) remaining. Things like raidz's unexpectely higher allocated disk amounts shouldn't apply if this is just a mirrored pool.

Sizes do have calculation delay; compressed data's size is not known until after compression has been performed. Writes (and deletes) is cached for writing for at most `sysctl vfs.zfs.txg.timeout` seconds before steps to commit them to disk are taken. Destroying snapshots should take time for the command to return; I'd still give at least that additional txg.timeout seconds before even thinking of checking on space.

You shouldn't need rsync or other tools to transfer data unless you are trying to do things ZFS cannot do during a transfer like rewrite some data structures: activate block cloning on blocks that are the same (currently undone with zfs send/recv), increase/decrease record size, etc.

Though du isn't recommended for figuring out how a pool has been used, you may want to also compare results with its -A flag. `zdb -d` when provided a dataset can output things helpful to further walk through that dataset. You can follow it up with an object # ot examine the object and more 'd's for more detail. I wouldn't call it user friendly and requires understanding ZFS structures to make proper sense of its output.

Without knowing why the system freezes during the transfer, you cannot know what can/cannot be transferred without causing a freeze. If a zfs scrub comes back clean, then your data should be intact. If you still have filesystem corruption (very unlikely but not impossible) then you would need to replace impacted data from backup, possibly going as far as destroying+recreating the pool.

As common as RAM issues are, it is not the only part of systems that go unstable. Over the past few years, I ended up teaching computer technicians that CPUs are having issues commonly enough that it was no longer the last thing that got checked during troubleshooting. Bad power supply, faulty or poorly designed accessories, etc. can wreak havoc too. No one test program checks for all possible faults and some faults are only brought out in certain conditions (humidity, temperature, load on multiple components, etc.)

ZFS isn't designed to make unreliable hardware become reliable. Checksums help identify, and fix if there are copies, corrupted data. Your data gets contained in ZFS data blocks and those blocks all have checksums, but your data does not. If data is corrupted and then checksummed after, you have corrupted data put on disk that is marked as valid. Filesystem bugs could also write bad data depending where in the pipeline they occur. If you think the hardware is unstable then that needs to be addressed.
 
Mirror176. Thank you for your detailed and informative response. I did not post specs because I wanted info about zfs regardless of hardware since two machines were affected but since you were so thorough here we go...

smtp is a 2020 ASRock X570 Taichi with a Seasonic Focus powersupply. powersupply is not original, there was a recall on the original.
CPU is AMD Ryzen 9 5950X Vermeer (Zen 3) 16-Core 3.4 GHz Socket AM4
RAM was replaced with ECC a year ago or so, i don't have the part number on hand
FreeBSD smtp.wfprod.com 14.0-RELEASE-p6 FreeBSD 14.0-RELEASE-p6 #0: Tue Mar 26 20:26:20 UTC 2024 root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
Intel 58GB Optane 800P M.2 2280 3D Xpoint PCIe SSD SSDPEK1A058GA for swap
Corsair MP600 PRO NH for zpools
zfs-2.2.0-FreeBSD_g95785196f
zfs-kmod-2.2.0-FreeBSD_g95785196f

gep is a 2024 ASRock B650M-HDV/M.2 Socket AM5 Micro ATX with a CORSAIR SF850L SFX Power Supply
CPU is AMD Ryzen 5 7600 6-Core 3.8 GHz Socket AM5 65W
RAM is 16GB Kingston 4800MHz CL40 DDR5
Two SSDs are Corsair MP600 PRO NH M.2 2280 2TB PCIe 4.0 x4 3D TLC CSSD-F2000GBMP600PNH
FreeBSD geproducts.net 14.0-RELEASE-p6 FreeBSD 14.0-RELEASE-p6 #0: Tue Mar 26 20:26:20 UTC 2024 root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
zfs-2.2.0-FreeBSD_g95785196f
zfs-kmod-2.2.0-FreeBSD_g95785196f

I installed from the same iso image on a USB stick and built both machines by hand using pkg only. No source at all.
ashift on both production pools is 12. One of the backups is 9, the other is 12.

all the zfs commands that had a problem were of the form:
Code:
${ssh) sudo zfs send -c ${rds}@${date} | zfs recv -Fu ${lds}
but the script did have a case where it did an incremental of the form:
Code:
${ssh} sudo zfs send -cI ${rds}@${hdate} ${rds}@${date} | zfs recv -Fu "${lds}"
However, I have rewritten the script using resumable sends of the form:
Code:
${ssh} timeout ${time} sudo zfs send -c -t ${token} | zfs recv -Fsu back/${bpfx}/${prds}
${ssh} timeout ${time} sudo zfs send -c ${prds}@${date} | zfs recv -Fsu back/${bpfx}/${prds}
${ssh} timeout ${time} sudo zfs send -cI "${prds}@${lbs_date}" "${prds}@${date}" | zfs recv -Fsu "back/${bpfx}/${prds}"
This new version of the script crashed the smtp server once so far but overall it is better written so I'm keeping it. However, when it crashed the server the backup script did not terminate. Periodic reported that the previous hourly was still running. So I have decided to put a timeout on the zfs recv as well.

zdb diff for zsmtp_jail to back (different machines):
< name: 'back'
---
> name: 'zsmtp_jail'
...
< vdev_children: 1
---
> vdev_children: 2
< type: 'disk'
---
> type: 'mirror'
< path: '/dev/ada0p3'
< whole_disk: 1
---
> whole_disk: 0
> children[0]:
> id: 0
> path: '/dev/nda1p3'
> whole_disk: 1
> DTL: 13483
> create_txg: 4
> children[1]:
> type: 'disk'
> id: 1
> path: '/dev/nda2p3'
> whole_disk: 1
> DTL: 12598
> create_txg: 4
> children[1]:
> type: 'indirect'
> whole_disk: 0
> metaslab_array: 0
> metaslab_shift: 34
> ashift: 12
> is_log: 0
> non_allocating: 1
> create_txg: 18
I removed guids and such for brevity

zdb diff for gep (both pools on same machine)
< name: 'zgep'
---
> name: 'zgep_back'
< txg: 265556
---
> txg: 229907
---
< type: 'mirror'
---
> type: 'disk'
< whole_disk: 0
---
> path: '/dev/ada0p3'
> whole_disk: 1
< metaslab_shift: 29
< ashift: 9
---
> metaslab_shift: 30
> ashift: 12
< children[0]:
< type: 'disk'
< id: 0
< path: '/dev/nda0p2'
< whole_disk: 1
< DTL: 281
< create_txg: 4
< children[1]:
< type: 'disk'
< id: 1
< path: '/dev/nda1p2'
< whole_disk: 1
< create_txg: 4
zfs get for zsmtp_jail and back (different machines)
< zsmtp_jail size 1.77T -
< zsmtp_jail capacity 41% -
---
> back size 10.8T -
> back capacity 37% -
---
< zsmtp_jail free 1.03T -
< zsmtp_jail allocated 756G -
---
> back free 6.72T -
> back allocated 4.10T -
< zsmtp_jail fragmentation 0% -
---
> back fragmentation 3% -
< zsmtp_jail load_guid 1809226621819765726 -
---
> back load_guid 9429038092753376069 -
< zsmtp_jail feature@device_removal active local
< zsmtp_jail feature@obsolete_counts active local
---
> back feature@device_removal enabled local
> back feature@obsolete_counts enabled local
< zsmtp_jail feature@zilsaxattr active local
---
> back feature@zilsaxattr enabled local

zfs get for zgep and zgep_back (same machine)
< zgep creation Tue Apr 30 13:57 2024 -
< zgep used 1.62G -
< zgep available 65.7G -
< zgep referenced 26K -
< zgep compressratio 2.10x -
---
> zgep_back creation Thu May 2 12:59 2024 -
> zgep_back used 1.72G -
> zgep_back available 143G -
> zgep_back referenced 96K -
> zgep_back compressratio 2.03x -
< zgep mountpoint /zgep default
---
> zgep_back mountpoint /jback local
< zgep guid 1722107665142296273 -
---
> zgep_back guid 7431549006592437355 -
< zgep usedbydataset 26K -
< zgep usedbychildren 1.62G -
---
> zgep_back usedbydataset 96K -
> zgep_back usedbychildren 1.72G -
< zgep written 0 -
< zgep logicalused 3.23G -
< zgep logicalreferenced 13K -
---
> zgep_back written 96K -
> zgep_back logicalused 3.24G -
> zgep logicalreferenced 42.5K -
< zgep snapshots_changed Fri May 31 3:59:00 2024 -


If by 'intermediate snapshots' you mean what I understand as 'incremental' then yes, it does get done in certain situations but my observation of the size problem was after all snaps were deleted. I was not using resumable streams until this very last crash on Saturday so that should not have been an issue. I do very much appreciate your list of possible size sources. Thank you.

sysctl vfs.zfs.txg.timeout is set to five seconds on the smtp server with the observed size issues. I waited several minutes and checked several times so I don't think that was an issue.

I did rsync because I considered that a corrupted pool might 'zfs send' corruption. I stand corrected.

zdb -dd was exciting to learn about, thank you.

We have no idea why we are getting panics on smtp but we do have core dumps. see: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278958

"If you still have filesystem corruption (very unlikely but not impossible) then you would need to replace impacted data from backup, possibly going as far as destroying+recreating the pool."
I thought that this situation was supposed to be theoretical, hence my consternation. As far as i understand, the panic situations are during zfs operations on smtp. But the hangs are on two completely different backup servers (one local, one remote).
However, gep has been up for two weeks now(no panics) and is backing up locally (no ssh) and until I changed the script to use timeout/resume it was hanging every couple days. I agree that the trouble with smtp may be instability, but gep is experiencing the same hangs in zfs send/recv, does not crash and shares no components with smtp.

I agree, this problem involves multiple variables and hardware is a possibility that is welded to the table.

"ZFS isn't designed to make unreliable hardware become reliable."
I agree. I also appreciated your clarification of checksums and the possibility of filesystem bugs.

Facts on the table:
Still confused by zfs send/recv hangs but timeout has mitigated it for now. No action needed.
I need a plan to detect hangs despite mitigation or zpool corruption may go unaddressed.
smtp has helped me improve our backup setup to be more resilient but at great cost. I will replace the hardware.
I have discarded the idea of using du against REFER since it wouldn't work with snapshots.
zpool corruption of this nature on our backup servers will not be noticed on backup zpools since they don't get sent. If the issue is zfs and not hardware then it would be useful to try to figure out another way to detect datasets that may hang. If the problem spreads I will consider learning about zbd and zfs data structures.


Thank you very much Mirror176. You are a gentleman and a scholar.
 
new development. smtp rebooted last night and since then I am getting stuck processes

Code:
root            6    0.0  0.0      0  10608  -  DL   Wed04       7:18.01 [zfskern]
root         6688    0.0  0.0  20132   9384  -  Ss   08:32       0:00.00 sudo zfs snap zsmtp/ROOT/14.0-RELEASE-p5_2024-04-15_042933@2024-06-06.04
root         6689    0.0  0.0  19508   8668  -  D    08:32       0:00.00 zfs snap zsmtp/ROOT/14.0-RELEASE-p5_2024-04-15_042933@2024-06-06.04
root        12829    0.0  0.0  19508   8808  -  I    04:27       0:00.00 zfs destroy zsmtp/var/audit@2024-06-03.16
root        14550    0.0  0.0  20132   9384  -  Is   04:32       0:00.00 sudo zfs snap zsmtp/ROOT@2024-06-06.04
root        14551    0.0  0.0  19508   8660  -  D    04:32       0:00.00 zfs snap zsmtp/ROOT@2024-06-06.04
root        19212    0.0  0.0  20132   9380  -  Is   09:04       0:00.00 sudo zfs snap zsmtp/ROOT/14.0-RELEASE-p5_2024-04-15_042933@2024-06-06.04
root        19213    0.0  0.0  19508   8676  -  D    09:04       0:00.00 zfs snap zsmtp/ROOT/14.0-RELEASE-p5_2024-04-15_042933@2024-06-06.04
root        19353    0.0  0.0  20132   9388  -  Is   09:04       0:00.01 sudo zfs snap zsmtp/ROOT/14.0-RELEASE-p6_2024-05-07_124518@2024-06-06.04
root        19354    0.0  0.0  19508   8668  -  D    09:04       0:00.00 zfs snap zsmtp/ROOT/14.0-RELEASE-p6_2024-05-07_124518@2024-06-06.04
root        19710    0.0  0.0  20132   9396  -  Is   04:44       0:00.01 sudo zfs snap zsmtp/ROOT@2024-06-06.04
root        19711    0.0  0.0  19508   8672  -  D    04:44       0:00.00 zfs snap zsmtp/ROOT@2024-06-06.04
root        19774    0.0  0.0  20132   9372  -  Is   09:05       0:00.01 sudo zfs snap zsmtp/ROOT/default@2024-06-06.04
root        19775    0.0  0.0  19508   8652  -  D    09:05       0:00.00 zfs snap zsmtp/ROOT/default@2024-06-06.04
root        21352    0.0  0.0  12796   2436  1  S+   09:09       0:00.00 grep zfs
 
Do you have newest firmware installed on both mobo and SSDs? Is any CPU microcode patching applied (recommended)?
If powerd is running, have you tried to disbled it?
I've seen some reports of some SSDs not playing nice with ZFS I haven't seen any reports about your models. (Example, https://github.com/openzfs/zfs/discussions/14793 )
Have you monitored temperature on your SSDs during load?
 
zpool status
Code:
  pool: zsmtp
 state: ONLINE
  scan: scrub repaired 0B in 00:01:07 with 0 errors on Thu Jun  6 04:27:42 2024
config:

        NAME        STATE     READ WRITE CKSUM
        zsmtp       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nda2p2  ONLINE       0     0     0
            nda1p2  ONLINE       0     0     0

errors: No known data errors

  pool: zsmtp_back
 state: ONLINE
  scan: scrub repaired 0B in 02:26:19 with 0 errors on Thu Jun  6 06:52:56 2024
config:

        NAME        STATE     READ WRITE CKSUM
        zsmtp_back  ONLINE       0     0     0
          ada0p2    ONLINE       0     0     0

errors: No known data errors

  pool: zsmtp_jail
 state: ONLINE
  scan: scrub repaired 0B in 00:16:24 with 0 errors on Thu Jun  6 04:43:02 2024
remove: Removal of vdev 1 copied 228G in 0h11m, completed on Tue May 28 12:07:49 2024
        600K memory used for removed device mappings
config:

        NAME          STATE     READ WRITE CKSUM
        zsmtp_jail    ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            nda1p3    ONLINE       0     0     0
            nda2p3    ONLINE       0     0     0

errors: No known data errors
 
diizzy, I apologize, I will answer your questions shortly.

I shut down all the jails and most other services but the stuck zfs processes wouldn't quit. I did a shutdown and it got stuck so I hit the reset button. When it came back up it had this to say (didn't show in dmesg, had to do a scroll lock on the box)

Setting hostuuid: -long guid-
Setting hostid: 0xaaa5b29c
Starting file system checks:
/dev/nda1p1: 4 files, 255 Mib free (522964 clusters)
FIXED
/dev/nda1p1: MARKING FILE SYSTEM CLEAN
Mounting local filesystems:.
Autoloading module: acpi_wmi
Autoloading module: if_iwlwifi
Intel(R) Wireless WiFi based driver for FreeBSD <- this line is bold and does appear in dmesg
Autoloading module: intpm
...

nda1p1 is the EFI partition in the mirror.

Weird right?
I did a grep in var/log, couldn't find these lines. I only saw it happen because I was watching the reboot process expecting to have a bad zpool. If anyone has a way to make these lines go to a log let me know.
 
diizzy:
Do you have newest firmware installed on both mobo and SSDs? I don't have the latest firmware on the motherboard, but I did have the latest non-beta last I checked. Checking now, i'm pretty sure I can do an update. Never checked the SSDs before, I will have to schedule that this weekend. Thank you very much for the suggestion.

Is any CPU microcode patching applied (recommended)? I do the boot update and the rc.conf update

If powerd is running, have you tried to disbled it? It is running. I had not tried disabling it. I will give that a try.

I've seen some reports of some SSDs not playing nice with ZFS I haven't seen any reports about your models. (Example, https://github.com/openzfs/zfs/discussions/14793 ) I did just buy another drive so we can have the exact same model on both sides of the raid. Will do that very soon.

Have you monitored temperature on your SSDs during load? I do monitor temp with smartd. 57C was a recent high on nvme2. I never figured out what happened, but it hasn't happened again. Usually 40C is the max.

Thank you very much Diizzy for taking the time to help. I will get right on your suggestions.
 
For a while we were buried in errors. We tried a bunch of different things but there were so many problems that I turned off our second backup. The problems continued but at a much slower pace so we kept going. Switching out the older SSD for a new one seemed to have had a big effect, but was not the final solution as we still had a lot of hangs in the backup system(but no crashes in twelve days).

Our remote backups were running through openvpn and ssh. The ssh setup was using ProxyCommand to limit bandwidth and ControlMaster to establish a connection that lasted the entire backup script. Though I was using timeout on the zfs send and zfs recv, I was not using timeout on the ssh connection. I decided to drop the ControlMaster to make individual connections and use timeout on them. The backups have not hung in several days.

There is still a lot of work to do but there aren't daily issues anymore. Thank you to everyone who was kind enough to offer their assistance. I really appreciate it. I feel we're in a place where we can start to walk back changes and actually fix some issues.
 
Thanks for the follow up, afaik powerd doesn't work correctly with AM5 and some reports that AMDs CPU feq boost technology works without it but I haven't verified it myself. My Ryzen AM5 box runs fine these days. :)
 
I did mess with powerd at some point but it is on right now. I think it is time I got into this. Thank you diizzy for pointing this out.

My (possibly incorrect) impression of Powerd is that it does a lot of things, like powering off devices with no driver, and it manages cpu to prevent overheating. So I am concerned to turn it off unless I can prove it is a problem.

I consulted the handbook on this and to my surprise I had not followed the relevant instructions.

I added amdtemp_load="YES" to /boot/loader.conf and rebooted. Now I get temperature reporting from sysctl:

Code:
 # sysctl dev.cpu.0
dev.cpu.0.cx_method: C1/hlt C2/io
dev.cpu.0.cx_usage_counters: 15589 0
dev.cpu.0.cx_usage: 100.00% 0.00% last 46628us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/1 C2/2/18
dev.cpu.0.freq_levels: 3400/3740 2800/2800 2200/1980
dev.cpu.0.freq: 2200
dev.cpu.0.temperature: 29.6C
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.0.%location: handle=\_PR_.C000
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU

Which is cool. Maybe temperature wasn't being monitored due to that module not being loaded?

I also didn't know you could send flags to powerd. This system typically runs with much CPU to spare. I think if i run two backups at once I can cause it to trip into max performance and see what happens.
 
I set up a test by stopping the powerd service and running it with powerd -v.

then i installed cpuburn and set up a script like so:

Code:
#!/bin/sh
timeout 5 burnMMX &
timeout 5 burnMMX &
timeout 5 burnMMX &
timeout 5 burnBX &
timeout 5 burnBX &
timeout 5 burnBX &

freq reported by powerd -v went from 2200Mhz to 3400Mhz, then, after the burns were complete, freq went to 2800, then a few seconds later went to 2200 again.

With powerd stopped the frequency stays at 3400 whether under load or not. I'm not sure if the reported frequency of 3400 is real or not. If anyone has any info on this i am all ears.

I think I am going to leave powerd on for now until I have a way to crash the system on demand. Then it will be worthwhile to try turning it off again.

Thank you dizzzy. I would not have found out about amdtemp_load="YES" or the powerd options without your suggestion.
 
Throttling back CPU frequency isn't a normal approach to keeping it from burning itself up but is more toward a last resort. If that step is needed then your cooling system is faulty or undersized. I didn't think it is required for the protective thermal throttling to happen. Your system should also power off if temperatures got too high without throttling helping but it is an unorderly shutdown so should be strongly avoided. Usually systems need lower frequency to have noticeably less CPU heat output and lower heat leads to lower fan speed so you may be able to get an idea by monitoring CPU fan speed by ear. On my old hardware I've had issues with iidprio/renice causing processes to run a LOT slower if the system doesn't do other tasks at the same time because clock speeds weren't being increased to do the work (ex: make/sh using 200 new process IDs per couple seconds while compiling ports with poudriere instead of 2,000 per couple seconds for the same task without idprio/renice.

If you have overclocking features in UEFI/BIOS then you may be able to lower voltages and/or clockspeeds to help keep thermals under control but such tweaks need to be tested just like an overclock would.

Last I checked, cpuburn is old and limited to stressing older CPU instructions. Different instructions will load CPUs differently and some are particularly stressful to some CPU models and good at bringing them to their knees. Some overclockers have previously avoided running certain tests because they didn't want to admit that their overclock wasn't stable with such high stress instructions. You can search for focused tests like cpuburn or commonly used tests like math/mprime or you can just fire up common and uncommon loads you would normally do but as more extreme examples. Sometimes you may find heat from GPU or drives also needs to get factored in or the extra power draw despite heat may find issues.

Heating and drawing power on non SSD components may also put the SSD in a problematic scenario that stressing the SSD alone won't cause.

There are other tests more commonly available on boot media or Windows. If you are expecting a crash is probable then you won't be crashing while your main OS is booted but a crash could still lead to rogue instructions causing data corruption. Having backups and/or disconnecting disks is still a safer bet to avoid that very unlikely outcome.

Though the above could be relevant toward non-software crashes, they don't explain why the filesystem size would fluctuate so much. Common culprits are still ashift, pool raid layout, checkpoint/snapshots/clones, other pool options like compression/copies/dedup being changed. If reviewing all of the pool's settings isn't revealing anything then I'd start looking into things like data getting scattered in strange/excessive ways which requires more zdb knowledge than I could share.
 
Thank you for your reply Mirror176. I am definitely up to my neck in a multivariate problem here. I had actually forgotten that this thread was about zfs filesystem size!

Earlier, regarding powerd, I was just checking if it worked as intended. Diizzzy indicated that he didn't think it worked on AM5 so I needed to check. But I agree with all your points about cpuburn, real world loads, SSDs and crashes.

My plan right now is to get my backup script dialed in and then scale them back up again to see what happens. Which I think is a strategy you would agree with. Now that things seem stable I hope that issues that crop up will be easier to diagnose.

Regarding iidprio/renice I've looked at those before but in the context of backups I worry about locking zfs resources for longer than necessary.
 
Back
Top