ZFS The pool metadata is corrupted, how to get your data out of the corrupted zpool?

Hi,

Currently I have a zpool that does not work anymore. The name of the pool is zroot. The VM was running in FreeBSD version 14.1 and got in a faulty state when the host hang. ZFS is always better then no ZFS (Dan Langeville), I would like to learn from this broken state.

The gpart show command show the disk ada0 with partion 1 the boot, partition 2 the swap and partition 3 the zfs.

Everything is in one single pool with name zroot, no redundancy, but i have backups. I have lost some data but that's my own problem. I should have taken quicker a backup or used another setup.

Partition 3 the ZFS decrypts without problems with geli.

Zpool import show:
- ZFS filesystem version :5
- ZFS storage pool version: features support (5000)

zpool import zroot gives 4 times failed to load zpool zroot.

Cannot import zroot: I/O error
Destroy or re-create the pool from a backup source.

zdb -e zroot

Shows a lot of data on the screen when it arrives at the dataset zroot/var/log, it shows the ZIL header, object 0, -1, -2, -3. Then it shows the Dnode slots 3 lines of data. Then it shows dmu_object_next() = 97
and then on the next line: Abort trap

When running zdb -ul /dev/ada0p3.eli, then uberblock 0 till 30 are shown.

When doing zpool import -F -T 3647666 -o zroot, which was actually a typo because i did not add any parameter for the -o i do get an explanation:
state: FAULTED
status: The pool metadata is corrupted
action: The poot cannot be imported due to damaged devices or data.
The pool may be active on another system but can be imported using the '-f' flag.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-72

zpool import -f -T 3647666 zroot
cannot import zroot: one or more devices is currently unavailable

1. Could it have helped that I created a checkpoint somewhere in the past?
2. Could a ZIL or ZLOG have helped? Or should I better have used redundancy like a mirror or zraid?
3. zdb crashes on the zroot/var/log a directory used a lot certainly when something goes wrong. I compare ZFS a lot with how databases work like Oracle. So, in this case could ZFS not just rollback to a previous transaction where the zpool was still ok? Why is now everything fault or broke?
4. I have the impression that somewhere something went wrong, but many data are still on that disk. Can absolutely nothing be recovered from that disk? I may be wrong, but I have the impression that because in a tiny spot something went wrong, everything is gone. Perhaps some dataset are still fine in the zpool, why can I not export those?

I did not try the -X option yet. I am waiting for advice of this forum.

Thanks in advance for your help
 
Shows a lot of data on the screen when it arrives at the dataset zroot/var/log, it shows the ZIL header, object 0, -1, -2, -3. Then it shows the Dnode slots 3 lines of data. Then it shows dmu_object_next() = 97
and then on the next line: Abort trap
Are you sure you did not add a ZIL LOG device to your pool? Have you checked if your pool devices are damaged, for example with smatctl?
 
I am sure that I did not have a ZIL LOG. The zroot was created with the default installer available when booting from the cdrom in virtualbox. I don’t think it’s possible to do smartcontrol on virtual disks. I think because the host hang some sectors might be damaged but the main part is according to me fine.
 
Sorry, I didn't read that you were running the system in a VM, I left out the smartctl part. On the other hand, you can clone the VM to have a copy of the original state. And you can safely run the -X option. On the other hand, about point 2, correct, if you had configured your pool with zraid replication or mirror, you could have recovered the data from the other healthy unit.

The following also occurs to me, with zdb you can read the blocks and the content, I don't remember the formula. I don't know if that could help the problem, because of the object headers you show above, but possibly a more experienced user can help you. At this point
 
zpool import zroot gives 4 times failed to load zpool zroot.

Cannot import zroot: I/O error
You should stop right there: The disk you are trying to import the pool from has IO errors. Before going any further, let's try to debug those, and perhaps fix them.

Look for IO error messages in dmesg or /var/log/messages. What do they say? Check the disk with smartctl, you might see some non-zero error counters there. The most likely cause of your problem is very simple: You disk has failed, there are some blocks on the disk that are not readable, and those are vitally necessary for ZFS to work. If that's true, smartctl should tell you that you have permanent read errors. It could also be that you have a solvable problem, like a bad cable connection on the SATA wire, or a bad power supply. The exact text of the error message might help us distinguish those possibilities.
 
[...] Or should I better have used redundancy like a mirror or zraid?
One of its fundamental qualities is that ZFS guards your stored data exceptionally well.

Any bad data ZFS encounters will be reported. You won't get the chance to work with incorrect data from storage.

When bad data happens to be user data*, then ZFS wil try to remedy that if and when there is redundancy available. For that it relies on:
  1. mirrors;
  2. RAIDZ1,2 or 3;
  3. copies=2 or copies=3 set for data of a filesystem (=ZFS dataset), see zfsprops.
#1 & #2 safeguard against a total disk drive failure (of course when you happen to create a mirror made up of 2 partitions on the same disk that won't work).

#3 I consider a lesser solution: it won't safeguard against total disk failure or a disk controller failure. It also impedes disk IO when writing. This may be your only direct recourse of redundancy when using for example a laptop with only one storage option. For that I also recommend frequent backups; perhaps in the form of a lot af snapshots sent externally.

What matters is that the the actual underlying storage has redundancy. Using VMs or using applications that use for example a ZVOL as block device can rely on ZFS redundancy of that ZVOL, even when they do not use ZFS as their "direct" filesystem.

A single disk without redundancy that reports errors that are the result of errors in the stored data on disk, means that the pool is lost. Your recourse then is to create the pool and restore from backup. For some this is a stark contrast to the "normal situation" as experienced with for example UFS where one could try to resolve errors by deploying tools like fsck(8); that AFAIK does not repair any user data.

ZFS is not intended for any internal repairs where there is no redundancy. From all that I've read and know that is also not in the works for any future version: it is just not designed to do that. You may be able to succesfully deploy more or less dangerous options such as using -f and others such as referred to here, at an increasing risk.

There is, one could argue, zdb(8), but that is a debugging tool and not a repair tool. Trying to use it effectively as a repair tool requires an in-depth knowledge of ZFS and its internal structures.



1. Could it have helped that I created a checkpoint somewhere in the past?
2. Could a ZIL or ZLOG have helped?
On a user level, checkpoints can guard against recent user errors (with restrictions) or be used for testing purposes, rewinding is at the cost of losing the most recent changes; its creation is akin to a snapshot but then on a pool wide basis.

The ZIL is a specific ZFS data structure in memory and on disk. The ZIL on disk is intended to function as a quick temporary storage facility for data that needs to be saved in a non-volatile manner but has not yet been written to its final location on disk as part of the normal ZFS data structures. It will be relied upon (=read) only in the case of a failure such as power failure; upon on reboot when importing the pool, the ZIL data on disk will be replayed and will be written to its final destination on disk. A SLOG device is just an external separate device that extends/moves the ZIL functionality to this device. Some further info:

___
* For meta data, it relies on its built-in redundancy (two or three copies) in order to remedy problems.
 
You should stop right there: The disk you are trying to import the pool from has IO errors. Before going any further, let's try to debug those, and perhaps fix them.

Look for IO error messages in dmesg or /var/log/messages. What do they say? Check the disk with smartctl, you might see some non-zero error counters there. The most likely cause of your problem is very simple: You disk has failed, there are some blocks on the disk that are not readable, and those are vitally necessary for ZFS to work. If that's true, smartctl should tell you that you have permanent read errors. It could also be that you have a solvable problem, like a bad cable connection on the SATA wire, or a bad power supply. The exact text of the error message might help us distinguish those possibilities.
Hi Ralphbsz,
Thanks for your reaction.
Maybe my mistake, i cannot start the pool , so I was trying to import the pool in single boot mode. Currently I cannot see the dmesg or /var/log/messages. I will try a smartctl during the day, but I remember from the past that it does not give so much for virtual disks. A cable or a bad power supply is probably also not the cause, as it's a virtualbox environment, it's a VM. Let me try that smartctl to be sure.
Best Regards,
 
To Erichans:
Thanks for your comments, interesting to read.

Apart from the fact I have chosen in the past for a single disk, to not loose any storage space for mirroring, maybe I could have done it differently. I could have chosen for 2 pools with on pool1 zroot the OS and on pool2 the data. Perhaps then my data pool was OK and I could just reinstall the OS. The annoying part is that some application as a default use the user/home folders to store some files. I prefer to have all my data on the data pool and all os in zroot. Probably I have to change those applications to switch their default write location, something like that.

I use the VM to replicate actually all the data to a another server. So with snapshots and sending over those snapshots. This is what i would prefer to do, to use the ZFS features and not a sync program for example. But then I need a baseline snapshot to send from one to the other. In other words suppose you have a dataset with books, you need to keep all your books on VM1 and SERVER1. Now I think about it, it's not really true, i could also just create a dataset: small-list-of-books-from-vm, make sure the list is small. Send it over to the server and then from there move the books to from small-list-of-books-from-vm to the big-list-of-all-books. Throw them away on the VM and sync again, then the small-list is both on VM and server empty, waiting for new books. Then my VM will become smaller and I could setup mirroring for the data pool. I am not 100% sure but even if the datapool is on the same physical machine, i think the chance is still higher to recover from the host crash in a VM when there is a mirror.

I have no space on the machine, so I am going to start another pc to move the virtual-box files to. Then there I can with a backup try the -X option. Just to try to get back some files if possible.

The more problems the more you learn :)
 
Update:
- The following command works from single boot modus when having decrypted the geli disk:

zpool import -F -X -o readonly=on zroot


But then everything readonly.

So how to get some of the data out of ZFS?
It should be possible to do it with zdb -B

I did following steps:
  1. Export the pool again:
    zpool export zroot
  2. Setup a ip in same subnet as the machine you want to send you data to:
    ifconfig em1 inet 192.168.129.4 netmask 255.255.255.0
  3. It should be possible to send the data to another machine but I get an error, I do something wrong:
    zdb -B zroot/media -e | ssh root@192.168.129.5 "zdb -R zroot/media"
I am curious to get it working, to see how the data will look like.
 
Update:
- The following command works from single boot modus when having decrypted the geli disk:


zpool import -F -X -o readonly=on zroot




I did following steps:
  1. Export the pool again:
    zpool export zroot
  2. Setup a ip in same subnet as the machine you want to send you data to:
    ifconfig em1 inet 192.168.129.4 netmask 255.255.255.0
  3. First I have to get the object id
    zdb -d zroot/media/book -e
  4. Then I can copy it to another machine:
    zdb -B zroot/115478 -e | ssh root@192.168.129.5 "cat > /zroot/test.snap"
But now I want to try it with zdb -R :
-R, --read-block=poolname vdev:eek:ffset:[lsize/]psize[:flags]Read and display a block from the specified device. By default the block is displayed as a hex dump, but see the description of the r flag, below.
The block is specified in terms of a colon-separated tuple vdev (an integer vdev identifier) offset (the offset within the vdev) size (the physical size, or logical size / physical size) of the block to read and, optionally, flags (a set of flags, described below).

So next steps are:
  1. What is my vdev identifier? How to get somewhere?
  2. Offset within the vdev?
  3. psize: physical size of a block
Thinking...
 
It should be possible to send the data to another machine but I get an error, I do something wrong:
zdb -B zroot/media -e | ssh root@192.168.129.5 "zdb -R zroot/media"
OpenZFS 2.2, zdb -R ... :
Code:
-R, --read-block=poolname vdev:offset:[lsize/]psize[:flags]
    Read and display a block from the specified device. 
    By default the block is displayed as a hex dump, but see the description of the r flag, below.
zdb -R tries tries to read from a block device; the quoted command tries to read it from a serialized stream: zdb -B: "Generate a backup stream, similar to zfs send [...]".

You have no redundancy, so you have only one vdev: see Virtual Devices (vdevs)

You might (I admit small chance) get more cutting edge reference from the master branch at OpenZFS, for zdb: master. To actually use any new items you'd have to boot with ZFS from the master branch.

Some hooks into further documentation:
  1. Using zdb to peer into how ZFS stores files on disk by Chris Siebenmann
  2. zfsondisk from Matthew Ahrens' github. Contains
    - ZFS On-Disk Specification - Draft; from 2006 by SUN: ondiskformatfinal.odt
    - various ZFS internals docs
  3. Matt Ahrens - Lecture on OpenZFS read and write code paths
    - Lecture 6 of Marshall Kirk McKusick's class 2016; intro: "What is the ZFS storage system?"
#2 can keep you occupied on many rainy nights. For some help into ZFS code internals #3 might be helpful, though not geared for zdb use.
Happy debugging!

___
P.S. It would be helpful, when you report "stuff" that reports an error, to post the command(s) given combined with its result in a
[code] ... [/code] block. However, I must admit that going deep into ZFS you'll have more direct ZFS expertise online at the OpenZFS github site than here; I suggest you prepare thoroughly :)
 
Thanks! You are right, zdb -B is a backup stream, and I don't need zdb -R. It seems the normal zfs-receive can handle backupstreams. I need to try it out to be sure, but looks like I can send it to a normal new dataset on destination. Perhaps it's also possible to the same destination dataset.

Hehehehe, indeed will probably take some time. But it's good I am punished, my own fault not to take extra backups or another setup.

Also thanks for the interesting articles and links!
 
I have tested it quickly on a new VM

- The following command works from single boot modus when having decrypted the geli disk:

zpool import -F -X -o readonly=on zroot





I did following steps:
  1. Export the pool again:
    zpool export zroot
  2. Setup a ip in same subnet as the machine you want to send you data to:
    fconfig em1 inet 192.168.129.4 netmask 255.255.255.0
  3. First I have to get the object id
    zdb -d zroot/media/book -e
  4. Then I can copy it to another machine:
    zdb -B zroot/115478 -e | ssh root@192.168.129.5 "zfs receive zroot/media/book1"
When opening /zroot/media/book1, I see the content of de zroot/media/book dataset of the zpool that was corrupted. That's what worked on a small dataset.

Going to test it later on all datasets.
 
The articles that Erichans shared reminded me of the procedure to recover files through blocks, I am performing the tests on a Solaris VM since I am traveling and it is the only thing I have.

Code:
:~# echo 'My text file' > file.txt

xv0:~# ls -i
184497 file.txt

xv0:~# zdb -dddddd rpool/ROOT/solaris 184497
Dataset rpool/ROOT/solaris [ZPL], ID 41, cr_txg 8, 5.81G, 178946 objects, rootbp DVA[0]=<0:412e34800:200:STD:1> DVA[1]=<0:1545e2800:200:STD:1> [L0 DMU objset] fletcher4 lzjb LE unique unencrypted size=800L/200P birth=6205L/6205P fill=178946 contiguous 2-copy cksum=1ae5596076:86ac63817fe:1769ccadef625:2f2c03601ac448

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
    184497    1    16K    512    512    512  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes |
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /root/file.txt
        uid     0
        gid     0
        atime   Sat Aug 10 10:36:44 2024
        mtime   Sat Aug 10 10:36:44 2024
        ctime   Sat Aug 10 10:36:44 2024
        crtime  Sat Aug 10 10:36:44 2024
        gen     6191
        mode    0100644
        size    13
        parent  354
        links   1
        pflags  0x40800000204
Indirect blocks:
                 0 L0 0:0x41260ae00:0x200 0x200L/0x200P F=1 B=6191/6191 ---

                segment [000000000000000000, 0x0000000000000200) size   512
                
xv0:~# zdb -R rpool 0:0x41260ae00:0x200
Found vdev: /dev/dsk/c1d0s1
DVA[0]=<0:41260ae00:200:STD:1> [L0 deduplicated block] off uncompressed LE unique unencrypted size=200L/200P birth=4L/4P fill=0 contiguous 1-copy cksum=0:0:0:0
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  207478657420794d  0000000a656c6966  My text file....
000010:  0000000000000000  0000000000000000  ................
*

xv0:~# dd if=/dev/dsk/c1d0s1 iseek=34164823 bs=512 count=1 | dd bs=13 count=1 | od -C
1+0 records in
1+0 records out
1+0 records in
1+0 records out
0000000   M   y       t   e   x   t       f   i   l   e  \n
0000015

xv0:~# dd if=/dev/dsk/c1d0s1 iseek=34164823 bs=512 count=1 | dd bs=13 count=1 | md5sum
1+0 records in
1+0 records out
1+0 records in
1+0 records out
6f777b2d13f36205ddc749fe98194aab  -

xv0:~# md5sum file.txt
6f777b2d13f36205ddc749fe98194aab  file.txt

That is the procedure I have followed, but I have a question about the formula described in the following document, there is a small thing that I do not understand. Possibly it is because of the heat that is making me not think clearly.

ZFS ondiskformat

In section 2.1 of the document we can see the formula to obtain the offset where the beginning of our file is located on the disk. But I do not understand why it makes a bit shift to the left, it should not be to the right.


Code:
physical block address = (offset << 9) + 0x400000 (4MB)

In my case:

(0x41260ae00 + 0x400000) / 512 = 34164823

Following the procedure in the document gives an erroneous result, the 512 could also be replaced by a >> 9, possibly it is a misunderstanding on my part or I forgot something.

Thanks.
 
Possibly it is because of the heat that is making me not think clearly. [...]

In section 2.1 of the document we can see the formula to obtain the offset where the beginning of our file is located on the disk. But I do not understand why it makes a bit shift to the left,
Must be the heat.

As per section 2.1 of the doc: must be shifted over (<<) by 9 (2**9 =512) a left shift by 9 in binary format equals a multiplication by 512.
Try it out yourself here
 
Must be the heat.

As per section 2.1 of the doc: must be shifted over (<<) by 9 (2**9 =512) a left shift by 9 in binary format equals a multiplication by 512.
Try it out yourself here
I understand that this logical shift is equal to multiplying by 512, 1 shift would be equal to 2, 2=4, 3=8 etc... But shouldn't it be a bit shift to the right? If I follow the formula in the document the offset is incorrect when I want to access with DD it doesn't point to anything. The procedure I followed would be:

Code:
(0x41260ae00 + 0x400000) >> 9 = 34164823

Otherwise, if I perform the multiplication as suggested in the document, the offset is incorrect. That was my doubt; I may have misunderstood some previous procedure. I am not multiplying as the document suggests, but dividing.
 
Right, a SUN document, that can be found at Matthew Ahrens' (co-inventor of ZFS) github site, gives a formula that calculates an offset by a left shift. You use an input value for that and it doesn't give you the expected result; therefore, you suggest, the calculation in the document is likely/possibly incorrect.

You suggest the left shift should perhaps be a right shift or a division (those are not the same). Why? Because it's just like an inverse operation (sort of) or did they just inverted the character as a typo and nobody noticed? I don't see how that would lead to a meaningful value for an offset, even though your overall calculations seem to lead you to the correct data on the disk.
 
Which od did you use? od(1) does not specify -C as an option; using -C produces an error on my 14.1-RELEASE.
I am using Solaris 11.4, I am traveling and it is the only thing I had on my laptop. I have also created a VM with FreeBSD 14.1-RELEASE in case I have to comment on more things in the thread.

I share with you the documentation that I have read about od, there you can see the -C option, in FreeBSD I am used hexdump(1) , but it was not in the base system of Solaris 11.4, I had to use od

od(1) - Solaris 11.4
 
Following the procedure in the document gives an erroneous result,
The "problem" with the formula in the SUN document is not the formula itself but the input you used.

The problem is not easy to spot and probably only source code analysis of zdb(8) or the retrieval of the offset value contained in the blkptr_t structure on disk by other means than zdb(8) could have cleared this up.

zdb(8) prints the related Data Virtual Address (DVA) as shown in your output:
Rich (BB code):
xv0:~# zdb -dddddd rpool/ROOT/solaris 184497
Dataset rpool/ROOT/solaris [ZPL], ID 41, cr_txg 8, 5.81G, 178946 
   [...]
Indirect blocks:
                 0 L0 0:0x41260ae00:0x200 0x200L/0x200P F=1 B=6191/6191 ---
You expected 0x41260ae00 to be the offset value as stored on disk: it is not!

ZFS DVA offsets are in 512-byte blocks on disk but zdb misleads you about them by Chris Siebenmann (see there for more details, including the twitter responses of Matthew Arhrens to his "ZFS DVA offsets"). Quote from that blog article (my bolded underlining):
That is to say, when zdb prints ZFS DVAs it is not showing you the actual on-disk representation, or a lightly decoded version of it; instead, the offset is silently converted from its on-disk form of 512-byte blocks to a version in bytes.
[...]
So, to summarize: on disk, ZFS DVA offsets are in units of 512-byte blocks, with offset 0 (on each disk) starting after a 4 Mbyte header. In addition, zdb prints offsets (and sizes) in units of bytes, not their on disk 512-byte blocks (in hexadecimal), as (probably) do other things. If zdb says that a given DVA is '0:7ea00:400', that is a byte offset of 518656 bytes and a byte size of 1024 bytes. Zdb is decoding these for you from their on disk form. If a kernel message talks about DVA '0:7ea00:400' it's also most likely using byte offsets, as zdb does.
7ea00 is exactly the result of "offset << 9" as in:
To find the physical block byte offset from the beginning of a slice, the value inside offset must be shifted over (<<) by 9 (29 =512) [...]
in the SUN document, as further shown in the forumula:
physical block address = (offset << 9) + 0x400000 (4MB)
This means that given the value as output by zdb you only have 0x400000 to add it to to obtain the physical block address.

For this to get dd(1) to show the test file as in:
Code:
xv0:~# dd if=/dev/dsk/c1d0s1 iseek=34164823 bs=512 count=1 | dd bs=13 count=1 | od -C
you must calculate the correct start value and provide that as the value to iseek option. For that calculation you correctly applied the / 512:
Code:
[...]
In my case:

(0x41260ae00 + 0x400000) / 512 = 34164823
Note that you apply that division to the sum, not just to the pertained offset.

Perhaps part of your confusion is equating (in a very direct manner) the multiplication by 512 (left shift) in:
physical block address = (offset << 9) + 0x400000 (4MB)
with the 512 in your calculation "In my case:". These two values are of course both related to the blocks on disk to which they refer, but they are each used for a different purpose.

As to zdb(8), the following references, including Chris Siebenmann's blog, I found helpful:
 
Interesting to approach it via the blocks.

I tried to copy the different datasets and got errors on the highest level: zroot and also on the zroot/var/log.

So next steps are:
  1. Setup clean install of FreeBSD on a new VM.
  2. Take the latest backup
  3. Transfer backup to new VM
  4. Transfer latest data of corrupt VM via zdb -B to the new VM (with latest backup)
    1. Only copy the files that did not had any errors (all except the highest zroot and zroot/var/log
At each of the steps take a ZFS snapshot and VM snapshot in case you make a mistake , so I can easily undo the situation in case something goes wrong.

I have to double check, but I am not sure, the /root home folder, where GPG keys are stored is that located under zroot/ROOT/default? Or does that come via zroot , the highest level ...
 
Back
Top