ZFS ZFS read performance of mirrored VDEVs

better · Feb 25, 2017

Hi. I'm building a new system for a small data warehouse and have been testing disk performance in various zpool configurations using up to 14 drives. Every configuration seems to be performing as expected except for sequential reads across mirror sets.

Could somebody please either correct my expectations or perhaps provide some performance tips. I've used ZFS for years but this is the first time where system performance is important.

Here's a basic summary of my findings.

Code:

Raw/single drive speed:
write 180 MiB/s
read  200 MiB/s

9x1 striped set - (9 disks total, no redundancy)
write 1440 MiB/s (160 MiB /data-drive)
read  1691 MiB/s ([B]188 MiB /data-drive[/B])

8x RAIDz2 - (8 disks total, 2 parity drives)
write 922 MiB/s   (154 MiB /data-drive)
read  1031 MiB/s ([B]172 MiB /data-drive[/B])

4x2 mirror - (8 disks total, 2 drives per mirror)
write 613 MiB/s (152 MiB /data-set - expecting 4x write gain)
read  804 MiB/s ([B]101 MiB /data-drive[/B] - expecting 8x read gain)

3x3 mirror - (9 disks total, 3 drives per mirror)
write 442 MiB/s (147 MiB /data-set - expecting 3x write gain)
read  686 MiB/s ([B]76 MiB /data-drive[/B] - expecting 9x read gain)

Overall I was impressed with ZFS when striping reads and writes across VDEVs. And I was surprised at how well the various RAIDz configurations performed. Everything in these configurations performed as expected based an single drive performance (when allowing for some overhead and accounting for various parity levels).

Writing to a mirrored VDEV seems acceptable at ~150 MiB/s. However resilvering and scrubbing both occurred at >180 MiB/s so I imagine faster sequential write speeds are theoretically attainable.

Read performance from a mirrored VDEV is disappointing. I would have expected something close to maybe >80% of the theoretical maximum but the results show more like 55%.

You can see that reads from an 8-drive zpool of double mirrored VDEVs are slower than reads from a 9-drive zpool of triple mirrored VDEVs. This seems fundamentally flawed. And comparing the 8-drive RAIDz2 pool (having 6 data drives only) with the 8-drive 2-way mirror you would expect the mirrored pool to outpace the RAIDz.

It seems that each mirror-set is performing roughly as if it were a single drive rather than as a striped set for reads. Monitoring reads with 'zpool iostat -v' and 'gstat' anecdotally confirms that the individual drives appear to be under-utilized in these configurations (55-65% usage under mirrors whereas under RAIDz the per-drive utilization is a fairly constant 99-100%). Random seeks are definitely better with mirrors but don't seem to improve with triple mirrors. Performance does not improve with multiple concurrent reader threads.

Can anybody knowledgeable about the internals of how ZFS is implemented on FreeBSD shed some light on whether this is expected behavior? It seems that a lot of read performance is being left on the table. Is there any detailed information about this?

Otherwise are there any settings I should look at tweaking? I know I could present a set of hardware-RAID mirrors to ZFS and use those instead (this should boost performance as expected but I would prefer not to for many reasons).

The numbers are making me seriously consider RAIDz over mirrors for at least part of the database - which seems all kinds of wrong.

These values were for sequential reads/writes to data at the beginning / "fast part" of the disk. These are just very basic tests using dd reading from /dev/zero or /dev/random. I've listed only the 1MiB block size results. I do have more comprehensive tests and their results mirror the values above.

I'm running a fresh install of FreeBSD 11.0 on brand new commodity hardware using brand new 3TB SATA HDDs (4096 sector size) connected to a mixture of both onboard and PCIe SATA adaptors.

I have confirmed that the data was fairly evenly spread across all drives during the tests and reads per drive were pretty much exactly the same. "zdb -m" confirms that data was being read from the beginning of each disk.

Most system/ZFS settings were left as default other than:
ahci_load="YES"
vfs.vmiodirenable
vfs.zfs.min_auto_ashift=12
ashift: 12 (confirmed via "zdb" and alignment looks correct also)
compression=off
atime=off
recordsize={various} - larger size does improve throughput although only by 10% ish

better · Mar 9, 2017

Is there a better place to ask this question?

I've seed this trend with a number of different setups now (different disks, machines, etc.). Has anybody else witnessed similar performance with their environments?

storvi_net · Mar 10, 2017

Can you give some more information about your system?

CPU / RAM / HBA...

Regarding "resilvering" I would say it's ok. The data is read from the other mirror-side so the max is the write speed of the harddisk.

Regards
Markus

better · Mar 10, 2017

Thanks Markus.

I've tried these tests on various machines with pretty consistent results. All consumer grade hardware.

I'm basically just wanting to know whether this is how ZFS is supposed to perform. Or would you expect reads across double-mirrored VDEVs to perform at near twice the speed of a single disk (the way it does with single-disk striping and with RAIDz when accounting for parity).

It's not just sequential IO, random IO seems to underperform when mirrored as well.

CPU is i7-490K
RAM 32GB DDR3 (also with DDR4 RAM, high grade consumer but non-ECC)
FreeBSD 11.0 p8
Various non enterprise HBAs...
Marvel 88SE9230 PCIe SATA 6Gb/s Controller
Also tried onboard chips like Intel Z97, ASMedia ASM1061, etc.

All systems are capable of high speeds when in single disk stripes and under RAIDz so I don't see why they hardware would suddenly have a problem when a disk is mirrored. They all perform at 2x speed when mirroring under software RAID just not under ZFS.

BonHomme · Mar 24, 2017

Hi,

Take a look at this very interesting site with a lot of ZFS performance tests with various ZFS raid configurations: https://calomel.org/zfs_raid_speed_capacity.html

Also with a lot of other interesting (freebsd) stuff. These guys are really incredible

better · Mar 24, 2017

Thanks BonHomme,

That was one of the first sites I saw when reading up on ZFS performance. But a couple of their results seem a little bit inconsistent - and that's one of the reasons I ran my own tests in pretty much every configuration I could. My numbers were very consistent and scaled in the exact way I expected them to as I added more disks - EXCEPT for mirrored reads.

It just seems that ZFS under Free BSD doesn't stripe sequential reads across mirrored VDEVs like a conventional RAID system would. At least not in an optimal way. The saw similar trends for multi-threaded random read figures as well.

I've been looking for confirmation of this everywhere but haven't found it. I would have imagined that ZFS performance characteristics would have been well known amongst the community - that's why I posted here.

I guess people don't really choose ZFS for its performance. Although it's a bit upsetting moving from a 2-way mirror to a 3-way mirror with the same number of disks and actually losing read performance.

The closest thing I could find to confirmation was a discussion on an Open ZFS issue tracker where one of the developers said they had improved the read scheduling massively for mirrors - but that was from 2.5 years ago so I would assume any improvements would have found their way to FreeBSD if not originate from it. I'm not sure how the various ZFS efforts fit together.

BonHomme · Mar 24, 2017

According to me it is almost impossible to answer questions. There are so many variables that influence the ZFS performance that, before you know you are comparing apples with pears. Also the first question that is coming to my mind is: Is your raid card working in IT mode or not?

better · Mar 24, 2017

Yes I've tried a few different controllers in different machines with different capabilities and they were all configured to use IT mode (as the Handbook and ZFS docs seem to suggest is the preferred approach). I saw consistent results no matter what the controller (or controllers) were used to form the zpool.

It doesn't seem like a performance problem with my system. In RAIDz and plain striping mode ZFS performs extremely well. It's just mirrored VDEVs that seem to perform as if they were a single disk (+5-10%). There just doesn't seem to be much (or any) performance gain by adding more disks if those added disks are in a mirror configuration.

My question isn't really about my specific performance. It's a ZFS design question asking whether or not ZFS is supposed to double the read performance in a 2-way mirror and triple it in a 3-way mirror.

The site you linked to seems to have a few inconsistencies (e.g. their 2x4TB mirror lists a read speed of r=488MB/s which seems impossible given a single drive is only r=204MB/s) but in general their data seems to back up exactly what I'm saying.

They list a 6-drive 3x2-way mirror as r=655MB/s and my results for the same configuration were 645 MiB/s (which makes sense since my single drive speed is slightly lower than theirs).

But in theory a 6-drive mirror like that could reach ~1120 MiB/s in a perfect scenario artificial benchmark style test.

I tested up to 6x2-way and 4x3-way mirrors (for a total of 12 drives). And my results were very consistent and predictable.

BonHomme · Mar 24, 2017

It looks like you are right. When you take a look at these test results from https://calomel.org/zfs_raid_speed_capacity.html the results are very different from what one would expect.

Code:

 1x 4TB,        single drive,    3.7 TB,  w=108MB/s , rw= 50MB/s , r=204MB/s

 4x 4TB,  2 striped mirrors,     7.5 TB,  w=226MB/s , rw= 53MB/s , r=644MB/s  should expect r=~  4 x 204MB/s
 6x 4TB,  3 striped mirrors,    11.3 TB,  w=389MB/s , rw= 60MB/s , r=655MB/s  should expect r=~  6 x 204MB/s
12x 4TB,  6 striped mirrors,    22.6 TB,  w=643MB/s , rw= 83MB/s , r=962MB/s  should expect r=~ 12 x 204MB/s
24x 4TB, 12 striped mirrors,    45.2 TB,  w=696MB/s , rw=144MB/s , r=898MB/s  should expect r=~ 24 x 204MB/s

But what I also saw is that calomel tested with a file size that is same as the memory size while I thought it is advised to use a file size that is double the memory size in order to see what the real performance of the different raid configs is.

Now I wonder if that could be the reason of the discrepancies? Did you also test with file sizes=2x memory size?

And I am even more curious what the test results are for file sizes that are (far) less than the available memory, because than ZFS should perform at its best.
Did you also test that situation? Because in practice that would be the working conditions you should desire.

I did a test on my workstation with 32GB memory and that has 4 x 1TB 2 striped mirrors (WD BLue Disks)

I tested with a file size of 4 GB, 32GB and 64GB and except for the 4GB file size the results are reasonable consistent.
Strangely enough the write speed of the 4GB file size is just it bit more than half the speed of the 32GB and 64 GB file sizes.

I start wondering if Bonnie++ is handling the things right

Code:

filesize= 4GB  w=171792 KB/s , r=3127265 KB/s
filesize=32GB  w=312156 KB/s , r= 393838 KB/s
filesize=64GB  w=312947 KB/s , r= 407336 KB/s

better · Mar 25, 2017

Thanks for the response BonHomme.

Yeah I used a file size that was 2.5x physical memory and made sure disable ZFS caching and compression. I also examined the ZFS freespace map to make sure I was reading and writing to the same physical section of the disks each time. I definitely saw much faster reads with file sizes smaller than RAM so I think that explains their results.

The other thing I did was read files that were filled from /dev/random to minimize any other compression that might occur. Writes were always from /dev/zero though (and that will definitely give you distorted results if you have compression enabled).

Yeah your 4GB write speed seems off. One cause might be that ZFS wasn't evenly distributing the write load across the stripes (i.e. only writing mostly to one VDEV) - which it will do when freespace is uneven between the VDEVs. One way to check is to run something like "zpool iostat -v 1 1000" or gstat which will help identify if it's not distributing evenly. Your 4GB read speed is definitely cached.

My workload is for a large mostly-write database with long running sequential reads - so caching doesn't really help (other than ZIL). I'm scanning many TB's of data so getting 3x read from a 3-way mirror would be a big win.

I just don't see why I'd be the first person to notice/complain about this. It really seems like there's a design flaw that nobody has thought to fix. Perhaps mirroring was only ever intended to provide redundancy and not a performance gain. The code to schedule a distributed read from a mirror seems fairly trivial to implement and there are plenty of existing examples.

From my casual observations - it seems as if ZFS gets "stuck" at a specific speed. I've had scenarios where I've started with a 4-disk 2x2 mirror and then I've added a 5th disk as an additional (non-redundant) VDEV (so 3 stripe groups). And in that configuration it would only read and write to the 3rd VDEV at half the speed of the first two. Which kind of makes sense. But then when I added a 6th disk as a mirror to the 3rd VDEV the pool still didn't alter its read/write pattern. It was still treating the 3rd VDEV as if it was only capable of half the performance of the other two. It wasn't until I destroyed the entire zpool and created it again as a 3x2 array that the 3 mirrored VDEVs started to distribute the IO evenly.

Anyway thanks for your interest. I was hoping that somebody familiar with the ZFS code might know for sure how it was implemented. Or that this was just a well known issue with ZFS that hadn't been documented. Or even maybe I was wrong and there was a sysctl tuning option or something that I wasn't aware of.

BonHomme · Mar 25, 2017

better said:
I just don't see why I'd be the first person to notice/complain about this.

As for me: When I was doing research on what would be the best ZFS raid setup for my server with 12 disks, I indeed noticed the strange raid 10 test results on "Calomel". But did not think too much of it because my deeper knowledge of ZFS is quite limited and I thought of it as it to be one of ZFS's oddities.

Actually I use ZFS (and a lot of other tools) like I use my car: To be able to drive from A to B based on the best price/performance ratio specif for the way I want to use my car. E.g. as a truck or as a sports car. Without thinking too much why it is that one specific car uses much more fuel than the other one under the same conditions. In that case I just choose for another car with less fuel consumption

So I suppose most people do it the same way like me.

But I agree with you that it would be nice if the ZFS raid 10 performance was more in line with NON-ZFS raid 10 performance. Or at least to be able to understand why the ZFS raid 10 performance is different.

Yeah your 4GB write speed seems off. One cause might be that ZFS wasn't evenly distributing the write load across the stripes (i.e. only writing mostly to one VDEV) - which it will do when freespace is uneven between the VDEVs. One way to check is to run something like "zpool iostat -v 1 1000" or gstat which will help identify if it's not distributing evenly. Your 4GB read speed is definitely cached.

I also don't understand why the 4GB performace is so much off. Actually, at the moment I am setting up my workstation on FreeBSD, using Mate as a Gui and ZFS as the file system with two pools: "rpool" on mirrored SSD's for the OS and "bigpool" on four Raid 10 HD's for all (local) data.

So my bigpool was still completely empty when I was doing this test. On default the bigpool uses LZ4, so for the test I made an apart folder without compression. So the freespace was and is absolutely 100% even. That is why I am starting to wonder if Bonnie++ is doing its things right?

And, also like you, I was wondering why the read performance of my two mirrored SSD's is not better than the read performance of one SSD. That is why your post attracted my attention.

I was/am even considering throwing one SSD in my workstation out because without performance gain it does not make much sense to use a second SSD for data integrity because one of the reasons why I am also setting up my workstation as a kind of ZFS server is to be able to send and receive the (remote) server and (local) workstation OS and data to an external 8TB drive connected to my workstation together with a daily backup of all my systems. Or to be able to send (test) data back and forth between my server and my workstation.

And I also even can use my workstation as a temporarily fall-back system. With degraded performance of-course.

better · Mar 27, 2017

If you needed the performance from your mirrors then you could use your hardware RAID to create the mirrors and then present them to ZFS as a single device.

It's probably not a best-practice though.

I considered doing this but I'm actually using partitions and not whole disks so it wasn't an option.

better · Mar 27, 2017

Also, was your 4GB result reproducible? If you're getting the same result each time then it really is odd. Have you tried just a simple dd copy from /dev/zero ?

cjm · Apr 2, 2018

A bit late, I know, but I just stumbled over this and I think I might have an explanation for the difference in sequential read performance (that's what I assume the original post was referring to):

On a mirror, all disks have the same sectors. Reading consecutive sectors with multiple disks in parallel doesn't get any faster because read speed is limited by the rotational speed and you can just as well read from a single disk. The only meaningful way to increase read speed would be to read on different tracks in parallel. For example:
- Random I/O with multiple outstanding read requests (multiple threads/processes or asynchronous I/O)
- Sequential I/O with large read-ahead (larger than track size), but that would require that ZFS knows the track sizes (which change based on head position) and can have the disks work on different tracks in parallel.
In a stripe set or raidz, each disk stores only a subset of all sectors thus, in order to read sequentially, you must read from all disks in parallel and will see a corresponding increase in read performance. On the other side, random read I/O may suffer if the I/O pattern results in a "hot spindle", something a mirror could balance across its members.

As far as I remember, the ZFS implementation in FreeBSD optimizes access to mirrored disks such that current head position (seek time) and disk load are taken into consideration. This should give a substantial advantage with multiple processes and/or random I/O but not for single process sequential loads.

alestrix · Nov 8, 2018

cjm said:
On a mirror, all disks have the same sectors. Reading consecutive sectors with multiple disks in parallel doesn't get any faster because read speed is limited by the rotational speed and you can just as well read from a single disk. The only meaningful way to increase read speed would be to read on different tracks in parallel. For example:

Random I/O with multiple outstanding read requests (multiple threads/processes or asynchronous I/O)

Sequential I/O with large read-ahead (larger than track size), but that would require that ZFS knows the track sizes (which change based on head position) and can have the disks work on different tracks in parallel.

Hi!

If I understand this note to revision 256956 correctly, that is exactly what was done. However, I also see the same single-drive performance on my 3-way-mirror setup as the OP. I also tried increasing cache and prefetch (at least I think that's what I did, I'm not a ZFS expert), but did not see any speed boost. So now I'm sitting here with a 2-vdev, 3-way-mirror-per-vdev setup and only get approximately double speed instead of 6x the single disk speed on reads which is very annoying and defies the reason I created that setup in the first place (making more use of my 10G connection - and yes, I can saturate that connection using iperf).

If anyone could give hints on what to do or how to best tune those parameters introduced in abovementioned revision, it'd be very much appreciated.

Code:

$ sysctl -d vfs.zfs.vdev.mirror.rotating_inc vfs.zfs.vdev.mirror.rotating_seek_inc vfs.zfs.vdev.mirror.rotating_seek_offset
vfs.zfs.vdev.mirror.rotating_inc: Rotating media load increment for non-seeking I/O's
vfs.zfs.vdev.mirror.rotating_seek_inc: Rotating media load increment for seeking I/O's
vfs.zfs.vdev.mirror.rotating_seek_offset: Offset in bytes from the last I/O which triggers a reduced rotating media seek increment
$ sysctl  vfs.zfs.vdev.mirror.rotating_inc vfs.zfs.vdev.mirror.rotating_seek_inc vfs.zfs.vdev.mirror.rotating_seek_offset
vfs.zfs.vdev.mirror.rotating_inc: 0
vfs.zfs.vdev.mirror.rotating_seek_inc: 5
vfs.zfs.vdev.mirror.rotating_seek_offset: 1048576

Regards
Alex