ZFS Storage Server Huge :-)

Design of a huge future storage with FreeBSD and ZFS

This design is to improve it with your help and recommendations...
Thanks in advance for your time and help!

I will update this content as design improvements are made.

---------------------------------------------------------------------------------------------------
Section Hardware:

Dell PowerEdge R440
2x Intel® Xeon® Silver 4112 2.6G
6x 16GB RDIMM, 2666MT/s RDIMMs
QLogic FastLinQ 41112 Dual Por t 10GbE SFP+ Adapter, PCIe Low Profile
Dell PERC H330 RAID Controller
2x 240GB SSD SATA Mixed Use 6Gbps 512e 2.5in Hot plug (Hardware RAID1 -> da0)
Dell SAS 12Gbps Host Bus Adapter External Controller (more info 1, 2, 3, 4, 5, 6)

Dell PowerVault ME484 JBOD, HBA
42x (NL-SAS, 3.5-inch, 7.2K, 10TB)
-> da1, da2, da3, ... da42

---------------------------------------------------------------------------------------------------
Section FreeBSD:

PowerEdge R440 boot from USB
F11 = Boot Manager
One-shot BIOS Boot Menu
[Hard drive] Disk connected to front USB 2: DataTraveler 2.0

Root-on-ZFS Automatic Partitioning
Binary package

bsdinstall-newboot-loader-menu.png


3. Escape to loader prompt

Load the appropriate driver for 'Dell PERC H330 RAID Controller' for installation

OK set hw.mfi.mrsas_enable="1"
OK boot


Install defaults

Manual Configuration

< Yes >

Have driver loaded when starting

# echo 'hw.mfi.mrsas_enable="1"' >> /boot/device.hints
# echo 'mrsas_load="YES"' >> /boot/loader.conf
# echo 'if_qlnxe_load="YES"' >> /boot/loader.conf
# shutdown -r now



# freebsd-version

Code:
12.1-RELEASE

# freebsd-update fetch
# freebsd-update install
# shutdown -r now
# freebsd-version

Code:
12.1-RELEASE-p1

View the partitions of da0

# gpart show da0

Code:
=>       40  467664816  da0  GPT  (223G)
         40       1024    1  freebsd-boot  (512K)
       1064        984       - free -  (492K)
       2048  134217728    2  freebsd-swap  (64G)
  134219776  333443072    3  freebsd-zfs  (159G)
  467662848       2008       - free -  (1.0M)

Because the configuration for the 10Gb network cards have some problems for the driver, I leave here the parameters


# cat /etc/rc.conf

Code:
# QLogic FastLinQ 41112 Dual Por t 10GbE SFP+ Adapter, PCIe Low Profile
# 31.7.2. Failover Mode
#
ifconfig_ql0="up"
ifconfig_ql1="up"
cloned_interfaces="lagg0"
#
# IPv4
ifconfig_lagg0="laggproto failover laggport ql0 laggport ql1 172.16.3.31/16"
defaultrouter="172.16.1.1"
#
# IPv6
ifconfig_lagg0_ipv6="inet6 2001:470:1f2b:be::31/64"
ipv6_defaultrouter="2001:0470:1f2b:be::1"


# ifconfig lagg0

Code:
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO>
        ether 34:80:0d:5d:d4:a6
        inet 172.16.3.31 netmask 0xffff0000 broadcast 172.16.255.255
        inet6 fe80::3680:dff:fe5d:d4a6%lagg0 prefixlen 64 scopeid 0x6
        inet6 2001:470:1f2b:be::31 prefixlen 64
        laggproto failover lagghash l2,l3,l4
        laggport: ql0 flags=5<MASTER,ACTIVE>
        laggport: ql1 flags=0<>
        groups: lagg
        media: Ethernet autoselect
        status: active
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

Update:
With 96GB of SWAP, FreeBSD report this error at boot
'WARNING: reducing swap size to maximum of 65536MB per unit'

The final size of SWAP is 64GB

FreeBSD 12.1 come with 'vfs.zfs.min_auto_ashift=12' default

Check the value

# sysctl vfs.zfs.min_auto_ashift

Code:
vfs.zfs.min_auto_ashift: 12

-------------------------------------------------
Only required for FreeBSD <= 12.0

Check the value
# sysctl vfs.zfs.min_auto_ashift
Code:
vfs.zfs.min_auto_ashift: 9

(recommendation from gkontos)
Put the value of 'vfs.zfs.min_auto_ashift=12' before creating the pool.

Change the value from 9 to 12
# sysctl vfs.zfs.min_auto_ashift=12
Code:
vfs.zfs.min_auto_ashift: 9 -> 12

Make the change permanent
# echo 'vfs.zfs.min_auto_ashift="12"' >> /etc/sysctl.conf
-------------------------------------------------

---------------------------------------------------------------------------------------------------
Section Disks distribution:

See what disks FreeBSD detects:


# egrep 'da[0-9]|cd[0-9]' /var/run/dmesg.boot

Code:
...
da0: <DELL PERC H330 Adp 4.30> Fixed Direct Access SPC-3 SCSI device
...
da1: <SEAGATE ST10000NM0256 TT55> Fixed Direct Access SPC-4 SCSI device
...
da2: <SEAGATE ST10000NM0256 TT55> Fixed Direct Access SPC-4 SCSI device
...
da42: <SEAGATE ST10000NM0256 TT55> Fixed Direct Access SPC-4 SCSI device
...

Strangely FreeBSD reports until da84 (the maximum capacity of the ME484), perhaps it is that the ME484 tells the HBA that empty slots are disks (this requires more investigation).

Dell PERC H330 RAID Controller
# cat /var/run/dmesg.boot | grep PERC
Code:
da0: <DELL PERC H330 Adp 4.30> Fixed Direct Access SPC-3 SCSI device

Dell SAS 12Gbps Host Bus Adapter External Controller
# cat /var/run/dmesg.boot | grep LSI
Code:
mpr0: <Avago Technologies (LSI) SAS3008> port 0xc000-0xc0ff mem 0xe1100000-0xe110ffff,0xe1000000-0xe10fffff irq 88 at device 0.0 numa-domain 1 on pci12
mpr0: Firmware: 16.00.08.00, Driver: 23.00.00.00-fbsd

Operating System (Hardware RAID1)
da0 -> FreeBSD 12.1 amd64

Pool storage.
da1, da2, ... da21 -> vdev (first vdev)
da22, da23, ... da42 -> vdev (second vdev)

Pool storage diagram.
42x HDD 10TB, 2 [BGCOLOR=rgb(251, 160, 38)]striped[/BGCOLOR] 21x [BGCOLOR=rgb(250, 197, 28)]raidz3[/BGCOLOR] (raid7), ~ 289TB

{ da1, da2, ... da21 } { da22, da23, ... da42 }
..............|....................................|
..........[BGCOLOR=rgb(251, 160, 38)]vdev[/BGCOLOR] [BGCOLOR=rgb(247, 218, 100)](raidz3)[/BGCOLOR].................[BGCOLOR=rgb(251, 160, 38)]vdev[/BGCOLOR] [BGCOLOR=rgb(247, 218, 100)](raidz3)[/BGCOLOR]
..............|....................................|
-------------------------------------------------------------
| ZFS Pool 289TB approx.
-------------------------------------------------------------

Setting up disks with FreeBSD:

Delete a previous GPT partition scheme.
Later versions of gpart(8) have a -F (force) option for destroy that makes things quicker.

# gpart destroy -F da1
# gpart destroy -F da2
...
# gpart destroy -F da42


Create GPT partition scheme.

# gpart create -s GPT da1
# gpart create -s GPT da2
...
# gpart create -s GPT da42


Proper sector alignment on 4K sector drives or SSDs.

# gpart add -t freebsd-zfs -b 1M -l da1 da1
# gpart add -t freebsd-zfs -b 1M -l da2 da2
...
# gpart add -t freebsd-zfs -b 1M -l da42 da42


View the partitions

# gpart show

Code:
gpart show
=>       40  467664816  da0  GPT  (223G)
         40       1024    1  freebsd-boot  (512K)
       1064        984       - free -  (492K)
       2048  134217728    2  freebsd-swap  (64G)
  134219776  333443072    3  freebsd-zfs  (159G)
  467662848       2008       - free -  (1.0M)

=>         40  19134414768  da1  GPT  (8.9T)
           40         2008       - free -  (1.0M)
         2048  19134412760    1  freebsd-zfs  (8.9T)

=>         40  19134414768  da2  GPT  (8.9T)
           40         2008       - free -  (1.0M)
         2048  19134412760    1  freebsd-zfs  (8.9T)
...
=>         40  19134414768  da42  GPT  (8.9T)
           40         2008        - free -  (1.0M)
         2048  19134412760     1  freebsd-zfs  (8.9T)

View the partitions labels

gpart show -l

Code:
=>       40  467664816  da0  GPT  (223G)
         40       1024    1  gptboot0  (512K)
       1064        984       - free -  (492K)
       2048  134217728    2  swap0  (64G)
  134219776  333443072    3  zfs0  (159G)
  467662848       2008       - free -  (1.0M)

=>         40  19134414768  da1  GPT  (8.9T)
           40         2008       - free -  (1.0M)
         2048  19134412760    1  da1  (8.9T)

=>         40  19134414768  da2  GPT  (8.9T)
           40         2008       - free -  (1.0M)
         2048  19134412760    1  da2  (8.9T)
...
=>         40  19134414768  da42  GPT  (8.9T)
           40         2008        - free -  (1.0M)
         2048  19134412760     1  da42  (8.9T)

Create the pool with the two VDEVs raidz3 (raid7) called 'storage'.
Using label name (the most recommended).

# zpool create storage \
raidz3 \
gpt/da1 gpt/da2 gpt/da3 gpt/da4 gpt/da5 gpt/da6 gpt/da7 \
gpt/da8 gpt/da9 gpt/da10 gpt/da11 gpt/da12 gpt/da13 gpt/da14 \
gpt/da15 gpt/da16 gpt/da17 gpt/da18 gpt/da19 gpt/da20 gpt/da21 \
raidz3 \
gpt/da22 gpt/da23 gpt/da24 gpt/da25 gpt/da26 gpt/da27 gpt/da28 \
gpt/da29 gpt/da30 gpt/da31 gpt/da32 gpt/da33 gpt/da34 gpt/da35 \
gpt/da36 gpt/da37 gpt/da38 gpt/da39 gpt/da40 gpt/da41 gpt/da42


See the pool status.

# zpool status storage

Code:
  pool: storage
state: ONLINE
  scan: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        storage         ONLINE       0     0     0
          raidz3-0      ONLINE       0     0     0
            gpt/da1  ONLINE       0     0     0
            gpt/da2  ONLINE       0     0     0
            gpt/da3  ONLINE       0     0     0
            gpt/da4  ONLINE       0     0     0
            gpt/da5  ONLINE       0     0     0
            gpt/da6  ONLINE       0     0     0
            gpt/da7  ONLINE       0     0     0
            gpt/da8  ONLINE       0     0     0
            gpt/da9  ONLINE       0     0     0
            gpt/da10  ONLINE       0     0     0
            gpt/da11  ONLINE       0     0     0
            gpt/da12  ONLINE       0     0     0
            gpt/da13  ONLINE       0     0     0
            gpt/da14  ONLINE       0     0     0
            gpt/da15  ONLINE       0     0     0
            gpt/da16  ONLINE       0     0     0
            gpt/da17  ONLINE       0     0     0
            gpt/da18  ONLINE       0     0     0
            gpt/da19  ONLINE       0     0     0
            gpt/da20  ONLINE       0     0     0
            gpt/da21  ONLINE       0     0     0
          raidz3-1      ONLINE       0     0     0
            gpt/da22  ONLINE       0     0     0
            gpt/da23  ONLINE       0     0     0
            gpt/da24  ONLINE       0     0     0
            gpt/da25  ONLINE       0     0     0
            gpt/da26  ONLINE       0     0     0
            gpt/da27  ONLINE       0     0     0
            gpt/da28  ONLINE       0     0     0
            gpt/da29  ONLINE       0     0     0
            gpt/da30  ONLINE       0     0     0
            gpt/da31  ONLINE       0     0     0
            gpt/da32  ONLINE       0     0     0
            gpt/da33  ONLINE       0     0     0
            gpt/da34  ONLINE       0     0     0
            gpt/da35  ONLINE       0     0     0
            gpt/da36  ONLINE       0     0     0
            gpt/da37  ONLINE       0     0     0
            gpt/da38  ONLINE       0     0     0
            gpt/da39  ONLINE       0     0     0
            gpt/da40  ONLINE       0     0     0
            gpt/da41  ONLINE       0     0     0
            gpt/da42  ONLINE       0     0     0

errors: No known data errors

See the pool list

zpool list storage

Code:
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage   374T  3.88M   374T        -         -     0%     0%  1.00x  ONLINE  -

See the pool mounted.

# df -h | egrep 'Filesystem|storage '

Code:
Filesystem            Size    Used   Avail Capacity  Mounted on
storage               289T    282K    289T     0%    /storage

Creation of ZFS datasets (file systems).

# zfs create storage/datasetname


Bonnie++ benchmarks

# bonnie++ -u root -r 1024 -s 98304 -d /storage -f -b -n 1 -c 4

Code:
Version  1.98       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
FreeBSD 96G::4            655m  81  398m  90            623m  99 726.4  28

Bonnie++ benchmark resume
w=655MB/s, rw=398MB/s, r=623MB/s

---------------------------------------------------------------------------------------------------
Section Pool expansion:

Useful as an example in a future expansion, since the ME484 supports up to 84 HDD per enclosure and the addition of 3 more for a maximum of 4 enclosures (84 HDD x4 enclosures), adding more VDEVs.

Delete a previous GPT partition scheme.
Later versions of gpart(8) have a -F (force) option for destroy that makes things quicker.

# gpart destroy -F da43
# gpart destroy -F da44
...
# gpart destroy -F da63


Create GPT partition scheme.

# gpart create -s GPT da43
# gpart create -s GPT da44
...
# gpart create -s GPT da63


Proper sector alignment on 4K sector drives or SSDs.

# gpart add -t freebsd-zfs -b 1M -l da43 da43
# gpart add -t freebsd-zfs -b 1M -l da44 da44
...
# gpart add -t freebsd-zfs -b 1M -l da63 da63


Adding the VDEVs partitions raidz3 (raid7) to existing the pool called ‘storage’.
Using label name (the most recommended).

# zpool add storage \
raidz3 \
gpt/da43 gpt/da44 gpt/da45 gpt/da46 gpt/da47 gpt/da48 gpt/da49 \
gpt/da50 gpt/da51 gpt/da52 gpt/da53 gpt/da54 gpt/da55 gpt/da56 \
gpt/da57 gpt/da58 gpt/da59 gpt/da60 gpt/da61 gpt/da62 gpt/da63


See the pool mounted and the new size.

# df -h | egrep 'Filesystem|storage '

Code:
Filesystem            Size    Used   Avail Capacity  Mounted on
storage               435T    281K    435T     0%    /storage

---------------------------------------------------------------------------------------------------
Section Tips:

Others useful commands examples.

Export a pool that is not in use.
# zpool export storage
Import a pool that be use.
# zpool import storage
A built-in monitoring system can display pool I/O statistics in real time.
Press Ctrl+C to stop this continuous monitoring.
# zpool iostat [-v] [pool] ... [interval [count]]
# zpool iostat -v storage 5 100

ZFS features. To create a dataset on this pool with compression and others useful commands examples.

To create a dataset on this pool with compression enabled.
# zfs create storage/compressed
# zfs set compression=lz4 storage/compressed
Compression can be disabled with.
# zfs set compression=off storage/compressed
To unmount a file system.
# zfs umount storage/compressed
To re-mount the file system.
# zfs mount storage/compressed
Status can be viewed with.
# zfs list storage
The name of a dataset can be changed with.
# zfs rename storage/oldname storage/newname

---------------------------------------------------------------------------------------------------
Section Upgrading FreeBSD:

Useful as an example in a future FreeBSD upgrade for example from 12.1-RELEASE to 12.2-RELEASE.

Upgrade the FreeBSD.

# freebsd-update fetch
# freebsd-update install
# freebsd-update upgrade -r 12.2-RELEASE
# freebsd-update install
# shutdown -r now
# freebsd-update install
# pkg-static upgrade -f
# freebsd-update install
# shutdown -r now


Check the upgrade of FreeBSD.

# freebsd-version


Upgrade the pool called 'zroot' (Root-on-ZFS), and as is the boot, also update the boot code.

# zpool upgrade zroot
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0


Upgrade the pool called 'storage'.

# zpool upgrade storage


Check the upgrade of all pools.

# zpool upgrade


---------------------------------------------------------------------------------------------------
Section Documentation:

---------------------------------------------------------------------------------------------------
 
Last edited:
For proper alignment of the pool(s) you also need to sysctl vfs.zfs.min_auto_ashift=12
That needs to be done before creating them. You can check by running zdb | grep ashift and expect a value of 12

I am not sure if it is default in 12.1
 
For proper alignment of the pool(s) you also need to sysctl vfs.zfs.min_auto_ashift=12
That needs to be done before creating them. You can check by running zdb | grep ashift and expect a value of 12

I am not sure if it is default in 12.1

$ freebsd-version
12.0-RELEASE-p11
$
$ zdb | grep ashift
ashift: 12

Apparently that comes by default in 12.0-RELEASE, will have to wait a few days to know how it is in 12.1-RELEASE.

Thank you for your support.

Update:

You are correct, at some point in time I put 'vfs.zfs.min_auto_ashift=12' in the file /etc/sysctl.conf
 
Last edited:
zdb(8) shows information about existing pools.

The default on FreeBSD 12.0-RELEASE is still
Code:
% sysctl vfs.zfs.min_auto_ashift
vfs.zfs.min_auto_ashift: 9
So before you create a new pool of native 4k (or mixed) drives, do like gkontos said.
 
On the storage server will run Samba and NFS.

I am thinking of installing FreeBSD with 'Root-on-ZFS Automatic Partitioning'

For the size of SWAP, how much would be recommended if I will have 96GB of RAM?

Thank you.
 
My server has 96GB and I'm using 16GB swapsize (there's around 15% in use). But I run lots of VMs on it. It's better to initially configure more than you actually need as it's a little difficult to increase the size of the swap partition(s) once everything has been set up.

One thing to keep in mind during the install. When you select the disks, ZFS and whatnot, you can configure the size of the swap partition. Keep in mind that this is for each disk. So if you create a mirror with 2 disks and select 8GB for swap you will end up with 16GB total swap (8GB per disk), unless you opt to also mirror the swap.
 
SirDice

According to your recommendation, then for 96 GB of RAM is it ok to put 96 GB of SWAP?
Or is better the old rule RAM x 2
96GB RAM x 2 = 192GB SWAP
 
The old rule of thumb goes out the window after about 32GB of RAM. That rule was fine when machines had a lot less memory but is severe overkill nowadays.
 
Back
Top