Networking much better in 13.1 (upgraded from 12.2)

Our FreeBSD router was getting taxed at 15Gbps -- started to drop packets, customer complaints, etc.

I upgrade to 13.1, turned on hyperthreading, and upped the hw.cxgbe queues to 16 from 8 and the box is working fine. I'm not sure if this is solely due to the hyperthreading of if the reworked networking stack in 13.1 is that much better. Posting good news here for other that may have throughput issues.

Also interesting, in the attached graphs, note the interrupts tanked when bandwidth and load went up in FreeBSD 12 - that is when we started to get packet loss.

Background: 3 chelsio cards, dual decacore CPU
t6nex0: <Chelsio T62100-LP-CR> numa-domain 0 on pci10
t5nex0: <Chelsio T540-LP-CR> numa-domain 1 on pci14
t5nex1: <Chelsio T540-LP-CR> numa-domain 1 on pci16
CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
We have a chelsio_affinity script that is 'numa aware' to bind the interrupts to different CPU cores. Without the hyperthreading, we could only support 8 queues per port, now with hyperthreading, we can do 16 queues (power of 2 and less than number of cores on a cpu).

We used the https://calomel.org/freebsd_network_tuning.html advice to not use hyperthreading in FreeBSD 12.
 

Attachments

  • Upgrade-122-to-13.gif
    Upgrade-122-to-13.gif
    68.9 KB · Views: 253
Why did you choose to put the cards on different numa domains?
I put my three Chelsio T540 on one domain and all my NVMe on the other.
I would expect a performance penalty if threads have to jump CPU's. QPI maxes out at? Plus overhead.
 
What is your testing platform?
Without the hyperthreading, we could only support 8 queues per port, now with hyperthreading, we can do 16 queues
So I guess this is the advantage of using both CPU's PCIe slots (numa domains)? You get more queues?
 
Why did you choose to put the cards on different numa domains?
I put my three Chelsio T540 on one domain and all my NVMe on the other.
I would expect a performance penalty if threads have to jump CPU's. QPI maxes out at? Plus overhead.
Two T5 cards in one domain, one T6 in the other. We were maxing out cores in one CPU (htop would show cpu's 11 - 20 pegging)

This box is just a router, no disk activity, so spreading the queue processing across CPUs is our goal.
 
Which software did you use to make the packet loss graph ?
I didn't post a packet loss graph, rather the last graph is context-switches / interrupts per second. When load was high, interrupts stopped getting processed as quickly. If you divide bandwidth / interrupts, we were getting 'more bandwith per interrupt' but there was also loss. Latency also was not as good.

How did we detect loss? Our customer support queue! ;)

The tool that we used to verify was just plain old ping. At peak times, traffic was lossy through that router. Turning off our Amazon peer (turns out a lot of traffic comes from them) make the traffic drop and pings return to 100%. This router is connect to an IX and we just started peering with Amazon -- bumped traffic from 10Gbps up to 15Gbps through the IX.
 
I upgrade to 13.1, turned on hyperthreading, and upped the hw.cxgbe queues to 16 from 8 and the box is working fine. I'm not sure if this is solely due to the hyperthreading of if the reworked networking stack in 13.1 is that much better.

So, You change 4 variables at one time:
- upgrade to 13.1
- turn on hypertreading
- lock NICs on certain domain, bind NICs to certain CPU core
- tune hw.cxgbe queues

Please show You measures before and after hypertreading switch on.

How You find the exactly source of Your problem with packet loss ?

I asking this because traffic constantly growing and in certain cases (national celebration days, worldwide sport events, local wars, national-wide disasters, important movies release, worldwide technological events, increasing numbers of videotraffic because next COVID-19 wave, etc..) You may stick to the same problem with packetloss...
Now only +30% of total traffic give You a packetloss, but what happened in feature, when peak goes up to +50....+70% and during this time You not be able to pull out router from work?
 
Back
Top