Why would `make world` slow down with dual channel RAM?

cracauer@ · Nov 19, 2024

malavon said:
kldload smbus smb?
What I have in my kernel:

Code:

> kldstat -vv | grep smb | grep -v ich | grep -v acpi 212 smbus/smb 211 nexus/smbios 125 smbus/jedec_dimm 118 iicsmb/smbus 117 iicbus/iicsmb

"ichsmb" is intel stuff, not applicable to AMD. "nexus/smbios" no idea either. The others should be there.
There are 2 modules for AMD systems: kldload amdpm amdsmb
Do you have any smb devices showing up? ls /dev/smb*

Things like memory organisation (# of ranks, device width etc) also influence how well they'll operate together, aside from timings.

EDIT: I have edited my message a bit. Be sure to re-read.

I loaded amdpm, no change. I don't have amdsmb. I see it referenced in src, though.

ETA: found it, had typo. But didn't help.

malavon · Nov 19, 2024

Yup, of no help then I guess. Last check, no need to answer because I'm 99.9% certain it's fine: you are running the tool as root, right? It doesn't work for me as non-root user.
Unless that laptop has an internal ipmi (easily checked with ipmi-tool) it'll be no dice. sysutils/cpu-x also doesn't show detailed memory information.

cracauer@ · Nov 19, 2024

Yeah, root.

I restarted the whole process to do the 4 different runs I laid out.

T-Aoki · Nov 19, 2024

I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.

cracauer@ · Nov 19, 2024

T-Aoki said:
I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.

It's Stream, not Steam.

MEMORY BANDWIDTH: STREAM BENCHMARK PERFORMANCE RESULTS

Sustainable memory bandwidth benchmark, with results on a wide variety of computer systems, from Mac's and PC's to most current and recent workstations to Cray supercomputers.

www.cs.virginia.edu

I kind of trust it because it reacted to dual channel in the expected way. It doesn't seem to care too much about timings.

Crivens · Nov 20, 2024

One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.

cracauer@ · Nov 20, 2024

Crivens said:
One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.

I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.

CeXP1917 · Nov 20, 2024

cracauer@ said:
I could use my SSD seek benchmark

Good idea.

I saw such things few years ago, investigation with some desktop machines and notebooks showed that scheme is like this:

DATAPATH: [cpu (calculating) ---> memory controller (reordening) ---> memory modules ---> mem. controller (reordening) ---> cpu (calc)] This is good for large blocks of data with linear sequence (e.g. games).
DATAPATH: [cpu (calculating) ---> mem. controller (no time spent on reorder, just write) ---> mem. module ---> ... ] This is good for tasks with small blocks and random access (some archivers for example).

Of course, performance depends, as in the SMT/HT case, of task.

Crivens · Nov 20, 2024

cracauer@ said:
I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.

That might work, and not. The granularity of the SSD benchmark is likely a block (512 bytes), we are looking at L1/L2/L3 cache line sizes. The speed up of dual channel comes from reading at neighboring locations, so you read L3 block #n and get #n+1 (almost) free. At least the row addressing delays in the memory module will be skipped there.

A crazy random pattern of tree walks, accessing no more than 32 or 64 bytes at once will most likely not profit from that but suffer from additional wait states for the dual channel logic. It's been a long time since I tipped my toes into that, so take it with a grain of salt. Technology could have changed since then.

sko · Nov 20, 2024

IIRC memtest is also doing some random patterns during a test cycle.
It also shows the current and average L1/2/3 and memory speeds - so if there's a difference between sequential and random access it might show in those numbers, or if there's something weird going on with partial dual-channel and single-channel access as was suggested with 'asymmetrical' memory sizes.

(the SSD benchmark is still a neat idea - I just thought I'd point at the obvious solution which one might loose sight of while figuring out a neat hack. at least that's what often happens to me...)

cracauer@ · Nov 20, 2024

New run with different things in the DIMM slot. hw.physmem=32G

Code:

step1a-dimm-none.log:     1:38:11 5891.61 real 87696.15 user 3794.37 sys 1552% CPU 164077/375876811 faults
step2a-dimm-16gb2666.log: 1:41:54 6114.02 real 92023.45 user 3189.69 sys 1557% CPU 121769/374617851 faults
step3a-dimm-16gb3200.log: 1:47:54 6474.81 real 97003.19 user 3335.20 sys 1549% CPU 128338/374642879 faults
step4a-dimm-32gb3200.log: 1:34:19 5659.53 real 84759.50 user 3099.47 sys 1552% CPU 121760/374626641 faults

And Stream results:

Code:

step1a-dimm-none.stream:    Triad:      12374.8597       0.0039       0.0039       0.0043
step1b-dimm-none.stream:    Triad:      12374.0991       0.0039       0.0039       0.0039
step2a-dimm-16gb2666.stream:Triad:      20227.7295       0.0024       0.0024       0.0035
step2b-dimm-16gb2666.stream:Triad:      20158.8657       0.0025       0.0024       0.0035
step3a-dimm-16gb3200.stream:Triad:      20185.1406       0.0024       0.0024       0.0035
step3b-dimm-16gb3200.stream:Triad:      20470.4211       0.0024       0.0023       0.0035
step4a-dimm-32gb3200.stream:Triad:      17297.5850       0.0028       0.0028       0.0030
step4b-dimm-32gb3200.stream:Triad:      15727.4113       0.0032       0.0031       0.0032

I conclude:

Dual channel works as expected according to Stream
The 32 GB module is just "better" somehow
Both 16 GB modules have some sort of "problem"
What the Lenovo support person said seems correct - partial dual channel on the 32 GB module

cracauer@ · Nov 20, 2024

Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.

cy@ · Nov 20, 2024

cracauer@ said:
Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.

Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.

cracauer@ · Nov 20, 2024

cy@ said:
Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.

Sorry, no ZFS involved. The 48 GB configuration had hw.physmem=32 GB for `make world`.

VladiBG · Nov 21, 2024

Why your memory speed is so low? Did you check the CPU for overheating and thermal throttling ?

@3200 you should get 25GB/s single channel or 44-45GB/s dual channel

https://www.techpowerup.com/img/Lk6MWIgrFPnc5ML3.jpg

cracauer@ · Nov 21, 2024

VladiBG said:
Why your memory speed is so low? Did you check the CPU for overheating and thermal throttling ?

@3200 you should get 25GB/s single channel or 44-45GB/s dual channel

https://www.techpowerup.com/img/Lk6MWIgrFPnc5ML3.jpg

Well, thermals have been tight in this test, witness the fact that this laptop gets faster with the case not screwed tight. But the last runs are with case open and CPU temp < 65 C.

I plan to repeat this line of testing with a desktop. Just waiting for my in-memory seekbench.

VladiBG · Nov 21, 2024

Just replace the cpu thermal compound and clear the fan. It's very easy on this model.

bgavin · Nov 21, 2024

It is starting to look like cheap, slow RAM.
The "unknown" in the SPD tells me they don't have enough pride in their RAM to even put their name in it.

cracauer@ · Nov 21, 2024

bgavin said:
It is starting to look like cheap, slow RAM.
The "unknown" in the SPD tells me they don't have enough pride in their RAM to even put their name in it.

But it's all three. Crucial, Lenovo and G.Skill. Maybe they don't know what they put in and are just being honest

Seriously, now that you mention it, that is a bit odd. Dmidecode also doesn't have that info.

diizzy · Nov 22, 2024

Lookup how single rank and double rank ram (chips) works/performs as that's likely what you're also seeing.
As for timings it's set to least common denominator at a specific frequency, given it's soldered you're very likely looking at JEDEC timings.
2666Mhz stick will for sure perform worse than 3200Mhz, the soldered memory is 3200Mhz and for whatever reason the BIOS runs eveything at 2666Mhz because the stick reports incompatible timings or it simply only supports 2666Mhz at best.

Document

Makewold is likely faster because it can utilize more memory (including caching), if you log memory usage you'll likely see overall more memory usage during the process. You have 16 threads at disposal so more than 2G process will certainly help during building overall.

diizzy · Nov 22, 2024

As for benchmarking without running "make buildworld" using the builtin benchmark in 7-zip probably gives you a good enough idea of overall computing performance with different sticks.

fwiw, I get ~15500 running the stream benchmark on my laptop with 8 built in and 32Gb stick (Intel 11th gen) on FreeBSD in a Hyper-V VM.

cracauer@ · Nov 23, 2024

I tend to agree now that there is something screwed up with this particular laptop.

I have the T14 gen 1 AMD (in this thread) and a T14s gen 1 AMD, both have identical Ryzen 7 PRO 4750U CPUs with PC3200 DDR4 RAM. I bought the two of them so that I could benchmark Linux against FreeBSD without rebooting.

But a base comparison of `make world` shows the T14s much faster:

Code:

t14.log:  1:52:54 6774.04 real 51593.02 user 1498.92 sys 783% CPU 146956/360062078 faults
t14s.log: 1:38:47 5927.77 real 44797.57 user 1237.49 sys 776% CPU 156087/359931383 faults

(this is a different FreeBSD build and drive, not comparable to earlier runs)

Both are limited to 16 GB RAM. I have no explanation for the different number of major page faults.

I hate laptops. They have limited utility for precise benchmarking. AMD also has no tool like i7z, so I can't even check the clockspeed they are running at

Crivens · Nov 23, 2024

Both with identical Userland/kernel?

cracauer@ · Nov 23, 2024

Crivens said:
Both with identical Userland/kernel?

Yes, this is the same drive for both runs (M.2 NVMe in a SSK USB enclosure).

I wonder whether I should open the T14 case again

But seriously, I'll re-paste the heatsink on that one this weekend, as rightfully recommended.

diizzy · Nov 23, 2024

I don't think there is to be honest. As long as you don't have identical hardware results between the two are just showing differences in products which may very well be the case. Given the difference in hardware design it's not surprising that they may have different TDP limts, thermal characteristics etc. I guess you could first look if both are running in "Peformance mode" but that only pushes the power limits to the max of what the manufacturer have defined.

View: https://www.youtube.com/watch?v=M8RTMaHxnkc

Lenovo ThinkPad T14s review (gen1) - AMD Ryzen business laptop

After reviewing the highly-competitive AMD Ryzen based versions of the IdeaPad 5 and Slim 7 earlier this year, we've finally got our hands on the ThinkPad

www.ultrabookreview.com

Looking a bit more shows that the T14 indeed runs hotter in general and seems to clock lower too,

https://www.notebookcheck.net/Lenovo-ThinkPad-T14-AMD-Review-Best-Business-Laptop-you-can-buy.504298.0.html

https://www.notebookcheck.net/Lenovo-ThinkPad-T14s-Review-Business-laptop-is-better-with-AMD.485220.0.html

AMD does have performances tool but FreeBSDs support for Ryzen is sparse, you can check lock freq using the sysctls dev.cpu.0.freq_levels and dev.cpu.0.freq .

...and in case you're wondering, my Intel 11th laptop have quite agressivel power limits so I'm not anywhere near the "top performance" of the CPU can do.

Edit: You might be able to pull out some information using https://www.freshports.org/sysutils/RyzenAdj/

Why would `make world` slow down with dual channel RAM?

cracauer@

malavon

cracauer@

T-Aoki

cracauer@

MEMORY BANDWIDTH: STREAM BENCHMARK PERFORMANCE RESULTS

Crivens

Administrator

cracauer@

CeXP1917

Crivens

Administrator

sko

cracauer@

cracauer@

cy@

cracauer@

VladiBG

cracauer@

VladiBG

bgavin

cracauer@

diizzy

diizzy

cracauer@

Crivens

Administrator

cracauer@

diizzy

Lenovo ThinkPad T14s review (gen1) - AMD Ryzen business laptop