Why would `make world` slow down with dual channel RAM?

kldload smbus smb?
What I have in my kernel:
Code:
> kldstat -vv | grep smb | grep -v ich | grep -v acpi
                212 smbus/smb
                211 nexus/smbios
                125 smbus/jedec_dimm
                118 iicsmb/smbus
                117 iicbus/iicsmb
"ichsmb" is intel stuff, not applicable to AMD. "nexus/smbios" no idea either. The others should be there.
There are 2 modules for AMD systems: kldload amdpm amdsmb
Do you have any smb devices showing up? ls /dev/smb*

Things like memory organisation (# of ranks, device width etc) also influence how well they'll operate together, aside from timings.

EDIT: I have edited my message a bit. Be sure to re-read.

I loaded amdpm, no change. I don't have amdsmb. I see it referenced in src, though.

ETA: found it, had typo. But didn't help.
 
Yup, of no help then I guess. Last check, no need to answer because I'm 99.9% certain it's fine: you are running the tool as root, right? It doesn't work for me as non-root user.
Unless that laptop has an internal ipmi (easily checked with ipmi-tool) it'll be no dice. sysutils/cpu-x also doesn't show detailed memory information.
 
I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.
 
I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.

It's Stream, not Steam.
I kind of trust it because it reacted to dual channel in the expected way. It doesn't seem to care too much about timings.
 
One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.
 
One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.

I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.
 
I could use my SSD seek benchmark
Good idea.

I saw such things few years ago, investigation with some desktop machines and notebooks showed that scheme is like this:

DATAPATH: [cpu (calculating) ---> memory controller (reordening) ---> memory modules ---> mem. controller (reordening) ---> cpu (calc)] This is good for large blocks of data with linear sequence (e.g. games).
DATAPATH: [cpu (calculating) ---> mem. controller (no time spent on reorder, just write) ---> mem. module ---> ... ] This is good for tasks with small blocks and random access (some archivers for example).

Of course, performance depends, as in the SMT/HT case, of task.
 
I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.
That might work, and not. The granularity of the SSD benchmark is likely a block (512 bytes), we are looking at L1/L2/L3 cache line sizes. The speed up of dual channel comes from reading at neighboring locations, so you read L3 block #n and get #n+1 (almost) free. At least the row addressing delays in the memory module will be skipped there.

A crazy random pattern of tree walks, accessing no more than 32 or 64 bytes at once will most likely not profit from that but suffer from additional wait states for the dual channel logic. It's been a long time since I tipped my toes into that, so take it with a grain of salt. Technology could have changed since then.
 
IIRC memtest is also doing some random patterns during a test cycle.
It also shows the current and average L1/2/3 and memory speeds - so if there's a difference between sequential and random access it might show in those numbers, or if there's something weird going on with partial dual-channel and single-channel access as was suggested with 'asymmetrical' memory sizes.

(the SSD benchmark is still a neat idea - I just thought I'd point at the obvious solution which one might loose sight of while figuring out a neat hack. at least that's what often happens to me...)
 
New run with different things in the DIMM slot. hw.physmem=32G
Code:
step1a-dimm-none.log:     1:38:11 5891.61 real 87696.15 user 3794.37 sys 1552% CPU 164077/375876811 faults
step2a-dimm-16gb2666.log: 1:41:54 6114.02 real 92023.45 user 3189.69 sys 1557% CPU 121769/374617851 faults
step3a-dimm-16gb3200.log: 1:47:54 6474.81 real 97003.19 user 3335.20 sys 1549% CPU 128338/374642879 faults
step4a-dimm-32gb3200.log: 1:34:19 5659.53 real 84759.50 user 3099.47 sys 1552% CPU 121760/374626641 faults
And Stream results:
Code:
step1a-dimm-none.stream:    Triad:      12374.8597       0.0039       0.0039       0.0043
step1b-dimm-none.stream:    Triad:      12374.0991       0.0039       0.0039       0.0039
step2a-dimm-16gb2666.stream:Triad:      20227.7295       0.0024       0.0024       0.0035
step2b-dimm-16gb2666.stream:Triad:      20158.8657       0.0025       0.0024       0.0035
step3a-dimm-16gb3200.stream:Triad:      20185.1406       0.0024       0.0024       0.0035
step3b-dimm-16gb3200.stream:Triad:      20470.4211       0.0024       0.0023       0.0035
step4a-dimm-32gb3200.stream:Triad:      17297.5850       0.0028       0.0028       0.0030
step4b-dimm-32gb3200.stream:Triad:      15727.4113       0.0032       0.0031       0.0032

I conclude:
  • Dual channel works as expected according to Stream
  • The 32 GB module is just "better" somehow
  • Both 16 GB modules have some sort of "problem"
  • What the Lenovo support person said seems correct - partial dual channel on the 32 GB module
 
Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.

PXL_20241120_174728892._small.jpg
PXL_20241120_173834444._small.jpg
PXL_20241120_165815381._small.jpg
 
Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.
Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.
 
Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.

Sorry, no ZFS involved. The 48 GB configuration had hw.physmem=32 GB for `make world`.
 
Why your memory speed is so low? Did you check the CPU for overheating and thermal throttling ?

@3200 you should get 25GB/s single channel or 44-45GB/s dual channel

Well, thermals have been tight in this test, witness the fact that this laptop gets faster with the case not screwed tight. But the last runs are with case open and CPU temp < 65 C.

I plan to repeat this line of testing with a desktop. Just waiting for my in-memory seekbench.
 
It is starting to look like cheap, slow RAM.
The "unknown" in the SPD tells me they don't have enough pride in their RAM to even put their name in it.
 
It is starting to look like cheap, slow RAM.
The "unknown" in the SPD tells me they don't have enough pride in their RAM to even put their name in it.

But it's all three. Crucial, Lenovo and G.Skill. Maybe they don't know what they put in and are just being honest :)

Seriously, now that you mention it, that is a bit odd. Dmidecode also doesn't have that info.
 
Lookup how single rank and double rank ram (chips) works/performs as that's likely what you're also seeing.
As for timings it's set to least common denominator at a specific frequency, given it's soldered you're very likely looking at JEDEC timings.
2666Mhz stick will for sure perform worse than 3200Mhz, the soldered memory is 3200Mhz and for whatever reason the BIOS runs eveything at 2666Mhz because the stick reports incompatible timings or it simply only supports 2666Mhz at best.

Makewold is likely faster because it can utilize more memory (including caching), if you log memory usage you'll likely see overall more memory usage during the process. You have 16 threads at disposal so more than 2G process will certainly help during building overall.
 
As for benchmarking without running "make buildworld" using the builtin benchmark in 7-zip probably gives you a good enough idea of overall computing performance with different sticks.

fwiw, I get ~15500 running the stream benchmark on my laptop with 8 built in and 32Gb stick (Intel 11th gen) on FreeBSD in a Hyper-V VM.
 
I tend to agree now that there is something screwed up with this particular laptop.

I have the T14 gen 1 AMD (in this thread) and a T14s gen 1 AMD, both have identical Ryzen 7 PRO 4750U CPUs with PC3200 DDR4 RAM. I bought the two of them so that I could benchmark Linux against FreeBSD without rebooting.

But a base comparison of `make world` shows the T14s much faster:
Code:
t14.log:  1:52:54 6774.04 real 51593.02 user 1498.92 sys 783% CPU 146956/360062078 faults
t14s.log: 1:38:47 5927.77 real 44797.57 user 1237.49 sys 776% CPU 156087/359931383 faults

(this is a different FreeBSD build and drive, not comparable to earlier runs)

Both are limited to 16 GB RAM. I have no explanation for the different number of major page faults.

I hate laptops. They have limited utility for precise benchmarking. AMD also has no tool like i7z, so I can't even check the clockspeed they are running at :mad:
 
Both with identical Userland/kernel?

Yes, this is the same drive for both runs (M.2 NVMe in a SSK USB enclosure).

I wonder whether I should open the T14 case again :D But seriously, I'll re-paste the heatsink on that one this weekend, as rightfully recommended.
 
I don't think there is to be honest. As long as you don't have identical hardware results between the two are just showing differences in products which may very well be the case. Given the difference in hardware design it's not surprising that they may have different TDP limts, thermal characteristics etc. I guess you could first look if both are running in "Peformance mode" but that only pushes the power limits to the max of what the manufacturer have defined.

View: https://www.youtube.com/watch?v=M8RTMaHxnkc


Looking a bit more shows that the T14 indeed runs hotter in general and seems to clock lower too,

AMD does have performances tool but FreeBSDs support for Ryzen is sparse, you can check lock freq using the sysctls dev.cpu.0.freq_levels and dev.cpu.0.freq .

...and in case you're wondering, my Intel 11th laptop have quite agressivel power limits so I'm not anywhere near the "top performance" of the CPU can do.

Edit: You might be able to pull out some information using https://www.freshports.org/sysutils/RyzenAdj/
 
Back
Top