Why would `make world` slow down with dual channel RAM?

kldload smbus smb?
What I have in my kernel:
Code:
> kldstat -vv | grep smb | grep -v ich | grep -v acpi
                212 smbus/smb
                211 nexus/smbios
                125 smbus/jedec_dimm
                118 iicsmb/smbus
                117 iicbus/iicsmb
"ichsmb" is intel stuff, not applicable to AMD. "nexus/smbios" no idea either. The others should be there.
There are 2 modules for AMD systems: kldload amdpm amdsmb
Do you have any smb devices showing up? ls /dev/smb*

Things like memory organisation (# of ranks, device width etc) also influence how well they'll operate together, aside from timings.

EDIT: I have edited my message a bit. Be sure to re-read.

I loaded amdpm, no change. I don't have amdsmb. I see it referenced in src, though.

ETA: found it, had typo. But didn't help.
 
Yup, of no help then I guess. Last check, no need to answer because I'm 99.9% certain it's fine: you are running the tool as root, right? It doesn't work for me as non-root user.
Unless that laptop has an internal ipmi (easily checked with ipmi-tool) it'll be no dice. sysutils/cpu-x also doesn't show detailed memory information.
 
I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.
 
I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.

It's Stream, not Steam.
I kind of trust it because it reacted to dual channel in the expected way. It doesn't seem to care too much about timings.
 
One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.
 
One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.

I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.
 
I could use my SSD seek benchmark
Good idea.

I saw such things few years ago, investigation with some desktop machines and notebooks showed that scheme is like this:

DATAPATH: [cpu (calculating) ---> memory controller (reordening) ---> memory modules ---> mem. controller (reordening) ---> cpu (calc)] This is good for large blocks of data with linear sequence (e.g. games).
DATAPATH: [cpu (calculating) ---> mem. controller (no time spent on reorder, just write) ---> mem. module ---> ... ] This is good for tasks with small blocks and random access (some archivers for example).

Of course, performance depends, as in the SMT/HT case, of task.
 
I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.
That might work, and not. The granularity of the SSD benchmark is likely a block (512 bytes), we are looking at L1/L2/L3 cache line sizes. The speed up of dual channel comes from reading at neighboring locations, so you read L3 block #n and get #n+1 (almost) free. At least the row addressing delays in the memory module will be skipped there.

A crazy random pattern of tree walks, accessing no more than 32 or 64 bytes at once will most likely not profit from that but suffer from additional wait states for the dual channel logic. It's been a long time since I tipped my toes into that, so take it with a grain of salt. Technology could have changed since then.
 
IIRC memtest is also doing some random patterns during a test cycle.
It also shows the current and average L1/2/3 and memory speeds - so if there's a difference between sequential and random access it might show in those numbers, or if there's something weird going on with partial dual-channel and single-channel access as was suggested with 'asymmetrical' memory sizes.

(the SSD benchmark is still a neat idea - I just thought I'd point at the obvious solution which one might loose sight of while figuring out a neat hack. at least that's what often happens to me...)
 
New run with different things in the DIMM slot. hw.physmem=32G
Code:
step1a-dimm-none.log:     1:38:11 5891.61 real 87696.15 user 3794.37 sys 1552% CPU 164077/375876811 faults
step2a-dimm-16gb2666.log: 1:41:54 6114.02 real 92023.45 user 3189.69 sys 1557% CPU 121769/374617851 faults
step3a-dimm-16gb3200.log: 1:47:54 6474.81 real 97003.19 user 3335.20 sys 1549% CPU 128338/374642879 faults
step4a-dimm-32gb3200.log: 1:34:19 5659.53 real 84759.50 user 3099.47 sys 1552% CPU 121760/374626641 faults
And Stream results:
Code:
step1a-dimm-none.stream:    Triad:      12374.8597       0.0039       0.0039       0.0043
step1b-dimm-none.stream:    Triad:      12374.0991       0.0039       0.0039       0.0039
step2a-dimm-16gb2666.stream:Triad:      20227.7295       0.0024       0.0024       0.0035
step2b-dimm-16gb2666.stream:Triad:      20158.8657       0.0025       0.0024       0.0035
step3a-dimm-16gb3200.stream:Triad:      20185.1406       0.0024       0.0024       0.0035
step3b-dimm-16gb3200.stream:Triad:      20470.4211       0.0024       0.0023       0.0035
step4a-dimm-32gb3200.stream:Triad:      17297.5850       0.0028       0.0028       0.0030
step4b-dimm-32gb3200.stream:Triad:      15727.4113       0.0032       0.0031       0.0032

I conclude:
  • Dual channel works as expected according to Stream
  • The 32 GB module is just "better" somehow
  • Both 16 GB modules have some sort of "problem"
  • What the Lenovo support person said seems correct - partial dual channel on the 32 GB module
 
Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.

PXL_20241120_174728892._small.jpg
PXL_20241120_173834444._small.jpg
PXL_20241120_165815381._small.jpg
 
Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.
Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.
 
Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.

Sorry, no ZFS involved. The 48 GB configuration had hw.physmem=32 GB for `make world`.
 
Back
Top