Why would `make world` slow down with dual channel RAM?

cracauer@ · Tuesday at 8:57 PM

malavon said:
kldload smbus smb?
What I have in my kernel:

Code:

> kldstat -vv | grep smb | grep -v ich | grep -v acpi 212 smbus/smb 211 nexus/smbios 125 smbus/jedec_dimm 118 iicsmb/smbus 117 iicbus/iicsmb

"ichsmb" is intel stuff, not applicable to AMD. "nexus/smbios" no idea either. The others should be there.
There are 2 modules for AMD systems: kldload amdpm amdsmb
Do you have any smb devices showing up? ls /dev/smb*

Things like memory organisation (# of ranks, device width etc) also influence how well they'll operate together, aside from timings.

EDIT: I have edited my message a bit. Be sure to re-read.

I loaded amdpm, no change. I don't have amdsmb. I see it referenced in src, though.

ETA: found it, had typo. But didn't help.

malavon · Tuesday at 10:49 PM

Yup, of no help then I guess. Last check, no need to answer because I'm 99.9% certain it's fine: you are running the tool as root, right? It doesn't work for me as non-root user.
Unless that laptop has an internal ipmi (easily checked with ipmi-tool) it'll be no dice. sysutils/cpu-x also doesn't show detailed memory information.

cracauer@ · Tuesday at 10:51 PM

Yeah, root.

I restarted the whole process to do the 4 different runs I laid out.

T-Aoki · Tuesday at 11:01 PM

I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.

cracauer@ · Tuesday at 11:04 PM

T-Aoki said:
I don't know how Steam bench works, but does it use enough memory range for test that certainly "kills" cache efficiency? If not, differences in access patterms could affect the different results between Steam bench and `make world'. Possibly not applicable recently, but I've encountered (long ago) a quite small differences in memory timing affected performance (adding a SIMM made PC slow, and replacing with the same procuct but different one fixed the issue. Yes, it is a quite old story, before even DIMM appears. It could be caused by the design (patterns) in memory bus of the mother board, though.

It's Stream, not Steam.

MEMORY BANDWIDTH: STREAM BENCHMARK PERFORMANCE RESULTS

Sustainable memory bandwidth benchmark, with results on a wide variety of computer systems, from Mac's and PC's to most current and recent workstations to Cray supercomputers.

www.cs.virginia.edu

I kind of trust it because it reacted to dual channel in the expected way. It doesn't seem to care too much about timings.

Crivens · 2024-11-20T05:26:34+0000

One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.

cracauer@ · 2024-11-20T05:36:49+0000

Crivens said:
One thing comes to mind.
The stream bench reads in sequence. Compiling is random access. Maybe you need to write your own memory test code that does a "random" memory walk, same seed for comparison. I trust you can type out such a little thing in no time.

The reason for this is that dual channel mode is faster for sequencial reads but not necessarily for random access.

I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.

CeXP1917 · 2024-11-20T06:01:14+0000

cracauer@ said:
I could use my SSD seek benchmark

Good idea.

I saw such things few years ago, investigation with some desktop machines and notebooks showed that scheme is like this:

DATAPATH: [cpu (calculating) ---> memory controller (reordening) ---> memory modules ---> mem. controller (reordening) ---> cpu (calc)] This is good for large blocks of data with linear sequence (e.g. games).
DATAPATH: [cpu (calculating) ---> mem. controller (no time spent on reorder, just write) ---> mem. module ---> ... ] This is good for tasks with small blocks and random access (some archivers for example).

Of course, performance depends, as in the SMT/HT case, of task.

Crivens · 2024-11-20T07:23:44+0000

cracauer@ said:
I could use my SSD seek benchmark, just on an mmap()ed area instead of a file or raw device.

That might work, and not. The granularity of the SSD benchmark is likely a block (512 bytes), we are looking at L1/L2/L3 cache line sizes. The speed up of dual channel comes from reading at neighboring locations, so you read L3 block #n and get #n+1 (almost) free. At least the row addressing delays in the memory module will be skipped there.

A crazy random pattern of tree walks, accessing no more than 32 or 64 bytes at once will most likely not profit from that but suffer from additional wait states for the dual channel logic. It's been a long time since I tipped my toes into that, so take it with a grain of salt. Technology could have changed since then.

sko · 2024-11-20T07:33:42+0000

IIRC memtest is also doing some random patterns during a test cycle.
It also shows the current and average L1/2/3 and memory speeds - so if there's a difference between sequential and random access it might show in those numbers, or if there's something weird going on with partial dual-channel and single-channel access as was suggested with 'asymmetrical' memory sizes.

(the SSD benchmark is still a neat idea - I just thought I'd point at the obvious solution which one might loose sight of while figuring out a neat hack. at least that's what often happens to me...)

cracauer@ · 2024-11-20T08:13:04+0000

New run with different things in the DIMM slot. hw.physmem=32G

Code:

step1a-dimm-none.log:     1:38:11 5891.61 real 87696.15 user 3794.37 sys 1552% CPU 164077/375876811 faults
step2a-dimm-16gb2666.log: 1:41:54 6114.02 real 92023.45 user 3189.69 sys 1557% CPU 121769/374617851 faults
step3a-dimm-16gb3200.log: 1:47:54 6474.81 real 97003.19 user 3335.20 sys 1549% CPU 128338/374642879 faults
step4a-dimm-32gb3200.log: 1:34:19 5659.53 real 84759.50 user 3099.47 sys 1552% CPU 121760/374626641 faults

And Stream results:

Code:

step1a-dimm-none.stream:    Triad:      12374.8597       0.0039       0.0039       0.0043
step1b-dimm-none.stream:    Triad:      12374.0991       0.0039       0.0039       0.0039
step2a-dimm-16gb2666.stream:Triad:      20227.7295       0.0024       0.0024       0.0035
step2b-dimm-16gb2666.stream:Triad:      20158.8657       0.0025       0.0024       0.0035
step3a-dimm-16gb3200.stream:Triad:      20185.1406       0.0024       0.0024       0.0035
step3b-dimm-16gb3200.stream:Triad:      20470.4211       0.0024       0.0023       0.0035
step4a-dimm-32gb3200.stream:Triad:      17297.5850       0.0028       0.0028       0.0030
step4b-dimm-32gb3200.stream:Triad:      15727.4113       0.0032       0.0031       0.0032

I conclude:

Dual channel works as expected according to Stream
The 32 GB module is just "better" somehow
Both 16 GB modules have some sort of "problem"
What the Lenovo support person said seems correct - partial dual channel on the 32 GB module

cracauer@ · 2024-11-20T22:27:15+0000

Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.

cy@ · 2024-11-20T22:40:45+0000

cracauer@ said:
Here are the timings and the bandwidth according to memtest.

The bandwidth confirms that the 32 GB module does not do dual-channel RAM. Why the `make world` is faster on it is unexplained.

Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.

cracauer@ · 2024-11-20T22:48:01+0000

cy@ said:
Is /usr/obj on ZFS? If yes, even though it's using a single channel, it's not referencing the slower SSD accesses. More RAM is used for ARC.

Compare your various configurations arcstats using systat -z.

Sorry, no ZFS involved. The 48 GB configuration had hw.physmem=32 GB for `make world`.

Why would `make world` slow down with dual channel RAM?

cracauer@

malavon

cracauer@

T-Aoki

cracauer@

MEMORY BANDWIDTH: STREAM BENCHMARK PERFORMANCE RESULTS

Crivens

Administrator

cracauer@

CeXP1917

Crivens

Administrator

sko

cracauer@

cracauer@

cy@

cracauer@