Porting ROCm (CUDA equiv. in AMD's world) would make freebsd viable for ML compute servers

shallwhack · Apr 9, 2024

Note: this is probably not an easy port, but the strategic value is immense.
This is the link: https://github.com/ROCm/ROCm

Motivation, benfits for freeBSD:

Nowadays, a big chunk of servers are doing machine learning compute. More so after LLMs (chatGPT and friends) came into the scene. Those run on nvidia's CUDA. CUDA is not available for freebsd, period. That blocks its usage for machine learning. ... or did. AMD, after ignoring that market for many years, finally woke up and released a library for machine learning (ROCm) that is not terrible as it used to be, but actually good (since v. 6.0).

Not to mention AMD cards are more performant at the same price, compared to Nvidia. And that nvidia had a monopoly.

This is why ROCm is important, and why freebsd supporting it would have a big impact. Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.

richardtoohey2 · Apr 9, 2024

That’s great where’s your code?

shkhln · Apr 9, 2024

shallwhack said:
CUDA is not available for freebsd, period.

Due to the lack of interest from FreeBSD side.

shallwhack said:
Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.

Typical ideologically driven Linux user bullshit.

sko · Apr 9, 2024

shallwhack said:
Nvidia on linux (binary blob) is utterly terrible,

Why?
At least on FreeBSD the Nvidia drivers always 'just worked'™

shallwhack · Apr 9, 2024

sko said:
Why?
At least on FreeBSD the Nvidia drivers always 'just worked'™

Because they need to be paired to kernel versions. A Kernel upgrade usually needs that driver to be upgraded. And this (apparently; I left nvidia a while ago) still fails often, leaving users with no X session (oh, the horror!

). This is enough for the average ubuntu desktop user to abandon the effort to get the system running again ("a blank screen", aka a fully working console).

For servers, CUDA versions are married to said nvidia binary drivers; it's apparently so bad that distros get their unique value proposition with things like 'cuda works with a single command to install it' (PoP OS)... implying it doesn't for most other distros. There's even a meme:

shallwhack · Apr 9, 2024

richardtoohey2 said:
That’s great where’s your code?

This is way beyond my skills. I'm not a C developer, and in fact have never written production code in any language. Should have started with that

OTOH, There seem to be very skilled people having a dog on this fight now:

https://twitter.com/i/web/status/1764740277120725303

View: https://twitter.com/swe_zach/status/1764740277120725303

AMD having open source drivers is a big advantage. Nvidia on linux would trivially fail coming back from S3 suspend, even if you managed to get it to install and update correctly (for desktop/laptop use).

shallwhack · Apr 9, 2024

shkhln said:
Due to the lack of interest from FreeBSD side.

Why is that?

sko · Apr 9, 2024

shallwhack said:
Because they need to be paired to kernel versions

This is true for all graphics drivers/kernel modules in general and is usually dealt with during package upgrade after an OS update - if you don't follow the standard procedure for OS upgrades, that's not the drivers fault.
Also: Why should we care about the "average ubuntu user" here?

I don't care if drivers are open source or not as long as THEY JUST WORK. And that's mostly true for all users that just want to get work done.
I remember the dumpster fire that ATI/AMD drivers were when they open sourced them (aka laid down basically all their driver developers and threw out an unusable carcass of a driver for others to fix). I also remember the bugfest that nouveau was (and still is).
Oh, and ALL of them (or at least the kernel module part) need to be built for the specific kernel version!

If your point is that we should have dynamic re-compilation of drivers/kernel modules at boot: I also remember the mess that 'DKIM' was on debian, which failed to rebuild the driver more often than not. This often lead to far worse scenarios than only xorg failing to load (boot loops due to kenel panics...). No idea if DKIM is still a thing, as I haven't used a systemd/linux OS for way over 10 years and I have no intention in doing so...

Please stop the "closed source drivers are bad"-FUD and come up with actual facts and - most importantly - a proper proposal and working code.

astyle · Apr 9, 2024

shallwhack said:
This is way beyond my skills. I'm not a C developer, and in fact have never written production code in any language. Should have started with that

OTOH, There seem to be very skilled people having a dog on this fight now:

https://twitter.com/i/web/status/1764740277120725303
View: https://twitter.com/swe_zach/status/1764740277120725303

AMD having open source drivers is a big advantage. Nvidia on linux would trivially fail coming back from S3 suspend, even if you managed to get it to install and update correctly (for desktop/laptop use).

Geohot was the guy behind the jailbreak for the 2nd gen iphones. I used his stuff, it worked on my 2nd gen iPod Touch back in the day.

I'm frankly interested in seeing ROCm ported to FreeBSD. But yeah, this is time consuming even if you know how to fight past technical hurdles and write testing suites. Just getting ROCm to compile without errors is bad enough, you also have to understand the circuitry on the GPU so that your stuff works correctly.

menelkir · Apr 10, 2024

sko said:
I remember the dumpster fire that ATI/AMD drivers were when they open sourced them (aka laid down basically all their driver developers and threw out an unusable carcass of a driver for others to fix).

This was due companies dictating rules and then they arrive to a community with some quality standards, they had to adapt instead of "do what we want". Fortunately, that's the reason why AMD have some degree of quality.

sko said:
I also remember the bugfest that nouveau was (and still is).

You know what is the issue with nouveau, right? Power management. Which is close to impossible to debug due nvidia firmware and 0 documentation available, thanks to nvidia. Nouveau is bad, yes, but by the effort, I would consider a feat of strength instead.

sko said:
Please stop the "closed source drivers are bad"-FUD and come up with actual facts and - most importantly - a proper proposal and working code.

Until you need drivers nvidia doesn't want you to use, like anything older than 474.x series.
Check vulkan situation with nvidia (for example, most GPUs capable of using 474.x series are capable of Vulkan 1.3 and wayland if you want it, but nvidia knows what is good for you, right?) . Meanwhile on amd/intel side, the limitation is the hardware itself.

gtewallace · Apr 10, 2024

shkhln said:
Due to the lack of interest from FreeBSD side.

Typical ideologically driven Linux user bullshit.

I'd be interested in understanding what causes you to think FreeBSD users are uninterested in CUDA support. My interactions with users in scientific computing and enterprises indicates a strong interest. Thanks!

shkhln · Apr 10, 2024

gtewallace said:
I'd be interested in understanding what causes you to think FreeBSD users are uninterested in CUDA support.

My impression is that the project has failed to communicate any interest to Nvidia in 2008-2010 when CUDA was relatively unproven technology. Then is there the fact that CUDA was broken under Linux emulation for years. To be fair, the Linux emulation layer itself wasn't in a great shape until 2016 or so (until x86_64 support), but that still was 8 years ago. And it's not like the missing pieces were difficult to diagnose.

gtewallace said:
My interactions with users in scientific computing and enterprises indicates a strong interest.

CUDA

% pwd /usr/home/grahamperrin/dev/gpufetch % echo $0 /bin/tcsh % ./build.sh -- The CXX compiler identification is Clang 13.0.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features...

forums.freebsd.org

BlackSteel · Apr 10, 2024

Rocm requires amdkfd kernel driver to work. It's part of amdgpu kernel module, however freebsd port has almost all amdkfd parts removed (here: https://github.com/freebsd/drm-kmod/commit/539069afcacc47f8e52482221afc01a0e930240e).
I tried to make it at least buildable few years ago, but there were many linuxkpi parts missing.

gtewallace · Apr 10, 2024

shallwhack said:
shkhln said:

My impression is that the project has failed to communicate any interest to Nvidia in 2008-2010 when CUDA was relatively unproven technology. Then is there the fact that CUDA was broken under Linux emulation for years. To be fair, the Linux emulation layer itself wasn't in a great shape until 2016 or so (until x86_64 support), but that still was 8 years ago. And it's not like the missing pieces were difficult to diagnose.

CUDA

% pwd /usr/home/grahamperrin/dev/gpufetch % echo $0 /bin/tcsh % ./build.sh -- The CXX compiler identification is Clang 13.0.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features...

forums.freebsd.org

Click to expand...

thanks for this. my heavy involvement in FreeBSD is pretty recent. This backstory is super helpful.

shallwhack said:
Note: this is probably not an easy port, but the strategic value is immense.
This is the link: https://github.com/ROCm/ROCm

Motivation, benfits for freeBSD:

Nowadays, a big chunk of servers are doing machine learning compute. More so after LLMs (chatGPT and friends) came into the scene. Those run on nvidia's CUDA. CUDA is not available for freebsd, period. That blocks its usage for machine learning. ... or did. AMD, after ignoring that market for many years, finally woke up and released a library for machine learning (ROCm) that is not terrible as it used to be, but actually good (since v. 6.0).

Not to mention AMD cards are more performant at the same price, compared to Nvidia. And that nvidia had a monopoly.

This is why ROCm is important, and why freebsd supporting it would have a big impact. Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.

cracauer@ · Apr 11, 2024

shallwhack said:
Note: this is probably not an easy port, but the strategic value is immense.
This is the link: https://github.com/ROCm/ROCm

Motivation, benfits for freeBSD:

Nowadays, a big chunk of servers are doing machine learning compute. More so after LLMs (chatGPT and friends) came into the scene. Those run on nvidia's CUDA. CUDA is not available for freebsd, period. That blocks its usage for machine learning. ... or did. AMD, after ignoring that market for many years, finally woke up and released a library for machine learning (ROCm) that is not terrible as it used to be, but actually good (since v. 6.0).

Not to mention AMD cards are more performant at the same price, compared to Nvidia. And that nvidia had a monopoly.

This is why ROCm is important, and why freebsd supporting it would have a big impact. Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.

But most of that software is actually written in CUDA, Eve if ROCm was a complete competitor it (which it isn't) you still don't have that software ported.

AMD also put a couple of gotchas such as ROCm only working on a very small collection of cards, whereas CUDA runs on pretty much any reasonable desktop card.