Porting ROCm (CUDA equiv. in AMD's world) would make freebsd viable for ML compute servers

Note: this is probably not an easy port, but the strategic value is immense.
This is the link: https://github.com/ROCm/ROCm

Motivation, benfits for freeBSD:

Nowadays, a big chunk of servers are doing machine learning compute. More so after LLMs (chatGPT and friends) came into the scene. Those run on nvidia's CUDA. CUDA is not available for freebsd, period. That blocks its usage for machine learning. ... or did. AMD, after ignoring that market for many years, finally woke up and released a library for machine learning (ROCm) that is not terrible as it used to be, but actually good (since v. 6.0).

Not to mention AMD cards are more performant at the same price, compared to Nvidia. And that nvidia had a monopoly.

This is why ROCm is important, and why freebsd supporting it would have a big impact. Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.
 
Why?
At least on FreeBSD the Nvidia drivers always 'just worked'™
Because they need to be paired to kernel versions. A Kernel upgrade usually needs that driver to be upgraded. And this (apparently; I left nvidia a while ago) still fails often, leaving users with no X session (oh, the horror! :) ). This is enough for the average ubuntu desktop user to abandon the effort to get the system running again ("a blank screen", aka a fully working console).

For servers, CUDA versions are married to said nvidia binary drivers; it's apparently so bad that distros get their unique value proposition with things like 'cuda works with a single command to install it' (PoP OS)... implying it doesn't for most other distros. There's even a meme:

1712692322652.png
 
That’s great where’s your code?
This is way beyond my skills. I'm not a C developer, and in fact have never written production code in any language. Should have started with that ;)

OTOH, There seem to be very skilled people having a dog on this fight now:
View: https://twitter.com/swe_zach/status/1764740277120725303


AMD having open source drivers is a big advantage. Nvidia on linux would trivially fail coming back from S3 suspend, even if you managed to get it to install and update correctly (for desktop/laptop use).
 
Because they need to be paired to kernel versions
This is true for all graphics drivers/kernel modules in general and is usually dealt with during package upgrade after an OS update - if you don't follow the standard procedure for OS upgrades, that's not the drivers fault.
Also: Why should we care about the "average ubuntu user" here?

I don't care if drivers are open source or not as long as THEY JUST WORK. And that's mostly true for all users that just want to get work done.
I remember the dumpster fire that ATI/AMD drivers were when they open sourced them (aka laid down basically all their driver developers and threw out an unusable carcass of a driver for others to fix). I also remember the bugfest that nouveau was (and still is).
Oh, and ALL of them (or at least the kernel module part) need to be built for the specific kernel version!

If your point is that we should have dynamic re-compilation of drivers/kernel modules at boot: I also remember the mess that 'DKIM' was on debian, which failed to rebuild the driver more often than not. This often lead to far worse scenarios than only xorg failing to load (boot loops due to kenel panics...). No idea if DKIM is still a thing, as I haven't used a systemd/linux OS for way over 10 years and I have no intention in doing so...


Please stop the "closed source drivers are bad"-FUD and come up with actual facts and - most importantly - a proper proposal and working code.
 
This is way beyond my skills. I'm not a C developer, and in fact have never written production code in any language. Should have started with that ;)

OTOH, There seem to be very skilled people having a dog on this fight now:
View: https://twitter.com/swe_zach/status/1764740277120725303


AMD having open source drivers is a big advantage. Nvidia on linux would trivially fail coming back from S3 suspend, even if you managed to get it to install and update correctly (for desktop/laptop use).
Geohot was the guy behind the jailbreak for the 2nd gen iphones. I used his stuff, it worked on my 2nd gen iPod Touch back in the day.

I'm frankly interested in seeing ROCm ported to FreeBSD. But yeah, this is time consuming even if you know how to fight past technical hurdles and write testing suites. Just getting ROCm to compile without errors is bad enough, you also have to understand the circuitry on the GPU so that your stuff works correctly.
 
I remember the dumpster fire that ATI/AMD drivers were when they open sourced them (aka laid down basically all their driver developers and threw out an unusable carcass of a driver for others to fix).
This was due companies dictating rules and then they arrive to a community with some quality standards, they had to adapt instead of "do what we want". Fortunately, that's the reason why AMD have some degree of quality.
I also remember the bugfest that nouveau was (and still is).
You know what is the issue with nouveau, right? Power management. Which is close to impossible to debug due nvidia firmware and 0 documentation available, thanks to nvidia. Nouveau is bad, yes, but by the effort, I would consider a feat of strength instead.
Please stop the "closed source drivers are bad"-FUD and come up with actual facts and - most importantly - a proper proposal and working code.
Until you need drivers nvidia doesn't want you to use, like anything older than 474.x series.
Check vulkan situation with nvidia (for example, most GPUs capable of using 474.x series are capable of Vulkan 1.3 and wayland if you want it, but nvidia knows what is good for you, right?) . Meanwhile on amd/intel side, the limitation is the hardware itself.
 
Due to the lack of interest from FreeBSD side.


Typical ideologically driven Linux user bullshit.
I'd be interested in understanding what causes you to think FreeBSD users are uninterested in CUDA support. My interactions with users in scientific computing and enterprises indicates a strong interest. Thanks!
 
I'd be interested in understanding what causes you to think FreeBSD users are uninterested in CUDA support.
My impression is that the project has failed to communicate any interest to Nvidia in 2008-2010 when CUDA was relatively unproven technology. Then is there the fact that CUDA was broken under Linux emulation for years. To be fair, the Linux emulation layer itself wasn't in a great shape until 2016 or so (until x86_64 support), but that still was 8 years ago. And it's not like the missing pieces were difficult to diagnose.

My interactions with users in scientific computing and enterprises indicates a strong interest.
 
My impression is that the project has failed to communicate any interest to Nvidia in 2008-2010 when CUDA was relatively unproven technology. Then is there the fact that CUDA was broken under Linux emulation for years. To be fair, the Linux emulation layer itself wasn't in a great shape until 2016 or so (until x86_64 support), but that still was 8 years ago. And it's not like the missing pieces were difficult to diagnose.


thanks for this. my heavy involvement in FreeBSD is pretty recent. This backstory is super helpful.

Note: this is probably not an easy port, but the strategic value is immense.
This is the link: https://github.com/ROCm/ROCm

Motivation, benfits for freeBSD:

Nowadays, a big chunk of servers are doing machine learning compute. More so after LLMs (chatGPT and friends) came into the scene. Those run on nvidia's CUDA. CUDA is not available for freebsd, period. That blocks its usage for machine learning. ... or did. AMD, after ignoring that market for many years, finally woke up and released a library for machine learning (ROCm) that is not terrible as it used to be, but actually good (since v. 6.0).

Not to mention AMD cards are more performant at the same price, compared to Nvidia. And that nvidia had a monopoly.

This is why ROCm is important, and why freebsd supporting it would have a big impact. Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.
 
Note: this is probably not an easy port, but the strategic value is immense.
This is the link: https://github.com/ROCm/ROCm

Motivation, benfits for freeBSD:

Nowadays, a big chunk of servers are doing machine learning compute. More so after LLMs (chatGPT and friends) came into the scene. Those run on nvidia's CUDA. CUDA is not available for freebsd, period. That blocks its usage for machine learning. ... or did. AMD, after ignoring that market for many years, finally woke up and released a library for machine learning (ROCm) that is not terrible as it used to be, but actually good (since v. 6.0).

Not to mention AMD cards are more performant at the same price, compared to Nvidia. And that nvidia had a monopoly.

This is why ROCm is important, and why freebsd supporting it would have a big impact. Nvidia on linux (binary blob) is utterly terrible, devs and admins would jump to freebsd ASAP once ROCm works there.

But most of that software is actually written in CUDA, Eve if ROCm was a complete competitor it (which it isn't) you still don't have that software ported.

AMD also put a couple of gotchas such as ROCm only working on a very small collection of cards, whereas CUDA runs on pretty much any reasonable desktop card.
 
Back
Top