CUDA-oxide: Nvidia's official Rust to CUDA compiler

(nvlabs.github.io)

273 points | by adamnemecek 3 hours ago

17 comments

  • debugnik 31 minutes ago
    > (em dash) no DSLs, no foreign language bindings, just Rust.

    Official CUDA port and they couldn't even bother with the introductory paragraph.

    Okay, I'll try to ignore it and read the docs. Hey a custom IR, this sounds interesti-

    > MLIR’s implementation, however, is C++ with a side of TableGen, a build system that requires you to compile all of LLVM, and debugging sessions that make you question your career choices.

    I can't take this industry seriously anymore.

    • nialv7 1 minute ago
      I think the whole codebase was more or less written by AI...
    • aiscoming 9 minutes ago
      if they didnt use AI for their webpage people would say "why doesnt NVIDIA write its website and documentation with AI? don't they believe their own story about AI factories and employees managing thousands of agents doing the work for them?"

      this is exactly on brand dog-fooding I would expect from an AI hyper

      • debugnik 7 minutes ago
        Literally no one would ever say that simply for editing the LLMisms away.
    • mathisfun123 20 minutes ago
      What exactly are you upset about? Someone observing that MLIR is extremely complex and dependent on LLVM...?
      • awestroke 15 minutes ago
        The quoted writing is AI slop, and OP is reacting to the fact that they did not write even the introductory text themselves (or at least bother to edit out clear AI/slop indicators)
  • arpadav 2 hours ago
    This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.

    im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...

    • the__alchemist 1 hour ago
      Cudarc slaps!

      > Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow.

      I anecdotally haven't hit this; see the `cuda_setup` crate I made to handle the build scripts; it is a simple `build.rs` which only recompiles if the file changes, and it's a tiny compile time (compared to the rust CPU-side code)

      • arpadav 1 hour ago
        i'll have to check this out, thanks!
    • jauntywundrkind 1 hour ago
      Do other people agree cuda-oxide looks like a near dorp in replacement for cudarc?

      That would be amazing, but probably not imo complementarily so.

      I am curious what distinguished cuda-oxide. Beyond it being totally under nv control.

      • the__alchemist 1 hour ago
        I am observing the same from the article... is it heavily inspired by Cudarc, i.e. is this intentional, or are we reading too much into this, given Cudarc is a light abstraction over the CUDA api?
      • arpadav 1 hour ago
        perhaps not drop-in, but all my workflows with cudarc have always been "i make cuda kernel, i use cudarc for ffi to said kernels, i call via rust" - which for this case is pretty analogous

        briefly looking at the repo, looks like the main workflow is using rustc-codegen-cuda to convert rust -> MIR -> pliron IR -> LLVM IR -> PTX, which is embedded in the host binary, where then cuda-core loads embedded PTX at runtime onto the GPU

        but, if you arent directly making cuda kernels and just want cudarc for either calling existing kernels or other cuda driver api access then cudarc is lighter-weight option? or just use one of the sub-crates in this repo like cuda-core for those apis

  • cyber_kinetist 2 hours ago
    I'm quite interested in how they dealt with Rust's memory model, which might not neatly map to CUDA's semantics. Curious what the differences are compared to CUDA C++, and if the Rust's type system can actually bring more safety to CUDA (I do think writing GPU kernels is inherently unsafe, it's just too hard to create a safe language because of how the hardware works, and because of the fact that you're hyper-optimizing all the time)
    • arpadav 1 hour ago
      the main 4 i see are:

      1. use-after-free, drop semantics vs manual cudaFree

      2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only

      3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`

      4. im pretty sure you can cuda memcpy a std::string and literally any other POD and "corrupt" its state making it unusable. here it ONLY accepts DisjointSlice<T>, scalars, and closures (https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...)

      but most of the nitty gritty is in these sections

      * https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-mo...

      * https://nvlabs.github.io/cuda-oxide/gpu-programming/memory-a...

      edit: that being said, not like this catch everything, just looks to give much more guardrails against UB with raw .cu files

    • wrs 1 hour ago
      This is explained in some detail in the docs. There is a safe layer, a mostly safe layer, and an unsafe layer. Some clunkiness is needed for safe-yet-parallel work that they couldn’t easily fit into the Rust Send/Sync model.
    • the__alchemist 1 hour ago
      I think it depends on the objective. My pattern-matching brain says there will be interest in addressing this.

      From my perspective of someone who writes applications in Rust and sometimes wants to use GPU compute in these applications: I don't care. If we can leverage the memory model or ownership model in a low-friction way, that's fine. If it makes it a high friction experience, I would prefer not to do it that way.

      The baseline is IMO how Cudarc currently does it. I don't think there is much memory management involved; it's just imperative syntax wrapping FFI, and some lines in the build script to invoke nvcc if the kernels change.

  • raincole 1 hour ago
    I wonder what it means for Slang[0]. Presumably the point is that people want to do GPU programming with a more modern language. But now you can just use Rust...

    (Disclaimer: I like Slang a lot.)

    [0]: https://shader-slang.org/

    • pjmlp 21 minutes ago
      They serve different public, Slag folks are more interested in graphics programming not AI algorithms.

      Also shading languages are more user friendly given their features.

      Finally NVida already has Slang in production and those folks aren't going to rewrite shader pipelines into Rust.

  • rogermeier 50 minutes ago
    TileLang https://github.com/tile-ai/tilelang and stuff like Tile Kernels https://github.com/deepseek-ai/TileKernels will make CUDA obsolete one day.
    • wrathofmonads 18 minutes ago
      Halide had the right idea, and some of us could sense that tile-based programming would eventually go mainstream, but it never had the hardware moment to make it a real necessity. The ideas that feel obvious in retrospect are usually right on the merits years before adoption. Think arrays and APL, but it took until numpy to provide the right ergonomics for everyday developers to stop thinking in loops.
    • jordand 28 minutes ago
      CUDA is nearly 20 years old, and is not going anywhere, for many years to come
    • mathisfun123 34 minutes ago
      this dude is a distinguished engineer at siemens commenting the dopiest/reddit level takes. lolol.
    • AnimalMuppet 38 minutes ago
      That's quite a claim for very little evidence.
    • arpadav 34 minutes ago
      is this even comparable? lol
  • tiffanyh 1 hour ago
    Re: Rust (and "safe" programming languages).

    Does anyone have more details on NVIDIAs use of Spark/Ada?

    All I can find is what's listed below:

    https://www.adacore.com/case-studies/nvidia-adoption-of-spar...

  • the__alchemist 1 hour ago
    Does anyone know if this will let you share structs between host and device? That is the big thing missing so far with existing rust/CUDA workflows. (Plus the serialization/bytes barrier between them)
  • foo-bar-baz529 1 hour ago
    One thing I’ve been wary about with Rust for CUDA is the bit of overhead that Rust adds that is usually negligible but might matter here, like bounds checks on arrays. Could it cause additional registers to get used, lowering the concurrency of a kernel?
  • the__alchemist 1 hour ago
    Hell yea! I have been doing it with Cudarc (Kernels) and FFI (cuFFT). Using manual [de]serialization between byte arrays and rust data structs. I hope this makes it lower friction!
  • TheMagicHorsey 44 minutes ago
    Oh lord. If this is the trend, I probably can't avoid improving my Rust language knowledge in the long term. I hate reading Rust so much right now. I guess I just have to get over that hump.
  • economistbob 1 hour ago
    So, we have stainless, which means Linux code that never rusted. Now we need someone to make phosphorus so that we can turn rusty code into old iron. Then GPL fans can run Rust boxes, Stainless machines, or future proofed iron work horses.

    All software can come on three editions. Stainless drivers that were never rusty, oxidized drivers that used Rust on existing code, and Iron editions which is where someone converted the Rust back to C using the new phosphoric tool...

    Diversity can be our strength.

    Making Iron C/c++ code can be called acid washing if it was rusted.

    • positron26 1 hour ago
      > we need someone

      > Then GPL fans can

      Checks out

  • zghst 56 minutes ago
    AWESOME!
  • adamnemecek 2 hours ago
  • rowanG077 2 hours ago
    Personally I really don't want new GPU languages that do not have AD as a first class citizen. I mean rust is an improvement over C++ CUDA but still.
    • erk__ 2 hours ago
      There is actually work on adding autodiff to Rust, maybe not really first class citizen, but at least build in: https://doc.rust-lang.org/std/autodiff/index.html (it is still at a pre-RFC stage so it is not something that soon will be added)
      • magnio 1 hour ago
        Incredible, I have never heard of std::autodiff before. Isn't it rare for a programming language to provide AD within the standard library? Even Julia doesn't have it built-in, I wouldn't expect Rust out of all languages to experiment it in std.
      • rowanG077 1 hour ago
        That's awesome, I didn't know that!
    • TallGuyShort 2 hours ago
      Sorry, what is AD in this context?

      edit: oh, automatic differentiation?

    • the__alchemist 1 hour ago
      This isn't a new GPU language; it's a lib which might replace FFI and third party libs.
      • rowanG077 1 hour ago
        This is definitely not just a lib. This compiles rust to CUDA. If you call a full on compiler stack a lib, everything may as well be a lib.
        • the__alchemist 59 minutes ago
          Ok. I am calling it a lib because to use, you add it as a dependency in cargo.toml then import it in your rust modules.
          • rowanG077 36 minutes ago
            That's after you have installed their entire build infra + dependencies. They ship their own cargo subcommand.
    • corysama 1 hour ago
    • vimarsh6739 2 hours ago
      Really hard to find alternatives to Julia for AD as a first class citizen
      • hellohello2 2 hours ago
        I think the parent is mostly referring to solutions like Slang.D
    • mathisfun123 1 hour ago
      every GPU related post has a comment which makes my eyes roll all the way back. this is the one for this post.
  • rvz 2 hours ago
    This is a bit good for Rust if you want to use the language with CUDA. The problem is, it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries and care about open source.

    Continuing from this discussion [0], this only makes it a Rust or a CUDA problem rather than a Python, CUDA and a PyTorch one if there bug in one of them.

    Yet at the end of the day, it still uses Nvidia's closed source CUDA compiler 'nvcc' which they will never open source. A least Mojo promises to open source their own compiler which compiles to different accelerators with multiple backend support.

    Unlike this...but uses Rust.

    [0] https://news.ycombinator.com/item?id=48067228

    • the__alchemist 1 hour ago
      IMO this has nothing to do with open source as an ideology; just a practical (and official?) lib for adding GPU interaction to your rust programs.
    • charcircuit 15 minutes ago
      Considering how fast everything is changing with GPUs and how competitive it is. It doesn't make sense to have an open source driver.
    • pjmlp 2 hours ago
      Mojo remains to be seen if it isn't another Swift for Tensorflow, apparently 1.0 won't even support Windows properly.
      • semiinfinitely 2 hours ago
        who the fuck uses windows
        • pjmlp 16 minutes ago
          All the game devs that forced Valve to come up Proton for Steam Deck to have any content.
        • beanjuiceII 1 hour ago
          many people
        • bigyabai 2 hours ago
          The majority of computer owners on planet Earth
          • OtomotO 2 hours ago
            But also the majority of programmers?
            • pjmlp 26 minutes ago
              Yes, because Windows software doesn't sprung into existence out of nowhere.
            • bigyabai 2 hours ago
              In AI-focused fields like business analytics and data science, yeah.
              • vlovich123 1 hour ago
                The claim is that people are running CUDA on Windows for business analytics and data science? This feels less likely an accurate picture and more likely any mass data processing is already happening on Linux K8s clusters.
                • pjmlp 25 minutes ago
                  Yes, if they happen to run tooling like Excel, PowerBI, Tableau,....

                  Also Linux support for CUDA on laptops, especially with dual GPU setup isn't particularly great.

                  Most workstation class laptops are Windows based.

        • fhn 1 hour ago
          you mom!
    • zamalek 1 hour ago
      My sentiment matches your exactly. I'm sick and tired of CUDA - but it's really not going to change.

      Could maybe be forked with some dynamic smarts, HIP is basically 1:1 with CUDA: https://github.com/amd/amd-lab-notes/blob/release/hipify%2Fs...

      • pjmlp 18 minutes ago
        Does it support a graphical GPU debugging for C++, Fortran and Python JIT GPU code?

        Otherwise it isn't 1:1 with CUDA, and I am not counting everything else on CUDA ecosystem

    • bigyabai 2 hours ago
      > it still doesn't really move the needle if you really don't like running closed source drivers and runtime binaries

      Those people probably did not buy an Nvidia GPU for themselves. It should be common knowledge that the "Open" Nvidia drivers still run gigantic firmware blobs to dispatch complex workloads. And Nouveau is close to useless for GPGPU compute.

  • globalcostdata 1 hour ago
    [flagged]
  • whatever1 2 hours ago
    Why do we bother with programming languages today? Why not have the LLMs just write assembly code and skip the human readable part? We are not reviewing it anymore anyway.
    • strbean 2 hours ago
      A lot of really good reasons:

      1) Higher level code is easier for LLMs to review and iterate upon. The more the intent is clear from the code, the easier it is for humans and LLMs to work with.

      2) LLMs get stuck or fail to solve a problem sometimes. It is preferable to have artifacts that humans can grok without the massive extra effort of parsing out assembly code.

      3) Assembly code varies massively across targets. We want provable, deterministic transformation from the intent (specified in a higher level language) to the target assembly language. LLMs can't reliably output many artifacts for different platforms that behave the same.

      4) Hopefully, we are still reviewing the code output by LLMs to some extent.

      • _flux 1 hour ago
        In addition LLMs also make bugs, and debugging assembler is more difficult, wasting more tokens, thus more money.

        A very big practical reason is also that assembler code would eat context like no other.

      • jcgrillo 1 hour ago
        I'd add to that

        1.5) Having a compiler in the loop that does things like enforcing type constraints (and in the case if Rust in particular, therefore memory safety guarantees) is really useful both for humans and LLMs.

    • Almondsetat 1 hour ago
      Feel free to post a project of yours where you gave a bunch of prompts to an LLM and it produced a working application written in assembly without you having to check for anything
    • vjsrinivas 2 hours ago
      Is this a serious question or are you just trolling?
    • hellohello2 2 hours ago
      I get what you mean but I think if anything AI pairs extremely well with strongly typed languages that are at times cumbersome for humans, but decrease the latency at which AI can get feedback on its code. In my (very) limited experience Rust is an excellent target for AI codegen.
      • wrathofmonads 51 minutes ago
        Clojure is a strongly typed language. A Clojure REPL capturing immutable, inspectable state is a philosophically richer feedback substrate than it gets credit for. Spec can express constraints that static types cannot - things that would require dependent or refinement types in a static system (and the enormous complexity that comes with them) you can just write as a predicate. The tradeoff is that specs are only checked when you actually exercise the code path, whereas a type error is total and upfront. But that's exactly the point - an agent working surgically on a specific path is exercising it, so the totality of static checking matters less. If we're not vibe-coding, a dynamic, strongly-typed, immutability-friendly language like Clojure could be both token-efficient and capable of richer reasoning than a static type system allows.
    • bee_rider 2 hours ago
      This is a Rust to CUDA converter so I guess it is for codes where the programmer wants it to function properly (Rust) and have good performance (CUDA).

      It’s just a matter of different workflows for different users and application.

    • regenschutz 2 hours ago
      I mean, AI is not good at writing x86-64 assembly code. Last time I tried (with both Claude and ChatGPT), the AI failed to even create basic programs other than Hello World.
    • ModernMech 1 hour ago
      I'll bite:

      Programming languages are tools for thinking. It's not clear that assembly code has the right abstractions to encourage the kind of thinking that programming large systems requires. After all, human intelligence found assembly insufficient and went on to invent better languages for thinking, why should artificial intelligence, trained on human intelligence, be any different? Maybe AI in the future will have its own languages for thinking, but assembly is likely not that.

    • OtomotO 2 hours ago
      Because when this idiotic hypemachinery finally dies an agonising, painful death, some of us still want to work with computers