CUDA Demonstrates That Nvidia Is Primarily a Software Firm

Please excuse my use of a well-worn phrase, a bit of finance lingo that has recently woven itself into the tech discourse, but I find it necessary to discuss “moats.” Originally popularized by Warren Buffett to describe a company’s competitive edge, the term became prevalent in Silicon Valley after a purported memo from Google, titled “We Have No Moat, and Neither Does OpenAI,” expressed concerns that open-source AI might breach Big Tech’s defenses.
Years later, those defenses remain intact. Aside from a fleeting moment of concern when DeepSeek emerged, open-source AI models have not significantly surpassed proprietary ones. Nevertheless, none of the leading labs—OpenAI, Anthropic, Google—possess a notable moat.
The entity that does possess a moat is Nvidia. CEO Jensen Huang refers to it as his most cherished “treasure.” Surprisingly, it is not a piece of hardware, as one might expect from a chip manufacturer. Instead, it’s something known as CUDA. What may sound like a chemical compound prohibited by the FDA could be the true distinct advantage in the AI sector.
CUDA stands for Compute Unified Device Architecture, but like laser or scuba, no one bothers to spell it out; we simply pronounce it “KOO-duh.” So what makes this treasure so vital? If I had to sum it up in one word: parallelization.
Consider a straightforward example. If we assign a machine the task of completing a 9×9 multiplication table, a computer with a single core would perform each of the 81 operations sequentially. In contrast, a GPU with nine cores can allocate tasks so that each core processes a different column—one core handles 1×1 to 1×9, another covers 2×1 to 2×9, and so forth—yielding a ninefold increase in speed. Modern GPUs can demonstrate even greater efficiency. For instance, if programmed to recognize commutativity—7×9 = 9×7—they can eliminate redundant calculations, reducing 81 operations to 45, nearly halving the workload. In a realm where a single training run can cost up to a hundred million dollars, every optimization counts.
Originally, Nvidia’s GPUs were designed for rendering graphics in video games. In the early 2000s, Ian Buck, a Stanford PhD student who initially engaged with GPUs through gaming, recognized that their architecture could be adapted for general high-performance computing. He created a programming language called Brook, was recruited by Nvidia, and, alongside John Nickolls, spearheaded the development of CUDA. If AI leads us into an era characterized by a permanent white-collar underclass and autonomous weaponry, it’s worth noting that it began because someone played Doom and decided a demon’s scrotum should jiggle at 60 frames per second.
CUDA isn’t a programming language on its own but a “platform.” I hesitate to use that vague term because, much like how The New York Times is a newspaper as well as a gaming company, CUDA has evolved into a complex collection of software libraries for AI over the years. Each function expedites nanoseconds in single mathematical operations—cumulatively, they enable GPUs, in industry terms, to go brrr.
A contemporary graphics card transcends being just a circuit board packed with chips, memory, and fans. It’s a sophisticated assembly of cache hierarchies and specialized units referred to as “tensor cores” and “streaming multiprocessors.” In this analogy, what chip companies produce resembles a professional kitchen, and additional cores equate to more grilling stations. However, even a kitchen with 30 grilling stations cannot operate efficiently without a capable head chef skillfully coordinating tasks—just as CUDA manages GPU cores.
To extend this metaphor, hand-tuned CUDA libraries tailored for a specific matrix operation resemble kitchen utensils designed for a singular task, like a cherry pitter or shrimp deveiner—which might be luxuries for home cooks, but are impractical when faced with 10,000 shrimp to prepare. This brings us back to DeepSeek. Its engineers delved below the already intricate layer of abstraction to work directly in PTX, a form of assembly language for Nvidia GPUs. Imagine the task is peeling garlic. An unoptimized GPU would suggest: “Peel the skin with your fingernails.” CUDA, however, can direct: “Smash the clove with the flat of a knife.” PTX allows you to dictate each sub-instruction: “Lift the blade 2.35 inches above the cutting board, align it parallel to the clove’s equator, and strike downwards with a force of 36.2 newtons.”
