Table of contents
Repost Notes
This blog series is reposted from Modular’s official blog, written by the creator of LLVM, Chris Lattner.
- Source: Why do HW companies struggle to build AI software? (Democratizing AI Compute, Part 9)
- Publish date: April 22, 2025
The original blog was parsed with Jina Reader.
From the launch of ChatGPT in 2023, GenAI reshaped the tech industryâbut GPUs didnât suddenly appear overnight. Hardware companies have spent billions on AI chips for over a decade. Dozens of architectures. Countless engineering hours. And yetâstillâNVIDIA dominates.
Why?
Because CUDA is more than an SDK. Itâs a fortress of developer experience designed to lock you inâand a business strategy engineered to keep competitors perpetually two years behind. Itâs not beloved. Itâs not elegant. But it works, and nothing else comes close.
Weâve spent this series tracing the rise and fall of hopeful alternativesâOpenCL and SyCL, TVM and XLA, Triton, MLIR, and others. The pattern is clear: bold technical ambitions, early excitement, and eventual fragmentation. Meanwhile, the CUDA moat grows deeper.
The trillion-dollar question that keeps hardware leaders awake at night is: Given the massive opportunityâand developers desperate for alternativesâwhy can’t we break free?
The answer isnât incompetence. Hardware companies are filled with brilliant engineers and seasoned execs. The problem is structural: misaligned incentives, conflicting priorities, and an underestimation of just how much software investment is required to play in this arena. You donât just need a chip. You need a platform. And building a platform means making hard, unpopular, long-term betsâwithout the guarantee that anyone will care.
In this post, we’ll reveal the invisible matrix of constraints that hardware companies operate withinâa system that makes building competitive AI software nearly impossible by design.
My career in HW / SW co-design
I live and breathe innovative hardware. I read SemiAnalysis, EE Times, Ars Technicaâanything I can get my hands on about the chips, stacks, and systems shaping the future. Over decades, Iâve fallen in love with the intricate dance of hardware/software co-design: when it works, itâs magic. When it doesnât⌠well, thatâs what this whole series is about.
A few of my learnings:
- My first real job in tech was at Intel, helping optimize launch titles for the Pentium MMXâthe first PC processor with SIMD instructions. There I learned the crucial lesson: without optimized software, a revolutionary silicon speedboat wonât get up to speed. That early taste of hardware/software interplay stuck with me.
- At Apple, I built the compiler infrastructure enabling a transition to in-house silicon. Apple taught me that true hardware/software integration requires extraordinary organizational disciplineâit succeeded because instead of settling for a compromise, the teams shared a unified vision that no business unit can override.
- At Google, I scaled the TPU software stack alongside the hardware and AI research teams. With seemingly unlimited resources and tight HW/SW co-design, we used workload knowledge to deliver the power of specialized silicon â an incredible custom AI racing yacht.
- At SiFive, I switched perspectives entirelyâleading engineering at a hardware company taught me the hard truths about hardware business models and organizational values.
Across all these experiences, one thing became clear: software and hardware teams speak different languages, move at different speeds, and measure success in different ways. But there’s something deeper at workâI came to see an invisible matrix of constraints that shapes how hardware companies approach software, and explain why software teams struggle with AI software in particular.
Before we go further, let’s step into the mindset of a hardware executiveâwhere the matrix of constraints begins to reveal itself.
How AI hardware companies think
Thereâs no shortage of brilliant minds in hardware companies. The problem isnât IQâitâs worldview.
The architectural ingredients for AI chips are well understood by now: systolic arrays, TensorCores, mixed-precision compute, exotic memory hierarchies. Building chips remains brutally hard, but it’s no longer the bottleneck for scalable success. The real challenge is getting anyone to use your siliconâand that means software.
GenAI workloads evolve at breakneck speed. Hardware companies need to design for what developers will need two years from now, not just what’s hot today. But they’re stuck in a mental model that doesn’t match realityâtrying to race in open waters with a culture designed for land.

Fun Fact: LLVM’s mascot is a wyvern, sort of like a dragon with no claws in front.
In the CPU era, software was simpler: build a backend for LLVM and your chip inherited an ecosystemâLinux, browsers, compiled applications all worked. AI has no such luxury. There’s no central compiler or OS. You’re building for a chaotic, fast-moving stackâPyTorch, vLLM, todayâs agent framework of the weekâwhile your customers are using NVIDIA’s tools. You’re expected to make it all feel native, to just work, for AI engineers who neither understand your chip nor want to.
Despite this, the chip is still the productâand the P&L makes that crystal clear. Software, docs, tooling, community? Treated like overhead. This is the first constraint of the matrix: hardware companies are structurally incapable of seeing a software ecosystem as a standalone product. Execs optimize for capex, BOM cost, and tapeout timelines. Software gets some budget, but itâs never enoughâespecially as AI software demands scale up. The result is a demo-driven culture: launch the chip, write a few kernels, run some benchmarks, and build a flashy keynote that proves your FLOPS are real.
The result is painfully familiar: a technically impressive chip with software no one wants to use. The software team promises improvement next cycle. But they said that last time too. This isn’t about individual failureâit’s about systemic misalignment of incentives and resources in an industry structured around silicon, not ecosystems.
Why is GenAI software so hard and expensive to build?
Building GenAI software isnât just hardâitâs a treadmill pointed uphill, on a mountain thatâs constantly shifting beneath your feet. Itâs less an engineering challenge than a perfect storm of fragmentation, evolving research, and brutal expectationsâeach components of the matrix.
đThe treadmill of fragmented AI research innovation
AI workloads arenât staticâtheyâre a constantly mutating zoo. One week itâs Transformers; the next itâs diffusion, MoEs, or LLM agents. Then comes a new quantization trick, a better optimizer, or some obscure operator that a research team insists must run at max performance right now.
It is well known that you must innovate in hardware to differentiate, but often forgotten that every hardware innovation multiplies your software burden against a moving target of use cases. Each hardware innovation demands that software engineers deeply understand itâwhile also understanding the rapidly moving AI research and how to connect the two together.
The result? Youâre not building a “stack”âyouâre building a cross product of models Ă quantization formats Ă batch sizes Ă inference/training Ă cloud/edge Ă framework-of-the-week.
It’s combinatorially explosive, which is why no one but NVIDIA can keep up. You end up with ecosystem maps that look like this:

Compatibility matrix highlighting the complexity of vLLM. Source:vLLM
đ You’re competing with an industry, not just CUDA
The real problem isn’t just CUDAâit’s that the entire AI ecosystem writes software for NVIDIA hardware. Every framework, paper, and library is tuned for their latest TensorCores. Every optimization is implemented there first. This is the compounding loop explored in Part 3: CUDA is a software gravity well that bends the industryâs efforts toward NVIDIAâs hardware.
For alternative hardware, compatibility isn’t enoughâyou have to outcompete a global open-source army optimizing for NVIDIA’s chips. First you have to “run” the workload, but then it has to be better than the HW+SW combo theyâre already using.
đĽ The software team is always outnumbered
No matter how many software engineers you have, itâs never enough to get ahead of the juggernaut - no matter how brilliant and committed, theyâre just totally outmatched. Their inboxes are full of customer escalations, internal feature requests, and desperate pleas for benchmarks. They’re fighting fires instead of building tools to prevent future fires, and theyâre exhausted. Each major success just makes it clear how much more there is left to be done.
They have many ideasâthey want to invest in infrastructure, build long-term abstractions, define the companyâs software philosophy. But they canât, because they canât stop working on the current-gen chip long enough to prepare for the next one. Meanwhile, âŚ
đ° The business always “chases the whale”
When a massive account shows up with cash and specific requirements, the business says yes. Those customers have leverage, and chasing them always makes short-term sense.
But thereâs a high cost: Every whale you reel in pulls the team further away from building a scalable platform. Thereâs no time to invest in a scalable torso-and-tail strategy that might unlock dozens of smaller customers later. Instead of becoming a product company, your software team is forced to operate like a consulting shop.
It starts innocently, but soon your engineers implement hacks, forks, half-integrations that make one thing fast but break five others. Eventually, your software stack becomes a haunted forest of tech debt and tribal knowledge. Itâs impossible to debug, painful to extend, and barely documentedâwho had time to write docs? And what do we do when the engineer who understood it just left?
Challenges getting ahead in the hardware regatta
These aren’t isolated problemsâthey’re the universal reality of building GenAI software. The race isn’t a sprintâit’s a regatta: chaotic, unpredictable, and shaped as much by weather as by engineering. Everyone’s crossing the same sea, but in radically different boats.

đ¤ Speedboats: Startups aim for benchmarks, not generality or usability
Startups are in survival mode. Their goal is to prove the silicon works, that it goes fast, and that someoneâanyoneâmight buy it. That means picking a few benchmark workloads and making them fly, using whatever hacks or contortions it takes. Generality and usability donât matterâThe only thing that matters is showing that the chip is real and competitive today. Youâre not building a software stack. Youâre building a pitch deck.
âľ Custom Racing Yachts: Single-chip companies build vertical stacks
The Mag7 and advanced startups take a different tack. They build TPU racing yachts to win specific races with custom designs. They can be fast and beautifulâbut only with their trained crew, their instruction manual, and often their own models. Because these chips leave GPU assumptions behind, they must build bespoke software stacks from scratch.
They own the entire stack because they have to. The result? More fragmentation for AI engineers. Betting on one of these chips means theoretical FLOPS at a discountâbut sacrificing momentum from the NVIDIA ecosystem. The most promising strategy for these companies is locking in a few large customers: frontier labs or sovereign clouds hungry for FLOPS without the NVIDIA tax.
đłď¸ Ocean Liners: Giants struggle with legacy and scale
Then come the giants: Intel, AMD, Apple, Qualcommâcompanies with decades of silicon experience and sprawling portfolios: CPUs, GPUs, NPUs, even FPGAs. Theyâve shipped billions of units. But that scale brings a problem: divided software teams stretched across too many codebases, too many priorities. Their customers canât keep track of all the software and versionsâwhere to start?
One tempting approach is to just embrace CUDA with a translator. It gets you “compatibility,” but never great performance. Modern CUDA kernels are written for Hopperâs TensorCores, TMA, and memory hierarchy. Translating them to your architecture wonât make your hardware shine.
Sadly, the best-case outcome at this scale is OneAPI from Intelâopen, portable, and community-governed, but lacking momentum or soul. It hasnât gained traction in GenAI for the same reasons OpenCL didnât: it was designed for a previous generation of GPU workload, and AI moved too fast for it to keep up. Being open only helps if you also keep up.
đ˘ NVIDIA: The carrier that commands the race
NVIDIA is the aircraft carrier in the lead: colossal, coordinated, and surrounded by supply ships, fighter jets, and satellite comms. While others struggle to build software for one chip, NVIDIA launches torpedos at anyone who might get ahead. While others optimize for a benchmark, the world optimizes for NVIDIA. The weather changes to match their runway.
If youâre in the regatta, youâre sailing into their wake. The question isnât whether youâre making progressâitâs whether the gap is closing or getting wider.
Breaking out of the matrix
At this point in “Democratizing AI Compute”, weâve mapped the landscape. CUDA isn’t dominant by accidentâitâs the result of relentless investment, platform control, and market feedback loops that others simply canât replicate. Billions have been poured into alternatives: vertically-integrated stacks from Mag7 companies, open platforms from industry giants, and innovative approaches from hungry startups. None have cracked it.
But weâre no longer lost in the fog. We can see the matrix now: how these dynamics work, where the traps lie, why even the most brilliant software teams can’t get ahead at hardware companies. The question is no longer why weâre stuck: Itâs whether we can break free.

Child:“Do not try and bend the spoon. That’s impossible. Instead… only try to realize the truth.”
Neo:“What truth?”
Child:"**There is no spoon.**Then you’ll see that it is not the spoon that bends, it is only yourself."
If we want to Democratize AI Compute, someone has to challenge the assumptions weâve all been working within. The path forward isn’t incremental improvementâit’s changing the rules of the game entirely.
Let’s explore that together in part 10.