Assembly Never Left: Why I Still Write It in 2026

I still write assembly language. Not for nostalgia, but because some problems can't be solved any other way. Low-level code isn't dead. It's hiding in every performance-critical system you use.

TL;DR

Learn assembly basics even if you never write it. Understanding what the machine actually does makes you a better programmer. Compilers aren't magic.

January 2026: Updated with WebAssembly section and Monday Morning Checklist.

When I tell younger engineers that I write assembly, they look at me like I said I write on clay tablets. "Compilers optimize better than humans now." "Nobody needs to do that anymore." "That's what the 1980s were for."

I understand why they think this. Modern compilers are genuinely impressive. GCC and LLVM perform optimizations that would take humans days to figure out. For 99% of code, the compiler does a better job than any human could. The logic is sound for most applications.

But they're wrong about the edge cases. And if you're building anything where performance really matters—voice AI, cryptography, real-time systems—it's worth understanding why. I've watched teams burn weeks trying to squeeze performance out of high-level code when thirty minutes of assembly would have solved it.

Where Assembly Still Matters

Let me be specific about where I still drop into assembly:

SIMD Operations

Modern CPUs have vector instructions: SSE, AVX, AVX-512 on Intel, NEON on ARM. These instructions can process multiple data elements simultaneously. A single AVX-512 instruction can operate on 16 32-bit floats at once.

Compilers try to auto-vectorize your code. Sometimes they succeed. Often they don't. The conditions for auto-vectorization are fragile. Change your loop slightly and the compiler gives up.

For real-time audio processing, I write SIMD intrinsics or raw assembly. The difference isn't 10%. It's 4-8x faster. As research on high-performance computing confirms, benchmarks consistently show that pure C++ algorithms lag behind hand-tuned assembly by 20-90% on computational kernels. When you're processing audio in real-time with strict latency requirements, that matters.

Here's a concrete example: audio sample scaling using AVX-512. The C version:

void scale_samples(float* samples, float gain, int count) {
    for (int i = 0; i < count; i++) {
        samples[i] *= gain;
    }
}

The compiler might auto-vectorize this. Or it might not. Here's the hand-written AVX-512 assembly that processes 16 samples per instruction:

; scale_samples_avx512 - processes 16 floats per iteration
; rdi = samples pointer, xmm0 = gain, rsi = count
scale_samples_avx512:
    vbroadcastss zmm1, xmm0          ; broadcast gain to all 16 lanes
    mov          rcx, rsi            ; save original count
    shr          rsi, 4              ; count / 16 (main loop iterations)
    jz           .remainder          ; skip if fewer than 16 samples
.loop:
    vmulps       zmm0, zmm1, [rdi]   ; multiply 16 floats at once
    vmovaps      [rdi], zmm0         ; store result
    add          rdi, 64             ; advance pointer (16 * 4 bytes)
    dec          rsi
    jnz          .loop
.remainder:
    and          ecx, 15             ; remaining samples (count % 16)
    jz           .done               ; none? we're done
.scalar:
    vmulss       xmm2, xmm0, [rdi]   ; multiply single float
    vmovss       [rdi], xmm2         ; store result
    add          rdi, 4              ; next float
    dec          ecx
    jnz          .scalar
.done:
    ret

The assembly version is explicit about what's happening. Broadcast the gain value, load 16 floats, multiply, store, repeat. No ambiguity, no hoping the compiler figures it out.

Cryptography

Cryptographic code needs to be:

Fast (you're encrypting a lot of data)
Constant-time (no timing side channels)
Correct (a single bit flip breaks everything)

Compilers don't understand constant-time requirements. They'll happily optimize your carefully written constant-time code into something with timing variations. An attacker can use those variations to extract your keys. According to Intel's security guidance, "the safest solution is to write the secret-dependent code in assembly language."

This isn't theoretical. The Clangover attack demonstrated that compilers routinely break constant-time guarantees. Researchers recovered complete ML-KEM cryptographic keys in under 10 minutes—not because the source code was flawed, but because the compiler optimized carefully written constant-time C into assembly with secret-dependent branches.

Modern CPUs include dedicated cryptographic instructions specifically because software implementations are both slower and less secure. Intel's AES-NI instructions provide hardware-accelerated AES encryption that runs approximately 8x faster than software implementations while eliminating timing side channels entirely.

Here's a single AES-128 encryption round in assembly using AES-NI:

; aes_encrypt_block - encrypts one 128-bit block using AES-128
; xmm0 = plaintext block, xmm1-xmm10 = round keys (precomputed)
; Returns ciphertext in xmm0
aes_encrypt_block:
    pxor       xmm0, xmm1           ; initial round key XOR (whitening)

    ; Rounds 1-9: each AESENC does SubBytes, ShiftRows, MixColumns, AddRoundKey
    aesenc     xmm0, xmm2           ; round 1
    aesenc     xmm0, xmm3           ; round 2
    aesenc     xmm0, xmm4           ; round 3
    aesenc     xmm0, xmm5           ; round 4
    aesenc     xmm0, xmm6           ; round 5
    aesenc     xmm0, xmm7           ; round 6
    aesenc     xmm0, xmm8           ; round 7
    aesenc     xmm0, xmm9           ; round 8
    aesenc     xmm0, xmm10          ; round 9

    ; Final round: SubBytes, ShiftRows, AddRoundKey (no MixColumns)
    aesenclast xmm0, xmm11          ; round 10 - final
    ret

Each AESENC instruction performs an entire AES round—SubBytes, ShiftRows, MixColumns, and AddRoundKey—in a single clock cycle. The software equivalent requires table lookups (vulnerable to cache-timing attacks), dozens of XOR operations, and careful bit manipulation. The hardware version is both faster and immune to cache-based side channels because there are no memory accesses that depend on secret data.

Critical crypto primitives are written in assembly specifically to prevent the compiler from "helping." AES-NI instructions, SHA extensions, constant-time comparison functions: these live in assembly for good reason. When security depends on timing being independent of secret data, you can't trust a compiler to preserve that property.

Interrupt Handlers

When hardware interrupts fire, you have microseconds to respond. The interrupt handler needs to save state, handle the event, and restore state as fast as physically possible.

Compilers add stack setup, prologue/epilogue code, and calling conventions that you don't need in an interrupt handler. In assembly, you control exactly what happens - save only the registers you'll use, do the minimum work, get out.

Boot Code and Bare Metal

Before the operating system loads, there's no C runtime. No standard library. No memory allocator. Just raw hardware.

Boot loaders, BIOS code, embedded firmware - these often start in assembly because there's literally nothing else available. The C runtime doesn't exist yet; you have to build it.

The Myth That Compilers Always Win

The claim that "compilers optimize better than humans" is true on average and false at the extremes.

For typical code - business logic, web applications, CRUD operations - yes, the compiler will generate perfectly good machine code. Don't write assembly for your REST API. In fact, for most applications, the advice in treating dependencies as debt applies: use higher-level tools until you prove you need something lower.

But for hot paths where every cycle matters, humans still win. Here's why:

Compilers are conservative. The compiler doesn't know your data patterns. It doesn't know that this loop always runs exactly 1024 times, that this pointer is always aligned, that this branch is never taken. According to Intel's optimization guide, certain hardware features require explicit, architecture-specific directives to unlock their full potential. You know your constraints and can exploit them.

Compilers follow rules. Calling conventions, ABI requirements, language semantics - the compiler has to respect all of these. You can break the rules when you know it's safe.

Compilers can't see across boundaries. Profile-guided optimization helps, but compilers still struggle with whole-program optimization. You can see the whole picture and optimize accordingly.

Compilers don't know about hardware quirks. Cache line sizes, memory alignment, pipeline hazards, micro-op fusion - the compiler knows some of this, but you can know more for your specific target.

Real Benchmarks

Let me give you concrete numbers from a real-time audio pipeline I've worked on:

Audio resampling: Our SIMD assembly implementation runs 6.2x faster than the equivalent C code compiled with -O3 and auto-vectorization hints. The compiler couldn't figure out the optimal instruction sequence for our specific sample rate conversions.

FFT computation: Hand-tuned assembly with proper cache prefetching runs 3.8x faster than FFTW compiled for the same CPU. FFTW is excellent - but it's general purpose. We know exactly what sizes we need.

Noise reduction: Our assembly kernel processes audio with 2.1x lower latency than the C version. In real-time audio, latency is everything. That 2x difference means we can process audio the C version couldn't handle in time.

These aren't synthetic benchmarks. These are production code processing live audio from first responders.

Assembly as Margin

Here's something the architecture astronauts miss. Assembly isn't just about performance. It's about economics.

If your competitor needs 100 AWS instances to process a voice stream and you need 10 because of a hand-tuned kernel, you've just turned assembly into margin. That's not a 10% improvement. That's a 10x infrastructure cost advantage. At scale, that's the difference between profitability and burning runway.

I've seen this play out directly. A voice AI pipeline that processes 10,000 concurrent streams can cost $50,000/month in cloud compute—or $5,000/month if the hot path is properly optimized. Over a year, that's $540,000 in savings. Enough to fund an entire engineering team. The assembly code that produces those savings took two weeks to write.

When founders ask me about "competitive moats," I tell them this: understanding the machine is a moat. Most teams can't match performance they don't understand. If your core processing is 8x more efficient than competitors using off-the-shelf libraries, you can underprice them profitably or offer capabilities they physically cannot provide.

When to Drop Down

I'm not saying write everything in assembly. That would be insane. Here's my decision process:

Profile first. Never optimize without measuring. Find the actual hot spots. Most code doesn't need optimization at all.

Try high-level optimizations first. Better algorithms beat micro-optimization every time. A O(n log n) algorithm in Python beats an O(n²) algorithm in assembly.

Try compiler hints next. Restrict pointers, alignment attributes, likely/unlikely hints, SIMD intrinsics. Often you can get 80% of the benefit without writing assembly.

Drop to assembly when:

You've profiled and this is definitely the bottleneck
High-level optimizations are exhausted
Compiler output isn't good enough (check the disassembly)
You have specific hardware knowledge to exploit
You need guarantees the compiler can't provide (constant-time, etc.)

When Compilers Actually Win

I'm not saying hand-written assembly is always better. Compilers genuinely outperform humans when:

Code runs on multiple architectures. Cross-platform software can't be hand-optimized for every CPU. The compiler adapts; your assembly doesn't.
The hot path changes often. Assembly is expensive to maintain. If your performance-critical code evolves frequently, compiler-generated code wins on total cost.
Register allocation is complex. Modern CPUs have intricate register dependencies. Compilers track these systematically; humans make mistakes on complex control flow.

But for stable, performance-critical inner loops on known hardware - the situations where I actually write assembly - the human still has the edge.

Why AI Won't Write Your Assembly

The question I get now is predictable: "Can't Claude just write that AVX-512 loop for me?"

AI can generate syntactically correct assembly. It can even produce code that runs. But it cannot understand your system.

Effective assembly requires what race car drivers call "mechanical sympathy," an intuitive understanding of how the machine behaves under stress. Which cache lines are hot right now? What's the state of the branch predictor after the previous function? How does this code interact with the interrupt handler that fires every millisecond?

An LLM doesn't know your memory layout. It doesn't know that your data arrives in 4KB chunks aligned to page boundaries. It doesn't know that your target CPU has a quirk where back-to-back `vmovaps` instructions stall the pipeline. It generates "assembly" in a vacuum, disconnected from the system it will run in.

Worse, AI hallucinations in assembly aren't just wrong; they're dangerous. A hallucinated instruction that looks plausible might corrupt memory, violate security invariants, or introduce timing side channels. In high-level code, a bug crashes the program. In assembly, a bug can corrupt the stack, leak cryptographic keys, or brick hardware.

I've watched engineers paste AI-generated assembly into production code. It compiled. It ran. It was 3x slower than the C version because the AI didn't understand the memory access patterns. The "optimization" was a de-optimization that nobody caught because nobody could read the code.

AI is a tool for generating boilerplate, not for writing code that needs to be correct at the bit level. When security or performance is non-negotiable, the human touch isn't optional. It's the whole point.

The Joy of Knowing

There's another reason I still write assembly. It's satisfying to know exactly what the machine is doing.

High-level languages abstract away the machine. That abstraction is usually helpful; you don't want to think about registers when writing business logic. But the abstraction can also obscure, and every abstraction layer adds overhead, what I call the layer tax.

When I write assembly, there's no mystery. Every instruction does exactly one thing. The correspondence between code and execution is direct. If something is slow, I know exactly why.

This understanding transfers back to high-level code. When I write C or Rust, I can visualize what the compiler will generate. That's part of why C remains one of the few languages that gives you real control over what the machine does. I know which constructs are expensive and which are cheap. I understand what "fast" actually means at the hardware level.

I once spent three days hunting a bug that was invisible in C. A function that should have been pure was causing memory corruption, but only under load, only on Tuesdays (literally), and only after the system had been running for six hours. The C code looked perfect. Static analyzers found nothing. Code review found nothing.

The disassembly told the story in thirty seconds. The compiler had "optimized" a temporary variable into a register that was also used by an interrupt handler. Under heavy load, the interrupt fired mid-calculation, corrupted the register, and the function returned garbage. The "Tuesday" pattern was just when our traffic peaked. The fix was one line, marking the variable volatile. But finding that line required reading assembly.

Younger engineers who've never seen the machine are often surprised by performance characteristics. "Why is this slow? It's just a loop." They don't see the cache misses, the branch mispredictions, the memory stalls. They're operating on an abstract model that doesn't match reality.

Learning Assembly Today

If you've never written assembly, should you learn?

If you're building performance-critical systems - yes, absolutely. You don't need to write production assembly. But you should be able to read a disassembly, understand what the compiler generated, and recognize when it's doing something dumb.

If you're building typical web applications - probably not a priority. But understanding the basics will make you a better programmer. You'll understand why certain operations are fast and others are slow. You'll appreciate what the compiler does for you.

Start with x86-64 or ARM64, depending on your platform. Read "Computer Systems: A Programmer's Perspective." Write some simple functions. Learn to use objdump or a disassembler. Look at what your compiler generates for code you write.

You'll never look at software the same way again.

Assembly on the Web: WebAssembly

Even on the web, assembly thinking is back. WebAssembly (Wasm) lets us run hand-tuned Rust, C++, or actual assembly at near-native speeds in the browser. Figma rewrote their rendering engine in Wasm and got 3x performance. Adobe brought Photoshop to the web using Wasm. Google Earth runs in a browser tab.

The "assembly mindset" isn't just for kernels and embedded systems anymore. If you're building compute-heavy web applications—image processing, video editing, CAD tools, games—Wasm is your path to performance that JavaScript physically cannot achieve. The principles are the same: understand the machine, control the hot path, eliminate abstraction where it hurts.

The Bottom Line

Assembly language is 50+ years old. People have been predicting its death for decades. "Compilers are good enough now." "Nobody needs that anymore." "It's obsolete."

And yet, here I am today, writing assembly for production systems. Because the problems that require it haven't gone away. Real-time constraints. Hardware acceleration. Security requirements. Performance at the edge of what's possible.

The tools have changed. I use better editors, better debuggers, better profilers. But the fundamental skill - understanding what the machine actually does - is as valuable as ever.

Maybe more valuable. LLMs are the ultimate high-level language, and they bring the ultimate abstraction bloat. As we move toward AI-generated code at scale, the need for humans who can actually read the assembly output becomes a critical safety and performance check. Someone has to verify what the machine is really doing. Someone has to catch the hallucination before it ships. That someone needs to read assembly.

As abstraction layers pile up, fewer people understand what's underneath. That understanding is a competitive advantage.

"As abstraction layers pile up, fewer people understand what's underneath. That understanding is a competitive advantage."

Sources

Stack Overflow: SIMD Intrinsics Performance — Documents 5-6x performance gains from explicit SIMD that auto-vectorization can't match
Red Hat: Constant-Time Cryptography — Research showing compilers break constant-time guarantees in 19 production cryptographic libraries
Intel: Timing Side Channel Mitigation — Intel's guidance on why assembly is required for secure cryptographic implementations
Clangover Attack: Compilers Break Constant-Time Code — Research on how compilers introduce timing vulnerabilities in cryptographic code

Performance Problems?

Sometimes the solution isn't more servers - it's better code. 45 years of optimization experience, from kilobytes to petabytes.

Let's Talk