How to debug kernel memory corruption on Apple Silicon

#security #debugging #lowlevel #macos

If you've ever stared at a kernel panic log at 2 AM, wondering how a perfectly innocent-looking IOConnectCallMethod call ended up trampling a vm_map structure, you know the unique pain of low-level memory bugs. I spent most of last month helping a friend chase down a use-after-free in a driver targeting Apple Silicon, and the experience reminded me how brutal kernel-space debugging still is — even with all the modern mitigations.

The recent public writeup on a macOS kernel memory corruption exploit (the one making rounds on Hacker News) is a great teaching artifact. Whether or not you write kernel code, the techniques used to find, trigger, and reason about these bugs are increasingly relevant for anyone working close to the metal — driver authors, OS researchers, sandbox escape defenders, and yes, people building anything that touches IOKit or Mach.

Let's talk about how these bugs actually show up, why they're hard to debug, and what you can practically do about them.

The problem: a corrupted pointer in a place you don't control

Kernel memory corruption almost always boils down to one of three classics:

Use-after-free — an object gets released, but a stale reference keeps writing to that memory after something else has been allocated there.
Out-of-bounds write — usually an integer overflow in a size calculation, or a missing length check on user-controlled data crossing the syscall boundary.
Type confusion — two code paths disagree about what kind of object lives at a given address, often because of a polymorphic IOKit class hierarchy.

What makes them brutal is that the symptom is almost never near the cause. You corrupt a freelist pointer in kalloc.var96 and the kernel happily keeps running until, five seconds later, some unrelated allocation hands out a bogus pointer to the VFS layer, which then panics in a function you've never heard of.

I ran into a variant of this in a kext I was reviewing — the panic backtrace pointed at OSDictionary::getObject, which had nothing to do with the actual bug. The real cause was a refcount underflow several hundred milliseconds earlier in an entirely different subsystem.

Root cause: why modern ARM kernels still get this wrong

The honest answer is that kernels are largely written in C and C++, and those languages don't enforce memory safety. Apple Silicon has hardware features that help — Pointer Authentication Codes (PAC) make it dramatically harder to forge code pointers, and the kernel uses zone-based allocators with various hardening tricks. But none of that prevents the actual write from happening. It just makes the exploitation harder once the write occurs.

A typical bug pattern looks like this (simplified):

// Pseudocode of the classic pattern
kern_return_t handle_user_request(user_request_t *req) {
    // BUG: size comes from userspace, multiplied without overflow check
    size_t total = req->count * sizeof(entry_t);
    entry_t *buf = kalloc(total);

    // If count was huge, total wraps. buf is tiny.
    // But the loop below uses count, not total.
    for (uint32_t i = 0; i < req->count; i++) {
        buf[i] = req->entries[i]; // OOB write into adjacent zone
    }
    ...
}

The fix is one line. Finding the bug in 200,000 lines of kernel code is the hard part.

Step-by-step: how to actually debug this

Here's the workflow I've settled on after enough failed attempts. None of this is novel, but having a consistent process saves you hours.

1. Capture the panic with full context

First, get a proper panic log with stack traces from all cores, not just the one that crashed. On macOS, /Library/Logs/DiagnosticReports/ has the panic files. Make sure the kernel's address space is logged with KASLR slide so you can resolve symbols later.

# Pull the panic log and the KASLR slide together
log show --predicate 'eventMessage contains "Kernel slide"' \
    --last 1h --style syslog

You need both the slide value and the panic backtrace. Without the slide, your addresses are noise.

2. Resolve addresses against the kernel image

Grab the kernelcache for the exact build that crashed, then use lldb or a disassembler to map the panic addresses back to functions. The kernelcache layout changed over the last few years, so make sure you're using a tool that understands the current format.

# Apply the KASLR slide to a faulted address
python3 -c "print(hex(0xfffffe0012345678 - 0xfffffe0000000000))"
# Then look up the resulting offset in the kernelcache symbols

3. Reproduce under instrumentation

This is where most people give up, and where you actually have to push through. If you can write a userspace harness that triggers the panic, you can run it under a sanitizer-equipped kernel build (if you have one) or use AddressSanitizer on userspace components that interact with the kext.

For kernel code itself, the most useful open-source tool is syzkaller, which fuzzes syscalls and IOKit interfaces. It's not magic — you need to write descriptions of the syscalls you want fuzzed — but once you have those, it'll find bugs you'd never spot by reading code.

# Tiny example: a userspace harness that hammers a suspect ioctl
import ctypes, fcntl, os

fd = os.open('/dev/your_target', os.O_RDWR)
for i in range(10_000):
    # Vary the size aggressively — overflow candidates live here
    buf = ctypes.create_string_buffer(b'A' * (i * 7919 % 65536))
    try:
        fcntl.ioctl(fd, 0x80000001, buf)
    except OSError:
        pass  # We're looking for panics, not clean errors

4. Bisect the trigger

Once you have a reliable repro, shrink it. Half the inputs, half the iterations, half the syscall sequence. Keep going until you have the minimal sequence that panics. This is the single most valuable debugging skill I've developed, and it works on every layer of the stack.

Prevention tips

A few things that have genuinely cut down the bug rate on projects I've worked on:

Use bounded arithmetic helpers everywhere. __builtin_mul_overflow is your friend. If you're multiplying a user-supplied count by a size, always check for overflow.
Write Rust for new components when you can. I haven't tested every edge case of rust-for-linux, but for greenfield kernel modules outside Apple's ecosystem, the memory-safety guarantees pay for themselves.
Treat every syscall boundary as hostile. Validate sizes, copy data once, never re-read user memory after validation (TOCTOU bugs love that).
Run continuous fuzzing. Even a modest syzkaller setup running overnight finds more bugs than careful code review.
Read panic logs from beta channels. Apple's beta releases leak useful information about where the kernel is currently fragile.

The uncomfortable truth is that as long as the kernel is written in C, these bugs will keep appearing. The mitigations make exploitation harder, not impossible. The best you can do is shrink the window between bug introduction and discovery — and that's a tooling problem, not a heroics problem.

If you're new to this kind of work, start by reading the XNU source on opensource.apple.com and following the syzkaller getting-started guide. It's a long road, but the bugs you'll find along the way are genuinely fascinating.