Tanish Khadse

blog

PyTorch JIT Compiler

[1/22/26] It's been a few weeks of work (and travelling), but I'm here to report my current progress. Frankly, the past few weeks involved a lot of reading and learning on these topics:

PyTorch 2 (TorchDynamo/TorchInductor): this was a useful starting point as the paper discusses the implementation decisions behind TorchDynamo and its Graph Capture mechanism. This essentially helped me narrow down my design to use PyTorch's torch.fx toolkit.
JIT Compilers
Triton Compiler

A major development includes my Python frontend & Rust setup. This project is actually my 2nd time ever using Rust: for setup purposes, I familiarized myself with Cargo (Rust's package management system), and I'm getting the hang of the memory lifetime model that Rust provides. I also played around with Rust bindings for PyTorch (provided by PyO3) and Rust bindings for MLIR (via Melior). What's next? I have to get my hands dirty with MLIR. I have been using the Toy example to learn implementation fundamentals for an MLIR-based compiler.

Above, I have a very simple 2-layer network defined using PyTorch. Similar to PyTorch, I utilize a compile() decorator under my own sopt library. Taking a step back from the technical side, I also wanted to share how fascinating I found this step; I essentially built my own importable Python library that funnels into the binary for my Rust-based compiler! Many libraries that I use on a daily basis follow some form of what I've just implemented here, which was eye-opening to me.

Anyway, during execution, sopt.compile() converts the torch.fx format into a list of related JSON objects that can easily be received by the Rust backend. I set up a class, PyNode, that encapsulates the important data fields from FXNodes.

From this point, I am working on setting up my first Dialect/IR on the MLIR skeleton. I call it the "soptfx" dialect. The idea is to lower the PyNodes to operations in MLIR in order to make the data graph; since FX nodes come in 3 main flavors (placeholders, callfunctions, and outputs), I handle this lowering separately for each case. My goal here is to output an accurate .mlir file in order to gauge the correctness of my current logic.

[1/4/26] So, I've been working on this project for the past few weeks. Up until yesterday, I spent most of my time reading up on basic theory behind ML/DL compilers. I plan to add much more to this blog to help me reinforce whatever I learn and to log my journey into this field (incase it inspires anyone).

I'll get started with some preliminaries: my related background up to this point involves GPU programming (CUDA), systems programming, ML systems, and a bit of compiler construction (LLVM). All of these were picked up through courses at my school, UIUC. Over the past few weeks, I read up on a few more technologies:

MLIR
TVM
Triton
TensorRT
nvFuser

Now, I somewhat understand where these technologies fit into a compiler stack. MLIR serves as compiler infrastructure, the "skeleton" of the compiler we build; TVM is a fullstack compiler that emphasizes loop-level/tiling optimizations and code generation for heterogenous devices; Triton is a language/IR/JIT compiler (used by PyTorch 2.0) that helps write extensible GPU kernels; TensorRT is an inference engine that optimizes models for NVIDIA GPUs; and nvFuser is a "Fusion Code Generator" that generates code optimized for NVIDIA GPUs.

I will revisit these definitions the more I explore :)

Now, I have finalized my design that I want to use for this project. I call it "soptRT" (still need to think of a better name). I've broken this project down into 2 phases:

Phase I: To better learn ML optimizations at a low-level, I want to introduce my own compile() trigger in PyTorch that calls a Rust-based compiler that is built on MLIR. This compiler will trigger optimization passes (Fusion, Quantization, Memory Mapping, etc) and funnel into an existing backend (Triton/TVM).
Phase II: Next, I want to design my own kernel generator. After reading a little, I realized that code generation is a very interesting problem that currently uses several solutions to implement. TVM uses a paradigm called ML for ML, while engineers handcraft kernels for NVIDIA's cuDNN. Details for this part are TBD and will require me to read a bit more.

Great! Now that the plan is out of the way, I will update this page over time with challenges I run into. For now, I am working on creating the "bridge" from my own PyTorch compile() and my Rust backend with MLIR.

Stay tuned!

Cool links: I'd also like to highlight some cool links I found throughout this project. There are a lot of cool startups and innovations in this field of ML compilers that I want to further explore:

Tile IR: NVIDIA released this somewhat recently. This is a "low-level tile virtual machine" allowing a developer to work in terms of tiles.
Compiler Optimization Advent Calendar: I saw this on LinkedIn; as the name suggests, the author goes over some interesting compiler designs.
Modular: really cool company led by Chris Lattner who I've been following for a while. The language seems very well-designed and I am curious to see how its compiler works.