Your description is exactly right. We create a search space of all possible kern...

erichocean · 2025-08-21T07:11:57 1755760317

> that we mitigate with smarter search

aka "a heuristic"

jakestevens2 · 2025-08-21T15:19:24 1755789564

See my other comments about static profiling of kernels. There are ways of improving the search that keep runtime at the heart of it.

jafioti · 2025-08-21T16:30:25 1755793825

mcts / rl isn't really a heuristic. but yes heuristics can be used temporarily to keep the search space small, and removed over time as the search algorithm improves.

gregorygoc · 2025-08-21T12:05:34 1755777934

Exactly, I was going to ask about this bit…

UncleOxidant · 2025-08-20T17:55:57 1755712557

How long does this typically take? It sounds time consuming. Also, it seems like this could be similar to doing a GA?

jakestevens2 · 2025-08-20T18:01:40 1755712900

That depends on the model architecture and how it was written since that informs the size of the search space.

The typical range is 10 mins to 10 hours. It won't be fast but you only have to do it once and then those optimizations are set for every forward pass.

sitkack · 2025-08-21T00:09:03 1755734943

Do you learn the capabilities of the underlying hardware relative to the kernel src? You should be able to start predicting perf using learned static profiling.

jakestevens2 · 2025-08-21T00:51:01 1755737461

Not today but we will implement memoization of kernels for each hardware backend, yes.

jakestevens2 · 2025-08-20T18:04:17 1755713057

You can also set a time budget for how long you'd like the search to run for to avoid wasting time on diminishing returns.

pilooch · 2025-08-20T21:53:53 1755726833

Is this a bit similar to what tensorrt does, but in a more opened manner ?