The Nx project brings Elixir to the domain of AI & ML. It makes a very large promise: to leverage the BEAM in ML workflows, enabling us to run ML models concurrently, distributed over multiple nodes, as well as partitioned across several GPUs.
Closer inspection reveals that Nx is walking a tightrope: its computations are powered by NIFs, whose scheduling is a tricky business, and GPUs, which support a very small number of parallel processes. This raises the question: how is Nx providing the high throughput that it boasts of?
Nx answers this question with “Servings” and “Batches”, ways to batch requests together to reduce and distribute the calls to the NIF and GPU. But how exactly does that work? What trade-offs are we making by batching requests?
In this talk we will examine Nx’s Servings, how it addresses the problem of scheduling, the trade-offs and the performance characteristics of its approach, as well as what are the other constraints when running an ML service.