Dhakal, AdityaAdityaDhakalRamakrishnan, K. K.K. K.RamakrishnanKulkarni, Sameer G.Sameer G.KulkarniSharma, PuneetPuneetSharmaCho, JungukJungukCho2025-08-282025-08-28[978-1-4503-9340-9]10.1145/3528535.3565247http://repository.iitgn.ac.in/handle/IITG2025/19470Autotuning DNN models prior to their deployment is an essential but time-consuming task. Using expensive (and power-hungry) GPU and TPU accelerators efficiently is also key. Since DNNs do not always use a GPU fully, spatial multiplexing of multiple models can provide just the right amount of GPU resources for each DNN. We find that a DNN model tuned with the maximum GPU resources has higher inference latency if less GPU resources are available at inference time. We present methods to tune a DNN model, so that we provide the right amount of accelerator resources during tuning. Thus, even when a wide range of GPU resources are available at inference time, the tuned model achieves low inference latency. Further, existing autotuning frameworks take a long time to tune a model due to inefficient utilization of the client and server-side CPU and GPU. Our system, SLICE-TUNE., improves several autotuning frameworks to effciently use system resources by re-thinking the partitioning of tasks between the client and server (where models are profiled on the server GPU), in a Kubernetes environment. We increase parallelism during tuning by sharding the tuning model across multiple tuning application instances, providing concurrent tuning of different operators of a model. We also scale server instances to achieve better GPU multiplexing. SLICE-TUNE. reduces DNN autotuning time in a single GPU and in GPU clusters. SLICE-TUNE. decreases DNN autotuning time by up to 75%, and increase autotuning throughput by a factor of 5, across 3 different autotuning frameworks (TVM, Ansor, and Chameleon).en-USComputer ScienceSLICE-TUNE: A System for High Performance DNN AutotuningConference Paperhttps://dl.acm.org/doi/pdf/10.1145/3528535.3565247228-240Proceedings Paper4WOS:001061556200018