SLICE-TUNE: A System for High Performance DNN Autotuning
Source
PROCEEDINGS OF THE TWENTY-THIRD ACM/IFIP INTERNATIONAL MIDDLEWARE CONFERENCE, MIDDLEWARE 2022
Author(s)
Dhakal, Aditya
Ramakrishnan, K. K.
Kulkarni, Sameer G.
Sharma, Puneet
Cho, Junguk
Abstract
Autotuning DNN models prior to their deployment is an essential but time-consuming task. Using expensive (and power-hungry) GPU and TPU accelerators efficiently is also key. Since DNNs do not always use a GPU fully, spatial multiplexing of multiple models can provide just the right amount of GPU resources for each DNN. We find that a DNN model tuned with the maximum GPU resources has higher inference latency if less GPU resources are available at inference time. We present methods to tune a DNN model, so that we provide the right amount of accelerator resources during tuning. Thus, even when a wide range of GPU resources are available at inference time, the tuned model achieves low inference latency. Further, existing autotuning frameworks take a long time to tune a model due to inefficient utilization of the client and server-side CPU and GPU. Our system, SLICE-TUNE., improves several autotuning frameworks to effciently use system resources by re-thinking the partitioning of tasks between the client and server (where models are profiled on the server GPU), in a Kubernetes environment. We increase parallelism during tuning by sharding the tuning model across multiple tuning application instances, providing concurrent tuning of different operators of a model. We also scale server instances to achieve better GPU multiplexing. SLICE-TUNE. reduces DNN autotuning time in a single GPU and in GPU clusters. SLICE-TUNE. decreases DNN autotuning time by up to 75%, and increase autotuning throughput by a factor of 5, across 3 different autotuning frameworks (TVM, Ansor, and Chameleon).
Subjects
Computer Science
