QoS aware automated batch inference

Dilushan, G.; Riza, M. S. I.; Jananie, J.

QoS aware automated batch inference

Dilushan, G.; Riza, M. S. I.; Jananie, J.

URI: http://ir.lib.seu.ac.lk/handle/123456789/7575

Date: 2024-11-06

Abstract:

The complexity of the predominant deployment of machine learning inference and batch processing needs attention for careful resource utilization without degrading the performance. End user side, when the cost is considered and goes for the model inference on CPUs, the systematic setting of batching will be a challenge. Determining batch size requires analysis of different settings depending on various resource configurations, which becomes a challenge. Then the decision space becomes large and domain expertise may be necessary to satisfy the end users' cost and performance requirements. The performance of the inference mainly relies on the computation time of the given neural network, task processing density (p), and the compute power of the nodes (c). The experiment of comparing the estimated end-to-end inference time and real measured time of AlexNet and ResNet50 for a given configuration and different batch sizes results in different coefficients for the suggested batch sizes by the proposed, designed and implemented profiler. The inference time estimation formulas handled by the different researchers and our experiment results motivated us to propose a representation to place the right batch size to the right node is flops(NN)*p/c. Now the remaining challenge is how to manipulate the batching through these representations efficiently without degrading the performance. As another contribution, we design the optimizer which minimizes the end-to-end inference time considering the batch size. This research mainly focuses on the end users of cloud platforms where users can get pay-per-use configurations for their inferencing workloads. Our proposed representation and optimization technique could be extended in cloud platforms. While other batching techniques of different researchers mainly consider various scheduling mechanisms, our framework motivates the future researchers to deeply analyze the computation and the configurations. Hence the proposed system provides a configuration oriented batching mechanism and give mitigation techniques to improve the inferencing performance.

Show full item record