QoS aware automated batch inference

Dilushan, G.; Riza, M. S. I.; Jananie, J.

Please use this identifier to cite or link to this item: http://ir.lib.seu.ac.lk/handle/123456789/7575

Full metadata record

DC Field	Value	Language
dc.contributor.author	Dilushan, G.	-
dc.contributor.author	Riza, M. S. I.	-
dc.contributor.author	Jananie, J.	-
dc.date.accessioned	2025-06-01T08:30:39Z	-
dc.date.available	2025-06-01T08:30:39Z	-
dc.date.issued	2024-11-06	-
dc.identifier.citation	Conference Proceedings of 13th Annual Science Research Session – 2024 on “"Empowering Innovations for Sustainable Development Through Scientific Research" on November 6th 2024. Faculty of Applied Sciences, South Eastern University of Sri Lanka, Sammanthurai.. pp. 32.	en_US
dc.identifier.isbn	978-955-627-029-7	-
dc.identifier.uri	http://ir.lib.seu.ac.lk/handle/123456789/7575	-
dc.description.abstract	The complexity of the predominant deployment of machine learning inference and batch processing needs attention for careful resource utilization without degrading the performance. End user side, when the cost is considered and goes for the model inference on CPUs, the systematic setting of batching will be a challenge. Determining batch size requires analysis of different settings depending on various resource configurations, which becomes a challenge. Then the decision space becomes large and domain expertise may be necessary to satisfy the end users' cost and performance requirements. The performance of the inference mainly relies on the computation time of the given neural network, task processing density (p), and the compute power of the nodes (c). The experiment of comparing the estimated end-to-end inference time and real measured time of AlexNet and ResNet50 for a given configuration and different batch sizes results in different coefficients for the suggested batch sizes by the proposed, designed and implemented profiler. The inference time estimation formulas handled by the different researchers and our experiment results motivated us to propose a representation to place the right batch size to the right node is flops(NN)*p/c. Now the remaining challenge is how to manipulate the batching through these representations efficiently without degrading the performance. As another contribution, we design the optimizer which minimizes the end-to-end inference time considering the batch size. This research mainly focuses on the end users of cloud platforms where users can get pay-per-use configurations for their inferencing workloads. Our proposed representation and optimization technique could be extended in cloud platforms. While other batching techniques of different researchers mainly consider various scheduling mechanisms, our framework motivates the future researchers to deeply analyze the computation and the configurations. Hence the proposed system provides a configuration oriented batching mechanism and give mitigation techniques to improve the inferencing performance.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Faculty of Applied Sciences, South Eastern University of Sri Lanka, Sammanthurai.	en_US
dc.subject	Batch scheduling	en_US
dc.subject	CPU clusters	en_US
dc.subject	Deep learning inference	en_US
dc.subject	Quality of service (QoS)	en_US
dc.subject	Resource utilization.	en_US
dc.title	QoS aware automated batch inference	en_US
dc.type	Article	en_US
Appears in Collections:	13th Annual Science Research Session

Files in This Item:

File	Description	Size	Format
QoS AWARE AUTOMATED BATCH INFERENCE.pdf		9.06 kB	Adobe PDF	View/Open

Show simple item record