New Research Enables Faster, More Efficent Machine Learning Models
Whenever you give a prompt to ChatGPT or query another machine learning (ML) model, the company's need for efficiency affects the response time.
Whether it’s ChatGPT or Meta’s AI assistant, ML services aim to maximize throughput or serve as many people as possible. At the same time, these companies want to answer user queries as fast as possible or decrease latency.
Anyone running a ML-powered service will focus on these two fundamental aspects of computer systems. The problem is that these goals conflict with one another.
Two new papers co-authored by School of Computer Science Assistant Professor Anand Iyer explore how both goals can be achieved. Through his research, he discovered methods that can enable companies to save money while providing faster responses to users of interactive models.
“For low latency, a company would want to run requests as they come in, but if you want to serve many requests, you might want to save them and send them in one batch to the machine. Doing this, latency would go up, but throughput would increase,” said Iyer.
Currently, most ML models can serve thousands of users by saving and processing requests in batches. While this is efficient in many ways, it sacrifices the speed at which users are served. It is also becoming less efficient as ML models become larger to serve more users.
To solve this, Iyer's new research proposes a way to serve requests faster while improving throughput.
Both papers focus on the early exit theory, which has existed in ML for some time but hasn’t been practical due to many challenges.
The researchers solved two challenges in Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving. Apparate automatically reduces request latency while maintaining a regular accuracy and throughput rate.
In Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation, the researchers built a system named E3 that resolves many problems while maximizing throughput and not sacrificing speed. Like Apparate, E-3 finds a way to make Early Exits practical, but it does this while also maximizing throughput.
Iyer said that using these two methods can increase the efficiency of ML models anywhere from 50% to 300%.
“People were reluctant to use early exiting due to their limitations before we wrote these papers, so our hope is that by having these solutions to solve the challenges, more people will use early exiting to get the latency and throughput benefits,” Iyer said.
Last month, both papers were presented at The 30th Symposium on Operating Systems Principles (SOSP).
As computing revolutionizes research in science and engineering disciplines and drives industry innovation, Georgia Tech leads the way, ranking as a top-tier destination for undergraduate computer science (CS) education. Read more about the college's commitment:… https://t.co/9e5udNwuuD pic.twitter.com/MZ6KU9gpF3
— Georgia Tech Computing (@gtcomputing) September 24, 2024