New Research Enables Faster, More Efficent Machine Learning Models

Friday, December 13, 2024

Morgan Usry

College of Computing School of Computer Science

Artificial Intelligence Research

Whenever you give a prompt to ChatGPT or query another machine learning (ML) model, the company's need for efficiency affects the response time.

Whether it’s ChatGPT or Meta’s AI assistant, ML services aim to maximize throughput or serve as many people as possible. At the same time, these companies want to answer user queries as fast as possible or decrease latency.

Anyone running a ML-powered service will focus on these two fundamental aspects of computer systems. The problem is that these goals conflict with one another.

Two new papers co-authored by School of Computer Science Assistant Professor Anand Iyer explore how both goals can be achieved. Through his research, he discovered methods that can enable companies to save money while providing faster responses to users of interactive models.

“For low latency, a company would want to run requests as they come in, but if you want to serve many requests, you might want to save them and send them in one batch to the machine. Doing this, latency would go up, but throughput would increase,” said Iyer.

Currently, most ML models can serve thousands of users by saving and processing requests in batches. While this is efficient in many ways, it sacrifices the speed at which users are served. It is also becoming less efficient as ML models become larger to serve more users.

To solve this, Iyer's new research proposes a way to serve requests faster while improving throughput.

Both papers focus on the early exit theory, which has existed in ML for some time but hasn’t been practical due to many challenges.

The researchers solved two challenges in Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving. Apparate automatically reduces request latency while maintaining a regular accuracy and throughput rate.

In Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation, the researchers built a system named E3 that resolves many problems while maximizing throughput and not sacrificing speed. Like Apparate, E-3 finds a way to make Early Exits practical, but it does this while also maximizing throughput.

Iyer said that using these two methods can increase the efficiency of ML models anywhere from 50% to 300%.

“People were reluctant to use early exiting due to their limitations before we wrote these papers, so our hope is that by having these solutions to solve the challenges, more people will use early exiting to get the latency and throughput benefits,” Iyer said.

Last month, both papers were presented at The 30th Symposium on Operating Systems Principles (SOSP).

As computing revolutionizes research in science and engineering disciplines and drives industry innovation, Georgia Tech leads the way, ranking as a top-tier destination for undergraduate computer science (CS) education. Read more about the college's commitment:… https://t.co/9e5udNwuuD pic.twitter.com/MZ6KU9gpF3
— Georgia Tech Computing (@gtcomputing) September 24, 2024

College of Computing

Search

New Research Enables Faster, More Efficent Machine Learning Models

Friday, December 13, 2024

Morgan Usry

College of Computing School of Computer Science

Artificial Intelligence Research

Recent Stories

Campus Connection Inspires Mother and Son to Find…

Computer Science Students Honored with Provost’s…

Journey to Make a Difference Continues for Class of 2025

News Feed

Georgia Institute of Technology

New Research Enables Faster, More Efficent Machine Learning Models

Friday, December 13, 2024

Morgan Usry

College of ComputingSchool of Computer Science

Artificial IntelligenceResearch

Recent Stories

Campus Connection Inspires Mother and Son to Find…

Computer Science Students Honored with Provost’s…

Journey to Make a Difference Continues for Class of 2025

Georgia Institute of Technology

College of Computing School of Computer Science

Artificial Intelligence Research