a person kneading dough

New Generative Tool Provides Images to Accompany Step-by-step Instructions

LEGO can show you how it’s done.

Proper instructions can be the difference between success and failure, whether for a parent putting together a crib or someone administering CPR.

While large language models (LLMs) can provide step-by-step instructions for assembling a crib, administering CPR, and other activities, Bolin Lai thinks they can go further.

Lai is a machine learning Ph.D. student who developed LEGO. This new framework allows generative artificial intelligence (AI) models to create first-person synthetic images based on text prompts. These images provide users with visual step-by-step instructions to complete a task.

For example, someone may not know how to properly handwash laundry if they’ve always relied on a washing machine. 

Lai said they could consult an LLM, but it provides instructions only in textual output. Users may feel better about doing the task correctly if they have a corresponding image to reference.

“Those instructions from LLMs could be very generic, so you’re reading lots of words, and it may not apply to your current situation,” Lai said. “Though you can input an image to GPT for more customized guidance, reading pure textual response isn’t efficient. Our model can understand the image and provide instructions by generating an image action frame showing people how to do it exactly.”

If a person wanted to know how to scrub a pair of trousers properly with a brush, they would first take a first-person photo of their situation. They can then upload that photo and prompt LEGO for instructions on washing the trousers with a brush. 

Based on the text in the prompt and the provided photo, the model generates a new image of someone scrubbing the trousers with the brush in the same environment.

The possibilities are innumerable, but Lai said his goal is to provide a way for people to learn new skills in everyday scenarios. Some of those skills could prove to be lifesaving.

“In some rural areas, there may not be any quick medical service available,” he said. “If an emergency happens, people can use this tool and find professional steps to assist the person who needs medical care.”

Lai started this project while interning at Meta GenAI and authored a paper titled LEGO: Learning Egocentric Action Frame Generation via Visual Instruction Tuning. His paper will be presented at the European Conference on Computer Vision Oct. 5-9 in Milan, Italy.

Gathering Data

Lai said his work stems from Meta’s release of the EGO4D dataset, a benchmark dataset consisting of first-person videos of humans performing everyday activities. The dataset was created to facilitate research in augmented and virtual reality and robotics.

Lai used still images from EGO4D to generate accurate and believable images in LEGO’s output.

“It’s so valuable, and they have corresponding annotations for the narration about what people are doing in the videos,” he said of EGO4D. “With so many egocentric videos, we can do much research on egocentric vision. We can have better data to train models and explore more action categories. We can learn the interaction of hands and objects and how the object’s state can change, such as moving from one place to another or changing its shape.”

Lai also curated images from a dataset called EPIC-KITCHENS, which contains first-person images of kitchen items, to bolster training.

Using a pair of smart glasses that could capture first-person images wherever he went, Lai then collected images of real-world scenarios that may require instructional assistance. He fed the images of those scenarios into LEGO and received accurate and believable synthetic images of completed tasks. 

He found that the model needs a single image to generate new images demonstrating a step-by-step process to complete a task.

“We show the model can a have high-quality generation of a real-world image. The task is challenging because the background in the user’s input image may be complex and chaotic. Other generative models are trained on all synthetic images with clean backgrounds and a few objects dominating the foreground. They oversimplify the problem and may not apply to the real world.”

From Images to Video

Lai envisions scaling his work to AI-generated video in which instructional videos could be the output instead of still images. These videos would show images of the instructional process and could be accompanied by narration.

He said that possibility is a long way off. Current generative AI video tools such as OpenAI’s Sora can generate videos up to 60 seconds long, but Lai says he doesn’t have access to the resources to reach that length. 

“We need more powerful computing resources to make it into a video, which was our initial goal, but we have found it difficult. It’s currently unaffordable for us, so we simplified the problem into image generation.”