Video Annotation

Website About Video Annotation

Here's the big secret about Artificial Intelligence and Machine Learning: these seemingly magical technologies require lots and lots of data to work, and all that data is created by millions of people around the world.

For example, a self-driving car is controlled by a Machine Learning model that has been given lots of dashboard camera videos to learn from. Before the model training can happen, an annotator must first label the points of interest in those videos so that the ML model knows what to pay attention to when it "watches" them. These points of interest are called data labels and they can range from β€œa person crossing the street” to β€œa car is changing lanes” to anything else you might encounter when driving a car.

The video below shows an example of using our new data labeling tool to identify a vehicle as a point of interest. The image in the example is a modified photo originally taken by Steven Vance.

Labeling a three minute video that runs at thirty frames per second translates to a person identifying and tracking information across 5400 individual images. This is both time-consuming and mentally taxing work, especially if the data labeling software is poorly designed. Over the past few months we have been researching, designing, and building a completely new video annotation tool with the goals of improving labeling accuracy and decreasing project costs for our customers.

Our research showed us that cognitive load was the most important problem to address, and I'll highlight two innovative ways we met the challenge in this case-study.

πŸ•’ The Timeline

Data labeling tasks are analogous with activities common to the tech and creative industries. These tasks are inherently visual in nature. When an annotator draws a bounding box around a person crossing the street in a dashcam video they are using a similar set of tools a designer uses when creating a new UI in Figma. To-date, data labeling tools haven't provided an easy way to manage lots of visual information. Even the prior version of our data labeling software displayed everything in a tabular format. I hypothesized that we could decrease cognitive load by presenting visual information in a more visual manner, thus making it easier to track over time. A timeline similar to those found in a video editor or animation software seemed like a viable solution.

But investing time and resources to build a complex control like a timeline is risky. I needed to validate what relevant features and gaps a traditional timeline has when applied to the use case of video annotation. Although design tools like Figma have sophisticated prototyping capabilities, they were not able to achieve the interaction fidelity our team required. I built a timeline prototype using HTML, CSS, and Javascript.

The video below shows the prototype being used to explore how we might incorporate machine learning to automatically track an object of interest over time. This feature alleviates the need for an annotator to make manual adjustments on every frame, but it's a complex interaction with a lot of stuff happening behind the scenes. The prototype is a huge time-saver because I can explore ideas with stakeholders and engineers directly in the medium in which those ideas will be implemented.

🎚 Surface Actions

Reducing mouse movement and repetitive actions is critical for long-running tasks like labeling points of interest in a video. Even something as seemingly innocuous as moving the mouse a hundred pixels to the left to click a button becomes infuriating when you have to do it thousands of times. The physical strain increases exponentially if this movement is one of several behaviors that compose a compound interaction.

When an annotator tracks a state-change for a point of interest they must make a context-switch from watching the video to manipulating a UI control. For example, they might be required to note when a shopper picks up and scans an item during self-checkout. When the annotator sees the shopper pick up an item in the video they will stop and click a checkbox or interact with a dropdown in some other area of the UI to track the update. Rather than forcing an annotator to move focus to update the state change, we make all of those options directly available on the object itself via context menu.

Our research also showed us that hot-key accelerators were important for reducing certain repetitive actions. Almost all video data labeling tasks require annotators to track when an object enters or leaves the frame. By introducing a hot-key to toggle this visibility change we eliminated thousands of mouse movements that our workers previously had to perform for each video they labeled.

πŸ“ Measuring Success

Our new tool got some things right. Head-to-head testing with other software revealed efficiency gains of up to 30%. We also discovered that brand new annotators were able to jump in and label data at roughly the same pace as our experienced annotators. But usability studies and user feedback showed us that we sometimes missed the mark as well.

For instance, we were unable to implement a commenting system in the initial release, and this hindered the communication loop between annotation teams, managers, and customers. In a follow-up we'll add support for commenting on data labels, linked to specific points in time. Comments between users and teams will be synchronized to the communication tab in the software's dedicated Information Panel.