Opportunities and Challenges of Video Annotation for Computer Vision

Across industries, artificial intelligence (AI) is making it possible to generate game-changing insights, innovate products, and automate complex tasks. Computer vision is one application of AI that holds great potential to transform industries that generate massive amounts of visual data.

Computer vision use cases range from dog training to life-saving with a myriad of use cases in between. The challenge to create them is two-fold:

Choosing your annotation methods (video vs. image, bounding box vs. polygon, and so on) and the targets, objects, or behaviors you want your model to recognize, and
Accurately labeling the massive amount of data required to train the machine how to recognize them visually, like a person.

This process is even more complex when the visual data you are working with are videos, or multi-frame data, because there is far more data.

Video annotation is useful across a wide variety of use cases. Annotated video data is used to train autonomous vehicle systems to recognize street boundaries for lane detection. It is used in medical AI for disease identification and surgical assistance. It can be used to create checkout-free retail environments, where consumers are charged based on the items they take with them out of the store. In one particularly interesting use case, video annotation is being used to create a cost-effective system to help scientists learn about the impact of solar technology on birds.

Video Annotation: How it Works

Video annotation is considered a subset of image annotation, and it uses many of the same tools and techniques. However, the process is more complex. A video can contain up to 60 or more frames per second, which means it can take far longer to annotate videos than images, and it requires the use of more advanced data annotation tool features.

There are two ways to annotate video:

Single-frame is the original method of video annotation. The annotator breaks the videos into thousands of images and annotates them one by one. Sometimes, this can be done with the aid of a copy annotation frame-to-frame feature. This process is time-consuming and inefficient. However, in select cases where the movement of objects is less dynamic in the frames being considered, this may be a better option.
Streaming video is a more common method. Here, the annotator uses specialized features in the data annotation tool to analyze a stream of video frames, making annotations only periodically. This process is faster and allows the annotator to indicate objects as they move in and out of the frame, which can lead to better learning for machines. This process is more precise and is more commonly used as the data annotation tool market grows and providers expand the capabilities of their tooling platforms.

This method for annotating an object’s movement is called tracking. Some image annotation tools have features that include interpolation, so an annotator can label one frame, then skip to a later frame, moving the annotation to the new position where the object appears later in time.

Interpolation uses machine learning to fill in the movement and track, or interpolate, the object’s movement in the frames in between that were not annotated.

For example, if you want to create a computer vision model that can direct a scalpel during surgery, it’s likely you would need to train your model using annotated videos of scalpel movements from hundreds or even thousands of videos of surgical procedures. Carefully annotated, or labeled, these videos could be used to train the machine to recognize a scalpel and track its movement.

Workforce is a Critical Choice

Your workforce is an important choice in video annotation. Too often, workforce is the last consideration for teams building complex computer vision models but it should be evaluated more strategically at the outset of the project.

Given the massive amount of data required to train computer vision models, in-house teams of annotators are difficult to scale and require a significant management burden. Crowdsourcing is a popular option for sourcing large annotation teams quickly but quality can suffer because workers are anonymous and less accountable for accuracy.

Especially when you are building machine learning models that will operate in environments where accuracy is important, professionally managed teams of annotators is a good choice. Working with the same annotators over time means their knowledge of your domain, business rules, and edge cases increases over time, which translates into higher quality data and better performing computer vision models.

It’s even better if your team operates like an extension of your own, with close communication, so you can make changes in your workflow as you train, validate, and test your models.

CloudFactory: Your Video Annotation Choice

At CloudFactory, we have provided professionally managed teams of data annotators for a decade. Our workforce annotates the visual data that trains and maintains machine learning and deep learning for 11 of the world’s autonomous vehicle companies.

To learn more about how CloudFactory can assist your team with video annotation for computer vision, contact us today.