Scaling Quality Training Data: Best Practices for Your Data Production Line

“Houston, we’ve had a problem.” Astronaut Jack Swigert made the words famous when he communicated to NASA mission control that an explosion had rocked the Apollo 13 capsule that was transporting him and two other people to the moon in April 1970. To get the astronauts home safely, the engineers at Johnson Space Center in Houston, Texas would have to do something they had never attempted before: use the descent engines on the lunar lander to send it home.

The algorithms for calculating the maneuver had been written only months earlier. The two young programmers who had written them sprung into action to check every possible parameter to see if the maneuver would work. Thanks to their hard work - and that of hundreds of people in the loop on the data, the algorithms, the required calculations, and other critical factors - they returned all three astronauts safely to Earth.

Machines + People in the Loop

In AI development, similar urgent challenges abound. Teams of computer vision engineers are training the algorithms that self-driving cars use to recognize pedestrians, trees, street signs, and other vehicles. Researchers are using data and natural language processing (NLP) to detect psychiatric patients who are at a higher risk for suicide. The success of these systems depends on massive pipelines of data and the skilled people in the loop who structure the data for AI use.

As we learned in the first article of this series, a growing number of teams are using in-house staff and contractors to do this mission-critical work. In the second article, we explored the hidden costs of using anonymous crowdsourcing to process data and structure it for AI use. In this third and final article, we’ll take a closer look at how you can design your training data operations to support quality, speed, and scale.

Your Data Production Line

In many ways, your training data operations are a lot like the assembly lines of yesterday’s industrial factories: data is your raw material, and you have to get it through multiple processing and review steps to structure it for machine learning (ML). Like the Apollo astronauts, you need skilled people on the ground - or, in the loop on your data - who can help you make changes when you run into a problem or your process evolves.

If you want to develop a high-performing ML model, you need smart people, tools, and operations that can consistently deliver high accuracy. Here are four critical elements to consider when you design your data production line for quality, speed, and scale.

1. Apply technology.

Think of your data production line as your tech-and-human stack, combining people and machines in a workflow that directs tasks where they are best suited for high performance. That means assigning to people the tasks that require domain expertise, context, and adaptability - and giving machines the tasks that require repetition, measurement, and consistency.

Technology is important for communication with your workforce too. Direct contact with your team will give you visibility into the quality of work. It also will allow workers to share insights that will help you make adjustments as your business requirements evolve.

2. Use a trained, managed workforce.

Managed teams deliver higher skill levels, engagement, accountability, and accuracy. Unlike an anonymous crowdsourced team, managed teams can improve in their quality and expertise over time as they grow more familiar with the source data and the nuances of the interpretation for your model.

Managed teams will get better at making decisions about your data, based on their experience with your domain, context, and edge cases.

It’s critical here to have a tight feedback loop with your workers via direct communication with a single point of contact on the ground. This person should be an expert in your data and business rules who can provide feedback, speed change requests, and train new team members.

3. Measure quality.

The quality of your data will determine the performance of your model. As we learned in the second article, there are three methods we use in the workforce industry to measure quality: consensus, gold standard, and sample review. We use one or more of these methods to check the quality of our teams’ work, and quality is a top driver for many of our clients.

Labelbox, a company that provides tools for labeling and managing training data, distinguishes accuracy from quality. “Accuracy measures how close a label is to the ground truth, or a subset of the training data labeled by your expert. Consistency is the degree to which labeler annotations agree with one another,” said Brian Rieger, COO at Labelbox.

As you build your data production line, look for a workforce provider that is transparent with quality metrics. Also consider how important quality is for your tasks today and how that could evolve over time.

4. Design for agility.

The keys here are training and technology to scale your data work seamlessly. This is about more than getting workers to label, annotate, or categorize data faster. It’s about designing your production line for use-case progression as your AI model develops. As you move through the development process, you’ll want flexibility to add higher-level features that can advance your AI application.

For one CloudFactory client, our teams label images to train algorithms that identify counterfeit retail products. The combination of the labeling tool, our technology platform, and our managed-team approach made it possible to iterate the process, resulting in better team morale, higher productivity and 99.3% accuracy.

Workforce solutions that charge by the hour, rather than by the task, are designed to support iteration in the work. Paying by task can incentivize workers to complete tasks quickly, without high quality. Look for options that get more cost effective as you scale and add more work.

The Bottom Line

NASA's Apollo program has been described as the most ambitious technical endeavor in history, demanding precision to map the journey to and from Earth. AI is taking a similar revolutionary path, stoked by increasing processing speed, a data boom, and better-trained algorithms. The winners in the race to develop world-changing AI will be those who design their data production line to strategically combine people and technology in ways that support high quality and agility to evolve their processes.

This article is the third and final article in a series about scaling your training data operations for AI.

New Call-to-action

Workforce Strategy Training Data Computer Vision AI & Machine Learning

Get the latest updates on CloudFactory by subscribing to our blog