6 Steps to Create Custom Data Sets for Computer Vision

Quality data is the lifeblood of great computer vision applications. But what happens when you don’t have enough data? In a previous article, we addressed your options for acquiring data sets for computer vision.

In this follow-up article, we take a closer look at best practices for one of those options: creating your own custom data sets. We addressed this topic in a recent webinar, Building Your Next Machine Learning Data Set.

There are six important steps to creating custom data sets:

1. Choose your collection method. You can build your own data set using internal resources or third-party services you hire. To collect the data, you can use automation, you can do it manually, or you can choose a combination of both.

There are data scraping tools that assist with the process of collecting data. Manual data set collection involves humans in the loop, who find and gather the data according to your requirements and business rules.

You may use your own devices, such as cameras or sensors (e.g., LiDAR). Or, you can hire a vendor that can provide devices, such as drones or satellites. In some cases, you may need to build the devices required to capture the data you need.

Autonomous vehicles are one of the most visible examples of custom dataset creation. You may have seen NuTonomy or Drive.ai’s branded, eco-friendly cars traveling through your city as they gathered data to build self-driving cars. Each of these data-gathering cars is fitted with devices such as cameras, RADAR, and LiDAR sensors that capture visual data as the cars traverse city streets.

While you are choosing your collection method, it’s also a good time to consider your preferences in a data annotation tool. Your tool choice will have an effect on the success of your project and will determine some of your options for data storage, tooling, and workflow.

2. Collect data in tiers. At this stage, you work with smaller datasets to analyze the effectiveness of your predictive model and adjust it as necessary. Start by breaking down the larger data set you have into smaller sets. For example, if you are aiming to work with 500,000 images, collect the data in tiers of 20,000-50,000 and increase that gradually or aggressively depending on the results of your model after training.

You will annotate that data, run it through your model, see how it works, and adjust your approach as necessary. Then, you collect another tier of data and do it again. It usually takes three to four cycles of data collection in tiers to realize what works best, in terms of model performance and the time and cost it takes to generate the best results, according to Maria Greicer, VP of partnerships at Keymakr, a company that provides data collection and annotation services.

Collecting in tiers helps prevent unwanted biases in data that may be less apparent than they are when you collect and train using larger data sets. Even worse, if you don’t collect in tiers, once you discover those unwanted biases, you may have to start the entire process over.

3. Validate the data. Now that you have gathered the data, it’s time for a validation exercise.

The purpose of validation is to ensure you’ve met the data quality metrics (i.e., variance, quality, quantity, density) you initially sought to achieve. This is the perfect time to prevent biases and collect data again before beginning annotation. You can skip this step, however, it is not recommended. The time you will spend on validation is insignificant compared to the time it would take you to annotate the data again, if you miss the mark the first time around.

4. Annotate the data. Once you have validated, during the collection stage, that you have acquired the appropriate amount and variety of data, you will begin working on the most time-consuming task of your project: data annotation. You will have done some annotation during the earlier stages of this process, as you collected and tested the data for use with your algorithm.

You can use a variety of image annotation techniques but your biggest choice will be your image annotation workforce. This is an important decision that can have a significant effect on the success of your project, so it’s worth thoughtful consideration.

Typically, these are your workforce options:

Employees, who are on your payroll and whose job description may not include data annotation
Contractors, or temporary workers (e.g., freelance, gig workers), who work remotely or at your location
Managed outsourced teams that can transition to remote work and you have direct access to annotators (e.g., CloudFactory)
Business process outsourcing (BPO), a more traditional outsourcing option that does not provide access to the people who are doing the work
Crowdsourcing using third-party platforms that give you access large numbers of anonymous workers

5. Validate your model. At this stage, you will validate the quality of your algorithm. This is a key step for determining if the data you labeled is a good fit for the algorithm you are creating. You’ll also learn if the inferences of your model are accurate for the outcome you want to achieve. You must have humans in the loop for this process.

This step can be quite iterative, as you are likely to make changes to your image annotation process and evolve your model as you learn what works best. Adjustments could include changes to your algorithm, changes in your data collection process, or changes in the features you are targeting in the data. For example, you may learn there are other objects of interest in your visual data that would be helpful to annotate to train and refresh your algorithm.

By using the tiered approach for data collection, you will significantly reduce the risk of having to scrap your model at this point in the process due to low-quality data. The tiered approach gives you the opportunity to adjust your course and move forward.

6. Repeat. Machine learning is not a one-and-done exercise, so you will repeat the collection, annotation, and validation steps again and again. Even after you deploy your model into production, you will continue these steps to ensure your models are performing to your satisfaction.

It’s important to note that as the conditions in the real world change, your continuous collection, annotation, and validation will train your machine learning model to respond to these new conditions.

We’re seeing the need for this now, as human behaviors during the COVID-19 pandemic are affecting the performance of AI models trained on data that reflects pre-pandemic conditions. When ground truth changes, your model must be trained to interpret and understand these new conditions.

It’s a cyclical process to create custom data sets and adjust your approach as you train and validate your model. To do it well requires a strategic combination of people, process, and technology. The more thoughtful you are about your workforce, your data collection process, and your annotation tools, the more successful your project is likely to be.

To learn more about how CloudFactory can assist you in data collection and image annotation, contact us.