The recent failure of IBM’s Project Debater in its contest against global debate champion Harish Natarajan offers the latest in a series of lessons learned about the deployment of natural language processing (NLP) technology in the real world. Project Debater is one of numerous attempts in a decades-old quest to use machines to automate the analysis of language to gather insights and make decisions.
The contest, which took place last month at IBM’s Think 2019 conference in San Francisco, was the first public debate between an AI system and a person on a complex topic: “We should subsidize preschool.” The winner’s success depended on the choices of audience members, who cast yes-or-no votes on the issue before the debate and after the closing arguments. Natarajan, who had never debated a machine before, was declared the winner after he persuaded 17% of the audience members to change their minds.
In the end, Natarajan’s win came down to the quality of his responses. As we learned in our first article, NLP often cannot pick up on subtlety. After all, software is literal and language is not. Project Debater made a strong opening argument. But when it came to analyzing those of its human opponent, it fell short in interpreting and responding rapidly to more subtle or emotional arguments, offering more general responses when it didn’t understand.
Just as the quality of Natarajan’s responses won the debate, the quality of your data will be critical for the performance of your NLP model. When your workforce makes bad decisions when tagging or classifying your data, they directly affect the performance of your model. And keep in mind, quality is more than accuracy; it’s also consistency across all of your datasets.
3 Workforce Essentials: Quality Data for High-Performing NLP
For a decade, our managed teams have annotated and analyzed raw text and image data for NLP and other AI applications. We've found these three workforce essentials support quality data processing for high-performing NLP.
1. Context and domain knowledge
Let’s say IBM’s team wanted to improve Project Debater by training and testing more extensively its ML models to recognize and use language that appeals to an audience’s emotions. That would require deeper focus on context and domain, two critical elements your NLP data workforce should be capable of providing.
Context relates to the setting or relevance of the content. For greater accuracy, your data workers should understand words that can be used in multiple ways, depending on the meaning of the text. For example, you would want them to tag the word “bass” accurately, knowing whether the text in a document relates to fish or music. They also must understand words that are substituted for others, such as “Kleenex” for “tissue.” You might even want them to apply knowledge that is not included in the text but is generally understood.
Domain expertise relates to the discourse model. The vocabulary, format, and style of text related to healthcare can vary significantly from that for the legal industry, for example. Domain is especially challenging when you’re creating an NLP model for a domain that doesn’t already have a large manually annotated corpora available for you to use. For accuracy, at least a subset of your data workers should know key details about the industry your NLP application serves, and how their work relates to the problem you are solving.
Domain and context capabilities are significantly limited with crowdsourced teams because they’re anonymous and don’t have access to the peer learning provided by other workers performing the same tasks or the benefit of aggregated lessons over time. You will get higher quality data with managed teams that become familiar with your data and increase their context and domain expertise with ongoing work on your project.
2. Smart tool choice
In our work with teams developing AI, we’ve found tool and workforce can seem a bit like a chicken-or-egg choice. Some teams want to choose their tool first, others prioritize choice of workforce. We’ve learned that it’s important to consider how your tool and data teams will work together, especially with NLP. As workers’ context grows, they will bring valuable opportunities for you to streamline your process and suggest adjustments for both process and tool that can introduce competitive differentiators for the ML models or products you’re creating.
In general, you’ll have two choices: build or buy. If you build your own tool, you have more control. You can make changes to the software quickly and with agility, using your own developers. You don’t have to worry about fees when the software scope changes. You’ll also have more control over the security of your system and can apply the exact technical controls to meet your company’s unique security requirements.
If you buy your tool, look for a company like Hivemind, which combines computational techniques with the tagging, mapping, and classification work of CloudFactory teams, and then provides a data quality framework around that process to ensure the integrity of the final datasets.
There are myriad open source tools available for NLP. In December, Facebook made its PyText NLP framework available to developers. The PyTorch-based product is responsible for models that power more than a billion daily predictions on Facebook’s platform. Facebook says its tool can help AI developers with tasks like document classification, sequence tagging, semantic parsing, and multitask modeling.
Be careful about tools that are integrated into a workforce vendor’s platform, for three reasons: 1) You’ll have less control to mitigate unintended bias the tool could introduce into your annotation tasks; 2) You may give up some ownership of your data. Be sure to ask your workforce vendor if they pool your data with that of their other clients to sell annotated datasets or inform their own ML algorithms; and 3) You unnecessarily link your annotation tool with your workforce, limiting agility and creating drag on your ability to get to market. Besides, no one wants a second-rate tool paired with a second-rate workforce.
3. Seamless blend of tech and people in the loop
In many ways, your training data operations are a lot like the assembly lines in the factories born from the second industrial revolution: data is your raw material, and you have to move it through multiple processing and review steps to structure it for ML. It’s your data production line. You need skilled people on the line who can help you make changes when you run into a problem or your process evolves.
At CloudFactory, we’ve learned that for NLP projects, it’s especially important to create a seamless blend of tech and people in the loop on your data production line. Think of it as your tech-and-human stack, combining people and machines in a workflow that directs tasks where they are best suited for high performance. That means assigning to people the tasks that require domain expertise, context, and adaptability - and giving machines the tasks that require repetition, measurement, and consistency. As you can see in this task-progression graphic, as you move toward more difficult tasks, you’ll need more people in the loop.
You’re not alone if your inclination is to keep quality assurance (QA) in-house because you want to maintain intimacy with your data and control over the process. This is another reason a managed workforce is preferable to an anonymous crowdsourced team. A managed team can do far more than structure raw data and send it back.
Consider the benefits of engaging managed teams in the QA process: 1) They can make the blend between annotator and tool faster and easier by managing the repetitive QA process for you as part of the task progression; and 2) They can incorporate improvements prompted by QA learnings into your workflow. Anonymous teams don’t bring the progressive benefits of a managed team, which acts as an extension of your own.
Annotation for NLP is not a fixed process; it will change over time, and you’ll want the agility to incorporate improvements as you go. Reliability and trust are key factors to identify in your workforce solution. Look for a team that provides a closed feedback loop, where you have a single point of communication with your annotation team. Strong communication between your development team and the data workers who are establishing ground truth for your models is critical here; it will ensure better model performance, accelerating deployments and with it, your ability to bring solutions to market.
The Bottom Line
NLP was no match for Natarajan when it came to generating responses quickly, picking up on subtlety, or mimicking human emotions. And just as IBM’s product team are contemplating the next stage of Project Debater’s development, informed by all of the new information from its recent contest in San Francisco, you and your colleagues will learn from each model deployment.
Your annotation team should be a strategic partner. With any AI project, but with NLP especially, your workforce choice can mean the difference between crossing the finish line in your race to market or running out of steam before you ever finish.
Much like drafting can benefit riders in a bicycle race, a managed workforce creates a trusted and reliable feedback loop that can give your team a competitive edge in the fast-moving race to market. If you want to develop a high-performing NLP model, you need a strategic approach to your data production that includes processes, tools, and reliable people who are accountable for delivering high-quality datasets, and who function as an extension of your development team. For more information about CloudFactory’s managed teams for text annotation, labeling, tagging, classification, and sentiment analysis, contact us.