Picture of Nanette George

Nanette George

Mar 7, 2019

How CloudFactory Workers Help Train NLP Models

cf-blogs-featured-howcftrainnlp-02

“Hey Siri,” I said. “Call Nancy. Mobile.” Siri replied, “Calling Lindsay. Mobile.” It began to ring. I hadn’t worked with Lindsay in six years. I lunged for my phone and touched the red button just in time to avoid what would no doubt have been an awkward, out-of-the-blue conversation with her. Then, I contemplated the complexities of NLP development.

Natural language processing (NLP) is among the fastest growing applications of AI technology. Gartner estimates that by 2020, 50% of analytical queries will be generated via search, NLP or voice, or automation. The use cases are vast, and the NLP market is expected to be worth $16 billion by 2021, according to a MarketsAndMarkets study.

NLP is also one of the most difficult AI systems to develop, due to the complexities and subtleties of language. To train its intelligent assistant Siri to complete language-related tasks such as answering questions or entering text on your smartphone, Apple used word embeddings to mathematically map words to numerical vectors. “This capability makes it fairly straightforward to find numerically similar vectors or vector clusters, then reverse the mapping to get relevant linguistic information,” according to Apple’s Machine Learning Journal.

The questions we’re likely to ask Siri determined how Apple’s ML engineers mapped and classified those words and how they designed the algorithm that powers the app.

That specificity and personalization is exactly what makes NLP development so challenging. For example, if you are creating a chatbot for an airline, you will want to design it primarily around understanding words that relate to travel. You’d take a different approach if you were designing a chatbot for a bank or a gaming company. And, as more users interact with your product, you'll need to evolve and validate your NLP models if you want to accurately reflect your brand’s unique voice.

Technology with Potential - And Limitations

Inspired by the early (and unsuccessful) machine-assisted translation of English codebreakers during World War II, modern-day NLP has a number of limitations, many of which we’ve seen in the news over recent years. Within hours of its deployment in 2016, Microsoft’s AI chatbot Tay began generating offensive content, a product of its user interactions with online troublemakers. Without domain knowledge on these topics, the chatbot mimicked what it was told and couldn’t discern insulting content from informed content.

On the other hand, NLP-powered AI assistant Duplex amazed the audience at Google I/O 2018 with its ability to “speak” with a real person when it called a hair salon to schedule an appointment. Here, we really began to see the potential for using NLP to create conversations and its burgeoning power to interact with people more naturally.

Even so, Duplex’s limitations are clear. Google has said it can only carry out such conversations after being deeply trained in their related domains – trained by people and complex algorithms on the topics that might arise in those conversations, that is – so it cannot yet carry out general conversations. This trend is consistent with most virtual assistant programs that use NLP today, which tend to be great for a handful of tasks but are limited when it comes to other tasks.

CloudFactory and NLP Data Processing

For a decade, CloudFactory has worked with companies that are creating products and innovating customer experiences. Our managed workers process and structure data for NLP models that solve problems across a variety of industries, including finance, insurance, security, business intelligence, and human resources.

Here’s how we help.

1. Text tagging is adding tags or annotations to unstructured data in text form. It’s a basic level of preparing data for NLP applications, and you can automate some of it with a tool but you may need a person involved in quality assurance. One CloudFactory client is building an AI platform to predict the behavior of scientific materials. Our workers transcribe text and special characters that optical character recognition (OCR) can’t decipher. Our workers transcribe data for another client that is developing a predictive engine to prevent fraud and expense misuse.

2. Text mapping is standardizing data based on its meaning or context. Some of this can be automated with a tool but because it requires understanding the meaning of the text, you’ll likely need a human in the loop. Chicago-based legal software company Heretik is combining efficient workflows with advanced machine learning to make the contract review process smarter. Our workers help by annotating contracts based on the Heretik team’s category instructions. For another client, our workers verify data to train its NLP models that identify prospective buyers across a particular industry.

3. Text classification is extracting data that would be used for semantic analysis. These tasks tend to be more difficult to automate with technology available today. UK-based data science company Hivemind is a software provider that helps companies distill messy or unstructured raw data into structured datasets for NLP and traditional data analysis techniques. Hivemind combines computational techniques with the tagging, mapping, and classification work of our contributors, and then provides a data quality framework around that process to ensure the integrity of final datasets.

As you move from tagging to classification, tasks become more difficult. CloudFactory’s teams can do it all, from tagging words to extracting data or meaning from messy or nonsensical text. Along the way, we work with our clients to iterate and refine inputs to improve the accuracy and performance of their models. Our managed-team approach allows us to be partners in the validation of high-performing ML models.

The Future of NLP

Siri is getting better at learning my voice and responding accurately to my questions. And we can expect the accuracy of NLP-powered technology to improve in coming years. As NLP expands in its application across industries - legal, finance, security, and others - collective domain expertise about linguistic nuances in each industry will grow and improve, and NLP will improve in accuracy.

From what we are seeing at CloudFactory, it’s likely NLP and related technologies will advance to the point that one virtual assistant will be able to handle a wider variety of tasks in ways even more natural than a chatbot calling a hair salon to make an appointment for its human client.

Indeed, NLP is becoming more advanced and complex, as more businesses seek to analyze combinations of data to surface analytics and make them available to others. In the next article, we will discuss key requirements for your NLP data processing workforce, especially when quality is important.

How to Accelerate Data Labeling and DL Training

Machine Learning AI Data Structuring NLP text tagging text labeling sentiment analysis

Recent Posts

Subscribe to CloudFactory Blog