Although CloudFactory is flexible on tooling, we depend on a network of carefully vetted, high-quality tool partners to meet the requirements of every data labeling and data processing project our managed workforce supports.
In this post, meet Datasaur, one of our text and audio annotation partners.
Datasaur, led by CEO Ivan Lee, is an industry-leading data annotation platform that supports natural language processing (NLP), data labeling, text annotation, audio transcription, labeling automation, advanced workforce management solutions, and quality control workflows. Spotify, Qualtrics, and Stanford University are just a few of the organizations that have chosen Datasaur for data labeling.
What inspired Ivan to create Datasaur? What’s challenging and exciting about the NLP space? And what’s on the Datasaur product development horizon? CloudFactory Solution Consultant Ed Davis caught up with Ivan to ask those and other questions. In this post, read edited highlights from the conversation. To watch the full interview—and to see a demo of Datasaur’s audio annotation capabilities—check out the interview video on YouTube or watch it here, at the bottom of this page.
Conversation highlights: Getting to know Datasaur
Ivan, tell us a bit about your background and Datasaur’s history:
Ivan Lee: I've been a product manager for AI for the last decade at several different companies. Most recently, I was at Apple, where I helped spend hundreds of millions of dollars on data labeling. There, I discovered that there were many tools focused on computer vision, but when it came to text, Apple was still using spreadsheets to label a good deal of text data. No one seemed to be building a toolset, so I left the company and started Datasaur to build precisely those tools. For the last three years, we've been working on creating the best software possible for labeling text and audio data.
Y Combinator, a venture capital firm for early-stage startups, played an important role in Datasaur’s early years. Could you share a bit of your experience with Y Combinator?
Ivan Lee: That was a great experience. About a year into the company, we had built an early MVP and were looking to scale it. I had been a consumer product manager, so I knew how to build and design products. But I had never actually done sales before. Y Combinator helped me start selling to other businesses and figure out how to frame and position our product.
We're seeing a rise in technologies like chatbots, voice assistants, and language translators. Customer service applications seem to lead the adoption of these technologies. Where else are you seeing demand grow for NLP technologies?
Ivan Lee: We're seeing these technologies being used in the legal space to label contracts and extract information automatically. We also see it in healthcare conversations between doctors and patients, where we summarize symptoms, illnesses, and treatments from text patterns across millions of examples. It's a really powerful technology that’s going to permeate pretty much every aspect of our lives by the end of this decade.
Watch the full interview to see Datasaur’s annotation capabilities put to work on a customer service call. Explore the tool’s capabilities further on Datasaur’s interactive playground.
As NLP technology continues to evolve to better understand the nuances, context, and ambiguities of human language, what are some of the challenges Datasaur is facing to keep up?
Ivan Lee: We're seeing increasingly advanced product requirements. Let's say we find a restaurant review online. A lot of companies will want to just do basic sentiment analysis on it. Overall, was the review positive or negative? If somebody says, “The food was great, but the line was way too long,” is the overall positive or negative? It's a judgment call. We're starting to see NLP evolve to answer that question using something called aspect-based sentiment analysis. In the sentence, you want to establish a connection between food and great and line and long. Doing so gives you a more nuanced understanding of what the customer is really saying. And that's much more pragmatic, right? Imagine trying to draw a relationship between those words in a spreadsheet. It’s going to be very difficult. And what we've done is make it really easy to draw an arrow from one word to another, as you would in a PowerPoint deck.
Given the potential for subjectivity in NLP, how can annotation workforces eliminate bias and implement quality control measures in their human-in-the-loop workflows?
Ivan Lee: This is really important because we have years of experience working with teams around the world and know how to set up projects by aligning with best practices for the industry. And we also learned a great deal from partners such as CloudFactory in terms of how to set up a workflow that guarantees the best results.
We’ve built our software to enforce those best practices. There's a lot of subjectivity if a single person is labeling data. So a client might assign both you and me to label the exact same document. Then, all they have to do is resolve the areas where we disagreed. We make it really easy for the project managers and the data scientists who run their projects.
There's a demand for making NLP tools more inclusive and available to a wider audience. What steps are you guys taking to make your application more accessible?
Ivan Lee: There have been so many advances in NLP over the last decade, but the majority of that research has been primarily in English and Mandarin because that's where all the researchers are based. So one of the really powerful things we're doing with Datasaur is giving the ability to support any language on the planet. We're seeing customers come in from around the globe, looking to build NLP capabilities for their language and country.
What challenges does globalization of NLP present?
Ivan Lee: There are some scripts that lack spaces. In NLP, there's something called tokenization, where you assume that the words are broken down into spaces. But in some languages, those spaces don't naturally exist. And so we have to be able to break things down without using those spaces as guideposts. Arabic was another big challenge for us. Because the script goes from right to left and our database assumed that everything is indexed from left to right. It took us a month of work, but now we support right to left—and Arabic—as well. We haven't found a language we don't support at this point. It was really important for us to make sure that our software was inclusive of all languages.
Those are the highlights. Interested in more?
Watch the full interview with Ivan Lee for a live, audio annotation demo and hear about Datasaur’s recent acquisition of Konvergen AI and how GPT-3 and OpenAI impact the NLP industry and Datasaur’s platform.
CloudFactory is grateful for its continued partnership with Datasaur. We look forward to exciting developments on the Datasaur product roadmap, and within the text, audio, and natural language processing space.
Let’s talk if you want your business to innovate at scale by using high-quality, annotated text and audio data delivered by CloudFactory’s expertly trained and managed workforce.