AI training data operations are a lot like the assembly lines of yesterday’s factories. Data is your raw material, and you have to get it through multiple processing and review steps before it’s ready for machine learning. If you want to develop a high-performing ML model, you need smart people, tools, and operations. We hosted a webinar to discuss this topic with experts in workforce and tooling for machine learning. This is a transcript of that November 14, 2018 webinar. It includes minor edits for clarity.
Philip Tester [CloudFactory]: Hello out there. Welcome to the webinar with CloudFactory and Labelbox. Perhaps the first of several. We had so many great content ideas when we were putting this together, we couldn't fit it all in today's session. So we wanted to kick off the series by covering the basics of how to build out a reliable and scalable process for training data production and why that's important.
Quick personal intro. My name is Philip Tester. I'm the Director of Business Development at CloudFactory. My job is to forge partnerships within our ecosystem to help AI innovators solve data production problems. One natural complement to CloudFactory’s cloud labor offering is data labeling platforms like Labelbox. In talking with them, we found ourselves helping similar customers solve the same problems but we each provide kind of a necessary but different component of the solution. So that led to some common takeaways of the challenges and knowledge gaps in the market. We want to share our lessons learned with you all.
First, just a little bit about the companies represented today. CloudFactory is an innovator in cloud labor. Our global workforce offering powers AI solutions for development teams and data scientists all around the world. Over the past several years, we've supported more than 150 AI projects and programs, many of them applying computer vision and NLP type technology including 11 of the top autonomous vehicle companies. This has given us unique expertise in best practices on how to scale data annotation to develop performant models.
Labelbox is a class-leading platform for AI data annotation, including data management labeling, and QA. And it's all fully API integrated and extensible. Labelbox was purposely built to support humans in the loop to develop AI.
Now to introduce our speakers today. In addition to myself, my colleague Matthew McMullen is the Growth Strategist at CloudFactory where he connects AI development and operations teams with solutions to accelerate and scale their data production process. Brian Rieger is Founder and Chief Operating Officer (COO) at Labelbox where he focuses on solving the tooling and data management challenges facing AI teams today.
As for the agenda and topics today, we'll be taking a closer look at ways that you can accelerate and scale high quality training data for your AI applications. In terms of how we'll spend that time, first we'll have about 30 minutes of prepared content, then about 15 minutes of question-and-answer time at the end. You can submit your questions along the way during the webinar. Just click on Q&A at the bottom of your screen and submit your question. We'll keep track of the questions as they come in, and when we get to that portion I'll prompt our presenters to answer them.
Finally why does this matter? Why are these topics important? Today, as everyone on the call knows, AI is an arms race. It's not only a buzzword intended to garner funding and news in the market but it is an arms race. That's equally as true for this massive wave of frontier-pushing startups that have significant VC funding and are really flooding the space as it is true for enterprises looking to incorporate AI into their products and services as quickly as possible or even diversify with new ones, and leveraging AI well is then table stakes for becoming and staying competitive.
So it is a race but, as they say, haste makes waste, and this urge to speed up development in order to get to market rapidly is a strong one but that often leads to skipped steps along the way and lack of focus in training data as a foundational building block. This leads to low-performing models and very likely data rework, which is expensive and impedes true progress. There's a saying that your training data is your model and so it couldn't be more crucial to make really informed decisions about your data process to get it right the first time and truly find efficiency in that way. With all that in mind let's go and get started today. Matthew I will hand it over to you.
Matthew McMullen [CloudFactory]: Thanks Philip, and thanks for setting that up. That was great. I'm pleased to be here to talk about scaling quality training data. You'll see that I'll be using a couple of manufacturing models as a backdrop and highlighting maybe one or two customers of ours.
Manufacturing is a century-old problem. Henry Ford back in 1910 needed a way to mass produce the Model T. So he introduced the moving assembly line by combining tech, people, and process to bring the automobile to the masses. He had the raw materials and tools with specialized workers, which we'll talk about today, and multiple process and review steps to scale production.
Today, we have the opportunity to apply similar efficiencies to AI data production. By using Toyota's manufacturing system (TPS) or lean manufacturing, it offers a great example for us to talk about AI data production. It goes beyond the process of the assembly line by organizing manufacturing and logistics too. This includes interaction with partners and suppliers. The model hones in on using short production cycles as a hyper focus on quality and continuous improvement measures to optimize each job of the system.
They make the complex simple, which happens to be one of our core company principles. Toyota’s system even goes so far as to reduce difficult jobs on the assembly line by ranking each job into three categories; green, yellow, and red. The goal is to improve the jobs to the green level, essentially eliminating difficult ones. So while automation can increase efficiencies in factories, robots cannot simply plug in to any worker's role and instantly save a business money.
Here we see the parallel of that combination of tech, people, and process. On one side, you're going to see the parallels between lean manufacturing and on the other, the data production. We have the opportunity, like I mentioned, to apply the efficiencies that you see on the left-hand side to AI data production. But keep in mind it is easy to prototype a data process internally because actually the hard part is scaling it. An agritech customer of ours has some pretty incredible KPIs (key performance indicators). For example, they measure their success with the ability to roll out a new AI and hardware system for each harvest season. That's pretty incredible.
Thanks to the five measures on the right-hand side of this slide, they have continuously improved the quality. Each new algorithm iteration gets better at spotting objects (in images) and knows what not to reach for. As an example, is that a bundle of leaves or is there an apple sitting behind those leaves? And to add to that, during each iteration they are updating hardware, so they need to adapt to the new data that's captured. I'll get into why that's relevant, too.
Taking these KPIs and framing them with the backdrop of Toyota's lean manufacturing, you can see the new efficiencies and quality of the system and how they were able to achieve these improvements. So the idea is to eliminate waste by having these five elements:
- Break the work into steps and fix bottlenecks.
- Eliminate defects.
- Reduce costs.
- Introduce flexibility.
- Find trusted partners and suppliers.
Lean manufacturing is built on this system of flexibility. You must be able to modify the system easily so Toyota actually took this to a whole new level from Ford's production line. And finally, in lean manufacturing, you need a strong and reliable relationship between you, your partners, and your suppliers. You rely on each other to produce high quality on time.
Moving to the right-hand side, we have the chance to apply these same elements to a data production line, which is what introduce today as the new AI factory model. In data production, we divide the work by specialization, identify risk to quality, and apply quality control. Using the agritech customer as an example again, they are able to leverage our collaborative feedback loop to better define, measure, and evolve performance goals because CloudFactory shares first-hand insights into how our workers are interacting with their tools.
Particularly for AI development, eliminating waste means having high-quality data from the outset. Rework is costly, as pretty much all of our customers point out. Our customers get the highest quality and value after the workers move past their initial learning curve. It’s even better when the same workers are processing the data. Their proficiency with the process and their productivity improves over time.
So as I mentioned, with lean manufacturing, to increase profits we need to reduce cost. This element has been highlighted in recent publications comparing Tesla's manufacturing model with Toyota's. Cost reduction can be achieved even when humans are part of that loop. Crowdsourcing has been a popular choice to get the process off the ground but it actually has hidden costs from the outset. When an anonymous crowd processes your data, you lose the productivity that you could gain when workers’ proficiency with your data increases over time.
Our agritech customer pointed this out recently in a QBR (quarterly business review) when they asked how someone doing this work can produce quality work if they have no context for the purpose of the data. Where does that data fit into the pipeline, and where does it fit into the vision of the company? The new AI factory model is built on having agility to iterate and evolve the process.
Just like Toyota took it to the next level, the new AI factory model does as well. We have to accommodate for hardware changes and software changes all the time. When data has become obsolete - when the hardware that captures it is no longer in use - the system has to be agile to support, for example, the agritech customer’s goal to get a new robot out for each harvest season. This also includes new robotic hardware for improved grabbing, maneuvering, and mobility features.
This is a perfect transition into the last element on the right hand side. And just like in manufacturing, you need to rely on each other to produce high-quality data, something that Brian will go into more in detail. This means having people on your production line who care about the quality of your data because some tasks are more nuanced and subjective. By having continuity, specialization, and context in your labeling workforce makes it possible to handle the more nuanced tasks and get it right the first time. Even with a strategic workflow, on all technical projects, you're going to run into regular surprises that will require you to prioritize and revisit that initial dataset to isolate new features in the data. And an efficient data production line needs to be supported by a tight feedback loop with your suppliers. Brian is going to dive deeper into the data production line. But I want to quickly highlight the human in the loop aspects.
Human input can come in two different forms. One is helping lay the original dataset so, ground truth or training. One is helping correct the inaccurate predictions that maybe come out of the system when it's pointed at real-world examples, such as detecting false positives.
To transform this data process into an efficient process you'll need the right tools. So I’ll invite Brian to take it from here and map out that data process, review the typical production architecture, and dive deeper into the AI factory model.
Brian Rieger [Labelbox]: Thank you Matthew. That was very insightful. Hello everyone, this is Brian and I'm from Labelbox. At Labelbox, as Matthew mentioned and Philip elaborated upon, we help companies big and small with their infrastructure and tooling so that they're effectively creating and managing training data. As Matthew eloquently explained here just now, a lean human-in-the-loop system is critical to scaling AI projects into production. Another term for a human-in-the-loop system which has been alluded to, is a data production line.
I think this is truly a more accurate way of describing it. Seeing this digital data manufacturing process and seeing the benefits of this time and time again, we know people are using it in their companies to move R&D or experimental AI projects into production. Let's talk about how we can go about bringing a data production line to life inside of an organization. Similar to a physical factory, a data production line is operated by a cross-functional team that's working together on that line, and those key roles are: data engineers, developers, data scientists, AI product owners, operations folks and domain experts, also known as labelers.
- Data engineers are responsible for connecting your company's internal data systems with machine learning systems, so they're preparing and processing data before and after machine learning steps and really laying down those data highways within the organization and external services.
- Developers write code and really help with those custom integrations that connect all the parts of a data production line together and build custom software were needed to glue things and create interfaces for users. Oftentimes, companies can make full use of commercial solutions to minimize or eliminate the need for this custom development but in some cases it’s necessary and that's normal.
- Data scientists as we know are responsible for the fun stuff - the machine learning - and they're determining what data gets labeled, how to label it, and how to model that labelled data - that training data - in a machine learning environment to best suit the desired machine learning outcomes.
- AI product owners are typically responsible for these machine learning projects, meeting the needs of the business and if the business is pushing their machine learning to a market or service, meeting those business objectives for the company in the marketplace. Operations roles oversee data production at the tactical level, so that is making sure things are running smoothly. CloudFactory has world-class operations folks and when they're working with a company they're greatly involved in facilitating that process.
- Domain experts are labelers who are responsible for labeling the data. They're also responsible for reviewing the data and making sure that their fellow labelers and reviewers are all doing work that's in line with the interpretation that's necessary for the training data. So as you can see to fully enable each of these key roles and a data science team and for them to collaborate effectively there's actually a significant amount of infrastructure and tooling that's required inside of the company and in some cases outside of the company as well as that piping that goes between. How do you get these tools? There are two options: classically building or buying.
At Labelbox we work with data science teams every day on making this decision - whether to build or buy parts or all of the data production line infrastructure and tooling. The decision is dependent on many factors. Here are some common themes to consider.
When building you're going to be up against unknown and evolving scope. Oftentimes, when building software you're not sure of all the requirements, hence the agile method, and so forth. But overall, we typically see six months of development time to develop software and infrastructure that can take a machine learning project into production.
If you buy, you'll typically get an enterprise-ready solution that has the features that meet your requirements as you scale. You’ll have something that’s stable and ready for you to scale.
The typical development costs that we see is somewhere in the neighborhood of $500,000 to create your own production AI pipeline. When all is said and done, there are commercial tools available that are configurable usually without code and without a lot of developer time. Any tool you develop will require maintenance. You can either maintain and support your own tool or you can keep up automatically with the latest tech for a commercial tool and leverage their third-party support.
I want to tell a bit of a story as we go to the next page and look at the full data production line here. Before we dive into this, I want to tell a little bit of a story about one of our clients. They are a sports analytics company and they were building tooling internally, working with Dropbox and Python scripts and Excel spreadsheets with a crowd-labeling company. This was really difficult for them because it's hard to keep track of what's getting labeled, and who's labeling it, and how do we review, and so on and so forth. They were struggling to meet deadlines and the AI product owner was concerned that they weren't going to be able to meet the business objectives.
So at Labelbox we were able to come in and help them clean that up and get infrastructure and tooling in place that enabled them to work with a managed service provider like CloudFactory in less than a week. That created a consistent labeling team that was there for them every day, labeling their data consistently and reviewing it for the interpretation that they needed, as well as bringing that internal infrastructure that enabled the high-volume throughput. Now, they're moving forward with their first production rollout this month. Very exciting, and really indicative of a lot of what we see across the AI landscape for different domains.
Here is a representation of the data production line. This is a framework as I've alluded to that we've seen many teams use to take their AI systems from prototype to production. Of course, production starts with raw material. In this case it’s raw data collection, data preparation, and data storage. This is the first pillar of machine learning, digitizing that world around you so you can apply human knowledge on this raw data then model it so machines can make mission-critical decisions and predictions for your business.
The middle pillar is a lot of what we talked about today. Labeling, reviewing and updating, and related QA and QC processes make sure that you're accurately and consistently interpreting digitized reality - that is, labeling in a way that is in line with your data science team and in a way that will produce training data that powers models applicable for your domain, whether you’re building self-driving cars or performing sports analytics. Part of this middle pillar is taking training data and curating and manipulating it before it goes into a model training environment. Some of this data you'll want to use, some of it you want to save for later and re-work it. This all just depends on the training environments and scopes as well as the machine learning and data scientists.
The final model, the fun part as they say, is the modeling here and this is the heart of machine learning at the end of the day. This is all about getting a mathematical model to represent your training data at scale and, of course, a model needs to be deployed to be using a product or service or an internal business system.
So with any mission-critical system making decisions for your business, this system must be monitored and decisions must be assessed. Bad decisions are often fed back into the training-data process as you can see here with the closed loop. This is so that you're continuously improving and evolving the model over time so that it stays performing in the domain and in its environment. As the environment shifts, your business objective shifts, and so forth. This is just a natural part of deploying production systems into the marketplace. This is what we call a new AI factory model and this is what Labelbox has seen work and it’s what CloudFactory has also seen work and we're really excited to talk more about it.
At the heart of all of this is just making sure that you're creating the highest quality training data. In reality, your model is only as good as your training data, as Philip mentioned, and this is something we see every single day. Accurate labeling is critical to a performant model. When we talk about accuracy and we talk about quality, those are different things. Accuracy means labeling is accurate, whether you're labeling edges of an object or classifying something. Are the lines being drawn accurately? Are the pixels identified accurately, precisely, that sort of thing.
Quality, on the other hand, speaks to the overall dataset. Is it accurate in its granularity - but also is it consistent holistically across the dataset? For example, if each image has accurate edge labeling for an object that's annotated, that doesn't necessarily mean that as a whole, every object is being labeled the same or that all the pumpkins are being labeled in the foreground and background versus only the foreground.
This gets into the quality of the whole training dataset and whether it's five, 50, or 500 labelers working concurrently. This is something that has to be understood and managed and tools are there for you on that, as are companies like CloudFactory. This quality thing gets back to modeling right. If the model is being given training data for training that is inconsistent and what's being labeled, then it's going to struggle to learn. That's just as simple as that. The lesson here is if you're outsourcing your data labeling or doing it internally and you're not measuring the accuracy or quality of that labeling, you're basically gambling with your investment.
Compounding this, unfortunately, is that rework for poorly labeled data is extremely expensive, and it can be as expensive as relabeling it for the first time. This is because first, before anything else, you must identify bad labels and then they have to be corrected and this rework process can be as time-consuming as the labeling itself. It turns out labelers get better at annotation tasks over time.
So how do you prevent this rework? One of the ways is using labelers who work on your data every day because they begin to understand the nuance. They begin to understand the context, and this is what Matt and Philip have been talking about. As these labelers get familiar with your data, they get better and better at labeling which leads to better training data, which leads to better modeling performance. It's really that simple.
The technology and service providers you work with, such as CloudFactory, are important for ensuring this quality. Particularly when it comes to implementing repeatable scalable processes as you scale from R&D to production, the numbers get huge as you go from 10,000 to 100,000 to one million. All of those levels of scale present different challenges and different tooling volume and stability requirements. So it's important that tools and services and your partners are there along that journey at each step. These are things that can take your AI projects to production. And with that, I'm going to hand it back off to Philip so we can begin answering your questions.
Philip Tester [CloudFactory]: Thanks to everyone who sent in thoughtful questions. The first question is: “How do you ensure high data quality?” I think it is a really good one and I'd love to hear both Matthew and Brian touch on this.
Matthew McMullen [CloudFactory]: I'll talk about how we onboard projects and how we ensure quality from the outset. Philip, if you want to talk about the way that we hire and vet workers, that would be perfect.
Philip Tester [CloudFactory]: Yeah, sure. At CloudFactory, we know that a known, curated, and managed workforce is a necessarily component. We're an impact sourcing company, so we create meaningful work for people in developing nations. We put a lot of work into finding great people. We take everyone who works for CloudFactory or works for our customers through a process of applying online and an in-person interview, followed by a detailed skills assessment. That's just to get in the door. We train workers on customers’ specific business rules and guidelines, then run test data so we can watch them continuously improve and ramp up the learning curve. Our technology platform is designed to make that process of onboarding, vetting, and managing new workers as efficient as possible. People are a core part of delivering the quality experience that we do.
Matthew McMullen [CloudFactory]: I'll add to that answer. There's actually another question here that gets at the same issue: “Is there a best way of onboarding new labelers?” Well, there is. Our teams begin by incubating and iterating on a subset of the work to develop methods and practices that will scale accuracy, quality, and even speed. Right out of the gate, we break down a subset of the work and nail it. Then we continue to incorporate efficiencies over time as you change the work and add new use cases. Your workforce is constantly adapting as the work evolves. Typically when we onboard a client, they might think the task is already baked, when in fact, once you throw a large team at it, we're going to surface some bugs. We keep onboarding new workers to be able to accommodate any changes in the work.
Brian Rieger [Labelbox]: Awesome. Yes, this is Brian. I'll touch on a few things. I think, as Philip and Matthew talked about, having labelers who are consistently working on your data is instrumental as far as I can tell in having high-quality data. They are going to be there with the context and the nuance understood as they build that familiarity. But beyond that, you need to implement some processes and tools which CloudFactory does - but also internally, because part of this is making sure that your data is of quality inside of your organization, however well it is labeled.
There are three ways to measure quality that we find effective. One is consensus, which is measuring how different labelers label the same image, that is, how they interpret the same image. That measures the agreement among the domain experts for particular applied knowledge on an image.
The second method is called golden set, which is a set of test questions. Basically, it creates ground truth - you’ve labeled it the way you want, and you send it to the team, then measure how close they are to that ground truth. And so that's another way of getting into the process and measuring that quality over time and the consistency of interpretation of the labeling team as a whole.
And the final way is the surefire way which is sort of a review. As it stands, not everybody reviews every image, but you can review 5%, or 30%, or 100%. So those are the three mechanics that we, at Labelbox, implement and recommend, and what we see from the highest grade of labeling service companies.
And so one of the nuances here that's not always obvious is that, the labeling is not always the problem. Sometimes - and particularly when you're using consensus technology or techniques - the image itself is extremely ambiguous. And that's a problem because having that image in your training data and it being inconsistently labeled is a problem. And it's not that the labeling team is not good, it's that the image is highly ambiguous. So those are things you want to catch and you wouldn't catch that potentially with a review. It turns out that human ambiguity is not a linear-gradient optimization that you can understand with math. It's very human. And so a combination of these techniques - in addition to labelers who are familiar with your data day in and day out - with those four things, you're very likely to succeed and get to a high-quality training dataset.
Matthew McMullen [CloudFactory]: I want to jump in and answer two questions together. One is: “Do we have specialized people for labeling, for example, specialized industries?” And the other question is: “For labelers, how do you suggest they get trained for different types of data?” Those kind of have the same answer. One is the idea that, we do not have domain expertise, but we have domain knowledge. So we work across many different AI products, from NLP to computer vision. The more intense NLP projects can be in linguistics-intensive industries, such as legal AI or medical AI.
While something more simple in NLP could be tagging words or phrases in an article, something more complex could be sentiment analysis for different titles of articles or comments on products. As you can see, there is specialization in terms of who we put on a certain task. We're not going to put on a lower-skilled worker on a complex task. We're able to accommodate for the higher, more difficult linguistics tasks for NLP, while at the same time positioning our workers so that if they're better at image-based tasks or more text-based intensive tasks, we can align the work with their skill set.
Philip Tester [CloudFactory]: Yes. Good. Looking through a couple of the other questions posted here. There is a really specific question here that I think is a really interesting one. It is, “We built an initial training set and we're about to push our AI model into production. However, we want a human in the middle to review before we expose the output to the user. Is this achievable or realistic?” And I think this question is really around how to review and validate what's been developed before actually deploying that into production. Matthew and Brian, I don't know if one of you wants to volunteer to go first, but we run into this quite a bit.
Matthew McMullen [CloudFactory]: Yes. This is actually one of the most important tasks of any computer vision project - to acquire image data, for a model’s real-world use. Of course, validation is used to assess the model to make sure it achieves its intended purpose. This is another type of dataset. For example, can the algorithm detect faces in the wild versus a control dataset of submissions from a laptop camera? We have both use cases. We plug into that workflow by understanding the entire context. The team leader for your workforce needs to know the intended purpose of the AI system. You also want workers to know the business rules of creating that original training dataset so they have a clear understanding, so you can assess the feasibility of the model. So as I mentioned, we're involved in the training data set creation, and sometimes even the concurrent validation of that model.
Philip Tester [CloudFactory]: Brian, is there anything you'd like to add on that?
Brian Rieger [Labelbox]: Yes, this is great. Having a human in the loop one of the components of an AI data production line. Obviously the goal is for the model to do all the work. But that's not always realistic. When you use traditional tools or common tools out in the world, as with apps that you use everyday that draw on computer vision software, you'll notice that a lot of that is AI-powered. But there's also a human in the loop sometimes because the AI doesn't always get the right answer. It doesn't always have high enough confidence to act autonomously. And so getting a human in the loop to support not only your R&D and development of a model but also for when your model is in production, is common and a sign of mature AI production that is providing value in the marketplace.
One of the challenges here, as you are probably listening to in the question, is how can this infrastructure actually get in place because it seems kind of daunting, and it is daunting. That's part of why we at CloudFactory and Labelbox have teamed up here to talk about this. Putting that infrastructure in place as well as getting a team in place is really instrumental in getting to a human-in-the-loop system that can support AI production, and part of this is a work-and-task management system, whether that's with your third-party service or whether that's internally.
It's complicated, and Labelbox solves part of that problem. CloudFactory helps companies do that. It is a daunting problem, and it is challenging. It is one where we recommend the data production line as a framework, and a place to work toward because it is challenging.
Matthew McMullen [CloudFactory]: That was a great question, and the audience member who posed it asked a clarifying statement at the end: “We actually want to do this in real-time or close to it.” CloudFactory takes on two different types of clients. One is where we are providing the human intelligence to power software by inputting data that’s on a receipt. There, we’re actually doing the auto-magical work to impact the final customer. So you're asking whether or not we can validate a model in real-time or close to it, before it impacts the final customer. The answer is that we would have to get down to specifics, but we have a delivery model can handle both. That's a great question.
Philip Tester [CloudFactory]: Great. Let's grab another question here. These are very good questions, so thanks to everyone. Peter has asked a fairly technical question: “Do you use or recommend linear regression as a way to separate the human-in-the-loop errors from the training set?” Perhaps Brian at Labelbox might have the most experience to answer this one.
Brian Rieger [Labelbox]: Yes, this is a tough question. I think there are heuristical and classical modeling techniques that can help here, depending on what kind of data you're working with. For example, if you're working with visual data and you're doing bounding boxes, there's this very common technique that can measure the accuracy of the bounding box. Ultimately, I think this boils down to the specific application. We do some of this and help clients with some of this, but it depends on where you want to stand, how much you want to catch, and how much model support you can have. So if you have good models - maybe even models that aren't externally facing but are internally facing - you can get under the hood and understand whether this human-in-the-loop system is effective or not, with respect to what the model thinks.
I think those are all ways that this can get implemented. Certainly I recommend keeping track of all these different points. This is the point of lean manufacturing itself, and why we're bringing this analogy to bear. When you compartmentalize or systematize every part of the production line, then you can scrutinize it and measure it quantitatively, giving thought to its inputs and outputs so you can understand the flux there. You also can understand where the bottlenecks are and what the performance of every subcomponent is and optimize subcomponents individually for the greater system performance. I highly recommend that. Whether that's a linear regression technique or you're using an internal modeling technique or a manual review technique with a service, those all work. It just depends on how your training data is made up and what kind of data it is.
Matthew McMullen [CloudFactory]: I want to jump in and point out a CloudFactory product offering called WorkStream Pulse. A member of the audience asked: “How can you measure the performance of the taggers?” This is important. We’ve built a platform that our workers use to do the work within our clients' environments, whether that means a tool like Labelbox or an internal tool our customer has built. We actively measure workers for their engagement, productivity, and quality. We track key data points from that WorkStream browser, such as keystrokes or time spent completing a task. We give our clients real-time visibility into the work.
Philip Tester [CloudFactory]: Yes, Matthew. I'm going to jump in here. I think that was a really good practical answer. I think there are more things to think about that. We think about performance in two ways: there is “how much” and “how good.” “How much” is how much data is being processed by your team of labelers, whether it's an internal, outsourced, or crowdsourced team. What that team produce in a given time window - per minute, per hour, per day, per week, something like that. This helps you solve for the amount of throughput that you need to keep your pipeline humming.
On the “how good” side, you think about the performance of individual workers. With CloudFactory, you get visibility to that with Pulse. But success here requires that closed feedback loop between you and your workforce. If your data is processed quickly but requires going back through for a second or third pass almost all the time, you'll be paying two to three times the cost. The truth is, it requires all parties involved to continually measure the performance. We rely on our customers to tell us how we're doing, and we use Pulse to give them visibility to performance.
Sometimes there's a resistance to that idea. There's definitely an appetite in the market for a solution to kind of promises and delivers to a certain level of data quality. Our experience is that it's just not that simple. You have to transfer knowledge to train a team of people to get to a certain level. If it's a core process that supports a product, for example, it is critical to have your finger on the pulse of how that's going. That might be through running sample sets to review. It might be through, as Brian mentioned earlier, a quality measure like what we call the gold standard at CloudFactory, to gauge performance through test tasks for which you have the correct answers. Brian, is there anything you want to add to that, about how tooling can help make that process efficient?
Brian Rieger [Labelbox]: Yes, we have a Labelbox performance tab dedicated to this. It's an important part of any mature system. Matthew and Philip did a great job explaining the nuances there and how to measure performance. Indeed, it is about “how much” can be created and “how good” it is. And both of those things are measured. One interesting thing we found is that being able to measure performance, particularly how long it takes to label something, is a great way to identify difficult images that are ambiguous or otherwise difficult to label. That is important because it gives you some insight into where the model is going to have challenges and also insight as to what things might come to bear at a bigger scale if you add 10 times or 100 times more training data. Keeping an eye on outliers in the productivity of your labelers can also give you insights into your data that are not related to the labelers themselves, but are related to the challenge or ambiguity of the dataset. A lot of things can be gleaned from measuring performance and it's a critical part of a mature system.
Philip Tester [CloudFactory]: Alright, I'm going to take one more question: “Is it good to create training data via coding and how do we ensure the quality in that case?” I'm not sure if the question here is about synthetic data as a means to develop a prototype for a model. If that's the question, then my view would be it’s only going to be good for prototyping to get started but you’ll need people involved to build and validate models. Brian, do you have any experience with using coded training data or synthetic training data?
Brian Rieger [Labelbox]: Yes. It's a great way to get a benchmark. You know, there’s a lot of data science around benchmarking. In your R&D or early development phase, what you'll typically do is you'll take an initial set of raw data and you'll get that labeled either internally, do it by hand, or you'll have a service come in and help you with that initial set. You'll run a benchmark model right before you optimize anything. You're just going to run a benchmark model and understand where you are - to learn if you’re 50% accurate or 99% accurate.
Synthetic data fits into that stage. It's like asking yourself, “Can we create synthetic data that gets us in the ballpark of where we think we could be if we heavily invested in this machine learning endeavor?” What we've seen is that using an initial set of training data is great for that. Then, as you move into high volume and you decide - firstly, that you can get somewhere with your concept - and secondly, maybe you would be using synthetic data there or using synthetic data to augment a transfer learning system.
Ultimately, when you're in a production environment that needs any kind of high level of performance, you need real-world data that's labeled. Some examples of this are where we've worked with clients who have cameras in parking lots or looking at cribs and things like that. Every camera that they deploy to every client needs its own slight tuning and so does the training data. That is because, for example, the lighting is different, the color is different, the crib is different, and parking lot asphalt is all different. Models are brittle, so you need to build up that real-world training data because they are brittle. This synthetic data will get you along your path and it will inform you about a potential outcome opportunity. Ultimately, what we see is that you need real data from the real world where the model will be performing and operating in order to get the performance that you typically would require in a production system.
Philip Tester [CloudFactory]: Thanks Brian. In the interest of time, I think it is better to wrap a little bit early than a little bit late. We are signing off here, but I want to give a big thanks to Brian and Matthew and the teams that have created the content for today. Great job. And an especially big thank you to everyone who attended and those who asked questions. We didn't get to all of them, so we will follow up by email on those. If you can't tell, we love talking shop. We hope this is the first of many webinars and hope that you take away some valuable concepts around the importance of focusing on the details of your AI data production and getting it right the first time with the right people and the right tools, and applying your custom process needs to tie that all together. Thanks to everyone. Bye.