Q&A with V7: COVID-19 AI Data Annotation for Clinical & Diagnostic Imaging

Looking back on the last two years as the global COVID-19 pandemic still rages, it is easy to become exasperated by the proliferation of information (and confusion) about the disease. Equally, if not more frustrating, is the cost, resourcing, and related challenges around diagnosing and understanding the long-term effects of the disease at scale.

Alberto Rizzoli, CEO and Co-Founder of V7, and his team knew that medical AI, machine learning, and radiological imaging offered a solution. They aimed to apply V7’s machine learning expertise to thousands of lung images to help researchers study COVID-19 and, eventually, help clinicians identify and triage more serious cases earlier. The longer-term goal is to apply machine learning in a way that would allow researchers to understand other lung conditions going forward. V7’s team helped train CloudFactory’s managed workforce to use V7’s Darwin annotation and data labeling tool to combine AI-driven auto-labeling and precise human-led image annotation to optimize the training data for machine learning.

We selected key questions and summarized answers from our June 2020 interview with Alberto from V7 to give a better glimpse into questions that are on the minds of medical AI experts and diagnostic healthcare technology developers. Explore more about this project and learn more about our valued partnership with V7.

Key Questions:

Why is computer vision relevant for COVID-19 and what is the application?
What are the challenges associated with using computer vision in this important medical AI use case?
Why is training data and a human-in-the-loop approach so vital to machine learning and AI when it is identifying COVID-19 and other infectious or chronic diseases?
Why is CloudFactory’s workforce model so well-suited to medical AI data labeling and annotation?
Why do companies select CloudFactory for medical AI data labeling and annotation?
What are the challenges with applying machine learning to the study of lungs and are there any gaps in the medical AI market that impede better medical research and diagnostic capabilities?

Question: Why is computer vision relevant for COVID-19 and what is the application?

Summary Answer:

Chest x-rays are easy to do, yield numerous images, don’t require decontamination, and the AI results are instantaneous. COVID-19 AI identification has helped doctors all over the world triage patients and potentially help save lives.

Full Answer:

Alberto Rizzoli (V7): Yeah, imaging in the lungs has been one of the keys to understanding what the patient outcome was going to be during this crisis. There are two ways of understanding if someone is going to need medical help. The first one is obviously the swab, which tells you whether you have the virus or not. But then with COVID-19, the actual problems that may occur within your lungs are a different story. Chest X-rays have been a star in this because they are easy to do, they don't require decontamination of something like a CT scanner machine, and the results are instantaneous. And, it can be done multiple times per patient. This has helped doctors all over the world understand which patients required the most help and potentially save lives. What we're doing today is trying to assist some of those decisions with AI, thanks to the data that has been collected since January on this topic and augmenting that decision support system.

Question: What are the challenges associated with using computer vision in this important medical AI use case?

Summary Answer:

We had to work quickly, during a pandemic, to create a usable AI training dataset of previously unusable data. Many images were rotated, cropped, or captured quickly and inconsistently using mobile X-ray scanners. CloudFactory and V7 annotated and validated 6,000 lung images to create a pixel-perfect dataset that will help train AI to adapt to variables in medical imaging quality and streamline triage.

Full Answer:

Alberto Rizzoli (V7): Yeah, so we collected 6,000 freely available chest X-rays -- images that don't have to be protected by any contractual agreement for users. And the CloudFactory team segmented the lungs of each of these 6,000 images. So essentially, we have 12,000 segmentations of the lungs in these images. They are all austere interior, so they're all sort of from the back to the front. And we are using pixel-perfect segmentation masks. This means that it isn't a rough polygon that may include some of the ribs or may include some of the diaphragm; it's been done to a high degree of accuracy and reviewed in many cases by me, personally, so that we can make sure that the ribs are excluded, such that the diaphragm is somewhat considered and the back of the lungs is also seen.

This was done in a surprisingly short amount of time by our teams together, and I'm happy with the results so far. The 6,000 images will give us the ability to find the lungs in any future X-ray that we will receive, at least in the case of V7 working with the NHS here in England, where the images can be even more varied from what we've seen. In the dataset that we put together, we're seeing lungs where the image might have been cropped across them, or some images may have been rotated, due to the hectic nature of having to collect these X-rays, sometimes with mobile X-ray scanners. This project has enabled data that was previously not usable for this research to now be usable. We can detect the lungs and then normalize them so that they all look relatively similar to one another to a machine. And, they can be compared to one another without having to consider their rotated pose or their size and so on.

Question: Why is training data and a human-in-the-loop approach so vital to machine learning and AI when it is identifying COVID-19 and other infectious or chronic diseases?

Summary Answer:

As the term “training data” implies, we need to help the AI understand what it is meant to look for and not look for in order to develop the most effective machine learning models for better diagnoses in a healthcare setting. Human-in-the-loop labeling strategies are vital because annotators help pinpoint the areas that are important to radiologists and possess the context that machines need to learn.

Full Answer:

Alberto Rizzoli (V7): What we've done together was actually pinpoint to the AI the areas that a radiologist would look at. In this case, the actual lungs of the patient. It's not easy for AI to learn that it needs to see the content inside the lungs to assess the presence of COVID-19 or its severity. And this is what this research was meant to do. It was meant to isolate the lungs so that APIs would not learn other parts of the body that correlate with disease. Sometimes you can tell the age of a patient, sometimes you can even tell the source which hospital it comes from heavily correlates with, for example, how the image was cropped or its size or even its opacity and the parameters of the machine used. So, one of the biggest improvements is actually steering or supervising it a little bit just so it can learn what it actually needs to look at and not bother with other features and biases.

Question: Why is CloudFactory’s workforce model so well-suited to medical AI data labeling and annotation?

Summary Answer:

We focus on quality, and our workforce and hiring model is very selective. We match the work to our labeling workforce ecosystem who have gone through multi-level assessments because we understand that visual data annotation is fundamentally different from tasks like text transcription. We go through rigorous training and evaluation processes to give annotators test data, monitor data quality closely, and know which workers develop competency on business rules and labeling details. Over time as people repeat tasks, they get faster, and the labeling and annotation work becomes easier with higher quality and throughput.

Full Answer:

Philip Tester (CloudFactory): We've been really happy to participate and help with the annotation of those 6,000 images and obviously would welcome the chance to do additional work along the same lines to help refine and expand the test model's capabilities. As you said, working with a greater variety of input images that are non-standard in a variety of ways — whether it's the format, orientation, or resolution. Because there are different devices that actually capture the X-rays, scan, and upload them. And so, the data handling is part of it, too. I would say that this has been a really good fit for CloudFactory’s model, which really relies on, rather than using a crowd-based approach, a much more selective model where after reviewing task details and annotation requirements, we match that to people within the CloudFactory worker ecosystem that have gone through this multi-level assessment process to understand people's skills and strengths. Visual data labeling and annotation is a fundamentally different sort of task than text transcription or the interpretation of sentiment or tone in social media. So, we start by making sure we've got the right team assigned. Then we go through a really rigorous training and evaluation process to give them test data, evaluate the upticks in data quality, indicating greater and greater understanding and application of the business rules and the details of the annotation. We focus on the efficiency data and quality. So that happens first. And over time, as people get more repetitions, they get faster and faster, and it comes easier and easier.

Question: Why do companies select CloudFactory for medical AI data labeling and annotation?

Summary Answer:

CloudFactory focuses on communication and building relationships. We combine tech with the process in our workforce management platform, empowering clients with dedicated messaging channels and collaboration tools. As their managed team, we stay focused on their needs and ensure continuity and stability.

Full Answer:

Alberto Rizzoli (V7): It's like having an extension of your team. And I think especially for medical projects, it's really important to have frank, clear, and continuous discussions with the people that are doing this type of labeling. You need someone with some clinical experience on that. That level of continuous discussions with the labelers has been particularly useful. And, everyone at CloudFactory had a very keen attention to making this a successful project. Whenever there was some feedback to be given on the quality of one element of the annotations, for example, excluding the rib cage, which is something that needs to be learned a little bit, that feedback came very quickly. And the next day, we saw results that incorporated the feedback, indicating that the feature was completely excluded. And there was a point in which the system just ran almost on autopilot. The reviews just ran swimmingly, and the annotations went through at a faster and faster rate. And we're very pleased with that.

Question: What are the challenges with applying machine learning to the study of lungs and are there any gaps in the medical AI market that impede better medical research and diagnostic capabilities?

Summary Answer:

We lack reliable, well-labeled data to jumpstart research — e.g., data that includes both the back and front of the lungs to study COVID-19. pen-source, freely usable data will give a head start to anyone working on data that requires the segmentation of lungs and X-rays.

Full Answer:

Alberto Rizzoli (V7 Labs): There is a lot of work that is being done in the study of chest X-rays and CT scans in academia and also in the healthcare industry. But there isn't a whole lot of reliable, well-labeled data out there that can get people started on this research. We hope that this dataset that we put together, which will be freely usable, will give a head start to anyone working on data that requires the segmentation of lungs and X-rays. Particularly, one of the challenges is this is one of the few datasets that include the back of the lungs and the front of the lungs in the masks of the lungs. Normally, the heart, which is sort of awkwardly in the middle of one of your lungs, is excluded by the segmentations. But for COVID-19, in particular, some of the areas that actually indicate the presence of the illness were just around this area. So this dataset includes it within the annotation. Hopefully, more data and more opportunities like this will help boost the research. I'm sure that this dataset, in particular, will already be a treat for researchers working on COVID-19 detection as it's going to be freely available.

Learn more about CloudFactory’s expertise in annotating complex healthcare data to fuel medical AI innovation. Read about Sartorius’ journey here.