AV Data Management: The Role of Data Annotation Companies

Data Annotation in AI

Traditionally, software engineers have written code as explicit instructions to the computer. With the advent of neural network based deep learning techniques, software can be written in much more abstract language (i.e., weights of a neural network). Software engineers do not need to program the machine with specific instructions. Rather, they feed it with “data,” so that the machine itself will write the code.

Once neural network models are trained properly, their performance is much better than traditional software codes. However, this comes at a cost. Well-trained neural network models need not only a lot of data, but also high-quality data. Data annotation and validation are required in order to make sense of the immense amount of raw data generated from sensors and to train algorithms to properly understand and act on the myriad of driving scenarios.

Historically, AV/ADAS companies in need of annotated data would ask engineers to spend hundreds of hours on annotation, scaling up resource-intensive in-house annotation teams, or resorting to crowdsourcing options which often compromise data quality and security.

Over the last decade, data annotation and validation companies have emerged to address these issues. The best data annotation and validation companies can help speed up the development of AV technology by delivering high-quality data labeled at scale.

Data Management Challenges

Two leading AV companies, Tesla and Waymo, collect a lot of data, but in a very different way. Tesla uses half a million of cars on the road to collect anecdotal data in shadow mode and has accumulated 12B Autopilot miles. On the other hand, Waymo, using its fleet of 600 cars, has accumulated 10.4M self-driven miles of full resolution data (recorded from all sensors), while supplementing the shorter real-world mileage with 10B miles of simulated scenarios.

For both companies, as well as the entire AV industry, data is a foundational component of training models. Supervised learning needs more data than other software model types. In supervised learning, algorithms learn from labeled data. When the problem’s complexity increases, the data volumes need to increase. Having more dimensions but small data volumes can result in overfitting, and training data is a bottleneck.

Hand labeling data can be time consuming and expensive. According to Labelbox, data scientists often spend up to 80% of their time cleaning and prepping data. Many large organizations bring data labeling in house, assigning tasks to an internal data science team or group of experts. This approach is popular for businesses performing Machine Learning (ML) at scale like Google. It can generate predictable results with high accuracy labels but can take a significant amount of time and resources.

Because of the resources required for data annotation, many AV companies have started outsourcing this work to dedicated data annotation service companies. Data annotation service companies offer APIs that allow businesses to upload their data that is labeled by offshore resources. They offer a specialized annotation service capable of labeling datasets at a much lower cost than it would take to hire an internal team. They often crowdsource labelers, paying individuals to generate data for pre-defined actions. This approach can be cost-effective but may have challenges initially with quality control. Over time, these systems can collect enough data of a particular type that they can become intelligent and streamline data labeling using auto-classification.

Now, most 3rd party data annotation companies have semi-automatic labeling tools to reduce manual work. Automated data labeling is key and solutions leveraging ML will be best positioned because they won’t need to build a large workforce, train labelers, and deal with quality control issues to the same degree

Data Annotation Companies

Companies specializing in annotation have become critical to the autonomous driving ecosystem and value chain. Such companies can provide quality training datasets from the raw data that AV companies collect.  

Data annotation companies provide data annotation tools and computer vision tasks upon receiving their customers’ driving data. These tools can be largely divided into four tasks: Images and video annotation, LiDAR point cloud annotation, Sensor fusion annotation and semantic segmentation. Most data annotation companies provide APIs where customers can upload their raw data and service providers can perform annotation tasks using their proprietary tools.

One company that stands out among others is Deepen AI, which offers point level semantic and instance segmentation of LiDAR “sequences.” The task of manually segmenting every single point in a scene is massive and requires a lot of attention to detail for the machine to be able to gain a solid understanding of the world. It is a demanding job for humans. The biggest challenge in this context is represented by sequences of frames. Autonomous vehicles drive around miles of roads producing LiDAR sequences of data over time. So, every point in each frame needs to be labeled, turning what it already a demanding job into a massive task for humans.

According to Deepen AI, it is still a challenge to semantically segment long sequences of ground truth LiDAR data in a way that is fast, accurate and efficient — not only in terms of human labor but also in terms of computational efficiency.

Many of these data annotation companies have raised significant venture money over the past two years – Scale AI ($122.6M), Labelbox ($13.9M), Playment ($2.5M), Figure Eight (acquired by Appen for $300M), Samasource ($1.5M). Furthermore, two well-funded companies were already acquired by major AV players – Understand.ai was acquired by dSpace and Mighty.ai was bought by Uber earlier this year.

Among these, Deepen AI, Figure Eight and Labelbox focus on licensing labeling software, while others have more emphasis on data annotation services.

Innovative Data Management Strategy

Tesla’s Data Engine, an iterative process by which Tesla improves these neural net predictions, is a good example of practicing efficient data ingestion and selection processes. Renovo’s data management and orchestration 6 platform called Insight promises to “quickly index and tag unstructured data from development fleets, query the most important insights, and automatically deliver them to distributed engineering teams ten times faster than any other approach.”

Similarly, Caliber Data Labs’ “Caliber Data Platform,” being partly on-premise and partly on cloud, lets AV engineers send only index of the collected data for processing, not necessarily moving the whole data. It has made tailor-fit data discovery features such that AV engineers would be able to reach queries and then review. In terms of data labeling, Caliber provides integration of multiple third-party data annotation providers’ APIs, so AV engineers can access to such as many labeling services and annotators as possible. Data is the new oil in AV industry. However, the new oil will be a very inefficient energy source for the AV systems unless it is filtered with innovative data management strategy and tools.

ADAS Guide 2021 Banner