Traditionally, software engineers have written code as
explicit instructions to the computer. With the advent of neural network based
deep learning techniques, software can be written in much more abstract
language (i.e., weights of a neural network). Software engineers do not need to
program the machine with specific instructions. Rather, they feed it with
“data,” so that the machine itself will write the code.
Once neural network models are trained properly, their
performance is much better than traditional software codes. However, this comes
at a cost. Well-trained neural network models need not only a lot of data, but
also high-quality data. Data annotation and validation are required in order to
make sense of the immense amount of raw data generated from sensors and to
train algorithms to properly understand and act on the myriad of driving
Historically, AV/ADAS companies in need of annotated data would
ask engineers to spend hundreds of hours on annotation, scaling up resource-intensive
in-house annotation teams, or resorting to crowdsourcing options which often
compromise data quality and security.
Over the last decade, data annotation and validation
companies have emerged to address these issues. The best data annotation and
validation companies can help speed up the development of AV technology by
delivering high-quality data labeled at scale.
Two leading AV companies, Tesla and Waymo, collect a lot of
data, but in a very different way. Tesla uses half a million of cars on the
road to collect anecdotal data in shadow mode and has accumulated 12B Autopilot
miles. On the other hand, Waymo, using its fleet of 600 cars, has accumulated
10.4M self-driven miles of full resolution data (recorded from all sensors),
while supplementing the shorter real-world mileage with 10B miles of simulated
For both companies, as well as the entire AV industry, data
is a foundational component of training models. Supervised learning needs more
data than other software model types. In supervised learning, algorithms learn
from labeled data. When the problem’s complexity increases, the data volumes
need to increase. Having more dimensions but small data volumes can result in
overfitting, and training data is a bottleneck.
Hand labeling data can be time consuming and expensive.
According to Labelbox, data scientists often spend up to 80% of their time
cleaning and prepping data. Many large organizations bring data labeling in
house, assigning tasks to an internal data science team or group of experts.
This approach is popular for businesses performing Machine Learning (ML) at
scale like Google. It can generate predictable results with high accuracy
labels but can take a significant amount of time and resources.
Because of the resources required for data annotation, many
AV companies have started outsourcing this work to dedicated data annotation
service companies. Data annotation service companies offer APIs that allow
businesses to upload their data that is labeled by offshore resources. They
offer a specialized annotation service capable of labeling datasets at a much
lower cost than it would take to hire an internal team. They often crowdsource
labelers, paying individuals to generate data for pre-defined actions. This
approach can be cost-effective but may have challenges initially with quality
control. Over time, these systems can collect enough data of a particular type
that they can become intelligent and streamline data labeling using
Now, most 3rd party data annotation companies have
semi-automatic labeling tools to reduce manual work. Automated data labeling is
key and solutions leveraging ML will be best positioned because they won’t need
to build a large workforce, train labelers, and deal with quality control
issues to the same degree
Data Annotation Companies
Companies specializing in annotation have become critical to
the autonomous driving ecosystem and value chain. Such companies can provide
quality training datasets from the raw data that AV companies collect.
Data annotation companies provide data annotation tools and
computer vision tasks upon receiving their customers’ driving data. These tools
can be largely divided into four tasks: Images and video annotation, LiDAR
point cloud annotation, Sensor fusion annotation and semantic segmentation.
Most data annotation companies provide APIs where customers can upload their
raw data and service providers can perform annotation tasks using their
One company that stands out among others is Deepen AI, which
offers point level semantic and instance segmentation of LiDAR “sequences.” The
task of manually segmenting every single point in a scene is massive and
requires a lot of attention to detail for the machine to be able to gain a
solid understanding of the world. It is a demanding job for humans. The biggest
challenge in this context is represented by sequences of frames. Autonomous
vehicles drive around miles of roads producing LiDAR sequences of data over
time. So, every point in each frame needs to be labeled, turning what it
already a demanding job into a massive task for humans.
According to Deepen AI, it is still a challenge to
semantically segment long sequences of ground truth LiDAR data in a way that is
fast, accurate and efficient — not only in terms of human labor but also in
terms of computational efficiency.
Many of these data annotation companies have raised
significant venture money over the past two years – Scale AI ($122.6M),
Labelbox ($13.9M), Playment ($2.5M), Figure Eight (acquired by Appen for
$300M), Samasource ($1.5M). Furthermore, two well-funded companies were already
acquired by major AV players – Understand.ai was acquired by dSpace and Mighty.ai
was bought by Uber earlier this year.
Among these, Deepen AI, Figure Eight and Labelbox focus on
licensing labeling software, while others have more emphasis on data annotation
Innovative Data Management Strategy
Tesla’s Data Engine, an iterative process by which Tesla
improves these neural net predictions, is a good example of practicing
efficient data ingestion and selection processes. Renovo’s data management and
orchestration 6 platform called Insight promises to “quickly index and tag unstructured
data from development fleets, query the most important insights, and
automatically deliver them to distributed engineering teams ten times faster
than any other approach.”
Similarly, Caliber Data Labs’ “Caliber Data Platform,” being partly on-premise and partly on cloud, lets AV engineers send only index of the collected data for processing, not necessarily moving the whole data. It has made tailor-fit data discovery features such that AV engineers would be able to reach queries and then review. In terms of data labeling, Caliber provides integration of multiple third-party data annotation providers’ APIs, so AV engineers can access to such as many labeling services and annotators as possible. Data is the new oil in AV industry. However, the new oil will be a very inefficient energy source for the AV systems unless it is filtered with innovative data management strategy and tools.