In the rapidly advancing field of autonomous vehicle development, the ability to effectively mine and analyze vast amounts of data is crucial. Engineering teams driving the future of mobility face significant challenges as they sift through extensive fleet log data to identify key insights and improve system performance. Traditional data mining techniques often fall short, proving too slow, inflexible, or inefficient to handle the complexity and volume of autonomy-specific datasets. These challenges include the manual triage of data, the reliance on narrow task-specific models, and the inherent difficulties in adapting to evolving data needs without substantial reconfiguration and retraining.
Recognizing these obstacles, Applied Intuition is at the forefront of transforming how autonomy engineers approach these vast datasets. By leveraging advanced AI technologies, including foundation models and emergent machine learning paradigms, we offer more scalable and efficient solutions.
In this blog post, we explore the challenges associated with data mining and how traditional approaches often fall short. We introduce emergent machine learning paradigms, particularly foundation models, that provide a more efficient and scalable approach. Finally, we discuss how Applied Intuition’s Data Explorer leverages these technologies to enhance data retrieval and accelerate the development of autonomous vehicles.
Data-Driven Autonomy Stacks
Autonomy stacks are increasingly data-driven. Many hard-coded modules have been replaced with ML-based ones, most notably modules for perception, prediction, and planning. This trend only accelerates as the industry moves toward fully differentiable end-to-end autonomy stacks.
To test ML-based autonomy stacks in the real world, engineering teams typically drive fleets of autonomous vehicles to evaluate system performance and identify failures. When a failure is discovered, improving the system requires a fast, multi-step “data loop”:
- Mine fleet log data to find similar failure cases
- Assemble a dataset from the mined data to improve the autonomous system
- Build simulation test cases that cover the failures and track future regressions
For example, if the onboard perception system struggles to detect jay-walking pedestrians, engineers can mine for similar scenarios in the fleet log data. They can then create a targeted dataset to improve the perception system and a suite of tests to track future perception system regressions.
Finding Specific Events in a Sea of Data
Integral to a fast data loop is efficient data mining. Running autonomous vehicle fleets at scale means access to significant amounts of diverse driving data. However, the sheer volume of data can make mining for relevant events both time-consuming and complex.
The most basic approach to data mining is manual triage: engineers watch entire drive logs and surface qualifying events. Not only does this method fail to scale as fleets grow, it also creates room for human error and inconsistency, as engineers may miss subtle events or interpret data differently.
Task-specific machine learning models and heuristics can help to automate parts of this manual triaging process. For example, an offline object detection system trained to accurately detect pedestrians can help triage engineers find examples of jay-walkers more quickly.
Additionally, unsupervised learning techniques such as clustering and anomaly detection can identify patterns or unusual events within the data. However, these techniques often rely on hand-crafted features or classical dimensionality reduction, both of which introduce biases and struggle to retain the original semantic information in the log data.
Overall, narrow task-specific models, heuristics, and hand-crafted features can be difficult or expensive to adapt as data needs change.
The Rise of Generalist Models for Data Mining
Foundation models that are pretrained on internet-scale data have ushered in a new paradigm in machine learning. Unlike task-specific models, foundation models are designed to be versatile, capable of understanding and adapting to a wide range of tasks. The most well-known example of this is large language models, which excel in their in-context learning ability. These generalist models, applied to data mining, can identify rare or complex patterns in large datasets without needing to be changed or re-trained as data needs evolve.
Applied Intuition's Data Explorer leverages multimodal foundation models to find long-tail events in fleet log data from just a natural language description. For example, engineers can search for "cyclists at night," "pedestrians jaywalking," or "construction zones" and Data Explorer will surface relevant log segments that match this description. This enables faster and more flexible data mining than can be accomplished with heuristics or task-specific models.
Building an AI-powered Search Engine
Data Explorer uses multimodal foundation models to help engineers find relevant fleet log data faster. These models power Data Explorer's log data search engine, which is engineered to be:
- Relevant and flexible: Results should closely match the user query, and the search engine should support a large variety of user queries without requiring extensive reconfiguration or retraining
- Fast and scalable: Results should be surfaced with low latency, and the search engine should maintain performance as the dataset grows
- Cost-efficient: Infrastructure costs should be minimized
Neural representations for relevant and flexible data retrieval
Data Explorer's log data search engine leverages a foundation model trained via contrastive learning. Our multimodal model is trained on a large dataset of over 5 billion text-image pairs scraped from the internet, encompassing a wide range of subjects, styles, and contexts. This diverse dataset allows the model to learn rich and generalized representations of textual and visual data, making it robust to data distribution shifts. For example, our model can handle variations in lighting, perspective, or even unexpected visual elements within images, ensuring accurate associations between text queries and visual data across different scenarios.
Applied Intuition's foundation model learns by associating each text description with its corresponding image while distinguishing it from unrelated pairs. In the training process, it learns neural representations (embedding vectors) for both vision and language data. Similar images and text will have similar embedding vectors.
After training, the foundation model is evaluated against internal automotive-specific benchmarks to verify its usefulness for automotive data retrieval. Our benchmarks measure the precision and recall for zero-shot classification of important fleet log data elements like pedestrians, vehicles, weather conditions, and road markings.
Creating a two-tower retrieval system
The vector distance property of our image-text embeddings can be leveraged by downstream tasks like data retrieval. In particular, Data Explorer uses such embeddings to power a two-tower data retrieval system:
- Fleet log camera data is embedded once upfront. This is the first tower in the retrieval system and is formally known as the "item tower." For example, a single frame of the vehicle's front-facing camera produces a single embedding vector.
- User queries, written in natural language, are embedded at query time. This is the second tower in the retrieval system and is formally known as the "query tower." For example, the query "construction zone at an intersection" produces a single embedding vector.
- Finally, Data Explorer performs nearest neighbor search to find the log data embeddings that match the user query embeddings most closely.
Spark for fast and scalable nearest neighbor search
Data Explorer leverages Apache Spark to scale nearest neighbor search to thousands of hours of fleet log data. Nearest neighbor search is particularly suited to this application because it enables the system to quickly find embeddings that are most similar to the user query. Motivations for using Spark include:
- Strong scaling properties: Spark is a distributed system that can scale horizontally to meet demand as query and data volume increase
- Integration with structured vehicle log data: Data Explorer already exposes structured vehicle log data (perception outputs, ego pose, etc.) in Spark, making it possible to natively perform hybrid search using both natural language and structured data filters
- Out-of-the-box support: Spark natively supports approximate nearest neighbor search for fast vector search
- Mature internal infrastructure: Applied Intuition already leverages Spark across many of our product offerings
Optimized ML inference for infrastructure cost savings
The two-tower retrieval system requires ML-model inference in each of the two towers. However, the access pattern in each of these two towers is vastly different, requiring two separate approaches for ML inference:
.png)
The "item tower" generates embeddings for many images in large batches asynchronously. For example, a 20-minute log with four cameras sampled at 4Hz will require embedding around 20,000 images. In this case, it is important for the CLIP-based visual encoder, a variant of vision transformers (ViT), to have high throughput. Individual request latency is less of a concern, as these image batches are computed asynchronously during initial log data upload.
To accomplish this, we run the CLIP-based visual encoder on cloud GPUs. GPUs excel at processing large batches of data simultaneously, making them ideal for high-throughput tasks like embedding thousands of images efficiently. However, due to their high cost, we use a queue system that automatically scales the GPUs up and down to handle load. This includes scaling the GPUs to zero when there is no load to save cost.
The "query tower" generates embeddings for a single piece of text at query time. Low latency is crucial for this process because it directly impacts the responsiveness of the search engine, ensuring that users receive results quickly. Request throughput is less of a concern, as we expect there to be relatively low load.
To accomplish this, we run the CLIP-based text encoder (a GPT-style transformer) on always-on cloud CPUs. While the throughput of CPUs is lower, they are significantly lower cost. This cost savings allows CPUs to remain operational without incurring the high expense of GPUs, making them an ideal choice for handling low-volume but latency-sensitive tasks in the query tower.
Looking ahead, we aim to further enhance Data Explorer’s data intelligence capabilities by integrating more data modalities, incorporating temporal context, and building automations to fine-tune foundation models on customer-specific data.
Our foundation models currently only use the camera data in fleet logs. However, vehicle logs typically include a number of other useful signals for understanding the context of a scene, for example: LiDAR, radar, map information, and onboard stack outputs. Incorporating these additional signals will help our foundation models build a deeper understanding of what is happening in a scene.
Additionally, our current foundation models process individual image frames and do not consider what happens over time across multiple frames. By expanding our foundation models to video, we can improve scene understanding. For example: interpreting vehicle maneuvers through their motion, understanding pedestrian intentions through their actions, and tracking the state of intersections as traffic lights change.
While our foundation models are trained on a vast amount of data to reduce their sensitivity to distribution shift, models will always perform best when fine-tuned on customer-specific data that most closely matches their desired use case. We intend to make this process self-serve directly within Data Explorer to boost foundation model performance on specific tasks.
At Applied Intuition, we are committed to pushing the boundaries of AI-driven autonomy development. Data Explorer is revolutionizing the way autonomy engineers analyze fleet log data, making the process significantly faster and more efficient.
If you're looking to accelerate your autonomy stack development, learn more about how Data Explorer can streamline your data management and analysis processes, enabling more efficient development cycles and deeper insights into your data.
And if you're passionate about building cutting-edge AI tools for the future of autonomy, consider joining the Applied Intuition team.