Autonomous vehicle (AV) systems are complex and need a robust testing and validation framework to support engineers. Well-defined processes and infrastructure are available for software development to avoid bugs in code and maintain quality over time, especially for a large developer base. Automated regression catching through continuous integration is a critical component that ensures overall reliability by regularly testing the relevant features of the software. These methodologies are also starting to be adopted for AV development. In this post, the Applied Intuition team discusses challenges with regression testing for autonomous driving systems and shares best practices for effectively setting up tests and debugging failures.
Challenges With Regression Testing for Autonomous Vehicles
A typical workflow for code development using simulation for an autonomous driving system is shown in Figure 1. Each stage of this workflow presents unique challenges:
Development and pushing code:
- Interconnected code with multiple people committing results: Interdependence of different modules reduces the efficacy of unit tests as the system increases in complexity.
Running simulations:
- Precisely defining pass/fail criteria: There are different layers of abstraction for what defines “working” software and accurately reflecting these complex requirements can be challenging.
- Many variations of a base test need to be run and tracked for proper evaluation of pass/fail: Because small variations in an environment can result in large variations of software behaviors, the test variation space is ideally large enough to account for this.
Analyzing results:
- Repeatability of test results: Results might vary between local desktops where engineers develop and the cloud environment where regression tests happen.
- Measuring and analyzing progress: Data must be aggregated and easily accessible to a wide audience.
Best Practices for Regression Testing
Regardless of the progress of a team’s autonomous vehicle development, no company is too early to implement simulation and automated CI testing; the earlier the tools are available, the faster the algorithm development happens. Below are some of the best practices that Applied Intuition suggests for setting up regression testing and accelerating the deployment of autonomous vehicles.
Flexible Deployment Capability:
Software development organizations use large quantities of personal machines together with centralized compute infrastructure for testing and development. CI and regression testing should take advantage of existing compute resources without additional costly installations and maintenance. If the framework is easy to distribute across an entire organization and is capable of running on a variety of different machines, the activation energy for new users is significantly lower. Moreover, simulations should be able to leverage orchestration platforms like Docker Swarm or Kubernetes to support rapid growth with reduced overhead.
Focused Testing:
Scenario regression tests should be as narrowly targeted and isolated as possible in order to get a clean signal from their results, which translates to only exercising the relevant code in the scenario as much as possible. For example, regression tests for the planning module’s stop sign logic should comprehensively cover all situations involving n-way stop-sign intersections, both with and without the presence of other interacting actors (Figure 2). As a result, mocked inputs to these modules are often necessary. A common example is synthetic simulation, where perception inputs are created virtually and only the planning module is running. Simulating the inputs to the planning module, when done realistically, allows for the isolated testing of the behavior of interest.
Easy Scenario Creation:
Synthetic scenarios should be quick to create and easy to extend, i.e. it is possible to create a scenario in 10 minutes and extend it through programmatic variations in another 10 minutes to cover dozens of unique cases. This allows developers to quickly solve issues and ensure the discovered problems are solved. The use of previous drive data for simulation (re-simulation) is also really important. Once the scenarios are created and are running as part of the commit test suite, they effectively prevent any large regression in the stop sign logic, allowing developers to confidently submit code that will not break existing features.
Integration into Development Workflows:
Because of the interdependence of a platform (e.g., engineers developing code on local workstations but running simulations in the cloud), a simulator must be carefully integrated into the development workflow and must ensure that the test results are repeatable whether you run the tests in the cloud or on local machines. Adding simulation to existing CI/CD pipelines is critical. Simulation should be a critical step in automated pipelines. Developers can then get immediate feedback on their code changes. Additionally, having a simulator work closely with existing data processing pipelines provides stakeholders ready access to important results.
Precise Description of Pass/Fail Criteria and Intuitive Dashboards:
It is also important to capture and view test results in a clean and effective way. This starts with defining the scenario test pass/fail criteria, which should be as specific as possible while still capturing the full scope and intent of the test. Ideally, the pass/fail criteria come directly from specific requirements of the self-driving system, and tools should support quick and easy translation of requirements into programmatic test cases. Scenario language should enable the description of test criteria precisely while maintaining the flexibility to encompass a large portion of the test criteria.
Once tests are providing clean and reliable output, results should be easily viewable, navigable, and immediately comprehensible in order to efficiently provide actionable information. Dashboards should be understood by not just engineers directly involved, but various stakeholders in the organization.
Different Cadence of Testing:
Two types of tests should be performed in order to catch regressions. Commit tests (Figure 1) are useful for catching bugs early and often, and should be run on every commit to your code repository. Scheduled assessments (Figure 3), on the other hand, are useful for tracking performance and regressions over time, and should be run on a regular cadence daily.
Debugging Failures Through Modular Testing
Once true failures are identified, there is still work to be done to identify a root cause for the observed performance. While teams often talk about scaling tests, isolating and focusing on a particular piece of code is critical for debugging failures. For example, if an autonomous vehicle doesn't stop in front of a pedestrian, it’s important to turn off other parts of the code and test until the responsible code is able to recognize the pedestrian and act accordingly.
Narrowing the focus of a test with specific test criteria makes it much more likely that there is a one-to-one correspondence between test cases and modular AV capabilities. If that’s the case, it is immediately obvious which part of the AV stack should be responsible for handling the test case in question, and no additional work is needed. On the other hand, in cases where a test features two or more interacting systems that need to be decoupled in order to determine a root cause, it is helpful to use a tool that could quickly modify which components of the stack are leveraged with the test and re-run it.
Progression Testing to Track Performance Relative to Target
As vehicle platforms mature, planning for future versions becomes a normal part of AV development. Teams pick out failures from real world driving that have tested the limits of a system and group future desired capabilities together in a clear roadmap to the next generation platform. New platform capabilities could be as high level as "handle all stop-lights in San Francisco” or “predict aggressive cut-ins on highways.” Progression testing captures these individual cases observed under simulation or in real world testing and puts them to use as guides during development.
Progression testing usually takes the form of nightly or hourly batches of scenarios that are then run against the latest code. Progression test suites are designed to have many of the scenarios fail day after day, until the system-under-test is able to handle the scenarios. Using the example of “cut-ins on highways”, a collection of aggressive cut-ins could be turned into rich tests to automatically grade new changes to a planning or a control stack. These test suites exist outside of continuous integration; they do not block pull requests. As teams strive towards new capabilities, clearly seeing scenarios go from “red” to “green” on nightly runs is immensely satisfying (Figure 4).
Pairing progression testing with regression testing is a powerful combination. Engineers now have a clear metric to use for risky changes to the core code. If a large change passes regression tests in CI and causes new scenarios to “go green” in progression suites, progress has been made.
Applied Intuition’s Tools for System Testing
The Applied Intuition team provides tooling to streamline regression and progression testing. Applied Intuition’s tools are designed to integrate with existing CI/CD solutions and augment their capabilities with different types of simulations. Internally, we apply the same paradigms described in this blog to prevent regressions in our simulator and track it’s performance over time to ensure that the software quality is high and the simulator results can be trusted.