The long legacy of simulation-based control.
This is a live blog of Lecture 9 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.
As I mentioned Monday, one of the big paradigms in modern robotics and control is the “sim2real” pipeline. People invest in complex computer simulators to test their robotic policies. The simulators have detailed dynamic and kinematic models of the robot and how it moves in contact with varied terrain and obstacles. The hope is that by burning through infinite GPU credits to troubleshoot every possibility in simulation, they can deploy code to their actual robot and need no troubleshooting once it’s unleashed in the real world.
While the young folks like to make this paradigm sound like a novel new research program, all of optimal control rests on the sim2real pipeline. Think about the core problem of optimal control: the linear quadratic regulator. This problem looks for a control sequence that minimizes a quadratic cost subject to the world evolving according to a linear dynamical system. Control theorists banged their heads against this problem for decades, and we are now taught the beautiful dynamic programming derivations that reduce this problem to solving a compact equation. However, we can also solve it using gradient descent. The gradient computation amounts to simulating the system with the current control policy, computing the sensitivity of the cost trajectory to each control decision, and then adding this information up to compute the gradient.
The lovely thing about gradient descent is that it gives you a solution technique for general optimal control problems with nonquadratic costs or nonlinear dynamics. You evaluate your policy under the current control, run a dynamical system backward in time to compute how sensitive the trajectory was to your control decisions, and then add up the contributions of each time point to get the full gradient. Arthur Bryson invented this method to compute gradients of general optimal control problems in 1962. Today, we call his algorithm backpropagation. This simulation-based gradient method provides incremental improvement of policies for any differentiable dynamical model and any differentiable cost function.
Now, if your simulation isn’t differentiable, maybe you’ll use a different sort of policy optimization method to solve your optimal control problem. However, reinforcement learning for robotics is still optimal control. RL for robotics minimizes a designed cost function subject to dynamics. The modern departure is that no one bothers to write down the equations of motion anymore. They just assume the simulator will compute them.
This belief pushes a lot of work onto the simulator. GPU cycles are sadly neither free nor abundant. It would be nice to minimize the simulation time and cost required to find a good control policy. It would be particularly nice because many people would like to have a simulator on board the actual robot to compute policies with methods like model predictive control. This begs the question of how accurate your simulation needs to be.
Unfortunately, no one knows. We all think that if you can act quickly enough with enough control authority, then a really simple model should work. But it’s impossible to quantify “enough” in that sentence. You have to try things out because dynamical processes are always surprising.
While it feels like increasing the fidelity of a simulator to the minute details of physical law always improves performance, this is not remotely the case. In class on Monday, Spencer Schutz presented a paper on autonomous driving showing a simple, inaccurate kinematic model with a low sampling rate performed just as well as a more accurate dynamic model. Anyone who’s spent time with dynamic models knows that very high-dimensional complex systems often look simple when you have limited controllability and observability. This is the basis of thermodynamics, where infinitely many bodies colliding collectively produce fairly boring dissipative behavior. Many complex-looking circuits have the input-output behavior of resistors.
On the other side of the coin, safe execution demands identification of subtle aspects of input-output relationships. You can have two dynamical systems with nearly identical behavior perform completely differently once in a closed loop circuit. You can also have systems with completely different behavior look the same in closed loop. I worked through a few examples of this phenomenon in a blog post a couple of years ago. Your model needs to be perfect in exactly the right places. But it’s usually impossible to know those places in advance.
To make matters worse, you can’t really identify the parameters of a robot in open loop. An expensive robot is always going to be running with its low-level controllers on both for its safety and yours. The actual parameters of closed-loop systems can’t be identified.1 So you’re stuck with guesses in your simulator, and you have to hope that your plausible parameters are good enough for your sim2real application.
The most popular solution to this identification problem is domain adaptation. Since you can only find a range of parameters that describe reality, you build a control policy that works for randomly sampled parameters. By constantly sampling different parameters in each run, you build a policy that performs well on average across all possible parameters.
Finding controllers that work for the average model isn’t new. Indeed, this is just a variant of optimal control called dual control, which has seen bursts of interest since the 1960s. Dual control is literally the problem of minimizing an expected control performance over a distribution of parameters. Like dual control, domain adaptation needs a good prior model for how the environment “samples” parameters. But you can also just YOLO and hope that as long as you include all the edge cases, you’ll never crash. That’s the machine learning mindset, after all.
But what does it mean to sample the coefficient of friction of a surface? What’s the right distribution of coefficients of friction? This is again a tricky question.
One approach to modeling the distribution of parameters is to add an element of adversarial behavior to the system. We can adapt the simulations to find hard parameter settings and train more on those. We can have the simulator learn to trip up the robot. Rather than minimizing expected cost, we are working to minimize a worst-case cost, where the supremum is over a distribution of parameters or disturbances. The dual control people were really into this sort of minimax robustness in the 60s. But practice in aerospace applications ultimately pushed the community to robust control.
But people hate robust control because it gives them conservative policies. Computer scientists love to hack and ship. Look how productive they’ve been! You only need to write a few tests and make sure your simulator passes those. No bugs detected, LGTM! What could go wrong, right?
Is that last paragraph about coding agents? It might be.
But regardless, robust control pointed out that unmodeled uncertainties are everywhere, and they can be out there to bite you if you’re not careful. For its entire history, robust control advocates have been haranguing people about the limits of simulators. They note a couple of significant problems: first, training on a simulator often means fitting to quirks of the simulator that don’t appear in the real world. This is a major danger, even in linear systems. Second, many apparent parametric robustness properties of optimal controllers break down under scrutiny.
In class, I introduced the structured singular value to motivate this issue. The structured singular value showed that when you had a system with many inputs and outputs, and you only considered independent perturbations, you’d convince yourself that a system was stable when it was not remotely stable. Guaranteeing stable behavior required understanding the dependencies between different errors. But how you test stability in simulation is not clear.
We are thus left considering a strategy beyond sim2real: sim2real2sim2real. Or sim2real2sim2real2sim2real. You deploy the system and find out what didn’t work in reality. And then you go back to your simulator, add a few thousand lines of code to account for the mistake, and try again. The software state of mind is that we can always patch mistakes. You can have an all-hands, blameless post-mortem and say it won’t happen again. This drives the old control theorists mad, but it’s been great working so far, so why change course?
In case you haven’t encountered this before, suppose you are trying to model a closed-loop system x[t+1] = Ax[t]+ Bu[t], u[t]= Kx[t]. Then for an arbitrary matrix EB,
A+BK = (A -EBK)+(B+EB) K
Hence, you can only identify a subspace of possible dynamical systems describing your data.
