A crash course on optimal control
This is a live blog of Lecture 4 of my graduate seminar “Feedback, Learning, and Adaptation.” A table of contents is here.
In machine learning and artificial intelligence classes, everyone treats control as if it’s a special kind of optimization problem. You write down a model for the system, you write down your goals for the system, and then you run PPO to find the optimal feedback controller.
This sounds pretty easy, to be honest. Why would we need all of these complex textbooks and Laplace transforms if we can just write down the cost function and gradient descent our way to optimal actions? I’m going to try to answer that question in the next lecture. But today we’ll review all of optimal control.
To make the leap from feedback design to policy optimization, you have to believe in state. In dynamical systems, the state is some vector that completely determines the fate of the system to be steered. If we know this vector and our input, we completely specify what happens in the next time period. More precisely, we assume that the next state of the system depends only on the current state, the current input, and some noise:
state_next = dynamics_model(state, input, noise)This seems perfectly reasonable, up to the part where we assume that we can measure this state vector. Let’s just assume we can for now.
Next, we need to make our state do something. We have some goal in mind for the system: “go over there and make me coffee, but don’t run over my dog.” We have to write this goal as a numerical function. Convention decomposes this cost function into a sum of costs incurred at each point in time. In my coffee example, the cost is high every time I don’t have a coffee in my hand, even higher if the dog is killed, and zero once I am sipping a well-brewed cup. You can imagine how to generalize this. Perhaps you penalize the amount of control effort the feedback is exerting. Or you could penalize the distance between the current state and a desired state. There are lots of costs you can write down.
If you can pose your cost this way, there’s a generic solution for the optimal feedback policy: dynamic programming. Bellman’s masterful observation is that optimization problems of this form can be solved by running backward in time from the end of time to the present. The classic example everyone gives is shortest paths. If we have an optimal path from A to B, and Albuquerque is on the optimal path, then you have also found the optimal path from Albuquerque to B. You can work your way backwards from the last part of your path to the beginning, doing clever bookkeeping in what we call “cost-to-go” functions.
Dynamic programming is beautiful. It admits an amazingly elegant solution for linear systems with quadratic costs, with the feedback rule a simple linear function of the current state. It has a reasonably simple implementation when you have a few discrete states and actions. For everything else, dynamic programming is effectively not implementable. It becomes a meta-algorithm we all aspire to, but we end up throwing messy heuristics, be they gridded policy iteration or deep reinforcement learning, to solve for the optimal control.
Perhaps the most important thing about optimal control is how it changes control design. Whereas PID control fixes the form of the controller and asks the designer to tune the parameters, optimal control fixes the types of optimization problems you can pose and asks the designer to think about how to model their problem so that one of the tractable models applies. The control engineer shoehorns their system into a corner where the optimization problem is solvable using dynamic programming, a Hamiltonian method, or some other numerical solver. In the language used by reinforcement learning people, optimal control design is “reward shaping.” You find an optimization problem that’s close enough to what you really want.
Still, optimal control feels both easier to understand and more widely applicable than the sorts of simple controllers like PID we’ve been covering so far this semester. In optimal control, you specify what you want the controller to do and let computers do the hard part of finding the parameters for you. Indeed, we’ll see today that you can often pose PID control problems as optimal control problems. If you find it more intuitive to tune cost functions than control gains, this is a powerful restatement of the PID control problem.
But optimal control buries a lot of what happens when we close the loop. It doesn’t make it clear why feedback induces the behaviors we see in the world. Though there are people who work on “inverse optimal control,” that problem is too ill-posed to provide insights about natural systems. More often than not, no single optimization problem describes any interesting system.
It’s also very hard to incorporate uncertainty into optimal control. We showed last time that PID controllers worked for rather underspecified systems. By contrast, optimal control wants to work on very specified systems. But what happens when the dynamics models are wrong? What if you have multiple design goals and not just a single cost to optimize? What if you can’t actually solve the optimization problem? What if you can’t measure the state? We’ll talk about these various fun scenarios next time. But the spoiler alert is that we get stuck. While optimal control is accessible and powerful, it is merely one tool in the toolbox for designing and understanding feedback systems.
