46. Reinforcement learning example

->

Suppose you are using machine learning to teach a helicopter to fly complex maneuvers. Here is a time-lapse photo of a computer-controller helicopter executing a landing with the engine turned off. ->

This is called an “autorotation” maneuver. It allows helicopters to land even if their engine unexpectedly fails. Human pilots practice this maneuver as part of their training. Your goal is to use a learning algorithm to fly the helicopter through a trajectory T that ends in a safe landing. ->

To apply reinforcement learning, you have to develop a “Reward function” R(.) that gives a score measuring how good each possible trajectory T is. For example, if T results in the helicopter crashing, then perhaps the reward is R(T) = -1,000—a huge negative reward. A trajectory T resulting in a safe landing might result in a positive R(T) w ith the exact value depending on how smooth the landing was. The reward function R(.) is typically chosen by hand to quantify how desirable different trajectories T are. It has to trade off how bumpy the landing was, whether the helicopter landed in exactly the desired spot, how rough the ride down was for passengers, and so on. It is not easy to design good reward functions. ->

Given a reward function R(T), the job of the reinforcement learning algorithm is to control the helicopter so that it achieves max_TR(T). However, reinforcement learning algorithms make many approximations and may not succeed in achieving this maximization. ->

Suppose you have picked some reward R(.) and have run your learning algorithm. However, its performance appears far worse than your human pilot—the landings are bumpier and seem less safe than what a human pilot achieves. How can you tell if the fault is with the reinforcement learning algorithm—which is trying to carry out a trajectory that achieves max_TR(T)— or if the fault is with the reward function—which is trying to measure as well as specify the ideal tradeoff between ride bumpiness and accuracy of landing spot? ->

To apply the Optimization Verification test, let T_human be the trajectory achieved by the human pilot, and let T_out be the trajectory achieved by the algorithm. According to our description above, T_human is a superior trajectory to T_out. Thus, the key test is the following: Does it hold true that R(T_human) > R(T_out)? ->

Case 1: If this inequality holds, then the reward function R(.) is correctly rating T_human as superior to T_out. But our reinforcement learning algorithm is finding the inferior T_out. This suggests that working on improving our reinforcement learning algorithm is worthwhile. ->

Case 2: The inequality does not hold: R(T_human) ≤ R(T_out). This mean R(.) assigns a worse score to T_human even though it is the superior trajectory. You should work on improving R(.) to better capture the tradeoffs that correspond to a good landing. ->

Many machine learning applications have this “pattern” of optimizing an approximate scoring function Score_x(.) using an approximate search algorithm. Sometimes, there is no specified input x, so this reduces to just Score(.). In our example above, the scoring function was the reward function Score(T) = R(T) , and the optimization algorithm was the reinforcement learning algorithm trying to execute a good trajectory T. ->

One difference between this and earlier examples is that, rather than comparing to an “optimal” output, you were instead comparing to human-level performance T_human. We assumed T_human is pretty good, even if not optimal. In general, so long as you have some y* (in this example, T_human) that is a superior output to the performance of your current learning algorithm—even if it is not the “optimal” output—then the Optimization Verification test can indicate whether it is more promising to improve the optimization algorithm or the scoring function. ->

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch46.md

ch46.md

46. Reinforcement learning example

Files

ch46.md

Latest commit

History

ch46.md

File metadata and controls

46. Reinforcement learning example