Skip to content

Latest commit

 

History

History
40 lines (29 loc) · 4.3 KB

File metadata and controls

40 lines (29 loc) · 4.3 KB

46. Reinforcement learning example

->

img

Suppose you are using machine learning to teach a helicopter to fly complex maneuvers. Here is a time-lapse photo of a computer-controller helicopter executing a landing with the engine turned off. ->

This is called an “autorotation” maneuver. It allows helicopters to land even if their engine unexpectedly fails. Human pilots practice this maneuver as part of their training. Your goal is to use a learning algorithm to fly the helicopter through a trajectory ​T t​hat ends in a safe landing. ->

To apply reinforcement learning, you have to develop a “Reward function” ​R​(.) that gives a score measuring how good each possible trajectory ​T​ is. For example, if ​T ​results in the helicopter crashing, then perhaps the reward is ​R(T)​ = -1,000—a huge negative reward. A trajectory ​T​ resulting in a safe landing might result in a positive ​R(T) w​ ith the exact value depending on how smooth the landing was. The reward function ​R(​.) is typically chosen by hand to quantify how desirable different trajectories ​T​ are. It has to trade off how bumpy the landing was, whether the helicopter landed in exactly the desired spot, how rough the ride down was for passengers, and so on. It is not easy to design good reward functions. ->

Given a reward function ​R(T), ​the job of the reinforcement learning algorithm is to control the helicopter so that it achieves max​TR(T). ​However, reinforcement learning algorithms make many approximations and may not succeed in achieving this maximization. ->

Suppose you have picked some reward ​R(.)​ and have run your learning algorithm. However, its performance appears far worse than your human pilot—the landings are bumpier and seem less safe than what a human pilot achieves. How can you tell if the fault is with the reinforcement learning algorithm—which is trying to carry out a trajectory that achieves max​TR(T)—​ or if the fault is with the reward function—which is trying to measure as well as specify the ideal tradeoff between ride bumpiness and accuracy of landing spot? ->

To apply the Optimization Verification test, let Thuman​ be the trajectory achieved by the human pilot, and let Tout​ be the trajectory achieved by the algorithm. According to our description above, Thuman​ is a superior trajectory to Tout​. Thus, the key test is the following: Does it hold true that R(Thuman) > R(Tout​)? ->

Case 1: If this inequality holds, then the reward function ​R(​.) is correctly rating Thuman​ as superior to Tout​. But our reinforcement learning algorithm is finding the inferior Tout​. This suggests that working on improving our reinforcement learning algorithm is worthwhile. ->

Case 2: The inequality does not hold: R(Thuman) ≤ R(Tout​). This mean R​(.) assigns a worse score to Thuman​ even though it is the superior trajectory. You should work on improving ​R​(.) to better capture the tradeoffs that correspond to a good landing. ->

Many machine learning applications have this “pattern” of optimizing an approximate scoring function Score​x​(.) using an approximate search algorithm. Sometimes, there is no specified input ​x,​ so this reduces to just Score(.). In our example above, the scoring function was the reward function Score(​T)​ = R(​T)​ , and the optimization algorithm was the reinforcement learning algorithm trying to execute a good trajectory ​T.​ ->

One difference between this and earlier examples is that, rather than comparing to an “optimal” output, you were instead comparing to human-level performance Thuman​. We assumed Thuman​ is pretty good, even if not optimal. In general, so long as you have some y* (in this example, Thuman​) that is a superior output to the performance of your current learning algorithm—even if it is not the “optimal” output—then the Optimization Verification test can indicate whether it is more promising to improve the optimization algorithm or the scoring function. ->