Original paper: http://ml.informatik.uni-freiburg.de/former/_media/publications/rieecml05.pdf
This is a quick implementation for a course. It appears that the parameters given in the paper are unstable. With some tuning, this implementation manages ~80 on cartpole as best-of-five training runs.