I recently finished reading the Sutton & Barto book on reinforcement learning. In the last part of that book, they discuss a general framework which is supposed to capture many of the different idea which have been presented. Specifically, it gives a view in which dynamic programming, Q-learning, SARSA, classical planning, monte carlo methods, and a few other things are all treated as value-propagation methods. The different propagation schemes differ, both in the equations of propagation and the scheduling of propagation, but the common structure is there.

This generally fits with the Dyna-style systems I've been thinking about, though the details are fuzzy.

Combined with the amount of thinking I've been doing about systems that use belief propagation as their major reasoning mechanism (which I should write more about...), this has me fairly deep into propagation-based reasoning systems.

We've also been talking about dynamic programming in my advanced algorithms course, as it happens. We went over a common example based on matrix multiplication (which I'd seen before, but not thought about recently). This example is particularly interesting, because the dynamic programming is being used for

*inference guidance*, not for inference: we use dynamic programming to find the fastest value-propagation ordering, to great advantage. This leads to the idea of

*using dynamic programming to determine the schedule for dynamic programming*. This connects back to my original theme of procedural logic: logic which determines its own inference procedures.

The issue with this is, we need some simpler inference guidance to get off the ground. If we wait for informed priority estimates before we do any inference, but rely on inference to get priority estimates, we'll go nowhere. (I referred to this as the reflection problem in the previous post in this sequence.) Furthermore, we have to balance between meta-inference and inference. It's tempting to give meta-inference a higher priority in general, because we want to do it first; but in a general setting, there will be too many possible meta-inferences, so we'd never get anything done if we gave them priority.

Since many of the propagations are just propagating real numbers (particularly in belief propagation and reinforcement learning for

*discrete*domains), a simpler idea is to guide inference simply by giving priority to propagations which are tracking larger changes. This is referred to as prioritized sweeping in Sutton & Barto, but surprisingly not dealt with very much. I also haven't seen much about the effect of such a scheme on the efficiency of belief propagation. It's probably out there, though.

It would be great to test the effects of different (combinations of) prioritization methods to see what actually worked. Hopefully I get to that point.

Today I had a discussion with Laurent Orseau about these ideas. He has been thinking about very similar things-- a system based on a priority queue and a hash table, like the Dyna implementation I've been slowly working on. (Actually, I'm not even worrying about a hash table at the moment, but building something that looks more like a Rete network's alpha memory; when I need to optimize it, I'll go for term indexing, thanks to some advice from Russel Wallace.) He has some ideas about inference control as well (perhaps more unified than mine), which fit in with the theme of using dynamic programming to determine the optimal inference. Perhaps we will make some progress.