The basic idea is this: if we are faced with a number of options which have unknown probabilistic payoffs, then we can either choose the option that has had the best average payoff so far (ie, exploit our current knowledge), or choose a different option in order to gain more evidence about how good that option is (ie, explore the avaliable options).
In principle, we can compute the optimal choice at a given point in a straightforward way, but this requires looking at the entire future (up to some cutoff we choose) and is highly inefficient. It seems to be critical to actual implementation that we approximate the utility of exploration in a more local way, making exploration an essential element of our decision procedure rather than a derived property that comes from planning far enough ahead.
Unfortunately, the simplistic strategy of choosing to explore totally at random some fraction of the time, and exploit the rest of the time, seems all-too-common. This fails to take into accound factors such as the current risk of exploration, weighing exploration to favor more promising options, et cetera.
An interesting class of algorithms for improving upon this is optimistic greedy methods: act as if you're exploiting, but use optimistic estimates. It looks like there are several ways of fleshing this out; perhaps the simplest is to come up with an interval within which the utility of an option falls with probability X (say, 95% confidence). Our "actual estimate" for the utility might be in the middle of the range, but we can act as if we think the top value is the right one in order to encourage exploration: if an option has not been explored very well, it will have a broad range, so that the top of the range may be higher than the actual current-best-bet even if the middle is pretty far below.
This can lead to too little exploration: a string of bad luck associated with an option which is in actuality good can push even the high estimate of the payoff to below the actual payoff of some other option, so that it will never be explored again. This becomes less and less likely the more optimistic we are (ie, the wider our con interval), but it's still a disadvantage.
One might ask, what intervals should we pick? Well, it seems like it depends how long we are going to be choosing between the particular set of options. If we are only choosing one more time, our intervals should be of width 0-- we should exploit rather than explore. Similarly, if we only have a few more times to choose, we should choose with relatively narrow intervals, doing relatively little exploration; if we have a great many left, we should have very broad intervals, exploring a great deal. (Perhaps I should try to work out the optimal intervals here.)
So, if we have a known finite number of moves to make, we should move our interval range down to 0 via some formula.
The interval width can be though of as the "curiosity" of the system at a particular point in time. If we're using a 95% confidence interval, then we are saying that for each option we don't try, we want to be at least 95% certain that it is worse than the option we do try: any less confident and we'd prefer to experiment.
What if we have an infinite number of moves, and want a strategy which will be guaranteed to converge to the optimal strategy at infinity? (Of course, what we really care about is how quickly we converge, but let's first worry about converging at all.) I said earlier that, with a fixed confidence, this would be impossible: we can always converge to the wrong thing. However, the higher the confidence we require, the less probable this is. So, we can gradually increase our confidence requirement over time to guarantee convergence. This can be thought of as increasing our epistemic standardsover time: if we've only been doing something for a few minutes, we're ok with only being 75% certain that it's better than the alternatives; but if we've been following a strategy for millions of years, we want to be damn sure by that point that it's the strategy we shoul dactually have been following.
There are several paradoxical things going on here:
- While for finite-time problems we want to decrease our curiosity over time, for infinite-time cases we want to increase it; the infinite-time case does not appear to be the limit of longer and longer finite-time cases.
- Higher epistemic standards (ie, requiring greater confidence) correspond to more optimism, not less, even though one might think of optimism as a sort of epistemic dishonesty (pretending the expected value is higher than it is). It's a consequence of differing definitions, not a real contradiction, but I think it's curious.
Edit-- Questions:
- What are the formulas for optimal curiosity levels in the finite and infinite versions?
- Can we make sure that a system using these approximate exploration strategies will still approximate the true optimal strategy as it is given more computation time with which to plan ahead?