Follow up to this.
I took a look at this paper on curiosity, which includes a good review on more recent work than what I had read about when I wrote the previous post. One nice insight that has been made is that it is useful to split up the value function based on actual reward from the value function based on "exploration bonus". These can then added together to make the final value. One can still think of the exploration bonus in terms of optimism, but another way to think of it is that the system is really just trying to calculate the benefit of exploring a particular option (that is, the learning benefit), and adding that to the direct benefit of choosing the route.
In this account, the confidence-interval method mentioned in the last post is seen as a method of estimating the learning benifit of a state as the distance between the most probable average utility and the top of the X%-confidence range for the average utility.
A related estimate might be the expected information gain...
It's not yet clear to me how to make an estimate that approximates the true benefit in the limmit.