## Wednesday, March 31, 2010

### Formalized Curiosity

I've been reading up on reinforcement learning lately. An interesting question, perhaps the first question, is how to balance exploration with exploitation.

The basic idea is this: if we are faced with a number of options which have unknown probabilistic payoffs, then we can either choose the option that has had the best average payoff so far (ie, exploit our current knowledge), or choose a different option in order to gain more evidence about how good that option is (ie, explore the avaliable options).

In principle, we can compute the optimal choice at a given point in a straightforward way, but this requires looking at the entire future (up to some cutoff we choose) and is highly inefficient. It seems to be critical to actual implementation that we approximate the utility of exploration in a more local way, making exploration an essential element of our decision procedure rather than a derived property that comes from planning far enough ahead.

Unfortunately, the simplistic strategy of choosing to explore totally at random some fraction of the time, and exploit the rest of the time, seems all-too-common. This fails to take into accound factors such as the current risk of exploration, weighing exploration to favor more promising options, et cetera.

An interesting class of algorithms for improving upon this is optimistic greedy methods: act as if you're exploiting, but use optimistic estimates. It looks like there are several ways of fleshing this out; perhaps the simplest is to come up with an interval within which the utility of an option falls with probability X (say, 95% confidence). Our "actual estimate" for the utility might be in the middle of the range, but we can act as if we think the top value is the right one in order to encourage exploration: if an option has not been explored very well, it will have a broad range, so that the top of the range may be higher than the actual current-best-bet even if the middle is pretty far below.

This can lead to too little exploration: a string of bad luck associated with an option which is in actuality good can push even the high estimate of the payoff to below the actual payoff of some other option, so that it will never be explored again. This becomes less and less likely the more optimistic we are (ie, the wider our con interval), but it's still a disadvantage.

One might ask, what intervals should we pick? Well, it seems like it depends how long we are going to be choosing between the particular set of options. If we are only choosing one more time, our intervals should be of width 0-- we should exploit rather than explore. Similarly, if we only have a few more times to choose, we should choose with relatively narrow intervals, doing relatively little exploration; if we have a great many left, we should have very broad intervals, exploring a great deal. (Perhaps I should try to work out the optimal intervals here.)

So, if we have a known finite number of moves to make, we should move our interval range down to 0 via some formula.

The interval width can be though of as the "curiosity" of the system at a particular point in time. If we're using a 95% confidence interval, then we are saying that for each option we don't try, we want to be at least 95% certain that it is worse than the option we do try: any less confident and we'd prefer to experiment.

What if we have an infinite number of moves, and want a strategy which will be guaranteed to converge to the optimal strategy at infinity? (Of course, what we really care about is how quickly we converge, but let's first worry about converging at all.) I said earlier that, with a fixed confidence, this would be impossible: we can always converge to the wrong thing. However, the higher the confidence we require, the less probable this is. So, we can gradually increase our confidence requirement over time to guarantee convergence. This can be thought of as increasing our epistemic standardsover time: if we've only been doing something for a few minutes, we're ok with only being 75% certain that it's better than the alternatives; but if we've been following a strategy for millions of years, we want to be damn sure by that point that it's the strategy we shoul dactually have been following.

There are several paradoxical things going on here:

• While for finite-time problems we want to decrease our curiosity over time, for infinite-time cases we want to increase it; the infinite-time case does not appear to be the limit of longer and longer finite-time cases.
• Higher epistemic standards (ie, requiring greater confidence) correspond to more optimism, not less, even though one might think of optimism as a sort of epistemic dishonesty (pretending the expected value is higher than it is). It's a consequence of differing definitions, not a real contradiction, but I think it's curious.
source

Edit-- Questions:
• What are the formulas for optimal curiosity levels  in the finite and infinite versions?
• Can we make sure that a system using these approximate exploration strategies will still approximate the true optimal strategy as it is given more computation time with which to plan ahead?

## Tuesday, March 30, 2010

### Self-Modifying Logic

This post is inspired by the GOLEM architecture that Ben Goertzel recently wrote a post about. In discussing the architecture on the Singularity list, we came up with the idea of allowing an artificial intelligence to alter its own utility function computation in a controlled way: it evaluates potential new utility calculations by comparing the outputs to the outputs of the original utility function, looking for computations that are essentially the same but faster.

Now there are some fun (and potentially important) questions about how to deal with uncertainty in the original utility function, but I won't go into these here.

1. Start with a set of axioms and rules of inference to start with, and if you like, also a set of statements about the world (perhaps sensory data).
2. Look for new logics which can derive what the old can derive, but possibly more.
3. Judge these based on some criteria; in particular, shortness of derivations and simplicity of the new logics both seem sensible.
This potentially solves my problem of coming up with a system that iteratively learns higher and higher levels of the Tarski hierarchy. If the system keeps augmenting itself with a new truth predicate, then it will keep increasing the efficiency with which it can derive the truths from the initial system (by a result of Goedel for the type-theoretic hierarchy which if I'm not mistaken will hold similarly for the Tarski hierarchy; see his On the Length of Proofs). This does not show that the Tarski hierarchy is the best way of increasing the power of the system, but I am perfectly OK with that.... what I'd like, however, would be some guarantee that (some canonical axiomatization of) each level of the Tarski hierarchy can at least eventually be interpreted (ie, as we keep adding further extensions, we interpret more of the Tarski hierarchy, without bound). I do not know how to show this, if it's true.

## Monday, March 22, 2010

### A Few More Thoughts on Guidance

So: if we want to make good inferences, is it enough to look for a programming language which describes good inferences succinctly, and then proceed to carry out actions which have succinct descriptions in that language?

Specifically, can this strategy be arranged such that it is (in some sense) equivalent with the strategy of searching for the inference algorithm that's got the lowest expected time (or, lowest average time on randomly generated test problems typical of what's been seen so far)?

If we've got a bunch of trial runs of different inference algorithms, then the absolute best language in the sense that I'm looking for would be one that could state just the algorithm with the best average time. Less-good languages would be judged by how good the inference algorithms they could state would be. This should make clear part of the divergence from the simple concept of compression: we're not trying to compress all the data, just the "best" data. We've got a range of goodness for our data samples; we want to have very short descriptions for the best cases, but very long ones for the worst cases.

Another difference is that we don't care in the same way about the simplicity of the result. It might be a good idea to favour simpler languages when we only know how good they are on some data, since shorter ones will probably generalise their behaviour more cleanly to more data. Yet, if we could calculate the actual average runtime on truly average data, we would no longer consider shortness to be a concern; not unless we ran a risk of exhausting the memory. With compression, this is not the case.

One might still base an effective algorithm on the loose metaphor: perhaps implement the Levin-like search for good inference strategies, and then some compression-like meta-search for strategy programming languages that tend to result in good strategies. However, the basis in compression doesn't appear to be totally rigorous.

### Compressive Inference Guidance

I just did a post on inference guidance, but I've got a lot more thoughts (which may or may not be worth anything).

The first one is a curiosity I've had for a while: a good way of predicting is to compress sensory information. Can compressing good inference methods lead to a similarly good method of inference?

The previous post indicated at least two ways in which compression can be relevant: first, the Levin-like search gives more priority to shorter inference guidance methods, so although it's not actually trying to compress anything, the result will tend to be compressive of the solution (and of its proof). One might expect that the more compressive a solution, the better it will generalize. However, note that what we're worried about is speed, not correctness--- an inference method cannot harm the correctness of a result, only the speed with which it is inferred. (It can delay the inference perpetually, though.) So we want the speed to generalize. A program that is more compressive of the same data tends to take longer to execute.

Still, I'd expect some connection between compressiveness and the generalization of speed. Wouldn't you?

The second definite connection is that it's useful to compress the problems the system has solved so far, in order to anticipate the coming problems and fine-tune to them. One way of thinking about this is that we look for a good language to describe the problems so far. Then, to generate new problems, we generate random descriptions in that language.

It would be nice, to apply this to the search for strategies-- rather than looking for a good strategy, look for a good programming language, in which randomly-generated strategies tend to be good! Remember the best strategies, and use them to continue revising the language and improving it. The question is, does this scheme have normative value? Will it systematically approximate the optimal behavior?

### Guidance of Inference

When I first switched to this new blog, I thought that the point was going to be to take it more seriously, writing in a more organised fashion and being more careful to only say things I know to be true.

The truth is, though, that just makes me not post. A more informal blog would do much better. One advantage of the change is that the blog's title reflects a much broader range of possible topics, such as the previous post that was just some playing with trigonometry.

So, here are some disorganised thoughts about inference guidance for AI.

What I want is a normative theory-- something that says "This is how it should be done, in principle." IE, a normative answer. Practical inference systems are of interest, but a guiding principle that tells us what sorts of practical implementations will tend to be good is very important.

This is somewhat related to my wondering about (normative) systems for handling mathematical uncertainty; the best way to guide inference may also involve, or be involved in, the best way to handle uncertainty about the potential result of an inference.

The main example of inference I'm thinking of is optimisation or constraint satisfaction.

Levin search is asymptotically optimal, but does not use the problem description in any way (like genetic programming). Jurgen's "Optimal Ordered Problem Solver" ("Oops," an unfortunate acronym) is optimal in a stronger sense, but still does not use problem descriptions. Hutter's asymptotically optimal algorithm for well-defined problems uses problem descriptions. So, several areas for improvement present themselves. Something that is optimal in Jurgen's sense ("bias-optimal") but which uses problem descriptions rather than searching blindly would be nice. Also, bias-optimality is a strong sort of optimality, but does not capture everything Oops is meant to do: Oops is meant to learn from its experience (hence "ordered problem" solver). Jurgen asserts that it is optimal in a certain sense, but I think there is room for improvement.

One way of using problem descriptions would be to just hand them to Oops as if it could understand them, and judge its output on that assumption. It would be forced to learn to use the problem descriptions. However, this would be quite inefficient.

A more interesting way would be to use Levin search at the inference-guidance level. Execute all possible inference-guidance programs in parallel, with the execution time they are given weighted by their simplicity. Solutions would no longer have to be checked, since results would automatically be correct; a proof of correctness would automatically come with the solution.

Oops could be modified in the same way. (Oops can be thought of as just an efficient implementation of Levin search with added features for learning from previous success.)

Now, Oops does several things to learn from experience. (I'll use the language of inference guidance, making the assumption that Oops has been modified to work via inference guidance rather than direct solution generation.)

1. It attempts to apply the so-far-sucessful inference guidance program to the new problem, searching extensions of the program if it's not yet a complete program (ie, if the new situation causes the execution to reach the end of the code where before it didn't); half the attention is directed to this, while the other half searches for a fresh solution (to all problems, not just to the new one).
2. Inference guidance programs are also allowed to provide search orderings for continuations, so that it's possible that a partial inference guidance program represents a good search heuristic for inference guidance programs; this is particularly useful in the situation mentioned above, when a program turns out to be incomplete for a new example.
3. New programs are allowed to copy and modify old ones.
To me this appears haphazard and hacky, though it has its good points. What can we do for a more normatively forceful strategy of learning?

Oops takes execution time into account just by the fact that strategies which are faster will tend to be be found more quickly than slower strategies. This is because if it optimised for execution time directly, then it would quickly overmatch: once it found a solution at all, the fastest-executing program would simply be the one that spit out that answer when given that input. The situation may improve with the amount of experience of the system (since after a large number of instances, a lookup table may no longer be the fastest way of computing the answers). Still, this seems wrong; we want the system to use extra cycles to look for inference strategies that are optimized to be quick on probable questions, not just on previously-experienced questions.

It seems reasonable, then, to search for a probability distribution which would generate the questions observed so far. Using the current-best estimate, the system should look for the quickest solution not just to the current known problems, but to the expected future problems as well.

This could be done in several ways. Perhaps potential future instances are generated at random from the distribution, and inference methods are tested against them. Perhaps a strategy more like Hutter's asymptotically optimal one is used: the system might try to prove which strategies will be better. In any case, the objective will be clear: optimize speed for expected problems.