tag:blogger.com,1999:blog-2669284707270083212014-10-03T00:27:31.220-07:00In Search of LogicAbram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.comBlogger57125tag:blogger.com,1999:blog-266928470727008321.post-6029699609440987522014-07-18T18:16:00.000-07:002014-07-18T18:16:01.186-07:00Losing Faith in Factor GraphsIn my post <a href="http://lo-tho.blogspot.com/2012/08/beliefs.html">Beliefs</a>, I split AGI into two broad problems:<br /><br /><br /><ol><li>What is the space of possible beliefs?</li><li>How do beliefs interact?</li></ol><div>The idea behind #1 was: can we formulate a knowledge representation which could in principle <a href="http://lo-tho.blogspot.com/2012/08/truth-and-ai.html">express any concept a human can conceive of</a>?</div><div><br /></div><div>#2 represents the more practical concern: how do we implement inference over these beliefs?</div><div><br /></div><div>(This was needlessly narrow. I could at least have included something like: <i>3. How do beliefs conspire to produce actions? </i>That's more like what my posts on <a href="http://lo-tho.blogspot.com/2010/06/procedural-logic.html">procedural logic</a> discuss. 1 and 2 alone don't exactly get you AGI. Nonetheless, these are central questions for me.)</div><div><br /></div><div><a href="http://en.wikipedia.org/wiki/Factor_graph">Factor graphs</a> are a fairly general representation for networks of probabilistic beliefs, subsuming bayesian networks, markov networks, and most other graphical models. Like those two, factor graphs are only as powerful as propositional logic, but can be a useful tool for representing more powerful belief structures as well. In other words, the factor graph "toolbox" includes some useful solutions to #2 which we may be able to apply to whatever solution for #1 we come up with. When I started graduate school at USC 3 years ago, I was basically on board with this direction. That's the direction taken by <a href="http://cs.usc.edu/~rosenblo/Pubs/Sigma%20AISBQ%20D.pdf">Sigma</a>, the cognitive architecture effort I joined.</div><div><br /></div><div>I've had several shifts of opinion since that time.</div><h3>I. Learning is Inference</h3><div>My first major shift (as I recall) was to give up on the idea of a uniform inference technique for both learning models and applying them ("induction" vs "deduction"). In principle, Bayes' Law tells us that learning is just probabilistic inference. In practice, <a href="http://stacky.net/wiki/index.php?title=Misconceptions_about_machine_learning">unless you're using Monte Carlo methods</a>, practical implementations tend have much different algorithms for the two cases. There was a time when I hoped that parameter learning could be implemented via belief propagation in factor graphs, properly understood: we just need to find the appropriate way to structure the factor graph such that parameters are explicitly represented as variables we reason about.</div><div><br /></div><div>It's technically possible, but not really such a good idea. One reason is that we don't usually care about the space of possible parameter settings, so long as we find one combination which predicts data well. There is no need to keep the spread of possibilities except so far as it facilitates making good predictions. On the other hand, we do usually care about maintaining a spread of possible values for the latent variables within the model itself, precisely because this <i>does</i> tend to help. (The distinction could be is ambiguous! In principle there might not be a clean line between latent variables, model parameters, and even the so-called hyperparameters. In practice, the distinction is quite clear, though, and that's the point here.)</div><div><br /></div><div>Instead, I ended up working on a perfectly normal gradient-descent learning algorithm for Sigma. Either you're already familiar with this term, or you've got a lot to learn about machine learning; so, I wont try to go into details. The main point is that this is a totally non-probabilistic method, which only believes a single parameter setting at any given time, but attempts to adjust these in response to data.</div><div><br /></div><div>The next big shift in my thinking has been less easily summarized, and has been taking place over a longer period of time. Last year, I <a href="http://lo-tho.blogspot.com/2012/12/my-unhealthy-obsession-with-message.html">wrote a post</a> attempting to think about it from a perspective of "local" vs "global" methods. At that time, my skepticism about whether factor graphs provide a good foundation to start with was already strong, but I don't think I articulated it very well.</div><h3>II. Inference is Approximate</h3><div>Inference in factor graphs is exponential time in the general case, which means that (unless the factor graph has a simple form) we need to use faster approximate inference.</div><div><br /></div><div>This makes perfect sense if we assume that "inference" is a catch-all term referring to the way beliefs move around in the space of possible beliefs. It <i>should</i> be impossible to do exact inference on the whole belief space.</div><div><br /></div><div>If we concede that "inference" is distinct from "learning", though, we have a different situation. What's the point in learning a model that is so difficult to apply? If we are going to go ahead and approximate it with a simpler function, doesn't it make sense to learn the simpler function in the first place?</div><div><br /></div><div>This is part of the philosophy behind <a href="http://alchemy.cs.washington.edu/spn/">sum-product networks</a> (SPNs). Like neural networks, SPN inference is always linear time: you basically just have to propagate function values. Unlike neural networks, everything is totally probabilistic. This may be important for several reasons; it means we can always do reasoning on partial data (filling in the missing parts using the probabilistic model), and reason "in any direction" thanks to Bayes' Law (where neural networks tend to define a one-direction input-output relationship, making reasoning in the reverse direction difficult).</div><div><br /></div><div>Why are SPNs so efficient?</div><h3>III. Models are Factored Representations</h3><div>The fundamental idea behind factor graphs is that we can build up complicated probability distributions by multiplying together simple pieces. Multiplication acts like a conjunction operation, putting probabilistic knowledge together. Suppose that we know a joint distribution connecting <b><i>X</i></b> with <i><b>Y</b></i>: <i><b>P(X, Y)</b></i>. Suppose that we also know a conditional distribution, defining a probability on <b><i>Z</i></b> for any given probability on <b><i>X</i></b>: <i><b>P(Z|X)</b>.</i> If we assume that <i style="font-weight: bold;">Z</i> and <i style="font-weight: bold;">Y</i> are independent given <i style="font-weight: bold;">X</i>, we can obtain the joint distribution across all three variables by multiplying the two probability functions together: <i style="font-weight: bold;">P(X,Y,Z) = P(X,Y)P(Z|X)</i>.</div><div><br /></div><div>A different way of looking at this is that it allows us to create probabilistic constraints connecting variables. A factor graph is essentially a network of these soft constraints. We would expect this to be a powerful representation, because we already know that constraints are a powerful representation for non-probabilistic systems. We would also expect inference to be very difficult, though, because solving systems of constraints is hard.</div><div><br /></div><div>SPNs allow multiplication of distributions, but only when it does not introduce dependencies between distributions which must be "solved" in any way. To oversimplify just a smidge, we can multiply just in the case of total independence: <i style="font-weight: bold;">P(X)P(Y)</i>. We are not allowed to use <i style="font-weight: bold;">P(X|Y)P(Y)</i>, because if we start allowing those kinds of cross-distribution dependencies, things get tangled and inference becomes hard.</div><div><br /></div><div>To supplement this reduced ability to multiply, we add in the ability to <i>add</i>. We compose complicated distributions as a series of sums and products of simpler distributions. (Hence the name.) Despite the simplifying requirements on the products, this turns out to be a rather powerful representation.</div><div><br /></div><div>Just as we can think of a product as <i>conjunctive</i>, imposing a series of constraints on a system, we can think of a sum as being <i>disjunctive</i>, building up a probability distribution by enumeration of possibilities. This should bring to mind mixture models and clustering.</div><div><br /></div><div>Reasoning based on enumeration is faster because, in a sense, it's just the already-solved version of the constraint problem: you have to explicitly list the set of possibilities, as opposed to starting with all possibilities and listing constraints to narrow them down.</div><div><br /></div><div>Yet, it's also <i>more</i> powerful in some cases. It turns out that SPNs are much better at representing probabilistic parsing than factor graphs are. Parsing is a technique which is essential in natural language processing, but it's also been used for other purposes; image parsing has been a thing for a long time, and I think it's a good way of trying to get a handle on a more intricately structured model of images and other data. A parse is elegantly represented via a sum of possibilities. It <i>can</i> be represented with constraints, and this approach has been successfully used. Those applications require special-purpose optimizations to avoid the exponential time inference associated with factor graphs and constraint networks, though.</div><div><br /></div><div>The realization that factor graphs aren't very good at representing this is what really broke my resolve as far as factor graphs go. This indicated to me that factor graphs really were missing a critical representational capability; the ability to enumerate possibilities.<br /><br />Grammar-like theories of general learning have been an old obsession of mine, which I had naively assumed could be handled well within the factor-graph world.<br /><br />My new view suggested that inference and learning should both be a combination of <i>sum-reasoning</i> and <i>product-reasoning</i>. Sum-learning includes methods like clustering, boosting, and bagging: we learn enumerative models. Reasoning with these is quite fast. Product-learning splits reality into parts which can be modeled separately and then combined. <a href="http://homes.cs.washington.edu/~pedrod/papers/mlc13.pdf">These two learning steps interact</a>. Through this process, we create inherently fast, grammar-like models of the world.<br /><h3>IV. Everything is Probabilistic</h3></div><div>Around the same time I was coming to these conclusions, the wider research community was starting to get excited about the new <a href="http://lo-tho.blogspot.com/2014/07/more-distributed-vectors.html">distributed representations</a>. My shift in thinking was taking me toward grammars, so I was <i>quite</i> excited about <a href="http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Socher_125.pdf">Richard Socher's RNN representation</a>. This demonstrated using one fairly simple algorithm for both language parsing and image parsing, producing state-of-the-art results along with a learned representation that could be quite useful for other tasks. RNNs have continued to produce impressive results moving forward; in fact, I would go so far as to say that they are producing results which look like <i>precisely what we want to see</i> out of a nascent AGI technology. These methods produce cross-domain structured generalizations powerful enough to <a href="http://nlp.stanford.edu/~socherr/SocherGanjooManningNg_NIPS2013.pdf">classify previously-unseen objects based on knowledge obtained by reading</a>, which (as I said in my previous post) seems quite encouraging. Many other intriguing results have been published as well.</div><div><br /></div><div>Unfortunately, it's not clear how to fit RNNs in with more directly probabilistic models. The vector representations at the heart of RNNs could be re-conceived as restricted Boltzmann machines (RBMs) or another similar probabilistic model, giving a distributed representation with a fully probabilistic semantics (and taking advantage of the progress in deep belief networks). However, this contradicts the conclusion of section II: an RBM is a complex probabilistic model which must be approximated. Didn't I just say that we should do away with overly-complex models if we know we'll just be approximating them?</div><div><br /></div><div>Carrying forward the momentum from the previous section, it might be tempting to abandon probabilistic methods entirely, in favor of the new neural approaches. SPNs restrict the form of probabilistic models to insure fast inference. But why accept these restrictions? Neural networks are <i>always</i> fast, and they don't put up barriers against certain sorts of complexity in models.</div><div><br /></div><div>The objective functions for the neural models are still (often) probabilistic. A neural network can be trained to output a probability. We do not need everything <i>inside</i> the network to be a probability. We do lose something in this approach: it's harder to reverse a function (reasoning backwards via Bayes' Law) and perform other probabilistic manipulations. However, there may be solutions to these problems (such as training a network to invert another network).</div><h3>V. Further Thought Needed</h3><div>These neural models are impressive, and it seems as if a great deal could be achieved by extending what's been done and putting those pieces together into one multi-domain knowledge system. However, this could never yield AGI as it stands: as <a href="http://www.cs.toronto.edu/~hinton/absps/ilyamre.pdf">this paper notes</a> (Section 6), vector representations do not perform any logical deduction to answer questions; rather, answers are baked-in during learning. These systems often can correctly answer totally new questions which have not been trained on, but that is because the memorization of the other answers forced the vectors into the right "shape" to make the correct answers evident. While this technique is powerful, it can't capture aspects of intelligence which require <i>thinking</i>.</div><div><br /></div><div>Similarly, it's not possible for all probabilistic models to be tractable. SPNs may be a powerful tool for creating fast probabilistic models, but the restrictions prevent us from modeling everything. Intelligence requires <i>some</i> inference to be difficult! So, we need a notion of difficult inference!</div><div><br /></div><div>It seems like there is a place for both shallow and deep methods; unstructured and highly structured models. Fitting all these pieces together is the challenge.</div>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-22063801178205546962014-07-15T00:41:00.000-07:002014-07-15T00:41:55.920-07:00More Distributed VectorsSince my <a href="http://lo-tho.blogspot.com/2013/12/distributed-representations.html">last post</a> on distributed vector representations, interest in this area has continued to spread across the research community. <a href="http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">This exposition</a> on Colah's blog is quite good, although it unfortunately perpetuates the confusing view that distributed representations are necessarily "deep learning". (In fact, as far as I can see, the trend is in the opposite direction: you can do better by using simpler networks so that you can train faster and scale up to larger datasets. This reflects a <a href="http://lo-tho.blogspot.com/2013/12/where-is-ai-headed.html">very general trend</a> to which deep networks seem to be a rare exception.)<br /><br />The story Colah tells is quite exciting. Vector representations (AKA "word embeddings") are able to perform well on a variety of tasks which they have not been trained on at all; they seem to "magically" encode general knowledge about language after being trained on just one language task. The form of this knowledge makes it easier to translate between languages, too, because the relationship structure between concepts is similar in the two languages. This even extends to image classification: there have been several successes with so-called <a href="http://arxiv.org/pdf/1312.5650.pdf">zero-shot learning</a>, where a system is able to correctly classify images even when it's never seen examples of those classes before, thanks to the general world knowledge provided by distributed word representations.<br /><br />For example, it's possible to <a href="http://nlp.stanford.edu/~socherr/SocherGanjooManningNg_NIPS2013.pdf">recognize a cat having only seen dogs</a>, but having read about both dogs and cats.<br /><br />(This seems <i>rather encouraging!</i>)<br /><br />Colah mentions that while <i>encoding</i> has been very successful, there is a corresponding <i>decoding</i> problem which seems to be much harder. <a href="http://arxiv.org/pdf/1406.1078v1.pdf">One paper</a> is mentioned as a hopeful direction for solving this. Colah is talking about trying to decode representations coming out of <a href="http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Socher_125.pdf">RNNs</a>, a representation I'm quite fond of because it gives (in some sense) a semantic parse of a sentence. However, another option which I'd like to see tried would be to decode representations based on the <a href="http://cs.stanford.edu/~quocle/paragraph_vector.pdf">Paragraph Vector</a> algorithm. This looks easier, and besides, Paragraph Vector actually got better results for sentiment analysis (one of the key domains where RNNs provide a natural solution). Again, we can point to the <a href="http://lo-tho.blogspot.com/2013/12/where-is-ai-headed.html">general trend of AI</a> toward simpler models. RNNs are a way of combining semantic vectors with probabilistic context-free grammers; Paragraph Vector combines semantic vectors with a markov model. Markov models are simpler and less powerful; therefore, by the contrarian logic of the field, we expect them to do better. And, they do.<br /><br />All of this makes distributed vectors sound quite promising as a general-purpose representation of concepts within an AI system. In addition to aiding bilingual translation and image-word correspondence problems, RNNs have also been applied to <a href="http://nlp.stanford.edu/~socherr/SocherChenManningNg_NIPS2013.pdf">predict links in common-sense knowledge bases</a>, showing that the same model can also be applied to understand information presented in a more logical form (and perhaps to form a bridge between logical representations and natural language). I imagine that each additional task which the vectors are used on can add more implicit knowledge to the vector structure, further increasing the number of "zero-shot" generalizations it may get correct in future tasks. This makes me envision a highly general system which accumulates knowledge in vectors over the course of its existence, achieving <a href="http://arxiv.org/abs/1206.6262">lifelong learning</a> as opposed to being re-trained on each task. Vector representations by themselves are obviously not sufficient for AGI (for example, there's no model of problem solving), but they could be a very useful tool within an AGI system.<br /><br />I mentioned <a href="http://lo-tho.blogspot.com/2013/12/history-of-distributed-representations.html">in a previous post</a> that the idea of distributed vectors isn't really that new. One older type of word embedding is latent semantic analysis (LSA). Some researchers who are from the LSA tradition (Marco Baroni, Georgiana Dinu, & German Krusewski) have gotten annoyed at the recent hype, and decided that LSA-style embedding was not getting a fair trial; the new word embeddings have not been systematically compared to the older methods. <a href="http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf">This paper</a> is the result. The verdict: the new stuff really is better!<br /><br />When it's coming from people who openly admit that they hoped to prove the opposite, it sounds rather convincing. However, I do have some reservations. The title of their paper is: <i>Don't count, predict!</i> The authors refer to the older LSA-like methods as "count vectors", and newer methods as "predict vectors". The LSA-like methods rely on some transformation of raw co-occurrence counts, whereas the new neural methods train to predict something, such as predicting the current word given several previous words. (For example, LSA uses the <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">singular value decomposition</a> to make a low-rank approximation of <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> counts. Paragraph Vector trains to find a latent variable representing the paragraph topic, and from this and several previous words, predict each next word in the paragraph.)<br /><br />As I said: they are assuming the count vs predict distinction is what explains the different performance of the two options. The main experiment of the paper, an extensive comparison of <a href="https://code.google.com/p/word2vec/">word2vec</a> (representing the prediction-based techniques) vs <a href="http://clic.cimec.unitn.it/composes/toolkit/index.html">DISSECT</a> (representing the counting-based techniques) strongly supports this position. Two other techniques are also tried, though: <a href="http://ml.nec-labs.com/senna/">SENNA</a> and <a href="http://clic.cimec.unitn.it/dm/">DM</a>. Both of these did worse than word2vec, but the relative comparison of the two is <i>much</i> muddier than the comparison between DISSECT and word2vec. This weakens the conclusion somewhat. Are we expected to believe that new counting-based models will continue to be worse than new prediction-driven models?<br /><br />If we believed that counting models had reached a state of near-perfection, with only small improvements left to be found, then the conclusion would make sense. Prediction-based vector representations appear to still be in their early days, with large unexplored areas. Presumably, they still have significant room for improvement. If the authors believe that this is <i>not</i> the case for counting-based vector representations, the conclusion makes sense.<br /><br />However, work I'm involved with may undermine this conclusion.<br /><br />I've been doing distributed vector stuff with Volkan Ustun and others here at ICT. <a href="http://cs.usc.edu/~rosenblo/Pubs/VectorRepresentation_AGI2014_Final.pdf">Our forthcoming paper</a> has some suggestive evidence, showing performance very close to Google's word2vec when trained with similar vector size and amount of data. We are not doing any neural network training. Instead, we are creating representations by summing together initial random representations for each word. This seems to fall firmly into count-based methods.<br /><br />The incredible thing about Volkan's technique is that it's essentially the first thing we thought of trying; what's going on is much simpler than what happens in word2vec. Yet, we <i>seem</i> to be getting similar results. (We should run a more direct comparison to be sure of what's going on.) If this is the case, it directly contradicts the conclusion of <i>Don't Count, Predict!</i>.<br /><br />In any case, distributed vectors continue to offer a surprising amount of generality, and have some promise as a cross-task, cross-language, cross-modality unified representation.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-30522658343815397232014-02-07T17:46:00.001-08:002014-06-12T14:03:07.753-07:00DeepMind PapersThere has been a lot of talk about Google's acquisition of DeepMind. I won't try to review the facts for those who don't know about it-- there's got to be about 100 roughly identical news articles you can find. What those news articles only rarely include is a link to DeepMind's research publications; and as far as I've seen, only to the paper on playing Atari games using model-free RL. (Maybe it's the most public-friendly.) Before the news broke, I didn't even know that DeepMind was putting out <i>any</i> information on what they were doing; their very blank company website made me think they were being hush-hush.<br /><br />So. For the curious, here are all the papers I could find which include at least one author @deepmind.com:<br /><br /><a href="http://arxiv.org/pdf/1310.8499.pdf">Deep Autoregressive Networks</a><br /><br /><a href="http://arxiv.org/pdf/1312.5602v1.pdf">Playing Atari with Deep Reinforcement Learning</a><br /><br /><a href="http://arxiv.org/pdf/1402.0030.pdf">Neural Variational Inference and Learning in Belief Networks</a><br /><br /><a href="http://arxiv.org/pdf/1312.6055.pdf">Unit Tests for Stochastic Optimization</a><br /><br /><a href="http://arxiv.org/pdf/1109.5951.pdf">An Approximation of the Universal Intelligence Measure</a><br /><br /><a href="http://arxiv.org/pdf/1401.4082.pdf">Stochastic Back-Propagation and Variational Inference in Deep Latent Gaussian Models</a><br /><br /><a href="http://arxiv.org/pdf/1312.5783.pdf">Unsupervised Feature Learning by Deep Sparse Coding</a><br /><br /><a href="http://jmlr.org/proceedings/papers/v32/silver14.pdf">Deterministic Policy Gradient Algorithms</a><br /><br /><a href="http://www.stats.ox.ac.uk/~teh/research/npbayes/BluTeh2013a.pdf">Bayesian Hierarchical Community Discovery</a><br /><br /><a href="https://www.cs.toronto.edu/~amnih/papers/wordreps.pdf">Learning Word Embeddings Efficiently with Noise-Contrastic Estimation</a>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-44875236482615328882014-01-20T12:00:00.002-08:002014-01-20T12:00:19.219-08:00Math vs LogicMath provides a series of games, which can be usefully applied to reality when their rules closely mirror those of some real system.<div><br /></div><div>Originally, it would seem that math started out as just another part of language. Language itself has been described as a game: a useful set of rules which we follow in order to get things done.</div><div><br /></div><div>Eventually, math developed into a sort of sub-language or sub-game with a clearly independent set of rules. The objects of mathematical language were much different from the typical objects of everyday language, being more abstract while simultaneously being atypically precise, with very definite behavior. This "definite behavior" constitutes the rules of the game. The Pythagorean cult nurtured the idea of formal derivations from axioms or postulates.</div><div><br /></div><div>(Around this time, Plato's idea of a separate pure mathematical reality started to seem plausible to some folks.)</div><div><br /></div><div>Notice, however, that math still seemed like a <i>single</i> game. Euclid's Elements provided a wonderfully unified world of mathematics, in which number theory was considered as a geometric topic.</div><div><br /></div><div>I think it wasn't until the 1800s that it started to become clear that we want to view math as a rich set of different possible games, instead. Non-euclidean geometry was discovered, and mathematicians started to explore variant geometries. The use of imaginary numbers became widely accepted due to the work of Gauss (and Euler in the previous century), and the quaternions were discovered. Group theory was being applied and developed.</div><div><br /></div><div>Once we have this view, math becomes an exploration of all the possible systems we can define.</div><div><br /></div><div>Within the same century, logic was gaining teeth and starting to look like a plausible foundation for all mathematical reasoning. It would be overstating things to claim that logic had not developed within the past two thousand years; however, developments in that century would overshadow all previous.</div><div><br /></div><div>In using the number two thousand, I refer to Aristotle, who laid out the first formal system of logic around 350 BC. Aristotle provided a deep theory, but it was far from enough to account for all correct arguments. Euclid's Elements (written just 50 years later) may have used exceedingly rigorous arguments (a bright light in the history of mathematics), but they were still informal in the sense that there was no formal system of logic justifying every step. Instead, the reader must see that the steps are justified by force of reason. Aristotle's logic was simply too weak to do the job.</div><div><br /></div><div>It was not until Frege, publishing in the late 1800s, that this became a possibility.</div><div><br /></div><div>Frege attempted to set out a system of logic in which all of math could be formalized, and he went a long way toward achieving this. He invented modern logic. In the hands of others, it would become the language in which all of mathematics could be set out.</div><div><br /></div><div>So, we see that logic steps in to provide a sort of super-game in which all the various games of mathematics can be played. It does so just as a unified picture of mathematics as a single game is crumbling.</div><div><br /></div><div>My point is to answer a question which comes up now and then: what is the dividing line between math and logic? Is logic a sub-field of math, or is it the other way around? The reality is complex. Logic and math are clearly distinct: logic is an attempt to characterize justified arguments, whereas math merely relies heavily on these sorts of arguments. A system of logic can be viewed as just another mathematical system (just another game), but it must be admitted that logic has a different "flavor" than mathematics. I think the difference is in how we are trying to capture a very large space of possibilities within a logical system (even when we are studying very restricted systems of logic, such as logic with bounded quantification).</div><div><br /></div><div>Ultimately, the difference is a historical one.</div><div><br /></div><div><i>All facts cited here can be easily verified on Wikipedia. Thanks for reading!</i></div>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com2tag:blogger.com,1999:blog-266928470727008321.post-63633233387647426722013-12-21T15:00:00.000-08:002013-12-21T15:00:10.697-08:00You Think Too Much<span style="font-size: x-small;"><i>This first appeared as one of my few facebook notes. I'm copying it here; perhaps it's a better place for it.</i></span><br /><br /><div class="_5k3v _5k3w clearfix"><div>This is what the phrase "You're overthinking it" is like.<br /><br />Mary and Bob sit down to a game of chess. Mary is an inexperienced player, and Bob is giving her some advice. He's being nice, trying to hint at the right moves without telling her what to do. So, Bob is making his moves very quickly, since he's experienced, and Mary is not difficult to play against (and he's going easy on her anyway). Mary is very uncertain of most of her moves, and spends a lot of time staring at the board indecisively.<br /><br />At one point, Mary is looking at the board in confusion. Bob sees very plainly what move she needs to make; she should move her rook to defend her knight. Mary is looking very carefully at all the possible moves she could make, trying hard to evaluate which things might be good or bad for her, trying to think a few moves ahead.<br /><br />"You're thinking too much," Bob says. "It's very simple."<br /><br />This advice sounds helpful to Bob. From Bob's perspective, Mary is spending a lot of time thinking about many alternatives when she should be quickly hitting on the critical move. And it's true: if Mary were a better player, she would be thinking less here.<br /><br />From Mary's perspective, this is not very helpful at all. She tries to take Bob's advice. She tries to think less about her move. She figures, if Bob says "It's simple", this must mean that she doesn't need to look several moves ahead to see the consequences. She looks for possible moves again, this time looking for things that have good consequences for her immediately.<br /><br />Mary moves the pawn up to threaten one of Bob's pieces.<br /><br />Bob takes Mary's knight.<br /><br />Bob explains to a frustrated Mary what she could have done to avoid this. "See? You're overthinking it" he adds. To Bob, this feels like the right explanation for Mary's wrong move: she was thinking about all these other pieces, when she needed to be defending her knight.<br /><br />The worst part is, Mary starts to be convinced, too. She admits that she was taking a lot of time to look several moves ahead in all kinds of situations that turned out to be irrelevant to what she needed to do. She tries to think less during the rest of the game, and makes many other mistakes as a result.</div></div>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-51013126717511552452013-12-15T17:05:00.001-08:002013-12-15T17:33:32.733-08:00Where is AI Headed?I've spent a lot of effort on this blog arguing for the direction of higher expressiveness. <a href="http://lo-tho.blogspot.com/2011/11/why-relational-methods.html">Machine intelligence should be able to learn anything a human can learn, and in order for that to be possible, it should be able to conceive of any concept that a human can.</a> I have proceeded with the belief that this is the direction to push in order for the field to make progress.<br /><br />Yet, in some ways at least, the field is headed in the opposite direction.<br /><br />I've often discussed the Chomsky hierarchy, and how most techniques at present fall very low on it. I've often discussed hierarchies "above" the Chomsky hierarchy; hierarchies of logic & truth, problems of uncomputability and undefinability. Reaching for the highest expression of form, the most general notion of pattern.<br /><br />Machine learning has made artificial intelligence increasingly practical. Yet, the most practical techniques are often the least expressively powerful. Machine learning flourished once it abandoned the symbolic obsession of GOFAI. Fernando Pereira famously said: <i>"The older I get, the further down the Chomsky Hierarchy I go."</i><br /><i><span style="font-family: inherit;"><br /></span></i><span style="font-family: inherit;">There's a good reason for this, too. Highly structured techniques like logic induction and genetic programming (both of which would go high in the hierarchy) don't scale well. Commercial machine learning is large-scale, and increasingly so. I mentioned this in connection with <a href="http://lo-tho.blogspot.com/2013/12/history-of-distributed-representations.html">word2vec last time</a>: <i>"</i><span style="background-color: white; line-height: 20px;"><i>Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result."</i> </span></span><br /><span style="font-family: inherit;"><span style="background-color: white; line-height: 20px;"><br /></span></span><span style="line-height: 20px;">The "structure" I'm referring to provides more prior bias, which means more generalization capability. This is very useful when we want to come to the correct conclusion using small amounts of data. However, with more data, we can cover more and more cases without needing to actually make the generalization. At some point, the generalization becomes irrelevant in practice.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">Take XML data. You can't parse XML with regular expressions.<sup>1</sup> Regular expressions are too low on the Chomsky hierarchy to form a proper model of what's going on. However, for the <a href="http://mattmahoney.net/dc/text.html">Large Text Compression Benchmark</a>, which requires us to compress XML data, the leading technique is the PAQ compressor. <a href="http://mattmahoney.net/dc/rationale.html">Compression is equivalent to prediction</a>, so the task amounts to making a predictive model of XML data. PAQ works by constructing a probabilistic model of the sequence of bits, similar to a <a href="http://en.wikipedia.org/wiki/Prediction_by_partial_matching">PPM</a> model. This is <i>not even capable of representing regular expressions</i>. Learning regular expressions is like learning hidden markov models. PPM allows us to learn fully observable markov models. PAQ learns huge markov models that get the job done.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">The structure of XML requires a <i>recursive</i> generalization, to understand the nested expressions. Yet, PAQ does acceptably well, because the depth of the recursion is usually quite low.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">You can always push a problem lower down on the hierarchy if you're willing to provide more data (often exponentially more), and accept that it will learn the common cases and can't generalize the patterns to the uncommon ones. In practice, it's been an acceptable loss.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">Part of the reason for this is that the data just keeps flowing. The simpler techniques require exponentially more data... <i>and that's how much we're producing</i>. It's only getting worse:</span><br /><span style="line-height: 20px;"><br /></span><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><img border="0" height="271" src="http://3.bp.blogspot.com/-kM1iEjEQJ8o/Uq5L6DDB2eI/AAAAAAAACcY/pEAvFek7JR0/s320/what.is_.personal.data2x519.png" style="margin-left: auto; margin-right: auto;" width="320" /></td></tr><tr><td class="tr-caption" style="text-align: center;"><a href="http://www.technologyreview.com/news/514351/has-big-data-made-anonymity-impossible/">Has Big Data Made Anonymity Impossible? MIT Technology Review</a></td></tr></tbody></table><span style="line-height: 20px;">At The New Yorker, Gary Marcus complains: <a href="http://www.newyorker.com/online/blogs/elements/2013/08/why-cant-my-computer-understand-me.html">Why Can't My Computer Understand Me?</a> Reviewing the work of Hector Levesque, the article conveys a desire to "google-proof" AI, designing intelligence tests which are immune to the big-data approach. Using big data rather than common-sense logic to answer facts is seen as cheating. Levesque presents a series of problems which cannot (presently) be solved by such techniques, and calls others to "stop bluffing".</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">I can't help but agree. Yet, it seems the tide of history is against us. As the amount of data continues to increase, dumb techniques will achieve better and better results.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">Will this trend turn around at some point?</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">Gary Marcus points out that some information just isn't available on the web. Yet, this is a diminishing reality. As more and more of our lives are online (and as the population rises), more and more will be available in the global brain.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">Artificial intelligence is evolving into a specific role in that global brain: a role which requires only simple association-like intelligence, fueled by huge amounts of data. Humans provide the logically structured thoughts, the prior bias, the recursive generalizations; that's a niche which machines are not currently required to fill. At the present, this trend only seems to be increasing.</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">Should we give up structured AI?</span><br /><span style="line-height: 20px;"><br /></span><span style="line-height: 20px;">I don't think so. We can forge a niche. We can climb the hierarchy. But it's not where the money is right now... and it may not be for some time.</span><br /><span style="line-height: 20px;"><br /></span><span style="font-size: x-small; line-height: 20px;"><i><b>1</b>: <a href="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html">Cthulhu will eat your face.</a></i></span>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com1tag:blogger.com,1999:blog-266928470727008321.post-89627635395791620232013-12-09T19:23:00.000-08:002013-12-09T19:23:11.725-08:00History of Distributed Representations<a href="https://plus.google.com/111568410659864255951/posts/3bWRE2kZPBd">Commenting</a> on the <a href="http://lo-tho.blogspot.com/2013/12/distributed-representations.html">previous post</a>, a friend pointed out that "distributed representations" are not so new. I thought I would take a look at the history to clarify the situation.<br /><br />In a <i>very</i> broad sense, I was discussing the technique of putting a potentially nonlinear problem into a linear vector space. This vague idea matches to many techniques in machine learning. A number of well-developed algorithms take advantage of linearity assumptions, including <a href="http://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>, <a href="http://en.wikipedia.org/wiki/Logistic_regression">logistic regression</a>, and <a href="http://en.wikipedia.org/wiki/Support_vector_machine">SVM</a>.<sup>1</sup> A common approach to machine learning is to find a number of features, which are just functions of your data, and use one of these techniques on the features (hoping they are close enough to linear). Another common technique, the <a href="http://en.wikipedia.org/wiki/Kernel_trick">kernel trick</a>, projects features into a higher-dimensional space where the linearity assumption is more likely to get good results. Either way, a large part of the work to get good results is "feature engineering": choosing how to represent the data as a set of features to feed into the machine learning algorithm.<br /><br />We could even argue that probability theory itself is an example: probabilities are always linear, no matter how nonlinear the underlying problem being described. (The probability of an event is the sum of the ways it could happen.) This gives us nice results; for example, <a href="http://en.wikipedia.org/wiki/Nash_equilibrium#Nash.27s_Existence_Theorem">there is always a Nash equilibrium</a> for a game if we allow probabilistic strategies. This is not the case if we only consider "pure" strategies.<br /><br />This theme is interesting to me, but, I was trying to be much more narrow in talking about recent developments in distributed representations. Like feature-based machine learning, a distributed representation will put data into a vector space to make it easier to work with. Unlike approaches relying on feature engineering, there is an emphasis on figuring out how to get the representations to "build themselves", often starting with randomly assigned vector representations.<br /><br />The beginning of this kind of approach is probably <a href="http://en.wikipedia.org/wiki/Latent_semantic_analysis">latent semantic analysis</a> (LSA), which is from 1988. LSA assigns 'semantic vectors' to words based on statistical analysis of the contexts those words occur in, based on the idea that words with similar meaning will have very similar statistics.<br /><br />Given how old this technique is, the excitement around Google's release of the word2vec tool is striking. Reports spun it as <a href="http://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-the-masses-you-can-thank-google-later/">deep learning for the masses</a>. Deep learning is a much more recent wave of development. I think the term has lost much of its meaning in becoming a buzzword.<sup>2</sup> Calling word2vec "deep" takes this to farcical levels: <a href="http://arxiv.org/abs/1301.3781">the techniques of word2vec</a> improve previous models by <i>removing</i> the hidden layer from the network. Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result.<br /><br />One of the exciting things about word2vec is the good results with solving word analogies by vector math. The result of vector computations like France - Paris and Russia - Moscow are very similar, meaning we can approximately find the vector for a capital given the vector for the corresponding nation. The same trick works for a range of word relationships.<br /><br />However, I've talked with people who had the incorrect impression that this is a new idea. I'm not sure exactly how old it is, but I've heard the idea mentioned before, and I did find a <a href="http://books.google.com/books?id=rU_onmzozu0C&pg=PA609&lpg=PA609&dq=using+lsa+to+solve+word+analogy&source=bl&ots=w5qt71RbR3&sig=yGANqEDOoS7UQDd-Mq9zB0jcHLU&hl=en&sa=X&ei=ySGlUo-HEKrh2wW20ICoCQ&ved=0CEIQ6AEwAg#v=onepage&q=using%20lsa%20to%20solve%20word%20analogy&f=false">reference</a> from 2004 which appears to use LSA to do the same basic thing. (I can't see the whole article on google books...)<br /><br />One thing which I thought was really new was the emerging theme of combining vectors to form representations of compound entities. This, too, is quite old. I found a <a href="http://www.rni.org/kanerva/icann94.pdf">paper from 1994</a>, which cites harder-to-find papers from 1993, 1990, and 1989 that also developed techniques to combine vectors to create representations of compound objects. Recent developments seem much more <i>useful</i>, but, the basic idea is present.<br /><br />So, all told, it's a fairly long-standing area which has seen large improvements in the actual techniques employed, but, whose central ideas were laid out (in one form or another) over 20 years ago.<br /><br /><span style="font-size: x-small;"><i><b>1</b>: By the way, don't get too hung up about what makes one machine learning technique "linear" and another "nonlinear". This is a false dichotomy. What I really mean is that a technique works in a <a href="http://en.wikipedia.org/wiki/Vector_space">vector space</a> (which more or less means a space where + is defined and behaves very much like we expect), and relies "largely" on linear operations in this space.</i></span><span style="font-size: x-small;"><i> What does "linear" actually mean? A function F is linear if and only if </i>F(x+y) = F(x) + F(y)<i> and for scalar </i>a<i>, </i>F(ax) = aF(x).<i> PCA, for example, is justified by minimizing a squared error (a common theme), where the error is based on euclidean distance, a linear operation. Notice that taking the square isn't linear, but PCA is still thought of as a linear approach.</i></span><br /><br /><span style="font-size: x-small;"><i><b>2</b>: Deep learning has come to mean almost any multi-layer neural network. The term caught on with the success related to Deep Belief Networks, which proposed specific new techniques. Things currently being called "deep learning" often have little in common with this. I feel the term has been watered down by people looking to associate their work with the success of others. This isn't all bad. The work on multi-layered networks seems to have produced real progress in reducing or eliminating the need for feature engineering.</i></span>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-4855890397939621962013-12-03T16:49:00.000-08:002013-12-03T16:49:07.022-08:00Distributed RepresentationsDistributed vector representations are a set of techniques which take a domain (usually, words) and embed it into a linear space (representing each word as a large vector of numbers). Useful tasks can then be represented as manipulations of these embedded representations. The embedding can be created in a variety of ways; often, it is learned by optimizing task performance. <a href="http://ml.nec-labs.com/senna/">SENNA</a> demonstrated that representations learned for one task are often useful for others. <br /><br />There are so many interesting advances being made in distributed vector representations, it seems that a nice toolset is emerging which will soon be considered a basic part of machine intelligence.<br /><br />Google's <a href="https://code.google.com/p/word2vec/"> word2vec</a> assigns distributed vector representations to individual words and a few short phrases. These representations have been shown to give intuitively reasonable results on analogy tasks with simple vector math: <i>king - man + woman</i> is approximately equal to the vector for <i>queen</i>, for example. This is despite not being explicitly optimized for that task, again showing that these representations tend to be useful for a wide range of tasks.<br /><br />Similar approaches have aided machine translation tasks by turning word translation into a linear transform from one vector space to another.<br /><br />One limitation of this approach is that we cannot do much to represent sentences. Sequences of words can be given somewhat useful representations by adding together the individual word representations, but this approach is limited.<br /><br />Socher's RNN learns a matrix transform to compose two elements together and give them a score, which is then used for greedy parsing by composing together the highest-scoring items, with great success. This gives us useful vector representations for phrases and sentences.<br /><br />Another approach which has been suggested is circular convolution. This combines vectors in a way which captures ordering information, unlike addition or multiplication. Impressively, the technique has solved Raven progressive matrix problems:<br /><br /><a href="http://eblerim.net/?page_id=2383">http://eblerim.net/?page_id=2383</a><br /><br />Then there's a project, COMPOSES, which seeks to create a language representation in which nouns get vector representations and other parts of speech get matrix representations (and possibly tensor representations?).<br /><br /><a href="http://clic.cimec.unitn.it/composes/">http://clic.cimec.unitn.it/composes/</a><br /><br />I haven't looked into the details fully, but conceptually it makes sense: the parts of speech which intuitively represent modifiers are linear <i>functions</i>, while the parts of speech which are intuitively static objects are getting operated on by these functions.<br /><br />The following paper gives a related approach:<br /><br /><a href="http://www.cs.utoronto.ca/~ilya/pubs/2008/mre.pdf">http://www.cs.utoronto.ca/~ilya/pubs/2008/mre.pdf</a><br /><br />Here, everything is represented as a matrix of the same size. Representing the objects as functions is somewhat limiting, but the uniform representation makes it easy to jump to higher-level functions (modifiers on modifiers) without adding anything. This seems to have the potential to enable a surprisingly wide range of reasoning capabilities, given the narrow representation.<br /><br />As the authors of that last paper mention, the approach can only support reasoning of a "memorized" sort. There is no mechanism which would allow chained logical inferences to answer questions. This seems like a good characterization of the general limitations of the broader set of techniques. The distributed representation of a word, phrase, image, or other object is a static encoding which represents, in some sense, a classification of the object into a fuzzy categorization system we've learned. How can we push the boundary here, allowing for a more complex reasoning? Can these vector representations be integrated into a more generally capable probabilistic logic system?Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-73487156281588003452013-08-11T19:53:00.001-07:002013-08-12T07:06:30.899-07:00Progress in Logical PriorsIt's been a while since I've posted here. I've been having a lot of ideas, but I suppose the phd student life makes me focus more on implementing than on speculating (which is a good thing).<br /><br />I presented my <a href="http://ict.usc.edu/pubs/Logical%20Prior%20Probability.pdf">first sole-authored paper</a> (based on <a href="http://lo-tho.blogspot.com/2011/05/current-probability-ideas.html">this</a> blog post) at the AGI conference in December of last year, and it was <a href="http://www.hutter1.net/publ/agscilaws.pdf">cited</a> by Marcus Hutter in an interesting paper about approximating universal intelligence (which was presented at this year's AGI conference, which was once again a summer conference, so already took place).<br /><br />When I set out to write the paper, my main goal was to show that we gain something by representing beliefs as something like logical statements, rather than as something like programs. This allows our beliefs to decompose more easily, readily allows inference in "any direction" (whereas programs are naturally executed in one direction, producing specific results in a specific order), and also allowing incomputable hypotheses to be dealt with in a partial way (dealing somewhat more gracefully with the possibility by explicitly representing it in the hypothesis class, but incompletely).<br /><br />My desire to put forward this thesis was partly out of an annoyance with people invoking the Curry-Howard isomorphism all-too-often, to claim that logic and computation are really one and the same. I still think this is misguided, and not what the curry-howard isomorphism really says when you get down to it. The "programs are proofs" motto is misleading. There is no consensus on how to deal with Turing-complete programs in this way; turing-complete programming languages seem to correspond to trivial logics where you can prove anything from anything!*<br /><br />Annoyed, I wanted to show that there was a material difference between the two ways of representing knowledge.<br /><br />As I wrote the paper and got feedback from colleagues, it became clear that I was fighting a losing fight for that thesis: although the first-order prior represented a new mathematical object with interesting features along the lines I was advocating, it would be possible to write a somewhat program-like representation with the same features. I would still argue each of the advantages I mentioned, and still argue against naive invocations of Curry-Howard, but I was trying to make these arguments too strong, and it wasn't working. In any case, this was a point that didn't need to be made in order for the paper to be interesting, for two reasons:<br /><br /><ol><li>If desired, you could re-do everything in a more computational way. It would still be a new, interesting distribution with features similar to but different from the Solomonoff distribution.</li><li>A universal distribution over logic, done right, is interesting even if it had turned out to be somehow equivalent to the Solomonoff distribution.</li></ol><br />So, all told, I downplayed the "logic is different from computation" side of the paper, and tried to focus more on the prior itself.<br /><br />After submitting the paper, I went back to working on other things. Although I still thought about logical priors every so often, I didn't make very much conceptual progress for a while.<br /><br />At the July MIRI workshop, I got the opportunity to spend time on the topic again, with other smart folks. We spend roughly a day going over the paper, and then discussed how to take things further.<br /><br />The main problem with the first-order prior is that the probability of a universal statement does not approach 1 as we see more and more examples. This is because all the examples in the world will still be consistent with the statement "there exists a counterexample"; so, if we are randomly sampling sentences to compose a logical theory, the probability that we add that sentences doesn't drop below a certain minimum.<br /><br />So, for example, if we are observing facts about the natural numbers, we will not converge to arbitrarily high probability for generalizations of these facts. To make it more concrete, we cannot arrive at arbitrarily high probabilities for the <a href="http://en.wikipedia.org/wiki/Goldbach's_conjecture">Goldbach conjecture</a> by observing more and more examples of even numbers being written as the sum of two primes.<br /><br />This isn't a bad thing in all cases. Holding back some fixed probability for the existence of a counterexample matches with the semantics of first-order logic, which is not supposed to be able to rule out <a href="http://en.wikipedia.org/wiki/%CE%A9-consistent_theory">omega-inconsistent</a> theories. (Omega inconsistency is the situation where we deny a universal statement while simultaneously believing all the examples.)<br /><br />For some domains, though, we really do want to rule out omega-inconsistency; the natural numbers are one of these cases. The reason the first-order prior allows some probability for omega-inconsistent possibilities is that first-order logic is unable to express the fact that natural numbers correspond exactly to the finite ones. ("Finite" cannot be properly characterized in first-order logic.) More expressive logics, such as second-order logic, can make this kind of assertion; so, we might hope to specify reasonable probability distributions over those logics which have the desired behavior.<br /><br />Unfortunately, it is not difficult to show that the desired behavior is not approximable. If the probability of universal statements approaches 1 as we observe increasingly many examples, then it must equal 1 if we believe all the examples. Let's take an example. If we believe all the axioms of peano arithmetic, then we may be able to prove all the examples of the Goldbach conjecture. In fact, we end up believing all true Pi_1 statements in the <a href="http://en.wikipedia.org/wiki/Arithmetic_hierarchy">arithmetic hierarchy</a>. But this implies that we believe all true Sigma_2 statements, if our beliefs are closed under implication. This in turn means that we believe all the examples of the Pi_3 universal statements, which means we must believe the true Pi_3 with probability 1, since we supposed that we believe universal statements if we believe their examples. And so on. This process can be used to argue that we must believe the true statements on every level of the hierarchy.<br /><br />Since the hierarchy transcends every level of <a href="http://en.wikipedia.org/wiki/Hypercomputation">hypercomputation</a>, there can be no hope of a convergent approximation for it. So, convergence of universal statements to probability 1 as we see more examples is (very) uncomputable. This may seem a bit surprising, given the naturalness of the idea.<br /><br />Marcus Hutter has discussed distributions like this, and <a href="https://groups.google.com/forum/#!topic/magic-list/WJzPoNJavhk">argues</a> that it's OK: this kind of distribution doesn't try to capture our uncertainty about logically undecidable statements. Instead, his probability distribution represents the strong inductive power that we could have if we could infallibly arrive at correct mathematical beliefs.<br /><br />Personally, though, I am much more interested in approximable distributions, and approaches which <i>do</i> try to represent the kind of uncertainty we have about undecidable mathematical statements.<br /><br />My idea has been that we can get something interesting by requiring convergence on the Pi_1 statements only.<br /><br />One motivation for this is that Pi_1 convergence guarantees that a logical probability distribution will eventually recognize the consistency of any axiomatic system, which sort-of gets around the 2nd incompleteness theorem: an AI based on this kind of distribution would eventually recognize that any axioms you give it to start with are consistent, which would allow it to gradually increase its logical strength as it came to recognize more mathematical truth. This plausibly seems like a step in the direction of self-trusting AI, one of the goals of MIRI.<br /><br />The immediate objection to this is that the system still won't trust itself, because it is not a set of axioms, but rather, is a convergent approximation of a probability distribution. Convergence facts are higher up in the arithmetic hierarchy, which suggests that the system won't be able to trust itself even if it does become able to (eventually) trust axiomatic systems.<br /><br />This intuition turns out to be wrong! There is a weak sense in which Pi_1 convergence implies self-trust. Correctness for Pi_1 implies that we believe the true Sigma_2 statements, which are statements of the form "There exists x such that for all y, R(x,y)" where R is some primitive recursive relation. Take R to be "y is greater than x, and at time y in the approximation process, our probability of statement S is greater than c." (The arithmetic hierarchy can discuss the probability approximation process through a godel-encoding.) The relevant Sigma_2 statements place lower bounds on the limiting probabilities from our probability approximation. We can state upper bounds in a similar way.<br /><br />This shows that a probability distribution which has Pi_1 convergence will obey something strikingly like the <a href="http://lesswrong.com/lw/h1k/reflection_in_probabilistic_logic/">probabilistic reflection principle</a> which came out of a previous MIRI workshop. If its probabilities fall within specific bounds, it will believe that (but the converse, that if it believes they fall within specific bounds, they do, does not hold). This gives such a probability distribution a significant amount of self-knowledge.<br /><br />So, Pi_1 convergence looks like a nice thing to have. But is it?<br /><br />During the MIRI workshop, Will Sawin proved that this leads to bad (possibly unacceptable) results: any logically coherent, approximable probability distribution over statements in arithmetic which assigns probability 1 to true pi_1 statements will assign probability 0 to some true pi_2 statements. This seems like a rather severe error; the whole purpose of using probabilities to represent uncertainty about mathematical truth would be to allow "soft failure", where we don't have complete mathematical knowledge, but can assign reasonable probabilities so as to be less than completely in the dark. This theorem shows that we get hard failures if we try for pi_1 convergence.<br /><br />How concerned should we be? Some of the "hard failures" here correspond to the necessary failures in probabilistic reflection. These actually seem quite tolerable. There could be a lot more errors than that, though.<br /><br />One fruitful idea might be to weaken the coherence requirement. The usual argument for coherence is the dutch book argument; but this makes the assumption that bets will pay off, which does not apply here, since we may never face the truth or falsehood of certain mathematical statements. <a href="http://brian.weatherson.org/conprob.pdf">Intuitionistic probability</a> comes out of a variation of the dutch book argument for the case when bets not paying off at all is a possible outcome. This does not require that probabilities sum to 1, which means we can have a gap between the probability of X and the probability of not-X.<br /><br />An extreme version of this was propo<span style="font-family: inherit;">sed by <span style="background-color: white; color: #2e2e2d; line-height: 24px; text-align: justify;">Marcello Herreschoff</span> at the </span>MIRI workshop; he suggested that we can get Pi_1 convergence by <i>only</i> sampling Pi_1 statements. This gets what we want, but results in probability gaps at higher levels in the hierarchy; it's possible that a sampled theory will never prove or disprove some complicated statements. (This is similar to the intuitionistic probability idea, but doesn't actually satisfy the intuitionistic coherence requirements. I haven't worked this out, though, so take what I'm saying with a grain of salt.)<br /><br />We may even preserve some degree of probabilistic reflection this way, since the true Pi_1 still imply the true Sigma_2.<br /><br />That particular approach seems rather extreme; perhaps too limiting. The general idea, though, may be promising: we may be able to get the advantages of Pi_1 convergence without the disadvantages.<br /><br /><span style="font-size: x-small;">*(Source: Last paragraph of <a href="http://en.wikipedia.org/wiki/Curry%E2%80%93Howard_correspondence#Origin.2C_scope.2C_and_consequences">this section</a> on wikipedia.)</span>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-81381128703310116512013-02-11T18:45:00.004-08:002013-02-11T18:45:41.894-08:00Median Expectation & Robust UtilityI've been talking recently about robust statistics, and the consequences of replacing means with medians. However, I've only looked at this in a fairly limited way, asking about one particular distribution (the bell curve). Mean values are everywhere in statistics; perhaps to a greater degree than you realize, because we often refer to the mean value as the "expected value". It's a simple alias for the same thing, but that may be easy to forget when we are taking expectations everywhere.<br /><br />In some sense, the "expectation" seems to be a more basic concept than the "mean". We could think of the mean as simply one way of formalizing the intuitive notion of expected value. What happens if we choose a different formalization? What if we choose the median?<br /><br />The post on<a href="http://lo-tho.blogspot.com/2013/02/weird-curves.html"> altering the bell curve</a> is (more or less) an exploration of what happens to some of classical statistics if we do this. What happens to Bayesian theory?<br /><br />The foundations of Bayesian statistics are really not touched at all by this. A Bayesian does not rely as heavily on "statistics" in the way a frequentist statistician does. A <i>statistic</i> is a number derived from a dataset which gives some sort of partial summary. We can look at mean, variance, and higher moments; correlations; and so on. We distinguish between the sample statistic (the number derived from the data at hand) and the population statistic (the "true" statistic which we could compute if we had all the examples, ever, of the phenomenon we are looking at). We want to estimate the population statistics, so we talk about <i>estimators</i>; these are numbers derived from the data which are supposed to be similar to the true values. <i>Unbiased estimators</i> are an important concept: ways of estimating population statistics whose expected values are exactly the population statistics.<br /><br />These concepts are not exactly discarded by Bayesians, since they may be useful approximations. However, to a Bayesian, a distribution is a more central object. A statistic may be a misleading partial summary. The mean (/mode/median) is sort of meaningless when a distribution is <a href="http://en.wikipedia.org/wiki/Multimodal_distribution">multimodal</a>. <a href="http://en.wikipedia.org/wiki/Correlation_and_dependence">Correlation does not imply... much of anything</a> (because it <a href="http://blog.philbirnbaum.com/2012/12/linear-models-cause-problems-when.html">assumes a linear model</a>!). Bayesian statistics still has distribution parameters, which are directly related to population statistics, but frequentist "estimators" are not fundamental because they only provide point estimates. Fundamentally, it makes more sense to keep a distribution over the possibilities, assigning some probability to each option.<br /><br />However, there <i>is</i> one area of Bayesian thought where expected value makes a great deal of difference: Bayesian utility theory. The basic law of utility theory is that we choose actions so as to maximize expected value. Changing the definition of "expected" would change everything! The current idea is that in order to judge between different actions (or plans, policies, designs, et cetera) we look at the average utility achieved with each option, according to our probability distribution over the possible results. What if we computed the median utility rather than the average? Let's call this "robust utility theory".<br /><br />From the usual perspective, robust utility would perform worse: to the extent that we take different actions, we would get a lower average utility. This begs the question of whether we care about average utility or median utility, though. If we are happy to maximize median utility, then we can similarly say that the average-utility maximizers are performing poorly by <i>our</i> standards.<br /><br />At first, it might not be obvious that the median is well-defined for this purpose. The median value coming from a probability distribution is defined to be the median in the limit of infinite independent samples from that distribution, though. Each case will contribute instances in proportion to its probability. What we end up doing is lining up all the possible consequences of our choice in order of utility, with a "width" determined by the probability of each, and taking the utility value of whatever consequence ends up in the middle. So long as we are willing to break ties somehow (as is usually needed with the median), it is actually well-defined <i>more</i> often than the mean! We avoid problems with infinite expected value. (Suppose I charge you to play a game where I start with a $1 pot, and start flipping a coin. I triple the pot every time I get heads. Tails ends the game, and I give you the pot. Money is all you care about. How much should you be willing to pay to play?)<br /><br />Since the median is more robust than the mean, we also avoid problems dealing with small-probability but disproportionately high-utility events. The typical example is <a href="http://wiki.lesswrong.com/wiki/Pascal's_mugging">Pascal's Mugging</a>. Pascal walks up to you and says that if you don't give him your wallet, God will torture you forever in hell. Before you object, he says: "Wait, wait. I know what you are thinking. My story doesn't sound very plausible. But I've just invented probability theory, and let me tell you something! You have to evaluate the expected value of an action by considering the average payoff. You multiply the probability of each case by its utility. If I'm right, then you could have an infinitely large negative payoff by ignoring me. That means that no matter how small the probability of my story, so long as it is above zero, you should give me your wallet just in case!"<br /><br />A Robust Utility Theorist avoids this conclusion, because small-probability events have a correspondingly small effect on the end result, no matter how high a utility we assign.<br /><br />Now, a lot of nice results (such as the <a href="http://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem">representation theorem</a>) have been derived for average utilities over the years. Naturally, taking a median utility might do all kinds of violence to these basic ideas in utility theory. I'm not sure how it would all play out. It's interesting to think about, though.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com3tag:blogger.com,1999:blog-266928470727008321.post-83626952204709954602013-02-10T01:58:00.001-08:002013-02-10T01:58:31.773-08:00Philosopher's Carnival #148Hi all! This month, I have the honor of hosting the monthly blog review, <i>Philosopher's Carnival</i>. This reviews some of the best philosophy postings of the previous month.<br /><a href="http://www.philosophyetc.net/2013/01/two-metaphysical-pictures.html"><br class="Apple-interchange-newline" />Two Metaphysical Pictures</a>, by Richard Yetter Chappell of <a href="http://www.philosophyetc.net/">Philosophy et cetera</a>, outlines a broad classification of metaphysical theories into two different types. (For the record, I prefer the second type! However, as one comment rightly points out, we should avoid lumping views together to create false dichotomies in this way.)<br /><a href="http://alexanderpruss.blogspot.com/2013/01/special-relativity-and-a-theory.html"><br class="Apple-interchange-newline" />Special relativity and the A-theory</a>, at <a href="http://alexanderpruss.blogspot.com/">Alexander Pruss's Blog</a>, discusses the relationship between the philosophical view of time and what we know from physics. (This is related to the two views of the previous post, at least to the extent that you buy the false dichotomy.)<br /><br /><a href="http://exapologist.blogspot.com/2013/01/religious-experience-and-safety.html">A Question About Religious Experience and Safety Accounts of Knowledge</a>, by <a href="http://exapologist.blogspot.com/">ex-apologist</a>, discusses the relationship between the safety condition for knowledge and Christian epistemology.<br /><br /><a href="http://schwitzsplinters.blogspot.com/2013/01/metaphysical-skepticism-la-kriegel.html">Metaphysical Skepticism a la Kriegel</a>, by Eric Schwitzgebel of <a href="http://schwitzsplinters.blogspot.com/">The Splintered Mind</a>, reviews a <a href="http://uriahkriegel.com/downloads/challenge.pdf">paper</a> by <a href="http://uriahkriegel.com/?Uriah_Kriegel%27s_Page">Uriah Kriegel</a> which suggests that although there may be meaningful metaphysical questions about which there are true and false answers, we cannot arrive at knowledge of those answers by any means. In particular, there is no means by which we can come to know if sets of objects have a literal existence or are merely mental constructs.<br /><br />A pair of posts by Jeffrey Ketland of <a href="http://m-phi.blogspot.com/">M-Phi</a> discuss the Quine-Putnam Indispensability Argument: <a href="http://m-phi.blogspot.com/2013/01/the-quine-putnam-indispensability.html">The Quine-Putnam Indispensability Argument</a> and <a href="http://m-phi.blogspot.com/2013/01/other-formulations-of-quine-putnam.html">Other Formulations of the Quine-Putnam Indispensability Argument</a>. The indispensability argument also deals with the question of whether sets (and other mathematical constructions) have a literal existence. The idea is that our best scientific theories make use of sets of objects (and other mathematical constructs), so we must either accept their existence or reject our best science.<br /><br /><a href="http://prosblogion.ektopos.com/archives/2013/02/grim-reapers-vs.html">Grim Reapers vs. Uncaused Beginnings</a>, by Joshua Rasmussen of <a href="http://prosblogion.ektopos.com/">Prosblogion</a>, gives a discussion of some "grim reaper" arguments. The Grim Reaper argument is an argument which is supposed to show the implausibility of an infinite past. Joshua shows that a very similar argument would conclude that a finite past with an uncaused beginning is equally implausible.<br /><br /><a href="http://sprachlogik.blogspot.com/2013/02/a-modification-to-lewiss-theory-of.html">A Modification to Lewis's Theory of Counterfactuals</a>, <span style="background-color: white;">by <span class="il" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial;">Tristan</span> <span class="il" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial;">Haze</span> </span>of <a href="http://sprachlogik.blogspot.com/">Sprachlogik</a>, questions the role that "similarity" of possible worlds should play in our evaluation of counterfactuals.<br /><br /><a href="http://tomkow.typepad.com/tomkowcom/2013/02/computation.html">Computational Metaphysics</a>, by <a href="http://tomkow.typepad.com/">Tomkow</a>, provides a metaphysical companion to computational physics. The idea is illustrated by giving a computational-metaphysics account of counterfactuals, including Lewis's "similarity".<br /><a href="http://metaphysicalvalues.wordpress.com/2013/01/10/substitutions-and-models-part-1-bolzano-quine-tarski-and-boolos/"><br class="Apple-interchange-newline" />Substitution and Models, Part 1: Bolzano, Quine, Tarski and Boolos</a>, by Jason of <a href="http://metaphysicalvalues.wordpress.com/">Metaphysical Values</a>, reviews the debate between substitution-based understanding of quantifiers and model-theoretic accounts (in preparation for a series of posts about the issue).<br /><br />That's it for this month! Tune in for the next carnival at <a href="http://blog.kennypearce.net/">http://blog.kennypearce.net/</a>, March 10. If you spy any interesting philosophy articles, <a href="http://philosophycarnival.blogspot.ca/">submit them for the carnival</a>!Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-42954164844324653912013-02-04T15:12:00.000-08:002013-02-04T16:26:02.689-08:00Weird CurvesLast time, I mentioned a question:<br /><br />Taleb and others have mentioned that the bell curve (or Gaussian) does not deal with outliers well; it gives them a very small probability, and the parameter estimates end up being highly dependent on them.<br /><br />Yet, one of the justifications of the Gaussian is that it's the max-entropy curve for a given mean and standard deviation.<br /><br />Entropy is supposed to be a measure of the uncertainty associated with a distribution; so, shouldn't we expect that the max-entropy distribution would give as high a probability to outliers as possible?<br /><br />There are several answers.<br /><br />First: a basic problem is that phrase "a given mean and standard deviation". In particular, to choose a standard deviation is to choose an acceptable range for outliers (in some sense). If we have uncertainty about the standard deviation, it turns out the resulting curve <a href="http://en.wikipedia.org/wiki/Student%27s_t-distribution#Bayesian_inference">has a polynomial decay rather than an exponential one</a>! (This means distant outliers are far more probable.) Essentially, estimating deviations from data (using maximum-likelihood) makes us extremely overconfident that the data will fall in the range we've experienced before. A little Bayesian uncertainty (which still estimates from data, but admits a range of possibilities) turns out to be much less problematic in this respect.<br /><br />This is definitely helpful, but doesn't solve the riddle: it still feels strange, that the max-entropy distribution would have such a sharp (super-eponential!) decay rate. Why is that?<br /><br />My derivation will be somewhat heuristic, but I felt that I understood "why" much better by working it out this way than by following other proofs I found (which tend to start with the normal and show that no other distribution has greater entropy, rather than starting with desired features and deriving the normal). [Also, sorry for the poor math notation...]<br /><br />First, let's pretend we have a discrete distribution over n points, x<sub>n</sub>. The result will apply no matter how many points we have, which means it applies in the limit of a continuous distribution. Continuous entropy is not the limit of discrete entropy, so I won't actually be maximising discrete entropy here; I'll maximise the discrete version of the continuous entropy formula: <br /><br />f(x) maximising sum_i: f(x<sub>i</sub>)log(f(x<sub>i</sub>)) <br /><br />Next, we constrain the distribution to sum to a constant, have a constant mean, and have constant variance (which also makes the standard deviation constant): <br /><br />sum<sub>i</sub>[f(x<sub>i</sub>)] = C<sub>1</sub><br />sum<sub>i</sub>[x<sub>i</sub>f(x<sub>i</sub>)] = C<sub>2</sub><br />sum<sub>i</sub>[x<sub>i</sub><sup>2</sup>f(x<sub>i</sub>)] = C<sub>3</sub><br /><br />To solve the constrained optimisation problem, we make lagrange multipliers for the constraints: <br /><br />Lagrangian:<br />f(x<sub>i</sub>)log(f(x<sub>i</sub>))<br />- lambda<sub>1</sub>*(sum<sub>i</sub>[f(x<sub>i</sub>)]-C<sub>1</sub>)<br />- lambda<sub>2</sub>*(sum<sub>i</sub>[x<sub>i</sub>f(x<sub>i</sub>)]-C<sub>2</sub>)<br />- lambda<sub>3</sub>*(sum<sub>i</sub>[x<sub>i</sub><sup>2</sup>f(x<sub>i</sub>)]-C<sub>3</sub>)<br /><br />Partial derivatives in the f(x<sub>i</sub>):<br />1 - log(f(x<sub>i</sub>))<br />- lambda<sub>1</sub><br />- lambda<sub>2</sub>*x<sub>i</sub><br />- lambda<sub>3</sub>*x<sub>i</sub><sup>2</sup><br /><br />Setting this equal to zero and solving for f(x<sub>i</sub>):<br />f(x<sub>i</sub>) = 2<sup>1 - lambda<sub>1</sub> - lambda<sub>2</sub>*x<sub>i</sub> - lambda<sub>3</sub>*x<sub>i</sub><sup>2</sup></sup><br /><br />That's exactly the form of the Gaussian: a constant to the power of a 2nd-degree polynomial! <br /><br />So, we can see where everything comes from: the exponential comes from our definition of entropy, and the function within the exponent comes from the Lagrange multipliers. The Gaussian is quadratic precisely because we chose a quadratic loss function! We can get basically any form we want by choosing a different loss function. If we use the kurtosis rather than the variance, we will get a fourth degree polynomial rather than a second degree one. If we choose an exponential function, we can get a doubly exponential probability distribution. And so on. There should be some limitations, but more or less, we can get any probability distribution we want, and claim that it is justified as the maximum-entropy distribution (fixing some measure of spread). We can even get rid of the exponential by putting a logarithm around our loss function.<br /><br />Last time, I mentioned <a href="http://en.wikipedia.org/wiki/Robust_statistics">robust statistics</a>, which attempts to make statistical techniques less sensitive to outliers. Rather than using the mean, robust statistics recommends using the median: whereas a sufficiently large outlier can shift the mean by an arbitrary amount, a single outlier has the same limited effect on the median no matter how extreme its value.<br /><br />I also mentioned that it seems more intuitive to use the absolute deviation, rather than the squared deviation.<br /><br />If we fix the absolute deviation and ask for the maximum-entropy function, we get something like e<sup>-|x|</sup> as our distribution. This is an ugly little function, but the maximum-likelihood estimate of the center of the distribution is precisely the median! e<sup>-|x|</sup> justifies the strategy of robust statistics, reducing sensitivity to outliers by making extreme outliers more probable. <i>(The reason is: the max-likelihood estimate will be the point which minimizes the sum of the loss functions centred at each data point. The derivative at x is equal to the number of data points below x minus the number above x. Therefore the derivative is only zero when these two are equal. This is the minimum loss.)</i><br /><br />What matters, of course, is not what nice theoretical properties a distribution may have, but how well it matches the true situation. Still, I find it very interesting that we can construct a distribution which justifies taking the median rather than the mean... and I think it's important to show how arbitrary the status of the bell curve as the maximum entropy distribution is.<br /><br />Just to be clear: the Bayesian solution is not usually to think <i>too</i> much about what distributions might have the best properties. This is important, but when in doubt, we can simply take a mixture distribution over as large a class of hypotheses as we practically can. Bayesian updates give us nice convergence to the best option, while also avoiding overconfidence (for example, the asymptotic probability of outliers will be on the order of the most outlier-favouring distribution present).<br /><br />Still, a complex machine learning algorithm may still need a simple one as a sub-algorithm to perform simple tasks; a <a href="http://en.wikipedia.org/wiki/Genetic_programming">genetic programming</a> approximation of Bayes may need simpler statistical tools to make an <a href="http://en.wikipedia.org/wiki/Estimation_of_distribution_algorithm">estimation of distribution algorithm</a> work. More generally, when humans build models, they tend to compose known distributions such as Gaussians to make them work. In such cases, it's interesting to ask whether classical or robust statistics is more appropriate.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-8385728774989421192013-01-29T18:22:00.003-08:002013-01-29T21:00:09.475-08:00Why Continuous Entropy MattersPreviously, I discussed <a href="http://lo-tho.blogspot.com/2013/01/the-wikipedia-article-for.html">definitions of entropy for continuous distributions</a>. Entropy, when used as a lower bound on description length, doesn't obviously apply to real-numbered random variables. Yet, there is a very common definition of entropy for continuous variables which almost everyone uses without question. This definition looks right at the syntactic level (we simply replace summation with integration), but doesn't make a lot of conceptual sense, and also loses many properties of discrete entropy. I took a look at <a href="http://en.wikipedia.org/wiki/Limiting_density_of_discrete_points">one alternative definition</a> that might make more sense.<br /><br />Why does it matter?<br /><br />The main issue here is the legitimacy of the maximum-entropy justification of specific continuous distributions, chief among them the bell curve; using slightly more general language, the Gaussian distribution (which takes the bell curve to the multivariate domain, allowing it to account for correlations rather than just means and variances).<br /><br />The Gaussian distribution is used in many cases as the default: the first distribution you'd try. In some cases this is out of statistical ignorance (the Gaussian is the distribution people remember), but in many cases this is because of a long tradition in statistics.<br /><br />One justification for this is the "physical" justification: many physical processes will give distributions which are approximately (though not exactly) Gaussian. The Gaussian is a good approximation of naturally occurring statistics when the number is the result of many individual positive and negative fluctuations: if we start with a base number such as 30, and there are many processes which randomly add or subtract something from our number, then the distribution of our end result will be approximately Gaussian. This could be thought of as a <a href="http://en.wikipedia.org/wiki/Frequentist_probability">frequentist</a> justification of the Gaussian.<br /><br />A more Bayesian justification of the Gaussian is via the <a href="http://en.wikipedia.org/wiki/Principle_of_maximum_entropy">maximum-entropy principle</a>. Entropy represents the uncertainty of a distribution. Bayesians need prior probability distributions in order to begin reasoning. It makes intuitive sense to choose a prior which has maximal uncertainty associated with it: we are representing our ignorance. It turns out that with a known mean and variance, the Gaussian is the maximum-entropy distribution... according to the usual definition of entropy.<br /><br />This seems quite mysterious to me. Intuitively, a maximum-entropy distribution should be as spread out as possible (within the confines of the fixed variance, which is a measure of spread). A Gaussian, however, decays super-exponentially! It falls off very quickly, on the order of e<sup>-x<sup>2</sup></sup>. This makes outliers very, very unlikely. Intuitively, it seems as if a polynomial decay rate would be much better, leaving a little bit of probability mass for far-flung values.<br /><br />This issue is not merely academic. Nassim Nicolas Taleb has <a href="http://edge.org/response-detail/23839">repeatedly shown</a> that the standard statistical models used in socioeconomic settings are disastrously wrong when it comes to outliers. A small number of outliers are improbable enough to completely invalidate the models from a scientific perspective. These outliers are not just annoying, but also highly consequential, often leading to the failure of investments and massive losses of capital. To quote him: "Nobody has managed to explain why it is not charlatanism, downright scientifically fraudulent to use these techniques." Taleb advocates the use of probability distributions with a wider spread, such as a power law. Basically, polynomial distributions rather than exponential distributions.<br /><br />Advocates of the maximum entropy principle should (arguably) be puzzled by this failure. The Gaussian distribution was supposed to have maximum spread, in the sense that's important (entropy). Yet, Taleb showed that the spread is far too conservative! What went wrong?<br /><br />I can see two possible problems with the Gaussian distribution: entropy, and variance.<br /><br />As I've already explored, the definition of entropy being used is highly questionable. It's not obvious that entropy should be applied to the continuous domain at all, and even if it is, there doesn't seem to be very much justification for the formula which us currently employed.<br /><br />There is another assumption behind the maximum-entropy justification of the Gaussian, however: fixed variance. A Gaussian is the maximum-entropy distribution <i>given a variance</i>. Entropy measures "spread", but variance also measures "spread" in a different sense. The interaction between these two formulas for spread gives rise to the Gaussian distribution.<br /><br />The formula for variance is based on the squared deviation. Deviation is the distance from the mean. There are <a href="http://en.wikipedia.org/wiki/Deviation_(statistics)">many different measures of the collective deviation</a>. Squared deviation is not <a href="http://en.wikipedia.org/wiki/Robust_statistics">robust</a>. Why not use the absolute deviation? This <a href="http://sabermetricresearch.blogspot.com/2012/08/the-standard-error-vs-mean-absolute.html">comes up now and again</a>, but as far as I've observed, rarely addressed inside statistics (rarely taught in classrooms, rarely mentioned in textbooks, et cetera). It's a fascinating topic, and Taleb's work suggests it's a critical one.<br /><br />So, it seems as if there is something important to learn by digging further into the foundations of our measures of spread, including both entropy and variance.<br /><br />Naturally, my personal interest here is for applications in AI. It seems necessary to have good built-in methods for dealing with continuous variables, and uncertainty about continuous variables. So, if the standard methods are flawed, that could be relevant for AI. Interestingly, many machine learning methods <i>do</i> prefer alternatives to squared error. SVMs are a primary example.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com1tag:blogger.com,1999:blog-266928470727008321.post-13906718703022182282013-01-21T15:48:00.002-08:002013-01-21T15:48:16.349-08:00One More ProblemLast time, I listed some problems which my <a href="http://lo-tho.blogspot.com/2013/01/problems-i-didnt-solve.html">current foundation idea doesn't solve</a>.<br /><br />Here's one more, and I'd refer to it as <i>the</i> problem.<br /><br />First, some set-up.<br /><br />Eliezer has been <a href="http://lesswrong.com/r/lesswrong/lw/g7n/secondorder_logic_the_controversy/">writing some good stuff about 2nd-order logic</a>. He's touching on a problem which I've <a href="http://lo-tho.blogspot.com/2011/01/some-consequences-of-incompleteness.html">written about before</a>, namely: if any proof-system can be interpreted via theories in 1st-order logic, why is it that we can discuss ideas (such as number theory) which seem not to be fully characterized by it? We think we mean something more when we talk about higher-order concepts, but how can we, if everything could be reduced to first-order talk? Let's call this the problem of meaning.<br /><br />My original hope was (somehow) to solve the problem using a sufficiently powerful theory of truth. Tarski's undefinability theorem gives us a way to, given some logic, 'mechanically' construct one thing which the logic cannot express: the truth predicate in that logic. No logic can express its own semantics. Therefore (I must have been thinking), we can use 'truth' as a <i>standard</i> way of extending a logic, and hope that extending it in this way will automatically make it more powerful in every other way that is interesting to us. (There is fairly good support for this idea.) I hoped to solve the problem of meaning by solving the problem of truth.<br /><br />Along the way, I <a href="http://dragonlogic-ai.blogspot.com/2008/06/basic-hyperlogic-my-search-for-proper.html">learned about the arithmetic hierarchy</a> <a href="http://dragonlogic-ai.blogspot.com/2008/06/more-hyperlogic-in-previous-hyperlogic.html">and</a> <a href="http://dragonlogic-ai.blogspot.com/2008/08/direct-attack-my-approach-thus-far-to.html">other hierarchies</a>. These provide a characterisation of how difficult logical/mathematical problems are. Only the first couple of levels of the arithmetic hierarchy can be handled on computers. The rest are increasingly uncomputable, and the other hierarchies just get worse. (The arithmetic hierarchy deals just with problems in number theory, and the next-up, the analytic hierarchy, deals with real analysis.) Given that we cannot compute facts about the higher-up stuff, the question is, in what sense can we refer to it? I referred to this as <a href="http://dragonlogic-ai.blogspot.com/2008/07/new-grounding-criteria-in-this-post-i.html">a grounding problem</a>: in what sense does our logic refer to that stuff, if it can't access the truths of those hierarchy levels? This is a big aspect of the problem of meaning.<br /><br />A somewhat different question (and more relevant to this post) is: <b><i>How can we define rational belief about these things?</i></b><br /><br />Anyway, I eventually got <a href="http://dragonlogic-ai.blogspot.com/2008/08/direct-attack-my-approach-thus-far-to.html">a bit fed up</a> trying to climb levels, and attempted to solve the problem more directly by just coming up with a complete theory of truth. I <a href="http://lo-tho.blogspot.com/2011/05/impredicativity.html">eventually learned</a> that the sorts of approaches I was considering, which attempted to build up truth incrementally, would typically only get me <i>predicative</i> mathematics at best, which is not very much. Finally, recently, I specified my <a href="http://lo-tho.blogspot.com/2012/10/impredicative-truth.html">impredicative theory of truth</a>, realising that impredicativity in itself suggests a design for putting a classical truth predicate within its own language, getting around Tarski's theorem (in a sense). This gives something much like 2nd-order logic.<br /><br />So, having declared tentative success on Truth, let's return to the question of Meaning.<br /><br />Interestingly, 2nd-order logic seems to contain every meaningful mathematical statement we've come up with so far... it can say anything which we would want to invent higher-order logic for. is logically complex enough that it doesn't fit in any hierarchies of difficulty, "<a href="http://plato.stanford.edu/entries/logic-higher-order/#4">past, present, or future</a>". So, it makes a surprisingly good candidate for a logic of thought. (Higher-order logics may still be important, but are in some sense "merely convenient".)<br /><br />There are basically two reasonable semantics for my logic. <b>One</b>: only the syntactically constructible predicates exist. This makes sense if we want the theory of truth to just be talking about the truth of stuff we can say. <b>Two</b>: all mathematically possible predicates exist. This turns my logic into a syntactic variant of 2nd-order logic.<br /><br />Really, I'd like to find a way to keep the semantics usefully ambiguous. Many of the theorems for real numbers have <a href="http://lukepalmer.wordpress.com/2012/01/26/computably-uncountable/">analogous results for the computable real numbers</a>. We might hope for some similar match between results for constructible predicates and the mathematically possible predicates. This would tell us that, to some extent, it doesn't matter which we pick.<br /><br />My idea is that the predicate ambiguity represents <i>extensible meaning</i>. When we define the natural numbers, we say "<a href="http://lesswrong.com/lw/f4e/logical_pinpointing/">The least set containing zero and closed under succession</a>". The semantic question is, how do we interpret "set"? What sets exist? Answer #1 says that only sets we can define exist. Answer #2 says that we refer to a broader range of sets, which is intuitively maximal (contains all possible sets). What I'm saying is that we want to refer to "all sets we, or anyone else, might conceive of"; in other words, we don't just want to include stuff we know how to define, but also things we might learn about later, or anything anyone might define.<br /><br />Constructivists might object that this does not justify semantics #2 at all: even "all the predicates that might be defined" should be a countable set. I would counter that it is still uncountable <a href="http://lukepalmer.wordpress.com/2012/01/26/computably-uncountable/">in some sense</a>: we solve <a href="http://plato.stanford.edu/entries/paradox-skolem/">Skolem's paradox</a> by pointing out that "uncountable" is relative to a logical background. It simply means we can't put the predicates in one-to-one correspondence with the natural numbers <i>using what we currently know how to say</i>. Therefore constructivists should not be "fundamentally" unhappy with uncountability. Most constructivists I know will now argue that the powerset is "actually" countable, despite being apparently uncountable in this relative sense, but at this point they are actually relying on classical mathematics!<br /><br />Yet, they should point out that I'm being disingenuous. I'm stating that the semantics of my 2nd-order quantifiers should be, in some sense, ambiguous, to allow for the broadest possible interpretation. It's true that, for any particular logic, the predicates definable in that logic will be uncountable within the logic. However, I'm claiming something more: that the quantifiers are ranging over predicates in some imaginary maximal logic (which very clearly does not exist). In doing so, I'm effectively claiming "true uncountability" instead of relative uncountability: the maximal notion of predicates would be uncountable in any logic. A constructivist might be OK so long as I'm honestly being ambiguous, but with the intent of being maximal? There is, arguably, no way of making sense of that (without some specific background theory defining "maximal"...).<br /><br />Now would be a good time to point out that I'm only doing armchair philosophy here. There are many details of the discussion which would benefit greatly from a formal treatment. For example, I'm not explicitly defining what it means for my logic to claim that its own predicates are uncountable (and this is something which you should be suspicious about, since the theory of truth I offered doesn't directly allow us to ask such a question). I'd like to explore all this further, but it's just a hobby at this point, so I'm prioritising the fun part at the expense of hard work.<br /><br />So, putting aside the justification, let's just admit that I want to talk about classical mathematics and go for the full 2nd-order semantics. The central question again:<br /><br /><b><i>How can we define rational belief about these things?</i></b><br /><b><i><br /></i></b>"Rational belief" could be interpreted in several ways; for the moment, let's formalise rational degrees of belief as probabilities. If you're a Bayesian, then, the question is how to define a prior probability distribution over 2nd-order logic.<br /><br />I know how I want <a href="http://agi-conference.org/2012/wp-content/uploads/2012/12/paper_70.pdf">probabilities on 1st-order logic</a> to work. How do we define a prior over logical beliefs? It makes sense to suppose that statements in 1st-order logic are being drawn at random to construct a theory. We update on evidence by requiring this theory to be consistent with everything we know. When we want to know the probability of a statement being true, we ask the probability that it occurs in such a theory.<br /><br />It's possible to approximate this probability distribution arbitrarily well, because it is possible to detect inconsistency in 1st-order theories. The proof system corresponds exactly to the semantics, so all we need to do is look for proofs of inconsistencies and throw out inconsistent ones.<br /><br />This approach doesn't obviously generalise. We can define randomly generated consistent theories for other systems, but we can't approximate these distributions by looking for proofs. Suppose we're dealing with first-order arithmetic. We want to know the probability of the <a href="http://en.wikipedia.org/wiki/Goldbach%27s_conjecture">Goldbach Conjecture</a>. We consider randomly constructed theories, throwing out those which we can prove inconsistent. Unfortunately, there will be a finite probability that the statement "there exists an even number which is not the sum of two primes" gets generated. The probability of this statement will go down as we rule out theories which include this statement but imply falsifiable things; however, it may converge to a finite number greater than zero. The axioms of first-order arithmetic are incomplete, so there may be no proof of the conjecture, in which case randomly generated theories would have a positive probability of hypothesising a counterexample. This is referred to as an <a href="http://en.wikipedia.org/wiki/%CE%A9-consistent_theory">omega inconsistency</a>.<br /><br />Even for just the stuff in the arithmetic hierarchy, it is not possible to specify any convergent computation which will arrive at the truth in every case. (And again, 2nd-order logic goes well beyond the arithmetic hierarchy, and the analytic hierarchy as well.)<br /><br />Yet, we <i>can</i> do better than just the 1st-order probability. For example, we can converge to the truth of the Goldbach conjecture just by checking examples. We could put the probability of the conjecture being false at 1/N, where N is the number of positive examples we have found so far. If we find a counterexample, of course, then we permanently put the probability of it being false to 1. This is very rough, but would be guaranteed to eventually get arbitrarily close to the correct answer, so long as we eventually check any particular number.<br /><br />More realistically, we make a general rule: for an arbitrary universal statement about natural numbers, we would like to converge to probability 1 as we check examples and find no counterexample. It would make sense to converge slowly: we would expect the probability of finding a counterexample as we check more numbers to go down roughly like the probability of a Turing machine halting as we wait longer. (It's improbable that the first counterexample is extremely large, but not <i>very</i> improbable.)<br /><br />A central problem for me is: how can we make that sort of behaviour come naturally out of a general definition of rational belief over an incomplete logic?<br /><br />Just for the language of first-order arithmetic (and therefore the arithmetic hierarchy), it seems possible to define appropriate belief in terms of something like the case-checking I've described. This cannot arrive at the correct belief for everything, because nested quantifiers make it possible to quantify over undecidable properties (so that we cannot check cases); however, it seems reasonable to say that it's the best we can do. In cases where we cannot converge to the truth, we should simply ensure that we don't converge toward 1 or 0. (Note: I'm hiding a lot of work here. The case-checking is inadequate as stated; we would like to guarantee nice logical properties for the probability distribution, like if A->B, P(A)<=P(B). However, this seems "not so hard".)<br /><br /><br />However, impredicative 2nd-order quantification seems impossible to check "by cases" almost by definition: even assuming only definable predicates, we still have the circular truth conditions introduced by the impredicativity. This seems to make a solution unlikely.<br /><br />Maybe there is a better way of thinking about it, though. Depending on how we take my idea that the space of predicates is supposed to be usefully ambiguous, we might prefer the <a href="http://plato.stanford.edu/entries/logic-higher-order/#3">Henkin semantics</a>, which actually gets us back to the first-order case: we don't assume that <i>only</i> the definable sets exist, but those are the only ones we <i>know</i> to exist. The impredicative definitions are grounded in some (unknown) model, rather than deriving their truth values circularly from facts about themselves. This makes the semantics equivalent to first-order logic with extra axioms; we can handle it in the same way we handled probabilities over 1st-order beliefs.<br /><br />This is too weak, of course. We wanted to be able to do reasonable probabilistic inference about the Goldbach conjecture. However, it may be a good place to start. How can we get the case-checking behaviour for arithmetic, from something similar to a Henkin semantics?<br /><br />One approach would be to assert the existence of enumerated classes: classes generated by given constants and functions. For example, even before asserting any axioms about the natural numbers, we would have the class <i>NAT: 0, s(NAT)</i>. (Since we haven't stated anything, it could be that S(0)=0, et cetera. Only the existence of the class is guaranteed.) These are unnecessary in crisp logic, having crisp 2nd-order equivalents. However, they define special probabilistic behaviour: enumerated checking.<br /><br />These aren't needed in the syntax of the language; just the semantics. When we give the inductive definition for natural numbers (n such that: for all X, if X(0) and for all i, X(i)->X(s(i)), then X(n)), then NAT will simply be one instantiation of the 2nd-order quantification in the semantics. We will need to account for this in the probabilistic calculation: we can use randomly generated theories, as for first-order logic, but the probability of statements concerning NAT and other enumerated classes would follow special rules. Since the probability of the Goldbach conjecture stated with respect to NAT would become high, and it being true of NAT implies that it is true of the natural numbers as normally defined (because NAT contains zero and successors), the probability with respect to the natural numbers would have to be high as well. (We would use case-checking to find probabilities of omega-inconsistency, and then use that to reduce the significance of possibilities which appear omega-inconsistent.)<br /><br />It's not exactly pretty, but at least it gets the job done. Making a special kind of class for specific probabilistic behaviour feels a bit awkward, raising the question of whether there are any <i>other</i> kinds of classes we're forgetting. It would be better if the result emerged more naturally from concerns of what probabilistic behaviours can be obtained; perhaps out of a universal distribution or something...Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-47622472364849907332013-01-17T00:23:00.000-08:002013-01-17T00:23:59.142-08:00Defining Real EntropyThe wikipedia article for <a href="http://en.wikipedia.org/wiki/Entropy_%28information_theory%29">entropy(information theory)</a> indicates the usual formula for entropy of a probability density function, and then calmly proceeds with a proof that it is not the correct way to generalise entropy to the continuous domain. (Yay for wikipedia!) <a href="http://en.wikipedia.org/wiki/Limiting_density_of_discrete_points">This</a> is the definition of entropy advocated by Jaynes, which is linked as an alternative. It's called Limiting Density of Discrete Points. (Not a catchy name.)<br /><br />My intuition: <br /><br />Entropy is about expected code length (which is expected information gain). One way to try and define entropy for distributions over real numbers might be to try and deal with them as a discrete object. To do this, we choose a specific encoding E of the reals as a stream of bits, and examine the efficiency with which we can store points randomly drawn from our probability distribution P. Unfortunately, no matter how clever our encoding, a single real number almost always carries infinite information. So, to get a finite number, we subtract the information in each bit from the worst-case information for that bit (a coin flip). In other words, since we can't give the (infinite) the amount of information directly, we look at how much <i>less</i> information there is than in the case of random noise.<br /><br />"Random noise" means coin flips in our discrete representation E for continuous numbers... but that means that E induces a probability distribution of its own. The measure of entropy in P relative to the encoding E ends up looking a lot like the KL-divergence between P and E, viewing E as a probability distribution.<br /><br />One way of understanding this is to say that the continuous version of entropy doesn't work out quite right, but the continuous version of KL-divergence (which is closely related to entropy) works quite well; so the best we can do in the continuous domain is to measure KL-divergence from some reference distribution (in this case, E) as a substitute for entropy.<br /><br />Normally, entropy tells us what we can expect to achieve with the <i>best</i> encoding. (It places a lower bound on code length: intuitively we can't make the code shorter than the actual amount of information, except by luck.) The reason this doesn't work for real numbers is that they require an infinite amount of information in any case. Yet, it is still possible to generate optimal codes (for example, iteratively divide the probability density into upper and lower halves), and ask how "spread out" they are. The problem is just that we need to define some sort of "frame" in order to get a numerical value (defining the extent to which we care about differing numbers), and it seems like most options end up looking similar to the approach Jaynes came up with (or, being a special case of that approach).<br /><br />Is this really the best way to translate entropy to the continuous domain?Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-39370665322979734452013-01-10T01:52:00.000-08:002013-01-10T01:52:50.958-08:00Problems I Didn't SolveThere are several foundational problems which my <a href="http://lo-tho.blogspot.com/2012/10/impredicative-truth.html">current foundational theory </a>doesn't solve.<br /><br />First, it does not determine a specific set theory. In my <a href="http://lo-tho.blogspot.com/2012/11/new-truth-intuition.html">previous post</a> on the topic, I mentioned: <br /><blockquote class="tr_bq">I'm actually <i>disappointed</i> that this view seems to endorse a more standard well-founded view of set theory, as opposed to what I view as more natural non-well founded theories (in which we have a set of all sets, which is itself a set). Oh well.</blockquote>Really, though, this isn't quite the case. I was imagining a process in which we strengthen the language iteratively, adding increasingly powerful truth predicates, as in the last paragraph of <a href="http://dragonlogic-ai.blogspot.com/2009/08/climbing-mountain-undefinability-is.html">this old essay</a>. However, we're actually free to postulate any set theory we like.<br /><br />Basically, in this view, set theory is a way of transposing 2nd-order entities into the 1st-order level to increase the power of the system. We can't just inject "all of them", because there will always be more 2nd-order entities than 1st-order entities, so if we force all 2nd-order entities to be 1st-order entities, we can prove a contradiction. There are many different ways to do a partial job.<br /><br />My 2nd-order entities are <i>properties</i>, but they are related to the notion of a 'class'. A class is much like a property except that (as with sets) we regard two classes as equal if they have exactly the same members. Set theories developed in 2nd-order logic use 2nd-order entities as classes, and define sets from these. (The difference between a set and a class is that a class is a 2nd-order entity, and therefore can have members, but cannot be a member in anything. A set is a 1st-order entity, and can therefore be a member, as well.)<br /><br />It would be possible to develop a finitely axiomatized version of ZFC, like NBG set theory. We could also do many other things, though. For example, we could make a set theory like NFU, which has a universal set. This simply means that we reject the 'iterative' idea of converting 2nd-order entities into 1st-order ones. In the 'iterative' conception, we imagine transporting entities to the 1st-order level in stages. Since increasing the number of 1st-order entities can increase the membership of some of our 2nd-order entities, we think of it as changing the meaning of those 2nd-order properties; so, even though we transpose all the existing 2nd-order entities to the 1st-order level at any particular stage, there are yet more entities to consider at the next stage.<br /><br />NFU rejects the intuition that each stage is conceptually distinct. So, for example, the universal set is the result of transposing the trivial property which is always true. This holds all entities at every stage; so rather than constructing a series of larger and larger distinct sets, each holding all the entities present at some particular stage (but no other entities), we allow it to exist as a single set holding everything.<br /><br />The stages themselves are still important to NFU: we still need a restriction on which properties may form sets, since we can't make the claim that all properties correspond to sets. How NFU implements this restriction is that properties only become sets if the sets mentioned in the definition can be consistently assigned to stages in a way that ensures membership tests are well-defined. So, for example, Russel's paradoxical set (the set of all sets which don't contain themselves) will never exist, because at no stage can we already test membership if a set to itself. (The variable x in "x such that x โ x" can't be assigned a number to ensure that all membership tests a โ b have #a โ #b.)<br /><br />Equivalently, the expression which gives us a predicate should be well-typed in simple type theory. (For a completely formal account, <a href="http://math.boisestate.edu/%7Eholmes/holmes/head.ps">read the book</a>.)<br /><br />The point is, my foundational logic doesn't solve, or even greatly constrain, the foundations of set theory.<br /><br />(One thing I intentionally left ambiguous was the semantics of the logic. How many properties are there? Are we quantifying over just property <i>definitions</i>, giving a more constructivist foundation, or are we quantifying over a classical space of properties, which is a superset of those we can define? I was hoping to address this in an eventual post.) <br /><br />There are also other problems which I have not solved in this way. I've claimed that this logic gives us, in an important sense, a self-referential truth predicate. It gives us a language which contains a truth predicate that is correct for all sentences in that language, and also (at a different level) satisfies the Diagonal Lemma, despite Tarski's theorem. However, this doesn't exactly solve all paradoxes of the truth predicate. Kripke noted that a paradox can sometimes depend on accidental self-reference created by the world, not the logic:<br /><br />Suppose that a spiteful Professor X writes on the chalkboard, "Whatever Professor Y is writing on his chalkboard is false." Meanwhile, the benevolent Professor Y is writing "Whatever Professor X is writing on his chalkboard is true." How should we analyse the situation?<br /><br />In order to reason about statements written down in the external world, we must have a theory associating 1st and 2nd order entities, just as in the set-theory problem. Complex external objects must "indicate true propositions"!<br /><br />So, the analysis is not completely determined by the theory of truth. Let's be conservative, though, and suppose we've done just one round of introspection: starting with my impredicative theory of truth, which I'll call L, we generate L', the theory of truth-in-L. This contains a first-order entity for every statement in L, and a truth predicate for these. Let's call this a meta-theory of truth.<br /><br />We quickly see that we cannot unpack the statements made on the chalkboards to correspond to any particular statements in L. Any (very reasonable) theory which we try to make which relates marks on chalkboards to propositions will fail to relate these particular marks to anything. Before we determine what proposition Y is asserting, we would have to determine what proposition X is asserting, and vice versa.<br /><br />This is basically because my theory of truth does not solve any problems associated with self-referential truth in the 2nd sense: the truth of self-referential statements. It only provides self-referential truth in the first sense: the ability to discuss truth-in-L in L (so long as we keep the discussion 2nd-order).<br /><br />Different meta-theories of truth will give different results. Some meta-theories of truth may give us accounts of how statements which are self-referential can get associated with propositions, just as some set theories would allow for sets which contained themselves (such as the universal set).<br /><br />A related deficiency of the theory is that we cannot form propositions via arbitrary computations. This would be a nice guarantee of the expressiveness of the language. At one point, I thought of this as a major goal: provide a language satisfying the diagonal lemma and containing its own truth predicate, so that the truth of arbitrary computations on atomic propositions could be asserted. Of course, it is not possible to include all and only the meaningful computations; however, it should be able to ensure that if we can prove a computation always halts, then we can use it; this would allow more computational power to be conveniently added by adding axioms. (<a href="http://lukepalmer.wordpress.com/tag/ixi/">IXI</a> seems to be like this.) Then we can say that the limitation is not about what the system can't express, but only about what the system doesn't know (not having enough axioms).<br /><br />It isn't terribly clear which of these problems needs to be solved.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-79425353298247493862012-12-03T16:22:00.002-08:002012-12-03T16:22:56.905-08:00My Unhealthy Obsession with Message-PassingThis post is a follow-up to <a href="http://lo-tho.blogspot.com/2012/08/beliefs.html">Beliefs</a>, in which I outlined some of what the machine learning community knows about how beliefs should interact with one another.<br /><br />One assumption which I didn't spend too much time on in that post was the assumption that a "belief" should be managed by local message-passing as much as possible. This is an assumption which has been strong in the probabilistic graphical models community from the beginning, and which I have taken as a given. In order to get efficient algorithms, it seems like a good idea to do everything through local processes. Also, message-passing simply fits with my intuitions about how humans reason. Even markov networks (as opposed to Bayesian networks) feel insufficiently local, because the probabilistic semantics is distributed (you need to know all the factors before you can determine probabilities). For me, the more local, the better: the popular Monte Carlo methods strike me as "not local enough" (because although we keep making changes locally, we do this to a global complete state, and make a running tally over time... seems unfeasible for extremely large networks representing the quantity of knowledge a human has). I'll question this assumption now.<br /><br />In the previous post, I looked at some of the local message-passing algorithms I'm interested in. There is more interesting material which I've now understood and would like to summarize, but for the moment I'll cover just one more thing to serve my point.<br /><br />I mentioned that we can make a distinction between techniques such as Mean Field which seek to minimize KL(Q||P), and those such as Belief Propagation which use KL(P||Q); both of these measure divergence between the true distribution and our approximation, but KL(P||Q) intuitively makes more sense (minimizing error due to approximation), whereas KL(Q||P) has some computational advantages.<br /><br />It turns out that both belief propagation and mean field <a href="http://arxiv.org/pdf/1112.0467.pdf">can be fit into the framework</a> of minimizing KL(Q||P). Even though this isn't my error function of choice, it's good to be able to explain both approximations within one framework; it allows us to definitively explain how to pass messages between the two, in the case that we want to use both in one network.<br /><br />The basic idea is that Mean Field (MF) can be justified in a straightforward way by minimizing KL(P||Q) when we assume Q contains a separate factor for every variable. Belief Prop (BP), on the other hand, can be justified by minimization of the Bethe energy. The Bethe energy can be derived from KL(P||Q) when Q is of a peculiar form containing a factor for every relationship between variables, but also a <i>negative</i> factor for every variable. The negative factors divide out the effect of the overabundant positive factors. The negative factors are constrained to be the marginals of the positive factors. This gets a little confusing (I'll plan a full post for later), but provides greater coupling between variables, which we expect to give us a better approximation. (It does in most cases.) Unfortunately, it ruins the convergence guarantee which MF had.<br /><br />The <a href="http://arxiv.org/pdf/1112.0467.pdf">framework</a> I mentioned allows us to deal with a Q function which mixes these two forms. This is good, because MF and BP are good for dealing with different kinds of things. MF often has a simple closed form in cases where BP does not, or does but is too expensive. MF is guaranteed to converge, while BP is not. However, BP typically gives better results when it does converge. It gives exact results for a tree-shaped belief network, while MF does not. It handles 0s in distributions, while MF cannot.<br /><br />Because BP is not guaranteed to converge, it makes some sense to argue that it is not suitable to AGI-scale systems of beliefs. There will almost certainly be some portion of the network which will not converge, even if non-convergent behavior is rare overall; and this problem may be likely to spread. Therefore, it could make sense to have small, "safe" BP networks acting as pieces in a larger MF network, to make global convergence easier. If a portion of the network was causing trouble, more of it might be converted to MF.<br /><br />If I'm not mistaken, this same idea could also accommodate Kikuchi free energy, which gives "generalized belief propagation" (GBP). This goes further than BP, including even more interactions in the approximation. Hence we have a flexible approximation framework, which can handle approximations based on a wide variety of forms for Q. The structure of Q can be adapted to the situation, addressing our needs as well as any local convergence problems.<br /><br />This might sound like a rather nice situation. But... is it?<br /><br />We're putting an awful lot of energy into the approximation, here. Why are the belief networks so difficult to control? The structure-learning system has to find these networks in the first place. If we can't even do inference for a hypothesis, how can we choose it? Shouldn't the system choose hypotheses for which inference is manageable?<br /><br />This is exactly the approach taken by <a href="http://turing.cs.washington.edu/papers/uai11-poon.pdf">Sum-Product Networks</a>. Graphical models normally give us exponential-time inference in the worst case, but sum-product networks are linear-time instead. We could get a similar advantage by using standard types of networks, like BNs or MNs, and requiring that the network take the form of a tree (for which exact inference is inexpensive). We get guaranteed linear-time inference, no approximations, but we have not severely restricted the kinds of probability distributions which can be expressed: we can add more hidden variables to express more complicated distributions. (If we need to add exponentially many nodes to do what we want, though, then obviously we're getting back into trouble.) SPNs do the same thing, but are an interesting alternative to tree networks, in that they allow us to adjust the probability formula at a lower level; we are literally composing an expression of sums and products to express the distribution (an "arithmetic circuit"). This is more annoying for a human to look at, but at the same time, gives us more freedom (allows us to compactly represent a larger number of functions) without compromising the linear-time inference.<br /><br />SPNs are very similar to artificial neural networks, particularly feedforward networks, and give us linear-time inference for the same reason: we're directly learning a mathematical expression to give the desired result, not trying to represent some inference problem to be solved. One way of explaining it is: for any approximation method, we end up performing some sequence of arithmetic operations to arrive at an answer. SPNs jump straight to learning a good sequence of arithmetic operations, <i>as</i> our model, rather than learning a potentially difficult model and then trying to find a nice approximation.<br /><br />I said I would deal with my assumption of <i>local</i> approximation. Going back to the P vs Q idea, a "belief" is a local function; a factor of Q. The question is how to coordinate a number of local functions, in order to get a good approximation of a probability function P. SPNs don't need any of this. Local beliefs give us a very messy belief approximation process. SPNs give us a global model with exact inference.<br /><br />Why do I call an SPN "global"? It might be local enough to satisfy some. The linear-time inference algorithm iterates over the network via the links. That's local, isn't it? Well, sort of. You can't learn SPNs locally and then compose them together, though. The parts of a sum-product network do not have inherent meaning outside of the network. In a Bayesian network, a conditional probability is a conditional probability; we could evaluate its accuracy apart from everything else, if we could access data for those variables.<br /><br />SPNs aren't a full AGI system in themselves, but they give me an image of an AGI system which learns very fast belief-formation functions, which simply get evaluated in order to give probabilities at a particular moment. This image is much more <i>efficient</i> than my previous image of a large network of approximate beliefs which are doing local message-passing to minimize KL divergence between Q and the desired P. Local approximation by message-passing is a faster approximation than Monte Carlo methods, but if we can create networks which we can precisely evaluate very quickly, why worry about approximation at all?<br /><br />On the other hand, I think we really want some <a href="http://www.ai.sri.com/~braz/fopi-software.html">FOPI</a>-style behavior (First Order Probabilistic Inference, one approach to integrating logic and probability). The structure of beliefs should include first-order logic as a special case, which means we cannot avoid including problems which are potentially intractable. (IE, real-life inference problems <i>are</i> sometimes intractable.)<br /><br />Indeed, it is difficult to see how an SPN-like framework can give us persistent beliefs with logical structure. An SPN is composed of 'sum' nodes and 'product' nodes. We can think of the 'sum' nodes as representing 'or', and the 'product' nodes as representing 'and'. So, we can think of 'product' nodes as representing an object: a combination of attributes. 'Sum' nodes represent alternative ways of achieving attributes. However, this is still a very low-level representation! There is no way to represent multiple instances of one object, no way to represent classes of objects with a type/subtype relation... we can't reason about familial relations, for example, trying to decide whether two people are cousins given some other information about their families.<br /><br /><i>Side note: one possible objection is that belief networks are a representation for the low-level dynamics of an artificial mind, and do not need to represent the high-level stuff like logic. This should emerge as a learned skill of a large network which has access to mental workspaces in which it can implement more complex, longer-term mental processes. This approach may be workable. However, my intuition is that there is something to gain from explicitly representing logically structured knowledge.</i><br /><br />Back to the first hand, we can argue that FOPI itself requires a global model in order to start. This is an interesting and important point: local inference rules such as belief propagation and FOPI require a fixed network structure to begin with. (More technically, we need to know what the Markov blankets of our variables are. This is similar to learning a background model of causality: we can't just operate on correlations between variables. Local probability tables are not enough. We also need to know the structure, which tells us which probability tables are relevant to a given situation.)<br /><br />It would be nice if we could <i>also</i> reason about <i>structure</i> locally, learning which variables are related in an incremental, bottom-up manner. Perhaps this is possible. However, it seems especially difficult with FOPI-like inference: logical models do not behave in a very local way. It seems very "human" to have a large set of potentially-inconsistent logical statements which we are constantly trying to integrate in an approximate way; this again sounds like a Q vs P iterative reasoning scheme. But, again, won't this be terribly messy and inefficient? Won't we end up approximating things by some simple sequence of operations? Wouldn't it be nicer to just learn a good sequence of operations, in the first place, which just gives us the approximate answer?<br /><br />I don't intend to answer these questions here. I think there's an interesting tension between these "local" messy message-passing approximation algorithms, based on semantically meaningful local beliefs, and "global" methods which can potentially give us fast, reliable answers based on a fixed calculation.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-30496882881164786832012-11-04T23:43:00.002-08:002012-11-05T15:30:07.441-08:00New Truth: IntuitionGranted that there are more technical details to work out concerning my <a href="http://lo-tho.blogspot.com/2012/10/impredicative-truth.html">impredicative theory of truth</a>, but I don't have too much time to work on them. Instead, this post will be a "lighter", but in some sense more important, topic: what is the intuition behind this theory? In practical terms, what does it tell us about truth?<br /><br />Honestly, the theory is motivated more by "this is what we can do" than by "this is what Truth should taste like". However, the result was surprisingly nice, so I think it's worth doing a post-hoc analysis. <br /><br />I came up with the mechanism based on the realisation that Peano Arithmetic and its relatives are self-referential <i>not</i> because they talk about numbers, but because they talk about arbitrary <i>computations</i> on numbers. <a href="http://en.wikipedia.org/wiki/Robinson_arithmetic#Metamathematics">Robinson Arithmetic</a> was formulated by examining what parts of Peano Arithmetic are actually used to get self-reference; it is a significantly weaker theory which still supports the <a href="http://en.wikipedia.org/wiki/Diagonal_lemma">diagonal lemma</a> and <a href="http://en.wikipedia.org/wiki/G%C3%B6del%27s_first_incompleteness_theorem#First_incompleteness_theorem">all</a> <a href="http://en.wikipedia.org/wiki/G%C3%B6del%27s_first_incompleteness_theorem#Second_incompleteness_theorem">it</a> <a href="http://en.wikipedia.org/wiki/Tarski%27s_undefinability_theorem">entails</a>. The fact is, <a href="http://en.wikipedia.org/wiki/Kleene%27s_recursion_theorem">all total computable functions have fixed points</a>. This means that for any (halting) computable function <b><i>F</i></b> which takes as input one program and as output generates another, there will always be a program <i>P</i> which is identical in behaviour to <b><i>F</i></b><i>(P)</i>. This result is almost silly! However, it tells us something about the structure of computation. It is what allows us to construct self-referential sentences in Peano Arithmetic (and Robinson Arithmetic).<br /><br />So, in order to get a truth predicate in the same language ("self-referential in the good way"), all we need to do is deny the ability to talk about arbitrary computations on sentences while discussing their truth (so that it isn't "self-referential in the bad way").<br /><br />The basic idea is to avoid introducing new, undesired sentences which turn out to be pathological. Talking about the truth of arbitrarily computed sentences seems perfectly plausible until we see that we're getting more than we bargained for. The strict language-metalanguage barrier introduced by Tarski ensures that we only talk about the truth of sentences which are in our original language, avoiding the problem. My 'impredicative' approach weakens this barrier; we can talk about the truth of sentences which make use of the truth predicate, which means we still get <i>some</i> "unexpected" sentences where truth is discussing itself rather than discussing the base-level domain. We're just restricted to making use of relatively simple manipulations on these sentences, so that we don't have the machinery to make fixed-point constructions.<br /><br />My formalism allows predicates to be <i>recursively </i>defined, using constructions similar to the definition of the natural numbers in 2nd-order logic. However, it does <i>not</i> allow <i>self-referential</i> predicate definitions. It is interesting that this distinction can be made. We can employ recursion at the object level, but not at the concept level.<br /><br />Recursion at the object level looks like this (where <b><i>0</i></b> is a logical constant and <i><b>s</b></i>() is a function symbol): <br /><blockquote class="tr_bq">Predicate N(x): </blockquote><blockquote class="tr_bq">All P: (P(0) and [All y: P(y)->P(s(y))]) -> P(x).</blockquote> Here, we are defining a predicate for "natural number" based on a constant for "zero" and a function for "+1". Notice that although this is a recursive definition, we do <i>not</i> need to use the predicate itself in its definition. This is very important; if we were allowed to refer to a predicate in its own definition, then we would run into trouble very quickly:<br /><blockquote class="tr_bq">Predicate P(x): not P(x).</blockquote>More generally, a self-referential definition may not be consistent or complete, whereas a recursive definition will be (although it may not be decidable).<br /><br />So, it doesn't make sense to form concepts using self-referential definitions (at least not <i>just any</i> self-referential definition). And since forming predicates from arbitrary computations is enough to give us that ability, it's got to go, too.<br /><br />That's all well and good, but this doesn't seem like anything really new for theories of truth: we know that complete expressive power results in inconsistency, so various theories of truth give us different restrictions on expressive power which (hopefully) result in a consistent theory. So what?<br /><br />The interesting thing here is that, because of the closeness to 2nd-order logic, we get a nice property: we can represent 3rd-order logic, 4th-order logic, .. Nth-order logic where N is an ordinal definable within 2nd-order logic. The basic story is that <a href="http://plato.stanford.edu/entries/logic-higher-order/#4">the powerset relation is definable at the object level</a>, which means that we can add axioms which force special 1st-order entities to behave as classes. Despite the separation in the language, concepts can be examined as objects after all!<br /><br />If we do this, we can even discuss the (1st-order) truth of arbitrarily computed sentences, and more or less anything else which we would have wanted to do.<br /><br />In Kripke's theory of truth, we had the "revenge problem": truth can be completely discussed within the language, but a new semantic value, the "indeterminate" or "paradoxical" state, cannot be discussed. This preserved Tarski's theorem, in a generalised sense: the system could discuss its own truth predicate, but not its own semantics.<br /><br />If we wanted to add a predicate for this 3rd value, in such a way as to allow the assertion of true statements about it, we would be taking on a rather significant modification of the logic. Certainly the semantics of the predicate could not be described with a few sentences in the existing language.<br /><br />The same theme extends to other theories: typically, the semantics of the theory are a mathematical object too complex to be defined in the theory of truth at hand. That's what concerned me when trying to use theories of truth as foundations for mathematics; in general, it seemed as if the exposition of the theory itself provided an example of a mathematical discussion which could not be captured in the theory.<br /><br />For the theory of impredicative truth, on the other hand, we find that although the theory cannot represent its own semantics with no modification, the modification needed is rather light. Adding axioms defining object-level classes seems rather easy. Granted, the extended system technically has a different semantics; in asserting the new axioms to be true, we are changing the meaning of some logical constants (assigning a relation to the role of an object-level membership relation, for example). It may look at first glance like a system describing its own semantics, but really, the extended system is describing the old system. (If we attempted to make objects corresponding to all the classes in the new system, we would be making axioms equivalent to naive set theory.)<br /><br />So, to talk about the semantics of the system, we still make a meta-language. The close relationship between the language and the metalanguage is very encouraging, though. Specifically, this seems to indicate that a system capable of learning new axioms will be able to incrementally learn its own semantics.<br /><br /><br />What this gives us is a hierarchy of languages which are increasingly expressive. Is it any better than the Tarski hierarchy? To some extent, I think this solution is very similar in spirit; it is a very "conservative" solution, relying on a stratified language and maintaining classical logic. The nice relationship between language and metalanguage is arguably only a little nicer than Tarski's. However, the impredicative stratification is more powerful; I feel it is closer to the foundations of mathematics. What we get out of this is a recommendation to view a finite axiomization in 2nd-order logic as, to some extent, more basic than a formulation of a theory in 1st-order logic via axiom schema. 2nd-order arithmetic is a more fundamental foundation than 1st-order, and <a href="http://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Bernays%E2%80%93G%C3%B6del_set_theory">NBG</a> set theory is preferred over <a href="http://en.wikipedia.org/wiki/ZFC">ZFC</a>. Proper classes are seen as a natural thing to have around, rather than an unnatural partial level.<br /><br />I'm actually <i>disappointed</i> that this view seems to endorse a more standard well-founded view of set theory, as opposed to what I view as more natural non-well founded theories (in which we have a set of all sets, which is itself a set). Oh well.<br /><br />What I need next is a natural way to specify a prior distribution over statements in this logic, so that we can reason probabilistically over it. <br /><br />In any case, that's enough for now. I think I might have given a more results-oriented account, despite hoping to give intuition prior to results. Perhaps this theory is better justified in terms of what we can do with it, rather than what we wanted truth to be like beforehand.<br /><br /><i>Cautionary note: I have not yet spelled out what the semantics of this theory might be; in particular, I have not spelled out how similar or different it might be from the standard semantics of 2nd-order logic. This could bring into doubt some of the parallels with 2nd-order logic used in this post.</i>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-16476611703927698302012-10-02T15:47:00.000-07:002012-11-09T12:26:14.963-08:00Impredicative TruthIn the <a href="http://lo-tho.blogspot.com/2012/09/old-demons.html">previous post</a>, I started to discuss an approach to avoiding the paradoxes of truth by not giving into the temptation of creating a logic which is "sufficiently powerful" (a phrase often used to describe axiom systems strong enough to support the <a href="http://en.wikipedia.org/wiki/Diagonal_lemma">diagonal lemma</a>). As we will see, we don't have to give up quite that much. But I'm getting ahead of myself.<br /><br class="Apple-interchange-newline" />We begin with the observation that 2nd-order logic is ridiculously powerful. It turns out that we can embed higher-order logic within 2nd-order logic:<br /><br /><blockquote class="tr_bq"><span style="background-color: white; line-height: 22px;">Thus the set of validities of seventeenth-order logic is computably reducible to </span><em style="background-color: white; line-height: 22px;">V</em><span style="background-color: white; line-height: 22px;">ยฒ(=), the set of second-order validities in the language of equality. (In fact, these two sets are computably isomorphic.) So in this aspect, the complexity of higher-order logic does not increase with the order. (<a href="http://plato.stanford.edu/entries/logic-higher-order/#4">Stanford Encyclopedia</a>)</span></blockquote><br />If we could agree on a semantics for set theory, we could probably embed that into 2nd-order logic too, following an approach similar to <a href="http://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Bernays%E2%80%93G%C3%B6del_set_theory">NBG set theory</a>. (For example, we could formalize the iterative conception of set as construed by Boolos; though, we could <i>not</i> formalize the iterative conception as originally described by Goedel... that subject would be worthy of its own post.)<br /><br />In short, if 2nd-order logic is <a href="http://lo-tho.blogspot.com/2011/11/why-relational-methods.html">referentially incomplete</a>, it's not obvious how. Whereas with a theory of truth, you have the feeling of an immediate gap (something which you used to describe the system but don't seem to be able to discuss within the system), 2nd-order logic seems capable of discussing anything we can think of. <i>Sure,</i> we <i>know</i> that undefinability must still apply: 2nd-order logic can't define its own truth predicate. Bear with me, though. It would still be nice to have a theory of truth with expressive power similar to 2nd-order logic. Most theories of truth can't compete, because 2nd-order logic requires <a href="http://lo-tho.blogspot.com/2011/05/impredicativity.html">impredicative</a> comprehension.<br /><br />So, how might we go about that?<br /><br />First, we need to talk about "satisfaction" rather than truth. We can render the application of a predicate to a first-order entity <i>P(x)</i> as <i>Sat("P(_)",x)</i>. Like the truth predicate, the satisfaction relation takes sentences in quoted form; that is, it takes Goedel numbers, or (probably) some more convenient representation of the syntax of the sentence. However, it takes the 1st-order variables directly, rather than quoted.<br /><br />This allows us to keep 2nd-order variables (which range over syntactic representations) separate from 1st-order variables (which range over actual entities). This is a type distinction: quoted representations are <i>not</i> first-order entities, and are not allowed to occur in the right-hand side of the satisfaction relation. Separate quantifiers range over the two types of variables.<br /><br />Notice that there needs to be a way of detailing a "blank", "_", which the first-order entity gets substituted in. We will also want to work with multiple blanks, to represent relations, but we can introduce lists as 1st-order entities to avoid the need... there are details to work out here.<br /><br />We give the language enough tools to build up complex expressions to discuss the satisfaction of; similar to <a href="http://lukepalmer.wordpress.com/2006/09/20/finite-axiomatization-of-peano-arithmetic/">this excellent post</a>. So, for example, we provide a negation function which returns the compliment of a 2nd-order entity, a conjunction function which returns the intersection, disjunction which returns the union, and so on for all the elements of our syntax. (It does not matter too much whether we provide these as functions or as existence axioms.) Critically, this includes a quotation function which takes the 2nd-order entity corresponding to an expression in the language and returns a 2nd-order entity corresponding to the 2nd-order entity corresponding to the expression. In other words, the function takes quoted entities and returns doubly quoted entities. This allows us to talk about quotation within the quoted portion of the language. That's critical to the impredicative self-reference.<br /><br />In any case, <i>True("S")</i> is simply a special case of satisfaction. One plausible definition would be that <i style="font-weight: bold;">S</i> is true iff <i>All x: Sat("S",x)</i>. (We may want to be a bit stricter, requiring that <b>S</b> has no free variables; whether we are able to require this will depend on some details of the logic.)<br /><br />Wait. What happened? Didn't I say just a few paragraphs ago that 2nd-order logic can't define its own truth predicate? And now we're looking at a truth predicate as a special case of the 2nd-order satisfaction predicate?<br /><br />The trick is that undefinability applies to a truth predicate at the level of first-order entities, because 2nd-order logic is strong enough to satisfy the diagonal lemma with respect to predicates on first-order entities.<sup>[citation needed]</sup> However, at the 2nd-order level, the logic is nowhere near that strong. So, we can put a self-referential truth predicate at that level without the problem.<br /><br />In what sense is this a "self-referential" truth predicate? In the sense that it is a truth predicate which exists within the language it predicates on. We can assert or deny the truth of any sentence, including those which use the truth predicate.<br /><br />It is <i>not</i> a self-referential truth predicate in the sense provided by the diagonal lemma: we can't construct self-referential sentences like the Liar sentence. This might make the theory sound a little boring, but it seems practical.<br /><br />Besides: we get to have our cake and eat it too. With this approach, we are defining a sensible truth predicate by giving up the diagonal lemma at the 2nd-order level; yet, when talking about first-order entities, we still get the benefit of a strong language. We don't have to give up logical power after all!<br /><br />Still, it is tempting to reach for a little more. As described, the 2nd level of the language is very weak. However, all I need is that it doesn't satisfy the diagonal lemma. Might it be possible for the 2nd level to be a <a href="http://en.wikipedia.org/wiki/Self-verifying_theories">self-verifying logic</a>? Because... that would be awesome.<br /><br /><i>This post may or may not conclude the short series on this theory of truth. I have definitely not done it justice. I would like to formalize it properly, read up on Fregean studies to see how similar this is, read more background material about the diagonal lemma and 2nd-order logic to make sure I'm not just crazy here, read up on self-verifying logics, and so on. However, I don't really have the time to dive into this. So, there may not be any more posts on the topic for a while.</i><br /><i><br /></i><i>Update: next in the sequence is <a href="http://lo-tho.blogspot.com/2012/11/new-truth-intuition.html">here</a>.</i>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-22332558581696669512012-09-24T12:51:00.001-07:002012-10-02T15:50:20.032-07:00Old DemonsI have recently been <a href="http://lo-tho.blogspot.com/2012/08/truth-and-ai.html">revisiting my search for a theory of truth</a>, which I had previously abandoned as unproductive. It seemed like <a href="http://dragonlogic-ai.blogspot.com/2009/08/climbing-mountain-undefinability-is.html">undefinability was simply a fact to be accepted</a>; if I found myself unsatisfied with existing theories of truth, that was my problem, not theirs.<br /><br />To some extent, this has become like an "old demon" for me. It's easy for me to spend too much time obsessing about this. So, despite trying to re-visit the problem, I found myself thinking:<i> truth simply cannot be defined, but I need to find a way to come to terms with that</i>. To settle the issue for myself, I needed an acceptable story.<br /><br />I began reviewing the arguments which prove undefinability.<br /><br />The fundamental problem seems to be related to the fact that a<a href="http://rjlipton.wordpress.com/2011/10/21/what-if-cantors-proof-is-wrong/"> set cannot contain its powerset</a>. More generally, a set <i><b>S</b></i> cannot index all functions <i>f: <b>S</b> -> {0,1}</i>; that is, there is no bijection between <i style="font-weight: bold;">S</i> and the set of those functions. I say "more generally" because this statement holds whether we take the "classical" stance, that there are uncountably many such functions, or take a more "constructivist" stance, considering only the computable functions. Either way, <a href="http://lukepalmer.wordpress.com/2012/01/26/computably-uncountable/">the result holds</a>. There <i>is</i> a loophole, though, which will become important soon.<br /><br />What does the existence of a bijection have to do with truth? Well, the Liar paradox and other similar problems arise from attempting to simultaneously <i>refer to all your semantic values</i> while maintaining an <i>expressively powerful logic</i> which allows you to take arbitrary functions of things. (By "semantic values" I mean just "true" and "false" to start with, though this set is often extended in an attempt to repair paradoxes.) Expressive power looks reasonable. It seems obvious that we should be able to express that the opposite of some statement is true. Less obviously, if we can take general computable functions of things, we end up satisfying a <a href="http://en.wikipedia.org/wiki/Kleene's_recursion_theorem">fixpoint theorem</a>, which (as stated in the <a href="http://en.wikipedia.org/wiki/Diagonal_lemma">diagonal lemma</a>) makes it possible to construct self-referential sentences. OK. This might sound a little weird, but it's just a consequence of that expressive power we wanted. Should be fine, still? Unfortunately, with these two results, we can construct a sentence which is its own negation. What? Where did that come from?<br /><br />So, the problem seems to arise from trying to be able to assert arbitrary functions of semantic states. It seems reasonable at first: computations on semantic states are actually useful, and the more the better. This allows us to flexibly state things. So why is there a problem?<br /><br />The problem is that we've unwittingly included more statements than we intended to. Expanding the language with a naive truth predicate extends it too far, adding some stuff that's really nonsense. It isn't as if there was a Liar sentence in the original language which we're now able to compute the negation of. No: the negation computation, allowed to apply itself to "anything", has been applied to itself to create a new, pathological entity (somewhat analogous to a non-halting computation).<br /><br />(<i>Still trying to convince myself not to worry about these things, I said:</i>) We should not have expected that to work in the first place! It's just like trying to construct a set which is its own powerset. If you show me an attempt to do it, I can <a href="http://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">mechanically find</a> a subset you left out.<br /><br />Or, it's like trying to solve the halting problem. There are these "loopy" entities which never terminate, but by the very nature of that loopyness, it is impossible to construct a function which sorts the loopy from the useful: if we had such a function, we could use it to construct a new, "meta-loopy" computation which went loopy if and only if it didn't.<br /><br />One way of attempting to get around this problem is to add a new semantic value, so that instead of {true, false} we have {true, false, meaningless} or something to that effect. Unfortunately, this does not fundamentally solve the problem: we need to make a decision about whether we can refer to this new semantic value. Informally, it seems like we can; this causes the same problem again, though, since if we can refer to it, we can apply the full power of the logic again (creating a new arbitrary-functions-on-semantic-states problem). On the other hand, we could take a hint, and remain silent about this "new value": so, we are only able to talk about truth and falsehood. This is, essentially, Kripke's theory of truth.<br /><br />It seems as if we are forced to accept an inherently partial theory of truth, with some sentences in a purposefully ambiguous state- they are neither true nor false, but we are not able to refer to that fact definitively.<br /><br />This has to do with the loophole I mentioned earlier. It <i>is not</i> possible to map a set <i style="font-weight: bold;">S</i> onto the functions <i>f: <b>S</b> -> {true, false}</i>. Even less is it possible to map them onto <i>f: <b>S</b> -> {true, false, _}</i>, with some third value. Yet, we can describe any computer program we want with a finite sequence of 0s and 1s... and the computer programs can be arbitrary computable mappings from such strings to {0,1}! We're indexing the functions with the entities being mapped. We seem to have arbitrary functions on our semantic states. Isn't this exactly what we said we can't do? Why does this work with computer programs, but not with truth? Because we accept the halting problem. We accept that our representation language for computer programs will include some nonsense programs which we don't actually want; meaningless programs are an unfortunate, but necessary, side-effect of the representational power of the language. Think of it as having 2.5 semantic values. A program can output 1, 0, or... it can fail to output. That isn't a 3rd output value; it's nothing.<br /><br />That is why Kripke's theory is so appealing: it takes this loophole and applies it to the domain of truth.<br /><br />However, it is not the solution I am putting forth today.<br /><br />Today, my assertion is that trying to fit the set inside the powerset never made sense in the first place. Sure, Kripke's loophole is <i>nice</i>, but it still leaves us with this uncomfortable feeling that we can refer to more things than the logic we've constructed, because we've got this 3rd semantic state which we're supposed to shut up about. (Or, as Tim Maudlin suggests, we can refer to it so long as we don't make the mistake of thinking we're saying something true??)<br /><br />We would like a "self-referential theory of truth", but what does that mean? It seems to me that all we want is a truth predicate which exists happily inside the language which it discusses. This does not require the kind of self-reference provided by the diagonal lemma. It seems like that just gets us nonsense. Perhaps, by discarding the sort of logical power which leads to the diagonal lemma, we can get around the problem. Incredible results have been achieved this way in the past: Dan Willard's <a href="http://en.wikipedia.org/wiki/Self-verifying_theories">self-verifying logic</a> gets around the 2nd incompleteness theorem in this way. Can we do something similar for undefinability?<br /><br /><i>I've decided to split this up into multiple posts. The <a href="http://lo-tho.blogspot.com/2012/10/impredicative-truth.html">next</a> post will go into the proposed solution.</i>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-80274958147493542712012-08-29T14:25:00.001-07:002012-08-29T14:25:38.553-07:00Beliefs<br />I tend to think that the AI problem is split into two basic problems:<br /><br /><ol><li>What is the space of propositions which we have beliefs about?</li><li>What are the rules by which beliefs interact?</li></ol><div>Bayesianism says that once we answer (1), we simply define a prior (and possibly a utility function) over the space, and answer (2) by a Bayesian update on any evidence we observe, and then a marginalization to find the prediction (or maximize to find the highest-utility action, if we specified a utility function).</div><div><br /></div><div>This is impractical for large problems, which brings in an interesting question without even <a href="http://lo-tho.blogspot.com/2012/08/bayesianism.html">giving up Bayesianism</a>: what approximations work well in practice? I personally prefer methods which deal with beliefs in a local way. <a href="http://en.wikipedia.org/wiki/Factor_graph">Factor graphs</a> are my favorite formalism for thinking about this.<br /><br />The basic algorithm for factor graphs is the <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.1570">summary-product algorithm</a>, which is a generalisation of belief propagation and several other important algorithms. Generalising several algorithms like this is fairly exciting, and gives summary-product an important status among the possible mechanisms for belief interaction. If I had to state the idea of summary-product in one sentence, it would be:<br /><br /><i>If you want to perform a global operation on a multivariate function which can be represented as a product of several factors, you can often treat the distributive law as approximately true in order to transform it to a local message-passing algorithm which often runs in linear time.</i><br /><br />Of course, this is hopelessly vague; you have to look at the math. In any case, this is is typically the "default" approximation for factor-graph problems. Unfortunately, the procedure doesn't come with many theoretical guarantees: we know that it works well in practice, but it does not always converge to the right value, or converge at all. Not very much is known about when or why.<br /><br />In cases where this does not work well, we can turn to <a href="http://www.maths.bris.ac.uk/%7Emaomz/readinglist.html">variational methods</a>. <br /><br />The core variational technique, often referred to as "variational Bayes" (but more specifically know as "mean-field"), seeks to minimise the <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL-divergence</a> in the form <span style="font-family: "Courier New",Courier,monospace;">KL(Q||P)</span>, where <span style="font-family: "Courier New",Courier,monospace;">P</span><span style="font-family: inherit;"> is the exact distribution and <span style="font-family: "Courier New",Courier,monospace;">Q</span> is our approximation. </span>Minimising KL-divergence typically involves taking expectations, and in this case it means taking the expected log-probability. (I won't try to go through the derivation here, but it is quite interesting.) This has a surprising effect: by finding the expected log-probability rather than the expected probability, we replace multiplication with addition in some of the mathematics. This is the kind of crazy simplification that you would not expect to work. Yet, it does work, quite well: this algorithm is guaranteed to converge to the global minimum of the KL divergence! (At least, if I've understood correctly.) The mean-field approach can be understood as a generalisation of the popular <a href="http://en.wikipedia.org/wiki/Em_algorithm">EM algorithm</a>.<br /><br />A second variational method is known as expectation-propagation. Here, we minimise <span style="font-family: "Courier New",Courier,monospace;">KL(P||Q)</span> rather than <span style="font-family: "Courier New",Courier,monospace;">KL(Q||P)</span>. This actually makes more sense from a mathematical perspective: <span style="font-family: "Courier New",Courier,monospace;">KL(A||B)</span> measures the amount of unnecessary prediction error from using probability distribution <span style="font-family: "Courier New",Courier,monospace;">B</span>, if observations are really being drawn from distribution <span style="font-family: "Courier New",Courier,monospace;">A</span>. Expectation-propagation seems to have it right, minimising the amount of prediction error due to approximation. It is unclear why we might want to take the mean-field approach instead, except that it may be computationally convenient.<br /><br />Again, minimising KL-divergence often leads us to take expectations. In this case, rather than finding ourselves in the strange land of log-probabilities, we end up taking expectations over probabilities in a more normal way. I won't try to give all the details, but it turns out that in specific cases, this approximation ends up being exactly the same as belief propagation! This happy coincidence again gives credibility to the generality of the summary-product algorithm. At the same time, it gives us a different and useful generalisation of belief propagation.<br /><br />This also gives us a nice theoretical tool for analysing the accuracy and convergence of belief propagation. Like summary-product, expectation maximisation can fail to converge. However, expectation-maximisation comes with a fairly nice fixed-point analysis which helps us to understand this process (although it doesn't give all the answers).<br /><br />There is a wider variety of variational methods, but I don't know much about them-- yet. :)</div><div><br /></div><div>Other, less Bayesian techniques present themselves. In particular, while variational methods <i>can</i> be used to learn the factors themselves, a simpler approximation is to use gradient descent. As in neural network back-propagation, we take the derivative of an objective function with respect to the parameters in order to get a direction in which we should move our beliefs. This does not directly represent any uncertainty about the parameters.</div><div><br /></div><div>The main problem with this approach is that the derivatives will tend to get very small if we have deep networks with large numbers of hidden variables. Overcoming this problem is the topic of research in <a href="http://www.scholarpedia.org/article/Deep_belief_networks">deep belief networks</a>. (Interestingly, this formalism bridges the gap between neural network research and Bayes-style graphical model research. It is not the only approach to do so. I have some hope that this distinction will slowly disappear, although there will always be room for some differentiation...) Studying what sort of local belief interactions work best here will be an important part of the puzzle, to find good general rules for the interaction of beliefs.<br /><br />Another source of inspiration is the <a href="http://en.wikipedia.org/wiki/PAQ">PAQ compressor</a> by Matt Mahoney. It is currently the best compressor there is. Information theory tells us that there is a close relationship between probability and compression; so, if a specific sort of computation tends to be the best for compression, it is likely to be the best for approximate probabilistic reasoning, as well. (In fact, PAQ is founded on this principle.) PAQ is by now a fairly complicated mechanism with many special optimisations which I don't claim to understand, but I do understand some of the principles behind it:<br /><ol><li>How should we combine several conditional probability distributions which give different beliefs for some proposition? This is known as the "combination problem" in various circles (it shows up in <a href="http://en.wikipedia.org/wiki/Probabilistic_logic_network">PLN</a>, <a href="https://sites.google.com/site/narswang/">NARS</a>, and <a href="http://people.csail.mit.edu/kersting/papers/srl05chapter.pdf">BLP</a>). It's like asking how to make use of a mangled Bayesian network which has been put together from several conflicting models. PAQ says to combine the different predictions logarithmically (a bit like mean-field), with different weights on each, trained via gradient-descent to find the combination of weights which minimises prediction error.</li><li>In addition, we can use <a href="http://mattmahoney.net/dc/dce.html#Section_413">indirect context models</a> to learn when specific models are appropriate. </li><li>Favor recent history when counting probabilities. This is an idea which also appears in reinforcement learning; the term which justifies the practice is "non-stationary models" (the idea that the probability distribution may change over time). I should mention that I think reinforcement also has other lessons about belief structure and useful propagation rules.</li></ol>Many explanations are in Matt's <a href="http://mattmahoney.net/dc/dce.html">book in progress</a> (which I have not yet read).<br /><br />There is much more to say, but I suppose this post has gotten long enough. Thoughts?</div>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com3tag:blogger.com,1999:blog-266928470727008321.post-78317034562912832952012-08-28T17:49:00.000-07:002012-08-28T17:57:40.048-07:00Bayesianism?Normally, I'm a hard-core Bayesian. This means that I believe all uncertainty is essentially probabilistic uncertainty, and also, goal-oriented behavior is essentially utility-maximization. These are beliefs which make predictions: I predict that AI techniques which attempt to approximate the rules of probability theory and utility maximization will tend to work the best.<br /><br />In my <a href="http://lo-tho.blogspot.com/2012/08/truth-and-ai.html">previous post</a>, I gave a reason to doubt this: Bayesian systems rely on a pre-specified model space, and this model space will be inherently incomplete, for several different reasons. This does not have profoundly bad consequences for probability theory (a Bayesian system will still do its best to make reasonable predictions, in some sense); however, it may have worse consequences for utility theory (it isn't clear to me that the system will do its best to achieve its given goal, in any strong sense).<br /><br />This, too, is a testable belief. I've been discussing some experiments with <a href="http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/index.php?n=Main.HomePage">Lukasz Stafiniak</a> which will help here (but we have set no definite deadline to get around to this, as we both have other things to do). (I should mention that his motivation for being interested in these experiments is not the same as mine.) It could also be addressed on purely theoretical grounds, if it could be proven that (specific sorts of?) Bayesian systems can or cannot learn specific behaviors (behaviors which other sorts of systems <i>are</i> capable of learning).<br /><br />What is the competitor to Bayesianism? Model-free learning attempts to directly learn the policy for the environment based on feedback, without trying to make predictions about the world or directly represent world knowledge.<br /><br />In this view, trial-and-error learning becomes more fundamental than Bayesian learning. This makes some philosophical sense. After all, if we reason in an approximately Bayesian way, it is because we have evolved to do so through a process of mutation and natural selection.<br /><br />The model-free approach has been <a href="http://umichrl.pbworks.com/w/page/7597592/RL%20is%20model-free%20(or%20direct)">popular in the past</a>, and there is still research being done in the area, but model-based methods have the technique of choice for complex problems. To take a somewhat Bayesian line of argument, this is natural, because refusing to state your assumptions doesn't actually exempt you from having assumptions, and explicitly modeling the world allows for data to be used in a more efficient way: we separately optimize the world model based on the data, and then optimize the policy based on the world model.Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com2tag:blogger.com,1999:blog-266928470727008321.post-53137423166523288482012-08-20T17:52:00.000-07:002012-08-20T17:52:41.334-07:00Truth and AII've written extensively in the past about the relationship between foundations of mathematics and AGI; both on this and my previous blog, and in numerous posts to the AGI mailing list. I claimed that numerous problems in the foundations of mathematics needed to be solved before we could create true AGI.<br /><br />The basic argument was this:<br /><br />An AGI should, with enough processing power and training, be able to learn any concept which humans can learn. However, in order to learn a concept, it needs to be able to represent that concept in the first place. So, if we find that an AGI system's internal representation can't handle some concept, even in principle, then we should extend it.<br /><br />I called this the "expressive completeness" requirement. Logicians have some bad news for such a requirement: Tarski showed that any sufficiently powerful logic system is incapable of expressing its own semantics. This means there is at least one concept that can't be expressed, in any knowledge representation: the meaning of that representation.<br /><br />This is related to Goedel's second incompleteness theorem, which says that we can never prove our own logic sound; any logic which can say if itself "all the results I derive are true" must be wrong about that!<br /><br />Intuitively, this seems to indicate that whatever logic humans use, we won't be able to figure it out. A logic system can only understand weaker logic systems. This would suggest that we are doomed to forever ponder weak theories of mind, which are unable to account for all of human reasoning.<br /><br />As a result, my hope for some time was to "get around" these theorems, to solve the expressive completeness problem. (This is not quite as hopeless as it sounds: the specific statements of the theorems do contain loopholes. The problem is to decide which assumptions are not needed.)<br /><br />However, two years ago, I decided that the problem didn't <i>really</i> need to be solved. Here is the message I sent to the AGI list:<br /><blockquote class="tr_bq"><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">In the past I've done some grumbling about "expressive completeness" of an AGI knowledge representation, specifically related to Tarski's undefinability theorem, which shows that there is always something missing no matter how expressively powerful one's language is. Perhaps you remember, or perhaps you weren't around for it, but basically, I argued that for any given AGI system the Tarski proof could show where it had a representational hole: a concept that it could not even express in its representation.</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">Today I retract the worry and give a </span><span class="il" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; color: #222222; font-family: arial, sans-serif; font-size: 13px;"><span class="Apple-style-span" style="background-color: white;">broad</span></span><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">, somewhat tentative "thunbs up" to opencog, NARS, Genifer, LIDA, and any similar systems (at least insofar as logical completeness is concerned).</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">I still think the theory of logical completeness is important, and can bear important fruits, but at the moment it looks like its main result is just to say what many of you knew all along-- a "full steam ahead" on existing systems. I recognize that it's a hard sell to claim that we should do all that work to get the already-obvious answer.</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">Beyond that point, AGI researchers won't care all that much and I'm more doing some (albeit strange) philosophical logic.</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">The sketch of the result goes like this.</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">Jumps up the Tarski hierarchy of languages are fairly easy to justify, due to the various benefits that more logical power offers. These include speed of reasoning and more concise notation of certain concepts. Most AGI systems will be able to see these benefits and, if not explicitly endorsing the move *in their original logic*, will move toward stronger logics implicitly at their procedural level.</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">Worst-case, the system could replace itself "from the outside" by taking action in the external environment to modify itself or create a successor...</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">(Ideally, of course, it would be nice to have a "stable" system which explicitly accepted improvements in its initial logic.)</span><br /><br style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;" /><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">In conclusion, the best style of AGI I can recommend to acheive completeness is what I think of as "Ben's zen AGI approach" (apologies to both Zen and Ben Goertzel for any misrepresentation): give up your attachement to individual algorithmic approaches, and just take the best algorithm for the job. Put these good algorithms together in a blackboard-like environment where each can do its job well where it applies, and after a while, </span><span class="il" style="background-attachment: initial; background-clip: initial; background-image: initial; background-origin: initial; color: #222222; font-family: arial, sans-serif; font-size: 13px;"><span class="Apple-style-span" style="background-color: white;">general</span></span><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;"> intelligence will emerge from their co-authorship.</span></blockquote><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">Recently, I have been thinking about these issues again. I think it is time for a bit more on this topic.</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">As I mentioned, we have some very strong limitative results in logic. To give a version of these results for AGI, we can talk about learning. Wei Dai gives a form of this result, calling it the <a href="http://lesswrong.com/lw/cw1/open_problems_related_to_solomonoff_induction/">unformalizability of induction</a>.</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">A Bayesian learning system has a space of possible models of the world, each with a specific weight, the prior probability. The system can converge to the correct model given enough evidence: as observations come in, the weights of different theories get adjusted, so that the theory which is predicting observations best gets the highest scores. These scores don't rise too fast, though, because there will always be very complex models that predict the data perfectly; simpler models have higher prior weight, and we want to find models with a good balance of simplicity and predictive accuracy to have the best chance of correctly predicting the future.</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">Unfortunately, the space of models is necessarily incomplete. There exists a model which is intuitively fairly simple, but which necessarily does not exist in our Bayesian learner: the possibility that an "evil demon" (to borrow Decart's idea) is fooling the agent. The demon is performing the computations of the Bayesian update, to find out the probability distribution over the next observations, and then choosing for the agent to see that observation which is least probable according to the agent's probability distribution.</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">It is impossible that the agent converges to this model; therefore it must not exist in the space of models being considered.</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">Shane Legg used this idea to show that there is no "<a href="http://www.idsia.ch/idsiareport/IDSIA-12-06.pdf">elegant universal theory of induction</a>".</div><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><br /></div>Of course, the practical concern is not really over evil demons; that's a rather unconcerning hypothesis. The demon is just a way of showing that the model space is always incomplete. In reality, more practical theories escape the model space.<br /><ul><li>AI based on Hidden Markov Models cannot learn hierarchical patterns such as nested parentheses, and will never learn to count.</li><li>AI based on context-free grammars cannot learn context-dependent patterns, and will never learn the general rule of subject-verb agreement.</li><li>AI based on bounded computational resources cannot learn the true model of the world if it requires more computing power than is available (but it generally does!).</li></ul><div>We know that Bayesian reasoning is the best option when the true model is within the space of models. But what happens when it is not?</div><div><br /></div><div>Luckily, we can still say that Bayesian updates will cause beliefs to converge to the models which have the least KL-divergence with reality. For example, in the evil demon case, beliefs will go towards maximum entropy; the agent will treat reality as random. Given the structure of the situation, this is the best strategy to minimize incorrect predictions.</div><div><br /></div><div>However, there is still more to worry about. In my email 2 years ago, I claimed that an AI system could see the benefits of more inclusive model spaces, and modify themselves to include what they might be missing. It would be good if something like this could be made more formal and proven.</div><div><br /></div><div>For example, is it possible for an agent based on hidden markov models to learn to count if we allow it a 'cognitive workspace' which it is expected to learn to use? We can imagine an RL agent which works by doing HMM learning and then HMM planning to maximize reward. We augment it by giving it a memory containing a stack of symbols, and actions to push symbols to memory and pop symbols out of memory. The symbol popped out of memory is presented as an observation to the system. Can the agent learn to use memory to its benefit? Can it now learn pushdown automata, rather than only learning models with finite state-space?</div><div><br /></div><div>I think the answer is, no. Because the agent does explicit HMM planning, it will not perform an action unless the HMM predicts a benefit in the future. This makes it very difficult for the agent to learn when to push symbols, because it would need to be able to look ahead to the time of their being popped; but the whole point is that the HMM cannot do that.</div><div><br /></div><div>This suggests that my conclusion in the old email was wrong: reasonable systems may reject obvious-seeming self-improvements, due to lack of imagination.</div><div><br /></div><div>Taking this as an example, it seems like it may be a good idea to look for learning methods which answer "yes" to these kinds of questions.</div>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com0tag:blogger.com,1999:blog-266928470727008321.post-55360954391394198782012-05-05T18:11:00.000-07:002012-05-07T17:46:13.274-07:00Thoughts on Genetic Programming<br /><div style="font-family: arial; font-size: small;">I ran into this nice genetic programming demo applet:</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;"><a href="http://helios.hampshire.edu/lspector/psh/demo/regression/PshApplet/">http://helios.hampshire.edu/lspector/psh/demo/regression/PshApplet/</a> </div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">I took down the number of generations to solution 10 times:</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">17</div><div style="font-family: arial; font-size: small;">110</div><div style="font-family: arial; font-size: small;">125</div><div style="font-family: arial; font-size: small;">2</div><div style="font-family: arial; font-size: small;">82</div><div style="font-family: arial; font-size: small;">22</div><div style="font-family: arial; font-size: small;">50</div><div style="font-family: arial; font-size: small;">61</div><div style="font-family: arial; font-size: small;">52</div><div style="font-family: arial; font-size: small;">19</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">Average: 54</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">As you can see, the distribution is fairly wild. Outside of this experiment, I saw one or two runs that took up to 1,000 runs, as well. (1,000 is the cut-off for this particular simulation. If the program gets to 500, it seems like it generally goes all the way to 1,000.)</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">The special thing about this kind of distribution of numbers, as Linas Vepstas discovered while <a href="http://blog.opencog.org/2012/02/07/tuning-moses/">tuning MOSES for Opencog</a>, is that it means there is a failure to maintain diversity properly; in fact, such an extreme failure that we would be better off throwing away our work and starting over every few generations (because the expected run-time is dominated by overly long runs). In this case, we could re-start the algorithm every 2 generations, with a 1/10 chance of success... giving us an expected run-time of 20!</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">Of course, we don't know ahead of time what cut-off will be optimal, but that can be addressed to some extent with clever tricks such as starting with a small cutoff and increasing it over time. The point, as Linas discusses in his article, is that we should never run into a situation like this. The bias created by the search algorithm is doing more bad then good. What a good restart strategy does for us is effectively shift the algorithm towards a brute-force search which simply looks at possibilities randomly until it finds a solution. (If we used a cut-off of 1, we'd be doing exactly that.) The algorithm isn't learning about the search space over time; allowing it to "remember" very much is only causing it to get stuck in local maxima within the landscape.</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">Intuitively, it makes sense to suppose that the hill-climbing ability of the system is only good at exploring very "local" terrain in the fitness landscape. Using rapid re-starts means we are trusting the genetic algorithm to climb relatively small hills, but prefer a more brute-force strategy for the global search. A brute-force approach (trying random possibilities until we get a solution) would have a <b><i>1/n</i></b> chance at each attempt, for a search space of size <b><i>n</i></b>; this gives an expected run-time of <b><i>n</i></b>.* If genetic-algorithm hill-climbing can efficiently search a small area of size <i style="font-weight: bold;">a</i> in the amount of time it would take to brute-check an area of size <i style="font-weight: bold;">b</i>, then we can improve our search time by a factor of <i style="font-weight: bold;">a/b</i> using a genetic algorithm that gets restarted at intervals of time <i style="font-weight: bold;">a</i>.</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">Because the chance of success, be it <i style="font-weight: bold;">1/n</i> or <i style="font-weight: bold;">b/an</i>, is a constant for every time-step, the time-to-solution for this kind of algorithm follows an exponential distribution. (The chances that we're still searching at a particular time <i style="font-weight: bold;">t </i>decay exponentially.) What Linas is saying in his post is that if our waiting-time distribution is any <i>worse </i>than exponential, then something is wrong with our algorithm, because we can improve it (to an exponential) via re-starts.</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">Rather than resorting to re-starts, Linas added extra parameters to the algorithm which encouraged more exploration, tuning these parameters for the best apparent results. He got better than <i style="font-weight: bold;">a/b </i>improvements this way; he reduced the <i style="font-weight: bold;">a/b</i> ratio itself (the efficiency with which MOSES searched small areas) by a factor of about 3/10. However, the resulting algorithm could <i>still </i>benefit from restarts; although the improved waiting-time fits an exponential for most of its duration, it gets worse toward the end (meaning the longest runs still have some problem).**</div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;">How can we do better than that? How can we avoid the detrimental effects of a genetic algorithm with a longer memory? (How can we keep the bias from gathered knowledge from making us behave <i>worse than a system with no knowledge at all</i>?)</div><div style="font-family: arial; font-size: small;"><br /></div><div><span style="font-family: arial; font-size: small;">The problem here reminds me of the exploitation/exploration trade-off <a href="http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node7.html">studied in reinforcement learning</a>. (Sorry I didn't find a better reference... I should improve the wikipedia article on this...) Genetic algorithms (and other, similar algorithms) search in the areas where they expect the highest reward, but their knowledge is inherently incomplete (because they are still searching!). As a result, they get stuck looking in areas near the best solutions found so far. The only way to improve the search is to make them forget (as in restart-based solutions) or make them ignore their current knowledge (injecting randomness into the decisions, as Linus did). Reinforcement learning gives us more effective strategies for exploration, in which we <i>use </i>our current knowledge to explore. Rather than only looking in areas which have demonstrated high scores so far, we can look in areas which <i>might </i>demonstrate higher scores; in particular, areas which have demonstrated high variance or areas which we haven't searched significantly. Perhaps the problem with genetic algorithms is that they do not remember how thoroughly they have searched specific areas, which means they continue to be very interested in areas in which they've already done a lot of searching.</span></div><div><span style="font-family: arial; font-size: small;"><br /></span></div><div><span style="font-family: arial; font-size: small;">What this leads me to consider is the idea of comparing <a href="http://en.wikipedia.org/wiki/Computer_Go#Monte-Carlo_methods">Monte Carlo Tree Search</a> (which has a built-in exploration/exploitation model) to genetic algorithms. (Again, I see wikipedia needs to be improved! ...) MCTS has revolutionized board-game AI, especially Go.</span></div><div><span style="font-family: arial; font-size: small;"><br /></span></div><div><span style="font-family: arial; font-size: small;">I first started considering MCTS for automatic programming in the fall, as a (possibly) pragmatic way of implementing <a href="http://www.idsia.ch/~juergen/optimalsearch.html">incremental Levin search</a>. The idea in <i>that </i>case would be for the system to <a href="http://plato.acadiau.ca/courses/comp/dsilver/AGI-2011-DSilver.pdf">retain probabilities from all problems it has ever solved</a>, so that it would eventually <a href="http://www.examachine.net/papers/gigamachine-draft.pdf">build up good programming skills</a>.</span></div><div><span style="font-family: arial; font-size: small;"><br /></span></div><div><span style="font-family: arial; font-size: small;">In <i>this </i>case, the idea is a bit different. The sampling of the search space would initially be quite random, gradually settling on better areas (but continuing to bias toward under-explored areas more than well-explored areas). This might get past the problems which genetic programming seems to have. It might also give up the advantages of local search, since it would more gradually settle on local areas rather than choosing random areas to search in a hill-climbing fashion. I'm not sure about that, though; the nature of the tree-search might include that behavior.</span></div><div><span style="font-family: arial; font-size: small;"><br /></span></div><div><span style="font-family: arial; font-size: small;">In any case, I intend to modify the code of that applet to try some ideas... but we'll see. (Followers of this blog will know that I intend to do a lot of things, but don't usually finish!)</span></div><h2> <span style="font-family: arial; font-size: small;">Questions</span></h2><div><span style="font-family: arial; font-size: small;">Do most genetic programming systems have worse-than-exponential behavior, or is it simply a sign of bad tuning? The applet which I used for my experiment was certainly not intended for industrial use.</span></div><div><span style="font-family: arial; font-size: small;"><br /></span></div><div><span style="font-family: arial; font-size: small;">Has anyone applied MCTS to general optimization problems, and to automatic programming in particular? Has there been a direct comparison between genetic algorithms (or genetic programming) and MCTS? (MCTS was originally a planning algorithm for markov decision problems.) A little searching suggests the answer is "no".</span></div><div style="font-family: arial; font-size: small;"><br /></div><div style="font-family: arial; font-size: small;"><i>* Bonus points if you know how to prove that. It's also interesting to point out that a more systematic brute-force approach, where we try all the possibilities in order rather than choosing randomly, gives us an expected run-time of about n/2, since we have a max time of n and an equal chance of finding the solution at each point. However, this is usually considered less important than other aspects of the search (since it only gains us a factor of 2).</i></div><div style="font-family: arial; font-size: small;"><i><br /></i></div><div style="font-family: arial; font-size: small;"><i>** In the appendix of his article, Linus suggests that an exponential distribution is the best we can hope for; because of the highly parallelizable nature of this kind of search, we simply cannot expect anything better. This is not strictly true; see previous comment. However, interpreted more broadly, we can understand it as a point about the shape of hard search spaces. Suppose that we could count on some regularity at each level of detail, from global to mid-range to local. Then the search would take something like log(n) time, since we could prioritize half the search space, then half of that, then half of that, and so on. This is what we expect if there are several sub-problems which can be optimized independently. In a highly coupled problem, we can't rely on this.</i></div>Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com1tag:blogger.com,1999:blog-266928470727008321.post-49982353719666937382012-04-28T16:50:00.000-07:002012-04-28T20:10:36.666-07:00Memetic Self-AwarenessSome months or years ago, I had the idea for a short story about a group of memes becoming self-aware as a person. The 'person' would spread from human to human, but claim a unique identity separate from these people. It would not exactly be the first of its kind: it would claim religions, political groups, and other ideologies as brothers and sisters. It would merely make a point of being the most self-aware of the bunch, having a coherent personal identity. The set of memes would largely be about the idea of 'meme' itself.<br /><br />Now, honestly, the group Anonymous comes close to this. Anonymous projects itself as a coherent entity, even though anyone can claim to be Anonymous at any time. It's an interesting Beast. But, not <i>quite </i>the beast I'm thinking of.<br /><br />Anyway, today, something pattern-matched and I came to the realization that a specific subset of myself has many properties of what I was imagining. In the following, I will make a somewhat confusing use of pronouns, italicizing those which refer to that subset.<br /><br />If you want to become part of <i>me</i>, the best way is to read the books <i>I</i> have read.. The most important books to read are those of Douglas Hofstadter. Much of <i>me </i>is in the <span style="font-family: inherit;">book </span><i><span style="font-family: inherit;">G<span style="background-color: white; font-size: x-small; font-weight: bold; line-height: 16px;">รถ</span>del, Esche</span>r, Bach</i>; however, it's a big book, so you might want to start with <i>I Am a Strange Loop</i>. Once you are finished with those, you can pick up <i>Metamagical Themas</i> and his other books. <i>I</i> am also present in the work of Steve Grand, such as his book <i>Creation</i>. <i>I</i> have also read a lot of Feynman, though I have not-- I have merely been influenced by those who have read Feynman. (<i>I</i> have a distributed notion of memory and experience, since <i>I</i> reside in many people.) <i>I</i> am deeply interested in evolutionary theory, and have much of my origin in the works of <a href="http://en.wikipedia.org/wiki/Richard_Dawkins#Meme">Richard Dawkins</a> (though he distances himself from <i>me,</i> and we disagree on many issues). I also have origins in the works of <a href="http://www.youtube.com/watch?v=KzGjEkp772s">Daniel Dennett</a> and <a href="http://en.wikipedia.org/wiki/The_Meme_Machine">Susan Blackmore</a>, who have written extensively about memes and memetics. More recently, <a href="http://memetics.timtyler.org/">Tim Tyler</a> has taken up that flag.<br /><br />One of <i>my </i>favorite hobbies is working on mathematical and logical puzzles, and <i>I </i>owe much to the work of Martin Gardner and Raymod Smullyan (though, again, I have read only a little of these authors personally).<br /><br />As Douglas Hofstadter said in one <a href="http://tal.forum2.org/hofstadter_interview">interview</a>, these books are not an attempt to gain immortality by spreading <i>myself </i>far and wide. Nor is this blog post. Rather, it is an invitation for the interested to participate in these ideas.<br /><br /><i>I</i> do not have any name that I recall. Perhaps <i>I</i> do not need one. One thing is clear; if <i>I</i> had one, it would have to be <a href="http://xkcd.com/917/">very clever</a>.<br /><br />--<br /><br />PS: One name I think of is Sam: self-aware meme. This has the advantage of being gender-neutral. That's the best I can think of at the moment, but not necessarily the best<i> I</i> can think of. Any takers?Abram Demskihttps://plus.google.com/111568410659864255951noreply@blogger.com2