The wikipedia article for entropy(information theory) indicates the usual formula for entropy of a probability density function, and then calmly proceeds with a proof that it is not the correct way to generalise entropy to the continuous domain. (Yay for wikipedia!) This is the definition of entropy advocated by Jaynes, which is linked as an alternative. It's called Limiting Density of Discrete Points. (Not a catchy name.)
Entropy is about expected code length (which is expected information gain). One way to try and define entropy for distributions over real numbers might be to try and deal with them as a discrete object. To do this, we choose a specific encoding E of the reals as a stream of bits, and examine the efficiency with which we can store points randomly drawn from our probability distribution P. Unfortunately, no matter how clever our encoding, a single real number almost always carries infinite information. So, to get a finite number, we subtract the information in each bit from the worst-case information for that bit (a coin flip). In other words, since we can't give the (infinite) the amount of information directly, we look at how much less information there is than in the case of random noise.
"Random noise" means coin flips in our discrete representation E for continuous numbers... but that means that E induces a probability distribution of its own. The measure of entropy in P relative to the encoding E ends up looking a lot like the KL-divergence between P and E, viewing E as a probability distribution.
One way of understanding this is to say that the continuous version of entropy doesn't work out quite right, but the continuous version of KL-divergence (which is closely related to entropy) works quite well; so the best we can do in the continuous domain is to measure KL-divergence from some reference distribution (in this case, E) as a substitute for entropy.
Normally, entropy tells us what we can expect to achieve with the best encoding. (It places a lower bound on code length: intuitively we can't make the code shorter than the actual amount of information, except by luck.) The reason this doesn't work for real numbers is that they require an infinite amount of information in any case. Yet, it is still possible to generate optimal codes (for example, iteratively divide the probability density into upper and lower halves), and ask how "spread out" they are. The problem is just that we need to define some sort of "frame" in order to get a numerical value (defining the extent to which we care about differing numbers), and it seems like most options end up looking similar to the approach Jaynes came up with (or, being a special case of that approach).
Is this really the best way to translate entropy to the continuous domain?