[om-list] Inverse Cumulative Probability Distribution Functions

Sun Jan 6 01:15:43 EST 2002

Aside: The context here is we are talking about generating random numbers that match a given distribution function, attempting to unify the discrete case (e.g. heads/tails) and the continuous case (e.g. temperature).

The short answer is that to do this efficiently, the samples from the discrete case must be treated differently than the continuous case, because one must interpolate between samples in the continuous case, but not in the discrete case.  Unification requires carrying a "discrete_flag" around with the set of samples of any of the density or distribution functions and handling the math accordingly.

- Mark

Tom and other Packers wrote:
> 
> (OM people: This email has a prerequisite of introductory mathematical
> statistics.  If you don't have the prereqs, read it anyway if you have
> interest in helping 4C with its Informatica; 4C-Informatica will probably be
> heavily based on probability distributions; it's good to get the exposure as
> soon as possible, in learning new things.)
> 
> Mark
> 
>     Remember our phone conversation about generating inverse distribution
> functions from estimated (sampled) p.d.f.s?  I'm concerned about cases of
> discrete p.d.f.s.
> 
>     You remember how the "nodes" in the domain of the inverse distribution
> function would not necessarily correspond to the nodes in the domain of the
> original p.d.f.?  If we have a few discrete, positively valued points in the
> p.d.f. domain, how will we regain those exact points through the inverse
> distribution function (i.d.f.) if this distribution function is generated by
> integrating between points other than the "support" points?
> 
>     That is, think of the process of generating a semi-random Y: we generate
> a random number in the domain of the i.d.f., between 0 and 1, and then
> looking for the node with that value, or interpolate to find an
> approximation.  In the discrete case, there will rarely be a node with that
> exact value, so we'd be looking for the point where the distribution
> function range jumps from below the input value to above the input value and
> then interpolate.  This could be drastically wrong if we end up generating a
> lot of Y's which in reality never have positive probabilities.
> 
>     There's another reason we should make the i.d.f. correspond directly to
> the sampled p.d.f.: it would be more accurate, even in the continuous
> case -- or at least I think it would be, since we'd have less approximation
> error: we'd be interpolating once instead of twice.
> 
>     How easy might it be to generate an inverse distribution function that
> does correspond directly to the p.d.f. in its sample points?