Bayesian Optimal Pricing, Part 2

This is Part 2 in a series on Bayesian optimal pricing. Part 1 is here.

Introduction

In Part 1 we used PyMC3 to build a Bayesian model for sales. By the end we had this result:

png

A common advantage of Bayesian analysis is the understanding it gives us of the distribution of a given result. For example, we very easily analyze a sample from the posterior distribution of profit for a given price. So it's natural to ask, what about the distribution of the result of out last optimization?

Strangely, there's no immediate answer to this. In Part 1 we discussed posterior predictive checks, which compare the observed data to data that might have been generated, according to the posterior over the parameters. But our "max of means" optimization can't even be computed for a single parameter value. The empirical nature of our approach makes it awkward to understand the sensitivity of the result to fluctuations in the observed sales data or the posterior parameter estimate.

Let's see what we can do about that.

Bootstrapping

Perhaps the most immediate possible solution is the nonparametric bootstrap. We can resample from the posterior distribution, and apply our max-of-means approach to each resample. This gives us the following result.

Pretty good, right? Maybe we should call it a day?

Not so fast. The bootstrap is great when we already know a distribution is reasonably well-behaved. But there are pathological cases. Even the simplest bootstrap calculation breaks down for a Cauchy distribution. Here are 10 Cauchy samples, with the bootsrapped mean density for each:

What's going on here? Because the tails of the Cauchy are so heavy, no sample can be considered "representative enough" for stable estimation of the mean (at least with the usual MLE approach).

Simulation

Just to be safe, let's try a different approach. Here are the assumptions our model makes about the process to this point:

  1. The parameters \((\alpha, \beta)\) were drawn from the prior distribution.
  2. Each component of \(Q_0\) was drawn from a Poisson distribution determined by \(\alpha\), \(\beta\), and the corresponding component of the set price \(P_0\).
  3. We sampled from the posterior distribution \((\alpha,\beta | Q_0)\).
  4. Using the posterior samples, we found a distribution on the profit \(\pi\) as a function of price \(P\). In particular, each price gave us an expected, or mean, profit.
  5. We found the max-of-means: that price \(\hat{P}\) for which the estimated expected profit is the highest.

Let's simulate all but the first step of this, using the posterior mean for \(\alpha\) and \(\beta\). We can do this simulation lots of times, and hopefully get a better understanding on the distributions involved.

We can do most of this in one little function:

def generate_replications(p0, μ0, n):
    qobs = np.random.poisson(np.outer(np.ones(n), μ0))
    with pm.Model() as m:
        α = pm.Cauchy('α', 0, 5, shape=n)
        β = pm.Cauchy('β', 0, 5, shape=n)
        μ0rep = np.exp(
            α + β * (np.log(p0) - logp0mean).reshape(-1, 1)).T
        qval = pm.Poisson('q0', μ0rep, observed=qobs)
        t = pm.sample()
    return t

Given prices and desired means, this generates n replications of data, and fits each independently. It would have been simpler to just call out original model n times, but overhead in calling PyMC3 adds up quickly, so the current approach is much faster. Note that this does not include the optimization; we'll express that separately in order to compare an alternative approach.

def withrep(rep, p, f):
    result = []
    for j in range(rep.β.shape[1]):
        t = {'α': rep.α[:, j], 'β': rep.β[:, j]}
        result.append(f(t, p))
    return np.array(result)

def max_of_means(t, p):
    _, π = predict(t, p)
    πmean = np.mean(π, 1)
    jmax = np.argmax(πmean)
    return (p[jmax], πmean[jmax])

Then we have a nice way of connecting these components:

rep = generate_replications(p0,np.exp(a + b * (np.log(p0) - logp0mean)),800)
out1 = withrep(rep,np.logspace(np.log10(20),np.log10(300),1000),max_of_means)

This yields the following result:

Each point on this graph is a replication of the process; our original result from Part 1 would be one such point. The blue points on the bottom left are what we might have expected - some additional variablility beyond our bootstrap result, but nothing too extreme. The inset plot shows a closer view of these.

The points shown in red seem to be in agreement about an optimal price of $300. But this is only an artifact of the range chosen for the search; all of them "want" a higher price. There's more going on here than suggested by our initial bootstrap results.

Further analysis

Along the way in Part 1, we showed this scatter plot of the posterior samples:

We could do this for each replication, but the result would have too many points to be easily interpretable. Let's instead look at a 95% ellipse for each replication:

The blue/red coloring here is the same as above, and the original posterior sample is overlayed, with the corresponding ellipse in bold. Divergent solutions seem to arise from large values of \(\beta\).

To understand what's going on, let's suppose we had a fixed value for \(\alpha\) and \(\beta\). Recall from Part 1 that the expected profit is

\[ \pi = (P-K) \mu\ , \]

where

\[ \log \mu = \alpha + \beta (\log P - \overline{\log P_0})\ . \]

Setting \(\frac{d\pi}{dP}=0\) and solving for \(P\) gives

\[ \hat{P} = \frac{\beta K}{1+\beta}\ . \]

So as \(\beta\) gets close to \(-1\), our estimate blows up.

To build some intuition about this, consider profit in units of "dollars per day", or equivalently,

\[ \frac{\text{dollars}/\text{purchase}}{\text{days}/\text{purchase}}\ . \]

The numerator of this corresponds to \(P-K\), the denominator to \(\frac{1}{\mu}\). An increase in price increases both the numerator and denominator. When \(\beta < -1\), the denominator grows more quickly, and \(\hat{P}\) is the point of diminishing returns. When \(\beta > -1\), sales still slow down with price increases, but not enough to offset the increased revenue of each sale.

We can also see this in terms of complexity. As a function of \(P\), the numerator is \(O(P)\), but the denominator is \(O(P^{-\beta})\).

An alternative approach

The analysis of the last section gave us a quick way to estimate \(\hat{P}\) for a given \(\beta\). In our original problem, we don't have just a single \(\beta\), but rather a sample from the posterior distribution. Some of these (4%) are greater than \(-1\), leading to nonsensical negative values of our simple \(\frac{\beta K}{1+\beta}\) approach. The remainder follow a ratio distribution (a heavy-tailed distribution of which Cauchy is an example).

Using a mean for estimation in this case is generally not a good idea, but the median works just fine (see John Cook's blog post for some more details).

So we could find all of the per-sample optima, and then take the median of those which are positive. But we can do even better! Medians are preserved under monotonic transformations, so for example the median of the log of a random variable is the log of its median. This simplifies the algorithm considerably, leaving us with this:

def median_of_maxes(t, p):
     = t['β']
    β = np.median([ < -1])
    pmax = 20*β/(1+β)
    _, πmax = predict(t, pmax)
    return (pmax, np.mean(πmax))

Finally, let's see how this compares to the original max-of-means approach:

The distribution of the replicated results are much more consistent than our first attempt.

Conclusion

After some simulation-based analysis, we've improved considerably on our initial attempt at optimization. Our resulting approach is more robust and more computationally efficient. As a bonus, it is expressed in terms of individual samples, making it far more amenable to posterior predictive checks than the original approach.

It may be an overstatement to claim that we should always prefer median-of-maxes to max-of-means. But certainly one should avoid the latter unless a similar analysis to this one has shown a good reason for the choice.