Fast and Flexible Probabilistic Programming with Soss.jl

A few months ago, Colin Carroll posted A Tour of Probabilistic Programming Language APIs, where he compared the APIs of a variety of probabilistic programming languages (PPLs) using this model:

\[ \begin{aligned} p ( \mathbf { w } ) & \sim \mathcal { N } \left( \mathbf { 0 } , I _ { 5 } \right) \\ p ( \mathbf { y } | X , \mathbf { w } ) & \sim \mathcal { N } \left( X \mathbf { w } , 0.1 I _ { 100 } \right) \end{aligned} \]

Soss.jl has really been picking up steam lately with a big boost from Taine Zhao's generalized generated functions for staged compilation. And thanks to Colin's work, we have a handy Rosetta Stone to other PPLs. So let's implement this and explore a bit.

Defining the Model

Here we go!

cc = @model X begin
    w ~ Normal(0,1) |> iid(5)
    Xw = X * w 
    y ~ For(1:100) do j
        Normal(Xw[j],0.1)
    end
end;

If you've done some probabilistic programming, this may look a little strange. In most PPLs, you'd see the observed data passed in as an argument. But in Soss, the model has no reference to the data!

There are a few problems with the more common approach:

  • It conflates data with hyperparameters,
  • It makes dependencies much less clear, and
  • It forces an early commitment to which variables are observed, greatly reducing flexibility.

In Soss, a model is written more like a standard function. From the model definition above, replacing each v ~ dist with v = rand(dist) already gets us very close to standard Julia code that can be executed.

Before we get too far ahead of ourselves, let's step through the cc model in some detail.

First, just like a function, we have any arguments, in this case the design matrix X. We could have chosen to put this in the body as well, and not require any arguments.

Next we have the body of the model, which is comprised of a set of Statements. The cc model has three statements.

  1. For w, the "~" indicates it's a random variable, with its distribution following. iid is a statistical term that means independent and identically distributed. So w will be a Vector, with each component drawn from a Normal(0,1) distribution.

  2. Next Xw is declared in an Assign statement. There's nothing too unusual here; it's just like an = in most languages.

  3. Finally, we have another Sample for y in the form of a For block. This is similar to iid, but allows the distribution to vary as a function of the index.

We could optionally have a return statement. Because we don't, the model will return a named tuple, with a slot for each Sample or Assign statement.

Applying the Model

Above I said models are function-like. So what do we get when we fill in the arguments?

julia> X0 = randn(100,5);

julia> cc(X=X0)
Joint Distribution
    Bound arguments: [X]
    Variables: [w, Xw, y]

@model X begin
    w ~ Normal(0, 1) |> iid(5)
    Xw = X * w
    y ~ For(1:100) do j
        Normal(Xw[j], 0.1)
    end
end

"Distribution"? Yep. cc(X=X0) is a distribution just like Normal(0,1) is. And just as we would expect with a distribution, we can sample from it:

julia> obs = rand(cc(X=X0))
(X = [-0.8272484544396099 -0.03646389219747102  0.9375627860961913 -0.804611104845261; -0.6202176685874693 0.7053793384222113  0.014361484078182381 -0.3737011074670588;  ; -2.8061148816772663 0.9815626536629464  -0.8081310261204893 -0.5775062378122593; -0.006294543353902432 -0.26016341761264344  -2.132236712504989 0.0128466239738442], Xw = [-2.221565537313942, -2.1520058746385304, -2.1293094342311103, 2.941188972024378, 0.9375034196035377, 4.714386942469174, 6.287025102610045, 1.7702590639999833, 7.6596956428829674, 0.35806372731922104    1.5771365203011163, 1.2979668403687703, 2.19230613396396, -1.1871741199543733, 4.175965213223754, 4.014878023176152, -0.5222101908555612, 5.00462751766363, -2.4025174019374136, 5.6428099448628455], w = [0.7298700914411173, -2.6773826574609143, -0.03040032089651461, -2.324302227817693, -0.6091581898700535], y = [-2.077911721314273, -2.161650284487506, -2.0906144035641847, 2.9433241681835, 0.952057361034113, 4.694415215318885, 6.392498728047622, 1.6900865159917597, 7.625515407370478, 0.2417023002430837    1.6578734475509114, 1.2888603774507226, 2.212452426932563, -1.156288447328179, 4.108958690520445, 4.014541785531511, -0.5235177431156618, 4.9852474365678425, -2.4759738971855128, 5.70482453218249])

Conditioning on Observed Data

Now that we have our fake data, we need to sample from the posterior distribution. For continuous parameters, the general best practice is to start with Hamiltonian Monte Carlo. In Julia, there are two main implementations of this: Tamas Papp's DynamicHMC.jl, and AdvancedHMC.jl from the Turing.jl team (notably Kai Xu and Hong Ge). There are some trade-offs between these, but I've done more testing with the former, so let's go with that for now.

For HMC, we'll need to specify a joint distribution (remember, that's just the model applied to some arguments) and data to condition on. It's convenient to pipe this into the particles helper function:

julia> post = dynamicHMC(cc(X=X0), (y=obs.y,)) |> particles;

julia> post.w
5-element Array{Particles{Float64,1000},1}:
  0.721 ± 0.011  
 -2.68 ± 0.011   
 -0.0383 ± 0.0098
 -2.32 ± 0.0087  
 -0.611 ± 0.011  

Particles are from Fredrik Bagge Carlson's `MonteCarloMeasurements.jl library, and make the result much more concise. Conveniently, there's no data loss here; the summarization is only in how it's displayed. Particles support componentwise operations as if each was just a floating-point value. This has great benefits for easily working with the results of an analysis.

The Posterior Predictive Distribution

Earlier I said that the body of a model is a set of statements. No, not a sequence, it's a set. Well, a partially ordered set.

Soss understands variable dependencies. In fact, the order of statements chosen by the user is immediately discarded. All computations, even "displaying" the model to user, are done in terms of topological ordering of the variables, relative to their dependencies.

This representation is very simple from a programming languages perspective, and we may extend it in the future. But even as it is, the representation is convenient for model transformations. Since we now have samples from the poseterior over w, we can build a new model conditioning on it:

julia> pred = predictive(cc, :w)
@model (X, w) begin
    Xw = X * w
    y ~ For(1:100) do j
        Normal(Xw[j], 0.1)
    end
end

[Alternatively, we could just use dynanamicHMC again. But "forward" computation is much cheaper, and we'll have better effective sample size.]

We're finally ready to set up our posterior predictive check. This takes a bit of maneuvering, so it's one thing we'll try to smooth out in future releases. Here's the code:

ppc = mapslices(Matrix(post.w),dims=2) do w
    rand(pred(X=X0, w=w)).y
end |> Particles

Matrix(post.w) is a 1000×5 matrix. We feed each row into the predictive model, generate a random sample, and retrieve the y component, before finally aggregating the whole thing back into Particles.

If the model is fit well, obs.y should be uniform along the quantiles of the posterior predictive distribution. To check this requires looking at

q = mean.(y0 .> ppc)

Epilogue

Things are moving really fast, and there are lots of other things I haven't mentioned yet, like

  • First-class models
  • Lifting distributions to models
  • Markov chain combinators
  • Multiple imputation
  • Normalizing Flows with Bijectors.jl
  • Deep learning with Flux
  • Integration with Turing and Gen