A few months ago, Colin Carroll posted A Tour of Probabilistic Programming Language APIs, where he compared the APIs of a variety of probabilistic programming languages (PPLs) using this model:
\[ \begin{aligned} p ( \mathbf { w } ) & \sim \mathcal { N } \left( \mathbf { 0 } , I _ { 5 } \right) \\ p ( \mathbf { y } | X , \mathbf { w } ) & \sim \mathcal { N } \left( X \mathbf { w } , 0.1 I _ { 100 } \right) \end{aligned} \]
Soss.jl has really been picking up steam lately with a big boost from Taine Zhao's generalized generated functions for staged compilation. And thanks to Colin's work, we have a handy Rosetta Stone to other PPLs. So let's implement this and explore a bit.
Defining the Model
Here we go!
cc = @model X begin
w ~ Normal(0,1) |> iid(5)
Xw = X * w
y ~ For(1:100) do j
Normal(Xw[j],0.1)
end
end;
If you've done some probabilistic programming, this may look a little strange. In most PPLs, you'd see the observed data passed in as an argument. But in Soss, the model has no reference to the data!
There are a few problems with the more common approach:
- It conflates data with hyperparameters,
- It makes dependencies much less clear, and
- It forces an early commitment to which variables are observed, greatly reducing flexibility.
In Soss, a model is written more like a standard function. From the model definition above, replacing each v ~ dist
with v = rand(dist)
already gets us very close to standard Julia code that can be executed.
Before we get too far ahead of ourselves, let's step through the cc
model in some detail.
First, just like a function, we have any arguments, in this case the design matrix X
. We could have chosen to put this in the body as well, and not require any arguments.
Next we have the body of the model, which is comprised of a set of Statements
. The cc
model has three statements.
For
w
, the "~
" indicates it's a random variable, with its distribution following.iid
is a statistical term that means independent and identically distributed. Sow
will be aVector
, with each component drawn from aNormal(0,1)
distribution.Next
Xw
is declared in anAssign
statement. There's nothing too unusual here; it's just like an=
in most languages.Finally, we have another
Sample
fory
in the form of aFor
block. This is similar toiid
, but allows the distribution to vary as a function of the index.
We could optionally have a return
statement. Because we don't, the model will return a named tuple, with a slot for each Sample
or Assign
statement.
Applying the Model
Above I said models are function-like. So what do we get when we fill in the arguments?
julia> X0 = randn(100,5);
julia> cc(X=X0)
Joint Distribution
Bound arguments: [X]
Variables: [w, Xw, y]
@model X begin
w ~ Normal(0, 1) |> iid(5)
Xw = X * w
y ~ For(1:100) do j
Normal(Xw[j], 0.1)
end
end
"Distribution"? Yep. cc(X=X0)
is a distribution just like Normal(0,1)
is. And just as we would expect with a distribution, we can sample from it:
julia> obs = rand(cc(X=X0))
(X = [-0.8272484544396099 -0.03646389219747102 … 0.9375627860961913 -0.804611104845261; -0.6202176685874693 0.7053793384222113 … 0.014361484078182381 -0.3737011074670588; … ; -2.8061148816772663 0.9815626536629464 … -0.8081310261204893 -0.5775062378122593; -0.006294543353902432 -0.26016341761264344 … -2.132236712504989 0.0128466239738442], Xw = [-2.221565537313942, -2.1520058746385304, -2.1293094342311103, 2.941188972024378, 0.9375034196035377, 4.714386942469174, 6.287025102610045, 1.7702590639999833, 7.6596956428829674, 0.35806372731922104 … 1.5771365203011163, 1.2979668403687703, 2.19230613396396, -1.1871741199543733, 4.175965213223754, 4.014878023176152, -0.5222101908555612, 5.00462751766363, -2.4025174019374136, 5.6428099448628455], w = [0.7298700914411173, -2.6773826574609143, -0.03040032089651461, -2.324302227817693, -0.6091581898700535], y = [-2.077911721314273, -2.161650284487506, -2.0906144035641847, 2.9433241681835, 0.952057361034113, 4.694415215318885, 6.392498728047622, 1.6900865159917597, 7.625515407370478, 0.2417023002430837 … 1.6578734475509114, 1.2888603774507226, 2.212452426932563, -1.156288447328179, 4.108958690520445, 4.014541785531511, -0.5235177431156618, 4.9852474365678425, -2.4759738971855128, 5.70482453218249])
Conditioning on Observed Data
Now that we have our fake data, we need to sample from the posterior distribution. For continuous parameters, the general best practice is to start with Hamiltonian Monte Carlo. In Julia, there are two main implementations of this: Tamas Papp's DynamicHMC.jl, and AdvancedHMC.jl from the Turing.jl team (notably Kai Xu and Hong Ge). There are some trade-offs between these, but I've done more testing with the former, so let's go with that for now.
For HMC, we'll need to specify a joint distribution (remember, that's just the model applied to some arguments) and data to condition on. It's convenient to pipe this into the particles
helper function:
julia> post = dynamicHMC(cc(X=X0), (y=obs.y,)) |> particles;
julia> post.w
5-element Array{Particles{Float64,1000},1}:
0.721 ± 0.011
-2.68 ± 0.011
-0.0383 ± 0.0098
-2.32 ± 0.0087
-0.611 ± 0.011
Particles
are from Fredrik Bagge Carlson's `MonteCarloMeasurements.jl library, and make the result much more concise. Conveniently, there's no data loss here; the summarization is only in how it's displayed. Particles
support componentwise operations as if each was just a floating-point value. This has great benefits for easily working with the results of an analysis.
The Posterior Predictive Distribution
Earlier I said that the body of a model is a set of statements. No, not a sequence, it's a set. Well, a partially ordered set.
Soss understands variable dependencies. In fact, the order of statements chosen by the user is immediately discarded. All computations, even "displaying" the model to user, are done in terms of topological ordering of the variables, relative to their dependencies.
This representation is very simple from a programming languages perspective, and we may extend it in the future. But even as it is, the representation is convenient for model transformations. Since we now have samples from the poseterior over w
, we can build a new model conditioning on it:
julia> pred = predictive(cc, :w)
@model (X, w) begin
Xw = X * w
y ~ For(1:100) do j
Normal(Xw[j], 0.1)
end
end
[Alternatively, we could just use dynanamicHMC
again. But "forward" computation is much cheaper, and we'll have better effective sample size.]
We're finally ready to set up our posterior predictive check. This takes a bit of maneuvering, so it's one thing we'll try to smooth out in future releases. Here's the code:
ppc = mapslices(Matrix(post.w),dims=2) do w
rand(pred(X=X0, w=w)).y
end |> Particles
Matrix(post.w)
is a 1000×5 matrix. We feed each row into the predictive model, generate a random sample, and retrieve the y
component, before finally aggregating the whole thing back into Particles
.
If the model is fit well, obs.y
should be uniform along the quantiles of the posterior predictive distribution. To check this requires looking at
q = mean.(y0 .> ppc)
Epilogue
Things are moving really fast, and there are lots of other things I haven't mentioned yet, like
- First-class models
- Lifting distributions to models
- Markov chain combinators
- Multiple imputation
- Normalizing Flows with Bijectors.jl
- Deep learning with Flux
- Integration with Turing and Gen