Saturday, September 24, 2016

Probabilistic Programming - Part 1

Probabilistic programming is an exciting, new-ish area that is eating a lot of my spare time.

"Probabilistic Programming and Bayesian Methods for Hackers" is an excellent book I am playing with. Not only is it free (see here), you can download it and read it as an IPython book (with the command ipython notebook) and tinker with the equations (use shift-enter to execute them).

As you might have gathered, it's Python-based but there are moves to integrate the library it uses extensively (PyMC) with Spark.

In the Java/Scala world, Figaro is making in-roads. In this first part post, I'm making notes on the Python implementation. In part two, I'm hoping to implement a Scala/Figaro version.

Bayes again

Bayesian probability requires a different way of thinking.

If frequentist and Bayesian inference were programming functions, with inputs being statistical problems, then the two would be different in what they return to the user. The frequentist inference function would return a number, representing an estimate (typically a summary statistic like the sample average etc.), whereas the Bayesian function would return probabilities
For example, in our debugging problem above, calling the frequentist function with the argument "My code passed all tests; is my code bug-free?" would return a YES. On the other hand, asking our Bayesian function "Often my code has bugs. My code passed all tests; is my code bug-free?" would return something very different: probabilities of YES and NO. The function might return: 
YES, with probability 0.8; NO, with probability 0.2 
This is very different from the answer the frequentist function returned. Notice that the Bayesian function accepted an additional argument:  "Often my code has bugs". This parameter is the prior. By including the prior parameter, we are telling the Bayesian function to include our belief about the situation.
The formula for Bayesian probability goes like this:

P(A|X) = P(X|A) P(A)

Which reads in English as: the probability of A given X is equal to the probability of X give A multiplied by the probability of A by itself and divided by the probability of X by itself.

P(A|X)is called our posterior and P(A) is called our prior.

Probability distributions refresher

Probability distributions can be discrete (integers) or continuous (real numbers). The distribution of a discrete variable is called a probability mass function, a continuous variable has a probability density function.

One (of many) discrete functions is the Poisson distribution that dates back centuries and was derived to estimate the probabilities of cavalry horses losing a shoe in (or some other independent event). It looks like:

P(Z = k) = λk e

and this has the convenient property that its expected value is λ (or, in maths speak, E[Z|λ]λ).

So, this looks as good a distribution as any for our prior. The trouble is, what is λ? Aye, there's the rub. λ can be anything (an integer or any real number), so to model it, we can choose an exponential distribution. It is continuous and it has some convenience. The probability density function looks like:

f(z|λ) = λ e-λz ,  z >= 0

and the convenient fact is that E[Z|λ] = λ-1.


The task is to deduce when (if at all) somebody's behavior changed given a list of the number of text messages they sent over a certain time period. To do this, the book uses PyMC, a Python library.

The workhorse of the code is the generation of simulated data:

def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_count_data)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
    return out

which simply creates an array where the elements are one of two values. All the values before tau are lambda_1 and the rest are lambda_2. Simples.

"Inside the deterministic decorator [@pm.deterministic], the Stochastic variables passed in behave like scalars or Numpy arrays (if multivariable), and not like Stochastic variables."

By putting some print statements in, you can see that this function gets hit thousands of times when we feed it into the Poisson function that is given to a Model function that is passed to a MCMC function (Monte Carlo Markov Chain). Phew! The details can be found in the freely downloadable book and I'll explain more when I show the Figaro implementation.

"MCMC returns samples from the posterior distribution, not the distribution itself... MCMC performs a task similar to repeatedly asking "How likely is this pebble I found to be from the mountain I am searching for?", and completes its task by returning thousands of accepted pebbles in hopes of reconstructing the original mountain. In MCMC and PyMC lingo, the returned sequence of "pebbles" are the samples, cumulatively called the traces."

And it's these traces that (when graphed) can help us to see when it is most likely that behavior changed. I'll go into more detail with the Scala implementation in part 2.

Saturday, September 3, 2016

Yet more maths and graphs - Lagrange Multiplier

Here's a little maths trick called the Lagrange Multiplier that helps in finding the maximum/minimum of a function f(x, y) given a constant that also uses x and y. We "form the function

F(x, y) = f(x, y) + λ Φ(x, y) 

and set the two partial derivatives of F equal to zero. Then solve these two equations and the equation Φ(x, y) = constant for the three unknowns x, y and λ".

A nice proof can be found in the essential undergraduate text, Mathematical Method in the Physical Sciences. Let's say we're trying to find the max/min of the function f(x, y). Obviously, at this point.

df = δf dx + δf dy = 0                               (1)
     δx      δy

and say we've been given a function that is a constant Φ(x, y). Once again, it should be obvious that:

dΦ = δΦ dx + δΦ dy = 0                               (2)
     δx      δy

since Φ(x, y) is a constant.

Now, multiply the second equation by λ and add it to the first.

(δf + λ δΦ )dx + (δf + λ δΦ)dy = 0                   (3)
 δx     δx        δy     δy

Let's take that second term. Since λ is arbitrary, let's choose it to be such that:

δf + λ δΦ = 0                                        (4)
δy     δy

(That is, we've decided

λ = - δf  / δΦ                                       (5)
              δy /  δy

But it's not necessary to worry about that as we're not particularly interested in the value of λ. We just use it to solve the equations).

Similarly, we do the same for the first term of (3):

δf + λ δΦ = 0                                        (6)
δx     δx

Now, with equations (4) and (5) and (6) we have enough information to find the values of x and y at the position of the min/max point.

Why am I writing this post? Because we're going to need the Lagrange Multiplier shortly in partitioning a graph.

Friday, September 2, 2016

Miscellaneous HBase tips

I'm new(ish) to HBase so here are a few notes I've jotted down over the last few months. Basically, Spark is great at what it does, but if you need to look something up while processing an element, you're best relying on another tool. HBase has been our choice.

HBase is not like a normal database

You can't do a simple select count(*)... You need to do something like:

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter

See here for more information.

The HBase shell is not even SQL like. If you want to limit the number of rows returned, the syntax looks like:

hbase> scan 'test-table', {'LIMIT' => 5}

In "Hadoop: the Definitive Guide" there is a great "Whirlwind Tour of the Data Model". What follows is an even more condensed precis.
"Table cells - the intersection of rows and column coordinates - are versioned. By default, their version is a timestamp... A cell's contents is an uninterpreted array of bytes.
"Table rows are sorted by ... the table's primary key... All table accesses are via the primary key
"Row columns are grouped into column families. All column family members have a common prefix.
"Physically, all column family members are stored together on the filesystem [therefore] it is advised that all column family members have the same general access patterns and size characteristics.
"Tables are automatically partitioned horizontally by HBase into regions. Initially, a table comprises of a single region but as the size of the region grows, it splits...Regions are the units that get distributed over an HBase cluster.
"Row updates are atomic no matter how many row columns constitute the row-level transaction." 
Unsurprisingly for a solution based on Hadoop, HBase shares a similar architecture. The "HBase master node orchestrat[es] a cluster of one or more regionserver slaves." Unlike Hadoop, "HBase depends on ZooKeeper ... as the authority on cluster state".

"At a minimum, when bootstrapping a client connection to an HBase cluster, the client must be passed the location of the ZooKeeper ensemble. Thereafter, the client navigates the ZooKeeper hierarchy to learn cluster attributes such as server locations." For the purposes of creating a Spark artifact, this is simply a matter of adding the host machine's hbase-site.xml at the top level.


Find the load the cluster is under with:

echo "status 'detailed'" | hbase shell 2>/dev/null | grep requestsPerSecond | perl -pe s/,.*//g

With this, I've seen nodes in my cluster happily reach 60 000 requests per second, which is most pleasing.

However, the load over the nodes is not terribly evenly distributed. One way to tackle this problem is salting. I did actually preempt this problem by reversing the Strings that were my keys. Since each key has a prefix taken from a fairly small set, I was expecting them to form "lumps" of data. However, I then create the HBase table with something like:

hbaseAdmin.createTable(hTableDescriptor, toBytes('0'), toBytes('z'), 20)

(where 20 is my number of regions).

However, this assumes that the text (even when reversed) that I am using as keys is evenly distributed over the alphanumerics (it's not as it's English words rather than random text). So, I still have some lumpiness.

Tuesday, August 30, 2016

Refresher on Eigenvalues and Eigenvectors

A lot of graph analysis strongly relies on linear algebra so here's a quick refresher on stuff you'll have studied in your undergraduate days.

In the mathematics of graphs, you'll see the mention of eigenvalues and eigenvectors. The general description of them (as in that Wikipedia link) focuses on the rotation, translation and shearing of images. Although absolutely correct, it takes the emphasis away from their most interesting quality: we're solving sets of equations with them.

A set of equations that are formed of constants and variables that are never raised to a power greater than one are called linear equations.


Let's take a straight-forward set of simultaneous equations:

2x-3y+ 8z=4

We can represent the coefficients as a matrix:


Now, we can produce a row reduced matrix by employing any combination of:

  • interchanging two rows
  • multiplying a row by a non-zero constant
  • adding one row to another

to give us:


where the the entire matrix we'll call matrix A but the purple subset we'll call matrix M. The rank (that is, the number of non-zero rows after all row reduction) is greater in A than M indicating that the equations have no solution (since no values of 0x, 0y and 0z can equal the -20 in the last row). It is inconsistent.

The rule here is:
  1. rank M < rank A: no solution
  2. rank M = rank A = number of unknowns: one solution
  3. rank M = rank A = R < n: R unknowns can be expressed in terms of (n - R) unknowns.
An example:

2x-y- 5z=2

when row reduced becomes:


Since this is option #3, and R = 2 and n  = 3, it's not surprising that we can express the original equations in one unknown (z, say)

x = 3 + 2z
y = 4 - z

which with a bit of Python, Matlib and some help from here like this

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt

mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
theta = np.linspace(-4 * np.pi, 4 * np.pi, 100)
z = np.linspace(-10, 10, 10)
x = 3 + (2 * z)
y = 4 - z
ax.plot(x, y, z, label='x = 3 + 2z; y = 4 - z')

tells us that there are infinite solution - that is, anything on this line:


Like the use of images to demonstrate the properties of matrices, determinants are often described geometrically in the way they define volumes. "For an n x n matrix with columns, the value of det A is the signed volume of the parallelepiped defined by the vectors" [2] This is similar to calculating the cross product. In fact, the triple product that uses both the cross product and the dot product is how to find the volume in 3 dimensions.

But, this is only part of the story. Determinants are interesting because of what they tell us about the matrix.

Calculating determinants is tedious. Thankfully, since my undergraduate days, there have been many libraries written to do it for me (floating-point caveats aside). Now, we just run something like:

import numpy as np

identity = np.eye(3)
print "det(I3) = ", np.linalg.det(identity)

which tells me the (not too surprising) fact that the determinant of the identity matrix (I) over field ℜ3 is 1.0.

More interestingly:

det AB = det BA = (det A) x (det B)

Why is this interesting? Well, if A is a matrix (and matrices can be treated as functions) that has an inverse:

A A-1 = I
det (A A-1) = det(A) x det(A-1) = det I = 1

If the product of terms equals 1, no term can be 0. So, for an inverse matrix to exist, a prerequisite is that the matrix's determinant must be non-zero.

Homogeneous Equations

When the constants on the right hand side of all linear equations in the set are 0, we call them homogeneous equations.

Unlike the equations at the top of this post, "homogeneous equations are never inconsistent; they always have the solution 'all unknowns = 0' (often called the 'trivial solution'). If the number of independent equations (that is, the rank of the matrix) is the same as the number of unknowns, this is the only solution." [1] That is, there are too many contradictory equations (see here for a nice demo).

"If the rank of the matrix is less than the number of unknowns, there are infinitely many solutions." [1] That is, we can only express unknowns in terms of other unknowns as there are no constants (try it!). However, the infinite solutions describe a line, a plane, a hyper-plane (depending on the number of dimensions in which we're working) and we can take unit-vectors over these spaces.

"A system of n homogeneous equations in n unknowns has solutions other than the trivial solution if and only if the determinant of the coefficients is zero" [1]

Eigen- vectors and values

Which finally brings us to eigen- vectors and values. What they are is simple. Again in geometric terms, we can consider them as vectors whose direction does not change under a transformation from a matrix but whose magnitude might. Why they are interesting is more subtle. They come in handy in problems as disparate as Latent Semantic Analysis where we want to know which documents contain which 'concepts' to determining if a graph is acyclic.

Let's take their definition:

M v = λ v

(where M is a matrix, v an eigenvector and λ an eigenvalue - there may be more than one).

Not all matrices have eigen-vectors and -values (over the field on which the matrix acts - see here for more information). Again from a geometric point of view, if we imagine the matrix to represent a rotation, no vector stays pointing in the same direction (note: there may still be solutions but the eigenpairs will be made of complex numbers).

Not all eigenvectors are orthogonal. Generally, they are not although they are for symmetric matrices.

[1] Mathematical Method in the Physical Sciences - Boas.
[2] Coding the Matrix - Klein

Saturday, August 20, 2016

Louvain Modularity and Graph Modularity

Looking at ways to solve my problem of agglomerative clusters when trying to de-dupe data, I turned to community detection algorithms.

Louvain Modularity is a heuristic that tells you about communities in a network (implementations found here). The original paper ("Fast unfolding of communities in large networks") can be found here. The first equations comes without much of an introduction so I scuttled off to find where it came from. The natural choice was a brilliant book by the equation's creator Networks: An Introduction by Mark Newman*.

Newman points out the difficulty in coming up with an algorithm that finds communities. If you want to minimise the number of edges crossing a boundary between two groups, then a naive algorithm will cheerfully put no nodes in one group and all N nodes in the other. So, you might want to tell the algorithm to weight the results by multiplying them by the product of the numbers in each group. This is heading in the right direction as a 0-N split will be the least favourable. In fact, with a bit of simple calculus, you can find that the maximum is (N/2). This is nice but arbitrary.

So, one metric we could use is the difference between the actual number of edges connecting nodes of the same class with the expected number.

The actual number of between classes ci and cj is given by:
Σδ(ci,cj) =½ΣAij δ(ci,cj)

Where δ(ci,cj) is the Kronecker Delta (which is simply 1 if i == j else 0), A is our adjacency matrix and the ½ comes because we don't want to double-count all the edges in the adjacency matrix.

For the expected number, imagine node i that has degree ki. If the graph has m edges, there are obviously 2m ends. Therefore, for any outgoing edge from i, the chances that the other side is node j is kj/2m and for all outgoing edges from i, the chance is kikj/2m.

Add them all up, and the expected number is:

½Σ(ki kj / 2m) δ(ci,cj)

where, again, the ½ prevents us from double-counting.

Now, we take the difference and divide by m (since we interested in fractions not absolute numbers) and we get:

Q = (1/2m)Σ(Aij - ki kj / 2m) δ(ci,cj)

which is the equation in the paper.

* I really can't recommend this book enough. If you've got an undergraduate degree in something mathematically related, you'll find the equations no problem and they are presented in an easy-going manner.

Tuesday, July 19, 2016

Independence Day

In Daphne Koller's excellent "Probabilistic Graphical Models", she talks of independence between variables and leaves as an exercise to the reader (viewer) to prove to themselves that of the following Conditional Probability Distribution (CPD), one represents independent variables and the other represents dependent variables.

Being a bit slow, it took me some time to work it out. One must remember that independence is defined as:

For events α, β then P ⊢ α  β (read: "P satisfies that α and β are independent") if:

P(α, β)  = P(α) P(β)
P(α | β) = P(α) 
P(β | α) = P(β)

These 3 equations are equivalent.

Now, back to our two CPDs, we can see that the ratio of d0 to d1 is the same for either i0 or i1 (in SQL-land, imagine a "group by"; it's called marginalization in probability distributions). That is, having been told the value of i, the probable value of d is unchanged. Therefore, the equations above are satisfied and i and d are independent.

The same is not true of the CPD on the right hand side, therefore in this CPD, i and d are not independent.

In other words, if we call left hand side graph of dependencies G1 and the right hand side G2 then I(G1) = { D ⊥ I } and I(G2) = ∅. This introduces the notion of I-Maps defined below.


"Let P be a distribution over X. We define I(P) to be the set of independence assertions of the form (X ⊥ Y | Z) that hold in P . We can now rewrite the statement that “P satisfies the local independencies associated with G” simply as I(G) ⊆ I(P). In this case, we say that G is an I-map (independency map) for P."

We define P as "a distribution P satisfies the local independencies associated with a graph G if and only if P is representable as a set of CPDs associated with the graph G" [1]

[1] Probabilistic Graphical Models: Principles and Techniques, Koller.

Monday, July 18, 2016

The theory that dare not speak its name

From XKCD (
The author of the excellent XKCD cartoons misrepresents the arguments of Frequentists and Bayesians for the purpose of jocularity but found he'd "kicked a hornet's nest" by doing so.

Bayes Theorem has been controversial for hundreds of years. In fact, books have been written on it (see here for a recent and popular one). One reason is that it depends on subjectivity. But Allen B Downey, author of Think Bayes (freely downloadable from here), thinks this is a good thing. "Science is always based on modeling decisions, and modeling decisions are always subjective.  Bayesian methods make these decisions explicit, and that’s a feature, not a bug. But all hope for objectivity is not lost.  Even if you and I start with different priors, if we see enough data (and agree on how to interpret it) our beliefs will converge.  And if we don’t have enough data to converge, we’ll be able to quantify how much uncertainty remains." [1]

Bayesian Networks

Taking the examples from Probabilistic Graphical Models available at Coursera, imagine this scenario: a student wants a letter of recommendation which depends on their grades, intelligence, SAT scores and course difficulty like so:

The course uses SamIam but I'm going to use Figaro as it's written in Scala. Representing these probability dependencies then looks like this:

class ProbabilisticGraphicalModels {

  implicit val universe = Universe.createNew()

  val Easy          = "Easy"
  val Hard          = "Hard"
  val Dumb          = "Dumb"
  val Smart         = "Smart"
  val A             = "A"
  val B             = "B"
  val C             = "C"
  val GoodSat       = "GoodSat"
  val BadSat        = "BadSat"
  val Letter        = "Letter"
  val NoLetter      = "NoLetter"

  def chancesOfDifficultIs(d: Double): Chain[Boolean, String]
    = Chain(Flip(d), (b: Boolean) => if (b) Constant(Hard) else Constant(Easy))

  def chancesOfSmartIs(d: Double): Chain[Boolean, String]
    = Chain(Flip(d), (b: Boolean) => if (b) Constant(Smart) else Constant(Dumb))

  def gradeDistributionWhen(intelligence: Chain[Boolean, String] = defaultIntelligence,
                            difficulty:   Chain[Boolean, String] = defaultDifficulty): CPD2[String, String, String]
    = CPD(intelligence, difficulty,
    (Dumb, Easy)   -> Select(0.3   -> A,   0.4   -> B,   0.3   -> C),
    (Dumb, Hard)   -> Select(0.05  -> A,   0.25  -> B,   0.7   -> C),
    (Smart, Easy)  -> Select(0.9   -> A,   0.08  -> B,   0.02  -> C),
    (Smart, Hard)  -> Select(0.5   -> A,   0.3   -> B,   0.2   -> C)

  def satDist(intelligence: Chain[Boolean, String] = defaultIntelligence): CPD1[String, String]
    = CPD(intelligence,
    Dumb  -> Select(0.95  -> BadSat,  0.05  -> GoodSat),
    Smart -> Select(0.2   -> BadSat,  0.8   -> GoodSat)

  def letterDist(gradeDist: CPD2[String, String, String] = defaultGradeDist): CPD1[String, String]
    = CPD(gradeDist,
    A -> Select(0.1   -> NoLetter,  0.9   -> Letter),
    B -> Select(0.4   -> NoLetter,  0.6   -> Letter),
    C -> Select(0.99  -> NoLetter,  0.01  -> Letter)

And we can query it like so:

  def probabilityOf[T](target: Element[T], fn: (T) => Boolean): Double = {
    val ve = VariableElimination(target)
    ve.probability(target, fn)

Finally, let's add some sugar so I can use it in ScalaTests:

  val defaultDifficulty   = chancesOfDifficultIs(0.6)
  val easier              = chancesOfDifficultIs(0.5)
  val defaultIntelligence = chancesOfSmartIs(0.7)
  val defaultGradeDist    = gradeDistributionWhen(intelligence = defaultIntelligence, difficulty = defaultDifficulty)
  val defaultLetterDist   = letterDist(defaultGradeDist)
  val defaultSatDist      = satDist(defaultIntelligence)

  def being(x: String): (String) => Boolean = _ == x

  def whenGettingAnABecomesHarder(): Unit = defaultGradeDist.addConstraint(x => if (x == A) 0.1 else 0.9)

  def whenTheCourseIsMoreLikelyToBeHard(): Unit = defaultDifficulty.addConstraint(x => if (x == Hard) 0.99 else 0.01)

  def whenLetterBecomesLessLikely(): Unit = defaultLetterDist.addConstraint(x => if (x == Letter) 0.1 else 0.9)

  def whenTheSatIsKnownToBeGood(): Unit = defaultSatDist.observe(GoodSat)


Flow of Probabilistic Influence

If we wanted to know the chances of receiving a letter of recommendation, we'd just have to run probabilityOf(letterDist, being(Letter))  (which equals about  0.603656). The interesting thing is what happens if we observe some facts - does this change the outcome?

The answer is: it depends.

If we observe the SAT score for an individual, then their probability of receiving the letter changes. For example, executing satDist.observe(GoodSat) means that the chancesOf method now returns a probability of about 0.712 (and similarly a BadSat reduces the probability to about 0.457).

The general idea is given in this screen grab:

From Daphne Koller's Probabilistic Graphical Models

The X-> Y and X <- Y are easy. "In general, probabilistic influence is symmetrical. That is, if X can influence Y, Y can influence X." [2]

Probabilities can flow through nodes in this Bayesian network so X -> W -> Y and X <- W <- Y are not terribly surprising if we know nothing about the intermediate step (the column on the left hand side). Conversely, if we do know something about it, that's all we need and it doesn't matter what the first step does.

The V-Structure (X <- W -> Y) is more interesting. Here, we can use ScalaTest to demonstrate what's going on. The first case is when we have no observed evidence. Here, Koller tells us

"If I tell you that a student took a class and the class is difficult, does that tell you anything about the student's intelligence? And the answer is 'no'."
And sure enough, Figaro agrees:

  "probabilities" should {
    "flow X -> W <- Y ('V-structure')" in new ProbabilisticGraphicalModels {
      val smartBefore = probabilityOf(defaultIntelligence, being(Smart))
      val smartAfter  = probabilityOf(defaultIntelligence, being(Smart))
      smartBefore shouldEqual smartAfter

with the caveat that this is true if and only if
"... W and all of its descendants are not observed."
if it is observed, then the following is true:

    "flow X -> W <- Y ('V-structure')" in new ProbabilisticGraphicalModels {
      // Can difficulty influence intelligence via letter?
      val smartBefore = probabilityOf(defaultIntelligence, being(Smart))
      whenLetterBecomesLessLikely() // "this too activates the V-structure"
      val smartAfter  = probabilityOf(defaultIntelligence, being(Smart))
      smartBefore should be > smartAfter

As mentioned, if we do know something about the intermediate step, the first step can't influence the outcome. Take these examples:
"I know the student got an A in the class, now I'm telling you that the class is really hard.does that change the probability of the distribution of the letter? No because ... the letter only depends on the grade."
  "Given evidence about Z" should {
    "make no difference in (X -> W -> Y)" in new ProbabilisticGraphicalModels {
      val letterChanceBefore  = probabilityOf(defaultLetterDist, being(Letter))
      val letterChanceAfter  = probabilityOf(defaultLetterDist, being(Letter))
      letterChanceBefore shouldEqual letterChanceAfter
"If I tell you that the student is intelligent, then there is no way the SAT can influence the probability influence in grade."
    "make no difference in X <- W -> Y" in new ProbabilisticGraphicalModels {
      val chancesOfABefore = probabilityOf(defaultGradeDist, being(A))
      val chancesOfAAfter = probabilityOf(defaultGradeDist, being(A))
      chancesOfAAfter shouldEqual chancesOfABefore

More notes to come.

[1] Prof Allen B Downey's blog
[2] Daphne Koller, Coursera.