The Weisfeiler-Lehman algorithm and estimation on graphs

 

Imagine you have two graphs (G) and (G’) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices (v in G cup G’) DO

  • Check that (v in G) and that (v in G’)
  • Check that the neighbors of v are the same in (G) and (G’)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas – this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

FOR ALL vertices (v in G) DO

  • Compute a hash of ((a_v, a_{v_1}, ldots, a_{v_n})) where (a_{v_i}) are the attributes of the neighbors of vertex (v).
  • Use the hash as vertex attribute for v in the next iteration.

The algorithm terminates when this iteration has converged in terms of unique assignments of hashes to vertices.

Note that it is not guaranteed to work for all graphs. In particular, it fails for graphs with a high degree of symmetry, e.g. chains, complete graphs, tori and stars. However, whenever it converges to a unique vertex attribute assignment it provides a certificate for graph isomorphism. Moreover, the sets of vertex attributes can be used to show that two graphs are not isomorphic (it suffices to verify that the sets differ at any stage).

Shervashidze et al. 2012 use this idea to define a similarity measure between graphs. Basically the idea is that graphs are most similar if many of their vertex identifiers match since this implies that the associated subgraphs match. Formally they compute a kernel using

$$k(G,G’) = sum_{i=1}^d lambda_d sum_{v in V} sum_{v’ in V’} delta(a(v,i), a(v’,i))$$

Here (a(v,i)) denote the vertex attribute of (v) after WL iteration (i). Morevoer, (lambda_i) are nonnegative coefficients that weigh how much the similarity at level (i) matters. Rather than a brute-force computation of the above test for equality we can sort vertex attribute sets. Note that vertices that have different attributes at any given iteration will never have the same attribute thereafter. This means that we can compare the two sets at all depths at at most (O(d cdot (|V| + |V’|))) cost.

A similar trick is possible if we want to regress between vertices on the same graph since we can use the set of attributes that a vertex obtains during the iterations as features. Finally, we can make our life even easier if we don’t compute kernels at all and use a linear classifier on the vertex attributes directly.

Source: Adventures in Data Land

 

Beware the bandwidth gap – speeding up optimization

Disks are slow and RAM is fast. Everyone knows that. But many optimization algorithms don’t take advantage of this. More to the point, disks currently stream at about 100-200 MB/s, solid state drives stream at over 500 MB/s with 1000x lower latency than disks, and main memory reigns supreme at about 10-100 GB/s bandwidth (depending on how many memory banks you have). This means that it is 100 times more expensive to retrieve instances from disk rather than recycling them once they’re already in memory. CPU caches are faster yet with 100-1000 GB/s of bandwidth. Everyone knows this. If not, read Jeff Dean’s slides. Page 13 is pure gold.

Ok, so what does this mean for machine learning? If you can keep things in memory, you can do things way faster. This is the main idea behind Spark. It’s a wonderful alternative to Hadoop. In other words, if your data fits into memory, you’re safe and you can process data way faster. A lot of datasets that are considered big in academia fit this bill. But what about real big data? Essentially you have two options – have the systems designer do the hard work or change your algorithm. This post is about the latter. And yes, there’s a good case to be made about who should do the work: the machine learners or the folks designing the computational infrastructure (I think it’s both).

So here’s the problem: Many online algorithms load data from disk, stream it through memory as efficiently as possible and discard it after seeing it once, only to pick it up later for another pass through the data. That is, these algorithms are disk bound rather than CPU bound. Several solvers try to address this by making the disk representation more efficient, e.g. Liblinear or VowpalWabbit, both of which user their own internal representation for efficiency. While this still makes for quite efficient code that can process up to 3TB of data per hour in any given pass, main memory is still much faster. This has led to the misconception that many machine learning algorithms are disk bound. But, they aren’t …

What if we could re-use data that’s in memory? For instance, use a ringbuffer where the disk writes into it (much more slowly) and the CPU reads from it (100 times more rapidly). The problem is what to do with an observation that we’ve already processed. A naive strategy would be to pretend that it is a new instance, i.e. we could simply update on it more than once. But this is very messy since we need to keep track of how many times we’ve seen the instance before, and it creates nonstationarity in the training set.

A much cleaner strategy is to switch to dual variables, similar to the updates in the Dualon of Shalev-Shwartz and Singer. This is what Shin Matsushima did in our dual cached loops paper. Have a look at StreamSVM here. Essentially, it keeps data in memory in a ringbuffer and updates the dual variables. This way, we’re guaranteed to make progress at each step, even if we’re revisiting the same observation more than once. To see what happens have a look at the graph below:

It’s just as fast as LibLinear provided that it’s all in memory. Algorithmically, what happens in the SVM case is that one updates the Lagrange multipliers (alpha_i), while simultaneously keeping an estimate of the parameter vector (w) available.

That said, this strategy is more general: reuse data several times for optimization while it is in memory. If possible, perform successive updates by changing variables of an optimization that is well-defined regardless of the order in which (and how frequently) data is seen.

Source: Adventures in Data Land

 

Distributing Data in a Parameterserver

 

One of the key features of a parameter server is that it, well, serves parameters. In particular, it serves more parameters than a single machine can typically hold and provides more bandwidth than what a single machine offers.

image

A sensible strategy to increase both aspects is to arrange data in the form of a bipartite graph with clients on one side and the server machines on the other. This way bandwidth and storage increase linearly with the number of machines involved. This is well understood. For instance, distributed (key,value) stores such as memcached or Basho Riak use it. It dates back to the ideas put forward e.g. in the STOC 1997 paper by David Karger et al. on Consistent Hashing and Random Trees.

A key problem is that we can obviously not store a mapping table from the keys to the machines. This would require a database that is of the same size as the set of keys and that would need to be maintained and updated on each client. One way around this is to use the argmin hash mapping. That is, given a machine pool (M), we assign a given (key,value) pair to the machine that has the smallest hash, i.e.

$$m(k, M) = mathrm{argmin}_{m in M} h(m,k)$$

The advantage of this scheme is that it allows for really good load balancing and repair. First off, the load is almost uniformly distributed, short of a small number of heavy hitters. Secondly, if a machine is removed or added to the machine pool, rebalancing affects all other machines uniformly. To see this, notice that the choice of machine with the smallest and second-smallest hash value is uniform.

Unfortunately, this is a stupid way of distributing (key,value) pairs for machine learning. And this is what we did in our 2010 VLDB and 2012 WSDM papers. To our excuse, we didn’t know any better. And others copied that approach … after all, how you can you improve on such nice rebalancing aspects.

This begs the question why it is a bad idea. It all comes down to the issue of synchronization. Basically, whenever a client attempts to synchronize its keys, it needs to traverse the list of the keys it owns and communicate with the appropriate servers. In the above scheme, it means that we need to communicate to a new random server for each key. This is amazingly costly. Probably the best comparison would be a P2P network where each byte is owned by a different machine. Downloads would take forever.

We ‘fixed’ this problem by cleverly reordering the access and then performing a few other steps of randomization. There’s even a nice load balancing lemma in the 2012 WSDM paper. However, a much better solution is to prevent the problem from happening and to borrow from key distribution algorithms such as Chord. In it, servers are inserted into a ring via a hash function. So are keys. This means that each server now owns a contiguous segment of keys. As a result, we can easily determine which keys go to which server, simply by knowing where in the ring the server sits.

image

In the picture above, keys are represented by little red stars. They are randomly assigned using a hash function via (h(k)) to the segments ‘owned’ by servers (s) that are inserted in the same way, i.e. via (h(s)). In the picture above, each server ‘owns’ the segment to its left. Also have a look at the Amazon Dynamo paper for a related description.

Obviously, such a load-balancing isn’t quite as ideal as the argmin hash. For instance, if a machine fails, the next machine inherits the entire segment. However, by inserting each server (log n) times we can ensure that a good load balance is achieved and also that when machines are removed, there are several other machines that pick up the work. Moreover, it is now also very easy to replicate things (more on this later). If you’re curious on how to do this, have a look at Amar Phanishayee’s excellent thesis. In a nutshell, the machines to the left hold the replicas. More details in the next post.

Source: Adventures in Data Land