Why is network analysis not popular yet?

2019-06-27 •

TL;DR: There’s no good software stack that makes it easy to do any network analysis task, because we lack a common interface.

We can safely say that data science applications based around linear algebra (machine learning, computational statistics, etc.) have exploded in popularity in the last 5 years.

Why didn’t the same thing happen to applications of graph theory? It’s extremely common to have problems that are best represented as a network. Even finding network open source datasets is easy: Stanford’s SNAP has a good repository for instance.

I don’t think it’s because of lack of education. Computer science graduates are all forced to learn their graph theory fundamentals. For social scientists, we have great free courses online like Matt Jackson’s Coursera MOOC.

The software problem

We want the graph tooling to run in an interpreted/scripting language, for the same reason python is popular in the statistics/machine learning stack. You need to be able to easily explore the data you have, but you also want to create a production ready application out of it in the same language. Data exploration code can’t be divorced from the code you deploy into production.

You also want the software to scale reasonably well. Most importantly, you need to be able to define algorithmic operations yourself on the graph that scale similarly as the library’s “native” functions. Or, at least, you need the ability to compose algorithmic operations on the graph from basic operations in the library (the same way you can create complex operations in python’s Pandas with their in-built functions).

If we look at mature python libraries, we have a few, and none of them solve our problems:

NetworkX is very good but is written in pure python, and as such doesn’t scale well at all. Operations are painfully slow and memory usage is voracious.
Graph-tool is written in C++ but with a (painful) python interface. Whenever you want to stray outside of pre-packaged routines, you’re stuck.
Stanford SNAP which is effectively unmaintained (latest python version is for python 2.7 and a year old)
Big data frameworks like GraphFrames which have similar lock in issues as graph-tool. Worse yet, they don’t “scale down”: analyzing a small graph here feels like trying to analyze 1mb of data using Hadoop.

Now take an application, like say you have a graph with a hundred million vertices you want to get a rough feel for. Of course you can’t visualize it directly – that’s insane. But you could maybe create a graph embedding (using a technique like Node2Vec) and visualize the embedding through some dimensionality reduction algorithm like T-SNE or UMAP. Or you could visualize a sampling of the graph by random walks.

You can’t easily do those approaches once you’re locked in to one of the libraries if it doesn’t support any part of the plan you laid out.

There are too many libraries, none connect with each other, and being “feature complete” is impossible if you want the entire domain of “operations related to networks”.

If you look at the linear algebra world (all machine learning, computational statistics, “data science”, etc.) everything is connected together by a common interface (a matrix or dataframe).

This works because the underlying interface is compact with high performance. You can pass around matrices from C or Fortran in python and all the machine learning or statistical packages operate on this representation, and happily pass results off to each other.

With networks, you can’t interoperate libraries. Either the underlying representation is in pure python, and necessarily it won’t scale, or it’s in another language, and no one has standardized the data access patterns, so everyone has their own C++ graph representation with their own API and you can’t stray outside this well-defined box.

The only solution I see to this problem is starting over and writing everything in a single, high performance, scripting language. Which would mean someone taking the effort of doing this in Julia (the only candidate I can see).

But that also implies a great outreach effort – people need to learn both Julia, but also graph theory more broadly for this to become part of the data science bread and butter. So I can expect us to sit around with a bunch of great graph data which is effectively unexplored for a while.