700x faster node2vec models: fastest random walks on a graph
Intended Audience: Data science and computer nerds
TL;DR: I implemented a method to generate network embeddings that reduces runtime from 30 hours to 3 minutes. It’s done by using a compact graph layout that massively accelerates generating random walks.
If you don’t care about how I made it faster, the code is available here
Graph Embeddings
Graph embeddings are similar idea to word embeddings. We represent each node in a network as a vector of real numbers, where nodes close in the network should be close in vector space [footnote]usually cosine similarity[/footnote]. This is useful for all the reasons word embeddings are useful – remember word embeddings jumpstarted the massive progress in Natural Language Processing in the last 5 years[footnote] Say you want to a compact representation of the graph for a machine learning model. You could make a “one-hot” matrix where each column represents a node and feed that to the model, but networks often have hundreds of thousands to billions of nodes, making it infeasible. Embeddings bypass that problem.[/footnote].
Currently, the hot way to embed graphs are methods based around random walks like node2vec. The random walks on the graph are the “sentences” on which we train a model exactly like a word embedding model.
Having to use this for a project, I used the reference code [footnote]Actually, there’s a better implementation here: https://github.com/eliorc/node2vec [/footnote]. One problem: for a reasonably small graph (145k nodes, 335k edges) it took 32 hours to generate random walks.
No wonder graph analysis isn’t popular yet, it’s impossible to get anything done with it.
Stop thinking about distributed clusters, write good code instead
Generating random walks is [latex]O(1)[/latex], which means it’s so supposed to run “instantly” according to researchers. But we’re in 2019. These days, if you take extremely slow code and waste yet even more electricity by running it on several computers instead of one, the hackernews zeitgeist decided that we should praise you with words like “scalable” and “distributed” instead of scorning you for being especially wasteful.
Reminds me of this blog post by Adam Drake where using 40 year old unix tools turns out 235x faster than “big data” tools for text processing.
I get similar results as Adam (the code I linked runs in 3 minutes instead of 32 hours), with node2vec training speedups in the 350x to 5100x range (no joke). To understand how this is possible, we need to take a detour and re-learn how computers work.
How computers work
Computers are made of a hierarchy of memory caches.
Your CPU operates on registers, which can hold a few bytes. The register is fed by the CPU’s L1 and L2 caches, which hold a few megabytes of data. It’s very fast for the CPU to fetch data from L1 cache. Your computer probably has a few gygabytes of RAM. Accessing RAM from the CPU is about 200x slower than accessing the cache (see below). Accessing your hard drive is even slower. Accessing other computers’ hard drives (hi Hadoop!) is even slower than that.
This is shown in this table:
Computer Latency Comparison
----------------------------------
CPU clock cycle (3Ghz) 0.3 ns
L1 cache read 0.5 ns
Branch mispredict 5 ns
L2 cache read 7 ns
RAM read 100 ns
Compress 1K bytes with Zippy 3k ns
Read 4K randomly from SSD 150k ns
Read 1 MB line from RAM 250k ns
Round trip within datacenter 500k ns
Read 1 MB line from SSD* 1m ns
Hard Disk read 10m ns
Read 1 MB line from disk 20m ns
Ping CA->Netherlands->CA 150m ns
Looking at the table above, we see that your CPU can do a lot of work in the time it takes to access RAM! At least 300 clock cycles go by, which will be thousands of instructions on a modern CPU. If your code spends its time going back and forth to RAM, you might as well use a Pentium 4 CPU, because your CPU is spending all its time waiting instead of working.
(Watch “Efficiency with algorithms, performance with data structures” by Chandler Carruth for more on this topic).
What does this have to do with graphs?
Graphs are typically laid out as a network of memory pointers referencing each other. This means each node is in a different place in your memory. This is based on the idea of a linked list, which is nice in computer science theory, but the worst thing you can do to your computer’s CPU.
Almost every time you move from one node to the next in this layout, you need to go to jump around the computer’s memory.
The takeaway is that memory access is the main bottleneck and you should shuffle around your data as little as possible. Also, have bits of data used often together be next to each other in memory [footnote]When a CPU pulls data from RAM, it pulls more data than what you requested. It’s assuming you’ll use the next bits after what you requested to increase the probability of a cache hit.[/footnote].
The simplest way you can do this is to start packing values used together into arrays instead of laying out your memory Object-Oriented style in classes. Object oriented design is fine some of the time, but in this case, defining a Node
class with recursive references to other Node
instances leads to severe anti-patterns.
What should we do instead?
The solution is to represent the graph as a CSR sparse matrix. The CSR matrix holds an array of edge weights, an array of where edges are pointing to, and an “index pointer” array holding the index in the other arrays where a nodes’ values are. We define all graph operations on this format, which can sometimes be a little convoluted.
I’ll spare the dirty details of operating on a CSR matrix from this post. If you’re interested, look at the source code of my package and you should get the idea.
The tradeoff here is that adding new nodes to the graph now requires rebuilding the whole CSR matrix, but since we care about analyzing the graph, we can ignore this drawback. The upside is the whole “700x faster” thing.
But what about scalability?
First, our graph is now lean on memory since nodes don’t each need to carry tons of pointers to other nodes. In CSR format, with 64bit values, you have two values per edge (one for the weight array, one for the edge array) and one per node.
The Facebook network for the USA has about 260m nodes and 16B edges which means you’d need a reasonable 132GB of RAM to fit it in memory.
If you somehow had a graph bigger than Facebook’s, you could always put the arrays as separate files on a SSD, and have the indexpointer array point to memory locations on the other arrays to read directly.
Beyond that, if your data is somehow orders of magnitude larger than Facebook’s, or you really insist fleecing the gullible, you can run this on your preferred hadoop/big data framework.
But comparing Numba (compiled) code against Python (interpreted) code is cheating!
It’s not. If we didn’t change the layout of the graph in computer memory, numba would not have helped, because the code was not optimizable. The compiler can only reason about the instructions it tells the CPU to do, if your memory layout is terrible, great instructions will still result in terribly slow code.
Fastest random walks? Seriously?
I looked around, and the simple CSR form random walk algorithm I wrote seems to be the fastest around. The code is extremely simple, and embarassingly parallel. The core loop is in the function _csr_random_walk
in my code.
Conclusion
Programmers need to remember how their computers work, and keep tradeoffs in their data structures in mind. Here, since we don’t care about adding nodes to our graph, we can speed up the code by two orders of magnitude.
Writing the package took less time than a single run with the original code!
Beyond that, I’ll think about graph analytics in the python environment. I think using a CSR matrix format as the basic building block could lead to advances like Pandas’ column-oriented dataframe did for matrix analytics.