Programming trick: Cantor Pairing (perfect hashing of two integers)
Reading time: 2 min. TL;DR Cantor pairing is a perfect, reversible, hashing function from multiple positive integers to a single positive integer. It’s cool. Its one drawback is that it can output very big numbers.
Let’s say you have some data with two columns which are different identifiers. Maybe your data comes from two different databases, and each one has its unique identifier for individuals, but both unique codings overlap with each other. Or maybe you want to combine encodings from multiple columns into one.
How do you go about that?
One of the better ways is Cantor Pairing, which is the following magic formula:
[latex]Cantor(x, y) = 1/2 * (x + y) * (x + y + 1) + y[/latex]
This takes two positive integers, and returns a unique positive integer. It’s also reversible: given the output of [latex]Cantor(x, y)[/latex] you can retrieve the values of [latex]x[/latex] and [latex]y[/latex]. The only problem with this method is that the size of the output can be large: [latex]Cantor(10^8, 10^8)[/latex] will overflow a 64bit integer [footnote]The trick to solve this is to either factorize the input, or pass in x - min(x). This makes it harder to retrieve x and y, though.[/footnote]. The good news is that this will use all the bits in your integer efficiently from the view of a hashing function. You can also compose the function to map 3 or more numbers into one – for example [latex]Cantor(Cantor(x, y), z)[/latex] maps 3 integers to one.
How does this work?
In your first advanced math class, you probably came across the result that the infinity of real numbers is “bigger” than the infinity of normal numbers, which implied the set [latex]\mathbb{N}[/latex] of natural numbers has the same cardinality as the set [latex](\mathbb{N} \times \mathbb{N})[/latex] of possible combinations of natural numbers [footnote]They’re both countable infinities[/footnote].
While this is cool, it doesn’t seem useful for practical applications. Until you see the diagram of the argument used to prove that fact. [latex]Cantor(x, y)[/latex] is really just the function that represents the diagonal line snaking across the plane, which effectively uses that results to create our perfect hashing function!