Let’s imagine we have 100,000 points and we want to find the distances for all the pairs that are within 10km of each other. Would we need to calculate 100,000 * 100,000 = 10,000,000,000 distances (10 billion) and how long would that take? Would it even fit into RAM? I wanted to experiment with NumPy for a long-time and saw this as the perfect setting. NumPy’s arrays are more compact than Python lists (taking up much less RAM), operations are generally much faster and convenient, and can be vectorised to be performed element-wise.

I think the first step to start with is that we don’t want to calculate 10 billion distances – we want to store our data in such a format where we can first pick out all points that overlap at 10km (which will still be millions) and then calculate the distance only for those. This means it would be silly to use standard lists and vectors – we want to take advantage of the spatial structure of our data.

Option 1: Geohashing – this is a hierarchical spatial data structure which subdivides space into various buckets of grid shape. It can be used as a proximity search (the idea is that two close locations will have a common prefix) provided that we apply it to an area with a Cartesian coordinate system (meaning we project our geographic latitude/longitude coordinates).

This example from geoindex is quite nice:

from geoindex import GeoGridIndex, GeoPoint
geo_index = GeoGridIndex()
# Create 10,000 random points and add to index
for _ in range(10000):
    lat = random.random()*180 - 90
    lng = random.random()*360 - 180
    index.add_point(GeoPoint(lat, lng))
# Create a point to search around
center_point = GeoPoint(37.7772448, -122.3955118)
# Get points within 10km
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
    print("We found {0} in {1} km".format(point, distance))

Option 2: K-D Tree – is a data structure used for organising points in a a space with k dimensions. It is a binary search tree with other constraints imposed. These are very useful for range searches since they divide the range of the domain in half each level of the tree and nearest neighbour searches (e.g. what is the closest point?) and for data in two-dimensions look like this:

370px-kdtree_2d-svg

In the example below I have gone with the K-D tree approach (using scipy.spatial.cKDTree). The steps I take are:

  • Import points into a numpy array (“genfromtxt”)
  • Project the points from WGS 84 to OSGB 36 using code 5339 which gives us accuracy done to 1.0 metres: http://epsg.io/27700-5339
  • Create a kd-tree using scipy.spatial.cKDTree (a C-implementation of spatial.KDTree)
  • Create a sparse-distance matrix (with a cut-off of 10,000 metres specifying the default ‘minkowski’ distance since we have 2D data on a Cartesian plane)
  • Extract the lower triangle (our distances are symmetric)
  • Export the distances to a CSV along with the raw data

With an input of 100k points we end up with 30 million distance calculations (all the pairs within 10km) and this whole process takes only 152 seconds. Saving the CSV takes a further 338 seconds for a total of 490 seconds (peak RAM use was 13GB)

import numpy as np
import csv
import time
from pyproj import Proj, transform
from scipy.spatial import cKDTree
from scipy import sparse

#http://epsg.io/27700-5339 (accuracy: 1.0m)
uk = '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 \
+x_0=400000 +y_0=-100000 +ellps=airy \
+towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs'

path_to_csv = '.../huge_points_test.csv'
out_csv = '.../out.csv'

# [A] Import points in geographic co-ordinates: (Lat|Lon)
points = np.genfromtxt(path_to_csv,
                       delimiter=',',
                       skip_header=1,
                       usecols=(0,1),
                       dtype=(float, float))

print("Total points: %d" % len(points))
print("Triangular matrix contains: %d" % (len(points)*((len(points))-1)*0.5))

# [B] Get projected co-ordinates
def proj_arr(points,proj_to):
    """
    Project geographic co-ordinates to get cartesian x,y
    Transform(origin|destination|lon|lat)
    """
    inproj = Proj(init='epsg:4326')
    outproj = Proj(proj_to)
    func = lambda x: transform(inproj,outproj,x[1],x[0])
    return np.array(list(map(func, points)))

tstart = time.time()
proj_pnts = proj_arr(points, uk)

# [C] Fill kd-tree
tree = cKDTree(proj_pnts)
cut_off_metres = 10000 #10KM (our projection is in metres!)
tree_dist = tree.sparse_distance_matrix(tree,
                                        cut_off_metres,
                                        p=2) 

# [D] Extract triangle
udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal (distance=0)
print("Extracted all distances in %.0f seconds" % (time.time()-tstart))
print("Distances after quad-tree cut-off: %d " % len(udist.data))

# [E] Export CSV
f = open(out_csv, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['lat_a','lon_a',
            'lat_b','lon_b',
            'proj_metres'])
w.writerows(np.column_stack((
            points[udist.row],
            points[udist.col],
            udist.data)))
f.close()
print("Finished in %.0f seconds" % (time.time()-tstart))

#Total points: 100,000
#Triangular matrix contains: 4,999,950,000 (5 bill)
#Distances after quad-tree cut-off: 29,098,208 (30 mill)

#Extracted all distances in 152 seconds
#Saved CSV in 338 seconds

#Finished in 490 seconds = around 9 minutes
#Peak RAM use: 12.5GB

""" Generate labels """
path_to_labels = '.../lbl_huge_points_test.csv'
out_labels = '.../out_labels.csv'

# Import ID labels (postcodes)
id_labels = np.genfromtxt(path_to_labels,
                       delimiter=',',
                       skip_header=1,
                       dtype='S10')

# Export CSV
f = open(out_labels, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['id_a','id_b'])
w.writerows(np.column_stack((
            id_labels[udist.row],
            id_labels[udist.col])))
f.close()

It is interesting to analyse just how accurate our distances are.

Vincenty’s formula is a commonly used method to generate accurate, geodesic distances across the Earth (based on the assumption that the Earth is an oblate spheroid; which is more accurate than assuming that the Earth is a perfect sphere). The maths for this iterative method can be found here: it is a two-step process which first iteratively evaluates until lambda converges to a desired degree of accuracy and then proceeds with the final distance. I used geopy.distance.vincenty below (with the default WGS-84 ellipsoid), along with geopy.distance.great_circle (which assumes the earth is a sphere) to generate some more distances (for the 30 million pairs).

Using numpy I then wrote a few vectorised-helper functions to establish that the vincenty calculation (relative to the minkowski distance for the projected co-ordinates ) was on average 1.53 metres apart (0.02%) and the maximum divergence was only 14.90 metres (0.15%). However, relative to the great-circle distance, vincenty was on average 10.54 metres apart (0.17%) , with a maximum of 32.31 metres (0.34%). Hence taking the vincenty calculation as our goal-post; the projected coordinates were very close!

"""
Part A - Analyse difference in distances relative to vincenty (most accurate)
"""
from geopy.distance import vincenty

vout_csv = '.../out_vin.csv'

# Map vincenty function to unprojected co-ordinates
tstart = time.time()
func = lambda x: vincenty(x[0:2],x[2:4]).m
output = list(map(func,
                  np.column_stack((points[udist.row],
                                   points[udist.col]))))
print("Extracted all distances in %.0f seconds" % (time.time()-tstart))

# Export CSV
f = open(vout_csv, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['vincenty_metres'])
w.writerows(np.column_stack(output).T)
f.close()
print("Finished in %.0f seconds" % (time.time()-tstart))

#Extracted all distances in 866 seconds
#Finished in 1061 seconds

""" Test great-circle"""
from geopy.distance import great_circle

gout_csv = 'K.../out_gcirc.csv'

# Map great_circle  function to unprojected co-ordinates
tstart = time.time()
func = lambda x: great_circle(x[0:2],x[2:4]).m
output = list(map(func,
                  np.column_stack((points[udist.row],
                                   points[udist.col]))))
print("Extracted all distances in %.0f seconds" % (time.time()-tstart))

# Export CSV
f = open(gout_csv, 'w', newline='')
w = csv.writer(f, delimiter=",", )
w.writerow(['gircle_metres'])
w.writerows(np.column_stack(output).T)
f.close()
print("Finished in %.0f seconds" % (time.time()-tstart))

#Extracted all distances in 486 seconds
#Finished in 684 seconds

# Load up last column (projected-metres)
out = np.genfromtxt(out_csv,skip_header=1,dtype=float,delimiter=',',usecols=(-1))
# Load vincenty distance
vout = np.genfromtxt(vout_csv,skip_header=1,dtype=float,delimiter=',',usecols=(0))

def average_diff(x,y):
    return np.average( np.absolute( np.subtract(x,y) ) )

def maximum_diff(x,y):
    return np.amax( np.absolute( np.subtract(x,y) ) )

def scaled_average_diff(x,y):
    return np.multiply( np.average( np.divide( np.absolute( np.subtract(x,y) ), x ) ), 100)

def maximum_average_diff(x,y):
    return np.multiply( np.amax( np.divide( np.absolute( np.subtract(x,y) ), x ) ), 100)

#######################
# Projected coordinates
#######################
# Report average of absolute differences
print(average_diff(out,vout))  # 1.53 metres
# Report average of scaled absolute differences
print(scaled_average_diff(out,vout)) # 0.02%

# Maximum difference:
print(maximum_diff(out,vout)) # 14.90 metres
# Maximum scaled difference:
print(maximum_average_diff(out,vout)) # 0.15%

#######################
# Great Circle
#######################
# Load great-circle distance
gout = np.genfromtxt(gout_csv,skip_header=1,dtype=float,delimiter=',',usecols=(0))
# Report average of absolute differences
print(average_diff(gout,vout)) # 10.54 metres
# Report average of scaled absolute differences
print(scaled_average_diff(gout,vout)) # 0.17%

# Maximum difference:
print(maximum_diff(gout,vout)) # 32.31 metres
# Maximum scaled difference:
print(maximum_average_diff(gout,vout)) # 0.34%

The total time for the vincenty distances was 866 seconds (compared to 486 seconds for the great-circle distances – it is a lot more computationally intensive!). Below we will look at ways of using numpy to speed-up our distance calculations.

The haversine formula uses the great-circle distance to get the distance between two points on a sphere. It is not accurate as the vincenty formula, however this also means it is a relatively easy formula to vectorise.

vinc

a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
d = 2 * 6371 * 1000 * math.asin(math.sqrt(a)) # metres

We can see that using a standard list and looping through (or mapping a function) takes ages. Curiously, the trigonometric functions from math are faster than from np (however they can’t be vectorised). Passing the whole numpy array to the function is much quicker – however, what is quicker still is re-writing the function to perform all operations element-wise “broadcasting_based“:

"""
Part B -Some speed tests on the haversine formula
"""
import math
import numpy as np

def math_haversine(origin,
                   destination):
    """
    Standard formula: Find distance between a pair of lat/lng coordinates
    """
    # convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(math.radians, [origin[0],origin[1],destination[0],destination[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2) ** 2
    c = 2 * math.asin(math.sqrt(a))
    r = 6371*1000  # Radius of earth in m
    return c*r

def np_haversine(origin,
                 destination):
    # convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(np.radians, [origin[0],origin[1],destination[0],destination[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2) ** 2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371*1000  # Radius of earth in m
    return c*r

def broadcasting_based(df):
    """
    Vectorised
    lat_a,lon_a,lat_b,lon_b
    """
    data = np.deg2rad(df)
    diff_lat = data[:,2] - data[:,0]
    diff_lng = data[:,3] - data[:,1]
    d = np.sin(diff_lat/2)**2 + np.cos(data[:,0])*np.cos(data[:,2]) * np.sin(diff_lng/2)**2
    return 2 * 6371 * 1000 * np.arcsin(np.sqrt(d))

# Test first on 5 mill
out_csv = '.../out.csv'
out = np.genfromtxt(out_csv,skip_header=1,dtype=float,delimiter=',',usecols=(0,1,2,3),max_rows=5000000)

%time distances_loop_n = [np_haversine(x[0:2],x[2:4]) for x in out]
#Wall time: 42.7 s
%time distances_loop_m = [math_haversine(x[0:2],x[2:4]) for x in out]
#Wall time: 14.6 s
%time distances_loop_f = list(map(lambda x: math_haversine(x[0:2],x[2:4]),out))
#Wall time: 15 s
%time distances_vec = np_haversine(out[:,0:2].T,out[:,2:4].T)
#Wall time: 511 ms
%time distances_br = broadcasting_based(out)
#Wall time: 479 ms (100 x faster than the first one!)

print(distances_loop_n[0])
print(distances_loop_m[0])
print(distances_loop_f[0])
print(distances_vec[0])
print(distances_br[0])

out = np.genfromtxt(out_csv,skip_header=1,dtype=float,delimiter=',',usecols=(0,1,2,3))
vout = np.genfromtxt(vout_csv,skip_header=1,dtype=float,delimiter=',',usecols=(0))

%time hav_dist = broadcasting_based(out)
#Wall time: 2.85 s (for 29,098,208)
print(hav_dist[0])
#5678.96730132

"""
2.85 seconds is much faster than 152 seconds, but ...
That is ignoring how we got the array in the first place!
Triangular matrix contains: 4,999,950,000 (5 bill)
2.85 seconds * (4999950000/29098208) = 489 seconds
Then we would eliminate those > 10km ... assuming we can hold the whole thing in RAM
However, the quad-tree method gives us distances in: 152 seconds
So it is much faster! (and more importantly feasible as RAM is limited)
"""

def average_diff(x,y):
    return np.average( np.absolute( np.subtract(x,y) ) )

def maximum_diff(x,y):
    return np.amax( np.absolute( np.subtract(x,y) ) )

def scaled_average_diff(x,y):
    return np.multiply( np.average( np.divide( np.absolute( np.subtract(x,y) ), x ) ), 100)

def maximum_average_diff(x,y):
    return np.multiply( np.amax( np.divide( np.absolute( np.subtract(x,y) ), x ) ), 100)

#######################
# Haversine formula
#######################
# Report average of absolute differences
print(average_diff(hav_dist,vout)) # 12.28 metres
# Report average of scaled absolute differences
print(scaled_average_diff(hav_dist,vout)) # 0.20%

# Maximum difference:
print(maximum_diff(hav_dist,vout)) # 34.97 metres
# Maximum scaled difference:
print(maximum_average_diff(hav_dist,vout)) # 0.37%

So we see that the geopy.distance.vincenty calculation takes 866 seconds, geopy.distance.great_circle takes 486 seconds, and our vectorised-numpy haversine formula takes 2.85 seconds (for 30 million distances). Hence, if accuracy is not so important (or we have hundreds of millions of distances to calculate) it may be better to use this over the vincenty formula – however, what about the projected-coordinates? It is not fair to compare the calculations on the 30 million distances to cKDTree.sparse_distance_matrix because although that calculates 30 million distances (in 152 seconds) it considers 10 billion potential distances!

So it may be interesting to use our vectorised function to create a distance matrix. With a slightly re-write we can now input a list of co-ordinates (rather than a list of pairs) and get back a full distance-matrix. However, this is wasteful as we have duplicates so we can modify the function to only return the upper triangle (“broadcasting_based_trima“). This functions calculates 199 million distances in 23.3 seconds.

Can we get even faster? If we project our co-ordinates then we can use the super-fast scipy.spatial.distance.cdist function (using the minkowski distance). It takes 0.45 seconds to project the points and 2.52 seconds to get the full distance matrix -> less than 3 seconds for 200 million distances (and as we saw more the projection-based approach was more accurate than the haversine approximation)!

"""
Part C - Distance Matricies

Lets find a way to compare the numpy vectorised code; instead of
using the pairings generated by cKDTree's distance matrix let's create our
own, by slightly revising the code of the vectorised functions:
"""
import numpy as np

path_to_csv = '.../huge_points_test.csv'
points = np.genfromtxt(path_to_csv,
                       delimiter=',',
                       skip_header=1,
                       usecols=(0,1),
                       dtype=(float, float))

def broadcasting_based_ma(df):
    """
    Cross every co-ordinate with every co-ordinate.
    lat, lon
    """
    data = np.deg2rad(df)
    lat = data[:,0]
    lng = data[:,1]
    diff_lat = lat[:,None] - lat
    diff_lng = lng[:,None] - lng
    d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
    return 2 * 6371 * 1000 * np.arcsin(np.sqrt(d))

def broadcasting_based_trima(df):
    """
    Create an upper triangular distance matrix to
    save resources.
    lat, lon
    """
    data = np.deg2rad(df)
    lat = data[:,0]
    lng = data[:,1]
    idx1,idx2 = np.triu_indices(lat.size,1)
    diff_lat = lat[idx2] - lat[idx1]
    diff_lng = lng[idx2] - lng[idx1]
    d = np.sin(diff_lat/2)**2 + np.cos(lat[idx2])*np.cos(lat[idx1]) * np.sin(diff_lng/2)**2
    return 2 * 6371 * 1000 * np.arcsin(np.sqrt(d))    

# Try creating a distance matrix with 20,000 points

%time out = broadcasting_based_ma(points[:20000])
del out
# Calculations: 400,000,000
# Wall time: 24.6 s

%time out_v2 = broadcasting_based_trima(points[:20000])
del out_v2
# Calculations: 199,990,000
# Wall time: 23.3 s

"""
E.g with 4 rows the dense matrix - broadcasting_based_ma:
[[      0.          284016.09682169  278297.26191605  359212.49587497]
 [ 284016.09682169       0.          342440.79369869  445836.60288353]
 [ 278297.26191605  342440.79369869       0.          104156.69522161]
 [ 359212.49587497  445836.60288353  104156.69522161       0.        ]]

The upper triangle - broadcasting_based_trima:
[ 284016.09682169  278297.26191605  359212.49587497  342440.79369869
  445836.60288353  104156.69522161]
"""

"""
Let's try the cdist function (on the projected points)
"""
from pyproj import Proj, transform
from scipy.spatial.distance import cdist
from scipy import sparse

def proj_arr(points,proj_to):
    """
    Project geographic co-ordinates to get cartesian x,y
    Transform(origin|destination|lon|lat)
    """
    inproj = Proj(init='epsg:4326')
    outproj = Proj(proj_to)
    func = lambda x: transform(inproj,outproj,x[1],x[0])
    return np.array(list(map(func, points)))

uk = '+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 \
+x_0=400000 +y_0=-100000 +ellps=airy \
+towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs'

%time proj_pnts = proj_arr(points, uk)
# Wall time: 454 ms
%time out_v3 = cdist(proj_pnts[:20000],proj_pnts[:20000],p=2)
# Wall time: 2.52 s
max_dist = np.amax(sparse.tril(out_v3, k=-1).data)
print(max_dist)
# 1,201,694 (length of UK is 1,407,000)
del points
del out_v3

It may be tempting to try this distance matrix by using cKDTree.sparse_distance_matrix with a cut-off that is just above the maximum distance (so that it doesn’t cut anything off) – exactly what we did at the start. This doesn’t do very well (with 5,00 points – not 20,000 points) it takes over 20 seconds. However, this function serves its purpose (as we see at the very beginning of the post). This is because we can’t get a distance-matrix for 100,000 points any other way (it has to be sparse not dense) otherwise we hit the RAM limit.

from scipy.spatial import cKDTree
"""
cKDTree.sparse_distance_matrix is obviously much slower than cdist ...
"""
%time tree = cKDTree(proj_pnts[:5000])
%time out_v4 = tree.sparse_distance_matrix(tree,1201694,p=2) 
del out_v4
# Notice we have 5,000 (not 20,000)
# Wall time: 1 ms
# Wall time: 22 s

"""
cdist cannot handle 100,000 points however and hits my RAM,
so each serves it's own purpose
"""
%time out_v5 = cdist(proj_pnts,proj_pnts,p=2)
print(len(sparse.tril(out_v5, k=-1).data))
del out_v5

This basically means: if we need a full distance matrix then project and use cdist, however if we just need all overlaps within a certain distance (which is likely to be a smaller sub-set) then instead of calculating the full-distance matrix and then dropping pairs which are over; use cKDTree.sparse_…. and it will be much quicker and more importantly feasible!