So far I have found two offline-routers that work well (in my opinion) with Open Street Map data and I have enjoyed using:

  1. OSRM – OSRM is a very fast C++ routing engine
  2. GraphHopper – GraphHopper is Java based

Both of these methods involve a local server set-up, which accepts HTTP connections. Because I/O delay is negligible, we find ourself limited by CPU processing-power and thus I have that using multi-tasking (notably multi-processing rather than threading to overcome cPython’s GIL) compared to asynchronous techniques brings about the greatest speed benefit. Further; GraphHopper allows a persistent connection and thus we can use connection-pooling for an even bigger speed boost

1. Open Source Routing Machine (OSRM)

The first-step is to download the compiled version (I always have issues compiling …) of OSRM from here.

The second-step is to create some maps using “osrm-prepare.exe”, which I have found is very-quick! For example: British-Isles took around 15 minutes from download to final map.

The third-step is to create a STATA script (an interface common to many people around me) which basically creates a CSV of O-D pairings in the TEMP directory and then calls the python-script supplying the relevant arguments.

The fourth-step is to  create a python-script which would robustly take the input-csv along with the path of the map-file and submit requests before dumping an output-csv which gets imported into STATA. The API queried was the via-route API, like so:

http://{server}/viaroute?loc={lat,lon}&loc={lat,lon}<&loc={lat,lon} ...>

Here is an example:

use "...\osrm_test.dta", clear
run_osrm, orig_lat(A_Latitude) orig_long(A_Longitude) dest_lat(B_Latitude) dest_long(B_Longitude) mapfile("...\osrm_maps\britain.osrm")
correlate google_seconds osm_time_s
correlate google_metres osm_dist_m
twoway scatter google_seconds osm_time_s

The scatter is the shape we usually see (Google times are slower than OSM):

Capture

Correlation between the two driving-distances is 0.95 and 0.92 between the two driving-times.

Another Example (France – 5.7 million):

Running the script for 5,750,356 OD pairs in France, took around 75 minutes which means we are looking at 1280 requests a second. The results also follow Google more closely than ArcGIS Network Analyst  or even ArcGIS HERE maps data – here some comparisons on a random sub-set of 5000 observations:

OSRM vs. Google (Error: 12.5%)

comp-osrm

ArcGIS ESRI (HERE data) vs. Google (Error: 17.8%)

comp-esri

ArcGIS Network Analyst (OSM Data) vs. Google (Error:24.0%)

comp-na

A problem with using HTTP to generate the data is that the server does support HTTP keep-alive – this means a new socket is created for each request (compared to the GraphHopper server which does allow a persistent connection). We have to be careful therefore not to generate too many requests otherwise we will run out of sockets (they will remain in TIME_WAIT) and our script will hit an error.

Having said that I am able to achieve speeds of 1800 requests-per-second using the below code.

osrm

After checking with the developers it appears that using node-bindings is the recommended way (rather than HTTP) – and one I will be investigating shortly:

> npm install osrm async
> node
var async = require('async');
var OSRM = require('osrm');
var osrm = new OSRM("dataset.osrm");
var start_destinations = [[[lat, lon], [lat, lon]], ....];
async.map(start_destinations, osrm.route, function(err, results) {
    console.log(results.map(function(r) { return r.route_summary.total_time; }));
});

Python Script:

import os
import requests
import json
import time
import csv
import pandas as pd
import argparse
from multiprocessing import *
from subprocess import Popen, PIPE

parser = argparse.ArgumentParser(description='Description of your program')
parser.add_argument('-d', '--Directory', help='Log directory', required=False)
parser.add_argument('-m', '--Map', help='Map file', required=False)
args = vars(parser.parse_args())

ghost = 'localhost'
gport = 5005

class OsrmEngine(object):
    """ Class which connects to an osrm-routed executable and spawns multiple servers"""
    def __init__(self,
                 map_loc,
                 router_loc):
        """
        Map needs to be pre-processed with osrm-prepare; router_loc should be the most up to date file from here:
        http://build.project-osrm.org/ - both can work over the network with no significant speed-fall as they
        are initially loaded into RAM
        """
        if not os.path.isfile(router_loc):
            raise Exception("Could not find osrm-routed executable at: %s" % router_loc)
        else:
            self.router_loc = router_loc
        if not os.path.isfile(map_loc):
            raise Exception("Could not find osrm network data at: %s" % map_loc)
        else:
            self.map_loc = map_loc
        # Remove any open instance
        if self.server_check():
            self.server_kill()

    def server_kill(self):
        """
        Kill any osrm-routed server that is currently running before spawning new - this means only one script
        can be run at a time
        """
        Popen('taskkill /f /im %s' % os.path.basename(self.router_loc), stdin=PIPE, stdout=PIPE, stderr=PIPE)
        time.sleep(2)

    def server_check(self):
		""" 
		Check if server is already running
		"""
        try:
             if requests.get("http://%s:%d" % (ghost, gport)).status_code == 400:
                return True
        except requests.ConnectionError:
            return False

    def server_start(self):
        """
        Robustness checks to make sure server can be initialised
        """
        try:
            p = Popen([self.router_loc, '-v'], stdin=PIPE, stdout=PIPE, stderr=PIPE)
            output = p.communicate()[0].decode("utf-8")
        except FileNotFoundError:
            output = ""
        if "info" not in str(output):
            raise Exception("OSM does not seem to work properly")
        # Check server not running
        if self.server_check():
            raise Exception("osrm-routed already running - force close all with task-manager")
        # Start server
        Popen("%s %s -i %s -p %d -t %d" % (self.router_loc, self.map_loc, ghost, gport, cpu_count()), stdout=PIPE)
        c = 0
        while c < 30:
            try:
                if requests.get("http://%s:%d" % (ghost, gport)).status_code == 400:
                    return "http://%s:%d" % (ghost, gport)
                else:
                    raise Exception("Map could not be loaded")
            except requests.ConnectionError:
                    time.sleep(10)
                    c += 1
        raise Exception("Map could not be loaded ... taking more than 5 minutes..")

    def spawn_server(self):
        """ Server can handle parallel requests so only one instance needed """
        p = Process(target=self.server_start)
        p.start()
        # Block until server loads
        p.join()


# Helper functions
def process_request(url_input):
    """
    Submits HTTP request to server and returns distance metrics; errors are coded as status=999
    """
    req_url, query_id = url_input
    try_c = 0
    #print(req_url)
    while try_c < 5:
        try:
            response = requests.get(req_url)
            json_geocode = response.json()
            status = int(json_geocode['status'])
            # Found route between points
            if status == 200:
                tot_time_s = json_geocode['route_summary']['total_time']
                tot_dist_m = json_geocode['route_summary']['total_distance']
                used_from, used_to = json_geocode['via_points']
                out = [query_id,status,tot_time_s,tot_dist_m,used_from[0],used_from[1],used_to[0],used_to[1]]
                return out
            # Cannot find route between points (code errors as 999)
            else:
                print("Done but no route: %d %s" % (query_id, req_url))
                return [query_id, 999, 0, 0, 0, 0, 0, 0]
        except Exception as err:
            print("%s - retrying..." % err)
            time.sleep(5)
            try_c += 1
    print("Failed: %d %s" % (query_id, req_url))
    return [query_id, 999, 0, 0, 0, 0, 0, 0]


def create_urls(routes, ghost, gport):
    """ Python list comprehension to create URLS """
    return [["http://{0}:{1}/viaroute?loc={2}%2C{3}&loc={4}%2C{5}&alt=false&geometry=false".format(
        ghost, gport, alat, alon, blat, blon),
                qid] for qid, alat, alon, blat, blon in routes]


def loadroute_csv(csv_loc, chunks):
    """ Use Pandas to iterate through CSV - very fast CSV parser """
    if not os.path.isfile(csv_loc):
        raise Exception("Could not find CSV with addresses at: %s" % csv_loc)
    else:
        return pd.read_csv(csv_loc, sep=',', header=None, iterator=True, chunksize=chunks)

if __name__ == '__main__':
    try:
        # Router_loc points to latest build http://build.project-osrm.org/
        router_loc = '.../osrm-routed.exe'
        # Directory containing routes to process (csv) and map supplied as arg.
        directory_loc = os.path.normpath(args['Directory'])
        map_loc = os.path.normpath(args['Map'])
        print("Initialising engine")
        osrm = OsrmEngine(map_loc, router_loc)
        print("Loading Map - this may take a while over the network")
        osrm.spawn_server()
        done_count = 0
        # Loop through 1 million rows at a time (save results)
        print("Loading CSV")
        with open(os.path.join(directory_loc, 'osrm_output.csv'), 'w') as outfile:
            wr = csv.writer(outfile, delimiter=',', lineterminator='\n')
            for x in loadroute_csv(csv_loc=os.path.join(directory_loc, 'osrm_input.csv'), chunks=1000*1000):
                # Pandas dataframe to python list
                routes = x.values.tolist()
                # Salt route-data with server location to create URLS
                url_routes = create_urls(routes, ghost, gport)
                del routes
                print("Created %d urls" % len(url_routes))
                print("Calculating in chunks of 1,000,000")

                # Server does not support keep-alive so new sockets are created each-time
                # If too many requests (too fast) will get TCP socket exhaustion
                stime = time.time()
                pool = Pool(int(cpu_count()-1))
                calc_routes = pool.map(process_request, url_routes)
                dur = time.time() - stime
                print("Calculated %d distances in %.2f seconds: %.0f per second" % (len(calc_routes),
                                                                                    dur,
                                                                                    len(calc_routes)/dur))
                del url_routes
                # Verify all threads closed safely
                pool.close()
                pool.join()
                print("Saving progress ...")
                wr.writerows(calc_routes)
                done_count += len(calc_routes)
                del calc_routes
                # Continually update progress in terms of millions
                print("Saved %d calculations" % done_count)
        print("Done.")
        osrm.server_kill()
    except Exception as err:
        print(err)
        osrm.server_kill()
        time.sleep(15)

STATA Script:

capture program drop run_osrm
program run_osrm
	syntax, orig_lat(string) orig_long(string) dest_lat(string) dest_long(string) mapfile(string)
	display in blue "...................................................................................."
	display in blue "This program connects to a local OSRM server and attaches the specified mapfile"
	display in blue "The input map is from OpenStreetMaps"
	display in blue "Results should be reproducable here: http://map.project-osrm.org/"
	display in blue "Average speed is around 900 request/second"
	display in blue "...................................................................................."
	display in blue "Starting: $S_TIME"
	display in blue "...................................................................................."
quietly {
	// Make sure variables do not exist
	foreach v in "_merge" "osm_status" "osm_time_s" "osm_dist_m" "osm_used_dest_lat" "osm_used_orig_lat" {
		cap confirm var `v'
		if !_rc {
			di as err "`v' already defined"
			exit 110
			}
	}

	if "`mapfile'"=="" {
		di as err "You must specify a mapfile"
		exit 197
		}
	else {
		cap confirm file "`mapfile'"
		if _rc {
			di as err "File `mapfile' does not exist"
			exit 601
			}
		}
	if substr(lower("`mapfile'"),-5,.)!=".osrm" {
		di in smcl "{p 0 0 4}{err:Please specify a prepared map with the .osrm ending}"
		exit
		}		
		
	// Make path of mapfile absolute
	local pwd = c(pwd)
	if substr("`pwd'",-1,.)=="/" | substr("`pwd'",-1,.)=="\" {
		local pwd = substr("`pwd'",1,length("`pwd'")-1 )
		}
	cap confirm file "`pwd'/`mapfile'"
	if !_rc {
		if c(os)=="Windows" local mapfile = "`pwd'/`mapfile'"
		else local mapfile = "`pwdir'/`mapfile'"
		}
	if regexm("`mapfile'"," ") &c(os)=="Windows" {
		di in smcl "{p 0 0 4}{err:Path provided for your mapfile contains a space character, OSRM does not work with that. Please provide a path without spaces} {break}{err:This is a known bug....}"
		exit
		}		
	 
	local logd = "`c(tmpdir)'osrmtemp_" +  subinstr("$S_TIME",":","_",.)
	noisily display "Creating temporary file which will be deleted"
	! mkdir "`logd'"
	noisily display "Using MapFile: `mapfile'"
	noisily display "Using Temp: `logd'"
	
	capture drop query_id
	gen double query_id = _n
	
	preserve
		noisily display "executing""
		outsheet query_id  `orig_lat' `orig_long' `dest_lat'  `dest_long'  using "`logd'/osrm_input.csv", comma nonames replace	
	restore

	
	! pip install requests
	capture timer clear 1
	timer on 1
	! python.exe "...\osrm_time.py" -d "`logd'" -m "`mapfile'"
	timer off 1
	timer list 1
	noisily di as error round(_N/`r(t1)',0.01) " a second"
	
	preserve
		insheet query_id osm_status osm_time_s osm_dist_m osm_used_orig_lat osm_used_orig_lon osm_used_dest_lat osm_used_dest_lon using "`logd'/osrm_output.csv", clear comma case double
			tempfile results
			save `results'
	restore	
	
	merge 1:1 query_id using `results'
		assert _merge==3
		drop _merge
		drop query_id	
	noisily di as error "Complete""
	noisily di as error "Finished: $S_TIME"
	capture label drop osm_statusl
	label define osm_statusl 200 "Successful"  999 "Unsuccessful"
	label values osm_status osm_statusl
	! RD /S /Q  "`logd'" 
}
end

2. GraphHopper

The process is very similar – first I add the downloaded “osm.pbf” file to the directory and then added a “run_x.bat” file:

java -jar graphhopper-web-0.6.0-with-dep.jar jetty.resourcebase=webapp config=config-example.properties osmreader.osm=british-isles-latest.osm.pbf

Example (France – 5.7 million):

Running the script for 5,750,356 OD pairs in France – gives pretty much the same figures as the OSRM version. The only differences were:

  1. Graphhopper does not snap to the nearest location (compared to OSRM) so it returned a lot more ‘FAILS’ (however <1% of total requests)
  2. OSRM seemed to correlate a tiny bit better with Google

OSRM vs. Google (Error: 13%)

gh_result_france

The idea is to launch the “.bat” script using Python. The good thing about the jetty server is that it allows persistent HTTP connections (so that we can establish a session and re-use our sockets).

I get speeds of 3500 requests-per-second.

gh

Python Script:

import os
import json
import time
import csv
import json
import pandas as pd
import argparse
from urllib3 import HTTPConnectionPool
from multiprocessing import Process, Pool
from subprocess import Popen, PIPE

parser = argparse.ArgumentParser(description='Description of your program')
parser.add_argument('-d', '--Directory', help='Log directory', required=False)
parser.add_argument('-r', '--Router', help='Router file', required=False)
args = vars(parser.parse_args())

ghost = 'localhost'
gport = 8989
_conn_pool = HTTPConnectionPool(host=ghost, port=gport, maxsize=1)


class GraphhopperEngine(object):
    def __init__(self,
                 router_loc):
        if not os.path.isfile(router_loc):
            raise Exception("Could not find .bat executable at: %s" % router_loc)
        else:
            self.router_loc = router_loc
        # Remove any open instance
        if self.server_check():
            self.server_kill()

    def server_kill(self):
        # Close any running java server
        Popen('taskkill /f /im java.exe', stdin=PIPE, stdout=PIPE, stderr=PIPE)
        time.sleep(2)

    def server_check(self):
        try:
            if _conn_pool.request('GET', 'http://{0}:{1}'.format(ghost, gport)).status == 200:
                return True
        except Exception:
            return False

    def server_start(self):
        # Check server not running
        if self.server_check():
            raise Exception('Graphhopper Java(TM) Platform already running - force class with task-manager')
        # Start server
        Popen("%s" % self.router_loc, cwd=os.path.split(self.router_loc)[0], stdout=PIPE)
        # Check server running
        print('Starting server - may take a minute over the network')
        c = 0
        while c < 30:
            try:
                if _conn_pool.request('GET', 'http://{0}:{1}'.format(ghost, gport)).status == 200:
                    print('Server successfully started')
                    return
            except Exception:
                print('Map loading ...')
                time.sleep(10)
                c += 1
        raise Exception("Map could not be loaded ... taking more than 5 minutes..")

    def spawn_server(self):
        """ Server can handle parallel requests so only one instance needed """
        p = Process(target=self.server_start)
        p.start()
        # Block until server loads
        p.join()


# Helper functions
def create_urls(routes, ghost, gport):
    """ Python list comprehension to create URLS """
    return [["http://{0}:{1}/route?point={2}%2C{3}&point={4}%2C{5}".format(
            ghost, gport, alat, alon, blat, blon) +
             "&vehicle=car&calc_points=false&instructions=false", qid]
            for qid, alat, alon, blat, blon in routes]


def load_route_csv(csv_loc, chunks):
    """ Use Pandas to iterate through CSV - very fast CSV parser """
    if not os.path.isfile(csv_loc):
        raise Exception("Could not find CSV with addresses at: %s" % csv_loc)
    else:
        return pd.read_csv(csv_loc, sep=',', header=None, iterator=True, chunksize=chunks)


def find_java():
    try:
        home = os.environ['JAVA_HOME']
        if "jre" in home:
            if os.path.exists(home):
                print("Found Java: %s" % home)
                return
    except:
        pass
    print("Could not find Java - this script may not work")
    print("Download java from: https://java.com/en/download/manual.jsp")

def process_request(msg):
    ul, qid = msg
    try:
        response = _conn_pool.request('GET', ul)
        s = float(response.status)
        if s == 200:
            json_geocode = json.loads(response.data.decode('utf-8'))
            tot_time_s = json_geocode['paths'][0]['time']
            tot_dist_m = json_geocode['paths'][0]['distance']
            return [qid, s, tot_time_s, tot_dist_m]
        elif s == 400:
            #print("Done but no route for row: ", qid)
            return  [qid, 999, 0, 0]
        else:
            print("Done but unknown error for: ", s)
            return [qid, 999, 0, 0]
    except Exception as err:
        print(err)
        return [qid, 999, 0, 0]


def make_pool(host, port):
    global _conn_pool
    _conn_pool = HTTPConnectionPool(host=host, port=port, maxsize=1)
    
if __name__ == '__main__':
    try:
        find_java()  # Report on status of java
        router_loc = os.path.normpath(args['Router'])
        directory_loc = os.path.normpath(args['Directory'])
        print("Initialising engine")
        gh = GraphhopperEngine(router_loc)
        gh.spawn_server()
        done_count = 0
        # Loop through x rows at a time (save results)
        print("Loading CSV")
        with open(os.path.join(directory_loc, 'gh_output.csv'), 'w') as outfile:
            wr = csv.writer(outfile, delimiter=',', lineterminator='\n')
            for x in load_route_csv(csv_loc=os.path.join(directory_loc, 'gh_input.csv'), chunks=1000*1000):
                # Pandas data-frame to python list
                routes = x.values.tolist()
                # Salt route-data with server location to create URLS
                url_routes = create_urls(routes, ghost, gport)
                del routes
                print("Created %d urls" % len(url_routes))
                print("Calculating in chunks of 1,000,000")
                stime = time.time()

                # conn_pool variable inside each process is separate from all other processes
                pool = Pool(initializer=make_pool, initargs=(ghost, gport))
                calc_routes = pool.map(process_request, url_routes)
                pool.close()
                pool.join()

                # Time diagnostics
                dur = time.time() - stime
                print("Calculated %d distances in %.2f seconds:"
                      " %.0f per second" % (len(calc_routes), dur, len(calc_routes) / dur))
                del url_routes
                print("Saving progress ...")
                wr.writerows(calc_routes)
                done_count += len(calc_routes)
                del calc_routes
                # Continually update progress in terms of millions
                print("Saved %d calculations" % done_count)
        print("Done.")
        gh.server_kill()
    except Exception as err:
        print(err)
        gh.server_kill()
        time.sleep(15)

STATA Script:

*! 1.0 VERSION OF GH
capture program drop cra_gh
program cra_gh
	syntax, orig_lat(string) orig_long(string) dest_lat(string) dest_long(string) mapfile(string)
	display as error "...................................................................................."
	display as error "This program connects to a local GraphHopper server using the referenced .bat file"
	display as error "The input map is from OpenStreetMaps"
	display as error "Average speed is around 2000 request/second"
	display as error "...................................................................................."
	display as error "Starting: $S_TIME"
	display as error "...................................................................................."
quietly {
	// Make sure variables do not exist
	foreach v in "_merge" "gh_status" "gh_dist_millisec" "gh_time_s" "gh_dist_m" {
		cap confirm var `v'
		if !_rc {
			di as err "`v' already defined"
			exit 110
			}
	}

	if "`mapfile'"=="" {
		di as err "You must specify a mapfile"
		exit 197
		}
	else {
		cap confirm file "`mapfile'"
		if _rc {
			di as err "File `mapfile' does not exist"
			exit 601
			}
		}
	if substr(lower("`mapfile'"),-4,.)!=".bat" {
		di in smcl "{p 0 0 4}{err:Please specify a proper map file with a .bat ending}"
		exit
		}		
		
	// Make path of mapfile absolute
	local pwd = c(pwd)
	if substr("`pwd'",-1,.)=="/" | substr("`pwd'",-1,.)=="\" {
		local pwd = substr("`pwd'",1,length("`pwd'")-1 )
		}
	cap confirm file "`pwd'/`mapfile'"
	if !_rc {
		if c(os)=="Windows" local mapfile = "`pwd'/`mapfile'"
		else local mapfile = "`pwdir'/`mapfile'"
		}
	 
	local logd = "`c(tmpdir)'osrmtemp_" +  subinstr("$S_TIME",":","_",.)
	noisily display "Creating temporary file which will be deleted"
	! mkdir "`logd'"
	noisily display "Using MapFile: `mapfile'"
	noisily display "Using Temp: `logd'"
	
	capture drop query_id
	gen double query_id = _n
	
	preserve
		noisily display "executing""
		outsheet query_id  `orig_lat' `orig_long' `dest_lat'  `dest_long'  using "`logd'/gh_input.csv", comma nonames replace	
	restore

	
	! pip install urllib3
	capture timer clear 1
	timer on 1
	! python.exe "...\gh_time.py" -d "`logd'" -r "`mapfile'"
	timer off 1
	timer list 1
	noisily di as error round(_N/`r(t1)',0.01) " a second"
	
	preserve
		insheet query_id gh_status gh_time_millisec gh_dist_m using "`logd'/gh_output.csv", clear comma case double
			tempfile results
			save `results'
	restore	
	
	merge 1:1 query_id using `results'
		assert _merge==3
		drop _merge
		drop query_id	
	// Convert from milli to seconds
	replace gh_time_millisec = gh_time_millisec / 1000
	rename gh_time_millisec gh_time_s
	noisily di as error "Complete""
	noisily di as error "Finished: $S_TIME"
	capture label drop gh_statusl
	label define gh_statusl 200 "Successful"  999 "Unsuccessful"
	label values gh_status gh_statusl
	! RD /S /Q  "`logd'" 
}
end