When it comes to drive-times, Google Maps are widely-considered the best. The Google Distance Matrix API currently accepts 1000 OD pairs every 10 seconds (Google Maps APIs Premium Plan) -> for a theoretical maximum of 360,000 an hour (however most have a 24 hour cap of 100,000 requests).

However what if we need 1 million drive-times, or 10 million or 100 million?

Directions/Distances are solved using generalisations of Dijkstra’s algorithm. Various heuristic approaches are added on top of this (A* is an example of this) and can make for an interesting read. For example: hierarchy is used (first search motorways, then search primary roads, then secondary, etc) to reduce solving-time and mimic real-world driving; routes are solved not from source to destination but by starting at each end meeting in the middle …

Using Open Street Map data it is possible to create a network data-set for ArcGIS’s Network Analyst extension.

For example: the network dataset for France came to around 120GB and took around a week from start to finish (creating the feature database and then creating the network data-set from that). Below is an example of a script that creates a layer from a CSV containing 36,253 points in France and then calculates an O-D matrix, returning drive-time and drive-distance for all routes within a 30 minute cut-off. Although the number of drive-times returned was around 6 million; the 36,000 by 36,000 matrix would contain over 1.3 billion elements. The whole process took only 60 minutes: this means we are averaging more than 1600 OD pairs a second!

Step 1: Create Layer from CSV

def create_layers_from_csv(wpath, csv_file):
    Create layers from CSV to use for analysis
    env.workspace = wpath
        in_Table = ".../ArcGIS/Distances/%s.csv" % csv_file
        x_coords = "lon"
        y_coords = "lat"
        out_Layer = "%s_layer" % x
        saved_Layer = "%s" % x
        # CSV contains Geographic WGS 1984 coordinates
        spRef = "Coordinate Systems/Geographic Coordinate Systems/World/WGS 1984.prj"
        arcpy.MakeXYEventLayer_management(in_Table, x_coords, y_coords, out_Layer, spRef)
        print arcpy.GetCount_management(out_Layer)
        arcpy.SaveToLayerFile_management(out_Layer, saved_Layer)
        print "Saved layer %s" % saved_Layer
        print "Something went wrong"
        print arcpy.GetMessages()

Step 2: Solve OD Matrix

def solve_od(wpath):
    OD Cost Matrix
        env.workspace = wpath
        env.overwriteOutput = True
        # Set variables
        inNetworkDataset = "france_osm/france_osm_nd"
        inBoth = "sample_drive_times.lyr"
        outNALayerName = "ODCost_restrict_30min"
        outLayerFile = outNALayerName
        lines_csv = "ODCost_restrict_30min.csv"
        using_id = "ID"
        impedanceAttribute = "DriveTime"
        accumulate_attributes = ["Length"]  # Metres: drive-distance
        searchTolerance = "1000 Meters"
        stime = time.time()
        print "Creating OD layer"
        MakeODCostMatrixLayer_na (in_network_dataset, out_network_analysis_layer,
                                  impedance_attribute, {default_cutoff},
                                  {accumulate_attribute_name}, {UTurn_policy},
                                  {restriction_attribute_name}, {hierarchy},
                                  {hierarchy_settings}, {output_path_shape},
        outNALayer = arcpy.na.MakeODCostMatrixLayer(inNetworkDataset, outNALayerName,
                                                    impedanceAttribute, 30, "",
        outNALayer = outNALayer.getOutput(0)
        subLayerNames = arcpy.na.GetNAClassNames(outNALayer)
        originsLayerName = subLayerNames["Origins"]
        destinationsLayerName = subLayerNames["Destinations"]
        linesLayerName = subLayerNames["ODLines"]
        fieldMappings = arcpy.na.NAClassFieldMappings(outNALayer, destinationsLayerName)
        fieldMappings["Name"].mappedFieldName = using_id
        print "Loading Destinations"
        AddLocations_na (in_network_analysis_layer, sub_layer,
                         in_table, field_mappings,
                         search_tolerance, {sort_field},
                         {search_criteria}, {match_type},
                         {append}, {snap_to_position_along_network},
                         {snap_offset}, {exclude_restricted_elements},
        arcpy.na.AddLocations(outNALayer, destinationsLayerName, inBoth,
                              fieldMappings, searchTolerance)
        fieldMappings = arcpy.na.NAClassFieldMappings(outNALayer, originsLayerName)
        fieldMappings["Name"].mappedFieldName = using_id
        print "Loading Origins"
        arcpy.na.AddLocations(outNALayer, originsLayerName, inBoth,
                              fieldMappings, searchTolerance)
        print "Solving..."
        print "Finished in %.0f" % (time.time() - stime)
        print "Solved! Saving ..."
        arcpy.management.SaveToLayerFile(outNALayer, outLayerFile, "RELATIVE")
        # Extract data to CSV
        fields = ["Name", "Total_DriveTime", "Total_Length"]
        for lyr in arcpy.mapping.ListLayers(outNALayer):
            if lyr.name == linesLayerName:
                print "Saving lines"
                with open(lines_csv, 'w') as f:
                    with arcpy.da.SearchCursor(lyr, fields) as cursor:
                        print "Successfully created lines searchCursor.  Exporting to " + lines_csv
                        for row in cursor:
                            f.write(','.join([str(r) for r in row])+'\n')
        print "Script completed successfully"
    except Exception as e:
        # If an error occurred, print line number and error message
        import traceback, sys
        tb = sys.exc_info()[2]
        print "An error occurred on line %i" % tb.tb_lineno
        print str(e)

Google Maps uses A LOT of different attributes to get the speed for a road (including, I believe, actual drive-times); however with OSM we are limited to attributes like this (from the “DriveGeneric.xml” installed by  arcgis-osm-editor):

Select Case LCase([highway])
    Case "motorway"
        speed = 110
    Case "motorway_link", "trunk"
        speed = 90
    Case "trunk_link", "primary"
        speed = 70
    Case "primary_link", "secondary"
        speed = 60
    Case "secondary_link", "tertiary"
        speed = 55
    Case "unclassified"
        speed = 50
    Case "tertiary_link"
        speed = 45
    Case "residential"
        speed = 40
    Case "living_street"
        speed = 10
End Select
Select Case LCase([osm_surface])
    Case "compacted"
        speed = speed / 1.25
    Case "metal"
        speed = speed / 1.50
    Case "unpaved", "gravel", "fine_gravel", "pebblestone", "sand", "dirt", "grass"
        speed = speed / 2.00
End Select
Select Case LCase([osm_smoothness])
    Case "intermediate"
        speed = speed / 1.25
    Case "bad"
        speed = speed / 1.50
    Case "very_bad"
        speed = speed / 1.75
    Case "horrible"
        speed = speed / 2.00
    Case "very_horrible"
        speed = speed / 3.00
    Case "impassable"
        speed = speed / 5.00
End Select

This means that we end up seeing something like this (I calculated a subset of 50,000 routes using Google Distance Matrix):

Drive-Times (minutes) of Open Street Maps using NA against Google Distance Matrix


Drive-Times (minutes) against straight-line distance between routes


The median difference here was 21% so I plotted the start location in red if the difference was greater than 21%, otherwise green:


Drive-Distance (metres) of Open Street Maps using NA against Google Distance Matrix


Drive-Distance (metres) against straight-line distance between routes


We can see that:

  1. Google drive-times are on average a lot slower (however Google drive-distances are very similar)
  2. Given a straight-line distance, the range of Google drive-times is much bigger than OSM-based drive-times
  3. Google drive-times (and distances) are less correlated with the straight-line distance between points.

This makes sense if we remember that the routing algorithm for OSM is calculating a cost-impedance (minutes) using distance travelled coupled with a speed based on road-type, road-surface, and road-smoothness. Whereas, Google uses a lot more information (possibly unrelated to distance) e.g.average drive-speed.

If we regress google_minutes on the straight-line distance we get an R-squared of 0.57 with a t-stat of 257, however if we regress the OSM_minutes on the straight-line distance we get an  R-squared of 0.72, with a t-stat of 360. Clearly; basic proximity is much more important given a lack of other attributes.

I was thinking of two approaches:

1) Each-time a small sub-set of random routes could be calculated using Google and then the ArcGIS drive-times can be standardised to reflect a similar mean and variance.

2) We can try to back-track Google’s average speed for each road type and then re-create the network dataset with those attributes.

The last option seems more useful and more interesting:

If we limit our analysis to routes where the drive distance is very similar between Google and ArcGIS (a proxy for the same route chosen) we could extract from ArcGIS: [road_type, road_smoothness, road_surface, distance_metres] for each chunk, and then combine that with google_time_seconds (for the whole route). If we then create a speed for each road_type(x), road_smoothness(y), road_surface(z) combination and create an arcgis_time_seconds variable which is basically sum(speed(x[i],y[i],z[i])*distance[i]) we can fit the model so that arcgis_time_seconds matches google_time_seconds and as a result get the ideal speed parameters. (Technically we can only extract the ID and distance for the road segment, however I think it would be trivial to merge on the type, smoothness, surface of that ID using an SQL command on the network dataset).

Hopefully more to follow!

Edit: Running the same analysis for the UK