Consider the following proposition:

Nearly 60% of the major coffee-shop chains are in a group of 3 or more where each is within 5 minutes walk of another.

What kind of analysis would we need to verify this? What kind of optimisation would we need if we wanted to ‘improve’ this?

The below method goes through a general example which uses graph-theory to identify these ‘problematic cliques’ and linear-programming to ‘solve’ them. We go from this (red points are in a problematic clique):

unopt2_a

To this:

unopt2_b

Step 1: Creating the example data-set

I start with the data-set of coffee-shops (coffee_python_mined.csv) created using the data-mining code explored in this post. I use the PANDAS library to read-in the data and clean it before extracting all mentions of ‘Starbucks’, ‘Cafe Nero’, and ‘Costa Coffee’.

"""
Create data-set
"""

import pandas as pd
import csv
import itertools
from math import radians, cos, sin, asin, sqrt


def haversine(origin,
              destination):
    """
    Find distance between a pair of lat/lng coordinates
    """
    # convert decimal degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [float(origin[0]), float(origin[1]),
                                           float(destination[0]), float(destination[1])])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    r = 3959  # Radius of earth in miles
    return '%.2f' % (c * r)

# Read in-data
##############
# Contains a data-dump of all coffee-shops in London (with co-ordinates)
data_csv = 'coffee_python_mined.csv'
data_pd = pd.read_csv(data_csv, sep=',', header=0)

# Clean a bit
##############
# Duplicates by coordinates (to 5dp)
data_pd['Latitude'] = data_pd.Latitude.round(decimals=5)
data_pd['Longitude'] = data_pd.Longitude.round(decimals=5) 
data_pd = data_pd.drop_duplicates(['Latitude', 'Longitude'])

# Clean Name
data_pd['Name'] = data_pd.Name.str.upper()
data_pd['Name'] = data_pd.Name.str.strip()
print(data_pd['Name'].value_counts()[:10])

# Drop if Type is not 'cafe'
print(data_pd['Type'].value_counts()[:10])
data_pd = data_pd[data_pd['Type'].str.contains("cafe")]

# Isolate desired shop(s): STARBUCKS, CAFE NERO, COSTA COFFEE
########################################################
data_pd['Shop'] = ''
pd.options.mode.chained_assignment = None  # default='war

# STARBUCKS
match = data_pd.Name.str.contains('(^STARBUC)')
print(data_pd['Name'][match].value_counts()) # Check what hits we are getting
print("Total: ",sum(match))
data_pd['Shop'][match] = "SB"
# CAFE NERO
match = data_pd.Name.str.contains('(^CAF)(.*)( )(NER)')
print(data_pd['Name'][match].value_counts()) # Check what hits we are getting
print("Total: ",sum(match))
data_pd['Shop'][match] = "CN"
# COSTA COFFEE
match = data_pd.Name.str.contains('(^COST)(.*)( )(COF)')
print(data_pd['Name'][match].value_counts()) # Check what hits we are getting
print("Total: ",sum(match))
data_pd['Shop'][match] = "CC"

# Remove non-filtered and add an index
data_pd = data_pd[data_pd['Shop'] != ""]
data_pd['ID'] = data_pd.index

print(data_pd['Shop'].value_counts())

# Extract to a list
###################
all_stores_lst = data_pd[['Shop','Latitude','Longitude','ID']].values.tolist()
print(len(all_stores_lst))

For sake of example; I keep only those store which are within 5 miles of St. Pauls (in green; otherwise blue)

1

# Let's restrict ourself to a small circle around St Pauls (e.g. 5 mile radius)
###############################################################################
st_pauls = [51.513771, -0.098195]
cut_off = 5
# Alo concatenate the ID with the name
store_list = [[str(x[3])+"_"+x[0],x[1],x[2]] for x in all_stores_lst if 
                  float(haversine( st_pauls , [x[1],x[2]] )) <= cut_off]

print("Cut list contains %d stores" % len(store_list)) 

I cross each store with every other store and filter out those pairings which have a straight-line distance more than (5mph*desired minutes) – since it is not possible to walk to them in the desired time-frame.

# Cross each store with every other store
#########################################
"""
We could cross using an (i=1to(N-1),j=i+1) loop rather than 
(i=1toN,j!=i), however since walk and drive-times are non-directional,
such that time(AtoB) != time(BtoA), we may want to take an average
over both directions (in the case of an undirected graph). If we don't take
an average then we will get a connection between the two IF:
AtoB is within tolerance ANDOR BtoA is within tolerance (greedy).
This should not be an issue with straight-line distances (only if
the approximation is inaccurate)
"""
cross_stores = [[store_list[i],store_list[j]] 
                for i in range(len(store_list)) 
                for j in range(len(store_list)) 
                if i!=j]

# Number of crosses: 
print(len(cross_stores))  

"""
Since we may want to calculate 10 min walk we can filter out if haversine <= ((5/60)*10) -
Hence using 5mph max walk-speed we get 0.84 (rounded up) miles which can be covered
in 10 minutes. We thus keep only pairings which are within this distance (to reduce the
requests used for the distances)
"""
# Filter out observations which are far apart
cut_off = (5/60)*10
cross_stores = [x for x in cross_stores if 
                (float(haversine( [x[0][1],x[0][2]] , [x[1][1],x[1][2]] )) <= cut_off)]

# Cut down crosses to:
print(len(cross_stores))

"""
Distances can be calculated using of the tools mentioned already
Only modification to request is walk-time:
'...&mode=walking'
"""
# Export distances to calculate to a CSV
#########################################
f = open('walk_to_calculate.csv', 'w', newline='')
w = csv.writer(f)
w.writerow(["Store_A", "Lat_A", "Lon_A", "Store_B", "Lat_B","Lon_B"])
for x in cross_stores:
    w.writerow([x[0][0],x[0][1],x[0][2],x[1][0],x[1][1],x[1][2]])
f.close()

The distances can be calculated using the code described in this post with the modification below in red (since we want walk not drive-times) – or by using the Google Distance Matrix like here:

varsigned_url = google_key_signer(encodeURI('http://maps.googleapis.com/maps/api/directions/json?'origin='+input_from[i] + '&destination='+input_to[i] + '&mode=walking'+'&client='+CR_NAME),CR_KEY)

Step 2: Visualising

I find this often helps me a lot! Instead of using the networkx.draw_networkx function to create a general graph, I chose to write a small function that creates a simplified google-map:

"""
Helper function to plot map
"""
def geo_network(all_stores, stores_in_group, connections, name='network_map.html', minutes=10):
    """
    Helper function:
    Plot the network on an easy-to-read google-map
    """
    map = []
    for x in all_stores:
        if x[0] in stores_in_group:
            # Stores in cliques
            map.append("<marker name = \"%s\" lat = \"%.5f\" lng = \"%.5f\" col = \"red\"/>" 
                       % (x[0],x[1],x[2]))
        else:
            # Stores not in cliques
            map.append("<marker name = \"%s\" lat = \"%.5f\" lng = \"%.5f\" col = \"green\"/>" 
                       % (x[0],x[1],x[2]))
    for x in connections:
        if float(x[-1]) <= minutes:
            # Connections
            map.append("<line latfrom = \"%.5f\" longfrom = \"%.5f\" latto = \"%.5f\" longto = \"%.5f\"/>"
                       % (float(x[1]),float(x[2]),float(x[4]),float(x[5]))) 
    # Map part 1
    htmltext = """<!DOCTYPE html >

<style type="text/css">
                        html, body {
                            height: 100%;
                            width: 100%;
                            padding: 0;
                            margin: 0;
                        }
                        .labels {
                            color: black;
                            background-color: white;
                            font-family: "Lucida Grande", "Arial", sans-serif;
                    font-size: 10px;
                            text-align: center;
                    width: 45px;
                            border: 0 solid black;
                            white-space: nowrap;
                        }
                    </style>

                    <html>
                    <head>
                    <meta name="viewport" content="initial-scale=1.0, user-scalable=no" />
                    <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>
                    <title>Network</title>
                    <xml id="myxml">"""
    # Map part 2
    for x in map:
        htmltext += x + "\n"
    
    # Map part 3
    htmltext += """
            </xml>
            <script type="text/javascript" src="https://maps.googleapis.com/maps/api/js?&libraries=geometry"></script>
            <script type="text/javascript" src="http://google-maps-utility-library-v3.googlecode.com/svn/trunk/markerwithlabel/src/markerwithlabel.js"></script>
            <script>
            var XML = document.getElementById("myxml");
            if(XML.documentElement == null)
            XML.documentElement = XML.firstChild;
            var MARKERS = XML.getElementsByTagName("marker");
            var LINES = XML.getElementsByTagName("line");
            function initialize() {
                 var mapOptions = {
                                center: new google.maps.LatLng(51.47010,-0.10617),
                                zoom: 12,
                                styles:
                                [
                                {"featureType":"administrative","stylers":[{ "saturation":-80},{"visibility":"off"}]},
                                {"featureType":"landscape.natural","elementType":"geometry","stylers":[{"color":"#d0e3b4"}]},
                                {"featureType":"landscape.natural.terrain","elementType":"geometry","stylers":[{"visibility":"off"}]},
                                {"featureType":"poi","elementType":"labels","stylers":[{"visibility":"off"}]},
                                {"featureType":"poi.business","elementType":"all","stylers":[{"visibility":"off"}]},
                                {"featureType":"poi.medical","elementType":"geometry","stylers":[{"color":"#fbd3da"}]},
                                {"featureType":"poi.park","elementType":"geometry","stylers":[{"color":"#bde6ab"}]},
                                {"featureType":"road","elementType":"geometry.stroke","stylers":[{"visibility":"off"}]},
                                {"featureType":"road","elementType":"labels","stylers":[{"visibility":"off"}]},
                                {"featureType":"road.highway","elementType":"geometry.fill","stylers":[{"color":"#ffe15f"}]},
                                {"featureType":"road.highway","elementType":"geometry.stroke","stylers":[{"color":"#efd151"}]},
                                {"featureType":"road.arterial","elementType":"geometry.fill","stylers":[{"color":"#ffffff"}]},
                                {"featureType":"road.local","elementType":"geometry.fill","stylers":[{"color":"black"}]},
                                {"featureType":"transit","stylers":[{"visibility":"off"}]},
                                {"featureType":"water","elementType":"geometry","stylers":[{"color":"#a2daf2"}]}
                                ]
                        };
                var map = new google.maps.Map(document.getElementById('map'),
                    mapOptions);
                var bounds = new google.maps.LatLngBounds();
                for (var i = 0; i < MARKERS.length; i++) {
                    var name = MARKERS[i].getAttribute("name");
                    var point_i = new google.maps.LatLng(
                        parseFloat(MARKERS[i].getAttribute("lat")),
                        parseFloat(MARKERS[i].getAttribute("lng")));
                    var col =  MARKERS[i].getAttribute("col")
                    if (col == "green") {
                        var ic = "https://storage.googleapis.com/support-kms-prod/SNP_2752129_en_v0"
                    } else {
                        var ic = "https://storage.googleapis.com/support-kms-prod/SNP_2752125_en_v0"
                    }
                    var marker = new MarkerWithLabel({
                        position: point_i,
                        map: map,
                        icon: ic,
                        //labelContent: name,
                        labelAnchor: new google.maps.Point(22, 0),
                        labelClass: "labels", // the CSS class for the label
                        labelStyle: {opacity: 0.75},
                    })
                    bounds.extend(point_i);
                };          
            for (var i = 0; i < LINES.length; i++) {
                var point_a = new google.maps.LatLng(
                        parseFloat(LINES[i].getAttribute("latfrom")),
                        parseFloat(LINES[i].getAttribute("longfrom")));
                var point_b = new google.maps.LatLng(
                        parseFloat(LINES[i].getAttribute("latto")),
                        parseFloat(LINES[i].getAttribute("longto")));
                var flightPlanCoordinates = [
                    point_a,
                    point_b
                ];
                var flightPath = new google.maps.Polyline({
                    path: flightPlanCoordinates,
                    geodesic: true,
                    strokeColor: 'black',
                    strokeOpacity: 1,
                    strokeWeight: 1
                });
                flightPath.setMap(map);
            };
            }
            google.maps.event.addDomListener(window, 'load', initialize);
            </script>
            </head>
            <body>
            <center>

<div id="map" style="width:95%; height:1200px;"></div>

            <center>
            </body>
            </html>       
    """
    with open(name, 'w') as f:
        f.write(htmltext)
    f.close() 

Step 3: Network Analysis

The pairings (with their calculated distance) are loaded back into python, those within 5 minutes are added to a list which will define our connections (or edges). Technically: since we have crossed i with all j <> i, we have distances going from shop A to shop B and then from Shop B to Shop A. I have just averaged these: so we have a connection between two stores if: half the distance there and back is less-than-or-equal to 5 minutes.

"""
Network Analysis
"""

import csv
import networkx as nx  
from pylab import *


def is_problematic(grp):
    """
    Various conditions can be applied to
    check if a group is problematic.
    In this case: simply if it contains three
    or more elements
    """
    if len(grp) > 2:
        return True
    return False


def extract_clusters(clusters):  
    """
    From a list of clusters/groups, extract unique stores
    and the cluster if it is problematic
    """
    lst_clqs = []  # Contains all the groups
    unq_stores_clq = [] # Contains stores in a clique
    for c in clusters:
        if is_problematic(c):
            lst_clqs.append(c)
            for x in c:
                if x not in unq_stores_clq:
                    unq_stores_clq.append(x)
    return [lst_clqs,unq_stores_clq]


minutes = 5  # Adjusted to get a good example screenshot

# Read in calculated distances
#[store_a],[lat_a],[lon_a],[store_b],[lat_b],[lon_b],[time_min]
in_data = []
with open('walk_calculated.csv') as f:
    reader = csv.reader(f)
    headers = next(f)
    for x in reader:
        in_data.append(x)
f.close()

# Define an edge (or a connection) if a store is within 5 minutes of another store
data_edges = [[x[0],x[3]] for x in in_data if float(x[-1]) <= minutes]

# Add edges
G=nx.Graph()
G.add_edges_from(data_edges)

# Clique Analysis
find_cliq = nx.find_cliques(G)
lst_clqs,unq_stores_clq = extract_clusters(find_cliq)
                
print("%d total stores in analysis" % len(store_list))
print("%d stores in problematic cliques:" % len(unq_stores_clq))  

print("Problematic cliques:")
for x in lst_clqs:
    print(x)
    
print("Alternatively, problematic clusters:")
for x in nx.connected_components(G):
    print(x)
    
# Geographic Diagram (instead)
geo_network(store_list, unq_stores_clq, in_data, 'unop_network_map.html', minutes) 

print("%.0f percent of stores in a problematic clique" % (len(unq_stores_clq)/len(store_list)*100))

In this example we have 292 stores to consider and 166 are in a ‘problematic clique’ which is 57% (and indeed “nearly 60%”).

Clustering is a general form of analysis to group a set of objects such that objects in the same group (cluster) are more similar to each other than to those outside the group. For example, if we had A within 5 minutes of B and B within 5 minutes of C we could say they form one cluster (A is indirectly connected to C through B). In this scenario we do not care if the members are directly connected or indirectly connected. An alternative model may require only direct connections. In a group called a ‘clique’, all members must be directly connected to all other members – using the above example we would end up with two cliques (A, B) and (C, B). If we go further and add the criteria that we must have “3 or more stores in a group”, we can see that in the first example we would get 3/3 stores whereas in the second would will get 0/3.

Consider this ‘group’ from our diagram:

a

Our analysis reports back the following problematic ‘clique’:

[‘5623_SB’, ‘6097_SB’, ‘6058_SB’]

And the following ‘cluster’:

[‘6073_CC’, ‘6097_SB’, ‘5623_SB’, ‘6058_SB’]

For this cluster to become a clique we would need ‘6073_CC’ to also be connected to ‘6097_SB’ and ‘5623_SB’

Step 4: Linear Optimisation

The linear optimisation is solved using PuLP. In some ways solving this example is trivial: we create a list of problematic nodes and sort them such that the node which the most connections appears at the top. We then eliminate the nodes (going down the list) until we have no more problematic cliques. However, by setting up the problem like this we can easily extend it to more advanced scenarios (perhaps we don’t weigh each store equally but some stores are more valuable than others; we may have an AND/OR constraint, which we can introduce into our optimisation using slack variables, etc.)

"""
Optimisation - Remove all Problematic Cliques
"""
from pulp import *

# Maximisation (max stores kept s.t. fill constraints)
prob = LpProblem("Maximise points kept",LpMaximize)

# 0. Variables and bounds
x = []
for store in unq_stores_clq:
    # Add each store (in a clique) as a variable
    x.append(LpVariable(store,0,1,LpInteger)) 
    #e.g.
    #Binaries
    #2801_SB
    #2822_SB ...
    
# 1. Objective Function (1.x1 + 1.x2 + 1.x3 ... etc,)
# We want to maximise the number of stores we keep
# Subject to the constraint that we remove our cliques
prob += pulp.lpSum(x[i] for i in range(len(unq_stores_clq)))
#e.g.
#OBJ: 2801_SB + 2822_SB + 2823_SB + 3174_SB + 3178_SB + 3190_SB + ...

# 2. Add constraints (number of stores kept should be less than or equal to 2)
for one_clique in lst_clqs:   
    prob += pulp.lpSum(
        x[unq_stores_clq.index(one_clique[i])] 
        for i in range(len(one_clique))) <= 2
    #e.g.
    #_C1: 6372_SB + 6782_SB + 6786_SB <= 2
    #_C2: 5518_SB + 5533_SB + 5991_SB <= 2
    #...
    
# Save "LP" file for other solvers
prob.writeLP("cliques.lp")
# Solve
prob.solve()
# The status of the solution is printed to the screen
print("Status:", LpStatus[prob.status])
# Optimum variable values
drop_points = []
for v in prob.variables():
    print(v.name, "=", v.varValue)
    # Stores optimal to drop
    if v.varValue == 0:
         drop_points.append(v.name)
# Optimised objective function
print("Number of points kept = ", int(value(prob.objective)))   
# Dropping
print("Number of points dropped = ", len(unq_stores_clq) - int(value(prob.objective)))   

The great thing about PuLP is that it outputs an ‘.lp’ file which we can submit to most other solvers!

Notice that in this example we have full control (our ‘variables’) over all Coffee Shops – it may be the case that we have control over only the SB (Starbucks) shops, for example.

Here is a truncated glimpse of our ‘problem’:

\* Maximise stores kept *\
Maximize
OBJ: 1767_SB + 2020_CN + 2022_SB + 2023_CC + 2040_CN + 2360_CN + 2532_SB
+ 2534_SB + 2773_CN + 2782_SB + 2786_SB + 2800_CN + 2801_SB + 2822_SB
+ 2823_SB + 2894_SB + 2904_CC + 2921_CC + 3174_SB + 3190_SB + 3208_CN
+ 3570_SB + 3659_CN + 3684_CC + 3688_SB + 3691_SB + 3696_CC + 3715_CN
+ 3716_SB + 3723_SB + 3725_CN + 3732_CC + 3735_SB + 3747_SB + 3774_CC
+ …
Subject To
_C1: 4225_SB + 4230_CC + 4247_CN + 4248_CN + 4249_SB <= 2
_C10: 3696_CC + 3723_SB + 3732_CC <= 2
_C11: 3723_SB + 3725_CN + 3732_CC <= 2
_C12: 5287_SB + 5292_CC + 5296_CC <= 2
_C13: 5939_CC + 5946_CN + 5950_SB <= 2
_C14: …
Binaries
1767_SB
2020_CN
2022_SB
2023_CC
2040_CN

End

If we wanted to we could submit this to the powerful CPLEX solver by simply uploading that file.

Step 5: Verification

The optimisation resulted in 66 points being dropped (and 100 kept). We can quickly re-create the network and use the helper graph function (from Step 2) to see what has changed:

"""
Test out our new number of cliques
"""
opt_edges = [x for x in data_edges if (x[0] not in drop_points and x[1] not in drop_points)]
opt_in_data = [x for x in in_data if (x[0] not in drop_points and x[3] not in drop_points)]
opt_store_list = [x for x in store_list if x[0] not in drop_points]
# Add edges
opt_G=nx.Graph()
opt_G.add_edges_from(opt_edges)
# Clique Analysis
opt_find_cliq = nx.find_cliques(opt_G)
opt_lst_clqs,opt_unq_stores_clq = extract_clusters(find_cliq)
print("We have %d stores in a problematic clique" % len(opt_unq_stores_clq))
# Geographic Diagram (instead)
geo_network(opt_store_list, opt_unq_stores_clq, opt_in_data, 'opt_network_map.html', minutes)      

Here is the original example (unoptimised) from Step 3 (small crop):

unopt_a

Here is the optimised example:

unopt_b

We can run a few visual checks to make sure our optimisation works like we thought it would: the ‘clique’ in the bottom right could be solved by dropping 2 red points (outer end) or by dropping 1 red point (inner) – we can see that the optimisation only dropped one point.

In general, by framing our analysis like this we can easily jump to much more complicated forms of analysis: imagine that we don’t strictly care about reducing dropped stores but that each store has a revenue associated with it, which we want to maximise in total & instead of removing all cliques, we may be required to only remove as many cliques as needed so that fewer than 10% of our stores are within a problematic clique.