I recently had the task of calculating the (un-weighted) geometric mean for a group of points (technically checking what was calculate using ArcGIS) e.g. we have 5 points in a group (and perhaps 10 groups) and so we want 10 centres. Now, the script below works fine for this task – however it brought about an interesting question for me: why does the gCentroid {rgeos} command give a mean and not a centroid?

R-Script

require(rgdal)
require(rgeos)

setwd('.../Customer Data/')
prefix <- list("COMPA","COMPB","COMPC")

#Using: http://www.spatialreference.org/ref/esri/102010/ we get the Proj4js format
no_am_eq_co <- "+proj=eqdc +lat_0=0 +lon_0=0 +lat_1=20 +lat_2=60 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
wgs84 <- "+proj=longlat +datum=WGS84"

for (i in 1:length(prefix)) {
  # Import customers for each of 3 firms and combine into one data-frame
  pre = prefix[[i]]
  pat = paste("(",pre,")(.*)(.csv)$",sep='')
  files = list.files(pattern=pat)
  comb_custs = do.call("rbind", lapply(files, function(x) read.csv(x, header=TRUE, stringsAsFactors = FALSE)))
  # Create dataframe for GIS in R
  geo_comb_custs <- comb_custs[, grepl("lat|lon|cust", names(comb_custs))]
  colnames(geo_comb_custs)[grepl('cust',colnames(geo_comb_custs))] <- 'Name'
  # FROM: Coordinates are geographic latitude/longitudes
  coordinates(geo_comb_custs) <- c("lon", "lat")
  proj4string(geo_comb_custs) <- CRS(wgs84)
  # TO: Project into North America Equidistant Conic
  df <- spTransform(geo_comb_custs, CRS(no_am_eq_co))
  # Get centroids
  ctrs <- lapply(unique(df$Name), 
                 function(x) gCentroid(SpatialPoints(df[df$Name==x,])))
  ctrsout <- setNames( ctrs , unique(df$Name ) )
  # Create data frame 
  df <- do.call(rbind, lapply(ctrsout, data.frame, stringsAsFactors=FALSE))
  coordinates(df) <- c("x", "y")
  proj4string(df) <- CRS(no_am_eq_co) 
  df <- as.data.frame(spTransform(df, CRS(wgs84)))
  names(df) <- c("longitude", "latitude")
  # Export centroids (calculated using R)
  write.csv(x=df, file = paste("r_centres_",pre,"_customers.csv",sep=''), row.names=TRUE)
}

This script cycles through all the CSVs which contain COMPA e.g. “COMPAone.csv”, “COMPAtwo.csv”, etc and groups them into one data-frame before projecting the geographic co-ordinates and calculating the centres.

The jist of this is: gCentroid calculates a centroid (a centre of gravity – assuming equal density) if it is supplied with a polygon, however with a group of points, instead of fitting a random polygon (obviously there are multiple polygons that can be fitted through n points) or perhaps creating a convex hull, it takes the mean centre which is: (average of x-coordinates, average of y-coordinates).

I thought it would be interesting to create this work-flow from scratch. With some help from Wikipedia we can use numpy to vectorise the calculations:

polygon_form

The python code below applies this to an example group of points – I have already calculated the gCentroid for these points using R (so that we can compare):

df <- as.data.frame(list(c(34.7573, 38.30088, 38.712147, 39.704905),
                         c(-86.678606,-76.520266,-77.158616, -84.126463)))
df$Name <- "points_A"
colnames(df) <- c("lat", "lon", "Name")
# ... follow as above ...
# Result:
# 37.94873, -81.18378

Python-script

import numpy as np
from pyproj import Proj, transform

# Using: http://www.spatialreference.org/ref/esri/102010/ we get the Proj4js format
na_eq_co = "+proj=eqdc +lat_0=0 +lon_0=0 +lat_1=20 +lat_2=60 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"
wgs84 = "+proj=longlat +datum=WGS84"

def proj_arr(points,proj_from,proj_to):
    inproj = Proj(proj_from)
    outproj = Proj(proj_to)
    func = lambda x: transform(inproj,outproj,x[0],x[1])
    return np.array(list(map(func, points)))

def get_polygon_centroid(polygon):
    # https://en.wikipedia.org/wiki/Centroid#Centroid_of_polygon
    pol = np.array(polygon)
    # Close
    if np.any(pol[-1] != pol[0]):
        pol = np.append(pol,[pol[0]], axis=0)
    else:
        print('polygon closed')
    pol_area = get_polygon_area(pol)
    x = pol[:,0]
    y = pol[:,1]
    Cx = np.sum((x[:-1] + x[1:]) * ((x[:-1] * y[1:]) - (y[:-1] * x[1:]))) / (6. * pol_area)
    Cy = np.sum((y[:-1] + y[1:]) * ((x[:-1] * y[1:]) - (y[:-1] * x[1:]))) / (6. * pol_area)
    return np.array([Cx, Cy])

def get_polygon_area(polygon):
    pol = np.array(polygon)
    x = pol[:,0]
    y = pol[:,1]
    return np.sum( (x[:-1] * y[1:]) - (y[:-1] * x[1:]) ) / 2 

def get_polygon_mean(polygon):
    pol = np.array(polygon)
    x = pol[:,0]
    y = pol[:,1]
    return np.array([np.mean(x),np.mean(y)])

def run_test(points):
    points = points[:,::-1] #Flip-axis (so that longitude x-axis, latitude y-axis)
    points_proj = proj_arr(points,wgs84,na_eq_co)

    centroid_proj = get_polygon_centroid(points_proj)
    mean_proj = get_polygon_mean(points_proj)

    centroid = proj_arr([centroid_proj],na_eq_co,wgs84)
    mean = proj_arr([mean_proj],na_eq_co,wgs84)
    return (centroid[:,::-1][0], mean[:,::-1][0])

def wrong_method(points):
    points = points[:,::-1] #Flip-axis (so that longitude x-axis, latitude y-axis)
    centroid = get_polygon_centroid(points)
    mean = get_polygon_mean(points)
    return ([centroid[1],centroid[0]], [mean[1],mean[0]])


if __name__ == '__main__':
    
    ##############################################################################
    # Example 1: Why it is important to project the points before 
    ##############################################################################
    my_points = np.array([
            [34.7573,-86.678606],
            [38.30088,-76.520266],
            [38.712147,-77.158616],
            [39.704905,-84.126463]
        ])
    test = wrong_method(my_points)
    print("Centroid calculation (un-projected): {0}\nMean calculation {1}".format(test[0],test[1]))    
    #Centroid calculation (un-projected): [37.65541, -82.27876]
    #Mean calculation [37.86880, -81.12098]
    my_points = np.array([
            [34.7573,-86.678606],
            [38.30088,-76.520266],
            [38.712147,-77.158616],
            [39.704905,-84.126463]
        ])
    test = run_test(my_points)
    print("Centroid calculation: (projected): {0}\nMean calculation {1}".format(test[0],test[1]))
    #Centroid calculation: [37.72876, -82.35113]
    #Mean calculation [37.94873, -81.18378]
    
    ##############################################################################
    # Example 2: Ordering of the points matters if we are fitting a polygon ...
    ##############################################################################
    my_points = np.array([
            [38.30088,-76.520266],
            [38.712147,-77.158616],
            [34.7573,-86.678606],
            [39.704905,-84.126463]
        ])
    test = run_test(my_points)
    print("Centroid calculation: {0}\nMean calculation {1}".format(test[0],test[1]))
    #Centroid calculation: [37.71966, -82.93011]
    #Mean calculation [37.94873, -81.18378]
    
    my_points = np.array([
            [39.704905,-84.126463],
            [34.7573,-86.678606],
            [38.30088,-76.520266],
            [38.712147,-77.158616]         
        ])
    test = run_test(my_points)
    print("Centroid calculation: {0}\nMean calculation {1}".format(test[0],test[1]))
    #Centroid calculation: [37.72876, -82.35113]
    #Mean calculation [37.94873, -81.18378]

Some interesting points to note:

First, we match the R-script calculation when we consider the mean and not the centroid (given that we have supplied points rather than a polygon).

Second, it is important to project our geographic co-ordinates (for any calculation). Here is an example which shows the points (vertices) in blue and then the projected mean (37.94873, -81.18378) in green and the un-projected (37.86880, -81.12098) mean in red:

error.png

The calculation assumes points on a Cartesian plane and thus equal spacing between intervals on the x and y-axis, which is only achieved if the co-ordinates are projected

Third, if we jumble the order of points then our mean calculation is unaffected. However, since we are assuming those points to be a polygon; our polygon changes and so naturally the centroid does too.

Having said this – it would be nice to check that the centroids match the R-script. I am working directly with a group of projected-coordinates now (in-case there are differences in projections).

Using the python-script I run:

centroid = get_polygon_centroid(np.array([[-6424797.94257892,  7164920.56353916],
                                             [-5582828.69570672,  6739129.64644454],
                                             [-5583459.32266293,  6808624.95123077],
                                             [-5855637.16642608, 7316808.01148585],
                                             [-5941009.53089084,  7067939.71641507],
                                             [-6424797.94257892, 7164920.56353916]]))
#polygon closed
#[-5875317.84402261  7010915.37286505]

Which gives us (-5875317.84402261, 7010915.37286505) – I then run the same calculation using the R-script – we have to produce a Polygon object otherwise we get the same thing as above – and plot the polygon and the two centroids:

x = readWKT(paste("POLYGON((-6424797.94257892  7164920.56353916,
                  -5582828.69570672  6739129.64644454,
                  -5583459.32266293  6808624.95123077,
                  -5855637.16642608  7316808.01148585,
                  -5941009.53089084  7067939.71641507,
                  -6424797.94257892  7164920.56353916))"))

python_cent = readWKT(paste("POINT(-5875317.84402261  7010915.37286505)"))
r_cent = gCentroid(x) 

plot(x)
plot(r_cent,add=T,col='red', pch = 0) # red square
plot(python_cent, add=T,col='green', pch = 1 # green circle

yep

Fortunately, everything matches. This means we can get quite a good workflow going using python + numpy instead of R if we have lots and lots of centroids/centres to calculate.