PostGIS & R

Lately, I have been facing several times the same problem while trying to spatially aggregate milion of spatial points over a spatial distribution of thousands of (not overlapping) polygons: How to efficiently identify the polygon in which every point is located? Indeed, as the number of elements increases R starts to become less and less efficient. In this post, I am going to show you how to efficiently perform operations between geometries by interfacing with PostGIS from R.

Let us consider two shapefiles to illustrate our problem. The file Tweets gathers information on about 9 milion of geolocalized tweets posted in UK. Only a sample of 1000 tweets drawn at random is available for download. This shapefile contains 5 attributes:

  • Tweet_ID
  • User_ID
  • Unix Time
  • Longitude
  • Latitude

The second shapefile OA is composed of 65,993 Census Output Areas which are the smallest geographical unit for which 2011 Census data is released in England & Wales. This shapefile contains 2 attributes, the OAs’ ID and a binary column GL, 1 if the OA is located in the Greater London, 0 otherwise (this will be useful for a future post).

As a first step, we can import and vizualize the data with R (package rgdal).

OA=readOGR("OA.shp","OA")
Tweets=readOGR("Tweets.shp","Tweets")

par(mar=c(0,0,0,0))
plot(OA, col="white", border="gray")
plot(Tweets, pch=".", col="steelblue2",add=TRUE)

Usually, I would use the function gIntersects from the rgeos package to identify the OA in which every tweet is located. But in this case it represents about 6 x 10^11 operations that R is not going to handle in a reasonable time (at least with my desktop machine).

This is why I got interested in PostGIS. I had already use PostgreSQL several times but I had never used the spatial plugin. I am not going to show you how to install PostgreSQL & PostGIS on your computer, a lot of articles available on the web will do that better than me (have a look at this one for example).

Once you have pgAdmin III installed, create a new database that we are going to call gIntersects, and don’t forget to add the postgis extension.

Then, import Tweets and OA using the PostGIS Shapefile and DBF loader 2.2 in the plugins menu. Note that the attribute SRDI is set to 0. This parameter is nothing more than the epsg projection reference code that pgAdmin III is not able to detect automatically. To obtain this code from the .prj file you can use this very useful website.

Finally, we can open a connection to PostGIS from R using the package RPostgreSQL.

drv=dbDriver("PostgreSQL")
con=dbConnect(drv, dbname='gIntersects', host='localhost', port='5432', user="Keyser", password="Soze") 

Let’s first have a look at the tables that we have in our database to verify that tweets and oa are there.

dbGetQuery(con, "SELECT table_name FROM information_schema.tables WHERE table_schema='public' AND table_type='BASE TABLE';")

Now, we identify the OA associated with every tweet (if any) by extracting the tweets and OAs that ‘spatially intersect’ using the ST_Intersects function. The results are stored in a data frame int. It is very (very) efficient, it took less than 4 minutes on my desktop machine!

int=dbGetQuery(con, "SELECT tweets.tweet_id, oa.id FROM oa, tweets WHERE ST_Intersects(oa.geom, tweets.geom)")

Then, we can extract the Tweets attribute table and merge it with int. Note that int does not contain any duplicated tweets since OA is composed of non overlapping polygons.

dat=dbGetQuery(con, "SELECT tweet_id, user_id, time, longitude, latitude FROM tweets")
dat=cbind(dat[match(int[,1], dat[,1]),], int[,2])

Finally, we close the connection,

dbDisconnect(con)

and we randomly check that the tweets are contained in the OA they have been associated with!

i=sample(nrow(dat),1)
par(mar=c(0,0,0,0))
plot(OA[as.character(OA@data[,1])==as.character(dat[i,6]),], col="grey", border="grey")
points(dat[i,4:5], pch=16, cex=2, col="steelblue2")
comments powered by Disqus