PostGIS presentation

So, what is Volunteered Geographic Information?

As I said, it can be anything as long as it's geotagged

A photo posted by Levente (@jlevente) on Aug 9, 2015 at 1:57pm PDT

From street photos

To map edits

Or restaurant reviews

And tweets

And many other sources...

Including citizen science data as well

And it all fits nicely in the definition of big data

Things I do

Functions, functions, functions...

PostGIS comes with a set of functions.


SELECT ST_AsText(
	ST_ClosestPoint(
		ST_GeomFromText('POLYGON((175 150, 20 40, 50 60, 125 100, 175 150))'),
		ST_Buffer(ST_GeomFromText('POINT(110 170)'), 20)));
------------------------------------------
 POINT(140.752120669087 125.695053378061)

What if there's no function for the task I want to do?

Create your own!


CREATE FUNCTION add_two(integer) RETURNS integer
    AS 'select $1 + 2;'
    LANGUAGE SQL
    IMMUTABLE
    RETURNS NULL ON NULL INPUT;
					


select my_column, add_two(my_column) from my_table where my_column < 144

my_column | add_two
-------------------
        3 |      5
      143 |    145

Procedural languages other than SQL

PL/pgSQL
PL/Python
PL/Perl

PL/R

Just a few examples

Breaking lines at long segments


-- Function that breaks lines at long segments
-- Works with simple LineStrings, outputs either LineStrings or MultiLineStrings
CREATE OR REPLACE FUNCTION break_lines(line geometry, distance integer) RETURNS geometry AS
$BODY$
DECLARE
  segment geometry;
  line_1 geometry;
  started boolean;
  multi boolean;
  multi_started boolean;
  multi_line geometry;
  length float;
  i int;
  count int;
 
BEGIN
  count := ST_NPoints(line);
  IF count IS NULL THEN
    RETURN NULL;
  END IF;
  IF count < 2 THEN
    RETURN NULL;
  END IF;
  started := false;
  multi := false;
  multi_started := false;
  FOR i IN 1..count LOOP
    segment := ST_MakeLine(ST_PointN(line, i), ST_PointN(line, i + 1));
    length := ST_Length(segment, false);
    IF length IS NULL THEN
      CONTINUE;
    END IF;
    IF length < distance THEN
      IF NOT started THEN
	line_1 := ST_MakeLine(ST_PointN(line, i), ST_PointN(line, i + 1));
	started := true;
      ELSE
	line_1 := ST_MakeLine(line_1, ST_PointN(line, i + 1));
      END IF;
    ELSE
      RAISE NOTICE 'Long segment!';
      RAISE NOTICE 'Length: %', length;
      IF line_1 IS NULL THEN
	CONTINUE;
    END IF;
    IF NOT multi_started THEN
      multi_line := line_1;
      multi_started := true;
      started := false;
      multi := true;
 
    ELSE
      multi_line := ST_Union(multi_line, line_1);
      multi := true;
      started := false;
    END IF;
    END IF;
  END LOOP;
  IF NOT multi THEN
    RETURN line_1;
  ELSIF multi AND line_1 IS NOT NULL THEN
      RETURN ST_Union(multi_line, line_1);
  ELSE
    RETURN multi_line;
  END IF;
END
$BODY$
LANGUAGE 'plpgsql';


ST_NPoints(), ST_PointN(), ST_Length(), ST_Union(), ST_MakeLine()

Breaking lines at long segments


CREATE TABLE split_lines_1000m AS
  SELECT row_number() OVER (ORDER BY data.geom ASC) as id, some_other_important_column,
    break_lines(data.geom, 1000) as geom,
  FROM 
    (SELECT * FROM mytable) AS data

Implementing a formula

Radius of gyration

Radius of gyration is a measure to quantify the spread of locations around their center of mass

Looks complicated?

It's not.

All we need is to find the center of mass and measure distances

ST_Centroid(), ST_Distance()

Radius of Gyration


-- Function to compute radius of gyration
CREATE OR REPLACE FUNCTION comp_gyration(usr text, tab character varying)
 RETURNS SETOF double precision AS
$BODY$
DECLARE
 rec geometry;
 mass_center geometry;
 line geometry;
 count int;
 count_points int;
 count_geometries int;
 points_num int;
 geometry_num int;
 point geometry;
 tmp int;
 sum float;
 gyration float;
 
BEGIN
-- radius of gyration = sqrt(1/n * sum( distance(point - center of mass)^2)
-- n: count, sum: sum
-- Calculate center of mass
-- Use subquery. ST_Dump can't be nested with ST_Collect
 EXECUTE 'SELECT ST_Centroid(ST_Collect(foo.geom2)) FROM (SELECT (ST_DumpPoints(geom)).geom as geom2 FROM ' || tab || ' WHERE key = ' || quote_literal(usr) || ')as foo;' INTO mass_center;
 sum := 0;
 count := 0;
-- For each geometry that belongs to a user
 FOR rec in EXECUTE 'SELECT geom FROM ' || tab || ' where key =' || quote_literal(usr) || ';'
 LOOP
-- Handle multilines
 CASE ST_GeometryType(rec)
 WHEN 'ST_MultiLineString' THEN
-- Extract each linestrings
 geometry_num := ST_NumGeometries(rec);
 count_geometries := 1;
 LOOP
 count_points := 1;
 line := ST_GeometryN(rec, count_geometries);
 points_num := ST_NPoints(line);
 LOOP
-- calculate distance from each point to the center of mass. add to sum
 sum := sum + ST_Distance(ST_PointN(line, count_points), mass_center, false)^2;
 count_points := count_points+ 1;
-- get next point, exit at last
 count := count + 1;
 IF count_points > points_num THEN
 EXIT;
 END IF;
 END LOOP;
-- move to next linestring, exit at last
 count_geometries := count_geometries + 1;
 IF count_geometries > geometry_num THEN
 EXIT;
 END IF;
 END LOOP;
-- If geometry is not multi, loop thrugh all points
 WHEN 'ST_LineString' THEN
 
 points_num := ST_NPoints(rec);
 count_points:= 1;
 LOOP
 sum := sum + ST_Distance(ST_PointN(rec, count_points), mass_center, false)^2;
 count_points := count_points+ 1;
 count := count + 1;
 IF count_points = points_num THEN
 EXIT;
 END IF;
 END LOOP;
-- If geometry is polygon
 WHEN 'ST_Polygon' THEN
-- Extract boundary as line, then use method above
 rec := ST_Boundary(rec);
 IF rec IS NOT NULL THEN
 points_num := ST_NPoints(rec);
 count_points:= 1;
 LOOP
 -- exit before last point (to avoid double counting of first point)
 IF count_points = points_num THEN
 EXIT;
 END IF;
 sum := sum + ST_Distance(ST_PointN(rec, count_points), mass_center, false)^2;
 count_points := count_points + 1;
 count := count + 1;
 END LOOP;
 ELSE
 count := 0;
 END IF;
-- Finally if geometry is point, just calculate sum of square distances
 ELSE
 sum := sum + ST_Distance(rec, mass_center, false)^2;
 count := count + 1;
 
 END CASE;
 END LOOP;
-- Return radius of gyration for the user
 
 IF count = 0 THEN
 gyration := 0;
 ELSE
 gyration := sqrt(CAST(1 AS float)/CAST(count AS FLOAT) * sum);
 END IF;
-- RAISE NOTICE 'gyration: %', gyration;
 RETURN NEXT gyration;
END
$BODY$
 LANGUAGE plpgsql 
 
 SELECT comp_gyration('some_user_key', 'some_table_with_geometry');
-- Another approach:
ALTER TABLE some_table ADD COLUMN rad_gyr float;
UPDATE some_table SET rad_gyr = comp_gyr(key, 'data_table');

Working with a massive dataset

Cross-checking map edits

See video from before

Pretty straightforward spatio-temporal overlap between 2 data sources

There's always a BUT

Even the simplest queries would run forever on a database like this

So what can we do?

Let's optimize everything!

Lessons learned

Keep your database statistics updated


VACUUM ANALYZE;

Check on

AUTOVACUUM

Monitor your queries and see what's expensive


EXPLAIN ANALYZE
SELECT id, version, minor, visible, user_id, user_name, valid_from, valid_to, tags, z_order, ST_AsText(geom) FROM southeurope_line
WHERE
 (valid_to between '2015-12-20'::timestamp  and '2015-12-20'::timestamp + interval '25' hour or valid_from between '2015-12-20'::timestamp and '2015-12-20'::timestamp + interval '25' hour) 
 and geom && ST_MakeEnvelope(2.124825,41.352072,2.219582,41.407716)

QUERY PLAN
"Bitmap Heap Scan on southeurope_line  (cost=8.92..33.04 rows=6 width=662) (actual time=32.294..34.359 rows=134 loops=1)"
"  Recheck Cond: (((geom && '010300000001000000050000001DC9E53FA4FF00402DEBFEB110AD44401DC9E53FA4FF00402907B30930B444407A522635B4C101402907B30930B444407A522635B4C101402DEBFEB110AD44401DC9E53FA4FF00402DEBFEB110AD4440'::geometry) AND (valid_to >= '2015-12-20  (...)"
"  Heap Blocks: exact=38"
"  ->  BitmapOr  (cost=8.92..8.92 rows=6 width=0) (actual time=32.231..32.231 rows=0 loops=1)"
"        ->  Bitmap Index Scan on southeurope_line_geom_and_time_index  (cost=0.00..4.46 rows=3 width=0) (actual time=19.530..19.530 rows=75 loops=1)"
"              Index Cond: ((geom && '010300000001000000050000001DC9E53FA4FF00402DEBFEB110AD44401DC9E53FA4FF00402907B30930B444407A522635B4C101402907B30930B444407A522635B4C101402DEBFEB110AD44401DC9E53FA4FF00402DEBFEB110AD4440'::geometry) AND (valid_to >= '20 (...)"
"        ->  Bitmap Index Scan on southeurope_line_geom_and_time_index  (cost=0.00..4.46 rows=3 width=0) (actual time=12.699..12.699 rows=79 loops=1)"
"              Index Cond: ((geom && '010300000001000000050000001DC9E53FA4FF00402DEBFEB110AD44401DC9E53FA4FF00402907B30930B444407A522635B4C101402907B30930B444407A522635B4C101402DEBFEB110AD44401DC9E53FA4FF00402DEBFEB110AD4440'::geometry) AND (valid_from >= ' (...)"
"Planning time: 0.403 ms"
"Execution time: 34.488 ms"

Create smaller tables

Go deeper if that's not enough

Index your data!

Indexes are a common way to enhance database performance. An index allows the database server to find and retrieve specific rows much faster than it could do without an index.

What happens in the DB without an index?

Your database literally checks every single record to see if it should be returned

Not so cool when having a large table

What if I had an index?

Your database effectively cuts a portion of your data and performs checks on a much smaller set

And working on a smaller set it always faster

Indexes

Think of them as a Table of Contents

That your database engine always opens when you need something

So it doesn't have to read through the whole book but only the page where your records are located

Spatial Index

A spatial index index usually uses the bounding box of features as a proxy. A light read on the process can be found here

They are really important to have!

Spatial Index


CREATE INDEX [indexname] ON [tablename] USING GIST ([geometrycolumn]);

Really, don't miss creating this! Did I mention they're important?

Back to the map example

Query for change candidates in a city-sized area:

So, how long did it take?

To check 5 billion records

In 70+ different tables

If they overlap with tens of thousands irregular areas

From all around the world

At a given point of time?

A little over a day!

Which is reasonable.

Summary

You don't have to be a computer scientist to use PostgreSQL/PostGIS ...

... If you're willing to learn about it.

But that's what scientists do

We extend horizons and navigate to previously unknown areas

So don't be afraid! It's fun!

Use your resources

gis.stackexchange.com

PostgreSQL documentation

PostGIS manual

OsGeo mailing lists

Just ask!

Questions?

levente.juhasz@ufl.edu

@juhaszlevi

So, what is Volunteered Geographic Information?

Things I do

Functions, functions, functions...

What if there's no function for the task I want to do?

Create your own!

Just a few examples

Breaking lines at long segments

Breaking lines at long segments

Breaking lines at long segments

Implementing a formula

Radius of Gyration

Social media example

Setup

How did PostgreSQL/PostGIS help in this study?

Data collection

DB connected to frontend to display photos for students

Multiple geometries stored for each photo

Working with a massive dataset

Cross-checking map edits

There's always a BUT

Even the simplest queries would run forever on a database like this

So what can we do?

Let's optimize everything!

Lessons learned

Index your data!

What happens in the DB without an index?

What if I had an index?

Indexes

Spatial Index

Spatial Index

Back to the map example

So, how long did it take?

A little over a day!

Summary

So don't be afraid! It's fun!

Use your resources

Questions?