Zonal Summations of GBIF Species Occurrence Data On GPUs

Summary

Analyzing how species are distributed on the Earth has been one of the fundamental questions in the intersections of environmental sciences, geosciences and biological sciences. With world-wide data contributions, more than 375 million species occurrence records for nearly 1.5 million species have been deposited to the Global Biodiversity Information Facility (GBIF) data portal. The sheer amounts of point and polygon data and the computation-intensive point-in-polygon tests for zonal statistics for biodiversity studies have imposed significant technical challenges. In this study, we have significantly extended our previous work on parallel primitives based spatial joins on commodity Graphics Processing Units (GPUs) and have developed new efficient and scalable techniques to enable parallel zonal statistics on the GBIF data completely on GPUs with limited memory capacity.  Experiment results have shown that an impressive end-to-end response time under 100 seconds can be achieved for zonal statistics on the 375+ million species records over 15+ thousand global eco-regions with 4+ million vertices on a single Nvidia Quadro 6000 GPU device. The achieved high performance, which is several orders of magnitude faster than reference serial implementations using traditional open source geospatial techniques, not only demonstrates the potential of GPU computing for large scale geospatial processing, but also makes interactive query driven visual exploration of global biodiversity data possible.

Baseline Serial Algorithms and Implementations

Descriptions of baseline implementations using traditional open source software stack and running on single processors:
1) Serial Implementation 1 (SI1) builds an R-Tree on WWF eco-region polygons ( 14,458 polygons, 16,838 rings and 4,028,622 points ) by using libspatialindex. For each species occurrence, a a point is created to query against the R-Tree index structure using the ISpatialIndex:intersectsWithQuery provided by  libspatialindex . If the point intersect with any of the MBRs  stored in the leaf nodes of the R-Tree, the point  is tested with  the polyogns that are represented by the MBRs using the  Contains function in the OGRGeometry class provided by the GDAL/OGR library. If the point is within the polygon, the polygon FeatureID will be assigned to the point; otherwise -1 will be assigned instead.
2) Serial Implementation 2 (SI2) adds an optimization technique we have used for the GPU design and it uses the same input as the GPU implementation. Recall that in the GPU implementation, in addition to the .xy file that stores  the long/lat pairs of the point coordinates, the raster cell identifiers and the number of points in the raster cells are also part of the input. For each of the grid cell, implementation 2 first queries its bounding box against the R-Tree index structure and locate all polygons whose MBRs intersect with the grid ell bounding box. For each of the point in the grid cell,  the Contains function in the OGRGeometry class is applied for point-in-polygon test.

One might expect SI2 will always be faster than SI1. However, our experiments have shown differently. This is because, will SI2 reduces R-Tree query costs, it might increases the number of point-in-polygon tests which are more expensive than R-Tree queries, given that the R-Tree index structure is likely to be completely cached in memory (the R-Tree structure is only about 2.4 MB) . As a matter of fact,  there are points that, although their cell boundary intersect with some polygon MBRs, they may not intersect with these MBRs. As such, SI2 is likely to introduce false positives that can be avoided by query the points against the polygon R-Tree directly.

Code, Data and Steps to Repeat Baseline Experiments

The following three programs (C++ code) allows interested readers to repeat our experiments using a sample dataset.
1) The code for generating the R-Tree indexing structure (used for both BI1 and BI2), the two baseline implementations can be downloaded as gzipped tar ball by following this link.
2) The sample dataset with 746,302 species occurrences for 25 species with numbers of occurrences between 10,000 and 100,000 can be downloaded as a gzipped tar ball by following this link (G45_G01). FYI: not all files in the tar ball are used. Please contact me if you are curious what the files are used for. Please visit GBIF data portal for more detailed info. 
3) A copy of the WWF ecoregion shapefile can be downloaded by following this link (Warning: 51 MB). Please acknowledge WWF by following the details at WWF eco-region website.
4) Two additional point datasets, one is smaller ( 279,808 records ) and one is bigger ( 9,397,443) are now provided for additional evaluations. They can be accessed by following the these two links:  G45_G02 and G10M (warning: 8.7 MB).

To repeat our baseline experiments:
1) download and unzip the above three tar balls to your directory.
2) install libspatialindex and GDAL/OGR library and make the shared library files (.so file under linux) accessible by modifying LD_LIBRARY_PATH environment variable (or specify their paths in g++ command line using -L).
3) change the paths to the WWF eco-region shapefile in the three source C++ programs (changing to command line is in progress).
3) compile the three C++ programs by following the example command line for compilation at the beginning of each program.
4) run GenPolyRTree first to generate the polygon index (newrtp.* under the current directory).
5) run PIPGDALTest1 and/or PIPGDALTest2 by following the example command line for execution at the beginning of each program. Please note the differences between the two. 

Related Publications

Jianting Zhang and Simin You (2014).  Efficient and Scalable Parallel Zonal Statistics on Large-Scale Species Occurrence Data on GPUs. Technical Report [PDF]

Jianting Zhang and Simin You (2013) . Parallel Zonal Summations of Large-Scale Species Occurrence Data on Hybrid CPU-GPU Systems. Technical Report [PDF]