Detecting Irregular Clusters in Big Spatial Data

GIScience 2012, Columbus, Ohio, 18-21 September 2012

Jared Aldstadt, Michael Widener, and Neal Crago

“In the age of “Big Data” GIScientists are faced with the challenging task of processing and interpreting increasingly large spatial datasets. To fully exploit these large and high-resolution data, researchers must adapt existing algorithms and utilize new computational technologies, such as high performance computers (HPCs) (Armstrong 2000). This research concentrates on improving the performance of local irregular cluster detection algorithms, focusing on one flexible, but computationally expensive, algorithm: AMOEBA (Aldstadt and Getis 2006). The ability to locally detect clusters within a geographic region is a powerful tool that can be applied in many settings. For example, in a health context, it is important to explicitly locate the clustering of disease incidents so an appropriate medical intervention can occur. AMOEBA is one of a growing number of cluster detection techniques that do not restrict the search to regular geometric shapes (Duczmal and Assuncao 2004, Tango and Takahashi 2005, Assuncao et al. 2006, Yiannakoulias et al. 2007). These methods have the potential to more accurately delineate clusters, thereby improving the value of clustering as an exploratory data analysis technique.

“While the ability to detect irregularly shaped clusters is useful, the underlying process of AMOEBA involves organically growing the cluster from every spatial data point, known as seed locations, in a predefined study area. This research builds on previous work (Duque et al. 2011a, Widener et al. 2012) by developing a heuristic and decomposition strategies for parallel computing platforms to improve the runtime of the algorithm. The heuristic is designed to intelligently sample seed locations in sub-regions of a large dataset in order to eliminate redundant cluster discovery. Current AMOEBA algorithms spend valuable computation time rediscovering the same irregular cluster, as all seed locations apart of a particular cluster will grow approximately or exactly the same hotspot (or coldspot). With the heuristic reducing the number of computations necessary in some sub-regions but not in others, new decomposition strategies are necessary to equitably distribute the computational load of growing clusters from those seed locations that are tested. In addition to providing researchers with a faster tool for detecting irregularly shaped clusters in large or detailed datasets, this research also provides general insights into how efficient local spatial cluster detection can be achieved on parallel computing platforms.”