Why tree-based methods do not work well with geographic data

Posted on July 5, 2024

Why tree-based methods do not work well with geographic data

Contributed by Margot Geerts under supervision of Prof. dr. Jochen De Weerdt and Prof. dr. Seppe vanden Broucke

Key take-aways

Tree-based models are not designed to recognize spatial relationships in the data resulting in underperformance
Tree-based models deal with coordinates as input features ineffectively resulting in strange artefacts in mapped decision boundaries
Alternative predictive techniques should focus on tailored tree-based models for spatial data such as geoRF [3]

Introduction

When starting a data science project, one of the top predictive techniques in a data scientist’s toolkit is a tree-based model such as Random Forest and gradient boosted trees. Tree-based models are known to be flexible, interpretable, and generally produce state-of-the-art performance across a wide range of tasks. Especially when dealing with tabular data, they often generalize better than simple methods such as linear regression, due to their ability to capture complex relationships, but also deep learning methods such as neural networks, due to being less prone to overfitting. Nevertheless, when dealing with geographic data, even tree-based methods fall short. Let’s dive into why tree-based models struggle with geographic data and explore some alternative approaches.

Understanding geographic data

Geographic data, or spatial data refers to data where instances are associated with a location on or near the surface of the Earth. Spatial attributes can come in the form of coordinates, distance, elevation, area and so on. Further, geographic data can be structured in a grid such as satellite images or by vectors representing points or polygons. In this post, we consider spatial vector data where coordinates are the most common type of spatial attributes. Common real-world examples of this type of geographic data are soil content mapping, income prediction, disease mapping, and house price prediction.

Limitations of tree-based methods

When dealing with geographic data, more specifically, handling spatial coordinates as input features, tree-based methods run into many problems. In what follows, two key issues are highlighted that tree-based models suffer from.

1. Lack of recognizing spatial relationships. In geography, the well-known first law of Tobler reads as follows: “everything is related to everything else, but near things are more related than distant things”. This means that geographic data is autocorrelated in space, i.e., exhibits spatial autocorrelation. This is an important intrinsic property that tree-based models are not designed to capture. That is, tree-based models split the data based on feature thresholds without considering these spatial relationships. A recent article provides evidence that these methods exhibit significant amounts of spatial autocorrelation in the model residuals [1].

2. Inefficiency in handling coordinates. In addition, another consequence of the inherent univariate recursive splitting procedure of tree-based models is the strange artefacts that they produce with respect to coordinates [2]. Due to the splits based on one feature, the decision boundaries take rectangular forms. This is difficult to reconcile with actual spatial patterns as these are often smoother and follow more complex shapes. In sum, tree-based models are unable to effectively capture spatial effects based on continuous coordinate systems.

Figure 1: Random Forest’s decision boundaries mapped in Belgium based on a house price prediction task with geographic data. Rectangular regions are distinguished by the model representing higher and lower house prices. In reality, spatial patterns in house prices are expected to follow smoother patterns and take on more complex structures.

Alternative predictive techniques for geographic data

1. Spatial statistics. Statistical methods such as Gaussian Process regression and Geographically Weighted regression are specifically designed for spatial data. Although these methods can generally capture spatial relationships well, they suffer from other limitations. First, they often based on linear models underlyingly, limiting the relationships they can capture in the data. Second, they often pose stringent assumptions on the data which are not realistic when dealing with real-world data. Lastly, spatial statistical models often scale poorly, making it impossible to leverage big data.

2. Spatial tree-based models. Recent advances have proposed tree-based methods tailored to geographic data. These advances can be categorized into two groups: feature engineering and algorithmic adaptation. While advanced feature engineering could help in improving the predictive performance of tree-based models, it does not change the way tree-based models inherently work and shifts most of the modeling efforts towards experts in the spatial domain. In our work, we directly address tree-based methods’ inefficiency in handling coordinates by introducing novel split types specifically designed for geographic data [3]. In an extensive evaluation across a range of spatial tasks, we showed that our method outperforms not only standard Random Forests and gradient boosted trees, but also spatial statistical methods and deep learning techniques.

Conclusion

While tree-based models are one of the most popular predictive techniques because of their flexibility, interpretability, and state-of-the-art performance, they fall short when dealing with geographic data. When taking geographic coordinates as input features, tree-based models are unable to capture spatial relationships in the data. Moreover, they are inefficient in handling coordinates, evidenced by the presence of strange artefacts in mapped decision boundaries. These issues arise from the recursive splitting procedure of trees based on a single attribute. Unless the conditions are met for resorting to spatial statistics, solutions must be found in adapting the algorithmic procedure of (tree-based) models to the spatial nature of the data.

References

[1] Song, I., & Kim, D. (2023). Three Common Machine Learning Algorithms Neither Enhance Prediction Accuracy Nor Reduce Spatial Autocorrelation in Residuals: An Analysis of Twenty-five Socioeconomic Data Sets. Geographical Analysis, 55, 585–620. https://doi.org/10.1111/gean.12351

[2] Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 2018(8). https://doi.org/10.7717/peerj.5518

[3] Geerts, M., vanden Broucke, S. & De Weerdt, J. GeoRF: a geospatial random forest. Data Mining and Knowledge Discovery (2024). https://doi.org/10.1007/s10618-024-01046-7