By: Bart Baesens, Seppe vanden Broucke
This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: How is location-data used in analytics today? Are there state-of-the-art algorithms or industry standards to predict how crowds behave and where/when they will gather?
Our answer:
To answer this question, we need to make a difference between the “location-data” aspect and the “crowd behavior” angle. Regarding the first (i.e. the use of location data), a broader term to describe this field is “spatial data mining”, or spatial analytics. The techniques developed therein focus on leveraging spatial data sets (oftentimes stored and managed by means of geographic information systems, or GIS) for predictive and descriptive purposes. This can range from rather simple use cases, for example extracting a feature vector from spatial components to be used in a predictive model (extracting the surface area, max height and complexity of buildings, for instance) to more involved methods which make use of spatial relations (e.g. using relations such as “within”, “touches”, “overlaps” in the predictive model) or even more focused techniques such as “trajectory mining”, which keeps track of locations through time (i.e. a sequence of locations), for instance as created by tracking GPS coordinates on a customer’s smart phone. Such sequences can be used to perform clustering, extract “moving together” patterns, or even to extract a unique fingerprint to identify instances.
Regarding the “crowd behavior” aspect, this is an area which is typically categorized under “agent modelling” and/or “crowd simulation”, rather than spatial analytics. The reason for this is that such techniques obviously do need a spatial topology in which the agents will roam, but it remains the actual modeling of the agents (the members of the crowd) themselves that forms the actual challenge here. Many techniques have been developed to do so, ranging from particle systems and simulations based on cellular automata to individual agent simulation using rule-based AI (e.g. by using finite state machines). In more recent years, due in part to the renewed focus on neural network based techniques, we also see a renewed interest in modeling agents by means of reinforcement learning, for instance by means of Q-learning. In a nutshell, this approach trains agents to optimize towards the best sequence of actions that will maximize their reward (their Q-value). The challenge here is to construct a meaningful reward function, though this is possible in many application areas, such as crowd simulation in emergency scenarios (the reward function embodies getting out of a burning building) or in marketing analytics (where a reward function could provide a reward when a customer sees a store that matches their interest).
This being said, it is also possible to use spatial data to directly predict crowd density without simulating or modeling each individual agent. Many marketing firms, for instance, will construct a heat map corresponding with the density of a crowd in a certain area by obtaining measurements by means of “people counting” (using infrared or thermal counters, or even wifi sensors or computer vision techniques applied on camera feeds). Such data set can also be combined with techniques such as spatial kriging (a method of spatial interpolation) to construct a full predictive model.