This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!
Contributed by: Bart Baesens. This article was adopted from our Managing Model Risk book. Check out managingmodelriskbook.com to learn more.
The quality of an analytical model depends upon the quality of the data that feeds into it. However, even high quality data is not an absolute guarantee for a high performing analytical model. Besides being of high quality the data should also have predictive power for the analytical task at hand. If the data is not related to the target variable or label as such, then even highly sophisticated analytical models such as deep learning neural networks or extreme gradient boosting will not find any meaningful patterns. To put it bluntly, you cannot predict churn risk using the color of the eyes of your customers as the only predictor even though the latter may have been perfectly captured and precisely measured. Hence, lack of predictive power is another source of data risk. There are however some workarounds one may consider to try and boost the predictive power of your analytical model.
A first option is to gather more data (meaning better features in most cases). During the data collection phase, we have seen far too often that companies restrict their sourcing efforts too much by selecting only a few data sources to keep the costs of data collection under control. The key message here however is: the more data collected, the better since the analytical model itself will end up saying which data matters and which does not using built-in variable selection mechanisms. However, by being too restrictive during data collection, you prohibit the model from unleashing its full predictive power on all available data. Hence, carefully reconsidering the data collection step by including more data sources and/or ways to combine them could help boost a model’s predictive power. This being said, however, it is nevertheless crucial to keep the operational setting of the model in mind. That is, each data element that you collect might lead to an improvement in predictive power, but also comes with a trade-off in the sense that this element (e.g. a feature) will need to continue to be provided in the same clean, timely form during the operationalization of the model. As such, when constructing the final version of the model, it can be a good idea to perform an assessment with regards to which features can be safely dropped in order to lessen the operational burden.
A next option is to do clever feature engineering. The aim of feature engineering is to transform data set variables into features so as to help the analytical models achieve better performance in terms of either predictive performance, interpretability or both. Hence, when doing feature engineering it is important to take the representational bias of your analytical technique into account. As an example, a logistic regression assumes a linear decision boundary to separate both classes. Hence, when defining smart features for logistic regression, your aim is to make sure that these new features make the data linearly separable. That will allow the logistic regression to come up with the best model possible. A very simple example of feature engineering is deriving the age from the date of birth variable. Another simple example of feature engineering is incorporating aggregated values based on e.g. similar instances in the data set (though be careful to only do so based on instances contained in the train set). Feature engineering can be done manually, by the data scientist typically in collaboration with the business user, or fully automated using sophisticated techniques such as deep learning or tools such as Featuretools (sic) (see https://www.featuretools.com/) and Autofeat (https://arxiv.org/abs/1901.07329). The importance of feature engineering cannot be underestimated. In fact, it is our firm conviction that the best way to improve the performance of an analytical model is by designing smart features, rather than focusing too much on the choice of the analytical technique!
Another option is to rely more on domain expert input and combine this with the available data to get the best of both worlds. However, embedding domain knowledge into an analytical model is not as straightforward as it sounds. One analytical technique worth considering here is Bayesian networks. A Bayesian network consists of two parts. A qualitative part specifies the conditional dependencies between the variables represented as a graph, and a quantitative part specifies the conditional probabilities of the variables. Given its attractive and easy-to-understand visual representation, a Bayesian network is commonly referred to as a probabilistic white box model. Bayesian networks excel at combining both domain knowledge with patterns learned from data. The domain expert can help in drawing the network whereas the conditional probabilities can then be learned using the data available. The Bayesian network can then be used to infer the value of a specific variable based upon all other observed variable values (even in case these would be missing).
Finally, one can also consider the option of leveraging new data sources, a process which is commonly called data augmentation. A first example of this is purchasing external data. Many data poolers or data providers are collecting, linking, aggregating and analyzing data sets. Popular examples of companies are Experian, Equifax, TransUnion, Dun & Bradstreet, GfK, Grandata, etc, but also various Internet companies such as Google and Twitter typically provide APIs which can be used for now-casting, sentiment analysis, etc. Think for instance about about now-casting unemployment based upon Google searches with key terms jobs and unemployment benefits. Social media data obtained from e.g. Twitter can be used for sentiment analysis. Companies often use this to monitor brand reputation. Other firms provide more specialized, niche forms of data such as weather data, which can be useful in a variety of settings (such as forecasting), or high-resolution satellite data (in some cases including infrared data – see e.g. www.albedo.space), which is useful for geospatial analytics. Lots of other providers and data types exist, so it is important to make a careful assessment of what can be useful for the task at hand. Also important to consider is the cost of purchasing or licensing the data source. As such, an interesting opportunity here is that of open data, which is data that anyone can access, use and share and is typically not copyrighted which opens up interesting perspectives for analytical modeling. Examples are government data as provided for example by organizations such as Eurostat and OECD. Another example is scientific data such as the Human Genome project which records human genomic sequence information and makes it publicly available to everyone. Open data can either be analyzed as such or used as a complement to other data for analytics.
Web scraped data can also be an interesting source of data. As an example, consider a list of reviews scraped from a movie site to perform text analytics, create a recommendation engine or build a predictive model to spot fake reviews. If you want to know more about this, we are happy to refer to our book published by Apress in 2018: Practical Web Scraping for Data Science. This being said, it is also important to note here that web scraping comes with its fair share of associated legal and other risks.