By: Bart Baesens, Seppe vanden Broucke
This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: Can you give me a simple explanation and example of process discovery?
Our answer:
Good question; the popularity of open source analytical software such as R and Python has sparked the debate about the added value of commercial tools such as SAS and SPSS. In fact, both commercial as well as open-source software each have their merits which should be thoroughly evaluated before any analytical software decision is made.
First of all, the key advantage of open source software is that it is obviously available for free, which significantly lowers the entry barrier to use it. However, this clearly poses a danger as well, since anyone can contribute to it without any quality assurance or extensive prior testing. In heavily regulated environments such as Credit Risk (Basel Accord), Insurance (Solvency Accord) and Pharmaceutics (FDA regulation), the analytical models are subject to external supervisory review because of their strategic impact to society, which is now bigger than ever before. Hence, in these settings many firms prefer to rely on mature commercial solutions, that have been thoroughly engineered and extensively tested, validated and completely documented. Many of these solutions also include automatic reporting facilities to generate compliant reports in each of the settings mentioned. Open source software solutions come without any kind of quality control or warranty which increases the risk to use them in a regulated environment.
Another key advantage of commercial solutions is that the software offered is no longer centered around dedicated analytical workbenches for e.g. data preprocessing, data mining, etc. but on well- engineered business focused solutions which automate the end to end activities. As an example, consider credit risk modeling which starts from framing the business problem to data preprocessing, analytical model development, backtesting and benchmarking, stress testing and regulatory capital calculation. To automate this entire chain of activities using open source would require various scripts, likely originating from heterogeneous sources, to be matched and connected together, resulting in a possible melting pot of software, whereby the overall functionality can become unstable and/or unclear.
Contrary to open source software, commercial software vendors also offer extensive help facilities such as FAQs, technical support hot lines, newsletters, professional training courses, etc. Another key advantage of commercial software vendors is business continuity. More specifically, the availability of centralized R&D teams (as opposed to world-wide loosely connected open source developers) which closely follow up on new analytical and regulatory developments provides a better guarantee that new software upgrades will provide the facilities required. In an open source environment, you need to rely on the community to voluntarily contribute, which provides less of a guarantee.
A disadvantage of commercial software is that it usually comes in pre-packaged, black box routines which, although extensively tested and documented, cannot be inspected by the more sophisticated data scientist. This is in contrast to open source solutions which provide full access to the source code of each of the scripts contributed. In addition, because of their community-driven nature, open-source solutions also tend to be on the forefront of the state-of-art and are willing to experiment with bleeding-edge techniques and algorithms.
Given the above discussion, it is clear that both commercial and open source software each have their strengths and weaknesses. Hence, it is likely that both will continue to co-exist and interfaces should be provided for both to collaborate as is the case for e.g. SAS and R/Python.