Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- The Cambridge Analytica Data Apocalypse Was Predicted in 2007
The promise of Cambridge Analytica: to use computational social science to influence behavior. Cambridge Analytica said it could do it. It apparently cheated to get the data. And the catastrophe that the authors of that 2009 paper warned of has come to pass. - Data governance and the death of schema on read
Comcast’s system of storing schemas and metadata enables data scientists to find, understand, and join data of interest. - Best Practices for ML Engineering
Some great insights in this overview by Google! - Things I wish we had known before we started our first Machine Learning project
Anything new brings along with it many unknowns that we discover with time. - Beware Default Random Forest Importances
The takeaway from this article is that the most popular RF implementation in Python (scikit) and R’s RF default importance strategy do not give reliable feature importances when “… potential predictor variables vary in their scale of measurement or their number of categories.” (Strobl et al). Rather than figuring out whether your data set conforms to one that gets accurate results, simply use permutation importance. - Understanding deep learning through neuron deletion
“We measured the performance impact of damaging the network by deleting individual neurons as well as groups of neurons. Our experiments led to two surprising findings: Although many previous studies have focused on understanding easily interpretable individual neurons (e.g. “cat neurons”, or neurons in the hidden layers of deep networks which are only active in response to images of cats), we found that these interpretable neurons are no more important than confusing neurons with difficult-to-interpret activity. Networks which correctly classify unseen images are more resilient to neuron deletion than networks which can only classify images they have seen before. In other words, networks which generalise well are much less reliant on single directions than those which memorise.” - The fight against illegal deforestation with TensorFlow
Rainforest Connection is using technology to protect the rainforest. Founder and CEO Topher White shares how TensorFlow, Google’s open-source machine learning framework, aids in their efforts. - Scaling Time Series Data Storage @Netflix
“In this 2-part blog post series, we will share how Netflix has evolved a time series data storage architecture through multiple increases in scale.” - World Models
Can agents learn inside their own dreams? - Financial forecasting with probabilistic programming and Pyro
“We just have to remember, that now all parameters, inputs and outputs in our model are distributions, and while training we need to fit parameters of these distributions to get better accuracy on real task.” - Gradient Boosting in TensorFlow vs XGBoost
“With a few hours of tweaking, I couldn’t get TensorFlow’s Boosted Trees implementation to match XGBoost’s results, neither in training time nor accuracy.” - LabNotebook is a tool that allows you to flexibly monitor, record, save, and query all your machine learning experiments.
A simple experiment manager for deep learning experiments. - Mittens: A fast implementation of GloVe, with optional retrofitting
This package contains fast TensorFlow and NumPy implementations of GloVe and Mittens. - keras-rl is being updated again
Keras v2.1.5 and gym v0.10.3 are now supported! - Understanding Spark Structured Streaming (video)
- Deep Learning for Recommender Systems (presentation)