Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone
This was all over the news this week, and rightly so: Google presents highly impressive results, but some people are wondering where this will lead to. - Data Violence and How Bad Engineering Choices Can Damage Society
“A wave of concern ripples through the audience: Data on criminal activity is notoriously unreliable and often subject to manipulation and misclassification. In one notorious case, a California state database of “gang” members was found to contain at least 42 babies. The Harvard researchers didn’t appear to have questioned the integrity of their source data and hadn’t thought through the unintended consequences of implementing a system like this.” - How can we be sure AI will behave? Perhaps by watching it argue with itself.
Experts suggest that having AI systems try to outwit one another could help a person judge their intentions. - How do we capture structure in relational data?
We have an incredible amount of information about how objects are related. But most of us have no idea how to use it. - The False Allure of Hashing for Anonymization
So why doesn’t sha256 produce anonymous data? - Loc2Vec: Learning location embeddings with triplet-loss networks
“To equip our venue mapping algorithm with the same sense of intuition, we developed a deep learning based solution that is trained to encode geo-spatial relations and semantic similarities describing a location’s surroundings.” - Launching Cutting Edge Deep Learning for Coders: 2018 edition
“Today we are launching the 2018 edition of Cutting Edge Deep Learning for Coders, part 2 of fast.ai’s free deep learning course.” - Modern Data Pipelines with Apache Airflow
“This talk was presented to developers at Momentum Dev Con covering how to get started with Apache Airflow with examples of custom components like hooks, operators, executors, and plugins. We also covered example DAGs and the Astronomer CLI for Airflow.” - Interpretable Machine Learning with iml and mlr
Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable. - flux: The Elegant Machine Learning Stack for Julia
Models that look like mathematics. Seamless derivatives, GPU training and deployment. A set of small, nimble tools that each do one thing and do it well. - The Wisdom and/or Madness of Crowds
Another fun explorable explanation from Nicky Case. - Efficient Graph Computation for Node2Vec
“In this paper, we propose Fast-Node2Vec, a family of efficient Node2Vec random walk algorithms on a Pregel-like graph computation framework.” - An Introduction to Deep Learning for Tabular Data
There is a powerful technique that is winning Kaggle competitions and is widely used at Google (according to Jeff Dean), Pinterest, and Instacart, yet that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables. - Jupyter receives the ACM Software System Award
Project Jupyter has been awarded the 2017 ACM Software System Award, a significant honor for the project. - Vector-based navigation using grid-like representations in artificial agents
“Navigation, however, remains a substantial challenge for artificial agents, with deep neural networks trained by reinforcement learning failing to rival the proficiency of mammalian spatial behaviour, which is underpinned by grid cells in the entorhinal cortex.” - Mara: A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
This package contains a lightweight ETL framework with a focus on transparency and complexity reduction. - Using SMOTEBoost and RUSBoost to deal with class imbalance
“SMOTEBoost then injects the SMOTE method at each boosting iteration. The advantage of this approach is that while standard boosting gives equal weights to all misclassified data, SMOTE gives more examples of the minority class at each boosting step. Similarly, RUSBoost achieves the same goal by performing random undersampling (RUS) at each boosting iteration instead of SMOTE.” - Billion-scale Network Embedding with Iterative Random Projection
“Network embedding has attracted considerable research attention recently. However, the existing methods are incapable of handling billion-scale networks, because they are computationally expensive and, at the same time, difficult to be accelerated by distributed computing schemes. To address these problems, we propose RandNE, a novel and simple billion-scale network embedding method.” - Visualizing space-time networks
“There are some obvious parallels here with Time Geography and in particular with the representation of space-time prisms.” - Aesthetically Pleasing Learning Rates
Move over one-cycle, aesthetical learning rates are here ;).