Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- Data Science and Statistics: Diferent Worlds? [YouTube]
In the last few years data science has become an increasingly popular discipline. However within the world of statistics, the ‘big data’ and ‘data scientist’ developments are sometimes labelled as hypes, and ‘data science’ is seen as a rebranding of what should be statistics. One of the often heard criticisms of big data analytics is that there’s a lack of statistical rigour which can lead to the wrong decisions. This talk discusses this topic in depth.
- Play Go Against a Deep Neural Network
Your opponent is a deep convolutional neural network trained to play Go.
- Introducing a new way to visually search on Pinterest
The engineers over at Pinterest show off a new way to search.
- Taking a neural net out for a walk
Kyle McDonald hooked a neural network program up to a webcam and had it try to analyze what it was seeing in realtime as he walked around Amsterdam. See also a neural network tries to identify objects in Star Trek:TNG intro.
- If Google predicts your future, will it be a cliché?
An essay from five years ago, but still very relevant today. As the author of the repost states: “Hubris at the Next Economy conference around robotic writing reminded me of this essay from 5 years ago.”
- Character-based Neural Machine Translation
This paper introduces a neural machine translation model that views the input and output sentences as sequences of characters rather than words. The authors show that their model can achieve translation results that are on par with conventional word-based models.
- Reducing Overfitting in Deep Networks by Decorrelating Representations
One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. This work proposes a new regularizer called DeCov which leads to significantly reduced overfitting and better generalization.
- Fun with Simpson’s Paradox
Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.”
- “Neural Art” in TensorFlow
An implementation of “A neural algorithm of Artistic style” in TensorFlow.
- Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
“All images in this paper are generated by a neural network. They are NOT REAL. Full paper here: http://arxiv.org/abs/1511.06434.”
- The hardest parts of data science
Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed.
- Building Analytics at Simple
“Early in 2014, Simple was a mid-stage startup with only a single analytics-focused employee. When we wanted to answer a question about customer behavior or business performance, we would have to query production databases. Everybody in the company wanted to make informed decisions, from engineering to product strategy to business development to customer relations, so it was clear that we needed to build a data warehouse and a team to support it.”
- NBA Player Movement Data in R
“Everyone’s excited about the newly released NBA player movement data that’s been released – at least I am. I stumbled across this post which shows how to visualize player movement data in Python, but I wanted to figure out how to do the same in R. We’ll use a combination of several R packages including RCurl, jsonlite, png and plotrix.”
- Simple end-to-end TensorFlow examples
Some simple examples to get started with TensorFlow.
- Tree-based Pipeline Optimization Tool (TPOT)
In the same vein as auto-sklearn, consider TPOT your Data Science Assistant. TPOT is a Python tool that automatically creates and optimizes Machine Learning pipelines using genetic programming. TPOT will automate the most tedious part of Machine Learning by intelligently exploring thousands of possible pipelines to find the best one for your data.
- Is it Pokemon or Big Data?
A fun light-hearted quiz to close off this issue’s list.