Web Picks (week of 9 January 2017)

Posted on January 16, 2017

Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.

The world’s best Go player says he still has “one last move” to defeat Google’s AlphaGo AI
Over the past few days, Google’s Deepmind machine-learning team secretively put its AlphaGo artificial intelligence system onto two Chinese online board-game platforms to test its skill in fast-paced games against several of the world’s best Go players.

Practical advice for analysis of large, complex data sets
Great technical, social, and process-related points of advice!

Blockchains & Platforms: shaping the future of Insurance and Liabilities
“Despite the outlook on risk management — and generally on insurance business — is changing a lot on a global scale, we are all extremely surprised about how the giants in the insurance industry are sitting in peaceful thinking that their industry is — essentially — protected from any disruption.”

Where Should Machines Go To Learn?
If we want to massively accelerate artificial intelligence and improve human lives, we need to democratize access to data.

‘Mathwashing,’ Facebook and the zeitgeist of data worship
“Don’t overlook the inherent subjectivity of building things with data just because you’re using math,” says former Kickstarter data scientist Fred Benenson.

A Kaggler’s Guide to Model Stacking in Practice
“Stacking (also called meta ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model. Here I provide a simple example and guide on how stacking is most often implemented in practice.”

Jupyter + Pachyderm — Part 1, Exploring and Understanding Historical Analyses
“In other words, the multi-format, exploratory functionality of Jupyter could be that much more powerful if there were a system, with which Jupyter could be paired, that would enable Jupyter notebooks to interact with chronological records of works and/or be versioned themselves. Enter Pachyderm! Pachyderm, with its data versioning plus data pipelining functionality, can expand the possibilities and increase the significance of applications like Jupyter and nteract.” Cool!

The Instant Rise of Machine Intelligence?
“While I strongly believe in the fascinating opportunities around deep learning for image recognition, natural language processing and even end-to-end “intelligent” systems (e.g. chat bots), I wanted to get a better feeling of the recent technological progress.”

Symbolic Machine Learning
This post aims to provide a unifying approach to symbolic and non-symbolic techniques of artificial intelligence.

Beautiful thematic maps with ggplot2
“In this blog post, I am going to explain step by step how I (eventually) achieved this result – from a very basic, useless, ugly, default map to the publication-ready and (in my opinion) highly aesthetic choropleth.”

How to map your Google location history with R
If you want to see a few ways how to quickly and easily visualize your location history with R, read on.

2017 Outlook: pandas, Arrow, Feather, Parquet, Spark, Ibis
2017 is shaping up to be an exciting year in Python data development.

Generating Videos with Scene Dynamics (paper)
“We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction).”

Learning from Simulated and Unsupervised Images through Adversarial Training (paper)
“With recent progress in graphics, it has become more tractable to train models on synthetic images, potentially avoiding the need for expensive annotations. However, learning from synthetic images may not achieve the desired performance due to a gap between synthetic and real image distributions. To reduce this gap, we propose Simulated+Unsupervised (S+U) learning, where the task is to learn a model to improve the realism of a simulator’s output using unlabeled real data, while preserving the annotation information from the simulator.”

Machine Learning is Fun Part 6: How to do Speech Recognition with Deep Learning
“Speech recognition has been around for decades, so why is it just now hitting the mainstream? The reason is that deep learning finally made speech recognition accurate enough to be useful outside of carefully controlled environments.”

TensorKart: self-driving MarioKart with TensorFlow
Project – use TensorFlow to train an agent that can play MarioKart 64.

21 Must-Know Data Science Interview Questions and Answers
KDnuggets Editors bring you the answers to ’20 Questions to Detect Fake Data Scientists, including what is regularization, Data Scientists we admire, model validation, and more.’

What Neural Network Can Tell About Your Doodles?
“I spent 3 weeks analyzing them, observing, looking for patterns. And I found a few. These patterns could be used for a deeper analysis of a thought process behind drawing and the way people use their brain.”

Spatial analysis pipelines with simple features in R
In November, the new simple features package for R sf hit CRAN. The package is like rgdal, sp, and rgeos rolled into one, is much faster, and allows for data processing with dplyr verbs! Also, as sf objects are represented in a much simpler way than sp objects, it allows for spatial analysis in R within magrittr pipelines.

Deep Learning Reinvents the Hearing Aid
Finally, wearers of hearing aids can pick out a voice in a crowded room.

XGBoost: the algorithm that wins every competition (presentation)
Also see this guide to parameter tuning in XGBoost if you’re planning to take it for a spin!

“What are the steps / tools in setting up a modern, SaaS-based BI infrastructure?”
Business intelligence tech has changed really dramatically over the past 3–4 years since the advent of cloud-based analytic databases like Amazon Redshift and Google BigQuery.

Handy Python Libraries for Formatting and Cleaning Data
Cleaning data may be time-consuming, but lots of tools have cropped up to make this crucial duty a little more bearable. The Python community offers a host of libraries for making data orderly and legible—from styling DataFrames to anonymizing datasets.

Design Better Data Tables
After being the bread and butter of the web for most of its early history, tables were cast aside by many designers for newer, trendier layouts. But while they might be making fewer appearances on the web these days, data tables still collect and organize much of the information we interact with on a day-to-day basis.

3-D Fractals Offer Clues to Complex Systems
By folding fractals into 3-D objects, a mathematical duo hopes to gain new insight into simple equations.

ggraph – Graph visualization for messy data
This is a library built on top D3 with the goal of improving how we work with large and messy graphs. It extends the notion of nodes and links with groups of nodes. This is useful when multiple nodes are in fact the same thing or belong to the same group.

From JSON to Parquet using Spark SQL (and HDFS sequence files)
A tour through the world of Apache’s new data formats.

Own ChatBot Based on Recurrent Neural Network
For 6$/6 hours and ~100 lines of code.

A Guide to Deep Learning
This guide is for those who know some math, know some programming language and now want to dive deep into deep learning.

Best Data Visualization Projects of 2016