Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- When is data science a house of cards?
“As data scientists, when we reach an answer, we often communicate that answer and move on. But what happens when there are multiple data scientists with varying answers?”
- Top Data Scientist Claudia Perlich’s Favorite Machine Learning Algorithm
“Hands down logistic regression (with many bells and whistles like stochastic gradient descent, feature hashing and penalties).” - On the code of data science (presentation)
“Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum.” - Lessons Learned Running Hadoop and Spark in Docker Containers (presentation)
This presentation makes a strong case for adopting containers for reproducible data science.
- When her best friend died, she rebuilt him using artificial intelligence
Interesting longread about the potential of AI to “store” human behavior, even after we pass away.
- Can we open the black box of AI?
Artificial intelligence is everywhere. But before scientists trust it, they first need to understand how machines learn.
- Deep Reinforcement Learning From Raw Pixels in Doom (paper)
“Using current reinforcement learning methods, it has recently become possible to learn to play unknown 3D games from raw pixels. In this work, we study the challenges that arise in such complex environments, and summarize current methods to approach these.”
- Making data analytics work for you—instead of the other way around
Does your data have a purpose? If not, you’re spinning your wheels. Here’s how to discover one and then translate it into action.
- Open Sourcing 223GB of Driving Data
A necessity in building an open source self-driving car is data. Lots and lots of data. Udacity now releases 223GB of image frames and log data from 70 minutes of driving in Mountain View on two separate days, with one day being sunny, and the other overcast.
- Why Deep Learning is suddenly changing your life
Decades-old discoveries are now electrifying the computing industry and will soon transform corporate America.
- How to Use t-SNE Effectively
Amazing visualization! “Although extremely useful for visualizing high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. By exploring how it behaves in simple cases, we can learn to use it more effectively.”
- dfply: dplyr-style piping operations for pandas dataframes
This package makes it possible to do R’s dplyr-style data manipulation with pipes in python on pandas DataFrames. Not as good as the original, but still a good resource to keep around.
- janitor: simple tools for data cleaning in R
janitor has simple functions for examining and cleaning dirty data. The example on the page is very convincing, starting from a dirty Excel.
- Keras.js: run Keras models in your browser
Amazing demo’s to behold. Simply amazing to see a deep neural network crunching away… in Javascript.
- Facebook has repeatedly trended fake news since firing its human editors
In the six weeks since Facebook revamped its Trending system — and a hoax about the Fox News Channel star subsequently trended — the site has repeatedly promoted “news” stories that are actually works of fiction.
- RStudio announces R Notebooks
“Today we’re excited to announce R Notebooks, which add a powerful notebook authoring engine to R Markdown. Notebook interfaces for data analysis have compelling advantages including the close association of code and output and the ability to intersperse narrative with computation. Notebooks are also an excellent tool for teaching and a convenient way to share analyses.”
- pandasql: Make python speak SQL
This post is about pandasql, a Python package Yhat wrote that emulates the R package sqldf. It’s a small but mighty library comprised of just 358 lines of code. The idea of pandasql is to make Python speak SQL For those who come from a SQL-first background, pandasql is a nice way to take advantage of the strengths of both languages.
- A Dramatic Tour through Python’s Data Visualization Landscape
Fun, entertaining read comparing Python’s visualization libraries.