Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- The “Joel Test” for Data Science
Some great advice here: Can new hires get set up in the environment to run analyses on their first day? Can data scientists utilize the latest tools/packages without help from IT? Can data scientists use on-demand and scalable compute resources without help from IT/dev ops? Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions? Does collaboration happen through a system other than email?
Can predictive models be deployed to production without custom engineering or infrastructure work? Is there a single place to search for past research and reusable data sets, code, etc?
- Build pipelines, deployment, and immutable artifacts
What is the best way to build your code? How can you ensure repeatable deploys? What does build and deployment look like in a devops, continuous delivery kind of world?
- How a Japanese cucumber farmer is using deep learning and TensorFlow
This is just fantastic!
- This Mathematician Says Big Data Is Causing a ‘Silent Financial Crisis’
O’Neil sees plenty of parallels between the usage of Big Data today and the predatory lending practices of the subprime crisis. In both cases, the effects are hard to track, even for insiders.
- Building a recommendation engine with AWS Data Pipeline, Elastic MapReduce and Spark
From Google’s advertisements to Amazon’s product suggestions, recommendation engines are everywhere. As users of smart internet services, we’ve become so accustomed to seeing things we like. This blog post is an overview of how we built a product recommendation engine.
- How algorithms rule our working lives
Employers are turning to mathematically modelled ways of sifting through job applications. Even when wrong, their verdicts seem beyond dispute – and they tend to punish the poor.
- Yuval Noah Harari on big data, Google and the end of free will
Forget about listening to ourselves. In the age of data, algorithms have the answer, writes the historian Yuval Noah Harari.
- Your Garbage Data Is A Gold Mine
From Big Data to Weird Data?
- Decoupled Neural Interfaces Using Synthetic Gradients
Neural networks are the workhorse of many of the algorithms developed at DeepMind. For example, AlphaGo uses convolutional neural networks to evaluate board positions in the game of Go and DQN and Deep Reinforcement Learning algorithms use neural networks to choose actions to play at super-human level on video games. This latest post from DeepMind introduces some of their latest research in progressing the capabilities and training procedures of neural networks called Decoupled Neural Interfaces using Synthetic Gradients.
- Learning from Imbalanced Classes
When you start looking at real, uncleaned data one of the first things you notice is that it’s a lot noisier and imbalanced. If you deal with such problems and want practical advice on how to address them, read on.
- Practical XGBoost in Python
New, 100% free course.
- Using Apache Spark to Analyze Large Neuroimaging Datasets
This describes how to analyze dominant components in high-dimensional neuroimaging data and demonstrates how to perform Principal Components Analysis (PCA) on a dataset large enough that standard single-computer techniques will not work.
- Facebook fires trending team, and algorithm without humans goes crazy
Module pushes out false story about Fox’s Megyn Kelly, offensive Ann Coulter headline and a story link about a man masturbating with a McDonald’s sandwich…
- An exclusive inside look at how artificial intelligence and machine learning work at Apple
“Because Apple has always been so tight-lipped about what goes on behind badged doors, the AI cognoscenti didn’t know what Apple was up to in machine learning. “It’s not part of the community,” says Jerry Kaplan, who teaches a course at Stanford on the history of artificial intelligence. “Apple is the NSA of AI.” But AI’s Brahmins figured that if Apple’s efforts were as significant as Google’s or Facebook’s, they would have heard that.”
- 3 Reasons Counting is the Hardest Thing in Data Science
“Counting is hard. You might be surprised to hear me say that, but it’s true. As a data scientist, I’ve done it all – everything from simple regression analysis all the way to coding Hadoop MapReduce jobs that process hundreds of billions of data points each month. And, with all that experience, I’ve found that counting often involves far more time and effort.”
- An introduction to Generative Adversarial Networks
This post describes the GAN formulation in a bit more detail, and provide a brief example (with code in TensorFlow) of using a GAN to solve a toy problem.
- O’Reilly Data Ebook Archive
An archive of all O’Reilly data ebooks is available for free download. Dive deep into the latest in data science and big data, compiled by O’Reilly editors, authors, and Strata speakers.
- A Concise History of Neural Networks
Nice complement to our feature article.
- glmnet for Python
“Thinking it would be easier to have a tool that was written in a single language, I started looking for the Scikit-Learn analog of glmnet, specifically the cv.glmnet function in this R package. Unfortunately, I could not find anything written in Python that emulated the functionality we needed from glmnet.”
- An alarming number of scientific papers contain Excel errors
A surprisingly high number of scientific papers in the field of genetics contain errors introduced by Microsoft Excel, according to an analysis recently published in the journal Genome Biology.
- The Zen of Modeling
Step 1: Your model should have some theoretical basis.
- Who Leads the Clothing Fashion: Style, Color, or Texture? A Computational Study (paper)
“A classification-based model is proposed to quantify the influence of different visual stimuli, in which each visual stimulus’s influence is quantified by its corresponding accuracy in fashion classification.”
- Collaborative Filtering with Recurrent Neural Networks (paper)
“We show that collaborative filtering can be viewed as a sequence prediction problem, and that given this interpretation, recurrent neural networks offer very competitive approach. In particular we study how the long short-term memory (LSTM) can be applied to collaborative filtering, and how it compares to standard nearest neighbors and matrix factorization methods on movie recommendation.”
- Full Resolution Image Compression with Recurrent Neural Networks (paper)
“This paper presents a set of full-resolution lossy image compression methods based on neural networks.”