Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- Economists are prone to fads, and the latest is machine learning
Big data have led to the latest craze in economic research.
- How the Circle Line rogue train was caught with data
Amazing story from data.gov.sg on how they found out the root cause behind a spate of mysterious disruption in Singapore’s MRT Circle Line. Definitely worth a read!
- What Artificial Intelligence Can and Can’t Do Right Now
As told by the famous Andrew Ng. “Many executives ask me what artificial intelligence can do. They want to know how it will disrupt their industry and how they can use it to reinvent their own companies. But lately the media has sometimes painted an unrealistic picture of the powers of AI.”
- The major advancements in Deep Learning in 2016
“Deep Learning has been the core topic in the Machine Learning community the last couple of years and 2016 was not the exception. In this article, we will go through the advancements we think have contributed the most (or have the potential) to move the field forward.” Generative Adversarial Networks might be one of the most important ideas in machine learning for a while, at least according to Yann LeCun (one of the fathers of deep learning).
- Reproducible research: Stripe’s approach to data science
“When people talk about their data infrastructure, they tend to focus on the technologies: Hadoop, Impala, and the like. However, we’ve found that just as important as the technologies themselves are the principles that guide their use. We’d like to share our experience with one such principle that we’ve found particularly useful: reproducibility.”
- OpenAI announces Universe
A software platform for measuring and training an AI’s general intelligence across the world’s supply of games, websites and other applications. An incredible resource for researchers and hobbyists alike to get started with reinforcement learning!
- The State of Data Science 2016
The highlights: the top five skills listed by data scientists are: Data Analysis, R, Python, Data Mining, and Machine Learning and: the number of data scientists has doubled over the last 4 years. Also see The State of Data Engineering, where the skills are SQL, Java, Python, Hadoop, and Linux.
- Dimensionality Reduction and Intuition
Very interesting article highlighting t-SNE and Google’s recently open-sourced Embedding Projector, a web application for interactive visualization and analysis of high-dimensional data that is part of TensorFlow.
- Open sourcing the Embedding Projector: a tool for visualizing high dimensional data
Actually, the Embedding Projector is interesting enough to also warrant its own mention: “with the widespread adoption of ML systems, it is increasingly important for research scientists to be able to explore how the data is being interpreted by the models. However, one of the main challenges in exploring this data is that it often has hundreds or even thousands of dimensions, requiring special tools to investigate the space. To enable a more intuitive exploration process, we are open-sourcing the Embedding Projector, a web application for interactive visualization and analysis of high-dimensional data recently shown as an A.I. Experiment, as part of TensorFlow. We are also releasing a standalone version at projector.tensorflow.org, where users can visualize their high-dimensional data without the need to install and run TensorFlow.”
- Four Experiments in Handwriting with a Neural Network
“We’ll start with a fun one that tries to predict your strokes as you write.”
- Decoding The Thought Vector
Neural networks have the rather uncanny knack for turning meaning into numbers. These numbers, the activations of the network, carry useful information from one layer of the network to the next, and are believed to represent the data at different layers of abstraction. But the vectors themselves have thus far defied interpretation. This blog post puts forward a possible interpretation of these vectors.
- You can’t deep-learn your way out of everything
Feature engineering is just easier, says this author.
- Do machines actually beat doctors?
For someone who watches the medical AI space, it seems like a day can’t go by without some new article reporting on a new piece of research in which the journalists say machines are outperforming human doctors, but reality is more complex.
- Why is machine learning ‘hard’?
According to this researcher, it’s due to “exponentially difficult debugging”.
- Laying the Foundation for a Data Team
Monzo wants to “build the best bank account in the world.” The startup discusses how they’re building their data team. Also see their post on building their backend.
- Spark Stream Processing + Kafka (presentation)
Interesting presentation about integrating two popular streaming processors.
- Data Wrangling at Slack
Kafka, Sqooper, Spark, Hive, and Presto all at work here.
- Time-Series Missing Data Imputation In Apache Spark
Using the Spark-TS package.
- This AI Boom Will Also Bust
“In the last few years, new “deep machine learning” prediction methods are “hot.” But another result is the one described in my tweet above: fashion-induced overuse of more expensive new methods on smaller problems to which they are poorly matched. We should expect this second result to produce a net loss on average.”
- A Deep Dive into Geospatial Analysis (Jupyter notebook)
“Many of the datasets that data scientists handle have some kind of geospatial component to them, and that information is oftentimes useful-to-critical for understanding the problem at hand. As such, an understanding of spatial data and how to work with it is a valuable skill for any data scientist to have. Even better, Python provides a rich toolset for working in this domain, and recent advances have greatly simplified and consolidated these.”
- An Interactive Tutorial on Numerical Optimization
“Numerical Optimization is one of the central techniques in Machine Learning. For many problems it is hard to figure out the best solution directly, but it is relatively easy to set up a loss function that measures how good a solution is – and then minimize the parameters of that function to find the solution.”