Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- Machine Learning: The Great Stagnation
Machine Learning Researchers can now engage in risk-free, high-income, high-prestige work. They are today’s Medieval Catholic priests. - We Don’t Need Data Scientists, We Need Data Engineers
There are 70% more open roles at companies in data engineering as compared to data science. - Does GPT-2 Know Your Phone Number?
Yet, OpenAI’s GPT-2 language model does know how to reach a certain Peter W— (name redacted for privacy). - Nearest Neighbour style interpretations of Tree Ensembles
Tree Ensembles (or Decision Forests, if you prefer), like Random Forest and Gradient Boosting, use a weighted average approach that mimics the behaviour of Adaptive Nearest Neighbours. - It takes a lot of energy for machines to learn – here’s why AI is so power-hungry
By some estimates, training an AI model generates as much carbon emissions as it takes to build and drive five cars over their lifetimes. - DGL Empowers Service For Predictions On Connected Datasets With Graph Neural Networks
AWS just announced the availability of Neptune ML. Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Neptune ML is a new capability that uses graph neural networks (GNNs), a machine learning (ML) technique purpose-built for graphs, for making easy, fast, and accurate predictions on graphs. Neptune ML uses the Deep Graph Library (DGL), an open-source library to which AWS contributes that makes it easy to develop and apply GNN models on graph data. - Netflix’s Metaflow: Reproducible machine learning pipelines
From training to deployment with Metaflow and Cortex - Human Learn – Machine Learning models should play by the rules, literally
This reminds us a bit of Snorkel and Compose. Are we going back to expert-driven modeling - GeFs – Generative Forests in Python
Generative Forests are a class of Probabilistic Circuits (PCs) that subsumes Random Forests. They maintain the discriminative structure learning and overall predictive performance of Random Forests, while extending them to a full generative model over p(X, y). - Accounts with GAN Faces Attack Belgium over 5G Restrictions (pdf)
A cluster of inauthentic accounts on Twitter amplified, and sometimes created, articles that attacked the Belgian government’s recent plans to limit the access of “high-risk” suppliers to its 5G network. - The Doctor Will Sniff You Now
Deep Nose will one day be the best diagnostician in medicine. - Finding the Words to Say: Hidden State Visualizations for Language Models
By visualizing the hidden state between a model’s layers, we can get some clues as to the model’s “thought process”. - How ML saves us $1.7M a year on document previews
“Recently, we translated the predictive power of machine learning (ML) into $1.7 million a year in infrastructure cost savings by optimizing how Dropbox generates and caches document previews.” - TF Quant Finance: TensorFlow based Quant Finance Library
This library provides high-performance components leveraging the hardware acceleration support and automatic differentiation of TensorFlow. The library will provide TensorFlow support for foundational mathematical methods, mid-level methods, and specific pricing models. - Automating my job by using GPT-3
… to generate database-ready SQL to answer business questions - Why isn’t differential dataflow more popular?
“Compared to competition like spark and kafka streams, it can handle more complex computations and provides dramatically better throughput and latency while using much less memory. I’m interested because materialize is expending a huge amount of effort adding a SQL layer on top of differential dataflow.” - DALL·E: Creating Images from Text
We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language. - Why I’m lukewarm on graph neural networks
GNNs can provide wins over simpler embedding methods, but we’re at a point where other research directions matter more. - Why can 2 times 3 sometimes equal 7 with Android’s Neural Network API?
It is well-known that floating point matrix multiplication can result in a variety of surprises. - Brainwave: If Memristors Act Like Neurons, Put Them in Neural Networks
Newfound “edge AI” applications for device that integrates memory and computing, has randomness built in, sips battery power. - Databricks Is an RDBMS
Databricks combines the best of data lakes and data warehouses. - Using GPT-3 for plain language incident root cause from logs
“Our approach works really well at generating root cause reports: if there’s a root cause indicator in the logs, it will almost always make its way into a concise root cause report. Our approach has proven robust to different kinds of applications and logs; it requires no training or rules, pre-built or otherwise.” - The neural network of the Stockfish chess engine
The real cleverness of Stockfish’s neural network is that it’s an efficiently-updatable neural network (NNUE). - Neural Geometric Level of Detail
Real-time rendering with implicit 3D surfaces. - Using JAX to accelerate our research
Recently, we’ve found that an increasing number of projects are well served by JAX, a machine learning framework developed by Google Research teams. - Generating Text With Markov Chains
“I wanted to write a program that I could feed a bunch of novels and then produce similar text to the author’s writing.” - A collection of simple PyTorch implementations of neural networks and related algorithms
These implementations are documented with explanations, and the website renders these as side-by-side formatted notes. - Weave.jl – Scientific Reports Using Julia
Weave is a scientific report generator/literate programming tool for Julia. It resembles Pweave, knitr, R Markdown, and Sweave. - Apache Arrow 3.0.0 released
This is the first release to officially include an implementation for the Julia language. - spaCy v3.0.0 released
Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more. - Redpanda
A Kafka® API compatible streaming platform for mission-critical workloads. - Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
“We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel).” - Reinforcement learning is supervised learning on optimized data
In this blog post we discuss a mental model for RL, based on the idea that RL can be viewed as doing supervised learning on the “good data”. - Using GANs to Create Fantastical Creatures
Today, we present Chimera Painter, a trained machine learning (ML) model that automatically creates a fully fleshed out rendering from a user-supplied creature outline. - Interpretability in Machine Learning: An Overview
This essay provides a broad overview of the sub-field of machine learning interpretability. While not exhaustive, my goal is to review conceptual frameworks, existing research, and future directions.