Every two weeks, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
“In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks?” You can also read more discussion on the paper here. - How to build data literacy in your company
A recent Gartner survey of chief data officers found that poor data literacy is one of the top three barriers in building strong data and analytics teams, while a data literacy survey by Accenture of more than 9,000 employees in a variety of roles found that only 21% were confident in their data literacy skills. - Google’s Model Search automatically optimizes and identifies AI models
Google today announced the release of Model Search, an open source platform designed to help researchers develop machine learning models efficiently and automatically. - Google Open-Sources Trillion-Parameter AI Language Model Switch Transformer
Researchers at Google Brain have open-sourced the Switch Transformer, a natural-language processing (NLP) AI model. The model scales up to 1.6T parameters and improves training time up to 7x compared to the T5 NLP model, with comparable accuracy. - Science fiction hasn’t prepared us to imagine machine learning.
It resembles the Library of Babel more than HAL. - Cannes: How ML saves us $1.7M a year on document previews
Recently, we translated the predictive power of machine learning (ML) into $1.7 million a year in infrastructure cost savings by optimizing how Dropbox generates and caches document previews. - Modern Data Science with R
The 2nd edition has been released. - adversarial.io – Fighting mass image recognition
Adversarial.io currently creates adversarial images against one single image recognition model: Google Inception V.3. - Machine Learning in your database
Automatically build and deploy Machine Learning models from inside your databases using plain SQL. - “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI
We define, identify, and present empirical evidence on Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable. - AI may mistake chess discussions as racist talk
“The Queen’s Gambit,” the recent TV mini-series about a chess master, may have stirred increased interest in chess, but a word to the wise: social media talk about game-piece colors could lead to misunderstandings, at least for hate-speech detection software. - desirable streets
Where do people prefer to walk? - What Inception Net Doesn’t See
“Deep learning vision models like Inception Net achieve state-of-the-art performance on image recognition. However, I’m curious about when these models don’t work well. I tested Inception Net on a large number of natural images, and here is a collection of things that the model doesn’t predict well.” - Uncovering Unknown Unknowns in Machine Learning
Unknown unknowns are examples where a model is confident about its answer, but is actually wrong - Yao
Extensible, Efficient Quantum Algorithm Design For Humans - Brain2Pix: Fully convolutional naturalistic video reconstruction from brain activity
“The 2D image representation of the brain activity on the visual field is passed to a fully convolutional image-to-image network trained to recover the original stimuli using VGG feature loss with an adversarial regularizer.” - How Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020
The majority of the data pipelines at Spotify are written in Scio, a Scala API for Apache Beam, and run on the Google Cloud Dataflow service. - OpenCelliD is the world’s largest open database of cell towers
The data has full world coverage and freely available for download. - Don’t Mess with Backprop: Doubts about Biologically Plausible Deep Learning
Biologically Plausible Deep Learning (BPDL) is an active research field at the intersection of Neuroscience and Machine Learning, studying how we can train deep neural networks with a “learning rule” that could conceivably be implemented in the brain. - High-Performance Large-Scale Image Recognition Without Normalization
Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, - Supercharging Apache Superset
How Airbnb customized Superset for business intelligence at scale - Self-Organising Textures
Neural Cellular Automata Model of Pattern Formation - Swift for TensorFlow is now archived
Swift for TensorFlow was an experiment in the next-generation platform for machine learning, incorporating the latest research across machine learning, compilers, differentiable programming, systems design, and beyond. - A Data Scientist’s Guide to Lazy Evaluation with Dask
Read on to learn how lazy evaluation works, how Dask uses it, and how it makes parallelization not only possible but easy! - kneed
Knee-point detection in Python - BudgetML: Deploy ML models on a budget
“BudgetML lets you deploy your model on a Google Cloud Platform preemptible instance (which is ~80% cheaper than a regular instance) with a secured HTTPS API endpoint. The tool sets it up in a way that the instance autostarts when it shuts down (at least once every 24 hours) with only a few minutes of downtime.” - Math Inspector
A Visual Programming Environment for Scientific Computing. - Ploomber
“Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form and Ploomber will automatically construct the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.”