Every so often, we find the most interesting data science links from around the web and collect them in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to get up to speed on interesting resources.
- Stable Diffusion
Forget about DALL-E 2, Stable Diffusion has been making the rounds during the past couple of weeks. Unlike autoencoder-based generation, or the creations that can be achieved by Neural Radiance Fields (NeRF) and Generative Adversarial Networks (GANs), diffusion-based systems learn to generate images by adding noise to existing source photos, which subsequently teaches the system how to make plausible and even photorealistic images solely from noise! According to a 2021 paper from OpenAI, diffusion models have a clear advantage over GAN image synthesis in terms of accuracy and realism. Though this contention supports their own product (the DALL-E line), recent public interest in such systems seems to bear it out. Not that people aren’t experimenting with DALL-E 2 anymore, such as this work from Harvard that studies whether it can understand relationships between objects. Missed the launch announcement? Catch up here. “The model itself builds upon the work of the team at CompVis and Runway in their widely used latent diffusion model combined with insights from the conditional diffusion models by our lead generative AI developer Katherine Crowson, Dall-E 2 by Open AI, Imagen by Google Brain and many others.” This post also gives a fantastic overview on what the big deal is: “It’s similar to models like Open AI’s DALL-E, but with one crucial difference: they released the whole thing.” Since then, people have been playing around with textual inversion (a process that helps to extract prompts from a set of example input images), art, animation, making Discord bots, and perhaps video is coming soon as well (this article is a great read in general as well). People have also been working hard to make the model work on M1 macs or older GPU’s. We can expect to hear much more from this in the coming weeks… - Real-time machine learning: challenges and solutions
This post outlines the solutions for (1) online prediction and (2) continual learning, with step-by-step use cases, considerations, and technologies required for each level. - Beyond Matrix Factorization: Using hybrid features for user-business recommendations
“In this blog, we discuss how we switched from a collaborative filtering approach to a hybrid approach – which can handle multiple features and be trained on different objectives.” - Making Decisions with Classifiers
The optimal point on the ROC curve is determined by your preferences. - Nevermind XOR – Deep Learning has an issue with Sin
“More precisely, even the best neural networks can not be trained to approximate periodic functions using stochastic gradient descent. (empirically, prove me wrong!)” - Running Large-Scale Graph Analytics with Memgraph and NVIDIA cuGraph Algorithms
You can now run GPU-powered graph analytics from Memgraph in seconds. - Why is Snowflake so expensive?
Snowflake’s incentives to push on performance optimizations are diametrically opposed to any revenue goal targets. - Replacing Static Authentication Detections With Anomaly Based Detections
“…explore building Isolation Forest machine learning models to help correlate our data and build detections that account for the shift to a post COVID distributed workforce.” - Einsum notation is all you need
Einstein summation (einsum) is implemented in numpy, as well as deep learning libraries such as TensorFlow and, thanks to Thomas Viehmann, recently also PyTorch. - How to Build a GPT-3 for Science
There are no generative AI models trained on the vast body of scientific research publications. - From Correlation to Causation in Machine Learning: Why and How our AI needs to understand causality
How can we encode causality into a model? - Differentiable Programming from Scratch
Many fields apart from machine learning are also finding differentiable programming to be a useful tool for solving many kinds of optimization problems. - Configuration Driven Machine Learning Pipelines
“As a core component of the algorithms platform, the Model Lifecycle team is responsible for enabling data science teams to scale, by streamlining the process of getting these models into production.” - chinchilla’s wild implications
This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. - GPU-accelerated ML Inference at Pinterest
Unlocking 16% Homefeed Engagement by Serving 100x Bigger Recommender Models - Frouros is a Python library for drift detection in Machine Learning problems.
Frouros provides a combination of classical and more recent algorithms for drift detection, both for the supervised and unsupervised parts, as well as some semi-supervised algorithms. - Cracking nuts with a sledgehammer: when modern graph neural networks do worse than classical greedy algorithms
“In general, many claims of superiority of neural networks in solving combinatorial problems are at risk of being not solid enough, since we lack standard benchmarks based on really hard problems.” - What Do We Maximize in Self-Supervised Learning?
“In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction.” - Machine learning, concluded: Did the “no-code” tools beat manual analysis?
In the finale of our experiment, we look at how the low/no-code tools performed. - LLM.int8() and Emergent Features
“If you quantize from 16-bit to 8-bit, you lose precision which might degrade model prediction quality.” - NVIDIA blocked by US government to export A100 circuits to China and Russia
“Any future export to China (including Hong Kong) and Russia of the Company’s A100 and forthcoming H100 integrated circuits. DGX or any other systems which incorporate A100 or H100 integrated circuits and the A100X are also covered by the new license requirement.” - Generalized Visual Language Models
“I focus on one approach for solving vision language tasks, which is to extend pre-trained generalized language models to be capable of consuming visual signals.” - OpenAI changes pricing of GPT-3
“We made our API more affordable on September 1, thanks to progress in making our models run more efficiently.”