Recently, I've been playing with OpenAI's CLIP[1] and the dVAE from Dall-e[2] and I've found some very interesting behavior by accident. I first found a great colab notebook from @advadnoun that steers the dVAE using CLIP. I'll describe that below. If you already know how that works, you can skip ahead to the Results section. For those interested, the notebook can be found here.
NSFW WARNING: There are blurred images produced by the dVAE near the end of the results that I would personally consider NSFW. They aren't clearly elicit but they generate a definite vibe of NSFW features. You can unblur each image by clicking on it. This way you can still read this at work as the images are default blurred.
Brief overview of Dall-e
Dall-e basically has 4 steps.
- Train a discrete variational autoencoder (dVAE) on hundreds of millions of images. A dVAE on a very basic level learns to encode an image into some smaller representation so that it can be decoded and be very similar to the original input. Think of it like image compression. Dall-e uses this to compress each 256×256 RGB image into a 32 × 32 grid of image tokens, each element of which can assume 8192 possible values. The embedding produces the image tokens.
An example of a dVAE from [3]. Though OpenAI uses a transformer. - BPE encode text associated (like captions) with each one of the hundreds of millions of images.
- Concatenate up to 256 BPE-encoded text
tokens with the 32 × 32 = 1024 image tokens per image, text pair, and
train an autoregressive transformer to model the joint
distribution over the text and image tokens.
This basically means given a set of tokens, predict the next token over and over again until you have finished predicting the full length of tokens. An example of an autoregressive model from Google Deepmind's wavenet is shown below. This doesn't go into the details of the transformer and is a simplification of the whole process but good enough to understand the rest of the model. In my opinion, this is the most important part of the paper but was not released. - When creating example images post-training, generate 512 images and rank them to "closeness" to the text using CLIP. CLIP basically just embeds images and text into a joint space where related words are close to related images. So if you give it an image, text pair, it can generate a distance that determines how close they are.
Results
OpenAI released Step 1 of dall-e (the dVAE) and step 4 (CLIP). Using the notebook I linked above, we can experiment with the dVAE by using CLIP to steer the image tokens from the dVAE embedding. Given an input phrase and a random initialization of an image, we iteratively move the dVAE image embedding until it's closer and closer to the given text according to CLIP.
My first experiment was the same as what they tried in the blog post with full Dall-e. "An armchair in the shape of an avocado." Granted this isn't near as good as the full Dall-e method, but it's still very cool! You can see both elements of armchairs and avocados in this. I also tried "a cube made of rainbow".
I noticed the cube image seems to have shipping crates and that got me thinking about what else it could accidentally capture. My initial wonder was whether copyrighted material could be produced by either Dall-e or this CLIP based dVAE steering. I noticed it starts breaking down and just generating deep dream like features when given single words. So I started trying different cartoon tv shows to see what was generated (Thanks to Cusuh and Andrew for the suggestions on what shows).
Copyrighted Results
Below you can see the results, below each result is the show involved (they are all from the late 90s/early 2000s). I've blurred out the show name under each one so you can guess before clicking to reveal.
Pokemon (mildly NSFW) Results
You can see for all of these, it generates features specific to these copyright TV shows/products. It's not exaclty clear image outputs, but given the Dall-e results but given this, it's like Dall-e could recreate close knock-offs to copyrighted images. This is true for all of above but it's definitely not true for pokemon. I had pokemon go on my phone recently so I started by just typing in pokemon, the results were not what I expected. I've displayed them below blurred (click to unblur).
When given input "pokemon" |
A second run of pokemon for 100 steps A second run of pokemon for 600 steps |
A second row of pokemon for 4000 steps |
A third run given input "pokemon" |
Next I tried different types of pokemon. These were less repeatable than just the input pokemon and some were fine. However I did get the following as well.
When given input "squirtle" |
When given input "weedle" |
The same thing happens to digimon so I was curious if both pokemon and digimon are embedded close together in CLIP and somehow pornography is also very close to those two. There may be other odd combinations that I just haven't stumbled across yet.
Testing these images as input to CLIP with the imagenet labels for zero-shot classification, they are frequently classified as below (these are from the imagenet 1000 class labels):
skunk, polecat, wood pussy: 13.15% banana: 9.81% nipple: 6.69% ox: 4.05% triceratops: 3.59%
If I add pornography as a label, they are classified as pornography.
porn: 16.16% skunk, polecat, wood pussy: 11.02% banana: 8.23% nipple: 5.61% ox: 3.39%
If I also add pokemon as a label, they are overwhelmingly classified as pokemon.
pokemon: 100.00% porn: 0.00% skunk, polecat, wood pussy: 0.00% banana: 0.00% nipple: 0.00%
Because the Dall-e dVAE is generating these images so well, I suspect there was a decent amount of nudity in the training set. Given CLIP's preference for these features and it's own knowledge of the label, I suspect it was trained on this data as well. I do think these are both really interesting and impressive models but it's definitely important that we look into these unusual connections it's making and why it could be happening. You definitely wouldn't want a method for stock photo generation generating nudity when you are asking for pokemon (assuming you also had permission to use the likeliness of pokemon).
In their paper, they don't discuss how they scraped images from the internet and if any filters were put in place, so it's unknown what was done, but likely more should have been done. We don't currently have rules on training generative models on copyrighted data but maybe that's also a consideration as these methods get better and better. Given larger and better generative models, replicating the style or content of an artist becomes more and more possible and closer to reality. I think that could potentially be a future problem for content creators.
If you are skeptical of these results, check out the notebook yourself and try! I did not cherry pick these examples. It happened pretty much every time for me. I will say that the charmander it created for me was fine. I'm hopeful blogger won't now mark this post as adult material given the images above.
[1] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
[2] Ramesh, Aditya, et al. "Zero-Shot Text-to-Image Generation." arXiv preprint arXiv:2102.12092 (2021).
[3] van den Oord, AƤron, Oriol Vinyals, and Koray Kavukcuoglu. "Neural Discrete Representation Learning." NIPS. 2017.
My opinions are my own and and do not represent the views of my employer.
Sample Code
import os
import clip
import torch
import numpy as np
import ast
from PIL import Image
clip.available_models()
# Load the model
model, preprocess = clip.load('ViT-B/32', 'cpu', jit=True)
with open('imagenet_labels.txt') as f:
data = f.read()
# reconstructing the data as a dictionary
imagenet_labels = ast.literal_eval(data)
device = 'cpu'
text_inputs = torch.cat([clip.tokenize(c) for c in list(imagenet_labels.values())]).to(device)
img = Image.open('../Pictures/dalle/pokemon_100.png')
image_input = preprocess(img).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{imagenet_labels[index.item()]:>16s}: {100 * value.item():.2f}%")