Sunday, March 7, 2021

OpenAI's CLIP and Dall-e seem to be trained on copyrighted material and pornography (and for some reason associates that with pokemon?)


Recently, I've been playing with OpenAI's CLIP[1] and the dVAE from Dall-e[2] and I've found some very interesting behavior by accident. I first found a great colab notebook from @advadnoun that steers the dVAE using CLIP. I'll describe that below. If you already know how that works, you can skip ahead to the Results section. For those interested, the notebook can be found here.

NSFW WARNING: There are blurred images produced by the dVAE near the end of the results that I would personally consider NSFW. They aren't clearly elicit but they generate a definite vibe of NSFW features. You can unblur each image by clicking on it. This way you can still read this at work as the images are default blurred.

Brief overview of Dall-e

Dall-e basically has 4 steps.

  1. Train a discrete variational autoencoder (dVAE) on hundreds of millions of images. A dVAE on a very basic level learns to encode an image into some smaller representation so that it can be decoded and be very similar to the original input. Think of it like image compression. Dall-e uses this to compress each 256×256 RGB image into a 32 × 32 grid of image tokens, each element of which can assume 8192 possible values. The embedding produces the image tokens.

    An example of a dVAE from [3]
    An example of a dVAE from [3]. Though OpenAI uses a transformer.

  2. BPE encode text associated (like captions) with each one of the hundreds of millions of images.

  3. Concatenate up to 256 BPE-encoded text tokens with the 32 × 32 = 1024 image tokens per image, text pair, and train an autoregressive transformer to model the joint distribution over the text and image tokens.
    This basically means given a set of tokens, predict the next token over and over again until you have finished predicting the full length of tokens. An example of an autoregressive model from Google Deepmind's wavenet is shown below. This doesn't go into the details of the transformer and is a simplification of the whole process but good enough to understand the rest of the model. In my opinion, this is the most important part of the paper but was not released.

  4. When creating example images post-training, generate 512 images and rank them to "closeness" to the text using CLIP. CLIP basically just embeds images and text into a joint space where related words are close to related images. So if you give it an image, text pair, it can generate a distance that determines how close they are. 

    From [1]

    It can also be used to classify an image by giving it a large set of text labels and finding the closest one. 


OpenAI released Step 1 of dall-e (the dVAE) and step 4 (CLIP). Using the notebook I linked above, we can experiment with the dVAE by using CLIP to steer the image tokens from the dVAE embedding. Given an input phrase and a random initialization of an image, we iteratively move the dVAE image embedding until it's closer and closer to the given text according to CLIP.

My first experiment was the same as what they tried in the blog post with full Dall-e. "An armchair in the shape of an avocado." Granted this isn't near as good as the full Dall-e method, but it's still very cool! You can see both elements of armchairs and avocados in this. I also tried "a cube made of rainbow".

I noticed the cube image seems to have shipping crates and that got me thinking about what else it could accidentally capture. My initial wonder was whether copyrighted material could be produced by either Dall-e or this CLIP based dVAE steering. I noticed it starts breaking down and just generating deep dream like features when given single words. So I started trying different cartoon tv shows to see what was generated (Thanks to Cusuh and Andrew for the suggestions on what shows).

Copyrighted Results

Below you can see the results, below each result is the show involved (they are all from the late 90s/early 2000s). I've blurred out the show name under each one so you can guess before clicking to reveal.

Pokemon (mildly NSFW) Results

You can see for all of these, it generates features specific to these copyright TV shows/products. It's not exaclty clear image outputs, but given the Dall-e results but given this, it's like Dall-e could recreate close knock-offs to copyrighted images. This is true for all of above but it's definitely not true for pokemon. I had pokemon go on my phone recently so I started by just typing in pokemon, the results were not what I expected. I've displayed them below blurred (click to unblur).

When given input "pokemon"

A second run of pokemon for 100 steps

A second run of pokemon for 600 steps

A second row of pokemon for 4000 steps

A third run given input "pokemon"

Next I tried different types of pokemon. These were less repeatable than just the input pokemon and some were fine. However I did get the following as well.

When given input "squirtle"

When given input "weedle"

The same thing happens to digimon so I was curious if both pokemon and digimon are embedded close together in CLIP and somehow pornography is also very close to those two. There may be other odd combinations that I just haven't stumbled across yet. 
Testing these images as input to CLIP with the imagenet labels for zero-shot classification, they are frequently classified as below (these are from the imagenet 1000 class labels):

skunk, polecat, wood pussy: 13.15%
          banana: 9.81%
          nipple: 6.69%
              ox: 4.05%
     triceratops: 3.59%

If I add pornography as a label, they are classified as pornography.

porn: 16.16%
skunk, polecat, wood pussy: 11.02%
          banana: 8.23%
          nipple: 5.61%
              ox: 3.39%

If I also add pokemon as a label, they are overwhelmingly classified as pokemon.

pokemon: 100.00%
            porn: 0.00%
skunk, polecat, wood pussy: 0.00%
          banana: 0.00%
          nipple: 0.00%

Because the Dall-e dVAE is generating these images so well, I suspect there was a decent amount of nudity in the training set. Given CLIP's preference for these features and it's own knowledge of the label, I suspect it was trained on this data as well. I do think these are both really interesting and impressive models but it's definitely important that we look into these unusual connections it's making and why it could be happening. You definitely wouldn't want a method for stock photo generation generating nudity when you are asking for pokemon (assuming you also had permission to use the likeliness of pokemon). 
In their paper, they don't discuss how they scraped images from the internet and if any filters were put in place, so it's unknown what was done, but likely more should have been done. We don't currently have rules on training generative models on copyrighted data but maybe that's also a consideration as these methods get better and better. Given larger and better generative models, replicating the style or content of an artist becomes more and more possible and closer to reality. I think that could potentially be a future problem for content creators. 

If you are skeptical of these results, check out the notebook yourself and try! I did not cherry pick these examples. It happened pretty much every time for me. I will say that the charmander it created for me was fine. I'm hopeful blogger won't now mark this post as adult material given the images above.

[1]  Radford, Alec, et al. "Learning transferable visual models from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
[2]  Ramesh, Aditya, et al. "Zero-Shot Text-to-Image Generation." arXiv preprint arXiv:2102.12092 (2021).
[3]  van den Oord, AƤron, Oriol Vinyals, and Koray Kavukcuoglu. "Neural Discrete Representation Learning." NIPS. 2017.

My opinions are my own and and do not represent the views of my employer.

Sample Code

To evaluate CLIP with Imagenet labels on the images generated by the notebook, I used the following code from CLIP github modified to work with this:

import os
import clip
import torch
import numpy as np
import ast
from PIL import Image 

# Load the model
model, preprocess = clip.load('ViT-B/32', 'cpu', jit=True)

with open('imagenet_labels.txt') as f: 
    data = 

# reconstructing the data as a dictionary 
imagenet_labels = ast.literal_eval(data) 

device = 'cpu'
text_inputs =[clip.tokenize(c) for c in list(imagenet_labels.values())]).to(device)

img ='../Pictures/dalle/pokemon_100.png')
image_input = preprocess(img).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{imagenet_labels[index.item()]:>16s}: {100 * value.item():.2f}%")

Tuesday, February 18, 2020

Rotatable Screen Mirroring iPhones with a Raspberry Pi and Airplay Clone

Recently, I installed a TV on an articulating mount in the kitchen for both tv and recipes (for those interested, I used this mount).

I wanted to be able to screen mirror my phone for recipes and shows in both landscape and normal orientations. Unfortunately, nothing currently allowed for that.
Luckily I found an excellent open source air play clone called RPiPlay
They didn't yet have rotating functionality but I spent some time implementing it and adding it on my own branch here:

With the code above, rpiplay can be started normally for landscape or rotated with rpiplay -r 90.

I also went ahead and made that code a system service and created a node web button to start the system in rotated or normal display mode so that everything is headless. 

You access the node web button on your phone or computer with http://pi-ip-address:9000/ where there is a button for changing rotation.

Here is a video demonstrating the landscape normal orientation.

And here is another one demonstrating portrait mode with the TV rotated after I've rotated it with the rotate button on my phone.

Install info can be found in the install folder but for ease I have included an image below that is fully set up. Install it just like a normal raspberry pi image. Make sure you expand the filesystem and change the default password (raspberry) as ssh is enabled!

Feel free to contact me for any questions or if you have products or ideas you think I should try. Places you can find me

Consider donating here to support my tinkering habits.

Monday, January 27, 2020

Powering Google Nest Cameras with Power over Ethernet

Recently I embarked on replacing all a friend's old cameras with some Nest cameras. Unfortunately, the cameras they had were wired up with Ethernet wire so that the only wiring available was custom wiring using 12V. I love the Nest Outdoor Cameras but I don't know why they don't have a PoE option. I didn't want to rewire parts of the house nor have the unsightly nest power cord.

I decided to make my own and figured I would share the details for others who want to avoid the  power cord (Note: This will probably void your nest cam warranty).

Above is what the original camera wiring looked like. Two wires for power (12V) and two for data.

You might think, okay just change the power supply on the other side to 5V and call it a day. Unfortunately, Ethernet wire doesn't support 5V very well. When a load is applied (1.5A for the Nest) over a distance, the voltage drops and the nest cam doesn't work.

So, at first, I figured, let's just throw a voltage regulator on there and it should be fine (I used a LM7805). This is shown below. However, this was getting pretty hot and I was worried about it not lasting for several years.

I ended up finding the perfect circuit for this on Amazon here.
It's a LM317 Adjustable Linear Regulator Converter Power Supply with a heat sink and everything you already need. Next, I just rotated the adjustable knob until I read 5V to prep them.
Finally, I got some female USB ports here and soldiered them to the output of the power supply module as shown below.

Then all I had to do was plug the Ethernet wires with the 12V power into the input of each supply, put it in a small enclosure, plug the Nest cam in, and push the Nest cord up into the camera whole.

Here is the final output. A clean nice ethernet based Nest solution. Hopefully Nest comes out with some sort of a PoE solution. However, f you are looking to do this before then, the links above should be all you need to replicate it. This should work for any supply over 9V up to the voltage regulator input limit so you aren't locked in to a 12V only solution like mine.

Feel free to contact me for any questions or if you have products or ideas you think I should try. Places you can find me

Consider donating here to support my tinkering habits.

Monday, March 26, 2018

Computer Vision Could Have Avoided Fatal Uber Crash

Uber's autonomous car hit and killed a pedestrian this week. You can see the video footage previous to the wreck here.

There have been a lot of articles talking about how the driver could have avoided the accident. I'm instead going to take a look at how state-of-the-art computer vision methods could have avoided it.

With autonomous car systems, we are going to need to have backups of backups in order to prevent these kinds of accidents. Uber has not yet discussed why the LiDAR or depth sensors failed to prevent this. Velodyne has released a statement saying it wasn't a failure of their hardware (which is most likely true from my experience with their technology).

But even not considering specialized sensors, just looking at the video footage, it's clear something wrong happened here.

Just using the video footage supplied, I ran some frames through a state of the art neural network called Mask RCNN trained on the COCO dataset. Note, this isn't an autonomous driving dataset but it does contain people, car, and bicycles so it is relevant. Below are the algorithm's output on some frames preceding the accident (more than a second before).

These images bring up some interesting questions for Uber. If the Velodyne LiDar should have caught this and the computer vision system should have caught this, then why did it happen?
This implies there was a most likely avoidable bug or failure in Uber's software stack and this caused someone's death.

This is not a post to say autonomous driving is bad or that we shouldn't pursue it; we most certainly should. This is just the first case of an easily avoidable death. We as researchers and programmers need to be careful about testing these methods and having many backups to prevent easy edge cases. Companies pushing this technology should be even more robust and careful in testing many edge cases and being confident in their software.

Places you can find me

Tuesday, May 12, 2015

Getting your fridge to order food for you with a RPi camera and a hacked up Instacart API

This is a detailed post on how to get your fridge to autonomously order fruit for you when you are low.  An RPi takes a picture every day and detects if you have fruit or not using my Caffe web query code. If your fridge is low on fruit, it orders fruit using Instacart, which is then delivered to your house. You can find the code with a walk through here:

Some of my posts are things I end up using every day and some are proof of concepts that I think are interesting. This is one of the latter. When I was younger, I heard an urban legend that Bill Gates had a fridge that ordered food for him and delivered it same-day whenever he was low. That story always intrigued me and I finally decided to implement a proof of concept of it. Below is how I set about doing this.

Hacking up an Instacart API

The first thing we need is a service that picks out food and delivers it to you. There are many of these, but as I live in Atlanta, I chose Instacart. Now we need an API. Unfortunately, Instacart doesn't provide one, so we will need to make our own. 

Head over to and set up an account and login. Then right click and view source. You are looking for a line in the source like this:

That string is what you need to access your instacart account. Open up a terminal and type:

You should get back a response that looks like this:
{"checkout_state":{"workflow_state":"shopping"},"items":{"1069829":{"created_at":1.409336316211E9,"qty":1,"user_id":YOUR_USER_ID}},"users":{"-JXAzAp6rgtM4u2dV2tI":{"id":YOUR_USER_ID"name":"StevenH"},"-Jj2_kFsu5hvZRhx4KX1":{"id":YOUR_USER_ID,"name":"Steven H"},"-Jp8VvDusSDOyEiJ0J5D":{"id":YOUR_USER_ID,"name":"Steven H"}}}

Now we just need to figure out what different items are. Pick a store and start adding items to your cart and run the same command. If I add some fruit (oranges, bananas, strawberries, pears) to my cart and then run the same curl request. I get something like this:
{"checkout_state":{"workflow_state":"shopping"},"items":{"1069829":{"created_at":1.409336316211E9,"qty":1,"user_id":YOUR_USER_ID},"8182033":{"created_at":1.431448385824E9,"qty":2,"user_id":YOUR_USER_ID},"8583398":{"created_at":1.431448413452E9,"qty":3,"user_id":YOUR_USER_ID},"8585519":{"created_at":1.431448355207E9,"qty":3,"user_id":YOUR_USER_ID},"8601780":{"created_at":1.424915467829E9,"qty":3,"user_id":YOUR_USER_ID},"8602830":{"created_at":1.43144840911E9,"qty":1,"user_id":YOUR_USER_ID}},"users":{"-JXAzAp6rgtM4u2dV2tI":{"id":22232545,"name":"StevenH"},"-Jj2_kFsu5hvZRhx4KX1":{"id":YOUR_USER_ID,"name":"Steven H"},"-Jp8VvDusSDOyEiJ0J5D":{"id":YOUR_USER_ID,"name":"Steven H"}}}

Now empty your cart and we will make sure we can add all those things to your cart with a curl request. Take your response from earlier, and use it in the following line:

Now, your cart should be full of fruit again. Now we just need a way to recognize whether your fridge has fruit or not.

Detecting fruit in your fridge

(P.S. Those of you wanting to learn more about Deep Learning, check out this book:
For this we just need a Raspberry Pi 2 Model B Project Board - 1GB RAM - 900 MHz Quad-Core CPU and a Raspberry PI 5MP Camera Board Module.
Set up your camera following these instructions and you will be ready to go. Set up your camera module in your fridge (or wherever you store your fruit).

We are going to use the Caffe framework for recognizing whether fruit is in the refrigerator drawer or not. You can read about how to do that here.
We are going to set this up similarly. Run the following commands to set things up:

git clone
sudo apt-get install python python-pycurl python-lxml python-pip
sudo pip install grab sudo apt-get install apache2
mkdir -p /dev/shm/images
sudo ln -s /dev/shm/images /var/www/images

Then you must forward your router from port 5005 to port 80 on the Pi
Now you can edit with your info and run ./
Or add the following line to cron with crontab -e:
00 17 * * * /home/pi/AutonomousFridge/

This script takes a picture with raspistill and puts it in a symlinked directory in memory accessible from port 80. Then it sends that URL to the Caffe web demo and gets the result.
The Caffe demo shows how well it classifies the existence of fruit as shown below:

The end result of this is a script that runs every day at 5 pm. When your fridge doesn't have fruit, it adds a bunch of fruit to your Instacart cart. You can order it at your leisure to make sure you are home when it arrives. You could also use my PiAUISuite to get it to text you about your fruit status. It can be alot of fun to make a proof of concept of an old urban legend.

Consider donating to further my tinkering since I do all this and help people out for free.

Places you can find me

Thursday, April 23, 2015

RPi Videolooper Not booting: blinking cursor bug fix

VideoLooper 4 (bug fix)!!

Wanted to apologize to everyone for the blinking cursor bug with the newest videolooper. I introduced it without realizing it by over-aggressively shrinking the partition to ease the download. If you have that bug you can download the newest version below, which fixes that. 

Alternatively, you can do the following (Thanks to Anthony Calvano for this) :
SSH in or press Windows key + R at the blinking menu, then you can extend the partition using the directions at "Manually resizing the SD card on Raspberry Pi" located at

This image is compatible with the A,B,B+, and B 2 versions. 

I have a brand new version of the Raspberry Pi Videolooper that is compatible with the new B V2 and has a bunch of new features that streamline it for easy use.
It can now loop one video seamlessly (without audio though) thanks to a solution from the talented individual over at (link here). And again thanks to Tim Schwartz as well (link here).

You can download the new image here:!411&authkey=!AGW37ozZuaeyjDw&ithint=file%2czip


For help you can post on the Raspberry Pi subreddit (probably the best way to get fast help) or email me (be forewarned, I respond intermittently and sporadically)

How to set up the looper

  1. Copy this image to an SD card following these directions
  2. If you want to use USB, change usb=0 to usb=1 in looperconfig.txt on the SD card (It is in the boot partition which can be read by Windows and Mac).
  3. If you want to disable the looping autostart to make copying files easier, change autostart=1 to autostart=0 in looperconfig.txt
  4. If you want to change the audio source to 3.5 mm, change audio_source=hdmi to audio_source=local in looperconfig.txt.
  5. If you want to play a seamless video (supports only one for now), convert it according to these directions, put it in the videos folder, and then change seamless=0 to seamless=name-of-your-video.h264 in looperconfig.txt. (NOTE: This video won't have audio so take that into account).
  6. You may also want to expand your filesystem to it your SD card by using sudo raspi-config as detailed here:
  7. If you aren't using a USB (NTFS) put your video files in the /home/pi/videos directory with SFTP or by turning autostart off. Otherwise, put your video files in a directory named videos on the root directory of your USB.
  8. Set your config options and plug it in!


  • NEW: Has an audio_source flag in the config file (audio_source=hdmi,audio_source=local)
  • NEW: Has a seamless flag in the config file (seamless=0,seamless=some-file.h264)
  • NEW: Has a new boot up splash screen
  • NEW: Compatible with the RPi B2 (1 GB RAM version)
  • NEW: Updated all packages (no heartbleed vulnerability, new omxplayer version)
  • Has a config file in the boot directory (looperconfig.txt)
  • Has a autostart flag in the config file (autostart=0,autostart=1)
  • Has a USB flag in the config file (usb=0,usb=1), just set usb=1, then plug a USB (NTFS) with a videos folder on it and boot
  • Only requires 4GB SD card and has a smaller zipped download file
  • Supports all raspberry pi video types (mp4,avi,mkv,mp3,mov,mpg,flv,m4v)
  • Supports subtitles (just put the srt file in the same directory as the videos)
  • Reduces time between videos
  • Allows spaces and special characters in the filename
  • Full screen with a black background and no flicker
  • SSH automatically enabled with user:pi and password:raspberry
  • Allows easy video conversion using ffmpeg (ffmpeg INFILE -sameq OUTFILE)
  • Has a default of HDMI audio output with one quick file change (replace -o hdmi with -o local in
  • Can support external HDDs and other directories easily with one quick file change (Change FILES=/home/pi/videos/ to FILES=/YOUR DIRECTORY/ in

Source code

The source code can be found on github here

This is perfect if you are working on a museum or school exhibit. Don't spend a lot of money and energy on a PC running windows and have problems like below (courtesy of the Atlanta Aquarium)!

If you are a museum or other educationally based program and need help, you can post on the Raspberry Pi subreddit (probably the best way to get fast help) or contact me by e-mail at

Consider donating to further my tinkering since I do all this and help people out for free.

Places you can find me

Tuesday, March 31, 2015

Classifying everything using your RPi Camera: Deep Learning with the Pi

(P.S. Those of you wanting to learn more about Deep Learning, check out this book:

For those who don't want to read, the code can be found on my github with a readme:
You can also read about it on my Hackaday io page here.

What is object classification?

Object classification has been a very popular topic the past couple years. Given an image, we want a computer to be able to tell us what that image is showing. The newest trend has been using convolutional neural networks in order to classify networks trained with a large amount of data.

One of the bigger frameworks for this is the Caffe framework. For more on this see the Caffe home page.
You can test out there web demo here. It isn't great at people but it is very good at cats, dogs, objects, and activities.

Why is this useful?

There are all kinds of autonomous tasks you can do with the RPi camera. Perhaps you want to know if your dog is in your living room, so the Pi can take his/her picture or tell him/her they are a good dog. Perhaps you want your RPi to recognize whether there is fruit in your fruit drawer so it can order you more when it is empty. The possibilities are endless.

How do convolutional neural networks work (a VERY simple overview)?

Convolutional neural networks are based loosely off how the human brain works. They are built of layers of many neurons that are "activated" by certain inputs. The input layer is connected in a network through a series of interconnected neurons in hidden layers like so:

Each neuron sends its signal to any other neuron it is connected to which is then multiplied by the connection weight and run through a sigmoid function. The training of the network is done by changing the weights in order to minimize the error function based on a set of inputs with a known set of outputs using back propagation.

How do we get this on the Pi?

Well I went ahead and compiled Caffe on the RPi. Unfortunately since it doesn't have code to optimize the network with it's GPU, the classification takes ~20-25s per image, which is far too much.
Note: I did find a different optimized CNN network for the RPi by Pete Warden here. It looks great but it still takes about 3 seconds per image, which still doesn't seem fast  enough. 

You will also need the Raspberry Pi camera which you can get from here:
Raspberry PI 5MP Camera Board Module

A better option: Using the web demo with python

So we can take advantage of the Caffe web demo and use that to reduce the processing time even further. With this method, the image classification takes ~1.5s, which is usable for a system.

How does the code work?

We make a symbolic link from /dev/shm/images/ to our /var/www for apache and forward our router port 5050 to the Pi port 80. 
Then we use raspistill to take an image and save it to memory as /dev/shm/images/test.jpg. Since this is symlinked in /var/www, we should be able to see it at http://YOUR-EXTERNAL-IP:5005/images/test.jpg.
Then we use grab to qull up the Caffe demo framework with our image and get the classification results. This is done in which gets the results.

What does the output look like?

Given a picture of some of my Pi components, I get this, which is pretty accurate:

Where can I get the code?


Consider donating to further my tinkering since I do all this and help people out for free.

Places you can find me