It’s been a while since a particular piece of technology has felt as fraught, efficient, and consequential as today’s text-to-image AI art-generation tools like DALL-E 2 and Midjourney. The reason for this is twofold: The tools continue to grow in popularity because they are fairly easy to use, and they do something cool by conjuring almost any image you can dream up in your head. When a text prompt comes to life as you envisioned (or better than you envisioned), it feels a bit like magic. When technologies feel like magic, adoption rates tick up rapidly. The second (and more important) reason is that the AI art tools are evolving quickly—often faster than the moral and ethical debates around the technology. I find this concerning.
In just the last month, the company Stability AI released Stable Diffusion: a free, open-source tool trained off of three massive datasets, including more than 2 billion images. Unlike DALL-E or Midjourney, Stable Diffusion has none of the content guardrails to stop people from creating potentially problematic imagery (incorporating trademarked images, or sexual or potentially violent and abusive content). In short order, a subset of Stable Diffusion users generated loads of deepfake-style images of nude celebrities, resulting in Reddit banning multiple NSFW Stable Diffusion communities. But, because Stable Diffusion is open-sourced, the tool has been credited with an “explosion of innovation”—specifically in regard to people using the tool to generate images from other images. Stability AI is pushing to release AI tools for audio and video soon, as well.
I’ve written twice about these debates and even found myself briefly near the center of them. Then and now, my biggest concern is that the datasets that these tools are trained off of are full of images that have been haphazardly scraped from across the internet, mostly without the artists’ permission. Making matters worse, the companies behind these technologies aren’t forthcoming about what raw materials are powering their models.
Thankfully, last week the programmer/bloggers Andy Baio and Simon Willison pulled back the curtain, if only a bit. Because Stable Diffusion’s dataset is an open model, the pair lifted data for 12 million images from part of Stable Diffusion’s dataset, and made a browsing tool that allows anyone to search through those images. While Baio and Willison’s dataset is just a small fraction of Stable Diffusion’s full set of 2.3 billion images, there’s a great deal we can learn from this partial glimpse. In a very helpful blog post, Baio notes that “nearly half of the images” in their dataset came from only 100 domains. The largest number of images came from Pinterest—over a million in all. Other prominent image sources include shopping sites, stock-image sites, and user-generated-content sites like WordPress, Flickr, and DeviantArt.
Baio and Willison’s dataset tool also lets you sort by artist. Here’s one of Baio’s findings:
Of the top 25 artists in the dataset, only three are still living: Phil Koch, Erin Hanson, and Steve Henderson. The most frequent artist in the dataset? The Painter of Light™ himself, Thomas Kinkade, with 9,268 images.
Once the tool went public, I watched artists on Twitter share their search findings. Many of them remarked that they had found a few examples of their own work, which had been collected and incorporated into Stable Diffusion’s dataset without their knowledge—possibly because a third party had shared them on a site like Pinterest. Others remarked that there was a wealth of copyrighted material in the dataset, enough to conclude that the AI art–ethics debate will almost certainly get tied up in the legal system at some point.
I’ve spent multiple hours searching through Baio and Willison’s tool, and it's an odd experience that feels a bit like poking around the backstage of the internet. There is, of course, a wealth of NSFW content, and photos of lots and lots of female celebrities. But what stands out the most is just how random the collection is. It’s not really organized; it’s just a massive collection of images with robotic text descriptions. That’s because Stable Diffusion is based on enormous datasets collected by a whole other company: a nonprofit called LAION. And here’s where things get dicey.
As Baio notes in his post, LAION’s computing power was funded in large part by Stability AI. To complicate matters, LAION’s datasets are compiled by Common Crawl, a nonprofit that scrapes billions of web pages every month to aid in academic web research. Put another way: Stable Diffusion helped fund a nonprofit and gained access to a dataset that is largely compiled using a different academic nonprofit’s dataset, in order to build out a commercial tech product that takes billions of randomly gathered images (many of them the creative work of artists) and turns them into art that may be used to replace those artists’ traditionally commissioned artwork. (Gah!)
Looking through Baio and Willison’s dataset made me feel even more conflicted about this technology. I know that Baio shares a lot of my concerns as well, so I reached out to him to talk a bit about the project and what he learned. He said that, like me, he’s been fascinated by the way that the AI-art debate was immediately drafted into the internet’s long-running culture wars.
“You see the techno-utopians taking the most generous interpretation of what’s going on, and then you have other communities of artists taking the least-generous interpretation of these tools,” Baio said. “I attribute that to the fact that these tools are opaque. There’s a vacuum of information about how these tools are made, and people fill that vacuum with feelings. So we’re trying to let them see what’s in there.”
Baio told me that poking through the model gave him some clarity about where these images came from—including showing how the dataset is the product of tools initially built for academic use. “This scraping process makes things difficult,” he said. “If you’re an artist, you can stop Common Crawl from scraping your site. But so many of those images are coming from sites like Pinterest, where other people upload the content. It’s not clear how an artist could stop Common Crawl from scraping Pinterest.”
The more Baio has learned about the dataset, the thornier the potential legal, ethical, and moral questions grow. “People can’t agree on what is fair use in the best of times,” he said. “You have judges in different circuits arguing and interpreting it differently.”