Starting Out with Stable Diffusion
The goal here is to get dangerous enough to create your own Stable Diffusion results that match what you expect and don’t look like a crude fingerpainting made by an AI that costed $600,000USD+ to train.
This past week, I’ve put a lot of hours into working with Stable Diffusion on my home PC. I’m sharing my initial learnings and thoughts and then pondering about what’s next (in a subsequent post). Honestly, if I was writing this as a book it would be well over three chapters on just getting started.
I’m describing here a local Stable Diffusion install on your PC which should work if you have a reasonable graphics card. There are non-local alternatives:
- A plethora of Google Collab notebooks, each with a variety of features. You can hook some of these up to your Google Drive and they will drop images right there. Useful! You get Google cloud GPU time for free up to a point, then you’ll get cut off (that happened to me and I switched to a local install).
- Dream Studio — you have a limited number of tokens to spend but it’s a way to try things out once you have some good prompt-craft down (below). As of now, it also has V1.5 of the model available vs. the local install of the V1.4 model.
Local Stable Diffusion
To get setup, I simply followed the instructions here: — VOLDY RETARD GUIDE — (rentry.org) . And yeah, it’s mildly technical. But pretty much everything just works and works really well once you can run the webui.bat batch file. Lots of large downloads are involved so be patient.
If you follow this above guide (or whatever the latest guide you find is) you will end up with a directory of source code files and an AI model trained on a variety of internet images. When you run webui.bat in that directory it will take a long time but eventually a local web server will be running, which you can connect to in your browser (e.g., Stable Diffusion aka localhost:7860 ). The next time you run webui.bat it will take less time to get to the local web server running.
Okay, there it is. In your browser. Now fricking what? You can type something into that prompt, but if you’re like me your first works will be exceptionally disappointing, like vibrant unimpressive finger paintings. “What’s the big deal? This sucks.” That was my first reaction. $600,000USD for this?
That’s because I couldn’t prompt well. The first step in doing Stable Diffusion or any other AI art generator out there right now is prompting well. And for me, this started by ripping off a bunch of nice prompts I found and then playing around with them and learning through play.
Get Prompting
The nicest place I’ve found for finding prompts is Lexica — you can either scroll until you find something you like and click on it or type in a search and then click on something in that search result you like.
(Oh, and, ah, at this point, well, you might have already discovered: Stable Diffusion’s model is stuffed full of the unfiltered internet. When you’re running it locally like this it’s not protecting you from anything that might be NSFW. Likewise for a site like Lexica. It’s not rife with fleshy-delights but they do pop up from time to time. FYI.)
So. You find something that you like, click on it, and you’re going to find the important following attributes:
- Prompt
- Seed
- Guidance scale
- Dimensions
If you copy the prompt and set the CFG scale to match the guidance scale and paste in that seed and set the matching dimensions, you’re close to reproducing that same image by clicking “Generate” — close except for setting the sample steps value and choosing which sampler. Try PLMS as the sampler and 50 as your sampling steps.
Here’s an example:
Prompt:
Drunken scarlett johansson as delirium from sandman, ( hallucinating colorful soap bubbles ), by jeremy mann, by sandra chevrier, by dave mckean and richard avedon and maciej kuciara, punk rock, tank girl, high detailed, one green eye and one blue eye, 8k
Sampling Steps: 50
Sampling Method: PLMS
CFG Scale: 7
Seed: 653151211
Size: 512 x 512
Hit Generate. And the first time, you might need to wait a while. If you run out of graphics card memory, well, first try reducing the dimensions. Then if your problems persist read up on running with less GPU card memory. I have 12GB graphics card memory and cannot go beyond 512 by 512.
I like Scarlett Johansson and I like The Sandman (Delirium especially). Here’s what I got:
After you get your first image output, you can try different results by increasing the sampling steps value. The image might change radically as you go through the scale. It should get better — e.g., if you have hands that look like a bunch of bananas or arms popping out of a back, try moving around the sampling steps.
You can also reduce the steps. It will take less time and your results will be rougher and perhaps radically different.
You can also move the CFG Scale value as well to see what lower and higher values change for your image.
Additionally, changing the dimension of the photo will change it a good bit.
Finally, setting the seed to -1 puts it back on generating very new results. -1 is the random seed for generating the starting point of an image for the AI to step through and generate. Let’s just say that the seed is the starting point for what the AI works to decode the image into. Usually, you’ll be random unless you’re revising images based on a particular image’s seed for an image you really like and that you want to work with.
Oh, and hands like bananas? Yeah, you’re going to see some horrific human body renderings as of today’s capabilities. Faces are getting better but there are other AIs to fix them up. Hands just usually look awful and twisted. Sometimes arms aren’t present. Sometimes three are where one should be. If you really love the composition but not the horror, try adjusting the steps with the same seed of the composition you like.
All your outputs are being saved in a subdirectory, so you can go back into them later.
Tip: if you select “Save” your output is not only also saved in a specific subdirectory (see Settings) but also a .csv file in that same directory is updated to have the parameters for making that particular image.
Those are the basics. Those sampling methods get subtly different results — “Euler a” and “DPM2 a” get similar results, and the rest get different but similar to each other results.
The higher you make your sampling steps the longer it will take. You can also try producing more than one image at a time. I can’t do too many due to my graphics card being middle of the road.
So now you can paste in prompts and try creating your own results, which no doubt will be far better than anything you started with because you don’t know all the magic incantations yet. Going through Lexica is a good place to start learning prompts that matter. The main thing you’ll notice is a list of artists present in a prompt. This directs the style of the art based on all the art that has been sampled by that artist. I’ve got more to say on this in a subsequent post.
Tip: if you ever want to recreate a picture in your txt2img-images output directory, all you have to do is drag the image into the “PNG Info” tab and it will decode the metadata about how it was created. Just get all that back into the txt2img tab and you’re set to recreate and adjust.
So, this is the first set of steps I took to get some reasonably interesting images going. Now what fueled my experiments is making art and photos of celebrities, because somehow that doesn’t seem like something I’d expect to be able to do. But you can make an interesting portrait of a well-known person that’s in the model’s training database. In fact, if you scroll through Lexica you’ll no doubt see many renderings of famous people and celebrities.
Find your passion subject, and then start experimenting. No doubt there will be better and better references to crafting better prompts, but some things to keep in mind:
- This art generation model was trained on internet content. Knowing what it was trained on gives you understanding of what kind of keywords you can use. For instance, training included Pinterest and Reddit. You can even refer to a Reddit group in prompt by putting an “r/” in front of it. So if you know of a Reddit group with loads of images you want to base your creation on, push it in there.
- Knowing artists is important if you want to replicate an artist’s style. Or, well, you have to use a series of artists so that your results don’t look amateur.
- Same is true for photographers if you’re trying to make a realistic image: list who the art was shot by.
- There are more technical things you can add, but that will become obvious as you look at more prompts later.
Some reading and reference (it’s really good to go through this):
- Intro — How to get images that don’t suck: a Beginner/Intermediate Guide to Getting Cool Images from Stable Diffusion : StableDiffusion (reddit.com)
- Prompt craft Wiki: Stable Diffusion — InstallGentoo Wiki
- Artists — WIP list of artists for SD v1.4 (rentry.org) — SUPER USEFUL!
- Artists — kaikalii/stable-diffusion-artists: Curated list of artists for Stable Diffusion prompts (github.com)
- Artists — Compilation of 45 artists recognized by Stable Diffusion : StableDiffusion (reddit.com)
- Photo Generation — Quest for ultimate photorealism : StableDiffusion (reddit.com)
- Read the rest of where you started — — VOLDY RETARD GUIDE — (rentry.org)
- Keyword examples — MEGA — look under “good”
For the local Stable Diffusion WebUI itself (optional — sort of “when you are ready”):
Digging deeper into the model’s training data to find artists / styles (optional):
- Query the big huge training database to see if matching content is there for a keyword: https://rom1504.github.io/clip-retrieval/
- Find artists via https://artsandculture.google.com/ Google Art & Culture
Well, that’s a big firehose of information. Again, have some compositions in mind that you’re passionate about and then you’ll tweak and revise and learn what works and what doesn’t.
Tip: that “CFG Scale” represents how much the AI is allow to improvise (lower value) vs. how much it dang well better do exactly what you said (higher value).
Tip: if you end up with an image you love but it’s not quite right, copy the seed value and ensure all the other settings are the same and then adjust the sampling steps and/or the CFG Scale. This could either be an image you just generated or one that is older and you loaded up into the “PNG Info” tab to get how it was made.
Tip: in addition to a prompt to say what you want, there’s a negative prompt to say what you don’t want. Use this to exclude anything that might be forcing itself into your render.
As you work through trying details, you might come across simple prompts that accomplish a lot. E.g., for photos, simply saying that it’s a professional photo drops the need for sugar prompts like camera model and f-stop and all that. For instance, this prompt creates a pretty good photo (and it’s a bit of a mash-up again, to make it interesting):
Scarlett Johansson cosplay as Elsa of Arendelle, professional photograph
Bigger
Now then, 512 x 512 isn’t very big and AI scaling is the next big thing that people have been using to get a screen res version of what they want to keep. There is an “Extras” tab in the WebUI for that and for fixing the generated faces.
Tip: if you want to upscale an image that was just generated, select “Send to Extras” and then switch to the Extras tab, it’s there. Alternatively, you can load an image off of storage for resizing / face fixing.
The resize by default is 2 so twice as big. There are AI upscalers you can try out that will ensure that the upsized image looks pleasant. Additionally, there are two face fixers (GFPGAN and CodeFormer) to try to smooth out any goofs in the generated face.
Anytime you use these AI models the first time there will be a big delay as they are downloaded. They are also now consuming memory.
Experiment Like Crazy
The best way I made progress this week was experimenting with prompts and trying new artist names and seeing what changing the steps and CFG would end up doing.
Somethings end up amazing, especially of people. Except for the occasional horror.
Other things are hard to try to create if it’s off into the edges on what the model has been trained on. Just because you can describe it doesn’t mean you’ll get anything quite like it.
The next bit I’m writing has to deal with thinking about the impact of Stable Diffusion and what’s next. Cheers.