You, the Multiverse of You, and Stable Diffusion
So in the last posting, I explored downloading new embedded textual inversions that give you new styles (or new subjects) to expand what you can do in Stable Diffusion AI art generation. For instance, there is a style in the version of Jamie Hewlett that represents artwork for the band Gorillaz. When applied to my base Snow White, here are the before and after:
The results can be strong and impactful. Okay, it’s one thing to download new embedding styles and subjects created by someone else, but eventually any of us might say: I’m inspired, how do I make one? Say of a family member’s art style? Or, well, of myself! Blessfully there are not enough paparazzi in my life for Stable Diffusion’s model to know me, but can I get it to learn me? Yes. Here’s an image from Stable Diffusion of me holding a painting styled as if by my mother-in-law Sudi:
How did I get there? It sure wasn’t hard, but my desktop did have to cook on it for a while. And is it perfect? No, but for something that takes so little effort it amazes me how well it works.
I’m going to go through the steps for training a new style and then note how you can use similar steps to train a new object, for like rendering yourself in Stable Diffusion.
Training a Style
For all of this, I’m using the ATOMATIC1111 Stable Diffusion WebUI running against a server on my desktop. I don’t have a powerhouse of a graphics card but it’s good enough. I pretty much followed the given instructions for my first attempt in generating a painting style. The steps I followed (with the release that I have) include the following.
Preparation For Creating a Style
- Scan the pictures. I grabbed a series of Sudi’s paintings, put them on my multi-function-printer scanning bed, and then used Windows Fax and Scan to scan them as manually cropped PNGs (preview, set your crop, and then scan).
- You can just take pictures on your phone if you want — that should work just as well. E.g., on Android you can use something like PhotoScan by Google Photos — Apps on Google Play. Crop later. Beware the skew.
- Copy them all into a directory by themselves, this will be processed shortly.
- Restart Stable Diffusion on your PC. I say this part because if you’ve been generating images, fixing faces, and/or scaling photos then your graphics card memory is full of AI models you’re going to need to flush out.
- Bring up Stable Diffusion and switch to the Textual Inversion tab.
- There are three steps you’re going to go through (crack those knuckles).
Step One — Create Your New Embedding
- Type in a name for your new embedding. I was making a painter’s style, so I typed in Sudi-Schmitt.
- Initialization text: I left this alone.
- Number of vectors: I set this to 6 so that it consumes six out of seventy-five tokens and has some strength in asserting itself in the prompt’s results.
- (Spoilers: I have regrets later about that six and I wonder if I should make it less. A future experiment for me.)
- Select Create.
Step Two — Preprocess Your Source Images
- Have the full path of your source directory — that only contains the images to train on — copied.
- Paste that into the source directory field.
- Create a new directory for the processed images that will be fed into training.
- Paste the full path of that into the destination directory field.
- For training on a style, I select:
- Flip (creates a mirror image for more training — great idea).
- Split into two (I didn’t make square images for the style — this breaks it apart into multiple square images for training).
- Add caption (this BLIP derived caption gets added to the preprocessed image as the preprocessed filename and this text is used later during sample image generation).
- Select Preprocess
Intermediate step: let’s go check out the results of preprocessing! Look at those filenames. Some are eerily on target and others are a bit off. Sometimes it’s worth a giggle. Feel free to fix anything egregious in a filename because generated training images will be based off of the text in there.
Step Three — Train!
- Select that embedding you created above (e.g., I selected Sudi-Schmitt).
- Leave Learning Rate as is (e.g., 0.005).
- For dataset, enter the full path of your preprocessed images.
- Log directory is where intermediate output is going to go. I leave this alone. This should be a directory in your WebUI install.
- Prompt template file: this is important. It’s probably already filled in with a full path to your style_filewords.txt. This will generate training using words from the captions in the preprocessed image. You can check the file out and if need be, create a new version with what you want (because it might be relevant to your style you’re creating).
- I’ll talk about what to do if you’re creating a prompt subject instead of a style below.
- Max steps: I left at 100000. For my middle of the road graphics card on my, well, older desktop, this took 16 hours of run time to get through 100000 steps. Great to start before dinner!
- The next two settings let you get snapshots along the way, say ever 500 steps. That’s good.
- Select Train.
If everything is okay and there’s enough memory on your card, it should start the training process.
Watch That Water Get To Boiling
Of course we want to know what’s going on! Yeah, you can watch your command line working way and see status whizzing by in Stable Diffusion. But what you want to do for reals is open up the folder where the intermediate steps are saved. Go into that WebUI log directory above and dig down into where the sample images are going to be saved. Under the log directory, it should be of the form of <Date> / <Embedding Name> / Images . Given the defaults above, a new image should show up every 500 steps of progress.
Use That New Style
When training is done (perhaps it’s the next day), the final style will already be in your embedding directory. So you should be able to switch to the txt2image tab and enter something like:
Award winning painting of a vase of flowers, masterpiece, by Sudi-Schmitt
In the WebUI, after it generates, you can get a confirmation by looking in the lower right and seeing that it confirmed it loaded your embedding as part of the image generation.
Are Intermediate Styles Better?
So all those images that were generated as checkpoints along the way? Each has a matching style in a sibling directory called Embeddings. Let’s say you’re not too happy with the final style, but one of the intermediate styles look good, or at least more interesting. All you need to do is copy the matching embedded style file into your core Stable Diffusion embeddings directory. For instance: if Sudi-Schmitt-7500.png looked really good to me, I just need to copy the matching Sudi-Schmitt-7500.pt file into the WebUI’s embeddings directory (where we were putting downloaded embeddings last time).
Then, you can just change your prompt above to be specific about the copied style:
Award winning painting of a vase of flowers, masterpiece, by Sudi-Schmitt-7500
If you’re really into compare / contrast, copy all the embeddings you like, lock in the seed, and then try regenerating an interesting composition with different variants to get a fix on the one you like best.
And, eh, there’s probably a way to script this to generate a comparison matrix. Hmm. I’ll have to do that sometime and write about it.
But What About Me?
Okay M. Inner Narcissist, here we go.
Training a new embedding on a subject vs. a style isn’t too different. Here are the differences regarding what I did for creating a training on big ole me:
- I copied, into a new directory, photos of a very fit me where I could crop to just my face or body.
- This was pretty easy to find pictures of myself because I still use Windows Live Photo Gallery and have used its Face Recognition feature to tag myself in photos over many years. I take more photos of family vs. being in photos but there are enough of me for training.
- I went through each one of those copied photos and did a destructive square cropping of what I wanted to train on. It has to be square.
- For cropping, I just used the current Windows Photos app (bless its heart) and selected Edit for each photo and waited until I could make the cropping a square ratio, then saved back over itself (since it was a copy).
- Note that I did my best to find full body shots of myself to include, not just face shots.
- Note that I wished I had more profile shots in this training. Like more 360 degree shots of me.
Going through the training steps above, things are pretty much the same as a style except:
- For preprocessing, I did not select “Split into two” — I cropped to a square on purpose. I don’t think that training on cropped versions of my face would be helpful. I could be wrong. It’s a gut feeling.
- For training, this time I made sure subject_filewords.txt was selected since it’s generating training of a subject (me!) instead of a style. It will use the BLIP derived text from the preprocessed filenames in running the training.
Let ‘Er Rip, Tater Chip!
Once again, you can start training and find the log subdirectory where the intermediate training images are being stored. Pretty early images looked like me! Well, except for that spooky smile. That’s certainly not me.
Lesson learned for next time:
I should have included more profile shots of me. Any picture generating a profile shot of me has my face squished like I had a very unfortunate permanent face-planting sometime in my life.
I will create a new subject_filewords.txt that’s better suited for rendering pictures of a person.
I will edit the file names to ensure that make sense — one was generated that said I had a toothbrush sticking out of my mouth and it made a bunch of trained nonsense.
I think that six might be too high for the token because the trained “me” embedding takes a lot of style wrangling not to just pop-out a photo of me surrounded by everything else being styled.
Fortunately, it’s easy enough to regenerate with new data some other night.
I next spent some time generating images with me in it. The top of this posting has an example set. For some more, well, I really like Virgil Finlay and recently an old college peer helped me track down that Starlog #14 article about Finlay. How about some Finlay art, backed by a couple more artists:
amazing lifelike award winning ink illustration of coffee shop EricRi , trending on art station, intricate detail, Virgil Lindsay and Andreas Vesalius and Alphonse Mucha, cinematic, black and white
EricRi is the name of my embedding, based off of my old Microsoft login alias.
Put It All Together
The last thing I did was put the two together: create an image of me dressed in a Starfleet uniform holding an art canvas. Take that image into img2img and paint over the art canvas so that we can generate a painting there by Sudi-Schmitt. Real easy, the one thing I learned here after the paintings generated over and over again looked like it had colors from the original image was to push up Denoising Strength so that the painting could be generated independent of the original image (I didn’t want it to blend in — sometimes you really do).
The final version of this is back up near the top of this post.
TLDR; Dude, This Could Have Been a Video
Oy, what can I say, I like typing. I like writing. Yes, you’re right. This could have been a YouTube video with all sorts of screen recordings and my pleasant voice narrating the whole process. I’m sure you can find some good-looking-devil in need of views on YouTube covering the same topic, and it can’t hurt to have multiple teachings on a subject like this. The main thing is to find some images you want to train on — even your dog or cat — and give it a shot.
Have fun visiting all the multiverse of you.