AI Art of Me — Textual Inversion vs. Dreambooth in Stable Diffusion
This is almost a diary kind of post where I go through the high-level steps to use Dreambooth to incorporate my appearance into an AI trained model used by Stable Diffusion to generate AI art. I compare this with the process I’ve already done, textual inversion (which really needs a rebranding). Which is better? Which will I use?
Why would you want to do this yourself?
This stuff is fun. Yes, there is an extreme amount of waiting involved in generating these astounding capabilities. But I’ve had endless joy in creating images of myself in new, fantastic situations. Throwing in celebrities just makes it more fun. I think once you try it and get a big grin on your face you’ll totally understand.
In my last post, I wrote about extending Stable Diffusion to have new art styles and to be able to create images of myself by creating new textual inversion embeddings. Basically, a small file that has trained how to create a specific subject or style. You can refer to this in your prompt and bingo, there you go-go. You can now create images that render you as the subject. Great for trying on new looks or historical (hysterical?) renderings.
In addition to textual inversion there is Dreambooth by Google. Various Google Colab notebooks are about to let you go through similar steps to create a modified version of the massive AI model that Stable Diffusion uses to add, say, yourself into that model.
I really want Dreambooth to work better. The reason being is that I believe Dreambooth trainings will be more adherent to the prompt. I can create a great prompt with fantastical elements, but once I modify it to make me the subject the fantastic is gone, replaced with pretty boring and flat results. I can try to adjust this by, in Automatic1111 WebUI, decreasing my embedding’s impact by surrounded it by one or more layer of square brackets. Sometimes it works. Most of the time doesn’t. So, if Dreambooth can slide me into the Stable Diffusion model and not wreck boring havoc with the prompt’s results, I’m all for it.
Let’s fire it up and, well, given that it works for me, which is better?
That “given” is a big given.
Some warnings upfront about this should you try:
- You’re going to be generating huge files. If you’re dumping this into your Google drive, you best have ample GBs for this. I’m on Google One so I had the space.
- It’s great that you’re not having to burn through your own computer’s time, but expect to burn time downloading that model. Be sure to rename the model so that you can put it next to the default Stable Diffusion model as a checkpoint that represents your modification. You can then switch between this newly generated checkpoint and the standard model.
- Getting ready to submit to a Dreambooth colab involves similar steps for textual inversion. You should have a variety of photos of yourself — various angles, clothes, backgrounds — and crop copies of those photos into a square aspect ratio. Send through Automatic1111’s preprocessing stage to resize down to 512x512. Use these photos to submit into the Dreambooth process.
Oh, and if you don’t know what a Google colab document is: it’s basically a structured notebook with code. You click on play buttons as you go down the page to run steps of the code. This is running on Google servers, some with GPUs for fast processing. Everyone gets some free colab time. When that free allocation runs out, you’ve got to pay for more.
My first attempt, having watched the video Dreambooth tutorial for stable diffusion. Quick, free and easy!, just didn’t seem to go well. I did two attempts with the following colab (with each attempt I’m worrying about when any credits I have for running Google VMs will expire): DreamBooth_Stable_Diffusion.ipynb — Colaboratory (google.com).
The results? Here’s a side-by-side of the same prompt and seed just switching between EricRi (textual inversion) and DreamyEric (Dreambooth):
Eeeyeah. That’s one much more handsome Asian dude Dreambooth created but not me me. An inspiration of me. Perhaps the steps at 1000 were too low? Too few pictures? I believe the true-life structure of my eyelids pushed it into making some weighted decision here about the basis of me. Many gigabytes and GPU cycles were sacrificed for this result. Let’s see if we can learn from it so that it wasn’t wasted.
So far: textual inversion one, Dreambooth zero.
Hoping my freebie Google colab credits weren’t about to burn out, I tried another colab notebook I had stumbled across before and found mentioned in Use Google’s DreamBooth AI for Free and Turn Yourself into AI Art (decentralizedcreator.com) — this notebook is fast-DreamBooth.ipynb — Colaboratory (google.com). Not as turn-key as the previous one but still pretty much point and click and click and click and click.
I decided to up the number of steps this time from 1000 to 5000. This Colab notes that 200 images are better than what I have (sixty some). It also provides the ability to train on the class (man) or just generate training on the class. I uploaded my sixty-some pictures, let it generate its own class references, and put in 5000 steps. Start training…
…ah, poop! Disconnected Google colab runtime error because I’m out of colab credits. Well, that’s only fair for a freeloader like me with two other trainings already done today. I extinguished my compute allocation 46% of the way through training.
We all have more than one Google ID for various reasons, looks like I’m going to need to use one for some back-up computing here. I was in the middle of finding more pictures of myself anyways so I might as well finish that hunt and add those extras to the next attempt.
Okay, attempt #2 on fast-DreamBooth.ipynb. 94 files to upload so that’s more than last time. I’m still going to set # of steps higher than 1,000… let’s see if the luck holds at 4,000. It will take a while to bake. Start training…
…it’s taking forever just to generate the “man” class references. Okay, can Dreambooth be run at home on an 8GB nVidia card? Let’s research that for a second. Hmm, the only reference I find seems to rely on Ubuntu running on the Windows Subsystem for Linux (WSL). Let’s put that down as Plan D. I’ll spring for $10 worth of Google colab units first. Okay, back to watching the training reach the boiling point…
…while waiting I go back to my Stable Diffusion and try combining textual inversion and Dreambooth together. To do that, I say something like the following in my prompt: DreamyEric man [EricRi] . I have the textual inversion, EricRi, in brackets just because it’s too strong. Do I see a difference? I mean, there is a difference if the “DreamyEric man” isn’t there — the image looks different but it doesn’t look worse for it not being there. So: eh? Not an appreciable difference I can find for combining the two together…
…phew, wow, okay training done! I did not have to throw extra money at Google today. The final model was dropped into the root of my Gdrive. I don’t sync that one, so now I’m downloading it. In the meantime, the Colab has started an Automatic1111 server for WebUI for trying the model out via Colab. Nice. Trying…. Ooo! That looks like me. So does that! Cool. Try some more… about one in five look like me. And… maybe makes me question my rosy cheeks even more.
Okay, now we can restart our local Stable Diffusion with the downloaded model copied into the models directory and select it as a checkpoint we want to use for our image generation. Let’s compare the textual inversion against the Dreambooth using the same seed for each one of these, just switching the technique:
Ah, well. Textual inversion consistently gets my face correct more often than Dreambooth. So plus-one for textual inversion there. When Dreambooth does get my face, though, it really looks more like me in subtle ways I find hard to explain — just the way my face looks seems more real. So if Dreambooth could only get my face to actually be my face more often then it would be the clear winner.
I next tried to go back to one of my older PNGs from yesterday and recreate it first in this new model… but it won’t recreate with the exact parameters loaded from PNG Info into txt2img. That’s not what I was expecting. The downloaded model is way smaller (2GB vs. 4GB) so it’s been pruned. I don’t want that. The model that made Asian Eric was 4GB and I felt it was the full model w/ that Eric added in. So now I need to redo the colab all over again and see — perhaps by not opting into fp16 — if I can get the full unpruned more precise model back.
In the meantime, I got very distracted generating noir detective images with the Dreambooth version of my face. This gave me a whole new appreciation for Dreambooth’ing. It seems my face slides into the artwork much more smoothly, though for the WebUI I do have to emphasize it by putting it in parenthesis. I don’t think same-seed / same-prompt 1:1 comparison is quite fair at this point.
Some of these generated by Stable Diffusion + Dreambooth are absolutely brilliant… except for the deformed arms, hands, and my head getting cut off. But that’s everywhere.
And then… I had the incredibly time-consuming idea to not let Stable Diffusion generate a random femme fatale but rather make it look like a celebrity. The loud-speaker crackles, “Ms. Johansson, you’re needed on stage. Please bring your extra arms and fingers with you.”
Okay, let’s compare textual inversion generation to Dreambooth generation, same prompt but different seeds, trying to pull out the best from a run. How about best of 3 versus each other?
Okay, so what I notice off-hand is that textual inversion tends to generate a more extreme, less-subtle version of myself. Harsher. The background in Dreambooth tends to be richer and more detailed, making me feel that Dreambooth slides into the prompt’s implied aesthetics far more easily.
But I sort of love both of them! How can I possibly choose? I think I’d go with Dreambooth in general and then also try out textual inversion especially if I’m not liking the face quality of Dreambooth. Dreambooth does tend to make me look thinner, fit, and more handsome so, while that doesn’t match to reality, it sure is compelling. Score one for Dreambooth, you little sycophant.
I next wanted to step away from the world of Greg Rutkowski and Alphonse Mucha (which means stepping away from 96% of generated art right now) and try the styles of Carne Griffiths, Agnes Cecile, Alberto Seveso and Anna Bocek combined together. It’s sort of a lovely chaotic splatter style. What I discovered: Dreambooth is compliant, but I had to boost my self-reference by surrounding it by parenthesis. Textual inversion is resistant. My self would avoid being in the style of the artist and it would decorate around the edges of my rendition.
There is a super clear winner here. Dreambooth by a knock-out. Even when I decreased the strength of my textual inversion by putting brackets around EricRi it was still too strong to take on the artistic style I wanted. This could be due to the strength of the artists in the model’s training. When you’re mixing in Artgerm and Greg and Alphonse they are just too strong to resist, even for a textual inversion.
My plan next is to generate another model without fp16 enabled. It will take longer but I want to see if the increased accuracy gives me anything. That’s after I’ve given Stable Diffusion a break for a bit. It’s giving me questions and thoughts about the classic capitalist’s question: How do I make money off of this? Not that I want to, but I’m curious to project what people will try. It’s pretty awesome to see pictures of yourself in fantastic situations. Would BTS fans (not me, just being clear) love to have various poses of themselves with BTS members? How about reproductions of you in memorable movie moments? Or an AI spitting out a comic of you with well-known celebrities having high adventure? I’m sure a lot will be tried, I’m just not sure how much will stick… like whatever that app was six years back that everyone was using to artify their profile picture. Yeah. No one remembers the name.
Until next time.