Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Stable Diffusion Textual Inversion (github.com) similar stories update story
442 points by antman | karma 6193 | avg karma 5.35 2022-08-29 16:08:00 | hide | past | favorite | 169 comments



view as:

And in a blink of an eye, the career potential of all aspiring "Sr. Prompt Engineer"s vanished into the whirlpool of automatable tasks.

On a more serious note, this opens up the door to exploring fixed points of txt2img->img2txt and img2txt -> txt2img. It may open the door to more model interpretability.


ELI5 - why has there been a cavalcade of Stable Diffusion spam on HN recently? What does it all mean?

You can now fire artists/designers and replace them with AI. Obviously, that's cheaper.

Someone will surely come by soon and tell us, “well actually… artists and graphic designers are irreplaceable.”

But for real, plenty of people are going to start rolling their own art and skipping the artist. Not Coca-Cola, but small to medium businesses doing a brochure or PowerPoint? Sure!


I think there's going to be plenty of work in stacking multiple AI prompts or manual retouching to fix rough spots. It automates a task, not a job. Some people won't use it at all and other people will use it only for reference - in the end doing everything by hand, as usual, because they have more control and because AI art has a specific smell to it and people will associate it with cheap.

But it's not just for art and design, it has uses in brainstorming, planning, and just to visualise your ideas and extend your imagination. It's a bicycle for the mind. People will eat it up, old copyrights and jobs be damned. It's a cyborg moment when we extend our minds with AI and it feels great. By the end of the decade we'll have mature models for all modalities. We'll extend our minds in many ways, and applications will be countless. There's going to be a lot of work created around it.


>AI art has a specific smell to it and people will associate it with cheap.

It might now, but I feel like that will be trained out of it a few more papers down the line


People rarely write assembly code nowadays because we mostly all use higher level abstractions that let us write more powerful code with fewer lines.

There are plenty of small shops now where somebody knows a little Photoshop and can eek out a design that they otherwise wouldn't be able to using pen and paper.

There are also professionals that use the Adobe suite to enhance their abilities they've cultivated for years.

AI art will simply be a tool that enhances artists but might take away some low hanging fruit jobs similar to how web frameworks pushed people out of the job of webmaster and into more specific roles.


If you automate part of a job, you need fewer people doing the job. We still have farmers, but we automated enough of the job to do the same work with a thousand times fewer farmers.

I think the first business to crack will be stock image sites like shutterstock.

Those were already used in a "let's find something that roughly fits what I want to communicate with this text" way.

Today I created a quick get well soon card using an image from the new Midjourney beta and I have to say the result was exactly as good as if I had used Shutterstock but it took me much less time because the search prompt matched created something I wanted on the third try.

Comparing that to sifting through pages and pages of vaguely relevant images it's a clear win and a lot cheaper


For sure it automates some work. For example, my sometime hobby of making silly photoshops looks like it will now be a whole lot easier... Visual memes can just be a sentence now. For more serious work I wonder... But it does give pause about what it means for other forms of work.

It'll be an interesting line to be sure.

Right now the tech still requires some nuance to be able to slap it all together into what I think most people would want.

While i expect the interface and the like to get a lot better, all good tutorials of this tech so far show many iterations over many different parts of an image to get something "cohesive". Blending those little mini iterations together is VASTLY easier than just making the whole thing, but not just plug and play for something professional.

Still there will be a huge dent in how long it takes to make certain styles of work and that will lower demand considerably, and there's a large market of artists who thrive on casual commissions which this might replace.


Weren’t the small to medium businesses already mostly using stock images anyway?

If anyone was to be worried I’d think it would be Getty Images.


There is plenty of reason for artists who are hired for one-of-a-kind work to be worried.

Getty Images will just get with the program and stock up on bazillions of AI generated images, indexed by the prompts used to generate them.

Someone looking for stock images doesn't want to deal with artists, photographers or feeding prompts after prompt into some AI software, while not quite getting the desired result.

If Getty makes it easier for someone to find some existing AI-generated image than to generate one, they still have something.

A lot of the AI images we see in online blogs and galeries have been curated; people tinkered for hours with the stuff, and cherry-picked the best results. There could be some business model in that, at least for a while.


Getty would just be a cache with the prompts acting as the initial search index and buyers are just typing keywords in to buy cached images.

That is not exactly so, because the prompts are not reproducible input cases; you don't get the same image every time for a given prompt. An association between a prompt and, say, around ten images would be something resembling a cache.

The stable diffusion model just got officially released recently, and in the last week a lot of easy to install repositories have been forked off the main one, so it's very accessible for people to do this at home. Additionally, the model is very impressive and it's a lot of fun to use it.

How does it compare to DALL-E

Worse at image cohesion and prompt matching, but competitive in terms of final image quality in the better cases.

Its image quality is often better, mainly because you can run it on your own machine and increase the quality/time controls. DALLE2 lacks settings and doesn't run enough diffusion steps to have fine detail.

Having had access to DallE for awhile, I find both Midjourney and stable diffusion to be quite a bit more powerful.

For weird inputs I find DALL-E to be better, but both have failed at really specific stuff that I've tried to create as art. ie: a factory with human lungs.

With stable diffusion, it really creates some nice stuff for what is already available, like a pizza with specific toppings [0]. I've been using it to add pictures to any of the recipes that are added to my wiki site without a picture. I originally tried this with DALL-E with similar prompts and the results are less appetizing.

[0] - https://www.reciped.io/recipes/mushroom-and-onion-pizza/



It's an impressive new technology, and there's nothing else out there like it in terms of the model being publicly available and able to be run on consumer GPUs.

First, it was recently released, so there’s novelty. Second, the code and model weights were also released so it is open and extensible, which this community loves. Thirdly, these high quality image generation models are mind blowing to most and it’s not hard to imagine how transformative it will be to the arts and design space.

If if has any greater meaning, we might all be a little nervous that it’ll come for our jobs next, or some piece of them. First it came for the logo designers, but I was not a logo designer, and so on.


the other issue is DallE and them Google/Imagen made a big deal about _not_ giving broad access to it and making it an approval-only beta (DallE) or simply you-can't-have-it (imagen).

So people were hyped up.


None

If you have evidence, let alone clear evidence, please let us know at hn@ycombinator.com. Please don't post insinuations about it here—those perceptions are a dime a dozen and the overwhelming majority turn out not to be supported by data.

This is in the site guidelines:

"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."

https://news.ycombinator.com/newsguidelines.html

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...


We could always do im2tex via just clip embedding the image. The idea that you could hide/sell prompts is silly. (Having human interpretable im2tex is cool tho.)


Are you literally a bot? This is the second time I've seen you link to replicate.com (this time the link isn't relevant to the parent).

Hmm, no, are you dang? Are you the link police? Fuck off, look at my history if you want, I'm trying to be helpful. I'm a happy customer of replicate, yes, but I also ordered a GPU to run on local... You say it's not relevant but actually I'm giving exactly the functionality that parent is asking, giving back a prompt without having to pay for it, that's what text2image does.

Also replicate links you to github, but in case you cannot run (not have a GPU at home) you can just run some free queries on it. What's to hate about that?

I don't care what you think, You can also check my other post on this thread where I link to the official paper, and has 11 upvotes.

Go bot yourself

Oh, also, very funny to have a 2m old account giving me lectures about what I'm allowed to post or not, you can't even downvote...


The parent is about im2text not text2im (i.e. converting an image to human readible text). Easy thing to misread. Apologize for upsetting you. Have a good day :)

So sorry about my prior comment, you're indeed right, I shared the wrong link.

This is is what I wanted to share originally https://replicate.com/methexis-inc/img2prompt

Have a nice day to you too, and sorry again for misreading the whole situation!


As someone mentioned in another comment, selling prompts isn't a joke https://promptbase.com/

It has a lot more joke potential if prompts can simply be generated from images.

Something about this just turns me off, not sure why.

Don't worry. People who can actually paint will mercilessly attack the insecurities of anyone who is too serious about prompts. The artists hardly made money before and the AI will just take more off the table. All they (we) will have left is spite.

A craft you can't earn money on is called a hobby. I have a lot of hobbies I don't earn a dime on. Why would art be any different?

I feel it too, the idea of rent seeking in the least difficult part of the process. But people calling themselves prompt engineers (without irony) is at least amusing, like a sanitation expert or a maintenance engineer.


The given example is shockingly similar to what may actually pop up on the HN front page.

What a great new possibility - generate a prompt, do the little project or whatever, write the blogpost, post the link. I guess status/karma/cred is the W?

Is PromptBase basically useless now?

The input image is just another kind of prompt.

These models have hundreds of input parameters, not just prompts. There are many ways to configure the different models and various techniques and link up the processing stages.

Getting the best results in a short amount of time requires highly specialized knowledge. The job description isn't "prompt engineer" but something close to that.


How many years until we can generate a feature length film from a script?

I want to see the Batman film where the Joker gives Batman a coupon for new parents but it is expired. That should really be a real film in theatres.

I loled.

you 'might' enjoy. Teen titans fixing the timeline.

5

You could do storyboards from a shooting script* now, but generalizing to synthesizing character and camera movement as well as object physics is a ways off.

* A version of the script used mainly by director and cinematographer with details of each different angle to be used covering the scene.


It looks like this is trending towards making our dreams/thoughts reality in a way in that what we imagine can easily be turned into media - music, books, movies, etc.. Pair this up with VR 'the metaverse' and you literally do get the ability to turn thoughts into personalized explorable realities.. what happens after that?

* Do we get lost in it?

* Does today's 'professional' fiction become a lot less lucrative when we can create our own?

* Is there a to leverage this technology the improve the human condition somehow?


I can create and explore realities using my imagination alone, though. I personally don't think having it become actual 2d or 3d art will have a lasting impact. It might be fun for a while, but it will get old.

It's kind of like your imagination on steroids as the system creates worlds using you imagination as the seed and augments it with summation of all the human creations used to train the network. Give Stable Diffusion a sentence for example, it will create something way beyond what you could of imagined and/or created on your own.

Stable diffusion is impressive, but still is a subset of what one _can_ imagine.

this probably already all happened before mate

I think it will encourage novel ideas in all forms of art. In other words, genuinely new styles and expression will be scarce, because there wasn't thousands of forms of it to train a model on yet.

We will also adjust to AI generated art like have other creative technologies and the novelty will wear off. We will become good at identifying AI generated art and think of it as cheap.

Still, extremely exciting.


> Do we get lost in it?

That is one hypothesis for the fermi paradox, Kardashev scale, and the great filter. At some point all civilizations essentially create infinite dreams/thoughts/Matrix style tech in where we all will retreat inward and have an infinite world to play with and essentially become gods in a virtual reality.


Nobody would be interested in it really, since everyone (would have) (edit) their own thing they want others to check out. It's like fan fiction collections online, or the 80% of deviantart that you really don't want to spend time with - only now everything looks hollywood polished.

I suspect at least 10+ depending on your definition.

Tools like this will absolutely be used by professionals to cut out portions of the workload, but there's still a large gap between something like this and actually making a coherent, cohesive, consistent, paced, well framed and lit story from text alone.


There is also recent work by Google called DreamBooth, though similar to Imagen/Parti Google refuses to release any model or code.

https://dreambooth.github.io/


Yeah, they allude to supporting more than one token for the identifier, which would be nice

Textual inversion also supports more than one embedding for identifier, just change num_vectors_per_token in the yaml config. Example: https://www.reddit.com/r/StableDiffusion/comments/wzf1qk/sd_...

so is dreambooth worth open-sourcing then given textual inversion?

From the textual inversion guy's own comment on Twitter

>The objective is similar, but it's: (1) A different approach - they also fine tune the model itself, and they get much much better identity preservation!


dreambooth retains higher fidelity as the model is finetuned, but to be honest I think textual inversion is actually more applicable as you can just add some embeddings to inject new knowledge into the model and not an whole new model just for a single concept (if you want to share it with others). Also I have not seen dreambooth being applied to replicate styles.

Sounds like there is a chance might open source a version for Stable Diffusion. Let's see though.

From Twitter : >Awesome job! That really extends the applicability of powerful generative models nowadays. Could I ask if you have any timetable for releasing the code please? >We are working on plans for implementation on other open source models

https://twitter.com/jason_dingzc/status/1563578510958297089


Wow, this is pretty cool. Instead of turning a picture back into text, turn it into a unique concept expressed as variable S* that can be used in later prompts.

It's like how humans create new words for new ideas, use AI to process a visual scene and generate a unique 'word' for it that can be used in future prompts.

What would a 'dictionary' of these variables enable? AI with it's own language with orders of magnitude more words. Will a language be created that interfaces between all these image generation systems? Feels like just he beginning here..


Right on, you could see a marketplace of those mini- pretrained weights to have stuff a styles much more available in a UI like setting without needing to add style manually... Very interesting.

This is a big deal! This adds a super power to communication, similar to how a photo is worth 1000 words. An inversion is worth 1000 diffusions!

I saw talk the other day how these ML art models aren't really suited to doing something like illustrating a picture book because it can synthesize a character once but wouldn't be able to reliably recreate that character in other situations.

Didn't take long for someone to resolve that issue!


It's not quite at that level yet. The paper introducing it recommends using only 5 images as the fine-tuning set so the results are not yet very accurate.

The model was released exactly one week ago. At this rate we'll be well past "there" in another week.

This doesn't feel too far off. With the img2img stuff you can give a picture of a character and the tool can spit out new images of that character or transpose them in to new art styles.

It doesn't feel like this tech has hit any hard limits on things that are impossible or very far away yet. Every limitation seems to be getting broken at rapid pace.


Is there a colab or easy to use demo of this?

This tutorial on reddit best close thing so far. https://www.reddit.com/r/StableDiffusion/comments/wvzr7s/tut...

Note that generating "inappropriate" images on colab could result in your entire Google account getting banned. I wouldn't risk it.

It should be noted that the official repo now also supports Stable Diffusion: https://github.com/rinongal/textual_inversion.

Anyone else starting to feel uncomfortable with the rate of progress?


I'm not worried about unemployment, although that is a problem. I'm more worried about bad actors being able to flood the web (even more than it already is) with realistic-enough content that makes it utterly unusable and unreliable.

Imagine entire subreddits consisting of posts, comments, memes, and photos, and 100% of it is pro-[insert authoritarian regime] and it essentially only cost $1M to do it.


Honestly, I’m worried that shocking pornographic depictions of every women who’s ever posted her face online is coming. AI’s first big splash in our society is going to be a traumatic sexual assault of all women.

How is drawing imaginary pictures sexual assault?

In the same way that speech is now considered violence by an unfortunately large amount of the population, I guess.

That's not assault, but it can definitely be used for harassment and other forms of social damaging. I posted this an excerpt from an article (on anonymous Telegram groups) a few days ago:

Filing charges is pointless, says Ezra. Since two years, she's being harassed on Telegram. It started when she was sixteen: photoshopped nudes with her snapchat account were circulated. They had taken selfies from her social media, and those of her family, and combined them with porn fragments. She doesn't know the perpetrator, but that person takes a lot of trouble to ruin her. "Nowadays, the boys have so many ways to make it look real." source: de Groene Amsterdammer,146/33, p. 21.


If anything, widespread use and understanding of this technology will help with situations like these. Teenagers in 10 years would absolutely not be impressed with a nude picture that has not been somehow verified as legitimate.

That's entirely unfounded optimism, or –less politely put– sticking your head in the sand. Hasn't the printing press shown how easy it is to slander? Has internet taught you nothing about misinformation?

And do you propose to stop either to deter misinformation? Or maybe the pros outweigh the cons?

Many seem to forget that Photoshop exists. People have been taking others faces and overlaying them to all sorts of images for years. Nothing about this is new and it hasn’t put society on fire.

I can do all of this right now from a handful of FREE Google accounts with Colab right now.

The only part that could be improved with money is getting dedicated mobile modems for unique IP addresses, to evade spam detection.


Yep this technology is super impressive, there’s a chance some tweak can turn it into something scary.

* Train network on thousands of assembly instructions.

* Prompt ‘some bad weapon of this size and material’

* Result simple instructions how to go build it.


It’s actually incapable of doing any of that, for now. Deep learning can’t generalize and it doesn’t understand plans or schematics.

Lots of smart people have been trying to get it to have capabilities anywhere close to what you’re describing for years now, to no luck.

We’re safe for now.


Neural networks generalize, otherwise they would not be as powerful as they are today (and I don't know how you can deny that). If your neural network does not generalize then the model is overfitted.

They don't generalize *well. "Deep Learning" as it is done currently, is very limited in its ability to generalize to out of distribution tasks. This is a major area of discussion in research.

The type of generalization necessary to perform what the parent was talking about for instance (synthesizing schematics) is (currently) not possible.


It's impressive how all of this is quickly picking up steam thanks to the Stable Diffusion model being open source with pertained weights available. It's like every week there's another breakthrough or two.

I think the main issue here is the computational cost, as - if I understand correctly - you basically have to do training for each concept you want to learn. Are pretrained embeddings available anywhere for common words?


100% agreed. That's one thing that really bothered me about "openai".

Did OpenAI (Dall-E) and Google (Imagen) know Stable Diffusion was coming?

I'm sure they were looking forward to many months of maintaining highly exclusive access and playing "too dangerous to release" games before SD completely upended the table.


Too dangerous to release, a codeword for selling access at 100% markup.

Bonus points: If dall-e rejects an output image because it thinks the image is inappropriate they won't show it. But they will still charge you for the prompt.

Is there even a public keyword list of words you’re not supposed to include?

I wonder if they keep that private to avoid people finding work arounds.

It’s perhaps not a simple list but an AI classifier, perhaps GPT-3 based.

Nuclear was apparently confirmed on the list, but I have recently used it to generate cool things around nuclear power. So suppose it is like you say.

There is no public list. Someone on reddit compiled a very limited list here: https://old.reddit.com/r/dalle2/comments/wa3jt6/banned_words...

The problem is that they will automatically ban accounts that trigger the filter too much, so people would have to burn a whole lot of accounts to assemble an even remotely-complete list.


Holy shit that list is insane. A lot of those words being banned would limit you enormously if you're trying to create art.

But it never has been about art?

This list seems incomplete. I've had the filter triggered on words like 'thong' as in thong slippers as well.

How is this legal? If someone commissions a painting from you, you're not allowed to just say "I'm not giving you the painting, but I'm keeping your money." Why does doing it with a computer make it okay?

Anyone moderately plugged into the AI art knew for at least 2 months.

I'm guessing those teams didn't know in that they're AI researchers, and in my own employment at Google, I've been regularly reminded that being a technologist and being someone obsessed with a technology and pursuing it socially are different things.

Even without knowing the precise individuals that'd do it, I knew in February that by August there would be an open source model challenging state of the art back then, if only because given 6 months _some_ open source team would try scaling to a bigger model.

Another thing to point out is these teams are descendants of open source, Katherine Crawsons open source breakthroughs led to substantial improvements in DallE. Everyone should be saying her name 1000x more often.* She also helped create Stable Diffusion specifically, in substantial ways

* I think. Maybe I misunderstand the technology dramatically. But I think it's just poorly understood how much she's been involved.


> Another thing to point out is these teams are descendants of open source, Katherine Crawsons open source breakthroughs led to substantial improvements in DallE. Everyone should be saying her name 1000x more often.* She also helped create Stable Diffusion specifically, in substantial ways

Not only that, but OpenAI didn't seem to know their CLIP model could be used to generate images (via Advad's CLIP+VQGAN) at all, otherwise they wouldn't have released it. So they did unintentionally start the "AI art" movement even if they didn't release a trained DALLE.


Well Google's paper showed you don't need CLIP anyway. T5 and other languages model can be used regardless.

CLIP isn't the true blocker to entry, the dataset and compute is.


StableDiffusion has an open dataset, was funded by one guy, and apparently took "much less than $600k" to train on AWS (https://twitter.com/EMostaque/status/1563965366061211660).

So it seems there actually aren't many barriers to entry at all. There's certainly a lot of legal questions, but if it's this easy to create your own model then it's hard to enforce anything…


I don't think there are actually major legal concerns. Copyright protects reproduction of a specific image. Looking at an image and producing something in a similar style is not copyright infringement, it's called being an artist. The law on this seems pretty clear.

The UK recently announced plans to make this completely explicit, to remove any remaining doubt: "For text and data mining, we plan to introduce a new copyright and database exception which allows TDM for any purpose. Rights holders will still have safeguards to protect their content, including a requirement for lawful access."

https://www.gov.uk/government/consultations/artificial-intel...


I was wondering about trademark issues with a model that can draw new pictures of Mickey Mouse/Homer Simpson/Hatsune Miku if prompted with their names.

I think the fact those diffusion models are smaller and compute efficient than gigantic GPT models are in fact make them easier to use and distribute.

BLOOM is out there, but not that many individuals with have like 8 3090 to host them, and the inference is still incredibly slow nevertheless


BLOOM also doesn’t have GPT-3’s RLHF tuning, so anyone who tries to ask it questions or give it instructions in the manner GPT-3 supports will be disappointed. You have to k-shot prompt it or fine-tune it yourself for it to be useful.

I don't know if they saw it coming or not, but frankly I'm glad it did.

This idea of "technology gatekeeping" sickens me. I'm tired to death of people saying some non-sensical horseshit like, "The technology is too dangerous to be turned over to the hoi polloi!!"

Give me a break... as if someone running StableDiffusion on their home system and creating naked centaur-women out of pictures of Kate Beckinsale and anime waifus out of Ariana Grande photos are going to cause the downfall of the modern era.

StableDiffusion didn't upend the table... StableDiffusion gave the plans to the printing press to every person out there that wants to learn how to make their own print shop... and more power to them all, I say. I've had more fun and learned more about AI models in the past week than in I've had with AI in the past year, and I've been using img2img to feed my own art into SD to create whole new works that I've been able to touchup in Photoshop and upscale to print resolution.

This is truly the kind of computing revolution that I love to see, and that comes around all too infrequently. The good from this will far, far outweigh any negatives.


Technology gatekeeping is completely antithetical to the hacker spirit.

Hackers built all this technology. There's no way a handful of megacorps are going to take it all for themselves.


Right, like no megacorp ever prevented us from doing whatever we want for our smartphones. Oh wait. (No Pixels don't count, unless you can write your own TEE, your own sensor hub, and your own wake up word)

I'm not saying they can't try, and succeed for a while. But eventually, we will always break free.

Pixel exists (but apparently doesn't count because it's not perfect yet).

Librem exists.

PinePhone exists.

More will exist in the future.


The "Librem is to iPhone as Stable Diffusion is to DALL-E" analogy breaks down when you consider that Librem phone works about 10% as well as an iPhone, whereas SD works 110% as well as DALL-E.

Linux also used to be a "hobby" OS. Now it powers the internet. Things change.

It took Linux a couple of decades to get to that point. And it had immense business value in having such a massive infrastructure open for everyone.

For hardware our world is not there yet and won't be for quite a forseeable future.


"open source" hardware is never going to work the same way as open source software does. Hardware is fundamentally capital-intensive to produce. Software can be produced (compiled) using hardware that many people have readily available. This is a fundamental, intractable difference.

It's the difference between free knitting patterns and free cardigans.


> This idea of "technology gatekeeping" sickens me. I'm tired to death of people saying some non-sensical horseshit like, "The technology is too dangerous to be turned over to the hoi polloi!!"

I think that misrepresents OpenAI's attitude. As I see it, their claim is closer to "let's discuss whether the stable door should be closed before we find out the hard way what makes the horse bolt".

Given how much trouble we already get from the Gell-Mann amnesia effect, and how many people take spirits and horoscopes seriously, it seems entirely plausible to me that some highly realistic centaur picture could be used as a casus belli for a popular uprising that effectively ends a nation.

(Similar rumours abound even without this tech, c.f. Catherine the Great or Malleus Maleficarum etc.; I suspect arbitrary photorealistic pictures make that kind of drama much more likely to occur and to stick harder when it does, but this suspicion is not strongly held).

Edit:

I want to add that my concerns from tech are less about the general public (most people are basically decent), but from the few percent who hate or fear who now have a much easier time promoting their views (the possibility having always existed is different from it being cheap), and also from those who don't realise the images are generated to fit the text and instead think it's a search engine of existing images (which appears to be a common view judging by the type of complaint certain artists have on any given demonstration of the tech, though public figures complaining about Google search results without knowing they're personalised is also a thing even for actual search).


> I think the main issue here is the computational cost, as - if I understand correctly - you basically have to do training for each concept you want to learn. Are pretrained embeddings available anywhere for common words?

I just read (skimmed) through the paper.

That's in fact the key idea here: that the training model is untouched. Using the existing, trained model, they use this "inversion" procedure to discover some word that acts as a stable reference for a concept expressed in some images exposed to the model, which the model will understand as a reference to that concept.

There is a pretrained model with those common words, which knows how to do things like, say, "hamburger in the style of Picasso".

Now, without such model having been trained on the works of some artist, or other images, using a few samples (merely five or so), it's possible to uncover a latent word in the model which refers to the concept that those samples have in common. That word is stable in the sense that you can compose it with other words in prompts, and it really seems to denote the concept in those sample images.

In the paper these researches consistently call such a word as the meta-variable S*, and use it in prompts like, "flying monkey in the style of S*".

What I couldn't spot in the paper is a concrete example of what the S* word actually looks like for given examples. I'm guessing that it's some sort of gibberish. According to the concrete usage instructions, the process produces an embeddings.pt file, which you then upload, allowing you to use the pseudo-word * (asterisk) to refer to the concept.

People have been intuitively experimenting with gibberish words in prompts, discovering some stable behavior that seems to correspond to words that the AI "came up" with by itself (like a child, some have noted). This research seems like methodical way of discovering those internal words.


Do latent variables “look like” anything at all? Like In a PCA, for example, a factor is some latent heuristic but does it even have an actual value?

Looking at this some more, I may have a slightly less flawed high level understanding. There is never actually a concrete word. There is an "embeddding" represented as an abstract vector, and that is forcibly associated with a pseudo-word like *. That * just recalls the vector; there is no intermediate gibberish that has a word representation: that vector is the gibberish.

I have seen images at openart with promots that considted entirely of different types of whitespace. They were haunting images of humanlike shapes. The prompt found some odd vocabulary that had trained to some concepts was my assumption. Is that impossible?

I think that’s a pretty apt comparison. A latent variable (or latent factor in PCA terms) is (basically) a direction in a n-dimensional space, where n is the length of the vector. The direction is correlated with some type of variance in the input data. Oftentimes this represents something that has some useful meaning (“dogness” vs “catness”, for example), but it could also just represent a correlation that has no interpretable meaning.

This is probably a dumb question but if we’re talking about language embeddings, are the latent vectors deterministically out of vocabulary? Is there any possibility of collision with an in-vocab n-gram’s vector?

Speaking of computational cost, I wonder how much electricity all this is using.

> I think the main issue here is the computational cost, as - if I understand correctly - you basically have to do training for each concept you want to learn. Are pretrained embeddings available anywhere for common words?

The basic SD model should have all the common words covered, this model's goal is to find a new concept that doesn't exist visually or textually in the dataset, like for example your own face, or a character you designed yourself. Note that this might not be possible to do, the corpus of data or the size of the model might not have held enough information that it can represent certain concepts, or at least represent them in detail. I.e. if you give it pictures of your dog, it might not look quite your dog during generation, even though those details existed in the pictures you gave the model.

If you want personalization that is also highly detailed, you'll have to fine tune the model itself with your own concepts, google has detailed how they did their own fine tuning and called it dreambooth[1].

[1] https://dreambooth.github.io/


RIP promptbase.com - they had a good run.

The won't even need to renew the domain name when it expires on 2023-02-28

None


Hey the colab link is dead :(

People seem to have missed the point of what this is.

This is not "work out what the prompt is for an image"

Instead it lets you give the model a "thing" (as an image) and then use that thing in your prompts.

So for example they give it a picture of a statue and name it "S", and then can say "Elmo sitting in the same pose as S" and it correctly generates it.


How far are we away from throwing a screenplay and some reference scenes at an AI and getting back a full movie? 10 years? Doesn't seem very far away now...

> https://github.com/rinongal/textual_inversion

Netflix UI of 2033: I want an episode of Senfield with X, Y, Z... Starting brand new (never aired) just generated by the AI episode now.


If you can ask for anything, why would it be Seinfeld?

AI: remake seasons 5-8 of game of thrones, based on GRRM's final two books which he finally finished.

AI: remake seasons 5-8 of game of thrones, based on the critical consensus version of the final two books autocompleted by GRRM-style emulating generative network.

I wouldn't, tbh, It's just the example someone else gave used Seinfield, and I was trying to be relatable to an american audience, I never seen it. LOL

Just make sure not to ask for a Moriarty that can beat Data.

What I want to know here is can I seed it with a human face? If so, this really becomes a breakthrough for fast meme generation / other more nefarious uses. That will really be the thing that opens the floodgates on this.

Yes, you can. For well-known people you can already upload a meme and ask SD to replace the face.

I replaced a Biden meme with Shrek just yesterday using Stable Diffusion image2image in Colab.


This is escalating quickly

I'm experimenting with generative art with all this new models, and what's coming as output is wild and beautiful.

You can basically make a full AI experience now as a one man show world building with the capabilities only prior to marvel or disney...

This might be a better link> https://textual-inversion.github.io/

Also this reddit tutorial might be useful to you https://www.reddit.com/r/StableDiffusion/comments/wvzr7s/tut...


Genuinely mind blowing stuff.

That link is better in my opinion.

Showing multiple variables, styles Sx in the style of Sy; it's really amazing.


The fact that they chose to recreate Qinni's artstyle, who passed away just a couple of years ago and had a notable online following that still remembers her, makes me slightly uncomfortable; feels too soon for me, I guess.

Also, if models start to accept prompts like "in the style of Qinni", surely we're back to the copyright debate. They get away because everyone's art is mixed into a single model, but once distilling someone's artstyle is a feature…


Art style is not copyrightable as far as I understand. It could be trademarked though if specific enough.

I'm fearful of these algorithms, how can I ensure my economic status will not be affected? Any stock tickers to buy?

Amazon sells compute, and that's always needed for this sort of thing. They may develop new AI accelerator stacks (including racked hardware) in the future as well, so sure. I don't want to be one of those "the Internet is a fad" people, but I'm absolutely not telling you do do anything rash, go do your own research. Even if you bet at well as you can, the dot-com crash wiped people like you out, and that's assuming you get the general picture right.

How can you ensure your economic status will not be affected?

How were farriers able to ensure their economic status wasn't affected by cars?

How were switchboard operators able to ensure their economic status wasn't affected by automatic switching?

How were travel agents able to ensure their economic status wasn't affected by Expedia, Priceline, and the dozen other sites out there?

No one can know when or how their job is going to go extinct. You just need to pay attention to the changing winds of technology and adjust your ship's sail accordingly.



Invest in relevant technologies

e.g. GPU compute cycle companies (e.g. nvidia/amd/intel/arm/etc..) or invest in cloud compute companies (Amazon/Microsoft/Google/etc..)

Or learn relevant skills and try to have relevant marketable skills. (Always be learning)


Any courses you have completed recently?

> how can I ensure my economic status will not be affected?

You cannot.

It sucks, but it's inevitable. The same thing has been happening ever since the industrial revolution. Old skills become obsolete, and nobody cares what happens to people who've invested their entire lives into something that is no longer profitable.


It's interesting that, in a certain phase of the learning process, words in natural language already work this way. Before a language learner has enough knowledge to understand word etymology, they have to just form an association between a word and a certain set of sensory experiences that the word is meant to represent.

Seems like we have the same thing happening here. The input images are the sensory experiences. The dummy word "S*" is the linguistic symbol that is attached to them.



I wonder how this will impact market opportunity for Dall-E which already lost quite of wind thanks to their weird monetisation model (compared to a more affordable Midjourney, which does use Stable Diffusion).

I mean, what stops us a life-imitates-art system where we have speech2img like in the Westworld ( the narative creating scenes ) ?

I guess, I hope someone reads this and will pick up this. Maybe coupled with a VR set?


Is there an easily self-hostable all in one text to image thing?


Tangential: I've set up a Discord Bot that turns your text prompt into images using Stable Diffusion.

You can invite the bot to your server via https://discord.com/api/oauth2/authorize?client_id=101337304...

Talk to it using the /draw Slash Command.

It's very much a quick weekend hack, so no guarantees whatsoever. Not sure how long I can afford the AWS g4dn instance, so get it while it's hot.

Oh and get your prompt ideas from https://lexica.art if you want good results.

PS: Anyone knows where to host reliable NVIDIA-equipped VMs at a reasonable price?


Legal | privacy