I think image generation is an interesting case, because even if a human is always in the loop, and you need to try several times before you get a good image for your prompt of interest, that's likely still faster and cheaper than photoshopping exactly what you want (or certainly faster than hiring an illustrator). And the images produced are sometimes really quite good. A model which produces some amount of really messed up images can still be 'useful'.
_However_ the kinds of failures it makes highlight that these models still lack basic background knowledge. I'm willing to let the stuff about compositionality slide -- that's asking kind of a lot. But I do draw a very straight line from DeepDream in 2015 producing worm-dogs with too many legs and eyes, style-gan artifacts where the physical relationship between a person's face or clothing and surroundings was messed up, and the freakish broken bodies that stable diffusion sometimes creates. Knowing about the structure of _images_ only tells you a limited amount about the structure of things that occupy 3d space, apparently. It knows what a John Singer Sargent portrait looks like, but it's not totally sure that humans have the same number of arms when they're hugging as when they're not.
In the same way, large language models know what text looks like, but not facticity.
So I don't know that an AI winter is called for. But maybe we should lean away from the AI optimism that we can keep getting better models by training against the kinds of data that are easiest to scrape?
Potentially, but I don't think in practice, because most image generation is not trying to be photorealistic. Most models are trying to stylize their outputs, either by default (e.g. a Miyazaki-anime-like model), or in response to the prompt.
Using AI image generation for misinformation is by far the vast minority of its use.
And if the output is stylized, it's going to be much more obvious if it has come from an AI model, because a human is going to have a much harder time reproducing a specific ML model's style on-demand (e.g. if their art teacher asks them to sketch a face in the same style, to prove their homework wasn't faked).
A very interesting take, but I am not sure I agree. First, this elaborates the idea on the image generation, but then applies it to all "generative AI", but no argument supports that in the post.
As of images, if anything, it was possible before with some good photo editing skills, or even with a good back story for a fuzzy photo (UFOs, Loch Ness Monster, etc.). If anything, I am glad that it is even easier now with the generative AI: I hope it forces more people to think through the novel information and decide for themselves whether they are accepting it or not, or if they want to research further. This is something we would do after reading a blog post, for example, so why would images need to be treated differently? We are too much used to treat the photos as proof or a fact.
I agree with most of this, but I do disgree with the thing about producing specific imagery. It's absolubtely a skill one can develop. I spend a lot of time helping people leearn to simplify their prompts and choose the right language for image generation AIs. For some reason people put a lot of unnessacary junk into them, I guess a form of superstition (this sentence fragment worked well the last few times).
As the article mentions, the hybrid approach (using this as a tool in a series of other tools) is the way forward
There are concepts the AI simply will not grasp. For example right now midjourney will extremely struggle with "bulldozer", "centaur", "fantasy archer" etc. These will inevitably fixed (and have in the past) be fixed with new model versions with better training data.
The real struggle comes with either small details or semantic information. For example, its hard to ask it to make a lifelike/photograph scene with everything including the background in focus. Even with "focus stacking" type keywords. "selfie" is about the best word we came up with but unforunately that has significant side effects lol. Perhaps there just isnt enough instances of people specifically describing that property in the training data, but honestly its difficult to even learn english words for these concepts to describe with!
As for small details, it is indeed true that the current approach will probably never scale to handle something like "six blue cubes with a red triangle on each, arranged in a pyramid shape, with a yellow ball balanced on top". But as the author points out, such things will likely be handled with a minimum of photoshop skill using assets made individually
The quality of the image really depends on the quality of the prompt, and a LOT of cherry picking.
I find that big sleep is also a better model than the one linked here (deep daze), generally.
I’ve generated several hundred images myself and found a few real treasures. Here’s a few of my personal favourites:
“A painting of a murder in the style of Monet” [0]
“A photo of fellas in Paris” [1]
“A painting of Thanos wearing the Infinity Gauntlet in the style of Rembrandt” [2]
I definitely agree that in the general case the examples are underwhelming, but I believe there is a lot of potential here. Personally I’m super excited to unlock the potential of human-guided, AI-assisted creative tooling. Some Colab notebooks let you active explore the latent space of a model to direct the results where you want them to go. As the generate-adjust feedback loop gets tighter we’re gonna see some crazy things.
I still find it funny we managed to get image generation working so much better than text.
If you care about veracity then image generation works about as well as text. Frequently you can find details of the image that are just bizarrely wrong, such as hands or food or other basic things. It's the same basic problem: there's no intelligence behind what it's doing, it just regurgitates mostly realistic-seeming pixels that are pretty good at fooling the casual viewer.
Really, it's like those moths with eyespots on them: good at fooling the brain's heuristics but obviously not real.
However, there's a major misconception about training of humans and AI.
Image generating AIs are trained with massive amounts of images and text, however image generating humans train with much broader spectrum of experience.
Also, feeding a model tons of images created by humans (directly or by proxy) and claiming that AI is generating something completely new is a bit naive IMHO. Humans mix a much more broader and deeper experience pool to create things without prompts.
An AI model blurts out something derived from a corpus of images and text created by humans, that's all.
The technology is impressive for sure, and it marks a new era in terms of possibilities, but it doesn't take my breath away, sorry.
I think AI-generated images are worse for training AI generative models than LLMs, since there are so many now on the internet (see Instagram art related hashtags if you want to see nothing but AI art) compared to the quantity of images downloaded prior to 2021 (for those AI that did that). Text will always be more varied than seeing 10m versions of the same ideas that people make for fun. AI text can also be partial (like AI-assisted writing) but the images will all be essentially 100% generated.
Can I ask, was there an underlying reason that people deciding to pursue this image generation task, or is this literally just the result of throwing lots of tasks at different types of AI until you finally find one it seems to do well?
I don't mean to denigrate this, the results are clearly interesting, but I just don't understand what problem this solves, it just seems to raise the noise floor on reality.
> It's somewhat possible still to discern images generated using AI.
currently if it includes "photograph" and includes humans or animals, there is a chance that you can get a good image once out of every 50 generated images. by "good" i mean "does not need to be retouched". If i use one of the other models (instead of stable diffusion's model) that goes to 1 in 10 images.
If i use img2img and all of my understanding of the generation, i can probably generate 8 images and get 3 usable ones.
All this is to say, depending on the subject, i could generate stuff that is indistinguishable from traditional art, in whatever medium or representation you want. And i can't make art, edit photos, etc. My limits of understanding of traditional art and editing is "crop/rotate" and adjusting the lights and darks to get the correct contrast and color rendering, and then save and publish.
On your primary topic, something i have been noticing ramp up in the past 10 days or so is weird typos and even "new words" being created, where it looks like someone was typing on a phone and didn't bother to check the output. I'm not sure if it's some new AI/ML, or perhaps there's a bug in the "keyboard input" part of android, perhaps. It's like the opposite of your adding a dash in "defederated" because there's a red squiggly line under the word in the input box.
Currently with the hallucination problems, AI is not good at the things people want, which is trusted information. However, it is good at the things people don't want, easy to use to make fake scams, deceitful info etc. Somewhat a decent tool for promotional blurbs and advertising. That kind of falls into the deceitful category anyway :-P
Not sure what the image generators are good for beyond social media clout. They are over saturating the supply of certain styles of images. Handy for grabbing an image for your blog post, but a bit lacking for professional work as you can't create something to an exact specification.
As someone who's generated many thousands of images on Midjourney, I agree.
People think they can waltz in and immediately get great results from using AI's to generate images... and they can, if they're lucky or if they copy somebody else's prompt.
It's a lot harder to do so consistently, or if you want your images to look both good and original, and not like mere copies of what everyone else is doing.
The images already look better than what like 99.9% of humans would be able to produce and it produces them orders of magnitudes faster than what any human could ever hope to produce, even when equipped with Photoshop, 3d software and Google image search.
The only real problem with them is that they are conceptually simple. It's always just a subject in the center, a setting and some modifiers. It's doesn't produce complex Renaissance paintings with dozens of characters or a graphic novel telling a story. But that seems more an issue of the short text descriptions it gets, than a fundamental limit in the AI.
As for the AI-typical image artifacts, I don't see that as an issue. Think of it like upscaling an image. If you upscale it too much, you'll see the pixels. In this case you see the AI struggling to produce the necessary level of detail. Scale the image down a bit an all the artifacts go away. It's really no different then taking a real painting done by a human and getting close enough to see the brush strokes. The illusion will fall apart just the same.
I wouldn't look for hidden reasons. Recent image generators are already too good with face generation (thanks to CelebA-like datasets and early researchers).
And now the emphasis is on the multimodality of the model within a domain. There, almost every picture demonstrates some aspect of it. Somewhere there is text on the picture (old AI used to output bullshit instead of letters), somewhere there are humorous references to old images (for example, a cosmonaut on a pig).
Even though it is possible to generate such images without training on real-life examples, do you deny if an AI were trained on those examples it would be better at this task? Combine that with the fact that it's hard (if not impossible) to tell after the fact whether an AI model has or has not been trained on any one specific image, and it seems like by allowing this you could be unintentionally creating a way to launder real-life CSAM through generative AI.
It's interesting that content generation AI (text, art, etc) is really being optimized for our flawed human perception. Which means a lot of stuff is going to look good on the surface, but tend to be deeply flawed.
AI generated imagery is usually imbued with inaccuracies and errors. Not to mention the ethical problems of using artwork without consent for training the AI.
So, it might be worse than a random stock photo in those aspects.
There are also second-order giveaways that someone is using AI generation, in the case of photos the photographer would probably take numerous shots of the subject before submitting the best one, and if challenged they could produce the rest of them as evidence that they're the real deal. As far as I'm aware, using AI to generate a plausible series of photos with all of the details being consistent between them is much more difficult than generating just a single plausible photo.
In the case of artwork, the author of even the most convincing, artifact-free AI generated piece will immediately crumble if asked to show WIPs, non-flattened project files or timelapses. I have seen some charlatans attempt to fake WIPs by using style transfer to turn their finished piece back into a "sketch" but the results aren't very convincing, the models aren't trained on the process of creating art conventionally so they're not good at faking it.
Still, it's a hard case to make that any AI-generated imagery lacks sufficient human input into the creation process. Both Midjourney and Stable Diffusion offer a number of parameters allowing the user to control how the generation will go (CFG, samplers, etc). So IMO there is necessarily some human input required to make the process work well.
_However_ the kinds of failures it makes highlight that these models still lack basic background knowledge. I'm willing to let the stuff about compositionality slide -- that's asking kind of a lot. But I do draw a very straight line from DeepDream in 2015 producing worm-dogs with too many legs and eyes, style-gan artifacts where the physical relationship between a person's face or clothing and surroundings was messed up, and the freakish broken bodies that stable diffusion sometimes creates. Knowing about the structure of _images_ only tells you a limited amount about the structure of things that occupy 3d space, apparently. It knows what a John Singer Sargent portrait looks like, but it's not totally sure that humans have the same number of arms when they're hugging as when they're not.
In the same way, large language models know what text looks like, but not facticity.
So I don't know that an AI winter is called for. But maybe we should lean away from the AI optimism that we can keep getting better models by training against the kinds of data that are easiest to scrape?
reply