Wednesday, 20 August 2025

The aesthetics of AI-generated videos

In recent months, well, maybe since a year, an application of AI technology is really taking off: and that is the generation of videos.

The fundamental principle for such a generation of videos is like this:

1. An image is needed. This can be either a real photograph, or an image that is generated by AI though a text prompt. The latter of course provides large freedom, but needs often refinement, because the initial prompt may not give the desired result. But with this method, anything is possible. Even unrealistic images can be created, for example women with non-realistic physical features. One can also create a specific style, be it Anime, Cartoon or other. It appears that many of the character creations generated with this method are young cute women, and they do appear quite similar. But that is misleading and does not show the full potential, but shows only the bias of the creators, i.e. the ones who write the prompts. There are wonderful examples of mature females being generated who are not really anymore those young attractive child-like girls, but who show in their faces the full experience of an exhausting life. It is these images which carry the most innovation in this new evolving art domain. There are many AI models, trained on specific sceneries, styles, and they can be modified by LoRas to focus on a very specific scenario and create realistic results.

But also real photographs can be used as input image. This is somewhat limiting, because there is the "reality" which one has not so much influence upon. But on the other hand, this also allows a very tight control on what one really wants to see, without depending on a vague text prompt, the interpretation of which is not really fully controllable by the AI concepts. One can put exactly the clothing on that one wants to be seen, and one can exactly shape the scenery. Because these images will contain real people, some of the AI video sites actually forbid using such images, for the sake of privacy and protection of those being in the picture. 

The AI-generated imagery can vary from fully photorealistic to cartoonish, while the real imagery has the sense of a real environment and creates an aura of full realism.

--- For animating the OneSixthual dolls with such AI videos, only the real-imagery can be useful as input images: pictures of these dolls are being taken, and here already is the full creative process in the setting up of the scenery and taking the photos, portraits etc. 

2.  In the next step, this input image is made into a video by image-to-video processing. Also for this there is a large number of AI models which performs this task. I have not yet really fully grasped which model is useful for which task or action, and these models can even be extended again by specific Loras for creating videos that relate to a specific situation or action. The description of what is to happen in the video is again done by a text prompt, describing the actions of the people in the image and possible some other things that are supposed to happen.

These videos are usually quite short: sometimes 3 seconds, sometimes 5 seconds, and currently also often up to 10 seconds. There is one approach which can extend an existing video snippet up to 20 seconds.

So in order to create a real longer video or a film, these short snippets need to be concatenated. This is often a problem, when the one video snippet ends in one situation that is not compatible with the next video snippet that is following and that was created from a different input image. Now this can really be a problem: the video will become inconsistent, and one might have to cut and edit even these small video snippets to achieve a more or less logical flow. Overall, this will create a video that is full of cuts, with very short scenes, which sometimes repeat an action from a different viewpoint. This may lead to an interesting but also a but hectic flow of the film, and if the inconsistencies are too large, this will no longer be a credible course of action and story.

There is a solution: Some of these AI image-to-video models offer the input of a starting image and an end image. The algorithm will the interpolate in between these two images, following the instructions that are given by a text prompt. Ideally, when the video snippet really actually ends with the end image provided, one can then create seamless scenes, where each end image is the start image of the next video snippet. This works not always correctly - I have seen AI models where the end image was actually not reached. But I also have had many good experiences where one could nicely concatenate these video snippets, and the resulting video was a smooth and seamless visual description over several minutes. This would emulate the film technique of a film reel recording a scene without any cut, or of a theatre play. The creative possibilities are endless and are very encouraging. And then a cut that really presents a new scene can be used to structure and segment the whole film. This would be the way into the future of AI video, and I am trying to pursue this by creating long seamless scenes with this technique.

There is even a further option of not only adding 2 images, but a whole series of "elements", i.e. individual images through which a small video snippet should interpolate (I have seen this with up to 7 elements). This really gives full control and extends the step-motion technique into a fully seamless video creation tool.


This is the current state of the art of AI video creation. I hope that a seamless generation of video snippets will be possible for a longer time than 10 seconds, because this short time is only able to create a small action but not a longer continuous shot with a continuing activity.

The start-end and the elements technique is already very helpful, and I am currently experimenting with it. For a smooth consistent video and story flow I find these techniques absolutely necessary. But there is also an aesthetic reason for using only those short video snippets, which then create a flow that is frequently interrupted by cuts. 



No comments:

Post a Comment