Skip to main content

overview of generative AI animation techniques [March-2024]


table of contents

Condensed graph showing the overview of generative AI animation techniques and tools (December 2023)
Simplified graph of the structure seen in this post. You're welcome to share it! Just don't claim ownership.

In this post I attempt to hierarchically lay out and categorize the current array of techniques involving generative AI that can be used in animation, giving brief descriptions, examples, pros and cons, and links to find associated tools. It's the kind of resource I wish I had a year ago as an animator, when trying to navigate the chaotic network of possibilities and ever growing progress. Video stylization use cases, while somewhat overlapping, are mostly left out here.

It is aimed at anybody curious, but mostly at other animators and creatives who might feel intimated by the accelerating progress in the field. Hopefully this allows you to catch up and keep an eye on the scene on a deeper level than TikTok feed.

Disclaimers: 

  • It's my best attempt the time of writing, based on my possibly subjective analysis as an animator, and some amount of personal opinion. I hope to keep refining it collectively though!
  • The list skips older tools, like those based on GAN models, as diffusion based models have become more established and popular.
  • This guide is not a tutorial, but the communities of most tools are teeming with helpful content. To get started, use keywords from this guide to look online!

Glossary:

What actually is AI?

Refer to my ramblings on part 1 of intro, or:

AI Model

Refers to Neural Network models, with each being trained on specific kind of data, and having specialized intended behavior in mind. An "AI" as used broadly in the media usually refers to an application (tool) that employs one of such models, or sometimes several working together. As a user you can rely on these applications, which usually (but not necessarily) conceal the actual model and expose only limited control and parameters, or use these models directly if they are open source, which also allows to potentially fine-tune them through further training or other customization.

AI Tool

Refers to any code, software and applications, both online and running local on your computer, that are wrapped around AI models or somehow relying on them. I won't fight you if you object referring to AI as "tools", but it only makes sense in this specific context, at least for now.

Diffusion

Diffusion refers to archetype of generative diffusion-based models, that dominate the field at the moment. They generate results by iteratively "revealing" results from noise, step-by-step, in a process called "denoising".

[Input 2 Output]

A widespread expression to indicate the type of input/output pair used on an AI application or a model. The "input" conditions the "output" result. It is usually used very loosely. "Video 2 video" for example can mean very different things under the hood on different occasions, but nevertheless it's useful to indicate the type of possible workflow to an end user.

Notebook

Refers to python based collection of structured code, easily shared and annotated. Most applications work by controlling the AI models through python and specialized libraries like PyTorch, which can be ran on these notebooks. They are often shared as user-ready tools for people to run either locally or on remote hardware.

Seed

An initial input, often a random vector or value, used to initialize the generation that produces the result. The same seed will generally result in same result if other variables don't change. Manipulating seed across many generations can be done creatively to induce various desired or experimental effects.

  • Generative image

    Techniques that rely on using generative image AI models, that were trained on static images.

    • Generative image as material and assets

      Author of the short film "Planets and Robots" uses digital cutout to animate the generated AI images. It also plays with LLMs to generate voice over script.

      Using static images generated from any AI app as assets in traditional workflow such as 2D cutout, digital manipulation, collage, or even as a source for other AI tools that for example offer "image2video". Besides the origin of images and material, this technique depends on your usual skillset of cutting and manipulating images.

      PROS CONS
      • Easy to transition into for existing animators.
      • Can help with backgrounds.
      • Doesn't feel too "fresh".
      • Relies on great synergy with between material and animation.
      TOOLS
      FREE PAID
      (Any generative image model or app):
      (Any generative image model or app):
      etc....
      Animating can be done using After Effects, Moho, Blender, etc.
    • Generative image frame-by-frame

      Animation likely done with Stable WarpFusion, involving I2I loops, and some underlying video input that is warping (displacing) the animation. Author - Sagans.

      This encompasses all techniques that use generative diffusion image models in a rather animation-native spirit, generating sequences of motion frame-by-frame, like you would draw and shoot traditional animation. The key aspect here is that these models have no concept of time or motion when generating each image, but it is up to mechanics added on top and various applications or extensions to help produce some sort of animated imagery in the end, often refereed to as having "temporal consistency".

      These techniques usually posses the characteristic flicker in the animations. While many users of these tools aim to clean that up as much as possible, animators will tell you that it's called "boiling" and has been a staple of animation art all this time.

      Mostly applicable to open source models such as Stable Diffusion and tools built on them, which can be used with exposed parameters and possibly on local hardware. For comparison, something like MidJourney has its model concealed and with interface streamlined for pictures, thus it couldn't be used for these techniques.

      It usually consists of these techniques mixed and layered together:

      • Standalone (Text 2 Images):

        There are several novel techniques to generate animations with only text prompts and parameters this way:

        • Parameter interpolation (morphing)
          Prompt editing with gradually changing weights creating a transition. Depth ControlNet was used to keep the overall hand shape consistent.

          Gradually interpolating parameters on each generated image frame to produce a change in the animation. Parameters can be anything to do with the model, such as the text prompt itself, or the underlying seed ("latent space walk").

        • Image 2 Image (I2I) feedback loops
          Using a starting image, and a prompt of something different makes it deteriorate into something else frame-by-frame.

          Using each generated image frame as input for the following frame in animation through "image 2 image". This allows to produce similar looking frames in sequence while other parameters are changing and the seed is not staying fixed. Controlled usually through "denoising" strength, or "strength schedule" in Deforum. The starting frame can also be a pre-existing picture.

          It's a core building block of most animation implementations that use Stable Diffusion, on which relies many other techniques listed below. Very delicate to balance, dependent a lot on the sampler (noise scheduler) used.

        • 2D or 3D transformation (on I2I loops)
          The endless zoom-in that everybody and your grandma has seen already. It works so well because you can rely on SD continuously dreaming up new details.

          Gradually transforming each generated frame before it is sent back as input in I2I loops. 2D transformations correspond to simple translation, rotation, and scale. 3D techniques imagine a virtual camera moving in 3D space, which is usually done by estimating 3D depth in each generated frame and then warping it according to the imagined camera motion.

        • Experimental, motion synthesis, hybrid, and other techniques
          Made with SD-CN Animation, which has an unique method of hallucinating motion across generated frames. Starting image was used for init, but nothing else.

          Motion synthesis is about trying to "imagine" motion flow between subsequent generated frames, and then using that to warp them frame-by-frame to instill organic motion on I2I loops. This usually relies on AI models trained on motion estimation (optical flow) in videos, but instead of looking at subsequent video frames, it is told to look at subsequent generated frames (through I2I loops), or some sort of hybrid approach.

          Other techniques may include advanced use of inpainting together with warping, multiple processing steps or even taking snapshots of model's training process. Deforum for example is loaded with knobs and settings to tinker with.

      • Transformative (Images 2 Images):

        Additionally, some sort of source input can be used to drive the generated frames and resulting animation:

        • Blending (stylizing) - mixing with video source or/and conditioning (ControlNets)
          Deforum's hybrid mode with some ControlNet conditioning, that is fed from a source video (seen on the left). Masking and background blur were done separately and are unrelated to this technique.

          This is a broad category of ways to mix and influence generated sequences with input videos (broken into individual frames), often used to stylize real life videos. At the moment riding a trend wave of stylizing dance videos and performances, often going for the Anime look and sexualized physiques. You may use anything as input though, for example rough frames of your own animation, or any miscellaneous and abstract footage. There are wide possibilities for imitating "pixilation" and replacement-animation techniques.

          Input frames can either be blended directly with generated images each frame, before inputting them back each I2I loop, or in more advanced cases are used in additional conditioning such as ControlNets.

        • Optical flow warping (on I2I loops with video input)
          Deforum's hybrid mode allows this technique with variety of settings. Increased "cadence" was also used for less flickery result, so the warping would show up better. Masking and background blur were done separetly and are unrelated to this technique.

          "Optical flow" refers to motion estimated in a video, which is expressed through motion vectors on each frame, for each pixel in screen space. When optical flow is estimated for the source video used in a transformative workflow, it can be used to warp the generated frames according to it, making generated textures "stick" to objects as they or camera move across the frame.

        • 3D derived

          The conditioning done with transformative workflows may also be tied directly to 3D data, skipping a layer of ambiguity and processing done on video frames. Examples being openpose or depth data supplied from a virtual 3D scene, rather than estimated from a video (or video of a CG render). This allows the most modular and controllable approach that's 3D native, especially powerful if combined with methods that help with temporal consistency such as optical flow warping.

          This is probably the most promising overlap between established techniques and AI for VFX, as seen in this video.

          One of the most extensive tools for this technique is a project that simplifies and automates generation of ControlNet ready character images from Blender. In this example, the hand rig is used to generate openpose, depth, and normal-map images for ControlNet, with final SD result seen on the right. (openpose was discarded in the end as it proved to be unusable for hands only)

          This blog and "Diffusion Pilot" also focuses on this approach, stay tuned!๐Ÿ‘€

      With all of these techniques combined, there are seemingly endless parameters that can be animated and modulated (quite like in modular audio production). It can either be "scheduled" with keyframes and graphed in something like Parseq, or linked to audio and music, allowing many audio-reactive results. You can make Stable Diffusion dance for you just like that.
      PROS CONS
      • Novel, evolving aesthetics, unique to the medium.
      • Conceptually reflects the tradition of animation.
      • The most customizable, hands on, and susceptible to directing.
      • Modular, layered approach.
      • Can be conditioned with video frames or complex data such as 3D render passes.
      • Often flickery and somewhat chaotic.
      • Dense on technical level, delicate to balance, advanced results have steep learning curve.
      • Usually inconvenient to do without having good local hardware. (nvidia GPU)
      TOOLS
      FREE PAID
        Tools to use in A1111 webui (If you have sufficient hardware)*:
      • Small scripts for parameter interpolation animations (travels): steps, prompts, seeds.
      • Deforum - the best powerhouse for all animated SD needs, incorporating most of the techniques listed above.
      • Parseq - popular visual parameter sequencer for Deforum.
      • "Deforum timeline helper" - another parameter visualization and scheduling tool. 
      • Deforumation - GUI for live control of Deforum parameters, allowing reactive adjustment and control.
      • TemporalKit - adopts some principles of EBsynth to use together with SD for consistent video stylization.
      • SD-CN Animation - somewhat experimental tool, allowing some hybrid stylization workflows and also interesting optical flow motion synthesis that results in turbulent motion.
      • TemporalNet - a ControlNet model meant to be used in other workflows like Deforum's, aiming to improve temporal consistency. 
      Python notebooks: (to be ran on Google Colab or Jupyter)*:
      • Stable WarpFusion - experimental code toolkit aimed at advanced video stylization and animation. Overlaps with Deforum a lot.
      Plugins and addons:
      Plugins and addons:
      • Diffusae for After Effects
      • A1111, ComfyUI, StreamDiffusion, and other API components for TouchDesigner by DotSimulate - available through his Patreon tiers with regular updates.

      There might be many random apps and tools out there, but even if they're paid, they are likely based on the open source Deforum code and act as simplified cloud versions of the same thing.
      * Optimally you have decent enough hardware, namely GPU, to run these tools locally. Alternatively, you may be able to try it through remote machines, like in Google Colab, but most free plans and trials are very limiting. Anything that was designed as a notebook for Google Colab can still be run on local hardware though.

      MORE EXAMPLES:

      Clever animation likely made with a fine tuned model or strong reference conditioning. It makes use of optical flow warping a lot, with source for that probably being videos of similar dancers.
      Deforum animation incorporating advanced optical warp techniques.
      Creative AI art..
      by u/Vishwasm123 in woahdude
      I unfortunately cannot dig up the original author and exact techniques used, but very likely it relies on Deforum. It shows the great potential for imaginative workflows using input material in transformative techniques.
      Animation done with SD-CN Animation extension, which employes motion synthesis techniques that provide the turbulent motion.
      Deforum animation from one of the main current contributors to its code. This one showcases 3D camera movement technique especially well.
      Animation from a solo show of LEGIO_X, who has used neural frames for their work.
  • Generative video

    Techniques that rely on using generative video AI models, that were trained on moving videos, or otherwise enhanced with temporal comprehension on a neural network level.

    At the moment, a common trait of these models seems to be that they're often limited to clips of very short duration (several seconds), bound by available video memory on the GPU. In cases where this has been worked around, the clips usually lack meaningful change and action over longe periods of time, and are more akin to animated slideshows.

    *Since initial publication of this article, a big elephant has entered the room going by the name of Sora. I will discuss and integrate it in this guide only when it's available to the general public.

    • Generative video models

      AI-generated video made from only Image and Text prompts using Runway's Gen-2 by Paul Trillo.

      This refers to using models, that were made and trained from the ground up to work with video footage.

      Results will likely have somewhat wobbly, AI-awkward, uncanny results today. The same way most generated AI images had been not so long ago. It's slightly lagging behind, rapidly improving, but my personal take is that the same progress we saw with static images won't convert proportionally to progress on video generation, as it is an exponentially harder problem to crack. Generally, the better generative video clip to looks, the less interesting its action and motion is, because drastic movement is usually where they fall apart into the uncanny.

      I suppose the boundary between animation and conventional film is messy here. As long as results don't yet match reality, all of it in a way is weird new genre of animation and video art. For now, I'd encourage to forget about replicating real film, and use this as new form of experimental media. Have fun!

      • Standalone (Text 2 video)

        One of the animation tests Kyle Wiggers did for his article using Runway's Gen2

        Using text prompts to generate entirely new video clips

        In theory this is limitless, with possibilities to go for both live-action look or anything surreal and stylized, as long as you can describe it, just like with static image generation. In practice though, gathering diverse and big enough datasets to train video models is much harder, so niche aesthetics are difficult on these models with only text conditioning.

        Runway's presentation of "Multi motion brush" feature on their video generator tools

        With this, true creative control is quite weak, but it becomes much more empowering when coupled with image or video conditioning, which you may call as "transformative" workflows. Additionally, there are new forms of motion control and conditioning emerging such as MotionCtrl, or Runway's Multi motion brush.

      • Transformative:

        Using text prompts in addition with further conditioning from existing images or videos.

        • Image 2 Video
          The album artwork was used as a starting image for each of the generated clips. Author - Stable Reel.

          Many generative video tools enable you to condition the result on an image. Either starting exactly with the image you specify, or using it as a rough reference for semantic information, composition, and colors. 

          Often people generate the starting image as well using traditional static image models before supplying it to a video model.

        • Video 2 Video
          With some luck and appropriate prompts, you can use an input video to "inspire" the model to reimagine the motion in source video with a completely different look. Done with Zeroscope in webui txt2vid extension, using vid2vid mode.

          Similarly to Image 2 Image process in generative image models, it is possible to embed input video information into a video model as it is generating (denoising) the output, in addition to the text prompt. I lack the expertise to understand exactly what's happening, but it appears this process matches the input video clip not only on frame-by-frame level (as stylization with Stable Diffusion would), but also on a holistic and movement level. It is controlled with a denoising strength just like image 2 image.

      PROS CONS
      • The most open ended set of techniques, that will only improve with time.
      • No barrier of entry in terms of professional animation knowledge.
      • Compared to frame-by-frame techniques, way smoother and usually more coherent as well.
      • Potentially a more straightforward way for transformative workflows than with frame-by-frame approaches.
      • Often awkward and uncanny looking, more so than static images. Mostly apparent on realistic footage involving people.
      • Computationally expensive. Less accessible to run on your own hardware than image AI.
      • Limited by short duration and context (for now).
      TOOLS
      FREE PAID (with trials)
      Plugins and addons:
      • Pallaidium for Blender - a multi-functional toolkit crammed with generative functionality across image, video and even audio domains.

      Additionally, you may find some free demos on Hugging face spaces.
      * Optimally you have decent enough hardware, namely GPU, to run these tools locally. Alternatively, you may be able to try running these models through remote machines, like in Google Colab, but most free plans and trials are very limiting.

      MORE EXAMPLES:

      Short film, made with generative video Modelscope model.
    • Image models enhanced with motion comprehension

      Animation done using AnimateDiff in ComfyUI, by animating between several different prompt subjects.

      With growing popularity of AnimateDiff, this is an emerging field of enhancing established image diffusion models with video or "motion" comprehension. The results are more similar to native video models (shown above), than what you would get with frame-by-frame techniques. The catch is that you can also utilize everything that has been built for these image models such as Stable Diffusion, including any community created checkpoint, LoRA, ControlNet, or other kinds of conditioning.

      The motion itself in this technique is often quite primitive, only loosely interpolating objects and flow throughout the clip, often morphing things into other things. It does that with way more temporal consistency though (less flicker), and it is still in its infancy. Best results are with abstract, less concrete subjects and scenes.

      The community is actively experimenting with this tech (see "MORE EXAMPLES"). The techniques draw both from static image models (such as prompt travel), and from video native models and their advancments. In some cases, people are trying to squeeze from it smoother video or 3D render stylization results when compared to image model frame-by-frame techniues.

      PROS CONS
      • Benefits from all development that has been done on existing image diffusion models.
      • Can be conditioned with video or complex data such as 3D render passes.
      • Very good with abstract, flowing motion.
      • Does not work well to produce complex, coherent motion of characters or unusual objects, often leading to morphing instead.
      • Computationally expensive just like video native models. Less accessible to run on your own hardware than image AI.
      • Limited by somewhat short context window (for now), although there are always workarounds that people experiment with.
      TOOLS
      FREE PAID
      Currently, implementations of AnimateDiff (for SD v1.5) are leading the charge here:
      for SD XL:
      Multi-functional implementations:
      • DomoAI - webapp that to my eyes encapsulates community driven video stylization workflows into a paid service.

      MORE EXAMPLES:

      Just some of my work
      byu/StrubenFairleyBoast inStableDiffusion
  • Animated faces with speech synthesis

    The author demonflyingfox had created a step-by-step tutorial before even releasing the viral Belenciaga videos.

    I know it, You know it. It's the technique behind a viral meme. Whenever you see a relatively still character (could be moving camera too), with an animated talking face, it likely relates to particular methodology using AI face animation and synthetic speech tools.

    It's a combination of several steps and components. The source images are often made with generative image AI, but you may also use any image with a face. The speech gets generated from text, conditioned on a chosen character voice. Then a different tool (or a model within a packaged tool) synthesizes facial animations with appropriate lip sync from the voice, usually only generating motion in the face and head area of the image. Using pre-trained avatars allows for movement on the body as well.

    PROS CONS
    • Easy memes.
    • ...eeeh, comedic effect?
    • Generally looks kinda uncanny. I can't imagine a serious use for these yet.
    • Too reliant on closed-source facial animation tools in paid apps.
    • Results are stiff and not too dynamic, even when training it with your own footage for an avatar.
    TOOLS
    FREE PAID (with trials)
    • ElevenLabs - constrained usage but limits seem to refresh monthly. 
    •  "Wav2Lip" A1111 WebUI extension - tool for generating "lip-sync" animation. Seems to be limited to mouth area.
    Or search online "Text 2 Speech", there are too many to count, but likely inferior to ElevenLabs.

    For full face animation, as far as I know, only trial versions of paid apps allow limited free access.
    And more...
    Search for "D-ID" alternatives".
  • generative 3D character motion

    Trailer for Nikita's genius meta AI film, that exposes the AI motion learning process and channels it into a ridiculously entertaining short.

    This refers to motion synthesis in the context of 3D characters. It can apply to 3d animated film, video games, or other 3D interactive applications. Just like with images and video, these emerging AI tools allow you to prompt character motion through text. Additionally some also build it from very limited amount of key-poses, or produce animations dynamically on-the-fly in interactive settings.

    Because this list focuses on generative tools, I am leaving out some AI applications that automate certain non creative tasks, like AI powered motion tracking, compositing, masking etc, as seen in Move.ai or Wonder Dynamics.

    PROS CONS
    • Fits inside established 3D animation workflow, reducing tedious tasks, potentially working as a utility for skilled animators.
    • Handles physics and weight really well.
    • Future of dynamic character animation in video games?๐Ÿ‘€๐ŸŽฎ
    • Seems to be limited to humanoid bipedal characters.
    • Not self sufficient. Only one component of 3D animation workflow. You need to know where to take it next.
    • Training is usually done on human motion capture data, which means these techniques so far only deal with realistic physics based motion, nothing stylized and cartoony.
    TOOLS
    FREE (or limited plans) PAID
    Paid plans of free tools that provide more features and expanded limits.

    MORE EXAMPLES:

  • LLM powered tools

    In theory, with LLMs (Large Language Models) showing great performance in coding tasks, especially when fine tuned, you could tell it to program and write scripts inside animation-capable software. This means the animation would follow the usual workflow, but AI assists you throughout. In an extreme case, AI does everything for you, while it assigns appropriate tasks in an back-end pipeline.

    In practice, you can kind of already try it! Blender for example is equipped with very extensive python API that allows to operate it through code, so there are couple chatGPT-like assistant tools available already. This is an unavoidable trend. Everywhere there is code, LLMs will likely show some practical use cases.

    PROS CONS
    • The promise - ultimate deconstruction of any technical barriers for creatives.
    • Useful as a copilot or assistant in creative software, eliminating tedious, repetitive tasks, digging through documentations for you.
    • If AI will create everything for you, then what's the point of being creative in the first place?
    • For now, running LLMs is only possible on powerful remote machines, thus being paid per tokens/subscriptions.
    TOOLS
    FREE PAID
    • Blender Chat Companion - (similar to Blender Copilot) a ChatGPT implementation inside Blender, specialized to handle appropriate tasks. Uses ChatGPT API tokens which are paid.
    • Genmo chat - promises a step towards "Creative General Intelligence ", with multi-step process all controlled through chat interface.
    • Blender Copilot - (similar to Blender Chat Companion) a ChatGPT implementation inside Blender, specialized to handle appropriate tasks. Uses ChatGPT API tokens which are paid.
    There's also upcoming ChatUSD - a chatbot to work with and manage USDs, which is a standard initially created by Pixar to unify and simplify 3D data exchange and parallelization on animated film production. Can't tell you much more here, but Nvidia seems to be embracing it as a standard for anything 3D, not just film.

Whev! That was ALOT, but I likely still missed something. Please comment below to suggest entries and tweaks to improve this and keep it up-to-date. Thank you for reading!