Skip to main content

overview of generative AI animation techniques [December-2024]


table of contents

Condensed graph showing the overview of generative AI animation techniques and tools (December 2023)
Simplified graph of the structure seen in this post. You're welcome to share it! Just don't claim ownership.

In this post I attempt to hierarchically lay out and categorize the current array of techniques involving generative AI that can be used in animation, giving brief descriptions, examples, pros and cons, and links to find associated tools. It's the kind of resource I wish I had a year ago as an animator, when trying to navigate the chaotic network of possibilities and ever growing progress. Video stylization use cases, while somewhat overlapping, are mostly left out here.

It is aimed at anybody curious, but mostly at other animators and creatives who might feel intimated by the accelerating progress in the field. Hopefully this allows you to catch up and keep an eye on the scene on a deeper level than a TikTok feed.

Disclaimers: 

  • It's my best attempt the time of writing, based on my possibly subjective analysis as an animator, and a tiny bit of personal opinion. I will update and refine this as much as I can though, trying to retain a somewhat neutral stance at least in this particular blog post.
  • The list skips older tools, like those based on GAN models, as diffusion and transformer based architecture has become more capable, established, and adopted.
  • This guide is not a tutorial, but the communities of most tools are teeming with helpful content. To get started, use keywords from this guide to look online!

Glossary:

What actually is AI?

Refer to my ramblings on part 1 of intro, or:

AI Model

Refers to Neural Network models, with each being trained on specific kind of data, and having specialized intended behavior in mind upon inference. An "AI" as used broadly in the media usually refers to an application (tool) that employs one of such models, or sometimes several working together. As a user you can rely on these applications, which usually (but not necessarily) conceal the actual model and expose only limited control and parameters, or use these models directly if they are open source, which also allows to potentially fine-tune them through further training or other customization.

AI Tool

Refers to any code, software, and applications, both as online services and running localy on your computer, that are wrapped around AI models or are somehow relying on them. I won't fight you if you reject referring to AI as "tools", but it only makes sense in this specific context, at least for now.

Diffusion

Diffusion refers to archetype of generative diffusion-based models, that dominate the field at the moment. They generate results by iteratively "revealing" results from noise, step-by-step, in a process called "denoising".

[Input 2 Output]

A widespread expression to indicate the type of input/output pair used in an AI application or a model. The "input" conditions the "output" result. It is usually used very loosely. "Video 2 video" for example can mean very different things under the hood on different occasions, but nevertheless it's useful to indicate the type of possible workflow to an end user.

Notebook

Refers to python based collection of structured code, easily shared and annotated. Most applications work by controlling the AI models through python and specialized libraries like PyTorch, which can be ran on these notebooks. They are often shared as user-ready tools for people to run either locally or on remote hardware.

Seed

An initial input, often a random vector or value, used to initialize the generation that produces the result. The same seed will generally result in same result if other variables don't change. Manipulating seed across many generations can be done creatively to induce various desired or experimental effects.

  • Generative image

    Techniques that rely on using generative image AI models, that were trained on static images.

    • Generative image as material and assets

      Author of the short film "Planets and Robots" uses digital cutout to animate the generated AI images. It also plays with LLMs to generate voice over script.

      Using static images generated from any AI app as assets in traditional workflow such as 2D cutout, digital manipulation, collage, or even as a source for other AI tools that for example offer "image2video". Besides the origin of images and material, this technique depends on your usual skillset of cutting and manipulating images, but can help with assembling large amounts of content quickly for drafts, animatics, or even final versions of animated work in some cases if used tastefully.

      PROS CONS
      • Easy to transition into for existing animators.
      • Can help with backgrounds and other content that is secondary to the main elements in focus.
      • Can serve as materials for early drafts, animatics, and prototyping.
      • Doesn't feel too "fresh".
      • Relies on great synergy with between material and animation.
      TOOLS
      FREE PAID
      (Any generative image model or app):
      (Any generative image model or app):
      etc....
      Animating can be done using After Effects, Toon Boom, Blender, etc.
    • Generative image frame-by-frame

      Animation likely done with Stable WarpFusion, involving I2I loops, and some underlying video input that is warping (displacing) the animation. Author - Sagans.

      This encompasses all techniques that use generative diffusion image models in a rather animation-native spirit, generating sequences of motion frame-by-frame, like you would draw and shoot traditional animation. The key aspect here is that these models have no concept of time or motion when generating each image, but it is up to mechanics added on top and various applications or extensions to help produce some sort of animated imagery in the end, often refereed to as having "temporal consistency".

      These techniques usually posses the characteristic flicker in the animations. While many users of these tools aim to clean that up as much as possible, animators will tell you that it's called "boiling" and has been a staple of animation art all this time.

      Mostly applicable to open source models such as Stable Diffusion and tools built on them, which can be used with exposed parameters and possibly on local hardware. For comparison, something like MidJourney has its model concealed and with interface streamlined for pictures, thus it couldn't be used for these techniques as easily.

      It usually consists of these techniques mixed and layered together:

      • Standalone (Text 2 Images):

        There are several novel techniques to generate animations with only text prompts and parameters this way:

        • Parameter interpolation (morphing)
          Prompt editing with gradually changing weights creating a transition. Depth ControlNet was used to keep the overall hand shape consistent.

          Gradually interpolating parameters on each generated image frame to produce a change in the animation. Parameters can be anything to do with the model, such as the text prompt itself, or the underlying seed ("latent space walk").

        • Image 2 Image (I2I) feedback loops
          Using a starting image, and a prompt of something different makes it deteriorate into something else frame-by-frame.

          Using each generated image frame as input for the following frame in animation through "image 2 image". This allows to produce similar looking frames in sequence while other parameters are changing and the seed is not staying fixed. Controlled usually through "denoising" strength, or "strength schedule" in Deforum. The starting frame can also be a pre-existing picture.

          It's a core building block of most animation implementations that use Stable Diffusion, on which relies many other techniques listed below. Very delicate to balance, dependent a lot on the sampler (noise scheduler) used.

        • 2D or 3D transformation (on I2I loops)
          The endless zoom-in that everybody and your grandma has seen already. It works so well because you can rely on SD continuously dreaming up new details.

          Gradually transforming each generated frame before it is sent back as input in I2I loops. 2D transformations correspond to simple translation, rotation, and scale. 3D techniques imagine a virtual camera moving in 3D space, which is usually done by estimating 3D depth in each generated frame and then warping it according to the imagined camera motion.

        • Experimental, motion synthesis, hybrid, and other techniques
          Made with SD-CN Animation, which has an unique method of hallucinating motion across generated frames. Starting image was used for init, but nothing else.

          Motion synthesis is about trying to "imagine" motion flow between subsequent generated frames, and then using that to warp them frame-by-frame to instill organic motion on I2I loops. This usually relies on AI models trained on motion estimation (optical flow) in videos, but instead of looking at subsequent video frames, it is told to look at subsequent generated frames (through I2I loops), or some sort of hybrid approach.

          Other techniques may include advanced use of inpainting together with warping, multiple processing steps or even taking snapshots of model's training process. Deforum for example is loaded with knobs and settings to tinker with.

      • Transformative (Images 2 Images):

        Additionally, some sort of source input can be used to drive the generated frames and resulting animation:

        • Blending (stylizing) - mixing with video source or/and conditioning (ControlNets)
          Deforum's hybrid mode with some ControlNet conditioning, that is fed from a source video (seen on the left). Masking and background blur were done separately and are unrelated to this technique.

          This is a broad category of ways to mix and influence generated sequences with input videos (broken into individual frames), often used to stylize real life videos. At the moment riding a trend wave of stylizing dance videos and performances, often going for the Anime look and sexualized physiques. You may use anything as input though, for example rough frames of your own animation, or any miscellaneous and abstract footage. There are wide possibilities for imitating "pixilation" and replacement-animation techniques.

          Input frames can either be blended directly with generated images each frame, before inputting them back each I2I loop, or in more advanced cases are used in additional conditioning such as ControlNets.

        • Optical flow warping (on I2I loops with video input)
          Deforum's hybrid mode allows this technique with variety of settings. Increased "cadence" was also used for less flickery result, so the warping would show up better. Masking and background blur were done separetly and are unrelated to this technique.

          "Optical flow" refers to motion estimated in a video, which is expressed through motion vectors on each frame, for each pixel in screen space. When optical flow is estimated for the source video used in a transformative workflow, it can be used to warp the generated frames according to it, making generated textures "stick" to objects as they or camera move across the frame.

        • 3D derived

          The conditioning done with transformative workflows may also be tied directly to 3D data, skipping a layer of ambiguity and processing done on video frames. Examples being openpose or depth data supplied from a virtual 3D scene, rather than estimated from a video (or video of a CG render). This allows the most modular and controllable approach that's 3D native, especially powerful if combined with methods that help with temporal consistency such as optical flow warping.

          This is probably the most promising overlap between established techniques and AI for VFX, as seen in this video.

          One of the most extensive tools for this technique is a project that simplifies and automates generation of ControlNet ready character images from Blender. In this example, the hand rig is used to generate openpose, depth, and normal-map images for ControlNet, with final SD result seen on the right. (openpose was discarded in the end as it proved to be unusable for hands only)

          This blog and "Diffusion Pilot" also focuses on this approach, stay tuned!๐Ÿ‘€

      With all of these techniques combined, there are seemingly endless parameters that can be animated and modulated (quite like in modular audio production). It can either be "scheduled" with keyframes and graphed in something like Parseq, or linked to audio and music, allowing many audio-reactive results. You can make Stable Diffusion dance for you just like that.
      PROS CONS
      • Novel, evolving aesthetics, unique to the medium.
      • Conceptually reflects the tradition of animation.
      • The most customizable, hands on, and susceptible to directing.
      • Modular, layered approach.
      • Can be conditioned with video frames or complex data such as 3D render passes.
      • Often flickery and somewhat chaotic.
      • Dense on technical level, delicate to balance, advanced results have steep learning curve.
      • Usually inconvenient to do without having good local hardware. (nvidia GPU)
      TOOLS
      FREE PAID (some with limited free plans)
        Tools to use in A1111 webui (If you have sufficient hardware)*:
      • Small scripts for parameter interpolation animations (travels): steps, prompts, seeds.
      • Deforum - the best powerhouse for all animated SD needs, incorporating most of the techniques listed above.
      • Parseq - popular visual parameter sequencer for Deforum.
      • "Deforum timeline helper" - another parameter visualization and scheduling tool. 
      • Deforumation - GUI for live control of Deforum parameters, allowing reactive adjustment and control.
      • TemporalKit - adopts some principles of EBsynth to use together with SD for consistent video stylization.
      • SD-CN Animation - somewhat experimental tool, allowing some hybrid stylization workflows and also interesting optical flow motion synthesis that results in turbulent motion.
      • TemporalNet - a ControlNet model meant to be used in other workflows like Deforum's, aiming to improve temporal consistency. 
      Python notebooks: (to be ran on Google Colab or Jupyter)*:
      • Stable WarpFusion - experimental code toolkit aimed at advanced video stylization and animation. Overlaps with Deforum a lot.
      Plugins and addons:
      Plugins and addons:
      • Diffusae for After Effects
      • A1111, ComfyUI, StreamDiffusion, and other API components for TouchDesigner by DotSimulate - available through his Patreon tiers with regular updates.

      There might be many random apps and tools out there, but even if they're paid, they are likely based on the open source Deforum code and act as simplified cloud versions of the same thing.
      * Optimally you have decent enough hardware, namely GPU, to run these tools locally. Alternatively, you may be able to try it through remote machines, like in Google Colab, but most free plans and trials are very limiting. Anything that was designed as a notebook for Google Colab can still be run on local hardware though.

      MORE EXAMPLES:

      Profesionally orchestrated production mixing together traditional sets, actors, vfx techniques, and contemporary generative AI tools. The primary painterly aesthetic came from using stable diffusion frame-by-frame through image2image.
      Clever animation likely made with a fine tuned model or strong reference conditioning. It makes use of optical flow warping a lot, with source for that probably being videos of similar dancers.
      Deforum animation incorporating advanced optical warp techniques.
      Animation done with SD-CN Animation extension, which employes motion synthesis techniques that provide the turbulent motion.
      Deforum animation from one of the main current contributors to its code. This one showcases 3D camera movement technique especially well.
      Animation from a solo show of LEGIO_X, who has used neural frames for their work.
      A demo by Dan Wood, using SD in a FBF feedback loop in real time coupled with prompting using voice, enabled by implementing state of the art optimizations in his custom pipeline. This is a "dream engine" that is literally being steered live.
  • Generative video

    Techniques that rely on using generative video AI models, that were trained on moving videos, or otherwise enhanced with temporal comprehension on a neural network level, providing clips of smooth moving images and videos.

    This category of models started out very weak, with uncanny results and limitations, leading to being ridiculed into memes. However since initial publication of this article, in only one year, video models came to the forefront of generative AI research and progress, showing increasingly convincing results and creeping into mainstream commercial media, for better or worse. Thus this section has been, and will be continuously revised to reflect state of the art tools.

    • Generative video models

      Fan made music video, utilizing Runway's Gen-3.

      This refers to using models, that were made and trained from the ground up to work with and generate video footage.

      With early results (2022-2023) being somewhat wobbly, awkward, and uncanny, the most recent video models create increasingly convincing and seamless video content, albeit still often characterized by somewhat generic and floaty aesthetics of motion. Drastic, unusual scenes with lots of complex moving pieces and characters are still difficult for these models, with object permanence suffering the most when there are many similar elements intersecting and occluding, which is however often unnoticeable without closer inspection.

      I suppose the boundary between animation and conventional film is blurry here. For results that don't quite replicate real video, either by intent or by failure, it may fall into it's own weird new genre of animation and video art. For animators and video artists, I'd encourage to forget about replicating real film, and use this as new form of experimental media. Have fun!

      Moreover, because for the most part these video models are trained on live action footage or smooth CGI, some aspects of animation tradition remain to be still quite out of reach - meticulous frame-by-frame timing changes, expressive frame holds and accentuated posing, hybrid animation techniques, etc. In other words, the newest video models will easily create a moving pretty image, and often times in a coherent way, but the character of motion itself is still very far from replicating the mastery of animation at highest level.

      • Standalone (Text 2 video)

        A demo of Sora's capabilities, generated using prompt "a green blob and an orange blob are in love and dancing together"

        Using text prompts to generate entirely new video clips.

        Limitless in theory, with possibilities to go for both live-action look or anything surreal and stylized, as long as you can describe it, just like with static image generation. In practice though, gathering diverse and big enough datasets to train video models is much harder, so niche aesthetics are difficult on models with only text conditioning.

        With only text, proper creative control and directing ability is quite weak, but it becomes much more empowering when coupled with image or video conditioning, which you may call as "transformative" workflow. Additionally, there are new forms of motion and camera control emerging such as MotionCtrl, or Runway's Multi motion brush, which I'll now categorize as "directable" techniques.

      • Transformative:

        Using text prompts in addition with further conditioning from existing images or videos.

        • Image 2 Video
          The album artwork was used as a starting image for each of the generated clips. Author - Stable Reel.

          Many generative video tools enable you to condition the result on an image. Either starting exactly with the image you specify, or using it as a rough reference for semantic information, composition, and colors. 

          Often people generate the starting image as well using traditional static image models before supplying it to a video model.

        • Video 2 Video
          A demo from Runway's use-cases page, showing ability to transform video material into various results using text prompts.

          Similarly to Image 2 Image process in generative image models, it is possible to embed input video information into a video model as it is generating (denoising) the output, in addition to the text prompt. I lack the expertise to understand exactly what's happening, but it appears this process matches the input video clip not only on frame-by-frame level (as stylization with Stable Diffusion would), but also on a holistic and movement level. Under the hood it is usually controlled with a denoising strength just like image 2 image.

      • Standalone - directable

        Runway's presentation of "Multi motion brush" feature on their video generator tools

        With rapid development of video models and services that are powered by them, they are getting increasingly well equiped with interfaces and tools that allow more hands-on and precise direction of generated clips besides just text prompts. That may include camera framing and movement, movement of specific elements or areas in the frame, or replacement/adjusment of specific elements through additional instructions.

        Likely, this is where the field will split into two extremes: services catering to film makers and creators wanting to retain artistic agency, and simplified services with focus on provoding standalone ready content in corporate settings and mass production, "served on a plate" so to say.

      PROS CONS
      • The most open ended set of techniques, that will only improve with time.
      • No barrier of entry in terms of professional animation knowledge.
      • Compared to frame-by-frame techniques, way smoother and more coherent.
      • A more straightforward way for transformative workflows than with frame-by-frame approaches.
      • With cheaper, smaller models - often awkward and uncanny looking, which may be more apparent than with images.
      • Computationally expensive. Less accessible to run on your own hardware than image AI, especially due to high video memory (VRAM) requirements, which are hard to meet with most consumer GPUs.
      • Depending on the model or service, often somewhat limited by short context length - the maximum duration of clips can be small, or extending such short clips struggles to maintain long-term consistency.
      TOOLS
      FREE PAID (some with limited free plans)
      Plugins and addons:
      • Pallaidium for Blender - a multi-functional toolkit crammed with generative functionality across image, video and even audio domains.

      Additionally, you may find some free demos on Hugging face spaces.
      Online services:
      * Optimally you have decent enough hardware, namely GPU, to run these tools locally. Alternatively, you may be able to try running these models through remote machines, like in Google Colab, but most free plans and trials are very limiting.

      MORE EXAMPLES:

      Short film, made with generative video Modelscope model.
    • Image models enhanced with motion comprehension

      Animation done using AnimateDiff in ComfyUI, by animating between several different prompt subjects.

      With growing popularity of AnimateDiff, this is an emerging field of enhancing established image diffusion models with video or "motion" comprehension. The results are more similar to native video models (shown above), than what you would get with frame-by-frame techniques. The catch is that you can also utilize everything that has been built for these image models such as Stable Diffusion, including any community created checkpoint, LoRA, ControlNet, or other kinds of conditioning.

      The motion itself in this technique is often quite primitive, only loosely interpolating objects and flow throughout the clip, often morphing things into other things. It does that with way more temporal consistency though (less flicker), and it is still in its infancy. Best results are with abstract, less concrete subjects and scenes.

      An example of a workflow that has been optimized specifically for 3D render treatment in any style with precise control. However it still struggles to maintain coherency on complex shapes moving and overlapping each other

      The community is actively experimenting with this tech (see "MORE EXAMPLES"). The techniques draw both from static image models (such as prompt travel), and from video native models and their advancments. In some cases, people are trying to squeeze from it smoother video or 3D render stylization results when compared to image model frame-by-frame techniues.

      PROS CONS
      • Benefits from all development that has been done on existing image diffusion models.
      • Can be conditioned with video or complex data such as 3D render passes.
      • Very good with abstract, flowing motion.
      • Does not work well to produce complex, coherent motion of characters or unusual objects, often leading to morphing instead.
      • Computationally expensive just like video native models. Less accessible to run on your own hardware than image AI.
      • Limited by somewhat short context window (for now), although there are always workarounds that people experiment with.
      TOOLS
      FREE PAID (some with limited free plans)
      Currently, implementations of AnimateDiff (for SD v1.5) are leading the charge here:
      for SD XL:
      Multi-functional implementations:
      • DomoAI - webapp that to my eyes encapsulates community driven video stylization workflows into a paid service.
      • KREA AI - polished webapp experience to work with AI video, which to my eyes seems to be using animatdiff under the hood. (limited free access)

      MORE EXAMPLES:

      Just some of my work
      byu/StrubenFairleyBoast inStableDiffusion
  • Talking heads and characters (with speech synthesis)

    Combination of techniques and AI models that focus on "talking heads" - faces and characters with synthetic or real voice with accurately matching lip movements.
    (By themselves, general generative video models usually cannot depict accurate lip movements and spoken performance)

    • Synthetic (generated performance)

      The author demonflyingfox had created a step-by-step tutorial before even releasing the viral Belenciaga videos.

      I know it, You know it. It's the technique behind a viral meme. Whenever you see a relatively still character (could be moving camera too), with an animated talking face, it likely relates to particular set of methodologies using AI face animation and synthetic speech tools.

      In the case of said memes, the source images are often made with generative image AI, but you may also use any image with a face. The speech gets generated from text, conditioned on a chosen character voice. Then a different tool (or a model within a packaged tool) synthesizes facial animations with appropriate lip sync from the voice, usually only generating motion in the face and head area of the image. Using pre-trained avatars allows for movement on the body as well.

      Most online services now streamline and package this process into cohesive tools, and what's exactly happening under the hood is anyone's guess, but it may be deducted from related research papers and open source alternatives.

    • Performance capture

      Related category of tools are those that instead of synthesizing, map a real captured performance onto a character using AI models (not to be confused with traditional facial capture for CGI characters). Doing so may allow very accessible performance through virtual characters by generating ready video clips, skipping all of the conventional motion capture and character rigging pipeline.

    PROS CONS
    • Easy memes.
    • Mass-producible talking avatars for games, installations, roleplay, cheap DIY shows, etc.
    • Less advanced and older tools produce somewhat uncanny results.
    • For the most part, reliant on closed-source facial animation tools in paid apps.
    • Results are often stiff and not too expressive, even when training it with your own footage for an avatar.
    TOOLS
    FREE PAID (some with limited free plans)
    •  "Wav2Lip" - a A1111w webui extension that generates "lip-sync" animation. Seems to be limited to mouth area.
    • SadTalker - another talking head generator based on audio. Available for use through A1111 webui and Discord.
    • ElevenLabs - constrained usage, but limits seem to refresh monthly. 
    • Live Portrait - open source model to generate talking heads from reference video, also available to run locally.
    • Animated 2D cutout characters using Adobe Express (in "Animate characters" mode).
  • generative 3D character motion

    Trailer for Nikita's genius meta AI film, that exposes the AI motion learning process and channels it into a ridiculously entertaining short.

    This refers to motion synthesis in the context of 3D characters. It can apply to 3D animated film, video games, or other 3D interactive applications. Just like with images and video, these emerging AI tools allow you to prompt character motion through text. Additionally, some also build it from very limited amount of key-poses, or produce animations dynamically on-the-fly in interactive settings.

    Because this list focuses on generative tools, I am leaving out some AI applications that automate certain non creative tasks, like AI powered motion tracking, compositing, masking etc, as seen in Move.ai or Wonder Dynamics.

    PROS CONS
    • Fits inside established 3D animation workflow, reducing tedious tasks, potentially working as a utility for skilled animators.
    • Handles physics and weight really well.
    • Future of dynamic character animation in video games?๐Ÿ‘€๐ŸŽฎ
    • Usually limited to humanoid bipedal characters.
    • Not self sufficient. Only one component of 3D animation workflow. You need to know where to take it next.
    • Training is usually done on human motion capture data, which means these techniques so far only deal with realistic physics based motion, nothing stylized or cartoony.
    TOOLS
    FREE (or limited plans) PAID
    Paid plans of free tools that provide more features and expanded limits.

    MORE EXAMPLES:

  • LLM powered

    In theory, with LLMs (Large Language Models) showing great performance in coding tasks, especially when fine tuned, you could tell it to program and write scripts inside animation-capable software, or describe motion through raw keyframe and curve data. This means the animation would follow the usual workflow, but AI assists you throughout. In an extreme case, AI does everything for you, while it executes or assigns appropriate tasks in a back-end pipeline.

    In practice, you can kind of already try it! Blender for example is equipped with very extensive python API that allows to operate it through code, so there are couple chatGPT-like assistant tools available already. This is an unavoidable trend. Everywhere there is code, LLMs will likely show some practical use cases.

    PROS CONS
    • The promise - deconstruction of any technical barrier for creatives.
    • Useful as a copilot or assistant in creative software, eliminating tedious, repetitive tasks, digging through documentation for you.
    • If AI will create everything for you, then what's the point of being creative in the first place?
    • For now, running LLMs is only possible on powerful remote machines, thus being paid per tokens/subscriptions.
    TOOLS
    FREE PAID
    • Blender Chat Companion - (similar to Blender Copilot) a ChatGPT implementation inside Blender, specialized to handle appropriate tasks. Uses ChatGPT API tokens which are paid.
    • Blender Copilot - (similar to Blender Chat Companion) a ChatGPT implementation inside Blender, specialized to handle appropriate tasks. Uses ChatGPT API tokens which are paid.
    There's also upcoming ChatUSD - a chatbot to work with and manage USDs, which is a standard initially created by Pixar to unify and simplify 3D data exchange and parallelization on animated film production. Can't tell you much more here, but Nvidia seems to be embracing it as a standard for anything 3D, not just film.

Whev! That was ALOT, but I likely still missed something. Please comment below to suggest entries and tweaks to improve this and keep it up-to-date. Thank you for reading!