table of contents
In this post I attempt to hierarchically lay out and categorize the current array of techniques involving generative AI that can be used in animation, giving brief descriptions, examples, pros and cons, and links to find associated tools. It's the kind of resource I wish I had a year ago as an animator, when trying to navigate the chaotic network of possibilities and ever growing progress. Video stylization use cases, while somewhat overlapping, are mostly left out here.
It is aimed at anybody curious, but mostly at other animators and creatives who might feel intimated by the accelerating progress in the field. Hopefully this allows you to catch up and keep an eye on the scene on a deeper level than a TikTok feed.
Disclaimers:
- It's my best attempt the time of writing, based on my possibly subjective analysis as an animator, and a tiny bit of personal opinion. I will update and refine this as much as I can though, trying to retain a somewhat neutral stance at least in this particular blog post.
- The list skips older tools, like those based on GAN models, as diffusion and transformer based architecture has become more capable, established, and adopted.
-
This guide is not a tutorial, but the communities of most tools are teeming
with helpful content. To get started, use keywords from this guide to look
online!
Glossary:
What actually is AI?
AI Model
Refers to Neural Network models, with each being trained on specific
kind of data, and having specialized intended behavior in mind upon inference. An "AI"
as used broadly in the media usually refers to an application (tool)
that employs one of such models, or sometimes several working together.
As a user you can rely on these applications, which usually (but not
necessarily) conceal the actual model and expose only limited control
and parameters, or use these models directly if they are open source,
which also allows to potentially fine-tune them through further training
or other customization.
AI Tool
Refers to any code, software, and applications, both as online services and running
localy on your computer, that are wrapped around AI models or are somehow
relying on them. I won't fight you if you reject referring to AI as
"tools", but it only makes sense in this specific context, at least for
now.
Diffusion
Diffusion refers to archetype of generative diffusion-based models, that dominate the field at the moment. They generate results by iteratively "revealing" results from noise, step-by-step, in a process called "denoising".
[Input 2 Output]
A widespread expression to indicate the type of input/output pair used in an AI application or a model. The "input" conditions the "output" result. It is usually used very loosely. "Video 2 video" for example can mean very different things under the hood on different occasions, but nevertheless it's useful to indicate the type of possible workflow to an end user.
Notebook
Refers to python based collection of structured code, easily shared and annotated. Most applications work by controlling the AI models through python and specialized libraries like PyTorch, which can be ran on these notebooks. They are often shared as user-ready tools for people to run either locally or on remote hardware.
Seed
An initial input, often a random vector or value, used to initialize the generation that produces the result. The same seed will generally result in same result if other variables don't change. Manipulating seed across many generations can be done creatively to induce various desired or experimental effects.
-
Generative image
Techniques that rely on using generative image AI models, that were trained on static images.
-
Generative image as material and assets
Using static images generated from any AI app as assets in traditional workflow such as 2D cutout, digital manipulation, collage, or even as a source for other AI tools that for example offer "image2video". Besides the origin of images and material, this technique depends on your usual skillset of cutting and manipulating images, but can help with assembling large amounts of content quickly for drafts, animatics, or even final versions of animated work in some cases if used tastefully.
PROS CONS - Easy to transition into for existing animators.
- Can help with backgrounds and other content that is secondary to the main elements in focus.
- Can serve as materials for early drafts, animatics, and prototyping.
- Doesn't feel too "fresh".
- Relies on great synergy with between material and animation.
TOOLS FREE PAID (Any generative image model or app):
-
Stable Diffusion (on local machine), or any
online app like this
- Craiyon
- Krea AI
- Invokeai (using SD)
- Enfugue (using SD)
- SkyBox AI - generation of VR ready 360 scenes.
- DALL-E 3 on Microsoft image creator
- Leonardo AI - refined app for working with generative AI. Offers some free daily credits.
- Stable Projectorz - sophisticated 3D texturing using SD
- ComfyUI nodes in Blender
- Generative AI for Krita - streamlined artist friendly way to work with Stable Diffusion, powered by ComfyUI backend.
Plugins and addons:
Additionally, you may find some free demos on Hugging face spaces.(Any generative image model or app):
- MidJourney
- DALL-E 3 on ChatGPT
- Adobe's FireFly
Animating can be done using After Effects, Toon Boom, Blender, etc. -
Generative image frame-by-frame
This encompasses all techniques that use generative diffusion image models in a rather animation-native spirit, generating sequences of motion frame-by-frame, like you would draw and shoot traditional animation. The key aspect here is that these models have no concept of time or motion when generating each image, but it is up to mechanics added on top and various applications or extensions to help produce some sort of animated imagery in the end, often refereed to as having "temporal consistency".
These techniques usually posses the characteristic flicker in the animations. While many users of these tools aim to clean that up as much as possible, animators will tell you that it's called "boiling" and has been a staple of animation art all this time.
Mostly applicable to open source models such as Stable Diffusion and tools built on them, which can be used with exposed parameters and possibly on local hardware. For comparison, something like MidJourney has its model concealed and with interface streamlined for pictures, thus it couldn't be used for these techniques as easily.
It usually consists of these techniques mixed and layered together:
-
Standalone (Text 2 Images):
There are several novel techniques to generate animations with only text prompts and parameters this way:
-
Parameter interpolation (morphing)
Gradually interpolating parameters on each generated image frame to produce a change in the animation. Parameters can be anything to do with the model, such as the text prompt itself, or the underlying seed ("latent space walk").
-
Image 2 Image (I2I) feedback loops
Using each generated image frame as input for the following frame in animation through "image 2 image". This allows to produce similar looking frames in sequence while other parameters are changing and the seed is not staying fixed. Controlled usually through "denoising" strength, or "strength schedule" in Deforum. The starting frame can also be a pre-existing picture.
It's a core building block of most animation implementations that use Stable Diffusion, on which relies many other techniques listed below. Very delicate to balance, dependent a lot on the sampler (noise scheduler) used.
-
2D or 3D transformation (on I2I loops)
Gradually transforming each generated frame before it is sent back as input in I2I loops. 2D transformations correspond to simple translation, rotation, and scale. 3D techniques imagine a virtual camera moving in 3D space, which is usually done by estimating 3D depth in each generated frame and then warping it according to the imagined camera motion.
-
Experimental, motion synthesis, hybrid, and other techniques
Motion synthesis is about trying to "imagine" motion flow between subsequent generated frames, and then using that to warp them frame-by-frame to instill organic motion on I2I loops. This usually relies on AI models trained on motion estimation (optical flow) in videos, but instead of looking at subsequent video frames, it is told to look at subsequent generated frames (through I2I loops), or some sort of hybrid approach.
Other techniques may include advanced use of inpainting together with warping, multiple processing steps or even taking snapshots of model's training process. Deforum for example is loaded with knobs and settings to tinker with.
-
-
Transformative (Images 2 Images):
Additionally, some sort of source input can be used to drive the generated frames and resulting animation:
-
Blending (stylizing) - mixing with video source or/and conditioning (ControlNets)
This is a broad category of ways to mix and influence generated sequences with input videos (broken into individual frames), often used to stylize real life videos. At the moment riding a trend wave of stylizing dance videos and performances, often going for the Anime look and sexualized physiques. You may use anything as input though, for example rough frames of your own animation, or any miscellaneous and abstract footage. There are wide possibilities for imitating "pixilation" and replacement-animation techniques.
Input frames can either be blended directly with generated images each frame, before inputting them back each I2I loop, or in more advanced cases are used in additional conditioning such as ControlNets.
-
Optical flow warping (on I2I loops with video input)
"Optical flow" refers to motion estimated in a video, which is expressed through motion vectors on each frame, for each pixel in screen space. When optical flow is estimated for the source video used in a transformative workflow, it can be used to warp the generated frames according to it, making generated textures "stick" to objects as they or camera move across the frame.
-
3D derived
The conditioning done with transformative workflows may also be tied directly to 3D data, skipping a layer of ambiguity and processing done on video frames. Examples being openpose or depth data supplied from a virtual 3D scene, rather than estimated from a video (or video of a CG render). This allows the most modular and controllable approach that's 3D native, especially powerful if combined with methods that help with temporal consistency such as optical flow warping.
This is probably the most promising overlap between established techniques and AI for VFX, as seen in this video.
This blog and "Diffusion Pilot" also focuses on this approach, stay tuned!๐
-
PROS CONS - Novel, evolving aesthetics, unique to the medium.
- Conceptually reflects the tradition of animation.
- The most customizable, hands on, and susceptible to directing.
- Modular, layered approach.
- Can be conditioned with video frames or complex data such as 3D render passes.
- Often flickery and somewhat chaotic.
-
Dense on technical level, delicate to balance, advanced
results have steep learning curve.
- Usually inconvenient to do without having good local hardware. (nvidia GPU)
TOOLS FREE PAID (some with limited free plans) -
Small scripts for parameter interpolation animations
(travels):
steps,
prompts,
seeds.
- Deforum - the best powerhouse for all animated SD needs, incorporating most of the techniques listed above.
- Parseq - popular visual parameter sequencer for Deforum.
- "Deforum timeline helper" - another parameter visualization and scheduling tool.
- Deforumation - GUI for live control of Deforum parameters, allowing reactive adjustment and control.
- TemporalKit - adopts some principles of EBsynth to use together with SD for consistent video stylization.
- SD-CN Animation - somewhat experimental tool, allowing some hybrid stylization workflows and also interesting optical flow motion synthesis that results in turbulent motion.
- TemporalNet - a ControlNet model meant to be used in other workflows like Deforum's, aiming to improve temporal consistency.
Tools to use in A1111 webui (If you have sufficient hardware)*:Python notebooks: (to be ran on Google Colab or Jupyter)*:- Stable WarpFusion - experimental code toolkit aimed at advanced video stylization and animation. Overlaps with Deforum a lot.
Plugins and addons:- Dream Textures for Blender
- AI Render for Blender
- Character bones that look like Openpose for Blender - for use with ControlNets outside of Blender.
- Unreal Diffusion for Unreal Engine 5
- After-Diffusion for After effects (highly WIP)
- A1111, ComfyUI API components, and streamdiffusion implementation for TouchDesigner from Oleg Chomp - if you know what you're doing, can be set up for animation or anything you can imagine.
- Deforum studio - official online service version of Deforum.
- AI Animation Generator on gooey.ai - simplified way to run Deforum online, offers some free credits.
- Neural frames - generator service inspired by Deforum.
Plugins and addons:- Diffusae for After Effects
- A1111, ComfyUI, StreamDiffusion, and other API components for TouchDesigner by DotSimulate - available through his Patreon tiers with regular updates.
There might be many random apps and tools out there, but even if they're paid, they are likely based on the open source Deforum code and act as simplified cloud versions of the same thing.* Optimally you have decent enough hardware, namely GPU, to run these tools locally. Alternatively, you may be able to try it through remote machines, like in Google Colab, but most free plans and trials are very limiting. Anything that was designed as a notebook for Google Colab can still be run on local hardware though.MORE EXAMPLES:
-
-
-
Generative video
Techniques that rely on using generative video AI models, that were trained on moving videos, or otherwise enhanced with temporal comprehension on a neural network level, providing clips of smooth moving images and videos.
This category of models started out very weak, with uncanny results and limitations, leading to being ridiculed into memes. However since initial publication of this article, in only one year, video models came to the forefront of generative AI research and progress, showing increasingly convincing results and creeping into mainstream commercial media, for better or worse. Thus this section has been, and will be continuously revised to reflect state of the art tools.
-
Generative video models
This refers to using models, that were made and trained from the ground up to work with and generate video footage.
With early results (2022-2023) being somewhat wobbly, awkward, and uncanny, the most recent video models create increasingly convincing and seamless video content, albeit still often characterized by somewhat generic and floaty aesthetics of motion. Drastic, unusual scenes with lots of complex moving pieces and characters are still difficult for these models, with object permanence suffering the most when there are many similar elements intersecting and occluding, which is however often unnoticeable without closer inspection.
I suppose the boundary between animation and conventional film is blurry here. For results that don't quite replicate real video, either by intent or by failure, it may fall into it's own weird new genre of animation and video art. For animators and video artists, I'd encourage to forget about replicating real film, and use this as new form of experimental media. Have fun!
Moreover, because for the most part these video models are trained on live action footage or smooth CGI, some aspects of animation tradition remain to be still quite out of reach - meticulous frame-by-frame timing changes, expressive frame holds and accentuated posing, hybrid animation techniques, etc. In other words, the newest video models will easily create a moving pretty image, and often times in a coherent way, but the character of motion itself is still very far from replicating the mastery of animation at highest level.
-
Standalone (Text 2 video)
Using text prompts to generate entirely new video clips.
Limitless in theory, with possibilities to go for both live-action look or anything surreal and stylized, as long as you can describe it, just like with static image generation. In practice though, gathering diverse and big enough datasets to train video models is much harder, so niche aesthetics are difficult on models with only text conditioning.
With only text, proper creative control and directing ability is quite weak, but it becomes much more empowering when coupled with image or video conditioning, which you may call as "transformative" workflow. Additionally, there are new forms of motion and camera control emerging such as MotionCtrl, or Runway's Multi motion brush, which I'll now categorize as "directable" techniques.
-
Transformative:
Using text prompts in addition with further conditioning from existing images or videos.
-
Image 2 Video
Many generative video tools enable you to condition the result on an image. Either starting exactly with the image you specify, or using it as a rough reference for semantic information, composition, and colors.
Often people generate the starting image as well using traditional static image models before supplying it to a video model.
-
Video 2 Video
Similarly to Image 2 Image process in generative image models, it is possible to embed input video information into a video model as it is generating (denoising) the output, in addition to the text prompt. I lack the expertise to understand exactly what's happening, but it appears this process matches the input video clip not only on frame-by-frame level (as stylization with Stable Diffusion would), but also on a holistic and movement level. Under the hood it is usually controlled with a denoising strength just like image 2 image.
-
-
Standalone - directable
With rapid development of video models and services that are powered by them, they are getting increasingly well equiped with interfaces and tools that allow more hands-on and precise direction of generated clips besides just text prompts. That may include camera framing and movement, movement of specific elements or areas in the frame, or replacement/adjusment of specific elements through additional instructions.
Likely, this is where the field will split into two extremes: services catering to film makers and creators wanting to retain artistic agency, and simplified services with focus on provoding standalone ready content in corporate settings and mass production, "served on a plate" so to say.
PROS CONS - The most open ended set of techniques, that will only improve with time.
- No barrier of entry in terms of professional animation knowledge.
- Compared to frame-by-frame techniques, way smoother and more coherent.
- A more straightforward way for transformative workflows than with frame-by-frame approaches.
- With cheaper, smaller models - often awkward and uncanny looking, which may be more apparent than with images.
- Computationally expensive. Less accessible to run on your own hardware than image AI, especially due to high video memory (VRAM) requirements, which are hard to meet with most consumer GPUs.
- Depending on the model or service, often somewhat limited by short context length - the maximum duration of clips can be small, or extending such short clips struggles to maintain long-term consistency.
TOOLS FREE PAID (some with limited free plans) - Stable Video (SVD) - open source video diffusion model from StabilityAI:
- DynamicCrafter video models in ComfyUI
- MotionCtrl - Enhancement allowing object motion and camera trajectory control in various video models.
- CameraCtrl - Enhancement focusing on camera trajectory control in various video models.
- Emu video - a preview demo of Meta's generative video model.
- LTXV video model on ComfyUI
- Mochi 1 - open video model, also implemented in ComfyUI.
- CogVideoX - open video model, also implemented in ComfyUI.
- Pyramid Flow - open video model, also implemented in ComfyUI (requires very high VRAM) Text 2 Video extension for A1111 webui to be used with one of these models: (if you have sufficient hardware)*
Plugins and addons:- Pallaidium for Blender - a multi-functional toolkit crammed with generative functionality across image, video and even audio domains.
Additionally, you may find some free demos on Hugging face spaces.Online services:- Runway's products
- Open AI's Sora
- Kaiber - makes use of several different generative AI models, including newest video models.
- Pika labs
- Luma Dream machine
- Hailuo AI
- KLING AI
- PixVerse
- Flux AI video generator
* Optimally you have decent enough hardware, namely GPU, to run these tools locally. Alternatively, you may be able to try running these models through remote machines, like in Google Colab, but most free plans and trials are very limiting.MORE EXAMPLES:
-
-
Image models enhanced with motion comprehension
With growing popularity of AnimateDiff, this is an emerging field of enhancing established image diffusion models with video or "motion" comprehension. The results are more similar to native video models (shown above), than what you would get with frame-by-frame techniques. The catch is that you can also utilize everything that has been built for these image models such as Stable Diffusion, including any community created checkpoint, LoRA, ControlNet, or other kinds of conditioning.
The motion itself in this technique is often quite primitive, only loosely interpolating objects and flow throughout the clip, often morphing things into other things. It does that with way more temporal consistency though (less flicker), and it is still in its infancy. Best results are with abstract, less concrete subjects and scenes.
The community is actively experimenting with this tech (see "MORE EXAMPLES"). The techniques draw both from static image models (such as prompt travel), and from video native models and their advancments. In some cases, people are trying to squeeze from it smoother video or 3D render stylization results when compared to image model frame-by-frame techniues.
PROS CONS - Benefits from all development that has been done on existing image diffusion models.
- Can be conditioned with video or complex data such as 3D render passes.
- Very good with abstract, flowing motion.
- Does not work well to produce complex, coherent motion of characters or unusual objects, often leading to morphing instead.
- Computationally expensive just like video native models. Less accessible to run on your own hardware than image AI.
- Limited by somewhat short context window (for now), although there are always workarounds that people experiment with.
TOOLS FREE PAID (some with limited free plans) Currently, implementations of AnimateDiff (for SD v1.5) are leading the charge here:- A1111 webui extension for AnimateDiff.
- AnimateDiff implementation in ComfyUI and a plethora of community made workflows around it.
- SparseCtrl - method to condition a video model with sparse set of keyfrane data, similarly to ControlNets, but in the context of video. (Supported by AnimateDiff v3)
for SD XL:Multi-functional implementations:- VisionCrafter - a GUI tool for AnimatedDiff implementation and other projects.
- Enfugue
MORE EXAMPLES:
-
-
Talking heads and characters (with speech synthesis)
Combination of techniques and AI models that focus on "talking heads" - faces and characters with synthetic or real voice with accurately matching lip movements.
(By themselves, general generative video models usually cannot depict accurate lip movements and spoken performance)-
Synthetic (generated performance)
I know it, You know it. It's the technique behind a viral meme. Whenever you see a relatively still character (could be moving camera too), with an animated talking face, it likely relates to particular set of methodologies using AI face animation and synthetic speech tools.
In the case of said memes, the source images are often made with generative image AI, but you may also use any image with a face. The speech gets generated from text, conditioned on a chosen character voice. Then a different tool (or a model within a packaged tool) synthesizes facial animations with appropriate lip sync from the voice, usually only generating motion in the face and head area of the image. Using pre-trained avatars allows for movement on the body as well.
Most online services now streamline and package this process into cohesive tools, and what's exactly happening under the hood is anyone's guess, but it may be deducted from related research papers and open source alternatives.
-
Performance capture
Related category of tools are those that instead of synthesizing, map a real captured performance onto a character using AI models (not to be confused with traditional facial capture for CGI characters). Doing so may allow very accessible performance through virtual characters by generating ready video clips, skipping all of the conventional motion capture and character rigging pipeline.
PROS CONS - Easy memes.
- Mass-producible talking avatars for games, installations, roleplay, cheap DIY shows, etc.
- Less advanced and older tools produce somewhat uncanny results.
- For the most part, reliant on closed-source facial animation tools in paid apps.
- Results are often stiff and not too expressive, even when training it with your own footage for an avatar.
TOOLS FREE PAID (some with limited free plans) -
"Wav2Lip" - a A1111w webui extension that generates "lip-sync"
animation. Seems to be limited to mouth area.
- SadTalker - another talking head generator based on audio. Available for use through A1111 webui and Discord.
- ElevenLabs - constrained usage, but limits seem to refresh monthly.
- Live Portrait - open source model to generate talking heads from reference video, also available to run locally.
- Animated 2D cutout characters using Adobe Express (in "Animate characters" mode).
- Runway's Act-one - tool to synthesize animated talking heads from real recorded performance
- Hedra
- RenderNet - service focused on consisten generative characters.
- D-ID
- Heygen
- Synesthesia
Tools mostly catering to business and corporate use cases: -
-
generative 3D character motion
This refers to motion synthesis in the context of 3D characters. It can apply to 3D animated film, video games, or other 3D interactive applications. Just like with images and video, these emerging AI tools allow you to prompt character motion through text. Additionally, some also build it from very limited amount of key-poses, or produce animations dynamically on-the-fly in interactive settings.
Because this list focuses on generative tools, I am leaving out some AI applications that automate certain non creative tasks, like AI powered motion tracking, compositing, masking etc, as seen in Move.ai or Wonder Dynamics.
PROS CONS - Fits inside established 3D animation workflow, reducing tedious tasks, potentially working as a utility for skilled animators.
- Handles physics and weight really well.
- Future of dynamic character animation in video games?๐๐ฎ
- Usually limited to humanoid bipedal characters.
- Not self sufficient. Only one component of 3D animation workflow. You need to know where to take it next.
- Training is usually done on human motion capture data, which means these techniques so far only deal with realistic physics based motion, nothing stylized or cartoony.
TOOLS FREE (or limited plans) PAID - Omni Animation
- Cascadeur - animation assistant that creates smooth, physics based animation and poses from minimal input. Highly controllable and looks like a major player in the future.
- ComfyUI MotionDiff - implementation of MDM, MotionDiffuse, and ReMoDiffuse in ComfyUI.
Paid plans of free tools that provide more features and expanded limits.MORE EXAMPLES:
-
LLM powered
In theory, with LLMs (Large Language Models) showing great performance in coding tasks, especially when fine tuned, you could tell it to program and write scripts inside animation-capable software, or describe motion through raw keyframe and curve data. This means the animation would follow the usual workflow, but AI assists you throughout. In an extreme case, AI does everything for you, while it executes or assigns appropriate tasks in a back-end pipeline.
In practice, you can kind of already try it! Blender for example is equipped with very extensive python API that allows to operate it through code, so there are couple chatGPT-like assistant tools available already. This is an unavoidable trend. Everywhere there is code, LLMs will likely show some practical use cases.
PROS CONS - The promise - deconstruction of any technical barrier for creatives.
- Useful as a copilot or assistant in creative software, eliminating tedious, repetitive tasks, digging through documentation for you.
- If AI will create everything for you, then what's the point of being creative in the first place?
- For now, running LLMs is only possible on powerful remote machines, thus being paid per tokens/subscriptions.
TOOLS FREE PAID - Blender Chat Companion - (similar to Blender Copilot) a ChatGPT implementation inside Blender, specialized to handle appropriate tasks. Uses ChatGPT API tokens which are paid.
- Blender Copilot - (similar to Blender Chat Companion) a ChatGPT implementation inside Blender, specialized to handle appropriate tasks. Uses ChatGPT API tokens which are paid.
There's also upcoming ChatUSD - a chatbot to work with and manage USDs, which is a standard initially created by Pixar to unify and simplify 3D data exchange and parallelization on animated film production. Can't tell you much more here, but Nvidia seems to be embracing it as a standard for anything 3D, not just film.
Whev! That was ALOT, but I likely still missed something. Please comment below to suggest entries and tweaks to improve this and keep it up-to-date. Thank you for reading!