ComfyUI Video Pipeline
Orchestrates video generation across three engines, selecting the best one based on requirements and available resources.
Engine Selection
VIDEO REQUEST
|
|-- Need film-level quality?
| |-- Yes + 24GB+ VRAM โ Wan 2.2 MoE 14B
| |-- Yes + 8GB VRAM โ Wan 2.2 1.3B
|
|-- Need long video (>10 seconds)?
| |-- Yes โ FramePack (60 seconds on 6GB)
|
|-- Need fast iteration?
| |-- Yes โ AnimateDiff Lightning (4-8 steps)
|
|-- Need camera/motion control?
| |-- Yes โ AnimateDiff V3 + Motion LoRAs
|
|-- Need first+last frame control?
| |-- Yes โ Wan 2.2 MoE (exclusive feature)
|
|-- Default โ Wan 2.2 (best general quality)
Pipeline 1: Wan 2.2 MoE (Highest Quality)
Image-to-Video
Prerequisites:
wan2.1_i2v_720p_14b_bf16.safetensors in models/diffusion_models/
umt5_xxl_fp8_e4m3fn_scaled.safetensors in models/clip/
open_clip_vit_h_14.safetensors in models/clip_vision/
wan_2.1_vae.safetensors in models/vae/
Settings:
| Parameter |
Value |
Notes |
| Resolution |
1280x720 (landscape) or 720x1280 (portrait) |
Native training resolution |
| Frames |
81 (~5 seconds at 16fps) |
Multiples of 4 + 1 |
| Steps |
30-50 |
Higher = better quality |
| CFG |
5-7 |
|
| Sampler |
uni_pc |
Recommended for Wan |
| Scheduler |
normal |
|
Frame count guide:
| Duration |
Frames (16fps) |
| 1 second |
17 |
| 3 seconds |
49 |
| 5 seconds |
81 |
| 10 seconds |
161 |
VRAM optimization:
- FP8 quantization: halves VRAM with minimal quality loss
- SageAttention: faster attention computation
- Reduce frames if OOM
Text-to-Video
Same as I2V but uses wan2.1_t2v_14b_bf16.safetensors and EmptySD3LatentImage instead of image conditioning.
First+Last Frame Control (Wan 2.2 Exclusive)
Wan 2.2 MoE allows specifying both the first and last frame, enabling precise video planning:
- Generate two hero images with consistent character
- Use first as start frame, second as end frame
- Wan interpolates the motion between them
Pipeline 2: FramePack (Long Videos, Low VRAM)
Key Innovation
VRAM usage is invariant to video length - generates 60-second videos at 30fps on just 6GB VRAM.
How it works:
- Dynamic context compression: 1536 markers for key frames, 192 for transitions
- Bidirectional memory with reverse generation prevents drift
- Frame-by-frame generation with context window
Settings
| Parameter |
Value |
Notes |
| Resolution |
640x384 to 1280x720 |
Depends on VRAM |
| Duration |
Up to 60 seconds |
VRAM-invariant |
| Quality |
High (comparable to Wan) |
Uses same base models |
When to Use
- Videos longer than 10 seconds
- Limited VRAM systems (but RTX 5090 doesn't need this)
- When VRAM is needed for parallel operations
- Batch video generation
Pipeline 3: AnimateDiff V3 (Fast, Controllable)
Strengths
- Motion LoRAs for camera control (pan, zoom, tilt, roll)
- Effect LoRAs (shatter, smoke, explosion, liquid)
- Sliding context window for infinite length
- Very fast with Lightning model (4-8 steps)
Settings
| Parameter |
Value (Standard) |
Value (Lightning) |
| Motion Module |
v3_sd15_mm.ckpt |
animatediff_lightning_4step.safetensors |
| Steps |
20-25 |
4-8 |
| CFG |
7-8 |
1.5-2.0 |
| Sampler |
euler_ancestral |
lcm |
| Resolution |
512x512 |
512x512 |
| Context Length |
16 |
16 |
| Context Overlap |
4 |
4 |
Camera Motion LoRAs
| LoRA |
Motion |
| v2_lora_ZoomIn |
Camera zooms in |
| v2_lora_ZoomOut |
Camera zooms out |
| v2_lora_PanLeft |
Camera pans left |
| v2_lora_PanRight |
Camera pans right |
| v2_lora_TiltUp |
Camera tilts up |
| v2_lora_TiltDown |
Camera tilts down |
| v2_lora_RollingClockwise |
Camera rolls clockwise |
Post-Processing Pipeline
After any video generation:
1. Frame Interpolation (RIFE)
Doubles or quadruples frame count for smoother motion:
Input (16fps) โ RIFE 2x โ Output (32fps)
Input (16fps) โ RIFE 4x โ Output (64fps)
Use rife47 or rife49 model.
2. Face Enhancement (if character video)
Apply FaceDetailer to each frame:
- denoise: 0.3-0.4 (lower than image - preserves temporal consistency)
- guide_size: 384 (speed optimization for video)
- detection_model: face_yolov8m.pt
3. Deflicker (if needed)
Reduces temporal inconsistencies between frames.
4. Color Correction
Maintain consistent color grading across frames.
5. Video Combine
Final output via VHS Video Combine:
frame_rate: 16 (native) or 24/30 (after interpolation)
format: "video/h264-mp4"
crf: 19 (high quality) to 23 (smaller file)
Talking Head Pipeline
Complete pipeline for character dialogue:
1. Generate audio โ comfyui-voice-pipeline
2. Generate base video โ This skill (Wan I2V or AnimateDiff)
- Prompt: "{character}, talking naturally, slight head movement"
- Duration: match audio length
3. Apply lip-sync โ Wav2Lip or LatentSync
4. Enhance faces โ FaceDetailer + CodeFormer
5. Final output โ video-assembly
Quality Checklist
Before marking video as complete:
Reference
references/workflows.md - Workflow templates for Wan and AnimateDiff
references/models.md - Video model download links
references/research-log.md - Latest video generation advances
state/inventory.json - Available video models