Tencent has taken a bold step forward in AI video synthesis with the release of HunyuanCustom—a new multimodal video model that builds on the original Hunyuan Video foundation. With this upgrade, users can now generate highly realistic deepfake-style videos from a single image, complete with synced audio and dynamic lip movement.
At the heart of this breakthrough is HunyuanCustom’s ability to generate full-motion video using nothing more than a still image and a text prompt. This includes facial expressions, head movements, gestures, and even spoken words—rendered with surprising fluidity and coherence. Compared to both open-source rivals and proprietary models like Kling, Vidu, and Pika, Tencent’s offering brings fresh competition to the rapidly evolving space of synthetic video.
Single Image, Full Motion
HunyuanCustom introduces a significant shift from earlier methods that required multiple images or fine-tuned LoRA models to maintain visual consistency. This system generates expressive videos using only a frontal image, though its limitations begin to show when the subject moves too far off-angle. It works best with scenarios where the face remains mostly front-facing. This might not fully replace LoRA-based workflows for every use case, but it offers a faster, more accessible option for video generation.
In one clip, a man cooks while listening to music. The system uses just one image to animate the scene. In another, a child smiles throughout a clip, echoing the expression seen in the input photo. These results demonstrate that the tool is impressive—though clearly constrained by how much it can infer from a single static image.
When tested on tasks like virtual try-ons—where the AI places clothes on a subject—the model intelligently crops the frame to compensate for missing visual information. It’s a smart workaround, but still points to the benefits of using multiple images or camera angles for more complex videos.
Realistic Lip Sync with LatentSync
One of HunyuanCustom’s biggest advances is its audio alignment system. Using the LatentSync framework, the model can match voice input to accurate lip movements. Although the current demos are in Chinese, the synchronization quality is strong. The AI can animate realistic facial expressions and speech gestures based on both voice clips and text descriptions—without needing full-body motion capture or labor-intensive rigging.
This capability places HunyuanCustom ahead of many hobbyist-level tools that struggle with audio-driven facial animation. It’s also one of the features that gives Tencent’s model a strong edge in producing believable avatars and virtual influencers.
AI Video Editing Without a Full Rebuild
The platform also offers a lightweight approach to video-to-video editing, letting users mask a portion of an existing video and replace that section with a generated version based on a reference image. Unlike traditional tools that require reconstructing the full video, HunyuanCustom surgically modifies just the target area while maintaining the rest of the footage—much like Adobe Firefly.
Tests show this feature works well for replacing or adding characters and objects. The system applies motion, lighting, and context-aware changes, resulting in seamless edits that feel grounded in the original scene. Compared to models like VACE or Kling, which often struggle with blending, Tencent’s model delivers more natural transitions and fewer artifacts.
How It Works Behind the Scenes
Built on the original December 2024 HunyuanVideo model, HunyuanCustom is a fine-tuned extension—not a complete rewrite. It retains the causal 3D-VAE architecture and now adds a robust multimodal data pipeline. The system uses a mix of open and synthetic datasets across categories like humans, vehicles, architecture, and anime. Video clips are filtered, segmented, and standardized to five seconds, with multiple rounds of annotation and aesthetic scoring.
Tools like YOLO11X, InsightFace, and Grounded SAM 2 handle subject detection and segmentation, while the Qwen and LLaVA language models help connect image features with written prompts. The end result is a tightly aligned vision-language dataset that guides video generation with enhanced accuracy.
To improve facial identity retention, the researchers added an “identity enhancement” module. This helps the model preserve facial features over time, even with limited visual reference. It’s crucial for creating character-consistent avatars—especially when only one photo is available.
Coordinated Speech and Motion
On the audio front, the model separates identity from voice data using a dedicated module called AudioNet. This ensures that the character in the video still looks like the original image, even while speaking or moving expressively. An extra timing module helps map voice to gestures, allowing subjects to move their heads or raise their hands in sync with speech patterns.
For multi-character scenes, each subject is processed independently, with their image-text pair injected into different sections of the video timeline. This allows the AI to create realistic interactions between people, even when only limited reference materials are provided.
Performance Metrics and Comparisons
In tests against both commercial and open-source systems, HunyuanCustom leads in identity consistency (measured by ArcFace) and subject similarity (evaluated with YOLO and Dino2). It also performs competitively in temporal consistency and text-video alignment. Compared to Kling—which often introduces copy-paste artifacts—and VACE, which struggles with edge blending, Tencent’s model offers better visual stability and more coherent results.
Even in complex tasks like ad-style product placement and audio-driven human animation, HunyuanCustom excels at preserving fine details, facial expressions, and motion accuracy.
Open Access, With a Catch
HunyuanCustom’s code and model weights are available on GitHub, with two resolution options (720×1280 and 512×896). However, it’s currently Linux-only and requires at least 24GB of GPU memory—ideally 80GB—for best results. Windows support and quantized versions may arrive soon, as was the case with earlier releases.
The GitHub repo offers all tools for local installation, though API access is gated behind a WeChat scan code. A ComfyUI extension is also in the works, which could make integration into creative pipelines even easier.
Final Thoughts
This release marks a major leap in AI-powered video synthesis. With single-image video generation, lip-sync accuracy, and editing capabilities, HunyuanCustom sets a new bar for what’s possible without heavy LoRA training or multi-shot datasets.
It’s far from perfect—issues with angle diversity and GPU demands remain—but Tencent is clearly pushing toward a future where generating custom video from a still photo becomes fast, flexible, and remarkably lifelike.