Kling Video O1 Launch: The World’s First Unified Multimodal Video Model

We are proud to introduce Kling Video O1, the world’s first unified multimodal video model, now available on Atlas Cloud.

Kling Video O1 Pricing Snapshot:

Model	Price
Kling Video O1 Text-to-video	$0.0896/sec on Atlas Cloud
Kling Video O1 Reference-to-video	$0.0896/sec on Atlas Cloud (vs. $0.112 on WaveSpeed.ai & Fal.ai).
Kling Video O1 Image-to-video	$0.0896/sec on Atlas Cloud (vs. $0.112 on WaveSpeed.ai & Fal.ai).

Introduction of Kling Video O1

For too long, AI video creation has been fragmented. Users had to switch between different tools for text-to-video, image-to-video, and video editing. Kling Video O1 changes everything. As the world’s first unified multimodal video model, it allows creators to input text, images, and video simultaneously within a single dialogue context. The model doesn’t just generate pixels; it understands the semantic relationship between visual elements and language, allowing for intuitive, conversational creation.

Core Features Highlights

Unified Input Processing

The model supports multiple input types within a single interface, without requiring the user to switch between different tools or modules.

Multimodal and Flexible Combination: Allows users to combine text prompts, reference images, and video files in one request.
Task Versatility: Handles various functions including reference generation, inpainting, video transformation, and video extension.

Semantic Video Modification

Kling Video O1 enables video editing through natural language understanding rather than manual masking. The model interprets semantic commands, such as "remove the person in the background" or "change the time of day to sunset." It automatically identifies the relevant pixels and objects to modify while maintaining the temporal consistency of the video, removing the need for manual rotoscoping or frame-by-frame adjustments.

Reference-Based Object Consistency

The model introduces a feature for maintaining character and object identity. By uploading a reference image and tagging it as an "Element," users can direct the model to use that specific object or face in the video. This ensures that character features, clothing, or product details remain consistent throughout the generated clip, regardless of camera movement or scene changes.

Frame-Based Narrative Control

Kling Video O1 offers precise control over video trajectory through start and end frames. Users can upload an initial image and a final image, and the model generates the connecting footage. This is utilized for creating specific transitions or ensuring a video clip begins and ends exactly as required by a storyboard.

Multi-Instruction Execution

The system can process complex prompts containing multiple directives. Users can issue combined commands, such as adding a specific character to a scene while simultaneously altering the background environment. The model executes these parallel tasks while ensuring the visual elements blend logically.

Practical Applications

Visual Effects (VFX): Automating the removal of unwanted objects, passerby, or debris from live-action footage.
Virtual Try-On: Changing a subject's clothing in a video based on a reference image while preserving the original body movements and pose.
Style Transfer: Converting the visual style of a video, such as transforming realistic footage into an animated style or oil painting aesthetic.
Product Visualization: Generating videos where a specific product (defined as an Element) is placed in various dynamic environments for e-commerce demonstrations.

Conclusion

Kling Video O1 simplifies the video production pipeline by merging generation and editing capabilities. By utilizing semantic understanding and unified multimodal inputs, it reduces the technical barrier for complex video tasks. Users can now execute director-level changes and generate high-definition content (up to 1080p, 10 seconds) through descriptive prompts rather than technical manual labor.

👉 Sign up today and get a $1 Free Trial. Experience the future of AI video editing, minus complexity.

Experience Kling Video O1 on Atlas Cloud today. 👉 Kling Video O1 Text-to-video 👉 Kling Video O1 Reference-to-video 👉 Kling Video O1 Image-to-video

👉 Kling Video Models Family

FAQ

Q1: What is Kling Video O1?

Kling Video O1 is the industry's first unified multimodal video model. Unlike previous models that separate tasks, O1 handles text-to-video, image-to-video, and complex video editing (inpainting, transformation) within a single model architecture using natural language.

Q2: How is it different from Kling 2.5 or Sora 2?

vs. Kling 2.5

Architecture: Kling O1 uses a unified architecture for both generation and editing. Kling 2.5 focuses solely on high-quality generation.
Editing Power: Kling O1 modifies existing videos (e.g., changing objects, altering backgrounds) directly. Kling 2.5 only creates new content.
Input Handling: Kling O1 processes text, images, and video inputs together for better character consistency.

vs. Sora 2

Core Focus: Kling O1 prioritizes creative control (editing, consistency, transitions). Sora 2 prioritizes physical simulation and world dynamics.
Duration: Sora 2 generates longer clips (up to 20s+). Kling O1 generates shorter, edit-ready clips (5–10s).
Availability: Kling O1 is publicly available for use now. Sora 2 remains in restricted access.

Q3: Can Kling Video O1 edit existing videos?

Yes. You can upload an existing video and use text prompts to modify it. You can change styles, replace objects, or alter the environment while preserving the original motion and composition of your footage.

BACK TO LIST