Abstract
Generating articulated objects, such as laptops and microwaves,
is a crucial yet challenging task with extensive applications in Embodied AI and AR/VR.
Current image-to-3D methods primarily focus on surface geometry and texture,
neglecting part decomposition and articulation modeling.
Meanwhile, neural reconstruction approaches (e.g., NeRF or Gaussian Splatting)
rely on dense multi-view or interaction data, limiting their scalability.
In this paper, we introduce DreamArt,
a novel framework for generating high-fidelity,
interactable articulated assets from single-view images.
DreamArt employs a three-stage pipeline:
firstly, it reconstructs part‑segmented and complete 3D object meshes through a combination of
image-to-3D generation, mask-prompted 3D segmentation, and part amodal completion.
Second, we fine-tune a video diffusion model to capture part-level articulation priors,
leveraging movable part masks as prompt and amodal images to mitigate ambiguities caused by occlusion.
Finally, DreamArt optimizes the articulation motion,
represented by a dual quaternion, and conducts global texture refinement and repainting
to ensure coherent, high-quality textures across all parts. Experimental results demonstrate
that DreamArt effectively generates high-quality articulated objects,
possessing accurate part shape, high appearance fidelity, and plausible articulation,
thereby providing a scalable solution for articulated asset generation.