Act to See, See to Act: Diffusion-Driven Perception–Action Interplay for Adaptive Policies

Wang, Jing; Peng, Weiting; Tang, Jing; Gong, Zeyu; Wang, Xihua; Tao, Bo; Cheng, Li

Diffusion-Driven Perception–Action Interplay for Adaptive Policies

Jing Wang¹, Weiting Peng², Jing Tang², Zeyu Gong², Xihua Wang¹, Bo Tao², Li Cheng¹

University of Alberta (¹)

Huazhong University of Science and Technology (²)
[NeurIPS 2025]

Code Paper Presentation Poster

DP-AG drives a closed perception–action loop: diffusion-guided action proposals inform perceptual updates on-the-fly, enabling robots to adapt smoothly to dynamic scenes and contact-rich manipulation.

Abstract

Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representations and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action through probabilistic latent dynamics. DP-AG encodes latent observations into a Gaussian posterior via variational inference and evolves them using an action-guided SDE, where the Vector–Jacobian Product (VJP) of the diffusion policy's noise predictions serves as a structured stochastic force driving latent updates. To promote bidirectional learning between perception and action, we introduce a cycle-consistent contrastive loss that organizes the gradient flow of the noise predictor into a coherent perception–action loop, enforcing mutually consistent transitions in both latent updates and action refinements. Theoretically, we derive a variational lower bound for the action-guided SDE, and prove that the contrastive objective enhances continuity in both latent and action trajectories. Empirically, DP-AG significantly outperforms state-of-the-art methods across simulation benchmarks and real-world UR5 manipulation tasks. As a result, our DP-AG offers a promising step toward bridging biological adaptability and artificial policy learning.

Comparison of Different Observation Representation Styles. (a) Conventional methods map observation features directly to actions. (b) DP models action distributions through incremental denoising from white noise, conditioned on observation features. (c) DP-AG refines observation features via noise predictions, establishing a mutually reinforcing cycle between perception and action.

Method Overview. While Diffusion Policy (DP) generates actions from static observation features, our DP-AG establishes a dynamic perception–action loop by guiding feature evolution via the VJP of DP’s predicted noise. To reinforce interplay, a cycle-consistent contrastive loss aligns noise predictions from static and evolving features, enabling mutual perception–action influence.

Regression results on irregular spirals. Left: Trajectories and latent dynamics predicted by the Base Flow. Right: The VJP-Guided Flow continuously refines latents through output-guided corrections, which results in smoother and more coherent trajectories in both output and latent spaces.

Target coverage score on Push-T and Dynamic Push-T tasks. img and kp refer to the observation modalities: RGB images or 2D keypoints.

Performance on Real-World UR5 Tasks. Mean and standard deviation are reported.

Real-World Evaluations on UR5 Robot Arm

Candy Push. The end-effector pushes small candies into a designated goal area. Object positions are randomized across trials.

Peg-in-Hole. The robot must insert a circular peg into a vertical hole using only RGB inputs from scene and wrist cameras, without explicit depth sensing, requiring the policy to infer 3D geometry from indirect cues.

Painting Circle. The robot is tasked with tracing a circular-shaped path using a paintbrush.

Painting Heart. The robot is tasked with tracing a heart-shaped path using a paintbrush.

Evaluations on Push-T and Dynamic Push-T Simulation Benchmarks

Push-T. A circular end-effector pushes a T-shaped block to a target location.

Dynamic Push-T. A circular end-effector pushes a T-shaped block to a target location with a moving ball disturbance.

Evaluations on Robomimic Simulation Benchmark

The Can task involves picking up a can and placing it in the correct bin, testing pick-and-place skills.

The Lift task involves lifting an object, a simple manipulation task that benefits less from large datasets compared to complex tasks.

The Square task, also known as Square Nut Assembly, requires fitting a square nut onto a square peg.

Paper

Preview our NeurIPS 2025 camera-ready paper.

BibTeX

@inproceedings{wang2025act,
  title={Act to See, See to Act: Diffusion-Driven Perception--Action Interplay for Adaptive Policies},
  author={Wang, Jing and Peng, Weiting and Tang, Jing and Gong, Zeyu and Wang, Xihua and Tao, Bo and Cheng, Li},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025}
}