Anchor Watermark: Robust Attribution for Diffusion-based Text-to-Audio Model

Xianjin Rong; Donghui Hu

doi:10.1609/aaai.v40i39.40561

Authors

Xianjin Rong Hefei University of Technology
Donghui Hu Hefei University of Technology

DOI:

https://doi.org/10.1609/aaai.v40i39.40561

Abstract

With the increasing commercialization of the latent diffusion-based text-to-audio generation, model attribution has become a critical challenge. Embedding watermarks in generated audio is an effective way to distinguish synthetic from natural audio. However, existing watermarking methods often suffer from limited robustness or require additional training, limiting their scalability in practical applications. In this paper, we propose an anchor-based inversion optimization framework. The method embeds a watermark into the model's initial latent vector, designated as a pivotal anchor, and extracts the watermark through inversion. To mitigate error accumulation and enhance robustness during inversion, we leverage the temporal consistency and distributional similarity of diffusion models, formulating watermark extraction as a time-series optimization problem. Specifically, given a suspicious audio sample and a candidate model with a predefined anchor, we first perform unguided denoising diffusion on the anchor to generate an intermediate latent trajectory as the anchor sequence. Then, we optimize the inversion process to align the inverted trajectory with the anchor sequence, thereby reducing accumulated errors. During optimization, we adopt Soft Dynamic Time Warping as the loss function. Its flexible temporal alignment capability ensures that correct attribution is achieved only when the anchor matches the target audio. Experimental results show that our method enables training-free attribution while preserving audio quality and achieving strong robustness.

Anchor Watermark: Robust Attribution for Diffusion-based Text-to-Audio Model

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information