Deep Neural Network Watermarking against Model Extraction Attack

ACM MM 2023

Jingxuan Tan1,2, Nan Zhong1,2, Zhenxing Qian1,2*, Xinpeng Zhang1,2*, Sheng Li1,2
1School of Computer Science, Fudan University
2Key Laboratory of Culture Tourism Intelligent Computing, Fudan University

The host model and its stolen copies output the same prediction on the trigger set

Abstract

Deep neural network (DNN) watermarking is an emerging technique to protect the intellectual property of deep learning models. At present, many DNN watermarking algorithms have been proposed to achieve provenance verification by embedding identify information into the internals or prediction behaviors of the host model. However, most methods are vulnerable to model extraction attacks, where attackers collect output labels from the model to train a surrogate or a replica. To address this issue, we present a novel DNN watermarking approach, named SSW, which constructs an adaptive trigger set progressively by optimizing over a pair of symmetric shadow models to enhance the robustness to model extraction. Precisely, we train a positive shadow model supervised by the prediction of the host model to mimic the behaviors of potential surrogate models. Additionally, a negative shadow model is normally trained to imitate irrelevant independent models. Using this pair of shadow models as a reference, we design a strategy to update the trigger samples appropriately such that they tend to persist in the host model and its stolen copies. Moreover, our method could well support two specific embedding schemes: embedding the watermark via fine-tuning or from scratch. Our extensive experimental results on popular datasets demonstrate that our SSW approach outperforms state-of-the-art methods against various model extraction attacks in whether trigger set classification accuracy based or hypothesis test-based verification. The results also show that our method is robust to common model modification schemes including fine-tuning and model compression.

Method

The overall pipeline of SSW

SSW algorithm involves three stages: watermark embedding, trigger selection, and ownership demonstration. In the watermark embedding stage, the host model H is trained on the union of the legitimate training data \mathcal{D} and a trigger set \mathcal{T}. To make surrogate models derived from the host model output the same prediction on the trigger set, we optimize the trigger samples to make them actively adapt to surrogate models. To this end, we train a positive shadow model P on \mathcal{D}^\prime, the data labeled by the host model, to simulate practical surrogate models. We also introduce a negative shadow model 𝑁 on \mathcal{D} to represent the non-watermarked model. The trigger samples are optimized so that they are predicted to the same pre-defined label by H and P but to different labels by N. The host model training, positive shadow model training, and trigger set optimization are alternately conducted to enhance the watermark robustness against model extraction.

The trigger selection stage further selects samples that are more eligible for ownership verification. The ownership demonstration stage can be completed based on the classification accuracy on the trigger set or hypothesis test, depending on actual scenarios.

If you like the project, please show your support by leaving a star 🌟 !

BibTeX

        
        @inproceedings{tan2023deep,
        title={Deep Neural Network Watermarking against Model Extraction Attack},
        author={Tan, Jingxuan and Zhong, Nan and Qian, Zhenxing and Zhang Xinpeng and Li Sheng},
        booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
        year={2023}
        }