PolyFormer:

Referring Image Segmentation as Sequential Polygon Generation


1 Johns Hopkins University, 2 AWS AI Labs
*Equal Contribution, Work done during internship at AWS AI Labs

In CVPR 2023

Figure 1. PolyFormer is a unified model for referring image segmentation (polygon vertex sequence) and referring expression comprehension (bounding box corner points). The polygons are converted to segmentation masks in the end.

Abstract

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.



Demo




Model

Figure 2. Overview of PolyFormer architecture. The model takes an image and its corresponding language expression as input, and outputs the floating-point 2D coordinates of bounding box and polygons in an autoregressive way.


The main contributions of PolyFormer are summarized as follows:

  • It provides a unified framework for referring image segmentation (RIS) and referring expression comprehension (REC) by formulating them as a sequence-to-sequence (seq2seq) prediction problem.
  • It uses a regression-based decoder for accurate coordinate prediction, which outputs continuous 2D coordinates directly without quantization error. To the best of our knowledge, this is the first work formulating geometric localization as a regression task in seq2seq framework.
  • For the first time, we show that the polygon-based method surpasses mask-based ones across all three main referring image segmentation benchmarks, and it can also generalize well to unseen scenarios, including video and synthetic data.

Figure 3. The architecture of the regression-based transformer decoder (a). The 2D coordinate embedding is obtained by bilinear interpolation from the nearby grid points, as illustrated in (b).

Previous visual seq2seq methods formulate coordinate localization as a classification problem and obtain coordinate embeddings by indexing from a dictionary with a fixed number of discrete coordinate bins. Instead, we predict the continuous coordinate values directly without quantization error for accurate geometric localization.

Our regression-based decoder is unique for three critical designs:

  • It generates precise 2D coordinate embedding for any floating-point coordinate by bilinear interpolation of its neighboring indexed embeddings.
  • It decouples token type and coordinate prediction, where a coordinate head predicts the 2D coordinates of the referred object bounding box corner points and polygon vertices, and a class head outputs the token types.
  • It introduces a separator token that allows it to accurately model fragmented objects with multiple polygons.

Results

Referring Image Segmentation

Table 1. Comparison with the state-of-the-art methods on three referring image segmentation benchmarks.

Referring Expression Comprehension

Table 2. Comparison with the state-of-the-art methods on three referring expression comprehension benchmarks.

Zero-shot Referring Video Object Segmentation

Table 3. Comparison with the state-of-the-art methods on RefDAVIS17. † Our model is trained on image datasets only. ReferFormer is trained on both image and video datasets.


Visualizations

Cross-attention Map


Figure 4. The cross-attention maps of the decoder when generating the polygon. ★ is the 2D vertex prediction at each inference step.

Prediction Visualization

Figure 5. The results of LAVT (top), SeqTR (middle), and PolyFormer (bottom) on RefCOCOg test set.


Figure 6. The results of LAVT (top), SeqTR (middle), and PolyFormer (bottom) on synthetic images generated by StableDiffusion.

BibTeX


@InProceedings{Liu_2023_CVPR,
    author    = {Liu, Jiang and Ding, Hui and Cai, Zhaowei and Zhang, Yuting and Satzoda, Ravi Kumar and Mahadevan, Vijay and Manmatha, R.},
    title     = {PolyFormer: Referring Image Segmentation As Sequential Polygon Generation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {18653-18663}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.