SCORE: Semantic Collage by Optimizing Rendered Elements

Zefan Shao, Jin Zhou, Hongliang Yang, Pengfei Xu*
CSSE, Shenzhen University
AAAI 2026
SCORE

Given a set of image elements and a text prompt, SCORE optimizes the spatial arrangement of visual elements to generate a semantically aligned collage.

Abstract

Collage is a powerful medium for visual expression, traditionally demanding significant artistic expertise and manual effort. Existing methods often struggle with a trade-off between semantic expression and the visual fidelity of the constituent images. To address this, we introduce SCORE (Semantic Collage by Optimizing Rendered Elements), a novel text-driven framework that automates the creation of semantically rich and structurally sound collages. Our key innovation is to shift the optimization process entirely into the image space. By employing a differentiable renderer, we can backpropagate gradients from a powerful, pre-trained text-to-image model directly to the spatial parameters, including position, rotation, and scale, of each image element. We leverage Variational Score Distillation (VSD) to provide robust semantic guidance from a text prompt, ensuring the final layout aligns with the desired concept. Crucially, our "minimal editing" principle preserves the integrity of the original elements by forgoing any content-level modifications. The layout is refined by a joint loss function that combines the VSD-based semantic loss with structural regularizers that penalize overlap and enforce boundary constraints. The output of SCORE is a parametric, structured representation that allows further editing and downstream use.

Method Overview

SCORE Framework Overview

The overall pipeline of SCORE. Given a set of image elements and a text prompt, our method employs a differentiable renderer to optimize spatial parameters (position, rotation, scale) of each element. Variational Score Distillation provides semantic guidance from the text prompt, while structural regularizers ensure valid layouts without overlap.

Key Contributions

Semantic-driven over Template Constraints

We introduce a paradigm that directly uses text to drive the layout optimization. By utilizing Variational Score Distillation (VSD) as a semantic loss, our method bypasses the need for intermediate representations such as contours or reference images, thereby significantly enhancing creative freedom and the precision of semantic alignment.

Image element Fidelity via Minimal Editing

Our optimization process is strictly confined to adjusting the position, rotation, and scale of the image elements. No distortion or content-level modifications are made, preserving the integrity of each element.

Parametric Output for Downstream Use

The output of SCORE is a parametric, structured representation that allows further editing and downstream use, providing flexibility beyond static image generation.

Qualitative Results

Comparison with baselines

Comparison with Baselines

Comparison with existing collage generation methods across various text prompts and element sets.

Ablation study

Ablation Study

Ablation study demonstrating the contribution of each component in SCORE.

Diverse results

Diverse Results

Diverse collage generation results showcasing semantic alignment and visual fidelity.

BibTeX

@inproceedings{score2026,
  title={SCORE: Semantic Collage by Optimizing Rendered Elements},
  author={Shao, Zefan and Zhou, Jin and Yang, Hongliang and Xu, Pengfei},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}