Abstract
Collage is a powerful medium for visual expression, traditionally demanding significant artistic expertise and manual effort. Existing methods often struggle with a trade-off between semantic expression and the visual fidelity of the constituent images. To address this, we introduce SCORE (Semantic Collage by Optimizing Rendered Elements), a novel text-driven framework that automates the creation of semantically rich and structurally sound collages. Our key innovation is to shift the optimization process entirely into the image space. By employing a differentiable renderer, we can backpropagate gradients from a powerful, pre-trained text-to-image model directly to the spatial parameters, including position, rotation, and scale, of each image element. We leverage Variational Score Distillation (VSD) to provide robust semantic guidance from a text prompt, ensuring the final layout aligns with the desired concept. Crucially, our "minimal editing" principle preserves the integrity of the original elements by forgoing any content-level modifications. The layout is refined by a joint loss function that combines the VSD-based semantic loss with structural regularizers that penalize overlap and enforce boundary constraints. The output of SCORE is a parametric, structured representation that allows further editing and downstream use.
Method Overview
The overall pipeline of SCORE. Given a set of image elements and a text prompt, our method employs a differentiable renderer to optimize spatial parameters (position, rotation, scale) of each element. Variational Score Distillation provides semantic guidance from the text prompt, while structural regularizers ensure valid layouts without overlap.
Key Contributions
Semantic-driven over Template Constraints
We introduce a paradigm that directly uses text to drive the layout optimization. By utilizing Variational Score Distillation (VSD) as a semantic loss, our method bypasses the need for intermediate representations such as contours or reference images, thereby significantly enhancing creative freedom and the precision of semantic alignment.
Image element Fidelity via Minimal Editing
Our optimization process is strictly confined to adjusting the position, rotation, and scale of the image elements. No distortion or content-level modifications are made, preserving the integrity of each element.
Parametric Output for Downstream Use
The output of SCORE is a parametric, structured representation that allows further editing and downstream use, providing flexibility beyond static image generation.
Qualitative Results
Comparison with Baselines
Comparison with existing collage generation methods across various text prompts and element sets.
Ablation Study
Ablation study demonstrating the contribution of each component in SCORE.
Diverse Results
Diverse collage generation results showcasing semantic alignment and visual fidelity.
BibTeX
@inproceedings{score2026,
title={SCORE: Semantic Collage by Optimizing Rendered Elements},
author={Shao, Zefan and Zhou, Jin and Yang, Hongliang and Xu, Pengfei},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}