Multi-Scale Hierarchical Diffusion Networks for Efficient Layout Generation: Improving Efficiency via Hierarchical Framework and Multi-Decoder Architectures
DOI:
https://doi.org/10.47709/cnahpc.v8i2.7497Keywords:
HierarchicalDiffusion, EfficientLayoutGeneration, MultiScaleDiffusion, ParameterEfficientArchitecture, TrainingSamplingEfficiency, DiffusionModelOptimizationAbstract
Layout generation remains a challenging task in automated design systems, where existing diffusion models often require extensive computational resources and numerous sampling steps. This work presents a novel multi-scale hierarchical diffusion architecture that achieves state-of-the-art performance through explicit three-level processing with progressive dimensional reduction (128d ? 64d ? 32d). The proposed framework demonstrates 92.5% loss reduction (0.496 to 0.037) over 50 training epochs with only 21,862 parameters, representing a 2.1× reduction compared to existing diffusion-based methods while maintaining superior generation quality. Experimental validation demonstrates the efficiency benefits of hierarchical design across multiple metrics including FID scores (12.3 vs 18.7), precision (0.87 vs 0.79), and training time (0.049s vs 0.127s per epoch). Comprehensive ablation studies quantify the contribution of each hierarchical level and validate architectural design choices.
Downloads
References
Arroyo, D. M., Postels, J., & Tombari, F. (2021). Variational transformer networks for layout generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13642-13652.
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., ... & Catanzaro, B. (2022). eDiff-I: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S., & Kreis, K. (2023). Align your latents: High-resolution video synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22563-22575.
Chai, S., Zhuang, L., & Yan, F. (2023). LayoutDiffusion: Controllable diffusion model for layout-to-image generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22490-22499.
Cheng, C. Y., Huang, F., Li, G., & Li, Y. (2023). PLAY: Parametrically conditioned layout generation using latent diffusion. Proceedings of the International Conference on Machine Learning, 5292-5308.
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 8780-8794.
Gupta, K., Lazarow, J., Achille, A., Davis, L. S., Mahadevan, V., & Shrivastava, A. (2021). LayoutTransformer: Layout generation and completion with self-attention. Proceedings of the IEEE/CVF International Conference on Computer Vision, 1004-1014.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.
Inoue, N., Kikuchi, K., Simo-Serra, E., Otani, M., & Yamaguchi, K. (2023). LayoutDM: Discrete diffusion model for controllable layout generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10167-10176.
Jyothi, A. A., Durand, T., He, J., Sigal, L., & Mori, G. (2019). LayoutVAE: Stochastic scene layout generation from a label set. Proceedings of the IEEE/CVF International Conference on Computer Vision, 9895-9904.
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35, 26565-26577.
Kong, X., Jiang, L., Chang, H., Zhang, H., Hao, Y., Gong, H., & Essa, I. (2022). BLT: Bidirectional layout transformer for controllable layout generation. Proceedings of the European Conference on Computer Vision, 474-490.
Li, J., Yang, J., Hertzmann, A., Zhang, J., & Xu, T. (2019). LayoutGAN: Generating graphic layouts with wireframe discriminators. Proceedings of the International Conference on Learning Representations.
Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117-2125.
Nichol, A. Q., & Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. Proceedings of the International Conference on Machine Learning, 8162-8171.
O'Donovan, P., Agarwala, A., & Hertzmann, A. (2014). Learning layouts for single-page graphic designs. IEEE Transactions on Visualization and Computer Graphics, 20(8), 1200-1213.
Poole, B., Jain, A., Barron, J. T., & Mildenhall, B. (2022). DreamFusion: Text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988.
Purvis, L., Harrington, S., O'Sullivan, B., & Freuder, E. C. (2003). Creating personalized documents: An optimization approach. Proceedings of the ACM Symposium on Document Engineering, 68-77.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684-10695.
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 234-241.
Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. Proceedings of the International Conference on Learning Representations.
Tang, Z., Wu, C., Li, J., & Duan, N. (2024). LayoutNUWA: Revealing the hidden layout expertise of large language models. Proceedings of the International Conference on Learning Representations.
Zhang, H., Lu, Y., Alkhouri, I., Ravishankar, S., Song, D., & Qu, Q. (2024). Improving efficiency of diffusion models via multi-stage framework and tailored multi-decoder architectures. arXiv preprint arXiv:2312.09181.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Kalyan Chakravarthy

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.











