As shown in Table 1, Case 1 employs only the Hierarchical Local LoRA module (denoted as A), achieving strong performance in both FID (76.951) and LPIPS (0.604), alongside a relatively high CLIP-T score (0.320). These results indicate that this module positively contributes to both perceptual image quality and semantic alignment. In contrast, Case 2 incorporates only the dynamic rank and dynamic alpha mechanisms (denoted as B). Despite their adaptive capabilities, this configuration yields the lowest performance across all three metrics (CLIP-T: 0.297, LPIPS: 0.689, FID: 85.945), suggesting limited standalone effectiveness. Case 3 integrates the Conditional Adversarial Flow and multi-scale MMD components (denoted as C), which slightly improves CLIP-T (0.305) and FID (82.809) compared to Case 2. This indicates that these modules facilitate better distribution alignment and structural fidelity. Case 4 builds upon Case 1 by adding the dynamic mechanisms (A + B), resulting in a marked improvement in LPIPS (0.633) and a reduction in FID (77.443), demonstrating that the dynamic mechanisms enhance visual quality when combined with LoRA. Similarly, Case 5, which combines A and C, shows consistently strong results in LPIPS (0.651) and CLIP-T (0.310), outperforming the standalone C configuration. Finally, Case 6 integrates all three components (A + B + C), achieving the best performance across all metrics: CLIP-T improves to 0.334, LPIPS drops to 0.438, and FID significantly decreases to 61.544. These findings validate the complementary nature and synergistic benefits of the proposed modules.
As illustrated in Fig. 3, the first column demonstrates that Case 1 exhibits strong capabilities in capturing architectural edges, contours, and structural hierarchies, especially in black-and-white line drawing styles (e.g., rows 3 and 10), where the reconstruction is relatively clear. This suggests that the local LoRA mechanism contributes to improved structural reconstruction, particularly in capturing local details and linear layering within the generated images. However, this configuration reveals notable shortcomings in natural color transitions, stylistic consistency, and overall structural coherence. Color images often appear overly synthetic and lack realism (e.g., rows 2 and 4). In the second column, improvements in color saturation and style consistency are observed, particularly in the rendering of landscapes and lighting effects (rows 5 and 6). Nonetheless, this setup shows a decline in structural accuracy, with certain architectural or scene details becoming blurred or distorted, suggesting a deficiency in maintaining stable spatial structures. The third column yields more natural color palettes and stylized effects, especially in terms of texture and brushstroke realism, achieving a closer resemblance to ground truth images (e.g., rows 4, 7, and 9). Yet, this configuration demonstrates weaker structural control, as some samples exhibit spatial misalignment or perspective distortion, indicating that this module alone is insufficient for capturing complete spatial structural features. When the Hierarchical Local LoRA is combined with the dynamic rank and alpha mechanisms, both structural clarity and stylistic coherence are improved. Images in the fourth column display enhanced architectural stability and color harmony (e.g., rows 3, 6, and 10), although some local regions, including landscape edges or fine textures, still lack sharpness. With the integration of Conditional Adversarial Flow, the fifth column shows further enhancement in image layering and texture detail (e.g., rows 4, 5, and 7), although its performance in color control and global consistency remains slightly inferior to that of the complete model. When all three modules (A, B, and C) are integrated, the generated results (sixth column) achieve the best overall performance in terms of structural fidelity, color accuracy, and stylistic consistency.
As shown in Table 2, the proposed LFMDiff (Ours) model demonstrates strong overall performance across all three evaluation metrics, with notable advantages in both CLIP-T and FID, and competitive results in LPIPS, which measures perceptual similarity. The semantic consistency score (CLIP-T) of our method reaches 0.334, significantly outperforming all baseline models, including WenXin 4.5 Turbo (0.247), CCLAP (0.234), and ControlNet (0.226). This indicates that the images generated by LFMDiff are more semantically aligned with the input text, demonstrating superior text-image correspondence. In terms of perceptual similarity, our method achieves an LPIPS score of 0.438. Although this is not the best among all models and does not rank in the top five, LFMDiff still shows strong capability in preserving ink wash details and capturing stylistic features characteristic of traditional Chinese landscape painting. It is worth noting that models such as SDXL (0.011) and ControlNet (0.026), despite achieving the lowest LPIPS scores (indicating better perceptual similarity in pixel space), tend to generate overly smooth images, lacking the diversity and expressive brushwork of traditional ink painting. In contrast, models like Taiyi (0.831), GLIDE (0.969), and T2I-Adapter (0.981) perform poorly on this metric, suggesting weaker perceptual quality. Regarding overall image quality, LFMDiff achieves the lowest FID score of 61.544, indicating that the distribution of its generated images is closest to that of real Chinese landscape paintings in terms of structural fidelity and distribution realism. In summary, although LFMDiff does not achieve the top performance on the LPIPS metric, it exhibits clear advantages on both CLIP-T and FID, reflecting a well-balanced trade-off among semantic consistency, stylistic fidelity, and image diversity.
As shown in Fig. 4, images generated by SDXL and DALL-E 3 demonstrate certain advantages in resolution and overall visual appeal. However, their stylistic tendencies lean toward Western oil painting or digital illustration, lacking the abstract expression and “artistic conception” characteristic of traditional Chinese ink painting. Their brushstrokes appear overly realistic or digitally rendered, and the handling of ink tones is often unnatural. GLIDE yields generally poor results, with outputs frequently exhibiting loose composition and monotonous color schemes. Some samples resemble children’s drawings or are overly abstract, failing to capture the expressive style and spatial depth fundamental to Chinese painting. Taiyi, a Chinese-pretrained model, aligns more closely with Chinese aesthetics in terms of color and some visual elements. Nevertheless, its compositions tend to be overly simplistic, lacking structural complexity, and the textures are coarse. For instance, the representation of trees and rocks is overly generalized, with insufficient variation in brush and ink techniques. ControlNet exhibits strong structural control and can produce well-defined contours of mountains and architectural elements. However, the resulting images resemble contour maps or computational renderings, lacking the fluidity and expressive spontaneity of ink painting, and often display a mechanical or patterned appearance.CCLAP is capable of generating images with rich color palettes and complex visual elements. Yet, its outputs frequently diverge from the subtle and elegant aesthetic of traditional landscape painting, instead showing a strong decorative tendency or exaggerated coloration, more reminiscent of modern collage or illustration styles. In contrast, our method (Ours) achieves the most convincing results in terms of ink painting style, hierarchical composition, and spatial arrangement. The generated images exhibit hallmark features of Chinese painting techniques, including feibai, jimo (ink accumulation), xuanran (gradual wash), and liubai (intentional blank space), which are applied in the rendering of elements like rocks, forests, clouds, water, and architecture. The overall composition conveys a dynamic sense of rhythm and vitality, achieving a balance between form and spirit, and most closely resembles authentic artworks.
As illustrated in Fig. 5, although mainstream large-scale models such as Tongyi Wanxiang and RAPHAEL are capable of generating high-resolution images with rich details, their visual styles tend to resemble Western oil paintings or digital illustrations. These models typically emphasize realistic lighting and physical structure in the depiction of mountains, trees, and water bodies. However, their brushstrokes are highly representational, lacking the abstract expression and ink aesthetics characteristic of traditional Chinese landscape painting. For instance, in columns 1 and 4, the images generated by RAPHAEL exhibit a strong sense of volume and contrast, but the overall color saturation is relatively high, which does not align with the Chinese aesthetic preference for subtlety, elegance, and serenity. WenXin 4.5 Turbo, a model fine-tuned for Chinese-language inputs, demonstrates a better understanding of landscape-related semantics, with more traditional elements such as pine trees, pavilions, terraces, and mist appearing in the generated images. However, it still suffers from loose spatial composition and ambiguous layering. In columns 3 and 6, for example, the shapes of rocks appear rigid, and the handling of mist and distant views lacks the dynamic interplay between solidity and emptiness that defines classical Chinese landscape techniques. In comparison, autoregressive models such as LlamaGen-XL, OpenMAGVIT2, and DALL. E Mini exhibits distinct stylistic tendencies. LlamaGen-XL leans toward photorealism, often lacking the artistic abstraction of ink painting, and in some cases (e.g., Columns 2 and 4) introduces modern elements inconsistent with the intended cultural context. OpenMAGVIT2 and DALL. E Mini can partially imitate the grayscale tonalities of ink wash, but their compositions remain simplistic, with homogeneous mountain and tree forms, insufficient layering, and a lack of brush-and-ink expressiveness. In contrast, our method (Ours) exhibits superior performance across multiple dimensions, particularly in generation quality and artistic style fidelity. The generated images follow traditional compositional principles, with a natural transition from foreground to middle ground to background, adhering to the “Three Distances” perspective approach commonly employed in Chinese landscape painting. In terms of detailed depiction, as seen in columns 5 and 8, our model produces well-articulated brushwork in trees and rocks, with effective use of blank space for water bodies, and natural applications of ink wash techniques such as feibai,jimo, andpomo (splashed ink). The overall imagery conveys a strong sense of vitality and rhythm, achieving a harmonious integration of form and meaning, and delivering a rich expression of artistic conception.
As shown in Table 3 and Fig. 6, the evaluated models exhibit distinct differences in perceptual quality from the perspective of general users. PAPHAEL demonstrates the best performance in Perceived Visual Quality, with a mean score of 4.07. The corresponding boxplot indicates a high median and concentrated interquartile range, suggesting strong and consistent visual detail rendering. SDXL achieves the highest score in Semantic Consistency (mean = 4.11), with a tightly grouped distribution, reflecting its strong alignment between text and image content. Although DALL-E3 attains comparable mean scores across dimensions, its boxplots reveal more dispersed ratings with the presence of low outliers, indicating less stable performance. In contrast, the Ours model excels in Style Fidelity, achieving the highest mean score of 3.98, with a tight and elevated boxplot distribution, highlighting its superior capability to reproduce the stylistic features of traditional Chinese landscape painting. Furthermore, Ours achieves a Semantic Consistency score of 4.09, closely approaching the best-performing SDXL, demonstrating robust semantic understanding. Although its Visual Quality score is slightly lower than PAPHAEL and SDXL, the difference is marginal, and Ours maintains high stability overall. These observations are further supported by statistical significance testing. For the Visual Quality dimension, the Friedman test yields a statistic of 2.130 with a p value of 0.5458, indicating no significant differences among models in this dimension—expert evaluations were generally consistent across models. In Semantic Consistency, the Friedman statistic is 10.714 (p = 0.0134), suggesting a significant difference among models. However, subsequent Dunn’s post hoc tests reveal that none of the pairwise comparisons reach statistical significance (all p > 0.05), implying that the overall difference may result from the joint contribution of multiple models. In Style Fidelity, the Friedman statistic is 8.217 with a p value of 0.0417, indicating significant differences. Dunn’s post-hoc analysis further shows that Ours significantly outperforms DALL-E3 (p = 0.037969) in this dimension, while no significant differences are observed with other models. In summary, the Ours model exhibits a statistically significant advantage in Style Fidelity, underscoring its strong ability to faithfully reproduce traditional Chinese landscape painting styles. It also performs stably in Semantic Consistency and comparably in Visual Quality, demonstrating balanced and competitive overall performance in the task of traditional Chinese landscape image generation.
As shown in Table 4 and Fig. 7, the expert evaluation was conducted across three core dimensions of traditional Chinese landscape painting—brushwork, composition & perspective, and ink aesthetic—to assess each model’s capability in artistic fidelity. Overall, the Ours model demonstrated strong performance across multiple dimensions, highlighting its excellent adaptability and effectiveness in replicating traditional artistic features. In the Composition & Perspective dimension, Ours achieved the highest mean score of 4.25, with the boxplot indicating a concentrated distribution in the upper quartile. Statistical analysis revealed that Ours significantly outperformed DALL-E3 (p = 0.0288) and SDXL (p = 0.0037), and showed no significant difference from PAPHAEL (p = 1.0). This suggests that Ours aligns well with expert principles such as the three perspectives and exhibits clear structural hierarchy and spatial layering. For Brushwork, PAPHAEL led with a mean score of 4.5, significantly higher than DALL-E3 (p = 0.0098). Ours followed closely with a mean score of 4.0, and although the difference was not statistically significant compared to the other models, its boxplot showed low variance, indicating consistent rendering of traditional techniques such as texturing, strokes and ink-wash gradients. In terms of Ink Aesthetic, PAPHAEL again achieved the highest mean score (4.125), followed by Ours (3.75). Although the Friedman test did not indicate statistical significance across models (p = 0.0954), Ours demonstrated a stable performance, with a compact boxplot suggesting reliable delivery of artistic mood and atmospheric subtlety. Taken together, Ours stands out as the top-performing model in composition, while also delivering solid and consistent results in brush technique and aesthetic expression. With all three dimensions rated as “Good”or “Excellent”, Ours exhibits outstanding overall suitability for the task of traditional Chinese landscape painting generation, balancing structural composition and artistic authenticity with notable effectiveness.
Despite the significant progress achieved in Chinese landscape painting generation, several limitations and challenges remain. Although the CTLPD dataset encompasses a relatively wide range of styles and themes, it still falls short in capturing the full diversity of human artistic creation. Rare styles such as ruled-line painting and blue-and-green landscape are underrepresented, which may impair the model’s generalization capability in these specific genres. Additionally, the current evaluation metrics, FID, LPIPS, and CLIP-T, are primarily designed for natural image perception and fail to adequately reflect the aesthetic criteria intrinsic to traditional East Asian art, which include qiyun shengdong (vitality and spirit resonance) and xushi xiangsheng (the interplay of void and substance). Future research should explore the integration of evaluation indicators grounded in art cognition and aesthetic theory.
Chinese landscape painting is not merely a visual art form but also a vessel of profound philosophical thought, cultural symbolism, and historical value. The introduction of LFMDiff represents an innovative advancement that has begun to influence generative modeling, while also providing potential contributions to the digital preservation and dissemination of cultural heritage. In terms of cultural inheritance, LFMDiff can support the recreation of ancient painting styles and the restoration of damaged artworks, providing intelligent tools for institutions such as the Palace Museum and Dunhuang Academy. In art education, LFMDiff may serve as a pedagogical aid to help beginners grasp compositional logic and brush-ink techniques, thereby facilitating digital instruction and creative reinterpretation of traditional practices. It is important to note, however, that such educational impact is not solely determined by technical capability; factors such as learners’ prior art background, age, and level of digital literacy also play a role. Furthermore, the controllable generation mechanisms introduced by our model could offer a technical foundation for the development of interactive cultural products, suggesting possible pathways for expanding the expressive boundaries of traditional art in contemporary contexts.
Although user evaluations highlighted the model’s performance in style fidelity and expert assessments confirmed its reliability in structural and compositional aspects, future research could benefit from triangulation by integrating qualitative findings with industry reports, usage data, or broader survey results. Such an approach would help assess the generalizability of these observations beyond individual subjective perceptions, thereby enhancing the robustness and applicability of the conclusions.
link

