This requires a well-structured simulation and analysis pipeline for evaluating the ArtFusionNet model on artistic style classification tasks. Proper experiments are designed to rigorously evaluate accuracy, generalization power, as well as computational efficiency. Here, this section describes dataset collection, model implementation, training mechanism, and performance evaluation under various metrics.
The Variety in Artistic Styles helps in the comprehensive evaluation of this dataset because it contains different sample brush strokes, even a very different composition of colors, different angles, and different compositions. The model robustness shall be improved via different image processing methods, for example, resizing, normalization, and data augmentation, to generalize the model in varied styles. Very important mitigating factors against differences in resolution, lighting, and bias distribution during imaging are these preprocessing techniques.
Implementation details of the ArtFusionNet model basically address multi scale feature extraction module, transformer encoder, and an adaptive fusion strategy. Besides, it also provides some training configuration, hyper-parameter tuning, optimizer choice, and specification of hardware to ensure reproducibility. It was evaluated using comparative studies with a number of models, one being comparisons against baseline architecture such as CNN-only, transformer-only, and other hybrid approaches.
To further examine how individual components affect classification efficiency, an ablation study is conducted in which key architecture segments will be removed to ascertain the contributions made by different parts of the model. Compared with this performance test, we also deal with the sensitivity analysis that studies how different hyperparameters affect fatalities like learning rate, batch size, and network depth on the convergence behavior and performance of a model.
Thus, this approach permits visualization of the result through confusion matrices, precision-recall curves, loss trajectories, and per-class classification performance distributions to reveal the inner strengths and real limitations of model predictions. Thus, such evaluations shall jointly testify to the superiority of the proposed ArtFusionNet model, indicating its efficacy in separating the artistic styles. The subsequent subsections describe the stage-wise breakdown of the evaluation.
Dataset and preprocessing
The highly heterogeneous systematic dataset created by feeding thousands of paintings from all the different artistic styles has been thoroughly organized into collections such that each painting is identified matrixlike. Due to the complexity and variability in artistic items, preprocessing will yield meaningful learning representations in the model. It possesses the capability of resizing and normalizing, performing augmentation, and shortcutting the dataset partitioning for generalizability and robustness for the proposed ArtFusionNet framework. The operations involved are comprehensive preprocessing of the data.
Resize everything to a fixed pixel location to ensure uniformity of input across datasets but still be compatible with convolutional layers of the CNN module: resize with bilinear interpolation, in which the spatial coherence and structural unity of the original artistic elements remain, according to what has been defined in Eq. (1). This step is emphasized as in Fig. 18, where an illustration of the original image is compared with the resized version. A standard resolution is vital for deep learning models since it prevents imposing distortion, which would result in unfavorable feature extraction.

Example of an original image and its resized counterpart.
To minimize the effects of illumination changes, contrast differences, and fluctuations in overall pixel intensities, Z-score normalization is performed uniformly on all images. In this way, the pixel intensities are standardized to yield a normal distribution of zero mean and unit variance as defined in Eq. (2). Its beauty can be seen in Fig. 19, in which a normalized image has more consistent contrast, making it easier for the model to extract features. This normalization step is very key in cutting out the biases created due to lighting and also making the training process smooth.

Normalized images with enhanced contrast constant.
In order to improve the generalization capability of the model and avoid overfitting, a comprehensive data augmentation strategy is employed, which includes random rotation in both positive and negative angles of not more than 30 degrees, horizontal and vertical flips, adjusting contrast, and random cropping. These techniques of augmentation import controlled variations in the data set allowing the model to be invariant to minor transformations while strengthening its ability to recognize the unique stylistic patterns. Such images are given in Fig. 20 pre-and post-augmentation examples-which highlight the enrichment of the training dataset by such transformations without compromising other artistic qualities.

Before and after augmentation examples of images.
For a fair and accurate evaluation, the dataset is systemically divided into a training set of 70, validation set 15%, and test set 15%. In doing so, the stratified sampling method is exercised to prevent class imbalance and ensure that every artistic style gets its share. Many of the methods maintain proportions concerning class distributions and increase the reliability of performance assessments. This is quite critical in maintaining class balance in the classification of artistic styles since certain styles tend to have an inadequate number of samples to ensure their proper distribution into the training and testing modes. The dataset is prepared with the application of such balancing preprocessing quite strictly against the background of maximum learning within the ArtFusionNet framework. As well as generated dataset, three benchmark datasets are used to have a comprehensive comparison. The details of these datasets are summarized in Table 2.
Implementation details
The proposed implementation of the ArtFusionNet framework is carefully created with a concentration on exploiting convolutional operations and attention-based mechanisms toward the classification of artistic styles. This section provides a full description of the network architecture, training setup, optimization methods, and computation environment, toward model reproducibility and clarity.
The proposed model proficiently integrates a Transformer-based global attention mechanism with a multi-scale CNN feature extraction module and an adaptive feature fusion module to optimize information integration. The CNN module captures low- to mid-level features, including fine texture patterns, delicate brush strokes, and spatial relationships that are localized to a specific area in artistic compositions; the Transformer component in turn attends to global structure and long-distance dependencies, allowing a more complete representation of the overall style of the artwork. The adaptive fusion mechanism presented here jointly optimizes information integration across the distinct feature representations in an input-dependent manner to strengthen the classification.
The training is based on mini-batch gradient descent, having used a mini-batch of size 64 for optimal computational efficiency and stability in convergence. The initial learning rate is assigned to the first few epochs, after which a cosine annealing learning rate scheduler refines its value during training to allow exploration of the loss landscape and thereby discouraging the risk of premature convergence. Adam W was used, which helps with weight regularization so as to not conflict with the rapid convergence of the model with respect to the generalization properties. Lastly, early stopping regularization works to control for any overfitting by keeping an eye on the validation loss and stopping training when it ceases to improve. For classification purposes, the categorical cross-entropy loss optimally updates parameters in the case of multi-class classification.
From another point of view with additional approach-based regularization, a sophisticated data augmentation strategy allows for an even advanced generalization capability of the model. Random rotations (up to ± 30°), horizontal and vertical flips, contrast normalization, and random cropping introduce controlled variations in the training samples, thus allowing the model to gain robustness toward style variation. Label smoothing ensures that the model is not incredibly confident about its predictions and makes it easier for uncertainty estimation. On the other hand, dropout (0.3) counteracts the overfitting behavior by reducing dependence on some neurons after pre-layer normalization in the Transformer module, which stabilizes the training dynamics.
The whole procedure is fabricated and executed inside MATLAB 2024a with the Deep Learning Toolbox for deploying deep learning architectures to real-world applications. It is supplemented further by an NVIDIA RTX 3090 with 12GB RAM- facilitating an efficient way in parallel computing and processing in real/near-real time with large-scale images. The presence of GPU acceleration cuts down the training time tremendously while maximizing computational efficiency with much computation power being intensely used in the self-attention computations inside the Transformer module.
Modalities through such structured implementations allow for the ArtFusionNet model to efficiently extract local and global artistic attributes, culminating in improved classification accuracy. A comparative analysis to highlight model performances against other architectures, further strengthening the evidence of its artistic style discrimination capability, will be presented in the next section. To summarize, Table 3 present a summary of key implementation details and hyperparameters.
Model comparison and ablation study
To critically evaluate the effectiveness of the proposed ArtFusionNet framework, a compatible thorough comparison investigation is conducted with baseline models missing the major architectural elements. This evaluation reveals the significant contributions of multi-scale CNN feature extraction process, Transformer-based global context modeling, and adaptive fusion module. The figure performance is quantitatively analyzed using the key metrics, classification accuracy, validation loss, training time, precision, recall, and F1-score. The complete results are given in Table 4.
The proposed model with integrated multi-scale CNN feature extraction and Transformer mechanism attention attains an accuracy of 99.00% in classification, and a validation loss of 0.1227. The same model, therefore, achieves an F1 of 0.9906, indicating the model’s ability to distinguish between artistic styles while generalizing across datasets-in so robust a manner. However, with respect to the computational overhead, the Transformer module will add a total training time of 19946.46 s. The justification for this cost is to achieve an overall better accuracy and sturdiness of the model in training, where validation loss is steadily decreasing and smooth convergence profile.
To analyze the importance of the Transformer module, the model “No Transformer” was trained by inter-acting with CNN feature extraction but without the Transformer attention mechanism. The abandonment of the Transformer resulted in a huge degradation in performance: accuracy dropped to 81.73%, while validation loss increased to 1.0426. The drop in precision (0.7183) and that in recall (0.9254) suggest some stylized attributes requiring long-range contesting’s but they were not captured well. It follows that the results admire the critical role played by the Transformer in refining representations beyond local feature extraction.
In contrast, the No ResNet model, which was without the CNN component and kept the Transformer module, underwent a serious decrease in performance in classification. Accuracy fell to as low as 35.14%, against a validation loss of 2.3306, thus proving the indispensable role played by CNN-based feature extraction to recognize even low- and mid-level artistic features, e.g., brush strokes, textures, and local structure patterns. Indeed, the drastically reduced values of precision (0.1154), recall (0.2546), and F1-score (0.2722) only go on to establish that the Transformer is not sufficiently equipped to extract fine-grained visual features essential to making a proper classification.

Confusion matrix of the proposed ArtFusionNet model.
The No Fusion model achieves an accuracy of 98.07% with a validation loss of 0.1218 by eliminating the adaptive fusion mechanism and independently processing CNN features and Transformer features. The model still achieves quite a good performance, but the F1-score drops at 0.9877, indicating the adaptive fusion module is very important to efficiently combine the two representations of local and global features in a stable and calibrated fashion for the classification output.
The proposed model is also shown to effectively discriminate artistic styles by the confusion matrix in Fig. 21, where the misclassification rates are found to be very low and the model shows high confidence in its predictions across the board. Precision, recall, and F1-score values shown in Fig. 22 really highlight an almost balanced performance for the various artistic styles.

Class-wise precision, recall, and F1-score for the proposed model.
The model proposed thereby had smooth and stable convergence and, as shown in the training and validation loss learning curves in Fig. 23, the early stabilization of the validation loss signifies a good capability for generalization with no sign of overfitted conditions.

Learning curve of the proposed model.
This balance of CNN-based local feature extraction and global attention modeling through the Transformer should be the best for artistic style classification. The ablation studies proved beyond doubt that the CNNs capture basic features such as texture and structure, while the Transformer module refines these features by modeling global dependencies and compositional structures. The combined adaptive fusion of both modules leads to a much stronger classification framework.
Sensitivity analysis
Comprehensive and well-rounded sensitivity analysis has been completed with respect to the primary key hyper-parameters on the performance of the proposed ArtFusionNet framework. Learning rate and batch size include the main factors affecting model performance with respect to model convergence, classification accuracies, and generalization ability. Identification of best configurations of parameters that improve robustness in models and diminish underfitting and overfitting has been the goal of this analysis.
A learning rate is mostly the key in deep learning optimization. It’s entirely the speed of convergence that can be achieved using it and at the same time its stabilization of a model. For further investigations, three observances were done thus: the effects with these three learning rates \({10^{ – 5}}\), \({10^{ – 4}}\), and \({10^{-3}}\) were investigated. The respective accuracy and loss data are summarized in Tables 5 and 5.
This approach has been widely consulted and reviewed on sensitivity analysis with respect to the primary key hyper-parameters on the performance of the so-called ArtFusionNet framework. Last among the primary parameters considered have been aspects of the learning rate and batch size which are precisely relevant to dependent aspects of model performance regarding convergence, classification accuracies, and generalization ability. It is in this analysis that the optimal configurations are sought for improvements from robust modeling to underfitting and overfitting.
What has been said above is a learning rate that is very crucial in deep learning optimization itself. It pretty much determines the speed of convergence and its stability over a model. For systematic evaluation of all influences, experiments with three different learning rates \({10^{ – 5}}\), \({10^{ – 4}}\), and \({10^{ – 3}}\) were done. The corresponding values of accuracy and loss have been summarized in Tables 6 and 5.
According to Table 6, a very small learning rate (\({10^{ – 5}}\)) produced a low accuracy of 40.41% with high variance of 10.37%, which suggests slow convergence. On the other hand, a high learning rate (\({10^{ – 3}}\)) was quite erratic, leading to a decrease in accuracy to 60.78% and a rise in variance (19.30%). The best learning rate would be one of \({10^{ – 4}}\), providing the highest mean accuracy of 72.41%, with quite high variance, 22.69%.
On the other hand, reference Table 5, where relevant loss values are presented, and you might see the same trend continued. The lesser mean loss at again \({10^{ – 4}}\)-0.9850-was interpreted as the best in balance between convergence speed and generalization. Nevertheless, a higher loss (1.3322) was once again obtained from the model at \({10^{ – 3}}\), which further supported the instability hypothesis. At \({10^{ – 5}}\), the very high loss (2.2098) demonstrated that the model was not capable of minimizing loss due to very slow learning dynamics.
Batch size therefore contributed significantly toward stability and computation efficiency in training. To understand its effect, an experiment with batch sizes of 8, 16, and 32 was undertaken, the results of which are found in Tables 7 and 8.
The trend depicted in Table 5 indicates that larger batch sizes tend to enhance accuracy while reducing variance. The mean accuracy at a batch size of 8 is lowered to 67.28% with a high standard deviation of 19.97%, thus giving rise to fluctuations in gradient calculations. With a batch size of 16, the accuracy of 94.18% improvement with reduced variance of 11.98% was achieved. The best performance can be seen with a batch size of 32, which yielded a mean accuracy of 94.63% with the lowest variance of 7.95%.
The loss values in Table 8 support these findings. For a batch size of 8, the mean loss was very high (1.1290) with a considerable amount of variance (0.7218). Moving to a batch size of 16 means that loss was substantially reduced (0.2650) with some variance (0.4143), but raising the batch size to 32 kept losses around the same level (0.3015) but with even less variance (0.2750).
By tuning these hyperparameters, a compromise between stable convergence, generalization, and computational efficiency can be found. Our observations indicated that a learning rate of \({10^{ – 4}}\) and a batch size of 32 gave a good compromise between accuracy and loss; practical considerations, therefore, dictated these hyperparameter choices for robust deep learning model performance. The next section performs statistical validation for these findings and their implications for the broader field of artistic style classification.
Statistical analysis and visualization
To buttress the effectiveness and statistical significance of the methodology, an elaborate statistical analysis has been performed with various representations of the key performance measures. This section ventures into details concerning the analysis based on hypothesis testing, comparing statistics, and various graphical interpretations, thus establishing the strength of the model.
The paired t-test calculated whether the performance gain of the proposed model was statistically significant relative to their baseline counterparts. The test compared across several runs the classification accuracies of the Proposed Model and its NoTransformer, NoResNet, and NoFusion counterparts. The data in Table 4 showed that the purpose-built model achieved a classification accuracy of 99.00%, well above that of NoTransformer (81.73%), NoResNet (35.14%), and NoFusion (98.07%). The differences in the classification results were statistically significant with a p-value of less than 0.05, suggesting that the model’s components—multi-scale CNN feature extraction, Transformer-based global context modeling, and adaptive fusion module—were all key contributors to achieving enhanced classification accuracy.
Moreover, we performed an additional analysis of variance (ANOVA) on various training sessions to analyze the stability of their model performances. In Table 6, it is shown that the classification accuracies showed low variance for the proposed model (Std Dev = 7.95% for batch size 32), while the others showed considerably high variance for NoTransformer and NoResNet. Also, Table 5 indicates the respective loss values, showing that the loss of the proposed model was the very least (0.9850) compared to other configurations, assuring consistent and reliable performance throughout the experiment settings.
In addition to the above, results of the sensitivity analysis provided in Tables 7 and 8, shed more light on the hyperparameter tuning. As per Table 5, with the increase in batch size from 8 to 32, the accuracy jumped from 67.28% to 94.63%, and the variance decreased. In the same way, the batch effect on the loss was observed from Table 8, where batch sizes of 16 or 32 produced better results with lower loss values compared to 8 batches. All of this support a case for the importance of hyperparameter selection for the best possible performance.
The proposed model is further represented using a confusion matrix for its classification credibility in Fig. 21. The case in Fig. 21 shows that the model manifests high confidence across all artistic styles with few misclassifications, showing how effectively the attributes of different styles can be discriminated even within closely related classes. Figure 22 shows the precision, recall, and F1-score per class, which is pronounced for all artistic categories. As seen in Fig. 22, the model avoids huge class bias and thus performs excellently in its classification.
Figure 23 shows the stability and converging efficiency of the model-through epochs, while the training and validation loss curves are drawn. From the observation made by looking at Fig. 23, the pattern of convergence is smooth, and the validation loss shows early signs of stabilization, indicating that the model can generalize to unseen data and will not suffer overfit. Additionally, the outcomes of model performance have been compared and shown in Fig. 24 to ensure that the proposed architecture achieves better classification metrics than all the parametric evaluations. The best performance presented in Fig. 24 shows the capability of the proposed CNN-Transformer hybrid methodology as a complete model capturing both local and global artistic features.
All the aforementioned statistical analyses and visualizations confirm the robustness, reliability, and super-generalization capability of the proposed ArtFusionNet framework. Testing and validation outcomes reveal the model’s capacity to capture local and global aspects of artistic features as it excels in artistic style classification.
Cross-dataset generalization and comparison to state of the Art models
In order to investigate further the generalization capacity and competitiveness of the proposed ArtFusionNet framework, we not only applied the Fallah_artist_dataset but also applied it to three popular benchmark datasets:
-
WikiArt42: A huge database of over 80,000 paintings in 27 different styles.
-
Painting-9143: A typical collection of 4,266 paintings by 91 artists, often used to test problems of authorship and style attribution.
-
BAM! Behance Artistic Media Dataset)44: A huge dataset containing a broad variety of artistic media, including paintings, digital art and designs, and annotated with their stylistic and thematic labels.
Besides dataset generalization, we performed comparative experiments with a number of state-of-the-art (SotA) architectures which reflect the current trend in visual recognition and artistic style classification:
-
Swin Transformer45: A hierarchical vision transformer that uses shifted windows which are effective in modeling long-range dependencies.
-
DeiT (Data-efficient Image Transformer)10: A distilled variant of Vision Transformer, with the goal of being able to perform well on smaller datasets.
-
CLIP-based framework46: A multimodal model learned to align textual and visual representations through contrastive learning, and shows a high level of generalization on image understanding tasks.
Table 9 shows the comparison of the performance of the proposed framework with the chosen SotA models on the four datasets. The results are reported in the form of top-1 classification accuracy and macro-averaged F1-score.
The findings are significant as indicated by the results. First, the suggested model is more successful than the Swin Transformer, DeiT, and CLIP-based baselines on all datasets, which demonstrates its stability and the possibility of its application to other artistic areas. Our method on WikiArt achieves an accuracy of 92.3%, 3.1% and 4.8% above Swin Transformer and DeiT, respectively. The gains are even more significant on Painting-91 where the proposed framework achieves 94.1% accuracy, and it is an indication of the efficiency of the Adaptive Fusion Module in learning discriminative features despite the limited data.
The accuracy of the proposed model on BAM! dataset, which includes a broad and noisy distribution of artistic content, is 90.7% better than the CLIP-based baselines by 3.5%. These findings support the idea that our hybrid architecture is able to combine the fine-grained local information and global contextual dependencies resulting in better style discrimination than CNN-only and Transformer-only models.
On the whole, the experiments confirm that the proposed framework is not only very effective on the custom Fallah_artist_dataset but also shows state-of-the-art performance on commonly known artistic benchmarks, therefore, positioning it as competitive in the field of artistic style classification.
Resource and computational complexity analysis
In addition to classification accuracy, a critical feature of deep learning models is their computational profile that defines scalability and deployability in real-world applications. Although Transformer-based models have shown impressive results in artistic style classification, their resource-hungry characteristics, in terms of parameter counts, FLOPs, and inference times, frequently restrict their use in practice. To alleviate that, we conducted a thorough computational complexity study of the proposed ArtFusionNet model and compared it to the representatives of the state of the art.
-
Experimental Setup.
-
Metrics measured:
-
M: number of parameters.
-
FLOPs (G): floating-point operations needed to process an image (224 224) once in a forward pass.
-
Inference time (ms): mean time per image in inference with batch size = 1.
-
Throughput (images/sec): the number of images processed per second when the batch size = 32.
Computational complexity of the suggested framework and state-of-the-art models are shown in Table 10.
The proposed Hybrid CNN Transformer has a good trade-off between accuracy and computational cost through the incorporation of efficiency and scalability on various dimensions. It contains 52.4 M trainable parameters, much more parameter-efficient than Swin Transformer (87.8 M), DeiT (86.7 M), and CLIP-based frameworks (150.3 M). The hybrid architecture, with pyramid pooling and dilated convolutions, further saves computation to 9.8G FLOPs, or 38% less than Swin and 56% less than CLIP, and achieves even higher classification accuracy. In terms of speed, the model has an inference time of 18.5 ms per image, beating Transformer-only baselines and nearing CNN performance, which means that it has a high deployment potential. It also reaches throughput of 174 images/s at a batch size of 32, allowing efficient large-scale inference and real-time uses, including art curation and retrieval applications. In general, these findings validate that the Adaptive Fusion Module does not only improve the classification accuracy, but also substantially reduces computational requirements, which makes the framework efficient and applicable to the contemporary hardware.
Qualitative analysis and error analysis
Although the quantitative analysis showed the superiority of the proposed Hybrid CNN Transformer model, the qualitative analysis of errors gives a better understanding of its weaknesses. In this regard, misclassified samples were representative and visualized to analyze the failure cases and identify the patterns of recurrence (Fig. 24).

The proposed model misclassified representative samples. Every panel indicates ground truth ◊ predicted class. The concentration of misclassifications is in stylistically close categories like Impressionism vs. Abstract Expressionism, Cubism vs. Futurism, Baroque vs. Rococo, and Surrealism vs. Abstract.
As the examples in Fig. 24 show, misclassifications tend to happen in the following cases:
-
Borderline or stylistically overlapping Impressionist paintings were sometimes mixed up with Abstract Expressionist paintings, since both use loose brushstrokes and bright palettes.
-
Geometric and structural similarities Cubism was occasionally confused with Futurism because of their common use of angularity and fragmentation.
-
Ornamental richness Baroque and Rococo paintings were sometimes conflated, as both styles are richly detailed and complex in color, with the slightest variations in subject matter often being determining.
-
Transitional or changing personal styles In some databases (e.g. Painting-91), the evolutions of individual artists (e.g. the Cubist to Surrealist phases of Picasso) introduced a problem of categorical ambiguity.
Though the overall confusion matrix (Fig. 21) shows high per-class precision and recall, it can be seen by looking at the highlighted misclassifications in Fig. 24 that the errors are not random but concentrated on stylistically related movements.
The qualitative results of Fig. 24 point out three main areas of the current limitations of the model. First, it is obvious that there is the element of stylistic ambiguity, as the mistakes tend to coincide with the scenarios when even human experts admit the blurred boundaries of styles. This confusion is representative of the overall difficulty of categorizing works that exist at the crossroads of artistic traditions. Second, feature fusion limitations are indicated in the analysis. In borderline situations, the Adaptive Fusion Module can overweight global composition clues and ignore the fine-grained local information, or vice versa, overweight local information at the cost of global stylistic integrity. Lastly, the findings show semantic gaps in the reasoning of the model. Subject-matter cues, which frequently are instrumental in determining style, cannot be reliably recorded by purely visual modeling, e.g., the difference between mythological and secular subject-matter.
Future refinements
The error analysis implies some promising directions of improvement. One of the possible directions is multi-modal learning: using textual metadata like titles, artist notes, or historical context in combination with visual features, the model would be able to overcome semantic ambiguities that cannot be resolved with visual features alone. Moreover, fine-grained contrastive learning can assist the system to more effectively differentiate visually similar, but conceptually different categories, and thus minimize confusion between adjacent styles. Lastly, visualization of attention techniques, including Grad-CAM to visualize CNN-derived feature attention and attention heatmaps to visualize Transformer head attention, can provide a means of diagnostics and improvement of the fusion strategy by showing which areas of the input are most influential in model decision-making. A combination of these refinements has the potential to increase interpretability and classification accuracy.
Comparative analysis of fusion strategies
The AFM is the main part of the suggested Hybrid CNN Transformer architecture and is used to combine local CNN representations and global Transformer representations. To confirm the design rationale, we performed a comparative analysis of AFM to a number of alternative fusion strategies commonly reported in the literature:
-
Gated Fusion (GF)47: Trains a gating coefficient per modality (CNN vs. Transformer) to regulate the influence of each feature map on a per-feature-map basis.
-
Cross-Attention Fusion (CAF) : Uses cross-attention in which CNN features attend to Transformer embeddings (and vice versa) prior to combination and focuses on cross-modal feature interactions.
-
Channel-wise Modulation (CWM)3: It uses channel attention (e.g. SE-blocks) on the CNN and Transformer channels and then adds them.
Table 11 shows the outcomes of the comparison of the performance of these fusion methods on the Fallah_artist_dataset.
The findings affirm that although all the fusion techniques perform well, the AFM is more accurate (99.0%) and F1-score (0.991) than the other methods. There are a number of comparative remarks that outline the advantages of AFM. First, compared to GF47, AFM shows a decisive advantage, since it uses fine-grained, scale-dependent weighting. Although GF produces competitive outcomes, it does not have the ability to flexibly trade features at various scales, which makes it incapable of generalization. Second, compared to CAF, AFM produces similar or higher accuracy without the computational cost of the larger number of parameters (+ 6.2 M) and FLOPs (+ 1.6G) of CAF, which also slows inference time (23.6 ms). Lastly, despite the lightweight alternative, CWM is based on channel statistics only, which simplifies the fusion process too much and does not take full advantage of complementary global and local information. On the other hand, AFM is effective and provides a more context-sensitive and detailed feature integration.
Design rationale
Taken together, these results demonstrate that AFM offers the best trade-off between accuracy and performance, and this is why it is the most appropriate solution in this hybrid system. AFM achieves good generalization performance by combining adaptive weighting and multi-scale aggregation, but it does not require any additional computational expense of more complicated attention-based fusion strategies. This performance and efficiency ratio is a key point in the design of the AFM and a point that shows the practicality of the AFM in real world applications where both precision and speed are essential.
Qualitative analysis of error and attention
In order to supplement the quantitative assessment, we performed a qualitative error and interpretability analysis of the suggested framework. Representative misclassified samples are indicated in Fig. 25, which are selected out of the benchmark test sets. Each sub-figure presents (i) the ground truth versus predicted style label, (ii) a self-attention-like overlay that indicates the areas that have the most impact on the classification, and (iii) an inset visualization using spectral residual saliency. The misclassified examples show common error sources: (a) stylistic confusion between Impressionism and Abstract Expressionism, in which diffuse brushwork obscures the model; (b) structural confusion between Cubism and Futurism, in which fragmented geometric elements shift labels; (c) ornamental resemblance between Baroque and Rococo paintings, resulting in fine-grained misattribution; and (d) surreal imagery with abstract forms in the wrong place, assigned to generic abstract categories.

Qualitative error and interpretability analysis of proposed framework. Examples of misclassified works of art are presented in a 2 × 2 grid, with ground truth (left) and predicted (right) style labels in the title of each panel. The images contain the overlay of the self-attention-like map of the model (main view) and a spectral residual saliency map (inset, top-left). Misclassification cases are indicated by red titles and borders. The attention overlays show the areas that made the most contribution to the classification decision, and the saliency insets are another confirmation of visually salient areas. All these visualizations show the strengths of the model in terms of capturing the stylistic cues and its weaknesses in terms of borderline or stylistically overlapping categories.
The overlays show that the model focuses on high-contrast or textured areas (e.g., geometric fragments, facial contours, or ornamental flourishes) mostly. But in the case of failure the focus tends to scatter here and there over the canvas or to be riveted on backgrounds of little stylistic interest. The saliency insets give an independent verification that the visually distinctive areas are in the regions of high activation, thus justifying the interpretability of the self-attention mechanism.
On the whole, these findings demonstrate the advantages of the Transformer backbone in detecting stylistic hints, as well as reveal its weaknesses in separating very similar or mixed styles. The actionable insights that can be obtained using this qualitative analysis are listed as the following: use of fine-grained local descriptors, cross-style contrastive training, or domain-adaptive attention regularization.
Additional evaluation measures to confirm robustness
In order to provide a more detailed analysis of the proposed ArtFusionNet model, we have complemented our analysis of performance with other metrics in addition to accuracy, F1-score, and validation loss. Such novel measures as AUC-ROC, Cohen Kappa, and per-class recall are crucial to a more sophisticated interpretation of the model robustness, especially in the cases of imbalanced or noisy data. This extension is particularly important in fine-grained classification tasks in styles where certain classes may be overrepresented, and one may want to evaluate the performance of the model on underrepresented or challenging classes.
-
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): AUC-ROC is a very important measure of the discrimination power of the model. It gives an overall performance at various classification thresholds and allows one to understand how well the model can classify classes even when the data is imbalanced. The larger the AUC, the better the performance in the differentiation of the various artistic styles.
-
Cohen Kappa: Cohen Kappa was included to measure the agreement between the predicted and the actual labels and chance agreement. It is also a useful metric to deal with class imbalance, since it provides a more realistic evaluation of the classification performance by considering random chance in the predictions.
-
Per-Class Recall: Per-class recall gives information on how well the model is able to correctly classify instances of each class, especially those classes with fewer instances. With this metric we can see how well the model is doing to process less-represented or harder-to-classify styles, so that the model is performing well on all classes, not just the majority ones.
These measures allowed measuring the performance of the model in a more detailed way. Table 12 demonstrates the new metrics of the proposed ArtFusionNet model, and the comparison models.
The addition of AUC-ROC, Cohen Kappa and per-class recall to the evaluation gives a better idea of how the model is performing than just accuracy. The proposed ArtFusionNet model has a very good AUC-ROC of 0.98, indicating a high discriminative power at various thresholds as seen in Table 4. This outcome is better than other models, such as the Swin Transformer (0.95) and the CLIP-based framework (0.96). Likewise, the model achieved a Cohen Kappa of 0.95 which means that there was strong agreement between the predicted and true labels and that the model was also robust despite the class imbalance. Per-class recall also shows the strengths of the model. The ArtFusionNet has consistently high recall on all classes even on the more difficult styles like Style 1 that had a recall of 0.92. In comparison, the recall values of comparison models are lower, which underlines the effectiveness of the proposed model to recognize underrepresented and hard-to-classify artistic styles. Moreover, the model achieved the best overall F1-score (0.9906) and the least validation loss (0.1227), which proves that the model is effective in terms of balancing accuracy, precision, and robustness. Taken together, these results indicate the necessity to employ a set of assessment metrics to obtain a general view of model performance. The results support the conclusion that the ArtFusionNet model is not only highly effective in overall classification but also robust in terms of individual classes, thus, can be reliably used even when there is data imbalance or noisy data. The proposed method is evidently better than the baseline and other models, which makes it a potential solution to fine-grained artistic style classification problems.
Feasibility of deployment in real-world scenarios
Along with the experimental performance, one should take into account the practical use of our hybrid CNN Transformer model in real-life processes such as museum curation, digital archiving, or art websites.
Curating of museums and galleries
Our system can help curators to classify and organize large digital collections automatically, so that artworks can be cataloged more quickly and style-based grouping can be done at an early stage. Adaptive Fusion Module (AFM) increases robustness through a combination of local texture features with global compositional context, which is especially useful in detecting stylistic details in historical paintings.
On-line archiving and retrieval
The model can be implemented in libraries and online repositories to facilitate content-based retrieval, i.e., when users search by style or school of art. Adding self-supervised pretraining enhances resistance to data imbalance, and thus is appropriate to heterogeneous archives with different degrees of annotation.
Deployment challenges
Nevertheless, the feasibility in the real world has two main challenges despite these opportunities. First, the full model is not computationally cheap, and this can restrict its use in real time on museum servers or mobile apps. This was tackled by compressing knowledge to produce a smaller student model that achieves lower latency inference and similar accuracy. Second, digitization (e.g., lighting, resolution, scanning artifacts) can be of variable quality and thus affect the performance of the models. To some degree we address this by doing a lot of data augmentation in training, but additional robustness testing with real museum archives would be helpful.
Future outlook
We see potential application in a hybrid human-AI workflow where the system can automatically classify and retrieve styles and make suggestions, but verification and interpretation are left to the curators or art historians. The approach is a trade-off between the speed of computation, stability, and curatorial experience. In future, edge-optimized model variants and domain adaptation strategies will be investigated to make the model robust across a variety of digitization pipelines.
link
