The proposed framework begins with the collection and preprocessing of a custom art dataset. During preprocessing, images are resized, normalized, and augmented to ensure consistency and robustness in model training. The preprocessed images are then fed into the supervised Modified CNN, which extracts high-level features and performs classification tasks. The model is trained using a labeled dataset with an 80:20 split for training and validation. Evaluation of the model’s performance is conducted using standard metrics, including accuracy, precision, recall, and F1-score. Figure 1 provides a comprehensive overview of the methodology, outlining each interconnected stage in the proposed framework.

Data collection and preprocessing
Data collection and preprocessing are crucial steps in automated art curation, ensuring machine learning (ML) models are trained on high-quality datasets that are both diverse and consistent. This section outlines the composition and preparation of the final dataset used in this study. Two complementary datasets were utilized to create a unified collection. The first was a supplementary dataset sourced from Kaggle’s publicly available art collections, which included diverse styles such as Baroque, Rococo, and Modernism. From this dataset, 180 artworks were selected, evenly distributed across three style categories. Images were resized to 256 × 256 pixels, and metadata inconsistencies were manually corrected to improve data quality. A 70:30 split was applied, resulting in 126 images for training and 54 for validation. This ensured a balanced representation of styles for initial model training and evaluation43.
To enhance the dataset further, a collection of 100 artworks was sourced from Sotheby’s auction catalogs, focusing on Modernist and Abstract styles. Images were processed to maintain high resolution and standard aspect ratios, while metadata such as auction prices and provenance was excluded to focus solely on visual features. This dataset was split 80:20, providing 80 images for training and 20 for validation. The final dataset used in this study combined these two collections, resulting in 280 artworks. This integrated dataset maintained a balance of style categories and was designed to support robust training and evaluation of the proposed Modified CNN model. A summary of the dataset composition and splits is presented in Table 1.
Image cleaning and standardization
Image cleaning and standardization are essential preprocessing steps in preparing art datasets for ML models. Image cleaning involves removing inconsistencies such as noise, distortions, and artifacts that may interfere with model training.
Preprocessing
Noise in an image can be defined as unwanted variations in pixel intensity. Gaussian smoothing, a common method for noise reduction, applies a convolutional filter to smooth pixel intensity values:
$$\:G(x,y)=\fraci-j\right{e}^{-\frac{{x}^{2}+{y}^{2}}{{2}_{{\sigma\:}^{2}}}}$$
(1)
where \(\:G(x,y)\) is the Gaussian kernel, \(\:x\) and \(\:y\) are the spatial coordinates, \(\:\sigma\:\) is the standard deviation of the Gaussian distribution.
The Gaussian kernel is convolved with the image \(\:I(x,y)\) to produce the smoothed image \(\:{I}_{s}(x,y)\):
$$\:{I}_{s}(x,y)=\sum\:_{i=-k}^{k}\sum\:_{j=-k}^{k}G(i,j)\cdot\:I(x+i,y+j)$$
(2)
Artifacts, such as compression errors or watermark remnants, can distort image analysis. Morphological operations, like opening (\(\:\circ\:\)) and closing (•) are used for artifact removal:
$$\:{I}_{cleaned}=(I\circ\:B)\bullet\:B$$
(3)
where \(\:I\) is the image, \(\:B\) is the structuring element, and \(\:\circ\:\) and • represent opening and closing operations.
Image standardization
Standardization ensures that all images are uniform in size, resolution, color space, and pixel intensity distribution. To standardize dimensions, images are resized to a fixed size, such as \(\:256\times\:256\) pixels. This can be done using interpolation methods:
$$\:{I}_{resized}(x{\prime\:},y{\prime\:})=I\left(\frac{x{\prime\:}}{{s}_{x}},\frac{y{\prime\:}}{{s}_{y}}\right)$$
(4)
Where \(\:(x{\prime\:},y{\prime\:})\) are the new coordinates, \(\:{s}_{x}\) and \(\:{s}_{y}\) are the scaling factors for width and height, respectively.
Depending on the model requirements, images are converted to a consistent color space, such as RGB or grayscale. The conversion uses the following linear transformation for grayscale:
$$\:{I}_{gray}=0.2989\cdot\:R+0.5870\cdot\:G+0.1140\cdot\:B$$
(5)
where \(\:R\), \(\:G\), and \(\:B\) are the red, green, and blue pixel intensities.
Normalization ensures consistent pixel value ranges, typically scaled to \(\:\left[\text{0,1}\right]\) or \(\:[-\text{1,1}]\). For scaling to \(\:\left[\text{0,1}\right]\):
$$\:{I}_{norm}(x,y)=\frac{I(x,y)-{I}_{min}}{{I}_{max}-{I}_{min}}$$
(6)
or mean-zero normalization (standardization):
$$\:{I}_{std}(x,y)=\frac{I(x,y)-\mu\:}{\sigma\:}$$
(7)
where \(\:\mu\:\) is the mean pixel intensity, \(\:\sigma\:\) is the standard deviation of pixel intensities.
To enrich the dataset and prevent overfitting, transformations such as rotation, flipping, and cropping are applied:
$$\:{I}_{aug}=T\left(I\right)$$
(8)
where \(\:T\) represents transformations, such as Rotation: \(\:{I}_{rot}=I\cdot\:{R}_{\theta\:}\), where \(\:{R}_{\theta\:}\) is the rotation matrix. Flipping: \(\:{I}_{flip}(x,y)=I(W-x,y)\) for horizontal flip, where \(\:W\) is the image width.
Feature extraction
Feature extraction remains among the critical processes when it comes to image preprocessing since it translates the raw image information into some format that can be measured and comprehended. Lack of physical specificity; rather, art analysis uses such nonsensical criteria as color, texture, and even style as the types that hold the paintings together in sets and enable their analysis. They are quantitatively derived in order to qualify the content and structure of the work of art.
Color features
Color is a fundamental attribute in art and is often used to distinguish between styles and movements. A color histogram represents the distribution of colors in an image by counting the number of pixels in each color bin.
For an image \(\:I(x,y)\) with \(\:N\) total pixels and \(\:K\) color bins:
$$\:{H}_{k}=\frac{1}{N}\sum\:_{x,y}\delta\:\left(B\right(I(x,y)),k)$$
(9)
Where \(\:{H}_{k}\): Normalized count of pixels in bin \(\:k\), \(\:B\left(I\right(x,y\left)\right)\): Bin index of pixel at \(\:(x,y)\), \(\:\delta\:\): Dirac delta function, equal to \(\:1\) if \(\:B\left(I\right(x,y\left)\right)=k\), otherwise \(\:0\).
Color moments capture statistical properties (mean, standard deviation and skewness) of color distributions, which are particularly useful for characterizing complex color patterns.
$$\:{\mu\:}_{c}=\frac{1}{N}\sum\:_{i=1}^{N}{p}_{c}\left(i\right)$$
(10)
$$\:{\sigma\:}_{c}=\sqrt{\frac{1}{N}\sum\:_{i=1}^{N}{\left({p}_{c}\left(i\right)-{\mu\:}_{c}\right)}^{2}}$$
(11)
$$\:{\gamma\:}_{c}=\frac{1}{N}\sum\:_{i=1}^{N}{({p}_{c}\left(i\right)-{\mu\:}_{c}{\sigma\:}_{c})}^{3}$$
(12)
where \(\:{p}_{c}\left(i\right)\) is the intensity of the color channel \(\:c\) at pixel \(\:i\).
Texture features
Texture describes the spatial arrangement of pixel intensities, capturing patterns such as smoothness, roughness, or periodicity.
Gray-level co-occurrence matrix (GLCM)
GLCM quantifies texture by calculating how often pairs of pixels with specific intensities occur at a given spatial relationship.
$$\:P(i,j,d,\theta\:)=\frac{\left\{\right({x}_{1},{y}_{1}),({x}_{2},{y}_{2})\mid\:I({x}_{1},{y}_{1})=i,I({x}_{2},{y}_{2})=j\}}{N}$$
(13)
Where \(\:P(i,j,d,\theta\:)\): Probability of intensity \(\:i\) co-occurring with \(\:j\) at distance \(\:d\) and angle \(\:\theta\:\), \(\:N\): Total number of pixel pairs.
From the GLCM, statistical texture features (contrast energy and homogenity) can be computed:
$$\:Contrast=\sum\:_{i,j}{(i-j)}^{2}P(i,j)$$
(14)
$$\:Energy=\sum\:_{i,j}{P(i,j)}^{2}$$
(15)
$$\:Homogeneity=\sum\:_{i,j}\frac{P(i,j)}{1+\left|i-j\right|}$$
(16)
Wavelet transform
Wavelet decomposition captures texture by analyzing an image at multiple resolutions. The wavelet transform of an image \(\:I(x,y)\) is given by:
$$\:W(u,v,s)=\int\:\int\:{I(x,y)\psi\:}^{*}(\frac{x-u}{s},\frac{y-v}{s})dxdy$$
(17)
where \(\:W(u,v,s)\): Wavelet coefficient at position \(\:(u,v)\) and scale \(\:s\), \(\:\psi\:\): Wavelet function.
Style features
Style features capture artworks’ high-level aesthetic and compositional attributes, often relying on deep learning techniques. NST involves matching the style of one image to another using a loss function that minimizes the difference in style representations:
$$\:{L}_{style}=\sum\:_{l}{w}_{l}{‖{G}_{l}^{target}-{G}_{l}^{generated}‖}^{2}$$
(18)
where \(\:{G}_{l}^{target}\) and \(\:{G}_{l}^{generated}\) are Gram matrices for the target and generated images, respectively.
Feature extraction in art analysis includes many numerical mathematical procedures. Color features give information about palettes and distributions, texture features provide information about the patterns and surfaces of the image and objects in the image, respectively. In contrast, style features capture the global perception and concept of the entire image. Coalescing these methods, it is seen that ML models effectively encompass all aspects of artworks for classification, clustering, and style transfer while at the same time providing better accuracy and interpretability.
Supervised proposed modified CNN
Convolutional Neural Networks (CNNs) are foundational for machine learning in image-based tasks, making them highly suitable for art analysis. For art classification, CNNs are often extended with modifications to handle challenges like intricate textures, overlapping styles, and high-dimensional datasets. These modifications enable their application in supervised tasks such as classifying artworks by style, artist, or movement.
The proposed Modified CNN takes input images of fixed dimensions \(\:H\times\:W\times\:C\) (e.g., \(\:256\times\:256\times\:3\) for RGB images) and processes them through a series of layers designed for feature extraction and classification. The convolutional layers extract feature maps using kernels (filters) that slide over the image, calculated as:
$$\:F[i,j,k]=\sum\:_{m=1}^{M}\sum\:_{n=1}^{N}\sum\:_{c=1}^{C}(W\left[m,n,c,k\right]\times\:I\left[i+m,j+n,c\right]+{b}_{k})$$
(19)
where:
-
\(\:F[i,j,k]\): Feature map at position \(\:(i,j)\) for filter \(\:k\),
-
\(\:W[m,n,c,k]\): Filter weights,
-
\(\:I[i+m,j+n,c]\): Input pixel values,
-
\(\:{b}_{k}\): Bias term for filter \(\:k\).
Batch normalization is employed to stabilize training by normalizing feature maps:
$$\:\widehat{F}=\frac{F-\mu\:}{\sigma\:}\times\:\gamma\:+\beta\:$$
(20)
where \(\:\mu\:\) and \(\:\sigma\:\) are the mean and standard deviation, and \(\:\gamma\:\), \(\:\beta\:\) are trainable parameters.
Dropout is used to regularize the network by randomly deactivating neurons during training:
$$\:{F}_{dropout}=F\cdot\:Mask\left(p\right)$$
(21)
where \(\:Mask\left(p\right)\) is a binary mask with a probability \(\:p\) of retaining a neuron.
The final layer maps the extracted features to class probabilities using the softmax activation function:
$$\:P(y=c\mid\:x)=\frac{exp\left({z}_{c}\right)}{\sum\:_{j=1}^{K}exp\left({z}_{j}\right)}$$
(22)
where \(\:{z}_{c}\) is the output score for class \(\:c\), and \(\:K\) is the total number of classes. Model training minimizes the cross-entropy loss function:
$$\:L=-\frac{1}{N}\sum\:_{i=1}^{N}\sum\:_{c=1}^{K}{y}_{i,c}log\:{\widehat{y}}_{i,c}$$
(23)
where \(\:{y}_{i,c}\) is the true label, and \(\:{\widehat{y}}_{i,c}\) is the predicted probability.
The Modified CNN incorporates these elements to extract high-level features and perform effective classification, leveraging its architecture for tasks specific to automated art curation.
Performance metrics
For the supervised Modified CNN, performance metrics are critical for quantifying the effectiveness of the model in classification tasks. They provide insights into the model’s predictive capabilities and its ability to generalize to unseen data. This section outlines the key metrics used to evaluate the Modified CNN, including Accuracy, Precision, Recall, F1-score, and Cross-Entropy Loss. Accuracy measures the proportion of correctly classified samples out of the total samples, offering an overall assessment of the model’s performance. Precision evaluates the proportion of true positive predictions among all positive predictions, making it particularly useful in scenarios with imbalanced datasets. It highlights the reliability of positive predictions made by the model. Recall quantifies the model’s ability to identify all actual positive samples, ensuring that the model performs well in recognizing relevant instances. The F1-score combines Precision and Recall into a single metric, providing a balanced view of performance, especially when dealing with imbalanced datasets. It represents the harmonic mean of Precision and Recall. The Modified CNN uses cross-entropy loss as the objective function to optimize its classification performance. By penalizing incorrect predictions proportionally to their confidence, it ensures the model minimizes errors during training. These metrics provide a comprehensive evaluation of the Modified CNN’s performance, enabling the identification of strengths and weaknesses. They are instrumental in refining the model to achieve optimal results in the classification of art styles.
link