• Thu. Apr 24th, 2025

Applying deep learning for style transfer in digital art: enhancing creative expression through neural networks

Applying deep learning for style transfer in digital art: enhancing creative expression through neural networks

This study’s methodology aims to develop a sound NST process encompassing the basic phases of data preprocessing, model specification, estimation, and model assessment. The selected dataset applied here is MS-COCO, a large and versatile image set often used in computer vision works. For compatibility with the neural network model, all images are preprocessed, resized, normalized, and standardized, which enables the creation of uniform images for the style and content image inputs of a defined quality and dimensions.

In the present model architecture, deep convolutional neural networks with layered architectures help the network derive the hierarchical features. This approach is critical to enabling the differentiation between content and form aspects of designs. Tools such as Gram matrices and AdaIN are used within the model to capture and apply styles of artwork efficiently. Figure 1 illustrates the model architecture, showing the interaction between CNN layers and style adaptation processes.

Fig. 1
figure 1

Model architecture for NST.

Dataset

Generating a sufficient and diverse data pool is imperative to improving the deeper network layers. This work uses MS-COCO (Microsoft Common Objects in Context), a popular dataset in computer vision.

Description of the MS-COCO dataset

MS-COCO is ideal for large-scale object detection, segmentation, and captioning tasks, as it contains many images with simple objects placed in various settings. The dataset contains over 330,000 images, organized across a range of common categories, which ensures a balanced and realistic representation of content for the model.

All the images in MS-COCO are provided with segmentations, boxes, and captions to encourage using more contextual information. Table 2 presents significant characteristics of the MS-COCO dataset as follows:

Table 2 Summary of MS-COCO dataset characteristics.

The feature analysis shows that the extent of complexity and the nature of variety in the MS-COCO make learning feasible for NST, thereby allowing the model to learn more about the structures and contents. The dataset’s annotations also help check the content consistencies of the given style when applying them so generalizations in different cases and contexts can be well made.

Dataset preprocessing for style transfer

The data pre-processing of the MS-COCO dataset is mandatory for standardization, better model performance, and more refined evaluations of style transfer. In the preprocessing, we had resizing, normalization, and data augmentation clearly defined when giving the set of image proposals, all to prepare the images to feed the neural network. For each image in the dataset, different from the original images, all of these images are reshaped to a standard size \(\:h\times\:w\) of 256 × 256 pixels. This uniform size helps to maintain the balance of instances for batches in the model so that the computer complexity is low. For an image \(\:I\) with original dimensions \(\:H\times\:W\), resizing can be mathematically expressed as:

$$\:{I}_{resized}=Resize(I,h,w)$$

(1)

To standardize pixel values, each image is normalized by adjusting pixel intensities to a fixed range, typically between − 1 and 1 or 0 and 1, depending on the model requirements. For an RGB image with pixel values in the range [0, 255], normalization for each pixel \(\:p\) in channel, \(\:c\) is calculated as:

$$\:p{\prime\:}=\frac{p-{\mu\:}_{c}}{{{\upsigma\:}}_{\text{c}}}$$

(2)

where \(\:{\mu\:}_{c}\) and \(\:{\sigma\:}_{c}\)​ are the mean and standard deviation of the channel \(\:c\), calculated across the training set. This normalization enhances training stability and helps the model converge faster.

For training, images are grouped into batches of size \(\:n\), and each batch includes both content and style images. This batching facilitates efficient processing and allows the model to update weights based on content and style loss. Given a batch \(\:B=\{{I}_{1},{I}_{2},\dots\:,{I}_{n}\}\), the batch loss \(\:{L}_{B}\)​ is computed as:

$$\:{L}_{B}=\frac{1}{n}\sum\:_{i=1}^{n}\left(\alpha\:{L}_{content}\left({I}_{i}\right)+\beta\:{L}_{style}\left({I}_{i}\right)\right)$$

(3)

where \(\:\alpha\:\) and \(\:\beta\:\) are weighting factors for content and style losses, respectively.

By means of all these preprocessing steps, the MS-COCO dataset is normalized and optimized to act as an efficient input for NST while simultaneously avoiding unnecessary computations and promoting generalized learning.

Model architecture

NST Model Architecture based on CNNs is used primarily because it can achieve the abstraction required for style and content representation while incorporating spatial relationships. This kind of feature extraction is well-suited for CNNs for this task since they can hierarchically extract features from images, which allows them to learn the textures, colors, and structural patterns associated with artistic styles.

Deep convolutional neural networks

Deep CNN, proposed for NST, consists of convolutional layers that learn and mix content and style features. These features are separated at different network depths, creating a high-quality stylization by ensuring the content structures and patterns while applying the proposed consolidation style. The proposed architecture intends to extract features from content and style images by convolutions and pooling layers with non-linear activation functions. A convolutional layer uses a filter or kernel \(\:K\) on an input feature map to give a feature map that accents specific image features. Given an image \(\:I\) with dimensions \(\:H\times\:W\), the convolution operation for filter \(\:K\) of size \(\:m\times\:m\) at a location \(\:(i,j)\) is defined as:

$$\:F(i,j)=\sum\limits_{p=0}^{m-1}\sum\limits_{q=0}^{m-1}\left(p,q\right).I(i+p,j+q)$$

(4)

where \(\:F(i,j)\) is the output feature map, and \(\:K(p,q)\) are the filter weights of the desired filter. This operation brings the network to detect hierarchical features, where at one layer of the network, it is used to identify edges, while at the next layer, it identifies shapes and textures. After each convolutional layer, a non-linear activation function, such as the Rectified Linear Unit (ReLU), is applied. ReLU added non-linearity to the network, allowing the architecture to model important features for style transfer. For any feature \(\:x\), the ReLU function is defined as:

$$\:f\left(x\right)=max(0,x)$$

(5)

This activation function enables the model to learn stronger features in projecting content-driven structural aspects and stylistic features. Since some of them may exceed the computational capability of some computers, pooling layers are incorporated after some particular convolutional layers to handle the number of important details. Max pooling, in particular, is used to downsample the feature map by selecting the maximum value within a window \(\:w\times\:w\), defined as:

$$\:P(i,j)={}_{p,q\in\:w}{}^{\text{max}\:\:\:\:}\:F(i+p,j+q)$$

(6)

where \(\:P(i,j)\) is the resulting pooled feature map. Pooling reduces spatial dimensions, making the model more efficient and resilient to minor changes in the input. The proposed architecture extracts content and style features from different layers to represent distinct image characteristics. Higher layers in the CNN capture structural information critical for content representation. For a content image \(\:C\), the content feature map at a specific layer \(\:l\) is denoted \(\:{F}_{l}^{C}\)​, and the content loss \(\:{L}_{content}\) between content and the generated image \(\:G\) is computed as:

$$\:{L}_{content}(C,G)=\frac{1}{2}\sum\:_{i,j}{\left({F}_{l}^{C}(i,j)-{F}_{l}^{G}(i,j)\right)}^{2}$$

(7)

where \(\:{F}_{l}^{G}\)​ is the feature map from the same layer for the generated image, ensuring structural similarity between \(\:C\) and \(\:G\).

Correlations between feature maps at lower layers are used for style representation to capture texture and color patterns. This is achieved through the Gram matrix \(\:{G}_{l}^{S}\)​ of a style image \(\:S\), defined as:

$$\:{G}_{l}^{S}(i,j)=\sum\:_{k}{F}_{l}^{S}(i,k)\cdot\:{F}_{l}^{S}(j,k)$$

(8)

The Gram matrix calculates the correlations between different feature channels \(\:i\) and \(\:j\), capturing the essence of the style. The style loss \(\:{L}_{style}\)​ between the style and generated images is given by:

$$\:{L}_{style}(S,G)=\frac{1}{4{N}^{2}{M}^{2}}\sum\:_{i,j}{\left({G}_{l}^{S}(i,j)-{G}_{l}^{G}(i,j)\right)}^{2}$$

(9)

where \(\:N\) and \(\:M\) represent the number of feature maps and their dimensions, respectively. The total loss \(\:{L}_{total}\)​ combines content and style losses, allowing the generated image to balance content fidelity and style transfer. It is expressed as:

$$\:{L}_{total}=\alpha\:\cdot\:{L}_{content}+\beta\:\cdot\:{L}_{style}$$

(10)

where \(\:\alpha\:\) and \(\:\beta\:\) are weighting factors that adjust the influence of content and style. By tuning \(\:\alpha\:\) and \(\:\beta\:\), the model can generate outputs with varying style intensity and content retention degrees.

The proposed architecture of deep CNN with stacked feature extraction and tailored loss functions helps achieve fine-quality style transfer with the necessary content attributes intact and elaborate style transfer patterns. Certain layers, such as the convolutional, activation, and pooling layers, create the right balance of content and style in the final image.

Style representation techniques

In NST, style representation is one of the most compelling aspects of understanding an object and capturing the feel of the particular style being undertaken. The proposed architecture integrates several techniques to represent and transfer these stylistic elements effectively. There are two main ways to encode the style: Gram matrices and AdaIN, which allow the capture and application of diverse styles with different benefits.

Gram matrix for style representation: The Gram matrix is one of the earliest and most widely used techniques for style representation in NST. It captures correlations between different feature maps within a convolutional layer, providing a holistic view of the style by considering textures and spatial patterns.

For a given style image \(\:S\) and a convolutional layer \(\:l\) with feature maps \(\:{F}_{l}^{S}\), the Gram matrix \(\:{G}_{l}^{S}\)​ is computed by taking the inner product between each pair of feature maps \(\:i\) and \(\:j\). Mathematically, the Gram matrix \(\:{G}_{l}^{S}(i,j)\) is defined by using Eq. (8), where \(\:{F}_{l}^{S}(\text{i},\text{k})\) represents the activation of the \(\:{i}^{th}\) feature map at the position \(\:k\) in layer \(\:l\). This approach captures the degree of similarity between feature channels, encapsulating the essence of the style independent of the spatial arrangement of pixels.

The style loss \(\:{L}_{style}\)​ is then computed by comparing the Gram matrices of the style image \(\:S\) and the generated image \(\:G\) at each layer \(\:l\), by using Eq. (9), where \(\:{N}_{l}\)​ is the number of feature maps and \(\:{M}_{l}\)​ is the number of elements in each feature map for layer \(\:l\). This loss function penalizes deviations between the style patterns in \(\:S\) and \(\:G\), guiding the model to replicate the style.

Adaptive instance normalization (AdaIN): AdaIN is a fairly recent work that attempts to enable the transfer of the given style across multiple styles using the same model without the need for training on individual models. AdaIN replaces the feature statistics of the content image with that of the style image by standardizing the content features and multiplying them to obtain the means and variances of the style features.

For a content feature map \(\:{F}^{C}\) and a style feature map \(\:{F}^{S}\), AdaIN adjusts the mean \(\:\mu\:\) and standard deviation \(\:\sigma\:\) of the content feature map to match those of the style feature map. The AdaIN transformation is defined as:

$$\:AdaIN({F}^{C},{F}^{S})=\sigma\:\left({F}^{S}\right)\cdot\:{F}^{C}-\mu\:\left({F}^{C}\right)\sigma\:\left({F}^{C}\right)+\mu\:\left({F}^{S}\right)$$

(11)

where \(\:\mu\:\left({F}^{C}\right)\) and \(\:\sigma\:\left({F}^{C}\right)\) are the mean and standard deviation of the content features, and \(\:\mu\:\left({F}^{S}\right)\) and \(\:\sigma\:\left({F}^{S}\right)\) are those of the style features. This normalization process directly aligns the feature statistics of the content image with those of the style, allowing the style to be transferred in real time.

AdaIN is especially useful in arbitrary style transfer because the model does not require extra learning when changing one style to another. This method is less computational in terms of optimization and enables one to generate visually desirable style transfers.

Hybrid techniques and advanced style representations: To maintain higher style faithfulness, several models propose hybrid solutions that use Gram matrices in conjunction with AdaIN or introduce other methods, such as attention maps, to localize important style features in different image regions. Attention-based style transfer can focus on which part of the content and style images have the highest effect on style and bring the texture and structure into better alignment. This hybrid approach may produce better and more diverse style transfers than complex and elegant styles.

The proposed model uses these style representation methodologies to attain high-quality style transfer that maintains good texture, quality colors, and appealing artistry. The Gram matrices allow for capturing the style patterns well in multilayered architecture. AdaIN, being accurate and fast, can be advantageous for applications like the one described above, where many styles are used, or for online systems. Combined, these techniques constitute the core of the model’s style representation approach, guaranteeing sophisticated and versatile style transformation.

Training process

The training process in NST optimizes the model to balance content and style through carefully defined loss functions and selected hyperparameters.

Loss functions for style and content

It computes the content loss through the disparity between the content image \(\:C\) and the generated image \(\:G\) using a high-level feature map yet retaining the content layout in Eq. (7). Style loss function compresses the stylistic patterns via the Gram matrix of the feature maps in different levels and applies textures and colors from the style image \(\:S\) using Eq. (9). The total loss function combines content and style losses, weighted by \(\:\alpha\:\) and \(\:\beta\:\), to balance content preservation and style application using Eq. (10).

Hyperparameter selection

Key hyperparameters include \(\:\alpha\:\) and \(\:\beta\:\) to control content-style balance, a learning rate (typically 0.001–0.01) for stable training, and iterations to refine detail. Adjusting \(\:\alpha\:\) and \(\:\beta\:\) allows fine-tuning of the content vs. style emphasis, ensuring high-quality results that align with desired stylistic effects.

Evaluation metrics

When judging the quality of NST, we have to look into two broad aspects: the content section and style reproduction in the synthesized picture. The following metrics have been defined to offer a quantitative means by which the model’s effectiveness in attaining a complementary balance between preserving the original text and mimicking the style can be evaluated. The following is the list of simple evaluation metrics: the content loss, which measures the fulfillment of the purpose of the input image. This style loss focuses on satisfaction with the input, SSIM, which represents structural similarity and computational time.

Content loss

Content loss \(\:{L}_{content}\) calculates the difference between various features of the generated image \(\:G\) with the preliminary content image \(\:C\). To guarantee that primary structural features of the image are maintained, despite the more abstract information transferred through later layers of the network, the content loss is learned by evaluating feature maps from these higher layers. Content loss is given by using Eq. (7), where \(\:{F}_{l}^{C}\) and \(\:{F}_{l}^{G}\)​ represent the feature maps at the layer \(\:l\) for the content and generated images, respectively. A lower content loss indicates better structural retention in the stylized image.

Style loss

For the style translation loss \(\:{L}_{style}\), it measures how well the generated image captures the style of the style image \(\:S\). This is done by comparing the Gram matrices that provide information about how the feature maps in given layers are correlated. The style loss is calculated by using Eq. (9)), where \(\:{G}_{l}^{S}\) and \(\:{G}_{l}^{G}\) ​are the Gram matrix of the style and generated images at the layer \(\:l\), \(\:{N}_{l}\)​ is the number of feature maps and \(\:{M}_{l}\) is the number of elements in each feature map. Lower style loss indicates better alignment with the stylistic patterns of the target style image.

Structural similarity index (SSIM)

The SSIM is a quality index that measures the perception of image structure, brightness levels, and contrast. This was achieved by comparing the generated image \(\:G\) to the content image \(\:C\) using the SSIM to determine how much of the structure of the content has been retained. SSIM is given by:

$$\:SSIM(C,G)=\frac{({2\mu\:}_{C}{\mu\:}_{G}+{c}_{1})(2{\sigma\:}_{CG}+{c}_{2})}{({\mu\:}_{C}^{2}+{\mu\:}_{G}^{2}+{c}_{1})({\sigma\:}_{C}^{2}+{\sigma\:}_{G}^{2}+{c}_{2})}$$

(12)

where \(\:{\mu\:}_{C}\) and \(\:{\mu\:}_{G}\) are the mean intensities, \(\:{\sigma\:}_{C}^{2}\) and \(\:{\sigma\:}_{G}^{2}\) are variances, and \(\:{\sigma\:}_{CG}\)is the covariance between \(\:C\) and \(\:G\). Constants \(\:{c}_{1}\)​ and \(\:{c}_{2}\)​ are used to stabilize the division. Higher SSIM values indicate better structural similarity between the original and generated images.

Computational efficiency

Computational efficiency is measured by the time to produce a stylized image and the model’s resource requirements (e.g., memory usage). Efficiency is often evaluated in terms of frames per second (FPS) for real-time applications, or total processing time \(\:{T}_{proc}\), which can be calculated as:

$$\:{T}_{proc}=\frac{Total\:Time\:Taken}{Number\:of\:Images\:Processed}$$

(13)

Improved efficiency is essential for making NST practical in real-time or resource-limited environments.

link

By admin

Leave a Reply

Your email address will not be published. Required fields are marked *