Classification of Generative AI Models

Classification of Generative AI Models

Generative Adversarial Networks (GAN)

Generative adversarial networks or GANs were first introduced by Ian Goodfellow in 2014. The GAN is based on the minimax two-person zero-sum game, in which one player profits only when the other suffers an equal loss. The two players in GAN are the generator and the discriminator. The generator's purpose is to trick the discriminator, while the discriminator's goal is to identify whether a sample is from a true distribution. The discriminator's output is a probability that the input sample is a true sample. A higher probability suggests that the sample is drawn from real-world data. In contrast, the closer the probability is to zero, the more probable the sample is a fake. When the probability approaches one-half infinity, the optimal answer is reached because the discriminator finds it difficult to check fake samples.

Typically, generator (G) and discriminator (D) are implemented using deep neural networks, working as latent function representations. The architecture of the GAN, illustrated in Figure 6, involves the G learning the data distribution from real samples and mapping it to a new space (generated samples) using dense/convolutional layers accompanied by its corresponding probability distribution. The primary objective of the GAN is to ensure that this probability distribution closely resembles the distribution of the training samples. The D receives input data, which can be either real data (x) from the training set or generated data produced by the generator. The discriminator then outputs a probability using dense/convolutional layers or scalar value that indicates whether the input is likely to come from the real data distribution.

Futureinternet 15 00260 g006 550

Figure 6. Typical structure of generative adversarial networks (GAN).

GAN (generative adversarial network) training faces several challenges, including gradient disappearance, difficulty in training, and poor diversity. These problems arise from the loss function used in GANs, which involves measuring and minimizing the distance between the real data distribution (Pr) and the generated data distribution (Pg).

During training, the discriminator aims to minimize cross-entropy by differentiating between real and generated samples. The optimal discriminator (D) takes the form given below.


D(x)=Pr(x)/(Pr(x)+Pg(x))

On the other hand, the generator (G) seeks to minimize a generator-specific loss function that includes an independent item to ensure diversity.

The loss function for the generator can be written as,

V(G)=KL(Pg||Pr)−2JSD(Pr||Pg)

where KL is the Kullback–Leibler divergence and JSD is the Jensen–Shannon divergence. Minimizing the JS divergence helps the generated samples resemble real ones. However, if there is little or no overlap between Pr and Pg, the JS divergence becomes a constant, leading to gradient vanishing and disappearance.

Additionally, training GANs can be challenging because the feedback from the discriminator can be close to zero when it is trained optimally, slowing down convergence. Moreover, determining when the discriminator is properly trained is difficult since there is no indicator for it.

Another problem is the poor diversity in the generated samples. The generator loss function V(G) can be reformulated to address this issue. Minimizing this loss function is equivalent to minimizing the KL divergence and the JSD, leading to more diverse generated samples. Several new models have been introduced to address these limitations of the original GAN, including issues like gradient disappearance, unstable training, and poor diversity. These new GAN models aim to enhance stability and improve the quality of the generated outputs.

Conditional generative adversarial networks (CGANs) have emerged as a solution to enhance the control and convergence speed of GANs in complex or large-scale datasets. By incorporating conditional variables, such as category labels, textual descriptions, or specific generated targets, CGANs provide guidance to the data generation process. This allows for supervised learning, targeted generation, and the ability to generate images with specific categories or labels. Moreover, CGANs can utilize image features as condition to generate corresponding word vectors, enabling effective cross-modal generation illustrated in Figure 7.

Futureinternet 15 00260 g007 550

Figure 7. Typical structure of conditional GAN.

Some of the GANs that incorporate this technique are conditional generative adversarial networks (CGAN), CGAN with Pix2Pix framework, conditional tabular GAN (CTGAN), conditional generative adversarial networks with text (TAC-GAN, TAGAN).

Wasserstein generative adversarial networks (WGANs) offer a novel approach to address the challenges faced by traditional GANs. By introducing the Wasserstein distance as a metric, WGANs provide a more stable training process and better gradient flow. The discriminator in WGANs, known as the "critic", assigns scores representing the distance between the real and fake data distributions. This distance is measured using the Wasserstein distance instead of the Jensen–Shannon divergence or Kullback–Leibler divergence used in other generative models. WGANs mitigate the issue of mode collapse, where GANs fail to capture the full diversity of the data, by effectively learning the underlying data distribution, even for complex and high-dimensional datasets. The generator and discriminator in WGANs are trained to minimize the Wasserstein distance, encouraging the generator to generate samples that closely resemble real data. This enables WGANs to produce more diverse and realistic outputs.

WGANs have found applications in various domains, such as image synthesis, text generation, and data augmentation. Their effectiveness in addressing mode collapse and providing a more reliable training process has made them a popular choice for researchers and practitioners working with generative models.

Deep convolutional generative adversarial networks (DCGANs) are a variant of GANs that leverage deep convolutional neural networks (CNNs) to enhance the quality of generated samples, particularly in the domain of image synthesis. DCGANs have proven to be highly effective in generating realistic and high-resolution images. DCGANs utilize convolutional layers in both the generator and discriminator networks, allowing them to capture spatial dependencies and patterns in the data. DCGANs introduce several key design principles, including the use of convolutional and transposed convolutional layers, batch normalization, and ReLU activation functions. These principles contribute to the stability of the training process, mitigate issues like mode collapse and allow for the generation of diverse and high-quality samples.

The benefits of DCGANs extend beyond image synthesis, with applications in areas such as image-to-image translation, style transfer, and data augmentation. The combination of deep convolutional architectures and adversarial training has propelled DCGANs as a go-to choice for generating visually appealing and realistic images in the field of deep learning.

Generative adversarial networks (GANs) have revolutionized various domains of computer vision and machine learning. They can be classified into different categories based on their specific tasks and applications. Image-to-image translation GANs focus on translating images between domains, with subcategories such as CycleGAN, DiscoGAN, and DTN. Super-resolution GANs enhance the resolution of low-resolution images, including SRGAN and VSRResFeatGAN. Text-to-image GANs generate images from textual descriptions, exemplified by AttnGAN and StackGAN. Tabular data GANs generate synthetic tabular data, with examples like CTGAN and TGAN. Defense and security GANs address security-related applications, including defense against adversarial attacks and steganography, such as defense GANs and SSGAN. Style-based GANs capture and manipulate artistic styles, including StyleGAN and StyleCLIP. Other GAN types encompass diverse applications like BigGAN for high-resolution images, ExGANs for variation generation, and SegAN for semantic segmentation and various other GANs and are listed below. These categories demonstrate the versatility and advancement of GANs in various domains, enabling tasks such as image translation, super-resolution, text-to-image synthesis, data generation, security applications, style manipulation, and more. GANs continue to drive innovation and push the boundaries of generative models in the field of artificial intelligence.