Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou
Abstract: Conventional Generative Adversarial Networks (GANs) for text generation tend to have issues of reward sparsity and mode collapse that affect the quality and diversity of generated samples. To address the issues, we propose a novel self-adversarial learning (SAL) paradigm for improving GANs’ performance in text generation. In contrast to standard GANs that use a binary classifier as its discriminator to predict whether a sample is real or generated, SAL employs a comparative discriminator which is a pairwise classifier for comparing the text quality between a pair of samples. During training, SAL rewards the generator when its currently generated sentence is found to be better than its previously generated samples. This self-improvement reward mechanism allows the model to receive credits more easily and avoid collapsing towards the limited number of real samples, which not only helps alleviate the reward sparsity issue but also reduces the risk of mode collapse. Experiments on text generation benchmark datasets show that our proposed approach substantially improves both the quality and the diversity, and yields more stable performance compared to the previous GANs for text generation.
Notes: Two GAN issues in text generation: (i) Reward sparsity, which is due to the fact that discriminator tends to learn much better than generator and thus easily recognizes generated samples as fakes; (ii) Mode collapse, which arises from the intrinsic nature of GANs and leads the adversarial models to only learn the limited patterns from the real samples.
This paper proposed a self-adversarial learning (SAL), which employs a comparative discriminator (a pairwise classifier) to assess whether the currently generated sample is better than its previously generated one. In the earlier training stage, this self-improvement reward mechanism makes it easier for the generator to receive non-sparse rewards due to the quality of generated samples is far below the real data; while in the later training stage, SAL can prevent a sample from keeping receiving high reward as the self-improvement for a popular mode will become very difficult, and thus avoid collapsing toward the limited patterns of real data.