NVAE: A Deep Hierarchical Variational Autoencoder 리뷰

논문

NVAE: A Deep Hierarchical Variational Autoencoder 리뷰

따옹 2023. 9. 10. 12:05

NVAE: A Deep Hierarchical Variational Autoencoder

https://arxiv.org/pdf/2007.03898v1.pdf

우리가 풀 문제 (abstract)
우리가 연구를 시작한 동기 (introduction)
관련해서 한 접근들 (related works)
근데 난 새로운 방식으로 접근해보려고 해 (method)
이게 성과를 찢을 수 있는지 실험해 봄 (experiment)
찢긴 찢었는데 좀 아쉬운 부분이 있단 말이지 (discussion)
바쁘다 바빠 현대인을 위한 3줄 요약 (conclusion)

논문의 핵심은 ‘주어진 문제를 해결하고자 노력한 기여(contribution)'를 봐줘!

초록 읽기 (Abstract)

Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning.

Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks.

likelihood-based 모델 중 VAE 가 가진 빠름의 이점과 encoding network에 쉽게 접근할 수 있는 이점을 가져와서 얘길 해보자 하는데

However, they are currently outperformed by other models such as normalizing flows and autoregressive models.

다양한 통계적 모델링으로 요즘 좀 침

While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs.

근데 이런 흔해빠진 통계적 방법 말고, 계층적인 VAE를 파볼까 해

We propose Nouveau VAE (NVAE),

Nouveau VAE 가 그건데,

a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization.

NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization.

We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ.

통계적인 걸 안쓰고도 이런 성과를 내는 거 봐줘

For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels.

내 새끼 장난 아니쥬?

결론 읽기 (Conclusion)

In this paper, we proposed Nouveau VAE, a deep hierarchical VAE with a carefully designed architecture.

NVAE uses depthwise separable convolutions for the generative model and regular convolutions for the encoder model.

아까 말한 우리 애 잘 보셨나요?

We introduced residual parameterization of Normal distributions in the encoder and spectral regularization for stabilizing the training of very deep models. We also presented practical remedies for reducing the memory usage of deep VAEs, enabling us to speed up training by ∼ 2×. NVAE achieves state-of-the-art results on MNIST, CIFAR-10, and CelebA HQ-256, and it provides a strong baseline on FFHQ-256.

메모리도 아끼고 속도도 빠름(초록 내용 복붙)

To the best of our knowledge, NVAE is the first VAE that can produce large high-quality images and it is trained without changing the objective function of VAEs.

VAE 목적 함수 안 바꾸고도 많은 양의 고해상도 이미지 만들었슴

Our results show that we can achieve state-of-the-art generative performance by carefully designing neural network architectures for VAEs.

VAE의 새로운 NN구조를 짬. 최신임

The future work includes scaling up the training for larger images, experimenting with more complex normalizing flows, automating the architecture design by neural architecture search, and studying the role of batch normalization in VAEs. We will release our source-code to facilitate research in these directions.

후속 연구로는 이런 게 하고 싶네

서론 읽기 (Introduction)

The majority of the research efforts on improving VAEs [1, 2] is dedicated to the statistical challenges, such as reducing the gap between approximate and true posterior distributions [3, 4, 5, 6, 7, 8, 9, 10], formulating tighter bounds [11, 12, 13, 14], reducing the gradient noise [15, 16], extending VAEs to discrete variables [17, 18, 19, 20, 21, 22, 23], or tackling posterior collapse [24, 25, 26, 27].

기존에는 예측치와 실제 사후 확률 분포의 차이를 줄이는 통계적인 노력이 주를 이루었는데,

The role of neural network architectures for VAEs is somewhat overlooked, as most previous work borrows the architectures from classification tasks.

우리는 VAE 구조를 건들거라고

However, VAEs can benefit from designing special network architectures as they have fundamentally different requirements.

VAE는 기본적으로 다양한 요구 사항에 맞춰 특별하게 설계된 Network 구조가 있음

First, VAEs maximize the mutual information between the input and latent variables [29, 30], requiring the networks to retain the information content of the input data as much as possible.

1. input 과 latent 정보 최대한 비슷하게 만들기

This is in contrast with classification networks that discard information regarding the input [31].

Second, VAEs often respond differently to the over-parameterization in neural networks.

2. VAE는 종종 NN에서 과한 parameterization에 대해 응답한다.

Since the marginal log-likelihood only depends on the generative model, overparameterizing the decoder network may hurt the test log-likelihood, whereas powerful encoders can yield better models because of reducing the amortization gap [6]. Wu et al. [32] observe that the marginal loglikelihood, estimated by non-encoder-based methods, is not sensitive to the encoder overfitting (see also Fig. 9 in [19]). Moreover, the neural networks for VAEs should model long-range correlations in data [33, 34, 35], requiring the networks to have large receptive fields. Finally, due to the unbounded Kullback–Leibler (KL) divergence in the variational lower bound, training very deep hierarchical VAEs is often unstable. The current state-of-the-art VAEs [4, 36] omit batch normalization (BN) [37] to combat the sources of randomness that could potentially amplify their instability.

VAE에 대한 전체적인 설명

In this paper, we aim to make VAEs great again by architecture design.

우리는 VAE를 완전히 다시 디자인 할 거임

We propose Nouveau VAE (NVAE), a deep hierarchical VAE with a carefully designed network architecture that produces highquality images.

NVAE obtains the state-of-the-art results among non-autoregressive(자기회귀적이지 않은) likelihood-based generative models, reducing the gap with autoregressive models.

The main building block of our network is depthwise convolutions [38, 39] that rapidly increase the receptive field of the network without dramatically increasing the number of parameters.

In contrast to the previous work, we find that BN is an important component of the success of deep VAEs. We also observe that instability of training remains a major roadblock when the number of hierarchical groups is increased, independent of the presence of BN. To combat this, we propose a residual parameterization of the approximate posterior parameters to improve minimizing the KL term, and we show that spectral regularization is key to stabilizing VAE training.

In summary, we make the following contributions:

우리는 이런 성과를 냈어

i) We propose a novel deep hierarchical VAE, called NVAE, with depthwise convolutions in its generative model.
VAE 구조를 새로 짠 NVAE 만듦
ii) We propose a new residual parameterization of the approximate posteriors.
대략적인 사후확률의 매개변수화
iii) We stabilize training deep VAEs with spectral regularization.
정규화도 진행해봄
iv) We provide practical solutions to reduce the memory burden of VAEs.
실제적인 메모리 줄이는 해결책도 내봄
v) We show that deep hierarchical VAEs can obtain state-of-the-art results on several image datasets, and can produce high-quality samples even when trained with the original VAE objective.
여러 dataset에 대한 최신 결과도 내보고 고퀄 샘플도 만들어봄

To the best of our knowledge, NVAE is the first successful application of VAEs to images as large as 256×256 pixels.

관련연구들 : 우리 결과물이 기존의 연구물과 다른 것들

Related Work: Recently, VQ-VAE-2 [40] demonstrated high-quality generative performance for large images. However, VQ-VAE’s objective differs substantially from VAEs’ and does not correspond to a lower bound on data log-likelihood. In contrast, NVAE is trained directly with the VAE objective. Moreover, VQ-VAE-2 uses PixelCNN [41] in its prior for latent variables up to 128×128 dims that can be very slow to sample from, while NVAE uses an unconditional decoder in the data space.

Our work is related to VAEs with inverse autoregressive flows (IAF-VAEs) [4]. NVAE borrows the statistical models (i.e., hierarchical prior and approximate posterior, etc.etc) from IAF-VAEs. But, it differs from IAF-VAEs in terms of

i) neural networks implementing these models,

ii) the parameterization of approximate posteriors, and

iii) scaling up the training to large images. Nevertheless, we provide ablation experiments on these aspects, and we show that NVAE outperform the original IAF-VAEs by a large gap. Recently, BIVA [36] showed state-of-the-art VAE results by extending bidirectional inference to latent variables. However, BIVA uses neural networks similar to IAF-VAE, and it is trained on images as large as 64×64 px. To keep matters simple, we use the hierarchical structure from IAF-VAEs, and we focus on carefully designing the neural networks. We expect improvements in NVAE’s performance if more complex hierarchical models from BIVA are used.