Less Tuning

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from …

Overfitting or Underfitting? Understand Robustness Drop in Adversarial Training

Our goal is to understand why the robustness drops after conducting adversarial training for too long. Although this phenomenon is commonly explained as overfitting, our analysis suggest that its primary cause is perturbation underfitting. We observe …

Less Tuning

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Overfitting or Underfitting? Understand Robustness Drop in Adversarial Training

Very Deep Transformers for Neural Machine Translation

Towards Adaptive Residual Network Training: A Neural-ODE Perspective

Understanding the Difficulty of Training Transformers

On the Variance of the Adaptive Learning Rate and Beyond