Selected

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors …

Bridging Discrete and Backpropagation: Straight-Through and Beyond

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to …

Label Noise in Adversarial Training: A Novel Perspective to Study Robust Overfitting

We show that label noise exists in adversarial training. Such label noise is due to the mismatch between the true label distribution of adversarial examples and the label inherited from clean examples–the true label distribution is distorted by the …

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Training transformers, however, requires non-trivial efforts regarding carefully designing learning rate schedulers and cutting-edge optimizers (the standard SGD fails to train …

On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in …

Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling

Many efforts have been made to facilitate natural language processing tasks with pre-trained language models (PTLM), and brought significant improvements to various applications. To fully leverage the nearly unlimited corpora and capture linguistic …

Empower Sequence Labeling with Task-Aware Neural Language Model

Linguistic sequence labeling is a general approach encompassing a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted …