In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors …
Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to …
We show that label noise exists in adversarial training. Such label noise is due to the mismatch between the true label distribution of adversarial examples and the label inherited from clean examples–the true label distribution is distorted by the …
Transformers have been proved effective for many deep learning tasks. Training transformers, however, requires non-trivial efforts regarding carefully designing learning rate schedulers and cutting-edge optimizers (the standard SGD fails to train …
The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in …
Many efforts have been made to facilitate natural language processing tasks with pre-trained language models (PTLM), and brought significant improvements to various applications. To fully leverage the nearly unlimited corpora and capture linguistic …
Linguistic sequence labeling is a general approach encompassing a variety of problems, such as part-of-speech tagging and named entity recognition. Recent advances in neural networks (NNs) make it possible to build reliable models without handcrafted …