One Sentence Summary for EMNLP 2020
- What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding
- Position encoding learns local position information in the encoder and absolute position in the decoder.
- A Matter of Framing: The Impact of Linguistic Formalism on Probing Results
- Choice of linguistic formalism is important for role-semantic probing results.
- Contrastive Distillation on Intermediate Representations for Language Model Compression
- Language model compression via contrastive loss, while memory bank is used to store negative samples (not updated in training).
- Efficient Meta Lifelong-Learning with Limited Memory
- Efficient experience rehearsal is achieved by learn to select examples.
- Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks
- Introduced optimization-based meta-learning to language modeling pretraining-finetuning paradigm.
- Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
- Multi-head attention can be viewed as multiple sample from the same distribution.
- Token-level Adaptive Training for Neural Machine Translation
- Using frequency-based heuristic to re-weight the loss value for different tokens.
- Shallow-to-Deep Training for Neural Machine Translation
- Both copying-strategy and lr-reset matters for progressive Transformer-based NMT training (copying is more important and
123->123123
>123->112233
>123->123333
)
- Both copying-strategy and lr-reset matters for progressive Transformer-based NMT training (copying is more important and
- Incorporating a Local Translation Mechanism into Non-autoregressive Translation
- In NAT, treating an autoregressive sequence instead of a token as the basic unit (for
1234
generate12
and34
simultaneously while2
is generated after1
)
- In NAT, treating an autoregressive sequence instead of a token as the basic unit (for
- Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation
- To alleviate error accumulation, use attention mask to change classical attention (global, all tokens are accessible) to local attention.
- Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
- Instead of finetuning LMs, learn a weight mask instead (e.g.,
[0.1, 0.3] -> [0.0, 0.3]
), which is argued to be more memory-efficient (1-bit instead of 32/16-bit additional copy).
- Instead of finetuning LMs, learn a weight mask instead (e.g.,
- If Beam Search is the Answer, What was the Question?
- Beam search has an inductive bias which can be linked to the promotion of uniform information density — variance of surprisals is linear to BLEU (people dont want surprise in reading).
- By adding regularization on information density, exact search performs similar to beam search.
- When BERT Plays the Lottery, All Tickets Are Winning
- The pruned “good” subnetworks works well, while the “bad” ones do not.
- For structured pruning, even “bad” subnetworks can be finetuned separately to reach fairly strong performance.
- Different runs get different “good” subnetworks, indicating the existance of factors otherthan non-trivial linguistic features.
- The success of BERT might be more related to optimization surfaces rather than specific bits of linguistic knowledge.
- Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning
- Using RL to select context for NMT at the document-level.
- Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation
- Pruning in training is much better than pruning after converge (0.1 BLEU drop v.s. 1.1)
- Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT
- Pre-training on high-resource -> fine-tune on low-resource by expanding dictionary -> XLM
- Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems
- Char-level NMT < first train Subword-level NMT, then convert to Character-level < Subword-level NMT
- Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
- Jointly learn to select training data from comparable (rather than parallel) data, and to NMT.
- Can Automatic Post-Editing Improve NMT?
- With more training data, automatic post-editing can improve NMT a lot.
- Learning from Context or Names? An Empirical Study on Neural Relation Extraction
- Both context and names are important for performance, while existing RE datasets may leak cues via names.
- Some Languages Seem Easier to Parse Because Their Treebanks Leak
- Some languages seem easier to parse, since the trees in the test set are mostly isomorphic with some tree in the training set.
- Word Frequency Does Not Predict Grammatical Knowledge in Language
Models
- Comparing
The cat walks
andThe cat walk
, the author found in four dimensions, frequency is not related to how well the knowledge is learned.
- Comparing
- Do sequence-to-sequence VAEs learn global features of sentences?
- VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness, while chaning architecture can help to alleviate such memorization.
- Adversarial Semantic Collisions
- Semantic collisions refer to texts that are semantically unrelated but judged as similar by NLP models, which can be effectively derived with gradient based methods.
- Sparse Text Generation
- Entmax transformation is leveraged to train and sample from a natively sparse language model, while \alpha-Entmax unifies softmax, argmax, and sparsemax.
- Learning VariationalWord Masks to Improve the Interpretability of Neural Text Classifiers
- word masks are automatically learned to emphasize task-specific words, which improves interpretability and performance.
- Identifying Elements Essential for BERT’s Multilinguality
- shared position embeddings, shared special tokens, replacing masked tokens with random tokens and a limited amount of parameters are necessary elements for multilinguality.
- Word order is relevant: BERT is not multilingual with one language having an inverted word order.
- The comparability of training corpora contributes to multilinguality.
- A Streaming Approach For Efficient Batched Beam Search
- Periodically “refills” the batch before proceeding with a selected subset of candidates, special handlings are made to handle the self-attention padding.
- On Losses for Modern Language Models
- Clarify NSP’s effect on BERT pre-training and design more auxiliary tasks, which allows the final method outperforms BERTBase with fewer than a quarter of the training tokens.
- Entities as Experts: Sparse Memory Access with Entity Supervision
- Leveraging entity representations to better memorize sparse knowledge.
- Semantic Label Smoothing for Sequence to Sequence Problems
- Retrieve semantically similar sentences to do label smoothing, at the cost of training computations.
- We Can Detect Your Bias: Predicting the Political Ideology of News Articles
- Directly using BERT would result in a heavy dependency on source, which motivates the leverage of adversarial training to neutralize such leaked clues.
- Imitation Attacks and Defenses for Black-box Machine Translation Systems
- After imitating black-box machine translation system, attack the imitated system with universal trigger attack / suffix dropper attack / targeted flip attack.
- Discussions are included to defense model stealing with gradient poisoning.
- Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three
- Ensemble helps to calibration and improve model performance, while distillation helps to make it faster.
- Inference Strategies for Machine Translation with Conditional Masking
- Using masked language model as decoder, and iterative replacing masks with model outputs.
- Four strategies are used: fixed #iterations; fixed #tokens/iteration; probs > T; comined-probs > T (best)
- An Empirical Study of Generation Order for Machine Translation
- Using Transformer to do generation in an insertive manner, i.e.,
(a, c)->b
, many efforts are made on exploring different generation orders (balanced binary tree works the best, while others are not far behind)
- Using Transformer to do generation in an insertive manner, i.e.,
- Reproducible and Efficient Benchmarks for Hyperparameter Optimization of Neural Machine Translation Systems
- Benchmark for hyper-parameter optimization on NMT.
- Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation
- Learn to generate masking, for adapting language model to a new domain in a faster manner.
- Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
- For small dataset, adding additional regularization on the parameter wight shift (difference to the original pre-trained weights)
- Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning
- After pre-training NER models on a source domain, trying to adapt the model to a new domain by conducting structured nearest neighbor search (similarities are calculated based on representations constructed by the pre-trained model).
- DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
- After linearization labeled sentence (converting into
B-PER Jose E-PER
), language modeling is trained and used to generate new data for the training.
- After linearization labeled sentence (converting into
- Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models
- Pre-training LSTMs on music helps to train it on language later, but not random generated texts.
- Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models
- LM does not perform well on numerical commonsense knowledge, even with distant supervision.
- Train No Evil: Selective Masking for Task-Guided Pre-Training
- Finding important tokens with heuristic strategies and conducting task-guided pre-training (heuristic leverages task-specific information).
- Fact or Fiction: Verifying Scientific Claims
- Propose a new task called scientific claim verification and construct corresponding datasets.
- Language Model Prior for Low-Resource Neural Machine Translation
- Using language model outputs to regularize NMT in a knowledge distillation manner.