One Sentence Summary for EMNLP 2020

Nov 1, 2020

What Do Position Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional Encoding
- Position encoding learns local position information in the encoder and absolute position in the decoder.
A Matter of Framing: The Impact of Linguistic Formalism on Probing Results
- Choice of linguistic formalism is important for role-semantic probing results.
Contrastive Distillation on Intermediate Representations for Language Model Compression
- Language model compression via contrastive loss, while memory bank is used to store negative samples (not updated in training).
Efficient Meta Lifelong-Learning with Limited Memory
- Efficient experience rehearsal is achieved by learn to select examples.
Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks
- Introduced optimization-based meta-learning to language modeling pretraining-finetuning paradigm.
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
- Multi-head attention can be viewed as multiple sample from the same distribution.
Token-level Adaptive Training for Neural Machine Translation
- Using frequency-based heuristic to re-weight the loss value for different tokens.
Shallow-to-Deep Training for Neural Machine Translation
- Both copying-strategy and lr-reset matters for progressive Transformer-based NMT training (copying is more important and 123->123123 > 123->112233 > 123->123333)
Incorporating a Local Translation Mechanism into Non-autoregressive Translation
- In NAT, treating an autoregressive sequence instead of a token as the basic unit (for 1234 generate 12 and 34 simultaneously while 2 is generated after 1)
Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation
- To alleviate error accumulation, use attention mask to change classical attention (global, all tokens are accessible) to local attention.
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
- Instead of finetuning LMs, learn a weight mask instead (e.g., [0.1, 0.3] -> [0.0, 0.3]), which is argued to be more memory-efficient (1-bit instead of 32/16-bit additional copy).
If Beam Search is the Answer, What was the Question?
- Beam search has an inductive bias which can be linked to the promotion of uniform information density — variance of surprisals is linear to BLEU (people dont want surprise in reading).
- By adding regularization on information density, exact search performs similar to beam search.
When BERT Plays the Lottery, All Tickets Are Winning
- The pruned “good” subnetworks works well, while the “bad” ones do not.
- For structured pruning, even “bad” subnetworks can be finetuned separately to reach fairly strong performance.
- Different runs get different “good” subnetworks, indicating the existance of factors otherthan non-trivial linguistic features.
- The success of BERT might be more related to optimization surfaces rather than specific bits of linguistic knowledge.
Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning
- Using RL to select context for NMT at the document-level.
Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation
- Pruning in training is much better than pruning after converge (0.1 BLEU drop v.s. 1.1)
Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT
- Pre-training on high-resource -> fine-tune on low-resource by expanding dictionary -> XLM
Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning Subword Systems
- Char-level NMT < first train Subword-level NMT, then convert to Character-level < Subword-level NMT
Self-Induced Curriculum Learning in Self-Supervised Neural Machine Translation
- Jointly learn to select training data from comparable (rather than parallel) data, and to NMT.
Can Automatic Post-Editing Improve NMT?
- With more training data, automatic post-editing can improve NMT a lot.
Learning from Context or Names? An Empirical Study on Neural Relation Extraction
- Both context and names are important for performance, while existing RE datasets may leak cues via names.
Some Languages Seem Easier to Parse Because Their Treebanks Leak
- Some languages seem easier to parse, since the trees in the test set are mostly isomorphic with some tree in the training set.
Word Frequency Does Not Predict Grammatical Knowledge in Language Models
- Comparing The cat walks and The cat walk, the author found in four dimensions, frequency is not related to how well the knowledge is learned.
Do sequence-to-sequence VAEs learn global features of sentences?
- VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness, while chaning architecture can help to alleviate such memorization.
Adversarial Semantic Collisions
- Semantic collisions refer to texts that are semantically unrelated but judged as similar by NLP models, which can be effectively derived with gradient based methods.
Sparse Text Generation
- Entmax transformation is leveraged to train and sample from a natively sparse language model, while \alpha-Entmax unifies softmax, argmax, and sparsemax.
Learning VariationalWord Masks to Improve the Interpretability of Neural Text Classifiers
- word masks are automatically learned to emphasize task-specific words, which improves interpretability and performance.
Identifying Elements Essential for BERT’s Multilinguality
- shared position embeddings, shared special tokens, replacing masked tokens with random tokens and a limited amount of parameters are necessary elements for multilinguality.
- Word order is relevant: BERT is not multilingual with one language having an inverted word order.
- The comparability of training corpora contributes to multilinguality.
A Streaming Approach For Efficient Batched Beam Search
- Periodically “refills” the batch before proceeding with a selected subset of candidates, special handlings are made to handle the self-attention padding.
On Losses for Modern Language Models
- Clarify NSP’s effect on BERT pre-training and design more auxiliary tasks, which allows the final method outperforms BERTBase with fewer than a quarter of the training tokens.
Entities as Experts: Sparse Memory Access with Entity Supervision
- Leveraging entity representations to better memorize sparse knowledge.
Semantic Label Smoothing for Sequence to Sequence Problems
- Retrieve semantically similar sentences to do label smoothing, at the cost of training computations.
We Can Detect Your Bias: Predicting the Political Ideology of News Articles
- Directly using BERT would result in a heavy dependency on source, which motivates the leverage of adversarial training to neutralize such leaked clues.
Imitation Attacks and Defenses for Black-box Machine Translation Systems
- After imitating black-box machine translation system, attack the imitated system with universal trigger attack / suffix dropper attack / targeted flip attack.
- Discussions are included to defense model stealing with gradient poisoning.
Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three
- Ensemble helps to calibration and improve model performance, while distillation helps to make it faster.
Inference Strategies for Machine Translation with Conditional Masking
- Using masked language model as decoder, and iterative replacing masks with model outputs.
- Four strategies are used: fixed #iterations; fixed #tokens/iteration; probs > T; comined-probs > T (best)
An Empirical Study of Generation Order for Machine Translation
- Using Transformer to do generation in an insertive manner, i.e., (a, c)->b, many efforts are made on exploring different generation orders (balanced binary tree works the best, while others are not far behind)
Reproducible and Efficient Benchmarks for Hyperparameter Optimization of Neural Machine Translation Systems
- Benchmark for hyper-parameter optimization on NMT.
Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation
- Learn to generate masking, for adapting language model to a new domain in a faster manner.
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
- For small dataset, adding additional regularization on the parameter wight shift (difference to the original pre-trained weights)
Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning
- After pre-training NER models on a source domain, trying to adapt the model to a new domain by conducting structured nearest neighbor search (similarities are calculated based on representations constructed by the pre-trained model).
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
- After linearization labeled sentence (converting into B-PER Jose E-PER), language modeling is trained and used to generate new data for the training.
Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models
- Pre-training LSTMs on music helps to train it on language later, but not random generated texts.
Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models
- LM does not perform well on numerical commonsense knowledge, even with distant supervision.
Train No Evil: Selective Masking for Task-Guided Pre-Training
- Finding important tokens with heuristic strategies and conducting task-guided pre-training (heuristic leverages task-specific information).
Fact or Fiction: Verifying Scientific Claims
- Propose a new task called scientific claim verification and construct corresponding datasets.
Language Model Prior for Low-Resource Neural Machine Translation
- Using language model outputs to regularize NMT in a knowledge distillation manner.

EMNLP Paper Summary

Liyuan Liu

Senior Researcher @ MSR

Understand the underlying mechanism of pretraining heuristics.