Decoding strategies in Sequence models

5 min readFeb 11, 2021

This article talks about different decoding strategies all the while illustrating the probability distributions, through visualizations of hidden states using a python package ecco, providing more explainability to the language model generation.

About Language Generation

The encoder-decoder architecture uses LSTM and RNN as building blocks. An auto-regressive model is based on the assumption that the probability of the next word is dependent on past hidden states.

The encoder encodes the source information and passes it to the decoder¹. At the decoder end, you pass a certain token as an input and the output you get is passed to the next cell. This process is repeated till you reach an <EOS> token (end of sentence)

The output at each time step generates a softmax probability distribution of vocabulary words, and outputs are controlled based on what decoding strategy used.

Ecco

Ecco can be install simply using

!pip install ecco
import ecco
lm = ecco.from_pretrained('gpt2', activations=True)

Greedy Search

Greedy search returns maximum probability word at each time step

#returns the output generation using greedy approach. do_sample is False 
greedy_output = lm.generate('Roses are red', max_length=50, do_sample=False) #returns probability distribution at time step greedy_output.layer_predictions(position=4, topk=10, layer=11)

The strategy has some drawbacks:

The model doesn’t take into consideration the high probability words hidden behind low probability words
The model output seems to repeat itself, which is a very common problem with greedy search²

Beam Search

At each time step you traverse through n number of paths called beams generating words¹. The probability of sentence is taken to pick words rather than individual tokens.

At the 1st step, you pick top n words based on beam width.
At 2nd step, run n parallel decoder models with input prefixes the 3 words, from which top n pairs are selected for 3rd step.
At 3rd step, run the same step as above, with n word pairs as prefix, and go on

let’s check how it goes…
Unfortunately, ecco doesn’t support beam search yet, so, went with normal transformers to view the output.

The outputs are all the beam outputs. The outputs makes much more sense now, but it’s quite evident that they have no variation at all. That’s one downside to it. The generation doesn’t work for open-ended generation, as the output doesn’t have much variance⁴, as humans don’t tend to chose best probability words always. As evident from below figure, a normal language can often contain low probability words.

The strategy still works for cases where outputs are tightly scoped by the input and, repetition and genericness are not as problematic.

Sampling

To get a variance in output, sampling is also done from vocabulary words based on probabilty distribution. This will lead to incoherancy, as many words will be sampled which would not be coherant to the text you wish to generate.

ranking matrice at 11th layer shows the rank of chosen tokens in the last softmax layer

Sampling with temperature

This modifies the softmax distribution based on temperature factor, this way exaggerating/diminishing word probability. Temperature ranges from 0–1, with t=1 from no effect, to t=0 which is equivalent to greedy search. By decreasing the t, the distribution is moving from flatter to skew, thus giving improved generation quality.

modifying softmax distribution using temperature factor

As it’s evident, that model starts sampling more high ranking tokens by providing temperature as input as opposed to sampling which was able to return niche tokens as well in output.

Sampling with top k

Top k sampling filters top k words from output distribution on top of sampling strategy, thus the models get the ability to generate a variety of words also while controls gibberish

The distribution is varied, yet is clipped at top 20

The top k sampling output does make more sense. But there is a drawback as the context window is fixed for each token, resulting in marginal words being picked from a highly skewed distribution. Also, in case of a flatter distribution, the valuable words will be lost.

Top-p (nucleus) sampling

Instead of picking a fixed number, words can also be filtered by cumulative probability. Top-p sampling picks the minimum number of words to exceed together p of the probability mass.

top-p takes only words with a cumulative probability of slightly more than 50%

The ranking matrix limits variability to an unfixed count based on the cumulative probability

Conclusion

Top-p and Top-k are generally considered good approaches for open-ended language problems, while where genericness is not an issue, beam search is better. In general, models still suffer from gram repetition problem, which is remedied using repetition_penalty in transformer module.

Link to my notebook: https://github.com/Nempickaxe/mini_projects/blob/master/Language_model_decoding_parameters.ipynb (PS: to view some ecco functionality, you need to open the notebook in google colab)

[1]An intuitive explanation of Beam Search, Renu Khandelwal,
https://towardsdatascience.com/an-intuitive-explanation-of-beam-search-9b1d744e7a0f

[2]Patrik Von Platen, Huggingface,
https://huggingface.co/blog/how-to-generate

[3]Nucleus Sampling: The Curious Case of Neural Text Degeneration (Research Paper Walkthrough), TechViz — The Data Science Guy,
https://www.youtube.com/watch?v=dCORspO2yVY&ab_channel=TechViz-TheDataScienceGuy

[4]THE CURIOUS CASE OF NEURAL TEXT DeGENERATION, Ari Holtzman,
https://arxiv.org/pdf/1904.09751.pdf