Neural Language Model
predict the next work, replace HMM,
stateful LSTM: memorize last batch. dependent
stateless LSTM: update parameter in batch one, when batch two, initialize hidden states and cell states to zero. batch to batch. independent in different batches
Word2Vec: CBOW, skip-grams | Glove (cannot solve the problem of synonyms)
ELMo (Embedding from Language Models): Deep contextualized word representation
Instead of using a fixed embedding for each word, ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bi-directional LSTM trained on a specific task to be able to create those embeddings. ELMo gained its language understanding from being trained to predict the next word in a sequence of words – a task called Language Modeling.
ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.
NLP finally had a way to do transfer learning probably as well as Computer Vision could.
GPT (Generative Pre-Training)
An attention model differs from a classic sequence-to-sequence model in two main ways:
First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder
Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:
- Look at the set of encoder hidden states it received – each encoder hidden states is most associated with a certain word in the input sentence
- Give each hidden states a score (let’s ignore how the scoring is done for now)
- Multiply each hidden states by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores
transfer learning to downstream tasks???
only using decoders
Neural Machine Translation