Liminal Loop

Masked Language Modeling (MLM) is a technique used in natural language processing (NLP) for training models, particularly in the context of pre-trained models like BERT (Bidirectional Encoder Representations from Transformers). Unlike traditional language models, which predict the next word in a sequence from left to right (as done in models like GPT), MLM takes a different approach by randomly masking certain words in a sentence and then training the model to predict these hidden words. The idea is that by learning to predict the missing parts of the sentence based on the surrounding context, the model becomes highly adept at understanding the relationships between words in a bidirectional (both left-to-right and right-to-left) way.

Imagine a simple sentence: "The cat sat on the mat." In MLM, certain words in this sentence are masked, so it might look like: "The [MASK] sat on the [MASK]." The task of the model is to correctly predict the hidden words based on the surrounding context. In this case, the model would use the visible words—"The," "sat on the"—to guess that the hidden words are "cat" and "mat." By constantly being exposed to sentences like this, where random parts are concealed, the model learns to predict missing words based on the context of the entire sentence, and in doing so, it becomes highly proficient in understanding the relationships between words.

This approach stands in contrast to older models that only look at words in one direction, usually left to right. By training models to predict words based on both the words before and after them, MLM allows for a richer understanding of language. For example, in the sentence, "The quick brown fox jumps over the lazy dog," if you were to mask "brown" and "jumps," the model would need to figure out that "brown" fits well before "fox" and "jumps" fits after "fox" by considering the entire context of the sentence, including the words that come after the masked tokens.

During training, the model compares its predicted words with the actual words that were masked. The goal is to minimize the difference between the predicted and the true masked words. Over time, with millions of such examples, the model becomes highly proficient at understanding the relationships between words in a sentence. For instance, if the model predicted "quick" instead of "brown" for the first masked token, it would adjust its internal weights to improve future predictions. This process is repeated across vast amounts of data, enabling the model to learn a deep understanding of language.

To make the model more robust, MLM uses a few additional strategies during the masking process. Out of the 15% of tokens that are selected for masking:

80% of the time, the token is replaced with the [MASK] token (e.g., "quick [MASK] fox").
10% of the time, the token is replaced with a random word (e.g., "quick blue fox"), which adds noise and helps the model learn better in unpredictable contexts.
10% of the time, the token remains unchanged, helping the model learn to identify when a token should or should not change.

This randomness ensures that the model doesn't become overly reliant on the [MASK] token, which doesn’t appear in real-world data, and encourages it to learn more flexible word representations.

The strength of MLM lies in its bidirectional nature. Traditional language models like GPT process words in a single direction — typically from left to right — which limits their ability to fully grasp the meaning of words in context. For instance, predicting the word "bank" in the sentence "She went to the [MASK]" can be ambiguous if only the preceding words are considered. By using both the left and right context, MLM models can differentiate between the meanings of "bank" (financial institution or riverbank), making more accurate predictions.

In real-world applications, this bidirectional training allows models like BERT to excel at tasks such as question answering, named entity recognition, and sentiment analysis. After pretraining on vast datasets with MLM, these models can be fine-tuned on smaller, task-specific datasets to perform exceptionally well in specific tasks.

The combination of input masking, contextual prediction, and bidirectional learning is what makes MLM one of the most innovative and effective techniques in modern NLP, and the foundation for models like BERT that continue to drive advancements in natural language understanding.

I can't breath in this mess,

(START)
● 06.06

(END)
▲ 28.08

—HiddenContent @dduyg

—HiddenContent
@dduyg