Nội dung text [Exercise] Skip-gram [Eng].docx
[Exercises] Skip-gram Model (Word2Vec) The following exercises focus on manually implementing the Skip-gram model for learning word embeddings. Skip-gram predicts context words given a center word. The exercises are divided into two parts: Part 1 uses full softmax for the loss function, and Part 2 uses negative sampling for optimization. Students must show detailed calculations with specific numerical values. To assist with calculations, the use of tools like Microsoft Excel is recommended for handling matrix and vector operations. Sample Data and Training Pair Generation ● Sample Data: Text: "The cat likes to eat fresh fish." o Tokenized into words: ["The", "cat", "likes", "to", "eat", "fresh", "fish"]. o Vocabulary: {"The":0, "cat":1, "likes":2, "to":3, "eat":4, "fresh":5, "fish":6} (V=7 words, indexed from 0 to 6). ● Generating Training Pairs for Skip-gram: o Use a context window size of 2: For each center word at position (t=1 to 7), create pairs (, ) where is a context word within positions to (excluding itself and staying within the sentence boundaries). o List of word pairs: ▪ ("The", index=0): Context (t=2,3): "cat" (1), "likes" (2) → pairs: (0,1), (0,2). ▪ ("cat", index=1): Context (t=1,3,4): "The" (0), "likes" (2), "to" (3) → pairs: (1,0), (1,2), (1,3). ▪ ("likes", index=2): Context (t=1,2,4,5): "The" (0), "cat" (1), "to" (3), "eat" (4) → pairs: (2,0), (2,1), (2,3), (2,4). ▪ ("to", index=3): Context (t=2,3,5,6): "cat" (1), "likes" (2), "eat" (4), "fresh" (5) → pairs: (3,1), (3,2), (3,4), (3,5). ▪ ("eat", index=4): Context (t=3,4,6,7): "likes" (2), "to" (3), "fresh" (5), "fish" (6) → pairs: (4,2), (4,3), (4,5), (4,6).