[Slide] Cross-Attention - NO Residual & NO LayerNorm (EN).pdf

Nội dung text [Slide] Cross-Attention - NO Residual & NO LayerNorm (EN).pdf

Q = [ 0.37 0.5 0.409 0.628 0.222 0.403 ] ▪ Similarly, K = Q, V = Q. o Step 2: Scores: ▪ scores = QK T √2 : scores = [ 0.274 0.329 0.201 0.329 0.397 0.243 0.201 0.243 0.150 ] o Step 3: Weights (Softmax): weights = [ 0.335 0.354 0.311 0.332 0.357 0.311 0.333 0.356 0.311 ] o Step 4: Output: Zencoder = [ 0.338 0.515 0.339 0.516 0.338 0.516 ] • No Residual Connection: o Output is purely Z, simplifying computations. • Purpose: o Creates contextual representations for the Decoder. Slide 4: Exercise 3 - Feed-Forward Network (FFN) and Masked Self-Attention in Decoder • Objective: Compute Forward Pass through Encoder FFN and Decoder Masked Self- Attention. Part 1: Feed-Forward Network (FFN) in Encoder • Formula: o FFN(x) = ReLU(xW1 + b1 )W2 + b2 • Parameters: o W1 = [ 0.5 0.6 0.7 0.8 ], b1 = [0.1,0.2] o W2 = [ 0.9 1.0 1.1 1.2 ], b2 = [0.3,0.4] • Input: Token 1 from Slide 3: [0.338, 0.515] • Computation: o First Layer: [0.630,1.015] o ReLU: [0.630,1.015] o Second Layer: [1.982,2.248] • FFN Output (for all tokens):
FFNencoder = [ 1.982 2.248 1.986 2.252 1.986 2.252 ] Part 2: Masked Self-Attention in Decoder • Formula: Same as Self-Attention, with mask to prevent attending to future tokens. • Parameters: Same WQ,WK, WV as Slide 3. • Mask: [ 0 −∞ 0 0 ] • Input: Xdecoder (from Slide 2): Xdecoder = [ 0.3 1.25 1.341 0.990] • Computation: o Step 1: Compute Q,K, V: ▪ Q = XdecoderWQ: Q = [ 0.405 0.56 0.431 0.635] ▪ Similarly, K = Q, V = Q. o Step 2: Scores: ▪ scores = QK T √2 : scores = [ 0.341 0.395 0.395 0.421] o Step 3: Apply Mask: scores = [ 0.341 −∞ 0.395 0.421] o Step 4: Weights (Softmax): ▪ Row 1: [e 0.341 , 0], Weights: [1, 0] ▪ Row 2: [e 0.395 ,e 0.421], Sum ≈ 2.884, Weights: [0.494, 0.506] weights = [ 1 0 0.494 0.506] o Step 5: Output: Zdecoder = [ 0.405 0.56 0.418 0.598] Slide 5: Exercise 4 - Backward Pass and Gradient Computation • Objective: Compute gradients through Decoder’s Softmax and FFN. • Cross-Entropy Loss: o Gradient through Softmax: ŷ − y • Example: o Decoder FFN Output (assume for token 1): [0.2, 0.6, 0.1] o Softmax: ▪ e 0.2 ≈ 1.221, e 0.6 ≈ 1.822, e 0.1 ≈ 1.105

PDF Google Drive Downloader v1.1

Nội dung text [Slide] Cross-Attention - NO Residual & NO LayerNorm (EN).pdf

Tài liệu liên quan

PDF Google Drive Downloader v1.1

Tiêu đề [Slide] Cross-Attention - NO Residual & NO LayerNorm (EN).pdf ✅

Nội dung text [Slide] Cross-Attention - NO Residual & NO LayerNorm (EN).pdf

Tài liệu liên quan