[SGU] Assignment_2 - Deep Learning (2025-2026_HK1).docx

Content text [SGU] Assignment_2 - Deep Learning (2025-2026_HK1).docx

Tokenization from torchtext.data.utils import get_tokenizer en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm') vi_tokenizer = get_tokenizer('spacy', language='vi_core_news_sm') Xây dựng từ điển (Vocabulary) ● Dùng torchtext.vocab.build_vocab_from_iterator. ● Thêm token đặc biệt: , , , . ● Giới hạn: 10.000 từ phổ biến nhất mỗi ngôn ngữ. Padding & Packing ● Dùng torchtext.data.Field hoặc pad_sequence để đồng bộ độ dài. ● Sắp xếp batch theo độ dài giảm dần. DataLoader ● Batch size: 32–128 ● Sử dụng torchtext.data.BucketIterator hoặc collate_fn tùy chỉnh. 6.2. Xây dựng mô hình Encoder (h_t, c_t) = LSTM(Embed(x_t), (h_{t-1}, c_{t-1})) ● Input: Chuỗi token tiếng Anh → embedding (size 256–512). ● Output: Context vector = (h_n, c_n) → truyền sang Decoder. Decoder (h_t, c_t) = LSTM(Embed(y_{t-1}), (h_{t-1}, c_{t-1})) p(y_t) = softmax(Linear(h_t)) ● Input: + context vector từ Encoder. ● Output: Xác suất từ tiếp theo. Tham số khuyến nghị Tham số Giá trị Hidden size 512 Embedding dim 256–512 Số layer LSTM 1–2 Dropout 0.3–0.5 Teacher forcing ratio 0.5

PDF Google Drive Downloader v1.1

Content text [SGU] Assignment_2 - Deep Learning (2025-2026_HK1).docx

Related document

PDF Google Drive Downloader v1.1

Title [SGU] Assignment_2 - Deep Learning (2025-2026_HK1).docx ✅

Content text [SGU] Assignment_2 - Deep Learning (2025-2026_HK1).docx

Related document