Content text [TDTU] Assignment (20%) - Deep Learning (2024-2025)
TDTU 503077 - DEEP LEARNING -------------------------------------------------------------------------------------------------- ASSIGNMENT 2 (20%) Sentiment Analysis of Short Text and Context Using RNN ● Assignment: Group (2-3 students) ● Submission deadline: 11:59 PM, 28 March 2025 ● Submission method: E-Learning system Objective Build a model to predict sentiment labels (Positive, Negative, Neutral) from a short text accompanied by supplementary context using Recurrent Neural Networks (RNN) with Word Embeddings. The model is designed to be simple, excluding Attention or CNN, to help students new to Deep Learning understand how to apply RNN to sentiment classification tasks. Scope ● Text Type: Short text under 50 words, related to work or study (e.g., "I was late for work today."). ● Context Type: Context under 20 words, describing a related situation (e.g., "The speaker just overslept."). ● Output Type: Sentiment labels: Positive, Negative, Neutral. ● Example: Text: "I was late for work today." | Context: "The speaker just overslept." | Output: "Negative". Environment Requirements ● Python 3.8 or higher. ● Libraries: pip install torch pandas numpy nltk scikit-learn torchtext. ● Instructions: o Run pip install -r requirements.txt if using a requirements.txt file.
TDTU 503077 - DEEP LEARNING -------------------------------------------------------------------------------------------------- o Verify Torch installation: python -c 'import torch; print(torch.__version__)'. Data Preparation ● Data Source: Collect from social media (e.g., X) or generate using an LLM with the prompt: Generate 10 pairs of short text (<50 words) and context (<20 words) about emotions in work/study, with Positive/Negative/Neutral labels. Size: Minimum 500 samples, stored in sentiment_data.csv. Sample Format: text,context,label "I was late for work today.","The speaker just overslept.","Negative" "I just finished my project!","Worked all week on it.","Positive" ● Instructions: o Load with pandas: data = pd.read_csv('sentiment_data.csv'). o Split data: 80% train, 20% test using train_test_split. o Set max_len_text=50, max_len_context=20 to match the scope. o Note: Check for empty rows with data.dropna(). If labels are imbalanced (e.g., too many Negatives), consider resampling. Model Construction Architecture 1. Input: Text and context processed through Word Embeddings (100D) → Matrix [number of words × 100]. 2. RNN: 128 hidden units, processing text and context separately. 3. Combination: Concatenate the final hidden states of text and context. 4. Output: Dense layer + softmax (3 classes). Purpose of codes (model.py) ● Purpose: Define an RNN model to predict sentiment from text and context.
TDTU 503077 - DEEP LEARNING -------------------------------------------------------------------------------------------------- o embedding: Converts words into numerical vectors (using GloVe if pretrained=True, or trained from scratch if pretrained=False). o rnn: Processes text and context sequences to generate hidden states representing information. o fc: Predicts sentiment labels from the combined hidden states. Sample codes for modification: Code (model.py) import torch import torch.nn as nn import torchtext.vocab as vocab class RNNModel(nn.Module): def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, pretrained=False): super().__init__() self.embedding = nn.Embedding(vocab_size, embedding_dim) if pretrained: glove = vocab.GloVe(name='6B', dim=embedding_dim) if glove.vectors.shape[0] < vocab_size: raise ValueError("vocab_size exceeds GloVe vocabulary size!") self.embedding.weight.data.copy_(glove.vectors[:vocab_size]) self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True) self.fc = nn.Linear(hidden_dim * 2, output_dim) def forward(self, text, context): text_embed = self.embedding(text) context_embed = self.embedding(context) _, text_hidden = self.rnn(text_embed) _, context_hidden = self.rnn(context_embed) combined = torch.cat((text_hidden.squeeze(0), context_hidden.squeeze(0)), dim=1) return self.fc(combined) Purpose of code (data.py) ● Purpose: Prepare input data for the RNN model.
TDTU 503077 - DEEP LEARNING -------------------------------------------------------------------------------------------------- o Loads and processes data from CSV, tokenizes into words, builds a vocabulary (vocab). o Converts text/context into numerical sequences (indices), pads to uniform length. o Creates DataLoader to supply batched data for training and evaluation. Code (data.py) import pandas as pd import torch from torch.utils.data import Dataset, DataLoader from nltk.tokenize import word_tokenize import nltk from sklearn.model_selection import train_test_split from collections import Counter nltk.download('punkt') data = pd.read_csv('sentiment_data.csv').dropna() texts = data['text'].tolist() contexts = data['context'].tolist() labels = data['label'].map({'Positive': 0, 'Negative': 1, 'Neutral': 2}).tolist() tokenized_texts = [word_tokenize(t.lower()) for t in texts] tokenized_contexts = [word_tokenize(c.lower()) for c in contexts] all_words = [w for txt in (tokenized_texts + tokenized_contexts) for w in txt] most_common = Counter(all_words).most_common(4998) vocab = {'<PAD>': 0, '<UNK>': 1} for i, (w, _) in enumerate(most_common, 2): vocab[w] = i vocab_size = len(vocab) def to_indices(tokens, max_len): idxs = [vocab.get(t, 1) for t in tokens][:max_len] return idxs + [0] * (max_len - len(idxs)) max_len_text, max_len_context = 50, 20 text_indices = [to_indices(t, max_len_text) for t in tokenized_texts] context_indices = [to_indices(c, max_len_context) for c in tokenized_contexts]