Content text lab1-1.pdf
[notice] A new release of pip is available: 23.0 -> 23.1.2 [notice] To update, run: python3.8 -m pip install --upgrade pip [2]: import math import re import sys import torch import nltk nltk.download('punkt', quiet=True) # this module is used to tokenize the text [2]: True Where we’re headed: Nearest neighbor text classification works by classifying a novel text with the same class as that of the training text that is closest according to some distance metric. These metrics are calculated based on representations of the texts. In this lab, we’ll introduce some different representations and you’ll use nearest neighbor classification to predict the speaker of sentences selected from a children’s book. The objectives of this lab are to: • Clarify terminology around words and texts, • Manipulate different representations of words and texts, • Apply these representations to calculate text similarity, and • Classify documents by a simple nearest neighbor model. In this and later labs, we will have you carry out several exercises in notebook cells. The cells you are to do are marked ‘#TODO’. They will typically have a ... where your code or answer should go. Where specified, you needn’t write code to calculate the answer, but instead, simply work out the answer yourself and enter it. New bits of Python used for the first time in the solution set for this lab, and which you may therefore find useful: • math.acos • math.pi • re.match • set • sorted • str.join • str.lower • torch.dot • torch.linalg.norm • torch.maximum • torch.minimum • torch.stack • torch.sum 2
• torch.Tensor.type • torch.where • torch.zeros • torch.zeros_like • nltk.tokenize.word_tokenizer • nltk.tokenize.WhitespaceTokenize 1 Counting words Here are five sentences from Dr. Seuss’s Green Eggs and Ham: Would you like them here or there? I would not like them here or there. I would not like them anywhere. I do not like green eggs and ham. I do not like them, Sam-I-am. We’ll make this text available in the variable text. [3]: text = """ Would you like them here or there? I would not like them here or there. I would not like them anywhere. I do not like green eggs and ham. I do not like them, Sam-I-am. """ A Python string like this is, of course, a sequence of characters. But we think of this text as a sequence of sentences each composed of a sequence of words. How many words are there in this text? That is a fraught question, for several reasons, including • The type-token distinction • Tokenization issues • Normalization 1.1 Types versus tokens In determining the number of words in text, are we talking about word types or word tokens. (For instance, there are five tokens of the word type ‘like’.) How many word tokens are there in total in this text? (Just count them manually.) Assign the number to the variable token_count in the next cell. [4]: #TODO - define `token_count` to be the number of tokens in `text` token_count = 41 # SOLUTION [ ]: grader.check("token_count") How many word types are there? (Again, you can just count manually.) 3