lab2-5.pdf - PDF.DoTool.net

Nội dung text lab2-5.pdf

CS236299 Lab 2-5: Sequence labeling with recurrent neural networks May 18, 2023 [ ]: # Initialize Otter import otter grader = otter.Notebook() [1]: # Please do not change this cell because some hidden tests might depend on it. import os # Otter grader does not handle ! commands well, so we define and use our # own function to execute shell commands. def shell(commands, warn=True): """Executes the string `commands` as a sequence of shell commands. Prints the result to stdout and returns the exit status. Provides a printed warning on non-zero exit status unless `warn` flag is unset. """ file = os.popen(commands) print (file.read().rstrip('\n')) exit_status = file.close() if warn and exit_status != None: print(f"Completed with errors. Exit status: {exit_status}\n") return exit_status shell(""" ls requirements.txt >/dev/null 2>&1 if [ ! $? = 0 ]; then rm -rf .tmp git clone [email protected]:cs236299-2023-spring/lab2-5.git .tmp mv .tmp/tests ./ mv .tmp/requirements.txt ./ rm -rf .tmp fi pip install -q -r requirements.txt """) 1
[notice] A new release of pip is available: 23.0 -> 23.1.2 [notice] To update, run: python3.8 -m pip install --upgrade pip In the last lab, you saw how to use hidden Markov models (HMMs) for sequence labeling. In this lab, you will use recurrent neural networks (RNNs) for sequence labeling. In this lab, we consider the task of automatic punctuation restoration from unpunctuated text, which is useful for post-processing transcribed speech from speech recognition systems (since we don’t want users to have to utter all punctuation marks). We can formulate this task as a sequence labeling task, predicting for each word the punctuation that should follow. If there’s no punctuation following the word, we use a special tag O for “other”. The dataset we use is the Federalist papers, but this time we use text without punctuation as our input, and predict the punctuation following each word. An example constructed from the dataset looks like below, which correponds to the punctuated sentence the powers to make treaties and to send and receive ambassadors , speak their own propriety . Token Label O the O powers O to O make O treaties O and O to O send O and O receive O ambassadors , speak O their O own O propriety . 1 Preparation and setup [2]: import copy import wget import torch import torch.nn as nn import csv 2
import random from datasets import load_dataset from tokenizers import Tokenizer from tokenizers.pre_tokenizers import WhitespaceSplit from tokenizers import normalizers from tokenizers.models import WordLevel from tokenizers.trainers import WordLevelTrainer from transformers import PreTrainedTokenizerFast from collections import Counter from tqdm.auto import tqdm # Fix random seed for replicability SEED=1234 random.seed(SEED) torch.manual_seed(SEED) [2]: [3]: ## GPU check device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(device) cpu [4]: # Prepare to download needed data def download_if_needed(source, dest, filename): os.makedirs(data_path, exist_ok=True) # ensure destination os.path.exists(f"./{dest}{filename}") or wget.download(source + filename,␣ ,→out=dest) source_path = "https://raw.githubusercontent.com/nlp-236299/data/master/ ,→Federalist/" data_path = "data/" # Download files for filename in ["federalist_tag.train.txt", "federalist_tag.dev.txt", "federalist_tag.test.txt" ]: download_if_needed(source_path, data_path, filename) Next, we process the dataset by extracting the sequences and their corresponding labels and save it in CSV format. 3

PDF Google Drive Downloader v1.1

Nội dung text lab2-5.pdf

Tài liệu liên quan

PDF Google Drive Downloader v1.1

Tiêu đề lab2-5.pdf ✅

Nội dung text lab2-5.pdf

Tài liệu liên quan