NLP2020

Explain why sequence (structured) prediction is different from the greedy model of POS tagging we reviewed in previous lectures. What are we trying to optimize differently?

1

שאלה #38555

		נכון
		לא נכון

מיין לפי

What is the difference between a generative probabilistic model and a discriminative one? Explain which formal tool is used to specify a generative model.

1

שאלה #38556

		נכון
		לא נכון

מיין לפי

In a Hidden Markov Model of a tagging problem - given a sequence of words xi and the predicted labels yi, we model the joint distribution p(x,y) as: p(x1 ... xn, y1 ... yn+1) = Πi=1..N+1 q(yi | yi−2, yi−1) Πi=1..Ne(xi | yi) Assuming y0 = y-1 = * Write down the independence assumptions which correspond to this equation.

1

שאלה #38559

		נכון
		לא נכון

מיין לפי

In the HMM model, the Viterbi algorithm is used to implement efficiently the problem of decoding. Explain what is decoding, why it is computationally expensive, and why dynamic programming is an appropriate strategy to solve this task.

1

שאלה #38560

		נכון
		לא נכון

מיין לפי

Consider the basic Viterbi algorithm: Input: a sentence x1 ... xn, parameters q(s|u, v) and e(x|s). Definitions: Define K to be the set of possible tags. Define K−1 = K0 = {}, and Kk = K for k = 1 ... n. Initialization: Set π(0, , *) = 1. Algorithm: For k = 1 ... n, For u ∈ Kk−1, v ∈ Kk, π(k, u, v) = maxw ∈ Kk−2(π(k − 1, w, u) × q(v | w, u) × e(xk | v)) Return maxu ∈ Kn−1,v ∈ Kn(π(n, u, v) × q(STOP | u, v)) What is the complexity of this algorithm which tags a sequence of size n with a label set of size K?

1

שאלה #38562

		נכון
		לא נכון

מיין לפי

In the Maximum Entroy Markov Model (MEMM), we model the sequential tagging problem of finding a sequence of tags given a sequence of words with the following independence assumption: Chain rule: p(t1, t2, ..., tn | x1, ..., xn) = Πi=1..np(si | s1, ..., si−1, x1, ..., xm) Independence assumption: p(ti | t1, ..., ti-1, x1, ..., xn) = p(ti | ti-1, x1, ..., xn) In addition, we model the conditional distribution using a log linear distribution with a feature extraction function Φ and parameters w. 1. What are the parameters of the feature extraction function Φ? Φ(___________________________) 2. Write the formula of the log-linear (using the softmax function) with and parameters w: p(ti | ti-1, x1, ..., xn) = softmax() 3. Is decoding necessary in the MEMM approach? Why?

1

שאלה #38563

		נכון
		לא נכון

מיין לפי

In the CRF model, we model the sequence tagging problem with a global feature function and parameters w as follows: x = (x1, ..., xn) t = (t1, t2, ..., tn) p(t | x, w) = exp( w . Φ(x, t) ) / Σt' in Sn[ exp( w . Φ(x, t') ] 1. What are the two computational challenges that must be overcome to solve such a model? 2. What are the assumptions on the feature function Φ which are exploited to make CRF computationally tractable?

1

שאלה #38564

		נכון
		לא נכון

מיין לפי

Give an example demonstrating human speakers share a sense of syntactic structure as a hierarchical set of relations among words within sentence that contradicts a purely sequential account of sentences.

1

שאלה #38565

		נכון
		לא נכון

מיין לפי

Show the dependency parse tree of the following English sentence (without labels on the edges): The little girl reads a book about NLP.

1

שאלה #38566

		נכון
		לא נכון

מיין לפי

Show the constituency structure of the same sentence - where the pre-terminals (part of speech categories) are Det, Adj, N, V, Prep and the non-terminal categories are NP (noun phrase), VP (verb phrase), PP (prepositional phrase) and S (sentence).

1

שאלה #38568

		נכון
		לא נכון

מיין לפי

Discuss, Learn and be Happy דיון בשאלות

Explain why sequence (structured) prediction is different from the greedy model of POS tagging we reviewed in previous lectures. What are we trying to optimize differently?

What is the difference between a generative probabilistic model and a discriminative one? Explain which formal tool is used to specify a generative model.

In the HMM model, the Viterbi algorithm is used to implement efficiently the problem of decoding. Explain what is decoding, why it is computationally expensive, and why dynamic programming is an appropriate strategy to solve this task.

Give an example demonstrating human speakers share a sense of syntactic structure as a hierarchical set of relations among words within sentence that contradicts a purely sequential account of sentences.

Show the dependency parse tree of the following English sentence (without labels on the edges): The little girl reads a book about NLP.

Show the constituency structure of the same sentence - where the pre-terminals (part of speech categories) are Det, Adj, N, V, Prep and the non-terminal categories are NP (noun phrase), VP (verb phrase), PP (prepositional phrase) and S (sentence).

בקשה לשינוי התשובה / תיקון שאלה gavel

בקשה למחיקת השאלהmood_bad

בקשה לשינוי התשובה / תיקון שאלה

בקשה למחיקת השאלה