Discuss, Learn and be Happy דיון בשאלות

help brightness_4 brightness_7 format_textdirection_r_to_l format_textdirection_l_to_r

Explain why sequence (structured) prediction is different from the greedy model of POS tagging we reviewed in previous lectures. What are we trying to optimize differently?

1
done
by
מיין לפי

What is the difference between a generative probabilistic model and a discriminative one? Explain which formal tool is used to specify a generative model.

1
done
by
מיין לפי

In a Hidden Markov Model of a tagging problem - given a sequence of words xi and the predicted labels yi, we model the joint distribution p(x,y) as: p(x1 ... xn, y1 ... yn+1) = Πi=1..N+1 q(yi | yi−2, yi−1) Πi=1..Ne(xi | yi) Assuming y0 = y-1 = * Write down the independence assumptions which correspond to this equation.

1
done
by
מיין לפי

In the HMM model, the Viterbi algorithm is used to implement efficiently the problem of decoding. Explain what is decoding, why it is computationally expensive, and why dynamic programming is an appropriate strategy to solve this task.

1
done
by
מיין לפי

Consider the basic Viterbi algorithm: Input: a sentence x1 ... xn, parameters q(s|u, v) and e(x|s). Definitions: Define K to be the set of possible tags. Define K−1 = K0 = {*}, and Kk = K for k = 1 ... n. Initialization: Set π(0, *, *) = 1. Algorithm: For k = 1 ... n, For u ∈ Kk−1, v ∈ Kk, π(k, u, v) = maxw ∈ Kk−2(π(k − 1, w, u) × q(v | w, u) × e(xk | v)) Return maxu ∈ Kn−1,v ∈ Kn(π(n, u, v) × q(STOP | u, v)) What is the complexity of this algorithm which tags a sequence of size n with a label set of size K?

1
done
by
מיין לפי

In the Maximum Entroy Markov Model (MEMM), we model the sequential tagging problem of finding a sequence of tags given a sequence of words with the following independence assumption: Chain rule: p(t1, t2, ..., tn | x1, ..., xn) = Πi=1..np(si | s1, ..., si−1, x1, ..., xm) Independence assumption: p(ti | t1, ..., ti-1, x1, ..., xn) = p(ti | ti-1, x1, ..., xn) In addition, we model the conditional distribution using a log linear distribution with a feature extraction function Φ and parameters w. 1. What are the parameters of the feature extraction function Φ? Φ(_____________________________________________________________________) 2. Write the formula of the log-linear (using the softmax function) with and parameters w: p(ti | ti-1, x1, ..., xn) = softmax(__________________________________________) 3. Is decoding necessary in the MEMM approach? Why?

1
done
by
מיין לפי

In the CRF model, we model the sequence tagging problem with a global feature function and parameters w as follows: x = (x1, ..., xn) t = (t1, t2, ..., tn) p(t | x, w) = exp( w . Φ(x, t) ) / Σt' in Sn[ exp( w . Φ(x, t') ] 1. What are the two computational challenges that must be overcome to solve such a model? 2. What are the assumptions on the feature function Φ which are exploited to make CRF computationally tractable?

1
done
by
מיין לפי

Give an example demonstrating human speakers share a sense of syntactic structure as a hierarchical set of relations among words within sentence that contradicts a purely sequential account of sentences.

1
done
by
מיין לפי

Show the dependency parse tree of the following English sentence (without labels on the edges): The little girl reads a book about NLP.

1
done
by
מיין לפי

Show the constituency structure of the same sentence - where the pre-terminals (part of speech categories) are Det, Adj, N, V, Prep and the non-terminal categories are NP (noun phrase), VP (verb phrase), PP (prepositional phrase) and S (sentence).

1
done
by
מיין לפי