Discuss, Learn and be Happy דיון בשאלות

help brightness_4 brightness_7 format_textdirection_r_to_l format_textdirection_l_to_r

Define the perplexity of a bigram language model given a dataset [w1...wN]

1
done
by
מיין לפי

How many parameters are needed to model a bigram language model for a vocabulary of size V?

1
done
by
מיין לפי

In machine learning, we are given a dataset of the form {(xi, yi)}, i ∈ [1..N] and aim at learning a function f(x) which maps unseen input feature vectors to ŷ - the predicted value. Distinguish between the 3 types of learning problems by characterizing the mathematical type of the predicted values ŷ: Classification: Regression: Ranking:

1
done
by
מיין לפי

Given a training dataset D = {(xi, yi)}, i ∈ [1..N], we want to identify a function fΘ() such that the predictions ŷ = fΘ(x) over the training dataset are as accurate as possible. Given a Loss Function L(y,ŷ) - write the criterion that the optimal value of Θ must satisfy: Find θ such that:

1
done
by
מיין לפי

Write the expression of the cross-entropy loss which is useful when the predicted output of the model we learn is interpreted as a discrete distribution p(yc|x) for c ∈ [1..C] (C-way classification model). f(x) = ŷ = (ŷ1 ... ŷC) is a distribution over the C possible classes. L(ŷ,y) =

1
done
by
מיין לפי

The deep learning approach learns a trainable non-linear mapping function φ from x to a representation φ(x) which can be used as an input to a linearly separable classification problem. The general form of this trainable mapping we consider is: ŷ = W φ(x) + b φ(x) = g(W'x + b') where g is a non-linear function. Why do we need non-linear mappings such as g() in this formulation?

1
sentiment_very_satisfied
by
מיין לפי

Consider the task of predicting the sentiment of a text document as either {positive, negative, neutral}. We want to use a neural network to learn a model for this task, given a training dataset of the form { (documenti, labeli) } i in [1..N]. Each document di contains Ni words (wi,1, ..., wi, Ni), where the words wi,j belong to the vocabulary V = { w1, ..., w|V| }. Describe how the documents are encoded as vectors of size |V| for each of the following two methods: Bag of Words Tf-Idf weights

1
done
by
מיין לפי

Consider the task of predicting the POS (parts of speech) tag of a word using a model similar to the one we discussed in class predicting the language of documents. The task consists of tagging each word in a sequence of words as either Noun, Verb, Adjective, ... (these labels are known as "tags"). Assume the tagset is of dimension T (there are T distinct tags). What would be the dimension of the task-specific embedding of words the model would learn? What would be the interpretation of an embedding vector for a given word w?

1
done
by
מיין לפי

List three key properties of word embedding representations which distinguish them from one-hot encodings

1
done
by
מיין לפי

Give three examples of systematic lexical semantic relations (that is, semantic relations which may hold between any pair of words):

1
done
by
מיין לפי