TODO merge in notes from 6.864 shingles markov chain pos tagging logistic stemming porter stemming: exclusion lists

general NLP problems

POS tagging
sentence detection
tokenization/segmentation: figuring out word/sentence boundaries
- simplest: over all possible segmentations, return seg w max prod over all words of P(word)
chunking
parsing
named entity recognition
coreference

misc

term frequency inverse document frequency (TFIDF)

term frequency is num occurrences of term $t_i$ in doc $d_j$ over total num words in $d_j$ (how prominent $t_i$ is in $d_j$):

$$ tf_{i,j} = \frac{ n_{i,j} }{ \sum_k n_{k,j} } $$
inverse document frequency is num docs over num docs with $t_i$ (more common words are less important):

$$ idf_i = \log \frac{ |D| }{ | { d : t_i \in d } | } $$
tf-idf = tf * idf

soundex algorithm

rudemintary "sounds-alike" phonetic algorithm for encoding things that sound similar to the same representation, for matching

vector space model

represent text docs as vectors of identifiers
eg index terms: each dim corresponds to a term; some possible values are:
- bits indicating presence
- counts
- tf-idf
relevancy rankings: treat query as doc, find most cosine-similar doc
limitations
- long docs poorly represented, have poor similarity values (small scalar product/numerator and large dimensionality/denominator)
- order information not preserved

cosine similarity

$\frac{A \cdot B}{|A| |B|}$
intuitively, find angle between vector representations of docs
don't actually need to calc cosine; can just leave as is; ordering preserved
$-1$ means opposite, 0 means same, 1 means identical; in IR, components are always non-neg, so range is [0,1]

jaccard similarity/coefficient/index

for binary features
simple: $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$
extended jaccard coefficient aka tanimoto coefficient: $T(A,B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B}$
- extended cosine similarity that yields jaccard for binary features

NLP at Google (Katja Filippova)

NLP problems at google: translation, speech recognition, language modeling, information extraction
- 7-gram language models on 2T+ tokens using simplified smoothing
graph-based multi-sentence (sentence fusion) summarization
- each unique (token,pos) is a vertex; also start & end vertices
- merge graphs of multiple sentences together (merging vertices may require their neighbors to also be the same)
- edges are inverse frequencies of transitions (bigram freqs)
- find $k$ shortest paths from start to end
- constraint: path must contain a verb
- use google news' clusters as dataset
http://videolectures.net/russir2010_filippova_nlp/

Provide feedback

Saved searches