VSM - From Frequency to Meaning: Vector Space Models of Semantics

关于

向量空间模型(VSM)综述论文:From Frequency to Meaning: Vector Space Models of Semantics, 2010.

导言

words that occur in similar contexts tend to have similar meanings (Wittgenstein, 1953; Harris, 1954; Weaver, 1955; Firth, 1957; Deerwester, Dumais, Landauer, Furnas, & Harsh- man, 1990)

案例

IQ测试 or 性格测试: subject-item matrix!

向量分析方法:因子分析

向量空间模型里面向量的元素来自于 事件频率统计!!

语义向量空间模型

文档的相似性:The Term–Document Matrix,Salton et al. (1975)

In information retrieval, the bag of words hypothesis is that we can estimate the relevance of documents to a query by representing the documents and the query as bags of words. (Salton et al., 1975)

词的相似性:The Word–Context Matrix

distributional hypothesis in linguistics is that words that occur in similar contexts tend to have similar meanings (Harris, 1954)

关系的相似性:The Pair–Pattern Matrix

extended distributional hypothesis, that patterns that co-occur with similar pairs tend to have similar meanings. Lin and Pantel (2001)

The latent relation hypothesis is that pairs of words that co-occur in similar patterns tend to have similar semantic relations (Turney, 2008a)

关系的相似性不能消减为属性的相似性(word-context matrix)

Types and Tokens

token-document matrix,里面相同的词但是出现在不同地方的词作为不同的token;
type-duocument matrix,则把相同词合并了。

前者可以用在词消歧义上,一词多义!

问题,这种token-document matrix完全看不出有什么意义啊

五个假设

Linguistic Processing for VSM

对数据的预处理:tokenize,normalize(将词不同的形式归一化),annotate the raw text(将相同的形式标记为不同的含义:eg 动词,名词)

Grefenstette (1994):三步走:tokenization, surface syntactic analysis, and syntactic attribute extraction.

Tokenization

英语等西班牙语系可以通过天然的分割符空格进行分割!
而汉语等非西班牙语系则不同!

精确的Tokenizer还需要处理标点符号!连字符,multi-word terms(e.g., Barack Obama and ice hockey)。
停止词,高频却无意义的词,代词等。停止词表:SMART system (Salton, 1971)

Normalization

一般而言,归一化将导致精确度降低,召回率提高。
如果数据量少,一定要用归一化,提高召回率;
但如果数据量很大,精确度更重要,可以不归一化!

Annotation

降低召回率,提高精确度!

Mathematical Processing for Vector Space Models

Lowe (2001) 4步走:1、统计频率,2、频率变换(加权),3、平滑,4、计算相似性。

频率统计

关键技术:Hash Table;数据库;搜索引擎索引。

加权频率变换

假设word-context 矩阵 F,行向量$(f_i)$,列向量$(f_{:j})$。新矩阵 X 是PPMI矩阵,定义为

$$
p_{ij} = \frac{f_{ij}}{\sum_i \sum_j f_{ij}} \\
p_{i*} = \frac{\sum_j f_{ij}}{\sum_i \sum_j f_{ij}} \\
p_{* j} = \frac{\sum_i f_{ij}}{\sum_i \sum_j f_{ij}} \\
pmi_{ij} = \log\left( \frac{p_{ij}}{p_{i*} p_{* j}} \right) \\
x_{ij} = \begin{cases}
pmi_{ij} & if pmi_{ij} > 0 \\
0 & \text{otherwise}
\end{cases}
$$

PMI 的问题,对小概率事件有偏!特例:当i和j统计依赖,$(p_{ij} = p_{i*} = p_{* j})$,
那么PMI变为 $(\log(1/p_{i*}))$。

一种解决方案是(Pantel & Lin, 2002a),对$(f_{ij}, f_{i*}, f_{* j})$进行平滑处理

$$
\delta_{ij} = \frac{f_{ij}}{f_{ij} + 1} \frac{\min(f_{* j}, f_{i*})}{\min(f_{* j}, f_{i*}) + 1} \
newpmi_{ij} = \delta_{ij} pmi_{ij}
$$

另一种解决方案是对概率进行拉普拉斯平滑!即对每一个$(f_{ij} \rightarrow f_{ij} + k)$。

平滑矩阵

rank-k 矩阵近似,最小化富比尼范数$(||X - \hat{X}||_F)$!(Golub&VanLoan,1996)

比较向量

三类:Weeds et al. (2004)

  1. high-frequency sensitive measures (cosine, Jensen-Shannon, $(\alpha)$-skew, recall),
  2. low-frequency sensitive measures (precision), and
  3. similar-frequency sensitive methods (Jaccard, Jaccard+MI, Lin, harmonic mean).

有效的比较

3个开源VSM系统

应用

其他方法 to 语义分析

VSM的未来

问题