文本摘要相关算法汇总

关于

汇总文本摘要相关的模型、算法即评估指标

关键短语提取:review

论文:Automatic Keyphrase Extraction: A Survey of the State of the Art,Kazi Saidul Hasan and Vincent Ng

Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information

论文:Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information,Y. MATSUO,M. Ishizuka,2003

基本思想:只需要单个文档(长文档),首先提取高频词,如果一个词与高频词的共现关系通过卡方检验,就认为是可能的关键词。

TFIDF:在该文档经常出现,但是在整个语料中出现得不那么频繁的词!

$$
\chi^2(w) = \sum_{g \in G} \frac{(freq(w, g) - n_w p_g)^2}{n_w p_g}
$$

这里w是某个待检验的词,$(g \in G)$ 是高频词,G是高频词组成的集合。$(n_w)$是w在共现矩阵中出现的总数,
$(p_g)$是高频词g在高频词中的归一化频率。

$$
\chi'^2(w) = \chi^2(w) - \max_{g \in G} \frac{(freq(w, g) - n_w p_g)^2}{n_w p_g}
$$

TextRank

论文:TextRank: Bringing Order into Texts,Rada Mihalcea and Paul Tarau,2004

带权 PageRank

$$
WS(V_i) = (1-d) + d \sum_{V_j \in IN(V_i)} \frac{w_{ji}}{\sum_{V_k \in OUT(V_j)} w_{jk}} WS(V_j)
$$

WS 是定点的 PageRank score。随机初始化,然后迭代收敛!

$$
similarity(S_i, S_j) = \frac{|\{w_k| w_k \in S_i , w_k \in S_j \}|}{\log{|S_i|} + \log{|S_j|}}
$$

KeyCluster

论文:Clustering to Find Exemplar Terms for Keyphrase Extraction,Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun

Topical PageRank

论文:Automatic Keyphrase Extraction via Topic Decomposition

基本思想,pagerank的时候,只关注某一个主题,求出每个term在该主题先的rank后,然后按照文档的主题分布加权得到最终的rank。