RAKE 關鍵字擷取演算法

by Robin Kuo 2021-04-20

RAKE 全名是 Rapid Automatic Keyword Extraction，是個用於判斷句子中「關鍵字」的演算法
而 RAKE 的特點在於快速，不需要過多複雜的運算，一個簡單規則就可以抓出文中重點
這個方法鼓勵找出句子中的複合詞（compound word）

RAKE 的運作原理基於斷詞（把句子中無意義的字遮掉）作為文字分段的依據，例如：

RAKE is a text rank algorithm to find important text.

經過斷詞後變成：

Rake – text rank algorithm – important text.

把斷詞斷在一起的單字相連，得到 RAKE、text rank algorithm、important text 這三組複合詞
建立一個表個來區分各個單字出現頻率的關聯表：

單字關聯表

基於上面單字的關聯表計算每個字的關連性（degree）與頻率（frequency）

frequency — 單字出現的次數，依單字關聯表可以發現，除了 text 出現 2次以外其他單字都是 1
degree — 單字的關聯性，例如 important text 這兩個單字互相關聯 degree 為 2，而 text rank algorithm 三個單字互相關聯 degree 為 3，其中 text 這個單字在兩個複合詞中都有出現，所以 degree 相加為 5

回推這三組文字的的分數得：

這樣算起來 text rank algorithm 就是這個句子的關鍵字了！

論文來源：https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.657.8134&rep=rep1&type=pdf