Clustering based multi-document selective text summarization using LexRank algorithm.
This repository is a source code for the paper 설진석, 이상구. "lexrankr: LexRank 기반 한국어 다중 문서 요약." 한국정보과학회 학술발표논문집 (2016): 458-460.
- Mostly designed for Korean, but not limited to.
- Click here to see how to install KoNLPy properly.
- Check out textrankr, which is a simpler summarizer using TextRank.
pip install lexrankr
Tokenizers are not included. You have to implement one by yourself.
Example:
from typing import List
class MyTokenizer:
def __call__(self, text: str) -> List[str]:
tokens: List[str] = text.split()
return tokens
한국어의 경우 KoNLPy를 사용하는 방법이 있습니다.
from typing import List
from konlpy.tag import Okt
class OktTokenizer:
okt: Okt = Okt()
def __call__(self, text: str) -> List[str]:
tokens: List[str] = self.okt.pos(text, norm=True, stem=True, join=True)
return tokens
from typing import List
from lexrankr import LexRank
# 1. init
mytokenizer: MyTokenizer = MyTokenizer()
lexrank: LexRank = LexRank(mytokenizer)
# 2. summarize (like, pre-computation)
lexrank.summarize(your_text_here)
# 3. probe (like, query-time)
summaries: List[str] = lexrank.probe()
for summary in summaries:
print(summary)
Use docker.
docker build -t lexrankr -f Dockerfile .
docker run --rm -it lexrankr