#Information

Introduction: This is the project of 2013 Information Retrieval class.
Author: Veck Hsiao, Nick Cheng @ PLSM Lab, NCCU, Taipei, Taiwan
Last Update: 2014/01/16 22:33
Developing Platform:
- Operating System: Linux Mint 13
- JDK Version: 1.7.x
- Lucene Version: 4.6.0
- Tomcat Version: 7.0.x
- Chinese Segmenter: mmseg4j 1.8.2

#File Structure

.
├── CDrawChart.java
├── Chinese.java
├── chineses/
├── CTermFrequence.java
├── DrawChart.java
├── HW.java
├── index
├── IndexFiles.java
├── index.jsp
├── Readme.md
├── SearchFiles.java
├── SegChinese.java
├── TermFrequence.java
├── TFTemp
│   └── README
└── WEB-INF
    ├── classes (Please compile and move class files manually)
    │   ├── CDrawChart.class
    │   ├── Chinese.class
    │   ├── CPair.class
    │   ├── CTermFrequence.class
    │   ├── dirPair.class
    │   ├── DrawChart.class
    │   ├── HW.class
    │   ├── IndexFiles.class
    │   ├── Pair.class
    │   ├── SearchFiles.class
    │   └── slate/
    ├── lib/
    └── web.xml

English Part:
- HW.java - Main class of English query
- DrawChart.java - Drawing chart for query with Google Chart API
- IndexFiles.java - Indexing corpus with not yet segmented
- TermFrequence.java - Counting term frequency for specific query
Chinese Part:
- Chinese.java
- CDrawChart.java
- SegChinese.java - Chinese segmenter
- CTermFrequency.java
Others:
- index.jsp - Homepage.
- SearchFiles.java - Search file in index

#Before Using

Please copy or setup classpath of Java Compiler with JAR in WEB-INF/lib.
Please compile following files:
- Classess to be moved to WEB-INF/classes:
  1. HW.java
  2. Chinese.java
- Classese to be used as Tools: IndexFiles.java, SegChinese.java
After compilation of IndexFiles.java, please index WEB-INF/classese/slate/* with IndexFiles: java org.apache.lucene.demo.IndexFiles -index index -docs WEB-INF/classese/slate/* or java IndexFiles -index index -docs WEB-INF/classese/slate/*.
After compilation of SegChinese.java, please segment Chinsese corpus with it. Default corpus are part chapters of 紅樓夢 and 西遊記. When you finish segmenting, remember to put them in chinese folder(as dictionary).
Note that English implementation does not contain segmenter. It's only use indexer that lucenen offers. However, Chinese part uses a segmenter called mmseg4j and does not do indexing.
Some path should be modified since they were hard-coded in this implementation:
- SearchFiles.java: String index, filename
- HW.java, Chinese.java: logger, HTML Output anchor
- SegChinese.java: System.setProperty
Deploying with tomcat.

#How it Work ###English Query 0. Indexing corpus

Given a query
Searching file with index
Computing the number of term of the query in each document
Sending result to Google Chart API and display the responsed chart

###Chinese Query 0. Segmenting corpus

Given a query
Searching file with dictionary in chinese
Computing the number of term of the query in each document
Sending result to Google Chart API and display the responsed chart

#Issue

This project is considered to be rewritten in Python with PyLucene(Indexer and Querier) and Jieba(Segmenter)
Should make a good new user interface for both Chinese and English search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls