Sphinx是开源的搜索引擎,它支持英文的全文检索。所以如果单独搭建Sphinx,你就已经可以使用全文索引了。但是往往我们要求的是中文索引,怎么做呢?国人提供了一个可供企业使用的,基于Sphinx的中文全文检索引擎。也就是说Coreseek实际上的内核还是Sphinx。
我选择的是4.1版本
下载tar包并解压
wget http://files.opstool.com/man/coreseek-4.1-beta.tar.gz tar -zxvf coreseek-4.1-beta.tar.gz
目录结构如下:
[root@centos6 coreseek-4.1-beta]# ll 总用量 16 drwxrwxrwx. 15 root root 4096 10月 18 2011 csft-4.1 drwxrwxrwx. 9 root root 4096 10月 18 2011 mmseg-3.2.14 -rwxrwxrwx. 1 root root 2467 1月 16 2011 README.txt drwxrwxrwx. 6 root root 4096 10月 18 2011 testpack
先安装mmseg,mmseg是中文分词组件,有了它才能实现中文检索
[root@centos6 coreseek-4.1-beta]# cd mmseg-3.2.14/ [root@centos6 mmseg-3.2.14]# ./bootstrap + aclocal -I config ./bootstrap: line 23: aclocal: command not found + libtoolize --force --copy ./bootstrap: line 24: libtoolize: command not found + autoheader ./bootstrap: line 25: autoheader: command not found + automake --add-missing --copy ./bootstrap: line 26: automake: command not found + autoconf ./bootstrap: line 27: autoconf: command not found
这里报错,需要安装相关软件
yum -y install glibc-common libtool autoconf automake mysql-devel expat-devel
./bootstrap
./configure --prefix=/usr/local/mmseg3 //这里报错configure: error: C++ compiler cannot create executables
yum install gcc gcc-c++ gcc-g77 //安装下相关软件
./configure --prefix=/usr/local/mmseg3 //这次正常
make && make install
[root@centos6 mmseg-3.2.14]# /usr/local/mmseg3/bin/mmseg //测试是否安装成功
Coreseek COS(tm) MM Segment 1.0
Copyright By Coreseek.com All Right Reserved.....
安装coreseek
cd csft-4.1-beta/ sh buildconf.sh ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql make && make install
测试mmseg分词和coreseek搜索
cd testpack
cat var/test/test.xml #此时应该显示中文
/usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml
/usr/local/coreseek/bin/indexer -c etc/csft.conf --all
/usr/local/coreseek/bin/search -c etc/csft.conf 网络搜索 //结果如下
words:
1. '网络': 1 documents, 1 hits
2. '搜索': 2 documents, 5 hits
配置:
cd /usr/local/coreseek/etc cp sphinx-min.conf.dist csft.conf vim csft.conf source article_src { type = mysql sql_host = 192.168.189.128 sql_user = remote sql_pass = 123456 sql_db = coreseek_test sql_port = 3306 # optional, default is 3306 sql_query = SELECT * from article sql_query_pre = SET NAMES utf8 # sql_attr_uint = group_id # sql_attr_timestamp = date_added # sql_query_info = SELECT * FROM documents WHERE id=$id } index article_index { source = article_src path = /usr/local/coreseek/var/data/article_index docinfo = extern charset_dictpath = /usr/local/mmseg3/etc/ charset_type = zh_cn.utf-8 min_word_len = 1 } indexer { mem_limit = 32M } searchd { listen = 9312 listen = 9306:mysql41 log = /usr/local/coreseek/var/log/searchd.log query_log = /usr/local/coreseek/var/log/query.log read_timeout = 5 max_children = 30 pid_file = /usr/local/coreseek/var/log/searchd.pid max_matches = 1000 seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1 workers = threads # for RT to work }
测试:
往article插入几条数据,建立索引
/usr/local/coreseek/bin/indexer --all --rotate
测试搜索:
[root@centos6 coreseek]# /usr/local/coreseek/bin/search 通信 Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file '/usr/local/coreseek/etc/csft.conf'... index 'article_index': query '通信 ': returned 1 matches of 1 total in 0.000 sec displaying matches: 1. document=2, weight=1680 words: 1. '通信': 1 documents, 1 hits