Sphinx全文索引安装教程
By admin
- 6 minutes read - 1163 words首先了解一下sphinx全文索引的相关知识
官方网站: http://www.sphinxsearch.com/
官方文档: http://www.sphinxsearch.com/docs/
中文支持: http://www.coreseek.cn/
中文使用手册下载: http://www.coreseek.cn/uploads/pdf/sphinx_doc_zhcn_0.9.pdf
基本上看看上面的官方教程和中文使用手册,你应该会安装和使用Sphix全文索引,当然,还有一些细节,需要不断的google和baidu,那为了节省大家的时间,就出一个完整的Sphinx安装教程和结合PHPWIND程序的使用教程(PHPWIND7.5版本支持)。
接下来开始Sphinx的技术之旅吧!
考虑到Sphinx全文索引使用的实际需要,主要介绍Sphinx全文索引中文方面的支持。 这里需要感谢李沫南同学对Sphinx全文索引中文支持的贡献! ** 一,Windows下安装Sphinx** 1,开始前的准备工作 来源: http://www.coreseek.cn/products/ft_down/ 下载csft3.1: http://www.coreseek.cn/uploads/csft/3.1/win32/csft3.1.bin.zip 下载标准词库: http://www.coreseek.cn/uploads/csft/3.1/data.zip 解压:csft3.1.bin.zip 如下目录,解压在C:\csft3.1目录下 解压:data.zip,解压在C:\csft3.1\data目录下 [分词包]
需要新建log文件夹
(1)复制 C:\csft3.1\conf\csft.conf.in 文件到 C:\csft3.1\bin\ 目录下,并重命名为csft.conf 注意csft.conf文件里的类似:path = @CONFDIR@/data/test1 把@CONFDIR@替换为C:\csft3.1\ 如上更改为:path = C:\csft3.1\ data\test1
(2)把测试数据 C:\csft3.1\conf\example.sql 导入数据库 [这个基本都会吧!]
(3)建立索引,在DOC界面下运行:indexer.exe –all 如下图,
如果执行上面的命令时,提示找不到python.dll找不到的话,请安装python2.5即可,windows下的安装文件下载地址:
建立索引过程需要仔细检查csft.conf数据库配置是否正确。如下: sql_host = localhost #数据库主机地址 sql_user = test #数据库用户名,拥有数据库所有权限 sql_pass = sql_db = test #数据库名 sql_port = 3306 #可用端口,一般不需要更改
其它配置使用默认,先体验下sphinx全文索引功能。
(4)测试搜索是否正常,运行:search.exe test 如下图
测试正常将返回
(5)开启搜索进程服务,运行:searchd.exe 如下图
这样就能提供sphinx全文索引的搜索服务了,以上就是一个简单的操作过程,如果需要支持中文索引,就需要配置相应的参数,具体请查看中文使用手册。为了便于大家了解相关配置,可查看PHPW_ind_程序支持Sphinx全文索引的配置文件,大家可边对照手册边了解[中文支持具体请看linux安装部分]。
附:PHPW_ind_程序支持Sphinx全文索引的配置。
Windows下安装Sphix使用csft非常简单,如果大家有兴趣可从sphinx[ www.sphinxsearch.com]官方下载安装,不过有点复杂,这里就不介绍了,高手们慢慢体验。
二,linux下安装Sphinx全文索引,以CentOS 5.3为例
只能说windows下安装sphinx只是为了体验,因为linux下安装sphinx才是正道。 为了详细体验Centos下安装Sphinx,重新安装Centos系统,完整体验Sphinx安装过程。 Coreseek 全文检索服务器版本已经集成sphinx和中文分词补丁,只需要下载MMSeg和Coreseek Fulltext Server(源代码),就能实现Sphinx服务支持。 下载地址: http://www.coreseek.cn/products/ft_down/
推荐源代码安装
1,开始前的准备工作 [如果已经安装就不需要,如果下面列表没有还有其它的请补上] 1)安装mysql 2)安装php 3)安装apache 4)安装python 5)安装libiconv 6)安装gcc-c++ 7)下载Coreseek Fulltext Server(源代码): http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz 8)下载Coreseek Mmseg(源代码): http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
执行如下命令 yum install python python-dev
2,安装步骤 (1)下载CSFT与MMseg #wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz #wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz
(2)安装MMseg中文分词 # pwd /usr/local [知道当前的安装目录] # wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz # tar xzvf mmseg-3.1.tar.gz # mkdir /usr/local/mmseg # cd mmseg-3.1 # ./configure –prefix=/usr/local/mmseg # make # make install
运行如下,看看mmseg是否安装成功 # /usr/local/mmseg/bin/mmseg Coreseek COS(tm) MM Segment 1.0 Copyright By Coreseek.com All Right Reserved. Usage: /usr/local/mmseg/bin/mmseg -u Unigram Dictionary -r Combine with -u, used a plain text build Unigram Dictionary, default Off -b Synonyms Dictionary -h print this help and exit
(3)安装csft-3.1 # pwd /usr/local # wget http://www.coreseek.cn/uploads/csft/3.1/Source/csft-3.1.tar.gz # tar xzvf csft-3.1.tar.gz # mkdir /usr/local/csft # cd csft-3.1 # ./configure –prefix=/usr/local/csft –with-mmseg=/usr/local/mmseg/bin/mmseg –with-mmseg-includes=/usr/local/mmseg/include/mmseg/ –with-mmseg-libs=/usr/local/mmseg/lib/ # make # make install
这里make的时候可能出错,解决如下: 1,检查环境是否安装如下软件 # yum install mysql mysql-devel php-mysql qt4-mysql [mysql环境要首先安装] # yum install python python-dev
2,是否安装libiconv 下载地址: http://savannah.gnu.org/projects/libiconv/
3,如果还有错误,打开src/Makefile文件,进行修改 # vi src/Makefile 找到182行
LIBS = -lm -lz -lexpat -L/usr/local/lib -lpthread LIBS = -lm -lz -lexpat -liconv -L/usr/local/lib -lpthread
这样,如果一切顺利,就开始配置你的sphinx全文索引服务器吧[如果安装有什么问题,欢迎在PHPWind官方提问]!
3,按下来就是配置 #cp /usr/local/csft/etc/sphinx-min.conf.dist /usr/local/csft/etc/sphinx.conf 修改sphinx.conf文件中的数据库参数配置,方法同windows下一样 sql_host = localhost sql_user = root sql_pass = sql_db = test
4,把体验数据/usr/local/csft/etc/example.sql 导入到数据库 [这一步应该都会] 5,新建索引 # /usr/local/csft/bin/indexer –all
6,测试搜索 # /usr/local/csft/bin/search test 如果测试有返回,恭喜你的sphinx全文索引服务器配置成功
7,接下来就是支持中文的配置和实现
UTF8编码实例 [如果已经存在utf8的数据库就不需要新建,这里只是举例] 1)创建一个新的数据库,注意编码为utf8_general_ci,如phpwind 2)导入部分现有的GBK数据,如pw_threads 3)配置csft.conf如下 source数据源部分 sql_host = localhost sql_user = root sql_pass = sql_db = phpwind sql_query_pre = SET NAMES utf8 sql_query_pre = SET SESSION query_cache_type=OFF sql_query = SELECT tid,fid,authorid,subject FROM pw_threads sql_attr_uint = fid sql_attr_uint = authorid
索引部分 charset_type = zh_cn.utf-8 charset_dictpath = /usr/local/csft/ min_prefix_len = 0 min_infix_len = 0 min_word_len = 2
4)创建数据词典 #pwd /usr/local/mmseg-3.1/data [这是你解压mmseg的目录下的data] 运行如下命令 # mmseg -u unigram.txt # ll 总计 10152 -rwxr-xr-x 1 root root 715 06-06 18:40 build_unigram.py -rwxr-xr-x 1 root root 32674 06-06 18:40 char.stat.txt -rwxr-xr-x 1 root root 1051268 06-06 18:40 Lexicon_full_words.txt -rwxr-xr-x 1 root root 1826251 06-06 18:40 unigram.txt -rw-r–r– 1 root root 3729280 09-16 20:20 unigram.txt.uni
将会生成 unigram.txt.uni 文件 # mv unigram.txt.uni uni.lib # cp uni.lib /usr/local/csft/ [这就是上面我们在配置索引中用的charset_dictpath]
其它的默认不变,如上方法创建索引 # /usr/local/csft/bin/indexer –all
测试是否成功 # /usr/local/csft/bin/search 测试
以上就是utf8编码的全文索引实现过程
GBK编码实例
与utf8一样,区别在于数据库和数据表使用gbk编码 同时只需要修改如下配置部分[csft.conf]
source数据源部分 sql_query_pre = SET NAMES gbk
索引部分 charset_type = zh_cn.gbk
这里需要注意一下,如果要想测试支持gbk,可以写一个PHP文件,调用sphinx提供的api接口,注意要开启searchd进程
# /usr/local/csft/bin/searchd
编写如下代码 [注意要与sphinxapi.php目录存放在一个目录] sphinxapi.php目录在# /usr/local/csft-3.1/api/下 也可以直接使用api目录下的test.php直接测试 SetServer(‘127.0.0.1’,3312); $sc->SetConnectTimeout(1); $sc->SetWeights(array(100,1)); $sc->SetMatchMode(SPH_MATCH_ALL); $sc->SetArrayResult(TRUE); $res = $sc->query(“简单”); var_dump($res); ?>
也可以直接运行search工具[utf8版],如下
[root@localhost ~]# /usr/local/csft/bin/search 便宜 Coreseek Full Text Server 3.1 Copyright (c) 2006-2008 coreseek.com using config file ‘/usr/local/csft/etc/csft.conf’… index ‘test1’: query ‘便宜 ‘: returned 4 matches of 4 total in 0.015 sec
displaying matches:
- document=3, weight=1, fid=7, authorid=1
- document=97, weight=1, fid=35, authorid=1
- document=108, weight=1, fid=32, authorid=1
- document=146, weight=1, fid=7, authorid=1
words:
- ‘便宜’: 4 documents, 4 hits
如果返回false,请检查searchd进程是否开启,如果返回成功,恭喜,你已经成为sphinx的使用者,向下一个高层次进军吧!
三,后记 其实很想制作一个安装视频教程,但由于时间有限,在安装过程中肯定会存在一些细节上的问题,只要大家按照上面的步骤一步一步安装,相信能把sphinx拿下,如果有什么问题 大家可查看 http://www.sphinxsearch.com/ 和 http://www.coreseek.cn/ 网站获取更多帮助,同时也可以查看中文手册。
同时也可以在phpwind官方网站 www.phpwind.net 提问和分享你的安装过程,把一个细节都亮出来,帮助别人也帮助自己。BY 2009-9-17
其它链接 用 PHP 构建自定义搜索引擎 http://www.ibm.com/developerworks/cn/opensource/os-php-sphinxsearch/index.html
MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm http://technology.chtsai.org/mmseg/
附phpwind配置实例[gbk版] PHPWind搜索sphinx配置实例 [修改部分参数就可直接应用于phpwind程序]
部分解读: 如下全文索引使用的是主索引+增量索引的方式,具体大家结合手册了解相关知识
需要创建一张表 [编码自己定,如下是gbk] CREATE TABLE IF NOT EXISTS `search_counter` ( `counterid` int(11) NOT NULL DEFAULT ‘0’, `max_doc_id` int(11) NOT NULL DEFAULT ‘0’, `min_doc_id` int(10) NOT NULL DEFAULT ‘0’, PRIMARY KEY (`counterid`) ) ENGINE=MyISAM DEFAULT CHARSET=gbk;
csft.conf配置文件
source tmsgs { type = mysql sql_host = localhost sql_user = root sql_pass = xxxx sql_db = phpwind sql_port = 3307 # optional, default is 3306 sql_sock = /tmp/mysql3307.sock sql_query_pre = SET NAMES gbk sql_query_pre = SET SESSION query_cache_type=OFF sql_query_pre = REPLACE INTO search_counter SELECT 1,MAX(tid),MIN(tid) FROM pw_tmsgs sql_query_range = SELECT min_doc_id, max_doc_id FROM search_counter WHERE counterid = 1 sql_range_step = 1000 sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies,t.content FROM pw_threads th LEFT JOIN pw_tmsgs t USING(tid) WHERE th.tid > $start AND th.tid <= $end
sql_attr_uint = authorid sql_attr_uint = hits sql_attr_uint = replies sql_attr_uint = fid sql_attr_timestamp = postdate sql_attr_timestamp = lastpost sql_attr_uint = digest }
source addtmsgs { type = mysql sql_host = localhost sql_user = root sql_pass = xxxx sql_db = phpwind sql_port = 3307 # optional, default is 3306 sql_sock = /tmp/mysql3307.sock sql_query_pre = SET NAMES gbk sql_query_pre = SET SESSION query_cache_type=OFF sql_query_range = SELECT max_doc_id, max_doc_id+100000 FROM search_counter WHERE counterid = 1 sql_range_step = 100000 sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies,t.content FROM pw_threads th LEFT JOIN pw_tmsgs t USING(tid) WHERE th.tid > $start AND th.tid <= $end
sql_attr_uint = authorid sql_attr_uint = hits sql_attr_uint = replies sql_attr_uint = fid sql_attr_timestamp = postdate sql_attr_timestamp = lastpost sql_attr_uint = digest sql_query_post = REPLACE INTO search_counter SELECT 1,MAX(tid),MIN(tid) FROM pw_tmsgs #sql_attr_uint = tid }
source threads { type = mysql sql_host = localhost sql_user = root sql_pass = xxxxxxx sql_db = phpwind sql_port = 3307 # optional, default is 3306 sql_sock = /tmp/mysql3307.sock sql_query_pre = SET NAMES gbk sql_query_pre = SET SESSION query_cache_type=OFF sql_query_pre = REPLACE INTO search_counter SELECT 3,MAX(tid),MIN(tid) FROM pw_threads sql_query_range = SELECT min_doc_id, max_doc_id FROM search_counter WHERE counterid = 3 sql_range_step = 1000 sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies FROM pw_threads th WHERE th.tid > $start AND th.tid <= $end sql_attr_uint = authorid sql_attr_uint = hits sql_attr_uint = replies sql_attr_uint = fid sql_attr_timestamp = postdate sql_attr_timestamp = lastpost sql_attr_uint = digest }
source addthreads { type = mysql sql_host = localhost sql_user = root sql_pass = xxx sql_db = phpwind sql_port = 3307 # optional, default is 3306 sql_sock = /tmp/mysql3307.sock sql_query_pre = SET NAMES gbk sql_query_pre = SET SESSION query_cache_type=OFF sql_query_range = SELECT max_doc_id, max_doc_id+100000 FROM search_counter WHERE counterid = 3 sql_range_step = 100000 sql_query = SELECT th.tid,th.subject,th.authorid,th.postdate,th.lastpost,th.fid,th.digest,th.hits,th.replies FROM pw_threads th WHERE th.tid > $start AND th.tid <= $end
sql_attr_uint = authorid sql_attr_uint = hits sql_attr_uint = replies sql_attr_uint = fid sql_attr_timestamp = postdate sql_attr_timestamp = lastpost sql_attr_uint = digest sql_query_post = REPLACE INTO search_counter SELECT 3,MAX(tid),MIN(tid) FROM pw_threads #sql_attr_uint = tid }
index tmsgsindex { source = tmsgs path = /usr/local/csft/var/data/tmsgs docinfo = extern charset_type = zh_cn.gbk #min_prefix_len = 0 #min_infix_len = 2 #ngram_len = 2 charset_dictpath = /usr/local/csft/ min_prefix_len = 0 min_infix_len = 0 min_word_len = 2 }
index addtmsgsindex { source = addtmsgs path = /usr/local/csft/var/data/addtmsgs docinfo = extern charset_type = zh_cn.gbk #min_infix_len = 2 #ngram_len = 2 charset_dictpath = /usr/local/csft/ min_prefix_len = 0 min_infix_len = 0 min_word_len = 2 } index threadsindex { source = threads path = /usr/local/csft/var/data/threads docinfo = extern charset_type = zh_cn.gbk #min_prefix_len = 0 #min_infix_len = 2 #ngram_len = 2 charset_dictpath = /usr/local/csft/ min_prefix_len = 0 min_infix_len = 0 min_word_len = 2 }
index addthreadsindex { source = addthreads path = /usr/local/csft/var/data/addthreads docinfo = extern charset_type = zh_cn.gbk #min_infix_len = 2 #ngram_len = 2 charset_dictpath = /usr/local/csft/ min_prefix_len = 0 min_infix_len = 0 min_word_len = 2 } indexer { mem_limit = 128M }
searchd { port = 3312 log = /usr/local/csft/var/log/searchd.log query_log = /usr/local/csft/var/log/query.log read_timeout = 5 max_children = 30 pid_file = /usr/local/csft/var/log/searchd.pid max_matches = 1000 seamless_rotate = 1 preopen_indexes = 0 unlink_old = 1 }