PHP+MongoDB+Coreseek/Sphinx打造搜索引擎
发布日期:2016-4-30 13:4:7
近几年来,Linux+Nginx+PHP+MongoDB(LNPM)这样的组合越来越火,甚至有取代Linux+Nginx/Apache+PHP+mysql这种组合的趋势。究其原因,是MongoDB强大,灵活,易扩展,更关键的易用。MongoDB不用事先设计好表结构,往里面插入什么都可以,管理还方便。所以成为创业团队的首选数据库,更是移动互联网的一枝新秀。 但MongoDB和关系型数据库也有很多相似之处,如全文索引不支持中文。在MongoDB2.6版本中开始默认支持全文索引,一如既往的不支持伟大的Chinese,因此如果需要搜索功能,就要另辟蹊径。 Sphinx与Lucene是做搜索引擎的不错的选择。个人观点Lucene对Java的支持比较好,而Sphixn对PHP的支持较好,所以我选择了Sphinx。其实Sphinx对中文的支持也不是很好,由于Sphinx是根据空格来分词(适用与英文),根本不适用中文分词。幸好有人提供了基于Sphinx的支持中文的插件Coreseek和Sphinx—for—chinese。 Coreseek有完整的文档,目前支持最新版的Sphinx,因此我选择Coreseek。 Sphinx-for-chinese严重缺乏文档。 安装: (1)Coreseek安装。 (2)phinx-for-chinese安装。 创建索引: Coreseek支持与Mysql直接对接,只需在Coreseek配置文件里填上Mysql的信息,Coreseek就会自动读取Mysql数据来创建索引(当然前提是你做了生成索引的相应设置或者执行生成索引的命令)。但是Sphinx不支持与MongoDB直接对接,可以把Mongo数据源转换为Python数据源或者转换成xmlpipe2数据源。 本人不会Python,因此用php些了一个xml管道用于把MongoDB数据传输到Coreseek。参考代码如下所示: class SphinxXmlpipe{ private $xmlWriter; private $fields = array(); private $attributes = array(); private $documents = array(); public function setFields($fields) { $this->fields = $fields; } public function setAttributes($attributes) { $this->attributes = $attributes; } public function beginOutput() { //create a new xml document $this->xmlWriter = new \XMLWriter(); $this->xmlWriter->openMemory(); $this->xmlWriter->setIndent(true); $this->xmlWriter->startDocument('1.0', 'UTF-8'); $this->xmlWriter->startElement('sphinx:docset'); $this->xmlWriter->startElement('sphinx:schema'); // add fileds to the schma foreach($this->fields as $field) { $this->xmlWriter->startElement('sphinx:field'); $this->xmlWriter->writeAttribute('name', $field); $this->xmlWriter->endElement(); } /* // add atttributes to the schema foreach($this->attributes as $attributes) { $this->xmlWriter->startElement('sphinx:attr'); foreach($attributes as $key => $value) { $this->xmlWriter->writeAttribute($key, $value); } $this->xmlWriter->endElement(); } */ $this->xmlWriter->endElement(); // schema } public function addDocument($doc) { $this->xmlWriter->startElement('sphinx:document'); $this->xmlWriter->writeAttribute('id', $doc['book_id']); foreach($doc as $key => $value) { $this->xmlWriter->startElement($key); $this->xmlWriter->text($value); $this->xmlWriter->endElement(); } $this->xmlWriter->endElement(); // document } public function endOutput() { // end sphinx:docset $this->xmlWriter->endElement(); $this->xmlWriter->endDocument(); echo $this->xmlWriter->outputMemory(); } public function xmlpipe2() { $this->setfields( array( 'book_id', 'book_name', )); $this->setAttributes( array( array( 'name' => 'book_id', 'type' => 'int', 'bits' => '16', 'default' => '1', ), )); $this->beginOutput(); $mBook = D('book'); $count = $mBook->count(); $limit = c('XMLPIPE_BOOKS_COUNT_PER_TIME'); $tCont = (int)$count/$limit; $oCount = $count%$limit; if($tCont>0) { do { $books = $mBook->field('book_id,book_name','_id=>0')->limit($limit)->select(); foreach($books as $book) { $this->addDocument($book); } unset($books); $tCont--; } while($tCont>0); $books = $mBook->field('book_id,book_name','_id=>0')->limit($oCount)->select(); foreach($books as $book) { $this->addDocument($book); } unset($books); } else { $books = $mBook->field('book_id,book_name','_id=>0')->limit($oCount)->select(); foreach($books as $book) { $this->addDocument($book); } unset($books); } $this->endOutput(); } } 输出的xml格式如下所示 图1 相应的Coreseek设置,参考代码如下所示: source src1 { type = xmlpipe2 xmlpipe_command = cd /var/www/PHPParser && php index.php /Home/SphinxXmlpipe/xmlpipe2 xmlpipe_field = book_id xmlpipe_field = book_name xmlpipe_attr_timestamp = book_id xmlpipe_attr_uint = book_id xmlpipe_fixup_utf8 = 1 } 搜索: (1)PHP提供了Sphinx扩展,适用于Coreseek。 (2)phinx 安装包提供了sphinxapi,在api目录下。 我用的PHP扩展 sphinx搜索代码。参考代码如下所示: public function getResultBySearchText($search_text) { $sphinxClient = new \SphinxClient(); $sphinxClient->setServer('localhost', 9312); // server = localhost,port = 9312. $sphinxClient->setMatchMode(SPH_MATCH_ANY); $sphinxClient->setMaxQueryTime(5000); // set search time 5 seconds. $result = $sphinxClient->query($search_text); if(isset($result['matches'])) { $rel['time'] = $result['time']; $rel['matches'] = $result['matches']; return $rel; } else { $rel['time'] = $result['time']; return $rel; } } 因为用的xmlpipe数据源,所以返回的是文档id,还需根据id去mongo提取数据。至于如何提取mongo数据,我就不写了。 上一条: InnoDB redo log漫游 下一条: Redis主键失效原理及实现机制
|