配置说明
========
SF1配置文件说明
-------------------
SF1配置文件主要包含以下几个方面:
1.System:全局参数设置,包括索引/挖掘特性/推荐特性的默认参数设置,线程个数,语言分析器设置等。
2.Collection:定义集合中文档的结构和具体的索引/挖掘参数。
3.Deployment:分布式配置。
System
--------
通过标准结构定义文件对SF1的配置文件进行验证,必须将结构定义文件sf1r-config.xsd和SF1的XML配置文件放在同一目录下。
::
.
.
.
.
Resource
~~~~~~~~~~~~
该参数表示加载资源的绝对路径.
::
WorkingDir
~~~~~~~~~~~~~~~
该参数表示资源运行路径.
::
LogConnection
~~~~~~~~~~~~~~~~~
该参数表示选择日志数据库。
::
.. note::
连接sqlite3则str="sqlite3://./log/COBRA",连接mysql则str="mysql://root:123456@127.0.0.1:3306/SF1R"。
LogServerConnection
~~~~~~~~~~~~~~~~~~~~~~~
该参数表示连接日志服务器命令.
::
IndexBundle
~~~~~~~~~~~~~~~~~
设置索引相关的默认参数.
::
default-collection-dir
+-------------------------+-------------------------+--------------------------------------------------------------------+
|元素 |属性 |描述 |
+=========================+=========================+====================================================================+
|CollectionDdataDirectory | |用来指定集合数据的索引目录 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
|IndexStratege |memorypoolsize |用于索引的内存池字节数。如果小于5,000,000性能会大大下降。 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |indexpolicy |有两种选择:1)default,表示只有当索引结束时,才能对文档进行检索, |
| | |创建索引速度很快;2)realtime,表示可以进行实时检索。 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |indexlevel |指定索引级别,有两种选择:1)doclevel。2)wordlevel。 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |mergepolicy |指定合并方法,有两种选择:1)file,对于某些类型的硬盘可能效率较低, |
| | |2)memory,额外的内存消耗等于最大列表长度。 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |cron |指令执行周期,格式为"分 小时 天 月 星期" |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |autorebuild |指定是否自动重新编译 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
|Sia |triggerqa |指定查询请求是否进入问答模式,默认"n" |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |enable_parallel_searching|指定是否允许并行检索,默认"n" |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |enable_forceget_doc |指定是否允许获取已删除的文档,默认"n" |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |doccachenum |指定检索程序中原始文档的缓存个数。值越大则内存消耗越大。默认2000 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |searchcachenum |指定检索程序中检索结果的缓存个数。值越大则内存消耗越大。默认1000 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |refreshsearchcache |指定是否定期清除检索缓存 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |refreshcacheinterval |指定清除检索缓存的周期,单位为秒,默认3600 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |filtercachenum |指定检索程序中的filter结果的缓存个数。值越大内存消耗越大。默认1000 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |mastersearchcachenum |指定master检索缓存个数 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |topknum |指定查询结果的个数 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |knntopknum |指定KNN查询结果的个数(现已不用) |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |knndist |指定KNN查询的Hamming距离(现已不用) |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |sortcacheupdateinterval |指定对检索排序结果的更新周期,单位为秒 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |encoding |指定编码的类型,包括"UTF-8","EUC-KR","GBK" |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |wildcardtype |指定通配符类型,可选"unigram"和"trie" |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |indexunigramproperty |指定是否对属性的一元词语进行索引(现已不用) |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |unigramsearchmode |指定是否为一元搜索模式(现已不用) |
+-------------------------+-------------------------+--------------------------------------------------------------------+
| |multilanggranularity |指定分词的粒度,默认"field"(现已不用) |
+-------------------------+-------------------------+--------------------------------------------------------------------+
|LanguageIdentifier |dbpath |指定语言识别器的资源文件路径 |
+-------------------------+-------------------------+--------------------------------------------------------------------+
ProductBundle
~~~~~~~~~~~~~~~~~~~
设置产品相关的默认参数.
::
default-collection-dir
CronPara/value:指定计算价格趋势的启动时间。
CassandraStorage/enable:指定是否需要计算价格趋势。
CassandraStorage/keyspace:指定对哪个表计算价格趋势,默认为"SF1R"。
MiningBundle
~~~~~~~~~~~~~~~~~~
设置挖掘相关的默认参数.
::
default-collection-dir
+-------------------+-----------------+---------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+===================+=================+=================================================================================+
|TaxonomyPara |topdocnum |指定排名靠前的文档数,数值越大则TG可以利用更多的信息生成导航信息,范围[50,300], |
| | |默认100 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |levels |指定运行时创建的分类树的层数,范围[1,3],如果为1则为链式结构,默认3 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |perlevelnum |指定分类树每一层的最大标签数,范围[2,20],默认8 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |candlabelnum |指定用于生成分类树的候选标签数量,范围[200,400],默认250 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |enablenec |指定是否使用命名体分类,默认"n" |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |maxpeopnum |指定排序的人名个数,只有在使用命名体分类时有效,范围[1,50],默认20 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |maxlocnum |指定排序的地名个数,只有在使用命名体分类时有效,范围[1,50],默认20 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |maxorgnum |指定排序的机构名个数,只有在使用命名体分类时有效,范围[1,50],默认20 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|AutofillPara |cron |指定自动填充的更新时间 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|FuzzyIndexMergePara|cron |指定模糊索引合并时间 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|RecommendPara |recommendnum |指定显示的推荐条目的个数,范围[1,50],默认10 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |cron |指定MiningQueryLogHandler启动时间 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|SimilarityPara |docnumlimit |指定每一个词对应的记录列表中文档个数(idf)的限制范围,该属性值越大则相似度越高,|
| | |同时离线计算花费的时间越多。范围[100,500],默认100 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |termnumlimit |指定文档中用于剪枝的词的个数(tf)限制,该属性值越大则相似度越高,同时离线计算 |
| | |花费的时间越多。范围[100-500000],默认400000 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |enableesa |指定是否使用Explicit Semantic Analysis(ESA)计算相似度,默认"n" |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|ClassificationPara |customizetraining|指定是否允许自定义分类器训练 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |trainingencoding |指定编码 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|IsePara |buildimageindex |指定是否在索引中建立图像表示,默认"n"(现已不用) |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |storeimagelocally|指定是否在服务器本地对图像进行备份,默认"n"(现已不用) |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |maximagenum |指定最大图像个数,范围[1,1000000],默认1000000(现已不用) |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |relatedimagenum |指定相关图像个数,范围[1,100],默认50(现已不用) |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|QueryCorrectionPara|enableEK |查询纠错是否支持英文 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
| |enableCN |查询纠错是否支持中文 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
|ProductRankingPara |cron |指定ProductScore启动时间 |
+-------------------+-----------------+---------------------------------------------------------------------------------+
RecommendBundle
~~~~~~~~~~~~~~~~~~~~~
设置推荐相关的默认参数.
::
default-recommend-dir
+-----------------------+-------------+---------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+=======================+=============+=================================================================================+
|CollectionDataDirectory| |指定数据的索引目录 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
|CronPara |value |指定任务启动时间 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
|CacheSize |purchase |指定分配内存大小 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
| |visit |指定可用内存大小 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
| |index |指定索引内存大小 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
|FreqItemSet |enable |指定是否支持频繁项目集 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
| |minfreq |指定频繁项目集阈值 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
|CassandraStorage |enable |指定是否将推荐数据存入Cassandra,否则将存在本地 |
+-----------------------+-------------+---------------------------------------------------------------------------------+
| |keyspace |指定将推荐数据存入哪个表,默认"SF1R" |
+-----------------------+-------------+---------------------------------------------------------------------------------+
Tokenizing
~~~~~~~~~~~~~~
标记解析器将一篇文本解析为一个个的字符串和标记符号。默认情况下,所有的非字母字符(如空格,特殊
字符)都被视为divide界定符。例如,对字符串“SF-1 Revolution!”进行解析,将会返回“SF”,“1” 和 “Revolution”。
下面是解析器的相关配置.
::
+--------+---------------------------------------------------------------------------------------------------------------+
|属性 |描述 |
+========+===============================================================================================================+
|id |指定解析器的名称 |
+--------+---------------------------------------------------------------------------------------------------------------+
|method |指定解析器的操作方法,有3种选择:1)allow,被设置为allow的字符将不再是界定符。2)divide,被设置为divide的界定符|
| |表示分割操作,即"A@B"=>"A","B"。3)unite,被设置为unite的界定符表示连接操作,即"A@B"=>"AB" |
+--------+---------------------------------------------------------------------------------------------------------------+
|value |用字符串指定method作用的字符参数 |
+--------+---------------------------------------------------------------------------------------------------------------+
|code |用UCS2码指定method作用的字符参数 |
+--------+---------------------------------------------------------------------------------------------------------------+
LanguageAnalyzer
~~~~~~~~~~~~~~~~~~~~
SF1-R 中有多种语言分析方法,其中一些是基于字符的分析方法,还有一些是基于语言的分析方法。标记解析器的解析结果作为分析器的输入,分析器对这些字符进行分析,得到一系列的字符。最终我们对这些字符再进行索引,搜索和挖掘。
下面是关于语言分析器的一些配置信息.
::
+------------------+------------------+--------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+==================+==================+================================================================================+
|LanguageAnalyzer |dictionarypath |指定分析器辞典的路径 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |updatedictinterval|指定词典更新时间 |
+------------------+------------------+--------------------------------------------------------------------------------+
|Method |id |指定分析器名称 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |analysis |指定分析器的类型,有两大类,1)语言独立的,包括token,ngram,char。2)语言 |
| | |相关的,包括english,korean,chinese,multilang |
+------------------+------------------+--------------------------------------------------------------------------------+
| |advoption |分析器为char时, |
| | |分析器为multilang时,指定具体配置方法,详细解释见下文 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |casesensitive |指定是否大小写敏感,默认yes |
+------------------+------------------+--------------------------------------------------------------------------------+
| |min |分析器为ngram时有效,指定N-Gram中N的最小值 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |max |分析器为ngram时有效,指定N-Gram中N的最大值 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |maxno |分析器为ngram时有效,指定由一个标记串分析得到的词的最大数量 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |apart |分析器为ngram时有效,指定是否将中日韩字符和其他字符区分对待 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |idxflag |指定索引返回词的类型,有4种选择,1)all,返回所有词。2)prime,返回标记解析器的 |
| | |解析结果。3)second,返回语言分析器的分析结果。4)none,不返回任何词。默认all |
+------------------+------------------+--------------------------------------------------------------------------------+
| |schflag |指定检索返回词的类型,有4种选择,同上,默认second |
+------------------+------------------+--------------------------------------------------------------------------------+
|Method/settings |mode |分析器为语言相关时有效,设置分析器输出哪种类型的词语,有3种选择,1)all,解析结 |
| | |果和分析结果。2)noun,分析结果。3)label,通常用于挖掘特性 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |option |分析器为语言相关时有效,详细解释见下文 |
+------------------+------------------+--------------------------------------------------------------------------------+
| |specialchar |指定的字符不作为界定符,相等于method中的allow |
+------------------+------------------+--------------------------------------------------------------------------------+
| |dictionarypath |指定分析器的辞典路径,会覆盖LanguageAnalyzer中指定的路径 |
+------------------+------------------+--------------------------------------------------------------------------------+
分析器类型的详细描述如下:
**token**
这种方法不作任何操作,仅将标记解析器的解析结果作为输出,即LAManager的输出。
**ngram**
由NGram分析器得到分析结果。
**char**
由Char分析器抽取为一个个的字。part属性指定是否将数字,字母等符号分割开来,默认y
**chinese**
使用Chinese Morpheme Analysis(CMA)分析器抽取词语,CMA中集成了英文的词干分析器,因此也可以处理中英混合文本。
option的描述如下:
+------+------+----------------------------------------------------------------------------------------------------------+
|选项 |设置 |描述 |
+======+======+==========================================================================================================+
|C |`+` |从复合名词中抽取名词 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`*` |从复合名字中抽取名词,且将这些名词加入辞典 |
+------+------+----------------------------------------------------------------------------------------------------------+
|R |0/- |返回所有的分析结果 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |使用排名最靠前的两种分析结果 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |1-9 |指定抽取多少个排名靠前的分析结果 |
+------+------+----------------------------------------------------------------------------------------------------------+
|S |`-` |混合在中文文本中的英文单词会被原样抽取出来 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |对英文单词进行词干化处理 |
+------+------+----------------------------------------------------------------------------------------------------------+
|T |1 |统计方法,正确率最高,速度较慢 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |2 |最大匹配方法,正确率较低,速度较快 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |3 |最小匹配方法,正确率较低,速度较快,召回率较高 |
+------+------+----------------------------------------------------------------------------------------------------------+
**korea**
使用Korean Morphological Analyzer(KMA)分析器抽取词语,KMA集成了英文的词干分析器,因此也可以处理韩英混合文本。
option的描述如下:
+------+------+----------------------------------------------------------------------------------------------------------+
|选项 |设置 |描述 |
+======+======+==========================================================================================================+
|C |`+` |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`*` |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
|R |0/- |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
| |1-9 |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
|S |`-` |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |同chinese |
+------+------+----------------------------------------------------------------------------------------------------------+
|N |0 |不抽取数字 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |1-9 |表示对最少包含多少个数字字符的数字进行抽取 |
+------+------+----------------------------------------------------------------------------------------------------------+
|B |`-` |将标记字符串中的数字和量化单位分开,如"10千米"="10","千米" |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |不将标记字符串中的数字和量化单位分开 |
+------+------+----------------------------------------------------------------------------------------------------------+
|H |`-` |将中文字符转换为等价的韩语字符 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |如果中文字符和与其对应的韩语字符一起出现,抽取其中的中文字符 |
+------+------+----------------------------------------------------------------------------------------------------------+
|V |`-` |不抽取动词和形容词的词根 |
+------+------+----------------------------------------------------------------------------------------------------------+
| |`+` |对于具有1个以上音节的动词和形容词进行词干化处理 |
+------+------+----------------------------------------------------------------------------------------------------------+
**english**
英文分析器与其他语言(包括丹麦语,荷兰语,芬兰语等)类似,均进行词干化处理,每个词根最终作为一个检索词。
**multilang**
多语言分析器不是一个独立的分析器,而是配置多种分析器来处理多语言的混合文档,核心选项是advoption,该选项可以配置为cn(中文),
en(英文),jp(日文),kr(韩文)和default(所有语言)。
.. note::
**"default"** 只能指定给一个分析器,而且指定了"default" 的分析器配置必须放在多种语言分析器前面。其它语言的配置作为可选项, 每一种语言都可以采用一种处理模式。共有4种处理模式:
**"none"** 不对该语言做任何处理。使用"default"语言分析器对该语言进行处理。
**"char"** 将该语言文本分为一个个独立的字符。例如, 采用"char"模式处理英语文本,给定"ABC"字符串,将得到"A","B"和"C"3个词。
**"string"** 利用标点符号对该语言文本进行切分。例如,利用该模式处理英文文本,给定"ABC DE"字符串,将返回"ABC","DE"两个词。
**"ma"** 指定一个语言分析器对该语言文本进行处理。对不同语言的设置用分号";"隔开。例如: advoption = "default, inner la korall mia; cn, char",意为利用"char"模式处理中文,利用inner_la_korall_mia处理其它语言。
Collection
----------
SF1中,collection(集合)是具有相同结构的文档集合,collection配置文件主要包含6个部分:
1.SCD文件路径
2.DocumentSchema
3.IndexBundle
4.ProductBundle
5.MiningBundle
6.RecommendBundle
SCD文件设置
~~~~~~~~~~~~
配置文件放在config目录下,配置文件的名称为"CollectionName.xml",用户需要给出一个具体的CollectionName.
::
{
}
Data:用来指定文档创建时间的格式,有4种选项: **none_time_t** :用14个数字填充格式YYYYMMDDHHmmSS。 **time_t** :表示UNIX time()的值,如果得到的值是错误的,将被视为no_data。
**utc_sec** :将创建索引的时间作为collection的创建时间。 **no_data** :设置创建时间为1970:01:01 09:00:00。
basepath:collection的路径,必须由用户手动设置。
SCD:SCD目录的路径,默认为$basepath/scd,SCD文件必须放在scd/index目录下。
CollectionData:索引相关数据的路径,默认为$basepath/collection-data。
Query:查询相关数据的路径,默认为$basepath/query-data。
DocumentSchema
~~~~~~~~~~~~~~~
定义了文档的所有属性信息,包括属性名和属性值的类型,属性值的类型选项为:string/float/int8/int16/int32/int64/datatime。
::
{
}
IndexBundle
~~~~~~~~~~~
SF1通过检查IndexBundle的配置参数绝对需要对哪些属性建立索引,一般对string类型的字段建立倒排索引,如Title,Content;
对数值型字段建立BTree索引,如Price。
::
{
}
+-----------------+------------+----------------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+=================+============+========================================================================================+
|Property |name |文档中的属性名称 |
+-----------------+------------+----------------------------------------------------------------------------------------+
|Property/Indexing|filter |指定该属性是否作为过滤器,true表示BTree索引,即可进行排序或设置筛选条件,false表示倒排 |
| | |索引,即可用来检索用户的查询 |
+-----------------+------------+----------------------------------------------------------------------------------------+
| |multivalue |指定该属性是否可以应用于多值过滤器 |
+-----------------+------------+----------------------------------------------------------------------------------------+
| |doclen |指定是否将此属性存储到文档的长度中 |
+-----------------+------------+----------------------------------------------------------------------------------------+
| |analyzer |指定索引过程中分析器的类型,只能对string类型的属性设置 |
+-----------------+------------+----------------------------------------------------------------------------------------+
| |tokenizer |指定索引过程中解析器的类型,可指定多个解析器,只能对string类型的属性设置 |
+-----------------+------------+----------------------------------------------------------------------------------------+
| |range |是否支持范围形式的值,只能对数值类型的属性设置 |
+-----------------+------------+----------------------------------------------------------------------------------------+
| |rankweight |指定该属性所占的权重 |
+-----------------+------------+----------------------------------------------------------------------------------------+
|VirtualProperty |name |虚拟属性名称 |
+-----------------+------------+----------------------------------------------------------------------------------------+
|VirtualProperty/ |name |虚拟属性的子属性名称 |
|Subproperty | | |
+-----------------+------------+----------------------------------------------------------------------------------------+
ProductBundle
~~~~~~~~~~~~~
产品的相关配置。
::
{
}
+---------------+--------+---------------------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+===============+========+=============================================================================================+
|Schema |mode |选择模式,可为a/m/o,具体含义??? |
+---------------+--------+---------------------------------------------------------------------------------------------+
| |id |产品名称 |
+---------------+--------+---------------------------------------------------------------------------------------------+
|Schema/Price- |name |价格的属性名称 |
|Property | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
|Schema/Data- |name |日期的属性名称 |
|Property | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
|Schema/DOCID- |name |DOCID的属性名称 |
|Property | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
|Schema/Uuid- |name |uuid的属性名称 |
|Property | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
|Schema/Item- |name |物品数量的属性名称 |
|CountProperty | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
|PriceTrend/ |name | |
|GroupProperty | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
|PriceTrend/ |days | |
|TimeInterval | | |
+---------------+--------+---------------------------------------------------------------------------------------------+
MiningBundle
~~~~~~~~~~~~
定义了挖掘特性在哪些属性上进行操作。
::
{
}
+---------------------+---------+---------------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+=====================+=========+=======================================================================================+
|QueryRecommend/ | |在推荐模块中使用用户的查询记录 |
|Querylog | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Group/Property |name |使用分组特性的属性名列表,类型必须为string |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Attr/Property |name |使用attrby特性的属性名列表,类型必须为string |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Attr/Exclude |name | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|ProductRanking/score |type |类型 |
+---------------------+---------+---------------------------------------------------------------------------------------+
| |property |属性名称 |
+---------------------+---------+---------------------------------------------------------------------------------------+
| |weight |权重 |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Summarization/Docid- |name |Docid概述使用的属性 |
|Property | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Summarization/Con- |name |uuid概述使用的属性 |
|tentProperty | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Summarization/Title- |name |标题概述使用的属性 |
|Property | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Summarization/Opi- |name | |
|nionProperty | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Summarization/Opi- |name |概述中选项的路径 |
|nionWorkingPath | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Summarization/Opi- |name | |
|nionSyncld | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|SuffixMatch/Property |name |后缀匹配使用的属性 |
+---------------------+---------+---------------------------------------------------------------------------------------+
|SuffixMatch/ |path |后缀匹配使用的分析器词典路径 |
|TokenizeDictionary | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|SuffixMatch/ |enable |指定后缀匹配是否是支持动态增长 |
|Incremental | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
|Suffixmatch/ |name |后缀匹配过滤器的名称 |
|FilterProperty | | |
+---------------------+---------+---------------------------------------------------------------------------------------+
| |filter- |后缀匹配过滤器的类型 |
| |type | |
+---------------------+---------+---------------------------------------------------------------------------------------+
RecommendBundle
~~~~~~~~~~~~~~~
推荐相关配置。
::
{
-
}
+---------------+------+---------------------------------------------------------------------------------------+
|元素 |属性 |描述 |
+===============+======+=======================================================================================+
|User/Property |name |用户与推荐相关的属性名称 |
+---------------+------+---------------------------------------------------------------------------------------+
|Item/Property |name |物品与推荐相关的属性名称 |
+---------------+------+---------------------------------------------------------------------------------------+
|Trace/Event |name |事件与推荐相关的属性名称 |
+---------------+------+---------------------------------------------------------------------------------------+
Deployment
------------
分布式配置。
::
+-------------------------------------+--------------+--------------------------------------------------------------------+
|元素 |属性 |描述 |
+=====================================+==============+====================================================================+
|BrokerAgent |usecache |指定是否使用cache |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |threadnum |sf1服务器的线程数量 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |enabletest |指定是否支持测试 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |port |sf1服务器的端口号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedCommon |clusterid |集群编号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |username |用户名 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |localhost |本地主机ip |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |workerport |worker节点端口号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |masterport |master节点端口号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |datarecvport |data节点端口号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |filesyncport |文件同步节点端口号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology |enable |指定是否允许分布式sf1 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |nodenum |指定集群的节点数量 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1rNode |nodeid |当前节点的编号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |replicaid |当前节点的副本编号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1r- |enable |当前节点是否是master |
|Node/MasterServer | | |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |name |当前节点名字 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1rNode/ |type |当前节点的作用 |
|MasterServer/DistributedService | | |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1rNode/ | | |
|MasterServer/DistributedService/Co- |name |collection名字 |
|llection | | |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |distributive |指定是否是分布式 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |shardids |shard的编号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1rNode/ |enable |当前节点是否是worker |
|WorkerServer | | |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1rNode/ |type |当前节点的作用 |
|WorkerServer/DistributedService | | |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedTopology/CurrentSf1rNode/ | | |
|WorkerServer/DistributedService/Co- |name |collection名字 |
|llection | | |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DistributedUtil/Zookeeper |disable |是否禁止zookeeper连接,只有在非分布式情况下才能禁止 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |servers |zookeeper服务器ip |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |sessiontimeout|指定超时时间 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
|DFS |type |指定DFS类型 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |supportfuse |是否支持FUSE,默认"y" |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |mountdir |DFS的装载ip |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |server |DFS服务器的ip |
+-------------------------------------+--------------+--------------------------------------------------------------------+
| |port |DFS的端口号 |
+-------------------------------------+--------------+--------------------------------------------------------------------+
分布式搜索配置
-------------------
分布式SF1R采用ZooKeeper做为任务调度,因此我们必须在配置文件里指定Zookeeper的地址:
::
* 例子
下边的配置样例中,我们有2个SF1R的节点,一个是Master,同时2个节点都做为Workers。
SF1R Node1, 同时作为Master和Worker(shard 1)。该节点里部署了多个collection,其中"product"是分布式集群。
::
....
SF1R Node2,作为Worker(shard 2)。
::
两个节点的product.xml中,需要加入在`IndexBundle`里加入`ShardSchema`。
::
....
....
....
* 分布式SF1R的协调
Zookeeper在分布式SF1R的名字空间
::
| # Root of zookeeper namespace
|--- SF1R-[CLUSTERID] # Root of distributed SF1 namesapce, [CLUSTERID] is specified by user configuration.
|--- Topology
|--- Replica1 # A replica of service cluster
|--- Node1 # A SF1 node in the replica of cluster, it can be a Master or Worker or both.
|--- Search,Recommend # A node supply the distributed search and recommend service.
|--- Node2
|--- Search
|--- Node3
|--- Recommend
|--- Replica2
|--- Node1
|--- Node2
|--- Servers # Each Master service node in topology is a service server. xxx, maybe we can remove this node.
|--- Server00000000 # A master node supply Search and Recommend service as master
|--- Search,Recommend
|--- Server00000001
|--- Synchro # For synchronization task
* ZooKeeper的安装,部署,和应用
* 介绍
ZooKeeper 是一个用于对分布式系统进行协作管理的服务程序,它本身也是分布式的。对于我们的分布式系统来说,ZooKeeper就是一个用来进行分布式管理的服务, ZooKeeper提供了一个简单易用的框架,由Service和Client两部分组成。
ZooKeeper的Service由若干运行的Server组成(1个或多个),这些Server相同且可部署在不同服务器上,每个Server都维护着相同的数据结构(类似于文件目录结构),这个树形结构中的节点叫znode,Server之间会自动同步数据。
ZooKeeper的Client端可以连接到Service,每个Client对象可以连接到一个指定(或自动分配)的Server,用户通过client可以在Server中创建并维护数据。因为不同的Server维护的是同一份数据的复制,所以不同的client之间,通过ZooKeeper Service,就可以达到共享数据(信息)的目的。
** ZooKeeper服务的安装和部署
请参考[ZooKeeper Administrator's Guide](http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html)。
至少需要部署3个Zookeeper的服务节点,下边是在单台机器上配置Zookeeper的例子(实际情况下需要部署到不同机器)
- **Server1**
Configuration file
zookeeper-3.3.3/conf/zoo.cfg
::
tickTime=2000
dataDir=./data
clientPort=2181
initLimit=30
syncLimit=10
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2810:3810
_myid_ file (config data directory)
zookeeper-3.3.3/data/myid
::
1
start server
::
zookeeper-3.3.3$ ./bin/zkServer.sh start
stop server
::
zookeeper-3.3.3$ ./bin/zkServer.sh stop
- **Server2**
Configuration file
zookeeper-2/conf/zoo.cfg
::
tickTime=2000
dataDir=./data
clientPort=2182
initLimit=30
syncLimit=10
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2810:3810
_myid_ file (config data directory)
zookeeper-2/data/myid
::
2
start server
::
zookeeper-2$ ./bin/zkServer.sh start
- **Server3**
配置文件
zookeeper-3/conf/zoo.cfg
::
tickTime=2000
dataDir=./data
clientPort=2183
initLimit=30
syncLimit=10
server.1=localhost:2888:3888
server.2=localhost:2889:3889
server.3=localhost:2810:3810
_myid_ 文件 (配置data目录)
zookeeper-3/data/myid
::
3
start server
::
zookeeper-3$ ./bin/zkServer.sh start
### ZooKeeper客户端
对于SF1R, 我们把Zookeeper的C客户端封装为C++,可以在
`izenelib`_
里看到相关代码
.. _izenelib: https://github.com/izenecloud/izenelib/tree/master/include/3rdparty/zookeeper
参考如下头文件:
::
#include <3rdparty/zookeeper/ZooKeeper.hpp>
#include <3rdparty/zookeeper/ZooKeeperWatcher.hpp>
#include <3rdparty/zookeeper/ZooKeeperEvent.hpp>