配置说明¶

SF1配置文件说明¶

SF1配置文件主要包含以下几个方面：

1.System：全局参数设置，包括索引/挖掘特性/推荐特性的默认参数设置，线程个数，语言分析器设置等。

2.Collection：定义集合中文档的结构和具体的索引/挖掘参数。

3.Deployment：分布式配置。

System¶

通过标准结构定义文件对SF1的配置文件进行验证，必须将结构定义文件sf1r-config.xsd和SF1的XML配置文件放在同一目录下。

<SF1Config xsi:schemaLocation="http://www.izenesoft.com sf1r-config.xsd"
    <System>
    .
    .
    <BundlesDefault>
        <xxxBundle>
        </xxxBundle>
        .
        .
    </BundlesDefault>
    <Tokenizing>
    </Tokenizing>
    <LanguageAnalyzer>
    </LanguageAnalyzer>
    </System>
    <Deployment>
    </Deployment>
</SF1Config>

Resource¶

该参数表示加载资源的绝对路径.

<Resource path="@SF1RENGINE_ROOT@/package/resource"/>

WorkingDir¶

该参数表示资源运行路径.

<WorkingDir path="@SF1RENGINE_ROOT@/bin"/>

LogConnection¶

该参数表示选择日志数据库。

<LogConnection str="sqlite3://./log/COBRA"/>

Note

连接sqlite3则str=”sqlite3://./log/COBRA”，连接mysql则str=”mysql://root:123456@127.0.0.1:3306/SF1R”。

LogServerConnection¶

该参数表示连接日志服务器命令.

<LogServerConnection host="10.10.99.121" rpcport="18811" rpc_thread_num="30" driverport="18812"/>

IndexBundle¶

设置索引相关的默认参数.

<Parameter>
  <CollectionDataDirectory>default-collection-dir</CollectionDataDirectory>
  <IndexStrategy memorypoolsize="128000000"
                 indexlevel="wordlevel"
                 indexpolicy="default"
                 mergepolicy="memory"
                 cron="0 4 1 1 *"
                 autorebuild="y"/>
  <Sia triggerqa="n"
       enable_parallel_searching="n"
       enable_forceget_doc="n"
       doccachenum="20000"
       searchcachenum="1000"
       refreshsearchcache="n"
       refreshcacheinterval="3600"
       filtercachenum="1000"
       mastersearchcachenum="1000"
       topknum="4000" knntopknum="30"
       knndist="32"
       sortcacheupdateinterval="1800"
       encoding="UTF-8"
       wildcardtype="unigram"
       indexunigramproperty="n"
       unigramsearchmode="n"
       multilanggranularity="field"/>
  <LanguageIdentifier dbpath="@ilplib_LANGUAGEIDENTIFIER_DB@"/>
</Parameter>

元素	属性	描述
CollectionDdataDirectory		用来指定集合数据的索引目录
IndexStratege	memorypoolsize	用于索引的内存池字节数。如果小于5,000,000性能会大大下降。
	indexpolicy	有两种选择：1）default，表示只有当索引结束时，才能对文档进行检索，创建索引速度很快；2）realtime，表示可以进行实时检索。
	indexlevel	指定索引级别，有两种选择：1）doclevel。2）wordlevel。
	mergepolicy	指定合并方法，有两种选择：1）file，对于某些类型的硬盘可能效率较低， 2）memory，额外的内存消耗等于最大列表长度。
	cron	指令执行周期，格式为”分小时天月星期”
	autorebuild	指定是否自动重新编译
Sia	triggerqa	指定查询请求是否进入问答模式，默认”n”
	enable_parallel_searching	指定是否允许并行检索，默认”n”
	enable_forceget_doc	指定是否允许获取已删除的文档，默认”n”
	doccachenum	指定检索程序中原始文档的缓存个数。值越大则内存消耗越大。默认2000
	searchcachenum	指定检索程序中检索结果的缓存个数。值越大则内存消耗越大。默认1000
	refreshsearchcache	指定是否定期清除检索缓存
	refreshcacheinterval	指定清除检索缓存的周期，单位为秒，默认3600
	filtercachenum	指定检索程序中的filter结果的缓存个数。值越大内存消耗越大。默认1000
	mastersearchcachenum	指定master检索缓存个数
	topknum	指定查询结果的个数
	knntopknum	指定KNN查询结果的个数（现已不用）
	knndist	指定KNN查询的Hamming距离（现已不用）
	sortcacheupdateinterval	指定对检索排序结果的更新周期，单位为秒
	encoding	指定编码的类型，包括”UTF-8”，”EUC-KR”，”GBK”
	wildcardtype	指定通配符类型，可选”unigram”和”trie”
	indexunigramproperty	指定是否对属性的一元词语进行索引（现已不用）
	unigramsearchmode	指定是否为一元搜索模式（现已不用）
	multilanggranularity	指定分词的粒度，默认”field”（现已不用）
LanguageIdentifier	dbpath	指定语言识别器的资源文件路径

ProductBundle¶

设置产品相关的默认参数.

<Parameter>
  <CollectionDataDirectory>default-collection-dir</CollectionDataDirectory>
  <CronPara value="0 1 * * *"/>
  <CassandraStorage enable="yes" keyspace="B5MO">
</Parameter>

CronPara/value：指定计算价格趋势的启动时间。

CassandraStorage/enable：指定是否需要计算价格趋势。

CassandraStorage/keyspace：指定对哪个表计算价格趋势，默认为”SF1R”。

MiningBundle¶

设置挖掘相关的默认参数.

<Parameter>
  <CollectionDataDirectory>default-collection-dir</CollectionDataDirectory>
  <TaxonomyPara topdocnum="100"
                levels="3"
                perlevelnum="8"
                candlabelnum="250"
                enablenec="n"
                maxpeopnum="20"
                maxlocnum="20"
                maxorgnum="20"/>
  <AutofillPara cron="30 2 * * *"/>
  <FuzzyIndexMergePara cron="30 3 * * *"/>
  <RecommendPara recommendnum="9" cron="0 2 * * *"/>
  <SimilarityPara docnumlimit="100" termnumlimit="400000" enableesa="n"/>
  <ClassificationPara customizetraining="n" trainingencoding="UTF-8"/>
  <IsePara buildimageindex="y"
           storeimagelocally="n"
           maximagenum="1000000"
           relatedimagenum="50"/>
  <QueryCorrectionPara enableEK="y" enableCN="y"/>
  <ProductRankingPara cron="0 23 * * *"/>
</Parameter>

元素	属性	描述
TaxonomyPara	topdocnum	指定排名靠前的文档数，数值越大则TG可以利用更多的信息生成导航信息，范围[50,300]，默认100
	levels	指定运行时创建的分类树的层数，范围[1,3]，如果为1则为链式结构，默认3
	perlevelnum	指定分类树每一层的最大标签数，范围[2,20]，默认8
	candlabelnum	指定用于生成分类树的候选标签数量，范围[200,400]，默认250
	enablenec	指定是否使用命名体分类，默认”n”
	maxpeopnum	指定排序的人名个数，只有在使用命名体分类时有效，范围[1,50]，默认20
	maxlocnum	指定排序的地名个数，只有在使用命名体分类时有效，范围[1,50]，默认20
	maxorgnum	指定排序的机构名个数，只有在使用命名体分类时有效，范围[1,50]，默认20
AutofillPara	cron	指定自动填充的更新时间
FuzzyIndexMergePara	cron	指定模糊索引合并时间
RecommendPara	recommendnum	指定显示的推荐条目的个数，范围[1,50]，默认10
	cron	指定MiningQueryLogHandler启动时间
SimilarityPara	docnumlimit	指定每一个词对应的记录列表中文档个数（idf）的限制范围，该属性值越大则相似度越高，同时离线计算花费的时间越多。范围[100,500]，默认100
	termnumlimit	指定文档中用于剪枝的词的个数（tf）限制，该属性值越大则相似度越高，同时离线计算花费的时间越多。范围[100-500000]，默认400000
	enableesa	指定是否使用Explicit Semantic Analysis（ESA）计算相似度，默认”n”
ClassificationPara	customizetraining	指定是否允许自定义分类器训练
	trainingencoding	指定编码
IsePara	buildimageindex	指定是否在索引中建立图像表示，默认”n”（现已不用）
	storeimagelocally	指定是否在服务器本地对图像进行备份，默认”n”（现已不用）
	maximagenum	指定最大图像个数，范围[1,1000000]，默认1000000（现已不用）
	relatedimagenum	指定相关图像个数，范围[1,100]，默认50（现已不用）
QueryCorrectionPara	enableEK	查询纠错是否支持英文
	enableCN	查询纠错是否支持中文
ProductRankingPara	cron	指定ProductScore启动时间

RecommendBundle¶

设置推荐相关的默认参数.

<Parameter>
  <CollectionDataDirectory>default-recommend-dir</CollectionDataDirectory>
  <CronPara value="0 0 * * *"/>
  <CacheSize purchase="1073741824" visit="536870912" index="104857600"/>
  <FreqItemSet enable="no" minfreq="10"/>
  <CassandraStorage enable="no" keyspace="recommend_001"/>
</Parameter>

元素	属性	描述
CollectionDataDirectory		指定数据的索引目录
CronPara	value	指定任务启动时间
CacheSize	purchase	指定分配内存大小
	visit	指定可用内存大小
	index	指定索引内存大小
FreqItemSet	enable	指定是否支持频繁项目集
	minfreq	指定频繁项目集阈值
CassandraStorage	enable	指定是否将推荐数据存入Cassandra，否则将存在本地
	keyspace	指定将推荐数据存入哪个表，默认”SF1R”

Tokenizing¶

标记解析器将一篇文本解析为一个个的字符串和标记符号。默认情况下，所有的非字母字符（如空格，特殊字符）都被视为divide界定符。例如，对字符串“SF-1 Revolution!”进行解析，将会返回“SF”，“1” 和 “Revolution”。

下面是解析器的相关配置.

<Tokenizer id="tok_divide" method="divide" value="@#$" code=""/>
<Tokenizer id="tok_unite" method="unite" value="/" code=""/>

属性	描述
id	指定解析器的名称
method	指定解析器的操作方法，有3种选择：1）allow，被设置为allow的字符将不再是界定符。2）divide，被设置为divide的界定符表示分割操作，即”A@B”=>”A”,”B”。3）unite，被设置为unite的界定符表示连接操作，即”A@B”=>”AB”
value	用字符串指定method作用的字符参数
code	用UCS2码指定method作用的字符参数

LanguageAnalyzer¶

SF1-R 中有多种语言分析方法，其中一些是基于字符的分析方法，还有一些是基于语言的分析方法。标记解析器的解析结果作为分析器的输入，分析器对这些字符进行分析，得到一系列的字符。最终我们对这些字符再进行索引，搜索和挖掘。下面是关于语言分析器的一些配置信息.

<LanguageAnalyzer dictionarypath="@wisekma_KNOWLEDGE@" updatedictinterval="300">
  <Method id="la_token" analysis="token"/>
  <Method id="la_char" analysis="char"/>
  <Method id="la_unigram_all" analysis="char" advoption="all" casesensitive="no"/>
  <Method id="la_unigram" analysis="char" advoption="part" casesensitive="no"/>
  <Method id="la_ngram"
          analysis="ngram"
          min="2"
          max="3"
          maxno="2194967296"
          apart="n"
          idxflag="second"
          schflag="second"/>
  <Method id="la_bigram"
          analysis="ngram"
          min="2"
          max="2"
          maxno="2194967296"
          apart="n"
          idxflag="second"/>
  <Method id="la_eng" analysis="english" casesensitive="no">
    <settings mode="all" option="S+" dictionarypath=""/>
  </Method>
  <Method id="inner_la_korall_sia" analysis="korean" casesensitive="no">
    <settings mode="label" option="R+S+" specialchar="#" dictionarypath=""/>
  </Method>
  <Method id="inner_la_cnall_sia_2" analysis="chinese" casesensitive="no">
    <settings mode="label" option="R+S-V-T2" specialchar="#" dictionarypath="@izenecma_KNOWLEDGE@"/>
  </Method>
  <Method id="inner_la_cnall_sia" analysis="chinese" casesensitive="no">
    <settings mode="label" option="R+S-V-T3" specialchar="#" dictionarypath="@izenecma_KNOWLEDGE@"/>
  </Method>
  <Method id="inner_la_cnall_ia" analysis="chinese" casesensitive="no">
    <settings mode="label" option="R+S-V-T4" specialchar="#" dictionarypath="@izenecma_KNOWLEDGE@"/>
  </Method>
  <Method id="inner_la_cnall_sa" analysis="chinese" casesensitive="no">
    <settings mode="label" option="R+S-T5" specialchar="#" dictionarypath="@izenecma_KNOWLEDGE@"/>
  </Method>
  <Method id="la_sia_without_overlap"
          analysis="multilang"
          advoption="default,inner_la_korall_sia;en,inner_la_cnall_sa;cn,inner_la_cnall_sa"/>
  <Method id="la_sia"
          analysis="multilang"
          advoption="default,inner_la_korall_sia;en,inner_la_cnall_sa;cn,inner_la_cnall_sia"/>
  <Method id="la_sia_with_unigram"
          analysis="multilang"
          advoption="default,inner_la_korall_sia;en,inner_la_cnall_sa;cn,inner_la_cnall_ia"/>
</LanguageAnalyzer>

元素	属性	描述
LanguageAnalyzer	dictionarypath	指定分析器辞典的路径
	updatedictinterval	指定词典更新时间
Method	id	指定分析器名称
	analysis	指定分析器的类型，有两大类，1）语言独立的，包括token,ngram,char。2）语言相关的，包括english,korean,chinese,multilang
	advoption	分析器为char时，分析器为multilang时，指定具体配置方法，详细解释见下文
	casesensitive	指定是否大小写敏感，默认yes
	min	分析器为ngram时有效，指定N-Gram中N的最小值
	max	分析器为ngram时有效，指定N-Gram中N的最大值
	maxno	分析器为ngram时有效，指定由一个标记串分析得到的词的最大数量
	apart	分析器为ngram时有效，指定是否将中日韩字符和其他字符区分对待
	idxflag	指定索引返回词的类型，有4种选择，1）all，返回所有词。2）prime，返回标记解析器的解析结果。3）second，返回语言分析器的分析结果。4）none，不返回任何词。默认all
	schflag	指定检索返回词的类型，有4种选择，同上，默认second
Method/settings	mode	分析器为语言相关时有效，设置分析器输出哪种类型的词语，有3种选择，1）all，解析结果和分析结果。2）noun，分析结果。3）label，通常用于挖掘特性
	option	分析器为语言相关时有效，详细解释见下文
	specialchar	指定的字符不作为界定符，相等于method中的allow
	dictionarypath	指定分析器的辞典路径，会覆盖LanguageAnalyzer中指定的路径

分析器类型的详细描述如下:

token

这种方法不作任何操作，仅将标记解析器的解析结果作为输出，即LAManager的输出。

ngram

由NGram分析器得到分析结果。

char

由Char分析器抽取为一个个的字。part属性指定是否将数字，字母等符号分割开来，默认y

chinese

使用Chinese Morpheme Analysis(CMA)分析器抽取词语，CMA中集成了英文的词干分析器，因此也可以处理中英混合文本。

option的描述如下：

选项	设置	描述
C	+	从复合名词中抽取名词
	*	从复合名字中抽取名词，且将这些名词加入辞典
R	0/-	返回所有的分析结果
	+	使用排名最靠前的两种分析结果
	1-9	指定抽取多少个排名靠前的分析结果
S	-	混合在中文文本中的英文单词会被原样抽取出来
	+	对英文单词进行词干化处理
T	1	统计方法，正确率最高，速度较慢
	2	最大匹配方法，正确率较低，速度较快
	3	最小匹配方法，正确率较低，速度较快，召回率较高

korea

使用Korean Morphological Analyzer(KMA)分析器抽取词语，KMA集成了英文的词干分析器，因此也可以处理韩英混合文本。

option的描述如下:

选项	设置	描述
C	+	同chinese
	*	同chinese
R	0/-	同chinese
	+	同chinese
	1-9	同chinese
S	-	同chinese
	+	同chinese
N	0	不抽取数字
	1-9	表示对最少包含多少个数字字符的数字进行抽取
B	-	将标记字符串中的数字和量化单位分开，如”10千米”=”10”,”千米”
	+	不将标记字符串中的数字和量化单位分开
H	-	将中文字符转换为等价的韩语字符
	+	如果中文字符和与其对应的韩语字符一起出现，抽取其中的中文字符
V	-	不抽取动词和形容词的词根
	+	对于具有1个以上音节的动词和形容词进行词干化处理

english

英文分析器与其他语言（包括丹麦语，荷兰语，芬兰语等）类似，均进行词干化处理，每个词根最终作为一个检索词。

multilang

多语言分析器不是一个独立的分析器，而是配置多种分析器来处理多语言的混合文档，核心选项是advoption，该选项可以配置为cn(中文), en(英文)，jp(日文)，kr(韩文)和default(所有语言)。

Note

“default” 只能指定给一个分析器，而且指定了”default” 的分析器配置必须放在多种语言分析器前面。其它语言的配置作为可选项，每一种语言都可以采用一种处理模式。共有4种处理模式: “none” 不对该语言做任何处理。使用”default”语言分析器对该语言进行处理。 “char” 将该语言文本分为一个个独立的字符。例如, 采用”char”模式处理英语文本，给定”ABC”字符串，将得到”A”,”B”和”C”3个词。 “string” 利用标点符号对该语言文本进行切分。例如,利用该模式处理英文文本，给定”ABC DE”字符串，将返回”ABC”,”DE”两个词。 “ma” 指定一个语言分析器对该语言文本进行处理。对不同语言的设置用分号”；”隔开。例如: advoption = “default, inner la korall mia; cn, char”，意为利用”char”模式处理中文，利用inner_la_korall_mia处理其它语言。

Collection¶

SF1中，collection(集合)是具有相同结构的文档集合，collection配置文件主要包含6个部分：

1.SCD文件路径

2.DocumentSchema

3.IndexBundle

4.ProductBundle

5.MiningBundle

6.RecommendBundle

SCD文件设置¶

配置文件放在config目录下，配置文件的名称为”CollectionName.xml”，用户需要给出一个具体的CollectionName.

{
  <Date format="none_time_t"/>
  <Path basepath="collection/tuanm"><!-- The default location can be overwritten -->
    <SCD path=""/><!-- default: basepath/scd -->
    <CollectionData path=""/><!-- default: basepath/collection-data -->
    <Query path=""/><!-- default: basepath/query-data -->
  </Path>
}

Data：用来指定文档创建时间的格式，有4种选项： none_time_t ：用14个数字填充格式YYYYMMDDHHmmSS。 time_t ：表示UNIX time()的值，如果得到的值是错误的，将被视为no_data。 utc_sec ：将创建索引的时间作为collection的创建时间。 no_data ：设置创建时间为1970:01:01 09:00:00。

basepath：collection的路径，必须由用户手动设置。

SCD：SCD目录的路径，默认为$basepath/scd，SCD文件必须放在scd/index目录下。

CollectionData：索引相关数据的路径，默认为$basepath/collection-data。

Query：查询相关数据的路径，默认为$basepath/query-data。

DocumentSchema¶

定义了文档的所有属性信息，包括属性名和属性值的类型，属性值的类型选项为：string/float/int8/int16/int32/int64/datatime。

{
  <DocumentSchema>
    <Property name="DOCID" type="string"/>
    <Property name="uuid" type="string"/>
    <Property name="DATE" type="datetime"/>
    <Property name="ComUrl" type="string"/>
    <Property name="ProdDocid" type="string"/>
    <Property name="ProdName" type="string"/>
    <Property name="Source" type="string"/>
    <Property name="UserName" type="string"/>
    <Property name="UsefulVoteTotal" type="int32"/>
    <Property name="UsefulVote" type="int32"/>
    <Property name="Content" type="string"/>
    <Property name="Advantage" type="string"/>
    <Property name="Disadvantage" type="string"/>
    <Property name="Title" type="string"/>
    <Property name="City" type="string"/>
    <Property name="Score" type="int32"/>
  </DocumentSchema>
}

IndexBundle¶

SF1通过检查IndexBundle的配置参数绝对需要对哪些属性建立索引，一般对string类型的字段建立倒排索引，如Title，Content；对数值型字段建立BTree索引，如Price。

{
  <IndexBundle>
    <Schema>
      <Property name="Title">
        <Indexing filter="no" multivalue="no" doclen="yes" analyzer="la_sia" tokenizer="" rankweight="0.8"/>
      </Property>
      <Property name="Price">
        <Indexing filter="yes" multivalue="no" doclen="yes" tokenizer="" rankweight="0.1" range="yes"/>
      </Property>
      <Property name="TargetCategory">
        <Indexing filter="yes" multivalue="no" doclen="yes" analyzer="la_sia" tokenizer="" rankweight="0.6"/>
      </Property>
      <Property name="Category">
        <Indexing filter="yes" multivalue="no" doclen="yes" analyzer="la_sia" tokenizer="" rankweight="0.6"/>
      </Property>
      <Property name="Attribute">
        <Indexing filter="no" multivalue="no" doclen="yes" analyzer="la_sia" tokenizer="" rankweight="0.2"/>
      </Property>
      <Property name="CommentCount">
        <Indexing filter="yes" multivalue="no" doclen="no" tokenizer="" rankweight="0.1"/>
      </Property>
      <Property name="Score">
        <Indexing filter="yes" multivalue="no" doclen="no" tokenizer="" rankweight="0.1"/>
      </Property>
      <Property name="mobile">
        <Indexing filter="yes" multivalue="no" doclen="no" tokenizer="" rankweight="0.1"/>
      </Property>
      <VirtualProperty name="Combined">
        <SubProperty name="Title"/>
        <SubProperty name="Source"/>
        <SubProperty name="Category"/>
      </VirtualProperty>
    </Schema>
  </IndexBundle>
}

元素	属性	描述
Property	name	文档中的属性名称
Property/Indexing	filter	指定该属性是否作为过滤器，true表示BTree索引，即可进行排序或设置筛选条件，false表示倒排索引，即可用来检索用户的查询
	multivalue	指定该属性是否可以应用于多值过滤器
	doclen	指定是否将此属性存储到文档的长度中
	analyzer	指定索引过程中分析器的类型，只能对string类型的属性设置
	tokenizer	指定索引过程中解析器的类型，可指定多个解析器，只能对string类型的属性设置
	range	是否支持范围形式的值，只能对数值类型的属性设置
	rankweight	指定该属性所占的权重
VirtualProperty	name	虚拟属性名称
VirtualProperty/ Subproperty	name	虚拟属性的子属性名称

ProductBundle¶

产品的相关配置。

{
  <ProductBundle>
    <Schema mode="o" id="b5m">
      <PriceProperty name="Price"/>
      <DateProperty name="DATE"/>
      <DOCIDProperty name="DOCID"/>
      <UuidProperty name="uuid"/>
      <ItemCountProperty name="itemcount"/>
      <PriceTrend>
        <GroupProperty name="TargetCategory"/>
        <GroupProperty name="Source"/>
        <TimeInterval days="2"/>
        <TimeInterval days="7"/>
        <TimeInterval days="183"/>
        <TimeInterval days="365"/>
      </PriceTrend>
    </Schema>
  </ProductBundle>
}

元素	属性	描述
Schema	mode	选择模式，可为a/m/o，具体含义？？？
	id	产品名称
Schema/Price- Property	name	价格的属性名称
Schema/Data- Property	name	日期的属性名称
Schema/DOCID- Property	name	DOCID的属性名称
Schema/Uuid- Property	name	uuid的属性名称
Schema/Item- CountProperty	name	物品数量的属性名称
PriceTrend/ GroupProperty	name
PriceTrend/ TimeInterval	days

MiningBundle¶

定义了挖掘特性在哪些属性上进行操作。

{
<MiningBundle>
  <Schema>
    <QueryRecommend>
      <QueryLog/>
    </QueryRecommend>
    <Group>
      <Property name="TargetCategory"/>
      <Property name="Source"/>
      <Property name="SubSource"/>
    </Group>
    <Attr>
      <Property name="Attribute"/>
      <Exclude name="ISBN"/>
    </Attr>
    <ProductRanking>
      <Score type="diversity" property="Source"/>
      <Score type="merchant" property="Source"/>
      <Score type="category" property="TargetCategory" weight="1"/>
      <Score type="relevance" weight="0.01"/>
    </ProductRanking>
    <Summarization>
      <DocidProperty name="DOCID"/>
      <ContentProperty name="Content"/>
      <TitleProperty name="Title"/>
      <OpinionProperty name="Opinion"/>
      <OpinionWorkingPath path="./collection/b5mc/opinion_working/"/>
      <OpinionSyncId name="b5m"/>
    </Summarization>
    <SuffixMatch>
      <Property name="Title"/>
      <TokenizeDictionary path="fmindex_dic"/>
      <Incremental enable="no"/>
      <FilterProperty name="TargetCategory" filtertype="group"/>
      <FilterProperty name="Source" filtertype="group"/>
      <FilterProperty name="SubSource" filtertype="group"/>
      <FilterProperty name="Price" filtertype="numeric"/>
    </SuffixMatch>
  </Schema>
</MiningBundle>
}

元素	属性	描述
QueryRecommend/ Querylog		在推荐模块中使用用户的查询记录
Group/Property	name	使用分组特性的属性名列表，类型必须为string
Attr/Property	name	使用attrby特性的属性名列表，类型必须为string
Attr/Exclude	name
ProductRanking/score	type	类型
	property	属性名称
	weight	权重
Summarization/Docid- Property	name	Docid概述使用的属性
Summarization/Con- tentProperty	name	uuid概述使用的属性
Summarization/Title- Property	name	标题概述使用的属性
Summarization/Opi- nionProperty	name
Summarization/Opi- nionWorkingPath	name	概述中选项的路径
Summarization/Opi- nionSyncld	name
SuffixMatch/Property	name	后缀匹配使用的属性
SuffixMatch/ TokenizeDictionary	path	后缀匹配使用的分析器词典路径
SuffixMatch/ Incremental	enable	指定后缀匹配是否是支持动态增长
Suffixmatch/ FilterProperty	name	后缀匹配过滤器的名称
	filter- type	后缀匹配过滤器的类型

RecommendBundle¶

推荐相关配置。

{
  <RecommendBundle>
    <Schema>
      <User>
        <Property name="gender" />
        <Property name="age" />
        <Property name="area" />
      </User>
      <Item>
        <Property name="name" />
        <Property name="link" />
        <Property name="price" />
        <Property name="category" />
      </Item>
      <Track>
        <Event name="wish_list" />
        <Event name="own" />
        <Event name="like" />
        <Event name="favorite" />
      </Track>
    </Schema>
  </RecommendBundle>
}

元素	属性	描述
User/Property	name	用户与推荐相关的属性名称
Item/Property	name	物品与推荐相关的属性名称
Trace/Event	name	事件与推荐相关的属性名称

Deployment¶

分布式配置。

<BrokerAgent usecache="n" threadnum="50" enabletest="y" port="18181"/>
<DistributedCommon clusterid="@LOCAL_HOST_USER_NAME@"
                   username="@LOCAL_HOST_USER_NAME@"
                   localhost="@LOCAL_HOST_IP@"
                   workerport="18151"
                   masterport="18131"
                   datarecvport="18121"
                   filesyncport="18141"/>
<DistributedTopology enable="n" nodenum="2">
  <CurrentSf1rNode nodeid="1" replicaid="1">
    <MasterServer enable="n" name="undefined">
      <DistributedService type="search">
        <Collection name="web" distributive="n"/>
        <Collection name="qa" distributive="n"/>
        <Collection name="b5mo" distributive="n"/>
        <Collection name="b5mc" distributive="n"/>
        <Collection name="b5mp" distributive="y" shardids="1,2"/>
      </DistributedService>
    </MasterServer>
    <WorkerServer enable="n">
      <DistributedService type="search">
        <Collection name="b5mp">
      </DistributedService>
    </WorkerServer>
  </CurrentSf1rNode>
</DistributedTopology>
<DistributedUtil>
  <ZooKeeper disable="n"
             servers="10.10.99.121:2181,10.10.99.122:2181,10.10.99.123:2181"
             sessiontimeout="5000"/>
 <DFS type="hdfs" supportfuse="y" mountdir="/mnt/hdfs" server="localhost" port="9000"/>
</DistributedUtil>

元素	属性	描述
BrokerAgent	usecache	指定是否使用cache
	threadnum	sf1服务器的线程数量
	enabletest	指定是否支持测试
	port	sf1服务器的端口号
DistributedCommon	clusterid	集群编号
	username	用户名
	localhost	本地主机ip
	workerport	worker节点端口号
	masterport	master节点端口号
	datarecvport	data节点端口号
	filesyncport	文件同步节点端口号
DistributedTopology	enable	指定是否允许分布式sf1
	nodenum	指定集群的节点数量
DistributedTopology/CurrentSf1rNode	nodeid	当前节点的编号
	replicaid	当前节点的副本编号
DistributedTopology/CurrentSf1r- Node/MasterServer	enable	当前节点是否是master
	name	当前节点名字
DistributedTopology/CurrentSf1rNode/ MasterServer/DistributedService	type	当前节点的作用
DistributedTopology/CurrentSf1rNode/ MasterServer/DistributedService/Co- llection	name	collection名字
	distributive	指定是否是分布式
	shardids	shard的编号
DistributedTopology/CurrentSf1rNode/ WorkerServer	enable	当前节点是否是worker
DistributedTopology/CurrentSf1rNode/ WorkerServer/DistributedService	type	当前节点的作用
DistributedTopology/CurrentSf1rNode/ WorkerServer/DistributedService/Co- llection	name	collection名字
DistributedUtil/Zookeeper	disable	是否禁止zookeeper连接，只有在非分布式情况下才能禁止
	servers	zookeeper服务器ip
	sessiontimeout	指定超时时间
DFS	type	指定DFS类型
	supportfuse	是否支持FUSE，默认”y”
	mountdir	DFS的装载ip
	server	DFS服务器的ip
	port	DFS的端口号

分布式搜索配置¶

分布式SF1R采用ZooKeeper做为任务调度，因此我们必须在配置文件里指定Zookeeper的地址：

<DistributedUtil>
  <ZooKeeper disable="n" servers="10.10.10.1:2181,10.10.10.2:2181,10.10.10.3:2181" sessiontimeout="2000" />
</DistributedUtil>

例子

下边的配置样例中，我们有2个SF1R的节点，一个是Master，同时2个节点都做为Workers。

SF1R Node1, 同时作为Master和Worker(shard 1)。该节点里部署了多个collection，其中”product”是分布式集群。

....
<DistributedTopology enable="y">
  <CurrentSf1rNode nodeid="1" replicaid="1">
    <!--master names could be www|stage|beta-->
    <MasterServer enable="y" name="undefined" />
    <WorkerServer enable="y" />
  </CurrentSf1rNode>
</DistributedTopology>

SF1R Node2，作为Worker(shard 2)。

<DistributedTopology enable="y">
  <CurrentSf1rNode nodeid="2" replicaid="1">
    <MasterServer enable="y" name="undefined" />
    <WorkerServer enable="y" />
  </CurrentSf1rNode>
</DistributedTopology>

两个节点的product.xml中，需要加入在`IndexBundle`里加入`ShardSchema`。

....
 <IndexBundle>
   <ShardSchema>
     <ShardKey name="DOCID" />
     <DistributedService type="search" shardids="1,2" />
     <DistributedService type="recommend" shardids="1,2" />
   </ShardSchema>
 ....
 </IndexBundle>
 ....

分布式SF1R的协调

Zookeeper在分布式SF1R的名字空间

|                                  # Root of zookeeper namespace
|--- SF1R-[CLUSTERID]              # Root of distributed SF1 namesapce, [CLUSTERID] is specified by user configuration.
    |--- Topology
         |--- Replica1           # A replica of service cluster
              |--- Node1         # A SF1 node in the replica of cluster, it can be a Master or Worker or both.
                   |--- Search,Recommend   # A node supply the distributed search and recommend service.
              |--- Node2
                   |--- Search
              |--- Node3
                   |--- Recommend
         |--- Replica2
              |--- Node1
              |--- Node2
    |--- Servers           # Each Master service node in topology is a service server. xxx, maybe we can remove this node.
         |--- Server00000000       # A master node supply Search and Recommend service as master
              |--- Search,Recommend
         |--- Server00000001
    |--- Synchro                 # For synchronization task

ZooKeeper的安装，部署，和应用
介绍

ZooKeeper 是一个用于对分布式系统进行协作管理的服务程序，它本身也是分布式的。对于我们的分布式系统来说，ZooKeeper就是一个用来进行分布式管理的服务, ZooKeeper提供了一个简单易用的框架，由Service和Client两部分组成。

ZooKeeper的Service由若干运行的Server组成（1个或多个），这些Server相同且可部署在不同服务器上，每个Server都维护着相同的数据结构（类似于文件目录结构），这个树形结构中的节点叫znode，Server之间会自动同步数据。

ZooKeeper的Client端可以连接到Service，每个Client对象可以连接到一个指定(或自动分配)的Server，用户通过client可以在Server中创建并维护数据。因为不同的Server维护的是同一份数据的复制，所以不同的client之间，通过ZooKeeper Service，就可以达到共享数据（信息）的目的。

** ZooKeeper服务的安装和部署

请参考[ZooKeeper Administrator’s Guide](http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html)。

至少需要部署3个Zookeeper的服务节点，下边是在单台机器上配置Zookeeper的例子（实际情况下需要部署到不同机器）

Server1

Configuration file

zookeeper-3.3.3/conf/zoo.cfg

tickTime=2000 dataDir=./data clientPort=2181 initLimit=30 syncLimit=10 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2810:3810

_myid_ file (config data directory) zookeeper-3.3.3/data/myid

1

start server

zookeeper-3.3.3$ ./bin/zkServer.sh start

stop server

zookeeper-3.3.3$ ./bin/zkServer.sh stop

Server2

Configuration file

zookeeper-2/conf/zoo.cfg

tickTime=2000 dataDir=./data clientPort=2182 initLimit=30 syncLimit=10 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2810:3810

_myid_ file (config data directory) zookeeper-2/data/myid

2

start server

zookeeper-2$ ./bin/zkServer.sh start

Server3

配置文件 zookeeper-3/conf/zoo.cfg

tickTime=2000 dataDir=./data clientPort=2183 initLimit=30 syncLimit=10 server.1=localhost:2888:3888 server.2=localhost:2889:3889 server.3=localhost:2810:3810

_myid_ 文件 (配置data目录) zookeeper-3/data/myid :: 3

start server

zookeeper-3$ ./bin/zkServer.sh start

### ZooKeeper客户端

对于SF1R, 我们把Zookeeper的C客户端封装为C++，可以在 izenelib 里看到相关代码

参考如下头文件：

#include <3rdparty/zookeeper/ZooKeeper.hpp>
#include <3rdparty/zookeeper/ZooKeeperWatcher.hpp>
#include <3rdparty/zookeeper/ZooKeeperEvent.hpp>