Lucene 实现任意词搜索命中并返回位置信息

2017-10-13 Eric Wang 更多博文 » 博客 » GitHub »

原文链接 http://codepub.cn/2017/10/13/lucene-implements-any-word-search-and-returns-location-information/
注：以下为加速网络访问所做的原文缓存，经过重新格式化，可能存在格式方面的问题，或偶有遗漏信息，请以原文为准。

背景

如果把这个标题拆分成两个来讲，那么每一个都很好解决，下文会进行详述，而如果把这两者看做是与条件并加上其它限制，则实现起来比较困难，本文就是要探讨在需求繁多的情况下，如何优雅地实现。比如需求如下

保留标点符号，否则去掉标点的话，在标点两边的词可能会匹配上，比如“你好，小甜甜”，去掉标点切分是『你|好|小|甜|甜』，那么『好小』有可能会命中，而如果切分成『你|好|，|小|甜|甜』，则『好小』无法命中
只要包含搜索词，要求对任意搜索词均可命中
- 比如“我爱你中国”，不同的分词工具会切分出不同的结果
- 『我|爱|你|中国』或者『我爱你|中国』或者『我|爱|你|中|国』等，那么要求搜索“我爱”或者“爱你”或者“你中”等都要命中
需要获取命中词的positions信息
- 还是以“我爱你中国”为例，如果搜“你中”，那么需要返回结果命中，并给出positions信息，例如start=2表示“你中”在原文本中是从第2个位置开始
可以设置slop
- 什么是slop，简单来说slop是指两个项的位置之间允许的最大间隔距离
- 为什么要设置slop呢？比如小黄文为了防止敏感词被屏蔽，会在敏感词中间加上干扰词，例如性%=$虐待，那么直接搜性虐待无法命中
- 只要设置slop为3，即相当于搜性[***]虐待，这里面的[***]就代表slop为3，可以匹配任意三个字
不要存储Field的值
- 如果可以存储的话，可以通过Document获取原文本，再用TokenStream分析该文本，使用QueryScore初始化TokenStream的分析结果，遍历每个token根据TokenScore的得分判断是否命中，若命中则输出位置信息或者起始偏移量即可
- 但是存储Field的值是需要占用硬盘空间的，当需要索引海量的文本的时候，会导致索引体积非常大，搜索性能变差
- 当然还可以通过将索引拆分成多份存储实现降低索引体积的目的，这也是一个方法，不过治标不治本
不要存储TermVectors
- 同样地如果可以存储的话，可以通过TermVectors获取Terms再遍历TermsEnum，获取PostingsEnum得到positions信息，缺点在于只能实现单个字（Term）的搜索匹配
- 但是存储TermVectors同样占用硬盘空间，为了缩小索引体积，不要存储
实现与逻辑，比如搜索“你中 & 我爱”表示两个词都要命中
实现或逻辑，比如搜索“你中 | 我爱”表示两个词至少要有一个命中

如果想要实现百分之百的任意词搜索命中，那么只能按字切分，因为没有任何分词工具能够保证切出来的词与搜索词是一致的。在上面说到为了达到匹配干扰词的目的，需要设置slop，但是会有一定的误判率，本来不该匹配的在设置slop之后也匹配上了。除了设置slop这种方式，还有一种方法，就是在索引阶段只保留汉字，其它的标点符号和干扰符号统统去掉，当然这也存在一定的误判率，而且获取的positions信息已经不是原文本中正确的positions信息了。两种方式的权衡与取舍可以根据业务需求而定，这两种方式都不会漏判，但是均会有误判。

简单需求的实现

如果并不要求满足上面所有的需求，而仅仅满足其中任何一个，那么实现起来都是非常简单的，以下代码均基于Lucene 5.5.0实现，示例如下。

仅要求任意搜索词命中

@org.junit.Test
public void testAnyMatch() throws IOException {
    RAMDirectory ramDirectory = new RAMDirectory();
    IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(new StandardAnalyzer()));
    Document document = new Document();
    document.add(new TextField("content", "我爱你中国", Field.Store.NO));
    indexWriter.addDocument(document);
    indexWriter.commit();
    IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(indexWriter));
    PhraseQuery phraseQuery = new PhraseQuery.Builder().add(new Term("content", "你")).add(new Term("content", "中")).setSlop(0).build();
    TopDocs search = indexSearcher.search(phraseQuery, Integer.MAX_VALUE);
    System.out.println(search.totalHits);
    //OR search like this
    MultiPhraseQuery multiPhraseQuery = new MultiPhraseQuery();
    Term first = new Term("content", "你");
    Term second = new Term("content", "中");
    multiPhraseQuery.add(new Term[]{first, second});
    search = indexSearcher.search(multiPhraseQuery, Integer.MAX_VALUE);
    System.out.println(search.totalHits);
}

可以存储Field的值

@org.junit.Test
public void testStoreFieldMatch() throws IOException, InvalidTokenOffsetsException {
    RAMDirectory ramDirectory = new RAMDirectory();
    IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(new StandardAnalyzer()));
    Document document = new Document();
    document.add(new TextField("content", "我爱你中国", Field.Store.YES));
    indexWriter.addDocument(document);
    indexWriter.commit();
    IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(indexWriter));
    PhraseQuery phraseQuery = new PhraseQuery.Builder().add(new Term("content", "你")).add(new Term("content", "中")).setSlop(0).build();
    TopDocs search = indexSearcher.search(phraseQuery, Integer.MAX_VALUE);
    ScoreDoc[] scoreDocs = search.scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
        String content = indexSearcher.doc(scoreDoc.doc).get("content");
        TokenStream contentStream = new StandardAnalyzer().tokenStream("content", content);
        CharTermAttribute charTermAttribute = contentStream.addAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttribute = contentStream.addAttribute(OffsetAttribute.class);
        QueryScorer queryScorer = new QueryScorer(phraseQuery);
        queryScorer.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
        TokenStream init = queryScorer.init(contentStream);
        if (init != null) {
            contentStream = init;
        }
        contentStream.reset();
        queryScorer.startFragment(null);
        int startOffset, endOffset;
        for (boolean next = contentStream.incrementToken(); next && (offsetAttribute.startOffset() < Integer.MAX_VALUE); next = contentStream.incrementToken()) {
            startOffset = offsetAttribute.startOffset();
            endOffset = offsetAttribute.endOffset();
            if (startOffset > content.length() || endOffset > content.length()) {
                throw new InvalidTokenOffsetsException("Token " + charTermAttribute.toString() + " exceeds length of provided text sized " + content.length());
            }
            float res = queryScorer.getTokenScore();
            if (res > Float.valueOf(0) && startOffset <= endOffset) {
                System.out.println("hits: " + content.substring(startOffset, endOffset) + ", start: " + startOffset);
            }
        }
        contentStream.close();
    }
}

可以存储TermVectors的值

@org.junit.Test
public void testTermVectorsMatch() throws IOException, InvalidTokenOffsetsException {
    RAMDirectory ramDirectory = new RAMDirectory();
    IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(new StandardAnalyzer()));
    Document document = new Document();
    FieldType fieldType = new FieldType();
    fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
    fieldType.setStoreTermVectorPositions(true);
    fieldType.setStoreTermVectors(true);
    document.add(new Field("content", "我爱你中国", fieldType));
    indexWriter.addDocument(document);
    indexWriter.commit();
    IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(indexWriter));
    Term searchTerm = new Term("content", "中");
    PhraseQuery phraseQuery = new PhraseQuery.Builder().add(searchTerm).setSlop(0).build();
    TopDocs search = indexSearcher.search(phraseQuery, Integer.MAX_VALUE);
    ScoreDoc[] scoreDocs = search.scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
        Terms content = indexSearcher.getIndexReader().getTermVector(scoreDoc.doc, "content");
        TermsEnum iterator = content.iterator();
        BytesRef bytesRef;
        while ((bytesRef = iterator.next()) != null) {
            PostingsEnum postings = iterator.postings(null, PostingsEnum.ALL);
            if (postings.nextDoc() != Spans.NO_MORE_DOCS) {
                for (int i = 0; i < postings.freq(); i++) {
                    if (searchTerm.text().equals(bytesRef.utf8ToString())) {
                        System.out.println("hits: " + bytesRef.utf8ToString() + ", start: " + postings.nextPosition());
                    }
                }
            }
        }
    }
}

复杂需求的实现

实现某一个简单的需求就不再举例了，下面要讲解如何实现复杂的需求，也就是说，要同时满足上面的需求列表，而不仅仅是只满足其中的某一条。首先需要解决的就是分词之后保留标点符号的问题，在Lucene中，我并没有找到原生的支持保留标点符号的Analyzer，于是只能自己造轮子了。

保留标点符号的分词器

import lombok.extern.log4j.Log4j2;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.pattern.PatternTokenizer;

import java.util.regex.Pattern;

@Log4j2
public class ReservePunctuationAnalyzer extends Analyzer {
    public ReservePunctuationAnalyzer() {

    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        final Tokenizer source;
        source = new PatternTokenizer(Pattern.compile(""), -1);
        TokenStream result = new LowerCaseFilter(source);
        return new TokenStreamComponents(source, result);
    }
}

分词测试如下

@Test
public void test() throws IOException {
    String input = "你好，小甜甜。";
    TokenStream test = new ReservePunctuationAnalyzer().tokenStream("test", input);
    CharTermAttribute charTermAttribute = test.addAttribute(CharTermAttribute.class);
    OffsetAttribute offsetAttribute = test.addAttribute(OffsetAttribute.class);
    test.reset();
    while (test.incrementToken()) {
        System.out.println("token:[" + charTermAttribute + "], offset:[" + offsetAttribute.startOffset() + "]");
    }
    test.close();
}

分词结果输出如下

token:[你], offset:[0] token:[好], offset:[1] token:[，], offset:[2] token:[小], offset:[3] token:[甜], offset:[4] token:[甜], offset:[5] token:[。], offset:[6]

任意词搜索命中并返回positions信息

下面再来解决在不存储Field、不存储TermVectors的情况下，如何实现任意词搜索命中并返回positions信息，同时还可以设置slop的值。要实现这些功能就需要用到SpanQuery及其一系列的子类，先来看一张继承关系图，这些都是即将要使用到的类。

SpanQuery的doc注释很简单，就一句话“Base class for span-based queries”，基于跨度查询的基类。而真正具有实际作用的是其各个子类。

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.LongField;
import org.apache.lucene.index.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.SpanWeight;
import org.apache.lucene.search.spans.Spans;
import org.apache.lucene.store.RAMDirectory;

import java.io.IOException;
import java.util.List;

import static org.apache.lucene.search.spans.SpanNearQuery.newOrderedNearQuery;

/**
 * <p>
 * Created by wangxu on 2017/10/13 14:29.
 * </p>
 * <p>
 * Description: TODO
 * </p>
 *
 * @author Wang Xu
 * @version V1.0.0
 * @since V1.0.0 <br/>
 * WebSite: http://codepub.cn <br>
 * Licence: Apache v2 License
 */
public class SpanNearQueryDemo {
    @org.junit.Test
    public void test() throws IOException {
        String input = "现有的中文分词算法可分为三大类：基于字符串匹配的类基分词方法、基于理解的分词方法和基于统计的分词方法。";
        RAMDirectory ramDirectory = new RAMDirectory();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new ReservePunctuationAnalyzer());
        try (IndexWriter indexWriter = new IndexWriter(ramDirectory, indexWriterConfig)) {
            Document document = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
            Field field = new Field("title", input, fieldType);
            LongField IDX = new LongField("IDX", 1, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);

            input = "计算机算法是很难很复杂滴";
            document = new Document();
            field = new Field("title", input, fieldType);
            IDX = new LongField("IDX", 2, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);

            input = "计算机算法可以大幅度提升程序性能";
            document = new Document();
            field = new Field("title", input, fieldType);
            IDX = new LongField("IDX", 3, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);
            indexWriter.commit();

            IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(ramDirectory));
            SpanTermQuery first = new SpanTermQuery(new Term("title", "类"));
            SpanTermQuery second = new SpanTermQuery(new Term("title", "基"));
            SpanNearQuery spanNearQuery = newOrderedNearQuery("title").addClause(first).addClause(second).build();
            SpanWeight weight = spanNearQuery.createWeight(indexSearcher, true);
            List<LeafReaderContext> leaves = indexSearcher.getIndexReader().getContext().leaves();
            for (LeafReaderContext leaf : leaves) {
                Spans spans = weight.getSpans(leaf, SpanWeight.Postings.POSITIONS);
                while (spans.nextDoc() != Spans.NO_MORE_DOCS) {
                    Document doc = leaf.reader().document(spans.docID());
                    while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
                        System.out.println("doc id = " + spans.docID() + ", doc IDX= " + doc.get("IDX") + ", start position = " + spans.startPosition() + ", end " +
                                "position = " + spans.endPosition());
                    }
                }
            }
            //================================================================
            // 输出结果是
            // doc id = 0, doc IDX= 1, start position = 24, end position = 26
            //================================================================
            System.out.println();
            //修改slop，设置1，默认是0
            spanNearQuery = newOrderedNearQuery("title").addClause(first).addClause(second).setSlop(1).build();
            weight = spanNearQuery.createWeight(indexSearcher, true);
            leaves = indexSearcher.getIndexReader().getContext().leaves();
            for (LeafReaderContext leaf : leaves) {
                Spans spans = weight.getSpans(leaf, SpanWeight.Postings.POSITIONS);
                while (spans.nextDoc() != spans.NO_MORE_DOCS) {
                    Document doc = leaf.reader().document(spans.docID());
                    while (spans.nextStartPosition() != spans.NO_MORE_POSITIONS) {
                        System.out.println("doc id = " + spans.docID() + ", doc IDX= " + doc.get("IDX") + ", start position = " + spans.startPosition() + ", end " +
                                "position = " + spans.endPosition());
                    }
                }
            }
            //================================================================
            // 输出结果是
            // doc id = 0, doc IDX= 1, start position = 14, end position = 17
            // doc id = 0, doc IDX = 1, start position = 24, end position = 26
            //================================================================
        }
    }
}

实现逻辑与查询

通过上面的图，想必你也知道，Lucene官方并不对SpanAndQuery提供支持，在Lucene的官方讨论组中，有人发起过支持SpanAndQuery的issue，但是一直没有获得官方的回应。不过已经有商业公司实现了这种搜索技术，公司名称是SearchTechnologies，API参见SpanAndQuery，但是并不开源（So Sad），我没有找到其实现的具体源码，如果你知道的话，烦请告知我一下。

既然官方不予支持，那就只能自己造轮子了，逻辑上来讲，也不复杂，有两种方式可以实现。

第一种方式，从词的角度，例如『爱你』和『你中』两个搜索词实现逻辑与，那么只要分别地把每一个搜索词都单独搜一下，最后在命中结果中取交集，就可以实现逻辑与的功能，代码写起来也很简单，在此不予示例。

第二种方式，从Term的角度，例如『爱你』如果按字切分，那么能够切成两个Term，分别是『爱』和『你』，这时候使用SpanNearQuery构造查询语句，加上一个很大的slop，但是不管slop多大，它总是有上限的，万一两个Term之间的距离超过slop，同样无法命中，所以说这种实现方式是存在漏洞的，除非你确定你的两个Term之间的距离不会超过某个具体的slop值，那么可以使用之。

实现逻辑或查询

官方已经对或逻辑提供了支持，就是SpanOrQuery，直接操练起来即可。

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.LongField;
import org.apache.lucene.index.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.spans.*;
import org.apache.lucene.store.RAMDirectory;

import java.io.IOException;
import java.util.List;

import static org.apache.lucene.search.spans.SpanNearQuery.newOrderedNearQuery;

/**
 * <p>
 * Created by wangxu on 2017/10/13 14:29.
 * </p>
 * <p>
 * Description: TODO
 * </p>
 *
 * @author Wang Xu
 * @version V1.0.0
 * @since V1.0.0 <br/>
 * WebSite: http://codepub.cn <br>
 * Licence: Apache v2 License
 */
public class SpanOrQueryDemo {
    @org.junit.Test
    public void test() throws IOException {
        String input = "现有的中文分词算法可分为三大类：基于字符串匹配的类基分词方法、基于理解的分词方法和基于统计的分词方法。";
        RAMDirectory ramDirectory = new RAMDirectory();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new ReservePunctuationAnalyzer());
        try (IndexWriter indexWriter = new IndexWriter(ramDirectory, indexWriterConfig)) {
            Document document = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
            Field field = new Field("title", input, fieldType);
            LongField IDX = new LongField("IDX", 1, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);

            input = "计算机算法是很难很复杂滴";
            document = new Document();
            field = new Field("title", input, fieldType);
            IDX = new LongField("IDX", 2, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);

            input = "计算机算法可以大幅度提升程序性能";
            document = new Document();
            field = new Field("title", input, fieldType);
            IDX = new LongField("IDX", 3, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);
            indexWriter.commit();
            IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(ramDirectory));

            SpanTermQuery first = new SpanTermQuery(new Term("title", "类"));
            SpanTermQuery second = new SpanTermQuery(new Term("title", "基"));
            SpanNearQuery spanNearQueryFirst = newOrderedNearQuery("title").addClause(first).addClause(second).build();

            first = new SpanTermQuery(new Term("title", "算"));
            second = new SpanTermQuery(new Term("title", "法"));
            SpanNearQuery spanNearQuerySecond = newOrderedNearQuery("title").addClause(first).addClause(second).build();
            SpanOrQuery spanOrQuery = new SpanOrQuery(spanNearQueryFirst, spanNearQuerySecond);

            SpanWeight weight = spanOrQuery.createWeight(indexSearcher, true);
            List<LeafReaderContext> leaves = indexSearcher.getIndexReader().getContext().leaves();
            for (LeafReaderContext leaf : leaves) {
                Spans spans = weight.getSpans(leaf, SpanWeight.Postings.POSITIONS);
                while (spans.nextDoc() != Spans.NO_MORE_DOCS) {
                    Document doc = leaf.reader().document(spans.docID());
                    while (spans.nextStartPosition() != Spans.NO_MORE_POSITIONS) {
                        System.out.println("doc id = " + spans.docID() + ", doc IDX= " + doc.get("IDX") + ", start position = " + spans.startPosition() + ", end " +
                                "position = " + spans.endPosition());
                    }
                }
            }
            //================================================================
            // 输出结果是
            // doc id = 0, doc IDX= 1, start position = 7, end position = 9
            // doc id = 0, doc IDX= 1, start position = 24, end position = 26
            // doc id = 1, doc IDX= 2, start position = 3, end position = 5
            // doc id = 2, doc IDX= 3, start position = 3, end position = 5
            //================================================================
        }
    }
}

实现SpanAllNearQuery

这是一个附加功能，因为目前还没有碰到这样的需求，但是这种查询实现起来非常有意思，所以在此简单讲解一下。这个问题的来源是有人提了个issue，请求官方支持SpanAllNearQuery，但是同样地，官方不理不睬。果然公益的就是拽啊，完全不倾听用户的需求，不像商业公司，为了赚用户的钱，只要用户有需求，就尽力实现。

那么这个需求是什么样的呢？简单表示如下a WITHIN 5 WORDS OF (b AND c)，还可以把它换一种方式理解(a WITHIN 5 WORDS OF b) AND (a WITHIN 5 WORDS OF c)，就是说我要查询，在a的前面5个或者后面5个token中出现b和c的所有结果集。要实现这个功能，需要借助于SpanNotQuery和SpanOrQuery，SpanOrQuery在实现或逻辑中已经介绍过了，那么SpanNotQuery又是什么意思呢？举例如下

SpanNotQuery(a, b, 5, 5)表示在a的前5个或者后5个token中不能出现b SpanNotQuery(a, c, 5, 5)表示在a的前5个或者后5个token中不能出现c

下面先从逻辑上先实现这个需求，要获得在a的前面或后面5个token中出现b和c，需要将其反转理解，先查询在a的前面5个或者后面5个token中不能出现b『SpanNotQuery(a, b, 5, 5)』或者在a的前面5个或者后面5个token中不能出现c的结果『SpanNotQuery(a, c, 5, 5)』，再用SpanOrQuery来组合『SpanNotQuery(a, b, 5, 5)』和『SpanNotQuery(a, c, 5, 5)』实现或逻辑，最后用SpanNotQuery排除掉SpanOrQuery的结果集，那么剩下的就是在a的前面5个或者后面5个能出现b也能出现c的结果。

import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.IntField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNotQuery;
import org.apache.lucene.search.spans.SpanOrQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;

import java.io.IOException;

/**
 * <p>
  * Created by wangxu on 2017/06/16 16:02.
 * </p>
  * <p>
  * Description: TODO
  * </p>
  *
 * @author Wang Xu
 * @version V1.0.0
 * @since V1.0.0 <br/>
  * WebSite: http://codepub.cn <br>
  * Licence: Apache v2 License
 */public class SpanAllNearQueryDemo {
    @org.junit.Test
  public void test() throws IOException {
        RAMDirectory ramDirectory = new RAMDirectory();
  IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(new WhitespaceAnalyzer()));
  Document document = new Document();
  document.add(new TextField("key", "X b X X X X a X X X X c X", Field.Store.YES));//命中
  document.add(new IntField("IDX", 1, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "X X X X X b a c X X X X X", Field.Store.YES));//命中
  document.add(new IntField("IDX", 2, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "X b X X X X a a X X X X c", Field.Store.YES));//不命中，不能同时以两个a为中心，两个a必选其一
  document.add(new IntField("IDX", 3, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "X b X X X X X a X X X X c", Field.Store.YES));//不命中
  document.add(new IntField("IDX", 4, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "X b X X X X a X X X X X c", Field.Store.YES));//不命中
  document.add(new IntField("IDX", 5, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "b X X X X X a X X X X X c", Field.Store.YES));//不命中
  document.add(new IntField("IDX", 6, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "b X X X X X a X X X X X X", Field.Store.YES));//不命中
  document.add(new IntField("IDX", 7, Field.Store.YES));
  indexWriter.addDocument(document);

  document = new Document();
  document.add(new TextField("key", "X X X X X X a X X X X X X", Field.Store.YES));//不命中
  document.add(new IntField("IDX", 8, Field.Store.YES));
  indexWriter.addDocument(document);
  indexWriter.commit();

  IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(indexWriter));
  SpanTermQuery a = new SpanTermQuery(new Term("key", "a"));
  SpanTermQuery b = new SpanTermQuery(new Term("key", "b"));
  SpanTermQuery c = new SpanTermQuery(new Term("key", "c"));
  SpanOrQuery exclude = new SpanOrQuery(new SpanNotQuery(a, b, 5, 5), new SpanNotQuery(a, c, 5, 5));
  //排除在a的前5个或者后5个不能出现b也不能出现c的document，那么剩下的就是在a的前5个token或者后5个token能够出现b和c的document
  SpanNotQuery spanNotQuery = new SpanNotQuery(a, exclude);
  TopDocs search = indexSearcher.search(spanNotQuery, Integer.MAX_VALUE);
  ScoreDoc[] scoreDocs = search.scoreDocs;
 for (ScoreDoc scoreDoc : scoreDocs) {
            System.out.println("hist IDX: " + indexSearcher.doc(scoreDoc.doc).get("IDX"));
  }
        indexSearcher.getIndexReader().close();
  indexWriter.close();
  }
}

SpanNearQuery实现通配符查询

import com.yuewen.nrzx.character.analyzer.ReservePunctuationAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.LongField;
import org.apache.lucene.index.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.spans.SpanMultiTermQueryWrapper;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;

import java.io.IOException;

public class SpanNearQueryAndWildcardQueryDemo {
    @org.junit.Test
    public void test() throws IOException {
        String input = "现有的中文分词算法可分为三大类：基于字符串匹配的类基分词方法、基于理解的分词方法和基于统计的分词方法。";
        RAMDirectory ramDirectory = new RAMDirectory();
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new ReservePunctuationAnalyzer());
        try (IndexWriter indexWriter = new IndexWriter(ramDirectory, indexWriterConfig)) {
            Document document = new Document();
            FieldType fieldType = new FieldType();
            fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
            Field field = new Field("title", input, fieldType);
            LongField IDX = new LongField("IDX", 1, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);

            input = "计算机算法是很难很复杂滴";
            document = new Document();
            field = new Field("title", input, fieldType);
            IDX = new LongField("IDX", 2, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);

            input = "计算机算法可以大幅度提升程序性能";
            document = new Document();
            field = new Field("title", input, fieldType);
            IDX = new LongField("IDX", 3, Field.Store.YES);
            document.add(field);
            document.add(IDX);
            indexWriter.addDocument(document);
            indexWriter.commit();
            IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(ramDirectory));
            // 用?和*均可以实现SpanNearQuery的通配符查询，但是注意*在通配符查询中表示可以匹配0个或多个字符
            // 但是在SpanQuery中只能匹配相当于slop=1的情形，不能匹配slop大于1的情形
            SpanTermQuery first = new SpanTermQuery(new Term("title", "复"));
            SpanQuery wildcard = new SpanMultiTermQueryWrapper<>(new WildcardQuery(new Term("title", "?")));
            SpanTermQuery last = new SpanTermQuery(new Term("title", "滴"));

            SpanNearQuery spanNearQuery = new SpanNearQuery.Builder("title", true).addClause(first).addClause(wildcard).addClause(last).build();
            TopDocs search = indexSearcher.search(spanNearQuery, Integer.MAX_VALUE);
            System.out.println("IDX: " + indexSearcher.doc(search.scoreDocs[0].doc).get("IDX"));

            wildcard = new SpanMultiTermQueryWrapper<>(new WildcardQuery(new Term("title", "*")));
            spanNearQuery = new SpanNearQuery.Builder("title", true).addClause(first).addClause(wildcard).addClause(last).build();
            search = indexSearcher.search(spanNearQuery, Integer.MAX_VALUE);
            System.out.println("IDX: " + indexSearcher.doc(search.scoreDocs[0].doc).get("IDX"));
        }
    }
}

实验

服务器 CPU 以及内存信息

$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c

24 Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz

$ free -g

||total|used|free|shared|buffers|cached| |--|--|--|--|--|--| |Mem|62|56|6|0|0|1| |-/+ buffers/cache|54|8||||| |Swap|1|0|1|||||

在公司内部，仅仅索引了十分之一的文档（Document数量：20023911），鉴于没有存储Field，也没有存储TermVectors，索引不算太大，简单测试了下，如果存储TermVectors的话，索引会从112GB增长到162GB，如果再存储Field的话，那么索引要超过200GB。此处的实验只是简单的单次搜索，没有测试与逻辑和或逻辑情况下的搜索情况。搜索阶段实验的结果如下所示

线程数目	搜索总次数	命中次数	搜索总耗时	平均单次耗时	搜索加构建Query耗时	平均单次耗时	索引大小
1	18820	18139	36001,698ms	1912.95ms	36011,989ms	1913.50ms	112GB
5	18820	18139	30775,283ms	1635.24ms	30785,538ms	1635.79ms	112GB
10	18820	18139	21637,515ms	1149.71ms	21647,953ms	1150.26ms	112GB
50	18820	18139	21572,506ms	1146.25ms	21583,101ms	1146.82ms	112GB

索引阶段并没有做详细完备的实验，只是简单拉取了一点数据，记录如下，仅供参考。索引2080550个Document，统计从数据库拉取数据加更新数据到索引耗时2133s，平均每次耗时0.001s。只计算更新数据到索引中，不计算从数据库拉取数据耗时464s，平均每次耗时0.0002s，可见索引速度也是很快的，这全部得益于Lucene的优良设计。