倒排索引 (Inverted Index) — 交互式原理可视化

📖什么是倒排索引？

搜索引擎最核心的数据结构——把"文档到词"的关系，翻转为"词到文档"的映射

想象你在图书馆找书。正向索引像是一本本翻开每本书看里面有什么内容；而倒排索引则像是图书馆后面的卡片目录柜—— 你查一个关键词，它直接告诉你哪些书里包含这个词。

FORWARD INDEX

📄 正向索引

Document → [term, term, term, ...]

Doc#1「搜索引擎原理」→ 搜索引擎使用倒排索引加速文本检索

Doc#2「Python编程」→ Python 是流行编程语言

Doc#3「数据库索引」→ 数据库使用 B+树索引优化查询

INVERTED INDEX

🔄 倒排索引

Term → [{doc_id, tf, positions}, ...]

"搜索引擎" → Doc#1(tf=1)

"倒排索引" → Doc#1(tf=1), Doc#4(tf=1)

"索引" → Doc#1(tf=1), Doc#3(tf=2), Doc#4(tf=1), Doc#5(tf=1)

"Python" → Doc#2(tf=3)

📚

词典 (Dictionary / Vocabulary)

所有不同词项的有序集合。通常用 Hash Table 或 B+Tree 存储，支持快速查找 O(1) 或 O(log N)。类似字典的"拼音索引页"。

📋

倒排列表 (Posting List)

每个词项对应的文档列表。每条记录包含：文档ID、词频(TF)、出现位置(positions)。支持交集/并集运算实现布尔查询。

⚡

为什么快？

查询时只需查找少数几个词的 posting list 然后做合并操作，无需扫描全部文档。从 O(N) 降到 O(M)，M 为匹配文档数。

🔨构建过程：从文档到倒排索引

一步步观察倒排索引是如何从原始文本中自动构建出来的

⚙️ 构建流程实时演示

等待开始...

⟹

倒排

词项 (Term)倒排列表 (Postings)

🎯三大搜索策略

基于倒排索引可以实现多种查询方式，各有适用场景

🔗 布尔查询 (Boolean Query)

通过对多个词项的 Posting List 做集合运算来获取结果。简单高效，返回结果无排序。

📝

查询 "索引 AND 查询"

→

📋

取 "索引" 的 PL

📋

取 "查询" 的 PL

→

∩

求交集

→

✅

结果文档集

boolean_search.py
def search_boolean(self, query: str):
    # 解析查询为词项列表
    terms = self.tokenize(query)
    
    result_set = None
    for term in terms:
        entry = self.index.get(term)
        doc_ids = {p.doc_id for p in entry.postings} if entry else set()
        
        if result_set is None:
            result_set = doc_ids       # 第一个词：直接赋值
        else:
            result_set &= doc_ids      # AND: 取交集 ∩
            # result_set |= doc_ids   # OR:  取并集 ∪
            # result_set -= doc_ids   # NOT: 取差集 −
    
    return sorted(result_set)

💬 短语查询 (Phrase Query)

不仅要求文档包含所有词，还要求这些词在文档中相邻出现且顺序一致。这需要利用 Posting List 中记录的位置信息 (positions)。

核心思路：位置连续性检查

phrase = ["倒排", "索引"] → 检查 pos("索引") == pos("倒排") + 1

Doc#1 中 "倒排"

position = 3

+

1

=

期望 "索引" 位置

position = 4 ?

✓ 匹配!

phrase_search.py
def search_phrase(self, phrase: str):
    tokens = self.tokenize(phrase)
    terms = [t[0] for t in tokens]
    
    # 获取第一个词的候选文档及位置
    candidates = {}
    for posting in self.index[terms[0]].postings:
        candidates[posting.doc_id] = [posting.positions]
    
    # 逐个后续词做位置连续性过滤
    for i in range(1, len(terms)):
        term_pos = {p.doc_id: p.positions for p in self.index[terms[i]].postings}
        new_candidates = {}
        
        for doc_id, prev_lists in candidates.items():
            if doc_id not in term_pos: continue
            for prev_list in prev_lists:
                last_pos = prev_list[-1]
                for cp in term_pos[doc_id]:
                    if cp == last_pos + 1:  # 关键！严格相邻
                        new_candidates.setdefault(doc_id, []).append(prev_list+[cp])
        candidates = new_candidates
    
    return sorted(candidates.keys())

🏆 相关性排序 (TF-IDF → BM25)

布尔查询只回答"是否匹配"，而实际搜索需要按相关程度对结果打分排序。经典方案是 TF-IDF，工业界标准是 BM25。

TF — 词频

词在文档中出现的次数越多，该文档与该词越相关。但不是线性增长——出现10次和100次的差异没有1次和10次那么大。

IDF — 逆文档频率

出现在越少的文档中的词越有区分度。"算法"比"的"更有价值。IDF = log(N/df)，N为总文档数，df为包含该词的文档数。

BM25 改进点

① 词频饱和：用 (k1+1)/(k1+tf) 避免长文档因词频高而占优
② 长度归一化：长文档不会天然得分更高
③ 参数可调：k1控制饱和程度，b控制长度惩罚力度

🧮BM25 评分公式深度解析

信息检索领域最重要的相关性评分算法，被 Elasticsearch、Lucene、SQLite FTS5 等广泛使用

BM25 完整评分函数

score(D,Q) = Σ_{qi ∈ Q} IDF(qi) × f(qi,D) × (k₁ + 1) / [ f(qi,D) + k₁ × (1 - b + b × |D| / avgdl) ]

D

待评分的文档

Q

用户的查询（由多个查询词组成）

qi

查询中的第 i 个词项

f(qi, D)

词 qi 在文档 D 中出现的频率 (Term Frequency)

|D|

文档 D 的长度（词数）

avgdl

语料库中所有文档的平均长度

k₁ ≈ 1.2~2.0

词频饱和参数。越大，TF的影响衰减越慢。通常取 1.5

b ≈ 0.75

长度归一化参数。越大，对长文档的惩罚越重。通常取 0.75

IDF(qi)

log[(N - df(qi) + 0.5) / (df(qi) + 0.5) + 1]，N=总文档数，df=含该词的文档数

📈 BM25 各因子直觉理解

🐍完整 Python 实现

自包含、可直接运行的倒排索引实现，含 BM25 评分、布尔查询、短语查询

数据结构定义 (Data Structures)
@dataclass
class Posting:
    """倒排列表中的一条记录"""
    doc_id: int                       # 文档ID
    frequency: int = 0               # 词频 TF（在该文档中出现次数）
    positions: List[int]              # 出现位置列表（用于短语查询）


@dataclass
class TermEntry:
    """词典中每个词项的条目"""
    term: str                         # 词项本身
    df: int = 0                      # 文档频率（多少个文档包含此词）
    postings: List[Posting]           # 倒排列表
    
    @property
    def idf(self) -> float:
        # IDF = log(N / df) + 1 （加1避免零值）
        return math.log(self._total_docs / self.df) + 1


@dataclass
class Document:
    """待索引的文档"""
    doc_id: int
    title: str
    content: str
    
    @property
    def full_text(self) -> str:
        return f"{self.title} {self.content}"

InvertedIndex 核心类 — 分词 & 构建
class Tokenizer:
    """分词器：英文按单词，中文按字符，支持停用词过滤"""
    STOP_WORDS = {'的', '是', '了', '在', 'the', 'a', 'an', 'is', 'are', ...}
    
    def tokenize(self, text: str) -> List[Tuple[str, int]]:
        """返回 [(token, position), ...] """
        text = text.lower()
        tokens = []
        pos = 0
        pattern = re.compile(r'[a-z0-9_]+|[\u4e00-\u9fff]')
        for match in pattern.finditer(text):
            tok = match.group()
            if tok not in self.STOP_WORDS:
                tokens.append((tok, pos))
            pos += 1
        return tokens


class InvertedIndex:
    def __init__(self, tokenizer=None):
        self.tokenizer = tokenizer or Tokenizer()
        self.index: Dict[str, TermEntry] = {}     # 词典 → 词项条目
        self.documents: Dict[int, Document] = {}   # 文档存储
        self.doc_count: int = 0
        self.avg_doc_length: float = 0.0       # 平均文档长度（BM25用）
        self.k1 = 1.5                          # BM25参数
        self.b = 0.75
    
    def add_document(self, doc: Document):
        """添加文档到索引（核心构建逻辑）"""
        self.documents[doc.doc_id] = doc
        self.doc_count += 1
        tokens = self.tokenizer.tokenize(doc.full_text)
        
        # 统计每个词的出现位置
        temp_index = defaultdict(list)
        for token, pos in tokens:
            temp_index[token].append(pos)
        
        # 更新全局倒排索引
        for token, positions in temp_index.items():
            if token not in self.index:
                self.index[token] = TermEntry(term=token, df=0, postings=[])
            entry = self.index[token]
            entry.df += 1
            entry.postings.append(Posting(
                doc_id=doc.doc_id,
                frequency=len(positions),
                positions=positions,
            ))
        
        # 更新平均文档长度  
        total = self.avg_doc_length * (self.doc_count - 1) + len(tokens)
        self.avg_doc_length = total / self.doc_count

BM25 相关性搜索 — 最重要的一块
def search_bm25(self, query: str, top_k: int = 10) -> List[Tuple[int, float]]:
    """
    BM25 评分搜索
    score(D,Q) = Σ IDF(qi) × [f(qi,D)×(k1+1)] / [f(qi,D) + k1×(1-b+b×|D|/avgdl)]
    """
    query_terms = list(set([t[0] for t in self.tokenizer.tokenize(query)]))
    scores: Dict[int, float] = defaultdict(float)
    
    for term in query_terms:
        if term not in self.index:
            continue
        
        entry = self.index[term]
        idf = entry.idf          # log(N/df) + 1
        
        for posting in entry.postings:
            doc_id = posting.doc_id
            tf = posting.frequency
            
            # 计算文档长度
            doc_len = len(self.tokenizer.tokenize(
                self.documents[doc_id].full_text))
            
            # ★ BM25 TF归一化（词频饱和 + 长度归一化）
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (
                1 - self.b + self.b * doc_len / max(self.avg_doc_length, 1)
            )
            tf_norm = numerator / denominator
            scores[doc_id] += idf * tf_norm
    
    # 按分数降序排列
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

完整使用示例 (Demo)
# === 1. 准备文档集 ===
documents = [
    Document(1, "搜索引擎原理",
             "搜索引擎使用倒排索引来加速文本检索..."),
    Document(2, "Python编程指南",
             "Python是一种流行的编程语言..."),
    Document(3, "数据库索引技术",
             "数据库使用B+树索引优化查询性能..."),
    Document(4, "AI Agent记忆系统",
             "AI Agent的记忆系统需要高效检索机制..."),
]

# === 2. 构建索引 ===
indexer = InvertedIndex(Tokenizer(use_stop_words=True))
indexer.build_index(documents)

# === 3. 布尔搜索 ===
results = indexer.search_boolean("倒排 索引")
# → [1, 4]

# === 4. BM25 相关性搜索 ===
results = indexer.search_bm25("索引 查询 性能", top_k=5)
# → [(3, 3.82), (1, 2.15), (5, 1.87), ...]
for rank, (doc_id, score) in enumerate(results, 1):
    print(f"#{rank} [{score:.4f}] 《{indexer.documents[doc_id].title}》")

# === 5. 精确短语查询 ===
results = indexer.search_phrase("倒排索引")
# → [1]  （只有Doc#1中这两个字是紧挨着的）

# === 6. 查看统计 ===
print(indexer.get_statistics())
# {'总文档数': 4, '词典大小': 67, '总posting数': 142, ...}