fix(search): quote underscored terms in FTS5 query sanitization

FTS5 default tokenizer splits 'sp_new1' into tokens 'sp' and 'new1'.
Without quoting, a search for 'sp_new' becomes an AND query
('sp AND new') that fails to match rows indexed as 'sp_new1'.

Fix: add underscore to the character class in Step 5 regex
([.-] -> [._-]) so underscored terms are wrapped in double quotes.

Also adds test_sanitize_fts5_quotes_underscored_terms.
This commit is contained in:
crayfish-ai
2026-04-28 09:23:02 +08:00
committed by Teknium
parent 0169c51820
commit abefd89059
2 changed files with 26 additions and 2 deletions

View File

@@ -1355,9 +1355,9 @@ class SessionDB:
# quotes. FTS5's tokenizer splits on dots and hyphens, turning
# ``chat-send`` into ``chat AND send`` and ``P2.2`` into ``p2 AND 2``.
# Quoting preserves phrase semantics. A single pass avoids the
# double-quoting bug that would occur if dotted and hyphenated
# double-quoting bug that would occur if dotted, hyphenated and underscored
# patterns were applied sequentially (e.g. ``my-app.config``).
sanitized = re.sub(r"\b(\w+(?:[.-]\w+)+)\b", r'"\1"', sanitized)
sanitized = re.sub(r"\b(\w+(?:[._-]\w+)+)\b", r'"\1"', sanitized)
# Step 6: Restore preserved quoted phrases
for i, quoted in enumerate(_quoted_parts):