LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   How to get an /exact/ word match in SQLite3's FTS5? (https://www.linuxquestions.org/questions/programming-9/how-to-get-an-exact-word-match-in-sqlite3s-fts5-4175735764/)

Turbocapitalist 04-07-2024 05:11 AM

How to get an /exact/ word match in SQLite3's FTS5?
 
I'm looking at the FTS5 full text searching module for SQLite3 and have a question about how to limit the search to an exact string.

If I have the following quick database:

Code:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE metadata(recno integer, term varchar(25) not null, value varchar(256) not null);
INSERT INTO metadata VALUES(1,'dc.title','a foo/bar baz b');
INSERT INTO metadata VALUES(2,'dc.title','a bar foo/baz b');
INSERT INTO metadata VALUES(3,'dc.title','a barbar foo b');
INSERT INTO metadata VALUES(4,'dc.title','a foofo bar b');
PRAGMA writable_schema=ON;
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m, tokenize="trigram case_sensitive 1")');
CREATE TABLE IF NOT EXISTS 'text_data'(id INTEGER PRIMARY KEY, block BLOB);
INSERT INTO text_data VALUES(1,X'0431');
INSERT INTO text_data VALUES(10,X'000000000101010001010101');
INSERT INTO text_data VALUES(137438953473,X'000000ee043020626101020b0102030102030102090202666f01020301020701020a01020301032f626101020701020b01036120620202020102020301660102020302020202722001020901020501020801020b03016203020502027a2001020d01020d01036261720102080102040104040501020a03017a01020c01020c0103666f2004020703016f01020401020801020b01020401036f206203020d01020802022f6201020601020a0202666f04020602026f2003020c03012f010205010209030166040205010372206201020a03020c0301660202060102090202626103020601037a206201020e01020e0411100b0b0910060a1209080f0b0a070709060b0907');
CREATE TABLE IF NOT EXISTS 'text_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
INSERT INTO text_idx VALUES(1,X'',2);
CREATE TABLE IF NOT EXISTS 'text_content'(id INTEGER PRIMARY KEY, c0);
INSERT INTO text_content VALUES(1,'a foo/bar baz b');
INSERT INTO text_content VALUES(2,'a bar foo/baz b');
INSERT INTO text_content VALUES(3,'a barbar foo b');
INSERT INTO text_content VALUES(4,'a foofo bar b');
CREATE TABLE IF NOT EXISTS 'text_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
INSERT INTO text_docsize VALUES(1,X'0d');
INSERT INTO text_docsize VALUES(2,X'0d');
INSERT INTO text_docsize VALUES(3,X'0c');
INSERT INTO text_docsize VALUES(4,X'0b');
CREATE TABLE IF NOT EXISTS 'text_config'(k PRIMARY KEY, v) WITHOUT ROWID;
INSERT INTO text_config VALUES('version',4);
PRAGMA writable_schema=OFF;
COMMIT;

It create two tables. One has four rows, the other is the FTS representation of those four rows. If I I provide the following query,

Code:

sqlite> SELECT rowid,m FROM text WHERE m MATCH ('foof');
4|a foofo bar b

the database returns partial matches instead of an empty set. The string 'foof', as a whole word, does not occur anywhere. What do I need to set so I can search for exact word matches and not have it glob either end automatially?

NevemTeve 04-07-2024 05:26 AM

Untested:
Code:

SELECT rowid,m FROM text WHERE m REGEX '\bfoof\b';

Turbocapitalist 04-07-2024 06:02 AM

Quote:

Originally Posted by NevemTeve (Post 6494608)
Untested:
Code:

SELECT rowid,m FROM text WHERE m REGEX '\bfoof\b';

Thanks. SQLite is much more limited. Neither the \b nor the REGEX are allowed.

It's pretty useful but a subset of PostgreSQL or MariaDB.

Anyway, the puzzle remains.

https://www.sqlite.org/fts5.html

ntubski 04-07-2024 07:31 AM

Quote:

Originally Posted by Turbocapitalist (Post 6494606)
Code:

INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m, tokenize="trigram case_sensitive 1")');

https://www.sqlite.org/fts5.html#the_trigram_tokenizer:
Quote:

When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token.
So don't use the trigram tokenizer if you don't want partial matches.

Turbocapitalist 04-07-2024 08:04 AM

Quote:

Originally Posted by ntubski (Post 6494621)
So don't use the trigram tokenizer if you don't want partial matches.

Thanks. Cloudflare is blocking my new SQL example. I'll keep trying. However, I can say that without the trigram tokenizer, it now looks for the whole sequence, if all letters, but still works with a fragment rather than requiring a whole sequence if non-letters are used (or at least I think that is the case):

What I'm aiming to do is find a way to search for exact URLs which have been mixed into a lot of text. It is thus important to differentiate between the following:

http://localhost/one/two/
http://localhost/one/three/

such that if I search for a non-existing URL such as "http:/localhost/one/" it will return an empty set, but if I search for "http:/localhost/one/three/" then it will return one record.

Again, I'll keep fighting Cloudflare here to see if I can get a new example through.

Turbocapitalist 04-07-2024 08:11 AM

Here is the try without the trigram tokenizer. I have added a fifth record.

Code:

sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/ba"';
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar"';
1|a foo/bar baz b
5|b foo/bar/bar baz b

The first query above returns an empty set, as expected. However, the second search should also return an empty set, but doesn't.

What I'm aiming to do is find a way to search for exact URLs which have been mixed into a lot of text. It is thus important to differentiate between the following:

http://localhost/one/two/
http://localhost/one/three/

such that if I search for a non-existing URL such as "http:/localhost/one/" it will return an empty set, but if I search for "http:/localhost/one/three/" then it will return one record.

Code:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE metadata(recno integer, term varchar(25) not null, value varchar(256) not null);
INSERT INTO metadata VALUES(1,'dc.title','a foo/bar baz b');
INSERT INTO metadata VALUES(2,'dc.title','a bar foo/baz b');
INSERT INTO metadata VALUES(3,'dc.title','a barbar foo b');
INSERT INTO metadata VALUES(4,'dc.title','a foofo bar b');
PRAGMA writable_schema=ON;
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m )');
CREATE TABLE IF NOT EXISTS 'text_data'(id INTEGER PRIMARY KEY, block BLOB);
INSERT INTO text_data VALUES(1,X'0518');
INSERT INTO text_data VALUES(10,X'000000000102020002010101020101');
INSERT INTO text_data VALUES(137438953473,X'0000005502306101020201020201020201020201016201020601020601020501020502026172010204010203020204040362617203020303017a0102050102050103666f6f0102030102040102040402666f040203040f0f0d08090e');
INSERT INTO text_data VALUES(274877906945,X'0000002102306205040207020261720504040303017a0502060103666f6f05020304070806');
CREATE TABLE IF NOT EXISTS 'text_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
INSERT INTO text_idx VALUES(1,X'',2);
INSERT INTO text_idx VALUES(2,X'',2);
CREATE TABLE IF NOT EXISTS 'text_content'(id INTEGER PRIMARY KEY, c0);
INSERT INTO text_content VALUES(1,'a foo/bar baz b');
INSERT INTO text_content VALUES(2,'a bar foo/baz b');
INSERT INTO text_content VALUES(3,'a barbar foo b');
INSERT INTO text_content VALUES(4,'a foofo bar b');
INSERT INTO text_content VALUES(5,'b foo/bar/bar baz b');
CREATE TABLE IF NOT EXISTS 'text_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
INSERT INTO text_docsize VALUES(1,X'05');
INSERT INTO text_docsize VALUES(2,X'05');
INSERT INTO text_docsize VALUES(3,X'04');
INSERT INTO text_docsize VALUES(4,X'04');
INSERT INTO text_docsize VALUES(5,X'06');
CREATE TABLE IF NOT EXISTS 'text_config'(k PRIMARY KEY, v) WITHOUT ROWID;
INSERT INTO text_config VALUES('version',4);
PRAGMA writable_schema=OFF;
COMMIT;


scasey 04-07-2024 08:26 AM

Literal spaces doesn’t work?
Code:

‘“ foof “‘
‘“foo/bar “‘


Turbocapitalist 04-07-2024 08:32 AM

Nope. Spaces do not work:

Code:

sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar "';
1|a foo/bar baz b
5|b foo/bar/bar baz b

Nor do double quotes, which would be relevant with the URLs.

Code:

sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar\""';
Runtime error: unterminated string


ntubski 04-07-2024 09:19 AM

I think you want something along the lines of

Code:

CREATE VIRTUAL TABLE text USING fts5(a, b,
    tokenize = "ascii tokenchars '/:%'"
);

(I'm not sure how to escape the quotes so they will be passed correctly to the INSERT INTO sqlite_schema... construct you are using, so I took the example from the docs as the base instead.)

Turbocapitalist 04-07-2024 09:26 AM

Quote:

Originally Posted by ntubski (Post 6494649)
I think you want something along the lines of

Code:

CREATE VIRTUAL TABLE text USING fts5(a, b,
    tokenize = "ascii tokenchars '/:%'"
);

(I'm not sure how to escape the quotes so they will be passed correctly to the INSERT INTO sqlite_schema... construct you are using, so I took the example from the docs as the base instead.)

Yes! Thank you! (And thanks everyone else!) Here is the final formula:

Code:

CREATE VIRTUAL TABLE text USING FTS5( m, tokenize = "unicode61 tokenchars '/:%'" );
I had misread the tokenchars option multiple times.

NevemTeve 04-07-2024 02:18 PM

@OP google says you should install component `sqlite3-pcre`

Turbocapitalist 04-07-2024 11:50 PM

Quote:

Originally Posted by NevemTeve (Post 6494704)
@OP google says you should install component `sqlite3-pcre`

I'll take a look.

http://git.altlinux.org/people/at/pa...lite3-pcre.git

Perl-compatible regular expression support for the SQLite is not necessarily the kind of code which needs frequent updates, but on the other hand not having been updated in over 15 years raises questions.


All times are GMT -5. The time now is 06:18 AM.