LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-07-2024, 05:11 AM   #1
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
How to get an /exact/ word match in SQLite3's FTS5?


I'm looking at the FTS5 full text searching module for SQLite3 and have a question about how to limit the search to an exact string.

If I have the following quick database:

Code:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE metadata(recno integer, term varchar(25) not null, value varchar(256) not null);
INSERT INTO metadata VALUES(1,'dc.title','a foo/bar baz b');
INSERT INTO metadata VALUES(2,'dc.title','a bar foo/baz b');
INSERT INTO metadata VALUES(3,'dc.title','a barbar foo b');
INSERT INTO metadata VALUES(4,'dc.title','a foofo bar b');
PRAGMA writable_schema=ON;
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m, tokenize="trigram case_sensitive 1")');
CREATE TABLE IF NOT EXISTS 'text_data'(id INTEGER PRIMARY KEY, block BLOB);
INSERT INTO text_data VALUES(1,X'0431');
INSERT INTO text_data VALUES(10,X'000000000101010001010101');
INSERT INTO text_data VALUES(137438953473,X'000000ee043020626101020b0102030102030102090202666f01020301020701020a01020301032f626101020701020b01036120620202020102020301660102020302020202722001020901020501020801020b03016203020502027a2001020d01020d01036261720102080102040104040501020a03017a01020c01020c0103666f2004020703016f01020401020801020b01020401036f206203020d01020802022f6201020601020a0202666f04020602026f2003020c03012f010205010209030166040205010372206201020a03020c0301660202060102090202626103020601037a206201020e01020e0411100b0b0910060a1209080f0b0a070709060b0907');
CREATE TABLE IF NOT EXISTS 'text_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
INSERT INTO text_idx VALUES(1,X'',2);
CREATE TABLE IF NOT EXISTS 'text_content'(id INTEGER PRIMARY KEY, c0);
INSERT INTO text_content VALUES(1,'a foo/bar baz b');
INSERT INTO text_content VALUES(2,'a bar foo/baz b');
INSERT INTO text_content VALUES(3,'a barbar foo b');
INSERT INTO text_content VALUES(4,'a foofo bar b');
CREATE TABLE IF NOT EXISTS 'text_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
INSERT INTO text_docsize VALUES(1,X'0d');
INSERT INTO text_docsize VALUES(2,X'0d');
INSERT INTO text_docsize VALUES(3,X'0c');
INSERT INTO text_docsize VALUES(4,X'0b');
CREATE TABLE IF NOT EXISTS 'text_config'(k PRIMARY KEY, v) WITHOUT ROWID;
INSERT INTO text_config VALUES('version',4);
PRAGMA writable_schema=OFF;
COMMIT;
It create two tables. One has four rows, the other is the FTS representation of those four rows. If I I provide the following query,

Code:
sqlite> SELECT rowid,m FROM text WHERE m MATCH ('foof');
4|a foofo bar b
the database returns partial matches instead of an empty set. The string 'foof', as a whole word, does not occur anywhere. What do I need to set so I can search for exact word matches and not have it glob either end automatially?
 
Old 04-07-2024, 05:26 AM   #2
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,866
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
Untested:
Code:
SELECT rowid,m FROM text WHERE m REGEX '\bfoof\b';
 
Old 04-07-2024, 06:02 AM   #3
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312

Original Poster
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
Quote:
Originally Posted by NevemTeve View Post
Untested:
Code:
SELECT rowid,m FROM text WHERE m REGEX '\bfoof\b';
Thanks. SQLite is much more limited. Neither the \b nor the REGEX are allowed.

It's pretty useful but a subset of PostgreSQL or MariaDB.

Anyway, the puzzle remains.

https://www.sqlite.org/fts5.html
 
Old 04-07-2024, 07:31 AM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082
Quote:
Originally Posted by Turbocapitalist View Post
Code:
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m, tokenize="trigram case_sensitive 1")');
https://www.sqlite.org/fts5.html#the_trigram_tokenizer:
Quote:
When using the trigram tokenizer, a query or phrase token may match any sequence of characters within a row, not just a complete token.
So don't use the trigram tokenizer if you don't want partial matches.
 
Old 04-07-2024, 08:04 AM   #5
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312

Original Poster
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
Quote:
Originally Posted by ntubski View Post
So don't use the trigram tokenizer if you don't want partial matches.
Thanks. Cloudflare is blocking my new SQL example. I'll keep trying. However, I can say that without the trigram tokenizer, it now looks for the whole sequence, if all letters, but still works with a fragment rather than requiring a whole sequence if non-letters are used (or at least I think that is the case):

What I'm aiming to do is find a way to search for exact URLs which have been mixed into a lot of text. It is thus important to differentiate between the following:

http://localhost/one/two/
http://localhost/one/three/

such that if I search for a non-existing URL such as "http:/localhost/one/" it will return an empty set, but if I search for "http:/localhost/one/three/" then it will return one record.

Again, I'll keep fighting Cloudflare here to see if I can get a new example through.
 
Old 04-07-2024, 08:11 AM   #6
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312

Original Poster
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
Here is the try without the trigram tokenizer. I have added a fifth record.

Code:
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/ba"';
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar"';
1|a foo/bar baz b
5|b foo/bar/bar baz b
The first query above returns an empty set, as expected. However, the second search should also return an empty set, but doesn't.

What I'm aiming to do is find a way to search for exact URLs which have been mixed into a lot of text. It is thus important to differentiate between the following:

http://localhost/one/two/
http://localhost/one/three/

such that if I search for a non-existing URL such as "http:/localhost/one/" it will return an empty set, but if I search for "http:/localhost/one/three/" then it will return one record.

Code:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE metadata(recno integer, term varchar(25) not null, value varchar(256) not null);
INSERT INTO metadata VALUES(1,'dc.title','a foo/bar baz b');
INSERT INTO metadata VALUES(2,'dc.title','a bar foo/baz b');
INSERT INTO metadata VALUES(3,'dc.title','a barbar foo b');
INSERT INTO metadata VALUES(4,'dc.title','a foofo bar b');
PRAGMA writable_schema=ON;
INSERT INTO sqlite_schema(type,name,tbl_name,rootpage,sql)VALUES('table','text','text',0,'CREATE VIRTUAL TABLE text USING FTS5( m )');
CREATE TABLE IF NOT EXISTS 'text_data'(id INTEGER PRIMARY KEY, block BLOB);
INSERT INTO text_data VALUES(1,X'0518');
INSERT INTO text_data VALUES(10,X'000000000102020002010101020101');
INSERT INTO text_data VALUES(137438953473,X'0000005502306101020201020201020201020201016201020601020601020501020502026172010204010203020204040362617203020303017a0102050102050103666f6f0102030102040102040402666f040203040f0f0d08090e');
INSERT INTO text_data VALUES(274877906945,X'0000002102306205040207020261720504040303017a0502060103666f6f05020304070806');
CREATE TABLE IF NOT EXISTS 'text_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
INSERT INTO text_idx VALUES(1,X'',2);
INSERT INTO text_idx VALUES(2,X'',2);
CREATE TABLE IF NOT EXISTS 'text_content'(id INTEGER PRIMARY KEY, c0);
INSERT INTO text_content VALUES(1,'a foo/bar baz b');
INSERT INTO text_content VALUES(2,'a bar foo/baz b');
INSERT INTO text_content VALUES(3,'a barbar foo b');
INSERT INTO text_content VALUES(4,'a foofo bar b');
INSERT INTO text_content VALUES(5,'b foo/bar/bar baz b');
CREATE TABLE IF NOT EXISTS 'text_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
INSERT INTO text_docsize VALUES(1,X'05');
INSERT INTO text_docsize VALUES(2,X'05');
INSERT INTO text_docsize VALUES(3,X'04');
INSERT INTO text_docsize VALUES(4,X'04');
INSERT INTO text_docsize VALUES(5,X'06');
CREATE TABLE IF NOT EXISTS 'text_config'(k PRIMARY KEY, v) WITHOUT ROWID;
INSERT INTO text_config VALUES('version',4);
PRAGMA writable_schema=OFF;
COMMIT;
 
Old 04-07-2024, 08:26 AM   #7
scasey
LQ Veteran
 
Registered: Feb 2013
Location: Tucson, AZ, USA
Distribution: CentOS 7.9.2009
Posts: 5,727

Rep: Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211Reputation: 2211
Literal spaces doesn’t work?
Code:
‘“ foof “‘
‘“foo/bar “‘
 
Old 04-07-2024, 08:32 AM   #8
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312

Original Poster
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
Nope. Spaces do not work:

Code:
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar "';
1|a foo/bar baz b
5|b foo/bar/bar baz b
Nor do double quotes, which would be relevant with the URLs.

Code:
sqlite> SELECT rowid,m FROM text WHERE m MATCH '"foo/bar\""';
Runtime error: unterminated string
 
Old 04-07-2024, 09:19 AM   #9
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,781

Rep: Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082Reputation: 2082
I think you want something along the lines of

Code:
CREATE VIRTUAL TABLE text USING fts5(a, b,
    tokenize = "ascii tokenchars '/:%'"
);
(I'm not sure how to escape the quotes so they will be passed correctly to the INSERT INTO sqlite_schema... construct you are using, so I took the example from the docs as the base instead.)
 
Old 04-07-2024, 09:26 AM   #10
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312

Original Poster
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
Quote:
Originally Posted by ntubski View Post
I think you want something along the lines of

Code:
CREATE VIRTUAL TABLE text USING fts5(a, b,
    tokenize = "ascii tokenchars '/:%'"
);
(I'm not sure how to escape the quotes so they will be passed correctly to the INSERT INTO sqlite_schema... construct you are using, so I took the example from the docs as the base instead.)
Yes! Thank you! (And thanks everyone else!) Here is the final formula:

Code:
CREATE VIRTUAL TABLE text USING FTS5( m, tokenize = "unicode61 tokenchars '/:%'" );
I had misread the tokenchars option multiple times.
 
Old 04-07-2024, 02:18 PM   #11
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,866
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
@OP google says you should install component `sqlite3-pcre`
 
1 members found this post helpful.
Old 04-07-2024, 11:50 PM   #12
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,312

Original Poster
Blog Entries: 3

Rep: Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722Reputation: 3722
Quote:
Originally Posted by NevemTeve View Post
@OP google says you should install component `sqlite3-pcre`
I'll take a look.

http://git.altlinux.org/people/at/pa...lite3-pcre.git

Perl-compatible regular expression support for the SQLite is not necessarily the kind of code which needs frequent updates, but on the other hand not having been updated in over 15 years raises questions.
 
  


Reply

Tags
fts5, sqlite3



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
regex for phrase like'word-word-word' Zero4 Linux - General 9 07-06-2019 06:36 AM
How to capture 1000 lines before a string match and 1000 line a string match including line of string match ? sysmicuser Linux - Newbie 12 11-14-2017 05:21 AM
grep exact word - to find file which start with some word publicLearner Linux - Newbie 8 07-23-2015 03:12 AM
[SOLVED] Match datetime by the minute (not an exact match by the second) [mysql] hattori.hanzo Programming 1 10-21-2010 05:43 PM
how to find an exact substring match? ldp Programming 7 02-22-2005 06:28 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration