How to implement full text search on complex nested JSONB in Postgresql -
i have pretty complex jsonb stored in 1 jsonb
column.
db table looks like:
create table sites ( id text not null, doc jsonb, primary key (id) )
data storing in doc
column complex nested jsonb
data:
{ "_id": "123", "type": "site", "identification": "custom id", "title": "site 1", "address": "uk, london, mr tom's street, 2", "buildings": [ { "uuid": "12312", "identification": "custom id", "name": "building 1", "deposits": [ { "uuid": "12312", "identification": "custom id", "audits": [ { "uuid": "12312", "sample_id": "sample id" } ] } ] } ] }
so structure of jsonb
looks like:
site -> array of buildings -> array of deposits -> array of audits
we need implement full text search values in each of type of entry:
site (identification, title, address) building (identification, name) deposit (identification) audit (sample_id)
sql query should run full text search in these field values only.
i guess need use gin
indexes , tsvector
, not have enough postgresql background.
so, question possible index , query such nested jsonb
structures?
let's add new column of tsvector
type:
alter table sites add column tsvector tsvector;
now let's create trigger collect lexems, organize them , put our tsvector. use 4 groups (a, b, c, d) -- special tsvector's feature allows distinguish lexems later, @ search time (see examples in manual https://www.postgresql.org/docs/current/static/textsearch-controls.html; unfortunately, feature support 4 groups becase developers reserved 2 bits that, lucky here, need 4 groups):
create or replace function t_sites_tsvector() returns trigger $$ declare dic regconfig; part_a text; part_b text; part_c text; part_d text; begin dic := 'simple'; -- change if need more advanced word processing (stemming, etc) part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', ''); select part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ') ( select jsonb_array_elements((new.doc->'buildings'))->>'identification', jsonb_array_elements((new.doc->'buildings'))->>'name' ) _(a, b); select part_c string_agg(coalesce(c, ''), ' ') ( select jsonb_array_elements(b)->>'identification' ( select jsonb_array_elements((new.doc->'buildings'))->'deposits' ) _(b) ) __(c); select part_d string_agg(coalesce(d, ''), ' ') ( select jsonb_array_elements(c)->>'sample_id' ( select jsonb_array_elements(b)->'audits' ( select jsonb_array_elements((new.doc->'buildings'))->'deposits' ) _(b) ) __(c) ) ___(d); new.tsvector := setweight(to_tsvector(dic, part_a), 'a') || setweight(to_tsvector(dic, part_b), 'b') || setweight(to_tsvector(dic, part_c), 'c') || setweight(to_tsvector(dic, part_d), 'd') ; return new; end; $$ language plpgsql immutable; create trigger t_sites_tsvector before insert or update on sites each row execute procedure t_sites_tsvector();
^^ -- scroll it, snippet bigger looks (especially of have macos w/o scrollbars...)
now let's create gin index speedup search queries (makes sense if have many rows -- say, more hundreds or thousands):
create index i_sites_fulltext on sites using gin(tsvector);
and insert check:
insert sites select 1, '{ "_id": "123", "type": "site", "identification": "custom id", "title": "site 1", "address": "uk, london, mr tom''s street, 2", "buildings": [ { "uuid": "12312", "identification": "custom id", "name": "building 1", "deposits": [ { "uuid": "12312", "identification": "custom id", "audits": [ { "uuid": "12312", "sample_id": "sample id" } ] } ] } ] }'::jsonb;
check select * sites;
– must see tsvector
column filled data.
now let's query it:
select * sites tsvector @@ to_tsquery('simple', 'sample');
-- must return our record. in case, search 'sample'
word , don't care in group found.
let's change , try search in group ("site (identification, title, address)" described it):
select * sites tsvector @@ to_tsquery('simple', 'sample:a');
-- must return nothing because word 'sample'
sits in group d ("audit (sample_id)"). indeed:
select * sites tsvector @@ to_tsquery('simple', 'sample:d');
-- again return our record.
notice, need use to_tsquery(..)
, not plainto_tsquery(..)
able address 4 groups. need sanitize input (avoid using or remove special characters &
, |
because have special meaning in tsquery
values).
and news can combine different groups in single query, this:
select * sites tsvector @@ to_tsquery('simple', 'sample:d & london:a');
the other way go (e.g. if have work more 4 groups) having multiple tsvectors, each 1 sitting in separate column, build them using single query, create index (you can create single index on multiple tsvector
columns) , query addressing separate columns. it's similar explained above, perhaps less efficient.
hope helps.
Comments
Post a Comment