How to implement full text search on complex nested JSONB in Postgresql -


i have pretty complex jsonb stored in 1 jsonb column.

db table looks like:

 create table sites (    id text not null,    doc jsonb,    primary key (id)  ) 

data storing in doc column complex nested jsonb data:

   {       "_id": "123",       "type": "site",       "identification": "custom id",       "title": "site 1",       "address": "uk, london, mr tom's street, 2",       "buildings": [           {                "uuid": "12312",                "identification": "custom id",                "name": "building 1",                "deposits": [                    {                       "uuid": "12312",                       "identification": "custom id",                                    "audits": [                           {                              "uuid": "12312",                                        "sample_id": "sample id"                                           }                        ]                    }                ]           }         ]     } 

so structure of jsonb looks like:

site    -> array of buildings      -> array of deposits        -> array of audits 

we need implement full text search values in each of type of entry:

site (identification, title, address) building (identification, name) deposit (identification) audit (sample_id) 

sql query should run full text search in these field values only.

i guess need use gin indexes , tsvector, not have enough postgresql background.

so, question possible index , query such nested jsonb structures?

let's add new column of tsvector type:

alter table sites add column tsvector tsvector; 

now let's create trigger collect lexems, organize them , put our tsvector. use 4 groups (a, b, c, d) -- special tsvector's feature allows distinguish lexems later, @ search time (see examples in manual https://www.postgresql.org/docs/current/static/textsearch-controls.html; unfortunately, feature support 4 groups becase developers reserved 2 bits that, lucky here, need 4 groups):

create or replace function t_sites_tsvector() returns trigger $$ declare   dic regconfig;   part_a text;   part_b text;   part_c text;   part_d text; begin   dic := 'simple'; -- change if need more advanced word processing (stemming, etc)    part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', '');    select part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ')   (     select        jsonb_array_elements((new.doc->'buildings'))->>'identification',       jsonb_array_elements((new.doc->'buildings'))->>'name'   ) _(a, b);    select part_c string_agg(coalesce(c, ''), ' ')   (     select jsonb_array_elements(b)->>'identification' (       select jsonb_array_elements((new.doc->'buildings'))->'deposits'     ) _(b)   ) __(c);    select part_d string_agg(coalesce(d, ''), ' ')   (     select jsonb_array_elements(c)->>'sample_id'     (       select jsonb_array_elements(b)->'audits' (         select jsonb_array_elements((new.doc->'buildings'))->'deposits'       ) _(b)     ) __(c)   ) ___(d);    new.tsvector := setweight(to_tsvector(dic, part_a), 'a')     || setweight(to_tsvector(dic, part_b), 'b')     || setweight(to_tsvector(dic, part_c), 'c')     || setweight(to_tsvector(dic, part_d), 'd')   ;   return new; end; $$ language plpgsql immutable;  create trigger t_sites_tsvector   before insert or update on sites each row execute procedure t_sites_tsvector(); 

^^ -- scroll it, snippet bigger looks (especially of have macos w/o scrollbars...)

now let's create gin index speedup search queries (makes sense if have many rows -- say, more hundreds or thousands):

create index i_sites_fulltext on sites using gin(tsvector); 

and insert check:

insert sites select 1, '{       "_id": "123",       "type": "site",       "identification": "custom id",       "title": "site 1",       "address": "uk, london, mr tom''s street, 2",       "buildings": [           {                "uuid": "12312",                "identification": "custom id",                "name": "building 1",                "deposits": [                    {                       "uuid": "12312",                       "identification": "custom id",                       "audits": [                           {                              "uuid": "12312",                               "sample_id": "sample id"                           }                        ]                    }                ]           }        ]     }'::jsonb; 

check select * sites; – must see tsvector column filled data.

now let's query it:

select * sites tsvector @@ to_tsquery('simple', 'sample'); 

-- must return our record. in case, search 'sample' word , don't care in group found.

let's change , try search in group ("site (identification, title, address)" described it):

select * sites tsvector @@ to_tsquery('simple', 'sample:a'); 

-- must return nothing because word 'sample' sits in group d ("audit (sample_id)"). indeed:

select * sites tsvector @@ to_tsquery('simple', 'sample:d'); 

-- again return our record.

notice, need use to_tsquery(..), not plainto_tsquery(..) able address 4 groups. need sanitize input (avoid using or remove special characters & , | because have special meaning in tsquery values).

and news can combine different groups in single query, this:

select * sites tsvector @@ to_tsquery('simple', 'sample:d & london:a'); 

the other way go (e.g. if have work more 4 groups) having multiple tsvectors, each 1 sitting in separate column, build them using single query, create index (you can create single index on multiple tsvector columns) , query addressing separate columns. it's similar explained above, perhaps less efficient.

hope helps.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -