New in Oracle 12c and one of the big new features in XMLDB is the XQuery Full Text functionality and, as mentioned in the post about XQuery Update, is the official W3C standard to handle unstructured pieces of XML content.
The XQuery Full Text and XQuery Full Text Index extends the XQuery XMLDB language to efficiently and XML aware handle unstructured content in XML. So finally in Oracle 12c, we are able to handle “structured”, “semi-structured” and “unstructured” content, XML aware and based on standards, across the board with regards to XML content.
XQuery Full Text FTSelection/Match
The following “Full Text FTSelection Operators” are supported:
- FTAnd (ftand)
- FTMildNot (not in)
- FTOr (ftor)
- FTOrder (ordered)
- FTUnaryNot (ftnot)
- FTWindow (window)
- FTWords
The following “XQuery Full Text Match Options” are supported:
- FTStemOption (stemming, no stemming)
The default behavior specified in the XQuery and XPath Full Text 1.0 Recommendation is used for each unsupported match option, with the following exceptions:
- FTLanguage (unsupported) – The language used is the language defined by the default lexer, which means the language that was used when the database was installed.
- FTStopWordOption (unsupported) – The stoplist used is the stoplist defined for that language.
For instance, the following XQuery Full Text will search for “Wikipedia Content” in a table with Wikipedia stored XML content, within a window of 6 words, for matches of “oracle” and “12c” via the ftand operator (Full Text AND)
.
SELECT xt1.PAGE_TEXT as "Wikipedia Content"
FROM BINARYXML_RANGE_PART_XSD t1,
XMLTABLE(xmlnamespaces(default 'http://www.mediawiki.org/xml/export-0.8/' ),
'$P/page/revision/text'
PASSING t1.content as "P"
COLUMNS
PAGE_TEXT varchar2(4000) PATH '.'
) xt1
WHERE XMLExists('xquery version "1.0"; (: :)
declare default element namespace "http://www.mediawiki.org/xml/export-0.8/"; (: :)
$P/page/revision/text[. contains text
{$PHRASE1} ftand {$PHRASE2}
using stemming window 6 words]'
PASSING
t1.content as "P",
'oracle' as "PHRASE1",
'12c' as "PHRASE2"
)
/
During tests it proved very fast on English Wikipedia content (10++ Gb) and delivered the results within less than a second. But such a statement will only be picked up very efficiently if the new, introduced in 12c, corresponding Oracle XQuery Full-Text Index has been created.
As seen in the manual this can be done very easily via, for example:
Creating an XML Full-Text Index
Index preference BASIC_STORAGE specifies the tablespace and creation parameters for the database tables and indexes that constitute an Oracle Text index.
BEGIN CTX_DDL.create_section_group('mysecgroup', 'PATH_SECTION_GROUP'); CTX_DDL.set_sec_grp_attr('mysecgroup', 'XML_ENABLE', 'T'); CTX_DDL.create_preference('mypref', 'BASIC_STORAGE'); CTX_DDL.set_attribute('mypref', 'D_TABLE_CLAUSE', 'TABLESPACE my_ts LOB(DOC) STORE AS SECUREFILE (TABLESPACE my_ts COMPRESS MEDIUM CACHE)'); CTX_DDL.set_attribute('mypref', 'I_TABLE_CLAUSE', 'TABLESPACE my_ts LOB(TOKEN_INFO) STORE AS SECUREFILE (TABLESPACE my_ts NOCOMPRESS CACHE)'); END; / CREATE INDEX po_ctx_idx ON po_binxml(OBJECT_VALUE) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS('storage mypref section group mysecgroup');
So on my Wikipedia table this looked like:
DROP INDEX FT_RANGE_PART_IDX FORCE;
begin
CTX_DDL.DROP_SECTION_GROUP('FTRI');
end;
/
begin
CTX_DDL.DROP_PREFERENCE('STORAGE_PREFS');
end;
/
begin
CTX_DDL.CREATE_SECTION_GROUP('FTRI','PATH_SECTION_GROUP');
CTX_DDL.SET_SEC_GRP_ATTR('FTRI','XML_ENABLE','T');
CTX_DDL.create_preference('STORAGE_PREFS', 'BASIC_STORAGE');
CTX_DDL.set_attribute('STORAGE_PREFS',
'D_TABLE_CLAUSE',
'LOB(DOC) STORE AS SECUREFILE (COMPRESS MEDIUM CACHE)');
CTX_DDL.set_attribute('STORAGE_PREFS',
'I_TABLE_CLAUSE',
'LOB(TOKEN_INFO) STORE AS SECUREFILE (NOCOMPRESS CACHE)'
);
END;
/
CREATE INDEX FT_RANGE_PART_IDX
ON BINARYXML_RANGE_PART_XSD(CONTENT)
INDEXTYPE IS CTXSYS.CONTEXT
LOCAL
PARAMETERS('storage STORAGE_PREFS
section group FTRI')
PARALLEL;
More information
More information can be found in the Oracle 12c XMLDB Developers Guide, or via the following introduction / presentation:
…and/or its accompanied pdf document, describing the syntax used for one of the storage methods, called “Creating Structure in Unstructured Data, DML DDL used” (pdf 559 Kb)
.