I wondered, for a long time, if it were possible to shred automatically XML documents, via Binary XML XMLType storage, the same way as you are able with Object Relational XMLType storage since the Oracle 9.2.0.3.0 version (the first officially supported XMLDB database version).
Based on “standard” object relational storage you need the following requirements and/or take the following steps:
- Annotate an XML Schema with the XDB namespace and xdb:annotations to your liking
- Register this XML Schema in the XDB Repository
- During registration automatically generate the needed types and tables (based on your xdb:annotations)
- An XML document(s) that validates properly, is valid, against your registered XML Schema.
If done properly, drag & dropping your XML documents via for example WebDAV via the enable protocol server, will cause the XDB functionality to check and validate the given XML documents against the registered XML Schema and insert the content automatically in the generated table(s). This mechanism also works via the FTP protocol functionality off the XDB Protocol Server.
I never got it really to work, at least the mechanism was fuzzy to me. Sometimes it did work as expected, sometimes it didn’t. What I didn’t realize, is that I most of the time use XML documents with multiple namespace references. Until one of my students during an XMLDB training, hit a small bug (8473369), regarding using duplicate namespaces in the “wrong” order. XQuery statement don’t have the problem, XML/SQL does, so I reported it via an Service Request. This made me wonder… Would I have the same issue with the namespace references in the registered XML Schema and the XML document. I still have to check the W3C Official XML Namespace Recommendation regarding this issue though…
You might now (or not) that most of the time I am playing around, testing situations based on the MediaWiki XML dumpfile, mine is an “old one” (English) that is 21 Gb’s in size and that can be downloaded, or you can get the needed information, via the following URL’s (still thanks to an old post once read on Marcelo Ochoa blog):
- http://download.wikimedia.org/
- http://meta.wikimedia.org/wiki/Data_dumps#Format
- http://download.wikimedia.org/backup-index.html
- http://www.mediawiki.org/xml/export-0.3.xsd
The following is based on this set off XML data (English XML data set) and XML Schema.
Shredding MediaWiki XML Data
First we have to annotate the XML Schema (export-0.3.xsd) with xdb:annotations (see the XMLDB Developers Guide for more information about xdb:annotations) and add the needed XDB namespace reference (xmlns:xdb=”http://xmlns.oracle.com/xdb“) that defines these annotations.
Due to the fact that I want to use an Binary XML XMLType table for storage (an Oracle 11g feature), I have to use some small alterations regarding the default “old” Object Relational storage method. It also might be that you need to add some extra namespace reference in the XML Schema or apply a one off patch for bug 6768776, as described here on the Oracle OTN Forum: Binary schema registration and import.
My XML Schema then looks like:
<?xml version=”1.0″ encoding=”UTF-8″ ?>
<schema xmlns=”http://www.w3.org/2001/XMLSchema”
xmlns:mw=”http://www.mediawiki.org/xml/export-0.3/”
targetNamespace=”http://www.mediawiki.org/xml/export-0.3/”
xmlns:xdb=”http://xmlns.oracle.com/xdb”
elementFormDefault=”qualified”><annotation>
<documentation xml:lang=”en”>
MediaWiki’s page export format
</documentation>
</annotation><!– Need this to reference xml:lang –>
<import namespace=” http://www.w3.org/XML/1998/namespace”
schemaLocation=”http://www.w3.org/2001/xml.xsd”/><!– Our root element –>
<element name=”mediawiki” type=”mw:MediaWikiType”
xdb:defaultTable=”MEDIAWIKI”/><element name=”page” type=”mw:PageType”/>
<complexType name=”MediaWikiType”>
<sequence>
<element name=”siteinfo” type=”mw:SiteInfoType”
minOccurs=”0″ maxOccurs=”1″/>
<element name=”page” type=”mw:PageType”
minOccurs=”0″ maxOccurs=”unbounded” />
</sequence>
<attribute name=”version” type=”string” use=”required”/>
<attribute ref=”xml:lang” use=”required”/>
</complexType><complexType name=”SiteInfoType”>
<sequence>
<element name=”sitename” type=”string” minOccurs=”0″ />
<element name=”base” type=”anyURI” minOccurs=”0″ />
<element name=”generator” type=”string” minOccurs=”0″ />
<element name=”case” type=”string” minOccurs=”0″ />
<element name=”namespaces” type=”mw:NamespacesType” minOccurs=”0″/>
</sequence>
</complexType><complexType name=”NamespacesType”>
<sequence>
<element name=”namespace” type=”mw:NamespaceType”
minOccurs=”0″ maxOccurs=”unbounded” />
</sequence>
</complexType><complexType name=”NamespaceType”>
<simpleContent>
<extension base=”string”>
<attribute name=”key” type=”integer” />
</extension>
</simpleContent>
</complexType><complexType name=”PageType”>
<sequence>
<element name=”title” type=”string”/>
<!– optional page ID number –>
<element name=”id” type=”positiveInteger” minOccurs=”0″/>
<!– comma-separated list of string tokens, if present –>
<element name=”restrictions” type=”string” minOccurs=”0″/>
<!– Zero or more sets of revision or upload data –>
<choice minOccurs=”0″ maxOccurs=”unbounded”>
<element name=”revision” type=”mw:RevisionType” />
<element name=”upload” type=”mw:UploadType” />
</choice>
</sequence>
</complexType><complexType name=”RevisionType”>
<sequence>
<element name=”id” type=”positiveInteger” minOccurs=”0″/>
<element name=”timestamp” type=”dateTime” xdb:SQLType=”TIMESTAMP WITH TIME ZONE”/>
<element name=”contributor” type=”mw:ContributorType”/>
<element name=”minor” minOccurs=”0″ />
<element name=”comment” type=”string” minOccurs=”0″/>
<element name=”text” type=”mw:TextType”/>
</sequence>
</complexType><complexType name=”TextType” >
<simpleContent>
<extension base=”string”>
<attribute ref=”xml:space” use=”optional” default=”preserve” />
</extension>
</simpleContent>
</complexType><complexType name=”ContributorType” >
<sequence>
<element name=”username” type=”string” minOccurs=”0″/>
<element name=”id” type=”positiveInteger” minOccurs=”0″ />
<element name=”ip” type=”string” minOccurs=”0″/>
</sequence>
</complexType><complexType name=”UploadType” >
<sequence>
<!– Revision-style data… –>
<element name=”timestamp” type=”dateTime” xdb:SQLType=”TIMESTAMP WITH TIME ZONE”/>
<element name=”contributor” type=”mw:ContributorType”/>
<element name=”comment” type=”string” minOccurs=”0″/>
<element name=”filename” type=”string”/>
<!– URI at which this resource can be obtained –>
<element name=”src” type=”anyURI”/>
<element name=”size” type=”positiveInteger” />
<!– TODO: add other metadata fields –>
</sequence>
</complexType>
</schema>
Not much xdb:annotations were used by me, as said you need the xdb namespace reference and at least one xdb:defaultTable annotation for your root element. The xdb:defaultTable annotation will be created automatically during the DBMS_XMLSCHEMA registration process if this procedure is given the correct parameters (GENTABLES=true).
I uploaded (drag & drop) the XML Schema content via WebDAV too a created directory, /home/wiki/xsd/, beforehand. The name of the XML Schema is “: export-0.3.marcelo_bin.xsd
Now I am able to create the automatically, by “xdb:defaultTable”, the Binary XML XMLType table “MEDIAWIKI”. This will be achieved by executing the following SQL Statement:
call dbms_xmlschema.deleteschema('http://www.mediawiki.org/xml/export-0.3/',4);
begin
DBMS_XMLSCHEMA.registerSchema(
SCHEMAURL => 'http://www.mediawiki.org/xml/export-0.3/',
SCHEMADOC => xdbURIType('/home/wiki/xsd/export-0.3.marcelo_bin.xsd').getClob(),
LOCAL => FALSE, -- local
GENTYPES => FALSE, -- generate object types
GENBEAN => FALSE, -- no java beans
GENTABLES => TRUE, -- generate object tables
FORCE => FALSE,
OPTIONS => DBMS_XMLSCHEMA.REGISTER_BINARYXML,
OWNER => USER);
end;
/
The values described (see also the DBMS_XMLSCHEMA.registerSchema alternatives):
- The SCHEMAURL can be whatever you want (it doesn’t even have to exist)
- The SCHEMADOC points to the content were you XML Schema resides (in my case I fetch the content directly from the XDB Repository via XDBUriType().getClob() )
- The LOCAL parameter defines if the XML Schema can only be used by the “user” or is available for all database users. You will need the XDBADMIN role if you want to use it with value FALSE
- GENTYPES: Determines whether the schema compiler generates object types. By default,
TRUE.
If you use binary XML, you must be setgentypes
toFALSE
- GENBEANS: Determines whether the schema compiler generates Java beans. By default,
FALSE
. - GENTABLES: Determines whether the schema compiler generates default tables. By default,
TRUE
- FORCE: If this parameter is set to
TRUE
, the schema registration will not raise errors. - OPTIONS:
REGISTER_BINARYXML
– Register the schema for Binary XML usage (you don’t need this if used for Object Relational XMLType storage)
If your XML documents are now dropped into the XDB repositoy or, for example via FTP, and these documents are valid against this registered XML Schema and most important have the exact same and correct namespace references as the XML Schema, then the XML document will be automatically inserted into the default table “MEDIAWIKI”, which has been created by the registration process.
XML document content could be, for example:
<mediawiki xmlns=”http://www.mediawiki.org/xml/export-0.3/”xsi:schemaLocation=”http://www.mediawiki.org/xml/export-0.3/xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” version=”0.3″ xml:lang=”en”>
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.13alpha</generator>
<case>first-letter</case>
<namespaces>
<namespace key=”-2″>Media</namespace>
<namespace key=”-1″>Special</namespace>
<namespace key=”0″ />
<namespace key=”1″>Talk</namespace>
<namespace key=”2″>User</namespace>
<namespace key=”3″>User talk</namespace>
<namespace key=”4″>Wikipedia</namespace>
<namespace key=”5″>Wikipedia talk</namespace>
<namespace key=”6″>Image</namespace>
<namespace key=”7″>Image talk</namespace>
<namespace key=”8″>MediaWiki</namespace>
<namespace key=”9″>MediaWiki talk</namespace>
<namespace key=”10″>Template</namespace>
<namespace key=”11″>Template talk</namespace>
<namespace key=”12″>Help</namespace>
<namespace key=”13″>Help talk</namespace>
<namespace key=”14″>Category</namespace>
<namespace key=”15″>Category talk</namespace>
<namespace key=”100″>Portal</namespace>
<namespace key=”101″>Portal talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>AmericanSamoa</title>
<id>6</id>
<revision>
<id>133452270</id>
<timestamp>2007-05-25T17:12:06Z</timestamp>
<contributor>
<username>Gurch</username>
<id>241822</id>
</contributor>
<minor />
<comment>Revert edit(s) by [[Special:Contributions/Ngaiklin|Ngaiklin]] to last version by [[Special:Contributions/Docu|Docu]]</comment>
<text xml:space=”preserve”>#REDIRECT [[American Samoa]]{{R from CamelCase}}</text>
</revision>
</page>
</mediawiki>
Be aware that inserting the XML Documents via FTP or via PL/SQL have one big advantage: you will receive a meaningful error if the insert does not work. The process also works the other way around. If, for instance, 10 XML documents were created / inserted automatically in the default table and you would delete, for example 1 XML document, then after this also will show in the XDB Repository via WebDAV. Depending on the WebDAV locking mechanisms and or XDB Protocol refresh time, it could be that you need to refresh Windows Explorer (or another WebDAV enabled program) or the used web brower, like Firefox or Windows Internet Explorer, if you browse the XML content via HTTP.
In all, nothing else is needed.
Hi Marco:
Did you upload your 21Gb media wiki export file by using the FTP protocol handler.
Did you have some performance comparison between the binary format or the object relational format?
I mean for some typical queries such how many pages have some word in an specific time frame.
Did you create some index on the table?
For example a primary key on ‘/page/revision/id’.
Best regards, Marcelo.
For this example I only used a part of the “base” set.
Great idea, but will this work over the long run?
Yes it will. The basics for this functionality has been buildin since database version 9.2.0.3, although binary xml is supported in 11g and onwards.