When trying to create anonymous data, I encountered an unexpected side effect of the updateXML statement while updating XML documents in an Oracle 10gR2 (10.2.0.1.0) EE database.
I got a small test environment in which a table (HGO010_DETAM) resides that contains approximately 500Mb of XML documents. The real production environment is much bigger, think in 100th of Gigabytes per table and an average of 512 Kb (up to 6 Mb) per document. The web application that makes use of this data is document driven so these documents were stored in an XMLType column based on CLOB storage. The table also contains an extra ID column that was added for reference purposes.
The table was created as follows:
SQL> create table HGO010_DETAM
2 (HGO_ID NUMBER(12), GEGEVENS XMLTYPE)
3 ;
The full syntax would have been as shown via the dbms_metadata method:
SQL> select dbms_metadata.get_ddl('TABLE','HGO010_DETAM',USER)
from dual;
DBMS_METADATA.GET_DDL('TABLE','HGO010_DETAM','HGO')
----------------------------------------------------
CREATE TABLE "HGO"."HGO010_DETAM"
( "HGO_ID" NUMBER(12,0) NOT NULL ENABLE,
"GEGEVENS" "SYS"."XMLTYPE" NOT NULL ENABLE,
CONSTRAINT "FK_HGO010_HGO000" FOREIGN KEY ("HGO_ID")
REFERENCES "HGO"."HGO000_SOFI" ("HGO_ID") ON DELETE CASCADE ENABLE
) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
STORAGE(INITIAL 2097152 NEXT 2097152 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "DAT_TS01"
XMLTYPE COLUMN "GEGEVENS" STORE AS CLOB (
TABLESPACE "DAT_TS01" ENABLE STORAGE IN ROW CHUNK 8192 PCTVERSION 10
NOCACHE LOGGING
STORAGE(INITIAL 2097152 NEXT 2097152 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT))
A typical document in this table looks like this:
SQL> desc hgo.hgo010_detam
Name Null? Type
--------- -------- -----------
HGO_ID NOT NULL NUMBER(12)
GEGEVENS NOT NULL SYS.XMLTYPE
SQL> select *
2 from hgo.hgo010_detam t
3 where t.hgo_id=18383;
HGO_ID
------
T.GEGEVENS.EXTRACT('/*')
------------------------
18383
012345678
426270670105
94628838
94628838
09045451
0
94628838
220111765
09045451
4
04
19950703
19950726
As can be seen, the document contains white spaces and has a “pretty print” layout. This is how the data was loaded in the XMLType column. The XML documents held there original layout because the XMLType column has a CLOB based storage (and therefore the data isn’t shredded as in XMLType columns based on Object Relational storage). XMLType CLOB storage stores unstructured data “as is”.
I needed to create anonymous data, for instance regarding the
SQL>UPDATE hgo.hgo010_detam t
SET t.gegevens=updateXML
(t.gegevens,'/wwb:WW-HISTORIE/R801FINR/R801-NUM-FIS/text()'
,'Marco','xmlns:wwb="http://www.gralike.com/ww/historie/detam/WWBase"')
where hgo_id=18383;
1 row updated.
The update statement worked fine but I was surprised when I re-examined the data and saw the following:
SQL> select *
2 from hgo.hgo010_detam t
3 where t.hgo_id=18383;
HGO_ID
------
GEGEVENS
------------------------
18383
< ?xml version="1.0" encoding="ISO-8859-1"?>Marco 426270670105
Pretty print layout and all the white spaces where gone!
I mentioned the effect on the Oracle TechNet XMLDB forum (http://forums.oracle.com/forums/forum.jspa?forumID=34) and Mark Drake guessed that the following was probably happening under the covers (http://forums.oracle.com/forums/thread.jspa?threadID=399742&tstart=0):
“We had to parse the XML into a DOM to perform the update, perform the update using the DOM API methods and then reserialize the DOM into text after the update was complete. We only preserve whitespace during parsing when the XML is schema based and the element is defined as mixed=”true” in the XML Schema, or in the case xml:space=”preserve” (Both of these cases require patches to work as expected). So the whitespace (pretty print) was lost when the XML was parsed prior to performing the update.”
“Since the XML is stored as CLOB we do not reparse it to pretty print when the row is selected..”
“You can force a pretty print by selecting”
“select HGO_ID, t.GEGEVENS.extract(‘/*’) from hgo010_detam t”
This sounds very plausible, testing the extract statement the following is shown in SQL*Plus:
SQL> select HGO_ID, t.GEGEVENS.extract('/*')
2 from hgo010_detam t
3 where t.hgo_id=18383;
HGO_ID
------
T.GEGEVENS.EXTRACT('/*')
------------------------
18383
Marco
426270670105
94628838
94628838
09045451
'etc..etc..etc
In my small environment the referenced table contains 7500 documents. The original total size of these documents on disks is 437.796 Kb (when shown in Windows explorer). When checking on the sizes of these (re-rendered) text documents in the mentioned table, the sum total size is 450136538 bytes. The smallest XML document in this table is 1584 bytes and the biggest XML document in this table is 1138638 bytes.
SQL> select min(dbms_lob.getlength(t.gegevens.getclobval())) "MIN"
2 from hgo.hgo010_detam t
3 ;
MIN
---
1584
1 row selected.
SQL> select max(dbms_lob.getlength(t.gegevens.getclobval())) "MAX"
2 from hgo.hgo010_detam t
3 ;
MAX
-------
1138638
1 row selected.
SQL> select sum(dbms_lob.getlength(t.gegevens.getclobval())) "SUM"
2 from hgo.hgo010_detam t
3 ;
SUM
---------
450136538
1 row selected.
After updating all XML documents via an updateXML statement and as a side effect of this, removing all white spaces:
SQL> update hgo.hgo010_detam t
set t.gegevens=updateXML
(t.gegevens
,'/wwb:WW-HISTORIE/R801FINR/R801-NUM-FIS/text()'
,'012345678'
,'xmlns:wwb="http://www.gralike.com/ww/historie/detam/WWBase"'
);
7500 rows updated.
After this update the same SUM statement as shown above gave the following result:
SQL> select sum(dbms_lob.getlength(t.gegevens.getclobval())) "SUM"
2 from hgo.hgo010_detam t;
SUM
---------
392869215
1 row selected.
This is 57267323 bytes less (~ 55 Mb).
SQL> select 450136538-392869215 "LESS" from dual;
LESS
--------
57267323
1 row selected.
In conclusion – the updateXML statements has (at least in our case) nice side effects; it removes the white spaces which causes less space consumption and will allow XML parsers to be more effective while parsing the XML documents. Keep in mind that, if you address DOM methods, this will be very memory intensive (your document will be rebuild in memory).
PS
(Update 04 june 2007)
Be aware that Mark’s statement regarding:
“You can force a pretty print by selecting”
“select HGO_ID, t.GEGEVENS.extract(‘/*’) from hgo010_detam t”will not be valid anymore in Oracle 11g. As stated by Mark on the XMLDB Forum, this behavior has been treated as a bug and is rectified from 11g upwards.