Mastering XML DB – Unexpected side effect of updateXML

When trying to create anonymous data, I encountered an unexpected side effect of the updateXML statement while updating XML documents in an Oracle 10gR2 (10.2.0.1.0) EE database.

I got a small test environment in which a table (HGO010_DETAM) resides that contains approximately 500Mb of XML documents. The real production environment is much bigger, think in 100th of Gigabytes per table and an average of 512 Kb (up to 6 Mb) per document. The web application that makes use of this data is document driven so these documents were stored in an XMLType column based on CLOB storage. The table also contains an extra ID column that was added for reference purposes.

The table was created as follows:

SQL> create table HGO010_DETAM
  2  (HGO_ID NUMBER(12), GEGEVENS XMLTYPE)
  3  ;

The full syntax would have been as shown via the dbms_metadata method:

SQL> select dbms_metadata.get_ddl('TABLE','HGO010_DETAM',USER) 
        from dual;

DBMS_METADATA.GET_DDL('TABLE','HGO010_DETAM','HGO')
---------------------------------------------------- 
CREATE TABLE "HGO"."HGO010_DETAM"
   (    "HGO_ID" NUMBER(12,0) NOT NULL ENABLE,
        "GEGEVENS" "SYS"."XMLTYPE"  NOT NULL ENABLE,
         CONSTRAINT "FK_HGO010_HGO000" FOREIGN KEY ("HGO_ID")
          REFERENCES "HGO"."HGO000_SOFI" ("HGO_ID") ON DELETE CASCADE ENABLE
   ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
  STORAGE(INITIAL 2097152 NEXT 2097152 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
  TABLESPACE "DAT_TS01"
 XMLTYPE COLUMN "GEGEVENS" STORE AS CLOB (
  TABLESPACE "DAT_TS01" ENABLE STORAGE IN ROW CHUNK 8192 PCTVERSION 10
  NOCACHE LOGGING
  STORAGE(INITIAL 2097152 NEXT 2097152 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT))

A typical document in this table looks like this:

SQL> desc hgo.hgo010_detam

 Name      Null?    Type
 --------- -------- -----------
 HGO_ID    NOT NULL NUMBER(12)
 GEGEVENS  NOT NULL SYS.XMLTYPE

SQL> select  *
  2  from    hgo.hgo010_detam t
  3  where   t.hgo_id=18383;

HGO_ID
------
T.GEGEVENS.EXTRACT('/*')
------------------------
     18383

  
    012345678
    426270670105
    
    
      94628838
      
        94628838
        
          09045451
        
      
      
        0
        94628838
        220111765
        09045451
        4
        04
        19950703
        19950726

As can be seen, the document contains white spaces and has a “pretty print” layout. This is how the data was loaded in the XMLType column. The XML documents held there original layout because the XMLType column has a CLOB based storage (and therefore the data isn’t shredded as in XMLType columns based on Object Relational storage). XMLType CLOB storage stores unstructured data “as is”.

I needed to create anonymous data, for instance regarding the element. So tried an update of the data via the statement:

SQL>UPDATE hgo.hgo010_detam t
    SET    t.gegevens=updateXML
    (t.gegevens,'/wwb:WW-HISTORIE/R801FINR/R801-NUM-FIS/text()'
    ,'Marco','xmlns:wwb="http://www.gralike.com/ww/historie/detam/WWBase"')
    where  hgo_id=18383;
 
1 row updated.

The update statement worked fine but I was surprised when I re-examined the data and saw the following:

SQL> select  *
  2  from    hgo.hgo010_detam t
  3  where   t.hgo_id=18383;

HGO_ID
------
GEGEVENS
------------------------
     18383
< ?xml version="1.0" encoding="ISO-8859-1"?>Marco426270670105

946288389462883809045451094628838220111765

Pretty print layout and all the white spaces where gone!

I mentioned the effect on the Oracle TechNet XMLDB forum (http://forums.oracle.com/forums/forum.jspa?forumID=34) and Mark Drake guessed that the following was probably happening under the covers (http://forums.oracle.com/forums/thread.jspa?threadID=399742&tstart=0):

“We had to parse the XML into a DOM to perform the update, perform the update using the DOM API methods and then reserialize the DOM into text after the update was complete. We only preserve whitespace during parsing when the XML is schema based and the element is defined as mixed=”true” in the XML Schema, or in the case xml:space=”preserve” (Both of these cases require patches to work as expected). So the whitespace (pretty print) was lost when the XML was parsed prior to performing the update.”

“Since the XML is stored as CLOB we do not reparse it to pretty print when the row is selected..”

“You can force a pretty print by selecting”

“select HGO_ID, t.GEGEVENS.extract(‘/*’) from hgo010_detam t”

This sounds very plausible, testing the extract statement the following is shown in SQL*Plus:

SQL> select HGO_ID, t.GEGEVENS.extract('/*') 
  2  from hgo010_detam t
  3  where t.hgo_id=18383;
 
HGO_ID
------
T.GEGEVENS.EXTRACT('/*')
------------------------
     18383

  
    Marco
    426270670105
    
    
      94628838
      
        94628838
        
          09045451
        
      
      
'etc..etc..etc

In my small environment the referenced table contains 7500 documents. The original total size of these documents on disks is 437.796 Kb (when shown in Windows explorer). When checking on the sizes of these (re-rendered) text documents in the mentioned table, the sum total size is 450136538 bytes. The smallest XML document in this table is 1584 bytes and the biggest XML document in this table is 1138638 bytes.

SQL> select min(dbms_lob.getlength(t.gegevens.getclobval())) "MIN"
  2  from hgo.hgo010_detam t
  3  ;

MIN
---
1584

1 row selected.

SQL> select max(dbms_lob.getlength(t.gegevens.getclobval())) "MAX"
  2  from hgo.hgo010_detam t
  3  ;

MAX
-------
1138638

1 row selected.

SQL> select sum(dbms_lob.getlength(t.gegevens.getclobval())) "SUM"
  2  from hgo.hgo010_detam t
  3  ;

SUM
---------
450136538

1 row selected.

After updating all XML documents via an updateXML statement and as a side effect of this, removing all white spaces:

SQL> update hgo.hgo010_detam t
     set t.gegevens=updateXML
     (t.gegevens
     ,'/wwb:WW-HISTORIE/R801FINR/R801-NUM-FIS/text()'
     ,'012345678'
     ,'xmlns:wwb="http://www.gralike.com/ww/historie/detam/WWBase"'
     );

7500 rows updated.

After this update the same SUM statement as shown above gave the following result:

SQL> select sum(dbms_lob.getlength(t.gegevens.getclobval())) "SUM"
  2    from hgo.hgo010_detam t;

SUM
---------
392869215

1 row selected.

This is 57267323 bytes less (~ 55 Mb).

SQL> select 450136538-392869215 "LESS" from dual;

LESS
--------
57267323

1 row selected.

In conclusion – the updateXML statements has (at least in our case) nice side effects; it removes the white spaces which causes less space consumption and will allow XML parsers to be more effective while parsing the XML documents. Keep in mind that, if you address DOM methods, this will be very memory intensive (your document will be rebuild in memory).

PS
(Update 04 june 2007)

Be aware that Mark’s statement regarding:

“You can force a pretty print by selecting”
“select HGO_ID, t.GEGEVENS.extract(‘/*’) from hgo010_detam t”

will not be valid anymore in Oracle 11g. As stated by Mark on the XMLDB Forum, this behavior has been treated as a bug and is rectified from 11g upwards.