Computer Science Institute - Publications

Italiano (Italian)

English (Inglese)

Tuesday, 1 April 2025

Publications

Back

Pubblication Details
Authors:	Paolo Ferragina
	Fabrizio Luccio
	Giovanni Manzini
	S. Muthukrishnan
Scientific Area:	Text Compression and Indexing
Title:	Compressing and Searching XML Data Via Two Zips
Published on:	TR-INF-2005-12-06-UNIPMN
Publisher:	Computer Science Department, UPO
Year:	2005
Tipo Pubblicazione:	Technical Report
URL:	http://www.di.unipmn.it...R-INF-2005-12-06-UNIPMN.pdf
Abstract:	XML is fast becoming the standard format to store, exchange and publish over the web, and is getting embedded in applications. Two challenges in handling XML are its size (the XML representation of a document is significantly larger than its native state) and the complexity of its search (XML search involves path and content searches on labeled tree structures). We address the basic problems of compression, navigation and searching of XML documents. In particular, we adopt recently proposed theoretical algorithms for succinct tree representations to design and implement a compressed index for XML called XbzipIndex in which the XML document is maintained in a highly compressed format; both navigation and searching can be done uncompressing only a tiny fraction of the data. This solution relies on compressing two arrays derived from the XML data. With detailed experiments we compare this with other compressed XML indexing and searching engines to show that XbzipIndex has compression ratio up to 35% better than the ones achievable by those other tools, and its time performance on SimplePathSearch and ContextSearch search operations is order of magnitudes faster: few milliseconds over hundreds of MBs of XML files versus tens of seconds, on standard XML data sources. Our library of XML compression and searching routines is downloadable from: http://roquefort.di.unipi.it/~ferrax/xbzipLib.tgz