[an error occurred while processing this directive]
N. Augsten, M. Böhlen, and J. Gamper. ACM Transactions on Database Systems (TODS), 35(1):1-36, 2010
Abstract: When integrating data from autonomous sources, exact matches of data
items that represent the same real world object often fail due to a
lack of common keys. Yet in many cases structural information is
available and can be used to match such data. Typically the matching
has to be approximate since the representations in the sources differ.
We propose pq-grams to approximately match hierarchical data from
autonomous sources and define the pq-gram distance between ordered
labeled trees as an effective and efficient approximation of the
fanout weighted tree edit distance. We prove that the pq-gram distance
is a lower bound of the fanout weighted tree edit distance and give a
normalization of the pq-gram distance for which the triangle inequality
holds. Experiments on synthetic and real world data (residential
addresses and XML) confirm the scalability of our approach and show
the effectiveness of pq-grams.
Download Paper: [PDF]
./load.sh
# load experimental data to database
./run.sh
# run the experiment
./plot.sh
# draw the figures
Note: Some experiments do not have a load.sh or a plot.sh command.
The package contains a readme file with further information.
Download: Repeatability Package
[ZIP]
[README]
Here you can download our real world datasets Bolzano Address Trees: The residential address data of the real
world experiment in Section 9.3 is owned by
the Municipality of Bolzano
and was provided to the authors in the context of
the eBZ
Initialtive. By courtesy of
the Municipality of Bolzano
you may download the Bolzano Address Trees under the following
conditions:
The Bolzano Address Trees come in two text files
(L.trees, R.trees) encoded with braces. For
example, 30:{cesare abba strasse{1}{2}{3{{1}{3}}}{11}} is the
address tree with ID 30, its root node has the label "cesare
abba strasse" and the children of the root are
labeled 1, 2, 3, 11; 3 has a child with an empty
string label, which in turn has two children with labels 1
and 3. The IDs of L are aligned to R by
hand such that matching address trees have the same ID. All street
names are lowercased.
Download: Bolzano Address Trees [ZIP]
XML: In Section 9.4 we use three large XML databases,
for example, the DBLP bibliography. You can download the snapshot that
we used in our experiments and you find a link to the original dataset.
Experimental Data
Dataset | Snapshot | Original Dataset |
---|---|---|
DBLP | dblp.xml.zip [55M] | http://dblp.uni-trier.de/xml |
SwissProt | sprot.xml.zip [14M] | http://www.expasy.ch/sprot/ |
Treebank | treebank.xml.zip [31M] | http://www.cis.upenn.edu/~treebank |
A short description for each of the data sets can be found at the UW XML
Repository.
The source code of our implementation comes with the jar
files Source Code
tods10.jar
and
approxlib_v1.0.jar
that are included in the
the repeatability package above.
Extract the source code from the jar files with the following
commands:
tods10.jar
contains the executables that run
the experiments. tods10.jar
requires approxlib_v1.0.jar
.
approxlib_v1.0.jar
is our approximate matching
library that implements the pq-gram distance and the tree
edit distance. More info about this library can be found
at http://www.inf.unibz.it/~augsten/src.
unzip -x tods10.jar *.java -d tods10
unzip -x approxlib_v1.0.jar *.java -d approxlib_v1.0
[an error occurred while processing this directive]