-----------------------------------------------------------
README for augsten-tods10.zip published on:
http://www.inf.unibz.it/~augsten/publ/tods10/
-----------------------------------------------------------

=========
OVERVIEW:
=========

This ZIP contains the data and the code that allow you to repeat the
experiments of the following paper:

  The pq-Gram Distance between Ordered Labeled Trees
  N. Augsten, M. Böhlen, and J. Gamper.  
  ACM Transactions on Database Systems (TODS), 2009.

===========
QUICKSTART:
===========

1. Unzip augsten-tods10.zip

2. Set host, database, username, and password for your database in the
   configuration file "augsten-tods10/config.txt".

3. Change to the directory of the experiment that you would like to
   repeat (augsten-tods10/exp/*, for a list of experiments see below).

4. Execute the following commands in this order:
   ./clean.sh
   ./load.sh
   ./run.sh
   ./plot.sh

   Note: Some experiments do not have a load.sh or a plot.sh command.

5. The result of each experiment is stored in the respective "log"
   directory, the figures are stored in the "eps" directory.

List of Experiments:

- 9.1 Scalability
  augsten-tods10/exp/scalability
- 9.2 Sensitivity to Structure Change
  augsten-tods10/exp/structure
- 9.3 Real World: Street Matching
  augsten-tods10/exp/streetmatching
- 9.4 Real World: Matching XML Data
  augsten-tods10/exp/xmlmatching

====================
SYSTEM REQUIREMENTS:
====================

You need Linux to run the shell scripts (e.g., ./run.sh). 

The Java code is written for Sun Java 1.6 and the relational database
MySQL 5.0 (http://dev.mysql.com). We use gnuplot to draw the figures.
We access MySQL with the JDBC driver v3.0.11 (included) and use Xerces
2.9.1 to parse XML files (included).

If you are an Ubuntu or Debian user, execute the following shell
command to install all required software:

sudo apt-get install sun-java6-jdk mysql-server gnuplot  

============
SOURCE CODE:
============

Source Code

The source code of our implementation comes with the jar files
tods10.jar and approxlib_v1.0.jar that are included in this ZIP file.

- tods10.jar contains the executables that run the
  experiments. tods10.jar requires approxlib_v1.0.jar.

- approxlib_v1.0.jar is our approximate matching library that
  implements the pq-gram distance and the tree edit distance. More
  info about this library can be found at
  http://www.inf.unibz.it/~augsten/src.

Extract the source code from the jar files with the following commands:

unzip -x tods10.jar *.java -d tods10
unzip -x approxlib_v1.0.jar *.java -d approxlib_v1.0 

=========
CONTENTS:
=========

README.TXT

  This file.

config.txt

  Here you configure host, database, user, and password of your MySQL
  database. The syntax is

  host=<host of mysql server>
  db=<name of mysql database>
  user=<your mysql user name>
  pwd=<your mysql password>

lib/
  
  All jar files required to run the experiments.

  For tods10.jar and approxlib_v1.0.jar see Section "Source Code".
  
exp/* 

  Directories for the individual experimetns. They all have a similar
  structure:

  data/
    experimental data
  log/
    log files (experimental results)
  eps/
    eps figures
  clean.sh
    remove log files and figures
  load.sh
    load the experimental data from files to the database
  run.sh
    execute the experiment (results written to log files)
  plot.sh
    use gnuplot to draw eps figures from log files

  Note: config.txt and lib/ are symbolic links.

=========================
RESIDENTIAL ADDRESS DATA:
=========================
(section included from exp/streetmatching/data/README)

The residential address data (Bolzano Address Trees) of the real world
experiment in Section 9.3 (streematching) is owned by the Municipality
of Bolzano and was provided to the authors in the context of the eBZ
Initiative. By courtesy of the Municipality of Bolzano you may
download the Bolzano Address Trees under the following conditions:

   1. You use the data for research purpose only.
   2. You explicitly acknowledge the Municipality of Bolzano. 

The Bolzano Address Trees come in two text files (L.trees, R.trees)
encoded with braces. For example, 

30:{cesare abba strasse{1}{2}{3{{1}{3}}}{11}}

is the address tree with ID 30, its root node has the label "cesare
abba strasse" and the children of the root are labeled 1, 2, 3, 11; 3
has a child with an empty string label, which in turn has two children
with labels 1 and 3. The IDs of L.trees are aligned to R.trees by hand
such that matching address trees have the same ID. All street names
are lowercased.

==================
DOWNLOAD XML DATA:
==================
(section included from exp/xmlmatching/data/README.TXT)

For the experiment "xmlmatching" in Section 9.4 we use large XML
files. This ZIP includes only templates for the large files. The
experiments can be executed with the templates, but the result is of
course different. You can download the exact version of the files that
we use in the paper from:

http://www.inf.unibz.it/~augsten/publ/tods10/

Place the XML files into the directory augsten-tods10/xmlmatching/data
replacing the symbolic links dblp.xml, sprot.xml, and treebank.xml,
respectively.

