
                  OpenRelEx Semantic Relation Extractor
                  -------------------------------------
                      Version 1.0.0 June 2009

RelEx is a syntactic dependency extractor and semantic framing generator;
it will parse English language sentences and return the dependency 
relationships between different parts of the sentence, and also provide
semantic framing tags based on syntax and semantic categories.

There are multiple inter-related parts to RelEx. The core component
extracts the dependency relationships. Additional parts perform
functions such as anaphora resolution, provide semantic frame output,
provide output in various formats, including a format suitable for 
later batch post-processing, another format suitable for input to 
OpenCog, and an W3C OWL format. There are also a small assortment of
perl scripts for cleaning up web pages, &c.

The main RelEx website is at 

    http://opencog.org/wiki/RelEx

It provides broader overview, as well as detailed documentation.

The source download and project management site is at

   https://launchpad.net/relex

Build and install of the core package is discussed below.


Dependencies
-------------
To build and use RelEx, the following packages are required to be 
installed:

 - libgetopt-java (GNU getopt)
 - Link Parser
 - WordNet 3.0
 - JWNL Java wordnet library
 - OpenNLP tools (optional)
 - GATE 4.0 or later (optional)
 - W3C OWL (optional)


Pre-requiste dependencies
--------------------------
The following packages are required pre-requisites for building RelEx.

- Link Grammar Parser
	Compile and install the Link Grammar Parser. This parser is
	described at http://www.link.cs.cmu.edu/link/, and sources
	are available for download at
	http://www.abisource.com/projects/link-grammar/#download

   Link-grammar version 4.5.4 or later is needed.

	The Link Grammar Parser is the underlying engine, providing
	the core sentence parsing ability.

	If the parser is installed in an unusual location,
	be sure to modify -Djava.library.path appropriately in 
	relation-extractor.sh.

- GNU getopt
	This is a standard command-line option parsing library. 
	For Ubuntu, install the "libgetopt-java" package.

- Wordnet
	Wordnet is used by RelEx to provide basic English morphology,
	such as singular versions of (plural) nouns, base forms (lemmas)
	of adjectives, adverbs and infinitve forms of verbs.

	Download, unpack and install WordNet 3.0.  The install directory
	then needs to be specified in data/wordnet/file_properties.xml,
	with the name="dictionary_path" property in this file.

	Some typical install locations are:
	/opt/WordNet-3.0/data for RedHat and SuSE
	/usr/share/wordnet for Ubuntu and Debian
	C:\Program Files\WordNet\3.0\data for Windows

	The relex/Morphy/Morphy.java class provides a simple, easy-to-use
	wrapper around wordnet, providing the needed word morphology info.

- didion.jwnl
	The didion JWNL is the "Java WordNet Libary", and provides the
	Java programming API to access the wordnet data files.
	Its home page is at http://sourceforge.net/projects/jwordnet
	and can be downloaded from
	http://sourceforge.net/project/showfiles.php?group_id=33824	

	Verify that the final installed location of jwnl.jar is correctly
	specified in the build.xml file. Note that GATE (below) also
	provides a jwnl.jar, but the GATE version of jwnl.jar is not
	compatible (welcome to java DLL hell).


Optional packages
-----------------
The following packages are optional. If they are found, then
addtional parts of RelEx will be built, enabling additional 
function.

- OpenNLP
	RelEx uses OpenNLP for sentence detection, giving RelEx the ability
   to find sentence boundaries in free text. Without OpenNLP, the input
   to RelEx must be organized so that there's only one sentence per line.

	The OpenNLP home page is at http://opennlp.sourceforge.net/
	Download and install OpenNLP tools, and verify that the 
	installed files are correctly identified in both build.xml
	and in relation-extractor.sh.

	OpenNLP also requires the installation of maxent from
	http://maxent.sourceforge.net/  

	You'll need maxent-2.5.2.jar and opennlp-tools-1.4.3.jar.
	
	The OpenNLP package is used solely in corpus/DocSplitter.java,
	which provides a simple, easy-to-use wrapper for splitting a
	document into sentences. Replace this file if an alternate
	sentence detector is desired.

- Trove
	Some users may require the GNU Trove to enable OpenNLP, although
	this depends on the JDK installed.  GNU Trove is an implementation
	of the java.util class heirarchy, which may or may not be included
	in the installed JDK.  If needed, download trove from:

	http://trove4j.sourceforge.net/

	Since trove is optimized, using it may improve performance and/or
   decrease memory usage, as compared to the standard Sun JDK
   implementation of the java.util heirarchy.

- Apache commons logging
   The OpenNLP package requires that the Apache commons logging
   jar file be installed. In Debian/Ubuntu, this is supplied by
   the "libcommons-logging-java" package.

- xercesImpl.jar
   The OpenNLP package requires that the Xerces2 XML parser package 
   be installed. In Debian/Ubunutu, this is supplied by the 
   "libxerces2-java" package.

- GATE
   If GATE is found, then GATE-based entity detection will be enabled.
   Please note that it is not clear whether GATE provides significantly
   better entity detection than the Link Grammar parser itself.

	GATE, the "General Architecture for Text Engineering", 
	http://www.gate.ac.uk/ is a large, complex framework. RelEx does
	not need the framework. However, GATE does provide a good entity
	detector, which RelEx can employ. An "entity" is the name of a 
   person, corporation or institutions, or a time, date or money
   expression. The term "entity detection" refers to the task of
   identifying such quantites in parsed text.

	Download GATE 4.0 from http://gate.ac.uk/download/index.html

	Install it at /opt/GATE-4.0 . If you change this location,
	please modify the system property -Dgate.home=/opt/GATE-4.0
	in relation-extractor.sh.  Modify build.xml to point at the
	correct location of gate-4.0.jar.

	GATE also requires the installation of the Xerces XML parser.
	Debian/Ubuntu users can install Xerces using apt, via
	"apt-get install libxerces2-java"

	Other users may need to get xerces from 
	http://xerces.apache.org/xerces2-j/

	Alternatives to GATE may be used by providing a replacement
	for the relex/corpus/GateEntityMaintainer.java file.


Building
--------
	After the above are installed, the relex java code can be built.
	The build system uses "ant", and the ant build specifictions
	are in "build.xml". Simply saying "ant" at the command line
	should be enough to build. Saying "ant run" will run a basic
	demo of the system.


Using RelEx
-----------
It is assumed that RelEx will be used in one of two different ways.
These are in a "batch processing" mode, and a "custom Java development"
mode.

In the "batch processing mode", RelEx is run once over a large text,
and its output is saved to a file.  This output can then be 
post-processed at a later time, to extract desired info. The goal here
is to avoid the heavy CPU overhead of re-parsing a large text over and
over.  Example post-processing scripts are included (described below).

In the "custom Java development" mode, it is assumed that a capable
Java programmer can write new code to interface RelEx to meet thier needs.
A good place to start is to review the workings of the output code in
src/java/relex/output/*.java.

The standard RelEx demo output is NOT SUITABLE for post-processing. It
is meant to be a human-readable example of what the system generates;
it does not include all required output. For example, if the same word
appears in a sentence twice, the demo output will not distinguish between
these two words.


Running RelEx
-------------
Several example shell scripts (MS Windows batch files) are included
to show sample usage. These files (*.sh in unix, or *.bat, in Windows)
define the required system properties, classpath and JVM options.

If there are any ClassNotFound exceptions, please verify the paths
and values in these files.  An important property is relex.algpath;
it defines the semantic algorithms used by RelEx.  The default file
is data/relex-semantic-algs.txt.


relation-extractor.sh
---------------------
The primary usage example is the "relation-extractor.sh" file.
Running this will display:

	- The link parser output.
	- The detected persons, organizations and locations.
	- The dependency relations found. 
	- Anaphora resolutions.
	- Frame relations.
	- Parse ranking info.

Output is controlled by command-line flags that are set in the shell
script.  The "-h" flag will print a list of all of the avaliable
command-line options.


batch-process.sh
----------------
The "batch-process.sh" script is an example batch processing script.
This script outputs the so-called "compact (cff) format" which captures
the full range of Link Grammar and RelEx output in a format that can be
easily post-processed by other systems (typically by using regex's).

The idea behind the batch processing is that it is costly to parse
large quantities of text: thus, it is convenient to parse the text
once, save the results, and then perform post-processing at liesure,
as needed.  Thus, the form of post-processing can be changed at will,
without requiring texts to be re-processed over and over again.


src/perl/cff-to-opencog.pl 
--------------------------
This perl script provides an example of post-processing: it converts
the "cff" btach output format into OpenCog hypergraphs, which can
then be processed by OpenCog.


opencog-server.sh
-----------------
This script starts a relex server that  listens for plain-text input
(English sentences) on port 4444. It then parses the text, and returns
opencog output on the same socket.  This server is meant to serve the
OpenCog chatbot directly; it is not intended for general, manual use.


doc-splitter.sh
---------------
The doc-splitter.sh file is a simple command-line utility to reformat
a free-form text into sentences, one per line.


src/perl/wiki-scrub.pl
----------------------
Ad-hoc script to scrub wikipedia xml dumps, outputting only valid
English-language sentences.  This  script removes wiki markup, URL's
tables, images, & etc.  It currently seems to be pretty darned
bullet-proof, although it might handle multi-line refs incorrectly.


Using RelEx in custom code
--------------------------
The primary output of RelEx is the set of semantic relationships of a
sentence. To obtain the list of these relationships, make a copy of
src/java/relex/output/SimpleView.java, and customize it to provide 
the relationships that you wish, in the format that you wish.

The class src/java/relex/RelationExtractor.java should be considered 
to be a large example program illustrating all of the various features
of RelEx.  For custom applications, this class should be copied and
modified as desired to fit the application.


TODO
----
The Java install dependencies would be much easier to deal with if
there was a centralized repository from which one could easily obtain
the needed jar files. That is, something analogous to apt-get or CPAN.
The closest such thing for Java is "maven"; however, none of the jar
files required by relex have been checked into maven. Thus, a to-do:
get all of the jar files submitted to maven.


TODO - polywords, lexical units, collocations, idioms. 
----------------
Would be nice to identify: "By the way" as a polyword.
"Break a leg" as an idiom.


Head word
---------
A core idea behind dependency grammar is that the order of a pair of 
words in a dependency relation is important: the first word is the
head word, the second is the dependent. This is maintained by RelEx,
but due to a bug, one relation was ordered inconsistently, and, due
to a design consideration, three more were in reverse order from the
conventional order used by other parsers.

The buggy, inconsistent relation was:
_nn

The intentionally reversed relations were:
_amod
_predadj
_advmod

These have now been fixed, as of version 0.99.1, with one exception:
some of the _nn usages in data/frame/mapping_rules.txt might be
reversed.

The following was audited, and appears to be correct:
_appo
