
                         Lexical Chunking
                         ----------------
                    Linas Vepstas, April 2008

Key ideas: 
 * Lexis is the basis of language.
 * Language consists of grammaticalized lexis, not lexicalized grammar.

See: Olga Moudraia, "Lexical Approach to Second Language Teaching"
http://www.cal.org/resources/digest/0102lexical.html

Alternate names: "gambits", "lexical phrases", "lexical units", 
"lexicalized stems", "speech formulae".


Definition of Lexical Chunks
----------------------------
Lexis may be single words, and also the word combinations that are a
basis of one's mental lexicon.  That is, language consists of meaningful
chunks that, when combined, produce continuous coherent text; only a
minority of spoken sentences are entirely novel creations.

Types of lexical chunks:
    * Words (e.g., book, pen)
    * Phrasal verbs (e.g. switch off, talk to ... about ...)
    * Polywords (e.g., by the way, upside down)
    * Collocations, or word partnerships (e.g., community service,
      absolutely convinced)
    * Idioms (e.g. break a leg, back in the day)
    * Institutionalized utterances (e.g., I'll get it; We'll see; 
      That'll do; If I were you . . .; Would you like a cup of coffee?)
    * Sentence frames and heads (e.g., That is not as . . . as you think;
      The fact/suggestion/problem/danger was . . .) and even text frames 
      (e.g., In this paper we explore . . .; Firstly . . .; Secondly . . .;
      Finally . . .)

(Taken from Lewis, M. (1997b). "Pedagogical implications of the lexical 
approach." In J. Coady & T.  Huckin (Eds.), "Second language vocabulary 
acquisition: A rationale for pedagogy" (pp.  255-270). Cambridge: Cambridge 
University Press.)

The code in this directory can be used to obtain phrasal verbs and
polywords.  There is currently no support for idioms or collocations.

Three distinct mechanisms are encoded in this directory.

LexicalChunker -- abstract base class providing the interface for
     several alternate chunking routines.

PhraseChunker -- print out phrases identified via the classical
     tree-bank style phrase tree.

RelationChunker -- print out phrases based on the binary RelEx
      relations -- e.g. object phrases, noun modifier phrases,
      prepositional phrases, etc.

PatternChunker() -- print out lexical chunks based on matching to a 
     collection of sentence patterns. Of the various mechanisms, this
     one yeilds the most interesting/best results.  The rest of this 
     README file is concerned wth these sentence patterns.

Also in this directory:

ChunkRanker(): assigns a weight or probability for a chunk to be correct.


Pattern matching
----------------
a == accept, r == reject
p == accept only if pronoun
notcop == accept only if not a copula
postcop == accept only if preceeded by copula

Verb phrases:
------------
Most of these are phrasal verbs of one sort or another.

Prepositional Verbs:
--------------------
Prepositional verbs are phrasal verbs that contain a preposition,
which is always followed by its nominal object.

 * We LOOK AFTER our children.

		(VP look (PP after (NP our children)))

   Need to drop NP in PP, to get "look after".
	This pattern matches to 

		(VP a (PP a (NP r)))
	
   Some more complicated examples:
   He will TAKE OFFICE ON monday.

      (VP will (VP take office (PP on (NP Monday))))

   Similarly:
   The cup was FILLED TO overflowing.
     (VP was (VP filled (PP to (NP overflowing))))
     (VP was (VP filled (PP to (S (VP overflowing)))))
   (The second parse is inappropriate).

 * Prepositional verbs that have thier own object, which 
   usually precedes the preposition:

	She HELPED him TO some more.
   I TOOK her TO the dance.
   Desired answer: "took ... to"

      (VP took (NP her) (PP to (NP the dance)))

   Skip both NP's, to get "took ... to"
      (VP a (NP r) (PP a (NP r)))

   Similarly, but more complex: 
   She HELPED the boy TO an extra portion of potatoes.

      (VP helped (NP the boy) (PP to (NP (NP an extra portion) (PP of (NP potatoes)))))
      (VP a (NP r) (PP a (NP (NP r) (PP r (NP r)))))
      (VP a (NP r) (PP a (NP *)))

   So we use a * here for the general case.

 * Prepositional verbs with two prepositions:
   I RAN AWAY FROM home.
   I RAN AWAY FROM home at age two.
   Desired answer: "ran away from"

      (VP ran (PP away) (PP from (NP home)))
      (VP ran (PP away) (PP from (NP home)) (PP at (NP age two)))

   Keep the VP, and the first PP, and the second PP, 
   dropping NP in the second PP.
      (VP a (PP a) (PP a (NP r)) *)

 * Prepositional verbs with two prepositions, and the
   preposition has an object:

   We TALKED TO the minister ABOUT the crisis.
      (VP talked (PP to (NP the minister)) (PP about (NP the crisis)))
   so:
      (VP a (PP a (NP r)) (PP a (NP r)))
 
   However, problems when the verb is a copula:
   I was IN LOVE WITH that bunny.
   Desired answer: "in love with"

      (VP was (PP in (NP love)) (PP with (NP that bunny)))

   Drop leading VP, keep first NP, drop second NP.

      (VP r (PP a (NP a)) (PP a (NP r)))

   But this conflicts with above, so use a more complex rule:

      (VP notcop (PP a (NP postcop)) (PP a (NP r)))

   (drop first VP because its a copula!)

Particle Verbs:
---------------
 * Phrasal verbs -- Particle verbs.
   Intransitive particle verbs do not have an object:

   He WENT AWAY.
      (VP went (PRT away)) 

   Keep the match tight:
      (S (NP r) (VP a (PRT a)) *)

  * Transitive particle verbs have an object, which may preceed 
    the verb:
   SWITCH the light OFF.
      (VP Switch (NP the light) (PRT off))
   so:
      (VP a (NP p) (PRT a))

   "p" flag -- if the NP is a pronoun, we want to keep it! See below:

   Let's GET IT OUT.
   Let's GET IT OUT in the open.
   Let's GET IT OUT in the open now.
	Let's PUT IT ON.

      (VP get (NP it) (PRT out))
      (VP get (NP it) (PRT out) (PP in (NP the open)))
      (VP get (NP it) (PRT out) (PP in (NP the open)) (PP now))

   The obvious pattern is (VP a (NP a) (PRT a) *) but this won't work
   with RelEx, because earlier stages mangle the sentence. The actual
   phrase becomes:
       (VP let's (VP get out (NP it) (PRT) (PP in (NP the open))))
   because the K link was used to merge "get" and "out". This, 
   ahem, wrecks the pattern matching. -- Fixed, had to hack 
   algs/TwoWordCombineToLeftAlg.java to not merge the orig_str.

   We want to keep the object NP only if its a pronoun !!

 * With prepositions, but without a direct object:
   The driver GOT OFF TO a flying start.

      (VP got (PRT off) (PP to (NP a flying start)))

      (VP a (PRT a) (PP a (NP r)))

 * With prepositions, and with a direct object:
   Onlookers PUT the accident DOWN TO bad luck.

      (VP put (NP the accident) (PRT down) (PP to (NP bad luck)))
      (VP a (NP p) (PRT a) (PP a (NP *)))

    XXX this clashes with "get it out in the open", which is perhaps
    too idiomatic to handle correctly!?

 * Infinitive forms:
   I want TO GO HOME now. 
     (S (VP to (VP go (PRT home))))
     (S (VP to (VP go (PRT home) (PP now))))
     (S (VP Please (VP go (PRT away) (PP now))) .)

XXX borken
   This should cover it:
      (VP a (VP a (PRT a) *))

Miscellany
-----------
 * You can't STOP me FROM seeing her.
      (VP can't (VP stop (NP me) (PP from (S (VP seeing (NP her))))))
   
      (VP r (VP a (NP r) (PP a (S *))))

 * He had A GOOD FEELING FOR that sort of thing.
   Desired answer: "a good feeling for"

      (VP had (NP a good feeling) (PP for (NP that sort of (NP thing))))

      (VP r (NP a) (PP a (NP r (NP r))))

 * General Custer GAVE the Sioux chief NO QUARTER.
   The desired answer is "give ... no quarter". 

      (VP gave (NP the Sioux chief) (NP no quarter)) 

   Skip first NP, keep second.

      (VP a (NP r) (NP a))

 * Reversed adjectival double prep phrase.
    It was done IN A MANNER APPROPRIATE FOR the occasion.

      (VP done (PP in (NP a manner (ADJP appropriate (PP for (NP the occasion)))))) 

      (VP r (PP a (NP a (ADJP a (PP a (NP r)))))) 

Sentence phrases
----------------
 * John Dillinger was a man who BROKE THE LAW.
   Desired answer: "broke the law"

      (S (NP John Dillinger) (VP was (NP (NP a man) (SBAR (WHNP who) (S (VP broke (NP the law)))))) .)

  Currently using a broad coverage:
    (S (VP a (NP a))) 

  XXX This pattern matches a lot. Could narrow to require WHNP.
  Perhaps this is too much of a collocation to detect automatically?

 * I'll DO it even IF IT KILLS me.
   Desired answer: "do ... if it kills"
      (SBAR if (S (NP it) (VP kills (NP me))))

   This is tricky, because the "do" will be embedded
   in a wide variety of phrase forms.

   XXX this is currently not deteted.


Idioms that break rules:
------------------------
 * It was A BITTER PILL TO SWALLOW.

      (VP was (NP a bitter pill) (S (VP to (VP swallow))))
      (VP r (NP a) (S (VP a (VP r))))

   However, this also reports:
     She forced him to resign.
     (VP forced (NP him_1) (S (VP to (VP resign))))
     Phrase: (him to resign)      Character ranges: [11-24]
   -- so this needs modification; perhaps we should drop the final VP?
   XXX fix the above somehow ...  perhaps its too idiomatic to work
   right.

 * I didn't HAVE THE HEART TO tell hem.
      (VP didn't (VP have (NP the heart) (S (VP to (VP tell (NP him))))))

      (VP r (VP a (NP a) (S (VP a (VP r (NP r))))))


Noun phrases:
-------------
 * I ate A WEE BIT OF cheese.
   Desired answer: "a wee bit of"

      (NP (NP a wee bit) (PP of (NP cheese)))

   Drop NP after PP.

      (NP (NP a) (PP a (NP r)))

   Also: a helping of, a hail of, a finger in

   It is A COUPLE OF clicks down the road.
   Desired: "a couple of"

      (NP (NP a couple) (PP of (NP clicks)))

 * It is a day's march from home.
   Desired: "a day's march"

      (NP (NP (NP a day 's) march) (PP from (NP home)))

   Rule: three NP's are too deep.
   Otherwise, same as "a wee bit of cheese".

      (NP (NP (NP a) a) (PP r (NP r)))

---------------

It's a good bet he did it.
(S (S (NP It) (VP 's (NP (NP a good bet) (SBAR (S (NP he) (VP did)))))) it .)

It's a good bet that he did it. 

Fails to parse. 


Pattern match:
--------------

(VP r (NP a) (PP a (NP r (NP r)))) -- (VP had (NP a good feeling) (PP for (NP that sort of (NP thing)))) -- a good feeling for

(VP a (PP a) (PP a (NP r)) *) -- (VP ran (PP away) (PP from (NP home)) (PP at (NP age two))) -- ran away from

(VP a (NP a) (PRT a) *) -- (VP get (NP it) (PRT out)) -- Let's get it out in the open.

(S (VP a (NP a))) -- (VP was (NP (NP a man) (SBAR (WHNP who) (S (VP broke (NP the law)))))) -- broke the law.


Idioms:
That is a hard act to follow.
