org.apache.lucene.analysis
Class LetterTokenizer

java.lang.Object
  |
  +--org.apache.lucene.analysis.TokenStream
        |
        +--org.apache.lucene.analysis.Tokenizer
              |
              +--org.apache.lucene.analysis.CharTokenizer
                    |
                    +--org.apache.lucene.analysis.LetterTokenizer
Direct Known Subclasses:
LowerCaseTokenizer

public class LetterTokenizer
extends CharTokenizer

A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter() predicate. Note: this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.


Fields inherited from class org.apache.lucene.analysis.Tokenizer
input
 
Constructor Summary
LetterTokenizer(Reader in)
          Construct a new LetterTokenizer.
 
Method Summary
protected  boolean isTokenChar(char c)
          Collects only characters which satisfy Character#isLetter(char).
 
Methods inherited from class org.apache.lucene.analysis.CharTokenizer
next, normalize
 
Methods inherited from class org.apache.lucene.analysis.Tokenizer
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LetterTokenizer

public LetterTokenizer(Reader in)
Construct a new LetterTokenizer.
Method Detail

isTokenChar

protected boolean isTokenChar(char c)
Collects only characters which satisfy Character#isLetter(char).
Overrides:
isTokenChar in class CharTokenizer


Copyright © 2000-2002 Apache Software Foundation. All Rights Reserved.