public class ChineseDocumentToSentenceProcessor
extends java.lang.Object
implements java.io.Serializable
Constructor and Description |
---|
ChineseDocumentToSentenceProcessor() |
ChineseDocumentToSentenceProcessor(java.lang.String normalizationTableFile) |
Modifier and Type | Method and Description |
---|---|
static java.util.List<java.lang.String> |
fromHTML(java.lang.String inputString)
Strip off HTML tags before processing.
|
static java.util.List<java.lang.String> |
fromPlainText(java.lang.String contentString) |
static java.util.List<java.lang.String> |
fromPlainText(java.lang.String contentString,
boolean segmented) |
static void |
main(java.lang.String[] args)
usage: java ChineseDocumentToSentenceProcessor [-segmentIBM]
-file filename [-encoding encoding]
|
java.lang.String |
normalization(java.lang.String in)
This should now become disused, and other people should call
ChineseUtils directly! CDM June 2006.
|
public ChineseDocumentToSentenceProcessor()
public ChineseDocumentToSentenceProcessor(java.lang.String normalizationTableFile)
normalizationTableFile
- A file listing character pairs for
normalization. Currently the normalization table must be in UTF-8.
If this parameter is null
, the default normalization
of the zero-argument constructor is used.public java.lang.String normalization(java.lang.String in)
public static void main(java.lang.String[] args) throws java.lang.Exception
The -segmentIBM option is for IBM GALE-specific splitting of an XML element into sentences.
java.lang.Exception
public static java.util.List<java.lang.String> fromHTML(java.lang.String inputString) throws java.io.IOException
inputString
- Chinese document text which contains HTML tagsjava.io.IOException
public static java.util.List<java.lang.String> fromPlainText(java.lang.String contentString) throws java.io.IOException
contentString
- Chinese document textjava.io.IOException
public static java.util.List<java.lang.String> fromPlainText(java.lang.String contentString, boolean segmented) throws java.io.IOException
java.io.IOException