This system is based on the the original BADGER architecture but utilizes an extended version of the Smith Waterman Similarity Metric most often used for DNA sequence analysis. The substitution cost has been modified to use a multilingual knowledgebase that supports English, Czech, Spanish, German and French. Word similarity is a combination of t-measures for association and the Dice Coefficient for relations. This version is much faster then the original BWT/SpatterMap based system due to precomputation of the z-scores for word to word substitution costs.
The system comes in two models.
Lite
-
This system uses the Smith Waterman Similarity Metric
-
This system is missing the ~2.5G Multilingual databases
Full
-
From the Language Model
-
Word Frequencies
-
Word Association t-test matrix
-
Word Relations dice matrix
-
Spelling correction for out of vocabulary terms
-
Extended Smith Waterman similarity metric using the above knowledgebase
Both system utilize multiple or single references.
A command line as been provided for convenience.
The system will also auto correlate single files or whole directories to reduce workload for the tester. As long as the suffix of the file is .sgm or .xml.
This system is written entirely in Java and should run on most operating systems including Windows, Linux and Mac OS. BADGER has been tested on:
CentOs 5 (64 bit) (2GRam)
Windows XP (32 bit) (2G Ram)
Vista (64 bit) (2G Ram)
Suse 11.x (64 bit) (8G Ram)
. The system is multi threaded and can utilize up to 10 processors. Expected running times on the test set are 15 minutes for the Lite version and .5 hours for the full version.
Output values for this metric are between 0 and 1. Where 1 is a perfect score. In general scores above .5 are considered to be acceptable MT output values at the segment level.
Pearson correlations are expected to be in the range of .6 to .7 on unseen data.
This system heavily leverages many open source libraries including:
Simmetrics
The full source code tree and tools comprising BADGER 2.0 are expected to be released to the open source community as soon as the baseline has been fully tested and reviewed.
The multilingual knowledge base is trainable for other languages as well as specific domains and genre (including spelling conventions). All other aspects of the system are expected to be language and topic neutral.
|