CRL Newsletter
Vol. 15, No. 2
December 2003
News
Technical Report
New corpora, new tests, and new data for frequency-based corpus comparisons
Department of Cognitive Science, University of California, San Diego
This study presents new data on frequency based corpus comparisons, in particular those made using the χ2 test. In doing such comparisons, many assumptions must be made. For example, it is usually assumed that a term must appear in both corpora in order to be included in the analysis. This assumption ignores lexemes that are very specific to a particular corpus, and relaxing it produces different results. The differences are even more pronounced when the definition of “lexeme” is extended beyond individual words to bigrams, many of which are domain-specific. Results from various comparisons are presented, along with a suggestion for a new standard, text categorization, against which to compare the results.