CRL Newsletter

Vol. 15, No. 2

December 2003

Technical Report

New corpora, new tests, and new data for frequency-based corpus comparisons

Robert A. Liebscher

Department of Cognitive Science, University of California, San Diego

This study presents new data on frequency based corpus comparisons, in particular those made using the χ² test. In doing such comparisons, many assumptions must be made. For example, it is usually assumed that a term must appear in both corpora in order to be included in the analysis. This assumption ignores lexemes that are very specific to a particular corpus, and relaxing it produces different results. The differences are even more pronounced when the definition of “lexeme” is extended beyond individual words to bigrams, many of which are domain-specific. Results from various comparisons are presented, along with a suggestion for a new standard, text categorization, against which to compare the results.

Unsubscribe

Publications

CRL Newsletter

Unsubscribe

CRL Newsletter

Vol. 15 No. 2