CRL has a number of natural language corpora that are available to members of the Center. Some of these corpora are accessible on-line; other corpora exist on CD-ROM, so special arrangements must be made to use them. Because most of the corpora are proprietary and have usage restrictions, access is only available to researchers affiliated with CRL.
NEW: A Linux-port of the Penn Treebank search utility tgrep is now available from CRL.
| CHILDES | Child Language Description Exchange. Child language productions from a variety of researchers. http://childes.psy.cmu.edu/ |
| WordNet 1.6 | Lexical database (See http://www.cogsci.princeton.edu/~wn/) |
| Penn Treebank | Penn's Linguistic Data Consortium (LDC) collection, including Brown (Kucera-Francis); Wall Street Journal, and other sources; some text is parsed and can be searched wih the tgrep program. (See http://www.ldc.upenn.edu/) Linux: /home/corpora/treebank Windows: \\slice.ucsd.edu\corpora\treebank Macintosh: smb://slice.ucsd.edu/corpora/treebank |
| North American News Text Corpus | Large (~350 million word) corpus of newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T21) CD-ROM |
| Wall Street Journal 1987/parsed | ~25 million word parsed text from WSJ (text from LDC; parsed version courtesy Eugene Charniak) Linux: /home/corpora/wsj87 Windows: \\slice.ucsd.edu\corpora\wsj87 Macintosh: smb://slice.ucsd.edu/corpora/wsj87 |
| Spanish Language News Corpus | Large (~172 million word) corpus of newsire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T9) CD-ROM |
| European Languages News Corpus | ~100 million words of French, 90 million words of German, and 15 million words of Portuguese; newswire text. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T11) CD-ROM |
| Hansard Parallel Text in English and French | Parallel English/French texts drawn from Canadian Parliament discussions. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20) CD-ROM |
| CELEX | Lexical databases (word lemmas, phonology, morphology, frequency) for Dutch, German, and English. (See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14) Linux: /home/corpora/celex Windows: \\slice.ucsd.edu\corpora\celex Macintosh: smb://slice.ucsd.edu/corpora/celex |
| British National Corpus | (100 million word searchable corpus; Windows software for more extensive searching is also available) http://sara.natcorp.ox.ac.uk/lookup.html Linux: /home/corpora/BNC Windows: \\slice.ucsd.edu\corpora\BNC Macintosh: smb://slice.ucsd.edu/corpora/BNC |
NOTE: Access to the above file locations is restricted. Please contact CRL to request access.
In addition, there are a large number of electronic databases and other resources that are useful for the psycholinguist, linguist, or computational linguist. A few of these include the following:
| MRC Psycholinguistic Database Interface | (Kucera-Francis; number of letters/phonemes/syllables; ratings of word familiarity, concreteness, imagability, meaningfulness; age of acquisition; etc.) http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm |
| Edinburgh Associative Thesaurus | (on-line word association norms) http://monkey.cis.rl.ac.uk/Eat/htdoc/eat.html |
| Oxford Text Archive | (a collection of several thousand electronic texts and linguistic corpora) http://ota.ahds.ac.uk/ |
| Association for Computational Linguistics | http://www.aclweb.org/ |
| Institute for Scientific Information | The Web of Science Citation Databases http://isi1.isiknowledge.com/portal.cgi |
| PubMed | http://www.ncbi.nlm.nih.gov/PubMed/ |
| UCSD Libraries and Library Resources | http://libraries.ucsd.edu |