186x Filetype PDF File size 0.29 MB Source: www.its.caltech.edu
Natural Language Processing Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 CS101 Win2015: Linguistics Natural Language Processing Reference C.D. Manning, H. Schutze,¨ Foundations of Statistical Natural Language Processing, MIT Press, 1999. CS101 Win2015: Linguistics Natural Language Processing • Setting based on Probabilistic Linguistics • Electronic Corpora - Linguistic Data Consortium - European Language Resources Association - International Computer Archive of Modern English - Oxford Text Archive - Child Language Data Exchange System • Stemming: stripping off affixes and word formation and extract stem of words from a word list • Markup: syntactic structure is marked • Penn Treebank: Lisp-like bracketing to mark binary tree structure of sentence • SGML (Standard Generalized Markup Language): HTML is a type of SGML encoding; Text Encoding Initiative (TEI) encoding scheme suitable for marking parts of various texts, XML simplified form good for web applications CS101 Win2015: Linguistics Natural Language Processing • Grammatical Tagging: automated tagging for categories (parts of speech: nouns, verbs,...) • Tag Sets: American Brown Corpus (developed to tag the Lancaster–Oslo–Bergen corpus and British National Corpus) • Penn Treebank tag set: most widely used in computational setting (simplified version of previous) • rule: least marked category is used as default whenever a word cannot be placed in any other more precise subcategory with additional markings • Example: “Adjectives” used if cannot further place into “comparatives, superlatives, numerals,...” • available tag sets are very different (some coarser, some more refined) CS101 Win2015: Linguistics Natural Language Processing
no reviews yet
Please Login to review.