A comprehensive review of text analytics and natural language processing with a focus on recent developments in computational linguistics and machine learning. Corpus include unstructured and semi-structured text from online sources, document collections, and databases. Using methods of artificial intelligence and machine learning, including how to parse text into numeric vectors and to convert higher dimensional vectors into lower dimensional vectors for subsequent analysis and modeling. Applications include speech recognition, semantic processing, text classification, relevant search, recommendation systems, sentiment analysis, and topic modeling. This is a project-based course with extensive programming assignments.
Learning Objectives:
- Identify role of natural language processing (NLP) and text analytics in the data sciences arena, as well as the kinds of data corpora suitable for organizational needs together with their preprocessing and metatagging requirements,
- Extract both entities and concepts; be able to identify, characterize, and apply methods for entity and concept co-resolution; identify and select strategies for complex concept extraction, apply appropriate text vectorization methods (specifically Google’s Word2Vec and Doc2Vec),
- Identify, select, and apply both clustering and classification algorithms, together with other forms of machine learning including both supervised and unsupervised, including generative machine learning methods such as latent semantic analysis (LSA) and latent Direchlet Allocation (LDA).
- Select, apply, and evaluate methods for sentiment analysis and link analysis,
- Develop (at a basic level) both ontologies and taxonomies that can provide interpretation during text analytics, and use these to assist in context-dependent NLP.