Natural Language Identification Machine Learning Pipeline

Graduate Project for Harvard's Python for Data Science (CSCI E - 29)

In this project, I pulled text data from European Parliament Proceedings in 21 languages. Using Scikit-Learn, I transformed the raw text into a numerical feature matrix, and trained a Multinomial naive bayes probability model to classify input language with greater than 99% accuracy.

Data Source

Video Demo

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Gradute Project Slides		Gradute Project Slides
language_data/txt		language_data/txt
Language_Identification_FES.ipynb		Language_Identification_FES.ipynb
Language_Identification_FES.py		Language_Identification_FES.py
README.md		README.md
english.txt		english.txt
german.txt		german.txt
pre_identificador_idioma.py		pre_identificador_idioma.py
spanish.txt		spanish.txt
split_dataset.py		split_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Identification Machine Learning Pipeline

About

Releases

Packages

Languages

cris-her/LanguageIdentification

Folders and files

Latest commit

History

Repository files navigation

Natural Language Identification Machine Learning Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages