Following are the details about the processing behind the project:
The NLTK data package includes a pre-trained Punkt tokenizer for English.
[+] TF-IDF Approach: TF : TERM FREQUENCY, (How frequent a word appears in a document) IDF : INVERSE DOCUMENT FREQUENCY (How rare a word is across documents)
TF = (Number of times term t appears in a document)/(Number of terms in the document)
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
[+] Cosine Similarity: Cosine similarity is a measure of similarity between two non-zero vectors. Using this formula we can find out the similarity between any two documents d1 and d2. Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2|| where d1 and d2 are two non-zero vectors
FILES
[+] Intents.json – The data file which has predefined patterns and responses.
[+] train_chatbot.py – In this Python file, we wrote a script to build the model and train our chatbot.
[+] Words.pkl – This is a pickle file in which we store the words Python object that contains a list of our vocabulary.
[+] Classes.pkl – The classes pickle file contains the list of categories.
[+] Chatbot_model.h5 – This is the trained model that contains information about the model and has weights of the neurons.
STEPS
[+] Import and load the data file
[+] Preprocess data
[+] Create training and testing data
[+] Build the model
[+] Predict the response
Please refer to the uploaded project report for more information.