Getting started with Natural Language Processing
In this article, I am going to enumerate few steps to get started with analyzing the content with NLP (Natural Language Processing) library of Python.
First import the relevant libraries.
Installing the libraries
pip install collections pip install nltk
Importing the libraries
from nltk.tokenize import word_tokenizefrom collections import Counter
Tokenize the article
Assuming the content is pre-loaded as “content”
tokens = word_tokenize(content)
Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for s in tokens]
Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)
Print the 5 most common tokens
print(bow_simple.most_common(5))
The output will be in the following form
[(‘ , ’ , 40), (‘the’ , 150), (‘ . ’ , 89)]
Now the problem with this output is that we are getting the result in the form of variables (e.g. ‘ , . , the) we do not want to consider. Therefore, we need to eliminate the counting of such variables and examine only the relevant ones in the content.
I will tackle this problem in the next article.