Getting started with Natural Language Processing

In this article, I am going to enumerate few steps to get started with analyzing the content with NLP (Natural Language Processing) library of Python.

First import the relevant libraries.

Installing the libraries

pip install collections pip install nltk

Importing the libraries

from nltk.tokenize import word_tokenizefrom collections import Counter

Tokenize the article

Assuming the content is pre-loaded as “content”

tokens = word_tokenize(content)

Convert the tokens into lowercase: lower_tokens

lower_tokens = [t.lower() for s in tokens]

Create a Counter with the lowercase tokens: bow_simple

bow_simple = Counter(lower_tokens)

Print the 5 most common tokens

print(bow_simple.most_common(5))

The output will be in the following form

[(‘ , ’ , 40), (‘the’ , 150), (‘ . ’ , 89)]

Now the problem with this output is that we are getting the result in the form of variables (e.g. ‘ , . , the) we do not want to consider. Therefore, we need to eliminate the counting of such variables and examine only the relevant ones in the content.

I will tackle this problem in the next article.

--

--