{"id":11624,"date":"2019-03-15T01:52:32","date_gmt":"2019-03-15T01:52:32","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=11624"},"modified":"2019-03-15T02:10:02","modified_gmt":"2019-03-15T02:10:02","slug":"nltk-tutorial-in-python-linux-hint","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/03\/15\/nltk-tutorial-in-python-linux-hint\/","title":{"rendered":"NLTK Tutorial in Python \u2013 Linux Hint"},"content":{"rendered":"<p>The era of data is already here. The rate at which the data is generated today is higher than ever and it is always growing. Most of the times, the people who deal with data everyday work mostly with unstructured textual data. Some of this data has associated elements like images, videos, audios etc. Some of the sources of this data are websites, daily blogs, news websites and many more. Analysing all of this data at a faster rate is necessary and many time, crucial too.<\/p>\n<p>For example, a business might run a text analysis engine which processes the tweets about its business mentioning the company name, location, process and analyse the emotion related to that tweet. Correct actions can be taken faster if that business get to know about growing negative tweets for it in a particular location to save itself from a blunder or anything else. Another common example will for\u00a0<strong>Youtube<\/strong>. The Youtube admins and moderators get to know about effect of a\u00a0<span id=\"IL_AD2\" class=\"IL_AD\">video<\/span>\u00a0depending on the type of comments made on a\u00a0video\u00a0or the\u00a0video\u00a0chat messages. This will help them find inappropriate content on the website much faster because now, they have eradicated the manual work and employed automated smart text analysis bots.<\/p>\n<p>In this lesson, we will study some of the concepts related to text analysis with the help of NLTK library in Python. Some of these concepts will involve:<\/p>\n<ul>\n<li>Tokenization, how to break a piece of text into words, sentences<\/li>\n<li>Avoiding stop words based on English language<\/li>\n<li>Performing stemming and lemmatization on a piece of text<\/li>\n<li>Identifying the tokens to be analysed<\/li>\n<\/ul>\n<p>NLP will be the main area of focus in this lesson as it is applicable to enormous real-life scenarios where it can solve big and crucial problems. If you think this sounds complex, well it does but the concepts are equally easy to understand if you try examples side by side. Let\u2019s jump into installing NLTK on your machine to get started with it.<\/p>\n<h3><strong>Installing NLTK<\/strong><\/h3>\n<p>Just a note before starting, you can use a\u00a0<a href=\"https:\/\/linuxhint.com\/virtual_environments_python3\/\">virtual environment<\/a>\u00a0for this lesson which we can be made with the following command:<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">python -m virtualenv nltk<br \/>\nsource nltk\/bin\/activate<\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">Once the virtual environment is active, you can install NLTK library within the virtual env so that examples we create next can be executed:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">pip install nltk<\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We will make use of\u00a0<a href=\"https:\/\/www.anaconda.com\/distribution\/\">Anaconda<\/a>\u00a0and Jupyter in this lesson. If you want to install it on your machine, look at the lesson which describes \u201c<a href=\"https:\/\/linuxhint.com\/install_anaconda_python_ubuntu_1804\/\">How to Install Anaconda Python on Ubuntu 18.04 LTS<\/a>\u201d and share your feedback if you face any issues. To install NLTK with Anaconda, use the following command in the terminal from Anaconda:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">conda install -c anaconda nltk<\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above command:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-37169 aligncenter\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/1-3.png\" sizes=\"auto, (max-width: 681px) 100vw, 681px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/1-3.png 681w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/1-3-300x245.png 300w\" alt=\"\" width=\"681\" height=\"556\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">Once all of the packages needed are installed and done, we can get started with using the NLTK library with the following import statement:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"kw1\">import<\/span>\u00a0nltk<\/div>\n<\/div>\n<p>Let\u2019s get started with basic NLTK examples now that we have the prerequisites packages installed.<\/p>\n<h3><strong>Tokenization<\/strong><\/h3>\n<p>We will start with Tokenization which is the first step in performing text analysis. A token can be any smaller part of a piece of text which can be analysed. There are two types of Tokenization which can be performed with NLTK:<\/p>\n<ul>\n<li>Sentence Tokenization<\/li>\n<li>Word Tokenization<\/li>\n<\/ul>\n<p>You can guess what happens on each of the Tokenization so let\u2019s dive into code examples.<\/p>\n<h3><strong>Sentence Tokenization<\/strong><\/h3>\n<p>As the name reflects, Sentence Tokenizers breaks a piece of text into sentences. Let\u2019s try a simple code snippet for the same where we make use of a text we picked from\u00a0<a href=\"https:\/\/linuxhint.com\/apache-kafka-partitioning\/\">Apache Kafka<\/a>\u00a0<span id=\"IL_AD4\" class=\"IL_AD\">tutorial<\/span>. We will perform the necessary imports<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"kw1\">import<\/span>\u00a0nltk<br \/>\n<span class=\"kw1\">from<\/span>\u00a0nltk.<span class=\"kw3\">tokenize<\/span>\u00a0<span class=\"kw1\">import<\/span>\u00a0sent_tokenize<\/div>\n<\/div>\n<p>Please note that you might face an error due to a missing dependency for nltk called\u00a0<strong>punkt<\/strong>. Add the following line right after the imports in the program to avoid any warnings:<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">nltk.<span class=\"me1\">download<\/span><span class=\"br0\">(<\/span><span class=\"st0\">&#8216;punkt&#8217;<\/span><span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">For me, it gave the following output:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-37170 aligncenter\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/2-3.png\" sizes=\"auto, (max-width: 594px) 100vw, 594px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/2-3.png 594w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/2-3-300x75.png 300w\" alt=\"\" width=\"594\" height=\"149\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">Next, we make use of the sentence tokenizer we imported:<\/span><\/p>\n<div class=\"codecolorer-container text default\">\n<div class=\"text codecolorer\">text = &#8220;&#8221;&#8221;A Topic in Kafka is something where a message is sent. The consumer<br \/>\napplications which are interested in that topic pulls the message inside that<br \/>\ntopic and can do anything with that data. Up to a specific time, any number of<br \/>\nconsumer applications can pull this message any number of times.&#8221;&#8221;&#8221;<\/p>\n<p>sentences = sent_tokenize(text)<br \/>\n<span id=\"IL_AD5\" class=\"IL_AD\">print<\/span>(sentences)<\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-37171 aligncenter\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/3-3.png\" sizes=\"auto, (max-width: 643px) 100vw, 643px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/3-3.png 643w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/3-3-300x101.png 300w\" alt=\"\" width=\"643\" height=\"216\" \/><\/p>\n<p>As expected, the text was correctly organised into sentences.<\/p>\n<h3><strong>Word Tokenization<\/strong><\/h3>\n<p>As the name reflects, Word Tokenizers breaks a piece of text into words. Let\u2019s try a simple code snippet for the same with the same text as the previous example:<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"kw1\">from<\/span>\u00a0nltk.<span class=\"kw3\">tokenize<\/span>\u00a0<span class=\"kw1\">import<\/span>\u00a0word_tokenize<\/p>\n<p>words\u00a0<span class=\"sy0\">=<\/span>\u00a0word_tokenize<span class=\"br0\">(<\/span>text<span class=\"br0\">)<\/span><br \/>\n<span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>words<span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-37172 aligncenter\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/4-3.png\" sizes=\"auto, (max-width: 641px) 100vw, 641px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/4-3.png 641w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/4-3-300x102.png 300w\" alt=\"\" width=\"641\" height=\"217\" \/><\/p>\n<p>As expected, the text was correctly organised into words.<\/p>\n<h3><strong>Frequency Distribution<\/strong><\/h3>\n<p>Now that we have broken the text, we can also calculate frequency of each word in the text we used. It is very simple to do with NLTK, here is the code snippet we use:<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"kw1\">from<\/span>\u00a0nltk.<span class=\"me1\">probability<\/span>\u00a0<span class=\"kw1\">import<\/span>\u00a0FreqDist<\/p>\n<p>distribution\u00a0<span class=\"sy0\">=<\/span>\u00a0FreqDist<span class=\"br0\">(<\/span>words<span class=\"br0\">)<\/span><br \/>\n<span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>distribution<span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-37173 aligncenter\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/5-3.png\" sizes=\"auto, (max-width: 387px) 100vw, 387px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/5-3.png 387w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/5-3-300x93.png 300w\" alt=\"\" width=\"387\" height=\"120\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">Next, we can find most common words in the text with a simple function which accepts the number of words to show:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"co1\"># Most common words<\/span><br \/>\ndistribution.<span class=\"me1\">most_common<\/span><span class=\"br0\">(<\/span><span class=\"nu0\">2<\/span><span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-37174 aligncenter\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/6-3.png\" alt=\"\" width=\"255\" height=\"85\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">Finally, we can make a frequency distribution plot to clear out the words and their count in the given text and clearly understand the distribution of words:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37175\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/7-3.png\" sizes=\"auto, (max-width: 396px) 100vw, 396px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/7-3.png 396w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/7-3-300x239.png 300w\" alt=\"\" width=\"396\" height=\"316\" \/><\/p>\n<h3><strong>Stopwords<\/strong><\/h3>\n<p>Just like when we talk to another person via a call, there tends to be some noise over the call which is unwanted information. In the same manner, text from real world also contain noise which is termed as\u00a0<strong>Stopwords<\/strong>. Stopwords can vary from language to language but they can be easily identified. Some of the Stopwords in English language can be \u2013 is, are, a, the, an etc.<\/p>\n<p>We can look at words which are considered as Stopwords by NLTK for English language with the following code snippet:<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"kw1\">from<\/span>\u00a0nltk.<span class=\"me1\">corpus<\/span>\u00a0<span class=\"kw1\">import<\/span>\u00a0stopwords<br \/>\nnltk.<span class=\"me1\">download<\/span><span class=\"br0\">(<\/span><span class=\"st0\">&#8216;stopwords&#8217;<\/span><span class=\"br0\">)<\/span><\/p>\n<p>language\u00a0<span class=\"sy0\">=<\/span>\u00a0<span class=\"st0\">&#8220;english&#8221;<\/span><br \/>\nstop_words\u00a0<span class=\"sy0\">=<\/span>\u00a0<span class=\"kw2\">set<\/span><span class=\"br0\">(<\/span>stopwords.<span class=\"me1\">words<\/span><span class=\"br0\">(<\/span>language<span class=\"br0\">)<\/span><span class=\"br0\">)<\/span><br \/>\n<span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>stop_words<span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">As of course the set of stop words can be big, it is stored as a separate dataset which can be\u00a0<span id=\"IL_AD3\" class=\"IL_AD\">downloaded<\/span>\u00a0with NLTK as we shown above. We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37176\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/8-3.png\" sizes=\"auto, (max-width: 812px) 100vw, 812px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/8-3.png 812w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/8-3-300x180.png 300w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/8-3-768x462.png 768w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/8-3-810x487.png 810w\" alt=\"\" width=\"812\" height=\"488\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">These stop words should be removed from the text if you want to perform a precise text analysis for the piece of text provided. Let\u2019s remove the stop words from our textual tokens:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">filtered_words\u00a0<span class=\"sy0\">=<\/span>\u00a0<span class=\"br0\">[<\/span><span class=\"br0\">]<\/span><\/p>\n<p><span class=\"kw1\">for<\/span>\u00a0word\u00a0<span class=\"kw1\">in<\/span>\u00a0words:<br \/>\n<span class=\"kw1\">if<\/span>\u00a0word\u00a0<span class=\"kw1\">not<\/span>\u00a0<span class=\"kw1\">in<\/span>\u00a0stop_words:<br \/>\nfiltered_words.<span class=\"me1\">append<\/span><span class=\"br0\">(<\/span>word<span class=\"br0\">)<\/span><\/p>\n<p>filtered_words<\/p><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37177\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/9-2.png\" sizes=\"auto, (max-width: 321px) 100vw, 321px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/9-2.png 321w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/9-2-142x300.png 142w\" alt=\"\" width=\"321\" height=\"676\" \/><\/p>\n<h3><strong>Word Stemming<\/strong><\/h3>\n<p>A stem of a word is the base of that word. For example:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37178\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/11-2.png\" sizes=\"auto, (max-width: 381px) 100vw, 381px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/11-2.png 381w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/11-2-300x269.png 300w\" alt=\"\" width=\"381\" height=\"341\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We will perform stemming upon the filtered words from which we removed stop words in the last section. Let\u2019s write a simple code snippet where we use NLTK\u2019s stemmer to perform the operation:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\"><span class=\"kw1\">from<\/span>\u00a0nltk.<span class=\"me1\">stem<\/span>\u00a0<span class=\"kw1\">import<\/span>\u00a0PorterStemmer<br \/>\nps\u00a0<span class=\"sy0\">=<\/span>\u00a0PorterStemmer<span class=\"br0\">(<\/span><span class=\"br0\">)<\/span><\/p>\n<p>stemmed_words\u00a0<span class=\"sy0\">=<\/span>\u00a0<span class=\"br0\">[<\/span><span class=\"br0\">]<\/span><br \/>\n<span class=\"kw1\">for<\/span>\u00a0word\u00a0<span class=\"kw1\">in<\/span>\u00a0filtered_words:<br \/>\nstemmed_words.<span class=\"me1\">append<\/span><span class=\"br0\">(<\/span>ps.<span class=\"me1\">stem<\/span><span class=\"br0\">(<\/span>word<span class=\"br0\">)<\/span><span class=\"br0\">)<\/span><\/p>\n<p><span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span><span class=\"st0\">&#8220;Stemmed Sentence:&#8221;<\/span><span class=\"sy0\">,<\/span>\u00a0stemmed_words<span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37179\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/12-2.png\" sizes=\"auto, (max-width: 801px) 100vw, 801px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/12-2.png 801w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/12-2-300x87.png 300w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/12-2-768x223.png 768w\" alt=\"\" width=\"801\" height=\"233\" \/><\/p>\n<h3><strong>POS Tagging<\/strong><\/h3>\n<p>Next step in textual analysis is after stemming is to identify and group each word in terms of their value, i.e. if each of the word is a noun or a verb or something else. This is termed as Part of Speech tagging. Let\u2019s perform POS tagging now:<\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">tokens<span class=\"sy0\">=<\/span>nltk.<span class=\"me1\">word_tokenize<\/span><span class=\"br0\">(<\/span>sentences<span class=\"br0\">[<\/span><span class=\"nu0\">0<\/span><span class=\"br0\">]<\/span><span class=\"br0\">)<\/span><br \/>\n<span class=\"kw1\">print<\/span><span class=\"br0\">(<\/span>tokens<span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">We see something like this when we execute the above script:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37180\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/14-1.png\" sizes=\"auto, (max-width: 818px) 100vw, 818px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/14-1.png 818w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/14-1-300x32.png 300w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/14-1-768x82.png 768w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/14-1-810x86.png 810w\" alt=\"\" width=\"818\" height=\"87\" \/><\/p>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\">Now, we can perform the tagging, for which we will have to\u00a0<span id=\"IL_AD1\" class=\"IL_AD\">download<\/span>\u00a0another dataset to identify the correct tags:<\/span><\/p>\n<div class=\"codecolorer-container python default\">\n<div class=\"python codecolorer\">nltk.<span class=\"me1\">download<\/span><span class=\"br0\">(<\/span><span class=\"st0\">&#8216;averaged_perceptron_tagger&#8217;<\/span><span class=\"br0\">)<\/span><br \/>\nnltk.<span class=\"me1\">pos_tag<\/span><span class=\"br0\">(<\/span>tokens<span class=\"br0\">)<\/span><\/div>\n<\/div>\n<p class=\"Normal1\"><span lang=\"UZ-CYR\"><br \/>\nHere is the output of the tagging:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-37181\" src=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/15-1.png\" sizes=\"auto, (max-width: 547px) 100vw, 547px\" srcset=\"https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/15-1.png 547w, https:\/\/linuxhint.com\/wp-content\/uploads\/2019\/03\/15-1-300x185.png 300w\" alt=\"\" width=\"547\" height=\"337\" \/><\/p>\n<p>Now that we have finally identified the tagged words, this is the dataset on which we can perform sentiment analysis to identify the emotions behind a sentence.<\/p>\n<h4><strong>Conclusion<\/strong><\/h4>\n<p>In this lesson, we looked at an excellent natural language package, NLTK which allows us to work with unstructured textual data to identify any stop words and perform deeper analysis by preparing a sharp data set for text analysis with libraries like sklearn.<\/p>\n<p>Find all of the source code used in this lesson on\u00a0<a href=\"https:\/\/github.com\/sbmaggarwal\/NLTK-Example\">Github<\/a>.<\/p>\n<p><a href=\"https:\/\/linuxhint.com\/nltk_python_tutorial\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The era of data is already here. The rate at which the data is generated today is higher than ever and it is always growing. Most of the times, the people who deal with data everyday work mostly with unstructured textual data. Some of this data has associated elements like images, videos, audios etc. Some &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/03\/15\/nltk-tutorial-in-python-linux-hint\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;NLTK Tutorial in Python \u2013 Linux Hint&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-11624","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/11624","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=11624"}],"version-history":[{"count":2,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/11624\/revisions"}],"predecessor-version":[{"id":11628,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/11624\/revisions\/11628"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=11624"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=11624"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=11624"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}