July 2017

Thursday, July 27, 2017

NLP Processing using NLTK Stanford core nlp


Can computers understand humans?


NLP is a very important branch of Machine Learning and therefore of artificial intelligence. The NLP is the ability of a program to understand human language.

Let's take a few practical examples that are used every day to better understand:





  • Spam: all mailboxes use an anti-spam filter and it works with Bayesian filtering in reference to the Bayes theorem which is a statistical technique for detecting spam. These filters will "understand" the text and find out if there are correlations of words that indicate spam.


  • Google Translation: you probably have all used this system and their technology uses many algorithms including NLP. Here, the challenge is not to translate the word, but to keep the meaning of a sentence in another language.


  • The Siri software created by Apple or Google Assistant uses NLP to translate transcribed text into analyzed text in order to give you an answer adapted to your request.




Today we will make some NLP processing, using NLTK Stanford core nlp.
We need for that jupyter notebook and python3.

You can find the full notebook HERE


POS-Tag

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset.
 

Parsing

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.

NER

Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names
Here are the results
We will try to remove the other things.

Coref-Resolution

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.


I will use core nlp server for coreference resolution 

You can run the server using this command:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer



You can use the server for parsing , part of speech tagging or any other tool of stanford core nlp.







Arabic named entity recognition


Information Extraction 

An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities and relations between those entities.

Arabic named entity recognition

 It's easy to make named entity recognition using Stanford core nlp, or Spacy or NLTK, ect.., but having good results for arabic language is not that easy.

Today we will use an arabic corpus called ANERCorp (you can download it using this link). It is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.
Here is the Full noteook

Let's load our files













If we visualize our data frame, we can notice that the last column is not necessary, so, we will drop it and we will rename our columns.







this is how our data frame become.
       
First of all, let's understand the meaning of the entities. We can say that we have 4 entities in general, but here, we will find 8.

How?

B-PERS: Beginning of Person Name
I-PERS: Inside of Person Name
B-LOC: Beginning of Location Name
I-LOC: Inside of Location Name
B-ORG: Beginning of Organization Name
I-ORG: Inside of Organization Name
B-MISC: Beginning of Miscellaneous Word
I-MISC: Inside of Miscellaneous Word

Those are our 8 entities, each with a beginning and inside. The "O" designs the words that are not named entities, so O is referred to Object.

Now, we will split the data to test and train data (20% and 80%), and we will put the data and the labels into arrays.

















Then, we will convert data to a matrix of features using CountVectorizer and TFIDF.


I will use Linear Svm to train the model ( you can try any other classifier)









and now, we can make a test, 
ألمانيا  is Germany
so our program shows that Germany is a location.





Text Classification


         Today we're going to learn a great machine learning technique called document classification. We will use different classifiers such as Naive Bayes, SVM and KNN.

Introduction to document classification

Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort. This is especially useful for publishers, news sites, blogs or anyone who deals with a lot of content. 

When you will search on the net for document classification, you will find that there are lots of applications of text classification in the commercial world just like email spam filtering is perhaps now the most ubiquitous

Today we will work with ag_news Dataset (link to download).Put another way: "given a piece of text, determine if it belongs to Sports, science and tech, world or Business category".

we will use:
           -Python 3
           -Jupyter notebook

Here you can find the full notebook

Load Data

When opening the CSV file, we can see 3 columns which are [category, title, text]. We will be interested in 2 columns which are category and text.


and now we can see our data frame.



then we will replace Nan Values if they exists with space.

I am going to delete title, if you don't want to delete it, you can join title and text to one field like this: 
we added a new column called full_text and we will delete the others to keep the memory.
For my case, i just deleted titles.
Now we will enumerate our class names
and also, we will change our category column, from numbers to names.
 Now everything is ready and we need to begin. First of all we will clean our data and we will stem them. The goal of Stemming is to "normalize" words to their common base form, which is useful for many text-processing apllications especially in our case "document classification".

So, we have this function which keep only-alphanumeric characters and replaces all white space with a single space.


Now, we will convert data frame series to list
Now, we will convert our text to a matrix of features, therefore, we will use scikit-learn countVectorizer and tfidTransformer.

Now we will build our models! 
Let's begin with Naive Bayes ( I will try Bernoulli Naive Bayes and Multinomial Naive Bayes)

But before, let's make a confusion matrix function to use it with all the classifiers.

Naive Bayes

Bernoulli Naive Bayes


Let's plot it confusion matrix (we forgot to import numpy before the confusion matrix function, just add import numpy as np)

Multinomial Naive Bayes




SVM

Linear SVM





 

Non-Linear SVM












KNN





































Never hesitate to ask questions if you have :)