Machine learning beats

NLP Processing using NLTK Stanford core nlp

July 27, 2017 Emna Amor

Can computers understand humans?

NLP is a very important branch of Machine Learning and therefore of artificial intelligence. The NLP is the ability of a program to understand human language.

Let's take a few practical examples that are used every day to better understand:

Spam: all mailboxes use an anti-spam filter and it works with Bayesian filtering in reference to the Bayes theorem which is a statistical technique for detecting spam. These filters will "understand" the text and find out if there are correlations of words that indicate spam.

Google Translation: you probably have all used this system and their technology uses many algorithms including NLP. Here, the challenge is not to translate the word, but to keep the meaning of a sentence in another language.

The Siri software created by Apple or Google Assistant uses NLP to translate transcribed text into analyzed text in order to give you an answer adapted to your request.

Today we will make some NLP processing, using NLTK Stanford core nlp.

We need for that jupyter notebook and python3.

You can find the full notebook HERE

POS-Tag

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset.

Parsing

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.

NER

Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names

Here are the results

We will try to remove the other things.

Coref-Resolution

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.

I will use core nlp server for coreference resolution

You can run the server using this command:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

You can use the server for parsing , part of speech tagging or any other tool of stanford core nlp.

coreference resolution / machine learning / named entity recognition / NER / NLP / nltk / parser / part of speech tagging / pos-tag / python / stanford core nlp

1 Comment

Arabic named entity recognition

July 27, 2017 Emna Amor

Information Extraction

An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities and relations between those entities.

Arabic named entity recognition

It's easy to make named entity recognition using Stanford core nlp, or Spacy or NLTK, ect.., but having good results for arabic language is not that easy.

Today we will use an arabic corpus called ANERCorp (you can download it using this link). It is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.

Here is the Full noteook

Let's load our files

If we visualize our data frame, we can notice that the last column is not necessary, so, we will drop it and we will rename our columns.

this is how our data frame become.

First of all, let's understand the meaning of the entities. We can say that we have 4 entities in general, but here, we will find 8.

How?

B-PERS: Beginning of Person Name

I-PERS: Inside of Person Name

B-LOC: Beginning of Location Name

I-LOC: Inside of Location Name

B-ORG: Beginning of Organization Name

I-ORG: Inside of Organization Name

B-MISC: Beginning of Miscellaneous Word

I-MISC: Inside of Miscellaneous Word

Those are our 8 entities, each with a beginning and inside. The "O" designs the words that are not named entities, so O is referred to Object.

Now, we will split the data to test and train data (20% and 80%), and we will put the data and the labels into arrays.

Then, we will convert data to a matrix of features using CountVectorizer and TFIDF.

I will use Linear Svm to train the model ( you can try any other classifier)

and now, we can make a test,

ألمانيا is Germany
so our program shows that Germany is a location.

AnerCorp / arabic language NLP / machine learning / named entity recognition / NER / NLP / Stanford NLP

No Comments

Text Classification

July 27, 2017 Emna Amor

Today we're going to learn a great machine learning technique called document classification. We will use different classifiers such as Naive Bayes, SVM and KNN.

Introduction to document classification

Document classification is an example of Machine Learning (ML) in the form of Natural Language Processing (NLP). By classifying text, we are aiming to assign one or more classes or categories to a document, making it easier to manage and sort. This is especially useful for publishers, news sites, blogs or anyone who deals with a lot of content.

When you will search on the net for document classification, you will find that there are lots of applications of text classification in the commercial world just like email spam filtering is perhaps now the most ubiquitous

Today we will work with ag_news Dataset (link to download).Put another way: "given a piece of text, determine if it belongs to Sports, science and tech, world or Business category".

we will use:

-Python 3

-Jupyter notebook

Here you can find the full notebook.

Load Data

When opening the CSV file, we can see 3 columns which are [category, title, text]. We will be interested in 2 columns which are category and text.

and now we can see our data frame.

then we will replace Nan Values if they exists with space.

I am going to delete title, if you don't want to delete it, you can join title and text to one field like this:

we added a new column called full_text and we will delete the others to keep the memory.

For my case, i just deleted titles.

Now we will enumerate our class names

and also, we will change our category column, from numbers to names.

Now everything is ready and we need to begin. First of all we will clean our data and we will stem them. The goal of Stemming is to "normalize" words to their common base form, which is useful for many text-processing apllications especially in our case "document classification".

So, we have this function which keep only-alphanumeric characters and replaces all white space with a single space.