Arabic named entity recognition

Thursday, July 27, 2017

Arabic named entity recognition


Information Extraction 

An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities and relations between those entities.

Arabic named entity recognition

 It's easy to make named entity recognition using Stanford core nlp, or Spacy or NLTK, ect.., but having good results for arabic language is not that easy.

Today we will use an arabic corpus called ANERCorp (you can download it using this link). It is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.
Here is the Full noteook

Let's load our files













If we visualize our data frame, we can notice that the last column is not necessary, so, we will drop it and we will rename our columns.







this is how our data frame become.
       
First of all, let's understand the meaning of the entities. We can say that we have 4 entities in general, but here, we will find 8.

How?

B-PERS: Beginning of Person Name
I-PERS: Inside of Person Name
B-LOC: Beginning of Location Name
I-LOC: Inside of Location Name
B-ORG: Beginning of Organization Name
I-ORG: Inside of Organization Name
B-MISC: Beginning of Miscellaneous Word
I-MISC: Inside of Miscellaneous Word

Those are our 8 entities, each with a beginning and inside. The "O" designs the words that are not named entities, so O is referred to Object.

Now, we will split the data to test and train data (20% and 80%), and we will put the data and the labels into arrays.

















Then, we will convert data to a matrix of features using CountVectorizer and TFIDF.


I will use Linear Svm to train the model ( you can try any other classifier)









and now, we can make a test, 
ألمانيا  is Germany
so our program shows that Germany is a location.





0 comments :

Post a Comment