Skip to content
This repository has been archived by the owner on Jan 25, 2023. It is now read-only.

Latest commit

 

History

History
57 lines (45 loc) · 2.81 KB

README.md

File metadata and controls

57 lines (45 loc) · 2.81 KB

Arabic Dialect Identifier (for RedCrow.co)

This is an automatic classification approach to identifying Arabic dialects. At this point, it is able to recognize distinguish between) two different dialects: Egyptian Arabic and Iraqi Arabic. More dialects will be added soon.

Prerequisites

Usage guide

Import the arabdialect class:

from arabicdialect import arabicdialect

Then instantiate the classifier with one of these values: egyptian, iraqi and multi. So for example:

ar = arabdialect('multi')

There are two methods available: classify_one and classify_many. The former takes a single string as its argument, the latter takes an array of documents.

Egyptian

The dialect-specific classifier for Egyptian returns boolean values, i.e. the document is classified as being either Egyptian (True) or Non-Egyptian (False).

from arabicdialect import arabicdialect
ar_eg = arabicdialect('egyptian')
non_egyptian_tweet = u'ما رتبت فوضاك غير الي نفت روحها في واد صمتك للأخير . غير ذي زرع ... ولكنه نبت  صار موطن للشعر حول الغدير  #وجد'
egyptian_tweet = u'-حبيبي إيه رأيك في عينيا في الشمس؟ =و الله أنا مش شايف في الشمس غير شنبك.'
print(ar_eg.classify_one(egyptian_tweet))
print(ar_eg.classify_one(non_egyptian_tweet))
print(ar_eg.classify_many([non_egyptian_tweet,egyptian_tweet]))

Iraqi

The dialect-specific classifier for Iraqi returns boolean values, i.e. the document is classified as being either Iraqi (True) or Non-Iraqi (False).

from arabicdialect import arabicdialect
ar_ir = arabicdialect('iraqi')
non_iraqi_tweet = u'سب کا وقت مقرر ہے لیکن دعا ہے ہمارے اعمال ایسے ہوں کہ جب ہم دنیا سے جائیں تو لوگ دعا دیں نہ کہ خوش ہوں کہ برائی کم ہوئی'
iraqi_tweet = u'خوش امتحان اليوم كله غش 😂😂'
print(ar_ir.classify_one(iraqi_tweet))
print(ar_ir.classify_one(non_iraqi_tweet))
print(ar_ir.classify_many([non_iraqi_tweet,iraqi_tweet]))

Multi

This classifier predicts whether a document is in either the Iraqi or the Egyptian dialect. It therefore does not return booleans, but either the class iraqi or egyptian.

from arabicdialect import arabicdialect
ar_multi = arabicdialect('multi')
egyptian_tweet = u'-حبيبي إيه رأيك في عينيا في الشمس؟ =و الله أنا مش شايف في الشمس غير شنبك.'
iraqi_tweet = u'خوش امتحان اليوم كله غش 😂😂'
print(ar_multi.classify_one(iraqi_tweet))
print(ar_multi.classify_one(egyptian_tweet))
print(ar_multi.classify_many([egyptian_tweet,iraqi_tweet]))