initial release 4.8

mesolitica · Jun 1, 2022 · 0cc9e73 · 0cc9e73
1 parent 68e0b23
commit 0cc9e73
Show file tree

Hide file tree

Showing 18 changed files with 5,649 additions and 17 deletions.
diff --git a/docs/index.rst b/docs/index.rst
@@ -63,6 +63,8 @@ Contents:
    :caption: Convert Module
 
    load-phoneme
+   load-rumi-jawi
+   load-jawi-rumi
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/load-jawi-rumi.ipynb b/docs/load-jawi-rumi.ipynb
@@ -0,0 +1,303 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Jawi-to-Rumi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "This tutorial is available as an IPython notebook at [Malaya/example/jawi-rumi](https://github.com/huseinzol05/Malaya/tree/master/example/jawi-rumi).\n",
+    "    \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "\n",
+    "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n",
+    "    \n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Explanation\n",
+    "\n",
+    "Originally from https://www.ejawi.net/converterV2.php?go=rumi able to convert Rumi to Jawi using heuristic method. So Malaya convert from heuristic and map it using deep learning model by inverse the dataset.\n",
+    "\n",
+    "`چوميل` -> `comel`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 5.95 s, sys: 1.15 s, total: 7.1 s\n",
+      "Wall time: 9.05 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "import malaya"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Use deep learning model\n",
+    "\n",
+    "Load LSTM + Bahdanau Attention Jawi to Rumi model.\n",
+    "\n",
+    "If you are using Tensorflow 2, make sure Tensorflow Addons already installed,\n",
+    "\n",
+    "```bash\n",
+    "pip install tensorflow-addons U\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "def deep_model(quantized: bool = False, **kwargs):\n",
+    "    \"\"\"\n",
+    "    Load LSTM + Bahdanau Attention Rumi to Jawi model.\n",
+    "    Original size 11MB, quantized size 2.92MB .\n",
+    "    CER on test set: 0.09239719040982326\n",
+    "    WER on test set: 0.33811816744187656\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    quantized : bool, optional (default=False)\n",
+    "        if True, will load 8-bit quantized model.\n",
+    "        Quantized model not necessary faster, totally depends on the machine.\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    result: malaya.model.tf.Seq2SeqLSTM class\n",
+    "    \"\"\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "530a47ea5c514ae9aa68c8a4e1e29d9c",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=11034253.0, style=ProgressStyle(descrip…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "model = malaya.jawi_rumi.deep_model()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load Quantized model\n",
+    "\n",
+    "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n",
+    "\n",
+    "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Load quantized model will cause accuracy drop.\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "6d1d22a65abd48a28f9a1eb62f2d0c4d",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2926859.0, style=ProgressStyle(descript…"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "quantized_model = malaya.jawi_rumi.deep_model(quantized = True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Predict\n",
+    "\n",
+    "```python\n",
+    "def predict(self, strings: List[str], beam_search: bool = False):\n",
+    "    \"\"\"\n",
+    "    Convert to target string.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    strings : List[str]\n",
+    "    beam_search : bool, (optional=False)\n",
+    "        If True, use beam search decoder, else use greedy decoder.\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    result: List[str]\n",
+    "    \"\"\"\n",
+    "```\n",
+    "\n",
+    "If want to speed up the inference, set `beam_search = False`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['saya suka makan im',\n",
+       " 'eak ack kotok',\n",
+       " 'aisuk berthday saya, jegan lupa bawak hadiah']"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "model.predict(['ساي سوك ماكن ايم', 'اياق اچق كوتوق', 'ايسوق بيرثداي ساي، جڬن لوڤا باوق هديه'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['saya suka makan im',\n",
+       " 'eak ack kotok',\n",
+       " 'aisuk berthday saya, jegan lopa bawak hadiah']"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "quantized_model.predict(['ساي سوك ماكن ايم', 'اياق اچق كوتوق', 'ايسوق بيرثداي ساي، جڬن لوڤا باوق هديه'])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}