diff --git a/examples/nlp/ipynb/mlm_training_tpus.ipynb b/examples/nlp/ipynb/mlm_training_tpus.ipynb
deleted file mode 100644
index b8bffa9553..0000000000
--- a/examples/nlp/ipynb/mlm_training_tpus.ipynb
+++ /dev/null
@@ -1,829 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "# Training a language model from scratch with \ud83e\udd17 Transformers and TPUs\n",
- "\n",
- "**Authors:** [Matthew Carrigan](https://twitter.com/carrigmat), [Sayak Paul](https://twitter.com/RisingSayak)
\n",
- "**Date created:** 2023/05/21
\n",
- "**Last modified:** 2023/05/21
\n",
- "**Description:** Train a masked language model on TPUs using \ud83e\udd17 Transformers."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "## Introduction\n",
- "\n",
- "In this example, we cover how to train a masked language model using TensorFlow,\n",
- "[\ud83e\udd17 Transformers](https://huggingface.co/transformers/index),\n",
- "and TPUs.\n",
- "\n",
- "TPU training is a useful skill to have: TPU pods are high-performance and extremely\n",
- "scalable, making it easy to train models at any scale from a few tens of millions of\n",
- "parameters up to truly enormous sizes: Google's PaLM model\n",
- "(over 500 billion parameters!) was trained entirely on TPU pods.\n",
- "\n",
- "We've previously written a\n",
- "[**tutorial**](https://huggingface.co/docs/transformers/main/perf_train_tpu_tf)\n",
- "and a\n",
- "[**Colab example**](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)\n",
- "showing small-scale TPU training with TensorFlow and introducing the core concepts you\n",
- "need to understand to get your model working on TPU. However, our Colab example doesn't\n",
- "contain all the steps needed to train a language model from scratch such as\n",
- "training the tokenizer. So, we wanted to provide a consolidated example of\n",
- "walking you through every critical step involved there.\n",
- "\n",
- "As in our Colab example, we're taking advantage of TensorFlow's very clean TPU support\n",
- "via XLA and `TPUStrategy`. We'll also be benefiting from the fact that the majority of\n",
- "the TensorFlow models in \ud83e\udd17 Transformers are fully\n",
- "[XLA-compatible](https://huggingface.co/blog/tf-xla-generate).\n",
- "So surprisingly, little work is needed to get them to run on TPU.\n",
- "\n",
- "This example is designed to be **scalable** and much closer to a realistic training run\n",
- "-- although we only use a BERT-sized model by default, the code could be expanded to a\n",
- "much larger model and a much more powerful TPU pod slice by changing a few configuration\n",
- "options.\n",
- "\n",
- "The following diagram gives you a pictorial overview of the steps involved in training a\n",
- "language model with \ud83e\udd17 Transformers using TensorFlow and TPUs:\n",
- "\n",
- "![https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tf_tpu/tf_tpu_steps.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tf_tpu/tf_tpu_steps.png)\n",
- "\n",
- "*(Contents of this example overlap with\n",
- "[this blog post](https://huggingface.co/blog/tf_tpu)).*"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "## Data\n",
- "\n",
- "We use the\n",
- "[WikiText dataset (v1)](https://huggingface.co/datasets/wikitext).\n",
- "You can head over to the\n",
- "[dataset page on the Hugging Face Hub](https://huggingface.co/datasets/wikitext)\n",
- "to explore the dataset.\n",
- "\n",
- "![data_preview_wikitext](https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/data_preview_wikitext.png)\n",
- "\n",
- "Since the dataset is already available on the Hub in a compatible format, we can easily\n",
- "load and interact with it using\n",
- "[\ud83e\udd17 datasets](https://hf.co/docs/datasets).\n",
- "However, training a language model from scratch also requires a separate\n",
- "tokenizer training step. We skip that part in this example for brevity, but,\n",
- "here's a gist of what we can do to train a tokenizer from scratch:\n",
- "\n",
- "- Load the `train` split of the WikiText using \ud83e\udd17 datasets.\n",
- "- Leverage\n",
- "[\ud83e\udd17 tokenizers](https://huggingface.co/docs/tokenizers/index)\n",
- "to train a\n",
- "[**Unigram model**](https://huggingface.co/course/chapter6/7?fw=pt).\n",
- "- Upload the trained tokenizer on the Hub.\n",
- "\n",
- "You can find the tokenizer training\n",
- "code\n",
- "[**here**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling-tpu#training-a-tokenizer)\n",
- "and the tokenizer\n",
- "[**here**](https://huggingface.co/tf-tpu/unigram-tokenizer-wikitext).\n",
- "This script also allows you to run it with\n",
- "[**any compatible dataset**](https://huggingface.co/datasets?task_ids=task_ids:language-modeling)\n",
- "from the Hub."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "## Tokenizing the data and creating TFRecords\n",
- "\n",
- "Once the tokenizer is trained, we can use it on all the dataset splits\n",
- "(`train`, `validation`, and `test` in this case) and create TFRecord shards out of them.\n",
- "Having the data splits spread across multiple TFRecord shards helps with massively\n",
- "parallel processing as opposed to having each split in single TFRecord files.\n",
- "\n",
- "We tokenize the samples individually. We then take a batch of samples, concatenate them\n",
- "together, and split them into several chunks of a fixed size (128 in our case). We follow\n",
- "this strategy rather than tokenizing a batch of samples with a fixed length to avoid\n",
- "aggressively discarding text content (because of truncation).\n",
- "\n",
- "We then take these tokenized samples in batches and serialize those batches as multiple\n",
- "TFRecord shards, where the total dataset length and individual shard size determine the\n",
- "number of shards. Finally, these shards are pushed to a\n",
- "[Google Cloud Storage (GCS) bucket](https://cloud.google.com/storage/docs/json_api/v1/buckets).\n",
- "\n",
- "If you're using a TPU node for training, then the data needs to be streamed from a GCS\n",
- "bucket since the node host memory is very small. But for TPU VMs, we can use datasets\n",
- "locally or even attach persistent storage to those VMs. Since TPU nodes (which is what we\n",
- "have in a Colab) are still quite heavily used, we based our example on using a GCS bucket\n",
- "for data storage.\n",
- "\n",
- "You can see all of this in code in\n",
- "[this script](https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py).\n",
- "For convenience, we have also hosted the resultant TFRecord shards in\n",
- "[this repository](https://huggingface.co/datasets/tf-tpu/wikitext-v1-tfrecords)\n",
- "on the Hub.\n",
- "\n",
- "Once the data is tokenized and serialized into TFRecord shards, we can proceed toward\n",
- "training."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "## Training\n",
- "\n",
- "### Setup and imports\n",
- "\n",
- "Let's start by installing \ud83e\udd17 Transformers."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "!pip install transformers -q"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Then, let's import the modules we need."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "import os\n",
- "import re\n",
- "\n",
- "import tensorflow as tf\n",
- "\n",
- "import transformers"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "### Initialize TPUs"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Then let's connect to our TPU and determine the distribution strategy:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "tpu = tf.distribute.cluster_resolver.TPUClusterResolver()\n",
- "\n",
- "tf.config.experimental_connect_to_cluster(tpu)\n",
- "tf.tpu.experimental.initialize_tpu_system(tpu)\n",
- "\n",
- "strategy = tf.distribute.TPUStrategy(tpu)\n",
- "\n",
- "print(f\"Available number of replicas: {strategy.num_replicas_in_sync}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "We then load the tokenizer. For more details on the tokenizer, check out\n",
- "[its repository](https://huggingface.co/tf-tpu/unigram-tokenizer-wikitext).\n",
- "For the model, we use RoBERTa (the base variant), introduced in\n",
- "[this paper](https://arxiv.org/abs/1907.11692)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "### Initialize the tokenizer"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "tokenizer = \"tf-tpu/unigram-tokenizer-wikitext\"\n",
- "pretrained_model_config = \"roberta-base\"\n",
- "\n",
- "tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer)\n",
- "config = transformers.AutoConfig.from_pretrained(pretrained_model_config)\n",
- "config.vocab_size = tokenizer.vocab_size"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "### Prepare the datasets"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "We now load the TFRecord shards of the WikiText dataset (which the Hugging Face team\n",
- "prepared beforehand for this example):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "train_dataset_path = \"gs://tf-tpu-training-resources/train\"\n",
- "eval_dataset_path = \"gs://tf-tpu-training-resources/validation\"\n",
- "\n",
- "training_records = tf.io.gfile.glob(os.path.join(train_dataset_path, \"*.tfrecord\"))\n",
- "eval_records = tf.io.gfile.glob(os.path.join(eval_dataset_path, \"*.tfrecord\"))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Now, we will write a utility to count the number of training samples we have. We need to\n",
- "know this value in order properly initialize our optimizer later:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "\n",
- "def count_samples(file_list):\n",
- " num_samples = 0\n",
- " for file in file_list:\n",
- " filename = file.split(\"/\")[-1]\n",
- " sample_count = re.search(r\"-\\d+-(\\d+)\\.tfrecord\", filename).group(1)\n",
- " sample_count = int(sample_count)\n",
- " num_samples += sample_count\n",
- "\n",
- " return num_samples\n",
- "\n",
- "\n",
- "num_train_samples = count_samples(training_records)\n",
- "print(f\"Number of total training samples: {num_train_samples}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Let's now prepare our datasets for training and evaluation. We start by writing our\n",
- "utilities. First, we need to be able to decode the TFRecords:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "max_sequence_length = 512\n",
- "\n",
- "\n",
- "def decode_fn(example):\n",
- " features = {\n",
- " \"input_ids\": tf.io.FixedLenFeature(\n",
- " dtype=tf.int64, shape=(max_sequence_length,)\n",
- " ),\n",
- " \"attention_mask\": tf.io.FixedLenFeature(\n",
- " dtype=tf.int64, shape=(max_sequence_length,)\n",
- " ),\n",
- " }\n",
- " return tf.io.parse_single_example(example, features)\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Here, `max_sequence_length` needs to be the same as the one used during preparing the\n",
- "TFRecord shards.Refer to\n",
- "[this script](https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py)\n",
- "for more details.\n",
- "\n",
- "Next up, we have our masking utility that is responsible for masking parts of the inputs\n",
- "and preparing labels for the masked language model to learn from. We leverage the\n",
- "[`DataCollatorForLanguageModeling`](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/data_collator#transformers.DataCollatorForLanguageModeling)\n",
- "for this purpose."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "# We use a standard masking probability of 0.15. `mlm_probability` denotes\n",
- "# probability with which we mask the input tokens in a sequence.\n",
- "mlm_probability = 0.15\n",
- "data_collator = transformers.DataCollatorForLanguageModeling(\n",
- " tokenizer=tokenizer, mlm_probability=mlm_probability, mlm=True, return_tensors=\"tf\"\n",
- ")\n",
- "\n",
- "\n",
- "def mask_with_collator(batch):\n",
- " special_tokens_mask = (\n",
- " ~tf.cast(batch[\"attention_mask\"], tf.bool)\n",
- " | (batch[\"input_ids\"] == tokenizer.cls_token_id)\n",
- " | (batch[\"input_ids\"] == tokenizer.sep_token_id)\n",
- " )\n",
- " batch[\"input_ids\"], batch[\"labels\"] = data_collator.tf_mask_tokens(\n",
- " batch[\"input_ids\"],\n",
- " vocab_size=len(tokenizer),\n",
- " mask_token_id=tokenizer.mask_token_id,\n",
- " special_tokens_mask=special_tokens_mask,\n",
- " )\n",
- " return batch\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "And now is the time to write the final data preparation utility to put it all together in\n",
- "a `tf.data.Dataset` object:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "auto = tf.data.AUTOTUNE\n",
- "shuffle_buffer_size = 2**18\n",
- "\n",
- "\n",
- "def prepare_dataset(\n",
- " records, decode_fn, mask_fn, batch_size, shuffle, shuffle_buffer_size=None\n",
- "):\n",
- " num_samples = count_samples(records)\n",
- " dataset = tf.data.Dataset.from_tensor_slices(records)\n",
- " if shuffle:\n",
- " dataset = dataset.shuffle(len(dataset))\n",
- " dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=auto)\n",
- " # TF can't infer the total sample count because it doesn't read\n",
- " # all the records yet, so we assert it here.\n",
- " dataset = dataset.apply(tf.data.experimental.assert_cardinality(num_samples))\n",
- " dataset = dataset.map(decode_fn, num_parallel_calls=auto)\n",
- " if shuffle:\n",
- " assert shuffle_buffer_size is not None\n",
- " dataset = dataset.shuffle(shuffle_buffer_size)\n",
- " dataset = dataset.batch(batch_size, drop_remainder=True)\n",
- " dataset = dataset.map(mask_fn, num_parallel_calls=auto)\n",
- " dataset = dataset.prefetch(auto)\n",
- " return dataset\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Let's prepare our datasets with these utilities:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "per_replica_batch_size = 16 # Change as needed.\n",
- "batch_size = per_replica_batch_size * strategy.num_replicas_in_sync\n",
- "shuffle_buffer_size = 2**18 # Default corresponds to a 1GB buffer for seq_len 512\n",
- "\n",
- "train_dataset = prepare_dataset(\n",
- " training_records,\n",
- " decode_fn=decode_fn,\n",
- " mask_fn=mask_with_collator,\n",
- " batch_size=batch_size,\n",
- " shuffle=True,\n",
- " shuffle_buffer_size=shuffle_buffer_size,\n",
- ")\n",
- "\n",
- "eval_dataset = prepare_dataset(\n",
- " eval_records,\n",
- " decode_fn=decode_fn,\n",
- " mask_fn=mask_with_collator,\n",
- " batch_size=batch_size,\n",
- " shuffle=False,\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Let's now investigate how a single batch of dataset looks like."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "single_batch = next(iter(train_dataset))\n",
- "print(single_batch.keys())"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "* `input_ids` denotes the tokenized versions of the input samples containing the mask\n",
- "tokens as well.\n",
- "* `attention_mask` denotes the mask to be used when performing attention operations.\n",
- "* `labels` denotes the actual values of masked tokens the model is supposed to learn from."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "for k in single_batch:\n",
- " if k == \"input_ids\":\n",
- " input_ids = single_batch[k]\n",
- " print(f\"Input shape: {input_ids.shape}\")\n",
- " if k == \"labels\":\n",
- " labels = single_batch[k]\n",
- " print(f\"Label shape: {labels.shape}\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Now, we can leverage our `tokenizer` to investigate the values of the tokens. Let's start\n",
- "with `input_ids`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "idx = 0\n",
- "print(\"Taking the first sample:\\n\")\n",
- "print(tokenizer.decode(input_ids[idx].numpy()))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "As expected, the decoded tokens contain the special tokens including the mask tokens as\n",
- "well. Let's now investigate the mask tokens:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "# Taking the first 30 tokens of the first sequence.\n",
- "print(labels[0].numpy()[:30])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Here, `-100` means that the corresponding tokens in the `input_ids` are NOT masked and\n",
- "non `-100` values denote the actual values of the masked tokens."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "## Initialize the mode and and the optimizer"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "With the datasets prepared, we now initialize and compile our model and optimizer within\n",
- "the `strategy.scope()`:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "# For this example, we keep this value to 10. But for a realistic run, start with 500.\n",
- "num_epochs = 10\n",
- "steps_per_epoch = num_train_samples // (\n",
- " per_replica_batch_size * strategy.num_replicas_in_sync\n",
- ")\n",
- "total_train_steps = steps_per_epoch * num_epochs\n",
- "learning_rate = 0.0001\n",
- "weight_decay_rate = 1e-3\n",
- "\n",
- "with strategy.scope():\n",
- " model = transformers.TFAutoModelForMaskedLM.from_config(config)\n",
- " model(\n",
- " model.dummy_inputs\n",
- " ) # Pass some dummy inputs through the model to ensure all the weights are built\n",
- " optimizer, schedule = transformers.create_optimizer(\n",
- " num_train_steps=total_train_steps,\n",
- " num_warmup_steps=total_train_steps // 20,\n",
- " init_lr=learning_rate,\n",
- " weight_decay_rate=weight_decay_rate,\n",
- " )\n",
- " model.compile(optimizer=optimizer, metrics=[\"accuracy\"])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "A couple of things to note here:\n",
- "* The\n",
- "[`create_optimizer()`](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.create_optimizer)\n",
- "function creates an Adam optimizer with a learning rate schedule using a warmup phase\n",
- "followed by a linear decay. Since we're using weight decay here, under the hood,\n",
- "`create_optimizer()` instantiates\n",
- "[the right variant of Adam](https://github.com/huggingface/transformers/blob/118e9810687dd713b6be07af79e80eeb1d916908/src/transformers/optimization_tf.py#L172)\n",
- "to enable weight decay.\n",
- "* While compiling the model, we're NOT using any `loss` argument. This is because\n",
- "the TensorFlow models internally compute the loss when expected labels are provided.\n",
- "Based on the model type and the labels being used, `transformers` will automatically\n",
- "infer the loss to use."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "### Start training!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Next, we set up a handy callback to push the intermediate training checkpoints to the\n",
- "Hugging Face Hub. To be able to operationalize this callback, we need to log in to our\n",
- "Hugging Face account (if you don't have one, you create one\n",
- "[here](https://huggingface.co/join) for free). Execute the code below for logging in:\n",
- "\n",
- "```python\n",
- "from huggingface_hub import notebook_login\n",
- "\n",
- "notebook_login()\n",
- "```"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Let's now define the\n",
- "[`PushToHubCallback`](https://huggingface.co/docs/transformers/main_classes/keras_callbacks#transformers.PushToHubCallback):"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "hub_model_id = output_dir = \"masked-lm-tpu\"\n",
- "\n",
- "callbacks = []\n",
- "callbacks.append(\n",
- " transformers.PushToHubCallback(\n",
- " output_dir=output_dir, hub_model_id=hub_model_id, tokenizer=tokenizer\n",
- " )\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "And now, we're ready to chug the TPUs:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "# In the interest of the runtime of this example,\n",
- "# we limit the number of batches to just 2.\n",
- "model.fit(\n",
- " train_dataset.take(2),\n",
- " validation_data=eval_dataset.take(2),\n",
- " epochs=num_epochs,\n",
- " callbacks=callbacks,\n",
- ")\n",
- "\n",
- "# After training we also serialize the final model.\n",
- "model.save_pretrained(output_dir)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "Once your training is complete, you can easily perform inference like so:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab_type": "code"
- },
- "outputs": [],
- "source": [
- "from transformers import pipeline\n",
- "\n",
- "# Replace your `model_id` here.\n",
- "# Here, we're using a model that the Hugging Face team trained for longer.\n",
- "model_id = \"tf-tpu/roberta-base-epochs-500-no-wd\"\n",
- "unmasker = pipeline(\"fill-mask\", model=model_id, framework=\"tf\")\n",
- "print(unmasker(\"Goal of my life is to [MASK].\"))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text"
- },
- "source": [
- "And that's it!\n",
- "\n",
- "If you enjoyed this example, we encourage you to check out the full codebase\n",
- "[here](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling-tpu)\n",
- "and the accompanying blog post\n",
- "[here](https://huggingface.co/blog/tf_tpu)."
- ]
- }
- ],
- "metadata": {
- "accelerator": "TPU",
- "colab": {
- "collapsed_sections": [],
- "name": "mlm_training_tpus",
- "private_outputs": false,
- "provenance": [],
- "toc_visible": true
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.0"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
\ No newline at end of file
diff --git a/examples/nlp/md/mlm_training_tpus.md b/examples/nlp/md/mlm_training_tpus.md
deleted file mode 100644
index cc90f0379e..0000000000
--- a/examples/nlp/md/mlm_training_tpus.md
+++ /dev/null
@@ -1,614 +0,0 @@
-# Training a language model from scratch with 🤗 Transformers and TPUs
-
-**Authors:** [Matthew Carrigan](https://twitter.com/carrigmat), [Sayak Paul](https://twitter.com/RisingSayak)
-**Date created:** 2023/05/21
-**Last modified:** 2023/05/21
-**Description:** Train a masked language model on TPUs using 🤗 Transformers.
-
-
- [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/mlm_training_tpus.ipynb) • [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/examples/nlp/mlm_training_tpus.py)
-
-
-
----
-## Introduction
-
-In this example, we cover how to train a masked language model using TensorFlow,
-[🤗 Transformers](https://huggingface.co/transformers/index),
-and TPUs.
-
-TPU training is a useful skill to have: TPU pods are high-performance and extremely
-scalable, making it easy to train models at any scale from a few tens of millions of
-parameters up to truly enormous sizes: Google's PaLM model
-(over 500 billion parameters!) was trained entirely on TPU pods.
-
-We've previously written a
-[**tutorial**](https://huggingface.co/docs/transformers/main/perf_train_tpu_tf)
-and a
-[**Colab example**](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)
-showing small-scale TPU training with TensorFlow and introducing the core concepts you
-need to understand to get your model working on TPU. However, our Colab example doesn't
-contain all the steps needed to train a language model from scratch such as
-training the tokenizer. So, we wanted to provide a consolidated example of
-walking you through every critical step involved there.
-
-As in our Colab example, we're taking advantage of TensorFlow's very clean TPU support
-via XLA and `TPUStrategy`. We'll also be benefiting from the fact that the majority of
-the TensorFlow models in 🤗 Transformers are fully
-[XLA-compatible](https://huggingface.co/blog/tf-xla-generate).
-So surprisingly, little work is needed to get them to run on TPU.
-
-This example is designed to be **scalable** and much closer to a realistic training run
--- although we only use a BERT-sized model by default, the code could be expanded to a
-much larger model and a much more powerful TPU pod slice by changing a few configuration
-options.
-
-The following diagram gives you a pictorial overview of the steps involved in training a
-language model with 🤗 Transformers using TensorFlow and TPUs:
-
-![https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tf_tpu/tf_tpu_steps.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/tf_tpu/tf_tpu_steps.png)
-
-*(Contents of this example overlap with
-[this blog post](https://huggingface.co/blog/tf_tpu)).*
-
----
-## Data
-
-We use the
-[WikiText dataset (v1)](https://huggingface.co/datasets/wikitext).
-You can head over to the
-[dataset page on the Hugging Face Hub](https://huggingface.co/datasets/wikitext)
-to explore the dataset.
-
-![data_preview_wikitext](https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/data_preview_wikitext.png)
-
-Since the dataset is already available on the Hub in a compatible format, we can easily
-load and interact with it using
-[🤗 datasets](https://hf.co/docs/datasets).
-However, training a language model from scratch also requires a separate
-tokenizer training step. We skip that part in this example for brevity, but,
-here's a gist of what we can do to train a tokenizer from scratch:
-
-- Load the `train` split of the WikiText using 🤗 datasets.
-- Leverage
-[🤗 tokenizers](https://huggingface.co/docs/tokenizers/index)
-to train a
-[**Unigram model**](https://huggingface.co/course/chapter6/7?fw=pt).
-- Upload the trained tokenizer on the Hub.
-
-You can find the tokenizer training
-code
-[**here**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling-tpu#training-a-tokenizer)
-and the tokenizer
-[**here**](https://huggingface.co/tf-tpu/unigram-tokenizer-wikitext).
-This script also allows you to run it with
-[**any compatible dataset**](https://huggingface.co/datasets?task_ids=task_ids:language-modeling)
-from the Hub.
-
----
-## Tokenizing the data and creating TFRecords
-
-Once the tokenizer is trained, we can use it on all the dataset splits
-(`train`, `validation`, and `test` in this case) and create TFRecord shards out of them.
-Having the data splits spread across multiple TFRecord shards helps with massively
-parallel processing as opposed to having each split in single TFRecord files.
-
-We tokenize the samples individually. We then take a batch of samples, concatenate them
-together, and split them into several chunks of a fixed size (128 in our case). We follow
-this strategy rather than tokenizing a batch of samples with a fixed length to avoid
-aggressively discarding text content (because of truncation).
-
-We then take these tokenized samples in batches and serialize those batches as multiple
-TFRecord shards, where the total dataset length and individual shard size determine the
-number of shards. Finally, these shards are pushed to a
-[Google Cloud Storage (GCS) bucket](https://cloud.google.com/storage/docs/json_api/v1/buckets).
-
-If you're using a TPU node for training, then the data needs to be streamed from a GCS
-bucket since the node host memory is very small. But for TPU VMs, we can use datasets
-locally or even attach persistent storage to those VMs. Since TPU nodes (which is what we
-have in a Colab) are still quite heavily used, we based our example on using a GCS bucket
-for data storage.
-
-You can see all of this in code in
-[this script](https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py).
-For convenience, we have also hosted the resultant TFRecord shards in
-[this repository](https://huggingface.co/datasets/tf-tpu/wikitext-v1-tfrecords)
-on the Hub.
-
-Once the data is tokenized and serialized into TFRecord shards, we can proceed toward
-training.
-
----
-## Training
-
-### Setup and imports
-
-Let's start by installing 🤗 Transformers.
-
-
-```python
-!pip install transformers -q
-```
-
-Then, let's import the modules we need.
-
-
-```python
-import os
-import re
-
-import tensorflow as tf
-
-import transformers
-```
-
-### Initialize TPUs
-
-Then let's connect to our TPU and determine the distribution strategy:
-
-
-```python
-tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
-
-tf.config.experimental_connect_to_cluster(tpu)
-tf.tpu.experimental.initialize_tpu_system(tpu)
-
-strategy = tf.distribute.TPUStrategy(tpu)
-
-print(f"Available number of replicas: {strategy.num_replicas_in_sync}")
-```
-
-