diff --git a/README.md b/README.md
index adb106a..451d814 100644
--- a/README.md
+++ b/README.md
@@ -2,17 +2,17 @@
# A Family of Southeast Asian Language Models
-***Updated: 21 August 2024***
+***Updated: 1 November 2024***
SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages.
-Our first versions of SEA-LION, released in December 2023, were trained from scratch using [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (about 1 trillion tokens). Our new version of SEA-LION is based on continued pre-training good open source models. Version 2-2.x is based on Llama 3. We believe that this approach, i.e., continued pre-training, might be more sustainable over the longer run.
+Version 3 is based on Google's Gemma 2. It is a 9B parameter model, with 200 billion tokens from 11+2 Southeast Asian languages (English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, and Lao), and Javanese and Sudanese.
## Transparent and Open Source
We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. SEA-LION will therefore be open and transparent in the following areas:
-1. *Pre-Training* data
+1. *Pre-Training* data [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (
2. Model *training* code
3. Model *weights*
4. *Fine-Tuning* data
@@ -20,33 +20,31 @@ We have benefited greatly from the open-source community and believe that effort
# LATEST MODELS
-## Key Features of SEA-LION v2.1
+## Key Features of SEA-LION v3
-- Continued Pre-Trained and Fine-Tuned Llama 3 (with more models to follow)
-- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil
-- Trained with up to 50B tokens from SEA languages
-- Outperforms base Llama 3 and other models in both general and SEA capabilities
-- Our contributions are open source (under MIT license); data and model licenses are listed on their respective Hugging Face data or model cards
+- Continued Pre-Training from Gemma 2 base with 200B tokens from 11+2 Southeast Asian languages (English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao, Javanese and Sudanese
+- Further fine-tuning to improve general and SEA capabilities, and optimize for instruction following and multi-turn conversations
+- Outperforms similar sized open source models, and even some larger models in both general and SEA capabilities
+- Our contributions are open source (under MIT license); model licenses are derived from the Gemma, and listed on their respective Hugging Face model cards
-See our [HuggingFace](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct) page for more detailed model and license information.
+See our [HuggingFace](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct) page for more detailed model and license information.
-## How To Download SEA-LION v2 and v2.1
+## How To Download SEA-LION v3
SEA-LION models are available for download on HuggingFace at:
-### SEA-LION v2 and v2.1
**Base Models**
-* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base)
+* [Gemma2-9B-CPT-SEA-LION-V3-Base](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base)
**Instruction-Tuned Models**
-* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct)
+* [Gemma2-9B-CPT-SEA-LION-V3-Instruct](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct)
**Quantized Models**
-* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct-gguf)
+* To be released soon
## Getting Started
-To use SEA-LION v2.x:
+To use SEA-LION v3:
```python
# Please use transformers==4.43.2
@@ -54,7 +52,7 @@ To use SEA-LION v2.x:
import transformers
import torch
-model_id = "aisingapore/llama3-8b-cpt-sealionv2-instruct"
+model_id = "aisingapore/gemma2-9b-cpt-sealionv3-instruct"
pipeline = transformers.pipeline(
"text-generation",
@@ -76,7 +74,7 @@ print(outputs[0]["generated_text"][-1])
## Performance and Benchmarks
-SEA-LION achieves better or competitive performances on tasks in regional languages while retaining the general performance of Llama 3.
+SEA-LION achieves better or competitive performances on tasks in regional languages while retaining the general performance of Gemma 2.
Our [leaderboard is here](https://leaderboard.sea-lion.ai).
@@ -161,4 +159,28 @@ If you have questions, comments, or issues, please open a GitHub issue or contac
**Model Details**
Please see model cards on Hugging Face.
-Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/SEALIONV1_README.md)
+Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/README.md)
+
+## SEA-LION v2
+
+- Continued Pre-Trained and Fine-Tuned Llama 3
+- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil
+- Trained with up to 50B tokens from SEA languages
+- Outperforms base Llama 3 and other models in both general and SEA capabilities
+- Our contributions are open source (under MIT license); model licenses are listed on their respective Hugging Face model cards
+
+**Base Models**
+* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base)
+
+**Instruction-Tuned Models**
+* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct)
+* [Llama3-8B-CPT-SEA-LION-V2-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-instruct)
+
+**Quantized Models**
+* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct-gguf)
+* [Llama3-8B-CPT-SEA-LION-V2-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-instruct-gguf)
+
+**Model Details**
+Please see model cards on Hugging Face.
+
+Additional information and guides about SEA-LION v2.x can be found [here](sea-lion-v2/README.md)
\ No newline at end of file
diff --git a/sea-lion-v2/CODE_OF_CONDUCT.md b/sea-lion-v2/CODE_OF_CONDUCT.md
new file mode 100644
index 0000000..403ce16
--- /dev/null
+++ b/sea-lion-v2/CODE_OF_CONDUCT.md
@@ -0,0 +1,49 @@
+# SEA-LION Code of Conduct
+
+## Our Pledge
+
+We, the community of contributors and users of SEA-LION, pledge to create a welcoming and inclusive environment for everyone. We are committed to fostering a respectful and harassment-free space where diverse ideas and perspectives can thrive.
+
+## Expected Behavior
+
+To contribute to a positive and inclusive atmosphere, we expect all participants, including contributors, users, and maintainers, to:
+
+1. Be respectful and considerate: Treat others with kindness, respect, and empathy. Recognize and embrace diversity in backgrounds, experiences, and opinions.
+
+2. Be inclusive: Welcome and support people of all backgrounds, identities, and abilities. Avoid any form of discrimination or exclusionary behavior.
+
+3. Listen actively: Pay attention to others' ideas, experiences, and feedback. Be open to constructive criticism and different points of view.
+
+4. Show empathy: Understand that people may have different cultural norms, communication styles, and perspectives. Be patient and considerate when engaging with others.
+
+5. Resolve conflicts constructively: Disagreements and conflicts are natural, but we encourage participants to address them in a respectful and solution-oriented manner. Avoid personal attacks and name-calling.
+
+6. Use clear and inclusive language: Use language that is respectful, inclusive, and considerate of all participants. Avoid offensive, derogatory, or discriminatory language.
+
+## Unacceptable Behavior
+
+The following behaviors are considered unacceptable and will not be tolerated within the SEA-LION community:
+
+1. Harassment: Any form of harassment, including but not limited to offensive comments, slurs, intimidation, or unwelcome advances, is strictly prohibited.
+
+2. Discrimination: Discriminatory actions or comments based on race, ethnicity, nationality, gender, gender identity, sexual orientation, disability, religion, age, or any other characteristic will not be tolerated.
+
+3. Hate speech: Hate speech, promoting violence, or advocating harm towards individuals or groups based on their identity is not allowed.
+
+4. Personal attacks: Engaging in personal attacks, insults, or trolling of others within the community is unacceptable.
+
+5. Disruptive behavior: Deliberate disruption of discussions, events, or community activities is discouraged.
+
+## Reporting Violations
+
+If you witness or experience behavior that violates this code of conduct, please report it promptly to the project maintainers by contacting [sealion@aisingapore.org](sealion@aisingapore.org)
+
+All reports will be treated with confidentiality, and the project maintainers will take appropriate action as necessary to address violations. We are committed to providing a safe and welcoming environment for all participants.
+
+## Enforcement
+
+Enforcement of this code of conduct will be carried out in a fair and just manner. Depending on the severity and frequency of violations, consequences may include warnings, temporary or permanent bans from the community, or other appropriate actions.
+
+## Attribution
+
+This code of conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
diff --git a/sea-lion-v2/CONTRIBUTING.md b/sea-lion-v2/CONTRIBUTING.md
new file mode 100644
index 0000000..649fe12
--- /dev/null
+++ b/sea-lion-v2/CONTRIBUTING.md
@@ -0,0 +1,91 @@
+# SEA-LION Contributing Guide
+
+Thank you for considering contributing to SEA-LION! We welcome contributions from the community to help improve and enhance our language model for Southeast Asia. Whether you're a developer, researcher, or just an enthusiast, there are various ways you can get involved and make a meaningful impact.
+
+Before you begin, please take a moment to review this guide, which outlines the contribution process, code of conduct, and how to get help if needed.
+
+## Table of Contents
+
+- [Getting Started](#getting-started)
+- [Contributing to SEA-LION](#contributing-to-sea-lion)
+ - [Reporting Bugs](#reporting-bugs)
+ - [Suggesting Enhancements](#suggesting-enhancements)
+ - [Code Contribution](#code-contribution)
+- [Development Setup](#development-setup)
+- [Code of Conduct](#code-of-conduct)
+- [Get Help](#get-help)
+
+## Getting Started
+
+Before you start contributing, please ensure you have the following:
+
+- A GitHub account: If you don't have one, you can [create an account here](https://github.com/join).
+- Familiarity with Git: You'll need to know the basics of Git for version control.
+
+## Contributing to SEA-LION
+
+There are several ways you can contribute to SEA-LION:
+
+### Reporting Bugs
+
+If you encounter any issues, bugs, or unexpected behavior while using SEA-LION, please help us by [reporting them](https://github.com/aisingapore/sealion/issues). To report a bug:
+
+1. Check if the issue has already been reported by searching the [GitHub Issues](https://github.com/aisingapore/sealion/issues) page.
+2. If not, create a new issue with a descriptive title and detailed description of the problem you encountered.
+3. Include relevant information such as your operating system, Python version, and any error messages.
+
+### Suggesting Enhancements
+
+We appreciate your suggestions for improving SEA-LION, including suggestions for better documentation, new evaluation metrics, or new features. If you have an idea for an enhancement or new feature, follow these steps:
+
+1. Check if your suggestion has already been proposed in the [GitHub Issues](https://github.com/aisingapore/sealion/issues) section.
+2. If not, create a new issue with a clear and concise title and a detailed description of your suggestion.
+3. Include any relevant context or examples to illustrate the enhancement's value.
+
+### Code Contribution
+
+If you're interested in contributing code to SEA-LION, you can do so by following these steps:
+
+1. Fork the [SEA-LION repository](https://github.com/aisingapore/sealion) to your GitHub account.
+2. Clone your forked repository to your local machine:
+
+ ```shell
+ git clone https://github.com/your-username/sealion.git
+ ```
+
+3. Create a new branch for your contribution:
+
+ ```shell
+ git checkout -b feature/your-feature-name
+ ```
+
+4. Make your changes and commit them with clear and concise commit messages.
+
+5. Push your changes to your forked repository:
+
+ ```shell
+ git push origin feature/your-feature-name
+ ```
+
+6. Create a pull request (PR) from your branch to the main SEA-LION repository.
+
+7. Ensure your PR includes a detailed description of the changes, why they are necessary, and any relevant testing or documentation updates.
+
+8. Participate in the review process, addressing any feedback or requested changes.
+
+9. Once your PR is approved, it will be merged into the main repository.
+
+## Development Setup
+
+If you want to contribute code, you'll need to set up a development environment for SEA-LION. Refer to the [Development Setup](https://github.com/aisingapore/sealion/blob/main/README.md) guide in the repository for detailed instructions on getting started.
+
+## Code of Conduct
+
+Please review and adhere to our [Code of Conduct](CODE_OF_CONDUCT.md). We expect all contributors and community members to treat each other with respect and kindness.
+
+## Get Help
+
+If you have questions, need assistance, or want to discuss contributions further, please feel free to [contact us](sealion@aisingapore.org) or open an issue for discussion.
+
+We appreciate your interest in contributing to SEA-LION, and we look forward to collaborating with you!
+
diff --git a/sea-lion-v2/LICENSE b/sea-lion-v2/LICENSE
new file mode 100644
index 0000000..d9205e5
--- /dev/null
+++ b/sea-lion-v2/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 AI Singapore
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/sea-lion-v2/README.md b/sea-lion-v2/README.md
new file mode 100644
index 0000000..adb106a
--- /dev/null
+++ b/sea-lion-v2/README.md
@@ -0,0 +1,164 @@
+# SEA-LION (Southeast Asian Languages In One Network)
+
+# A Family of Southeast Asian Language Models
+
+***Updated: 21 August 2024***
+
+SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages.
+
+Our first versions of SEA-LION, released in December 2023, were trained from scratch using [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (about 1 trillion tokens). Our new version of SEA-LION is based on continued pre-training good open source models. Version 2-2.x is based on Llama 3. We believe that this approach, i.e., continued pre-training, might be more sustainable over the longer run.
+
+## Transparent and Open Source
+
+We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. SEA-LION will therefore be open and transparent in the following areas:
+
+1. *Pre-Training* data
+2. Model *training* code
+3. Model *weights*
+4. *Fine-Tuning* data
+5. Evaluation *benchmarks*
+
+# LATEST MODELS
+
+## Key Features of SEA-LION v2.1
+
+- Continued Pre-Trained and Fine-Tuned Llama 3 (with more models to follow)
+- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil
+- Trained with up to 50B tokens from SEA languages
+- Outperforms base Llama 3 and other models in both general and SEA capabilities
+- Our contributions are open source (under MIT license); data and model licenses are listed on their respective Hugging Face data or model cards
+
+See our [HuggingFace](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct) page for more detailed model and license information.
+
+## How To Download SEA-LION v2 and v2.1
+
+SEA-LION models are available for download on HuggingFace at:
+
+### SEA-LION v2 and v2.1
+**Base Models**
+* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base)
+
+**Instruction-Tuned Models**
+* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct)
+
+**Quantized Models**
+* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct-gguf)
+
+## Getting Started
+
+To use SEA-LION v2.x:
+
+```python
+# Please use transformers==4.43.2
+
+import transformers
+import torch
+
+model_id = "aisingapore/llama3-8b-cpt-sealionv2-instruct"
+
+pipeline = transformers.pipeline(
+ "text-generation",
+ model=model_id,
+ model_kwargs={"torch_dtype": torch.bfloat16},
+ device_map="auto",
+)
+messages = [
+ {"role": "user", "content": "Apa sentimen dari kalimat berikut ini?\nKalimat: Buku ini sangat membosankan.\nJawaban: "},
+]
+
+outputs = pipeline(
+ messages,
+ max_new_tokens=256,
+)
+print(outputs[0]["generated_text"][-1])
+
+```
+
+## Performance and Benchmarks
+
+SEA-LION achieves better or competitive performances on tasks in regional languages while retaining the general performance of Llama 3.
+
+Our [leaderboard is here](https://leaderboard.sea-lion.ai).
+
+We use a holistic approach to evaluation, including not just traditional Natural Language Processing (NLP) benchmarking tasks (such as sentiment analysis and question answering) but also meticulously handcrafted linguistic and cultural diagnostic tests tailored to Southeast Asia.
+
+The benchmark was introduced here [BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models](https://arxiv.org/abs/2309.06085v2) and [GitHub](https://github.com/aisingapore/bhasa).
+
+## Deployment Framework
+
+### Text Generation Inference (TGI)
+
+Please refer to [serving the SEA-LION model with TGI](https://github.com/aisingapore/sealion-tgi).
+
+### vLLM
+
+Please refer to [serving the SEA-LION model with vLLM](https://github.com/aisingapore/sealion-vllm).
+
+### Ollama
+
+To run SEA-LION locally with Ollama via the command line:
+1. [Download and install Ollama](https://ollama.com)
+2. Run and chat with SEA-LION with the following command
+ ```python
+ ollama run aisingapore/llama3-8b-cpt-sea-lionv2-instruct
+ ```
+
+or [explore SEA-LION with Chainlit and Ollama here](https://github.com/aisingapore/sealion-chainlit-ollama)
+
+## Contributing
+
+We welcome contributions to SEA-LION! Check out the [contributing guide](CONTRIBUTING.md) to get started.
+
+Some ways to contribute:
+
+- Report bugs and issues
+- Enhance the documentation
+- Add more model evaluation tasks and metrics
+- Train versions of the model in more SEA languages
+
+## To Cite SEA-LION
+
+If you use SEA-LION in your work, please cite it as:
+
+```bibtex
+@misc{sea_lion_2024,
+ title={SEA-LION (Southeast Asian Languages In One Network): A Family of Large Language Models for Southeast Asia},
+ author={AI Singapore},
+ year={2024},
+ howpublished={\url{https://github.com/aisingapore/sealion}}
+}
+```
+
+## Acknowledgements
+
+AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinion, finding, conclusion or recommendation expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore, or the National University of Singapore.
+
+## Contact
+
+If you have questions, comments, or issues, please open a GitHub issue or contact us via this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6).
+
+
+# OTHER MODELS
+
+## SEA-LION v1
+
+- 3 to 7 billion parameters
+- Instruction tuned in English and Bahasa Indonesia
+- Trained with 980B tokens of text data from 11 languages spoken across SEA
+- Specialized vocabulary and tokenization for optimal performance in SEA languages
+- Excels on tasks in regional languages
+- Open source under the MIT License for community contribution and adoption
+
+
+**Base Models**
+* [SEA-LION-3B](https://huggingface.co/aisingapore/sea-lion-3b)
+* [SEA-LION-7B](https://huggingface.co/aisingapore/sea-lion-7b)
+
+**Instruction-Tuned Models**
+* [SEA-LION-7B-Instruct-Research](https://huggingface.co/aisingapore/sea-lion-7b-instruct-research)
+* [SEA-LION-7B-Instruct](https://huggingface.co/aisingapore/sea-lion-7b-instruct)
+
+**Model Details**
+Please see model cards on Hugging Face.
+
+Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/SEALIONV1_README.md)
diff --git a/sea-lion-v2/images/blue_sealion-min.png b/sea-lion-v2/images/blue_sealion-min.png
new file mode 100644
index 0000000..54f5a5e
Binary files /dev/null and b/sea-lion-v2/images/blue_sealion-min.png differ
diff --git a/sea-lion-v2/images/purple_sealion-64x64.png b/sea-lion-v2/images/purple_sealion-64x64.png
new file mode 100644
index 0000000..60313cc
Binary files /dev/null and b/sea-lion-v2/images/purple_sealion-64x64.png differ
diff --git a/sea-lion-v2/images/purple_sealion-min.png b/sea-lion-v2/images/purple_sealion-min.png
new file mode 100644
index 0000000..c0fca48
Binary files /dev/null and b/sea-lion-v2/images/purple_sealion-min.png differ