Skip to content

Commit

Permalink
Update for version 3
Browse files Browse the repository at this point in the history
  • Loading branch information
maynard242 committed Oct 31, 2024
1 parent 9d7df68 commit 1cc9d77
Show file tree
Hide file tree
Showing 8 changed files with 366 additions and 19 deletions.
60 changes: 41 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,59 +2,57 @@

# <img align="center" src="images/purple_sealion-64x64.png"> A Family of Southeast Asian Language Models

***Updated: 21 August 2024***
***Updated: 1 November 2024***

SEA-LION is a family of open-source language models developed by AI Singapore that better understands Southeast Asia's diverse contexts, languages, and cultures (SEA). We hope it makes LLMs more accessible and better represents the region's breadth of cultures and languages.

Our first versions of SEA-LION, released in December 2023, were trained from scratch using [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (about 1 trillion tokens). Our new version of SEA-LION is based on continued pre-training good open source models. Version 2-2.x is based on Llama 3. We believe that this approach, i.e., continued pre-training, might be more sustainable over the longer run.
Version 3 is based on Google's Gemma 2. It is a 9B parameter model, with 200 billion tokens from 11+2 Southeast Asian languages (English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, and Lao), and Javanese and Sudanese.

## Transparent and Open Source

We have benefited greatly from the open-source community and believe that efforts to better represent our region will similarly be well served by open-source efforts. SEA-LION will therefore be open and transparent in the following areas:

1. *Pre-Training* data
1. *Pre-Training* data [SEA-LION-PILE](https://huggingface.co/datasets/aisingapore/sea-lion-pile) (
2. Model *training* code
3. Model *weights*
4. *Fine-Tuning* data
5. Evaluation *benchmarks*

# LATEST MODELS

## Key Features of SEA-LION v2.1
## Key Features of SEA-LION v3

- Continued Pre-Trained and Fine-Tuned Llama 3 (with more models to follow)
- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil
- Trained with up to 50B tokens from SEA languages
- Outperforms base Llama 3 and other models in both general and SEA capabilities
- Our contributions are open source (under MIT license); data and model licenses are listed on their respective Hugging Face data or model cards
- Continued Pre-Training from Gemma 2 base with 200B tokens from 11+2 Southeast Asian languages (English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao, Javanese and Sudanese
- Further fine-tuning to improve general and SEA capabilities, and optimize for instruction following and multi-turn conversations
- Outperforms similar sized open source models, and even some larger models in both general and SEA capabilities
- Our contributions are open source (under MIT license); model licenses are derived from the Gemma, and listed on their respective Hugging Face model cards

See our [HuggingFace](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct) page for more detailed model and license information.
See our [HuggingFace](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct) page for more detailed model and license information.

## How To Download SEA-LION v2 and v2.1
## How To Download SEA-LION v3

SEA-LION models are available for download on HuggingFace at:

### SEA-LION v2 and v2.1
**Base Models**
* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base)
* [Gemma2-9B-CPT-SEA-LION-V3-Base](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base)

**Instruction-Tuned Models**
* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct)
* [Gemma2-9B-CPT-SEA-LION-V3-Instruct](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct)

**Quantized Models**
* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct-gguf)
* To be released soon

## Getting Started

To use SEA-LION v2.x:
To use SEA-LION v3:

```python
# Please use transformers==4.43.2

import transformers
import torch

model_id = "aisingapore/llama3-8b-cpt-sealionv2-instruct"
model_id = "aisingapore/gemma2-9b-cpt-sealionv3-instruct"

pipeline = transformers.pipeline(
"text-generation",
Expand All @@ -76,7 +74,7 @@ print(outputs[0]["generated_text"][-1])

## Performance and Benchmarks

SEA-LION achieves better or competitive performances on tasks in regional languages while retaining the general performance of Llama 3.
SEA-LION achieves better or competitive performances on tasks in regional languages while retaining the general performance of Gemma 2.

Our [leaderboard is here](https://leaderboard.sea-lion.ai).

Expand Down Expand Up @@ -161,4 +159,28 @@ If you have questions, comments, or issues, please open a GitHub issue or contac
**Model Details**
Please see model cards on Hugging Face.

Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/SEALIONV1_README.md)
Additional information and guides about SEA-LION v1 can be found [here](sea-lion-v1/README.md)

## SEA-LION v2

- Continued Pre-Trained and Fine-Tuned Llama 3
- Instruction tuned in English, Bahasa Indonesia, Thai, Vietnamese, and Tamil
- Trained with up to 50B tokens from SEA languages
- Outperforms base Llama 3 and other models in both general and SEA capabilities
- Our contributions are open source (under MIT license); model licenses are listed on their respective Hugging Face model cards

**Base Models**
* [Llama3-8B-CPT-SEA-LION-V2-Base](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base)

**Instruction-Tuned Models**
* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct)
* [Llama3-8B-CPT-SEA-LION-V2-Instruct](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-instruct)

**Quantized Models**
* [Llama3-8B-CPT-SEA-LION-V2.1-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2.1-instruct-gguf)
* [Llama3-8B-CPT-SEA-LION-V2-Instruct-GGUF](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-instruct-gguf)

**Model Details**
Please see model cards on Hugging Face.

Additional information and guides about SEA-LION v2.x can be found [here](sea-lion-v2/README.md)
49 changes: 49 additions & 0 deletions sea-lion-v2/CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# SEA-LION Code of Conduct

## Our Pledge

We, the community of contributors and users of SEA-LION, pledge to create a welcoming and inclusive environment for everyone. We are committed to fostering a respectful and harassment-free space where diverse ideas and perspectives can thrive.

## Expected Behavior

To contribute to a positive and inclusive atmosphere, we expect all participants, including contributors, users, and maintainers, to:

1. Be respectful and considerate: Treat others with kindness, respect, and empathy. Recognize and embrace diversity in backgrounds, experiences, and opinions.

2. Be inclusive: Welcome and support people of all backgrounds, identities, and abilities. Avoid any form of discrimination or exclusionary behavior.

3. Listen actively: Pay attention to others' ideas, experiences, and feedback. Be open to constructive criticism and different points of view.

4. Show empathy: Understand that people may have different cultural norms, communication styles, and perspectives. Be patient and considerate when engaging with others.

5. Resolve conflicts constructively: Disagreements and conflicts are natural, but we encourage participants to address them in a respectful and solution-oriented manner. Avoid personal attacks and name-calling.

6. Use clear and inclusive language: Use language that is respectful, inclusive, and considerate of all participants. Avoid offensive, derogatory, or discriminatory language.

## Unacceptable Behavior

The following behaviors are considered unacceptable and will not be tolerated within the SEA-LION community:

1. Harassment: Any form of harassment, including but not limited to offensive comments, slurs, intimidation, or unwelcome advances, is strictly prohibited.

2. Discrimination: Discriminatory actions or comments based on race, ethnicity, nationality, gender, gender identity, sexual orientation, disability, religion, age, or any other characteristic will not be tolerated.

3. Hate speech: Hate speech, promoting violence, or advocating harm towards individuals or groups based on their identity is not allowed.

4. Personal attacks: Engaging in personal attacks, insults, or trolling of others within the community is unacceptable.

5. Disruptive behavior: Deliberate disruption of discussions, events, or community activities is discouraged.

## Reporting Violations

If you witness or experience behavior that violates this code of conduct, please report it promptly to the project maintainers by contacting [sealion@aisingapore.org](sealion@aisingapore.org)

All reports will be treated with confidentiality, and the project maintainers will take appropriate action as necessary to address violations. We are committed to providing a safe and welcoming environment for all participants.

## Enforcement

Enforcement of this code of conduct will be carried out in a fair and just manner. Depending on the severity and frequency of violations, consequences may include warnings, temporary or permanent bans from the community, or other appropriate actions.

## Attribution

This code of conduct is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
91 changes: 91 additions & 0 deletions sea-lion-v2/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# SEA-LION Contributing Guide

Thank you for considering contributing to SEA-LION! We welcome contributions from the community to help improve and enhance our language model for Southeast Asia. Whether you're a developer, researcher, or just an enthusiast, there are various ways you can get involved and make a meaningful impact.

Before you begin, please take a moment to review this guide, which outlines the contribution process, code of conduct, and how to get help if needed.

## Table of Contents

- [Getting Started](#getting-started)
- [Contributing to SEA-LION](#contributing-to-sea-lion)
- [Reporting Bugs](#reporting-bugs)
- [Suggesting Enhancements](#suggesting-enhancements)
- [Code Contribution](#code-contribution)
- [Development Setup](#development-setup)
- [Code of Conduct](#code-of-conduct)
- [Get Help](#get-help)

## Getting Started

Before you start contributing, please ensure you have the following:

- A GitHub account: If you don't have one, you can [create an account here](https://github.com/join).
- Familiarity with Git: You'll need to know the basics of Git for version control.

## Contributing to SEA-LION

There are several ways you can contribute to SEA-LION:

### Reporting Bugs

If you encounter any issues, bugs, or unexpected behavior while using SEA-LION, please help us by [reporting them](https://github.com/aisingapore/sealion/issues). To report a bug:

1. Check if the issue has already been reported by searching the [GitHub Issues](https://github.com/aisingapore/sealion/issues) page.
2. If not, create a new issue with a descriptive title and detailed description of the problem you encountered.
3. Include relevant information such as your operating system, Python version, and any error messages.

### Suggesting Enhancements

We appreciate your suggestions for improving SEA-LION, including suggestions for better documentation, new evaluation metrics, or new features. If you have an idea for an enhancement or new feature, follow these steps:

1. Check if your suggestion has already been proposed in the [GitHub Issues](https://github.com/aisingapore/sealion/issues) section.
2. If not, create a new issue with a clear and concise title and a detailed description of your suggestion.
3. Include any relevant context or examples to illustrate the enhancement's value.

### Code Contribution

If you're interested in contributing code to SEA-LION, you can do so by following these steps:

1. Fork the [SEA-LION repository](https://github.com/aisingapore/sealion) to your GitHub account.
2. Clone your forked repository to your local machine:

```shell
git clone https://github.com/your-username/sealion.git
```

3. Create a new branch for your contribution:

```shell
git checkout -b feature/your-feature-name
```

4. Make your changes and commit them with clear and concise commit messages.

5. Push your changes to your forked repository:

```shell
git push origin feature/your-feature-name
```

6. Create a pull request (PR) from your branch to the main SEA-LION repository.

7. Ensure your PR includes a detailed description of the changes, why they are necessary, and any relevant testing or documentation updates.

8. Participate in the review process, addressing any feedback or requested changes.

9. Once your PR is approved, it will be merged into the main repository.

## Development Setup

If you want to contribute code, you'll need to set up a development environment for SEA-LION. Refer to the [Development Setup](https://github.com/aisingapore/sealion/blob/main/README.md) guide in the repository for detailed instructions on getting started.

## Code of Conduct

Please review and adhere to our [Code of Conduct](CODE_OF_CONDUCT.md). We expect all contributors and community members to treat each other with respect and kindness.

## Get Help

If you have questions, need assistance, or want to discuss contributions further, please feel free to [contact us](sealion@aisingapore.org) or open an issue for discussion.

We appreciate your interest in contributing to SEA-LION, and we look forward to collaborating with you!

21 changes: 21 additions & 0 deletions sea-lion-v2/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 AI Singapore

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Loading

0 comments on commit 1cc9d77

Please sign in to comment.