GoSBD is a library for segmenting text into sentences, designed to make it easier to build Retrieval Augmented Generation (RAG) systems in Go. It is inspired by pySBD and pragmatic_segmenter, and works out-of-the-box with a rule-based approach.
Try out GoSBD in our online playground.
- Sentence Splitting: Efficiently breaks down a block of text into individual sentences.
- Lightweight and Easy Integration: Designed to be lightweight and easy to integrate into existing Go projects.
- High Accuracy: Offers high accuracy in sentence segmentation. For more details, see pySBD.
- Fast Sentence Splitting: GoSBD aims to provide high-performance sentence splitting by leveraging Go's efficiency.
- Non-Destructive Splitting: Segments text into sentences without altering the original content.
- Language-Specific Configuration: Adaptable to handle punctuation rules specific to different languages.
- Text Cleaning: Equipped with features to manage and clean noisy text, including:
- Handling irregular newline characters and spacing
- Processing Tables of Contents
- Recognizing and managing URLs and HTML tags
- Dealing with sentences that are delimited without any space
Note: Text Cleaning feature is to be implemented. Contributions are greatly welcomed.
To install gosbd, you can use go get
:
go get github.com/gosbd/gosbd
Here's a basic example of how to use gosbd:
package main
import (
"fmt"
"github.com/gosbd/gosbd"
)
// This example segments a text string into individual sentences.
func main() {
segmenter := gosbd.NewSegmenter("en")
text := "This is a sentence. And this is another one."
sentences := segmenter.Segment(text)
for _, sentence := range sentences {
fmt.Println(sentence)
}
}
- Add Online Playground.
- Add chuking feature with overlapping option.
- Setup Codecov for monitoring test coverage.
- Implement text cleaner.
- Add support for more languages.
- Add benchmark test.
- Setup GitHub Action for testing.
The following table outlines our current language support. We're actively seeking contributions to expand this list. If you're interested in contributing, consider helping us add support for a language, whether it's listed below or not. Your expertise in a language not listed here could be a valuable addition to our project.
Language | ISO Code | Supported |
---|---|---|
Amharic | am | Planned |
Arabic | ar | Planned |
Armenian | hy | Planned |
Bulgarian | bg | Planned |
Burmese | my | Planned |
Chinese | zh | Yes |
Danish | da | Planned |
Deutsch | de | Planned |
Dutch | nl | Planned |
English | en | Yes |
French | fr | Planned |
Greek | el | Planned |
Hindi | hi | Planned |
Italian | it | Planned |
Japanese | ja | Yes |
Kazakh | kk | Planned |
Marathi | mr | Planned |
Persian | fa | Planned |
Polish | pl | Planned |
Russian | ru | Yes |
Slovak | sk | Planned |
Spanish | es | Planned |
Urdu | ur | Planned |
We welcome contributions that help us add support for these languages. Please feel free to submit a Pull Request with your contributions.
Sentence splitting is a crucial step in the preprocessing pipeline of Natural Language Processing (NLP) tasks, especially for building Retrieval Augmented Generation (RAG) systems. RAG systems rely on accurately segmented sentences to retrieve relevant information and generate coherent responses.
While libraries like pragmatic_segmenter and pySBD are known for their high accuracy and efficiency in sentence splitting, there are no equivalent libraries available in Go. This poses a challenge for developers building RAG systems in Go, as they need to rely on external libraries or implement their own sentence splitting logic.
GoSBD aims to bridge this gap by providing a reliable and efficient sentence splitting solution in Go. By offering a native Go library for sentence splitting, GoSBD simplifies the process of building RAG systems and other NLP applications entirely within the Go ecosystem. This not only streamlines the development workflow but also enables faster execution times by leveraging Go's performance characteristics.
This library builds upon the excellent foundations laid by pySBD and pragmatic_segmenter.
Contributions are greatly appreciated and crucial for this project! Here are a few ways you can contribute:
- Add new tests and rules: Improve the accuracy of sentence segmentation by adding new tests and rules.
- Add support for a new language: Help expand the reach of this library by adding support for new languages.
- Port features: Help improve this library by porting features that are supported in pySBD and pragmatic_segmenter.
Please feel free to submit a Pull Request with your contributions.
This project is licensed under the MIT License.