Suxotin's Vowel Identification Algorithm

Overview

This repository contains an implementation of Suxotin's algorithm, designed to identify vowels in texts by analyzing their formal properties. Using statistical analysis, the algorithm examines letter combinations to classify letters based on their adjacency patterns. This algorithm is especially useful for linguists and decipherment enthusiasts.

Although the underlying principles of the algorithm are not extensively explained in the literature, its efficacy is evident. For example, in the sherlock_holmes.txt file, the algorithm accurately identified all relevant vowels with minimal errors:

Classified vowels: ['a', 'e', 'i', 'o', 'u', 'â', 'æ', 'è', 'é']
Classified consonants: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z', 'à', 'œ']

Testing on the NLTK Gutenberg corpus yielded similar accuracy, confirming its robustness:

Classified Vowels: ['a', 'e', 'i', 'o', 'u', 'æ', 'è']
Classified Consonants: ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q', 'r', 's', 't', 'v', 'w', 'x', 'y', 'z', 'é', 'î']

The algorithm occasionally misclassified low-frequency vowels like 'à', 'œ', 'é' and 'î' as consonants.

Features

Language Agnostic: The algorithm can process texts in multiple languages and accurately identify vowel patterns regardless of the language.
Fast and Efficient: By leveraging NumPy for matrix operations, the algorithm achieves high computational speed and efficiency.

Why the Algorithm Works

Suxotin's algorithm is effective because it leverages two key linguistic and statistical principles observed in natural languages:

Alternation of Vowels and Consonants:
- Vowels and consonants often alternate in texts. This means that vowels are frequently adjacent to consonants and vice versa. Clusters of vowels or consonants do occur but are less common.
- This alternation results in high adjacency counts between vowels and consonants, which is captured in the frequency matrix.
Prevalence of Vowels:
- Vowels typically occur more frequently in texts than individual consonants because they are essential for the phonetic structure of words.
- Therefore, in any sufficiently long text, vowels will have higher overall counts and higher adjacency counts due to their frequent alternation with consonants.

Algorithm Steps

Frequency Matrix Creation:
- Create a matrix where each cell (i, j) represents the number of times character i is adjacent to character j.
- High values in this matrix indicate frequent adjacency.
Row Sums:
- The sum of each row in the frequency matrix provides a measure of how frequently each character appears next to others.
- Vowels, due to their frequent alternation with consonants, will generally have higher row sums compared to consonants.
Iterative Classification:
- Classify the character with the highest adjacency sum as a vowel.
- Adjust the sums of the remaining characters by subtracting twice the adjacency counts involving the newly classified vowel, reducing the likelihood of adjacent characters being misclassified.
- Repeat until no positive sums are left.

Mathematical Foundation

Markov Chains: The algorithm can be viewed through the lens of Markov chains, where state transitions (i.e., moving from one character to another) are influenced by the classification of previous characters.
Maximum Likelihood Estimation (MLE): The algorithm maximizes the likelihood that the classified vowels and consonants fit the observed adjacency patterns in the text.

These principles, combined with the iterative refinement process, make Suxotin's algorithm effective in separating vowels from consonants based on the statistical behavior of characters in natural language texts.

Repository Contents

suxotin.py: The primary Python script implementing Suxotin's algorithm with NumPy for efficient matrix operations.
sherlock_holmes.txt: A sample text file used to test the algorithm's performance.
suxotin: A directory containing documentation of two versions of the Suxotin algorithm and a text version of the algorithm for further testing purposes.
- Decipherment_Models_1973.pdf: Details the original version of the Suxotin algorithm.
- guySuxotin.pdf: Details an alternate version of the Suxotin algorithm.
- vowel-algorithm.txt: A text version of the Suxotin algorithm for additional testing purposes.
README.md: Provides an overview and guidance on using the repository's resources.

Installation

Clone this repository to your local machine:

Using git:

git clone https://github.com/jhnwnstd/suxotin

Using GitHub CLI:
```
gh repo clone jhnwnstd/suxotin
```

Install the required dependencies:
- You need Python 3.11 or higher. Check your Python version by running python --version in your terminal.
- Install NumPy and NLTK, which are essential for running the algorithm:
```
pip install numpy nltk
```
Download the necessary NLTK data:
- After installing NLTK, download the Gutenberg corpus, which is used if you choose the NLTK data source:
```
import nltk
nltk.download('gutenberg')
```

Usage

To run the script suxotin.py in this repository:

Ensure that Python and the required packages (NumPy and NLTK) are installed on your system.
Navigate to the directory containing the scripts.
- Open a terminal or command prompt.
- Change the directory to where suxotin.py is located, e.g., cd path/to/suxotin.
Execute the script:
```
python suxotin.py
```
- Follow the on-screen prompts to choose between processing a local text file or using the NLTK Gutenberg corpus.
- You will also be asked whether to preprocess the text. Respond with 'yes' or 'no' based on your preference. It is recommended 'yes' to preprocess the text to remove punctuation and convert all characters to lowercase.

Contributing

I appreciate contributions to enhance this project and extend the capabilities of the algorithm. To contribute, please fork the repository, make changes, and submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
suxotin		suxotin
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sherlock_holmes.txt		sherlock_holmes.txt
suxotin.py		suxotin.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Suxotin's Vowel Identification Algorithm

Overview

Features

Why the Algorithm Works

Algorithm Steps

Mathematical Foundation

Repository Contents

Installation

Usage

Contributing

About

Releases

Packages

Languages

License

jhnwnstd/suxotin

Folders and files

Latest commit

History

Repository files navigation

Suxotin's Vowel Identification Algorithm

Overview

Features

Why the Algorithm Works

Algorithm Steps

Mathematical Foundation

Repository Contents

Installation

Usage

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages