Marma TTS and text data resources

Text data

This dataset contains sentences in the Marma language (ISO code: rmz), with both original and normalized forms. The dataset is designed to support language technology development for the Marma language, a Tibeto-Burman language spoken primarily by the Marma people in Bangladesh and Myanmar. This dataset was created as part of a project funded by the Australian Government Department of Foreign Affairs and Trade (DFAT) through CLEAR Global.

Language: Marma (rmz)
Script: Burmese script
Language Family: Tibeto-Burman

Dataset Structure

Data Instances – Each instance consists of:

source: The source of the text
sentence: The original sentence text (Test sentence ID in test set)
normalized: The normalized version of the sentence

Data Splits – The dataset is split into training and test sets:

Train: 5575 examples
Test: 100 examples

Sources – The texts in this collection come from various sources:

Sentences written by our linguist (Aong)
Marma Tainyintha by Nuthowain Barang
Useful phrases in Marma (Marma-2.docx)
Maths problems in Marma
Marma textbooks Bangladesh National Curriculum and Textbook Board (NCTB)
PCJSS
Marma Poems
Refungjang
အကျွန်ရို့ မာရမာ လူမျိုး ပါ

Permissions and Usage Rights

This dataset has been compiled with permission from the original content creators and community representatives.

CLEAR Global has obtained necessary permissions to share this data under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA-4.0).

Permissions Summary:

The dataset can be used for non-commercial research and educational purposes
Derivative works must acknowledge the original sources
Derivative works must be shared under the same license terms
Commercial use requires explicit permission from the original rights holders

MarmaSpeakTTS: Text-to-Speech Model for Marma Language

This model provides text-to-speech synthesis for the Marma language (ISO code: rmz), a Tibeto-Burman language spoken by the Marma people in Bangladesh and Myanmar.

Model Details:

Base model: Massively Multilingual Speech (MMS)
Type: Text-to-Speech
Language: Marma (rmz)
Training Data: The model was trained on Marma language audio recordings collected by CLEAR Global.
Training script: https://github.com/translatorswb/finetune-hf-vits-marma
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

Limitations and Biases

This is an early version of the model and may have limitations in pronunciation and naturalness.
The model works best with properly normalized Marma text.
Performance may vary based on the complexity and length of the input text.

Training – The model was fine-tuned from a Massively Multilingual Speech (MMS) VITS model using this training recipe.

Ethical Considerations – This model has been developed with permission and input from Marma language speakers. The voice synthesis should be used responsibly and respectfully.

Citation –

				
					@misc{marma-tts,
  author = {CLEAR Global},
  title = {MarmaSpeakTTS: A Text-to-Speech Model for Marma Language},
  year = {2025},
  howpublished = {https://huggingface.co/CLEAR-Global/marmaspeak-tts-v1}
}

Marma TTS and text data resources

Text data

Dataset Structure

Permissions and Usage Rights

MarmaSpeakTTS: Text-to-Speech Model for Marma Language

Follow us

© 2026 All rights reserved.