Text data

This dataset contains sentences in the Marma language (ISO code: rmz), with both original and normalized forms. The dataset is designed to support language technology development for the Marma language, a Tibeto-Burman language spoken primarily by the Marma people in Bangladesh and Myanmar. This dataset was created as part of a project funded by the Australian Government Department of Foreign Affairs and Trade (DFAT) through CLEAR Global.

  • Language: Marma (rmz)
  • Script: Burmese script
  • Language Family: Tibeto-Burman

 

Dataset Structure

Data Instances – Each instance consists of:

  • source: The source of the text
  • sentence: The original sentence text (Test sentence ID in test set)
  • normalized: The normalized version of the sentence

Data Splits – The dataset is split into training and test sets:

  • Train: 5575 examples
  • Test: 100 examples

Sources – The texts in this collection come from various sources:

Permissions and Usage Rights

This dataset has been compiled with permission from the original content creators and community representatives.

CLEAR Global has obtained necessary permissions to share this data under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA-4.0).

Permissions Summary:

  • The dataset can be used for non-commercial research and educational purposes
  • Derivative works must acknowledge the original sources
  • Derivative works must be shared under the same license terms
  • Commercial use requires explicit permission from the original rights holders

MarmaSpeakTTS: Text-to-Speech Model for Marma Language

This model provides text-to-speech synthesis for the Marma language (ISO code: rmz), a Tibeto-Burman language spoken by the Marma people in Bangladesh and Myanmar.

Model Details:

 

Limitations and Biases

  • This is an early version of the model and may have limitations in pronunciation and naturalness.
  • The model works best with properly normalized Marma text.
  • Performance may vary based on the complexity and length of the input text.

 

Training – The model was fine-tuned from a Massively Multilingual Speech (MMS) VITS model using this training recipe.

Ethical Considerations – This model has been developed with permission and input from Marma language speakers. The voice synthesis should be used responsibly and respectfully.

Citation – 

				
					@misc{marma-tts,
  author = {CLEAR Global},
  title = {MarmaSpeakTTS: A Text-to-Speech Model for Marma Language},
  year = {2025},
  howpublished = {https://huggingface.co/CLEAR-Global/marmaspeak-tts-v1}
}