Generating Non-English Synthetic Medical Data Sets

Usingsynthetic datasets to train medicine-focused machine learning models has been shown to enhance their performance, however, most research focuses on English texts. In this paper, we explore generating non-Englishsyntheticmedicaltexts. Wepropose a methodologyforgeneratingmedicalsynthetic data, showcasing it by generating Greeklish medical texts relating to hypertension. Weevaluate our approach with seven different language models and assess the quality of the datasets by training a classifier to distinguish between original and synthetic examples. We find that the Llama-3 performs best for our task.