HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Authors

  • Seungho Choi Wisenut
  • Sihyun Park Wisenut
  • Minsang Kim Wisenut
  • Chansol Park Wisenut
  • Bongsu Kim Wisenut

DOI:

https://doi.org/10.1609/aaai.v40i36.40294

Abstract

Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.

Published

2026-03-14

How to Cite

Choi, S., Park, S., Kim, M., Park, C., & Kim, B. (2026). HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training. Proceedings of the AAAI Conference on Artificial Intelligence, 40(36), 30413–30421. https://doi.org/10.1609/aaai.v40i36.40294

Issue

Section

AAAI Technical Track on Natural Language Processing I