DEV Community

SEN LLC
SEN LLC

Posted on

A Japanese Kana Converter With Hepburn, Kunrei, and Nihon-shiki Romanization

A Japanese Kana Converter With Hepburn, Kunrei, and Nihon-shiki Romanization

Hiragana β†’ Katakana is one Unicode offset (+0x60). Kana β†’ Romaji is a lookup table, but which table? Japan has three official romanization systems: Hepburn (what most foreigners see), Kunrei-shiki (taught in Japanese schools), and Nihon-shiki (historical, strictest). "shi" vs "si", "tsu" vs "tu", "chi" vs "ti" β€” they're all correct depending on which system you mean.

Japanese text conversion sounds trivial but opens a surprising set of questions about romanization standards, half-width katakana, and the one-to-many mapping problem of converting back from romaji.

πŸ”— Live demo: https://sen.ltd/portfolio/kana-converter/
πŸ“¦ GitHub: https://github.com/sen-ltd/kana-converter

Screenshot

Features:

  • Hiragana ↔ Katakana
  • Hiragana / Katakana ↔ Romaji (3 systems)
  • Half-width ↔ Full-width katakana
  • Live conversion
  • Swap direction button
  • Japanese / English UI
  • Zero dependencies, 73 tests

Hiragana to Katakana: one offset

Hiragana range: U+3041-U+3096. Katakana range: U+30A1-U+30F6. The difference: exactly 0x60.

export function hiraganaToKatakana(text) {
  return [...text].map(c => {
    const code = c.charCodeAt(0);
    if (code >= 0x3041 && code <= 0x3096) {
      return String.fromCharCode(code + 0x60);
    }
    return c;
  }).join('');
}
Enter fullscreen mode Exit fullscreen mode

Same conversion in reverse: - 0x60. あ (0x3042) + 0x60 = γ‚’ (0x30A2). The Unicode consortium aligned the two kana scripts intentionally to make this conversion trivial.

Three romanization systems

Kana-to-romaji needs a lookup table. The three major systems disagree on several characters:

Kana Hepburn Kunrei Nihon
し shi si si
け chi ti ti
぀ tsu tu tu
ち fu hu hu
じ ji zi zi
げ ji zi di
γ₯ zu zu du
しゃ sha sya sya

Hepburn is what Japanese train stations use and what most English speakers see. Kunrei-shiki is what Japanese elementary schools teach β€” more phonetically consistent but less intuitive for English speakers. Nihon-shiki is the strictest, distinguishing homophones like じ and げ (both pronounced "ji") by the kana column they come from.

For the converter, each system gets its own lookup table:

const HEPBURN = { 'し': 'shi', 'け': 'chi', '぀': 'tsu', ... };
const KUNREI  = { 'し': 'si',  'け': 'ti',  '぀': 'tu',  ... };
const NIHON   = { 'し': 'si',  'け': 'ti',  '぀': 'tu',  'げ': 'di', 'γ₯': 'du', ... };
Enter fullscreen mode Exit fullscreen mode

γ‚“ before vowels

A subtle Hepburn rule: γ‚“ before a vowel or y is written as "n'" with an apostrophe to prevent ambiguity. ζ‘ˆε†… is "an'nai", not "annai", and 反応 is "han'nō", not "hannō".

// Detect γ‚“ followed by γ‚γ„γ†γˆγŠ or γ‚„γ‚†γ‚ˆ and insert apostrophe
result = result.replace(/n([γ‚γ„γ†γˆγŠγ‚„γ‚†γ‚ˆ])/g, "n'$1");
Enter fullscreen mode Exit fullscreen mode

The regex runs on the hiragana source before conversion β€” that way it sees the actual γ‚“ character and can check the following character.

Half-width katakana

Half-width katakana (アイウエ。) lives in a different Unicode block: U+FF66-U+FF9F. They were introduced for 8-bit character sets in the 80s and are still used in some legacy systems (ATMs, older printers).

The quirk: dakuten and handakuten are separate characters in half-width. ガ is one char in full-width (U+30AC) but two chars in half-width: ο½Ά (U+FF76) + ゙ (U+FF9E).

const FULL_TO_HALF = {
  'γ‚’': 'ο½±', 'ガ': '「゙', 'γ‚Ά': 'ザ', 'パ': 'パ', 'ヴ': 'ヴ', ...
};
Enter fullscreen mode Exit fullscreen mode

So converting ガガ (2 characters) produces 「゙「゙ (4 characters). The string length doubles. The conversion is inherently not length-preserving.

Romaji to Hiragana: greedy matching

Going back from romaji requires greedy longest-match:

const TABLE = [
  ['shi', 'し'], ['chi', 'け'], ['tsu', '぀'],
  ['sha', 'しゃ'], ['shu', 'しゅ'], ['sho', 'しょ'],
  ['ka', 'か'], ['ki', 'き'], ...
];

export function romajiToHiragana(text) {
  let result = '';
  let i = 0;
  while (i < text.length) {
    let matched = false;
    // Try longer keys first
    for (const [rom, kana] of TABLE) {
      if (text.slice(i, i + rom.length) === rom) {
        result += kana;
        i += rom.length;
        matched = true;
        break;
      }
    }
    if (!matched) { result += text[i]; i++; }
  }
  return result;
}
Enter fullscreen mode Exit fullscreen mode

The TABLE is sorted so longer keys come first. This ensures shi matches before s when they overlap. sha matches as a single digraph instead of s + ha.

Series

This is entry #84 in my 100+ public portfolio series.

Top comments (0)