A Japanese Kana Converter With Hepburn, Kunrei, and Nihon-shiki Romanization

#javascript #japanese #converter #unicode

A Japanese Kana Converter With Hepburn, Kunrei, and Nihon-shiki Romanization

Hiragana → Katakana is one Unicode offset (+0x60). Kana → Romaji is a lookup table, but which table? Japan has three official romanization systems: Hepburn (what most foreigners see), Kunrei-shiki (taught in Japanese schools), and Nihon-shiki (historical, strictest). "shi" vs "si", "tsu" vs "tu", "chi" vs "ti" — they're all correct depending on which system you mean.

Japanese text conversion sounds trivial but opens a surprising set of questions about romanization standards, half-width katakana, and the one-to-many mapping problem of converting back from romaji.

🔗 Live demo: https://sen.ltd/portfolio/kana-converter/
📦 GitHub: https://github.com/sen-ltd/kana-converter

Features:

Hiragana ↔ Katakana
Hiragana / Katakana ↔ Romaji (3 systems)
Half-width ↔ Full-width katakana
Live conversion
Swap direction button
Japanese / English UI
Zero dependencies, 73 tests

Hiragana to Katakana: one offset

Hiragana range: U+3041-U+3096. Katakana range: U+30A1-U+30F6. The difference: exactly 0x60.

export function hiraganaToKatakana(text) {
  return [...text].map(c => {
    const code = c.charCodeAt(0);
    if (code >= 0x3041 && code <= 0x3096) {
      return String.fromCharCode(code + 0x60);
    }
    return c;
  }).join('');
}

Same conversion in reverse: - 0x60. あ (0x3042) + 0x60 = ア (0x30A2). The Unicode consortium aligned the two kana scripts intentionally to make this conversion trivial.

Three romanization systems

Kana-to-romaji needs a lookup table. The three major systems disagree on several characters:

Kana	Hepburn	Kunrei	Nihon
し	shi	si	si
ち	chi	ti	ti
つ	tsu	tu	tu
ふ	fu	hu	hu
じ	ji	zi	zi
ぢ	ji	zi	di
づ	zu	zu	du
しゃ	sha	sya	sya

Hepburn is what Japanese train stations use and what most English speakers see. Kunrei-shiki is what Japanese elementary schools teach — more phonetically consistent but less intuitive for English speakers. Nihon-shiki is the strictest, distinguishing homophones like じ and ぢ (both pronounced "ji") by the kana column they come from.

For the converter, each system gets its own lookup table:

const HEPBURN = { 'し': 'shi', 'ち': 'chi', 'つ': 'tsu', ... };
const KUNREI  = { 'し': 'si',  'ち': 'ti',  'つ': 'tu',  ... };
const NIHON   = { 'し': 'si',  'ち': 'ti',  'つ': 'tu',  'ぢ': 'di', 'づ': 'du', ... };

ん before vowels

A subtle Hepburn rule: ん before a vowel or y is written as "n'" with an apostrophe to prevent ambiguity. 案内 is "an'nai", not "annai", and 反応 is "han'nō", not "hannō".

// Detect ん followed by あいうえお or やゆよ and insert apostrophe
result = result.replace(/n([あいうえおやゆよ])/g, "n'$1");

The regex runs on the hiragana source before conversion — that way it sees the actual ん character and can check the following character.

Half-width katakana

Half-width katakana (ｱｲｳｴｵ) lives in a different Unicode block: U+FF66-U+FF9F. They were introduced for 8-bit character sets in the 80s and are still used in some legacy systems (ATMs, older printers).

The quirk: dakuten and handakuten are separate characters in half-width. ガ is one char in full-width (U+30AC) but two chars in half-width: ｶ (U+FF76) + ﾞ (U+FF9E).

const FULL_TO_HALF = {
  'ア': 'ｱ', 'ガ': 'ｶﾞ', 'ザ': 'ｻﾞ', 'パ': 'ﾊﾟ', 'ヴ': 'ｳﾞ', ...
};

So converting ガガ (2 characters) produces ｶﾞｶﾞ (4 characters). The string length doubles. The conversion is inherently not length-preserving.

Romaji to Hiragana: greedy matching

Going back from romaji requires greedy longest-match:

const TABLE = [
  ['shi', 'し'], ['chi', 'ち'], ['tsu', 'つ'],
  ['sha', 'しゃ'], ['shu', 'しゅ'], ['sho', 'しょ'],
  ['ka', 'か'], ['ki', 'き'], ...
];

export function romajiToHiragana(text) {
  let result = '';
  let i = 0;
  while (i < text.length) {
    let matched = false;
    // Try longer keys first
    for (const [rom, kana] of TABLE) {
      if (text.slice(i, i + rom.length) === rom) {
        result += kana;
        i += rom.length;
        matched = true;
        break;
      }
    }
    if (!matched) { result += text[i]; i++; }
  }
  return result;
}

The TABLE is sorted so longer keys come first. This ensures shi matches before s when they overlap. sha matches as a single digraph instead of s + ha.