SEN LLC

Posted on May 11

Searching Emojis With Casual Japanese Keywords — Why Unicode CLDR's ja Annotations Aren't Enough

#javascript #japanese #frontend #webdev

If you type "わらう" (laugh) into Slack's emoji search, you get nothing. The same is true for "ぴえん" (a sad-cute slang for crying that took over Japanese Twitter around 2020), or "ばんざい" (a celebratory "hooray"). Unicode's CLDR (Common Locale Data Repository) ships official Japanese annotations for every emoji — but they're written in the register of an accessibility caption, not a search query. I curated 107 emojis by hand with the casual Japanese tags people actually type, and wrapped them in a ~200-line browser search.

🌐 Demo: https://sen.ltd/portfolio/emoji-search-jp/
📦 GitHub: https://github.com/sen-ltd/emoji-search-jp

Why CLDR's Japanese annotations don't make a search index

Unicode ships annotations for every emoji in every CLDR locale. Each entry has a "name" plus a handful of "keywords". For Japanese, the file looks like this:

Emoji	Name	Keywords
😂	"うれしなき"	`["うれしなき", "かお", "かおえもじ"]`
🥺	"もの欲しそうな顔"	`["かお", "かおえもじ", "けんめい", "もの欲しそう"]`
🙏	"合掌した手"	`["お辞儀", "かたを下げる", "かんしゃ"]`
🎉	"クラッカー"	`["クラッカー", "ハッピー", "パーティー", "紙ふぶき"]`
🐶	"犬の顔"	`["いぬ", "おもしろい", "かお", "どうぶつ"]`

The accessibility job that produced these keywords is the right job for CLDR to do. They cover the visual content of the emoji in a formal-register, all-hiragana style that a screen reader can announce.

What they don't cover is what users actually type:

😂: missing わらう / lol / 草
🥺: missing ぴえん (the slang that defines this emoji in 2020s Japanese)
🙏: missing お願い / ありがとう / ごめん — the contexts every user uses this emoji for
🎉: missing おめでとう
🐶: missing わんこ / ワン

There's an upstream proposal to expand CLDR annotations toward search use cases, but the file as it ships today is a captioning dictionary, not a search dictionary. The two have different shapes.

What this repo ships instead

107 emojis with 5-9 hand-curated tags each, totaling about 750 tag entries. Same schema as CLDR ({char, name_ja, name_en, tags, category}) so the two can be merged if you want both registers.

{
  "char": "😂",
  "name_en": "face with tears of joy",
  "name_ja": "嬉し泣きの顔",
  "tags": ["わらう", "大爆笑", "笑い泣き", "嬉し泣き", "lol"],
  "category": "face"
},
{
  "char": "🥺",
  "name_en": "pleading face",
  "name_ja": "うるうる目の顔",
  "tags": ["ぴえん", "かわいい", "うるうる", "おねがい", "切ない"],
  "category": "face"
},
{
  "char": "🙏",
  "name_en": "folded hands",
  "name_ja": "合掌",
  "tags": ["お願い", "おねがい", "ありがとう", "祈る", "感謝", "ごめん"],
  "category": "gesture"
}

Tag-selection rules I followed while curating:

Mix kana and kanji. "ねこ" and "猫" both belong on 🐱 because the user may or may not have committed an IME conversion.
Mix register. Both the slang ("ぴえん") and the descriptive ("切ない") for 🥺.
Borrow from CLDR. Take the official annotations as a baseline and add the spoken-Japanese coverage on top.
A handful of English tags. lol / ok / love / cool — common enough in mixed-language Japanese chat that they earn their slot.

What I deliberately didn't do: shoot for "all 3700 emoji from the Unicode emoji-list". The lexicon is the product. A smaller, well-tagged set is more useful than a complete, badly-tagged one.

Weighted scoring, five tiers

Search is a linear scan with a five-tier scoring function. For each query token, against each emoji:

const SCORE = {
  TAG_EXACT:           10,  // token === a tag
  TAG_PREFIX:          7,   // token is a prefix of a tag
  TAG_SUBSTRING:       4,   // token is a substring of a tag (non-prefix)
  NAME_JA_SUBSTRING:   3,   // token in name_ja
  NAME_EN_SUBSTRING:   1,   // last-resort English fallback
};

export function scoreToken(emoji, token) {
  if (!token) return 0;
  let best = 0;
  for (const tag of emoji.tags) {
    const t = normalize(tag);
    if (t === token) return SCORE.TAG_EXACT;            // can't beat exact
    if (t.startsWith(token))    best = Math.max(best, SCORE.TAG_PREFIX);
    else if (t.includes(token)) best = Math.max(best, SCORE.TAG_SUBSTRING);
  }
  if (best > 0) return best;
  if (normalize(emoji.name_ja).includes(token)) return SCORE.NAME_JA_SUBSTRING;
  if (normalize(emoji.name_en).includes(token)) return SCORE.NAME_EN_SUBSTRING;
  return 0;
}

The non-obvious choice is the weight gap between TAG_SUBSTRING (4) and NAME_JA_SUBSTRING (3) — a substring hit on a hand-curated tag still beats a coincidental substring hit in the descriptive name. So 顔 ranks "tags containing 顔" above "emojis whose name happens to contain 顔". Without the gap, every face emoji would tie with every cat emoji because both have 顔 in their name_ja.

Multi-token AND with sum-of-scores ranking

Whitespace splits the query into tokens. The default behaviour is every token must contribute at least one point, otherwise the emoji is dropped:

export function scoreEmoji(emoji, tokens) {
  if (tokens.length === 0) return 0;
  let sum = 0;
  for (const tok of tokens) {
    const s = scoreToken(emoji, tok);
    if (s === 0) return -1;   // any miss → drop
    sum += s;
  }
  return sum;
}

So "わらう顔" keeps 😂 (10 from tag わらう + 3 from name_ja containing 顔 = 13) and drops 🐱 (3 from 顔 but no signal for わらう).

The unit tests pin this:

test("scoreEmoji returns -1 if any token fails to match", () => {
  const s = scoreEmoji(SAMPLE_FACE_WITH_TEARS, ["わらう", "車"]);
  assert.equal(s, -1);  // dropped
});

test("search drops emojis that don't satisfy ALL tokens", () => {
  const results = search(SAMPLE, "わらう 顔");
  const chars = results.map((r) => r.emoji.char);
  assert.ok(chars.includes("😂"));
  assert.ok(!chars.includes("🐱"));
});

NFKC normalization upfront

Users will type half-width and full-width letters, with stray IME spaces, and case-mixed. One normalize call handles all of it:

export function normalize(s) {
  return String(s).normalize("NFKC").toLowerCase().trim();
}

ＬＯＬ (full-width) becomes lol; わらう becomes わらう; Pien becomes pien. Everything downstream operates on the normalized form. The boundary is one line and gets tested:

test("search is case- and width-insensitive (NFKC)", () => {
  const results = search(SAMPLE, "ＬＯＬ");
  assert.equal(results[0].emoji.char, "😂");
});

Stable sort matters more than the algorithm

Array.prototype.sort has been stable since ECMA-2019, so equal-keyed elements keep their original order. But equal scores are common in this dataset — 顔 lands at NAME_JA_SUBSTRING (3) for every face emoji — so I made the tie-breaker explicit:

matches.sort((a, b) => {
  if (b.score !== a.score) return b.score - a.score;
  return a.idx - b.idx;   // tie-break by input order
});

The effect: when many emojis tie on score, the curated order (the order I wrote them into data.json) survives. I put the most-used emoji first in every category, so ties resolve to "the more-used one".

test("search is stable: equal-scoring matches keep input order", () => {
  // "顔" → 3 emojis all at NAME_JA_SUBSTRING (3).
  const results = search(SAMPLE, "顔");
  const chars = results.map((r) => r.emoji.char);
  assert.deepEqual(chars, ["😂", "🥺", "🐱"]);  // input order preserved
});

What I didn't build

Trie / inverted index — 107 entries × 7 tags is 0.1 ms of linear scan on a phone. The index only pays off past ~10,000 entries.
Fuzzy / Levenshtein matching — typo tolerance complicates the score function. Prefix + substring already cover ~80% of real misses, and the cost of adding fuzzy is a noticeable jump in false positives.
Skin-tone / gender variants — exploding the entry count by 5-10× hurts search quality. Native OS emoji pickers cover this better.
Speech-readout accessibility annotations — that's exactly what CLDR is for. This repo is the other annotation register.

Try it

The demo at https://sen.ltd/portfolio/emoji-search-jp/ has the full lexicon. Try わらう, ぴえん, ばんざい, 猫, ハート, 寿司, お願い, ありがとう. / focuses the search box. Click an emoji to copy it.

Source: https://github.com/sen-ltd/emoji-search-jp — MIT, ~200 lines of JS plus 107 entries of curated data, 18 unit tests, no build step, no runtime dependencies.

🛠 Built by SEN LLC as part of an ongoing series of small, focused developer tools. Browse the full portfolio for more.

DEV Community