DEV Community

SEN LLC
SEN LLC

Posted on

Searching Emojis With Casual Japanese Keywords — Why Unicode CLDR's ja Annotations Aren't Enough

If you type "わらう" (laugh) into Slack's emoji search, you get nothing. The same is true for "ぴえん" (a sad-cute slang for crying that took over Japanese Twitter around 2020), or "ばんざい" (a celebratory "hooray"). Unicode's CLDR (Common Locale Data Repository) ships official Japanese annotations for every emoji — but they're written in the register of an accessibility caption, not a search query. I curated 107 emojis by hand with the casual Japanese tags people actually type, and wrapped them in a ~200-line browser search.

emoji-search-jp UI: dark theme. The search input contains

🌐 Demo: https://sen.ltd/portfolio/emoji-search-jp/
📦 GitHub: https://github.com/sen-ltd/emoji-search-jp

Why CLDR's Japanese annotations don't make a search index

Unicode ships annotations for every emoji in every CLDR locale. Each entry has a "name" plus a handful of "keywords". For Japanese, the file looks like this:

Emoji Name Keywords
😂 "うれしなき" ["うれしなき", "かお", "かおえもじ"]
🥺 "もの欲しそうな顔" ["かお", "かおえもじ", "けんめい", "もの欲しそう"]
🙏 "合掌した手" ["お辞儀", "かたを下げる", "かんしゃ"]
🎉 "クラッカー" ["クラッカー", "ハッピー", "パーティー", "紙ふぶき"]
🐶 "犬の顔" ["いぬ", "おもしろい", "かお", "どうぶつ"]

The accessibility job that produced these keywords is the right job for CLDR to do. They cover the visual content of the emoji in a formal-register, all-hiragana style that a screen reader can announce.

What they don't cover is what users actually type:

  • 😂: missing わらう / lol /
  • 🥺: missing ぴえん (the slang that defines this emoji in 2020s Japanese)
  • 🙏: missing お願い / ありがとう / ごめん — the contexts every user uses this emoji for
  • 🎉: missing おめでとう
  • 🐶: missing わんこ / ワン

There's an upstream proposal to expand CLDR annotations toward search use cases, but the file as it ships today is a captioning dictionary, not a search dictionary. The two have different shapes.

What this repo ships instead

107 emojis with 5-9 hand-curated tags each, totaling about 750 tag entries. Same schema as CLDR ({char, name_ja, name_en, tags, category}) so the two can be merged if you want both registers.

{
  "char": "😂",
  "name_en": "face with tears of joy",
  "name_ja": "嬉し泣きの顔",
  "tags": ["わらう", "大爆笑", "笑い泣き", "嬉し泣き", "lol"],
  "category": "face"
},
{
  "char": "🥺",
  "name_en": "pleading face",
  "name_ja": "うるうる目の顔",
  "tags": ["ぴえん", "かわいい", "うるうる", "おねがい", "切ない"],
  "category": "face"
},
{
  "char": "🙏",
  "name_en": "folded hands",
  "name_ja": "合掌",
  "tags": ["お願い", "おねがい", "ありがとう", "祈る", "感謝", "ごめん"],
  "category": "gesture"
}
Enter fullscreen mode Exit fullscreen mode

Tag-selection rules I followed while curating:

  1. Mix kana and kanji. "ねこ" and "猫" both belong on 🐱 because the user may or may not have committed an IME conversion.
  2. Mix register. Both the slang ("ぴえん") and the descriptive ("切ない") for 🥺.
  3. Borrow from CLDR. Take the official annotations as a baseline and add the spoken-Japanese coverage on top.
  4. A handful of English tags. lol / ok / love / cool — common enough in mixed-language Japanese chat that they earn their slot.

What I deliberately didn't do: shoot for "all 3700 emoji from the Unicode emoji-list". The lexicon is the product. A smaller, well-tagged set is more useful than a complete, badly-tagged one.

Weighted scoring, five tiers

Search is a linear scan with a five-tier scoring function. For each query token, against each emoji:

const SCORE = {
  TAG_EXACT:           10,  // token === a tag
  TAG_PREFIX:          7,   // token is a prefix of a tag
  TAG_SUBSTRING:       4,   // token is a substring of a tag (non-prefix)
  NAME_JA_SUBSTRING:   3,   // token in name_ja
  NAME_EN_SUBSTRING:   1,   // last-resort English fallback
};

export function scoreToken(emoji, token) {
  if (!token) return 0;
  let best = 0;
  for (const tag of emoji.tags) {
    const t = normalize(tag);
    if (t === token) return SCORE.TAG_EXACT;            // can't beat exact
    if (t.startsWith(token))    best = Math.max(best, SCORE.TAG_PREFIX);
    else if (t.includes(token)) best = Math.max(best, SCORE.TAG_SUBSTRING);
  }
  if (best > 0) return best;
  if (normalize(emoji.name_ja).includes(token)) return SCORE.NAME_JA_SUBSTRING;
  if (normalize(emoji.name_en).includes(token)) return SCORE.NAME_EN_SUBSTRING;
  return 0;
}
Enter fullscreen mode Exit fullscreen mode

The non-obvious choice is the weight gap between TAG_SUBSTRING (4) and NAME_JA_SUBSTRING (3) — a substring hit on a hand-curated tag still beats a coincidental substring hit in the descriptive name. So ranks "tags containing 顔" above "emojis whose name happens to contain 顔". Without the gap, every face emoji would tie with every cat emoji because both have in their name_ja.

Multi-token AND with sum-of-scores ranking

Whitespace splits the query into tokens. The default behaviour is every token must contribute at least one point, otherwise the emoji is dropped:

export function scoreEmoji(emoji, tokens) {
  if (tokens.length === 0) return 0;
  let sum = 0;
  for (const tok of tokens) {
    const s = scoreToken(emoji, tok);
    if (s === 0) return -1;   // any miss → drop
    sum += s;
  }
  return sum;
}
Enter fullscreen mode Exit fullscreen mode

So "わらう 顔" keeps 😂 (10 from tag わらう + 3 from name_ja containing = 13) and drops 🐱 (3 from but no signal for わらう).

The unit tests pin this:

test("scoreEmoji returns -1 if any token fails to match", () => {
  const s = scoreEmoji(SAMPLE_FACE_WITH_TEARS, ["わらう", ""]);
  assert.equal(s, -1);  // dropped
});

test("search drops emojis that don't satisfy ALL tokens", () => {
  const results = search(SAMPLE, "わらう 顔");
  const chars = results.map((r) => r.emoji.char);
  assert.ok(chars.includes("😂"));
  assert.ok(!chars.includes("🐱"));
});
Enter fullscreen mode Exit fullscreen mode

NFKC normalization upfront

Users will type half-width and full-width letters, with stray IME spaces, and case-mixed. One normalize call handles all of it:

export function normalize(s) {
  return String(s).normalize("NFKC").toLowerCase().trim();
}
Enter fullscreen mode Exit fullscreen mode

LOL (full-width) becomes lol; わらう becomes わらう; Pien becomes pien. Everything downstream operates on the normalized form. The boundary is one line and gets tested:

test("search is case- and width-insensitive (NFKC)", () => {
  const results = search(SAMPLE, "LOL");
  assert.equal(results[0].emoji.char, "😂");
});
Enter fullscreen mode Exit fullscreen mode

Stable sort matters more than the algorithm

Array.prototype.sort has been stable since ECMA-2019, so equal-keyed elements keep their original order. But equal scores are common in this dataset — lands at NAME_JA_SUBSTRING (3) for every face emoji — so I made the tie-breaker explicit:

matches.sort((a, b) => {
  if (b.score !== a.score) return b.score - a.score;
  return a.idx - b.idx;   // tie-break by input order
});
Enter fullscreen mode Exit fullscreen mode

The effect: when many emojis tie on score, the curated order (the order I wrote them into data.json) survives. I put the most-used emoji first in every category, so ties resolve to "the more-used one".

test("search is stable: equal-scoring matches keep input order", () => {
  // "顔" → 3 emojis all at NAME_JA_SUBSTRING (3).
  const results = search(SAMPLE, "");
  const chars = results.map((r) => r.emoji.char);
  assert.deepEqual(chars, ["😂", "🥺", "🐱"]);  // input order preserved
});
Enter fullscreen mode Exit fullscreen mode

What I didn't build

  • Trie / inverted index — 107 entries × 7 tags is 0.1 ms of linear scan on a phone. The index only pays off past ~10,000 entries.
  • Fuzzy / Levenshtein matching — typo tolerance complicates the score function. Prefix + substring already cover ~80% of real misses, and the cost of adding fuzzy is a noticeable jump in false positives.
  • Skin-tone / gender variants — exploding the entry count by 5-10× hurts search quality. Native OS emoji pickers cover this better.
  • Speech-readout accessibility annotations — that's exactly what CLDR is for. This repo is the other annotation register.

Try it

The demo at https://sen.ltd/portfolio/emoji-search-jp/ has the full lexicon. Try わらう, ぴえん, ばんざい, , ハート, 寿司, お願い, ありがとう. / focuses the search box. Click an emoji to copy it.

Source: https://github.com/sen-ltd/emoji-search-jp — MIT, ~200 lines of JS plus 107 entries of curated data, 18 unit tests, no build step, no runtime dependencies.


🛠 Built by SEN LLC as part of an ongoing series of small, focused developer tools. Browse the full portfolio for more.

Top comments (0)