If you type "わらう" (laugh) into Slack's emoji search, you get nothing. The same is true for "ぴえん" (a sad-cute slang for crying that took over Japanese Twitter around 2020), or "ばんざい" (a celebratory "hooray"). Unicode's CLDR (Common Locale Data Repository) ships official Japanese annotations for every emoji — but they're written in the register of an accessibility caption, not a search query. I curated 107 emojis by hand with the casual Japanese tags people actually type, and wrapped them in a ~200-line browser search.
🌐 Demo: https://sen.ltd/portfolio/emoji-search-jp/
📦 GitHub: https://github.com/sen-ltd/emoji-search-jp
Why CLDR's Japanese annotations don't make a search index
Unicode ships annotations for every emoji in every CLDR locale. Each entry has a "name" plus a handful of "keywords". For Japanese, the file looks like this:
| Emoji | Name | Keywords |
|---|---|---|
| 😂 | "うれしなき" | ["うれしなき", "かお", "かおえもじ"] |
| 🥺 | "もの欲しそうな顔" | ["かお", "かおえもじ", "けんめい", "もの欲しそう"] |
| 🙏 | "合掌した手" | ["お辞儀", "かたを下げる", "かんしゃ"] |
| 🎉 | "クラッカー" | ["クラッカー", "ハッピー", "パーティー", "紙ふぶき"] |
| 🐶 | "犬の顔" | ["いぬ", "おもしろい", "かお", "どうぶつ"] |
The accessibility job that produced these keywords is the right job for CLDR to do. They cover the visual content of the emoji in a formal-register, all-hiragana style that a screen reader can announce.
What they don't cover is what users actually type:
- 😂: missing
わらう/lol/草 - 🥺: missing
ぴえん(the slang that defines this emoji in 2020s Japanese) - 🙏: missing
お願い/ありがとう/ごめん— the contexts every user uses this emoji for - 🎉: missing
おめでとう - 🐶: missing
わんこ/ワン
There's an upstream proposal to expand CLDR annotations toward search use cases, but the file as it ships today is a captioning dictionary, not a search dictionary. The two have different shapes.
What this repo ships instead
107 emojis with 5-9 hand-curated tags each, totaling about 750 tag entries. Same schema as CLDR ({char, name_ja, name_en, tags, category}) so the two can be merged if you want both registers.
{
"char": "😂",
"name_en": "face with tears of joy",
"name_ja": "嬉し泣きの顔",
"tags": ["わらう", "大爆笑", "笑い泣き", "嬉し泣き", "lol"],
"category": "face"
},
{
"char": "🥺",
"name_en": "pleading face",
"name_ja": "うるうる目の顔",
"tags": ["ぴえん", "かわいい", "うるうる", "おねがい", "切ない"],
"category": "face"
},
{
"char": "🙏",
"name_en": "folded hands",
"name_ja": "合掌",
"tags": ["お願い", "おねがい", "ありがとう", "祈る", "感謝", "ごめん"],
"category": "gesture"
}
Tag-selection rules I followed while curating:
- Mix kana and kanji. "ねこ" and "猫" both belong on 🐱 because the user may or may not have committed an IME conversion.
- Mix register. Both the slang ("ぴえん") and the descriptive ("切ない") for 🥺.
- Borrow from CLDR. Take the official annotations as a baseline and add the spoken-Japanese coverage on top.
-
A handful of English tags.
lol/ok/love/cool— common enough in mixed-language Japanese chat that they earn their slot.
What I deliberately didn't do: shoot for "all 3700 emoji from the Unicode emoji-list". The lexicon is the product. A smaller, well-tagged set is more useful than a complete, badly-tagged one.
Weighted scoring, five tiers
Search is a linear scan with a five-tier scoring function. For each query token, against each emoji:
const SCORE = {
TAG_EXACT: 10, // token === a tag
TAG_PREFIX: 7, // token is a prefix of a tag
TAG_SUBSTRING: 4, // token is a substring of a tag (non-prefix)
NAME_JA_SUBSTRING: 3, // token in name_ja
NAME_EN_SUBSTRING: 1, // last-resort English fallback
};
export function scoreToken(emoji, token) {
if (!token) return 0;
let best = 0;
for (const tag of emoji.tags) {
const t = normalize(tag);
if (t === token) return SCORE.TAG_EXACT; // can't beat exact
if (t.startsWith(token)) best = Math.max(best, SCORE.TAG_PREFIX);
else if (t.includes(token)) best = Math.max(best, SCORE.TAG_SUBSTRING);
}
if (best > 0) return best;
if (normalize(emoji.name_ja).includes(token)) return SCORE.NAME_JA_SUBSTRING;
if (normalize(emoji.name_en).includes(token)) return SCORE.NAME_EN_SUBSTRING;
return 0;
}
The non-obvious choice is the weight gap between TAG_SUBSTRING (4) and NAME_JA_SUBSTRING (3) — a substring hit on a hand-curated tag still beats a coincidental substring hit in the descriptive name. So 顔 ranks "tags containing 顔" above "emojis whose name happens to contain 顔". Without the gap, every face emoji would tie with every cat emoji because both have 顔 in their name_ja.
Multi-token AND with sum-of-scores ranking
Whitespace splits the query into tokens. The default behaviour is every token must contribute at least one point, otherwise the emoji is dropped:
export function scoreEmoji(emoji, tokens) {
if (tokens.length === 0) return 0;
let sum = 0;
for (const tok of tokens) {
const s = scoreToken(emoji, tok);
if (s === 0) return -1; // any miss → drop
sum += s;
}
return sum;
}
So "わらう 顔" keeps 😂 (10 from tag わらう + 3 from name_ja containing 顔 = 13) and drops 🐱 (3 from 顔 but no signal for わらう).
The unit tests pin this:
test("scoreEmoji returns -1 if any token fails to match", () => {
const s = scoreEmoji(SAMPLE_FACE_WITH_TEARS, ["わらう", "車"]);
assert.equal(s, -1); // dropped
});
test("search drops emojis that don't satisfy ALL tokens", () => {
const results = search(SAMPLE, "わらう 顔");
const chars = results.map((r) => r.emoji.char);
assert.ok(chars.includes("😂"));
assert.ok(!chars.includes("🐱"));
});
NFKC normalization upfront
Users will type half-width and full-width letters, with stray IME spaces, and case-mixed. One normalize call handles all of it:
export function normalize(s) {
return String(s).normalize("NFKC").toLowerCase().trim();
}
LOL (full-width) becomes lol; わらう becomes わらう; Pien becomes pien. Everything downstream operates on the normalized form. The boundary is one line and gets tested:
test("search is case- and width-insensitive (NFKC)", () => {
const results = search(SAMPLE, "LOL");
assert.equal(results[0].emoji.char, "😂");
});
Stable sort matters more than the algorithm
Array.prototype.sort has been stable since ECMA-2019, so equal-keyed elements keep their original order. But equal scores are common in this dataset — 顔 lands at NAME_JA_SUBSTRING (3) for every face emoji — so I made the tie-breaker explicit:
matches.sort((a, b) => {
if (b.score !== a.score) return b.score - a.score;
return a.idx - b.idx; // tie-break by input order
});
The effect: when many emojis tie on score, the curated order (the order I wrote them into data.json) survives. I put the most-used emoji first in every category, so ties resolve to "the more-used one".
test("search is stable: equal-scoring matches keep input order", () => {
// "顔" → 3 emojis all at NAME_JA_SUBSTRING (3).
const results = search(SAMPLE, "顔");
const chars = results.map((r) => r.emoji.char);
assert.deepEqual(chars, ["😂", "🥺", "🐱"]); // input order preserved
});
What I didn't build
- Trie / inverted index — 107 entries × 7 tags is 0.1 ms of linear scan on a phone. The index only pays off past ~10,000 entries.
- Fuzzy / Levenshtein matching — typo tolerance complicates the score function. Prefix + substring already cover ~80% of real misses, and the cost of adding fuzzy is a noticeable jump in false positives.
- Skin-tone / gender variants — exploding the entry count by 5-10× hurts search quality. Native OS emoji pickers cover this better.
- Speech-readout accessibility annotations — that's exactly what CLDR is for. This repo is the other annotation register.
Try it
The demo at https://sen.ltd/portfolio/emoji-search-jp/ has the full lexicon. Try わらう, ぴえん, ばんざい, 猫, ハート, 寿司, お願い, ありがとう. / focuses the search box. Click an emoji to copy it.
Source: https://github.com/sen-ltd/emoji-search-jp — MIT, ~200 lines of JS plus 107 entries of curated data, 18 unit tests, no build step, no runtime dependencies.
🛠 Built by SEN LLC as part of an ongoing series of small, focused developer tools. Browse the full portfolio for more.

Top comments (0)