Your Friendly iOS Coder

Dabby [ダビー]

Normalizing Text for Search With String Transformations {あいうえお <=> アイウエオ <=> AIUEO}

Recently I was working on a personal dictionary for Japanese. The Japanese language comprises three different syllabaries that are used in different contexts. I needed a robust search that could find a words represented in any of these syllabaries. In this post, we will see how the String API provides us with enough functionality to do this.

Japanese Syllabaries

This is not a blog about Japanese, but in order to understand the problem, it would be helpful to know a little bit about how the language is structured. As previously mentioned, the language has three different syllabaries

  • Hiragana: Ordinary/simple phonetic lettering system used for Japanese words not covered by Kanji e.g いま{Now}
  • Katakana: Derived lettering system used for transcription of loan words in Japanese: e.g アイスクリーム{Ice cream}
  • Kanji: Chinese characters used Japanese. e.g 今{Now}

Notice how the example for the Hiragana and the Kanji have the same meanings. This is because the word “Now” is usually written in Kanji (今), but phonetically represented in Hiragana as (いま), which can be read as i-ma. The english representation of the way a Japanese word is pronounced is known romaji.

The goal was hence to ensure that any of the following search queries: “ima”, “いま”, “今” or “now”; bring up the Japanese dictionary term for “now”.

Normalization

A common pre-processing operation to perform on text that is intended to be searched on is to normalize the text. This means modifying the text into a uniform representation that other operations (e.g search) can be performed on. This entails removing diacritics from languages like German or in our case, converting all Japanese syllabaries into one single system that search can be performed on. I had to create a reverse-index of all the ways a Japanese word can be represented, and then normalize this index into romaji. This is where the power of String comes in.

The String API provides us with

  • applyingTransform(StringTransform, reverse: Bool)

Some examples of StringTransform are toLatin, latinToArabic latinToHiragana, e.t.c (See the official documentation for more). Given a StringTransform, the applyingTransform function transforms the string from it’s current representation into the resulting representation given by the StringTransform parameter. This made my work a lot easier as I could just create an extension on String for my purpose:

1
2
3
4
5
6
7
extension String {
   // converts string from hiragana or katakana to their latin representation (romaji)
    var toRomaji: String? {
        return self.applyingTransform(.hiraganaToKatakana, reverse: false)?
            .applyingTransform(.latinToKatakana, reverse: true)
    }
}

The function above converts a Japanese string represented in either (Katakana or Hiragana) into its latin phonetic equivalent (romaji). Words represented in Kanji are left untransformed. This is actually okay for our purpose because Kanji to Hiragana conversions can sometimes be ambiguous. The function first converts all hiragana to katakana, and then converts from katakana to romaji/latin. By doing it this way, we prevent String from making assumptions about the language string, which most times causes Japanese Kanji characters to be treated as Chinese.

Searching Japanese words

Let’s now use this to solve the search problem we talked about above. Japanese dictionaries like JMDict, usually give unambiguous Kanji to Hiragana representations of words. So we create a Japanese word struct:

1
2
3
4
5
struct JapaneseWord {
    let kanji: String? // not all words have a kanji representation
    let kana: String // this is either katakana or hiragana
    let meaning: String
}

The above struct is a slight oversimplification, as a particular Japanese word could have multiple kana representations, and `meanings, but we will stick to them having only one value for the scope of this blog.

We can then create a reverse index that maps from possible search indexes of a given word, back to the word. This will usually be persisted in a database, but we will use an in-memory map.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
let words = [
    JapaneseWord(kanji: "今", kana: "いま", meaning: "now"),
    JapaneseWord(kanji: "今日", kana: "きょう", meaning: "today"),
    JapaneseWord(kanji: "来る", kana: "くる", meaning: "to come"),
    JapaneseWord(kanji: "行く", kana: "いく", meaning: "to go"),
    JapaneseWord(kanji: "帰る", kana: "かえる", meaning: "to return")
]

var indexMap = [String: [Int]]()
for (idx, word) in words.enumerated() {
    if let kanji = word.kanji, let romanizedKanji = kanji.toRomaji {
        indexMap[romanizedKanji, default: []].append(idx)
    }

    if let romanizedKana = word.kana.toRomaji {
        indexMap[romanizedKana, default: []].append(idx)
    }

    indexMap[word.meaning.lowerCased(), default: []].append(idx)
}

In our code above, we are assuming a dictionary containing only five words. We create the reverse-index map by normalizing the kanji property using that as a key in the dictionary that maps back to the index of corresponding JapaneseWord for the kanji. We do the same for the kana, and finally, we use the lowercased meaning to create another reference to the JapaneseWord. Referencing the index(idx) of the JapaneseWord in the indexMap instead of the struct itself ensures that the map does not bloated by the size of the JapaneseWord struct therefore saving us some memory.

Let’s now imagine we had a search function that was to search on the dictionary. We define the search logic below:

1
2
3
4
5
6
7
8
9
10
11
12
func search(query: String) -> [JapaneseWord] {
   guard let normalizedQuery = query.lowercased().toRomaji else { return [] }

    let wordIndices = indexMap[normalizedQuery] ?? []
    return wordIndices.map { words[$0] }
}


print(search(query: "now")) // => [JapaneseWord(kanji: "今", kana: "いま", meaning: "now")]
print(search(query: "帰る")) // => [JapaneseWord(kanji: "帰る", kana: "かえる", meaning: "to return")]
print(search(query: "きょう")) // => [JapaneseWord(kanji: "今日", kana: "きょう", meaning: "today")]
print(search(query: "kyou")) // => [JapaneseWord(kanji: "今日", kana: "きょう", meaning: "today")]

First, we normalize the search query so that it is in the same form as the keys in our index. Next we search for the normalized query in our reverse-index which gives us the indices (in the words array) of the JapaneseWords that match our search query. Finally, we return the JapaneseWords corresponding to the indices. 🎉🚀

Conclusion

The String API has a lot of powerful functionality. In this post, we were able to take advantage of one of them to create a search functionality for a simple electronic Japanese dictionary. What other String APIs are you using and how? I’ll be very interested to know more.

Thanks for taking the time 🙏🏿

Find me on twitter or contact me if you have any questions or suggestions.