Stripping Diacritics in Qt

As someone dealing with languages different than English, I quite often need to deal with diacritics. You know, characters such as č, ć, đ, š, ž, and similar. Even in English texts I can sometime see them, e.g. voilà. And yes, quite often you can omit them and still have understandable text. But often you simply cannot because it changes meaning of the word.

One place where this often bites me is search. It’s really practical for search to be accent insensitive since that allows me to use English keyboard even though I am searching for content in another language. Search that would ignore diacritics would be awesome.

And over the time I implemented something like that in all my apps. As I am building a new QText, it came time to implement it in C++ with a Qt flavor. And, unlike C#, C++ was not really built with a lot of internationalization in mind.

Solution comes from non-other than now late Michael S. Kaplan. While his blog was deleted by Microsoft (great loss!), there are archives of his work still around - courtesy of people who loved his work. His solution was in C# (that’s how I actually rembered it - I already needed that once) and it was beautifully simple. Decompose unicode string, remove non-spacing mark characters, and finally combine what’s left back to a unicode string.

In Qt’s C++, that would be something like this:

QString stripDiacritics(QString text) {
    QString formD = text.normalized(QString::NormalizationForm_D);

    QString filtered;
    for (int i = 0; i < formD.length(); i++) {
        if (formD.at(i).category() != QChar::Mark_NonSpacing) {
            filtered.append(formD.at(i));
        }
    }

    return filtered.normalized(QString::NormalizationForm_C);
}