contact us | support Technology to Bridge the Language Gap
Products
| NameSphere |
|
|
|
|
|
Solving the Problem of Traditional Approaches to Name Matching NameSphere combines several tools and techniques to match variant spellings and transliterations of names originating in languages using Roman and non-Roman scripts. Generating a list of variations of a name can be performed in a few steps, which also eliminate unwanted low-quality spelling variants, retaining only the best variations. Alternatively, in finding variants of a particular spelling, NameSphere finds the best base form to derive variations from. Solving the Problem of Traditional Approaches to Name Matching NameSphere replaces the broadly used rewrite-rule approach, which has several disadvantages, particularly when it comes to non-European names. The main problem areas are in Romanization, segmentation, and rule ordering. No matter how much this latter approach is tweaked, it is not possible to completely overcome these disadvantages. Romanization The term "'rewrite rule" comes from early syntactic theory, where a string of constituents would be replaced ("rewritten") by another string, e.g., S → NP VP rewrites the symbol S as the two symbols NP and VP. As applied to names, special rules are used to convert variants of transliterated names into a single canonical form. So, for example, Arabic Mohamed, Muhammad, Mahomet, Imhammad, and Mehmed can all get regularized to Muhamad. The regularized forms are then compared in matching. For indexing, the regularized names are compressed, using one compression scheme or other: for instance, doubled letters and vowels are removed. This is so that sufficiently similar regularizations will bring each other back. For instance, whether Mahumd is meant to be Muhamad or a typo for Mahmud, both will be returned.This approach is based on two assumptions: a. each Romanization can be mapped to exactly one source name, and b. very similar source names will have the same compression. Neither assumption is correct. Differences in Romanization and pronunciation often make it impossible to determine which source name was intended, and names That have variants in common may have very different variants, as well. For example, the Arabic letter qaf ق can be pronounced in various ways: in Iraq, it's a hard G (as in "goof"); in Saudi Arabia, it's a J (as in "judge"); in most cities outside the Gulf, it's a glottal stop (as in "uh-oh"). This means that the name Qaasim ("apportioner") could also be spelled as Ga(a)sim, Ja(a)sim, or 'A(a)sim. But Jaasim and 'Asim could also represent separate names (meaning "tremendous one" and "protector", respectively). Getting these names to match by regularizing them to the same thing obscures the differences between them. This Q-G-J-' alternation is not an isolated case; such instances are common. Consider the following overlapping Romanization:
So Nadhir could represent at least three different names, each of which might have other variants, none of which should match each other. Either names which don't match are grouped together, or names that do are missed.
Segmentation In many cultures, names commonly have affixes attached: prefixes (e.g. Arabic Hajyousef), or suffixes (e.g. Russian Petrovna). Names can also appear joined together, an occurrence quite frequent with Chinese names (e.g. Xiaoyan vs Xiao Yan, Xiaomei vs Xiao Mei, Meixiao vs Mei Xiao, Linxiao vs Lin Xiao). By introducing whitespace into names, traditional rules can divide joined names, or remove prefixes and suffixes. Unfortunately, this process is prone to error. First, it may remove affixes that aren't really affixes. The North African Arabic prefix Ow, a variant of Ould (which comes from Arabic Walad, "son of"), also has variants Aw, Wa, Wi, Oua and Oui. A rule to remove this prefix would also apply to names like Wakim, Wisam, Wizight and Oussama, where it should not. Similarly, the suffix Aldin ("the faith") often appears without the L, in names like Saifeddin ("sword of..."), Shamseddine ("sun of..."), Salahuddin ("righteousness of..."). Removing this suffix will mangle names like Ladin (changing it to L Al Din) and Ayden (becoming Ay Al Din). Similar problems will crop up with suffixed variants of Allah. Second, a rewrite mechanism is unable, in general, to split apart joined names, as there is no general mechanism to recognize two familiar elements and break them up. Because a rule engine operates in one pass, regularization and segmentation must take place at the same time.This means that each name requires a separate rule to divide it from other names. For example, if there is a rule to separate variants of Chinese Xiao from the beginnings and endings of names, there is no way to check whether the remaining segment (e.g. Yan, Mei, Lin) is, in fact, also a name. To do that requires rules to cover every possible pair of joined names -- in either possible order! Rule Ordering A rewrite-rule engine is limited by the fact that only one rule can fire at any given position in a name. Once the rule has fired, the pointer advances. It is not possible to back up, or to stay in the same place.For instance, a potential rule to convert Francophone CH to SH allows Charif to become Sharif. Another rule might break up initial consonant clusters by inserting a vowel, converting Shrif into Sharif. However, the first rule is trumped by the second, so that Chrif becomes Charif, not matching Sharif (not to mention the problem of Germanophone spellings where CH should become KH, not SH). Corrective attempts to avoid manglings (like Ladin to L Al Din) could change rules to only apply after three letters. However, this would cause problems with Noureddine, a Francophone spelling of Nur Al Din ("light of the faith"). Although Nour is four letters long, a rule to convert Francophone OU to U would reduce it to three. After OU is rewritten as U, the pointer advances to R, and all the rule engine can see is "Reddine" -- which is too short to trigger that Al Din rule. Either general rules that extract Al Din after four letters, five letters, etc. (which in turn may interfere with other rules in unexpected ways) are required; or a special rule for Noureddine is needed. Of course workarounds can be found for these particular cases, but there is no general solution to this problem. Although a rewrite mechanism makes it possible to write very powerful rules, there is no way to ensure that these rules will not interfere with each other, and the more of these workaround rules one writes, the harder it becomes for those who inherit the system to understand, maintain, and improve the rules.The NameSphere Approach NameSphere uses three main ways to improve retrieving and comparing transliterated names:
Although some of these could be implemented alongside rewrite rules, others would involve more or less drastic changes to a rewrite process. Implementing all three of these solutions allows NameSphere to completely replace a traditional rewrite-rule system. NameSphere allows improved methods of generating good name variants, as well as improved methods of findingvariants of a particular spelling, by employing a suite of tools. These tools operate behind the scenes, providing theuser with a seamless process. Generating a list of quality variations of a name can be performed in a few steps, unseen by the user. NameSphere's tailored multi-compressional intersective-comparative approach to names produces lists of the good variant spellings, while simultaneously eliminating the unwanted junk variations. Finding variants of a particular spelling involves a pre-generative tailored baseform search. Through mapping the source variant to a base form before generating variants, NameSphere avoids the skewed possibilities derived from taking a non-standard variant as prototype.
|



