Citing Cyrillic transliterations
Recently we had a
longish conversation on Twitterabout citing metadata that's been transliterated from Cyrillic.
Background
There is a number of ways to transliterate Cyrillic text into Latin script.
It seems that many catalogs servemetadata that's been transliterated using the ALA-LC standard(i.e.includes character combinations like i︠e︡, ĭ, T︠︡S, etc.).
Issue
1.188BET靠谱吗Combining marks currently make it difficult to find items in Zotero database.
2.When citing, ligatures/ combining marks are not used(though apparently they may be used in some styles?)
Possible solutions
1.This is a general issue and I think it will be addressed when we implement text normalization throughout the database and then strip all special Unicode marks when comparing strings.
2A.We _could_ replace ALA-LC transliterations with "standard" form (I assume this refers to BGN/PCGNsystem) on import from websites.It seems to be a pretty straightforward 1:1 mapping, though it only works one way.The issue here is that these ligatures are not specific to Cyrillic transliterations and could be used in other scripts, so we would have to make sure that we're only doing it for Cyrillic transliterations.Unfortunately, from what I saw, many catalogs do not include any indication of the language/script, so I would rather leave this in user control.Additionally, it seems that the ligatures are actually informative and Avram has suggested that 188BET靠谱吗Zotero should _not_ remove them on import.
2B.The other option I see is that these are cleaned up when citing in citeproc-js.This would allow the user to use the language field to specify what kind of transliteration this is (more on that below*) and we would not have to worry about messing up metadata.Additionally, if some styles do want to use ALA-LC system, there could be a way to specify this in the style.Finally, the original metadata would remain undisturbed.
* There is a "t" extensionto the BCP 47 language tag system that allows specifying the source language for transliteration and the system that was used to transliterate.This could allow the character substitution to be fine-tuned based on the style requirements and the language/script of the metadata.
188BET靠谱吗Off topic: in the long long long run, I can see Zotero taking advantage of the ICU projectto transliterate metadata on-the-fly.
Background
There is a number of ways to transliterate Cyrillic text into Latin script.
It seems that many catalogs servemetadata that's been transliterated using the ALA-LC standard(i.e.includes character combinations like i︠e︡, ĭ, T︠︡S, etc.).
Issue
1.188BET靠谱吗Combining marks currently make it difficult to find items in Zotero database.
2.When citing, ligatures/ combining marks are not used(though apparently they may be used in some styles?)
Possible solutions
1.This is a general issue and I think it will be addressed when we implement text normalization throughout the database and then strip all special Unicode marks when comparing strings.
2A.We _could_ replace ALA-LC transliterations with "standard" form (I assume this refers to BGN/PCGNsystem) on import from websites.It seems to be a pretty straightforward 1:1 mapping, though it only works one way.The issue here is that these ligatures are not specific to Cyrillic transliterations and could be used in other scripts, so we would have to make sure that we're only doing it for Cyrillic transliterations.Unfortunately, from what I saw, many catalogs do not include any indication of the language/script, so I would rather leave this in user control.Additionally, it seems that the ligatures are actually informative and Avram has suggested that 188BET靠谱吗Zotero should _not_ remove them on import.
2B.The other option I see is that these are cleaned up when citing in citeproc-js.This would allow the user to use the language field to specify what kind of transliteration this is (more on that below*) and we would not have to worry about messing up metadata.Additionally, if some styles do want to use ALA-LC system, there could be a way to specify this in the style.Finally, the original metadata would remain undisturbed.
* There is a "t" extensionto the BCP 47 language tag system that allows specifying the source language for transliteration and the system that was used to transliterate.This could allow the character substitution to be fine-tuned based on the style requirements and the language/script of the metadata.
188BET靠谱吗Off topic: in the long long long run, I can see Zotero taking advantage of the ICU projectto transliterate metadata on-the-fly.
1) No ligatures.
2) i instead of ĭ for й.
3) Capitalized letters which are two roman letters but one in Russian (Ц -> Ts) are rendered with standard English capitalization (i.e.Ts instead of TS with the ligature).
4) Russian old orthography letter ѣ (yat') is transliterated "e" instead of "i︠e︡."
I believe this might be an older version of the LC system, but at any rate it is the standard form used in publications such as the Russian Review.It's also routinely called the LC system!I don't think people are actually aware of what the strict form entails.
188BET靠谱吗Speaking only for myself, it's irrelevant to me that the ligatures are displayed in Zotero itself (as long as the search is made to work properly).188BET靠谱吗What really bugs me and makes Zotero very hard to use in final products is the presence of these forms in the citations.So a citation-level fix would be fine for me as a historian.
We should be able to solve this with a plugin, if I provide a hook in citeproc-js for an unconditional transform function, applied to CSL items before the abbreviation mechanism gets ahold of them.All we would need is a set of JSON mappings for character clusters to be transformed, and a small plugin to attach a function that makes use of them to the processor.Something like:
{
"ru": {
"[ligature chars]": "e",
"[ligature chars]": "i"
}
}
Maybe.
citeproc.sys.stripLigatures = function (Item) {
// Do stuff to Item
}