Unicode is a codepage. UTF-8 is a "compression" format for Unicode. If you need to use only base 127 ASCII codes + Unicode cyrillic in UTF-8, it is not very complex. The main problem is that cyrillic will use 2 consecutive bytes. It is not a problem if text is stream. If you need to load it in array, you can make simple conversion tables (e.g. utf_to_koi and koi_to_utf). But if you want to process utf-8 in memory, IMO it is better to convert utf-8 to Unicode (16 or 32 bit) and work with "wchar_t" characters.Unicode is not a codepage, it's a compression format for strings. If you try to work with UTF-8 strings "from scratch" (for example in the C without some libraries like libpango and so on), you shall cry to much)) Very simple thing became over complicated one. Please return support of the normal Russian codepage.
It's a big problem, because you should "normalize" your string, because you have some variants how to encode same strings. Well, if you ignore it, you have a situation, when two same strings are not equal. Also you can't use normal functions like strlen and so on.Unicode is a codepage. UTF-8 is a "compression" format for Unicode. If you need to use only base 127 ASCII codes + Unicode cyrillic in UTF-8, it is not very complex. The main problem is that cyrillic will use 2 consecutive bytes. It is not a problem if text is stream. If you need to load it in array, you can make simple conversion tables (e.g. utf_to_koi and koi_to_utf). But if you want to process utf-8 in memory, IMO it is better to convert utf-8 to Unicode (16 or 32 bit) and work with "wchar_t" characters.
That's a task for markup language, not for codepage.Hmm, I respectfully disagree. Unicode has made string handling much easier and more logical, once programming languages caught up (and we're not there yet). Today I can have a string that contains scientific data (like "temperature 42.2°F recent change -1.8°F/hour χ²ₙ=0.05" which shows the trend and how certain the trend is), accented western European characters and umlauts, all varieties of Cyrillic, and CJKV languages from Asia, and display it with a single print statement. I can cut and paste that string from a terminal window into a web browser (I just did). In good programming languages, those strings are perfectly well supported. The fact that C's string handling (which is archaic) doesn't support it yet is a strong argument for switching to better programming languages, or updating C.
I'm not saying that the capability of loading one particular 256-byte code page into the console font buffer should be removed, and it is unfortunate that it has been. But a rewrite of the console support in FreeBSD was needed long ago (the transition from sc to vt), and supporting Unicode was much more important than maintaining support for individual code pages.
What prevents you from using Unicode on the console? That you have custom C programs which expect 8-bit data in a particular encoding?
There is no solution. Anyway you can encode character using some ways. You should use some large libraries to process UTF-8 correctly.UTF-8 is for Unicode transmission, not for processing in memory. If you load UTF-8 in memory, you have to convert it to Unicode (wchar_t) or to specific code page like KOI8-R.
Microsoft made the transition to Unicode about 30 years ago. I wonder why Unix/Linux still prefer 8-bit strings and tricks like UTF-8.
For example Russian character "й" may be encoded as U+0439 or as U+0438 U+0306No need of large libraries. Encoding and decoding between UTF-8 and UTF-16 (16-bit Unicode) is something like 10-20 lines in C. What do you mean by that "because you have some variants how to encode same strings"?
For example Russian character "й" may be encoded as U+0439 or as U+0438 U+0306
[…] I can't boot my system using classical BIOS (only UEFI), so I can't use syscons console driver. […]
GENERIC
build.cat >> /boot/loader.conf << 'EOT'
kern.vty=sc
EOT
You can see all cyrillic letters here. Don't know why you try to combine "и" with other signs to create "й". Who will use special combinations to enter "й" when there is separate key on keyboard only for this letter?For example Russian character "й" may be encoded as U+0439 or as U+0438 U+0306
That's never mind. If you write software, you MUST consider ANY user's behavior.You can see all cyrillic letters here. Don't know why you try to combine "и" with other signs to create "й". Who will use special combinations to enter "й" when there is separate key on keyboard only for this letter?
But I can't boot with classical BIOS.I recommend to familiarize with ISO 10646 at least a little bit. Then you would you know that these are called precomposed sequences a), and b) precomposed sequences are provided for compatibility with preexisting standards. The “Unicode way” is to use combining diacritics.
If you’re so focused on having KOI8‑R, you can select sc(4) as your terminal driver. It is just a default (UEFI → vt(4), IBM PC BIOS → sc(4)), you are not forced to use vt(4). sc(4) is still part of aGENERIC
build.Bash:cat >> /boot/loader.conf << 'EOT' kern.vty=sc EOT
Note that the syscons driver is not compatible with systems booted via
UEFI(8). Forcing use of syscons on such systems will result in no us-
able console.
I agree. But I got the impression that you want quick/easy solution. You don't want libraries, etc. Your demand for KOI8-R sounds like partial solution - what will happen if the user want to type letters from other encoding (e.g. central european languages)? What about "you MUST consider ANY user's behavior".That's never mind. If you write software, you MUST consider ANY user's behavior.
If system does not support other codepage, that is never mind, because any way other characters are not acceptable. If system works with other codepages, it is not a problem, just use national codepage like GNU gettext support i18n with any codepages, including KOI8-R.I agree. But I got the impression that you want quick/easy solution. You don't want libraries, etc. Your demand for KOI8-R sounds like partial solution - what will happen if the user want to type letters from other encoding (e.g. central european languages)? What about "you MUST consider ANY user's behavior".