How to setup KOI8-R codepage instead of UTF-8 in vt console driver.

zx_gamer · Nov 23, 2024

Hey, guys. I can't boot my system using classical BIOS (only UEFI), so I can't use syscons console driver.
And I have no choose to use vt driver, but I don't get how to setup KOI8-R codepage in vt driver instead of UTF-8.
Could you help me setup KOI8-R in vt?

Kai Burghardt · Nov 23, 2024

Welcome to the FreeBSD forums!

This is not yet supported.

Just out of curiosity why do you need to set the video terminal to a Russian code page? I mean, you can display Cyrillic letters and enter them just as well in Unicode.

zx_gamer · Nov 23, 2024

Unicode is not a codepage, it's a compression format for strings. If you try to work with UTF-8 strings "from scratch" (for example in the C without some libraries like libpango and so on), you shall cry to much)) Very simple thing became over complicated one. Please return support of the normal Russian codepage.

ralphbsz · Nov 24, 2024

Hmm, I respectfully disagree. Unicode has made string handling much easier and more logical, once programming languages caught up (and we're not there yet). Today I can have a string that contains scientific data (like "temperature 42.2°F recent change -1.8°F/hour χ²ₙ=0.05" which shows the trend and how certain the trend is), accented western European characters and umlauts, all varieties of Cyrillic, and CJKV languages from Asia, and display it with a single print statement. I can cut and paste that string from a terminal window into a web browser (I just did). In good programming languages, those strings are perfectly well supported. The fact that C's string handling (which is archaic) doesn't support it yet is a strong argument for switching to better programming languages, or updating C.

I'm not saying that the capability of loading one particular 256-byte code page into the console font buffer should be removed, and it is unfortunate that it has been. But a rewrite of the console support in FreeBSD was needed long ago (the transition from sc to vt), and supporting Unicode was much more important than maintaining support for individual code pages.

What prevents you from using Unicode on the console? That you have custom C programs which expect 8-bit data in a particular encoding?

6502 · Nov 24, 2024

zx_gamer said:
Unicode is not a codepage, it's a compression format for strings. If you try to work with UTF-8 strings "from scratch" (for example in the C without some libraries like libpango and so on), you shall cry to much)) Very simple thing became over complicated one. Please return support of the normal Russian codepage.

Unicode is a codepage. UTF-8 is a "compression" format for Unicode. If you need to use only base 127 ASCII codes + Unicode cyrillic in UTF-8, it is not very complex. The main problem is that cyrillic will use 2 consecutive bytes. It is not a problem if text is stream. If you need to load it in array, you can make simple conversion tables (e.g. utf_to_koi and koi_to_utf). But if you want to process utf-8 in memory, IMO it is better to convert utf-8 to Unicode (16 or 32 bit) and work with "wchar_t" characters.

zx_gamer · Nov 24, 2024

6502 said:
Unicode is a codepage. UTF-8 is a "compression" format for Unicode. If you need to use only base 127 ASCII codes + Unicode cyrillic in UTF-8, it is not very complex. The main problem is that cyrillic will use 2 consecutive bytes. It is not a problem if text is stream. If you need to load it in array, you can make simple conversion tables (e.g. utf_to_koi and koi_to_utf). But if you want to process utf-8 in memory, IMO it is better to convert utf-8 to Unicode (16 or 32 bit) and work with "wchar_t" characters.

It's a big problem, because you should "normalize" your string, because you have some variants how to encode same strings. Well, if you ignore it, you have a situation, when two same strings are not equal. Also you can't use normal functions like strlen and so on.
Why make it so difficult when you can make it so simple?

zx_gamer · Nov 24, 2024

ralphbsz said:
Hmm, I respectfully disagree. Unicode has made string handling much easier and more logical, once programming languages caught up (and we're not there yet). Today I can have a string that contains scientific data (like "temperature 42.2°F recent change -1.8°F/hour χ²ₙ=0.05" which shows the trend and how certain the trend is), accented western European characters and umlauts, all varieties of Cyrillic, and CJKV languages from Asia, and display it with a single print statement. I can cut and paste that string from a terminal window into a web browser (I just did). In good programming languages, those strings are perfectly well supported. The fact that C's string handling (which is archaic) doesn't support it yet is a strong argument for switching to better programming languages, or updating C.

I'm not saying that the capability of loading one particular 256-byte code page into the console font buffer should be removed, and it is unfortunate that it has been. But a rewrite of the console support in FreeBSD was needed long ago (the transition from sc to vt), and supporting Unicode was much more important than maintaining support for individual code pages.

What prevents you from using Unicode on the console? That you have custom C programs which expect 8-bit data in a particular encoding?

That's a task for markup language, not for codepage.

And it’s not C that’s a bad programming language, it’s the coding that’s terrible. When processing ANY data is written in C, why is this UTF-8 "more important" than processing raw PNG, for example? What can you say about other programming languages? Pascal, Ada, Object Pascal, C++? Are they also unsuitable for processing character arrays?

6502 · Nov 24, 2024

UTF-8 is for Unicode transmission, not for processing in memory. If you load UTF-8 in memory, you have to convert it to Unicode (wchar_t) or to specific 8-bit code page like KOI8-R.

PS: Microsoft made the transition to Unicode about 30 years ago. I wonder why Unix/Linux still prefer 8-bit strings and tricks like UTF-8.

zx_gamer · Nov 24, 2024

6502 said:
UTF-8 is for Unicode transmission, not for processing in memory. If you load UTF-8 in memory, you have to convert it to Unicode (wchar_t) or to specific code page like KOI8-R.

Microsoft made the transition to Unicode about 30 years ago. I wonder why Unix/Linux still prefer 8-bit strings and tricks like UTF-8.

There is no solution. Anyway you can encode character using some ways. You should use some large libraries to process UTF-8 correctly.

6502 · Nov 24, 2024

No need of large libraries. Encoding and decoding between UTF-8 and UTF-16 (16-bit Unicode) is something like 10-20 lines in C. What do you mean by that "because you have some variants how to encode same strings"?

zx_gamer · Nov 24, 2024

6502 said:
No need of large libraries. Encoding and decoding between UTF-8 and UTF-16 (16-bit Unicode) is something like 10-20 lines in C. What do you mean by that "because you have some variants how to encode same strings"?

For example Russian character "й" may be encoded as U+0439 or as U+0438 U+0306

Kai Burghardt · Nov 24, 2024

zx_gamer said:
For example Russian character "й" may be encoded as U+0439 or as U+0438 U+0306

I recommend to familiarize with ISO 10646 at least a little bit. Then you would you know that these are called precomposed sequences a), and b) precomposed sequences are provided for compatibility with preexisting standards. The “Unicode way” is to use combining diacritics.

zx_gamer said:
[…] I can't boot my system using classical BIOS (only UEFI), so I can't use syscons console driver. […]

If you’re so focused on having KOI8‑R, you can select sc(4) as your terminal driver. It is just a default (UEFI → vt(4), IBM PC BIOS → sc(4)), you are not forced to use vt(4). sc(4) is still part of a GENERIC build.

Bash:

cat >> /boot/loader.conf << 'EOT'
kern.vty=sc
EOT

6502 · Nov 24, 2024

zx_gamer said:
For example Russian character "й" may be encoded as U+0439 or as U+0438 U+0306

You can see all cyrillic letters here. Don't know why you try to combine "и" with other signs to create "й". Who will use special combinations to enter "й" when there is separate key on keyboard only for this letter?

https://www.unicode.org/charts/PDF/U0400.pdf

zx_gamer · Nov 25, 2024

6502 said:
You can see all cyrillic letters here. Don't know why you try to combine "и" with other signs to create "й". Who will use special combinations to enter "й" when there is separate key on keyboard only for this letter?

https://www.unicode.org/charts/PDF/U0400.pdf

That's never mind. If you write software, you MUST consider ANY user's behavior.

zx_gamer · Nov 25, 2024

Kai Burghardt said:
I recommend to familiarize with ISO 10646 at least a little bit. Then you would you know that these are called precomposed sequences a), and b) precomposed sequences are provided for compatibility with preexisting standards. The “Unicode way” is to use combining diacritics.
If you’re so focused on having KOI8‑R, you can select sc(4) as your terminal driver. It is just a default (UEFI → vt(4), IBM PC BIOS → sc(4)), you are not forced to use vt(4). sc(4) is still part of a GENERIC build.

Bash:

cat >> /boot/loader.conf << 'EOT' kern.vty=sc EOT

But I can't boot with classical BIOS.

syscons(4)

man.freebsd.org

Note that the syscons driver is not compatible with systems booted via
UEFI(8). Forcing use of syscons on such systems will result in no us-
able console.

6502 · Nov 25, 2024

zx_gamer said:
That's never mind. If you write software, you MUST consider ANY user's behavior.

I agree. But I got the impression that you want quick/easy solution. You don't want libraries, etc. Your demand for KOI8-R sounds like partial solution - what will happen if the user want to type letters from other encoding (e.g. central european languages)? What about "you MUST consider ANY user's behavior".

zx_gamer · Nov 25, 2024

6502 said:
I agree. But I got the impression that you want quick/easy solution. You don't want libraries, etc. Your demand for KOI8-R sounds like partial solution - what will happen if the user want to type letters from other encoding (e.g. central european languages)? What about "you MUST consider ANY user's behavior".

If system does not support other codepage, that is never mind, because any way other characters are not acceptable. If system works with other codepages, it is not a problem, just use national codepage like GNU gettext support i18n with any codepages, including KOI8-R.