Backspace doesn't remove utf-8 multibyte characters on console

In FreeBSD 12.1 backspace in terminal only deletes one byte of multi-byte character.
For example:

Code:
# cat > /tmp/test
tы<backspace>t
# cat /tmp/test
t�t
# hexdump -C /tmp/test
74 d1 74 0a

Where 'ы' is a Russian letter (D18B in hexadecimal form). You may see that only second byte of multi-byte character was deleted by backspace.
 
Last edited by a moderator:
Thank you for answer!
Code:
# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

This behavior is observed in vt(4) console, also under xterm(1).

I know, under linux it is solved by iutf8 console extension, but for FreeBSD I don't know an answer.
 
Last edited by a moderator:
Interesting, I see the problem, yes, ssh'ing from windows using Putty, and using xterm locally. Don't have linux installed anywhere, but can't reproduce in iTerm2 on my MBP. Something to look into.
 
The IUTF8 terminal input flag is arguably a hack, but it does solve the problem at least...for UTF-8. Unfortunately, that still leaves other multibyte encodings like GB18030, EUC-KR, and Shift JIS that suffer from the same trouble as well (try U+6F22 U+8A9E 漢語; it's made up of two 3-byte UTF-8 sequences--E6 BC A2 and E8 AA 9E).

It's a chicken-and-egg problem really: the terminal is the one buffering the data before handing it off to read(2) (called by functions like getwchar(3) and fgets(3)), yet the terminal can't delete the entire character sequence because it can't possibly know what you're doing with it. Things like the IUTF8 hack are perhaps a step in the right direction, but multibyte encodings are simply not an easy thing to deal with for terminals, especially when you consider the many control sequences, termios(4) support, and such to contend with.

It's not that it can't be done; it's just a daunting amount of work that very few, if any, people have put in. xterm currently uses luit(1) for transforming characters from UTF-8 to the native encoding and vice-versa, but it still isn't a perfect solution.
 
Back
Top