git.kernel.dk Git - linux-2.6-block.git/commit

author	Roman Žilka <roman.zilka@gmail.com>
	Tue, 9 Jan 2024 10:43:46 +0000 (11:43 +0100)
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Sun, 28 Jan 2024 03:01:27 +0000 (19:01 -0800)
commit	c01e71b49c37e3b9c9652d28c42a88197a9d7f02
tree	89474329d97ad7d7d43990db3b0985ea435f8769	tree \| snapshot
parent	e9e873eadced9389b819685a762c9892500f12d0	commit \| diff

tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode

vc_translate_unicode() and vc_sanitize_unicode() parse input to the
UTF-8-enabled console, marking invalid byte sequences and producing Unicode
codepoints. The current algorithm follows ancient Unicode and may accept
invalid byte sequences, pass on non-existent codepoints and reject valid
sequences.

The patch restores the functions' compliance with modern Unicode (v15.1 [1]
+ many previous versions) as well as RFC 3629 [2].
1. Codepoint space is limited to 0x10FFFF.
2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
   Unicode and will be accepted. Another option was to complete the set of
   noncharacters (used to be just those two, now there's more) and preserve
   the rejection step. This is indeed what Unicode suggests ([1] chap.
   23.7) (not requires), but most codepoints are !iswprint(), so selecting
   just the noncharacters seemed arbitrary and futile (and unnecessary).

This is not a security patch. I'm not aware of any present security
implications of the old code.

[1] https://www.unicode.org/versions/Unicode15.1.0
[2] https://datatracker.ietf.org/doc/html/rfc3629

Signed-off-by: Roman Žilka <roman.zilka@gmail.com>
Link: https://lore.kernel.org/r/598ab459-6ba9-4a17-b4a1-08f26a356fc0@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>