[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Syslog-sec] Syslog protocol - UTF-8 encoding
Robert:
> Potential confusions:
>
> 1) Saying UTF-8 is insufficient. To really cover all the
> bases (especially from a security and string parsing
> perspective) you need to
> say:
>
> "Unicode characters encoded in UTF-8 using the minimal
> encoding." UTF-8 permits a variety of encodings for the same
> character, but only one is the minimal encoding.
Are you suggesting we make minimum encoding a MUST or a SHOULD? Everywhere?
I am fine with a SHOULD everywhere and maybe making it a MUST for certain parts of the HEADER, like space separator. However, I think before we require minimal encoding in PARAM-VALUE and MSG, we should explore the reasons why UTF-8 allows for different encodings. There may be good reason for it. We need to have a good reason to re-define the use of the standard for parts of the message which may be received by library from third-party applications. My concern is that some perfectly legitimate UTF-8 code in the field may not do minimum encoding. Then, we are making syslog protocol adoption more difficult by requiring it.
> For more
> info you can also reference the most recent ISO
> 10646-1 and 10646-2 (with extensions). With minimal
> encodings you eliminate some potential buffer overflows and
> you simplify the use of regular expression matching. It is
> easy enough for an incoming message filter to detect and
> recode UTF-8 into minimal encoding, but you need to say this
> in the specification to inform people that they need the
> filter on the incoming side and that the emitters of messages
> should use the minimal form.
>
> 2) There are multiple blank space characters defined in
> Unicode. These are typographically different. There is only
> one that corresponds to the ASCII blank character and its
> minimal encoding using UTF-8 is intentionally identical to
> the encoding of the ASCII blank character. The confusion may
> be resolved by identifying this Unicode code point by number
> rather than just saying "blank".
I could not find the word "blank" anywhere in the latest draft. The encoding defines the space explicitly as:
SP = %d32
Do you think we need to specify more?
Does UTF-8 allow more than one encoding for basic ASCII character subset or only for characters with larger Unicode code points?
> 3) Not mentioned originally, but also a potential problem,
> are the other homotype and semi-homotype characters. For
> example, there are multiple backslash characters. In fact
> there are three of them in common use, one the ASCII
> character (whose minimal UTF-8 encoding matches the ASCII
> character) and two that are used in Japanese. These are
> pseudo-homotype characters in that a close examination will
> reveal that in a high precision font they are all different
> in size and slope. But in many situations they look the same.
>
> More importantly from the perspective of regular use, the
> ASCII backslash character was replaced in the Japanese 7-bit
> Latin characterset by the Yen symbol. So the Japanese will
> have significant problems regarding use of backslash. Even
> if you specify the use of the proper Unicode character set,
> encoded using minimal size UTF-8, all the backslashes will
> be presented to Japanese users as Yen symbols on most
> systems. These systems make the assumption that what they
> are seeing is the older modified 7-bit
> ASCII that is standard in Japan. This is almost always the correct
> assumption.
>
> There is no simple solution to the backslash problem. The
> backslash should not be given any special meaning in any
> protocol. The various default workarounds for conflicts
> between the older and newer systems introduce a lot of
> confusion around this character. If it has special meaning
> to computers there will always be confusion and problems. If
> you leave it an ordinary non-special character the humans who
> read the message usually have enough context to decide
> whether the character is intended to mean yen or backslash
> and will know from their application context how to interpret
> the text.
>
> If you have messages that must be composed by people and must
> contain backslashes you have an even worse problem. They
> have a backslash character on the keyboard, but it will
> generate the Japanese backslashes, not the ASCII backslash.
> This effectively guarantees problems with entering backslash
> in Japan because people will forget that they need to do
> something special and will just use the keyboard.
Will this issue be addressed if instead of referring to "\" when we talk about escaping it in PARAM-VALUE and using it as escape sequence, we were to specifically refer to ASCII character %d92 instead?
Thanks,
Anton.
_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec