[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Syslog-sec] Syslog protocol - UTF-8 encoding
Robert,
thanks for all the good points. Just to clarify one thing that me really
puzzles: so you are saying the ASCII is actuall NOT a subset of UTF-8,
because in Japanese the \ character has been replaced by the Yen sign,
so there are two different interpretions of this UTF-8 code?
Rainer
> -----Original Message-----
> From: Robert Horn [mailto:robert.horn@agfa.com]
> Sent: Friday, June 03, 2005 6:04 PM
> To: aokmians@cisco.com
> Cc: Alexander Clemm (alex); ietfdbh@comcast.net; Rainer
> Gerhards; Steve Chang (schang99); syslog-sec@employees.org
> Subject: RE: [Syslog-sec] Syslog protocol - UTF-8 encoding
>
>
> A MUST in the header with SHOULD elsewhere would be
> sufficient, but I think
> that there is little risk making it a MUST everywhere. ISO
> made it into a
> MUST with the extensions to 10646-2. The problem is an
> oversight in the
> UTF-8 specification. It specifies how to take an m-bit
> character and break
> it down into 8-bit chunks. It was assumed that people would always
> minimize the number of 8-bit chunks used, and this is the
> general practice.
> So if I have a character with a 10-bit code point, it will
> get encoded as a
> 6-bit and a 4-bit chunk. Then malicious programmers
> discovered that they
> could get programs to malfunction by using more chunks, e.g.
> encoding a
> 10-bit code point as two 4-bit chunks and a 2-bit chunk.
> Sometimes this
> caused buffer overflows and sometimes it lets them evade
> 8-bit oriented
> regular expression parsers. These are legitimate UTF-8
> encodings because
> the UTF-8 specification failed to require minimal size
> encodings be used.
> I am not aware of any reasonable UTF-8 encoder that does not generate
> minimal size encodings.
>
> I didn't have the text with me while traveling, hence the
> uncertainty over
> the "space". Specifying the ASCII code value is sufficient.
> We probably
> should note to readers that the code value used for backslash
> in ASCII is
> used for the yen symbol in Japan, and that they should be
> prepared for user
> interface confusion. It is inevitable that there will be
> people who use
> the Japanese backslash character (a valid UTF-8 character)
> instead of the
> correct ASCII code value because they are matching what they
> see on the
> screen with what they see on the keyboard. We should alert
> them to the
> problem. (Or we could pick another character, but most of the good
> characters have already been used for other purposes.)
>
> R Horn
>
>
>
>
>
> "Anton Okmianski
>
>
> \(aokmians\)" To:
> Robert Horn/WIL/AGFA/US/BAYER@AGFA,
> <rgerhards@hq.adiscon.com>
> <aokmians@cisco.c cc:
> "Alexander Clemm \(alex\)" <alex@cisco.com>,
> <ietfdbh@comcast.net>, "Steve
> om> Chang
> \(schang99\)" <schang99@cisco.com>,
> <syslog-sec@employees.org>
> Subject: RE:
> [Syslog-sec] Syslog protocol - UTF-8 encoding
>
> 06/02/2005 03:53
>
>
> PM
>
>
>
>
>
>
>
>
>
>
>
>
> Robert:
>
> > Potential confusions:
> >
> > 1) Saying UTF-8 is insufficient. To really cover all the
> > bases (especially from a security and string parsing
> > perspective) you need to
> > say:
> >
> > "Unicode characters encoded in UTF-8 using the minimal
> > encoding." UTF-8 permits a variety of encodings for the same
> > character, but only one is the minimal encoding.
>
> Are you suggesting we make minimum encoding a MUST or a
> SHOULD? Everywhere?
>
> I am fine with a SHOULD everywhere and maybe making it a MUST
> for certain
> parts of the HEADER, like space separator. However, I think before we
> require minimal encoding in PARAM-VALUE and MSG, we should explore the
> reasons why UTF-8 allows for different encodings. There may
> be good reason
> for it. We need to have a good reason to re-define the use of
> the standard
> for parts of the message which may be received by library
> from third-party
> applications. My concern is that some perfectly legitimate
> UTF-8 code in
> the field may not do minimum encoding. Then, we are making
> syslog protocol
> adoption more difficult by requiring it.
>
> > For more
> > info you can also reference the most recent ISO
> > 10646-1 and 10646-2 (with extensions). With minimal
> > encodings you eliminate some potential buffer overflows and
> > you simplify the use of regular expression matching. It is
> > easy enough for an incoming message filter to detect and
> > recode UTF-8 into minimal encoding, but you need to say this
> > in the specification to inform people that they need the
> > filter on the incoming side and that the emitters of messages
> > should use the minimal form.
> >
> > 2) There are multiple blank space characters defined in
> > Unicode. These are typographically different. There is only
> > one that corresponds to the ASCII blank character and its
> > minimal encoding using UTF-8 is intentionally identical to
> > the encoding of the ASCII blank character. The confusion may
> > be resolved by identifying this Unicode code point by number
> > rather than just saying "blank".
>
> I could not find the word "blank" anywhere in the latest draft. The
> encoding defines the space explicitly as:
>
> SP = %d32
>
> Do you think we need to specify more?
>
> Does UTF-8 allow more than one encoding for basic ASCII
> character subset or
> only for characters with larger Unicode code points?
>
> > 3) Not mentioned originally, but also a potential problem,
> > are the other homotype and semi-homotype characters. For
> > example, there are multiple backslash characters. In fact
> > there are three of them in common use, one the ASCII
> > character (whose minimal UTF-8 encoding matches the ASCII
> > character) and two that are used in Japanese. These are
> > pseudo-homotype characters in that a close examination will
> > reveal that in a high precision font they are all different
> > in size and slope. But in many situations they look the same.
> >
> > More importantly from the perspective of regular use, the
> > ASCII backslash character was replaced in the Japanese 7-bit
> > Latin characterset by the Yen symbol. So the Japanese will
> > have significant problems regarding use of backslash. Even
> > if you specify the use of the proper Unicode character set,
> > encoded using minimal size UTF-8, all the backslashes will
> > be presented to Japanese users as Yen symbols on most
> > systems. These systems make the assumption that what they
> > are seeing is the older modified 7-bit
> > ASCII that is standard in Japan. This is almost always the correct
> > assumption.
> >
> > There is no simple solution to the backslash problem. The
> > backslash should not be given any special meaning in any
> > protocol. The various default workarounds for conflicts
> > between the older and newer systems introduce a lot of
> > confusion around this character. If it has special meaning
> > to computers there will always be confusion and problems. If
> > you leave it an ordinary non-special character the humans who
> > read the message usually have enough context to decide
> > whether the character is intended to mean yen or backslash
> > and will know from their application context how to interpret
> > the text.
> >
> > If you have messages that must be composed by people and must
> > contain backslashes you have an even worse problem. They
> > have a backslash character on the keyboard, but it will
> > generate the Japanese backslashes, not the ASCII backslash.
> > This effectively guarantees problems with entering backslash
> > in Japan because people will forget that they need to do
> > something special and will just use the keyboard.
>
> Will this issue be addressed if instead of referring to "\"
> when we talk
> about escaping it in PARAM-VALUE and using it as escape
> sequence, we were
> to specifically refer to ASCII character %d92 instead?
>
> Thanks,
> Anton.
>
>
>
>
_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec