[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Syslog-sec] Syslog protocol - UTF-8 encoding



Robert,

thanks for all the good points. Just to clarify one thing that me really
puzzles: so you are saying the ASCII is actuall NOT a subset of UTF-8,
because in Japanese the \ character has been replaced by the Yen sign,
so there are two different interpretions of this UTF-8 code?

Rainer

> -----Original Message-----
> From: Robert Horn [mailto:robert.horn@agfa.com] 
> Sent: Friday, June 03, 2005 6:04 PM
> To: aokmians@cisco.com
> Cc: Alexander Clemm (alex); ietfdbh@comcast.net; Rainer 
> Gerhards; Steve Chang (schang99); syslog-sec@employees.org
> Subject: RE: [Syslog-sec] Syslog protocol - UTF-8 encoding
> 
> 
> A MUST in the header with SHOULD elsewhere would be 
> sufficient, but I think
> that there is little risk making it a MUST everywhere.  ISO 
> made it into a
> MUST with the extensions to 10646-2.   The problem is an 
> oversight in the
> UTF-8 specification.  It specifies how to take an m-bit 
> character and break
> it down into 8-bit chunks.  It was assumed that people would always
> minimize the number of 8-bit chunks used, and this is the 
> general practice.
> So if I have a character with a 10-bit code point, it will 
> get encoded as a
> 6-bit and a 4-bit chunk.  Then malicious programmers 
> discovered that they
> could get programs to malfunction by using more chunks, e.g. 
> encoding a
> 10-bit code point as two 4-bit chunks and a 2-bit chunk.  
> Sometimes this
> caused buffer overflows and sometimes it lets them evade 
> 8-bit oriented
> regular expression parsers.  These are legitimate UTF-8 
> encodings because
> the UTF-8 specification failed to require minimal size 
> encodings be used.
> I am not aware of any reasonable UTF-8 encoder that does not generate
> minimal size encodings.
> 
> I didn't have the text with me while traveling, hence the 
> uncertainty over
> the "space".  Specifying the ASCII code value is sufficient.  
> We probably
> should note to readers that the code value used for backslash 
> in ASCII is
> used for the yen symbol in Japan, and that they should be 
> prepared for user
> interface confusion.  It is inevitable that there will be 
> people who use
> the Japanese backslash character (a valid UTF-8 character) 
> instead of the
> correct ASCII code value because they are matching what they 
> see on the
> screen with what they see on the keyboard.  We should alert 
> them to the
> problem.  (Or we could pick another character, but most of the good
> characters have already been used for other purposes.)
> 
> R Horn
> 
> 
>                                                               
>                                                               
>          
>                       "Anton Okmianski                        
>                                                               
>          
>                       \(aokmians\)"            To:       
> Robert Horn/WIL/AGFA/US/BAYER@AGFA, 
> <rgerhards@hq.adiscon.com>              
>                       <aokmians@cisco.c        cc:       
> "Alexander Clemm \(alex\)" <alex@cisco.com>, 
> <ietfdbh@comcast.net>, "Steve  
>                       om>                       Chang 
> \(schang99\)" <schang99@cisco.com>, 
> <syslog-sec@employees.org>                 
>                                                Subject:  RE: 
> [Syslog-sec] Syslog protocol - UTF-8 encoding                 
>           
>                       06/02/2005 03:53                        
>                                                               
>          
>                       PM                                      
>                                                               
>          
>                                                               
>                                                               
>          
>                                                               
>                                                               
>          
> 
> 
> 
> 
> Robert:
> 
> > Potential confusions:
> >
> >   1) Saying UTF-8 is insufficient.  To really cover all the
> > bases (especially from a security  and string parsing
> > perspective) you need to
> > say:
> >
> > "Unicode characters encoded in UTF-8 using the minimal
> > encoding."  UTF-8 permits a variety of encodings for the same
> > character, but only one is the minimal encoding.
> 
> Are you suggesting we make minimum encoding a MUST or a 
> SHOULD? Everywhere?
> 
> I am fine with a SHOULD everywhere and maybe making it a MUST 
> for certain
> parts of the HEADER, like space separator.  However, I think before we
> require minimal encoding in PARAM-VALUE and MSG, we should explore the
> reasons why UTF-8 allows for different encodings.  There may 
> be good reason
> for it. We need to have a good reason to re-define the use of 
> the standard
> for parts of the message which may be received by library 
> from third-party
> applications.  My concern is that some perfectly legitimate 
> UTF-8 code in
> the field may not do minimum encoding.  Then, we are making 
> syslog protocol
> adoption more difficult by requiring it.
> 
> > For more
> > info you can also reference the most recent ISO
> > 10646-1 and 10646-2 (with extensions).  With minimal
> > encodings you eliminate some potential buffer overflows and
> > you simplify the use of regular expression matching.  It is
> > easy enough for an incoming message filter to detect and
> > recode UTF-8 into minimal encoding, but you need to say this
> > in the specification to inform people that they need the
> > filter on the incoming side and that the emitters of messages
> > should use the minimal form.
> >
> >   2) There are multiple blank space characters defined in
> > Unicode.  These are typographically different.  There is only
> > one that corresponds to the ASCII blank character and  its
> > minimal encoding using UTF-8 is intentionally identical to
> > the encoding of the ASCII blank character.  The confusion may
> > be resolved by identifying this Unicode code point by number
> > rather than just saying "blank".
> 
> I could not find the word "blank" anywhere in the latest draft. The
> encoding defines the space explicitly as:
> 
> SP = %d32
> 
> Do you think we need to specify more?
> 
> Does UTF-8 allow more than one encoding for basic ASCII 
> character subset or
> only for characters with larger Unicode code points?
> 
> >   3) Not mentioned originally, but also a potential problem,
> > are the other homotype and semi-homotype characters.  For
> > example, there are multiple backslash characters.  In fact
> > there are three of them in common use, one the ASCII
> > character (whose minimal UTF-8 encoding matches the ASCII
> > character) and two that are used in Japanese.  These are
> > pseudo-homotype characters in that a close examination will
> > reveal that in a high precision font they are all different
> > in size and slope.  But in many situations they look the same.
> >
> > More importantly from the perspective of regular use, the
> > ASCII backslash character was replaced in the Japanese 7-bit
> > Latin characterset by the Yen symbol.  So the Japanese will
> > have significant problems regarding use of backslash.  Even
> > if you specify the use of the proper Unicode character set,
> > encoded using minimal size UTF-8,  all the backslashes will
> > be presented to Japanese users as Yen symbols on most
> > systems.  These systems make the assumption that what they
> > are seeing is the older modified 7-bit
> > ASCII that is standard in Japan.   This is almost always the correct
> > assumption.
> >
> > There is no simple solution to the backslash problem.  The
> > backslash should not be given any special meaning in any
> > protocol.  The various default workarounds for conflicts
> > between the older and newer systems introduce a lot of
> > confusion around this character.  If it has special meaning
> > to computers there will always be confusion and problems.  If
> > you leave it an ordinary non-special character the humans who
> > read the message usually have enough context to decide
> > whether the character is intended to mean yen or backslash
> > and will know from their application context how to interpret
> > the text.
> >
> > If you have messages that must be composed by people and must
> > contain backslashes you have an even worse problem.  They
> > have a backslash character on the keyboard, but it will
> > generate the Japanese backslashes, not the ASCII backslash.
> > This effectively guarantees problems with entering backslash
> > in Japan because people will forget that they need to do
> > something special and will just use the keyboard.
> 
> Will this issue be addressed if instead of referring to "\" 
> when we talk
> about escaping it in PARAM-VALUE and using it as escape 
> sequence, we were
> to specifically refer to ASCII character %d92 instead?
> 
> Thanks,
> Anton.
> 
> 
> 
> 
_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec