[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Syslog-sec] Syslog protocol - UTF-8 encoding



Perhaps the question is subtler.  I'm traveling at the moment and have
limited access to the drafts so I can't refer to specific paragraphs.

Potential confusions:

  1) Saying UTF-8 is insufficient.  To really cover all the bases
(especially from a security  and string parsing perspective) you need to
say:

"Unicode characters encoded in UTF-8 using the minimal encoding."  UTF-8
permits a variety of encodings for the same character, but only one is the
minimal encoding.  For more info you can also reference the most recent ISO
10646-1 and 10646-2 (with extensions).  With minimal encodings you
eliminate some potential buffer overflows and you simplify the use of
regular expression matching.  It is easy enough for an incoming message
filter to detect and recode UTF-8 into minimal encoding, but you need to
say this in the specification to inform people that they need the filter on
the incoming side and that the emitters of messages should use the minimal
form.

  2) There are multiple blank space characters defined in Unicode.  These
are typographically different.  There is only one that corresponds to the
ASCII blank character and  its minimal encoding using UTF-8 is
intentionally identical to the encoding of the ASCII blank character.  The
confusion may be resolved by identifying this Unicode code point by number
rather than just saying "blank".

  3) Not mentioned originally, but also a potential problem, are the other
homotype and semi-homotype characters.  For example, there are multiple
backslash characters.  In fact there are three of them in common use, one
the ASCII character (whose minimal UTF-8 encoding matches the ASCII
character) and two that are used in Japanese.  These are pseudo-homotype
characters in that a close examination will reveal that in a high precision
font they are all different in size and slope.  But in many situations they
look the same.

More importantly from the perspective of regular use, the ASCII backslash
character was replaced in the Japanese 7-bit Latin characterset by the Yen
symbol.  So the Japanese will have significant problems regarding use of
backslash.  Even if you specify the use of the proper Unicode character
set, encoded using minimal size UTF-8,  all the backslashes will be
presented to Japanese users as Yen symbols on most systems.  These systems
make the assumption that what they are seeing is the older modified 7-bit
ASCII that is standard in Japan.   This is almost always the correct
assumption.

There is no simple solution to the backslash problem.  The backslash should
not be given any special meaning in any protocol.  The various default
workarounds for conflicts between the older and newer systems introduce a
lot of confusion around this character.  If it has special meaning to
computers there will always be confusion and problems.  If you leave it an
ordinary non-special character the humans who read the message usually have
enough context to decide whether the character is intended to mean yen or
backslash and will know from their application context how to interpret the
text.

If you have messages that must be composed by people and must contain
backslashes you have an even worse problem.  They have a backslash
character on the keyboard, but it will generate the Japanese backslashes,
not the ASCII backslash.  This effectively guarantees problems with
entering backslash in Japan because people will forget that they need to do
something special and will just use the keyboard.

R Horn

_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec