Control Characters in MSG?
Status
[back]
It was asked if we should allow control characters inside the MSG part.
Syslog traditionally allows only printable characters inside the MSG part. It
seems to be observed behaviour, that some syslogds actually submit control
characters (namely CR and LF) inside the MSG part. How to deal with this?
Please note that issue 6 is related to this issue and
can only be solved once this issue here is solved.
Status: SOLVED
Full UTF-8 enconding will be allowed in the MSG part of a syslog message. This
means that all control characters, including 0x00, 0x0a (LF) are allowed inside
the message. It is the syslog program's task to take care of proper internal
handling. Escaping needs probably be done. Rainer Gerhards intends to add a reference
implementation for C to the liblogging open
source project at a later stage.
Discussion Milestones
Other References
-
RFC 2279 describes UTF-8
encoding. It says the US-ASCII set is preserved inside it (UTF-8, page 2):
-
Character values from 0000 0000 to 0000 007F (US-ASCII repertoire) correspond
to octets 00 to 7F (7 bit US-ASCII values). A direct consequence is that a
plain ASCII string is also a valid UTF-8 string.
-
US-ASCII values do not appear otherwise in a UTF-8 encoded character stream.
This provides compatibility with file systems or other software (e.g. the
printf() function in C libraries) that parse based on US-ASCII values but are
transparent to other values.
However, it is not said if any non-printable characters are present in
the other character maps.
A random Google search brings up a
discussion thread
that reveals there are obviously control characters in Unicode.
Microsoft lists at least some of the Unicode Control Characters.
The UNICODE consortium offers online
versions of the character tables. The
range 20 table tells us that there are control characters (search for
formatting characters - it is on page 3 of the PDF - an example is 0x200C [ZERO
WIDTH NON-JOINER]). There are probably more pages with control characters, I
just did not dig deeper in to the tables. They are available at
http://www.unicode.org/charts/
if someone would like to do more research.
Must we support UNICODE? Yes, RFC
2277/BCP 18 says we MUST so: "Protocols MUST be able to use the UTF-8
charset, which consists of the ISO 10646 coded character set combined with the
UTF-8 character encoding scheme, as defined in [10646] Annex R (published in
Amendment 2), for all text. ". In theory, a BCP (see
RFC 1818) is not a required standard, but as of my understanding
everybody should try very hard to follow it.
This page last updated: Tue Sep 25 13:32:29 2007.
For content issues, contact rgerhards-at-adiscon.com - for legal issues, please contact Adiscon who is the legal owner and publisher of this web site.
Visit our topic pages for practical information on syslog.
Raw Mail Archive: [threaded] [by date] [search]