mime2utf8 - cleanly chop
Mark Martinec
Mark.Martinec+amavis at ijs.si
Mon Mar 28 18:29:57 CEST 2011
Clay,
> Can you please explain your "cleanly chop" portion of the mime2utf8 macro.
>
> I was under the impression that it would not chop an encoded string in
> such a manner that it would look like:
>
> "\xbc \xd0\x"
>
> I didn't think the ending "\x", which was originally "\x9c", would be
> considered "valid".
>
> Maybe I am just misunderstanding how you are using the regex, but it
> looks pretty straightforward.
>
> Or am I not following the remaining portions of the macro properly?
You are referring to:
if ($octets =~ /^(.{0,$max_len})(?=[\x00-\x7F\xC0-\xFF]|\z)/s) {
$octets = $1; # cleanly chop a UTF-8 byte sequence, RFC 3629
}
A UTF-8 -encoded character consists or 1 to 4 octets. The above
regexp makes sure that a truncation point does not occur in the
middle of a single-character encoding, which would produce
a syntactically invalid UTF-8 string (choking SQL, etc).
See sections 3 and 4 of the RFC 3629. Octets forming the trailing 1..3
characters of the UTF-8 sequence all have the topmost bits 10,
i.e. 10xxxxxx (UTF8-tail = %x80-BF), which is why these are
excluded from the set of valid character-starter octets in the
above [\x00-\x7F\xC0-\xFF].
So the regexp attempts to take all octets up to $max_len, but
backtracks until it finds a truncation point which can start
a new character, i.e. is not in the middle of an UTF-8 sequence.
Mark
More information about the amavis-users
mailing list