mime2utf8 - cleanly chop

Mon Mar 28 18:29:57 CEST 2011

Clay,

> Can you please explain your "cleanly chop" portion of the mime2utf8 macro.
> 
> I was under the impression that it would not chop an encoded string in
> such a manner that it would look like:
> 
> "\xbc \xd0\x"
> 
> I didn't think the ending "\x", which was originally "\x9c", would be
> considered "valid".
> 
> Maybe I am just misunderstanding how you are using the regex, but it
> looks pretty straightforward.
> 
> Or am I not following the remaining portions of the macro properly?

You are referring to:

  if ($octets =~ /^(.{0,$max_len})(?=[\x00-\x7F\xC0-\xFF]|\z)/s) {
    $octets = $1;  # cleanly chop a UTF-8 byte sequence, RFC 3629
  }

A UTF-8 -encoded character consists or 1 to 4 octets. The above
regexp makes sure that a truncation point does not occur in the
middle of a single-character encoding, which would produce
a syntactically invalid UTF-8 string (choking SQL, etc).

See sections 3 and 4 of the RFC 3629. Octets forming the trailing 1..3
characters of the UTF-8 sequence all have the topmost bits 10,
i.e. 10xxxxxx (UTF8-tail = %x80-BF), which is why these are
excluded from the set of valid character-starter octets in the
above [\x00-\x7F\xC0-\xFF].

So the regexp attempts to take all octets up to $max_len, but
backtracks until it finds a truncation point which can start
a new character, i.e. is not in the middle of an UTF-8 sequence.

  Mark