mime2utf8 - cleanly chop

Mon Mar 28 22:57:17 CEST 2011

On 03/28/2011 11:29 AM, Mark Martinec wrote:

> A UTF-8 -encoded character consists or 1 to 4 octets. The above
> regexp makes sure that a truncation point does not occur in the
> middle of a single-character encoding, which would produce
> a syntactically invalid UTF-8 string (choking SQL, etc).
>
> See sections 3 and 4 of the RFC 3629. Octets forming the trailing 1..3
> characters of the UTF-8 sequence all have the topmost bits 10,
> i.e. 10xxxxxx (UTF8-tail = %x80-BF), which is why these are
> excluded from the set of valid character-starter octets in the
> above [\x00-\x7F\xC0-\xFF].
>

Thanks Mark, I was not processing this portion of it properly as I read 
through the RFC. Thanks for putting more emphasis on it.