Fixpoint

2022-11-25

A quick survey of filthy character sets from my mail archives

Filed under: Data, Gales Linux, Historia, JWRD, Networks, Politikos, Software — Jacob Welsh @ 23:33

This will expand on a bit of data mining I did to help in evaluating the iconv implementation from musl libc, in light of the challenges faced by an email server in the field. The trouble is that email was something of a ground zero of the text format wars, a weakly governed commons where the definition of "text" became the whole collection of random schemes hatched by anyone who got enough consumer-sheep bleating for support until the bad behavior became mandatory. "Was", I say, because the bright side is that it's been relatively settled for some time now; the sheep got the softforks they were after and now the most that seems to happen is new emoji getting added to unicode, a small manifestation of the larger cultural slide into the new illiteracy.

                       _
ASCII ribbon campaign ( )
 against HTML e-mail   X
                      / \

Never forget.

As I've run my own mail servers for a while, I had a decent sized data set to work with, admittedly with the initial filter of me; or to be more precise, whatever was dumped in my inbox by marketers, ivy-covered professors and whatever other robots managed to find me, plus a smaller body of actual human communication. It breaks down into two eras: the first from an archive covering Sep 2008 to Nov 2018,(i) and the second from an active server covering Nov 2018 to present. The idea is to extract and tally the character sets/encodings,(ii) looking at message bodies and headers (To, From, Subject etc.) separately as the Great Internationalization was inflicted upon them in different ways and possibly at different times. The extraction method is approximate, using regular expressions rather than attempting to parse the formal message structure such as it is.

Results

First, the "old" data set. Sample size in number of messages, as per find ~/Maildir -path '*/cur/*' | wc -l, is 68042.

Extraction:

$ find ~/Maildir -path '*/cur/*' -print0 | xargs -0 grep -ih '^content-type:' | grep -io 'charset=[^ >;/]*' > bodies1
$ find ~/Maildir -path '*/cur/*' -print0 | xargs -0 grep -ho '=?.*?.*?.*?=' > headers1

Filtering first for a Content-Type header cuts out a lot of charset declarations embedded within HTML, which I certainly hope won't be getting parsed at the IMAP server layer. It does however miss some cases where the header line was wrapped. The number of body charset declarations thus found:

$ wc -l bodies1
84838 bodies1

This can be greater than the total message count because it's given per part in multipart (MIME) messages.

Trimming, case folding and tallying:

$ cat bodies1 | tr -d '"' | tr A-Z a-z | sed 's/charset=//' | sort | uniq -c | sort -n
      1
      1 cp-850
      1 iso-8859-6
      1 latin1
      1 uft-8(iii)
      1 x-unknown(iv)
      2 euc-kr
      4 gbk
      4 iso-8859-2
      7 ansi_x3.4-1968
      7 unicode-1-1-utf-7
      7 windows-1250
     15 iso-2022-jp
     17 koi8-r
     18 big5
     22 iso-8859-7
     25 3dutf-8(v)
     26 windows-1256
     34 3d&quot
     39 gb2312
     39 windows-1251
     46 iso-8859-15
     47 ascii(vi)
     55 cp1252
   2048 windows-1252
  19802 us-ascii
  28586 iso-8859-1
  33982 utf-8(vii)

Number of charset-laden headers found (they snuck this in on top of the standards which clearly stated ASCII, using =? ?= bracketing because nobody would ever use those characters!!):

$ wc -l headers1
14961 headers1

Trimming, case folding and tallying:

$ cut -d? -f2 headers1 | tr A-Z a-z | sort | uniq -c | sort -n
      1 cp1252
      1 gbk
      3 windows-1256
      4 iso-8859-7
     13 windows-1251
     15 iso-2022-jp
     56 koi8-r
    128 iso-8859-1
    134 windows-1252
    613 us-ascii(viii)
   1077 gb2312
  12916 utf-8

Moving on to the "new" set, the sample is 40745 messages, 40676 body charset declarations and 30279 charset-laden headers. It would appear that the prevalence of deviant headers has greatly increased between the two time periods, though other explanations are possible such as an increased proportion of retained spam to human use in the new set.

Bodies:

$ cat bodies2 | tr -d '"' | tr A-Z a-z | sed 's/charset=//' | sort | uniq -c | sort -n
      1 3dus-ascii(ix)
      1 3dut=46-8
      1 iso-8859-10
      1 unicode-1-1-utf-7
      1 utf-16le
      1 utf-8(x)
      1 utf8
      1 windows-1250
      2 euc-jp
      2 iso-8859-3
      2 iso-8859-7
      3 euc-kr
      3 iso-8859-5
      4 big5
      4 iso-8859-14
      9 ibm852
     17 ascii
     20 cp-850
     22 iso-2022-jp
     41 gbk
    154 gb2312
    547 windows-1251
    582 iso-8859-2
    631 iso-8859-15
   1189 iso-8859-1
   1349 windows-1252
   7080 us-ascii
  29007 utf-8

Headers:

$ cut -d? -f2 headers2 | tr A-Z a-z | sort | uniq -c | sort -n
      1 ¶¡œc¼Ñ(xi)
      2 iso-8859-2
      3 shift_jis
      4 iso-8859-7
      7 gb18030
     12 gbk
     15 iso-8859-5
     16 iso-8859-15
     64 iso-2022-jp
    204 windows-1252
    280 gb2312
    366 windows-1251
   1217 iso-8859-1
  11294 us-ascii
  16794 utf-8

Charset support in musl iconv

For the other side of the comparison we'll need to see what out of this mess is supported by musl. As they describe it:

The iconv implementation musl is very small and oriented towards being unobtrusive to static link. Its character set/encoding coverage is very strong for its size, but not comprehensive like glibc’s.

and

Many legacy double-byte and multi-byte East Asian encodings are supported only as the source charset, not the destination charset. JIS-based ones are supported as the destination as of version 1.1.19.

I expect it's only the decoding side that matters here, i.e. converting from whatever source charset to unicode.

However, I couldn't find the exact list of supported charsets anywhere, so I've extracted it from the source, which is the charmaps variable defined in src/locale/iconv.c and src/locale/codepages.h, as of the version in the current Gales tree. Here they all are, with aliases listed on the same line. I take it they normalize by first removing the optional hyphens.

utf8 char
wchart
ucs2be
ucs2le
utf16be
utf16le
ucs4be utf32be
ucs4le utf32le
ascii usascii iso646 iso646us
utf16
ucs4 utf32
ucs2
eucjp
shiftjis sjis
iso2022jp
gb18030
gbk
gb2312
big5 bigfive cp950 big5hkscs
euckr ksc5601 ksx1001 cp949
iso88591 latin1
iso88592
iso88593
iso88594
iso88595
iso88596
iso88597
iso88598
iso88599
iso885910
iso885911 tis620
iso885913
iso885914
iso885915 latin9
iso885916
cp1250 windows1250
cp1251 windows1251
cp1252 windows1252
cp1253 windows1253
cp1254 windows1254
cp1255 windows1255
cp1256 windows1256
cp1257 windows1257
cp1258 windows1258
koi8r
koi8u
cp437
cp850
cp866
ibm1047 cp1047

The utf7 business that prompted this in the first place shows up only in the very odd looking unicode-1-1-utf-7, which isn't in glibc iconv either. It shows up in "Delivery Status Notification (Failure)" messages (bounces), most from the Postfix MTA. The actual text of the relevant parts looks all ASCII to me.

The remaining plausible encodings not found in the list:

ibm852 - a Central European DOS codepage that didn't make it into the ISO-8859 list. Found only in a specific strain of spam (I am a hacker who has access to your account, send me Bitcoins!)

ANSI_X3.4-1968 - yet another name for ASCII, found in mail from a cron daemon and one corporate sender. This *is* recognized by glibc iconv so may be worth adding to musl as an alias.

  1. I have older stuff too, but a bit too buried in different places and formats to bother digging up for this. [^]
  2. Supposedly unicode is an abstract character set while utf8 and friends are the concrete byte-encodings thereof, but they all seem to end up labeled "charset". [^]
  3. Lolz. [^]
  4. How very helpful of them. [^]
  5. From "=3D", the quoted-printable encoding of "=" (ASCII 0x3D); perhaps this snuck in from some HTML. [^]
  6. Using ASCII to tell me it's ASCII, gee thanks. Probably it's because something had to be declared in order to include a further transfer-encoding field. [^]
  7. Ain't it nice that unicode is dominating and maybe eventually someday we can be rid of all those legacy encodings? Perhaps only if you don't look inside that box, because the froth never went away, just got hidden under a new name with somewhat different complexities to deal with. [^]
  8. Again, they had to put something in order to base64 or quoted-printable encode it, perhaps to escape some character that would otherwise terminate the header. Which itself is pretty suspicious. [^]
  9. This and the following are artifacts of the sloppy parsing; they're from a message that was quoted in full raw form in another message so the client figured it better escape stuff. [^]
  10. This got separated from the main utf-8 pile due to a Ctrl-M (carriage return) character at the end. [^]
  11. Definitely spam, perhaps h4xx0rz, with .doc payload; the binary garbage comes from a pseudo-header in a nominally plain text part of the message body. [^]

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by MP-WP. Copyright Jacob Welsh.