Bypassing misidentified MS Office files

Tue Oct 1 01:52:29 CEST 2013

Hi,

Some time ago I reported that MS Office docx files are misidentified
by 'file' as Zip Archives. I've tried to upgrade to the latest 'file'
on my fc17 system (compiled 5.15), and it is still detected
improperly. I'm unsure how to modify the magic file to properly
identify them, so I'd like to bypass scanning files that contain the
"[trash]/0001.dat" files that are causing the problem.

Does someone have a working 'file' magic file that they could send me
to evaluate?

I've tried the following, and it doesn't appear to work:

$banned_filename_re = new_RE(
  ...
  [ qr'^\[trash\]/[0-9a-f]{4}\.dat$'       => 0 ],  # allow any in
Unix-type archives
  ...
);

I'm hoping my description of the problem is clear.

$ file report.docx
report.docx: Zip archive data, at least v2.0 to extract

$ unzip -v report.docx
Archive:  report.docx
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
  412276  Defl:N    22707  95% 01-01-1980 00:00 88586817  word/document.xml
     456  Stored      456   0% 01-01-1980 00:00 ffffffff  [trash]/0001.dat
  ...

If I remove the [trash]/ files and re-zip the archive, it's properly detected:

$ file report1.docx
report1.docx: Microsoft OOXML

My thinking is to avoid any [trash]/NNNN.dat files, where NNNN.dat is
[0-9a-f]{4} but it doesn't appear to work.

Thanks so much,
Alex