pdf - Stop Microsoft Word 2010 from smoothing screenshots?

When I insert JPEG screenshots into Microsoft Word, it smoothes them instead of preserving the original pixels from the bitmap. When I then print to PDF (using Acrobat Distiller), depending on my downsample settings, I either get blurry screenshots or hugely bloated file sizes.

What I want:

I would like Word and Acrobat to leave the bitmaps alone so that they make it through the process with their pixels intact. This is what the original image looks like when you zoom in:

What I want

What I get:

This is what the Word document looks like when you insert the same image and zoom in. When this is printed to PDF, all those extra pixels result in a much larger file.

What I get

Sample files:

Test.png (56K) A sample screenshot image file

Test.docx (69K) A Word file containing nothing but this image

Test.PDF (9.4MB) A PDF file printed from the Word file using Distiller, with all downsampling turned off

Test2.PDF (98K) A PDF file generated using Word 2010's "Save as PDF" tool (note the very low quality of the compressed image)

Edit: This is with Word 2010 - I've updated the tags to reflect that.

Edit: I've confirmed that OpenOffice doesn't have this problem. I've opened Test.docx (referenced above) and exported it as a PDF from OO (choosing "lossless compression" under Images in the options), and the image comes through unharmed.

Unfortunately, OpenOffice mangles the formatting on more complex Word documents that I've created; so I can't just create the documents in Word and use OO to render the PDFs; I'd have to switch to OO altogether, which is a bigger step than I'm prepared to take right now.

Answer

Word maybe just renders upscaled image and sends it that way as printer input (I presume that Distiller works as a printer). If so, then it's good for normal printers, but inefficient for fake printers producing PDF files.

For instance pdfLaTeX properly embeds image in output file. Check my PDF uploaded to min.us gallery: Embedding image in LaTeX document

Important thing is what PDF producing stack you are using. If trying other PDF printer, like great and free PDFCreator, does not fix the problem, then you should try using dedicated PDF export, i.e. not working as a printer. AFAIK recent Word versions have PDF export built-in, so if it is properly implemented, then you will get small file, thanks to embedding images used in the document.

HUGE EDIT

Gallery has been renamed to Embedding PNG image in LaTeX vs Word

I've looked more thoroughly at my mytest.pdf generated by pdfLaTeX and your test2.pdf generated by Word.

mytest.pdf test2.pdf

Let's start with uncompressing. If you look into uncompressed file, you'll easily spot beginning of the image stream (<<...>>stream line with Width and Height parameters, same as in test.png, i.e. 176x295), which ends with endstream tag. Peek time.

(WARNING at this point pdftk is assumed to be in version 1.41)

test2.pdf

$ pdftk test2.pdf output test2uc.pdf uncompress
$ sed '\,^<]*/Height 295[^>]*>>stream$,!d' test2uc.pdf
<>stream
$ sed '1,\,^<]*/Height 295[^>]*>>stream$,d;/^endstream$/,$d' test2uc.pdf > test2stream
$ xxd test2stream | head -10
0000000: ffd8 ffe0 0010 4a46 4946 0001 0101 0048  ......JFIF.....H
0000010: 0048 0000 ffe1 005c 4578 6966 0000 4d4d  .H.....\Exif..MM
0000020: 002a 0000 0008 0004 0302 0002 0000 0016  .*..............
0000030: 0000 003e 5110 0001 0000 0001 0100 0000  ...>Q...........
0000040: 5111 0004 0000 0001 0000 0b13 5112 0004  Q...........Q...
0000050: 0000 0001 0000 0b13 0000 0000 5068 6f74  ............Phot
0000060: 6f73 686f 7020 4943 4320 7072 6f66 696c  oshop ICC profil
0000070: 6500 ffe2 0c58 4943 435f 5052 4f46 494c  e....XICC_PROFIL
0000080: 4500 0101 0000 0c48 4c69 6e6f 0210 0000  E......HLino....
0000090: 6d6e 7472 5247 4220 5859 5a20 07ce 0002  mntrRGB XYZ ....
$ file test2stream 
test2stream: JPEG image data, JFIF standard 1.01

So Word is giving JPEG instead of PNG on its internal output for further PDF processing. Just WOW! Same thing may happen when sending output to printer.

test2stream.jpg

mytest.pdf

$ pdftk mytest.pdf output mytestuc.pdf uncompress
$ sed '\,^<]*/Height 295[^>]*>>stream$,!d' mytestuc.pdf
<>stream
$ sed '1,\,^<]*/Height 295[^>]*>>stream$,d;/^endstream$/,$d' mytestuc.pdf > myteststream
$ xxd myteststream | head -10
0000000: ebeb ebea eaea ecec eceb ebeb ebeb ebeb  ................
0000010: ebeb ebeb ebec ecec ebeb ebeb ebeb ebeb  ................
0000020: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb  ................
0000030: ebeb ebea eaea eaea eaec ecec eaea eaec  ................
0000040: ecec ebeb ebec ecec ebeb ebeb ebeb ebeb  ................
0000050: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb  ................
0000060: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb  ................
0000070: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb  ................
0000080: ebea eaea ecec eceb ebeb ebeb ebea eaea  ................
0000090: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb  ................
$ file myteststream 
myteststream: DOS executable (COM)

It's not COM file, but it's not PNG either.

$ du -b test.png test2stream myteststream 
57727   test.png
20004   test2stream
155761  myteststream

You see it now? Image stream (of PNG) from PDF produced by pdfLaTeX is possibly simple raw format (176*295*3=155760, 1 comes from superfluous newline). Let's check it:

$ convert -depth 8 -size 176x295 rgb:myteststream myteststream.png

And we have our original image back! No, wait. It looks that pdftk 1.41 uncompression is buggy and image was almost the same with a few flaws. I upgraded to pdftk 1.44, but this version does not decompress image stream at all. Moreover pdftk does not output stream dictionary in one line, so above extraction using sed no longer works, but there is no point in fixing it now.

So what we can do about Word? Not much methinks. At least you can transplant embedded image from one PDF to another. I repeated uncompression of both PDFs using recent pdftk, opened them in vim, replaced in test2uc.pdf <<...>>stream...endstream with counterpart from mytestuc.pdf, saved as test2fixuc.pdf and compressed to test2fix.pdf.

test2fix.pdf

test.pdf

It would be a sin not checking your big PDF after all. Ok, I've prepared another oneliner to play with pdftk 1.44 uncompressed PDFs to list image streams and their beginning lines in files. So I'll start with uncompressing test.pdf.

(WARNING at this point pdftk is assumed to be in version 1.44)

$ pdftk test.pdf output testuc.pdf uncompress
$ awk '{if(i)h=h$0} /^[0-9]+ [0-9]+ obj $/{i=1;h=""}/^stream$/{i=0;if(h!~/\/Image/)next;print h,":"NR+1}' testuc.pdf 
<>stream :619
<>stream :12106
<>stream :12910
<>stream :18547
<>stream :19312
<>stream :19326

Something is really insane here! 6 raw images (apparently this time pdftk did not have any problems in uncompressing them) taking together 43444452 bytes! Let's recheck test2uc.pdf and mytestuc.pdf.

$ awk '{if(i)h=h$0} /^[0-9]+ [0-9]+ obj $/{i=1;h=""}/^stream$/{i=0;if(h!~/\/Image/)next;print h,":"NR+1}' test2uc.pdf 
<>stream :113
przemoc@debian:~/latex/test/img/mod$ awk '{if(i)h=h$0} /^[0-9]+ [0-9]+ obj $/{i=1;h=""}/^stream$/{i=0;if(h!~/\/Image/)next;print h,":"NR+1}' mytestuc.pdf 
<>/Width 176/BitsPerComponent 8/Height 295/Filter /FlateDecode/Subtype /Image/Length 54954/ColorSpace /DeviceRGB/Type /XObject>>stream :22

In both cases only one image stream. Why the heck there could be more of them?!

$ sed '1,618d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 707x4924 rgb:- testuc-stream1.png
$ sed '1,12105d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x3940 rgb:- testuc-stream2.png
$ sed '1,12909d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x984 rgb:- testuc-stream3.png
$ sed '1,18546d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x3940 rgb:- testuc-stream4.png
$ sed '1,19311d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x984 rgb:- testuc-stream5.png
$ sed '1,19325d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 328x4924 rgb:- testuc-stream6.png

Image was cut to many pieces... It looks like some kind of utterly stupid protection, maybe introduced by Distiller (and maybe it can be turned off)? I doubt same thing would be spitted by PDFCreator, unless it's Word who performs this unbelievable insanity...

testuc-stream1.png and others (use right arrow to navigate)

Conclusion

Important things are:

you can clearly see, that huge image that was cut into pieces is actually upscaled JPEG, so my hypothesis was correct,

because in PDFCreator you get also huge file in the output, it's the Word who provides awfully big image to the fake PDF printer, and my earlier supposition was also correct.

Phew. This investigation took some time. Word is piece of junk.

Workarounds?

In the meantime some suggestions were given. Let me comment them.

Using writer with decent PDF support like LibreOffice (forget about OpenOffice, it's obsoleted now) is good solution, unless some incompabilities make you unable to work with it.

Using bigger image in same box on the page is also not that bad idea, because even after JPEG-izing, artifacts will be less visible.

My another grosz though is using JPEG from the beginning. That way Word shouldn't recompress it (you never know...) and you can provide highest possible quality of JPEG. There is also lossless JPEG compression. Developers from Redmond presumably thought it's not needed, so I won't be surprised if Word doesn't handle such JPEGs. Well, TBH it's not widely supported (even in open source world), just like arithmetic coding (or it's rather even worse situation in case of arithmetic coding).

convert test.png -quality 100 -resize $((100*300/72))% test-300dpi-mitchell.jpg
convert test.png -quality 100 -filter box -resize $((100*300/72))% test-300dpi-box.jpg
convert test.png -quality 100 test.jpg

(In Windows use 416 instead of this $(()) arithmetic expansion available in POSIX shells)

I think that default Mitchell is good one for upscaling, but if you really want such pixelatic image, then go with Box as @ceving suggested. Of course first 2 files are useful only if you must (for some reason) use fake PDF printers.

I've uploaded all three files.

test-300dpi-mitchell.jpg (426 KB) test-300dpi-box.jpg (581 KB) test.jpg (74 KB)

If my hypothesis is right and Word won't recompress JPEG image, then just use the last one not upscaled and go with built-in PDF output, because it has less shortcommings (at least it avoids needless upscale).

Notes

Search This Blog