Skip to main content

pdf - Stop Microsoft Word 2010 from smoothing screenshots?


When I insert JPEG screenshots into Microsoft Word, it smoothes them instead of preserving the original pixels from the bitmap. When I then print to PDF (using Acrobat Distiller), depending on my downsample settings, I either get blurry screenshots or hugely bloated file sizes.


What I want:


I would like Word and Acrobat to leave the bitmaps alone so that they make it through the process with their pixels intact. This is what the original image looks like when you zoom in:


What I want


What I get:


This is what the Word document looks like when you insert the same image and zoom in. When this is printed to PDF, all those extra pixels result in a much larger file.


What I get


Sample files:



  • Test.png (56K) A sample screenshot image file

  • Test.docx (69K) A Word file containing nothing but this image

  • Test.PDF (9.4MB) A PDF file printed from the Word file using Distiller, with all downsampling turned off

  • Test2.PDF (98K) A PDF file generated using Word 2010's "Save as PDF" tool (note the very low quality of the compressed image)




Edit: This is with Word 2010 - I've updated the tags to reflect that.




Edit: I've confirmed that OpenOffice doesn't have this problem. I've opened Test.docx (referenced above) and exported it as a PDF from OO (choosing "lossless compression" under Images in the options), and the image comes through unharmed.



Unfortunately, OpenOffice mangles the formatting on more complex Word documents that I've created; so I can't just create the documents in Word and use OO to render the PDFs; I'd have to switch to OO altogether, which is a bigger step than I'm prepared to take right now.



Answer



Word maybe just renders upscaled image and sends it that way as printer input (I presume that Distiller works as a printer). If so, then it's good for normal printers, but inefficient for fake printers producing PDF files.


For instance pdfLaTeX properly embeds image in output file. Check my PDF uploaded to min.us gallery: Embedding image in LaTeX document


Important thing is what PDF producing stack you are using. If trying other PDF printer, like great and free PDFCreator, does not fix the problem, then you should try using dedicated PDF export, i.e. not working as a printer. AFAIK recent Word versions have PDF export built-in, so if it is properly implemented, then you will get small file, thanks to embedding images used in the document.


HUGE EDIT


Gallery has been renamed to Embedding PNG image in LaTeX vs Word


I've looked more thoroughly at my mytest.pdf generated by pdfLaTeX and your test2.pdf generated by Word.


mytest.pdf test2.pdf


Let's start with uncompressing. If you look into uncompressed file, you'll easily spot beginning of the image stream (<<...>>stream line with Width and Height parameters, same as in test.png, i.e. 176x295), which ends with endstream tag. Peek time.


(WARNING at this point pdftk is assumed to be in version 1.41)


test2.pdf


$ pdftk test2.pdf output test2uc.pdf uncompress
$ sed '\,^<]*/Height 295[^>]*>>stream$,!d' test2uc.pdf
<>stream
$ sed '1,\,^<]*/Height 295[^>]*>>stream$,d;/^endstream$/,$d' test2uc.pdf > test2stream
$ xxd test2stream | head -10
0000000: ffd8 ffe0 0010 4a46 4946 0001 0101 0048 ......JFIF.....H
0000010: 0048 0000 ffe1 005c 4578 6966 0000 4d4d .H.....\Exif..MM
0000020: 002a 0000 0008 0004 0302 0002 0000 0016 .*..............
0000030: 0000 003e 5110 0001 0000 0001 0100 0000 ...>Q...........
0000040: 5111 0004 0000 0001 0000 0b13 5112 0004 Q...........Q...
0000050: 0000 0001 0000 0b13 0000 0000 5068 6f74 ............Phot
0000060: 6f73 686f 7020 4943 4320 7072 6f66 696c oshop ICC profil
0000070: 6500 ffe2 0c58 4943 435f 5052 4f46 494c e....XICC_PROFIL
0000080: 4500 0101 0000 0c48 4c69 6e6f 0210 0000 E......HLino....
0000090: 6d6e 7472 5247 4220 5859 5a20 07ce 0002 mntrRGB XYZ ....
$ file test2stream
test2stream: JPEG image data, JFIF standard 1.01

So Word is giving JPEG instead of PNG on its internal output for further PDF processing. Just WOW! Same thing may happen when sending output to printer.


test2stream.jpg


mytest.pdf


$ pdftk mytest.pdf output mytestuc.pdf uncompress
$ sed '\,^<]*/Height 295[^>]*>>stream$,!d' mytestuc.pdf
<>stream
$ sed '1,\,^<]*/Height 295[^>]*>>stream$,d;/^endstream$/,$d' mytestuc.pdf > myteststream
$ xxd myteststream | head -10
0000000: ebeb ebea eaea ecec eceb ebeb ebeb ebeb ................
0000010: ebeb ebeb ebec ecec ebeb ebeb ebeb ebeb ................
0000020: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb ................
0000030: ebeb ebea eaea eaea eaec ecec eaea eaec ................
0000040: ecec ebeb ebec ecec ebeb ebeb ebeb ebeb ................
0000050: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb ................
0000060: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb ................
0000070: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb ................
0000080: ebea eaea ecec eceb ebeb ebeb ebea eaea ................
0000090: ebeb ebeb ebeb ebeb ebeb ebeb ebeb ebeb ................
$ file myteststream
myteststream: DOS executable (COM)

It's not COM file, but it's not PNG either.


$ du -b test.png test2stream myteststream 
57727 test.png
20004 test2stream
155761 myteststream

You see it now? Image stream (of PNG) from PDF produced by pdfLaTeX is possibly simple raw format (176*295*3=155760, 1 comes from superfluous newline). Let's check it:


$ convert -depth 8 -size 176x295 rgb:myteststream myteststream.png

And we have our original image back! No, wait. It looks that pdftk 1.41 uncompression is buggy and image was almost the same with a few flaws. I upgraded to pdftk 1.44, but this version does not decompress image stream at all. Moreover pdftk does not output stream dictionary in one line, so above extraction using sed no longer works, but there is no point in fixing it now.


So what we can do about Word? Not much methinks. At least you can transplant embedded image from one PDF to another. I repeated uncompression of both PDFs using recent pdftk, opened them in vim, replaced in test2uc.pdf <<...>>stream...endstream with counterpart from mytestuc.pdf, saved as test2fixuc.pdf and compressed to test2fix.pdf.


test2fix.pdf


test.pdf


It would be a sin not checking your big PDF after all. Ok, I've prepared another oneliner to play with pdftk 1.44 uncompressed PDFs to list image streams and their beginning lines in files. So I'll start with uncompressing test.pdf.


(WARNING at this point pdftk is assumed to be in version 1.44)


$ pdftk test.pdf output testuc.pdf uncompress
$ awk '{if(i)h=h$0} /^[0-9]+ [0-9]+ obj $/{i=1;h=""}/^stream$/{i=0;if(h!~/\/Image/)next;print h,":"NR+1}' testuc.pdf
<>stream :619
<>stream :12106
<>stream :12910
<>stream :18547
<>stream :19312
<>stream :19326

Something is really insane here! 6 raw images (apparently this time pdftk did not have any problems in uncompressing them) taking together 43444452 bytes! Let's recheck test2uc.pdf and mytestuc.pdf.


$ awk '{if(i)h=h$0} /^[0-9]+ [0-9]+ obj $/{i=1;h=""}/^stream$/{i=0;if(h!~/\/Image/)next;print h,":"NR+1}' test2uc.pdf 
<>stream :113
przemoc@debian:~/latex/test/img/mod$ awk '{if(i)h=h$0} /^[0-9]+ [0-9]+ obj $/{i=1;h=""}/^stream$/{i=0;if(h!~/\/Image/)next;print h,":"NR+1}' mytestuc.pdf
<>/Width 176/BitsPerComponent 8/Height 295/Filter /FlateDecode/Subtype /Image/Length 54954/ColorSpace /DeviceRGB/Type /XObject>>stream :22

In both cases only one image stream. Why the heck there could be more of them?!


$ sed '1,618d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 707x4924 rgb:- testuc-stream1.png
$ sed '1,12105d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x3940 rgb:- testuc-stream2.png
$ sed '1,12909d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x984 rgb:- testuc-stream3.png
$ sed '1,18546d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x3940 rgb:- testuc-stream4.png
$ sed '1,19311d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 953x984 rgb:- testuc-stream5.png
$ sed '1,19325d;/^endstream $/q' testuc.pdf | convert -depth 8 -size 328x4924 rgb:- testuc-stream6.png

Image was cut to many pieces... It looks like some kind of utterly stupid protection, maybe introduced by Distiller (and maybe it can be turned off)? I doubt same thing would be spitted by PDFCreator, unless it's Word who performs this unbelievable insanity...


testuc-stream1.png and others (use right arrow to navigate)


Conclusion


Important things are:



  • you can clearly see, that huge image that was cut into pieces is actually upscaled JPEG, so my hypothesis was correct,

  • because in PDFCreator you get also huge file in the output, it's the Word who provides awfully big image to the fake PDF printer, and my earlier supposition was also correct.


Phew. This investigation took some time. Word is piece of junk.


Workarounds?


In the meantime some suggestions were given. Let me comment them.


Using writer with decent PDF support like LibreOffice (forget about OpenOffice, it's obsoleted now) is good solution, unless some incompabilities make you unable to work with it.


Using bigger image in same box on the page is also not that bad idea, because even after JPEG-izing, artifacts will be less visible.


My another grosz though is using JPEG from the beginning. That way Word shouldn't recompress it (you never know...) and you can provide highest possible quality of JPEG. There is also lossless JPEG compression. Developers from Redmond presumably thought it's not needed, so I won't be surprised if Word doesn't handle such JPEGs. Well, TBH it's not widely supported (even in open source world), just like arithmetic coding (or it's rather even worse situation in case of arithmetic coding).


convert test.png -quality 100 -resize $((100*300/72))% test-300dpi-mitchell.jpg
convert test.png -quality 100 -filter box -resize $((100*300/72))% test-300dpi-box.jpg
convert test.png -quality 100 test.jpg

(In Windows use 416 instead of this $(()) arithmetic expansion available in POSIX shells)


I think that default Mitchell is good one for upscaling, but if you really want such pixelatic image, then go with Box as @ceving suggested. Of course first 2 files are useful only if you must (for some reason) use fake PDF printers.


I've uploaded all three files.


test-300dpi-mitchell.jpg (426 KB) test-300dpi-box.jpg (581 KB) test.jpg (74 KB)


If my hypothesis is right and Word won't recompress JPEG image, then just use the last one not upscaled and go with built-in PDF output, because it has less shortcommings (at least it avoids needless upscale).


Comments

Popular Posts

Use Google instead of Bing with Windows 10 search

I want to use Google Chrome and Google search instead of Bing when I search in Windows 10. Google Chrome is launched when I click on web, but it's Bing search. (My default search engine on Google and Edge is http://www.google.com ) I haven't found how to configure that. Someone can help me ? Answer There is no way to change the default in Cortana itself but you can redirect it in Chrome. You said that it opens the results in the Chrome browser but it used Bing search right? There's a Chrome extension now that will redirect Bing to Google, DuckDuckGo, or Yahoo , whichever you prefer. More information on that in the second link.

linux - Using an index to make grep faster?

I find myself grepping the same codebase over and over. While it works great, each command takes about 10 seconds, so I am thinking about ways to make it faster. So can grep use some sort of index? I understand an index probably won't help for complicated regexps, but I use mostly very simple patters. Does an indexer exist for this case? EDIT: I know about ctags and the like, but I would like to do full-text search. Answer what about cscope , does this match your shoes? Allows searching code for: all references to a symbol global definitions functions called by a function functions calling a function text string regular expression pattern a file files including a file

How do I transmit a single hexadecimal value serial data in PuTTY using an Alt code?

I am trying to sent a specific hexadecimal value across a serial COM port using PuTTY. Specifically, I want to send the hex codes 9C, B6, FC, and 8B. I have looked up the Alt codes for these and they are 156, 182, 252, and 139 respectively. However, whenever I input the Alt codes, a preceding hex value of C2 is sent before 9C, B6, and 8B so the values that are sent are C2 9C, C2 B6, and C2 8B. The value for FC is changed to C3 FC. Why are these values being placed before the hex value and why is FC being changed altogether? To me, it seems like there is a problem internally converting the Alt code to hex. Is there a way to directly input hex values without using Alt codes in PuTTY? Answer What you're seeing is just ordinary text character set conversion. As far as PuTTY is concerned, you are typing (and reading) text , not raw binary data, therefore it has to convert the text to bytes in whatever configured character set before sending it over the wire. In other words, when y

networking - Windows 10, can ping other PC but cannot access shared folders! What gives?

I have a computer running Windows 7 that shares a Git repo on drive D. Let's call this PC " win7 ". This repo is the origin of a project that we push to and pull from. The network is a wireless network. One PC on this network is running on Windows 10. Let's call this PC " win10 ". Win10 can ping every other PC on the network including win7 . Win7 can ping win10 . Win7 can access all shared files on win10 . Neither of the PCs have passwords. Problem : Win10 cannot access any shared files on win7 , not from Explorer, nor from Git Bash or any other Git management system (E-Git on Eclipse or Visual Studio). So, win10 cannot pull/push. Every other PC on the network can access win7 shared files and push/pull to/from the shared Git origin. What's wrong with Windows 10? I have tried these: Control Panel\All Control Panel Items\Network and Sharing Center\Advanced sharing settings\ File sharing is on, Discovery is on, Password protected sharing is off Adapte