danielsauder

IT security is a matter of trust.

Extract text and media content from docx

Here is a small bash script for extracting the text and media content from a docx file. Might be useful if you do not want to open the file with word. Works with cygwin, should work with linux.
Github: https://github.com/govolution/stuff/blob/master/xtractdocx.sh

xtractdocx

Usage is pretty straight forward:

$ bash xtractdocx.sh test.docx
* extracting media files to ./test.docx1484702626/media
Archive: test.docx
extracting: ./test.docx1484702626/word/media/image1.png
* ls ./test.docx1484702626/media
image1.png
* extracting xml content
Archive: test.docx
inflating: ./test.docx1484702626/word/document.xml
* write xml to txt content to ./test.docx1484702626/document.txt
* cleanup
* done
$ cat ./test.docx1484702626/document.txt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean rhoncus massa non elitrum dignissim. Nam libero purus, ultrices eu purus et, tempor tincidunt augue. [ … CUT … ]

Published by

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: