Here is a small bash script for extracting the text and media content from a docx file. Might be useful if you do not want to open the file with word. Works with cygwin, should work with linux.
Github: https://github.com/govolution/stuff/blob/master/xtractdocx.sh
Usage is pretty straight forward:
$ bash xtractdocx.sh test.docx
* extracting media files to ./test.docx1484702626/media
Archive: test.docx
extracting: ./test.docx1484702626/word/media/image1.png
* ls ./test.docx1484702626/media
image1.png
* extracting xml content
Archive: test.docx
inflating: ./test.docx1484702626/word/document.xml
* write xml to txt content to ./test.docx1484702626/document.txt
* cleanup
* done
$ cat ./test.docx1484702626/document.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean rhoncus massa non elitrum dignissim. Nam libero purus, ultrices eu purus et, tempor tincidunt augue. [ … CUT … ]
Leave a Reply