Find out which words you use often in a text


You got a document, Document.odt. You wonder if you use some words too often. Find it out with:

unzip -p Document.odt content.xml|sed 's/<[^>]*>/ /g'|
sed 's/[^a-zA-Z]/ /g'|grep -Eo "[^ ]{3,}" |
sort -n|uniq -c|
grep -vf ~/words.txt|grep -v "^[ ]*1" |sort -n

Where words.txt is a list of common words for your language, we don’t want to see them. Get the list at http://wortschatz.uni-leipzig.de/html/wliste.html or from sites like http://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache

You get something like

      2 Beitrag
2 Effizienz
2 Hauptteil
2 Technik
3 Autor
4 Collectors
5 Garbage
7 Daten

which is really cool.

  1. No comments yet.
(will not be published)

  1. No trackbacks yet.