Python CLI tool to generate customized word clouds from documents, especially large documents such as dissertations, master and bachelor thesis (or export your WhatsApp Chat and use it on your conversations with different people!). Depending on wordcloud
and nltk
Don't want to work with the command line? Use the jupyter notebook instead (see instructions and examples below)
Saved as example.txt
. This is a text file containing the book "The Count of Monte Cristo".
python generate_cloud.py
By default text information is taken from a file called "doc.txt", so be sure to move a copy of your thesis to you working directory and to rename it to "doc.txt".
Alternatively, use a command line argument to change the name of the input file.
python generate_cloud.py - whatsapp
This will pre-process the WhatsApp chat export file, to exclude dates and other text-parts added by WhatsApp to generate export file (e.g. "Media omitted" text that is inserted inplace of media sent).
A number of different parameters can be customized:
Parameter | Command Line Argument | Type |
---|---|---|
Name of input file | -file_path | string |
Text color | -hue | integer |
Stopwords | -sw (NOTE: these stopword will not replace generic stopwords but will be added) |
list |
Background Color | -bg | string |
Image Width (pixel) | -w | integer |
Image Heigt (pixel) | -height | integer |
Maximum number of words to display | -maxwords | integer |
Ratio of words to display horizontally | -h_ratio | integer (from 0-1) |
Saturation | -s | integer (from 0-100) |
Lightness | -l | integer (from 0-100) |
File name to store output | -o | string (NOTE: should end with '.png') |
Words to replace in text | -x1 | string (NOTE: can be multiple strings) (NOTE: always needs to be used together with -x2) |
Substitutes for words passed in -x1 | -x2 | string (NOTE: can be multiple strings) (NOTE: always needs to be used together with -x1) |
WhatsApp export-file usage | simply add "-whatsapp" Use when a WhatsApp chat export file is used as text |
|
Matrix Effect | -matrix | simply add "-matrix" The program will then automatically ste all parameters for a matrix-like word cloud (see below for example) |
Example:
python generate_cloud.py -file_path my_thesis_final_version.txt -bg black -h_ratio 0.6 -o wordcloud_thesis.png
- This example will take a text file named 'my_thesis_final_version.txt' and save the wordcloud to 'wordcloud_thesis.png'. The word cloud will have a black background and only 60% of the words will be displayed horizontally (and 40% vertically).
If you don't want to use the command line, you can use the Jupiter Notebook instead:
- Install Jupyter Notebook
- Download Github repository
- Open Notebook
- replace example.txt with the name of your text file / thesis (in the notebook); or save your file in the same folder as the jupyter notebook and rename it example.txt
- go to
Cell
- clickRun all
- check you working directory: the word cloud image should be saved there now under a name similar to wc_Size1500_1000_hslColorH322 (unless you changed the parameter for the output)
A few examples of different custom settings and the results:
- Regular usage:
python generate_cloud.py
Let's change 'count' to 'Simon Basset' ( ...looking at you Bridgerton... ) and use a black background - Custom usage:
python generate_cloud.py -x1 count -x2 Simon_Hastings -f example.txt -o bridgerton2.png -bg black
I only replaced one word (count -> simon hastings), but multiple words can be replaced at the same time.
E.g:-x1 count Monte_Cristo -x2 simon_hastings London
changes "count" to "simon hastings" and "Monte Cristo" to "London".
Note that words that belong together, such as "Monte Cristo", should be connected with an underscore.
- Matrix usage:
python generate_cloud.py -matrix
Automatically created word cloud with matrix-like style. This specific word cloud was generated using the "-whatsapp" option using a WhatsApp chat export file and I used -x1/-x2 in order to censor names and addresses. You can still specify "-whatsapp", and the input (-f) and output (-o) files.
Custom usage:
* Left (saturation and lightness adjusted):python generate_cloud.py -s 25 -l 90
* Right (allow for random word colors):python generate_cloud.py -hue None