wordlist-generator
This project uses nltk to generate word lists and performs several filtering and formatting functions.
The main code is Words.ipyb (a jupyter python notebook).
A few files were downloaded from various websites to compile bad words.
The company-names file was generated from a dump of all registered companies in the SEC database, unzipped and processed as follows:
for x in $(ls -1); do jq .entityName $x; done | grep -v null | grep -v \"\" | sed 's/\"//g' > ../company-names.txt
Code takes a few hours to run on a modern laptop. It could be paralellized but it is fast enough that it is not needed.
Note that the bad words could be improved significantly. Removing by prefix is generally good but some words don't need that (maybe add a space to the entry in the text file?).