You get out what you put it
Since all my machine learning models were giving me similar but modest accuracy, I looked at playing with the data that is fed into them. This made an enormous difference to the accuracy achieved.
Just a reminder of my results, the top five models were giving an accuracy of about 81 +/- 5%. They all used a TF-IDF to create a DTM (document-term matrix) where "the entire document is thus a feature vector, and each feature vector corresponds to a point in a vector space" (StackOverflow).
There was very little improvement using bagging (although, admittedly, I am told that as a rule of thumb one should use as many different models in bagging as one has classes and this I did not do).
The great leap forward came in representing the training data instead as a term-class matrix. In each case, we used a bag-of-words approach on the corpus ("bag-of-words refers to what kind of information you can extract from a document namely, unigram words" - ibid)
Firstly, I tried using Spark's Word2Vec implementation. This is a word-embedding technique. This is "a means of building a low-dimensional vector representation from corpus of text, which preserves the contextual similarity of words" (Quora).
[If you read much about Word2Vec, you will see the expression distributed vector used frequently. "In distributed representations of words and paragraphs, the information about the word is distributed all along the vector. So, every position in the vector may be non-zero for a given word. Contrast this with the one-hot encoding of words, where the representation of a word is all 0s except for a 1 in one position for that word" (Quora).]
Vered Shwartz says: "It was common to apply dimensionality reduction to the word co-occurrence matrix (count based vectors) to get low-dimensional word vectors. They perform similarly or slightly worse than word embeddings (it depends who you ask...)". She then cites two different papers that found radically different results.
I tried using Word2Vec embeddings on a Spark neural net for the "20 Newsgroups" data but only got the monkey score (5.02%) with lots of "StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to ..." WARNings originating in optimize.LBFGS code.
Interestingly, Random Forest gave me 82.6% on exactly the same data. Nice but not a massive improvement so I concluded using Word2Vec did not significantly effect the results.
Using a matrix where the rows are the classes rather than the documents and the columns are still word vectors (representing the probability of this word being in a given class) resulted much better predictions. Spark's neural nets, random forest and naive Bayes all achieved an accuracy of about 93 +/- 0.5%.
Note, the words were counted manually without using Spark's IDF class. Because of a lack of collisions using this technique, about 200 more words were captured (about 2% extra) but this did not seem to make any difference. The difference appears to entirely be due to adding the word vectors (with only 20 elements) together for each sentence in the vector space model.
Also note that the columns (ie, the terms) were all normalized (rather than scaled or standardized see here for the differences). That is, they represented the (frequentist's) probability of that term being in any given class. If these vectors were not normalized (ie, they remained the count of that term being in any given class) then the accuracy of the neural net dropped back down to about 81%.
Finally, using this representation of the feature vectors lead to the KMean algorithm giving more sensible results (an average error per cluster of 86.4%).
It was explained to me (by a machine learning PhD) that the improvement in accuracy that comes with a term-class matrix is a result of using an input model that better approximates what you want as an output. "TF-IDF combines the importance-per-document and the importance-per-corpus introducing other information not necessarily supporting your classification problem directly," he said. "TF-IDF was developed for Information Retrieval."