During the poster session at the Applied Machine Learning Days 2019 in Lausanne some very interesting questions came up about the prediction of the outcome of Swiss Federal votes by machine learning models. Answers to these questions can be found here.
Question: So, what is it that makes a vote go one way or the other?
Answer: First of all, it depends on the model. In a logistic regression model (LRM) it is possible to extract and sort the words by their weight once the model has been trained. At the right side, the graph displays the 40 most weighted words for a vote to be accepted or rejected. However, there are over 50,000 words determining the outcome the prediction.
Words like ‚ Bundesbeschluss, Vorlage, Vergütungen, Massnahmen‘ (‚federal decree, bill, remuneration, measures‘) do have a positive weight whereas words like ‚ Initiative, Beschwerde, Ausländer, Wirtschaft, Franken‘ (‚popular initiative, complaint, foreigners, economy, Swiss francs‘) weight in an negative way. However, these words alone do not prohibit the outcome of the prediction being overturned as there are more than 50,000 words contributing to the model.
For the machine learning models based on neural networks it is more difficult to illustrate how the network ‚decides‘ at which side the coin is going to fall. The neural networks used are based on convolution or long short-term memory layers. These layers are working like filters and extract features /words /sequences in a way that is harder to illustrate, especially in the case of text. The neural networks are based on word embeddings. Based on these in embeddings information about the inner workings of the network can be derived. This will be explained in the next section.
The 40 most positive and negative weighted words in the logistic regression model (LRM).
Question: How do the clusters look based on Word Embeddings (WE)?
Answer: The input layer of the all the neural networks described in the poster are based on WE. WE allow specific words not just to have a single feature weight but a high dimensional vector in space. Either pre-trained word vectors can be used or as in this study the WEs were trained in the given context. Words with a similar meaning or words that turned out to be ‚close‘ in the given context will cluster together in the vector space. These vectors can be extracted after training and can be graphically displayed after reducing their dimensionality to two or three dimensions (such as by Principal Component Analysis (PCA) or by Distributed Stochastic Neighbor Embedding (t-SNE)).
For the LSTM model (and based on an updated version (by June 2019) of the model used in the poster) the 10,000 most frequent words each with a embedding vector length of 200 were extracted and uploaded to a publicly available platform for graphical illustration which can be found here: Tensorflow Projector. Details on how to use this platform and interpret the visuals can be found here.
The visual represents the word cloud based on PCA from the TensorFlow Projector. Each dot represents one word and its position in the vector space reduced to 3 dimensions (distance measure: cosine similarity).
The projector allows to look up individual words (upper right corner) and display their neighbors in space. Different bookmarks have been prepared and saved (see the lower right corner in the projector).
The content related proximity of many of these words is intuitive. For instance when looking up ‚Europe‘ (Europa) which is a hot political topic in Switzerland words such as ‚chicane‘ (Schikane),‘ absurd‘ (widersinnig), ‚monetary policy‘ (Geldpolitik) or ‚people’s right‘ (Volksrechte) show up.
It is remarkable how the model has learned the similarities/proximities from the data ‚without further instructions‘ because it ‚helps the classifier‘.
More interesting word similarities can be found in the bookmarks in the projector or by searching by any word in the word cloud.
Question: The pamphlets contain the recommendation from the government and the parliament. Doesn’t the model just follow this recommendation and that’s why it’s so accurate?
Answer: It is true that the recommendation by the government has an accuracy of 86% to predict the outcome of the vote correctly over 15 votes between May 2017 and February 2019 (see graph at the right side). However, there are several reasons why none of the models is relying on these recommendations:
- The words ‚recommendation and recommendations‘ (‚Empfehlung/ Empfehlungen‘) used in this context in the pamphlet do not belong to te most frequent words of the corpus used in any of the neural networks (word frequency rank 6414 and 15488). Therefore these words do not play a significant role in the predictions
- in the LRM these words have a very low weight (.0017 and .00038) and therefore little impact. These words are ranking at position 4673 and 10898 in the sorted list of the most weighted words
- in the LRM all words are extracted individually (unigrams in a bag of words) and no word sequences are considered in the model. Therefore, a word sequence such as ‚recommendation: yes/no‘ is not built but only the count and weight of the individual word is taken into account.
The accuracy to predict the outcome of the vote by the governmental recommendation is 86% over the last 15 votes. All models are more accurate than the recommendation and most importantly, these recommendation are not part of the models.
Question: Isn’t there a risk that these models could be abused for voter manipulation?
Answer: This is not very likely. Knowing to use some words more than others in an argument about a bill might be of some advantage at least at first glance. However, some words will be inevitable to use such as ‚popular initiative‘ if it’s what you are proposing and it does have a negative predictive value. But most of all, you can not foresee what the arguments of the opponents are going to be. These arguments are equally important for the models. Finally, it is the debate that defines the information in the voter information and not the content of the voter information that defines the debate, in my opinion.
But the question is up for debate and any feedback on this subject and any other is highly welcome.
Question: What is the margin of error of the model?
Answer: The margin of error of the logistic regression classifier model is +/- 18%.
This is based on the total number of classification errors (6 within the 54 records of validation set and 1 error in the 19 records of the test set (predictions since 2017) so far (totally n=73) and the following formula:
margin of error +/- = 1.96 (const) * sqrt( (.11 (error) * (1 – .11 (error))) / 73 (n))
(the const=1.96 defines the confidence interval at 95%)