master/text/3_method.tex

\chapter{Method}

StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9 pm UTC\footnote{\label{post2018come}\url{https://meta.stackexchange.com/questions/314287/come-take-a-look-at-our-new-contributor-indicator}}. This step is one of many StackExchange took to make the platform and its members more welcoming towards new users. This indicator is shown to potential answerers in the answer text box of a question from a new contributor, as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of the user or if the first contribution (question or answer) of the user was less than 7 days ago\footnote{\label{sonic2018what}\url{https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown}}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their existing contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first non-deleted contribution and the 7-day window afterward.

\begin{figure}
 \centering\includegraphics[scale=0.47]{figures/new_contributor}
 \caption{The answer box a potential answerers sees when viewing a question from a new contributor. \copyright{Tim Post, 2018, \url{https://meta.stackexchange.com/users/50049/tim-post}}\footref{post2018come}}
 \label{newcontributor}
\end{figure}

% about the change
% https://meta.stackexchange.com/questions/314287/come-take-a-look-at-our-new-contributor-indicator \cite{post2018come}
% https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown \cite{sonic2018what} ; change date = 2018-08-21T21:04:49.177
% new user indicator visible for 1 week ...

%TODO state plots of sec 5 here and why these were chosen
% -> also limitierungen, andere faktoren
This thesis investigates the following criteria to determine whether the change affected a community positively or negatively, or whether the community is largely unaffected:
\begin{itemize}
 \item \textbf{Sentiment of answers to a question}. This symbolizes the quality of communication between different individuals. Better values indicate better communication. Through the display of the \emph{new contributor} indicator, the answerer should react less negatively towards the new user when they behave outside the community standards.
 \item \textbf{Vote score of questions}. This symbolizes the feedback the community gives to a question. Voters will likely vote more positively (not voting instead of down-voting, or upvoting instead of not voting) due to the \emph{new contributor} indicator. Thereby the vote score should increase after the change.
 \item \textbf{Amount of first and follow-up question}. This symbolizes the willingness of users to participate in the community. Higher amounts of first questions indicate a higher number of new participating users. Higher follow-up questions indicate that users are more willing to stay within the community and continue their active participation.
\end{itemize}
If these criteria improve after the change is introduced, the community is affected positively. If they worsen, the community is affected negatively. If the criteria stay largely the same, then the community is unaffected. Here it is important to note that a question may receive answers and votes after the \emph{new contributor} indicator is no longer shown and therefore these are not considered as part of the data set to analyze.
%only when new contributor insicator is shown


To measure the effect on the sentiment of the change this thesis utilizes the Vader\cite{hutto2014vader} sentiment analysis tool. This decision is based on the performance in analyzing and categorizing microblog-like texts, the speed of processing, and the simplicity of use. Vader uses a lexicon of words, and rules related to grammar and syntax. This lexicon was manually created by \citeauthor{hutto2014vader} and is therefore considered a \emph{gold standard lexicon}. Each word has a sentiment value attached to it. Negative words, for instance, \emph evil, have negative values; good words, for instance, \emph brave, have positive values. The range of these values is continuous, so words can have different intensities, for instance, \emph bad has a higher value than \emph evil. This feature of intensity distinction makes Vader a valance-based approach.

However, just simply looking at the words in a text is not enough and therefore Vader also uses rules to determine how words are used in conjunction with other words. Some words can boost other words. For example, ``They did well.'' is less intense than ``They did extremely well.''. This works for both positive and negative sentences. Moreover, words can have different meanings depending on the context, for instance, ``Fire provides warmth.'' and ``Boss is about to fire an employee.'' This feature is called \emph{Word Sense Disambiguation}.

Furthermore, Vader also detects language features commonly found in social media text which may not be present in other forms of text, for instance, books, or newspapers. Social media texts may contain acronyms, initialisms (for instance \emph{afaik} (as far as I know)), slang words, emojis, caps words (often used to emphasize meaning), punctuation (for instance, \emph{!!!}, and \emph{?!?!}), etc.. These features can convey a lot of meaning and drastically change the sentiment of a text.
After all these features are considered, Vader assigns a sentiment value between -1 and 1 on a continuous range. The sentiment range is divided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and positive (0.05 to 1). The outer edges of this range are rarely reached as the text would have to be extremely negative or positive which is very unlikely.

%speed
Due to this mathematical simplicity, Vader is really fast when computing a sentiment value for a given text. This feature is one of the requirements \citeauthor{hutto2014vader} originally posed. They proposed that Vader shall be fast enough to do online (real-time) analysis of social media text.
%simplicy
Vader is also easy to use. It does not require any pre-training on a dataset as it already has a human-curated lexicon and rules related to grammar and syntax. Therefore the sentiment analysis only requires an input to evaluate. This thesis uses a publicly available implementation of Vader.\footnote{\url{https://github.com/cjhutto/vaderSentiment}}
The design of Vader allows fast and verifiable analysis.
% lexicon approach
%valence based (sentiment intensity, (-1,1) continous)
%detect grammatical features
% detects many language features present in the social media domain (acronym initalism slang, punctuation, caps words...
%wsd
%designed to do online processing


% sentiment calculation via vaderlib, write whole paragraph and explain, also add ref to paper \cite{hutto2014vader}

\section{Data gathering and preprocessing}
StackExchange provides anonymized data dumps of all their communities for researchers to investigate at no cost on archive.org\footnote{\label{archivestackexchange}\url{https://archive.org/download/stackexchange}}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the necessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot be used in the analysis, for instance, questions or answers of which the user is unknown, but this only affects a very small amount of entries. So before the actual analysis, the data has to be cleaned. Moreover, the answer texts are in HTML format, containing tags that could skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted.
% data sets as xml files from archive.org \cite{archivestackexchange}

%cleaning data
% broken entries, missing user id
% answers in html -> strip html and remove code sections, no contribution to sentiment

After preprocessing the raw data, relevant data is filtered and computed. Questions and answers in the data are mixed together and have to be separated and answers have to be linked to their questions. Also, questions in these datasets do not have the \emph{new contributor} indicator attached to them and neither do users. So, the first contribution date and time of users have to be calculated via the creation dates of the questions and answers the user has posted. Then, questions are filtered per user and by whether they are created within the 7-day window after the first contribution of the user. These questions were created during the period where the \emph{new contributor} indicator would have been displayed, in case the questions had been posted before the change, or had been displayed after the change. From these questions, all answers which arrived within the 7-day window are considered for the analysis. Answers which arrived at a later point are excluded as the answerer most likely has not seen the disclaimer shown in figure \ref{newcontributor}. Included answers are then analyzed with Vader and the resulting sentiments are stored. Furthermore, votes to questions of new contributors are counted if they arrived within the 7-day window and count 1 if it is an upvote and -1 if it is a downvote. Moreover, the number of questions new contributors ask, are counted and divided into two classes: 1st-question of a user and follow-up questions of a new contributor.

% calc sentiment for answers
% questions do not have a tag if from a new contribtor -> calc first contributor
% filter questions for new contributor
% calc sentiment of all answers posted within 7 days of question creations
% collect sentiments


% differences in avg sentiment
% look at plots and write something that fits


\section{Analysis}
An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS performs well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasible. ITS only works with the before and after data and a point in time where a change was introduced.

ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change.

Contrary to the basic method explained in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points. To filter out seasonal effects, the average value of all data points with the same month of all years is subtracted from the data points (i.e. subtract the average value of all Januaries from each data point in a January). This thesis uses the least-squares method for regression.

Although the ITS analysis takes data density variability and seasonality into account, there is always a possibility that an unknown factor or event is contained in the data. It is always recommended to do a visual inspection of the data. This thesis contains one example where the data density increases so drastically in a particular time span that this form of analysis loses accuracy.
%limitations
% large sudden changes (maybe include example from analysis)
% autocorrelation?
%

\subsection{A synthetic example}
%3 segment example like it will be used later
% with lower sentiment first and higher sentiment after the change
%
For demonstration purposes, this section shows how to create a synthetic example for an ITS analysis. The example has 3 segments, equal to the number of segments that will be used in the analysis in the next sections. In this example, the sentiment is lower before the change occurs and high after the change has occurred. This example also includes data density variablily, i.e., there are a different amount of data points for each month. The example is shown visually in figure \ref{itsexample} is generated by the following algorithm:
\begin{itemize}
 \item Select time frame: for instance, 15 months before and after the change
 \item Select base values: before the change choose a base value of $0.10$ and after the change choose a base value of $0.15$
 \item Add noise: add a random value in $[0, 0.05)$ to the base value for each month respectively
 \item Choose sample size (data density): choose a random sample size in $[200, 400)$ for each month and duplicate the value from the previous step by the sample size in each month respectively
 \item Compute the ITS: while taking data density variability into account
\end{itemize}
This algorihm generates an ITS where the line before the change is on a lower level than the line after the change. However, this algorithm does not control the slopes of the segments before and after the change. The slopes of the lines in \ref{itsexample} are random. The algorithm could be extended to also control the slopes of the lines, however, for demonstration purposes in this thesis this is enough.


\begin{figure}
 \centering\includegraphics[scale=0.7]{figures/itsexample}
 \caption{An example that visualizes how ITS works. The change of the system occurs at month 0. The blue line shows the average sentiment of fictional answers grouped by month. The numbers attached to the blue line show the number of sentiment values for a given month. The yellow line represents the ITS analysis as a three-segment line. This example shows the expected behavior of the data sets in the following sections.}
 \label{itsexample}
\end{figure}


%interrupted time series
% ref tutorial paper \cite{bernal2017interrupted}
% often used in medical fields to see if changes have an effect
% linear regression
% used same tensors as described in paper, show formula and how it works, 3 tensors describe tensors and what they capture
% explain why i chose this model, captures the change, more complex model would capture more but also get more complicated, these 3 tensors are enough to see the impact
% fitting every value not aggregated values, aggregated values would have different weights, weights are too far spread, contrary to paper where person years are more or less constant
% single value fitting is better, no weight issues, as weights are taken care of via more values
% if one month has more values than another then that month affects its more as more values are present
%