\chapter{Method}

StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9 pm UTC \cite{post2018come}. This step is one of many StackExchange took to make the platform and its members more welcoming towards new users. This indicator is shown the potential answerers in the answer text box of a question flagged as from a new contributor as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of a user or if the first contribution (question or answer) of the user was less than 7 days ago \cite{sonic2018what}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first non-deleted contribution and the 7-day window afterward. 

\begin{figure}
 \centering\includegraphics[scale=0.47]{figures/new_contributor}
 \caption{The answer box a potential answerers sees when viewing a question from a new contributor. \copyright{Tim Post, 2018, \url{https://meta.stackexchange.com/users/50049/tim-post}} in \cite{post2018come}}
 \label{newcontributor}
\end{figure}

% about the change 
% https://meta.stackexchange.com/questions/314287/come-take-a-look-at-our-new-contributor-indicator \cite{post2018come}
% https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown \cite{sonic2018what} ; change date = 2018-08-21T21:04:49.177
% new user indicator visible for 1 week ...

%TODO more vader explanation
To measure the effectiveness of the change we chose Vader, a sentiment analysis tool designed for social media interactions \cite{hutto2014vader}. Vader uses a lexicon of words with attached sentiment values and rules related to grammar and syntax to determine a sentiment value between -1 and 1 to a given piece of text. The sentiment range is divided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and positive (0.05 to 1). The outer edges of the value space are rarely reached as the text would have to be negative or positive to the extremes which is very unlikely. 

% sentiment calculation via vaderlib, write whole paragraph and explain, also add ref to paper \cite{hutto2014vader}

\section{Data gathering and preprocessing}
StackExchange provides anonymized data dumps of all their communities for researchers to investigate at no cost on archive.org \cite{archivestackexchange}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the necessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot be used in the analysis, for instance, questions or answers of which the user is unknown but this only affects a very small amount entries. So before the actual analysis, the data has to be cleaned. Moreover, the answer texts are in HTML format, containing tags that could skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted. 
% data sets as xml files from archive.org \cite{archivestackexchange}

%cleaning data
% broken entries, missing user id
% answers in html -> strip html and remove code sections, no contribution to sentiment

After preprocessing the raw data, relevant data is filtered and computed. Questions and answers in the data are mixed together and have to be separated and answers have to be linked to their questions. Also, questions in these datasets do not have the \emph{new contributor} indicator attached to them and neither do users. So, the first contribution date and time of users have to be calculated via the creation dates of the questions and answers the user has posted. Then, questions are filtered per user and by whether they are created within the 7-day window after the first contribution of the user. These questions were created during the period where the \emph{new contributor} indicator would have been displayed, in case the questions had been posted before the change, or has been displayed after the change. From these questions, all answers which arrived within the 7-day window are considered for the analysis. Answers which arrived at a later point are excluded as the answerer most likely has not seen the disclaimer shown in figure \ref{newcontributor}. Included answers are then analyzed with Vader and the resulting sentiments are stored.

% calc sentiment for answers
% questions do not have a tag if from a new contribtor -> calc first contributor
% filter questions for new contributor
% calc sentiment of all answers posted within 7 days of question creations
% collect sentiments


% differences in avg sentiment
% look at plots and write something that fits


\section{Analysis}
An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS works well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasible. ITS only works with the before and after data and a date where a change was introduced. 
ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the method in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points.

%interrupted time series
% ref tutorial paper \cite{bernal2017interrupted}
% often used in medical fields to see if changes have an effect
% linear regression
% used same tensors as described in paper, show formula and how it works, 3 tensors describe tensors and what they capture
% explain why i chose this model, captures the change, more complex model would capture more but also get more complicated, these 3 tensors are enough to see the impact
% fitting every value not aggregated values, aggregated values would have different weights, weights are too far spread, contrary to paper where person years are more or less constant
% single value fitting is better, no weight issues, as weights are taken care of via more values
% if one month has more values than another then that month affects its more as more values are present
%