master/text/3_method.tex

\chapter{Methodology}

StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9pm UTC \cite{post2018come}. This step is one of many to make the platform and its members more welcoming towards new users. This indicator is shown the potential answerers in the answer text box of a question flagged as from a new contributor as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of a user or if the first contribution (question or answer) of the user was less than 7 days ago \cite{sonic2018what}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first contribution and the 7 days window afterwards.

\begin{figure}
% \includegraphics[scale=0.47]{figures/new_contributor} %TODO
 \caption{The answer box a potential answerers sees when viewing a question from a new contributor.}
 \label{newcontributor}
\end{figure}

% about the change
% https://meta.stackexchange.com/questions/314287/come-take-a-look-at-our-new-contributor-indicator \cite{post2018come}
% https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown \cite{sonic2018what} ; change date = 2018-08-21T21:04:49.177
% new user indicator visible for 1 week ...

To measure the effect of the change we chose Vader, a sentiment analysis tool designed for social media interactions \cite{hutto2014vader}. Vader uses a lexicon of words with attached sentiment values and rules related to grammar and syntax to determine a sentiment value between -1 and 1 to a given piece of text. The sentiment range is devided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and postive (0.05 to 1). The outer edges of the value space are rarely reached as the text would have to be negative or postive to the extremes which is very unlikely.

% sentiment calculation via vaderlib, write whole paragraph and explain, also add ref to paper \cite{hutto2014vader}

StackExchange provides anonymized data dumps of all their communities for researches to investigate at no cost on archive.org \cite{archivestackexchange}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the neccessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot used in the analysis, for instance, questions or answers of which the user is unknown but this only affects a very small amount entries. So before the actuals analysis the data has to cleaned. Moreover, the answer texts are in HTML format, containing tags that would skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted.
% data sets as xml files from archive.org \cite{archivestackexchange}

%cleaning data
% broken entries, missing user id
% answers in html -> strip html and remove code sections, no contribution to sentiment


% calc sentiment for answers


% differences in avg sentiment
% look at plots and write something that fits


%interrupted time series
% ref tutorial paper \cite{bernal2017interrupted}
% often used in medical fields to see if changes have an effect
% linear regression
% used same tensors as describe in paper, show formula and how it works, 3 tensors describe tensors and what they capture
% explain why i chose this model, captures the change, more complex model would capture more but also get more complicated, these 3 tensors are enough to see the impact
% fitting every value not aggregated values, aggregated values would have different weights, weights are too far spread, contrary to paper where person years are more or less constant
% single value fitting is better, no weight issues, as weights are taken care of via more values
% if one month has more values than another then that month affects its more as more values are present
%