This commit is contained in:
wea_ondara
2020-04-17 11:48:39 +02:00
parent b5508c7597
commit 1862d6b39a
3 changed files with 65 additions and 50 deletions

View File

@@ -1,6 +1,6 @@
\chapter{Methodology}
StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9pm UTC \cite{post2018come}. This step is one of many to make the platform and its members more welcoming towards new users. This indicator is shown the potential answerers in the answer text box of a question flagged as from a new contributor as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of a user or if the first contribution (question or answer) of the user was less than 7 days ago \cite{sonic2018what}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first contribution and the 7 days window afterwards.
StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9 pm UTC \cite{post2018come}. This step is one of many to make the platform and its members more welcoming towards new users. This indicator is shown the potential answerers in the answer text box of a question flagged as from a new contributor as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of a user or if the first contribution (question or answer) of the user was less than 7 days ago \cite{sonic2018what}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first contribution and the 7 days window afterward.
\begin{figure}
\includegraphics[scale=0.47]{figures/new_contributor} %TODO wrong image
@@ -13,19 +13,19 @@ StackExchange introduced a \emph{new contributor} indicator to all communities o
% https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown \cite{sonic2018what} ; change date = 2018-08-21T21:04:49.177
% new user indicator visible for 1 week ...
To measure the effect of the change we chose Vader, a sentiment analysis tool designed for social media interactions \cite{hutto2014vader}. Vader uses a lexicon of words with attached sentiment values and rules related to grammar and syntax to determine a sentiment value between -1 and 1 to a given piece of text. The sentiment range is devided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and postive (0.05 to 1). The outer edges of the value space are rarely reached as the text would have to be negative or postive to the extremes which is very unlikely.
To measure the effectiveness of the change we chose Vader, a sentiment analysis tool designed for social media interactions \cite{hutto2014vader}. Vader uses a lexicon of words with attached sentiment values and rules related to grammar and syntax to determine a sentiment value between -1 and 1 to a given piece of text. The sentiment range is divided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and positive (0.05 to 1). The outer edges of the value space are rarely reached as the text would have to be negative or positive to the extremes which is very unlikely.
% sentiment calculation via vaderlib, write whole paragraph and explain, also add ref to paper \cite{hutto2014vader}
\section{Data gathering and preprocessing}
StackExchange provides anonymized data dumps of all their communities for researches to investigate at no cost on archive.org \cite{archivestackexchange}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the neccessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot used in the analysis, for instance, questions or answers of which the user is unknown but this only affects a very small amount entries. So before the actuals analysis the data has to be cleaned. Moreover, the answer texts are in HTML format, containing tags that could skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted.
StackExchange provides anonymized data dumps of all their communities for researchers to investigate at no cost on archive.org \cite{archivestackexchange}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the necessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot be used in the analysis, for instance, questions or answers of which the user is unknown but this only affects a very small amount entries. So before the actual analysis, the data has to be cleaned. Moreover, the answer texts are in HTML format, containing tags that could skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted.
% data sets as xml files from archive.org \cite{archivestackexchange}
%cleaning data
% broken entries, missing user id
% answers in html -> strip html and remove code sections, no contribution to sentiment
After preprocessing the raw data, relevant data is filtered and computed. Questions and answers in the data are mixed together and have to be seperated and answer have to be linked to their questions. Also, questions do not have the \emph{new contributor} indicator attached to them and neither do users. So, the first contribution date and time of a user has to calculated via the creation dates of the questions and answers the user has posted. Then, questions are filtered by user and by whether they are created within the 7 day window after the first contribution of the user. These questions were created during the period where the \emph{new contributor} indicator would have been displayed, in case the questions had been posted before the change, or has been displayed after the change. From these questions, all answers which arrived within the 7 day window are considered for the analysis. Answers which arrived at a later point are excluded as the answerer most likely has not seen the disclaimer shown in figure \ref{newcontributor}. Included answers are then analysed with vader and the resulting sentiments are stored.
After preprocessing the raw data, relevant data is filtered and computed. Questions and answers in the data are mixed together and have to be separated and answers have to be linked to their questions. Also, questions do not have the \emph{new contributor} indicator attached to them and neither do users. So, the first contribution date and time of users have to calculated via the creation dates of the questions and answers the user has posted. Then, questions are filtered per user and by whether they are created within the 7-day window after the first contribution of the user. These questions were created during the period where the \emph{new contributor} indicator would have been displayed, in case the questions had been posted before the change, or has been displayed after the change. From these questions, all answers which arrived within the 7-day window are considered for the analysis. Answers which arrived at a later point are excluded as the answerer most likely has not seen the disclaimer shown in figure \ref{newcontributor}. Included answers are then analyzed with Vader and the resulting sentiments are stored.
% calc sentiment for answers
% questions do not have a tag if from a new contribtor -> calc first contributor
@@ -41,8 +41,8 @@ After preprocessing the raw data, relevant data is filtered and computed. Questi
\section{Analysis}
An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS works well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasable. ITS only works with the before and after data and a date where a change was introduced.
ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segment are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensors of the following forumla $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as an number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introducted, and $\beta_3$ represents the slope after the change. Contrary to method in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the permise that the aggregated values all have the same weight within a certain margin is not fulfilled. Perfoming the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more datapoints.
An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS works well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasible. ITS only works with the before and after data and a date where a change was introduced.
ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the method in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points.
%interrupted time series
% ref tutorial paper \cite{bernal2017interrupted}