wip

2021-03-22 20:30:32 +01:00
parent 316fed8283
commit 52d7ddb7fc
9 changed files with 270 additions and 36 deletions
--- a/text/2_relwork.tex
+++ b/text/2_relwork.tex
@@ -376,9 +376,13 @@ This shortcoming was addressed by \citeauthor{hutto2014vader} who introducted a
 % ursprüngliches paper ITS, wie hat man das früher (davor) gemacht
 \subsection{Trend analysis}

-When introducing a change to a system (experiment), one often wants to know whether the intervention achieves its intended purpose. This leads to 3 possible outcomes: a) the intervention shows effect and the system changes in the desired way, b) the intervention shows effect and the system changes in an undesired way, or c) the system did not react at all to the change. There are multiple ways to determine which of these outcomes occur. To analyze the behavior of the system data from before and after the intervention as well as the nature of the intervation has be aquired. The are multiple ways to run such an experiment and one has to choose which type of experiment fits best. There are 2 categories of approaches: actively creating an experiment where one design the experiment before it is executed (for example randomized control trials in medical fields), or using existing data of an experiment which was not designed beforehand or where setting up a designed experiment is not possible (quasi-experiment).
+When introducing a change to a system (experiment), one often wants to know whether the intervention achieves its intended purpose. This leads to 3 possible outcomes: a) the intervention shows effect and the system changes in the desired way, b) the intervention shows effect and the system changes in an undesired way, or c) the system did not react at all to the change. There are multiple ways to determine which of these outcomes occur. To analyze the behavior of the system, data from before and after the intervention as well as the nature of the intervation has be aquired. The are multiple ways to run such an experiment and one has to choose which type of experiment fits best. There are 2 categories of approaches: actively creating an experiment where one design the experiment before it is executed (for example randomized control trials in medical fields), or using existing data of an experiment which was not designed beforehand or where setting up a designed experiment is not possible (quasi-experiment).

-As this thesis investigates a change which has already been implemented by another party, this thesis covers quasi-experiments. A tool that is often used for this purpose is an \emph{Interrupted Time Series} (ITS) analysis. The ITS analysis is a form of segmented regression analysis, where data from before, after and during the intervention is regressed with seperate line segements\cite{mcdowall2019interrupted, bernal2017interrupted}. ITS requires data at (regular) intervals from before and after the intervention (time series). The interrupt signifies the intervention and the time of when it occured must be known. The intervention can be at a single point in time of it can be streched out over a certain time span. This property must also be known to take it into account when designing the regression. Also, as the data is aquired from an quasi-experiment, it may be baised, for example seasonality, ....%TODO
+As this thesis investigates a change which has already been implemented by another party, this thesis covers quasi-experiments. A tool that is often used for this purpose is an \emph{Interrupted Time Series} (ITS) analysis. The ITS analysis is a form of segmented regression analysis, where data from before, after and during the intervention is regressed with seperate line segements\cite{mcdowall2019interrupted}. ITS requires data at (regular) intervals from before and after the intervention (time series). The interrupt signifies the intervention and the time of when it occured must be known. The intervention can be at a single point in time or it can be streched out over a certain time span. This property must also be known to take it into account when designing the regression. Also, as the data is aquired from an quasi-experiment, it may be baised\cite{bernal2017interrupted}, for example seasonality, time-varying confunders (for example a change in measuring data), variance in the number of single observations grouped together in an interval measurement, etc.. These biases need to be addressed if present. Seasonality can be accounted for by subtracting the average value of each of the months in succesive years (i.e. subtract the average value of all Januaries in the data set from the the values in Januaries).
+%\begin{lstlisting}
+% deseasonalized = datasample - average(dataSamplesInMonth(month(datasample)))
+%\end{lstlisting}
+This removes the differences between different months of the same year thereby filtering out the effect of seasonality. The variance in data density per interval (data samples in an interval) can be addressed by using the each single data point in the regression instead of an average.



--- a/text/3_method.tex
+++ b/text/3_method.tex
@@ -13,6 +13,10 @@ StackExchange introduced a \emph{new contributor} indicator to all communities o
 % https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown \cite{sonic2018what} ; change date = 2018-08-21T21:04:49.177
 % new user indicator visible for 1 week ...

+%TODO state plots of sec 5 here and why these were chosen
+% -> also limitierungen, andere faktoren
+
+
 %TODO more vader explanation
 To measure the effectiveness of the change this thesis utilizes Vader, a sentiment analysis tool with exceptional performance in analysing and categorizing microblog-like texts as well as good generalization in other domains \cite{hutto2014vader}. The choice is based on the speed and simplicity of Vader. Vader uses a lexicon of words with attached sentiment values and rules related to grammar and syntax to determine a sentiment value between -1 and 1 to a given piece of text. The sentiment range is divided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and positive (0.05 to 1). The outer edges of the value space are rarely reached as the text would have to be extremely negative or positive which is very unlikely. This design allows fast and verifiable analysis.

@@ -26,7 +30,7 @@ StackExchange provides anonymized data dumps of all their communities for resear
 % broken entries, missing user id
 % answers in html -> strip html and remove code sections, no contribution to sentiment

-After preprocessing the raw data, relevant data is filtered and computed. Questions and answers in the data are mixed together and have to be separated and answers have to be linked to their questions. Also, questions in these datasets do not have the \emph{new contributor} indicator attached to them and neither do users. So, the first contribution date and time of users have to be calculated via the creation dates of the questions and answers the user has posted. Then, questions are filtered per user and by whether they are created within the 7-day window after the first contribution of the user. These questions were created during the period where the \emph{new contributor} indicator would have been displayed, in case the questions had been posted before the change, or has been displayed after the change. From these questions, all answers which arrived within the 7-day window are considered for the analysis. Answers which arrived at a later point are excluded as the answerer most likely has not seen the disclaimer shown in figure \ref{newcontributor}. Included answers are then analyzed with Vader and the resulting sentiments are stored. Furhtermore, votes to questions of new contributors are counted if they arrived within the 7-day window and count 1 if it is an upvote and -1 if it is a downvote. Moreover, number of questions new contributors ask are counted and divided into two classes: 1st-question of a user and follow-up questions of a new contributor.
+After preprocessing the raw data, relevant data is filtered and computed. Questions and answers in the data are mixed together and have to be separated and answers have to be linked to their questions. Also, questions in these datasets do not have the \emph{new contributor} indicator attached to them and neither do users. So, the first contribution date and time of users have to be calculated via the creation dates of the questions and answers the user has posted. Then, questions are filtered per user and by whether they are created within the 7-day window after the first contribution of the user. These questions were created during the period where the \emph{new contributor} indicator would have been displayed, in case the questions had been posted before the change, or has been displayed after the change. From these questions, all answers which arrived within the 7-day window are considered for the analysis. Answers which arrived at a later point are excluded as the answerer most likely has not seen the disclaimer shown in figure \ref{newcontributor}. Included answers are then analyzed with Vader and the resulting sentiments are stored. Furthermore, votes to questions of new contributors are counted if they arrived within the 7-day window and count 1 if it is an upvote and -1 if it is a downvote. Moreover, number of questions new contributors ask are counted and divided into two classes: 1st-question of a user and follow-up questions of a new contributor.

 % calc sentiment for answers
 % questions do not have a tag if from a new contribtor -> calc first contributor
@@ -43,7 +47,7 @@ After preprocessing the raw data, relevant data is filtered and computed. Questi

 \section{Analysis}
 An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS performes well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasible. ITS only works with the before and after data and a point in time where a change was introduced. 
-ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the method in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points.
+ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the basic method explained in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points. To filter out seasonal effects, the average value of all data points with the same month of all years is subtracted from the data points (i.e. subtract the average value of all Januaries from each data point in a January).


 \begin{figure}
--- a/text/main.tex
+++ b/text/main.tex
@@ -203,6 +203,7 @@
 \usepackage{float}
 \usepackage{subcaption}
 \let\mfs\multiplefootnoteseparator
+\usepackage{listings}

 \addbibresource{\mybiblatexfile}