wip

2022-07-22 20:57:53 +02:00
parent 11716b364c
commit 0ca7531406
2 changed files with 19 additions and 9 deletions
--- a/text/3_method.tex
+++ b/text/3_method.tex
@@ -73,7 +73,10 @@ After preprocessing the raw data, relevant data is filtered and computed. Questi
 \section{Analysis}
 An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS performs well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasible. ITS only works with the before and after data and a point in time where a change was introduced. 
-ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the basic method explained in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points. To filter out seasonal effects, the average value of all data points with the same month of all years is subtracted from the data points (i.e. subtract the average value of all Januaries from each data point in a January). This thesis uses the least-squares method for regression.
+
 ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. 
 Contrary to the basic method explained in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points. To filter out seasonal effects, the average value of all data points with the same month of all years is subtracted from the data points (i.e. subtract the average value of all Januaries from each data point in a January). This thesis uses the least-squares method for regression.
 Although the ITS analysis takes data density variability and seasonality into account, there is always a possibility that an unknown factor or event is contained in the data. It is always recommended to do a visual inspection of the data. This thesis contains one example where the data density increases so drastically in a particular time span that this form of analysis loses accuracy. 
 %limitations
@@ -81,12 +84,24 @@ Although the ITS analysis takes data density variability and seasonality into ac
 % autocorrelation?
 % 
 \subsection{A synthetic example}
 %TODO
 The diagram in figure \ref{itsexample} is generated by the following algorithm:
 \begin{itemize}
 \item Select base values: before the change choose a base value of 0.10 and after the change choose a base value of 0.15
 \item Add noise: add a random value in $[0, 0.05)$ to the base value for each month
 \item Choose sample size: choose a random sample size in $[200, 400)$ for each month and duplicate the value from the previous step by the sample size in each month respectively
 \item Compute the ITS: while taking data density variability into account
 \end{itemize}
 This algorihm generates an ITS where the line before the change is on a lower level than the line after the change. However, this algorithm does not control the slopes of the lines before and after the change. The slopes of the lines in \ref{itsexample} are random. The algorithm could be extended to also control the slopes of the lines, however, for demonstration purposes this is enough.
 \begin{figure}
 \centering\includegraphics[scale=0.7]{figures/itsexample}
 \caption{An example that visualizes how ITS works. The change of the system occurs at month 0. The blue line shows the average sentiment of fictional answers grouped by month. The numbers attached to the blue line show the number of sentiment values for a given month. The yellow line represents the ITS analysis as a three-segment line. This example shows the expected behavior of the data sets in the following sections.}
 \label{itsexample}
-\end{figure}
+\end{figure}\label{itsexample}
 %interrupted time series
--- a/text/5_results.tex
+++ b/text/5_results.tex
@@ -1,18 +1,13 @@
 \chapter{Results}
 %TODO some text here
 This section shows the results of the experiments described in section 3 on the data sets described in section 4. In the following pages, there 3 diagrams for each community. 
 In diagrams (a), the blue line states the average sentiment (\emph{average sentiment} in diagram legend) of the answers to questions from new contributors. Also, the numbers attached to the blue line indicate number of answers to questions from new users that formed the average sentiment. The orange line (\emph{sm single ITS} in the diagram legend) represents the ITS over the whole period of the avaiable data. As stated in section 3.2, data density variabilty is a factor to take into account, therefore, the orange line represents the weighted ITS. The green, red, purple, and brown lines also represent ITS, however the time period considered for ITS before and after the change are limited to 6, 9, 12, and 15 months respectively.
 Similarly, in diagram (b), the blue line respresens the average vote score of the questions of new users. The number attached to the blue line indicate the number of questions that formed the average vote score. The ITS (orange, green, red, purple, and brown lines) are computed the same way as in diagrams (a). 
-In diagrams (c), the blue line represents the number of 1st questions from new users, whereas the orange line denotes the followup questions from new users. The green and red lines represent the ITS of the blue and orange line respectively. In these diagrams no weighting is performed as each data point has equivalent weight.
+In diagrams (c), the blue line represents the number of 1st questions from new users, whereas the orange line denotes the followup questions from new users. The green and red lines
-
+represent the ITS of the blue and orange line respectively. In these diagrams no weighting is performed as each data point has equivalent weight.
 % pvalues ... 
 % maybe average data points per month
 \pagebreak