From cf63c6708058092353164aef8e5a059c242e750b Mon Sep 17 00:00:00 2001 From: wea_ondara Date: Sat, 27 Mar 2021 19:14:20 +0100 Subject: [PATCH] wip --- text/3_method.tex | 33 +++++++++++++++++++++++++-------- todo2 | 2 +- 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/text/3_method.tex b/text/3_method.tex index 4b2bf51..211fe7e 100644 --- a/text/3_method.tex +++ b/text/3_method.tex @@ -1,6 +1,6 @@ \chapter{Method} -StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9 pm UTC\footnote{\label{post2018come}\url{https://meta.stackexchange.com/questions/314287/come-take-a-look-at-our-new-contributor-indicator}}. This step is one of many StackExchange took to make the platform and its members more welcoming towards new users. This indicator is shown to potential answerers in the answer text box of a question flagged as from a new contributor as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of the user or if the first contribution (question or answer) of the user was less than 7 days ago\footnote{\label{sonic2018what}\url{https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown}}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their existing contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first non-deleted contribution and the 7-day window afterward. +StackExchange introduced a \emph{new contributor} indicator to all communities on $21^{st}$ of August in 2018 at 9 pm UTC\footnote{\label{post2018come}\url{https://meta.stackexchange.com/questions/314287/come-take-a-look-at-our-new-contributor-indicator}}. This step is one of many StackExchange took to make the platform and its members more welcoming towards new users. This indicator is shown to potential answerers in the answer text box of a question from a new contributor as shown in figure \ref{newcontributor}. The indicator is added to a question if the question is the first contribution of the user or if the first contribution (question or answer) of the user was less than 7 days ago\footnote{\label{sonic2018what}\url{https://meta.stackexchange.com/questions/314472/what-are-the-exact-criteria-for-the-new-contributor-indicator-to-be-shown}}. The indicator is then shown for 7 days from the creation date of the question. Note that the user can be registered for a long time and then post their first question and it is counted as a question from a new contributor. Also, if a user decides to delete all their existing contributions from the site and then creates a new question this question will have the \emph{new contributor} indicator attached. The sole deciding factor for the indicator is the date and time of the first non-deleted contribution and the 7-day window afterward. \begin{figure} \centering\includegraphics[scale=0.47]{figures/new_contributor} @@ -17,21 +17,38 @@ StackExchange introduced a \emph{new contributor} indicator to all communities o % -> also limitierungen, andere faktoren This thesis investigates the following criteria to determine whether the change affected a community positively or negatively, or whether the community is largly unaffected: \begin{itemize} - \item \textbf{Sentiment of answers to a question}. This symbolizes the quality of communication between different individuals. Better values indicate better communication. Through the display of the \emph{new contributor} indicator, the answerer should react less negatively towards the new user if they behave outside the community standards. - \item \textbf{Vote score of questions}. This is similar to the sentiment criterion. Voters will likely vote more postively (not voting instead of down voting, or upvoting instead of of not voting) due to the \emph{new contributor}. Thereby the vote score should increase after the change. + \item \textbf{Sentiment of answers to a question}. This symbolizes the quality of communication between different individuals. Better values indicate better communication. Through the display of the \emph{new contributor} indicator, the answerer should react less negatively towards the new user when they behave outside the community standards. + \item \textbf{Vote score of questions}. This is similar to the sentiment criterion. Voters will likely vote more postively (not voting instead of down voting, or upvoting instead of of not voting) due to the \emph{new contributor} indicator. Thereby the vote score should increase after the change. \item \textbf{The amount of first and follow-up question}. This symbolizes the willingness of users to participate in the community. Higher amounts of first questions indicate higher number of new participating users. Higher follow-up questions indicate that users are more willing to stay within the community. \end{itemize} -If these criteria improve after the change is introducted, the community is affected positively. If they worsen, the community is affected negatively. If the criteria stay largely the same, then the community is unaffected. A question may receive answers and votes after the \emph{new contributor} indicator is no longer shown and therefore they are not considered as part of the data set to analyze. +If these criteria improve after the change is introducted, the community is affected positively. If they worsen, the community is affected negatively. If the criteria stay largely the same, then the community is unaffected. Here it is important to note that a question may receive answers and votes after the \emph{new contributor} indicator is no longer shown and therefore these are not considered as part of the data set to analyze. %only when new contributor insicator is shown -%TODO more vader explanation -To measure the effect on sentiment of the change this thesis utilizes Vader, a sentiment analysis tool with exceptional performance in analysing and categorizing microblog-like texts as well as good generalization in other domains \cite{hutto2014vader}. The choice is based on the speed and simplicity of Vader. Vader uses a lexicon of words with attached sentiment values and rules related to grammar and syntax to determine a sentiment value between -1 and 1 to a given piece of text. The sentiment range is divided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and positive (0.05 to 1). The outer edges of the value space are rarely reached as the text would have to be extremely negative or positive which is very unlikely. This design allows fast and verifiable analysis. +To measure the effect on sentiment of the change this thesis utilizes the Vader\cite{hutto2014vader} sentiment analysis tool. This decision is based on the performance in analyzing and categorizing microblog-like texts, the speed of processing, and on the simplicity of use. Vader uses a lexicon of words, and rules related to grammar and syntax. This lexicon was manually created by \citeauthor{hutto2014vader} and is therefore considered a \emph{gold standard lexicon}. Each word has a sentiment value attached to it. Negative words, for instance \emph evil, have negative values; good words, for instance \emph brave, have a positive values. The range of these values is continuous, so words can have different intensities, for instance, \emph bad has a higher value than \emph evil. This feature of instensity distinction makes Vader a valance-based approach. + +However, just simply looking at the words in a text is not enough and therefore Vader also uses rules to determine how words are used in conjunction with other words. Some words can boost other words. For example, ``They did well.'' is less intense than ``They did extremely well.''. This works for both positive and negative sentences. Moreover, words can have different meanings depending on the context, for instance, ``Fire provides warmth.'' and ``Boss is about to fire an employee.'' This feature is called \emph{Word Sense Disambiguation}. + +Furthermore, Vader also detects language features commonly found in social media text which may not be present in other forms of text, for instance, books, or news papers. Social media texts may contain acronyms, initialisms (for instance \emph{afaik} (as far as I know)), slang words, emojis, caps words (often used to emphasize meaning), punctuation (for instance, \emph{!!!}, and \emph{?!?!}), etc.. These features can convey a lot of meaning and drastically change the sentiment of a text. +After all these features are considered, Vader outputs a sentiment value between -1 and 1 on a continuous range. The sentiment range is divided into 3 classes: negative (-1 to -0.05), neutral (-0.05 to 0.05), and positive (0.05 to 1). The outer edges of this range are rarely reached as the text would have to be extremely negative or positive which is very unlikely. + +%speed +Due to this mathematical simplicy, Vader is really fast when computing a sentiment value for a given text. This feature is one of the requirements \citeauthor{hutto2014vader} originally posed. They proposed that Vader shall be fast enough to do online (real time) analysis of social media text. +%simplicy +Vader is also easy to use. It does not require any pre-training on a dataset as it already has a human curated lexicon and rules related to grammar and syntax. Therefore the sentiment analysis only requires an input to evaluate. This thesis uses a publicly available implementation of Vader.\footnote{\url{https://github.com/cjhutto/vaderSentiment}} +The design of Vader allows fast and verifiable analysis. +% lexicon approach +%valence based (sentiment intensity, (-1,1) continous) +%detect grammatical features +% detects many language features present in the social media domain (acronym initalism slang, punctuation, caps words... +%wsd +%designed to do online processing + % sentiment calculation via vaderlib, write whole paragraph and explain, also add ref to paper \cite{hutto2014vader} \section{Data gathering and preprocessing} -StackExchange provides anonymized data dumps of all their communities for researchers to investigate at no cost on archive.org\footnote{\label{archivestackexchange}\url{https://archive.org/download/stackexchange}}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the necessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot be used in the analysis, for instance, questions or answers of which the user is unknown, but this only affects a very small amount entries. So before the actual analysis, the data has to be cleaned. Moreover, the answer texts are in HTML format, containing tags that could skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted. +StackExchange provides anonymized data dumps of all their communities for researchers to investigate at no cost on archive.org\footnote{\label{archivestackexchange}\url{https://archive.org/download/stackexchange}}. These data dumps contain users, posts (questions and answers), badges, comments, tags, votes, and a post history containing all versions of posts. Each entry contains the necessary information, for instance, id, creation date, title, body, and how the data is linked together (which user posted a question/answer/comment). However, not all data entries are valid and therefore cannot be used in the analysis, for instance, questions or answers of which the user is unknown, but this only affects a very small amount of entries. So before the actual analysis, the data has to be cleaned. Moreover, the answer texts are in HTML format, containing tags that could skew the sentiment values, and they need to be stripped away beforehand. Additionally, answers may contain code sections which also would skew the results and are therefore omitted. % data sets as xml files from archive.org \cite{archivestackexchange} %cleaning data @@ -55,7 +72,7 @@ After preprocessing the raw data, relevant data is filtered and computed. Questi \section{Analysis} An interrupted time series (ITS) analysis captures trends before and after a change in a system and fits very well with the question this thesis investigates. ITS can be applied to a large variety of data if the data contains the same kind of data points before and after the change and when the change date and time are known. \citeauthor{bernal2017interrupted} published a paper on how ITS works \cite{bernal2017interrupted}. ITS performes well on medical data, for instance, when a new treatment is introduced ITS can visualize if the treatment improves a condition. For ITS no control group is required and often control groups are not feasible. ITS only works with the before and after data and a point in time where a change was introduced. -ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the basic method explained in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points. To filter out seasonal effects, the average value of all data points with the same month of all years is subtracted from the data points (i.e. subtract the average value of all Januaries from each data point in a January). +ITS relies on linear regression and tries to fit a three-segment linear function to the data. The authors also described cases where more than three segments are used but these models quickly raise the complexity of the analysis and for this thesis a three-segment linear regression is sufficient. The three segments are lines to fit the data before and after the change as well as one line to connect the other two lines at the change date. Figure \ref{itsexample} shows an example of an ITS. Each segment is captured by a tensor of the following formula $Y_t = \beta_0 + \beta_1T + \beta_2X_t + \beta_3TX_t$, where $T$ represents time as a number, for instance, number of months since the start of data recording, $X_t$ represents 0 or 1 depending on whether the change is in effect, $\beta_0$ represents the value at $T = 0$, $\beta_1$ represents the slope before the change, $\beta_2$ represents the value when the change is introduced, and $\beta_3$ represents the slope after the change. Contrary to the basic method explained in \cite{bernal2017interrupted} where the ITS is performed on aggregated values per month, this thesis performs the ITS on single data points, as the premise that the aggregated values all have the same weight within a certain margin is not fulfilled for sentiment and vote score values. Performing the ITS with aggregated values would skew the linear regression more towards data points with less weight. Single data point fitting prevents this, as weight is taken into account with more data points. To filter out seasonal effects, the average value of all data points with the same month of all years is subtracted from the data points (i.e. subtract the average value of all Januaries from each data point in a January). This thesis uses the least squares method for regression. \begin{figure} diff --git a/todo2 b/todo2 index d4db6d5..64d3b59 100644 --- a/todo2 +++ b/todo2 @@ -12,7 +12,7 @@ 3. - DONE argumente warum ich genau diese variablen (sentiment, votes, #questions) - DONEXT limitierungen, andere faktoren -- DONEXT vader genau beschreiben +- DONE vader genau beschreiben 5. - DONE gruppieren nach categorien