This commit is contained in:
wea_ondara
2023-02-25 17:27:09 +01:00
parent 6fdcf6760f
commit 0697a44c37
7 changed files with 19 additions and 19 deletions

View File

@@ -8,7 +8,7 @@ StackExchange\footnote{\url{https://stackexchange.com}} is a community question
Originally, StackExchange started with StackOverflow\footnote{\url{https://stackoverflow.com}} in 2008\footnote{\label{atwood2008stack}\url{https://stackoverflow.blog/2008/08/01/stack-overflow-private-beta-begins/}}. Since then StackExchange grew into a platform hosting sites for 174 different topics\footnote{\label{stackexchangetour}\url{https://stackexchange.com/tour}}, for instance, programming (StackOverflow), maths (MathOverflow\footnote{\url{https://mathoverflow.net}} and Math StackExchange\footnote{\url{https://math.stackexchange.com}}), and typesetting (TeX/LaTeX\footnote{\url{https://tex.stackexchange.com}}). Questions on StackExchange are stated in the natural English language and consist of a title, a body containing a detailed description of the problem or information needed, and tags to categorize the question. After a question is posted the community can submit answers to the question. The author of the question can then accept an appropriate answer which satisfies their question. The accepted answer is then marked as such with a green checkmark and shown on top of all the other answers. Figure \ref{soexamplepost} shows an example of a StackOverflow question. Questions and answers can be up-/downvoted by every user registered on the site. Votes typically reflect the quality and importance of the respective question or answers. Answers with a high voting score raise to the top of the answer list as answers are sorted by the vote score in descending order by default. Voting also influences a user's reputation \cite{movshovitz2013analysis}\footref{stackexchangetour}. When a post (question or answer) is voted upon, the reputation of the poster changes accordingly. Furthermore, downvoting of answers also decreases the reputation of the user who voted\footnote{\url{https://stackoverflow.com/help/privileges/vote-down}}.
Reputation on StackExchange indicates how trustworthy a user is. To gain a high reputation value a user has to invest a lot of time and effort to reach a high reputation value by asking good questions and posting good answers to questions. Reputation also unlocks privileges which may differ slightly from one community to another\footnote{\url{https://mathoverflow.com/help/privileges/}}\mfs\footnote{\url{https://stackoverflow.com/help/privileges/}}.
Reputation on StackExchange indicates how trustworthy a user is. To gain a high reputation value a user has to invest a lot of time and effort to reach by asking good questions and posting good answers to questions. Reputation also unlocks privileges which may differ slightly from one community to another\footnote{\url{https://mathoverflow.com/help/privileges/}}\mfs\footnote{\url{https://stackoverflow.com/help/privileges/}}.
With privileges, users can, for instance, create new tags if the need for a new tag arises, cast votes on closing or reopening questions if the question is off-topic or a duplicate of another question, or when a question had been closed for no or a wrong reason, or even get access to moderation tools.
StackExchange also employs a badge system to steer the community\footnote{\label{stackoverflowbadges}\url{https://stackoverflow.com/help/badges/}}. Some badges can be obtained by performing one-time actions, for instance, reading the tour page which contains necessary details for newly registered users, or by performing certain actions multiple times, for instance, editing and answering the same question within 12 hours.
Furthermore, users can comment on every question and answer. Comments could be used for further clarifying an answer or a short discussion on a question or answer.
@@ -355,7 +355,7 @@ Linguistic Inquiry and Word Count (LIWC) \cite{pennebaker2001linguistic,pennebak
% - very old (1966), continuously refined, still in use (vader)
% - misses lexical feature detection (acronyms, ...) and sentiment intensity (vader)
General Inquirer (GI)\cite{stone1966general} is one of the oldest sentiment tools still in use. It was originally designed in 1966 and has been continuously refined and now consists of about 11000 words where 1900 positively rated words and 2300 negatively rated words. Like LIWC, GI uses a polarity-based lexicon and therefore is not able to capture sentiment intensity\cite{hutto2014vader}. Also, GI does not recognize lexical features, such as acronyms, initialisms, etc.
General Inquirer (GI)\cite{stone1966general} is one of the oldest sentiment tools still in use. It was originally designed in 1966 and has been continuously refined. Now it consists of about 11000 words with 1900 positively rated words and 2300 negatively rated words. Like LIWC, GI uses a polarity-based lexicon and therefore is not able to capture sentiment intensity\cite{hutto2014vader}. Also, GI does not recognize lexical features, such as acronyms, initialisms, etc.
%Hu-Liu04 \cite{hu2004mining,liu2005opinion}, 2004
@@ -425,7 +425,7 @@ Word-Sense Disambiguation (WSD)\cite{akkaya2009subjectivity} is not a sentiment
%updateing (extend/modify) hard (e.g. new domain) (vader)
\textbf{Machine Learning Approches}\\
Because handcrafting sentiment analysis requires a lot of effort, researchers turned to approaches that offload the labor-intensive part to machine learning (ML). However, this results in a new challenge, namely: gathering a \emph good data set to feed the machine learning algorithms for training. Firstly, \emph good data set needs to represent as many features as possible, otherwise, the algorithm will not recognize it. Secondly, the data set has to be unbiased and representative of all the data of which the data set is a part of. The data set has to represent each feature in an appropriate amount, otherwise, the algorithms may discriminate a feature in favor of other more represented features. These requirements are hard to fulfill and often they are not\cite{hutto2014vader}. After a data set is acquired, a model has to be learned by the ML algorithm, which is, depending on the complexity of the algorithm, a very computationally-intensive and memory-intensive process. After training is completed, the algorithm can predict sentiment values for new pieces of text, that it has never seen before. However, due to the nature of this approach, the results cannot be comprehended by humans easily if at all. ML approaches also suffer from a generalization problem and therefore cannot be transferred to other domains without accepting a bad performance, or updating the training data set to fit the new domain. Updating (extending or modifying) the model also requires complete retraining from scratch. These drawbacks make ML algorithms useful only in narrow situations where changes are not required and the training data is static and unbiased.
Because handcrafting sentiment analysis requires a lot of effort, researchers turned to approaches that offload the labor-intensive part to machine learning (ML). However, this results in a new challenge, namely: gathering a \emph good data set to feed the machine learning algorithms for training. Firstly, a \emph {good data set} needs to represent as many features as possible, otherwise, the algorithm will not recognize it. Secondly, the data set has to be unbiased and representative of all the data of which the data set is a part of. The data set has to represent each feature in an appropriate amount, otherwise, the algorithms may discriminate a feature in favor of other more represented features. These requirements are hard to fulfill and often they are not\cite{hutto2014vader}. After a data set is acquired, a model has to be learned by the ML algorithm, which is, depending on the complexity of the algorithm, a very computationally-intensive and memory-intensive process. After training is completed, the algorithm can predict sentiment values for new pieces of text, that it has never seen before. However, due to the nature of this approach, the results cannot be comprehended by humans easily if at all. ML approaches also suffer from a generalization problem and therefore cannot be transferred to other domains without accepting a bad performance, or updating the training data set to fit the new domain. Updating (extending or modifying) the model also requires complete retraining from scratch. These drawbacks make ML algorithms useful only in narrow situations where changes are not required and the training data is static and unbiased.
% naive bayes
% - simple (vader)
@@ -445,7 +445,7 @@ Support Vector Machines (SVM) use a different approach. SVMs put data points in
%generall blyabla, transition to vader
In general, ML approaches do not provide an improvement over hand-crafted lexicon approaches as they only shift the time-intensive process to training data set collections. Furthermore, lexicon-based approaches seem to have progressed further in terms of coverage and feature weighting. However, many tools are not specifically tailored to social media text analysis and leak in coverage of feature detection.
In general, ML approaches do not provide an improvement over hand-crafted lexicon approaches as they only shift the time-intensive process to training data set collections. Furthermore, lexicon-based approaches seem to have progressed further in terms of coverage and feature weighting. However, many tools are not specifically tailored to social media text analysis and lack in coverage of feature detection.
%vader (Valence Aware Dictionary for sEntiment Reasoning)(grob) \cite{hutto2014vader}
% - 2014