Abstract: the utilization of large quantities of text

Abstract:

One
challenging issue in application of Latent Dirichlet Allocation (LDA) is to
select the optimal number of topics which must depend on both the corpus itself
and user modeling goals. This paper presents a topic selection method which
models the minimum perplexity against number of topics for any given dataset.
The research set up scikit-learn and 
graphlab on jupyter notebook in the google cloud compute engine’s custom
machine and then developed python code to manipulate selected existing
datasets. Results indicate that the graph of perplexity against number of
topics (K) has a strong quadratic behaviour around a turning point and opens
upwards. This means that the graph has a minimum perplexity point that
optimizes K. The paper presents a model of the optimum K in an identified interval
and guides the calculation of this value of K within three iterations using
quadratic approach and differential calculus. This improves inferential speed
of number of topics and hyper parameter alpha thereby enhancing LDA application
in big data.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

1.       
Introduction:

 

Progress
in information technology from large mainframes to PCs to mobile devices to
cloud has brought an information overflow, with transformative societal
implications that affect all aspects of human life. A considerable and possibly
the most significant portion of this information is in the form of text data,
such as books, news articles, microblogs and instant messages. These vast
quantities of text data can only be accessed and utilized using information
technology devices, but the automated processing of text is only possible using
technology specialized for human language. Text analytics in a broad sense
refers to technology that allows the utilization of large quantities of text
data among them topic models.

 

Topic
Models are a class of Bayesian algorithms that can be used to analyse text
documents to a lower dimension. A lot of research in the area of Topic Modeling
has been carried out, pioneered by Hofmann, (1999). David M Blei,
Ng, & Jordan, (2003) developed
Latent Dirichlet Allocation (LDA), a generative topic modeling technique
referred to by Roberts et. al.
(2015) as a prominent
example in the field of text data analysis.

 

Latent
Dirichlet Allocation(LDA), was originally introduced by Blei et al. (2003) and
has been widely researched and used in many applications such as text mining,
information retrieval, and computer vision. In order to apply LDA, we need to
specify alpha and beta dirichlet hyperparameters and number of topics K. The
performance of many machine learning methods depends critically on
hyperparameter settings (Hutter et al, 2014). This means that the predictive
performance of LDA can be affected by the choice of hyperparameters. Often
users must try different values of hyperparameters and select the best model.
An alternative to setting the hyperparameters in advance is to infer them from
the data. As with the other parameters of the model, the hyperparameters can be
treated as random variables and have their posterior inferred (Wallach, 2008).
Rather than full posterior inference, however, a more common practice is to
perform MAP estimation with optimization algorithms, which has been empirically
shown to be a competitive approach to sampling the hyperparameter values
(Wallach et al., 2009a).

 

Selecting
the optimal number of topics is one of the challenging issues in the
application of LDA (Wang, et al, 2014 & Zhao, et al 2015). Several
approaches exist, but ultimately, the appropriate number of topics must depend
on both the corpus itself and user modeling goals. Collapsed Gibbs Sampling
(CGS) offers a popular technique to infer the hyperparameters from data and
guarantees convergence to actual values. However CGS is iterative and takes a
lot of time to converge. Researchers have opted to set the number of iterations
before hand since it is also difficult to access the point of convergence.

 

This
study presents a topic selection method which obtains optimum value of K within
three iterations of CGS thereby improving on number of topics inferential
speed. The study models the minimum perplexity against number of topics for any
given dataset using the CGS topic model application in graphlab create.

 

1.1             
Approaches to
Setting Number of Topics

The first is to set manually via “trial and error”. If there is a human
in the loop, it may be simplest to try multiple values of K within some
reasonable range (e.g., K ? {5; 10; 25; 50; 100; 200}). The user can then
quickly scan the learned topics associated with each value of T and select the
value which seems most appropriate to the corpus.

 

The second is to use domain knowledge. If the topics are meant to
correspond to known entities in the world (e.g., in a vision task each topic
may be a type of object), then we can simply set K equal to the true number of
entities we are trying to model.

 

The third approach optimizes performance on secondary task. If the
learned topics are to be used as input to another task, then it follows that K
should be chosen according to performance on the ultimate task of interest. For
example, if the topics are going to be used for document classification, K
could be chosen to optimize classifier performance on a held-aside validation
set.

 

Finally in the fourth approach the likelihood of held aside documents is
optimized. Given a means of evaluating the probability P(w|K) of a validation
set of held aside documents (Wallach et al., 2009), we can learn topics for
different values of K and choose the value which maximizes the likelihood of
the held-aside validation set (Rosen et al, 2004). Plotting these values, we
can typically see the familiar pattern of P(w|K) increasing with larger K up to
a point, beyond which the model overfits (Mitchell, 1997) the data and P(w|K)
on the held-aside documents begins to fall. This study precisely uses
mathematical theory to model this behavior with an aim of reducing the time
taken to estimate K.

 

1.2             
Model Evaluation:
Likelihood and Perplexity

There are two ways of evaluating LDA. One is by measuring performance on
some secondary task, such as document classification or information retrieval,
and the second is by estimating the probability of unseen held-out documents
given some training documents (Wallach 2009). This second approach is the most
common way used to evaluate a probabilistic model and is achieved by measuring
the log-likelihood of a held-out test set.

 

When applying the second approach we split the dataset into two parts:
one for training, the other for testing. This leads to the following function
that we need to evaluate inorder to compute the log-likelihood:

                        ?(w) = log p (w|?, ?) =

Where wd is a collection of previously unseen documents, ? is
the topic matrix describing the model and ? is the dirichlet hyper-parameter for
topic-distribution of documents.

The computed likelihood of unseen documents can be used to compare
models with a higher likelihood implying a better model. A further definition
leads to perplexity measure which according to research is the most typical
measure for evaluating LDA models (Bao & Datta, 2014; Blei et al., 2003)

 

Perplexity measures the modeling power by calculating the inverse
log-likelihood of unobserved documents. Traditionally we use perplexity which
is defined interms of likelihood as follows:

                        perplexity(test
set w) = exp

Perplexity is a decreasing function of the
log-likelihood ?(w) of the
unseen documents wd and lower perplexity
corresponds to higher likelihood. This means that better models have lower perplexity which suggests less uncertainties
about unobserved documents. However the function for likelihood p(wd|?, ?) of one
document is complex and intractable to exact computation. This makes the
evaluation of ?(w), and therefore the perplexity, complex
and intractable as well.

The advantage of
perplexity as a measurement is that it is normalized to the size of the data
and can thus be compared across different datasets.

 

2.       
Materials

In this section we describe the experimental design
for the study. We
first describe the datasets, the evaluation procedures and the relevant
experimental setup.

 

2.1     Dataset and Software Used

This
experiment demonstrates use of Latent Dirichlet Allocation on an existing
dataset called the 20 Newsgroups dataset which is a collection of approximately
20,000 newsgroup text documents, partitioned evenly across 20 different groups.
The data can be found at GitHub and was originally collected by Lang (1995) for
the learning to filter netnews paper. The dataset has become popular for
experiments in text applications of machine learning techniques, such as text
classification and text clustering as evidenced in a lot of published work for
example Albishre
& Li (2015), Mekala et. al. (2017), Dasgupta et. al. (2007), Harish & Revanasiddappa
(2017) and Feng et. al (2017).

 

For
topic modeling task there are many software packages that can be used to train
a model such as Mallet, Matlab, R package etc. In this study, we set up Python
3.6 from Continuum Analytics Anaconda and configured scikit-learn, gensim and
graphlab create libraries to work in jupyter notebook for data and procedures
as the development environment set up in the google’s cloud compute engine,
virtual machine instances.

 

2.2     Hardware Environment

The
experiments in this study were conducted under the google cloud compute
engine’s custom machine with a specification of 8 CPUs, 8 GB memory and 64 GB
boot and local disk. This was accessed with a personal laptop with a CPU
specification of Intel(R) Core(TM) i5-4210U CPU @ 1.70 GHz 2.40 GHz and a
memory of 4.00 GB. The capacity of personal laptop hard drive is 500 GB running
Microsoft Windows 8.1 single Language 64-bit Operating System, x64-based
processor.

 

Avoid the use of trade names of chemicals, generic or
chemical names are preferred.

 

3.       
Methods:

This
study followed the experimental model to analyse the 20 Newsgroups datasets and
to make a critical evaluation of the behaviour of different LDA hyperparameter
settings. The researcher used these datasets as secondary data for purposes of experimentation
and some tools developed by other researchers like scikit-learn, a python
machine learning library (Pedregosa et at, 2011) and graphlab create, a framework for parallel
machine learning (Low, et
al 2014).

 

The
research first set up graph lab environment on jupyter notebook in the google
cloud compute engine’s custom machine and then developed python code to run on
this environment. The code was used to manipulate the data in the desired way
and iterated many times on given datasets. Results were displayed on a two
dimensional graph using python code. This allowed the researchers to observe
the behaviour of parameters of interest under different settings, thereby
conceptualising the set modelling goals. This procedure was repeated for
different datasets with an aim of generalisation.

 

4.       
Results:

In
this experiment, we set alpha at 0.1 and beta at 0.01 and K at 10,000, a
sufficiently large value of K inorder to determine the ‘infinite’ trend of the
graph. We observed that the value of perplexity increases monotonically beyond
a point of minimum perplexity as K increases. This is illustrated on the graph
shown below.

 

4.1     Perplexity against K for 50

Author:

x

Hi!
I'm Eileen!

Would you like to get a custom essay? How about receiving a customized one?

Check it out