Abstract:

One

challenging issue in application of Latent Dirichlet Allocation (LDA) is to

select the optimal number of topics which must depend on both the corpus itself

and user modeling goals. This paper presents a topic selection method which

models the minimum perplexity against number of topics for any given dataset.

The research set up scikit-learn and

graphlab on jupyter notebook in the google cloud compute engine’s custom

machine and then developed python code to manipulate selected existing

datasets. Results indicate that the graph of perplexity against number of

topics (K) has a strong quadratic behaviour around a turning point and opens

upwards. This means that the graph has a minimum perplexity point that

optimizes K. The paper presents a model of the optimum K in an identified interval

and guides the calculation of this value of K within three iterations using

quadratic approach and differential calculus. This improves inferential speed

of number of topics and hyper parameter alpha thereby enhancing LDA application

in big data.

1.

Introduction:

Progress

in information technology from large mainframes to PCs to mobile devices to

cloud has brought an information overflow, with transformative societal

implications that affect all aspects of human life. A considerable and possibly

the most significant portion of this information is in the form of text data,

such as books, news articles, microblogs and instant messages. These vast

quantities of text data can only be accessed and utilized using information

technology devices, but the automated processing of text is only possible using

technology specialized for human language. Text analytics in a broad sense

refers to technology that allows the utilization of large quantities of text

data among them topic models.

Topic

Models are a class of Bayesian algorithms that can be used to analyse text

documents to a lower dimension. A lot of research in the area of Topic Modeling

has been carried out, pioneered by Hofmann, (1999). David M Blei,

Ng, & Jordan, (2003) developed

Latent Dirichlet Allocation (LDA), a generative topic modeling technique

referred to by Roberts et. al.

(2015) as a prominent

example in the field of text data analysis.

Latent

Dirichlet Allocation(LDA), was originally introduced by Blei et al. (2003) and

has been widely researched and used in many applications such as text mining,

information retrieval, and computer vision. In order to apply LDA, we need to

specify alpha and beta dirichlet hyperparameters and number of topics K. The

performance of many machine learning methods depends critically on

hyperparameter settings (Hutter et al, 2014). This means that the predictive

performance of LDA can be affected by the choice of hyperparameters. Often

users must try different values of hyperparameters and select the best model.

An alternative to setting the hyperparameters in advance is to infer them from

the data. As with the other parameters of the model, the hyperparameters can be

treated as random variables and have their posterior inferred (Wallach, 2008).

Rather than full posterior inference, however, a more common practice is to

perform MAP estimation with optimization algorithms, which has been empirically

shown to be a competitive approach to sampling the hyperparameter values

(Wallach et al., 2009a).

Selecting

the optimal number of topics is one of the challenging issues in the

application of LDA (Wang, et al, 2014 & Zhao, et al 2015). Several

approaches exist, but ultimately, the appropriate number of topics must depend

on both the corpus itself and user modeling goals. Collapsed Gibbs Sampling

(CGS) offers a popular technique to infer the hyperparameters from data and

guarantees convergence to actual values. However CGS is iterative and takes a

lot of time to converge. Researchers have opted to set the number of iterations

before hand since it is also difficult to access the point of convergence.

This

study presents a topic selection method which obtains optimum value of K within

three iterations of CGS thereby improving on number of topics inferential

speed. The study models the minimum perplexity against number of topics for any

given dataset using the CGS topic model application in graphlab create.

1.1

Approaches to

Setting Number of Topics

The first is to set manually via “trial and error”. If there is a human

in the loop, it may be simplest to try multiple values of K within some

reasonable range (e.g., K ? {5; 10; 25; 50; 100; 200}). The user can then

quickly scan the learned topics associated with each value of T and select the

value which seems most appropriate to the corpus.

The second is to use domain knowledge. If the topics are meant to

correspond to known entities in the world (e.g., in a vision task each topic

may be a type of object), then we can simply set K equal to the true number of

entities we are trying to model.

The third approach optimizes performance on secondary task. If the

learned topics are to be used as input to another task, then it follows that K

should be chosen according to performance on the ultimate task of interest. For

example, if the topics are going to be used for document classification, K

could be chosen to optimize classifier performance on a held-aside validation

set.

Finally in the fourth approach the likelihood of held aside documents is

optimized. Given a means of evaluating the probability P(w|K) of a validation

set of held aside documents (Wallach et al., 2009), we can learn topics for

different values of K and choose the value which maximizes the likelihood of

the held-aside validation set (Rosen et al, 2004). Plotting these values, we

can typically see the familiar pattern of P(w|K) increasing with larger K up to

a point, beyond which the model overfits (Mitchell, 1997) the data and P(w|K)

on the held-aside documents begins to fall. This study precisely uses

mathematical theory to model this behavior with an aim of reducing the time

taken to estimate K.

1.2

Model Evaluation:

Likelihood and Perplexity

There are two ways of evaluating LDA. One is by measuring performance on

some secondary task, such as document classification or information retrieval,

and the second is by estimating the probability of unseen held-out documents

given some training documents (Wallach 2009). This second approach is the most

common way used to evaluate a probabilistic model and is achieved by measuring

the log-likelihood of a held-out test set.

When applying the second approach we split the dataset into two parts:

one for training, the other for testing. This leads to the following function

that we need to evaluate inorder to compute the log-likelihood:

?(w) = log p (w|?, ?) =

Where wd is a collection of previously unseen documents, ? is

the topic matrix describing the model and ? is the dirichlet hyper-parameter for

topic-distribution of documents.

The computed likelihood of unseen documents can be used to compare

models with a higher likelihood implying a better model. A further definition

leads to perplexity measure which according to research is the most typical

measure for evaluating LDA models (Bao & Datta, 2014; Blei et al., 2003)

Perplexity measures the modeling power by calculating the inverse

log-likelihood of unobserved documents. Traditionally we use perplexity which

is defined interms of likelihood as follows:

perplexity(test

set w) = exp

Perplexity is a decreasing function of the

log-likelihood ?(w) of the

unseen documents wd and lower perplexity

corresponds to higher likelihood. This means that better models have lower perplexity which suggests less uncertainties

about unobserved documents. However the function for likelihood p(wd|?, ?) of one

document is complex and intractable to exact computation. This makes the

evaluation of ?(w), and therefore the perplexity, complex

and intractable as well.

The advantage of

perplexity as a measurement is that it is normalized to the size of the data

and can thus be compared across different datasets.

2.

Materials

In this section we describe the experimental design

for the study. We

first describe the datasets, the evaluation procedures and the relevant

experimental setup.

2.1 Dataset and Software Used

This

experiment demonstrates use of Latent Dirichlet Allocation on an existing

dataset called the 20 Newsgroups dataset which is a collection of approximately

20,000 newsgroup text documents, partitioned evenly across 20 different groups.

The data can be found at GitHub and was originally collected by Lang (1995) for

the learning to filter netnews paper. The dataset has become popular for

experiments in text applications of machine learning techniques, such as text

classification and text clustering as evidenced in a lot of published work for

example Albishre

& Li (2015), Mekala et. al. (2017), Dasgupta et. al. (2007), Harish & Revanasiddappa

(2017) and Feng et. al (2017).

For

topic modeling task there are many software packages that can be used to train

a model such as Mallet, Matlab, R package etc. In this study, we set up Python

3.6 from Continuum Analytics Anaconda and configured scikit-learn, gensim and

graphlab create libraries to work in jupyter notebook for data and procedures

as the development environment set up in the google’s cloud compute engine,

virtual machine instances.

2.2 Hardware Environment

The

experiments in this study were conducted under the google cloud compute

engine’s custom machine with a specification of 8 CPUs, 8 GB memory and 64 GB

boot and local disk. This was accessed with a personal laptop with a CPU

specification of Intel(R) Core(TM) i5-4210U CPU @ 1.70 GHz 2.40 GHz and a

memory of 4.00 GB. The capacity of personal laptop hard drive is 500 GB running

Microsoft Windows 8.1 single Language 64-bit Operating System, x64-based

processor.

Avoid the use of trade names of chemicals, generic or

chemical names are preferred.

3.

Methods:

This

study followed the experimental model to analyse the 20 Newsgroups datasets and

to make a critical evaluation of the behaviour of different LDA hyperparameter

settings. The researcher used these datasets as secondary data for purposes of experimentation

and some tools developed by other researchers like scikit-learn, a python

machine learning library (Pedregosa et at, 2011) and graphlab create, a framework for parallel

machine learning (Low, et

al 2014).

The

research first set up graph lab environment on jupyter notebook in the google

cloud compute engine’s custom machine and then developed python code to run on

this environment. The code was used to manipulate the data in the desired way

and iterated many times on given datasets. Results were displayed on a two

dimensional graph using python code. This allowed the researchers to observe

the behaviour of parameters of interest under different settings, thereby

conceptualising the set modelling goals. This procedure was repeated for

different datasets with an aim of generalisation.

4.

Results:

In

this experiment, we set alpha at 0.1 and beta at 0.01 and K at 10,000, a

sufficiently large value of K inorder to determine the ‘infinite’ trend of the

graph. We observed that the value of perplexity increases monotonically beyond

a point of minimum perplexity as K increases. This is illustrated on the graph

shown below.

4.1 Perplexity against K for 50