Abstract: Onechallenging issue in application of Latent Dirichlet Allocation (LDA) is toselect the optimal number of topics which must depend on both the corpus itselfand user modeling goals. This paper presents a topic selection method whichmodels the minimum perplexity against number of topics for any given dataset.The research set up scikit-learn and graphlab on jupyter notebook in the google cloud compute engine’s custommachine and then developed python code to manipulate selected existingdatasets. Results indicate that the graph of perplexity against number oftopics (K) has a strong quadratic behaviour around a turning point and opensupwards. This means that the graph has a minimum perplexity point thatoptimizes K.

The paper presents a model of the optimum K in an identified intervaland guides the calculation of this value of K within three iterations usingquadratic approach and differential calculus. This improves inferential speedof number of topics and hyper parameter alpha thereby enhancing LDA applicationin big data. 1. Introduction: Progressin information technology from large mainframes to PCs to mobile devices tocloud has brought an information overflow, with transformative societalimplications that affect all aspects of human life. A considerable and possiblythe most significant portion of this information is in the form of text data,such as books, news articles, microblogs and instant messages. These vastquantities of text data can only be accessed and utilized using informationtechnology devices, but the automated processing of text is only possible usingtechnology specialized for human language. Text analytics in a broad senserefers to technology that allows the utilization of large quantities of textdata among them topic models.

TopicModels are a class of Bayesian algorithms that can be used to analyse textdocuments to a lower dimension. A lot of research in the area of Topic Modelinghas been carried out, pioneered by Hofmann, (1999). David M Blei,Ng, & Jordan, (2003) developedLatent Dirichlet Allocation (LDA), a generative topic modeling techniquereferred to by Roberts et. al.(2015) as a prominentexample in the field of text data analysis. LatentDirichlet Allocation(LDA), was originally introduced by Blei et al. (2003) andhas been widely researched and used in many applications such as text mining,information retrieval, and computer vision. In order to apply LDA, we need tospecify alpha and beta dirichlet hyperparameters and number of topics K.

Theperformance of many machine learning methods depends critically onhyperparameter settings (Hutter et al, 2014). This means that the predictiveperformance of LDA can be affected by the choice of hyperparameters. Oftenusers must try different values of hyperparameters and select the best model.An alternative to setting the hyperparameters in advance is to infer them fromthe data.

As with the other parameters of the model, the hyperparameters can betreated as random variables and have their posterior inferred (Wallach, 2008).Rather than full posterior inference, however, a more common practice is toperform MAP estimation with optimization algorithms, which has been empiricallyshown to be a competitive approach to sampling the hyperparameter values(Wallach et al., 2009a). Selectingthe optimal number of topics is one of the challenging issues in theapplication of LDA (Wang, et al, 2014 & Zhao, et al 2015).

Severalapproaches exist, but ultimately, the appropriate number of topics must dependon both the corpus itself and user modeling goals. Collapsed Gibbs Sampling(CGS) offers a popular technique to infer the hyperparameters from data andguarantees convergence to actual values. However CGS is iterative and takes alot of time to converge. Researchers have opted to set the number of iterationsbefore hand since it is also difficult to access the point of convergence. Thisstudy presents a topic selection method which obtains optimum value of K withinthree iterations of CGS thereby improving on number of topics inferentialspeed. The study models the minimum perplexity against number of topics for anygiven dataset using the CGS topic model application in graphlab create.

1.1 Approaches toSetting Number of Topics The first is to set manually via “trial and error”. If there is a humanin the loop, it may be simplest to try multiple values of K within somereasonable range (e.g., K ? {5; 10; 25; 50; 100; 200}). The user can thenquickly scan the learned topics associated with each value of T and select thevalue which seems most appropriate to the corpus. The second is to use domain knowledge. If the topics are meant tocorrespond to known entities in the world (e.

g., in a vision task each topicmay be a type of object), then we can simply set K equal to the true number ofentities we are trying to model. The third approach optimizes performance on secondary task. If thelearned topics are to be used as input to another task, then it follows that Kshould be chosen according to performance on the ultimate task of interest. Forexample, if the topics are going to be used for document classification, Kcould be chosen to optimize classifier performance on a held-aside validationset. Finally in the fourth approach the likelihood of held aside documents isoptimized.

Given a means of evaluating the probability P(w|K) of a validationset of held aside documents (Wallach et al., 2009), we can learn topics fordifferent values of K and choose the value which maximizes the likelihood ofthe held-aside validation set (Rosen et al, 2004). Plotting these values, wecan typically see the familiar pattern of P(w|K) increasing with larger K up toa point, beyond which the model overfits (Mitchell, 1997) the data and P(w|K)on the held-aside documents begins to fall. This study precisely usesmathematical theory to model this behavior with an aim of reducing the timetaken to estimate K. 1.

2 Model Evaluation:Likelihood and PerplexityThere are two ways of evaluating LDA. One is by measuring performance onsome secondary task, such as document classification or information retrieval,and the second is by estimating the probability of unseen held-out documentsgiven some training documents (Wallach 2009). This second approach is the mostcommon way used to evaluate a probabilistic model and is achieved by measuringthe log-likelihood of a held-out test set. When applying the second approach we split the dataset into two parts:one for training, the other for testing.

This leads to the following functionthat we need to evaluate inorder to compute the log-likelihood: ?(w) = log p (w|?, ?) = Where wd is a collection of previously unseen documents, ? isthe topic matrix describing the model and ? is the dirichlet hyper-parameter fortopic-distribution of documents.The computed likelihood of unseen documents can be used to comparemodels with a higher likelihood implying a better model. A further definitionleads to perplexity measure which according to research is the most typicalmeasure for evaluating LDA models (Bao & Datta, 2014; Blei et al., 2003) Perplexity measures the modeling power by calculating the inverselog-likelihood of unobserved documents. Traditionally we use perplexity whichis defined interms of likelihood as follows: perplexity(testset w) = exp Perplexity is a decreasing function of thelog-likelihood ?(w) of theunseen documents wd and lower perplexitycorresponds to higher likelihood. This means that better models have lower perplexity which suggests less uncertaintiesabout unobserved documents. However the function for likelihood p(wd|?, ?) of onedocument is complex and intractable to exact computation. This makes theevaluation of ?(w), and therefore the perplexity, complexand intractable as well.

The advantage ofperplexity as a measurement is that it is normalized to the size of the dataand can thus be compared across different datasets. 2. MaterialsIn this section we describe the experimental designfor the study. Wefirst describe the datasets, the evaluation procedures and the relevantexperimental setup. 2.1 Dataset and Software UsedThisexperiment demonstrates use of Latent Dirichlet Allocation on an existingdataset called the 20 Newsgroups dataset which is a collection of approximately20,000 newsgroup text documents, partitioned evenly across 20 different groups.The data can be found at GitHub and was originally collected by Lang (1995) forthe learning to filter netnews paper. The dataset has become popular forexperiments in text applications of machine learning techniques, such as textclassification and text clustering as evidenced in a lot of published work forexample Albishre& Li (2015), Mekala et.

al. (2017), Dasgupta et. al. (2007), Harish & Revanasiddappa(2017) and Feng et. al (2017). Fortopic modeling task there are many software packages that can be used to traina model such as Mallet, Matlab, R package etc. In this study, we set up Python3.

6 from Continuum Analytics Anaconda and configured scikit-learn, gensim andgraphlab create libraries to work in jupyter notebook for data and proceduresas the development environment set up in the google’s cloud compute engine,virtual machine instances. 2.2 Hardware EnvironmentTheexperiments in this study were conducted under the google cloud computeengine’s custom machine with a specification of 8 CPUs, 8 GB memory and 64 GBboot and local disk. This was accessed with a personal laptop with a CPUspecification of Intel(R) Core(TM) i5-4210U CPU @ 1.70 GHz 2.40 GHz and amemory of 4.

00 GB. The capacity of personal laptop hard drive is 500 GB runningMicrosoft Windows 8.1 single Language 64-bit Operating System, x64-basedprocessor. Avoid the use of trade names of chemicals, generic orchemical names are preferred.

3. Methods: Thisstudy followed the experimental model to analyse the 20 Newsgroups datasets andto make a critical evaluation of the behaviour of different LDA hyperparametersettings. The researcher used these datasets as secondary data for purposes of experimentationand some tools developed by other researchers like scikit-learn, a pythonmachine learning library (Pedregosa et at, 2011) and graphlab create, a framework for parallelmachine learning (Low, etal 2014). Theresearch first set up graph lab environment on jupyter notebook in the googlecloud compute engine’s custom machine and then developed python code to run onthis environment. The code was used to manipulate the data in the desired wayand iterated many times on given datasets. Results were displayed on a twodimensional graph using python code. This allowed the researchers to observethe behaviour of parameters of interest under different settings, therebyconceptualising the set modelling goals. This procedure was repeated fordifferent datasets with an aim of generalisation.

4. Results: Inthis experiment, we set alpha at 0.1 and beta at 0.

01 and K at 10,000, asufficiently large value of K inorder to determine the ‘infinite’ trend of thegraph. We observed that the value of perplexity increases monotonically beyonda point of minimum perplexity as K increases. This is illustrated on the graphshown below. 4.

1 Perplexity against K for 50< K<10,000Fromthe general trend, we observed that there is a minimum perplexity between a Kvalue of 50 and for purposes of symmetry, an estimate of 690. We thereforechanged the limit parameter values to (50, 690, 5) where 5 is the step. Thestep of five was chosen in order to increase precision in modelling K for thisrange. 4.

2 Perplexity against K for 50< K<690Followingis a graph of perplexity against K for 50< K<690 and at steps of 5.5. Discussion: Afterexperimenting on many datasets and setting iterations at 25, it was observedthat a graph of Perplexity against K has a strong quadratic behaviour around aturning point which opens upwards as illustrated on the graph 4.

2 above. This means the graph has a minimum perplexitypoint that optimizes K. Our interest was therefore to model the behaviour of Kand to guide the calculation of the optimal value with fewer iterations therebymaking a contribution to the body of knowledge.

This would enhance the speed ofhyper parameter’s estimation hence the model for application in big textanalytics. Inorderto model this point of minimum perplexity, the following differential equationwas set up:p=a +bK+ Cwherep is the perplexity, K is the number of topics, a, b and C are constants forthe general quadratic function. Wedifferentiate, equate the derivative to zero and solve the resulting equationto find the optimal value of K, as follows: =2ka + bð 2Ka + b = 0 at optimumfrom differential calculus.ð K= …………….equation5.

1Ourtask therefore in this study was to model the evaluation of a and b for any dataset of interest and use those values to calculate theoptimum value of K as shown onequation 5.1 above. To accomplish this task, we solved the following set ofthree general quadratic equations, equation 5.2, equation 5.3 and equation 5.4.The reason why we set up three equations is because we have threeunknowns: a, b and C. We shall therefore be requiring onlythree data points (), () and () to estimate optimum K for any datasetas opposed to the previous approach where many iterations on data has to beperformed.

=a +b +C ……………. equation 5.2=a +b +C ……………. equation 5.3=a +b +C ……………. equation 5.4 Onsolving the block of simultaneous equations by substitution, we get the genericvalues of a, b and C as follows:C = a = g – c.

hb = e + c.fwhere:d = – ; e = ; f= g= andh = Weconceptualise the following algorithm for estimating topic modelhyperparameters: AlgorithmicSteps: Pseudocode for our proposed estimator.Initialise ?, ?and KFORiteration i=1, 2, 3doREAD END FORd ß – e ß f= g= h= C= a= g – chb= e + cfK= end function 6. Conclusion: In theCollapsed Gibbs Sampling algorithm, the values of the Dirichlet priors, ? and ?are assumed to be known (Nguyen et al. 2012).

Many researchers in the area oftopic modelling use the heuristic values for the hyperparameters. Inparticular, common values are ? = 50/K where K is the total number of topics. Theexperiments on beta shown above indicates that a value of ? = 0.01 yieldsbetter likelihood for all test cases and therefore is a reasonable estimate.

In conclusion, the study established that theiterative procedure of finding optimal number of topics can be modified inorder to reduce number of iterations as follows: First initialise ? and ? to0.1 and 0.01 respectively and then iterate K three times from a value of 25with a step of 25. We then use quadratic formula and differential calculus toobtain optimal value of K for the particular dataset. To obtain the new alpha,we use the relationship ? = 50/K. From experimentation, it has also been shownthat a ? value of 0.01 is optimal.