Материалы X Коллоквиума V^^^H
молодых исследователей в области ^Ъ^^М
баз данных и информационных систем ЩШ^^Ш
УДК 004.9
УСКОРЕНИЕ ТЕМАТИЧЕСКОЙ МОДЕЛИ PLSA ЗА СЧЕТ НАЧАЛЬНОГО ПРИБЛИЖЕНИЯ
И ЗА СЧЕТ ПРИБЛИЖЕННОГО РЕШЕНИЯ
В.Аванесов, И.Козлов
PLSA EFFICIENCY IMPROVEMENT BASED ON INITIALIZATION AND APPROXIMATION
V.Avanesov, I.Kozlov
Институт системного программирования РАН, [email protected]
Модель PLSA эффективно используется в задачах информационного поиска. Однако обучение модели требует большого количества вычислительных ресурсов, что затрудняет применение модели к большим коллекциям. Целью данной работы является увеличить эффективность алгоритма обучения. Рассматриваются два подхода: один основывается на нахождение хорошего начального приближения, второй основан на обучение модели на части коллекции и последующей аппроксимации решения на всю коллекцию.
Keywords: PLSA, тематическое моделирование, начальное приближение
Probabilistic Latent Semantic Analysis (PLSA) is an effective technique for information retrieval, but it has a serious drawback: it consumes a huge amount of computational resources, so it is hard to train this model on a large collection of documents. The aim of this paper is to improve time efficiency of the training algorithm. Two different approaches are explored: one is based on efficient finding of an appropriate initial approximation; the idea of another is that for the most of collection topics may be extracted from relatively small fraction of the data.
Ключевые слова: PLSA, topic modeling, initial approximation
1. Introduction
Topic modeling is an application of machine learning to text analysis. Topic modeling is useful for different text analysis tasks, for example: document categorization [1], spam detection [2], phishing detection [3] and many other applications.
One of the widespread algorithms is Probabilistic Latent Semantic Analysis (PLSA) introduced by Thomas Hofmann in [4].
1.1. Generative model
PLSA is based on the "Bag of words" generative model: every document is assumed to be a multinomial distribution over topics; topic is a multinomial distribution over words. Generation model may be defined as follow:
— For every position in document d i.i.d choose topic t from distribution of topics by document.
— Choose word w from topic t.
The aim of topic modeling is to recover topics and distribution of document by topics.
1.2. Topic modeling as optimization problem
According to the generative model, one can estimate probability to observe collection D as:
nnz p(t I d) p(w|t). (1)
dED wEd t
Let us denote qwt = p(w 11) and 9td = p(t | d). One may obtain ywt and 9td as a solution of the optimization problem:
L = 22 log^ 9wt9td ^ max (2)
dED wEd t
with boundary
Vt 29wt = 1, Vd ^9td =1 (3)
ww
and
Vt,w 9wt > 0, Vd,t 9wt > 0. (4)
1.3. Topic modeling as matrix decomposition 1.3.1. Kullback-Leibler divergence
Kullback-Leibler divergence is a non-negative measure of diffrence between two different probability distribution:
KL(p,IIq,) = 2p, lnM. (5)
i=1
Let us consider an empirical distribution p, and some parametric distribution q, = q, (a) which is used to explain q, = q, (a). It is easy to see that in this case minimization of KL-divergence is equivalent to estimation of a by maximum-likelihood:
KL(p, || qi (а)) = J pi ЦА J ^ min,
q, (а),
I—1
n
J p, ln(q, (а)) ^ maxa.
(6)
i—1
Thus, one can easily see that (2) is equivalent to weighted Kullback-Leibler divergence minimization:
J P, ln(q, (а)) ^ maxa
(7)
where nwd is number of words w in document d, nd is number of words in document d.
1.3.2. Matrix decomposition
Now let us denote empirical distribution of words
by document as p(w, d) = nwd. According to this notation,
nd
one can consider the problem (2) as matrix decomposition: F ~kl ®© (8)
where the matrix F = (p(w, d))WxD is empirical distribution of words by document, the matrix ® = (<wt )WD is distribution of words by topics, and the matrix © = (9td )TxD is distribution of documents by topics. Thus, our optimization problem may be rewritten in Kullback-Leibler notation as:
KL(F, ®@) ^ min. (9)
Thus, PLSA may be observed as stochastic matrix decomposition.
1.4. Expectation-Maximization algorithm
Unfortunately, (2) has no analytical solution. Thus, we use Expectation - Maximization (EM) algorithm. This algorithm consists of two steps:
1. Estimation of number ndwt of words w produced by topic t in document d. (E - step).
2. Optimization of distribution of documents by topics and optimization of distribution of topics by words relying on the ndwt values obtained during E - step. (M -step).
One can estimate ndwt as follows:
nwdP(w 11) p(t | d)
ndwt —
J p(w 11) p(t | d)
(10)
where nwd is number of words w in document d. Thus, probability p(w 11) may be estimated as:
^ ndwt
p(w|t) — ^ —
_ "wt __d_
JJ ndwt
w d
(11)
where nwt is number of words w produced by topic t:
— J ndwt
deD
nt is number of words produced by topic t:
' — nwt.
(12)
(13)
Estimation for p(t | d) may be found analogously.
As one can see, the asymptotic time of this algorithm is O(DxVxTxI), where D is number of documents, V is average number of distinct words in document, T is number of topics, and I is number of iterations until convergence. Inference of PLSA on a large dataset requires a lot of time thus, the methods of decreasing of computation time are important.
The number of topics and number of documents are defined by application. The size of vocabulary (number of distinct words) can be decreased by text normalization (removing of stop-words, lowercasing, etc.). The number of iterations until convergence depends of initial approximation of the PLSA parameters ® and ©, so a good initial approximation can reduce the number of iterations until convergence. The current study presents an efficient approach to find a beneficial initial approximation. The other method of computation time reduction is based on the idea that matrix ® may be obtained on small representative part of a documents collection.
2. Related Work
The original algorithm was described in 1999 in [4]. Since 1999 numerous papers were devoted to PLSA, but only a few of them are devoted to time efficiency improvement. In [5] the authors improve time efficiency by parallelization of the algorithm using OpenMP. The authors report 6 times speed-up on 8 CPU machine. The work [6] improves the result of a previous work by using MPI. But both of these studies try to solve the problem of time efficiency purely by programming methods.
In [7] Farahat uses LSA for finding an initialization for PLSA. LSA is based on SVD* matrix decomposition in L2 norm and lacks probabilistic interpretation. PLSA performs stochastic matrix decomposition based on Kull-back-Leibler divergence and has a simple probability interpretation, but it inherits the problem of every non-convex optimization algorithm: it may converge to a local minimum instead of the global one. Combination of LSA and PLSA leverages the best features of these models: usage of LSA training result as an initial approximation helps to avoid convergence to a poor local minimum. But the problem of time efficiency is not explored. In [7] it is shown that L2 norm usage is appropriate to find an initialization for PLSA inference algorithm; we will use this result in our work.
An idea of obtaining distribution over topics for a document not included in a collection that PLSA had been initially trained on was expressed in [8]. The author suggests to perform this through EM-scheme holding matrix ® fixed. However, he proposes this method only for query processing but not for PLSA training speed-up.
3. Proposed approach
In this work we present two different approaches for computational time reduction. One of them is based on finding initial approximation and reduction of the number of iteration to convergence. The other one is based on obtaining &part on representative sample, fixing &part and then obtaining © on the whole collection.
*SVD - Singular value decomposition, a factorization of a matrix into the product of a unitary matrix, a diagonal matrix, and another unitary matrix
n
3.1. Finding initial approximation
In this work we use no LSA, nor clustering methods either. Instead, we take a subset of our collection (for example 10%), then apply PLSA to this sample and calculate an initial approximation using the obtained matrix Computation time of PLSA is proportional to the number of documents in collection, and training PLSA on 10% part of collection is at least ten times faster than on the whole collection (per iteration).
3.1.1. Taking a sample
In order to take a representative sample, we need to take a random sample. The exact size of a sample is not important, so we use a rather simple scheme: we include documents independently with probability 10%. So we take a representative part of collection, and its size is approximately 10% of the whole collection size.
3.1.2. Initial approximation of © (words-topics)
For training PLSA on the sample, we use a random initialization. Computation time is linear by the number of documents in the collection, so the process of training turns out to be relatively fast. The obtained matrix &part can be used as initial approximation of matrix ® for the whole collection, but some words from the vocabulary may not occur in the sample and every topic in matrix <&part has zero weight for these words. If we would use matrix &part has is, these probabilities would stay zero in every step (10), (1.4). It would have disastrous result for likelihood (or perplexity) of our model:
likelihood = ^£ P(w I t)0t = 0 (14)
wed ie©
because some word w* would have zero probability for every topic. Thus, some kind of smoothing is necessary. In this work we use a trivial one:
1. Add some constant for every position in every topic Vt, w p(w 11)+ = const. In this work we use
1
const=-t——.
vocabularySize
2. Normalize:
Vt£ p(w| t) = 1.
w
3.1.3. Initial approximation of © (document-topics)
During the previous step we found an initial approximation for matrix ® (words by topic). Now we have to find an initial approximation for matrix © (documents-topics) given Its every column ©d can be found as a solution of the following optimization problem:
©d. = arg max 0 P(d | ©) = arg max 0 p(w | t)0wt (15)
wed te©
with boundaries
£0,=1. (16)
te©
But our aim is to decrease the training time, and computation method of solving this problem is not fast enough. We propose to find ©d in the norm L2. The
solution in L2 would not be a solution in our space with Kullback-Leibler divergence, but we are not looking for the exact solution but for an appropriate initialization.
Now let us assume that all words are replaced by their serial numbers in the vocabulary. Let us consider topics and documents as vector in RV , where V stands for vocabulary size. The coordinate i — th in vector-document represents the number of times when a word with the number i occurs in the document:
d(i) =#(word i occur in document d)
For topic-vector i-th coordinate shows probability to generate word i from this topic t:
t(i) = p(i 11):
Let us consider vector space formed by topic-vectors. Topics form basis in this space. One can find an initial approximation for distribution of documents d by topics as orthogonal projection to this space:
d = dL+ 4 (17)
where
Vt e© d,t) = 0. (18)
For computation efficiency we perform orthogo-
nalization and normalization of basis {t} where Vi = j(tr,tf) = 0 and Vi(tr,tr) = 1. It allows us to find the projection faster and simpler:
di=£ at (19)
i
where 1\ is i-th vector of the orthonormal basis
(d,t) = (d|,t) = (t,£ at) = a (20)
i
where scalar product is defined as follows:
(X, y)=£ X * y (21)
i=1
Document-vector expansion in the orthonormal basis is obtained, so one can return to topics basis and perform normalization as follows:
£ dt = 1 (22)
te©
where dt is weight of topic t in document d.
Due to the nature of L2 norm some topic weight may be too small or even negative, so smoothing is necessary. We do it analogously to the previous subsection:
1. Replace negative weight by zero.
2. Add some constant for every weight. In this paper we use
const = 1
numberOfTopics '
3. Perform normalization
Vd £ dt =1 .
te©
3.2. Fixed © approach
This approach is based on the fact that matrix ® may be found by training PLSA on representative subset
D' c D . The rows of matrix © may be obtained through the following constrained optimization problem independently for each document d e D :
L(9d ) = 2 "dwln 9td ^ max9d (23)
weW tET
29td = 1 (24)
tET
9td > 0 (25)
This approach consists of the following steps:
1. Take a random subset of documents D' c D of sufficient size 5.
2. Obtain ® through training PLSA on D' using EM algorithm described in [4].
3. Obtain 9 d through solving the optimization problem (23), (24), (25) for each document d e D .
The third step needs some explanation. We solve this problem with EM-algorithm.
E-step estimates the probabilities p(t | d, w) as:
p(t | d, w) =
_fwAd_
2^wT6Td
tgT
(26)
M-step. In order to solve the problem (23), (24), (25), we temporary omit the non-negativeness of the constraint (25) - we will see that the solution is nonnegative.
Lagrange function for the problem (23) and (24) takes the form:
L(9.d) =2 "dw lnwt9td "^29td "!) (27)
WE^
tET
We take a derivative:
dL 96 td
) =2 "dw
2Ф wt 6
tET
— X = 0
(28)
Then we move X to the right part and multiply both sides by 9td, and according to (26):
2"dwP(w I t,d) = X9td (29)
weW
Now we sum both sides by every teT :
22 "dwP(w 11, d) = X (30)
weW tET
From Equation (29) we obtain 9td and substitute X from (30):
p(w| t, d)
6td = 2"dw
wew 22"d® P(® 1 T, d)
fflEW TET
The denominator is independent of t thus:
9td «2 "dwP(w 11, d) (31)
weW
Primal feasibility is easily verifiable by 9td summation by t. Dual feasibility follows from (30) and probability non-negativeness. So this point satisfies the Ka-rush-Kuhn-Tucker conditions.
Also, one can easily see that 9td > 0 thus, we have found a solution of the problem (23), (24), (25).
4. Experiments
We conduct two kinds of experiments: we evaluate perplexity [9] for our approaches and for classical PLSA and compare classification performance on topics distributions obtained by PLSA and by our approaches.
4.1. Datasets
Both experiments were conducted on three datasets: tweets, news articles, abstracts of scientific papers.
• Twitter dataset.
Twitter dataset contains tweets posted by 15,000 Twitter users and written in English. We merge all tweets posted by a single user into a single document. Every document contains approximately 1,000 tweets. Documents with less than 50 words are omitted.
• The 20 Newsgroups data set**.
The 20 Newsgroups dataset is often used for topic modeling testing on text categorization. It contains short news articles on one of twenty newsgroups. It is a collection of approximately 20,000 newsgroup documents partitioned (nearly) evenly across 20 different newsgroups.
• Arxiv***.
The third data set consists of abstracts of scientific articles. It consists of approximately 900,000 abstracts from 6 areas: Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics. Distribution of articles by areas is not uniform: Physics contains 600 thousand, Mathematics contains 270 thousand and Quantitative Biology contains only 5 thousand abstracts. The abstracts with less than 20 words are omitted. For experiments with fixed ® we omit the small areas and take only Physics, Mathematics and Computer Science.
Some text normalization is performed: stoplisting, lowercasing, and rare words removing (the ones that occur less than 5 times in the whole collection).
4.2. Initial approximation
Four types of initial approximation are compared:
Random initial approximation for matrix ® (words by topic) and matrix © (document by topic). Denoted "randomly".
Calculate an initial approximation for matrix ® on sample. Use random initial approximation for matrix © . Denoted "phi".
Calculate an initial approximation for matrix © on sample. Use random initial approximation for matrix Denoted "theta".
Calculate an initial approximation for matrix © and matrix ® on sample. Denoted "full".
4.2.1. Perplexity depending on initial approximation
We evaluate the dependence of perplexity on different types of initial approximation. During these experiments the number of iterations is fixed and equals 100. The number of topics is 25 for every experiment. In Figure 1, Figure 2 and Figure 3 one can see perplexity depending on the number of iterations for different datasets and different types of initial approximation. The keys are given above in the beginning of Section 4.2.
** http://qwone.com/ jason/20Newsgroups/
*** http://arxiv.org/
Fig. 1. Perplexity for Twitter data set
Fig.2. Perplexity for news articles
Fig.3. Perplexity for scientific articles
As one can see, all the types of initial approximation decrease perplexity of a model. The model with initial approximation found by our approach converges faster then the model with random initial approximation. The same
behavior is observed for every data set. Perplexity values eventually obtained in 100 iterations for different datasets and different types of initialization are presented in Table 1.
Table 1
Perplexity in 100 iterations
Initialization News articles tweets arxiv
Random 3006.86 9287.68 1575.58
© only 2913.73 9111.91 1551.84
® only 2993.90 9143.94 1559.13
© and ® 2973.90 8816.67 1561.07
4.2.2. Computation time depending on initial approximation
In these experiments we evaluate training time depending on initial approximation. We perform iterations until the stop criterion is satisfied: change of perplexity is less than 1 five times in a row. Choosing a threshold is not the aim of this work. Similar results were observed for a wide range of thresholds. Difference in dispersion is less than standard deviation. The results for different datasets and different types of initialization are presented in tables 2, 3 and 4. (They include all training time: training on sample, orthogonalization, finding initial approximation for © , training on the whole collection).
Table 2
Computation time depending on initial approximation for Twitter dataset
Initialization perplexity time, sec #iteration
Random 9258.17 7164.9 114
© only 9143.5 4131.9 57
® only 9213.2 4814.9 67
© and ® 8856.5 3722.6 48
Table 3
Computation time depending on initial approximation
for news articles
Initialization perplexity time, sec #iteration
Random 2996.1 161.5 110
© only 2946.9 102.3 64
® only 3020.6 97.5 61
© and ® 2999.4 106.6 67
Table 4
Computation time depending on initial approximation
for Arxiv
Initialization perplexity time, sec #iteration
Random 1592.8 4017.8 59
© only 1560.9 2413.7 30
® only 1587.7 2012.9 25
© and ® 1583.7 1583.7 21
It can be seen that our approach decreases calculation time by 1.5--2 times in every data set. Perplexity in the models with initial approximation obtained by our approach is less or equal to the perplexity of the model with random initial distribution.
4.2.3. Document categorization depends on initial approximation
In these experiments we prove that our initial approximation does not decrease the quality of PLSA compared to the random initial approximation. We produce a series of experiments with different random number generator and different type of initial approximation. Then we apply the random forest classifier and use ten fold cross validation to evaluate the accuracy of classification. The accuracy obtained in a single experiment may be considered as a random variable. We perform 25 runs for each type of initialization for news articles and 6 starts for each initialization for Twitter dataset. These random variables are drown independently thus, one can use Mann-Whitney U test. We compare two hypotheses:
— H0 our initial approximation decreases or does not change the accuracy of classification compared to random initial approximation.
— H1 our initial approximation increase accuracy of classification compared to random initial approximation.
The results of these experiments are represented in Tables 5 and 6.
Table 5
News articles
Type of initialization mean accuracy p-value
Random 0.475 —
© only 0.481 0.004
® only 0.483 1.36 * 10л-9
© and ® 0.486 1.10 * 10л-6
Twitter dataset Table 6
Type of initialization mean accuracy p-value
Random 0.728 —
© only 0.738 9.71 * 10Л-9
® only 0.737 1.14 * 10Л-7
© and ® 0.741 9.71 * 10Л-9
As one can see, improvement of document classification accuracy is statistically significant with a = 0.05 confidence level. However, the improvement is relatively small, so accuracy improvement is not the purpose of suggested method.
4.3. Fixed © approach 4.3.1. Perplexity
Perplexity inspection is a common way to compare different topic models [9-11]. We compare our model with varying size of training subset with PLSA on different datasets. One can see the results in Figures 4, 5 and 6.
As one can see, perplexity values for PLSA and for our approximation are nearly equal, especially for such large collections as arxiv or Twitter dataset.
Fig.4.Perplexity depending on training subset size for arxiv dataset
Fig.5. Perplexity depending on training subset size for Twitter dataset
Fig.6. Perplexity depending on training subset size for News articles dataset
4.3.2. Text categorization
The other way to compare different topic models is application task, for example, document categorization.
In these experiments we classify news articles by categories and Twitter users by gender treating topic distributions as features. We obtain topic distributions by training PLSA on the whole collection and by our approximation with topics trained on varying fraction of the whole collec-
tion. Then we estimate the quality of classification**** by cross-validation with 10 folds. In both experiments we use 20 topics. The results can be found in Figures 7 and 8.
• \ • •
* . • • i " • . . 9
• • approximation - all
2000 4000 6000 3000 10000 12000 14000 16000 sample size
Fig.7. Classification Twitter user by gender
Fig.8. Classification news articles by category
Majority to minority class ratio for Twitter dataset is 1.17.
As one can see, our approximation exhibits comparable result if the training dataset is not too small. It works noticeably better for a large collection. Another important observation is that in all the experiments the curve reaches a plateau; it confirms that matrix ® may be found by training PLSA on a representative subset.
4.3.3. Time efficiency
One of the aims of our work is time efficiency improvement. We compare training time for PLSA and for our approximation with |D*| = 0.25 * |D|. In Table 7 the calculation time for PLSA and the one for our approximation (time to train PLSA on the subset is included) are presented.
Table 7
Time to process collection
Dataset PLSA, sec approx., sec Speed-up, times
Arxiv 4333 1972 2.2
Twitter 6384 27G2 2.4
News articles 164 64 2.6
**** For classification we use random forest classifier from package scikit-learn. http://scikit-learn.org/stable/modules/generated/sklearn. ensemble.RandomForestClassifier.html
Another important characteristic is average time to process one document with our approximation and PLSA. The obtained values and speed-up in comparison to the time needed to train original PLSA on a single document in average are given in Table 8 (time to train PLSA on the subset is not included).
Table 8
Average time to process one document
Dataset PLSA, sec approx., sec Speed-up, times
Arxiv 14.3 3.G 4.8
Twitter 414.2 7G.7 5.9
News articles 9.G G.7 12.9
4.3.4. Experiment on a large collection
As one can see, fixed ® approach performs better on larger collections (like arxiv). In order to prove this assumption, we have conducted a series of experiments on collections of different size. In these experiments scientific papers from arxiv (whole articles, not only abstracts) were used. The experiments were conducted on collections of different size: 5 thousand, 10 thousand, 20 thousand, and finally 100 thousand papers, and the dependence between perplexity obtained with our approximation and the size of the training sample were investigated. As one can easily see in Figure 9, approximation works better on larger collections - the fraction of a collection being sufficient to obtain ® is smaller for larger collection, and we can conclude that performance improvement is more significant for larger collections.
Fig.9.Perplexity depends on size of collection 5. Conclusion
We developed two methods for computation time reduction, one of which is based on finding appropriate initial approximation and another is based on fixing ® matrix, and tested these methods on three different datasets. The method based on finding initial approximation demonstrates the same behavior on every used dataset; the calculation time and the number of iterations to converge is decreased, yet the quality of topic model does not decrease. We confirm that transition from Kullback-Leibler divergence to ® norm is appropriate to find an initial approximation for PLSA.
The method based on finding initial approximation demonstrates more significant speed-up, but precision is drop. However, drop of precision is not significant, especially on large datasets.
1. Rubin T. N., Chambers A., Smyth P., Steyvers M. Statistical topic models for multi-label document classification. Machine Learning, 2012, vol. 88, no. 1-2, pp. 157-208.
2. Dong C., Zhou B. Effectively detecting content spam on the web using topical diversity measures. Proc. of the IEEE/WIC/ACM Int. Conf. on Web Intelligence and Intelligent Agent Technology. Macau, 2012, vol. 1, pp. 266-273.
3. Ramanathan V., Wechsler H. phishGILLENT-phishing detection methodology using probabilistic latent semantic analysis, adaboost, and co-training. EURASIP Journal on Information Security, 2012, vol. 1, pp. 1-22.
4. Hofmann T. Probabilistic latent semantic indexing. Proc. of the 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR '99. New York, 1999, pp. 50-57.
5. Hong C., Chen Y., Zheng W., Shan J., Chen Y., Zhang Y. Parallelization and characterization of probabilistic latent se-
mantic analysis. Proc. of the 37th Int. Conf. on Parallel Processing, ICPP '08. Portland, 2008, pp. 628-635.
6. Wan R., Ngoc V. A., Mamitsuka H. Efficient probabilistic latent semantic analysis through parallelization. Lecture Notes in Computer Science "Information Retrieval Technology", vol. 5839. Springer Berlin Heidelberg, 2009. pp. 432-443.
7. Farahat A., Chen F. Improving probabilistic latent semantic analysis using principal component analysis. Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics, EACL 2006. Trento, 2006, pp. 105-112.
8. Hofmann T. Probabilistic latent semantic indexing. Proc. of the 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR '99. New York, 1999, pp. 50-57.
9. Potapenko A., Vorontsov K. Robust PLSA performs better than LDA. Proc. of the 35th European Conf. on Information Retrieval, ECIR 2013. Moscow, 2013, pp. 784-787.
10. Blei D. M., Ng A. Y., Jordan M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, vol. 3, pp. 9931022.
11. Wallach H. M., Murray I., Salakhutdinov R., Mimno D. Evaluation methods for topic models. Proc. of the 26th Annual Int. ACM Conf. on Machine Learning, ICML '09. New York, 2009, pp. 1105-1112.