P2LSA and P2LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model

Yan Jin; Yang Gao; Yinghuan Shi; Lin Shang; Ruili Wang; Yubin Yang

doi:10.1007/978-3-642-23878-9_46

P²LSA and P²LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model

Yan Jin, Yang Gao, Yinghuan Shi, Lin Shang, Ruili Wang, Yubin Yang

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

11 Citations (Scopus)

Abstract

Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P²LSA and P²LSA+, respectively. When dealing with a large-scale data set, P²LSA and P²LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P²LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P²LSA, the Map function in P²LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P²LSA and P ²LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P ²LSA+ is about 3 times faster than P²LSA.

Original language	English
Title of host publication	Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings
Pages	385-393
Number of pages	9
DOIs	https://doi.org/10.1007/978-3-642-23878-9_46
Publication status	Published - 2011
Externally published	Yes
Event	12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011 - Norwich, United Kingdom Duration: 7 Sept 2011 → 9 Sept 2011

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	6936 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011
Country/Territory	United Kingdom
City	Norwich
Period	7/09/11 → 9/09/11

Keywords

MapReduce
Paralleled PLSA
PLSA

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-642-23878-9_46

Cite this

Jin, Y., Gao, Y., Shi, Y., Shang, L., Wang, R., & Yang, Y. (2011). P²LSA and P²LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model. In Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings (pp. 385-393). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6936 LNCS). https://doi.org/10.1007/978-3-642-23878-9_46

Jin, Yan ; Gao, Yang ; Shi, Yinghuan et al. / P²LSA and P²LSA+ : Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model. Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings. 2011. pp. 385-393 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{f6eec6067b60406396f6e5c091fc0ff7,

title = "P2LSA and P2LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model",

abstract = "Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P 2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P 2LSA+ is about 3 times faster than P2LSA.",

keywords = "MapReduce, Paralleled PLSA, PLSA",

author = "Yan Jin and Yang Gao and Yinghuan Shi and Lin Shang and Ruili Wang and Yubin Yang",

year = "2011",

doi = "10.1007/978-3-642-23878-9_46",

language = "English",

isbn = "9783642238772",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

pages = "385--393",

booktitle = "Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings",

note = "12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011 ; Conference date: 07-09-2011 Through 09-09-2011",

}

Jin, Y, Gao, Y, Shi, Y, Shang, L, Wang, R & Yang, Y 2011, P²LSA and P²LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model. in Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6936 LNCS, pp. 385-393, 12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011, Norwich, United Kingdom, 7/09/11. https://doi.org/10.1007/978-3-642-23878-9_46

P²LSA and P²LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model. / Jin, Yan; Gao, Yang; Shi, Yinghuan et al.
Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings. 2011. p. 385-393 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6936 LNCS).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - P2LSA and P2LSA+

T2 - 12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011

AU - Jin, Yan

AU - Gao, Yang

AU - Shi, Yinghuan

AU - Shang, Lin

AU - Wang, Ruili

AU - Yang, Yubin

PY - 2011

Y1 - 2011

N2 - Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P 2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P 2LSA+ is about 3 times faster than P2LSA.

AB - Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P 2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P 2LSA+ is about 3 times faster than P2LSA.

KW - MapReduce

KW - Paralleled PLSA

KW - PLSA

UR - http://www.scopus.com/inward/record.url?scp=80053012167&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-23878-9_46

DO - 10.1007/978-3-642-23878-9_46

M3 - Conference contribution

AN - SCOPUS:80053012167

SN - 9783642238772

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 385

EP - 393

BT - Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings

Y2 - 7 September 2011 through 9 September 2011

ER -

Jin Y, Gao Y, Shi Y, Shang L, Wang R, Yang Y. P²LSA and P²LSA+: Two paralleled probabilistic latent semantic analysis algorithms based on the MapReduce model. In Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings. 2011. p. 385-393. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-642-23878-9_46