TY - GEN
T1 - P2LSA and P2LSA+
T2 - 12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011
AU - Jin, Yan
AU - Gao, Yang
AU - Shi, Yinghuan
AU - Shang, Lin
AU - Wang, Ruili
AU - Yang, Yubin
PY - 2011
Y1 - 2011
N2 - Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P 2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P 2LSA+ is about 3 times faster than P2LSA.
AB - Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P2LSA and P2LSA+, respectively. When dealing with a large-scale data set, P2LSA and P2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P2LSA, the Map function in P2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P2LSA and P 2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P 2LSA+ is about 3 times faster than P2LSA.
KW - MapReduce
KW - Paralleled PLSA
KW - PLSA
UR - http://www.scopus.com/inward/record.url?scp=80053012167&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-23878-9_46
DO - 10.1007/978-3-642-23878-9_46
M3 - Conference contribution
AN - SCOPUS:80053012167
SN - 9783642238772
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 385
EP - 393
BT - Intelligent Data Engineering and Automated Learning, IDEAL 2011 - 12th International Conference, Proceedings
Y2 - 7 September 2011 through 9 September 2011
ER -