TY - GEN
T1 - Exploring the influence of sampling on pattern support distribution
AU - Luofeng, Xu
AU - Stephen, Marsland
AU - Ruili, Wang
PY - 2008
Y1 - 2008
N2 - Identifying the pattern support distribution (PSD) in datasets is useful for many data mining tasks, such as market basket analysis. The support of a pattern is the frequency of its occurrence in a dataset. Calculating the distribution of these supports over an entire dataset is computationally expensive; this cost can be reduced by sampling from the dataset and computing the PSD on a relatively small sample. However, this may miscount patterns and cause significant changes in the distribution identified. Based on the fact that the PSD shows a power-law relationship, in this paper we investigate the influence of sampling on the characteristics of the power-law relationship in the pattern support distribution. We consider sampling effect on this relationship under two assumptions: uniform distribution of pattern supports, and independent identically distributed (i.i.d.) distributions. We experimentally evaluate the influence on data from four real-world transaction datasets.
AB - Identifying the pattern support distribution (PSD) in datasets is useful for many data mining tasks, such as market basket analysis. The support of a pattern is the frequency of its occurrence in a dataset. Calculating the distribution of these supports over an entire dataset is computationally expensive; this cost can be reduced by sampling from the dataset and computing the PSD on a relatively small sample. However, this may miscount patterns and cause significant changes in the distribution identified. Based on the fact that the PSD shows a power-law relationship, in this paper we investigate the influence of sampling on the characteristics of the power-law relationship in the pattern support distribution. We consider sampling effect on this relationship under two assumptions: uniform distribution of pattern supports, and independent identically distributed (i.i.d.) distributions. We experimentally evaluate the influence on data from four real-world transaction datasets.
UR - http://www.scopus.com/inward/record.url?scp=52049122627&partnerID=8YFLogxK
U2 - 10.1109/CIT.2008.Workshops.91
DO - 10.1109/CIT.2008.Workshops.91
M3 - Conference contribution
AN - SCOPUS:52049122627
SN - 9780769533391
T3 - Proceedings - 8th IEEE International Conference on Computer and Information Technology Workshops, CIT Workshops 2008
SP - 66
EP - 71
BT - Proceedings - 8th IEEE International Conference on Computer and Information Technology Workshops, CIT Workshops 2008
T2 - 8th IEEE International Conference on Computer and Information Technology Workshops, CIT Workshops 2008
Y2 - 8 July 2008 through 11 July 2008
ER -