Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.