Finding and Analyzing Social Networks in unstructured web log data using probabilistic topic modeling

Jähnichen, Patrick
Abstract in English: 
Web logs and other platforms used to organize a social life online have achieved an enormous success over the last few years. Opposed to applications directly designed for building up and visualizing social networks, web logs are comprised of mostly unstruc- tured text data, that comes with some meta data, such as the author of the text, its publication date, the URL it is available under and the web log platform it originates from. Some basics on web logs and a description of such data is given in chapter 1. A way to extract networks between authors using the meta data mentioned is discussed and applied in chapter 2. The required theoretical background on graph theory is covered in this chapter and it is shown that the networks exhibit the Small World Phenomenon. The main question posed in this theses is discussed in chapters 3 and 4, which is, if these networks may be inferred not by the available meta data, but by pure natural language analysis of the text content, allowing inference of these networks without any meta data at hand. For this, different techniques are used, namely a simplistic frequen- tist model based on the ”bag-of-words” assumption and so called Topic models making use of Bayesian probability theory. The Topic models used are called Latent Dirichlet Allocation and, expanding this model, the Author-Topic model. All these techniques and their foundations are thoroughly described and applied to the available data. After this, the possibility of predicting the distance between two authors of web log texts in a social network by comparing term frequency vectors(bag-of-words) or probability dis- tributions produced by the Topic models in terms of different metrics. After comparing these different techniques, a new model, also building on Latent Dirichlet Allocation, is introduced in the last chapter, together with possible ways to improve prediction of social networks based on content analysis.
thesis_paper_patrick_jaehnichen.pdf5.13 MB