Hauptnavigation

PhD Thesis Christian Pölitz

Automatic methods to extract latent meanings in large text corpora

This thesis concentrates on Data Mining in Corpus Linguistic. We show the use of modern Data Mining by developing efficient and effective methods for research and teaching in Corpus Linguistics in the fields of lexicography and semantics. Modern language resources as they are provided by Common Language Resources and Technology Infrastructure (http://clarin.eu) offer a large number of heterogeneous information resources of written language. Besides large text corpora, additional information about the sources or publication date of the documents from the corpora are available. Further, information about words from dictionaries or WordNets offer prior information of the word distributions. Starting with pre-studies in lexicography and semantics with large text corpora, we investigate the use of latent variable methods to extract hidden concepts in large text collections. We show that these hidden concepts correspond to meanings of words and subjects in text collections. This motivates an investigation of latent variable methods for large corpora to support linguistic research. In an extensive survey, latent variable models are described. Mathematical and geometrical foundations are explained to motivate the latent variable methods. We distinguish two starting points for latent variable models depending on how we represent documents internally. The first representation is based on geometric objects in a vector space and latent variable are represented by vectors. Latent factor models are described to extract latent variables by finding factorizations of matrices summarizing the document objects. The second representation is based on random sequences and the latent variables are random variables on which the sequences conditionally depend. Latent topic models are described to extract latent variables by finding conditionally depending variables. We explain state-of-the-art methods for factor and topic models. To show the quality and hence the use of latent variable methods for corpus linguistic, different evaluation methods are discussed. Qualitative evaluation methods are described to effectively present the results of the latent variable methods to users. State-of-the-art quantitative evaluation methods are summarized to illustrate how to measure the quality of latent variable methods automatically. Additional, we propose new methods to efficiently estimate the quality of latent variable methods for corpora with time information about the documents. Besides standard evaluation methods based on likelihoods and coherences of the extracted hidden concepts, we develop methods to estimate the coherence of the concepts in terms of temporal aspects and likelihoods that including time. Based on the survey on latent variable methods, we interpret the latent variable methods as optimization problem that finds latent variables to optimally describe the document corpus. To efficiently integrate additional information about a corpus from a modern language resources, we propose to extend the optimization for the latent variables with a regularization that includes this additional information. In terms of the different latent variable models, regularizations are proposed to either align latent factors or jointly model latent topics with information about the documents in the corpus. From pre-studies and collaborations with researches from corpus linguistics, we compiled use cases to investigate the regularized latent variable methods for linguistic research and teaching. Two major application are investigated. In diachronic linguistics, we show efficient regularized latent topic models to jointly model latent variables with time stamps from documents. In variety linguistics, we integrate information about the sources of the documents to model similarities and dissimilarities between corpora. Finally, a software package as Plugin for the Data Mining toolkit RapidMiner as it is developed to implement the methods from the thesis is described. The interfaces to the language resources and text corpora, the text processing methods, the latent variable methods and the evaluation methods are specified. We give detailed information about how the software is used on the use cases. The integration of the developed methods in the modern language resources like WebLicht or the Dictionary of the German Languages is explained to show the acceptance of our method in corpus linguistic research and teaching.