Representing and Mining Heterogeneous and Complex Textual Data


Context of the thesis

This thesis aims at combining processus for representing and mining texts in order to better use, understand and reason about texts in networks (social networks, web…). Such a combination of processus should be seen as an iterative and interactive exploration of textual data where different facets of texts could be visualized and analyzed.

There is currently a huge volume of texts available on the web but it cannot be easily used by humans or by software agents. Humans want to communicate and to be able to retrieve the good information at the right time and at the right location while software agents should be able to understand a query and provide the answer in an optimal way. In order to be iterative and interactive, the processus should sucessively enrich the texts and take into account external resources. At the linguistic level, resources are lexical information, terminology or structural information: syntactic trees, dependency graphs… At the ontological level, knowledge makes possible reasoning and solving problems by the machine.

The main challenge of this thesis is to combine processus of representation and processus of mining, one relying on the other and vice-versa, paying a special attention to the volume of data (size of the data, memory space) and to efficiency (CPU time). Textual data can be very large, heterogeneous and complex but they can also be delivered as a stream. Form and content — syntax and semantics — of texts should be represented in an appropriate format so that they can be classified in an ordered set of classes and thus be visualized, compared and interpreted. Moreover, texts vary greatly in  size (tweets, abstracts or full texts), they may refer to locations or periods and they may be linked together by inter-textual relations, referring to other texts (scientific articles, laws…). Dealing with these different aspects of the texts is a major challenge of this thesis.

The thesis has both a theorical and a practical dimension. It involves syntactic or semantic analysis (such as provided by Gate (http://gate.ac.uk) as well as techniques for representing and discovering knowledge.

 

Representing and mining various textual documents

Representing texts and mining them are inter-related processes. Texts should be related to existing ontologies in order to embed it into a domain and move from words to concepts.

The thesis will be based on symbolic approaches of representation and data mining. The text will be first analyzed by natural language tools and will be then annotated by ontological resources in order to give meaning to the content [6,7]. An ontology is a conceptualisation of a domain with concepts defined by attributes and relations [27]. It is written in a knowledge representation language and some modules enable reasoning [18]. OWL is the representation language that will be used in the thesis. It is based on description logics [5] and reasoning modules include instance classification, concept classification, satisfiability checking and may detect inconsistencies in the set of descriptions.

Some data mining approaches such as frequent itemset mining will be used to describe the texts and these patterns will be used to build a knowledge base on the domain of the texts that could be used for further problem solving. Data mining will also use Formal Concept Analysis (FCA) [17], and some extensions like Relational Concept Analysis [26] and pattern structures [20,21]. The later approaches produce a concept lattice that is very close to the structure of an ontology with a partial order between concepts.

Textual data stream are similar to temporal data on which we developed a multidimensional data mining technique with a similarity measure that compare and classify sequences [10,14,15,16]. This method for sequence will be adapted to textual streams during the thesis.

 

Scientific environment

The thesis takes part to a cooperation between two laboratories in Nancy, one in Computer Science and the other in Linguistics. The Orpailleur team, at LORIA Computer Science lab, provides a very rich environment for data and text mining with several platforms that could be used and enriched during the thesis: Coron (http://coron.loria.fr/), GALICIA (http://www.iro.umontreal.ca/~galicia/). The Lexique Team, at ATILF Linguistic lab, will provide morpho-syntactic annotation tools, term annotation and desambiguisation, transdisciplinary lexicons.

Amedeo Napoli from Orpailleur and Bertrand Gaiffe from ATILF will be both associated to the work.

Expected skills

This thesis takes part to a bigger pluridisciplinary project on Natural Language and Knowledge Engineering. The candidate should have very good skills in Computer Sciences, be autonoumous in coding, and have an interest for formal and mathematical approaches. The candidate should ideally have also good skills in Computational Linguistics. If not, he will acquire these skills during his thesis.

 

Supervisor

Yannick Toussaint, CR 1 INRIA & LORIA (HdR), Orpailleur Team, IAEM doctoral school
Co-supervisor : Evelyne Jacquey, CR1 CNRS ATILF (UMR UL CNRS), Lexique Team, Stanislas doctoral school

 

How to apply

In order to prepare a PhD thesis within the Lorraine Université d’Excellence Program, the interested candidate should consult the PhD topics offered in each social and economic challenges.
These PhD thesis topics are proposed by faculty members or researchers accredited to supervise research.

Candidate application period: according to graduate school schedule

Each candidate may submit an application on up to three separate research topics.

Application analysis period by each graduate school
The graduate school reviews the applicants for a doctoral contract in the relevant disciplines. They check the level of supervision for each supervisor and the situation of trained doctors. Each candidate will meet the laboratory director, a supervisor or a representative from the graduate school. This interview is to identify the candidate’s motivations and suitability as a candidate for the PhD project proposed by the supervisor. A recommendation will be made to the graduate school. This will summarize the strengths and/or weaknesses of the application.

PhD grants will include monthly income for the PhD student (roughly 1700 € for research only, complement can be provided for teaching missions) and environment for research in the research unit.

Please be aware that in order to offer a variety of subjects, more positions are posted here than available funding. The LUE executive committee will make the final choice on the granted funding (up to 12 positions), based on the recommendations by the doctoral schools.