hosted by
publicationslist.org
    

Georgios Petasis


petasis@iit.demokritos.gr

Books

2012
Elias Iosif, Georgios Petasis, Vangelis Karkaletsis (2012)  1   Edited by:Armando Stellato Maria Teresa Pazienza. Hershey, PA, USA: IGI Global  
Abstract: The authors present an ontology-based information extraction process, which operates in a bootstrapping framework. The novelty of this approach lies in the continuous semantics extraction from textual content in order to evolve the underlying ontology, while the evolved ontology enhances in turn the information extraction mechanism. This process was implemented in the context of the R&D project BOEMIE. The BOEMIE system was evaluated on the athletics domain.
Notes:
Georgios Petasis, Sergios Petridis, Georgios Paliouras, Vangelis Karkaletsis, Stavros J Perantonis, Constantine D Spyropoulos Edited by:isbn = "978-0-7923-7645-3", keywords = "namedentityrecognition "Hans-J{"}{u}rgenZimmermannandGeorgiosTselentisandMaarstenvanSomerenandGeorgiosDounias".  
Abstract: This paper compares two alternative approaches to the problem of acquiring named-entity recognition and classification systems from training corpora, in two different languages. The process of named-entity recognition and classification is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. The manual construction of rules for the recognition of named entities is a tedious and time-consuming task. For this reason, effective methods to acquire such systems automatically from data are very desirable. In this paper we compare two popular learning methods on this task: a decision-tree induction method and a multi-layered feed-forward neural network. Particular emphasis is paid on the selection of the appropriate data representation for each method and the extraction of training examples from unstructured textual data. We compare the performance of the two methods on large corpora of English and Greek texts and present the results. In addition to the good performance of both methods, one very interesting result is the fact that a simple representation of the data, which ignores the order of the words within a named entity, leads to improved results over a more complex approach that preserves word order.
Notes:

Journal articles

2012
Georgios Petasis (2012)  The SYNC3 Collaborative Annotation Tool   May  
Abstract: The huge amount of the available information in the Web creates the need for effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately annotated corpus in order to learn or evaluate extraction models. The production of such corpora can be significantly facilitated by annotation tools, which provide user-friendly facilities and enable annotators to annotate documents according to a predefined annotation schema. However, the construction of annotation tools that operate in a distributed environment is a challenging task: the majority of these tools are implemented as Web applications, having to cope with the capabilities offered by browsers. This paper describes the SYNC3 collaborative annotation tool, which implements an alternative architecture: it remains a desktop application, fully exploiting the advantages of desktop applications, but provides collaborative annotation through the use of a centralised server for storing both the documents and their metadata, and instance messaging protocols for communicating events among all annotators. The annotation tool is implemented as a component of the Ellogon language engineering platform, exploiting its extensive annotation engine, its cross-platform abilities and its linguistic processing components, if such a need arises. Finally, the SYNC3 annotation tool is distributed with an open source license, as part of the Ellogon platform.
Notes:
2011
Georgios Petasis (2011)  Unsupervised Domain Adaptation based on Text Relatedness   September 12–14  
Abstract: In this paper an unsupervised approach to do-main adaptation is presented, which exploits external knowledge sources in order to port a classification model into a new thematic do-main. Our approach extracts a new feature set from documents of the target domain, and tries to align the new features to the original ones, by exploiting text relatedness from external knowledge sources, such as WordNet. The approach has been evaluated on the task of document classification, involving the classification of newsgroup postings into 20 news groups.
Notes:
Mara Tsoumari, Georgios Petasis (2011)  Coreference Annotator - A new annotation tool for aligned bilingual corpora   43-52 September 15  
Abstract: This paper presents the main features of an annotation tool, the Coreference Annotator, which manages bilingual corpora consisting of aligned texts that can be grouped in collections and subcollections according to their topics and discourse. The tool allows the manual annotation of certain linguistic items in the source text and their translation equivalent in the target text, by entering useful information about these items based on their context.
Notes:
Nikos Sarris, Gerasimos Potamianos, Jean-Michel Renders, Claire Grover, Eric Karstens, Leonidas Kallipolitis, Vasilis Tountopoulos, Georgios Petasis, Anastasia Krithara, Matthias GallΓ©, Guillaume Jacquet, Beatrice Alex, Richard Tobin, Liliana Bounegru (2011)  A System for Synergistically Structuring News Content from Traditional Media and the Blogosphere   October 26–28  
Abstract: News and social media are emerging as a dominant source of information for numerous applications. However, their vast unstructured content present challenges to efficient extraction of such information. In this paper, we present the SYNC3 system that aims to intelligently structure content from both traditional news media and the blogosphere. To achieve this goal, SYNC3 incorporates innovative algorithms that first model news media content statistically, based on fine clustering of articles into so-called "newsevents". Such models are then adapted and applied to the blogosphere domain, allowing its content to map to the traditional news domain. Furthermore, appropriate algorithms are employed to extract news event labels and relations between events, in order to efficiently present news content to the system end users.
Notes:
2010
Georgios Petasis, Dimitrios Petasis (2010)  BlogBuster : A Tool for Extracting Corpora from the Blogosphere   May 17–23  
Abstract: This paper presents BlogBuster, a tool for extracting a corpus from the blogosphere. The topic of cleaning arbitrary web pages with the goal of extracting a corpus from web data, suitable for linguistic and language technology research and development, has attracted significant research interest recently. Several general purpose approaches for removing boilerplate have been presented in the literature; however the blogosphere poses additional requirements, such as a finer control over the extracted textual segments in order to accurately identify important elements, i.e. individual blog posts, titles, posting dates or comments. BlogBuster tries to provide such additional details along with boilerplate removal, following a rule-based approach. A small set of rules were manually constructed by observing a limited set of blogs from the Blogger and Wordpress hosting platforms. These rules operate on the DOM tree of an HTML page, as constructed by a popular browser, Mozilla Firefox. Evaluation results suggest that BlogBuster is very accurate when extracting corpora from blogs hosted in the Blogger and Wordpress, while exhibiting a reasonable precision when applied to blogs not hosted in these two popular blogging platforms.
Notes:
Georgios Petasis (2010)  TkGecko : Another Attempt for an HTML Renderer for Tk   October 11–15  
Abstract: The support for displaying HTML and especially complex Web sites has always been problematic in Tk. Several efforts have been made in order to alleviate this problem, and this paper presents another (and still incomplete) one. This paper presents TkGecko, a Tcl/Tk extension written in C++, which allows Gecko (the HTML processing and rendering engine developed by the Mozilla Foundation) to be embedded as a widget in Tk. The current status of the TkGecko extension is alpha quality, while the code is publically available under the BSD license.
Notes:
Georgios Petasis (2010)  TkRibbon : Windows Ribbons for Tk   October 11–15  
Abstract: This paper is about TkRibbon, a Tcl/Tk extension that aims to introduce support for the Windows Ribbon Framework in the Tk toolkit. The Windows Ribbon is a graphical interface where a set of toolbars are placed on tabs in a notebook widget, aiming to substitute traditional menus and toolbars. This paper briefly describes Windows Ribbon framework, the TkRibbon Tk extension and presents some examples on how TkRibbon can be used by Tk applications.
Notes:
Georgios Petasis (2010)  TkDND : a cross-platform dragҀ™nҀ™drop pac   October 11–15  
Abstract: This paper is about TkDND, a Tcl/Tk extension that aims to add cross-application drag and drop support to Tk, for popular operating systems, such as Microsoft Windows, Apple OS X and GNU/Linux. Being in its second rewrite, TkDND 2.x has a stable implementation for Windows and OS X, while support for Linux and the XDND protocol is still under development.
Notes:
Georgios Petasis (2010)  Ellogon and the challenge of threads   October 11–15  
Abstract: This paper is about the Ellogon language engineering platform, and the challenges faced in modernising it, in order to better exploit contemporary hardware. Ellogon is an open-source infrastructure, specialised in natural language processing. Following a data model that closely resembles TIPSTER, Ellogon can be used either as an autonomous application, offering a graphical user interface, or it can be embedded in a C/C++ application as a library. Ellogon has been implemented in C/C++ and Tcl/Tk: in fact Ellogon is a vanilla Tcl interpreter, with the Ellogon core loaded as a Tcl extension, and a set of Tcl/Tk scripts that implement the GUI. The core component of Ellogon, being a Tcl extension, heavily relies on Tcl objects to implement its data model, a decision made more than a decade ago, which poses difficulties into making Ellogon a multi-threaded application.
Notes:
Georgios Petasis (2010)  TileQt and TileGtk : current status   October 11–15  
Abstract: This paper is about two Tile and Ttk themes, TileQt and TileGTK. Despite being two distinct and very different extensions, the motivation for their development was common: making Tk applications look as native as possible under the Linux operating system.
Notes:
2009
Georgios Petasis, Vangelis Karkaletsis, Anastasia Krithara, Georgios Paliouras, Constantine D Spyropoulos (2009)  Semi-automated ontology learning : the BOEMIE approach   June 1  
Abstract: In this paper we describe a semi-automated approach for ontology learning. Exploiting an ontology-based multimodal information extraction system, the ontology learning subsystem accumulates documents that are insufficiently analysed and through clustering proposes new concepts, relations and interpretation rules to be added to the ontology.
Notes:
2008
Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Constantine D Spyropoulos (2008)  Learning context-free grammars to extract relations from text   178: 303-307  
Abstract: In this paper we propose a novel relation extraction method, based on grammatical inference. Following a semi-supervised learning approach, the text that connects named entities in an annotated corpus is used to infer a context free grammar. The grammar learning algorithm is able to infer grammars from positive examples only, controlling overgeneralisation through minimum description length. Evaluation results show that the proposed approach performs comparable to the state of the art, while exhibiting a bias towards precision, which is a sign of conservative generalisation.
Notes:
Pavlina Fragkou, Georgios Petasis, Aris Theodorakos, Vangelis Karkaletsis, Constantine D Spyropoulos (2008)  BOEMIE Ontology-Based Text Annotation Tool   May 26 – June 1  
Abstract: The huge amount of the available information in the Web creates the need of effective information extraction systems that are able to produce metadata that satisfy user’s information needs. The development of such systems, in the majority of cases, depends on the availability of an appropriately annotated corpus in order to learn extraction models. The production of such corpora can be significantly facilitated by annotation tools that are able to annotate, according to a defined ontology, not only named entities but most importantly relations between them. This paper describes the BOEMIE ontology-based annotation tool which is able to locate blocks of text that correspond to specific types of named entities, fill tables corresponding to ontology concepts with those named entities and link the filled tables based on relations defined in the domain ontology. Additionally, it can perform annotation of blocks of text that refer to the same topic. The tool has a user-friendly interface, supports automatic pre-annotation, annotation comparison as well as customization to other annotation schemata. The annotation tool has been used in a large scale annotation task involving 3000 web pages regarding athletics. It has also been used in another annotation task involving 503 web pages with medical information, in different languages.
Notes:
Georgios Petasis, Pavlina Fragkou, Aris Theodorakos, Vangelis Karkaletsis, Constantine D Spyropoulos (2008)  Segmenting HTML pages using visual and semantic information   4th Web as Corpus Workshop (WAC-4) 18-24 June 1  
Abstract: The information explosion of the Web aggravates the problem of effective information retrieval. Even though linguistic approaches found in the literature perform linguistic annotation by creating metadata in the form of tokens, lemmas or part of speech tags, however,this process is insufficient. This is due to the fact that these linguistic metadata do not exploit the actual content of the page, leading to the need of performing semantic annotation based on a predefined semantic model. This paper proposes a new learning approach for performing automatic semantic annotation. This is the result of a two step procedure: the first step partitions a web page into blocks based on its visual layout, while the second, performs subsequent partitioning based on the examination of appearance of specific types of entities denoting the semantic category as well as the application of a number of simple heuristics. Preliminary experiments performed on a manually annotated corpus regarding athletics proved to be very promising.
Notes: Proceedings: The 4th Web as Corpus: Can we do better than Google? http://www.lrec-conf.org/proceedings/lrec2008/workshops/W19_Proceedings.pdf
2005
Dimitris Spiliotopoulos, Georgios Petasis, Georgios Kouroupetroglou (2005)  Prosodically Enriched Text Annotation for High Quality Speech Synthesis   313-316 October 17–19  
Abstract: Linguistically enriched text generated from natural language modules contributes significantly on the quality of speech synthesis. For all cases where such modules are not available, such enriched input needs to be produced from plain text in order to maintain quality. This work reports on a framework of several combined language resources and procedures (word/sentence identification, syntactic analysis, prosodic feature annotation) for text annotation/processing from plain text. Using that, the implementation of an automatic XML formatted output generation module produces the prosodically enriched markup.
Notes:
2004
Georgios Petasis, Georgios Paliouras, Vangelis Karkaletsis, Constantine Halatsis, Constantine D Spyropoulos (2004)  E-GRIDS : Computationally Efficient Grammatical Inference from Positive Examples   GRAMMARS 7: 69-110  
Abstract: In this paper we present a new computationally efficient algorithm for inducing context-free grammars that is able to learn from positive sample sentences. This new algorithm uses simplicity as a criterion for directing inference, and the search process of the new algorithm has been optimised by utilising the results of a theoretical analysis regarding the behaviour and complexity of the search operators. Evaluation results are presented on artificially generated data, while the scalability of the algorithm is tested on a large textual corpus. These results show that the new algorithm performs well and can infer grammars from large data sets in a reasonable amount of time.
Notes: Technical Report referenced in the paper: http://www.ellogon.org/petasis/bibliography/GRAMMARS/GRAMMARS2004-SpecialIssue-Petasis-TechnicalReport.pdf
Stavros J Perantonis, Basilios Gatos, Vassilios Maragos, Vangelis Karkaletsis, Georgios Petasis (2004)  Text Area Identification in Web Images   3025: 82-92 May  
Abstract: With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.
Notes:
Georgios Petasis, Vangelis Karkaletsis, Claire Grover, Ben Hachey, Maria Teresa Pazienza, Michele Vindigni, JosΓ© Coch (2004)  Adaptive, Multilingual Named Entity Recognition in Web Pages   1073-1074 August 22–27  
Abstract: Most of the information on the Web today is in the form of HTML documents, which are designed for presentation purposes and not for machine understanding and reasoning. Existing web extraction systems require a lot of human involvement for maintenance due to changes to targeted web sites and for adaptation to new web sites or even to new domains. This paper presents the adaptive, multilingual named entity recognition and classification (NERC) technologies developed for processing web pages in the context of the R&D project CROSSMARC. The evaluation results demonstrate the viability of our approach.
Notes: Extended version: http://www.ellogon.org/petasis/bibliography/ECAI2004/ECAI2004_NERC.pdf
Georgios Petasis, Georgios Paliouras, Constantine D Spyropoulos, Constantine Halatsis (2004)  Eg-GRIDS : Context-Free Grammatical Inference from Positive Examples Using Genetic Search   3264: 223-234 October 11–13  
Abstract: In this paper we present eg-GRIDS, an algorithm for inducing context-free grammars that is able to learn from positive sample sentences. The presented algorithm, similar to its GRIDS predecessors, uses simplicity as a criterion for directing inference, and a set of operators for exploring the search space. In addition to the basic beam search strategy of GRIDS, eg-GRIDS incorporates an evolutionary grammar selection process, aiming to explore a larger part of the search space. Evaluation results are presented on artificially generated data, comparing the performance of beam search and genetic search. These results show that genetic search performs better than beam search while being significantly more efficient computationally.
Notes:
2003
Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Constantine D Spyropoulos (2003)  Using the Ellogon Natural Language Engineering Infrastructure   November 21  
Abstract: Ellogon is a multi-lingual, cross-operating system, general-purpose natural language engineering infrastructure. Ellogon has been used extensively in various NLP applications. It is currently provided for free for research use to research and academic organisations. In this paper, we outline its architecture and data model, present Ellogon features as used by different types of users and discuss its functionalities against other infrastructures for language engineering.
Notes: http://labs-repos.iit.demokritos.gr/skel/bci03_workshop/
Georgios Petasis, Vangelis Karkaletsis, Constantine D Spyropoulos (2003)  Cross-lingual Information Extraction from Web pages : the use of a general-purpose Text Engineering Platform   381-388 September 10–12  
Abstract: In this paper we present how the use of a general-purpose text engineering platform has facilitated the development of a cross-lingual information extraction system and its adaptation to new domains and languages. Our approach for crosslingual information extraction from the Web covers all the way from the identification of Web sites of interest, to the location of the domain specific Web pages, to the extraction of specific information from the Web pages and its presentation to the end-user. This approach has been implemented in the context of the IST project CROSSMARC. The text engineering platform "Ellogon" offers functionalities that facilitated the development of core CROSSMARC components as well as their porting into new domains and languages.
Notes: http://lml.bas.bg/ranlp2003/
2002
Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Ion Androutsopoulos, Constantine D Spyropoulos (2002)  Ellogon : A New Text Engineering Platform   72-78 May 29–31  
Abstract: This paper presents Ellogon, a multi-lingual, cross-platform, general-purpose text engineering environment. Ellogon was designed in order to aid both researchers in natural language processing, as well as companies that produce language engineering systems for the end-user. Ellogon provides a powerful TIPSTER-based infrastructure for managing, storing and exchanging textual data, embedding and managing text processing components as well as visualising textual data and their associated linguistic information. Among its key features are full Unicode support, an extensive multi-lingual graphical user interface, its modular architecture and the reduced hardware requirements.
Notes:
Dimitra Farmakiotou, Vangelis Karkaletsis, Georgios Samaritakis, Georgios Petasis, Constantine D Spyropoulos (2002)  Named Entity Recognition from Greek Web Pages   91-102 April 11–12  
Abstract: We describe the functionalities of the Hellenic Named Entity Recognition and Classification (HNERC) system developed in the context of the CROSSMARC project. CROSSMARC is developing technology for e-retail product comparison. The CROSSMARC system locates relevant retailers’ web pages and processes them in order to extract information about their products (e.g. technical features, prices). CROSSMARC’s technology is demonstrated and evaluated for two different product types and four languages (English, Greek, Italian, French). This paper presents the HNERC system that is responsible for the identification and classification of specific types of proper names (e.g. laptop manufacturers, models), numerical expressions (e.g. length, weight), and temporal expressions (e.g. time, date) in Hellenic vendor sites. The paper presents the HNERC processing stages using examples from the laptops domain.
Notes: http://lpis.csd.auth.gr/setn02/
Claire Grover, Scott Mcdonald, Donnla Nic Gearailt, Vangelis Karkaletsis, Dimitra Farmakiotou, Georgios Samaritakis, Georgios Petasis, Maria Teresa Pazienza, Michele Vindigni, Frantz Vichot (2002)  Multilingual XML-Based Named Entity Recognition for E-Retail Domains   May 29–31  
Abstract: We describe the multilingual Named Entity Recognition and Classification (NERC) subpart of an e-retail product comparison system which is currently under development as part of the EU-funded project CROSSMARC. The system must be rapidly extensible, both to new languages and new domains. To achieve this aim we use XML as our common exchange format and the monolingual NERC components use a combination of rule-based and machine-learning techniques. It has been challenging to process web pages which contain heavily structured data where text is intermingled with HTML and other code. Our preliminary evaluation results demonstrate the viability of our approach.
Notes:
Dimitra Farmakiotou, Vangelis Karkaletsis, Ioannis Koutsias, Georgios Petasis, Constantine D Spyropoulos (2002)  PatEdit : An Information Extraction Pattern Editor for Fast System Customization   1097-1102 May 29–31  
Abstract: This paper addresses the problem of Information Extraction (IE) system customization to new domains and extraction needs with the use of PatEdit, an IE Pattern Editor. PatEdit is a human-assisted knowledge engineering tool, that facilitates the production of IE patterns. First, we present the problem of IE system customisation and the use of human assisted knowledge engineering tools. Then, we describe PatEdit with respect to the IE pattern language used and discuss its characteristics that facilitate rapid pattern writing. Finally, the exploitation of PatEdit in two information extraction projects is presented along with our plans for future work.
Notes:
2001
Georgios Petasis, Frantz Vichot, Francis Wolinski, Georgios Paliouras, Vangelis Karkaletsis, Constantine D Spyropoulos (2001)  Using Machine Learning to Maintain Rule-based Named - Entity Recognition and Classification Systems   426-433 July 9–11  
Abstract: This paper presents a method that assists in maintaining a rule-based named-entity recognition and classification system. The underlying idea is to use a separate system, constructed with the use of machine learning, to monitor the performance of the rule-based system. The training data for the second system is generated with the use of the rule-based system, thus avoiding the need for manual tagging. The disagreement of the two systems acts as a signal for updating the rule-based system. The generality of the approach is illustrated by applying it to large corpora in two different languages: Greek and French. The results are very encouraging, showing that this alternative use of machine learning can assist significantly in the maintenance of rule-based systems.
Notes:
Georgios Petasis, Vangelis Karkaletsis, Dimitra Farmakiotou, Ion Androutsopoulos, Constantine D Spyropoulos (2001)  A Greek Morphological Lexicon and its Exploitation by a Greek Controlled Language Checker   80-89 November 8–10  
Abstract: This paper presents a large-scale Greek morphological lexicon, developed by the Software & Knowledge Engineering Laboratory (SKEL) of NCSR "Demokritos". The paper describes the lexicon architecture and the procedure to develop and update it. The morphological lexicon was used to develop a lemmatiser and a morphological analyser that were included in a controlled language checker for Greek. The paper discusses the current coverage of the lexicon, as well as remaining issues and how we plan to address them. Our goal is to produce a wide-coverage morphological lexicon of Greek that can be easily exploited in several natural language processing applications.
Notes:
2000
Georgios Petasis, Sergios Petridis, Georgios Paliouras, Vangelis Karkaletsis, Stavros J Perantonis, Constantine D Spyropoulos (2000)  Symbolic and Neural Learning for Named-Entity Recognition   58-66 June 19–23  
Abstract: Named-entity recognition involves the identification and classification of named entities in text. This is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. The manual construction of rules for the recognition of named entities is a tedious and time-consuming task. For this reason, we present in this paper two approaches to learning named-entity recognition rules from text. The first approach is a decision-tree induction method and the second a multi-layered feed-forward neural network. Particular emphasis is paid on the selection of the appropriate feature set for each method and the extraction of training examples from unstructured textual data. We compare the performance of the two methods on a large corpus of English text and present the results.
Notes:
Georgios Petasis, Alessandro Cucchiarelli, Paola Velardi, Georgios Paliouras, Vangelis Karkaletsis, Constantine D Spyropoulos (2000)  Automatic adaptation of proper noun dictionaries through cooperation of machine learning and probabilistic methods   128-135 July 24–28  
Abstract: The recognition of Proper Nouns (PNs) is considered an important task in the area of Information Retrieval and Extraction. However the high performance of most existing PN classifiers heavily depends upon the avail-ability of large dictionaries of domain-specific Proper Nouns, and a certain amount of manual work for rule writing or manual tagging. Though it is not a heavy requirement to rely on some existing PN dictionary (of-ten these resources are available on the web), its coverage of a domain corpus may be rather low, in absence of manual updating. In this paper we propose a technique for the automatic updating of a PN Dictionary through the cooperation of an inductive and a probabilistic classifier. In our experiments we show that, whenever an existing PN Dictionary allows the identification of 50% of the proper nouns within a corpus, our technique allows, without additional manual effort, the successful recognition of about 90% of the remaining 50%.
Notes:
Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis, Constantine D Spyropoulos (2000)  Learning Decision Trees for Named-Entity Recognition and Classification   August 20–25  
Abstract: We propose the use of decision tree induction as a solution to the problem of customising a named-entity recognition and classification (NERC) system to a specific domain. A NERC system assigns semantic tags to phrases that correspond to named entities, e.g. persons, locations and organisations. Typically, such a system makes use of two language resources: a recognition grammar and a lexicon of known names, classified by the corresponding named-entity types. NERC systems have been shown to achieve good results when the domain of application is very specific. However, the construction of the grammar and the lexicon for a new domain is a hard and time-consuming process. We propose the use of decision trees as NERC "grammars" and the construction of these trees using machine learning. In order to validate our approach, we tested C4.5 on the identification of person and organisation names involved in management succession events, using data from the sixth Message Understanding Conference. The results of the evaluation are very encouraging showing that the induced tree can outperform a grammar that was constructed manually.
Notes:
1999
Vangelis Karkaletsis, Georgios Paliouras, Georgios Petasis, Natasa Manousopoulou, Constantine D Spyropoulos (1999)  Named-Entity Recognition from Greek and English Texts   Journal of Intelligent and Robotic Systems 26: 2. 123-135 October  
Abstract: Named-entity recognition (NER) involves the identification and classification of named entities in text. This is an important subtask in most language engineering applications, in particular information extraction, where different types of named entity are associated with specific roles in events. In this paper, we present a prototype NER system for Greek texts that we developed based on a NER system for English. Both systems are evaluated on corpora of the same domain and of similar size. The time-consuming process for the construction and update of domain-specific resources in both systems led us to examine a machine learning method for the automatic construction of such resources for a particular application in a specific language.
Notes:
Georgios Petasis, Georgios Paliouras, Vangelis Karkaletsis, Constantine D Spyropoulos (1999)  Resolving Part-of-Speech Ambiguity in the Greek Language Using Learning Techniques   July 5–16  
Abstract: This article investigates the use of Transformation-Based Error-Driven learning for resolving part-of-speech ambiguity in the Greek language. The aim is not only to study the performance, but also to examine its dependence on different thematic domains. Results are presented here for two different test cases: a corpus on "managementsuccessionevents" and a general-theme corpus. The two experiments show that the performance of this method does not depend on the thematic domain of the corpus, and its accuracy for the Greek language is around 95%.
Notes:

Book chapters

2011
Georgios Petasis, Vangelis Karkaletsis, Georgios Paliouras, Anastasia Krithara, Elias Zavitsanos (2011) In: Knowledge-Driven Multimedia Information Extraction and Ontology Evolution : State of the Art Edited by:Georgios Paliouras, Constantine D Spyropoulos, George Tsatsaronis. 134-166 Springer Berlin / Heidelberg  
Abstract: Ontology learning is the process of acquiring (constructing or integrating) an ontology (semi-) automatically. Being a knowledge acquisition task, it is a complex activity, which becomes even more complex in the context of the BOEMIE project, due to the management of multimedia resources and the multi-modal semantic interpretation that they require. The purpose of this chapter is to present a survey of the most relevant methods, techniques and tools used for the task of ontology learning. Adopting a practical perspective, an overview of the main activities involved in ontology learning is presented. This breakdown of the learning process is used as a basis for the comparative analysis of existing tools and approaches. The comparison is done along dimensions that emphasize the particular interests of the BOEMIE project. In this context, ontology learning in BOEMIE is treated and compared to the state of the art, explaining how BOEMIE addresses problems observed in existing systems and contributes to issues that are not frequently considered by existing approaches.
Notes: 10.1007/978-3-642-20795-2_6
Vangelis Karkaletsis, Pavlina Fragkou, Georgios Petasis, Elias Iosif (2011) In: Knowledge-Driven Multimedia Information Extraction and Ontology Evolution Edited by:Georgios Paliouras, Constantine D Spyropoulos, George Tsatsaronis. 89-109 Springer Berlin / Heidelberg  
Abstract: Information extraction systems employ ontologies as a means to describe formally the domain knowledge exploited by these systems for their operation. The aim of this survey is to study the contribution of ontologies to information extraction systems. We believe that this will help towards specifying a concrete methodology for ontology based information extraction exploiting all levels of ontological knowledge, from domain entities for named entity recognition, to the use of conceptual hierarchies for pattern generalization, to the use of properties and non-taxonomic relations for pattern acquisition, and finally to the use of the domain model itself for integrating extracted entities and instances of relations, as well as for discovering implicit information and detecting inconsistencies.
Notes: 10.1007/978-3-642-20795-2_4
2008
Dimitris Spiliotopoulos, Georgios Petasis, Georgios Kouroupetroglou (2008) In: Text, Speech and Dialogue (Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD 2008)) Edited by:Petr Sojka, AleΕ‘ HorΓ‘k, Ivan Kopecek, Karel Pala. 517-524 Brno, Czech Republic: Springer Berlin / Heidelberg  
Abstract: Concept-to-Speech systems include Natural Language Generators that produce linguistically enriched text descriptions which can lead to significantly improved quality of speech synthesis. There are cases, however, where either the generator modules produce pieces of non-analyzed, non-annotated plain text, or such modules are not available at all. Moreover, the language analysis is restricted by the usually limited domain coverage of the generator due to its embedded grammar. This work reports on a language-independent framework basis, linguistic resources and language analysis procedures (word/sentence identification, part-of-speech, prosodic feature annotation) for text annotation/processing for plain or enriched text corpora. It aims to produce an automated XML- annotated enriched prosodic markup for English and Greek texts, for improved synthetic speech. The markup includes information for both training the synthesizer and for actual input for synthesising. Depending on the domain and target, different methods may be used for automatic classification of entities (words, phrases, sentences) to one or more preset categories such as "emphaticevent", "new/oldinformation", "secondargumenttoverb", "propernounphrase", etc. The prosodic features are classified according to the analysis of the speech-specific characteristics for their role in prosody modelling and passed through to the synthesizer via an extended SOLE-ML description. Evaluation results show that using selectable hybrid methods for part-of-speech tagging high accuracy is achieved. Annotation of a large generated text corpus containing 50% enriched text and 50% canned plain text produces a fully annotated uniform SOLE-ML output containing all prosodic features found in the initial enriched source. Furthermore, additional automatically-derived prosodic feature annotation and speech synthesis related values are assigned, such as word-placement in sentences and phrases, previous and next word entity relations, emphatic phrases containing proper nouns, and more.
Notes:
2003
Georgios Petasis, Vangelis Karkaletsis, Dimitra Farmakiotou, Ion Androutsopoulos, Constantine D Spyropoulos (2003) In: Advances in Informatics - Post-proceedings of the 8th Panhellenic Conference in Informatics Edited by:Yannis Manolopoulos, Skevos Evripidou, Antonis Kakas. 401-419 Springer Berlin / Heidelberg  
Abstract: This paper presents a large-scale Greek morphological lexicon, developed at the Software & Knowledge Engineering Laboratory (SKEL) of NCSR "Demokritos". The paper describes the lexicon architecture and the procedure to develop and update it. The morphological lexicon was used to develop a lemmatiser and a morphological analyser that were exploited in various natural language processing applications for Greek. The paper presents these applications (controlled language checker, information extraction, information filtering) and discusses further research issues and how we plan to address them.
Notes: http://www.springerlink.com/content/hcdjrlvj5nlybf5c/
2000
Georgios Petasis, Georgios Paliouras, Vangelis Karkaletsis, Constantine D Spyropoulos, Ion Androutsopoulos (2000) In: ADVANCES IN INFORMATICS : Proceedings of the 7th Hellenic Conference on Informatics (HCI ’99) Edited by:Dimitrios I Fotiadis, Stavros D Nikolopoulos. 273-281 World Scientific  
Abstract: This article investigates the use of Transformation-Based Error-Driven learning for resolving part-of-speech ambiguity in the Greek language. The aim is not only to study the performance, but also to examine its dependence on different thematic domains. Results are presented here for two different test cases: a corpus on "managementsuccessionevents" and a general-theme corpus. The two experiments show that the performance of this method does not depend on the thematic domain of the corpus, and its accuracy for the Greek language is around 95%.
Notes: http://www.worldscibooks.com/compsci/4320.html
1999
Vangelis Karkaletsis, Constantine D Spyropoulos, Georgios Petasis (1999)  12   In: Advances in Intelligent Systems : Concepts, Tools and Applications Edited by:Spyros G Tzafestas. 131-142 Springer Berlin / Heidelberg  
Abstract:
Notes: Presented at the 3rd European Robotics Intelligent Systems & Control Conference (EURISCON’98), June 22–25 1998, Athens, Greece.

PhD theses

2011
Georgios Petasis (2011)  Machine Learning in Natural Language Processing   Department of Informatics and Telecommunications, University of Athens Hissar, Boulgaria:  
Abstract: This thesis examines the use of machine learning techniques in various tasks of natural language processing, mainly for the task of information extraction from texts. The objectives are the improvement of adaptability of information extraction systems to new thematic domains (or even languages), and the improvement of their performance using as fewer resources (either linguistic or human) as possible. This thesis has examined two main axes: a) the research and assessment of existing algorithms of machine learning mainly in the stages of linguistic pre-processing (such as part of speech tagging) and named-entity recognition, and b) the creation of a new machine learning algorithm and its assessment on synthetic data, as well as in real world data from the task of relation extraction between named entities. This new algorithm belongs to the category of inductive grammar learning, and can infer context free grammars from positive examples only.
Notes:
Powered by PublicationsList.org.