FIZ News

June 2015

Neues Zeolith kann mehr Wärme speichern

June 2015

Stabiles Netz trotz flexibler Einspeisung

May 2015

Windenergie-Anlagen rütteln und schütteln

Meet us at

Tag der gewerblichen Schutzrechte,
Stuttgart, 30. Jun
CICM 2015, Washington, 13. - 17. Jul
AALL - American Association of Law Libraries, 10th Annual Meeting and Conference, Philadelphia, 18. - 21. July

Current Topics for Bachelor, Master or Diploma Thesis; as well as for Internships

Text Mining for Patents  –  Topics for Bachelor’s or Master’s Thesis

Patents represent a special type of text: they usually are longer and syntactically more complex than e.g. scientific articles. FIZ Karlsruhe, one of the world’s leading provider of patent information, is engaged in text mining since many years to offer its users additional functionality, allowing them to retrieve the required information faster and with more precision and recall. In this context we identified topics from the areas text mining and NLP listed below which might be suitable for a bachelor’s thesis as well as for a master’s thesis.

Syntactic Normalisation of Automatically Extracted Key Phrases

The automatic extraction of key phrases often results in similar phrases with different morphological and syntactical structure. For information search or for generating content overviews these syntactic variations must be normalised and mapped to one canonical form.

Simple examples are:

  • Information retrieval, retrieval of information => information retrieval
  • method for combating spam => spam combating method
  • circular or rectangular patterns => circular pattern,  rectangular pattern

The objective is to identify the different types of phrase variants and to implement and evaluate a rule-based method for phrase normalisation. Depending on the extent to which the variants are recognised and on the quality of the normalisation method, this topic could be suitable for a bachelor’s or for a master’s thesis.

Recognition of Enumerations in Patent Texts

Patent texts are often characterised by the extensive use of long enumerations of, e.g. substances, chemical entities, numeric entities, methods, etc.

Examples:

  • fuel system components such as sensors, actuators, pumps, level controls, throttles and valves …
  • locomotive systems like cars, including vans, SUVs and roadsters; bikes, including motor bikes, bicycles and pedelecs, or trains like underground, motor coaches or freightliners …

To start with, the most common types of enumerations are to be identified. For these types, an automatic recognition method is to be implemented and evaluated. The topic can be expanded to a master’s thesis by considering also complex and less frequent types of enumerations and by developing means to decide automatically whether the enumeration represents a certain taxonomic relation like e.g., synonymy, quasi-synonym, or hyponymy.

Named Entity Recognition in Patents

Patents contain named entities from a wide range of knowledge domains. Which types exactly has up to now only partially been explored. The aim of this topic is to analyse which types of named entities occur in patents and to devise methods to recognise them. Patents from the domains of chemistry, biology or pharmaceuticals are excluded here, since the recognition of named entities in these domains requires extensive special knowledge.

This topic might be dealt with in a bachelor’s or master’s thesis depending on the extensiveness of the investigations.

Internship (Praktikum)

FIZ Karlsruhe offers internships (Praktika) with a duration of one month or more. Possible topics might include:

  • Exploration of Elasticsearch’s Significant Terms Aggregation by means of a basic gold standard. Elasticsearch is a search and analytics engine based on Lucene. Its new version includes the experimental Significant Terms Aggregation, designed to extract keywords relating to a query.
  • Evaluate several PoS taggers with patent texts and, when indicated, make suggestions for improvement. PoS-taggers are a commodity for natural language processing but they were primarily designed for common texts or scientific articles. Their application on patent texts might be wanting.
  • Analogous investigations can be carried out with chunkers (shallow parsers).
  • Structural segmentation of patent texts, i.e., the identification of paragraphs, headers, figures, tables, etc..

Please contact Dr. Michael Schwantner for more information.
Phone: +49 (7247) 808-260