Inside semantic Web
search engines:
between semantic
annotation and Natural
Language Processing
Dentro i motori di
ricerca semantici: tra
annotazione semantica
ed elaborazione della
lingua naturale
Incontro ISKO Italia - Torino 3 aprile 2009
Intervento di
Mela Bosch
Terminology on Web Search Engines
Text Search Engine: based on Lexical analysis. The main aim of the
lexical analysis is to divide the text into paragraphs, sentences and
words and also entities such as e-mail addresses or URLs. All these
elements are knows as tokens, and the Search Engine makes a parsing
with statistical parameters to develop a range of links as a response to a
Latent semantic indexing (LSI): based on Latent semantic analysis
(LSA); LSI is a technique of Natural Language Processing (NLP) which
uses an indexed database of documents to find similar terms. It can find
a synonym and then return the best matched websites for the query. LSI
does not require exact matching words for ranking result.
Semantic Web search engines: take the sense of a word as a factor in its
ranking lists or offers the user a choice as to the sense of a word or
Semantic Web search engines
or Search engines of 3rd generation
Three types:
User oriented Semantic Web search engine: It returns web page links.
It can use internally both Semantic Web technologies and LSI. Ex.:
True Knowledge, Hakia and PowerSet.
Semantic Web Services oriented engine: It returns links to ontologies,
OWL files, RDF instances. It is inadequate for end users. Ex.: SOWL,
WSE, Watson, Falcons, Sindice and Swoogle. The idea is to provide
ways for businesses to inter-operate across domains or services.
Social-semantic Web oriented engine: The socio-semantic web (s2w)
uses classification and ontologies in very practical situations. S2w
search engines’ aim is to complement the formal Semantic Web
vision adding a pragmatic collaborative tagging (folksonomy)
approach. The main interest is to to enable users to share knowledge.
Semantic Web search engines. What
are all these differences for?
“Semantic Web means many
things to different people:
•It is about artificial
intelligence, computer
programs solving complex
optimization problems
•It is about web services, in
terms of end user value
•It is the web of data, where
information is represented in
RDF or microformats and
The components of
Semantic Web search
•Natural Language
Processing (NLP)
Free-text annotation:
The annotations can be comments,
notes, explanations, references,
examples, advice, corrections or any
other type of external remark that
can be attached to or embedded in a
Web document or a selected part of
the document.
Semantic annotation in general
Semantic annotation is the association of a data entity with an
element from a classification scheme, ontology or other knowledge repository
Examples of semantic annotation:
• the assignment of MeSH descriptors to citations in MEDLINE
• the assignment of Gene Ontology terms to gene products in UniProt
Semantic Web Annotation
Is the technique for uploading machine understandable data on the Web by
creating metadata through semantic tagging
semantic annotation is a formal
annotation, where the predicate is an
ontological term, and the object
• The
term “annotation” can denote
both the process of annotating and
the result of that process.
It is crucial to the fulfillment of the Semantic Web to give
useful meaning to data or to unstructured text
Semantic Web Annotation
The Semantic Web Annotation process includes three
• an ontology which describes the domain of interest
• a data instance recognition process that discovers all
instances of interest in target web documents based on
the defined ontology
• an annotation generation process creates a semantic
meaning disclosure file for each annotated document.
Through the semantic meaning disclosure file, any
ontology-aware machine agent can understand the
target document.
Annotation: can be manually, automatically or semi-automatically
The process of annotating requires semantic annotation tools:
Types of semantic annotation tools
Inline annotation means that the original document
is augmented with metadata information.
Embedded metadata
It focuses on annotating
information on pages
using RDF
so that it is machine
Also called:
Semantic Authoring
Bottom-up approach
Types of semantic annotation tools:
Standoff annotation means that the metadata is
stored separately from the original document.
Attached metadata
The annotations are then stored in a
database that is made available to
users via websites and sometimes via
web services
It is generally preferable from the point of view of inter-operability
Also called: top-down approach. Its focus is leveraging
information from existing web pages, to derive meaning
There are several choices for annotation
The components of Semantic Web search engines
•Natural Language Processing (NLP)
Initially NLP
•is conceived as a support for Linguistics studies
•aims at using computers to interpret and
manipulate words as a part of a language
A powerful method for the investigation and
evaluation of human language itself. i.e.
enhanced study over large corpora of texts
•Artificial Intelligence defines NLP as the act of using computers
to process written and spoken languages for some practical
purpose such as translating languages, or carrying conversations
with machines.
The components of Semantic Web search engines
•Natural Language Processing (NLP)
After the Web explosion NLP has been used for the
development of natural language understanding systems that
convert samples of human language into more formal
representations that are easier to manipulate for computer
•Thanks to the NLP techniques different
algorithms such as chunking, clustering,
parsing, spellchecking, tagging, and word
sense disambiguation are used to handle
text intelligently and to get information
from the Web on text data banks in order
to answer questions
However, both methodologies are now being
•semantic web search engines need many
pages to be annotated (which requires an
enormous effort),
•so that NLP becomes an important help in
automatic or semi-automatic annotation.
•At the same time the precision of text
analysis may be optimized by means of
techniques of assignment provided by users
and professionals.
In conclusion, the trend is the development of collective
knowledge systems that improve as more people participate, as
they are based on human contributions. All of this will possibly
be integrated by NLP algorithms.
Iskold, Alex. (2006) Semantic Web Patterns: A Guide to Semantic Technologies.
Atanas, K. et al. (2005) Semantic Annotation, Indexing, and Retrieval. Ontotext Lab.
Vehvilainen, A. et al. (2006) SemiAutomatic Semantic Annotation and Authoring, Tool for a Library Help
Desk Service. Helsinki University.
Diana Maynard (2005) Benchmarking ontology-based annotation tools for the Semantic Web. Department
of Computer Science, University of Sheffield, UK.
Good, Benjamin M ; Kawas, Edward ; Wilkinson, Mark. (2007) Bridging the gap between social tagging
and semantic annotation: E.D. the Entity Describer.
Useful links:

Inside semantic Web research engines: between semantic