Language: The key to connecting
people with IT
Junichi Tsujii
University of Tokyo, JAPAN
And
UMIST,UK
Natural Language Processing
• Why : the reasons why NLP is important
• What: the projects which I am involved in
– What I am doing
• How: the current states of arts
– NLP techniques
• Future directions
Natural Language Processing
• Why : the reasons why NLP is important
• What: the projects which I am involved in
– What I am doing
• How: the current states of arts
– NLP techniques
• Future directions
USA
TIDES
• Translingual: not only English but ..
• Information Detection
– Information retrieval, Text retrieval, Passage retrieval
– Full text search for key words ?
• Information Extraction
– Fact extraction: (ex) which company started a new
joint venture with China where, when …
• Summarization
– Multi-text summarization, etc.
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture: Yaxing Benz
Products:
buses and bus chassis
Location:
Yangzhou,China
Companies involved: (1)Name: X?
Country: German
(2)Name: Y?
Country: China
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture:
Products:
Location:
Companies involved:
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture: Yaxing Benz
Products:
Location:
Companies involved:
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Name of the Venture:
Products:
Location:
Companies involved:
Yaxing Benz
buses and bus chassis
Yangzhou, China
(1) Name: X ?
Country: German
(2) Name: Y ?
Country: China
Templates
(Information
Formats)
Filling
Information Extraction
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Crime-Type: Murder
Type: Stabbing
Inst: 45cm watermelon knife
The killed:
Location:
Date:
Different Templates
for Crimes
Name: Jurgen Pfrang
Age:
51
Profession: Deputy general manager
Nanjing, China
3/April/2000
Information Extraction
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Crime-Type: Murder
Type: Stabbing
The killed: Name: Jurgen Pfrang
Age:
51
Profession: Deputy general manager
Location: Nanjing, China
Reasons why NLP is important
• Language as Media
• Economic Effects
• Advances of Technologies
Reasons why NLP is important
• Language as Media
• Economic Effects
• Advances of Technologies
Economic Effects (EU)
• Globalization: Single and Free Market
– Free Flow of People,Goods and Money
– No border control, No tariff and Single currency
– Free Flow of Information: Language Technology
• Economic Power and Language Power
– Sell your ideas and dominate the market !!
– Duality of language in WWW: Domestic and
International market
– Localization
EU-US Cooperation
Transportation challenges for the 21st century
Climate forecast, predicating application
Human environmental health sciences
Information technology
• Next generation internet
• Electronic commerce
• Trans-lingual information management
Translingual
Information Management
• Spoken trans-lingual communication
• International digital library
• Tactical information management
Reasons why NLP is important
• Language as Media
• Economic Effects
• Advances of Technologies
Language as Information Media
– Language as media for representing knowledge,
information and our recognition of the world
• Images:instances at certain spatial-temporal coordinates
Simple messages about single facts
• Language: generalized statements, structured thoughts,
and logical developments
– Language as media for communication
• Universality: interpretation dependent on contexts
• Communication among intelligent agents with different
backgrounds
Roles of Language as
Information Media
• Integration of diverse kinds of information
into a coherent, structured whole
• Knowledge sharing among different
institutions
• Universal media for communication with
diverse IT software and intelligent
appliances
Reasons why NLP is important
• Language as Media
• Economic Effects
• Advances of Technologies
Advances of Background
Technologies
• Speed and Memory Capacity
• Computer of the 60s - Digital Watch of the 80s(P.Brown:
IBM/1991)
• Speech Technology
• Text-to-Speech and Speech Recognition (Gates and
Paxman/1/April/2000)
• L & H: Language Technology Valley in Belgium
• Input and output of Asian languages
• Ubiquitous Computing and Wearable
Computing ++ WWW and InterNet
• Language in Contexts
Language in Context
Entrance of a shop
Open
Button on a microwave
Flashing light on the
door of a CD player
Context-1
Context-3
Context-2
Natural Language Processing
• Why : the reasons why NLP is important
• What: the projects which I am involved in
– What I am doing
• How: the current states of arts
– NLP techniques
• Future directions
機械翻訳 関連技術Map
手書き文字
人の声
人の声
Keyboards
文字認識 音声認識 音声合成
機械翻訳
異言語
大
量
異信号
Language Resource 情報抽出
・要約
and Technology
多言語
辞書・
Corpora
解析法
生成法
多言語
情報検索
情報検索
Text
密度圧縮
分類・Filtering・
Personalization
Text
Mining
Japan
• ATR: Speech Translation (14years)
– 3rd Period (5years)
• Multi-lingual(E,G,C,K,J), Broader topics
• CRL (MPT) and National Language Lab
– Five year project
• Speech Monologue and Summarization
• Corpus Collection
• TAO (MPT): Automatic Caption in TV
• Joint Project between MITI and ME
– Knowledge Acquisition for MT
• JSPS (ME); Tokyo-U, Kyoto-U and TIT
JSPS Project of NLP
• Type-based Grammar Formalisms and
Parsing Techniques Generic Techniques
• Programming Language based on Typed Feature
Structures (LiLFes)
• Efficient Parsing Techniques and Partial Parsing
• Information Retrieval (Query Expansion)
• Application Domains, Languages, App. Types
• GENIA: IE and IR in genome informatics
Research Plan of Tsujii Group
Practical NLP applications
Knowledge Acquisition Module
SLUNG (Japanese Grammar)
XHPSG (English Grammar)
Parallel HPSG Parser
Sequential
HPSG Parser
Parallel Programming
Programming Environment
Language
Research Plan of Tsujii Group
Practical NLP applications
Knowledge Acquisition Module
SLUNG (Japanese Grammar)
XHPSG (English Grammar)
Parallel HPSG Parser
Sequential
HPSG Parser
Parallel Programming
Programming Environment
Language
Unification by Abstract(Carpenter
Machine
and Qu, 1995)
nelist
REST
FIRST
PUSH
FIRST
ADDNEW list
UNIFYVAR 1
POP
list
nelist
FIRST
bot
FIRST
foo
1
2
3
4
5
6
STR
VAR
PTR
STR
VAR
VAR
REST
nelist
REST
list
nelist
bot
4
nelist
foo
list
Abstract machine
code of a TFS
PUSH
REST
UNIFYVAR 1
POP
nelist
FIRST
list
FIRST
foo
1
2
3
4
5
6
STR
VAR
PTR
STR
VAR
VAR
REST
nelist
REST
list
nelist
list
4
nelist
foo
list
TFS data on memory
nelist
FIRST
FIRST
foo
1
2
3
4
5
6
STR
PTR
PTR
STR
VAR
VAR
REST
nelist
REST
list
nelist
4
4
nelist
foo
list
素性構造処理系との性能比較
FASTER
Intel Pentium II / 400MHz
100
90
80
70
60
50
40
30
20
10
0
20
18
16
14
12
10
8
6
4
2
0
HPSG
新 LiLFeS (asm)
旧 LiLFeS
ProFIT on SICStus
(Emulator)
ALE 3.1 on SICStus
(Emulator)
Simple
(相対速度 user time,
ALE 3.1 (Emulator) = 1)
Filtering with CFG (1/5)
• 2-phased parsing
– Approximate HPSG with CFG with keeping
important constraints.
– Obtained CFG might over-generate, but can be used in
filtering.
– Rewriting in CFG is far less expensive than that of
application of rule schemata, principles and so on.
HPSG Compile
Input Parsing
Sentences
CFG
Built-in
CFG Parser
+
Feature
Structures
LiLFeS
Unification
Complete parse trees
Output
HPSG Parserの評価
Parsing Time(秒)/文
Corpus
文法
Naïve
(平均文長:語) Parser
TNT
Praser
参考:LKB
Parser
(Stanford DFKI)
LinGO
csli(5.8)
0.68
0.12
0.23
LinGO
aged(8.4)
1.72
0.31
0.61
LinGO
blend (11)
14.71
1.90
3.10
XHPSG
ATIS (7.42)
14.27
0.30
SLUNG
EDR(20.5)
0.88
0.38
Sun UltraSparc, 336 mhz, 6GB main memory
Research Plan of Tsujii Group
Practical NLP applications
Knowledge Acquisition Module
SLUNG (Japanese Grammar)
XHPSG (English Grammar)
Parallel HPSG Parser
Sequential
HPSG Parser
Parallel Programming
Programming Environment
Language
Experiment
Parsing time per sentence (msec) Speed-up without Morphological
analyzing time
1
50
14
CPU CPUs
1,069
(CKY parser)
LiLFeS
961
(CKY parser)
LiLFeS
(2-Phase parser)
240
Speed-up reaches to
11.8 times.
123
10
Speed-up
PSTFS
12
文の長さ
8
6
4
2
0
0
10
20 30 40 50
# of processors
Grammar: Japanese HPSG (SLUNG) [Mitsuishi98]
Corpus: EDR Corpus 600 sentences
Computer: Ultra Enterprise 10000 (64 CPUs, 8Gbyte shared
memory)
60
Research Plan of Tsujii Group
Practical NLP applications
Knowledge Acquisition Module
SLUNG (Japanese Grammar)
XHPSG (English Grammar)
Parallel HPSG Parser
Sequential
HPSG Parser
Parallel Programming
Programming Environment
Language
Research Plan of Tsujii Group
Practical NLP applications
Knowledge Acquisition Module
SLUNG (Japanese Grammar)
XHPSG (English Grammar)
Parallel HPSG Parser
Sequential
HPSG Parser
Parallel Programming
Programming Environment
Language
Overview of GENIA Project
② query
① A researcher with a question
③ GENIA
Information Extraction
•Pre‐processing
Learning
Terminology
Databases
•Named entity
•Template element
⑤ answer to the question
Corpora
•Scenario template
Ontology
WWW Links
Thesaurus
④ information extracted
Information Retrieval
CSNDB
(National Institute of Health Sciences)
• A data- and knowledge- base for signaling
pathways of human cells.
– It compiles the information on biological molecules,
sequences, structures, functions, and biological
reactions which transfer the cellular signals.
– Signaling pathways are compiled as binary
relationships of bio-molecules and represented by
graphs drawn automatically.
– CSNDB is constructed on ACEDB and inference
engine CLIPS , and has a linkage to TRANSFAC.
– Final goal is to make a computerized model for various
biological phenomena.
Example. 1
• A Standard Reaction
Signal_Reaction:
“EGF receptor  Grb2”
From_molecule “EGF receptor”
To_molecule “Grb2”
Tissue “liver”
Effect “activation”
Interaction
“SH2+phosphorylated Tyr”
Reference [Yamauchi_1997]
Excerpted @[Takai98]
Example. 3
• A Polymerization Reaction
Signal_Reaction:
“Ah receptor + HSP90 ”
Component “Ah receptor” “HSP90”
Effect “activation dissociation”
Interaction
“PAS domain”
“of Ah receptor”
Activity
“inactivation of Ah receptor”
Reference [Powell-Coffman_1998]
Excerpted @[Takai98]
医学・生物学分野における学術論文からの情報抽出
目
的
概
要
効果
自然言語処理技術を用いて、分子生物学分野の研究者が論文
を執筆する際に有用なツール、また多くの論文から情報を抽出
するツールを構築し、WWW上で利用可能にする。
•
•
情報を獲得し、データベースを半自動的に拡張する
論文の検索を効率良く行う
②
GENIA
Information Extraction
•Pre‐processing
Learning
Terminology
Databases
①
•Named entity
•Template element
⑤
•Scenario template
Corpora
④
WWW Links
Thesaurus
Information Retrieval
Objectives
• Information extraction from texts
– Named entity recognition
– Event recognition
– Context pattern recognition
• Linking texts with fact databases
– Representation of textual meanings
• Ontology building
– Knowledge of the domain
– Representation language: Lattice-based types
Objectives
• Information extraction from texts
– Named entity recognition
– Event recognition
– Context pattern recognition
• Linking texts with fact databases
– Representation of textual meanings
• Ontology building
– Knowledge of the domain
– Representation language: Lattice-based types
Information Extraction
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
Second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Information Extraction
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
Second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Person’s name Jurgen Pfrang
Company name Yaxing Benz
Information Extraction
A German vehicle-firm executive was stabbed to death ….
……….
Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the
Second floor of his Nanjing home early on Sunday.
The deputy general manager of Yaxing Benz, a Sino-German
joint venture that makes buses and bus chassis in nearby Yangzhou,
was hacked to death with 45 cm watermelon knives.
……….
Person’s name Jurgen Pfrang
Company name Yaxing Benz
Position Name: deputy general manager
Products:
bus, bus chassis
Location Name: Nanjin, Yangzhou
Named Entity
Recognition
K.Fukuda(1998, IMS, U-Tokyo)
T.Sekimizu(1998, GENIA, U-Tokyo)
C.Nobata(1999, GENIA, U-Tokyo)
Example of Annotated Text
UI - 98438350
TI - Alterations in protein-DNA interactions in the <DNA id=1
unit=dr>gamma-globin gene promoter</DNA> in response to
butyrate therapy.
AB - The mechanisms by which pharmacologic agents stimulate <DNA
id=2 unit=ml>gamma-globin gene</DNA> expression in <PROTEIN
id=1 unit=fg>beta-globin</PROTEIN> disorders has not been fully
established at the molecular level. In studies described here, nucleated
<SOURCE id=1 subtype=cell-type>erythroblasts</SOURCE> were
isolated from patients with <PROTEIN id=1 unit=ml>betaglobin</PROTEIN> disorders before and with butyrate therapy, and
globin biosynthesis, mRNA, and protein-DNA interactions were
examined. Expression of <RNA id=1 unit=ml>gamma-globin
mRNA</RNA> increased twofold to sixfold above baseline with
butyrate therapy in 7 of 8 patients studied.
Features Used in HMM
Code
dig
sin
grk
cad
cap
lad
fst
ini
lcp
low
hyp
opp
clp
fsp
cma
pct
osq
csq
cln
scn
det
con
oth
Feature
Example
DigitNumber
15
SingleCapital
M
GreekLetter
alpha
CapsAndDigits I2
AtLeastTwoCapsRalGDS
LettersAndDigitsil2
FirstWord
(first word in sentence)
InitCap
Interleukin
LowerCaps
kappaB
LowerCase
kinases
Hyhon
-'
OpenParentheses(
CloseParentheses
)
FullStop
.
Comma
,
Percent
%
OpenSquareBrackets
[
CloseSquareBrackets
]
Colon
:
SemiColon
;
Determiner
the
Conjunction
and
Other
*,+,#,@
Outline of Experiments of NE
•Interpolating HMM
•Domain of biochemistry: human+blood cell+transcription factor
•Corpus:
100 MEDLINE abstracts 80 for training, 20 for testing with 5-fold cross-validation
SGML tagged by domain expert
•Marked-up entities:
2124
Proteins entities
365
DNA entities
30
RNA entities
856
Source entities (all sub-classes)
3375
Total entities
Results: basic
Class
NE task Term identification Term classification
0.719
0.674
0.764
0.753
0.692
0.815
0.464
0.416
0.511
0.08
0.08
0.08
0.664
0.607
0.721
All
PROTEIN
DNA
RNA
SOURCE
SOURCE.cl
SOURCE.ct
SOURCE.mo
SOURCE.mu
SOURCE.vi
SOURCE.sl
SOURCE.ti
0.388
0.691
0
0.467
0.742
0.57
0.084
0.353
0.651
0
0.467
0.721
0.563
0.084
Table 1: F-score values for 5-fold cross-validation
F-score = (2 x Precision x Recall) / (Precision + Recall)
0.422
0.73
0
0.467
0.763
0.576
0.084
Features Used in HMM
Prof. Tsou’s approach
to segmentation of
Chinese texts using
information of characters
Asian langauges like
Chinese, Japanese, etc
may have some advantages
Code
dig
sin
grk
cad
cap
lad
fst
ini
lcp
low
hyp
opp
clp
fsp
cma
pct
osq
csq
cln
scn
det
con
oth
Feature
Example
DigitNumber
15
SingleCapital
M
GreekLetter
alpha
CapsAndDigits I2
AtLeastTwoCapsRalGDS
LettersAndDigitsil2
FirstWord
(first word in sentence)
InitCap
Interleukin
LowerCaps
kappaB
LowerCase
kinases
Hyhon
-'
OpenParentheses(
CloseParentheses
)
FullStop
.
Comma
,
Percent
%
OpenSquareBrackets
[
CloseSquareBrackets
]
Colon
:
SemiColon
;
Determiner
the
Conjunction
and
Other
*,+,#,@
Objectives
• Information extraction from texts
– Named entity recognition
– Event recognition
– Context pattern recognition
• Linking texts with fact databases
– Representation of textual meanings
• Ontology building
– Knowledge of the domain
– Representation language: Lattice-based types
Verbs Related to Biological Events
Frequent Verbs in 100 MEDLINE Abstracts
Ver b
be
induce
bind
show
suggest
activate
factor
demonstrate
inhibit
have
reveal
require
regulate
indicate
find
result
play
interact
mediate
contain
C ount
255
56
50
49
42
42
36
35
26
25
21
21
21
21
21
20
19
18
17
17
Ver b
C ount
involve
16
identify
16
act
15
stimulate
14
provide
14
express
13
affect
13
type
12
report
12
form
12
contribute
12
study
11
observe
11
lead
11
function
11
assay
11
appear
11
occur
10
increase
10
phosphorylate
9
Ver b
determine
construct
associate
reduce
prevent
locate
line
differ
trigger
synergize
examine
block
become
analyze
target
signal
remain
produce
present
possess
C ount
9
9
9
8
8
8
8
8
7
7
7
7
7
7
6
6
6
6
6
6
Ver b
explain
exert
enhance
display
characterize
participate
localize
investigate
imply
establish
conclude
compare
use
transform
transfect
test
suppress
support
substitute
share
C ount
6
6
6
6
6
5
5
5
5
5
5
5
4
4
4
4
4
4
4
4
Verbs Related to Biological Events
Syntactic Patterns of Some Frequent Verbs
• induce
– noun BE INDUCED BY noun
activation of these PROTEIN was induced by PROTEIN
– noun INDUCE noun
PROTEIN induced the tyrosine phosphorylation of PROTEIN
• bind
– noun BIND TO noun
the drugs bind to two different PROTEIN
– noun BIND noun
motifs previously found to bind the cellular factor PROTEIN
• show
– noun BE SHOWN to-infinitive
PROTEIN has been shown to trigger cellular PROTEIN activity
– noun SHOW that-clause
– noun SHOW noun
the data show that PROTEIN stimulation is also not sufficient
SOURCE showed a dose-dependent inhibition of PROTEIN activity
semantic class: substance source experiment fact
Generic Techniques for Application
•Parsing: The same software for different languages
Japanese, German, English ,Chinese
The same core grammar for different domains
Non-Core: Learning from Data
•Semantics: Named entities, Event templates
HMM, Decision Tree
Corpus Collection and Annotation
Objectives
• Information extraction from texts
– Named entity recognition
– Event recognition
– Context pattern recognition
• Linking texts with fact databases
– Representation of textual meanings
• Ontology building
– Knowledge of the domain
– Representation language: Lattice-based types
Example of Annotated Text
UI - 98438350
TI - Alterations in protein-DNA interactions in the <DNA id=1
unit=dr>gamma-globin gene promoter</DNA> in response to
butyrate therapy.
AB - The mechanisms by which pharmacologic agents stimulate <DNA
id=2 unit=ml>gamma-globin gene</DNA> expression in <PROTEIN
id=1 unit=fg>beta-globin</PROTEIN> disorders has not been fully
established at the molecular level. In studies described here, nucleated
<SOURCE id=1 subtype=cell-type>erythroblasts</SOURCE> were
isolated from patients with <PROTEIN id=1 unit=ml>betaglobin</PROTEIN> disorders before and with butyrate therapy, and
globin biosynthesis, mRNA, and protein-DNA interactions were
examined. Expression of <RNA id=1 unit=ml>gamma-globin
mRNA</RNA> increased twofold to sixfold above baseline with
butyrate therapy in 7 of 8 patients studied.
Underlying Ontology (1)
Substances are classified by their chemical structure,
not by their biological role
-substance
compound
organic -protein
-DNA
-RNA
-inorganic
-atom
Ontology description on Web
http://www.is.s.u-tokyo.ac.jp/~yucca/docs/genia/hierarchy.html
Cross Validation
• Average rate of agreement (F-score)
75.8 %
• Typical patterns of inconsistency found
Revised Tag Set and Underlying Ontology
+-name-+-source-+-natural-+-organism-+-multi-cell organism
|
|
|
+-mono-cell organism
part-of +-virus
|
|
|
|
|
+-tissue
| Is-a
|
+-cell type
|
|
+-sub-location of cells
|
+-artificial-+-cell line
|
+-substance-+-compound-+-inorganic
|
+-organic-+-amino-+-protein-+-protein family/group
|
|
+-protein complex
|
|
+-Individual molecule
|
|
+-UnitOfProteinComplex
|
|
+-SubstructureOfProtein
|
|
+-Domain/RegionOfProtein
|
+-peptide
|
+-amino acid monomer
+-DNA-+-DNA family or group
|
+-individual DNA molecule
|
+-domain or region of DNA
|
+-RNA-+-RNA family or group
+-individual RNA molecule
+-domain or region of RNA
Example of Annotated Text
UI - 98438350
TI - Alterations in protein-DNA interactions in the <DNA id=1
unit=dr>gamma-globin gene promoter</DNA> in response to
butyrate therapy.
AB - The mechanisms by which pharmacologic agents stimulate <DNA
id=2 unit=ml>gamma-globin gene</DNA> expression in <PROTEIN
id=1 unit=fg>beta-globin</PROTEIN> disorders has not been fully
established at the molecular level. In studies described here, nucleated
<SOURCE id=1 subtype=cell-type>erythroblasts</SOURCE> were
isolated from patients with <PROTEIN id=1 unit=ml>betaglobin</PROTEIN> disorders before and with butyrate therapy, and
globin biosynthesis, mRNA, and protein-DNA interactions were
examined. Expression of <RNA id=1 unit=ml>gamma-globin
mRNA</RNA> increased twofold to sixfold above baseline with
butyrate therapy in 7 of 8 patients studied.
Name Ontology
Taxonomies
SUBSTANCE1
attribute1
attribute2
:
SUBSTANCE2
attribute1
attribute2
:
Terms
SUBSTANCE3
attribute1
attribute2
:
SUBSTANCE4
attribute1
attribute2
:
ROLE1
attribute3
attribute4
:
•AGENT
•ENZYME
•PHOSPHATASE
•TRANSCRIPTION FACTOR
ROLE2
attribute3
attribute4
:
ROLE3
attribute3
attribute4
:
•AMINO ACID
•DNA
•ORGANIC COMPOUND
•PROTEIN
ROLE4
attribute3
attribute4
:
Event Ontology
REACTION1
attribute1
attribute2
:
REACTION2
attribute1
attribute2
:
REACTION3
attribute1
attribute2
:
REACTION4
attribute1
attribute2
:
REACTION5
attribute1
attribute2
:
• substance ACTIVATE substance
• substance ACTIVATE protein
• protein ACTIVATE pathway
• PHOSPHORYLATE
•INHIBIT
•REGULATE
Linguistic Expressions and Ontology
(example: to activate and receptors)
Def-Type{G-Protein-Receptor(0):
Is-a:{Membrane-Receptor}
Def-event-sequence:{Signal-Pass-by-G-Protein
receptor:(0)
ligand:{Protein lr:{N ligand}}(1) second-messenger:{Protein lr:<{messenger}>}(5)
lr:<{V verb:activate subject:(0)} object:(1)}>
combine:{Combine co-agent:<(0),(1)>
product:{Complex-protein subunit:<(0),(1)> }(2) location:{Exterior-of Cell}
lr:<{V verb:activate subcat……}>}
connect:{Connect agent:{G-protein}(3) object:{}(2)
product:{Complex subunit:<{(2) subunit},(3)>}(4)
location:{Interior-ob-Cell}}
activation:{Activate agent:{}(4) object:{Enzyme event:{Enzymatic-event
product:{}(5)}}}
}
* lr: linguistic realization
Objectives
• Information extraction from texts
– Named entity recognition
– Event recognition
– Context pattern recognition
• Linking texts with fact databases
– Representation of textual meanings
• Ontology building
– Knowledge of the domain
– Representation language: Lattice-based types
Future Directions
• Language is the key in IT society
– Language as representation media
• Knowledge sharing, integration of text and fact data bases
– Language as communication media
• Speech, wearable computation & ubiquitous computation
• Language in multi-media environments
• Globalization and localization
– Multi-lingual, Cross-Lingual, Trans-lingual NLP
• Language: Investigation tools of other disciplines
– Text mining (ex: different distributions of proper
nouns)
Future Research Topics
• Combination of Structural Techniques with
Statistical ones
– (Partial Parsing + HMM) for event recognition
• Large scale ontology building and text
annotation / Corpus Collection
– Software tools
• Integrated system
Descargar

LiLFeS プロジェクト