Syntactic Pattern Recognition
•Statistical PR:
Find a feature vector x
•
Train a system using a set of labeled patterns
•
Classify unknown patterns
•Ignores relational information contained in the structure
•Most structural methods use hierarchical decomposition
•Note similarity between a sentence structure and pattern description
Picture A
f
A
c
b
B
a
g
C
Rectangle C
Triangle B
e
d
edge edge edge
a
b
c
edge edge
d
e
edge
edge
f
g
Language
•Alphabet is a finite set of symbols, V={x1,x2, …,xn}
•Sentence over B is a finite string of ordered symbols (left to right) from V
•Example: V = {a,b,c}, valid sentences are “abb”, “abba”, “aaa”, null
•Length of a sentence s, |s| is the number of symbols
•s1os2 is the concatenation of the two sentences
•VoVoV…oV = Vn is the set of all sentences with n symbols over V
•V+=VUV2UV3…. is the set of all non-empty sentences over V
•V* is the closure of V
•Language is an arbitrary subset L of V*
•Example: V={0,1}, then L1 = {001, 110, 111, 0, null} is a finite language
• L2 = {s|s = 1n021m, n>=1, 1<=m<=10} is an infinite language
Languages
•L1oL2 = {s|s = s1s2, s1 belongs to L1 and s2 belongs to L2} is concatenation
•L1it = {s|s = s1s2…sn, n>=0, si belongs to L1} is the iterate of L1
•L1oL2 and L1it are both languages
•Example: V = {a,b}
L1 = {aa,ab,bb}
L2 = {a,b}
•L1oL2 = {a3,aba,b2a,a2b,ab2,b3}
•L1it is infinite; for n={0,1,2}
 , a
2
2
4
3
2
2
2
3
2
2
2
4
, ab , b , a , a b , a b , aba , abab , ab , b a , b ab , b }
•s is called a sub-string of t if t =usv for some strings u,v belonging to V*
•Every string is a substring of itself as u and/or v can be null
Grammars
•Grammar G = {VT, VN, P, S} has 4 entities
•
VT is a set of terminal symbols, called primitives or constants
•
VN is a set of non-terminal symbols, called variables
•
VT and VN belong to V;
•
VT  V N   , VT  V N  V
P is the set of production rules A->B where A has at least one variable and
B is a mix of variables and constants
•
S is the starting symbol or the root; S belongs to VN
•L(G) is a formal language ( a set of strings) generated by the grammar G
•
Each string is composed of only primitives
•
Each string can be derived from S using the production rules P
•
Example: VT = {a,b}, VN = {S}; P = {S->aSb, S->ab} => L(G) : anbn, n>=1
•Grammar is used to :
•(I) generate the strings (sentences) accepted by L(G), (ii) check if a
sentence belongs to a grammar, (iii) analyze the structure of a sentences
Grammar Types

  V ;   V ; A , B  V N ; a , b  VT ;
*
UnRestricted Grammar (UR)
1  2
Context Sensitive Grammar (CS)
 A 1   2 
Context Free Grammar (CF)
A1  2
Finite State Grammar (FS)
A 1  a , orA
Example: VT = {a,b,c}; VN = {S, A, B}
UR
CS
CF
1
 B 2a
FS
S  abA
S  aBbc
S  aAc
S  aA
A  c
S  ab
S  bB
S  bB
S  cB
aB  aAc
S  AB
A 
B  aA
Ac  abc
A  aB
A  a
B 
Bb  Acb
A  c
B  b
B  b
B  c
Finite State Grammars, and Graphical Representations
•Nodes are nonterminals in VN and an additional terminal node T not in V
•Productions of type Ai->aAj represented by edge a directed from Ai to Aj
•Productions of type Ai->a represented by edge a directed from Ai to T
S  aA
a
S  bB
A  cB
B  a
A  c
A
S
a
B
a
a
a
T
For a FS grammar G, an arbitrary string x=x1x2..xn, xi in VT is in L(G) iff
there exists at least one path (x1,x2,..,xn) from S to T
Syntactic Pattern Recognition
C2-class problem C1 and C2 are composed of features from a set VT
Let G be a grammar such that L(G) consists only of sentences (patterns) from C1
Example:
VT = {a,b}
VN = {S,A}
P:{S->aSb S->b}
L(G): {b; anbn+1, n>=1}
Classification Rule
x belongs to C1 iff x belongs to L(G)
x belongs to C2 iff otherwise
Classification algorithm has to correctly answer whether or not a given string is
grammatically correct.
Pattern Grammars
2-class problem: rectangles and other quadilaterals
Select primitives:
Set of rectangles:
a:
0o
edge
b:
90o
edge
c:
180o
edge
d:
270o
edge
L  { a 0 b 0 c 0 d 0 ; n , m  1, 2 ,3 ...}
n
m
n
m
If a0, b0, c0, d0 represent unit length lines
L  {a b c ;1  n  3}
n
Consider, a:
n
n
0o horizontal unit length
b:
120o unit length
c:
240o unit length
L(G) represents the class of equilateral triangles
What is the grammar?
Make it up from domain knowledge
There is no unique solution
FS Grammar solution
VT = {a,b,c}
VN = {S, A, B, C, D, E, F, G, H, I, J, K}
S  aA
C  bI
H  bK
S  aA
S  aC
D  bF
I  c
A  aD
A  aB
F  bJ
J  cI
A  aD
E  bG
K  cJ
B  aE
G  bH
2
2
a b c
2
D  bF
F  bJ
J  cI
I  c
CS Grammar solution
VT = {a,b,c}
P 
VN = {S, A, B, C, D, E, F}
S  aAF
A  aBF
D  bc
A  b
B  aEF
C  b
A  aDF
E  bD
F  c
Syntax Analysis
•Let x be the unknown pattern. Recognition task is
finding L(Gi) such that x belongs to L(Gi)
•i.e. Given a string x and a grammar G, construct a
triangle with the top vertex S and the bottom side x
inside which will be the derivation parse tree
•Top-down and Bottom-up parsing methods can be
used
L i , G i 1  i  m
C i 1  i  m
S
x
Stochastic Languages
Probabilities are associated with production rules- stochastic grammar
Stochastic language is one obtained by such a grammar
Probability of obtaining x is
p ( x )  p ( P1 ) p ( P 2 | P1 )  p ( P n | P1 , P 2  , P n 1 )
Tree representations
A string s1 is directly derived
from string s2 in G ( s 1  s 2 ) if
there exists a rule    in G
such that s1 is the result of
replacing  by  . In
general, s is derived from the
initial symbol of G, S, if there
exists a sequence of strings
from which we can derive s
G
from S, i.e., S 
s
Parsing is the reverse of
generation
Descargar

Slide 1