Regular Expressions and Automata Lecture #2-2 September 10 2009 1 Finite State Automata • Regular Expressions (REs) can be viewed as a way to describe machines called Finite State Automata (FSA, also known as automata, finite automata). • FSAs and their close variants are a theoretical foundation of much of the field of NLP. 2 Finite State Automata • FSAs recognize the regular languages represented by regular expressions – SheepTalk: /baa+!/ a b q0 a q1 a q2 ! q3 q4 • Directed graph with labeled nodes and arc transitions •Five states: q0 the start state, q4 the final state, 5 transitions 3 Formally • FSA is a 5-tuple consisting of – – – – – Q: set of states {q0,q1,q2,q3,q4} : a finite alphabet of symbols {a,b,!} q0: a start state F: a set of accept/final states in Q {q4} (q,i): a transition function mapping Q x to Q a b q0 a q1 a q2 ! q3 q4 4 State Transition Table for SheepTalk Input State b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4 Ø Ø Ø 5 Recognition • Recognition (or acceptance) is the process of determining whether or not a given input should be accepted by a given machine. • Or… it’s the process of determining if as string is in the language we’re defining with the machine • In terms of REs, it’s the process of determining whether or not a given input matches a particular regular expression. • Traditionally, recognition is viewed as processing an input written on a tape consisting of cells containing elements from the alphabet. 6 • FSA recognizes (accepts) strings of a regular language – – – – baa! baaa! baaa! … • Tape metaphor: a rejected input q0 a b a ! b 7 Recognition • • • • • Simply a process of starting in the start state Examining the current input Consulting the table Going to a new state and updating the tape pointer. Until you run out of tape. 8 D-Recognize 9 q0 q1 q2 b q3 a q3 a q3 a a q4 ! Input State b a ! 0 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4 Ø Ø Ø 10 Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter • The algorithm is universal for all unambiguous languages. – To change the machine, you change the table. 11 Key Points • Crudely therefore… matching strings with regular expressions (ala Perl) is a matter of – translating the expression into a machine (table) and – passing the table to an interpreter 12 Recognition as Search • You can view this algorithm as a degenerate kind of state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state • Its degenerate because? 13 Formal Languages • Formal Languages are sets of strings composed of symbols from a finite set of symbols. • Finite-state automate define formal languages (without having to enumerate all the strings in the language) • Given a machine m (such as a particular FSA) L(m) means the formal language characterized by m. – L(Sheeptalk FSA) = {baa!, baaa!, baaaa!, …} (an infinite set) 14 Generative Formalisms • The term Generative is based on the view that you can run the machine as a generator to get strings from the language. • FSAs can be viewed from two perspectives: – Acceptors that can tell you if a string is in the language – Generators to produce all and only the strings in the language 15 Three Views • Three equivalent formal ways to look at what we’re up to (not including tables – and we’ll find more…) Regular Expressions Finite State Automata Regular Languages 16 Determinism • Let’s take another look at what is going on with drecognize. • In particular, let’s look at what it means to be deterministic here and see if we can relax that notion. • How would our recognition algorithm change? • What would it mean for the accepted language? 17 Determinism and Non-Determinism • Deterministic: There is at most one transition that can be taken given a current state and input symbol. • Non-deterministic: There is a choice of several transitions that can be taken given a current state and input symbol. (The machine doesn’t specify how to make the choice.) 18 Non-Deterministic FSAs for SheepTalk b q0 a q1 b q0 a q2 a q1 a ! q3 a q2 q4 ! q3 q4 19 FSAs as Grammars for Natural Language dr the q0 rev q1 q2 hon mr pat q3 l. q4 robinson q5 q6 ms mrs Can you use a regexpr to capture this too? 20 Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction (essentially building “set states” that are reached by following all possible states in parallel) • That means that they have the same power; nondeterministic machines are not more powerful than deterministic ones • It also means that one way to do recognition with a nondeterministic machine is to turn it into a deterministic one. • Problems: translating gives us a not very intuitive machine, and this machine has LOTS of states 21 Non-Deterministic Recognition • In a ND FSA there exists at least one path directed through the machine by a string that is in the language defined by the machine that leads to an accept condition.. • But not all paths directed through the machine by an accept string lead to an accept state. It is OK for some paths to lead to a reject condition. • In a ND FSA no path directed through the machine by a string outside the language leads to an accept condition. 22 Non-Deterministic Recognition • So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • However, being driven to a reject condition by an input does not imply it should be rejected. • Failure occurs only when none of the possible paths lead to an accept state. • This means that the problem of non-deterministic recognition can be thought of as a standard search problem. 23 The Problem of Choice • Choice in non-deterministic models comes up again and again in NLP. Several Standard Solutions • Backup (search, this chapter) – Save input/state of machine at choice points – If wrong choice, use this saved state to back up and try another choice • Lookahead – Look ahead in the input to help make a choice • Parallelism – Look at all choices in parallel 24 Backup • After a wrong choice leads to a dead-end (either no input left in a non-accept state, or no legal transitions), return to a previous choice point to pursue another unexplored choice. • Thus, at each choice point, the search process needs to remember the (unexplored) choices. • Standard State Space Search. • State = (FSA node or machine state, tape-position) 25 Example b q0 a q1 a a q2 q2 ! q3 \ q4 26 ND-Recognize Code 27 Example Agenda: 28 Example 29 Example Agenda: 30 Example 31 Example Agenda: 32 Example 33 Example Agenda: 34 Example Agenda: 35 Example Agenda: 36 Example 37 Example Agenda: 38 Example Agenda: 39 Example Agenda: 40 Example Agenda: 41 Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 42 Infinite Search • If you’re not careful such searches can go into an infinite loop. • How? 43 Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother? – More natural solutions – Machines based on construction are too big 44 Compositional Machines • Formal languages are just sets of strings • Therefore, we can talk about various set operations (intersection, union, concatenation) • This turns out to be a useful exercise 45 Union • Accept a string in either of two languages 46 Concatenation • Accept a string consisting of a string from language L1 followed by a string from language L2. 47 Negation • Construct a machine M2 to accept all strings not accepted by machine M1 and reject all the strings accepted by M1 – Invert all the accept and not accept states in M1 • Does that work for non-deterministic machines? 48 Intersection • Accept a string that is in both of two specified languages • An indirect construction… – A^B = ~(~A or ~B) 49 Why Bother? • ‘FSAs can be useful tools for recognizing – and generating – subsets of natural language – But they cannot represent all NL phenomena (Center Embedding: The mouse the cat ... chased died.) 50 Summing Up • Regular expressions and FSAs can represent subsets of natural language as well as regular languages – Both representations may be impossible for humans to understand for any real subset of a language – But they are very easy to use for smaller subsets • Next time: Read Ch 3 • For fun: – Think of ways you might characterize features of your email using only regular expressions 51

Descargar
# cisc882 Regular Expressions