Knowledge Discovery in
Databases
&
Information Retrieval
University of Texas at Austin
School of nformation
i
Knowledge Management Systems
Presented April 29, 2003
By Anne Marie Donovan

Knowledge Discovery in Databases


“The nontrivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data” (Fayyad,
Piatetsky-Shapiro, and Smyth, 1996, p. 30)
Also known as knowledge extraction,
information harvesting, data archeology,
and information extraction (p. 28)

Information Retrieval
“The methods and processes for searching
relevant information out of information
systems that contain extremely large
numbers of documents” (Rocha, 2001, 1.1)
 “The ultimate goal of IR is to produce or
recommend relevant information to users”
(1.2)
 “Traditional IR does not identify users and
classifies subjects only with unchanging
keywords and categories” (1.2)


Institutions that use KDD/IR systems
Require knowledge-based decisions
 Have a large quantity of accessible, relevant,
historical and current data
 Have a high payoff for correct decisions
 Financial: banking & investment
 Medical: healthcare & insurance
 Sales: marketing & customer relations
(Piatetsky-Shapiro, 1998, Slides 28-31)


Database Management Systems

File Systems

Relational Database Management Systems
(RDBMS)

Object-Oriented Database Management
Systems (OODBMS)

Object-Relational Database Management
Systems (ORDBMS)
(Devarakonda, 2001, ORDBMS)

Relational Database Management
Systems (RDBMS)
Relational databases are composed of many
relations in the form of two-dimensional
tables of rows and columns
 RDBMS advantages include the SQL
standard (enables migration between
database systems), rapid data access and
large storage capacity
 RDBMS disadvantages include an inability
to handle complex data types and
relationships
(Devarakonda, 2001, RDBMS)


Object-Oriented Database Management
Systems (OODBMS)
OODBMS use abstract data types (ADTs) in
which the internal data structure is hidden
 OODBMS data is managed through two sets
of relations, one describing the interrelations
of data items and another describing the
abstract relationships
 OODBMS handle complex data
relationships, but suffer from poor
performance and problems of scalability
(Devarakonda, 2001, OODBMS)


Object-Relational Database Management
Systems (ORDBMS)
ORDBMS store all database information in
tables, but some entries have richer data
structure that are also called abstract data
types (ADTs).
 ORDBMS exhibit features of both the
relational and object models such as
scalability and support for rich data types
 Their main advantage is massive scalability
(Devarakonda, 2001, ORDBMS)


The KDD Process
Collecting and pre-processing data
 The problem of continually increasing
volumes of data
 The problem of increasingly complex
forms of data
 Identifying and extracting useful knowledge
from large data repositories
 What knowledge is in the data set?
 What can be observed about the data set?
 Presenting the knowledge in usable forms
(Fayyad et al., 1996)


The KDD Process (continued)
Data management problems in data
collection, storage, and retrieval
 Translation, change detection, integration,
duplication, summarization; aggregation,
timeliness/datedness (Widom, 1995)
 The impracticality of manual analysis
 Billions of records and hundreds of fields
 Increasing desire for on-the-fly analysis
and more flexible presentation (Fayyad et
al., p. 28)


The KDD Process (continued)
A need to automate the knowledge discovery
and extraction processes
 Data selection and pre-processing
 Data transformation and mining
 Interpretation and evaluation (p. 28)
 Automation requires attention to:
 Data collection, storage, and retrieval
 Statistical foundations of search and
retrieval processes (p. 29)


Stages in the KDD process
Learning the application domain
 Creating a target data set
 Data cleaning and preprocessing
 Data reduction and projection
 Choosing the function of data mining
 Choosing the data mining algorithm
 Data mining
 Interpretation
 Using discovered knowledge (pp. 30-31)


Data mining
The application of specific algorithms to a
data set for the purpose of extracting data
patterns (p. 28)
 “Fitting models to or determining
patterns from observed data” (p. 31)


Data warehousing

Collecting and “cleaning” transactional
data to make it available for online
analysis and decision support (p. 30)

Data mining tasks
Classification: predicting an item class
 Forecasting: predicting a parameter value
 Clustering: finding groups of items
 Description: describing a group
 Deviation detection: finding changes
 Link analysis: finding relationships and
associations
 Visualization: presenting data visually to
facilitate human discovery (Piatetsky-Shapiro,
1998, Slide 17)


Components of data mining systems
Model functions: classification, regression,
clustering, etc. (pp. 31 -32)
 Model representation: decision trees and
rules, linear models, non-linear models,
example-based methods, etc. (p. 32)
 Preference criterion: quantitative criterion
embedded in the search algorithm; implicit
criterion embedded in the KDD process
 Search algorithms: parameter search (given
a model) or model search over model space


There is NO universal search algorithm
Each type of search suits specific types of
search problems
 The searcher must be careful to properly
formulate the question
 The searcher must understand the search
goal (p. 31)


Every search can be improved by an
increase in data or query context

Creating context for KDD and IR
Extending IR throughout the social network
of an organization, e.g., Answer Garden
(Ackerman, 1994 & Ackerman and
MacDonald, 1996)
 Providing social context for data exchange,
e.g., PeopleGarden (Xiong and Donath, 1999)
 Relational database reverse engineering,
“extracts a conceptual model from an
existing relational database by analyzing
data instances as well as metadata” (Lee and
Hwang, 2002, Conclusion)


KD & IR problems for Web resources
Collecting and pre-processing data
 Even more continually changing data
 Complex data; streaming & multi-media
 The problem of identifying and extracting
useful knowledge from Web resources
 No consistent data models; no context
 A lack of descriptive information
 Presenting the knowledge in usable forms
 More and more wireless devices and timesensitive, multi-media applications


Current methods for Web KD & IR
Collecting and pre-processing data
 Web crawlers and link-based ranking
 Human indexing and categorization
 Identifying and extracting useful knowledge
from Web resources
 Keyword search on natural language text
 Topical directories or topical Web sites
 Presenting the knowledge in usable forms
 Content presented in native format
(plugins) or in HTML


Automating KD & IR for the Web
Semantic markup to enable machine
understanding/processing (RDF/S &
DAML/OIL) & inference analysis
 Intelligent search engines and agents to
exploit semantic statements
 Ontologies to provide context (a data
model) for agents (Shah et. al.)


Automating KD & IR for the Web
(continued)
Automated data collection, automated
context collection (data pre-processing)
 Value-added services (query routing)
 Integrated query systems/knowledge
delivery systems (accessibility)
 Social accounting metrics to provide
context for humans (Smith, 2002, p. 52)


Enhanced presentation for the Web

Reformatting for presentation
 Differentiated service
 Variable visualization
• Adaptive graphics, “a unifying
framework that allows visual
representations of information to be
customized and mixed together into
new ones” (Boier-Martin, 2003, pp. 6-9)
• Previewing & interactive content
• Selective presentation & customized
views

KDD and IR for pervasive computing

Achieving “ubiquitous data access”
(Cherniack, Franklin, & Zdonik, 2001, slide 7)
 Data management problems
• Dissemination (context dependent
pull/push)
• Synchronization (multiple
collectors/devices)
• Recharging (renewing) multiple data
streams
• Profile-driven data management

KDD and IR for pervasive computing
(continued)

Achieving “ubiquitous data access”
(Cherniack, Franklin, & Zdonik, 2001, slide 7)
 Location aware, mobile devices
 Service discovery for mobile services
 Distributed sensors/collectors (slides 827)

Next generation KDD & IR will….
Focus on solving business problems, not data
analysis problems
 Embed knowledge discovery engines
 Integrate access to enterprise and external
data on the back-end
 Integrate knowledge discovery process with
knowledge delivery tools (Piatetsky-Shapiro,
1998, Slide 7)


Next generation KDD & IR will….
Manage information retrieval contextually
 Allow contextual query/continuous query
 Synchronize multiple data flows from
disparate sensors/input devices
 Enable KD in virtual networks of peer-topeer databases (data “clusters” or “cubes”)
 Interpolate or extrapolate for missing data
(Cherniack et. al., 2001, slides 115-138)


Next generation KDD & IR will….
Recognize individual users
 Characterize information resources
 Provide a way to exchange knowledge
between users and information resources
(push and pull of information
 Adapt to the user community and enable the
reuse and recombination of information as
well as its exchange
(Rocha, 2001, 1.2)


KDD research problems
Massive data sets & high dimensionality
 User interaction & prior knowledge
 Determining statistical significance
 Missing data
 Understandability of patterns
 Management of changing data & knowledge
 Data integration
 Non-standard, multimedia, & objectoriented data (Fayyad, Piatetsky-Shapiro, &
Smyth, 1996, pp. 33-34)


“Top Ten” IR research issues
Integrated solutions
 Distributed IR
 Efficient, flexible indexing and retrieval
 "Magic” (automatic query expansion)
 Interfaces and browsing
 Routing and filtering
 Effective retrieval
 Multimedia retrieval
 Information extraction
 Relevance feedback (Croft, 1995)


Total Information Awareness - DARPA
on the bleeding edge…...
New database technologies
 Database architectures
 Database population
 New search algorithms and data models
 Genysis
 Goal is to produce technology enabling
ultra-large, all-source information
repositories


http://www.darpa.mil/iao/Genisys.htm

Social Issues
Communicating context
 Creating trust/social value
 Inciting cooperation/collaboration
 Privacy tradeoffs: convenience/service or
security/privacy?

References
Ackerman, M. S. (1998, July). Augmenting the organizational memory: A field
study of Answer Garden. ACM Transactions on Information Systems, 16(3),
203-204. Retrieved March 28, 2003 from
http://doi.acm.org/10.1145/290159.290160
Ackerman, M. S., & Malone, T. W. (1990, April). Answer Garden: A tool for
growing organizational memory. ACM SIGOIS Bulletin, 11(.2-3), 31-39.
Retrieved March 28, 2003 from http://doi.acm.org/10.1145/91474.91485
Ackerman, M. S., & McDonald, D. W. (1996). Proceedings of the ACM
Conference on Computer-Supported Cooperative Work 1996 (CSCW96
Boston, MA). Retrieved March 28, 2003 from
http://doi.acm.org/10.1145/240080.240203
Boier-Martin, I. M.. (2003, January/February). Adaptive graphics. In T. Rhyne
(Ed.) Visualization Viewpoints, IEEE Computer Graphics and Application,
23(1), 6-10. Retrieved April 5, 2003 from
http://www.research.ibm.com/people/i/imartin/papers/visviewpoints.pdf
References
Chakrabarti, S., Srivastava, S., Subramanyam, M., & Tiware, M. (2000). Using
Memex to archive and mine community Web browsing experience. A paper
presented at the 9th International World Wide Web Conference, Amsterdam,
May 15-19, 2000. Retrieved April 12, 2003 from
http://www9.org/w9cdrom/98/98.html
Croft, W. B. (1995, November). What do people want from information retrieval?:
The top 10 research issues for companies that use and sell IR systems. D-Lib
Magazine. Retrieved April 5, 2003 from
http://sunsite.anu.edu.au/mirrors/dlib/dlib/november95/11croft.html
DARPA Information Awareness Office. (2003a). Genysis. Retrieved from the
DARPA Information Awareness Office Web site at:
http://www.darpa.mil/iao/Genisys.htm
DARPA Information Awareness Office. (2003b). Total Information Awareness
System. Retrieved from the DARPA Information Awareness Office Web site at:
http://www.darpa.mil/iao/TIASystems.htm
References
Devarakonda, R. (2001, March). Object-Relational database systems - The road
ahead. ACM Crossroads Student Magazine. Retrieved April 12, 2003 from
www.acm.org/crossroads/xrds7-3/ordbms.html
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996, November). The KDD
process for extracting useful knowledge from volumes of data.
Communications of the ACM, 39(11), 27-34. Retrieved March 03, 2003 from
http://wwwhome.cs.utwente.nl/~mpoel/colleges/dwdm/ACM_artikelen/fayyad
2.pdf
Lee, D., & Hwang, Y. (2002, March 1). Extracting semantic metadata and its
visualization. ACM Crossroads Student Magazine. Retrieved March 27, 2003
from www.acm.org/crossroads/xrds7-3/smeva.html
Piatetsky-Shapiro, G. (1998, December 4). Data mining and knowledge discovery
tools: The next generation. Retrieved February 27, 2003 from kdnuggets.com
at http://www.kdnuggets.com/gpspubs/dama-nextgen-98/index.htm
References
Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., & Kaiser, M. (2002,
December). Uncovering information hidden in Web archives: A glimpse at
Web analysis building on data warehouses. D-Lib Magazine, 8(12). Retrieved
March 28, 2003 from
http://www.dlib.org/dlib/december02/rauber/12rauber.html
Rocha, L. M. (2001). TalkMine: A soft computing approach to adaptive
knowledge recommendation [Electronic version]. In V. Loia & S. Sessa (Eds.),
Studies in fuzziness and soft computing: Vol. 75. Soft computing agents: New
trends for designing autonomous systems. (pp. 89-116). New York: Springer.
Retrieved March 28, 2003 from http://www.c3.lanl.gov/~rocha/softagents.html
Shah, U., Finin, T., Joshi, A., Cost, R. S., & Mayfield, J. (2002, November).
Information retrieval on the Semantic Web. Paper presented at The ACM
Conference on Information and Knowledge Management , November 2002.
Retrieved March 28, 2003 from
http://www.csee.umbc.edu/~finin/papers/cikm02/cikm02.pdf
References
Smith, M. (2002). Tools for navigating large social cyberspaces. Communications
of the ACM, 45(4), 51-55. Retrieved March 28, 2003 from
http://delivery.acm.org/10.1145/510000/505272/p51smith.html?key1=505272&key2=5541680501&coll=GUIDE&dl=GUIDE&C
FID=9914049&CFTOKEN=12943474
Whitted, T. (1999, July/August). Draw on the Wall. IEEE Computer Graphics and
Applications, 19(4), 6-9. Retrieved April 8, 2003 from ieeeexplore.ieee.org at:
http://ieeexplore.ieee.org/iel5/38/16795/00773957.pdf?isNumber=16795&arnu
mber=773957&prod=JNL&arSt=6&ared=9&arAuthor=Whitted%2C+T.
Widom, J. (1995, November). Research problems in data warehousing.
Proceedings of the 4th International Conference on Information and
Knowledge Management (CIKM). Retrieved March 28, 2003 from
http://www.ischool.utexas.edu/~i385tkms/readings/Widom-1995ResearchProblems.pdf
References
Xion, R., & Donath, J. (1999). PeopleGarden: Creating data portraits for users.
CHI Letters, 1(1). 37-44. Retrieved April 8, 2003 from
http://smg.media.mit.edu/papers/Xiong/pgarden_uist99.pdf
Descargar

No Slide Title