Artificial Intelligence and the
Internet
Edward Brent
University of Missouri – Columbia and Idea Works, Inc.
Theodore Carnahan
Idea Works, Inc.
Overview



Objective – Consider how AI can be (and in
many cases is being) used to enhance and
transform social research on the Internet
Framework – intersection of AI and research
issues
View Internet as a source of data whose
size and rate of growth make it important to
automate much of the analysis of data
Overview (continued)



We discuss a leading AI-based approach, the
semantic web, and an alternative paradigmatic
approach, and the strengths and weaknesses of
each
We explore how other AI strategies can be used
including intelligent agents, multi-agent systems,
expert systems, semantic networks, natural
language understanding, genetic algorithms, neural
networks, machine learning, and data mining
We conclude by considering implications for future
research
Key Features of the Internet





Decentralized
Few or no standards for much of the
substantive content
Incredibly diverse information
Massive and growing rapidly
Unstructured data
The Good News About the
Internet



A massive flow of data
Digitized
A researcher’s dream
The Bad News



A massive flow of data
Digitized
A researcher’s nightmare
Data Flows






The Internet provides many examples of data flows.
A data flow is an ongoing flux of new information, often from
multiple sources, and typically large in volume.
Data flows are the result of ongoing social processes in which
information is gathered and/or disseminated by humans for the
assessment or consumption by others.
Not all data flows are digital, but all flows on the Internet are.
Data flows are increasingly available over the internet.
Examples of data flows include





News articles
eMail
Personnel records
Research proposals
Birth and death records
Published research articles
Medical records
Articles submitted for publication
Arrest records
Data Flows vs Data Sets

Data flows are fundamentally different from the data sets with
which most social scientists have traditionally worked.
A data set is a collection of data, often
collected for a specific purpose and over a
specific period of time, then frozen in place.
A data flow is an ongoing flux of new
information, with no clear end in sight.
Data sets typically must be created in
research projects funded for that purpose in
which relevant data are collected, formatted,
cleaned, stored, and analyzed.
Data flows are the result of ongoing social
processes in which information is gathered
and/or disseminated by humans for the
assessment or consumption by others.
Data sets are sometimes analyzed only once
in the context of the initial study, but are
often made available in data archives to
other researchers for further analysis.
Data flows often merit continuing analysis, not
only of delimited data sets from specific time
periods, but as part of ongoing monitoring and
control efforts.
The Need for Automating
Analysis



Together, the tremendous volume and rate
of growth of the Internet, and the prevalence
of ongoing data flows make automating
analysis both more important and more
cost-effective.
Greater cost savings result from automated
analysis with very large data sets
Ongoing data flows require continuing
analysis and that also makes automation
cost-effective
The Semantic Web




The semantic web is an effort to build into the World Wide
Web tags or markers for data along with representations of the
semantic meaning of those tags (Berners-Lee and Lassila,
2001; Shadbolt, Hall and Berners-Lee, 2006).
The semantic web will make it possible for computer programs
to recognize information of a specific type in any of many
different locations on the web and to “understand” the
semantic meaning of that information well enough to reason
about it.
This will produce interoperability – the ability of different
applications and databases to exchange information and to be
able to use that information effectively across applications.
Such a web can provide an infrastructure to facilitate and
enhance many things including social science research.
Implementing the Semantic
Web
Contemporary
Research
Possible Implementation of the Semantic Web
Coding scheme
XML Schema – a standardized set of XML tags used to markup web pages.
For example, research proposals might include tags such as <design>
<sampling plan> <hypothesis> <findings>
Coded data
Web pages marked up with XML (extensible markup language) – a generalpurpose markup language designed to be readable by humans while at the
same time providing metadata tags for various kinds of substantive content that
can be easily recognized by computers
Knowledge
representation
Resource Description Framework – a general model for expressing
knowledge as subject-predicate-object statements about resources
A sample plan in a research proposal might include these statements
Systematic sampling - is a - sampling procedure
Sampling procedure - is part of - a sampling plan
Theory
Ontology – a knowledgebase of objects, classes of objects, attributes
describing those objects, and relationships among objects
An ontology is essentially a formal representation of a theory
Analysis
Intelligent agents – software programs capable of navigating to relevant web
pages and using information accessible through the semantic web to perform
useful functions
AI Strategies and the
Semantic Web

Several components of the semantic web make
use of artificial intelligence (AI) strategies
Semantic Web
Component
Artificial intelligence and
related computational
strategies
Knowledge
representation
Object-Attribute-Value (O-A-V)
triplets commonly used in semantic
networks
Theory
Semantic network
Analysis
Intelligent agents, Expert systems,
Multi-agent models
Distributed computing, parallel
processing, grid
Strengths of the Semantic
Web

Fast and efficient to develop


Fast and efficient to use




Most coding done by web developers one time and used
by everyone
Intelligent agents can do most of the work with little human
intervention
Structure provided makes it easier for computers to
process
Can take advantage of distributed processing and grid
computing
Interoperability

Many different applications can access and use
information from throughout the web
Weaknesses of the Semantic
Web (Pragmatic Concerns)

Seeks to impose standardization on a highly
decentralized process of web development





Requires cooperation of many if not all developers
Imposes the double burden of expressing knowledge for
humans and for computers
How will tens of millions of legacy web sites be retrofitted?
What alternative procedures will be needed for
noncompliant web sites?
Major forms of data on the web are provided by
untrained users unlikely to be able to markup for the
semantic web

E.g., blogs, input to online surveys, emails,
Weaknesses of the Semantic
Web (Fundamental Concerns)

Assumes there is a single ontology that can be used for all
web pages and all users (at least in some domain).




For example, a standard way to markup products and prices in commercial web sites could make
it possible for intelligent agents to search the Internet for the best price for a particular make and
model of car.
This assumption may be inherently flawed for social research
for two reasons.
1) Multiple paradigms - What ontology could code web pages from
multiple competing paradigms or world views (Kuhn, 1969).

If reality is socially constructed, and “beauty is in the eye of the
beholder” how can a single ontology represent such diverse
views?
2) Competing interests – What if developers of web pages have
political or economic interests at odds with some of the viewers of
those web pages?
Paradigmatic Approach



We describe an alternative approach to the
semantic web, one that we believe may be more
suitable for many social science research
applications.
Recognizes there may be multiple incompatible
views of data
Data structure must be imposed on data
dynamically by the researcher as part of the
research process

(in contrast to the semantic web which seeks to build an
infrastructure of web pages with data structure pre-coded
by web developers)
Paradigmatic Approach (continued)




Relies heavily on natural language processing
(NLP) strategies to code data.
NLP capabilities are not already developed for
many of these research areas and must be
developed.
Those NLP procedures are often developed and
refined using machine learning strategies.
We will compare the paradigmatic approach to
traditional research strategies and the Semantic
Web for important research tasks.
Example Areas Illustrating the
Paradigmatic Approach



Event analysis in international relations
Essay grading
Tracking news reports on social issues or for
clients




E.g., Campaigns, Corporations, Press agents
Each of these areas illustrate significant data flows.
These areas and programs within them illustrate
elements of the paradigmatic approach.
Most do not yet employ all the strategies.
Essay Grading




These are programs that allow students to submit essays
using the computer then a computer program examines
the essays and computes a score for the student.
Some of the programs also provide feedback to the
student to help them improve.
These programs are becoming more common for
standardized assessment tests and classroom
applications.
Examples of programs






SAGrader™
E-rater®
C-rater®
Intelligent Essay Assessor®
Criterion®
These programs illustrate large ongoing data flows and
generally reflect the paradigmatic approach.
Digitizing Data
Task
Traditional Research
Semantic Web
Data from Internet digitized Data digitized by
by web page developers.
web page
Other data must be
developers
Digitizing
digitized by researcher
or analyzed manually.
This can be a huge
hurdle.
Paradigmatic
Approach
Data digitized
by web page
developers
The first step in any computer analysis must be converting relevant data to
digital form where it is expressed as a stream of digits that can be
transmitted and manipulated by computers
These two approaches both rely on web page developers to digitize
information. This gives them a distinct advantage over traditional research
where digitizing data can be a major hurdle.
Essay Grading: Digitizing
Data

Digitizing

Papers replaced with digital submissions


SAGrader, for example, has students submit
their papers over the Internet using standard
web browsers.
Digitizing often still a major hurdle limiting
use


Access issues
Security concerns
Data Conversions
Task
Traditional Research
Semantic Web
Paradigmatic Approach
Converted
Data
Digitized data
suitable for web
delivery for human
interpretation
Digitized data suitable for
web delivery for human
interpretation
Digitized data suitable for
web delivery and
machine interpretation
Converting
No further data
No further data conversions
conversions required
required once digitized by
once digitized by
web page author
web page author
Further conversion
sometimes required by
researcher (e.g., OCR,
speech recognition,
handwriting recognition)
Essay Grading: Converting
Data

Data conversion

Where essays are submitted on paper,
optical character recognition (OCR) or
handwriting recognition programs must
be used to convert to digitized text.

Standardized testing programs often face this
issue
Encoding Data
Task
Encoding
Data
Coded
Data
Traditional
Research
Semantic Web
Paradigmatic Approach
Encoding done Each web page
by researcher
developer must
(often with use encode small or
of qualitative
moderate amount of
or quantitative
data
programs)
Researchers must encode massive
amounts of data
Encoding automated using
NLP strategies (including statistical,
linguistic, rule-based expert systems,
and combined strategies)
machine learning (unsupervised
learning, supervised learning, neural
networks, genetic algorithms, data
mining)
Coded data
based on
coding rubric
XML markup based on ontology for
that paradigm
An XML schema indicates the basic
structure expected for a web page
XML markup based
on standard
ontology
An XML schema
indicates the basic
structure expected
for a web page
Essay Grading: Coding


Essay grading programs employ a wide array of strategies for
recognizing important features in essays.
Intelligent Essay Assessor (IEA) employs a purely statistical
approach, latent semantic analysis (LSA).


E-rater uses a combination of statistical and linguistic approaches.


It uses syntactic, discourse structure, and content features to predict
scores for essays after the program has been trained to match human
coders.
SAGrader uses a strategy that blends linguistic, statistical, and AI
approaches.


This approach treats essays like a “bag of words” using a matrix of
word frequencies by essays and factor analysis to find an underlying
semantic space. It then locates each essay in that space and
assesses how closely it matches essays with known scores.
It uses fuzzy logic to detect key concepts in student papers and a
semantic network to represent the semantic information that should
be present in good essays.
All of these programs require learning before they can be used to
grade essays in a specific domain.
Knowledge
Task
Knowledge
Traditional Research
Semantic Web
Paradigmatic Approach
Theory
A single shared worldview or objective reality
Multiple paradigms
Coding scheme
implemented with
a Codebook (often
imperfect)
Ontology (knowledgebase
developed by web page
developers and shared as
standard) (implemented
with RDF and ontological
languages)
Multiple ontologies, one
for each paradigm
(developed by
researchers and shared
within paradigm)
(implemented with RDF
and ontological
languages)
Essay Grading: Knowledge


Most essay grading programs have very little in the way of a
representation of theory or knowledge.
This is probably because they are often designed specifically
for grading essays and are not meant to be used for other
purposes requiring theory, such as social science research.


For example, C-rater, a program that emphasizes semantic
content in essays, yet has no representation of semantic content
other than as desirable features for the essay.
The exception is SAGrader.

SAGrader employs technologies developed in a qualitative
analysis program, Qualrus. Hence, SAGrader uses a semantic
network to explicitly represent and reason about the knowledge
or theory.
Analysis
Task
Analysis
Traditional Research
Semantic Web
Paradigmatic Approach
Analysis (by hand, perhaps
with help of qualitative or
quantitative programs)
Intelligent Agents
Intelligent agents
The semantic web and paradigmatic approaches can take similar approaches
to analysis.
Essay Grading: Analysis




All programs produce scores, though the precision
and complexity of the scores varies.
Some produce explanations
Most of these essay grading programs simply
perform a one-time analysis (grading) of papers.
However some of them, such as SAGrader, provide
for ongoing monitoring of student performance as
students revise and resubmit their papers.
Since essays presented to the programs are
already converted into standard formats and are
submitted to a central site for processing, there is
no need for the search and retrieval capabilities of
intelligent agents
Advantages of Paradigmatic
Approach




Suitable for multiple-paradigm fields
Suitable for contested issues
Does not require as much
infrastructure development on the web
Can be used for new views requiring
different codes with little lag time
Disadvantages of
Paradigmatic Approach






Relies heavily on NLP technologies that are still evolving
May not be feasible in some or all circumstances
Requires extensive machine learning
Often requires additional data conversion for automated
analysis
Requires individual web pages to be coded once for each
paradigm rather than a single time, hence increasing costs.
(However, by automating this, costs are made manageable)
Current NLP capabilities are limited to problems of restricted
scope. Instead of general-purpose NLP programs, they are
better characterized as special-purpose NLP programs.
Discussion and Conclusions





Both semantic web and paradigmatic approaches
have advantages and disadvantages
Codes on semantic web could facilitate coding by
paradigmatic-approach programs
Where there is much consensus the single coding
for the semantic web could be sufficient
While the infrastructure for the semantic web is still
in development the paradigmatic approach could
facilitate analysis of legacy data
The paradigmatic approach could be used to build
out the infrastructure for the semantic web
Descargar

AI - University of York