Corpus Statistics Example:
Collecting Data from Potato Chip Bags
Defining Corpus
• Corpus to compare language between types of snack chips *
– To choose representative sample, visited large local Wegman’s
– Randomly choose a large selection of brands, choose the one most
“normal” flavor of that brand
• Only two brands from Frito-lay
* Thanks to a student of Dan Jurafsky’s for this idea.
2
Defining comparison types
• Define the two corpora to compare as gourmet/upscale or
regular snack chips
– Method: gourmet/upscale chips are sold in Nature’s Marketplace
section, regular chips are sold in the regular snack aisle
3
Collecting Text
• Wrote all text occurring on the front of the package and all
text on the back of the package
– Omitted weights from front
– Omitted nutrition label from back
• Typed words by hand into document
• Limitations:
– Extremely small amount of text to analyze, statistical results could
be dubious
– Imbalance between the two comparison types
• Gourmet packages have lots of text compared to regular
packages
4
Processing Text
• Chose to lower-case text
• Removed stop words and punctuation from word frequency
lists by hand
• Removed stop words (Smart.English.stop) and punctuation
in bigram analysis
5
Word Frequencies
Gourmet
Top
Words
chips 26
natural 21
potato 14
fat 11
more 10
ingredients 9
baked 7
great 7
make 7
organic 7
sticks 7
booty 6
delicious 6
kettle 6
made 6
pirate 6
snack 6
trans 6
Regular
Top
Words
chips 19
potato 10
brand 8
fat 8
oil 7
fresh 6
made 6
santitas 6
canola 5
date 5
guaranteed 5
printed 5
tortilla 5
com 4
corn 4
flavor 4
natural 4
original 4
snacks 4
6
G
exotic vegetable 20.2514176861
real food 19.6664551854
eat healthier 19.2514176861
french fry 19.2514176861
lightly sea 19.2514176861
american rivers 18.9294895912
boulder canyon 18.9294895912
lesser evil 18.9294895912
terra exotic 18.9294895912
fake colors 18.6664551854
gluten free 18.6664551854
green energy 18.6664551854
lightly salted 18.6664551854
sea salted 18.6664551854
small company 18.6664551854
taste naturally 18.6664551854
trans fats 18.6664551854
wholesome potatoes 18.6664551854
krinkle sticks 18.4440627641
canyon products 18.3445270905
big flavor 18.2514176861
company family 18.2514176861
crispy chip 18.2514176861
food ingredients 18.0814926847
kettle cooked 18.0814926847
perfect crunch 18.0814926847
visit www 18.0814926847
R
grams trans 17.3298138534
cape cod 16.7448513526
good fun 16.7448513526
authentic mexican 16.3298138534
classic chip 16.3298138534
printed date 16.0078857585
cooked cape 15.7448513526
great multigrain 15.7448513526
multigrain snacks 15.7448513526
canola oil 15.5224589313
guaranteed fresh 15.4229232577
authentic flavor 15.3298138534
brand tortilla 15.3298138534
delicious natural 15.3298138534
reduced fat 15.3298138534
trans fat 15.3298138534
multigrain taste 15.1598888519
santitas brand 15.0667794475
regular potato 15.0078857585
enjoy santitas 14.7448513526
santitas authentic 14.7448513526
natural oil 14.5224589313
cod potato 14.4229232577
fat chip 14.3298138534
natural flavor 14.3298138534
original corn 14.3298138534
original flavor 14.3298138534
7
Question
• What kind of people are the companies appealing to for
gourmet and regular snack chips?
• Sketch of how to approach analysis:
– Appeal to health issues for gourmet chips
• Words: natural, baked, organic
– Fresh on regular list
• AR: real food, eat healthier, gluten free, taste naturally,
wholesome potatoes, … high on gourmet list
– Appeal to environmental issues for gourmet chips
• AR: green energy
– Chatty, family approach to gourmet chips
• Small company, company family
– Regular chips issue words: good fun
8
Other Observations
• Regular chips just don’t use a lot of text to appeal to
customers
• Text more standardized on regular chips
– Guaranteed fresh by printed date on almost all packages
• Measures limited by small amounts of text
9
Brainstorm ideas for Corpora?
• Gutenberg,
• NLTK book (Presidents’ Inaugural Addresses),
• positive and negative opinions from annotated text or from
debate blogs,
• BNC, ANC,
• collect documents on a two topics,
• compare text on same topic from different eras,
• Compare song lyrics from different authors or genres
• ??
10
Descargar

Semantic Role Labeling using Support Vector Machines