Textos digitales
La tokenización
día13, 11-feb-15
SPAN 4350
Cultura computacional en español
Harry Howard
Tulane University
Organización del curso
2


http://www.tulane.edu/~howard/Span4350/
http://www.tulane.edu/~howard/CompCultES/
1.
2.
3.
4.
5.
6.
7.
Computación cultural
Python
Cadenas
Unicode
Exreg
Archivos
Listas
CultCompES, Prof. Howard, Tulane University
11-feb-2015
3
Repaso
Una lista es una secuencia de objetos entre
corchetes.
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.2.3. ¿Qué métodos se permiten con una
lista pero no con una cadena?
4
1.
2.
3.
4.
5.
6.
7.
8.
9.
>>> L1 = ['Miguel', 'Cervantes']
>>> L1.append('de Saavedra')
>>> del L1[2]
>>> L1.insert(1, 'de Saavedra')
>>> L1.remove('de Saavedra')
>>> L1[0] = 'Miguelito'
>>> L1.append('de Saavedra')
>>> L1.pop(2)
>>> L1.reverse()
CultCompES, Prof. Howard, Tulane University
11-feb-2015
Configurar el directorio de trabajo global
5


Crea una carpeta "pyScripts" en tu carpeta de
documentos.
En Spyder > Preferences > Global Working
directory:
 "At
start-up, the global working directory is … the
following directory (navega a "pyScripts" y pínchala)
 "Files are opened from: … the global working
directory.
 "Files are created in: … the global working directory.
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.1.1. How to navigate folders with os
6
1.
>>> import os
2.
>>> os.getcwd()
3.
'/Users/harryhow/Documents/pyScripts'
4.
# if the path is not to your pyScripts folder, then change it:
5.
>>> os.chdir('/Users/{your_user_name}/Documents/pyScripts/')
6.
>>> os.getcwd()
7.
'/Users/{your_user_name}/Documents/pyScripts/'
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.1.2. Project Gutenberg
http://www.gutenberg.org/ebooks/28554
7
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.1.3. How to download a file with urllib
and convert it to a string with read()
8
1.
2.
3.
4.
5.
6.
7.
>>> from urllib import urlopen
>>> url =
'http://www.gutenberg.org/cache/epub/28554/pg28554.
txt'
>>> download = urlopen(url)
>>> downloadString = download.read()
>>> type(downloadString)
>>> len(downloadString) # 35739?
>>> downloadString[:50]
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.1.4. How to save a file to your drive with
open(), write(), and close()
9

# it is assumed that Python is looking at your pyScripts folder

>>> tempFile = open('Cervantes.txt','w')

>>> tempFile.write(downloadString.encode('utf8'))

>>> tempFile.close()

# import os if you haven't already done so

>>> os.listdir('.')
CultCompES, Prof. Howard, Tulane University
11-feb-2015
10
Requests
CultCompES, Prof. Howard, Tulane University
11-feb-2015
6.2.2 Como descargar un fichero con requests
11
1.
2.
3.
4.
5.
6.
7.
8.
9.
>>> import requests
>>> lur =
'http://www.gutenberg.org/cache/epub/15115/pg15115.txt'
>>> descarga = requests.get(lur).text
>>> type(descarga)
<type 'unicode'>
>>> len(descarga)
363600
>>> descarga[:150]
u'The Project Gutenberg EBook of Novelas y teatro, by
Cervantes\r\n\r\nThis eBook is for the use of anyone
anywhere at no co
CultCompES, Prof. Howard, Tulane University
11-feb-2015
Como manejar un fichero en tu disco duro con
open(), write() y close()
12
>>> fitemp = open('NovelasTeatro.txt','w')
2.
>>> fitemp.write(descarga.encode('utf8'))
3.
>>> fitemp.close()
----------------------------------------------1.
>>> fitemp = open('NovelasTeatro.txt','r')
2.
>>> texto = fitemp.read()
3.
>>> fitemp.close()
----------------------------------------------1.
texto = open('NovelasTeatro.txt', 'r').read()
----------------------------------------------1.
>>> type(texto)
2.
>>> len(texto)
3.
370503
1.
CultCompES, Prof. Howard, Tulane University
11-feb-2015
A leer
13

Lee "6.3.3. Como dividir un documento en sub-textos"
por tu cuenta. Es necesario para hacer #3 de la
prueba.
CultCompES, Prof. Howard, Tulane University
11-feb-2015
14
La tokenización
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.3.1. La tokenización con expresiones regulares
15
1.
>>> C = open('NovelasTeatro.txt', 'r').read()
2.
>>> U = C.decode('utf8')
3.
>>> from re import findall, UNICODE
4.
>>> palabras = findall(r'\b\w+\b', U, UNICODE)
CultCompES, Prof. Howard, Tulane University
11-feb-2015
7.4. Como manejar un fichero con NLTK
16
1.
>>> from nltk.corpus import PlaintextCorpusReader
2.
>>> texlector = PlaintextCorpusReader('', 'NovelasTeatro.txt',
encoding='utf-8')
3.
>>> palabras = texlector.words()
CultCompES, Prof. Howard, Tulane University
11-feb-2015
17
El próximo día
§8. El control de la computación
CultCompES, Prof. Howard, Tulane University
11-feb-2015
Descargar

LING 681 Intro to Comp Ling