Advanced Course on Bioinformatic and Comparative Genome Analysis
UFSC – Florianopolis - June 30 - July 12, 2008
UFSC - CTC
Unix/perl
Prof. Mario Dantas
[email protected]
http://www.inf.ufsc.br
Motivation
The large amount of computing resources
available in many organizations can be gather to
solve a large amount of problems from several
research areas.
Biology represents an example of an areas that
can improve experiments through the use of
these distributed resources.
Motivation
• Workflow:
- represent a execution flow which
data are passed between some
tasks obeying rules previously
defined.
• Ontology:
– Ontology can be expressed as a
formal and explicit specification from
a shared concept.
GRID 2007 - September 20, 2007
3
Motivation
Grid Information Service
R2
Grid Resource Broker
R3
R5
R4
RN
R6
R1
Grid Information Service
Several
Virtual
Organizations
Each organization
develops your
ontology
It does not have
an unique
ontology
Motivation
The Architecture Proposal
A - Reference Ontology, Query Ontology and Rules
Integration
Portal
B - Resource Ontologies and semantic equivalence relations
1 - A VO solicits the Reference Ontology (RO)
and after it to identify the equivalence relations it
informs them to system.
Query
Interface
3 - submit query
Integration
Service
1.1 - look for the RO
3.1 - research result
A
Matchmaker
Service
1.2 - store the relations
B
Jena Generic Rule
Reasoner (JGRR)
2.1 - realize the
information
publication
Information
Provider
2 - provide resources
information of a VO
Publishing
Service
module Integration Portal
module Information
Provider
module Matchmaker
Motivation
Approach Semantic Integration
Grid Environment
VO_1
......
VO_2
VO_n
......
Resource
ontology
developer 1
Resource
ontology
developer n
Resource
ontology
developer 2
establish
establish
establish
......
mappings 1
mappings 2
based on
mappings n
(a)
Reference
Ontology
Motivation
Reference Ontology (RO)
DataSet
is-a
is-a
is-a
ComputingResource
NetworkResource
is-a
is-a
is-a
GridMiddleware
is-a
AMD
Legion
is-a
GridEntity
Globus
is-a
is-a
Unicore
is-a
InfoService
is-a
is-a
is-a
Policy
Access
is-a
is-a
is-a
Account
AMD32
POWER32
is-a
FileSystem
POWER64
ComputingResourceElements
is-a
Windows
is-a
OperatingSystem
is-a
Unix
INTEL32
INTEL64
SPARC32
SPARC64
is-a
DataManagement
is-a
is-a
SPARC
ResourceManagement
is-a
INTEL
is-a
JobManagement
GridService
is-a
is-a
is-a
POWER
ProcessorArchitecture
is-a
is-a
AMD64
StorageResource
is-a
is-a
UnitaryComputerSystem
is-a
GridResource
is-a
Cluster
is-a
Processor
is-a
MacOS
is-a
Linux
Motivation
Graphic interface for editing
queries
Motivation
Case Study 1 (matching semantic)
Motivation
Case Study 2 (checking queries)
Motivation
SuMMIT
GRID 2007 - September 20, 2007
11
Motivation
SuMMIT – Mobile GUI
Monitoring Interface
GRID 2007 - September 20, 2007
12
Motivation
SuMMIT – Agent
GRID 2007 - September 20, 2007
13
Motivation
SuMMIT – Workflow Manager (4/7)
GRID 2007 - September 20, 2007
14
Motivation
SuMMIT Operation
Automation
and
Coordination
GRID 2007 - September 20, 2007
15
Motivation
SuMMIT – Resource Selector
Ontology-based Matchmaker
Matchmaking Rules
Domain Background
Knowledge
Domain Ontologies
(Resources, Policies, Requests)
Rule-based Inference Engine (Jena/ARQ/Pellet)
GRID 2007 - September 20, 2007
16
Objective
• In this course will introduce the basics of
UNIX/perl programming, therefore in the
end participants will be able to write simple
scripts programms.
References
• There are several books and sites that can
help in the task to develop and improve
your knowledge about Unix and perl, some
examples are:
Books
Available Free Online
http://proquest.safaribooksonline.com
Books
Interesting Sites
• http://directory.fsf.org/project/emacs/
• http://stein.cshl.org/genome_informatics/
• http://www.cs.usfca.edu/~parrt/course/601/
lectures/unix.util.html
Interesting Sites
• http://people.genome.duke.edu/~jes12//co
urses/perl_duke_2005/
• http://www.pasteur.fr/~tekaia/BCGA_useful
_links.html
• http://google.com/linux
Course Outline
 Introduction
 Operating system overview
 UNIX utilities
 Scripting languages
 Programming tools
Course Outline
 Introduction
 Operating system overview
 UNIX utilities
 Scripting languages
 Programming tools
In the Beginning
• UNICS: 1969 – PDP-7 minicomputer
• PDP-7 goes away, rewritten on PDP-11 to
“help patent lawyers”
• V1: 1971
• V3: 1973 (pipes, C language)
• V6: 1976 (rewritten in C, base for BSD)
• V7: 1979 (Licensed, portable)
PDP-11
Big Reason for V6 Success
Commercial Success
•
•
•
•
•
•
AIX
SunOS, Solaris
Ultrix, Digital Unix
HP-UX
Irix
UnixWare -> Novell -> SCO -> Caldera >SCO
• Xenix:
-> SCO
• Standardization (Posix, X/Open)
…But Then The Feuding Began
• Unix International vs. Open Software Foundation
(to compete with desktop PCs)
• Battle of the Window Managers
Openlook
Motif
• Threat of Windows NT resolves battle with CDE
Send in the Clones
• Linux
– Written in 1991 by Linus Torvalds
– Most popular UNIX variant
– Free with GNU license
• BSD Lite
– FreeBSD (1993, focus on PCs)
– NetBSD (1993, focus on portability)
– OpenBSD (1996, focus on
security)
– Free with BSD license
– Development less centralized
Today: Unix is Big
Popular Success!
Linux at Google & Elsewhere
Darwin
• Apple abandoned old Mac OS for UNIX
–
–
–
–
–
Purchased NeXT in December 1996
Unveiled in 2000
Based on 4.4BSD-Lite
Aqua UI written over Darwin
Open Source
Why did UNIX succeed?
•
•
•
•
Technical strengths!
Research, not commercial
PDP-11 was popular with an unusable OS
AT&T’s legal concerns
– Not allowed to enter computer business but
needed to write software to help with switches
– Licensed cheaply or free
The Open Source Movement
• Has fueled much growth in UNIX
– Keeps up with pace of change
– More users, developers
• More platforms, better
performance, better code
• Many vendors switching to Linux
SCO vs. Linux
• Jan 2002: SCO releases Ancient Unix : BSD style
licensing of V5/V6/V7/32V/System III
• March 2003: SCO sues IBM for $3 billion. Alleges
contributions to Linux come from proprietary licensed
code
– AIX is based on System V r4, now owned by SCO
• Aug 2003: Evidence released
– Code traced to Ancient UNIX
– Isn’t in 90% of all running Linux distributions
– Already dropped from Linux in July
• Aug 2005: Linux Kernel Code May Have Been in
SCO
Does Linux borrow from ancient UNIX or System V R4?
UNIX vs. Linux
The UNIX Philosophy
• Small is beautiful
– Easy to understand
– Easy to maintain
– More efficient
– Better for reuse
• Make each program do one thing well
– More complex functionality by combining
programs
– Make every program a filter
The UNIX Philosophy
• Portability over efficiency
..continued
– Most efficient implementation is rarely portable
– Portability better for rapidly changing hardware
• Use flat ASCII files
– Common, simple file format (yesterday’s XML)
– Example of portability over efficiency
• Reusable code
– Good programmers write good code;
great programmers borrow good code
The UNIX Philosophy
• Scripting increases leverage and
portability
..continued
print $(who | awk '{print $1}' | sort | uniq) | sed 's/ /,/g'
List the logins of a system’s users on a single line.
• Build prototypes quickly (high level
interpreted languages)
who
755
awk
3,412
sort
2,614
uniq
302
sed
2,093
9,176 lines
The UNIX Philosophy
..continued
• Avoid captive interfaces
– The user of a program isn’t always human
– Look nice, but code is big and ugly
– Problems with scale
• Silence is golden
– Only report if something is wrong
• Think hierarchically
UNIX Highlights / Contributions
• Portability (variety of hardware; C implementation)
• Hierarchical file system; the file abstraction
• Multitasking and multiuser capability for
minicomputer
• Inter-process communication
– Pipes: output of one programmed fed into input of
another
•
•
•
•
Software tools
Development tools
Scripting languages
TCP/IP
The Operating System
• The government of your computer
• Kernel: Performs critical system functions
and interacts with the hardware
• Systems utilities: Programs and libraries
that provide various functions through
systems calls to the kernel
Kernel Basics
• The kernel is …
– a program loaded into memory during the
boot process, and always stays in physical
memory.
– responsible for managing CPU and memory
for processes, managing file systems, and
interacting with devices.
UNIX Structural Layout
shell scripts
system calls
signal handler
device drivers
terminal
disk
User Space
Kernel
Devices
utilities
C programs
compilers
scheduler
swapper
printer
RAM
Kernel Subsystems
• Process management
– Schedule processes to run on CPU
– Inter-process communication (IPC)
• Memory management
– Virtual memory
– Paging and swapping
• I/O system
– File system
– Device drivers
– Buffer cache
System Calls
• Interface to the kernel
• Over 1,000 system calls available on Linux
• 3 main categories
– File/device manipulation
• e.g. mkdir(), unlink()
– Process control
• e.g. fork(), execve(), nice()
– Information manipulation
• e.g. getuid(), time()
Logging In
• Need an account and password first
– Enter at login: prompt
– Password not echoed
– After successful login, you will see a shell prompt
• Entering commands
– At the shell prompt, type in commands
• Typical format: command options arguments
• Examples: who, date, ls, cat myfile, ls –l
– Case sensitive
• exit to log out
Remote Login
• Use Secure Shell
(SSH)
• Windows
– e.g. PuTTY
• UNIX-like OS
– ssh [email protected]
UNIX on Windows
Two recommended UNIX emulation environments:
• UWIN (AT&T)
– http://www.research.att.com/sw/tools/uwin
• Cygwin (GPL)
– http://www.cygwin.com
Linux Distributions
•
•
•
•
Slackware – the original
Debian – collaboration of volunteers
Red Hat / Fedora – commerical success
Ubuntu – currently most popular, based on
Debian. Focus on desktop
• Gentoo – portability
• Knoppix – live distribution
Course Outline
 Introduction
 Operating system overview
 UNIX utilities
 Scripting languages
 Programming tools
Unix System Structure
user
c programs
scripts
shell and utilities
ls
ksh
gcc
find
open()
fork()
exec()
kernel
hardware
Kernel Subsystems
• File system
– Deals with all input and output
• Includes files and terminals
• Integration of storage devices
• Process management
– Deals with programs and program interaction
•
•
•
•
How processes share CPU, memory and signals
Scheduling
Interprocess Communication
Memory management
• UNIX variants have different implementations
of different subsystems.
What is a shell?
• The user interface to the operating system
• Functionality:
– Execute other programs
– Manage files
– Manage processes
• A program like any other
• Executed when you log on
Most Commonly Used Shells
– /bin/sh
shell
– /bin/csh
– /bin/tcsh
– /bin/ksh
– /bin/bash
The Bourne Shell / POSIX
C shell
Enhanced C Shell
Korn shell
Free ksh clone
Basic form of shell:
while (read command) {
parse command
execute command
}
Shell Interactive Use
When you log in, you interactively use the shell:
–
–
–
–
–
–
–
Command history
Command line editing
File expansion (tab completion)
Command expansion
Key bindings
Spelling correction
Job control
Shell Scripting
• A set of shell commands that
constitute an executable program
• A shell script is a regular text file that
contains shell or UNIX commands
• Very useful for automating repetitive task and
administrative tools and for storing
commands for later execution
Simple Commands
• simple command: sequence of non blanks
arguments separated by blanks or tabs.
• 1st argument (numbered zero) usually specifies
the name of the command to be executed.
• Any remaining arguments:
– Are passed as arguments to that command.
– Arguments may be filenames, pathnames, directories
or special options (up to command)
– Special characters are interpreted by shell
A simple example
$ ls –l /bin
-rwxr-xr-x 1 root
$
prompt
command
sys
43234 Sep 26
2001 date
arguments
• Execute a basic command
• Parsing into command in arguments is called
splitting
Types of Arguments
$ tar –c –v –f archive.tar main.c main.h
• Options/Flags
– Convention: -X or --longname
• Parameters
– May be files, may be strings
– Depends on command
Getting Help on UNIX
• man: display entries from UNIX online
documentation
• whatis, apropos
• Manual entries organization:
–
–
–
–
–
–
1. Commands
2. System calls
3. Subroutines
4. Special files
5. File format and conventions
6. Games
http://en.wikipedia.org/wiki/Unix_manual
Example Man Page
ls ( 1 )
USER COMMANDS
ls ( 1 )
NAME
ls - list files and/or directories
SYNOPSIS
ls [ options ] [ file ... ]
DESCRIPTION
For each directory argument ls lists the contents; for each file argument the name and requested information are listed. The
current directory is listed if no file arguments appear. The listing is sorted by file name by default, except that file arguments
are listed before directories.
.
OPTIONS
a, --all
List entries starting with .; turns off --almost-all.
-F, --classify
Append a character for typing each entry.
-l, --long|verbose
Use a long listing format.
-r, --reverse
Reverse order while sorting.
-R, --recursive
List subdirectories recursively.
-
SEE ALSO
chmod(1), find(1), getconf(1), tw(1)
Fundamentals of Security
• UNIX systems have one or more users,
identified with a number and name.
• A set of users can form a group. A user
can be a member of multiple groups.
• A special user (id 0, name root) has
complete control.
• Each user has a primary (default)
group.
How are Users & Groups used?
• Used to determine if file or process
operations can be performed:
– Can a given file be read? written to?
– Can this program be run?
– Can I use this piece of hardware?
– Can I stop a particular process that’s running?
A simple example
$ ls –l /bin
-rwxr-xr-x 1 root
$
read
write
sys
execute
43234 Sep 26
2001 date
The UNIX File Hierarchy
Hierarchies are Ubiquitous
Definition: Filename
A sequence of characters other than slash.
Case sensitive.
/
tmp
usr
foo
dmr
etc
bin
who
date
wm4
foo
.profile
who
date
.profile
Definition: Directory
Holds a set of files or other directories.
Case sensitive.
/
tmp
usr
foo
dmr
etc
bin
who
date
wm4
etc
.profile
usr
dmr
bin
Definition: Pathname
A sequence of directory names followed by a simple
filename, each separated from the previous one by a /
/
tmp
usr
foo
dmr
etc
bin
who
date
wm4
.profile
/usr/wm4/.profile
Definition: Working Directory
A directory that file names refer to by default.
One per process.
/
tmp
usr
foo
dmr
wm4
.profile
etc
bin
who
date
Definition: Relative Pathname
A pathname relative to the working directory (as
opposed to absolute pathname)
/
tmp
usr
foo
dmr
etc
bin
who
date
wm4
.. refers to parent directory
. refers to current directory
.profile
.profile
./.profile
../wm4/.profile
Files and Directories
• Files are just a sequence of bytes
– No file types (data vs. executable)
– No sections
– Example of UNIX philosophy
• Directories are a list of files and status of the
files:
– Creation date
– Attributes
– etc.
Tilde Expansion
• Each user has a home directory
• Most shells (ksh, csh) support ~ operator:
– ~ expands to my home directory
• ~/myfile
 /home/kornj/myfile
– ~user expands to user’s home directory
• ~unixtool/file2
 /home/unixtool/file2
• Useful because home directory locations
vary by machine
Mounting File Systems
• When UNIX is started, the directory hierarchy
corresponds to the file system located on a
single disk called the root device.
• Mounting allows root to splice the root
directory of a file system into the existing
directory hierarchy.
• File systems created on other devices can be
attached to the original directory hierarchy
using the mount mechanism.
• Commands mount and umount manage
Mounting File Systems
root device
external device
/
/
a
a
b
b
/
Device
a
Mount Point
/
/a/b
b
a
Mount table
b
Printing File Contents
• The cat command can be used to copy the
contents of a file to the terminal. When invoked
with a list of file names, it concatenates them.
• Some options:
-n
-v
number output lines (starting from 1)
display control-characters in visible
form (e.g. ^C)
• Interactive commands more and less show a
page at a time
Common Utilities for Managing files and
directories
•
•
•
•
•
•
•
•
•
•
•
pwd
print process working dir
ed, vi, emacs…
create/edit files
ls
list contents of directory
rm
remove file
mv
rename file
cp
copy a file
touch
create an empty file or update
mkdir and rmdir create and remove dir
wc
counts the words in a file
file
determine file contents
du
directory usage
File Permissions
• UNIX provides a way to protect files based on
users and groups.
• Three types of permissions:
• read, process may read contents of file
• write, process may write contents of file
• execute, process may execute file
• Three sets of permissions:
• permissions for owner
• permissions for group (1 group per file)
• permissions for other
Directory permissions
• Same types and sets of permissions as for
files:
– read: process may a read the directory
contents (i.e., list files)
– write: process may add/remove files in the
directory
– execute: process may open files in directory
or subdirectories
Utilities for Manipulating file
attributes
•
•
•
•
•
chmod
change file permissions
chown
change file owner
chgrp
change file group
umask
user file creation mode mask
only owner or super-user can change file
attributes
• upon creation, default permissions given
to file modified by process umask value
Chmod command
• Symbolic access modes {u,g,o} /
{r,w,x}
• example: chmod +r file
• Octal access modes
octal
read
write
execute
0
1
2
3
4
5
6
7
no
no
no
no
yes
yes
yes
yes
no
no
yes
yes
no
no
yes
yes
no
yes
no
yes
no
yes
no
yes
File System Internals
The Open File Table
• I/O operations are done on files by first
opening them, reading/writing/etc., then
closing them.
• The kernel maintains a global table
containing information about each open
file.
Inode
Mode
Count
Position
1023
1331
read
read/write
1
2
0
50
…
The File Descriptor Table
• Each process contains a
table of files it has opened.
• Inherits open files from
parent
• Each open file is
associated with a number
or handle, called file
descriptor, (fd).
• Each entry of this table
points to an entry in the
open file table.
• Always starts at 0
Why not directly use
the open file table?
• Convenient for kernel
– Indirection makes security easier
• Numbering scheme can be local to
process (0 .. 128)
• Extra information stored:
– Should the open file be inherited by children?
(close-on-exec flag)
Standard in/out/err
• The first three entries in the file descriptor
table are special by convention:
cat
• Entry 0 is for input
• Entry 1 is for output
• Entry 2 is for error
messages
• What about reading/writing to the
screen?
Devices
• Besides files, input and output can go
from/to various hardware devices
• UNIX innovation: Treat these just like
files!
– /dev/tty, /dev/lpr, /dev/modem
• By default, standard in/out/err opened
with /dev/tty
Redirection
• Before a command is executed, the input and
output can be changed from the default
(terminal) to a file
– Shell modifies file descriptors in child process
– The child program knows nothing about this
ls
ls
Redirection of input/ouput
• Redirection of output: >
– example:$ ls > my_files
• Redirection of input: <
– example: $ mail kornj <input.data
• Append output: >>
– example: $ date >> logfile
• Bourne Shell derivatives: fd>
– example: $ ls 2> error_log
Using Devices
• Redirection works with devices (just like
files)
• Special files in /dev directory
– Example: /dev/tty
– Example: /dev/lp
– Example: /dev/null
• cat big_file > /dev/lp
• cat big_file > /dev/null
Links
• Directories are a list of files and directories.
– Each directory entry links to a file on the disk
mydir
hello
file2
subdir
Hello
World!
– Two different directory entries can link to the same
file
• In same directory or across different directories
– Moving a file does not actually move any data
around.
• Creates link in new location
• Deletes link in old location
• ln command
Symbolic links
• Symbolic links are different than regular links (often
called hard links). Created with ln -s
• Can be thought of as a directory entry that points to
the name of another file.
• Does not change link count for file
– When original deleted, symbolic link remains
• They exist because:
– Hard links don’t work across file systems
– Hard links only work for regular files, not directories
dirent
Contents of file
dirent
Hard link
symlink
dirent
Symbolic Link
Contents of file
Example
usr tmp etc bin
dmr wm4
.profile
etc
foo
who date
Hard Link
usr tmp etc bin
dmr wm4
.profile
etc
foo
who date
Symbolic Link
usr tmp etc bin
dmr wm4
foo
who date
.profile
etc
/usr/wm4/.profile
Can a file have no links?
usr tmp etc bin
dmr wm4
foo
who date
.profile
etc
cat
Tree Walking
• How can do we find a set of files in the
hierarchy?
• One possibility:
– ls –l –R /
• What about:
– All files below a given directory in the hierarchy?
– All files since Jan 1, 2001?
– All files larger than 10K?
find utility
• find pathlist expression
• find recursively descends through pathlist
and applies expression to every file.
• expression can be:
– -name pattern
• true if file name matches pattern. Pattern may
include shell patterns such as *, must be in quotes
to suppress shell interpretation.
– Eg: find / -name '*.c'
find utility (continued)
• -perm [+-]mode
– Find files with given access mode, mode must be in octal.
Eg: find . 755
• -type ch
– Find files of type ch (c=character, b=block, f for plain file,
etc..). Eg: find /home –type f
• -user userid/username
– Find by owner userid or username
• -group groupid/groupname
– Find by group groupid or groupname
• -size size
– File size is at least size
• many more…
find: logical operations
• ! expression
• op1 -a op2
• op1 -o op2
• ( )
returns the logical negation
of expression
matches both patterns
op1 and op2
matches either op1 or op2
group expressions together
find: actions
• -print
prints out the name of the
current file (default)
• -exec cmd
– Executes cmd, where cmd must be terminated by an
escaped semicolon (\; or ';').
– If you specify {} as a command line argument, it is
replaced by the name of the current file just found.
– exec executes cmd once per file.
– Example:
• find -name "*.o" -exec rm "{}" ";"
find Examples
• Find all files beneath home directory beginning with f
– find ~ -name 'f*' -print
• Find all files beneath home directory modified in last day
– find ~ -mtime 1 -print
• Find all files beneath home directory larger than 10K
– find ~ -size 10k -print
• Count words in files under home directory
– find ~ -exec wc -w {} \; -print
• Remove core files
– find / -name core –exec rm {} \;
diff: comparing two files
• diff: compares two files and outputs a
description of their differences
– Usage: diff [options] file1 file2
– -i: ignore case
apples
oranges
walnuts
apples
oranges
grapes
$ diff test1 test2
3c3
< walnuts
--> grapes
Other file comparison utilities
• cmp
– Tests two files for equality
– If equal, nothing returned. If different, location of
first differing byte returned
– Faster than diff for checking equality
• comm
– Reads two files and outputs three columns:
• Lines in first file only
• Lines in second file only
• Lines in both files
– Must be sorted
– Options: fields to suppress ( [-123] )
Course Outline
 Introduction
 Operating system overview
 UNIX utilities
 Scripting languages
 Programming tools
Kernel Data Structures
• Information
about each
process.
• Process table:
contains an
entry for every
process in the
system.
• Open-file
table: contains
at least one
entry for every
open file in the
system.
User Space
Code
Code
Code
Data
Data
Data
Process
Info
Process
Info
Process
Info
Open File
Table
Kernel Space
Process
Table
Unix Processes
Process: An entity of execution
• Definitions
– program: collection of bytes stored in a file that can
be run
– image: computer execution environment of program
– process: execution of an image
• Unix can execute many processes
simultaneously.
Process Creation
• Interesting trait of UNIX
• fork system call clones the current
process
A
A
A
• exec system call replaces current
B
process A
Process Setup
• All of the per process information is copied
with the fork operation
– Working directory
– Open files
• Copy-on-write makes this efficient
• Before exec, these values can be modified
fork and exec
• Example: the shell
while(1) {
display_prompt();
read_input(cmd, params);
pid = fork();
if (pid != 0)
waitpid(-1, &stat, 0);
else
execve(cmd, params, 0);
}
/* create child */
/* parent waits */
/* child execs */
Unix process genealogy
Process generation
Init process 1
forks init processes
init
execs
init
execs
Init
execs
getty
getty
execs
getty
login
execs
/bin/sh
Background Jobs
• By default, executing a command in the
shell will wait for it to exit before printing
out the next prompt
• Trailing a command with & allows the shell
and command to run simultaneously
$ /bin/sleep 10 &
[1] 3424
$
Program Arguments
• When a process is started, it is sent a list
of strings
– argv, argc
• The process can use this list however it
wants to
Ending a process
• When a process ends, there is a return
code associated with the process
• This is a positive integer
– 0 means success
– >0 represent various kinds of failure, up to
process
Process Information Maintained
• Working directory
• File descriptor table
• Process id
– number used to identify process
• Process group id
– number used to identify set of processes
• Parent process id
– process id of the process that created the process
Process Information Maintained
• Umask
– Default file permissions for new file
We haven’t talked about these yet:
• Effective user and group id
– The user and group this process is running with
permissions as
• Real user and group id
– The user and group that invoked the process
• Environment variables
Setuid and Setgid Mechanisms
• The kernel can set the effective user and
group ids of a process to something
different than the real user and group
– Files executed with a setuid or setgid flag
set cause the these values to change
• Make it possible to do privileged tasks:
– Change your password
• Open up a can of worms for security if
buggy
Environment of a Process
• A set of name-value pairs associated with
a process
• Keys and values are strings
• Passed to children processes
• Cannot be passed back up
• Common examples:
– PATH: Where to search for programs
– TERM: Terminal type
The PATH environment variable
• Colon-separated list of directories.
• Non-absolute pathnames of
executables are only executed if found
in the list.
– Searched left to right
• Example:
$ myprogram
sh: myprogram not found
$ PATH=/bin:/usr/bin:/home/kornj/bin
$ myprogram
hello!
Having . In Your Path
$ ls
foo
$ foo
sh: foo: not found
$ ./foo
Hello, foo.
• What not to do:
$ PATH=.:/bin
$ ls
foo
$ cd /usr/badguy
$ ls
Congratulations, your files have been removed
and you have just sent email to Prof. Korn
challenging him to a fight.
Shell Variables
• Shells have several mechanisms for creating
variables. A variable is a name representing a
string value. Example: PATH
– Shell variables can save time and reduce typing
errors
• Allow you to store and manipulate information
– Eg: ls $DIR > $FILE
• Two types: local and environmental
– local are set by the user or by the shell itself
– environmental come from the operating system
and are passed to children
Variables (con’t)
• Syntax varies by shell
– varname=value
– set varname = value
# sh, ksh
# csh
• To access the value: $varname
• Turn local variable into environment:
– export varname
# sh, ksh
– setenv varname value # csh
Environmental Variables
NAME
MEANING
$HOME Absolute pathname of your home
directory
$PATH A list of directories to search for
$MAIL Absolute pathname to mailbox
$USER Your user id
$SHELL Absolute pathname of login shell
$TERM Type of your terminal
$PS1
Prompt
Inter-process Communication
Ways in which processes communicate:
• Passing arguments, environment
• Read/write regular files
• Exit values
• Signals
• Pipes
Signals
• Signal: A message a process can send to a
process or process group, if it has appropriate
permissions.
• Message type represented by a symbolic
name
• For each signal, the receiving process can:
– Explicitly ignore signal
– Specify action to be taken upron receipt (signal
handler)
– Otherwise, default action takes place (usually
process is killed)
• Common signals:
An Example of Signals
• When a child exists, it sends a
SIGCHLD signal to its parent.
• If a parent wants to wait for a child to
exit, it tells the system it wants to catch
the SIGCHLD signal
• When a parent does not issue a wait, it
ignores the SIGCHLD signal
Process Subsystem utilities
• ps
• kill
• wait
• nohup
the
signal
• sleep
• nice
monitors status of processes
send a signal to a pid
parent process wait for one of its
children to terminate
makes a command immune to
hangup and terminate
sleep in seconds
run processes at low priority
Pipes
One of the cornerstones of
UNIX
Pipes
• General idea: The input of one program
is the output of the other, and vice versa
A
B
• Both programs run at the same time
Pipes (2)
• Often, only one end of the pipe is used
standard out
standard in
A
• Could this be done with files?
B
File Approach
• Run first program, save output into file
• Run second program, using file as input
process 1
• Unnecessary use of the disk
– Slower
– Can take up a lot of space
• Makes no use of multi-tasking
process 2
More about pipes
• What if a process tries to read data but
nothing is available?
– UNIX puts the reader to sleep until data available
• What if a process can’t keep up reading from
the process that’s writing?
– UNIX keeps a buffer of unread data
• This is referred to as the pipe size.
– If the pipe fills up, UNIX puts the writer to sleep
until the reader frees up space (by doing a read)
• Multiple readers and writers possible with
pipes.
More about Pipes
• Pipes are often chained together
– Called filters
standard out
A
standard in
B
C
Interprocess Communication
For Unrelated Processes
• FIFO (named pipes)
– A special file that when opened represents
pipe
• System V IPC
p1
– message queues
– semaphores
– shared memory
• Sockets (client/server model)
p2
Pipelines
• Output of one program becomes input to
another
– Uses concept of UNIX pipes
• Example: $ who | wc -l
– counts the number of users logged in
• Pipelines can be long
What’s the difference?
Both of these commands send input to command from a
file instead of the terminal:
$ cat file | command
vs.
$ command < file
An Extra Process
$ cat file | command
cat
command
$ command < file
command
Introduction to Filters
• A class of Unix tools called filters.
– Utilities that read from standard input,
transform the file, and write to standard out
• Using filters can be thought of as data
oriented programming.
– Each step of the computation transforms data
stream.
Examples of Filters
• Sort
– Input: lines from a file
– Output: lines from the file sorted
• Grep
– Input: lines from a file
– Output: lines that match the argument
• Awk
– Programmable filter
cat: The simplest filter
• The cat command copies its input to output
unchanged (identity filter). When supplied a list
of file names, it concatenates them onto stdout.
• Some options:
– -n
– -v
^C)
number output lines (starting from 1)
display control-characters in visible form (e.g.
cat file*
ls | cat -n
head
• Display the first few lines of a specified file
• Syntax: head [-n] [filename...]
– -n - number of lines to display, default is 10
– filename... - list of filenames to display
• When more than one filename is specified,
the start of each files listing displays
==>filename<==
tail
• Displays the last part of a file
• Syntax: tail +|-number [lbc] [f] [filename]
or: tail +|-number [l] [rf] [filename]
– +number - begins copying at distance number from
beginning of file, if number isn’t given, defaults to 10
– -number - begins from end of file
– l,b,c - number is in units of lines/block/characters
– r - print in reverse order (lines only)
– f - if input is not a pipe, do not terminate after end of
file has been copied but loop. This is useful to
monitor a file being written by another process
head and tail examples
head /etc/passwd
head *.c
tail +20 /etc/passwd
ls -lt | tail -3
head –100 /etc/passwd | tail -5
tail –f /usr/local/httpd/access_log
tee
Unix Command
Standard output
file-list
• Copy standard input to standard output
and one or more files
– Captures intermediate results from a filter
in the pipeline
tee con’t
• Syntax: tee [ -ai ] file-list
– -a - append to output file rather than
overwrite, default is to overwrite (replace) the
output file
– -i - ignore interrupts
– file-list - one or more file names for capturing
output
• Examples
ls | head –10 | tee first_10 | tail –5
who | tee user_list | wc
Unix Text Files: Delimited Data
Tab Separated
John
Anne
Andrew
Tim
Arun
Sowmya
99
75
50
95
33
76
Pipe-separated
COMP1011|2252424|Abbot, Andrew John |3727|1|M
COMP2011|2211222|Abdurjh, Saeed |3640|2|M
COMP1011|2250631|Accent, Aac-Ek-Murhg |3640|1|M
COMP1021|2250127|Addison, Blair |3971|1|F
COMP4012|2190705|Allen, David Peter |3645|4|M
COMP4910|2190705|Allen, David Pater |3645|4|M
Colon-separated
root:ZHolHAHZw8As2:0:0:root:/root:/bin/ksh
jas:nJz3ru5a/44Ko:100:100:John Shepherd:/home/jas:/bin/ksh
cs1021:iZ3sO90O5eZY6:101:101:COMP1021:/home/cs1021:/bin/bash
cs2041:rX9KwSSPqkLyA:102:102:COMP2041:/home/cs2041:/bin/csh
cs3311:mLRiCIvmtI9O2:103:103:COMP3311:/home/cs3311:/bin/sh
cut: select columns
• The cut command prints selected parts of input
lines.
– can select columns (assumes tab-separated input)
– can select a range of character positions
• Some options:
– -f listOfCols: print only the specified columns (tabseparated) on output
– -c listOfPos: print only chars in the specified positions
– -d c: use character c as the column separator
• Lists are specified as ranges (e.g. 1-5) or
comma-separated (e.g. 2,4,5).
cut examples
cut -f 1 < data
cut -f 1-3 < data
cut -f 1,4 < data
cut -f 4- < data
cut -d'|' -f 1-3 < data
cut -c 1-4 < data
Unfortunately, there's no way to refer to "last
column" without counting the columns.
paste: join columns
• The paste command displays several text files
"in parallel" on output.
1
3
5
• If the inputs are files a, b, c
2
4
6
– the first line of output is composed
of the first lines of a, b, c
1
– the second line of output is composed2
of the second lines of a, b, c
3
4
5
6
• Lines from each file are separated by a tab
character.
• If files are different lengths, output has all lines
from longest file, with empty strings for missing
lines.
paste example
cut -f 1 < data > data1
cut -f 2 < data > data2
cut -f 3 < data > data3
paste data1 data3 data2 > newdata
sort: Sort lines of a file
• The sort command copies input to output but
ensures that the output is arranged in
ascending order of lines.
– By default, sorting is based on ASCII comparisons
of the whole line.
• Other features of sort:
– understands text data that occurs in columns.
(can also sort on a column other than the first)
– can distinguish numbers and sort appropriately
– can sort files "in place" as well as behaving like a
filter
– capable of sorting very large files
sort: Options
• Syntax: sort [-dftnr] [-o filename] [filename(s)]
-d Dictionary order, only letters, digits, and
whitespace
are significant in determining sort
order
-f Ignore case (fold into lower case)
-t Specify delimiter
-n Numeric order, sort by arithmetic value instead
of
first digit
-r Sort in reverse order
-ofilename - write output to filename, filename can
be
the same as one of the input
files
• Lots of more options…
sort: Specifying fields
• Delimiter : -td
• Old way:
– +f[.c][options] [-f[.c][options]
• +2.1 –3 +0 –2 +3n
– Exclusive
– Start from 0 (unlike cut, which starts at 1)
• New way:
– -k f[.c][options][,f[.c][options]]
• -k2.1 –k0,1 –k3n
– Inclusive
– Start from 1
sort Examples
sort +2nr < data
sort –k2nr data
sort -t: +4 /etc/passwd
sort -o mydata mydata
uniq: list UNIQue items
• Remove or report adjacent duplicate lines
• Syntax: uniq [ -cdu] [input-file] [ output-file]
-c Supersede the -u and -d options and
generate an output
report with each line
preceded by an occurrence count
-d Write only the duplicated lines
-u Write only those lines which are not
duplicated
– The default output is the union (combination) of
-d and -u
wc: Counting results
• The word count utility, wc, counts the
number of lines, characters or words
• Options:
-l
-w
-c
Count lines
Count words
Count characters
• Default: count lines, words and chars
wc and uniq Examples
who | sort | uniq –d
wc my_essay
who | wc
sort file | uniq | wc –l
sort file | uniq –d | wc –l
sort file | uniq –u | wc -l
tr: TRanslate Characters
• Copies standard input to standard output with
substitution or deletion of selected characters
• Syntax: tr [ -cds ] [ string1 ] [ string2 ]
• -d delete all input characters contained in string1
• -c complements the characters in string1 with respect
to the entire ASCII character set
• -s squeeze all strings of repeated output characters
in the last operand to single characters
tr (continued)
• tr reads from standard input.
– Any character that does not match a character in
string1 is passed to standard output unchanged
– Any character that does match a character in
string1 is translated into the corresponding
character in string2 and then passed to standard
output
• Examples
– tr s z
– tr so zx
and o
– tr a-z A-Z
with
replaces all instances of s with z
replaces all instances of s with z
with x
replaces all lower case characters
upper case
tr uses
• Change delimiter
tr ‘|’ ‘:’
• Rewrite numbers
tr ,. .,
• Import DOS files
tr –d ’\r’ < dos_file
• Find printable ASCII in a binary file
tr –cd ’\na-zA-Z0-9 ’ < binary_file
xargs
• Unix limits the size of arguments and
environment that can be passed down to child
• What happens when we have a list of 10,000
files to send to a command?
• xargs solves this problem
– Reads arguments as standard input
– Sends them to commands that take file lists
– May invoke program several times depending on size
of arguments
cmd a1 a2 …
a1 … a300
xargs
cmd
cmd a100 a101 …
cmd a200 a201 …
find utility and xargs
• find . -type f -print | xargs wc -l
– -type f for files
– -print to print them out
– xargs invokes wc 1 or more times
• wc -l a b c d e f g
wc -l h i j k l m n o
…
• Compare to: find . -type f –exec wc -l {}
\;
Next Time
• Regular Expressions
– Allow you to search for text in files
– grep command
• We will soon learn how to write scripts that
use these utilities in interesting ways.
Previously
• Basic UNIX Commands
– Files: rm, cp, mv, ls, ln
– Processes: ps, kill
• Unix Filters
–
–
–
–
–
–
cat, head, tail, tee, wc
cut, paste
find
sort, uniq
comm, diff, cmp
tr
Subtleties of commands
•
•
•
•
Executing commands with find
Specification of columns in cut
Specification of columns in sort
Methods of input
– Standard in
– File name arguments
– Special "-" filename
• Options for uniq
Today
• Regular Expressions
– Allow you to search for text in files
– grep command
• Stream manipulation:
– sed
• But first, a command we didn’t cover last time…
xargs
• Unix limits the size of arguments and
environment that can be passed down to child
• What happens when we have a list of 10,000
files to send to a command?
• xargs handles this problem
– Reads arguments as standard input
– Sends them to commands that take file lists
– May invoke program several times depending on size
of arguments
cmd a1 a2 …
a1 … a300
xargs
cmd
cmd a100 a101 …
cmd a200 a201 …
find utility and xargs
• find . -type f -print | xargs wc -l
-type f for files
-print to print them out
xargs invokes wc 1 or more times
• wc -l a b c d e f g
wc -l h i j k l m n o
…
• Compare to: find . -type f –exec wc -l {}
\;
• The -n option can be used to limit number of
args
Regular Expressions
What Is a Regular Expression?
• A regular expression (regex) describes a set
of possible input strings.
• Regular expressions descend from a
fundamental concept in Computer Science
called finite automata theory
• Regular expressions are endemic to Unix
–
–
–
–
vi, ed, sed, and emacs
awk, tcl, perl and Python
grep, egrep, fgrep
compilers
Regular Expressions
• The simplest regular expressions are a
string of literal characters to match.
• The string matches the regular expression
if it contains the substring.
regular expression
c k s
UNIX Tools rocks.
match
UNIX Tools sucks.
match
UNIX Tools is okay.
no match
Regular Expressions
• A regular expression can match a string in
more than one place.
regular expression
a p p l e
Scrapple from the apple.
match 1
match 2
Regular Expressions
• The . regular expression can be used to
match any character.
regular expression
o .
For me to poop on.
match 1
match 2
Character Classes
• Character classes [] can be used to
match any specific set of characters.
regular expression
b [eor] a t
beat a brat on a boat
match 1
match 2
match 3
Negated Character Classes
• Character classes can be negated with the
[^] syntax.
regular expression
b [^eo] a t
beat a brat on a boat
match
More About Character Classes
– [aeiou] will match any of the characters a, e, i,
o, or u
– [kK]orn will match korn or Korn
• Ranges can also be specified in character
classes
– [1-9] is the same as [123456789]
– [abcde] is equivalent to [a-e]
– You can also combine multiple ranges
• [abcde123456789] is equivalent to [a-e1-9]
– Note that the - character has a special meaning in
a character class but only if it is used within a
range,
Named Character Classes
• Commonly used character classes can be
referred to by name (alpha, lower, upper,
alnum, digit, punct, cntrl)
• Syntax [:name:]
– [a-zA-Z]
– [a-zA-Z0-9]
– [45a-z]
[[:alpha:]]
[[:alnum:]]
[45[:lower:]]
• Important for portability across languages
Anchors
• Anchors are used to match at the beginning or
end of a line (or both).
• ^ means beginning of the line
• $ means end of the line
^ b [eor] a t
regular expression
beat a brat on a boat
match
regular expression
b [eor] a t $
beat a brat on a boat
match
^word$
^$
Repetition
• The * is used to define zero or more
occurrences of the single regular
expression preceding it.
y a * y
regular expression
I got mail, yaaaaaaaaaay!
match
regular expression
o a * o
For me to poop on.
match
.*
Match length
• A match will be the longest string that
satisfies the regular expression.
regular expression
a . * e
Scrapple from the apple.
no
no
yes
Repetition Ranges
• Ranges can also be specified
– { } notation can specify a range of
repetitions for the immediately preceding
regex
– {n} means exactly n occurrences
– {n,} means at least n occurrences
– {n,m} means at least n occurrences but
no more than m occurrences
• Example:
– .{0,} same as .*
– a{2,} same as aaa*
Subexpressions
• If you want to group part of an expression so
that * or { } applies to more than just the
previous character, use ( ) notation
• Subexpresssions are treated like a single
character
– a* matches 0 or more occurrences of a
– abc* matches ab, abc, abcc, abccc, …
– (abc)* matches abc, abcabc, abcabcabc, …
– (abc){2,3} matches abcabc or abcabcabc
grep
• grep comes from the ed (Unix text editor)
search command “global regular expression
print” or g/re/p
• This was such a useful command that it was
written as a standalone utility
• There are two other variants, egrep and fgrep
that comprise the grep family
• grep is the answer to the moments where you
know you want the file that contains a specific
phrase but you can’t remember its name
Family Differences
• grep - uses regular expressions for pattern
matching
• fgrep - file grep, does not use regular
expressions, only matches fixed strings but can
get search strings from a file
• egrep - extended grep, uses a more powerful
set of regular expressions but does not support
backreferencing, generally the fastest member
of the grep family
• agrep – approximate grep; not standard
Syntax
• Regular expression concepts we have
seen so far are common to grep and
egrep.
• grep and egrep have slightly different
syntax
– grep: BREs
– egrep: EREs (enhanced features we will
discuss)
• Major syntax differences:
Protecting Regex
Metacharacters
• Since many of the special characters used
in regexs also have special meaning to the
shell, it’s a good idea to get in the habit of
single quoting your regexs
– This will protect any special characters from
being operated on by the shell
– If you habitually do it, you won’t have to worry
about when it is necessary
Escaping Special Characters
• Even though we are single quoting our regexs
so the shell won’t interpret the special
characters, some characters are special to grep
(eg * and .)
• To get literal characters, we escape the
character with a \ (backslash)
• Suppose we want to search for the character
sequence a*b*
– Unless we do something special, this will match zero
or more ‘a’s followed by zero or more ‘b’s, not what
we want
– a\*b\* will fix this - now the asterisks are treated as
Egrep: Alternation
• Regex also provides an alternation character |
for matching one or another subexpression
– (T|Fl)an will match ‘Tan’ or ‘Flan’
– ^(From|Subject): will match the From and
Subject lines of a typical email message
• It matches a beginning of line followed by either the
characters ‘From’ or ‘Subject’ followed by a ‘:’
• Subexpressions are used to limit the scope of
the alternation
– At(ten|nine)tion then matches “Attention” or
“Atninetion”, not “Atten” or “ninetion” as would happen
without the parenthesis - Atten|ninetion
Egrep: Repetition Shorthands
• The * (star) has already been seen to specify
zero or more occurrences of the immediately
preceding character
• + (plus) means “one or more”
 abc+d will match ‘abcd’, ‘abccd’, or ‘abccccccd’
but will not match ‘abd’
 Equivalent to {1,}
Egrep: Repetition Shorthands
cont
• The ‘?’ (question mark) specifies an optional
character, the single character that immediately
precedes it
 July? will match ‘Jul’ or ‘July’
 Equivalent to {0,1}
 Also equivalent to (Jul|July)
• The *, ?, and + are known as quantifiers because
they specify the quantity of a match
• Quantifiers can also be used with subexpressions
– (a*c)+ will match ‘c’, ‘ac’, ‘aac’ or ‘aacaacac’ but will not
match ‘a’ or a blank line
Grep: Backreferences
• Sometimes it is handy to be able to refer to a
match that was made earlier in a regex
• This is done using backreferences
– \n is the backreference specifier, where n is a
number
• Looks for nth subexpression
• For example, to find if the first word of a line
is the same as the last:
– ^\([[:alpha:]]\{1,\}\) .* \1$
– The \([[:alpha:]]\{1,\}\) matches 1 or more
letters
Practical Regex Examples
• Variable names in C
– [a-zA-Z_][a-zA-Z_0-9]*
• Dollar amount with optional cents
– \$[0-9]+(\.[0-9][0-9])?
• Time of day
– (1[012]|[1-9]):[0-5][0-9] (am|pm)
• HTML headers <h1> <H1> <h2> …
– <[hH][1-4]>
grep Family
• Syntax
grep [-hilnv] [-e expression] [filename]
egrep [-hilnv] [-e expression] [-f filename] [expression]
[filename]
fgrep [-hilnxv] [-e string] [-f filename] [string] [filename]
– -h Do not display filenames
– -i Ignore case
– -l List only filenames containing matching lines
– -n Precede each matching line with its line
number
– -v Negate matches
– -x Match whole line only (fgrep only)
– -e expression Specify expression as option
– -f filename
Take the regular expression
(egrep) or
a list of strings (fgrep) from
grep Examples
•
•
•
•
•
•
•
grep 'men' GrepMe
grep 'fo*' GrepMe
egrep 'fo+' GrepMe
egrep -n '[Tt]he' GrepMe
fgrep 'The' GrepMe
egrep 'NC+[0-9]*A?' GrepMe
fgrep -f expfile GrepMe
• Find all lines with signed numbers
$ egrep ’[-+][0-9]+\.?[0-9]*’ *.c
bsearch. c: return -1;
compile. c: strchr("+1-2*3", t-> op)[1] - ’0’, dst,
convert. c: Print integers in a given base 2-16 (default 10)
convert. c: sscanf( argv[ i+1], "% d", &base);
strcmp. c: return -1;
strcmp. c: return +1;
• egrep has its limits: For example, it cannot match all lines that
contain a number divisible by 7.
Fun with the Dictionary
• /usr/dict/words contains about 25,000 words
– egrep hh /usr/dict/words
• beachhead
• highhanded
• withheld
• withhold
• egrep as a simple spelling checker: Specify plausible
alternatives you know
egrep "n(ie|ei)ther" /usr/dict/words
neither
• How many words have 3 a’s one letter apart?
– egrep a.a.a /usr/dict/words | wc –l
• 54
– egrep u.u.u /usr/dict/words
• cumulus
Other Notes
• Use /dev/null as an extra file name
– Will print the name of the file that matched
• grep test bigfile
– This is a test.
• grep test /dev/null bigfile
– bigfile:This is a test.
• Return code of grep is useful
–
grep fred filename > /dev/null && rm filename
This is one line of text
o.*o
x
xyz
\m
^
$
.
[xy^$x]
[^xy^$z]
[a-z]
r*
r1r2
\(r\)
\n
\{n,m\}
r+
r?
r1|r2
(r1|r2)r3
(r1|r2)*
{n,m}
Ordinary characters match themselves
(NEWLINES and metacharacters excluded)
Ordinary strings match themselves
Matches literal character m
Start of line
End of line
Any single character
Any of x, y, ^, $, or z
Any one character other than x, y, ^, $, or z
Any single character in given range
zero or more occurrences of regex r
Matches r1 followed by r2
Tagged regular expression, matches r
Set to what matched the nth tagged expression
(n = 1-9)
Repetition
One or more occurrences of r
Zero or one occurrences of r
Either r1 or r2
Either r1r3 or r2r3
Zero or more occurrences of r1|r2, e.g., r1, r1r1,
r2r1, r1r1r2r1,…)
Repetition
input line
regular expression
fgrep, grep, egrep
grep, egrep
grep
egrep
Quick
Reference
Sed: Stream-oriented, NonInteractive, Text Editor
• Look for patterns one line at a time, like grep
• Change lines of the file
• Non-interactive text editor
– Editing commands come in as script
– There is an interactive editor ed which accepts the
same commands
• A Unix filter
– Superset of previously mentioned tools
Sed Architecture
Input
Input line
(Pattern Space)
Output
scriptfile
• Commands in a sed script are applied
in order to each line.
• If a command changes the input,
subsequent command will be applied
to the modified line in the pattern
space, not the original input line.
• The input file is unchanged (sed is a
filter).
• Results are sent to standard output
Scripts
• A script is nothing more than a file of
commands
• Each command consists of up to two
addresses and an action, where the address
can be a regular expression or line number.
address
action
address
action
address
action
address
action
address
action
command
script
Sed Flow of Control
• sed then reads the next line in the input file
and restarts from the beginning of the script
file
• All commands in the script file are compared
to, and potentially act on, all lines in the input
script
file
cmd 1
cmd 2
...
cmd n
Executed if line matches address
print command
output
input
output
only without -n
sed Syntax
• Syntax: sed [-n] [-e] [‘command’] [file…]
sed [-n] [-f scriptfile] [file…]
– -n - only print lines specified with the print
command (or the ‘p’ flag of the substitute (‘s’)
command)
– -f scriptfile - next argument is a filename
containing editing commands
– -e command - the next argument is an editing
command rather than a filename, useful if multiple
commands are specified
– If the first line of a scriptfile is “#n”, sed acts as
though -n had been specified
sed Commands
• sed commands have the general form
– [address[, address]][!]command [arguments]
• sed copies each input line into a pattern
space
– If the address of the command matches the line in
the pattern space, the command is applied to that
line
– If the command has no address, it is applied to
each line as it enters pattern space
– If a command changes the line in pattern space,
subsequent commands operate on the modified
line
• When all commands have been read, the line
Addressing
• An address can be either a line number or
a pattern, enclosed in slashes ( /pattern/
)
• A pattern is described using regular
expressions (BREs, as in grep)
• If no pattern is specified, the command will
be applied to all lines of the input file
• To refer to the last line: $
Addressing (continued)
• Most commands will accept two addresses
– If only one address is given, the command operates
only on that line
– If two comma separated addresses are given, then
the command operates on a range of lines between
the first and second address, inclusively
• The ! operator can be used to negate an
address, ie; address!command causes
command to be applied to all lines that do not
match address
Commands
• command is a single letter
• Example: Deletion: d
• [address1][,address2]d
– Delete the addressed line(s) from the pattern
space; line(s) not passed to standard output.
– A new line of input is read and editing
resumes with the first command of the script.
Address and Command
Examples
• d
• 6d
• /^$/d
• 1,10d
• 1,/^$/d
• /^$/,$d
• /^$/,10d
deletes the all lines
deletes line 6
deletes all blank lines
deletes lines 1 through 10
deletes from line 1 through the first blank line
deletes from the first blank line through
the last line of the file
deletes from the first blank line through line
10
• /^ya*y/,/[0-9]$/d
deletes from the first line that begins
with yay, yaay, yaaay, etc. through
the first line that ends with a digit
Multiple Commands
• Braces {} can be used to apply multiple commands to
an address
[/pattern/[,/pattern/]]{
command1
command2
command3
}
• Strange syntax:
– The opening brace must be the last character on a line
– The closing brace must be on a line by itself
– Make sure there are no spaces following the braces
Sed Commands
• Although sed contains many editing commands,
we are only going to cover the following subset:
• s - substitute
• a - append
• i - insert
• c - change
• d - delete
• p - print
• y - transform
• q - quit
Print
• The Print command (p) can be used to force
the pattern space to be output, useful if the -n
option has been specified
• Syntax: [address1[,address2]]p
• Note: if the -n option has not been specified, p
will cause the line to be output twice!
• Examples:
1,5p will display lines 1 through 5
/^$/,$p will display the lines from the first
blank line through the last line of the
file
Substitute
• Syntax:
[address(es)]s/pattern/replacement/[flags]
– pattern - search pattern
– replacement - replacement string for pattern
– flags - optionally any of the following
•
n
•
g
•
p
a number from 1 to 512 indicating which
occurrence of pattern should be
replaced
global, replace all occurrences of pattern
in pattern space
print contents of pattern space
Substitute Examples
• s/Puff Daddy/P. Diddy/
– Substitute P. Diddy for the first occurrence of Puff
Daddy in pattern space
• s/Tom/Dick/2
– Substitutes Dick for the second occurrence of Tom in
the pattern space
• s/wood/plastic/p
– Substitutes plastic for the first occurrence of wood and
outputs (prints) pattern space
Replacement Patterns
• Substitute can use several special
characters in the replacement string
– & - replaced by the entire string matched in
the regular expression for pattern
– \n - replaced by the nth substring (or
subexpression) previously specified using “\(“
and “\)”
– \ - used to escape the ampersand (&) and
the backslash (\)
Replacement Pattern Examples
"the UNIX operating system …"
s/.NI./wonderful &/
"the wonderful UNIX operating system …"
cat test1
first:second
one:two
sed 's/\(.*\):\(.*\)/\2:\1/' test1
second:first
two:one
sed 's/\([[:alpha:]]\)\([^ \n]*\)/\2\1ay/g'
– Pig Latin ("unix is fun" -> "nixuay siay unfay")
Append, Insert, and Change
• Syntax for these commands is a little strange
because they must be specified on multiple
lines
• append
[address]a\
text
• insert
[address]i\
text
• change
[address(es)]c\
text
• append/insert for single lines only, not range
Append and Insert
• Append places text after the current line in pattern space
• Insert places text before the current line in pattern space
– Each of these commands requires a \ following it.
text must begin on the next line.
– If text begins with whitespace, sed will discard it
unless you start the line with a \
• Example:
/<Insert Text Here>/i\
Line 1 of inserted text\
\
Line 2 of inserted text
would leave the following in the pattern space
Line 1 of inserted text
Line 2 of inserted text
<Insert Text Here>
Change
• Unlike Insert and Append, Change can be
applied to either a single line address or a range
of addresses
• When applied to a range, the entire range is
replaced by text specified with change, not each
line
– Exception: If the Change command is executed with
other commands enclosed in { } that act on a range
of lines, each line will be replaced with text
• No subsequent editing allowed
Change Examples
• Remove mail headers, ie;
/^From /,/^$/c\
the address specifies a
<Mail Headers Removed>
range of lines beginning
with a line that begins
with From until the first
/^From /,/^$/{
blank line.
s/^From //p
– The first example replaces
all lines with a single
occurrence of <Mail
Header Removed>.
– The second example
replaces each line with
<Mail Header Removed>
c\
<Mail Header Removed>
}
Using !
• If an address is followed by an exclamation point
(!), the associated command is applied to all
lines that don’t match the address or address
range
• Examples:
1,5!d would delete all lines except 1 through 5
/black/!s/cow/horse/ would substitute
“horse” for “cow” on all lines except those that
contained “black”
“The brown cow” -> “The brown horse”
“The black cow” -> “The black cow”
Transform
• The Transform command (y) operates like tr,
it does a one-to-one or character-to-character
replacement
• Transform accepts zero, one or two
addresses
• [address[,address]]y/abc/xyz/
– every a within the specified address(es) is
transformed to an x. The same is true for b to y
and c to z
– y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNO
PQRSTUVWXYZ/ changes all lower case characters
on the addressed line to upper case
– If you only want to transform specific characters
Quit
• Quit causes sed to stop reading new input
lines and stop sending them to standard
output
• It takes at most a single line address
– Once a line matching the address is reached, the
script will be terminated
– This can be used to save time when you only want
to process some portion of the beginning of a file
• Example: to print the first 100 lines of a file
(like head) use:
– sed '100q' filename
– sed will, by default, send the first 100 lines of
filename to standard output and then quit
Pattern and Hold spaces
• Pattern space: Workspace or temporary
buffer where a single line of input is held
while the editing commands are applied
• Hold space: Secondary temporary buffer
for temporary storage only
in
h, H, g, G, x
Pattern
Hold
out
Sed Advantages
• Regular expressions
• Fast
• Concise
Sed Drawbacks
• Hard to remember text from one line to
another
• Not possible to go backward in the file
• No way to do forward references like
/..../+1
• No facilities to manipulate numbers
• Cumbersome syntax
Course Outline
 Introduction
 Operating system overview
 UNIX utilities
 Scripting languages
 Programming tools
What is a shell?
• The user interface to the operating system
• Functionality:
– Execute other programs
– Manage files
– Manage processes
• Full programming language
• A program like any other
– This is why there are so many shells
Shell History
• There are
many choices
for shells
• Shell features
evolved as
UNIX grew
Most Commonly Used Shells
/bin/csh
/bin/tcsh
/bin/sh
/bin/ksh
/bin/bash
C shell
Enhanced C Shell
The Bourne Shell / POSIX shell
Korn shell
Korn shell clone, from GNU
Ways to use the shell
• Interactively
– When you log in, you interactively use the
shell
• Scripting
– A set of shell commands that constitute an
executable program
Review: UNIX Programs
• Means of input:
– Program arguments
[control information]
– Environment variables
[state information]
– Standard input [data]
• Means of output:
– Return status code [control information]
– Standard out [data]
– Standard error [error messages]
Shell Scripts
• A shell script is a regular text file that
contains shell or UNIX commands
– Before running it, it must have execute
permission:
• chmod +x filename
• A script can be invoked as:
– sh name [ arg … ]
– sh < name [ args … ]
– name [ arg …]
Shell Scripts
• When a script is run, the kernel determines
which shell it is written for by examining the first
line of the script
– If 1st line starts with #!pathname-ofshell, then it invokes pathname and sends
the script as an argument to be interpreted
– If #! is not specified, the current shell
assumes it is a script in its own language
• leads to problems
Simple Example
#!/bin/sh
echo Hello World
Scripting vs. C Programming
• Advantages of shell scripts
– Easy to work with other programs
– Easy to work with files
– Easy to work with strings
– Great for prototyping. No compilation
• Disadvantages of shell scripts
– Slower
– Not well suited for algorithms & data
structures
The C Shell
• C-like syntax (uses { }'s)
• Inadequate for scripting
–
–
–
–
Poor control over file descriptors
Difficult quoting "I say \"hello\"" doesn't work
Can only trap SIGINT
Can't mix flow control and commands
• Survives mostly because of interactive
features.
– Job control
– Command history
– Command line editing, with arrow keys (tcsh)
http://www.faqs.org/faqs/unix-faq/shell/csh-whynot
The Bourne Shell
•
•
•
•
Slight differences on various systems
Evolved into standardized POSIX shell
Scripts will also run with ksh, bash
Influenced by ALGOL
Simple Commands
• simple command: sequence of non blanks
arguments separated by blanks or tabs.
• 1st argument (numbered zero) usually specifies
the name of the command to be executed.
• Any remaining arguments:
– Are passed as arguments to that command.
– Arguments may be filenames, pathnames, directories
or special options
ls –l /
/bin/ls
-l
/
Background Commands
• Any command ending with "&" is run in the
background.
firefox &
• wait will block until the command finishes
Complex Commands
• The shell's power is in its ability to hook
commands together
• We've seen one example of this so far with
pipelines:
cut –d: -f2 /etc/passwd | sort | uniq
• We will see others
Redirection of input/ouput
• Redirection of output: >
– example:$ ls -l > my_files
• Redirection of input: <
– example: $ cat <input.data
• Append output: >>
– example: $ date >> logfile
• Arbitrary file descriptor redirection: fd>
– example: $ ls –l 2> error_log
Multiple Redirection
• cmd 2>file
– send standard error to file
– standard output remains the same
• cmd > file 2>&1
– send both standard error and standard output
to file
• cmd > file1 2>file2
– send standard output to file1
– send standard error to file2
Here Documents
• Shell provides alternative ways of supplying
standard input to commands (an anonymous
file)
• Shell allows in-line input redirection using <<
called here documents
• Syntax:
command [arg(s)] << arbitrary-delimiter
command input
:
:
arbitrary-delimiter
• arbitrary-delimiter should be a string that
does not appear in text
Here Document Example
#!/bin/sh
mail [email protected] <<EOT
Sorry, I really blew it this
year. Thanks for not firing me.
Yours,
Joe
EOT
Shell Variables
• To set:
name=value
• Read: $var
• Variables can be local or environment.
Environment variables are part of UNIX
and can be accessed by child processes.
• Turn local variable into environment:
export variable
Variable Example
#!/bin/sh
MESSAGE="Hello World"
echo $MESSAGE
Environmental Variables
NAME
MEANING
$HOME Absolute pathname of your home
directory
$PATH A list of directories to search for
$MAIL Absolute pathname to mailbox
$USER Your login name
$SHELL Absolute pathname of login shell
$TERM Type of your terminal
$PS1
Prompt
Here Documents Expand Vars
#!/bin/sh
mail [email protected] <<EOT
Sorry, I really blew it this
year. Thanks for not firing me.
Yours,
$USERS
EOT
Parameters
• A parameter is one of the following:
– A variable
– A positional parameter, starting from 1
– A special parameter
• To get the value of a parameter: ${param}
– Can be part of a word (abc${foo}def)
– Works within double quotes
• The {} can be omitted for simple variables,
special parameters, and single digit positional
parameters.
Positional Parameters
• The arguments to a shell script
– $1, $2, $3 …
• The arguments to a shell function
• Arguments to the set built-in command
– set this is a test
• $1=this, $2=is, $3=a, $4=test
• Manipulated with shift
– shift 2
• $1=a, $2=test
• Parameter 0 is the name of the shell or the shell
script.
Example with Parameters
#!/bin/sh
# Parameter 1: word
# Parameter 2: file
grep $1 $2 | wc –l
$ countlines ing /usr/dict/words
3277
Special Parameters
•
•
•
•
•
$#
$$?
$$
$!
process
• $*
• "$@"
Number of positional parameters
Options currently in effect
Exit value of last executed command
Process number of current process
Process number of background
All arguments on command line
All arguments on command line
individually quoted "$1" "$2" ...
Command Substitution
• Used to turn the output of a command into a
string
• Used to create arguments or variables
• Command is placed with grave accents ` ` to
capture the output of command
$ date
Wed Sep 25 14:40:56 EDT 2001
$ NOW=`date`
$ grep `generate_regexp` myfile.c
$ sed "s/oldtext/`ls | head -1`/g"
$ PATH=`myscript`:$PATH
File name expansion
• Used to generate a set of arguments from files
• Wildcards (patterns)
* matches any string of characters
? matches any single character
[list] matches any character in list
[lower-upper] matches any character in range lowerupper inclusive
[!list] matches any character not in list
• This is the same syntax that find uses
File Expansion
• If multiple matches, all are returned $ /bin/ls
and treated as separate arguments: file1 file2
$ cat file1
a
$ cat file2
b
$ cat file*
a
• Handled by the shell (programs don’tbsee the wildcards)
– argv[0]: /bin/cat
– argv[1]: file1
– argv[0]: /bin/cat
– argv[2]: file2
NOT
– argv[1]: file*
Compound Commands
• Multiple commands
– Separated by semicolon or newline
• Command groupings
– pipelines
• Subshell
( command1; command2 ) > file
• Boolean operators
• Control structures
Boolean Operators
• Exit value of a program (exit system call) is a
number
– 0 means success
– anything else is a failure code
• cmd1 && cmd2
– executes cmd2 if cmd1 is successful
• cmd1 || cmd2
– executes
cmd2 if cmd1
is not successful
$ ls bad_file
> /dev/null
&& date
$ ls bad_file > /dev/null || date
Wed Sep 26 07:43:23 2006
Control Structures
if expression
then
command1
else
command2
fi
What is an expression?
• Any UNIX command. Evaluates to true if the
exit code is 0, false if the exit code > 0
• Special command /bin/test exists that does
most common expressions
– String compare
– Numeric comparison
– Check file properties
• [ often a builtin version of /bin/test for
syntactic sugar
• Good example UNIX tools working together
Examples
if test "$USER" = "kornj"
then
echo "I know you"
else
echo "I dont know you"
fi
if [ -f /tmp/stuff ] && [ `wc –l < /tmp/stuff` -gt 10 ]
then
echo "The file has more than 10 lines in it"
else
echo "The file is nonexistent or small"
fi
test Summary
• String based tests
-z string
-n string
string1 = string2
string1 != string2
string
Length of string is 0
Length of string is not 0
Strings are identical
Strings differ
String is not NULL
• Numeric tests
int1 –eq int2
int1 –ne int2
-gt, -ge, -lt, -le
First int equal to second
First int not equal to second
greater, greater/equal, less, less/equal
• File tests
-r
-w
-f
-d
-s
file
file
file
file
file
File exists and is readable
File exists and is writable
File is regular file
File is directory
file exists and is not empty
• Logic
!
-a, -o
( expr )
Negate result of expression
and operator, or operator
groups an expression
Arithmetic
• No arithmetic built in to /bin/sh
• Use external command /bin/expr
• expr expression
– Evaluates expression and sends the result
to standard output.
– Yields a numeric or string result
expr 4 "*" 12
expr "(" 4 + 3 ")" "*" 2
– Particularly
useful with command
substitution
X=`expr $X + 2`
Control Structures Summary
• if … then … fi
• while … done
• until … do … done
• for … do … done
• case … in … esac
for loops
• Different than C:
for var in list
do
command
done
• Typically used with positional parameters or a
list of files:
sum=0
for var in "$@"
do
sum=`expr $sum + $var`
done
echo The sum is $sum
for file in *.c ; do echo "We have $file"
done
Case statement
• Like a C switch statement for strings:
case $var in
opt1) command1
command2
;;
opt2) command
;;
*)
command
;;
esac
• * is a catch all condition
Case Example
#!/bin/sh
for INPUT in "$@"
do
case $INPUT in
hello)
echo "Hello there."
;;
bye)
echo "See ya later."
;;
*)
echo "I'm sorry?"
;;
esac
done
echo "Take care."
Case Options
• opt can be a shell pattern, or a list of shell
patterns delimited by |
• Example:
case $name in
*[0-9]*)
echo "That doesn't seem like a name."
;;
J*|K*)
echo "Your name starts with J or K, cool."
;;
*)
echo "You're not special."
;;
esac
Types of Commands
All behave the same way
• Programs
– Most that are part of the OS in /bin
• Built-in commands
• Functions
• Aliases
Built-in Commands
• Built-in commands are internal to the shell and
do not create a separate process. Commands
are built-in because:
– They are intrinsic to the language (exit)
– They produce side effects on the current process (cd)
– They perform faster
• No fork/exec
• Special built-ins
– : . break continue eval exec export exit
readonly return set shift trap unset
Important Built-in Commands
exec
:
cd
:
shift
:
parameters
set
:
wait
:
exit
umask :
permissions
exit
:
eval
:
time
:
export :
trap
:
replaces shell with program
change working directory
rearrange positional
set positional parameters
wait for background proc. to
change default file
quit the shell
parse and execute string
run command and print times
put variable into environment
set signal handlers
Important Built-in Commands
continue
break
return
:
.
:
:
:
:
:
continue in loop
break in loop
return from function
true
read file of commands into
current shell; like #include
Functions
Functions are similar to scripts and other
commands except:
• They can produce side effects in the callers
script.
• Variables are shared between caller and
callee.
• The positional parameters are saved and
restored when invoking a function.
Syntax:
name ()
{
commands
}
Aliases
• Like macros (#define in C)
• Shorter to define than functions, but more
limited
• Not recommended for scripts
• Example:
alias rm='rm –i'
Command Search Rules
• Special built-ins
• Functions
– command bypasses search for functions
• Built-ins not associated with PATH
• PATH search
• Built-ins associated with PATH
Parsing and Quoting
How the Shell Parses
• Part 1: Read the command:
– Read one or more lines a needed
– Separate into tokens using space/tabs
– Form commands based on token types
• Part 2: Evaluate a command:
– Expand word tokens (command substitution, parameter
expansion)
–
–
–
–
Split words into fields
File expansion
Setup redirections, environment
Run command with arguments
Useful Program for Testing
/home/unixtool/bin/showargs
#include <stdio.h>
int main(int argc, char *argv[])
{
int i;
for (i=0; i < argc; i++) {
printf("Arg %d: %s\n", i, argv[i]);
}
return(0);
}
Shell Comments
• Comments begin with an unquoted #
• Comments end at the end of the line
• Comments can begin whenever a token begins
• Examples
# This is a comment
# and so is this
grep foo bar # this is a comment
grep foo bar# this is not a comment
Special Characters
• The shell processes the following characters specially
unless quoted:
| & ( ) < > ; " ' $ ` space tab newline
• The following are special whenever patterns are
processed:
* ? [ ]
• The following are special at the beginning of a word:
# ~
• The following is special when processing assignments:
=
Token Types
• The shell uses spaces and tabs to split the
line or lines into the following types of
tokens:
– Control operators (||)
– Redirection operators (<)
– Reserved words (if)
– Assignment tokens
– Word tokens
Operator Tokens
• Operator tokens are recognized everywhere
unless quoted. Spaces are optional before and
after operator tokens.
• I/O Redirection Operators:
> >> >| >& < << <<- <&
– Each I/O operator can be immediately preceded by a
single digit
• Control Operators:
| & ; ( ) || && ;;
Shell Quoting
• Quoting causes characters to loose special
meaning.
• \ Unless quoted, \ causes next character to
be quoted. In front of new-line causes lines to
be joined.
• '…'
Literal quotes. Cannot contain '
• "…"
Removes special meaning of all
characters except $, ", \ and `. The \ is only
special before one of these characters and newline.
Quoting Examples
$ cat file*
a
b
$ cat "file*"
cat: file* not found
$ cat file1 > /dev/null
$ cat file1 ">" /dev/null
a
cat: >: cannot open
FILES="file1 file2"
$ cat "$FILES"
cat: file1 file2 not found
Simple Commands
• A simple command consists of three types of
tokens:
–
–
–
–
–
Assignments (must come first)
Command word tokens
Redirections: redirection-op + word-op
The first token must not be a reserved word
Command terminated by new-line or ;
• Example:
– foo=bar z=`date`
echo $HOME
x=foobar > q$$ $xyz z=3
Word Splitting
• After parameter expansion, command
substitution, and arithmetic expansion,
the characters that are generated as a
result of these expansions that are not
inside double quotes are checked for split
characters
• Default split character is space or tab
• Split characters are defined by the value of
the IFS variable (IFS="" disables)
Word Splitting Examples
FILES="file1 file2"
cat $FILES
a
b
IFS=
cat $FILES
cat: file1 file2: cannot open
IFS=x v=exit
echo exit $v "$v"
exit e it exit
Pathname Expansion
• After word splitting, each field that
contains pattern characters is replaced by
the pathnames that match
• Quoting prevents expansion
• set –o noglob disables
– Not in original Bourne shell, but in POSIX
Parsing Example
DATE=`date` echo $foo > \
/dev/null
DATE=`date` echo $foo > /dev/null
assignment
word
param
redirection
echo hello there
/dev/null
/bin/echo
PATH expansion
hello there
split by IFS
/dev/null
Script Examples
• Rename files to lower case
• Strip CR from files
• Emit HTML for directory contents
Rename files
#!/bin/sh
for file in *
do
lfile=`echo $file | tr A-Z a-z`
if [ $file != $lfile ]
then
mv $file $lfile
fi
done
Remove DOS Carriage Returns
#!/bin/sh
TMPFILE=/tmp/file$$
if [ "$1" = "" ]
then
tr -d '\r'
exit 0
fi
trap 'rm -f $TMPFILE' 1 2 3 6 15
for file in "$@"
do
if tr -d '\r' < $file > $TMPFILE
then
mv $TMPFILE $file
fi
done
Generate HTML
$ dir2html.sh > dir.html
#!/bin/sh
The Script
[ "$1" != "" ] && cd "$1"
cat <<HUP
<html>
<h1> Directory listing for $PWD </h1>
<table border=1>
<tr>
HUP
num=0
for file in *
do
genhtml $file
# this function is on next page
done
cat <<HUP
</tr>
</table>
</html>
HUP
Function genhtml
genhtml()
{
file=$1
echo "<td><tt>"
if [ -f $file ]
then
echo "<font color=blue>$file</font>"
elif [ -d $file ]
then
echo "<font color=red>$file</font>"
else
echo "$file"
fi
echo "</tt></td>"
num=`expr $num + 1`
if [ $num -gt 4 ]
then
echo "</tr><tr>"
num=0
fi
}
Korn Shell / bash Features
Command Substitution
• Better syntax with $(command)
– Allows nesting
– x=$(cat $(generate_file_list))
• Backward compatible with ` … ` notation
Expressions
• Expressions are built-in with the [[ ]] operator
if [[ $var = "" ]] …
• Gets around parsing quirks of /bin/test, allows checking
strings against patterns
• Operations:
–
–
–
–
–
–
–
string == pattern
string != pattern
string1 < string2
file1 –nt file2
file1 –ot file2
file1 –ef file2
&&, ||
Patterns
• Can be used to do string matching:
if [[ $foo = *a* ]]
if [[ $foo = [abc]* ]]
• Similar to regular expressions, but
different syntax
Additonal Parameter Expansion
•
•
•
•
•
•
${#param} – Length of param
${param#pattern} – Left strip min pattern
${param##pattern} – Left strip max pattern
${param%pattern} – Right strip min pattern
${param%%pattern} – Right strip max pattern
${param-value} – Default value if param not
set
Variables
• Variables can be arrays
– foo[3]=test
– echo ${foo[3]}
• Indexed by number
• ${#arr} is length of the array
• Multiple array elements can be set at once:
– set –A foo a b c d
– echo ${foo[1]}
– Set command can also be used for positional
params: set a b c d; print $2
Printing
• Built-in print command to replace echo
• Much faster
• Allows options:
-u#
print to specific file descriptor
Functions
• Alternative function syntax:
function name {
commands
}
• Allows for local variables
• $0 is set to the name of the function
Additional Features
• Built-in arithmetic: Using $((expression ))
– e.g., print $(( 1 + 1 * 8 / x ))
• Tilde file expansion
~
~user
~+
~-
$HOME
home directory of user
$PWD
$OLDPWD
Course Outline
 Introduction
 Operating system overview
 UNIX utilities
 Scripting languages
 Programming tools
Parsing and Quoting
Shell Quoting
Quoting causes characters to loose special
meaning.
•\
Unless quoted, \ causes next
character
to be quoted. In front of new-line
causes lines to be joined.
• '…'
Literal quotes. Cannot contain '
• "…"
Removes special meaning of all
characters except $, ", \ and `.
The \ is only special before one of
these characters and new-line.
Quoting Examples
$ cat file*
a
b
$ cat "file*"
cat: file* not found
$ cat file1 > /dev/null
$ cat file1 ">" /dev/null
a
cat: >: cannot open
FILES="file1 file2"
$ cat "$FILES"
cat: file1 file2 not found
Shell Comments
• Comments begin with an unquoted #
• Comments end at the end of the line
• Comments can begin whenever a token begins
• Examples
# This is a comment
# and so is this
grep foo bar # this is a comment
grep foo bar# this is not a comment
How the Shell Parses
• Part 1: Read the command:
– Read one or more lines a needed
– Separate into tokens using space/tabs
– Form commands based on token types
• Part 2: Evaluate a command:
– Expand word tokens (command substitution, parameter
expansion)
–
–
–
–
Split words into fields
File expansion
Setup redirections, environment
Run command with arguments
Useful Program for Testing
/home/unixtool/bin/showargs
#include <stdio.h>
int main(int argc, char *argv[])
{
int i;
for (i=0; i < argc; i++) {
printf("Arg %d: %s\n", i, argv[i]);
}
return(0);
}
Special Characters
• The shell processes the following characters specially
unless quoted:
| & ( ) < > ; " ' $ ` space tab newline
• The following are special whenever patterns are
processed:
* ? [ ]
• The following are special at the beginning of a word:
# ~
• The following is special when processing assignments:
=
Token Types
• The shell uses spaces and tabs to split the
line or lines into the following types of
tokens:
– Control operators (||)
– Redirection operators (<)
– Reserved words (if)
– Assignment tokens
– Word tokens
Operator Tokens
• Operator tokens are recognized everywhere
unless quoted. Spaces are optional before and
after operator tokens.
• I/O Redirection Operators:
> >> >| >& < << <<- <&
– Each I/O operator can be immediately preceded by a
single digit
• Control Operators:
| & ; ( ) || && ;;
Simple Commands
• A simple command consists of three types of
tokens:
– Assignments (must come first)
– Command word tokens
– Redirections: redirection-op + word-op
• The first token must not be a reserved word
• Command terminated by new-line or ;
• Examples:
– foo=bar z=`date`
echo $HOME
x=foobar > q$$ $xyz z=3
Word Splitting
• After parameter expansion, command
substitution, and arithmetic expansion,
the characters that are generated as a
result of these expansions that are not
inside double quotes are checked for split
characters
• Default split character is space or tab
• Split characters are defined by the value of
the IFS variable (IFS="" disables)
Word Splitting Examples
FILES="file1 file2"
cat $FILES
a
b
IFS=
cat $FILES
cat: file1 file2: cannot open
IFS=x v=exit
echo exit $v "$v"
exit e it exit
Pathname Expansion
• After word splitting, each field that
contains pattern characters is replaced by
the pathnames that match
• Quoting prevents expansion
• set –o noglob disables
– Not in original Bourne shell, but in POSIX
Parsing Example
DATE=`date` echo $foo > \
/dev/null
DATE=`date` echo $foo > /dev/null
assignment
word
param
redirection
echo hello there
/dev/null
/bin/echo
PATH expansion
hello there
split by IFS
/dev/null
The eval built-in
• eval arg …
– Causes all the tokenizing and expansions to
be performed again
trap command
• trap specifies command that should be evaled
when the shell receives a signal of a particular
value.
• trap [ [command] {signal}+]
– If command is omitted, signals are ignored
• Especially useful for cleaning up temporary files
trap 'echo "please, dont interrupt!"' SIGINT
trap 'rm /tmp/tmpfile' EXIT
Reading Lines
• read is used to read a line from a file
and to store the result into shell
variables
– read –r prevents special processing
– Uses IFS to split into words
– If no variable specified, uses REPLY
read
read –r NAME
read FIRSTNAME LASTNAME
Script Examples
• Rename files to lower case
• Strip CR from files
• Emit HTML for directory contents
Rename files
#!/bin/sh
for file in *
do
lfile=`echo $file | tr A-Z a-z`
if [ $file != $lfile ]
then
mv $file $lfile
fi
done
Remove DOS Carriage Returns
#!/bin/sh
TMPFILE=/tmp/file$$
if [ "$1" = "" ]
then
tr -d '\r'
exit 0
fi
trap 'rm -f $TMPFILE' 1 2 3 6 15
for file in "$@"
do
if tr -d '\r' < $file > $TMPFILE
then
mv $TMPFILE $file
fi
done
Generate HTML
$ dir2html.sh > dir.html
#!/bin/sh
The Script
[ "$1" != "" ] && cd "$1"
cat <<HUP
<html>
<h1> Directory listing for $PWD </h1>
<table border=1>
<tr>
HUP
num=0
for file in *
do
genhtml $file
# this function is on next page
done
cat <<HUP
</tr>
</table>
</html>
HUP
Function genhtml
genhtml()
{
file=$1
echo "<td><tt>"
if [ -f $file ]
then
echo "<font color=blue>$file</font>"
elif [ -d $file ]
then
echo "<font color=red>$file</font>"
else
echo "$file"
fi
echo "</tt></td>"
num=`expr $num + 1`
if [ $num -gt 4 ]
then
echo "</tr><tr>"
num=0
fi
}
Korn Shell / bash Features
Command Substitution
• Better syntax with $(command)
– Allows nesting
– x=$(cat $(generate_file_list))
• Backward compatible with ` … ` notation
Expressions
• Expressions are built-in with the [[ ]] operator
if [[ $var = "" ]] …
• Gets around parsing quirks of /bin/test, allows checking
strings against patterns
• Operations:
–
–
–
–
–
–
–
string == pattern
string != pattern
string1 < string2
file1 –nt file2
file1 –ot file2
file1 –ef file2
&&, ||
Patterns
• Can be used to do string matching:
if [[ $foo = *a* ]]
if [[ $foo = [abc]* ]]
• Note: patterns are like a subset of regular
expressions, but different syntax
Additonal Parameter Expansion
•
•
•
•
•
•
${#param} – Length of param
${param#pattern} – Left strip min pattern
${param##pattern} – Left strip max pattern
${param%pattern} – Right strip min pattern
${param%%pattern} – Right strip max pattern
${param-value} – Default value if param not
set
Variables
• Variables can be arrays
– foo[3]=test
– echo ${foo[3]}
• Indexed by number
• ${#arr} is length of the array
• Multiple array elements can be set at once:
– set –A foo a b c d
– echo ${foo[1]}
– Set command can also be used for positional
params: set a b c d; print $2
Functions
• Alternative function syntax:
function name {
commands
}
• Allows for local variables
• $0 is set to the name of the function
Additional Features
• Built-in arithmetic: Using $((expression ))
– e.g., print $(( 1 + 1 * 8 / x ))
• Tilde file expansion
~
~user
~+
~-
$HOME
home directory of user
$PWD
$OLDPWD
KornShell 93
Variable Attributes
• By default attributes hold strings of unlimited length
• Attributes can be set with typeset:
–
–
–
–
–
–
–
–
readonly (-r) – cannot be changed
export (-x) – value will be exported to env
upper (-u) – letters will be converted to upper case
lower (-l) – letters will be converted to lower case
ljust (-L width) – left justify to given width
rjust (-R width) – right justify to given width
zfill (-Z width) – justify, fill with leading zeros
integer (-I [base]) – value stored as integer
– float (-E [prec]) – value stored as C double
– nameref (-n) – a name reference
Name References
• A name reference is a type of variable that
references another variable.
• nameref is an alias for typeset -n
– Example:
user1="jeff"
user2="adam"
typeset –n name="user1"
print $name
jeff
New Parameter Expansion
• ${param/pattern/str} – Replace first pattern
with str
• ${param//pattern/str} – Replace all
patterns with str
• ${param:offset:len} – Substring with offset
Patterns Extended
• Additional
pattern types so
that shell
patterns are
equally
expressive as
regular
expressions
• Used for:
– file expansion
– [[ ]]
– case statements
– parameter
Patterns
Regular Expressions
ANSI C Quoting
• $'…' Uses C escape sequences
$'\t'
$'Hello\nthere'
• printf added that supports C like printing:
printf "You have %d apples" $x
• Extensions
–
–
–
–
–
–
%b – ANSI escape sequences
%q – Quote argument for reinput
\E – Escape character (033)
%P – convert ERE to shell pattern
%H – convert using HTML conventions
%T – date conversions using date formats
Associative Arrays
•
•
•
•
•
Arrays can be indexed by string
Declared with typeset –A
Set: name["foo"]="bar"
Reference ${name["foo"]}
Subscripts: ${!name[@]}
Networking, HTTP, CGI
Network Application
• Client application and server application
communicate via a network protocol
• A protocol is a set of rules on how the client and
server communicate
web
client
HTTP
web
server
kernel
user
TCP/IP Suite
application
layer
client
TCP/UDP
transport
layer
server
TCP/UDP
internet
layer
IP
drivers/
hardware
IP
drivers/
hardware
network access layer
(ethernet)
Data Encapsulation
Data
Application Layer
H1
Data
H2
H1
Data
H2
H1
Data
Transport Layer
Internet Layer
Network
Access
Layer
H3
Network Access/Internet Layers
• Network Access Layer
– Deliver data to devices on the same physical network
– Ethernet
• Internet Layer
–
–
–
–
Internet Protocol (IP)
Determines routing of datagram
IPv4 uses 32-bit addresses (e.g. 128.122.20.15)
Datagram fragmentation and reassembly
Transport Layer
• Transport Layer
– Host-host layer
– Provides error-free, point-to-point connection between
hosts
• User Datagram Protocol (UDP)
– Unreliable, connectionless
• Transmission Control Protocol (TCP)
– Reliable, connection-oriented
– Acknowledgements, sequencing, retransmission
Ports
• Both TCP and UDP use 16-bit port numbers
• A server application listen to a specific port for
connections
• Ports used by popular applications are well-defined
– SSH (22), SMTP (25), HTTP (80)
– 1-1023 are reserved (well-known)
• Clients use ephemeral ports (OS dependent)
Name Service
• Every node on the network normally has a
hostname in addition to an IP address
• Domain Name System (DNS) maps IP
addresses to names
– e.g. 128.122.81.155 is access1.cims.nyu.edu
• DNS lookup utilities: nslookup, dig
• Local name address mappings stored in
/etc/hosts
Sockets
• Sockets provide access to TCP/IP on UNIX
systems
• Sockets are communications endpoints
• Invented in Berkeley UNIX
• Allows a network connection to be opened as a
file (returns a file descriptor)
machine 1
machine 2
Major Network Services
• Telnet (Port 23)
– Provides virtual terminal for remote user
– The telnet program can also be used to connect to
other ports
• FTP (Port 20/21)
– Used to transfer files from one machine to another
– Uses port 20 for data, 21 for control
• SSH (Port 22)
– For logging in and executing commands on
remote machines
– Data is encrypted
Major Network Services cont.
• SMTP (Port 25)
– Host-to-host mail transport
– Used by mail transfer agents (MTAs)
• IMAP (Port 143)
– Allow clients to access and manipulate emails
on the server
• HTTP (Port 80)
– Protocol for WWW
Ksh93: /dev/tcp
• Files in the form /dev/tcp/hostname/port
result in a socket connection to the given
service:
exec 3<>/dev/tcp/smtp.cs.nyu.edu/25 #SMTP
print –u3 ”EHLO cs.nyu.edu"
print –u3 ”QUIT"
while IFS= read –u3
do
print –r "$REPLY"
done
HTTP
• Hypertext Transfer Protocol
– Use port 80
• Language used by web browsers (IE,
Netscape, Firefox) to communicate with
web servers (Apache, IIS)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
HTTP request:
Get me this document
HTTP response:
Here is your document
Resources
• Web servers host web resources, including
HTML files, PDF files, GIF files, MPEG movies,
etc.
• Each web object has an associated MIME type
– HTML document has type text/html
– JPEG image has type image/jpeg
• Web resource is accessed using a Uniform
Resource Locator (URL)
– http://www.cs.nyu.edu:80/courses/fall06/G22.2245-001/index.html
protocol
host
port
resource
HTTP Transactions
• HTTP request to web server
GET /v40images/nyu.gif HTTP/1.1
Host: www.nyu.edu
• HTTP response to web client
HTTP/1.1 200 OK
Content-type: image/gif
Content-length: 3210
Sample HTTP Session
GET / HTTP/1.1
HOST: www.cs.nyu.edu
request
HTTP/1.1 200 OK
Date: Wed, 19 Oct 2005 06:59:49 GMT
Server: Apache/2.0.49 (Unix) mod_perl/1.99_14 Perl/v5.8.4
mod_ssl/2.0.49 OpenSSL/0.9.7e mod_auth_kerb/4.13 PHP/5.0.0RC3
Last-Modified: Thu, 12 Sep 2002 17:09:03 GMT
Content-Length: 163
Content-Type: text/html; charset=ISO-8859-1
response
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<title></title>
<meta HTTP-EQUIV="Refresh" CONTENT="0; URL=csweb/index.html">
<body>
</body>
</html>
Status Codes
• Status code in the HTTP response indicates if a
request is successful
• Some typical status codes:
200
302
401
403
404
OK
Found; Resource in different
URI
Authorization required
Forbidden
Not Found
Gateways
• Interface between resource and a web
server
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
http Web Server
Gateway
resource
CGI
• Common Gateway Interface is a standard interface for
running helper applications to generate dynamic
contents
– Specify the encoding of data passed to programs
• Allow HTML documents to be created on the fly
• Transparent to clients
– Client sends regular HTTP request
– Web server receives HTTP request, runs CGI program, and
sends contents back in HTTP responses
• CGI programs can be written in any language
CGI Diagram
HTTP request
Web Server
HTTP response
spawn process
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Document
Script
HTML
• Document format used on the web
<html>
<head>
<title>Some Document</title>
</head>
<body>
<h2>Some Topics</h2>
This is an HTML document
<p>
This is another paragraph
</body>
</html>
HTML
• HTML is a file format that describes a web
page.
• These files can be made by hand, or
generated by a program
• A good way to generate an HTML file is by
writing a shell script
Forms
• HTML forms are used to collect user input
• Data sent via HTTP request
• Server launches CGI script to process data
<form method=POST
action=“http://www.cs.nyu.edu/~unixtool/cgibin/search.cgi”>
Enter your query: <input type=text name=Search>
<input type=submit>
</form>
Input Types
• Text Field
<input type=text name=zipcode>
• Radio Buttons
<input type=radio name=size value=“S”> Small
<input type=radio name=size value=“M”> Medium
<input type=radio name=size value=“L”> Large
• Checkboxes
<input type=checkbox name=extras value=“lettuce”> Lettuce
<input type=checkbox name=extras value=“tomato”> Tomato
• Text Area
<textarea name=address cols=50 rows=4>
…
</textarea>
Submit Button
• Submits the form for processing by the
CGI script specified in the form tag
<input type=submit value=“Submit Order”>
HTTP Methods
• Determine how form data are sent to web
server
• Two methods:
– GET
• Form variables stored in URL
– POST
• Form variables sent as content of HTTP request
Encoding Form Values
• Browser sends form variable as name-value
pairs
– name1=value1&name2=value2&name3=value3
• Names are defined in form elements
– <input type=text name=ssn maxlength=9>
• Special characters are replaced with %## (2digit hex number), spaces replaced with +
– e.g. “10/20 Wed” is encoded as “10%2F20+Wed”
GET/POST examples
GET:
GET /cgi-bin/myscript.pl?name=Bill%20Gates&
company=Microsoft HTTP/1.1
HOST: www.cs.nyu.edu
POST:
POST /cgi-bin/myscript.pl HTTP/1.1
HOST: www.cs.nyu.edu
…other headers…
name=Bill%20Gates&company=Microsoft
GET or POST?
• GET method is useful for
– Retrieving information, e.g. from a database
– Embedding data in URL without form element
• POST method should be used for forms with
– Many fields or long fields
– Sensitive information
– Data for updating database
• GET requests may be cached by clients
browsers or proxies, but not POST requests
Parsing Form Input
• Method stored in HTTP_METHOD
• GET: Data encoded into QUERY_STRING
• POST: Data in standard input (from body
of request)
• Most scripts parse input into an
associative array
– You can parse it yourself
– Or use available libraries (better)
CGI Environment Variables
•
•
•
•
•
•
•
•
•
•
•
DOCUMENT_ROOT
HTTP_HOST
HTTP_REFERER
HTTP_USER_AGENT
HTTP_COOKIE
REMOTE_ADDR
REMOTE_HOST
REMOTE_USER
REQUEST_METHOD
SERVER_NAME
SERVER_PORT
CGI Script: Example
Part 1: HTML Form
<html>
<center>
<H1>Anonymous Comment Submission</H1>
</center>
Please enter your comment below which will
be sent anonymously to <tt>[email protected]</tt>.
If you want to be extra cautious, access this
page through <a
href="http://www.anonymizer.com">Anonymizer</a>.
<p>
<form action=cgi-bin/comment.cgi method=post>
<textarea name=comment rows=20 cols=80>
</textarea>
<input type=submit value="Submit Comment">
</form>
</html>
Part 2: CGI Script (ksh)
#!/home/unixtool/bin/ksh
. cgi-lib.ksh
ReadParse
PrintHeader
# Read special functions to help parse
print -r -- "${Cgi.comment}" | /bin/mailx -s "COMMENT" kornj
print
print
print
print
"<H2>You submitted the comment</H2>"
"<pre>"
-r -- "${Cgi.comment}"
"</pre>"
Debugging
• Debugging can be tricky, since error
messages don't always print well as HTML
• One method: run interactively
$ QUERY_STRING='birthday=10/15/03'
$ ./birthday.cgi
Content-type: text/html
<html>
Your birthday is <tt>10/15/02</tt>.
</html>
How to get your script run
• This can vary by web server type
http://www.cims.nyu.edu/systems/resources/webhosting/index.html
• Typically, you give your script a name that
ends with .cgi
• Give the script execute permission
• Specify the location of that script in the
URL
CGI Security Risks
• Sometimes CGI scripts run as owner of the
scripts
• Never trust user input - sanity-check
everything
• If a shell command contains user input, run
without shell escapes
• Always encode sensitive information, e.g.
passwords
– Also use HTTPS
• Clean up - don’t leave sensitive data around
CGI Benefits
• Simple
• Language independent
• UNIX tools are good for this because
– Work well with text
– Integrate programs well
– Easy to prototype
– No compilation (CGI scripts)
Example: Find words in
Dictionary
<form action=dict.cgi>
Regular expression: <input type=entry
name=re value=".*">
<input type=submit>
</form>
Example: Find words in
Dictionary
#!/home/unixtool/bin/ksh
PATH=$PATH:.
. cgi-lib.ksh
ReadParse
PrintHeader
print "<H1> Words matching <tt>${Cgi.re}</tt> in the dictionary
</H1>\n";
print "<OL>"
grep "${Cgi.re}" /usr/dict/words | while read word
do
print "<LI> $word"
done
print "</OL>"
What is Perl?
• Practical Extraction and Report Language
• Scripting language created by Larry Wall in
the mid-80s
• Functionality and speed somewhere between
low-level languages (like C) and high-level
ones (like shell)
• Influence from awk, sed, and C Shell
• Easy to write (after you learn it), but
sometimes hard to read
• Widely used in CGI scripting
A Simple Perl Script
turns on warnings
hello:
#!/usr/bin/perl -w
print “Hello, world!\n”;
$ chmod a+x hello
$ ./hello
Hello, world!
$ perl -e ‘print “Hello, world!\n”;’
Hello, world!
Another Perl Script
$;=$_;$/='0#](.+,a()$=(\}$+_c2$sdl[h*du,(1ri)b$2](n}
/1)1tfz),}0(o{=4s)1rs(2u;2(u",bw-2b $
hc7s"tlio,tx[{ls9r11$e(1(9]q($,$2)=)_5{4*s{[9$,lh$2,_.(i
a]7[11f=*2308t$$)]4,;d/{}83f,)s,65o@*ui),rt$bn;5(=_stf*0
l[t(o$.o$rsrt.c!(i([$a]$n$2ql/d(l])t2,$.+{i)$_.$zm+n[6t(
e1+26[$;)+]61_l*,*)],([email protected])/z1_0+=)(2,,4c*2)\5,h$4;$
91r_,pa,)$[4r)$=_$6i}tc}!,n}[h$]$t
0rd)_$';open(eval$/);$_=<0>;for($x=2;$x<666;$a.=++$x){s}
{{.|.}};push@@,$&;$x==5?$z=$a:++$}}for(++$/..substr($a,1
885)){$p+=7;$;.=$@[$p%substr($a,$!,3)+11]}eval$;
Data Types
• Basic types: scalar, lists, hashes
• Support OO programming and userdefined types
What Type?
• Type of variable determined by special
leading character
$foo
scalar
@foo
list
%foo
hash
&foo
function
• Data types have separate name spaces
Scalars
• Can be numbers
$num = 100;
$num = 223.45;
$num = -1.3e38;
• Can be strings
$str
$str
$str
$str
=
=
=
=
’unix tools’;
’Who\’s there?’;
”good evening\n”;
”one\ttwo”;
• Backslash escapes and variable names are interpreted
inside double quotes
Special Scalar Variables
$0
$_
$$
$?
$!
$/
$.
undef
Name of script
Default variable
Current PID
Status of last pipe or system call
System error message
Input record separator
Input record number
Acts like 0 or empty string
Operators
• Numeric: + - * / % **
• String concatenation: .
$state = “New” . “York”;
# “NewYork”
• String repetition: x
print “bla” x 3;
# blablabla
• Binary assignments:
$val = 2; $val *= 3;
$state .= “City”;
# $val is 6
# “NewYorkCity”
Comparison Operators
Comparison
Numeri
c
==
String
Not Equal
!=
ne
Greater than
>
gt
Less than
<
lt
Less than or equal to
<=
le
Greater than or equal
to
>=
ge
Equal
eq
Boolean “Values”
if ($ostype eq “unix”) { … }
if ($val) { … }
• No boolean data type
• undef is false
• 0 is false; Non-zero numbers are true
• ‘’ and ‘0’ are false; other strings are true
• The unary not (!) negates the boolean
value
undef and defined
$f = 1;
while ($n < 10) {
# $n is undef at 1st iteration
$f *= ++$n;
}
• Use defined to check if a value is undef
if (defined($val)) { … }
Lists and Arrays
•
•
•
•
List: ordered collection of scalars
Array: Variable containing a list
Each element is a scalar variable
Indices are integers starting at 0
Array/List Assignment
@teams=(”Knicks”,”Nets”,”Lakers”);
print $teams[0];
# print Knicks
$teams[3]=”Celtics”;# add new elt
@foo = ();
# empty list
@nums = (1..100);
# list of 1-100
@arr = ($x, $y*6);
($a, $b) = (”apple”, ”orange”);
($a, $b) = ($b, $a); # swap $a $b
@arr1 = @arr2;
More About Arrays and Lists
• Quoted words - qw
@planets = qw/ earth mars jupiter /;
@planets = qw{ earth mars jupiter };
• Last element’s index: $#planets
– Not the same as number of elements in array!
• Last element: $planets[-1]
Scalar and List Context
@colors = qw< red green blue >;
• Array interpolated as string:
print “My favorite colors are @colors\n”;
• Prints My favorite colors are red green blue
• Array in scalar context returns the number of
elements in the list
$num = @colors + 5;
# $num gets 8
• Scalar expression in list context
@num = 88;
# a one-element list (88)
pop and push
• push and pop: arrays used as stacks
• push adds element to end of array
@colors = qw# red green blue #;
push(@colors, ”yellow”);
# same as
@colors = (@colors, ”yellow”);
push @colors, @more_colors;
• pop removes last element of array and returns
it
$lastcolor = pop(@colors);
shift and unshift
• shift and unshift: similar to push and pop on
the “left” side of an array
• unshift adds elements to the beginning
@colors = qw# red green blue #;
unshift @colors, ”orange”;
•
First element is now “orange”
• shift removes element from beginning
$c = shift(@colors); # $c gets ”orange”
sort and reverse
• reverse returns list with elements in reverse
order
@list1 = qw# NY NJ CT #;
@list2 = reverse(@list1); # (CT,NJ,NY)
• sort returns list with elements in ASCII order
@day = qw/ tues wed thurs /;
@sorted = sort(@day); #(thurs,tues,wed)
@nums = sort 1..10; # 1 10 2 3 … 8 9
• reverse and sort do not modify their
arguments
Iterate over a List
• foreach loops through a list of values
@teams = qw# Knicks Nets Lakers #;
foreach $team (@teams) {
print “$team win\n”;
}
• Value of control variable restored at end of
loop
• Synonym for the for keyword
• $_ is the default
foreach (@teams) {
$_ .= “ win\n”;
print;
}
# print $_
Hashes
• Associative arrays - indexed by strings (keys)
$cap{“Hawaii”} = “Honolulu”;
%cap = ( “New York”, “Albany”, “New Jersey”,
“Trenton”, “Delaware”, “Dover” );
• Can use => (big arrow or comma arrow) in place
of , (comma)
%cap = ( “New York”
=> “Albany”,
“New Jersey” => “Trenton”,
Delaware
=> “Dover” );
Hash Element Access
• $hash{$key}
print $cap{”New York”};
print $cap{”New ” . ”York”};
• Unwinding the hash
@cap_arr = %cap;
– Gets unordered list of key-value pairs
• Assigning one hash to another
%cap2 = %cap;
%cap_of = reverse %cap;
print $cap_of{”Trenton”};
# New Jersey
Hash Functions
• keys returns a list of keys
@state = keys %cap;
• values returns a list of values
@city = values %cap;
• Use each to iterate over all (key, value) pairs
while ( ($state, $city) = each %cap )
{
print “Capital of $state is $city\n”;
}
Hash Element Interpolation
• Unlike a list, entire hash cannot be interpolated
print “%cap\n”;
– Prints %cap followed by a newline
• Individual elements can
foreach $state (sort keys %cap) {
print “Capital of $state is
$cap{$state}\n”;
}
More Hash Functions
• exists checks if a hash element has ever been
initialized
print “Exists\n” if exists $cap{“Utah”};
– Can be used for array elements
– A hash or array element can only be defined if it
exists
• delete removes a key from the hash
delete $cap{“New York”};
Merging Hashes
• Method 1: Treat them as lists
%h3 = (%h1, %h2);
• Method 2 (save memory): Build a new hash
by looping over all elements
%h3 = ();
while ((%k,$v) = each(%h1)) {
$h3{$k} = $v;
}
while ((%k,$v) = each(%h2)) {
$h3{$k} = $v;
}
Subroutines
• sub myfunc { … }
$name=“Jane”;
…
sub print_hello {
print “Hello $name\n”; # global $name
}
&print_hello;
# print “Hello Jane”
print_hello;
# print “Hello Jane”
print_hello();
# print “Hello Jane”
Arguments
• Parameters are assigned to the special array
@_
• Individual parameter can be accessed as
$_[0], $_[1], …
sub sum {
my $x;
# private variable $x
foreach (@_) { # iterate over params
$x += $_;
}
return $x;
}
$n = &sum(3, 10, 22); # n gets 35
More on Parameter Passing
• Any number of scalars, lists, and hashes can
be passed to a subroutine
• Lists and hashes are “flattened”
func($x, @y, %z);
– Inside func:
• $_[0] is $x
• $_[1] is $y[0]
• $_[2] is $y[1], etc.
• Scalars in @_ are implicit aliases (not copies)
of the ones passed — changing values of
$_[0], etc. changes the original variables
Return Values
• The return value of a subroutine is the last
expression evaluated, or the value returned by
the return operator
sub myfunc {
my $x = 1;
$x + 2; #returns 3
}
sub myfunc {
my $x = 1;
return $x + 2;
} @somelist;
• Can also return a list: return
• If return is used without an expression (failure),
undef or () is returned depending on context
Lexical Variables
• Variables can be scoped to the enclosing block
with the my operator
sub myfunc {
my $x;
my($a, $b) = @_;
…
}
# copy params
• Can be used in any block, such as if block or
while block
– Without enclosing block, the scope is the source file
use strict
• The use strict pragma enforces some
good programming rules
– All new variables need to be declared with my
#!/usr/bin/perl -w
use strict;
$n = 1;
# <-- perl will complain
Another Subroutine Example
@nums = (1, 2, 3);
$num = 4;
@res = dec_by_one(@nums, $num);
minus_one(@nums, $num);
sub dec_by_one {
my @ret = @_;
for my $n (@ret) { $n-- }
return @ret;
}
sub minus_one {
for (@_) { $_-- }
}
# @res=(0, 1, 2, 3)
# (@nums,$num)=(1, 2, 3, 4)
# (@nums,$num)=(0, 1, 2, 3)
# make a copy
Reading from STDIN
• STDIN is the builtin filehandle to the std input
• Use the line input operator around a file handle
to read from it
$line = <STDIN>;
chomp($line);
# read next line
• chomp removes trailing string that corresponds
to the value of $/ (usually the newline character)
Reading from STDIN example
while (<STDIN>) {
chomp;
print ”Line $. ==> $_\n”;
}
Line 1 ==> [Contents of line 1]
Line 2 ==> [Contents of line 2]
…
<>
• Diamond operator < > helps Perl programs
behave like standard Unix utilities (cut, sed,
…)
• Lines are read from list of files given as
command line arguments (@ARGV), otherwise
from stdin
while (<>) {
chomp;
print ”Line $. from $ARGV is $_\n”;
}
• ./myprog file1 file2 – Read from file1, then file2, then standard input
• $ARGV is the current filename
Filehandles
• Use open to open a file for reading/writing
open
open
open
open
LOG,
LOG,
LOG,
LOG,
”syslog”;
”<syslog”;
”>syslog”;
”>>syslog”;
# read
# read
# write
# append
• When you’re done with a filehandle, close
it
close LOG;
Errors
• When a fatal error is encountered, use die to
print out error message and exit program
die ”Something bad happened\n” if ….;
• Always check return value of open
open LOG, ”>>syslog”
or die ”Cannot open log: $!”;
• For non-fatal errors, use warn instead
warn ”Temperature is below 0!”
if $temp < 0;
Reading from a File
open MSG, “/var/log/messages”
or die “Cannot open messages: $!\n”;
while (<MSG>) {
chomp;
# do something with $_
}
close MSG;
Reading Whole File
• In scalar context, <FH> reads the next line
$line = <LOG>;
• In list context, <FH> read all remaining lines
@lines = <LOG>;
• Undefine $/ to read the rest of file as a string
undef $/;
$all_lines = <LOG>;
Writing to a File
open LOG, “>/tmp/log”
or die “Cannot create log: $!”;
print LOG “Some log messages…\n”
printf LOG “%d entries processed.\n”,
$num;
close LOG;
no comma after filehandle
File Tests examples
die “The file $filename is not readable” if
! -r $filename;
warn “The file $filename is not owned by
you” unless -o $filename;
print “This file is old” if -M $filename >
365;
File Tests list
-r
-w
-x
-o
-e
-z
-s
File or directory is readable
File or directory is writable
File or directory is executable
File or directory is owned by this
user
File or directory exists
File exists and has zero size
File or directory exists and has
nonzero size (value in bytes)
File Tests list
-f
-d
-l
-M
-A
Entry if a plain file
Entry is a directory
Entry is a symbolic link
Modification age (in days)
Access age (in days)
• $_ is the default operand
Manipulating Files and Dirs
•
unlink removes files
unlink “file1”, “file2”
or warn “failed to remove file: $!”;
•
rename renames a file
rename “file1”, “file2”;
•
link creates a new (hard) link
link “file1”, “file2”
or warn “can’t create link: $!”;
•
symlink creates a soft link
link “file1”, “file2” or warn “ … “;
Manipulating Files and Dirs cont.
• mkdir creates directory
mkdir “mydir”, 0755
or warn “Cannot create mydir: $!”;
• rmdir removes empty directories
rmdir “dir1”, “dir2”, “dir3”;
• chmod modifies permissions on file or directory
chmod 0600, “file1”, “file2”;
if - elsif - else
• if … elsif … else …
if ( $x
print
}
elsif (
print
}
else {
print
}
> 0 ) {
“x is positive\n”;
$x < 0 ) {
“x is negative\n”;
“x is zero\n”;
unless
• Like the opposite of if
unless ($x < 0) {
print “$x is non-negative\n”;
}
unlink $file unless -A $file < 100;
while and until
while ($x < 100) {
$y += $x++;
}
• until is like the opposite of while
until ($x >= 100) {
$y += $x++;
}
for
• for (init; test; incr) { … }
# sum of squares of 1 to 5
for ($i = 1; $i <= 5; $i++) {
$sum += $i*$i;
}
next
• next skips the remaining of the current
iteration (like continue in C)
# only print non-blank lines
while (<>) {
if ( $_ eq “\n”) { next; }
else { print; }
}
last
• last exits loop immediately (like break in C)
# print up to first blank line
while (<>) {
if ( $_ eq “\n”) { last; }
else { print; }
}
Logical AND/OR
• Logical AND : &&
if (($x > 0) && ($x < 10)) { … }
• Logical OR : ||
if
($x < 0) || ($x > 0)) { … }
• Both are short-circuit — second
expression evaluated only if necessary
Ternary Operator
• Same as the ternary operator (?:) in C
• expr1 ? expr2 : expr3
• Like if-then-else: If expr1 is true, expr2 is
used; otherwise expr3 is used
$weather=($temp>50)?“warm”:“cold”;
Regular Expressions
• Use EREs (egrep style)
• Plus the following character classes
–
–
–
–
–
\w
“word” characters: [A-Za-z0-9_]
\d
digits: [0-9]
\s
whitespaces: [\f\t\n\r ]
\b
word boundaries
\W, \D, \S, \B are complements of the corresponding
classes above
• Can use \t to denote a tab
Backreferences
• Support backreferences
• Subexpressions are referred to using \1, \2,
etc. in the RE and $1, $2, etc. outside RE
if (/^this (red|blue|green) (bat|ball) is \1/)
{
($color, $object) = ($1, $2);
}
Matching
• Pattern match operator: /RE/ is shortcut of m/RE/
– Returns true if there is a match
– Match against $_
– Can also use m(RE), m<RE>, m!RE!, etc.
if (/^\/usr\/local\//) { … }
if (m%/usr/local/%) { … }
• Case-insensitive match
if (/new york/i) { … };
Matching cont.
• To match an RE against something other than
$_, use the binding operator =~
if ($s =~ /\bblah/i) {
print “Found blah!”
}
• !~ negates the match
while (<STDIN> !~ /^#/) { … }
• Variables are interpolated inside REs
if (/^$word/) { … }
\Substitutions
• Sed-like search and replace with s///
s/red/blue/;
$x =~ s/\w+$/$`/;
– m/// does not modify variable; s/// does
• Global replacement with /g
s/(.)\1/$1/g;
• Transliteration operator: tr/// or y///
tr/A-Z/a-z/;
RE Functions
• split string using RE (whitespace by default)
@fields = split /:/, “::ab:cde:f”;
# gets (“”,””,”ab”,”cde”,”f”)
• join strings into one
$str = join “-”, @fields;
# gets “--ab-cde-f”
• grep something from a list
– Similar to UNIX grep, but not limited to using RE
@selected = grep(!/^#/, @code);
@matched = grep { $_>100 && $_<150 } @nums;
– Modifying elements in returned list actually
modifies the elements in the original list
Running Another program
• Use the system function to run an external
program
• With one argument, the shell is used to run the
command
– Convenient when redirection is needed
$status = system(“cmd1 args > file”);
• To avoid the shell, pass system a list
$status = system($prog, @args);
die “$prog exited abnormally: $?” unless $status
== 0;
Capturing Output
• If output from another program needs to
be collected, use the backticks
my $files = `ls
*.c`;
• Collect all output lines into a single string
my @files = `ls *.c`;
• Each element is an output line
• The shell is invoked to run the command
Environment Variables
• Environment variables are stored in the
special hash %ENV
$ENV{’PATH’} =
“/usr/local/bin:$ENV{’PATH’}”;
Example: Word Frequency
#!/usr/bin/perl -w
# Read a list of words (one per line) and
# print the frequency of each word
use strict;
my(@words, %count, $word);
chomp(@words = <STDIN>); # read and chomp all lines
for $word (@words) {
$count{$word}++;
}
for $word (keys %count) {
print “$word was seen $count{$word} times.\n”;
}
Good Ways to Learn Perl
• a2p
– Translates an awk program to Perl
• s2p
– Translates a sed script to Perl
• perldoc
–
$
$
$
$
Online Perl documentation
perldoc perldoc  perldoc man page
perldoc perlintro  Perl introduction
perldoc -f sort  Perl sort function man page
perldoc CGI
 CGI module man page
Modules
• Perl modules are libraries of reusable
code with specific functionalities
• Standard modules are distributed with
Perl, others can be obtained from
• Include modules in your program with use,
e.g. use CGI incorporates the CGI module
• Each module has its own namespace
CGI Programming
Forms
• HTML forms are used to collect user input
• Data sent via HTTP request
• Server launches CGI script to process data
<form method=POST
action=“http://www.cs.nyu.edu/~unixtool/cgibin/search.cgi”>
Enter your query: <input type=text name=Search>
<input type=submit>
</form>
Input Types
• Text Field
<input type=text name=zipcode>
• Radio Buttons
<input type=radio name=size value=“S”> Small
<input type=radio name=size value=“M”> Medium
<input type=radio name=size value=“L”> Large
• Checkboxes
<input type=checkbox name=extras value=“lettuce”> Lettuce
<input type=checkbox name=extras value=“tomato”> Tomato
• Text Area
<textarea name=address cols=50 rows=4>
…
</textarea>
Submit Button
• Submits the form for processing by the
CGI script specified in the form tag
<input type=submit value=“Submit Order”>
HTTP Methods
• Determine how form data are sent to web
server
• Two methods:
– GET
• Form variables stored in URL
– POST
• Form variables sent as content of HTTP request
Encoding Form Values
• Browser sends form variable as name-value
pairs
– name1=value1&name2=value2&name3=value3
• Names are defined in form elements
– <input type=text name=ssn maxlength=9>
• Special characters are replaced with %## (2digit hex number), spaces replaced with +
– e.g. “11/8 Wed” is encoded as “11%2F8+Wed”
HTTP GET/POST examples
GET:
GET /cgi-bin/myscript.pl?name=Bill%20Gates&
company=Microsoft HTTP/1.1
HOST: www.cs.nyu.edu
POST:
POST /cgi-bin/myscript.pl HTTP/1.1
HOST: www.cs.nyu.edu
…other headers…
name=Bill%20Gates&company=Microsoft
GET or POST?
• GET method is useful for
– Retrieving information, e.g. from a database
– Embedding data in URL without form element
• POST method should be used for forms with
– Many fields or long fields
– Sensitive information
– Data for updating database
• GET requests may be cached by clients
browsers or proxies, but not POST requests
Parsing Form Input
• Method stored in HTTP_METHOD
• GET: Data encoded into QUERY_STRING
• POST: Data in standard input (from body
of request)
• Most scripts parse input into an
associative array
– You can parse it yourself
– Or use available libraries (better)
CGI Script: Example
Part 1: HTML Form
<html>
<center>
<H1>Anonymous Comment Submission</H1>
</center>
Please enter your comment below which will
be sent anonymously to <tt>[email protected]</tt>.
If you want to be extra cautious, access this
page through <a
href="http://www.anonymizer.com">Anonymizer</a>.
<p>
<form action=cgi-bin/comment.cgi method=post>
<textarea name=comment rows=20 cols=80>
</textarea>
<input type=submit value="Submit Comment">
</form>
</html>
Part 2: CGI Script (ksh)
#!/home/unixtool/bin/ksh
. cgi-lib.ksh
ReadParse
PrintHeader
# Read special functions to help parse
print -r -- "${Cgi.comment}" | /bin/mailx -s "COMMENT" kornj
print
print
print
print
"<H2>You submitted the comment</H2>"
"<pre>"
-r -- "${Cgi.comment}"
"</pre>"
Perl CGI Module
• Interface for parsing and interpreting query
strings passed to CGI scripts
• Methods for creating generating HTML
• Methods to handle errors in CGI scripts
• Two interfaces: procedural and OO
– Ask for the procedural interface:
use CGI qw(:standard);
A Perl CGI Script
#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
my $bday = param("birthday");
# Print headers (text/html is the default)
print header(-type => 'text/html');
# Print <html>, <head>, <title>, <body> tags etc.
print start_html(“Birthday”);
# Your HTML body
print "Your birthday is $bday.\n";
# Print </body></html>
print end_html();
Debugging Perl CGI Scripts
• Debugging CGI script is tricky - error
messages don’t always come up on your
browser
• Check if the script compiles
$ perl -wc cgiScript
• Run script with test data
$ perl -w cgiScript prod=“MacBook” price=“1800”
Content-Type: text/html
<html>
…
</html>
How to get your script run
• This can vary by web server type
http://www.cims.nyu.edu/systems/resources/webhosting/index.html
• Typically, you give your script a name that
ends with .cgi and/or put it in a special
directory (e.g. cgi-bin)
• Give the script execute permission
• Specify the location of that script in the
URL
CGI Security Risks
• Sometimes CGI scripts run as owner of the
scripts
• Never trust user input - sanity-check
everything
• If a shell command contains user input, run
without shell escapes
• Always encode sensitive information, e.g.
passwords
– Also use HTTPS
• Clean up - don’t leave sensitive data around
CGI Benefits
• Simple
• Language independent
• UNIX tools are good for this because
– Work well with text
– Integrate programs well
– Easy to prototype
– No compilation (CGI scripts)
Descargar

Unix-perl - Institut Pasteur