6. OTHER ISSUES

6.1. I have a certain problem that stumps me. Where can I get help?

Post your question on the "sed-users" mailing list (section 2.3.2), where many sed users will be able to see your question. You will have to subscribe to have posting privileges.

Your other alternative is one of these newsgroups:

  • alt.comp.editors.batch
  • comp.editors
  • comp.unix.questions
  • comp.unix.shell

6.2. How does sed compare with awk, perl, and other utilities?

Awk is a much richer language with many features of a programming language, including variable names, math functions, arrays, system calls, etc. Its command structure is similar to sed:

      address { command(s) }

which means that for each line or range of lines that matches the address, execute the command(s). In both sed and awk, an address can be a line number or a RE somewhere on the line, or both.

In program size, awk is 3-10 times larger than sed. Awk has most of the functions of sed, but not all. Notably, sed supports backreferences (\1, \2, ...) to previous expressions, and awk does not have any comparable syntax. (One exception: GNU awk v3.0 introduced gensub(), which supports backreferences only on substitutions.)

Perl is a general-purpose programming language, with many features beyond text processing and interprocess communication, taking it well past awk or other scripting languages. Perl supports every feature sed does and has its own set of extended regular expressions, which give it extensive power in pattern matching and processing. (Note: the standard perl distribution comes with 's2p', a sed-to-perl conversion script. See section 3.6 for more info.) Like sed and awk, perl scripts do not need to be compiled into binary code. Like sed, perl can also run many useful "one-liners" from the command line, though with greater flexibility; see question 4.41 ("How do I make substitutions in every file in a directory, or in a complete directory tree?").

On the other hand, the current version of perl is from 8 to 35 times larger than sed in its executables alone (perl's library modules and allied files not included!). Further, for most simple tasks such as substitution, sed executes more quickly than either perl or awk. All these utilities serve to process input text, transforming it to meet our needs . . . or our arbitrary whims.

6.3. When should I use sed?

When you need a small, fast program to modify words, lines, or blocks of lines in a textfile.

6.4. When should I NOT use sed?

You should not use sed when you have "dedicated" tools which can do the job faster or with an easier syntax. Do not use sed when you only want to:

  • print individual lines, based on patterns within the line itself. Instead, use "grep".
  • print blocks of lines, with 1 or more lines of context above or below a specific regular expression. Instead, use the GNU version of grep as follows:
            grep -A{number} -B{number} "regex"
    
  • remove individual lines, based on patterns within the line itself. Instead, use "grep -v".
  • print line numbers. Instead, use "nl" or "cat -n".
  • reformat lines or paragraphs. Instead, use "fold", "fmt" or "par".

The tr utility is also more suited than sed to some simple tasks. For example, to:

  • delete individual characters. Instead of "s/[a-d]//g", use
            tr -d "[a-d]"
    
  • squeeze sequential characters. Instead of "s/ee*/e/g", use
            tr -s "{character-set}"
    
  • change individual characters. Instead of "y/abcdef/ABCDEF/", use
            tr "[a-f]" "[A-F]"
    

Note, however, that tr does not support giving input files on the command line, so the syntax is:

     tr {options-and-patterns} < input-file

or, to process multiple files:

     cat input-file1 input-file2 | tr {options-and-patterns}

If you have multiple files, using tr instead of sed is often more of an exercise than a useful thing. Although sed can perfectly emulate certain functions of cat, grep, nl, rev, sort, tac, tail, tr, uniq, and other utilities, producing identical output, the native utilities are usually optimized to do the job more quickly than sed.

6.5. When should I ignore sed and use awk or Perl instead?

If you can write the same script in awk or Perl and do it in less time, then use Perl or awk. There's no reason to spend an hour writing and debugging a sed script if you can do it in Perl in 10 minutes (assuming that you know Perl already) and if the processing time or memory use is not a factor. Don't hunt pheasants with a .22 if you have a shotgun at your side . . . unless you simply enjoy the challenge!

Specifically, use awk or perl if you need to:

  • count fields or words on a line. (awk)
  • count lines in a block or objects in a file.
  • check lengths of strings or do math operations.
  • handle very long lines or need very large buffers. (or gsed)
  • handle binary data (control characters). (perl: binmode)
  • loop through an array or list.
  • test for file existence, filesize, or fileage.
  • treat each paragraph as a line. (well, not always)

6.6. Known limitations among sed versions

Limits on distributed versions, although source code for most versions of free sed allows for modification and recompilation. As used below, "no limit" means there is no "fixed" limit. Limits are actually determined by one's hardware, memory, operating system, and which C library is used to compile sed.

6.6.1. Maximum line length

      GNU sed:        no limit
      ssed:           no limit
      sedmod v1.0:    4096 bytes
      HHsed v1.5:     4000 bytes
      sed v1.6:       [pending]

6.6.2. Maximum size for all buffers (pattern space + hold space)

      GNU sed:        no limit
      ssed:           no limit
      sedmod v1.0:    4096 bytes
      HHsed v1.5:     4000 bytes
      sed v1.6:       [pending]

6.6.3. Maximum number of files that can be read with read command

      GNU sed v3+:    no limit
      ssed:           no limit
      GNU sed v2.05:  total no. of r and w commands may not exceed 32
      sedmod v1.0:    total no. of r and w commands may not exceed 20
      sed v1.6:       [pending]

6.6.4. Maximum number of files that can be written with 'w' command

      GNU sed v3+:    no limit (but typical Unix is 253)
      ssed:           no limit (but typical Unix is 253)
      GNU sed v2.05:  total no. of r and w commands may not exceed 32
      sedmod v1.0:    10
      HHsed v1.5:     10
      sed v1.6:       [pending]

6.6.5. Limits on length of label names

      GNU sed:        no limit
      ssed:           no limit
      HHsed v1.5:     no limit
      sed v1.6:       [pending]
      BSD sed:        8 characters

Note that GNU sed and ssed both consider a semicolon to terminate a label name.

6.6.6. Limits on length of write-file names

      GNU sed:        no limit
      ssed:           no limit
      HHsed v1.5:     no limit
      sed v1.6:       [pending]
      BSD sed:        40 characters

6.6.7. Limits on branch/jump commands

      GNU sed:        no limit
      ssed:           no limit
      HHsed v1.5:     50
      sed v1.6:       [pending]

As a practical consequence, this means that HHsed will not read more than 50 lines into the pattern space via an N command, even if the pattern space is only a few hundred bytes in size. HHsed exits with an error message, "infinite branch loop at line {nn}".

6.7. Known incompatibilities between sed versions

6.7.1. Issuing commands from the command line

Most versions of sed permit multiple commands to issued on the command line, separated by a semicolon (;). Thus,

       sed 'G;G' file

should triple-space a file. However, for non-GNU sed, some commands require separate expressions on the command line. These include:

  • all labels (':a', ':more', etc.)
  • all branching instructions ('b', 't')
  • commands to read and write files ('r' and 'w')
  • any closing brace, '}'

If these commands are used, they must be the LAST commands of an expression. Subsequent commands must use another expression (another -e switch plus arguments). E.g.,

     sed  -e :a -e 's/^.\{1,77\}$/ &/;ta' -e 's/\( *\)\1/\1/' files

GNU sed, ssed, sed15 and sed16 all permit these commands to be followed by a semicolon, so the previous script can be written:

     sed  ':a;s/^.\{1,77\}$/ &/;ta;s/\( *\)\1/\1/' files

Versions differ in implementing the 'a' (append), 'c' (change), and 'i' (insert) commands:

      sed "/foo/i New text here"              # HHsed/sedmod/gsed-30280
      gsed -e "/foo/i\\" -e "New text here"   # GNU sed
      sed1 -e "/foo/i" -e "New text here"     # one version of sed
      sed2 "/foo/i\ New text here"            # another version

6.7.2. Using comments (prefixed by the '#' sign)

Most versions of sed permit comments to appear in sed scripts only on the first line of the script. Comments on line 2 or thereafter are not recognized and will generate an error like "unrecognized command" or "command [bad-line-here] has trailing garbage".

GNU sed, HHsed, sedmod, and HP-UX sed permit comments to appear on any line of the script, except after labels and branching commands (b,t), provided that a semicolon (;) occurs after the command itself. This syntax makes sed similar to awk and perl, which use a similar commenting structure in their scripts. Thus,

      # GNU style sed script
      $!N;                        # except for last line, get next line
      s/^\([0-9]\{5\}\).*\n\1.*//;    # if first 5 digits of each line
                                      # match, delete BOTH lines.
      t skip
      P;                              # print 1st line only if no match
      :skip
      D;                    # delete 1st line of pattern space and loop
      #---end of script---

is a valid script for GNU-based versions of sed, but is unrecognized for most other versions of sed.

Finally, if the first two characters in a disk file script are "#n", the output is suppressed, exactly as if -n were entered on the command line. This is true for the following versions of sed:

  • ssed v3.57 and above
  • gsed
  • HHsed v1.5
  • sed v1.6

This syntax is not recognized by these versions of sed:

  • ssed v3.45 to v3.50 (other versions untested)
  • sedmod v1.0

6.7.3. Special syntax in REs

A. HHsed v1.5 (by Howard Helman)

The following expressions can be used for /RE/ addresses or in the LHS side of a substitution:

      +    - 1 or more occurrences of previous RE: same as \{1,\}
      \<   - boundary between nonword and word character
      \>   - boundary between word and nonword character

The following expressions can be used for /RE/ addresses or on either side of a substitution:

      \a   - bell         (ASCII 07, 0x07)
      \b   - backspace    (ASCII 08, 0x08)
      \e   - escape       (ASCII 27, 0x1B)
      \f   - formfeed     (ASCII 12, 0x0C)
      \n   - newline      (printed as 2 bytes, 0D 0A or ^M^J, in DOS)
      \r   - return       (ASCII 13, 0x0D)
      \t   - tab          (ASCII 09, 0x09)
      \v   - vertical tab (ASCII 11, 0x0B)
      \xHH - the ASCII character corresponding to 2 hex digits HH.

B. sed v1.6 (by Walter Briscoe)

sed v1.6 accepts every expression supported by sed v1.5 (above), plus the following elements, which can also used in the RHS of a substitution (in addition to those listed above):

      \\~  - insert replacement pattern defined in last s/// command
             (must be used alone in the RHS)
      \l   - change next element to lower case
      \L   - change remaining elements to lower case
      \u   - change next element to upper case
      \U   - change remaining elements to upper case
      \e   - end case conversion of next element
      \E   - end case conversion of remaining elements
      $0   - insert pattern space BEFORE the substitution
      $1-$9 - match Nth word on the pattern space

C. sedmod v1.0 (by Hern Chen)

The following expressions can be used for /RE/ addresses in the LHS of a substitution:

      +    - 1 or more occurrences of previous RE: same as \{1,\}
      \a   - any alphanumeric: same as [a-zA-Z0-9]
      \A   - 1 or more alphas: same as \a+
      \d   - any digit: same as [0-9]
      \D   - 1 or more digits: same as \d+
      \h   - any hex digit: same as [0-9a-fA-F]
      \H   - 1 or more hexdigits: same as \h+
      \l   - any letter: same as [A-Za-z]
      \L   - 1 or more letters: same as \l+
      \n   - newline      (read as 2 bytes, 0D 0A or ^M^J, in DOS)
      \s   - any whitespace character: space, tab, or vertical tab
      \S   - 1 or more whitespace chars: same as \s+
      \t   - tab          (ASCII 09, 0x09)
      \<   - boundary between nonword and word character
      \>   - boundary between word and nonword character

The following expressions can be used in the RHS of a substitution. "Elements" refer to \1 .. \9, &, $0, or $1 .. $9:

      &    - insert regexp defined on LHS
      \e   - end case conversion of next element
      \E   - end case conversion of remaining elements
      \l   - change next element to lower case
      \L   - change remaining elements to lower case
      \n   - newline      (printed as 2 bytes, 0D 0A or ^M^J, in DOS)
      \t   - tab          (ASCII 09, 0x09)
      \u   - change next element to upper case
      \U   - change remaining elements to upper case
      $0   - insert the original pattern space
      $1-$9 - match Nth word on the pattern space

D. UnixDos sed

The following expressions can be used in text, LHS, and RHS:

      \n   - newline      (printed as 2 bytes, 0D 0A or ^M^J, in DOS)

E. GNU sed v1.03 (by Frank Whaley)

When used with the -x (extended) switch on the command line, or when '#x' occurs as the first line of a script, Whaley's gsed103 supports the following expressions in both the LHS and RHS of a substitution:

      \|      matches the expression on either side
      ?       0 or 1 occurrences of previous RE: same as \{0,1\}
      +       1 or more occurrence of previous RE: same as \{1,\}
      \a      "alert" beep     (BEL, Ctrl-G, 0x07)
      \b      backspace        (BS, Ctrl-H, 0x08)
      \f      formfeed         (FF, Ctrl-L, 0x0C)
      \n      newline          (LF, Ctrl-J, 0x0A)
      \r      carriage-return  (CR, Ctrl-M, 0x0D)
      \t      horizontal tab   (HT, Ctrl-I, 0x09)
      \v      vertical tab     (VT, Ctrl-K, 0x0B)
      \bBBB   binary char, where BBB are 1-8 binary digits, [0-1]
      \dDDD   decimal char, where DDD are 1-3 decimal digits, [0-9]
      \oOOO   octal char, where OOO are 1-3 octal digits, [0-7]
      \xHH    hex char, where HH are 1-2 hex digits, [0-9A-F]

In normal mode, with or without the -x switch, the following escape sequences are also supported in regex addressing or in the LHS of a substitution:

      \`      matches beginning of pattern space: same as /^/
      \'      matches end of pattern space: same as /$/
      \B      boundary between 2 word or 2 nonword characters
      \w      any nonword character [*BUG!* should be a word char]
      \W      any nonword character: same as /[^A-Za-z0-9]/
      \<      boundary between nonword and word char
      \>      boundary between word and nonword char

F. GNU sed v2.05 and higher versions

The following expressions can be used for /RE/ addresses or in the LHS side of a substitution:

      \`  - matches the beginning of the pattern space (same as "^")
      \'  - matches the end of the pattern space (same as "$")
      \?  - 0 or 1 occurrence of previous character: same as \{0,1\}
      \+  - 1 or more occurrences of previous character: same as \{1,\}
      \|  - matches the string on either side, e.g., foo\|bar
      \b  - boundary between word and nonword chars (reversible)
      \B  - boundary between 2 word or between 2 nonword chars
      \n  - embedded newline (usable after N, G, or similar commands)
      \w  - any word character: [A-Za-z0-9_]
      \W  - any nonword char: [^A-Za-z0-9_]
      \<  - boundary between nonword and word character
      \>  - boundary between word and nonword character

On \b, \B, \<, and \>, see section 6.7.4 ("Word boundaries"), below.

Undocumented -r switch:

Beginning with version 3.02, GNU sed has an undocumented -r switch (undocumented till version 4.0), activating Extended Regular Expressions in the following manner:

       ?      -  0 or 1 occurrence of previous character
       +      -  1 or more occurrences of previous character
       |      -  matches the string on either side, e.g., foo|bar
       (...)  -  enable grouping without backslash
       {...}  -  enable interval expression without backslash

When the -r switch (mnemonic: "regular expression") is used, prefix these symbols with a backslash to disable the special meaning.

Escape sequences:

Beginning with version 3.02.80, the following escape sequences can now be used on both sides of a "s///" substitution:

      \a      "alert" beep     (BEL, Ctrl-G, 0x07)
      \f      formfeed         (FF, Ctrl-L, 0x0C)
      \n      newline          (LF, Ctrl-J, 0x0A)
      \r      carriage-return  (CR, Ctrl-M, 0x0D)
      \t      horizontal tab   (HT, Ctrl-I, 0x09)
      \v      vertical tab     (VT, Ctrl-K, 0x0B)
      \oNNN   a character with the octal value NNN
      \dNNN   a character with the decimal value NNN
      \xHH    a character with the hexadecimal value HH

Note that GNU sed also supports "character classes", a POSIX extension to regexes, described in section 3.7, above.

G. sed 4.0 and higher versions

The following expressions can be used in the RHS of a substitution.

      \e   - end case conversion
      \l   - change next character to lower case
      \L   - change remaining text to lower case
      \n   - newline      (printed as 2 bytes, 0D 0A or ^M^J, in DOS)
      \t   - tab          (ASCII 09, 0x09)
      \u   - change next character to upper case
      \U   - change remaining text to upper case

In addition, GNU sed 4.0 can modify the way ^ and $ are interpreted, so that ^ can also match an empty string after a newline character, and $ can also match an empty string before a newline character (to do this, add an "M" after the regular expression terminator, like /^>/M -- see section 3.1.1). Even if you use this feature, \` and \' still match the beginning and the end of the pattern space, respectively.

H. ssed

Everything that was said for GNU sed applies to ssed as well. In addition, in Perl-mode (-R switch), these become active or inactive:

      .     - no longer matches new-line characters
      \A    - matches beginning of pattern space
      \Z    - matches end of pattern space or last newline in the PS
      \z    - matches end of pattern space
      \d    - matches any digit: same as [0-9]
      \D    - matches any non-digit: same as [^0-9]
      \`    - no longer matches beginning of pattern space
      \'    - no longer matches end of pattern space
      \<    - no longer matches boundary between nonword & word char
      \>    - no longer matches boundary between word & nonword char
      \oNNN - no longer matches char with octal value NNN
      \dNNN - no longer matches char with decimal value NNN
      \NNN  - matches char with octal value NNN

Perl mode supports lookahead (?=match) and lookbehind (?<=match) pattern matching. The matched text is NOT captured in "&" for s/// replacements!

      foo(?=bar)   - match "foo" only if "bar" follows it
      foo(?!bar)   - match "foo" only if "bar" does NOT follow it
      (?<=foo)bar  - match "bar" only if "foo" precedes it
      (?<!foo)bar  - match "bar" only if "foo" does NOT precede it

      (?<!in|on|at)foo
                  - match "foo" only if NOT preceded by "in", "on" or "at"
      (?<=\d{3})(?<!999)foo
                  - match "foo" only if preceded by 3 digits other than "999"

In Perl mode, there are two new switches in /addressing/ or s/// commands. Switches may be lowercase in s/// commands, but must be uppercase in /addressing/:

       /S  - lets "." match a newline also
       /X  - extra whitespace is ignored. See below, for sample usage.

Here are some examples of Perl-style regular expressions. Use the -R switch.

     (?i)abc    - case-insensitive match of abc, ABC, aBc, ABc, etc.
     ab(?i)c    - same as above; the (?i) applies throughout the pattern
     (ab(?i)c)  - matches abc or abC; the outer parens make the difference!
     (?m)       - multi-line pattern space: same as "s/FIND/REPL/M"
     (?s)       - set "." to match newline also: same as "s/FIND/REPL/S"
     (?x)       - ignore whitespace and #comments; see section (9) below.

     (?:abc)foo    - match "abcfoo", but do not capture 'abc' in \1
     (?:ab|cd)ef   - match "abef" or "cdef"; only 'cd' is captured in \1
     (?#remark)xy  - match "xy"; remarks after "#" are ignored.

And here are some sample uses of /X switch to add comments to complex expressions. To embed literal spaces, precede with \ or put inside [brackets].

     # ssed script to change "(123) 456-7890" into "[ac123] 456-7890"
     #
     s/ # BACKSLASH IS NEEDED AT END OF EACH LINE!   \
     \(                   # literal left paren, (    \
     (\d{3})              # 3 digits                 \
     \)                   # literal right paren, )   \
     [ \t]*               # zero or more spaces or tabs  \
     (\d{3}-\d{4})        # 3 digits, hyphen, 4 digits   \
     /[ac\1] \2/gx;       # replace g(lobally), with e(x)tended spacing

6.7.4. Word boundaries

GNU sed, ssed, sed16, sed15 and sedmod use certain symbols to define the boundary between a "word character" and a nonword character. A word character fits the regex "[A-Za-z0-9_]". Note: a word character includes the underscore "_" but not the hyphen, probably because the underscore is permissible as a label in sed and in other scripting languages. (In gsed103, a word character did NOT include the underscore; it included alphanumerics only.)

These symbols include '\<' and '\>' (gsed, ssed, sed15, sed16, sedmod) and '\b' and '\B' (gsed only). Note that the boundary symbols do not represent a character, but a position on the line. Word boundaries are used with literal characters or character sets to let you match (and delete or alter) whole words without affecting the spaces or punctuation marks outside of those words. They can only be used in a "/pattern/" address or in the LHS of a 's/LHS/RHS/' command. The following table shows how these symbols may be used in HHsed and GNU sed. Sedmod matches the syntax of HHsed.

      Match position      Possible word boundaries   HHsed   GNU sed
      ---------------------------------------------------------------
      start of word    [nonword char]^[word char]      \<    \< or \b
      end of word         [word char]^[nonword char]   \>    \> or \b
      middle of word      [word char]^[word char]     none      \B
      outside of word  [nonword char]^[nonword char]  none      \B
      ---------------------------------------------------------------

In ssed, the symbols '\<' and '\>' lose their special meaning when the -R switch is used to invoke Perl-style expressions. However, the identical meaning of '\<' and '\>' can be obtained through these nonmatching, zero-width assertions:

       (?<!\w)(?=\w)  and   (?<=\w)(?!\w)

6.7.5. Commands which operate differently

A. GNU sed version 3.02 and 3.02.80

The N command no longer discards the contents of the pattern space upon reaching the end of file. This is not a bug, it's a feature. However, it breaks certain scripts which relied on the older behavior of N.

'N' adds the Next line to the pattern space, enabling multiple lines to be stored and acted upon. Upon reaching the last line of the file, if the N command was issued again, the contents of the pattern space would be silently deleted and the script would abort (this has been the traditional behavior). For this reason, sed users generally wrote:

       $!N;   # to add the Next line to every line but the last one.

However, certain sed scripts relied on this behavior, such as the script to delete trailing blank lines at the end of a file (see script #12 in section 3.2, "Common one-line sed scripts", above). Also, classic textbooks such as Dale Dougherty and Arnold Robbins' sed & awk documented the older behavior.

The GNU sed maintainer felt that despite the portability problems this would cause, changing the N command to print (rather than delete) the pattern space was more consistent with one's intuitions about how a command to "append the Next line" ought to behave. Another fact favoring the change was that "{N;command;}" will delete the last line if the file has an odd number of lines, but print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting the pattern space upon reaching the EOF) to scripts compatible with all versions of sed, change a lone "N;" to "$d;N;".


grok2.gif (391 bytes) The latest version of this site has moved to it's own domain at http://www.grok2.com/. Please update your bookmarks and links. Click here to go to the new site.

 

-0-

The latest version of this site has moved to it's own domain at http://www.grok2.com/. Please update your bookmarks and links. Click here to go to the new site.

Other Stuff On This Site

Emacs for Vi Programmers
Installing Linux on Sony's Vaio PCG-Z505LE Laptop
Studying in New Zealand (July 2001 info)
Utilities for Programmers
Why is programming fun?
Mirror of Eric Pement's Sed FAQ
C Programming Language Links (Work in Progress)
Source Code Comprehension Tools
Structure packing with the GNU C compiler
Walt Disney World Photographs
PPP RFCs
FSM/HSM
Tcl/Tk GUI Application Rules

~0~

Tell me what you think!