Match any single character from among characters listed between brackets.
108 |
109 |
110 |
[!characters]
111 |
Match any single character other than characters listed between brackets.
112 |
113 |
114 |
[a-z]
115 |
Match any single character from among the range of characters listed between brackets.
116 |
117 |
118 |
[!a-z]
119 |
Match any single character from among the characters not in the range listed between brackets
120 |
121 |
122 |
{frag1,frag2,...}
123 |
Brace expansion: create strings frag1, frag2, etc.
124 |
125 |
126 |
127 |
128 | List all files ending with a digit:
129 |
130 | ```bash
131 | $ ls *[0-9]
132 | ```
133 |
134 | Make a copy of `filename` as `filename.old`:
135 |
136 | ```bash
137 | $ cp filename{,.old}
138 | ```
139 |
140 | Remove all files beginning with *a* or *z*:
141 |
142 | ```bash
143 | $ rm [az]*
144 | ```
145 |
146 | List all the R code files with a variety of suffixes:
147 |
148 | ```bash
149 | $ ls *.{r,R}
150 | ```
151 |
152 | The `echo` command can be used to verify that a wildcard expansion will
153 | do what you think it will:
154 |
155 | ```bash
156 | $ echo cp filename{,.old}
157 | cp filename filename.old
158 | ```
159 |
160 | If you want to suppress the special meaning of a wildcard in a shell
161 | command, precede it with a backslash (`\`). (Note that this is a general
162 | rule of thumb in many similar situations when a character has a special
163 | meaning but you just want to treat it as a character.) For example to
164 | list files whose name starts with the `*` character:
165 |
166 | ```bash
167 | $ touch \*test # create a file called *test
168 | $ ls \**
169 | *test
170 | ```
171 |
172 | To read more about standard globbing patterns, see the man page:
173 |
174 | ```
175 | $ man 7 glob
176 | ```
177 |
178 | ## 4 File permissions
179 |
180 | UNIX allows you to control who has access to a given file (or directory) and how the user can interact with the file (or directory). We can see what permissions are set using the `-l` flag to `ls`.
181 |
182 | ```bash
183 | $ cd ~/stat243-fall-2020
184 | $ ls -l
185 | ```
186 |
187 | ```
188 | total 152
189 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 data
190 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 howtos
191 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 project
192 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 ps
193 | -rw-rw-r-- 1 scflocal scflocal 11825 Dec 28 13:15 README.md
194 | drwxrwxr-x 13 scflocal scflocal 4096 Dec 28 13:15 sections
195 | -rw-rw-r-- 1 scflocal scflocal 37923 Dec 28 13:15 syllabus.lyx
196 | -rw-rw-r-- 1 scflocal scflocal 77105 Dec 28 13:15 syllabus.pdf
197 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:37 units
198 | ```
199 |
200 | When using the `-l` flag to `ls`, you'll see extensive information about each file (or directory), of which the most important are:
201 |
202 | - (column 1) file permissions (more later)
203 | - (column 3) the owner of the file ('scflocal' here)
204 | - (column 4) the group of users that the file belongs too (also 'scflocal' here)
205 | - (column 5) the size of the file in bytes
206 | - (column 6-8) the last time the file was modified
207 | - (column 9) name of the file
208 |
209 | Here's a graphical summary of the information for a file named
210 | "file.txt", whose owner is "root" and group is "users". (The graphic also indicates that the commands `chmod`, `chown`, and `chgrp` can be used to change aspects of the file permissions and ownership.)
211 |
212 | 
213 |
214 |
215 | Let's look in detail at the information in the first column returned by `ls -l`.
216 |
217 | ```bash
218 | $ ls -l
219 | total 156
220 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 data
221 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 howtos
222 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 project
223 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:15 ps
224 | -rw-rw-r-- 1 scflocal scflocal 11825 Dec 28 13:15 README.md
225 | drwxrwxr-x 13 scflocal scflocal 4096 Dec 28 13:15 sections
226 | -rw-rw-r-- 1 scflocal scflocal 37923 Dec 28 13:15 syllabus.lyx
227 | -rw-rw-r-- 1 scflocal scflocal 77105 Dec 28 13:15 syllabus.pdf
228 | drwxrwxr-x 2 scflocal scflocal 4096 Dec 28 13:37 units
229 | ```
230 |
231 | The first column actually contains 10 individual single-character columns. Items marked with a `d` as the first character are directories. Here `data` is a directory while `syllabus.pdf` is not.
232 |
233 | Following that first character are three triplets of file permission information. Each triplet contains read ('r'), write ('w') and execute ('x') information. The first `rwx` triplet (the second through fourth characters) indicates if the owner of the file can read, write, and execute a file (or directory). The second `rwx` triplet (the fifth through seventh characters) indicates if anyone in the group that the file belongs to can read, write and execute a file (or directory). The third triplet (the eighth through tenth characters) pertains to any other user. Dashes mean that a given user does not have that kind of access to the given file.
234 |
235 | For example, for the *syllabus.pdf* file, the owner of the file can read it and can modify the file by writing to it (the first triplet is `'rw-'`), as can users in the group the file belongs to. But for other users, they can only read it (the third triplet is `'r--'`).
236 |
237 | We can change the permissions by indicating the type of user and the kind of access we want to add or remove. The type of user is one of:
238 |
239 | - 'u' for the user who owns the file,
240 | - 'g' for users in the group that the file belongs to, and
241 | - 'o' for any other users.
242 |
243 | Thus we specify one of 'u', 'g', or 'o', followed by a '+' to add permission or a '-' to remove permission and finally by the kind of permission: 'r' for read access, 'w' for write access, and 'x' for execution access.
244 |
245 | As a simple example, let's prevent anyone from reading the `tmp.txt`
246 | file (which we'll create first). We then try to print the contents of the file to the screen with the command `cat`, but we are denied.
247 |
248 | First recall the current permissions:
249 |
250 | ```bash
251 | $ echo "first line" > tmp.txt # create a test text file that contains "first line"
252 | $ ls -l tmp.txt
253 | -rw-rw-r-- 1 scflocal scflocal 11 Dec 28 13:39 tmp.txt
254 | ```
255 | Now we remove the read permissions:
256 |
257 | ```bash
258 | $ chmod u-r tmp.txt # prevent owner from reading
259 | $ chmod g-r tmp.txt # prevent users in the file's group from reading
260 | $ chmod o-r tmp.txt # prevent others from reading
261 | $ ls -l tmp.txt
262 | --w--w---- 1 scflocal scflocal 11 Dec 28 13:39 tmp.txt
263 | ```
264 | ```bash
265 | $ cat tmp.txt
266 | cat: tmp.txt: Permission denied
267 | ```
268 |
269 | That can actually be accomplished all at once, like this:
270 |
271 | ```bash
272 | $ chmod ugo-r tmp.txt # prevent all three
273 | $ ls -l tmp.txt
274 | --w--w---- 1 scflocal scflocal 11 Dec 28 13:39 tmp.txt
275 | ```
276 |
277 |
278 | Or if we wanted to remove read and write permission, we can do this:
279 |
280 | ```bash
281 | $ chmod ugo-rw tmp.txt # prevent all three
282 | ```
283 |
284 | Now if we try to add a line to the file, using the `>>`
285 | [redirection operator](using-commands#32-overview-of-redirection), we are denied:
286 |
287 | ```bash
288 | $ echo "added line" >> tmp.txt
289 | -bash: tmp.txt: Permission denied
290 | ```
291 |
292 | Now let's restore read and write permission to the owner:
293 |
294 | ```bash
295 | $ chmod u+rw tmp.txt
296 | $ echo "added line" >> tmp.txt
297 | $ cat tmp.txt
298 | first line
299 | added line
300 | ```
301 |
302 | There's lots more details that are important when making files accessible to other users, including:
303 |
304 | - [how to make files in a particular directory available to other users on the system](https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/data/transferring-data/making-files-accessible/#making-files-accessible-to-all-other-savio-users),
305 | - [how to set up a directory for use by a UNIX group](https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/data/transferring-data/making-files-accessible/#making-files-accessible-to-your-group-members), using the so-called "sticky bit" so that files created in the directory in the future belong to the group so that group members will readily have access to them by default, and
306 | - [how to use *access control lists*](https://www.redhat.com/en/blog/linux-access-control-lists) to have more control over access.
307 |
308 | ## 5 Use simple text files when possible
309 |
310 | UNIX commands are designed as powerful tools to manipulate text files. This means that it's helpful to store information in information in text files when possible (of course there are very good reasons to store large datasets in binary files as well, in particular speed of access to portions of the data and efficient storage formats).
311 |
312 | Furthermore, the basic UNIX commands that operate on files operate on a line by line basis (e.g., `grep`, `sed`, `cut`, etc.). So using formats where each line contains a distinct set of information (such as CSVs) is advantageous even compared to other text formats where related information is stored on multiple lines (such as XML and JSON).
313 |
314 | ## 6 Document formats and conversion
315 |
316 | There are many plain text file formats (e.g., Markdown,
317 | reStructuredText, LaTeX). Pandoc is a widely used document converter. To
318 | convert a file written in markdown (`report.md`) to a PDF
319 | (`report.pdf`), you would do something like:
320 |
321 | $ pandoc -o report.pdf report.md
322 |
323 | For a quick introduction to LaTeX, please see our [Introduction to
324 | LaTeX tutorial and screencast](https://github.com/berkeley-scf/tutorial-latex-intro).
325 |
--------------------------------------------------------------------------------
/file.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berkeley-scf/tutorial-using-bash/b818e852ced6e0c5b620abef4352c56103c0a859/file.png
--------------------------------------------------------------------------------
/index.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Using the bash (and zsh) shell
3 | date: 2025-04-09
4 | format:
5 | html:
6 | theme: cosmo
7 | css: assets/styles.css
8 | toc: true
9 | code-copy: true
10 | code-block-bg: true
11 | code-block-border-left: "#31BAE9"
12 | engine: knitr
13 | ipynb-shell-interactivity: all
14 | code-overflow: wrap
15 | ---
16 |
17 | ## 1 This tutorial
18 |
19 | ::: {.callout-warning title="Prerequisite"}
20 | Before reading this, if you're not already comfortable with basic commands for working with files (e.g. `cd`, `ls`, `cp` and the structure of the filesystem on a UNIX-like machine), you will want to be familiar with the introductory material in our [Basics of UNIX tutorial](https://computing.stat.berkeley.edu/tutorial-unix-basics).
21 | :::
22 |
23 |
24 | ### 1.1 Other resources
25 |
26 | Software Carpentry has a very nice introductory lesson on the [basics of the shell](https://swcarpentry.github.io/shell-novice/). It also has an accompanying [YouTube video](https://www.youtube.com/watch?v=8c1BL5b47kg) that covers some, but not all, of the topics of this tutorial.
27 |
28 | The book Research Software Engineering with Python has [several good in-depth chapters on the shell](https://third-bit.com/py-rse/bash-basics.html).
29 |
30 | ## 2 The interactive shell
31 |
32 | The shell is the UNIX program that provides an interactive computer programming environment. You use the shell when in a terminal window to interact with a UNIX-style operating system (e.g., Linux or MacOS). The shell sits between you and the operating system and provides useful commands and functionality. Basically, the shell is a program that serves to run other commands for you and show you the results.
33 |
34 | The shell is a read-evaluate-print loop (REPL) environment. R and
35 | Python also provide REPL environments. A REPL reads a single
36 | *expression* or input, parses and *evaluates* it, *prints* the results,
37 | and then *loops* (i.e., returns control to you to continue your work).
38 |
39 | ::: {.callout-note title="The shell prompt"}
40 |
41 | I will use a `$` prompt for bash. By convention, a regular
42 | user's prompt in bash is `$`, while the root (or administrative)
43 | user's prompt is `#`. However, it is common practice to never log on
44 | as the root user, even if you have root access. If you need to run a command with root privileges,
45 | you should use the `sudo` command.
46 |
47 | `$ echo "The current user is $USER."`
48 |
49 | `The current user is paciorek.`
50 | :::
51 |
52 | When you are working in a terminal window (i.e., a window with the
53 | command line interface), you're interacting with a shell.
54 | There are actually different shells that you can use, of which `bash` is very common and is the default on many systems. In recent versions of MacOS, `zsh` is the default shell. There are others as well (e.g., *sh*, *csh*, *tcsh*, *fish*). I've generated this document based on using the bash shell on a computer running the Ubuntu Linux version 20.04 operating system, and this tutorial assumes you are using *bash* or *zsh*. That said, the basic ideas and the use of various commands are applicable to any UNIX shell, and you should be able to replicate most of the steps in this tutorial in other UNIX command line environments, with various substitutions of shell syntax specific to the shell you are using,
55 |
56 | The shell is an amazingly powerful programming environment. From it you
57 | can interactively monitor and control almost any aspect of the OS and
58 | more importantly you can automate it. As you will see, *bash* has a
59 | very extensive set of capabilities intended to make both interactive as
60 | well as automated control simple, effective, and customizable.
61 |
62 | ::: {.callout-warning}
63 |
64 | It can be difficult to distinguish what is shell-specific and what is
65 | just part of UNIX. Some of the material in this tutorial is not bash-specific but is
66 | general to UNIX.
67 |
68 | Reference: Newham and Rosenblatt, Learning the bash Shell, 2nd ed.
69 | :::
70 |
71 | ::: {.callout-warning title="The shell on a Mac"}
72 |
73 | Unfortunately, the behavior of shell commands on a Mac can be
74 | [somewhat different](https://ponderthebits.com/2017/01/know-your-tools-linux-gnu-vs-mac-bsd-command-line-utilities-grep-strings-sed-and-find)
75 | than on Linux (e.g., on a Mac, one can't do `tail -n +5`) because MacOS is based on
76 | BSD, which is not a Linux distribution. The behavior of the commands is distinct from the shell you are using.
77 | :::
78 |
79 |
80 | ## 3 Accessing the shell
81 |
82 | This tutorial assumes you already have access to a basic bash shell on a computer
83 | with network access (e.g., the Terminal on a Mac, the Ubuntu subsystem on Windows, or a terminal window on a Linux machine), as discussed in our [Basics of UNIX tutorial](https://berkeley-scf.github.io/tutorial-unix-basics#1.3-accessing-a-unix-command-line-interface).
84 |
85 | Here's how you can see your default shell and change it if you like.
86 |
87 | 1. What is my default shell?
88 |
89 | ```bash
90 | $ echo $SHELL
91 | /bin/bash
92 | ```
93 |
94 | 2. To change to bash on a one-time basis:
95 |
96 | ```bash
97 | $ bash
98 | ```
99 |
100 | 3. To make it your default:
101 |
102 | ```bash
103 | $ chsh /bin/bash
104 | ```
105 |
106 | In the last example, `/bin/bash` should be whatever the path to the bash
107 | shell is, which you can figure out using:
108 |
109 | ```bash
110 | $ type bash
111 | bash is /usr/bin/bash
112 | ```
113 |
114 |
115 |
116 | ## 4 Variables
117 |
118 | ### 4.1 Using variables
119 |
120 | Just like programming languages, you can use variables in the shell.
121 | Variables are names that have values assigned to them.
122 |
123 | To access the value currently assigned to a variable, you can prepend the name with
124 | the dollar sign (\$). To print the value you can use the `echo` command.
125 |
126 | For example, I can find the username of the current user in the `USER` variable:
127 |
128 | ```bash
129 | $ echo $USER
130 | paciorek
131 | ```
132 |
133 | To declare a variable, just assign a value to the name, without using `$`. For
134 | example, if you want to make a new variable with the name `counter` with
135 | the value `1`:
136 |
137 | ```bash
138 | $ counter=1
139 | ```
140 |
141 | Since bash uses spaces to parse the expression you give it as input, it
142 | is important to note the lack of spaces around the equal sign. Try
143 | typing the command with and without spaces and note what happens.
144 |
145 | You can also enclose the variable name in curly brackets, which comes in
146 | handy when you're embedding a variable within a line of code, to make
147 | sure the shell knows where the variable name ends:
148 |
149 | ```bash
150 | $ base=/home/jarrod/
151 | $ echo ${base}src
152 | $ echo $basesrc
153 | ```
154 |
155 | Make sure you understand the difference in behavior in the last two
156 | lines.
157 |
158 | ### 4.2 Environment variables
159 |
160 | There are also special shell variables called environment variables that
161 | help to control the shell's behavior. These are generally named in all
162 | caps. Type `printenv` to see them. You can create your own environment
163 | variable as follows:
164 |
165 | ```bash
166 | $ export base=/home/jarrod/
167 | ```
168 |
169 | The `export` command ensures that other shells created by the current
170 | shell (for example, to run a program) will inherit the variable. Without
171 | the export command, any shell variables that are set will only be
172 | modified within the current shell. More generally, if you want a
173 | variable to always be accessible, you should include the definition of
174 | the variable with an `export` command in your `.bashrc` file.
175 |
176 | You can control the appearance of the bash prompt using the `PS1`
177 | variable:
178 |
179 | ```bash
180 | $ echo $PS1
181 | ```
182 |
183 | To modify it so that it puts the username, hostname, and current working
184 | directory in the prompt:
185 |
186 | ```bash
187 | $ export PS1='[\u@\h \W]\$ '
188 | [user1@local1 ~]$
189 | ```
190 |
191 | ## 5 Introduction to commands
192 |
193 | ### 5.1 Elements of a command
194 |
195 | While each command has its own syntax, there are some rules usually
196 | followed. Generally, a command line consists of 4 things: a command,
197 | command options, arguments, and line acceptance. Consider the following
198 | example:
199 |
200 | ```bash
201 | $ ls -l file.txt
202 | ```
203 | In the above example, `ls` is the command, `-l` is a command option
204 | specifying to use the long format, `file.txt` is the argument, and the
205 | line acceptance is indicated by hitting the `Enter` key at the end of
206 | the line.
207 |
208 | After you type a command at the bash prompt and indicate line acceptance
209 | with the `Enter` key, bash parses the command and then attempts to
210 | execute the command. To determine what to do, bash first checks whether
211 | the command is a shell function (we will discuss functions below). If
212 | not, it checks to see whether it is a builtin. Finally, if the command
213 | is not a shell function nor a builtin, bash uses the `PATH` variable.
214 | The `PATH` variable is a list of directories:
215 |
216 | ```bash
217 | $ echo $PATH
218 | /home/jarrod/usr/bin:/usr/local/bin:/bin:/usr/bin:
219 | ```
220 |
221 | For example, consider the following command:
222 |
223 | ```bash
224 | $ grep pdf file.txt
225 | ```
226 |
227 | We will discuss `grep` later. For now, let's ignore what `grep` actually
228 | does and focus on what bash would do when you press enter after typing
229 | the above command. First bash checks whether `grep` a shell function or
230 | a builtin. Once it determines that `grep` is neither a shell function
231 | nor a builtin, it will look for an executable file named `grep` first in
232 | `/home/jarrod/usr/bin`, then in `/usr/local/bin`, and so on until it
233 | finds a match or runs out of places to look. You can use `type` to find
234 | out where bash would find it:
235 |
236 | ```bash
237 | $ type grep
238 | grep is hashed (/usr/bin/grep)
239 | ```
240 |
241 | Also note that the shell substitutes in the values of variables and
242 | does other manipulations before calling the command. For example in the following
243 | example,
244 |
245 | ```bash
246 | $ myfile=file.txt
247 | $ grep pdf $myfile
248 | ```
249 |
250 | the value of `$myfile` is substituted in before `grep` is called, so the command
251 | that is executed is `grep pdf myfile.txt`.
252 |
253 | ### 5.2 Getting help with commands
254 |
255 | Most bash commands have electronic manual pages, which are accessible
256 | directly from the commandline. You will be more efficient and
257 | effective if you become accustomed to using these `man` pages. To view
258 | the `man` page for the command `sudo`, for instance, you would type:
259 |
260 | ```bash
261 | $ man ls
262 | ```
263 |
264 | Alternatively, for many commands you can use the `--help` flag:
265 |
266 | ```bash
267 | $ ls --help
268 | ```
269 |
270 | ::: {.callout-tip title="Exercise"}
271 |
272 | Consider the following examples using the `ls` command:
273 |
274 | ```bash
275 | $ ls --all -l
276 | $ ls -a -l
277 | $ ls -al
278 | ```
279 |
280 | Use `man ls` to see what the command options do. Is there any difference
281 | in what the three versions of the command invocation above return as the
282 | result? What happens if you add a filename to the end of the command?
283 | :::
284 |
285 | ## 6 Operating efficiently at the command line
286 |
287 | ### 6.1 Tab completion
288 |
289 | When working in the shell, it is often unnecessary to type out an entire
290 | command or file name, because of a feature known as tab completion. When
291 | you are entering a command or filename in the shell, you can, at any
292 | time, hit the tab key, and the shell will try to figure out how to
293 | complete the name of the command or filename you are typing. If there is
294 | only one such command found in the search path and you're using tab completion with
295 | the first token of a line, then the shell will display its value and the
296 | cursor will be one space past the completed name. If there are multiple
297 | commands that match the partial name, the shell will display as much as
298 | it can. In this case, hitting tab twice will display a list of choices,
299 | and redisplay the partial command line for further editing. Similar
300 | behavior with regard to filenames occurs when tab completion is used on
301 | anything other than the first token of a command.
302 |
303 | ::: {.callout-tip title="Tab completion in Python and R"}
304 |
305 | R and Python also provide tab completions for objects (including functions) and
306 | (in some cases) filenames.
307 | :::
308 |
309 | ### 6.2 Keyboard shortcuts
310 |
311 |
312 | Note that you can use emacs-like control sequences (`Ctrl-a`, `Ctrl-e`,
313 | `Ctrl-k`) to navigate and delete characters.
314 |
315 | **Table. Keyboard Shortcuts**
316 |
317 |
318 |
319 |
320 |
Key Strokes
321 |
Descriptions
322 |
323 |
324 |
325 |
326 |
Ctrl-a
327 |
Beginning of line
328 |
329 |
330 |
Ctrl-e
331 |
End of line
332 |
333 |
334 |
Ctrl-k
335 |
Delete line from cursor forward
336 |
337 |
338 |
Ctrl-w
339 |
Delete word before cursor
340 |
341 |
342 |
Ctrl-y
343 |
pastes in whatever was deleted previously with Ctrl-k or Ctrl-w
375 |
376 | ### 6.3 Command History and Editing
377 |
378 | By using the up and down arrows, you can scroll through commands that
379 | you have entered previously. So if you want to rerun the same command,
380 | or fix a typo in a command you entered, just scroll up to it and hit
381 | enter to run it or edit the line and then hit enter.
382 |
383 | To list the history of the commands you entered, use the `history`
384 | command:
385 |
386 | ```bash
387 | $ history
388 | ```
389 | ```
390 | 1 echo $PS1
391 | 2 PS1=$
392 | 3 bash
393 | 4 export PS1=$
394 | 5 bash
395 | 6 echo $PATH
396 | 7 which echo
397 | 8 ls --all -l
398 | 9 ls -a -l
399 | 10 ls -al
400 | 11 ls -al manual.xml
401 | ```
402 |
403 | The behavior of the `history` command is controlled by a shell
404 | variables:
405 |
406 | ```bash
407 | $ echo $HISTFILE
408 | $ echo $HISTSIZE
409 | ```
410 | You can also rerun previous commands as follows:
411 |
412 | ```bash
413 | $ !-n
414 | $ !gi
415 | ```
416 |
417 | The first example runs the nth previous command and the second one runs
418 | the last command that started with 'gi'.
419 |
420 | **Table. Command History Expansion**
421 |
422 |
423 |
424 |
425 |
Designator
426 |
Description
427 |
428 |
429 |
430 |
431 |
!!
432 |
Last command
433 |
434 |
435 |
!n
436 |
Command numbered n in the history
437 |
438 |
439 |
!-n
440 |
Command n previous
441 |
442 |
443 |
!string
444 |
Last command starting with string
445 |
446 |
447 |
!?string
448 |
Last command containing string
449 |
450 |
451 |
^string1^string2
452 |
Execute the previous command with string2 substituted for string1
453 |
454 |
455 |
456 |
457 | If you're not sure what command you're going to recall, you can append
458 | `:p` at the end of the text you type to do the recall, and the result
459 | will be printed, but not executed. For example:
460 |
461 | ```bash
462 | $ !gi:p
463 | ```
464 |
465 | You can then use the up arrow key to bring back that statement for
466 | editing or execution.
467 |
468 | You can also search for commands by doing `Ctrl-r` and typing a string
469 | of characters to search for in the search history. You can hit return to
470 | submit, `Ctrl-c` to get out, or `ESC` to put the result on the regular
471 | command line for editing.
472 |
473 | ## 7 Accessing remote machines
474 |
475 | You likely already have `ssh` installed. SSH provides an
476 | encrypted mechanism to connect to a remote Unix-based (i.e., Linux or Mac) terminal. You can [learn more
477 | about using ssh on various operating systems](https://statistics.berkeley.edu/computing/ssh).
478 |
479 | To ssh to another machine, you need to know its (host)name. For example,
480 | to ssh to `arwen.berkeley.edu`, one of the SCF machines, you would:
481 |
482 | ```bash
483 | $ ssh arwen.berkeley.edu
484 | Password:
485 | ```
486 |
487 | At this point you have to type your password. Alternatively, you can [set
488 | up ssh so that you can use it without typing your password](https://statistics.berkeley.edu/computing/ssh-keys).
489 |
490 | If you have a different username on the remote machine than on the machine you are on, you will need to
491 | specify it as well. For example, to specify the username `jarrod`, you
492 | would:
493 |
494 | ```bash
495 | $ ssh jarrod@arwen.berkeley.edu
496 | ```
497 |
498 | If you want to view graphical applications on your local computer that
499 | are running on the remote computer you need to use the `-X` option:
500 |
501 | ```bash
502 | $ ssh -X jarrod@arwen.berkeley.edu
503 | ```
504 |
505 | Alternatively, if you want to copy a file (`file1.txt`) from your local
506 | computer to `arwen.berkeley.edu`, you can use the `scp` command,
507 | which securely copies files between machines:
508 |
509 | ```bash
510 | $ scp file1.txt jarrod@arwen.berkeley.edu:.
511 | ```
512 | The above command will copy `file1.txt` from my current working
513 | directory on my local machine to `jarrod`'s home directory on
514 | `arwen.berkeley.edu`. The `.` following the `:` indicates that I want
515 | to copy the file to jarrod's home directory on the remote machine,
516 | keeping the file name as it is. I could
517 | also replace `.` with any relative path from jarrod's home directory on the
518 | remote machine or I could use an absolute path.
519 |
520 | To copy a file (`file2.txt`) from `arwen.berkeley.edu` to my local
521 | machine:
522 |
523 | ```bash
524 | $ scp jarrod@arwen.berkeley.edu:file2.txt .
525 | ```
526 |
527 | I can even copy a file (`file3.txt`) owned by one user (`jarrod`) on one
528 | remote machine `arwen.berkeley.edu` to the account of another user
529 | (`jmillman`) on another remote machine `scf-ug02.berkeley.edu`:
530 |
531 | ```bash
532 | $ scp jarrod@arwen.berkeley.edu:file3.txt jmillman@arwen.berkeley.edu:.
533 | ```
534 | If instead of copying a single file, I wanted to copy an entire
535 | directory (`src`) from one machine to another, I would use the `-r`
536 | option:
537 |
538 | ```bash
539 | $ scp -r src jmillman@arwen.berkeley.edu:.
540 | ```
541 |
542 | Regardless of whether you are working on a local computer or a remote
543 | one, it is occasionally useful to operate as a different user. For
544 | instance, you may need root (or administrative) access to change file
545 | permissions or install software. (Note this will only be possible
546 | on machines that you own or have special privileges on. The Ubuntu
547 | Subsystem on Windows is one way to have a virtual Linux machine
548 | for which you have root access.)
549 |
550 | For example on an Ubuntu Linux machine (including the Ubuntu Subsystem on Windows),
551 | here's how you can act as the 'root' user to update or add software
552 | on machines where you have administrative access:
553 |
554 | To upgrade all the software on the machine:
555 |
556 | ```bash
557 | $ sudo apt-get upgrade
558 | ```
559 |
560 | To install the text editor vim on the machine:
561 |
562 | ```bash
563 | $ sudo apt-get install vim
564 | ```
565 |
--------------------------------------------------------------------------------
/license.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: License
3 | ---
4 |
5 | This work (by Christopher Paciorek and Jarrod Millman) is licensed under a Creative Commons Attribution 4.0 International License.
6 |
7 |
--------------------------------------------------------------------------------
/ls_format.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/berkeley-scf/tutorial-using-bash/b818e852ced6e0c5b620abef4352c56103c0a859/ls_format.png
--------------------------------------------------------------------------------
/managing-processes.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Managing processes
3 | date: 2025-04-09
4 | format:
5 | html:
6 | theme: cosmo
7 | css: assets/styles.css
8 | toc: true
9 | code-copy: true
10 | code-block-bg: true
11 | code-block-border-left: "#31BAE9"
12 | engine: knitr
13 | ipynb-shell-interactivity: all
14 | code-overflow: wrap
15 | ---
16 |
17 | A process is a program that is being executed. Processes have the
18 | following attributes:
19 |
20 | - A lifetime.
21 | - A process ID (PID).
22 | - A user ID (UID).
23 | - A group ID (GID).
24 | - A parent process with its own ID (PPID).
25 | - An environment.
26 | - A current working directory.
27 |
28 | Anytime you do something on the computer, one or more processes will
29 | start up to carry out the activity.
30 |
31 | ## 1 Monitoring
32 |
33 | ### 1.1 Monitoring processes
34 |
35 | #### Using `ps`
36 |
37 | Examining subprocesses of your shell with `ps`:
38 |
39 | ```bash
40 | $ ps
41 | PID TTY TIME CMD
42 | 19370 pts/3 00:00:00 bash
43 | 22846 pts/3 00:00:00 ps
44 | ```
45 |
46 | Examining in more detail subprocesses of your shell with `ps`:
47 |
48 | ```bash
49 | $ ps -f
50 | UID PID PPID C STIME TTY TIME CMD
51 | jarrod 19370 19368 0 10:51 pts/3 00:00:00 bash
52 | jarrod 22850 19370 0 14:57 pts/3 00:00:00 ps -f
53 | ```
54 |
55 | Examining in more detail all processes on your computer:
56 |
57 | ```bash
58 | $ ps -ef
59 | UID PID PPID C STIME TTY TIME CMD
60 | root 1 0 0 Aug21 ? 00:00:05 /usr/lib/systemd
61 | root 2 0 0 Aug21 ? 00:00:00 [kthreadd]
62 | root 3 2 0 Aug21 ? 00:00:07 [ksoftirqd/0]
63 | root 5 2 0 Aug21 ? 00:00:00 [kworker/0:0H]
64 |
65 | root 16210 1 0 07:19 ? 00:00:00 login -- jarrod
66 | jarrod 16219 16210 0 07:19 tty1 00:00:00 -bash
67 | jarrod 16361 16219 0 07:19 tty1 00:00:00 /bin/sh /bin/startx
68 |
69 | ```
70 |
71 | You can use the `-u` option to see percent CPU and percent memory used
72 | by each process. You can use the `-o` option to provide your own
73 | user-defined format; for example, :
74 |
75 |
76 | ```bash
77 | $ ps -o pid,ni,pcpu,pmem,user,comm
78 | PID NI %CPU %MEM USER COMMAND
79 | 18124 0 0.0 0.0 jarrod bash
80 | 22963 0 0.0 0.0 jarrod ps
81 | ```
82 |
83 | To see the hierarchical process structure (i.e., which processes started which other processes), you can use the `pstree`
84 | command.
85 |
86 | #### Using `top`
87 |
88 | Examining processes with `top`:
89 |
90 | ```bash
91 | $ top
92 | top - 13:49:07 up 1:49, 3 users, load average: 0.10, 0.15, 0.18
93 | Tasks: 160 total, 1 running, 158 sleeping, 1 stopped, 0 zombie
94 | %Cpu(s): 2.5 us, 0.5 sy, 0.0 ni, 96.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
95 | KiB Mem : 7893644 total, 5951552 free, 1085584 used, 856508 buff/cache
96 | KiB Swap: 7897084 total, 7897084 free, 0 used. 6561548 avail Mem
97 |
98 | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
99 | 1607 jarrod 20 0 2333568 974888 212944 S 12.5 12.4 11:10.67 firefox
100 | 3366 jarrod 20 0 159828 4312 3624 R 6.2 0.1 0:00.01 top
101 | 1 root 20 0 193892 8484 5636 S 0.0 0.1 0:01.78 systemd
102 |
103 | ```
104 |
105 | The first few lines show general information about the machine, while
106 | the remaining lines show information for each process.
107 |
108 | - The `RES` column indicates the amount of memory that a process is
109 | using (in bytes, if not otherwise indicated).
110 | - The `%MEM` shows that memory use relative to the physical memory available on the
111 | computer.
112 | - The `%CPU` column shows the proportion of a CPU core that
113 | the process is using (which can exceed 100% if a process is
114 | threaded).
115 | - The `TIME+` column shows the amount of time the process has
116 | been running.
117 |
118 | To quit `top`, type `q`.
119 |
120 | You can also kill jobs (see below for further details on
121 | killing jobs) from within top: just type `r` or `k`,
122 | respectively, and proceed from there.
123 |
124 | ### 1.2 Monitoring memory use
125 |
126 | One of the main things to watch out for is a job that is using close to
127 | 100% of memory and much less than 100% of CPU. What is generally
128 | happening is that your program has run out of memory and is using
129 | virtual memory on disk, spending most of its time writing to/from disk,
130 | sometimes called *paging* or *swapping*. If this happens, it can be a
131 | very long time, if ever, before your job finishes.
132 |
133 | Note that the per-process memory use reported by `top` and `ps` may "double count"
134 | memory that is being used simultaneously by multiple processes. To see the total
135 | amount of memory actually available on a machine:
136 |
137 | ```bash
138 | $ free -h
139 | total used free shared buff/cache available
140 | Mem: 251G 998M 221G 2.6G 29G 247G
141 | Swap: 7.6G 210M 7.4G
142 | ```
143 |
144 | You'll generally be interested in the `Memory` row and in the `total`, `used` and `available` columns.
145 | The `free` column can be confusing and [does not actually indicate how much memory is still available
146 | to be used](https://berkeley-scf.github.io/tutorial-databases/db-management#52-memory),
147 | so you'll want to focus on the `available` column.
148 |
149 |
150 | ## 2 Job Control
151 |
152 | ### 2.1 Foreground and background jobs
153 |
154 | When you run a command in a shell by simply typing its name, you are
155 | said to be running in the foreground. When a job is running in the
156 | foreground, you can't type additional commands into that shell session,
157 | but there are two signals that can be sent to the running job through
158 | the keyboard. To interrupt a program running in the foreground, use
159 | `Ctrl-c`; to quit a program, use `Ctrl-\`. While modern windowed systems
160 | have lessened the inconvenience of tying up a shell with foreground
161 | processes, there are some situations where running in the foreground is
162 | not adequate.
163 |
164 | The primary need for an alternative to foreground processing arises when
165 | you wish to have jobs continue to run after you log off the computer. In
166 | cases like this you can run a program in the background by simply
167 | terminating the command with an ampersand (`&`). However, before putting
168 | a job in the background, you should consider how you will access its
169 | results, since *stdout* is not preserved when you log off from the
170 | computer. Thus, redirection (including redirection of *stderr*) is
171 | essential when running jobs in the background. As a simple example,
172 | suppose that you wish to run a Python script, and you don't want it to
173 | terminate when you log off.
174 |
175 | ```bash
176 | $ python code.py > code.pyout 2>&1 &
177 | ```
178 |
179 | What does the inscrutable `2>&1` do? Recall from [earlier](using-commands.html#overview-of-redirection) that it says to send *stderr* to the same
180 | place as *stdout*, which in this case has been redirected to `code.pyout`.
181 |
182 | If you forget to put a job in the background when you first execute it,
183 | you can do it while it's running in the foreground in two steps. First,
184 | suspend the job using the `Ctrl-z` signal. After receiving the signal,
185 | the program will interrupt execution, but it will still have access to all
186 | files and other resources. Next, issue the `bg` command, which will
187 | start the stopped job running in the background.
188 |
189 | ### 2.2 Listing and killing jobs
190 |
191 | Since only foreground jobs will accept signals through the keyboard, if
192 | you want to terminate a background job you must first determine the
193 | unique process id (PID) for the process you wish to terminate through
194 | either `ps` or `top`. Here we'll illustrate use of `ps` again.
195 |
196 | To see all processes owned by a specific user (e.g., `jarrod`), I can
197 | use the `-U jarrod` option:
198 |
199 | ```bash
200 | $ ps -U jarrod
201 | ```
202 |
203 | If I want to get more information (e.g., `%CPU` and `%MEM`), I can use
204 | add the `-u` option:
205 |
206 | ```bash
207 | $ ps -U jarrod -u
208 | USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
209 | jarrod 16116 12.0 6.0 118804 5080 tty1 Ss 16:25 133:01 python
210 | ```
211 |
212 | In this example, the `ps` output tells us that this python job has a PID
213 | of `16116`, that it has been running for 133 minutes, is using 12% of
214 | CPU and 6% of memory, and that it started at 16:25. You could then issue
215 | the command:
216 |
217 | ```bash
218 | $ kill 16116
219 | ```
220 |
221 | or, if that doesn't work:
222 |
223 | ```bash
224 | $ kill -9 16116
225 | ```
226 |
227 | to terminate the job. Another useful command in this regard is
228 | `killall`, which accepts a program name instead of a process id, and
229 | will kill all instances of the named program (in this case, R):
230 |
231 | ```bash
232 | $ killall R
233 | ```
234 |
235 | Of course, it will only kill the jobs that belong to you, so it will not
236 | affect the jobs of other users. Note that the `ps` and `kill` commands
237 | only apply to the particular computer on which they are executed, not to
238 | the entire computer network. Thus, if you start a job on one machine,
239 | you must log back into that same machine in order to manage your job.
240 |
241 | Finally, let's see how to build up a command to kill firefox using some of the
242 | tools we've seen. First let's pipe the output of `ps -e` to `grep` to
243 | select the line corresponding to `firefox`:
244 |
245 | ```bash
246 | $ ps -e | grep firefox
247 | 16517 ? 00:10:03 firefox
248 | ```
249 |
250 | We can now use `awk` to select the first column, which contains the
251 | process ID corresponding to `firefox`:
252 |
253 | ```bash
254 | $ ps -e | grep firefox | awk '{ print $1 }'
255 | 16517
256 | ```
257 |
258 | Finally, we can pipe this to the `kill` command using `xargs` or
259 | command substitution:
260 |
261 | ```bash
262 | $ ps -e | grep firefox | awk '{ print $1 }' | xargs kill
263 | $ kill $(ps -e | grep firefox | awk '{ print $1 }')
264 | ```
265 |
266 | As mentioned before, we can't pipe the PID directly to `kill` because
267 | `kill` takes the PID(s) as argument(s) rather than reading them from stdin.
268 |
269 |
270 | ## 3 Screen
271 |
272 | Screen allows you to create *virtual terminals*, which are not connected
273 | to your actual terminal or shell. This allows you to run multiple
274 | programs from the command line and leave them all in the foreground in
275 | their own virtual terminal. Screen provides facilities for managing
276 | several virtual terminals including:
277 |
278 | - listing them,
279 | - switching between them, and
280 | - disconnecting from one machine and then reconnecting to an existing virtual terminal from another.
281 |
282 | While we will only discuss its basic operation, we will cover enough to
283 | be of regular use.
284 |
285 | `tmux` is an alternative to `screen`.
286 |
287 | Calling screen:
288 |
289 | ```bash
290 | $ screen
291 | ```
292 |
293 | will open a single window and you will see a new bash prompt. You just
294 | work at this prompt as you normally would. The difference is that you
295 | can disconnect from this window by typing `Ctrl-a d` and you will see
296 | something like this :
297 |
298 | ```bash
299 | $ screen
300 | [detached from 23974.pts-2.t430u]
301 | ```
302 |
303 | ::: {.callout-tip title="Screen commands"}
304 |
305 | All the screen key commands begin with the control key combination
306 | `Ctrl-a` followed by another key. For instance, when you are in a
307 | screen session and type `Ctrl-a ?`, screen will display a help screen
308 | with a list of its keybindings.
309 |
310 | :::
311 |
312 | You can now list your screen sessions :
313 |
314 | ```bash
315 | $ screen -ls
316 | There is a screen on:
317 | 23974.pts-2.t430u (Detached)
318 | ```
319 |
320 | To reconnect :
321 |
322 | ```bash
323 | $ screen -r
324 | ```
325 |
326 | You can start multiple screen sessions. This is what it might look like
327 | if you have 3 screen sessions:
328 |
329 | ```bash
330 | $ screen -ls
331 | There are screens on:
332 | 24274.pts-2.t430u (Attached)
333 | 24216.pts-2.t430u (Detached)
334 | 24158.pts-2.t430u (Detached)
335 | ```
336 |
337 | with the first session active on a machine.
338 |
339 | To specify that you want to reattach to session `24158.pts-2.t430u`,
340 | type:
341 |
342 | ```bash
343 | $ screen -r 24158.pts-2.t430u
344 | ```
345 |
346 | If you have several screen sessions, you will want to name your screen
347 | session something more informative than `24158.pts-2.t430u`. To name a
348 | screen session `gene-analysis` you can use the `-S` option when calling
349 | screen:
350 |
351 | ```bash
352 | $ screen -S gene-analysis
353 | ```
354 |
355 | While there are many more features and keybindings available for screen,
356 | you've already seen enough screen to be useful. For example, imagine you
357 | ssh to a remote machine from your laptop to run an analysis. The first
358 | thing you do at the bash prompt on the remote machine is:
359 |
360 | ```bash
361 | $ screen -S dbox-study
362 | ```
363 |
364 | Then you start your analysis script `dbox-analysis.py` running:
365 |
366 | ```bash
367 | $ dbox-analysis.py
368 | Starting statistical analysis ...
369 | Processing subject 1 ...
370 | Processing subject 2 ...
371 | ```
372 |
373 | If your study has 50 subjects and processing each subject takes 20
374 | minutes, you will not want to sit there watching your monitor. So you
375 | use `Ctrl-a d` to detach the session and you will then see:
376 |
377 | ```bash
378 | $ screen -S dbox-study
379 | [detached from 2799.dbox-study]
380 | $
381 | ```
382 |
383 | Now you can log off your laptop and go home. Sometime after dinner, you
384 | decide to check on your job. So you ssh from your home computer to the
385 | remote machine again and type the following at the bash prompt:
386 |
387 | ```bash
388 | $ screen -r dbox-study
389 | ```
390 |
391 | You should then be able to see the progress of your analysis script,
392 | as if you had kept a terminal open the whole time.
393 |
--------------------------------------------------------------------------------
/regex.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Regular expressions
3 | date: 2025-04-09
4 | format:
5 | html:
6 | theme: cosmo
7 | css: assets/styles.css
8 | toc: true
9 | code-copy: true
10 | code-block-bg: true
11 | code-block-border-left: "#31BAE9"
12 | engine: knitr
13 | ipynb-shell-interactivity: all
14 | code-overflow: wrap
15 | ---
16 |
17 | Regular expressions (regex) are a domain-specific language for finding
18 | patterns and are one of the key functionalities in scripting languages
19 | such as Python, as well as the UNIX utilities `sed`, `awk`, and
20 | `grep`. We'll just cover the basic use of regular
21 | expressions in bash, but once you know that, it would be easy to use
22 | them elsewhere (Python, R, etc.). At the level we'll consider them, the
23 | syntax is quite similar.
24 |
25 | ::: {.callout-important title="Different types of regular expressions"}
26 |
27 | POSIX.2 regular expressions come in two flavors: extended regular
28 | expressions and basic (or obsolete) regular expressions. The extended
29 | syntax has metacharacters `()` and `{}`, while the basic syntax
30 | requires the metacharacters to be designated `\(\)` and `\{\}`.
31 | In addition to the POSIX standard, Perl regular expressions are also
32 | widely used. While we won't go into detail, we will see some examples
33 | of each syntax. In the examples that follow we'll generally use the
34 | extended syntax by using the `-E` flag to `grep`.
35 | :::
36 |
37 | ## 1 Overview and core syntax
38 |
39 | The basic idea of regular expressions is that they allow us to find
40 | matches of strings or patterns in strings, as well as do substitution.
41 | Regular expressions are good for tasks such as:
42 |
43 | - extracting pieces of text - for example finding all the phone
44 | numbers in a document;
45 | - creating variables from information found in text;
46 | - cleaning and transforming text into a uniform format;
47 | - mining text by treating documents as data; and
48 | - scraping the web for data.
49 |
50 | Regular expressions are constructed from three things:
51 |
52 | 1. *Literal characters* are matched only by the characters themselves,
53 | 2. *Character classes* are matched by any single member in the class,
54 | and
55 | 3. *Modifiers* operate on either of the above or combinations of them.
56 |
57 | Note that the syntax is very concise, so it's helpful to break down
58 | individual regular expressions into the component parts to understand
59 | them. Since regex are their own language, it's a good idea to build up a
60 | regex in pieces as a way of avoiding errors just as we would with any
61 | computer code. You'll also want to test your regex on examples, for which
62 | this [online testing tool](https://regex101.com) is helpful.
63 |
64 | It is also helpful to search for common regex online
65 | before trying to craft your own. For instance, if you wanted to use a
66 | regex that matches valid email addresses, you would need to match
67 | anything that complies with the [RFC
68 | 822](http://www.ietf.org/rfc/rfc0822.txt?number=822) grammar. If you
69 | look over that document, you will quickly realize that implementing a
70 | correct regular expression to validate email addresses is extremely
71 | complex. So if you are writing a website that validates email addresses,
72 | it is best to look for a bug-vetted implementation rather than creating
73 | your own.
74 |
75 | The special characters (meta-characters) used for defining regular
76 | expressions are:
77 |
78 | * . ^ $ + ? ( ) [ ] { } | \
79 |
80 | To use these characters literally as characters, we have to 'escape'
81 | them. In bash, you escape these characters by placing a single backslash
82 | before the character you want to escape. In R, we have to use two
83 | backslashes instead of a single backslash because R uses a single
84 | backslash to symbolize certain control characters, such as `\n` for
85 | newline.
86 |
87 | To learn more about regular expressions, you can type:
88 |
89 | ```bash
90 | $ man 7 regex
91 | ```
92 |
93 | ## 2 Character sets and character classes
94 |
95 | We can use character sets to match any of the characters in a set.
96 |
97 |
98 |
99 |
100 |
Operators
101 |
Description
102 |
103 |
104 |
105 |
106 |
[abc]
107 |
Match any single character from from the listed characters
108 |
109 |
110 |
[a-z]
111 |
Match any single character from the range of characters
112 |
113 |
114 |
[^abc]
115 |
Match any single character not among listed characters
116 |
117 |
118 |
[^a-z]
119 |
Match any single character not among listed range of characters
120 |
121 |
122 |
.
123 |
Match any single character except a newline
124 |
125 |
126 |
\
127 |
Turn off (escape) the special meaning of a metacharacter
128 |
129 |
130 |
131 |
132 | If we want to search for any one of a set of characters, we use a
133 | character set, such as `[13579]` or `[abcd]` or `[0-9]` (where the dash
134 | indicates a sequence) or `[0-9a-z]`. To indicate any character not in a
135 | set, we place a `^` just inside the first bracket: `[^abcd]`.
136 |
137 | Here's an example of using regex with `grep` to find all lines in `test.txt` that
138 | contain at least one numeric digit.
139 |
140 | ```bash
141 | $ grep -E [0-9] test.txt
142 | ```
143 |
144 | or with the `-o` flag to find and return only the actual digits
145 |
146 | ```bash
147 | $ grep -E -o [0-9] test.txt
148 | ```
149 |
150 | There are a bunch of named character classes so that we don't have write
151 | out common sets of characters. The syntax is `[:CLASS:]` where `CLASS`
152 | is one of the following values:
153 |
154 | "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
155 | "lower", "print", "punct", "space", "upper", "word" or "xdigit".
156 |
157 | So to find any line that contains a punctuation symbol:
158 |
159 | ```bash
160 | $ grep -E [[:punct:]] test.txt
161 | ```
162 |
163 | Note that to make a character set with a character class you need two square
164 | brackets, e.g., with the digit class: `[[:digit:]]`. Or we can make a combined
165 | character set such as `[[:alnum:]_]` (to find any alphabetic or
166 | numeric characters or an underscore).
167 | Or here, any line with a digit, a period, or a comma.
168 |
169 | ```bash
170 | $ grep -E [[:digit:].,] test.txt
171 | ```
172 |
173 | Interestingly, we don't need to escape the period or comma inside the
174 | character set, despite both of them being meta-characters.
175 |
176 |
177 | ## 3 Location-specific matches
178 |
179 | We can use position anchors to make location-specific matches.
180 |
181 |
182 |
183 |
184 |
Operators
185 |
Description
186 |
187 |
188 |
189 |
190 |
^
191 |
Match the beginning of a line.
192 |
193 |
194 |
$
195 |
Match the end of a line.
196 |
197 |
198 |
199 |
200 | To find a pattern at the beginning of the string, we use `^` (note this
201 | was also used for negation, but in that case occurs only inside square
202 | brackets) and to find it at the end we use `$`.
203 |
204 | Here we'll search for lines that start with a digit and for lines that
205 | end with a digit.
206 |
207 | ```bash
208 | $ grep -E ^[0-9] test.txt
209 | $ grep -E [0-9]$ test.txt
210 | ```
211 |
212 | ## 4 Repetitions, Grouping, and References
213 |
214 | Now suppose I wanted to be able to detect phone numbers, email
215 | addresses, etc. I often need to be able to deal with repetitions of
216 | characters or character sets.
217 |
218 | **Modifiers**
219 |
220 |
221 |
222 |
223 |
Operators
224 |
Description
225 |
226 |
227 |
228 |
229 |
*
230 |
Match zero or more instances of the preceding character or regex.
231 |
232 |
233 |
?
234 |
Match zero or one instance of the preceding character or regex.
235 |
236 |
237 |
+
238 |
Match one or more instances of the preceding character or regex.
239 |
240 |
241 |
{n,m}
242 |
Match a range of occurrences (at least n, no more
243 | than m) of preceding character
244 | of regex.
245 |
246 |
247 |
|
248 |
Match the character or expression to the left or right of the vertical bar.
249 |
250 |
251 |
252 |
253 | Here are some examples of repetitions:
254 |
255 | - `[[:digit:]]*` : any number of digits (zero or more)
256 | - `[[:digit:]]+` : at least one digit
257 | - `[[:digit:]]?` : zero or one digits
258 | - `[[:digit:]]{1,3}` : at least one and no more than three digits
259 | - `[[:digit:]]{2,}` : two or more digits
260 |
261 | Another example is that `\[.*\]` is the pattern of closed square brackets
262 | with any number of characters (`.*`) inside:
263 |
264 | ```bash
265 | $ grep -E "\[.*\]" test.txt
266 | ```
267 |
268 | Note that the quotations ensured that the backslashes are passed into
269 | grep and not simply interpreted by the shell, while the `\` is needed
270 | so that `[` and `]` are treated as simple characters since they are
271 | meta-characters in the regex syntax.
272 |
273 | As shown above, we can use `|` to mean "or". For example, to match one or
274 | more occurrences of "http" or "ftp":
275 |
276 |
277 | ```bash
278 | $ grep -E -o "(http|ftp)" test.txt
279 | ```
280 |
281 | Parentheses are also used with a pipe (`|`) when working with
282 | multi-character sequences, such as `(http|ftp)`. Also, here we need double
283 | quotes or the shell tries to interpret the `(` as part of the regular
284 | expression and not shell syntax.
285 |
286 | Next let's see the use of repitition to look for more complicated multi-character patterns. For
287 | example, if you wanted to match phone numbers whether they start with
288 | `1-` or not you could use the following:
289 |
290 | (1-)?[[:digit:]]{3}-[[:digit:]]{3}-[[:digit:]]{4}
291 |
292 | The first part of the pattern `(1-)?` matches 0 or 1 occurrences of
293 | `1-`. Then the pattern `[[:digit:]]{3}` matches any 3 digits. Similarly,
294 | the pattern `[[:digit:]]{4}` matches any 4 digits. So the whole pattern
295 | matches any three digits followed by `-`, then another three digits, and then followed by four
296 | digits when it is preceded by 0 or 1 occurrences of `1-`.
297 |
298 | Now let's consider a file named `file2.txt` with the following content:
299 |
300 | ```
301 | Here is my number: 919-543-3300.
302 | hi John, good to meet you
303 | They bought 731 bananas
304 | Please call 1.919.554.3800
305 | I think he said it was 337.4355
306 | ```
307 |
308 | Let's use a regular expression pattern to print all lines
309 | containing phone numbers:
310 |
311 | ```bash
312 | $ grep '(1-)?[[:digit:]]{3}-[[:digit:]]{4}' file2.txt
313 | ```
314 |
315 | You will notice that this doesn't match any lines. The reason is that
316 | the group syntax `(1-)` and the `{}` notation are not part of the
317 | extended syntax. To have `grep` use the extended syntax, you can either
318 | use the `-E` option (as we've been doing above):
319 |
320 | ```bash
321 | $ grep -E '(1-)?[[:digit:]]{3}-[[:digit:]]{4}' file2.txt
322 | Here is my number: 919-543-3300.
323 | ```
324 |
325 | or use the `egrep` command:
326 |
327 | ```bash
328 | $ egrep '(1-)?[[:digit:]]{3}-[[:digit:]]{4}' file2.txt
329 | Here is my number: 919-543-3300.
330 | ```
331 |
332 | If we want to match regardless of whether the phone number is separated
333 | by a minus `-` or a period `.`, we could use the pattern `[-.]`:
334 |
335 | ```bash
336 | $ egrep '(1[-.])?[[:digit:]]{3}[-.][[:digit:]]{4}' file2.txt
337 | Here is my number: 919-543-3300.
338 | Please call 1.919.554.3800
339 | I think he said it was 337.4355
340 | ```
341 |
342 | ::: {.callout-tip title="Exercise"}
343 | Explain what the following regular expression matches:
344 |
345 | ```bash
346 | $ grep '^[^T]*is.*$' file1.txt
347 | ```
348 | :::
349 |
350 | ## 5 Greedy matching
351 |
352 | Regular expression pattern matching is *greedy*---by default, the
353 | longest matching string is chosen.
354 |
355 | Suppose we have the following file:
356 |
357 | ```bash
358 | $ cat file1.txt
359 | Do an internship in place of one course.
360 | ```
361 |
362 | If we want to match the html tags (e.g., `` and ``, we might be
363 | tempted to use the pattern `<.*>`. Using the `-o` option to grep, we can
364 | have grep print out just the part of the text that the pattern matches:
365 |
366 | ```bash
367 | $ grep -o "<.*>" file1.txt
368 | in place of one
369 | ```
370 |
371 | To get a non-greedy match, you can use the modifier `?` after the
372 | quantifier. However, this requires that we use the Perl syntax. In order
373 | for grep to use the Perl syntax, we need to use the `-P` option:
374 |
375 | ```bash
376 | $ grep -P -o "<.*?>" file1.txt
377 |
378 |
379 |
380 |
381 | ```
382 |
383 | However, one can often avoid greedy matching by being more clever.
384 |
385 | ::: {.callout-tip title="Challenge"}
386 | How could we change our regexp to avoid the greedy
387 | matching without using the `?` modifier? Hint: Is there some character
388 | set that we don't want to be inside the angle brackets?
389 | :::
390 |
391 | ::: {.callout-tip title="globs vs. regex"}
392 |
393 | Be sure you understand the difference between [filename globbing](file-management#3-filename-matching-globbing)
394 | and regular expressions. Filename globbing only works for filenames, while
395 | regular expressions are used to match patterns in text more
396 | generally. While they both use the same set of symbols, they mean
397 | different things (e.g., `*` matches 0 or more characters when
398 | globbing but matches 0 or more repetitions of the character that
399 | precedes it when used in a regular expression).
400 | :::
401 |
--------------------------------------------------------------------------------
/shell-programming.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Shell programming
3 | date: 2025-04-09
4 | format:
5 | html:
6 | theme: cosmo
7 | css: assets/styles.css
8 | toc: true
9 | code-copy: true
10 | code-block-bg: true
11 | code-block-border-left: "#31BAE9"
12 | engine: knitr
13 | ipynb-shell-interactivity: all
14 | code-overflow: wrap
15 | ---
16 |
17 | ## 1 Shell scripts
18 |
19 | Shell scripts are files containing shell commands (commonly with the
20 | extension `.sh`) To run a shell script called `file.sh`, you would type
21 | :
22 |
23 | ```bash
24 | $ source ./file.sh
25 | ```
26 |
27 | or :
28 |
29 | ```bash
30 | $ . ./file.sh
31 | ```
32 |
33 | Note that if you just typed `file.sh`, the operating system will
34 | generally have two problems: first, it would have trouble finding the script (if `file.sh` is not in a
35 | directory included in the `PATH` environment variable), and second it would have trouble recognizing that it is
36 | executable (if the `-x` flag is not set for `file.sh`).
37 |
38 | To be sure that the operating system knows what shell to use
39 | to interpret the script, the first line of the script should be
40 | `#!/bin/bash` (in the case that you're using the bash shell).
41 |
42 | The best thing to do is to set `file.sh` to be executable (i.e., to have the 'x' flag set), and you can can execute it simply with:
43 |
44 | ```bash
45 | $ ./file.sh
46 | ```
47 |
48 | ## 2 Functions
49 |
50 | You can define your own utilities by creating a shell function. This
51 | allows you to automate things that are more complicated than you can do
52 | with an
53 | [alias](using-commands#8-aliases-command-shortcuts-and-bashrc).
54 | One nice thing about shell functions is that the shell
55 | automatically takes care of function arguments for you. It places the
56 | arguments given by the user into local variables in the function called
57 | (in order): `$1 $2 $3` etc. It also fills `$#` with the number of
58 | arguments given by the user. Here's an example of using arguments in a
59 | function that saves me some typing when I want to copy a file to the SCF
60 | filesystem:
61 |
62 | ```bash
63 | function putscf() {
64 | scp $1 jarrod@arwen.berkeley.edu:$2
65 | }
66 | ```
67 |
68 | To use this function, I just do the following to copy `unit1.pdf` from
69 | the current directory on whatever non-SCF machine I'm on to the
70 | directory `~/teaching/243` on SCF:
71 |
72 | ```bash
73 | $ putscf unit1.pdf teaching/243/.
74 | ```
75 |
76 | Often you'd want to put such functions in your `.bashrc` file.
77 |
78 | ## 3 If/then/else
79 |
80 | We can use if-then-else type syntax to control the flow of a shell
81 | script. For an example, here is a shell function `niceR()` that can be
82 | used for nicing R jobs:
83 |
84 | ```bash
85 | # niceR shortcut for nicing R jobs
86 | # usage: niceR inputRfile outputRfile
87 | # Author: Brian Caffo
88 | # Date: 10/01/03
89 |
90 | function niceR(){
91 | # submits nice'd R jobs
92 | if [ $# != "2" ]; then
93 | echo "usage: niceR inputRfile outputfile"
94 | elif [ -e "$2" ]; then
95 | echo "$2 exists, I won't overwrite"
96 | elif [ ! -e "$1" ]; then
97 | echo "inputRfile $1 does not exist"
98 | else
99 | echo "running R on $1"
100 | nice -n 19 R --no-save < $1 &> $2
101 | fi
102 | }
103 | ```
104 |
105 | If the `then` is on a separate line from the `if`, you won't need the semicolon.
106 |
107 | ## 4 For loops
108 |
109 | For loops in shell scripting are primarily designed for iterating
110 | through a set of files or directories. Here's an example:
111 |
112 | ```bash
113 | $ for FILE in $(ls *.txt); do
114 | > mv $file ${FILE/.txt/.R}
115 | > # this syntax replaces .txt with .R in $FILE``
116 | > done
117 | ```
118 |
119 | Note that the `>` prompt above occurs when the shell is expecting
120 | further input.
121 |
122 | Another use of `for` loops is automating file downloads:
123 |
124 | ```bash
125 | # example of bash for loop and wget for downloading a collection of files on the web
126 | # usage: ./forloopDownload.sh
127 | # Author: Chris Paciorek
128 | # Date: July 28, 2011
129 |
130 | url='ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/grid/years'
131 | types="tmin tmax"
132 | for ((yr=1950; yr<=2017; yr++))
133 | do
134 | for type in ${types}
135 | do
136 | wget ${url}/${yr}.${type}
137 | done
138 | done
139 | ```
140 |
141 | If the `do` is on a separate line from the `for`, you don't need the
142 | semicolon seen in the previous example.
143 |
144 |
145 | *for* loops are very useful for starting a series of jobs:
146 |
147 | ```bash
148 | # example of bash for loop for starting jobs
149 | # usage: ./forloopJobs.sh
150 | # Author: Chris Paciorek
151 | # Date: July 28, 2011
152 |
153 | n=100
154 | for(( it=1; it<=100; it++ ))
155 | do
156 | echo "n=$n; it=$it; source('base.R')" > tmp-$n-$it.R # create customized R file
157 | R CMD BATCH --no-save tmp-$n-$it.R sim-n$n-it$it.Rout
158 | done
159 | # note that base.R should NOT set either 'n' or 'it'
160 | ```
161 |
162 | That's just an illustration. In reality, in the case above you'd be better off passing arguments into the R code using `commandArgs` or by setting environment variables that are read in the R code.
163 |
164 | Note by default the separator when you're looping through elements of a variable will be a space (as above), but you can set it differently, for example:
165 |
166 | ```bash
167 | $ IFS=:
168 | $ types=tmin:tmax:pmin:pmax
169 | $ for type in $types
170 | > do
171 | > echo $type
172 | > done
173 | tmin
174 | tmax
175 | pmin
176 | pmax
177 | ```
178 |
179 |
180 | ## 5 How much shell scripting should I learn?
181 |
182 | We've covered most of what you are likely to need to know about the
183 | shell. I tend to only use bash scripts for simple tasks that require
184 | only a few lines of bash commands and limited control flow (i.e.,
185 | conditional statements, loops). For more complicated OS tasks, it is
186 | often preferable to use Python. (You can also do a fair amount of what
187 | you need from within R using the `system()` function.) This will enable
188 | you to avoid dealing with a lot of shell programming syntax. But you'll
189 | still need to know how to use standard UNIX commands/utilities, wildcards, and pipes to be
190 | effective.
191 |
--------------------------------------------------------------------------------
/testfile.txt:
--------------------------------------------------------------------------------
1 | This is the first line.
2 | Followed by this line.
3 | And then ...
4 |
--------------------------------------------------------------------------------
/using-commands.qmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: Using UNIX commands
3 | date: 2025-04-09
4 | format:
5 | html:
6 | theme: cosmo
7 | css: assets/styles.css
8 | toc: true
9 | code-copy: true
10 | code-block-bg: true
11 | code-block-border-left: "#31BAE9"
12 | engine: knitr
13 | ipynb-shell-interactivity: all
14 | code-overflow: wrap
15 | ---
16 |
17 | ## 1 Basic utilities / commands
18 |
19 | [Earlier](index#5-introduction-to-commands) we introduced the basics of entering commands in the shell.
20 |
21 |
22 | Since files are such an essential aspect of Unix and working from the
23 | shell is the primary way to work with Unix, there are a large number of
24 | useful commands and tools to view and manipulate files.
25 |
26 | - cat -- concatenate files and print to standard output
27 | - cp -- copy files and directories
28 | - cut --_remove sections from each line of files
29 | - diff -- find differences between two files
30 | - grep -- print lines matching a pattern
31 | - head -- output the first part of files
32 | - find -- search for files in a directory hierarchy
33 | - less -- opposite of more (and better than more)
34 | - more -- file perusal filter for crt viewing
35 | - mv -- move (rename) files
36 | - nl -- number lines of files
37 | - paste -- merge lines of files
38 | - rm -- remove files or directories
39 | - rmdir -- remove empty directories
40 | - sort -- sort lines of text files.
41 | - split -- split a file into pieces
42 | - tac -- concatenate and print files in reverse
43 | - tail -- output the last part of files
44 | - touch -- change file timestamps
45 | - tr -- translate or delete characters
46 | - uniq -- remove duplicate lines from a sorted file
47 | - wc -- print the number of bytes, words, and lines in files
48 | - wget and curl -- non-interactive internet downloading
49 |
50 | Recall that a command consists of the command, optionally one or more flags, and optionally one or more arguments. When there is an argument, it is often the name of a file that the command should operate on.
51 |
52 | Thus the general syntax for a Unix program/command/utility is:
53 |
54 | ```
55 | $ command -options argument1 argument2 ...
56 | ```
57 |
58 | For example, :
59 |
60 | ```bash
61 | $ grep -i graphics file.txt
62 | ```
63 |
64 | looks for the literal string `graphics` (argument 1) in `file.txt`
65 | (argument2) with the option `-i`, which says to ignore the case of the
66 | letters. A simpler invocation is:While :
67 |
68 | ```bash
69 | $ less file.txt
70 | ```
71 |
72 | which simply pages through a text file (you can navigate up and down
73 | with the space bar and the up/down arrows) so you
74 | can get a feel for what's in it. To exit `less` type `q`.
75 |
76 |
77 | Unix programs often take flags (options) that are identified with a minus
78 | followed by a letter and then (possibly) followed by the specific option (adding a space
79 | before the specific option is fine). Options may also involve two
80 | dashes, e.g., `R --no-save`. A standard two dash option for many
81 | commands is `--help`. For example, try:
82 |
83 | ```bash
84 | $ tail --help
85 | ```
86 |
87 | Here are a couple of examples of flags when using the `tail` command
88 | (`-n 10` and `-f`):
89 |
90 | ```bash
91 | $ wget https://raw.githubusercontent.com/berkeley-scf/tutorial-using-bash/master/cpds.csv
92 | $ tail -n 10 cpds.csv # last 10 lines of cpds.csv
93 | $ tail -f cpds.csv # shows end of file, continually refreshing
94 | ```
95 |
96 | The first line downloads the data from GitHub. The two main tools
97 | for downloading network-accessible data from the commandline are `wget`
98 | and `curl`. I tend to use `wget` as my commandline downloading tool as
99 | it is more convenient, but on a Mac, only `curl` is generally available.
100 |
101 | A few more tidbits about `grep` (we will see more examples of `grep` in
102 | the [section on regular expressions](regex), but it is so useful that it is worth
103 | seeing many times):
104 |
105 | ```bash
106 | $ grep ^2001 cpds.csv # returns lines that start with '2001'
107 | $ grep 0$ cpds.csv # returns lines that end with '0'
108 | $ grep 19.0 cpds.csv # returns lines with '19' separated from '0' by a single character
109 | $ grep 19.*0 cpds.csv # now separated by any number of characters
110 | $ grep -o 19.0 cpds.csv # returns only the content matching the pattern, not entire lines
111 | ```
112 |
113 | Note that the first argument to grep is the pattern you are looking for.
114 | The syntax is different from that [used for wildcards](file-management#3-filename-globbing) in file names.
115 | Also, you can use regular expressions in the pattern, but we defer
116 | details until [later](regex).
117 |
118 | It is sometimes helpful to put the pattern inside double quotes, e.g.,
119 | if you want spaces in your pattern:
120 |
121 | ```bash
122 | $ grep "George .* Bush" cpds.csv
123 | ```
124 |
125 | More generally in Unix, enclosing a string in quotes is often useful to
126 | indicate that it is a single argument/value.
127 |
128 | If you want to explicitly look for one of the special characters used in
129 | creating patterns (such as double quote (`"`), period (`.`), etc.), you
130 | can "escape" them by preceding with a back-slash. For example to look
131 | for `"Canada"`, including the quotes:
132 |
133 | ```bash
134 | $ grep "\"Canada\"" cpds.csv # look for "Canada" (including quotes)
135 | $ grep "19\.0" cpds.csv # look for 19.0
136 | ```
137 |
138 | If you have a big data file and need to subset it by line (e.g., with
139 | `grep`) or by field (e.g., with `cut`), then you can do it really fast
140 | from the Unix command line, rather than reading it with R, SAS, Python,
141 | etc.
142 |
143 | Much of the power of these utilities comes in piping between them (see
144 | the next section) and [using wildcards](file-management#3-filename-globbing) to
145 | operate on groups of files. The utilities can also be used in shell
146 | scripts to do more complicated things.
147 |
148 | We'll see further examples of how to use these utilities later.
149 |
150 | ::: {.callout-tip title="Exercise"}
151 |
152 | You've already seen some of the above commands. Use the `--help`
153 | syntax to view the abbreviated man pages for some commands you're not
154 | familiar with and consider how you
155 | might use these commands.
156 | :::
157 |
158 | ## 3 Streams, pipes, and redirects
159 |
160 | ### 3.1 Streams (stdin/stdout/stderr)
161 |
162 | Unix programs that involve input and/or output often operate by reading
163 | input from a *stream* known as standard input (*stdin*), and writing their
164 | results to a stream known as standard output (*stdout*). In addition, a
165 | third stream known as standard error (*stderr*) receives error messages
166 | and other information that's not part of the program's results. In the
167 | usual interactive session, standard output and standard error default to
168 | your screen, and standard input defaults to your keyboard.
169 |
170 | ### 3.2 Overview of redirection
171 |
172 | You can change the place from which programs read and write through
173 | redirection. The shell provides this service, not the individual
174 | programs, so redirection will work for all programs. The following table
175 | shows some examples of redirection.
176 |
177 | **Table. Common Redirection Operators**
178 |
179 |
180 |
181 |
182 |
Redirection Syntax
183 |
Function
184 |
185 |
186 |
187 |
188 |
$ cmd > file
189 |
Send stdout to file
190 |
191 |
192 |
$ cmd 1> file
193 |
Same as above
194 |
195 |
196 |
$ cmd 2> file
197 |
Send stderr to file
198 |
199 |
200 |
$ cmd > file 2>&1
201 |
Send both stdout and stderr to file
202 |
203 |
204 |
$ cmd < file
205 |
Receive stdin from file
206 |
207 |
208 |
$ cmd >> file
209 |
Append stdout to file
210 |
211 |
212 |
$ cmd 1>> file
213 |
Same as above
214 |
215 |
216 |
$ cmd 2>> file
217 |
Append stderr to file
218 |
219 |
220 |
$ cmd >> file 2>&1
221 |
Append both stdout and stderr to file
222 |
223 |
224 |
$ cmd1 | cmd2
225 |
Pipe stdout from cmd1 to cmd2
226 |
227 |
228 |
$ cmd1 2>&1 | cmd2
229 |
Pipe stdout and stderr from cmd1 to cmd2
230 |
231 |
232 |
$ cmd1 | tee file1 | cmd2
233 |
Pipe stdout from cmd1 to cmd2 while simultaneously writing it to file1
234 |
235 |
236 |
237 |
using tee
238 |
239 |
240 |
241 |
242 | Note that `cmd` may include options and arguments as seen in the
243 | previous section.
244 |
245 |
246 |
247 | ### 3.3 Standard redirection (pipes)
248 |
249 | Operations where output from one command is used as input to another
250 | command (via the `|` operator) are known as pipes; they are made
251 | especially useful by the convention that many UNIX commands will accept
252 | their input through the standard input stream when no file name is
253 | provided to them.
254 |
255 | A simple pipe to `wc` to count the number of words in a string:
256 |
257 | ```bash
258 | $ echo "hey there" | wc -w
259 | 2
260 | ```
261 |
262 | Translating lowercase to UPPERCASE with `tr`:
263 |
264 | ```bash
265 | $ echo 'user1' | tr 'a-z' 'A-Z'
266 | USER1
267 | ```
268 |
269 | Here's an example of finding out how many unique entries there are in
270 | the 2nd column of a data file whose fields are separated by commas:
271 |
272 | ```bash
273 | $ cut -d',' -f2 cpds.csv | sort | uniq | wc
274 | $ cut -d',' -f2 cpds.csv | sort | uniq > countries.txt
275 | ```
276 |
277 | Here are the piecies of what is going on in the commands above:
278 |
279 | - We use the `cut` utility to extract the second field (`-f2`) or
280 | column of the file `cpds.csv` where the fields (or columns) are split or
281 | delimited by a comma (`-d','`).
282 | - The standard output of the `cut` command [is then piped (via `|`) to the standard input of the `sort` command.
283 | - Then the output of `sort` is sent to the input of `uniq` to remove
284 | duplicate entries in the sorted list provided by `sort`. (Rather than
285 | using `sort | uniq`, you could also use `sort -u`.)
286 | - Finally, the first of the `cut` commands prints a word count summary using `wc`; while the
287 | second saving the sorted information with duplicates removed in the file
288 | `countries.txt`.
289 |
290 | As another example of checking for anomalies in a set of files, with
291 | the , to see if there are any "S" values in certain fields (based on fixed
292 | width using `-b`) of a
293 | set of files (`USC*dly`), one can do this:
294 |
295 | ```bash
296 | $ cut -b29,37,45,53,61,69,77,85,93,101,109,117,125,133,141,149, \
297 | 157,165,173,181,189,197,205,213,221,229,237,245,253, \
298 | 261,269 USC*.dly | grep S | less
299 | ```
300 |
301 | (Note I did that on 22,000 files (5 Gb or so) in about 5
302 | minutes on my desktop; it would have taken much more time to read the
303 | data into a program like R or Python.)
304 |
305 | ### 3.4 The `tee` command
306 |
307 | The `tee` command lets you create two streams from one. For example,
308 | consider the case where you want the results of this command:
309 |
310 | ```bash
311 | $ cut -d',' -f2 cpds.csv | sort | uniq
312 | ```
313 |
314 | to both be output to the terminal screen you are working in as well as
315 | being saved to a file. You could issue the command twice:
316 |
317 | ```bash
318 | $ cut -d',' -f2 cpds.csv | sort | uniq
319 | $ cut -d',' -f2 cpds.csv | sort | uniq > countries.txt
320 | ```
321 |
322 | Instead of repeating the command and wasting computing time, you could
323 | use `tee` command:
324 |
325 | ```bash
326 | $ cut -d',' -f2 cpds.csv | sort | uniq | tee countries.txt
327 | ```
328 |
329 | ## 4 Command substitution and the `xargs` command
330 |
331 | ### 4.1 Command substitution
332 |
333 | A closely related, but subtly different, capability to piping is
334 | command substitution. You may sometimes need to substitute the results of a command for use
335 | by another command. For example, if you wanted to use the directory
336 | listing returned by `ls` as the argument to another command, you would
337 | type `$(ls)` in the location you want the result of `ls` to appear.
338 |
339 | When the shell encounters a command
340 | surrounded by `$()`, it runs the command and replaces the
341 | expression with the output from the command. This allows something
342 | similar to a pipe, but it is appropriate when a command reads its arguments
343 | directly from the command line instead of through standard input.
344 |
345 | For
346 | example, suppose we are interested in searching for the text `pdf` in
347 | the last 4 R code files (those with suffix `.r` or `.R`) that were
348 | modified in the current directory. We can find the names of the four most
349 | recently modified files ending in `.R` or `.r` using:
350 |
351 | ```bash
352 | $ ls -t *.{R,r} | head -4
353 | ```
354 |
355 | and we can search for the required pattern using `grep` . Putting these
356 | together with command substitution, we can solve the problem using:
357 |
358 | ```bash
359 | $ grep pdf $(ls -t *.{R,r} | head -4)
360 | ```
361 |
362 | Suppose that the four R code file names produced by the `ls` command above were:
363 | `test.R`, `run.R`, `analysis.R` , and `process.R`. Then the result of the command substitution above is to run the following command:
364 |
365 | ```bash
366 | $ grep pdf test.R run.R analysis.R process.R
367 | ```
368 |
369 |
370 | ::: {.callout-tip title="Command substitution alternate syntax"}
371 |
372 | An older notation for command substitution is to use backticks (e.g.,
373 | `` `ls` `` rather than `$(ls)`). It is generally preferable to use the new
374 | notation, since there are many annoyances with the backtick notation.
375 | For example, backslashes (`\`) inside of backticks behave in a
376 | non-intuitive way, nested quoting is more cumbersome inside backticks,
377 | nested substitution is more difficult inside of backticks, and it is
378 | easy to visually mistake backticks for a single quote.
379 | :::
380 |
381 | Note that piping the output of the `ls` command into `grep` would not
382 | achieve the desired goal, since `grep` reads its filenames as arguments from the
383 | command line, not standard input.
384 |
385 | ### 4.2 The `xargs` command
386 |
387 | While it doesn't work to directly use pipes to redirect output from one program
388 | as arguments to another program, you can redirect output as the arguments to another program using
389 | the `xargs` utility. Here's an example:
390 |
391 | ```bash
392 | $ ls -t *.{R,r} | head -4 | xargs grep pdf
393 | ```
394 |
395 | where the result is equivalent to the use of command substitution we saw in the previous section.
396 |
397 | ::: {.callout-tip title="Exercise"}
398 |
399 | Try the following commands:
400 |
401 | ```bash
402 | $ ls -l tr
403 | $ type -p tr
404 | $ ls -l type -p tr
405 | $ ls -l $(type -p tr)
406 | ```
407 |
408 | Make sure you understand why each command behaves as it does.
409 | :::
410 |
411 | ## 5 Brace expansion
412 |
413 | We saw brace expansion when discussing file wildcards. For example, we can
414 | rename a file with a long name easily like this:
415 |
416 | ```bash
417 | $ mv my_long_filename.{txt,csv}
418 | $ ls my_long_filename*
419 | my_long_filename.csv
420 | $ mv my_long_filename.csv{,-old}
421 | $ ls my_long_filename*
422 | my_long_filename.csv-old
423 | ```
424 |
425 | This works because the shell expands the braces before passing the result on to the command. So with the `mv` calls above, the shell expands the braces to produce
426 |
427 | ```bash
428 | mv my_long_filename.txt my_long_filename.csv
429 | mv my_long_filename.csv my_long_filename.csv-old
430 | ```
431 |
432 | Brace expansion is quite useful and more flexible than I've indicated.
433 | Above we saw how to use brace expansion using a comma-separated
434 | list of items inside the curly braces (e.g., `{txt,csv}`), but they can
435 | also be used with a sequence specification. A sequence is indicated with
436 | a start and end item separated by two periods (`..`). Try typing the
437 | following examples at the command line and try to figure out how they
438 | work:
439 |
440 | ```bash
441 | $ echo {1..15}
442 | $ echo c{c..e}
443 | $ echo {d..a}
444 | $ echo {1..5..2}
445 | $ echo {z..a..-2}
446 | ```
447 |
448 | This can be used for filename wildcards but also anywhere else it would be useful. For example to kill a bunch of sequentially-numbered processes:
449 |
450 | ```bash
451 | $ kill 1397{62..81}
452 | ```
453 | ## 6 Quoting
454 |
455 | A note about using single vs. double quotes in shell code. In
456 | general, variables inside double quotes will be evaluated, but variables
457 | not inside double quotes will not be:
458 |
459 | ```bash
460 | $ echo "My home directory is $HOME"
461 | My home directory is /home/jarrod
462 | $ echo 'My home directory is $HOME'
463 | My home directory is $HOME
464 | ```
465 |
466 | **Table. Quotes**
467 |
468 |
469 |
470 |
471 |
Types of Quoting
472 |
Description
473 |
474 |
475 |
476 |
477 |
' '
478 |
hard quote - no substitution allowed
479 |
480 |
481 |
" "
482 |
soft quote - allow substitution
483 |
484 |
485 |
486 |
487 | This can be useful, for example, when you have a directory with a space
488 | in its name (of course, it is better to avoid spaces in file and
489 | directory names). For example, suppose you have a directory named "with space" within the `/home/jarrod` home directory.
490 | Since bash uses spaces to parse the elements of the
491 | command line, you might try escaping any spaces with a backslash:
492 |
493 | ```bash
494 | $ ls $HOME/with\ space
495 | file1.txt
496 | ```
497 |
498 | However that can be a pain and may not work in all circumstances. A cleaner
499 | approach is to use soft (or double) quotes:
500 |
501 | ```bash
502 | $ ls "$HOME/with space"
503 | file1.txt
504 | ```
505 |
506 | If you used hard quotes, you will get this error:
507 |
508 | ```bash
509 | $ ls '$HOME/with space'
510 | ls: cannot access $HOME/with space: No such file or directory
511 | ```
512 |
513 | What if you have double quotes in your file or directory name, such as a directory `"with"quote` (again, it
514 | is better to avoid using double quotes in file and directory names)? In
515 | this case, you will need to escape the quote:
516 |
517 | ```bash
518 | $ ls "$HOME/\"with\"quote"
519 | ```
520 |
521 | So we'll generally use double quotes. We can always work with a literal
522 | double quote by escaping it as seen above.
523 |
524 | ::: {.callout-warning title="Curly quotes"}
525 | Avoid using curly quotes (e.g., “ or ‘) when coding (in the shell or otherwise),
526 | except as part of an actual string.
527 | :::
528 |
529 | ## 7 Powerful tools for text manipulation: `grep`, `sed`, and `awk`
530 |
531 | Before the text editor, there was the line editor. Rather than
532 | presenting you with the entire text as a text editor does, a line editor
533 | only displays lines of text when it is requested to. The original Unix
534 | line editor is called `ed`. You will likely never use `ed` directly, but
535 | you will very likely use commands that are its descendants. For example,
536 | the commands `grep`, `sed`, `awk`, and `vim` are all based directly on
537 | `ed` (e.g., `grep` is a `ed` command that is now available as a
538 | standalone command, while `sed` is a streaming version of `ed`) or
539 | inherit much of its syntax (e.g., `awk` and `vim` both heavily borrow
540 | from the `ed` syntax). Since `ed` was written when computing resources
541 | were very constrained compared to today, this means that the syntax of
542 | these commands can be terse. However, it also means that learning the
543 | syntax for one of these tools will be rewarded when you need to learn
544 | the syntax of another of these tools.
545 |
546 | An important benefit of these tools, particularly when working with large files, is that by operating line by line they
547 | don't incur the memory use that would be involved in reading an entire file into memory in a program like Python or R and then operating on the file's contents in memory.
548 |
549 | You may not need to learn much `sed` or `awk`, but it is good to know about
550 | them since you can search the internet for awk or sed one-liners. If you
551 | have some file munging task, it can be helpful to do a quick search
552 | before writing code to perform the task yourself.
553 |
554 | ### 7.1 `grep`
555 |
556 | The simplest of these tools is `grep`. As I mentioned, `ed` only
557 | displays lines of text when requested. One common task was to print all
558 | the lines in a file matching a specific regular expression. The command
559 | in `ed` that does this is `g//p`, which stands for globally match
560 | all lines containing the regular express `` and print them out.
561 |
562 | One often uses `grep` with [regular expressions](regex), covered
563 | later, so we'll just show some basic usage here.
564 |
565 | To start you will need to create a file called `testfile.txt` with the
566 | following content:
567 |
568 | ```
569 | This is the first line.
570 | Followed by this line.
571 | And then ...
572 | ```
573 |
574 | To print all the lines containing `is`:
575 |
576 | ```bash
577 | $ grep is testfile.txt
578 | This is the first line.
579 | Followed by this line.
580 | ```
581 |
582 | To print all the lines **not** containing `is`:
583 |
584 | ```bash
585 | $ grep -v is testfile.txt
586 | And then ...
587 | ```
588 |
589 | To print only the matches, one can use the `-o` flag, though this
590 | would generally only be interesting when used with a regular
591 | expression pattern since in this case, we know "is" is what will be
592 | returned:
593 |
594 | ```bash
595 | $ grep -o is testfile.txt
596 | is
597 | is
598 | is
599 | ```
600 |
601 | One could also use `--color` so that the matches are highlighed in color.
602 |
603 |
604 | ### 7.2 `sed`
605 |
606 | Here are some useful things you can do with `sed`. Note that as with
607 | other UNIX tools, `sed` will not generally directly alter a file
608 | (unless you use the `-i` flag); instead it will print the modified
609 | version of the file to stdout.
610 |
611 | Printing lines of text with `sed`:
612 |
613 | ```bash
614 | $ sed -n '1,9p' file.txt # prints out lines 1-9 from file.txt
615 | $ sed -n '/^#/p' file.txt # prints out lines starting with # from file.txt
616 | ```
617 |
618 | The first command prints out lines 1-9, while the second
619 | one prints out lines starting with `#`.
620 |
621 | Deleting lines of text with `sed`:
622 |
623 | ```bash
624 | $ sed -e '1,9d' file.txt
625 | $ sed -e '/^;/d' -e '/^$/d' file.txt
626 | ```
627 |
628 | The first line deletes lines 1-9 of `file.txt`, printing the remaining
629 | lines to stdout. What do you think the
630 | second line does?
631 |
632 | Note that the -e flag is only necessary if you want to have more than one expression, so it's not actually needed in the first line.
633 |
634 | Text substitution with `sed`:
635 |
636 | ```bash
637 | $ sed 's/old_pattern/new_pattern/' file.txt > new_file.txt
638 | $ sed 's/old_pattern/new_pattern/g' file.txt > new_file.txt
639 | $ sed -i 's/old_pattern/new_pattern/g' file.txt
640 | ```
641 |
642 | The first line replaces only the first instance in a line, while the second
643 | line replaces all instances in a line (i.e., globally). The use of the -i
644 | flag in the third line replaces
645 | the pattern **in place** in the file, thereby altering file.txt. Use
646 | the `-i` flag carefully as there is no way to easily restore the original version of the file.
647 |
648 | ### 7.3 `awk`
649 |
650 | Awk is a general purpose programming language typically used in data
651 | extraction tasks and particularly well-suited to one-liners (although it
652 | is possible to write long programs in it, it is rare). For our purposes,
653 | we will just look at a few common one-liners to get a sense of how it
654 | works. Basically, awk will go through a file line by line and perform
655 | some action for each line.
656 |
657 | For example, to select a given column from some text (here getting
658 | the PIDs of some processes, which are in the second (`$2`) column of
659 | the output of `ps -f`:
660 |
661 | ```bash
662 | ps -f | awk '{ print $2 }'
663 | ```
664 |
665 | To double space a file, you would read each line, print it,
666 | and then print a blank line:
667 |
668 | ```bash
669 | $ awk '{ print } { print "" }' file.txt
670 | ```
671 |
672 | Print every line of a file that is longer than 80 characters:
673 |
674 | ```bash
675 | $ awk 'length($0) > 80' file.txt
676 | ```
677 |
678 | Print the home directory of every user defined in the file
679 | `/etc/passwd`:
680 |
681 | ```bash
682 | $ awk -F: '{ print $6 }' /etc/passwd
683 | ```
684 |
685 | To see what that does, let's look at the first line of `/etc/passwd`:
686 |
687 | ```bash
688 | $ head -n 1 /etc/passwd
689 | root:x:0:0:root:/root:/bin/bash
690 | ```
691 |
692 | As you can see the entries are separated by colons (`:`) and the sixth
693 | field contains the root user's home directory (`/root`). The option
694 | `-F:` specifies that the colon `:` is the field delimiter (instead of
695 | the default space delmiter) and `print $6`
696 | prints the 6th field of each line.
697 |
698 | Summing columns:
699 |
700 | ```bash
701 | $ awk '{print $1 + $2}' file.txt
702 | ```
703 |
704 | This will sum columns 1 and 2 of `file.txt`.
705 |
706 |
707 | ## 8 Aliases (command shortcuts) and .bashrc
708 |
709 | Aliases allow you to use an abbreviation for a command, to create new
710 | functionality or to insure that certain options are always used when you
711 | call an existing command. For example, I'm lazy and would rather type
712 | `q` instead of `exit` to terminate a shell window. You could create the
713 | alias as follow:
714 |
715 | ```bash
716 | $ alias q=exit
717 | ```
718 |
719 | As another example, suppose you find the `-F` option of `ls` (which
720 | displays `/` after directories, `\` after executable files and `@` after
721 | links) to be very useful. The command :
722 |
723 | ```bash
724 | $ alias ls="ls -F"
725 | ```
726 |
727 | will ensure that the `-F` option will be used whenever you use `ls`. If
728 | you need to use the unaliased version of something for which you've
729 | created an alias, precede the name with a backslash (`\`). For example,
730 | to use the normal version of `ls` after you've created the alias
731 | described above:
732 |
733 | ```bash
734 | $ \ls
735 | ```
736 |
737 | The real power of aliases is only achieved when they are automatically
738 | set up whenever you log in to the computer or open a new shell window.
739 | To achieve that goal with aliases (or any other bash shell commands),
740 | simply insert the commands in the file `.bashrc` in your home directory.
741 | For example, here is an excerpt from my `.bashrc`:
742 |
743 | ```bash
744 | # .bashrc
745 |
746 | # Source global definitions
747 | if [ -f /etc/bashrc ]; then
748 | . /etc/bashrc
749 | fi
750 |
751 | # User specific aliases and functions
752 | pushdp () {
753 | pushd "$(python -c "import os.path as _, ${1}; \
754 | print _.dirname(_.realpath(${1}.__file__[:-1]))"
755 | )"
756 | }
757 |
758 | export EDITOR=vim
759 | source /usr/share/git-core/contrib/completion/git-prompt.sh
760 | export PS1='[\u@\h \W$(__git_ps1 " (%s)")]\$ '
761 |
762 | # history settings
763 | export HISTCONTROL=ignoredups # no duplicate entries
764 | shopt -s histappend # append, don't overwrite
765 |
766 | # R settings
767 | export R_LIBS=$HOME/usr/lib64/R/library
768 | alias R="/usr/bin/R --quiet --no-save"
769 |
770 | # Set path
771 | mybin=$HOME/usr/bin
772 | export PATH=$mybin:$HOME/.local/bin:$HOME/usr/local/bin:$PATH:
773 | export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/usr/local/lib
774 |
775 | # Additional aliases
776 | alias grep='grep --color=auto'
777 | alias hgrep='history | grep'
778 | alias l.='ls -d .* --color=auto'
779 | alias ll='ls -l --color=auto'
780 | alias ls='ls --color=auto'
781 | alias more=less
782 | alias vi=vim
783 | ```
784 |
785 |
786 | ::: {.callout-tip title="Exercise"}
787 |
788 | Look over the content of the example `.bashrc` and make sure you
789 | understand what each line does. For instance, use `man grep` to see what
790 | the option `--color=auto` does. Use `man which` to figure out what the
791 | various options passed to it do.
792 |
793 | :::
794 |
--------------------------------------------------------------------------------