├── README.md
├── lesson-materials
├── 2015-09-25_LibCarp-lesson-one-deck.odp
├── 2015-09-25_LibCarp-lesson-one-deck.pdf
├── LibCarp-lesson-one-handout-answers.docx
├── LibCarp-lesson-one-handout-answers.md
├── LibCarp-lesson-one-handout.docx
├── LibCarp-lesson-one-handout.md
├── LibCarp-lesson-one.docx
├── LibCarp-lesson-one.md
└── attic
│ ├── 2015-08-13_LibCarp-lesson-one-handout.md
│ └── 2015-08-13_LibCarp-lesson-one.md
└── photos
├── 2015-11-09_what-do-you-want-more-or-less-of-next-week.jpg
└── 2015-11-09_what-words-do-you-think-you-should-know-more-about.jpg
/README.md:
--------------------------------------------------------------------------------
1 | **NOTE: materials used for initial workshop. Now superseded by [Data intro for librarians](https://github.com/data-lessons/library-data-intro)**
2 |
3 | ## Library Carpentry. Week One: Basics
4 |
5 | This repository contains lesson materials for Library Carpentry week one, held 9 November 2015 at City University London. Slides used on the evening are here and for ease of use also on [Slideshare](http://www.slideshare.net/drjwbaker/library-carpentry-week-one-basics).
6 |
7 | Library Carpentry is generously funded by the [Software Sustainability Institute](http://software.ac.uk/). The Software Sustainability Institute cultivates world-class research with software. The Institute is based at the universities of Edinburgh, Manchester, Southampton and Oxford.
8 |
9 | This lesson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. ***Exceptions: embeds to and from external sources, and direct quotations from speakers***
10 |
--------------------------------------------------------------------------------
/lesson-materials/2015-09-25_LibCarp-lesson-one-deck.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/lesson-materials/2015-09-25_LibCarp-lesson-one-deck.odp
--------------------------------------------------------------------------------
/lesson-materials/2015-09-25_LibCarp-lesson-one-deck.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/lesson-materials/2015-09-25_LibCarp-lesson-one-deck.pdf
--------------------------------------------------------------------------------
/lesson-materials/LibCarp-lesson-one-handout-answers.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/lesson-materials/LibCarp-lesson-one-handout-answers.docx
--------------------------------------------------------------------------------
/lesson-materials/LibCarp-lesson-one-handout-answers.md:
--------------------------------------------------------------------------------
1 | # Library Carpentry Week One: Some Basics
2 |
3 | _______
4 | ### Tips and clarifications
5 |
6 | `^` defines the start of the string. So what you put after it will only match the first characters of a line or contents of a cell.
7 |
8 | `$` defines the end of the string. So what you put after it will only match the last character of a line of contents of a cell.
9 |
10 | `\b` adds a word boundary. So putting this either side of a stops the regular expression matching longer variants of words.
11 |
12 | So the following regular expression will find:
13 |
14 | - `foobar` will match `foobar` and find `666foobar`, `foobar777`, `8thfoobar8th` et cetera
15 | - `\bfoobar` will match `foobar` and find `foobar777`
16 | - `foobar\b` will match `foobar` and find `666foobar`
17 | - `\bfoobar\b` will find `foobar`
18 |
19 | `\` is used to escape the proceeding (yes I mean *proceeding* this time...) character when that character is a special character. So, for example, a regular expression that found `.com` would be `\.com` because `.` is a special character that matches any character.
20 |
21 | _____
22 | ### Regex Answers
23 |
24 | What does `Fr[ea]nc[eh]` match?
25 |
26 | - this matches `France`, `French`, `Frence`, and `Franch`. It would finds words where there were characters either side of these so `Francer`, `dakkldakFrench`, or `Franch911`.
27 |
28 | What does `Fr[ea]nc[eh]$` match?
29 |
30 | - this matches `France`, `French`, `Frence`, and `Franch` at the end of a string. It would find words where there were characters before these so `dakkldakFrench`.
31 |
32 | What would match strings that begin with `French` and `France` only?
33 |
34 | - `^France|^French` This would also find words where there were characters after `French` such as `Frenchness`.
35 |
36 | How do you match the whole words `colour` and `color` (case insensitive)?
37 |
38 | - There are two ways of thinking about this. In real life, you *should* only come across the case insensitive variations `colour`, `color`, `Colour`, `Color`, `COLOUR`, and `COLOR`. So one option would be `\b[Cc]olou?r\b|\bCOLOU?R\b`. However, you can also use `/colou?r/i` to find all case insensitive matches.
39 |
40 | How would you find `headrest` and `head rest` but not `head rest` (that is, with two spaces between `head` and `rest`?
41 |
42 | - `head\s?rest` Note this will also match zero or one tabs or newline characters, but it should work in most real world cases :)
43 |
44 | How would you find a 4 letter word that ends a string and is preceded by at least one zero?
45 |
46 | - `0+[a-z]{4}$`
47 |
48 | How do you match any 4 digit string anywhere?
49 |
50 | - `\d{4}`. Note this will match 4 digit strings only but will find them within longer strings of numbers.
51 |
52 | How would you match the date format `dd-MM-yyyy`?
53 |
54 | - `\b\d{2}-\d{2}-\d{4}\b` In most real world situations, you are likely to want word bounding here (but it may depend on your data).
55 |
56 | How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a string only?
57 |
58 | - `\d{2}-\d{2}-\d{2,4}$`
59 |
60 | How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press : Manchester, 1999`?
61 |
62 | - `.* : .*, \d{4}` You will find that this matches any text you put before `British` or `Manchester`. In this case, this regular expression does a good job on the first look up and may be need to be refined on a second depending on your real world application.
63 |
--------------------------------------------------------------------------------
/lesson-materials/LibCarp-lesson-one-handout.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/lesson-materials/LibCarp-lesson-one-handout.docx
--------------------------------------------------------------------------------
/lesson-materials/LibCarp-lesson-one-handout.md:
--------------------------------------------------------------------------------
1 | # Library Carpentry Week One: Some Basics
2 |
3 | _____
4 | ### Schedule
5 |
6 | - Jargon Busting (1800-1845)
7 | - Foundations (1845-1930)
8 | - Regular Expressions (1930-2015)
9 |
10 | _____
11 | ### Regular Expressions
12 |
13 | - `[]` define a list or range of characters to be found
14 | - `.` matches any character at all
15 | - `\d` matches any single digit
16 | - `\w` matches and part of word character (equivalent to [A-Za-z0-9_] )
17 | - `\s` matches any space, tab, or newline
18 | - `\b` adds a word boundary
19 | - `^` defines the start of the string
20 | - `$` defines the end of the string
21 | - `*` matches when the preceeding character appears any number of times including zero
22 | - `+` matches when the preceeding character appears any number of times excluding zero
23 | - `?` matches when the preceeding character appears one or zero times
24 | - `{VALUE}` matches the preceeding character the number of times define by VALUE; ranges can be specified with the syntax `{VALUE,VALUE}`
25 | - `|` simply means or
26 |
27 | Check you regex with:
28 | - regex101 https://regex101.com/
29 | - rexegper http://regexper.com/
30 | - myregexp http://myregexp.com/
31 |
32 | Test yourself with:
33 | - Regex Crossword https://regexcrossword.com/
34 |
35 | #### Exercise
36 |
37 | What does `Fr[ea]nc[eh]` match?
38 |
39 | What does `Fr[ea]nc[eh]$` match?
40 |
41 | What would match strings that begin with `French` and `France` only?
42 |
43 | How do you match the words `colour` and `color` (case insensitive)?
44 |
45 | How would you find `headrest` and `head rest` but not `head rest` (that is, with two spaces between `head` and `rest`?
46 |
47 | How would you find a 4 letter word that ends a string and is preceded by at least one zero?
48 |
49 | How do you match any 4 digit string?
50 |
51 | How would you match the date format `dd-MM-yyyy`?
52 |
53 | How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a string only?
54 |
55 | How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press : Manchester, 1999`?
56 |
57 | _____
58 | ### Next Week
59 |
60 | #### Installation (to be completed before the session)
61 |
62 | Windows users, see the section entitled 'Installing Git Bash' in the Programming Historian lesson [*Introduction to the Bash Command Line*](http://programminghistorian.org/lessons/intro-to-bash). OS X and Linux users, simply make sure you know how to find your 'Terminal'.
63 |
64 | #### Where to go for help
65 |
66 | Raise an issue on the Library Carpentry Week Two GitHub page https://github.com/LibraryCarpentry/week-two-library-carpentry/issues (note: you'll need to sign-up to GitHub to do this. As you'll need a GitHub account for week three, there is value in doing this now)
67 |
68 | _____
69 | ### References
70 |
71 | James Baker, "Preserving Your Research Data," *Programming Historian* (30 April 2014), [http://programminghistorian.org/lessons/preserving-your-research-data.html](http://melbourne.resbaz.edu.au/post/95320810834/why-code). The sub-sections 'Plain text formats are your friend' and 'Naming files sensible things is good for you and for your computers' are reworked from this lesson.
72 |
73 | Owen Stephens, "Working with Data using OpenRefine", *Overdue Ideas" (19 November 2014), [http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/](http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/). The section on 'Regular Expressions' is reworked from this lesson developed by Owen Stephens on behalf of the British Library.
74 |
75 | Andromeda Yelton, "Coding for Librarians: Learning by Example", *Library Technology Reports* 51:3 (April 2015), doi: [10.5860/ltr.51n3](http://dx.doi.org/10.5860/ltr.51n3)
76 |
77 | Fiona Tweedie, "Why Code?", *The Research Bazaar* (October 2014), [http://melbourne.resbaz.edu.au/post/95320810834/why-code](http://melbourne.resbaz.edu.au/post/95320810834/why-code)
78 |
--------------------------------------------------------------------------------
/lesson-materials/LibCarp-lesson-one.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/lesson-materials/LibCarp-lesson-one.docx
--------------------------------------------------------------------------------
/lesson-materials/LibCarp-lesson-one.md:
--------------------------------------------------------------------------------
1 | # Library Carpentry Week One: Some Basics
2 |
3 | 1730-2030, Monday 9 November, City University London
4 |
5 | _____
6 | ##Lesson Plan
7 |
8 | _____
9 | ### Data collection
10 |
11 | Conducted as attendees enter the room
12 |
13 | Thinking about the programme as a whole, please rate your skill level. Would you say that you know:
14 |
15 | a) Nothing
16 | b) A Little
17 | c) Lots
18 | d) Lots and Lots
19 |
20 | _____
21 | ### Overview
22 |
23 | #### Introduction
24 |
25 | Welcome to Library Carpentry! This series of introductory workshops on software skills for librarians is an exploratory programme funded by the Software Sustainability Institute and supported by [Software Carpentry](http://software-carpentry.org/) and City University London. My thanks also go to the British Library and the University of Sussex: the former where I worked when I was setting this up and the latter where I work currently as part of the Sussex Humanities Lab, a centrally funded, multi-disciplinary programme that brings 20 or so folks at Sussex together, many of which are new hires in new posts and aims to make the humanities fit for the digital age, and - pleasingly - includes a core member embedded in the Library. I want to emphasise the exploratory element of the programme. With my former BL colleague Nora McGregor, I attended a Software Carpentry workshop 2 years ago that gave me the confidence to really delve deeply into acquiring the software skills that could enhance my work, practice, and research. But Software Carpentry is aimed at scientists and it takes a little sideways thinking to parse the lessons for non-scientists. So having spoken to Greg Wilson from Software Carpentry and having observed that library folks are a consistent and not insignificant fringe of folks attending Software Carpentry events, I started thinking how Software Carpentry lessons, along with materials produced at the Programming Historian and the BL's internal Digital Scholarship Training Programme, could be made into a baseline for a public programme of software skills lessons aimed at library folks that library folks could reuse in their local contexts. So rather than Library Carpentry become a programme I coordinate each and every running of, the hope is that it becomes a set of tools the community to manage, support, enrich, and reuse as the community sees fit. Periodically during the sessions we will be collecting anonymous feedback from you. All this will go into building this future vision for a community driven and supported set of tools that the library community can use to kick start their exploration of software skills.
26 |
27 | The rationale for Library Carpentry is twofold. First, as Andromeda Yelton argues in her excellent recent [ALA Library Technology Report](http://journals.ala.org/ltr/issue/view/506) 'Coding for Librarians: learning by example', code is a means for librarians to take control of practice and to empower themselves and their organisation to meet user needs in flexible ways. Second, librarians play a crucial role in cultivating world class research. And in most research areas today world class research relies on the use of software. Librarians with software skills are then well placed to continue that cultivation of world class research.
28 |
29 | In order to kick start your exploration of software, this four week programme will look at the following: **SLIDE**
30 |
31 | - week 1: Some Basics
32 | - week 2: Controlling Data (with the Shell)
33 | - week 3: Versioning Data (with Git)
34 | - week 4: Cleaning Data (with Open Refine)
35 |
36 | #### Where to go for help
37 |
38 | **SLIDE** You are a big class. Thankfully each week there will be multiple ways of getting help. First, use your skill level stickers to identify people on your table who can help: you will all be following along from the same worksheets, so someone around you may have got past the point you are stuck at. Second, there are plenty of helpers in the room including me who are there to help if those around you can't. You should all have access to coloured sticky notes: a red sticky note on your laptop indicates to one of us that you need help (it might also alert the attention of someone around you!). So, please use them. Third, before each of the three following classes you will be required to install software: all issues doing this should be reported to the Github issues page for the week, posting something there will alert someone to give you help in advance **bring up Github pages**. Fourth, and finally, as much of the programme is self-directed we encourage you to finish up or repeat tasks after class time: if you run into issues, again report them to the relevant Github issues page.
39 |
40 | #### Final points of admin
41 |
42 | **SLIDE** We will be in the same place each week, convening at the same time. I will be here each week, though those here to help may change.
43 |
44 | Most of the sessions will involve following along from a worksheet. Much of the time you will be encouraged to go along at your own pace. For some of you this will feel like a lot of material, for others it might not feel like enough. Remember that the session is introductory. If you finish early our end time is not a hard stop, so you may of course leave. Alternatively, you might want to use the time to search online for more information or advanced skills guides. You may even wish to deepen your own skills by staying around and helping someone else out: there is nothing better for really getting to know something than teaching someone else! If you don't finish, don't worry, there are no prerequisites between classes and if you have time, you can always carry on at home or on the train.
45 |
46 | Tea, coffee, and snacks will be available at every session. Given the timing of the session you are welcome to bring some dinner with you (though you'll need to bring your own cutlery!)
47 |
48 | Finally computers are stupid, can frustrate, and as you all have different machines it can be tricky to resolve problems. Please be patient, particularly if your issue is local. Stepping outside and taking a gulp of fresh air always helps.
49 |
50 | Wifi: speak to Ernesto
51 |
52 | #### This week
53 |
54 | **SLIDE**
55 |
56 | - Jargon Busting (1800-1845)
57 | - Foundations (1845-1930)
58 | - Regular Expressions (1930-2015)
59 |
60 | _____
61 | ### Jargon Busting (Group Task)
62 |
63 | #### Requirements
64 |
65 | - boards/pads
66 | - pens
67 | - sticky notes*
68 |
69 | ####Purpose
70 |
71 | - icebreaker
72 | - finding confidence level
73 | - expectation management
74 |
75 | ####Task
76 |
77 | **SLIDE**
78 |
79 | This group task is an opportunity for you to get help understanding terms, phrases, or ideas around code or software development that you've come across and perhaps feel you should know better.
80 |
81 | - Start by getting into groups of 5 or 6.
82 | - Next make a big list of all the problem terms, phrases, and ideas that come up. Retain duplicates. Then (taking common words as a starting point) work together to try and explain what a term, phrase, or idea means (note: use both each other and the internet as a resource!). Make a note of those your group resolves and those you are still struggling with. **15 minutes**
83 | - {Trainer}: alongside this make a note of common problem terms, phrases, and ideas that both do and do not map to what Library Carpentry will cover (paying special attention to common problems that will not be covered)}
84 | - Report back on an issue resolved by your group **1 minute each**
85 | - {Trainer}: Report on mapping exercise: what is and isn't going to be covered.
86 |
87 | _____
88 | ### Foundations
89 |
90 | **SLIDE** Before we crack on with using the computational tools at our disposal, I want to spend some time on some foundation level stuff - a combination of best practice and generic skills that frame what we'll be doing for the next month.
91 |
92 | #### The Computer is Stupid
93 |
94 | **SLIDE** This does not mean that the computer isn't useful. Given a repetitive task, a enumerative task, or a task that relies on memory it can produce results faster, more accurately, and less grudgingly than you or I. Rather when I say that you should keep in mind that the computer is stupid, I mean to say that computer only does what you tell it to. If it throws up an error it is often not your fault, rather in most cases the computer has failed to interpret what you mean because it can only work with what it knows (ergo, it is bad at interpreting). This is not to say that the people who told the computer what to tell you when it doesn't know what to do couldn't have done a better job with error messages, for they could. So keep in mind as we go along that if you find an error message frustrating it isn't the computer's fault that it is giving you an archaic and incomprehensible error message, it is person's.
95 |
96 | #### Why take an automated or computational approach
97 |
98 | **SLIDE** Otherwise known as the 'why not do it manually?' question. To start with, I'm not anti-manual. I do plenty of things manually that a machine could do in an automated way because either a) I don't know how to automate the task or b) I'm unlikely to repeat the task and estimate that automating it would take longer. However once you know you'll need to repeat a task you have a compelling reason to consider automating it. This is one of the main areas in which programmatic ways of doing outside of IT service environments are changing library practice. Andromeda Yelton, a US based librarian closely involved in the Code4Lib movement, recently put together an excellent American Library Association Library Technology Report called "Coding for Librarians: Learning by Example." The report is pitched at a real world relevance level, and in it Andromeda describes scenarios library professionals told her about where learning a little programming, usually learning ad-hoc, had made a difference to their work, to the work of their colleagues, and to the work of their library.
99 |
100 | Main lessons:
101 |
102 | - **Borrow, Borrow, and Borrow again**. This is a mainstay of programming and a practice common to all skill levels, from professional programmers to people like us hacking around in libraries;
103 | - **The correct language to learn is the one that works in your local context**. There truly isn't a best language, just languages with different strengths and weaknesses, all of which incorporate the same fundamental principles;
104 | - **Consider the role of programming in professional development**. That is both yours and of those you manage;
105 | - **Knowing (even a little) code helps you evaluate projects that use code**. Programming can seem alien. Knowing some code makes you better at judging the quality of software development or planning activity that include software development
106 | - **Automate to make the time to do something else!** Taking the time to gather together even the most simple programming skills can save time to do more interesting stuff! (even if often that more interesting stuff is learning more programming skills...)
107 |
108 | **SLIDE: Cost/benefit slide**
109 |
110 | #### Keyboard shortcuts are your friend
111 |
112 | **SLIDE** Though we will get more computational over the next 3 weeks, we can start our adventure into programming - as many of you will have already - with very simple things like keyboard shortcuts. We all have our favourites. Labour saving but also exploiting this stupid machine in best possible way. Alongside the very basic ones (ctrl+s for save; ctrl+c for copy; ctrl+x for cut; ctrl+v for paste) my favourite (in a Windows or Linux machines) is alt+tab, a keyboard shortcut that switches between programmes {Trainer: ask other helpers what their favourites are}. You can do all the lessons in Library Carpentry without keyboard shortcuts, but note that they'll likely come up a lot.
113 |
114 | #### Plain text formats are your friend
115 |
116 | **SLIDE** Why? Because computers can process them!
117 |
118 | If you want computers to be able to process your stuff I'd encourage you to get in the habit where possible of using platform agnostic formats such as .txt for notes and .csv or .tsv for tabulated data (the latter pair are just spreadsheet formats, separated by commas and tabs respectively). These plain text formats are preferable to the proprietary formats used as defaults by Microsoft Office or iWork because they can be opened by many software packages and have a strong chance of remaining viewable and editable in the future. Most standard office suites include the option to save files in .txt, .csv and .tsv formats, meaning you can continue to work with familiar software and still take appropriate action to make your work accessible. Compared to .doc or .xls these formats have the additional benefit of containing only machine-readable elements. Whilst using bold, italics, and colouring to signify headings or to make a visual connection between data elements is common practice, these display-orientated annotations are not (easily) machine-readable and hence can neither be queried and searched nor are appropriate for large quantities of information (the rule of thumb is if you can't find it by ctrl+f it isn't machine readable). Preferable are simple notation schemes such as using a double-asterisk or three hashes to represent a data feature: in my own notes, for example, three question marks indicate something I need to follow up on, chosen because `???` can easily be found with a CTRL+F search.
119 |
120 | `???` was also chosen by me because it doesn't clash with existing schemes. Though it is likely that notation schemes will emerge from existing individual practice, existing schema are available to represent headers, breaks, et al. One such scheme is Markdown, a lightweight markup language. Markdown files are as .md, are machine readable, human readable, and used in many contexts - GitHub for example, renders text via Markdown. An excellent [Markdown cheat sheet is available on GitHub](https://github.com/adam-p/markdown-here) for those who wish to follow – or adapt – this existing schema. Notepad++ http://notepad-plus-plus.org/ is recommended for Windows users as a tool to write Markdown text in, though it is by no means essential for working with .md files. Mac or Unix users may find Komodo Edit, Text Wrangler, Kate, or Atom helpful.
121 |
122 | #### Naming files sensible things is good for you and for your computers
123 |
124 | **SLIDE** Working with data is made easier by structuring your stuff in a consistent and predictable manner.
125 |
126 | Why?
127 |
128 | Without structured information, our lives would be much poorer. As library and archive people we know this. But I'll linger on this a little longer because for working with data it is especially important.
129 |
130 | Examining URLs is a good way of thinking about why structuring research data in a consistent and predictable manner might be useful in your work. Good URLs represent with clarity the content of the page they identify, either by containing semantic elements or by using a single data element found across a set or majority of pages.
131 |
132 | **look at webpages** A typical example of the former are the URLs used by news websites or blogging services. WordPress URLs follow the format:
133 |
134 | - `ROOT/YYYY/MM/DD/words-of-title-separated-by-hyphens`
135 | -
136 |
137 | A similar style is used by news agencies such as a *The Guardian* newspaper:
138 |
139 | - `ROOT/SUB_ROOT/YYYY/MMM/DD/words-describing-content-separated-by-hyphens`
140 | -
141 |
142 | In archival catalogues, URLs structured by a single data element are often used. The British Cartoon Archive structures its online archive using the format:
143 |
144 | - `ROOT/record/REF`
145 | -
146 |
147 | And the Old Bailey Online uses the format:
148 |
149 | - ROOT/browse.jsp?ref=REF`
150 | -
151 |
152 | What we learn from these examples is that a combination of semantic description and data elements make for consistent and predictable data structures that are readable both by humans and machines. Transferring this to your stuff makes it easier to browse, to search, and to query using both the standard tools provided by operating systems and by the more advanced tools Library Carpentry will cover.
153 |
154 | In practice, the structure of a good archive might look something like this:
155 |
156 | - A base or root directory, perhaps called 'work'.
157 | - A series of sub-directories such as 'events', 'data', ' projects' et cetera
158 | - Within these directories are series of directories for each event, dataset or project. Introducing a naming convention here that includes a date element keeps the information organised without the need for subdirectories by, say, year or month.
159 |
160 | All this should help you remember something you were working on when you come back to it later (call it real world preservation).
161 |
162 | The crucial bit for our purposes, however, is the file naming convention you choose. The name of a file is important to ensuring it and its contents are easy to identify. 'Data.xslx' doesn't fulfil this purpose. A title that describes the data does. And adding dating convention to the file name, associating derived data with base data through file names, and using directory structures to aid comprehension strengthens those connection.
163 |
164 | To recap, the key points about structuring your data are:
165 |
166 | - Data structures should be consistent and predictable
167 | - Consider using semantic elements or data identifiers to data directories
168 | - Fit and adapt your data structure to your work
169 | - Apply naming conventions to directories and file names to identify
170 | them, to create associations between data elements, and to assist
171 | with the long term readability and comprehension of your data
172 | structures
173 |
174 | _____
175 | ### Regular Expressions
176 |
177 | **SLIDE** One of the reason why I have stressed the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file name. So, for example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2014', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files then do something with them. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* files.
178 |
179 | A powerful means of doing this selecting based on file characteristics is to use regular expressions, often abbreviated to regex. A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. It will let you:
180 |
181 | - Match on types of character (e.g. 'upper case letters', 'digits', 'spaces', etc.)
182 | - Match patterns that repeat any number of times
183 | - Capture the parts of the original string that match your pattern
184 |
185 | As most computational software has regular expression functionality built in and as many computational tasks in libraries are built around complex matching, it is good place for Library Carpentry to start in earnest.
186 |
187 | **SLIDE** A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both "organise" and "organize".
188 |
189 | A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both "organise" and "organize".
190 |
191 | But it would also find `reorganise`. So there are a bunch of special syntax that help us be more precise.
192 |
193 | The first we've seen: square brackets can be used to define a list or range of characters to be found. So: **SLIDE**
194 |
195 | - `[ABC]` matches A or B or C
196 | - `[A-Z]` matches any upper case letter
197 | - `[A-Za-z0-9]` matches any upper or lower case letter or any digit (note: this is case-sensitive)
198 |
199 | Then there are: **SLIDE**
200 |
201 | - `.` matches any character
202 | - `\d` matches any single digit
203 | - `\w` matches any part of word character (equivalent to `[A-Za-z0-9]`)
204 | - `\s` matches any space, tab, or newline
205 | - `\b` adds a word boundary. So putting this either side of a stops the regular expression matching longer variants of words.
206 | - `^` defines the start of the string
207 | - `$` defines the end of the string
208 |
209 | **SLIDE** {Question}: So, what is `^[Oo]rgani.e$` going to match.
210 |
211 | Other useful special characters are **SLIDE**:
212 |
213 | - `*` matches when the proceeding character appears any number of times including zero
214 | - `+` matches when the proceeding character appears any number of times excluding zero
215 | - `?` matches when the proceeding character appears one or zero times
216 | - `{VALUE}` matches the proceeding character the number of times define by VALUE; ranges can be specified with the syntax `{VALUE,VALUE}`
217 | - `|` means or.
218 |
219 | {Questions}: So, what are these going to match?
220 |
221 | - **SLIDE** `^[Oo]rgani.e\w*`
222 | - **SLIDE** `[Oo]rgani.e\w+$`
223 | - **SLIDE** `^[Oo]rgani.e\w?\b`
224 | - **SLIDE** `\b[Oo]rgani.e\w{2}\b`
225 | - **SLIDE** `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`
226 |
227 | **SLIDE** This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a columns of a spreadsheet to make a new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won't - however - be using computers. Instead we'll use pen and paper. I want you to work in teams of 5 or 6 to work through the exercises in the handout. I have an answer sheet over here if you want to check where you've gone wrong. When you finish, I'd like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use [regex101](https://regex101.com/).
228 |
229 | #### Exercise
230 |
231 | What does `Fr[ea]nc[eh]` match?
232 |
233 | - this matches `France`, `French`, `Frence`, and `Franch`. It also matches any string that had characters either side of these words so `Francer`, `dakkldakFrench`, or `Franch911`.
234 |
235 | What does `Fr[ea]nc[eh]/` match?
236 |
237 | - this matches `France`, `French`, `Frence`, and `Franch`. It also matches any string that had character before these words, so `dakkldakFrench` (but not `Francer` or `Franch911`).
238 |
239 | What would match strings that begin with `French` and `France` only?
240 |
241 | - `^France/|^French/`
242 |
243 | How do you match the words `colour` and `color` (case insensitive)?
244 |
245 | - There are two ways of thinking about this. In real life, you *should* only come across the case insensitive variations `colour`, `color`, `Colour`, `Color`, `COLOUR`, and `COLOR`. So one option would be `^[Cc]olou?r$|COLOU?R$`. However, you can also use `/colou?r/i` to find all case insensitive matches.
246 |
247 | How would you find `headrest` and `head rest` but not `head rest` (that is, with two spaces between `head` and `rest`?
248 |
249 | - `/head\s?rest/`. Note this will also match zero or one tabs or newline characters, but it should work in most real world cases :)
250 |
251 | How would you find a 4 letter word that ends a string and is preceded by at least one zero?
252 |
253 | - `/0+[a-z]{4}$`
254 |
255 | How do you match any 4 digit string anywhere?
256 |
257 | -`\d{4}`
258 |
259 | How would you match the date format `dd-MM-yyyy`?
260 |
261 | - `/\d{2}-\d{2}-\d{4}\/`
262 |
263 | How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a string only?
264 |
265 | - `/\d{2}-\d{2}-\d{2,4}$`
266 |
267 | How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press : Manchester, 1999`?
268 |
269 | - `/.* : .*, \d{4}/`
270 |
271 | _____
272 | ## Next week
273 |
274 | **SLIDE**
275 |
276 | Shell.
277 |
278 | Instructions of what to install per operating system on the Github page.
279 |
280 | Anyone who wants help now, we are happy to help.
281 |
282 | _____
283 | ### References
284 |
285 | James Baker , "Preserving Your Research Data," *Programming Historian* (30 April 2014), [http://programminghistorian.org/lessons/preserving-your-research-data.html](http://programminghistorian.org/lessons/preserving-your-research-data.html). The sub-sections 'Plain text formats are your friend' and 'Naming files sensible things is good for you and for your computers' are reworked from this lesson.
286 |
287 | Owen Stephens, "Working with Data using OpenRefine", *Overdue Ideas" (19 November 2014), [http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/](http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/). The section on 'Regular Expressions' is reworked from this lesson developed by Owen Stephens on behalf of the British Library
288 |
289 | Andromeda Yelton, "Coding for Librarians: Learning by Example", *Library Technology Reports* 51:3 (April 2015), doi: [10.5860/ltr.51n3](http://dx.doi.org/10.5860/ltr.51n3)
290 |
291 | Fiona Tweedie, "Why Code?", *The Research Bazaar* (October 2014), [http://melbourne.resbaz.edu.au/post/95320810834/why-code](http://melbourne.resbaz.edu.au/post/95320810834/why-code)
292 |
293 |
--------------------------------------------------------------------------------
/lesson-materials/attic/2015-08-13_LibCarp-lesson-one-handout.md:
--------------------------------------------------------------------------------
1 | # Library Carpentry Week One: Some Basics
2 |
3 | _____
4 | ### Regular Expressions
5 |
6 | - `[]` define a list or range of characters to be found
7 | - `.` matches any character at all.
8 | - `\d` matches any single digit
9 | - `\w` matches and part of word character (equivalent to [A-Za-z0-9_] )
10 | - `\s` matches any space, tab, or newline
11 | - `^` defines the start of the string
12 | - `$` defines the end of the string
13 | - `*` matches when the proceeding character appears any number of times including zero
14 | - `+` matches when the proceeding character appears any number of times excluding zero
15 | - `?` matches when the proceeding character appears one or zero times
16 | - `{VALUE}` matches the proceeding character the number of times define by VALUE; ranges can be specified with the syntax `{VALUE,VALUE}`
17 | - `|` means or.
18 |
19 | Check you regex with regex101: https://regex101.com/
20 |
21 | #### Exercise
22 |
23 | What does `Fr[ea]nc[eh]` match?
24 |
25 | What does `Fr[ea]nc[eh]$` match?
26 |
27 | What would match strings that begin with `French` and `France` only? {France|French}
28 |
29 | How do you match the words 'colour' and 'color' (case insensitive)? {`colou?r`}
30 |
31 | How would you find 'headrest' and 'head rest' but not 'head rest'? {head\s?rest}
32 |
33 | How would you find a 4 letter word that begins a string and is preceded by at least one zero? {0+[a-z]{4}$}
34 |
35 | How do you match any 4 digit string? {`\d{4}`}
36 |
37 | How would you match the date format 'dd-MM-yyyy'? {`\d{2}-\d{2}-\d{4}`}
38 |
39 | How would you match the date format 'dd-MM-yyyy' or 'dd-MM-yy' at the end of a string only? {`\d{2}-\d{2}-\d{2,4}$`}
40 |
41 | How would you match a thirteen digital ISBN? {^\d{13}$}
42 |
43 | How would you match publication formats such as 'British Library : London, 2015' and 'Manchester University Press : Manchester, 1999'? {`.* : .*, \d{4}`}
44 |
45 | _____
46 | ### References
47 |
48 | James Baker , "Preserving Your Research Data," *Programming Historian* (30 April 2014), [http://programminghistorian.org/lessons/preserving-your-research-data.html](http://melbourne.resbaz.edu.au/post/95320810834/why-code). The sub-sections 'Plain text formats are your friend' and 'Naming files sensible things is good for you and for your computers' are reworked from this lesson.
49 |
50 | Owen Stephens, "Working with Data using OpenRefine", *Overdue Ideas" (19 November 2014), [http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/](http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/). The section on 'Regular Expressions' is reworked from this lesson developed by Owen Stephens on behalf of the British Library
51 |
52 | Andromeda Yelton, "Coding for Librarians: Learning by Example", *Library Technology Reports* 51:3 (April 2015), doi: [10.5860/ltr.51n3](http://dx.doi.org/10.5860/ltr.51n3)
53 |
54 | Fiona Tweedie, "Why Code?", *The Research Bazaar* (October 2014), [http://melbourne.resbaz.edu.au/post/95320810834/why-code](http://melbourne.resbaz.edu.au/post/95320810834/why-code)
55 |
--------------------------------------------------------------------------------
/lesson-materials/attic/2015-08-13_LibCarp-lesson-one.md:
--------------------------------------------------------------------------------
1 | # Library Carpentry Week One: Some Basics
2 |
3 | 1730-2030, Monday 9 November, City University London
4 |
5 | _____
6 | ##Lesson Plan
7 |
8 | _____
9 | ### Data collection
10 |
11 | Conducted as attendees enter the room
12 |
13 | Thinking about the programme as a whole, please rate your skill level. Would you say that you know:
14 |
15 | a) Nothing
16 | b) A Little
17 | c) Lots
18 | d) Lots and Lots
19 |
20 | _____
21 | ### Overview
22 |
23 | #### Introduction
24 |
25 | Welcome to Library Carpentry! This series of introductory workshops on software skills for librarians is an exploratory programme funded by the Software Sustainability Institute and supported by [Software Carpentry](http://software-carpentry.org/), City University London, the British Library, and the University of Sussex (the latter pair where I worked when I was setting up and delivering the programme respectively). I want to emphasise the exploratory element of the programme. With my former BL colleague Nora McGregor, I attended a Software Carpentry workshop 2 years ago that gave me the confidence to really delve deeply into acquiring the software skills that could enhance my work, practice, and research. But Software Carpentry is aimed at scientists and it takes a little sideways thinking to parse the lessons for non-scientists. So having spoken to Greg Wilson from Software Carpentry and having observed that library folks are a consistent and not insignificant fringe of folks attending Software Carpentry events, I started thinking how Software Carpentry lessons, along with materials produced at the Programming Historian and the BL's internal Digital Scholarship Training Programme, could be made into a baseline for a public programme of software skills lessons aimed at library folks that library folks could reuse in their local contexts. So rather than Library Carpentry become a programme I coordinate each and every running of, the hope is that it becomes a set of tools the community to manage, support, enrich, and reuse as the community sees fit. Periodically during the sessions we will be collecting anonymous feedback from you. All this will go into building this future vision for a community driven and supported set of tools the library community can use to kick start their exploration of software skills.
26 |
27 | The rationale for Library Carpentry is twofold. First, as Andromeda Yelton argues in her excellent recent [ALA Library Technology Report](http://journals.ala.org/ltr/issue/view/506) code is a means for librarians to take control of practice and to empower themselves and their organisation to meet user needs in flexible ways. Second, librarians play a crucial role in cultivating world class research. And in most research areas today world class research relies on the use of software. Librarians with software skills are then well placed to continue that cultivation of world class research.
28 |
29 | In order to kick start your exploration of software, this four week programme will look at the following: **SLIDE**
30 |
31 | - week 1: Some Basics
32 | - week 2: Controlling Data (with the Shell)
33 | - week 3: Versioning Data (with Git)
34 | - week 4: Cleaning Data (with Open Refine)
35 |
36 | #### Where to go for help
37 |
38 | **SLIDE** You are a big class. Thankfully each week there will be multiple ways of getting help. First, use your skill level stickers to identify people on your table who can help: you will all be following along from the same worksheets, so someone around you may have got past the point you are stuck at. Second, there are plenty of helpers in the room including me who are there to help if those around you can't. You should all have access to coloured sticky notes: a red sticky note on your laptop indicates to one of us that you need help (it might also alert the attention of someone around you!). So, please use them. Third, before each of the three following classes you will be required to install software: all issues doing this should be reported to the Github issues page for the week where someone can help in advance. Fourth, and finally, as much of the programme is self-directed we encourage you to finish up or repeat tasks after class time: if you run into issues, again report them to the relevant Github issues page.
39 |
40 | #### Final points of admin
41 |
42 | **SLIDE** We will be in the same place each week, convening at the same time. I will be here each week, though those here to help may change.
43 |
44 | Most of the sessions will involve following along from a worksheet. Much of the time you will be encouraged to go along at your own pace. For some of you this will feel like a lot of material, for other it might not feel like enough. Remember that the session is introductory. If you finish early our end time is not a hard stop, so you may of course leave. Alternatively, you might want to use the time to search online for more information or advanced skills guides. You may even wish to deepen your own skills by staying around and helping someone else out: there is nothing better for really getting to know something than teaching someone else! If you don't finish, don't worry, there are no prerequisites between classes and if you have time, you can always carry on at home or on the train.
45 |
46 | Tea, coffee, and snacks will be available at every session. Given the timing of the session you are welcome to bring some dinner with you (though you'll need to bring your own cutlery!)
47 |
48 | Finally computers are stupid, can frustrate, and as you all have different machines it can be tricky to resolve problems. Please be patient, particularly if your issue is local. Stepping outside and taking a gulp of fresh air always helps.
49 |
50 | Wifi: **ASK ERNESTO**
51 |
52 | #### This week
53 |
54 | **SLIDE**
55 |
56 | - Jargon Busting (1800-1845)
57 | - Foundations (1845-1930)
58 | - Regular Expressions (1930-2015)
59 |
60 | _____
61 | ### Jargon Busting (Group Task)
62 |
63 | #### Requirements
64 |
65 | - boards/pads
66 | - pens
67 | - sticky notes*
68 |
69 | ####Purpose
70 |
71 | - icebreaker
72 | - finding confidence level
73 | - expectation management
74 |
75 | ####Task
76 |
77 | **SLIDE**
78 |
79 | This group task is an opportunity for you to get help understanding terms, phrases, or ideas around code or software development that you've come across and perhaps feel you should know better.
80 |
81 | - Start by getting into groups of 5 or 6.
82 | - Next make a big list of all the problem terms, phrases, and ideas that come up. Retain duplicates. Then (taking common words as a starting point) work together to try and explain what a term, phrase, or idea means (note: use both each other and the internet as a resource!). Make a note of those your group resolves and those you are still struggling with. **15 minutes**
83 | - {Trainer}: alongside this make a note of common problem terms, phrases, and ideas that both do and do not map to what Library Carpentry will cover (paying special attention to common problems that will not be covered)}
84 | - Report back on an issue resolved by your group **1 minute each**
85 | - {Trainer}: Report on mapping exercise: what is and isn't going to be covered.
86 |
87 | _____
88 | ### Foundations
89 |
90 | **SLIDE** Before we crack on with using the computational tools at our disposal, I want to spend some time on some foundation level stuff - a combination of best practice and generic skills that frame what we'll be doing for the next month.
91 |
92 | #### The Computer is Stupid
93 |
94 | **SLIDE** This does not mean that the computer isn't useful. Given a repetitive task, a enumerative task, or a task that relies on memory it can produce results faster, more accurately, and less grudgingly than you or I. Rather when I say that you should keep in mind that the computer is stupid, I mean to say that computer only does what you tell it to. If it throws up an error it is often not your fault, rather in most cases the computer has failed to interpret what you mean because it can only work with what it knows (ergo, it is bad at interpreting). This is not to say that the people who told the computer what to tell you when it doesn't know what to do couldn't have done a better job with error messages, for they could. So keep in mind as we go along that if you find an error message frustrating it isn't the computer's fault that it is giving you an archaic and incomprehensible error message, it is person's.
95 |
96 | #### Why take an automated or computational approach
97 |
98 | **SLIDE** Otherwise known as the 'why not do it manually?' question. To start with, I'm not anti-manual. I do plenty of things manually that a machine could do in an automated way because either a) I don't know how to automate the task or b) I'm unlikely to repeat the task and estimate tha automating it would take longer. However once you know you'll need to repeat a task you have a compelling reason to consider automating it. This is one of the main areas in which programmatic ways of doing outside of IT service environments are changing library practice. Andromeda Yelton, a US based librarian closely involved in the Code4Lib movement, recently put together an excellent American Library Association Library Technology Report called "Coding for Librarians: Learning by Example." The report is pitched at a real world relevance level, and in it Andromeda describes scenarios library professionals told her about where learning a little programming, usually learning ad-hoc, had made a difference to their work to their work, the work of their colleagues, and the work of their library.
99 |
100 | Main lessons:
101 |
102 | - **Borrow, Borrow, and Borrow again**. This is a mainstay of programming and a practice common to all skill levels, from professional programmers to people like us hacking around in libraries;
103 | - **The correct language to learn is the one that works in your local context**. There truly isn't a best language, just languages with different strengths and weaknesses, all of which incorporate the same fundamental principles;
104 | - **Consider the role of programming in professional development**. That is both yours and of those you manage;
105 | - **Knowing (even a little) code helps you evaluate projects that use code**. Programming can seem alien. Knowing some code makes you better at judging the quality of software development or planning activity that include software development
106 | - **Automate to make the time to do something else!** Taking the time to gather together even the most simple programming skills can save time to do more interesting stuff! (even if often that more interesting stuff if learning more programming skills...)
107 |
108 | **Cost/benefit slide**
109 |
110 | #### Keyboard shortcuts are your friend
111 |
112 | **SLIDE** Though we will get more computational over the next 3 weeks, we can start our adventure into programming - as many of you will have already - with very simple things like keyboard shortcuts. We all have our favourites. Labour saving but also exploiting this stupid machine in best possible way. Alongside the very basic ones (ctrl+s for save; ctrl+c for copy; ctrl+x for cut; ctrl+v for paste) my favourite (in a Windows or Linux machines) is alt+tab, a keyboard shortcut that switches between programmes {Trainer: ask other helpers what their favourites are}. You can do all the lessons in Library Carpentry without keyboard shortcuts, but note that they'll likely come up a lot.
113 |
114 | #### Plain text formats are your friend
115 |
116 | **SLIDE** Why? Because computers can process them!
117 |
118 | If you want computers to be able to process your stuff I'd encourage you to get in the habit where possible of using platform agnostic formats such as .txt for notes and .csv or .tsv for tabulated data (the latter pair are just spreadsheet formats, separated by commas and tabs respectively). These plain text formats are preferable to the proprietary formats used as defaults by Microsoft Office or iWork because they can be opened by many software packages and have a strong chance of remaining viewable and editable in the future. Most standard office suites include the option to save files in .txt, .csv and .tsv formats, meaning you can continue to work with familiar software and still take appropriate action to make your work accessible. Compared to .doc or .xls these formats have the additional benefit of containing only machine-readable elements. Whilst using bold, italics, and colouring to signify headings or to make a visual connection between data elements is common practice, these display-orientated annotations are not machine-readable and hence can neither be queried and searched nor are appropriate for large quantities of information (the rule of thumb is if you can't find it by ctrl+f it isn't machine readable). Preferable are simple notation schemes such as using a double-asterisk or three hashes to represent a data feature: in my own notes, for example, three question marks indicate something I need to follow up on, chosen because `???` can easily be found with a CTRL+F search.
119 |
120 | `???` was also chosen by me because it doesn't clash with existing schemes. Though it is likely that that notation schemes will emerge from existing individual practice, existing schema are available to represent headers, breaks, et al. One such scheme is Markdown, a lightweight markup language. Markdown files are as .md, are machine readable, human readable, and used in many contexts - GitHub for example, renders text text via Markdown. An excellent Markdown cheat sheet is available on GitHub https://github.com/adam-p/markdown-here) for those who wish to follow – or adapt – this existing schema. Notepad++ http://notepad-plus-plus.org/ is recommended for Windows users as a tool to write Markdown text in, though it is by no means essential for working with .md files. Mac or Unix users may find Komodo Edit or Text Wrangler helpful.
121 |
122 | #### Naming files sensible things is good for you and for your computers
123 |
124 | **SLIDE** Working with data is made easier by structuring your stuff in a consistent and predictable manner.
125 |
126 | Why?
127 |
128 | Without that structured information, our lives would be much poorer. As library and archive people we know this. But I'll linger on this a little longer because for working with data it is especially important.
129 |
130 | Examining URLs is a good way of thinking about why structuring research data in a consistent and predictable manner might be useful in your work. Good URLs represent with clarity the content of the page they identify, either by containing semantic elements or by using a single data element found across a set or majority of pages.
131 |
132 | A typical example of the former are the URLs used by news websites or blogging services. WordPress URLs follow the format:
133 |
134 | - `ROOT/YYYY/MM/DD/words-of-title-separated-by-hyphens`
135 | -
136 |
137 | A similar style is used by news agencies such as a The Guardian newspaper:
138 |
139 | - `ROOT/SUB_ROOT/YYYY/MMM/DD/words-describing-content-separated-by-hyphens`
140 | -
141 |
142 | In archival catalogues, URLs structured by a single data element are often used. The British Cartoon Archive structures its online archive using the format:
143 |
144 | - `ROOT/record/REF`
145 | -
146 |
147 | And the Old Bailey Online uses the format:
148 |
149 | - ROOT/browse.jsp?ref=REF`
150 | -
151 |
152 | What we learn from these examples is that a combination of semantic description and data elements make for consistent and predictable data structures that are readable both by humans and machines. Transferring this to your stuff makes it easier to browse, to search, and to query using both the standard tools provided by operating systems and by the more advanced tools Library Carpentry will cover.
153 |
154 | In practice, the structure of a good archive might look something like this:
155 |
156 | - A base or root directory, perhaps called 'work'.
157 | - A series of sub-directories such as 'events', 'data', ' projects' et cetera
158 | - Within these directories are series of directories for each event, dataset or project. Introducing a naming convention here that includes a date element keeps the information organised without the need for subdirectories by, say, year or month.
159 |
160 | All this should help you remember something you were working on when you come back to it later (call it real world preservation).
161 |
162 | The crucial bit for our purposes, however, is the file naming convention you choose. The name of a file is important to ensuring it and its contents are easy to identify. 'Data.xslx' doesn't fulfil this purpose. A title that describes the data does. And adding dating convention to the file name, associating derived data with base data through file names, and using directory structures to aid comprehension strengthens those connection.
163 |
164 | To recap, the key points about structuring research data are:
165 |
166 | - Data structures should be consistent and predictable.
167 | - Consider using semantic elements or data identifiers to data directories.
168 | - Fit and adapt your data structure to your work.
169 | - Apply naming conventions to directories and file names to identify
170 | them, to create associations between data elements, and to assist
171 | with the long term readability and comprehension of your data
172 | structure.
173 |
174 | _____
175 | ### Regular Expressions
176 |
177 | **SLIDE** One of the reason why I have stressed the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file name. So, for example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2014', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files then do something with them. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* files.
178 |
179 | A powerful means of doing this selecting based on file characteristics is to use regular expressions, often abbreviated to regex. A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. It will let you:
180 |
181 | - Match on types of character (e.g. 'upper case letters', 'digits', 'spaces', etc.)
182 | - Match patterns that repeat any number of times
183 | - Capture the parts of the original string that match your pattern
184 |
185 | As most computational software has regular expression functionality built in and as many computational tasks in libraries are built around complex matching, it is good place for Library Carpentry to start in earnest.
186 |
187 | **SLIDE** A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both "organise" and "organize".
188 |
189 | A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both "organise" and "organize".
190 |
191 | But it would also match `reorganise`. So there are a bunch of special syntax that help us be more precise.
192 |
193 | The first we've seen: square brackets can be used to define a list or range of characters to be found. So: **SLIDE**
194 |
195 | - `[ABC]` matches A or B or C.
196 | - `[A-Z]` matches any upper case letter
197 | - `[A-Za-z0-9]` matches any upper or lower case letter or any digit (note: this is case-sensitive)
198 |
199 | Then there are: **SLIDE**
200 |
201 | - `.` matches any character at all.
202 | - `\d` matches any single digit
203 | - `\w` matches any part of word character (equivalent to [A-Za-z0-9_] )
204 | - `\s` matches any space, tab, or newline
205 | - `^` defines the start of the string
206 | - `$` defines the end of the string
207 |
208 | **SLIDE** {Question}: So, what is `^[Oo]rgani.e$` going to match.
209 |
210 | Other useful special characters are **SLIDE**:
211 |
212 | - `*` matches when the proceeding character appears any number of times including zero
213 | - `+` matches when the proceeding character appears any number of times excluding zero
214 | - `?` matches when the proceeding character appears one or zero times
215 | - `{VALUE}` matches the proceeding character the number of times define by VALUE; ranges can be specified with the syntax `{VALUE,VALUE}`
216 | - `|` means or.
217 |
218 | {Questions}: So, what are these going to match?
219 |
220 | - **SLIDE** `^[Oo]rgani.e\w*$`
221 | - **SLIDE** `^[Oo]rgani.e\w+$`
222 | - **SLIDE** `^[Oo]rgani.e\w?$`
223 | - **SLIDE** `^[Oo]rgani.e\w{2}$`
224 | - **SLIDE** `^[Oo]rgani.e$|^[Oo]rgani.e\w{2}$`
225 |
226 | **SLIDE** This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a columns of a spreadsheet to make a new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won't - however - be using computers. Instead we'll use pen and paper. I want you to work in teams of 5 or 6 to work through the exercises in the handout. I have an answer sheet over here if you want to check where you've gone wrong. When you finish, I'd like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use [regex101](https://regex101.com/).
227 |
228 | #### Handout
229 |
230 | What does `Fr[ea]nc[eh]` match?
231 |
232 | What does `Fr[ea]nc[eh]$` match?
233 |
234 | What would match strings that begin with `French` and `France` only? {France|French}
235 |
236 | How do you match the words 'colour' and 'color' (case insensitive)? {`colou?r`}
237 |
238 | How would you find 'headrest' and 'head rest' but not 'head rest'? {head\s?rest}
239 |
240 | How would you find a 4 letter word that begins a string and is preceded by at least one zero? {0+[a-z]{4}$}
241 |
242 | How do you match any 4 digit string? {`\d{4}`}
243 |
244 | How would you match the date format 'dd-MM-yyyy'? {`\d{2}-\d{2}-\d{4}`}
245 |
246 | How would you match the date format 'dd-MM-yyyy' or 'dd-MM-yy' at the end of a string only? {`\d{2}-\d{2}-\d{2,4}$`}
247 |
248 | How would you match a thirteen digital ISBN? {^\d{13}$}
249 |
250 | How would you match publication formats such as 'British Library : London, 2015' and 'Manchester University Press : Manchester, 1999'? {`.* : .*, \d{4}`}
251 |
252 | _____
253 | ### References
254 |
255 | James Baker , "Preserving Your Research Data," *Programming Historian* (30 April 2014), [http://programminghistorian.org/lessons/preserving-your-research-data.html](http://programminghistorian.org/lessons/preserving-your-research-data.html). The sub-sections 'Plain text formats are your friend' and 'Naming files sensible things is good for you and for your computers' are reworked from this lesson.
256 |
257 | Owen Stephens, "Working with Data using OpenRefine", *Overdue Ideas" (19 November 2014), [http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/](http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-using-openrefine/). The section on 'Regular Expressions' is reworked from this lesson developed by Owen Stephens on behalf of the British Library
258 |
259 | Andromeda Yelton, "Coding for Librarians: Learning by Example", *Library Technology Reports* 51:3 (April 2015), doi: [10.5860/ltr.51n3](http://dx.doi.org/10.5860/ltr.51n3)
260 |
261 | Fiona Tweedie, "Why Code?", *The Research Bazaar* (October 2014), [http://melbourne.resbaz.edu.au/post/95320810834/why-code](http://melbourne.resbaz.edu.au/post/95320810834/why-code)
262 |
263 | _____
264 | ## Next week
265 |
266 | {Include this on the handout}
267 |
268 | Shell.
269 |
270 | Instructions of what to install per operating system on the Github page.
271 |
272 | Anyone who wants help now, we are happy to help.
273 |
--------------------------------------------------------------------------------
/photos/2015-11-09_what-do-you-want-more-or-less-of-next-week.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/photos/2015-11-09_what-do-you-want-more-or-less-of-next-week.jpg
--------------------------------------------------------------------------------
/photos/2015-11-09_what-words-do-you-think-you-should-know-more-about.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/LibraryCarpentry/week-one-library-carpentry-DEPRECATED/c4b7a8377a273959219d189c5d9591d75f61d11f/photos/2015-11-09_what-words-do-you-think-you-should-know-more-about.jpg
--------------------------------------------------------------------------------