Workshop Design: Regular Expressions with Linux

Context for the Workshop

Regular Expressions ("RegEx") are general pattern notations that provide the means for efficient and effective text processing that was first developed in a computational sense in the late 1960s. Today RegExs are heavily used for tasks such as the extraction of data from unstructured text (Bui, Zeng-Treitler, 2014), genomics (Yandell, Majoras, 2002), GPS data (McDonald, 1999) etc. Despite their widespread and effective use there is no educational or training programme for learning regular expressions at the University; learners are expected to 'pick it up' along their research career. Whilst regular expressions exist in other operating systems they are most advanced in Linux and other UNIX-like systems. Further, their use over very large datasets necessitates utilisation on high-performance computing facilities, which invariably use Linux. Hence, the workshop is "Regular Expressions with Linux". Finally, a specific request has been received by a number of researchers at Melbourne Bioinformatics for such a workshop.

Learning Objective

The learners should develop a theoretical understanding of regular expression, their use, and meta-characters. They should gain sufficient use to have confidence in the basic search functions of `grep`, stream-editing with `sed`, and report generation with `awk`. They should understand the core differences between POSIX-standard Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE), along with the additional functionality in Perl regular expressions and Perl-compatible Regular Expressions (PCRE). The learners will gain familiarity with incorporation regular expressions in shell-scripts, and their use in a programmatic manner with Perl. The learners will know where and how to access additional examples and references on a local repository. Examples are generic and whilst this version is tailored towards domain-specific knowledge (i.e., bioinformatics) course can be modified to suit other disciplines example changes.

Draft of Workshop Design

This is designed as a four-hour workshop (roughly one hour per part), for a small workshop/tutorial (c20 maximum), with a combination of real-time lectures, questions-and-answers, examples code, with significant data repository for reference.

The course prerequisites is an introductory-level knowledge to the Linux command-line environment, with knowledge of the use of SSH to access remote systems and a high-level understanding of the file system hierarchy and file-types.

Part 0: Introduction
* Housekeeping. Outline goals for the day. Location for Repository of content and examples.

Part 1: Basics of Regular Expressions
* Definition of Regular Expressions. History and Usage. Major tools.
* Meta-characters. Meta-meanings, escape characters, matched sets.
* Searching with `grep`. Formal structure. Common options. Examples.
* Substitution with `sed`. Formal structure. Common options. Examples.
* Reporting with `awk`. Formal structure. Common options. Examples.

Part 2: Advanced Tools
* POSIX standards. for Basic and Extended Regular Expressions (BRE, ERE). New meta-characters, escape rules.
* Invoking tools with ERE. Comparison and examples of BRE and ERE.
* Perl Regular Expressions. Additional functionality to POSIX. Use on other languages. Examples.
* Perl Compatible Regular Expressions (PCRE). Relationship with Perl. Advanced functionality. Examples.

Part 3: Shell Scripting with RegEX
* Core components of shell scripting; variables, loops, conditionals, functions.
* Incorporating `grep`, `sed`, and `awk` into shell-scripts. Examples.

Part 4: Perl Programs with RegEx
* Arrays, scalars, loops, conditionals, subroutines. Bioinformatics examples.
* Object-orientation, data structures, databases. Bioinformatics examples.

Learning and Teaching Method

The small workshop environment provides an opportunity for learners to directly ask questions of the educator (proximal development). Course content is organised as structured knowledge with scaffolding (Schunk et al 2008); no new concepts or skills are introduced without prior exposure to the requisite material. Examples provide opportunities for learners to examine and test code and try their modifications (intrinsic motivation appeal) along with collaborative learning with pair-programming (Williams, 2001). Teaching method is partially instructional-expertise based guidance.

Learning Tasks and Assessments

Learning tasks are based on following provided examples and elaborations from those examples. Assessment can be evaluated on the micro-level by evaluation of shell history file on a per-user basis and job submission results, which would evaluate (a) whether the core work and elaborations has been done, what problems were encountered etc, providing for further areas of emphasis if needed. Component of assessment should include paired- and self-assessment. How helpful was their paired learner? How much learning does the learner think they achieved? Weighted values could be used to minimise abuse of scores.


Course content states prerequisites and builds content with scaffolding (theoretical principles, basic commands and examples, advanced tools, scripted commands, programming). Content correlates with delivery, and makes use of learner-centered teaching with instructional content with feedback. "Hands-on" approach encouraged and tied to assessment. Collaborative approach encourages "paired-programming" techniques.


Course should be expanded from a single-day workshop. Opportunities to test more complex examples and algorithmic examples lacking (Volet, 1991), and also lacks summative assessment. Significant cognitive load for a one-day workshop, although effect is reduced by small-class approach with feedback and availability of reference material.


Bui, D. D., Zeng-Treitler, Q. (2014). Learning regular expressions for clinical text classification. Journal of the American Medical Informatics Association : JAMIA, 21(5), 850–857. doi:10.1136/amiajnl-2013-002411

Dougherty, D., Robbins, A., (1997). sed and awk (second edition). O'Reilly.

Friedl, J.E.F. (2006). Mastering Regular Expressions (third edition). O'Reilly.

McDonald, T. (1999). Time study of harvesting equipment using GPS-derived positional data. Foresty Engineering for Tomorrow, GIS Technical Papers, Edinburgh University, Edinburgh, Scotland.

Tisdall, J.. (2001). Beginning Perl for Bioinformatics. O'Reilly.

Schunk, D.H., Pintrich, P.R., & Meece, J.L. (2008). Attribution theory. In D.H.Schunk, P.R. Pintrich, & J.L.Meece, Motivation in education: Theory, research and applications (pp.79 -110). Upper Saddle River, NJ: Pearson.

Volet, S. E. (1991). Modelling and coaching of relevant metacognitive strategies for enhancing university students’ learning. Learning and Instruction, 1, 319–336.

Williams, L (2001). Integrating pair programming into a software development process. 14th Conference on Software Engineering Education and Training. Charlotte. pp. 27–36

Yandell, M. D., Majoros, W. H. (2002). Genomics and natural language processing. Nature Reviews Genetics, 3(8), 601.