MIT Missing Semester Solution- Week 4

  1. Take this short interactive regex tutorial.
    Website: https://regexone.com/
  2. Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?
    cat /usr/share/dict/words | awk
    '/.+(a).+(a).+(a)/' | grep -v "'s$" | wc -l cat /usr/share/dict/words | awk '/.+(a).+(a).+(a)/' | grep -v "'s$" | sed -E
    "s/.([a-z][a-z])$/\1/" | sort | uniq -c | sort | tail -n3 cat /usr/share/dict/words | awk '/.+(a).+(a).+(a)/' | grep -v "'s$" | sed -E
    "s/.*([a-z][a-z])$/\1/" | sort | uniq -c | wc -l
  3. To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.
    It clears all data from the file. You will lose all the data if not backed up.
    sed 's /REGEX/SUBSTITUTION/ ' -i {{filename}}

Glossary:

  1. Data Wrangling
    Data wrangling, sometimes referred to as data munging, is the process of
    transforming and mapping data from one “raw” data form into another
    format with the intent of making it more appropriate and valuable for a
    variety of downstream purposes such as analytics.
  2. Regular Expressions
    A powerful construct that lets you match text against patterns. Regular
    expressions are usually (though not always) surrounded by /. Most ASCII
    characters just carry their normal meaning, but some characters have
    “special” matching behaviour.
    • . means “any single character” except newline
    • zero or more of the preceding matches
    • + one or more of the preceding matches
    • [abc] any one character of a, b, and c
    • (RX1|RX2) either something that matches RX1 or RX2
    • ^ the start of the line
    • $ the end of the line
    E.g.: ‘s/.*?Disconnected from //’

Commands Used:

  1. grep: Find patterns in files using regular expressions.
    Search for a pattern within a file:
    grep "{{search_pattern}}" {{path/to/file}}
  2. sed: Stream editor for filtering and transforming text.
    Replace the first occurrence of a string in a file, overwriting the file
    (i.e., in-place):
    sed -i 's/{{find}}/{{replace}}/' {{filename}}
    Replace the first occurrence of a regular expression in each line of
    a file, and print the result:
    sed 's/{{regex}}/{{replace}}/' {{filename}}
  3. sort: Sort lines of text files.
  4. uniq: Output the unique lines from the given input or file. Since it does
    not detect repeated lines unless they are adjacent, we need to sort them
    first.
    Display number of occurrences of each line along with that line:
    sort {{file}} | uniq -c
  5. tail: Display the last part of a file.
    Show last ‘num’ lines in file:
    tail -n {{num}} {{file}}
  6. wc: Count lines, words, or bytes.
    Count lines in file:
    wc -l {{file}}
  7. awk: A versatile programming language for working on files.
  8. R: A language for data analysis and graphics.
  9. gnuplot: An interactive plotting program