I've been reading Mastering Regular Expressions by Jeffrey E.F. Friedl, and since nobody in my life (aside from my wife) cares, I thought I’d share something I'm pretty proud of. My first set of regular expressions, that I wrote myself to manipulate the text I'm working with.
What’s I’m so happy about is that I wrote these expressions. I understand exactly what they do and the purpose of each character in each expression.
I've used regex in the past. Stuff cobbled together from stack overflow, but I never really understood how they worked or what the expressions meant, just that they did what I needed them to do at the time.
I'm only about 10% of the way through the book, but already I understand so much more than I ever did about regex (I also recognize I have a lot to learn).
I wrote the expressions to be used with egrep and sed to generate and clean up a list of filenames pulled out of tarballs. (movies I've ripped from my DVD collection and tarballed to archive them).
The first expression I wrote was this one used with tar and egrep to list the files in the tarball and get just the name of the video file:
tar -tzvf file.tar.gz | egrep -o '\/[^/]*\.m(kv|p4)' > movielist
Which gives me a list of movies of which this is an example:
/The.Hunger.Games.(2012).[tmdbid-70160].mp4
Then I used sed with the expression groups to remove:
the leading forward slash
Everything from .[ to the end
All of the periods in between words
And the last expression checks for one or more spaces and replaces them with a single space.
This is the full sed command:
sed -Eie 's/^\///; s/\.\[[a-z]+-[0-9]+\]\.m(p4|kv)//; s/[^a-zA-Z0-9\(\)&-]/ /g; s/ +/ /g' movielist
Which leaves me with a pretty list of movies that looks like this:
The Hunger Games (2012)
I'm sure this could be done more elegantly, and I'm happy for any feedback on how to do that! For now, I'm just excited that I'm beginning to understand regex and how to use it!
Edit: fixed title so it didn’t say “regex expressions”
Wait. Are there flavors of regex? Every time I have to use regex it hurts my brain and I never need to do it enough to actually sit down and learn it properly like OP is doing. Just knowing there are different ways of doing the same things in an already mind baffeling language blows me away even more.
Yeah. The only one you really need to care about (especially under Linux) is PCRE, the good 'ol Perl Compatible Regular Expressions. For the most part, every other flavor is a derivative of that. Microsoft had a weird version for a while, but that may be completely dead now, thankfully.
Learning the syntax of regex is fairly easy. Hell, I still have to use this cheat sheet more often now that my perl skills are no longer needed or even relevant.
Regex isn't that hard. The challenge is identifying and understanding patterns in the data that you are filtering. Here is a brain hack: As an example, if to have pages and pages of logs that you need to filter, open up one of the log files, stare at the screen and hold the page down key for several dozen pages. Patterns can be easily seen in the blur of text that is quickly scrolling across the screen. (Our brains love to find patterns in noise, btw.) The patterns that you see will give you focus points for developing regular expressions to match. ie: You start breaking strings into chunks and seeing the ebb and flow of data streaming across a screen helps. Anomalies in the data "stream" are are easy to spot as well.
From a security and efficiency standpoint, you should also understand where the most processing takes place so you don't kill whatever platform you are working on.
Sorry for the rambling, but I am getting older and feel the need to pass on a ton of tips and tricks whenever I can for these "archaic" languages.
Yes. Most things use pcre, or Perl Compatible Regular Expressions, but there are other flavors. Usually they lack features or have slightly different syntax.
Regex101 is amazing. It tends to balk at backtracing which we rely on a lot for work, but it's such a good visual.
Chat GPT can also save a lot of time writing regex, but it tends to write very unreadable regex because it thinks it's being clever when it really isnt.
Regex is an art form, and writing readable regex is another step above that.
Piggybacking onto this to mention my go-to online RegEx editor: RegExr. It lets you test the regex as you type, explains the particular symbols used, as well as has a sidebar where you can see different pattern types categorically. I've been using it for almost 2 years now, and haven't had any reason to use much else (after I discovered this).
It is a great book, although a bit outdated. In particular, nowadays egrep is not recommended to use. grep -E is a more portable synonim.
Some notes on you script:
You don't need to escape slashes in grep regex. In the sed s/// command better use another character like s### so you also can leave slashes unescaped.
You usually don't need to pipe grep and sed, sed -n with regex address and explicit printing command gives the same result as grep.
You could omit leading slash in your egrep regex, so you won't need to remove it later.
So I would do the same with
tar -tzvf file.tar.gz | sed -En '/\.(mp4|mkv)$/{s#^.*/##; s#\.\[.*##; s#[^a-zA-Z0-9()&-]# #g; s/ +/ /g; p}'
nowadays egrep is not recommended to use. grep -E is a more portable synonim
Not directed at you personally, but this is the kind of pointless pedantry from upstream developers that grinds my gears.
Like, I've used egrep for 25 years. I don't know of a still relevant Unix variant in existence that doesn't have the egrep command. But suddenly now, when any other Unix variant but Linux is all but extinct, and all your shell scripts are probably full of bashisms and Linuxisms anyway, now there is somehow a portability problem, and they deem it necessary to print out a warning whenever I dare to run egrep instead of grep -E? C'mon now ... If anything, they have just made it less portable by spitting out spurious warnings where there weren't any before.
GNU grep, the most widespread implementation, does not include egrep, fgrep and rgrep for years. Distributions (not all, but many) provide shell scripts that simply run grep with corresponding option for backward compatibility. You can learn this from official documentation.
Also, my scripts are not full of bashisms, gnuisms, linuxisms and other -isms, I try to keep them portable unless it is really necessary to use some unportable command or syntax.
Just to chip in because I haven't seen it mentioned yet, but I fing LLMs like ChatGPT or Microsoft Copilot are really good at making regexes and also at explaining regexes. So if you're learning them or just want to get the darned thing to work so you can go to bed those are a good resource.
And it still kinda breaks my brain when I look at an expression. When I just look at it it looks like utter gibberish, but when I say to myself, “okay, what’s this doing?”
And go through it character by character, it turns into something I can comprehend.
I can also recommend the book the TS mentioned, it is very good and after reading it you will understand regular expressions. It's fine to use a cheat sheet if you want, cause if you don't do it regularly the knowledge can sag, but the understanding is what matters. Also depending on the context, different implementations can have slightly different syntax or modifiers to be aware of.
I lent out the book to my brother once and he somehow lost it, so I never got it back. Don't lend out book guys.
And remember not everything can be solved using a regular expression: https://xkcd.com/1171/
Give a man a regular expression and he’ll match a string… teach him to make his own regular expressions and you’ve got a man with problems.
-- yakugo in http://regex.info/blog/2006-09-15/247#comment-3022 (and yes, it is http:// never https:// for this domain)
That's really cool!
I know some regex and I tried to learn vim regex, only to find out it's a rabbithole so deep I'm afraid to look into. The feeling when you press enter and your carefully crafted regex does exactly what it's supposed to do is awesome though. Good luck!
Vim is on my list of things to learn. I didn’t even know vim had its own regex, but I suppose that makes sense. I’ve messed with vim a bit, but have stuck to nano so far.
I was wondering a few years ago how far you could get with implementing some simple markup syntax with just regex. Turns out, surprisingly far, but once stuff starts going wrong you're in a less than ideal situation.