|
Searching using regular expressions in OpenOffice.org
|
Introduction
A friend of mine recently asked me a question I'd seen posted a few times on Usenet and also in the OpenOffice mailing list: “How can I search for tab characters in my text? I can search for everything else, but how do I specify a tab?” This document not only answers that question, it also teaches you how to use one of OO's most powerful features: regular expressions. These are to be found not only in OpenOffice, but also in programming languages and even at the bash shell prompt in Linux. Even if you only learn how to search for tabs and paragraph endings this document will have proved useful; if you go on to become a regular expression power user, even better! But first let's deal with the important question: “How can I search for a tab character?”Non-printing characters
To find non-printing characters – for example, line endings and tabs – in OpenOffice, you need to turn on 'regular expression' searching; this allows you to search for 'wildcards'. These wildcards are similar to the wildcards you might use in specifying a range of filenames (for example, 'C:\My Documents\text*.*' or /home/me/text*') but are far more powerful.You turn on regular expression search by selecting the 'Regular expressions' option at the bottom-left of the 'Find & Replace' dialog. The OO Help tells us that “You can only search for regular expressions within paragraphs”. This doesn't mean that you can only search one paragraph at a time; it means that any text you're searching for must lie wholly within a single paragraph and cannot span a paragraph boundary.
The first non-printing characters most people want to search for are tabs and paragraph endings. To search for a single tab, open the Find dialog and select the 'Regular expressions' option (I won't mention this again – I'll just assume you've done it) and enter '\t' (without the quotes) in the 'Search for' box. You might want to open a OO Writer document and try out some of these examples while you're reading this article. To find more than one tab character, just enter '\t' more than once: '\t\t\t' searches for 3 tabs in a row. And you can, of course, search for terms such as this one: 'Date\tTime\tName', which searches for any occurrence of the word 'Date', followed by a tab, followed by the word 'Time', followed by another tab, followed by the word 'Name'.
The next most popular non-printing character you might want to search for is the hidden one at the end of every paragraph. To do this you search for the 'end-of-line' character which is represented in regular expressions as a '$'. So entering '$' will find the next end-of-line character at the end of the paragraph currently containing the insertion point (cursor). If you want to find a paragraph that ends with the text 'writer.' then you'd enter 'writer.$' in the Search box. And, of course, you can find the next paragraph that ends with this text simply by clicking 'Find' (or pressing the Enter key) again.
You can use regular expressions containing the '$' end-of-line character as much as you want as long as you remember that the entire search term you use will only match within a single paragraph: you can't expect terms like 'writer.$It is' to work since this effectively means, “Search for the next occurrence of the word 'writer', followed by a '.' at the end of a paragraph, where the next paragraph starts with the text 'It is'.” As I said right at the start, the Help file tells us we can't do that, and you'll just get the message “Search key not found” if you try it.
Another non-printing character you might want to search for is the one that the OpenOffice Help describes as a 'row break', also sometimes known as a 'soft line break': it's the one you generate when you press Shift+Enter. If I press Shift+Enter now
it puts the rest of my text on the next line without starting a new paragraph. To search for a row break, enter '\n' (for 'newline') in the Search box. And, of course, searching for 'now\nit' would find the row break I generated a few lines above this one. Of course, you must remember to select the 'Regular expression' option or it will find the literal 'now\nit' that appears above after the words 'searching for'. As with the tab character '\t' you can search for more than one by simply repeating it in the Search box: '\n\n'. And '\n\tHere\n' will search for what you expect it to search for: a row break followed by a tab, followed by the word 'Here', followed by another row break.So you now know how to search for tabs and two kinds of line endings: paragraph ends and row breaks (soft line ends). How can you search for the start of a line? Simple: use the '^' character. There is a catch, though. It won't work if you enter it in the Search box on its own. You need to follow it with something, and the simplest thing to follow it with is a '.' which means 'match any single character'. So '^.' means search for the single character at the beginning of the next paragraph.
All of the regular expression characters we've seen so far ('\t', '$', '\n', '^', '.') can be used in a search term to match a single character. The '.' character is the first 'wild card' character we've met in this article; you'll see some more in the second part. But to finish off this section, can you guess how to search for an empty paragraph? Maybe you've just changed all your paragraph styles so that they have extra space at the end and you want to delete all the now-unnecessary empty paragraphs in between them. You can do this very quickly by opening the Find dialog, entering '^$' (the start of a paragraph immediately followed by the end of a paragraph) in the Search box, selecting the 'Regular expression' option, leaving the 'Replace with' box empty, and clicking the 'Replace all' option.
Searching for patterns
This is where regular expressions really come into their own, so it's time for a formal definition of the term: “a regular expression is a sequence of combinations of characters where each combination specifies one character from a particular class of characters to be matched against some specified text”. So what does this mean in English? Rather than explain using another load of technical jargon, I'll give you some examples. And rather than say 'regular expression' each time, I'll use the common abbreviation 'RE' instead.As the simplest example, consider the RE 'the'. Used in a search, this means exactly what it looks like: “search for the letter 't' followed by the letter 'h' followed by the letter 'e'. In human terms: “match the word 'the' and only the word 'the'”. From what I said in the first part of this article, you should be able to guess what the RE '^the' means. That's right: match the word 'the' only where it appears at the start of a paragraph. Note that this RE as it stands will also match the words 'The', 'tHe', 'thE', 'ThE', or 'THE'. This is where OpenOffice REs differ from those used elsewhere: you need to select the 'Match case' option to ignore 'ThE' but find 'the'. So having found 'the' at the start of a paragraph with '^the', you'll remember that you can find it at the end of a paragraph with 'the$'. And '^the$' will match a paragraph with nothing but the word 'the' in it.
Wildcards
As I said above, within an RE a '.' will match any single character. So '.he' will match both 'the' and 'she'. It will also match any letter followed by 'he' if those letters appear in the middle of a longer word such as 'lathes' or 'lashes'. Repeating the '.' in the RE requires each '.' to match one character. I say 'requires' because if you search for '.he' and there's nothing after the cursor to search in except the word 'he' then the match will fail. But if there's even so much as a single space before it, then '.he' will match.You can match any run of zero or more of the same character with '*'. So 'ab*c' will match 'ac', 'abc', 'abbc', 'abbbc', and so on. Suppose you want to search for an 'a' followed by any number of any character followed by 'c'. Well, remember that '.' means “match any character”, so 'a.*c' means “match an 'a', followed by any character 0 or more times, followed by a 'c'”. And, of course, the 'a' and 'c' can be anywhere in the paragraph; for example, 'a.*c' will match part of “I've just had a cup of tea”. If you look at the underlined part of that sentence, you'll see that 'a.*c' could have matched 'a cup', so why did it match 'ad a c'? The answer is because REs, by nature, are greedy: they match as much as they can. While REs in some programming languages (e.g. Perl, Python, Ruby) can be made to be non-greedy, there doesn't seem to be any way of making OpenOffice's REs non-greedy (at least, as far I can see in OO's Help). So if your REs seem to be matching more than you intended, at least you know why.
The wildcard characters '.' and '*' can be used anywhere within the RE. You could, for example, search for 'cat.' or 'c*t' or '.og.*', and you'd match 'cats', 'catamount' and 'dogstar' if they appeared in your text.
Here's another RE wildcard that's useful: Just as '*' means “match zero or more of the previous character”, '+' means “match one or more of the previous character”. So while 'as*' will match 'a' ('*' means zero or more, remember?), 'as', 'ass' and 'asss', 'as+' will fail to match 'a' but will match 'as', 'ass', 'asss', etc.
And just as you've got RE wildcards that match 'zero or more' and 'one or more' of any character, '?' will match 'zero or one' of the previous character. So 'as?' will match 'a' or 'as' and nothing else. You can think of it as meaning, “match a single 'a' followed by an optional 's'”.
Character sets
Remember I told you earlier that 'the' will also match 'The', 'tHe', thE, 'ThE', and 'THE' unless you select the 'Match case' option? Well, suppose you're doing a lot of searching and replacing, keeping the Find/Replace dialog open (it's modal, remember, and stays open until you specifically close it), and you want to search for any of these terms but don't want to keep selecting and deselecting the 'Match case' option. OO allows you to do this using 'character sets'. A character set is defined by a number of characters inside square brackets. So you could search for any of the 'the' variants above with the RE '[Tt][Hh][Ee]'. Note that each character set defines one search character, no matter how many characters there are inside the square brackets. '[Tt]' means “match either a 'T' or a 't' but nothing else. You could search for any vowel with '[aeiou]' and it would match just one vowel wherever it appeared. So you could search for any word that has more than one consecutive vowel with '[aeiou][aeiou]+' which means “match any vowel followed by at least one other vowel”.Just as '[Tt]' means “match either a 'T' or a 't', '[bcd]og' means “match the letters 'og' wherever they have a 'b', a 'c', or a 'd' in front of them. So '[bcd]og' will match 'bog', 'cog', and 'dog'. If you want to match any of a set of more than three characters, there's a shorthand way of doing it: instead of, for example, '[bcdefg]og]' you can do '[b-g]og'. And you can combine sets, too: '[b-dn-w]oggle' would match (amongst other words) 'boggle', 'coggle', 'doggle', 'noggle', 'roggle', and 'woggle'.
Suppose you wanted to find all of the three-letter words that ended with 'og' but didn't start with 'b', 'c', or 'd'. The '^' character, inside a set (surrounded by square brackets), means “negate the following set”. In other words, '[^b-d]og' will match words like 'log' and 'fog', but not 'bog', 'cog', or 'dog'. So '^' has two meanings: one inside a character set and another outside ('start of paragraph', remember?).
Odds and ends
A character set gives us a way of searching for alternate characters within our text. But suppose we want to search for alternate words? If we wanted, say, to search for 'words' or 'text', then we could use the RE 'words|text'. The '|' means 'or'. When you get to the expert stage you'll be mixing and matching search terms in your REs and you'll be using such examples as '[wt]hen|(where|how)ever'. Note the use of parentheses to group elements together. With an example like this it can be hard to work out exactly what it would match. Of course, that's because I put it together; if you had, you'd have done it piece by piece and you'd know what you intended. This example means “match if you find 'when' or 'then', OR if you find either 'wherever' or 'however'. Note that this example wouldn't match 'whenever', but this example would: '(when|where|how)ever'. And this could also be written, at the cost of legibility, to '(wh(en|ere)|how)ever'.While '^' and '$' allow you to match text at the beginning and end of a paragraph, it would also be useful if you could search for text that only occurs at the beginning or end of a word, ignoring any space and/or punctuation. Of course, you can: '\<' matches the beginning of a word and '\>' matches the end. So if you wanted to search for words that began with 'the' you could use '\<the'. Note that this will find 'the', 'there', 'them', and so on, but wouldn't not match 'out-there' or the 'the' in 'tithe'. So instead of using the Find dialog's 'Whole words only' option, you can search for '\<[st]he\>' and know that it will only match the words 'the' or 'she' and nothing else.
Special sets
Suppose you wanted to search for all occurrences of any upper-case word throughout your text. You could do it the hard way with this: '\<[ABCDEFGHIJKLMNOPQRSTUVWXYZ]+\>' but there's an easier way: '\<[:upper:]\>'. The '[:upper:]' part means “the set of all upper-case letters”. There's a corresponding '[:lower:]', of course. There's also an '[:alpha:]' which matches any letter of the alphabet, but not digits, spaces, and non-printable characters such as control characters, '^', '$', and so on. So 'TK[:alpha:]+34' would match 'TKa34' and 'TKfzeu34', but not 'TK4ak734' (the '4' and the '7' cause a mismatch). Likewise, there's a corresponding '[:digit:]' which matches only the digits 0-9. So '\<[:digit:]+\>' would match any number consisting of any amount of digits (but not consisting of zero digits). To go with '[:alpha:]' and '[:digit:]' there's also a '[:alnum:]' which matches any alphabetical character a-z or A-Z, or any digit 0-9. So '[:alnum:][:digit:]' would match 'a9' or '99' but not '9a'.Three more character sets round out this section: '[:space:]' will match any so-called 'whitespace' character. In OpenOffice, the only whitespace characters you're likely to find are spaces and tabs. In programming languages that use REs, line feeds, carriage returns, and form feeds also qualify as whitespace. The '[:print:]' RE will match any printable character. So if your text has any hidden field codes or control codes embedded in it, this won't find them but it will find everything else. Lastly, '[:cntrl:]' is the one to use to find those embedded ASCII control codes. If you want to search for hidden field codes you'll have to ask someone else (the OO mailing list?) as the Help says nothing about this as far as I can see.
Caveat
With anything as complex as regular expressions there are always things that will trip up the unwary and you almost always come across them when that job you're doing just has to be submitted within the next ten minutes or so. The two problems that confront most people new to REs are a) greediness: the search term is matching more than you intended (Can you redefine it to narrow it down? Can you do it a piece at a time?); and b) that '*' doesn't mean “0 or more of any character”, it means “0 or more of the preceding character”; at some point you'll probably find yourself wondering why 'Word*' doesn't match 'WordStar' or 'Wordiness'. Oh, it matches the 'Word' part of both of those words, but not the whole thing. Maybe the RE you're looking for is 'Word.*' - “match 'Word' followed by 0 or more of any character”, or better still, 'Word.+' - “match 'Word' followed by 1 or more of any character”. But then, when you try that, you run into the greediness problem: it matches all of the remaining text in the paragraph. Perhaps you meant '\<Word.+\>'? No, that doesn't work either: it still matches all of the remaining text. The answer is to be more specific. If you want the search to match either 'WordStar' or 'Wordiness' then specify 'Word(Star|iness)'. The best rule is: remember how greedy REs are and don't use '.+' or '.*' unless you follow them with something that will specifically match the miniumum amount of text. And that's not always easy.One way to narrow down searches of this kind is to think about what it is you want to search for, then switch to what you don't want to search for. For example, suppose you want to search for some text between two braces: '{this text}'. You might be tempted to use '{.*}' but the first reason it wouldn't work is that OO requires the '{' and the '}'to be 'escaped' with a '\' like '\{' and '\}'. Also if there are any more '}' characters in the paragraph then '\{.*\}' will match from the next '{' after the cursor right up to the last '}' in the paragraph, because it will use greedy matching. Thinking about it a few seconds more, it's obvious we want to search for a left brace, followed by no right brace characters, followed by a right brace. So we can use '\{[^\}]+\}' instead. To deconstruct this, we're asking it to match a left brace '\{' followed by anything that isn't a right brace '[^\}]' followed by a right brace '\}'.
Searching for special characters
This is all very well, but suppose you want to search for one of the special RE characters such as '^', '$', '[', or ']'. How could you do that without that character working its usual RE magic? The answer is to 'escape' it, as we saw in the previous paragraph, with a '\'. So if you want to find '[sic]' you would need to enter '\[sic\]'.Replacing what you've found
In most places you find REs, you have a way to replace part of what you've found using special notation. For example, in the Perl programming language, you can search for 'Name: ([^\.]+).' Breaking this down, we want to match the letters 'Name:' followed by a space, then anything that's not a '.' (with the RE '[^\.]+'), followed by a '.' This would match 'Name: Linus'. Suppose we wanted to change 'Linus' to 'Linus Torvalds'. In Perl, we surround the part of the RE that represents the text we're searching for with parentheses and it gets assigned to a variable called '$1'. We can then replace 'Linus' with 'Linus Torvalds' using the $1 in a Perl expression:
s/Name: ([^\.]+)./Name: $1 Torvalds/This is getting a little too technical now but suffice it to say that, at the moment, OpenOffice doesn't allow us this level of sophistication. However, we can replace the whole of the text we found, which is probably what we want on most occasions. And you do it with the '&' character. For example, say you want to find all words that begin with 'L' and end with 's' and add 'Torvalds' after them. So 'Linus' becomes 'Linus Torvalds' and 'Les' becomes 'Les Torvalds'. Here's how you'd do it: in the 'Search for' box you'd enter something like \<L[^s]+s\>' and in the Replace box you'd enter '& Torvalds'.
Finally...
I tried to pitch this article at the right level: not too technical (at least, no more technical than computers get) and not too patronising. I welcome feedback on this article (especially if you've found any error) sent to garryknight@gmx.net but please don't ask me too many specific questions: that's what the OO mailing list is for. I hope this article has proved useful to you, and I hope all of you find what you're searching for... In OpenOffice, at least...Addendum
Since I wrote this article, some years ago now, a few people have asked me how they can search for text across paragraph boundaries. At the time, OpenOffice.org didn't have this capability, and I believe that it still doesn't. However, a repository for OpenOffice.org extensions has since been set up and it contains an extension named "Alternative dialog Find & Replace for Writer", written by Tomas Bilek which promises that "Searched or replaced text can contain one or more paragraphs", so if that's what you're looking for, or for other useful extensions such as a Bookmarks Menu, go here.
|
Site design © Garry Knight 1998-2007
|
|