There is a variety of software (including the Unix command-line
utility uniq and Windows PowerShell
cmdlet Get-Unique) that can help you remove duplicate lines in a file or
string. The following sections contain three regex-based approaches
that can be especially helpful when trying to accomplish this task in
a nonscriptable text editor with regular expression search-and-replace
support.
When programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.
Option 1: Sort lines and remove adjacent duplicates
If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.
After sorting the lines, use the following regex and replacement string to get rid of the duplicates:
^(.*)(?:(?:\r?\n|\r)\1)+$
Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, Javascript, PCRE, Perl, Python, Ruby
Replace with:
$1
Replacement text flavors: .NET, Java, Javascript, Perl, PHP
\1
Replacement text flavors: Python, Ruby
This regular expression uses a capturing group and a backreference (among other ingredients) to match two or more sequential, duplicate lines. A backreference is used in the replacement string to put back the first line.
This regex removes all but the first of duplicate lines that appear next to each other. It does not remove duplicates that are separated by other lines. Let’s step through the process.
First, the caret (^) at the front of the regular expression
matches the start of a line. Normally it would only match at the
beginning of the subject string, so you need to make sure that the
option to let ^ and $ match at line breaks is enabled. Next, the .* within the capturing parentheses matches
the entire contents of a line (even if it’s blank), and the value is
stored as backreference 1. For this to work correctly, the “dot
matches line breaks” option must not be set; otherwise, the
dot-asterisk combination would match until the end of the
string.
Within an outer, noncapturing group, we’ve used (?:\r?\n|\r) to match a line
separator used in Windows (\r\n), Unix/Linux/OS X (\n), or legacy Mac OS
(\r) text files. The
backreference \1
then tries to match the line we just finished matching. If the same
line isn’t found at that position, the match attempt fails and the
regex engine moves on. If it matches, we repeat the group (composed
of a line break sequence and backreference 1) using the + quantifier to try to match
additional duplicate lines.
Finally, we use the dollar sign at the end of the regex to assert position at the end of the line. This ensures that we only match identical lines, and not lines that merely start with the same characters as a previous line.
Because we’re doing a search-and-replace, each entire match (including the original line and line separators) is removed from the string. We replace this with backreference 1 to put the original line back in.
Option 2: Keep the last occurrence of each duplicate line in an unsorted file
If you are using a text editor that does not have the built-in ability to sort lines, or if it is important to preseve the original line order, the following solution lets you remove duplicates even when they are separated by other lines:
^([^\r\n]*)(?:\r?\n|\r)(?=.*^\1$)
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
Here’s the same thing as a Javascript-compatible regex, without the requirement for the “dot matches line breaks” option:
^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)
Regex options: ^ and $ match at line breaks ("dot matches line breaks" must not be set)
Regex flavors: .NET, Java, Javascript, PCRE, Perl, Python, Ruby
Replace with:
(The empty string, i.e., nothing.)
Replacement text flavors: N/A
There are several changes here compared to the option 1 regex
earlier in this recipe, which only finds duplicate lines when they
appear next to each other. First, in the non-Javascript version of the option 2
regex, the dot within the capturing group has been replaced with
[^\r\n] (any
character except a line break), and the “dot matches line breaks”
option has been enabled. That’s because a dot is used later in the
regex to match any character, including line breaks. Second, a
lookahead has been added to scan for duplicate lines at any position
further along in the string. Since the lookahead does not consume
any characters, the text matched by the regex is always a single
line (along with its following line break) that is known to appear
again later in the string. Replacing all matches with the empty
string removes the duplicate lines, leaving behind only the last
occurrence of each.
Option 3: Keep the first occurrence of each duplicate line in an unsorted file
If you want to preserve the first occurrence of each duplicate line, you’ll need to use a somewhat different approach. First, here is the regular expression and replacement string we will use:
^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
Once again, we need to make a couple changes to make this compatible with Javascript-flavor regexes, since Javascript doesn’t have a “dot matches line breaks” option.
^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+
Regex options: ^ and $ match at line breaks ("dot matches line breaks" must not be set)
Regex flavors: .NET, Java, Javascript, PCRE, Perl, Python, Ruby
Replace with:
$1$2
Replacement text flavors: .NET, Java, Javascript, Perl, PHP
\1\2
Replacement text flavors: Python, Ruby
Unlike the option 1 and 2 regexes, this version cannot remove all duplicate lines with one search-and-replace operation. You’ll need to continually apply “replace all” until the regex no longer matches your string, meaning that there are no more duplicates to remove.
Because lookbehind is not as widely supported as lookahead (and where it is supported, you still may not be able to look as far backwards as you need to), the Option 3 regex is significantly different from Option 2. Instead of matching lines that are known to be repeated earlier in the string (which would be comparable to option 2’s tactic), this regex matches a line, the first duplicate of that line that occurs later in the string, and all the lines in between. The original line is stored as backreference 1, and the lines in between (if any) as backreference 2. By replacing each match with both backreference 1 and 2, you put back the parts you want to keep, leaving out the trailing, duplicate line and its preceding line break.
This alternative approach presents a couple of issues. First, because each match of a set of duplicate lines may include other lines in between, it’s possible that there are duplicates of a different value within your matched text, and those will be skipped over during a “replace all” operation. Second, if a line is repeated more than twice, the regex will first match duplicates one and two, but after that, it will take another set of duplicates to get the regex to match again as it advances through the string. Thus, a single “replace all” action will at best remove only every other duplicate of any specific line. To solve both of these problems and make sure that all duplicates are removed, you’ll need to continually apply the search-and-replace operation to your entire subject string until the regex no longer matches within it. Consider how this regex will work when applied to the following subject string:
value1
value2
value2
value3
value3
value1
value2
Removing all duplicate lines from this string will take three passes. The table below, “Replacement passes” shows the result of each pass.
Replacement passes
Pass one | Pass two | Pass three | Final string |
|---|---|---|---|
|
|
| value1 |
|
| value2 | value2 |
|
|
| value3 |
|
|
| |
|
| ||
|
| ||
| |||
One match/replacement | Two matches/replacements | One match/replacement | No duplicates remain |
Learn more about this topic from Regular Expressions Cookbook.
This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.

Help

