Jump to content

How to remove duplicate lines with regular expressions

+ 2
  StevenLevithan's Photo
Posted Oct 30 2009 08:35 AM

There is a variety of software (including the Unix command-line utility uniq and Windows PowerShell cmdlet Get-Unique) that can help you remove duplicate lines in a file or string. The following sections contain three regex-based approaches that can be especially helpful when trying to accomplish this task in a nonscriptable text editor with regular expression search-and-replace support.

When programming, options two and three should be avoided since they are inefficient compared to other available approaches, such as using a hash object to keep track of unique lines. However, the first option (which requires that you sort the lines in advance, unless you only want to remove adjacent duplicates) may be an acceptable approach since it’s quick and easy.

Option 1: Sort lines and remove adjacent duplicates

If you’re able to sort lines in the file or string you’re working with so that any duplicate lines appear next to each other, you should do so, unless the order of the lines must be preserved. This option will allow using a simpler and more efficient search-and-replace operation to remove the duplicates than would otherwise be possible.

After sorting the lines, use the following regex and replacement string to get rid of the duplicates:

^(.*)(?:(?:\r?\n|\r)\1)+$

Regex options: ^ and $ match at line breaks (“dot matches line breaks” must not be set)
Regex flavors: .NET, Java, Javascript, PCRE, Perl, Python, Ruby

Replace with:

$1

Replacement text flavors: .NET, Java, Javascript, Perl, PHP

\1

Replacement text flavors: Python, Ruby

This regular expression uses a capturing group and a backreference (among other ingredients) to match two or more sequential, duplicate lines. A backreference is used in the replacement string to put back the first line.

This regex removes all but the first of duplicate lines that appear next to each other. It does not remove duplicates that are separated by other lines. Let’s step through the process.

First, the caret (^) at the front of the regular expression matches the start of a line. Normally it would only match at the beginning of the subject string, so you need to make sure that the option to let ^ and $ match at line breaks is enabled. Next, the .* within the capturing parentheses matches the entire contents of a line (even if it’s blank), and the value is stored as backreference 1. For this to work correctly, the “dot matches line breaks” option must not be set; otherwise, the dot-asterisk combination would match until the end of the string.

Within an outer, noncapturing group, we’ve used (?:\r?\n|\r) to match a line separator used in Windows (\r\n), Unix/Linux/OS X (\n), or legacy Mac OS (\r) text files. The backreference \1 then tries to match the line we just finished matching. If the same line isn’t found at that position, the match attempt fails and the regex engine moves on. If it matches, we repeat the group (composed of a line break sequence and backreference 1) using the + quantifier to try to match additional duplicate lines.

Finally, we use the dollar sign at the end of the regex to assert position at the end of the line. This ensures that we only match identical lines, and not lines that merely start with the same characters as a previous line.

Because we’re doing a search-and-replace, each entire match (including the original line and line separators) is removed from the string. We replace this with backreference 1 to put the original line back in.

Option 2: Keep the last occurrence of each duplicate line in an unsorted file

If you are using a text editor that does not have the built-in ability to sort lines, or if it is important to preseve the original line order, the following solution lets you remove duplicates even when they are separated by other lines:

^([^\r\n]*)(?:\r?\n|\r)(?=.*^\1$)

Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Here’s the same thing as a Javascript-compatible regex, without the requirement for the “dot matches line breaks” option:

^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)

Regex options: ^ and $ match at line breaks ("dot matches line breaks" must not be set)
Regex flavors: .NET, Java, Javascript, PCRE, Perl, Python, Ruby

Replace with:

(The empty string, i.e., nothing.)

Replacement text flavors: N/A

There are several changes here compared to the option 1 regex earlier in this recipe, which only finds duplicate lines when they appear next to each other. First, in the non-Javascript version of the option 2 regex, the dot within the capturing group has been replaced with [^\r\n] (any character except a line break), and the “dot matches line breaks” option has been enabled. That’s because a dot is used later in the regex to match any character, including line breaks. Second, a lookahead has been added to scan for duplicate lines at any position further along in the string. Since the lookahead does not consume any characters, the text matched by the regex is always a single line (along with its following line break) that is known to appear again later in the string. Replacing all matches with the empty string removes the duplicate lines, leaving behind only the last occurrence of each.

Option 3: Keep the first occurrence of each duplicate line in an unsorted file

If you want to preserve the first occurrence of each duplicate line, you’ll need to use a somewhat different approach. First, here is the regular expression and replacement string we will use:

^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1$)+

Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Once again, we need to make a couple changes to make this compatible with Javascript-flavor regexes, since Javascript doesn’t have a “dot matches line breaks” option.

^(.*)$([\s\S]*?)(?:(?:\r?\n|\r)\1$)+

Regex options: ^ and $ match at line breaks ("dot matches line breaks" must not be set)
Regex flavors: .NET, Java, Javascript, PCRE, Perl, Python, Ruby

Replace with:

$1$2

Replacement text flavors: .NET, Java, Javascript, Perl, PHP

\1\2

Replacement text flavors: Python, Ruby

Unlike the option 1 and 2 regexes, this version cannot remove all duplicate lines with one search-and-replace operation. You’ll need to continually apply “replace all” until the regex no longer matches your string, meaning that there are no more duplicates to remove.

Because lookbehind is not as widely supported as lookahead (and where it is supported, you still may not be able to look as far backwards as you need to), the Option 3 regex is significantly different from Option 2. Instead of matching lines that are known to be repeated earlier in the string (which would be comparable to option 2’s tactic), this regex matches a line, the first duplicate of that line that occurs later in the string, and all the lines in between. The original line is stored as backreference 1, and the lines in between (if any) as backreference 2. By replacing each match with both backreference 1 and 2, you put back the parts you want to keep, leaving out the trailing, duplicate line and its preceding line break.

This alternative approach presents a couple of issues. First, because each match of a set of duplicate lines may include other lines in between, it’s possible that there are duplicates of a different value within your matched text, and those will be skipped over during a “replace all” operation. Second, if a line is repeated more than twice, the regex will first match duplicates one and two, but after that, it will take another set of duplicates to get the regex to match again as it advances through the string. Thus, a single “replace all” action will at best remove only every other duplicate of any specific line. To solve both of these problems and make sure that all duplicates are removed, you’ll need to continually apply the search-and-replace operation to your entire subject string until the regex no longer matches within it. Consider how this regex will work when applied to the following subject string:

value1
value2
value2
value3
value3
value1
value2

Removing all duplicate lines from this string will take three passes. The table below, “Replacement passes” shows the result of each pass.

Replacement passes

Pass one

Pass two

Pass three

Final string

value1

value1

value1

value1

value2

value2

value2value2

value2

value2

value3

value3

value3

value3

value2

value3

value3

value1

value2

value2

One match/replacement

Two matches/replacements

One match/replacement

No duplicates remain

Regular Exp<b></b>ressions Cookbook

Learn more about this topic from Regular Expressions Cookbook.

This cookbook provides more than 100 recipes to help you crunch data and manipulate text with regular expressions. With recipes for popular programming languages such as C#, Java, Javascript, Perl, PHP, Python, Ruby, and VB.NET, Regular Expressions Cookbook will help you learn powerful new tricks, avoid language-specific gotchas, and save valuable time with this library of proven solutions to difficult, real-world problems.

See what you'll learn


0 Replies