Jump to content

How to use Open Source OCR to create a simples and accurate Captcha Bypasser on Unix/Linux.

0
  arthuralves's Photo
Posted Feb 20 2013 01:34 AM

An OCR (OpticalCharacter Recognition) converts image files into human readablecharacters. Our goal is to use an OCR as back-end for our simple CaptchaBypasser. GOCR, GNU Optical Character Recognition, is an Open Source and Free solutionfor us. Note that GOCR can have unexpected results working with non-Latinalphabets.

sudo apt-get install gocr
yum install gocr




IMAGE FILES

To generate asatisfactory output, we need to use image processing to handle the images (i.e.intensify colors, remove undesired lines, dots, etc.), to facility thecharacters recognition. We will use simple captchas, that will require imageprocessing to modify the colorspace of the images, if the image is colored itwill provide to the GOCR a grayscale copy (it is preferable for a better result).


Our images willhave normal characters from the Latin alphabet that will be uppercase, normaland/or bold - not italic, with a traditional font (i.e. Arial, Times New Roman,Verdana etc.).


Posted Image
1.0 - Captcha already in grayscale
Posted Image
1.1. - Captcha that need to be converted into grayscale


We will useImageMagick to process the images. You can use it to handle complex images,trying to generate the better possible input, but our Captcha Bypasser willjust create a grayscale copy.

sudo apt-get install imagemagick
yum install imagemagick


Tocreate a grayscale copy of the image we will create a function:


# ---- Applying grayscale
	grayscale ()
	{
		source=$img
		id=`date +%N`
		img="$temp_dir"/img_$id.jpg
		convert $source -type Grayscale -despeckle -enhance "$img"
		convert "$img" +level-colors black, "$img"
	}





THE PROCESS

We will create atemporary folder to save the grayscale copy of the image while running thescript:



# ---- Creating temporary directory
	making_env ()
	{
		dd=`date +%N`
		temp_dir=decaptcha_temp_$dd
		mkdir "$temp_dir"
	}


You can use GOCRas below:
gocr [OPTION] [-i] pnm-file


I advise you toread the manual page to understand and increase your script. However theoptions we will use here are:

gocr-l 70 -C [A-Z] -i “$img”

-l level

set grey level to level (0<160<=255, default: 0for autodetect),
darker
pixels belong to characters, brighter pixels areinterpreted as background
of the input image.

-C string

only recognise characters from string, this is afilter function in cases
where the interest is only to a part of the character alphabet, you can
use 0-9 or a-z to specify ranges, use – to detect the minus sign.

-i file

read input from file (or stdin if file is a singledash).


If the image hastext with different grayscale levels would be a problem to discern everycharacter. We will use then, 3 grayscale levels - you can use more, even all:standard, 70 and 85.

gocr -C [A-Z] -i"$img" #standard level: 0
gocr -l 70 -C [A-Z] -i "$img"
gocr -l 85 -C [A-Z] -i "$img"

I decided 70 and85 after I tested many levels and checked the results, but we should let theoption to pass these levels as arguments if we will need (You can see in thecomplet code).

GOCR display anunderscore "_" for unrecognized characters by default. We will storethe results in variables and compare them, if in the first character in thefirst variable is a "_" it will be replaced by the first character inthe second variable, and so on.

# ---- \Decaptching\

	dcap ()
	{

	recog1=$(gocr -C [A-Z] -i"$img")
	recog2=$(gocr -l $number -C [A-Z] -i "$img")

	for (( i=0; i<${#recog2}; i++ ))
	do
        		array2[$i]=${recog2:$i:1}
	done

	#-----

	for (( i=0; i<${#recog1}; i++ ))
	do
        		array1[$i]=${recog1:$i:1}
	done


	for ((i=0; i<${#recog1}; i++))
	do
        		if[ "${array2[$i]}" = "_" ]
        		then
                    		cdecp="$cdecp${array1[$i]}"
        		else
                    		cdecp="$cdecp${array2[$i]}"
        		fi
	done
	}



We will call ourfunctions grayscale and/or dcap based on command-line arguments:

decaptcha INPUT [OPTIONS]

These optionsare:

-c colored
Musthave the -c option if the image is colored.


-l level
Tochange the standard grayscale levels.
Mustbe followed by at last 1 and maximum 2 levels.



img="$1"
	check=`echo $* | wc -w`

	for ((i=1; i<=$check; i++))
	do
        		case $* in
			*-c*)
				shift;shift;
				grayscale${img};
				shift;
			;;
			*-l*)
				shift;
				case $1 in
					*[0-9]*)
						number="$1"
					;;
					*)
						number=70;
					;;
				esac
				shift;

				dcap ${img} ${number}
				f1=$cdecp; cdecp=""

				case $1 in
					*[0-9]*)
						number="$1"
					;;
					*)
						number=85;
					;;
				esac

				dcap ${img} ${number}
				f2=$cdecp
			;;
			*)
				number=70
				dcap${img} ${number}
				f1=$cdecp;cdecp=""
				number=85
				dcap${img} ${number}
				f2=$cdecp
			;;
		esac
	done



After that wewill compare the results again to find the correct one:


for ((i=0; i<${#f1}; i++))
	do
	if[ "${f1:$i:1}" == "${f2:$i:1}" ]
	then
			string="$string""${f1:$i:1}"
		elif[ "${f1:$i:1}" != "${f2:$i:1}" ]
		then
		case${f1:$i:1} in 
			_)
				string="$string""${f2:$i:1}"
			;;
			*)
				string="$string""${f1:$i:1}"
			;;
		esac
		fi
	done

		echo “CAPTCHA: $string”

	exit 0


Bellowis the complet script I created under GPL License:
Some examples: decaptcha captchas/1captcha.jpg


Posted Image


decaptcha captchas/2captcha.jpg -c

Posted Image


Attached File(s)



Tags:
0 Subscribe


1 Reply

0
  mattyclown's Photo
Posted Feb 16 2014 07:10 PM

First, thank you for sharing this OCR image recognizing method with us. As you said, an OCR (Optical Character Recognition) converts image files into human readablecharacters, so I wonder how to make sure there is no error occurring in the image converting process. And I read an article that encoding grammer correction algorithms can be a feasible way. So I wonder will you try to implement this feature?