I’ve got it let’s invade Iraq!



Download 2.43 Mb.
Page15/95
Date29.01.2017
Size2.43 Mb.
#12727
1   ...   11   12   13   14   15   16   17   18   ...   95

The Stupid 365 Project, Day 14: GOTCHA, reCaptcha!

October 14th, 2010

Many of you, like me, have wondered about those bizarro wordoids you have to enter before WordPress will accept a comment. So today, we’re going to let someone else write most of the blog in order to explain reCaptcha. I personally think it’s fascinating.



Everett Kaser has been kind enough to look at all the Simeon Grist books as they’re formatted as e-books, and he’s pointed out literally thousands of mistakes that Kimberly Hitchens, my long-suffering e-book producer/factotum, has then fixed. A graduate in Arts from Oregon State University, Everett spent 20 years working at Hewlett Packard, where he rose from an entry-level production job to being a software engineer in the R&D lab. After writing many trivial games for fun through the 1980s, he started releasing shareware games for MS-DOS and later MS-Windows (and now Mac OS X). This hobby grew sufficiently until, in 1997, he quit his day job and has been in game-programming heaven ever since. He reads, he plays disc golf, and sometimes (he says) he even sleeps.

He’s also one of the few people in my world who understands reCaptcha. So he said I could ask him some questions.



When I last went to post a response, I was commanded to key in ores; prepolty. I mean, what the hell?

Ores; is perfectly reasonable, even in English. Think of an article about mining containing a compound sentence: ”They were digging for rare earths and metal ores; little was found.” Modern newspapers don’t use the semicolon nearly as much as papers did 100 years ago.

prepolty IS a strange one. But strange reCaptcha words can be one of several things.

– A word from a non-English language.

– The name of a person or place.

– The beginning of a hyphenated word (from the end of a line)

– The end of a hyphenated word (from the beginning of a line)

– A technical term.

– Slang. (Think how some of today’s slang will look to people in 100 years.)

– An acronym.

– An actual mistake in the original document.

– The middle of a word. (Nothing says the reCaptcha system has to use entire words. A word may get broken down into several pieces because it’s best that reCaptcha terms be kept to a reasonable length.)

– Etc.

There are always two groups of characters. If you make a mistake, are you less likely to be bounced if it’s in one group or the other?

This I don’t know for sure, but ONE of them has to be spelled right, and the other one doesn’t matter. But you don’t know which is which, and if I were the one writing the reCaptcha code, I would randomize which one you have to spell right. Otherwise it would eventually be known which was the important one, and people would start screwing with the system, which would defeat the second purpose of reCaptcha. The first purpose is to make sure it’s a human who is making the comment (or whatever it is that reCaptcha is evaluating), not some spamming computer program. It is still VERY difficult to program a computer to figure out what these mutated graphical words say, while a human can do it quite easily. The original system was called a “Captcha” and had only one word or series of characters you had to enter to prove you were human. ”reCaptcha” was invented to do that AND something else, and that brings us to the next question.



Where do the character groups come from?

The second purpose of the reCaptcha system is to rapidly, cheaply, and easily digitize and proof-read scanned-in documents. So the words in reCaptcha come from documents that have been scanned in.



They’re digitizing what? And how does this relate to the zillion other digitizing projects that are underway?

Documents, like the entire run of The New York Times from the 1800s up to the present and the hundreds of thousands of books that Google has scanned in from libraries. Scanners and OCR (Optical Character Recognition) software can do a pretty good job of converting printed text to ASCII text computer files, but they’re not perfect and many little errors creep into the text. So every time you correctly enter a Captcha you’ve helped this process by proofreading ONE word, or part of a word, from one of those documents. It sounds like a slow process until you realize that there are tens of millions of those reCaptchas filled out every day on the Web. That’s a lot of documents every day, day after day after day.

The way it works is that one of the words in the Captcha is already known to the server that provided the graphic, and the other is being proofread. If you enter the known word correctly, then the system assumes you entered the other one correctly, too. But errors could still creep in, because sometmes you’d luck out and get the known word right while making a mistake on the unknown word. So the system requires the unknown word to be typed in the same way by multiple people before it’s accepted as the correctly proofread version of that word.

There are many digitization projects underway, but most of them have no good proofreading associated with them. For example, Google has scanned in hundreds of thousands of books and converted them to ASCII text files, but they also contained many OCR errors. Hiring people to proofread all those volumes manually would be expensive and time-consuming.



How is reCaptcha capitalized? Or, to put it in language I actually understand, where do they get their money?

Initially, they were probably being paid by The New York Times for digitizing and proofreading the papers. But in September 2009 Google bought the reCaptcha company, and Google is now applying the technology to improve the quality of the digitized books — making them more useful and more easily searched — as well as continuing outside projects, such as The New York Times. Some folks may think that Google is becoming the new Evil Empire, but having available high-quality scanned versions of all those old books is really a good thing and is a huge step forward into our information future.



We’ve all at one time or another written a brilliant, eloquent, perfectly articulated and very long comment, and then seen it flushed when we screwed up the reCaptcha. And ideas on how to avoid that or recapture our lost brilliance?

The best thing to do is select all the text you entered and copy it to the clipboard before you enter and submit your reCaptcha words. That way if something goes wrong and your submission fails, you can easily paste your brilliant thoughts back into the edit field and try, try again. If you don’t know how to select and choose “copy,” you should learn.

If you failed to have the forethought to select and copy your text, you may be able to recover it by selecting the BACK button on your browser. Sometimes this will work and sometimes it won’t, depending on your browser and the implementation of the specific web page you’re viewing.

Thank you, Everett. Just really fascinating information.




This entry was posted on Thursday, October 14th, 2010 at 8:03 am and is filed under All Blogs. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

14 Responses to “The Stupid 365 Project, Day 14: GOTCHA, reCaptcha!”


  1. EverettK Says:
    October 14th, 2010 at 8:21 am

Sheesh! Who is this guy? I wish I knew HALF as much as he does! Err… umm… no, wait a minute… where the hell’s that BACKSPACE key???

  1. Sylvia Says:
    October 14th, 2010 at 10:27 am

Captcha: shilang thought

I think generally it’s pretty clear which one is the “needs confirmation” word. But that of course requires a human to look at them which is the whole point, right?

Neat stuff.

BUT Tim, did you write your 300 words today? I don’t recall you negotiating days off!



  1. Timothy Hallinan Says:
    October 14th, 2010 at 3:50 pm

Sylvia — I KNEW someone was going to do this to me. Actually, I probably didn’t — but I thought up the idea, wrote the questions, pestered Everett to answer them and then, because of a bug in WordPress that wouldn’t let me block and copy the interview, I had to key the whole thing in. So my conscience is clear. Yeah, I have a question, too, about the word fragments like “shilang” that aren’t recognizable as part of any word I know, but it’s a fascinating explanation for the fact that reCaptcha is there in the first place.

Everett, you ARE one of the smartest people I know (virtually, anyway — and my in-person friends aren’t all that bright, so . . .) And I loved the post and thank you for all the effort.



  1. Gary Says:
    October 14th, 2010 at 4:25 pm

Thank goodness, Everett, you’ve solved the puzzle!

I had read that reCAPTCHA was being used to help proofread OCR, but what I hadn’t understood was this: if the system needs us to tell it what these scanned words are, then how will it know when we get it right? And you’ve answered that. Thank you so much.

(Of course, we all know where these strange words REALLY come from. My recent reCAPTCHA “trobits curtains” was obviously just a distorted broadcast of “Oh, it’s curtains!” as another fictional character winked out of existence on Mars.)

But if people won’t recognize the truth when they see it, what can we do?



  1. Suzanna Says:
    October 14th, 2010 at 4:41 pm

Hi, Tim and Everett

Always wondered how I could post something with reCaptcha when I could only recognize fragments of some of the letters and just guessed. Didn’t realize the system allows you to get by with only one correct word or um part of the text.

This is not really related expect that it has to do with computers. If anyone is interested in computer programming or just really well written/directed/acted movies take a look at “The Social Network,” a movie by Aaron Sorkin and David Fincher about the founding of facebook. Really well done.


  1. Sharai Says:
    October 14th, 2010 at 5:29 pm

Well I just spent the last hour getting caught up! I had no idea this Stupid Project, or stupid project was going to be so intense. Guess I’ll have to check in more often, it’s too good to skim! Thank you for day 10, that was the real Tim Hallinan at work, I felt a sisterhood with Sylvia because apparently your writing makes us both leak.

Everett, thanks for the brains and the humor, I love it when you two team up!



  1. Timothy Hallinan Says:
    October 14th, 2010 at 8:59 pm

Gary — Actually, trobits are small, nitrogen-breathing life forms from the planet Xenu, launched into earth’s atmosphere by L. Ron Hubbard some 40 years ago. Since earth is low on nitrogen in the atmosphere, Hubbard mutated them to breathe hot air manufactured by politicians. Naturally, they’ve thrived. The Church of Scientology was furious to see “trobits” in a Captcha and tried to buy the company, but, as Everett explained, Google outbid them.

Suzanna, you know you can change the Captcha you get if it’s really illegible. That’s what that little circle of arrows is for. And SOCIAL NETWORK is just about the only move I want to see right now.

Sharai — these are not meant to be read in one setting – they’re far too rich and full of, um, meaning. Glad you liked Day Ten. I laughed really loudly when I was writing about the sandwich. And I agree — Everett nailed it.

reCaptcha: are crattate. Everett — crattate?



  1. EverettK Says:
    October 14th, 2010 at 9:17 pm

crattate obviously falls into the acronym category: Crap Recaptcha Always Tries To Assign To Everyone.

See: EVERYTHING can be explained. (Don’t get me started on the JFK assassination…)



  1. Gary Says:
    October 15th, 2010 at 4:24 am

But that’s just it: JFK was never assassinated. His body double was saluted and buried, and the real JFK lives and breathes to this day on the grassy knoll.

  1. EverettK Says:
    October 15th, 2010 at 5:32 am

re: CRATTATE

Actually, I think I’ve found the real reason why we’re getting all of these bizarre ‘words’ in reCaptchas. If you do a Google search on CRATTATE, it brings up a list of possibilities, and the second one (for me) is a pointer to a book written in Latin and printed in VERY old germanic script:

Tractatus de questionibus in quo materie maleficorum pertractantur
By Ippolito Marsigli

When you follow the link, it brings up a page of that book and highlights a word that, to me, looks like: ciuitate. By the time that got run through their reCaptcha distortion code, it came out looking to Tim like crattate.

The reCaptcha process is GREAT for digitizing and proofreading documents that are written in the native language of the user, as you then ‘see’ the right word. But when you don’t KNOW the original language, you’re purely guessing at each individual letter, and then errors will creep in. And very likely, several people who speak the same language (none of whom speak the original language) will proofread the word in the same way, thus ‘verifying’ the proofread of it incorrectly.


  1. EverettK Says:
    October 15th, 2010 at 5:38 am

One further note about the reCaptcha process, and then I’m done (hopefully .

After a word has been “reCaptcha’d” by multiple people, if there is a disagreement in their transcriptions, then that word is highlighted in the original document, and a human working for the reCaptcha project looks at the original document to arrive at the ‘official’ transcription of that word. In this way, the human proofreader only has to examine maybe 1% of the words in a book rather than proofreading the whole thing.



  1. Timothy Hallinan Says:
    October 15th, 2010 at 8:08 am

Aha. In fact, multiple ahas. This is one of those “aha” moments. I have to say, with no offense to Everett, that only a first-class obsessive would have come up with that explanation for “crattate,” which is actually an obscure sexual perversion involving white lace gloves and a stuffed panda. (Not obscure to those who practice it, obviously; to them, it’s all-involving.)

Gary, when you see JFK next, please give him my regards. Is Bobby okay, too?

And GREAT acronym, Everett.

reCaptcha (I love it) noodisma Herald



  1. Jaden Says:
    October 18th, 2010 at 3:11 pm

I must be missing something, because I’m still confused. How does it help profread anything to have me COPY funny-looking letters I see on the screen. I will copy them as I see them, even if they’re incorrect–and have.

  1. Timothy Hallinan Says:
    October 18th, 2010 at 7:26 pm

Jaden, would you like Everett’s e-mail address? There are still lots of things I don’t understand, either, I mean, in addition to the human condition. As I get older I understand less and less. I’ve read a couple of articles on reCaptcha, and some of it remains impenetrable.


Download 2.43 Mb.

Share with your friends:
1   ...   11   12   13   14   15   16   17   18   ...   95




The database is protected by copyright ©ininet.org 2024
send message

    Main page