Analytics Roundup: Open knowledge resources

Comprehensive Knowledge Archive Network
CKAN is a registry of open knowledge packages and projects... the place to search for open knowledge resources as well as register your own—be that a set of Shakespeare's works, a global population density database, the voting records of MPs, and more.

0 comments | Add a comment

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment






ANDs, ORs, and IFs: Comparing big lists in Excel

One problem we face when manipulating large amounts of data in Excel is checking to see if two lists of the same length contain the same items. For instance, we might be given a list of products that a company has for sale this month, running to thousands of items, then the following month, we get another list of products for sale and we need to see if there has been any change between those two lists. This isn’t too hard to deal with when you only have a hundred or so items, but it gets a little thorny when your list runs to tens of thousands.

What we do is line the two lists up, side by side, in sorted order.

Two lists

Use the simple “A1=B1” formula to compare pairs of items in the lists.

Formula for comparing elements in the lists

If the pairs are the same, this will be true, otherwise they’ll be false.

Results of comparing elements in the lists

Copy this formula down for all your rows. Then use the AND function and give it the entire range of comparison formulas.

Checking comparisons

This will only be true if every single one of the values in your list are exactly matches. If even one comparison is false, this big AND statement will evaluate to false.

This is a quick and dirty approach. For tougher problems, we use a slightly more complicated formula in the comparison where we evaluate it to 1 if the value is true, 0 if the value is false. This gives us more flexibility to combine comparisons, but that’s a topic for another post.

19 comments | Show all comments only the last 5 are shown


May 5, 2007
Robbin Steif said:

To the person who wrote, this is too easy a post for Juice: you probably have a strong command of the topic, but not everyone does. And like someone else wrote: it's nice to be reminded of tricks again.

To the person who complained that then you have an error somewhere and you can find it: all you need to do is sort all the rows by the column with the proof in it, so that you can see where the problem is. You can sort Ascending, since F comes before T (how's that for easy advice?)

And finally, you can use a similar trick to dedupe email addresses, or other lists where the info will be identical. I am sure you Juicers are all over this one. But since the data are on two lines, you have to do it like this =a1=b1 And copy all the way down. Then you copy the results and repaste them right where they were, but PASTE SPECIAL - Values. Then you sort on the IF column


May 7, 2007
mike harding said:

well in the world of statistical genetics, excel doesn't have enough room in one tab. it would be useful to know how to do this in R or SAS.


May 9, 2007
AW said:

Mike,
In SAS, you'd want to use a sort-sort-merge or some inner joins.
R is on my schedule to learn in the next 2 months; so I got nothing for you there...


June 25, 2007
Brenda said:

Looks like it works for text string comparison as well!


October 11, 2007
Will said:

I think the use of MATCH is a much better option. Using the data as you have displayed it, in cell D3, you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),1,0), and then in cell D4 on downwards you would have =IF(ISNA(VLOOKUP(C4,B$3:B$xxxxx,2,FALSE)),D3+1,D3) where xxxxxx is the final row in your set. In cell D2, you have =MAX(D3:Dxxxxx). This will then tell you how many new datapoints you have.

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment


Add a comment





Analytics Roundup: Late April edition

Population Estimates Data Sets
US census data

Are you generic? / Wilson Miner Live
Wilson Miners post re: Django's generic views

0 comments | Add a comment

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment






Analytics Roundup

Many Eyes: IBM's collaborative visualization
Fernando Vargas' vision comes to light.

0 comments | Add a comment

Your name

Email (optional, will not be shared)

Type the word "juice" (required to confuse the spammers)

Your comment