I had the chance to listen to one of the icons of computer science recently. John Hopcroft, IBM Professor of Engineering and Applied Mathematics in Computer Science at Cornell, gave a talk on future directions in theoretical computer science to a standing-room-only crowd at Worcester Polytechnic Institute.
While this is not my field, I was interested to hear how someone who's been in the field -- and received just about every award -- might be thinking about the issues most likely to impact the development of the theory of computer science, especially when addressing a throng of young people who will have to execute against those challenges.
Here are my notes, mainly unedited. I hope they make sense.
The early focus was on making computers useful; now they are useful and as we move forward, we need to develop theory that supports new directions.
While we google quite liberally now, in the future there will be new kinds queries that produce reasonable answers to specific questions, like "which car should I buy?" The "system," knowing all your prior searches (hello, privacy) will be able to cluster results of those searches in "temporal cluster histograms" (which he then showed but which I can't remember and which I'm not now going to google...)
He gave the example of googling Michael Jordan. If you're a sports nut, you'll of course get this Michael Jordan but if you're a computer scientist then you would be looking for this one. The future computer will of course know this -- and know that I wouldn't be looking for either one, to be honest.
Which of course led him to privacy issues, which he parsed nicely in regard to electronic medical records. Your doctor needs your whole record; your insurance company needs only the information pertaining to your last visit; a researcher needs statistical information with no independently identifiable info.
(My next note is one of those that I recorded solely because it sounded like poetry -- "zero knowledge proofs" -- but I have no clue as to what it means.)
His next point was a very good one: in the future (which, I believe is actually right this minute), data sets will be so large that we'll only be able to analyze them by sampling them. As one of the computer scientists at the lecture explained to me, think of Facebook's data on 800 million users. There are no machines large enough to bring that magnitude of data into memory -- thus sampling. Kind of like being in the ice cream store and unable to eat all those gallons as there's not enough room in our tummies -- thus we sample. Well, not exactly, but you get the point.
But before I leave this, a few more thoughts because as we collectively dive more deeply into understanding our vast (and vastly growing social networks, about which more below), we need to better understand the implications of what it takes to fully analyze gigantic amounts of data. Which brings me to the classic large data problem that many may remember: SETI, where many of us gave/have given over our machines to the "search for extraterrestrial intelligence," lending our computers to computationally solve that bigger problem of finding such intelligent life. In that example, the SETI scientists send us a piece of the data along with the algorithm and our computer runs "the package," returning a solution that the SETI folks can then combine with others' solutions.
Graph algorithms, the math by which we analyze network maps (who's connected to you, how many people/degrees are between me and you, what your favorite Facebook group looks like), are different.
Whereas SETI-like data can be partitioned for individual machines to compute pieces of the whole solution, graph algorithms, which is what make up network maps, cannot. Why? Because the graphs only can be analyzed when the algorithm can see all the links between the nodes, which may not be included in the selected parcel. Thus, we can't parcel out pieces of Facebook's 800 million users' data for analysis on your machine and mine. Notice, for instance, the recent study of how sentiments vary on Twitter -- "Drawing on messages posted by more than two million people in 84 countries..."-- which is to say a mere fraction of the tweets.
Big problem which I'll come back to in later posts but in the meantime, let's return to Dr. Hopcroft's lecture where he...
Then used the word "permute," which I'd never heard before thus recording it. No clue once again. Is this a mathematician-type really meaning permutate? Dunno.
Next example re: privacy had to do with GPS only tracking major roads when minor roads often are the better route. Local drivers always take them but they don't show up when you ask for best routes on your GPS. Getting that info requires sharing back road routes but alas, do you want to be identified everywhere you're driving?
This led to his talking about how to look up weather patterns in your local neighborhood, which he does so that he can figure out whether he can walk home in the rain. Walking is his preferred mode of travel -- and he's figured out how to look at current weather maps and adjust for small time differences so that he knows when to leave his office.
And then he got more fully into some stuff that brushes up against the areas we're all interested in: social networks.
An early measure of the quality of communities was thought to be conductance, a connectivity measure from graph theory that indicates how fast you can go from one part of the graph, er, network, to another. But, it turns out, Dr. Hopcroft says, conductance is not the right model for communities.
Instead, we need to look at overlapping communities, massively overlapping communities with common cores that have very fuzzy boundaries. In fact, we need different algorithms in order to understand different communitites.
We need, he said, a "theory of sparse vectors." (I need a graph theorist for this one.)
He then went on to talk about the need for the new generation to incorporate the "law of large numbers" in their thinking. "When I was grad student," he said, "we analyzed graphs with only 7-8 'edges,'" i.e. the connection/link between two nodes. Now, as he had said earlier, the number of edges is so huge that they don't yet fit into any computer memory known to humans.
"There's something fundamental about giant components," he said, something I took to mean that we still don't understand.
My last note about Dr. Hopcroft's lecture is my favorite line: "We have to upgrade our intuition," i.e. it just plain doesn't work in the new world. What's going on now in understanding patterns of higher dimensions is counter-intuitive. Our intuition, he said, was formed in the world of 2 and 3 dimensions but now we live in a world in which datasets are growing exponentially. Touche.