Main

Search Technologies Archives

March 7, 2007

Microsoft Pulls Out All the Stops to Catch Google

Microsoft has made headlines a lot lately. The release of Vista, a new Book Search interface, a scathing attack on Google earlier this week before an AAP gathering in NY, and now, this NY Times article, Searching for Michael Jordan? Microsoft Wants a Better Way - New York Times, featuring Microsoft's efforts to improve its statute in the world of online search.

I saw an interesting connection among these headlines. I've been reading about how Vista packs the biggest DRM accommodating punch of any operating system ever (Google search on something like Vista drm to get a feel for the commentary), the speech before the AAP painted Google as cavalier about copyright and MS as totally supportive of the publishing industry, and the stats in the NY Times piece show that Microsoft is near the bottom of the heap regarding the use of, and lack of recent growth of, its search features.

These pieces look like parts of a very aggressive campaign to claw its way out of its own 20th Century niche, taking no prisoners, no holds barred. It has all the hallmarks of a political campaign, including the mudslinging. Tim O'Reilly chides MS and insists it is bigger than this, that we expect more out of a player of this size. Larry Lessig believes MS is just plain wrong about Google's cavalier attitude. I certainly agree with Lessig's sentiments, but as for whether we can expect more from MS, I don't know. Perhaps MS is just as much a victim of its business model as Holly wood and the publishers are of theirs (O'Reilly's comments again, in another entry)?

I wonder whether in working together as they must have, to implement such powerful DRM controls in Vista, whether they may actually impair their own progress towards more efficient business models in the future. In fact, I often think about DRM in the context of the old adage, "give them enough rope and they'll hang themselves." Many of the commentaries about Vista's DRM suggest that it's like a suicide note... Only time will tell.

April 5, 2007

Jean-Noel Jeanneney leaves France's Bibliotheque National

I read with interest today that the President of the Bibliotheque National, Jean-Noel Jeanneney, has apparently been forced to resign:Jean-Noel Jeanneney quitte la presidence de la BnF - Tour de Toile du BBF. You might wonder why this seems important to me, unless you know what I'm studying at the iSchool...

But, more generally, it's of interest because Jeanneney is an impassioned critic of all things Google. In fact, in his slim volume, Google and the Myth of Universal Knowledge, he says at one point, something to the effect, "Whatever Google does, we should do the opposite."

His principle criticism was that selectivity and organization should be at the heart of the process of digitization, and of course, Google's goal is to digitize everything and let the users sort it out through search, tags, bookmarks, etc. He also criticizes our reliance on the market to do what he thinks should be done with public money in Europe. At the core of Google's undertaking, and implicitly rejected in France's efforts that so far involve only public domain works, is reliance on fair use to justify digitizing books still in copyright. Being an employee of a Google Library partner, I'm not neutral on the matter, but I must say that the book is very well written and raises good points. Nevertheless, one commenter on the blog where I saw this note about Jeanneney's departure seemed to suggest that there might be a connection between the fact that Google had so far digitized 10 million books and the Bibliotheque National, 100 thousand, and Jeanneney had essentially castigated Google for performing well. While neither of the figures is likely accurate, they get the general gist of the point across.

As always, there's no doubt a lot more to the story than initial reactions suggest, but I wonder whether Jeanneney's departure signals an opening for a new attitude towards mass digitization projects in France. Not coincidentally, I am headed there in 5 weeks to interview several librarians about their views of the future of the library in France. I have both Bibliotheques Nationales on my agenda, as well as 2 University libraries and a municipal library (Lyon). It's an exciting time to be thinking about the future of libraries, and May is a fine month to visit Paris.

August 7, 2007

Google Book Search Tips -- UMich

University of Michigan is making a 5 page description of "Google Book Search Tips available on their website. Pretty amazing on several levels.

The first thing that struck me was the subject of the book search UMich uses to illustrate book search: texas longhorns. Of course the results are about Texas also, but it's rich and famous Texans and the fabulous fortunes they've won and lost. Hmmm.

Aside from that chuckle, at least for me, the document is really helpful as it shows in detail what features the book search provides, how to use it to best advantage, and if you're at UMich, how to double-check your results against Michigan's catalog, Mirlyn. I want to say right now that I think this is a really good thing. I've heard so many people say things that indicate that there's a lot of misunderstanding about what Google Book Search does and how it works. So clearly, this is needed and kudos to UMich for doing it, but...

The tips then go on to discuss searching for journal titles. Here it gets really complicated and this is where I suspect a lot of eyes will glaze over. Mine did. This explanation encapsulates a classic library search problem. If searching a library resource through a library database or catalog is so complicated that it takes 5 pages to explain how to do it effectively, well, you know you're going to lose a whole lot of your audience. I know, I know, you're thinking, "but there's so much good information you can find through these complex, complicated, difficult search interfaces that take 5 pages to explain." UT is rumoured to have 28 pages of explanation for how to use our new search interface. Sorry my alma mater, and beloved employer, but I've never confirmed this rumour. It's probably one of those tall Texas tales, but I can't even go look at something that might take that long to explain. I know it's not our fault. It's the product we have, one of the best we can get. But, well, we know this is a problem, don't we?

So, how could it be that Google, known for ease of use, practically identified with quick, simple, uncluttered search with amazing success (success being defined as "good enough for government work" -- ironic isn't it?), could have a search feature, now that it's gotten into books, that takes 5 pages to explain?

I always sigh when I think about how cool Google-think is, the culture of creativity and get it done, all that, and how I wish for us a little more of that culture. Now they are becoming more like us (even if only in a small way)? Is this really a good move? Are books and journals just too complicated to search simply?

August 28, 2007

Just discovered an interesting technology blog

I was directed to the site, from Open Access News, Peter Suber's running commentary on all things to do with, well, open access. This particular post focused on copyright, but in the context of increasing populations that include more and more people who are willing to write without making it their living: More Authors, Less Copyright. So, I visited. It was a blog called The Technology Liberation Front with several authors of whom I had heard, but many more that were new to me. Tom Bell was the author of this particular post. Anyway, the focus of the blog is federal Interent policy, whatever that touches -- certainly copyright, but also a host of other issues that we don't talk about that much here. I recommend you pay it a visit if you're interested in the broader issues of Net policy.

September 17, 2007

This just in... Libraries and library organizations ask Copyright Office to free the registration database

Peter Brantley and Carl Malamud have just asked the Copyright Office to make its retrospective database of registrations of copyright freely available to the public: Carl Malamud Tackles the Copyright Office. The claim is that the information is public domain (the Copyright Office apparently claims copyright on it) and that it is a valuable dataset that, if publicly available for research activities, could yield improvements to the search process itself as well as other information about the registration process.

It is rather remarkable that the massive numbers of registrations and renewals are only searchable back to 1978. Stanford made headlines when it provided access to the "determinator," its database of earlier records that are proving indispensable to determining which of the works registered during the period 1923 - 1963 are in the public domain because their owners did not renew their copyrights as was required during that time.

University of Texas is joining this effort to determine the copyright status of works that have been digitized by Google, but not just for the purpose of making those works that are found to be in the public domain more accessible, but also to further the research efforts of others along these same lines. We plan to document in detail the process we go through to make our determinations, the resources we find indispensable to our work, and when we are unable to make a determination, all the evidence that we were able to bring to bear on the question of copyright status so that others might be able to pick up where we left off. This is the kind of work that requires a "knowledge community" to further it. I know that the Copyright Office is a part of that knowledge community. Contributing its records to the research community is a special step that only it can take, a unique contribution I hope it will make.

September 18, 2007

NY Times move represents a publisher backing off its copyrights

The story about the NY Times closing down shop on its 2 year experiment in selling access to content has been reported all over the blogosphere, from many different angles. It is a rich story, really, and does in fact have much to say about what's happening in the business world of the Internet. The if:book report reflects the change from a publisher's having confidence in the power of its own brand to draw in paying subscribers to its having confidence in the power of Internet search and advertising to draw in far more dollars in the long run: if:book: all the news that's fit to search.

I agree that the power of search makes dollars and sense. But I also note that this particular strategy places copyright's *exclusion" right, it's reliance on exclusive rights to motivate creation, a little further down in the hierarchy of what one needs to succeed in the online world. Or, put another way, if you play the copyright card front and center, you ignore a lot of other cards that are ultimately of more economic value.

We are finally beginning to get the idea that control over copies isn't the only way to exercise one's copyrights. Sharing actually works, economically. It also makes a lot of sense that advertising would be the liberator. It's under our noses all the time with television. But surely it's not the only alternative to controlling access and counting copies. The world of all advertising, all the time has its own downsides. Nonetheless, it's encouraging to see major publishers leaving access control behind.

November 25, 2007

Mass Digitization blogging project completed

After 6 weeks of drafting, posting, tracking blog statistics, and weekly writing in a journal about the experience, I have just completed my blogging experiment at Mass digitization ~ Changing copyright law and policy, by posting the Conclusion today. Here's the first paragraph:

The story of mass digitization’s effect on copyright law and policy is the story of confronting and eventually calming fears. Sometimes the only way to calm fears is just to stand up, stride towards the light switch, and show that there’s nothing to be afraid of. Turn on the light. Look under the bed. Open the closet door. See? There’s nothing there. Didn’t Franklin Roosevelt say something about this?


Since I announced the start of the experiment here on Collectanea, I thought I would announce its conclusion as well. If you haven't visited yet, or if you visited early in the drafting process, you might like to visit again to read the entire draft (7 fairly short sections). Be sure to check out the Project Resources page. It has links to all the online materials referred to in the draft, and other materials that support or illustrate the argument.

It has been a very interesting experience to draft in blog-style. My next step will be to polish the draft and give it journal-style. I will be able to compare the two drafts and perhaps say something useful about how the styles differ. I also have skads of data about daily page views, time on the pages, and how many pages were viewed per visit. It's amazing what Google Analytics can tell you about your blog. If it weren't for Google Analytics in fact (and other blog statistics programs), the story we would relate about our experiences blogging would be far removed from the truth because without stats, we only know readers are there if they comment. Hardly *anyone* comments though. The comment rate on Mass Digitization was roughly .2% -- that's point two percent, not two percent. So, for 1000 pages viewed, the blog received 2 comments. This rate is consistent with rates I've read in broad studies of blogs. Of course, there are exceptions, but most of us are not really visibly building a community of commenters.

But we are reaching people. Those 1000 pages viewed represent about 500+ people who stopped by, even if only for a few minutes. So, the blog entries did get viewed in whole or in part by many folks who might not read the article in its polished journal-style form. It is an interesting hypothesis, how blogs affect scholarship. I will be posting my paper on that subject at the Crash Course when I complete the paper in about 2 weeks. And Mass Digitization will be published on CIP's Website in the spring.

If you are one of those 500+ people, THANK YOU! It is very nice to know you are there --

November 29, 2007

Just Because You're Paranoid...

As I mentioned in an earlier post ("Shooting Fish In a Barrel"), my university was one of the 25 named bad guys receiving letters about online piracy. My earlier blog was about the College Opportunity and Affordability Act winding its way through the legislative system. Among other things this act speaks to is requiring universities to explore technology-based deterrents to prevent illegal activity.

One such nifty "technology based deterrent is the "University Toolkit"
being offered (FOR FREE, can you imagine?) by the Motion Picture Association of America (MPAA). Not being a "techie", I can get left behind fairly easily on these things and I can also get spooked about privacy invasions with almost no effort at all. (My first reaction to things like On-Star was not 'Great - someone somewhere in the ether can unlock my car for me' but rather 'Good grief, someone knows where I am all the time'.)

Anyway, when I read the Washington Post's Security Fix blog by Brian Krebs titled "MPAA University 'Toolkit' Raises Privacy Concerns" where detailed explanations of the Toolkit are discussed, I was horrified. (this is a blog so I figure I can say things like "horrified" and "yikes")

Apparently, installing the Toolkit on your university's network is like letting the fox into the henhouse. Once installed, the software phones home to the MPAA telling them that it is 'in' and checking for a new version (and who knows what's in that). According to Security Fix, "installing and using the MPAA tool in its default configuration could expose to the entire Internet all of the traffic flowing across the school's network" automatically configuring "all of the data and graphs gathered about activity on the local network to be displayed on a Web page complete with ntop generated graphics showing not only bandwidth usage generated by each user on the network, but also the Internet address of every Web site each user has visited."

Does this bother you? Bothers me. Bothers Steve Worona (director of policy and networking programs at EDUCAUSE) who opined that "no university network administrator in their right mind would install this toolkit on their networks."

In response to these criticism, the MPAA, via Craig Winter, deputy director for Internet enforcement (does that sound like web cop to you?) said the toolkit was in the 'beta' phase. Again, no technology expert here, but why would you release and promote something not finished?

Rather than continue to repeat this informative blog entry by Krebs, I would encourage you to read it yourself, as well as some of the follow-up comments.

When I consider how some of these associations are treating their customer base, the saying about killing the goose that lays the golden egg comes to mind. I laughed out loud the other day when I heard someone say on the radio (completely different context; can't remember who or I'd credit) "We don't want to kill the golden goose; we just want to strangle it until it gives us all its eggs."

March 13, 2008

Semantic web and copyright

Yahoo! announced today that it will be supporting Semantic Web and microformats to improve search results for structured data (as reported in ReadWrite Web: And Nerds Became Kings: Yahoo! to Announce Semantic Web Support - ReadWriteWeb). The Semantic Web has been a dream of Tim Berners-Lee for a long, long time, and up until now, pretty much way behind schedule because it just seemed, well, too hard. Things are changing.

They always do.

You know how RSS allows you to get feeds from your favorite blogs and other newsy Websites? That functionality is one example of how we are able today to break the offerings on a Webpage up into small parts and send them zipping around the Web. The text is separated from the formatting on our page, the way the text is displayed isn't carried around with it. That enables a snippet of our text, maybe the first paragraph for example, to be displayed by someone, anyone who subscribes to our feed.

Semantic Web potentially micro-bites the content even further -- into little bits that are identified as to precise type: this part is a last name; this part is a first name; this part is a phone number; this part is a set of key words; this part is an abstract, etc. People might tag text down to this level to enable its extraction and manipulation, its readability by computers (see Michael Jensen's article, The New Metrics of Scholarly Authority, about the importance to Authority 3.0 of being computable); its reorganization for other purposes. It gets treated like data rather than information or knowledge (don't let's debate what those things are just now).

What might this mean for copyright policy and practice? Wow, it just sends the mind reeling. I can't begin to imagine the implications, but one thing seems clear: a Semantic Web has the potential to further dramatically reconfigure the relationship between copyright owners and those who wish to access and use their copyrighted works. Implicit in the markup for computer recognition, extraction and manipulation is a license to actually do those things. Atomized text and images, sounds, audio-visuals. Wow. Might a whole new round of fear and loathing be right around the corner? Or will this just add to the steady pressure on copyright owners to open up their works to use and reuse -- if they want attention at all?

About Search Technologies

This page contains an archive of all entries posted to ©ollectanea in the Search Technologies category. They are listed from oldest to newest.

Scholarly Publishing is the previous category.

TEACH Act is the next category.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 3.31