PHP and the parsing of HTML for Microformats
While Ruby on Rails and Python is way ahead in the parsing of HTML for microformats, PHP is trailing behind it seems. We have found a couple of classes that can do it, but none without some quirks.
The one option was hKit which is a PHP 5.x parser. The problem with this parser is that it only works on PHP 5.x since it requires the SimpleXML PHP extension. Our problem is that not all PHP developed sites are running on PHP 5.x. That is not an ideal solution then.
The other option we found was the Microformats Parser over at PHPClasses developed by Ve Bailovity. This is an award winning class and is dependant of the xArray class also found there and also developed by Ve. This opion is much better since it works on PHP 4.3 or better and requires the DOM XML extension. The issue is that it does not work “out-of-the-box”. It worked beautifully when tested with the HTML file that accompanies the classes, but as soon as I tried it with some “real-world” microformat implementations (tried on http://corkd.com and http://www.thinkvitamin.com/), it failed. Sad, since it is a beautiful script!
So what do we need? We need a class that works on ALL versions of PHP and that can process ANY form of HTML or XHTML. We all know not everyone cares much for their mark-up and therefor you will find all kinds of funny and weird issues, so it must be extremely forgiving when parsing the HTML document. It must also have some proper security and anti-spam checks built in, since this could end up being a SPAMMERS dream if not properly protected! We’re on it.
We have started to develop our own parser class for kupa, aptly named kupaParser and will release it under the GPL for anyone who would like to take advantage of microformats on PHP. Will keep you updated on the progress as we go along.
technorati tags:kupa, microformats, semanticweb, php
Blogged with Flock
13 Comments to "PHP and the parsing of HTML for Microformats"
Spit it out!
Kick-ass Tools, Open Source, PHP, Semantic Web, Software Development Stii
Recent Posts
- Astalavista Wordpress!
- Lifestreaming and Twitter is making us lazy
- Days with my father
- Friday morning fail by a stripper
- Got Springleap!
- Afrigator vs Regator
- Don’t pirate music/movies! You might be forced to use Windows if you do…
- Pike > Python?
- Using Twhirl for FriendFeed
- Being anti-social SUCKS!
My Posse
- Jayx’s bloggy
- Gogo’s blog
- Go2 South Africa
- Stumble Upon
- Dave Duarte
- Wikipedia
- zlythern
- Max Kaizen
- Tresblue
- Mike Stopforth
- RafiQ
- Muti.co.za
- Employmint
- Danette’s Bloggy!
- Thinking Machine
- White African
- kiefpiet.co.za
- Skuff’s World
- Goozeberry
- Crossloop blog
- Crossloop
- Aquila Online
- Charl van Niekerk
- Derek Allard
- Code Igniter
- Carls
- Justin Hartman
- blik.co.za
- Stefano Sessa
- Uno de Waal
- Amplitude!
- bLaugh
- Tyler Reed
- Chris Rawlinson
- Stormhoek!
- 3am
- Mike Solomon
- Mobile Q and A
- Eric Edelstein
- Marc Forrest
- Imel Rautenbach
- Absolutewillie
- Vincent Maher
- Colin Daniels
- Groogle!
- Chilibean
- Paul Jacobson
- Ayelet
- Python Guru Neil
- Rails Guru Nic
- Beverley Merriman
- Miguel
- Nic Harrywhatshisname
- Chris iMod
- Geekrebel!
- Steven McD
- Belinda sweetheart!
- Henre Rossouw
- JPGeek
- Foxinni
- Adii
- Charl Norman
- Bandwidthblog
- Jason Bagley
- Simon Botes
- Auric Silverwing
- Mark Forrester
- Saul Kropman
- Fred Roed
- Sass Schultz
- Gregor Rohrig
- Catherine Lückhoff
- Toastmasters
- SAA
Filed in
- Afrigator (26)
- ajax (9)
- API (2)
- Apple stuff (10)
- Blogging (25)
- browsers (5)
- Business (28)
- Code Igniter (8)
- firefox (8)
- flock (14)
- Funnies (73)
- GeekDinner! (18)
- General and sometimes Rants (49)
- Go2SA (2)
- ideas 2.0 (14)
- javascript (12)
- Kick-ass Tools (30)
- Linux (5)
- Marketing (25)
- moo.ajax (4)
- mootools (6)
- Open Source (10)
- Programming (33)
- C# (1)
- PHP (13)
- Python (9)
- Ruby (on Rails) (9)
- RSS (5)
- Semantic Web (32)
- Social Web (57)
- Software Development (15)
- South Africa (33)
- Tagging (6)
- Techie stuff (22)
- Tshirts (3)
- Tutorials (42)
- Blogging (17)
- Flocking (6)
- muti.co.za (13)
- Web 2.0 (73)
- web development (20)
Past Stuff
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- May 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006

















Maybe it’s time to use something other than PHP? Python has some great “weird-ass HTML to usable XML” converters, and its XML manipulation libraries are great too.
I battled for hours to find trivial Amazon S3 (with full error reporting and so forth) implementations that work in both PHP4 and PHP5. I didn’t find one. Finding boto for Python took about 5 seconds. I ended up writing my own, bad, Amazon S3 code, because the project absolutely required PHP (it was for an existing PHP project).
Also, most libraries for PHP are under the PHP licence - why don’t you use that?
Neil
Neil, good points. I’d love to use Python, but as I’ve stated most sites are still PHP driven. The idea is that others like Cherrypicka (which is PHP driven) can use this parser and therefor it would be a lot simpler if it was a simple PHP class they can use in their existing code.
As far as licensing goes, I’m not all that picky.
Also, generally your spidering and microformat parsing is done in an external, often long-lived, process. PHP isn’t as suited for this environment (it can have fatal, uncatchable, errors) as more mature (in this regard) languages like Python (and Java, and Ruby, and …).
One method I’ve seen that’s worked well is to decouple these two, written in separate languages, and use the language that’s best suited for the task, taking the environment and available resources into account. This often is the case when using something like Lucene - you can use pyLucene these days, but I’ve often seen Python and PHP cores using Java and Lucene as one component via a simple HTTP protocol.
Neil
If you really need to have a nice PHP method the
htmlSQL (by Jonas John - http://www.jonasjohn.de/lab/htmlsql.htm) , a php class to query html just like you would a database table, can be used to write a kick-ass parser… Just a thought
Although you are right, I think that the implementation would be tricky for guys hosting PHP apps on a shared host. The idea for the parser is for okes who want their visitors to submit for example Reviews to their sites, to do it alla pingerati style. Simple for now. Eventually we’d have a spider on kupa and will have to run the parsing process in its own thread. It would however not be really necessary in the Cherrypicka example where one or two reviews are on blogs and would be submitted through a parser to get it onto the Cherrypicka site. (Only an example)
I must say, you’ve just given me a brilliant idea! What could work even better is having a parser as a web service. Then you can use whatever language(s) you want and anyone can use the API to extract MFs of their choice from any URI. Could work better! Hmmm… lemme ponder on this one.
Cool, I’m just a little confused by your statement “It must also have some proper security and anti-spam checks built in, since this could end up being a SPAMMERS dream if not properly protected!”. What does a markup parser and spam have to do with each other?
Also, is this parser going to be a full quirky HTML parser or only going to search for embedded microformats? If the former, will it support the DOM?
Well, what would prevent those viagra SPAMMERS to make a pages with 10000000000 viagra hReviews on them?! The parser must handle checks like that, not the implementer.
Well, for now only going to search for embedded microformats.
I’d really like to encourage you to help improve an existing solution to a useable point rather than start over from scratch. There’s already a few different solutions for parsing microformats in PHP and, as far as I know, most are open source.
Writing a parser is a lot of work. I didn’t realise how much work when I started hKit, otherwise I would never have embarked on such a project. (As you can see from the hKit source code, the complexity very quickly overtook the simple design, to a point where it could use some refactoring. But it’s solid code.) Writing a brand new parser when there’s already a number of good solutions available seems pretty rash, frankly. If you’re interested in supporting the community, do so by getting behind solutions already in progress. Creating yet another solution to the same problem just introduces more confusion, another set of quirky behaviours and possibly more bugs into the parsing space.
I also think it’s pretty dumb to discount hKit for requiring a version of PHP that’s already nearly 3 years old, but YM will clearly V. For what it’s worth, I wouldn’t base a new solution on SimpleXML again as it’s pretty inelegant, but for PHP I’d use PHP5 every time.
> Well, what would prevent those viagra SPAMMERS to make a pages with 10000000000 viagra hReviews on them?! The parser must handle checks like that, not the implementer.
That’s not the job of a parser. A parser uses logic (and knowledge of a grammar) to extract information and turn it into a data structure. It’s not for a parser to determine what content is valuable.
I’d imagine from the example given that a medical research student studying erectile disfunction would find your parser next to useless due to a preprogrammed assumption based on a very limited domain.
Drew, although you are right, it is a bit of a pain to simply move a site to a host that offers proper PHP 5 hosting. Although it seems dumb, it is not all that simple to persuade established sites to move to a PHP5 supported host just so they can implement hKit. hKit is great! Nothing against that. PHP 5 is a lot greater than PHP 4. The issue is not the language. The issue is the clients. The guys we’d love to adopt hKit. Sure, maybe soon their hosting company will upgrade PHP, but until that happens, they are cut off from using great new technologies.
I guess that opens the age-old debate of do you FORCE clients to upgrade or do you try to adapt to their technologies?
I’d be interested in the hearing the use case you have in mind. What sort of scenario are you considering where a developer wants to parse microformats but has no control over their server and software versions?
I have a client that has a recruitment agency. They host their web site on a shared host with PHP 4.3. They would love for their applicants to be able to submit their Resumes in the hResume format without having to redo their resumes. From experience I know that the particular client would HATE to move their site simply to support PHP 5. Yet, they would love to take advantage of the hResume microformats. They could not care less for hosting their own software on their own servers, since technology is not their core business, although it forms a big part of their core. Yes, it would be silly NOT to run on the latest PHP 5 version. Yes, they’d be able to do sooooo much more. It is unfortunately NOT in our control. We have to make do with what is available!
Does this make any sense? In other words, for businesses that host their sites on shared hosts that does not support PHP 5, they would be much happier if they could take advantage of technologies like microformats than moving their domain to servers that supports PHP5. It may sound absurd, but believe me, you’ll find it a lot.
In any event, I think one thing you would agree to is that MFs is extremely useful little things. It would definitely be benificial beyond the scope of dedicated MF services!
Oh, and as far as the parser checking for SPAM goes, its got nothing to do with the parser itself. It is part of the “package” that other developers might want to implement on their own/client sites.