December 3, 2009

Parsing (X)HTML in C - A libxml2 tutorial


Parsing (X)HTML in C is often seen as a difficult task.  It's true that C isn't the easiest language to use to develop a parser.  Fortunately, libxml2's HTMLParser module come to the rescue.  So, as promised, here's a small tutorial explaining how to use libxml2's HTMLParser to parse (X)HTML.

First, you need to create a parser context.  You have many functions for doing that, depending on how you want to feed data to the parser.  I'll use htmlCreatePushParserCtxt(), since it work with memory buffers.

htmlParserCtxtPtr parser = htmlCreatePushParserCtxt(NULL, NULL, NULL, 0, NULL, 0);

Then, you can set many options on that parser context.

htmlCtxtUseOptions(parser, HTML_PARSE_NOBLANKS | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING | HTML_PARSE_NONET);

We are now ready to parse an (X)HTML document.

// char * data : buffer containing part of the web page
// int len : number of bytes in data
// Last argument is 0 if the web page isn't complete, and 1 for the final call.
htmlParseChunk(parser, data, len, 0);

Once you've pushed it all your data, you can call that function again with a NULL buffer and '1' as the last argument.  This will ensure that the parser have processed everything.

Finally, how to get the data you parsed?  That's easier than it seems.  You simply have to walk the XML tree created. 

void walkTree(xmlNode * a_node)
{
  xmlNode *cur_node = NULL;
  xmlAttr *cur_attr = NULL;

  for (cur_node = a_node; cur_node; cur_node = cur_node->next) {
     // do something with that node information, like... printing the tag's name and attributes
    printf("Got tag : %s\n", cur_node->name);
    for (cur_attr = cur_node->properties; cur_attr; cur_attr = cur_attr->next) {
      printf("  -> with attribute : %s\n", cur_attre->name);
    }

    walkTree(cur_node->children);
  }
}

walkTree(xmlDocGetRootElement(parser->myDoc));

And that's it!  Isn't that simple enough?  From there, you can do any kind of stuff, like finding all referenced images (by looking at "img" tag) and fetching them, or anything you can think of doing.

Also, you should know that you can walk the XML tree anytime, even if you haven't parsed the whole (X)HTML document yet.

If you have to parse (X)HTML in C, you should use libxml2's HTMLParser.  It will save you a lot of time.

November 27, 2009

Virtualization, cloud computing, unified computing, etc.

I don't know that much about virtualization and cloud computing.  I mean, I know what they are, but I don't have much knowledge about all the possible solution it exist for Linux for that.  My experience is pretty much limited to using VMWare (a couple of years ago) and now VirtualBox.  As for cloud computing, I've never touched it.

But since I like to learn new stuff, and that they are both hot topics right now (Cloud computing is the most dynamic segment of the IT industry, Why the cloud needs virtualization), I decided to look more into that.

I suppose that I'm not alone in that situation.  So, if you're looking for great information regarding this, I suggest you to take a look at the opensourc3.org magazine.  It's a free monthly publication, with five issues already published.  Every issue goes into deep details of some particular points of cloud computing and virtualization.

Have you any other source of information that you'd like to share about virtualization and cloud computing?

November 23, 2009

Book Review: The Web Startup Success Guide (and links list!)

The Web Startup Success Guide by Bob Walsh (of 47 hats) is really an interesting book.  And I want to start by saying that even if you don't want to start a business, go read that book.  There's a lot of information in there for every software developer.  Topics like customer support, social media, personal performance improvment tools, Getting Things Done, and many more are invaluable.

For those really looking at starting a business, the chapters about getting funded, marketing your product or service, when and who to hire, should help you in your quest to entrepreneurship.

In his book, Bob offer a lot of information himself.  But the strength of the book came from all the interviews.  For any topic discussed, there are a bunch of interviews with people who brings more to the table.  This really help getting a broader view of the subject.

Also, as those who have already read it have noticed, Bob provide a lot of links to web pages in his book.  I started to take note of them, but I quickly stopped and instead decided to rescan the book after.  Since I'm a generous person, I've included the whole list at the end of this post.

But, really go read that book, it offer a lot more than just links!



"The Web Startup Success Guide" Links List

Blogs, podcast, forums, newsletter, etc.
Reports, articles, etc.
Communities (online and offlines)
Events
Startup helps, incubators, VCs, angels, etc.
Marketing
Miscellaneous Tools
Search Engine Optimization, web analytics, etc.
Glossary, resources, etc.
Payment processor
Testing
Version control
Design
Dashboard, personal home page, etc.
Search Engines
Task management, project management, etc.
Notebooks

November 18, 2009

To have success, know why it was a success

In the history of the humankind, the thought of one solution that would fit for all, that would be perfect for everybody, have often surfaced.  The many empires along the time, the religions, all emcompassed that idea that what worked for them was the best solution and should be used by everyone.

The same thing have happenned and is still happening in the computer science field.  Everytime a new software development methodology, coding/testing/debugging practice, project managment techniques, programming language, software/hardware architecture, etc., make the news, you get people swearing that this novelty is the way to go.  If you're not doing it exactly as evangelised, you're doing it wrong.

Well, at last, some people don't fall for that.  Like Scott Ambler, who says that every IT project is different in some way, as snowflakes, so any agile methodology used should consider the particularities of the project.  Like Johanna Rothman, who ask if three backlogs could be better than one.

To have that kind of reflexion, you need to understand why a particular procedure, solution, or method was invented, and why it worked for someone.  If you don't know why it worked somewhere, how do you know that it will work for you?
"Don't follow any advice, no matter how good, until you feel as deeply in your spirit as you think in your mind that the counsel is wise" -- Joan Rivers, American comedian
If you cannot explain why your methodology work for you, and in which case it would not work, you're doing it wrong.