Darren Chuang: 以C#自動擷取網頁中的圖片及文章片段

Facebook及Google+等社群網站都有分享連結的功能，使用者貼上網址後，系統自動從這個連結讀取網頁內容，並擷取出圖片及文章片段。一般網頁中圖片何其多，文字也到處都是，為什麼系統會知道那些圖片及文字可以代表這篇文章？讓我們來看看這樣的功能是如何辦到的。

在開始製作這樣的程式之前，首先要決定用甚麼方法來Parse HTML內容，最精簡快速的方法當然是用Regular Expression，可以很快的篩選出要找的HTML TAG內容，但是Regular Expression的語法門檻較高，對於不常使用的人來說實在不容易上手。由於這裡不是要做商業應用，所以我選擇了一個Open Source的元件「HtmlAgilityPack」，它的好處是可以用類似XmlDocument的方式，以XPath存取HTML Object，方便好用，效能也不錯。

利用WebClient取得網頁內容，同時要考慮網頁Encoding的問題，現在的網頁大都使用UTF8，但還是有很多例外，所以需要根據Content-Type以及charset做判斷，以使用正確的Encoding來處理HTML內容(這邊偷懶使用WebClient且有可能會需要Download兩次，正規的做法還是應該用HttpWebRequest)，取得HTML之後就可交由HtmlAgilityPack來Parse

private static HtmlDocument GetHtmlDoc(Uri uri)
{
    WebClient client = new WebClient();
    client.Encoding = Encoding.UTF8;
    string html = client.DownloadString(uri);
    string contentType = client.ResponseHeaders.Get("Content-Type");
    Encoding e = AutoEncoding(html, contentType);
    if (client.Encoding != e)
    {
        client.Encoding = e;
        html = client.DownloadString(uri);
    }
    html = System.Text.RegularExpressions.Regex.Replace(html, "(\\n|\\r|\\t)", "");
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    return doc;
}

再來就是最主要的部份，其實這樣的的功能說穿了一點也不神奇，Facebook 在「Open Graph Protocol」中定義了很多meta tag，主要目的是讓任何的網頁能夠被Facebook辨認其內容及型態，當然其他的社群網站如Google+也都利用了這些資訊，以下是OGP官方的說明：

The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.

其中幾個常用的meta tag正可以讓我們取得想要的資訊：

og:title：網頁的標題
og:description：網頁內文摘要
og:image：圖片網址
og:site_name：網站名稱

   1: <head>

   2:     <title>網頁標題</title>

   3:          <meta property="og:title" content="網頁標題"/>

   4:          <meta property="og:description" content="內文摘要"/>

   5:          <meta property="og:image" content="圖片網址"/>

   6:          <meta property="og:site_name" content="網站名稱"/>

   7:          ...

如果網頁沒有使用Open Graph Protocol定義的meta tag，也有可能使用了以下的tag

<title>：網頁的標題
<meta name=”description” content=”網頁內文摘要" />
<meta name=”thumbnail” content=”圖片網址" />
<link ref=”image_src” href=”圖片網址" />

有了上述的這些資訊Tag，就能透過程式取得資訊，以下是這部分的程式內容：

   1: private static void ParseHead(HtmlDocument doc, ref PageInfo info)

   2: {

   3:     string value = "";

   4:     string image = "";

   5:     HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//head/meta");

   6:     foreach (HtmlNode node in nodes)

   7:     {

   8:         switch (node.GetAttributeValue("name", ""))

   9:         {

  10:             case "thumbnail":

  11:                 value = node.GetAttributeValue("content", "");

  12:                 if (!string.IsNullOrEmpty(value)) image = value;

  13:                 break;

  14:             case "title":

  15:                 value = node.GetAttributeValue("content", "");

  16:                 if (!string.IsNullOrEmpty(value)) info.Title = value;

  17:                 break;

  18:             case "description":

  19:                 value = node.GetAttributeValue("content", info.Content);

  20:                 if (!string.IsNullOrEmpty(value)) info.Content = value;

  21:                 break;

  22:         }

  23:         switch (node.GetAttributeValue("property", ""))

  24:         {

  25:             case "og:image":

  26:                 image = node.GetAttributeValue("content", "");

  27:                 break;

  28:             case "og:title":

  29:                 info.Title = node.GetAttributeValue("content", "");

  30:                 break;

  31:             case "og:description":

  32:                 info.Content = node.GetAttributeValue("content", "");

  33:                 break;

  34:             case "og:site_name":

  35:                 info.Site = node.GetAttributeValue("content", "");

  36:                 break;

  37:         }

  38:     }

  39:     nodes = doc.DocumentNode.SelectNodes("//link");

  40:     foreach (HtmlNode node in nodes)

  41:     {

  42:         switch (node.GetAttributeValue("rel", ""))

  43:         {

  44:             case "image_src":

  45:                 image = node.GetAttributeValue("href", "");

  46:                 break;

  47:         }

  48:     }

  49:     if(!string.IsNullOrEmpty(image)) info.Images.Add(image);

  50: }

如果網頁都沒有提供這些meta tag怎麼辦，這個時候還可以再掙扎一下，文字的部分可以搜尋<p> tag，如果內含文字的長度超過65(65這個數值是我實驗後覺得較合適的長度)，就取用，否則再往下找。圖片的部分可以搜尋<img> tag，找出所有.png、.jpg或.gif的圖檔，到時在畫面上再讓使用者自行挑選合適的圖，這裡其實還可以再改良，先檢查圖片的寬度及高度，剔除太大或太小的圖片。

private static PageInfo GetPageInfo(PageInfo info, Uri uri, HtmlDocument doc)
{
    ParseHead(doc, ref info);
  
    if (string.IsNullOrEmpty(info.Site))
    {
        info.Site = uri.Host;
    }
    if(string.IsNullOrEmpty(info.Title))
    {
        HtmlNode nodeTitle = doc.DocumentNode.SelectSingleNode("//head/title");
        info.Title = nodeTitle.InnerText;
    }
    if (string.IsNullOrEmpty(info.Content))
    {
        HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//p");
        foreach (HtmlNode node in nodes)
        {
            if (!string.IsNullOrWhiteSpace(node.InnerText) && node.InnerText.Length>65)
            {
                if (node.InnerText.Length > 200) info.Content = node.InnerText.Substring(0, 200) + "...";
                else info.Content = node.InnerText;
                break;
            }
        }
    }
    //if (info.Images.Count == 0)
    {
        HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//img");
        foreach (HtmlNode node in nodes)
        {
            string src = node.GetAttributeValue("src", "").ToLower();
            if(src.EndsWith("png") || src.EndsWith("jpg") || src.EndsWith("gif"))
                info.Images.Add(GetAbsoluteUrl(uri, node.GetAttributeValue("src", "")));
        }
    }
    return info;
}

如果以上的方法還是沒辦法取得正確的內容，一般來說，社群網站就會放棄了，如果堅持一定要取得正確的內容，接下來只有個案處理一途，針對這類的網站特別針對其HTML結構來處理，以下就是真對Engadget所做的處理：

private static PageInfo GetPageInfo_ChineseEngadget(PageInfo info, Uri uri, HtmlDocument doc)
{
        info.Site = "Engadget中文版";
  
        HtmlNode nodeTitle = doc.DocumentNode.SelectSingleNode("//head/title");
        info.Title = nodeTitle.InnerText;
        
        HtmlNode nodeBody = doc.DocumentNode.SelectSingleNode("//div[@class='postbody']");
        HtmlNode nodediv = nodeBody.SelectSingleNode(".//div[2]");
        info.Content = nodediv.InnerText;
        
        HtmlNode nodeImage = nodeBody.SelectSingleNode(".//img");
        info.Images.Add(GetAbsoluteUrl(uri, nodeImage.GetAttributeValue("src", "")));
        
        return info;
    }

到此差不多就大功告成了，剩下的只是前端UI的呈現，如果還喜歡這篇文章的分享，請不吝+1或按讚鼓勵，謝謝。

Darren Chuang

2011年11月8日星期二

以C#自動擷取網頁中的圖片及文章片段

沒有留言:

張貼留言

2011年11月8日 星期二

以C#自動擷取網頁中的圖片及文章片段

沒有留言:

張貼留言

2011年11月8日星期二