プレーンテキスト備忘録

WEBからデータ抽出

最終更新：2011年08月03日 13:35

Bot(ページ名リンク)

- view

管理者のみ編集可

WEBからデータ抽出

目的

指定したURL(価格.com)のサイトからHTMLデータを取得し、
商品名、最安価格、画像URLを正規表現で取得

GUI

プロジェクトダウンロード

WebGet.zip

Tips

C#の正規表現の中で「"(ダブルクォーテーション)」をマッチングさせたい場合は

\でエスケープではなく「""」で「"」とマッチングさせることができる。

コード　※VCが自動で作ったコードを含まないのでコピペじゃ動かないと思うので参考程度に

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.IO;
using System.Text;
using System.Windows.Forms;
using System.Net;

private void button1_Click(object sender, EventArgs e)
{
    /* Webページ取得 */
    WebClient wc = new WebClient();
    Stream st = wc.OpenRead(UrlTextBox.Text);
    Encoding enc = Encoding.GetEncoding("Shift_JIS");
    StreamReader sr = new StreamReader(st, enc);
    string html = sr.ReadToEnd();
    sr.Close();
    st.Close();

    /* 取得したHTMLデータから商品名を取得 */
    System.Text.RegularExpressions.MatchCollection title =
        System.Text.RegularExpressions.Regex.Matches(
        html, @"(?<=<h1><a href=.*><span.*>).*(?=</span> 価格比較</a>)");
    foreach (System.Text.RegularExpressions.Match m in title)
    {
        label2.Text = m.Value;
    }

    /* 最安価格を取得 */
    System.Text.RegularExpressions.MatchCollection kakaku =
        System.Text.RegularExpressions.Regex.Matches(
        html, @"(?<=<p class=""fontPrice wordwrapPrice"">¥).*(?=</p>)");
    foreach (System.Text.RegularExpressions.Match m in kakaku)
    {
        label4.Text = "\\" + m.Value;
        break; // 1位のみ抽出したいので1回でbreak
    }

    /* 商品画像を取得 */
    System.Text.RegularExpressions.MatchCollection imgurl =
        System.Text.RegularExpressions.Regex.Matches(
        html, @"(?<=<a href=""http://kakaku.com/item/.*images/"" target=.*img src="").*?(?="".*class=""photo"".*)");
    foreach (System.Text.RegularExpressions.Match m in imgurl)
    {
        pictureBox1.ImageLocation = m.Value;
    }

}

「WEBからデータ抽出」をウィキ内検索