Open Refine

範例
Add column by fetcheing URLs`
value=url
`"http://query.yahooapis.com/v1/public/yql?q=SELECT%20*%20FROM%20html%20WHERE%20url%3D'"+escape(value, "url")+"'&format=json&diagnostics=true"
範例
forRange(0,20,1,k,forRange(0, 40, 2, v, parseJson(value).query.results.td[v].span.content)[k]+":::"+forRange(1, 40, 2, v, parseJson(value).query.results.td[v].p)[k])

超麻煩的合併
意思大概是把陣列[Date1, Pageview1, Date2, Pageview2,....]
變成陣列["Date1:::Pageview1","Date2:::pageview2",....]以利之後切割record

YQL

範例
SELECT * FROM html WHERE url='http://www.ly.gov.tw/05_orglaw/search/lawView.action?no=18397' and xpath='//div[contains(@class,"page_content")]' AND xpath='//div[contains(@class,"page_content_date")]'

翻譯:抓取 http://www.ly.gov.tw/05_orglaw/search/lawView.action?no=18397 的所有html資料,篩選出其中div標籤符合「class屬性值含page_content_date」者。

xpath

範例
SELECT * FROM html WHERE url="http://xxxxx" AND xpath="//td[@class='xxx'] | //td[@class='ooo']"

翻譯:抓取html中class值為'xxx'或'ooo'的<td>...</td>內容
http://developer.yahoo.com/yql/guide/yql-select-xpath.html

Charset

範例
select * from html where url="http://gcis.nat.gov.tw/Fidbweb/factInfoAction.do?method=detail&estbid=09210007600289&agencyCode=376470000A" and charset='Big5'

網頁編碼有問題時,手動設置編碼
http://developer.yahoo.com/yql/guide/yql-l18n.html
http://www.dreamdu.com/xhtml/charset/

Comments

comments powered by Disqus