テキストマイニングの編集履歴ソース - Intro to Python

テキストマイニング - (2025/02/10 (月) 20:39:35) のソース

** 目次

#contents

** NLTKのパッケージをダウンロードする

nltkモジュールのdownloadメソッドを使う。

#highlight(){{
>>> import nltk
>>> nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
}}

次のウィンドウが開く。「Collections」タブの「Identifier」の「all」をクリックして選択し、左下の「Download」ボタンをクリックすると、ダウンロードが始まる。数分待つとダウンロードが完了して次のようになるはず。

ウィンドウを終了すると、コンソール画面は以下のようになるはず。

#highlight(){{
True
}}

試しに使ってみる。

#highlight(){{
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
}}

text1やtext2がサンプルデータ。

#highlight(){{
>>> len(text1)
260819
>>> text1[0]
'['
>>> text1[1]
'Moby'
>>> text1[0:7]
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851']
}}

** アメリカ合衆国大統領の大統領就任演説のテキストを得る

nltkモジュールを使う。初代のワシントンから、2021年就任のバイデン大統領までが含まれている。

#highlight(){{
>>> import nltk
>>> from nltk.corpus import inaugural
>>> ss = nltk.corpus.inaugural.fileids()
>>> len(ss)
59
>>> ss[0:3]
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt']
>>> ss[-1:-5:-1]
['2021-Biden.txt', '2017-Trump.txt', '2013-Obama.txt', '2009-Obama.txt']
}}

試しに、2009年に就任したオバマ大統領の一部を表示してみる。

#highlight(){{
>>> ss[-4]
'2009-Obama.txt'
>>> s = inaugural.raw(ss[-4])
>>> s[0:70]
'My fellow citizens:\n\nI stand here today humbled by the task before us,'
}}

新聞社のウェブサイトの記事でも公開されており、一致することが確認できる。

http://www.asahi.com/special/081113/TKY200901200391.html

人気記事ランキング

最近更新されたページ

新規Wikiランキング

最近作成されたWikiのアクセスランキングです。見るだけでなく加筆してみよう！

人気Wikiランキング

atwikiでよく見られているWikiのランキングです。新しい情報を発見してみよう！

全体ページランキング

最近アクセスの多かったページランキングです。話題のページを見に行こう！

テキストマイニング - (2025/02/10 (月) 20:39:35) のソース

メニュー

更新履歴