keiskS @technote

Unix commands for NLP

ざっくばらんに集めてみます。

入門UNIXシェルプログラミング―シェルの基礎から学ぶUNIXの世界

指定した範囲の行を取得

http://p.tl/fwu9

Unix-for-Poets

http://www.stanford.edu/class/cs124/kwc-unix-for-poets.pdf

Natural Language Processing Using Linux

http://www.generation5.org/content/2004/nlpUnix.asp

Command line tools for NLP and Machine Learning

http://www.drmaciver.com/2009/04/command-line-tools-for-nlp-and-machine-learning/

Advanced Bash-Scripting Guide:16.4. Text Processing Commands

http://tldp.org/LDP/abs/html/textproc.html

Linux コマンドテキスト処理コマンド一覧

http://www.webhtm.net/unix/cmd/cmd_list_text.htm

cut, sort, uniq で生産性を5%向上させる

http://blog.bonar.jp/entry/20070618/1182183655

Linuxコマンドでテキストデータを自在に操る

http://d.hatena.ne.jp/mi_kattun/20100916/1284631280

特定の行を見たい。

sed -n #p FILENAME   #=行番号(1からカウント)

pythonでは昔書いたこちらが使えます。

指定した範囲の行

sed -n '開始行,終了行p' ファイル名

ファイルの分割

http://www.k-tanaka.net/unix/split.html

split -b 100k file1    #file1を100KBごとに分割
split -l 1000 file2    #file2を1000行ずつに分割
split -b 1m file* test #名前がfileで始まるファイルを1MBごとに分割し、ファイル名の頭にtestを付ける

xargs

http://www.nxmnpg.com/ja/1/xargs
明示的に並列化する場合は、-P オプションをつけます。（e.g. -P16)

e.g. find . -name "*.xml" | xargs -i% -P16 cp % %.txt
上の例は簡単な処理ですが、splitで分割したファイルを並列でparseする場合などは便利です。

comm

http://www.k-tanaka.net/unix/comm.html
評価の時に使えそうですね。

comm (option) [file name] (file name)

オプション 	機能
-1 	左列を出力しない
-2 	中央列を出力しない
-3 	右列を出力しない
-12 	右列のみ出力する
-13 	中央列のみ出力する
-23 	左列のみ出力する

コマンド例
comm file1 file2 	file1とfile2を比較して差異を表示
comm -12 file1 file2 	両方のファイルに共通して存在する行を表示

エラー分析にも使えそう。
omm A B -23 |sort | uniq -c | sort -nr
A-BでAにしかないエラーでランキング。

service 周り

(sudo) service httpd start   // web server をオンにする
(sudo) service iptables stop // firewall をオフにする
(sudo) chkconfig httpd on // httpdをrestartしたあともオンにする

find コマンドをフル活用

特に\( \), -not, -o -wholename オプションは使える。
http://qiita.com/catfist/items/5504511fa5a028fc7c41
http://d.hatena.ne.jp/mrgoofy33/20100823/1282576209
ワンライナーで。wsjの02-21ディレクトリにあるのmrgファイルをすべてfindするとか?

タグ：

「Unix commands for NLP」をウィキ内検索

最終更新：2016年08月17日 20:16

keiskS @technote

Unix commands for NLP

指定した範囲の行を取得

Unix-for-Poets

Natural Language Processing Using Linux

Command line tools for NLP and Machine Learning

Advanced Bash-Scripting Guide:16.4. Text Processing Commands

Linux コマンド テキスト処理 コマンド 一覧

cut, sort, uniq で生産性を5%向上させる

Linuxコマンドでテキストデータを自在に操る

特定の行を見たい。

指定した範囲の行

ファイルの分割

xargs

comm

service 周り

find コマンドをフル活用

Linux コマンドテキスト処理コマンド一覧