[対訳] トレーニングテスト4.00

original (2019/05/14 付)	Google 翻訳 (2019/05/28 付)
# How to use the tools provided to train Tesseract 4.00	#Tesseract 4.00をトレーニングするためのツールの使い方
Have questions about the training process? If you had some problems during	トレーニングプロセスについて質問がありますか？途中で問題が発生した場合
the training process and you need help, use	トレーニングプロセスとあなたは助けが必要です
tesseract-ocr	tesseract-ocr
mailing-list to ask your question(s). PLEASE DO NOT report your problems and	あなたの質問をするメーリングリスト。問題を報告しないでください。
ask questions about training as	トレーニングについて質問する
issues!	問題！
* Introduction	* はじめに
* Before You Start	* 始める前に
* Additional Libraries Required	* 追加の図書館が必要
* Building the Training Tools	* トレーニングツールの構築
* Hardware-Software Requirements	* [ハードウェア - ソフトウェア要件](TrainingTesseract-4.00.md#ハードウェア - ソフトウェア - 要件)
* Training Text Requirements	* トレーニングテキストの要件
* Overview of Training Process	* トレーニングプロセスの概要
* Understanding the Various Files Used During Training	* トレーニング中に使用される各種ファイルの理解
* LSTMTraining Command Line	* LSTMTrainingコマンドライン
* Unicharset Compression-recoding	* Unicharset圧縮 - 再コーディング
* Randomized Training Data and sequential_training	* 無作為化訓練データと順次訓練
* Model output	* モデル出力
* Net Mode and Optimization	* ネットモードと最適化
* Perfect Sample Delay	* 完全サンプル遅延
* Debug Interval and Visual Debugging	* デバッグ間隔とビジュアルデバッグ
* TessTutorial	* TessTutorial
* One-time Setup for TessTutorial	* TessTutorialのワンタイムセットアップ
* Creating Training Data	* トレーニングデータの作成
* Making Box Files	* ボックスファイルの作成
* Using tesstrain.sh	* using tesstrain.sh
* Tutorial guide to lstmtraining	* lstmtrainingのチュートリアルガイド
* Creating Starter Traineddata	* スタータートレーニングデータの作成
* Training From Scratch	* 最初からのトレーニング
* Fine Tuning for Impact	* インパクトの微調整
* Fine Tuning for ┬▒ a few characters	* 数文字の微調整
* Training Just a Few Layers	* 数層だけのトレーニング
* Error Messages From Training	* トレーニングからのエラーメッセージ
* Combining the Output Files	* 出力ファイルの結合
* The Hallucination Effect	* 幻覚効果
# Introduction	# 前書き
Tesseract 4.00 includes a new neural network-based recognition engine that	Tesseract 4.00には、ニューラルネットワークベースの新しい認識エンジンが含まれています。
delivers significantly higher accuracy (on document images) than the previous	以前よりもはるかに高い精度(ドキュメント画像)を実現
versions, in return for a significant increase in required compute power. On	必要な計算能力の大幅な増加と引き換えにバージョン。に
complex languages however, it may actually be faster than base Tesseract.	複雑な言語ですが、実際には基本的なTesseractよりも速い可能性があります。
Neural networks require significantly more training data and train a lot	ニューラルネットワークはかなり多くのトレーニングデータとトレーニングを必要とします
slower than base Tesseract. For Latin-based languages, the existing model data	ベースよりも遅いTesseract。ラテン系言語の場合、既存のモデルデータ
provided has been trained on about [400000 textlines spanning about 4500	提供された約4500に及ぶ約[400000のテキスト行で訓練されています
fonts](https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951).	フォント](https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951)。
For other scripts, not so many fonts are available, but they have still been	他のスクリプトでは、それほど多くのフォントが利用可能ではありませんが、それらはまだ使用されています。
trained on a similar number of textlines. Instead of taking a few minutes to a	同数のテキスト行について訓練を受けた。に数分かかる代わりに
couple of hours to train, Tesseract 4.00 takes a few days to a couple of	Tesseract 4.00のトレーニングに数時間かかるには、数日から数日かかります。
weeks. Even with all this new training data, you might find it inadequate for	*この新しいトレーニングデータをすべて使用しても、次の学習には不十分な場合があります。
your particular problem, and therefore you are here wanting to retrain it.	あなたの特定の問題、それゆえあなたはここでそれを再訓練したいのです。
There are multiple options for training:	トレーニングには複数の選択肢があります。
* Fine tune. Starting with an existing trained language, train on your	* 微調整。既存の訓練された言語から始めて、あなたの
specific additional data. This may work for problems that are close to the	特定の追加データこれはに近い問題のために働くかもしれません
existing training data, but different in some subtle way, like a	既存のトレーニングデータですが、微妙に異なる点があります。
particularly unusual font. May work with even a small amount of training	特に変わったフォントです。少量のトレーニングでも動作可能
data.	データ。
* Cut off the top layer (or some arbitrary number of layers) from the network	*ネットワークから最上位層(または任意の数の層)を切り取る
and retrain a new top layer using the new data. If fine tuning doesn't work,	新しいデータを使用して新しい最上層を再トレーニングします。微調整がうまくいかない場合は、
this is most likely the next best option. Cutting off the top layer could	これはおそらく次善の策です。最上層を切断すると
still work for training a completely new language or script, if you start	あなたが始めるなら、まだ全く新しい言語やスクリプトを訓練するために働く
with the most similar looking script.	最もよく似たスクリプトを使って。
* Retrain from scratch. This is a daunting task, unless you have a very	*傷を付けないでください。あなたが非常に持っていない限り、これは大変な作業です。
representative and sufficiently large training set for your problem. If not,	あなたの問題のための代表的で十分に大きなトレーニングセット。そうでなければ、
you are likely to end up with an over-fitted network that does really well	あなたは本当にうまく機能しているオーバーフィットネットワークになってしまう可能性があります
on the training data, but not on the actual data.	実際のデータではなく、トレーニングデータに関するものです。
While the above options may sound different, the training steps are actually	上記のオプションは異なるように聞こえるかもしれませんが、トレーニングのステップは実際には
almost identical, apart from the command line, so it is relatively easy to try	コマンドラインを除けばほとんど同じですので、試すのは比較的簡単です。
it all ways, given the time or hardware to run them in parallel.	それらを並行して実行する時間またはハードウェアを考えれば、それはすべての方法です。
For 4.00 at least, the old recognition engine is still present, and can also be	少なくとも4.00の場合、古い認識エンジンはまだ存在しています。
trained, but is deprecated, and, unless good reasons materialize to keep it, may	訓練されていますが、推奨されていません。
be deleted in a future release.	将来のリリースで削除される予定です。
# Before You Start	#始める前に
You don't need any background in neural networks to train Tesseract 4.00, but it	Tesseract 4.00をトレーニングするためにニューラルネットワークの背景は必要ありませんが、
may help in understanding the difference between the training options. Please	トレーニングの選択肢の違いを理解するのに役立つかもしれません。お願いします
read the Implementation introduction before delving	掘り下げる前に実装の紹介を読んでください
too deeply into the training process, and the same note as for training	トレーニングプロセスの深さが深すぎ、トレーニングと同じ
Tesseract 3.04 applies:	Tesseract 3.04が適用されます。
Important note: Before you invest time and effort on training Tesseract, it	重要な注意事項:Tesseractのトレーニングに時間と労力を費やす前に、
is highly recommended to read the ImproveQuality page.	ImproveQualityページを読むことを強くお勧めします。
# Additional Libraries Required	#追加のライブラリが必要
Beginning with 3.03, additional libraries are required to build the training	3.03以降、トレーニングを構築するために追加のライブラリが必要になりました
tools.	ツール
```bash	bash
sudo apt-get install libicu-dev	sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev	sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev	sudo apt-get libcairo2-devをインストールする
` \| `
# Building the Training Tools	#トレーニングツールの構築
Beginning with 3.03, if you're compiling Tesseract from source you need to make	3.03から始めて、Tesseractをソースからコンパイルしているならば、あなたは作る必要があります
and install the training tools with separate make commands. Once the above	そして別々のmakeコマンドでトレーニングツールをインストールします。上記の
additional libraries have been installed, run the following from the Tesseract	追加のライブラリがインストールされているので、Tesseractから以下を実行してください。
source directory:	ソースディレクトリ:
```bash	bash
./configure	./configure
` \| `
By default Tesseract configuration will proceed if dependencies required only	デフォルトでは、依存関係が必要な場合にのみTesseract構成が続行されます
for training are missing, but for training, you will have to ensure all those	訓練のために不足しているが、訓練のために、あなたはそれらすべてを確実にしなければならないでしょう
optional dependencies are installed and that Tesseract's build environment	オプションの依存関係がインストールされ、そのTesseractのビルド環境
can locate them. Look for these lines in the output of `./configure`:	それらを見つけることができます。 `。/ configure`の出力でこれらの行を探してください:
` \| `
checking for pkg-config... [some valid path]	pkg-configを確認しています... [有効なパス]
checking for lept >= 1.74... yes	leptのチェック> = 1.74 ...はい
checking for libarchive... yes	libarchiveをチェックしています...はい
checking for icu-uc >= 52.1... yes	icu-uc> = 52.1をチェックしています...はい
checking for icu-i18n >= 52.1... yes	icu-i18n> = 52.1をチェックしています...はい
checking for pango >= 1.22.0... yes	Pangoのチェック> = 1.22.0 ...はい
checking for cairo... yes	カイロをチェックしています...はい
[...]	[...]
Training tools can be built and installed with:	トレーニングツールは、次のもので構築およびインストールできます。
` \| `
(The version numbers may change over time, of course. What we are looking for is	(もちろん、バージョン番号は時とともに変わるかもしれません。探しているのは、
"yes", all of the optional dependencies are available.)	"はい"、すべてのオプションの依存関係が利用可能です。)
If configure does not say the training tools can be built, you still need to add	もしconfigureがトレーニングツールが構築できると言っていない場合でも、追加する必要があります。
libraries or ensure that `pkg-config` can find them.	またはpkg-configがそれらを見つけることができることを確認してください。
After configuring, you can attempt to build the training tools:	設定した後、トレーニングツールを構築することを試みることができます。
```bash	bash
make	作る
make training	トレーニングをする
sudo make training-install	sudo作るトレーニングインストール
` \| `
It is also useful, but not required, to build ScrollView.jar:	ScrollView.jarをビルドすることも便利ですが必須ではありません。
```bash	bash
make ScrollView.jar	ScrollView.jarを作る
export SCROLLVIEW_PATH=$PWD/java	SCROLLVIEW_PATH = $ PWD / javaをエクスポートします。
` \| `
## On macOS Mojave with Homebrew	##自作のmacOSモハーベについて
Homebrew has an unusual way of setting up `pkgconfig` so you must opt-in to certain files.	自作は `pkgconfig`を設定する珍しい方法を持っているので、あなたは特定のファイルにオプトインしなければなりません。
In general run `brew info package` and ensure that you append the mentioned PKG_CONFIG_PATH	一般に `brew info package`を実行して、あなたが言及されたPKG_CONFIG_PATHを追加することを確実にしてください
to this environment variable.	この環境変数に。
```bash	bash
brew install cairo pango icu4c autoconf libffi libarchive	インストールcairo pango icu4c autoconf libffi libarchive
export PKG_CONFIG_PATH=\	PKG_CONFIG_PATH = \をエクスポート
$(brew --prefix)/lib/pkgconfig:\	$(brew --prefix)/ lib / pkgconfig:\
$(brew --prefix)/opt/libarchive/lib/pkgconfig:\	$(brew --prefix)/ opt / libarchive / lib / pkgconfig:\
$(brew --prefix)/opt/icu4c/lib/pkgconfig:\	$(brew --prefix)/ opt / icu4c / lib / pkgconfig:\
$(brew --prefix)/opt/libffi/lib/pkgconfig	$(brew --prefix)/ opt / libffi / lib / pkgconfig
./configure	./configure
` \| `
# Hardware-Software Requirements	#ハードウェア - ソフトウェア要件
At time of writing, training only works on Linux. (macOS almost works; it requires	これを書いている時点では、トレーニングはLinuxでしか機能しません。 (macOSはほとんど動作します。
minor hacks to the shell scripts to account for the older version of `bash` it	古いバージョンの `bash`を考慮するためのシェルスクリプトへのマイナーハック
provides and differences in `mktemp`.) Windows is unknown, but would need msys or Cygwin.	Windowsは知られていませんが、msysかCygwinが必要でしょう。
As for running Tesseract 4.0.0, it is useful, but not essential to have a multi-core (4 is good)	Tesseract 4.0.0を実行することに関しては、それは有用ですが、マルチコアを持つことは必須ではありません(4が良いです)
machine, with OpenMP and Intel Intrinsics support for SSE/AVX extensions.	OpenMPおよびIntel IntrinsicsがSSE / AVX拡張をサポートしているマシン。
Basically it will still run on anything with enough memory, but the higher-end	基本的にはまだ十分なメモリがあるものなら何でも実行されますが、ハイエンド
your processor is, the faster it will go. No GPU is needed. (No support.) Memory	あなたのプロセッサは、速くなるでしょう。 GPUは必要ありません。 (サポートなし)
use can be controlled via the --max_image_MB command-line option, but you are	使用は--max_image_MBコマンドラインオプションで制御できますが、
likely to need at least 1GB of memory over and above what is taken by your OS.	お使いのOSが使う容量以上に少なくとも1GBのメモリが必要です。
# Training Text Requirements	#トレーニングテキストの要件
For Latin-based languages, the existing model data provided has been trained on	ラテン系言語の場合、提供されている既存のモデルデータは
about [400000 textlines spanning about 4500	約4500に及ぶ400000のテキスト行
fonts](https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951).	フォント](https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951)。
For other scripts, not so many fonts are available, but they have still been	他のスクリプトでは、それほど多くのフォントが利用可能ではありませんが、それらはまだ使用されています。
trained on a similar number of textlines.	同数のテキスト行について訓練を受けた。
Note that it is beneficial to have more training text and make more pages	より多くのトレーニングテキストを用意し、より多くのページを作成することは有益であることに注意してください。
though, as neural nets don't generalize as well and need to train on something	ただし、ニューラルネットは一般化されていないので、何かを訓練する必要があります。
similar to what they will be running on. If the target domain is severely	彼らが走っているものに似ています。ターゲットドメインが深刻な場合
limited, then all the dire warnings about needing a lot of training data may not	多くの訓練データが必要であることについてのすべての悲惨な警告が制限されるわけではありません
apply, but the network specification may need to be changed.	適用されますが、ネットワーク仕様を変更する必要があるかもしれません。
# Overview of Training Process	#トレーニングプロセスの概要
The overall training process is similar to training 3.04.	全体的なトレーニングプロセスはtraining 3.04に似ています。
Conceptually the same:	概念的には同じです。
1. [Prepare training	1. [トレーニングを準備する
text.](https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951)	]。(https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951)
1. Render text to image + box file. (Or create hand-made box files for existing	1.テキストを画像+ボックスファイルにレンダリングします。 (または既存の手作りボックスファイルを作成する
image data.)	画像データ)
1. Make unicharset file. (Can be partially specified, ie created manually).	1. unicharsetファイルを作ります。 (部分的に指定、つまり手動で作成できます)
1. [Make a starter traineddata from the unicharset and optional dictionary	1. [unicharsetとオプションの辞書からスタータートレーニングデータを作成します。
data.](#creating-starter-traineddata)	](#creating-starter-traineddata)
1. Run tesseract to process image + box file to make training data set.	1. tesseractを実行して、画像+ボックスファイルを処理してトレーニングデータセットを作成します。
1. Run training on training data set.	1.トレーニングデータセットでトレーニングを実行します。
1. Combine data files.	1.データファイルを結合します。
The key differences are:	主な違いは以下のとおりです。
The boxes only need to be at the textline level. It is thus far easier*	ボックスは textlineレベルである必要がありますしたがってはるかに簡単です*
to make training data from existing image data.	既存の画像データからトレーニングデータを作成する。
* The .tr files are replaced by .lstmf data files.	* .trファイルは.lstmfデータファイルに置き換えられます。
Fonts can and should be mixed freely* instead of being separate.	フォントは別々ではなく自由に混在させることができます。
* The clustering steps (mftraining, cntraining, shapeclustering) are replaced	*クラスタリングのステップ(mftraining、cntraining、shapeclustering)は置き換えられました
with a single slow lstmtraining step.	1回のゆっくりとしたlstmtrainingステップで。
The training cannot be quite as automated as the training for 3.04 for several	トレーニングは、いくつかの3.04のトレーニングほど自動化できません。
reasons:	理由:
* The slow training step isn't good to run from the middle of a script as it	*遅いトレーニングステップは、スクリプトの途中から実行するのは良くありません。
can be restarted if stopped, and it is hard to tell automatically when it is	停止した場合は再起動できますが、いつ停止したかを自動的に判断するのは困難です。
finished.	終了しました。
* There are multiple options for how to train the network (see above).	*ネットワークのトレーニング方法には複数の選択肢があります(上記参照)。
* The language models and unicharset are allowed to be different from those	*言語モデルとユニキャストは、それらと異なるものにすることができます。
used by base Tesseract, but don't have to be.	ベースのTesseractで使用されますが、使用する必要はありません。
* It isn't necessary to have a base Tesseract of the same language as the	*ベース言語と同じ言語のTesseractを持つ必要はありません。
neural net Tesseract.	ニューラルネットTesseract。
# Understanding the Various Files Used During Training	#トレーニング中に使用されるさまざまなファイルを理解する
As with base Tesseract, the completed LSTM model and everything else it needs is	基本的なTesseractと同様に、完成したLSTMモデルとそれが必要とする他のすべては、
collected in the `traineddata` file. Unlike base Tesseract, a `starter \|`traineddata`ファイルに集められています。基本のTesseractとは異なり、`スターター
traineddata` file is given during training, and has to be setup in advance. It	traineddataファイルはトレーニング中に提供され、事前に設定する必要があります。それ
can contain:	含めることができます:
* Config file providing control parameters.	*制御パラメータを提供する設定ファイル。
* Unicharset defining the character set.	* Unicharset は文字セットを定義します。
* Unicharcompress, aka the recoder, which maps the unicharset further to	* Unicharcompress、別名Recoder。ユニキャストをさらにマップします。
the codes actually used by the neural network recognizer.	ニューラルネットワーク認識装置によって実際に使用されるコード。
* Punctuation pattern dawg, with patterns of punctuation allowed around words.	句読点のパターンは単語の周りに許されています。
* Word dawg. The system word-list language model.	*言葉は夜明け。システムの単語リスト言語モデル
* Number dawg, with patterns of numbers that are allowed.	*許可されている数字のパターンを持つ数字dawg。
Bold elements must be provided. Others are optional, but if any of the dawgs	太字の要素**を指定する必要があります。その他はオプションですが、いずれかの夜明けの場合
are provided, the punctuation dawg must also be provided. A new tool:	が提供されている場合は、句読点も提供する必要があります。新しいツール
`combine_lang_model` is provided to make a `starter traineddata` from a	`combine_lang_model`は、`starter traineddata`を作成するために提供されています。
`unicharset` and optional wordlists.	`unicharset`とオプションの単語リスト。
During training, the trainer writes checkpoint files, which is a standard	トレーニング中に、トレーナーは標準であるチェックポイントファイルを書き込みます。
behavior for neural network trainers. This allows training to be stopped and	ニューラルネットワークトレーナーの行動これにより、トレーニングを中止することができます。
continued again later if desired. Any checkpoint can be converted to a full	必要に応じて後でもう一度続けた。どのチェックポイントもフルに変換できます
`traineddata` for recognition by using the `--stop_training` command-line flag.	`--stop_training`コマンドラインフラグを使って認識するための`traineddata`。
The trainer also periodically writes checkpoint files at new bests achieved	トレーナーはまた、達成された新しいベストで定期的にチェックポイントファイルを書き込みます。
during training.	トレーニング中
It is possible to modify the network and retrain just part of it, or fine tune	ネットワークを変更してその一部だけを再トレーニングすることも、微調整することもできます。
for specific training data (even with a modified unicharset!) by telling the	特定のトレーニングデータ(変更されたユニキャストを使用した場合でも)については、
trainer to `--continue_from` either an existing checkpoint file, or from a naked	既存のチェックポイントファイル、または裸のファイルのどちらかから `--continue_from`へのトレーナー
LSTM model file that has been extracted from an existing `traineddata` file	既存の `traineddata`ファイルから抽出されたLSTMモデルファイル
using `combine_tessdata` provided it has not been converted to integer.	`combine_tessdata`を使って整数に変換されていないことを示します。
If the unicharset is changed in the `--traineddata` flag, compared to the one	ユニキャストが `--traineddata`フラグで変更された場合、
that was used in the model provided via `--continue_from`, then the	それは `--continue_from`で提供されるモデルで使われていました
`--old_traineddata` flag must be provided with the corresponding `traineddata`	`--old_traineddata`フラグは対応する`traineddata`と共に提供されなければなりません
file that holds the `unicharset` and `recoder.` This enables the trainer to	`unicharset`と`recoder`を保持するファイル
compute the mapping between the character sets.	文字セット間のマッピングを計算します。
The training data is provided via `.lstmf` files, which are serialized	訓練データはシリアライズされた `.lstmf`ファイルを通して提供されます
`DocumentData` They contain an image and the corresponding UTF8 text	`DocumentData`画像とそれに対応するUTF8テキストを含みます
transcription, and can be generated from tif/box file pairs using Tesseract in a	Tesseractを使用してTIF /ボックスファイルのペアから生成できます。
similar manner to the way `.tr` files were created for the old engine.	古いエンジン用に作成された `.tr`ファイルと同じような方法です。
# LSTMTraining Command Line	#LSTMトレーニングのコマンドライン
The lstmtraining program is a multi-purpose tool for training neural networks.	lstmtrainingプログラムは、ニューラルネットワークを訓練するための多目的ツールです。
The following table describes its command-line options:	次の表は、コマンドラインオプションについて説明しています。
Flag	Type	Default	Explanation			旗	タイプ	デフォルト	説明
:---------------------	:------:	:---------:	:-----------------			:---------------------	:------:	:---------:	:-----------------
`traineddata`	`string`	none	Path to the starter traineddata file that contains the unicharset, recoder and optional language model.			トレーニングデータ`string`	なしunicharset、recoder、およびオプションの言語モデルを含むスターター訓練データファイルへのパス。
`net_spec`	`string`	none	Specifies the topology of the network.			`net_spec`	`string`	なしネットワークのトポロジを指定します。
`model_output`	`string`	none	Base path of output model files/checkpoints.			`model_output`	`string`	なし出力モデルファイル/チェックポイントのベースパス。
`max_image_MB`	`int`	`6000`	Maximum amount of memory to use for caching images.	`learning_rate`	`double`	`10e-4`	Initial learning rate for SGD algorithm.			`max_image_MB`	`int`	`6000`	画像のキャッシュに使用する最大メモリ量。	`learning_rate`	`double`	`10e-4`	SGDアルゴリズムの初期学習率
`sequential_training`	`bool`	`false`	Set to true for sequential training. Default is to process all training data in round-robin fashion.			`sequential_training`	「ブール」 `false`	シーケンシャルトレーニングの場合はtrueに設定します。デフォルトでは、すべてのトレーニングデータをラウンドロビン方式で処理します。
`net_mode`	`int`	`192`	Flags from `NetworkFlags`in `network.h`. Possible values: `128` for Adam optimization instead of momentum; `64` to allow different layers to have their own learning rates, discovered automatically.			`net_mode`	`int`	`192`	`network.h`の`NetworkFlags`からのフラグ。可能な値:Adamの最適化では、運動量ではなく `128`。自動的に発見される、異なる層が独自の学習率を持つことを可能にするための `64`。
`perfect_sample_delay`	`int`	`0`	When the network gets good, only backprop a perfect sample after this many imperfect samples have been seen since the last perfect sample was allowed through.			`perfect_sample_delay`	`int`	`0`	ネットワークが良くなったとき、最後の完璧なサンプルが通過するのを許されて以来この多くの不完全なサンプルが見られた後に完璧なサンプルをバックプロップするだけです。
`debug_interval`	`int`	`0`	If non-zero, show visual debugging every this many iterations.			`debug_interval`	`int`	`0`	ゼロ以外の場合、何度も繰り返すたびにビジュアルデバッグを表示します。
`weight_range`	`double`	`0.1`	Range of random values to initialize weights.			`weight_range`	`double`	`0.1`	重みを初期化するためのランダムな値の範囲。
`momentum`	`double`	`0.5`	Momentum for alpha smoothing gradients.			勢い	`double`	`0.5`	アルファスムージンググラデーションの運動量
`adam_beta`	`double`	`0.999`	Smoothing factor squared gradients in ADAM algorithm.			`adam_beta`	`double`	`0.999`	ADAMアルゴリズムにおける平滑化係数二乗勾配
`max_iterations`	`int`	`0`	Stop training after this many iterations.			`max_iterations`	`int`	`0`	この繰り返しの後にトレーニングをやめてください。
`target_error_rate`	`double`	`0.01`	Stop training if the mean percent error rate gets below this value.			`target_error_rate`	`double`	`0.01`	平均エラー率がこの値を下回った場合は、トレーニングを中止してください。
`continue_from`	`string`	none	Path to previous checkpoint from which to continue training or fine tune.			`continue_from`	`string`	なしトレーニングまたは微調整を続ける前のチェックポイントへのパス。
`stop_training`	`bool`	`false`	Convert the training checkpoint in `--continue_from` to a recognition model.			`stop_training`	「ブール」 `false`	`--continue_from`の訓練チェックポイントを認識モデルに変換します。
`convert_to_int`	`bool`	`false`	With `stop_training`, convert to 8-bit integer for greater speed, with slightly less accuracy.			`convert_to_int`	「ブール」 `false`	`stop_training`を使うと、少し速度を遅くして速度を上げるために8ビット整数に変換します。
`append_index`	`int`	`-1`	Cut the head off the network at the given index and append `--net_spec` network in place of the cut off part.			`append_index`	`int`	`-1`	与えられたインデックスでネットワークから頭を切り取って、切り取られた部分の代わりに `--net_spec`ネットワークを追加します。
`train_listfile`	`string`	none	Filename of a file listing training data files.			`train_listfile`	`string`	なしトレーニングデータファイルをリストしたファイルのファイル名。
`eval_listfile`	`string`	none	Filename of a file listing evaluation data files to be used in evaluating the model independently of the training data.			`eval_listfile`	`string`	なしトレーニングデータとは別にモデルの評価に使用される評価データファイルをリストしたファイルのファイル名。
Most of the flags work with defaults, and several are only required for	ほとんどのフラグはデフォルトで動作します。いくつかのフラグはデフォルトでのみ必要です。
particular operations listed below, but first some detailed comments on the more	以下にリストされている特定の操作ですが、最初にいくつかの詳細なコメント
complex flags:	複雑なフラグ:
## Unicharset Compression-recoding	## Unicharset圧縮 - 再コーディング
LSTMs are great at learning sequences, but slow down a lot when the number of	LSTMはシーケンスの学習には優れていますが、数が多くなると遅くなります。
states is too large. There are empirical results that suggest it is better to	州が大きすぎます。それがより良いことを示唆する経験的な結果があります
ask an LSTM to learn a long sequence than a short sequence of many classes, so	LSTMに、多くのクラスの短いシーケンスよりも長いシーケンスを学ぶように依頼します。
for the complex scripts, (Han, Hangul, and the Indic scripts) it is better to	複雑なスクリプト(Han、Hangul、およびIndicスクリプト)の場合は、
recode each symbol as a short sequence of codes from a small number of classes	各シンボルを少数のクラスからの短いコードシーケンスとして再コード化する
than have a large set of classes. The `combine_lang_model` command has this	たくさんのクラスがあります。 `combine_lang_model`コマンドはこれを持っています
feature on by default. It encodes each Han character as a variable-length	機能はデフォルトでオンになっています。各漢字を可変長としてエンコードします。
sequence of 1-5 codes, Hangul using the Jamo encoding as a sequence of 3 codes,	1~5コードのシーケンス、ハングルは3コードのシーケンスとしてJamoエンコードを使用
and other scripts as a sequence of their unicode components. For the scripts	その他のスクリプトは、それらのUnicodeコンポーネントのシーケンスとして。スクリプト用
that use a virama character to generate conjunct consonants, (All the Indic	これは_virama_文字を使って接続子音を生成します。
scripts plus Myanmar and Khmer) the function `NormalizeCleanAndSegmentUTF8`	スクリプトプラスミャンマーとクメール語)関数 `NormalizeCleanAndSegmentUTF8`
pairs the virama with an appropriate neighbor to generate a more glyph-oriented	よりグリフ指向のものを生成するために、ビラマを適切な隣人とペアにする
encoding in the unicharset. To make full use of this improvement, the	unicharsetでのエンコーディングこの改善を最大限に活用するために、
`--pass_through_recoder` flag should be set for `combine_lang_model` for these	これらのためには `--pass_through_recoder`フラグを`combine_lang_model`に設定する必要があります
scripts.	スクリプト
## Randomized Training Data and sequential_training	##無作為化訓練データと逐次訓練
For Stochastic Gradient Descent to work properly, the training data is supposed	Stochastic Gradient Descentが正しく機能するためには、トレーニングデータが必要です。
to be randomly shuffled across all the sample files, so the trainer can read its	すべてのサンプルファイルにわたってランダムにシャッフルされ、トレーナーはそのファイルを読むことができます。
way through each file in turn and go back to the first one when it reaches the	各ファイルを順番に調べていき、最初のファイルに到達したら最初のファイルに戻ります。
end. This is entirely contrary to the way base Tesseract was trained!	終わり。これは、Tesseractベースのトレーニング方法とはまったく反対です。
If using the rendering code, (via `tesstrain.sh`) then it will shuffle the	レンダリングコードを使っている場合は( `tesstrain.sh`経由で)、
sample text lines within each file, but you will get a set of files, each	各ファイル内にサンプルテキスト行がありますが、それぞれ一連のファイルがあります。
containing training samples from a single font. To add a more even mix, the	単一のフォントからのトレーニングサンプルを含みます。より均等なミックスを追加するには、
default is to process one sample from each file in turn aka 'round robin' style.	デフォルトでは、各ファイルから1つのサンプルを順番に「ラウンドロビン」スタイルで処理します。
If you have generated training data some other way, or it is all from the same	他の方法でトレーニングデータを生成した場合、またはすべて同じものからの場合
style (a handwritten manuscript book for instance) then you can use the	スタイル(例えば手書きの原稿本)なら、
`--sequential_training` flag for `lstmtraining.` This is more memory efficient	lstmtraining用の--sequential_trainingフラグ。これはよりメモリ効率的です。
since it will load data from only two files at a time, and process them in	2つのファイルから同時にデータをロードし、それらを
sequence. (The second file is read-ahead so it is ready when needed.)	シーケンス(2番目のファイルは先読みなので、必要なときに準備ができています。)
## Model output	##モデル出力
The trainer saves checkpoints periodically using `--model_output` as a basename.	トレーナーは `--model_output`をベース名として定期的にチェックポイントを保存します。
It is therefore possible to stop training at any point, and restart it, using	したがって、いつでもトレーニングを停止して再開することができます。
the same command line, and it will continue. To force a restart, use a different	同じコマンドライン、そしてそれは続くでしょう。強制的に再起動するには、別の
`--model_output` or delete all the files.	`--model_output`もしくは全てのファイルを削除します。
## Net Mode and Optimization	##ネットモードと最適化
The `128` flag turns on Adam optimization, which seems to work a lot better than	`128`フラグはAdam最適化をオンにします。
plain momentum.	普通の勢い。
The `64` flag enables automatic layer-specific learning rate. When progress	「６４」フラグは自動的なレイヤ固有の学習率を可能にする。進歩したとき
stalls, the trainer investigates which layer(s) should have their learning rate	失速した場合、トレーナーはどの層が学習率を持つべきかを調査します。
reduced independently, and may lower one or more learning rates to continue	独立して減らされ、続けるために1つ以上の学習率を下げるかもしれません
learning.	学びます。
The default value of `net_mode` of `192` enables both Adam and layer-specific	`192`の`net_mode`のデフォルト値はAdamとレイヤ固有の両方を有効にします
learning rates.	学習率
## Perfect Sample Delay	##完璧なサンプルディレイ
Training on "easy" samples isn't necessarily a good idea, as it is a waste of	「簡単な」サンプルについてのトレーニングは、それが無駄になるので必ずしも良い考えではありません。
time, but the network shouldn't be allowed to forget how to handle them, so it	時間はありますが、ネットワークはそれらを処理する方法を忘れることを許可されるべきではないので、それはそれ
is possible to discard some easy samples if they are coming up too often. The	それらがあまりにも頻繁に現れているならば、いくつかの簡単なサンプルを捨てることは可能です。の
`--perfect_sample_delay` argument discards perfect samples if there haven't been	`--perfect_sample_delay`引数がなければ完璧なサンプルを破棄します
that many imperfect ones seen since the last perfect sample. The current default	最後の完璧なサンプル以降に見られた、多くの不完全なもの。現在のデフォルト
value of zero uses all samples. In practice the value doesn't seem to have a	値ゼロはすべてのサンプルを使用します。実際には、この値には
huge effect, and if training is allowed to run long enough, zero produces the	トレーニングが十分に長く実行されることが許可されている場合、ゼロは
best results.	最高の結果
## Debug Interval and Visual Debugging	##デバッグ間隔とビジュアルデバッグ
With zero (default) `--debug_interval`, the trainer outputs a progress report	ゼロ(デフォルト) `--debug_interval`を指定すると、トレーナーは進捗レポートを出力します
every 100 iterations, similar to the following example.	次の例のように、100回の繰り返しごと。
` \| `
At iteration 61239/65000/65015, Mean rms=1.62%, delta=8.587%, char train=16.786%, word train=36.633%, skip ratio=0.1%, wrote checkpoint.	反復61239/65000/65015では、平均rms = 1.62％、デルタ= 8.587％、文字列= 16.786％、ワード列= 36.633％、スキップ率= 0.1％、チェックポイントを書きました。
At iteration 61332/65100/65115, Mean rms=1.601%, delta=8.347%, char train=16.497%, word train=36.24%, skip ratio=0.1%, wrote checkpoint.	反復61332/65100/65115では、平均rms = 1.601％、デルタ= 8.347％、文字列= 16.497％、ワード列= 36.24％、スキップ率= 0.1％、チェックポイントを書きました。
2 Percent improvement time=44606, best error was 17.77 @ 16817	2パーセント改善時間= 44606、最良エラーは17.77 @ 16817
Warning: LSTMTrainer deserialized an LSTMRecognizer!	Warning:LSTMTrainerはLSTMRecognizerを逆シリアル化しました。
At iteration 61423/65200/65215, Mean rms=1.559%, delta=7.841%, char train=15.7%, word train=35.68%, skip ratio=0.1%, New best char error = 15.7At iteration 45481, stage 0, Eval Char error rate=6.9447893, Word error rate=27.039255 wrote best model:./SANLAYER/LAYER15.7_61423.checkpoint wrote checkpoint.	反復61423/65200/65215では、平均rms = 1.559％、デルタ= 7.841％、文字列= 15.7％、ワード列= 35.68％、スキップ率= 0.1％、新しい最良の文字エラー= 15.7反復45481、ステージ0、 Eval Charエラーレート= 6.9447893、Wordエラーレート= 27.039255がベストモデルを書きました:./ SANLAYER / LAYER15.7_61423.checkpointがチェックポイントを書きました。
` \| `
With `--debug_interval -1`, the trainer outputs verbose debug text for every	`--debug_interval -1`を指定すると、トレーナーはすべてのメッセージに対して詳細なデバッグテキストを出力します。
training iteration. The text debug information includes the truth text, the recognized text, the	反復のトレーニングテキストデバッグ情報には、真実のテキスト、認識されたテキスト、
iteration number, the training sample id (lstmf file and line) and the mean value of	反復数、学習サンプルID(lstmfファイルと行)、およびの平均値
several error metrics. `GROUND TRUTH` for the line is displayed in all cases.	いくつかのエラーメトリックその行の `GROUND TRUTH`はすべての場合に表示されます。
`ALIGNED TRUTH` and `BEST OCR TEXT` are displayed only when different from	「ALIGNED TRUTH」および「BEST OCR TEXT」は、次の場合と異なる場合にのみ表示されます。
the `GROUND TRUTH`.	「地面の真実」
` \| `
Iteration 455038: GROUND TRUTH : рдЙрдкреС рддреНрд╡рд╛рдЧреНрдиреЗ рджрд┐реТрд╡реЗрджрд┐реСрд╡реЗреТ рджреЛрд╖рд╛реСрд╡рд╕реНрддрд░реНрдзрд┐реТрдпрд╛ рд╡реТрдпрдореН ред	イテレーション455038:グランドトゥルース:рдЙрдкреСрддреНрд╡рд╛рдЧреНрдиреЗрджрд┐реТрд╡реЗрджрд┐реСрд╡реЗреТрджреЛрд╖рд╛реСрд╡рд╕реНрддрд░реНрдзрд┐реТрдпрд╛рд╡реТрдпрдореНред
File /tmp/san-2019-03-28.jsY/san.Mangal.exp0.lstmf line 451 (Perfect):	/tmp/san-2019-03-28.jsY/san.Mangal.exp0.lstmf 451行目(完璧):
Mean rms=1.267%, delta=4.155%, train=11.308%(32.421%), skip ratio=0%	平均実効値= 1.267％、デルタ= 4.155％、トレイン= 11.308％(32.421％)、スキップ率= 0％
Iteration 455039: GROUND TRUTH : рдореЗ рдЕрдкрд░рд╛рдз рдФрд░ рдмреИрдареЗ рджреБрдХрд╛рдиреЛрдВ рдирд╛рдо рд╕рдХрддреЗ рдЕрдзрд┐рд╡рдХреНрддрд╛, рджреЛрдмрд╛рд░рд╛ рд╕рд╛рдзрди рд╡рд┐рд╖реИрд▓реЗ рд▓рдЧрд╛рдиреЗ рдкрд░ рдкреНрд░рдпреЛрдЧрдХрд░реНрддрд╛рдУрдВ рднрд╛рдЧреЗ	イテレーション455039:グランドトゥルース:рдореЗрдЕрдкрд░рд╛рдзрдФрд░рдмреИрдареЗрджреБрдХрд╛рдиреЛрдВрдирд╛рдорд╕рдХрддреЗрдЕрдзрд┐рд╡рдХреНрддрд╛、рджреЛрдмрд╛рд░рд╛рд╕рд╛рдзрдирд╡рд┐рд╖реИрд▓ реЗрд▓рдЧрд╛рдиреЗрдкрд░рдкреНрд░рдпреЛрдЧрдХрд░реНрддрд╛рдУрдВрднрд╛рдЧреЗ
File /tmp/san-2019-04-04.H4m/san.FreeSerif.exp0.lstmf line 28 (Perfect):	ファイル/tmp/san-2019-04-04.H4m/san.FreeSerif.exp0.lstmf 28行目(完璧):
Mean rms=1.267%, delta=4.153%, train=11.3%(32.396%), skip ratio=0%	平均実効値= 1.267％、デルタ= 4.153％、トレイン= 11.3％(32.396％)、スキップ率= 0％
` \| `
` \| `
Iteration 1526: GROUND TRUTH : ЁТГ╗ ЁТА╕ ЁТЖ│ЁТЖ│ ЁТЕШЁТКПЁТААЁТЛ╛	反復1526:地面の真実:ЁТГ╗ЁТА╕ЁТЖ│ЁТЖ│ЁТЕШЁТКПЁТААЁТЛ╛
Iteration 1526: ALIGNED TRUTH : ЁТГ╗ ЁТА╕ ЁТЖ│ЁТЖ│ ЁТЕШЁТКПЁТКПЁТААЁТЛ╛	反復1526:整列真実:ЁТГ╗ЁТА╕ЁТЖ│ЁТЖ│ЁТЕШЁТКПЁТКПЁТААЁТЛ╛
Iteration 1526: BEST OCR TEXT : ЁТААЁТЛ╛	反復1526:BEST OCRテキスト:EXTТААЁТЛ╛
File /tmp/eng-2019-04-06.Ieb/eng.CuneiformComposite.exp0.lstmf line 19587 :	ファイル/tmp/eng-2019-04-06.Ieb/eng.CuneiformComposite.exp0.lstmf 19587行目:
Mean rms=0.941%, delta=12.319%, train=56.134%(99.965%), skip ratio=0.6%	平均実効値= 0.941％、デルタ= 12.319％、列車= 56.134％(99.965％)、スキップ率= 0.6％
Iteration 1527: GROUND TRUTH : ЁТАнЁТМЛЁТРК	反復1527:地面の真実:ЁТАнЁТМЛЁТРК
Iteration 1527: BEST OCR TEXT : ЁТАнЁТМЛ	反復1527年:ベストOCRテキスト:ЁТАнЁТМЛ
File /tmp/eng-2019-04-06.Ieb/eng.CuneiformOB.exp0.lstmf line 7771 :	ファイル/tmp/eng-2019-04-06.Ieb/eng.CuneiformOB.exp0.lstmf 7771行:
Mean rms=0.941%, delta=12.329%, train=56.116%(99.965%), skip ratio=0.6%	平均実効値= 0.941％、デルタ= 12.329％、トレイン= 56.116％(99.965％)、スキップ率= 0.6％
` \| `
With `--debug_interval > 0`, the trainer displays several windows of debug	`--debug_interval> 0`の場合、トレーナーはいくつかのデバッグウィンドウを表示します。
information on the layers of the network. In the special case of	ネットワークのレイヤに関する情報。の特別な場合
`--debug_interval 1` it waits for a click in the `LSTMForward` window before	`--debug_interval 1`の前に`LSTMForward`ウィンドウでクリックを待ちます
continuing to the next iteration, but for all others it just continues and draws	次の繰り返しに進みますが、他のすべての場合は継続して描画します。
information at the frequency requested.	要求された頻度での情報。
**NOTE that to use --debug_interval > 0 you must build	** --debug_interval> 0を使用するにはビルドする必要があることに注意してください
ScrollView.jar as well as the other training tools.** See	ScrollView.jarおよび他のトレーニングツール**
Building the Training Tools	トレーニングツールの作成
The visual debug information includes:	視覚的なデバッグ情報は次のとおりです。
A forward and backward window for each network layer. Most are just random	各ネットワーク層の前方および後方ウィンドウ。ほとんどはランダムです
noise, but the `Output/Output-back` and `ConvNL` windows are worth viewing.	しかし、 `Output / Output-back`と`ConvNL`ウィンドウは見る価値があります。
`Output` shows the output of the final Softmax, which starts out as a yellow	`Output`は黄色で始まる最後のSoftmaxの出力を表示します
line for the null character, and gradually develops yellow marks at each point	null文字の行で、各ポイントに徐々に黄色のマークが表示されます。
where it thinks there is a character. (The x-axis is the image x-coordinate, and	文字があると思うところ。 (x軸は画像のx座標です。
the y-axis is character class.) The `Output-back` window shows the difference	y軸は文字クラスです。 `output-back`ウィンドウは違いを表示します
between the actual output and the target using the same layout, but with yellow	同じレイアウトを使用しているが黄色の実際の出力とターゲットの間
for "give me more of this" and blue for "give me less of this". As the network	「これ以上の情報を提供する」と「これ以上情報を提供しない」の青はネットワークとして
learns, the `ConvNL` window develops the typical edge detector results that you	習得したように、 `ConvNL`ウィンドウは典型的なエッジ検出結果を表示します。
expect from the bottom layer.	最下層から期待しています。
`LSTMForward` shows the output of the whole network on the training image.	「ＬＳＴＭフォワード」は、トレーニング画像上のネットワーク全体の出力を示す。
`LSTMTraining` shows the training target on the training image. In both, green	「ＬＳＴＭトレーニング」はトレーニング画像上にトレーニング目標を示す。どちらも、緑
lines are drawn to show the peak output for each character, and the character	各文字のピーク出力を示す線が描かれています。
itself is drawn to the right of the line.	それ自体は線の右側に描かれます。
The other two windows worth looking at are `CTC Outputs` and `CTC Targets`.	他に注目に値するウィンドウは `CTC Outputs`と`CTC Targets`です。
These show the current output of the network and the targets as a line graph of	これらはネットワークの現在の出力とターゲットをの折れ線グラフとして表示します。
strength of output against image x-coordinate. Instead of a heatmap, like the	画像のX座標に対する出力の強度。ヒートマップの代わりに、
`Output` window, a different colored line is drawn for each character class and	`Output`ウィンドウでは、文字クラスごとに異なる色の線が描かれます。
the y-axis is strength of output.	y軸は出力の強さです。
# TessTutorial	#TessTutorial
The process of Creating the training data is	トレーニングデータの作成の処理は、
documented below, followed by a [Tutorial guide to	以下に文書化され、その後に[チュートリアルガイド]
lstmtraining](#tutorial-guide-to-lstmtraining) which gives an introduction to	紹介文を与えるlstmtraining](#tutorial-guide-to-lstmtraining)
the main training process, with command-lines that have been tested for real. On	実際にテストされたコマンドラインを使ったメインのトレーニングプロセス。に
Linux at least, you should be able to just copy-paste the command lines into	少なくともLinuxでは、コマンドラインをコピー&ペーストするだけでいいのです。
your terminal.	あなたの端末
To make the `tesstrain.sh` script work, it will be necessary to	tesstrain.shスクリプトを機能させるためには、
either set `PATH` to include your local `training` and `api` directories, or use	ローカルのtrainingとapiを含むようにPATHを設定するか、または
`make install`.	`make install`。
## One-time Setup for TessTutorial	## TessTutorialのワンタイムセットアップ
In order to successfully run the TessTutorial, you need to have a working	TessTutorialを正常に実行するためには、作業用のツールが必要です。
installation of tesseract and training tools and have the training scripts and	tesseractとトレーニングツールをインストールし、トレーニングスクリプトと
required traineddata files in certain directories.	特定のディレクトリに訓練済みデータファイルが必要です。
These instructions only cover the case of rendering from fonts,	これらの指示はフォントからレンダリングする場合だけをカバーします、
so the needed fonts must be installed first.	そのため、必要なフォントを最初にインストールする必要があります。
Note that your fonts location may vary.	フォントの場所は異なる場合があります。
` \| `
sudo apt update	sudo aptアップデート
sudo apt install ttf-mscorefonts-installer	sudo apt install ttf-mscorefonts-installer
sudo apt install fonts-dejavu	sudo apt install fonts-dejavu
fc-cache -vf	fc-cache -vf
` \| `
Follow the instructions below	以下の指示に従ってください
to do the first time setup for TessTutorial.	TessTutorialの初回セットアップを行うため。
` \| `
mkdir ~/tesstutorial	mkdir~/ tesstutorial
cd ~/tesstutorial	cd~/ tesstutorial
mkdir langdata	mkdir langdata
cd langdata	cd langdata
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt	https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txtを取得します。
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/common.punc	https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/common.punc
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/font_properties	https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/font_properties
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.unicharset	https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.unicharset
wget https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.xheights	https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/Latin.xheights
mkdir eng	mkdir eng
cd eng	cd eng
wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text	https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text wget
wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.punc	https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.punc
wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.numbers	https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.numbers
wget https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.wordlist	https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.wordlist
cd ~/tesstutorial	cd~/ tesstutorial
git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git	git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git
cd tesseract/tessdata	cd tesseract / tessdata
mkdir best	mkdirベスト
cd best	最高のCD
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata	https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/heb.traineddata	https://github.com/tesseract-ocr/tessdata_best/raw/master/heb.traineddata
wget https://github.com/tesseract-ocr/tessdata_best/raw/master/chi_sim.traineddata	https://github.com/tesseract-ocr/tessdata_best/raw/master/chi_sim.traineddata
` \| `
## Creating Training Data	##トレーニングデータを作成する
As with base Tesseract, there is a choice between rendering synthetic training	基本的なTesseractと同様に、合成トレーニングをレンダリングするかどうかの選択があります
data from fonts, or labeling some pre-existing images (like ancient manuscripts	フォントからのデータ、または既存の画像へのラベル付け(古代の原稿のような)
for example).	例えば)。

In either case, the required format is still the tiff/box file	どちらの場合も、必要な形式はまだtiff / boxファイルです。
pair, except that the boxes only need to cover a textline instead of individual	ただし、ボックスは個々のテキスト行ではなくテキスト行を覆うだけで構いません。
characters.	キャラクター
### Making Box Files	###ボックスファイルを作る
Multiple formats of box files are accepted by Tesseract 4 for LSTM training,	LSTMトレーニングのために、Tesseract 4では複数の形式のボックスファイルを使用できます。
though they are different from the one used by Tesseract 3	それらはTesseract 3によって使用されているものとは異なりますが
(details).	(詳細)。
Each line in the box file matches a 'character' (glyph) in the tiff image.	ボックスファイルの各行は、TIFF画像の「文字」(グリフ)と一致します。

Where could be bounding-box coordinates	はバウンディングボックスの座標になります。
of a single glyph or of a whole textline (see examples).	1つのグリフまたはテキスト全体(例を参照)。
To mark an end-of-textline, a special line must be inserted after a series of lines.	テキストの終わりをマークするには、一連の行の後に特別な行を挿入する必要があります。

Note that in all cases, even for right-to-left languages, such as Arabic, the	アラビア語のような右から左への言語でさえ、すべての場合において、
text transcription for the line, should be ordered left-to-right. In other words, the network	行のテキスト転記、*は左から右に並べる必要があります。つまり、ネットワーク
is going to learn from left-to-right regardless of the language, and the	言語に関係なく、左から右へと学習します。
right-to-left/bidi handling happens at a higher level inside Tesseract.	右から左への/ bidi処理はTesseract内のより高いレベルで行われます。
### Using tesstrain.sh	### tesstrain.shを使う
The setup for running tesstrain.sh is the	tesstrain.shを実行するためのセットアップは
same as for base Tesseract. Use `--linedata_only` option for LSTM training.	ベースTesseractと同じです。 LSTMの訓練には `--linedata_only`オプションを使用してください。
Note that it is beneficial to have more training	より多くのトレーニングを受けることは有益であることに注意してください
text and make more pages though, as neural nets don't generalize as well and	ニューラルネットも同様に一般化していないので
need to train on something similar to what they will be running on. If the	彼らが走っているものに類似した何かで訓練する必要があります。あれば
target domain is severely limited, then all the dire warnings about needing a	ターゲットドメインは厳しく制限されています。
lot of training data may not apply, but the network specification may need to be	多くのトレーニングデータは適用されないかもしれませんが、ネットワーク仕様は
changed.	かわった。
Training data is created using tesstrain.sh	tesstrain.shを使用してトレーニングデータを作成します。
as follows:	次のように:
` \| `
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \	src / training / tesstrain.sh --fonts_dir / usr / share / fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \	--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain	--tessdata_dir ./tessdata --output_dir~/ tesstutorial / engtrain
` \| `
The above command makes LSTM training data equivalent to the data used to train	上記のコマンドは、LSTMトレーニングデータをトレーニングに使用されたデータと同等にします。
base Tesseract for English. For making a general-purpose LSTM-based OCR engine,	英語のベースTesseract。汎用LSTMベースのOCRエンジンを作るために、
it is woefully inadequate, but makes a good tutorial demo.	それはひどく不適切ですが、良いチュートリアルデモになります。
Now try this to make eval data for the 'Impact' font:	これを試して、「Impact」フォントの評価データを作成してください。
` \| `
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \	src / training / tesstrain.sh --fonts_dir / usr / share / fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \	--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata \	--tessdata_dir ./tessdata \
--fontlist "Impact Condensed" --output_dir ~/tesstutorial/engeval	--fontlist "Impact Condensed" --output_dir~/ tesstutorial / engeval
` \| `
We will use that data later to demonstrate tuning.	後でそのデータを使ってチューニングを説明します。
## Tutorial guide to lstmtraining	## lstmtrainingのチュートリアルガイド
### Creating Starter Traineddata	###スタータートレーニングデータを作成する
NOTE: This is a new step!	注:これは新しいステップです。
Instead of a `unicharset` and `script_dir,` `lstmtraining` now takes a	`unicharset`とscript_dirの代わりに、 `` lstmtrainingは現在
`traineddata` file on its command-line, to obtain all the information it needs	コマンドラインで `traineddata`ファイル、必要なすべての情報を取得する
on the language to be learned. The `traineddata` must contain at least an	学ぶ言語について`traineddata`は少なくともを含まなければなりません
`lstm-unicharset` and `lstm-recoder` component, and may also contain the three	`lstm-unicharset`と`lstm-recoder`コンポーネント、そしてこれらの3つを含むこともできます
dawg files: `lstm-punc-dawg lstm-word-dawg lstm-number-dawg` A `config` file is	dawgファイル: lstm-punc-dawg lstm-word-dawg lstm-number-dawg`` configファイルは
also optional. The other components, if present, will be ignored and unused.	またオプションです。他のコンポーネントが存在する場合、それらは無視され未使用になります。
There is no tool to create the `lstm-recoder` directly. Instead there is a new	`lstm-recoder`を直接作成するためのツールはありません。代わりに新しいものがあります
tool, `combine_lang_model` which takes as input an `input_unicharset` and	入力として `input_unicharset`を受け取るツール、`combine_lang_model`
`script_dir` (`script_dir` points to the `langdata` directory) and `lang` (`lang` is the	`script_dir`(`script_dir`は `langdata`ディレクトリを指します)そして`lang`( `lang`は
language being used) and optional word list files. It creates the `lstm-recoder`	使用されている言語)およびオプションの単語リストファイル。 `lstm-recoder`を作成します
from the `input_unicharset` and creates all the dawgs, if wordlists are provided,	ワードリストが提供されている場合、 `input_unicharset`からすべてのdawgsを作成します。
putting everything together into a `traineddata` file.	すべてを `traineddata`ファイルにまとめます。
### Training From Scratch	###最初からのトレーニング
The following example shows the command line for training from scratch. Try it	次の例は、最初からトレーニングするためのコマンドラインを示しています。それを試してみてください
with the default training data created with the command-lines above.	上記のコマンドラインで作成されたデフォルトのトレーニングデータを使用します。
` \| `
mkdir -p ~/tesstutorial/engoutput	mkdir -p~/ tesstutorial / engoutput
training/lstmtraining --debug_interval 100 \	training / lstmtraining --debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \	--net_spec '[1,36,0,1 Ct 3、3、16 Mp 3、3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \	--model_output~/ tesstutorial / engoutput / base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \	--train_listfile~/ tesstutorial / engtrain / eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log	--max_iterations 5000&>~/ tesstutorial / engoutput / basetrain.log
` \| `
In a separate window monitor the log file:	別のウィンドウでログファイルを監視します。
` \| `
tail -f ~/tesstutorial/engoutput/basetrain.log	tail -f~/ tesstutorial / engoutput / basetrain.log
` \| `
(If you tried this tutorial before, you might notice that the numbers have	(このチュートリアルを以前に試したことがある場合は、数字に
changed. This is a result of a slightly smaller network, and the addition of the	かわった。これは、ネットワークがわずかに小さくなったことと、
ADAM optimizer, which enables a higher learning rate.)	より高い学習率を可能にするADAMオプティマイザ。
You should observe that by 600 iterations, the spaces (white) are starting to	600回の繰り返しで、スペース(白)が
show on the `CTC Outputs` window and by 1300 iterations green lines appear on	`CTC Outputs`ウィンドウに表示され、1300回の繰り返しで緑の線が表示されます。
the `LSTMForward` window where there are spaces in the image.	画像にスペースがあるところの `LSTMForward`ウィンドウ。
By 1300 iterations, there are noticeable non-space bumps in the `CTC Outputs.`	1300回の繰り返しで、 `CTCの出力`にはっきりしたスペース以外の隆起があります
Note that the `CTC Targets,` which started at all the same height are now varied	まったく同じ高さで始まっていた `CTC Targets 'は今やさまざまです
in height because of the definite output for spaces and some and the tentative	スペースとsomeと暫定の明確な出力のための高さ
outputs for other characters. At the same time, the characters and positioning	他の文字用に出力します。同時に、文字と位置
of the green lines in the `LSTMTraining` window are not as accurate as they were	LSTMTraining`ウィンドウの緑色の線の精度は正確ではありません
initially, because the partial output from the network confuses the CTC	ネットワークからの部分的な出力がCTCを混乱させるため、最初
algorithm. (CTC assumes statistical independence between the different	アルゴリズム。 (CTCは異なる
x-coordinates, but they are clearly not independent.)	x座標ですが、明らかに独立していません。
By 2000 iterations, it should be clear on the `Output` window that some faint	2000回の繰り返しまでに、 `出力`ウィンドウではいくぶんかすかなことがはっきりしているはずです
yellow marks are appearing to indicate that there is some growing output for	黄色のマークが表示されています。
non-null and non-space, and characters are starting to appear in the	ヌルでもスペースでもなく、文字が
`LSTMForward` window.	`LSTMForward`ウィンドウ
The character error rate falls below 50% just after 3700 iterations, and by 5000	文字エラー率は3700回の反復の直後、そして5000までに50％を下回ります
to about 13%, where it will terminate. (In about 20 minutes on a current	それが終了するところで、約13％まで。 (電流で約20分で
high-end machine with AVX.)	AVX搭載のハイエンドマシン。)
Note that this engine is trained on the same amount of training data as used by	このエンジンは、によって使用されるのと同じ量のトレーニングデータでトレーニングされていることに注意してください。
the legacy Tesseract engine, but its accuracy on other fonts is probably very	従来のTesseractエンジンですが、他のフォントに対する精度はおそらく非常に高いです。
poor. Run an independent test on the 'Impact' font:	悪い'Impact'フォントに対して独立したテストを実行します。
` \| `
training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \	training / lstmeval --model~/ tesstutorial / engoutput / base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt
` \| `
85% character error rate? Not so good!	85％の文字エラー率？あまり良くない！
Now base Tesseract doesn't do very well on 'Impact', but it is included in the	現在、基本的なTesseractは 'Impact'に対してあまりうまくいきませんが、それはに含まれています。
4500 or so fonts used to train the new LSTM version, so if you can run on that	新しいLSTMバージョンをトレーニングするために使用されていた4500程度のフォント
for a comparison:	比較のために:
` \| `
training/lstmeval --model tessdata/best/eng.traineddata \	トレーニング/ lstmeval - モデルのテストデータ/ best / eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt
` \| `
2.45% character error rate? Much better!	2.45％の文字エラー率？ずっといい！
For reference in the next section, also run a test of the full model on the	次のセクションで参照するために、フルモデルのテストも実行します。
training set that we have been using:	私たちが使ってきたトレーニングセット:
` \| `
training/lstmeval --model tessdata/best/eng.traineddata \	トレーニング/ lstmeval - モデルのテストデータ/ best / eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt	--eval_listfile~/ tesstutorial / engtrain / eng.training_files.txt
` \| `
Char error rate=0.25047642, Word error rate=0.63389585	文字エラー率= 0.25047642、ワードエラー率= 0.63389585
(If you ran this before, and notice that the error rates are a lot higher than	(以前にこれを実行したことがあり、エラー率がはるかに高いことに気付くと
the previous alpha version, this is due to a change in the use of shaped quotes.	前のアルファ版は、これは形のついた引用符の使用の変更によるものです。
It didn't count errors in quote shape before, but now it does.)	以前は見積りの形のエラーを数えていませんでしたが、今はしています。)
You can train for another 5000 iterations, and get the error rate on the	さらに5000回反復するようにトレーニングして、
training set a lot lower, but it doesn't help the `Impact` font much:	トレーニングはかなり低く設定されていますが、それは `Impact`フォントをあまり役に立ちません:
` \| `
mkdir -p ~/tesstutorial/engoutput	mkdir -p~/ tesstutorial / engoutput
training/lstmtraining \	トレーニング/トレーニング
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \	--net_spec '[1,36,0,1 Ct 3、3、16 Mp 3、3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base --learning_rate 20e-4 \	--model_output~/ tesstutorial / engoutput / base --learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \	--train_listfile~/ tesstutorial / engtrain / eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt \
--max_iterations 10000 &>>~/tesstutorial/engoutput/basetrain.log	--max_iterations 10000&>>~/ tesstutorial / engoutput / basetrain.log
` \| `
Character error rate on `Impact` now >100%, even as the error rate on the	「Impact」の文字エラー率が100％を超えました。
training set has fallen to 2.68% character / 10.01% word:	トレーニングセットは2.68％文字/ 10.01％単語に落ちました:
` \| `
training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \	training / lstmeval --model~/ tesstutorial / engoutput / base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt
` \| `
This shows that the model has completely over-fitted to the supplied training	これは、モデルが提供されたトレーニングに完全に適合しすぎていることを示しています
set! It is an excellent illustration of what happens when the training set	セット！これは、トレーニングセットが終了したときに何が起こるかの優れた実例です。
doesn't cover the desired variation in the target data.	ターゲットデータの望ましい変動をカバーしていません。
In summary, training from scratch needs either a very constrained problem, a lot	要約すると、ゼロからのトレーニングには、非常に制限された問題と多くのことが必要です。
of training data, or you need to shrink the network by reducing some of the	またはトレーニングデータの量を減らすか、ネットワークの数を減らすことでネットワークを縮小する必要があります。
sizes of the layers in the `--net_spec` above. Alternatively, you could try fine	上の `--net_spec`の中のレイヤーのサイズ。あるいは、あなたは罰金を試すことができます
tuning...	チューニング...
### Fine Tuning for Impact	###インパクトのための微調整
Fine tuning is the process of training an existing model on new data without	微調整は、新しいデータを使用せずに既存のモデルをトレーニングするプロセスです。
changing any part of the network, although you can now add	ネットワークのどの部分でも変更できますが、追加することができます。
characters to the character set. (See [Fine Tuning for ┬▒ a few	文字セットに文字。 ([微調整用の微調整]を参照してください。
characters](#fine-tuning-for--a-few-characters)).	]#(#fine-tuning-for - 数文字))。
` \| `
training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \	training / lstmtraining --model_output / path / to / output [--max_image_MB 6000] \
--continue_from /path/to/existing/model \	--continue_from / path / to / existing / model \
--traineddata /path/to/original/traineddata \	--traineddata / path / to / original / traineddata \
[--perfect_sample_delay 0] [--debug_interval 0] \	[--perfect_sample_delay 0] [--debug_interval 0] \
[--max_iterations 0] [--target_error_rate 0.01] \	[--max_iterations 0] [--target_error_rate 0.01] \
--train_listfile /path/to/list/of/filenames.txt	--train_listfile /path/to/list/of/filenames.txt
` \| `
Note that the `--continue_from` arg can point to a training checkpoint	Note `--continue_from`引数はトレーニングチェックポイントを指すことができることに注意
or a recognition model, even though the file formats are different.	ファイル形式が異なっていても、または認識モデル。
Training checkpoints are the files that begin with `--model_output` and end	トレーニングチェックポイントは `--model_output`で始まり終わりのファイルです。
in `checkpoint`. A recognition model can be extracted from an existing	チェックポイント認識モデルは既存のものから抽出することができます
traineddata file, using `combine_tessdata.` Note that it is also necessary to	`combine_tessdata.`を使用した学習済みデータファイル
supply the original traineddata file as well, as that contains the unicharset	オリジナルの訓練データファイルも同様に供給してください。
and recoder. Let's start by fine tuning the model we built earlier, and see if	そしてレコーダー。先ほど作成したモデルを微調整することから始めましょう。
we can make it work for 'Impact':	私たちはそれを 'Impact'のために機能させることができます。
` \| `
mkdir -p ~/tesstutorial/impact_from_small	mkdir -p~/ tesstutorial / impact_from_small
training/lstmtraining --model_output ~/tesstutorial/impact_from_small/impact \	トレーニング/ lstmtraining --model_output~/ tesstutorial / impact_from_small / impact \
--continue_from ~/tesstutorial/engoutput/base_checkpoint \	--continue_from~/ tesstutorial / engoutput / base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \	--train_listfile~/ tesstutorial / engeval / eng.training_files.txt \
--max_iterations 1200	--max_iterations 1200
` \| `
This has character/word error at 22.36%/50.0% after 100 iterations and gets down	これは100回の繰り返しの後に22.36％/ 50.0％の文字/単語エラーを持ち、そして落ちます
to 0.3%/1.2% at 1200. Now a stand-alone test:	1200で0.3％/ 1.2％になりました。今度はスタンドアロンテストです。
` \| `
training/lstmeval --model ~/tesstutorial/impact_from_small/impact_checkpoint \	training / lstmeval --model~/ tesstutorial / impact_from_small / impact_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt
` \| `
That shows a better result of 0.0086%/0.057% because the trainer is averaging	それはトレーナーが平均しているので0.0086％/ 0.057％のより良い結果を示しています
over 1000 iterations, and it has been improving. This isn't a representative	1000回以上の繰り返し、そしてそれは改善されています。これは代表ではありません
result for the `Impact` font though, as we are testing on the training data!	トレーニングデータでテストしているので、 `Impact`フォントの結果はそうです！
That was a bit of a toy example. The idea of fine tuning is really to apply it	それはちょっとしたおもちゃの例でした。微調整のアイデアは本当にそれを適用することです
to one of the fully-trained existing models:	十分に訓練された既存のモデルの1つに:
` \| `
mkdir -p ~/tesstutorial/impact_from_full	mkdir -p~/ tesstutorial / impact_from_full
training/combine_tessdata -e tessdata/best/eng.traineddata \	トレーニング/ combine_tessdata -e tessdata / best / eng.traineddata \
~/tesstutorial/impact_from_full/eng.lstm	~/ tesstutorial / impact_from_full / eng.lstm
training/lstmtraining --model_output ~/tesstutorial/impact_from_full/impact \	トレーニング/ lstmtraining --model_output~/ tesstutorial / impact_from_full / impact \
--continue_from ~/tesstutorial/impact_from_full/eng.lstm \	--continue_from~/ tesstutorial / impact_from_full / eng.lstm \
--traineddata tessdata/best/eng.traineddata \	--traineddata tessdata / best / eng.traineddata \
--train_listfile ~/tesstutorial/engeval/eng.training_files.txt \	--train_listfile~/ tesstutorial / engeval / eng.training_files.txt \
--max_iterations 400	--max_iterations 400
` \| `
After 100 iterations, it has 1.35%/4.56% char/word error and gets down to	100回繰り返した後の文字数は1.35％/ 4.56％char / wordです。
0.533%/1.633% at 400. Again, the stand-alone test gives a better result:	400で0.533％/ 1.633％。ここでも、スタンドアロンテストのほうが良い結果が得られます。
` \| `
training/lstmeval --model ~/tesstutorial/impact_from_full/impact_checkpoint \	training / lstmeval --model~/ tesstutorial / impact_from_full / impact_checkpoint \
--traineddata tessdata/best/eng.traineddata \	--traineddata tessdata / best / eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt
` \| `
Char error 0.017%, word 0.120% What is more interesting though, is the effect on	Charエラー0.017％、単語0.120％しかし、もっと面白いのは、上の効果です
the other fonts, so run a test on the base training set that we have been using:	他のフォントなので、これまで使用してきた基本トレーニングセットでテストを実行します。
` \| `
training/lstmeval --model ~/tesstutorial/impact_from_full/impact_checkpoint \	training / lstmeval --model~/ tesstutorial / impact_from_full / impact_checkpoint \
--traineddata tessdata/best/eng.traineddata \	--traineddata tessdata / best / eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt	--eval_listfile~/ tesstutorial / engtrain / eng.training_files.txt
` \| `
Char error rate=0.25548592, Word error rate=0.82523491	文字エラー率= 0.25548592、ワードエラー率= 0.82523491
It is only slightly worse, despite having reached close to zero error on the	それはわずかに悪いだけです、それにもかかわらずゼロに近いエラーに達したにもかかわらず
training set, and achieved it in only 400 iterations. **Note that further	そして、たった400回の繰り返しでそれを達成しました。 **さらに注意してください
training beyond 400 iterations makes the error on the base set higher.**	400回を超えるトレーニングでは、ベースセットの誤差が大きくなります。**
In summary, the pre-trained model can be fine-tuned or adapted to a small data	要約すると、事前に訓練されたモデルは微調整することも、小さなデータに適応させることもできます。
set, without doing a lot of harm to its general accuracy. It is still very	設定、その一般的な精度に多くの害を及ぼすことなくそれはまだ非常に
important however, to avoid over-fitting.	しかし、過度の適合を避けるために重要です。
### Fine Tuning for ┬▒ a few characters	###数文字の微調整
New feature It is possible to add a few new characters to the character set	新機能文字セットにいくつかの新しい文字を追加することが可能です
and train for them by fine tuning, without a large amount of training data.	大量のトレーニングデータを使用せずに、微調整によってトレーニングを受けます。
The training requires a new unicharset/recoder, optional language models, and	トレーニングには、新しいunicharset / recoder、オプションの言語モデル、および
the old traineddata file containing the old unicharset/recoder.	古いunicharset / recoderを含む古いtraineddataファイル
` \| `
training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \	training / lstmtraining --model_output / path / to / output [--max_image_MB 6000] \
--continue_from /path/to/existing/model \	--continue_from / path / to / existing / model \
--traineddata /path/to/traineddata/with/new/unicharset \	--traineddata / path / to / traineddata / with / new / unicharset \
--old_traineddata /path/to/existing/traineddata \	--old_traineddata / path / to / existing / traineddata \
[--perfect_sample_delay 0] [--debug_interval 0] \	[--perfect_sample_delay 0] [--debug_interval 0] \
[--max_iterations 0] [--target_error_rate 0.01] \	[--max_iterations 0] [--target_error_rate 0.01] \
--train_listfile /path/to/list/of/filenames.txt	--train_listfile /path/to/list/of/filenames.txt
` \| `
Let's try adding the plus-minus sign (┬▒) to the existing English model. Modify	既存の英語モデルにプラスマイナス記号(┬▒)を追加してみましょう。修正する
`langdata/eng/eng.training_text` to include some samples of ┬▒. I inserted 14 of	`langdata / eng / eng.training_text`はいくつかのsamplesのサンプルを含みます。の14を挿入しました
them, as shown below:	以下に示すように、
` \| `
grep ┬▒ ../langdata/eng/eng.training_text	grep┬▒../langdata/eng/eng.training_text
alkoxy of LEAVES ┬▒1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL	曲がった抵抗を買うことによってLEAVESのアルコキシ - 1.84％あなたの(Vol。SPANIEL
TRAVELED ┬▒85┬в , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership	信頼できるイベントTHOUSANDS TRADITIONS。反米国寝室のリーダーシップ
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ┬▒1.31 POPSET OsтАФC(11)	デザインの自己と株式会社。ボールが変わりました。マンハッタンハーヴェイの┬▒1.31POPSETオセプトC(11)
VOLVO abdomen, ┬▒65┬░C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri	VOLVO腹部、65℃、AEROMEXICO SUMMONER =(1961)洗濯ミズーリについて
PATENTSCOPE┬о # ┬й HOME SECOND HAI Business most COLETTI, ┬▒14┬в Flujo Gilbert	PATENTSCOPE┬®#ホームホーム2番目のHAIビジネス最もCOLETTI、3月14日Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ┬▒90┬░ Gogol PARTIALLY BOARDS ямБrm	Dresdner昨日の拡張システムYour FOUR┬▒90┬░Gogol PARTIALLY BOARDSямБrm
Email ACTUAL QUEENSLAND Carl's Unruly ┬▒8.4 DESTRUCTION customers DataVac┬о DAY	EメールACTUAL QUEENSLANDカールの手に負えない「8.4破壊」のお客様DataVac┬日
Kollman, for тАШplankedтАЩ key max) View ┬лLINK┬╗ PRIVACY BY ┬▒2.96% Ask! WELL	Kollman、forтАШplankedтАЩkey max)の表示┬лLINK┬╗プライバシーBY BY┬▒2.96％Ask！よく
Lambert own Company View mg \ (┬▒7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv	Lambert own Company View mg (┬▒7)センサー研究2月偶然[It Yahoo！テレビ
United by #DEFINE Rebel PERFORMED ┬▒500Gb Oliver Forums Many	┬й2003-2008 Used OF	Unitedによって#DEFINE Rebelによって行われました。 ©2003-2008 Used OF
Avoidance Moosejaw pm* ┬▒18 note: PROBE Jailbroken RAISE Fountains Write Goods (┬▒6)	回避Moosejaw pm *┬▒18注釈:PROBE Jailbroken RAISEファウンテン商品を書く(6)
OberямВachen source.тАЭ CULTURED CUTTING Home 06-13-2008, ┬з ┬▒44.01189673355 тВм	オーバーカルチャーの源泉。CULTURED CUTTING Home 06-13-2008、アメリカ合衆国44.01189673355
netting Bookmark of WE MORE) STRENGTH IDENTICAL ┬▒2? activity PROPERTY MAINTAINED	私たちのネッティングブックマークもっと)STRENGTH IDENTICAL┬▒2？アクティビティのプロパティメンテナンス
` \| `
Now generate new training and eval data:	新しいトレーニングと評価データを生成します。
` \| `
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \	src / training / tesstrain.sh --fonts_dir / usr / share / fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \	--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus	--tessdata_dir ./tessdata --output_dir~/ tesstutorial / trainplusminus
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \	src / training / tesstrain.sh --fonts_dir / usr / share / fonts --lang eng --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \	--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata \	--tessdata_dir ./tessdata \
--fontlist "Impact Condensed" --output_dir ~/tesstutorial/evalplusminus	--fontlist "Impact Condensed" --output_dir~/ tesstutorial / evalplusminus
` \| `
Run fine tuning on the new training data. This requires more iterations, as it	新しいトレーニングデータを微調整します。これにはもっと反復が必要です。
only has a few samples of the new target character to go on:	先に進む新しいターゲットキャラクタのサンプルがいくつかあります。
` \| `
training/combine_tessdata -e tessdata/best/eng.traineddata \	トレーニング/ combine_tessdata -e tessdata / best / eng.traineddata \
~/tesstutorial/trainplusminus/eng.lstm	~/ tesstutorial / trainplusminus / eng.lstm
training/lstmtraining --model_output ~/tesstutorial/trainplusminus/plusminus \	トレーニング/ lstmtraining --model_output~/ tesstutorial / trainplusminus / plusminus \
--continue_from ~/tesstutorial/trainplusminus/eng.lstm \	--continue_from~/ tesstutorial / trainplusminus / eng.lstm \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \	--traineddata~/ tesstutorial / trainplusminus / eng / eng.traineddata \
--old_traineddata tessdata/best/eng.traineddata \	--old_traineddata tessdata / best / eng.traineddata \
--train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \	--train_listfile~/ tesstutorial / trainplusminus / eng.training_files.txt \
--max_iterations 3600	--max_iterations 3600
` \| `
After 100 iterations, it has 1.26%/3.98% char/word error and gets down to	100回繰り返した後の文字数は1.26％/ 3.98％です。
0.041%/0.185% at 3600. Again, the stand-alone test gives a better result:	3600で0.041％/ 0.185％。ここでも、スタンドアロンテストの方が良い結果が得られます。
` \| `
training/lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \	トレーニング/ lstmeval --model~/ tesstutorial / trainplusminus / plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \	--traineddata~/ tesstutorial / trainplusminus / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt	--eval_listfile~/ tesstutorial / trainplusminus / eng.training_files.txt
` \| `
Char error 0.0326%, word 0.128%. What is more interesting though, is whether the	文字エラー0.0326％、ワード0.128％。もっとおもしろいのは、
new character can be recognized in the 'Impact' font, so run a test on the	新しい文字は 'Impact'フォントで認識できます。そこでテストを実行してください。
impact eval set:	影響評価セット:
` \| `
training/lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \	トレーニング/ lstmeval --model~/ tesstutorial / trainplusminus / plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \	--traineddata~/ tesstutorial / trainplusminus / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt	--eval_listfile~/ tesstutorial / evalplusminus / eng.training_files.txt
` \| `
Char error rate=2.3767074, Word error rate=8.3829474	文字エラー率= 2.3767074、ワードエラー率= 8.3829474
This compares very well against the original test of the original model on the	これは、元のモデルの元のテストと非常によく比較できます。
impact data set. Furthermore, if you check the errors:	データセットに影響を与えます。さらに、エラーをチェックすると:
` \| `
training/lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \	トレーニング/ lstmeval --model~/ tesstutorial / trainplusminus / plusminus_checkpoint \
--traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \	--traineddata~/ tesstutorial / trainplusminus / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt 2>&1		--eval_listfile~/ tesstutorial / evalplusminus / eng.training_files.txt 2>&1
grep ┬▒	grep┬▒
` \| `
...you should see that it gets all the ┬▒ signs correct! (Every truth line that	...これですべての┬▒サインが正しくなることがわかります。 (すべての真実の線は
contains a ┬▒ also contains a ┬▒ on the corresponding OCR line, and there are no	対応するOCR行にaが含まれている
truth lines that don't have a matching OCR line in the grep output.)	grep出力に一致するOCR行がない真理行)
This is excellent news! It means that one or more new characters can be added	これは素晴らしいニュースです。 1つ以上の新しい文字を追加できることを意味します
without impacting existing accuracy, and the ability to recognize the new	既存の正確性に影響を与えずに、および新しい要素を認識する能力
character will, to some extent at least, generalize to other fonts!	文字は、少なくともある程度、他のフォントに一般化するでしょう！
NOTE: When fine tuning, it is important to experiment with the number of	注:微調整するとき、それの数で実験することは重要です
iterations, since excessive training on a small data set will cause	小さいデータセットに対する過度のトレーニングは
over-fitting. ADAM, is great for finding the feature combinations necessary to	取り付け過ぎ。 ADAMは、必要な機能の組み合わせを見つけるのに最適です。
get that rare class correct, but it does seem to overfit more than simpler	そのまれなクラスを正しくしてください。
optimizers.	オプティマイザ
### Training Just a Few Layers	###ほんの少しの層をトレーニングする
Fine tuning is OK if you only want to add a new font style or need a couple of	新しいフォントスタイルを追加したい場合、またはいくつかのフォントスタイルを追加する必要がある場合は、微調整は問題ありません。
new characters, but what if you want to train for Klingon? You are unlikely to	新しいキャラクターが、あなたがクリンゴンのために訓練したい場合はどうなりますか？あなたはそうは思わない
have much training data and it is unlike anything else, so what do you do? You	たくさんのトレーニングデータがあり、それは他のものとは違っています、それであなたはどうしますか？君は
can try removing some of the top layers of an existing network model, replace	既存のネットワークモデルの最上位層の一部を削除してみることができます。
some of them with new randomized layers, and train with your data. The	そのうちのいくつかは新しいランダム化されたレイヤーで、そしてあなたのデータで訓練します。の
command-line is mostly the same as [Training from	コマンドラインは[トレーニングとほぼ同じです。
scratch](#training-from-scratch), but in addition you have to provide a model to	スクラッチ](#training-from-scratch)が、さらにモデルを提供する必要があります
`--continue_from` and `--append_index`.	`--continue_from`と`--append_index`。
The `--append_index` argument tells it to remove all layers above the layer	`--append_index`引数は、レイヤの上のすべてのレイヤを削除するようにそれに伝えます。
with the given index, (starting from zero, in the outermost series) and then	与えられたインデックスを使って、(最も外側の系列で、ゼロから始めて)
append the given `--net_spec` argument to what remains. Although this indexing	与えられた `--net_spec`引数を残りに追加します。この索引付け
system isn't a perfect way of referring to network layers, it is a consequence	システムはネットワーク層を参照するための完璧な方法ではありません、それは結果です
of the greatly simplified network specification language. The builder will	非常に単純化されたネットワーク仕様言語ビルダーは
output a string corresponding to the network it has generated, making it	生成したネットワークに対応する文字列を出力します。
reasonably easy to check that the index referred to the intended layer.	インデックスが目的のレイヤを参照していることを確認するのはかなり簡単です。
A new feature of 4.00 alpha is that combine_tessdata can list the content of a	4.00アルファの新機能は、combine_tessdataがaの内容をリストできることです。
traineddata file and its version string. In most cases, the version string	訓練データファイルとそのバージョン文字列。ほとんどの場合、バージョン文字列
includes the net_spec that was used to train:	訓練に使用されたnet_specを含みます。
` \| `
training/combine_tessdata -d tessdata/best/heb.traineddata	トレーニング/ combine_tessdata -d tessdata / best / heb.traineddata
Version string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]	バージョン文字列:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]
17:lstm:size=3022651, offset=192	17:lstm:サイズ= 3022651、オフセット= 192
18:lstm-punc-dawg:size=3022651, offset=3022843	18:lstm-punc-dawg:サイズ= 3022651、オフセット= 3022843
19:lstm-word-dawg:size=673826, offset=3024221	19:lstm-word-dawg:サイズ= 673826、オフセット= 3024221
20:lstm-number-dawg:size=625, offset=3698047	20:lstm-number-dawg:サイズ= 625、オフセット= 3698047
21:lstm-unicharset:size=1673826, offset=3703368	21:lstm-unicharset:size = 1673826、offset = 3703368
22:lstm-recoder:size=4023, offset=3703368	22:lstmレコーダ:サイズ= 4023、オフセット= 3703368
23:version:size=80, offset=3703993	23:バージョン:サイズ= 80、オフセット= 3703993
` \| `
and for chi_sim:	そしてchi_simの場合:
` \| `
training/combine_tessdata -d tessdata/best/chi_sim.traineddata	トレーニング/ combine_tessdata -d tessdata / best / chi_sim.traineddata
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]	バージョン文字列:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192	0:設定:サイズ= 1966、オフセット= 192
17:lstm:size=12152851, offset=2158	17:lstm:サイズ= 12152851、オフセット= 2158
18:lstm-punc-dawg:size=282, offset=12155009	18:lstm-punc-dawg:サイズ= 282、オフセット= 12155009
19:lstm-word-dawg:size=590634, offset=12155291	19:lstm-word-dawg:サイズ= 590634、オフセット= 12155291
20:lstm-number-dawg:size=82, offset=12745925	20:lstm-number-dawg:サイズ= 82、オフセット= 12745925
21:lstm-unicharset:size=258834, offset=12746007	21:lstm-unicharset:size = 258834、offset = 12746007
22:lstm-recoder:size=72494, offset=13004841	22:lstmレコーダ:サイズ= 72494、オフセット= 13004841
23:version:size=84, offset=13077335	23:バージョン:サイズ= 84、オフセット= 13077335
` \| `
Note that the number of layers is the same, but only the sizes differ. Therefore	レイヤーの数は同じですが、サイズが異なるだけです。だから
in these models, the following values of `--append_index` will keep the	これらのモデルでは、以下の `--append_index`の値は
associated last layer, and append above:	最後のレイヤに関連付けられ、上に追加:
Index	Layer	インデックス	層
:-------:	:-----------	:-------:	:-----------
`0`	Input	`0`	入力
`1`	`Ct3,3,16`	`1`	`Ct 3、3、16`
`2`	`Mp3,3`	`2`	「Mp3,3」
`3`	`Lfys48/64`	`3`	`Lfys48 / 64`
`4`	`Lfx96`	`4`	`Lfx96`
`5`	`Lrx96`	`5`	`Lrx96`
`6`	`Lfx192/512`	`6`	`Lfx192 / 512`
The weights in the remaining part of the existing model are unchanged initially,	既存のモデルの残りの部分の重みは、最初は変わりません。
but allowed to be modified by the new training data.	しかし、新しいトレーニングデータによって修正されることを許可されています。
As an example, let's try converting the existing chi_sim model to eng. We will	例として、既存のchi_simモデルをengに変換してみましょう。私達はします
cut off the last LSTM layer (which was bigger for chi_sim than the one used to	最後のLSTMレイヤ(これは、以前のものよりchi_simの方が大きかった)を切り捨てます。
train the eng model) and the softmax, replacing with a smaller LSTM layer and a	engモデル)とsoftmaxをトレーニングし、小さいLSTMレイヤーと
new softmax:	新しいソフトマックス:
` \| `
mkdir -p ~/tesstutorial/eng_from_chi	mkdir -p~/ tesstutorial / eng_from_chi
training/combine_tessdata -e tessdata/best/chi_sim.traineddata \	トレーニング/ combine_tessdata -e tessdata / best / chi_sim.traineddata \
~/tesstutorial/eng_from_chi/eng.lstm	~/ tesstutorial / eng_from_chi / eng.lstm
training/lstmtraining --debug_interval 100 \	training / lstmtraining --debug_interval 100 \
--continue_from ~/tesstutorial/eng_from_chi/eng.lstm \	--continue_from~/ tesstutorial / eng_from_chi / eng.lstm \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--append_index 5 --net_spec '[Lfx256 O1c111]' \	--append_index 5 --net_spec '[Lfx256 O1c111]' \
--model_output ~/tesstutorial/eng_from_chi/base \	--model_output~/ tesstutorial / eng_from_chi / base \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \	--train_listfile~/ tesstutorial / engtrain / eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt \
--max_iterations 3000 &>~/tesstutorial/eng_from_chi/basetrain.log	--max_iterations 3000&>~/ tesstutorial / eng_from_chi / basetrain.log
` \| `
Since the lower layers are already trained, this learns somewhat faster than	下位層はすでに訓練されているので、これは学習よりもいくらか速く学習します。
training from scratch. At 600 iterations, it suddenly starts producing output	ゼロからのトレーニング600回の繰り返しで、突然出力を生成し始めます
and by 800, it is already getting most characters correct. By the time it stops	そして800年までには、すでにほとんどの文字が正しくなっています。止まる頃には
at 3000 iterations, it should be at 6.00% character/22.42% word.	3000回の繰り返しでは、6.00％の文字/ 22.42％の単語になります。
Try the usual tests on the full training set:	フルトレーニングセットで通常のテストを試してください。
` \| `
training/lstmeval --model ~/tesstutorial/eng_from_chi/base_checkpoint \	training / lstmeval --model~/ tesstutorial / eng_from_chi / base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt	--eval_listfile~/ tesstutorial / engtrain / eng.training_files.txt
` \| `
and independent test on the `Impact` font:	そして `Impact`フォントの独立したテスト:
` \| `
training/lstmeval --model ~/tesstutorial/eng_from_chi/base_checkpoint \	training / lstmeval --model~/ tesstutorial / eng_from_chi / base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt	--eval_listfile~/ tesstutorial / engeval / eng.training_files.txt
` \| `
On the full training set, we get 5.557%/20.43% and on `Impact` 36.67%/83.23%,	フルトレーニングでは5.557％/ 20.43％、「インパクト」では36.67％/ 83.23％となります。
which is much better than the from-scratch training, but is still badly	ゼロからのトレーニングよりはるかに優れていますが、それでもまだひどいです
over-fitted.	オーバーフィット。
In summary, it is possible to cut off the top layers of an existing network and	要約すると、既存のネットワークの最上位層を切り捨てることが可能です。
train, as if from scratch, but a fairly large amount of training data is still	最初から練習しますが、かなり大量のトレーニングデータが残っています。
required to avoid over-fitting.	過度の取り付けを避けるために必要です。
## Error Messages From Training	##トレーニングからのエラーメッセージ
There are various error messages that can occur when running the training, some	トレーニングの実行中に発生する可能性があるさまざまなエラーメッセージがあります。
of which can be important, and others not so much:	そのうちの1つが重要になる可能性があり、他の人はそれほど重要ではありません。
`Encoding of string failed!` results when the text string for a training image	トレーニング画像のテキスト文字列が「文字列のエンコードに失敗しました！」という結果になる
cannot be encoded using the given unicharset. Possible causes are:	与えられたunicharsetを使ってエンコードすることはできません。考えられる原因は次のとおりです。
1. There is an un-represented character in the text, say a British Pound sign	1.テキストに表現されていない文字がある、イギリスポンド記号を言う
that is not in your unicharset.	それはあなたの一価にはありません。
1. A stray unprintable character (like tab or a control character) in the text.	1.テキスト内の印刷不能な文字(タブや制御文字など)。
1. There is an un-represented Indic grapheme/aksara in the text.	1.本文には、表現されていないインド語の書記素／アクサラがあります。
In any case it will result in that training image being ignored by the trainer.	いずれにせよ、それはその訓練画像が訓練者によって無視されることになるだろう。
If the error is infrequent, it is harmless, but it may indicate that your	エラーがまれであれば、それは無害ですが、それはあなたの
unicharset is inadequate for representing the language that you are training.	unicharsetはあなたが訓練している言語を表現するのには不十分です。
`Unichar xxx is too long to encode!!` (Most likely Indic only). There is an	`Unichar xxxはエンコードするには長すぎます!!`(ほとんどの場合はインド語のみ)。あります
upper limit to the length of unicode characters that can be used in the recoder,	レコーダーで使用できるUnicode文字の長さの上限
which simplifies the unicharset for the LSTM engine. It will just continue and	これはLSTMエンジンのためのunicharsetを単純化します。それは続きます
leave that Aksara out of the recognizable set, but if there are a lot, then you	Aksaraを認識可能なセットから除外しますが、たくさんある場合は、
are in trouble.	困っています。
`Bad box coordinates in boxfile string!` The LSTM trainer only needs bounding	`boxfile文字列内の不正なボックス座標！` LSTMトレーナーはバウンディングのみを必要とします
box information for a complete textline, instead of at a character level, but if	文字レベルではなく、完全なテキスト行のボックス情報
you put spaces in the box string, like this:	このように、ボックスの文字列にスペースを入れます。
` \| `
` \| `
the parser will be confused and give you the error message.	パーサは混乱してあなたにエラーメッセージを与えるでしょう。
`Deserialize header failed` occurs when a training input is not in LSTM format	トレーニング入力がLSTM形式ではない場合、「ヘッダーのデシリアライズに失敗しました」が発生します。
or the file is not readable. Check your filelist file to see if it contains	またはファイルが読めません。それが含まれているかどうかを確認するためにあなたのファイルリストファイルをチェックしてください
valid filenames.	有効なファイル名
`No block overlapping textline:` occurs when layout analysis fails to correctly	レイアウト解析が正しく失敗した場合、 `ブロックオーバーラップテキストラインがありません:`が発生します
segment the image that was given as training data. The textline is dropped. Not	トレーニングデータとして与えられた画像を分割します。テキスト行はドロップされます。ではない
much problem if there aren't many, but if there are a lot, there is probably	多くない場合は多くの問題がありますが、多くある場合はおそらく
something wrong with the training text or rendering process.	トレーニングテキストまたはレンダリングプロセスに問題があります。
`` can occur in either the ALIGNED_TRUTH or OCR TEXT output early	ALIGNED_TRUTHまたはOCR TEXTのどちらかの出力では、初期にが発生する可能性があります。
in training. It is a consequence of unicharset compression and CTC training.	研修中。これは、ユニキャスト圧縮とCTCトレーニングの結果です。
(See Unicharset Compression and train_mode above). This should be harmless and	(上記のUnicharset圧縮とtrain_modeを参照してください)。これは無害なはずです
can be safely ignored. Its frequency should fall as training progresses.	無視して構いません。その頻度はトレーニングが進むにつれて低下するはずです。
# Combining the Output Files	#出力ファイルを結合する
The lstmtraining program outputs two kinds of checkpoint files:	lstmtrainingプログラムは、2種類のチェックポイントファイルを出力します。
* `_checkpoint` is the latest model file.	* `_checkpoint`は最新のモデルファイルです。
* `_.checkpoint` is periodically written as	* `_ .checkpoint`は定期的に次のように書かれています
the model with the best training error. It is a training dump just like the	最良の訓練誤差をもつモデルそれはちょうどのようなトレーニングダンプです
checkpoint, but is smaller because it doesn't have a backup model to be used	チェックポイントですが、使用するバックアップモデルがないため、小さくなります
if the training runs into divergence.	訓練が分岐する場合
Either of these files can be converted to a standard traineddata file as	これらのファイルはどちらも標準の訓練済みデータファイルに変換できます。
follows:	次のとおりです。
` \| `
training/lstmtraining --stop_training \	トレーニング/ lstmtraining --stop_training \
--continue_from ~/tesstutorial/eng_from_chi/base_checkpoint \	--continue_from~/ tesstutorial / eng_from_chi / base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \	--traineddata~/ tesstutorial / engtrain / eng / eng.traineddata \
--model_output ~/tesstutorial/eng_from_chi/eng.traineddata	--model_output~/ tesstutorial / eng_from_chi / eng.traineddata
` \| `
This will extract the recognition model from the training dump, and insert it	これはトレーニングダンプから認識モデルを抽出し、それを挿入します
into the --traineddata argument, along with the unicharset, recoder, and any	unicharset、recoder、およびanyと共に--traineddata引数に
dawgs that were provided during training.	トレーニング中に提供された夜明け。
NOTE Tesseract 4.00 will now run happily with a traineddata file that	NOTE Tesseract 4.00は訓練されたデータファイルでうまく動くでしょう。
contains just `lang.lstm`, `lang.lstm-unicharset` and `lang.lstm-recoder`. The	`lang.lstm`、`lang.lstm-unicharset`、 `lang.lstm-recoder`だけが含まれています。の
`lstm--dawgs` are optional, and none of the other components are required or	`lstm - * - dawgs`はオプションで、*他のコンポーネントはどれも必須ではありません。
used with OEM_LSTM_ONLY as the OCR engine mode.* No bigrams, unichar ambigs or	OCRエンジンモードとしてOEM_LSTM_ONLYと共に使用されます。*バイグラム、unicharのあいまいさ、または
any of the other components are needed or even have any effect if present. The	他のコンポーネントのいずれかが必要であるか、存在している場合はなんらかの効果があります。の
only other component that does anything is the `lang.config`, which can affect	何かをする他のコンポーネントだけが `lang.config`です。
layout analysis, and sub-languages.	レイアウト分析、およびサブ言語。
If added to an existing Tesseract traineddata file, the `lstm-unicharset`	既存のTesseractの学習データファイルに追加された場合、 `lstm-unicharset`
doesn't have to match the Tesseract `unicharset`, but the same unicharset must	Tesseractの `unicharset`と一致させる必要はありませんが、同じunicharsetは必須です。
be used to train the LSTM and build the `lstm-*-dawgs` files.	LSTMを訓練して `lstm - * - dawgs`ファイルを構築するために使われます。
# The Hallucination Effect	#幻覚効果
If you notice that your model is misbehaving, for example by:	あなたのモデルが間違っていることに気づいたら、例えば:
* Adding a `Capital` letter instead of a `Small` letter at the beginning of certain words.	*特定の単語の先頭に「小」の代わりに「大文字」を追加する。
* Adding `Space` where it should not do that.	*そうすべきでないところに `Space`を追加する。
* etc...	*など
Then read the hallucination topic.	それから幻覚のトピックを読んでください。

「[対訳] トレーニングテスト4.00」をウィキ内検索

最終更新：2019年08月23日 20:25

＊99 [ e のない e 本]

おしながき

本棚

アーカイブ

リンク

更新履歴

リンク