Document Cleaner is a collection of tools for preparation of badly formatted documents for translation. Clean tags, fix formatting issues after PDF conversion or OCR, make sure that text is fully visible.
We all have to translate badly formatted documents. These can be produced by OCR and PDF conversion software such as ABBYY Finereader, ABBYY PDF Transformer, Nuance Omnipage, etc., or by inexperienced users of word processing software. Incorrect formatting causes many problems in translation, including:
- Excessive tags in your CAT tool. These tags may be caused by hard-to-see changes in text formatting, excessive bookmarks, unnecessary text backgrounds, etc.
- Inability to see all text after translation. This usually occurs in tables, frames or textboxes.
- Difficulties in further editing, translation or formatting. OCR and PDF conversion tools use a lot of special formatting tricks to ensure that the Word document resembles the original PDF document or scanned image. Some of this special formatting makes it incredibly difficult to edit the document: for instance, the document layout may “fall apart” as soon as the text expands or shrinks during translation.
- How Document Cleaner can help you in your work
- Where to find the tool
- Using Document Cleaner
- Related tools
How Document Cleaner can help you in your work
Document Cleaner provides several tools to help you process badly formatted Word documents:
Tag Cleaner – this tool re-formats the document in order to minimize tags displayed when you import the document into your CAT tool. Tags are a result of complex formatting applied by OCR and PDF conversion tools in order to reproduce the appearance of the source PDF / scanned document. 99% of the time this complex formatting is not necessary, as it was never used in the original document.
Tag Cleaner performs the following (optional) operations to minimize tags and make the document more user-friendly:
- fixes invisible formatting problems,
- resets uneven character spacing,
- removes text and paragraph shading,
- turns ‘black’ font color into ‘automatic’ color,
- removes text highlighting,
- removes manual hyphenation,
- removes character styles, leaving direct formatting only,
- removes some types of unnecessary bookmarks,
- normalizes (makes standard) font colors inside each paragraph,
- normalizes (makes standard) font sizes inside each paragraph,
- normalizes (makes standard) fonts inside each paragraph, and
- finds specific symbols from symbol fonts and converts them to corresponding readable characters from standard fonts.
Autoformat tool combines a number of operations that will help you re-format Word documents generated by OCR and PDF conversion software or identify some problems inside them. For example, Autoformat tool can remove frames which can cause some translated text to become invisible, highlight tab characters inserted by OCR tools instead of spaces, apply variable height setting to table rows to prevent text invisibility, set consistent cell margins, and do a lot of other useful things. In a matter of seconds, you can produce a document which will be much easier and quicker to format.
Other Tools page combines assorted tools which can be useful for formatting of Word documents generated by OCR and PDF conversion tools:
- Quick Actions – This is a collection of assorted macros which can be applied to selected text, selected objects, or the whole document.
- Table Aligner – When you convert a PDF / scanned document containing multi-page tables, these tables are recognized as several smaller tables, one per page. When you join these tables, however, they will often have misaligned vertical borders. This tool helps to format such tables properly.
- Row Height Resizer – When OCR tools process tables, they apply special formatting to each row so that the row height is equal or larger than the height of the row in the original PDF/scan. Some programs apply fixed row height which prevents the row from expanding as more text is added in it, which is common during translation. This tool allows you to remove this formatting so that the row expands or shrinks in height depending on the amount of text in the row.
- Line Removal – Some OCR tools insert vertical or horizontal floating lines instead of borders. Removing these lines is very tiresome, as they are difficult to find. This tool helps you track them down and remove them much easier.
- Resave – This tool saves the document to RTF format and back to the original format, which can sometimes help with removal of tags in some documents. However, it is recommended to use Tag Cleaner command instead of this command, because conversion to RTF format can result in loss of formatting, and Resave tool can help in a limited number of situations only.
Where to find the tool
To run Document Cleaner, click Document Cleaner button on TransTools ribbon in Word:
This will open Document Cleaner dialogue:
Using Document Cleaner
Tag Cleaner tool
In an attempt to match original document formats, many OCR / PDF conversion tools may apply the following formatting to the document:
- use wide or narrow text spacing to match the size of letters and inter-character spacing in the original document
- use text and paragraph shading or highlighting to show text highlighted with a marker, or to mark unreliably recognized characters
- insert manual hyphens if the original document is hyphenated
- insert bookmarks, some of which are not used within the document
- use character styles in addition to regular (direct) formatting
- use different formatting for spaces or tabs
- use ‘black’ font color for certain words instead of ‘automatic’ color. Black and Automatic font colors look the same on most systems. Automatic is the default font color in Microsoft Office, and Black is rarely used. However, when some parts of the text are colored in black and others in automatic color, CAT programs will insert tags to indicate changes in formatting
- apply different text colors to various text in the same paragraph (esp. during conversion of image-based PDFs or scans)
- use jumping font sizes, e.g. text formatted as ‘10 pt’ next to ‘10.5 pt’ (this is especially true for documents acquired from image-based PDFs)
- use different fonts in the same paragraph or sentence, while source documents use only one font (this is especially true for documents acquired from image-based PDFs)
- apply thick underline formatting to text formatted in bold font
The above formatting is the primary source of tags in CAT software and makes it harder to format the document before and after translation.
Tag Cleaner tool
The Tag Cleaner tool has the following options:
- Set default text spacing – reset character spacing to default values.
Character spacing properties include Scale, Spacing, Position and Kerning properties available on the Advanced page of the Font dialogue in Microsoft Word. These properties are very rarely used during document preparation, but OCR and PDF conversion tools use them very frequently to provide a Word document which looks close to the original PDF or image.
- Remove text shading – remove shading from text and paragraphs.
Text shading (do not confuse with highlight colors) is a background color behind inidividual words or paragraphs. Shading may be used by OCR tools to show unreliably recognized characters (e.g., in ABBYY Finereader), text highlighted with a marker, etc.
- Remove manual hyphens – remove hyphens which may cause extra tags or reduce TM match percentage.
Manual hyphens appear as if non-printing symbols are displayed. OCR and PDF conversion tools insert them if the source PDF or image is hyphenated. However, these symbols will never be in the correct places in the Word document because of the text layout changes. In CAT tools, manual hyphens will reduce match percentage and may cause tags.
- Change font color from ‘Black’ to ‘Automatic’ – change ‘black’ font color to ‘automatic’ font color where appropriate.
Sometimes text is partly formatted with the Automatic font color and partly – with the Black font color. Text formatted with these two colors looks the same on most computers. However, it is treated differently by CAT tools which insert a tag for every change from one color to another. This option changes Black font color to the Automatic font color across the document.
The option is not selected by default.
- Remove excessive bookmarks – remove bookmarks which are not required before translation.
This option is relevant for CAT tool users. CAT tools show bookmarks as pairs of tags, so it is recommended to remove unnecessary bookmarks before importing the Word document into your CAT tool. This option removes several categories of bookmarks which may be unnecessary:
- OLE LINK bookmarks (these are completely unnecessary, so this option is checked by default)
- Bookmarks not referenced from inside the document (unchecked by default). These are bookmarks which are not referred to from fields or hyperlinks.
- Table of Contents bookmarks (unchecked by default). These bookmarks are used for jumping from a table of contents to the relevant document section. However, they are reinserted automatically when you regenerate a table of contents after you create the translated document, so they can be removed prior to translation to reduce tags.
- Remove highlighting – remove text highlight color.
Text highlighting may be used by OCR and PDF conversion tools to represent text highlighted with a marker. Use this option if you do not need to preserve text highlighting in the translated document.
This option is not selected by default.
- Remove character styles, leave direct formatting only.
This option is highly recommended for all documents obtained through OCR or PDF conversion. Some OCR / PDF conversion tools, e.g. ABBYY Finereader, make frequent use of character styles on top of paragraph styles. If these character styles are left in the document, they will be imported into your CAT tool as tag pairs. At the same time, character styles are of value only if the document was formatted by a human. This option strips character styles while leaving the formatting intact, so it preserves the original formatting.
This option is not selected by default.
- Accept tracked changes and switch off change tracking
This option accepts all tracked changes and switches off change tracking mode to avoid excessive tags caused by tracked changes in some CAT tools.
This option is not selected by default.
- Normalize font colors in each paragraph.
Quite often, the OCR or PDF conversion tool decides that the text in a specific paragraph is formatted in different colors. This often occurs in scanned PDFs which naturally have several shades of black or some other color on a page. Different text colors in the same paragraph cause tags in your CAT tool. If you select this option, Tag Cleaner will apply the color of the first word in the paragraph to the text in the same paragraph that has a different font color if the two colors are close enough.
This option is not selected by default.
- Normalize font size in each paragraph.
When you convert a PDF, esp. an image-based PDF, the OCR or PDF conversion tool may decide that there are several different font sizes in a given paragraph, while in fact there is only one. It can format one range of text as ‘9 pt’, the second – as ‘9.5 pt’, and the third – as ‘9 pt’ again. Because human-authored documents rarely have more than one font size in a given paragraph, you can use this option to level font sizes across the paragraph or sentence using the first font size of the paragraph/sentence. To use this option, select the appropriate setting from the picklist:
- Fix paragraph font size differences of 1 / 2 / 3 pt or smaller – select one of these settings if you want to apply the same font size only if the size of some text is different from the first font size of the paragraph by 1, 2 or 3 points or less
- Apply first font size to all ranges within the paragraph – select this setting in order to apply the first font size to the rest of the paragraph
- Normalize font in each paragraph.
When you convert an image-based PDF, the OCR or PDF conversion tool may decide that there are several different fonts in a given paragraph, while in fact there is only one. For example, the OCR tool can recognize one range of text as ‘Cambria font’ and another one as ‘Calibri font’. When you import the document into the CAT tool, this will result in several tags. Because most documents use the same font in a given paragraph (unless a symbol font is used), you can safely format the rest of the paragraph using the same font as used in the beginning of the paragraph. This option does just that.
This option is not selected by default.
- Replace special symbols with regular characters where appropriate.
Some Word documents contain text written with the help of so-called “symbol fonts”, such as Symbol, Wingdings, Webdings, etc. These symbols are similar to images and so they are displayed as tags in CAT tools, making them difficult to understand and use in the translated text.
However, some of these symbols (e.g., ©, ™, etc.) can also be found in standard fonts and so they can be represented as normal readable characters. This option finds the majority of symbols from standard symbol fonts (including Symbol, Wingdings and Wingdings 3) and replaces them with readable (Unicode) characters from standard fonts, reducing the number of tags and improving readability in your CAT tool.
This option is not selected by default.
Uncheck “Check for unsupported formatting” checkbox if you want to speed up processing of long documents which do not have advanced text effect formatting introduced in Word 2010 and later. OCR software never uses such formatting, so you can safely uncheck this checkbox. However, if the document was produced by a human, I would recommend that you leave this checkbox activated, otherwise this formatting may be removed during the operation.
To remove a specific type of formatting from the entire active document, check the appropriate options and then click Clean tags button.
The following Tag Cleaner settings are recommended for processing of Word documents generated with OCR and PDF conversion tools:
1) Set default text spacing, 2) Remove text shading, 3) Remove manual hyphens, 4) Change font color from ‘Black’ to ‘Automatic’, 5) Remove excessive bookmarks + Not referenced from inside document, 6) Remove character styles.
Other settings can be activated if you still see excessive tags in your CAT tool and you are confident that no loss of original formatting will occur. It is recommended to check the cleaned document against the original document in case you use such additional options (e.g., by using Compare Documents feature in Microsoft Word to compare the original and cleaned version of the document). If you do not enable the three normalization options (‘Normalize font color’, ‘Normalize font size’ and ‘Normalize font’), Tag Cleaner will check the document for inconsistent font colors, font sizes or fonts in the same paragraph and notify you whether you should enable these options or use Autoformat to find and manually format such paragraphs.
If you process a document produced by a human, I would advise against using the options which are unchecked by default. In fact, if the document was authored by a human from its creation and not joined together from various sources, the Tag Cleaner command should not be used at all as it may remove some intended formatting. Always err on the side of caution in such cases.
To reproduce the layout and formatting of a PDF or scanned document, OCR and PDF conversion tools use a lot of various formatting. This formatting makes it incredibly difficult to format, edit and translate a Word document. Autoformat tool performs a number of optional operations to modify document layout, change its formatting, or identify potential issues. With this command, you will be able to make the document more “editable” in a matter of seconds.
Autoformat tool has the following options:
- Text formatting
- Highlight paragraphs formatted using several different fonts, font sizes or font colors (unchecked by default) – when you use this option, the Autoformat tool will apply text highlighting to paragraphs which contain more than one font, font size or font color. For example, a paragraph may have Arial formatting while some words are formatted using Calibri font. Such paragraphs may need to be formatted, as different fonts or font sizes in the same paragraph will look unprofessional and will cause excessive tags to appear in your CAT tool.
- Use single paragraph line spacing (unchecked by default) – Most OCR programs will apply different line spacing to different paragraphs of the converted documents. For example, Multiple or Exactly line spacing can often be seen in documents converted from PDF files. Use this option to apply Single line spacing to every paragraph – this helps with further formatting of the document and you can later apply a consistent line spacing manually.
- Highlight tables of contents (unchecked by default) – Sometimes, OCR and PDF conversion programs incorrectly recognize text as a table of contents. Such text will look like normal text in Microsoft Word, but you will not be able to see it from your CAT program or translate it safely in Microsoft Word. Even if you translate the text by overtyping it in Word, your translation will disappear when the table of contents is updated. This option will highlight all tables of contents with a predefined color so that you can find text erroneously recognized as a table of contents and convert it to simple text (see the appropriate command under Other Tools / Quick Actions tab).
- Remove empty paragraphs (unchecked by default) – Most well-formatted native Word documents do not have empty paragraphs. Use this option to remove all empty paragraphs from the document.
- Remove text shading (unchecked by default) – OCR / PDF conversion tools often apply shading to text to identify text highlighted with a marker, to show unreliably recognized symbols, etc. Use this option to clear all text and paragraph shading (except table shading).
- Set default text spacing (unchecked by default) – Text spacing (including spacing, scaling, position and kerning) is special formatting which is usually applied by OCR / PDF conversion tools in order to match the original inter-character distances in text. Regular documents rarely use this type of formatting. This option applies default font spacing to text in the entire document.
- Text editing
- Clearly show where all tab characters are (unchecked by default) – If a source PDF document contains justified text, you can often see tab characters between words in the converted documents. This will give you a lot of headaches during translation, because tab characters will appear as tags in your CAT program or can even cause incorrect segmentation. This option makes it easy to see where all tab characters are in your document so that you can replace them with spaces where appropriate (see the appropriate command under Other Tools / Quick Actions tab).
- Remove soft (manual) hyphens (unchecked by default) – Most OCR / PDF conversion tools reproduce hyphenation using soft (manual) hyphens. Hyphens make it more difficult to translate text and reduce TM matches. This option removes all soft hyphens.
- Left-align all tables (unchecked by default) – usually, OCR and PDF conversion programs center-align all tables, making it more difficult to resize columns. This option left-aligns all tables to make it easier to format them.
- Set Text Wrapping to None for all tables (unchecked by default) – Some OCR programs may apply custom text wrapping for tables (text wrapping can be changed manually for each table under Table tab of Table Properties dialogue). This can cause text to flow around the table, so the table may appear in the middle of a text paragraph. This option changes the Text Wrapping of each table to None so that the table appears as it normally should.
- Use “AutoFit to Window” option for all tables (unchecked by default) – this option applies “AutoFit to Window” formatting to every table so that they occupy the entire width of the document.
- Use consistent left/right cell margins for all tables (unchecked by default) – many OCR and PDF conversion programs use very small cell margins inside tables. This option applies consistent left/right cell margins to every table in the document.
- Apply variable row height to fixed-height rows – When OCR programs convert tables to Microsoft Word format, they often set a fixed height or minimum height for certain rows. This prevents rows from changing their size as text expands or shrinks during translation. This option changes the formatting of appropriate table rows to fix this. If you check this option, select one of the following settings:
- Use minimum height instead of fixed height – when this setting is selected, all fixed-height rows will be formatted as minimum-height rows. This means that, if the text expands during translation, these rows will expand to fit the new text, but they will not shrink if the translation is shorter than the original text.
- Set variable height for all fixed-height rows – when this setting is selected, all fixed-height rows will be formatted as variable-height rows. In other words, these rows will shrink or expand to accommodate the text.
- Set variable height for all rows – when this setting is selected, all rows (including fixed-height and minimum-height rows) will become variable-height rows. In comparison with the previous option, this one not only applies to fixed-height rows, but also to minimum-height rows.
- Apply transparent background to all tables (unchecked by default) – This option removes background color and/or pattern in every table so that every table has transparent background. This may be useful if the OCR / PDF conversion program uses inconsistent or incorrect colors or if you don't want to have background colors in tables.
- Use consistent cell alignment for all tables (unchecked by default) – This option applies consistent horizontal and vertical alignment to all cells in every table. You can choose one of standard settings: Top Left, Top Center, Top Right, Middle Left, Middle Center, Middle Right, Bottom Left, Bottom Center, Bottom Right; Top, Middle and Bottom (horizontal text alignment is not changed); Left, Center, Right and Justified (vertical text alignment is not changed).
- Apply consistent borders to all tables (unchecked by default) – This option applies consistent borders to all tables. You can choose between No Borders and All Borders, and specify border width.
- Document Layout
- Use consistent page margins for all sections in the document (unchecked by default) – OCR programs usually apply inappropriate margins to different pages of converted documents. For example, you will often see that the page margins are too small and that each paragraph is indented from the margins to make the page look normal. You will also see that different pages have different margins. Use this option to apply consistent margins to each page of the document. You can select between three settings: Normal, Narrow and Moderate, which correspond to the options available in the Margins list under Page Layout tab of Microsoft Word.
- Minimize the number of section breaks in the document (unchecked by default) – In an attempt to recreate the exact layout of the source PDF documents, OCR programs may use a lot of section breaks with unique headers and footers. This option attempts to remove a section break between two pages if their page size, orientation and headers / footers are the same.
- Format all text in one text column (unchecked by default) – OCR programs often use several text columns to format text instead of formatting it as a table. This causes a number of problems during formatting and translation. Use this option to apply a single text column to all text in the document so that you can reformat the text appropriately.
- Use a single paper size for all pages (unchecked by default) – Usually, OCR programs try to recreate the page size of the original PDF in the converted Word document. This may result in several page sizes applied to different sections of the document. Use this option to apply a consistent page size to every section of your document. You can select between four settings: A4, A3, Letter and Legal, which are the most common document page sizes.
- Convert page breaks and/or column breaks to regular paragraph breaks (unchecked by default) – OCR programs often insert page breaks and column breaks in converted documents. If you wish to remove such breaks from a document, use this option and select the appropriate settings to convert page breaks, column breaks, or both.
- Do not put text / tables / images inside frames – OCR programs often place images and tables in a layout element called “frame” (not to be confused with textboxes). Frames are used to position things accurately, but they are unnecessary and can hide text out of view if it expands during translation. Use this option to remove frames, leaving their contents intact. Leave this option checked if you want to use “Left-align all tables” and/or “Use AutoFit to Window option” above.
- Format invisible (1-point size) paragraphs using default font size (unchecked by default) – OCR programs often use invisible (1-point) paragraphs to separate content in converted documents. This will make it more difficult to format the document because you will not see page / section / column breaks which often hide at the end of such paragraphs. Use this option to apply default font size (usually 10–12 pt) to such invisible paragraphs.
- Make all images inline (unchecked by default) – OCR programs format most images as floating images. This makes it more difficult to format the document after conversion. This option applies inline text wrapping to all the images, making it easier to align or copy/paste images inside the document.
- Clearly identify all textboxes (unchecked by default) – OCR and PDF conversion programs often use textboxes to position text, images and tables. Most of the time, such textboxes are problematic for translation because they do not expand to fit the translated text and are hard to position correctly. This option changes the fill color or text highlight color of all textboxes so you can easily see where textboxes are and decide what to do with them. You can choose how to identify textboxes: with a Textbox Fill Color (background color of the textbox) or Text Highlight Color (highlight color of the text inside the textbox). If you want to remove a textbox, you can use the appropriate command under Other Tools / Quick Actions tab.
If your document was produced by an OCR or PDF conversion tool, it is always a good idea to run the Autoformat tool with the following options:
- Apply variable row height to fixed-height rows with Use minimum height instead of fixed height setting. This action is needed to make sure that you will be able to see the translation inside tables if it is longer than the original text.
- Do not put text / tables / images inside frames – this action is required because some OCR tools like ABBYY Finereader use frames for positioning of tables, images and some text. If the source text is inside a frame and the translation is longer than the source text, you may not be able to see the translation fully.
The above two options are completely safe for any document, even those produced by a human. To use other options, make sure that you understand exactly what each option does. Be prepared to undo the actions if they cause unexpected results.
Once you have activated the appropriate options, click Run selected commands button. The selected actions will be applied to the entire document. Make sure to check the document thoroughly and make additional formatting changes to prepare the document.
If you want to perform only one or two actions on the document, you can click Run one command... button. This opens a new dialogue where you can select a single action along with its settings and then click Perform selected action to execute the desired command.
Other Tools page
The Other Tools page incorporates additional formatting tools that you may need to use on a case-by-case basis to prepare a well-formatted document.
Quick Actions tool
Quick Actions is a collection of commands that allows you to perform various actions on selected text, selected objects, or the entire document. Read below for a description of each command.
- Editing – Correct formatting of numbered list markers (document). This command corrects the formatting of markers in numbered lists, using the same bold, size and font face properties as the list text itself.
- Editing – Replace all highlighted tabs with spaces (document). This command searches the entire document for tab characters identified by a highlight color and replaces them with simple spaces. It can be used in conjunction with the relevant option of the Autoformat tool.
- Editing – Convert table(s) of contents to text (selection). This command converts selected tables of contents to regular text. This is useful if the OCR or PDF conversion program has converted regular text to a table of contents by mistake. To use, click anywhere within the table of contents and run this command. You can use this command in conjunction with the relevant option of the Autoformat tool.
- Editing – Convert numbered list items to text (selection). This command converts selected numbered list items to regular text so that the marker becomes editable. This is useful if the OCR or PDF conversion program has converted regular text to a numbered list by mistake, or if it used incorrect list numbering. To use, click anywhere within the relevant list items (or select appropriate paragraphs) and run this command.
- Editing – Convert hyperlinks to text (selection). This command converts hyperlinks in selected text to regular text. Hyperlinks cause extra tags in CAT tools. While the ability to click a hyperlink is often important, there are many cases when you only need to preserve the hyperlink text, not the URL address behind the text. First, this may be needed if the converted PDF was not recognized properly and the website URL of many hyperlinks needs to be edited. Second, some documents contain a lot of readable URL addresses which can simply be copied from the document, so there is no need to keep them as hyperlinks. Third, some documents copied from legal reference systems contain superfluous hyperlinks cross-referencing other areas in the document, and there is often no need to preserve these cross-references after translation. To use this command, select the text containing hyperlinks and run it.
- Bookmarks – Remove all bookmarks (except hidden bookmarks). This command removes all bookmarks from the document, except for hidden bookmarks (however, TOC bookmarks are removed). You can use this command if you want to get rid of tags caused by bookmarks and the Tag Cleaner tool does not remove them.
- Bookmarks – Remove all bookmarks (including hidden bookmarks). This command removes absolutely all bookmarks from the document. You can use this command if you want to get rid of tags caused by bookmarks and the Tag Cleaner tool does not remove them.
- Textboxes – Autofit all textboxes (vertically). This command autofits all textboxes in the current document (vertically) so that each textbox fully displays its text.
- Textboxes – Move text out of selected textboxes. OCR and PDF conversion programs often use textboxes instead of multiple text columns or tables, so if you have good knowledge of Microsoft Word, it is better to get rid of textboxes and format the text using tables or columns. This command moves the text from each selected textbox and places it inside the paragraph to which this textbox is anchored (note that the text may end up in a different place, although it will always be on the same page as the textboxes). The textboxes are then removed, and the moved text becomes highlighted. You can use this command in conjunction with the relevant option of the Autoformat tool.
- Textboxes - Move text out of all textboxes (entire document). This command is very similar to the above command, but instead of processing selected textboxes, it processes all textboxes found in the current document (unless a textbox is inside a group or canvas). It should be really useful if you use a PDF conversion service which places text in textboxes. It is strongly recommended to back up your document before running this command because it may produce undesirable results in certain cases.
- Textboxes - Convert all textboxes to frames. This command converts all floating textboxes to Word frames. Frames are somewhat easier to work with than textboxes, but they are still problematic in terms of translation and further formatting due to their inability to expand to fit the translation or inability to break across several pages. Avoid using this command if your document contains a lot of textboxes and you plan to remove frames after than, because there is no guarantee that the text from the frames will end up in the correct location on the page.
- Paragraph formatting – Reset indentation and spacing (selection). This command sets paragraph indentation (including left, right, first line and hanging indentation) to 0, applies single line spacing, and sets paragraph spacing before and after the selected paragraphs to 0.
- Paragraph formatting – Reset indentation (selection). This command sets paragraph indentation (including left, right, first line and hanging indentation) to 0 for all selected paragraphs.
- Paragraph formatting – Reset right indentation (selection). This command sets right indentation to 0 for all selected paragraphs. Most OCR tools seem to use right indentation inappropriately – they use right indentation to wrap text to another line in places where a line or paragraph break needs to be inserted. Resetting right indentation will allow you to see where line or paragraph breaks need to be inserted to avoid formatting and segmentation problems.
- Paragraph formatting – Reset spacing (selection). This command applies single line spacing and sets paragraph spacing before and after the selected paragraphs to 0.
- Paragraph formatting – Reset tab places to defaults (selection). This command resets tab places to default values in selected paragraphs. You may need to use this command if some of the text in the document becomes invisible because of large tab position settings.
- Paragraph formatting – Reset paragraph shading (selection). This command removes paragraph shading (the paragraph's background color or pattern) from selected paragraphs. Note that, in addition to paragraph shading, Microsoft Word also has cell background color which also appears as text background color, although it can appear only inside tables and looks different.
- Font formatting – Reset all font formatting (selection). This command removes all text formatting attributes except font name, font size and color from selected text.
- Font formatting – Reset text shading (selection). This command removes shading (background color or pattern) applied to selected text.
- Font formatting – Reset text spacing (selection). This command resets text spacing properties (scale, spacing, position and kerning) applied to selected text.
- Font formatting – Reset all paragraph and font formatting (selection). This command resets all font and paragraph formatting in selected text, including paragraph indentation / spacing , tab places, and font attributes.
After selecting the command from the list, click Run selected command button to execute it.
Table Aligner tool
When you perform OCR on a document containing a long table that occupies several pages, OCR tools break the table into smaller tables, one per page. After the Word document is produced, you need to join these tables to create a single table. However, depending on the original quality of the PDF / scanned document, you will often find that the columns in the joined table are not fully aligned:
Table Aligner helps you align tables after you join individual tables together.
In order to align the table, do the following:
- Select rows. You can select the entire table or only individual rows, however the beginning of the selection must always contain "benchmark" rows with correct column widths to be applied to the rows below. For example, benchmark row of type 1 may be a row with 4 columns merged (like the first row of the table above), benchmark row of type 2 – row with 4 columns, type 3 – row with 3 columns, one of which is merged, etc.
- Choose additional options:
- Select the first option if the table has different types of rows with different numbers of columns, and you want to align columns in all the rows (in the sample table above, the first row has only one column, and the 2 rows below have 4 columns each). If you use this option and select rows instead of the whole table, make sure that the first rows you select include "benchmark" rows with correct column widths.
- Select "Skip such rows" if you only want to have correct column widths in rows which contain the same number of columns as the very first selected row;
- Select "Highlight text of such rows using Yellow colour" if you want to have correct column widths in rows which contain the same number of columns as the very first selected row, and have all other rows with a different number of columns highlighted in yellow.
- Click Align Columns button. The operation may take some time depending on the complexity and size of the table.
Row Height Resizer tool
When OCR tools process tables, they apply special formatting to rows so that the row height is equal or larger than the height of the rows in the original PDF / scan. If the row height is fixed and the text expands after translation, the text may not fit inside the row and will not be fully visible. If the row has a minimum height and the translated text is shorter, the row will not shrink in height, so there may be a lot of extra space around the text.
This tool allows you to apply default (variable) row height so that the rows expand or shrink in height depending on the amount of text in them.
- Apply variable row height setting to ALL tables in the active document – choose this option to apply variable row height attributes to all table rows in the active document;
- Decide which rows need to be formatted as variable height rows – choose this option in order to decide which rows need to be changed. When you select this option, you will see a list of all rows that have minimum and/or fixed height (depending on the setting of the next option). You can then use Up and Down arrow keys or the mouse within the list to see every row in the document, and then select the rows which need to be changed.
- Ignore rows with minimum height setting – Select this option if you only want to act on rows with ‘fixed’ height setting, and ignore rows with ‘minimum’ height setting. This option is selected by default.
Click Apply variable height setting button. The operation may take some time depending on the number of tables in the document.
Line Removal tool
Some OCR tools insert vertical or horizontal floating lines instead of borders. Removing these lines is very tiresome, as they are difficult to find. This tool helps you track them down and remove them much easier.
When you select Line Removal tab, you will see a list of all vertical or horizontal floating lines in the active document. To see each line in context, select the item from the list. To remove a line individually, click Remove Selected Item. To remove all vertical / horizontal lines that were found, click Remove All.
Some documents, when imported into CAT tools, have thousands of tags for no apparent reason [see an example here]. In most cases, the problem manifests as tags surrounding spaces between words. This command saves the document to RTF format and then back to the original format, which often eliminates all such 'rogue' tags. Note: the issues fixed by the Resave tool are now handled by the Tag Cleaner tool, so you should use Tag Cleaner instead.
Before clicking Resave active document, make sure about the following:
- You do not need to use the undo command (the document will be closed during the resaving operation).
- The document is in DOC or DOCX format (the command will not proceed if this is otherwise).
- Macros are not retained when the document is converted to RTF format. Therefore, the document you want to resave must not contain any macros.
If your document is in DOC format (DOCX files do not contain macros), be aware that macros may disappear after the document is resaved using this command. If you use Word 2007 or later, the command will not proceed if the document has macros. If you use Word 2003, you need to watch for a macro notification when you open the document (it only appears if your macro security level is not set to Low, this is configured under Tools -> Macro -> Security...): if such notification appears, then you cannot resave the document since all macros will be lost.
When you click Resave active document, the document will be saved to RTF format, closed, reopened, and then saved to the original format.
In case of any problems, you can repeat the same operation manually: 1) save the original document and preferably back it up to another file; 2) save the document to RTF format by choosing the RTF format in the Save As Type list of the Save As dialogue; 3) close the document; 4) open the new RTF file; 5) save the document to the original file format, i.e. DOC or DOCX.
Depending on the type of your document, you may need to use different configuration settings of Document Cleaner tools. For example, to process documents produced by OCR tools, you need to use the Tag Cleaner tool with most options turned on; if the document was produced by a human, however, you may need to activate only several Tag Cleaner options.
The profile area is located at the bottom of Document Cleaner dialogue.
To save current configuration settings to a new profile, select or deselect various options on all tabs of the Document Cleaner dialogue, and choose [New profile...] from the list at the bottom of the dialogue. Assign the name to the new profile and click OK.
To save current configuration settings to an existing profile, select or deselect various options on all tabs of the Document Cleaner dialogue, select the profile from the list at the bottom of the dialogue, and click Save.
To load a specific profile, select it from the list at the bottom of the dialogue and click Load.
The Default profile is loaded automatically when Document Cleaner dialogue is opened.
To remove a specific profile, select it from the list at the bottom of the dialogue and click Remove. The Default profile cannot be removed.
Tools for document formatting and preparation before/after translation
- Tag Cleaner (PowerPoint) – minimize tags in PowerPoint presentations created with OCR and PDF conversion tools before translation in a CAT tool
- Find & Replace Excessive Spaces – find and replace excessive spaces to improve TM leverage and improve formatting
- Multiple Find & Replace – search Word documents for multiple words and phrases, replace or format found text or review each occurrence in context before making a change