If you work with a CAT tool, you have certainly had problems with incorrect paragraph () and line () breaks when these are used in the middle of sentences. This is a common issue with Word documents, especially inside tables. Breaks also appear in text copied from PDF documents, in OCR results, or in text copied from some webpages. Unless these symbols are removed from the document prior to import into your CAT tool, you are going to have serious problems during translation and post-formatting:
- Most CAT tools do not allow joining segments across paragraph boundaries, so you will need to put parts of the translation in different segments, often changing the order of words and saving unusable translation units in your translation memory. E.g., “Unitnumber” will be translated in Russian as “Номерблока”, reversing the word order, so the resulting translation units will be completely unusable and even dangerous for further projects.
- Even if your CAT tool can join segments across paragraph borders (e.g., memoQ and DejaVu X 3 can do this, among others), you will need to make note of all such segments and change the exported document, because the breaks will almost always be in incorrect places, disrupting proper text flow.
Below are several example of incorrect breaks in Word documents:
Example 1: Incorrect paragraph breaks in a table header
Example 2: Text in a PDF document (left) and how it appears when copied to Microsoft Word (right)
How can Unbreaker help you
Unbreaker is a special tool for Microsoft Word that will help you find incorrect breaks and remove them, preventing the above issues. The tool is equally effective with properly formatted documents, helping you find occasional incorrect breaks, and with documents produced through OCR or copying from PDF documents, where such breaks are much more common.
How to use it
When you run Unbreaker from TransTools tab, you will see the following dialogue:
By default, the dialogue is configured to search for paragraph breaks and line breaks in the body of the document. You can search within selected text by choosing the Selection option in the Scope pane.
From this point, you can choose one of two options:
- Manual correction: This mode is best suited for regular Word documents produced by a person: it helps to find occasional incorrect breaks in an otherwise normal document. It can also be used in documents produced by OCR and PDF conversion tools, since these can also have occasional incorrect breaks.
- Automatic correction: This mode is designed for automatic break removal in text copied from PDFs, regular text files, bad OCR results or certain webpages. It should be used when the majority of breaks are incorrect. As with any automatic process, the results need to be checked thoroughly.
Manual correction mode
The manual correction mode should be used in documents where the majority of paragraphs and line breaks are correct. These include documents created by a person and documents produced by OCR and PDF conversion tools (e.g., ABBYY FineReader, Nuance Omnipage, etc.).
To search for incorrect breaks manually, click Search under Manual Correction tab. The tool will start searching for incorrect breaks. When the operation completes, you will be presented with a notification dialogue:
When you click OK, you will see a list of incorrect breaks:
Below the Search button you will see a list of paragraph / line breaks which are considered incorrect. The list includes the following columns:
- Uncertainty: If an item is uncertain, you will see “?” in the first column. Usually, uncertain breaks are more likely to be correct than items which are not marked with a question mark.
- Context for the break: in the second column, you will see the text surrounding the break. The break itself is shown as .
- Type of break: in the third column, you will see “Paragr.” for “paragraph break” () or “Line” () for “Line break”.
The selected item is marked with “<–” in the right-hand column.
To remove breaks, you need to mark incorrect breaks with a checkbox. The easiest way to do this quickly is to select an item with the mouse and then use Up/Down keys to move up or down and Space key to select or de-select an item. You can also use the mouse to check or uncheck items, although this is not very convenient.
As you navigate the list, the appropriate text before and after the highlighted item will be selected in the document, so you will be able to see more context and fix document punctuation or other mistakes immediately. In addition, the preview box under the list will display a preview of the text before and after the highlighted break and, if applicable, an explanation of the uncertainty for breaks marked with “?”:
Initially, all “certain” items (the ones without the ?) will be checked, and “uncertain” items will be unchecked. You need to verify each item using the preview column, the preview box and the text selected in the document, making sure that only incorrect breaks are checked. Usually, it is a good idea to press button to uncheck all items before going through the list: this way, you will have less items to check or uncheck.
Remember: “certain” breaks are not always incorrect. For example, Unbreaker will often find an incorrect break if the first paragraph has no final punctuation while the second paragraph starts with a capital letter. However, sometimes this is because the author forgot to use final punctuation, or the two items are part of a list where there is no final punctuation at the end of each item.
When you are finished checking the incorrect breaks, press Correct Document button. All selected breaks will be removed, joining paragraphs or lines together. The letters before and after the removed breaks will be highlighted using one of two colors: Yellow (default) for “certain” breaks and Red (default) for “uncertain” breaks. The highlight colors can be configured, they are above Correct Document button.
As a finishing touch, you can use Find Highlight command from TransTools ribbon tab to go through the document and check whether every break was removed correctly.
To remove highlighting, you can use Remove Highlight command from TransTools ribbon tab or simply select everything (Ctrl+A) and choose “No color” with the Highlight Color tool ().
Automatic correction mode
The automatic correction mode is designed for documents which mostly contain incorrect breaks. Good examples are text copied from PDF documents, regular text (TXT) files, text copied from some webpages, results of simple OCR programs, etc.
To remove incorrect breaks automatically, switch to Automatic Correction tab:
Before you press Correct Document button, you need to choose the following:
- Whether to correct “uncertain” breaks or not. By default, the tool will keep uncertain breaks and highlight the last symbol preceding the break and the first symbol following the break, so you can remove the break manually. If Correct Uncertain Breaks option is checked, uncertain breaks will be removed and the appropriate place will be highlighted.
- Highlight colors for marking “certain” and “uncertain” breaks.
When all the options are configured, press Correct Document button. When the operation is complete, you will need to look through the document and check the results. To make this easier, use Find Highlight command from TransTools ribbon tab to find text with color highlighting. Finally, use Remove Highlight command to remove all color highlighting.
The behaviour of Manual and Automatic operations is controlled by a set of options. These are configured under Extra Search Options tab:
The following parameters can be configured:
- Treat a break before a capitalized word as an uncertain break: If checked, Unbreaker will mark breaks before capitalized words with a question mark, which is useful in many languages where capitalized words normally occur at the beginning of sentences. It is unchecked by default.
- Document may contain incorrectly formatted list markers (as text): if Unbreaker erroneously finds or removes breaks between numbered list items which are formatted incorrectly (see example on the screenshot above), check this option. It is unchecked by default.
- Do not remove a break after a line in all uppercase letters: quite often, documents may contain headings in all uppercase letters, without final punctuation and in the same formatting as the following paragraph. This option allows for such scenario. It is checked by default.
- Do not remove a break after one of these symbols: besides final punctuation like . ; ! ? : and …, some symbols may often be used at the end of paragraphs. If a break is found after such symbols, it will be considered “uncertain”. By default, this option includes the closing parenthesis – ).
- Do not remove a break before one of these symbols: besides initial punctuation like ¿ ¡
some symbols may often be used at the beginning of paragraphs. If a break is found before such symbols, it will be considered “uncertain”. By default, this option includes the opening parenthesis – (.
If you change any of these parameters and want them to be used in the future, activate Save These Options As Default checkbox.
For a small demonstration of how Unbreaker works, please .
As an added benefit, Unbreaker will allow you to do the following:
- Find paragraphs that have no final punctuation mark. Simply type the final punctuation mark inside the document window when the problematic paragraphs are selected, and uncheck the item in the list.