Paste from web to microsoft word without strange line breaks

by thegreg82 / December 23, 2006 9:48 AM PST

Word 2003.

When pasting text from the web into a normal word document, I would like the text to integrate into the document without any sort of extra formatting. I have tried everything that is recommended; paste special, clear formatting, paste options, and pasting into notepad and then re-pasting into word. While these tips remove most of the extraneous formatting, strange line breaks still fill the text. (The lines don't reach the margins and instead just cut off early, as if I had pressed "enter" while typing). Is there any way to either paste without these line breaks or quickly remove them without having to go through and manually delete every single line break?
Thanks in advance.

Can you give a good example?
by Kees Bakker / December 23, 2006 7:04 PM PST

Go here, select this, paste to a new Word document, and see the results. I'd like to see some details.


by thegreg82 / December 23, 2006 10:32 PM PST

Copy and paste this into Word, and you should be able to see the problem. Thanks again.

During the Cold War, the United States maintained nuclear forces that were
sized and structured to deter any attack by the Soviet Union and its Warsaw Pact
allies, and if deterrence failed, to defeat the Soviet Union. In the years since the 1989
collapse of the Berlin wall and 1991 demise of the Soviet Union, officials in the U.S.
government and analysts outside government have conducted numerous reviews and
studies of U.S. nuclear weapons policy and force structure. Although these studies
have varied in scope, intent, and outcome, most have sought to describe a new role
for U.S. nuclear weapons and to identify the appropriate size and structure of the U.S.
nuclear arsenal in the post-Cold War era. In offering their recommendations, these
analyses addressed not only the end of the hostile U.S.-Soviet global rivalry, but also
the emergence of new threats and regional challenges to U.S. security.
The U.S. Department of Defense conducted several far-reaching reviews,
including the 1993 Bottom-up Review, the 1994 Nuclear Posture Review, and the
1997 Quadrennial Defense Review, that contributed to the Clinton Administration?s
response to changes in the international security environment. These formal reviews,
when combined with less prominent internal studies, resulted in numerous changes
to the structure of U.S. nuclear forces and policy guiding their potential use.
However, many critics of the Clinton Administration argued that, at the end of the
1990s, the U.S. nuclear posture looked much as it had at the beginning of the decade.
The number of deployed nuclear weapons had declined as the United States
implemented the first Strategic Arms Reduction Treaty (START I) and completed the
withdrawal of most of its non-strategic nuclear weapons. But, even though the
Soviet Union no longer existed and the threat of global nuclear war had sharply
diminished, the United States continued to focus its nuclear planning and size and
structure its nuclear forces to deter the potential threat of a Russian attack.

Re: line breaks
by Kees Bakker / December 24, 2006 12:06 AM PST
In reply to: Example


As I suspected this is more complex than it seems to be.

If I copy and paste the text about the cold war from your post into Word, it shows up with a new-line character (same as when you press shift-enter in Word) at the end of each line. That's fully correct. If you have a look at the html-source of the message, there are <br> line-breaks in it, as shown in
"During the Cold War, the United States maintained nuclear forces that were<BR>sized and structured to deter any attack by the Soviet Union and its Warsaw Pact<BR>allies, and if deterrence failed, to defeat the Soviet Union. In"
A <br> is the 'new-line' command in html. And the only reason it's there is that the designer of the webpage (or the program he used) put it there intentionally to force a new line. So Word obeys the intention of the maker of the webpage. Nothing wrong with that.

Your link is to a pdf-file. If I open that (either with Acrobat or with Foxit reader, and either locally or from the web) and use the text selection tool to copy part of the text to Word, the end of the line shows up as a paragraph marker in Word. That has absolutely nothing to do with the web, it's just that the Adobe or Foxit programmers thought this the right thing to do. Go and complain with them.

The last case: normal html. I copied a small piece of text form In the browser it looks like
This was the year of the web generation, a year
that saw the rise of a new digital democracy.
Meet 15 of the web generation's biggest movers
and shakers"

but that's just because my browsers (IE 6) rendering engine puts it on the screen that way to fit in into the available space (the column size). If you look at the html-source, you see there are no <br>-tags inside, so the designer chose to have IE determine the exact lay-out (as usual). And if I paste this to Word, it shows up just as you expect, as one paragraph without line breaks. Well, in fact it shows up as a bulleted list, because the designer of the webpage enclosed it in a <li>-tag.

I can see nothing wrong with this the way Word handles copies from an html-source.
As I said, you might have doubts about the way the text-tool of some pdf-readers handle a new line in the document, but that's a quite other subject. It might be inherent to the way a .pdf-document is structured internally, but I couldn't tell you that.


In the links I noted, line break removal can be done.
by R. Proffitt Forum moderator / December 24, 2006 12:17 AM PST
In reply to: Example

It's a well discussed subject so I took the liberty to head straight to the solutions noted at the end of prior discussions.


Those links were quite clear and relevant.
by Kees Bakker / December 24, 2006 4:05 AM PST

But I found it interesting to research the exact cause. Knowing you can't do anything about it makes it more acceptable to execute an extra step.


by thegreg82 / December 24, 2006 4:37 AM PST

Thanks for everyones help. I found the following, from one of the links, to be particularly useful.

If they are true paragraph breaks, you can search for ^p and replace with
nothing. If they're line breaks, you can search for ^l (that's a lowercase
L) and replace with nothing. Occasionally, line breaks will be displayed as
paragraph breaks, in which case you'll need to use ^013. See, which
contains the following:

Sometimes when you paste in from other applications, non-printing characters
paste in that display as paragraph marks but don't behave like ?proper?
paragraph breaks should ? they behave like manual line breaks. The character
code for a paragraph mark is 13 (as can be shown be selecting one and
running a macro containing the line: MsgBox Asc(Selection.Text)).

Replacing ^013 with ^p fixes the problem.

by entrecon / December 27, 2006 3:31 AM PST
In reply to: Thanks

I actually wrote a macro to clean this type of text up. The problem that you have with the simple find and replace is that quite often in this type of text two line breaks indicate a new paragraph. My macro actually does a series of find and replace. First it looks for 3 space and replaces with 2. It loops to make sure it cleans up occurances of 4 or more spaces. It then looks for 2 line breaks and replaces it with XXXX (my identifier of choice). It then looks for single line breaks and replaces with a space. The final step is looking for XXXX and replacing it with the paragraph mark.

I may have added a couple of extra steps in my final macro, it has been a while since i put it together, but the steps here give you a general idea of how it runs. It initially was written for text files that were sent to me off of a mainframe and therefor had the hard returns in it.

