<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Yet another web log &#187; PDF</title>
	<atom:link href="http://blog.philippheckel.com/tag/pdf/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.philippheckel.com</link>
	<description>Life, Linux and other things</description>
	<lastBuildDate>Thu, 17 Mar 2011 10:04:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Extract text from PDF files</title>
		<link>http://blog.philippheckel.com/2009/08/09/extract-text-from-pdf-files/</link>
		<comments>http://blog.philippheckel.com/2009/08/09/extract-text-from-pdf-files/#comments</comments>
		<pubDate>Sun, 09 Aug 2009 17:17:05 +0000</pubDate>
		<dc:creator>Philipp C. Heckel</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Office]]></category>
		<category><![CDATA[PDF]]></category>

		<guid isPermaLink="false">http://blog.philippheckel.com/2009/08/09/extract-text-from-pdf-files/</guid>
		<description><![CDATA[Adobe&#8217;s Portable Document Format (PDF) has reached great popularity over the last years and is the number one format for easy document exchange. It comes with great features such as embeddable images and multimedia, but also has rather unpleasant properties. The so called Security Features represent a simple Digital Rights Management (DRM) system and allow [...]]]></description>
			<content:encoded><![CDATA[<p>Adobe&#8217;s Portable Document Format (PDF) has reached great popularity over the last years and is the number one format for easy document exchange. It comes with great features such as embeddable images and multimedia, but also has rather unpleasant properties. The so called <em>Security Features</em> represent a simple Digital Rights Management (DRM) system and allow PDF authors to restrict the file usage. Using the DRM system, authors can allow or deny actions such as printing a file, commenting or copying content.</p>
<p>Even though this is a good idea for some situations, most of the times, it&#8217;s just annoying: Collecting ideas for seminar papers or a thesis, for instance, is almost impossible without being able to Copy &amp; Paste certain paragraphs from the PDF. </p>
<p><span id="more-24"></span></p>
<p>Fortunately, Linux can solve this problem with a simple tool called <strong>pdf to text</strong>. This command line tool simply strips all text from the PDF file and saves it to a given text-file.</p>
<h3 id="toc-installation">Installation</h3>
<p>The tool is part of the package <strong>poppler-utils</strong> and can be installed via your favorite package manager, e.g. apt-get:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #c20cb9; font-weight: bold;">apt-get</span> <span style="color: #c20cb9; font-weight: bold;">install</span> poppler-utils</pre></div></div>

<h3 id="toc-extract-text-from-pdf-files">Extract text from PDF files</h3>
<p>This is also pretty simple and the man-page gives the instructions: <em>pdftotext [options] &lt;PDF&gt; [&lt;text-file&gt;]</em>.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ pdftotext PDF-file-with-copy-and-paste-restriction.pdf</pre></div></div>

<p>In case you&#8217;d like to perform this for every PDF-file in a folder (recursive search), simple do that:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">$ <span style="color: #c20cb9; font-weight: bold;">find</span> <span style="color: #660033;">-name</span> <span style="color: #ff0000;">'*.pdf'</span> <span style="color: #660033;">-exec</span> pdftotext <span style="color: #ff0000;">&quot;{}&quot;</span> \;</pre></div></div>

<p>After executing the command, there will be a *.txt-file for each PDF file in the folder, &#8211; containing the plain-text of the corresponding PDF file.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.philippheckel.com/2009/08/09/extract-text-from-pdf-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

