<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:series="http://unfoldingneurons.com/"
	>

<channel>
	<title>sean @ celsoft</title>
	<atom:link href="http://www.celsoft.com/feed" rel="self" type="application/rss+xml" />
	<link>http://www.celsoft.com</link>
	<description>java, ruby, linux, whatever ...</description>
	<lastBuildDate>Sat, 21 May 2011 15:56:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Character Sets and Character Encoding: A Unicode/UTF-8 Primer</title>
		<link>http://www.celsoft.com/character-sets-and-character-encoding-a-unicodeutf-8-primer</link>
		<comments>http://www.celsoft.com/character-sets-and-character-encoding-a-unicodeutf-8-primer#comments</comments>
		<pubDate>Sat, 21 May 2011 15:54:29 +0000</pubDate>
		<dc:creator>sean</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.celsoft.com/?p=176</guid>
		<description><![CDATA[<p>Recently, I&#8217;ve been forced to fill a number of gaps in my knowledge of international character sets and encodings. The most important thing I learned is that understanding and working with international languages is surprisingly simple.</p> <p></p> <p>If you learn nothing else from this article, let it be this: use UTF-8.  It&#8217;s genius.  I&#8217;ll [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;ve been forced to fill a number of gaps in my knowledge of international character sets and encodings.  The most important thing I learned is that understanding and working with international languages is surprisingly simple.</p>
<p><span id="more-176"></span></p>
<p><em>If you learn nothing else from this article, let it be this: use UTF-8.  It&#8217;s genius.  I&#8217;ll explain why later.</em></p>
<h4>Helpful Hint #1: Don&#8217;t try to learn this stuff by observing the technology you&#8217;re working with</h4>
<p>Python, Java and even Ruby have pretty good support these days for Unicode and UTF-8 encoding.  However, there are some concepts you need to understand fully but which often get munged up by the implementation of these concepts.  If you&#8217;re trying to learn all about international character sets and encodings by taking a hands-on approach with your programming language of choice: <strong>way to go, tiger! </strong>Now stop doing that.</p>
<p>One of the more confusing aspects of character sets and encodings is that the distinction between the two is often blurred in a number of ways, and if your brain is like mine it will start organizing information under &#8220;character sets&#8221; and if the line blurs in just the right way, it will incorrectly start storing things under &#8220;character encoding.&#8221;</p>
<h4><span style="font-weight: bold;">Character Sets</span></h4>
<p>Let&#8217;s get character sets out of the way.</p>
<p>Character sets are logical tables which map characters to numbers.  ASCII is a character set.  So is Unicode.  The Windows-1252 is a popular character set that you&#8217;ll find is used by many, well, Windows applications.</p>
<p><em>Which brings us to our first problem:</em></p>
<p>These character tables often assign different numbers for different symbols.  The Euro symbol, for example, does not have the same numeric value across every character set.</p>
<p>As you can imagine, with so many character sets available, and with lots of characters mapped to different numerical values, it can be pretty messy business converting text represented in one character set to another.</p>
<p>Which is where Unicode comes in.</p>
<p>Unicode unifies of all those characters into one very large table of characters.  One of the immediate and clear advantages of Unicode is that there is only one number for any given character.  A Euro symbol displayed in the middle of a sentence written in Chinese has the same numeric value as a Euro symbol shown in the middle of a sentence in French.</p>
<h4>Character Encodings</h4>
<p>Here&#8217;s where things start to get murky.</p>
<p>Before we get into this, let me explain that the issue of character encodings breaks up into two concerns that are interwoven in such a way as to make this area pretty darn confusing for a lot of people.  Hopefully I can shed some light on this for you now and get the topic out of the way.</p>
<h4>Encoding Concern #1:</h4>
<p>An encoding is any representation of a character set,<em> including that which you might use directly in your code.</em></p>
<p>To a C programmer, a &#8220;string&#8221; is an encoding of the ASCII character set; it&#8217;s an array of chars, each of which holds the numeric value of a character in the ASCII character set.  The size of a C char, by convention, fits all possible ASCII numeric values neatly within its 8-bit width.  This sort of encoding is commonly used for storing data in files or transmitting across network sockets.  It&#8217;s also very easily manipulated within the C language.</p>
<p>C programmers who have had to work with Unicode likely had to switch from char-based strings to strings made of &#8220;wide-characters,&#8221; which, because of the size of the Unicode table of characters, must be large enough to hold any value in the table.  While a lot more trouble to handle than simple char arrays, they&#8217;re still pretty straight-forward to work with; though, typically, you need to use a specialized set of function calls to manipulate them.</p>
<p>What these &#8220;internal&#8221; encodings have in common is they were intended for direct manipulation by the programming language.  They&#8217;re easy to traverse and easy to manipulate.  Where they diverge is: char-based strings can also be sent as-is out to files and other programs, while wide-character strings typically cannot.  At least, not without being re-encoded into a form intended for communicating with other applications.</p>
<p>And so begins the blurring &#8230;</p>
<h4><span style="font-weight: bold;"><strong>Encoding Concern #2:</strong></span></h4>
<p>An encoding is any representation of a character set, <em>including that which you might use to transmit text data across network sockets or read to and write from text files.</em></p>
<p>Again, ASCII does this very simply.  Unicode does not do this at all (though I imagine there are brave souls out there transmitting wide characters as raw bytes across homegrown network protocols).  Unicode must be transformed into and out of some other format that plays well with whatever environment the text lives in; very typically the environment is the Internet: that networked world built primarily with the ASCII character set in mind.</p>
<p>There are a number of encodings to choose from, but these days UTF-8 is the standard.  It&#8217;s 8-bit, and can encode characters up to 4 bytes in length, covering the entire Unicode table of characters.  It works so well, in fact, that they actually named it the <em>8-bit Unicode Transformation Format.</em></p>
<p>UTF-8 is a variable-length (sometimes a character takes one byte, sometimes it takes 4) encoding scheme, so it&#8217;s a lot more difficult to traverse and manipulate directly.  Because of that, it&#8217;s typically only used as an intermediate format between applications, and not usually manipulated directly.</p>
<h4>The Blur</h4>
<p>In addition to the legacy of ASCII encodings being simple char arrays which are both internal and external representations (and which are burned into the hearts and minds of Western developers everywhere), we have a new bog of uncertainty to navigate: encoding and decoding implementations.</p>
<p>While Unicode is strictly a logical table (and clearly not an encoding), some languages do have their own internal Unicode representation, which is the equivalent of the wide-character string mentioned above.  In the idioms of these languages, there exists such a mythical entity as a &#8220;Unicode string&#8221; or a &#8220;Unicode-encoded string.&#8221;</p>
<p>It&#8217;s the equivalent of calling a C char-based array an &#8220;ASCII string.&#8221;  There really is no such thing, but it does tell you a little bit about the purpose of a given array of bytes.  In this case, a &#8220;Unicode string&#8221; is really just an array of bytes which is intended to be manipulated as an array of Unicode numeric values.</p>
<p>Except, of course, you can&#8217;t use these strings outside of your application.  It&#8217;s an internal representation only.</p>
<p><em>Here&#8217;s where the murkiness becomes a swirling fog of insanity for the uninitiated:</em></p>
<p>Some languages will let you &#8220;encode&#8221; a string as ASCII or Unicode.</p>
<p>In case you missed that, let me re-state this: There are languages you can code in right now that have decided that character sets and character encodings are probably the same thing and you can actually make API calls to have your strings encoded into (<em>aaaarrrrhhhh!!!</em>) &#8220;ascii&#8221; or &#8220;unicode&#8221;.</p>
<p><em>I&#8217;m looking at you, Python, Ruby and Java.</em></p>
<p>Walk towards the light now, and breath.  Remember:</p>
<ul>
<li>ASCII and Unicode are character sets.</li>
<li>Single-char/wide-character arrays and &#8220;Unicode strings&#8221; are raw encodings used internal to applications.</li>
<li>UTF-8, UCS-2, and so on are standard encodings intended for sharing text between applications.</li>
</ul>
<p>Truth be told, it&#8217;s relatively standard to say a string is &#8220;encoded in ASCII&#8221; or &#8220;encoded in Unicode.&#8221;  What that means, though, is that the string is probably a non-standard/semi-standard array of bytes which represent ASCII or Unicode characters, respectively, and which are intended for manipulation internally.</p>
<p>It&#8217;s confusing, and I think library designers should separate internal from external representations, but it&#8217;s not entirely wrong to mix them together.</p>
<h4>The genius of UTF-8</h4>
<p>Going back to the topic of UTF-8 for a moment, some key reasons why I think UTF-8 (along with Unicode)is  a genius technology:</p>
<ul>
<li>At its core is Unicode.</li>
<li>It&#8217;s made to work in a world built for ASCII.</li>
<li>It&#8217;s well-supported by most modern programming languages and applications.</li>
<li>It&#8217;s an open standard.</li>
<li>It&#8217;s easy to work with.</li>
</ul>
<p>In short, it resolves all encoding issues and works everywhere.</p>
<p><em>&lt;&lt;Well, almost everywhere, but you probably don&#8217;t have to worry about the exceptions.&gt;&gt;</em></p>
<p><span style="font-weight: bold;">Now you are the master</span></p>
<p>Now you know the difference between a character set and an encoding.  Now you understanding that there exists both internal and external character encodings, and you know how they can sometimes appear to be one and the same thing.</p>
<p>Now you understand why Unicode and UTF-8 are so important.</p>
<p>Now when you&#8217;re given the choice of encoding a string as either &#8220;UTF-8&#8243; or &#8220;Unicode,&#8221; you can laugh a small, knowing laugh and feel empathy for those who are still suffering with the misconception that the terms &#8220;character set&#8221; and &#8220;character encoding&#8221; are interchangeable.</p>
<p>Look kindly upon them and have mercy.  Send them this article.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.celsoft.com/character-sets-and-character-encoding-a-unicodeutf-8-primer/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Resty: REST/JSON in Java Made So Much Easier!</title>
		<link>http://www.celsoft.com/resty-restjson-in-java-made-so-much-easier</link>
		<comments>http://www.celsoft.com/resty-restjson-in-java-made-so-much-easier#comments</comments>
		<pubDate>Sat, 26 Mar 2011 13:13:36 +0000</pubDate>
		<dc:creator>sean</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[JSON]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.celsoft.com/?p=164</guid>
		<description><![CDATA[<p>I love it when something so useful is made so simple to use.  Take a look at Resty, an all-in-one HTTP client for Java with native support for parsing JSON.</p> <p>Here&#8217;s an example from the Resty website:</p> Resty r = new Resty(); Object name = r.json(&#34;http://ws.geonames.org/&#34; + &#34;postalCodeLookupJSON?postalcode=66780&#38;country=DE&#34;) .get(&#34;postalcodes[0].placeName&#34;); ]]></description>
			<content:encoded><![CDATA[<p>I love it when something so useful is made so simple to use.  Take a look at <a href="http://beders.github.com/Resty/Resty/Overview.html">Resty</a>, an all-in-one HTTP client for Java with native support for parsing JSON.</p>
<p>Here&#8217;s an example from the Resty website:</p>
<pre class="brush: java; title: ; wrap-lines: true;">
Resty r = new Resty();
Object name = r.json(&quot;http://ws.geonames.org/&quot; +
  &quot;postalCodeLookupJSON?postalcode=66780&amp;country=DE&quot;)
  .get(&quot;postalcodes[0].placeName&quot;);
</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.celsoft.com/resty-restjson-in-java-made-so-much-easier/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Science Toolkit</title>
		<link>http://www.celsoft.com/data-science-toolkit</link>
		<comments>http://www.celsoft.com/data-science-toolkit#comments</comments>
		<pubDate>Thu, 24 Mar 2011 15:31:59 +0000</pubDate>
		<dc:creator>sean</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.celsoft.com/?p=134</guid>
		<description><![CDATA[<p>Yesterday, I came across this little gem of a public API: Data Science Toolkit</p> <p>From their web site:</p> <p>A collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line,Python and Javascript interfaces. Available as a self-contained VM or EC2 AMI that you can deploy yourself.</p> <p>Some [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday, I came across this little gem of a public API: <a href="http://www.datasciencetoolkit.org/">Data Science Toolkit</a></p>
<p>From their web site:</p>
<blockquote><p><em>A collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with <a href="http://www.datasciencetoolkit.org/developerdocs#commandline">command line</a>,<a href="http://www.datasciencetoolkit.org/developerdocs#python">Python</a> and <a href="http://www.datasciencetoolkit.org/developerdocs#javascript">Javascript</a> interfaces. Available as a <a href="http://www.datasciencetoolkit.org/developerdocs#vmware">self-contained VM</a> or <a href="http://www.datasciencetoolkit.org/developerdocs#amazon">EC2 AMI</a> that you can deploy yourself.</em></p></blockquote>
<p><strong><span id="more-134"></span>Some examples of what their API provides:</strong></p>
<blockquote>
<h3>Street Address to Coordinates</h3>
<p>API: <a href="http://www.datasciencetoolkit.org/developerdocs#street2coordinates">/street2coordinates</a><br />
Street Address to Location calculates the latitude/longitude coordinates for a postal address.<br />
Currently restricted to the US.</p>
<h3>IP Address to Coordinates</h3>
<p>API: <a href="http://www.datasciencetoolkit.org/developerdocs#ip2coordinates">/ip2coordinates</a><br />
IP Address to Location calculates country, state, city and latitude/longitude coordinates for IP addresses.</p></blockquote>
<p>See their <a href="http://www.datasciencetoolkit.org/developerdocs">Developer Documentation</a> for more details on using their services, or how to <a href="http://www.datasciencetoolkit.org/developerdocs#setup">run your own Data Science Toolkit service</a>.</p>
<p><em><br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.celsoft.com/data-science-toolkit/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Setup a Simple, Very Secure CVS Repository</title>
		<link>http://www.celsoft.com/how-to-setup-a-simple-very-secure-cvs-repository</link>
		<comments>http://www.celsoft.com/how-to-setup-a-simple-very-secure-cvs-repository#comments</comments>
		<pubDate>Sat, 12 Mar 2011 15:15:28 +0000</pubDate>
		<dc:creator>sean</dc:creator>
				<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.celsoft.com/?p=31</guid>
		<description><![CDATA[<p>The most secure CVS repository setup also happens to be the simplest.</p> <p>The following commands, when executed at the linux command-line, will create and initialize a CVS repository. It is accessible by any user with an account on the same host in the &#8216;cvs&#8217; group, either while logged in locally or remotely over SSH.</p> [...]]]></description>
			<content:encoded><![CDATA[<p>The most secure CVS repository setup also happens to be the simplest.</p>
<p>The following commands, when executed at the linux command-line, will create and initialize a CVS repository.  It is accessible by any user with an account on the same host in the &#8216;cvs&#8217; group, either while logged in locally or remotely over SSH.</p>
<pre class="brush: bash; title: ;">
sudo su -
groupadd cvs
mkdir /cvs
export CVSROOT=/cvs
cvs init
chgrp -R cvs /cvs
chmod g+w /cvs
</pre>
<p><i>Caveat: This is a great setup for a small, or technically-savvy group of CVS users, but if you need anonymous access, or if your users are not comfortable with SSH, you may want to provide pserver access.</i></p>
]]></content:encoded>
			<wfw:commentRss>http://www.celsoft.com/how-to-setup-a-simple-very-secure-cvs-repository/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/


Served from: www.celsoft.com @ 2012-02-23 07:04:23 -->
