Thursday, October 29, 2009

When HTML encoding can bite you

I’ve been using TweetDeck to follow Twitter.  It’s a great app, but it has some quirks.  Like it’s ginormous memory usage.   Another is how it renders the text of a tweet. I’ve seen a few tweets go by where the text had HTML escape sequences instead of the text. 

This tweet is an example: http://twitter.com/KristiGustafson/status/5231052312

It should display as:

City shelves happy hour: Many of you will be glad we aren’t this city — they’ve shelved their.. http://bit.ly/10QmUr

In Tweetdeck, it displays as:

City shelves happy hour: Many of you will be glad we aren’t this city — they’ve shelved their.. http://bit.ly/10QmUr

A screenshot of that tweet:

td

If you view the link on Twitter, you’ll see the text the “right” way.  I couldn’t figure out what was wrong.  I posted a message in the TweetDeck support site and they couldn’t replicate the problem.  Then I looked at bottom line of the tweet.  It has been posted to Twitter from TwitterFeed.  I had never heard of TwitterFeed, so I signed up for an account.

TwitterFeed is a free service that can scan your blog’s RSS feed and look for new blog postings.  It can then post the first 100 or so characters from the post to your Twitter and/or Facebook accounts.  And that’s where the problem occurs. 

HTML and XML use escape sequences to define special characters.  What you see rendered in the browser is not literally the same text in the source for that page.  Literal characters like “<” and “>” have special meaning in XML and HTML.  To display those characters on the page, they needed to be encoded as &lt; and &gt; respectively.  It’s all magic that goes on behind the scenes, you usually are never aware of it.

When TwitterFeed gets the latest post from your blog, it’s pulling it from the RSS feed and the text is encoded with the right escape sequences.  They then call the Twitter and/or Facebook API to post that text.  They are sending HTML/XML encoded text to functions that are expecting plain text.  When Twitter displays that new blog posting as a tweet, it’s including the encoded text.  Your browser sees that encoding and decodes it back again.  Facebook on the other hand displays the text encoded.

TweetDeck isn’t a browser, it’s a desktop or mobile application.  It renders the tweets as plain text and assumes that the API call that it suing to get tweets from Twitter is sending back plain text.  So the question is where is it broken.  I’ve only seen this problem with entrties posted by TwitterFeed, I would be the first place I would look.  I think they will need to a HTML decode on the text that they are scraping from the RSS feed and send it as plain text to Twitter and Facebook API’s.

btw:  Kristi Gustafson is worth following, even if her text is getting mangled by TwitterFeed.

3 comments:

  1. Hi there - Mario from twitterfeed here. Just a couple of clarifications:

    - the issue with TweetDeck and some other twitter clients is actually slightly more complex: they see a "#" character in the text, and try to automatically turn it into a link (to support automatic links for hashtags). This means they insert code like <a href=...> inside the HTML entity, which obviously breaks it. You can actually see this in your screenshot - the "8217" has been turned into a link because it appears after the has character. I'm in frequent contact with the TweetDeck guys and they are aware of the issue, and are working on a fix as far as I know.

    - re. facebook, we are still working on trying to find a way to post HTML entities and other non-English characters to their Stream.publish API method. Other developers seem to have the same problem, so we're still trying to figure out if this is a Facebook API bug, or if there is anything we can do to fix at our end.

    Hope this helps, cheers,
    Mario.

    ReplyDelete
  2. Thanks for that information Mario. That would make sense with TweetDeck. If you assume that everything that has a "#" character is a hashtag, then you are going to have problems.

    This should be easy to fix for this case though. They should only identify hashtags when "#" is at the beginning of the text or is preceded by a " ". I don't know if that fixes everything, but it's a move in the right direction.

    ReplyDelete
  3. I'm using the twitter api to post new blogpost to twitter. The posts are normaly in norwegian with the norwegian letters æ ø å. They are also encoded and with a # in the code. TweetDeck has still this issue.

    ReplyDelete

Note: Only a member of this blog may post a comment.