Counting Email Addresses By Domain in MySQL

Every so often I find some statistical need that although Perl program would be easy to write for it, its probably something the database should just handle. So I have a column in an email management table that has just the email addresses in the format user@domain.tld.

I want to know which domains make up the majority of the users. (The numbers have been changed to protect the innocent):

mysql> SELECT SUBSTRING_INDEX(email, '@', -1) as Domain, count(*) as Total
      FROM email_list
GROUP BY Domain
ORDER BY Total DESC
       LIMIT 15;
+----------------+---------+
| Domain         | Total   |
+----------------+---------+
| yahoo.com      | 1304000 |
| hotmail.com    |  908400 |
| aol.com        |  800000 |
| msn.com        |  168000 |
| gmail.com      |  161000 |
| comcast.net    |  143000 |
| sbcglobal.net  |  110000 |
| bellsouth.net  |   62000 |
| cox.net        |   58000 |
| verizon.net    |   56000 |
| earthlink.net  |   52000 |
| charter.net    |   46000 |
| juno.com       |   30000 |
| optonline.net  |   22000 |
| netzero.com    |   17000 |
+----------------+---------+
Posted in MySQL. Tags: , . 10 Comments »

Character Encoding

I recently ran into some character encoding issues and I wanted to share the fun. The default character encoding for MySQL on Gentoo is latin-1 or iso-8859-1. This wasn’t a problem until we recently started putting content straight from the DB through Java and onto the web. Java connects to the DB with a character encoding (typically UTF-8). Since UTF-8 is roughly a superset of iso-8859-1, it generally wasn’t a problem. Except when UTF-8 and UTF-16 characters were put into an iso-8859-1 database without translation.

What was essentially happening was that the data was being stored as iso-8859-1. The Java code was connecting to the DB in UTF-8 and pulling it into Java (which is usually UTF-16, but in this case was being handled as UTF-8). It was then being sent to the browser as URL encoded UTF-8 when in reality, it hadn’t even properly been put into UTF-8 character encoding. This then gave the web browser some funny yen symbols and question marks. This was not quite what we were aiming for.

The moral of this story is that it is necessary to realize the character encoding of the start point and end point of your data. It is crucial that the code points match up otherwise they could potentially make for an interesting screen given to the reader. All this could have been avoided with a simple: ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;.

MySQL Encoded URI Search and Replace

So I can definitely not take credit for this trick. The credit goes to Augusto Bott of the Pythian Group for this. I have a table in my DB that has columns of encoded URIs. The list of URI encoding character translations is available here.

To accomplish this, the MySQL REPLACE function was used.

mysql> SET @test:='%20%27%7C%26%5E%2B%2D%25';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT @test as original,
    ->        REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    ->        @test,'%20',' '),
    ->        '%27','\''),
    ->        '%7C','|'),  -- REPLACE() is case sensitive
    ->        '%7c','|'),  -- so we have
    ->        '%26','&'),
    ->        '%5E','^'),
    ->        '%5e','^'),  -- to repeat
    ->        '%2D','-'),
    ->        '%2d','-'),  -- ourselves
    ->        '%2B','+'),
    ->        '%2b','+'),  -- sometimes
    ->        '%25','%') as replaced;
+--------------------------+----------+
| original                 | replaced |
+--------------------------+----------+
| %20%27%7C%26%5E%2B%2D%25 |  '|&^+-% |
+--------------------------+----------+
1 row in set (0.01 sec)

To make our lives easier, let’s create a function from that.

mysql> CREATE FUNCTION url_replace
    -> (url VARCHAR(1024))
    -> RETURNS VARCHAR(1025)
    -> BEGIN
    -> RETURN REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    ->        @test,'%20',' '),
    ->        '%27','\''),
    ->        '%7C','|'),
    ->        '%7c','|'),
    ->        '%26','&'),
    ->        '%5E','^'),
    ->        '%5e','^'),
    ->        '%2D','-'),
    ->        '%2d','-'),
    ->        '%2B','+'),
    ->        '%2b','+'),
    ->        '%25','%');
    -> END$$
Query OK, 0 rows affected (0.02 sec)

And to wrap it up, let’s use it in an actual query. This query finds all instances of a % and replaces it with its unencoded character equivalent.

mysql> UPDATE website_page SET uri = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE( uri,'%20',' '), '%27','\''), '%7C','|'), '%7c','|'), '%26','&'), '%5E','^'), '%5e','^'), '%2D','-'), '%2d','-'), '%2B','+'), '%2b','+'), '%25','%'), '%2C',',') WHERE uri LIKE '%\%%';
Posted in MySQL. Tags: . No Comments »