There really isn’t a solid Python module for multiprocessing and MySQL. Now this may be because MySQL on a single server is disk bound and therefore limited in speed or just because no one has written it. So here is a quick and dirty example using the Pool module in multiprocessing in Python 2.6 and MySQLdb.
I also tried using PySQLPool. This was designed for threading and not forking as I am doing with Pool method. Although I am sure it is possible to use PySQLPool with forking by passing the connection (pool) object down to the child process or possibly doing something with IPC, I decided to keep it simple (although slightly more expensive) and instantiate MySQLdb connections upon fork.
import multiprocessing
def main():
tablename = 'address_demographics'
inserts = []
id = 0
max_id = 1000000
while id < max_id:
sql = """
INSERT IGNORE INTO `%s` (person_id)
SELECT person_id
FROM `people`
WHERE id BETWEEN %d AND %d
""" % (tablename,id,(id + 100000))
inserts.append(sql)
id += 100001
pool = multiprocessing.Pool(multiprocessing.cpu_count())
pool.map(sqlWorkerInsert, inserts, 1)
pool.close()
pool.join()
def sqlWorkerInsert(sql):
conn = MySQLdb.connect(host = "127.0.0.1",
user = "USER",
passwd = "PASS",
db = "DB")
cursor = conn.cursor()
# Need to do this since AUTOCOMMIT = 0 by default (wtf?)
cursor.execute("SET AUTOCOMMIT=1")
cursor.execute(sql)
cursor.close()
conn.close()
It may not be the cleanest method to open a new connection to MySQL for every worker subprocess. Since MySQL opens a new thread for every connection, I don’t think this is that big of a deal. However I am always open to new ways of doing things.
Related posts:
