研究 Oracle 到 PostgreSQL 的数据迁移 – 以 pgloader 为例

一点实习成果，过程中学了很多东西，就放出来吧，欢迎交流指正。

pgloader 介绍

一个开源的工具，用来把数据从其他地方导入到 PostgreSQL
写入 PostgreSQL 的时候用的 COPY 命令，效率高
v1 是 Tcl 写的，v2 是 Python 写的，v3 是 Common Lisp 写的，以下测试用的是 v3
导入来源目前支持 CSV 文件、STDIN、SQLite、MySQL、MSSQL
网址：https://github.com/dimitri/pgloader、http://pgloader.io/howto/pgloader.1.html
有问题可以直接在 issues 里搜索关键词或者提问，有很多有用的资源： https://github.com/dimitri/pgloader/issues
Common Lisp 介绍：Lisp 的一种分支，一种语言规范，偏函数式编程，使用S-表达式表示代码和数据结构，函数和宏调用以列表的形式写出

与原生 copy 进行性能对比测试

每次均为表清空之后再导入数据
本地只有 PostgreSQL 环境，且 pgloader 暂时不支持从 Oracle 直读数据（但支持 MSSQL、MySQL 直读），所以采用 CSV 文件作为数据源，相当于不读、只写（458368 条记录，31MB），记录内容形如："2016-1-1 0:00:00","1234567890","10.0.0.1","index.aspx","OK"
Python 测试和 pgloader 测试均在 Ubuntu 虚机上进行，连接到虚机宿主机上的 PG 数据库（包含网络开销）

分别进行三次测试，耗时结果：

Python: 3678ms, 4253ms, 6418ms
pgloader: 9690ms, 6740ms, 10793ms

平均速度：

Python: 9.6万条/s, 6.5MB/s
Pgloader: 5.0万条/s, 3.4MB/s

官方对 pgloader 的性能测试结果（来自 http://pgloader.io/index.html 首页的视频前几分钟位置，具体用例未说明，应该是有读有写的）：v3 版本折算记录写入速度为 5.5万/s、4.4万/s、5.1万/s、4.9万/s。

Python 程序如下：

import psycopg2, logging, sys
conn = psycopg2.connect(host="192.168.56.1", dbname="postgres", user="postgres“)
cur = conn.cursor()
fin = open('2016-01-01-log.csv', 'r') 
cur.copy_expert('COPY "public"."logs" FROM STDIN WITH CSV encoding \'UTF8\'', fin)
conn.commit() # Should do this
fin.close()
conn.close()

pgloader 程序配置如下：

LOAD CSV
    FROM ‘2016-01-01-log.csv’ WITH ENCODING UTF-8
 INTO postgresql://postgres@192.168.56.1/postgres?public.logs
 WITH fields terminated by ‘,’;

关于速度

影响测试结果的因素：

网络栈（测试是跨 IP 进行的，有考虑网络，但是几乎无时延）
PG 数据库表的情况（写入空表速度比写入有内容的表速度慢）
服务器速度（测试是 PC、虚拟机进行的，可能会比较慢）

Python 侧提高速度可能的方法：

尽量避免数据转换过程？对于不同的数据样式，psycopg2 有 cursor.copy_expert() 函数来实现自定义 COPY 的 SQL 语句
送给 copy 函数的文件流，实测非 binary 模式速度比 binary 快
测试时试过 Python 内带引号和不带引号 CSV，分别通过 copy_from/copy_expert 触发，速度相差无几

可以参考 https://github.com/dimitri/pgloader/issues/251，非常详尽的讨论，官方在开头就表示 "I'm interested into making pgloader as fast as possible of course"。

如何开始使用 pgloader

最好在 Linux 上运行，Windows 虽然官方说能运行但是还挺难弄的。

https://github.com/dimitri/pgloader/releases 找最新版（pgloader-bundle-3.3.2.tgz）
- 不要偷懒用 apt-get install pgloader，是 v2 版，如果运行时莫名其妙提示 Error: Configuration file pgloader.conf does not exists 就是 v2 版
系统上要装好 lisp 编译器进行编译，官方默认用的是 SBCL
- https://github.com/dimitri/pgloader#build-from-sources 这里的说明是直接用 apt-get install sbcl unzip libsqlite3-dev make curl gawk freetds-dev libzip-dev 安装需要的依赖，但是一般 apt 里的 sbcl 版本比较老，可能编译会出问题（You need ASDF >= 3.1 to load this system correctly），所以需要参考以下方法，从官方渠道安装，会比较靠谱
- http://sbcl.org/platform-table.html 我下的是 Linux AMD64
- 然后按照 http://sbcl.org/getting.html 安装，最后确认一下 sbcl 命令全局可用
然后编译：make pgloader，程序就会出现在 bin/pgloader 位置，然后写 load 配置文件，就可以直接运行它：bin/pgloader test.load

官方其实有提到另一种编译器 CCL：

Also if you're building yourself and having large dataset and some other issues with SBCL, consider building against Clozure CL (or CCL); it provides a much better Garbage Collector and thus the user experience with pgloader usually is much better, sometimes even in terms of raw performances. https://github.com/dimitri/pgloader/issues/497#issuecomment-269474917

但是我用 pgloader-bundle-3.3.2.tgz 的 makefile 几乎没法编译，要改 makefile 里很多参数，官方 issues 中确认说暂时要用 master 版本的来编译，日后的 release 版本会修复。我也就没继续尝试了。

另外 CCL 安装比较麻烦，默认不是全局的：

In the scripts/ directory of the ccl directory, there are two files named ccl and ccl64. Copy these files into /usr/local/bin or some other directory that is on your path, and then edit them so that the value of CCL_DEFAULT_DIRECTORY is your ccl directory. You can then start up the lisp with "ccl" or "ccl64".

关于 Oracle 数据迁移

写了这么多，看似跟 Oracle 一点关系都没有。以上的测试过程，的确只包含写的过程，不包含从数据库抓数据的过程。

pgloader 目前不支持从 Oracle 读数据，但是已支持 MSSQL、MySQL 等数据来源。具体在 GitHub 上有个 issue，很多信息（Oracle support），我顺便总结了一下：

pgloader 官方表示：这个功能在 TODO list 上，但有人赞助他才愿意写；
有人提出的现阶段能用的方案：use ora2pg to dump out the oracle databases per table, then using a shell script to convert the .sql files with inline COPY data into .csv files that pgloader can load into postgres. Some of our tables are 7+ billion rows and would not load as a regular .sql file。虽然比较曲线不过依然简化了工作；
如果自己在 pgloader 上集成的话：需要集成一个 oracle driver（比如https://github.com/archimag/cl-oracle），然后要写请求表格结构、索引和外键等的 SQL，然后写 OR 到 PG 的数据格式映射（比如 number -> numeric）；
实际上用支持 MSSQL/MySQL 的代码去改就行，主要问题是 Common Lisp 编程语言的开发学习门槛比较高，跟编译出来用是两回事。

总结

本次测试中写的 Python 的实现，相当于是最“原生”的实现，因为过程中没有任何数据转换处理，因此速度非常快。

而 pgloader 因为功能比较多，包括其会处理不同数据类型的转换，速度会有一定影响；虽然没有具体对比到，但我认为这一类的数据转换过程，Common Lisp 会比 Python 的速度快，具体还需深入对比。

应此尽量利用 COPY 的 SQL 语句进行原生迁移是最好的；当不得不进行数据类型转换时，效率非常重要。

研究 Oracle 到 PostgreSQL 的数据迁移 - 以 pgloader 为例

pgloader 介绍

与原生 copy 进行性能对比测试

关于速度

如何开始使用 pgloader

关于 Oracle 数据迁移

总结

发表回复取消回复

pgloader 介绍

与原生 copy 进行性能对比测试

关于速度

如何开始使用 pgloader

关于 Oracle 数据迁移

总结

发表回复 取消回复

发表回复取消回复