PostgreSQL数据库中的表和数据（Tables & Data）

发表时间：2014-02-25 16:38 | 分类：PostgreSQL | 浏览：5,114 次

照顾好你的数据,数据库也会照顾你。保持数据库的整洁，查询起来也会更快，应用也会少些错误。半夜被叫醒解决数据问题并不酷。接下来，就和章郎虫博主一起来了解postgresql的表和数据吧。

一、选择一个好的数据库对象名（Choosing good names for database objects）

让其他人可以快速了解数据库的最简单方法就是给数据库各对象取一个有意义的名字。具体注意事项可以参考《PostgreSQL-9-Admin-Cookbook》的96页。

在postgresql中，标准的索引表的格式是：{tablename}_{columnname(s)}_{suffix} ，即{表名}_{列名}_{后缀}。后缀有pkey、key、excl、idx和seq几种，分别对应主键约束、唯一约束、排他性约束、其它类型的索引和序列。

postgresql中的表可以同时包含多个触发器。触发器名中可以包含一些动作，比如update、delete等。触发器一个有用的命名规范格式为：{tablename}_{actionname}_{after|before}__trig 。

二、处理包含引用名的对象（Handling objects with quoted names）

博主第一次看到这个标题，实在不明白是什么意思（英语不好），不过看了以下例子大家应该就会明白。

首先创建包含引用的对象，CREATE TABLE "MyCust" AS SELECT * FROM cust;

然后用下面几个语句查询，可以发现都出现了相同的错误。

postgres=# SELECT count(*) FROM mycust;

ERROR: relation "mycust" does not exist

LINE 1: SELECT * FROM mycust;

postgres=# SELECT count(*) FROM MyCust;

ERROR: relation "mycust" does not exist

LINE 1: SELECT * FROM mycust;

而这个是对的。

postgres=# SELECT count(*) FROM "MyCust";

count

-------

(1 row)

从上面这个例子可以发现，如果创建对象的时候名字中用了引号，那么查询时也一定要包含引号。而且postgresql中对象名对大小写不敏感，也就是说“SELECT * FROM mycust;”、“SELECT * FROM MYCUST;”和“SELECT * FROM MyCust;”是一样的。

三、执行相同的名称,相同的列定义（Enforcing same name, same column definition）

两个比较复杂的sql。

Columns ：

We can identify columns that are defined in different ways in different tables using a query

against the catalog.

SELECT table_schema,table_name,column_name,data_type ||coalesce(' ' || text(character_maximum_length), '') ||coalesce(' ' || text(numeric_precision), '') ||coalesce(',' || text(numeric_scale), '') as data_type FROM information_schema.columns

WHERE column_name IN(

SELECT column_name FROM

(SELECT column_name,data_type,character_maximum_length,numeric_precision,numeric_scale FROM information_schema.columns

WHERE table_schema = 'public'

GROUP BY column_name,data_type,character_maximum_length,numeric_precision,numeric_scale

) derived

GROUP BY column_name

HAVING count(*) > 1

) AND table_schema NOT IN ('information_schema', 'pg_catalog')

ORDER BY column_name ;

Tables：

The following query looks for all tables of the same name (and

hence in different schemas) that have different definitions.

SELECT table_schema,table_name,column_name,data_type FROM information_schema.columns

WHERE table_name IN

(SELECT table_name FROM

(SELECT table_schema,table_name,string_agg(' '||column_name||' '||data_type) FROM information_schema.columns

GROUP BY table_schema,table_name

) def

GROUP BY table_name

HAVING count(*) > 1

) ORDER BY table_name,table_schema,column_name;

四、识别和去除重复定义（Identifying and removing duplicates）

关系型数据库中可以标识唯一的数据项，但是可能不知道什么原因，数据中会出现重复。

比如这个例子，在customerid就有重复的数据。

postgres=# SELECT * FROM cust;

customerid | firstname | lastname | age

------------+-----------+----------+-----

1 | Philip | Marlowe | 38

2 | Richard | Hannay | 42

3 | Holly | Martins | 25

4 | Harry | Palmer | 36

4 | Mark | Hall | 47

(5 rows)

这里可以用下面这个语句找出重复的数据。

SELECT * FROM cust WHERE customerid IN (SELECT customerid FROM cust GROUP BY customerid HAVING count(*) > 1);

找出重复数据后可以对这些数据进行更新或者删除。

五、防止出现重复行（Preventing duplicate rows）

从四中我们可以知道，数据库可能会出现重复的数据项。如果我们不想让某列出现重复，那么我们可以在定义数据库表的时候进行唯一限制。具体有下面几种方法。

1.创建主键

ALTER TABLE newcust ADD PRIMARY KEY(customerid);

运行后创建新索引newcust_pkey 。

2.创建唯一约束

ALTER TABLE newcust ADD UNIQUE(customerid);

运行后创建新索引newcust_customerid_key 。

3.创建唯一索引

CREATE UNIQUE INDEX ON newcust (customerid);

运行后创建新索引newcust_customerid_idx 。

六、在一组数据中找出唯一键（Finding a unique key for a set of data）

没有工具，我们找唯一键可能也会很快，比如看列的名字、外键就可以了。这里我们使用postgresql提供的optimizer statistics。

postgresql=# analyze article ;

ANALYZE

postgresql=# select attname,n_distinct from pg_stats where schemaname = 'public' AND tablename = 'article' ;

attname | n_distinct

--------------+------------

rply_cnt | 564

read_cnt | 930

url_hash | -1

hash_plain | -1

title_hash | -1

guid | -1

neg_pos | 1

match_code | -0.937369

tm_spider | -0.389967

aid | -1

style | 3

oaid | 1102

fid | 6

bid | 67

cid | 2

tid | 3

url | -1

tm_post | -0.915479

tm_last_rply | 0

author | 49

title | -0.95474

content | 0

ab_content | -0.924905

tm_update | -0.685363

stage | 1

rply_cut | 473

read_cut | 814

src | 1

rfid | 5

labels | 172

kwds | 0

like_cnt | 186

(32 rows)

如果n_distinct等于-1，那么说明在检查的这些数据中，这个列是唯一的。如果有多个-1，那么我们可能需要判断下。（We would then need to use our judgment to decide whether one or both of those columns are

unique by chance, or as part of the design of the database that created them.）

七、生成测试数据（Generating test data）

生成顺序数

zhangnq=# select * from generate_series(1,5) ;

generate_series

-----------------

(5 rows)

生成时间

zhangnq=# SELECT date(generate_series(now(), now() + '1 week', '1 day'));

date

------------

2014-02-25

2014-02-26

2014-02-27

2014-02-28

2014-03-01

2014-03-02

2014-03-03

2014-03-04

(8 rows)

随机整数

zhangnq=# select (random()*(2*10^9))::integer ;

int4

-----------

958536259

(1 row)

随机长整型数字

zhangnq=# select (random()*(9*10^18))::bigint ;

int8

---------------------

6527764440514147328

(1 row)

随机小数数字

zhangnq=# select (random()*100.)::numeric(4,2);

numeric

---------

39.97

(1 row)

随机重复的字符串，最长长度40 。

zhangnq=# select repeat('1',(random()*40)::integer) ;

repeat

-----------

111111111

(1 row)

随机长度字符串

zhangnq=# select substr('abcdefghijklmnopqrstuvwxyz',1, (random()*26)::integer) ;

substr

------------

abcdefghij

(1 row)

从一个字符串列表中随机取一个字符串

zhangnq=# select (ARRAY['one','two','three'])[1+random()*3] ;

array

-------

two

(1 row)

用随机数据生成表

zhangnq=# SELECT generate_series(1,10) as key ,(random()*100.)::numeric(4,2) ,repeat('1',(random()*25)::integer);

八、随机抽样数据（Randomly sampling data）

生成随机数据

pg_dump –-exclude-table=MyBigTable > db.dmp

pg_dump –-table=MyBigTable –schema-only > mybigtable.schema

psql -c '\copy (SELECT * FROM MyBigTable

WHERE random() < 0.01) to mybigtable.dat'

导入随机数据

psql -f db.dmp

psql -f mybigtable.schema

psql -c '\copy mybigtable from mybigtable.dat'

总的来说,如果可以的话,我的建议是避免取样，或者至少减少一些大表取样。

九、从表格中加载数据（Loading data from a spreadsheet）

现在大多数小数据都通过表格储存，所以从表格中加载数据是很多数据库管理员必须要面对的问题。

在导入到数据库之前，表格需要满足一下几个条件：

1.所有列都只有一列

2.所有行都只有一行

3.数据只在一个工作表

4.第一行的列是一些描述或者标题（可选条件）

当然现在很多表格中包含了公式、汇总、宏、图像等等，所以在操作之前，需要先把这些转化成符合上面条件的最简单表的格式。然后把表格另存为csv格式，上传到服务器。

开始导入数据

postgres=# \COPY sample FROM sample.csv CSV HEADER

postgres=# SELECT * FROM sample;

或者psql -c '\COPY sample FROM sample.csv CSV HEADER'

十、从平面文件加载数据（Loading data from flat files）

这个主要讲pgloader的用法，之后博主会再介绍。

PostgreSQL数据库中的表和数据（Tables & Data）

一、选择一个好的数据库对象名（Choosing good names for database objects）

二、处理包含引用名的对象（Handling objects with quoted names）

三、执行相同的名称,相同的列定义（Enforcing same name, same column definition）

四、识别和去除重复定义（Identifying and removing duplicates）

五、防止出现重复行（Preventing duplicate rows）

六、在一组数据中找出唯一键（Finding a unique key for a set of data）

七、生成测试数据（Generating test data）

八、随机抽样数据（Randomly sampling data）

九、从表格中加载数据（Loading data from a spreadsheet）

十、从平面文件加载数据（Loading data from flat files）

相关内容

日历

最新发表

2024年四月
一	二	三	四	五	六	日
« 十
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

PostgreSQL数据库中的表和数据（Tables & Data）

一、选择一个好的数据库对象名（Choosing good names for database objects）

二、处理包含引用名的对象（Handling objects with quoted names）

三、执行相同的名称,相同的列定义（Enforcing same name, same column definition）

四、识别和去除重复定义（Identifying and removing duplicates）

五、防止出现重复行（Preventing duplicate rows）

六、在一组数据中找出唯一键（Finding a unique key for a set of data）

七、生成测试数据（Generating test data）

八、随机抽样数据（Randomly sampling data）

九、从表格中加载数据（Loading data from a spreadsheet）

十、从平面文件加载数据（Loading data from flat files）

相关内容

日历

标签

最新发表