presto语法

👤拔丝英语网 🕔2024/01/09 10:00 📁国外生活

01

—

hive,spark,presto区别

1、hive：

百度百科的定义如下：基于Hadoop的一个数据仓库工具，用来进行数据提取、转化、加载，这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表，并提供SQL查询功能，能将SQL语句转变成MapReduce任务来执行。

那么Hive 一般可以认为是一个数据库，支持存储和计算，自己本身有hive 计算引擎，所以也有hivesql的功能。

2、Spark：

一种计算引擎,但是现在已经发展成一个生态系统，类似hadoop生态系统，实际上spark 是对hadoop的补充，除了能像mapreduce一样分布式执行之外，spark 是将结果保存在内存中，不再需要读写HDFS，因此能够降低延迟，实现快速查询。

那么spark 可以理解为一种计算引擎。

3、Presto:

简单来说也是一种查询引擎，本身不能进行存储，但是可以接入多个数据源，如hive、mysql 、tidb 等，能够支持跨数据源查询。

4、hive sql 、spark sql、presto sql

Presto是一个低延迟高并发的内存计算引擎，接受任务立即执行，不需要经过磁盘，相比Hive，执行效率要高很多。

由于presto 是放在内存中计算，当遇到大表查询时，常常溢出内存，这个时候就可以用spark sql 来替代，spark 计算时也是将任务放在内存中计算，替代了mapreduce的计算方式，因此即能解决MapReduce造成的低效，又能解决大表查询时的内存溢出问题。

基于以上的优缺点对比：以Hive 作为数据源，结合presto、spark 查询引擎，是很多公司选择的方式。

02

—

presto函数大全

详见链接：

https://blog.csdn.net/sinat_17697111/article/details/89101124

03

—

hive与presto一些重要函数对比

3.1 处理Json数据

hive处理如下：select  get_json_object(xx['custom'],'$.position') from table presto处理如下： select  json_extract_scalar(xx['custom'],'$.position') from table
注意这里Presto中json_extract_scalar返回值是一个string类型,其还有一个函数json_extract是直接返回一个json串，所以使用的时候你得自己知道取的到底是一个什么类型的值

3.2 行转列

1、Hive：collect_set转为数组并去重，concat_ws将数组用逗号间隔连接成字符串select user_id    , concat_ws(',', collect_set(order_id)) as order_idsfrom tmp.tmp_row_to_colwhere 1 = 1group by user_id ;
2、Presto：array_agg转为数组，array_distinct去重，array_join将数组用逗号间隔连接成字符串select user_id , array_join(array_distinct(array_agg(order_id)), ',') as order_idsfrom tmp.tmp_row_to_colwhere 1 = 1group by user_id ;

3.3 列转行

即将上面的两张图顺序调换一下

1、Hive：split将order_ids拆分成数组，lateral view explode将数组炸裂开select a.user_id    , b.order_idfrom tmp.tmp_col_to_row alateral view explode(split(order_ids, ',')) b as order_id ;select student, score from tests lateral view explode(split(scores, ',')) t as score;
2、Presto：split将order_ids拆分成数组，cross join unnest将数组炸裂开，要注意一下两种语法的表名缩写位置select a.user_id    , b.order_idfrom tmp.tmp_col_to_row across join unnest(split(order_ids, ',')) as b(order_id) ;select student, score from tests cross json unnest(split(scores, ',') as t (score);

3.4 日期函数对比

问题1：时间格式转换

例子: 当前时间20200110 转化为2020-01-10

--hiveselect to_date(from_unixtime(UNIX_TIMESTAMP('20200110','yyyyMMdd')));结果：2020-01-10--prestoselect (format_datetime(date_parse('20200110','%Y%m%d'),'yyyy-MM-dd')) ;结果：2020-01-10

问题2：时间的偏移

例子: 原时间为20200110 需先转化为标准日期形式再加减

--hive select date_add('2020-01-12',10);  ---后移10天结果：2020-01-22select date_add(to_date(from_unixtime(UNIX_TIMESTAMP('20200110','yyyyMMdd'))),10);结果：2020-01-20
--prestoselect date_add('day',10,cast('2020-01-12' as date)); --第三个参数不转换为date格式, 会报错 第三个参数必须为date格式结果：2020-01-22select date_add('day', 10, cast(format_datetime(date_parse('20200110','%Y%m%d'),'yyyy-MM-dd') as date))结果：2020-01-20

问题3：时间戳转日期

--hiveselect from_unixtime(1578585600);   结果：2020-01-10 00:00:00--加格式select from_unixtime(1578585600,'yyyyMMdd');结果：20200110
--prestoselect from_unixtime(1578585600);结果：2020-01-10 00:00:00--加格式select format_datetime(from_unixtime(1578585600),'yyyy-MM-dd');
10位Unix时间戳数据：1595487673Hive：select from_unixtime(1595487673,'yyyy-MM-dd HH:mm:ss')结果：2020-07-23 15:01:13Presto：select format_datetime(from_unixtime(1595487673),'yyyy-MM-dd HH:mm:ss')
13位Unix时间戳（如果不要毫秒就把concat和ss后面的.去掉）数据：1595487673343Hive：select from_unixtime(1595487673343)结果：52528-12-28 20:22:23select from_unixtime(cast(1595487673343/1000 as int))结果：2020-07-23 15:01:13select concat(from_unixtime(cast(1595487673343/1000 as int),'yyyy-MM-dd HH:mm:ss.'), cast(1595487673343%1000 as string))结果：2020-07-23 15:01:13.343Presto：select concat(format_datetime(from_unixtime(1595487673343/1000),'yyyy-MM-dd HH:mm:ss.'), cast(1595487673343%1000 as varchar))

问题4：日期转时间戳

例子: 当前时间20200110 转化为2020-01-10
--hiveselect to_date(from_unixtime(UNIX_TIMESTAMP('20200110','yyyyMMdd')));结果：2020-01-10--prestoselect (format_datetime(date_parse('20200110','%Y%m%d'),'yyyy-MM-dd')) ;结果：2020-01-10
转10位Unix时间戳--hive select unix_timestamp('20200110' ,'yyyyMMdd'); --10位时间戳结果：1578585600select unix_timestamp(cast('2020-07-23 15:01:13' as timestamp))结果：15954876-- presto select to_unixtime(cast('2020-01-10' as date));select to_unixtime(cast(format_datetime(date_parse('20200110','%Y%m%d'),'yyyy-MM-dd') as date))select to_unixtime(cast('2020-07-23 15:01:13' as timestamp))
转13位Unix时间戳数据：2020-07-23 15:01:13.343Hive：select unix_timestamp(cast(substr('2020-07-23 15:01:13.343', 1, 19) as timestamp)) * 1000 + cast(substr('2020-07-23 15:01:13.343', 21) as bigint)结果：1595487673343Presto：select to_unixtime(cast('2020-07-23 15:01:13.343' as timestamp))*1000





问题5： 计算两个日期之间的diff

数据：2017-09-15 - 2017-09-01
--hiveselect datediff('2017-09-15','2017-09-01')  --前面日期减去后面日期结果：14
--presto select date_diff('day',cast('2017-09-01' as date),cast('2017-09-15' as date));   ---后面日期减去前面日期结果：14-- 1)需要提供参数'day'，表示要查询的是天数间隔；要查询小时，则提供参数'hour'-- 2)并且后面传参限制为date类型；-- 3)最后要注意是后面减去前面 --与hive不同


数据：2020-07-24 11:42:58 - 2020-07-23 15:01:13











Hive：select datediff('2020-07-24 11:42:58','2020-07-23 15:01:13');结果：1Presto：select date_diff('day', cast('2020-07-23 15:01:13' as timestamp), cast('2020-07-24 11:42:58' as timestamp))结果：0这个数据，因为相差的时间小于24小时，Presto输出的是0，而Hive是1，这个坑要注意一下.




问题6： 当前时间





















--hiveselect current_date;select unix_timestamp(); --获取当前时间戳select from_unixtime(unix_timestamp());
-- prestoselect now();  --精确到今天的时分秒select current_date; --精确到今天的年月日select current_date - interval '1' day; 精确到昨天的年月日
--hive其它日期查询select current_timestamp; --查询当前系统时间(包括毫秒数);  select dayofmonth(current_date); -- 查询当月第几天select last_day(current_date); --月末select date_sub(current_date,dayofmonth(current_date)-1); --当月第1天: select add_months(date_sub(current_date,dayofmonth(current_date)-1),1); --下个月第1天

presto语法

01

—

hive,spark,presto区别

02

—

presto函数大全

03

—

hive与presto一些重要函数对比

3.3 列转行

3.4 日期函数对比

问题1：时间格式转换

问题2：时间的偏移

问题4：日期转时间戳

问题5：计算两个日期之间的diff

发表评论

发表回复

01 — hive,spark,presto区别

02 — presto函数大全

03 — hive与presto一些重要函数对比

3.3 列转行

3.4 日期函数对比

问题1：时间格式转换

问题2： 时间的偏移

问题4： 日期转时间戳

问题5： 计算两个日期之间的diff

相关文章

发表评论

发表回复

01

—

hive,spark,presto区别

02

—

presto函数大全

03

—

hive与presto一些重要函数对比

问题2：时间的偏移

问题4：日期转时间戳

问题5：计算两个日期之间的diff