Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Flink] After K8s Session runs the job, the console is always in the state of executing the job. #3372

Closed
2 of 3 tasks
13535048320 opened this issue Apr 10, 2024 · 24 comments · Fixed by #3410
Closed
2 of 3 tasks
Assignees
Labels
Bug Something isn't working FAQ Frequently Asked Questions
Milestone

Comments

@13535048320
Copy link

13535048320 commented Apr 10, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

image
K8s Session 运行作业后控制台一直处于执行作业的状态,任务提交成功的,Flink UI 里能看到任务正常运行,运维中心看不到对应的实例,或者处于 Unknown 状态,特别是在启动的作业多了之后,会频繁出现这种情况,没有报错日志
image
image

版本:
Dinky 1.0.1
Mariadb 10.6.11
Paimon 0.8
Flink 1.17.2
Kafka 3.4

FLINK SQL:

set 'execution.checkpointing.interval' = '30s';
set 'execution.checkpointing.max-concurrent-checkpoints' = '1';
set 'execution.checkpointing.tolerable-failed-checkpoints' = '1';
set 'execution.checkpointing.min-pause' = '30s';
set 'table.exec.state.ttl' = '45min';
set 'execution.checkpointing.timeout' = '1h';
set 'table.exec.sink.upsert-materialize' = 'NONE';

CREATE CATALOG paimon WITH (
 'type' = 'paimon',
 'warehouse' = 's3://uat-warehouse/paimon',
 's3.endpoint' = 'http://minio:9000',
 's3.access-key' = '',
 's3.secret-key' = '',
 'fs.s3a.connection.maximum'='2000',
 'fs.s3a.threads.max'='4000',
 'fs.s3a.buffer.dir'='/tmp/'
);

use catalog default_catalog;
create database if not exists source_kafka ;

set 'parallelism.default' = '12' ;
create table default_catalog.source_kafka.source_s4h_kafka_qm_zqminspectdefect
(
 `CHARG` STRING,`POSNR` STRING,`FEGRP` STRING,`FECOD` STRING,`ANZFEHLER` DOUBLE,`FEHLBEWC` STRING,`PRUEFLINR` STRING,`LASTRECTXNTYPE` STRING,`RECCREDATE` STRING,`LASTRECTXNDATE` STRING,`LASTRECTXNTIME` STRING,`LASTRECTXNUSERID` STRING
) with (
'connector' = 'kafka'
, 'topic' = 'ZQMINSPECTDEFECT.A20.SAP'
, 'properties.group.id' = 'source_s4h_qm_zqminspectdefect_sink'
, 'scan.startup.mode' = 'group-offsets'
, 'properties.bootstrap.servers' = 'kafka.confluent:9071'
, 'properties.auto.offset.reset' = 'earliest'
, 'properties.security.protocol' = 'SASL_SSL'
, 'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username="" password="";'
, 'properties.sasl.mechanism' = 'PLAIN'
, 'format' = 'json'
, 'json.fail-on-missing-field' = 'false'
, 'json.ignore-parse-errors' = 'true'
);

insert into paimon.ods.ods_s4h_qm_zqminspectdefect/*+ OPTIONS('sink.use-managed-memory-allocator'='true', 'sink.managed.writer-buffer-memory'='256M') */
select 
 `CHARG` ,`POSNR` ,`FEGRP` ,`FECOD` ,`ANZFEHLER` ,`FEHLBEWC` ,`PRUEFLINR` ,`LASTRECTXNTYPE` ,`RECCREDATE` ,`LASTRECTXNDATE` ,`LASTRECTXNTIME` ,`LASTRECTXNUSERID` ,
 cast( FROM_UNIXTIME(UNIX_TIMESTAMP()) as TIMESTAMP(3) ) as data_update_time 
from default_catalog.source_kafka.source_s4h_kafka_qm_zqminspectdefect ;

What you expected to happen

作业正常启动,运维中心状态正常

How to reproduce

  1. 新建 K8s Session 集群配置,jobmanager.flink:8081
  2. 新建flink sql任务,选择 K8s Session 模式
  3. 运行flink sql多个项目

Anything else

No response

Version

1.0.0

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@13535048320 13535048320 added Bug Something isn't working Waiting for reply Waiting for reply labels Apr 10, 2024
Copy link

Hello @13535048320, this issue is about K8S, so I assign it to @gaoyan1998 and @zackyoungh. If you have any questions, you can comment and reply.

你好 @13535048320, 这个 issue 是关于 K8S 的,所以我把它分配给了 @gaoyan1998@zackyoungh。如有任何问题,可以评论回复。

@github-actions github-actions bot changed the title [Bug] [Flink] K8s Session 运行作业后控制台一直处于执行作业的状态 [Bug] [Flink] After K8s Session runs the job, the console is always in the state of executing the job. Apr 10, 2024
@gaoyan1998
Copy link
Contributor

可以发一下日志吗

@Zzm0809 Zzm0809 added More Information Required More information required and removed Waiting for reply Waiting for reply labels Apr 10, 2024
@13535048320
Copy link
Author

@gaoyan1998
Copy link
Contributor

重启一下dinky,就可以了,这个会在1.0.2修复

@Zzm0809 Zzm0809 added FAQ Frequently Asked Questions Bug Something isn't working and removed Bug Something isn't working More Information Required More information required labels Apr 11, 2024
@Zzm0809 Zzm0809 added this to the 1.0.2 milestone Apr 11, 2024
@Zzm0809 Zzm0809 moved this to ToDo in Dinky Roadmap Apr 11, 2024
@13535048320
Copy link
Author

@gaoyan1998 请问 1.0.2 预计什么时候发布,谢谢

@gaoyan1998
Copy link
Contributor

@gaoyan1998 请问 1.0.2 预计什么时候发布,谢谢

重启之后问题依然存在吗

@13535048320
Copy link
Author

是的,重启后问题还是会重现

@gaoyan1998
Copy link
Contributor

是的,重启后问题还是会重现

你用的1.0.0吧,我在1.0.1测试没有这问题

@13535048320
Copy link
Author

是 1.0.1 的,会不会是流作业太多导致的
image

@gaoyan1998
Copy link
Contributor

是 1.0.1 的,会不会是流作业太多导致的 image

是这几个任务有问题还是全部任务都有这问题,我之前遇到过,临时解决方案是新建一个任务,不用这个了

@13535048320
Copy link
Author

不固定哪些任务有这个问题,作业少的时候,偶尔会出现几个;现在启动了100多个流作业之后,就反过来了,大部分作业启动都会遇到这个问题,只有偶尔几个,一分钟就启动完成的

@13535048320
Copy link
Author

13535048320 commented Apr 11, 2024

我用接口来运行的话,也一直没有拿到响应结果
https://dinky.demo.com/api/task/submitTask?id=xxx

@gaoyan1998
Copy link
Contributor

我用接口来运行的话,也一直没有拿到响应结果 https://dinky.demo.com/api/task/submitTask?id=xxx

重启之后第一次提交是成功的吗

@13535048320
Copy link
Author

重启过后,第一次提交是成功的

@gaoyan1998
Copy link
Contributor

gaoyan1998 commented Apr 11, 2024

我看每次提交失败的报错前面都有下面这个错误,有检查过登录状态吗,或者浏览器长时间没有关闭刷新一下页面,dinky会有登录已过期,但是不刷新不会跳转login页面情况

[dinky] 2024-04-10 14:13:26 HKT ERROR org.dinky.aop.LogAspect 139 handleCommonLogic - pre doAfterThrowing Exception: cn.dev33.satoken.exception.NotLoginException: 未能读取到有效 token
	at cn.dev33.satoken.exception.NotLoginException.newInstance(NotLoginException.java:134) ~[sa-token-core-1.37.0.jar:?]
	at cn.dev33.satoken.stp.StpLogic.getLoginId(StpLogic.java:941) ~[sa-token-core-1.37.0.jar:?]
	at cn.dev33.satoken.stp.StpLogic.getLoginIdAsInt(StpLogic.java:1052) ~[sa-token-core-1.37.0.jar:?]
	at cn.dev33.satoken.stp.StpUtil.getLoginIdAsInt(StpUtil.java:378) ~[sa-token-core-1.37.0.jar:?]
	at org.dinky.aop.LogAspect.handleCommonLogic(LogAspect.java:99) ~[dinky-admin-1.0.1.jar:?]
	at org.dinky.aop.LogAspect.doAfterThrowing(LogAspect.java:87) ~[dinky-admin-1.0.1.jar:?]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_342]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_342]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_342]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_342]
	at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:634) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:617) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.aspectj.AspectJAfterThrowingAdvice.invoke(AspectJAfterThrowingAdvice.java:68) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.adapter.AfterReturningAdviceInterceptor.invoke(AfterReturningAdviceInterceptor.java:57) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) ~[spring-aop-5.3.27.jar:5.3.27]
	at org.dinky.controller.TaskController$$EnhancerBySpringCGLIB$$89ffc7b5.submitTask(<generated>) ~[dinky-admin-1.0.1.jar:?]

@13535048320
Copy link
Author

这个报错应该是因为我用接口启动 https://dinky.demo.com/api/task/submitTask?id=xxx,没带上 token 导致的,不过作业还是启动了,这个应该算是另一个的问题了。应该不是这个问题导致的,因为我在页面上点运行也会出现。

@gaoyan1998
Copy link
Contributor

gaoyan1998 commented Apr 11, 2024

第一次执行成功的控制台日志发一下,注意是web页面上的那个控制台,不是dinky后台日志

@13535048320
Copy link
Author

Start Process:FlinkSubmit/99
Start Process Step:SUBMIT_PRECHECK
2024-04-11 14:02:21.668 INFO org.dinky.service.impl.TaskServiceImpl(177): Start check and config task, task:s4h-ods-pp-zpppigment-realtime
Process Step SUBMIT_PRECHECK exit with status:FINISHED
Start Process Step:SUBMIT_EXECUTE
Start Process Step:SUBMIT_BUILD_CONFIG
2024-04-11 14:02:21.672 INFO org.dinky.service.impl.TaskServiceImpl(286): Start initialize FlinkSQLEnv:
2024-04-11 14:02:21.677 INFO org.dinky.service.impl.TaskServiceImpl(306): Initializing data permissions...
2024-04-11 14:02:21.678 INFO org.dinky.service.impl.TaskServiceImpl(308): Finish initialize FlinkSQLEnv.
2024-04-11 14:02:21.682 INFO org.dinky.service.impl.TaskServiceImpl(236): Init remote cluster
Process Step SUBMIT_BUILD_CONFIG exit with status:FINISHED
2024-04-11 14:02:21.710 INFO org.dinky.service.task.FlinkSqlTask(67): Initializing Flink job config...
2024-04-11 14:02:21.765 INFO org.dinky.job.builder.JobUDFBuilder(115): A total of 0 UDF have been Init.
2024-04-11 14:02:21.766 INFO org.dinky.job.builder.JobUDFBuilder(116): Initializing Flink UDF...Finish
2024-04-11 14:02:21.766 INFO org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 14:02:21.786 INFO org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 14:02:21.787 INFO org.apache.flink.table.catalog.CatalogManager(281): Set the current default catalog as [my_catalog] and the current default database as [default_database].
2024-04-11 14:02:21.787 INFO org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 14:02:21.792 WARN org.apache.paimon.utils.HadoopUtils(125): Could not find Hadoop configuration via any of the supported methods
2024-04-11 14:02:21.855 WARN org.apache.paimon.utils.HadoopUtils(125): Could not find Hadoop configuration via any of the supported methods
2024-04-11 14:02:22.387 INFO org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 14:02:22.388 INFO org.apache.flink.table.catalog.CatalogManager(281): Set the current default catalog as [default_catalog] and the current default database as [default_database].
2024-04-11 14:02:22.389 INFO org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 14:02:22.390 INFO org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 14:02:22.693 WARN org.apache.flink.connector.kafka.source.KafkaSourceBuilder(481): Property auto.offset.reset is provided but will be overridden from earliest to earliest
2024-04-11 14:02:22.737 WARN org.apache.flink.connector.kafka.source.KafkaSourceBuilder(481): Property auto.offset.reset is provided but will be overridden from earliest to earliest
2024-04-11 14:02:22.770 INFO org.apache.flink.api.java.typeutils.TypeExtractor(2033): class org.apache.paimon.flink.FlinkRowWrapper does not contain a getter for field row
2024-04-11 14:02:22.770 INFO org.apache.flink.api.java.typeutils.TypeExtractor(2036): class org.apache.paimon.flink.FlinkRowWrapper does not contain a setter for field row
2024-04-11 14:02:22.770 INFO org.apache.flink.api.java.typeutils.TypeExtractor(2079): Class class org.apache.paimon.flink.FlinkRowWrapper cannot be used as a POJO type because not all fields are valid POJO fields, and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance and schema evolution.
2024-04-11 14:02:22.946 INFO org.apache.flink.client.program.rest.RestClusterClient(405): Submitting job 's4h-ods-pp-zpppigment-realtime' (a8fe31b2366fcdf75b80b669657a2340).
2024-04-11 14:02:23.186 INFO org.apache.flink.client.program.rest.RestClusterClient(424): Successfully submitted job 's4h-ods-pp-zpppigment-realtime' (a8fe31b2366fcdf75b80b669657a2340) to 'http://ods-jobmanager.flink:8081'.
2024-04-11 14:02:29.744 INFO org.dinky.service.impl.TaskServiceImpl(192): execute job finished,status is SUCCESS
Process Step SUBMIT_EXECUTE exit with status:FINISHED
2024-04-11 14:02:29.744 INFO org.dinky.service.impl.TaskServiceImpl(323): Job Submit success

@gaoyan1998
Copy link
Contributor

gaoyan1998 commented Apr 11, 2024

这是我的日志 1.0.1版本1的

Start Process:FlinkSubmit/1
Start Process Step:SUBMIT_PRECHECK
2024-04-11 16:43:02.897 INFO  org.dinky.service.impl.TaskServiceImpl(177): Start check and config task, task:k8s-demo
Process Step SUBMIT_PRECHECK exit with status:FINISHED
Start Process Step:SUBMIT_EXECUTE
Start Process Step:SUBMIT_BUILD_CONFIG
2024-04-11 16:43:02.901 INFO  org.dinky.service.impl.TaskServiceImpl(286): Start initialize FlinkSQLEnv:
2024-04-11 16:43:02.960 INFO  org.dinky.service.impl.TaskServiceImpl(306): Initializing data permissions...
2024-04-11 16:43:03.012 INFO  org.dinky.service.impl.TaskServiceImpl(308): Finish initialize FlinkSQLEnv.
2024-04-11 16:43:03.013 INFO  org.dinky.service.impl.TaskServiceImpl(236): Init remote cluster
Process Step SUBMIT_BUILD_CONFIG exit with status:FINISHED
2024-04-11 16:43:03.049 INFO  org.dinky.service.task.FlinkSqlTask(67): Initializing Flink job config...
2024-04-11 16:43:03.103 INFO  org.dinky.job.builder.JobUDFBuilder(115): A total of 0 UDF have been Init.
2024-04-11 16:43:03.103 INFO  org.dinky.job.builder.JobUDFBuilder(116): Initializing Flink UDF...Finish
2024-04-11 16:43:03.103 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.105 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.107 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.109 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.111 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.125 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.127 INFO  org.dinky.utils.KerberosUtil(58): Simple authentication mode
2024-04-11 16:43:03.150 WARN  org.apache.flink.configuration.Configuration(860): Config uses deprecated configuration key 'pipeline.operator-chaining' instead of proper key 'pipeline.operator-chaining.enabled'
2024-04-11 16:43:03.204 WARN  org.apache.flink.configuration.Configuration(860): Config uses deprecated configuration key 'pipeline.operator-chaining' instead of proper key 'pipeline.operator-chaining.enabled'
2024-04-11 16:43:03.223 WARN  org.apache.flink.configuration.Configuration(860): Config uses deprecated configuration key 'pipeline.operator-chaining' instead of proper key 'pipeline.operator-chaining.enabled'
2024-04-11 16:43:03.278 INFO  org.apache.flink.client.program.rest.RestClusterClient(410): Submitting job 'k8s-demo' (668979ffd5839d11eaf3516368f9ccac).
2024-04-11 16:43:03.317 INFO  org.apache.flink.client.program.rest.RestClusterClient(429): Successfully submitted job 'k8s-demo' (668979ffd5839d11eaf3516368f9ccac) to 'http://172.19.99.140:32679'.
2024-04-11 16:43:05.315 INFO  org.dinky.service.impl.TaskServiceImpl(192): execute job finished,status is SUCCESS
Process Step SUBMIT_EXECUTE exit with status:FINISHED
2024-04-11 16:43:05.315 INFO  org.dinky.service.impl.TaskServiceImpl(323): Job Submit success
Process FlinkSubmit/1 exit with status:FINISHED

可以确定是最后缺少这个日志导致的,提交进程没有正常结束,你还有其他的操作吗,例如部署环境,nginx之类的东西

Process Step SUBMIT_EXECUTE exit with status:FINISHED
2024-04-11 16:43:05.315 INFO  org.dinky.service.impl.TaskServiceImpl(323): Job Submit success
Process FlinkSubmit/1 exit with status:FINISHED

@13535048320
Copy link
Author

Dinky 部署环境也是 k8s,镜像是用 github 上面下载 dinky-release-1.17-1.0.1.tar.gz 做成的,Dockerfile 也是用的 github 上面的 docker/Dockerfile,有经过 k8s nginx ingress,nginx 超时 proxy_read_timeout 设置为 300 秒

@13535048320
Copy link
Author

Reference in n

出现问题的确实少了这一部分日志

@gaoyan1998
Copy link
Contributor

要不试试重装一下dinky呢,我这试了好久没复现

@gaoyan1998
Copy link
Contributor

删除dinky目录下的 /tmp文件夹

@gaoyan1998
Copy link
Contributor

link #3410

Zzm0809 pushed a commit that referenced this issue Apr 18, 2024
@gaoyan1998 gaoyan1998 moved this from ToDo to Done in Dinky Roadmap May 6, 2024
Zzm0809 pushed a commit to Zzm0809/dinky that referenced this issue May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working FAQ Frequently Asked Questions
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants