-
Notifications
You must be signed in to change notification settings - Fork 15
/
CHANGES
1072 lines (912 loc) · 41.7 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Change logs have been moved to doc/notes*.rst
4.1 Unreleased
=================
Notable User Facing Changes
---------------------------
- Added support for Coarse-Grained buffer SVM on CUDA devices.
Notable Fixes
-------------
4.0 June 2023
=================
Notable User Facing Changes
---------------------------
- Support for Clang/LLVM 16.0 (except TCE drivers)
- Dropped support for Clang/LLVM versions 6.0 to 9.0 (inclusive)
- CMake version at least 3.12 is required
- Added support for program-scope variables in the CPU drivers
- Added support for generic address spaces in the CPU drivers
- Added basic support for cl_khr_subgroups for CPUs: A single
subgroup that always executes the whole X-dimension's WIs.
- Added initial (incomplete) support for
cl_intel_required_subgroup_size for CPUs
- AlmaIF's OpenASIP backend now supports a standalone mode.
It generates a standalone C program from a kernel launch, which
can then be compiled and executed with ttasim or RTL simulation.
- Added a user env POCL_BITCODE_FINALIZER that can be used to
call a custom script that manipulates the final bitcode before
passing it to the code generation.
- Pulled in alternative work-group function mode for non-SPMD from Open SYCL:
Continuation-based synchronization is slightly more general than POCL's
current kernel compiler, but allows for fewer hand-rolled optimizations.
CBS is expected to work for kernels that POCL's current kernel compiler
does not support. Currently, CBS can be manually enabled by setting
the environment variable `POCL_WORK_GROUP_METHOD=cbs`.
- New device driver, using the Level Zero API. It supports many features,
however there are still a few missing. For the full
documentation, see doc/sphinx/source/level0.rst.
- Support for cl_intel_unified_shared_memory (only with
the Level Zero driver & the CPU drivers).
- CPU devices now flush printf() output after each call instead of waiting
for the end of the kernel command. This makes it (again) useful for debugging
kernel segfaults.
- Added a new POCL_CPU_LOCAL_MEM_SIZE environment for overriding the
local memory size for the CPU devices.
- CUDA: Added Direct Peer-to-Peer buffer migrations for much better performance
scaling in multi-GPU scenarios
- CPU drivers gained partial support for cl_khr_fp16 when compiled with LLVM 16+.
The implementation relies on Clang, and may result in emulation if
the CPU does not support the required instruction set.
- Linux/x86-64 only: SIGFPE handler has been changed to skip instructions
causing division-by-zero, only if it occured in one of the CPU driver
threads; so division-by-zero errors are no longer hidden in user threads.
- POCL_CUDA_VERIFY_MODULE env variable has been replaced by POCL_LLVM_VERIFY
- CUDA: compilation now defaults to `-ffp-contract=fast`, previously it was `-ffp-contract=on`.
- `-cl-fast-relaxed-math` now defaults to `-ffp-contract=fast`, previously it was `-ffp-contract=on`.
- CPU drivers: renamed 'basic' to 'cpu-minimal' and 'pthread' driver to 'cpu',
to reflect the hardware they're driving instead of implementation details.
- POCL_MAX_PTHREAD_COUNT renamed to POCL_CPU_MAX_CU_COUNT; the old env. variable is deprecated
but still works
- it's now possible to disable libhwloc support even when it's present, using -DENABLE_HWLOC=0
CMake option
- Fixed issues with SYCL via DPC++ on PoCL on ARM.
Notable Fixes
-------------
- Fix handling of clBuildProgram() -I directories given with quotes or
with leading/trailing white space.
- Fix long kernel names trying to create too long dir/filenames. This
affected programs with long kernel names generated by template
specialization.
- CPU driver: Made kernel debug info more robust especially
when building the device program from SPIR-V.
Known issues
------------
- With LLVM 16, when running CTS with the offline compilation mode (= via SPIR-V), Clang + SPIR-V translator
produces invalid SPIR-V for several tests. PoCL bugreport:
https://github.com/pocl/pocl/issues/1232
Related Khronos issues:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/2008
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/2024
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/2025
3.1 December 2022
=================
Notable User Facing Changes
---------------------------
- Support for Clang/LLVM 15.0
- Work-in-progress Vulkan-driver:
* implement clEnqRead/Copy/WriteRect + clEnqFillBuffer
* implement global offsets
* add support for external memory
* implement WG sizes larger than vulkan limits with global offsets
* fix module constants
* and other small fixes and improvements
- SPIR-V support: much improved support for both CPU driver and CUDA
* vectorized versions of kernel library functions now supported
* much improved support on ARM64 & x86-64 with recent LLVMs
* fixed some atomic builtin functions
* SPIR-V new subtests from OpenCL-CTS now pass (with sufficiently recent
llvm-spirv translator)
* most of the CTS passes with offline-compile=spirv option
* specialization constants are supported via clSetProgramSpecializationConstant
- improved Valgrind/Helgrind support
- All device names used to control the platform setup via
POCL_DEVICES are now lower case (as documented in the user manual).
- Switch from __sync_*() to __atomic_*() (compiler builtins).
- Link with -latomic (compiler library) in case that is needed for using
64-bit atomic operations, e.g. on mipsel.
- Basic support for cl_khr_command_buffer. Command parameters are validated
at recording time and the command data is stored. When the command buffer
is enqueued to a device, the recorded commands are submitted to the device
as if their respective immediate clEnqueue* functions had been called just
then, but without the overhead from validating and computing the arguments.
- Command buffer support is only reported for CPU devices for now. All drivers
technically have the same capability but only CPU has been tested so far.
- Built-in kernel support for CPU, CUDA and Accel-devices
* The .cl files in the driver directories are compiled to built-in kernels.
* Experimental CUDNN support for Cuda-driver through built-in kernels.
- Major update to the custom device driver (Accel -> Now renamed to AlmaIF)
* Support more device types implementing AlmaIF
+ Memory-mapped accelerator (for SoC FPGAs)
+ Experimental XRT backend (for PCIe FPGAs)
+ Instruction-set simulator (TTASim)
+ Emulation device on host CPU
- TTASim-driver will be deprecated as soon as all of its functionality
can be achieved with the almaif-driver. Most of it has already been
ported to almaif/TTASimDevice.cc.
* This move allows controlling TTASIM simulator and FPGA device
(and in future, RTL simulator) with the same runtime driver,
which reduces code duplication.
Notable Bug Fixes
-----------------
- Fixed loss of precision with CUDA event timers due to limited resolution
of the epoch event timer
3.0 June 2022
=============
Notable User Facing Changes
---------------------------
- OpenCL 3.0 support:
* OpenCL 3.0 is enabled at platform level, all API calls
up to 3.0 are implemented
* The CPU device reports OpenCL 3.0 if LLVM is recent enough
(LLVM 14+), otherwise it reports OpenCL 1.2
* The Conformance Test Suite was updated to a version that supports
OpenCL 3.0; the CTS should also work with devices that report 1.2
* For the list of supported features & more details,
see doc/sphinx/source/conformance.rst
- Support for Clang/LLVM 14.0
- Text tracing improved - a few new command types traced,
event dependencies are now also recorded
- There are now two scripts, which let the user parse & convert the
outputs of LTTNG and Text tracer into a JSON format, suitable
for loading into Chromium's trace visualizer (chrome://tracing)
- Support for generating specialized work-group functions and include them
in the PoCL kernel program binaries.
- SPIR-V input: printf fixed
Notable Bug Fixes
-----------------
- Fixed poclbin for non-SPMD targets. The generic size WG function generation
was completely broken and the breakage was hidden by the WG specialization
recompilation, if online compiler was available.
- Fixed a race condidion in clBuildProgram
- Fixed uninitialized variables and other issues in tests
- Fixed the post-event-finish-cleanup process in the drivers,
this should improve reliability
- Few more fixes for Android
- Fixed loss of precision in CUDA event times due to type conversion
Other
-----
- lib/kernel/libclc has received a number of new math library function
implementations. These are useful for some drivers which can't
use SLEEF and want to avoid libm calls or simple expression
implementations.
- TCE driver significantly improved, it now uses argbuffer launcher,
handles multithreaded user programs properly, uses the new libclc
math library functions, and has fixed some filesystem race conditions.
Should be able to run the TCE testsuite in parallel without random errors.
- A new proxy driver that lets PoCL proxy to another OpenCL implementation.
if a user creates a cl_context with multiple devices including one via
the proxy driver, PoCL will automatically manage buffer memory within
the context.
- A new experimental Vulkan driver, using clspv and Vulkan API to run
OpenCL code on Vulkan-enabled devices. To learn about the limitations,
please read the documentation at doc/sphinx/source/vulkan.rst.
- Handling whitespace chars in build options (lib/CL/pocl_build.c)
- Migrated from the deprecated cl2.hpp to opencl.hpp
1.8 October 2021
================
Notable User Facing Changes
---------------------------
- support for LLVM 13
- CMake: Inter-Procedural Optimization is enabled on code of runtime library
(libpocl.so is compiled with -flto on systems that support it).
- LTTng tracing improved - more command types are traced, and also
some synchronous API calls (like clCreateBuffer) are traced.
- poclcc, tests and examples can be disabled with CMake options
- Valgrind support improved by making Valgrind aware of pocl's
reference counting of cl_* objects
- kernels which are called by kernels are now force-inlined
- Support for NetBSD.
- Support for Unix systems without libdl.
- PoCL can now (optionally) respond to SIGUSR2 by printing
some live debug information.
- improved SPIR support for CUDA devices
Notable Bug Fixes
-----------------
- Fixed a potential crash on Unix systems without sysfs mounted.
- Fixed compilation errors when building on macOS.
- Fixed POCL_FAST_INIT macro; POCL_INIT_LOCK must be invoked with only one argument.
- Fix bin/poclcc to not depend on OpenCL 2.0 symbols
- Fixed miscompilation in kernel loops with multiple conditionals with barriers in them.
Other
-----
- Add cmake options PARALLEL_COMPILE_JOBS, PARALLEL_LINK_JOBS to
use ninja's seperate compile and link job pools.
- Improve memory architecture, buffer migration and allocation.
Buffers are now allocated on a device when first used
(previously each buffer was allocated on every device in context).
- the single global LLVMContext was replaced with
multiple LLVMContexts, one per OpenCL cl_context.
OpenCL code can now be compiled in parallel
when using separate cl_contexts. This feature
is disabled by default since it significantly slowed
down PyOpenCL. This should be resolved by separating
LLVM compilation in their own threads in the future.
- a new OpenCL extension was added to PoCL: cl_pocl_content_size.
The extension allows the user to give optimization hint to PoCL,
which will be used internally by PoCL to optimize buffer transfers
between multiple devices.
1.7 May 2021
============
Notable User Facing Changes
---------------------------
- Support for LLVM 12.
- support for cross-compiling PoCL
- Added support for the cl_nv_device_attribute_query extension on CUDA devices.
- improved support for SPIR-V binaries when using CPU device:
- improved local variables support
- OpenCL 2.0 atomics are now supported
- work_group_barrier, to_local/to_global are implemented
- Implemented OpenCL 3.0 features
- clGetDeviceInfo queries
- CL_DEVICE_ATOMIC_MEMORY_CAPABILITIES (Minimal implementation)
- CL_DEVICE_ATOMIC_FENCE_CAPABILITIES (Minimal implementation)
Notable Bug Fixes
-----------------
- Added support for arm CUDA platforms without cuMemHostRegister
support.
1.6 December 2020
=================
Notable User Facing Changes
---------------------------
- Support for LLVM 11.
- CUDA kernels using constant __local blocks are now ABI incompatible with
previous release. Users need to delete their pocl cache.
- SINGLE_LLVM_LIB CMake option removed. Instead reintroduce STATIC_LLVM and
PoCL now relies on llvm-config to provide correct shared/static libraries
for linkage.
- improved debugging of OpenCL code with CPU driver.
See doc/sphinx/source/debug.rst
Optimizations
-------------
- Improved the PTX code generation for __local blocks. Previously constant
__local blocks and __local arguments were using one dynamic shared CUDA
memory block with offsets computed at runtime. Now if there is no __local
arguments, separate static shared CUDA memory is used. If there are
__local arguments, the constant __local blocks are indexed with compile
time constants. This improves the performance due to better SASS code
generation because it avoids what appears to be a pointer aliasing issue.
Running SHOC benchmark GEMM with size class 4 on a NVIDIA Titan X gives
the following performance improvements.
sgemm_n: 23.2%
sgemm_t: 18.5%
sgemm_n_pcie: 23.3%
sgemm_t_pcie: 19.5%
dgemm_n: 51.4%
dgemm_t: 2.8%
dgemm_n_pcie: 51.6%
dgemm_t_pcie: 6.9%
- Improved handling of command queue barriers: Previously an internal
event was added from all previous commands to it, even with in-order
queues, causing slowdown with applications that have a lot of commands.
Now additional events are omitted in in-order queues. Measured with the
PolyBench OpenCL Gramschmidt kernel, execution time went down from 44
seconds to around 0.5 seconds.
Notable Bug Fixes
-----------------
- Fix LLVM loop vectorizing remarks printing (POCL_VECTORIZER_REMARKS=1).
- Fix an issue in which the loop vectorizer produced code with invalid
memory reads (issue #757).
- Fix compilation error when CMake option SINGLE_LLVM_LIB is set to OFF.
- Fix wrongly output dlerror (Undefined symbol) after dlopen,
caused by a previous libdl call in an ICD loader (issue #877).
- [CPU] safety margin of pocl's CPU driver local memory allocation
has been reduced to a much more reasonable value
- [CPU] buffer size for OpenCL printf is now configurable with
PRINTF_BUFFER_SIZE CMake variable
- [CPU] local memory size reported is now the size of last level of non-shared
data cache (usually L1 or L2 depending on CPU), if hwloc can determine it.
Security
--------
- Added a build parameter HARDENING_ENABLE that applies hardening flags
present in some modern compilers to produce a more secure libpocl.so
with the trade-off in performance.
1.5 April 2020
==============
Notable User Facing Changes
---------------------------
- Support for LLVM 10.
- POCL_TRACE_EVENT, POCL_TRACE_EVENT_OPT and POCL_TRACE_EVENT_FILTER
environment variables were renamed to
POCL_TRACING, POCL_TRACING_OPT and POCL_TRACING_FILTER, respectively.
- Refactored the implementation of convert_T() OpenCL functions to better meet
autovectorization criteria under LLVM, thus utilizing device's SIMD ISA
capabilities where available; e.g. on an ARM64 Cortex-A72 convert_int8(short8)
is 5.5x faster now when measured in a tight loop.
- A lot of fixes.
Usability
---------
- A simple per-kernel execution time statistics atexit() for quick and easy
low-impact per-device profiling purposes (relies on event time stamps purely).
It can be enabled by setting POCL_TRACING env to 'cq'.
1.4 September 2019
==================
Highlights
----------
- pocl-accel: An example driver and support infrastructure for OpenCL 1.2
CL_DEVICE_TYPE_CUSTOM hardware accelerators which implement a memory
mapped control interface.
- Improved SPIR and SPIR-V support. clCreateProgramWithIL() implemented,
Kernel library (for CPU target) support for SPIR-mangling improved
Kernel Compiler
---------------
- Specialize work-group functions for global offset (0,0,0).
- A pocl installation with clang, hwloc statically linked in
is now relocatable.
- Clang/LLVM versions older than 6.0 are no longer supported.
- Create specialized work-group functions for small (defined by a device
driver specific limit) grid dimensions.
- Add Range Metadata to various ID queries etc. to improve vectorizing
index computation to smaller lane widths and other optimizations.
- Passes only the launched kernel to work-group generation and code gen, thus
speeding up the compilation process.
Misc.
-----
- hsa-native: Downgraded the advertised version to 1.2
which is closer to the truth (fixes OCLTest of Glow).
- hsa-native: Add support for byval (struct) argument passing.
- hsa-native: Allow offsets in block copy.
Notable Internal Changes
------------------------
- Allow devices to utilize the ROCm-Device-Libs ocml
builtins for their builtin libraries if seen fit.
https://github.com/RadeonOpenCompute/ROCm-Device-Libs/tree/master/ocml
was mirrored in lib/kernel and made it easy to cherry pick
implementations to targets' kernel libary.
- libltdl is replaced with libdl on UNIX platforms.
Notable Bug Fixes
-----------------
- Fix a race condition in device initialization, which
caused issues in applications that cause reinitialization
of pocl device drivers (appeared in Glow's OCLTest).
Device Driver Specific
----------------------
- hsa-native: Downgraded the advertised version to 1.2
which is closer to the truth (fixes OCLTest of Glow).
- hsa-native: Add support for byval (struct) argument passing.
- hsa-native: Allow offsets in block copy.
1.3 April 2019
==============
Highlights
----------
- Support for Clang/LLVM 8.0.
- Support ICD on OSX.
Misc.
-----
- Ability to have size_t (basically derived from the largest supported
object) smaller than CL_ADDRESS_BITS. This is an unofficial optional
extension as the OpenCL standard mandates it to be the same.
- POCL_EXTRA_BUILD_FLAGS can be used to force add extra build flags such as
'-g' to all clBuildProgram() calls.
- Allow building pocl without CPU backend drivers. When set to off,
CPU will not appear in the list of OpenCL devices reported by pocl.
Controllable via ENABLE_HOST_CPU_DEVICES=off cmake option.
- Build logs are now produced also for illegal options passed to the kernel
build e.g. via the options parameter of clBuildProgram().
- hsa-native: Device side printf-support and alternative < 1.2 non-standard
C99 printf exposing support.
- pocl's binary format has been slightly updated (changes are listed in
the top of pocl_binary.c file) to version 7, but pocl can still read
also the previous version 6 format.
- Allow local-size-specializing also SPMD-targeted kernels to enable compile
time optimization of code depending on the local dimensions.
- Support older GLIBC versions.
- HSA: Initial experimental support for native-ISA compilation on top of HSA
runtime. Tested and works currently only on phsa-runtime. Can be enabled
with ENABLE_HSAIL=off cmake option.
- Add option to disable installing of OpenCL headers.
Notable Bug Fixes
-----------------
- Fixed kernel debug symbol generation.
- HSA: fix kernel caching.
- Fix issue #661: clCreateImage doesn't fail with unsupported image type.
- Fix issue #668: handle non-kernel functions with barriers properly.
- Fix issue #671: Unable to build pocl with CUDA support with LLVM 7 and host GCC 8.2.
- Fix image format/size handling with multiple devices in context.
- Fix padding issue with context arrays that manifested as unaligned
access errors after autovectorization.
Notable Internal Changes
------------------------
- Add group ids as hidden kernel arguments instead of digging them up
from the context struct.
- Ability to generate the final binary via separate assembly text + assembler
call. Useful for supporting LLVM targets without direct binary emission
support.
- Use Clang's Driver API for launching the final linkage step. This way we
utilize the toolchain registry with correct linkage steps required for
the target at hand.
- Add 'device_aux_functions' to the driver layer attributes. This can be used
to retain device-specific functions required by the target across the
pruning of unused globals.
- The "default kernels" hack which was used to store kernel metadata, has
been removed. Kernel metadata are now stored only once, in cl_program
struct; every new cl_kernel structs holds only a pointer.
- Major 'pthread' CPU driver cleanup.
- Major Workgroup.cc cleanup.
1.2 September 2018
==================
- LLVM 7.0 is now supported.
- Version 2.0 of hwloc library is supported.
- device-side printf; more consistent printf output.
1.1 March 2018
==============
Highlights
----------
- LLVM 6.0 is now supported.
- Reintroduced experimental SPIR LLVM bitcode support to pocl.
Requires LLVM 5 or newer. New experimental feature: SPIR-V support;
requires a working llvm-spirv converter. Currently only loading
of SPIR-V binaries by pocl is supported, not output.
See docs/features.rst for more details.
- Refactored pocl cache now does away with LLVM file locks and relies
entirely on system calls for proper synchronization. Additionally,
cache file writes are now fdatasync()ed.
- Improved kernel compilation time (with cold cache). Improvement
depends on sources - it's bigger for large programs with many kernels.
Luxmark now compiles in seconds instead of dozens of seconds;
internal pocl tests run in 30-50% less time.
- LLVM Scalarizer pass is now only called for SPMD devices. Performance
change varies across tests, but positive seems to outweigh negative.
- Implemented uninitialization callback for device drivers. This is
triggered when the last cl_context is released. Currently only the
CPU driver implements the callback.
- Removed libpoclu from installed files; this library contains helpers
for pocl's internal tests, and from installed files was only used by
poclcc, which has been updated to not rely on it.
- POCL_MAX_WORK_GROUP_SIZE is now respected by all devices. This variable
limits the reported maximum WG sizes & dimensions; tuning max WG size
may improve performance due to cache locality improvement.
- CL_PLATFORM_VERSION now contains much more information about how
pocl was built.
- For users still building with Vecmathlib, performance should be back
to levels of pocl 0.14 (there was a huge drop caused by a change
in -O0 optimization level of LLVM 5.0).
- Improved support for ARM and ARM64 architectures. All internal tests
now pass (on Cortex-A53 and Cortex-A15), although it's still far
from full conformance.
1.0 December 2017
=================
Highlights
----------
- Improved automatic local work-group sizing on kernel enqueue, taking
into account standard constraints, SIMD width for vectorization as
well as the number of compute units available on the device.
- Support for NVIDIA GPUs via a new CUDA backend (currently experimental).
- Removed support for BBVectorizer.
- LLVM 5.0 is now supported.
- A few build options have been added for distribution builds,
see README.packaging.
- Somewhat improved scalability in the CPU driver. CPUs with many cores
and programs using a lot of WIs with small kernels can run somewhat faster.
- The OpenCL 1.2 conformance tests now pass with selected CPUs. There are some
caveats though - see the documentation.
- When conformance is enabled, some kernel library functions might be
slower than in previous releases.
- Pocl now reports OpenCL 1.2 instead of 2.0, except HSA enabled builds.
- Updated format of pocl binaries, which is NOT backwards compatible.
You'll need to clean any kernel caches.
- Fixed several memory leaks.
- Unresolved symbols (missing/misspelled functions etc) in a kernel will
result in error in clBuildProgram() instead of pocl silently ignoring
them and then aborting at dlopen().
- New env variable POCL_MEMORY_LIMIT=<num> limits the Global memory size
reported by pocl to <num> gigabytes.
- New env variable POCL_AFFINITY (defaults to 0): if enabled, sets
the affinity of each CPU driver pthread to a single core.
- Improved AVX512 support (with LLVM 5.0). Note that even with LLVM 5.0
there are still a few bugs (see pocl issue #555); AVX512 + LLVM 4.0 are
a lot more broken, and probably not worth trying.
- POCL_DEBUG env var has been revamped. You can now limit debuginfo to
these categories (or their combination): all,error,warning,general
memory,llvm,events,cache,locking,refcounts,timing,hsa,tce,cuda
The old setting POCL_DEBUG=1 now equals error+warning+general.
0.14 April 2017
===============
Highlights
----------
- Support for LLVM/Clang versions 3.9 and 4.0. Version 3.9 was the first
release to include all frontend features for OpenCL 2.0.
- Ability to build pocl in a mode where online compilation is not
supported to run in hosts without LLVM and binaries compiled offline
e.g. using poclcc.
- pocl's binary format now can contain all the necessary bits to
execute the programs on a host without online compiler support.
- Initial support for out-of-order execution execution of command queues.
- It's now possible to cross-compile pocl when building an offline
compiler build.
- New driver api extension to support out-of-order and asynchronous
devices/drivers.
- Pthread and HSA drivers are now fully asynchronous.
- CMake now the only supported build system, autotools removed.
- LTTng tracing support
OpenCL Runtime/Platform API support
-----------------------------------
- implemented clEnqueueBarrierWithWaitList
- implemented clEnqueueMigrateMemObjects
Other
-----
- Support for reqd_work_group_size attribute in the binary format and poclcc:
Generates a static sized work-group function to help optimizations
such as autovectorization.
- HSA: added support for phsa (https://github.com/HSAFoundation/phsa)
- A lot of bug and memory leak fixes. Some notable ones:
- Issue #1, passing aggregates as kernel value parameters, can be
now fixed with an LLVM patch.
- Now it's possible to build pocl without using the fake address
space ids, which were a source of many annoying issues.
0.13 April 2016
===============
Highlights
-----------
- Support for LLVM/Clang 3.8
- initial (partial) OpenCL 2.0 support
(only Shared Virtual Memory and Atomics are supported ATM)
- CMake build system almost on parity with autotools
(TCE, all external testsuites)
- CMake build is now able to build multiple kernel libraries
for different CPUs and let pocl select a suitable one at runtime
Bugfixes
---------
- clEnqueueCopyImage() now works properly
- improved file locking (much less disk access to kernel cache)
- Address spaces of structs are handled properly
Other
------
- removed custom buffer alloc from pthread device
- removed IBM Cell support
- removed support for older LLVM versions (before 3.7)
- significantly higher performance with a lot of small kernel enqueues
(due to improved file locking)
- vecmathlib now supports AVX2
- a few more HSA kernel library implementations: l/tgamma, erf(c), hypot
- implemented OpenCL 2.0 API calls: clEnqueueSVM*, clSVMalloc/free,
clEnqueueFillBuffer, clSetKernelExecInfo, clSetKernelArgSVMPointer,
clCreateCommandQueueWithProperties - no device side queues yet
- OpenCL 2.0 atomics (C11 atomics subset) for x86-64 and HSA
- new testsuites: AMD SDK 3.0, Intel SVM
- New CMake-only testsuites: ASL, clBLAS, clFFT, arrayfire
- more debugging info (timing, mem stats)
- ansi colors with POCL_DEBUG=1 if the output is a terminal
0.12 October 2015
===============
Highlights
----------
- Support for HSA-compliant devices (kernel agents). The GPU of AMD Kaveri
now works through pocl with a bunch of test cases in the AMD SDK 2.9 example
suite.
- New and improved kernel cache system that enables caching
kernels with #includes.
- Support for LLVM/Clang 3.7.
- Little endian MIPS32 now passes almost all pocl testsuite tests.
OpenCL Runtime/Platform API support
-----------------------------------
- Transferred buffer read/write/copy offset calculation to device driver side.
- these driver api functions have changed; got offset as a new argument.
- Maximum allocation is not limited to 1/4th of total memory size.
- Maximum image dimensions grow to fit maximum allocation.
- clGetDeviceInfo() reports better information about CPU vendor and cache.
- experimental clCreateSubDevices() for pthread CPU device.
OpenCL C Builtin Function Implementations
-----------------------------------------
- Implemented get_image_dim().
Bugfixes
--------
- Avoid infinite loops when users recycle an event waiting list.
- Correctly report the base address alignment.
- Lots of others.
Misc
----
- Tests now using new cl2.hpp, removing dependency on OpenGL headers
0.11 March 2015
===============
Highlights
----------
- Support for LLVM/Clang 3.6
- Kernel compiler cache.
- Android support.
Kernel compiler
---------------
- Do not add implicit barriers to kernels without WG barriers
to avoid WI context data overheads.
- Setting the POCL_VECTORIZER_REMARKS env to 1 prints out LLVM vectorizer
remarks during kernel compilation.
- Implicit work-group vectorizer improvements.
- POCL_VECTORIZER_REMARKS: When set to 1, prints out remarks produced by
the loop vectorizer of LLVM during kernel compilation.
OpenCL Runtime/Platform API support
-----------------------------------
- Minimal initial implementation for clCreateSubDevices()
Bugfixes
--------
- Fix falsely detecting operations with side-effects (especially atomic
operations) as uniform. This caused deadlock/race situations due to
illegal implicit barrier injection.
- Fix several reference counting issues.
- Memory leak fixes.
- ARM/openSUSE build fixes.
- Plenty of CMake fixes.
New test/example cases
----------------------
- Several Halide examples using its OpenCL backend added.
- CloverLeaf
Misc.
-----
- The old BBVectorizer forked WIVectorizer removed due to bit rot and
the general hackiness of it.
- Experimental Windows/Visual Studio support (in progress).
- Initial support for MIPS architecture (with known issues).
- Runtime debug printouts that can be enabled via POCL_DEBUG=1.
- Streamlined the buffer allocation and fixed several issues with it.
0.10 September 2014
===================
This lists only the most interesting changes. Please
refer to the version control log for a full listing.
Highlights
----------
- Support for LLVM/Clang 3.5
- Support for building using CMake (experimental with known issues).
Bugfixes
--------
- TCE: kernel building was broken when running pocl
from install location
- thread-safety (as required since OpenCL 1.1) improved
Kernel compiler
---------------
- Final code generation now done via LLVM API calls instead of
calling the llc binary.
- Sensible linking of functions from the monolithic kernel built-in
library. Major compilation speedup for smaller kernels.
OpenCL C Builtin Function Implementations
-----------------------------------------
- Improved support for halfN functions.
- ilogb and ldexp available with vecmathlib
OpenCL Runtime/Platform API support
-----------------------------------
- Implement clCreateKernelsInProgram()
- OpenCL-C shuffle() and shuffle2() implementation added
- Device probing modified to allow for device driver to detect device during
runtime. POCL_DEVICES still supported.
- Checks in clSetKernelArgs() for argument validity
- Checks in clEnqueueNDRange() for arguments to be all set
- Implement clGetKernelArgInfo()
- clEnqueueCopyImage()
Misc
----
- ViennaCL testsuite updated to 1.5.1
0.9 January 2014
================
This lists only the most interesting changes. Please
refer to the version control log for a full listing.
Highlights
----------
- Major improvements to the kernel compiler's vectorization
performance. Twofold speedups in some benchmarks
- Support for most of the piglit CL tests
OpenCL Runtime/Platform API support
-----------------------------------
- clCreateImage2D() and clCreateImage3D() implementation moved to
clCreateImage()
- Image creation now uses clCreateBuffer()
- clBuildProgram: Propagate the supported -cl* compiler options to Clang's
OpenCL frontend.
- clFinish: works with commands with event wait lists.
- Preliminary support for OpenCL 2.0 blocks
- Added support for clEnqueueNativeKernel()
Builtin Function Implementations (OpenCL 1.2 Section 6.12)
----------------------------------------------------------
- Refactored read/write_image()-functions to support refactored device image
object. (Only functions used by SimpleImage test)
- Introduced new macro based implementation for read/write_image()-functions
- Added sampler implementation for CLK_ADDRESS_CLAMP and
CLK_ADDRESS_CLAMP_TO_EDGE (Only integer coords supported)
- Most of the printf() format strings now works. Missing features:
- long on 32-bit architectures
Performance Improvements
------------------------
- Kernel compiler now tries to avoid replicating uniform variables,
this leads to less context data to be saved per work-item and cleaner
kernel bitcode for later optimizations
- Use a precompiled header for OpenCL C builtin declarations to speed up
the kernel compilation
- Kernel compiler vectorization optimizations:
- Inject implicit barriers both to loop starts and ends to
horizontally vectorize the inner loop.
- Reduce "peeling" by minimizing the conditional barrier region
by injecting implicit barrier close to the branch points for
conditional barrier cases.
- Breaking of vector datatypes for more efficient loop
vectorization.
- Support LLVM 3.4 parallel loop metadata.
Misc
----
- Explicitly specify the target architecture/CPU for the
kernel complier.
- Kernel compiler frontend defaults to implementation using LLVM API
directly instead of the scripts.
- __OPENCL_VERSION__ defined to 120
- poclu: helpers for converting between the C float and OpenCL cl_half
types
- clEnqueueNativeKernel implemented
- Static and cmake-builds of LLVM can now be used.
Bugfixes
--------
- Correct isequal, isnan, and similar routines
0.8 August 2013
================
This lists only the most interesting changes. Please
refer to the version control log for a full listing.
Overall
-------
- Added support for LLVM/Clang 3.3.
- Dropped support for LLVM/Clang v3.1.
- Removed the depedency on llvm-ld (which was copied to
pocl-llvm-ld to pocl tree). Now uses llvm-link instead.
- Project renamed to Portable Computing Language (pocl).
- Luxmark v2.0 now works.
- x86_64 can now use efficient math built-in function
implementations from the vecmathlib project to avoid libm
calls and to exploit the SIMD instructions more efficiently
in case of vector datatypes in the kernel.
- Parallelize kernel inner loops "horizontally", if possible.
This converts possibly sequential inner kernel loops to parallel
loops by effectively performing "loop interchange" of the
work-item loop and the kernel's inner loop.
- Added VexCL tests to the test suite. All but one of them
work with pocl.
Major bugfixes
--------------
- Fixed passing NULL as a buffer argument to clSetKernelArg
(this time with a regression test added).
- Constant BitCast expressions broken to variables to avoid
crashing when copying a kernel with casts on automatic
local pointers.
- Fixes for i386/i686. Tested on Pentium4/Ubuntu 10.04 LTS.
- Lots of API error checking added (found by the Piglit testing suite).
- Fixed bug in select producing incorrect results when the third
conditional argument is an unsigned scalar or vector.
- Replaced deprecated SSE 4.1 assembly mneunomics in x86-64 min/max
kernel functions that have since been removed in more recent
versions of gas and llvm-as.
- SPIR/LLVM IR 'byval' attributes are now handled correctly on
kernel function arguments, allowing for structs and oversized
vectors to be passed in with value semantics.
- Fixed to work with the latest Khronos OpenCL headers for 1.2.
Some issues fixed with the new cl.hpp.
- The ICD dispatch table was too small which might have caused
"interesting" behavior when calling the later functions in
the table and not using ocl-icd as the dispatcher.
- Several kernel compiler bugs fixed.
- A multithreaded host application could free the same object
multiple times due to a race issue.
Platform Layer implementations (OpenCL 1.2 Chapter 4)
-----------------------------------------------------
- Return correctly formatted CL_DEVICE_VERSION and
CL_DEVICE_OPENCL_C_VERSION.
- clGetDeviceInfo: Use the 'cpufreq' sys interface of Linux for
querying the CPU clock frequency, if available.
The OpenCL Runtime (OpenCL 1.2 Chapter 5)
-----------------------------------------
- clGetEventInfo: Querying the command type, command queue,
and the reference count of the event.
Builtin Function Implementations (OpenCL 1.2 Section 6.12)
----------------------------------------------------------
- convert_type* builtins now generated with a Python script by
Victor Oliveira.
- length() fingerprint was assuming two arguments instead of one.
- The kernel bitcode library is now optimized when built in pocl. Speeds
up kernel optimization for cases which use the kernel functions
a lot.
- Fix mul_hi() implementation
ICD
---
- Fixed pocl tests to work when executed through the Khronos
supplied icd loader (needs a patch applied to the loader be able to
override the .icd search path).
Misc.
-----
- Fix to the helper script search logic:
Search from the BUILDDIR only if env POCL_BUILDING is defined.
Otherwise search from PKGDATADIR first, then from the PATH.
- Fixed memory leaks in clCreateContext* and clCreateKernel
- Ensured that stored arguments are adequately aligned in
clSetKernelArg and clEnqueueNDRangeKernel.
0.7 January 2013
=================
This lists only the most interesting changes. Please
refer to the version control log for a full listing.
Overall
-------
- Support for LLVM 3.2.
- Multi-WI work group functions can be now generated
using loops which are only partially unrolled. Reduces
code size explosion with large WGs in comparison to
the full replication method.
- PowerPC 64 support (tested on Cell/Debian Sid/PS3).
- PowerPC 32 support (tested on Cell/Debian Sid/PS3).
- ARM v7 support (on Linux)
- Beginning of Cell SPU support (very experimental!).
- Most of the AMD APP SDK OpenCL examples now work and have been
added to the pocl test suite.
- Most of the Parboil benchmark cases added to the test
suite.
Kernel Compiler Passes
----------------------
- Several miscompilations and compiler crashes fixed.
- Multiple bugs fixed from the work group vectorizer.
- Updated metadata format pocl uses to pass information
to vectorization and TCE backend to simplify debuging.
- Kernel pointer arguments are not always marked 'noalias' (restricted).
Doing this previously was a specs misunderstanding.
- ConstantGEPs to static variables generated from automated
locals caused problems. Now converting them to normal GEPs
using a pass from the SAFECode project.
OpenCL Platform Layer implementations (OpenCL 1.2 Chapter 4)
-------------------------------------------------------
- clGetDeviceInfo now uses the hwloc lib for device property
queries. Many new queries implemented.
- clGetKernelInfo (initial implementation)
- clGetMemObjectInfo (initial implementation)
- clGetCommandQueueInfo (initial implementation)
- clReleaseDevice
- clRetainDevice
- Proper freeing of devices in clReleaseContext
The OpenCL Runtime Implementations (OpenCL 1.2 Chapter 5)
---------------------------------------------------------
- clBuildProgram: support for passing options to the compiler.
- clEnqueueMarker
OpenCL C Builtin Function Implementations (OpenCL 1.2 Section 6.12)
-------------------------------------------------------------------
- Atomic Functions (6.12.11)