Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU and Memory Optimizations #56

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

yuzawa-san
Copy link

@yuzawa-san yuzawa-san commented Jul 23, 2024

I found the CPU and Memory in the decode hot path is very high, so I did some light refactoring to alleviate this.

  • Added a BitString and BitStringBuilder for efficient bitstring operations
  • Make substring operation zero-copy (BitString slices the underlying data)
  • BitString also removes the need to validate 0/1 using Pattern.
  • Optimize base64 decoding by assembling reverse dictionary lookup table from char to BitString (previously it was a HashMap and from Character (boxed) to Integer, which needed to be converted to a bit string)
  • Reduce number of substring operations. e.g. FixedIntegerEncoder.decode(bitString, fromIndex, length)
  • Made FixedBitfieldEncoder return BitString directly which does fulfill List<Boolean>. This is a lot smaller that the ArrayList<Boolean>
  • Use more StringBuilder
  • Presize things that we know the size or approximate sizes of
  • Use more constants. NOTE: I used Strings constants in the + usages since that is optimized (in JDK8) to a StringBuilder. if it used the char constant it appears to have to convert those chars to Strings each time. however in later JDK's this is not needed.
  • NOTE: the encode flow could technically use the BitString too, but i held off on that for now.
  • added an IntegerCache which is largest enough to contain all of the vendor ID's in the global vendor list. This cut down on a lot allocations.
  • Added JMH microbenchmarking
  • made constants final
  • do more presizing in segment initializeData
  • make GppModel more DRY and use switch statements instead of if/elses
  • contains containsKey + get calls to just get call with nullcheck. this avoids double Map reads.
  • make initializeSegments more memory efficient with Arrays.asList or Collections.singletonList
  • use CharSequence for zero copy string splits

microbenchmark results:

before
Benchmark                                  Mode  Cnt        Score      Error   Units
MyBenchmark.decodeGpp                     thrpt   25     3425.619 ±  103.495   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25     6099.516 ±  186.780  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  1867059.215 ± 4110.174    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25     2632.000             counts
MyBenchmark.decodeGpp:gc.time             thrpt   25     2700.000                 ms

after
Benchmark                                  Mode  Cnt      Score     Error   Units
MyBenchmark.decodeGpp                     thrpt   25  19205.372 ± 485.226   ops/s
MyBenchmark.decodeGpp:gc.alloc.rate       thrpt   25   1037.076 ±  26.204  MB/sec
MyBenchmark.decodeGpp:gc.alloc.rate.norm  thrpt   25  56624.003 ±   0.001    B/op
MyBenchmark.decodeGpp:gc.count            thrpt   25    866.000            counts
MyBenchmark.decodeGpp:gc.time             thrpt   25    842.000                ms

seems to be around 6x faster than the last released version and uses only about 97% less memory.

ad-hoc benchmark code against https://github.com/InteractiveAdvertisingBureau/iabtcf-java (partially ported into JMH)

public class TcfBench {
	
	private static final String in = "CQCDewAQCDewAPoABABGA9EMAP-AAB4AAIAAKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-ClbVfxv_215b_l-9-n7ZHmNX_f4e-7EMQYXybPuBcy71vycF99hMzRN-ramCpkSALt2wSEDbRyY1E1QoWqIFa8w7GpNnKE7SifgZIhzEXtnWAheb5sLY_kCmeb6__d3edkf7f3a_t3c8s-VZ792vf_m9ViYnSufrR_7_20Tm_vyPvf-fv-L_Pzf6xNv3k9bf7Xr7e9_fvLb__f___f___-______9__gAAAAA.QKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-.IKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-";


	public static void main(String[] args) {
		while(1==1) {
		 TCString old = TCString.decode(in);
		old.getPubPurposesConsent();
		old.getPurposesConsent();
		old.getVendorConsent();
		old.getPurposesLITransparency();
		old.getVendorLegitimateInterest();
		old.getSpecialFeatureOptIns();
		old.getCmpId();
		old.getPublisherRestrictions();
		TcfEuV2 nu = new TcfEuV2(in);
		nu.getPublisherConsents();
		nu.getPurposeConsents();
		nu.getVendorConsents();
		nu.getPurposeLegitimateInterests();
		nu.getVendorLegitimateInterests();
		nu.getSpecialFeatureOptins();
		nu.getCmpId();
		nu.getPublisherRestrictions();
		nu.setFieldValue(TcfEuV2Field.CMP_ID, 14);
		nu.encode();
		
		}
	}

}

benchmarked using async-profiler asprof -d 20 -e cpu,alloc -f ~/Desktop/dump16.jfr TcfBench

Flame Charts:

memory before:
image

memory after (note how TCString parse was small teal sliver in the before graph, but the iab-gpp portion has shrank so much that the TCString parse is now a larger percentage of the icicle chart):
image

cpu before:
image

cpu after:
image

Future Ideas (not in PR). I'll likely open issues to discuss:

  • List<Integer> is still a little bulky in the charts above. I was thinking of maybe making a specialty class backed by int[] like https://github.com/InteractiveAdvertisingBureau/iabtcf-java/blob/master/iabtcf-decoder/src/main/java/com/iabtcf/utils/IntIterable.java but that would likely break API stability
  • Should the fields map have keys be enum's instead of strings? The EnumMap would be a lot more lightweight.
  • is there too much stuff which is public which should be not public?
  • should List be replaced with Set since I feel set contains is something worth optimizing (i.e. is this vendor(s) present in the set)
  • are the defensive copies in some getValue implementation necessary, could we achieve the same from returning read-only versions (Collections.unmodifiableList).

fixes #25

supersedes #45

@ChristopherWirt
Copy link

This looks great 👍

@yuzawa-san yuzawa-san force-pushed the cpu-memory-optimizations branch from 5e4b971 to 02f53c4 Compare July 25, 2024 16:34
@yuzawa-san
Copy link
Author

@iabmayank @chuff can you please take a look at this?

This was referenced Oct 15, 2024
@yuzawa-san yuzawa-san force-pushed the cpu-memory-optimizations branch from 8bac901 to c30373e Compare November 6, 2024 00:27
@yuzawa-san
Copy link
Author

@iabmayank @chuff i have rebased off of the most recent master

@lamrowena
Copy link
Collaborator

@yuzawa-san thank you for submitting this, we will be reviewing these in the working group

@chuff
Copy link
Contributor

chuff commented Dec 9, 2024

@yuzawa-san
This does look pretty great. Could you remove the benchmark code from the PR?

@yuzawa-san
Copy link
Author

@chuff removed

i would like to try adding it again in the future in a separate pr since i feel it is quite important that contributors have the ability to easily generate benchmarks. just curious, why did you favor removing it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High CPU Consumption
5 participants