ClickHouse/contrib/FastMemcpy/README.md

Internal implementation of `memcpy` function.

It has the following advantages over `libc`-supplied implementation:
- it is linked statically, so the function is called directly, not through a `PLT` (procedure lookup table of shared library);
- it is linked statically, so the function can have position-dependent code;
- your binaries will not depend on `glibc`'s memcpy, that forces dependency on specific symbol version like `memcpy@@GLIBC_2.14` and consequently on specific version of `glibc` library;
- you can include `memcpy.h` directly and the function has the chance to be inlined, which is beneficial for small but unknown at compile time sizes of memory regions;
- this version of `memcpy` pretend to be faster (in our benchmarks, the difference is within few percents).

Currently it uses the implementation from **Linwei** (skywind3000@163.com).
Look at https://www.zhihu.com/question/35172305 for discussion.

Drawbacks:
- only use SSE 2, doesn't use wider (AVX, AVX 512) vector registers when available;
- no CPU dispatching; doesn't take into account actual cache size.

Also worth to look at:
- simple implementation from Facebook: https://github.com/facebook/folly/blob/master/folly/memcpy.S
- implementation from Agner Fog: http://www.agner.org/optimize/
- glibc source code.
Added README [#CLICKHOUSE-2]. 2017-09-15 05:07:42 +00:00			Internal implementation of `memcpy` function.

			It has the following advantages over `libc`-supplied implementation:
			- it is linked statically, so the function is called directly, not through a `PLT` (procedure lookup table of shared library);
			`- it is linked statically, so the function can have position-dependent code;`
			- your binaries will not depend on `glibc`'s memcpy, that forces dependency on specific symbol version like `memcpy@@GLIBC_2.14` and consequently on specific version of `glibc` library;
			- you can include `memcpy.h` directly and the function has the chance to be inlined, which is beneficial for small but unknown at compile time sizes of memory regions;
Fixed readme [#CLICKHOUSE-3275]. 2017-09-15 09:49:50 +00:00			- this version of `memcpy` pretend to be faster (in our benchmarks, the difference is within few percents).
Added README [#CLICKHOUSE-2]. 2017-09-15 05:07:42 +00:00
			`Currently it uses the implementation from Linwei (skywind3000@163.com).`
			`Look at https://www.zhihu.com/question/35172305 for discussion.`

			`Drawbacks:`
			`- only use SSE 2, doesn't use wider (AVX, AVX 512) vector registers when available;`
			`- no CPU dispatching; doesn't take into account actual cache size.`

			`Also worth to look at:`
			`- simple implementation from Facebook: https://github.com/facebook/folly/blob/master/folly/memcpy.S`
			`- implementation from Agner Fog: http://www.agner.org/optimize/`
			`- glibc source code.`