Yuriy Nazarov
Love machine learning
When I searched how to estimate GPU performance I found this answer on stackoverflow, which contains the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
import os import sys import tensorflow as tf import time n = 8192 dtype = tf.float32 with tf.device("/gpu"): matrix1 = tf.Variable(tf.ones((n, n), dtype=dtype)) matrix2 = tf.Variable(tf.ones((n, n), dtype=dtype)) product = tf.matmul(matrix1, matrix2) # avoid optimizing away redundant nodes config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0))) sess = tf.Session(config=config) sess.run(tf.global_variables_initializer()) iters = 10 # pre-warming sess.run(product.op) start = time.time() for i in range(iters): sess.run(product.op) end = time.time() ops = n**3 + (n-1)*n**2 # n^2*(n-1) additions, n^3 multiplications elapsed = (end - start) rate = iters*ops/elapsed/10**9 print('\n %d x %d matmul took: %.2f sec, %.2f G ops/sec' % (n, n, elapsed/iters, rate,)) |
After Nvidia released a bunch of new generation GPUs I wanted to compare their performance.
To measure fp16 performance dtype was changed to tf.float16.
To benchmark matrix multiplication in tensorflow 2 compatibility mode was used. It can be enabled by replacing
1 |
import tensorflow as tf |
with
1 2 |
import tensorflow.compat.v1 as tf tf.disable_v2_behavior() |
So final results for tensorflow 2.4.0 are in table:
GPU | fp32 performance | fp16 performance |
RTX 2080 | 10877.23 G ops/sec | 42471.64 G ops/sec |
V100 | 14743.50 G ops/sec | 89348.57 G ops/sec |
RTX 3090 | 35958.73 G ops/sec | 69669.73 G ops/sec |
A100 | 79158.13 G ops/sec | 232681.81 G ops/sec |
RTX 4090 | 80802.89 G ops/sec | 162852.21 G ops/sec |
I actively using CUDA capable cards from NVIDIA for machine learning applications.
When GPU is under heavy load by ml tasks it’s almost impossible to do anything in graphical interface because screen refreshes too slow.
So I switched to integrated video controller. It’s not solved the issue with low refresh rate right away because X server still using NVIDIA card.
After some trials and errors I found a quick and dirty solution: disable graphics card driver during boot to prevent X server to use a card.
To accomplish this you need to blacklist driver by adding a following line:
1 |
install nvidia /bin/false |
to /etc/modprobe.d/blacklist.conf
And don’t forget to update initramfs:
1 |
sudo update-initramfs -u |
After system boot you can load driver manually by following command:
1 |
sudo insmod /lib/modules/`uname -r`/updates/dkms/nvidia.ko |
There is a bunch of stuff in my closet, so it requires only time and appropriate mood to build some hardware device. And there is a summer around and weather is rather hot, so it’s not a surprising idea to build a thermometer.
DS1820 is a great thermometer and I started with it, but I have a spare HDPM01 barometer module with built-in thermometer, so let’s use it.
read more
It’s well known that I’m a Lego fan. So I’m very happy to add such a beautiful plane to my Lego collection.
Read more to see functions animation. read more
During implementation of DistTest I faced with necessity of building a lot of different linux kernel versions. As a first solution I chose downloading archives from kernel.org for each used version. But I soon realized that about 1000 versions of sources with size 0.5-1GB each would consume a lot of disk space. It’s also impossible to build kernel with exact commit precision using this approach.
Set of base versions with corresponding patches can save disk space, but uses a lot of random I/O during applying patches, so it’s slow on HDD and consume finite rewrite resource of SSD. Temporary nature of sources leads to conclusion “use tmpfs”. But aufs offers much less RAM consuming method – store in RAM only diffs.
I fell in love with md5 hash algorithm because it can detect some very interesting characteristics of system which I want to benchmark. Almost all computations which need to be performed during computation of md5 hash sum are lying in critical path. It means that it’s almost impossible to parallelize md5 hash sum computation. And I’m not talking about execution in multiple threads, but about instruction level parallelization(superscalar and vector computing). So this feature excluding any new modern tricks used in CPU cores(like out-of-order execution and specialized instruction sets) out of equation and makes it perfect single thread benchmark.
Let’s see some numbers:
Calculate md5(10GiB of zerroes) on i5-760(Turbo frequency: 3.33 GHz, launch date Q3’10)(with Ubuntu 14.04)
1 2 |
$ (dd if=/dev/zero bs=1M count=10k | md5sum >/dev/null) 2>&1 | tail -n1 10737418240 bytes (11 GB) copied, 22.8959 s, 469 MB/s |
And then do the same on i7-6700(Turbo frequency: 4.0 GHz, launch date Q3’15)(with Ubuntu 15.10)
1 2 |
$ (dd if=/dev/zero bs=1M count=10k | md5sum >/dev/null) 2>&1 | tail -n1 10737418240 bytes (11 GB) copied, 17.325 s, 620 MB/s |
So we have 140 and 155 MB/s per GHz respectively. It is 10.7% performance boost after 5 years of CPU evolution. And it looks so frustrating.
p.s. Yep, I know that CPU now much smarter than 5 years ago and have rich set of specialized instruction sets(like AES-NI which is responsible for +2200% ghash calculation speed). But any software developer should be ready for that fact that unparallelizeable algorithms execution will not become faster for even a bit in near future.
Dyn.com stopped providing free dyndns hosting some time ago. It were sad news.
However most necessary functionality can be implemented in 70 lines of code.
To dynamically update dns records we must find a way to:
Many ISP use NAT for security and money saving reasons. Also NAS used in routers. So, to get your IP address you need to ask your IP address from service located outside your network.
For example internet.yandex.ru web page ask resource http://ipv4.internet.yandex.ru/api/v0/ip to determine IP address. So, let’s use it.
1 2 3 4 5 6 |
use constant IP_API_URI => 'http://ipv4.internet.yandex.ru/api/v0/ip'; sub get_ip($) { my ($ua) = @_; my $response = get($ua, IP_API_URI); return substr $response, 1, length($response)-2; #clean out quotes } |
Yandex provide API for DNS records management for domains parked in yandex. API reference should be located here but link is currently broken :-(. UPD: API reference.
In two words you should send POST request with token in header and data in request’s body to https://pddimp.yandex.ru/api2/admin/dns/edit url.
It should looks like following snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
use constant DNS_API_URI_PREFIX => 'https://pddimp.yandex.ru/api2/admin/dns/'; sub post($$$%){ my ($ua, $url, $data, %headers) = @_; my $req = HTTP::Request->new(POST => $url); $req->header(%headers) if keys %headers; $req->content_type('application/x-www-form-urlencoded'); $req->content($data); my $res = $ua->request($req); return $res->content if $res->is_success; return undef; } sub set_domain_ip($$$$$$){ my ($ua, $token, $domain, $subdomain, $record_id , $ip) = @_; return decode_json post($ua, DNS_API_URI_PREFIX."edit", "domain=$domain&record_id=$record_id&subdomain=$subdomain&ttl=14400&content=$ip", PddToken => $token); } |
Use the cron, Luke :-) I mean that adding following line to cronjobs would be a simplies way to schedule updating of our dns records:
1 |
* * * * * ( token=123 /home/user/yandex_dns.pl --domain=domain.com --subdomain=subdomain ) |
That’s all. Complete script on github.
Not so long time ago I’ve faced with problem: on the same linux distributive some machines use Inte Turbo Boost but some others didn’t.
So… To investigate this problem I’ve read enough article about power management and want to summarize key aspects below.
Holy Grail of power management is ACPI(Advanced Configuration and Power Interface). It describes sleep(Sx), processor(Cx) and performace(Px) states.
Performance states came to replace legacy throtling(Tx) states.
So… About TurboBoost issues solution. It was just a BIOS bug(or feature? Who knows?) that doesn’t moved cpu to P0 state on some boards and does on another.
Let’s imagine that you have a bunch of files(with default mode “rw-r–r–“) and you configured automatic or performed manual hardlink based backup of them.
Well…
Then you moved one of these files to “secured” folder, that have strict rights (“drwx——” for example).
Before adding some confidential information to this file it was good idea to change file permissions to more strict one. But it is not clear how important it is because no one but owner can access file located at “secured/file” when “secured” folder have “drwx——” rights.
Well… Let’s preserve old permissions if changing of them is not necessary.
But what about hardlink to file saved in usual folder? Oh yes, file located at “usual/file” still can be opened by everyone.
Conclusion:
– You must remember about all hardlinks of your files when you think about security.
– Creating hardlinks by inode and opening file by inode denied by security reasons.
I’ve just tried to follow intellij idea’s manual and develop some useless plugin. And now my plugin can display “Hello world” program on perl.
You also can view this madness on github.
UPD1. Some new lexical constructions support added.