How to acquire and process historical data from LDI?
How to acquire and process historical data from LDI?
@opeeters asked the other day
I have a question concerning the functionalities of what we have at the moment. Is it already possible to download data of a Luftdaten sensor via the grafana GUI?
A colleague of mine would like to download the data (since somewhere beginning of February) of 8 LD-sensors we have hanging next to official stations for further analysis in e.g. R. I can easily make a dashboard using the LD-id's but is there a download functionality.
Link issues together to show that they're related. Learn more.
Activity
- Author Developer
We just outlined a basic start into that topic at https://github.com/panodata/luftdatenpumpe/issues/9. Feel free to add further requests or questions here e.g. about how you would like to see that feature evolve.
- Andreas Motl changed the description
changed the description
- Author Developer
We just found importing historical data from LDI stopped working. We are tracking this issue at https://github.com/panodata/luftdatenpumpe/issues/10.
Edit: This issue has been mitigated with
luftdatenpumpe-0.18.0
. Historical data import from LDI should work fine again.Edited by Andreas Motl - Maintainer
@amotl : thanks for fixing this in 0.18.0, I can confirm the CSV import is working fine now.
FYI (also for @opeeters) I cloned the CSV archive 2019-09-19 of Lutfdaten and imported like:
wget --mirror --continue --no-host-directories --directory-prefix=/var/spool/archive.luftdaten.info --accept-regex='2019-09-19' http://archive.luftdaten.info/ luftdatenpumpe readings --network=ldi --source=file:///var/spool/archive.luftdaten.info --country=BE --target=influxdb://luftdatenpumpe@localhost/luftdaten_info --progress
It would be even nicer if we were able to wget only the relevant (i.e. in our case BE) sensor CSV files from the archive. But I understand this out the scope of LDP. Not to difficult to generate a list of relevant sensor id's (i.e. for BE) and only wget those.
Edited by David Roet - David Roet closed
closed
- David Roet reopened
reopened
- Maintainer
Seems I was a bit too quick, the CSV import works, but halts after a while before completing. This is related to issue #5 (closed)
[root@ldp ~]# luftdatenpumpe readings --network=ldi --source=file:///var/spool/archive.luftdaten.info --country=BE --target=influxdb://luftdatenpumpe@localhost/luftdaten_info --progress 2019-09-30 10:14:50,582 [luftdatenpumpe.source ] INFO : Applying filter: Munch({'country': ['BE']}) 2019-09-30 10:14:50,630 [luftdatenpumpe.commands ] INFO : Acquiring readings from network "ldi" with source "file:///var/spool/archive.luftdaten.info" 2019-09-30 10:14:50,631 [luftdatenpumpe.commands ] INFO : Will publish data to ['influxdb://luftdatenpumpe@localhost/luftdaten_info'] 2019-09-30 10:14:50,631 [luftdatenpumpe.engine ] INFO : Configuring data sink "influxdb://luftdatenpumpe@localhost/luftdaten_info" with domain "readings" 2019-09-30 10:14:50,753 [luftdatenpumpe.engine ] INFO : Emitting to target data sinks, this might take some time 2019-09-30 10:14:50,754 [luftdatenpumpe.source.luftdaten_info] INFO : Building list of CSV files from /var/spool/archive.luftdaten.info/**/*.csv 2019-09-30 10:14:51,136 [luftdatenpumpe.source.luftdaten_info] INFO : Processing 16623 files 2019-09-30 10:14:51,136 [luftdatenpumpe.source.common ] INFO : Processing 16623 items 12%|████████ | 1963/16623 [05:33<24:34, 9.94it/s]Killed [root@ldp ~]# systemctl status redis ● redis.service - Redis persistent key-value database Loaded: loaded (/usr/lib/systemd/system/redis.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/redis.service.d └─limit.conf, restart.conf Active: active (running) since Mon 2019-09-30 10:25:18 CEST; 4min 30s ago Process: 2478 ExecStop=/usr/libexec/redis-shutdown (code=exited, status=1/FAILURE) Main PID: 2494 (redis-server) CGroup: /system.slice/redis.service └─2494 /usr/bin/redis-server 127.0.0.1:6379 Sep 30 10:25:18 ldp.irceline.be systemd[1]: Starting Redis persistent key-value database... Sep 30 10:25:18 ldp.irceline.be systemd[1]: Started Redis persistent key-value database. [root@ldp ~]# dmesg [11985743.252378] Out of memory: Kill process 2088 (luftdatenpumpe) score 329 or sacrifice child [11985743.253700] Killed process 2088 (luftdatenpumpe) total-vm:2684640kB, anon-rss:347296kB, file-rss:0kB, shmem-rss:0kB
@opeeters Could you increase the memory for this VM please, to see what effect this has?
Edited by David Roet - Owner
@droet Memory has been increased from 4GB to 8GB
Edited by Olav Peeters - Author Developer
Dear David,
sorry to hear about that. I am running
luftdatenpumpe
on my workstation which has 16GB of RAM. If the problems on memory usage keep popping up, I might consider investigating reducing memory usage if possible.Please let me know how it goes with 8GB of RAM. Thanks, @opeeters!
With kind regards, Andreas.
Edited by Andreas Motl - Maintainer
@opeeters Thanks for the quick increase for the VM @amotl So, now it keeps running longer, but while monitoring through
top
(not that great I confess) I do see the LDP process consuming all memory steadily until the kill. also the number of it/s drops. It does look like something is leaking or incomplete garbage collection ...Edited by David Roet - Author Developer
Hi David,
we just released luftdatenpumpe-0.18.2 which might improve its memory usage. Please let us know if everything still works for you.
With kind regards, Andreas.
- Maintainer
@amotl : It seems v0.18.2 solved the memory issues, the mem usage so far is stable .. however I do see a big performance drop (it/s). To ingest the current CSV data dump of 1 day will take (ETA) like over 1 hour? Is this related to our machine/setup? What affects the performance/speed of the ingestion?
Edited by David Roet - Author Developer
Hi David,
it's always about trading memory for speed ;]. Saying this, we have to put some additional efforts into the codebase in order to better balance between both things.
While I'm sad to hear the performance dropped that much for you, I will be happy to hear about the general outcome to see whether things actually work right now.
As an outlook, I see the following things could contribute to a better performance (in no particular order):
- Better tune the current implementation re. buffering to better balance between memory consumption and ingesting performance.
- See whether using tablib's Dataset for ingesting the raw CSV files will gain better performance. The underlying machinery is based on Pandas.
- Stop ingesting LDI CSV files at all and use the Parquet files instead, see also [1].
With kind regards, Andreas.
[1] https://github.com/panodata/luftdatenpumpe/issues/9#issuecomment-536127489
Edited by Andreas Motl - Maintainer
@amotl I can confirm the archived CSV ingestion worked properly, although taking quite some time (+1 hour). We will need to see how we can maybe improve further on this if we want to import a big (huge?) backlog of Lufdaten LDI readings.
Interesting link about those Parquet files, I was unaware it existed nor it being made available by Luftdaten themselves. I do suspect that ingesting a 1,5 GB file might also affect performance (I/O). But for now I think we can close down this issue?
- Author Developer
Hi David,
thanks for your feedback and observations. We might well close this issue and divert the performance improvements into [1].
Cheers, Andreas.
- David Roet closed
closed