Difference between revisions of "Wikimedia Commons"
(→Uploading: log) |
(update query) |
||
(8 intermediate revisions by 6 users not shown) | |||
Line 7: | Line 7: | ||
| archiving_status = {{inprogress}} | | archiving_status = {{inprogress}} | ||
}} | }} | ||
'''Wikimedia Commons''' is a database of freely usable media files with more than | '''Wikimedia Commons''' is a database of freely usable media files with more than '''50 million files'''. | ||
: As of December 2018, total size is over '''195 TB''' (TiB?). | |||
: As of December 2021, total size is '''325 TiB'''. ([https://commons.wikimedia.org/wiki/Special:MediaStatistics check]) | |||
== Archiving process == | == Archiving process == | ||
Line 84: | Line 86: | ||
== Other dumps == | == Other dumps == | ||
There is no public dump of all images. [[WikiTeam]] | There is no public dump of all images (see upstream issue [https://phabricator.wikimedia.org/T298394 T298394: Produce regular public dumps of Commons media files], which saw some work by WMF in 2022). | ||
[[WikiTeam]] has developed and run a scraper until 2017 (see section above), but new updates were paused as the size of Wikimedia Commons overgrew the techniques used to that point. | |||
Pictures of the Year (best ones): | Pictures of the Year (best ones): | ||
Line 107: | Line 111: | ||
| date | numimages | gigabytes | | | date | numimages | gigabytes | | ||
+------+-----------+-----------+ | +------+-----------+-----------+ | ||
| | | 2002 | 1 | 0 | | ||
| 2003 | | | 2003 | 260 | 0 | | ||
| 2004 | | | 2004 | 18972 | 3 | | ||
| 2005 | | | 2005 | 231826 | 94 | | ||
| 2006 | | | 2006 | 585157 | 310 | | ||
| 2007 | | | 2007 | 1146771 | 769 | | ||
| 2008 | | | 2008 | 1320350 | 1161 | | ||
| 2009 | | | 2009 | 1856416 | 2268 | | ||
| 2010 | | | 2010 | 2254873 | 2996 | | ||
| 2011 | | | 2011 | 3825196 | 5306 | | ||
| 2012 | | | 2012 | 3314956 | 7156 | | ||
| 2013 | | | 2013 | 4455631 | 12453 | | ||
| 2014 | | | 2014 | 4462202 | 19119 | | ||
| 2015 | | | 2015 | 5429764 | 16788 | | ||
| 2016 | | | 2016 | 6125166 | 35182 | | ||
| 2017 | 7927028 | 34456 | | |||
| 2018 | 7637895 | 62911 | | |||
| 2019 | 6842086 | 34818 | | |||
| 2020 | 9354449 | 46552 | | |||
| 2021 | 12301202 | 48953 | | |||
| 2022 | 10430326 | 56845 | | |||
| 2023 | (partial) | (partial) | | |||
+------+-----------+-----------+ | +------+-----------+-----------+ | ||
22 rows in set (7 min 55.590 sec) | |||
</pre> | |||
=== Months to download === | |||
<pre> | |||
select substr(img_timestamp, 1, 6) as date, count(*) as numimages, round(sum(img_size)/(1024*1024*1024)) as gigabytes from image where img_timestamp > '20150101000000' group by date; | |||
| 201501 | 310558 | 869 | | |||
| 201502 | 319359 | 971 | | |||
| 201503 | 394316 | 1213 | | |||
| 201504 | 366594 | 1229 | | |||
| 201505 | 509950 | 1785 | | |||
| 201506 | 422456 | 1577 | | |||
| 201507 | 497517 | 1389 | | |||
| 201508 | 567581 | 1593 | | |||
| 201509 | 824261 | 2341 | | |||
| 201510 | 550427 | 1430 | | |||
| 201511 | 436516 | 1522 | | |||
| 201512 | 432697 | 1134 | | |||
| 201601 | 463515 | 1508 | | |||
| 201602 | 350460 | 2089 | | |||
| 201603 | 462514 | 1966 | | |||
| 201604 | 393534 | 1572 | | |||
| 201605 | 508363 | 2368 | | |||
| 201606 | 496242 | 4208 | | |||
| 201607 | 437502 | 3768 | | |||
| 201608 | 428234 | 3206 | | |||
| 201609 | 756052 | 6001 | | |||
| 201610 | 557747 | 3692 | | |||
| 201611 | 793891 | 1960 | | |||
| 201612 | 801610 | 3589 | | |||
| 201701 | 264612 | 927 | | |||
+--------+-----------+-----------+ | |||
</pre> | </pre> | ||
Latest revision as of 21:30, 18 June 2023
Wikimedia Commons | |
Wikimedia Commons mainpage on 2010-12-13 | |
URL | http://commons.wikimedia.org |
Status | Online! |
Archiving status | In progress... |
Archiving type | Unknown |
IRC channel | #archiveteam-bs (on hackint) |
Wikimedia Commons is a database of freely usable media files with more than 50 million files.
- As of December 2018, total size is over 195 TB (TiB?).
- As of December 2021, total size is 325 TiB. (check)
Archiving process
Tools
- Download script (Python)
- Checker script (Python)
- Feed lists (from 2004-09-07 to 2008-12-31; more coming soon)
How-to
Download the script and the feed lists (unpack it, it is a .csv file) in the same directory. Then run:
- python commonsdownloader.py 2005-01-01 2005-01-10 [to download that 10 days range; it generates zip files by day and a .csv for every day]
Don't forget 30th days and 31st days on some months. Also, February 29th in some years.
To verify the download data use the checker script:
- python commonschecker.py 2005-01-01 2005-01-10 [to check that 10 days range; it works on the .zip and .csv files, not the original folders]
Tools required
If downloading using a very new server (i.e. a default virtual machine), you got to download zip (Ubuntu: apt-get install zip)
Python should be already installed on your server, if not then just install it!
Also has a dependency on curl and wget, which should be installed on your server by default...
Volunteers
- Please, wait until we do some tests. Probably, long filenames bug.
Nick | Start date | End date | Images | Size | Revision | Status | Notes |
---|---|---|---|---|---|---|---|
Hydriz | 2004-09-07 | 2005-06-30 | ? | ? | r643 | Downloaded Uploaded to the Internet Archive |
Check: October 2004: [1] November 2004: [2] December 2004: [3] January 2005: [4] February 2005: [5] March 2005: [6] (2005-03-23 - 2005-03-31 was downloaded differently, so its not available for checking) April 2005: [7] May 2005: [8] June 2005: [9] |
Hydriz | 2005-07-01 | 2005-12-31 | ? | ? | r643 | Downloaded Uploaded to the Internet Archive |
Check: July 2005: [10] August 2005: [11] September 2005: [12] October 2005: [13] November 2005: [14] December 2005: [15] |
Hydriz | 2006-01-01 | 2006-01-10 | 13198 | 4.8GB | r349 | Downloaded Uploaded to the Internet Archive |
|
Hydriz | 2006-01-11 | 2006-06-30 | ? | ? | r349 | Downloaded Uploaded to the Internet Archive |
|
Hydriz | 2006-07-01 | 2006-12-31 | ? | ? | r643 | Downloaded Uploaded to the Internet Archive |
Check: July 2006: http://p.defau.lt/?IcMnwkx_j4H09FE_9iVgkQ August 2006: http://p.defau.lt/?EmsKDtM0RXaysFNEABXJCQ September 2006: http://p.defau.lt/?KBZVE9rJ9hdz4DiKnegnUw October 2006: http://p.defau.lt/?f3F85TyqHtdY0LhpQk_m1w November 2006: http://p.defau.lt/?VZwhzt_2doA_Z3c65_JkXg December 2006: http://p.defau.lt/?Ms_TgrcyGDL_0oZQgKCNmw |
Hydriz | 2007-01-01 | 2007-12-31 | ? | ? | r349 | Downloading | Check: January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 |
Errors
- oi_archive_name empty fields: https://commons.wikimedia.org/wiki/File:Nl-scheikundig.ogg
- broken file links: https://commons.wikimedia.org/wiki/File:SMS_Bluecher.jpg#filehistory
- Issue 45: 2005-03-23, 2005-08-08, 2005-09-12, 2005-09-18, 2005-09-25, 2005-11-18, 2006-02-05, 2006-02-11, 2006-02-25, 2006-03-10, 2006-03-23, 2006-04-21, 2006-04-25, 2006-05-01, 2006-07-13, 2006-07-30, 2006-08-02, 2006-08-05, 2006-08-13, 2006-09-12, 2006-10-22, 2006-10-26, 2006-11-23, 2006-12-06, 2006-12-13, 2006-12-17.
- Also issue 45: 2007-01-01, 2007-01-06, 2007-01-14, 2007-01-15, 2007-02-06, 2007-02-13, 2007-02-22, 2007-02-26, 2007-03-07, 2007-03-13, 2007-03-25, 2007-03-30, 2007-04-12, 2007-04-14, 2007-04-20, 2007-05-04, 2007-05-08, 2007-05-10, 2007-05-29, 2007-06-05, 2007-06-22.
I'm going to file a bug in bugzilla.
Uploading
UPLOAD using the format: wikimediacommons-<year><month>
E.g. wikimediacommons-200601 for January 2006 grab.
If you can, add it into the WikiTeam collection, or else just tag it with the wikiteam keyword, and it will be added in later on.
To run commonschecker and archive the output separately for each file, listing the archives which generated most errors/output:
ls -1 *zip | sed 's/.zip//g' | xargs -n1 -P4 -I§ sh -c "python commonschecker.py § > §.log" ; ls -lhSr *log
You can for instance generate a metadata.csv file to be used with the ias3upload.pl tool like this:
( echo -e "item,file,mediatype,collection,title,creator,language,description,contributor,date,subject[0],subject[1],originalurl,rights\n" ; ls -1 *zip | sed 's/.zip//g' | sed --regexp-extended 's/^(.+)$/wikimediacommons-\1,\1.zip,web,wikimediacommons,Wikimedia Commons Grab,,,"The <a href=""https:\/\/commons.wikimedia.org"" rel=""nofollow"">Wikimedia Commons<\/a> grab of files uploaded during this day.",,\1,WikiTeam,Wikimedia Commons,https:\/\/commons.wikimedia.org\/w\/api.php,"All files are under a free license or in the public domain, as specified in the associated description."\n,\1.csv\n,\1.log/g' ) > metadata.csv
Other dumps
There is no public dump of all images (see upstream issue T298394: Produce regular public dumps of Commons media files, which saw some work by WMF in 2022).
WikiTeam has developed and run a scraper until 2017 (see section above), but new updates were paused as the size of Wikimedia Commons overgrew the techniques used to that point.
Pictures of the Year (best ones):
Featured images
Wikimedia Commons contains a lot images of high quality.
Statistics
Stats per year
MariaDB [commonswiki_p]> select year(img_timestamp) as date, count(*) as numimages, round(sum(img_size)/(1024*1024*1024)) as gigabytes from image where 1 group by date; +------+-----------+-----------+ | date | numimages | gigabytes | +------+-----------+-----------+ | 2002 | 1 | 0 | | 2003 | 260 | 0 | | 2004 | 18972 | 3 | | 2005 | 231826 | 94 | | 2006 | 585157 | 310 | | 2007 | 1146771 | 769 | | 2008 | 1320350 | 1161 | | 2009 | 1856416 | 2268 | | 2010 | 2254873 | 2996 | | 2011 | 3825196 | 5306 | | 2012 | 3314956 | 7156 | | 2013 | 4455631 | 12453 | | 2014 | 4462202 | 19119 | | 2015 | 5429764 | 16788 | | 2016 | 6125166 | 35182 | | 2017 | 7927028 | 34456 | | 2018 | 7637895 | 62911 | | 2019 | 6842086 | 34818 | | 2020 | 9354449 | 46552 | | 2021 | 12301202 | 48953 | | 2022 | 10430326 | 56845 | | 2023 | (partial) | (partial) | +------+-----------+-----------+ 22 rows in set (7 min 55.590 sec)
Months to download
select substr(img_timestamp, 1, 6) as date, count(*) as numimages, round(sum(img_size)/(1024*1024*1024)) as gigabytes from image where img_timestamp > '20150101000000' group by date; | 201501 | 310558 | 869 | | 201502 | 319359 | 971 | | 201503 | 394316 | 1213 | | 201504 | 366594 | 1229 | | 201505 | 509950 | 1785 | | 201506 | 422456 | 1577 | | 201507 | 497517 | 1389 | | 201508 | 567581 | 1593 | | 201509 | 824261 | 2341 | | 201510 | 550427 | 1430 | | 201511 | 436516 | 1522 | | 201512 | 432697 | 1134 | | 201601 | 463515 | 1508 | | 201602 | 350460 | 2089 | | 201603 | 462514 | 1966 | | 201604 | 393534 | 1572 | | 201605 | 508363 | 2368 | | 201606 | 496242 | 4208 | | 201607 | 437502 | 3768 | | 201608 | 428234 | 3206 | | 201609 | 756052 | 6001 | | 201610 | 557747 | 3692 | | 201611 | 793891 | 1960 | | 201612 | 801610 | 3589 | | 201701 | 264612 | 927 | +--------+-----------+-----------+
Status
This table shows all items at Internet Archive in the wikimediacommons collection. It was most recently updated on 18:22, 4 August 2016 (EDT) using this script.
See also
- Wikipedia: files uploaded on some Wikipedias using the local upload form are not included in Wikimedia Commons; English Wikipedia for instance contains about 800000 images, a lot of which unfree (under fair use)
External links
- https://commons.wikimedia.org
- Picture of the Year archives
- https://github.com/emijrp/commons-coverage
v · t · e Knowledge and Wikis | ||
---|---|---|
Software |
DokuWiki · MediaWiki · MoinMoin · Oddmuse · PukiWiki · UseModWiki · YukiWiki | |
Wikifarms |
atwiki · Battlestar Wiki · BluWiki · Communpedia · EditThis · elwiki.com · Fandom · Miraheze · Neoseeker.com · Orain · Referata · ScribbleWiki · Seesaa · ShoutWiki · SourceForge · TropicalWikis · Wik.is · Wiki.Wiki · Wiki-Site · Wikidot · WikiHub · Wikispaces · WikiForge · WikiTide · Wikkii · YourWiki.net | |
Wikimedia |
Wikipedia · Wikimedia Commons · Wikibooks · Wikidata · Wikinews · Wikiquote · Wikisource · Wikispecies · Wiktionary · Wikiversity · Wikivoyage · Wikimedia Incubator · Meta-Wiki | |
Other |
Anarchopedia · Citizendium · Conservapedia · Creation Wiki · EcuRed · Enciclopedia Libre Universal en Español · GNUPedia · Moegirlpedia · Nico Nico Pedia · Nupedia · OmegaWiki · OpenStreetMap · Pixiv Encyclopedia | |
Indexes and stats |