Difference between revisions of "GitHub"

From Archiveteam
Jump to navigation Jump to search
 
(11 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = GitHub
| logo = GitHub_logo.png
| logo = GitHub_logo.png
| image = GitHub 1303511667338.png
| image = GitHub 1303511667338.png
| description = A screen shot of the GitHub home page taken on 2015-11-08
| description = A screen shot of the GitHub home page taken on 2021-05-01
| URL = {{url|1=https://github.com/|2=GitHub}}
| URL = {{url|1=https://github.com/|2=GitHub}}
| project_status = {{online}}
| project_status = {{online}}
| archiving_status = {{upcoming}}
| archiving_status = {{inprogress}}
| archiving_type = DPoS
| source = [https://github.com/ArchiveTeam/github-grab github-grab]
| tracker = [https://tracker.archiveteam.org/github/ github]
| irc = gitgud
| irc = gitgud
| irc_network = hackint
| data = {{IA collection|archiveteam_github}}
}}
}}


:''See also [[GitHub Downloads]]''
:''See also [[GitHub Downloads]]''


'''GitHub''' is a software repository powered by Git. Does not seem to have any site issues, often 24 hours uptime (see [http://status.github.com/ site status]). Looks pretty sunny at the moment, but when disaster strikes it would be a problem archiving the private repositories.
'''GitHub''' is the largest and most popular Git forge, a site that hosts software git repositories. When it was bought by Microsoft in 2018 (see [[#Acquisition_by_Microsoft|section]]) concerns were raised among users and archivists. Against their expectations the site hasn't become more closed or restricted since the acquisition, and looks pretty sunny at the moment. Though there have been incidents of censorship, (though nothing widespread yet) such as momentarily taking down the youtube-dl and nyaa.si repositories in reflex to DMCA requests.


== Archive Team project ==
== Archive Team project ==
Line 36: Line 38:
* Downloads - release downloads, comment attachments
* Downloads - release downloads, comment attachments
* github.io - if the repository is a <tt>*.github.io</tt> repository, the full corresponding <tt>*.github.io</tt> website is downloaded, for regular repositories an attempt is made to archive the <tt>*.github.io/*</tt> pages.
* github.io - if the repository is a <tt>*.github.io</tt> repository, the full corresponding <tt>*.github.io</tt> website is downloaded, for regular repositories an attempt is made to archive the <tt>*.github.io/*</tt> pages.
=== IRC channel ===
Before the [[Move Archiveteam to Hackint|migration to hackint]], the IRC channel was {{IRC|getgit|EFnet|abandoned|oldtextonly}}.


== Size ==
== Size ==
Line 47: Line 52:
It was [https://www.bloomberg.com/news/articles/2018-06-03/microsoft-is-said-to-have-agreed-to-acquire-coding-site-github reported by Bloomberg] and [https://news.microsoft.com/2018/06/04/microsoft-to-acquire-github-for-7-5-billion/ confirmed on June 4, 2018], that Microsoft bought GitHub for 7.5 billion dollars. On 26th October 2018, the new GitHub CEO, Nat Friedman, [https://blog.github.com/2018-10-26-github-and-microsoft/ announced] that the acquisition was complete.
It was [https://www.bloomberg.com/news/articles/2018-06-03/microsoft-is-said-to-have-agreed-to-acquire-coding-site-github reported by Bloomberg] and [https://news.microsoft.com/2018/06/04/microsoft-to-acquire-github-for-7-5-billion/ confirmed on June 4, 2018], that Microsoft bought GitHub for 7.5 billion dollars. On 26th October 2018, the new GitHub CEO, Nat Friedman, [https://blog.github.com/2018-10-26-github-and-microsoft/ announced] that the acquisition was complete.


A discussion into the feasibility of archiving GitHub has commenced in {{IRC|getgit}}.
A discussion into the feasibility of archiving GitHub commenced soon after. Key concerns mentioned included:
* Users in the FOSS community fear Microsoft's "embrace, extend, extinguish" schemes in the 1990s and 2000s and many called for a move to rival [[GitLab]] in the wake of the news.
* Users in the FOSS community fear Microsoft's "embrace, extend, extinguish" schemes in the 1990s and 2000s and many called for a move to rival [[GitLab]] in the wake of the news.
* [[LinkedIn]] shows how user content can be gradually taken away (by means of paywalls and login walls).
* [[LinkedIn]] shows how user content can be gradually taken away (by means of paywalls and login walls).
Line 61: Line 66:
=== Other tools ===
=== Other tools ===


[https://github-backup.branchable.com/ github-backup] runs in a git repository and chases down that information, committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well.
[https://github-backup.branchable.com/ github-backup] runs in a git repository and chases down that information, committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well. It is unmaintained since late 2020.<ref>{{URL|https://joeyh.name/blog/entry/Withrawing_github-backup/}}</ref>


[http://www.githubarchive.org/ githubarchive.org] and [http://ghtorrent.org/ GHTorrent] are both creating archives of the GitHub "timeline", that is, all events like git pushes, forks, created issues, pull requests, etc.
[http://www.githubarchive.org/ githubarchive.org] is creating archives of the GitHub "timeline", that is, all events like git pushes, forks, created issues, pull requests, etc.


[http://codearchive.org codearchive.org] Effort to backup all the versions of all the repos on GitHub and other sources. [https://speakerdeck.com/filosottile/the-code-archive-hope-xi Slides from a talk about it].
[http://codearchive.org codearchive.org] Effort to backup all the versions of all the repos on GitHub and other sources. [https://speakerdeck.com/filosottile/the-code-archive-hope-xi Slides from a talk about it].
Line 99: Line 104:


The Internet Archive item {{IA item|github_repository_index_201806}} contains another crawl of the API from June 2018.
The Internet Archive item {{IA item|github_repository_index_201806}} contains another crawl of the API from June 2018.
=== Extreme repositories ===
This is a list of repositories that are thought to hold records across GitHub (or even Git in general).
* Most commits: https://github.com/zp4rker/50mil-commits has 50&nbsp;million commits, which GitHub struggles to count correctly at all times. The author also attempted to push a repository with 100&nbsp;million commits, but GitHub asked them to stop at 61.5&nbsp;million, and the repository was removed.
** This and several similar repositories were generated. Largest known ''real'' repository: https://github.com/torvalds/linux
* Largest data size (LFS excluded): https://github.com/chromium/chromium
* Largest data size (LFS included): unknown
* Most branches: https://github.com/archlinux/aur
* Most tags: https://github.com/chromium/chromium
* Most issues: https://github.com/AdguardTeam/AdguardFilters
* Most PRs: https://github.com/google-test/signcla-probe-repo
* Largest merge: https://github.com/cirosantilli/test-octopus-100k/commit/07fdcceb20ac3626a07c08166d0c410707b1cb9b is a 100k-way octopus merge commit, i.e. a merge commit with a hundred thousand parent commits. This has caused issues in the past, and an attempt to push a 1-million-way merge failed.<ref>{{URL|https://github.com/isaacs/github/issues/1344}}</ref>
* Most orphan (aka root) commits: unknown, https://github.com/torvalds/linux has three as of 2022-09-26, but this seems unlikely to be the record.


== GithubArchive ==
== GithubArchive ==
Line 119: Line 138:
* {{url|1=https://archive.softwareheritage.org/|2=Software Heritage Archive}}
* {{url|1=https://archive.softwareheritage.org/|2=Software Heritage Archive}}
* {{url|1=https://archiveprogram.github.com/|2=The GitHub Archive Program}}
* {{url|1=https://archiveprogram.github.com/|2=The GitHub Archive Program}}
* [https://github.com/github/dmca/tree/master Official repository of disclosed DMCA takedown notices] With the names of infringing repositories.
== References ==
<references />


{{Navigation box}}
{{Navigation box}}
[[Category:Code]]

Latest revision as of 20:14, 14 January 2024

GitHub
GitHub logo
A screen shot of the GitHub home page taken on 2021-05-01
A screen shot of the GitHub home page taken on 2021-05-01
URL GitHub[IAWcite.todayMemWeb]
Status Online!
Archiving status In progress...
Archiving type DPoS
Project source github-grab
Project tracker github
IRC channel #gitgud (on hackint)
Data[how to use] archiveteam_github
See also GitHub Downloads

GitHub is the largest and most popular Git forge, a site that hosts software git repositories. When it was bought by Microsoft in 2018 (see section) concerns were raised among users and archivists. Against their expectations the site hasn't become more closed or restricted since the acquisition, and looks pretty sunny at the moment. Though there have been incidents of censorship, (though nothing widespread yet) such as momentarily taking down the youtube-dl and nyaa.si repositories in reflex to DMCA requests.

Archive Team project

In 2020 Archive Team started a project to archive GitHub and keep the archive up to date as new content is added. This project is a collaboration with Internet Archive and GitHub.

The project is split up into two parts:

  • The web part, the UI of GitHub.
  • The code part, consists of simulated calls as made by git clone.

As of today the web part of the project has started, while the code part of the project is under active development.

What is being archived

For every project on GitHub the following is archived from the UI with the web part of the project:

  • Issues
  • Pull requests - the conversation, commits and checks tabs (not the files changed tab)
  • Actions
  • Projects
  • Wiki
  • Security
  • Insights
  • Releases and tags - the tar.gz tag archives are considered static and are downloaded if the tag has notes or extra downloads attached, or if the repository has at least one fork or star and is not a clone from another repository.
  • Downloads - release downloads, comment attachments
  • github.io - if the repository is a *.github.io repository, the full corresponding *.github.io website is downloaded, for regular repositories an attempt is made to archive the *.github.io/* pages.

IRC channel

Before the migration to hackint, the IRC channel was #getgit (on EFnet) (abandoned).

Size

As of 12th August 2012: 1,963,652 people hosting over 3,460,582 repositories 1,117,147 public repositories are forks, which greatly reduces the amount of data required to archive it.

As of 22 November 2015: There are 32,000,000 repositories, with a similar fork ratio. Back-of-the-envelope calculations suggest 120TB of data in git repositories.

As of June 2018, there are 79.6 million public repositories in 137 million repository IDs, indicating that around 42 % of all repositories ever created are private or have been deleted.

Acquisition by Microsoft

It was reported by Bloomberg and confirmed on June 4, 2018, that Microsoft bought GitHub for 7.5 billion dollars. On 26th October 2018, the new GitHub CEO, Nat Friedman, announced that the acquisition was complete.

A discussion into the feasibility of archiving GitHub commenced soon after. Key concerns mentioned included:

  • Users in the FOSS community fear Microsoft's "embrace, extend, extinguish" schemes in the 1990s and 2000s and many called for a move to rival GitLab in the wake of the news.
  • LinkedIn shows how user content can be gradually taken away (by means of paywalls and login walls).

Backup tools

git itself

git clone is the simplest one (and also works outside of GitHub, obviously). However, it does not get some project data that is not stored in git, including issue reports, comments, pull requests.

When cloning a repository for archival, it is best to use the --mirror option. This mirror will include all branches and even the code associated with pull requests. (Note however that the PR code will get purged eventually by Git's GC when you create a clone from this mirror as the PR commits aren't referenced by any branches, though this can be solved by adding a line like fetch = +refs/pull/*/head:refs/remotes/origin/pr/* to the repository config file.)

To pack a clone/mirror into a single, easily handleable file, use git bundle create FILE --all inside the clone/mirror.

Other tools

github-backup runs in a git repository and chases down that information, committing it to a "github" branch. It also chases down the forks and efficiently downloads them as well. It is unmaintained since late 2020.[1]

githubarchive.org is creating archives of the GitHub "timeline", that is, all events like git pushes, forks, created issues, pull requests, etc.

codearchive.org Effort to backup all the versions of all the repos on GitHub and other sources. Slides from a talk about it.

python-github-backup can backup entire users or organisations and retrieves issues, PRs, labels, milestones, hooks, wikis, gists, and LFS data. It can also grab starred repositories and forks.

See also Software Heritage.

GitHub replacement engines

If we ever have to archive the data out of GitHub, the data will need to be exportable to a GitHub-style engine.

Currently[when?], the best GitHub-style engine that has a Wiki, issues, Git Repo hosting, and is free and open source to use is GitLab. The engine is used by and paid for by many major organizations, so it is likely to live on in a stable way. Other popular FOSS alternatives to GitHub include Gitea and Gogs.

We will need a complete migration system to move a git repository and all related GitHub service information of a repository to GitLab.

Things to scrape

In case of emergency, these are the items we need to grab.

  • Git Repository - Accomplished by github-backup
    • Forked Repositories - Accomplished by github-backup
    • Notes on Commits/Lines of Code - Not supported by github-backup yet. GitHub API support exists since ca. 2011.
  • GitHub Gollum Wiki - No tool yet, but just clone the whole thing, and then push it to GitLab.
  • Releases - Tags on GitHub can have binaries attached. These are of high priority to archive.
  • Issues + Comments - Accomplished by github-backup
    • Milestones - github-backup currently does not archive this yet.
    • Labels - github-backup currently does not archive this yet.
  • Hooks - Needs some kind of tool to archive GitHub Hooks

Lists of repositories

A list of repositories from GitHub API data are maintained by an archive team member at za3k.com. It scrapes continuously. Public downloads are updated once a day. This list does not include gists.

The Internet Archive item github_repository_index_201806 contains another crawl of the API from June 2018.

Extreme repositories

This is a list of repositories that are thought to hold records across GitHub (or even Git in general).

GithubArchive

The metadata generated by the GitHub API are archived to Google BigQuery every hour by GithubArchive.

It obviously doesn't grab events dating before 2011, so a targeted repository scrape may still be ideal.

But at least it could be possible to grab all info about a single repository using Google BigQuery's free version, since it would use a low amount of CPU power. However, we need to create such an export script for it when the time comes.

ArchiveTeam archival efforts

In June 2018, a discovery warrior project was started based on the current list of repositories. The goal was to obtain the number of watchers, stars, forks, and the origin repository (for forks) for each repository – all information which is not returned by the repositories API endpoint which was used to collect the list – so that a prioritisation of content according to those numbers would be possible. The origin repository is needed for storing forks efficiently: since the original repository and all its forks are usually mostly identical, this can be stored in a single repository instead of one clone per fork, thus storing the shared revisions only once.

In December 2018, a list of around 2,000 GitHub repos linked from Wikidata were saved using ArchiveBot.

The Github Archive Program

On February 2, 2020, Github "captured a snapshot of every active public repository, to be preserved in the GitHub Arctic Code Vault". Read more at https://archiveprogram.github.com.

External links

References